## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

**Rodney J. Dyer, PhD
**

Department of Biology

Center for the Study of Biological Complexity

Virginia Commonwealth University

2

Biological Data Analysis Using R

Contents

Preface xi

I Basic Usability 1

1 Getting R 3

1.1 What Is R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Where Do I Get It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Language & Grammar 5

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Function Quickie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

II Biologically Motivated Topics 23

3 Data Frames 25

3.1 Data Input/Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Complex Selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Summary Statistics 43

4.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 Relationships Between Pairs of Variables . . . . . . . . . . . . . . . . . . . . . 63

4.5 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Contingency Tables 71

5.1 One Random Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

i

ii CONTENTS

5.2 Paired Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3 Several Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4 The Formula Notation & Box Plots . . . . . . . . . . . . . . . . . . . . . . . . 83

5.5 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6 Linear Models 89

6.1 The t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.2 Regression With A Single Variable . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.3 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.5 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7 Working With Images 109

7.1 Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.2 Loading The Image Into R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.3 Components of A Pixmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.4 Image Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.5 Creating Images Programatically . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.6 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8 Matrix Analysis 121

8.1 Matrices In R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

8.2 Stage-Classiﬁed Matrix Models . . . . . . . . . . . . . . . . . . . . . . . . . . 132

8.3 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

9 Working With Strings 147

9.1 Parsing Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

9.2 Producing Formatted Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

9.3 Plotting Special Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

9.4 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

III Extending R 165

10Basic Scripts 167

10.1Writing Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

10.2Evaluating Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

10.3Adding Comments To Your Code . . . . . . . . . . . . . . . . . . . . . . . . . . 171

10.4Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

10.5Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

11Programming 175

11.1Looping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

11.2Conditional Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Biological Data Analysis Using R

CONTENTS iii

11.3Outlining A Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

11.4Creating A Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

11.5Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

11.6Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

11.7Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

12Functions 189

12.1Function Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

12.2Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

12.3Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

12.4Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

A Answers to Exercises 197

B Installing Additional Libraries 199

B.1 Library Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

B.2 Installing Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Bibliography 205

Index 205

Biological Data Analysis Using R

iv CONTENTS

Biological Data Analysis Using R

List of Tables

2.1 Common constants you will run across in R . . . . . . . . . . . . . . . . . . 11

4.1 Some useful additional commands to customize the appearance of a ﬁgure.

For a complete listing of possible values that can be customized, try the ?par

command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Graphics devices for output of ﬁgures . . . . . . . . . . . . . . . . . . . . . . 51

5.1 Diversity of enrolled undergraduate students at Virginia Commonwealth

University in the College of Humanities & Sciences between the academic

years 1998-2008 as reported by the Center for Institutional Effectiveness

(http://www.vcu.edu/cie/analysis/reports/sets.html). . . . . . . . . . . . . 75

8.1 Table of life history values separated into A Fertility estimates (the f

X

items)

and B transition probabilities depicting the movement between stages and

within stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

9.1 Caption For Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

v

vi LIST OF TABLES

Biological Data Analysis Using R

List of Figures

1 Example scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

4.1 Values for the density function for the χ

2

distribution with 1, 2, and 3 de-

grees of freedom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 A graphical depiction of the critical value of the χ

2

distribution for α = 0.05

and df = 3. The shaded region constitutes a proportion of the area under

the curve equal to α. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Some example graphs with alternate values for symbols, line types, widths,

colors, and titles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Plot of two data sets using the par(new=T command but not taking into con-

sideration the axis limits of the two data sets before plotting. . . . . . . . . . 50

4.5 Plot of two variables on the same axis after correcting for the range of each

data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6 Image of colored Poisson distribution that was copied from the graphics

device to a jpeg ﬁle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.7 Examples of the densities of two normal distributions; the red one is drawn

from a random normal distribution with default values of µ = 0 and σ = 1

and another in blue that has µ = σ = 5. . . . . . . . . . . . . . . . . . . . . . . 55

4.8 Histogram with labels and main title changed. . . . . . . . . . . . . . . . . . 56

4.9 Histogram of 1000 random numbers drawn from a Poisson distribution

with the λ parameter set to 5. The red line indicates the density of the values. 57

4.10Example locations for ﬁrst two moments of a Normal (N(0, 1)) distribution. . 59

4.11Negative (left) and positive (right) distributions. In both of these examples

the dotted line connects the mode of the distribution (the top peak) to the

mean (on the x axis). The direction of this lean determines if the distribution

has a negative (left) or positive (right) skew. . . . . . . . . . . . . . . . . . . . 60

4.12Three distributions )exponential, normal, and logistic) showing different

levels of kurtosis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.13Matrix of four plots created from random numbers sampled from the nor-

mal, poisson, exponential, and the logistic distributions. . . . . . . . . . . . 62

4.14Distribution of random number drawn from rpois(1000,5). . . . . . . . . . . . . 64

4.15Scatter plot of some semi-random points. . . . . . . . . . . . . . . . . . . . . 65

4.16Example plot of two variables used to test correlations. . . . . . . . . . . . . 66

5.1 Undergraduate diversity at Virginia Commonwealth University during aca-

demic years 1998, 2003, & 2008. . . . . . . . . . . . . . . . . . . . . . . . . . 77

vii

viii LIST OF FIGURES

5.2 Boxplot of Pinus echinata germination data partitioned by timber extraction

treatment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.1 Plot of single variable regression values. . . . . . . . . . . . . . . . . . . . . . 92

6.2 Regression model added to plot of points using abline function. . . . . . . . . 94

6.3 Regression model with ﬁtted line and formula. . . . . . . . . . . . . . . . . . 96

6.4 A 2x2 matrix plot of some diagnostic tools associated with a linear model.

They include a plot of the residuals (e

ij

) as a function of the ﬁtted values (ˆ y

i

)

to see if there are systematic biases in the model (upper left), a Q-Q plot to

examine normality of the residuals (upper right), a scale location plot (lower

left), and a leverage plot to look for outliers (lower right). . . . . . . . . . . . 97

6.5 Boxplot of germination percentages for Pinus echinata as a function of treat-

ment. A colored rug was added to the right side to show the actual values

within treatments (see rug. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.6 Conﬁdence intervals for difference in mean germination rates for Pinus echi-

nata families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.1 The image represented in the r.pbm ﬁle. This image has been scaled up

to make it large enough to see it on the page using the program GIMP

(www.gimp.org). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2 A PBM ﬁle that was programatically created in R . The image is rotated

because of the default location of the origin. . . . . . . . . . . . . . . . . . . . 112

7.3 The image represented by the dog.pgm ﬁle. This image has been scaled up

to make it large enough to see it on the page using the program GIMP

(www.gimp.org). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.4 The image represented in the Libbie.ppm ﬁle. This image has been scaled

up to make it large enough to see it on the page using the program GIMP

(www.gimp.org). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.5 The original image along with ones where only the red, green, and blue

channel turned on. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.6 The greyscale translation of the PPN image, a histogram of the grey values

and the image resulting from reducing all the grey values in the image by

half. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.7 A random image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.8 A random image with a square doughnut hole in the middle. . . . . . . . . . 118

8.1 Image depicting two vectors v

red

= [4, 2] and v

blue

= [2, 1] that are projecting

in the same direction but have different magnitudes. . . . . . . . . . . . . . 131

8.2 The A graphical depiction of the life history stages in the ﬁctitious plant

Grenus growii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8.3 Effects of the instantaneous growth rate λ as a function of time for both

exponential growth (λ

blue

= 1.2) and exponential decay (λ

red

= 0.8). . . . . . . 136

8.4 Examples of two different calls to the plotting function barplot(). The param-

eters used to create these plots is given in the R code. . . . . . . . . . . . . . 138

8.5 Example of a stacked bar plot with multiple categories represented in each

Treatment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.6 Size of the four stage classes through time. . . . . . . . . . . . . . . . . . . . 142

Biological Data Analysis Using R

LIST OF FIGURES ix

8.7 Differences in estimated proportions of individuals in each stage from what

was expected through time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

9.1 Histogramof distance estimates among all sequences using the ”K90” model

of substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.2 Neighbor joining tree based upon the trnL-trnF intergenic spacer sequences

and the ”K90” model of sequence evolution. . . . . . . . . . . . . . . . . . . . 156

9.3 The html printout of a xtable as interpreted in Firefox. You can also import

tables saved as html into popular word processors and use them as normal

table items in the creation of your documents. . . . . . . . . . . . . . . . . . 159

9.4 Example of using the expression function to annotate a graphic. . . . . . . . . 161

11.1Hemispherical photograph of winter roosting habitat at Monarch Biosphere

Reserve, Mexico. Photo by S.B. Weiss made available by the Creative Com-

mons Atribution 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

11.2The blue channel of the canopy picture displayed as a greyscale image. . . . 181

11.3A histogram of values in the blue channel (Figure 11.2). . . . . . . . . . . . . 181

11.4Intensity of blue channel values in the image as taken through a slice of

the image (at pixel row 230 as indicated by red dashed line). . . . . . . . . . 182

B.1 Example of CRAN mirror window as viewed on Linux . . . . . . . . . . . . . 201

B.2 All packages that can be installed from the selected mirror server on my

machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

Biological Data Analysis Using R

x LIST OF FIGURES

Biological Data Analysis Using R

Preface

This manuscript was written to scratch a particular itch that I felt was not being sati-

ated.Increasingly students in biological research programs, both at the undergraduate

and the graduate level, are dealing with data sets that are both enormous in size and

varied in representation. Image data, sequence data, counts of species in communities,

nutrient ﬂux, reaction networks, and a whole host of other kinds of data are encountered

on a daily basis in biological sciences. In order to ”drink from this ﬁrehose” of data, it is

important that we have the correct kinds of tools; the spreadsheet metaphor is no longer

valid.

After spending a few years encouraging students to learn a tool, any tool, that would

help them deal with the complexity of data we encounter, I decided to put together a

course focusing on how R can be used to deal with many different kinds of data. This

course was designed for incoming graduate students in Biology at Virginia Common-

wealth University with the goal of getting them familiar with R from the beginning of

their graduate work. Many of the graduate faculty in Biology use R in their courses

and ﬁnd that a non-trivial amount of time needs to be spent on introducing students

to R in each course which is taking away from the focus of the course. However, if a

student had taken a short course in R when they began their graduate work, then it

would be possible to spend more time in our individual courses focusing on the topic at

hand.

This manuscript is not designed to be one of the ”Biological discipline X in R ” kind of

offerings; there are already a lot of those kinds of books available. My goal here is to

introduce the reader to a wide variety of data types that we deal with in Biology and

give a brief introduction to how R can be used to interact with, and perhaps perform

analyses, on these data. The treatment of any one kind of data is relatively shallow,

as I am assuming that students are going to take a speciﬁc course on that topic in the

future. And when they do, they will have already seen how R will make their life easier.

In my own research, I use tools such as R in many different circumstances and feel

that students can only beneﬁt from a broad understanding of how R can assist in their

research. With this focus, it is no coincidence that the kinds of data introduced in this

text are pulled directly from the graduate courses that our students will take, such as

Community Ecology, Population Genetics, Population Ecology, Evolution & Speciation,

Biological Complexity, Molecular Genetics, Landscape Genetics, Bioinformatic Technolo-

gies, Ecological Genetics, and Quantitative Ecology.

Give the range of topics covered herein, I think this manuscript has a broad audience

as I assume that the reader of this text will not have much previous experience using R.

xi

xii Preface

Obviously, incoming graduate students are my primary audience. However, I also feel

that this would also be a good beginning text for one who is already working in the ﬁeld

and would like to gain a broader introduction to how R can be used in their particular

discipline.

Contents

This manuscript has been partitioned into four separate sections. The ﬁrst section intro-

duces R as a language and a tool and covers some basic topics that are required to get

one going. The next section contains eleven chapters that target some particular aspect

of biological inquiry from the perspective of the kind of data that will be analyzed. The

third section focuses on how you can extend the R environment developing scripts and

deﬁning your own functions and libraries. The ﬁnal section of this text is an appendix

that includes the answers to odd-numbered questions from the exercises in each chap-

ter as well as some additional information on installing additional libraries or groups of

libraries.

There are some common elements to each chapter that make it easy for the reader to

get the larger picture of the topics being introduced. At the beginning of each chapter, a

speciﬁc list of topics and skills that are to be covered provided. As topics are introduced,

the R code is provided and keywords from the R programming language are highlighted

to help the reader follow along.

At the end of each chapter all the R functions that were used in the chapter as well as a

brief deﬁnition of the arguments passed to each function is provided as a quick reference

source. Each chapter also contains a set of exercises that can test the readers under-

standing of chapter topics. Answers to odd numbered exercise problems are provided in

Appendix A. Throughout the text, all of the R functions used are also indexed so that

the reader can easily ﬁnd instances where they were used.

Part 1: Basic Usability

The ﬁrst part of this manuscript contains the basic information that is required to install

and begin using R for data analysis. This section has the following chapters:

Chapter 1: Getting R This chapter provides information on how to download the latest

binary release for R as well as compiling it from source code. Particular attention is paid

to the differences associated with installing R on different platforms.

Chapter 2: Language & Grammar This chapter begins introducing the R programming

language by focusing on the different kinds of data types that are used (e.g., integers,

decimal values, factors, . Topic covered include a basic overview of what a function is,

an introduction to the most commonly used data types in R , and general operations on

these data types.

Biological Data Analysis Using R

Preface xiii

Part 2: Biologically Motivated Topics

The second section of this manuscript contains the main content.

Chapter 3: Data Frames The data frame is a fundamental object in R . This chapter

builds upon the basic understanding of data frames (introduced in Chapter 2) by in-

troducing several methods for putting your data into new and existing data frames,

persistent storage of data frames. This chapter also introduces the concept of using the

data frame data type as a light-weight database object. This includes an introduction to

making slices of a data set, the methods required to make complex selections of subsets

of data, and joining data from multiple data frames.

Chapter 4: Summary Statistics This chapter introduces the reader to general summary

statistics for continuous data, statistical distributions, and random number generation.

This chapter also provides the reader a ﬁrst introduction to creating publication-quality

graphics in R . General graphics include scatter and line plots, histograms, density

plots, plotting several graphical objects on the same set of axes, creating matrices of

plots, and saving graphics to ﬁle.

Chapter 5: Categorical Data This chapter focuses on the analysis of categorical data

and contingency tables. Give the ubiquity of the χ

2

test in Biology, a general treatment

of contingency tables is provided with examples demonstrating how to examine genetic

linkage disequilibrium, Hardy-Weinberg equilibrium, and demographic analysis testing

for equality of population diversity. Both parametric and non-parametric approaches are

introduced with examples.

Chapter 6: Linear Models This chapter introduces the concept of linear models from sim-

ple correlations through single and multiple regression and ANOVA (which is introduced

as regression with categorical predictors). Data for this chapter is derived from my own

thesis working with the consequences of landscape modiﬁcation on reproductive success

in canopy trees. Examples of model diagnostics, model selection and post-hoc tests are

also covered.

Chapter 7: Working With String Data This chapter uses genetic sequence data as an

example of string-related data that can be manipulated in R . Basic skills in string

searching and replacements are augmented with a short discussion of genetic sequence

alignments, the use of online genetic databases such as NCBI, and the creation of a

phylogenetic trees using different algorithms is demonstrated.

Chapter 8: Image Data This chapter focuses on image creation, importation, analysis,

and manipulation. After a basic overview of image formats and manipulations, hemi-

spheric canopy photos are used as an analysis topic on which several analyses are

preformed.

Chapter 9: Matrix Analysis Matrix analysis is a general tool used in a variety of biological

disciplines. In this chapter, the topic of life history analysis and population projection is

used as an example for matrix operations in R .

Chapter 10: Multivariate Data Ordination techniques are a broad class of methodologies

that seek to understand the structure of multivariate data. In this chapter, vegetation

data is used as an example of how one conducts and interprets basic ordination.

Biological Data Analysis Using R

xiv Preface

Chapter 11: Classiﬁcation This chapter focuses on how morphological shape analysis

can be used for classiﬁcation purposes. Morphological data from the bark beetle species

complex, Araptus attenuatus is used as an example for comparison with genetic classiﬁ-

cation schemes.

Chapter 12: Spatial Data In this chapter, the analysis of spatial data is introduced.

Topics covered include, conversion of GPS way points and GIS data ﬁles into R data

formats, plotting georeferenced raster and vector maps, and basic spatial analysis.

Chapter 13: Genetic Data This chapter focuses on how one can represent genetic data

in R and perform basic analyses on genetic structure. Examples include the analysis

of inbreeding, population structure, association mapping, and population assignment

tests.

Part 3: Extending R

The chapters in this section only require a basic understanding of R and can be used at

any time as they are stand-alone. In fact, it is suggested that after you get familiar with

R , you should look into these chapters because they contain valuable information that

will make your life easier.

Chapter 14: Creating Basic Scripts This chapter addresses how to you create basic R

scripts so that you can reuse your code and analyses as well as have persistence across

your R sessions.

Chapter 15: Programming R This chapter covers basic programming, ﬂow control, and

decision control statements.

Chapter 16: Functions This chapter demonstrates how the user can create individual

functions from their scripts so that calling complex analyses and operations can be

simpliﬁed.

Appendices

The last part of this manuscript includes supplementary material in support of the con-

tents.

Appendix A: Answers to Exercises This appendix provides answers to the odd numbered

problems located at the end of each chapter.

Appendix B: Installing Additional Libraries There are a broad range of libraries that the

R community provides and this appendix shows you how to ﬁnd and install additional

libraries to your local copy.

Typographic Conventions

The developers of R have worked very hard to make sure that you can interface with R

on any platform without worrying about which operating system you are using. However,

Biological Data Analysis Using R

Preface xv

there are some times when things are slightly different on alternate platforms. When

there are platform speciﬁc issues to be dealt with, I will make a notation in the margins

with the name of the operating system next to the text to indicated speciﬁc issues. OS

The book is not going to show you how to interact with R using a GUI, because in my

opinion GUI’s are for babies. If you want to learn how to use R , you will have to learn

how to interact with it from the command line and write scripts for R to analyze your

data. If you want to a point-and-click interface for a statistical analysis program then

perhaps you should check out SPSS (Statistical Package for Social Sciences) or similar

offerings. It is my belief that you will learn more about programming and data analysis

if you learn the R language. There are only so many options that GUI-based analyses

can provide but with R on the command-line, you will be have the most ﬂexibility in the

analysis of your data. Moreover, when you create scripts to perform your analysis, you

will have a persistent record of how you analyzed the data instead of just some data and

results. Increasingly, peer-reviewed journals are suggesting that your analysis scripts

be included in your supplementary materials for general consumption.

Throughout this book, I will provide examples of code in a box format. You will be able to

tell what is code that can be entered in R because it will be separated from the main text

and in an alternate font, slightly shaded, and with R keywords colored appropriately.

For example, the commands:

> x <− seq(0,100,by=2)

> y <− rnorm( 51)

> pl ot ( x, y , xlab="X Axis" , ylab="Y Axis")

create a scatter plot for the variables x, a sequence of even numbers from 0 → 100 and

y which are random numbers sampled from a normal distribution. The result is given

in a new graphics window with a plot similar to what is shown in Figure 1. How plots

are made and saved to a ﬁle for subsequent use is covered in depth though out the

book. I have decided to sprinkle instructions of how to create graphics into the text at

locations that are appropriate for the content being discussed rather than creating one

or more chapters on Graphics with made up data presented out of context with how that

particular graphical representation is appropriate.

In all code provided in this text will have text highlighting showing R keywords in dark

blue and strings in red (see Chapter 2 for more information on these commands). If you

are using a good editor to write your scripts, you will see this kind of text highlighting

in your own work. In these code listings the > character is the preﬁx given by R and is

not typed. I provide it here because I want to differentiate between code you type and

answers that are given by R , which will not have the > character in it such as:

> 2 ∗ 6

[ 1] 12

> rnorm( 10)

[ 1] −1.08495736 −1.25010428 −0.76237538 −0.08486045 −1.62145675 −0.54872689

[ 7] 0.64345848 0.43850325 0.26551658 −0.41362136

> pi /2

[ 1] 1.570796

where the answers are given in the line immediately following what was entered. Along

with the answer is also an index for the answer or answers. For example, the second

Biological Data Analysis Using R

xvi Preface

Figure 1: Example scatter plot.

example gets 10 random numbers from a normal distribution but can only give 6 on a

line before it wraps around. The [7] tells you that the ﬁrst number on the second line

is the seventh in the sequence. When you operate on vectors or matrices, these indices

are relatively important and allow you to easily ﬁnd speciﬁc indices rapidly.

Acknowledgments

There are several people I would like to acknowledge for their assistance in this work.

This has been possible primarily due to the ﬂexibility of my Department in allowing me

to ”experiment” on our graduate students. Next, I wish to thank Dr. James Vonesh who

has goaded me into putting this together and been my colleague in crime as we continue

to push R as a general tool in our curricula. Members of my laboratory Stephen Baker,

Daniel Carr, Candace Dillion, Crystal Meadows, and Cathy Viverette sat through the ﬁrst

iteration of the course and have provided insightful feedback on the both the focus and

the content. I would also like to thank the developers of R, L

A

T

E

X, Grass GIS, Emacs, and

Vim who have provided a set of tools that facilitate good research.

Rodney J. Dyer

Richmond

June 2009

Biological Data Analysis Using R

Part I

Basic Usability

1

Chapter 1

Getting R

I am not going to spend much time on how you go about getting and installing R on

your computer. If you are going to use a machine on campus, it should have it already

installed on it. If not, VCU does not allow students to install programs on their ma-

chines so this Chapter is somewhat irrelevant anyways. However, if you are using your

own computer (which is always the best idea), the internet has a much more in-depth

and complete iteration of how to get and install the R environment for your particular

machine. Reproducing that here would be a waste of paper and both of our times as it

would probably be out of date before long.

1.1 What Is R

R is both a language and an interface for statistical analysis, programming, and graph-

ics. R is modeled after the S language that was originally created by AT&T and in many

cases scripts written for R can be run in S with little to no modiﬁcation. R has be-

come a standard interface for statistical analysis in biological sciences due in part to its

openness, ability to be extended by users and it vibrant user base.

The R environment is a command-line interface that allows easy manipulation of data,

calculation of parameters related to that data, an easy to understand grammar that fa-

cilitates rapid program creation, and the ability to produce publication quality graphics.

Moreover, you can create R scripts that describe how you analyzed your data so that in

the future you can pick up where you left off. Increasingly, entities such as NSF and

prominent research journals are making R scripts a normal component of the Supple-

mentary materials that you upload along with your research results and ﬁnal reports. It

is my opinion that the sooner you start documenting your data and creating a history of

how you perform analyses on this data, the better you will be in the long run.

3

4 CHAPTER 1. GETTING R

1.2 Where Do I Get It?

The main webpage for R is located at http://www.r-project.org/ Here you can ﬁnd in-

formation on the latest version of R available for your platform. Moreover, you can ﬁnd

some nice screenshots, ﬁnd out what is new in the R community, ﬁnd links to man-

uals, newsletters, wiki’s, and books on R . There is a lot of information in the online

community and in general, they are a friendly lot. Since R has been around for quite a

while, most of your most basic questions can be answered by a quick google search of

the mailing list repositories. It is always a good idea to check these out prior to posting

to a discussion board or email list so you do not get the old RTFM treatment...

1.2.1 Installation From Binaries

The CRAN site maintains pre-compiled binary distributions for Linux, Mac OSX and

Windows. These binaries are the latest stable versions of the software and contain the

basic libraries that you need to run R on your operating system. Depending upon your

platform, the package will contain an installer that allows you to clickity-click your way

through the process and have a base R installation on your machine.

Connected from the main R site is also the CRAN repository where people make avail-

able extensions to R that you can download and use. There is a tremendous variety of

solutions available for you and it is always in your best interest to try to see if someone

has already tackled the problem you are working with. There is no reason to reinvent

the wheel, your time is too valuable.

1.2.2 Compiling

If you know what a compiler is and have one on your computer then you are probably

able to compile the latest version of R on your machine. If you fall into this category

then you do not need me to tell you how to proceed, there is a lot of good documentation

on this found on the R website.

Biological Data Analysis Using R

Chapter 2

Language & Grammar

R is a language that has its own grammar and in this chapter you will be exposed

to some basic concepts regarding these. In this and all subsequent Chapters, it is

important for you to remember that computers do exactly what you tell them to, and

often not what you had wanted them to do. So learning the grammar is an important

step in understanding R .

In this chapter, you will focus on the following topics:

• Learn basic data types and how to create them in R

• Understand various operators and how they can be used.

• Understand variable naming and be able to create, manipulate, and destroy

This is a pretty short list of things but it will take you a bit of time to get through it. The

main goal here is to understand a small subset of the different kinds of data that can

be produced in R and how we interact with them. Later, we will become more proﬁcient

with them and add new data types as we move forward.

2.1 Overview

R itself consists of an underlying engine that takes commands and provides feedback

on these commands. From a technical perspective R is called a Function Language as

each command you give the R engine is either an:

Expression An expression is a statement that you give the R engine. R will evaluate

the expression, give you the answer and not keep any reference to it for future use.

Some examples include:

> 2 + 6

[ 1] 8

> sqrt ( 5)

[ 1] 2.236068

> 3∗ ( pi /2) − 1

[ 1] 3.712389

5

6 CHAPTER 2. LANGUAGE & GRAMMAR

In each of these examples, R evaluates the expression and gives you an answer.

When you use it like this, R is acting as a gloriﬁed calculator.

Assignment An assignment causes R to evaluate the expression and stores the result

in a variable. This is important because you can use the variable in the future. An

example of an assignment is:

> x <− 2+6

> myCoolVariable <− sqrt ( 5)

> another one number23 <− 3∗ ( pi /2) − 1

> x

[ 1] 8

> myCoolVariable

[ 1] 2.236068

> another one number23

[ 1] 3.712389

Notice here the use of the assignment operator <-. This is made with a ”less

than” character and a ”minus” character. As for the expression, the variables x,

myCoolVariable, and another one number23 are all the names of variables whose

value was assigned with the expression. Also notice that to retrieve the value of a

variable, just type it into the command line and it will provide the current value.

2.2 Function Quickie

This chapter will introduce you to several conventions, the main one of which is the

function. A function in R is a collection of statements bound together to make it easier

to use. In the previous example, I used the function sqrt(x), which is the function that

gives the square-root of the argument being passed (or an error if there is one).

Some functions are easy to understand and others are relatively complicated. We will

spend a whole chapter on functions later in the book (see Chapter 12) when you be-

gin to write your own. However, in the interim, you need to know a few things about

functions.

1. A function has two parts; (1) a unique name, and (2) the stuff (e.g., variables) passed

to it within the parentheses. Not all functions need any additional variables. For

example, the function ls () shows which variables R currently has in memory and

does not require any parameters.

2. If you forget to put the parentheses on the function and only use its name, by

default R will show you the code that is inside the function (unless it is a compiled

function). This is because each function is also a variable. This is why you should

not use function names for your variable names (see 2.3 for more on naming).

3. To ﬁnd the deﬁnition of a function, the arguments passed to it, details of the imple-

mentation, and some examples, you can use the ? shortcut. To ﬁnd the deﬁnition

for the sqrt() function type ?sqrt and R will provide you the documentation for that

function. If R cannot ﬁnd the function you may have to do a more thorough search

using the help.search("functionName") approach. This searches throughout the docu-

mentation system and even uses some cool fuzzy searching techniques. For more

Biological Data Analysis Using R

2.3. VARIABLES 7

info on how to use help.search(), type ?help.search() (recursive logic recurses...).

4. Functions can be organized into libraries and only loaded when needed. At the

time of this writing there are just over 1600 different packages containing different

libraries on http://cran.r-project.org. There is no reason to have every conceivable

library loaded and in fact if they were to be loaded would probably leave little mem-

ory for you to work with your data on. As a rule of thumb, only load the libraries

that you need when you need them. More on libraries as we go forward.

5. Functions may have more than one parameter passed to it. Often if there are

a lot of parameters given then there will be some default values provided. For

example, the log() function provides logarithms. The deﬁnition of the log function

show log(x, base=exp(1)) (say from ?log). Playing around with the function shows:

> l og ( 2)

[ 1] 0.6931472

> l og ( 2 , base=2)

[ 1] 1

> l og ( 2 , base=10)

[ 1] 0.30103

where without the optional base= parameter, it is clear that the log() function returns

the natural log (in fact if you ?ln there is nothing found).

2.3 Variables

A variable is something that can hold an item for you. While this is a little bit of Dyer-

speak, and I am sure that there are more elegant deﬁnitions, it is important to under-

stand that variables are things that you will interact with. For example, you may have

a predictor and a response variable you want to ﬁnd a correlation between. It is your

responsibility to deﬁne these variables and then you can subsequently use them in your

analyses.

There are some naming conventions that you can follow to make your life a bit eas-

ier.

1. It is a pretty good idea for you to start your variable name with a letter. You cannot

use a number or punctuation as the ﬁrst character of a variable (N.B. you can use

a period to start it but the variable will be hidden from you and you cannot see it

with ls () so unless you know what you are doing, don’t to this).

2. Variable names cannot have spaces in them although it is possible to use periods

(”.”), underscores (” ”), or you can use what is called camel case (e.g., NumberOfDogsInHouse;

notice the use of upper and lower case letters to smush words together and make

it readable).

3. Try to name your variables something that makes sense to you. Using a,b,c,d,e,

and f as variables is probably not as informative to you when you are reading the

code as Rate, number of items, foodDataForNovember.

Biological Data Analysis Using R

8 CHAPTER 2. LANGUAGE & GRAMMAR

4. In R when you make a new variable such as x <−sqrt(2) then that variable is in mem-

ory. You can recall it by typing its name and hitting return, you can use it later in

functions or calculations, and you can manipulate it (e.g., x <−x/2 to decrease it by

half).

5. The function ls () provides you a list of all variables that you have deﬁned. It is a very

helpful function. You can remove a variable from memory using the rm(variableName)

function.

2.4 Data Types

R recognizes about a dozen different types of data. While it is important to know the

differences between these data types, you will probably use only a fraction of them. All

of the data types are characterized by what R calls classes. As such, every data type has

three common functions associated with it; a constructor that creates a speciﬁed type,

as introspection function that tells you if any variable is a particular type, and a casting

function that allows you to coerce the contents of a variable into a speciﬁc type (a more

complete discussion of functions can be found in ??). This may sound a bit confusing

but in reality it is pretty straight forward. For example, the constructor is the function

type(x) will create a vector of x types ( where type is the data type will create), is.type(x) to

determine if x is that particular type of variable, and as.type(x) will return x translated into

a type of variable. Confused yet? It really isn’t that bad, examples for each data type

below will discuss the speciﬁcs.

To determine the type of any variable you can use the built-in function class(x). This will

tell you what kind of variable x is and is relatively important in the discussions we are

going to have below about coercion. This is an important concept for understanding

data types. What follows is a brief discussion of each data type and where appropriate

an example of the use of one, how to access it, and how we can operate on it.

2.4.1 Integers

An integer is a common counting number (e.g., one without a fractional part). Techni-

cally, integers can range from −∞ ↔ ∞ however, in practice there is a limited amount of

integers that can be deﬁned on the range ±2 ∗ 10

9

. The integer type is typically used in

the development of R libraries who need to pass succinct integers to C or FORTRAN code

and is not typically used by the normal R end user.

Check out the code listing below and see how one can create, coerce, and use an inte-

ger.

> i nteger ( 5)

[ 1] 0 0 0 0 0

> x <− as . i nteger ( 5)

> x

[ 1] 5

> i s . i nteger ( x )

[ 1] TRUE

> cl ass ( x )

Biological Data Analysis Using R

2.4. DATA TYPES 9

[ 1] "integer"

> x + 2

[ 1] 7

> cl ass ( x+2)

[ 1] "numeric"

> y <− i nteger ( 3) + 2

> i s . i nteger ( y )

[ 1] FALSE

> y <− i nteger ( 3) + as . i nteger ( 2)

> i s . i nteger ( y )

[ 1] TRUE

There are some things to notice about this:

1. The command integer(5) produces a vector (see 2.4.8) of ﬁve integers.

2. All of the items returned from the listing(5) function were assigned a value of zero

(0), which is the default value for an integer until its value is changed to something

else.

3. The variable x is assigned a particular integer, in this case 5, and is veriﬁed by the

class(x) statement.

4. You can perform operations on integers you need to make sure that you use other

integers. For example, adding 2 to the vector of integers represented by the variable

y produces a ”numeric” type, not an integer. Whereas the integer(3) + as.integer(2) state-

ment does return an ”integer” type. This is your ﬁrst example of coercion, where

one data type is ”magically” turned into another type. There are rules for these

transformations and the ﬁrst one you should recognize is that the number 2 is not

considered an integer. By default, numbers are coerced into numeric values (see

2.4.2) as integers are not used that often.

5. When adding an integer as.integer(2) to a vector of integers every element is assigned

the same number. There are a few more subtle things to know about adding things

to vectors and I’ll leave that until 2.4.8.

As I said above, the integer type is not used that often and is only provided here for

completeness.

2.4.2 Numeric

Numeric types represent the majority of number valued items you will deal with. When

you assign a number to a variable in R it will most likely be a numeric type (unless you

specify otherwise such as deﬁned in 2.4.5 and 2.4.6). Numeric data types can either

be displayed with or without decimal places depending if the value(s) include a decimal

portion. For example:

> x <− numeric ( 4)

> x

[ 1] 0 0 0 0

> x[ 1] = 2.4

> x

[ 1] 2.4 0.0 0.0 0.0

Biological Data Analysis Using R

10 CHAPTER 2. LANGUAGE & GRAMMAR

Notice this is an all or nothing deal here. Also notice (especially those who have some

experience in programming other languages) that dimensions in vectors (and matrices)

start at 1 rather than 0.

Operations on numeric types proceed as you would expect but since the numeric type

is the default type, you don’t really have to go around using the as.numeric(x) function. For

example:

> i s . numeric ( 2. 4)

[ 1] TRUE

> as . numeric ( 2) + 0.4

[ 1] 2.4

> 2 + 0.4

[ 1] 2.4

shows that no matter how you do it, 2.4 is a numeric data type. In general, programmers

are lazy people who try to do things that minimize the amount of typing they have to do

(since they do a lot of typing to begin with) and as such the numeric type is the easiest

to use.

2.4.3 Character

The character data type is the one that handles letters and letter-like representations of

numbers. For example, observe the following:

> x <− "some sequence of letters of length 37"

> cl ass ( x )

[ 1] "character"

> y <− 23

> cl ass ( y )

[ 1] "numeric"

> z <− as . character ( y )

> z

[ 1] "23"

> cl ass ( z )

[ 1] "character"

Notice how the variable y was initially designated as a numeric type but if we use the

as.character(y) function, we can coerce it into a non-numeric representation of the number...

There will be times when you need to translate various things into characters, such as

when making titles and axis labels and this will come in handy.

You need to think of the numeric type as a sequence of letters, numbers, symbols, or

other stuff you can produce by pushing keys on your keyboard that are enclosed in

either single or double quotations. It doesn’t really make much sense to perform any

operations on a character type (e.g., what would you expect ”hello”*3 to accomplish)

although you can paste() them together. For example,

> x <− "I am"

> y <− "not"

> z <− ’a looser’

> x

[ 1] "I am"

> y

[ 1] "not"

Biological Data Analysis Using R

2.4. DATA TYPES 11

> z

[ 1] "a looser"

> paste ( x, y , z )

[ 1] "I am not a looser"

> paste ( x, z )

[ 1] "I am a looser"

It is important to note that if you are a really anal person for perfection that the paste()

function by default separates the individual variables you give it with a single space.

However, this can be modiﬁed by telling the function what to use as the separator).

> paste ( x, z , sep=" not ")

[ 1] "I am not a looser"

> paste ( x, z , sep=", ")

[ 1] "I am, a looser"

2.4.4 Constants

Constants are variables that have a particular value associated with them that cannot

be changed. They are mostly here for convienence so that we do not have to go look

up values for common things. Below are listed some common constants that you will

probably encounter as you play with R .

Table 2.1: Common constants you will run across in R

Constant Description

pi The mathematical constant, π representing the ratio of a circles circumference

to its diameter.

NULL The absence of a type. This is the oubliette, complete nothingness, /dev/null

Richmond on a Wednesday night... This is commonly used by functions that return

undeﬁned responses.

nan Not a number.

Inf Inﬁnity (∞) as well as -Inf for −∞.

NA Typically used to represent something that is not there or missing. You can use it

for missing data if you like.

For the non-numerical constants, there are commands such as is.NULL(), is.nan(), is. inﬁnite

(and its cousin is. ﬁnite () ), and is.na() to help you ﬁgure out if particular items are of that

constant type if you like. At times this can be handy such when you have missing data

and you want to set it to some meaningful value (e.g, is.na(X) <−32 will set all NA values

in X to 32). We’ll get into this more in depth at a later time.

2.4.5 Complex Numbers

Complex numbers are those that can be written in the form a + bi where a is the real

part and the product bi being the imaginary part with i =

√

−1. The code snippet below

shows you how to create and query the class of a complex number.

Biological Data Analysis Using R

12 CHAPTER 2. LANGUAGE & GRAMMAR

> w <− complex ( 3)

> w

[ 1] 0+0i 0+0i 0+0i

> x <− complex( 3 , 4 , 5)

> y <− 4+5i

> x

[ 1] 4+5i 4+5i 4+5i

> y

[ 1] 4+5i

> i s . complex ( x )

[ 1] TRUE

> i s . complex ( y )

[ 1] TRUE

The main differences here in the constructor complex() from the other ones we have seen

so far is that it can take default values. For example, when called as complex(3), it returns

three complex numbers whose real and imaginary parts are set to zero. However, calling

the function as complex(3,4,5) makes a three complex numbers each assigned a four to

the real part and a ﬁve to the imaginary part. As shown, you can also create complex

numbers by simply typing them directly on the command line as a + bi as shown and is

probably the easiest way to do it.

2.4.6 Raw

The raw data type is a hexadecimal data type bound on the inclusive range [0 − 255].

Raw numbers are represented as a two digit sequence of hex numbers. Valid hex digits

include 0 −9 as well as a, b, c, d, e, & f. The listing below gives you some examples of how

to create some raw data types.

> raw( 3)

[ 1] 00 00 00

> as . raw(255)

[ 1] f f

> as . raw( 13)

[ 1] 0d

> as . raw(256)

[ 1] 00

Warning message:

out−of−range values treated as 0 in coercion to raw

> i s . raw( 13)

[ 1] FALSE

> i s . raw( 0d)

Error : unexpected symbol in "is.raw(0d"

> x <− 0d

Error : unexpected symbol in "x <− 0d"

There are several important points to make here.

1. If you try to create a raw number outside the its allowable range, R will issue you

a warning and then assign the variable the default value, 00.

2. The digits 13 while valid raw digits are not considered raw given by themselves. This

is because all numbers are considered numeric data types (see 2.4.2) by R , even

in the case of 0d which is deﬁnitely a raw hex number, R doesn’t coerce it into a

raw type but leaves it as the characters 0d and then chokes on it. This is probably

good behavior.

Biological Data Analysis Using R

2.4. DATA TYPES 13

3. Similar to what was shown for integers, raw numbers must be constructed from

the constructor raw() function and cannot be directly created by simply pairing up

valid digits.

2.4.7 Logical

Logical data types are boolean variables with a value of TRUE or FALSE. Obviously, these

two values are the opposites of each other (e.g., not TRUE is FALSE, etc.). You will en-

counter logical data types in two primary situations; (1) when you are writing a condi-

tional statement that requires you to know the truth about something (e.g., if x == 0 you

probably shouldn’t try to divide by x because for some reason mathematicians haven’t

ﬁgured out how to divide by zero yet...), or (b) if you are tying to select some subset of your

data by using a particular condition (e.g., select all entries where color == ”blue”).

The interesting thing about logical variables is that numbers can be coerced into a logical

variable. For example the number zero, as an integer, numeric, complex, or raw data

type, is considered to be FALSE whereas any non-zero value is considered TRUE.

2.4.8 Vectors

R is a vector language and as you begin to learn more and more of it you will appreciate

the fact that you can easily work with vectors of numbers as well as single ones. In fact,

I suppose it is probably better to think of a single number as a vector of length 1, which

is why the R command line interface puts the [1] after every answer...

A vector is a sequence of items that can be created using the function vector(). However,

since a vector is simply a sequence, it can be a sequence of any type of data. For example,

I may have a vector of integers or a vector of complex numbers, or whatever. To specify

the data type for a vector, must tell it what type to use. Here is an example using the

”numeric” data type.

> x <− vector ( "numeric" , 3)

> x

[ 1] 0 0 0

> i s . vector ( x )

[ 1] TRUE

> i s . numeric ( x )

[ 1] TRUE

Notice that it assigns default values for each entry as would be expected. However, it is

also important to notice that not only is x a vector but it is also numeric! So in actuality,

in all the preceding cases where we have used the constructor to create a new data

type they are also creating vectors! Blows you mind doesn’t it! This is why it is safe to

consider R as a vector language.

Because you will use vectors so much, there is an easier way to create the using the

c() function (c for combine). This is a short-hand version and R tries to determine the

type of variables that you pass to the c() function to do the right thing

c

. Here are some

examples:

Biological Data Analysis Using R

14 CHAPTER 2. LANGUAGE & GRAMMAR

> x <− c( 1 , 2 , 3)

> x

[ 1] 1 2 3

> y <− c (TRUE,TRUE,FALSE)

> y

[ 1] TRUE TRUE FALSE

> z <− c ( "I" ,"am" ,"not" ,"a" ,"looser")

> z

[ 1] "I" "am" "not" "a" "looser"

> notGoingToWork <− c(00,0b, f f )

Error : unexpected symbol in "notGoingToWork <− c(00,0b"

The only caveat here is that if the data type cannot be determined unambiguously, then

R will choke and tell you so, as shown in the last example where I was trying to make a

vector of raw data types. For cases such as these, use the normal data type constructor

(e.g., raw(3)) and then assign values to each element.

To access an element in a vector, R uses square brackets ([]) as demonstrated here:

> x <− vector ( "numeric" , 3)

> x

[ 1] 0 0 0

> x[ 1] <− 2

> x[ 3] <− 1

> x

[ 1] 2 0 1

> x[ 2]

[ 1] 0

Since working with a vector is such a common thing, there are a number of helper

function that you can use to make vectors.

> x <− 1:6

> x

[ 1] 1 2 3 4 5 6

> y <− seq( 1 , 6)

> y

[ 1] 1 2 3 4 5 6

> z <− seq(1,20, by=2)

> z

[ 1] 1 3 5 7 9 11 13 15 17 19

> rep( 6 , 4)

[ 1] 6 6 6 6

The notion x : y provides a vector of whole numbers from x to y. In a similar fashion the

function seq(x,y,by=z) provides a sequence of numbers from x to y but can also have the

optional parameter by= to determine how the sequence is made (in this case the by 2s for

all the odd numbers from 1 to 20). The function rep(x,y) repeats x a total of y times. These

are some real time saving options and you will probably be using them often.

2.4.9 Matrices

Matrices are 2-dimensional vectors and can be created using the default constructor

matrix() function. However, since they have 2-dimensions, you must tell R the size of the

matrix that you are interested in creating by passing it a number for nrow and ncol for the

number of rows and columns.

Biological Data Analysis Using R

2.4. DATA TYPES 15

> matrix ( nrow=2, ncol =2)

[ , 1] [ , 2]

[ 1 , ] NA NA

[ 2 , ] NA NA

> matrix(23,nrow=2, ncol =2)

[ , 1] [ , 2]

[ 1 , ] 23 23

[ 2 , ] 23 23

If you do not give matrix() a default value to put in each cell, it will ﬁll them with NA, which

is the way R indicates a missing value.

Matrices can be created from vectors as well.

> x <− c( 1 , 2 , 3 , 4)

> x

[ 1] 1 2 3 4

> i s . vector ( x )

[ 1] TRUE

> i s . matrix ( x )

[ 1] FALSE

> matrix ( x )

[ , 1]

[ 1 , ] 1

[ 2 , ] 2

[ 3 , ] 3

[ 4 , ] 4

> y <− matrix ( x, nrow=2)

> y

[ , 1] [ , 2]

[ 1 , ] 1 3

[ 2 , ] 2 4

> i s . matrix ( y )

[ 1] TRUE

> i s . vector ( y )

[ 1] FALSE

Be default, if you do not provide any dimension to the matrix() function, it will produce

one with a single column of data. If you provide one of the dimensions then it will try

to determine how many of the other dimension is needed by looking at the length of the

vector that you passed (e.g., here nrow=2 was given and it ﬁgured out that it should have

two columns as well).

There is a slight gotcha here if you are not careful.

> x <− 1:4

> matrix ( x, nrow=4, ncol =2)

[ , 1] [ , 2]

[ 1 , ] 1 1

[ 2 , ] 2 2

[ 3 , ] 3 3

[ 4 , ] 4 4

> matrix ( x, nrow=3)

[ , 1] [ , 2]

[ 1 , ] 1 4

[ 2 , ] 2 1

[ 3 , ] 3 2

Warning message:

In matrix ( x, nrow = 3) :

data length [ 4] i s not a sub−multiple or multiple of the number of rows [ 3]

> matrix ( seq( 1 , 8) , nrow=4)

[ , 1] [ , 2]

Biological Data Analysis Using R

16 CHAPTER 2. LANGUAGE & GRAMMAR

[ 1 , ] 1 5

[ 2 , ] 2 6

[ 3 , ] 3 7

[ 4 , ] 4 8

Notice here that R added the values of x to the matrix until it got to the end. However,

it did not ﬁll the matrix so it started over again. In the ﬁrst case the size of x was

a multiple of the size of the matrix whereas in the second case it wasn’t but it still

assigned the values (and gave a warning). Finally, as shown in the last case, if they are

perfect multiples, then it ﬁlls up the matrix in a column-wise fashion.

To access values in a matrix you use the square brackets just as was done for the vector

types. However, for matrices, you have to use two indices rather than one.

> X <− matrix ( c( 1 , 2 , 3 , 4 , 5 , 6) ,nrow=2)

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 3 5

[ 2 , ] 2 4 6

> X[ 1 , 3]

[ 1] 5

> X[ 2 , 2] <− 3.2

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 3.0 5

[ 2 , ] 2 3.2 6

> X[ 1 , ]

[ 1] 1 3 5

> X[ , 3]

[ 1] 5 6

We will use matrices quite a bit but will delay the commentary on matrix algebra and

operations until Chapter 8. However, the last two operations provide a hint as to some

of the power associated with manipulating matrices. These are slice operations where

only one index is given (e.g., X[1,]) provide a vector as a result for the entire row or

column.

2.4.10 Factors

Factors are a particular kind of data that is used in statistics and sampling. You can

think of a factor as a categorical treatment type that you are using in your experiments

(e.g., Male vs. Female or Treatment A vs. Treatment B vs. Treatment C). Factors can

be ordered or unordered depending upon how you are setting up you experiment.

Most factors are given in as characters so that naming isn’t a problem. Below is

an example of ﬁve observations where the categorical variable sex of the organism is

recorded.

> sex <− f actor ( c ( "Male" ,"Male" ,"Female" ,"Female" ,"Unknown" ) )

> l evel s ( sex )

[ 1] "Female" "Male" "Unknown"

> tabl e ( sex )

sex

Female Male Unknown

2 2 1

> sex [ 5] <− "Male"

Biological Data Analysis Using R

2.4. DATA TYPES 17

> sex

[ 1] Male Male Female Female Male

Levels : Female Male Unknown

Here the table() function takes the vector of factors and makes a summary table from it.

Also notice that the levels () function tells us that there is still an "Unknown" level for the

variable even though there is no longer a sample that has been classiﬁed as "Unknown" (it

just currently has zero of them in the data set).

2.4.11 Lists

A list is a convienence data type whose function is to group other data items.

> theLi st <− l i s t ( x=seq( 2 , 30) , dog=LETTERS[ 1: 5] , hasStyle=l ogi cal ( 5) )

> summary( theLi st )

Length Class Mode

x 29 −none− numeric

dog 5 −none− character

hasStyle 5 −none− l ogi cal

> theLi st

$x

[ 1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

[ 26] 27 28 29 30

$dog

[ 1] "A" "B" "C" "D" "E"

$hasStyle

[ 1] FALSE FALSE FALSE FALSE FALSE

> theLi st$x

[ 1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

[ 26] 27 28 29 30

> theLi st$x[ 2]

[ 1] 3

> theLi st$x[ 2]<− 22

> theLi st$x

[ 1] 2 22 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

[ 26] 27 28 29 30

> theLi st$dog [ 2]

[ 1] "B"

> theLi st$MyFavoriteNumber <− 2.9 + 3i

> theLi st$MyFavoriteNumber

[ 1] 2.9+3 i

As you can see, a list can contains a range of different types of data. The summary()

function gives, not to surprisingly, a summary of the items within the list. These data

are grouped together by the list but you can access them and manipulate them just as

you would if they were a stand alone variable with the exception of the list name and the

dollar sign. R uses the dollar sign $ frequently to designate something that is contained

within something else. You will ﬁnd when you conduct analyses and assign the results

to a variable that variable will be a list and to access predicted values, or error terms, or

other components of that analysis you will do so by using the $ nomenclature.

It is important to remember that lists are general groupings of variables and these vari-

ables do not necessarily have any relationship between them other than my need to

Biological Data Analysis Using R

18 CHAPTER 2. LANGUAGE & GRAMMAR

group them as it makes sense to me to do so. This is different than what is found in the

next data type, the data frame.

2.4.12 Data Frames

Data frames are kind of like lists in that they can have named items within them, how-

ever, it is easiest for me to think of a data frame as a spreadsheet. It has rows of items,

and each row has one or more columns. As in a spreadsheet, each column has a variable

name (say height or NumberOfBumps). There is an inherent relationship between the columns

of data that have the same row in that it is an observation of some sort. This is the

distinction between data frames and lists, the i

th

row of a data frame can be considered

a single observation across all columns of variables.

Typically when I load data into R from an external source, you do so by creating a data

frame. There are other ways to load data but I ﬁnd this to be the most convenient. The

topic of data frames is large enough such that I will delay discussion of it until Chapter

3 when we discuss it depth and provide some analogies to how a data frame is like a

database.

2.5 Operators

R recognizes proper orders of operation for mathematical expressions. As in normal

notation, you can override the normal order of operations by using parenthesis in ap-

propriate areas. What follows is a brief discussion of some basic kinds of operators.

2.5.1 Assignment Operators

As described above, assignments are made using the assignment operator, <- and ac-

tually can be assigned the other way with the operator ->. Examples of assignments

include:

> x <− 23

> 56 −> y

> x

[ 1] 23

> y

[ 1] 56

Again, it is important to note that (a) under assignment, there is nothing printed out

form the R engine, and (b) to see the value of a variable, just type its name on the

command line.

2.5.2 Numerical Operators

Numerical operators are deﬁned as operations on variables. These include the normal

set of operators including addition (+), subtraction (-), mutliplication (), division (), and

Biological Data Analysis Using R

2.5. OPERATORS 19

exponents (

ˆ

). Examples of these operators are:

> x∗2

[ 1] 46

> y−5

[ 1] 51

> x−y

[ 1] −33

> xˆ2

[ 1] 529

> x/y

[ 1] 0.4107143

> x

[ 1] 23

> y

[ 1] 56

Notice here that these expressions did not change the values of the variables because

there was no assignment involved.

2.5.3 Logical Operators

Often times we need to run comparisons between variables. These operators determine

the true of a statement and return a boolean (e.g., TRUE or FALSE). Operators include

equality (==; notice this is two equals signs), explicit relations (< and >), range rela-

tions (>= for equal to or greater than and <= for less than or equal to), and inequality

(! =).

> x <− 23

> y <− 56

> x==y

[ 1] FALSE

> x<y

[ 1] TRUE

> x>y

[ 1] FALSE

> x>=y

[ 1] FALSE

> x ! =y

[ 1] TRUE

> y<=x

[ 1] FALSE

These operators are commonly found in conditions but can also be used to select a

subset of values from a data vector (see ??).

Biological Data Analysis Using R

20 CHAPTER 2. LANGUAGE & GRAMMAR

2.6 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• class(x) This function will return the kind of variable that x is. We have been using

it all along in the discussions of data types but you will probably not use it very

much.

• dim(x) This function returns the dimension of x. This function returns the number

of rows and columns in x which is appropriate for matrices and data frames. The

result is returned as a vector of length=2 with the number of rows in the ﬁrst index

and the number of columns in the second. This function will return NULL for all

other data types.

• length(x) This will return the length of x which means different things depending

upon the kind of variable that x is.

– If x is an integer, numeric, logical, character, complex, raw, or logical, this

function will return the number of distinct items in x. Essentially, this will tell

you if you have a single data point or a vector of data points. Remember the

default constructors of these data types allow you to make a vector of item so

it treats all these data types as a vector and returns the length of the vector.

– If x is a list or a data frame then it will return the number of variables in that

list or data frame. For example, assume that theList is deﬁned as in 2.4.11,

then length(theList) would return the number 3.

– For matrices the function returns the number of elements in the matrix. So a

matrix with 3 rows and 2 columns would have a length of 6.

• paste(x,y) This function concatenates items into a character string. By default, this

function puts a space between the items in x and y, although you can change this

behavior by setting a value for the optional sep parameter passed to the function.

• rep(x,n) This function repeats the value x a total of n times and returns it as a

vector.

• seq(f,t,by=b) This function returns a sequence of numeric types from f to t by b.

• summary(x) This function will return an overview of the variable x.

– If x is contains numerical values then it will provide the following quantitative

measures: Minimum, 1

st

Quantile, the Median, the Mean, the 3

rd

Quantile,

and the Maximum.

– If x is a list or data frame then it provides a summary of each variable in x.

Biological Data Analysis Using R

2.7. EXERCISES 21

2.7 Exercises

The following exercises are meant to help you understand the items presented in this

Chapter.

1. Create two variables, x and y of type integer using the as.integer function and assign

them the values of 4 and 5. Add the x by y and store the result in a third variable

named z. What kind of variable is z?

2. Create two variables, x and y of type integer using the as.integer function and assign

them the values of 3 and 2. Divide the x by y and store the result in a third variable

named z. What kind of variable is z? Whis is this different than the answer in the

previous question?

3. Coerce x <- 23 into other data types to see which are amenable using the as.∗ functions

for each data type.

4. What numeric values are considered TRUE when coerced into a logical data type using

the function as.logical () ?

5. Create a sequence of numbers ranging from 1 −10 by 0.1 and assign it to the variable

x.

6. Create a sequence of numbers from 100 down to 50 by 2 and assign it to the variable

y.

7. Turn the vector of character items "Control", "Control", "Control", "Ear Removal", "Ear

Removal", "Ear Removal", "Ear Removal", "Fake Ear Removal", "Fake Ear Removal", "Fake Ear Removal",

"Fake Ear Removal" into a Factor variable and create a table from it to show the number

of entries in each treatment.

8. Create a vector of character variables that contains 25 ”a”, 15 ”b”, and 58 ”c” in-

stances. What is the length of this vector? Create a table from the entries.

9. Create a variable that is a list. In the list add variables for your name, email address,

and height.

10. How is a data frame different than a list?

Biological Data Analysis Using R

22 CHAPTER 2. LANGUAGE & GRAMMAR

Biological Data Analysis Using R

Part II

Biologically Motivated Topics

23

Chapter 3

Data Frames

In this chapter we will be learning about data frames and how we can use them to

our beneﬁt. Data frames are useful as they are a single object within which we can

store data (to disk or databases), perform statistical analyses, and perform complicated

selections.

In my interactions with R , the vast majority of time that I spend working with data

that is contained with a data frame. This is because I typically keep my data in either

spreadsheets or in databases, both of which force me to coerce my observations into

something like:

Population,Height,Sex

A,23.4,Female

A,32.9,Female

A,29.7,Female

A,38.2,Male

A,32.7,Male

B,28.4,Female

B,27.3,Male

B,27.7,Male

B,30.1,Female

This format is relatively rigid but is amenable to several types of observations. The ﬁrst

row is a header row with the name of each variable spelled out. The second and all

subsequent rows are observations with a value for each column of data. Columns of

data are also separated by some kind of delimiter. Here I am using a comma (and the ﬁle

is probably saved as a csv ﬁle) but tabs, spaces, and other characters can also be used.

For the rest of this chapter, we will use the data above as an example to show how to

interact with and manipulate data frames.

In this Chapter you will learn the following skills:

• Enter data into a data frame.

• Load a data frame from an existing ﬁle.

• Save a data frame to a ﬁle.

25

26 CHAPTER 3. DATA FRAMES

• Manipulate data within a data frame.

• Perform complex queries and joins on data frames.

3.1 Data Input/Output

Data can be input into data frames in two different ways; you can enter it directly or

load it from an external ﬁle. The former method is good if you have just a little bit of

data whereas the later is probably better if you have persistent data.

3.1.1 Entering Data Directly

After reading Chapter 2 discussing different data types should be all you need to under-

stand how to put data in manually. To recreate the example data set you could:

> Pop <− c ( "A" ,"A" ,"A" ,"A" ,"A" ,"B" ,"B" ,"B" ,"B" )

> Ht <− c(23.4,32.9,29.7,38.2,32.7,28.4,27.3,27.7,30.1)

> Sx <− c ( "Female" ,"Female" ,"Female" ,"Male" ,"Male" ,"Female" ,"Male" ,"Male" ,"Female")

Once you have these variables entered into R , you can put them into a single data frame

by:

> myData <− data . frame ( Population=Pop, Height=Ht , Sex=Sx)

> myData

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

8 B 27.7 Male

9 B 30.1 Female

> summary( myData)

Population Height Sex

A:5 Min. :23.40 Female:5

B:4 1st Qu.:27.70 Male :4

Median :29.70

Mean :30.04

3rd Qu.:32.70

Max. :38.20

Notice how the data are already numbered by observation. The names that you pass to

the data.frame() function will be the names of the variables in the data frame and the names

of the variables you previously deﬁned for them will be thrown away (e.g., there is not a

variable named Pop in myData).

Once you have created a data frame, you can access elements within it as you would for

a list (and even as a matrix to some extent).

Biological Data Analysis Using R

3.1. DATA INPUT/OUTPUT 27

3.1.2 Loading Data From A File

It is relatively common for you to already have data on hand and it is a bit of a waste

of time for you to re-enter the data into R (this would also cause a high probability of

errors as you type these values in). Getting data into R is pretty easy.

The data format of the ﬁle is a relatively important item. There are methods available to

import normal Excel ﬁles into R but will not go into them because the ﬁle format for this

program changes with each release and it is not portable across platforms (e.g., there is

no Excel on unix). Moreover, there are a lot of other places that you can get data such

as online databases, data loggers, etc. and a more general approach will be followed

here.

I will assume that you can get your data ﬁle into a text format. What matters for the

import are the following items:

1. Does the data have a row of variable names (headers) in the ﬁrst row? If you do not

have a row of headers then R will assign them as V 1, V 2, . . ..

2. What character do you use to separate columns of data? Is it tab, space, comma,

or some other character that separates you data columns?

3. Do you have any items that are in quotes? Some programs will output text wrapped

in quotes. This is not that common but you should be aware of it.

4. You need to either have the data ﬁle in the same directory that you are working

in when you started R or know the full path to the ﬁle (e.g., /Desktop/data.txt or

C:Whatever).

It is important for you to realize that the data you enter into a data frame have to have Note!

the same number of data columns for every observation. In the example data ﬁle above,

there are three observations for each row. If you do not have the same number of

observations for each row, R will barf up some errors. Be careful here, some times when

you export from a particular spreadsheet program (that shall remain nameless) you can

get extra columns of data that will screw up your import. You may want to open the text

ﬁle in a text editor to look to make sure if you get some odd errors. If you forget to add

one of the additional options to the read.table() function, R may actually load the ﬁle but

it won’t be as you expect. For example, it the example below where I did not tell R that

the data ﬁle uses a comma as a column separator, it loads every row as a single text

observation (and considers it a factor) rather than three column of data.

> data <− read . tabl e ( "DataFrame1.txt" , header=T)

> data

Population . Height . Sex

1 A, 23. 4 , Female

2 A, 32. 9 , Female

3 A, 29. 7 , Female

4 A, 38. 2 , Male

5 A, 32. 7 , Male

6 B, 28. 4 , Female

7 B, 27. 3 , Male

8 B, 27. 7 , Male

9 B, 30. 1 , Female

> data [ 1 , ]

[ 1] A, 23. 4 , Female

Biological Data Analysis Using R

28 CHAPTER 3. DATA FRAMES

9 Levels : A, 23. 4 , Female A, 29. 7 , Female A, 32. 7 , Male . . . B, 30. 1 , Female

> data <− read . tabl e ( "DataFrame1.txt" , header=TRUE, sep=",")

> data

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

8 B 27.7 Male

9 B 30.1 Female

> summary( data )

Population Height Sex

A:5 Min. :23.40 Female:5

B:4 1st Qu.:27.70 Male :4

Median :29.70

Mean :30.04

3rd Qu.:32.70

Max. :38.20

The options passed to the read.table() are the ﬁle name (with path if necessary), a header

parameter (TRUE or FALSE) to indicate if the data ﬁle has a header row, and sep to indicate

what character is used for a separator. Other separators are tab (indicated as sep="\t")

and as space sep="". Baring any errors that I made in typing in the data in the last section

(3.1.1) the printing of the data frame should be identical.

3.1.3 Adding Data To An Existing Data Frame

Once you have a data frame in R , you an add data to it relatively easily using. To add

additional rows of data you use the function rbind() (as in row bind). What you add to the

data frame must be another list or data frame that has the same variables in it as in

your original data frame. If you do not have all the variables in the thing you are adding

R will give you an error. Here is an example.

> rbind ( data , data . frame ( Population="B" , Height =31.3,Sex="Female" ) )

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

8 B 27.7 Male

9 B 30.1 Female

10 B 31.3 Female

> data

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

8 B 27.7 Male

9 B 30.1 Female

Biological Data Analysis Using R

3.1. DATA INPUT/OUTPUT 29

Notice that the addition of the data B 31.3 Female items were not retained in the data

object. That is because this function does not change the data frame that is passed to

it, rather it returns a brand new data frame that is identical to the original one but has

the additional data appended on the bottom. If you want to permanently change your

existing data frame then you need to use the assignment operator as:

> data <− rbind ( data , l i s t ( Population="A" , Height=32,Sex="Male" ) )

> data

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

8 B 27.7 Male

9 B 30.1 Female

10 A 32.0 Male

To add additional columns of data you use the function cbind() (as in column bind). This

amounts to adding another variable to all the observations in your current data set.

Again, for this to work, you should provide as many items as there are rows of data in

the data frame.

> cbind ( data , l i s t ( SizeClass = c( 1 , 1 , 1 , 2 , 2 , 1 , 2 , 2 , 1 , 2) ) )

Population Height Sex SizeClass

1 A 23.4 Female 1

2 A 32.9 Female 1

3 A 29.7 Female 1

4 A 38.2 Male 2

5 A 32.7 Male 2

6 B 28.4 Female 1

7 B 27.3 Male 2

8 B 27.7 Male 2

9 B 30.1 Female 1

10 A 32.0 Male 2

Again, if you want to make the additions to your data frame permanent then you need

to use the assignment operator.

> data <− cbind ( data , l i s t ( SizeClass = c( 1 , 1 , 1 , 2 , 2 , 1 , 2 , 2 , 1 , 2) ) )

> data

Population Height Sex SizeClass

1 A 23.4 Female 1

2 A 32.9 Female 1

3 A 29.7 Female 1

4 A 38.2 Male 2

5 A 32.7 Male 2

6 B 28.4 Female 1

7 B 27.3 Male 2

8 B 27.7 Male 2

9 B 30.1 Female 1

10 A 32.0 Male 2

The reason that these two functions do not change the data frame that you passed to

them is because you may want to make a temporary data frame with some additional

variables or copy the data frame

Biological Data Analysis Using R

30 CHAPTER 3. DATA FRAMES

3.1.4 Copying Data Frames

To copy a data frame, use the assignment operator. This make a new copy of the data

frame that is independent. For example, in the listing below, newData is made as a copy

of data. Then the Population variable for the ﬁrst row is changed from A to B. Notice how

changes to newData are independent of entries in data.

> newData[ 1 , ]

Population Height Sex SizeClass

1 A 23.4 Female 1

> newData[ 1 , 1] <− "B"

> newData[ 1 , ]

Population Height Sex SizeClass

1 B 23.4 Female 1

> data [ 1 , ]

Population Height Sex SizeClass

1 A 23.4 Female 1

3.1.5 Removing Data From A Data Frame

How you remove items from a data frame depends upon if you are removing columns

or rows of data. To remove a row of data (e.g., a whole set of variables for a single

observation) you an use a negative sign in front of the index.

> data[ −10,]

Population Height Sex SizeClass

1 A 23.4 Female 1

2 A 32.9 Female 1

3 A 29.7 Female 1

4 A 38.2 Male 2

5 A 32.7 Male 2

6 B 28.4 Female 1

7 B 27.3 Male 2

8 B 27.7 Male 2

9 B 30.1 Female 1

> data

Population Height Sex SizeClass

1 A 23.4 Female 1

2 A 32.9 Female 1

3 A 29.7 Female 1

4 A 38.2 Male 2

5 A 32.7 Male 2

6 B 28.4 Female 1

7 B 27.3 Male 2

8 B 27.7 Male 2

9 B 30.1 Female 1

10 A 32.0 Male 2

Again, this returns a data frame without the given index. If you want to make this

permanent you must make an assignment as before. You can also pass an array of

indices to remove more than one at a time (see also the function subset() in 3.3.1).

> data <− data[ −10,]

> data

Population Height Sex SizeClass

1 A 23.4 Female 1

2 A 32.9 Female 1

3 A 29.7 Female 1

Biological Data Analysis Using R

3.1. DATA INPUT/OUTPUT 31

4 A 38.2 Male 2

5 A 32.7 Male 2

6 B 28.4 Female 1

7 B 27.3 Male 2

8 B 27.7 Male 2

9 B 30.1 Female 1

> data[−c ( 2 , 4 , 6 , 8) , ]

Population Height Sex SizeClass

1 A 23.4 Female 1

3 A 29.7 Female 1

5 A 32.7 Male 2

7 B 27.3 Male 2

9 B 30.1 Female 1

Deleting a column of data can also be accomplished by the same manner or by assigning

the variable the value of NULL.

> data <− data[ , −4]

> data

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

8 B 27.7 Male

9 B 30.1 Female

> data$Sex <− NULL

> data$Population <− NULL

> data

Height

1 23.4

2 32.9

3 29.7

4 38.2

5 32.7

6 28.4

7 27.3

8 27.7

9 30.1

3.1.6 Saving Data Frames to Files

There comes a time when you have to save some data you have been working on. In fact,

it is quite often. There are several ways to save data in R . First, you can have R save

every variable in memory. When you quit R using the q() function, it will ask if you want

to save:

> q ( )

Save workspace image? [ y/n/c ] : y

If you do, there will be a .RData ﬁle saved in the directory you are working with that

contains all the data you currently have in memory. When you restart R , it will load

these data back into memory for you. Fairly easy and direct way of getting your data to

disk and back and it is cross-platform. If you are going to use this kind of data saving,

you should create a new folder for any data set you are working with. This will keep

Biological Data Analysis Using R

32 CHAPTER 3. DATA FRAMES

the raw data ﬁle(s) in the same location as the data entered and formatted in R . The

main drawback to this is that the name of the saved data ﬁle (.RData) starts with a period

(.) and will therefore be invisible to you when you look in the folder with your normal

Finder, File Browser, or whatever. You can easily overwrite it or throw it away since it

isn’t immediately visible. It is also a bit inefﬁcient in that if you have a bunch of other

variables in memory you may not want to save them all. If I just merged a bunch of data

frames (see 3.3.2), I may only want to save the ﬁnal data.

The second way that you can save your data frame is to save the data frame directly.

This allows you to save different data frames with different names and you can save

them where ever and named what ever you like.

> save ( data , f i l e ="MyNewSavedData.Rdata")

You can also save several variables at once by passing their names as a list to the save()

function. Here is an example:

> g <− 1:20

> otherData <− f actor ( c ( T, T, T, T, F, F) )

> g

[ 1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

> otherData

[ 1] TRUE TRUE TRUE TRUE FALSE FALSE

Levels : FALSE TRUE

> save ( l i s t =c ( "data" ,"g" ,"otherData") , f i l e ="DataType2.RData")

It is common for saved data from R to have the ﬁle sufﬁx of .Rdata so lets not buck

tradition...

Once you have saved the data frame, you can load it back into memory at any time

by:

> l s ( )

[ 1] "data

> rm(data)

> ls()

character(0)

> load("MyNewSavedData. Rdata")

> ls()

[1] "data"

Notice here I use ls () to see what is in memory, rm() to remove data from memory (and

check, then reload the data using the load() function.

3.1.7 Deleting Data Frame

Removing a data frame from memory is no different than removing any other variable.

You simply use the rm() function as:

> rm( data )

If you have a lot of different data ﬁles in memory, you can delete them individually, as a

group, or delete everything in memory at once as shown below:

Biological Data Analysis Using R

3.1. DATA INPUT/OUTPUT 33

> l s ( )

[ 1] "elvis genotypes" "kent hovinds secret data"

[ 3] "myCoolData" "x"

[ 5] "y" "yourNotLooserData"

> rm( "x")

> l s ( )

[ 1] "elvis genotypes" "kent hovinds secret data"

[ 3] "myCoolData" "y"

[ 5] "yourNotLooserData"

> rm( l i s t =c ( "y" ,"myCoolData") )

> l s ( )

[ 1] "elvis genotypes" "kent hovinds secret data"

[ 3] "yourNotLooserData"

> rm( l i s t =l s ( ) )

> l s ( )

character ( 0)

To delete individual variables, you must name them but when you delete several vari-

ables you need to tell the rm() command that you are going to pass it a list of variable

names to delete (the list =) parameter. The ﬁnal example shows you how you can tell it

to delete everything in memory (e.g., delete this list and this list is all the data that are

currently in memory.

3.1.8 Components of a Data Frame

A data frame has a few distinct components in addition to the data points. Using the

function attributes() shows the things that are make up a data frame. This function returns

a list containing the variables names, class, and row.names.

> attri butes ( data )

$names

[ 1] "Population" "Height" "Sex"

$cl ass

[ 1] "data.frame"

$row. names

[ 1] 1 2 3 4 5 6 7 8 9

> dataAttributes <− attri butes ( data )

> dataAttributes$row. names

[ 1] 1 2 3 4 5 6 7 8 9

There are also other ways to access these attributes. In Chapter 2, you were introduced

to the class(x) function and we will not need to go over that again here. There are corre-

sponding functions names(x) and row.names(x) that you can easily use to get access to these

components of a data frame. You can also use these functions to assign new values to

an existing data set. For example:

> data

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

Biological Data Analysis Using R

34 CHAPTER 3. DATA FRAMES

8 B 27.7 Male

9 B 30.1 Female

> names( data ) <− c ( "Group" ,"DistanceFromGround" ,"Gender")

> data

Group DistanceFromGround Gender

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

8 B 27.7 Male

9 B 30.1 Female

> row. names( data ) <− seq( 9 , 1 , by=−1)

> data

Group DistanceFromGround Gender

9 A 23.4 Female

8 A 32.9 Female

7 A 29.7 Female

6 A 38.2 Male

5 A 32.7 Male

4 B 28.4 Female

3 B 27.3 Male

2 B 27.7 Male

1 B 30.1 Female

3.2 Slicing

Grabbing portions of your data frame is pretty easy. Below are some examples of how

you can access some of your data components:

> data [ , 1]

[ 1] A A A A A B B B B

Levels : A B

> data [ , 2]

[ 1] 23.4 32.9 29.7 38.2 32.7 28.4 27.3 27.7 30.1

> data [ 1: 4 , ]

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

> data$Sex

[ 1] Female Female Female Male Male Female Male Male Female

Levels : Female Male

> data$Population

[ 1] A A A A A B B B B

Levels : A B

Here are some rules that you need to keep in mind:

1. To access a data frames items by index, you use the square brackets [] along with

the indices of the components separated by a comma ,.

2. R uses indices for all its data types in what is called row major format. That is to

say that the ﬁrst index is for the row and the second index is for the column. For

example data[1,2] will provide access to the 1

st

row and the 2

nd

column.

Biological Data Analysis Using R

3.3. COMPLEX SELECTIONS 35

3. To get all the items in a given row or column you can leave out the index. For

example, the command data[i,] returns all rows of data from the i

th

row whereas

data[,j] returns the data in all rows for the j

th

column.

4. You can also index the data for a particular column by calling its name. For exam-

ple, the example data set has variables named Population, Height, and Sex. You can

get all the data in one of these variables by using the notation data$VariableName as in

data$Population.

5. To get a range of values on one or the other index such as the 2

nd

through 5

th

entries in the height variable you put the range of indices separated by a colon as

in data[2:5,2]. You can also combine this with the naming of the variables, which

may be able to make it a bit easier to read, as data$Height[2:5]. This can work

in both directions, as shown above when retrieving all the data for the ﬁrst four

records (data[1:4,]).

3.3 Complex Selections

R data frames can be thought of as pseudo databases. There is a standard language that

has been adapted by both the American National Standards Institute (ANSI) and later the

International Organization for Standardization (ISO). If you ever interact directly with a

database, you use the Standard Query Language (SQL) to interact with the data. R does

allow you to interact with databases through one of its many database libraries but I will

not be covering that in this chapter. However, if you are familiar with some basic SQL

operations you will ﬁnd this section rather easy. If not, I will be spending a little extra

time trying to convince you that it is probably in your best interest to understand how to

query your data frames because it gives you a lot of power and ﬂexibility. After all, being

agile with your data is a key skill I hope you will be learning in this course.and show you

how to use a data frame as a lite-database. Even if you do not ever use a database, this

section is really important as it will allow you to think about interacting with your data

in interesting and complex way.

To understand SQL you need to understand that in a database data is contained within

tables. And tables have rows and columns of data, just like a data frame. Each table also

has a name. You can think of a database table as a worksheet in a spreadsheet program

if that helps (though real database gurus are probably cringing as they read that). The

SQL language is very easy to understand and I will partition this section into commands

that query the database and those that create new data frames by the combination of

two or more existing data frames that have a common data column.

3.3.1 Queries

Queries are essentially what we have been doing in 3.2 with indices so I won’t go over

the basic stuff that we have already covered other than to show the SQL equivalents

in case you need to know them. I will however delve a bit into how the function subset()

works because it is pretty powerful.

Biological Data Analysis Using R

36 CHAPTER 3. DATA FRAMES

To select all observations in SQL, you use the statement SELECT

*

FROM tableName, which in

R is simply what we have been doing by tying the name of the data frame (hereafter I

will use data to refer to the name of the table for similarity with our previously loaded

data frame).

> data

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

8 B 27.7 Male

9 B 30.1 Female

In these SQL statements I use words in all capitol letters to indicate SQL language

components and lowercase words to indicate table names or variables. Also, in SQL the

asterisk means ”everything” (as in all variables).

The strength of SQL and databases lies in the fact that you can do complicated selections

from the tables. For example, in SQL you can select by row number and column number

using the statement SELECT

*

FROM data WHERE rownum==x AND colnum==y. Using the logical oper-

ator AND adds a lot of power to this statement. However, in R we have been doing this

using the indices directly and the square bracket notation as (with x = 1 and y = 2):

> data[ 1 , 2]

[ 1] 23.4

Several rows or columns can be selected in SQL by SELECT

*

from data WHERE rownum>=5 AND

rownum<=7 is accomplished in R as:

> data [ 5: 7 , ]

Population Height Sex

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

To get only a subset of the variables in each row, you can indicate which variables you

are interested in selecting in SQL as SELECT height, sex FROM data and in R we can either

slice both indices as:

> data [ , 2: 3]

Height Sex

1 23.4 Female

2 32.9 Female

3 29.7 Female

4 38.2 Male

5 32.7 Male

6 28.4 Female

7 27.3 Male

8 27.7 Male

9 30.1 Female

Or we can use the subset() function as:

Biological Data Analysis Using R

3.3. COMPLEX SELECTIONS 37

> subset ( data , sel ect =c ( "Height" ,"Sex" ) )

Height Sex

1 23.4 Female

2 32.9 Female

3 29.7 Female

4 38.2 Male

5 32.7 Male

6 28.4 Female

7 27.3 Male

8 27.7 Male

9 30.1 Female

Often times you will have rather large data sets in R that you will be working with and

it may be easier to grab parts of your data set by using names of variables rather than

by using column indices (it is up to you).

You can also get a bit more speciﬁc and only look for components in your data set using

relational operations. For example, the SQL statements SELECT

*

FROM data WHERE height>30

and SELECT

*

FROM data WHERE height>30 AND columnnum==2 is accomplished in R by:

> data [ data$Height >30,]

Population Height Sex

2 A 32.9 Female

4 A 38.2 Male

5 A 32.7 Male

9 B 30.1 Female

> data [ data$Height >30,2]

[ 1] 32.9 38.2 32.7 30.1

Notice how in the last example here I mixed the use of selecting subsets of observations

using the relational operator > and subsets of column using the numeric index. Also

notice how using the 2 in the position after the comma gives only the second column of

data.

You can combine conditions in a SELECT-like query such as SELECT

*

FROM data WHERE height>30

AND sex="Male" by using the unary & operator as:

> data [ data$Height>30 & data$Sex=="Male" , ]

Population Height Sex

4 A 38.2 Male

5 A 32.7 Male

This complicated statement needs to be dissected to reduce confusion. The part in the

square brackets [] consists of the stuff on the left side of the comma (data$Height>30 &

data$Sex=="Male") and the stuff on the right side (which happens to be empty in this case).

There are some things to remember when doing compound statements like this:

1. The & operator in between the requires that the things on both sides of it are TRUE.

2. The equality operator == must be a double equals sign.

3. I ﬁnd it easy to take a few passes at these compound statements to make sure I am

getting them correct.

In addition to the AND operator in the SELECT statements you there is also an OR operator. It

is valid to say in SQL SELECT

*

FROM data WHERE sex=="FEMALE" OR population=="A". This can also

be done in R using the OR operator .

Biological Data Analysis Using R

38 CHAPTER 3. DATA FRAMES

> data [ data$Sex=="Female" | data$Population=="A" , ]

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

9 B 30.1 Female

If the selection of subsets of your data become more complicated than this, you can use

parenthesis to separate out conditions. This makes it easier for you to read and since

you are the one that will be writing this code and coming back later and looking at it, it

pays to be as un-convoluted as possible. Here is a whack example from the SQL SELECT

*

FROM data WHERE (population=="A" AND sex=="Female") OR (population=="B" AND height<30).

> data [ ( data$Population=="A" & data$Sex=="Female")

+ | ( data$Population=="B" & data$Height<30) , ]

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

6 B 28.4 Female

7 B 27.3 Male

8 B 27.7 Male

Note: I split the command across two lines at the OR operator. In R when you do this, it

gives you the little + sign and you can continue typing as if it were on a single line. I had

to do this because the command is longer than the width of this paper...

3.3.2 Joins

OK, so now that you have everything you want to know about how to select stuff from

within a single data frame with an arbitrary level of complexity lets move into joins. A

join is an operation where you have two or more tables (or data frames) and you are

going to create a new one based upon the merging of the two, provided that they both

have a variable in them you can use as a common index. Here are two examples that we

will be using. The ﬁrst table is the data table we have been working with thus far.

> data

Population Height Sex

1 A 23.4 Female

2 A 32.9 Female

3 A 29.7 Female

4 A 38.2 Male

5 A 32.7 Male

6 B 28.4 Female

7 B 27.3 Male

8 B 27.7 Male

9 B 30.1 Female

The second table is one that has characteristics of the Populations themselves. It is in the

example data sets and is called PopulationAttributes.txt and we can load it into R as:

> popData <− read . tabl e ( "PopulationAttributes.txt" , header=T, sep=",")

Biological Data Analysis Using R

3.3. COMPLEX SELECTIONS 39

> popData

Population LongName State North East Elevation

1 A Richmond Vi rgi ni a 37.53300 −77.4670 45.7

2 B Seattl e Washington 47.60972 −122.3331 0.0

If you look at these two tables, there is the common variable Population. So in essence,

I could add the data from popData and data to create a new data set that has all this

information. It is common in databases to have tables split like this. It saves space

(imagine having the 5 extra data columns for each row in data, it would be repetitive and

for large data sets may max out the memory of your computer. It is also common to ﬁnd

biologists who have programmed software to do some kind of analysis that requires you

to put some kinds of data in one ﬁle another kind in a second ﬁle, etc. Joins allow you

to take these different data frames and join them (catchy name, no?).

To join two tables you will use the function merge() on the data sets. In SQL this would be

SELECT

*

FROM data, popData WHERE data.Population == popData.Population. Fortunately, is is a bit

easier to do this in R , here is an example:

> merge( data , popData)

Population Height Sex LongName State North East Elevation

1 A 23.4 Female Richmond Vi rgi ni a 37.53300 −77.4670 45.7

2 A 32.9 Female Richmond Vi rgi ni a 37.53300 −77.4670 45.7

3 A 29.7 Female Richmond Vi rgi ni a 37.53300 −77.4670 45.7

4 A 38.2 Male Richmond Vi rgi ni a 37.53300 −77.4670 45.7

5 A 32.7 Male Richmond Vi rgi ni a 37.53300 −77.4670 45.7

6 B 28.4 Female Seattl e Washington 47.60972 −122.3331 0.0

7 B 27.3 Male Seattl e Washington 47.60972 −122.3331 0.0

8 B 27.7 Male Seattl e Washington 47.60972 −122.3331 0.0

9 B 30.1 Female Seattl e Washington 47.60972 −122.3331 0.0

> cl ass ( merge ( data , popData ) )

[ 1] "data.frame"

As you can see, it returns a new data frame with all the data included. I think this has

gotten you enough exposure so that you can probably be dangerous. The best way to get

comfortable with these methods is to actually use them.

Biological Data Analysis Using R

40 CHAPTER 3. DATA FRAMES

3.4 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• cbind(x) This function binds a column onto the right side of x. This only works with

some kinds of data types (e.g., those where an operation of appending on a column

of data makes sense).

• rbind(x) This functions binds a row of data onto the end of x. Again for those data

types that this operation makes sense.

• load(x) If x is the name of a .Rdata data ﬁle then it will load the contents into memory.

• merge(x) This function takes two data frames and merges themon a common variable

name. If there are more than one common variable name you can specify which

one and if there are no commonly named variables then you are out of luck (unless

you have variables that hold the same data but are just named differently).

• rm(x) This function removes x from memory. Gone. Auf wiedersehen. Can’t get it

back.

• save(x,filename=y) This function saves the R object x to ﬁle named y.

• subset(x) This function returns a slice of your data frame where you can specify

which variables to use. You can also do this with creative use of conditional opera-

tors and variable names.

Biological Data Analysis Using R

3.5. EXERCISES 41

3.5 Exercises

The following exercises are meant to help you understand the items presented in this

Chapter.

1. Create three different variables, a logical one, one that is a numeric type, and a vector

of characters. Use these to create a data frame named theData.

2. In the folder for this Chapter there is a text ﬁle named GuinneaPigData.csv. Load it into

memory and print out a summary.

3. How do you indicate a missing data point in a data ﬁle?

4. Add a numeric data column to the existing data frame, theData. Provide a summary of

the data.

5. How would you save the data frame, theData, to a ﬁle named newData.Rdata.

6. What is the difference between row major indexing and column major?

7. Using index numbers, select the 2

nd

and 3

rd

rows of the data set theData.

8. Read in the data ﬁle PersonData.csv from the class data set. What kind of data type is

the variable Names? How can you change this to a character type and then change the

name of the third entry in the data frame, theData, to Thomas?

9. Create a newdata set with a two variables, one that is Order = −1:4 and the other that is

Home=c("Olympia", "Juanita", "Centralia", "Tacoma", "Olympia", "Olympia"). Merge this data frame

with the one named theData and assign it the name combinedData.

10. How would you perform a query of the combined data set to select all records that

have Order >= 3 or Home == ”Olympia”.

Biological Data Analysis Using R

42 CHAPTER 3. DATA FRAMES

Biological Data Analysis Using R

Chapter 4

Summary Statistics

In this chapter you will explore some of the methodologies that R has for describing

your data. R is an excellent platform for exploring data, looking at relationships among

variables, and graphically portraying results.

In this Chapter you will learn the following skills:

• Learn about some common numerical distributions.

• Learn about commonly used statistical distributions.

• Understand parametric summary statistics.

• Explore non-parametric summary statistics.

• Use the table() function as an entry point into contingency table analysis.

• Create single and multiple line ﬁgures.

• Create histograms and density plots.

4.1 Distributions

R and its various sub-packages contain more numerical distributions than you will

probably ever need to use. Moreover, they provide them in a clear and concise inter-

face that has a consistent format. To my knowledge, all the distributions provide the

following four components:

1. A density function that is of the form dNameOfDistribution (e.g., dnorm(), df () & dchisq()).

2. A distribution function that is called as pNameOfDistribution (e.g., pnorm(), pf () & pchisq()).

3. A quantile function named qNameOfDistribution (e.g., qnorm(), qf () & qchisq()).

4. A function that produces random numbers sampled from the distribution that is

named rNameOfDistribution (e.g., rnorm(), rf () & rchisq()).

These are speciﬁcally helpful in a number of situations. For example, you may be run-

ning a test and calculating a χ

2

statistics on some table of data and want to know if the

43

44 CHAPTER 4. SUMMARY STATISTICS

value of your observed statistic, χ

2

Obs

is large given the particular degrees of freedom that

you have at your disposal. Now typically, we have memorized due to the sheer number

of times that we have used it, what the critical value for a χ

2

statistic with a single degree

of freedom should be (≈ 3.841459 right?). However, what if we have 8 degrees of freedom

and χ

2

Obs

= 15.507? You could go ﬁnd that old stats book on the shelf and page through

the back of it to ﬁnd the correct Appendix that has the right table (How do you read

those tables again?). Or you could use the various functions in R .

In this section, three aspects of using distributions within a statistical context will be

introduced. First, you will learn how to determine critical values for the χ

2

distribution

as used in formal hypothesis testing using the quantile functions. Then you will see how

the distribution function can tell you the probability of a particular estimation of the χ

2

test statistic.

4.1.1 Finding Critical Values

In formal hypothesis testing, there is a speciﬁc test statistic that is proposed. Moreover,

the estimation of a value for that statistic is compared to a known cutoff set by the

degrees of freedom in the model and the Type I error rate that you have chosen (e.g., the

α value). For some reason, as a biologist we have settled on an α = 0.05 value to have

some kind of special meaning. Now, this is probably an over simpliﬁcation of things that

was used initially as a teaching aid for understanding the meaning of Type I errors. There

is nothing intrinsically interesting about α = 0.05 and it is probably more informative for

me to know the real probability of your calculated test statistic rather than if it exceeds

dome arbitrary cutoff. I mean, is it really that different an interpretation if P = 0.049

versus P = 0.051? That being said, lets jump into understanding how we ﬁnd critical

values for some pre-deﬁned value for α in different distributions.

The most commonly used distribution observed as an undergrad is probably the χ

2

distribution. The distribution itself is shown in Figure 4.1 for three different values for

the degrees of freedom. This and other statistical distributions require that you provide

the degrees of freedom before it can give you any information.

For any one particular set for the parameters α and df, there is a deﬁned cutoff. The

value of the cutoff is deﬁned as the point along the x−axis at which there is 1 −α of the

are under the curve to the left of the point and α of the are under the curve from that

point and beyond. While this is a very non-technical deﬁnition but I think you get the

point when you consider the α shaded region in Figure 4.2 and the 1 − α region that is

unshaded.

To determine the critical value of the χ

2

distribution you use the qchisq() function. If you

were to look up the signature of this function (by typing ?qchisq into R ), you would see

that it accepts the following options:

qchisq ( p, df , ncp=0, lower . t ai l = TRUE, l og . p = FALSE)

There are two required parameters for this function, p and df. You can tell by looking

at this signature that they are required because they do not have an = sign next to

them and a default value given. If a parameter has a variable=value format in a function

Biological Data Analysis Using R

4.1. DISTRIBUTIONS 45

Figure 4.1: Values for the density function

for the χ

2

distribution with 1, 2, and 3 de-

grees of freedom.

Figure 4.2: A graphical depiction of the

critical value of the χ

2

distribution for α =

0.05 and df = 3. The shaded region con-

stitutes a proportion of the area under the

curve equal to α.

signature then the value will be assigned to variable if you do not give it a value when you

call it. Default values are very helpful and save a lot of typing on your part.

The parameter p is the 1 − α cutoff you are interested in ﬁnding. In the classic case,

this would be 1 − 0.05 = 0.95. At ﬁrst, it seems a little backwards to use 1 − α instead

of α but if we look at the graphical depiction of this distribution in Figure 4.2, we see

that the point in question is where we actually have 95% of the area under the curve and

we are interested in the extreme α portion. The next required parameter is df, which

corresponds to the degrees of freedom. As shown in Figure 4.1, this parameter controls

both the shape and location of the χ

2

values.

There are several optional parameters that you can pass to the qchisq() function and I

will brieﬂy mention them here for completeness. If you are interested in a more in depth

discussion of these parameters, look up the qchisq() function and read the documentation.

The ncp=0 option speciﬁes a non-centrality parameter allowing you to get the critical val-

ues for a non-central χ

2

distribution. The lower.tail=TRUE indicates that you are interested

in the p proportion of the data in the lower tail of the distribution (e.g., P[x < 1−α]) rather

than the the 1 − α portion of the other side of the distribution (e.g., P[x > 1 − α]). The

default value here is what we expect since we are interested in ﬁnding the α proportion

on the right side of the distribution not on the left side of the distribution (which would

be all the values less than or equal to 0.03518). Finally, the log.p=FALSE option allows you

to query using the log of p rather than p directly.

There are several other statistical distributions that you can query in R for particular

critical values. Common ones that you will be playing with in the Exercises portion of

Biological Data Analysis Using R

46 CHAPTER 4. SUMMARY STATISTICS

this chapter include Students t from qt and Fishers F from qf.

Scatter & Line Plots

Creating a simple plot of a line (or points in a sequence) is accomplished using the plot ()

function. This function has a signature (e.g., the things that you can pass to the function

and the things it expects) is:

pl ot ( x, y , . . . )

This listing is not very informative! Don’t worry, they get more interesting as we go along.

The plot () function is kind of a dummy function that allows you to plot lots of different

kinds of things, and if things can be plotted, they should know how to plot themselves.

Well, that is the theory at least. Lets jump into this graphing stuff by staring off with

a more basic approach to creating graphs and building up to what we see in Figure

4.1.

When you begin to create a plot, there are some default characteristics of the plot that

you may want to override. For example, the R code plot( rnorm(10) ) produces the graph

shown in the leftmost panel of Figure 4.3 consisting of a sequence of 10 random points

selected from a normal probability distribution (we will discuss these random functions

later in Section 4.2). If you try it will look different that is why they are random...

The function rnorm(x) returns x random numbers selected from a normal probability dis-

tribution with µ = 0 and σ = 1.0 (you can change these values, check the documentation

on this function using the ?rnorm command). When you look at this plot, it is rather plain

and does not convey any more information than 10 little circles. It may be of interest to

you to be able to change some of the properties of this plot. For example, you may want

to modify:

• The shape of the symbols

• The color of the symbols

• Add a line to connect the symbols, and perhaps modify the color, width, shape of

that line.

• Provide more meaningful axis labels.

• Remove the box around the plot (my pet peeve)

To do this, we must understand what a graph consists of, how to access the various

components, and how to ﬁnd more information on the appropriate levels that can be set

to these components. This chapter will be very long because it takes a lot of page real

estate to show a graph, but I think you’ll be happy with the results when you can whip

out a nice looking graph of your data. When possible, I will use random numbers to

create these graphs so as you go through and attempt to recreate them, yours will look

slightly different than mine.

To customize any of these values, you need to pass additional information to the plot ()

function. This is what the ... part of the function signature that is shown above. Table

Biological Data Analysis Using R

4.1. DISTRIBUTIONS 47

4.1 shows a list of additional commands that can be passed to the plot () function to

customize plot appearances.

Here are some examples of how you would use some of these optional parameters with

graphs shown in Figure 4.3:

> pl ot ( x, y , xlab="X Label" , ylab="Y Label" ,pch=3, col ="green" , bty="l")

> pl ot ( x, y , xlab="X Label" , ylab="" ,pch=2, col ="blue" , type="b" , bty="n" , lwd=2)

> pl ot ( x, y , xlab="X Label" , ylab="" ,main="Title" ,sub="subtitle" ,pch=2, col ="red" ,

+ type="l" , bty="n" , lwd=5)

Figure 4.3: Some example graphs with alternate values for symbols, line types, widths, colors,

and titles.

When creating complicated graphs, I ﬁnd it easy to build them up incrementally. Start

with a plain plot () command to see what the output looks like. Then customize the labels

and titles and plot it again to see it. Then continue to add parameters and review the

plot.

Biological Data Analysis Using R

4

8

C

H

A

P

T

E

R

4

.

S

U

M

M

A

R

Y

S

T

A

T

I

S

T

I

C

S

Table 4.1: Some useful additional commands to customize the appearance of a ﬁgure. For a complete listing of possible values that can be

customized, try the ?par command.

Command Usage Description

bg bg="red" Colors the background of the ﬁgure the speciﬁed color.

bty bty="x" Sets the style of the box type around the graph. Useful values are ”o” for

complete box (the default), ”l”, ”7”, ”c”, ”u”, ”]” which will make a box with

sides around the plot area resembling the upper case version of these letters,

and ”n” for no box (my preference)

cex cex=1.0 Magniﬁes the default font size by the corresponding factor.

col col="blue" colors the line and symbols the given color.

fg fg="blue" Colors the foreground of the image to the set color.

lty lty=x Speciﬁes the line type (0 = none, 1 = solid, 2 = dashed, 3 = dotted, etc.)

lwd lwd=x Speciﬁes the width of the line ( 1 = default ).

main main="Title for Graph" Sets a title along the top of the graph.

mfrow mfrow=c(nr,nc) Creates a matrix of plots that can potentially have a number of rows

(nr) and columns (nr; see 4.3.1 for example).

pch pch=x Sets the symbol that is plotted on the ﬁgure.

sub sub="Subtitle on Graph" Adds a subtitle just under main on the top of the graph.

type type="x" Sets the plot type. Plot types can be ”p” for points (the default), ”l” for

lines and ”b” for both lines and points.

xlab xlab="label for x-axis" Set the label on the x-axis.

ylab ylab="label for y-axis" Set the label on the y-axis.

B

i

o

l

o

g

i

c

a

l

D

a

t

a

A

n

a

l

y

s

i

s

U

s

i

n

g

R

4.1. DISTRIBUTIONS 49

Overlaying Plots

There are times where it is desirable to produce several plots on a common background

(e.g., the different values for df in Figure 4.1. R allows you a lot of leeway to mix up dif-

ferent types of graphs in the same plot (see Figure 11.4 for a rather complex combination

of images and plots overlayed on the same area).

To overlay two graphs, you use the par(new=T) command to tell R that the following com-

mand is going to apply to the currently active graphics device. This function allows you

to adjust a lot of different graphical parameters and the plotting of a new image onto an

existing one is only one of the things that you can adjust. For a full discussion of other

options that par() accepts type ?par in R . You use it as follows:

pl ot ( x1, y1)

par ( new=T)

pl ot ( x2, y2)

This will take the plot for the second set of variables and plot it on the same graphics

device as the previous one. When you overlay more than one plot on the same graphing

area, you must take into consideration the different scales that the graphs have. By

default R will try to maximize the are that is being plotted by changing the default

ranges of the x− and y−axes. For example, if I have data such as: x

1

= [0, 1, 2, 3]; y

1

=

[10, 11, 12, 13] and plot it will automatically scale the axes to have limits of xlim=c(0,3) and

ylim=c(10,13), which means that the x−axis will start and end at 0 and 3 and the y−axis will

start and end at 10 and 13.

This is what would be expected to happen and works nicely until you try to put another

plot. If your other data has values of x

2

= [11, 12, 13, 14]; y

1

= [23, 22, 21, 20] and you try to

simply overlay the two plots by simply typing:

> x1 <− c( 0 , 1 , 2 , 3)

> y1 <− c(10,11,12,13)

> x2 <− c(11,12,13,14)

> y2 <− c(23,22,21,20)

> pl ot ( x1, y1, col ="red")

> par ( new=T)

> pl ot ( x2, y2, col ="blue")

You get the image shown in Figure 4.4.

There are several obvious issues with this image.

1. You cannot read the axis labels. The two images are put right on top of each other

and the axes are individually scaled to ﬁt the data in each plot () command.

2. It is difﬁcult to tell the relationship among the data. If you look at the raw data, the

x1[1] = x2[1] but in the plot it appears that they are equal.

3. The labels on the axes are typed over each other.

To overcome these issues you need to ﬁrst ﬁnd the appropriate limits for the values in

both of the data sets and for both plot () statements, we need to set the xlim and ylim values

(see Table 4.1) to the appropriate values. These appropriate values will tell R what the

Biological Data Analysis Using R

50 CHAPTER 4. SUMMARY STATISTICS

Figure 4.4: Plot of two data sets using the par(new=T command but not taking into consideration

the axis limits of the two data sets before plotting.

minimum and maximum values for the x− and y−axes should be. Here is some code

that does this.

> x1 <− c( 0 , 1 , 2 , 3)

> y1 <− c(10,11,12,13)

> x2 <− c(11,12,13,14)

> y2 <− c(23,22,21,20)

> yLimit <− range ( c ( y1, y2) )

> yLimit

[ 1] 10 23

> xLimit <− range ( c ( x1, x2) )

> xLimit

[ 1] 0 14

Here I combined the y values for both data sets and used the range() function to tell me

what the range of these values are. Then I did the same thing of the x values in both

data sets. Now, if I make the plot, I can make it for each pair of x & y variables scaling

the axes so that both data sets will be displayed on the same Figure.

> pl ot ( x1, y1, xlab="X" , ylab="Y" , bty="n" , xlim=xLimit , ylim=yLimit , col ="red")

> par ( new=T)

> pl ot ( x2, y2, xlab="X" , ylab="Y" , bty="n" , xlim=xLimit , ylim=yLimit , col ="blue")

Notice how the optional arguments xlim and ylim make sure the axes are scaled correctly

(Figure 4.1.1). I also use the bty="n" because I just hate the box that it puts around the

plot area by default and this option does not draw any box at all.

As long as you add a par(new=T) between each successive plot () command, you can add as

many plots to the same ﬁgure as you would like.

Biological Data Analysis Using R

4.1. DISTRIBUTIONS 51

Figure 4.5: Plot of two variables on the same axis after correcting for the range of each data set.

Saving Images To Disk

While it is rather cool to be able to create rather hansom graphics in R it is entirely

useless if you do not know how to save it for later use. You could take a screenshot of

the image and then crop it down a bit but that is not quite the easiest method to use

here. Almost all the images in this book were created in R and I was able to save them

into a format that made it easy to import them into this document.

R considers the little popup window that shows your graph as a graphics device. De-

pending upon which platform you are using (e.g., Linux, OSX, Windows), the kinds of

output you may be able to produce may change. At present the following types are

available:

Device What receives these graphing commands

bmp A Windows bitmap device

cairo pdf A PDF device based upon the Cairo drawing libraries

jpeg A JPEG bitmap device

pdf A PDF ﬁle

pictex A L

A

T

E

X graphics command ﬁle

png A PNG bitmap device

postscript A postscript ﬁle

quartz An OSX graphics window

tiff A TIFF bitmap device

X11 A graphics window on a system running X-Windows (unix some OSX)

Table 4.2: Graphics devices for output of ﬁgures

Biological Data Analysis Using R

52 CHAPTER 4. SUMMARY STATISTICS

When you type the command plot () a graphics window pops up showing you the image of

the ﬁgure. What is happening here is that R is looking for the default graphics device

and if you have not speciﬁed one, then the default value of ”show it to the user as a

window” is use.

Creating The Plot And Saving To File: This is the method that I used for all the ﬁgures in

this text. I ﬁrst created the ﬁgure to look the way that I wanted and then I had R copy

the ﬁgure to a ﬁle. You should be aware that when you copy the image, it will only copy

the ACTIVE graphics device. If you have more than one graphics window open, only one of

them will say ACTIVE in the window title. Be careful of this or you could be copying the

wrong ﬁgure.

Once you have the graphic the way you like, you can use the dev.copy() command to copy

the current graphics device to a ﬁle. For this book, I have been saving all the images as

JPEG ﬁles so I pass the function the device=jpeg option and then specify the name of the

ﬁle. If you want to save yourself some heartache down the road, use meaningful names

for the graphics you create. You can quickly get a lot of different plots that you may

want to go through at some time in the future and it sure helps to have them named

nicely.

> hi st ( rpoi s (1000,2) , xlab="Counts" , ylab="Frequency" ,main="" , col =topo . col ors ( 8) )

> dev . copy ( device=jpeg , f i l e ="ColoredHistogramOfPoissonDistribution.jpeg")

jpeg

3

> dev . of f ( )

X11cairo

2

Once the dev.copy() function is ﬁnished, you must call the dev.off () function to tell R that you

are ﬁnished copying things to that particular ﬁle and you no longer want to keep it open

and ready for subsequent graphing. The output after the dev.off () command shows which

graphics device is now active and what kind of device it is (in general, you can ignore

this). The image produced from this plot is shown in Figure 4.6

I also passed the plot command the optional col=topo.colors(8). The function topo.colors(x) re-

turns x evenly spaced colors from a palette that is used for plotting topo maps. There are

other default palettes in R you can use (see ?topo.colors for a list) in coloring parts of your

ﬁgures. By default, I new that the hist () function would return 8 bins of data from the

rpois(1000,2) distribution (I plotted it ﬁrst and counted) so I added 8 evenly spaced colors to

the plot just to make it look a bit more cheesy.

Plotting Directly To A File: Plotting to a graph window and copying it to a ﬁle is not

necessarily the only way you can get your graphics saved. You could just write them

directly to a ﬁle using one of the graphics devices listed in Table 4.2 without looking at it

in a window. I ﬁnd this less appealing since I would like to see what I am plotting before

saving it, but if you are chugging through lots of data and creating hundreds of images,

perhaps you would be better served to make the plots directly and view them later. At

any rate, here is how it is done.

jpeg ( )

pl ot ( rnorm(1000) , xlab="index" , ylab="value" , bty="n")

dev . of f ( )

Biological Data Analysis Using R

4.1. DISTRIBUTIONS 53

Figure 4.6: Image of colored Poisson distribution that was copied from the graphics device to a

jpeg ﬁle.

and R will open the a jpeg() graphics device. This device is generally a ﬁle in the local di-

rectory that is named RPlotXXX.jpeg (where the XXX values are incremental numbers such

as 001, 002, . . .). Then when you call the plot () function it sends the plotting commands

to the image itself in the ﬁle.

1

You can add as many plotting commands as you like

and it will continue to send them to the ﬁle you speciﬁed. When you are done, you can

ﬁnalize the image by calling dev.off () to turn of the graphics device. To change the default

incremental numbering of the ﬁles, you can pass a ﬁle name to the jpeg() function (or any

of the other ones) as we did in the previous section using dev.copy().

4.1.2 What Probability?

The outcome of a statistical analysis is the estimation of a particular test statistic. For

example, when you calculate a χ

2

statistic, you need to look up a the probability that a

value as large or larger than the observed one is expected to occur. In 4.1.1 we deter-

mined how to calculate the cutoff value from a particular distribution given a speciﬁed

1

Actually it keeps them in a buffer and not in the ﬁle directly.

Biological Data Analysis Using R

54 CHAPTER 4. SUMMARY STATISTICS

Type I error rate (the α value). Here we are interested in not asking if our calculated

value exceeds some particular cutoff, rather we are interested in understanding what

the probability of observing a value as large or larger than the one we see.

In keeping with the current examples from the χ

2

statistic, we can determine the prob-

ability associated with a particular estimation of χ

2

Calc

by using the distribution function

pchisq(). The arguments to pchisq() are almost identical to those for the qchisq() function

discussed in 4.1.1 with the exception that we do not pass it the 1 −α as the ﬁrst param-

eter, rather we pass it the estimated χ

2

Calc

value and it will return the answer in terms of

P[X ≤ x]. For example:

> chiCritAt0.05 <− qchisq( 0. 95 , 1)

> pchisq ( chiCritAt0.05, 1 )

[ 1] 0.95

> pchisq ( 7.23, 3)

[ 1] 0.9350828

The functions qchisq() and pchisq() give us the opposite answers from each other with one

telling us what the critical value (or P[X <= x]), and the other takes a value for χ

2

and

tells us what the cumulative area under the curve up to and including that point.

4.2 Random Number Generation

There are often times when you need to generate some random numbers (playing poker,

picking lottery numbers, etc.). Random numbers can be drawn from any of the distri-

butions that are in R using the rdistribution function. For example, to draw a random

number from a normal distribution (N(µ, σ)) you would call the rnorm(x,\mu,\sigma) function.

The parameters µ and σ signify the mean and standard deviation of the distribution from

which you are drawing. An example of how this inﬂuences the outcome, check out Figure

4.2.

There are a large number of random number distributions that you can run across.

Below are some commonly encountered ones:

Normal The normal distribution has a density function of P(x|µ, σ) =

1

σ

√

2π

e

−

(x−µ)

2

2σ

2

.

Exponential The exponential density has a continuous density function of P(x|λ) =

1 −e

−λx

.

Poisson The Poisson distribution is a discrete distribution whose density function is

P(k|λ) =

e

−λ

λ

k

k!

.

Later in the Exercises you will get to use some of these distribution.

Histograms

A histogram is a graphical display of data that has been tallied into bins (e.g., speciﬁc

buckets). How you deﬁne the bucket locations and sizes are up to you. You can specify

that there should be a speciﬁc number of buckets and R will make them equal sized, or

Biological Data Analysis Using R

4.2. RANDOM NUMBER GENERATION 55

Figure 4.7: Examples of the densities of two normal distributions; the red one is drawn from a

random normal distribution with default values of µ = 0 and σ = 1 and another in blue that has

µ = σ = 5.

you can deﬁne ranges yourself. The function signature for the hist () function by typing

?hist in R :

hi st ( x, breaks = "Sturges" ,

f req = NULL, probabi l i ty = ! freq ,

include . lowest = TRUE, ri ght = TRUE,

density = NULL, angle = 45, col = NULL, border = NULL,

main = paste ( "Histogram of" , xname) ,

xlim = range ( breaks ) , ylim = NULL,

xlab = xname, ylab ,

axes = TRUE, pl ot = TRUE, l abel s = FALSE,

nclass = NULL, . . . )

There are several things we should notice about this function signature. First, this is the

ﬁrst time that we’ve looked into a particular function and seen all the options. You can

see that several of the parameters are given what we call default values (e.g., the =VALUE

portions). That way if we do not provide a particular value for a parameter such as main,

it will ﬁll it in for you.

Biological Data Analysis Using R

56 CHAPTER 4. SUMMARY STATISTICS

The ﬁrst thing that you typically want to change in a graphic is the default values for

the axis labels and the title of the graph. It is not commonly accepted practice to provide

titles on graphs for most publication-quality graphics, but some times it is helpful when

you are putting together a talk or just analyzing the data and making graphics for your

own interpretation. To change the default values of the axis labels and set an empty title

you would do the following (shown in Figure 4.8):

> hi st ( rnorm( 100) , xlab="My Defined Bin Categories" , ylab="Frequency" , main="")

Figure 4.8: Histogram with labels and main title changed.

Again, I am using the function rnorm() to generate the data from a random normal distri-

bution here. It is perfectly OK to give empty values to things like titles and such.

Density Plots

A density plot is one where the probability density is calculated and turned into a line

across the domain rather than a histogram. Here I will combine the histogram and

density plots to show how to overlay two graphs on the same values.

> data <− rpoi s ( lambda=5,n=1000)

> den <− density ( data )

> den

Cal l :

density . def aul t ( x = data )

Data: y (1000 obs . ) ; Bandwidth ’bw’ = 0.5061

x y

Min. :−1.518 Min. :3.567e−05

1st Qu. : 2.491 1st Qu.:8.145e−03

Biological Data Analysis Using R

4.2. RANDOM NUMBER GENERATION 57

Median : 6.500 Median :3.973e−02

Mean : 6.500 Mean :6.229e−02

3rd Qu.:10.509 3rd Qu.:1.219e−01

Max. :14.518 Max. :1.689e−01

> yrange <− range ( den$y )

> xrange <− range ( den$x )

> hi st ( data , ylim=yrange , xlim=xrange , xlab="Value of Random Poisson" ,

+ ylab="Frequency" ,main="" , probabi l i ty=T, bty="n")

> par ( new=T)

> pl ot ( den, col ="red" , lwd=2, xlab="" , ylab="" ,main="" , bty="n")

Figure 4.9: Histogram of 1000 random numbers drawn from a Poisson distribution with the λ

parameter set to 5. The red line indicates the density of the values.

There are some things to point out with this plot.

1. I save the values of data as a variable because I needed to plot the same set of

random variables as a histogram and as a density plot. Had I not saved them, I

would be using a different collection of random numbers for each plot and they

wouldn’t match.

2. I used the function density() to calculate the probability density function for the values

of data. The density() function has two components, an x variable and a y variable.

The the probability density is calculated as a probability rather than as a frequency

count (as the .

Biological Data Analysis Using R

58 CHAPTER 4. SUMMARY STATISTICS

4.3 Descriptive Statistics

Descriptive statistics are valuable tools in understanding particular patterns in your

data. For the purposes of this section, we will assume that your the experiments that

are producing your data yield one of two different data types. First, observations from

your data could be considered random variables; a measurement that produces a real

number. Examples of random variables may be body size, dissolved oxygen, available

light, etc. A collection of random variables will be denoted as X with elements x

i

; i =

1 . . . N (e.g., indexing across all N individual observations). The other kind of data we

will be examining here are categorical data. Your observations are grouped into distinct

categories and consist of relative counts of each category. Examples of this include

stage-dependent demographic tallies, gender of your study organisms, some types of

genetic data, disease prevalence, etc. Categorical data will be denoted as Y , consisting

of K categories and the number of counts observed in each category will be referred to

as y

i

; i = 1 . . . K.

There are two general properties of random variables that we will spend a little time

discussing because they form the basis of how we examine our data. First, the mean

of a random variable, usually denoted by the symbol µ is a measure of the central

tendency of your variable (a center of gravity, so to speak). We are all familiar with

the concept of mean, but in a general sense, the mean is just one of several moments of

a distribution and now we turn to this particular moment and then discuss some of the

”higher moments.”

4.3.1 Moments

There are several properties of random variables that we may be interested in estimating.

Notice that here I used the term estimate rather than compute, this is on purpose. We

will be making estimates of real parameters of the data and we do so because in most

cases we do not have all the data at our disposal. Rather, we have created a sample of

our data from which we make inferences. To get all the data, we would have to sample

EVERY single instance out there and in most cases this is not possible.

There are two common properties that you will probably recognize immediately (I hope)

and use all the time. These are the mean and variance of the data and are estimated in R

using the functions: mean() and var(). Figure 4.10 shows what is being measured by these

estimators. This ﬁgure was created using the density() function from rnorm(1000000).

The mean, shown by the dashed line and the symbol µ is located at the center of gravity

of the data. In R, you can calculate the mean of the data by using the function mean().

The image also shows the standard deviation (which is the square root of the variance

σ =

√

σ

2

) as indicated by the dotted line. R has a function for both the variance var(), and

the standard deviation sd().

There are two more measures of distributions that we should discuss while we are here.

2

These are the skew and kurtosis of the distribution. In R these functions are not loaded

2

Actually all four of these measures are known as the ﬁrst four moments of the distribution. The ﬁrst for

moments, µ

k

; k = 1 . . . 4 can be calculated by µ

k

= E[(X −µ)

k

].

Biological Data Analysis Using R

4.3. DESCRIPTIVE STATISTICS 59

Figure 4.10: Example locations for ﬁrst two moments of a Normal (N(0, 1)) distribution.

into memory by default and we must load the moments library to gain access to them. To

load these libraries type:

> l i brary ( moments)

If R gives you a warning, this means that the moments library is not installed by default.

In this case, see Appendix B for instructions on how to add libraries to your installation

of R.

The skew of a distribution is a measure of how ”pushed-over” the main lump of the

distribution (again not a very statistical deﬁnition here). Distributions can either have a

positive or negative skew, compare the images in Figure 4.11

A distribution is said to have a negative skew if the direction of the longer tail is to

the left. In these cases the mean < median < mode. Conversely, a distribution has a

positive skew if the tail is on the right and the mean > median > mode. Distributions

where these measures are equal is said to not have any skew. Skew is estimated in R

using the function skewness()

The kurtosis of a distribution is a measure of the ”peakedness” of a distribution. This

Biological Data Analysis Using R

60 CHAPTER 4. SUMMARY STATISTICS

Figure 4.11: Negative (left) and positive (right) distributions. In both of these examples the dotted

line connects the mode of the distribution (the top peak) to the mean (on the x axis). The direction

of this lean determines if the distribution has a negative (left) or positive (right) skew.

term comes from the Greek word kurtos that means ’bulging.’ A simple example of how

kurtosis looks is found in Figure 4.12 with three different distributions (the normal,

logistic, and uniform), each with a different level of kurtosis.

In general, the function for kurtosis is:

K =

µ

4

σ

4

−3

The correction factor (the - 3 part of the equation is a normalizing constant that allows

the kurtosis of a normal distribution to be equal to zero. Below are the raw data and the

kurtosis estimates used in producing Figure 4.12.

> normData <− rnorm(100000)

> l ogi sti cData <− r l ogi s (100000)

> unifData <− runi f (100000)

> kurtosis ( normData) − 3

[ 1] −0.02320046

> kurtosis ( l ogi sti cData ) − 3

[ 1] 1.219505

> kurtosis ( unifData ) − 3

[ 1] −1.197009

The discrepancy here in the estimates showing the normal distribution not quite equal to

zero is because the data were created by drawing randomnumbers rather then specifying

the distribution directly. One beneﬁt of the - 3 correction factor is that it allows you to

quickly tell the different types of kurtosis by looking at the value of the estimate. In

general, the following types of kurtosis are available:

Platykurtic Curves that have negative excess kurtosis (e.g., the kurtosis()−3 < 0).

Biological Data Analysis Using R

4.3. DESCRIPTIVE STATISTICS 61

Figure 4.12: Three distributions )exponential, normal, and logistic) showing different levels of

kurtosis.

Mesokurtic Curves that do not have excess kurtosis (e.g., the kurtosis()−3 = 0).

Leptokurtic Curves that have positive excess kurtosis (e.g., the kurtosis()−3 > 0).

The last summary statistic we will cover here is the range(), which returns a two-item

vector containing the minimum and maximum values. In fact, the range() function calls

the min() and max() directly. There is little to discuss about this particular set of func-

tions...

Creating a matrix of Plots

It is often desireable to create more than one plot on a graphic but not overlayed on

top of each other as was explained in Section 4.1.1. To do this, we need to adjust one

of the graphics properties using the function par(). The property we need to change is

mfrow=c(nr,nc). This will create a matrix of plots that has nr rows and nc columns.

An example of creating a matrix of plots is given in the code below and depicted in Figure

4.13.

Biological Data Analysis Using R

62 CHAPTER 4. SUMMARY STATISTICS

Figure 4.13: Matrix of four plots created from random numbers sampled from the normal, pois-

son, exponential, and the logistic distributions.

> par ( mfrow=c ( 2 , 2) )

> hi st ( rnorm(100000))

> hi st ( rpoi s (100000,1))

> hi st ( rexp(100000))

> hi st ( r l ogi s (100000))

Subsequent calls to plotting functions will ”reuse” this graphic ﬁgure and replot the

graphs in the nr x nc matrix. This graphic window will have the nr x nc matrix of plots

until it is either closed or you change the mfrow property to something else.

4.3.2 Non-Parametric Parameters

Non-parametric statistics are generally concerned with the analysis of data that does

not make assumptions about the underlying statistical distributions. There are several

commonly known non-parametric statistics such as the Binomial Test, Goodness of Fit,

the Mann-Whitney Test, and the Kruskal-Wallis test. In this section, we will explore

some of the methods that R can use to describe data without assuming an underlying

Biological Data Analysis Using R

4.4. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES 63

distribution.

The ﬁrst summary statistic outline here will be the quantile. While you have probably not

heard of this particular descriptive statistic, you most likely will have run across terms

such as a median, quartile, or percentile. All of these are particular kinds of quantiles

that will be obvious when we consider the formal deﬁnition of a quantile.

Quantile A p

th

quantile is the value x

p

that when considering the data (X) the probability

P(X < x

p

) ≤ p and the probability P(X > x

p

) = 1 −p.

While this may be statsy, it generally says that the 50

th

quantile is the the value x

50

in

the distribution where 50% of the data is less than x

50

and 50% is greater than x

50

. Thus

far, you have probably call this the median (and R has a median() function if you like to

call it that). More generally though, we can consider the 95

th

quantile analogous to what

we were discussing in Section 4.1.1 when we were trying to ﬁgure out critical regions

of the χ

2

distribution. The main distinction here is in Section 4.1.1 we implicitly used

the known distributional form of the χ

2

function to ﬁnd the critical value whereas in

non-parametric approaches, we typically apply the approach of putting everything into

a vector, sorting it, and counting to where quantile is located in the list. As a result, the

50

th

quantile (or median) can be considered a measure of central tendency of the sorted

data.

Quantiles can also be used to look at the dispersion of data. In parametric statistics

we discussed parameters such as the variance and standard deviation that deﬁne the

dispersion of values around the mean. The notion of Quantiles can be used in a similar

way. The values of x that give the upper and lower quartiles (e.g., the 25

th

and 75

th

quantiles) provide a range of the data X where the inner 50% of the values lie. These

are often called the inner quartiles of the data. To illustrate the use of the quantile

function, consider the data in Figure 4.14 consisting of 1000 numbers drawn from a

Poisson random distribution with a centrality parameter k = 5.

The quantile() function in R by default provides the 0

th

quantile (e.g., the minimum), the

25

th

quantile, the 50

th

quantile (the median), the 75

th

quantile, and the 100

th

quantile

(e.g., the maximum). For the data that produced the histogram in 4.14, the quantiles

are:

> x <− rpoi s (1000,5)

> quantile ( x )

0% 25% 50% 75% 100%

0 3 5 6 12

showing that the center of dispersion is 5 and the inner quartile ranges from 3 −6.

4.4 Relationships Between Pairs of Variables

There is often times when we are interested in knowing about the simultaneous changes

in two or more variables. Individually, we can estimate the mean, variance, skew, kur-

tosis, and various ranges but this does not tell us about how the variables interact

together. For this we need to look at measures that explain the relationship between

variables.

Biological Data Analysis Using R

64 CHAPTER 4. SUMMARY STATISTICS

Figure 4.14: Distribution of random number drawn from rpois(1000,5).

4.4.1 Covariance & Correlation

The covariance of two variable is deﬁned as:

c

ij

= E[(X −µ

X

)(Y −µ

Y

)]

and measures the degree to which one variable X changes as another Y changes. Co-

variance estimates may be positive or negative as long as the two variables are not the

same, in which case it is a variance and there is no such thing as a negative variance.

Two variables that have a covariance equal to zero are said to be uncorrelated (although

if you don’t know what a correlation is this moniker is kinda sucky).

In R the covariance between two vectors of values is estimated by the function cov().

Needless to say, the length of the two variables must be the same or R will rightly com-

plain.

> X <− c(1,34,5,23,6,43,56,28,33,7)

> Y <− runi f (10,1,100)

> Y

Biological Data Analysis Using R

4.4. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES 65

[ 1] 90.112843 47.236585 17.148708 3.861546 54.871332 57.234582 8.072745

[ 8] 6.000811 84.546069 17.960688

> pl ot ( X, Y)

> cov ( X, Y)

[ 1] 2231.952

Figure 4.15: Scatter plot of some semi-random points.

So here I just pounded on my numeric keypad and made up the numbers for X (not

quite random but pretty good) and then had R make some numbers for Y by drawing

from a uniform distribution runif() selecting 10 values in the range 1 → 100. You can see

that the values that I used produced a smattering of points (Figure 4.15 )

4.4.2 Tests For Correlation

There are parametric and non-parametric methods for looking at the relationship among

pairs of variables. In general, all correlations between two random variables (X, Y )

should have the following characteristics:

• The value of a correlation is strictly bound on the interval [−1, 1].

Biological Data Analysis Using R

66 CHAPTER 4. SUMMARY STATISTICS

• If larger values of X tend to be associated with larger values of Y then the cor-

relation should approach +1 as the association becomes stronger. We call this a

positive correlation.

• If smaller values of X tend to be associated with larger values of Y then the cor-

relation should approach −1 as the association becomes stronger. We call this a

negative correlation.

• If there is no general relation between the variables X and Y then the correlation

statistic should approach 0. We call this a relationship where the variables are

uncorrelated.

The most commonly used measure of correlation is Pearson’s product moment correla-

tion, r, that is calculated as:

r =

N

i=1

(X

i

− ¯ x)(Y

i

− ¯ y)

N

i=1

(X

i

− ¯ x)

N

i=1

(Y

i

− ¯ y)

(4.1)

where the ¯ x and ¯ y values are the mean of the N sampled variables in X and Y .

Figure 4.16: Example plot of two variables used to test correlations.

Biological Data Analysis Using R

4.4. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES 67

In R the test for correlation is performed with the cor.test () function. To demonstrate, we

will use the following data shown in Figure 4.16:

> X <− 1:20

> Y <− c(−17, 7, −12, 12, −4, 11, 10, −2, 35, 31, 34, 49, 27, 33, 45, 32, 36, 38, 58, 44)

> cor . t est ( X, Y)

Pearson product−moment correl ati on

data : X and Y

t = 7.3194, df = 18, p−value = 8.489e−07

al t ernat i ve hypothesis : true correl ati on i s not equal to 0

95 percent confidence i nt erval :

0.6848344 0.9456427

sample estimates :

cor

0.8651642

The correlation between these two variables is r = 0.865, which is both large and positive

as expected by looking at the graph. By default when you use cor.test () , it will use the

Pearson product moment approach. There are two additional approaches for estimating

correlation, approaches developed by Spearman and Kendal but these two are consid-

ered non-parametric methods based upon ranks rather than that shown in Eqn. 4.1

and will be left until 5.2.1 when we can fully discuss how it works. The output also

includes a signiﬁcance test and a display of the 95% conﬁdence intervals which are very

useful.

Biological Data Analysis Using R

68 CHAPTER 4. SUMMARY STATISTICS

4.5 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• dchisq(x,df) Returns the density of the χ

2

distribution with df degrees of freedom.

• df(x,df1,df2) Returns the density of the F distribution with df1 and df2 degrees of

freedom.

• dnorm(x) Returns the density of a normal distribution at x.

• mean() Calculates the mean of the values in x.

• pchisq(x,df) Returns the distribution of the χ

2

distribution with df degrees of free-

dom.

• pf(x,df1,df2) Returns the distribution of the F distribution with df1 and df2 degrees

of freedom.

• plot(x) This is the main wrapper function that creates a graphical display of the

variable(s) that you pass to it. Depending upon the variables passed, it will create

different types of plots.

• pnorm(x) Returns the distribution of a normal distribution at x.

• qchisq(x,df) Returns the quantile of the χ

2

distribution with df degrees of freedom.

• qf(x,df1,df2) Returns the quantile of the F distribution with df1 and df2 degrees of

freedom.

• qnorm(x) Returns the quantile of a normal distribution at x.

• rchisq(x,df) Returns x random numbers from the χ

2

distribution with df degrees of

freedom.

• rf(x,df1,df2) Returns x random numbers from the F distribution with df1 and df2

degrees of freedom.

• rnorm(x) Returns x random numbers from the normal distribution.

• sd(x) Returns the sample standard deviation of data in x.

• table(f) This function takes the list of levels in the factor f and makes a table from

it.

• var(x) Estimates the sample variance, s

2

, from the variables in x.

Biological Data Analysis Using R

4.6. EXERCISES 69

4.6 Exercises

The following exercises are meant to help you understand the items presented in this

Chapter.

1. What are the critical values for a χ

2

distribution with df = 8 if you are assuming that

α = [0.2, 0.1, 0.01, 0.001]?

2. Create a scatter plot using the variables x<−rnorm(10) and y<−rpois(10,1). Label the axes

”Jaw Size” and ”Number of Kids”.

3. For the probabilities p = seq(0.1,0.9,by=.1) create a graph that has a red line representing

the quantile function for the Poisson distribution (qpois with λ = 1) and a blue one

representing the quantile function for the χ

2

distribution (qchisq with df = 1). Make

sure to have your axes labeled and drawn properly. Save the image and include it in

your answer.

4. In a Platykurtic distribution what is the relationship between the mean, mode, and

median?

5. Create a histogram of 1000 random numbers drawn from the F-distribution with

parameters df1 = 1 & df2 = 10. On this plot, overlay the density using the density

function. Label the axes appropriately.

6. What is the inner-quartile of the data x <−rnorm(200,3)?

7. Is the data from the command x <−rf(1000,1,10), lepto, meso, or platykurtic? How do you

know?

8. Explain what is happening with the command data <−LETTERS[ rpois(23, 2 ) ]. Create a new

variable that is a table of the results of this command, show me the table, and show

how you would access the ”B” element in the table.

9. What is the range of possible values you can get for a Pearson’s Product-Moment

Correlation?

10. There is a data set named HWCorrelationData.csv in the folder. Load this data into R ,

plot it an appropriate graphic, and then test the hypothesis H

O

:Height is independent

of Weight.

Biological Data Analysis Using R

70 CHAPTER 4. SUMMARY STATISTICS

Biological Data Analysis Using R

Chapter 5

Contingency Tables

In this chapter we will examine non-parametric methodologies that are available for

the analysis of random variables. It is not uncommon in Biology to encounter the notion

that non-parametric approaches are only to be used with categorical (e.g., nominal) data.

However, non-parametric analyses are just as applicable to normal ordinal and interval

data that we commonly come into contact with and in this Chapter we will go over a few

examples of how you can use general non-parametric statistical approaches in your own

research.

In this Chapter you will learn the following skills:

• Non-parametric analysis of data single categorical data set (x

1

, x

2

, . . . , x

N

) using a

χ

2

test.

• Non-parametric analysis of paired data ( (x

1

, y

1

), (x

2

, y

2

), . . . , (x

N

, y

N

)) using the Fisher

Exact for small data and the general χ

2

test for large data sets.

• Non-parametric analysis of several random samples using the Kruskal-Wallis test.

For most of the exercises in this chapter you will need to load the stats library by issuing

the command: library(stats).

5.1 One Random Sample

For this section, we will assume that your data consist of N observations made on a

single variable, X = [x

1

, x

2

, . . . , x

N

].

5.1.1 Goodness of Fit

The χ

2

test for goodness of ﬁt is the typical χ

2

test that we have all had a million times

as an undergraduate and a graduate student. The data for this test consists of N obser-

vations that can be categorized into K discrete Categories. In R we will use the factor

data type (see 2.4.10 for more on the factor type).

71

72 CHAPTER 5. CONTINGENCY TABLES

The assumptions of this test are:

1. All the observations are selected randomly.

2. You can assign an observation to one of the K categories without error.

The test statistic for this analysis is the calculated χ

2

Calc

which is:

χ

2

Calc

=

K

i=1

(O

i

−E

i

)

2

E

i

(5.1)

(5.2)

The underlying distribution of χ

2

Calc

will be approximated using the χ

2

-distribution with

K − 1 degrees of freedom. From the discussion of this distribution and its depiction in

Figure 4.2, it is large values of χ

2

Calc

that will lead to the rejection of the null hypothesis,

H

O

.

Example Problem: Assume that we have captured a sample of the Marbled Salamander,

Ambystoma opacum, from the Rice Center for Environmental Studies (a ﬁeld station for

Virginia Commonwealth University). On each of these individuals we have classiﬁed

their marbling pattern as either Little White (N

A

= 24), Moderate White Marbling (N

B

=

47), and Mostly White (N

C

= 29). A separate crossing experiment has suggest that the

marbling on an individual may be under the control of a limited number of genetic loci

and has predicted that the frequency of these types would be 1 : 2 : 1 in populations at

equilibrium. Do the proposed mechanisms predict a distribution of phenotypes that you

sampled from the wild? To test the hypothesis, H

O

:Phenotypes occur at a ratio of 1 : 2 : 1

in R we would:

> Phenotypes <− as . f actor ( c ( rep ( "Little White" , 24) ,

+ rep ( "Marbled" , 47) , rep ( "Mostly White" , 29) ) )

> p <− c ( 1, 2, 1)

> p <− p / sum( p) # makes p a vector of probabi l i t i es

> tabl e ( Phenotypes )

Phenotypes

Li t t l e White Marbled Mostly White

24 47 29

> chisq . t est ( tabl e ( Phenotypes ) , p = p )

Chi−squared t est f or given probabi l i t i es

data : tabl e ( Phenotypes )

X−squared = 0.86, df = 2, p−value = 0.6505

So here, the observed and expected values were relatively close to each other producing

a χ

2

Calc

(in R called ”X-squared”) of 0.86, which with df = 2 has a P-value of 0.6505. Not

something that would be considered rare. As a result, we fail to reject H

O

that the ratio

of phenotypes is 1 : 2 : 1.

Here the thing that was passed to the chisq.test function was an object of class table. This

is only one way that you can pass data to to the chisq.test function. See ?chisq.test for more

information on other ways to pass your data to this function.

Biological Data Analysis Using R

5.1. ONE RANDOM SAMPLE 73

5.1.2 Binomial Test

The binomial test evaluates the support for the probability (p) that an observation was

categorized into one of two groups. The following assumptions are inherent in the bino-

mial test:

1. Each observation has the ability to be characterized as either Category A or Cate-

gory B and the probably of assigning to A is denoted as p (and B as 1 −p).

2. Each of the N observations are mutually independent.

The binomial test tests to see if the number of items you have classiﬁed as Category A

is rare given a speciﬁed probability, p. The test itself is performed using the binom.test()

function. In the example below, I am considering the situation where a coin was ﬂipped

20 times and was found to have shown Heads only six times. The hypothesis is: H

O

:

p = 0.5. The function itself need a few pieces of data; the number of times Category A

was observed (as x), the total number of trials (as n), and the hypothesized probability p.

Call it with these data would be done as:

> binom. t est ( x=6, n=20, p=0.5 )

Exact binomial t est

data : 6 and 20

number of successes = 6, number of t r i al s = 20, p−value = 0.1153

al t ernat i ve hypothesis : true probabi l i ty of success i s not equal to 0.5

95 percent confidence i nt erval :

0.1189316 0.5427892

sample estimates :

probabi l i ty of success

0.3

These results suggest that even with only 6 observed Heads in 20 ﬂips, we cannot reject

H

O

that it is a fair coin. However, the 95% conﬁdence intervals show that there is a large

range of values we cannot reject...

5.1.3 General Contingency Tables

For this next application of a contingency tables we will focus on data describing the

diversity of students in the College of Humanities & Sciences at Virginia Commonwealth

University. These data are reported by all public institutions and can be found for VCU at

the webpage http://www.vcu.edu/cie/analysis/reports/sets.html and are summarized

in Table 5.1.

In general, we are going to create a contingency table that has the general form:

Col 1 Col 2 Col 3 · · · Col c Totals

Row 1 O

11

O

12

O

13

· · · O

1c

R

1

Row 2 O

21

O

22

O

23

· · · O

2c

R

2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Row r O

r1

O

r2

O

r3

· · · O

rc

R

r

Totals C

1

C

2

C

3

· · · C

c

N

Biological Data Analysis Using R

74 CHAPTER 5. CONTINGENCY TABLES

with r rows of data and c columns. Each of the entries in the rxc contingency table (the

O

ij

values) are counts of the number of observations that were classiﬁed as belonging

to the category in the i

th

row and the j

th

column. Above, when we looked at the χ

2

test, it was a smaller version of this table and the test statistic for analyses in general

contingency tables are the same as above:

χ

2

Calc

=

r

i=1

c

j=1

(O

ij

−E

ij

)

2

E

ij

The only distinction here is that our expected values are based upon row and column

totals such that:

E

ij

=

R

i

C

j

N

where R

i

and C

j

are the respective row and column total.

There are two speciﬁc assumptions that are required to conduct a general contingency

table test such as this:

1. The sample of N samples are drawn randomly from the larger population.

2. Each observation can be classiﬁed into exactly one of the possible r and c categories

according to single and independent criteria (e.g., there is no correlation between

the row and column variables).

Biological Data Analysis Using R

5

.

1

.

O

N

E

R

A

N

D

O

M

S

A

M

P

L

E

7

5

Table 5.1: Diversity of enrolled undergraduate students at Virginia Commonwealth University in the College of Hu-

manities & Sciences between the academic years 1998-2008 as reported by the Center for Institutional Effectiveness

(http://www.vcu.edu/cie/analysis/reports/sets.html).

Group 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

Non-resident Aliens 186 158 188 208 206 235 272 375 512 577 673

Black non-Hispanic 2985 3094 3282 3332 3387 3456 3633 3797 3983 4158 4193

American Indian or Alaskan Native 91 80 83 86 90 113 109 116 124 131 131

Asian or Paciﬁc Islander 1103 1139 1132 1175 1231 1437 1632 1764 1970 2148 2330

Hispanic 279 305 362 400 449 521 559 623 709 761 822

White, non-Hispanic 8688 8586 9013 9373 9916 10077 10757 11088 11180 11170 11202

Race/ethnicity unknown 0 188 208 279 387 665 849 928 1019 1287 1642

Total 13332 13550 14268 14853 15666 16504 17811 18691 19497 20232 20993

B

i

o

l

o

g

i

c

a

l

D

a

t

a

A

n

a

l

y

s

i

s

U

s

i

n

g

R

76 CHAPTER 5. CONTINGENCY TABLES

To demonstrate this analysis we will analyze the 1998, 2003 and 2008 enrollment data from

Table 5.1 to see if the diversity of students at VCU has changed over the last decade.

These data are present in a text ﬁle named VCUCommonData.csv in the folder for this Chapter.

It is loaded into R with the following commands.

> data <− read . tabl e ( "VCUCommonData.csv" , header=T, sep=" " )

> summary( data )

Yr1998 Yr1999 Yr2000 Yr2001

Min. : 0.0 Min. : 80 Min. : 83 Min. : 86.0

1st Qu. : 138.5 1st Qu. : 173 1st Qu. : 198 1st Qu. : 243.5

Median : 279.0 Median : 305 Median : 362 Median : 400.0

Mean :1904.6 Mean :1936 Mean :2038 Mean :2121.9

3rd Qu.:2044.0 3rd Qu.:2116 3rd Qu.:2207 3rd Qu.:2253.5

Max. :8688.0 Max. :8586 Max. :9013 Max. :9373.0

Yr2002 Yr2003 Yr2004 Yr2005

Min. : 90.0 Min. : 113 Min. : 109.0 Min. : 116

1st Qu. : 296.5 1st Qu. : 378 1st Qu. : 415.5 1st Qu. : 499

Median : 449.0 Median : 665 Median : 849.0 Median : 928

Mean :2238.0 Mean : 2358 Mean : 2544.4 Mean : 2670

3rd Qu.:2309.0 3rd Qu. : 2446 3rd Qu. : 2632.5 3rd Qu. : 2780

Max. :9916.0 Max. :10077 Max. :10757.0 Max. :11088

Yr2006 Yr2007 Yr2008

Min. : 124.0 Min. : 131 Min. : 131.0

1st Qu. : 610.5 1st Qu. : 669 1st Qu. : 747.5

Median : 1019.0 Median : 1287 Median : 1642.0

Mean : 2785.3 Mean : 2890 Mean : 2999.0

3rd Qu. : 2976.5 3rd Qu. : 3153 3rd Qu. : 3261.5

Max. :11180.0 Max. :11170 Max. :11202.0

Once the entire data set is loaded into R , we can extract only the values that we are

going to use.

> Obs <− as . matrix ( cbind ( data$Yr1998, data$Yr2003, data$Yr2008 ) )

> Obs

[ , 1] [ , 2] [ , 3]

[ 1 , ] 186 235 673

[ 2 , ] 2985 3456 4193

[ 3 , ] 91 113 131

[ 4 , ] 1103 1437 2330

[ 5 , ] 279 521 822

[ 6 , ] 8688 10077 11202

[ 7 , ] 0 665 1642

> colnames ( Obs ) <− c ( "1998" ,"2003" ,"2008")

> rownames( Obs ) <− c ( "Non-resident Aliens" , "Black non-Hispanic" ,

+ "American Indian or Alaskan Native" , "Asian or Pacific Islander" ,

+ "Hispanic" , "White, non-Hispanic" , "Race/ethnicity unknown")

> Obs

1998 2003 2008

Non−resi dent Aliens 186 235 673

Black non−Hispanic 2985 3456 4193

American Indian or Alaskan Native 91 113 131

Asian or Paci f i c Isl ander 1103 1437 2330

Hispanic 279 521 822

White , non−Hispanic 8688 10077 11202

Race/ethni ci ty unknown 0 665 1642

With these data we will be speciﬁcally testing the hypothesis that across years there is

no differences in the relative distributions of self-identiﬁed racial and ethnic group.

In some texts, this (7x3) contingency test is called a χ

2

Test for Independence and in R

is conducted using the chisq.test(). To begin with, we can plot the categories as the barplot

(see 8.2.1 for how to make these plots yourself) as represented in Figure 5.1.

Biological Data Analysis Using R

5.1. ONE RANDOM SAMPLE 77

Figure 5.1: Undergraduate diversity at Virginia Commonwealth University during academic years

1998, 2003, & 2008.

> test1 <− chisq . t est ( Obs )

> test1

Pearsons Chi−squared t est

data : Obs

X−squared = 1704.417, df = 12, p−value < 2.2e−16

> summary( test1 )

Length Class Mode

st at i st i c 1 −none− numeric

parameter 1 −none− numeric

p. value 1 −none− numeric

method 1 −none− character

data .name 1 −none− character

observed 21 −none− numeric

expected 21 −none− numeric

resi dual s 21 −none− numeric

Notice here that I actually assigned the results of the statistical test to the variable

test1. I did this because there are many reasons why you may be interested in looking

a various aspects of the analysis. By printing the contents of the test itself, we see that

Biological Data Analysis Using R

78 CHAPTER 5. CONTINGENCY TABLES

the calculated statstic χ

2

Calc

= 1704.417, which with (r −1) ∗ (c −1) = 6 ∗ 2 = 12df produces

a very small P−value. If you look back at Figure 4.2, our observed value is way out to

the right with a very small likelihood that that you would get a value this large if it were

not signiﬁcant.

As shown using the function summary(test1) shows, the analysis itself returns a list that

has all the components as list items. There are a lot of different reasons why you may be

interested in using various components of the analysis. For example, you may want to

create a table of the observed or expected values, you may need to run this test a large

number of times and store

Caveats

There are some caveats that need to be made with respect to general use of contingency

tables. First, they are very robust as long as you have a moderate amount of samples

in each of the cells. The test statistic we have been using, χ

2

Calc

with (r − 1) ∗ (c − 1)df is

actually an approximation that is good only with good representation. If the values in the

cells are small then the approximation that we use to ﬁnd the Type I error (the α value)

is poorly estimated. OK but what is moderate? Here are some general guidelines:

1

1. If any of the E

ij

estimates are less than 1 the approximation will be poor.

2. If more than 20% of the E

ij

values are less than 5 then the approximation will be

poor.

So what do you do if you have some small expected values? First, you can try to col-

lapse some of your row or column categories and recalculate. It really depends upon

your knowledge of the biology of the system if this can be done without making it a

meaningless analysis.

Second, you can try to use Fishers Exact Test. This uses combinatorial theory to esti-

mate the probabilities of the test statistic rather than asymptotic assumptions. This is

an excellent choice but has the problem that since it use combinatorial theory, at some

point you will have to perform an operation like N! which when N > 170 the computer

cannot calculate a number that large. There is also the restriction that product of the

row marginals (the R

i

values in the table) must be strictly less than 2

31

−1 but he N < 170

rule is a bit easier to remember.

5.2 Paired Observations

Analyses in this section will be concerned with data that is collected in a pair-wise

fashion (e.g., for each observation, there are two values collected).

1

These guidelines are a bit on the conservative side and you may want to see a text on non-parametric

statistics for a more complete discussion of how far you can stray from these and still not get laughed at.

Biological Data Analysis Using R

5.2. PAIRED OBSERVATIONS 79

5.2.1 Rank Correlation

In 4.4.2 we looked at how you use the cor.test function to get a parametric estimate of the

correlation between two sets of variables. This is possible as well using a non-parametric

approach by adopting a ranking methodology. Non-parametric correlation methods in-

clude Spearman’s ρ and Kendal’s τ, among others but the interface in R is identical (and

the same as we already saw for the Pearson product moment correlation) so I will only

cover the Spearman approach and leave you to look into the differences.

Spearman’s correlation statistic, ρ, is calculated as:

ρ =

N

i=1

R[X

i

]R[Y

i

] −N

_

N+1

2

_

2

_

N

i=1

R[X

i

]

2

−N

_

N+1

2

_

2

_1

2

_

N

i=1

R[Y

i

]

2

−N

_

N+1

2

_

2

_1

2

(5.3)

where the terms R[X

i

] is the rank of the i

th

element in X. These ranks are computed

in comparison to other values in X. For example R[X

i

] = 1 is the smallest value of X,

R[X

i

] = 2 would be the second smallest, etc. So what is begin done here is that we are

replacing the actual values of the variables by the relative ranks.

Using the same data as in 4.4.2 you specify the use of the Spearman approach using

ranks by passing it as an additional option to the cor.test function.

> X <− 1:20

> Y <− c(−17, 7, −12, 12, −4, 11, 10, −2, 35, 31, 34, 49, 27, 33, 45, 32, 36, 38, 58, 44)

> cor . t est ( X, Y, method="spearman")

Spearmans rank correl ati on rho

data : X and Y

S = 198, p−value < 2.2e−16

al t ernat i ve hypothesis : true rho i s not equal to 0

sample estimates :

rho

0.8511278

Notice here that the correlation is signiﬁcant although the correlation statistic is a bit

smaller. There is some loss of information by putting the data into ranks rather than

using the raw values.

So why use this instead of the parametric approaches? Well the calculation of Pearson’s

r statistic depends upon the bivariate distribution of X and Y . If there is no known

joint distribution for these variables then the density function of r is undeﬁned. What

does this mean to you? It means that if your data can be assumed to be normal or then

go ahead and use the Pearson approach. However, if you cannot assume that they are

normal or they you know they are not, then a rank approach may be more appropriate.

For me, I consider the non-parametric approaches as appropriate for all data, whereas

the parametric ones as only good for a subset of the data that we encounter.

Biological Data Analysis Using R

80 CHAPTER 5. CONTINGENCY TABLES

5.2.2 Wilcoxon Test

The Wilcoxon test is also known as the Mann-Whitney test and a ranks based method

analogous to the a paired t-test. This approach tests the null hypothesis that samples

drawn from two different populations are essentially the same (e.g., they are as likely as

samples drawn from one or the other population). Data here are drawn randomly from

two different ”treatments” to see if the application of either produces a signiﬁcant shift

in the values of one set of observations.

As was discussed for Spearman’s ρ, samples will be ranked in increasing order for this

analysis. If the ranks in sample X tend to be generally larger or smaller than those

observations in Y then we can reject the null hypothesis H

O

: X = Y . In general your

data should look like:

Treatment 1 Treatment 2

X

1

Y

1

X

2

Y

2

. . . . . .

X

n

Y

m

In this analysis, we do not assume that both X and Y have the same number of obser-

vations and in general will consider X to have n observations while Y has m and denote

N = n +m. Samples are lumped together and assigned ranks based upon the combined

N observations. In the case of ties where two or more samples have the exact same

value, it is recommended to assign the average rank to all the tied observations. For-

tunately for us, the internal R code takes care of this for us (and will provide warnings

when appropriate) so we can focus on our tasks and let R focus on the speciﬁcs.

Assumptions

The Wilcoxon test has the following assumptions:

1. Both sets of samples (the X and Y observations) are drawn randomly form each

population.

2. There is an expected mutual independence between the X and Y values as well.

3. The variables are at least ordinal.

The test statistic for this analysis is the sum of the ranks of the X variables:

W =

n

i=1

R[X

i

]

If the observations in X and Y are drawn from a single population, as stated in the null

hypothesis, then the sum of the ranks of X should be just as large as expected for the

sum of the ranks for Y . If the treatments are producing differences in either X or Y then

the test statistic will be unusually large given N.

Biological Data Analysis Using R

5.2. PAIRED OBSERVATIONS 81

To show how to conduct the Wilcoxon test, I will use the pine germination data that is in

the folder for this Chapter. These data are from my thesis and record the average germi-

nation rates for offspring arrays of Pinus echinata families who were sampled in continu-

ous (CTRL), selectively cut (SEL), and stands where all the trees around P. echinata were

clear-cut (CLR). Here we will use the Wilcoxon to see if there is a signiﬁcant difference

in germination rates between the control (CTRL) and clear-cut treatments (CLR). Here is

how to load the data into R and extract just the treatments of interest.

> pineData <− read . tabl e ( "PineGerminationData.txt" , header=T)

> summary( pineData )

GERM TRT

Min. :0.0000 CLR :15

1st Qu.:0.1800 CTRL:23

Median :0.3700 SEL :15

Mean :0.3625

3rd Qu.:0.5700

Max. :0.9400

> X <− pineData$GERM[ pineData$TRT=="CLR" ]

> Y <− pineData$GERM[ pineData$TRT=="CTRL" ]

> length (X)

[ 1] 15

> length ( Y)

[ 1] 23

> X

[ 1] 0.67 0.64 0.94 0.40 0.01 0.45 0.58 0.00 0.80 0.81 0.21 0.36 0.82 0.35 0.41

> Y

[ 1] 0.63 0.29 0.37 0.56 0.19 0.02 0.06 0.07 0.11 0.18 0.03 0.64 0.21 0.00 0.00

[ 16] 0.53 0.00 0.00 0.00 0.00 0.35 0.39 0.37

> mean(X)

[ 1] 0.4966667

> mean( Y)

[ 1] 0.2173913

> range (X)

[ 1] 0.00 0.94

> range ( Y)

[ 1] 0.00 0.64

You can see that there are different numbers of samples in each treatment but that they

have overlapping ranges. To run the Wilcoxon test, use the function wilcox.test and pass it

the two variables.

> wilcox . t est ( X, Y)

Wilcoxon rank sum t est with conti nui ty correcti on

data : X and Y

W = 269.5, p−value = 0.003835

al t ernat i ve hypothesis : true l ocati on shi f t i s not equal to 0

Warning message:

In wilcox . t est . def aul t ( X, Y) : cannot compute exact p−value with t i es

According to our test, the data in X and Y appear to be different. The test statistic, W =

269.5 which gives it a P-value of 0.004. There are some error messages that you should

be aware of. Apparently in the data, there were ties and this causes some problems

in calculating the signiﬁcance of the parameter. These ties are for families that did not

produce any offspring. From a biological perspective, these are valid responses and you

would have to just live with the fact that ties existed because throwing out all the 0.00

values changes the interpretation of what happened.

Biological Data Analysis Using R

82 CHAPTER 5. CONTINGENCY TABLES

In general, the Wilcoxon test is rather powerful in determining the equality of samples

drawn from two different populations. It is essentially the non-parametric version of the

normal t-test.

2

Situations where you may favor a Wilcoxon approach over the t-test are

when you have non-normal data or data with several outlier points.

5.3 Several Random Samples

The ﬁnal section in this chapter is focused on data that is collected from multiple treat-

ments. In the previous discussion of the Wilcoxon test, the data had k = 2 treatments

and it was introduced as a rank based analog of the t-test. Here we will introduce the

Kruskal-Wallis test which allows for the analysis of k > 2 treatments and we could again

consider it a rank-based analog of an analysis of variance (ANOVA) approach.

5.3.1 Kruskal-Wallis Tests

The Kruskal-Wallis test examines the differences among k different treatments using a

rank-based approach similar to that discussed for the Wilcoxon test. In fact, this test is

just an extension of the Wilcoxon test using the same sum or ranks approach.

Data for this test is not assumed to be of equal sizes. Each treatment may have a

different number of observations in it with a total sample size of: N =

k

i=1

n

i

. You

should be able to make a list of your data by treatment such as:

Treatment 1 Treatment 2 · · · Treatment k

X

11

X

21

· · · X

k1

X

12

X

22

· · · X

k2

.

.

.

.

.

.

.

.

.

.

.

.

X

1n

1

X

2n

2

· · · X

kn

k

The test statistic for this test is a χ

2

approximation with k −1 degrees of freedom

Assumptions

There are several assumptions associated with this test:

1. All samples are randomly drawn from their perspective treatments.

2. Treatments are independent of each other.

3. The observations are at least ordinal in nature.

As an example using this analysis, we will examine the same Pinus echinata data set

that we used to demonstrate the Wilcoxon test. The default method for performing this

analysis looks like kruskal.test(x, g, ...) where the variable x is the raw data and the g one is

another variable that has the groupings. In the code below I separate out the variables

2

Actually if you do a t-test on the ranks you will get the same answer as the Wilcoxon, the approaches

are identical except for how the data are encoded; raw or as ranks.

Biological Data Analysis Using R

5.4. THE FORMULA NOTATION & BOX PLOTS 83

and then pass them to the function with Germination as the response and grouped by the

factor Treatment. I also conduct the analysis and assign it to the variable named germTest

so you can see that this analysis also returns a list of results.

> pineData <− read . tabl e ( "PineGerminationData.txt" , header=T)

> GerminationRates <− pineData$GERM

> Treatment <− as . f actor ( pineData$TRT )

> Treatment

[ 1] CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL

[ 16] CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL SEL SEL SEL SEL SEL SEL SEL

[ 31] SEL SEL SEL SEL SEL SEL SEL SEL CLR CLR CLR CLR CLR CLR CLR

[ 46] CLR CLR CLR CLR CLR CLR CLR CLR

Levels : CLR CTRL SEL

> GerminationRates

[ 1] 0.630 0.290 0.370 0.560 0.190 0.020 0.060 0.070 0.110 0.180 0.030 0.640

[ 13] 0.210 0.000 0.000 0.530 0.000 0.000 0.000 0.000 0.350 0.390 0.370 0.580

[ 25] 0.490 0.450 0.380 0.510 0.570 0.240 0.290 0.620 0.520 0.200 0.240 0.615

[ 37] 0.760 0.300 0.670 0.640 0.940 0.400 0.010 0.450 0.580 0.000 0.800 0.810

[ 49] 0.210 0.360 0.820 0.350 0.410

> germTest <− kruskal . t est ( GerminationRates , Treatment )

> summary( germTest )

Length Class Mode

st at i st i c 1 −none− numeric

parameter 1 −none− numeric

p. value 1 −none− numeric

method 1 −none− character

data .name 1 −none− character

> germTest

Kruskal−Wal l i s rank sum t est

data : GerminationRates and Treatment

Kruskal−Wal l i s chi−squared = 12.539, df = 2, p−value = 0.001893

When looking at the results of the test, we see that the estimated test statistic was

relatively large suggesting that it is unlikely that the three timber extraction treatments

do not differentially inﬂuence the germination percentages.

5.4 The Formula Notation & Box Plots

If you look at the function signature for the kruskal.test (by typing ?kruskal.test into R ), you

can see several alternate ways you can pass your data to it.

kruskal . t est package : stats R Documentation

Kruskal−Wal l i s Rank Sum Test

Description :

Performs a Kruskal−Wal l i s rank sum t est .

Usage:

kruskal . t est ( x, . . . )

## Default S3 method:

kruskal . t est ( x, g , . . . )

## S3 method f or cl ass ’ formula ’ :

kruskal . t est ( formula , data , subset , na. action , . . . )

Biological Data Analysis Using R

84 CHAPTER 5. CONTINGENCY TABLES

When discussing the relationship between the raw germination data and the grouping

variable, I used the statement ”...is a function of...” This notation is the formula notation

that is indicated in the last option for calling the kruskal.test function. In R you can often

use the formula notation to perform analyses and plots and here we will spend a little

bit of time on how that is done. In Chapter 6 you will use this notation quite a bit when

writing out linear models.

The formula notation in R consists of the response variable (or variables that I’ll call

Y ), the predictor variable (or variables which will be denoted as X), and the tilde sign

˜ showing the relationship. For example, a simple function would be denoted as Y ˜ X

stating that Y is a function of X. Using the function notation for the kruskal.test would

look like:

> kruskal . t est ( GerminationRates ˜ Treatment )

Kruskal−Wal l i s rank sum t est

data : GerminationRates by Treatment

Kruskal−Wal l i s chi−squared = 12.539, df = 2, p−value = 0.001893

Figure 5.2: Boxplot of Pinus echinata germination data partitioned by timber extraction treatment.

Biological Data Analysis Using R

5.4. THE FORMULA NOTATION & BOX PLOTS 85

It is even possible (and perhaps better because we are rather lazy in our typing) to use

the function notation of the variable names within a data.frame without having to make the

other variables (GerminationRates and Treatments). However, when you do this, you will have

to pass an additional parameter to the analysis function to tell it which data to look into

for those variable names. For example, with the pineData data set you can type:

> kruskal . t est ( GERM ˜ TRT, data=pineData )

Kruskal−Wal l i s rank sum t est

data : GERM by TRT

Kruskal−Wal l i s chi−squared = 12.539, df = 2, p−value = 0.001893

Another common place to ﬁnd the function notation is in plotting. Thus far, we have

called scatter plots by the function plot(x,y). It is just as easy to call the plot as plot(y ˜ x)

and you will get the same results if the variable x is a continuous variable. However, if

x is a categorical variable you will not get a normal scatter plot. What you will get is a

box plot as depicted in Figure 5.2 which was created by calling the function

3

:

> pl ot (GERM ˜ TRT, data=data , xlab="Treatment" , ylab="GerminationRate")

3

To adjust additional parameters on the box plots see the function bxp which is the actual plotting

function that the plot function is handing the data off to. You can adjust many other components of the plot

including notches, box colors, etc.

Biological Data Analysis Using R

86 CHAPTER 5. CONTINGENCY TABLES

5.5 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• as.factor(x) Coerces the data in x into a factor data type.

• as.matrix(x) Coerces the data in x into a matrix data type if possible.

• binom.test(x,n,p) Performs a binomial test to see if observing x occurrances of one

category of data in n trials is consistent with the likelihood of it occuring with a

frequency of p.

• c(x,y) The concatinate function that munges all the items together and returns

them as a vector.

• cbind(x,y,...) Binds together the data in x, y, etc. by columns.

• colnames(x) Access the column names in the item x. This only works for matrices

and data.frames.

• cor.test(x) Tests for a signiﬁcant (e.g., ρ = 0) correlations.

• chisq.test(t) Performs a χ

2

test on the values in the table t.

• kruskal.test(x,g) Performs the Kruskal-Wallis Rank Sum test for the data in x as

partitioned into groups deﬁned by g.

• length(x) Returns the length of x.

• mean(x) Returns the mean of the items in x.

• range(x) Returns a two-element vector containing the minimum and maximum val-

ues in x.

• read.table() Reads in a raw data into R .

• rownames(x) Access the row names in x. This only works ofr matrices or data.frames.

• summary(x) Returns a general summary of the data in x.

• table(f) This function takes the list of levels in the factor f and makes a table from

it.

• wilcox.test(x,y) Performs the Wilcoxon Rank Sum Test on the variables in x and y.

Biological Data Analysis Using R

5.6. EXERCISES 87

5.6 Exercises

The following exercise are meant to help you understand the items presented in this

Chapter

1. Calculate the relative proportions of each group in the 1999 VCU data and use the

goodness of ﬁt approach (as in 5.1.1 to see if the 2008 student class has the same

relative proportions as are predicted by the 1999 class.

2. Compare the enrollment freshmen enrollment in the College of Humanities & Sciences

at VCU (from Table 5.1) during the 2006-2007 academic year for Degree-Seeking Un-

dergraduates to the three Universities listed below. Is the student diversity across

these institutions the same? These data sets are prepared each academic year by

each public institution and can be found by searching for ”Common Data Sets” and

looking at Enrollment & Persistence. Below are the places you can get this informa-

tion for three Universities in our region.

• Auburn University https://oira.auburn.edu/cds/2006/sectionb.aspx

• University of Virginia http://www.web.virginia.edu/IAAS/data catalog/institutional/cds/current/enrollment.htm

• Virginia Tech http://www.ir.vt.edu/common ds 2006.htm

3. Use the wilcoxon.test to see if the germination rates observed in the SEL and CLR treat-

ments are signiﬁcantly different. Provide some interpretation of your results.

4. Load the data into R that is found in the ﬁle CornOutput.csv (Note: this data is tab-

delimited so you will have to adjust the separator you use in the read.table function),

These data represent the output in numbers of bushels per acre of corn with three dif-

ferent fertilizer treatments. Create a density plot showing the distribution of bushels

yielded by each treatment.

5. Test the equality of the fertilizers in the data loaded from the last question using a

Kruskal-Wallis test. Interpret your results.

6. What are the inner-quartiles of the three fertilizer yields?

7. From a total of N = 15 students in this course, if 14 pass, is the probability of passing

this course equal to p = 0.65?

8. What does the optional parameter rescale.p change in the chisq.test function? Why would

you want to use this option?

9. Assume that you observed phenotypes in the following amounts: n

spots

= 12 individ-

uals with spots, n

silky

= 22 with silky fur, n

Smooth

= 15 smooth coated, and n

aguti

= 8

aguti. Do these data ﬁt the hypothesis that the probability of any one of these phe-

notypes is equal?

10. Create a data three variables named First, Second, and Third and assign each of them

the value of runif(3). Now, create a bar plot of these data assuming that the ﬁrst entry

in each data set represents Category A, the second Category B, and the third Category C.

Make it look something like Figure 5.1 with the Categories used as the partitioning

variable along the x-axis. Feel free to provide your own colors.

Biological Data Analysis Using R

88 CHAPTER 5. CONTINGENCY TABLES

Biological Data Analysis Using R

Chapter 6

Linear Models

This chapter focuses on the analysis of linear models in R . The term ”linear model” is a

general one that will be used a bit loosely. In general, a linear models is one that can be

written down in the form:

y = x

Some variable, or set of variables, y, are predicted to have a particular relationship with

some predictor variable (or variables) denoted in x. In the simplest case when both x

and y are continuous variables, the analysis is called a regression analysis, if x has

more than one predictor variable then it is called a multiple regression, and if y is binary

it is a logistic regression. However, if the predictor variable is categorical the model

is called an analysis of variance with many variants depending upon the number and

relationship of categorical predictor variables in x. Finally, if predictor variables consist

of categorical and continuous variables then it is called an analysis of covariance. There

are many different ways of introducing these different kinds of analysis but we are going

to focus on the functional form and the kinds of variables that make up the predictor

x.

In this Chapter you will learn the following skills:

• Learn to analyze data using a simple regression approach.

• Be able to incrementally build a multiple regression model using Type III sums of

squares.

• Perform an analysis of variance (ANOVA) analysis for both 1-way and factorial mod-

els.

6.1 The t-test

6.1.1 One-Sample Tests

The ﬁrst linear model we will deal with is the t-test. The functional form of this is:

89

90 CHAPTER 6. LINEAR MODELS

y = µ

where we believe that the observations sampled in y have some particular mean value

and the variation around that mean value is simply the natural variation there is is the

kind of samples we are measuring. The function that performs the one-sample t−test in

R is (not surprisingly) called t.test and has the following options available to it.

t . t est ( x, y = NULL,

al t ernat i ve = c ( "two.sided" , "less" , "greater") ,

mu = 0, paired = FALSE, var . equal = FALSE,

conf . l evel = 0.95, . . . )

For a one-sampled test, we will pass the response variable and a value for the parameter

mu to the function. By default, it will test the null hypothesis H

O

: ¯ y = µ (the mu in

the signature) using a ”two.sided” alternate hypothesis. This means that we can reject

the null if ¯ y < µ and if ¯ y > µ using a

α

2

rejection region. If you have reason to believe

that the observations are supposed to increase or decrease µ over some particular value,

something along the lines of say ”the addition of fertilizer should increase yield,” then

you should be using a one-tailed test instead that only examines an α-sized region one

end.

In the data below, we are testing the hypothesis that H

O

: ¯ y = 15 with the given data.

> Y <− c(19,25,14,15,24,17,19,27,29,25)

> test1 <− t . t est ( Y,mu=15)

> summary( test1 )

Length Class Mode

st at i st i c 1 −none− numeric

parameter 1 −none− numeric

p. value 1 −none− numeric

conf . i nt 2 −none− numeric

estimate 1 −none− numeric

nul l . value 1 −none− numeric

al t ernat i ve 1 −none− character

method 1 −none− character

data .name 1 −none− character

> pri nt ( test1 )

One Sample t−t est

data : Y

t = 3.8523, df = 9, p−value = 0.003892

al t ernat i ve hypothesis : true mean i s not equal to 15

95 percent confidence i nt erval :

17.64182 25.15818

sample estimates :

mean of x

21.4

You can see that I assigned the results of the analysis to the variable named test1. Just

as in the contingency tables examples (5.1.3 & 5.2.2) the results of an analysis are a

list containing all the parameters that were used to perform the analysis as well as

intermediary materials and results. Of particular mention are the parameters p.value,

conf.int, and statistic. Overall, the analysis found that we can reject the null hypothesis

H

O

: ¯ y = 15 with a P-value of ≈ 0.004. This is fairly good support for the notion that the

mean of these observations is not equal to 15.

Biological Data Analysis Using R

6.2. REGRESSION WITH A SINGLE VARIABLE 91

6.1.2 Paired Tests

The t-test can also be used in a paired fashion. This analysis consists of two sets of

variables, X and Y that are observations that are taken in such a manner as to think

that the differences between them are negligible. For example, perhaps you think that

parasite load has inﬂuenced the development of young warblers so you measure the

lengths of the primary feathers. Overall the null hypothesis for this is H

O

: X = Y .

Another way to write this hypothesis is: H

O

: (X − Y ) = 0, in which case this becomes

identical to the one-sampled test. An example of this in R (with entirely contrived data)

would be:

> X <− round( runi f (10,min=12,max=20) )

> Y <− round( runi f (10,min=12,max=20) )

> X

[ 1] 12 18 18 13 14 15 15 16 17 19

> Y

[ 1] 14 17 20 13 17 12 16 17 17 15

> t . t est ( X, Y, paired=T)

Paired t−t est

data : X and Y

t = −0.1416, df = 9, p−value = 0.8905

al t ernat i ve hypothesis : true di f f erence in means i s not equal to 0

95 percent confidence i nt erval :

−1.697808 1.497808

sample estimates :

mean of the di f f erences

−0.1

Notice that since these are paired, they must be taken from the same experimental unit,

which is why we added the paired=T option to the parameters we passed to t.test.

6.2 Regression With A Single Variable

A linear regression seeks to see if the values in the response variable y can be predicted

to change systematically with the predictor variable x. The general form of a regression

model is:

y

ij

= β

0

+β

1

x

i

+e

j

where the response variable y

ij

is hypothesized to be a function of three independent

components:

1. The intercept, β

0

.

2. A slope coefﬁcient, β

1

that determines at what rate y changes with changes in x.

3. The error term, e

j

, is the latent variation that every observed value has around the

predicted regression line.

The methods by which the parameters β

0

and β

1

are estimated are varied. The most

common approach is the least squares approach which tries to ﬁnd estimates for these

Biological Data Analysis Using R

92 CHAPTER 6. LINEAR MODELS

two parameters that minimizes the sum of squared error terms (e.g.,

N

i=1

e

i

). In R we

can use the function lm to construct the linear model. Here is an example data set with

the values plotted in Figure 6.1.

Figure 6.1: Plot of single variable regression values.

> X <− 1:10

> X

[ 1] 1 2 3 4 5 6 7 8 9 10

> Y <− c(19,25,14,15,24,17,19,27,29,25)

> Y

[ 1] 19 25 14 15 24 17 19 27 29 25

> pl ot ( Y˜X, xlab="X" , ylab="Y" , bty="n" , col ="red" ,pch=19, ylim=c( 0 , 30) , xlim=c ( 0 , 10) )

To plot these, I used the functional form (see 5.4 for a discussion of how this works)

with Y ˜ X, set the labels, the plot colors, the ranges of the x− and y−axes, and the

plot characters with the pch option.

1

By eye-balling the image, do you think there is a

relationship between these variables?

> f i t 1 <− lm( Y˜X)

> f i t 1

1

To see all the different characters that you can use as plot symbols type plot(1:25,pch=1:25) and it will

plot each symbol along the x = y line.

Biological Data Analysis Using R

6.2. REGRESSION WITH A SINGLE VARIABLE 93

Cal l :

lm( formula = Y ˜ X)

Coef f i ci ent s :

( I ntercept ) X

16.3333 0.9212

I start by assigning the response of the analysis to the variable fit1. Printing the contents

of the analysis shows that the intercept term (the β

0

) has been estimated to be 16.333

whereas the slope term (R calls this by the variable name you use for it and above we

called it β

1

) as 0.92. So for each increment of X, there is almost a corresponding increase

in Y (OK since the points do kinda point upwards). But is this signiﬁcant? You can have

a non-zero estimate for a non-signiﬁcant relationship. To see a slightly more detailed

printout of the components in fit1 use the summary function.

> summary( f i t 1 )

Cal l :

lm( formula = Y ˜ X)

Residuals :

Min 1Q Median 3Q Max

−5.097 −4.591 0.600 3.238 6.824

Coef f i ci ent s :

Estimate Std. Error t value Pr( >| t | )

( I ntercept ) 16.3333 3.2258 5.063 0.000973 ∗∗∗

X 0.9212 0.5199 1.772 0.114348

−−−

Si gni f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error : 4.722 on 8 degrees of freedom

Multiple R−squared: 0.2819, Adjusted R−squared: 0.1921

F−st at i st i c : 3.14 on 1 and 8 DF, p−value : 0.1143

Here we see several components:

1. The formula that was used to call the lm function.

2. A summary of the residuals (the e

ij

terms)

3. The coefﬁcients themselves with standard errors and probabilities.

4. A summary of the test statistic, F, the df, and the probability.

Overall, it does not appear that the regression line is signiﬁcant. If you are interested in

printing out a more standard ANOVA table for this model, you can pass the variable fit1

to the anova function and it will print out the more normal results.

> anova( f i t 1 )

Analysis of Variance Table

Response : Y

Df Sum Sq Mean Sq F value Pr(>F)

X 1 70.012 70.012 3.1398 0.1143

Residuals 8 178.388 22.298

This printout is probably more like what you will be putting into your manuscripts.

Again, the trend does not seem to be signiﬁcant.

Biological Data Analysis Using R

94 CHAPTER 6. LINEAR MODELS

Plotting the Regression Model onto Your Points

It is possible plot the regression model onto a display of the predictor and response

variables. This can sometimes be helpful when visualizing your data. The abline function

overlays a line on your current plot. To use the abline function on an existing graph does

not require you to call par(new=T) ﬁrst as it takes care of that already.

Figure 6.2: Regression model added to plot of points using abline function.

> pl ot ( Y˜X, xlab="X" , ylab="Y" , bty="n" , col ="red" ,pch=19, ylim=c( 0 , 30) , xlim=c ( 0 , 10) )

> abline ( f i t 1 , l t y =2)

In addition to passing a variable that is a regression model (e.g., the class(ﬁt1) = "lm"), the

function abline can also be called by passing it raw values for the slope and intercept. This

means you can add an arbitrary line to any plot you like. As shown above, the function

also takes additional parameters that allow you to customize the look of the line. You

may want to revisit Table 4.1 as a reminder.

Biological Data Analysis Using R

6.2. REGRESSION WITH A SINGLE VARIABLE 95

Adding Text To A Graph

While we are customizing this image of our non-signiﬁcant regression model, it is prob-

ably a good time to look at the text () function. This function allows you to add arbitrary

text to your plot. The basic call of this function will include the x and y coordinates of

where you want to put the text and the characters string that you will be putting on the

graph.

To illustrate how this is done, we will add the regression formula to the plot. First, we

will determine where in the fit1 variable you can ﬁnd the regression coefﬁcients. You

could type out the regression equation yourself and for a one-off image it may be easier

for you to do it this way, but if the data are already embedded in the fit1 variable then it

is a more versatile approach for you to use.

> names( f i t 1 )

[ 1] "coefficients" "residuals" "effects" "rank"

[ 5] "fitted.values" "assign" "qr" "df.residual"

[ 9] "xlevels" "call" "terms" "model"

> f i t 1$coef f i ci ent s

( I ntercept ) X

16.3333333 0.9212121

> f i t 1$coef f i ci ent s [ 1]

( I ntercept )

16.33333

> f i t 1$coef f i ci ent s [ 2]

X

0.9212121

So we can access the values estimated for β

0

and β

1

using the ﬁt1coefﬁcients[1]andﬁt1coefficients[2].

Now we need to make a single string that has the regression equation y = β

0

+ β

1

x. The

text parts, we can write out but the variables should come from fit1. To do this, we

use the paste function. This function takes a list of items and mushes them together

into a single character string More can be found on the paste function and general string

manipulation in Chapter 9.

> formula <− paste ( "y = " , f i t 1$coef f i ci ent s [ 1] , " + " , f i t 1$coef f i ci ent s [ 2] , "x")

> formula

[ 1] "y = 16.3333333333333 + 0.921212121212122 x"

> text ( 5 , 12.5 , formula )

6.2.1 Regression Diagnostics

It is possible to attempt to ﬁt any model to a set of data. However, just because R will

happily (in most cases) provide you an answer to a model ﬁtting, it does not mean that

it is the right model for the data. For example, your data may not be linear, however it

is still possible for you to ﬁt a line to non-linear data. R includes some easy methods

that you can use to examine the appropriateness of your model and here we will focus

on some of the built-in diagnostics. These focus on the single speciﬁed model and allow

you to make decisions on the appropriateness of your proposed model. Later in ?? we

will cover methods that allow you to determine if one model is better than another for

describing your data.

Biological Data Analysis Using R

96 CHAPTER 6. LINEAR MODELS

Figure 6.3: Regression model with ﬁtted line and formula.

One of the ﬁrst things you should do when you specify a linear model is look at the

residuals. The residuals are the e

ij

components of the model in the general formula.

These represent the variation that is not explained by your ﬁtted line. The things you

are looking for in the residuals are:

1. Systematic changes in the residuals when plotted as a function of the predicted val-

ues. This would indicate that there is something else that is changing the response

variable that you are not taking into consideration.

2. Non-linearity in the residuals when plotted against the predicted values. This would

suggest that perhaps your data are not linear to start with and the ﬁtting of a linear

model to it may not be appropriate.

3. Normality of the residuals. These values are expected to be N(0, σ

2

). If they are not,

it may not be appropriate to be ﬁtting this model to your data.

4. Outliers. Do you have any evidence that once you ﬁt your model to the data that

there are particular entries that are obviously not part of the trend. There can

be many reasons for outliers. First, they may just be an outlier and it is a real

observation that should be kept in the model. However, it is also possible that

Biological Data Analysis Using R

6.3. MULTIPLE REGRESSION 97

Figure 6.4: A 2x2 matrix plot of some diagnostic tools associated with a linear model. They include

a plot of the residuals (eij) as a function of the ﬁtted values (ˆ yi) to see if there are systematic biases

in the model (upper left), a Q-Q plot to examine normality of the residuals (upper right), a scale

location plot (lower left), and a leverage plot to look for outliers (lower right).

there was an equipment malfunction, you entered the data point incorrectly into

the computer, etc. It is always good to check and see if you screwed up.

R provides a series of four plots for you to look at when you plot a variable speciﬁed

by lm(). These plots are displayed in Figure ??. You can see these plots by using the

command plot( ﬁt1) (or whatever your model variable name is) and R will show you a series

of plots examining the distribution of the residuals. For a more in depth discussion of

model veriﬁcation you should probably consult a text book on regression analysis.

6.3 Multiple Regression

There are several occasions where we may be interested in how well several predictor

variables can explain the variation in a response variable. This is called multiple regres-

sion and has a linear model with the form:

Biological Data Analysis Using R

98 CHAPTER 6. LINEAR MODELS

y

i

= β

0

+β

1

X

1

+β

2

X

2

+. . . +β

k

X

k

+e

**Here you have up to k different predictor variables, each of which contributing to the
**

observed value in y. When approaching a multiple regression,

The null hypothesis for a multiple regression is H

O

: β

i

= 0; ∀i and states that all the beta

regression terms are zero. To address this hypothesis, we build a linear model and then

determine how much of the observed variation can be explained by the model in.

In R we can use the same lm function as for a single predictor regression but this time

we need change how we put the function equation into it to accommodate two variables.

For this example, we can use the data shown in Table 6.3.

i Y X

1

X

2

1 4.26 1.00 0.89

2 20.74 2.00 0.41

3 14.95 3.00 0.72

4 -5.55 4.00 0.20

5 21.29 5.00 0.40

6 33.49 6.00 0.37

7 32.15 7.00 0.61

8 45.95 8.00 0.09

9 38.94 9.00 0.74

10 48.27 10.00 0.69

These values can be put into R as:

> Y <− c( 4. 26 , 30.74, 14.95, −5.55, 21.29, 33.49, 32.15, 45.95, 38.94, 48.27)

> X1 <− 1:10

> X2 <− c( 0. 88 , 0.41, 0.72, 0.19, 0.40, 0.37, 0.61, 0.09, 0.74, 0.68)

> cbind ( Y, X1,X2)

Y X1 X2

[ 1 , ] 4.26 1 0.88

[ 2 , ] 30.74 2 0.41

[ 3 , ] 14.95 3 0.72

[ 4 , ] −5.55 4 0.19

[ 5 , ] 21.29 5 0.40

[ 6 , ] 33.49 6 0.37

[ 7 , ] 32.15 7 0.61

[ 8 , ] 45.95 8 0.09

[ 9 , ] 38.94 9 0.74

[ 10 , ] 48.27 10 0.68

And then we can create a linear model using the notation lm( Y ˜ X1 + X2 ).

> f i t 2 <− lm( Y ˜ X1 + X2 )

> summary( f i t 2 )

Cal l :

lm( formula = Y ˜ X1 + X2)

Residuals :

Min 1Q Median 3Q Max

−24.8394 −2.7430 −0.8989 4.1369 20.0461

Biological Data Analysis Using R

6.3. MULTIPLE REGRESSION 99

Coef f i ci ent s :

Estimate Std. Error t value Pr( >| t | )

( I ntercept ) 1.170 12.801 0.091 0.9297

X1 4.460 1.422 3.137 0.0164 ∗

X2 1.473 16.763 0.088 0.9324

−−−

Si gni f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error : 12.85 on 7 degrees of freedom

Multiple R−squared: 0.5857, Adjusted R−squared: 0.4673

F−st at i st i c : 4.948 on 2 and 7 DF, p−value : 0.04578

> anova( f i t 2 )

Analysis of Variance Table

Response : Y

Df Sum Sq Mean Sq F value Pr(>F)

X1 1 1631.66 1631.66 9.8875 0.01628 ∗

X2 1 1.27 1.27 0.0077 0.93244

Residuals 7 1155.16 165.02

−−−

Si gni f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

As we can see, the estimates for β

0

= 1.17, β

1

= 4.46, and β

2

= 1.47. Overall, it appears that

the only term that has is like to not be zero is the term for β

1

for variable X1. However,

even with the β

0

and β

2

terms in the model for the intercept and the slope coefﬁcient for

the variable X2, the overall mode is signiﬁcant (see the anova table).

Adding Interactions

Some times it is preferable to run models that show the interaction between variables

as well as the inﬂuence of individual variables. This is appropriate when you have some

reason to believe that the combination of predictor variables will inﬂuence the response

in a non-additive method. The linear model for this is:

y

ij

= µ +β

1

X

1

+β

2

X

2

+β

3

(X

1

X

2

) +e

ij

where the β

3

coefﬁcient determines the strength of the interaction. If β

3

= 0 then there

is no interaction.

In R interaction terms are indicated by the colon operator. For example, the full model

in our example data with the interaction would be speciﬁed as

> f i t 2 <− lm( Y ˜ X1 + X2 + X1:X2 )

> summary( f i t 2 )

Cal l :

lm( formula = Y ˜ X1 + X2 + X1:X2)

Residuals :

Min 1Q Median 3Q Max

−22.882 −2.267 −1.007 4.168 22.401

Coef f i ci ent s :

Estimate Std. Error t value Pr( >| t | )

( I ntercept ) −8.500 26.951 −0.315 0.763

X1 6.204 4.459 1.391 0.213

Biological Data Analysis Using R

100 CHAPTER 6. LINEAR MODELS

X2 16.270 39.803 0.409 0.697

X1:X2 −2.732 6.569 −0.416 0.692

Residual standard error : 13.68 on 6 degrees of freedom

Multiple R−squared: 0.5973, Adjusted R−squared: 0.3959

F−st at i st i c : 2.966 on 3 and 6 DF, p−value : 0.1192

> anova( f i t 2 )

Analysis of Variance Table

Response : Y

Df Sum Sq Mean Sq F value Pr(>F)

X1 1 1631.66 1631.66 8.7194 0.02552 ∗

X2 1 1.27 1.27 0.0068 0.93692

X1:X2 1 32.37 32.37 0.1730 0.69192

Residuals 6 1122.78 187.13

−−−

Si gni f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

There is a shorthand method that indicates that you are interested in having all interac-

tions between predictor variables and that is:

> f i t 2Al t ernat e <− lm( Y ˜ X1∗X2 )

> summary( f i t 2Al t ernat e )

Cal l :

lm( formula = Y ˜ X1 ∗ X2)

Residuals :

Min 1Q Median 3Q Max

−22.882 −2.267 −1.007 4.168 22.401

Coef f i ci ent s :

Estimate Std. Error t value Pr( >| t | )

( I ntercept ) −8.500 26.951 −0.315 0.763

X1 6.204 4.459 1.391 0.213

X2 16.270 39.803 0.409 0.697

X1:X2 −2.732 6.569 −0.416 0.692

Residual standard error : 13.68 on 6 degrees of freedom

Multiple R−squared: 0.5973, Adjusted R−squared: 0.3959

F−st at i st i c : 2.966 on 3 and 6 DF, p−value : 0.1192

You can see that this gives the exact same response. You should be careful with this

notation when you are working with several predictor variables because it will do all the

linear interactions including the three- and four-way (and higher) ones if you have that

many variables. This may or may not be what you are interested in testing.

Models Without Intercept Terms

Some times it is of interest to test the ﬁt of a model that does not have an interaction

term. Perhaps you have already subtracted the mean of the response variable ˆ y = y − ¯ y

and as such there is not predicted to be any interaction, or as in the case of our model in

the previous section, perhaps the model does not support the addition of an interaction

term. At any rate, it is possible to indicate to the lm function that you want to run the

analysis without estimating the interaction. The linear model for this would be:

Biological Data Analysis Using R

6.3. MULTIPLE REGRESSION 101

y

i

= β

1

X

The formula that you pass to lm( Y ˜ X − 1). The -1 addition to the function is the part that

tells R how to run properly. Running the data again but only including the variable X1

and the response variable Y without the interaction term gives:

> f i t 3 <− lm( Y ˜ X1 − 1 )

> summary( f i t 3 )

Cal l :

lm( formula = Y ˜ X1 − 1)

Residuals :

Min 1Q Median 3Q Max

−24.4756 −2.0177 0.1422 4.0652 21.2772

Coef f i ci ent s :

Estimate Std. Error t value Pr( >| t | )

X1 4.7314 0.5798 8.16 1.89e−05 ∗∗∗

−−−

Si gni f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error : 11.38 on 9 degrees of freedom

Multiple R−squared: 0.8809, Adjusted R−squared: 0.8677

F−st at i st i c : 66.59 on 1 and 9 DF, p−value : 1.889e−05

> anova( f i t 3 )

Analysis of Variance Table

Response : Y

Df Sum Sq Mean Sq F value Pr(>F)

X1 1 8618.7 8618.7 66.587 1.889e−05 ∗∗∗

Residuals 9 1164.9 129.4

−−−

Si gni f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

Overall, this model explains much more of the variation that the full model lm(Y ˜ X1 + X2 )

or the interaction model lm(Y ˜ X1∗X2), just compare the Multiple R-Squared values.

6.3.1 Comparing Models

So in the previous subsection we have developed three different models that we have

proposed to explain our data. They are, in the order of reverse complexity, give as:

• The full model with the interaction terms lm( Y ˜ X1 + X2 + X1:X2).

• The full model without the interaction terms lm( Y ˜ X1 + X2 ).

• The partial model with only X1, lm(Y ˜ X1).

• The minimal model with only X1 and without an intercept term, lm(Y ˜ X1 − 1).

There are several methods that you should use to determine which of these models you

would like to consider to be the most appropriate.

1. Look at the overall anova signiﬁcance. If the overall models are not signiﬁcant, then

there is no use in discussing them. In our examples, the full interaction model was

not signiﬁcant and should be disregarded.

Biological Data Analysis Using R

102 CHAPTER 6. LINEAR MODELS

2. Examine the relative signiﬁcance of each of the terms in the models as is shown

by the summary function. This can give some indication of which terms may be

important. Our various models suggested that the predictor variable X2 did not

help in explaining the variation in the response variable.

3. Look at the relative R-squared values. These indicate the proportion of variation

explained by the model and are given by the summary function.

4. Use a statistically based method to test the differences between two models such

as:

anova You can use the anova function and pass it two models that have been ﬁt to

the same data and it will perform an analysis to see if the additional term(s)

are signiﬁcant. Here is an example using the models having only the variable

X1 to see if the addition of the intercept term is signiﬁcant.

> anova( f i t 3 , f i t 4 )

Analysis of Variance Table

Model 1: Y ˜ X1 − 1

Model 2: Y ˜ X1

Res. Df RSS Df Sum of Sq F Pr(>F)

1 9 1164.91

2 8 1156.43 1 8.48 0.0587 0.8147

AIC There are other statistical methods that you can use to see if the additional

terms are signiﬁcant in your model. One of these is the stepwise method using

the AIC (Akaike Information Criterion). In R you can do this by passing the

largest model to the function step and it will perform the analysis for you. The

AIC statistics will decrease as the estimated predictive power of your model

increases. So you want to look for the smallest values of AIC. Here is an

example using the full model (including the interaction).

Start : AIC=55.21

Y ˜ X1 + X2 + X1:X2

Df Sum of Sq RSS AIC

− X1:X2 1 32.37 1155.16 53.49

<none> 1122.78 55.21

Step: AIC=53.49

Y ˜ X1 + X2

Df Sum of Sq RSS AIC

− X2 1 1.27 1156.43 51.51

<none> 1155.16 53.49

− X1 1 1624.25 2779.40 60.27

Step: AIC=51.51

Y ˜ X1

Df Sum of Sq RSS AIC

<none> 1156.43 51.51

− X1 1 1631.66 2788.09 58.31

Cal l :

lm( formula = Y ˜ X1)

Coef f i ci ent s :

( I ntercept ) X1

Biological Data Analysis Using R

6.4. ANALYSIS OF VARIANCE 103

1.989 4.447

As you can see, the AIC values decrease until the ﬁnal model which only has

the X1 term and is missing an intercept.

You should consider a wide range of these methods when attempting to put together a

good regression model.

6.4 Analysis of Variance

The analysis of variance is a common method for examining the equality of observations

that can be partitioned into categorical treatments. In all reality, an ANOVA is simply

a regression with categorical predictor variables (e.g., the values of x are not continu-

ous).

6.4.1 1-Way ANOVA

The simplest ANOVA model is one in which a single treatment has been applied and you

have collected a single set of observations. The linear model can be presented as:

y

ij

= µ +τ

i

+e

ij

where the τ

i

is the treatment effect. You can think of this as the deviation from the

overall mean that can be attributed to an observation being in a particular treatment.

The e

ij

term is again the error term.

In 5.3.1, we used the Pinus echinata germination data to illustrate how to perform a

Kruskal-Wallis test. At that time, I had suggested that the Kruskal-Wallis test was a

rank-based version of an analysis of variance (ANOVA). Here will use the same data

again to demonstrate the parametric equivalent of the Kruskal-Wallis test; the one-way

ANOVA.

As a reminder, the data consist of family germination rates for Pinus echinata (perhaps

one of the homeliest looking conifer in existence) separated by timber treatment. In the

Ozark mountains of Missouri, control, selectively cut, and clear cut treatments were ap-

plied to previously continuous forest stands. No P. echinata individuals were removed so

in essence the treatments were modiﬁcations of other species around the resident pines.

A summary of germination data is presented in Figure ?? showing the average germi-

nation rate lowest in the control stands and highest in the stands where heterospeciﬁcs

were selectively removed from around the target species.

The null hypothesis for this model is: H

O

: NoTreatmentEffects (which is like saying

τ

Control

= τ

Selective

= τ

ClearCut

).

> pineData <− read . tabl e ( "PineGerminationData.txt" , header=T)

> anova1 <− aov ( GERM ˜ TRT, data=pineData )

> anova1

Cal l :

Biological Data Analysis Using R

104 CHAPTER 6. LINEAR MODELS

Figure 6.5: Boxplot of germination percentages for Pinus echinata as a function of treatment. A

colored rug was added to the right side to show the actual values within treatments (see rug.

aov ( formula = GERM ˜ TRT, data = pineData )

Terms:

TRT Residuals

Sum of Squares 0.8717943 2.6520868

Deg. of Freedom 2 50

Residual standard error : 0.2303079

Estimated ef f ect s may be unbalanced

> anova( anova1)

Analysis of Variance Table

Response : GERM

Df Sum Sq Mean Sq F value Pr(>F)

TRT 2 0.87179 0.43590 8.218 0.0008207 ∗∗∗

Residuals 50 2.65209 0.05304

−−−

Si gni f . codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.01 ’∗’ 0.05 ’.’ 0.1 ’ ’ 1

From these results, we can see that there is a treatment effect, and it appears to be

highly signiﬁcant. But in looking at the plot in Figure 6.5 are these results supposed to

lead us to believe that all the treatments are signiﬁcantly different or just some subset

Biological Data Analysis Using R

6.4. ANALYSIS OF VARIANCE 105

of them?

One way to get to this is to look at the 95% conﬁdence intervals for the treatment means

and see if they overlap. One way to do this is to use the Tukey Honest Signiﬁcant

Differences (or TukeyHSD) function. This function takes the aov analysis as an argument

and prints out the conﬁdence intervals for the differences in the means of the treat-

ments.

Figure 6.6: Conﬁdence intervals for difference in mean germination rates for Pinus echinata fam-

ilies.

> postHoc <− TukeyHSD( anova1 )

> postHoc

Tukey multiple comparisons of means

95% family−wise confidence l evel

Fi t : aov ( formula = GERM ˜ TRT, data = pineData )

$TRT

di f f lwr upr p adj

CTRL−CLR −0.27927536 −0.46389755 −0.09465318 0.0017640

SEL−CLR −0.04566667 −0.24879523 0.15746190 0.8504882

SEL−CTRL 0.23360870 0.04898651 0.41823088 0.0098768

> pl ot ( postHoc )

Biological Data Analysis Using R

106 CHAPTER 6. LINEAR MODELS

The postHoc anlaysis can also be plotted by calling plot( postHoc ) showing the conﬁdence

in the differences in treatment levels (those that overlap the zero are not signiﬁcantly

different) as presented in Figure 6.6. These results suggest that the signiﬁcance in the

ANOVA model is due to the differences between the control and the other two treatments

and that both of the cutting treatments had essentially the same germination rate (just

larger than families in the control stands).

Biological Data Analysis Using R

6.5. USEFUL FUNCTIONS 107

6.5 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• abline(x) Draws a line on the currently active graphics device. You can either specify

the intercept and slope or pass this a ﬁtted linear model.

• anova(x) Creates the Analysis of Variance Tables for the models passed in x.

• aov(x) Performs the analysis of variance on the formula in x.

• cbind(x,y,... Puts the variables x, y, ... into a single column-bound variable.

• lm(func) Tests the model func using linear least-squares.

• t.test(x) This function performs the t-test for either a single data set and a predicted

mean or a paired t-test using two data sets.

• round(x) Rounds the value of x to the nearest integer.

• pch Optional parameter for the plot function that will designate the type of symbol

plotted using the plot command.

• runif(n,mn,mx) Returns n random numbers drawn uniformly from the range [mn, mx].

• step(x) Evaluates the terms in the model x for inclusion in the model using the AIC

criteria.

• summary(x) Provides description of x.

• text(x,y,c Plots the text in c on the graph at the coordinates (x, y).

• TukeyHSD(x) Performs Tukey’s Honest Signiﬁcant Difference post-hoc test on the aov

model in x.

Biological Data Analysis Using R

108 CHAPTER 6. LINEAR MODELS

6.6 Exercises

The following exercises are meant to help you understand the items presented in this

Chapter.

1. Load the data set Temperature.csv from the chapter folder. These data represent the

measured brood chamber temperature for a wood-boring beetle. Test the hypothesis

H

O

: Mean temperature is 61

◦

.

2. Load the data set ClutchSizes.csv from the ﬁle. Using a paired t-test, test the hypoth-

esis H

O

: There is no difference in reproductive output between habitat types.

3. Load the data ﬁle, SingleRegresssion.RData from the ﬁle into R

2

. Fit the regression

model, Y ˜ X. Is it signiﬁcant? Show the regression equation and the anova table.

4. Plot the regression model fromthe previous example and indicate the ﬁtted regression

line with a dotted red line in the plot.

5. Fromthe single regression model, add the regression equation to the graph indicating

the β coefﬁcients that were estimated.

6. Does a plot of the residuals as a function of the predicted values from the estimated

regression model suggest that the model is appropriate?

7. Load the data set MultipleRegression.RData from the ﬁle, it will contain a data frame

named multReg. Use the variables in this data frame, Y,X1,X2,X3 to ﬁt a multiple re-

gression model. Show the summary and the anova table in your results. What is the

predicted regression equation?

8. Fit another model to the multReg data that has all the interaction terms amongst the

X predictor variables. Use the anova procedure to see which of these models is more

appropriate.

9. Load the data ﬁle VarroaCounts.RData, it will be a data frame named BeeData. These

data represent counts of the parasite Varroa destructor a common pest of domesti-

cated honey bees. Test the hypothesis using an analysis of variance that there is no

difference in mite counts between the different lines of bees.

10. Perform the TukeyHSD test on the parasite data from the previous question.

2

Use the load function.

Biological Data Analysis Using R

Chapter 7

Working With Images

In this chapter, you will focus on the following topics:

• Gain a basic understanding of open image formats

• Learn how to import image data into R

• Manipulate image data at the pixel level.

7.1 Image Data

There are several different methods that are available to you to import image data into R

. As I was writing this document over Winter break and updating it in the fall, the main

image processing library for R , rimage, was broken and could caused a few problems

when installed. I am sure it will be ﬁxed in the near future and recommend that you

look at that library when you next have the need to do some image manipulation because

it has a lot of funcitonality. However, at the present, it is not going to be used. The

consequences of not having rimage is that it appears that importing jpeg, tiff, and bmp

image formats is beyond our grasp. Lucky for us, there are a ton of other image formats

out there and we can easily convert the image shown in Figure 11.1 into another format

and use it just as easily. Perhaps when I update this manuscript the next time around,

I’ll change this section. I think it is also important that you understand the internal

workings of images and for right now, these more simple image formats will serve our

purposes nicely and everything you learn here will be easily transferable to those other

image formats when you need to deal with them in the future.

7.1.1 PNM Image Format

Images on computers have speciﬁc formats in which the color information and other

meta data is stored in the ﬁle. Some of the methods are relatively easy to use and can be

manipulated directly in a text editor. Others are more of a pain and some are ”owned” by

some company who has patented the way the information is stored in the ﬁle and you

have to pay royalties to them to view it. For example, the ubiquitous GIF image format

109

110 CHAPTER 7. WORKING WITH IMAGES

uses an algorithm that was patented and owned by a company and if you were to write

a viewer for it in some countries you would have to pay a royalty to use it... Lame.

The PNM image format (short for portable anymap) is an open format for the exchange

of image information. Actually, there are three different formats that fall under the PNM

speciﬁcation as detailed below.

Portable Bitmap Format (PBM)

This format stores bitmaps images. A bitmap can be thought of as an image whose pixels

are either turned on or off (say black and white). The representation of a PBM ﬁle can be

given as a simple text ﬁle with the extension .pbm. An example text ﬁle for a bitmap ﬁle

that encodes for the uppercase letter R would be:

P1

# This is an example bit map file r.pbm

5 8

1 1 1 1 0

1 0 0 0 1

1 0 0 0 1

1 0 0 0 1

1 1 1 1 0

1 0 0 1 0

1 0 0 0 1

1 0 0 0 1

In this ﬁle, the ﬁrst line is a special code to tell the computer how many bits per pixel to

use. The second line is a comment line that you can put anything you like into (but has

to start with the # character). The third line tells how many columns and rows of data

that the image has. Note, this is a column-major notation here where the ﬁrst number

is the number of columns and the second number is the number of rows, which is the

opposite of which we use (row-major) in R for interacting with matrices of data. The rest

of the ﬁle consists of the actual bit matrix where 1 represents a pixel that is turned on

and 0 represents a pixel that is turned off. The image represented in this ﬁle is given in

Figure 7.1.

You can make this image programatically, by creating the matrix in R and using the

image function. Here is an example creating the image of the letter T.

> x <− matrix ( 0 , nrow=8, ncol =5)

> x

[ , 1] [ , 2] [ , 3] [ , 4] [ , 5]

[ 1 , ] 0 0 0 0 0

[ 2 , ] 0 0 0 0 0

[ 3 , ] 0 0 0 0 0

[ 4 , ] 0 0 0 0 0

[ 5 , ] 0 0 0 0 0

[ 6 , ] 0 0 0 0 0

[ 7 , ] 0 0 0 0 0

[ 8 , ] 0 0 0 0 0

> x[ 1 , ] <− 1

> x[ , 3] <− 1

> x

Biological Data Analysis Using R

7.1. IMAGE DATA 111

Figure 7.1: The image represented in the r.pbm ﬁle. This image has been scaled up to make it

large enough to see it on the page using the program GIMP (www.gimp.org).

[ , 1] [ , 2] [ , 3] [ , 4] [ , 5]

[ 1 , ] 1 1 1 1 1

[ 2 , ] 0 0 1 0 0

[ 3 , ] 0 0 1 0 0

[ 4 , ] 0 0 1 0 0

[ 5 , ] 0 0 1 0 0

[ 6 , ] 0 0 1 0 0

[ 7 , ] 0 0 1 0 0

[ 8 , ] 0 0 1 0 0

> col ors <− c ( "black" ,"grey")

> image ( x, col =colors , axes=F)

Here I created the matrix that had all 0 in it and set the top row and the middle column

equal to 1. Then the image function was used to plot it. The image function takes a number

of optional arguments and here I have supplied it the colors and the option to not show

the axes. Since I have two values in the matrix, a two element vector will be sufﬁcient to

handle all the different colors. The image shown in Figure 7.2 shows this matrix. There

seems to be a small problem with it in that it is rotated 90

◦

counter-clockwise. This is

because the origin of the plot that is created by the image function is in the lower left-hand

corner. Conversely, most images that are stored on the computer (like the desktop image

in the background), assume that the origin is at the upper left hand corner of the image.

Obviously these two do not mesh well together.

Portable Graymap Format (PGM)

This format is for graymap images where the term graymap refers to the lack of color

in the image. In terms of complexity, this is slightly more information contained in the

data ﬁle as each pixel is not either ON or OFF, rather there is a percentage of ONNESS... (is

that a word?).

P2

# The PGM file for dog.pgm

24 7

5

Biological Data Analysis Using R

112 CHAPTER 7. WORKING WITH IMAGES

Figure 7.2: A PBM ﬁle that was programatically created in R . The image is rotated because of the

default location of the origin.

0 1 1 1 1 0 0 0 0 0 0 5 5 5 0 0 0 0 0 4 4 4 0 0

0 1 0 0 0 1 0 0 0 0 5 0 0 0 5 0 0 0 4 0 0 0 4 0

0 1 0 0 0 0 1 0 0 5 0 0 0 0 0 5 0 0 4 0 0 0 0 0

0 1 0 0 0 0 1 0 0 5 0 0 0 0 0 5 0 0 4 0 0 0 0 0

0 1 0 0 0 0 1 0 0 5 0 0 0 0 0 5 0 0 4 0 0 4 4 0

0 1 0 0 0 1 0 0 0 0 5 0 0 0 5 0 0 0 4 0 0 0 4 0

0 1 1 1 1 0 0 0 0 0 0 5 5 5 0 0 0 0 0 4 4 4 0 0

The ﬁrst three lines of the ﬁle are the same as for the PBM format. The fourth line in the

ﬁle gives the maximum value representing the the most white in the image. In this case,

the a black pixel will be represented by the number 0 and the white would be represented

by 5 and values in between would be

1

5

increments of whiteness. The remaining portions

of the ﬁle have the actual image represented in a pixel-by-pixel matrix of values. You

can see that the majority of the image is

0

5

black and the letters are varying shades of

gray (Figure 7.3).

The number of shades of gray you use in a PGM ﬁle is up to you as long as it does not

exceed 255 (I think). These are easy ﬁles to create and you could imagine how you could

Biological Data Analysis Using R

7.1. IMAGE DATA 113

Figure 7.3: The image represented by the dog.pgm ﬁle. This image has been scaled up to make it

large enough to see it on the page using the program GIMP (www.gimp.org).

create a matrix of integers from some analysis and save it as a pgm ﬁle and view it

directly.

Portable Pixmap Format (PPM)

The last ﬁle format, PPM, is one that handles pixmaps, which means that you have colored

pixels in the image. The ﬁle format is identical to that of the PGM with the exception that

the code on the ﬁrst line is P3, which represents 24-bits per pixel; 8 of which are for red,

8 for green, and 8 for blue. An example of the PPM ﬁle shown in Figure 7.4 is:

P3

# This image contains an image of my daughter Libbie (from Libbie.ppm).

180 240

255

188

219

253

189

220

252

In this ﬁle, the pixel values are placed one per line instead of next to each other. Starting

at line number 5 with a value of 188 the following 180x240 = 43, 200 lines contain an

integer whose value is between 0 and 255 (the maximum all color as depicted on line

4) for the color red followed by another 43, 200 lines of numbers for the color green, and

then another 43, 200 lines for the blue. When we begin looking at manipulating images

you will ﬁnd that you can interact with each color channel independently.

One drawback to these image formats are that they are not very efﬁcient. For example,

the image of my daughter in Figure 7.4 has 129, 604 lines of information in it, which on

my computer makes it 465K in size. The exact same image saved as a jpeg ﬁle is only

25K in size. The compression used to make jpeg, tiff, gif, png, and other compressed ﬁle

formats is why they are used on the internet. But for our purposes, the lack compression

and inefﬁciency in storages sizes are relatively irrelevant.

Biological Data Analysis Using R

114 CHAPTER 7. WORKING WITH IMAGES

Figure 7.4: The image represented in the Libbie.ppm ﬁle. This image has been scaled up to make

it large enough to see it on the page using the program GIMP (www.gimp.org).

7.2 Loading The Image Into R

OK, now that the basics of how one kind of image is represented in the data ﬁles, it is

time to load one into R and see what we have to work with. To load a PNM ﬁle, you must

ﬁrst import the pixmap library then you can use the function read.pnm() to load the ﬁle into

a local variable and plot it using the plot () function.

> l i brary ( pixmap)

> photo <− read .pnm( f i l e ="Libbie.ppm")

Read 129600 items

> pl ot ( photo )

The plot () function will open a new image window and show the loaded image.

7.3 Components of A Pixmap

We can learn a little bit more about what kind of data type the variable we call photo is

by using the class() function.

> cl ass ( photo )

[ 1] "pixmapRGB"

at t r ( , "package")

[ 1] "pixmap"

> names( attri butes ( photo ) )

[ 1] "size" "cellres" "bbox" "bbcent" "channels" "red" "green"

[ 8] "blue" "class"

Biological Data Analysis Using R

7.4. IMAGE OPERATIONS 115

This variable is a pixmapRGB class that comes from the pixmap package. A class is a self con-

tained data structure that has both attributes and data. The command names(attributes(photo))

tells us the names of the attributes that the variable has.

There are some issues that we should touch on when dealing with classes. They differ

from what we have been using thus far such as data frames in that we cannot access

the contents of a class using the $ notation. This is because things like lists and data

frames are not classes, they are just objects. To access attributes of classes we use the

notation. For example:

> photo@size

[ 1] 240 180

> photo@channels

[ 1] "red" "green" "blue"

> dim( photo@red )

[ 1] 240 180

> photo@red[ 1 , 1]

[ 1] 0.7372549

> range ( photo@red)

[ 1] 0 1

Here we can get to the size, channels, and red components of the class directly. We can also

see that the red channel that determines the amount of redness in each pixel has been

standardized on the range [0, 1]. This is important to know if we are going to manipulate

the image directly.

7.4 Image Operations

7.4.1 Extracting Channels

So now we know how to make some alterations of the image and see what happens. In

the next example, I ﬁrst copy the photo to make three additional photos, named redPhoto,

bluePhoto, and greenPhoto. Then for each of the new variables I remove all the data in each

of the corresponding channels by making the channel contain a matrix of zeros the same

size as the original matrix.

> redPhoto <− photo

> bluePhoto <− photo

> greenPhoto <− photo

> redPhoto@size

[ 1] 240 180

> redPhoto@blue <− redPhoto@green <− matrix ( 0 , nrow=240, ncol =180)

> bluePhoto@red <− bluePhoto@green <− matrix ( 0 , nrow=240, ncol =180)

> greenPhoto@red <− greenPhoto@blue <− matrix ( 0 , nrow=240, ncol =180)

> par ( mfrow=c ( 1 , 4) )

> pl ot ( photo )

> pl ot ( redPhoto )

> pl ot ( greenPhoto )

> pl ot ( bluePhoto )

Note that I used the sequential assignment A <−B <−C <−D as a shorthand here. This will

assign the value of D to the variable C then C to B and then B to A. This a lazy trick but one

that you will probably use as it saves a bit of time and typing.

Biological Data Analysis Using R

116 CHAPTER 7. WORKING WITH IMAGES

Then I make a 1x4 matrix of plots so that I can plot all four images in the same frame (see

?? for more on how this is done) and in each of the four slots, I plot one of the images

yielding a ﬁgure similar to what is presented in Figure 7.5.

Figure 7.5: The original image along with ones where only the red, green, and blue channel turned

on.

In some cases, it is helpful if you can extract the color information and generalize the

image as a greyscale image (as you will in Chapter 11). Here we use the information

from each channel, weighed equally, in the creation of the image.

> gphoto <− pixmapGrey( photo@red+photo@blue+photo@green)

> pl ot ( gphoto )

> names( attri butes ( gphoto ) )

[ 1] "size" "cellres" "bbox" "bbcent" "channels" "grey" "class"

> gphoto@grey[ 1 , 1]

[ 1] 0.8627451

> range ( gphoto@grey )

[ 1] 0 1

The function pixmapGrey() takes a matrix of data, of which we just use the element-wise

addition of each channel in the color photo. You can also see that in the creation of the

new grey image, the values were again standardized.

For the moment, lets examine the contents of this grey image and play around with it a

bit. Lets make it a bit darker by shifting all the grey values down (to make it more black).

We can do this by performing operations on the matrix of grey values in the class. For

simplicity, I will make a copy of the image ﬁrst and then perform operations on the copy

rather than the original one. Then we will look at the distribution of grey values that

make the image.

> darkerGphoto <− gphoto

> darkerGphoto@grey <− darkerGphoto@grey / 2

> par ( mfrow=c ( 1 , 3) )

> pl ot ( gphoto )

> hi st ( gphoto@grey , xlim=c ( 0 , 1) , xlab="Grey" ,main="")

> pl ot ( darkerGphoto )

We can see that the vast majority of values are towards the light end of the distribution.

To darken this up, we should scale these values to be closer to zero by dividing them by

2 and then replotting the image to see the result (see results in Figure 7.6).

Biological Data Analysis Using R

7.5. CREATING IMAGES PROGRAMATICALLY 117

7.5 Creating Images Programatically

Images can be made programatically once you understand how images are represented.

There are some helper functions that can help you in creating new images. For the

purposes of this section, we will focus on greyscale images and allow the analysis of

colored images for you to play with on your own time.

Lets start by making an image where each pixel is randomly assigned a greyscale value.

For convenience, I’ll make it the same size as the photo named gphoto from 7.4.1.

> randomImageMatrix <− matrix ( rnorm(240∗180) ,nrow=240, ncol =180)

> gray <− grey(1:100/100)

> image ( randomImageMatrix , col =gray )

Here I use the rnorm() function to create 240 ∗ 180 = 43, 200 random numbers in a matrix

that has 240 rows and 180 columns. I then use the grey() function to create 100 different

shades of grey ranging from white to black at equal intervals. When the image is made,

the range of random numbers is used to divide the pixels into the 100 different grey

colors (e.g., the image() function scales the values in randomImageMatrix into length(gray) distinct

groups for plotting). The results is shown in Figure 7.7.

This image can be manipulated by changing the values in the matrix randomImageMatrix. In

the next example, I replace the center 40x40 block with the white (which would be the

largest value from randomImageMatrix).

> randomImageMatrix[100:140,70:110] <− max( randomImageMatrix)

> image ( randomImageMatrix , col =gray )

The result is shown in Figure 7.8 resembling a square doughnut (mmmmdoughnuts...).

Figure 7.6: The greyscale translation of the PPN image, a histogram of the grey values and the

image resulting from reducing all the grey values in the image by half.

Biological Data Analysis Using R

118 CHAPTER 7. WORKING WITH IMAGES

Figure 7.7: A random image Figure 7.8: A random image with a square

doughnut hole in the middle.

7.6 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• cat() This function dumps the passed arguments out to the terminal.

• grey(x) This function returns the grey color associated with the value of x. It is

assumed that that 0 ≤ x ≤ 1.

• image(x) Can be used to create an image as either grey or colors for the values in the

matrix x.

• max(x) Returns the maximum value contained in x.

• rnorm(x) Returns x random numbers from a N(µ, σ).

Biological Data Analysis Using R

7.7. EXERCISES 119

7.7 Exercises

The following exercises are meant to help you understand the items presented in this

Chapter.

1. Create a Portable Bitmap Format ﬁle (*.pbm) exactly like the one that is shown for the

letter R but make it represent the letter L.

2. Why is Figure 7.2 not right-side-up?

3. Make your L image correct by changing the values of the underlying matrix such that

when it is plot using the image command it is in the correct orientation.

4. What is the purpose of the PX number on the ﬁrst line of the PNM ﬁle formats?

5. Load your own copy of the image Libbie.ppm into R using the read.pnm function as

demonstrated in the Chapter. Create three copies of the image and for each copy

remove the values in one channel (e.g., make one of the color matrices a zero). Plot

these images in a three-paned graphic using the function par(mfrow=c(1,3) option.

6. Replot the randomImageMatrix using a color palette instead of the grey palette shown.

(Hint: See ?rainbow for ﬁve of the stock palettes available to you.)

7. What is the default palette used in the image plot function?

8. What is the purpose of the optional argument bbox in the pixmapGrey function?

9. Create the greyscale version of the image shown in the leftmost box in Figure 7.6.

The grey channel is composed of greyscale values that must be between [0, 1]. Can

you invert the colors in this image? (Hint: If you can’t ﬁgure out how to do this, see

the footnote at the end of this sentence but only as a last resort.

1

10. Why do you have to use the @ notation to access components of the pixmaps in this

chapter?

1

Are you sure you want a hint? Take 1 minus the grey channel to make the values ﬂipped in the [0, 1]

interval.)

Biological Data Analysis Using R

120 CHAPTER 7. WORKING WITH IMAGES

Biological Data Analysis Using R

Chapter 8

Matrix Analysis

Matrices are used in a wide variety of biological studies. In this Chapter I will use the

example of stage-classiﬁed matrix models to introduce you to how matrix manipulation

operates in R . There are some issues that need to be addressed with respect to basic

operations on matrices that if you haven’t had a course on Matrix Algebra, you may not

fully appreciate.

In this chapter, you will focus on the following topics:

• Understand matrix operations in R .

• Create stage-classiﬁed matrix models.

8.1 Matrices In R

As shown in 2.4.9, a matrix is a fully recognized data type in R . In fact, R does a

wonderful job of working with matrices and is much faster at doing vector and matrix

operations directly than looping through matrices of values using a for()-loop (see 11.1

for a complete discussion of looping R ).

In speciﬁc terms for this Chapter, a matrix can be deﬁned as a 2-dimensional object that

holds numeric values. Matrices can be created by hand using the matrix() function and

the elements within them can be accessed using the square bracket notation (e.g., X[i,j])

as:

> X <− matrix ( 0 , nrow=4, ncol =4)

> X[ 1 , 2] <− 23

> X[ 1 , 4] <− 42

> X

[ , 1] [ , 2] [ , 3] [ , 4]

[ 1 , ] 0 23 0 42

[ 2 , ] 0 0 0 0

[ 3 , ] 0 0 0 0

[ 4 , ] 0 0 0 0

You can also wrap the as.matrix() function around the read.table() function and read the

data from a matrix in a ﬁle into a variable directly. For a review of these two func-

121

122 CHAPTER 8. MATRIX ANALYSIS

tions see 2.4.9 and 3.1.2. In the online data sets for this chapter, there is a ﬁle called

ExampleMatrix.csv that was exported from a spreadsheet. If

> A <− as . matrix ( read . tabl e ( "ExampleMatrix.csv" , header=F, sep="\t" ) )

> A

V1 V2 V3 V4 V5 V6 V7 V8 V9

[ 1 , ] 0.00000 2.00000 2.00000 5.00000 4.00000 2.00000 7.00000 2.603310 2.000000

[ 2 , ] 2.00000 0.00000 4.00000 6.00000 3.00000 4.00000 7.00000 3.603310 4.000000

[ 3 , ] 2.00000 4.00000 0.00000 6.00000 4.00000 3.00000 7.00000 1.603310 1.000000

[ 4 , ] 5.00000 6.00000 6.00000 0.00000 3.00000 1.00000 1.00000 3.694210 6.000000

[ 5 , ] 4.00000 3.00000 4.00000 3.00000 0.00000 3.00000 4.00000 1.966940 4.000000

[ 6 , ] 2.00000 4.00000 3.00000 1.00000 3.00000 0.00000 2.00000 2.148760 3.000000

[ 7 , ] 7.00000 7.00000 7.00000 1.00000 4.00000 2.00000 0.00000 4.694210 7.000000

[ 8 , ] 2.60331 3.60331 1.60331 3.69421 1.96694 2.14876 4.69421 0.000000 0.603306

[ 9 , ] 2.00000 4.00000 1.00000 6.00000 4.00000 3.00000 7.00000 0.603306 0.000000

[ 10 , ] 4.00000 5.00000 4.00000 4.00000 4.00000 2.00000 3.00000 3.421490 4.000000

[ 11 , ] 3.00000 5.00000 3.00000 5.00000 6.00000 2.00000 4.00000 3.603310 3.000000

[ 12 , ] 3.00000 4.00000 3.00000 5.00000 3.00000 3.00000 6.00000 1.421490 2.000000

V10 V11 V12

[ 1 , ] 4.00000 3.00000 3.00000

[ 2 , ] 5.00000 5.00000 4.00000

[ 3 , ] 4.00000 3.00000 3.00000

[ 4 , ] 4.00000 5.00000 5.00000

[ 5 , ] 4.00000 6.00000 3.00000

[ 6 , ] 2.00000 2.00000 3.00000

[ 7 , ] 3.00000 4.00000 6.00000

[ 8 , ] 3.42149 3.60331 1.42149

[ 9 , ] 4.00000 3.00000 2.00000

[ 10 , ] 0.00000 1.00000 3.00000

[ 11 , ] 1.00000 0.00000 4.00000

[ 12 , ] 3.00000 4.00000 0.00000

There are a few things to notice here:

1. R wraps values for matrices so that only a portion of each row can be viewed at a

time.

2. The columns of data that were read in the ﬁle did not have a header row so R

assigned them the values V1 - V12. This is the default behavior.

3. If there is one value in the matrix that has a decimal portion to it, all the values will

be displayed with the same number of decimal places (e.g., compare the matrix X

and A from the two listings.

8.1.1 Matrix Arithmetic

Matrices have their own special kind of arithmetic that you may not be aware of, so here

is a very short course. For the following examples, I will be using the matrices X

1

, Y,

and Z as deﬁned by the R commands:

> X <− matrix ( 1: 9 , nrow=3,byrow=TRUE)

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

1

For matrices I will use upper case bold letters for variable names in the text to make it easier to distin-

guish them from non-matrix variables as you read along. Obviously, this is not possible in R itself but for

the text hopefully this will make it easier to follow.

Biological Data Analysis Using R

8.1. MATRICES IN R 123

[ 3 , ] 7 8 9

> Y <− matrix ( 9: 1 , nrow=3)

> Y

[ , 1] [ , 2] [ , 3]

[ 1 , ] 9 6 3

[ 2 , ] 8 5 2

[ 3 , ] 7 4 1

> Z <− matrix( 1: 12 ,nrow=4)

> Z

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 5 9

[ 2 , ] 2 6 10

[ 3 , ] 3 7 11

[ 4 , ] 4 8 12

One of the main things you have to pay attention to when dealing with matrices is the

number of rows and columns in the matrices. In these example matrices, X and X are

square matrices (e.g., they have the same number of rows and columns whereas X is

not square as it has 4 rows and 3 columns of data. To access the number of rows and

columns in a matrix you must use the function dim().

Scalar Addition & Subtraction

Matrices may be shifted by the addition or subtraction of a constant scalar value (e.g.,

2 + X). Scalar addition and subtraction take the value of the scalar and add it to every

element in the matrix.

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

[ 3 , ] 7 8 9

> X + 3

[ , 1] [ , 2] [ , 3]

[ 1 , ] 4 5 6

[ 2 , ] 7 8 9

[ 3 , ] 10 11 12

Matrix Addition & Subtraction

For both addition and subtraction of matrices, the numbers of rows and columns must

be identical. If they are, the addition and/or subtraction operation results in the elemente-

wise addition of each matrix. In R you can use the normal addition (+) and subtraction

(-) operators as demonstrated below.

> X+Y

[ , 1] [ , 2] [ , 3]

[ 1 , ] 10 8 6

[ 2 , ] 12 10 8

[ 3 , ] 14 12 10

But when they are not the same size, R will barf up an error message to you telling you

they are not amenable to this operation.

Biological Data Analysis Using R

124 CHAPTER 8. MATRIX ANALYSIS

> X+Z

Error in X + Z : non−conformable arrays

Scalar Multiplication

The values within a matrix may be scaled by the multiplication of a scalar value (e.g., 0.5∗

X). Scalar multiplication results in every single element in the matrix being multiplied

by the scalar value. For example:

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

[ 3 , ] 7 8 9

> X ∗ 2

[ , 1] [ , 2] [ , 3]

[ 1 , ] 2 4 6

[ 2 , ] 8 10 12

[ 3 , ] 14 16 18

Element-wise Multiplication

It is possible to multiply two matrices where what you are wanting is a new matrix that

is the element-wise product of each of the original matrices. This is sometimes called

the Hadamard product or the Schur product. In R this operation is conducted using

the regular multiplication character,

*

, between the two matrices. The result of this

operation is a new matrix, the same dimensions as the two original ones.

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

[ 3 , ] 7 8 9

> Y

[ , 1] [ , 2] [ , 3]

[ 1 , ] 9 6 3

[ 2 , ] 8 5 2

[ 3 , ] 7 4 1

> X ∗ Y

[ , 1] [ , 2] [ , 3]

[ 1 , ] 9 12 9

[ 2 , ] 32 25 12

[ 3 , ] 49 32 9

Multiplication

Matrix multiplication is slightly more complicated than multiplication among scalars or

multiplying a scalar by a matrix. For example, in matrix multiplication, AB = BA.

This is because of the way that matrices are multiplied. Moreover, there are several

restrictions to which sets of matrices can be multiplied together.

Biological Data Analysis Using R

8.1. MATRICES IN R 125

For example, consider the operation A = XY where the matrix X has r

X

rows and c

X

columns of data and the matrix Y has r

Y

rows and c

Y

columns of data. For this operation

to be deﬁned, the number of columns in X, c

X

, must equal the number of rows in Y (e.g.,

c

X

= r

Y

). If these are not equal, then you cannot perform the multiplication. Moreover,

the resulting matrix A will have r

X

rows and c

Y

columns. This is because the matrix

multiplication is conducted as:

A

ij

=

N

k=1

X

i,k

Y

k,j

Essentially every row of X is multiplied against the corresponding column of Y.

In R matrix multiplication uses a unique operator that you probably haven’t seen yet. To

indicate that you want two matrices to be multiplied (and not the Hadamard product as

above) you use the compound operator %∗ %. That is right, it is a pair of percent signs

surrounding the normal multiplication character (a.k.a. the asterisk). Two examples

using the matrices X and Y are given below. Notice how XY = YX.

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

[ 3 , ] 7 8 9

> Y

[ , 1] [ , 2] [ , 3]

[ 1 , ] 9 6 3

[ 2 , ] 8 5 2

[ 3 , ] 7 4 1

> X %∗% Y

[ , 1] [ , 2] [ , 3]

[ 1 , ] 46 28 10

[ 2 , ] 118 73 28

[ 3 , ] 190 118 46

> Y %∗% X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 54 72 90

[ 2 , ] 42 57 72

[ 3 , ] 30 42 54

> X %∗% I

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

[ 3 , ] 7 8 9

> X − (X %∗% I )

[ , 1] [ , 2] [ , 3]

[ 1 , ] 0 0 0

[ 2 , ] 0 0 0

[ 3 , ] 0 0 0

> I %∗% X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

[ 3 , ] 7 8 9

>

Here both X and Y are both square and have the same number of rows and columns

(e.g., the simplest case because we don’t have to make sure the correct rows and columns

match). The identity matrix, I deﬁned in the section above is shown here with its groovy

Biological Data Analysis Using R

126 CHAPTER 8. MATRIX ANALYSIS

properties. Matrix multiplication by the identity matrix is transitive and will result in

the original matrix. A kind of matrix version of the scalar multiplying by one.

2

Here is an example using the matrices X and Z, who have different dimensions.

> Z

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 5 9

[ 2 , ] 2 6 10

[ 3 , ] 3 7 11

[ 4 , ] 4 8 12

> Z %∗% X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 84 99 114

[ 2 , ] 96 114 132

[ 3 , ] 108 129 150

[ 4 , ] 120 144 168

> X %∗% Z

Error in X %∗% Z : non−conformable arguments

In the ﬁrst case, Z %∗%X is deﬁned and provides a result because the number of columns

in Z match the number of rows in X. The reverse of this multiplication, X %∗%Z, is

undeﬁned and R tells you so.

8.1.2 Matrix Operations

There are several other operations that can be conducted on matrices that you will

probably run across as you begin playing with matrices. Here are a smattering of a

few.

The Diagonal

It is often necessary to interact with the diagonal, deﬁned as the elements in the matrix

whose row index are equal to the column index, of a matrix. For example, in a covariance

matrix, the diagonal elements are the variance estimates. In R you can get access to

the diagonal of a matrix by using the diag(). Some examples using the diag() function

include:

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

[ 3 , ] 7 8 9

> diag (X)

[ 1] 1 5 9

> Z

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 5 9

[ 2 , ] 2 6 10

[ 3 , ] 3 7 11

[ 4 , ] 4 8 12

> diag ( Z)

[ 1] 1 6 11

2

There are other matrices that have this property that are not as simple as this one and if you take some

multivariate statistics, it will blow your mind how cool they are...

Biological Data Analysis Using R

8.1. MATRICES IN R 127

Notice how even for non-square matrices the diagonal is deﬁned. You can also extract

and insert particular values for the diagonal as demonstrated below:

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

[ 3 , ] 7 8 9

> origDiag <− diag (X)

> origDiag

[ 1] 1 5 9

> diag (X) <− c(42,23,4)

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 42 2 3

[ 2 , ] 4 23 6

[ 3 , ] 7 8 4

> diag (X) <− origDiag

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

[ 3 , ] 7 8 9

A commonly used matrix that can easily be constructed using the diag() function is the

Identity Matrix, whose symbol is I. This matrix has the zeros everywhere except on the

diagonal

> I <− matrix ( 0 , nrow=3, ncol =3)

> diag ( I ) <− 1

> I

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 0 0

[ 2 , ] 0 1 0

[ 3 , ] 0 0 1

Finally, there is an operator called the trace of a matrix that is typically written as tr(A),

which is the sum of the diagonal elements. If A is a variance, covariance matrix as is

commonly found in multivariate statistics, then its trace is the overall variance. In R we

can ﬁnd the trace using a combination of the sum() and diag() functions as:

> X

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 2 3

[ 2 , ] 4 5 6

[ 3 , ] 7 8 9

> sum( diag ( X ) )

[ 1] 15

Matrix Determinant

The determinant of a matrix is scalar factor of a matrix. The calcuation of the determi-

nant is somewhat complicated when we get to matrices that have more than two rows

and columns and I’ll let you go ﬁnd a linear algebra book to look into it if you so desire.

For small matrices, the determinant of a matrix, denoted as |A| is given as:

Biological Data Analysis Using R

128 CHAPTER 8. MATRIX ANALYSIS

|A| =

¸

¸

¸

¸

a

11

a

12

a

21

a

22

¸

¸

¸

¸

= a

11

a

22

−a

12

a

21

In R the function det() is used to estimate the determinant of a matrix.

> X <− matrix ( c( 1 , 6 , 3 , 4) , nrow=2)

> X

[ , 1] [ , 2]

[ 1 , ] 1 3

[ 2 , ] 6 4

> det (X)

[ 1] −14

Matrix Transpose

The transpose of a matrix is an operation that exchanges the row and column indices

of the elements. This will change the dimensions of the matrix if it is not square. No-

tationally, you will see several different ways to represent a transpose such as A

or

A

T

.

In R the transpose operation is performed with the t () function.

> Z

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 5 9

[ 2 , ] 2 6 10

[ 3 , ] 3 7 11

[ 4 , ] 4 8 12

> t ( Z)

[ , 1] [ , 2] [ , 3] [ , 4]

[ 1 , ] 1 2 3 4

[ 2 , ] 5 6 7 8

[ 3 , ] 9 10 11 12

> t ( t ( Z) )

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 5 9

[ 2 , ] 2 6 10

[ 3 , ] 3 7 11

[ 4 , ] 4 8 12

Notice that the transpose of a transpose is equal to the original variable.

Matrix Inversion

For scalars, the inverse is deﬁned as x

−1

=

1

x

but for matrices it is slightly more com-

plicated. There are even large groups of matrices that cannot be inverted. One property

that prevents inversion is if the matrix is singular (think black hole of mathematics or

matrices that have a zero determinant).

A common use for matrix inversion is in estimation of regression coefﬁcients by least

squares. In 6.2, we used the lm() function to estimate the intercept and slope coefﬁcients.

This can be done using matrix algebra and the inversion function ginv() found in the MASS

library. A one column matrix of slope coefﬁcients B is estimated from the formula:

Biological Data Analysis Using R

8.1. MATRICES IN R 129

B = (X

X)

−1

X

Y

Where the matrix Y matrix is the normal matrix of response variables and the X matrix

has the ﬁrst column of all ones (1) for the intercept and the remaining columns as the

predictor variables.

> X <− matrix ( c ( rep( 1 , 10) , 1: 10) , ncol=2 )

> X

[ , 1] [ , 2]

[ 1 , ] 1 1

[ 2 , ] 1 2

[ 3 , ] 1 3

[ 4 , ] 1 4

[ 5 , ] 1 5

[ 6 , ] 1 6

[ 7 , ] 1 7

[ 8 , ] 1 8

[ 9 , ] 1 9

[ 10 , ] 1 10

> Y <− matrix ( c(19,25,14,15,24,17,19,27,29,25))

> Y

[ , 1]

[ 1 , ] 19

[ 2 , ] 25

[ 3 , ] 14

[ 4 , ] 15

[ 5 , ] 24

[ 6 , ] 17

[ 7 , ] 19

[ 8 , ] 27

[ 9 , ] 29

[ 10 , ] 25

> l i brary (MASS)

> ginv ( t (X) %∗% X ) %∗% ( t (X) %∗% Y )

[ , 1]

[ 1 , ] 16.3333333

[ 2 , ] 0.9212121

> lm( Y ˜ c ( 1: 10) )

Cal l :

lm( formula = Y ˜ c ( 1: 10) )

Coef f i ci ent s :

( I ntercept ) c ( 1: 10)

16.3333 0.9212

You can see from the comparison, both lm() and the matrix multiplication/inversion

method produce the same estimates for the intercept and the slope coefﬁcient. If you

were to make Z <−Y − mean(Y) (e.g., standardize it for mean zero), you could have the X ma-

trix without the column for the interscept (β

0

= 0) and you could get the same estimate

for the slope coefﬁcient, β

1

.

Eigen Decompositions

An eigenvalue/eigenvector decomposition is a ”magical property” of matrices that can

only be appreciated by some experience in matrix algebra. However, we will be using

them in the next section so it seems there is a need to introduce them here. Start by

Biological Data Analysis Using R

130 CHAPTER 8. MATRIX ANALYSIS

considering the square (kxk) matrix X and the identity matrix (I) in the characteristic

equation |A−λI| = 0.

Using the matrix:

> A <− matrix ( c( 1 , 6 , 3 , 4) , nrow=2)

> A

[ , 1] [ , 2]

[ 1 , ] 1 3

[ 2 , ] 6 4

The eigenvalues for the matrix are given by solving the characteristic formula:

0 = |A−λI| (8.1)

=

¸

¸

¸

¸

_

1 3

6 4

_

−λ

_

1 0

0 1

_¸

¸

¸

¸

=

¸

¸

¸

¸

_

1 −λ 3

6 4 −λ

_¸

¸

¸

¸

= (1 −λ)(4 −λ) −18

= λ

2

−5λ −14

If we solve for λ we see that possible values are 7 and −2. These are called the eigenvalues

of the matrix A.

Each eigenvalue has an associated eigenvector such that:

Ax = λx

Where x is a vector (e.g, a matrix with only one column) that is matched to each of

the k eigenvalues. The equation above is called the characteristic equation for the right

eigenvector and a left eigenvector exists and has the form xA = xλ. From both of these,

we need to solve for x. Starting with the largest eigenvalue, λ

1

= 7, we have:

_

1 3

6 4

_ _

e

1

e

2

_

= λ

1

_

e

1

e

2

_

(8.2)

If we multiply these out, we get the following equations:

1e

1

+ 3e

2

= 7e

1

6e

1

+ 4e

2

= 7e

2

And here we have two equations in two variables and can easily solve for the values

of e

1

and e

2

and these values deﬁne the eigenvector v

1

= [e

1

, e

2

] that is linked to the

eigenvalue λ

1

. We can do the same for the second vector (which I will let you play with in

those boring weekend hours where you are wishing that you had some really cool math

problem to solve).

Biological Data Analysis Using R

8.1. MATRICES IN R 131

It is important to point out here that the values for v

1

can be scaled. As you look at the

equations above we can solve for the components and ﬁnd that e

1

=

e

2

2

. There are a lot

of values for e

1

and e

2

that make this statement true. And if we think about the vector

v

1

= [e

1

, e

2

] as a project away from the origin a distance of e

1

on one axis and e

2

on a

second orthogonal axis it may make a bit more sense. There are several vectors that will

point in a direction that will intersect the point (e

1

, e

2

) all of which are the same except

for a scaling factor. This is graphically shown in Figure 8.1.2 with two vectors pointing

in the same direction but with different lengths.

Figure 8.1: Image depicting two vectors v

red

= [4, 2] and v

blue

= [2, 1] that are projecting in the

same direction but have different magnitudes.

The reason I bring this up is that it is common for routines that calculate vectors, such

as we are doing here for the eigenvector decomposition, to scale the vectors such that

their lengths are set to some normalizing constant such as 1. As a result, if you solve for

v

1

and then check it below with the eigen() function you may not get the same values but

if you were to plot the vectors, the lines away from the origin would be pointing in the

same direction.

There are some interesting properties of eigenvalues and eigenvectors.

• If the original matrix is symmetric (actually non-negative semi-deﬁnite but whose

Biological Data Analysis Using R

132 CHAPTER 8. MATRIX ANALYSIS

watching), the original matrix A =

k

i=1

λ

i

e

i

e

i

. This is called the spectral decompo-

sition of the matrix A.

• The product of the eigenvalues is equal to the determinant of the original matrix

(e.g.,

k

i=1

λ

i

= |A|).

• The sum of the eigenvalues is equal to the trace of the matrix (e.g.,

k

i=1

n

i

λ

i

= tr(A)

where n

i

is a

• If it is possible to invert A then the eigenvalues of A

−1

will be the inverse of the

eigenvalues of A (e.g,. they will be λ

−1

i

.

• The eigenvectors of A and A

−1

are identical.

R has a eigen() function that takes a square matrix and returns the eigen values and

eigenvectors as a list. Here is an example using our little friend the A matrix we touched

on above.

> A

[ , 1] [ , 2]

[ 1 , ] 1 3

[ 2 , ] 6 4

> rootsOfA <− eigen ( A)

> rootsOfA

$values

[ 1] 7 −2

$vectors

[ , 1] [ , 2]

[ 1 , ] −0.4472136 −0.7071068

[ 2 , ] −0.8944272 0.7071068

Baring the possibility that I actually just copied and pasted the results from eigen() into

the discussion above on v

i

= [e

1

, e

2

], the answer looks like it should.

8.2 Stage-Classiﬁed Matrix Models

Stage-classiﬁed matrix models are concerned with understanding the processes that in-

ﬂuence the persistence of populations. These models tacitly assume that the continuum

of life histories for a species can be partitioned into discrete stages and that a census

of individuals in a population can be performed wherein we can tally the number of

individuals in each of these discrete stages. Some species lend themselves to stage-

classiﬁcation better than others and the distinctions on how to go about deﬁning stages

is best left to another course. Here we are going to introduce the notation of a matrix

model in R and then perform some analyses on these models. This Chapter is intended

to only whet your appetite a bit on matrix models and for those that are interested, you

should seek out another course or at least read a good text such as Caswell (2001).

8.2.1 Transition Matrices & Census Vectors

For the sake of discussion, lets assume that we are working with a plant, Grenus growii,

that has the following four different distinct life stages. Moreover, from our vast knowl-

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 133

edge of this organism, we have the accompanying information about the way this species

proceeds through life stages.

Seed The seed stage lasts a single time step (e.g., there is no persistent seed bank) and

only 50% of the seeds actually germinate, the others are either eaten or rot.

Seedling The seedling stage is a non-reproductive stage and herbivory removes 20%

of the individuals that get into this stage and the remaining individuals become

juveniles.

Juvenile The juvenile stage is the ﬁrst reproductive stage and on average each juvenile

produces 1.3 offspring. Depending upon the habitat the juvenile is located in, half

move on to the next stage and a quarter stay as a juvenile. The remining ones are

eaten.

Adult The ﬁnal adult stage is where most of the reproduction happens with each indi-

vidual producing an average of 3.1 offspring. Half of the adults persist in the adult

stage from one time step to the next.

A diagram of this ﬁctions species is shown in Figure 8.2.

Figure 8.2: The A graphical depiction of the life history stages in the ﬁctitious plant Grenus growii

Here each of the spheres in this image represent a stage. The arrows between the stages

depict either fertility estimates (labeled f

X

) when they point back to the seed stage, or

transitions (labeled p

XY

signifying the probability that an individual proceeds to stage

X from stage Y . From the description we have above, we can associate values with

this particular life history diagram with particular parameters. In Table 8.1 I show the

parameters for each of the variables listed.

These parameters can now be put into a transition matrix

3

, A, that has a particularly

strict form.

A =

_

¸

¸

_

f

1

f

2

f

3

f

4

p

21

p

22

p

23

p

24

p

31

p

32

p

33

p

34

p

41

p

42

p

43

p

44

_

¸

¸

_

(8.3)

3

Actually this is not a transition matrix as it does not sum to 1 rather it is a Leslie matrix but I think I

can get away with generalizing the term a bit here.

Biological Data Analysis Using R

134 CHAPTER 8. MATRIX ANALYSIS

Table 8.1: Table of life history values separated into A Fertility estimates (the fX items) and B

transition probabilities depicting the movement between stages and within stages.

A. Fertility Estimates

Stage Parameter Value

Seed f

1

0

Seeding f

2

0

Juvenile f

3

1.3

Adult f

4

3.1

B. Transition probabilities.

Transition Parameter Value

Seed → Seedling p

21

0.5

Seedling → Juvenile p

32

0.8

Juvenile → Adult p

43

0.5

Juvenile → Juvenile p

33

0.25

Adult → Adult p

44

0.5

The items in the matrix are partitioned into two components, the top row records the

fecundity values, f

X

, and the second and remaining rows depict the probabilities of

transition, p

XY

. Inserting the observed values into this matrix gives us:

A =

_

¸

¸

_

0 0 1.3 3.1

0.5 0 0 0

0 0.8 0.25 0

0 0 0.5 0.5

_

¸

¸

_

(8.4)

In R we can create this matrix using the following code:

> A <− matrix ( 0 , nrow=4, ncol =4)

> A[ 1 , 3] <− 1.3

> A[ 1 , 4] <− 3.1

> A[ 2 , 1] <− 0.5

> A[ 3 , 2] <− 0.8

> A[ 3 , 3] <− 0.25

> A[ 4 , 3] <− 0.5

> A[ 4 , 4] <− 0.5

> A

[ , 1] [ , 2] [ , 3] [ , 4]

[ 1 , ] 0.0 0.0 1.30 3.1

[ 2 , ] 0.5 0.0 0.00 0.0

[ 3 , ] 0.0 0.8 0.25 0.0

[ 4 , ] 0.0 0.0 0.50 0.5

The entries in this matrix have some rather special properties if we put the values into

it as directed.

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 135

Intrinsic Growth Rate

The Euler-Lotka’s integral equation for the instantaneous grow rate, r, is well known to

most biologists (...) and has the form:

1 =

_

∞

0

l(x)m(x)e

−rx

dx

where the term l(x) is the fraction of reproductive individuals surviving to x, m(x) is the

fertility rate of individuals at x, and r is the growth. The r component here is the part

that we are interested in looking at because:

r =

_

_

_

< 1 : Populationsizedecayingexponentially

= 1 : Stablesizethroughtime

> 1 : Populationsizeincreasingexponentially

We can provide an estimate of r using an eigenvalue decomposition of the transition

matrix A. Due to the way the matrix is set up, the largest non-imaginary eigenvalue of

the matrix (λ

1

as deﬁned in 8.1.2) is equal to r. So, once the matrix A is entered into R

, we can ﬁnd the growth parameter as:

> A

[ , 1] [ , 2] [ , 3] [ , 4]

[ 1 , ] 0.0 0.0 1.30 3.1

[ 2 , ] 0.5 0.0 0.00 0.0

[ 3 , ] 0.0 0.8 0.25 0.0

[ 4 , ] 0.0 0.0 0.50 0.5

> eigen ( A)

$values

[ 1] 1.2075472+0.0000000i −0.0067844+0.8194141i −0.0067844−0.8194141i

[ 4] −0.4439783+0.0000000i

$vectors

[ , 1] [ , 2] [ , 3] [ , 4]

[ 1 , ] 0.8603823+0i 0.7490103+0.0000000i 0.7490103+0.0000000i −0.4753001+0i

[ 2 , ] 0.3562521+0i −0.0037839−0.4570089i −0.0037839+0.4570089i 0.5352740+0i

[ 3 , ] 0.2976372+0i −0.4052283+0.1306829i −0.4052283−0.1306829i −0.6170499+0i

[ 4 , ] 0.2103303+0i 0.1682952+0.1431813i 0.1682952−0.1431813i 0.3268348+0i

here we can see that λ

1

is not a complex number (the +0.0000000i part tells us that) even

though there are some complex eigenvalues (roots) of this matrix. Moreover, it suggests

that the overall behavior of this transition matrix is to increase overall population size

with an instantaneous rate of r ≈ 1.2.

The particular values of λ will determine the overall long term behavior of the population.

Essentially as time increases t : 0 → ∞, the impact of λ is determined by raising it to

higher and higher powers. Figure 8.3 shows the projected impact on population growth

rate as a function to two values for λ

red

= 0.8 and λ

blue

= 1.2.

Biological Data Analysis Using R

136 CHAPTER 8. MATRIX ANALYSIS

Figure 8.3: Effects of the instantaneous growth rate λ as a function of time for both exponential

growth (λ

blue

= 1.2) and exponential decay (λ

red

= 0.8).

Stable Stage Distribution

The values in A also contain information on the relative proportion of individuals that

will be in each stage class as the population stabilizes into a steady state (either growth,

stable, or declining). This information is contained in the eigenvector that is associated

with λ

1

. From the output above we see that:

> ssd <− as . numeric ( eigen ( A) $vectors [ , 1] )

> ssd

[ 1] 0.8603823 0.3562521 0.2976372 0.2103303

> sum( ssd)

[ 1] 1.724602

> ssd <− ssd / sum( ssd)

> ssd

[ 1] 0.4988875 0.2065706 0.1725831 0.1219587

> sum( ssd)

[ 1] 1

Here you see that the eigenvalues are scaled to unit size (e.g., t(e i ) %∗%e i = 1) as mentioned

above which results in a total sum of the vector of sum(ssd) = 1.724602. If we are interested in

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 137

ﬁnding the proportion of the population that is in each stage then we need to standardize

the vector so that the sum(ssd) = 1 and this is done by dividing every element by the total.

As a result, ssd suggests that at equilibrium there should be 49% of the individuals as

seeds, 21% as seedlings, 17% as juveniles and 12% as adults.

We will return to these numbers and the estimate for r in the next subsection when we

iterate the data manually.

Bar Plots

As in the previous example, we determined the stable age distribution to estimate the

proportion of the total population that is in each group. Graphically, this material could

be depicted as a bargraph and since we haven’t covered how to make bar graphs yet,

this is as good a time as any...

There is an option in the normal plot () function, type="h" that will kind of plot bars of your

data to a ﬁgure. Actually, these are high density lines and not real bar plots. This is

what I used to make Figure 4.2 and at that time it got the job done correctly, but a true

bar plot is something that looks a bit different than those lines.

R provides the function barplot() that takes a vector of heights and produces a general

barplot for you. Without modiﬁcations, the function barplot() does not produce a very

interesting plot in my opinion. However, there are several optional arguments that can

be used to create a more informative graphic. They include:

• names.arg a vector of names that you can have placed on the x−axis below the bars

• width controls the width of the bars.

• space controls the amount of area between the bars with a value of zero having the

bars touch and positive numbers equal to that number of bar width (e.g., space=2

plots a bar and then 2 bar widths before the next bar shows up).

• horiz is a logical ﬂag that will plot the bars horizontally instead of vertically.

• col can pass as a single color or a vector of colors which are used to color the bars.

• ylim can adjust the limit of the y−axis as in normal plotting routines.

• xlab \& ylab Labels for the x− and y−axes.

Using the data from λ

1

in the previous section, we can plot the data as (shown in Figure

8.4.

> ssd

[ 1] 0.4988875 0.2065706 0.1725831 0.1219587

> barplot ( ssd)

> barplot ( ssd , ylim=c ( 0 , 1) , xlab="Stage" , ylab="Proportion of Individuals" ,

+ names. arg=c ( "Seed" ,"Seedling" ,"Juvenile" ,"Adult") , col =c ( "red" ,"blue" ,"green" ,"yellow" ) )

The barplot() function can also be used to create stacked graphs 8.5

To create this example, I used the following code which as t

Biological Data Analysis Using R

138 CHAPTER 8. MATRIX ANALYSIS

Figure 8.4: Examples of two different calls to the plotting function barplot(). The parameters used

to create these plots is given in the R code.

> x <− matrix ( runi f ( 9) , nrow=3)

> x

[ , 1] [ , 2] [ , 3]

[ 1 , ] 0.2355922 0.396869276 0.5674993

[ 2 , ] 0.7247734 0.001881527 0.9215767

[ 3 , ] 0.4625868 0.767329832 0.6408461

> barplot ( x, names. arg=c ( "Control" ,"A" ,"B") , xlab="Treatments" , ylab="Value" ,

+ legend=c ( "Category A" ,"Category B" ,"Category C" ) )

These stacked plots treat every column of data as a single bar and the order in which the

rows are presented is the order in which the stacking occurs. You can standardize the

plot to all have the same height by dividing each column by that columns sum providing

a proportional barplot.

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 139

Figure 8.5: Example of a stacked bar plot with multiple categories represented in each Treatment.

8.2.2 Projecting Stage Sizes

In this matrix model we have been playing with, the census count of individuals in

each of the four stages can be represented by the vector n and in R as a matrix whose

dimensions are (4x1). Assuming that I start with 12 seeds, 34 seedlings, 21 juveniles, and

12 adults, the vector can be depicted as:

> n <− matrix ( c(12,34,21,12))

> n

[ , 1]

[ 1 , ] 12

[ 2 , ] 34

[ 3 , ] 21

[ 4 , ] 12

Using this notation, we can predict what the number of individuals in the next time slice

will be given A and n as:

n

t+1

= An

t

Biological Data Analysis Using R

140 CHAPTER 8. MATRIX ANALYSIS

> A

[ , 1] [ , 2] [ , 3] [ , 4]

[ 1 , ] 0.0 0.0 1.30 3.1

[ 2 , ] 0.5 0.0 0.00 0.0

[ 3 , ] 0.0 0.8 0.25 0.0

[ 4 , ] 0.0 0.0 0.50 0.5

> n

[ , 1]

[ 1 , ] 12

[ 2 , ] 34

[ 3 , ] 21

[ 4 , ] 12

> A %∗% n

[ , 1]

[ 1 , ] 64.50

[ 2 , ] 6.00

[ 3 , ] 32.45

[ 4 , ] 16.50

So after one generation, we can see that the number of seeds, juveniles, and adults all

increased but the number of seedlings decreased. If we look at the next time step, we

see that:

n

t+2

= An

t+1

= AAn

t+1

= A

2

n

t

And in general the vector of stage sizes at any arbitrary time step can be written as:

n

t

= A

t

n

0

(8.5)

Lets make a matrix of n values for time 1 → 11 in R and calculate the number of individ-

uals in each stage for each time step. I use 11 here because the matrix starts counting

at column 1 which will correspond to our time t = 0 so when t = 10 the column will be

11. Lets also set the ﬁrst column (our t = 0) equal to the census population size we were

using above.

> N <− matrix ( 0 , nrow=4, ncol =11)

> N

[ , 1] [ , 2] [ , 3] [ , 4] [ , 5] [ , 6] [ , 7] [ , 8] [ , 9] [ , 10] [ , 11]

[ 1 , ] 0 0 0 0 0 0 0 0 0 0 0

[ 2 , ] 0 0 0 0 0 0 0 0 0 0 0

[ 3 , ] 0 0 0 0 0 0 0 0 0 0 0

[ 4 , ] 0 0 0 0 0 0 0 0 0 0 0

> N[ , 1] <− n

> N

[ , 1] [ , 2] [ , 3] [ , 4] [ , 5] [ , 6] [ , 7] [ , 8] [ , 9] [ , 10] [ , 11]

[ 1 , ] 12 0 0 0 0 0 0 0 0 0 0

[ 2 , ] 34 0 0 0 0 0 0 0 0 0 0

[ 3 , ] 21 0 0 0 0 0 0 0 0 0 0

[ 4 , ] 12 0 0 0 0 0 0 0 0 0 0

Now, for time steps 1 → 10 (and in the matrix N columns 2 → 11) we will use the equation

8.5 to calculate the number of individuals in each group.

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 141

> t <− 1

> N[ , ( t +1) ] <− A %∗% N[ , t ]

> t <− t + 1

> N

[ , 1] [ , 2] [ , 3] [ , 4] [ , 5] [ , 6] [ , 7] [ , 8] [ , 9] [ , 10] [ , 11]

[ 1 , ] 12 64.50 0 0 0 0 0 0 0 0 0

[ 2 , ] 34 6.00 0 0 0 0 0 0 0 0 0

[ 3 , ] 21 32.45 0 0 0 0 0 0 0 0 0

[ 4 , ] 12 16.50 0 0 0 0 0 0 0 0 0

> t

[ 1] 2

OK, here I am going to do something that saves some typing (you can use the up cursor

key to repeat the last entry you typed in the R interpreter and I will use this to make my

life a bit easier). I have deﬁned the variable t such that it will be used to indicate which

column of the matrix to use (the ( t+1) part) as well as the exponent to the matrix A. Then

I will increment the variable t by one and redo it again and again until I’ve ﬁlled up the

columns of N.

In the following code examples, I show that you can use a semicolon (;) to put more than

one command on a line. Again, I combine the assignment of counts to the appropriate

column of N and then update the counter variable t each time through until all eleven

columns are full. In Chapter 11 you will learn how to use a loop to do this much easier

but until then using the up cursor key in the R interpreter is good enough.

> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1

> N

[ , 1] [ , 2] [ , 3] [ , 4] [ , 5] [ , 6] [ , 7] [ , 8] [ , 9] [ , 10] [ , 11]

[ 1 , ] 12 64.50 93.3350 0 0 0 0 0 0 0 0

[ 2 , ] 34 6.00 32.2500 0 0 0 0 0 0 0 0

[ 3 , ] 21 32.45 12.9125 0 0 0 0 0 0 0 0

[ 4 , ] 12 16.50 24.4750 0 0 0 0 0 0 0 0

> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1

> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1

> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1

> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1

> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1

> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1

> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1

> N[ , ( t +1) ] <− A %∗% N[ , t ] ; t <− t + 1

> N

[ , 1] [ , 2] [ , 3] [ , 4] [ , 5] [ , 6] [ , 7] [ , 8]

[ 1 , ] 12 64.50 93.3350 92.65875 95.68719 131.93725 168.77519 193.20372

[ 2 , ] 34 6.00 32.2500 46.66750 46.32937 47.84359 65.96862 84.38759

[ 3 , ] 21 32.45 12.9125 29.02813 44.59103 48.21126 50.32769 65.35682

[ 4 , ] 12 16.50 24.4750 18.69375 23.86094 34.22598 41.21862 45.77316

[ , 9] [ , 10] [ , 11]

[ 1 , ] 226.86065 281.25553 343.80907

[ 2 , ] 96.60186 113.43032 140.62776

[ 3 , ] 83.84928 98.24381 115.30521

[ 4 , ] 55.56499 69.70713 83.97547

So this is a large number of values here so lets plot this out to see what the stages do

as we go through 10 time steps. The code used to produce the image in Figure 8.2.2

is:

> pl ot ( 1: 11 ,N[ 1 , ] , xlab="" , ylab="" , axes=F, bty="n" , col ="red" , ylim=ylim, type="l" , lwd=2)

> par ( new=T)

> pl ot ( 1: 11 ,N[ 2 , ] , xlab="" , ylab="" , axes=F, bty="n" , col ="blue" , ylim=ylim, type="l" , lwd=2)

Biological Data Analysis Using R

142 CHAPTER 8. MATRIX ANALYSIS

> par ( new=T)

> pl ot ( 1: 11 ,N[ 3 , ] , xlab="" , ylab="" , axes=F, bty="n" , col ="green" , ylim=ylim, type="l" , lwd=2)

> par ( new=T)

> pl ot ( 1: 11 ,N[ 4 , ] , xlab="t" , ylab="Number of Individuals" , axes=T, bty="n" , col ="pink" ,

+ ylim=ylim, type="l" , lwd=2)

> legend(2,350, c ( "Seed" ,"Seedling" ,"Juvenile" ,"Adult") , col =c ( "red" ,"blue" ,"green" ,"pink") ,

+ lwd=2, bty="n")

I use the par(new=T) to overlay the lines on a single graph (see 4.1.1 for more on this). I

also turn off the labels and axes for the ﬁrst three plots because if you plot them over

and over again, they look too dark on the graphic (think printing the same line on top of

itself numerous times). On the last one, I set the labels for the axes and the turn on the

axes. Also included is the code I used to add the legend to the image. See ?legend for a

complete discussion of the options that you can provide to this function.

Figure 8.6: Size of the four stage classes through time.

We can check some of the values that we estimated directly from A using the eigen

decomposition by looking at the numbers in the matrix N. First, the growth rate we

estimated from the ﬁrst eigenvalue λ

1

≈ 1.2 looks pretty close to that estimated from the

raw counts.

> eigen ( A) $values [ 1]

Biological Data Analysis Using R

8.2. STAGE-CLASSIFIED MATRIX MODELS 143

[ 1] 1.207547+0i

> sum(N[ , 11] ) / sum(N[ , 10] )

[ 1] 1.215202

And the proportion of individuals in each class was estimated by standardizing the ﬁrst

eigenvalue ˆ v

1

= v

1

/

4

i=1

v

1i

is pretty close to what we see in N (and I throw in the ﬁrst

census so that you don’t think I put values in there that were already pretty close).

> N[ , 1] / sum(N[ , 1] )

[ 1] 0.1518987 0.4303797 0.2658228 0.1518987

> N[ , 11] / sum(N[ , 11] )

[ 1] 0.5028525 0.2056811 0.1686445 0.1228219

> ssd

[ 1] 0.4988875 0.2065706 0.1725831 0.1219587

If we were to iterate this a bit longer you would see that the ”brute force” method of

getting the population growth rate and the stable age distributions converge towards

what was estimated. In fact, Figure 8.2.2 shows the mean absolute deviation (MAD)

representing the differences between the distribution of individuals in each stage from

the predicted stable stage distribution (ssd) we calculated earlier. As you can see, it

approaches the expected values pretty quickly.

Biological Data Analysis Using R

144 CHAPTER 8. MATRIX ANALYSIS

Figure 8.7: Differences in estimated proportions of individuals in each stage from what was

expected through time.

8.3 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• %

*

% Binary operator to performmatrix multiplication. An example would be X \%\∗\% Y.

• as.matrix(x) Coerces the variable x into the data type matrix.

• barplot(x) Creates a barplot of the values in x.

• det(x) Calculates, if possible, the determinant of the matrix in x.

• diag(x) Returns the diagonal (e.g., those entries whose row and column indices are

equal) of the matrix in x.

• dim(x) Returns the dimensions of the matrix x (e.g., the number of rows and columns).

Biological Data Analysis Using R

8.3. USEFUL FUNCTIONS 145

• eigen(x) Returns the eigenvalue/eigenvector pairs for the matrix in x as a list. Values

are sorted in descending numerical order and vectors are scaled to unit length.

• ginv(x) Attempts to calculate the generalized inverse of x.

• legend(x,y,c) Creates a legend for the plot at the coordinates (x, y) with the entries

in c.

• matrix(x) Creates a new instance of the matrix data type of the values in x. You will

probably need to specify nrow and ncol to set the proper size for the matrices.

• read.table(x) Reads the ﬁle x into memory. See ?read.table for the copious amounts of

additional parameters that may be needed as well as Chapter 3.

• t(x) Returns the transpose of the matrix in x (e.g., reverses the row and column

indices)

Biological Data Analysis Using R

146 CHAPTER 8. MATRIX ANALYSIS

8.4 Exercises

The following exercises are meant to help you understand the items presented in this

Chapter.

1. In considering the instantaneous growth rate r, it was mentioned that λ

1

> 0 and this

is what you will ﬁnd in most cases. However, it is possible to get values of λ < 0.

For the following values of λ make a graph of t vs. λ

t

as shown in Figure 8.3 and

describe the behavior of the population if these were the real values of r.

(a) −1 < λ

1

< 0.

(b) λ

1

< −1.

2. Create a matrix of random numbers using the runif () function and make a barplot of

the values. What happens when you pass the optional argument beside=T?

3. Standardize the columns of data in the matrix from the previous example so that the

sum of each column is equal to 1. Replot this with using the function barplot() as done

for Figure 8.5 with the beside=F option. How does standardizing each row inﬂuence

the display of the plot?

Biological Data Analysis Using R

Chapter 9

Working With Strings

While the majority of biological data is numeric in nature there are still several important

reasons to be able to manipulate character-based information. For example, you may be

downloading all the references from a online database such as WebOfScience and want

to mine the abstracts for metadata. You may also be interested in working with sequence

data which consists of mostly text information. In this relatively short chapter we will

learn about how we can work with string in data in R and look at a few examples using

genetic sequences.

In this chapter, you will focus on the following topics:

• Learn how to work with string data to perform tasks such as parsing, searching,

and replacement.

• Learn how to access sequence based data and pre-process it for importation into R

• Learn how to create genetic distance matrices.

• Construct Neighbor-Joining trees and display them in R

9.1 Parsing Text Data

At a most basic level you need to understand that character data in R is treated as a

single token in the same way that integer and numeric data is treated. For example,

consider the following code:

> x <− c ( "bob" ,"mary" ,"johnathan")

> length ( x )

[ 1] 3

> x <− "George Stephen Sr."

> length ( x )

[ 1] 1

> x <− c( 1 , 2 , 3)

> length ( x )

[ 1] 3

> x <− 3

> length ( x )

[ 1] 1

147

148 CHAPTER 9. WORKING WITH STRINGS

9.1.1 Finding Lengths of Character Sequences

So R treats a character data type, independent of the length of the items in the variable,

as a single entry. Once we understand this then the rest of this Chapter really begins to

take shape and make sense.

So, if R thinks that the everything between a pair of quotes is a single instance of a

character data type then how do we ﬁgure out how many letters are contained between

the quotes? The answer here is the function nchar().

> x <− "George Stephen Sr."

> nchar ( x )

[ 1] 18

Another commonly used function for dealing with strings is the strsplit () function. This

function takes the string of characters that you are interested in splitting as well as the

character you want to split it on and returns the chunks as a list. This returning-as-a-

list behavior is kind of a pain in the butt so at the same time I introduce this function I

will also show the unlist() function at the same time.

1

> partsOfName <− unl i st ( st r spl i t ( x, " ") )

> partsOfName

[ 1] "George" "Stephen" "Sr."

> nchar ( partsOfName )

[ 1] 6 7 3

Here is another example as to how we may go about cycling through a set of words in

a phrase and doing some operation on them. The ﬁrst sentence from the ﬁrst chapter

of Darwin’s The Origin Of Species is, ”WHEN we look to the individuals of the same

variety or sub-variety of our older cultivated plants and animals, one of the ﬁrst points

which strikes us, is, that they generally differ much more from each other, than do

the individuals of any one species or variety in a state of nature.” While this is a very

interesting sentence, we are going to use it to show you how to break down the sentence

into an array of words and then tally the number of times each word is used.

We begin by making the sentence all lowercase and without punctuation because the

simple matching procedure would consider ”When” different than ”when” and the strsplit ()

function will cut up the string on the spaces (that I what I will tell it to do)

> phrase <− "when we look to the individuals of the same variety or sub-variety of our older "

+ "cultivated plants and animals one of the first points which strikes us is that they "

+ "generally differ much more from each other than do the individuals of any one species or "

+ "variety in a state of nature"

> wordList <− unl i st ( st r spl i t ( phrase , " " ) )

> tabl e ( wordList )

wordList

a and animals any cul ti vated di f f er

1 1 1 1 1 1

do each f i r s t from general l y in

1 1 1 1 1 1

i ndi vi dual s i s look more much nature

2 1 1 1 1 1

of ol der one or other our

5 1 2 2 1 1

1

This function takes a list and turns the items in it into a vector which is easier to work with.

Biological Data Analysis Using R

9.1. PARSING TEXT DATA 149

plants points same species state stri kes

1 1 1 1 1 1

sub−vari et y than that the they to

1 1 1 4 1 1

us vari et y we when which

1 2 1 1 1

9.1.2 Extracting Substrings

It is not possible to use the normal subscripting approaches to access the individual

characters within strings because R treats the entire sequence of characters between

the quotation marks as a single item. However, you can extract internal components of

a string by using the substring() function.

> phrase <− "A Goat, that was sitting next to the gentleman in white, shut his eyes and said

+ in a loud voice, ’She ought to know her way to the ticket-office, even if she doesn’t know

+ her alphabet! ’"

> substring ( phrase , 34, 70)

[ 1] "the gentleman in white, shut his eyes"

> substring ( phrase , 98)

[ 1] "’She ought to know her way to the ticket-office, even if she doesn’t know her alphabet! ’"

The function takes the string to be searched and the starting and ending locations in

the string and returns the characters in between. If you do not provide an ending

number, it will return all the characters up to the end. This is a shorthand way of saying

substring( phrase, x, nchar(phrase) ).

It is also possible to use vector notation in pulling out substrings by passing vectors to

the start and end arguments.

> startPosi ti ons <− c(34,3,58,172,67)

> endPositions <− c(36,6,61,174,70)

> substring ( phrase , startPosi ti ons , endPositions )

[ 1] "the" "Goat" "shut" "her" "eyes"

9.1.3 Concatenating Strings

Vectors of character data can be concatenated to form a single long string. This is very

helpful in creating labels for graphs that have to include the value of a variable and

for times when you need to open a lot of data ﬁles that have a predictable ﬁle naming

scheme. In R string concatenation is accomplished using the paste() function.

> stri ngVector <− substring ( phrase , startPosi ti ons , endPositions )

> stri ngVector

[ 1] "the" "Goat" "shut" "her" "eyes"

> paste ( stringVector , col l apse=" ")

[ 1] "the Goat shut her eyes"

> paste ( stringVector , col l apse="|")

[ 1] "the|Goat|shut|her|eyes"

Biological Data Analysis Using R

150 CHAPTER 9. WORKING WITH STRINGS

9.1.4 Matching & Substitution

The ﬁnal tasks we will look into in this section on string operations are matching and

substitutions. There are a lot of times when the ability to see if a particular set of

strings has a speciﬁc substring within it. This is the realm of matching and is primarily

accomplished by the functions grep() and regexpr(). This last function allows you to use

what are called Regular Expressions (RE) to scan through string. While this is a very

powerful method for pattern matching and is something that if you are going to do any

extensive work with strings should know, I am not going to cover it in this Chapter. In

fact, it probably needs its own chapter and perhaps in a future version of this text I will

include it. For those of you who work with string data on a regular basis, look up the

regexpr function and have at it, it will make your life easier. For the rest of us, lets dig into

grep for a little light matching exercises.

The grep function takes a pattern that you are looking for and a string that you want to

look into. A simple example would be:

> x <− "The quick brown fox jumped over the candle stick"

> grep ( "fox" , x )

[ 1] 1

> any( grep ( "fox" , x ) )

[ 1] TRUE

> any( grep ( "o" , x ) )

[ 1] TRUE

> any( grep ( "dog" , x ) )

[ 1] FALSE

In general, the grep function returns an integer indicating that the string either has or

does not have a copy of the pattern in it. I wrapped the grep function here inside the

any() function because it will take either a single argument or a vector of arguments and

return a logical value.

It is also possible to substitute values in a string with new items. There are two functions

that perform string substitutions, sub and gsub. Both of these functions take at least three

arguments;

1. A pattern to match,

2. The string to replace the matched pattern with, and

3. The string to search within.

The sub function replaces the ﬁrst occurrence of the pattern whereas gsub replaces all of

them (the g stands for global).

> x <− "The quick brown fox jumped over the candle stick with all the kings men."

> sub( "the" ,"THE" , x )

[ 1] "The quick brown fox jumped over THE candle stick with all the kings men."

> gsub( "the" ,"THE" , x )

[ 1] "The quick brown fox jumped over THE candle stick with all THE kings men."

> gsub( "the" ,"THE" , x, ignore . case=T)

[ 1] "THE quick brown fox jumped over THE candle stick with all THE kings men."

Both of these functions have optional arguments, the most common one of which is

ignore.case option that allows the searching and replacing to either take into consideration

Biological Data Analysis Using R

9.1. PARSING TEXT DATA 151

the case of the letters when matching or not.

9.1.5 Slightly More In Depth Examples: Genetic Sequence Analyses

Genetic sequences are essentially long character strings and R has a few different li-

braries available to you for the analysis of sequence data. I am not going to get into what

a genetic sequence is, if you do not already know about it then you probably should not

be calling yourself a biologist... In this section, we will:

1. Brieﬂy discuss how we go about getting DNA sequence data

2. Learn how to align sequences

3. Import sequence aligned sequence data into R

4. Create a distance matrix from the sequences

5. Use R to estimate a Neighbor-Joining tree from the sequence data

Getting DNA Sequence Data

The mother of all sequence repositories that you can access (without actually doing the

sequencing yourself) is the NCBI web database located at http://www.ncbi.nlm.nih.gov/

Here you can run database queries based upon taxa, genes, groups, or whatever. The

basic results of a search are given as an annotation (just below). This annotation has

three parts,

1. The meta data in the top section that contains the locus deﬁnition, size, who found

it, references and a the taxonomy of the organism.

2. The ”FEATURES” of the record that describe what is in the sequences (coding and

non-coding regions if known), some geographical and taxonomic information that

has been standardized (good for data mining and putting on a map) as well as the

translation of genetic sequence into amino acids if appropriate.

3. The ”ORIGIN” which contains the raw sequence information.

An example of a record is given below

LOCUS FJ347583 278 bp DNA l i near INV 01−JUL−2009

DEFINITION Araptus attenuatus haplotype 5 muscle protein 20 (MP20) gene ,

part i al sequence .

ACCESSION FJ347583

VERSION FJ347583.1 GI:227345175

KEYWORDS .

SOURCE Araptus attenuatus

ORGANISM Araptus attenuatus

Eukaryota; Metazoa; Arthropoda ; Hexapoda; Insecta ; Pterygota ;

Neoptera ; Endopterygota ; Coleoptera ; Polyphaga ; Cucujiformia ;

Curculionidae ; Scolytinae ; Araptus .

REFERENCE 1 ( bases 1 to 278)

AUTHORS Garrick ,R.C. , Meadows,C. A. , Nason, J.D. , Cognato , A. I . and Dyer ,R. J.

TITLE Variable nuclear markers f or a Sonoran Desert bark beetl e , Araptus

attenuatus Wood ( Curculionidae : Scolytinae ) , with appl i cati ons to

rel ated genera

Biological Data Analysis Using R

152 CHAPTER 9. WORKING WITH STRINGS

JOURNAL Conserv . Genet . 10 ( 4) , 1177−1179 (2009)

REFERENCE 2 ( bases 1 to 278)

AUTHORS Garrick ,R.C. , Meadows,C. A. , Nason, J.D. , Cognato , A. I . and Dyer ,R. J.

TITLE Di rect Submission

JOURNAL Submitted (26−SEP−2008) Department of Biology , Vi rgi ni a

Commonwealth University , 1000 West Cary Street , Richmond, VA 23284,

USA

FEATURES Location/Qual i f i ers

source 1..278

/organism="Araptus attenuatus"

/mol type="genomic DNA"

/db xref ="taxon:634056"

/haplotype="5"

gene <1..>278

/gene="MP20"

/note="muscle protein 20; coding region not determined"

ORIGIN

1 ctaaaatcaa cacttccgga ggacaattta aattcatgga aaacatcaac aagtaagaaa

61 aaaataattt gacatgtaaa taatgtagag aaaattcata aacattccta t t t t t t at t g

121 at t t gt caat at t t agt t t g gaactaaact ctgacaatca attatacagg gtgacaattc

181 taat tacat t tccattcaat gccaactaga aatttcgtga aaaaaaaatt gt t t ct at gc

241 caaacatact gt t t t at aag at t t aat t cc agaaattt

//

Sequence Formats & Aligning Genetic Sequences

The format of the sequence data like this is a bit verbose but very informative. When

we work with sequence data we will use an abbreviated ﬁle format, the FASTA format,

to work with sequences. This format is very compact and as a result, it is rather easy to

use. In general, FASTA ﬁles are simple text ﬁles that have blocks of information for each

sequence. Each block contains a summary line that must begin with the greater than

character (>) and can be anything you like. It is common to put the accession numbers,

locus identiﬁer, taxonomy and other information into this line. The lines following the

summary line is the raw sequence. If you want to have more than a single taxon in a

ﬁle, you just put the next taxon block blow the previous one and continue. In general

they look like this (this is an excerpt from an example data set that you have in the class

folder):

>Pinus caribaea var . hondurensis

GGTTCAAGTCCCTCTATCCCCACCCAGGTTCGGTCCCGAAAGGAYTGATCTATCTTCTCCAATTCCATTG

GTTCGAATCCATTCTAATTTCTCGATTCTTTTACCTCGCTATTTTTTTTTTTTCATGAAGAGAAGAAATT

AGAACATGAATCTTTTCATCCATCTTATGACAAGTTGAGTTGATCTGTTAATAAGTTGATCATATGATCA

ATTTATTTTGTGATATATGATCTACATAGAATAGATTAGATCNTTTTTAAATTATTCAATTGCAGTCCAT

TTTTATCATATTAGTGACTTCCAGATCGAAAATAATAAAGATCATTCTAAAAACTAGTAAAAATACCTTT

TTACTTCTTTTTAGTTGACACAAGTTAAAACCCTGTACCAGGATGATCCACAGGGAAGAGCCGGGGATAG

CTCATTTGGTAAACCAAAGGACTGAAAATCCTCGTGTCACCAGTTCAAAT

>Pinus echinata

ACCCAGGTTCGTTCCCGAACGGATTGATCTATCTTCTCCAATTCCATTGGTTCGAATCCATTCTAATTTC

TCGATTCTTTTACCTCGCTATTTTTTTTTTTCATGAAGAGAAGAAATTAGAACATGAATCTTTTCATCCA

TCTTATGACAAGTTGAGTTGATCTGTTAATAAGTTGATCATATGATCAATTTATTTTGTGATATATGATC

TACATAGAATAGATTAGATCATTTTTAAATTATTCAATTGCAGTCCATTTTTATCATATTAGTGACTTCC

AGATCGAAAATAATAAAGATCATTCTAAAAACTAGTAAAAATACCTTTTTACTTCTTTTTAGTTGACACA

AGTTAAAACCCTGTACCAGGATGATCCACAGGGAAGAGCCGGGATAGCTCAGTTGGTAGAGCAGAGGACT

GAAAATC

When conducting analyses of genetic sequence data, it is important that you are conﬁ-

dent that all the sequences you have are of homologous portions of the genome. For the

Biological Data Analysis Using R

9.1. PARSING TEXT DATA 153

example I used here, I downloaded some genetic sequence data for a handful of conifers

in the family Pinaceae from the NCBI website. The sequences I was looking for is a com-

mon inter-genic spacer region between the genes encoding for tRNA-trnL and tRNA-trnF.

These sequences were between 390-470 base pairs in length and are in the ﬁle named

confiers.fasta in the folder for this chapter. I cleaned up the summary lines in this ﬁle so

it only has the genus and species names rather than all the other stuff. This makes it a

bit easier for you in the future when you interact with the data.

Before I played with these sequences, I ran an alignment on them to make sure we were

dealing with the matching sequences across taxa. There are many ways to do this and I

just used the online ClustalW server at http://align.genome.jp to align the sequences for

me. This is not something you want to do by hand and it is much better to let a computer

do some of the work for you. This algorithm aligns all the sequences and returns the

ﬁle in a clustal format. This is another text ﬁle but this time all the species have been

displayed in blocks with homologous sequence locations in the same text column. An

example of this is shown below with gaps (insertions/deletions) indicated as the dash

character (−).

Pinus caribaea var . hondurensi CC−−−CACCCAGG−TTCGGTCCCGAAAGGAYTGATCTATCTTCTCCAATT

Pinus taeda −−−−−−ACCCAGG−TTCGTTCCCGAACGGATTGATCTATCTTCTCCAATT

Pinus ponderosa −−−−−−ACCCAGG−TTCGTTCCCGAACGGATTGATCTATCTTCTCCAATT

Pinus echinata −−−−−−ACCCAGG−TTCGTTCCCGAACGGATTGATCTATCTTCTCCAATT

This ﬁle is also located in the folder for this chapter and is called conifers.aln and this is

the ﬁle we will be working with.

Getting Aligned Sequences Into R

R does not by default recognize sequence data as anything more elegant than a sequence

of characters. As a result, several people have developed libraries for you to use that have

a lot of general functionality to them. In this section, I am going to use the library ape. If

you do not have this library installed on your machine, see Appendix B for an overview

of the process.

I am assuming that you currently have the data ﬁle in a location that you can reach it

easily from within R . To load the aligned sequences into R type the following:

> l i brary ( ape )

> seqs <− read . dna( "confiers.aln" , format="clustal")

> cl ass ( seqs )

[ 1] "DNAbin"

> summary( seqs )

23 DNA sequences in binary format stored in a matrix .

Al l sequences of same length : 526

Labels : Abies alba Abies kawakamii Abies vei t chi i Abies homolepis Larix potani ni i Cedrus atl anti ca . . .

Base composition:

a c g t

0.310 0.187 0.160 0.343

Biological Data Analysis Using R

154 CHAPTER 9. WORKING WITH STRINGS

There are several things that you can do with these aligned sequences. You can look for

motifs, examine CG content, etc. I will leave these options for you to play with later in

the exercises.

Constructing A Neighbor Joining Tree

To construct a Neighbor Joining (NJ) tree, we ﬁrst need to create a distance matrix that

estimates the distances between pairs of sequences that we have in our ﬁle. There are

several different kinds of distance metrics that you can use in the calculation of this

distance matrix (see ?dist.dna for more information on these). We will use the default value

which is Kimura’s 2-parameter model called ”K90”.

> D <− di st . dna( seqs )

> cl ass (D)

[ 1] "dist"

> summary(D)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.00000 0.07252 0.09310 0.26890 0.15720 1.45700

The function dist.dna() takes as an argument a set of sequences that you have read in (the

must be of class DNAbin as shown above) and spits out the distance matrix. The distance

matrix, D, is a particular kind of matrix that holds the lower triangle of the pair-wise

distance calculations. If you print it out, you will get a whole lot of output as it prints

the taxa names for row and column headers.

Since D is a general distance matrix, we can look at the values in it. Figure 9.1 shows

a histogram of the distance values that have been estimated in D. From this we see that

there are several values that are low meaning that the sequences are very similar to each

other and then there are some that are 2-3 peaks that are larger suggesting some degree

of sequence divergence.

To create a NJ tree from these distances, we use the function nj () .

> njTree <− nj (D)

> cl ass ( njTree )

[ 1] "phylo"

> summary( njTree )

Phylogenetic tree : njTree

Number of t i ps : 23

Number of nodes : 21

Branch lengths :

mean: 0.03838704

variance : 0.01999758

di stri buti on summary:

Min. 1st Qu. Median 3rd Qu. Max.

−0.0009736 0.0000000 0.0004898 0.0150700 0.8610000

No root edge .

Fi rst ten t i p l abel s : Abies alba

Abies kawakamii

Abies vei t chi i

Abies homolepis

Larix potani ni i

Cedrus atl anti ca

Larix decidua

Cedrus deodara

Biological Data Analysis Using R

9.1. PARSING TEXT DATA 155

Figure 9.1: Histogram of distance estimates among all sequences using the ”K90” model of sub-

stitutions

Larix l ari ci na

Pinus roxburghii

No node l abel s .

This function take a distance matrix and returns a tree that is of the class phylo. We

can see that internally the variable njTree has some internal information that may be of

interest (e.g., branch lengths, etc) but the real way we can understand it is by looking at

a graphic of the tree that is produced. To do this, we use the plot () command and pass it

the njTree variable as plot(njTree).

2

The topology of the tree (Figure 9.2) is easy to interpret and it is quite obvious where

those very large distances shown in Figure 9.1 come from. From this topology we can

see that:

1. The Pinus species are generally together forming a polytomy that connects to the

2

You may be surprised by the utility of the plot function as it seems to know how to plot everything. Well

in actuality this function is simply a wrapper that takes whatever you pass to it and determines if the class

of the object you passed has its own plot command. For the tree, the native command is plot.phylo() and you

have to look up that command to see the available options for it.

Biological Data Analysis Using R

156 CHAPTER 9. WORKING WITH STRINGS

Figure 9.2: Neighbor joining tree based upon the trnL-trnF intergenic spacer sequences and the

”K90” model of sequence evolution.

other genera in the family.

2. The Larix, Abies, and Cedrus for generally self contained groups.

3. The most divergent groups are the Picea and Keteleeria samples.

There is quite a bit more that can be done here but I think that is enough to get you on

the right track if you are interested in using R for some basic sequence analysis.

9.2 Producing Formatted Output

Often in the use of R there is a need to produce a particular kind of output from an

analysis of to display the contents of a particular variable. R does a pretty good job

itself, but it has some limitations. For example, you may want to print out a matrix of

values but only have 2 decimal places printed for each entry. Or you may want to export

a table of values as HTML so that I can copy and paste it into another program

Biological Data Analysis Using R

9.2. PRODUCING FORMATTED OUTPUT 157

9.2.1 Formatting Strings For Printing

format ( x, trim = FALSE, di gi t s = NULL, nsmall = 0,

j ust i f y = c ( "left" , "right" , "centre" , "none") ,

width = NULL, na. encode = TRUE, sci ent i f i c = NA,

big . mark = "" , big . i nt erval = 3,

small . mark = "" , small . i nt erval = 5,

decimal . mark = "." , zero . pri nt = NULL, drop0trai l i ng = FALSE, . . . )

9.2.2 Formatting Tables

A common type of format to be output to another format is tabular data. Tables are

common features of statistical analysis and as such you will ﬁnd it necessary to cut

a table out of R and paste it into a document in the same way that graphics can be

exported from R to be used in your manuscripts and reports.

For these examples, I will just created a matrix of values and add row and column names

using the functions rownames and colnames.

> x <− matrix ( rnorm( 12) , nrow=3)

> x

[ , 1] [ , 2] [ , 3] [ , 4]

[ 1 , ] 0.1678067 0.8856766 −0.3955881 0.7677516

[ 2 , ] −1.0302831 0.7392326 −0.8333904 −0.3235135

[ 3 , ] 0.4396607 1.7622323 −0.8763023 0.6091688

> colnames ( x ) <− c ( "Header A" , "Header B" , "Header C" , "Header D")

> rownames( x ) <− c ( "Row 1" , "Row 2" , "Row 3")

> x

Header A Header B Header C Header D

Row 1 0.1678067 0.8856766 −0.3955881 0.7677516

Row 2 −1.0302831 0.7392326 −0.8333904 −0.3235135

Row 3 0.4396607 1.7622323 −0.8763023 0.6091688

> theMatrixTable <− xtabl e ( x, caption="Caption For Table" , al i gn="l|cccc")

The variable theMatrixTable now is a xtable object. What we do with it at this point depends

upon how you want to interact with it.

Getting L

A

T

E

XOutput

If you print it out as is, it will display the contents in L

A

T

E

X, a typesetting language that

is used to create very nice looking manuscripts and books (this entire book has been

written in it). If you use L

A

T

E

Xto write your manuscripts then you are set and the listing

that follows show the formatting that results and the Table 9.1 that follows is what it

looks like when it is inserted into a L

A

T

E

Xdocument.

% l atex tabl e generated in R 2.8.0 by xtabl e 1.5−4 package

% Wed Dec 31 14:22:46 2008

\begin{tabl e }[ ht ]

\begin{center}

\caption{Caption For Table}

\begin{tabular}{ l | cccc}

\hl i ne

& Header A & Header B & Header C & Header D \\

\hl i ne

Biological Data Analysis Using R

158 CHAPTER 9. WORKING WITH STRINGS

Row 1 & 0.17 & 0.89 & −0.40 & 0.77 \\

Row 2 & −1.03 & 0.74 & −0.83 & −0.32 \\

Row 3 & 0.44 & 1.76 & −0.88 & 0.61 \\

\hl i ne

\end{tabular}

\end{center}

\end{tabl e}

Table 9.1: Caption For Table

Header A Header B Header C Header D

Row 1 0.17 0.89 -0.40 0.77

Row 2 -1.03 0.74 -0.83 -0.32

Row 3 0.44 1.76 -0.88 0.61

You can also print the table to a ﬁle by calling the function print(theMatrixTable,ﬁle=”theﬁleName.tex”).

There are several other options available to you with the print function, see ?print.xtable for

more information.

Exporting In HTML for Web or Word

If you do not use L

A

T

E

Xand are a biologist that does a lot of mathematical, programming,

or scientiﬁc work then you should be. That being said there are many people for which a

general overpriced and under powered word processor (which shall remain nameless but

is buggy and prone to viruses and screwing up your manuscripts, you know which one I

mean) is the best you can expect to master. The xtable can be exported into a format you

can open up in said program by ﬁrst exporting the ﬁle as type="html". To export it as such

call the command > print(theMatrixTable,type=”html”,ﬁle=”MyHTMLizedTable.html”) and the table will be

saved. You can then open it up in your favorite word processor and it will turn the html

table into a normal table that you can manipulate in your documents. An example of the

html markup that this function produces is given below and an image of it is presented

in Figure 9.3.

<!−− html tabl e generated in R 2.8.0 by xtabl e 1.5−4 package −−>

<!−− Wed Dec 31 14:22:51 2008 −−>

<TABLE border=1>

<CAPTION ALIGN="bottom"> Caption For Table </CAPTION>

<TR>

<TH> </TH>

<TH> Header A </TH>

<TH> Header B </TH>

<TH> Header C </TH>

<TH> Header D </TH>

</TR>

<TR>

<TD> Row 1 </TD>

<TD al i gn="center"> 0.17 </TD>

<TD al i gn="center"> 0.89 </TD>

<TD al i gn="center"> −0.40 </TD>

<TD al i gn="center"> 0.77 </TD>

</TR>

<TR>

<TD> Row 2 </TD>

<TD al i gn="center"> −1.03 </TD>

<TD al i gn="center"> 0.74 </TD>

<TD al i gn="center"> −0.83 </TD>

Biological Data Analysis Using R

9.3. PLOTTING SPECIAL CHARACTERS 159

<TD al i gn="center"> −0.32 </TD>

</TR>

<TR>

<TD> Row 3 </TD>

<TD al i gn="center"> 0.44 </TD>

<TD al i gn="center"> 1.76 </TD>

<TD al i gn="center"> −0.88 </TD>

<TD al i gn="center"> 0.61 </TD>

</TR>

</TABLE>

The HTML above produces a table that when imported into Firefox looks like that pre-

sented in Figure 9.3.

Figure 9.3: The html printout of a xtable as interpreted in Firefox. You can also import tables

saved as html into popular word processors and use them as normal table items in the creation of

your documents.

There are several other options available to you with the print function, see ?print.xtable for

more information.

9.3 Plotting Special Characters

There are some special characters that you should be aware of when trying to get your

data output into a readable format. These characters are not necessarily ones that you

speciﬁcally type on the keyboard rather they are ones that are available as their own

buttons on the keyboard, namely the tab character, the newline character, and the bell

character.

All the characters on your keyboard (assuming that you are using an en US keyboard)

are speciﬁed in as single variables in ASCII (ASCII stands for the American Standard Code

for Information Interchange). Obviously, since the ﬁrst A stands for American, there are

a lot of characters that you see on a computer screen that you cannot type directly on

a keyboard such as letters with accents, Greek and Latin characters (α, Λ, Ω), and then

there are all those non-US English characters and hieroglyphs. Your terminal that you

are running R from cannot handle these characters but you can get them into plots that

you make.

R has the nice ability to produce slightly complicated output for the axes of your plots

as well as for putting into most graphics you produce. Items such as subscripts, su-

perscripts, and mathematical symbols are easily produced using just a few different

functions.

Biological Data Analysis Using R

160 CHAPTER 9. WORKING WITH STRINGS

The primary way for producing formatted text for a graphics output is through the use

of the expression function. And the best method for looking at the ability of R to provide

nice mathy like output is to look at its own demo. So, start R and type:

> demo( plotmath)

This command will show you a short number of tables in a ﬁgure window that have

examples of the different kinds of math plotting that R handles. Associated with each

table, when R sources the demo script it passes the optional echo=TRUE parameter so that

all the commands that are used to produce the output are also shown in the R command

interface. This way you can see how each of the cells in the displayed tables is being

encoded. An example of some of the copious output is:

> draw. plotmath. cel l ( expression ( i t al i c ( x ) ) , i , nr ) ; i <− i + 1

> draw. plotmath. cel l ( expression ( bold ( x ) ) , i , nr ) ; i <− i + 1

> draw. plotmath. cel l ( expression ( bol di t al i c ( x ) ) , i , nr ) ; i <− i + 1

The demo script itself deﬁned the function draw.plotmath.cell() so don’t worry about that part.

The part you should focus on is the (expression(bold(x)) parts. There are several options that

you can pass to the expression function and it is not quite worth listing them all here since

you see them in the R demo itself. However, I will show some of the more common

methods in the plot shown in Figure 9.4.

> x <− rnorm(100)

> y <− 23 + 1.4∗x + 2∗rnorm(100)

> pl ot ( x, y , bty="n" , ylab=expression (X[ st uf f ] ) , xlab=expression ( chi ˆ2) , col ="red")

For both the x− and y-axes, I use the expression function to create labels with subscripts

and superscripts. If you like, you can deﬁne these values as individual variables prior to

plotting if you like to keep the plot command a bit cleaner, there is really no difference in

the speed at which R would evaluate them. Here is another example:

> xl abel <− expression ( bold ( x [ i ] ) )

> yl abel <− expression ( i t al i c ( x [ i ] ˆ 2) )

> pl ot ( x, y , bty="n" , xlim=c( 0 , 20) , type="l" , lwd=2, col ="blue" , xlab=xlabel , ylab=yl abel )

Look at the demo(plotmath) output to see the diversity of plotting approaches.

Biological Data Analysis Using R

9.4. USEFUL FUNCTIONS 161

Figure 9.4: Example of using the expression function to annotate a graphic.

9.4 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• any(x,y) Returns a logical response to x having any instance of y in it.

• cat(x) Concatenates the objects in x and dumps them out to the interface.

• format(x) Formats the object x for rigid (some say pretty) printing.

• substring(x,s,f) This returns takes the string in x and returns the substring starting

at position s and ﬁnishing at position f.

• strsplit(x,c)functions!strsplit Splits the string x on the character (or characters in c).

• nchar(x) Returns the number of characters in the string x.

• expressionx This function takes the variables in x and turns them into a string ex-

pression to be plotted in a function.

Biological Data Analysis Using R

162 CHAPTER 9. WORKING WITH STRINGS

• nj(x) This function performs the neighbor joining function on the distance matrix

x.

• unlist(x) Takes the list x and returns it as a vector.

Biological Data Analysis Using R

9.5. EXERCISES 163

9.5 Exercises

The following exercises are meant to help you understand the items presented in this

Chapter.

1. Create a table fromthe data data <−matrix( rnorm(9), nrow=3 ) and label the rows as c("Richmond",

"Petersburg", "Varina") and the columns as c("PPM(A)", "PPM(B)", "PPM(C)". Use the xtable

library to export this table as HTML and then import it into your answers. (This

is a very helpful methodology for getting formatted data out of R and into your

manuscripts).

2. Create a table of the different words found on the ﬁrst page of the Chapter entitled

Preface in this text.

3. Using the strsplit function to break apart the raw text of the ﬁrst four paragraphs of

the Chapter entitled Preliminaries into sentences (HINT: use the ”

.” as the character to break apart the string on and you can copy and paste it from

the pdf). Then use the grep command to ﬁnd the sentences that have the word are in

them.

4. Show how you would use the sub command to ﬁx the sentence, ”Dr. Dyer is a loser”?

(And when I say ”ﬁx” I mean make it say that I am not...)

5. How many characters are in the ﬁrst paragraph of this Chapter?

6. Create a density plot of the χ

2

distribution and make the main label say ”χ” using the

expression() function (hint: this character is called ’chi’).

7. In the previous graph, plot a dotted vertical line to indicate where the mean value of

the distribution is and put the character ”µ” symbol next to it.

8. Using the aligned sequences to create a few different distance matrices by changing

the model type that you pass to the function dist.dna(). Do alternate distance models

have different densities of values? (Hint: plot a density plot for each distance matrix

on the same graph similar to what is shown in Figure 9.1).

9. Do these different distance models produce different tree topologies when using the

nj () function? If so, show the trees and describe the differences you see in the trees.

10. Do the functions nj () , fastme.bal(), and bionj () produce the same looking topologies? You

should read the functions to see what they are as you probably haven’t worked with

them yet. Explain.

Biological Data Analysis Using R

164 CHAPTER 9. WORKING WITH STRINGS

Biological Data Analysis Using R

Part III

Extending R

165

Chapter 10

Basic Scripts

When I use the term script here, what I am referring to is a set of R commands that you

put into a text ﬁle and have R evaluate. Learning how to write scripts will help you out

in the following ways:

1. In general we are all lazy. It seems to be a monumental task to type the same thing

into R over and over again. Scripts allow you to put your commands into a text ﬁle

and have R run them for you.

2. Keeping your analyses and data sets together is a great way for you to not loose

a record of what you have done. At a later date, you can come back and pick up

where you left off. If you have more data or another angle at the analysis, having a

record of how the previous analyses were performed is a huge beneﬁt.

3. There are times when you have to do the same thing over and over again, say make

graphs of a large number of variables or transform a lot of different data sets using

the same algorithm. If you put the commands in a script, and later when we get

into programming (Chapter 11) and functions (Chapter 12) you can run it over and

over again with ease (remember the lazy thing?).

So in essence, scripts are enablers for our laziness.

In this chapter, you will focus on the following topics:

• Learn about basic script writing

• Understand differences between code evaluated from a script and that same code

typed into the interactive R command line

• Execute scripts in R

10.1 Writing Scripts

A script is nothing more than a series of commands that R recognizes and evaluates.

Within a script, you can deﬁne data (Chapter 2), functions (Chapter 12), or other oper-

167

168 CHAPTER 10. BASIC SCRIPTS

ations. It is convenient to have a record of the commands that you use in R to produce

output.

10.1.1 Knowing Directories

A script must be in text and it must reside in a location where you can tell R it is located.

When you start an interactive session in R , it notes the current directory that you are

using. This is what is called the cwd or current working directory. Now if you are using R

from a GUI-ish installation such as on Windows , you have to tell R which directory to OS

use as a starting place. You can change the cwd from the “Change dir...” command in the

“File” menu. If you are staring R from a terminal (in OSX or some Unix variant), then

the directory where you started will be the cwd.

Here are a few tips that I ﬁnd helpful when I work with R :

• It is a pretty good idea to keep your data sets and the scripts that you use to analyze

these data in the same directory. Use your descriptive skills in naming your data

and scripts such that you know what is contained in the ﬁle without looking at

it (e.g., perhaps a data set named DogwoodGerminationRates27.csv and the R

script as AnalysisOfDogwoodGermination.R; just makes it easier).

• It is also a good idea to make sure that you separate your directories of data and

associated scripts such that it is easy for you to ﬁnd the right directory. Keeping

it all mashed together into a single directory can cause problems with data sets

having the same name (e.g., the infamous data.txt).

• Always provide labels for each column of data. At some time in the future you will

need to look at the data set and ﬁgure out what that column of data represent.

• In your scripts, provide a lot of comments. Lines that start with the hash character

(#) are ignored by R and you can use them for adding comments about what the

script, program, functions, or variables actually mean. I cannot emphasize this

enough. You will leave this class and at some point in the future look back on

some script you wrote and want to ﬁgure out how it works and without copious

comments you will fail and have a small sense of being genuine looser. You have

been warned.

10.1.2 The Editor

You can write a script in any basic text editor. For some installations of R , there is a

pseudo-GUI associated with it (e.g., Windows) because there is no real command line

terminal in the OS. This interface to R often has an integrated editor built into it and

if it is there you should probably use it unless you have another editor of choice that

you feel more comfortable with.

1

If you do not want to use the supplied editor or do not

have one available, you may want to check out TEXTMATE or TEXTWRANGLER on OSX,

1

There have literally been decades of wars fought over the choice of the real editor. If you are interested

in cultural aspects of programming and programmers (e.g., nerds like myself) ﬁre up a google search for ”vi

vs. emacs” and sit back and enjoy.

Biological Data Analysis Using R

10.2. EVALUATING SCRIPTS 169

E or CRIMSON EDITOR (or the million others that are on this pedestrian platform) on

Windows, and for Unix/Linux you can use GEDIT, KATE, EMACS or VI (n.b. If you learn

one these last two you will never need another editor on any platform).

The important component of the editor that you are looking for is one that understands

R (or SPlus) and can provide you with syntax highlighting, parenthesis matching, and

automatic indentation. These are things that just make your life easier. After all, if you

are going to be spending a lot of time in front of your computer, you may as well have

tools that help instead of get in the way. Speaking of getting in the way, you should

never, under any circumstance, even think of using Word to do any of this.

OK, so open your editor and we will make a very small script that does something entire

useless. There is a data set named ScriptExampleData1.txt in the class folder. Make sure

you script is saved in the same directory as the data ﬁle. In R type the following code

and see what happens.

> theData <− read . tabl e ( "ScriptExampleData1.txt" , header=T, sep=",")

> summary( theData )

Population Height Sex

A:5 Min. :23.40 Female:5

B:4 1st Qu.:27.70 Male :4

Median :29.70

Mean :30.04

3rd Qu.:32.70

Max. :38.20

> range ( theData$Height )

[ 1] 23.4 38.2

> l evel s ( theData$Population )

[ 1] "A" "B"

It should have loaded theData and provided a summary of it as shown. If not, you are

probably not in the correct directory. Change to the right directory and redo.

Now, take the same code and put it into your script ﬁle. Obviously, you do not want to

copy the responses that the R engine had provided to you, just the commands that you

typed. Save the script as AnalysisOfScriptData.R (note you must have the .R sufﬁx on the

script ﬁle). Congratulations, you have written your ﬁrst script. In the next section we

will evaluate the script and note a few differences.

10.2 Evaluating Scripts

The R engine can load and evaluate scripts relatively easily. Take a look at the docu-

mentation for the source() command by typing ?source into R and give it a read. OK, ready?

In R type source("AnalysisOfScriptData.R") and see what happens... Nothing. Why is this?

The same commands produced lots of output when typed directly into R ...

The issue is that when you are typing commands into R you are doing so in an interactive

mode. You say ”do this” and it says ”OK.” However, when you are executing the contents

of a script, it is not entirely clear where output should go, another ﬁle, to the screen,

some other place. As a result, if you want to get a response from stuff in a script you

need to tell R to print the results. So for example, if you change your script to look

like:

Biological Data Analysis Using R

170 CHAPTER 10. BASIC SCRIPTS

theData <− read . tabl e ( "ScriptExampleData1.txt" , header=T, sep=",")

pri nt ( summary( theData ) )

pri nt ( range ( theData$Height ) )

pri nt ( l evel s ( theData$Population ) )

and from R source it you’ll get:

> source ( "AnalysisOfScriptData.R")

Population Height Sex

A:5 Min. :23.40 Female:5

B:4 1st Qu.:27.70 Male :4

Median :29.70

Mean :30.04

3rd Qu.:32.70

Max. :38.20

[ 1] 23.4 38.2

[ 1] "A" "B"

Again, notice that here the output was only the response of the commands, the com-

mands themselves were not echoed to the R environment. You can get R to echo each

command and then provide the results when it is in a script by adding the optional

echo=TRUE option to the source() function as shown in the output below:

> source ( "AnalysisOfScriptData.R" , echo=TRUE)

> theData <− read . tabl e ( "ScriptExampleData1.txt" , header=T, sep=",")

> pri nt ( summary( theData ) )

Population Height Sex

A:5 Min. :23.40 Female:5

B:4 1st Qu.:27.70 Male :4

Median :29.70

Mean :30.04

3rd Qu.:32.70

Max. :38.20

> pri nt ( range ( theData$Height ) )

[ 1] 23.4 38.2

> pri nt ( l evel s ( theData$Population ) )

[ 1] "A" "B"

This is helpful if you are debugging a script (e.g., ﬁguring out why it is crashing or giving

you the wrong answers).

So, in a script, things won’t be printed out to the R terminal unless you tell it to. And

it is relatively appropriate to ask why you are wanting some things printed out as the

script is executing. The variables in a script are available in the main R memory so if

you deﬁne a new variable in the script, after the ﬁrst time you source() it, you will have

access to it. However, because you can add variables to the main memory of R from a

script, I typically erase all variables from memory at the beginning of each script using

the command rm( list=ls() ). This way it is easy to see that the variable x you are working

with is the real one and not another x you had used two hours ago. This is a very

important point. Again, we are thinking about the future here and we need to make sure

that the things that we do in our analyses are reproducible at some point in the future.

Relying on variables that are outside our script and are only memory because we did

something before running our scripts will lead to frustration (bet on it!).

Biological Data Analysis Using R

10.3. ADDING COMMENTS TO YOUR CODE 171

In Chapter 9 there was a more complete discussion of how you can format your data for

printing. As you begin writing scripts right now, just focus on writing the routines that

you need to use to get an answer and later you can focus on making it look pretty.

10.3 Adding Comments To Your Code

Speaking of looking pretty, you must add comments to your code so that you remember

what is going on inside that ﬁle. To comment code in R you put a hash character at

the beginning of the section that you want to be commented. This will comment the line

from that point to the right. Everything to the left of the hash character is considered

code that will be evaluated.

x <− 20 # thi s comment wi l l l et the assignment happen

# thi s i s a comment that spans multiple l i nes and won’ t

# be evaluated even i f i t has l ogi cal R code in i t

# x <− 21

pri nt ( x )

Empty lines are also a nice feature to sprinkle through your scripts so that logical par-

titions can be identiﬁed. The R interpreter ignores all commented material and all lines

that do not have anything on them, so you are not penalized for not having it there.

Biological Data Analysis Using R

172 CHAPTER 10. BASIC SCRIPTS

10.4 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• # Indicates the start of a comment. The R interpreter ignores everything to the right

of this symbol.

• rm(x) This function removes the variable x (or if x is a list of variable names all of

them) from memory.

• source(x) This function causes R to look for the script named x and evaluate its

contents from start to ﬁnish. This works just as if you had typed in the lines of the

script with the exception of how variables are printed out to the terminal.

• cat(x) This function dumps the contents of x to the GUI output as a single entity.

• print(x) Send the contents of the variable x to the terminal output.

• summary(x) Provides a summary of the variable x.

Biological Data Analysis Using R

10.5. EXERCISES 173

10.5 Exercises

The following exercises are meant to help you understand the items presented in this

Chapter.

1. How do you remove all variables from memory in the current workspace?

2. What happens when you set the optional argument verbose=TRUE when calling source?

3. Are you lazy?

4. Can R evaluate scripts that are written in Word or Excel?

5. How do you change the current working directory in R ?

6. How does the optional argument echo=TRUE change in the output of sourcing a script in

R ?

7. How would you print the summary of a data frame from within a script?

8. What character is used to indicate a comment?

9. How would you comment out several lines of code in a script?

10. Why is it important to comment your code?

Biological Data Analysis Using R

174 CHAPTER 10. BASIC SCRIPTS

Biological Data Analysis Using R

Chapter 11

Programming

Programming is the art of making a computer, who understands that it has to do only

exactly what you tell it to do, to get it to do the things you want to do. The language that R

uses for programming is derived from S-Plus and will be familiar looking to anyone who

has programmed in another language or seen other programming languages before.

In general, the majority of programming in R will be very linear. While it is possible to

program in an object-orientated fashion (and indeed it is not that bad of an implemen-

tation in my opinion), I won’t be covering that in this book. The programs that I’ll help

you build will have a start, proceed through a set of operations, have some conditional

statement, perhaps some loops, print out some stuff or save it to a ﬁle, and then exit. If

you have never programmed before you need to think about programming as a kind of

recipe, a very precise one. You need to think about the problem that you are going to

solve by writing a program. And then you need to think about the exact steps that you

will need to do to accomplish what you are attempting to do.

In this chapter, we will tackle a rather easy problem as a test case to show off how to

construct a very simple program. The problem that we are going to deal with is how to

measure canopy light from a hemispheric photo. An example of a Hemispheric photo is

given in Figure 11.1. This photo was taken by S.B. Weiss from the winter roosting habitat

of the monarch butterﬂy in the Monarch Biosphere Reserve, Mexico. In this image, it is

easy to see the amount of canopy closure when taken from the hemispherical lens.

What we are going to do in this chapter is determine how much of that image is open

sky as a surrogate to measure available light in these forests. In the next few sections,

I will show some basic programming tools that we will use to write this program. Then

we will walk through the loading of an image and discuss how we get information from

and manipulate image data. Finally we will set out to write the program in a step-wise

fashion and ﬁnish with the completed program.

In this chapter, you will focus on the following topics:

• Be introduced to some basic programming logic and the corresponding R grammar.

• Develop a detailed pseudocode for a given program.

• In a step-wise fashion, develop and test the program.

175

176 CHAPTER 11. PROGRAMMING

Figure 11.1: Hemispherical photograph of winter roosting habitat at Monarch Biosphere Reserve,

Mexico. Photo by S.B. Weiss made available by the Creative Commons Atribution 2.5

11.1 Looping

As mentioned in Chapter 2, R is primarily a vector language. The consequence of this

is that if you are looking for a language to do fast loops through a data set, R is not it.

In fact, Perl or Python would actually be faster to do looping-like algorithms. That being

said, there are reasons we occasionally need to use loops in R and here is a general

overview.

A loop, when referred to in a programming language, is a sequence of statements that

are repeated over and over again until some condition is reached. The items inside the

loop are typically contained within curly brackets (e.g.,{}).

11.1.1 The While

The while looping metric is a good one to use if you have a particular condition which you

want to check over and over again and perform some operations as long as the condition

is in one state. The while loop has the form while(COND){ <code goes here> }. The COND term

Biological Data Analysis Using R

11.2. CONDITIONAL STATEMENTS 177

in the parenthesis is evaluated as a logical statement each time you go through the loop

and will continue as long as COND=TRUE. When COND=FALSE, the loop exits and R starts to

evaluate statements after the closing curly bracket. There can be a lot of code between

the brackets.

The following example loops as long as x < 10 and prints out the value of x each time

through the loop.

> x <− 0

> while ( x < 10 ) {

+ x <− x + 1

+ cat ( x, " ")

+ }

1 2 3 4 5 6 7 8 9 10

When you start looping here, x = 0 and at each time through the loop, the variable x is

incremented and printed out on the console.

11.1.2 The For

Another common loop is one that actually focuses on the value of a counting variable

(e.g., the index in the loop). What this looping metric does is combine the initialization

of the condition variable (a counter) as a numeric value, increment the counter each

through the loop, and exits the loop when some condition on the counter is correct. The

general form of the for statement is for( COND ){ <code goes here> }. The COND can be one

of many different constructs that sets up a counting variable. Here are some examples

using the variable x.

> f or ( i in seq ( 0 , 9) ) {

+ cat ( i )

+ }

0123456789

> f or ( i in 0: 9) {

+ cat ( i )

+ }

0123456789

> x <− seq( 0 , 9)

> f or ( i in seq ( length ( x ) ) ) {

+ cat ( x [ i ] )

+ }

0123456789

> f or ( i in x) {

+ cat ( i )

+ }

0123456789

For the COND the variable i is used as the counting variable along with the keyword in.

11.2 Conditional Statements

The next tool in your R programming toolbox is the conditional statement. Conditional

statements control the ﬂow of logic through the a script or program. There are many

Biological Data Analysis Using R

178 CHAPTER 11. PROGRAMMING

cases where you would like to run some command or sets of commands if some condition

is true. For example,

if( CONDITION ) then RESPONSE

else if( OTHER_CONDITION ) then OTHER_RESPONSE

else FINAL_RESPONSE

Here the logic asks about the state of CONDITION, and OTHER CONDITION. If CONDITION is TRUE then

RESPONSE is done and none of the other conditions are evaluated nor are their responses

performed. The R interpreter just skips everything until the end of the set of condition-

als. If CONDITION is not TRUE but OTHER CONDITION is, then the only response to be performed

is OTHER CONDITION. If neither CONDITION nor OTHER CONDITION are true then FINAL RESPONSE is per-

formed. Note, only one response is ever performed each time.

In the example below, I set up a vector of boolean (TRUE|FALSE) variables and then loop

through them one at a time and see what they

> observations <− as . l ogi cal ( c (TRUE, FALSE, FALSE, TRUE, TRUE) )

> observations

[ 1] TRUE FALSE FALSE TRUE TRUE

> f or ( obs in observations )

+ pri nt ( obs )

[ 1] TRUE

[ 1] FALSE

[ 1] FALSE

[ 1] TRUE

[ 1] TRUE

> f or ( obs in observations ) {

+ i f ( obs == TRUE )

+ cat ( obs , "it is true \n")

+ el se

+ cat ( "not\n")

+ }

TRUE i t i s true

not

not

TRUE i t i s true

TRUE i t i s true

We can also use conditional operators as a CONDITION in a if statement. In the example

below, we cycle through the numbers 1 through 10. And for each of them, we determine if

they are odd or even using the modulus operator %%. This operator returns the remainder

after a division.

> f or ( i in 1:10){

+ i f ( i %% 2 )

+ cat ( i , " is odd\n")

+ el se

+ cat ( i , " is even\n")

+ }

1 i s odd

2 i s even

3 i s odd

4 i s even

5 i s odd

6 i s even

7 i s odd

Biological Data Analysis Using R

11.2. CONDITIONAL STATEMENTS 179

8 i s even

9 i s odd

10 i s even

Each time through, the remainder of i %% 2 is evaluated. Possible values for this are

1 and 0 which when evaluated as.logical () , turn out to be either TRUE or FALSE printing the

appropriate message.

11.2.1 Bracketing

There is a little bit of bracket magic going on here and I should take the time to make

a few comments. Notice in the previous listing, there were brackets {} surrounding the

content inside the for loop. These brackets are essential because there is more than

one line of code inside the for loop. If there were only one line (see previous code listing

where print(obs) is the only code inside the for loop) then the enclosing brackets are

optional.

As a general rule, after any conditional (e.g., the if/else if/else) or loop (e.g., while/for) if

there is only one line of code then you do not need to use brackets if you do not want to.

Examples include:

> i f ( rnorm( 1) > 0.5 )

+ pri nt ( "greater")

> while ( TRUE )

+ pri nt ( "this will last forever")

This rule is recursive in that the “one line of code” is any line that is not a conditional

or a loop. In the next example, I loop through the numbers 1-10 and look for those

even numbers that are not divisible by 4 (n.b. I could have used a compound condi-

tional statement such as if( !(i%%2) && (i%%4)) but that would have really screwed up my

example).

> f or ( i in 1:10)

+ i f ( ! ( i %% 2 ) )

+ i f ( i %% 4 )

+ cat ( "the value=" , i , "\n")

the value= 2

the value= 6

the value= 10

In some sense, you can think of these kinds of “one-liners” as just extensions as one-

offs. There is nothing wrong with using brackets even in these cases. In fact, it may

open up your code a bit and make it a bit easier to read in the future. You just do not

have to use them.

However, where you want more than one statement to be executed after a loop or condi-

tional statement then you must use brackets. T

Biological Data Analysis Using R

180 CHAPTER 11. PROGRAMMING

11.3 Outlining A Program

The most difﬁcult part of programming is understanding where to start. Writing a pro-

gram, on the surface, appears to be a daunting task in intself. However, when I write

programs I tend to think of them not as a single large program but as a series of smaller

steps. The key to doing this is to understand the sequence of steps that we need to

accomplish so that the program can do what is required.

So, ﬁrst things ﬁrst. State what you want the program to do in speciﬁc terms. For

this Chapter we will be working on developing a program that calculates the amount of

canopy openness from a hemispheric image (Figure 11.1). If you haven’t already done

so, I recommend that you look at Chapter 7 to refresh yourself on how we work with the

internals of an image.

Next, we need to get out a sheet of paper and write down, exactly, how the program is

going to work. It is important that we include all the steps necessary and in the order in

which they are to be performed. An example of this would be:

1. Load image into memory

2. Determine what parts of image are ”open canopy”

3. Determine total area of image

4. Print out the proportion of canopy that is open.

So, each of these steps is a relatively easy one by itself and we will create the overall

program by breaking it up into manageable parts.

11.4 Creating A Program

It is often necessary to incrementally build a program. Using the outline in the previous

section, we can open a new ﬁle and create a script that does each of these items in suc-

cession. Typically, I ﬁnd it helpful to work on the R command line to test out particular

sets of commands and when I have it exactly like I like it then I move it to a script.

11.4.1 Step 1: Loading An Image Into Memory

In Chapter 7, we examined how to load images into memory, translate them into vari-

ous formats, and get into their knickers, so to speak. So to begin with, the image as I

retrieved it from Wikipedia is a JPEG image. I will begin by turning it into a PPM formatted

image as discussed in Chapter 7 using the program GIMP (http://www.gimp.org), al-

though you could use any image manipulation program and there are several free ones

available for you on the internets. The PPM ﬁle is what you have access to in the class

folder for Chapter 11.

> l i brary ( pixmap)

> img <− read .pnm( f i l e ="Hemiphoto monarch habitat1.ppm")

Read 637563 items

Biological Data Analysis Using R

11.4. CREATING A PROGRAM 181

Figure 11.2: The blue channel of the

canopy picture displayed as a greyscale im-

age.

Figure 11.3: A histogram of values in the

blue channel (Figure 11.2).

> pl ot ( img)

Now we have the image loaded and a plot that is identical to that displayed in Figure

11.1 and we must ﬁgure out how to have it represented.

11.4.2 Step 2: What Is “Open Canopy”

The variable img has the following components and here we need to ﬁgure out what parts

of the image are the sky parts.

> names( attri butes ( img ) )

[ 1] "size" "cellres" "bbox" "bbcent" "channels" "red" "green"

[ 8] "blue" "class"

Remembering that there are three different channels in a PPM ﬁle, one for red, one for

green, and one for blue, perhaps we should look there ﬁrst. You can plot each of the

channels as an image by creating a pixmapGrey() image and see the intensity of each color

channel.

> pl ot ( pixmapGrey( img@blue ) )

> pl ot ( pixmapGrey( img@red ) )

> pl ot ( pixmapGrey( img@green ) )

And from this you will see that the different channels look pretty much the same when

evaluating the area that is considered the “sky” in this image. For our purposes, I will

we will only use the blue channel as displayed in Figure 11.2.

Biological Data Analysis Using R

182 CHAPTER 11. PROGRAMMING

So if that is the component of the image that we are going to use, we now need to deter-

mine which values to look for. To do this, you can easily make a histogram composed

of the values in the blue channel of the image using the command hist( img@blue ). We can

see from Figure 11.3 that there is a tremendous amount of values in this channel at the

low end, a peak at around 0.2 and another at the top end close to 1.0.

We can get a bit more speciﬁc with this image and plot the intensity of a particular row

of values in the blue channel to double check that we think values close to 1.0 should

represent light values and those near 0.0 are the dark regions. The following commands

create the image displayed in Figure 11.4 where the raw values along the 230

th

row of

pixels (indicated by the red dashed line) are shown in blue. It is easy to see that the

value in the blue channel gets larger as the dashed line crosses the image.

Figure 11.4: Intensity of blue channel values in the image as taken through a slice of the image

(at pixel row 230 as indicated by red dashed line).

> pl ot ( img, axes=T, bty="n" , xlab="Image Width" , ylab="Image Height")

> par ( new=T)

> abline (230,0, col ="red" , lwd=2, l t y=2)> par ( new=T)

> par ( new=T)

> pl ot ( img@blue[ 230 , ] , bty="n" , type="l" , xlab="" , ylab="" , col ="blue" ,

+ lwd=3, axes=F, ylim=c( −10,10))

So, at this point, we need to make a value judgement. We are fairly conﬁdent that values

close to one in the blue channel (and others you can go check yourself) represent areas

in the image where it is pretty light. But, we need to make a cut-off such that if we look

at a pixel, we can put it into the light or not-light category. For the purposes of this

exercise, I will assume that values that are ≥ 0.98 are to be considered as sky and I will

also make the restriction that I need the pixels in each channel to meet or exceed this

cut-off.

Now, to ﬁnd out how much of the image is sky (using this deﬁnition), we must:

Biological Data Analysis Using R

11.4. CREATING A PROGRAM 183

1. Loop through every matrix and the items in each matrix.

2. Evaluate if the value should be considered as sky or not.

3. Use a variable to keep track of all the pixels that meet the criteria

So to our script, we will add the following lines of code

> numRows <− img@size [ 1]

> numCols <− img@size [ 2]

> f or ( row in 1:numRows ) {

+ f or ( col in 1:numCols ) {

+ i f ( img@red[ row, col ] >= 0.98 &

+ img@green[ row, col ] >= 0.98 &

+ img@blue [ row, col ] >= 0.98 )

+ numSky <− numSky + 1

+ }

+ }

> numSky

[ 1] 9624

So, in the image across all three color channels, we ﬁnd a total of 9, 624 pixels that can

be considered to represent the sky.

1

11.4.3 Step 3: Determine The Total Area Of The Image

OK, ﬁnally we are almost ﬁnished. We need to now determine what the total number of

pixels there are in the image so that we can get a standardized percent of open canopy.

We could use the total number of pixels 461

2

= 212, 521 but the image taken with the

ﬁsh-eye lens is not square, rather it is a circle that ﬁts in a square whose side has 461

pixels. So, we need to ﬁgure out the area of this circle as:

> r <− 461/2

> total Area <− pi ∗ r ˆ2

> total Area

[ 1] 166913.6

> (461ˆ2−total Area ) /total Area

[ 1] 0.2732395

As a side note, the last expression in the code listing shows what percentage of area that

we would bias our estimation by if we just used the total number of pixels in the image,

27.3% is a reasonable sized bias!

11.4.4 Step 4: Print Out The Proportion Of Canopy That Is Sky

This part is fairly easy and doesn’t require much.

1

While this part of the exercise was excellent at showing some of the programming paradigms and how

they can be combined to give an answer, it is also true that Step 2 can be accomplished in R using the

one-liner sum( img@blue >= 0.98 &img@green >= 0.98 &img@red >= 0.98 ). Here the three conditionals return a

vector of logical variables, which the function sum() coerces into integers. While it would have been much

shorter to do it this way, it would have negated all the quality teaching experiences that I was laying on

you...

Biological Data Analysis Using R

184 CHAPTER 11. PROGRAMMING

> numSky / total Area

[ 1] 0.05765857

11.4.5 The Complete Program

The complete program is listed below with comments. There are a few changes in the

program that I made to make it a bit easier to work with. Comments should be self

explanatory and are indicated by lines that start with the hash character (#).

# removes al l vari abl es from memory at st art of scri pt

rm( l i s t =l s ( ) )

# load the pixmap l i brary to open the image

l i brary ( pixmap)

# I put the f i l e name i nto a vari abl e so

# i t could be changed easi l y at the top

# of the f i l e i f necessary

fileName = "Hemiphoto monarch habitat1.ppm"

# I also put the cr i t er i a i nto a vari abl e

# so we can change i t in one place to see

# how the resul ts di f f er

skyCri teri a <− 0.98

# Read in the image and f i nd the number of

# rows and columns in i t

img <− read .pnm( f i l e =fileName )

numRows <− img@size [ 1]

numCols <− img@size [ 2]

# Loop through each row

f or ( row in 1:numRows ) {

# Loop through each column

f or ( col in 1:numCols ) {

# Evaluate the cel l in each f or

# ‘ sky cri t eri a ’

i f ( img@red[ row, col ] >= 0.98 &

img@green[ row, col ] >= 0.98 &

img@blue [ row, col ] >= 0.98 )

numSky <− numSky + 1

}

}

# Find t ot al are of f i sheye ci r cl e

r <− numRows/2

total Area <− pi ∗ r ˆ2

# Pri nt out the percent ca

percentCanopyOpen = numSky/total Area

cat ( ‘ ‘ Canopy Opening: ‘ ‘ , percentCanopyOpen, ‘ ‘ \n’’ ) ;

11.5 Synopsis

This has been a very simple little program that we made. Despite it being simplistic, it

does show you how to go about creating a simple analysis program. R is not a general

Biological Data Analysis Using R

11.5. SYNOPSIS 185

programming language and you are not going to make large programs with it. The key

to R is knowing how to get something put together, take it a step at a time, and break

the components into reasonably sized, easy to accomplish pieces. This is where you

start.

In Chapter 12 we will build upon what has been done here when we discuss Functions.

We can encapsulate code into functions and make our lives much easier. For now, play

around with the program and the exercises and get comfortable with typing code.

Biological Data Analysis Using R

186 CHAPTER 11. PROGRAMMING

11.6 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• x %% y The modulus operator. This returns the remainder of the division x/y.

• as.logical(x) Coerces x into a logical variable if possible. See 2.4.7 for more infor-

mation on logical variables.

• rm(x) This function removes the variable x from memory

• abline(a,b) This function plots a line with intercept of a and a slope of b in the current

graphics window.

• for(INDEX SEQUENCE ) A main looping construct that speciﬁcally uses the counter IN-

DEX that is contained in SEQUENCE.

• while(COND) A looping construct that continues to loop until some condition is met.

As long as COND==TRUE the loop will continue.

• if(COND) The evaluation of the condition COND. If it is TRUE then the next line following

the if statement is executed. If it is FALSE then the next line is skipped. You can

include several lines to be evaluated after this and other evaluation statements by

enclosing the code in curly brackets {}.

• else if(OTHER COND) The second evaluation of a condition. This must not be the ﬁrst

conditional (e.g., there is an else here that implies a previous if or else if statement

that this is following).

• else The last of a conditional, if all the previous ones did not turn out to be true,

then whatever follows the else will be evaluated. It is not necessary that you have

one of these at the end, you may want to not do anything unless some speciﬁc

conditions occur.

Biological Data Analysis Using R

11.7. EXERCISES 187

11.7 Exercises

The following exercises are meant to help you understand the items presented in this

Chapter.

1. Write a short program that lists all numbers from 1 to 100 and determines if they are

divisible by 2 and 3.

2. Write a program using the for () loop that prints the numbers from 42 down to 27, one

on each line.

3. List some of the assumptions that are included in how the variable numSky is deter-

mined.

4. Using the program we created in this Chapter, make a graph of percent canopy with

different cut-off values. In your opinion what would be the most biologically mean-

ingful cutoff?

5. Change the program to use a cutoff value based on the sum of the individual color

channel values rather than the current requirement that they all be simultaneously

over some threshold.

6. How many lines of output do you expect to get from the following code? HINT: Think

before you try to run this program.

while ( 1) {

cat ( "All work and no play makes Dr. D a dull boy.\n" ) ;

}

7. Create an outline of the steps that would ﬁnd the number of values in a matrix that

is equal to or greater than 20.

8. Implement the program you outlined using the matrix M <−matrix( runif(25,10,30), nrow=5) as

your input. Make sure to comment your code appropriately.

9. What is the proper syntax for conditions passed to an if statement requires x to be

greater than 23 and y to be equal to or less than 4?

10. How many else statements can you have after an if statement?

Biological Data Analysis Using R

188 CHAPTER 11. PROGRAMMING

Biological Data Analysis Using R

Chapter 12

Functions

Throughout this book, we’ve used both built-in function such as sqrt() and sum() as well as

some that are located in external libraries that we had to load (such as skewness() in 4.3.1

and read.pnm() in 7.2). These functions have been really helpful in making you scripts

look clean and readable and have made you life rather easy as you performed some

basic statistical analysis. Think what a pain it would have been if you had to write code

every time you wanted to calculate a sqrt() of a number... (I’m not even sure how it is

done).

Writing your own functions in R is a very useful way to save a lot of typing. You can

consider a function a small self-contained bundle of instructions that you can call when

every you need to. Say you are picky about the way your graphics look, or that there

is a particular set of routines that you use to make translations of your data from one

format to another. Putting this code into function and putting that function in a location

where you can get access to when every you need it is a real treat.

In this Chapter you will learn the following skills:

• Learn the syntax required to write your own functions.

• Understand the scope of a variable and why you should care.

• Create a basic library of routines that you can use in the future.

12.1 Function Syntax

The format of a function basically has the following three parts:

1. The name of the function. The creation of a name for a function is just as important

as for a variable. I ﬁnd it helpful to try to make the name tell me what the function

does (I’m funny that way), which means it typically starts with a verb such as

convertMissingData(), removeLameExcuses(), or makeTheGraphTheWayILikeIt().

2. The assignment to the name. Right after the name you will have the assignment of

the generic function() function to the variable (see the syntax below). This tells R that

the name is not a variable but will actually be the name of a function.

189

190 CHAPTER 12. FUNCTIONS

3. The function contents. This is the part that you get to write. Here is where you put

all the stuff together to do whatever it needs to do.

In general, these three parts are put together to look like:

doMyBidding <− function ( ) {

# Function Contents

}

Now this is fairly boring function here, it takes no arguments and doesn’t return any-

thing to you. It is kind of like saying, ”R go to your special place and do something

but don’t tell me what it is.” As you write functions, they will be considerably more

complicated (and hopefully useful).

In this Chapter I will post in the raw code for the function itself followed by the output

of R from the command line. The straight posting of the function syntax allows you to

cut-and-paste them into the R interpreter (even though you will learn it better by typing

it).

Also, functions that you have deﬁned are available in the local memory of the interpreter

in the same way as local variables are. If you use the ls () command to list the items in

memory it will show your function names along side your variable names.

> l s ( )

[ 1] "doMyBidding" "x"

12.1.1 Returning Values From A Function

Most likely you are calling some function because you are interested in getting a re-

sponse to it. It is not common to write functions that do no give you something back in

return.

To return a value from a function, R has you put the name of the variable on the last

line of the function. An example of this is the following function that returns a single

number.

gimmeANumber <− function ( ) {

42

}

> gimmeANumber ( )

[ 1] 42

> gimmeANumber ( )

[ 1] 42

And a slightly better function here that actually returns a random number:

gimmeAnotherNumber <− function ( ) {

x <− runi f (1,1,100)

x

}

> gimmeAnotherNumber ( )

[ 1] 87.3278

> gimmeAnotherNumber ( )

[ 1] 64.97312

Biological Data Analysis Using R

12.1. FUNCTION SYNTAX 191

You can also use the return return() to exit the function and potentially return a value.

Here is an example that checks to see if the passed argument is the right kind, if it is

not it prints an error and returns, otherwise it performs a calculation and then returns

the result.

gimmeHalf <− function ( theValue ) {

# check to see i f i t i s a numeric value

# i f i t i s the return hal f

i f ( i s . numeric ( theValue ) ) {

return ( theValue / 2. 0)

}

# i f i t isn ’ t then complain

el se {

cat ( "The value" , theValue , "is not a number, try again.\n")

return ( )

}

}

> gimmeHalf ( 12 )

[ 1] 6

> gimmeHalf ( "Hello partner! " )

The value Hello partner ! i s not a number, try again.

NULL

Notice here that when the function left the else section of the function by calling the

return() without any arguments then the function actually returned the NULL value. If you

are not interested in having a function return NULL, something that signals to you that

the value passed to the function may be incorrect then you can remove the last return()

statement and have the function not return anything. Here is what that function would

look like.

gimmeHalf <− function ( theValue ) {

# check to see i f i t i s a numeric value

# i f i t i s the return hal f

i f ( i s . numeric ( theValue ) )

return ( theValue / 2. 0)

# i f i t isn ’ t then complain

el se

cat ( "The value" , theValue , "is not a number, try again.\n")

}

> gimmeHalf ( 14)

[ 1] 7

> gimmeHalf ( "bob")

The value bob i s not a number, try again.

Vector Arguments

By default you function above can work on vectors of values just as easy as single

numbers. This is because a vector of numbers will return TRUE when asked if it is.numeric()

(see 2.4.8 for more on this). Here is an example,

> x <− seq(2,20, by=3)

> i s . numeric ( x )

[ 1] TRUE

Biological Data Analysis Using R

192 CHAPTER 12. FUNCTIONS

> i s . vector ( x )

[ 1] TRUE

> x

[ 1] 2 5 8 11 14 17 20

> gimmeHalf ( x )

[ 1] 1.0 2.5 4.0 5.5 7.0 8.5 10.0

So by default, you can work with vectors of your values just as easy as single numbers.

This is pretty cool and you should try to remember the love that R has for vector oper-

ations because it is much faster to call your gimmeHalf() function by passing it vector of

value than using a loop to go through the vector and calling gimmeHalf() for each individual

value...

Here is a slightly longer example of a function. Notice that inside the function, I have

added some comments. This is a very good idea because it allows you to document what

you are doing inside the function. In fact, I typically write functions by:

1. Write the signature of the function, the funcName <−function(){ } part.

2. Using comments, write the sequence of events that have to occur inside the function

so I can see what needs to be done (breaking large problems into small ones here)

3. Fill in the code to allow R to do my bidding.

So lets walk through these steps and make a function. The purpose of this function is

to get a little encouragement for my programming endeavors by having R return some

nice praise for me.

Step 1: Create signature The signature for this function will be:

giveMeSomeMomLove <− function ( ) {}

Step 2: Using comments create logic of function: The overall goal of this function is to

return a random statement from my mother so I will have to set up some statements,

ﬁnd a random one,, and then return it.

giveMeSomeMomLove <− function ( ) {

# set up a vector of l ovi ng mother sayings

# pick a random number to use as index f or responses

# I f you put the name vector and the index on the l ast l i ne

}

Step 3: Fill in the R logic: Now that I have the comments set out, it is fairly easy for me

to use them as a guide in laying out the logic of function. You do not have to document

every line of code in your functions, but if you put in enough so that it is obvious what is

going to happen next, you will ﬁnd yourself being happy with your past self more often

than hating what you had forgotten to do (?).

giveMeSomeMomLove <− function ( ) {

# set up a vector of l ovi ng mother sayings

momSayings <− c ( "Honey, your dad and I think you are doing just fine." ,

"Come home this weekend, I made your favorite dessert." ,

"We think you are the BEST student at VCU." ,

"You know I took calculus back in college, maybe I can help." ,

Biological Data Analysis Using R

12.1. FUNCTION SYNTAX 193

"I just know you’ll be able to find a good job after college.")

# pick a random number to use as index f or responses

resp = round( runi f ( 1 , 1, length ( momSayings) ) )

# I f you put the name vector and the index on the l ast l i ne

momSayings[ resp ]

}

> giveMeSomeMomLove ( )

[ 1] "We think you are the BEST student at VCU."

> giveMeSomeMomLove ( )

[ 1] "Honey, your dad and I think you are doing just fine."

> giveMeSomeMomLove ( )

[ 1] "You know I took calculus back in college, maybe I can help."

Feel free to add some of your own mother sayings here

12.1.2 Passing Values To A Function

The most common way you will interact with a function is probably by giving it some

variables and expecting to get something back.

getI denti tyMatri x <− function ( numRows ) {

# make a square matrix with al l zeros

I <− matrix ( 0, nrow=numRows, ncol=numRows )

# make the diagonal al l ones

diag ( I ) <− 1

# return i t to the cal l er

I

}

> getI denti tyMatri x ( 2)

[ , 1] [ , 2]

[ 1 , ] 1 0

[ 2 , ] 0 1

> getI denti tyMatri x ( 5)

[ , 1] [ , 2] [ , 3] [ , 4] [ , 5]

[ 1 , ] 1 0 0 0 0

[ 2 , ] 0 1 0 0 0

[ 3 , ] 0 0 1 0 0

[ 4 , ] 0 0 0 1 0

[ 5 , ] 0 0 0 0 1

Default Values

Functions can have default values associated with variables that are passed to them.

We’ve seen this many times so far as you’ve looked up and seen the function signatures

of built in variables. This is a very convenient feature for you and your users. In general,

when you think of writing functions you should not try to make them so speciﬁc that

you have a lot of different functions that do almost the same thing, rather you should

make them robust and if you can combine a few functions into a single one whose

values change depending upon a parameter you pass to it, it is better overall form. For

example, the function getIdentityMatrix() returns a square matrix with ones down the

diagonal. This matrix is a pretty special one (see ??) in matrix analysis and probably

Biological Data Analysis Using R

194 CHAPTER 12. FUNCTIONS

should have its own function just because of its status. However, there are a number of

reasons why you may need a square matrix with a single value down the diagonal and

perhaps it would be more robust to create a function such as:

getDiagonalMatrix <− function ( si ze , value=1 ) {

theMat <− matrix ( 0 , nrow=si ze , ncol=si ze )

diag ( theMat ) <− value

theMat

}

> getDiagonalMatrix ( 3)

[ , 1] [ , 2] [ , 3]

[ 1 , ] 1 0 0

[ 2 , ] 0 1 0

[ 3 , ] 0 0 1

> getDiagonalMatrix ( 3 , 42)

[ , 1] [ , 2] [ , 3]

[ 1 , ] 42 0 0

[ 2 , ] 0 42 0

[ 3 , ] 0 0 42

Now this function has a default value to set the diagonal values to (e.g., 1) producing the

Identity matrix I by default, however, it can also produce any diagonal matrix when you

pass an additional parameter to the function. If you do not pass it to the function, it

is assigned in the signature for you by default. This makes the function perhaps more

robust and useful. Of course, this is all up to you, you are the programmer here and you

get to make the decisions. After all, there are several different ways to get the correct

result when programming and as Biologists, we should focus on the biology and use

tools like R as simple tools.

12.2 Scope

The scope of a variable determines the value that it has depending upon where it is

located. This topic is a pretty important one and can be a bit tricky at times.

myFunc <− function ( x) {

x <− 42

cat ( "x inside function is" , x, "\n")

}

> x <− 21

> x

[ 1] 21

> myFunc( x )

x i nsi de i s 42

> x

[ 1] 21

myFunc <− function ( a) {

x <− 42

cat ( "other x inside function is" , x, "\n")

}

> x <− 23

> myFunc( x )

other x i nsi de function i s 42

> x

[ 1] 23

Biological Data Analysis Using R

12.3. USEFUL FUNCTIONS 195

12.3 Useful Functions

The following functions were introduced in this chapter and you will be required to use

them for the exercises. To get more information on any of these functions, use the R

help system.

• function(args)code Creates a function that has the code inside code requiring the ar-

guments args.

• return(x) Returns the value x from the function which means it is immediately exited

and no more code is executed in the function.

Biological Data Analysis Using R

196 CHAPTER 12. FUNCTIONS

12.4 Exercises

The following exercises are meant to help you understand the items presented in this

Chapter.

1. Create a function that allows you to pass it a regression model and it will return a

string that contains the formula for the model as you would like to have it displayed

on a graph.

2. Create a function that takes a single vector of values and creates a histogram and

density line from that data in a new graphics window.

3. Explain scope and how it pertains to the values assigned to variables.

4. Create a function that takes an ANOVA or Regression model and saves the ANOVA

table to a ﬁle. You should probably allow the user to pass a ﬁle name to the function.

5. How do you set default values for a function when you write it?

6. Explain how you get your functions to accept vector arguments.

7. Create a function that returns random numbers but allow the user to set an optional

argument that will only return even numbers.

8. How would you remove a function from the memory of R ?

9. Lets assume that you have a folder full of data ﬁles named Data1, Data2, Data3, . . .,

Data40. Write a function that creates these ﬁle names dynamically. You will want to

allow the user to specify the base name of the ﬁles (e.g., Data) as well as the starting

and ending numbers (e.g., 1 and 40) but set the starting number to default to 0.

10. How do you make sure that the arguments that are passed to your functions are the

right kind of variables? For example, what if I passed the variable x <−"this is the end"

to a function that expects a number.

Biological Data Analysis Using R

Appendix A

Answers to Exercises

In this section you will ﬁnd answers to the odd numbered Exercises presented in each

Chapter. These answers are meant to help you start on the exercises facilitating your

completion of the remaining questions. It is my recommendation that you look at the

answers only after you have completed them just to make sure that what you thought

you were doing is the correct thing. Don’t look ahead....

Ansers to Chapter 2.

197

198 APPENDIX A. ANSWERS TO EXERCISES

Biological Data Analysis Using R

Appendix B

Installing Additional Libraries

The R statistical computing environment is made more robust by the addition of external

libraries. Libraries can be written in R , C, or FORTRAN by you or other people who want to

expand the functionality and utility of R .

B.1 Library Availability

There is a list of libraries available at http://cran.r-project.org. As of the time of this

writing, there are currently 1621 different packages in the repository. All are available

for you to install and use at your discression. Each should also come with a set of

documentation covering all the functions that are included in the library, descriptions of

the data sets, and some overall discussions on the library along with the library.

B.2 Installing Libraries

B.2.1 Using install.packages() As A GUI

The easiest way for you to install a libarary is to do so from within R itself. To do this,

your machine must be connected to the internet. R knows how to ﬁnd, download, and

install binary versions of packages using a tck/tk interface GUI interface.

If you conduct the installation as a normal user that does not have administrative priv-

ilages on your computer, the libraries will be installed in a location that is in your own

home directory. Depending upon which operating system you have, this will be in dif-

ferent places. The main thing to worry about here is that when you install libraries into

your own directory they will only be available to that user and will not be available for

any other users on that machine. If two people use the same machine then they will have

to install it twice, once in each home directory. Conversely, if you have administrative

privelages on the machine you are using, you can install the libraries into a location that

everyone that uses that machine can access.

To start the installation process, issue the command:

199

200 APPENDIX B. INSTALLING ADDITIONAL LIBRARIES

> i nst al l . packages ( )

And this will bring up a window (using tck/tk so it won’t look quite like the normal

window on your operating system) that allows you to select which mirror you would

like to use for downloading. An example window is shown in Figure B.2.1. In general,

you should select a location that is geographically proximite to your current location.

All of these mirror servers are kept up-to-date pretty well and you shouldn’t ﬁnd any

differences among the packages on any of them.

Once you have selected your preferred mirror server, another window will be presented

(resembling Figure B.2.1) that lists all the packages that are available to be installed.

Be careful here, this simple interface does not check to see which packages you already

have installed, it only lists all the packages that are at your disposal. So just because

there is a package on that list doesn’t mean that you do not already have it installed on

your machine.

Select the package, or packages, that you want to install from the list. To select more

than one, click on more than one... To deselect a package, click on it a second time and it

will be deslected. Once you hit the OK button on this window, the install .packages() function

will look to see what dependencies the selected packages have (e.g., PackageA requires

PackageB but you didn’t know that and didn’t select it). Packages will be downloaded and

installed in the correct location. After they are installed, you should be able to use them

immediately (e.g., without restarting R ).

B.2.2 Using install.packages() For Speciﬁc Libraries

If you know the name of the package that you are interested in installing you can use

the install .packages() function directly by passing it a name, or list of names, of the packages

you are interesed in. This will skip the Package Selection Window step shown in Figure

B.2.1. The syntax for this would be:

> i nst al l . packages ( "theNameOfTheLibraryNeeded")

Libraries have also be partitioned into different Task Views. These are meta-packages

that contain several different packages under a particular theme. Below are a list of the

views that are available as of January 2009 (these categories and desriptions are lifted

directly from the website.

Bayesian Bayesian Inference

ChemPhys Chemometrics and Computational Physics

Cluster Cluster Analysis & Finite Mixture Models

Distributions Probability Distributions

Econometrics Computational Econometrics

Environmetrics Analysis of Ecological and Environmental Data

ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data

Biological Data Analysis Using R

B.2. INSTALLING LIBRARIES 201

Figure B.1: Example of CRAN mirror window as viewed on Linux

Biological Data Analysis Using R

202 APPENDIX B. INSTALLING ADDITIONAL LIBRARIES

Figure B.2: All packages that can be installed from the selected mirror server on my machine.

Biological Data Analysis Using R

B.2. INSTALLING LIBRARIES 203

Finance Empirical Finance

Genetics Statistical Genetics

Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization

gR gRaphical Models in R

MachineLearning Machine Learning & Statistical Learning

Multivariate Multivariate Statistics

NaturalLanguageProcessing Natural Language Processing

Optimization Optimization and Mathematical Programming

Pharmacokinetics Analysis of Pharmacokinetic Data

Psychometrics Psychometric Models and Methods

Robust Robust Statistical Methods

SocialSciences Statistics for the Social Sciences

Spatial Analysis of Spatial Data

Survival Survival Analysis

TimeSeries Time Series Analysis

You can install all the libraries in these particular views by invoking the command:

> i nst al l . packages ( "ViewName")

You will still have to specify the mirror server to use and once you do, R will take it

from there. This could be a lengthy process as it may require numerous packages to be

downloaded and installed. Be patient.

B.2.3 From the Command Line

Finally, there is one other method that I typically use on my machines. This is because I

typically download the source packages rather than the pre-compiled binaries. However,

this method also works with binaries. You can download the package from the CRAN

site directly and then open a command-line Terminal and change to the directory where

the package is located. From there issue the command:

R CMD INSTALL ThePackageYouDownloaded.tar.gz

and R will install it for you. If you do this as the root or administrator person, it will

install it in a globally accessable location so any user on that machine will have access

to it.

Biological Data Analysis Using R

204 APPENDIX B. INSTALLING ADDITIONAL LIBRARIES

Biological Data Analysis Using R

Bibliography

Caswell, H. (2001). Matrix population Models: Construction, Analysis, and Interpretation.

Sinauer Associates, Sunderland, Mass., 2nd edition edition.

205

Index

class, 115

clustal ﬁle, 153

coercion, 9

comment character (#), 172

data types, 8

character, 10

complex, 11

constant, 11

data frame, 18, 26

factors, 16

integer, 8

list, 17

logical, 13

matrix, 14

NULL, 31

numeric, 9

raw, 12

vector, 13

distributions

dchisq, 43, 68

df, 43, 68

dnorm, 43

pchisq, 43, 54

pf, 43

pnorm, 43

qchisq, 43, 44, 54

qf, 43, 46

qnorm, 43

qt, 46

rchisq, 43

rf, 43

rnorm, 43, 54, 58, 68

rpois, 63

runif, 65, 107

fasta ﬁle, 152

ﬁgure

axis labels, 56

title, 56

functions, 6

%%, 186

abline, 186

any, 150, 161

as.factor, 72, 86

as.index, 186

as.matrix, 86, 144

as.matrix(), 121

attributes, 33

barf, 123

barplot, 137

binom.test, 73

c, 86

cat, 118, 161, 172

cbind, 29, 40, 86, 107

class, 20, 33

colnames, 86, 157

components, 6

cov, 64

density, 57, 58

det, 128

diag(), 126

dim, 123

dist.dna, 154

eigen, 132

else, 186

else if, 186

expression, 161

for, 186

format, 161

function, 195

gimeMeSomeMomLove, 192

ginv, 128

grep, 150

grey, 118

gsub, 150

image, 118

index, 127, 186

kurtosis, 60

length, 20, 86

206

INDEX 207

levels, 17

lm, 128

load, 32, 40

log, 7

ls, 32

matrix, 121, 145

max, 61, 118

mean, 58, 129

merge, 39, 40

min, 61

names, 33

nchar, 148, 161

nj, 154

par, 49

paste, 11, 20, 95, 149

plot, 47, 155

print, 172

q, 31

qchisq, 45

range, 50, 61, 81, 86

rbind, 28, 40

read.dna, 153

read.table, 27, 28, 86, 145

read.table(), 121

rep, 14, 20

return, 191, 195

rexp, 54

rm, 32, 40, 172, 186

rnorm, 54, 56, 118

round, 107

row.names, 33

rownames, 86, 157

rpois, 52, 54

save, 32, 40

sd, 58

seq, 14, 20

skewness, 59

source, 172

strsplit, 148

sub, 150

subset, 35, 36, 40

substring, 149, 161

sum, 127

summary, 17, 20, 86, 93, 107,

172

t, 128

table, 17, 68, 72

unlist, 148, 162

var, 58

while, 186

genetic distance, 154

grahics

pdf, 51

graphics

abline, 94, 107

barplot, 137, 144

bg, 48

bmp, 51

bty, 48

bxp, 85

cairo pdf, 51

cex, 48

col, 48

density plot, 57

dev.copy, 52, 53

dev.off, 52, 53

fg, 48

hist, 52, 55

jpeg, 51, 53

legend, 142, 145

line plot, 47

lty, 48

lwd, 48

main, 48

mfrow, 48, 61

optional parameters, 48

overlaid, 49

par, 48

pch, 48, 107

pictex, 51

plot, 46, 68, 85

png, 51

postscript, 51

quartz, 51

rug, 104

scatter plot, 46, 47

sub, 48

text, 107

tiff, 51

topo.colors, 52

type, 48

x11, 51

xlab, 48

xlim, 49

ylab, 48

Biological Data Analysis Using R

208 INDEX

ylim, 49

matrix

%*%, 144

addition, 123

det, 144

diag, 144

diagonal, 126

dim, 144

eigen, 145

element-wise multiplication, 124

ginv, 145

Hadamard product, 124

multiplication, 124

scalar addition, 123

scalar multiplication, 124

scalar subtraction, 123

Schur product, 124

subtraction, 123

t, 145

trace, 127

Neighbor Joining, 154

operator

assignment, 18

logical, 19

numerical, 18

operator order, 18

Pinaceae, 153

stats

anova, 93, 107

aov, 107

binom.test, 86

chisq.test, 72, 76, 86

cor.test, 67, 79, 86

interaction formula, 99

Kruskal-Wallis Test, 82, 83

kruskal.test, 86

lm, 92, 107

Mann-Whitney, 80

mean, 58, 68, 81, 86

median, 63

nj, 162

no intercept, 100

quantile, 63

sd, 68

step, 107

t.test, 107

TukeyHSD, 107

var, 58, 68

Wilcoxon, 80

Wilcoxon Test, 81

variable, 7

Biological Data Analysis Using R

2

Biological Data Analysis Using R

Contents

Preface xi

I

Basic Usability

1

3 3 4 5 5 6 7 8 18 20 21

1 Getting R 1.1 What Is R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Where Do I Get It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Language & Grammar 2.1 Overview . . . . . 2.2 Function Quickie . 2.3 Variables . . . . . 2.4 Data Types . . . . 2.5 Operators . . . . . 2.6 Useful Functions . 2.7 Exercises . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

II

**Biologically Motivated Topics
**

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

25 26 34 35 40 41 43 43 54 58 63 68 69

3 Data Frames 3.1 Data Input/Output 3.2 Slicing . . . . . . . . 3.3 Complex Selections 3.4 Useful Functions . . 3.5 Exercises . . . . . .

4 Summary Statistics 4.1 Distributions . . . . . . . . . . . . . . . . 4.2 Random Number Generation . . . . . . . 4.3 Descriptive Statistics . . . . . . . . . . . 4.4 Relationships Between Pairs of Variables 4.5 Useful Functions . . . . . . . . . . . . . . 4.6 Exercises . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

5 Contingency Tables 71 5.1 One Random Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 i

ii 5.2 5.3 5.4 5.5 5.6 Paired Observations . . . . . Several Random Samples . . The Formula Notation & Box Useful Functions . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . 78 82 83 86 87 89 89 91 97 103 107 108

6 Linear Models 6.1 The t-test . . . . . . . . . . . . . . 6.2 Regression With A Single Variable 6.3 Multiple Regression . . . . . . . . 6.4 Analysis of Variance . . . . . . . . 6.5 Useful Functions . . . . . . . . . . 6.6 Exercises . . . . . . . . . . . . . . 7 Working With Images 7.1 Image Data . . . . . . . . . . . . . 7.2 Loading The Image Into R . . . . 7.3 Components of A Pixmap . . . . . 7.4 Image Operations . . . . . . . . . 7.5 Creating Images Programatically . 7.6 Useful Functions . . . . . . . . . . 7.7 Exercises . . . . . . . . . . . . . . 8 Matrix Analysis 8.1 Matrices In R . . . . . . . . . . 8.2 Stage-Classiﬁed Matrix Models 8.3 Useful Functions . . . . . . . . . 8.4 Exercises . . . . . . . . . . . . . 9 Working With Strings 9.1 Parsing Text Data . . . . . . . 9.2 Producing Formatted Output . 9.3 Plotting Special Characters . . 9.4 Useful Functions . . . . . . . . 9.5 Exercises . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

109 . 109 . 114 . 114 . 115 . 117 . 118 . 119 121 . 121 . 132 . 144 . 146 147 147 156 159 161 163

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

III

Extending R

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

165

. . . . . 167 167 169 171 172 173

10 Basic Scripts 10.1 Writing Scripts . . . . . . . . . . . 10.2 Evaluating Scripts . . . . . . . . . 10.3 Adding Comments To Your Code . 10.4 Useful Functions . . . . . . . . . . 10.5 Exercises . . . . . . . . . . . . . .

11 Programming 175 11.1 Looping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 11.2 Conditional Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Biological Data Analysis Using R

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Exercises . . . . . . . . . . . . . 194 . . . . 195 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Outlining A Program 11. . . .4 Creating A Program 11. . . . . . . . . . . . . . . . . . . . . . . . 199 B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Function Syntax 12. . . . . . . . . . . .CONTENTS 11. . . .2 Scope . . . . . 11. .4 Exercises . . . . . . . . . . . . .1 Library Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Useful Functions 12. . . . 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Bibliography Index 205 205 Biological Data Analysis Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 197 B Installing Additional Libraries 199 B. . . . . . . . . . . . . . . . . . . . . . . . . . . A Answers to Exercises . . . 189 . . . . . . 12 Functions 12. . . . . . . . . . . . . . . . . . . . . . .5 Synopsis . . iii 180 180 184 186 187 . . . . . . . . 12. . . . . . . . .2 Installing Libraries . . . . . . .6 Useful Functions . . . . . . .

iv CONTENTS Biological Data Analysis Using R .

. . . . . . . . . . . . . . . . . . . try the ?par command. . . . . 158 v . . 75 8. . . . . . . . . . . .1 Caption For Table . . . .1 Table of life history values separated into A Fertility estimates (the fX items) and B transition probabilities depicting the movement between stages and within stages. . . . . . . . . . . . . . . . . . . . . . 48 4. . . . . .1 Some useful additional commands to customize the appearance of a ﬁgure.vcu. . . . . . For a complete listing of possible values that can be customized. . . . . . . .2 Graphics devices for output of ﬁgures . . . . . . . . . . . . . . . . . . . . 134 9. . . . . . . . . . 11 4. . . . . . . . . . . . . . .1 Diversity of enrolled undergraduate students at Virginia Commonwealth University in the College of Humanities & Sciences between the academic years 1998-2008 as reported by the Center for Institutional Effectiveness (http://www. . . 51 5. . .List of Tables 2. . . . . . . . . . . . . . . . . . .edu/cie/analysis/reports/sets. . . . . . . . . . . . .1 Common constants you will run across in R . . . .html). . . . .

vi LIST OF TABLES Biological Data Analysis Using R .

.2 A graphical depiction of the critical value of the χ2 distribution for α = 0. . . . . and titles. . . . . . . . . . . . . . . 4. . . . . .1 Values for the density function for the χ2 distribution with 1. . 4. . . . and 3 degrees of freedom. . . . . . . . . .10 Example locations for ﬁrst two moments of a Normal (N (0. . . . . . 4. . . . . . . . . . . . . . . In both of these examples the dotted line connects the mode of the distribution (the top peak) to the mean (on the x axis). . . widths.3 Some example graphs with alternate values for symbols. The direction of this lean determines if the distribution has a negative (left) or positive (right) skew. . . . . . . . . . . 4. . . . . . . . . . . . . . 4. . . . . . . . . . . 77 vii .9 Histogram of 1000 random numbers drawn from a Poisson distribution with the λ parameter set to 5. . . . . . . . . . . . The shaded region constitutes a proportion of the area under the curve equal to α. . . . . .14 Distribution of random number drawn from rpois(1000. .5 Plot of two variables on the same axis after correcting for the range of each data set. . . The red line indicates the density of the values.15 Scatter plot of some semi-random points. . . . . . . . . . . . . 4. . . . . . . . . 1)) distribution. . . . .4 Plot of two data sets using the par(new=T command but not taking into consideration the axis limits of the two data sets before plotting. . . .1 Undergraduate diversity at Virginia Commonwealth University during academic years 1998. . . . . . . . . . & 2008. . . . . . . . . . . 4. . . . . . . . . 4. . line types. . . . . .11 Negative (left) and positive (right) distributions. . . . .8 Histogram with labels and main title changed. . . . . . 2003. . . . . . . . . . . . . 4. . 45 47 50 51 53 55 56 57 59 60 61 62 64 65 66 5. . exponential. . 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Image of colored Poisson distribution that was copied from the graphics device to a jpeg ﬁle. . . . . . . . . . . . . . . . . . . . . . . . .5). . . . . . xvi 45 4. . . 4. . . 2. . . . . . . . . . . .12 Three distributions )exponential. . . . . . . colors. . . . . . . . .7 Examples of the densities of two normal distributions. and logistic) showing different levels of kurtosis. . . . . . . . . . . . . . . . . . . . . . . 4. . . poisson. . . normal. . . . . . . . . . . .05 and df = 3. . . . . . . . 4. . . . 4. .List of Figures 1 Example scatter plot. . . . . . . . . . . . . . . . . . . . . . 4. . . . . . . . .16 Example plot of two variables used to test correlations. . . .13 Matrix of four plots created from random numbers sampled from the normal. the red one is drawn from a random normal distribution with default values of µ = 0 and σ = 1 and another in blue that has µ = σ = 5. . . . . . . . and the logistic distributions. . . . . . . .

. . . . . . . . . a scale location plot (lower left). . . . . . . . . . This image has been scaled up to make it large enough to see it on the page using the program GIMP (www. . . . . . 84 Plot of single variable regression values. . . . . .org). . 8. . . . . . 6. . . . . . . . . . . . . . . . . . .3 The image represented by the dog. . . . . . . . . . . . . .3 6. . . . . . . . . . . . . . 8. . . . 7. . . . . . 1] that are projecting in the same direction but have different magnitudes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Example of a stacked bar plot with multiple categories represented in each Treatment. . . . . . . . . . . . . .7 A random image . . .4 The image represented in the Libbie. . . . . . . . . 2] and vblue = [2. . . . . . . . .pbm ﬁle. . . . . . 7. 7. . . .viii LIST OF FIGURES 5. . .6 Conﬁdence intervals for difference in mean germination rates for Pinus echinata families. . Regression model with ﬁtted line and formula. . . . . . . . . .4 Examples of two different calls to the plotting function barplot(). . . . . . . . . . . . . . . . . .5 Boxplot of germination percentages for Pinus echinata as a function of treatment. . .org). . . . . . . . . . .3 Effects of the instantaneous growth rate λ as a function of time for both exponential growth (λblue = 1. . . . . . . . . . . . . .pgm ﬁle. . . . . . . . . . .gimp. . . and blue channel turned on. . . . . .1 Image depicting two vectors vred = [4. . . . Regression model added to plot of points using abline function. . .2 6. . . 8. . . . . . . . . . . . . . . . . .2 The A graphical depiction of the life history stages in the ﬁctitious plant Grenus growii . . . . . . . . . . . . . . . . .2 Boxplot of Pinus echinata germination data partitioned by timber extraction treatment. . . .2 A PBM ﬁle that was programatically created in R . . . . . . . . . . . . . . . . . . . . . . . . . . 8. . . . . . . . . .6 Size of the four stage classes through time. . . . . . .gimp. . . . . . . . . a Q-Q plot to examine normality of the residuals (upper right). and a leverage plot to look for outliers (lower right). .ppm ﬁle. . . . . . a histogram of the grey values and the image resulting from reducing all the grey values in the image by half.gimp. . . 7. . . . . . . . . . . . . 6. . . . . . . . . . . . . A colored rug was added to the right side to show the actual values within treatments (see rug. . . . . . . . . The image is rotated because of the default location of the origin. . . 8. . . . . . . . . . . . . . . . . . 7. . . . . . . . . . . . . This image has been scaled up to make it large enough to see it on the page using the program GIMP (www.6 The greyscale translation of the PPN image. . . . . . . 7. . . . . . . . . . . . .1 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . They include a plot of the residuals (eij ) as a function of the ﬁtted values (ˆi ) y to see if there are systematic biases in the model (upper left).8 A random image with a square doughnut hole in the middle. . . . . . . . . . . . . . . . . . 7. . This image has been scaled up to make it large enough to see it on the page using the program GIMP (www. . . . . .1 The image represented in the r. . . A 2x2 matrix plot of some diagnostic tools associated with a linear model. . . . . . green. . . . . . . . . . . . . 8. . . . . . . . . . .5 The original image along with ones where only the red.org). . . .2) and exponential decay (λred = 0. . . The parameters used to create these plots is given in the R code. .4 92 94 96 97 104 105 111 112 113 114 116 117 118 118 131 133 136 138 139 142 Biological Data Analysis Using R . . 6. . . . . . . 7. . . . . . . . .8). . . . . . . . .

. .5 . . . . . . . . . . . . . . .4 Intensity of blue channel values in the image as taken through a slice of the image (at pixel row 230 as indicated by red dashed line). . . . . . . . . . . . . 202 Biological Data Analysis Using R . . .3 The html printout of a xtable as interpreted in Firefox. . . . . . .2 The blue channel of the canopy picture displayed as a greyscale image. . . . . . . . . . . . . . . . . . . .1 Histogram of distance estimates among all sequences using the ”K90” model of substitutions . . 9. . . . . . .1 Example of CRAN mirror window as viewed on Linux .2). . . . . . . . . . . . . . . . . .2 Neighbor joining tree based upon the trnL-trnF intergenic spacer sequences and the ”K90” model of sequence evolution. 155 156 159 161 176 181 181 182 B. Photo by S. . . . . . . . . . . . . . . . .2 All packages that can be installed from the selected mirror server on my machine. . . . . . . . . . . .1 Hemispherical photograph of winter roosting habitat at Monarch Biosphere Reserve. . . . . . . . . . . . . . . . . . . . . . . . 11.B. Weiss made available by the Creative Commons Atribution 2. . . . .7 Differences in estimated proportions of individuals in each stage from what was expected through time. Mexico. . . . . .3 A histogram of values in the blue channel (Figure 11. . 11. . . . . You can also import tables saved as html into popular word processors and use them as normal table items in the creation of your documents. 11. . .4 Example of using the expression function to annotate a graphic. . . . . . . . . . . . . . . . . . . . . . . . . . 9. . . . . . . . . . . . . . . . 144 9. . . . . . .LIST OF FIGURES ix 8. 201 B. . . . . . . . . . 11. . . . . . . . . 9. . . . . .

x LIST OF FIGURES Biological Data Analysis Using R .

there are already a lot of those kinds of books available. However. on these data. reaction networks. sequence data. And when they do. With this focus. After spending a few years encouraging students to learn a tool. This course was designed for incoming graduate students in Biology at Virginia Commonwealth University with the goal of getting them familiar with R from the beginning of their graduate work.Increasingly students in biological research programs. The treatment of any one kind of data is relatively shallow. Population Genetics. are dealing with data sets that are both enormous in size and varied in representation. as I am assuming that students are going to take a speciﬁc course on that topic in the future. In my own research. they will have already seen how R will make their life easier. Landscape Genetics. Bioinformatic Technologies. Molecular Genetics. I use tools such as R in many different circumstances and feel that students can only beneﬁt from a broad understanding of how R can assist in their research. and perhaps perform analyses. it is no coincidence that the kinds of data introduced in this text are pulled directly from the graduate courses that our students will take. the spreadsheet metaphor is no longer valid. nutrient ﬂux. I decided to put together a course focusing on how R can be used to deal with many different kinds of data. both at the undergraduate and the graduate level. Give the range of topics covered herein. if a student had taken a short course in R when they began their graduate work. Image data. that would help them deal with the complexity of data we encounter. Biological Complexity. Many of the graduate faculty in Biology use R in their courses and ﬁnd that a non-trivial amount of time needs to be spent on introducing students to R in each course which is taking away from the focus of the course. Population Ecology. any tool. and a whole host of other kinds of data are encountered on a daily basis in biological sciences. then it would be possible to spend more time in our individual courses focusing on the topic at hand.Preface This manuscript was written to scratch a particular itch that I felt was not being satiated. In order to ”drink from this ﬁrehose” of data. such as Community Ecology. I think this manuscript has a broad audience as I assume that the reader of this text will not have much previous experience using R. Ecological Genetics. it is important that we have the correct kinds of tools. counts of species in communities. and Quantitative Ecology. My goal here is to introduce the reader to a wide variety of data types that we deal with in Biology and give a brief introduction to how R can be used to interact with. This manuscript is not designed to be one of the ”Biological discipline X in R ” kind of offerings. Evolution & Speciation. xi .

integers. Contents This manuscript has been partitioned into four separate sections. The ﬁrst section introduces R as a language and a tool and covers some basic topics that are required to get one going. Each chapter also contains a set of exercises that can test the readers understanding of chapter topics. This section has the following chapters: Chapter 1: Getting R This chapter provides information on how to download the latest binary release for R as well as compiling it from source code.xii Preface Obviously. Part 1: Basic Usability The ﬁrst part of this manuscript contains the basic information that is required to install and begin using R for data analysis. At the end of each chapter all the R functions that were used in the chapter as well as a brief deﬁnition of the arguments passed to each function is provided as a quick reference source.g. an introduction to the most commonly used data types in R . and general operations on these data types. Answers to odd numbered exercise problems are provided in Appendix A. the R code is provided and keywords from the R programming language are highlighted to help the reader follow along. . Biological Data Analysis Using R . Chapter 2: Language & Grammar This chapter begins introducing the R programming language by focusing on the different kinds of data types that are used (e. Particular attention is paid to the differences associated with installing R on different platforms. As topics are introduced. Throughout the text. I also feel that this would also be a good beginning text for one who is already working in the ﬁeld and would like to gain a broader introduction to how R can be used in their particular discipline. all of the R functions used are also indexed so that the reader can easily ﬁnd instances where they were used.. However. a speciﬁc list of topics and skills that are to be covered provided. decimal values. factors. There are some common elements to each chapter that make it easy for the reader to get the larger picture of the topics being introduced. The third section focuses on how you can extend the R environment developing scripts and deﬁning your own functions and libraries. The next section contains eleven chapters that target some particular aspect of biological inquiry from the perspective of the kind of data that will be analyzed. The ﬁnal section of this text is an appendix that includes the answers to odd-numbered questions from the exercises in each chapter as well as some additional information on installing additional libraries or groups of libraries. incoming graduate students are my primary audience. At the beginning of each chapter. Topic covered include a basic overview of what a function is.

Hardy-Weinberg equilibrium. This chapter also provides the reader a ﬁrst introduction to creating publication-quality graphics in R . Chapter 5: Categorical Data This chapter focuses on the analysis of categorical data and contingency tables. importation. and saving graphics to ﬁle. After a basic overview of image formats and manipulations. the topic of life history analysis and population projection is used as an example for matrix operations in R . xiii Chapter 3: Data Frames The data frame is a fundamental object in R . This chapter also introduces the concept of using the data frame data type as a light-weight database object. the methods required to make complex selections of subsets of data. and random number generation. Chapter 4: Summary Statistics This chapter introduces the reader to general summary statistics for continuous data. This chapter builds upon the basic understanding of data frames (introduced in Chapter 2) by introducing several methods for putting your data into new and existing data frames. In this chapter. persistent storage of data frames. hemispheric canopy photos are used as an analysis topic on which several analyses are preformed. This includes an introduction to making slices of a data set. and demographic analysis testing for equality of population diversity. Chapter 6: Linear Models This chapter introduces the concept of linear models from simple correlations through single and multiple regression and ANOVA (which is introduced as regression with categorical predictors). In this chapter. General graphics include scatter and line plots. Basic skills in string searching and replacements are augmented with a short discussion of genetic sequence alignments. density plots.Preface Part 2: Biologically Motivated Topics The second section of this manuscript contains the main content. histograms. Chapter 7: Working With String Data This chapter uses genetic sequence data as an example of string-related data that can be manipulated in R . Biological Data Analysis Using R . Give the ubiquity of the χ2 test in Biology. Data for this chapter is derived from my own thesis working with the consequences of landscape modiﬁcation on reproductive success in canopy trees. Chapter 8: Image Data This chapter focuses on image creation. the use of online genetic databases such as NCBI. a general treatment of contingency tables is provided with examples demonstrating how to examine genetic linkage disequilibrium. and manipulation. statistical distributions. analysis. Examples of model diagnostics. Both parametric and non-parametric approaches are introduced with examples. and the creation of a phylogenetic trees using different algorithms is demonstrated. model selection and post-hoc tests are also covered. and joining data from multiple data frames. vegetation data is used as an example of how one conducts and interprets basic ordination. Chapter 10: Multivariate Data Ordination techniques are a broad class of methodologies that seek to understand the structure of multivariate data. plotting several graphical objects on the same set of axes. Chapter 9: Matrix Analysis Matrix analysis is a general tool used in a variety of biological disciplines. creating matrices of plots.

Appendix B: Installing Additional Libraries There are a broad range of libraries that the R community provides and this appendix shows you how to ﬁnd and install additional libraries to your local copy. and basic spatial analysis. Examples include the analysis of inbreeding. Typographic Conventions The developers of R have worked very hard to make sure that you can interface with R on any platform without worrying about which operating system you are using. Chapter 12: Spatial Data In this chapter. population structure. and decision control statements. Chapter 16: Functions This chapter demonstrates how the user can create individual functions from their scripts so that calling complex analyses and operations can be simpliﬁed. ﬂow control. Morphological data from the bark beetle species complex. the analysis of spatial data is introduced. Appendix A: Answers to Exercises This appendix provides answers to the odd numbered problems located at the end of each chapter. Topics covered include. plotting georeferenced raster and vector maps. Chapter 15: Programming R This chapter covers basic programming.xiv Preface Chapter 11: Classiﬁcation This chapter focuses on how morphological shape analysis can be used for classiﬁcation purposes. you should look into these chapters because they contain valuable information that will make your life easier. Appendices The last part of this manuscript includes supplementary material in support of the contents. and population assignment tests. Chapter 14: Creating Basic Scripts This chapter addresses how to you create basic R scripts so that you can reuse your code and analyses as well as have persistence across your R sessions. conversion of GPS way points and GIS data ﬁles into R data formats. Chapter 13: Genetic Data This chapter focuses on how one can represent genetic data in R and perform basic analyses on genetic structure. Araptus attenuatus is used as an example for comparison with genetic classiﬁcation schemes. association mapping. Biological Data Analysis Using R . Part 3: Extending R The chapters in this section only require a basic understanding of R and can be used at any time as they are stand-alone. However. it is suggested that after you get familiar with R . In fact.

you will be have the most ﬂexibility in the analysis of your data. you will have a persistent record of how you analyzed the data instead of just some data and results. How plots are made and saved to a ﬁle for subsequent use is covered in depth though out the book. For example. when you create scripts to perform your analysis. Increasingly. I have decided to sprinkle instructions of how to create graphics into the text at locations that are appropriate for the content being discussed rather than creating one or more chapters on Graphics with made up data presented out of context with how that particular graphical representation is appropriate. xlab="X Axis" .54872689 [ 7 ] 0. When there are platform speciﬁc issues to be dealt with. The result is given in a new graphics window with a plot similar to what is shown in Figure 1. the commands: > x <− seq (0 . which will not have the > character in it such as: > 2 ∗ 6 [ 1 ] 12 > rnorm ( 1 0 ) [ 1 ] −1. the second Biological Data Analysis Using R . In these code listings the > character is the preﬁx given by R and is not typed.41362136 > pi/2 [ 1 ] 1. Throughout this book. you will have to learn how to interact with it from the command line and write scripts for R to analyze your data. In all code provided in this text will have text highlighting showing R keywords in dark blue and strings in red (see Chapter 2 for more information on these commands). Moreover. a sequence of even numbers from 0 → 100 and y which are random numbers sampled from a normal distribution.08495736 −1. For example. you will see this kind of text highlighting in your own work. Along with the answer is also an index for the answer or answers.43850325 0. I provide it here because I want to differentiate between code you type and answers that are given by R . slightly shaded. peer-reviewed journals are suggesting that your analysis scripts be included in your supplementary materials for general consumption. OS The book is not going to show you how to interact with R using a GUI. I will provide examples of code in a box format. If you want to a point-and-click interface for a statistical analysis program then perhaps you should check out SPSS (Statistical Package for Social Sciences) or similar offerings. because in my opinion GUI’s are for babies.26551658 −0. If you are using a good editor to write your scripts.570796 where the answers are given in the line immediately following what was entered.by=2) > y <− rnorm ( 5 1 ) > p l o t ( x .76237538 −0. You will be able to tell what is code that can be entered in R because it will be separated from the main text and in an alternate font.62145675 −0.08486045 −1. ylab="Y Axis" ) create a scatter plot for the variables x. It is my belief that you will learn more about programming and data analysis if you learn the R language. and with R keywords colored appropriately. I will make a notation in the margins with the name of the operating system next to the text to indicated speciﬁc issues. There are only so many options that GUI-based analyses can provide but with R on the command-line. y .25010428 −0.100 .64345848 0.Preface xv there are some times when things are slightly different on alternate platforms. If you want to learn how to use R .

Acknowledgments There are several people I would like to acknowledge for their assistance in this work. Grass GIS. LTEX. Crystal Meadows. these indices are relatively important and allow you to easily ﬁnd speciﬁc indices rapidly. Candace Dillion. and Vim who have provided a set of tools that facilitate good research. Daniel Carr. I would also like to thank the developers of R. I wish to thank Dr. Members of my laboratory Stephen Baker. The [7] tells you that the ﬁrst number on the second line is the seventh in the sequence. example gets 10 random numbers from a normal distribution but can only give 6 on a line before it wraps around. Emacs. Dyer Richmond June 2009 Biological Data Analysis Using R . James Vonesh who has goaded me into putting this together and been my colleague in crime as we continue to push R as a general tool in our curricula.xvi Preface Figure 1: Example scatter plot. and Cathy Viverette sat through the ﬁrst iteration of the course and have provided insightful feedback on the both the focus and A the content. This has been possible primarily due to the ﬂexibility of my Department in allowing me to ”experiment” on our graduate students. Rodney J. When you operate on vectors or matrices. Next.

Part I Basic Usability 1 .

.

if you are using your own computer (which is always the best idea). If you are going to use a machine on campus. calculation of parameters related to that data. the internet has a much more in-depth and complete iteration of how to get and install the R environment for your particular machine. R has become a standard interface for statistical analysis in biological sciences due in part to its openness. The R environment is a command-line interface that allows easy manipulation of data. Increasingly. entities such as NSF and prominent research journals are making R scripts a normal component of the Supplementary materials that you upload along with your research results and ﬁnal reports. you can create R scripts that describe how you analyzed your data so that in the future you can pick up where you left off. VCU does not allow students to install programs on their machines so this Chapter is somewhat irrelevant anyways. Moreover. and the ability to produce publication quality graphics. However. it should have it already installed on it. an easy to understand grammar that facilitates rapid program creation. R is modeled after the S language that was originally created by AT&T and in many cases scripts written for R can be run in S with little to no modiﬁcation. It is my opinion that the sooner you start documenting your data and creating a history of how you perform analyses on this data. If not. 1. and graphics.1 What Is R R is both a language and an interface for statistical analysis. ability to be extended by users and it vibrant user base. the better you will be in the long run.Chapter 1 Getting R I am not going to spend much time on how you go about getting and installing R on your computer. Reproducing that here would be a waste of paper and both of our times as it would probably be out of date before long. programming. 3 .

Since R has been around for quite a while. wiki’s. There is a lot of information in the online community and in general. GETTING R 1. most of your most basic questions can be answered by a quick google search of the mailing list repositories. Mac OSX and Windows.1 Installation From Binaries The CRAN site maintains pre-compiled binary distributions for Linux. 1. the package will contain an installer that allows you to clickity-click your way through the process and have a base R installation on your machine..2.2. your time is too valuable. Depending upon your platform.r-project. Connected from the main R site is also the CRAN repository where people make available extensions to R that you can download and use. ﬁnd out what is new in the R community. There is a tremendous variety of solutions available for you and it is always in your best interest to try to see if someone has already tackled the problem you are working with.4 CHAPTER 1. 1.2 Compiling If you know what a compiler is and have one on your computer then you are probably able to compile the latest version of R on your machine. you can ﬁnd some nice screenshots.org/ Here you can ﬁnd information on the latest version of R available for your platform. If you fall into this category then you do not need me to tell you how to proceed.. there is a lot of good documentation on this found on the R website. There is no reason to reinvent the wheel. ﬁnd links to manuals. newsletters. It is always a good idea to check these out prior to posting to a discussion board or email list so you do not get the old RTFM treatment. and books on R . they are a friendly lot.2 Where Do I Get It? The main webpage for R is located at http://www. These binaries are the latest stable versions of the software and contain the basic libraries that you need to run R on your operating system. Biological Data Analysis Using R . Moreover.

and often not what you had wanted them to do. it is important for you to remember that computers do exactly what you tell them to. In this and all subsequent Chapters. So learning the grammar is an important step in understanding R .236068 > 3∗ ( pi/2) − 1 [ 1 ] 3. R will evaluate the expression. The main goal here is to understand a small subset of the different kinds of data that can be produced in R and how we interact with them. 2. give you the answer and not keep any reference to it for future use. you will focus on the following topics: • Learn basic data types and how to create them in R • Understand various operators and how they can be used. and destroy This is a pretty short list of things but it will take you a bit of time to get through it. we will become more proﬁcient with them and add new data types as we move forward. In this chapter.712389 5 . From a technical perspective R is called a Function Language as each command you give the R engine is either an: Expression An expression is a statement that you give the R engine. Some examples include: > 2 + 6 [1] 8 > sqrt ( 5 ) [ 1 ] 2. Later.Chapter 2 Language & Grammar R is a language that has its own grammar and in this chapter you will be exposed to some basic concepts regarding these.1 Overview R itself consists of an underlying engine that takes commands and provides feedback on these commands. manipulate. • Understand variable naming and be able to create.

For more Biological Data Analysis Using R . This is important because you can use the variable in the future. To ﬁnd the deﬁnition for the sqrt() function type ?sqrt and R will provide you the documentation for that function. myCoolVariable. by default R will show you the code that is inside the function (unless it is a compiled function). you can use the ? shortcut. However. This is made with a ”less than” character and a ”minus” character.712389 Notice here the use of the assignment operator <-.search("functionName") approach. Some functions are easy to understand and others are relatively complicated. details of the implementation. A function in R is a collection of statements bound together to make it easier to use.236068 > another one number23 [ 1 ] 3. An example of an assignment is: > x <− 2+6 > myCoolVariable <− sqrt ( 5 ) > another one number23 <− 3∗ ( pi/2) − 1 > x [1] 8 > myCoolVariable [ 1 ] 2. To ﬁnd the deﬁnition of a function. the function ls () shows which variables R currently has in memory and does not require any parameters. This is because each function is also a variable. For example. This searches throughout the documentation system and even uses some cool fuzzy searching techniques. 1. In the previous example. R is acting as a gloriﬁed calculator. Also notice that to retrieve the value of a variable.g. (1) a unique name. variables) passed to it within the parentheses.3 for more on naming). Assignment An assignment causes R to evaluate the expression and stores the result in a variable. LANGUAGE & GRAMMAR In each of these examples. and some examples. Not all functions need any additional variables. in the interim. and (2) the stuff (e.2 Function Quickie This chapter will introduce you to several conventions. just type it into the command line and it will provide the current value. A function has two parts.6 CHAPTER 2. If you forget to put the parentheses on the function and only use its name. I used the function sqrt(x). and another one number23 are all the names of variables whose value was assigned with the expression. As for the expression. If R cannot ﬁnd the function you may have to do a more thorough search using the help. which is the function that gives the square-root of the argument being passed (or an error if there is one). the variables x. This is why you should not use function names for your variable names (see 2.. the arguments passed to it. 2. We will spend a whole chapter on functions later in the book (see Chapter 12) when you begin to write your own. you need to know a few things about functions. When you use it like this. the main one of which is the function. 2. R evaluates the expression and gives you an answer. 3.

b... As a rule of thumb.e. 7 type ?help. NumberOfDogsInHouse.g. it is important to understand that variables are things that you will interact with. There are some naming conventions that you can follow to make your life a bit easier.. It is a pretty good idea for you to start your variable name with a letter. 5. notice the use of upper and lower case letters to smush words together and make it readable). Functions can be organized into libraries and only loaded when needed. For example. Biological Data Analysis Using R . 1. Playing around with the function shows: > log ( 2 ) [ 1 ] 0. The deﬁnition of the log function show log(x.3 Variables A variable is something that can hold an item for you. While this is a little bit of Dyerspeak.search() (recursive logic recurses. Using a. There is no reason to have every conceivable library loaded and in fact if they were to be loaded would probably leave little memory for you to work with your data on. number of items. and f as variables is probably not as informative to you when you are reading the code as Rate. base =2) [1] 1 > l o g ( 2 . foodDataForNovember.r-project. underscores (” ”). only load the libraries that you need when you need them.B.30103 where without the optional base= parameter. It is your responsibility to deﬁne these variables and then you can subsequently use them in your analyses. base=10) [ 1 ] 0. Try to name your variables something that makes sense to you.2. VARIABLES info on how to use help.6931472 > l o g ( 2 . base=exp(1)) (say from ?log).c. You cannot use a number or punctuation as the ﬁrst character of a variable (N. Often if there are a lot of parameters given then there will be some default values provided. and I am sure that there are more elegant deﬁnitions.”). or you can use what is called camel case (e. log () function returns 2. the log () function provides logarithms.d.org. More on libraries as we go forward. For example. don’t to this).search(). 3. it is clear that the the natural log (in fact if you ?ln there is nothing found).3. At the time of this writing there are just over 1600 different packages containing different libraries on http://cran. Variable names cannot have spaces in them although it is possible to use periods (”. 2. Functions may have more than one parameter passed to it. you can use a period to start it but the variable will be hidden from you and you cannot see it with ls () so unless you know what you are doing. 4.). you may have a predictor and a response variable you want to ﬁnd a correlation between.

This will tell you what kind of variable x is and is relatively important in the discussions we are going to have below about coercion. the constructor is the function type(x) will create a vector of x types ( where type is the data type will create). You can recall it by typing its name and hitting return. x <−x/2 to decrease it by half). in practice there is a limited amount of integers that can be deﬁned on the range ±2 ∗ 109 . Technically. you can use it later in functions or calculations. This is an important concept for understanding data types. is . What follows is a brief discussion of each data type and where appropriate an example of the use of one. It is a very helpful function.g. While it is important to know the differences between these data types. you will probably use only a fraction of them. This may sound a bit confusing but in reality it is pretty straight forward.. In R when you make a new variable such as x <−sqrt(2) then that variable is in memory. 5. as introspection function that tells you if any variable is a particular type. and use an integer. LANGUAGE & GRAMMAR 4. Check out the code listing below and see how one can create.4. > integer ( 5 ) [1] 0 0 0 0 0 > x <− as . every data type has three common functions associated with it. a constructor that creates a speciﬁed type. As such. Confused yet? It really isn’t that bad. 2. and a casting function that allows you to coerce the contents of a variable into a speciﬁc type (a more complete discussion of functions can be found in ??). coerce. 2. integer ( x ) [ 1 ] TRUE > class ( x ) Biological Data Analysis Using R . and how we can operate on it.8 CHAPTER 2. and as. integers can range from −∞ ↔ ∞ however.g.type(x) to determine if x is that particular type of variable. For example. To determine the type of any variable you can use the built-in function class(x). i n t e g e r ( 5 ) > x [1] 5 > is . how to access it. You can remove a variable from memory using the rm(variableName) function. one without a fractional part).4 Data Types R recognizes about a dozen different types of data. examples for each data type below will discuss the speciﬁcs.type(x) will return x translated into a type of variable. and you can manipulate it (e.1 Integers An integer is a common counting number (e. All of the data types are characterized by what R calls classes. The function ls () provides you a list of all variables that you have deﬁned.. The integer type is typically used in the development of R libraries who need to pass succinct integers to C or FORTRAN code and is not typically used by the normal R end user.

For example. which is the default value for an integer until its value is changed to something else. 2.4.2 Numeric Numeric types represent the majority of number valued items you will deal with. When you assign a number to a variable in R it will most likely be a numeric type (unless you specify otherwise such as deﬁned in 2.4. not an integer.2.4. The command integer(5) produces a vector (see 2. Whereas the integer(3) + as. You can perform operations on integers you need to make sure that you use other integers.4 > x [ 1 ] 2.integer(2) to a vector of integers every element is assigned the same number. There are a few more subtle things to know about adding things to vectors and I’ll leave that until 2. integer ( y ) [ 1 ] FALSE > y <− i n t e g e r ( 3 ) + as .8. where one data type is ”magically” turned into another type. in this case 5. numbers are coerced into numeric values (see 2. and is veriﬁed by the class(x) statement. There are rules for these transformations and the ﬁrst one you should recognize is that the number 2 is not considered an integer.4.integer(2) statement does return an ”integer” type.8) of ﬁve integers. This is your ﬁrst example of coercion. Numeric data types can either be displayed with or without decimal places depending if the value(s) include a decimal portion.5 and 2. 3. The variable x is assigned a particular integer. i n t e g e r ( 2 ) > is . DATA TYPES [ 1 ] "integer" > x + 2 [1] 7 > class ( x+2) [ 1 ] "numeric" > y <− i n t e g e r ( 3 ) + 2 > is .4.0 0. When adding an integer as. As I said above.4 0. 2. adding 2 to the vector of integers represented by the variable y produces a ”numeric” type.6). 5. integer ( y ) [ 1 ] TRUE 9 There are some things to notice about this: 1. All of the items returned from the listing (5) function were assigned a value of zero (0). For example: > x <− numeric ( 4 ) > x [1] 0 0 0 0 > x [ 1 ] = 2.4.4. By default.2) as integers are not used that often. the integer type is not used that often and is only provided here for completeness. 4.0 0.0 Biological Data Analysis Using R .

g. You need to think of the numeric type as a sequence of letters. 2. what would you expect ”hello”*3 to accomplish) although you can paste() them together.4 shows that no matter how you do it..10 CHAPTER 2. such as when making titles and axis labels and this will come in handy.3 Character The character data type is the one that handles letters and letter-like representations of numbers. For example.numeric(x) function. 4 ) [ 1 ] TRUE > as .4 is a numeric data type.. There will be times when you need to translate various things into characters. In general.4.4 [ 1 ] 2. For example: > i s . or other stuff you can produce by pushing keys on your keyboard that are enclosed in either single or double quotations. LANGUAGE & GRAMMAR Notice this is an all or nothing deal here.. For example. you don’t really have to go around using the as.4 > 2 + 0. It doesn’t really make much sense to perform any operations on a character type (e. numeric ( 2 . character ( y ) > z [ 1 ] "23" > class ( z ) [ 1 ] "character" Notice how the variable y was initially designated as a numeric type but if we use the as. > x > y > z > x [1] > y [1] <− "I am" <− "not" <− ’a looser’ "I am" "not" Biological Data Analysis Using R . observe the following: > x <− "some sequence of letters of length 37" > class ( x ) [ 1 ] "character" > y <− 23 > class ( y ) [ 1 ] "numeric" > z <− as . numeric ( 2 ) + 0.4 [ 1 ] 2. Operations on numeric types proceed as you would expect but since the numeric type is the default type. numbers. symbols.character(y) function. Also notice (especially those who have some experience in programming other languages) that dimensions in vectors (and matrices) start at 1 rather than 0. 2. programmers are lazy people who try to do things that minimize the amount of typing they have to do (since they do a lot of typing to begin with) and as such the numeric type is the easiest to use. we can coerce it into a non-numeric representation of the number.

This is commonly used by functions that return undeﬁned responses. complete nothingness..2. there are commands such as is .. ﬁnite () ). z . The code snippet below shows you how to create and query the class of a complex number. You can use it for missing data if you like. > paste ( x . z ) [ 1 ] "I am a looser" 11 It is important to note that if you are a really anal person for perfection that the paste() function by default separates the individual variables you give it with a single space. is . Biological Data Analysis Using R . DATA TYPES > z [ 1 ] "a looser" > paste ( x . Inﬁnity (∞) as well as -Inf for −∞. a looser" 2. and is .na() to help you ﬁgure out if particular items are of that constant type if you like. this can be modiﬁed by telling the function what to use as the separator). " ) [ 1 ] "I am.4. y . sep=".nan().g. is . Table 2. At times this can be handy such when you have missing data and you want to set it to some meaningful value (e. Not a number. Typically used to represent something that is not there or missing.5 Complex Numbers Complex numbers are those that can be written in the form a + bi where a is the real √ part and the product bi being the imaginary part with i = −1.1: Common constants you will run across in R Constant pi NULL Description The mathematical constant. This is the oubliette. However.na(X) <−32 will set all N A values in X to 32). z . The absence of a type. They are mostly here for convienence so that we do not have to go look up values for common things. /dev/null Richmond on a Wednesday night. We’ll get into this more in depth at a later time.4. z ) [ 1 ] "I am not a looser" > paste ( x . 2. inﬁnite (and its cousin is . Below are listed some common constants that you will probably encounter as you play with R . π representing the ratio of a circles circumference to its diameter.NULL(). nan Inf NA For the non-numerical constants.4.4 Constants Constants are variables that have a particular value associated with them that cannot be changed. is . sep=" not " ) [ 1 ] "I am not a looser" > paste ( x .

2) by R . If you try to create a raw number outside the its allowable range. This is probably good behavior. The digits 13 while valid raw digits are not considered raw given by themselves. Biological Data Analysis Using R .raw(0d" > x <− 0d Error : unexpected symbol in "x <− 0d" There are several important points to make here.6 Raw The raw data type is a hexadecimal data type bound on the inclusive range [0 − 255]. 2. 1. raw(0d ) Error : unexpected symbol in "is. 00. it returns three complex numbers whose real and imaginary parts are set to zero. d. LANGUAGE & GRAMMAR > w <− complex ( 3 ) > w [ 1 ] 0+0 i 0+0 i 0+0 i > x <− complex ( 3 . 2. This is because all numbers are considered numeric data types (see 2. raw ( 1 3 ) [ 1 ] FALSE > i s . Raw numbers are represented as a two digit sequence of hex numbers.4.4. calling the function as complex(3. > raw ( 3 ) [ 1 ] 00 00 00 > as . c.5) makes a three complex numbers each assigned a four to the real part and a ﬁve to the imaginary part. raw(255) [1] ff > as . e. However. complex ( y ) [ 1 ] TRUE The main differences here in the constructor complex() from the other ones we have seen so far is that it can take default values. 5 ) > y <− 4+5 i > x [ 1 ] 4+5 i 4+5 i 4+5 i > y [ 1 ] 4+5 i > i s . R doesn’t coerce it into a raw type but leaves it as the characters 0d and then chokes on it. complex ( x ) [ 1 ] TRUE > i s . The listing below gives you some examples of how to create some raw data types. you can also create complex numbers by simply typing them directly on the command line as a + bi as shown and is probably the easiest way to do it.4. b. when called as complex(3). 4 . & f . For example. R will issue you a warning and then assign the variable the default value.12 CHAPTER 2. As shown. Valid hex digits include 0 − 9 as well as a. raw(256) [ 1 ] 00 Warning message : out−of−range values treated as 0 in coercion to raw > i s . even in the case of 0d which is deﬁnitely a raw hex number. raw ( 1 3 ) [ 1 ] 0d > as .

2. as an integer.. (1) when you are writing a conditional statement that requires you to know the truth about something (e.4. not TRUE is FALSE. You will encounter logical data types in two primary situations. vector ( x ) [ 1 ] TRUE > i s .. For example.4.4. these two values are the opposites of each other (e. Because you will use vectors so much. since a vector is simply a sequence. numeric. or (b) if you are tying to select some subset of your data by using a particular condition (e.7 Logical Logical data types are boolean variables with a value of TRUE or FALSE. Here is an example using the ”numeric” data type. A vector is a sequence of items that can be created using the function vector(). The interesting thing about logical variables is that numbers can be coerced into a logical variable.). must tell it what type to use. etc.2. or raw data type. However. raw numbers must be constructed from the constructor raw() function and cannot be directly created by simply pairing up valid digits. For example the number zero. it can be a sequence of any type of data. 2.g. if x == 0 you probably shouldn’t try to divide by x because for some reason mathematicians haven’t ﬁgured out how to divide by zero yet. I may have a vector of integers or a vector of complex numbers.8 Vectors R is a vector language and as you begin to learn more and more of it you will appreciate the fact that you can easily work with vectors of numbers as well as single ones. Obviously. I suppose it is probably better to think of a single number as a vector of length 1. numeric ( x ) [ 1 ] TRUE Notice that it assigns default values for each entry as would be expected. is considered to be FALSE whereas any non-zero value is considered TRUE. complex.. Similar to what was shown for integers. This is a short-hand version and R tries to determine the type of variables that you pass to the c () function to do the right thing c .g. To specify the data type for a vector.. Here are some examples: Biological Data Analysis Using R . it is also important to notice that not only is x a vector but it is also numeric! So in actuality. In fact. select all entries where color == ”blue”)... 3 ) > x [1] 0 0 0 > i s . in all the preceding cases where we have used the constructor to create a new data type they are also creating vectors! Blows you mind doesn’t it! This is why it is safe to consider R as a vector language. or whatever.g. DATA TYPES 13 3.). there is an easier way to create the using the c () function (c for combine). However.. which is why the R command line interface puts the [1] after every answer. > x <− vector ( "numeric" .

However.0b .g."not" . 4 ) [1] 6 6 6 6 The notion x : y provides a vector of whole numbers from x to y. 2 ..4. > x <− 1:6 > x [1] 1 2 3 4 5 6 > y <− seq ( 1 .FALSE ) > y [ 1 ] TRUE TRUE FALSE > z <− c ( "I" .by=z) provides a sequence of numbers from x to y but can also have the optional parameter by= to determine how the sequence is made (in this case the by 2s for all the odd numbers from 1 to 20)."a" . use the normal data type constructor (e."looser" ) > z [ 1 ] "I" "am" "not" "a" "looser" > notGoingToWork <− c (00 . raw(3)) and then assign values to each element. f f ) Error : unexpected symbol in "notGoingToWork <− c(00. R uses square brackets ([]) as demonstrated here: > x <− vector ( "numeric" .14 CHAPTER 2.TRUE.y) repeats x a total of y times. there are a number of helper function that you can use to make vectors. then R will choke and tell you so. These are some real time saving options and you will probably be using them often.20 . 3 ) > x [1] 0 0 0 > x [ 1 ] <− 2 > x [ 3 ] <− 1 > x [1] 2 0 1 > x[2] [1] 0 Since working with a vector is such a common thing. Biological Data Analysis Using R . LANGUAGE & GRAMMAR > x <− c ( 1 . 3 ) > x [1] 1 2 3 > y <− c (TRUE. since they have 2-dimensions."am" . 6 ) > y [1] 1 2 3 4 5 6 > z <− seq (1 . as shown in the last example where I was trying to make a vector of raw data types. To access an element in a vector.0b" The only caveat here is that if the data type cannot be determined unambiguously. In a similar fashion the function seq(x. The function rep(x. For cases such as these.9 Matrices Matrices are 2-dimensional vectors and can be created using the default constructor matrix() function.y. by=2) > z [ 1 ] 1 3 5 7 9 11 13 15 17 19 > rep ( 6 . 2. you must tell R the size of the matrix that you are interested in creating by passing it a number for nrow and ncol for the number of rows and columns.

8 ) ..] NA NA > matrix (23 .] 2 [3 . if you do not provide any dimension to the matrix() function.] 2 1 [3 .1] [1 .2] [1 . > x <− c ( 1 .] 2 4 > i s . DATA TYPES 15 > matrix ( nrow=2 . here nrow=2 was given and it ﬁgured out that it should have two columns as well).2] [1 .nrow=2 . nrow=2) > y [ .] 4 4 > matrix ( x .2] [1 .1] [ . If you provide one of the dimensions then it will try to determine how many of the other dimension is needed by looking at the length of the vector that you passed (e.2] [1 .] 23 23 [2 . ncol =2) [ . nrow=4) [ . which Be default.1] [ . 2 . it will produce one with a single column of data.] 1 1 [2 .2] [1 . matrix ( y ) [ 1 ] TRUE > i s .] 1 [2 . Matrices can be created from vectors as well. ncol =2) [ .1] [ . nrow=4 .] 3 2 Warning message : In matrix ( x .] 4 > y <− matrix ( x . vector ( y ) [ 1 ] FALSE NA.] 23 23 If you do not give matrix() a default value to put in each cell. ncol =2) [ .1] [ . > x <− 1:4 > matrix ( x .1] [ . it will ﬁll them with is the way R indicates a missing value.] 1 3 [2 .] NA NA [2 . 4 ) > x [1] 1 2 3 4 > i s . nrow=3) [ .2] Biological Data Analysis Using R .] 3 [4 .g.] 1 4 [2 .] 3 3 [4 . There is a slight gotcha here if you are not careful. nrow = 3) : data length [ 4 ] i s not a sub−multiple or multiple o f the number o f rows [ 3 ] > matrix ( seq ( 1 .] 2 2 [3 . matrix ( x ) [ 1 ] FALSE > matrix ( x ) [ .2.1] [ . 3 . vector ( x ) [ 1 ] TRUE > i s .4.

10 Factors Factors are a particular kind of data that is used in statistics and sampling. However. 4 ..nrow=2) > X [ .] [4 . However. Finally."Female" . if they are perfect multiples. LANGUAGE & GRAMMAR Notice here that R added the values of x to the matrix until it got to the end.16 [1 . you have to use two indices rather than one. 5 .] 2 4 6 > X[ 1 .] 1 3 5 [2 . 2. 2 ] <− 3. In the ﬁrst case the size of x was a multiple of the size of the matrix whereas in the second case it wasn’t but it still assigned the values (and gave a warning).2] [ .2 > X [ .1] [ ."Unknown" ) ) > l e v e l s ( sex ) [ 1 ] "Female" "Male" "Unknown" > t a b l e ( sex ) sex Female Male Unknown 2 2 1 > sex [ 5 ] <− "Male" Biological Data Analysis Using R . Below is an example of ﬁve observations where the categorical variable sex of the organism is recorded. Treatment C). 3 . for matrices. Treatment B vs.] 2 3. as shown in the last case.3] [1 ."Male" . then it ﬁlls up the matrix in a column-wise fashion. To access values in a matrix you use the square brackets just as was done for the vector types.0 5 [2 . it did not ﬁll the matrix so it started over again. 6 ) . Factors can be ordered or unordered depending upon how you are setting up you experiment.2 6 > X[ 1 .4. X[1.2] [ . > sex <− f a c t o r ( c ( "Male" . Male vs. These are slice operations where only one index is given (e. However. > X <− matrix ( c ( 1 .g.g.] [3 . 3 ] [1] 5 > X[ 2 .3] [1 . Female or Treatment A vs.] 1 2 3 4 5 6 7 8 CHAPTER 2. the last two operations provide a hint as to some of the power associated with manipulating matrices."Female" .. 3 ] [1] 5 6 We will use matrices quite a bit but will delay the commentary on matrix algebra and operations until Chapter 8.]) provide a vector as a result for the entire row or column. ] [1] 1 3 5 > X[ . You can think of a factor as a categorical treatment type that you are using in your experiments (e.1] [ . Most factors are given in as characters so that naming isn’t a problem. 2 .] 1 3.] [2 .

> t h e L i s t <− l i s t ( x=seq ( 2 . It is important to remember that lists are general groupings of variables and these variables do not necessarily have any relationship between them other than my need to Biological Data Analysis Using R . DATA TYPES > sex [ 1 ] Male Male Female Female Male Levels : Female Male Unknown 17 Here the table() function takes the vector of factors and makes a summary table from it.9 + 3 i > t h e L i s t$MyFavoriteNumber [ 1 ] 2. or error terms. Also notice that the levels () function tells us that there is still an "Unknown" level for the variable even though there is no longer a sample that has been classiﬁed as "Unknown" (it just currently has zero of them in the data set).9+3 i As you can see.11 Lists A list is a convienence data type whose function is to group other data items. R uses the dollar sign $ frequently to designate something that is contained within something else.4. not to surprisingly. The summary() function gives. 2. These data are grouped together by the list but you can access them and manipulate them just as you would if they were a stand alone variable with the exception of the list name and the dollar sign. a summary of the items within the list.4. dog=LETTERS[ 1 : 5 ] . 3 0 ) . You will ﬁnd when you conduct analyses and assign the results to a variable that variable will be a list and to access predicted values.2. or other components of that analysis you will do so by using the $ nomenclature. hasStyle= l o g i c a l ( 5 ) ) > summary( t h e L i s t ) Length Class Mode x 29 −none− numeric dog 5 −none− character hasStyle 5 −none− l o g i c a l > theList $x [ 1 ] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 [ 2 6 ] 27 28 29 30 $dog [ 1 ] "A" "B" "C" "D" "E" $hasStyle [ 1 ] FALSE FALSE FALSE FALSE FALSE > t h e L i s t$x [ 1 ] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 [ 2 6 ] 27 28 29 30 > t h e L i s t$x [ 2 ] [1] 3 > t h e L i s t$x [ 2 ]<− 22 > t h e L i s t$x [ 1 ] 2 22 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 [ 2 6 ] 27 28 29 30 > t h e L i s t$dog [ 2 ] [ 1 ] "B" > t h e L i s t$MyFavoriteNumber <− 2. a list can contains a range of different types of data.

2. just type its name on the command line. and (b) to see the value of a variable. There is an inherent relationship between the columns of data that have the same row in that it is an observation of some sort. This is the distinction between data frames and lists. These include the normal set of operators including addition (+). assignments are made using the assignment operator.1 Assignment Operators As described above. mutliplication ().2 Numerical Operators Numerical operators are deﬁned as operations on variables. you can override the normal order of operations by using parenthesis in appropriate areas. and each row has one or more columns.12 Data Frames Data frames are kind of like lists in that they can have named items within them. division (). <. it is important to note that (a) under assignment. the ith row of a data frame can be considered a single observation across all columns of variables. LANGUAGE & GRAMMAR group them as it makes sense to me to do so.5 Operators R recognizes proper orders of operation for mathematical expressions.5. The topic of data frames is large enough such that I will delay discussion of it until Chapter 3 when we discuss it depth and provide some analogies to how a data frame is like a database.and actually can be assigned the other way with the operator ->.18 CHAPTER 2. As in a spreadsheet.4. there is nothing printed out form the R engine. What follows is a brief discussion of some basic kinds of operators. the data frame. There are other ways to load data but I ﬁnd this to be the most convenient. 2. As in normal notation.5. and Biological Data Analysis Using R . each column has a variable name (say height or NumberOfBumps). however. 2. Examples of assignments include: > x <− 23 > 56 −> y > x [ 1 ] 23 > y [ 1 ] 56 Again. it is easiest for me to think of a data frame as a spreadsheet. It has rows of items. subtraction (-). you do so by creating a data frame. Typically when I load data into R from an external source. 2. This is different than what is found in the next data type.

and inequality (! =). explicit relations (< and >). range relations (>= for equal to or greater than and <= for less than or equal to). 2.4107143 > x [ 1 ] 23 > y [ 1 ] 56 19 Notice here that these expressions did not change the values of the variables because there was no assignment involved. TRUE or FALSE).3 Logical Operators Often times we need to run comparisons between variables..2.5. These operators determine the true of a statement and return a boolean (e. OPERATORS exponents (ˆ Examples of these operators are: ).5. > x <− 23 > y <− 56 > x==y [ 1 ] FALSE > x<y [ 1 ] TRUE > x>y [ 1 ] FALSE > x>=y [ 1 ] FALSE > x ! =y [ 1 ] TRUE > y<=x [ 1 ] FALSE These operators are commonly found in conditions but can also be used to select a subset of values from a data vector (see ??).g. notice this is two equals signs). Operators include equality (==. Biological Data Analysis Using R . > x∗2 [ 1 ] 46 > y−5 [ 1 ] 51 > x−y [ 1 ] −33 > xˆ2 [ 1 ] 529 > x/y [ 1 ] 0.

1st Quantile. then length(theList) would return the number 3.4. and the Maximum. LANGUAGE & GRAMMAR 2. the 3rd Quantile.6 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises. this will tell you if you have a single data point or a vector of data points. • paste(x. This function will return NULL for all other data types. Remember the default constructors of these data types allow you to make a vector of item so it treats all these data types as a vector and returns the length of the vector. seq(f. the Mean. So a matrix with 3 rows and 2 columns would have a length of 6.11. Biological Data Analysis Using R . dim(x) • This will return the length of x which means different things depending upon the kind of variable that x is. To get more information on any of these functions. – If x is a list or a data frame then it will return the number of variables in that list or data frame. logical. use the R help system.by=b) summary(x) This function will return an overview of the variable x. – For matrices the function returns the number of elements in the matrix. the Median.n) • • • This function repeats the value x a total of n times and returns it as a This function returns a sequence of numeric types from f to t by b. raw. vector. or logical. although you can change this behavior by setting a value for the optional sep parameter passed to the function. Essentially. The result is returned as a vector of length=2 with the number of rows in the ﬁrst index and the number of columns in the second. this function puts a space between the items in x and y. character. rep(x. this function will return the number of distinct items in x. length(x) – If x is an integer. For example. We have been using it all along in the discussions of data types but you will probably not use it very much. • This function returns the dimension of x. assume that theList is deﬁned as in 2. – If x is a list or data frame then it provides a summary of each variable in x. – If x is contains numerical values then it will provide the following quantitative measures: Minimum. By default. numeric. • class(x) This function will return the kind of variable that x is.t. complex.20 CHAPTER 2.y) This function concatenates items into a character string. This function returns the number of rows and columns in x which is appropriate for matrices and data frames.

logical () ? TRUE when coerced into a logical data type using 5. Create two variables. How is a data frame different than a list? Biological Data Analysis Using R . x and y of type integer using the as. What kind of variable is z? Whis is this different than the answer in the previous question? 3. "Fake Ear Removal". 6. Divide the x by y and store the result in a third variable named z.integer function and assign them the values of 3 and 2. "Control". 10. EXERCISES 21 2. Add the x by y and store the result in a third variable named z.23 into other data types to see which are amenable using the as. Create two variables. Create a sequence of numbers from 100 down to 50 by 2 and assign it to the variable y. "Ear Removal". "Ear Removal". Create a variable that is a list. 15 ”b”. 7. and 58 ”c” instances. Turn the vector of character items "Control".7 Exercises The following exercises are meant to help you understand the items presented in this Chapter. What numeric values are considered the function as. 1. In the list add variables for your name. 4. What kind of variable is z? 2.1 and assign it to the variable x. Coerce x <. Create a vector of character variables that contains 25 ”a”. "Fake Ear Removal" into a Factor variable and create a table from it to show the number of entries in each treatment. "Control". "Ear Removal". and height.7. "Fake Ear Removal". Create a sequence of numbers ranging from 1 − 10 by 0. x and y of type integer using the as. 9. What is the length of this vector? Create a table from the entries. email address.2. 8. "Fake Ear Removal". "Ear Removal".∗ functions for each data type.integer function and assign them the values of 4 and 5.

22 CHAPTER 2. LANGUAGE & GRAMMAR Biological Data Analysis Using R .

Part II Biologically Motivated Topics 23 .

.

32.Height. Here I am using a comma (and the ﬁle is probably saved as a csv ﬁle) but tabs.7.30.Female B.Male B. the vast majority of time that I spend working with data that is contained with a data frame.23.32.7. both of which force me to coerce my observations into something like: Population.27.29.7. spaces.Chapter 3 Data Frames In this chapter we will be learning about data frames and how we can use them to our beneﬁt.Female A.4. Columns of data are also separated by some kind of delimiter.9.3. The second and all subsequent rows are observations with a value for each column of data. In this Chapter you will learn the following skills: • Enter data into a data frame. and perform complicated selections.Male A.Female This format is relatively rigid but is amenable to several types of observations.38.Male B.1. perform statistical analyses.Male B. • Save a data frame to a ﬁle. and other characters can also be used.Female A. • Load a data frame from an existing ﬁle.2. we will use the data above as an example to show how to interact with and manipulate data frames.27.28. For the rest of this chapter. 25 .4. In my interactions with R . The ﬁrst row is a header row with the name of each variable spelled out.Female A. This is because I typically keep my data in either spreadsheets or in databases.Sex A. Data frames are useful as they are a single object within which we can store data (to disk or databases).

frame() function will be the names of the variables in the data frame and the names of the variables you previously deﬁned for them will be thrown away (e. Sex=Sx ) > myData Population Height Sex 1 A 23. DATA FRAMES 3.28.1 Entering Data Directly After reading Chapter 2 discussing different data types should be all you need to understand how to put data in manually.4 . Once you have created a data frame.1 Data Input/Output Data can be input into data frames in two different ways.38.30.27."A" .7 . you can enter it directly or load it from an external ﬁle.9 .2 Male 5 A 32."Female" ."A" .:32. The former method is good if you have just a little bit of data whereas the later is probably better if you have persistent data.70 Max.4 ."Male" ."Female" ."Female" .1 Female > summary( myData ) Population Height Sex A:5 Min .32."B" .20 Notice how the data are already numbered by observation."Female" ) Once you have these variables entered into R . :23.2 .27.04 3rd Qu. • Perform complex queries and joins on data frames."A" .7 Male 6 B 28. To recreate the example data set you could: > Pop <− c ( "A" ."B" ) > Ht <− c (23. you can access elements within it as you would for a list (and even as a matrix to some extent).4 Female 7 B 27..1) > Sx <− c ( "Female" .29. Biological Data Analysis Using R .26 • Manipulate data within a data frame.4 Female 2 A 32.40 Female :5 B:4 1s t Qu.7 Female 4 A 38. The names that you pass to the data."A" ."Male" .7 . you can put them into a single data frame by: > myData <− data .3 Male 8 B 27. frame ( Population=Pop ."B" .70 Mean :30.:27.1. :38.7 Male 9 B 30.7 .70 Male :4 Median :29."Male" ."B" .g.3 . CHAPTER 3. Height=Ht ."Male" . there is not a variable named Pop in myData). 3.9 Female 3 A 29.32.

Be careful here.7 .23.4 . Female 3 A. Female 7 B. and a more general approach will be followed here. Moreover.3. In the example data ﬁle above. /Desktop/data.1. Male 6 B. Male 5 A. R will barf up some errors. it the example below where I did not tell R that the data ﬁle uses a comma as a column separator. some times when you export from a particular spreadsheet program (that shall remain nameless) you can get extra columns of data that will screw up your import. If you do not have the same number of observations for each row. This is not that common but you should be aware of it. . Height . t a b l e ( "DataFrame1. Female > data [ 1 .23. You may want to open the text ﬁle in a text editor to look to make sure if you get some odd errors.1 . I will assume that you can get your data ﬁle into a text format.32. R may actually load the ﬁle but it won’t be as you expect.38.. etc. Do you have any items that are in quotes? Some programs will output text wrapped in quotes. data loggers.g. > data <− read .32. For example. header=T ) > data Population . What character do you use to separate columns of data? Is it tab.3 . Male 9 B. Female Biological Data Analysis Using R .7 . Getting data into R is pretty easy.30. it loads every row as a single text observation (and considers it a factor) rather than three column of data. What matters for the import are the following items: 1.txt or C:Whatever).. Female 2 A.txt" .4 .9 . 4. Sex 1 A.. or some other character that separates you data columns? 3. DATA INPUT/OUTPUT 27 3.g.2 Loading Data From A File It is relatively common for you to already have data on hand and it is a bit of a waste of time for you to re-enter the data into R (this would also cause a high probability of errors as you type these values in).table() function. there is no Excel on unix). There are methods available to import normal Excel ﬁles into R but will not go into them because the ﬁle format for this program changes with each release and it is not portable across platforms (e.7 .27. V 2. there are three observations for each row.2 . If you forget to add one of the additional options to the read. You need to either have the data ﬁle in the same directory that you are working in when you started R or know the full path to the ﬁle (e. 2. It is important for you to realize that the data you enter into a data frame have to have Note! the same number of data columns for every observation. . space.4 . .28. Male 8 B.29. comma. The data format of the ﬁle is a relatively important item. there are a lot of other places that you can get data such as online databases. Does the data have a row of variable names (headers) in the ﬁrst row? If you do not have a row of headers then R will assign them as V 1.27. ] [ 1 ] A.1. Female 4 A.

data . Female > data <− read .1) the printing of the data frame should be identical.1. If you do not have all the variables in the thing you are adding R will give you an error.7 .04 3rd Qu.1 .txt" .1 Female > summary( data ) Population Height Sex A:5 Min . header=TRUE.29.4 Female 7 B 27.1 Female Biological Data Analysis Using R .7 Male 6 B 28.7 Female 4 A 38.1.32. sep=". a header parameter (TRUE or FALSE) to indicate if the data ﬁle has a header row.7 Female 4 A 38. .9 Female 3 A 29." ) > data Population Height Sex 1 A 23.70 Male :4 Median :29. What you add to the data frame must be another list or data frame that has the same variables in it as in your original data frame.Sex="Female" ) ) Population Height Sex 1 A 23.28 CHAPTER 3.7 Male 9 B 30.3 Adding Data To An Existing Data Frame Once you have a data frame in R . Female A.40 Female :5 B:4 1s t Qu.:32. Height =31. Other separators are tab (indicated as sep="\t") and as space sep="".:27.30.7 Male 6 B 28. t a b l e ( "DataFrame1.9 Female 3 A 29.23. Baring any errors that I made in typing in the data in the last section (3. you an add data to it relatively easily using.3 Male 8 B 27.2 Male 5 A 32.4 Female 2 A 32. DATA FRAMES 9 Levels : A.4 Female 2 A 32.1 Female 10 B 31. :23.4 Female 7 B 27. . Female A.table() are the ﬁle name (with path if necessary).3 . and sep to indicate what character is used for a separator.70 Mean :30. Male .2 Male 5 A 32.9 Female 3 A 29.7 . :38.2 Male 5 A 32. > rbind ( data .3 Male 8 B 27. 3.7 Male 9 B 30.4 Female 7 B 27. B. To add additional rows of data you use the function rbind() (as in row bind).3 Male 8 B 27.4 Female 2 A 32.7 Female 4 A 38.20 The options passed to the read. Here is an example.7 Male 6 B 28.7 Male 9 B 30. frame ( Population="B" .4 .3 Female > data Population Height Sex 1 A 23.70 Max.

2 . l i s t ( SizeClass = c ( 1 . Height=32. l i s t ( SizeClass = c ( 1 . 2 . 2 .Sex="Male" ) ) > data Population Height Sex 1 A 23. That is because this function does not change the data frame that is passed to it.7 Female 4 A 38.4 Female 1 2 A 32.1.2 Male 5 A 32.3 Male 8 B 27.7 Male 2 9 B 30. > data <− cbind ( data .3 Male 2 8 B 27.7 Male 2 6 B 28.7 Female 1 4 A 38. 1 .1 Female 10 A 32.9 Female 1 3 A 29. 1 .4 Female 1 7 B 27. This amounts to adding another variable to all the observations in your current data set. if you want to make the additions to your data frame permanent then you need to use the assignment operator. If you want to permanently change your existing data frame then you need to use the assignment operator as: > data <− rbind ( data .7 Male 2 9 B 30. 2 .4 Female 2 A 32.4 Female 1 7 B 27.0 Male 2 Again.0 Male To add additional columns of data you use the function cbind() (as in column bind). 2 ) ) ) Population Height Sex SizeClass 1 A 23. l i s t ( Population="A" . rather it returns a brand new data frame that is identical to the original one but has the additional data appended on the bottom.2 Male 2 5 A 32. 1 . 1 . 2 ) ) ) > data Population Height Sex SizeClass 1 A 23. 1 .4 Female 7 B 27.7 Male 6 B 28.0 Male 2 The reason that these two functions do not change the data frame that you passed to them is because you may want to make a temporary data frame with some additional variables or copy the data frame Biological Data Analysis Using R .1 Female 1 10 A 32. for this to work. 1 . 1 .7 Male 2 6 B 28. DATA INPUT/OUTPUT 29 Notice that the addition of the data B 31. 2 . 2 .3. 2 . > cbind ( data . 1 .9 Female 1 3 A 29.7 Female 1 4 A 38.2 Male 2 5 A 32. you should provide as many items as there are rows of data in the data frame.4 Female 1 2 A 32.9 Female 3 A 29.3 Female items were not retained in the data object. Again. 2 .7 Male 9 B 30.3 Male 2 8 B 27.1 Female 1 10 A 32.

9 Female 1 3 A 29. > data [ −10 .2 Male 2 5 A 32. ] Population Height Sex SizeClass 1 A 23.1).] Population Height Sex SizeClass 1 A 23. this returns a data frame without the given index. a whole set of variables for a single observation) you an use a negative sign in front of the index. 1 ] <− "B" > newData [ 1 .4 Female 1 2 A 32.3 Male 2 8 B 27.7 Female 1 Biological Data Analysis Using R . Notice how changes to newData are independent of entries in data.7 Male 2 6 B 28.7 Male 2 6 B 28. DATA FRAMES 3.4 Female 1 3.1 Female 1 10 A 32.g. To remove a row of data (e. > newData [ 1 .9 Female 1 3 A 29.4 Female 1 7 B 27. Then the Population variable for the ﬁrst row is changed from A to B.7 Female 1 4 A 38.4 Female 1 > data [ 1 .3 Male 2 8 B 27.7 Male 2 9 B 30.4 Female 1 7 B 27.4 Copying Data Frames To copy a data frame.5 Removing Data From A Data Frame How you remove items from a data frame depends upon if you are removing columns or rows of data.1 Female 1 > data Population Height Sex SizeClass 1 A 23. If you want to make this permanent you must make an assignment as before. ] Population Height Sex SizeClass 1 A 23.30 CHAPTER 3.] > data Population Height Sex SizeClass 1 A 23. For example.4 Female 1 2 A 32.3. in the listing below.2 Male 2 5 A 32.0 Male 2 Again. use the assignment operator.4 Female 1 > newData [ 1 . newData is made as a copy of data. ] Population Height Sex SizeClass 1 B 23.9 Female 1 3 A 29.4 Female 1 2 A 32. This make a new copy of the data frame that is independent. > data <− data [ −10 .7 Male 2 9 B 30..1. You can also pass an array of indices to remove more than one at a time (see also the function subset() in 3.1.7 Female 1 4 A 38.

In fact.2 5 32.1 Female 1 31 Deleting a column of data can also be accomplished by the same manner or by assigning the variable the value of NULL.1 3.9 3 29.4 Female 2 A 32. 8 ) . −4] > data Population Height Sex 1 A 23.2 Male 2 5 A 32.RData ﬁle saved in the directory you are working with that contains all the data you currently have in memory.7 9 30.6 Saving Data Frames to Files There comes a time when you have to save some data you have been working on. There are several ways to save data in R . If you are going to use this kind of data saving.4 Female 1 7 B 27. When you restart R .3 Male 2 8 B 27.7 Male 6 B 28.7 Female 1 5 A 32. it will load these data back into memory for you.1.7 Male 2 9 B 30.7 4 38. you should create a new folder for any data set you are working with.3 Male 2 9 B 30. 4 . Fairly easy and direct way of getting your data to disk and back and it is cross-platform. 6 . When you quit R using the q() function.4 Female 7 B 27. > data <− data [ . there will be a .7 6 28.7 Female 4 A 38.3.1 Female 1 > data[−c ( 2 .4 7 27. ] Population Height Sex SizeClass 1 A 23. it will ask if you want to save: > q() Save workspace image? [ y/n/c ] : y If you do.7 Male 2 6 B 28.9 Female 3 A 29.7 Male 2 7 B 27.3 Male 8 B 27.7 Male 9 B 30.4 2 32. you can have R save every variable in memory. First. it is quite often. This will keep Biological Data Analysis Using R .4 Female 1 3 A 29.1.3 8 27.1 Female > data$Sex <− NULL > data$Population <− NULL > data Height 1 23. DATA INPUT/OUTPUT 4 A 38.2 Male 5 A 32.

3. or delete everything in memory at once as shown below: Biological Data Analysis Using R . or whatever."otherData" ) . The second way that you can save your data frame is to save the data frame directly. F ) ) > g [ 1 ] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 > otherData [ 1 ] TRUE TRUE TRUE TRUE FALSE FALSE Levels : FALSE TRUE > save ( l i s t =c ( "data" .Rdata so lets not buck Once you have saved the data frame. you can load it back into memory at any time by: > ls ( ) [ 1 ] "data > rm(data) > ls() character(0) > load("MyNewSavedData . The main drawback to this is that the name of the saved data ﬁle (.) and will therefore be invisible to you when you look in the folder with your normal Finder."g" .1. Rdata") > ls() [1] "data" Notice here I use ls () to see what is in memory.32 CHAPTER 3. T . data from memory (and 3.. > save ( data . This allows you to save different data frames with different names and you can save them where ever and named what ever you like. F . File Browser. rm() to remove check. .RData" ) save() It is common for saved data from R to have the ﬁle sufﬁx of tradition. then reload the data using the load() function. f i l e ="DataType2. I may only want to save the ﬁnal data.7 Deleting Data Frame Removing a data frame from memory is no different than removing any other variable. You simply use the rm() function as: > rm( data ) If you have a lot of different data ﬁles in memory. DATA FRAMES the raw data ﬁle(s) in the same location as the data entered and formatted in R .2).. You can easily overwrite it or throw it away since it isn’t immediately visible.Rdata" ) You can also save several variables at once by passing their names as a list to the function. T . you can delete them individually. T . It is also a bit inefﬁcient in that if you have a bunch of other variables in memory you may not want to save them all. f i l e ="MyNewSavedData. If I just merged a bunch of data frames (see 3. Here is an example: > g <− 1:20 > otherData <− f a c t o r ( c ( T . as a group.RData) starts with a period (.

g. DATA INPUT/OUTPUT 33 > ls ( ) [ 1 ] "elvis genotypes" [ 3 ] "myCoolData" [ 5 ] "y" > rm( "x" ) > ls ( ) [ 1 ] "elvis genotypes" [ 3 ] "myCoolData" [ 5 ] "yourNotLooserData" > rm( l i s t =c ( "y" .names(x) that you can easily use to get access to these components of a data frame."myCoolData" ) > ls ( ) [ 1 ] "elvis genotypes" [ 3 ] "yourNotLooserData" > rm( l i s t = l s ( ) ) > ls ( ) character ( 0 ) "kent hovinds secret data" "x" "yourNotLooserData" "kent hovinds secret data" "y" ) "kent hovinds secret data" To delete individual variables. You can also use these functions to assign new values to an existing data set.7 Female 4 A 38. There are corresponding functions names(x) and row. and row.. class.3 Male Biological Data Analysis Using R .9 Female 3 A 29.1. you were introduced to the class(x) function and we will not need to go over that again here. > a t t r i b u t e s ( data ) $names [ 1 ] "Population" "Height" $class [ 1 ] "data. For example: > data Population Height Sex 1 A 23. The ﬁnal example shows you how you can tell it to delete everything in memory (e.3. In Chapter 2. you must name them but when you delete several variables you need to tell the rm() command that you are going to pass it a list of variable names to delete (the list =) parameter. names [1] 1 2 3 4 5 6 7 8 9 "Sex" There are also other ways to access these attributes.1. delete this list and this list is all the data that are currently in memory. 3. This function returns a list containing the variables names.4 Female 2 A 32.4 Female 7 B 27.2 Male 5 A 32. names [1] 1 2 3 4 5 6 7 8 9 > dataAttributes <− a t t r i b u t e s ( data ) > dataAttributes$row .7 Male 6 B 28.8 Components of a Data Frame A data frame has a few distinct components in addition to the data points.frame" $row .names. Using the function attributes() shows the things that are make up a data frame.

by=−1) > data Group DistanceFromGround Gender 9 A 23.3 Male 2 B 27.7 Male 9 B 30. 2 ] [ 1 ] 23.2 Slicing Grabbing portions of your data frame is pretty easy.7 30."DistanceFromGround" .2] will provide access to the 1st row and the 2nd column.1 > data [ 1 : 4 .4 27. 1 .7 Female 4 A 38.7 Male 4 B 28.7 Male 6 B 28. ] Population Height Sex 1 A 23. That is to say that the ﬁrst index is for the row and the second index is for the column. DATA FRAMES 3.2 32.1 Female > names ( data ) <− c ( "Group" ."Gender" ) > data Group DistanceFromGround Gender 1 A 23. you use the square brackets [] along with the indices of the components separated by a comma .4 32. For example data[1.7 Female 4 A 38.9 Female 3 A 29.1 Female CHAPTER 3. To access a data frames items by index.7 Male 1 B 30.4 Female 2 A 32. R uses indices for all its data types in what is called row major format. names ( data ) <− seq ( 9 ..7 Male 9 B 30.9 Female 3 A 29.1 Female > row .7 38.9 Female 7 A 29.4 Female 8 A 32. 2.4 Female 7 B 27.2 Male > data$Sex [ 1 ] Female Female Female Male Male Female Male Levels : Female Male > data$Population [1] A A A A A B B B B Levels : A B Male Female Here are some rules that you need to keep in mind: 1.4 Female 2 A 32.4 Female 3 B 27.2 Male 5 A 32. 1 ] [1] A A A A A B B B B Levels : A B > data [ .3 27.3 Male 8 B 27.34 8 B 27. Below are some examples of how you can access some of your data components: > data [ .9 29.2 Male 5 A 32.7 28. Biological Data Analysis Using R .7 Female 6 A 38.

You can think of a database table as a worksheet in a spreadsheet program if that helps (though real database gurus are probably cringing as they read that).1 Queries Queries are essentially what we have been doing in 3.2 with indices so I won’t go over the basic stuff that we have already covered other than to show the SQL equivalents in case you need to know them. You can also combine this with the naming of the variables. Each table also has a name.]).2].3. For example.3. the command data[i. Height. 4.and show you how to use a data frame as a lite-database.] returns all rows of data from the ith row whereas data[. The SQL language is very easy to understand and I will partition this section into commands that query the database and those that create new data frames by the combination of two or more existing data frames that have a common data column. To get all the items in a given row or column you can leave out the index. After all. and Sex. Even if you do not ever use a database. being agile with your data is a key skill I hope you will be learning in this course. If not. You can get all the data in one of these variables by using the notation data$VariableName as in data$Population.3. I will be spending a little extra time trying to convince you that it is probably in your best interest to understand how to query your data frames because it gives you a lot of power and ﬂexibility. 3. as data$Height[2:5]. R does allow you to interact with databases through one of its many database libraries but I will not be covering that in this chapter. You can also index the data for a particular column by calling its name. as shown above when retrieving all the data for the ﬁrst four records (data[1:4. There is a standard language that has been adapted by both the American National Standards Institute (ANSI) and later the International Organization for Standardization (ISO). And tables have rows and columns of data.j] returns the data in all rows for the j th column. I will however delve a bit into how the function subset() works because it is pretty powerful. 3. just like a data frame.3 Complex Selections R data frames can be thought of as pseudo databases. This can work in both directions. To understand SQL you need to understand that in a database data is contained within tables. Biological Data Analysis Using R . which may be able to make it a bit easier to read. if you are familiar with some basic SQL operations you will ﬁnd this section rather easy. you use the Standard Query Language (SQL) to interact with the data. However. For example. this section is really important as it will allow you to think about interacting with your data in interesting and complex way. To get a range of values on one or the other index such as the 2nd through 5th entries in the height variable you put the range of indices separated by a colon as in data[2:5. the example data set has variables named Population. COMPLEX SELECTIONS 35 3. If you ever interact directly with a database. 5.

sex FROM data and in R we can either slice both indices as: > data [ . Using the logical operator AND adds a lot of power to this statement.7 Male 6 B 28. which in R is simply what we have been doing by tying the name of the data frame (hereafter I will use data to refer to the name of the table for similarity with our previously loaded data frame).7 Male 6 B 28.9 Female 3 29. > data Population Height Sex 1 A 23.4 Female 2 A 32.3 Male 8 27. 2 ] [ 1 ] 23. The strength of SQL and databases lies in the fact that you can do complicated selections from the tables.4 Female 7 B 27. in SQL you can select by row number and column number using the statement SELECT * FROM data WHERE rownum==x AND colnum==y.4 Female 7 27.1 Female Or we can use the subset() function as: Biological Data Analysis Using R .7 Male 6 28.4 Several rows or columns can be selected in SQL by rownum<=7 is accomplished in R as: > data [ 5 : 7 .2 Male 5 A 32.3 Male 8 B 27. you can indicate which variables you are interested in selecting in SQL as SELECT height.7 Female 4 A 38.9 Female 3 A 29.4 Female 7 B 27. ] Population Height Sex 5 A 32. DATA FRAMES To select all observations in SQL. However. in R we have been doing this using the indices directly and the square bracket notation as (with x = 1 and y = 2): > data [ 1 . you use the statement SELECT * FROM tableName.7 Male 9 30. in SQL the asterisk means ”everything” (as in all variables).3 Male SELECT * from data WHERE rownum>=5 AND To get only a subset of the variables in each row.7 Female 4 38.2 Male 5 32.1 Female In these SQL statements I use words in all capitol letters to indicate SQL language components and lowercase words to indicate table names or variables. Also.4 Female 2 32. 2 : 3 ] Height Sex 1 23.36 CHAPTER 3.7 Male 9 B 30. For example.

"Sex" ) ) Height Sex 1 23. must be a double equals sign.4 Female 2 32.1 Female Often times you will have rather large data sets in R that you will be working with and it may be easier to grab parts of your data set by using names of variables rather than by using column indices (it is up to you).1 Female > data [ data$Height >30. the SQL statements SELECT * FROM data WHERE height>30 and SELECT * FROM data WHERE height>30 AND columnnum==2 is accomplished in R by: > data [ data$Height >30.9 Female 4 A 38. In addition to the AND operator in the SELECT statements you there is also an OR operator.2 Male 5 A 32.] Population Height Sex 2 A 32.2 32. Also notice how using the 2 in the position after the comma gives only the second column of data.7 30.7 Female 4 38. Biological Data Analysis Using R . It is valid to say in SQL SELECT * FROM data WHERE sex=="FEMALE" OR population=="A".2] [ 1 ] 32. The & operator in between the requires that the things on both sides of it are 2.3. COMPLEX SELECTIONS 37 > subset ( data . This can also be done in R using the OR operator . 3. s e l e c t =c ( "Height" . You can also get a bit more speciﬁc and only look for components in your data set using relational operations.7 Male 6 28.9 38.4 Female 7 27.7 Male 9 B 30. The part in the square brackets [] consists of the stuff on the left side of the comma (data$Height>30 & data$Sex=="Male") and the stuff on the right side (which happens to be empty in this case). There are some things to remember when doing compound statements like this: 1. The equality operator == TRUE. I ﬁnd it easy to take a few passes at these compound statements to make sure I am getting them correct.7 Male 9 30.3 Male 8 27. For example.1 Notice how in the last example here I mixed the use of selecting subsets of observations using the relational operator > and subsets of column using the numeric index.7 Male * FROM data WHERE height>30 This complicated statement needs to be dissected to reduce confusion.2 Male 5 A 32.3.9 Female 3 29. ] Population Height Sex 4 A 38.2 Male 5 32. You can combine conditions in a SELECT-like query such as SELECT AND sex="Male" by using the unary & operator as: > data [ data$Height>30 & data$Sex=="Male" .

DATA FRAMES > data [ data$Sex=="Female" | data$Population=="A" . > data Population Height Sex 1 A 23. sep=". Here are two examples that we will be using. t a b l e ( "PopulationAttributes." ) Biological Data Analysis Using R .3 Male 8 B 27.4 Female 9 B 30.2 Male 5 A 32.4 Female 7 B 27. provided that they both have a variable in them you can use as a common index. you can use parenthesis to separate out conditions.9 Female 3 A 29.. 3.4 Female 2 A 32.3 Male 8 B 27. ] Population Height Sex 1 A 23.4 Female 7 B 27.3. so now that you have everything you want to know about how to select stuff from within a single data frame with an arbitrary level of complexity lets move into joins.txt and we can load it into R as: > popData <− read .1 Female The second table is one that has characteristics of the Populations themselves. It is in the example data sets and is called PopulationAttributes.7 Female 4 A 38.txt" .4 Female 2 A 32.7 Male 6 B 28. header=T .7 Male Note: I split the command across two lines at the OR operator.7 Female 4 A 38. In R when you do this. I had to do this because the command is longer than the width of this paper.9 Female 3 A 29.1 Female If the selection of subsets of your data become more complicated than this.7 Female 6 B 28. Here is a whack example from the SQL SELECT * FROM data WHERE (population=="A" AND sex=="Female") OR (population=="B" AND height<30). ] Population Height Sex 1 A 23. This makes it easier for you to read and since you are the one that will be writing this code and coming back later and looking at it.4 Female 2 A 32.2 Male 5 A 32..7 Male 6 B 28. it pays to be as un-convoluted as possible. > data [ ( data$Population=="A" & data$Sex=="Female" ) + | ( data$Population=="B" & data$Height <30) .38 CHAPTER 3. it gives you the little + sign and you can continue typing as if it were on a single line. A join is an operation where you have two or more tables (or data frames) and you are going to create a new one based upon the merging of the two.2 Joins OK.7 Male 9 B 30. The ﬁrst table is the data table we have been working with thus far.9 Female 3 A 29.

3331 0. So in essence.60972 47. I could add the data from popData and data to create a new data set that has all this information.3.53300 −77.Population == popData.3331 0.7 −77.60972 47. no?).53300 37.0 −122.7 −77.0 −122.Population.0 merge() SELECT * FROM data. it returns a new data frame with all the data included.7 Female Richmond 4 A 38.9 Female Richmond 3 A 29. on the data sets. To join two tables you will use the function easier to do this in R .0 39 If you look at these two tables.60972 East Elevation −77. It is common in databases to have tables split like this.53300 37. it would be repetitive and for large data sets may max out the memory of your computer.4 Female Richmond 2 A 32.4670 45.60972 −122.53300 37.4670 45.7 −122.60972 47. popData WHERE data.frame" State Virginia Virginia Virginia Virginia Virginia Washington Washington Washington Washington North 37.4670 45. Joins allow you to take these different data frames and join them (catchy name. etc.7 Male S e a t t l e 9 B 30.4670 45.0 −122.3. there is the common variable Population.3331 0. COMPLEX SELECTIONS > popData Population LongName State North East Elevation 1 A Richmond V i r g i n i a 37.4 Female S e a t t l e 7 B 27.53300 47. The best way to get comfortable with these methods is to actually use them. Biological Data Analysis Using R . popData ) ) [ 1 ] "data.7 2 B S e a t t l e Washington 47.3331 0. It is also common to ﬁnd biologists who have programmed software to do some kind of analysis that requires you to put some kinds of data in one ﬁle another kind in a second ﬁle.3331 0.3 Male S e a t t l e 8 B 27.53300 37.2 Male Richmond 5 A 32.7 −77. popData ) Population Height Sex LongName 1 A 23. is is a bit As you can see.1 Female S e a t t l e > class ( merge ( data .4670 45. I think this has gotten you enough exposure so that you can probably be dangerous.7 −77. here is an example: > merge ( data .7 Male Richmond 6 B 28. It saves space (imagine having the 5 extra data columns for each row in data.4670 45. In SQL this would be Fortunately.

40 CHAPTER 3. This function returns a slice of your data frame where you can specify which variables to use. Again for those data types that this operation makes sense. load(x) If x is the name of a . merge(x) This function takes two data frames and merges them on a common variable name. DATA FRAMES 3. To get more information on any of these functions. Auf wiedersehen.g. You can also do this with creative use of conditional operators and variable names. This only works with some kinds of data types (e. Biological Data Analysis Using R . This function removes x from memory.filename=y) subset(x) • • • This function saves the R object x to ﬁle named y. • cbind(x) This function binds a column onto the right side of x.4 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises. If there are more than one common variable name you can specify which one and if there are no commonly named variables then you are out of luck (unless you have variables that hold the same data but are just named differently). rbind(x) • • • This functions binds a row of data onto the end of x. Can’t get it back. rm(x) save(x. use the R help system..Rdata data ﬁle then it will load the contents into memory. those where an operation of appending on a column of data makes sense). Gone.

csv. Biological Data Analysis Using R . Create three different variables.csv from the class data set.3. How would you perform a query of the combined data set to select all records that have Order >= 3 or Home == ”Olympia”. the data. theData. one that is a numeric type. 2. How do you indicate a missing data point in a data ﬁle? 4. to Thomas? 9. to a ﬁle named newData. 1. "Olympia"). EXERCISES 41 3. Using index numbers. Create a new data set with a two variables. "Juanita". a logical one. What is the difference between row major indexing and column major? 7. "Centralia". Read in the data ﬁle PersonData. Load it into 3. What kind of data type is the variable Names? How can you change this to a character type and then change the name of the third entry in the data frame. 10. one that is Order = −1:4 and the other that is Home=c("Olympia". Add a numeric data column to the existing data frame. theData. Use these to create a data frame named theData. How would you save the data frame. 6. "Olympia". GuinneaPigData. select the 2nd and 3rd rows of the data set theData.5 Exercises The following exercises are meant to help you understand the items presented in this Chapter. In the folder for this Chapter there is a text ﬁle named memory and print out a summary.5. 8. "Tacoma". theData.Rdata. and a vector of characters. Provide a summary of 5. Merge this data frame with the one named theData and assign it the name combinedData.

42 CHAPTER 3. DATA FRAMES Biological Data Analysis Using R .

all the distributions provide the following four components: 1. A quantile function named qNameOfDistribution (e. & 4. looking at relationships among variables. they provide them in a clear and concise interface that has a consistent format. • Explore non-parametric summary statistics. A distribution function that is called as pNameOfDistribution (e.Chapter 4 Summary Statistics In this chapter you will explore some of the methodologies that R has for describing your data.g. A density function that is of the form d NameOfDistribution (e. These are speciﬁcally helpful in a number of situations. R is an excellent platform for exploring data. 2.. dnorm(). you may be running a test and calculating a χ2 statistics on some table of data and want to know if the 43 . 3. For example. A function that produces random numbers sampled from the distribution that is named r NameOfDistribution (e. df () & dchisq()). • Use the table() function as an entry point into contingency table analysis.. qnorm(). Moreover. • Learn about commonly used statistical distributions. 4.g.. rf () & rchisq()). qf () pnorm(). & pchisq()).g. and graphically portraying results. In this Chapter you will learn the following skills: • Learn about some common numerical distributions.1 Distributions R and its various sub-packages contain more numerical distributions than you will probably ever need to use. • Create histograms and density plots. To my knowledge. pf () qchisq()). • Create single and multiple line ﬁgures.g. rnorm().. • Understand parametric summary statistics.

l o g .507? You could go ﬁnd that old stats book on the shelf and page through Obs the back of it to ﬁnd the correct Appendix that has the right table (How do you read those tables again?). we have memorized due to the sheer number of times that we have used it. Then you will see how the distribution function can tell you the probability of a particular estimation of the χ2 test statistic. p and df . p = FALSE ) There are two required parameters for this function. what the critical value for a χ2 statistic with a single degree of freedom should be (≈ 3. I mean. as a biologist we have settled on an α = 0. This and other statistical distributions require that you provide the degrees of freedom before it can give you any information. there is a deﬁned cutoff.1 for three different values for the degrees of freedom.g.44 CHAPTER 4. lower . While this is a very non-technical deﬁnition but I think you get the point when you consider the α shaded region in Figure 4. Now typically. χ2 is large given the particular degrees of freedom that Obs you have at your disposal. you will learn how to determine critical values for the χ2 distribution as used in formal hypothesis testing using the quantile functions.049 versus P = 0. In this section. ncp=0 . The value of the cutoff is deﬁned as the point along the x−axis at which there is 1 − α of the are under the curve to the left of the point and α of the are under the curve from that point and beyond. you would see that it accepts the following options: qchisq ( p . what if we have 8 degrees of freedom and χ2 = 15. lets jump into understanding how we ﬁnd critical values for some pre-deﬁned value for α in different distributions.1. The distribution itself is shown in Figure 4. Moreover. For some reason. The most commonly used distribution observed as an undergrad is probably the χ2 distribution.1 Finding Critical Values In formal hypothesis testing.051? That being said. there is a speciﬁc test statistic that is proposed. Now. You can tell by looking at this signature that they are required because they do not have an = sign next to them and a default value given. the estimation of a value for that statistic is compared to a known cutoff set by the degrees of freedom in the model and the Type I error rate that you have chosen (e. is it really that different an interpretation if P = 0. this is probably an over simpliﬁcation of things that was used initially as a teaching aid for understanding the meaning of Type I errors. There is nothing intrinsically interesting about α = 0.05 value to have some kind of special meaning. three aspects of using distributions within a statistical context will be introduced. t a i l = TRUE. If you were to look up the signature of this function (by typing ?qchisq into R ). However.841459 right?). For any one particular set for the parameters α and df . SUMMARY STATISTICS value of your observed statistic.05 and it is probably more informative for me to know the real probability of your calculated test statistic rather than if it exceeds dome arbitrary cutoff. To determine the critical value of the χ2 distribution you use the qchisq() function. Or you could use the various functions in R . df . If a parameter has a variable=value format in a function Biological Data Analysis Using R ..2 and the 1 − α region that is unshaded. the α value). 4. First.

If you are interested in a more in depth discussion of these parameters. and 3 degrees of freedom.1. signature then the value will be assigned to variable if you do not give it a value when you call it. The parameter p is the 1 − α cutoff you are interested in ﬁnding. look up the qchisq() function and read the documentation.05 = 0. In the classic case. this parameter controls both the shape and location of the χ2 values. we see that the point in question is where we actually have 95% of the area under the curve and we are interested in the extreme α portion. this would be 1 − 0. the log. The lower. The shaded region constitutes a proportion of the area under the curve equal to α.g. The ncp=0 option speciﬁes a non-centrality parameter allowing you to get the critical values for a non-central χ2 distribution.03518).05 and df = 3. The next required parameter is df . There are several optional parameters that you can pass to the qchisq() function and I will brieﬂy mention them here for completeness.2. The default value here is what we expect since we are interested in ﬁnding the α proportion on the right side of the distribution not on the left side of the distribution (which would be all the values less than or equal to 0. Common ones that you will be playing with in the Exercises portion of Biological Data Analysis Using R .1. DISTRIBUTIONS 45 Figure 4. P [x > 1 − α])..1: Values for the density function for the χ2 distribution with 1. There are several other statistical distributions that you can query in R for particular critical values. At ﬁrst. Default values are very helpful and save a lot of typing on your part. it seems a little backwards to use 1 − α instead of α but if we look at the graphical depiction of this distribution in Figure 4.95.g. 2. As shown in Figure 4.tail=TRUE indicates that you are interested in the p proportion of the data in the lower tail of the distribution (e. which corresponds to the degrees of freedom. Figure 4..2: A graphical depiction of the critical value of the χ2 distribution for α = 0.4. P [x < 1−α]) rather than the the 1 − α portion of the other side of the distribution (e.p=FALSE option allows you to query using the log of p rather than p directly. Finally.

they should know how to plot themselves. When you begin to create a plot. that is the theory at least. Table Biological Data Analysis Using R . you may want to modify: • The shape of the symbols • The color of the symbols • Add a line to connect the symbols. Scatter & Line Plots Creating a simple plot of a line (or points in a sequence) is accomplished using the plot () function. For example. If you try it will look different that is why they are random. yours will look slightly different than mine.. This chapter will be very long because it takes a lot of page real estate to show a graph. When possible.0 (you can change these values. • Provide more meaningful axis labels. Lets jump into this graphing stuff by staring off with a more basic approach to creating graphs and building up to what we see in Figure 4. . it is rather plain and does not convey any more information than 10 little circles.. shape of that line. check the documentation on this function using the ?rnorm command). and if things can be plotted.g. the R code plot( rnorm(10) ) produces the graph shown in the leftmost panel of Figure 4. y . width.. • Remove the box around the plot (my pet peeve) To do this. the things that you can pass to the function and the things it expects) is: plot ( x . It may be of interest to you to be able to change some of the properties of this plot. Well. and perhaps modify the color. ) This listing is not very informative! Don’t worry. there are some default characteristics of the plot that you may want to override. This is what the .3 consisting of a sequence of 10 random points selected from a normal probability distribution (we will discuss these random functions later in Section 4. For example. . but I think you’ll be happy with the results when you can whip out a nice looking graph of your data. part of the function signature that is shown above. To customize any of these values. how to access the various components. you need to pass additional information to the plot () function.2). This function has a signature (e.46 this chapter include Students t from qt CHAPTER 4. and how to ﬁnd more information on the appropriate levels that can be set to these components. . When you look at this plot. we must understand what a graph consists of. The plot () function is kind of a dummy function that allows you to plot lots of different kinds of things.1.. I will use random numbers to create these graphs so as you go through and attempt to recreate them. they get more interesting as we go along. SUMMARY STATISTICS and Fishers F from qf.. The function rnorm(x) returns x random numbers selected from a normal probability distribution with µ = 0 and σ = 1.

c o l ="red" .3: Some example graphs with alternate values for symbols. Then customize the labels and titles and plot it again to see it. bty="n" .3: > > > + p l o t ( x . bty="n" .1. c o l ="blue" . Biological Data Analysis Using R . lwd=5) Figure 4. xlab="X Label" . xlab="X Label" . DISTRIBUTIONS 4.1 shows a list of additional commands that can be passed to the customize plot appearances. colors. widths. ylab="" . ylab="Y Label" . ylab="" .4.sub="subtitle" . plot () 47 function to Here are some examples of how you would use some of these optional parameters with graphs shown in Figure 4. lwd=2) p l o t ( x . type="b" .main="Title" . y . Start with a plain plot () command to see what the output looks like. Then continue to add parameters and review the plot. line types. I ﬁnd it easy to build them up incrementally. y .pch=2 . When creating complicated graphs.pch=2 . type="l" . bty="l" ) p l o t ( x . xlab="X Label" . and titles.pch=3 . c o l ="green" . y .

Creates a matrix of plots that can potentially have a number of rows (nr) and columns (nr. Sets a title along the top of the graph.) Speciﬁes the width of the line ( 1 = default ). Sets the symbol that is plotted on the ﬁgure.nc) CHAPTER 4. SUMMARY STATISTICS Biological Data Analysis Using R pch=x sub="Subtitle on Graph" type="x" xlab="label for x-axis" ylab="label for y-axis" .48 Table 4. Set the label on the y-axis. Colors the foreground of the image to the set color. ”l”. see 4. try the ?par command. ”u”. For a complete listing of possible values that can be customized. 3 = dotted. ”c”.1: Some useful additional commands to customize the appearance of a ﬁgure.3. Adds a subtitle just under main on the top of the graph. cex col fg lty lwd main mfrow pch sub type xlab ylab cex=1.1 for example). Sets the plot type. 2 = dashed. Command bg bty Usage bg="red" bty="x" Description Colors the background of the ﬁgure the speciﬁed color. Speciﬁes the line type (0 = none. and ”n” for no box (my preference) Magniﬁes the default font size by the corresponding factor. ”l” for lines and ”b” for both lines and points. colors the line and symbols the given color. Useful values are ”o” for complete box (the default). Sets the style of the box type around the graph. ”7”. ”]” which will make a box with sides around the plot area resembling the upper case version of these letters. 1 = solid. etc. Plot types can be ”p” for points (the default). Set the label on the x-axis.0 col="blue" fg="blue" lty=x lwd=x main="Title for Graph" mfrow=c(nr.

When you overlay more than one plot on the same graphing area. y1 ) par ( new=T ) p l o t ( x2 . It is difﬁcult to tell the relationship among the data. c o l ="blue" ) You get the image shown in Figure 4. y1 . This function allows you to adjust a lot of different graphical parameters and the plotting of a new image onto an existing one is only one of the things that you can adjust. 14]. If your other data has values of x2 = [11.20) p l o t ( x1 . 2. if I have data such as: x1 = [0. 22. 13] and plot it will automatically scale the axes to have limits of xlim=c(0. 1. 1 . By default R will try to maximize the are that is being plotted by changing the default ranges of the x− and y−axes. 3]. the x1[1] = x2[1] but in the plot it appears that they are equal. 3. we need to set the xlim and ylim values (see Table 4. For example.4.12 . The two images are put right on top of each other and the axes are individually scaled to ﬁt the data in each plot () command. y1 = [23.g.14) y2 <− c (23 . DISTRIBUTIONS Overlaying Plots 49 There are times where it is desirable to produce several plots on a common background (e. 1.11 .1. You cannot read the axis labels. To overcome these issues you need to ﬁrst ﬁnd the appropriate limits for the values in both of the data sets and for both plot () statements.21 . This is what would be expected to happen and works nicely until you try to put another plot.1..12 . 2 . 13.4 for a rather complex combination of images and plots overlayed on the same area). 2.1) to the appropriate values.3) and ylim=c(10. the different values for df in Figure 4. 21.13 .13) x2 <− c (11 .4. 11. y2 ) This will take the plot for the second set of variables and plot it on the same graphics device as the previous one. you must take into consideration the different scales that the graphs have. you use the par(new=T) command to tell R that the following command is going to apply to the currently active graphics device. You use it as follows: p l o t ( x1 . c o l ="red" ) par ( new=T ) p l o t ( x2 . 12.13).22 . These appropriate values will tell R what the Biological Data Analysis Using R . R allows you a lot of leeway to mix up different types of graphs in the same plot (see Figure 11. There are several obvious issues with this image. 3 ) y1 <− c (10 . If you look at the raw data. 12. To overlay two graphs. 20] and you try to simply overlay the two plots by simply typing: > > > > > > > x1 <− c ( 0 . y2 . which means that the x−axis will start and end at 0 and 3 and the y−axis will start and end at 10 and 13. y1 = [10. For a full discussion of other options that par() accepts type ?par in R . The labels on the axes are typed over each other.

20) > yLimit <− range ( c ( y1 . ylim=yLimit .1). 1 . > x1 <− c ( 0 .11 .50 CHAPTER 4.1. c o l ="red" ) > par ( new=T ) > p l o t ( x2 . you can add as Biological Data Analysis Using R . minimum and maximum values for the x− and y−axes should be. y2 . y2 ) ) > yLimit [ 1 ] 10 23 > xLimit <− range ( c ( x1 . 3 ) > y1 <− c (10 .22 . SUMMARY STATISTICS Figure 4. Here is some code that does this. ylim=yLimit .12 . bty="n" . xlab="X" .4: Plot of two data sets using the par(new=T command but not taking into consideration the axis limits of the two data sets before plotting. Then I did the same thing of the x values in both data sets. bty="n" . ylab="Y" . I can make it for each pair of x & y variables scaling the axes so that both data sets will be displayed on the same Figure.14) > y2 <− c (23 . I also use the bty="n" because I just hate the box that it puts around the plot area by default and this option does not draw any box at all. As long as you add a par(new=T) between each successive many plots to the same ﬁgure as you would like. y1 . xlim=xLimit .13 . xlab="X" .13) > x2 <− c (11 . xlim=xLimit . c o l ="blue" ) Notice how the optional arguments xlim and ylim make sure the axes are scaled correctly (Figure 4.12 . if I make the plot. 2 . plot () command. ylab="Y" . Now.21 . x2 ) ) > xLimit [ 1 ] 0 14 Here I combined the y values for both data sets and used the range() function to tell me what the range of these values are. > p l o t ( x1 .

4.1. DISTRIBUTIONS

51

Figure 4.5: Plot of two variables on the same axis after correcting for the range of each data set.

Saving Images To Disk While it is rather cool to be able to create rather hansom graphics in R it is entirely useless if you do not know how to save it for later use. You could take a screenshot of the image and then crop it down a bit but that is not quite the easiest method to use here. Almost all the images in this book were created in R and I was able to save them into a format that made it easy to import them into this document. R considers the little popup window that shows your graph as a graphics device. Depending upon which platform you are using (e.g., Linux, OSX, Windows), the kinds of output you may be able to produce may change. At present the following types are available: Device

bmp cairo pdf jpeg pdf pictex png postscript quartz tiff X11

What receives these graphing commands A Windows bitmap device A PDF device based upon the Cairo drawing libraries A JPEG bitmap device A PDF ﬁle A A LTEX graphics command ﬁle A PNG bitmap device A postscript ﬁle An OSX graphics window A TIFF bitmap device A graphics window on a system running X-Windows (unix some OSX)

Table 4.2: Graphics devices for output of ﬁgures

Biological Data Analysis Using R

52

CHAPTER 4. SUMMARY STATISTICS

When you type the command plot () a graphics window pops up showing you the image of the ﬁgure. What is happening here is that R is looking for the default graphics device and if you have not speciﬁed one, then the default value of ”show it to the user as a window” is use. Creating The Plot And Saving To File: This is the method that I used for all the ﬁgures in this text. I ﬁrst created the ﬁgure to look the way that I wanted and then I had R copy the ﬁgure to a ﬁle. You should be aware that when you copy the image, it will only copy the ACTIVE graphics device. If you have more than one graphics window open, only one of them will say ACTIVE in the window title. Be careful of this or you could be copying the wrong ﬁgure. Once you have the graphic the way you like, you can use the dev.copy() command to copy the current graphics device to a ﬁle. For this book, I have been saving all the images as JPEG ﬁles so I pass the function the device=jpeg option and then specify the name of the ﬁle. If you want to save yourself some heartache down the road, use meaningful names for the graphics you create. You can quickly get a lot of different plots that you may want to go through at some time in the future and it sure helps to have them named nicely.

> h i s t ( rpois (1000 ,2) , xlab="Counts" , ylab="Frequency" ,main="" , c o l =topo . c o l o r s ( 8 ) ) > dev . copy ( device=jpeg , f i l e ="ColoredHistogramOfPoissonDistribution.jpeg" ) jpeg 3 > dev . o f f ( ) X11cairo 2

Once the dev.copy() function is ﬁnished, you must call the dev.off () function to tell R that you are ﬁnished copying things to that particular ﬁle and you no longer want to keep it open and ready for subsequent graphing. The output after the dev.off () command shows which graphics device is now active and what kind of device it is (in general, you can ignore this). The image produced from this plot is shown in Figure 4.6 I also passed the plot command the optional col=topo.colors(8). The function topo.colors(x) returns x evenly spaced colors from a palette that is used for plotting topo maps. There are other default palettes in R you can use (see ?topo.colors for a list) in coloring parts of your ﬁgures. By default, I new that the hist () function would return 8 bins of data from the rpois(1000,2) distribution (I plotted it ﬁrst and counted) so I added 8 evenly spaced colors to the plot just to make it look a bit more cheesy. Plotting Directly To A File: Plotting to a graph window and copying it to a ﬁle is not necessarily the only way you can get your graphics saved. You could just write them directly to a ﬁle using one of the graphics devices listed in Table 4.2 without looking at it in a window. I ﬁnd this less appealing since I would like to see what I am plotting before saving it, but if you are chugging through lots of data and creating hundreds of images, perhaps you would be better served to make the plots directly and view them later. At any rate, here is how it is done.

jpeg ( ) p l o t ( rnorm(1000) , xlab="index" , ylab="value" , bty="n" ) dev . o f f ( )

Biological Data Analysis Using R

4.1. DISTRIBUTIONS

53

Figure 4.6: Image of colored Poisson distribution that was copied from the graphics device to a jpeg ﬁle.

and R will open the a jpeg() graphics device. This device is generally a ﬁle in the local directory that is named RPlotXXX.jpeg (where the XXX values are incremental numbers such as 001, 002, . . .). Then when you call the plot () function it sends the plotting commands to the image itself in the ﬁle.1 You can add as many plotting commands as you like and it will continue to send them to the ﬁle you speciﬁed. When you are done, you can ﬁnalize the image by calling dev.off () to turn of the graphics device. To change the default incremental numbering of the ﬁles, you can pass a ﬁle name to the jpeg() function (or any of the other ones) as we did in the previous section using dev.copy().

4.1.2

What Probability?

The outcome of a statistical analysis is the estimation of a particular test statistic. For example, when you calculate a χ2 statistic, you need to look up a the probability that a value as large or larger than the observed one is expected to occur. In 4.1.1 we determined how to calculate the cutoff value from a particular distribution given a speciﬁed

1

Actually it keeps them in a buffer and not in the ﬁle directly.

Biological Data Analysis Using R

54

CHAPTER 4. SUMMARY STATISTICS

Type I error rate (the α value). Here we are interested in not asking if our calculated value exceeds some particular cutoff, rather we are interested in understanding what the probability of observing a value as large or larger than the one we see. In keeping with the current examples from the χ2 statistic, we can determine the probability associated with a particular estimation of χ2 Calc by using the distribution function pchisq(). The arguments to pchisq() are almost identical to those for the qchisq() function discussed in 4.1.1 with the exception that we do not pass it the 1 − α as the ﬁrst parameter, rather we pass it the estimated χ2 Calc value and it will return the answer in terms of P [X ≤ x]. For example:

> chiCritAt0 .05 <− qchisq ( 0 . 9 5 , 1 ) > pchisq ( chiCritAt0 .05 , 1 ) [ 1 ] 0.95 > pchisq ( 7.23 , 3) [ 1 ] 0.9350828

The functions qchisq() and pchisq() give us the opposite answers from each other with one telling us what the critical value (or P [X <= x]), and the other takes a value for χ2 and tells us what the cumulative area under the curve up to and including that point.

**4.2 Random Number Generation
**

There are often times when you need to generate some random numbers (playing poker, picking lottery numbers, etc.). Random numbers can be drawn from any of the distributions that are in R using the rdistribution function. For example, to draw a random number from a normal distribution (N (µ, σ)) you would call the rnorm(x,\mu,\sigma) function. The parameters µ and σ signify the mean and standard deviation of the distribution from which you are drawing. An example of how this inﬂuences the outcome, check out Figure 4.2. There are a large number of random number distributions that you can run across. Below are some commonly encountered ones: Normal The normal distribution has a density function of P (x|µ, σ) =

1 √ e− σ 2π

(x−µ)2 2σ 2

.

Exponential The exponential density has a continuous density function of P (x|λ) = 1 − e−λx . Poisson The Poisson distribution is a discrete distribution whose density function is −λ k P (k|λ) = e k!λ . Later in the Exercises you will get to use some of these distribution.

Histograms A histogram is a graphical display of data that has been tallied into bins (e.g., speciﬁc buckets). How you deﬁne the bucket locations and sizes are up to you. You can specify that there should be a speciﬁc number of buckets and R will make them equal sized, or Biological Data Analysis Using R

. l a b e l s = FALSE. the red one is drawn from a random normal distribution with default values of µ = 0 and σ = 1 and another in blue that has µ = σ = 5. r i g h t = TRUE. That way if we do not provide a particular value for a parameter such as main. . it will ﬁll it in for you. First. f r e q = NULL.4. nclass = NULL. xlab = xname. include . c o l = NULL. axes = TRUE. p r o b a b i l i t y = ! freq . ylab .2. density = NULL. border = NULL.7: Examples of the densities of two normal distributions. xname ) . the =VALUE portions). ) hist () function by typing There are several things we should notice about this function signature. p l o t = TRUE. Biological Data Analysis Using R .g. ylim = NULL. angle = 45. The function signature for the ?hist in R : h i s t ( x . .. lowest = TRUE. main = paste ( "Histogram of" . You can see that several of the parameters are given what we call default values (e. RANDOM NUMBER GENERATION 55 Figure 4. breaks = "Sturges" . xlim = range ( breaks ) . this is the ﬁrst time that we’ve looked into a particular function and seen all the options. you can deﬁne ranges yourself.

but some times it is helpful when you are putting together a talk or just analyzing the data and making graphics for your own interpretation. SUMMARY STATISTICS The ﬁrst thing that you typically want to change in a graphic is the default values for the axis labels and the title of the graph.145 e−03 Biological Data Analysis Using R . main="" ) Figure 4. xlab="My Defined Bin Categories" .8): > h i s t ( rnorm( 1 0 0 ) .518 1s t Qu. It is not commonly accepted practice to provide titles on graphs for most publication-quality graphics.5061 y Min . Here I will combine the histogram and density plots to show how to overlay two graphs on the same values.491 Bandwidth ’bw’ = 0. Density Plots A density plot is one where the probability density is calculated and turned into a line across the domain rather than a histogram.:8. Again.8: Histogram with labels and main title changed. To change the default values of the axis labels and set an empty title you would do the following (shown in Figure 4. : 2. ) . :−1.567e−05 1s t Qu. d e f a u l t ( x = data ) Data : y (1000 obs . x Min . > data <− rpois ( lambda=5 . :3. It is perfectly OK to give empty values to things like titles and such. ylab="Frequency" .n=1000) > den <− density ( data ) > den Call : density . I am using the function rnorm() to generate the data from a random normal distribution here.56 CHAPTER 4.

500 Median :3.4.main="" . The red line indicates the density of the values. Biological Data Analysis Using R .9: Histogram of 1000 random numbers drawn from a Poisson distribution with the λ parameter set to 5.509 3rd Qu.:10. + ylab="Frequency" . 2. Had I not saved them. The the probability density is calculated as a probability rather than as a frequency count (as the . There are some things to point out with this plot.229e−02 3rd Qu. I used the function density() to calculate the probability density function for the values of data. bty="n" ) 57 Figure 4.:1. c o l ="red" . xlab="" . I would be using a different collection of random numbers for each plot and they wouldn’t match.973e−02 Mean : 6. ylab="" . p r o b a b i l i t y =T .500 Mean :6. xlim=xrange . ylim=yrange .main="" . I save the values of data as a variable because I needed to plot the same set of random variables as a histogram and as a density plot. RANDOM NUMBER GENERATION Median : 6.2.689e−01 > yrange <− range ( den$y ) > xrange <− range ( den$x ) > h i s t ( data .518 Max. xlab="Value of Random Poisson" . 1. :14.219 e−01 Max. The density() function has two components. :1. lwd=2 . an x variable and a y variable. bty="n" ) > par ( new=T ) > p l o t ( den .

you can calculate the mean of the data by using the function mean(). This ﬁgure was created using the density() function from rnorm(1000000). . we have created a sample of our data from which we make inferences. SUMMARY STATISTICS 4. . K. These are the mean and variance of the data and are estimated in R using the functions: mean() and var() . First. Your observations are grouped into distinct categories and consist of relative counts of each category. . some types of genetic data. µk . Notice that here I used the term estimate rather than compute. the mean is just one of several moments of a distribution and now we turn to this particular moment and then discuss some of the ”higher moments. we will assume that your the experiments that are producing your data yield one of two different data types. Rather. We are all familiar with the concept of mean.10 shows what is being measured by these estimators. Figure 4. A collection of random variables will be denoted as X with elements xi . consisting of K categories and the number of counts observed in each category will be referred to as yi . Examples of this include stage-dependent demographic tallies. i = 1 . We will be making estimates of real parameters of the data and we do so because in most cases we do not have all the data at our disposal. The mean. First.1 Moments There are several properties of random variables that we may be interested in estimating. To get all the data. 4 can be calculated by µk = E[(X − µ)k ]. In R these functions are not loaded Actually all four of these measures are known as the ﬁrst four moments of the distribution. available light. observations from your data could be considered random variables. There are two general properties of random variables that we will spend a little time discussing because they form the basis of how we examine our data. Categorical data will be denoted as Y . this is on purpose. the mean of a random variable. 2 Biological Data Analysis Using R . etc. gender of your study organisms. There are two more measures of distributions that we should discuss while we are here.g.” 4. For the purposes of this section. Examples of random variables may be body size. N (e. we would have to sample EVERY single instance out there and in most cases this is not possible. but in a general sense.58 CHAPTER 4. . . indexing across all N individual observations). disease prevalence. The √ image also shows the standard deviation (which is the square root of the variance σ = σ 2 ) as indicated by the dotted line. etc.2 These are the skew and kurtosis of the distribution. There are two common properties that you will probably recognize immediately (I hope) and use all the time. i = 1 .. R has a function for both the variance var() . The ﬁrst for moments. dissolved oxygen.3. . shown by the dashed line and the symbol µ is located at the center of gravity of the data. In R. a measurement that produces a real number.3 Descriptive Statistics Descriptive statistics are valuable tools in understanding particular patterns in your data. k = 1 . and the standard deviation sd(). The other kind of data we will be examining here are categorical data. usually denoted by the symbol µ is a measure of the central tendency of your variable (a center of gravity. so to speak).

3. DESCRIPTIVE STATISTICS 59 Figure 4. Distributions can either have a positive or negative skew. 1)) distribution.10: Example locations for ﬁrst two moments of a Normal (N (0. Skew is estimated in R using the function skewness() The kurtosis of a distribution is a measure of the ”peakedness” of a distribution. In these cases the mean < median < mode. compare the images in Figure 4. The skew of a distribution is a measure of how ”pushed-over” the main lump of the distribution (again not a very statistical deﬁnition here). Conversely. To If R gives you a warning. this means that the moments library is not installed by default. This Biological Data Analysis Using R . a distribution has a positive skew if the tail is on the right and the mean > median > mode. into memory by default and we must load the load these libraries type: > l i b r a r y ( moments ) moments library to gain access to them. In this case. Distributions where these measures are equal is said to not have any skew.4.11 A distribution is said to have a negative skew if the direction of the longer tail is to the left. see Appendix B for instructions on how to add libraries to your installation of R.

logistic.11: Negative (left) and positive (right) distributions.. The direction of this lean determines if the distribution has a negative (left) or positive (right) skew. In general. each with a different level of kurtosis. the following types of kurtosis are available: Platykurtic Curves that have negative excess kurtosis (e.’ A simple example of how kurtosis looks is found in Figure 4. and uniform). > normData <− rnorm(100000) > l o g i s t i c D a t a <− r l o g i s (100000) > unifData <− r u n i f (100000) > kurtosis ( normData ) − 3 [ 1 ] −0. Biological Data Analysis Using R .3 part of the equation is a normalizing constant that allows the kurtosis of a normal distribution to be equal to zero.02320046 > kurtosis ( l o g i s t i c D a t a ) − 3 [ 1 ] 1. SUMMARY STATISTICS Figure 4. In both of these examples the dotted line connects the mode of the distribution (the top peak) to the mean (on the x axis).3 correction factor is that it allows you to quickly tell the different types of kurtosis by looking at the value of the estimate. term comes from the Greek word kurtos that means ’bulging.197009 The discrepancy here in the estimates showing the normal distribution not quite equal to zero is because the data were created by drawing random numbers rather then specifying the distribution directly.12 with three different distributions (the normal. In general.12. Below are the raw data and the kurtosis estimates used in producing Figure 4. the function for kurtosis is: K= µ4 −3 σ4 The correction factor (the .g. One beneﬁt of the .219505 > kurtosis ( unifData ) − 3 [ 1 ] −1.60 CHAPTER 4. the kurtosis()−3 < 0).

There is little to discuss about this particular set of functions. Creating a matrix of Plots It is often desireable to create more than one plot on a graphic but not overlayed on top of each other as was explained in Section 4. This will create a matrix of plots that has nr rows and nc columns. kurtosis()−3 > 0). To do this. and logistic) showing different levels of kurtosis. normal.. Biological Data Analysis Using R . Mesokurtic Curves that do not have excess kurtosis (e.4.13. which returns a two-item vector containing the minimum and maximum values. DESCRIPTIVE STATISTICS 61 Figure 4.nc).g. the kurtosis()−3 = 0).3.1.. Leptokurtic Curves that have positive excess kurtosis (e. An example of creating a matrix of plots is given in the code below and depicted in Figure 4. In fact... we need to adjust one of the graphics properties using the function par().g.1.12: Three distributions )exponential. The property we need to change is mfrow=c(nr. the The last summary statistic we will cover here is the range(). the range() function calls the min() and max() directly.

and the Kruskal-Wallis test. This graphic window will have the nr x nc matrix of plots until it is either closed or you change the mfrow property to something else.1)) h i s t ( rexp (100000)) h i s t ( r l o g i s (100000)) Subsequent calls to plotting functions will ”reuse” this graphic ﬁgure and replot the graphs in the nr x nc matrix. the Mann-Whitney Test. 2 ) ) h i s t ( rnorm(100000)) h i s t ( rpois (100000 .13: Matrix of four plots created from random numbers sampled from the normal. > > > > > par ( mfrow=c ( 2 .62 CHAPTER 4. and the logistic distributions. poisson. In this section.3.2 Non-Parametric Parameters Non-parametric statistics are generally concerned with the analysis of data that does not make assumptions about the underlying statistical distributions. we will explore some of the methods that R can use to describe data without assuming an underlying Biological Data Analysis Using R . There are several commonly known non-parametric statistics such as the Binomial Test. 4. Goodness of Fit. SUMMARY STATISTICS Figure 4. exponential.

and counting to where quantile is located in the list. To illustrate the use of the quantile function. we can estimate the mean. variance.1 we implicitly used the known distributional form of the χ2 function to ﬁnd the critical value whereas in non-parametric approaches. the 75th quantile.4 Relationships Between Pairs of Variables There is often times when we are interested in knowing about the simultaneous changes in two or more variables. you have probably call this the median (and R has a median() function if you like to call it that). the 50th quantile (or median) can be considered a measure of central tendency of the sorted data. skew.5) > quantile ( x ) 0% 25% 50% 75% 100% 0 3 5 6 12 showing that the center of dispersion is 5 and the inner quartile ranges from 3 − 6. While this may be statsy. For the data that produced the histogram in 4.14. the minimum). you most likely will have run across terms such as a median.. we typically apply the approach of putting everything into a vector. While you have probably not heard of this particular descriptive statistic. Thus far.1.. the 25th and 75th quantiles) provide a range of the data X where the inner 50% of the values lie..4.g. Quantiles can also be used to look at the dispersion of data. and various ranges but this does not tell us about how the variables interact together. Individually. The values of x that give the upper and lower quartiles (e. 4. All of these are particular kinds of quantiles that will be obvious when we consider the formal deﬁnition of a quantile. The quantile() function in R by default provides the 0th quantile (e. quartile. As a result. sorting it. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES distribution.14 consisting of 1000 numbers drawn from a Poisson random distribution with a centrality parameter k = 5. In parametric statistics we discussed parameters such as the variance and standard deviation that deﬁne the dispersion of values around the mean. These are often called the inner quartiles of the data. we can consider the 95th quantile analogous to what we were discussing in Section 4. For this we need to look at measures that explain the relationship between variables.g. Biological Data Analysis Using R . the 50th quantile (the median). the quantiles are: > x <− rpois (1000 .1. 63 The ﬁrst summary statistic outline here will be the quantile.g. The main distinction here is in Section 4. The notion of Quantiles can be used in a similar way. kurtosis. it generally says that the 50th quantile is the the value x50 in the distribution where 50% of the data is less than x50 and 50% is greater than x50 . or percentile.1 when we were trying to ﬁgure out critical regions of the χ2 distribution. Quantile A pth quantile is the value xp that when considering the data (X) the probability P (X < xp ) ≤ p and the probability P (X > xp ) = 1 − p. More generally though. the 25th quantile.4. the maximum). consider the data in Figure 4. and the 100th quantile (e.

43 .33 .100) > Y Biological Data Analysis Using R .34 . Needless to say.4.64 CHAPTER 4.1 . > X <− c (1 .5 .28 .56 . 4. In R the covariance between two vectors of values is estimated by the function cov(). in which case it is a variance and there is no such thing as a negative variance. the length of the two variables must be the same or R will rightly complain.5). Covariance estimates may be positive or negative as long as the two variables are not the same. Two variables that have a covariance equal to zero are said to be uncorrelated (although if you don’t know what a correlation is this moniker is kinda sucky).7) > Y <− r u n i f (10 .6 . SUMMARY STATISTICS Figure 4.23 .14: Distribution of random number drawn from rpois(1000.1 Covariance & Correlation The covariance of two variable is deﬁned as: cij = E[(X − µX )(Y − µY )] and measures the degree to which one variable X changes as another Y changes.

Y ) should have the following characteristics: • The value of a correlation is strictly bound on the interval [−1. Y ) [ 1 ] 2231.2 Tests For Correlation There are parametric and non-parametric methods for looking at the relationship among pairs of variables. You can see that the values that I used produced a smattering of points (Figure 4. Biological Data Analysis Using R .960688 > p l o t ( X.234582 8.4.148708 [ 8 ] 6.112843 47. So here I just pounded on my numeric keypad and made up the numbers for X (not quite random but pretty good) and then had R make some numbers for Y by drawing from a uniform distribution runif() selecting 10 values in the range 1 → 100.871332 57. Y ) > cov ( X.4. all correlations between two random variables (X.236585 17.546069 17.072745 65 Figure 4. In general. 1].15: Scatter plot of some semi-random points. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES [ 1 ] 90.861546 54.15 ) 4.952 3.4.000811 84.

16: Example plot of two variables used to test correlations. We call this a negative correlation. • If there is no general relation between the variables X and Y then the correlation statistic should approach 0. The most commonly used measure of correlation is Pearson’s product moment correlation.66 CHAPTER 4. SUMMARY STATISTICS • If larger values of X tend to be associated with larger values of Y then the correlation should approach +1 as the association becomes stronger. • If smaller values of X tend to be associated with larger values of Y then the correlation should approach −1 as the association becomes stronger. We call this a positive correlation.1) where the x and y values are the mean of the N sampled variables in X and Y . r. ¯ ¯ Figure 4. that is calculated as: N i=1 (Xi N i=1 (Xi r= − x)(Yi − y ) ¯ ¯ N i=1 (Yi − x) ¯ − y) ¯ (4. Biological Data Analysis Using R . We call this a relationship where the variables are uncorrelated.

12.1 when we can fully discuss how it works. 27. 7 .4. There are two additional approaches for estimating correlation.4. p−value = 8. approaches developed by Spearman and Kendal but these two are considered non-parametric methods based upon ranks rather than that shown in Eqn.489e−07 a l t e r n a t i v e hypothesis : true c o r r e l a t i o n i s not equal to 0 95 percent confidence i n t e r v a l : 0. 44) > cor . which is both large and positive as expected by looking at the graph. −2.3194 . The output also includes a signiﬁcance test and a display of the 95% conﬁdence intervals which are very useful. 58.6848344 0. Y ) Pearson product−moment c o r r e l a t i o n data : X and Y t = 7. 11. t e s t ( X.test () 67 function. df = 18. 38. 10.865.8651642 The correlation between these two variables is r = 0. 32. 31. 35. 36. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES In R the test for correlation is performed with the will use the following data shown in Figure 4. 34. 45.1 and will be left until 5.2. To demonstrate. we > X <− 1:20 > Y <− c(−17. By default when you use cor. Biological Data Analysis Using R .16: cor. 4.9456427 sample estimates : cor 0.test () . it will use the Pearson product moment approach. −12. 33. 49. −4.

Calculates the mean of the values in x. s2 .68 CHAPTER 4. plot(x) This is the main wrapper function that creates a graphical display of the variable(s) that you pass to it. SUMMARY STATISTICS 4. To get more information on any of these functions. var(x) Estimates the sample variance. Returns x random numbers from the χ2 distribution with df degrees of rchisq(x.df) dom.df1.df1. pf(x. from the variables in x. pnorm(x) • • • • • • • • • • Returns the distribution of a normal distribution at x. it will create different types of plots. Returns the sample standard deviation of data in x. Returns the distribution of the χ2 distribution with df degrees of freeReturns the distribution of the F distribution with df 1 and df 2 degrees pchisq(x. Biological Data Analysis Using R .df1. rf(x.df2) freedom.df2) Returns x random numbers from the F distribution with df 1 and df 2 degrees of freedom.df2) freedom.5 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises.df) qf(x. • • • • • • • dchisq(x.df1. Depending upon the variables passed. Returns the quantile of the χ2 distribution with df degrees of freedom. Returns the quantile of the F distribution with df 1 and df 2 degrees of qchisq(x. dnorm(x) mean() Returns the density of a normal distribution at x. qnorm(x) Returns the quantile of a normal distribution at x. This function takes the list of levels in the factor f and makes a table from table(f) it.df) Returns the density of the χ2 distribution with df degrees of freedom.df) freedom. use the R help system. rnorm(x) sd(x) Returns x random numbers from the normal distribution.df2) of freedom. Returns the density of the F distribution with df 1 and df 2 degrees of df(x.

overlay the density using the density function. Explain what is happening with the command data <−LETTERS[ rpois(23.10). 0.1. and then test the hypothesis HO :Height is independent of Weight. 2 ) ]. What are the critical values for a χ2 distribution with df = 8 if you are assuming that α = [0. or platykurtic? How do you know? 8.6 Exercises The following exercises are meant to help you understand the items presented in this Chapter. In a Platykurtic distribution what is the relationship between the median? mean.3)? 7. Create a histogram of 1000 random numbers drawn from the F -distribution with parameters df 1 = 1 & df 2 = 10. mode. 0. plot it an appropriate graphic.01. meso. and show how you would access the ”B” element in the table. What is the range of possible values you can get for a Pearson’s Product-Moment Correlation? 10. For the probabilities p = seq(0.1.001]? 2.csv in the folder.by=. Create a new variable that is a table of the results of this command.1. 1. 4. Is the data from the command x <−rf(1000. x<−rnorm(10) and y<−rpois(10. Save the image and include it in your answer. Make sure to have your axes labeled and drawn properly. Biological Data Analysis Using R . 0. Label the axes 3. Load this data into R .1) create a graph that has a red line representing the quantile function for the Poisson distribution (qpois with λ = 1) and a blue one representing the quantile function for the χ2 distribution (qchisq with df = 1). show me the table.2. On this plot. 6.4. There is a data set named HWCorrelationData.6.0. 9.1). Label the axes appropriately. What is the inner-quartile of the data x <−rnorm(200. and 5. lepto.9. Create a scatter plot using the variables ”Jaw Size” and ”Number of Kids”. EXERCISES 69 4.

70 CHAPTER 4. SUMMARY STATISTICS Biological Data Analysis Using R .

The data for this test consists of N observations that can be categorized into K discrete Categories. .g. .Chapter 5 Contingency Tables In this chapter we will examine non-parametric methodologies that are available for the analysis of random variables. y1 ). x2 . . .1. (x2 . non-parametric analyses are just as applicable to normal ordinal and interval data that we commonly come into contact with and in this Chapter we will go over a few examples of how you can use general non-parametric statistical approaches in your own research. y2 ). we will assume that your data consist of N observations made on a single variable. 5. • Non-parametric analysis of paired data ( (x1 . For most of the exercises in this chapter you will need to load the stats library by issuing the command: library(stats).4. 71 . nominal) data. xN ) using a χ2 test. 5. .1 One Random Sample For this section. xN ]. • Non-parametric analysis of several random samples using the Kruskal-Wallis test. . . In this Chapter you will learn the following skills: • Non-parametric analysis of data single categorical data set (x1 . .. It is not uncommon in Biology to encounter the notion that non-parametric approaches are only to be used with categorical (e. (xN .10 for more on the factor type). . However. yN )) using the Fisher Exact for small data and the general χ2 test for large data sets. . . In R we will use the factor data type (see 2. x2 . X = [x1 .1 Goodness of Fit The χ2 test for goodness of ﬁt is the typical χ2 test that we have all had a million times as an undergraduate and a graduate student. .

Example Problem: Assume that we have captured a sample of the Marbled Salamander. You can assign an observation to one of the K categories without error. Here the thing that was passed to the chisq. 4 7 ) .test for more information on other ways to pass your data to this function. Biological Data Analysis Using R . Ambystoma opacum. df = 2 .2) 2 The underlying distribution of χ2 Calc will be approximated using the χ -distribution with K − 1 degrees of freedom.test function was an object of class table.6505 So here. HO :Phenotypes occur at a ratio of 1 : 2 : 1 in R we would: > Phenotypes <− as . CONTINGENCY TABLES 2. + rep ( "Marbled" .6505. rep ( "Mostly White" . A separate crossing experiment has suggest that the marbling on an individual may be under the control of a limited number of genetic loci and has predicted that the frequency of these types would be 1 : 2 : 1 in populations at equilibrium.test function. t e s t ( t a b l e ( Phenotypes ) .86. it is large values of χ2 Calc that will lead to the rejection of the null hypothesis. On each of these individuals we have classiﬁed their marbling pattern as either Little White (NA = 24). From the discussion of this distribution and its depiction in Figure 4. p−value = 0. f a c t o r ( c ( rep ( "Little White" . 2 4 ) . we fail to reject HO that the ratio of phenotypes is 1 : 2 : 1.2. 2 . p = p ) Chi−squared t e s t f o r given p r o b a b i l i t i e s data : t a b l e ( Phenotypes ) X−squared = 0. 2 9 ) ) ) > p <− c ( 1 . As a result. CHAPTER 5. Not something that would be considered rare. Do the proposed mechanisms predict a distribution of phenotypes that you sampled from the wild? To test the hypothesis. All the observations are selected randomly. HO .72 The assumptions of this test are: 1. This is only one way that you can pass data to to the chisq. Moderate White Marbling (NB = 47).1) (5. the observed and expected values were relatively close to each other producing a χ2 Calc (in R called ”X-squared”) of 0. 1) > p <− p / sum( p ) # makes p a vector o f p r o b a b i l i t i e s > t a b l e ( Phenotypes ) Phenotypes L i t t l e White Marbled Mostly White 24 47 29 > chisq . from the Rice Center for Environmental Studies (a ﬁeld station for Virginia Commonwealth University). and Mostly White (NC = 29). See ?chisq.86 . which with df = 2 has a P -value of 0. The test statistic for this analysis is the calculated χ2 Calc which is: K χ2 Calc = i=1 (Oi − Ei )2 Ei (5.

. . Each of the N observations are mutually independent. we are going to create a contingency table that has the general form: Col 1 Row 1 Row 2 . These data are reported by all public institutions and can be found for VCU at the webpage http://www. .vcu. . we cannot reject HO that it is a fair coin. p=0.html and are summarized in Table 5. 5. . n=20.5427892 sample estimates : p r o b a b i l i t y o f success 0. ONE RANDOM SAMPLE 73 5. The following assumptions are inherent in the binomial test: 1. However. . . .5. . In general.3 These results suggest that even with only 6 observed Heads in 20 ﬂips. Row r Totals O11 O21 .1. The function itself need a few pieces of data. Each observation has the ability to be characterized as either Category A or Category B and the probably of assigning to A is denoted as p (and B as 1 − p).5 ) Exact binomial t e s t data : 6 and 20 number o f successes = 6 . t e s t ( x=6 . number o f t r i a l s = 20. . the 95% conﬁdence intervals show that there is a large range of values we cannot reject. Orc Cc Totals R1 R2 . and the hypothesized probability p.2 Binomial Test The binomial test evaluates the support for the probability (p) that an observation was categorized into one of two groups.test() function. . The binomial test tests to see if the number of items you have classiﬁed as Category A is rare given a speciﬁed probability. Or3 C3 ··· ··· ··· . 2.1. p−value = 0.1.1153 a l t e r n a t i v e hypothesis : true p r o b a b i l i t y o f success i s not equal to 0.5 95 percent confidence i n t e r v a l : 0. Call it with these data would be done as: > binom . . the number of times Category A was observed (as x).. I am considering the situation where a coin was ﬂipped 20 times and was found to have shown Heads only six times. Or1 C1 Col 2 O12 O22 . the total number of trials (as n).3 General Contingency Tables For this next application of a contingency tables we will focus on data describing the diversity of students in the College of Humanities & Sciences at Virginia Commonwealth University. The test itself is performed using the binom. In the example below. The hypothesis is: HO : p = 0.1. p.. Or2 C2 Col 3 O13 O23 .1189316 0.5. .edu/cie/analysis/reports/sets. ··· ··· Col c O1c O2c . Rr N Biological Data Analysis Using R . .

Biological Data Analysis Using R . Each of the entries in the rxc contingency table (the Oij values) are counts of the number of observations that were classiﬁed as belonging to the category in the ith row and the j th column. it was a smaller version of this table and the test statistic for analyses in general contingency tables are the same as above: r c χ2 Calc = i=1 j=1 (Oij − Eij )2 Eij The only distinction here is that our expected values are based upon row and column totals such that: Eij = Ri Cj N where Ri and Cj are the respective row and column total. there is no correlation between the row and column variables). when we looked at the χ2 test. CONTINGENCY TABLES with r rows of data and c columns.g..74 CHAPTER 5. Each observation can be classiﬁed into exactly one of the possible r and c categories according to single and independent criteria (e. The sample of N samples are drawn randomly from the larger population. Above. 2. There are two speciﬁc assumptions that are required to conduct a general contingency table test such as this: 1.

1: Diversity of enrolled undergraduate students at Virginia Commonwealth University in the College of Humanities & Sciences between the academic years 1998-2008 as reported by the Center for Institutional Effectiveness (http://www.5.1.vcu. ONE RANDOM SAMPLE Table 5. non-Hispanic Race/ethnicity unknown Total 1998 186 2985 91 1103 279 8688 0 13332 1999 158 3094 80 1139 305 8586 188 13550 2000 188 3282 83 1132 362 9013 208 14268 2001 208 3332 86 1175 400 9373 279 14853 2002 206 3387 90 1231 449 9916 387 15666 2003 235 3456 113 1437 521 10077 665 16504 2004 272 3633 109 1632 559 10757 849 17811 2005 375 3797 116 1764 623 11088 928 18691 2006 512 3983 124 1970 709 11180 1019 19497 2007 577 4158 131 2148 761 11170 1287 20232 2008 673 4193 131 2330 822 11202 1642 20993 Biological Data Analysis Using R 75 .edu/cie/analysis/reports/sets.html). Group Non-resident Aliens Black non-Hispanic American Indian or Alaskan Native Asian or Paciﬁc Islander Hispanic White.

: 3153 3rd Qu.0 Median : 305 Median : 362 Median : 400. ] 1103 1437 2330 [ 5 . :10077 Max. :8586 Max. "Asian or Pacific Islander" .0 3rd Qu. we can plot the categories as the barplot (see 8. : 610. ] 2985 3456 4193 [3 .0 Min . : 109. :11180. data$Yr2008 ) ) > Obs [ . "Black non-Hispanic" . > data <− read . t a b l e ( "VCUCommonData.:2309.:2253.3 Mean : 2890 Mean : 2999.:2044. sep=" " ) > summary( data ) Yr1998 Yr1999 Yr2000 Yr2001 Min . :11202. : 378 1s t Qu.0 Mean : 2358 Mean : 2544. In some texts.5 1s t Qu. CONTINGENCY TABLES To demonstrate this analysis we will analyze the 1998.0 Max. These data are present in a text ﬁle named VCUCommonData. : 198 1s t Qu.5 3rd Qu. : 138. :11088 Yr2006 Yr2007 Yr2008 Min .5 1s t Qu. : 2446 3rd Qu. :9373.2.1 to see if the diversity of students at VCU has changed over the last decade. : 669 1s t Qu.1 for how to make these plots yourself) as represented in Figure 5.3] [ 1 .4 Mean : 2670 3rd Qu.5 1s t Qu.0 Yr2002 Yr2003 Yr2004 Yr2005 Min . : 131.0 3rd Qu.:2116 3rd Qu. : 415. ] 186 235 673 [ 2 .5 3rd Qu. header=T . "White. "Race/ethnicity unknown" ) > Obs 1998 2003 2008 Non−resident Aliens 186 235 673 Black non−Hispanic 2985 3456 4193 American Indian or Alaskan Native 91 113 131 Asian or P a c i f i c Islander 1103 1437 2330 Hispanic 279 521 822 White .2] [ .0 Median : 1287 Median : 1642.] 91 113 131 [ 4 . matrix ( cbind ( data$Yr1998 . : 243.5 Median : 279.0 1s t Qu. this (7x3) contingency test is called a χ2 Test for Independence and in R is conducted using the chisq.csv in the folder for this Chapter.76 CHAPTER 5. ] 279 521 822 [ 6 . + "American Indian or Alaskan Native" .1.test() . To begin with. : 173 1s t Qu.0 Once the entire data set is loaded into R .0 Mean :1904. : 124."2003" . > Obs <− as .6 Mean :1936 Mean :2038 Mean :2121. : 2632.5 Max. : 113 Min . : 0. :9013 Max. : 2780 Max.0 Min .5 Median : 1019. : 499 Median : 449.0 Min .] 0 665 1642 > colnames ( Obs ) <− c ( "1998" .0 Max.:2207 3rd Qu. It is loaded into R with the following commands. : 2976. :11170 Max. : 90. : 80 Min .0 Mean : 2785. : 86.9 3rd Qu.0 Median : 665 Median : 849. : 116 1s t Qu. ] 8688 10077 11202 [7 . + "Hispanic" . non−Hispanic 8688 10077 11202 Race/ e t h n i c i t y unknown 0 665 1642 With these data we will be speciﬁcally testing the hypothesis that across years there is no differences in the relative distributions of self-identiﬁed racial and ethnic group. non-Hispanic" .0 3rd Qu.5 Max. 2003 and 2008 enrollment data from Table 5. data$Yr2003 .0 Max. :8688. : 296.csv" . : 3261. we can extract only the values that we are going to use. :10757."2008" ) > rownames ( Obs ) <− c ( "Non-resident Aliens" .0 Min .0 1s t Qu. :9916.0 Max. : 747.0 Median : 928 Mean :2238. : 131 Min . : 83 Min . Biological Data Analysis Using R .5 1s t Qu.1] [ .

5.name 1 observed 21 expected 21 residuals 21 Class −none− −none− −none− −none− −none− −none− −none− −none− Mode numeric numeric numeric character character numeric numeric numeric Notice here that I actually assigned the results of the statistical test to the variable test1. By printing the contents of the test itself. value 1 method 1 data .2e−16 > summary( test1 ) Length statistic 1 parameter 1 p . > test1 <− chisq . 2003. p−value < 2. I did this because there are many reasons why you may be interested in looking a various aspects of the analysis. df = 12.417. ONE RANDOM SAMPLE 77 Figure 5. we see that Biological Data Analysis Using R .1. t e s t ( Obs ) > test1 Pearsons Chi−squared t e s t data : Obs X−squared = 1704.1: Undergraduate diversity at Virginia Commonwealth University during academic years 1998. & 2008.

2 Paired Observations Analyses in this section will be concerned with data that is collected in a pair-wise fashion (e.78 CHAPTER 5. First. If any of the Eij estimates are less than 1 the approximation will be poor. This is an excellent choice but has the problem that since it use combinatorial theory. for each observation. As shown using the function summary(test1) shows. 2. It really depends upon your knowledge of the biology of the system if this can be done without making it a meaningless analysis. If the values in the cells are small then the approximation that we use to ﬁnd the Type I error (the α value) is poorly estimated. you may need to run this test a large number of times and store Caveats There are some caveats that need to be made with respect to general use of contingency tables. This uses combinatorial theory to estimate the probabilities of the test statistic rather than asymptotic assumptions. 5. they are very robust as long as you have a moderate amount of samples in each of the cells. OK but what is moderate? Here are some general guidelines:1 1. which with (r − 1) ∗ (c − 1) = 6 ∗ 2 = 12df produces a very small P −value. you may want to create a table of the observed or expected values. So what do you do if you have some small expected values? First.417. 1 Biological Data Analysis Using R . Second. there are two values collected). χ2 Calc with (r − 1) ∗ (c − 1)df is actually an approximation that is good only with good representation. These guidelines are a bit on the conservative side and you may want to see a text on non-parametric statistics for a more complete discussion of how far you can stray from these and still not get laughed at. at some point you will have to perform an operation like N ! which when N > 170 the computer cannot calculate a number that large. you can try to collapse some of your row or column categories and recalculate. There are a lot of different reasons why you may be interested in using various components of the analysis. the analysis itself returns a list that has all the components as list items. The test statistic we have been using. CONTINGENCY TABLES the calculated statstic χ2 Calc = 1704. our observed value is way out to the right with a very small likelihood that that you would get a value this large if it were not signiﬁcant. you can try to use Fishers Exact Test.2. If you look back at Figure 4. If more than 20% of the Eij values are less than 5 then the approximation will be poor.g. For example. There is also the restriction that product of the row marginals (the Ri values in the table) must be strictly less than 231 −1 but he N < 170 rule is a bit easier to remember..

12. −2. 36. If there is no known joint distribution for these variables then the density function of r is undeﬁned. 58. t e s t ( X. There is some loss of information by putting the data into ranks rather than using the raw values. PAIRED OBSERVATIONS 79 5.test function to get a parametric estimate of the correlation between two sets of variables. Biological Data Analysis Using R . 27. 34. 35.4. ρ. 31. among others but the interface in R is identical (and the same as we already saw for the Pearson product moment correlation) so I will only cover the Spearman approach and leave you to look into the differences.test function. −4.2. 38. So why use this instead of the parametric approaches? Well the calculation of Pearson’s r statistic depends upon the bivariate distribution of X and Y .8511278 Notice here that the correlation is signiﬁcant although the correlation statistic is a bit smaller. Using the same data as in 4. etc. These ranks are computed in comparison to other values in X. then a rank approach may be more appropriate. −12.3) where the terms R[Xi ] is the rank of the ith element in X.4.2 you specify the use of the Spearman approach using ranks by passing it as an additional option to the cor. 7 . For me. Y . This is possible as well using a non-parametric approach by adopting a ranking methodology. Non-parametric correlation methods include Spearman’s ρ and Kendal’s τ . So what is begin done here is that we are replacing the actual values of the variables by the relative ranks. 44) > cor . method="spearman" ) Spearmans rank c o r r e l a t i o n rho data : X and Y S = 198. 49. However. What does this mean to you? It means that if your data can be assumed to be normal or then go ahead and use the Pearson approach. is calculated as: ρ= N 2 i=1 R[Xi ] N i=1 R[Xi ]R[Yi ] −N N +1 2 2 −N N +1 2 2 1 2 N 2 i=1 R[Yi ] −N N +1 2 2 1 2 (5. For example R[Xi ] = 1 is the smallest value of X. 32. p−value < 2. Spearman’s correlation statistic. I consider the non-parametric approaches as appropriate for all data. 33.2.2 we looked at how you use the cor. if you cannot assume that they are normal or they you know they are not. whereas the parametric ones as only good for a subset of the data that we encounter. 45.5.2e−16 a l t e r n a t i v e hypothesis : true rho i s not equal to 0 sample estimates : rho 0. 10. 11. > X <− 1:20 > Y <− c(−17. R[Xi ] = 2 would be the second smallest.1 Rank Correlation In 4.

Samples are lumped together and assigned ranks based upon the combined N observations. If the treatments are producing differences in either X or Y then the test statistic will be unusually large given N . Biological Data Analysis Using R . as stated in the null hypothesis. The variables are at least ordinal. we do not assume that both X and Y have the same number of observations and in general will consider X to have n observations while Y has m and denote N = n + m.2 Wilcoxon Test The Wilcoxon test is also known as the Mann-Whitney test and a ranks based method analogous to the a paired t-test. the internal R code takes care of this for us (and will provide warnings when appropriate) so we can focus on our tasks and let R focus on the speciﬁcs.. Data here are drawn randomly from two different ”treatments” to see if the application of either produces a signiﬁcant shift in the values of one set of observations..80 CHAPTER 5. 3.g. CONTINGENCY TABLES 5. samples will be ranked in increasing order for this analysis. Both sets of samples (the X and Y observations) are drawn randomly form each population. Assumptions The Wilcoxon test has the following assumptions: 1. it is recommended to assign the average rank to all the tied observations. Xn Treatment 2 Y1 Y2 . then the sum of the ranks of X should be just as large as expected for the sum of the ranks for Y . The test statistic for this analysis is the sum of the ranks of the X variables: n W = i=1 R[Xi ] If the observations in X and Y are drawn from a single population. There is an expected mutual independence between the X and Y values as well.. This approach tests the null hypothesis that samples drawn from two different populations are essentially the same (e. they are as likely as samples drawn from one or the other population). In general your data should look like: Treatment 1 X1 X2 . As was discussed for Spearman’s ρ.2. In the case of ties where two or more samples have the exact same value. Fortunately for us. If the ranks in sample X tend to be generally larger or smaller than those observations in Y then we can reject the null hypothesis HO : X = Y . 2. Ym In this analysis...

p−value = 0.5700 Max.00 0.53 0.00 values changes the interpretation of what happened. Here we will use the Wilcoxon to see if there is a signiﬁcant difference in germination rates between the control (CTRL) and clear-cut treatments (CLR). selectively cut (SEL).94 0. Here is how to load the data into R and extract just the treatments of interest.:0.35 0. Biological Data Analysis Using R .41 > Y [ 1 ] 0. d e f a u l t ( X.21 0. t a b l e ( "PineGerminationData.5 . :0.82 0.64 0.03 0.01 0. :0. I will use the pine germination data that is in the folder for this Chapter. These data are from my thesis and record the average germination rates for offspring arrays of Pinus echinata families who were sampled in continuous (CTRL).37 0. header=T ) > summary( pineData ) GERM TRT Min .07 0.00 [ 1 6 ] 0.2. and stands where all the trees around P.35 0. echinata were clear-cut (CLR).94 > range ( Y ) [ 1 ] 0.29 0.00 0.19 0. the data in X and Y appear to be different. t e s t . there were ties and this causes some problems in calculating the signiﬁcance of the parameter. > wilcox . There are some error messages that you should be aware of.test and pass it the two variables.00 0.9400 > X <− pineData$GERM[ pineData$TRT=="CLR" ] > Y <− pineData$GERM[ pineData$TRT=="CTRL" ] > length (X) [ 1 ] 15 > length ( Y ) [ 1 ] 23 > X [ 1 ] 0.02 0.00 0. > pineData <− read .4966667 > mean( Y ) [ 1 ] 0.81 0.11 0.67 0.64 You can see that there are different numbers of samples in each treatment but that they have overlapping ranges.56 0.003835 a l t e r n a t i v e hypothesis : true l o c a t i o n s h i f t i s not equal to 0 Warning message : In wilcox . W = 269.004.39 0. These ties are for families that did not produce any offspring. use the function wilcox.37 > mean(X) [ 1 ] 0.45 0.2173913 > range (X) [ 1 ] 0.5 which gives it a P -value of 0.58 0.:0.00 0.21 0.00 0.18 0.3625 3rd Qu.1800 CTRL:23 Median :0. Y ) Wilcoxon rank sum t e s t with continuity c o r r e c t i o n data : X and Y W = 269.5. PAIRED OBSERVATIONS 81 To show how to conduct the Wilcoxon test. The test statistic.00 0. Apparently in the data.63 0.0000 CLR :15 1s t Qu. these are valid responses and you would have to just live with the fact that ties existed because throwing out all the 0.00 0. From a biological perspective.80 0. t e s t ( X. Y ) : cannot compute exact p−value with t i e s According to our test.40 0.36 0.06 0.3700 SEL :15 Mean :0.txt" .64 0. To run the Wilcoxon test.

The default method for performing this analysis looks like kruskal. CONTINGENCY TABLES In general. X2n2 ··· ··· ··· . X1n1 Treatment 2 X21 X22 ..3. 2. the approaches are identical except for how the data are encoded. we will examine the same Pinus echinata data set that we used to demonstrate the Wilcoxon test. the data had k = 2 treatments and it was introduced as a rank based analog of the t-test. . All samples are randomly drawn from their perspective treatments.) where the variable x is the raw data and the g one is another variable that has the groupings.3 Several Random Samples The ﬁnal section in this chapter is focused on data that is collected from multiple treatments.1 Kruskal-Wallis Tests The Kruskal-Wallis test examines the differences among k different treatments using a rank-based approach similar to that discussed for the Wilcoxon test. Each treatment may have a k different number of observations in it with a total sample size of: N = i=1 ni . 3. Treatments are independent of each other. . ··· Treatment k Xk1 Xk2 . . . 5. raw or as ranks.test(x. In the code below I separate out the variables Actually if you do a t-test on the ranks you will get the same answer as the Wilcoxon. Here we will introduce the Kruskal-Wallis test which allows for the analysis of k > 2 treatments and we could again consider it a rank-based analog of an analysis of variance (ANOVA) approach.82 CHAPTER 5. .. the Wilcoxon test is rather powerful in determining the equality of samples drawn from two different populations. 5. In the previous discussion of the Wilcoxon test. In fact. this test is just an extension of the Wilcoxon test using the same sum or ranks approach. Xknk The test statistic for this test is a χ2 approximation with k − 1 degrees of freedom Assumptions There are several assumptions associated with this test: 1. It is essentially the non-parametric version of the normal t-test. Data for this test is not assumed to be of equal sizes. . The observations are at least ordinal in nature. You should be able to make a list of your data by treatment such as: Treatment 1 X11 X12 . As an example using this analysis. . . . g.2 Situations where you may favor a Wilcoxon approach over the t-test are when you have non-normal data or data with several outlier points. 2 Biological Data Analysis Using R .

you Kruskal−Wallis Rank Sum Test Description : Performs a Kruskal−Wallis rank sum t e s t .020 0. kruskal .290 0. . THE FORMULA NOTATION & BOX PLOTS 83 and then pass them to the function with Germination as the response and grouped by the factor Treatment.test into R ). f a c t o r ( pineData$TRT ) Treatment [ 1 ] CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL [ 1 6 ] CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL SEL SEL SEL SEL [ 3 1 ] SEL SEL SEL SEL SEL SEL SEL SEL CLR CLR CLR CLR [ 4 6 ] CLR CLR CLR CLR CLR CLR CLR CLR Levels : CLR CTRL SEL > GerminationRates [ 1 ] 0.240 0.450 0.370 0.000 0.580 0.200 [ 3 7 ] 0.300 0.000 0.4 The Formula Notation & Box Plots If you look at the function signature for the kruskal.530 0.490 0. Treatment ) > summary( germTest ) Length Class Mode statistic 1 −none− numeric parameter 1 −none− numeric p .360 0. .190 0.110 0.txt" .210 0.510 0.000 0. t e s t ( formula .4.350 0. p−value = 0.670 0.539 . ) ## S3 method f o r class ’ formula ’ : kruskal . na .560 0.390 [ 2 5 ] 0.070 0. .001893 CTRL CTRL CTRL SEL SEL SEL CLR CLR CLR 0.000 [ 4 9 ] 0.000 0.210 0.570 0.030 0.370 0.940 0.450 0. t e s t ( x .620 0.name 1 −none− character > germTest > > > > Kruskal−Wallis rank sum t e s t data : GerminationRates and Treatment Kruskal−Wallis chi−squared = 12.760 0. df = 2 . data . t a b l e ( "PineGerminationData.400 0. .640 0. we see that the estimated test statistic was relatively large suggesting that it is unlikely that the three timber extraction treatments do not differentially inﬂuence the germination percentages. .800 0.000 0. subset . 5. ) ## Default S3 method : kruskal .820 0.630 0.410 > germTest <− kruskal . action . . . .test (by typing can see several alternate ways you can pass your data to it. t e s t ( GerminationRates .180 [ 1 3 ] 0.520 0.000 0. t e s t ( x . ) Biological Data Analysis Using R . value 1 −none− numeric method 1 −none− character data . g .240 0. I also conduct the analysis and assign it to the variable named germTest so you can see that this analysis also returns a list of results.640 0. Usage : kruskal .010 0.380 0.5.580 0.290 0. t e s t package : s t a t s R Documentation ?kruskal.350 0. header=T ) GerminationRates <− pineData$GERM Treatment <− as . .060 0.810 When looking at the results of the test.615 0. pineData <− read .

I used the statement ”.is a function of.2: Boxplot of Pinus echinata germination data partitioned by timber extraction treatment. t e s t ( GerminationRates ˜ Treatment ) Kruskal−Wallis rank sum t e s t data : GerminationRates by Treatment Kruskal−Wallis chi−squared = 12. The formula notation in R consists of the response variable (or variables that I’ll call Y ). CONTINGENCY TABLES When discussing the relationship between the raw germination data and the grouping variable. Biological Data Analysis Using R ..test would look like: > kruskal . a simple function would be denoted as Y ˜ X stating that Y is a function of X. and the tilde sign ˜ showing the relationship. p−value = 0.. In R you can often use the formula notation to perform analyses and plots and here we will spend a little bit of time on how that is done.001893 Figure 5. In Chapter 6 you will use this notation quite a bit when writing out linear models..84 CHAPTER 5. For example.539 .. df = 2 . the predictor variable (or variables which will be denoted as X).” This notation is the formula notation that is indicated in the last option for calling the kruskal. Using the function notation for the kruskal.test function.

Thus far. when you do this.y). THE FORMULA NOTATION & BOX PLOTS 85 It is even possible (and perhaps better because we are rather lazy in our typing) to use the function notation of the variable names within a data. ylab="GerminationRate" ) To adjust additional parameters on the box plots see the function bxp which is the actual plotting function that the plot function is handing the data off to. data=data . you will have to pass an additional parameter to the analysis function to tell it which data to look into for those variable names.frame without having to make the other variables (GerminationRates and Treatments). etc. However.539 . For example.4. with the pineData data set you can type: > kruskal . xlab="Treatment" .001893 Another common place to ﬁnd the function notation is in plotting. p−value = 0. However. 3 Biological Data Analysis Using R .5. t e s t ( GERM ˜ TRT. box colors. It is just as easy to call the plot as plot(y ˜ x) and you will get the same results if the variable x is a continuous variable. we have called scatter plots by the function plot(x. if x is a categorical variable you will not get a normal scatter plot. You can adjust many other components of the plot including notches. data=pineData ) Kruskal−Wallis rank sum t e s t data : GERM by TRT Kruskal−Wallis chi−squared = 12. df = 2 .2 which was created by calling the function3 : > p l o t (GERM ˜ TRT. What you will get is a box plot as depicted in Figure 5.

. cor.. binom. chisq.86 CHAPTER 5. c(x. • • • • • • • • • • • • • • The concatinate function that munges all the items together and returns them as a vector. This function takes the list of levels in the factor f and makes a table from Performs the Wilcoxon Rank Sum Test on the variables in x and y.test(x.matrix(x) Coerces the data in x into a factor data type..g.test(x. use the R help system.p) Performs a binomial test to see if observing x occurrances of one category of data in n trials is consistent with the likelihood of it occuring with a frequency of p.test(t) kruskal. Returns a two-element vector containing the minimum and maximum values in x.test(x) Tests for a signiﬁcant (e.5 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises.y) cbind(x. by columns. ρ = 0) correlations. To get more information on any of these functions. range(x) read. Returns a general summary of the data in x. This only works for matrices and data. Performs a χ2 test on the values in the table t. Coerces the data in x into a matrix data type if possible.. wilcox.n.frames. y.y.g) Performs the Kruskal-Wallis Rank Sum test for the data in x as partitioned into groups deﬁned by g.factor(x) as. Returns the mean of the items in x. This only works ofr matrices or data. it. Access the column names in the item x. Access the row names in x.table() rownames(x) summary(x) table(f) Reads in a raw data into R .test(x.frames.) colnames(x) Binds together the data in x. CONTINGENCY TABLES 5.y) Biological Data Analysis Using R . length(x) mean(x) Returns the length of x. • • • as. etc.

Make it look something like Figure 5.virginia. the second Category B. and Third and assign each of them the value of runif(3) . Now. From a total of N = 15 students in this course. Biological Data Analysis Using R . and the third Category C.test to see if the germination rates observed in the SEL and CLR treatments are signiﬁcantly different.aspx • University of Virginia http://www.edu/IAAS/data catalog/institutional/cds/current/enrollment. Second. Load the data into R that is found in the ﬁle CornOutput. Create a density plot showing the distribution of bushels yielded by each treatment. These data represent the output in numbers of bushels per acre of corn with three different fertilizer treatments.65? 8. nSmooth = 15 smooth coated. Is the student diversity across these institutions the same? These data sets are prepared each academic year by each public institution and can be found by searching for ”Common Data Sets” and looking at Enrollment & Persistence.ir.p change in the chisq. Interpret your results.5.6. Provide some interpretation of your results. Do these data ﬁt the hypothesis that the probability of any one of these phenotypes is equal? 10. EXERCISES 87 5.csv (Note: this data is tabdelimited so you will have to adjust the separator you use in the read. Calculate the relative proportions of each group in the 1999 VCU data and use the goodness of ﬁt approach (as in 5.1. Below are the places you can get this information for three Universities in our region. Use the wilcoxon. Compare the enrollment freshmen enrollment in the College of Humanities & Sciences at VCU (from Table 5.edu/common ds 2006.1) during the 2006-2007 academic year for Degree-Seeking Undergraduates to the three Universities listed below.test function? Why would you want to use this option? 9. 5. Assume that you observed phenotypes in the following amounts: nspots = 12 individuals with spots. is the probability of passing this course equal to p = 0.auburn.1 to see if the 2008 student class has the same relative proportions as are predicted by the 1999 class.edu/cds/2006/sectionb. Create a data three variables named First.1 with the Categories used as the partitioning variable along the x-axis. What are the inner-quartiles of the three fertilizer yields? 7.vt.6 Exercises The following exercise are meant to help you understand the items presented in this Chapter 1. 2.web. Test the equality of the fertilizers in the data loaded from the last question using a Kruskal-Wallis test. 6.table function). and naguti = 8 aguti. if 14 pass. nsilky = 22 with silky fur.htm • Virginia Tech http://www. create a bar plot of these data assuming that the ﬁrst entry in each data set represents Category A. Feel free to provide your own colors.htm 3. 4. • Auburn University https://oira. What does the optional parameter rescale.

CONTINGENCY TABLES Biological Data Analysis Using R .88 CHAPTER 5.

the analysis is called a regression analysis. There are many different ways of introducing these different kinds of analysis but we are going to focus on the functional form and the kinds of variables that make up the predictor x. if x has more than one predictor variable then it is called a multiple regression. The term ”linear model” is a general one that will be used a bit loosely.1 One-Sample Tests The ﬁrst linear model we will deal with is the t-test. In the simplest case when both x and y are continuous variables. are predicted to have a particular relationship with some predictor variable (or variables) denoted in x. or set of variables. However. The functional form of this is: 89 . and if y is binary it is a logistic regression. • Be able to incrementally build a multiple regression model using Type III sums of squares.1. a linear models is one that can be written down in the form: y=x Some variable. Finally. if predictor variables consist of categorical and continuous variables then it is called an analysis of covariance.Chapter 6 Linear Models This chapter focuses on the analysis of linear models in R . 6. y. • Perform an analysis of variance (ANOVA) analysis for both 1-way and factorial models.1 The t-test 6. In this Chapter you will learn the following skills: • Learn to analyze data using a simple regression approach. if the predictor variable is categorical the model is called an analysis of variance with many variants depending upon the number and relationship of categorical predictor variables in x. In general.

In the data below.test and has the following options available to it. p−value = 0.sided” alternate hypothesis. paired = FALSE. and statistic . we are testing the hypothesis that HO : y = 15 with the given data.1. Just as in the contingency tables examples (5. Biological Data Analysis Using R . t .64182 25. i n t 2 −none− numeric estimate 1 −none− numeric null . t e s t ( Y .95 . "greater" ) . LINEAR MODELS y=µ where we believe that the observations sampled in y have some particular mean value and the variation around that mean value is simply the natural variation there is is the kind of samples we are measuring. Overall. ) For a one-sampled test.8523 .27 . value 1 −none− numeric conf . equal = FALSE.25) > test1 <− t .name 1 −none− character > p r i n t ( test1 ) One Sample t−t e s t data : Y t = 3. t e s t ( x . conf.2) the results of an analysis are a list containing all the parameters that were used to perform the analysis as well as intermediary materials and results.19 .25 . we will pass the response variable and a value for the parameter mu to the function. . l e v e l = 0.14 . y = NULL.4 You can see that I assigned the results of the analysis to the variable named test1. Of particular mention are the parameters p. . the analysis found that we can reject the null hypothesis HO : y = 15 with a P -value of ≈ 0.15818 sample estimates : mean o f x 21. This is fairly good support for the notion that the ¯ mean of these observations is not equal to 15. something along the lines of say ”the addition of fertilizer should increase yield.15 .29 .3 & 5.value. If you have reason to believe ¯ ¯ 2 that the observations are supposed to increase or decrease µ over some particular value. The function that performs the one-sample t−test in R is (not surprisingly) called t . df = 9 . var .mu=15) > summary( test1 ) Length Class Mode statistic 1 −none− numeric parameter 1 −none− numeric p . mu = 0 .90 CHAPTER 6.” then you should be using a one-tailed test instead that only examines an α-sized region one end. "less" .17 .int.004. . it will test the null hypothesis HO : y = µ (the mu in ¯ the signature) using a ”two. By default.sided" . value 1 −none− numeric alternative 1 −none− character method 1 −none− character data . This means that we can reject the null if y < µ and if y > µ using a α rejection region. conf . ¯ > Y <− c (19 .003892 a l t e r n a t i v e hypothesis : true mean i s not equal to 15 95 percent confidence i n t e r v a l : 17.2. a l t e r n a t i v e = c ( "two.24 .

497808 sample estimates : mean o f the d i f f e r e n c e s −0. An example of this in R (with entirely contrived data) would be: > X <− round ( r u n i f (10 . The error term. Y .max=20)) > Y <− round ( r u n i f (10 . The methods by which the parameters β0 and β1 are estimated are varied. p−value = 0.1.1416.min=12. df = 9 . ej . Overall the null hypothesis for this is HO : X = Y . Another way to write this hypothesis is: HO : (X − Y ) = 0. which is why we added the paired=T option to the parameters we passed to t . For example. is the latent variation that every observed value has around the predicted regression line.2 Paired Tests The t-test can also be used in a paired fashion. 6.test. β1 that determines at what rate y changes with changes in x. they must be taken from the same experimental unit.max=20)) > X [ 1 ] 12 18 18 13 14 15 15 16 17 19 > Y [ 1 ] 14 17 20 13 17 12 16 17 17 15 > t .1 Notice that since these are paired. t e s t ( X. REGRESSION WITH A SINGLE VARIABLE 91 6. paired=T ) Paired t−t e s t data : X and Y t = −0.min=12.697808 1. The general form of a regression model is: yij = β0 + β1 xi + ej where the response variable yij is hypothesized to be a function of three independent components: 1. 3.8905 a l t e r n a t i v e hypothesis : true d i f f e r e n c e in means i s not equal to 0 95 percent confidence i n t e r v a l : −1. The most common approach is the least squares approach which tries to ﬁnd estimates for these Biological Data Analysis Using R . in which case this becomes identical to the one-sampled test. The intercept. X and Y that are observations that are taken in such a manner as to think that the differences between them are negligible. This analysis consists of two sets of variables. perhaps you think that parasite load has inﬂuenced the development of young warblers so you measure the lengths of the primary feathers.2 Regression With A Single Variable A linear regression seeks to see if the values in the response variable y can be predicted to change systematically with the predictor variable x. A slope coefﬁcient.6. 2.2. β0 .

the ranges of the x− and y−axes.17 .pch=19. Here is an example data set with the values plotted in Figure 6. In R we i=1 can use the function lm to construct the linear model. > X <− 1:10 > X [ 1 ] 1 2 3 4 5 6 7 8 9 10 > Y <− c (19 . bty="n" .25) > Y [ 1 ] 19 25 14 15 24 17 19 27 29 25 > p l o t ( Y˜X. ylim=c ( 0 . ylab="Y" .27 . LINEAR MODELS two parameters that minimizes the sum of squared error terms (e. set the labels. xlab="X" . 1 Biological Data Analysis Using R . 1 0 ) ) To plot these.29 .1.1 By eye-balling the image.25 .14 . 3 0 ) . I used the functional form (see 5.. and the plot characters with the pch option.4 for a discussion of how this works) with Y ˜ X. Figure 6.pch=1:25) and it will plot each symbol along the x = y line.15 . xlim=c ( 0 .1: Plot of single variable regression values. N ei ). the plot colors.19 .92 CHAPTER 6.24 . do you think there is a relationship between these variables? > f i t 1 <− lm ( Y˜X) > fit1 To see all the different characters that you can use as plot symbols type plot(1:25.g. c o l ="red" .

05 ’.01 Pr ( >| t | ) 0. codes : 0 ’∗∗∗’ 0. it does not appear that the regression line is signiﬁcant.600 3Q 3.298 This printout is probably more like what you will be putting into your manuscripts. If you are interested in printing out a more standard ANOVA table for this model. the df .1398 0. Adjusted R−squared : 0.238 Max 6.722 on 8 degrees o f freedom Multiple R−squared : 0.’ 0. > summary( f i t 1 ) Call : lm ( formula = Y ˜ X) Residuals : Min 1Q Median −5.3333 X 0. REGRESSION WITH A SINGLE VARIABLE 93 Call : lm ( formula = Y ˜ X) Coefficients : ( Intercept ) 16. 4. Again.9212 0.772 −−− S i g n i f .001 ’∗∗’ 0. Biological Data Analysis Using R .1143 Here we see several components: 1.6.012 70. you can pass the variable fit1 to the anova function and it will print out the more normal results. > anova ( f i t 1 ) Analysis o f Variance Table Response : Y Df Sum Sq Mean Sq F value Pr(>F ) X 1 70. 2.14 on 1 and 8 DF. F . A summary of the residuals (the eij terms) 3.5199 1. The formula that was used to call the lm function.591 0. But is this signiﬁcant? You can have a non-zero estimate for a non-signiﬁcant relationship. A summary of the test statistic.388 22.2819 .000973 ∗∗∗ 0.2258 5. Printing the contents of the analysis shows that the intercept term (the β0 ) has been estimated to be 16.1 ’ ’ 1 Residual standard e r r o r : 4.063 X 0.114348 ’∗’ 0. To see a slightly more detailed printout of the components in fit1 use the summary function. Overall.824 Coefficients : Estimate Std . the trend does not seem to be signiﬁcant.2.1143 Residuals 8 178. The coefﬁcients themselves with standard errors and probabilities.097 −4. So for each increment of X. and the probability.9212 I start by assigning the response of the analysis to the variable fit1.012 3.3333 3.333 whereas the slope term (R calls this by the variable name you use for it and above we called it β1 ) as 0. p−value : 0.1921 F−s t a t i s t i c : 3.92. Error t value ( I n t e r c e p t ) 16. there is almost a corresponding increase in Y (OK since the points do kinda point upwards).

94 Plotting the Regression Model onto Your Points CHAPTER 6. 1 0 ) ) > abline ( f i t 1 .g. ylab="Y" .. the function also takes additional parameters that allow you to customize the look of the line. LINEAR MODELS It is possible plot the regression model onto a display of the predictor and response variables.1 as a reminder. ylim=c ( 0 . Figure 6. To use the abline function on an existing graph does not require you to call par(new=T) ﬁrst as it takes care of that already. 3 0 ) . c o l ="red" . This means you can add an arbitrary line to any plot you like. l t y =2) In addition to passing a variable that is a regression model (e. xlim=c ( 0 . As shown above.pch=19.2: Regression model added to plot of points using abline function. You may want to revisit Table 4. Biological Data Analysis Using R . The abline function overlays a line on your current plot. the class(ﬁt1 ) = "lm"). xlab="X" . the function abline can also be called by passing it raw values for the slope and intercept. > p l o t ( Y˜X. This can sometimes be helpful when visualizing your data. bty="n" .

Biological Data Analysis Using R .33333 > fit1$coefficients [2] X 0.values" "assign" [ 9 ] "xlevels" "call" > fit1$coefficients ( Intercept ) X 16. The basic call of this function will include the x and y coordinates of where you want to put the text and the characters string that you will be putting on the graph. Now we need to make a single string that has the regression equation y = β0 + β1 x. The text parts. formula ) 6. "x" ) > formula [ 1 ] "y = 16. just because R will happily (in most cases) provide you an answer to a model ﬁtting.9212121 "effects" "qr" "terms" "rank" "df.921212121212122 x" > t e x t (5 . it does not mean that it is the right model for the data. These focus on the single speciﬁed model and allow you to make decisions on the appropriateness of your proposed model.3333333333333 + 0. This function allows you to add arbitrary text to your plot. This function takes a list of items and mushes them together into a single character string More can be found on the paste function and general string manipulation in Chapter 9.residual" "model" So we can access the values estimated for β0 and β1 using the ﬁt1 coefﬁcients[1]and ﬁt1 coef f icients[2].6. we will determine where in the fit1 variable you can ﬁnd the regression coefﬁcients.9212121 > fit1$coefficients [1] ( Intercept ) 16. To do this. f i t 1 $ c o e f f i c i e n t s [ 2 ] . we will add the regression formula to the plot. Later in ?? we will cover methods that allow you to determine if one model is better than another for describing your data. f i t 1 $ c o e f f i c i e n t s [ 1 ] . > formula <− paste ( "y = " . First. To illustrate how this is done. your data may not be linear. we can write out but the variables should come from fit1. but if the data are already embedded in the fit1 variable then it is a more versatile approach for you to use.1 Regression Diagnostics It is possible to attempt to ﬁt any model to a set of data.12. REGRESSION WITH A SINGLE VARIABLE Adding Text To A Graph 95 While we are customizing this image of our non-signiﬁcant regression model. " + " . However.2. it is probably a good time to look at the text () function. You could type out the regression equation yourself and for a one-off image it may be easier for you to do it this way.2. we use the paste function. For example. however it is still possible for you to ﬁt a line to non-linear data.5 . > names ( f i t 1 ) [ 1 ] "coefficients" "residuals" [ 5 ] "fitted.3333333 0. R includes some easy methods that you can use to examine the appropriateness of your model and here we will focus on some of the built-in diagnostics.

This would indicate that there is something else that is changing the response variable that you are not taking into consideration. The residuals are the eij components of the model in the general formula. Do you have any evidence that once you ﬁt your model to the data that there are particular entries that are obviously not part of the trend. LINEAR MODELS Figure 6. However. Normality of the residuals. The things you are looking for in the residuals are: 1. 2. σ 2 ). Systematic changes in the residuals when plotted as a function of the predicted values. One of the ﬁrst things you should do when you specify a linear model is look at the residuals. it is also possible that Biological Data Analysis Using R . There can be many reasons for outliers. Outliers.96 CHAPTER 6. These represent the variation that is not explained by your ﬁtted line. Non-linearity in the residuals when plotted against the predicted values. 4. First. they may just be an outlier and it is a real observation that should be kept in the model. If they are not. This would suggest that perhaps your data are not linear to start with and the ﬁtting of a linear model to it may not be appropriate.3: Regression model with ﬁtted line and formula. These values are expected to be N (0. 3. it may not be appropriate to be ﬁtting this model to your data.

you entered the data point incorrectly into the computer. This is called multiple regression and has a linear model with the form: Biological Data Analysis Using R . It is always good to check and see if you screwed up. 6.3 Multiple Regression There are several occasions where we may be interested in how well several predictor variables can explain the variation in a response variable. You can see these plots by using the command plot( ﬁt1 ) (or whatever your model variable name is) and R will show you a series of plots examining the distribution of the residuals.4: A 2x2 matrix plot of some diagnostic tools associated with a linear model. They include a plot of the residuals (eij ) as a function of the ﬁtted values (ˆi ) to see if there are systematic biases y in the model (upper left).6. These plots are displayed in Figure ??. etc. a Q-Q plot to examine normality of the residuals (upper right). and a leverage plot to look for outliers (lower right).3. R provides a series of four plots for you to look at when you plot a variable speciﬁed by lm(). For a more in depth discussion of model veriﬁcation you should probably consult a text book on regression analysis. a scale location plot (lower left). there was an equipment malfunction. MULTIPLE REGRESSION 97 Figure 6.

.74 . 21.95 -5.27) X1 <− 1:10 X2 <− c ( 0 .00 10.55 4 0.20 0.95 . + βk Xk + e Here you have up to k different predictor variables.09 0. 0. ] 30.00 7.74 [ 1 0 .89 0.19 .40 [ 6 . ] 38.15 7 0.09 .00 8. ] 32.95 8 0. For this example. each of which contributing to the observed value in y.1369 Max 20.74 14. To address this hypothesis. −5. ] 21.X2 ) Y X1 X2 [ 1 . 0.00 6.29 .29 5 0.49 . 0. The null hypothesis for a multiple regression is HO : βi = 0.94 .37 0.40 0.72 [ 4 . 0.27 10 0. X1. .7430 lm( Y ˜ X1 + X2 ). Median −0.3. 8 8 . ∀i and states that all the beta regression terms are zero. 0. ] 14.61 0.68 Y 4. ] −5.8989 3Q 4.09 [ 9 . 14. ] 33.26 1 0.49 32.69 And then we can create a linear model using the notation > f i t 2 <− lm ( Y ˜ X1 + X2 ) > summary( f i t 2 ) Call : lm ( formula = Y ˜ X1 + X2 ) Residuals : Min 1Q −24.19 [ 5 .72 0. ] 48.74 0.88 [ 2 .37 .95 .27 X1 1.95 38.00 5.55 21. 0. 0.26 20. we can use the data shown in Table 6.49 6 0. ] 4.68) cbind ( Y . 0. LINEAR MODELS yi = β0 + β1 X1 + β2 X2 + . 32. 48.61 [ 8 .15 45. 33. we build a linear model and then determine how much of the observed variation can be explained by the model in.61 .29 33. 0. 30.0461 Biological Data Analysis Using R .98 CHAPTER 6.37 [ 7 .95 3 0.00 X2 0.00 4.74 2 0.94 9 0. 2 6 .15 .8394 −2.41 [ 3 .41 0. i 1 2 3 4 5 6 7 8 9 10 These values can be put into R as: > > > > Y <− c ( 4 . 45.40 .00 2.00 9. When approaching a multiple regression.72 .74 .00 3. ] 45. In R we can use the same lm function as for a single predictor regression but this time we need change how we put the function equation into it to accommodate two variables.55. 38.94 48.41 .

213 Biological Data Analysis Using R .27 0.8875 0.01628 ∗ X2 1 1.763 X1 6. Adjusted R−squared : 0.66 1631. p−value : 0.85 on 7 degrees o f freedom Multiple R−squared : 0. The linear model for this is: yij = µ + β1 X1 + β2 X2 + β3 (X1 X2 ) + eij where the β3 coefﬁcient determines the strength of the interaction.007 3Q 4.460 1. the estimates for β0 = 1. the overall mode is signiﬁcant (see the anova table).091 0.204 4. and β2 = 1. it appears that the only term that has is like to not be zero is the term for β1 for variable X1. If β3 = 0 then there is no interaction.16 165.’ 0.04578 > anova ( f i t 2 ) Analysis o f Variance Table Response : Y Df Sum Sq Mean Sq F value Pr(>F ) X1 1 1631.088 0.168 Max 22.763 0.315 0.’ 0.001 ’∗∗’ 0.001 ’∗∗’ 0.93244 Residuals 7 1155.1 ’ ’ 1 Residual standard e r r o r : 12.01 ’∗’ 0. Error t value Pr ( >| t | ) ( Intercept ) 1. This is appropriate when you have some reason to believe that the combination of predictor variables will inﬂuence the response in a non-additive method.3. Overall.500 26. codes : 0 ’∗∗∗’ 0.422 3.801 0.401 Coefficients : Estimate Std .05 ’.4673 F−s t a t i s t i c : 4. For example. Error t value Pr ( >| t | ) ( Intercept ) −8.02 −−− S i g n i f .137 0.6. the full model in our example data with the interaction would be speciﬁed as > f i t 2 <− lm ( Y ˜ X1 + X2 + X1:X2 ) > summary( f i t 2 ) Call : lm ( formula = Y ˜ X1 + X2 + X1:X2 ) Residuals : Min 1Q −22.391 0.05 ’.9297 X1 4.5857 . β1 = 4. However.459 1.0077 0. Adding Interactions Some times it is preferable to run models that show the interaction between variables as well as the inﬂuence of individual variables.1 ’ ’ 1 As we can see. MULTIPLE REGRESSION 99 Coefficients : Estimate Std .27 1.951 −0.473 16. even with the β0 and β2 terms in the model for the intercept and the slope coefﬁcient for the variable X2.01 ’∗’ 0. codes : 0 ’∗∗∗’ 0.170 12.0164 ∗ X2 1.267 Median −1.882 −2.66 9.47.9324 −−− S i g n i f .46. In R interaction terms are indicated by the colon operator.17.948 on 2 and 7 DF.

401 Coefficients : Estimate Std .409 0.692 CHAPTER 6.01 ’∗’ 0.204 4.1192 You can see that this gives the exact same response. You should be careful with this notation when you are working with several predictor variables because it will do all the linear interactions including the three.1192 > anova ( f i t 2 ) Analysis o f Variance Table Response : Y Df Sum Sq Mean Sq F value Pr(>F ) X1 1 1631.1730 0.5973 .93692 X1:X2 1 32. Adjusted R−squared : 0. Models Without Intercept Terms Some times it is of interest to test the ﬁt of a model that does not have an interaction term.267 Median −1.37 0.7194 0.966 on 3 and 6 DF. p−value : 0.001 ’∗∗’ 0. it is possible to indicate to the lm function that you want to run the analysis without estimating the interaction.803 0. p−value : 0.763 X1 6.0068 0. Perhaps you have already subtracted the mean of the response variable y = y − y ˆ ¯ and as such there is not predicted to be any interaction.409 −0.416 0.66 8.391 0.569 −0.500 26.68 on 6 degrees o f freedom Multiple R−squared : 0.1 ’ ’ 1 There is a shorthand method that indicates that you are interested in having all interactions between predictor variables and that is: > f i t 2 A l t e r n a t e <− lm ( Y ˜ X1∗X2 ) > summary( f i t 2 A l t e r n a t e ) Call : lm ( formula = Y ˜ X1 ∗ X2 ) Residuals : Min 1Q −22.168 Max 22. Error t value Pr ( >| t | ) ( Intercept ) −8.270 39.78 187.569 0.13 −−− S i g n i f . This may or may not be what you are interested in testing.5973 .05 ’.100 X2 X1:X2 16.27 1.697 X1:X2 −2.68 on 6 degrees o f freedom Multiple R−squared : 0.270 −2. The linear model for this would be: Biological Data Analysis Using R .882 −2.966 on 3 and 6 DF.’ 0.732 6. perhaps the model does not support the addition of an interaction term.69192 Residuals 6 1122.416 0.and four-way (and higher) ones if you have that many variables.692 Residual standard e r r o r : 13.315 0.02552 ∗ X2 1 1.3959 F−s t a t i s t i c : 2.37 32.27 0.3959 F−s t a t i s t i c : 2.66 1631. or as in the case of our model in the previous section.213 X2 16.732 39. At any rate. LINEAR MODELS Residual standard e r r o r : 13.459 1. codes : 0 ’∗∗∗’ 0.803 6.007 3Q 4.697 0. Adjusted R−squared : 0.951 −0.

01 ’∗’ 0.6.’ 0.’ 0.05 ’.59 on 1 and 9 DF.01 ’∗’ 0. 6.001 ’∗∗’ 0. lm(Y ˜ X1). this model explains much more of the variation that the full model lm(Y ˜ X1 + X2 ) or the interaction model lm(Y ˜ X1∗X2).0177 Median 0.05 ’. There are several methods that you should use to determine which of these models you would like to consider to be the most appropriate. MULTIPLE REGRESSION 101 yi = β 1 X The formula that you pass to lm( Y ˜ X − 1).89e−05 ∗∗∗ −−− S i g n i f .16 1. lm( Y ˜ X1 + X2 ).7 66. the full interaction model was not signiﬁcant and should be disregarded. Adjusted R−squared : 0. In our examples.889e−05 > anova ( f i t 3 ) Analysis o f Variance Table Response : Y Df Sum Sq Mean Sq F value Pr(>F ) X1 1 8618.5798 8.3. just compare the Multiple R-Squared values.4 −−− S i g n i f .1422 3Q 4.1 Comparing Models So in the previous subsection we have developed three different models that we have proposed to explain our data. p−value : 1.0652 Max 21. lm(Y ˜ X1 − 1). in the order of reverse complexity.587 1.1 ’ ’ 1 Residual standard e r r o r : 11. Error t value Pr ( >| t | ) X1 4. then there is no use in discussing them.4756 −2. If the overall models are not signiﬁcant.9 129.3. X1 • The minimal model with only and without an intercept term.889e−05 ∗∗∗ Residuals 9 1164.1 ’ ’ 1 Overall. They are.7314 0. codes : 0 ’∗∗∗’ 0.7 8618.8809 . The -1 addition to the function is the part that tells R how to run properly. give as: • The full model with the interaction terms lm( Y ˜ X1 + X2 + X1:X2). codes : 0 ’∗∗∗’ 0.001 ’∗∗’ 0.8677 F−s t a t i s t i c : 66.2772 Coefficients : Estimate Std .38 on 9 degrees o f freedom Multiple R−squared : 0. 1. • The full model without the interaction terms • The partial model with only X1. Look at the overall anova signiﬁcance. Biological Data Analysis Using R . Running the data again but only including the variable X1 and the response variable Y without the interaction term gives: > f i t 3 <− lm ( Y ˜ X1 − 1 ) > summary( f i t 3 ) Call : lm ( formula = Y ˜ X1 − 1) Residuals : Min 1Q −24.

51 Y ˜ X1 Df Sum o f Sq <none> − X1 1 RSS 1156. Df 1 9 2 8 Y ˜ X1 − 1 Y ˜ X1 RSS Df Sum o f Sq F Pr(>F ) 1164. 4.102 CHAPTER 6.16 1122. Start : AIC=55.16 1 1624. Here is an example using the full model (including the interaction). Use a statistically based method to test the differences between two models such as: anova You can use the anova function and pass it two models that have been ﬁt to the same data and it will perform an analysis to see if the additional term(s) are signiﬁcant. Examine the relative signiﬁcance of each of the terms in the models as is shown by the summary function. So you want to look for the smallest values of AIC.43 1631.49 55.49 60. These indicate the proportion of variation explained by the model and are given by the summary function.27 − X2 <none> − X1 Step : AIC=51.91 1156. The AIC statistics will decrease as the estimated predictive power of your model increases.66 2788.8147 AIC There are other statistical methods that you can use to see if the additional terms are signiﬁcant in your model.51 58.43 1 8.49 Y ˜ X1 + X2 Df Sum o f Sq RSS 1 1.40 AIC 51. 3. This can give some indication of which terms may be important. LINEAR MODELS 2.21 Y ˜ X1 + X2 + X1:X2 Df Sum o f Sq RSS 1 32.31 Call : lm ( formula = Y ˜ X1 ) Coefficients : ( Intercept ) X1 Biological Data Analysis Using R .78 AIC 53.43 1155. Our various models suggested that the predictor variable X2 did not help in explaining the variation in the response variable.48 0. Here is an example using the models having only the variable X1 to see if the addition of the intercept term is signiﬁcant.37 1155. f i t 4 ) Analysis o f Variance Table Model 1: Model 2: Res .21 − X1:X2 <none> Step : AIC=53. One of these is the stepwise method using the AIC (Akaike Information Criterion).0587 0. > anova ( f i t 3 . In R you can do this by passing the largest model to the function step and it will perform the analysis for you.25 2779.51 53. Look at the relative R-squared values.27 1156.09 AIC 51.

At that time.1 1-Way ANOVA The simplest ANOVA model is one in which a single treatment has been applied and you have collected a single set of observations. selectively cut. the one-way ANOVA. control.989 4. You should consider a wide range of these methods when attempting to put together a good regression model. the AIC values decrease until the ﬁnal model which only has the X1 term and is missing an intercept.4 Analysis of Variance The analysis of variance is a common method for examining the equality of observations that can be partitioned into categorical treatments.4. In all reality.3. t a b l e ( "PineGerminationData. In 5. echinata individuals were removed so in essence the treatments were modiﬁcations of other species around the resident pines.. ANALYSIS OF VARIANCE 1. and clear cut treatments were applied to previously continuous forest stands. the values of x are not continuous).447 103 As you can see. we used the Pinus echinata germination data to illustrate how to perform a Kruskal-Wallis test. > pineData <− read .6. The eij term is again the error term. The null hypothesis for this model is: HO : N oT reatmentEf f ects (which is like saying τControl = τSelective = τClearCut ). I had suggested that the Kruskal-Wallis test was a rank-based version of an analysis of variance (ANOVA).1. No P. 6. data=pineData ) > anova1 Call : Biological Data Analysis Using R . the data consist of family germination rates for Pinus echinata (perhaps one of the homeliest looking conifer in existence) separated by timber treatment. As a reminder. You can think of this as the deviation from the overall mean that can be attributed to an observation being in a particular treatment.txt" . Here will use the same data again to demonstrate the parametric equivalent of the Kruskal-Wallis test. In the Ozark mountains of Missouri. an ANOVA is simply a regression with categorical predictor variables (e.4. 6. The linear model can be presented as: yij = µ + τi + eij where the τi is the treatment effect.g. A summary of germination data is presented in Figure ?? showing the average germination rate lowest in the control stands and highest in the stands where heterospeciﬁcs were selectively removed from around the target species. header=T ) > anova1 <− aov ( GERM ˜ TRT.

104 CHAPTER 6.5 are these results supposed to lead us to believe that all the treatments are signiﬁcantly different or just some subset Biological Data Analysis Using R .0008207 ∗∗∗ Residuals 50 2.001 ’∗∗’ 0. A colored rug was added to the right side to show the actual values within treatments (see rug. we can see that there is a treatment effect.’ 0. o f Freedom 2 50 Residual standard e r r o r : 0.8717943 2. codes : 0 ’∗∗∗’ 0.1 ’ ’ 1 From these results.43590 8.65209 0.87179 0. But in looking at the plot in Figure 6.05 ’.2303079 Estimated e f f e c t s may be unbalanced > anova ( anova1 ) Analysis o f Variance Table Response : GERM Df Sum Sq Mean Sq F value Pr(>F ) TRT 2 0. data = pineData ) Terms : TRT Residuals Sum o f Squares 0. LINEAR MODELS Figure 6.5: Boxplot of germination percentages for Pinus echinata as a function of treatment. and it appears to be highly signiﬁcant.6520868 Deg .01 ’∗’ 0. aov ( formula = GERM ˜ TRT.05304 −−− S i g n i f .218 0.

0017640 SEL−CLR −0.41823088 0. > postHoc <− TukeyHSD ( anova1 ) > postHoc Tukey multiple comparisons o f means 95% family−wise confidence l e v e l F i t : aov ( formula = GERM ˜ TRT.4.04898651 0.04566667 −0. This function takes the aov analysis as an argument and prints out the conﬁdence intervals for the differences in the means of the treatments.27927536 −0.09465318 0.15746190 0. ANALYSIS OF VARIANCE of them? 105 One way to get to this is to look at the 95% conﬁdence intervals for the treatment means and see if they overlap.8504882 SEL−CTRL 0. data = pineData ) $TRT diff lwr upr p adj CTRL −CLR −0. One way to do this is to use the Tukey Honest Signiﬁcant Differences (or TukeyHSD) function.6.0098768 > p l o t ( postHoc ) Biological Data Analysis Using R .23360870 0.24879523 0. Figure 6.46389755 −0.6: Conﬁdence intervals for difference in mean germination rates for Pinus echinata families.

Biological Data Analysis Using R .6.106 CHAPTER 6. These results suggest that the signiﬁcance in the ANOVA model is due to the differences between the control and the other two treatments and that both of the cutting treatments had essentially the same germination rate (just larger than families in the control stands). LINEAR MODELS The postHoc anlaysis can also be plotted by calling plot( postHoc ) showing the conﬁdence in the differences in treatment levels (those that overlap the zero are not signiﬁcantly different) as presented in Figure 6.

.c Provides description of x. round(x) pch Rounds the value of x to the nearest integer. lm(func) Tests the model f unc using linear least-squares. Performs the analysis of variance on the formula in x..test(x) This function performs the t-test for either a single data set and a predicted mean or a paired t-test using two data sets.. mx]. To get more information on any of these functions. into a single column-bound variable. Optional parameter for the plot function that will designate the type of symbol plotted using the plot command.mn. runif(n. You can either specify the intercept and slope or pass this a ﬁtted linear model. t. Plots the text in c on the graph at the coordinates (x. y.5. cbind(x.y. aov TukeyHSD(x) Performs Tukey’s Honest Signiﬁcant Difference post-hoc test on the model in x. y).mx) step(x) Returns n random numbers drawn uniformly from the range [mn..5 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises. anova(x) aov(x) Creates the Analysis of Variance Tables for the models passed in x. use the R help system. summary(x) text(x.6.. Puts the variables x. • • • • • • • • • • • • • abline(x) Draws a line on the currently active graphics device. Biological Data Analysis Using R . Evaluates the terms in the model x for inclusion in the model using the AIC criteria.y. USEFUL FUNCTIONS 107 6. .

Test the hypothesis HO : Mean temperature is 61 ◦ . These data represent counts of the parasite Varroa destructor a common pest of domesticated honey bees.X3 to ﬁt a multiple regression model. 4. Test the hypothesis using an analysis of variance that there is no difference in mite counts between the different lines of bees. it will contain a data frame named multReg.RData from the ﬁle into R 2 . 9.RData. Y ˜ X. 5. 10. it will be a data frame named BeeData. Does a plot of the residuals as a function of the predicted values from the estimated regression model suggest that the model is appropriate? 7. 1. Use the anova procedure to see which of these models is more appropriate. 2.csv from the ﬁle. Plot the regression model from the previous example and indicate the ﬁtted regression line with a dotted red line in the plot. Use the variables in this data frame. Fit another model to the multReg data that has all the interaction terms amongst the X predictor variables. Load the data ﬁle. Load the data set MultipleRegression. Using a paired t-test.108 CHAPTER 6. LINEAR MODELS 6.6 Exercises The following exercises are meant to help you understand the items presented in this Chapter.X2. 2 Use the load function. Load the data ﬁle VarroaCounts. Is it signiﬁcant? Show the regression equation and the anova table. Load the data set ClutchSizes. What is the predicted regression equation? 8. add the regression equation to the graph indicating the β coefﬁcients that were estimated.csv from the chapter folder. From the single regression model. 3. Y.RData from the ﬁle. Biological Data Analysis Using R . SingleRegresssion. Load the data set Temperature. test the hypothesis HO : There is no difference in reproductive output between habitat types. Fit the regression model.X1. 6. Perform the TukeyHSD test on the parasite data from the previous question. Show the summary and the anova table in your results. These data represent the measured brood chamber temperature for a wood-boring beetle.

these more simple image formats will serve our purposes nicely and everything you learn here will be easily transferable to those other image formats when you need to deal with them in the future. was broken and could caused a few problems when installed. Lucky for us. the ubiquitous GIF image format 109 . I’ll change this section.1 Image Data There are several different methods that are available to you to import image data into R . I think it is also important that you understand the internal workings of images and for right now. Some of the methods are relatively easy to use and can be manipulated directly in a text editor. However.1 into another format and use it just as easily.1. and bmp image formats is beyond our grasp. 7. The consequences of not having rimage is that it appears that importing jpeg. there are a ton of other image formats out there and we can easily convert the image shown in Figure 11. the main image processing library for R . 7. you will focus on the following topics: • Gain a basic understanding of open image formats • Learn how to import image data into R • Manipulate image data at the pixel level.Chapter 7 Working With Images In this chapter. Perhaps when I update this manuscript the next time around. tiff. it is not going to be used. Others are more of a pain and some are ”owned” by some company who has patented the way the information is stored in the ﬁle and you have to pay royalties to them to view it. For example. I am sure it will be ﬁxed in the near future and recommend that you look at that library when you next have the need to do some image manipulation because it has a lot of funcitonality. at the present. rimage. As I was writing this document over Winter break and updating it in the fall.1 PNM Image Format Images on computers have speciﬁc formats in which the color information and other meta data is stored in the ﬁle.

by creating the matrix in R and using the image function..pbm 5 8 1 1 1 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 1 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 In this ﬁle..] 0 0 0 0 0 [4 .] 0 0 0 0 0 [6 . Lame. Portable Bitmap Format (PBM) This format stores bitmaps images. A bitmap can be thought of as an image whose pixels are either turned on or off (say black and white). The PNM image format (short for portable anymap) is an open format for the exchange of image information. The rest of the ﬁle consists of the actual bit matrix where 1 represents a pixel that is turned on and 0 represents a pixel that is turned off.1. WORKING WITH IMAGES uses an algorithm that was patented and owned by a company and if you were to write a viewer for it in some countries you would have to pay a royalty to use it.] 0 0 0 0 0 [5 . Here is an example creating the image of the letter T.] 0 0 0 0 0 [7 . You can make this image programatically. The third line tells how many columns and rows of data that the image has. there are three different formats that fall under the PNM speciﬁcation as detailed below. The image represented in this ﬁle is given in Figure 7. ncol =5) > x [ . this is a column-major notation here where the ﬁrst number is the number of columns and the second number is the number of rows.1] [ .] 0 0 0 0 0 [8 .110 CHAPTER 7. Actually. 3 ] <− 1 > x Biological Data Analysis Using R . ] <− 1 > x [ .5] [1 . the ﬁrst line is a special code to tell the computer how many bits per pixel to use. An example text ﬁle for a bitmap ﬁle that encodes for the uppercase letter R would be: P1 # This is an example bit map file r. nrow=8 .] 0 0 0 0 0 [3 .] 0 0 0 0 0 > x [ 1 .3] [ .2] [ . The representation of a PBM ﬁle can be given as a simple text ﬁle with the extension .4] [ . > x <− matrix ( 0 . Note. which is the opposite of which we use (row-major) in R for interacting with matrices of data.pbm. The second line is a comment line that you can put anything you like into (but has to start with the # character).] 0 0 0 0 0 [2 .

1. There seems to be a small problem with it in that it is rotated 90 ◦ counter-clockwise.pgm 24 7 5 Biological Data Analysis Using R .5] [1 ."grey" ) > image ( x . [ . a two element vector will be sufﬁcient to handle all the different colors. Conversely.] 0 0 1 0 0 [4 .pbm ﬁle.] 0 0 1 0 0 [8 .2] [ .org). P2 # The PGM file for dog. The image shown in Figure 7..2 shows this matrix.gimp. assume that the origin is at the upper left hand corner of the image.] 0 0 1 0 0 [6 . this is slightly more information contained in the data ﬁle as each pixel is not either ON or OFF.1: The image represented in the r..4] [ .] 1 1 1 1 1 [2 . rather there is a percentage of ONNESS. Portable Graymap Format (PGM) This format is for graymap images where the term graymap refers to the lack of color in the image. (is that a word?). In terms of complexity. IMAGE DATA 111 Figure 7.] 0 0 1 0 0 > c o l o r s <− c ( "black" . Then the image function was used to plot it. c o l =colors .] 0 0 1 0 0 [7 .] 0 0 1 0 0 [5 .3] [ . This image has been scaled up to make it large enough to see it on the page using the program GIMP (www.7.1] [ . Since I have two values in the matrix. Obviously these two do not mesh well together. This is because the origin of the plot that is created by the image function is in the lower left-hand corner.] 0 0 1 0 0 [3 . most images that are stored on the computer (like the desktop image in the background). axes=F ) Here I created the matrix that had all 0 in it and set the top row and the middle column equal to 1. The image function takes a number of optional arguments and here I have supplied it the colors and the option to not show the axes.

the a black pixel will be represented by the number 0 and the white would be represented by 5 and values in between would be 1 increments of whiteness. The remaining portions 5 of the ﬁle have the actual image represented in a pixel-by-pixel matrix of values. 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 0 0 0 5 0 0 0 5 0 5 0 0 0 0 0 5 5 0 0 0 0 0 5 5 0 0 0 0 0 5 0 5 0 0 0 5 0 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 4 4 4 0 4 0 0 0 0 0 4 4 0 0 0 0 0 4 4 0 0 0 4 0 4 0 4 0 0 4 4 0 0 0 0 0 0 0 0 The ﬁrst three lines of the ﬁle are the same as for the PBM format. The fourth line in the ﬁle gives the maximum value representing the the most white in the image.112 CHAPTER 7. The number of shades of gray you use in a PGM ﬁle is up to you as long as it does not exceed 255 (I think).2: A PBM ﬁle that was programatically created in R . In this case.3). These are easy ﬁles to create and you could imagine how you could Biological Data Analysis Using R . You can see that the majority of the image is 0 black and the letters are varying shades of 5 gray (Figure 7. The image is rotated because of the default location of the origin. WORKING WITH IMAGES Figure 7.

ppm). For example. 8 for green.4 has 129. which means that you have colored pixels in the image. IMAGE DATA 113 Figure 7. When we begin looking at manipulating images you will ﬁnd that you can interact with each color channel independently. which on my computer makes it 465K in size. the pixel values are placed one per line instead of next to each other. png. the lack compression and inefﬁciency in storages sizes are relatively irrelevant. But for our purposes.3: The image represented by the dog. One drawback to these image formats are that they are not very efﬁcient. and 8 for blue. 604 lines of information in it. 200 lines of numbers for the color green. This image has been scaled up to make it large enough to see it on the page using the program GIMP (www.1. 180 240 255 188 219 253 189 220 252 In this ﬁle. is one that handles pixmaps. create a matrix of integers from some analysis and save it as a pgm ﬁle and view it directly. and other compressed ﬁle formats is why they are used on the internet. The ﬁle format is identical to that of the PGM with the exception that the code on the ﬁrst line is P3. The exact same image saved as a jpeg ﬁle is only 25K in size. 200 lines contain an integer whose value is between 0 and 255 (the maximum all color as depicted on line 4) for the color red followed by another 43.org). Starting at line number 5 with a value of 188 the following 180x240 = 43. which represents 24-bits per pixel. An example of the PPM ﬁle shown in Figure 7.pgm ﬁle. The compression used to make jpeg.gimp. 200 lines for the blue.7. and then another 43. Biological Data Analysis Using R . Portable Pixmap Format (PPM) The last ﬁle format. 8 of which are for red. the image of my daughter in Figure 7. PPM. gif.4 is: P3 # This image contains an image of my daughter Libbie (from Libbie. tiff.

org).3 Components of A Pixmap We can learn a little bit more about what kind of data type the variable we call by using the class() function.pnm( f i l e ="Libbie.ppm" ) Read 129600 items > p l o t ( photo ) The plot () function will open a new image window and show the loaded image. 7.4: The image represented in the Libbie. now that the basics of how one kind of image is represented in the data ﬁles. > class ( photo ) [ 1 ] "pixmapRGB" a t t r ( .gimp.pnm() to load the ﬁle into a local variable and plot it using the plot () function. > l i b r a r y ( pixmap ) > photo <− read . it is time to load one into R and see what we have to work with. To load a PNM ﬁle. "package" ) [ 1 ] "pixmap" > names ( a t t r i b u t e s ( photo ) ) [ 1 ] "size" "cellres" "bbox" [ 8 ] "blue" "class" photo is "bbcent" "channels" "red" "green" Biological Data Analysis Using R .114 CHAPTER 7.ppm ﬁle.2 Loading The Image Into R OK. This image has been scaled up to make it large enough to see it on the page using the program GIMP (www. you must ﬁrst import the pixmap library then you can use the function read. WORKING WITH IMAGES Figure 7. 7.

and red components of the class directly. This will assign the value of D to the variable C then C to B and then B to A. There are some issues that we should touch on when dealing with classes. They differ from what we have been using thus far such as data frames in that we cannot access the contents of a class using the $ notation. nrow=240. We can also see that the red channel that determines the amount of redness in each pixel has been standardized on the range [0. Then for each of the new variables I remove all the data in each of the corresponding channels by making the channel contain a matrix of zeros the same size as the original matrix. Biological Data Analysis Using R . A class is a self contained data structure that has both attributes and data. > redPhoto <− photo > bluePhoto <− photo > greenPhoto <− photo > redPhoto@size [ 1 ] 240 180 > redPhoto@blue <− redPhoto@green <− matrix ( 0 . nrow=240. ncol =180) > par ( mfrow=c ( 1 . bluePhoto.4 Image Operations 7. nrow=240. named redPhoto.4. 1 ] [ 1 ] 0. This is because things like lists and data frames are not classes. In the next example.7372549 > range ( photo@red ) [1] 0 1 Here we can get to the size. This a lazy trick but one that you will probably use as it saves a bit of time and typing. they are just objects. and greenPhoto. 7.1 Extracting Channels So now we know how to make some alterations of the image and see what happens.7. To access attributes of classes we use the notation. channels. For example: > photo@size [ 1 ] 240 180 > photo@channels [ 1 ] "red" "green" "blue" > dim ( photo@red ) [ 1 ] 240 180 > photo@red [ 1 . ncol =180) > greenPhoto@red <− greenPhoto@blue <− matrix ( 0 .4. The command names(attributes(photo)) tells us the names of the attributes that the variable has. This is important to know if we are going to manipulate the image directly. IMAGE OPERATIONS 115 This variable is a pixmapRGB class that comes from the pixmap package. I ﬁrst copy the photo to make three additional photos. ncol =180) > bluePhoto@red <− bluePhoto@green <− matrix ( 0 . 4 ) ) > p l o t ( photo ) > p l o t ( redPhoto ) > p l o t ( greenPhoto ) > p l o t ( bluePhoto ) Note that I used the sequential assignment A <−B <−C <−D as a shorthand here. 1].

8627451 > range ( gphoto@grey ) [1] 0 1 "class" The function pixmapGrey() takes a matrix of data. 3 ) ) p l o t ( gphoto ) h i s t ( gphoto@grey . For the moment. In some cases. in the creation of the image. WORKING WITH IMAGES Then I make a 1x4 matrix of plots so that I can plot all four images in the same frame (see ?? for more on how this is done) and in each of the four slots.main="" ) p l o t ( darkerGphoto ) We can see that the vast majority of values are towards the light end of the distribution.5: The original image along with ones where only the red. lets examine the contents of this grey image and play around with it a bit. Biological Data Analysis Using R . Here we use the information from each channel. green. We can do this by performing operations on the matrix of grey values in the class. Figure 7. it is helpful if you can extract the color information and generalize the image as a greyscale image (as you will in Chapter 11). weighed equally. > > > > > > darkerGphoto <− gphoto darkerGphoto@grey <− darkerGphoto@grey / 2 par ( mfrow=c ( 1 . I plot one of the images yielding a ﬁgure similar to what is presented in Figure 7. of which we just use the element-wise addition of each channel in the color photo.116 CHAPTER 7. You can also see that in the creation of the new grey image. 1 ] [ 1 ] 0. For simplicity. we should scale these values to be closer to zero by dividing them by 2 and then replotting the image to see the result (see results in Figure 7. xlab="Grey" . Lets make it a bit darker by shifting all the grey values down (to make it more black). I will make a copy of the image ﬁrst and then perform operations on the copy rather than the original one. and blue channel turned on. > gphoto <− pixmapGrey ( photo@red+photo@blue+photo@green ) > p l o t ( gphoto ) > names ( a t t r i b u t e s ( gphoto ) ) [ 1 ] "size" "cellres" "bbox" "bbcent" "channels" "grey" > gphoto@grey [ 1 . xlim=c ( 0 . the values were again standardized. 1 ) .5.6). Then we will look at the distribution of grey values that make the image. To darken this up.

.7.. For convenience. I replace the center 40x40 block with the white (which would be the largest value from randomImageMatrix). a histogram of the grey values and the image resulting from reducing all the grey values in the image by half.).. Lets start by making an image where each pixel is randomly assigned a greyscale value. we will focus on greyscale images and allow the analysis of colored images for you to play with on your own time. For the purposes of this section.6: The greyscale translation of the PPN image. Biological Data Analysis Using R . CREATING IMAGES PROGRAMATICALLY 117 7. The results is shown in Figure 7. c o l =gray ) Here I use the rnorm() function to create 240 ∗ 180 = 43.g.70:110] <− max( randomImageMatrix ) > image ( randomImageMatrix . the range of random numbers is used to divide the pixels into the 100 different grey colors (e.1. When the image is made. Figure 7.4. the image() function scales the values in randomImageMatrix into length(gray) distinct groups for plotting). c o l =gray ) The result is shown in Figure 7. > randomImageMatrix[100:140 . > randomImageMatrix <− matrix ( rnorm(240∗ 180) .7. I’ll make it the same size as the photo named gphoto from 7. There are some helper functions that can help you in creating new images. This image can be manipulated by changing the values in the matrix randomImageMatrix. In the next example.5.nrow=240. ncol =180) > gray <− grey (1:100/100) > image ( randomImageMatrix .8 resembling a square doughnut (mmmm doughnuts. I then use the grey() function to create 100 different shades of grey ranging from white to black at equal intervals. 200 random numbers in a matrix that has 240 rows and 180 columns.5 Creating Images Programatically Images can be made programatically once you understand how images are represented.

σ). To get more information on any of these functions. max(x) Returns the maximum value contained in x. Returns x random numbers from a N (µ. It is assumed that that 0 ≤ x ≤ 1.118 CHAPTER 7. WORKING WITH IMAGES Figure 7. grey(x) This function returns the grey color associated with the value of x.8: A random image with a square doughnut hole in the middle. 7.6 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises.7: A random image Figure 7. use the R help system. • • • • • cat() This function dumps the passed arguments out to the terminal. image(x) Can be used to create an image as either grey or colors for the values in the matrix x. rnorm(x) Biological Data Analysis Using R .

ppm into R using the read. (Hint: See ?rainbow for ﬁve of the stock palettes available to you.) 1 Biological Data Analysis Using R . make one of the color matrices a zero).7 Exercises The following exercises are meant to help you understand the items presented in this Chapter.6.pbm) exactly like the one that is shown for the letter R but make it represent the letter L. The grey channel is composed of greyscale values that must be between [0. EXERCISES 119 7. Create the greyscale version of the image shown in the leftmost box in Figure 7. What is the default palette used in the image plot function? 8.3) option. Plot these images in a three-paned graphic using the function par(mfrow=c(1.2 not right-side-up? 3. 6.) 7. 4. Why is Figure 7.1 10. see the footnote at the end of this sentence but only as a last resort.7. What is the purpose of the optional argument bbox in the pixmapGrey function? 9. 2. Create three copies of the image and for each copy remove the values in one channel (e. Load your own copy of the image Libbie. Can you invert the colors in this image? (Hint: If you can’t ﬁgure out how to do this.7. What is the purpose of the PX number on the ﬁrst line of the PNM ﬁle formats? 5.g. Replot the randomImageMatrix using a color palette instead of the grey palette shown.. 1.pnm function as demonstrated in the Chapter. 1] interval. Make your L image correct by changing the values of the underlying matrix such that when it is plot using the image command it is in the correct orientation. Why do you have to use the chapter? @ notation to access components of the pixmaps in this Are you sure you want a hint? Take 1 minus the grey channel to make the values ﬂipped in the [0. Create a Portable Bitmap Format ﬁle (*. 1].

WORKING WITH IMAGES Biological Data Analysis Using R .120 CHAPTER 7.

] 0 0 0 0 You can also wrap the as. nrow=4 . Matrices can be created by hand using the matrix() function and the elements within them can be accessed using the square bracket notation (e. 8..g.2] [ .j]) as: > > > > X <− matrix ( 0 .3] [ . In fact. • Create stage-classiﬁed matrix models. ncol =4) X[ 1 . a matrix is a fully recognized data type in R . a matrix can be deﬁned as a 2-dimensional object that holds numeric values. X[i. 2 ] <− 23 X[ 1 . you will focus on the following topics: • Understand matrix operations in R .1 Matrices In R As shown in 2.4.] 0 0 0 0 [4 .4] [1 .matrix() function around the read.9. you may not fully appreciate.] 0 0 0 0 [3 . 4 ] <− 42 X [ . In this chapter. For a review of these two func121 . In speciﬁc terms for this Chapter.1] [ .1 for a complete discussion of looping R ). R does a wonderful job of working with matrices and is much faster at doing vector and matrix operations directly than looping through matrices of values using a for()-loop (see 11.] 0 23 0 42 [2 . In this Chapter I will use the example of stage-classiﬁed matrix models to introduce you to how matrix manipulation operates in R . There are some issues that need to be addressed with respect to basic operations on matrices that if you haven’t had a course on Matrix Algebra.Chapter 8 Matrix Analysis Matrices are used in a wide variety of biological studies.table() function and read the data from a matrix in a ﬁle into a variable directly.

00000 "ExampleMatrix.00000 5.00000 3.60331 3.42149 [ 9 .60331 1.csv that was exported from a spreadsheet.603310 3. ] 7.1] [ .00000 3. 8.] 1 2 3 [2 .00000 2.122 CHAPTER 8.00000 6.2.00000 6. ] 4. ] 5.00000 4. MATRIX ANALYSIS tions see 2. Obviously. t a b l e ( > A V1 V2 V3 [ 1 .00000 3.00000 [ 2 .69421 6. and Z as deﬁned by the R commands: > X <− matrix ( 1 : 9 .00000 3.00000 0.V12. This is the default behavior.00000 1.60331 1.00000 7.00000 3.00000 6.603310 1. so here is a very short course. this is not possible in R itself but for the text hopefully this will make it easier to follow.00000 4.00000 6.00000 2. ] 0.00000 4. nrow=3 .00000 [ 5 .00000 7. there is a ﬁle called ExampleMatrix.00000 0.00000 0. ] 1.00000 3.000000 4. ] 2.00000 0.00000 4.00000 5.00000 V7 7.00000 0. ] 3.00000 6. ] 3.000000 4.00000 0.3] [1 .00000 4. In the online data sets for this chapter.00000 [ 6 .00000 3. 3.00000 [ 1 0 . sep="\t" ) ) V4 5.00000 [ 6 .00000 6. ] 4.00000 7.00000 [ 3 .1 Matrix Arithmetic Matrices have their own special kind of arithmetic that you may not be aware of.] 4 5 6 For matrices I will use upper case bold letters for variable names in the text to make it easier to distinguish them from non-matrix variables as you read along.00000 2.000000 0. If there is one value in the matrix that has a decimal portion to it.00000 [ 4 .byrow=TRUE) > X [ .42149 3.00000 4.00000 4.2] [ .000000 There are a few things to notice here: 1.00000 [ 1 2 .00000 6.. ] 5.00000 4.00000 [ 1 1 . ] 2.1. R wraps values for matrices so that only a portion of each row can be viewed at a time.00000 4.00000 3.00000 2.1.000000 2.00000 [ 7 .00000 7.00000 4.00000 4.csv" . ] 4.4.000000 3. header=F . The columns of data that were read in the ﬁle did not have a header row so R assigned them the values V1 .00000 [ 1 1 .00000 4.00000 3. matrix ( read .g. ] 0.00000 5.603310 3.60331 [ 9 .00000 5.00000 5. compare the matrix X and A from the two listings.603310 1.00000 0.00000 3.148760 4.000000 1. I will be using the matrices X1 .00000 0. ] 3.694210 0.00000 [ 1 2 .00000 [ 4 .00000 2.000000 6.00000 3. ] 2.00000 6.69421 7.00000 3.00000 2.000000 4. ] 3.00000 [ 2 .00000 V10 V11 V12 [ 1 .00000 4.00000 [ 3 .00000 1. ] 2.00000 3.00000 4.00000 [ 5 .00000 4.00000 2.00000 2.00000 1.00000 V8 2. ] 4.00000 V6 2. ] 4.00000 2.00000 4. If > A <− as .00000 3. 2.00000 1.00000 [ 1 0 .96694 4.421490 V9 2.00000 [ 8 .00000 V5 4.00000 3.00000 3. all the values will be displayed with the same number of decimal places (e. ] 4.00000 3.00000 3.000000 7.966940 2. ] 2.14876 3.00000 1. ] 3.00000 [ 8 .00000 4.9 and 3.00000 5.00000 1. For the following examples.00000 [ 7 . 1 Biological Data Analysis Using R .00000 3.00000 3.421490 3. Y.000000 3.00000 3.603306 0.00000 1.00000 4.694210 1.00000 3. ] 2. ] 4.000000 0.00000 5.603306 3.00000 4.

] 4 5 6 [2 .] 2 6 10 [3 .3] [1 .1] [ .] 7 8 9 > Y <− matrix ( 9 : 1 .] 7 4 1 > Z <− matrix ( 1 : 1 2 . In these example matrices. Scalar addition and subtraction take the value of the scalar and add it to every element in the matrix. X and X are square matrices (e.] 10 11 12 Matrix Addition & Subtraction For both addition and subtraction of matrices. MATRICES IN R [3 . the addition and/or subtraction operation results in the elementewise addition of each matrix. the numbers of rows and columns must be identical..g.2] [ .nrow=4) > Z [ . In R you can use the normal addition (+) and subtraction (-) operators as demonstrated below.1] [ .] 1 2 3 [2 .] 4 8 12 123 One of the main things you have to pay attention to when dealing with matrices is the number of rows and columns in the matrices.1] [ .2] [ . > X [ . 2 + X). To access the number of rows and columns in a matrix you must use the function dim().3] [1 .] 9 6 3 [2 .] 8 5 2 [3 .1] [ .2] [ .3] [1 . nrow=3) > Y [ . Scalar Addition & Subtraction Matrices may be shifted by the addition or subtraction of a constant scalar value (e.] 4 5 6 [3 .] 3 7 11 [4 .g. If they are. they have the same number of rows and columns whereas X is not square as it has 4 rows and 3 columns of data.] 7 8 9 > X + 3 [ .] [3 .2] [ .1.] 1 5 9 [2 .3] 10 8 6 12 10 8 14 12 10 But when they are not the same size.. > X+Y [1 .] [ .8.] [2 . Biological Data Analysis Using R .1] [ .] 7 8 9 [3 .3] [1 .2] [ . R will barf up an error message to you telling you they are not amenable to this operation.

5∗ X).] 1 2 3 [2 .3] [1 .2] [ .] 14 16 18 Element-wise Multiplication It is possible to multiply two matrices where what you are wanting is a new matrix that is the element-wise product of each of the original matrices.] 8 10 12 [3 . In R this operation is conducted using the regular multiplication character. Moreover.3] 1 2 3 4 5 6 7 8 9 [ .g.2] [ .] 7 8 9 > X ∗ 2 [ .1] [ .] 8 5 2 [3 .] 32 25 12 [3 .] 9 6 3 [2 .3] [1 . the same dimensions as the two original ones. there are several restrictions to which sets of matrices can be multiplied together. *.. MATRIX ANALYSIS > X+Z Error in X + Z : non−conformable arrays Scalar Multiplication The values within a matrix may be scaled by the multiplication of a scalar value (e.] [3 .] 4 5 6 [3 . Scalar multiplication results in every single element in the matrix being multiplied by the scalar value. This is because of the way that matrices are multiplied. > X [1 . The result of this operation is a new matrix.1] [ . 0.1] [ .1] [ .] 9 12 9 [2 .] 49 32 9 Multiplication Matrix multiplication is slightly more complicated than multiplication among scalars or multiplying a scalar by a matrix. This is sometimes called the Hadamard product or the Schur product.2] [ . For example.] 2 4 6 [2 .] > Y [ .124 CHAPTER 8. Biological Data Analysis Using R . between the two matrices.2] [ .3] [1 .2] [ .1] [ . in matrix multiplication. AB = BA.] [2 .3] [1 . For example: > X [ .] 7 4 1 > X ∗ Y [ .

8.1. MATRICES IN R

125

For example, consider the operation A = XY where the matrix X has rX rows and cX columns of data and the matrix Y has rY rows and cY columns of data. For this operation to be deﬁned, the number of columns in X, cX , must equal the number of rows in Y (e.g., cX = rY ). If these are not equal, then you cannot perform the multiplication. Moreover, the resulting matrix A will have rX rows and cY columns. This is because the matrix multiplication is conducted as:

N

Aij =

k=1

Xi,k Yk,j

Essentially every row of X is multiplied against the corresponding column of Y. In R matrix multiplication uses a unique operator that you probably haven’t seen yet. To indicate that you want two matrices to be multiplied (and not the Hadamard product as above) you use the compound operator % ∗ %. That is right, it is a pair of percent signs surrounding the normal multiplication character (a.k.a. the asterisk). Two examples using the matrices X and Y are given below. Notice how XY = YX.

> X [1 ,] [2 ,] [3 ,] > Y [ ,1] [ ,2] [ ,3] 1 2 3 4 5 6 7 8 9

[ ,1] [ ,2] [ ,3] [1 ,] 9 6 3 [2 ,] 8 5 2 [3 ,] 7 4 1 > X %∗% Y [ ,1] [ ,2] [ ,3] [1 ,] 46 28 10 [ 2 , ] 118 73 28 [ 3 , ] 190 118 46 > Y %∗% X [ ,1] [ ,2] [ ,3] [1 ,] 54 72 90 [2 ,] 42 57 72 [3 ,] 30 42 54 > X %∗% I [ ,1] [ ,2] [ ,3] [1 ,] 1 2 3 [2 ,] 4 5 6 [3 ,] 7 8 9 > X − (X %∗% I ) [ ,1] [ ,2] [ ,3] [1 ,] 0 0 0 [2 ,] 0 0 0 [3 ,] 0 0 0 > I %∗% X [ ,1] [ ,2] [ ,3] [1 ,] 1 2 3 [2 ,] 4 5 6 [3 ,] 7 8 9 >

Here both X and Y are both square and have the same number of rows and columns (e.g., the simplest case because we don’t have to make sure the correct rows and columns match). The identity matrix, I deﬁned in the section above is shown here with its groovy Biological Data Analysis Using R

126

CHAPTER 8. MATRIX ANALYSIS

properties. Matrix multiplication by the identity matrix is transitive and will result in the original matrix. A kind of matrix version of the scalar multiplying by one.2 Here is an example using the matrices X and Z, who have different dimensions.

> Z [ ,1] [ ,2] [ ,3] [1 ,] 1 5 9 [2 ,] 2 6 10 [3 ,] 3 7 11 [4 ,] 4 8 12 > Z %∗% X [ ,1] [ ,2] [ ,3] [1 ,] 84 99 114 [2 ,] 96 114 132 [ 3 , ] 108 129 150 [ 4 , ] 120 144 168 > X %∗% Z Error in X %∗% Z : non−conformable arguments

In the ﬁrst case, Z %∗%X is deﬁned and provides a result because the number of columns in Z match the number of rows in X. The reverse of this multiplication, X %∗%Z, is undeﬁned and R tells you so.

8.1.2

Matrix Operations

There are several other operations that can be conducted on matrices that you will probably run across as you begin playing with matrices. Here are a smattering of a few. The Diagonal It is often necessary to interact with the diagonal, deﬁned as the elements in the matrix whose row index are equal to the column index, of a matrix. For example, in a covariance matrix, the diagonal elements are the variance estimates. In R you can get access to the diagonal of a matrix by using the diag(). Some examples using the diag() function include:

> X [ ,1] [ ,2] [ ,3] [1 ,] 1 2 3 [2 ,] 4 5 6 [3 ,] 7 8 9 > diag (X) [1] 1 5 9 > Z [ ,1] [ ,2] [ ,3] [1 ,] 1 5 9 [2 ,] 2 6 10 [3 ,] 3 7 11 [4 ,] 4 8 12 > diag ( Z ) [ 1 ] 1 6 11

There are other matrices that have this property that are not as simple as this one and if you take some multivariate statistics, it will blow your mind how cool they are...

2

Biological Data Analysis Using R

8.1. MATRICES IN R

127

Notice how even for non-square matrices the diagonal is deﬁned. You can also extract and insert particular values for the diagonal as demonstrated below:

> X [ ,1] [ ,2] [ ,3] [1 ,] 1 2 3 [2 ,] 4 5 6 [3 ,] 7 8 9 > origDiag <− diag (X) > origDiag [1] 1 5 9 > diag (X) <− c (42 ,23 ,4) > X [ ,1] [ ,2] [ ,3] [1 ,] 42 2 3 [2 ,] 4 23 6 [3 ,] 7 8 4 > diag (X) <− origDiag > X [ ,1] [ ,2] [ ,3] [1 ,] 1 2 3 [2 ,] 4 5 6 [3 ,] 7 8 9

A commonly used matrix that can easily be constructed using the diag() function is the Identity Matrix, whose symbol is I. This matrix has the zeros everywhere except on the diagonal

> I <− matrix ( 0 , nrow=3 , ncol =3) > diag ( I ) <− 1 > I [ ,1] [ ,2] [ ,3] [1 ,] 1 0 0 [2 ,] 0 1 0 [3 ,] 0 0 1

Finally, there is an operator called the trace of a matrix that is typically written as tr(A), which is the sum of the diagonal elements. If A is a variance, covariance matrix as is commonly found in multivariate statistics, then its trace is the overall variance. In R we can ﬁnd the trace using a combination of the sum() and diag() functions as:

> X [ ,1] [ ,2] [ ,3] [1 ,] 1 2 3 [2 ,] 4 5 6 [3 ,] 7 8 9 > sum( diag ( X ) ) [ 1 ] 15

Matrix Determinant The determinant of a matrix is scalar factor of a matrix. The calcuation of the determinant is somewhat complicated when we get to matrices that have more than two rows and columns and I’ll let you go ﬁnd a linear algebra book to look into it if you so desire. For small matrices, the determinant of a matrix, denoted as |A| is given as: Biological Data Analysis Using R

128

CHAPTER 8. MATRIX ANALYSIS

|A| = In R the function

det()

a11 a12 a21 a22

= a11 a22 − a12 a21

is used to estimate the determinant of a matrix.

> X <− matrix ( c ( 1 , 6 , 3 , 4 ) , nrow=2) > X [ ,1] [ ,2] [1 ,] 1 3 [2 ,] 6 4 > det (X) [ 1 ] −14

Matrix Transpose The transpose of a matrix is an operation that exchanges the row and column indices of the elements. This will change the dimensions of the matrix if it is not square. Notationally, you will see several different ways to represent a transpose such as A or AT . In R the transpose operation is performed with the

> Z [ ,1] [ ,2] [ ,3] [1 ,] 1 5 9 [2 ,] 2 6 10 [3 ,] 3 7 11 [4 ,] 4 8 12 > t (Z) [ ,1] [ ,2] [ ,3] [ ,4] [1 ,] 1 2 3 4 [2 ,] 5 6 7 8 [3 ,] 9 10 11 12 > t ( t (Z ) ) [ ,1] [ ,2] [ ,3] [1 ,] 1 5 9 [2 ,] 2 6 10 [3 ,] 3 7 11 [4 ,] 4 8 12 t ()

function.

**Notice that the transpose of a transpose is equal to the original variable. Matrix Inversion
**

1 For scalars, the inverse is deﬁned as x−1 = x but for matrices it is slightly more complicated. There are even large groups of matrices that cannot be inverted. One property that prevents inversion is if the matrix is singular (think black hole of mathematics or matrices that have a zero determinant).

A common use for matrix inversion is in estimation of regression coefﬁcients by least squares. In 6.2, we used the lm() function to estimate the intercept and slope coefﬁcients. This can be done using matrix algebra and the inversion function ginv() found in the MASS library. A one column matrix of slope coefﬁcients B is estimated from the formula: Biological Data Analysis Using R

9212 You can see from the comparison.8. ] 0.] 1 6 [7 . Start by Biological Data Analysis Using R . MATRICES IN R 129 B = (X X)−1 X Y Where the matrix Y matrix is the normal matrix of response variables and the X matrix has the ﬁrst column of all ones (1) for the intercept and the remaining columns as the predictor variables.27 .19 . > X <− matrix ( c ( rep ( 1 .15 .25 .] 29 [10 .3333 c (1:10) 0.1] [1 .] 1 3 [4 .] 1 9 [10 .1.29 .] 1 1 [2 .] 15 [5 .3333333 [ 2 . you could have the X matrix without the column for the interscept (β0 = 0) and you could get the same estimate for the slope coefﬁcient. ] 16. we will be using them in the next section so it seems there is a need to introduce them here.25)) > Y [ . 1 : 1 0 ) .] 19 [2 .2] [1 .] 25 [3 .14 . However. β1 .] 1 8 [9 .1] [ 1 .17 .] 1 5 [6 .] 1 4 [5 .24 .] 27 [9 .] 19 [8 .g.] 17 [7 .] 1 2 [3 .] 24 [6 .] 14 [4 .1] [ .. Eigen Decompositions An eigenvalue/eigenvector decomposition is a ”magical property” of matrices that can only be appreciated by some experience in matrix algebra.9212121 > lm ( Y ˜ c ( 1 : 1 0 ) ) Call : lm ( formula = Y ˜ c ( 1 : 1 0 ) ) Coefficients : ( Intercept ) 16.] 1 10 > Y <− matrix ( c (19 . If you were to make Z <−Y − mean(Y) (e. both lm() and the matrix multiplication/inversion method produce the same estimates for the intercept and the slope coefﬁcient. ncol=2 ) > X [ .] 25 > l i b r a r y (MASS) > ginv ( t (X) %∗% X ) %∗% ( t (X) %∗% Y ) [ .] 1 7 [8 . 1 0 ) . standardize it for mean zero).

2] [1 . Biological Data Analysis Using R . we get the following equations: 1e1 + 3e2 = 7e1 6e1 + 4e2 = 7e2 And here we have two equations in two variables and can easily solve for the values of e1 and e2 and these values deﬁne the eigenvector v1 = [e1 . Using the matrix: > A <− matrix ( c ( 1 . These are called the eigenvalues of the matrix A. λ1 = 7. Each eigenvalue has an associated eigenvector such that: Ax = λx Where x is a vector (e. e2 ] that is linked to the eigenvalue λ1 . The equation above is called the characteristic equation for the right eigenvector and a left eigenvector exists and has the form xA = xλ. 3 .] 6 4 The eigenvalues for the matrix are given by solving the characteristic formula: 0 = |A − λI| 1 3 = −λ 6 4 = (8. nrow=2) > A [ . we have: 1 3 6 4 e1 e2 = λ1 e1 e2 (8. we need to solve for x.2) If we multiply these out. 6 . From both of these.g. a matrix with only one column) that is matched to each of the k eigenvalues.1] [ . MATRIX ANALYSIS considering the square (kxk) matrix X and the identity matrix (I) in the characteristic equation |A − λI| = 0. We can do the same for the second vector (which I will let you play with in those boring weekend hours where you are wishing that you had some really cool math problem to solve).] 1 3 [2 . 4 ) . Starting with the largest eigenvalue.130 CHAPTER 8.1) 1 0 0 1 1−λ 3 6 4−λ = (1 − λ)(4 − λ) − 18 = λ2 − 5λ − 14 If we solve for λ we see that possible values are 7 and −2.

As you look at the 2 equations above we can solve for the components and ﬁnd that e1 = e2 . 2] and vblue = [2.2 with two vectors pointing in the same direction but with different lengths. MATRICES IN R 131 It is important to point out here that the values for v1 can be scaled.1. e2 ) all of which are the same except for a scaling factor. 1] that are projecting in the same direction but have different magnitudes.8. There are some interesting properties of eigenvalues and eigenvectors. such as we are doing here for the eigenvector decomposition.1. Figure 8. There are several vectors that will point in a direction that will intersect the point (e1 . And if we think about the vector v1 = [e1 . As a result. the lines away from the origin would be pointing in the same direction. There are a lot of values for e1 and e2 that make this statement true. to scale the vectors such that their lengths are set to some normalizing constant such as 1. if you solve for v1 and then check it below with the eigen() function you may not get the same values but if you were to plot the vectors. This is graphically shown in Figure 8. • If the original matrix is symmetric (actually non-negative semi-deﬁnite but whose Biological Data Analysis Using R . e2 ] as a project away from the origin a distance of e1 on one axis and e2 on a second orthogonal axis it may make a bit more sense. The reason I bring this up is that it is common for routines that calculate vectors.1: Image depicting two vectors vred = [4.

that has the following four different distinct life stages.] 1 3 [2 . the original matrix A = sition of the matrix A.1] [ . k λi = |A|).4472136 −0. > A [ . ] −0. i=1 • The sum of the eigenvalues is equal to the trace of the matrix (e. the answer looks like it should. Here we are going to introduce the notation of a matrix model in R and then perform some analyses on these models.7071068 [ 2 . Here is an example using our little friend the A matrix we touched on above. MATRIX ANALYSIS This is called the spectral decompo- • The product of the eigenvalues is equal to the determinant of the original matrix (e. from our vast knowlBiological Data Analysis Using R . CHAPTER 8..g.2] [1 .132 watching).1] [ . This Chapter is intended to only whet your appetite a bit on matrix models and for those that are interested.2] [ 1 . Moreover. Grenus growii.8944272 0. i • The eigenvectors of A and A−1 are identical. e2 ]. Some species lend themselves to stageclassiﬁcation better than others and the distinctions on how to go about deﬁning stages is best left to another course.g. ] −0.] 6 4 > rootsOfA <− eigen ( A ) > rootsOfA $values [ 1 ] 7 −2 $vectors [ . These models tacitly assume that the continuum of life histories for a species can be partitioned into discrete stages and that a census of individuals in a population can be performed wherein we can tally the number of individuals in each of these discrete stages.2. lets assume that we are working with a plant..7071068 Baring the possibility that I actually just copied and pasted the results from the discussion above on vi = [e1 . you should seek out another course or at least read a good text such as Caswell (2001). where ni is a k i=1 ni λi = tr(A) • If it is possible to invert A then the eigenvalues of A−1 will be the inverse of the eigenvalues of A (e.. R has a eigen() function that takes a square matrix and returns the eigen values and eigenvectors as a list. 8. eigen() into 8.1 Transition Matrices & Census Vectors For the sake of discussion. k i=1 λi ei ei .2 Stage-Classiﬁed Matrix Models Stage-classiﬁed matrix models are concerned with understanding the processes that inﬂuence the persistence of populations. they will be λ−1 .g.

The remining ones are eaten. STAGE-CLASSIFIED MATRIX MODELS 133 edge of this organism.g. the others are either eaten or rot.2. In Table 8.1 I show the parameters for each of the variables listed. or transitions (labeled pXY signifying the probability that an individual proceeds to stage X from stage Y .. These parameters can now be put into a transition matrix3 . A. we can associate values with this particular life history diagram with particular parameters.2. Biological Data Analysis Using R . there is no persistent seed bank) and only 50% of the seeds actually germinate.3) Actually this is not a transition matrix as it does not sum to 1 rather it is a Leslie matrix but I think I can get away with generalizing the term a bit here. A diagram of this ﬁctions species is shown in Figure 8. half move on to the next stage and a quarter stay as a juvenile. Adult The ﬁnal adult stage is where most of the reproduction happens with each individual producing an average of 3.1 offspring. From the description we have above. Figure 8. The arrows between the stages depict either fertility estimates (labeled fX ) when they point back to the seed stage. Seedling The seedling stage is a non-reproductive stage and herbivory removes 20% of the individuals that get into this stage and the remaining individuals become juveniles.2: The A graphical depiction of the life history stages in the ﬁctitious plant Grenus growii Here each of the spheres in this image represent a stage. f1 f2 f3 f4 p p p p A = 21 22 23 24 p31 p32 p33 p34 p41 p42 p43 p44 3 (8. Juvenile The juvenile stage is the ﬁrst reproductive stage and on average each juvenile produces 1. Half of the adults persist in the adult stage from one time step to the next.8.3 offspring. we have the accompanying information about the way this species proceeds through life stages. Seed The seed stage lasts a single time step (e. that has a particularly strict form. Depending upon the habitat the juvenile is located in.

50 0.5 The entries in this matrix have some rather special properties if we put the values into it as directed. Fertility Estimates Stage Seed Seeding Juvenile Adult Parameter f1 f2 f3 f4 Value 0 0 1. Inserting the observed values into this matrix gives us: 0 0 1.5 The items in the matrix are partitioned into two components.5 0 0 0 A= 0 0.0 0.1 [ 2 . and the second and remaining rows depict the probabilities of transition.1 B. 3 ] <− 0.2] [ . the top row records the fecundity values.8 0.1 0.5 0.0 0. 3 ] <− 0. A. 4 ] <− 3.3 3. ] 0.25 A[ 4 . 3 ] <− 1.25 0 0 0 0.0 [ 3 . ncol =4) A[ 1 .1 A[ 2 .30 3. ] 0. ] 0.8 0.0 1.5 0. Transition Seed → Seedling Seedling → Juvenile Juvenile → Adult Juvenile → Juvenile Adult → Adult Parameter p21 p32 p43 p33 p44 Value 0.5 A [ .134 CHAPTER 8. fX . MATRIX ANALYSIS Table 8.25 0.5 (8.0 0. pXY .25 0.1] [ .4] [ 1 . nrow=4 .8 A[ 3 .5 A[ 3 .5 0.4) In R we can create this matrix using the following code: > > > > > > > > > A <− matrix ( 0 .8 0. 2 ] <− 0.3 3. Transition probabilities.0 [ 4 . Biological Data Analysis Using R .5 A[ 4 . ] 0.1: Table of life history values separated into A Fertility estimates (the fX items) and B transition probabilities depicting the movement between stages and within stages. 4 ] <− 0. 1 ] <− 0.0 0.0 0.3] [ .3 A[ 1 .5 0.00 0.

8 and λblue = 1.4753001+0 i 0.0000000 i −0.3] [ .] [ .4570089 i 0.2) is equal to r.3562521+0 i −0.0 [ 3 .5352740+0 i 0.0 0. Moreover. Figure 8. ] 0. The r component here is the part that we are interested in looking at because: < 1 : Populationsizedecayingexponentially = 1 : Stablesizethroughtime r= > 1 : Populationsizeincreasingexponentially We can provide an estimate of r using an eigenvalue decomposition of the transition matrix A.4570089 i −0.0000000 i −0.5 0. ] 0.4] [ 1 . r.2] [ .1.4052283−0.2.1306829 i −0.3 shows the projected impact on population growth rate as a function to two values for λred = 0.0000000 i 0.0 0.00 0.4] 0.1306829 i −0.3268348+0 i here we can see that λ1 is not a complex number (the +0.25 0. m(x) is the fertility rate of individuals at x.1] [ . the largest non-imaginary eigenvalue of the matrix (λ1 as deﬁned in 8.2] [ .7490103+0.2075472+0.8603823+0 i 0.1431813 i 0.7490103+0.2103303+0 i 0.] [4 .5 > eigen ( A ) $values [ 1 ] 1. Essentially as time increases t : 0 → ∞. is well known to most biologists (. ] 0. the impact of λ is determined by raising it to higher and higher powers.0 0.4439783+0.8194141 i −0.6170499+0 i 0.2.0067844−0.0037839+0..30 3.1] [ .1431813 i 0.0000000 i $vectors [1 . it suggests that the overall behavior of this transition matrix is to increase overall population size with an instantaneous rate of r ≈ 1. and r is the growth.] [3 . So. The particular values of λ will determine the overall long term behavior of the population. once the matrix A is entered into R .8. Due to the way the matrix is set up.0000000i part tells us that) even though there are some complex eigenvalues (roots) of this matrix..] [2 .0037839−0.3] [ .1682952+0.0 [ 4 .2.0 1. we can ﬁnd the growth parameter as: > A [ .0067844+0. Biological Data Analysis Using R .1682952−0.2976372+0 i −0.8194141 i [ 4 ] −0.1 [ 2 . STAGE-CLASSIFIED MATRIX MODELS Intrinsic Growth Rate 135 The Euler-Lotka’s integral equation for the instantaneous grow rate.8 0.0 0. ] 0.4052283+0.50 0.0 0.) and has the form: ∞ 1= 0 l(x)m(x)e−rx dx where the term l(x) is the fraction of reproductive individuals surviving to x.

.g.2976372 0.8).2103303 > sum( ssd ) [ 1 ] 1.136 CHAPTER 8. From the output above we see that: > ssd <− as .8603823 0. If we are interested in Biological Data Analysis Using R .2) and exponential decay (λred = 0.1725831 0.724602 > ssd <− ssd / sum( ssd ) > ssd [ 1 ] 0.2065706 0. numeric ( eigen ( A ) $vectors [ .724602. Stable Stage Distribution The values in A also contain information on the relative proportion of individuals that will be in each stage class as the population stabilizes into a steady state (either growth. stable. or declining).1219587 > sum( ssd ) [1] 1 Here you see that the eigenvalues are scaled to unit size (e.4988875 0. 1 ] ) > ssd [ 1 ] 0. This information is contained in the eigenvector that is associated with λ1 . t (e i ) %∗%e i = 1) as mentioned above which results in a total sum of the vector of sum(ssd) = 1.3: Effects of the instantaneous growth rate λ as a function of time for both exponential growth (λblue = 1. MATRIX ANALYSIS Figure 8.3562521 0.

arg=c ( "Seed" . xlab="Stage" . We will return to these numbers and the estimate for r in the next subsection when we iterate the data manually. STAGE-CLASSIFIED MATRIX MODELS 137 ﬁnding the proportion of the population that is in each stage then we need to standardize the vector so that the sum(ssd) = 1 and this is done by dividing every element by the total. + names . Bar Plots As in the previous example. we determined the stable age distribution to estimate the proportion of the total population that is in each group. Without modiﬁcations. As a result. They include: • • • names. Labels for the x− and y−axes. can adjust the limit of the y−axis as in normal plotting routines."yellow" ) ) The barplot() function can also be used to create stacked graphs 8."blue" .arg width space a vector of names that you can have placed on the x−axis below the bars controls the width of the bars. 17% as juveniles and 12% as adults.. c o l =c ( "red" . There is an option in the normal plot () function. we can plot the data as (shown in Figure 8. 1 ) .5 To create this example.2065706 0. However. 21% as seedlings. this is as good a time as any. This is what I used to make Figure 4."green" . the function barplot() does not produce a very interesting plot in my opinion."Seedling" . controls the amount of area between the bars with a value of zero having the bars touch and positive numbers equal to that number of bar width (e.4. horiz col • • • • is a logical ﬂag that will plot the bars horizontally instead of vertically.1725831 0. R provides the function barplot() that takes a vector of heights and produces a general barplot for you. ylab="Proportion of Individuals" . type="h" that will kind of plot bars of your data to a ﬁgure."Juvenile" . ylim=c ( 0 .. > ssd [ 1 ] 0.2 and at that time it got the job done correctly.2. but a true bar plot is something that looks a bit different than those lines.8.4988875 0. ylim xlab \& ylab Using the data from λ1 in the previous section.. I used the following code which as t Biological Data Analysis Using R . this material could be depicted as a bargraph and since we haven’t covered how to make bar graphs yet. these are high density lines and not real bar plots. there are several optional arguments that can be used to create a more informative graphic."Adult" ) . Actually. ssd suggests that at equilibrium there should be 49% of the individuals as seeds. Graphically.1219587 > barplot ( ssd ) > barplot ( ssd . can pass as a single color or a vector of colors which are used to color the bars.g. space=2 plots a bar and then 2 bar widths before the next bar shows up).

] 0.396869276 0.6408461 > barplot ( x ."Category C" ) ) These stacked plots treat every column of data as a single bar and the order in which the rows are presented is the order in which the stacking occurs."B" ) .767329832 0. ylab="Value" . nrow=3) > x [ .5674993 [ 2 . The parameters used to create these plots is given in the R code. ] 0.3] [ 1 .2355922 0.1] [ . + legend=c ( "Category A" . You can standardize the plot to all have the same height by dividing each column by that columns sum providing a proportional barplot.2] [ . Biological Data Analysis Using R . MATRIX ANALYSIS Figure 8."Category B" . names .4625868 0.4: Examples of two different calls to the plotting function barplot(). > x <− matrix ( r u n i f ( 9 ) .138 CHAPTER 8.9215767 [ 3 .7247734 0."A" .001881527 0. ] 0. arg=c ( "Control" . xlab="Treatments" .

21 juveniles.12)) > n [ .1] [1 .34 . STAGE-CLASSIFIED MATRIX MODELS 139 Figure 8. 34 seedlings.5: Example of a stacked bar plot with multiple categories represented in each Treatment. Assuming that I start with 12 seeds. the census count of individuals in each of the four stages can be represented by the vector n and in R as a matrix whose dimensions are (4x1).] 21 [4 . and 12 adults.] 12 [2 . we can predict what the number of individuals in the next time slice will be given A and n as: nt+1 = Ant Biological Data Analysis Using R . 8.] 34 [3 .2.8.2.] 12 Using this notation.21 . the vector can be depicted as: > n <− matrix ( c (12 .2 Projecting Stage Sizes In this matrix model we have been playing with.

] [2 .5] [ .00 [ 3 . for time steps 1 → 10 (and in the matrix N columns 2 → 11) we will use the equation 8.8] [ .50 So after one generation.] 12 0 0 0 0 0 0 0 0 0 0 [2 . we see that: nt+2 = Ant+1 = AAnt+1 = A 2 nt And in general the vector of stage sizes at any arbitrary time step can be written as: nt = A t n0 (8.3] [ . we can see that the number of seeds.3] [ . MATRIX ANALYSIS > A [1 .] 0 0 0 0 0 0 0 0 0 0 0 [4 .4] [ . juveniles.] 0 0 0 0 0 0 0 0 0 0 0 [3 .5] [ .5 [ .1] [1 .0 0.1] [ 1 .4] 0.] 34 [3 .6] [ .9] [ . ] 64.8] [ . > N <− matrix ( 0 .0 0.7] [ . Biological Data Analysis Using R .] > n [ .1 0. and adults all increased but the number of seedlings decreased.] 12 [2 .30 3.11] [1 .] 0 0 0 0 0 0 0 0 0 0 0 [2 . ] 6.] 12 0 0 0 0 0 0 0 0 0 0 Now.0 0.1] [ .5) Lets make a matrix of n values for time 1 → 11 in R and calculate the number of individuals in each stage for each time step.8 0.0 0.2] [ .] 12 > A %∗% n [ .1] [ . ncol =11) > N [ . ] 16. I use 11 here because the matrix starts counting at column 1 which will correspond to our time t = 0 so when t = 10 the column will be 11. Lets also set the ﬁrst column (our t = 0) equal to the census population size we were using above.] 21 0 0 0 0 0 0 0 0 0 0 [4 .45 [ 4 .9] [ .] [4 .00 0.4] [ .11] [1 .5 to calculate the number of individuals in each group.0 0. 1 ] <− n > N [ .] [3 . ] 32.1] [ .3] [ .] 21 [4 . If we look at the next time step. nrow=4 .25 0.140 CHAPTER 8.50 [ 2 .2] [ .2] [ .] 34 0 0 0 0 0 0 0 0 0 0 [3 .5 0.0 0.10] [ .] 0 0 0 0 0 0 0 0 0 0 0 > N[ .10] [ .7] [ .50 0.0 0.0 1.6] [ .

35682 [4 .96862 84.4] [ .11] [1 .65875 95. t ] . Again. ( t + 1 ) ] <− A %∗% N[ .7] [ . t <− t + 1 > N [ .] 12 16.69375 23. ( t + 1 ) ] <− A %∗% N[ .11] [ 1 .5] [ . ] 55.00 0 0 0 0 0 0 0 0 0 [3 .2] [ .9] [ .10] [ .8] [ .3350 92.50 93.45 12.8] [1 . The code used to produce the image in Figure 8.9] [ . c o l ="blue" .50 93. ( t + 1 ) ] <− A %∗% N[ . type="l" .4750 18.62776 [ 3 .70713 83.4] [ . I have deﬁned the variable t such that it will be used to indicate which column of the matrix to use (the ( t+1) part) as well as the exponent to the matrix A.43032 140. ] 226.77316 [ . t ] .] 34 6. I combine the assignment of counts to the appropriate column of N and then update the counter variable t each time through until all eleven columns are full.2 is: > p l o t ( 1 : 1 1 .25553 343. axes=F .11] [1 .6] [ . t ] . ( t + 1 ) ] <− A %∗% N[ . t <− t + 1 > N [ . lwd=2) Biological Data Analysis Using R .50 0 0 0 0 0 0 0 0 0 [2 . type="l" .00 32.] 21 32. t ] .45 0 0 0 0 0 0 0 0 0 [4 . t ] . ( t + 1 ) ] <− A %∗% N[ .5] [ . ] .00 32.10] [ .2. ylim=ylim .4750 0 0 0 0 0 0 0 0 > N [ .84359 65. t <− t + 1 > N [ .2500 0 0 0 0 0 0 0 0 [3 . ( t + 1 ) ] <− A %∗% N[ .86094 34.66750 46.3350 0 0 0 0 0 0 0 0 [2 . ( t + 1 ) ] <− A %∗% N[ . t <− t + 1 > N [ .N[ 1 . ( t + 1 ) ] <− A %∗% N[ .) to put more than one command on a line. bty="n" . ylab="" . In the following code examples. > N [ .N[ 2 . ylim=ylim . t <− t + 1 > N [ .] 12 16.80907 [ 2 .4] [ .10] [ .] 21 32. t <− t + 1 > N [ .59103 48.] 34 6.] 12 64. t ] . t ] .8.38759 [3 .2500 46. I show that you can use a semicolon (.2] [ .60186 113.6] [ .3] [ .9125 0 0 0 0 0 0 0 0 [4 .1] [ .77519 193. c o l ="red" .68719 131.5] [ .2.84928 98.21126 50. axes=F . t <− t + 1 > N [ .45 12.] 21 32.] 12 64.24381 115.] 12 64.1] [ .3] [ .6] [ .7] [ . ] .50 24. ylab="" .] 34 6. lwd=2) > par ( new=T ) > p l o t ( 1 : 1 1 .97547 So this is a large number of values here so lets plot this out to see what the stages do as we go through 10 time steps.93725 168.22598 41.7] [ .02813 44. t ] . xlab="" .2] [ .8] [ .] 12 16. STAGE-CLASSIFIED MATRIX MODELS 141 t <− 1 N [ . ( t + 1 ) ] <− A %∗% N[ . t <− t + 1 > N [ .3] [ . t ] . ] 83.32937 47.1] [ .30521 [ 4 .20372 [2 .86065 281.9] [ . ( t + 1 ) ] <− A %∗% N[ .21862 45. In Chapter 11 you will learn how to use a loop to do this much easier but until then using the up cursor key in the R interpreter is good enough.32769 65.50 24. xlab="" . here I am going to do something that saves some typing (you can use the up cursor key to repeat the last entry you typed in the R interpreter and I will use this to make my life a bit easier). Then I will increment the variable t by one and redo it again and again until I’ve ﬁlled up the columns of N. ] 96.9125 29.50 0 0 0 0 0 0 0 0 0 > t [1] 2 > > > > OK. bty="n" .56499 69. t <− t + 1 > N [ . t ] t <− t + 1 N [ .

142 > > > > + > + CHAPTER 8. ylim=ylim . First. We can check some of the values that we estimated directly from A using the eigen decomposition by looking at the numbers in the matrix N."pink" ) .1 for more on this). c o l =c ( "red" . c o l ="green" .6: Size of the four stage classes through time."Adult" ) . I set the labels for the axes and the turn on the axes. ylim=ylim . See ?legend for a complete discussion of the options that you can provide to this function. MATRIX ANALYSIS par ( new=T ) p l o t ( 1 : 1 1 . Figure 8. lwd=2 . bty="n" . > eigen ( A ) $values [ 1 ] Biological Data Analysis Using R . c ( "Seed" . ] . ylab="" . xlab="t" . axes=F .1. ylab="Number of Individuals" . lwd=2) par ( new=T ) p l o t ( 1 : 1 1 . bty="n" .N[ 3 ."green" . c o l ="pink" . the growth rate we estimated from the ﬁrst eigenvalue λ1 ≈ 1. they look too dark on the graphic (think printing the same line on top of itself numerous times). ] . xlab="" . On the last one. type="l" . axes=T . Also included is the code I used to add the legend to the image. I also turn off the labels and axes for the ﬁrst three plots because if you plot them over and over again."Seedling" . lwd=2) legend (2 ."blue" . bty="n" ) I use the par(new=T) to overlay the lines on a single graph (see 4.N[ 4 .2 looks pretty close to that estimated from the raw counts. type="l" ."Juvenile" .350 .

> N[ . In fact. it approaches the expected values pretty quickly.4303797 0.1518987 > N[ .1686445 0.1725831 0.2.8. Figure 8.1228219 > ssd [ 1 ] 0. 1 1 ] ) [ 1 ] 0.4988875 0. As you can see.2065706 0. 1 1 ] / sum(N[ .5028525 0. 1 1 ] ) / sum(N[ . 1 ] / sum(N[ . STAGE-CLASSIFIED MATRIX MODELS [ 1 ] 1.2 shows the mean absolute deviation (MAD) representing the differences between the distribution of individuals in each stage from the predicted stable stage distribution (ssd) we calculated earlier.207547+0 i > sum(N[ .2658228 0.2056811 0.1518987 0. 1 0 ] ) [ 1 ] 1. Biological Data Analysis Using R .2.1219587 If we were to iterate this a bit longer you would see that the ”brute force” method of getting the population growth rate and the stable age distributions converge towards what was estimated. 1 ] ) [ 1 ] 0.215202 143 And the proportion of individuals in each class was estimated by standardizing the ﬁrst ˆ eigenvalue v1 = v1 / 4 v1i is pretty close to what we see in N (and I throw in the ﬁrst i=1 census so that you don’t think I put values in there that were already pretty close).

g. • • • • • %*% Binary operator to perform matrix multiplication. Returns the diagonal (e. An example would be X \%\∗\% Y.g. To get more information on any of these functions. those entries whose row and column indices are equal) of the matrix in x. diag(x) dim(x) • Returns the dimensions of the matrix x (e.3 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises. use the R help system. if possible.7: Differences in estimated proportions of individuals in each stage from what was expected through time.. MATRIX ANALYSIS Figure 8. as.. Coerces the variable x into the data type matrix.matrix(x) barplot(x) det(x) Creates a barplot of the values in x.144 CHAPTER 8. the number of rows and columns). 8. Calculates. Biological Data Analysis Using R . the determinant of the matrix in x.

g. Creates a legend for the plot at the coordinates (x.. matrix(x) Creates a new instance of the matrix data type of the values in x.8.3.table(x) t(x) Returns the transpose of the matrix in x (e. You will probably need to specify nrow and ncol to set the proper size for the matrices. reverses the row and column indices) Biological Data Analysis Using R . Reads the ﬁle x into memory. Values are sorted in descending numerical order and vectors are scaled to unit length. y) with the entries legend(x. read.y. ginv(x) Attempts to calculate the generalized inverse of x. See ?read. USEFUL FUNCTIONS • • • • • • eigen(x) 145 Returns the eigenvalue/eigenvector pairs for the matrix in x as a list.table for the copious amounts of additional parameters that may be needed as well as Chapter 3.c) in c.

4 Exercises The following exercises are meant to help you understand the items presented in this Chapter. it was mentioned that λ1 > 0 and this is what you will ﬁnd in most cases. 1.146 CHAPTER 8. In considering the instantaneous growth rate r. Standardize the columns of data in the matrix from the previous example so that the sum of each column is equal to 1. it is possible to get values of λ < 0.3 and describe the behavior of the population if these were the real values of r. For the following values of λ make a graph of t vs. Create a matrix of random numbers using the runif () function and make a barplot of the values. (b) λ1 < −1.5 with the beside=F option. (a) −1 < λ1 < 0. However. Replot this with using the function barplot() as done for Figure 8. What happens when you pass the optional argument beside=T? 3. How does standardizing each row inﬂuence the display of the plot? Biological Data Analysis Using R . 2. λt as shown in Figure 8. MATRIX ANALYSIS 8.

3 ) > length ( x ) [1] 3 > x <− 3 > length ( x ) [1] 1 147 . consider the following code: > x <− c ( "bob" . and replacement. For example."mary" . you may be downloading all the references from a online database such as WebOfScience and want to mine the abstracts for metadata. you will focus on the following topics: • Learn how to work with string data to perform tasks such as parsing. For example. In this relatively short chapter we will learn about how we can work with string in data in R and look at a few examples using genetic sequences.Chapter 9 Working With Strings While the majority of biological data is numeric in nature there are still several important reasons to be able to manipulate character-based information. • Construct Neighbor-Joining trees and display them in R 9. searching.1 Parsing Text Data At a most basic level you need to understand that character data in R is treated as a single token in the same way that integer and numeric data is treated." > length ( x ) [1] 1 > x <− c ( 1 . • Learn how to access sequence based data and pre-process it for importation into R • Learn how to create genetic distance matrices. 2 . In this chapter. You may also be interested in working with sequence data which consists of mostly text information."johnathan" ) > length ( x ) [1] 3 > x <− "George Stephen Sr.

WORKING WITH STRINGS 9.1 > partsOfName <− u n l i s t ( s t r s p l i t ( x . ”WHEN we look to the individuals of the same variety or sub-variety of our older cultivated plants and animals. than do the individuals of any one species or variety in a state of nature. one of the ﬁrst points which strikes us. we are going to use it to show you how to break down the sentence into an array of words and then tally the number of times each word is used. Biological Data Analysis Using R . > x <− "George Stephen Sr. The ﬁrst sentence from the ﬁrst chapter of Darwin’s The Origin Of Species is. So. " " ) ) > t a b l e ( wordList ) wordList a and animals any c u l t i v a t e d differ 1 1 1 1 1 1 do each first from generally in 1 1 1 1 1 1 individuals is look more much nature 2 1 1 1 1 1 of older one or other our 5 1 2 2 1 1 1 This function takes a list and turns the items in it into a vector which is easier to work with. Once we understand this then the rest of this Chapter really begins to take shape and make sense. as a single entry. independent of the length of the items in the variable. " " ) ) > partsOfName [ 1 ] "George" "Stephen" "Sr. that they generally differ much more from each other.148 CHAPTER 9.1. This returning-as-alist behavior is kind of a pain in the butt so at the same time I introduce this function I will also show the unlist() function at the same time. is. This function takes the string of characters that you are interested in splitting as well as the character you want to split it on and returns the chunks as a list." > nchar ( x ) [ 1 ] 18 Another commonly used function for dealing with strings is the strsplit () function.” While this is a very interesting sentence." > nchar ( partsOfName ) [1] 6 7 3 Here is another example as to how we may go about cycling through a set of words in a phrase and doing some operation on them. We begin by making the sentence all lowercase and without punctuation because the simple matching procedure would consider ”When” different than ”when” and the strsplit () function will cut up the string on the spaces (that I what I will tell it to do) > phrase <− "when we look to the individuals of the same variety or sub-variety of our older " + "cultivated plants and animals one of the first points which strikes us is that they " + "generally differ much more from each other than do the individuals of any one species or " + "variety in a state of nature" > wordList <− u n l i s t ( s t r s p l i t ( phrase .1 Finding Lengths of Character Sequences So R treats a character data type. if R thinks that the everything between a pair of quotes is a single instance of a character data type then how do we ﬁgure out how many letters are contained between the quotes? The answer here is the function nchar().

If you do not provide an ending number.58 . > phrase <− "A Goat. endPositions ) [ 1 ] "the" "Goat" "shut" "her" "eyes" 9. shut his eyes and said + in a loud voice. > s t a r t P o s i t i o n s <− c (34 .174 .172 . In R string concatenation is accomplished using the paste() function. 34.3 . endPositions ) > stringVector [ 1 ] "the" "Goat" "shut" "her" "eyes" > paste ( stringVector . This is a shorthand way of saying substring( phrase. you can extract internal components of a string by using the substring() function.3 Concatenating Strings Vectors of character data can be concatenated to form a single long string. shut his eyes" > substring ( phrase . However.1. collapse=" " ) [ 1 ] "the Goat shut her eyes" > paste ( stringVector .2 Extracting Substrings It is not possible to use the normal subscripting approaches to access the individual characters within strings because R treats the entire sequence of characters between the quotation marks as a single item. ’She ought to know her way to the ticket-office. 98) [ 1 ] "’She ought to know her way to the ticket-office.70) > substring ( phrase . it will return all the characters up to the end. PARSING TEXT DATA plants 1 sub−v a r i e t y 1 us 1 points 1 than 1 variety 2 same 1 that 1 we 1 species 1 the 4 when 1 state 1 they 1 which 1 strikes 1 to 1 149 9. x.61 . It is also possible to use vector notation in pulling out substrings by passing vectors to the start and end arguments.1. > stringVector <− substring ( phrase . collapse="|" ) [ 1 ] "the|Goat|shut|her|eyes" Biological Data Analysis Using R . even if she doesn’t know + her alphabet ! ’" > substring ( phrase . even if she doesn’t know her alphabet ! ’" The function takes the string to be searched and the starting and ending locations in the string and returns the characters in between. s t a r t P o s i t i o n s .6 .1. that was sitting next to the gentleman in white.9. This is very helpful in creating labels for graphs that have to include the value of a variable and for times when you need to open a lot of data ﬁles that have a predictable ﬁle naming scheme. s t a r t P o s i t i o n s . 70) [ 1 ] "the gentleman in white.67) > endPositions <− c (36 . nchar(phrase) ).

look up the regexpr function and have at it. I am not going to cover it in this Chapter. x ) ) [ 1 ] TRUE > any ( grep ( "o" .150 CHAPTER 9." > gsub ( "the" .4 Matching & Substitution The ﬁnal tasks we will look into in this section on string operations are matching and substitutions. There are two functions that perform string substitutions. x ) [1] 1 > any ( grep ( "fox" . For those of you who work with string data on a regular basis. In fact. the grep function returns an integer indicating that the string either has or does not have a copy of the pattern in it."THE" . sub and gsub. x ) [ 1 ] "The quick brown fox jumped over THE candle stick with all THE kings men."THE" . x ) [ 1 ] "The quick brown fox jumped over THE candle stick with all the kings men. gsub replaces all of > x <− "The quick brown fox jumped over the candle stick with all the kings men. This is the realm of matching and is primarily accomplished by the functions grep() and regexpr()." > gsub ( "the" . This last function allows you to use what are called Regular Expressions (RE) to scan through string. it will make your life easier.1. Both of these functions take at least three arguments. lets dig into grep for a little light matching exercises.case option that allows the searching and replacing to either take into consideration Biological Data Analysis Using R . it probably needs its own chapter and perhaps in a future version of this text I will include it. 1. x ) ) [ 1 ] TRUE > any ( grep ( "dog" . x . ignore . I wrapped the grep function here inside the any() function because it will take either a single argument or a vector of arguments and return a logical value. 2. WORKING WITH STRINGS 9. The string to replace the matched pattern with. and 3. The string to search within. There are a lot of times when the ability to see if a particular set of strings has a speciﬁc substring within it. case=T ) [ 1 ] "THE quick brown fox jumped over THE candle stick with all THE kings men. x ) ) [ 1 ] FALSE In general. While this is a very powerful method for pattern matching and is something that if you are going to do any extensive work with strings should know. the most common one of which is ignore." Both of these functions have optional arguments. A pattern to match. It is also possible to substitute values in a string with new items."THE" ." > sub ( "the" . For the rest of us. The sub function replaces the ﬁrst occurrence of the pattern whereas them (the g stands for global). The grep function takes a pattern that you are looking for and a string that you want to look into. A simple example would be: > x <− "The quick brown fox jumped over the candle stick" > grep ( "fox" .

Variable nuclear markers f o r a Sonoran Desert bark beetle . with applications to r e l a t e d genera REFERENCE AUTHORS TITLE Biological Data Analysis Using R . 1. In this section. Araptus attenuatus Wood ( Curculionidae : Scolytinae ) . The ”ORIGIN” which contains the raw sequence information.gov/ Here you can run database queries based upon taxa. A .C.. The meta data in the top section that contains the locus deﬁnition. Araptus .5 Slightly More In Depth Examples: Genetic Sequence Analyses Genetic sequences are essentially long character strings and R has a few different libraries available to you for the analysis of sequence data. Meadows. A . .nih. The basic results of a search are given as an annotation (just below).1 GI:227345175 . This annotation has three parts.D. we will: 1. Araptus attenuatus Araptus attenuatus Eukaryota . Insecta . size.1. Metazoa . 2.ncbi. Use R to estimate a Neighbor-Joining tree from the sequence data Getting DNA Sequence Data The mother of all sequence repositories that you can access (without actually doing the sequencing yourself) is the NCBI web database located at http://www. 3. Learn how to align sequences 3. Curculionidae .nlm. Create a distance matrix from the sequences 5. Endopterygota .R. groups. FJ347583 FJ347583. genes. Pterygota . J . An example of a record is given below LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM FJ347583 278 bp DNA linear INV 01−JUL−2009 Araptus attenuatus haplotype 5 muscle protein 20 (MP20) gene . Polyphaga . some geographical and taxonomic information that has been standardized (good for data mining and putting on a map) as well as the translation of genetic sequence into amino acids if appropriate.. or whatever. 1 ( bases 1 to 278) Garrick . and Dyer . references and a the taxonomy of the organism. Neoptera . . . PARSING TEXT DATA the case of the letters when matching or not. Coleoptera . if you do not already know about it then you probably should not be calling yourself a biologist.9. Hexapoda .C. The ”FEATURES” of the record that describe what is in the sequences (coding and non-coding regions if known). Arthropoda . Nason .1. Scolytinae . Cognato . p a r t i a l sequence . J . Brieﬂy discuss how we go about getting DNA sequence data 2. I . 151 9. I am not going to get into what a genetic sequence is.R. Cucujiformia . who found it. Import sequence aligned sequence data into R 4.

C. This format is very compact and as a result. V i r g i n i a Commonwealth University . . In general.152 JOURNAL REFERENCE AUTHORS TITLE JOURNAL CHAPTER 9. WORKING WITH STRINGS Conserv .D. and Dyer . Nason . Each block contains a summary line that must begin with the greater than character (>) and can be anything you like. The lines following the summary line is the raw sequence. . coding region not determined" ORIGIN 1 ctaaaatcaa cacttccgga ggacaattta aattcatgga aaacatcaac aagtaagaaa 61 aaaataattt gacatgtaaa taatgtagag aaaattcata aacattccta t t t t t t a t t g 121 a t t t g t c a a t a t t t a g t t t g gaactaaact ctgacaatca attatacagg gtgacaattc 181 t a a t t a c a t t t c c a t t c a a t gccaactaga a a t t t c g t g a aaaaaaaatt g t t t c t a t g c 241 caaacatact g t t t t a t a a g a t t t a a t t c c agaaattt // Sequence Formats & Aligning Genetic Sequences The format of the sequence data like this is a bit verbose but very informative. A . When we work with sequence data we will use an abbreviated ﬁle format. hondurensis GGTTCAAGTCCCTCTATCCCCACCCAGGTTCGGTCCCGAAAGGAYTGATCTATCTTCTCCAATTCCATTG GTTCGAATCCATTCTAATTTCTCGATTCTTTTACCTCGCTATTTTTTTTTTTTCATGAAGAGAAGAAATT AGAACATGAATCTTTTCATCCATCTTATGACAAGTTGAGTTGATCTGTTAATAAGTTGATCATATGATCA ATTTATTTTGTGATATATGATCTACATAGAATAGATTAGATCNTTTTTAAATTATTCAATTGCAGTCCAT TTTTATCATATTAGTGACTTCCAGATCGAAAATAATAAAGATCATTCTAAAAACTAGTAAAAATACCTTT TTACTTCTTTTTAGTTGACACAAGTTAAAACCCTGTACCAGGATGATCCACAGGGAAGAGCCGGGGATAG CTCATTTGGTAAACCAAAGGACTGAAAATCCTCGTGTCACCAGTTCAAAT >Pinus echinata ACCCAGGTTCGTTCCCGAACGGATTGATCTATCTTCTCCAATTCCATTGGTTCGAATCCATTCTAATTTC TCGATTCTTTTACCTCGCTATTTTTTTTTTTCATGAAGAGAAGAAATTAGAACATGAATCTTTTCATCCA TCTTATGACAAGTTGAGTTGATCTGTTAATAAGTTGATCATATGATCAATTTATTTTGTGATATATGATC TACATAGAATAGATTAGATCATTTTTAAATTATTCAATTGCAGTCCATTTTTATCATATTAGTGACTTCC AGATCGAAAATAATAAAGATCATTCTAAAAACTAGTAAAAATACCTTTTTACTTCTTTTTAGTTGACACA AGTTAAAACCCTGTACCAGGATGATCCACAGGGAAGAGCCGGGATAGCTCAGTTGGTAGAGCAGAGGACT GAAAATC When conducting analyses of genetic sequence data. A ..278 /organism="Araptus attenuatus" /mol type="genomic DNA" /db x r e f ="taxon:634056" /haplotype="5" gene <1. VA 23284..C. Richmond . If you want to have more than a single taxon in a ﬁle. J . Direct Submission Submitted (26−SEP−2008) Department o f Biology . Genet . it is rather easy to use. you just put the next taxon block blow the previous one and continue. In general they look like this (this is an excerpt from an example data set that you have in the class folder): >Pinus caribaea var .>278 /gene="MP20" /note="muscle protein 20. Cognato . it is important that you are conﬁdent that all the sequences you have are of homologous portions of the genome. For the Biological Data Analysis Using R . I . 1000 West Cary Street . locus identiﬁer. FASTA ﬁles are simple text ﬁles that have blocks of information for each sequence.R. It is common to put the accession numbers. to work with sequences. the FASTA format. . USA FEATURES Location/ Q u a l i f i e r s source 1. J . taxonomy and other information into this line.R. 10 ( 4 ) . 1177−1179 (2009) 2 ( bases 1 to 278) Garrick . Meadows.

In this section. dna ( "confiers. This is not something you want to do by hand and it is much better to let a computer do some of the work for you. see Appendix B for an overview of the process.187 0. several people have developed libraries for you to use that have a lot of general functionality to them. format="clustal" ) > class ( seqs ) [ 1 ] "DNAbin" > summary( seqs ) 23 DNA sequences in binary format stored in a matrix .fasta in the folder for this chapter.343 Biological Data Analysis Using R .9. conifers. An example of this is shown below with gaps (insertions/deletions) indicated as the dash character (−).aln" .310 0. There are many ways to do this and I just used the online ClustalW server at http://align. . This makes it a bit easier for you in the future when you interact with the data.aln and this is Getting Aligned Sequences Into R R does not by default recognize sequence data as anything more elegant than a sequence of characters. Pinus Pinus Pinus Pinus caribaea var . These sequences were between 390-470 base pairs in length and are in the ﬁle named confiers. Before I played with these sequences.160 0. . To load the aligned sequences into R type the following: > l i b r a r y ( ape ) > seqs <− read . If you do not have this library installed on your machine. I cleaned up the summary lines in this ﬁle so it only has the genus and species names rather than all the other stuff. As a result. hondurensi taeda ponderosa echinata CC − CACCCAGG TTCGGTCCCGAAAGGAYTGATCTATCTTCTCCAATT −− − − − − ACCCAGG TTCGTTCCCGAACGGATTGATCTATCTTCTCCAATT − − − − − − − ACCCAGG TTCGTTCCCGAACGGATTGATCTATCTTCTCCAATT − − − − − − − ACCCAGG TTCGTTCCCGAACGGATTGATCTATCTTCTCCAATT − − − − This ﬁle is also located in the folder for this chapter and is called the ﬁle we will be working with.genome. I ran an alignment on them to make sure we were dealing with the matching sequences across taxa.jp to align the sequences for me. This is another text ﬁle but this time all the species have been displayed in blocks with homologous sequence locations in the same text column. A l l sequences o f same length : 526 Labels : Abies alba Abies kawakamii Abies v e i t c h i i Abies homolepis Larix p o t a n i n i i Cedrus a t l a n t i c a . This algorithm aligns all the sequences and returns the ﬁle in a clustal format. I downloaded some genetic sequence data for a handful of conifers in the family Pinaceae from the NCBI website. PARSING TEXT DATA 153 example I used here. I am assuming that you currently have the data ﬁle in a location that you can reach it easily from within R . Base composition : a c g t 0. I am going to use the library ape.1. The sequences I was looking for is a common inter-genic spacer region between the genes encoding for tRNA-trnL and tRNA-trnF.

dna for more information on these). If you print it out.07252 0. −0.26890 0. is a particular kind of matrix that holds the lower triangle of the pair-wise distance calculations. The distance matrix. Max. 0.0009736 0.00000 0. You can look for motifs. 0. you will get a whole lot of output as it prints the taxa names for row and column headers.01999758 d i s t r i b u t i o n summary: Min . To create a NJ tree from these distances. > D <− d i s t . I will leave these options for you to play with later in the exercises.0004898 0.0000000 0. etc. examine CG content. Constructing A Neighbor Joining Tree To construct a Neighbor Joining (NJ) tree.dna() takes as an argument a set of sequences that you have read in (the must be of class DNAbin as shown above) and spits out the distance matrix. We will use the default value which is Kimura’s 2-parameter model called ”K90”.15720 1. Figure 9.0150700 No root edge . Max. D. Since D is a general distance matrix. 1s t Qu. we can look at the values in it.45700 The function dist. we ﬁrst need to create a distance matrix that estimates the distances between pairs of sequences that we have in our ﬁle. 1s t Qu. F i r s t ten t i p l a b e l s : Abies alba Abies kawakamii Abies v e i t c h i i Abies homolepis Larix p o t a n i n i i Cedrus a t l a n t i c a Larix decidua Cedrus deodara nj () . There are several different kinds of distance metrics that you can use in the calculation of this distance matrix (see ?dist. Median 3rd Qu. WORKING WITH STRINGS There are several things that you can do with these aligned sequences.154 CHAPTER 9.1 shows a histogram of the distance values that have been estimated in D.03838704 variance : 0. Median Mean 3rd Qu. dna ( seqs ) > class (D) [ 1 ] "dist" > summary(D) Min .8610000 Biological Data Analysis Using R . From this we see that there are several values that are low meaning that the sequences are very similar to each other and then there are some that are 2-3 peaks that are larger suggesting some degree of sequence divergence. we use the function > njTree <− nj (D) > class ( njTree ) [ 1 ] "phylo" > summary( njTree ) Phylogenetic t r e e : njTree Number o f t i p s : 23 Number o f nodes : 21 Branch lengths : mean: 0.09310 0.

branch lengths. We can see that internally the variable njTree has some internal information that may be of interest (e.phylo() and you have to look up that command to see the available options for it.. the native command is plot. Biological Data Analysis Using R . PARSING TEXT DATA 155 Figure 9. Well in actuality this function is simply a wrapper that takes whatever you pass to it and determines if the class of the object you passed has its own plot command.1 come from. To do this.9. The Pinus species are generally together forming a polytomy that connects to the 2 You may be surprised by the utility of the plot function as it seems to know how to plot everything.2) is easy to interpret and it is quite obvious where those very large distances shown in Figure 9. From this topology we can see that: 1. For the tree. This function take a distance matrix and returns a tree that is of the class phylo. etc) but the real way we can understand it is by looking at a graphic of the tree that is produced.2 The topology of the tree (Figure 9.1.1: Histogram of distance estimates among all sequences using the ”K90” model of substitutions Larix l a r i c i n a Pinus roxburghii No node l a b e l s . we use the plot () command and pass it the njTree variable as plot(njTree).g.

R does a pretty good job itself. WORKING WITH STRINGS Figure 9. 9. Or you may want to export a table of values as HTML so that I can copy and paste it into another program Biological Data Analysis Using R . you may want to print out a matrix of values but only have 2 decimal places printed for each entry. 2. There is quite a bit more that can be done here but I think that is enough to get you on the right track if you are interested in using R for some basic sequence analysis. 3. Abies. but it has some limitations. The most divergent groups are the Picea and Keteleeria samples. For example.156 CHAPTER 9.2: Neighbor joining tree based upon the trnL-trnF intergenic spacer sequences and the ”K90” model of sequence evolution. and Cedrus for generally self contained groups.2 Producing Formatted Output Often in the use of R there is a need to produce a particular kind of output from an analysis of to display the contents of a particular variable. The Larix. other genera in the family.

it will display the contents in LTEX. d r o p 0 t r a i l i n g = FALSE.2] [ . "Header C" .0302831 0. ] 0. mark = "" . j u s t i f y = c ( "left" . small . trim = FALSE. "Row 3" ) > x Header A Header B Header C Header D Row 1 0. > x <− matrix ( rnorm ( 1 2 ) ." . What we do with it at this point depends upon how you want to interact with it.4] [ 1 .6091688 > colnames ( x ) <− c ( "Header A" .2.8763023 0.2. p r i n t = NULL. d i g i t s = NULL.7677516 Row 2 −1. big . a l i g n ="l|cccc" ) The variable theMatrixTable now is a xtable object. . "Header D" ) > rownames ( x ) <− c ( "Row 1" . decimal .2.8333904 −0.7392326 −0.7392326 −0.2 Formatting Tables A common type of format to be output to another format is tabular data.3955881 0.1 that follows is what it A looks like when it is inserted into a LTEXdocument. big . "centre" .6091688 > theMatrixTable <− xtable ( x . s c i e n t i f i c = NA. ) 9.4396607 1. Tables are common features of statistical analysis and as such you will ﬁnd it necessary to cut a table out of R and paste it into a document in the same way that graphics can be exported from R to be used in your manuscripts and reports. "Row 2" .0302831 0. A Getting LTEXOutput A If you print it out as is. ] 0. nsmall = 0 . encode = TRUE. i n t e r v a l = 5 .8856766 −0. "right" . % l a t e x t a b l e generated in R 2.1] [ . zero .3235135 [ 3 .3235135 Row 3 0.1678067 0.7622323 −0. mark = "" .5−4 package % Wed Dec 31 14:22:46 2008 \begin{ t a b l e } [ ht ] \begin{ center } \ caption {Caption For Table} \begin{ tabular }{ l | cccc} \ hline & Header A & Header B & Header C & Header D \\ \ hline Biological Data Analysis Using R . "Header B" . .3] [ . "none" ) . ] −1. PRODUCING FORMATTED OUTPUT 157 9.7622323 −0.1 Formatting Strings For Printing format ( x .4396607 1. If you use LTEXto write your manuscripts then you are set and the listing that follows show the formatting that results and the Table 9.8763023 0. caption="Caption For Table" . . i n t e r v a l = 3 .9.8. na . width = NULL.8856766 −0. small . mark = ". a typesetting language that is used to create very nice looking manuscripts and books (this entire book has been A written in it).1678067 0.8333904 −0. I will just created a matrix of values and add row and column names using the functions rownames and colnames.7677516 [ 2 . For these examples.3955881 0. nrow=3) > x [ .0 by xtable 1.

ﬁle=”theﬁleName.88 & 0.76 & −0. WORKING WITH STRINGS Table 9. <!−− html t a b l e generated in R 2. programming.61 You can also print the table to a ﬁle by calling the function print(theMatrixTable. The xtable can be exported into a format you can open up in said program by ﬁrst exporting the ﬁle as type="html".61 \\ \ hline \end{ tabular } \end{ center } \end{ t a b l e } CHAPTER 9.03 0.32 0.88 Header D 0. Exporting In HTML for Web or Word A If you do not use LTEXand are a biologist that does a lot of mathematical. see ?print.03 </TD > <TD a l i g n ="center"> 0. That being said there are many people for which a general overpriced and under powered word processor (which shall remain nameless but is buggy and prone to viruses and screwing up your manuscripts.1: Caption For Table Row 1 Row 2 Row 3 Header A 0.83 & −0.89 & −0.xtable for more information.40 & 0.89 </TD > <TD a l i g n ="center"> −0.40 </TD > <TD a l i g n ="center"> 0.8.3.74 </TD > <TD a l i g n ="center"> −0.17 & 0. you know which one I mean) is the best you can expect to master. There are several other options available to you with the print function.77 </TD > </TR> <TR> <TD Row 2 </TD > > <TD a l i g n ="center"> −1.17 </TD > <TD a l i g n ="center"> 0.40 -0.77 \\ Row 2 & −1.type=”html”.32 \\ Row 3 & 0.44 & 1.44 Header B 0.89 0.03 & 0.17 -1.74 & −0.html”) and the table will be saved. An example of the html markup that this function produces is given below and an image of it is presented in Figure 9. To export it as such call the command > print(theMatrixTable.83 </TD > Biological Data Analysis Using R .83 -0.77 -0.76 Header C -0. You can then open it up in your favorite word processor and it will turn the html table into a normal table that you can manipulate in your documents.158 Row 1 & 0.74 1.0 by xtable 1.ﬁle=”MyHTMLizedTable.tex”).5−4 package −− > <!−− Wed Dec 31 14:22:51 2008 −− > <TABLE border=1> <CAPTION ALIGN="bottom"> Caption For Table </CAPTION> <TR> <TH > </TH > <TH Header A </TH > > <TH Header B </TH > > <TH Header C </TH > > <TH Header D </TH > > </TR> <TR> <TD Row 1 </TD > > <TD a l i g n ="center"> 0. or scientiﬁc work then you should be.

PLOTTING SPECIAL CHARACTERS <TD a l i g n ="center"> </TR> <TR> <TD Row 3 </TD > > <TD a l i g n ="center"> <TD a l i g n ="center"> <TD a l i g n ="center"> <TD a l i g n ="center"> </TR> </TABLE > −0. Λ. Biological Data Analysis Using R . You can also import tables saved as html into popular word processors and use them as normal table items in the creation of your documents.61 </TD > The HTML above produces a table that when imported into Firefox looks like that presented in Figure 9.3: The html printout of a xtable as interpreted in Firefox. All the characters on your keyboard (assuming that you are using an en US keyboard) are speciﬁed in as single variables in ASCII (ASCII stands for the American Standard Code for Information Interchange). the newline character. superscripts.44 </TD > 1.xtable for more information.3 Plotting Special Characters There are some special characters that you should be aware of when trying to get your data output into a readable format.32 </TD > 159 0.76 </TD > −0. namely the tab character. and then there are all those non-US English characters and hieroglyphs. There are several other options available to you with the print function. Obviously. since the ﬁrst A stands for American.3.3.9.88 </TD > 0. and the bell character. see ?print. Figure 9. R has the nice ability to produce slightly complicated output for the axes of your plots as well as for putting into most graphics you produce. These characters are not necessarily ones that you speciﬁcally type on the keyboard rather they are ones that are available as their own buttons on the keyboard. Your terminal that you are running R from cannot handle these characters but you can get them into plots that you make. 9. and mathematical symbols are easily produced using just a few different functions. Items such as subscripts. Greek and Latin characters (α. there are a lot of characters that you see on a computer screen that you cannot type directly on a keyboard such as letters with accents. Ω).

xlab=xlabel . y . plotmath . c o l ="red" ) For both the x− and y-axes.plotmath. 2 0 ) . nr ) . bty="n" . There are several options that you can pass to the expression function and it is not quite worth listing them all here since you see them in the R demo itself. i <− i + 1 > draw . Biological Data Analysis Using R . i . The part you should focus on is the (expression(bold(x)) parts. plotmath . > x <− rnorm(100) > y <− 23 + 1. bty="n" . If you like. i . Here is another example: > x l a b e l <− expression ( bold ( x [ i ] ) ) > y l a b e l <− expression ( i t a l i c ( x [ i ] ˆ 2 ) ) > p l o t ( x . I will show some of the more common methods in the plot shown in Figure 9. And the best method for looking at the ability of R to provide nice mathy like output is to look at its own demo. c o l ="blue" .4. xlim=c ( 0 .4∗x + 2∗rnorm(100) > p l o t ( x . Associated with each table. start R and type: > demo ( plotmath ) This command will show you a short number of tables in a ﬁgure window that have examples of the different kinds of math plotting that R handles. I use the expression function to create labels with subscripts and superscripts. nr ) . i . nr ) . c e l l ( expression ( bold ( x ) ) . i <− i + 1 The demo script itself deﬁned the function draw. type="l" . An example of some of the copious output is: > draw . ylab=expression (X[ s t u f f ] ) .160 CHAPTER 9.cell() so don’t worry about that part. However. WORKING WITH STRINGS The primary way for producing formatted text for a graphics output is through the use of the expression function. c e l l ( expression ( i t a l i c ( x ) ) . there is really no difference in the speed at which R would evaluate them. you can deﬁne these values as individual variables prior to plotting if you like to keep the plot command a bit cleaner. i <− i + 1 > draw . plotmath . So. y . ylab= y l a b e l ) Look at the demo(plotmath) output to see the diversity of plotting approaches. lwd=2 . This way you can see how each of the cells in the displayed tables is being encoded. when R sources the demo script it passes the optional echo=TRUE parameter so that all the commands that are used to produce the output are also shown in the R command interface. xlab=expression ( chi ˆ 2 ) . c e l l ( expression ( b o l d i t a l i c ( x ) ) .

USEFUL FUNCTIONS 161 Figure 9. 9.4: Example of using the expression function to annotate a graphic. use the R help system.s. Formats the object x for rigid (some say pretty) printing. Biological Data Analysis Using R . format(x) substring(x.9. To get more information on any of these functions. Splits the string x on the character (or characters in c).f) This returns takes the string in x and returns the substring starting at position s and ﬁnishing at position f .c)functions!strsplit nchar(x) Returns the number of characters in the string x.4. Concatenates the objects in x and dumps them out to the interface.4 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises. strsplit(x. • • • • • • • any(x. expressionx This function takes the variables in x and turns them into a string expression to be plotted in a function.y) cat(x) Returns a logical response to x having any instance of y in it.

x. WORKING WITH STRINGS This function performs the neighbor joining function on the distance matrix Takes the list x and returns it as a vector. unlist(x) Biological Data Analysis Using R .162 • • nj(x) CHAPTER 9.

Then use the grep command to ﬁnd the sentences that have the word are in them. 1. 2. "Petersburg"...5 Exercises The following exercises are meant to help you understand the items presented in this Chapter. 3. Do these different distance models produce different tree topologies when using the nj () function? If so. Create a table from the data data <−matrix( rnorm(9). ”Dr. Use the xtable library to export this table as HTML and then import it into your answers.1). show the trees and describe the differences you see in the trees. (This is a very helpful methodology for getting formatted data out of R and into your manuscripts). Show how you would use the sub command to ﬁx the sentence. "Varina") and the columns as c("PPM(A)". "PPM(C)".5. 8. "PPM(B)".9. Dyer is a loser”? (And when I say ”ﬁx” I mean make it say that I am not. Using the aligned sequences to create a few different distance matrices by changing the model type that you pass to the function dist. Using the strsplit function to break apart the raw text of the ﬁrst four paragraphs of the Chapter entitled Preliminaries into sentences (HINT: use the ” .” as the character to break apart the string on and you can copy and paste it from the pdf).) 5. 4. 10. Biological Data Analysis Using R . Do the functions nj () .dna(). plot a dotted vertical line to indicate where the mean value of the distribution is and put the character ”µ” symbol next to it. fastme. In the previous graph. and bionj () produce the same looking topologies? You should read the functions to see what they are as you probably haven’t worked with them yet. nrow=3 ) and label the rows as c("Richmond".bal(). Create a density plot of the χ2 distribution and make the main label say ”χ” using the expression() function (hint: this character is called ’chi’). 9. Create a table of the different words found on the ﬁrst page of the Chapter entitled Preface in this text. Explain. 7. Do alternate distance models have different densities of values? (Hint: plot a density plot for each distance matrix on the same graph similar to what is shown in Figure 9. EXERCISES 163 9. How many characters are in the ﬁrst paragraph of this Chapter? 6.

164 CHAPTER 9. WORKING WITH STRINGS Biological Data Analysis Using R .

Part III Extending R 165 .

.

There are times when you have to do the same thing over and over again. In general we are all lazy. you can come back and pick up where you left off. 3. Learning how to write scripts will help you out in the following ways: 1. having a record of how the previous analyses were performed is a huge beneﬁt. what I am referring to is a set of R commands that you put into a text ﬁle and have R evaluate. or other oper167 . scripts are enablers for our laziness. you can deﬁne data (Chapter 2). In this chapter. say make graphs of a large number of variables or transform a lot of different data sets using the same algorithm. 2. At a later date.1 Writing Scripts A script is nothing more than a series of commands that R recognizes and evaluates. If you put the commands in a script. Within a script. It seems to be a monumental task to type the same thing into R over and over again. If you have more data or another angle at the analysis. and later when we get into programming (Chapter 11) and functions (Chapter 12) you can run it over and over again with ease (remember the lazy thing?). you will focus on the following topics: • Learn about basic script writing • Understand differences between code evaluated from a script and that same code typed into the interactive R command line • Execute scripts in R 10. Scripts allow you to put your commands into a text ﬁle and have R run them for you. So in essence. functions (Chapter 12). Keeping your analyses and data sets together is a great way for you to not loose a record of what you have done.Chapter 10 Basic Scripts When I use the term script here.

1 Knowing Directories A script must be in text and it must reside in a location where you can tell R it is located.2 The Editor You can write a script in any basic text editor. BASIC SCRIPTS ations.168 CHAPTER 10.. program. nerds like myself) ﬁre up a google search for ”vi vs. the infamous data. If you are staring R from a terminal (in OSX or some Unix variant). You have been warned. Windows) because there is no real command line terminal in the OS.g. For some installations of R . Now if you are using R from a GUI-ish installation such as on Windows . perhaps a data set named DogwoodGerminationRates27.txt). You can change the cwd from the “Change dir.R. it notes the current directory that you are using. just makes it easier). When you start an interactive session in R . 10. • Always provide labels for each column of data. You will leave this class and at some point in the future look back on some script you wrote and want to ﬁgure out how it works and without copious comments you will fail and have a small sense of being genuine looser.” command in the “File” menu. At some time in the future you will need to look at the data set and ﬁgure out what that column of data represent..g.csv and the R script as AnalysisOfDogwoodGermination.1...g. 10.. I cannot emphasize this enough. There have literally been decades of wars fought over the choice of the real editor. Use your descriptive skills in naming your data and scripts such that you know what is contained in the ﬁle without looking at it (e. Keeping it all mashed together into a single directory can cause problems with data sets having the same name (e. It is convenient to have a record of the commands that you use in R to produce output. emacs” and sit back and enjoy. • It is also a good idea to make sure that you separate your directories of data and associated scripts such that it is easy for you to ﬁnd the right directory. you may want to check out T EXT M ATE or T EXT W RANGLER on OSX. This interface to R often has an integrated editor built into it and if it is there you should probably use it unless you have another editor of choice that you feel more comfortable with. If you are interested in cultural aspects of programming and programmers (e. or variables actually mean. then the directory where you started will be the cwd. • In your scripts. Lines that start with the hash character (#) are ignored by R and you can use them for adding comments about what the script. provide a lot of comments. you have to tell R which directory to use as a starting place. 1 Biological Data Analysis Using R .1 If you do not want to use the supplied editor or do not have one available. there is a pseudo-GUI associated with it (e. functions..1. This is what is called the cwd or current working directory. Here are a few tips that I ﬁnd helpful when I work with R : • It is a pretty good idea to keep your data sets and the scripts that you use to analyze these data in the same directory.g.

R (note you must have the . There is a data set named ScriptExampleData1.40 Female :5 B:4 1s t Qu.. another ﬁle. Take a look at the documentation for the source() command by typing ?source into R and give it a read. parenthesis matching.R") and see what happens." ) > summary( theData ) Population Height Sex A:5 Min .. you do not want to copy the responses that the R engine had provided to you.b. even think of using Word to do any of this. so open your editor and we will make a very small script that does something entire useless. when you are executing the contents of a script.. K ATE. under any circumstance. take the same code and put it into your script ﬁle. if you are going to be spending a lot of time in front of your computer. it is not entirely clear where output should go. In R type the following code and see what happens.04 3rd Qu.70 Male :4 Median :29. After all. and for Unix/Linux you can use GE DIT. If you learn one these last two you will never need another editor on any platform). So for example. you may as well have tools that help instead of get in the way.20 > range ( theData$Height ) [ 1 ] 23.R sufﬁx on the script ﬁle).70 Max. you should never. if you want to get a response from stuff in a script you need to tell R to print the results. OK. The issue is that when you are typing commands into R you are doing so in an interactive mode.2 > l e v e l s ( theData$Population ) [ 1 ] "A" "B" It should have loaded theData and provided a summary of it as shown. You say ”do this” and it says ”OK. Obviously.10. you have written your ﬁrst script. OK.txt" . As a result. Speaking of getting in the way.2 Evaluating Scripts The R engine can load and evaluate scripts relatively easily.70 Mean :30. Why is this? The same commands produced lots of output when typed directly into R . if you change your script to look like: Biological Data Analysis Using R . Save the script as AnalysisOfScriptData. The important component of the editor that you are looking for is one that understands R (or SPlus) and can provide you with syntax highlighting. sep=". header=T . 10. :23. and automatic indentation. These are things that just make your life easier. Now.2. you are probably not in the correct directory. EVALUATING SCRIPTS 169 E or C RIMSON E DITOR (or the million others that are on this pedestrian platform) on Windows. just the commands that you typed. Congratulations. In the next section we will evaluate the script and note a few differences. > theData <− read . ready? In R type source("AnalysisOfScriptData. :38. EMACS or VI (n.txt in the class folder.” However. Make sure you script is saved in the same directory as the data ﬁle. some other place. t a b l e ( "ScriptExampleData1. to the screen. Nothing.:27.4 38. If not.:32. Change to the right directory and redo..

And it is relatively appropriate to ask why you are wanting some things printed out as the script is executing.:32. because you can add variables to the main memory of R from a script.g.70 Male :4 Median :29.70 Max.70 Median :29. header=T . ﬁguring out why it is crashing or giving you the wrong answers). you will have access to it. Relying on variables that are outside our script and are only memory because we did something before running our scripts will lead to frustration (bet on it!).04 3rd Qu.2 > p r i n t ( l e v e l s ( theData$Population ) ) [ 1 ] "A" "B" This is helpful if you are debugging a script (e. :38.:32. the commands themselves were not echoed to the R environment. The variables in a script are available in the main R memory so if you deﬁne a new variable in the script.20 [ 1 ] 23. BASIC SCRIPTS theData <− read . header=T . So..:27.:27. This is a very important point.4 38. I typically erase all variables from memory at the beginning of each script using the command rm( list=ls () ) . after the ﬁrst time you source() it. echo=TRUE) > theData <− read .R" .txt" .170 CHAPTER 10.70 Max.70 Mean :30.20 Sex Female :5 Male :4 > p r i n t ( range ( theData$Height ) ) [ 1 ] 23. :23. sep=".R" ) Population Height Sex A:5 Min . t a b l e ( "ScriptExampleData1." ) p r i n t (summary( theData ) ) p r i n t ( range ( theData$Height ) ) p r i n t ( l e v e l s ( theData$Population ) ) and from R source it you’ll get: > source ( "AnalysisOfScriptData. we are thinking about the future here and we need to make sure that the things that we do in our analyses are reproducible at some point in the future.70 Mean :30. notice that here the output was only the response of the commands. :23.40 Female :5 B:4 1s t Qu. t a b l e ( "ScriptExampleData1." ) > p r i n t (summary( theData ) ) Population Height A:5 Min . You can get R to echo each command and then provide the results when it is in a script by adding the optional echo=TRUE option to the source() function as shown in the output below: > source ( "AnalysisOfScriptData.4 38. However.2 [ 1 ] "A" "B" Again. Again. sep=". in a script. :38. Biological Data Analysis Using R .txt" . things won’t be printed out to the R terminal unless you tell it to. This way it is easy to see that the variable x you are working with is the real one and not another x you had used two hours ago.40 B:4 1s t Qu.04 3rd Qu.

just focus on writing the routines that you need to use to get an answer and later you can focus on making it look pretty. The R interpreter ignores all commented material and all lines that do not have anything on them.3 Adding Comments To Your Code Speaking of looking pretty. This will comment the line from that point to the right. you must add comments to your code so that you remember what is going on inside that ﬁle. 10.10. As you begin writing scripts right now.3. To comment code in R you put a hash character at the beginning of the section that you want to be commented. x <− 20 # t h i s comment w i l l l e t the assignment happen # t h i s i s a comment that spans multiple l i n e s and won ’ t # be evaluated even i f i t has l o g i c a l R code in i t # x <− 21 print ( x ) Empty lines are also a nice feature to sprinkle through your scripts so that logical partitions can be identiﬁed. ADDING COMMENTS TO YOUR CODE 171 In Chapter 9 there was a more complete discussion of how you can format your data for printing. so you are not penalized for not having it there. Everything to the left of the hash character is considered code that will be evaluated. Biological Data Analysis Using R .

This works just as if you had typed in the lines of the script with the exception of how variables are printed out to the terminal. Send the contents of the variable x to the terminal output. print(x) summary(x) Biological Data Analysis Using R . use the R help system. source(x) cat(x) • • • This function dumps the contents of x to the GUI output as a single entity. The R interpreter ignores everything to the right of this symbol. • • • # Indicates the start of a comment. Provides a summary of the variable x.172 CHAPTER 10. To get more information on any of these functions. rm(x) This function removes the variable x (or if x is a list of variable names all of them) from memory. BASIC SCRIPTS 10.4 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises. This function causes R to look for the script named x and evaluate its contents from start to ﬁnish.

Are you lazy? 4. How do you change the current working directory in R ? 6. How does the optional argument echo=TRUE change in the output of sourcing a script in R ? 7. What character is used to indicate a comment? 9. How do you remove all variables from memory in the current workspace? 2. Why is it important to comment your code? Biological Data Analysis Using R .5 Exercises The following exercises are meant to help you understand the items presented in this Chapter. 1.10. EXERCISES 173 10. Can R evaluate scripts that are written in Word or Excel? 5.5. How would you print the summary of a data frame from within a script? 8. What happens when you set the optional argument verbose=TRUE when calling source? 3. How would you comment out several lines of code in a script? 10.

BASIC SCRIPTS Biological Data Analysis Using R .174 CHAPTER 10.

we will tackle a rather easy problem as a test case to show off how to construct a very simple program. Then we will walk through the loading of an image and discuss how we get information from and manipulate image data. 175 . In general. In the next few sections. • In a step-wise fashion. The programs that I’ll help you build will have a start. you will focus on the following topics: • Be introduced to some basic programming logic and the corresponding R grammar. Mexico. I will show some basic programming tools that we will use to write this program. The language that R uses for programming is derived from S-Plus and will be familiar looking to anyone who has programmed in another language or seen other programming languages before. You need to think about the problem that you are going to solve by writing a program. In this image.1. And then you need to think about the exact steps that you will need to do to accomplish what you are attempting to do. In this chapter. This photo was taken by S. print out some stuff or save it to a ﬁle.B. to get it to do the things you want to do. I won’t be covering that in this book. While it is possible to program in an object-orientated fashion (and indeed it is not that bad of an implementation in my opinion). An example of a Hemispheric photo is given in Figure 11. who understands that it has to do only exactly what you tell it to do. Finally we will set out to write the program in a step-wise fashion and ﬁnish with the completed program. it is easy to see the amount of canopy closure when taken from the hemispherical lens. • Develop a detailed pseudocode for a given program. have some conditional statement. What we are going to do in this chapter is determine how much of that image is open sky as a surrogate to measure available light in these forests. In this chapter. develop and test the program. the majority of programming in R will be very linear. and then exit. If you have never programmed before you need to think about programming as a kind of recipe.Chapter 11 Programming Programming is the art of making a computer. Weiss from the winter roosting habitat of the monarch butterﬂy in the Monarch Biosphere Reserve. a very precise one. perhaps some loops. proceed through a set of operations. The problem that we are going to deal with is how to measure canopy light from a hemispheric photo.

1 Looping As mentioned in Chapter 2. Weiss made available by the Creative Commons Atribution 2. when referred to in a programming language. The consequence of this is that if you are looking for a language to do fast loops through a data set.5 11. R is primarily a vector language.{}).g.176 CHAPTER 11. is a sequence of statements that are repeated over and over again until some condition is reached.1 The While The while looping metric is a good one to use if you have a particular condition which you want to check over and over again and perform some operations as long as the condition is in one state. The COND term Biological Data Analysis Using R .1: Hemispherical photograph of winter roosting habitat at Monarch Biosphere Reserve.. PROGRAMMING Figure 11. Perl or Python would actually be faster to do looping-like algorithms. Mexico. R is not it. The while loop has the form while(COND){ <code goes here> }. The items inside the loop are typically contained within curly brackets (e. A loop.1. there are reasons we occasionally need to use loops in R and here is a general overview. That being said.B. In fact. Photo by S. 11.

the loop exits and R starts to evaluate statements after the closing curly bracket. 9 ) > f o r ( i in seq ( length ( x ) ) ) { + cat ( x [ i ] ) + } 0123456789 > f o r ( i in x ) { + cat ( i ) + } 0123456789 For the COND the variable i is used as the counting variable along with the keyword in. When COND=FALSE.g. The general form of the for statement is for( COND ){ <code goes here> }.. x = 0 and at each time through the loop. 9 ) ) { + cat ( i ) + } 0123456789 > f o r ( i in 0 : 9 ) { + cat ( i ) + } 0123456789 > x <− seq ( 0 . > > + + + 1 x <− 0 while ( x < 10 ) { x <− x + 1 cat ( x . CONDITIONAL STATEMENTS 177 in the parenthesis is evaluated as a logical statement each time you go through the loop and will continue as long as COND=TRUE. The COND can be one of many different constructs that sets up a counting variable.2 The For Another common loop is one that actually focuses on the value of a counting variable (e. What this looping metric does is combine the initialization of the condition variable (a counter) as a numeric value.2 Conditional Statements The next tool in your R programming toolbox is the conditional statement. the variable x is incremented and printed out on the console.2. " " ) } 2 3 4 5 6 7 8 9 10 When you start looping here. and exits the loop when some condition on the counter is correct.1. There can be a lot of code between the brackets. There are many Biological Data Analysis Using R . Conditional statements control the ﬂow of logic through the a script or program.11. Here are some examples using the variable x. increment the counter each through the loop. 11. 11. The following example loops as long as x < 10 and prints out the value of x each time through the loop. > f o r ( i in seq ( 0 . the index in the loop).

And for each of them.178 CHAPTER 11. only one response is ever performed each time. TRUE ) ) > observations [ 1 ] TRUE FALSE FALSE TRUE TRUE > f o r ( obs in observations ) + p r i n t ( obs ) [ 1 ] TRUE [ 1 ] FALSE [ 1 ] FALSE [ 1 ] TRUE [ 1 ] TRUE > f o r ( obs in observations ) { + i f ( obs == TRUE ) + cat ( obs . we cycle through the numbers 1 through 10. If CONDITION is TRUE then RESPONSE is done and none of the other conditions are evaluated nor are their responses performed. In the example below. " is even\n" ) } i s odd i s even i s odd i s even i s odd i s even i s odd Biological Data Analysis Using R . if( CONDITION ) then RESPONSE else if( OTHER_CONDITION ) then OTHER_RESPONSE else FINAL_RESPONSE Here the logic asks about the state of CONDITION. l o g i c a l ( c (TRUE. " is odd\n" ) else cat ( i . PROGRAMMING cases where you would like to run some command or sets of commands if some condition is true. > + + + + + 1 2 3 4 5 6 7 f o r ( i in 1:10){ if ( i % 2 ) % cat ( i . we determine if they are odd or even using the modulus operator %%. If CONDITION is not TRUE but OTHER CONDITION is. This operator returns the remainder after a division. Note. then the only response to be performed is OTHER CONDITION. If neither CONDITION nor OTHER CONDITION are true then FINAL RESPONSE is performed. In the example below. "it is true \n" ) + else + cat ( "not\n" ) + } TRUE i t i s true not not TRUE i t i s true TRUE i t i s true We can also use conditional operators as a CONDITION in a if statement. FALSE. For example. I set up a vector of boolean (TRUE|FALSE) variables and then loop through them one at a time and see what they > observations <− as . FALSE. TRUE. and OTHER CONDITION. The R interpreter just skips everything until the end of the set of conditionals.

the if/else if/else) or loop (e. where you want more than one statement to be executed after a loop or conditional statement then you must use brackets. Examples include: > + > + i f ( rnorm ( 1 ) > 0.g. it may open up your code a bit and make it a bit easier to read in the future.. In fact. the remainder of i %% 2 is evaluated.11. In the next example. i . Notice in the previous listing..2.2. As a general rule.logical () . there were brackets {} surrounding the content inside the for loop. T Biological Data Analysis Using R . you can think of these kinds of “one-liners” as just extensions as oneoffs. Possible values for this are 1 and 0 which when evaluated as. after any conditional (e. 11.g. There is nothing wrong with using brackets even in these cases. These brackets are essential because there is more than one line of code inside the for loop.b. turn out to be either TRUE or FALSE printing the appropriate message. I loop through the numbers 1-10 and look for those even numbers that are not divisible by 4 (n. > f o r ( i in 1:10) + if ( ! ( i % 2 )) % + if ( i % 4 ) % + cat ( "the value=" . I could have used a compound conditional statement such as if( !(i%%2) && (i%%4)) but that would have really screwed up my example). You just do not have to use them. "\n" ) the value= 2 the value= 6 the value= 10 In some sense.1 Bracketing There is a little bit of bracket magic going on here and I should take the time to make a few comments. However. If there were only one line (see previous code listing where print(obs) is the only code inside the for loop) then the enclosing brackets are optional. while/for) if there is only one line of code then you do not need to use brackets if you do not want to.5 ) p r i n t ( "greater" ) while ( TRUE ) p r i n t ( "this will last forever" ) This rule is recursive in that the “one line of code” is any line that is not a conditional or a loop. CONDITIONAL STATEMENTS 8 i s even 9 i s odd 10 i s even 179 Each time through.

11. when I write programs I tend to think of them not as a single large program but as a series of smaller steps. Typically. If you haven’t already done so. ﬁrst things ﬁrst.4. we need to get out a sheet of paper and write down. on the surface. 11. So to begin with. Using the outline in the previous section. we examined how to load images into memory. how the program is going to work.pnm( f i l e ="Hemiphoto monarch habitat1. > l i b r a r y ( pixmap ) > img <− read . and get into their knickers. Load image into memory 2.3 Outlining A Program The most difﬁcult part of programming is understanding where to start. exactly.gimp. I ﬁnd it helpful to work on the R command line to test out particular sets of commands and when I have it exactly like I like it then I move it to a script. appears to be a daunting task in intself.4 Creating A Program It is often necessary to incrementally build a program.org). although you could use any image manipulation program and there are several free ones available for you on the internets. Print out the proportion of canopy that is open.1 Step 1: Loading An Image Into Memory In Chapter 7. Next. It is important that we include all the steps necessary and in the order in which they are to be performed. so to speak.180 CHAPTER 11. PROGRAMMING 11. translate them into various formats. The key to doing this is to understand the sequence of steps that we need to accomplish so that the program can do what is required.1). Determine total area of image 4. So. we can open a new ﬁle and create a script that does each of these items in succession. However. The PPM ﬁle is what you have access to in the class folder for Chapter 11. So.ppm" ) Read 637563 items Biological Data Analysis Using R . Writing a program. I will begin by turning it into a PPM formatted image as discussed in Chapter 7 using the program GIMP (http://www. I recommend that you look at Chapter 7 to refresh yourself on how we work with the internals of an image. the image as I retrieved it from Wikipedia is a JPEG image. For this Chapter we will be working on developing a program that calculates the amount of canopy openness from a hemispheric image (Figure 11. Determine what parts of image are ”open canopy” 3. An example of this would be: 1. each of these steps is a relatively easy one by itself and we will create the overall program by breaking it up into manageable parts. State what you want the program to do in speciﬁc terms.

1 and we must ﬁgure out how to have it represented. Biological Data Analysis Using R . For our purposes.3: A histogram of values in the blue channel (Figure 11. CREATING A PROGRAM 181 Figure 11.11. > p l o t ( pixmapGrey ( img@blue ) ) > p l o t ( pixmapGrey ( img@red ) ) > p l o t ( pixmapGrey ( img@green ) ) And from this you will see that the different channels look pretty much the same when evaluating the area that is considered the “sky” in this image.2. > names ( a t t r i b u t e s ( img ) ) [ 1 ] "size" "cellres" [ 8 ] "blue" "class" "bbox" "bbcent" "channels" "red" "green" Remembering that there are three different channels in a PPM ﬁle. one for red. > p l o t ( img ) Now we have the image loaded and a plot that is identical to that displayed in Figure 11. 11. and one for blue. perhaps we should look there ﬁrst.2: The blue channel of the canopy picture displayed as a greyscale image.4. I will we will only use the blue channel as displayed in Figure 11. You can plot each of the channels as an image by creating a pixmapGrey() image and see the intensity of each color channel.4.2 Step 2: What Is “Open Canopy” The variable img has the following components and here we need to ﬁgure out what parts of the image are the sky parts. Figure 11. one for green.2).

to ﬁnd out how much of the image is sky (using this deﬁnition). axes=F . a peak at around 0. ylim=c ( −10 . bty="n" .3 that there is a tremendous amount of values in this channel at the low end. we need to make a cut-off such that if we look at a pixel. c o l ="blue" . at this point. ] . > > > > > + p l o t ( img . We can see from Figure 11.0 are the dark regions. axes=T . I will assume that values that are ≥ 0. xlab="" .4 where the raw values along the 230th row of pixels (indicated by the red dashed line) are shown in blue. xlab="Image Width" . you can easily make a histogram composed of the values in the blue channel of the image using the command hist( img@blue ).10)) So. lwd=2 . we must: Biological Data Analysis Using R . To do this.98 are to be considered as sky and I will also make the restriction that I need the pixels in each channel to meet or exceed this cut-off. we need to make a value judgement. Now. ylab="Image Height" ) par ( new=T ) abline (230 .0 should represent light values and those near 0.4: Intensity of blue channel values in the image as taken through a slice of the image (at pixel row 230 as indicated by red dashed line). The following commands create the image displayed in Figure 11. We are fairly conﬁdent that values close to one in the blue channel (and others you can go check yourself) represent areas in the image where it is pretty light. we can put it into the light or not-light category. It is easy to see that the value in the blue channel gets larger as the dashed line crosses the image. Figure 11. lwd=3 . We can get a bit more speciﬁc with this image and plot the intensity of a particular row of values in the blue channel to double check that we think values close to 1. type="l" .2 and another at the top end close to 1.182 CHAPTER 11. ylab="" . bty="n" .0 . c o l ="red" . PROGRAMMING So if that is the component of the image that we are going to use. For the purposes of this exercise. But. l t y =2)> par ( new=T ) par ( new=T ) p l o t ( img@blue [ 2 3 0 .0. we now need to determine which values to look for.

it is also true that Step 2 can be accomplished in R using the one-liner sum( img@blue >= 0..98 & + img@blue [ row . 1 Biological Data Analysis Using R .98 & + img@green [ row .6 > (461ˆ2− totalArea ) /totalArea [ 1 ] 0.4. we ﬁnd a total of 9. in the image across all three color channels. 27.98 &img@green >= 0..98 ) + numSky <− numSky + 1 + } + } > numSky [ 1 ] 9624 183 So.4. we will add the following lines of code > numRows <− img@size [ 1 ] > numCols <− img@size [ 2 ] > f o r ( row in 1:numRows ) { + f o r ( c o l in 1:numCols ) { + i f ( img@red [ row . We need to now determine what the total number of pixels there are in the image so that we can get a standardized percent of open canopy. While this part of the exercise was excellent at showing some of the programming paradigms and how they can be combined to give an answer. ﬁnally we are almost ﬁnished. 624 pixels that can be considered to represent the sky.98 &img@red >= 0.2732395 As a side note. We could use the total number of pixels 4612 = 212. Evaluate if the value should be considered as sky or not. the last expression in the code listing shows what percentage of area that we would bias our estimation by if we just used the total number of pixels in the image. rather it is a circle that ﬁts in a square whose side has 461 pixels.1 11.98 ).4. c o l ] >= 0. So.3 Step 3: Determine The Total Area Of The Image OK. c o l ] >= 0. 521 but the image taken with the ﬁsh-eye lens is not square. which the function sum() coerces into integers. While it would have been much shorter to do it this way. 3. CREATING A PROGRAM 1. we need to ﬁgure out the area of this circle as: > r <− 461/2 > totalArea <− pi ∗ r ˆ2 > totalArea [ 1 ] 166913. it would have negated all the quality teaching experiences that I was laying on you. 2. Here the three conditionals return a vector of logical variables. Use a variable to keep track of all the pixels that meet the criteria So to our script.11. Loop through every matrix and the items in each matrix.4 Step 4: Print Out The Proportion Of Canopy That Is Sky This part is fairly easy and doesn’t require much.3% is a reasonable sized bias! 11. c o l ] >= 0.

98 # Read in the image and f i n d the number o f # rows and columns in i t img <− read . R is not a general Biological Data Analysis Using R .4.98 & img@blue [ row .5 Synopsis This has been a very simple little program that we made. 11.184 CHAPTER 11. # removes a l l v a r i a b l e s from memory at s t a r t o f s c r i p t rm( l i s t = l s ( ) ) # load the pixmap l i b r a r y to open the image l i b r a r y ( pixmap ) # I put the f i l e name i n t o a v a r i a b l e so # i t could be changed e a s i l y at the top # o f the f i l e i f necessary fileName = "Hemiphoto monarch habitat1. c o l ] >= 0. c o l ] >= 0.98 ) numSky <− numSky + 1 } } # Find t o t a l are o f f i s h e y e c i r c l e r <− numRows/2 totalArea <− pi ∗ r ˆ2 # P r i n t out the percent ca percentCanopyOpen = numSky/totalArea cat ( ‘ ‘ Canopy Opening : ‘ ‘ .98 & img@green [ row .pnm( f i l e =fileName ) numRows <− img@size [ 1 ] numCols <− img@size [ 2 ] # Loop through each row f o r ( row in 1:numRows ) { # Loop through each column f o r ( c o l in 1:numCols ) { # Evaluate the c e l l in each f o r # ‘ sky c r i t e r i a ’ i f ( img@red [ row . PROGRAMMING > numSky / totalArea [ 1 ] 0. it does show you how to go about creating a simple analysis program. Despite it being simplistic.05765857 11. Comments should be self explanatory and are indicated by lines that start with the hash character (#). percentCanopyOpen .ppm" # I also put the c r i t e r i a i n t o a v a r i a b l e # so we can change i t in one place to see # how the r e s u l t s d i f f e r s k y C r i t e r i a <− 0. c o l ] >= 0.5 The Complete Program The complete program is listed below with comments. ‘ ‘ \ n’’ ) . There are a few changes in the program that I made to make it a bit easier to work with.

We can encapsulate code into functions and make our lives much easier. In Chapter 12 we will build upon what has been done here when we discuss Functions. play around with the program and the exercises and get comfortable with typing code. easy to accomplish pieces. The key to R is knowing how to get something put together.11. take it a step at a time. Biological Data Analysis Using R . SYNOPSIS 185 programming language and you are not going to make large programs with it.5. and break the components into reasonably sized. For now. This is where you start.

4. you may want to not do anything unless some speciﬁc conditions occur. This returns the remainder of the division x/y. As long as COND==TRUE the loop will continue. See 2.. The second evaluation of a condition.g. You can include several lines to be evaluated after this and other evaluation statements by enclosing the code in curly brackets {}. If it is TRUE then the next line following the if statement is executed. PROGRAMMING 11. The last of a conditional.7 for more information on logical variables. there is an else here that implies a previous if or else if statement that this is following). then whatever follows the else will be evaluated.b) This function plots a line with intercept of a and a slope of b in the current graphics window. while(COND) if(COND) The evaluation of the condition COND. for(INDEX SEQUENCE ) A main looping construct that speciﬁcally uses the counter INDEX that is contained in SEQUENCE. To get more information on any of these functions. if all the previous ones did not turn out to be true. It is not necessary that you have one of these at the end.186 CHAPTER 11. • • • • • • • x %% y The modulus operator.6 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises. This must not be the ﬁrst conditional (e. else • else if(OTHER COND) • Biological Data Analysis Using R . rm(x) This function removes the variable x from memory abline(a. If it is FALSE then the next line is skipped. use the R help system. as.logical(x) Coerces x into a logical variable if possible. A looping construct that continues to loop until some condition is met.

nrow=5) as your input. make a graph of percent canopy with different cut-off values. Make sure to comment your code appropriately. How many else statements can you have after an if statement? Biological Data Analysis Using R . 9. for () loop that prints the numbers from 42 down to 27. D a dull boy. EXERCISES 187 11.\n" ) . How many lines of output do you expect to get from the following code? HINT: Think before you try to run this program. 6. 2.30). one numSky 3. List some of the assumptions that are included in how the variable mined. Create an outline of the steps that would ﬁnd the number of values in a matrix that is equal to or greater than 20. Write a short program that lists all numbers from 1 to 100 and determines if they are divisible by 2 and 3. In your opinion what would be the most biologically meaningful cutoff? 5. Change the program to use a cutoff value based on the sum of the individual color channel values rather than the current requirement that they all be simultaneously over some threshold. is deter- 4. 8. } 7.7. Implement the program you outlined using the matrix M <−matrix( runif(25.10. What is the proper syntax for conditions passed to an greater than 23 and y to be equal to or less than 4? if statement requires x to be 10. Using the program we created in this Chapter. 1.11. while ( 1 ) { cat ( "All work and no play makes Dr. Write a program using the on each line.7 Exercises The following exercises are meant to help you understand the items presented in this Chapter.

188 CHAPTER 11. PROGRAMMING Biological Data Analysis Using R .

(I’m not even sure how it is done).pnm() in 7. In this Chapter you will learn the following skills: • Learn the syntax required to write your own functions. The assignment to the name.3. You can consider a function a small self-contained bundle of instructions that you can call when every you need to. Putting this code into function and putting that function in a location where you can get access to when every you need it is a real treat. 189 .Chapter 12 Functions Throughout this book. Writing your own functions in R is a very useful way to save a lot of typing. The creation of a name for a function is just as important as for a variable. These functions have been really helpful in making you scripts look clean and readable and have made you life rather easy as you performed some basic statistical analysis. removeLameExcuses().1 Function Syntax The format of a function basically has the following three parts: 1. • Understand the scope of a variable and why you should care. This tells R that the name is not a variable but will actually be the name of a function. we’ve used both built-in function such as sqrt() and sum() as well as some that are located in external libraries that we had to load (such as skewness() in 4.. Right after the name you will have the assignment of the generic function() function to the variable (see the syntax below). or that there is a particular set of routines that you use to make translations of your data from one format to another. or makeTheGraphTheWayILikeIt(). 2. Say you are picky about the way your graphics look.1 and read. 12.. I ﬁnd it helpful to try to make the name tell me what the function does (I’m funny that way). The name of the function.2). • Create a basic library of routines that you can use in the future. Think what a pain it would have been if you had to write code every time you wanted to calculate a sqrt() of a number. which means it typically starts with a verb such as convertMissingData().

gimmeANumber <− function ( ) { 42 } > gimmeANumber ( ) [ 1 ] 42 > gimmeANumber ( ) [ 1 ] 42 And a slightly better function here that actually returns a random number: gimmeAnotherNumber <− function ( ) { x <− r u n i f (1 . If you use the ls () command to list the items in memory it will show your function names along side your variable names.1 .97312 Biological Data Analysis Using R . Also. FUNCTIONS 3. This is the part that you get to write. It is not common to write functions that do no give you something back in return. In this Chapter I will post in the raw code for the function itself followed by the output of R from the command line. The straight posting of the function syntax allows you to cut-and-paste them into the R interpreter (even though you will learn it better by typing it). it takes no arguments and doesn’t return anything to you. > ls ( ) [ 1 ] "doMyBidding" "x" 12.1. they will be considerably more complicated (and hopefully useful). functions that you have deﬁned are available in the local memory of the interpreter in the same way as local variables are. An example of this is the following function that returns a single number.1 Returning Values From A Function Most likely you are calling some function because you are interested in getting a response to it. R has you put the name of the variable on the last line of the function.100) x } > gimmeAnotherNumber ( ) [ 1 ] 87.” As you write functions. The function contents. these three parts are put together to look like: doMyBidding <− function ( ) { # Function Contents } Now this is fairly boring function here.190 CHAPTER 12. To return a value from a function. Here is where you put all the stuff together to do whatever it needs to do. ”R go to your special place and do something but don’t tell me what it is. It is kind of like saying.3278 > gimmeAnotherNumber ( ) [ 1 ] 64. In general.

20 . FUNCTION SYNTAX 191 You can also use the return return() to exit the function and potentially return a value. if it is not it prints an error and returns. theValue . by=3) > i s . numeric ( theValue ) ) { return ( theValue / 2 .8 for more on this). try again. Here is an example that checks to see if the passed argument is the right kind. "is not a number. "is not a number. something that signals to you that the value passed to the function may be incorrect then you can remove the last return() statement and have the function not return anything. numeric ( theValue ) ) return ( theValue / 2 . theValue . Here is what that function would look like. otherwise it performs a calculation and then returns the result. gimmeHalf <− function ( theValue ) { # check to see i f i t i s a numeric value # i f i t i s the return h a l f i f ( i s .1. t r y again . > x <− seq (2 .\n" ) return ( ) } } > gimmeHalf ( 12 ) [1] 6 > gimmeHalf ( "Hello partner ! " ) The value Hello partner ! i s not a number.4. try again. gimmeHalf <− function ( theValue ) { # check to see i f i t i s a numeric value # i f i t i s the return h a l f i f ( i s . numeric ( x ) [ 1 ] TRUE Biological Data Analysis Using R . This is because a vector of numbers will return TRUE when asked if it is . If you are not interested in having a function return NULL. Here is an example.\n" ) } > gimmeHalf ( 1 4 ) [1] 7 > gimmeHalf ( "bob" ) The value bob i s not a number. NULL Notice here that when the function left the else section of the function by calling the return() without any arguments then the function actually returned the NULL value. 0 ) # i f i t isn ’ t then complain else cat ( "The value" . t r y again . Vector Arguments By default you function above can work on vectors of values just as easy as single numbers. 0 ) } # i f i t isn ’ t then complain else { cat ( "The value" .12.numeric() (see 2.

vector ( x ) [ 1 ] TRUE > x [ 1 ] 2 5 8 11 14 17 20 > gimmeHalf ( x ) [ 1 ] 1.5 4. I typically write functions by: 1. write the sequence of events that have to occur inside the function so I can see what needs to be done (breaking large problems into small ones here) 3.0 2. 2..0 CHAPTER 12. Notice that inside the function. you can work with vectors of your values just as easy as single numbers. FUNCTIONS 8. ﬁnd a random one. In fact." . "Come home this weekend." . you will ﬁnd yourself being happy with your past self more often than hating what you had forgotten to do (?). and then return it.0 So by default. "We think you are the BEST student at VCU." . So lets walk through these steps and make a function. but if you put in enough so that it is obvious what is going to happen next. your dad and I think you are doing just fine. Fill in the code to allow R to do my bidding.. This is pretty cool and you should try to remember the love that R has for vector operations because it is much faster to call your gimmeHalf() function by passing it vector of value than using a loop to go through the vector and calling gimmeHalf() for each individual value. This is a very good idea because it allows you to document what you are doing inside the function. "You know I took calculus back in college.5 10.. You do not have to document every line of code in your functions. I have added some comments. giveMeSomeMomLove <− function ( ) { # set up a vector o f l o v i n g mother sayings momSayings <− c ( "Honey.5 7. maybe I can help. I made your favorite dessert. Biological Data Analysis Using R . Step 1: Create signature The signature for this function will be: giveMeSomeMomLove <− function ( ) { } Step 2: Using comments create logic of function: The overall goal of this function is to return a random statement from my mother so I will have to set up some statements.0 5. the funcName <−function(){ } part. giveMeSomeMomLove <− function ( ) { # set up a vector o f l o v i n g mother sayings # pick a random number to use as index f o r responses # I f you put the name vector and the index on the l a s t l i n e } Step 3: Fill in the R logic: Now that I have the comments set out. Write the signature of the function." .192 > i s . The purpose of this function is to get a little encouragement for my programming endeavors by having R return some nice praise for me. it is fairly easy for me to use them as a guide in laying out the logic of function. Using comments. Here is a slightly longer example of a function.

] 0 1 > getIdentityMatrix ( 5 ) [ . 1 ." > giveMeSomeMomLove ( ) [ 1 ] "Honey. nrow=numRows.2] [ . This matrix is a pretty special one (see ??) in matrix analysis and probably Biological Data Analysis Using R .12. For example.1] [ . when you think of writing functions you should not try to make them so speciﬁc that you have a lot of different functions that do almost the same thing.5] [1 . ncol=numRows ) # make the diagonal a l l ones diag ( I ) <− 1 # return i t to the c a l l e r I } > getIdentityMatrix ( 2 ) [ . length ( momSayings ) ) ) # I f you put the name vector and the index on the l a s t l i n e momSayings [ resp ] } > giveMeSomeMomLove ( ) [ 1 ] "We think you are the BEST student at VCU.2] [1 . your dad and I think you are doing just fine. the function getIdentityMatrix() returns a square matrix with ones down the diagonal. In general. FUNCTION SYNTAX "I just know you’ll be able to find a good job after college.] 0 0 0 1 0 [5 .1.] 0 0 1 0 0 [4 .1] [ . it is better overall form. maybe I can help." > giveMeSomeMomLove ( ) [ 1 ] "You know I took calculus back in college.] 1 0 [2 . This is a very convenient feature for you and your users. rather you should make them robust and if you can combine a few functions into a single one whose values change depending upon a parameter you pass to it." 193 Feel free to add some of your own mother sayings here 12.] 0 0 0 0 1 Default Values Functions can have default values associated with variables that are passed to them.2 Passing Values To A Function The most common way you will interact with a function is probably by giving it some variables and expecting to get something back. We’ve seen this many times so far as you’ve looked up and seen the function signatures of built in variables. g e t I d e n t i t y M a t r i x <− function ( numRows ) { # make a square matrix with a l l zeros I <− matrix ( 0 ." ) # pick a random number to use as index f o r responses resp = round ( r u n i f ( 1 .1.] 0 1 0 0 0 [3 .3] [ .] 1 0 0 0 0 [2 .4] [ .

] 0 0 42 Now this function has a default value to set the diagonal values to (e. x . value=1 ) { theMat <− matrix ( 0 .2 Scope The scope of a variable determines the value that it has depending upon where it is located. this is all up to you. x . FUNCTIONS should have its own function just because of its status. "\n" ) } > x <− 21 > x [ 1 ] 21 > myFunc( x ) x inside i s 42 > x [ 1 ] 21 myFunc <− function ( a ) { x <− 42 cat ( "other x inside function is" .3] [1 . myFunc <− function ( x ) { x <− 42 cat ( "x inside function is" . 1) producing the Identity matrix I by default. there are a number of reasons why you may need a square matrix with a single value down the diagonal and perhaps it would be more robust to create a function such as: getDiagonalMatrix <− function ( size . it is assigned in the signature for you by default. However..] 1 0 0 [2 .2] [ . however.1] [ . we should focus on the biology and use tools like R as simple tools. you are the programmer here and you get to make the decisions. 12. After all. This topic is a pretty important one and can be a bit tricky at times. "\n" ) } > x <− 23 > myFunc( x ) other x inside function i s 42 > x [ 1 ] 23 Biological Data Analysis Using R . ncol= s i z e ) diag ( theMat ) <− value theMat } > getDiagonalMatrix ( 3 ) [ .2] [ .194 CHAPTER 12. 4 2 ) [ . it can also produce any diagonal matrix when you pass an additional parameter to the function.1] [ . If you do not pass it to the function.] 0 1 0 [3 .] 0 0 1 > getDiagonalMatrix ( 3 . there are several different ways to get the correct result when programming and as Biologists. This makes the function perhaps more robust and useful.g. nrow=size . Of course.] 42 0 0 [2 .] 0 42 0 [3 .3] [1 .

USEFUL FUNCTIONS 195 12. To get more information on any of these functions.3 Useful Functions The following functions were introduced in this chapter and you will be required to use them for the exercises. • • function(args)code Creates a function that has the code inside code requiring the ar- guments args. use the R help system. return(x) Returns the value x from the function which means it is immediately exited and no more code is executed in the function.12.3. Biological Data Analysis Using R .

g. Write a function that creates these ﬁle names dynamically. . Data3. 4. You will want to allow the user to specify the base name of the ﬁles (e. Data2. 1 and 40) but set the starting number to default to 0. Data40.196 CHAPTER 12. 1.g. 5. .. FUNCTIONS 12.. Biological Data Analysis Using R . what if I passed the variable x <−"this is the end" to a function that expects a number. Explain how you get your functions to accept vector arguments.4 Exercises The following exercises are meant to help you understand the items presented in this Chapter. How would you remove a function from the memory of R ? 9. 3. Create a function that allows you to pass it a regression model and it will return a string that contains the formula for the model as you would like to have it displayed on a graph. 2. Create a function that takes a single vector of values and creates a histogram and density line from that data in a new graphics window. .. Create a function that returns random numbers but allow the user to set an optional argument that will only return even numbers. 7. 8. How do you make sure that the arguments that are passed to your functions are the right kind of variables? For example. You should probably allow the user to pass a ﬁle name to the function. Lets assume that you have a folder full of data ﬁles named Data1. How do you set default values for a function when you write it? 6. Create a function that takes an ANOVA or Regression model and saves the ANOVA table to a ﬁle. Explain scope and how it pertains to the values assigned to variables. 10. Data) as well as the starting and ending numbers (e.

Don’t look ahead. It is my recommendation that you look at the answers only after you have completed them just to make sure that what you thought you were doing is the correct thing.. 197 .. Ansers to Chapter 2.Appendix A Answers to Exercises In this section you will ﬁnd answers to the odd numbered Exercises presented in each Chapter. These answers are meant to help you start on the exercises facilitating your completion of the remaining questions..

198 APPENDIX A. ANSWERS TO EXERCISES Biological Data Analysis Using R .

there are currently 1621 different packages in the repository.org. Each should also come with a set of documentation covering all the functions that are included in the library. if you have administrative privelages on the machine you are using. As of the time of this writing. Libraries can be written in R . B.Appendix B Installing Additional Libraries The R statistical computing environment is made more robust by the addition of external libraries. and install binary versions of packages using a tck/tk interface GUI interface. or FORTRAN by you or other people who want to expand the functionality and utility of R . R knows how to ﬁnd.packages() As A GUI The easiest way for you to install a libarary is to do so from within R itself. download. the libraries will be installed in a location that is in your own home directory. issue the command: 199 . once in each home directory. The main thing to worry about here is that when you install libraries into your own directory they will only be available to that user and will not be available for any other users on that machine. C. All are available for you to install and use at your discression. this will be in different places. and some overall discussions on the library along with the library. your machine must be connected to the internet.r-project. Conversely.1 Using install.1 Library Availability There is a list of libraries available at http://cran. B. you can install the libraries into a location that everyone that uses that machine can access. If you conduct the installation as a normal user that does not have administrative privilages on your computer. To do this. Depending upon which operating system you have.2. descriptions of the data sets.2 Installing Libraries B. To start the installation process. If two people use the same machine then they will have to install it twice.

200

APPENDIX B. INSTALLING ADDITIONAL LIBRARIES

> i n s t a l l . packages ( )

And this will bring up a window (using tck/tk so it won’t look quite like the normal window on your operating system) that allows you to select which mirror you would like to use for downloading. An example window is shown in Figure B.2.1. In general, you should select a location that is geographically proximite to your current location. All of these mirror servers are kept up-to-date pretty well and you shouldn’t ﬁnd any differences among the packages on any of them. Once you have selected your preferred mirror server, another window will be presented (resembling Figure B.2.1) that lists all the packages that are available to be installed. Be careful here, this simple interface does not check to see which packages you already have installed, it only lists all the packages that are at your disposal. So just because there is a package on that list doesn’t mean that you do not already have it installed on your machine. Select the package, or packages, that you want to install from the list. To select more than one, click on more than one... To deselect a package, click on it a second time and it will be deslected. Once you hit the OK button on this window, the install .packages() function will look to see what dependencies the selected packages have (e.g., PackageA requires PackageB but you didn’t know that and didn’t select it). Packages will be downloaded and installed in the correct location. After they are installed, you should be able to use them immediately (e.g., without restarting R ).

B.2.2

Using

install.packages()

For Speciﬁc Libraries

If you know the name of the package that you are interested in installing you can use the install .packages() function directly by passing it a name, or list of names, of the packages you are interesed in. This will skip the Package Selection Window step shown in Figure B.2.1. The syntax for this would be:

> i n s t a l l . packages ( "theNameOfTheLibraryNeeded" )

Libraries have also be partitioned into different Task Views. These are meta-packages that contain several different packages under a particular theme. Below are a list of the views that are available as of January 2009 (these categories and desriptions are lifted directly from the website. Bayesian Bayesian Inference ChemPhys Chemometrics and Computational Physics Cluster Cluster Analysis & Finite Mixture Models Distributions Probability Distributions Econometrics Computational Econometrics Environmetrics Analysis of Ecological and Environmental Data ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data Biological Data Analysis Using R

B.2. INSTALLING LIBRARIES

201

Figure B.1: Example of CRAN mirror window as viewed on Linux

Biological Data Analysis Using R

202

APPENDIX B. INSTALLING ADDITIONAL LIBRARIES

Figure B.2: All packages that can be installed from the selected mirror server on my machine.

Biological Data Analysis Using R

B.2. INSTALLING LIBRARIES Finance Empirical Finance Genetics Statistical Genetics

203

Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization gR gRaphical Models in R MachineLearning Machine Learning & Statistical Learning Multivariate Multivariate Statistics NaturalLanguageProcessing Natural Language Processing Optimization Optimization and Mathematical Programming Pharmacokinetics Analysis of Pharmacokinetic Data Psychometrics Psychometric Models and Methods Robust Robust Statistical Methods SocialSciences Statistics for the Social Sciences Spatial Analysis of Spatial Data Survival Survival Analysis TimeSeries Time Series Analysis You can install all the libraries in these particular views by invoking the command:

> i n s t a l l . packages ( "ViewName" )

You will still have to specify the mirror server to use and once you do, R will take it from there. This could be a lengthy process as it may require numerous packages to be downloaded and installed. Be patient.

B.2.3

From the Command Line

Finally, there is one other method that I typically use on my machines. This is because I typically download the source packages rather than the pre-compiled binaries. However, this method also works with binaries. You can download the package from the CRAN site directly and then open a command-line Terminal and change to the directory where the package is located. From there issue the command:

R CMD INSTALL ThePackageYouDownloaded.tar.gz

and R will install it for you. If you do this as the root or administrator person, it will install it in a globally accessable location so any user on that machine will have access to it.

Biological Data Analysis Using R

204

APPENDIX B. INSTALLING ADDITIONAL LIBRARIES

Biological Data Analysis Using R

Bibliography

Caswell, H. (2001). Matrix population Models: Construction, Analysis, and Interpretation. Sinauer Associates, Sunderland, Mass., 2nd edition edition.

205

Index

class, 115 clustal ﬁle, 153 coercion, 9 comment character (#), 172 data types, 8 character, 10 complex, 11 constant, 11 data frame, 18, 26 factors, 16 integer, 8 list, 17 logical, 13 matrix, 14 NULL, 31 numeric, 9 raw, 12 vector, 13 distributions dchisq, 43, 68 df, 43, 68 dnorm, 43 pchisq, 43, 54 pf, 43 pnorm, 43 qchisq, 43, 44, 54 qf, 43, 46 qnorm, 43 qt, 46 rchisq, 43 rf, 43 rnorm, 43, 54, 58, 68 rpois, 63 runif, 65, 107 fasta ﬁle, 152 ﬁgure axis labels, 56 title, 56 functions, 6 %%, 186 abline, 186 any, 150, 161 as.factor, 72, 86 as.index, 186 as.matrix, 86, 144 as.matrix(), 121 attributes, 33 barf, 123 barplot, 137 binom.test, 73 c, 86 cat, 118, 161, 172 cbind, 29, 40, 86, 107 class, 20, 33 colnames, 86, 157 components, 6 cov, 64 density, 57, 58 det, 128 diag(), 126 dim, 123 dist.dna, 154 eigen, 132 else, 186 else if, 186 expression, 161 for, 186 format, 161 function, 195 gimeMeSomeMomLove, 192 ginv, 128 grep, 150 grey, 118 gsub, 150 image, 118 index, 127, 186 kurtosis, 60 length, 20, 86 206

118 mean. 172. 145 line plot. 36. 45 range.table. 94. 137. 48 207 93.copy. 48 bxp. 107 row. 49 par. 118 round. 48 main. 61. 85 cairo pdf. 54 rm. 39. 40 substring. 86 rbind. 17. 107 tiff. 46. 128 load. 47. 51 bty. 161 nj. 128 table. 56. 27. 48 overlaid. 68. 51 xlab.dna. 148 sub.colors. 53 legend. 127 summary. 51 plot. 162 var. 107 pictex. 11. 32. 51 graphics abline. 95. 32. 14. 55 jpeg. 48 lwd. 86. 28. 85 png. 48 107. 48 mfrow. 191. 48 bmp. 155 print. 20 return. 7 ls. 186 rnorm. 54 save.names. 48 col. 72 unlist. 52. 58. 129 merge. 20. 52. 48 hist. 61 names. 52 type. 121 rep. 46. 57 dev. 51 cex. 145 max. 172 strsplit. 49 paste. 33 nchar. 172 t. 53 fg. 47 sub. 148.INDEX levels. 58 seq. 148. 172 q. Biological Data Analysis Using R . 51. 40 sd. 154 par. 154 grahics pdf. 40. 48 xlim. 32. 149.off. 61. 86. 59 source. 54. 52. 40 log. 32 matrix. 58 while. 14. 28. 150 subset. 149 plot. 51 quartz.table(). 142. 40 read. 153 read. 157 rpois. 145 read. 48 pch. 47 lty. 51 rug. 86. 49 ylab. 121. 68. 20. 51 topo. 17. 186 genetic distance. 48. 48 density plot. 17 lm. 61 optional parameters. 35. 52. 144 bg. x11. 31 qchisq. 104 scatter plot. 48 text. 48. 33 rownames. 161 sum. 51 postscript. 40 min. 107 barplot. 53 dev. 20 skewness. 81. 195 rexp. 50.

124 ginv. 86 chisq. 82. 124 subtraction. 99 Kruskal-Wallis Test. 81 variable. 83 kruskal. 49 matrix %*%. 92. 127 Neighbor Joining. 100 quantile. 80 Wilcoxon Test. 18 logical. 144 diagonal. 144 eigen. 162 no intercept. 68. 86 interaction formula. 58. 124 scalar addition.test. 86 cor. 63 sd. 123 t. 123 scalar multiplication. 145 element-wise multiplication. 67. 63 nj. 107 var. 145 trace. 124 multiplication. 145 Hadamard product. 123 det. 126 dim. 18 operator order. 86 median.test.test. 144 diag. 76.test. 144 addition. 68 Wilcoxon. 124 scalar subtraction. 19 numerical. 123 Schur product. 68 step. 18 Pinaceae. 80 mean. 58. 72. 154 operator assignment. 153 stats anova.208 ylim. 86 lm. 7 INDEX Biological Data Analysis Using R . 107 Mann-Whitney. 79. 81. 107 t. 107 binom. 107 aov. 93.test. 107 TukeyHSD.

- Survival Using RIlona Mushnikova
- Predictive Analytics Using rSunil Kamat
- Complexity, networks and knowledge flowGandolfo Dominici
- Ch15CellSignalingFa2010rAnupam Goli
- Structural equation modeling using Rfriendshippp
- TOPOLOGYTanveer Khan
- Advanced Statistics Using RFrederick Bugay Cabactulan
- Advanced Topics in Analysis of Economic and Financial Data Using Rwuxuefei
- Modeling Biological Networks Lecture3Genomius
- Multidimensional Neural Networks Unified Theory Rama Murthy_NEW AGE_2007rasty_01
- Beaujean Latent Variable Modeling Using ropenid_BrjM8P1a
- The Analysis Of Biological Data Practice problem answersKevin Gian
- Statistical Mechanics of Complex Networknacimera
- [STAT] - Latent Variable Modeling Using R - A Step-by-Step Guide - 2013 - RoutLedge.pdfbillna
- Biological consequence of Social NetworkingCaptain Walker
- Cell CycleAnsh Dutta
- -Control theory pdf-api-3755845
- Ubiquitous Sensor NetworksITU-T Technology Watch
- Totalzengu
- Cell Basicsgagan_555
- Theory of Dark Network DesignIan Davis
- Primer on Optimal Control BOOK SpeyerRaul Rojas Peña
- EMF- Mechanism, Cell Signaling, Bio Processes, Toxicity, Radicalskarlmalatesta
- Plant SignalingTreesa Jojo Kodiyan
- Signal Processing and Communications for Sensor NetworksBest Tech Videos
- Differential.equations.and.Control.theory.ebook EEnmetkmy
- Molecules in Motion: a theoretical study of noise in gene expression and cell signalingMaciej Dobrzyński
- Motion Control Theory Needed in Theapi-27637095
- Introduction to Control Theory Including Optimal Controlapi-3698538

- UT Dallas Syllabus for taught by Patrick Brandt (pxb054000)UT Dallas Provost's Technology Group
- UT Dallas Syllabus for epps3405.001.11f taught by Michael Tiefelsdorf (mrt052000)UT Dallas Provost's Technology Group
- tmp14A6.tmpFrontiers
- Sentiment Analysis on Real Time Blog using RInternational Journal for Scientific Research and Development
- A Study on CRAN R and MRAN R InterpretersInternational Journal for Scientific Research and Development
- tmpABB5.tmpFrontiers
- Tmp 6598Frontiers
- tmp935AFrontiers
- Salary Prediction Using Big DataInternational Journal for Scientific Research and Development
- UT Dallas Syllabus for econ4355.501.11s taught by Michael Tiefelsdorf (mrt052000)UT Dallas Provost's Technology Group

- Assessing the Compensation of Public-School TeachersTexas Watchdog
- Association of Environmental Cadmium Exposure with Pediatric Dental CariesEnvironmental Health Perspectives
- UT Dallas Syllabus for mkt6329.501 06s taught by Norris Bruce (nxb018100)UT Dallas Provost's Technology Group
- UT Dallas Syllabus for poec7359.501 06f taught by Daniel Griffith (dag054000)UT Dallas Provost's Technology Group
- Effectiveness Review: Supporting Rural Livelihoods and Employment in Western GeorgiaOxfam
- First Names and Crime: Does Unpopularity Spell Trouble?Daily Freeman
- UT Dallas Syllabus for eco6314.001.08s taught by Kurt Beron (kberon)UT Dallas Provost's Technology Group
- Configuration Navigation Analysis Model for Regression Test Case PrioritizationInternational Journal for Scientific Research and Development
- UT Dallas Syllabus for stat6348.501.11f taught by Robert Serfling (serfling)UT Dallas Provost's Technology Group
- Development of Traffic Congestion Index for Urban Road Links in Rajkot CityInternational Journal for Scientific Research and Development
- Use of Linear Regression in Machine Learning for RankingInternational Journal for Scientific Research and Development
- tmp2575.tmpFrontiers
- tmp698DFrontiers
- The Union Wage Advantage for Low-Wage WorkersCenter for Economic and Policy Research
- rev_frbrich199205.pdfFRASER: Federal Reserve Archive
- tmp6850.tmpFrontiers
- UT Dallas Sample Syllabus for Chansu JungUT Dallas Provost's Technology Group
- TTR High Flyers LeeNational Education Policy Center
- UT Dallas Syllabus for stat7334.501.10s taught by Robert Serfling (serfling)UT Dallas Provost's Technology Group
- tmpF532.tmpFrontiers
- 68627_1995-1999FRASER: Federal Reserve Archive
- UT Dallas Syllabus for hcs6313.501.07s taught by Herve Abdi (herve)UT Dallas Provost's Technology Group
- Tmp 7550Frontiers
- UT Dallas Syllabus for epps7316.501.11s taught by Patrick Brandt (pxb054000)UT Dallas Provost's Technology Group
- Linking Lead and Education Data in ConnecticutPatricia Dillon
- 31 Fair empl.prac.cas. 1578, 31 Empl. Prac. Dec. P 33,571 Frank L. Eastland, Individually v. Tennessee Valley Authority, 704 F.2d 613, 11th Cir. (1983)Scribd Government Docs
- UT Dallas Syllabus for eco5311.501.07s taught by Magnus Lofstrom (mjl023000)UT Dallas Provost's Technology Group
- UT Dallas Syllabus for poec5316.501.09s taught by Timothy Bray (tmb021000)UT Dallas Provost's Technology Group
- tmp57E6.tmpFrontiers
- How fast are semiconductor prices falling?American Enterprise Institute

Sign up to vote on this title

UsefulNot usefulRead Free for 30 Days

Cancel anytime.

Close Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading