You are on page 1of 101

Statistics with R —

a practical guide for beginners

8
Table of Contents
Introduction ..................................................................................................................... 4
Who should use this script? ...................................................................................................... 4
How to use this script ............................................................................................................... 5
1. Getting started with statistics .................................................................................... 6
1.1 Distributions and probability ........................................................................................... 6
1.2 Descriptive statistics ..................................................................................................... 10
1.3 How statistical tests work ............................................................................................. 13
1.4 Exercise 1: Getting started with statistics ....................................................................... 16
2 Getting started with R ............................................................................................... 18
2.1 What is R?..................................................................................................................... 18
2.2 Downloading and installing R ........................................................................................ 18
2.3 How to work with R ...................................................................................................... 18
2.4 Exercise 2: Getting started with R .................................................................................. 27
2.5 Web resources and books on R ...................................................................................... 28
3 Loading and exploring data ....................................................................................... 29
3.1 Entering and loading data.............................................................................................. 29
3.2 Object types in R ........................................................................................................... 33
3.3 Exploring data with tables ............................................................................................. 34
3.4 Exploring data graphically ............................................................................................. 34
3.5 Calculating descriptive statistics .................................................................................... 36
3.6 Understanding help functions........................................................................................ 38
3.7 Exercise 3: Exploring data .............................................................................................. 40
4 Basic statistics with nice datasets................................................................................ 41
4.1 t-tests (and non-parametric alternatives)....................................................................... 41
4.2 Correlation analysis....................................................................................................... 52
4.3 Cross-tabulation and the 2 test..................................................................................... 54
4.4 Exercise 4: Basic statistics with nice datasets ................................................................. 56
5 Handling larger and nastier datasets ......................................................................... 58
5.1 Handling data ............................................................................................................... 58
5.2 Dealing with missing Values .......................................................................................... 64
5.3 Exercise 5: Handling larger & nastier datasets ............................................................... 67
6 Linear models ........................................................................................................... 68
6.1 Linear models: basic reasoning ...................................................................................... 68
6.2 Linear models in R ......................................................................................................... 72
6.3 One-way ANOVA: worked example ............................................................................... 74
6.4 Two-way ANOVA with interaction: worked example ...................................................... 77
6.5 Linear regression: worked example ............................................................................... 80
6.6 Exercise 6: Linear Models .............................................................................................. 86

2
7 Basic graphs with R ................................................................................................... 88
7.1 Bar-plots ....................................................................................................................... 88
7.2 Grouped scatter plot with regression lines ..................................................................... 91
7.3 Exercise 7: Basic graphs with R ...................................................................................... 94
8 Introduction to customizing R ................................................................................... 95
8.1 Flow control.................................................................................................................. 95
8.2 Custom functions .......................................................................................................... 97
8.3 Summary ...................................................................................................................... 98
8.4 Exercise 8: Customizing R .............................................................................................. 99

3
Introduction

Welcome to the statistics with R guide!

Who should use this script?

This introduction is directed at beginning MSc students and advance BSc students of Biology at
Uppsala University who specialize in Ecology and Conservation, Evolutionary Biology, Limnology
or Toxicology.

This "quick-start" guide in intended to generate interest in statistics with R and to


allow you to learn more on your own.

You may find this script helpful if you are:

1. an incoming Master or exchange student with limited previous education in Statistics with R

2. a student who wishes to freshen up knowledge in statistics and/or R for courses, project
work or research (Uppsala University student, Master thesis, Doctoral thesis)

3. anyone interested in a quick-start guide to beginner level Statistics with R

4
How to use this script

We wrote this script for flexible use, such that you can direct your attention to the parts that you
want to focus on, given your background and current interest.

You will have most use of this pdf file if you read it electronically using a pdf reader that provides
a content sidebar (bookmarks pane), for example Adobe Reader for Macintosh and for PC .
You can browse and navigate between sections and subsections using the bookmarks pane.
This file also contains an R code demo file and data files as attachments (paperclip symbol in
Adobe reader).

Please feel free to approach us with questions on the contents and on the exercises. Exercise
solutions are available from us after you have handed in your own solution.

We hope that you will find script useful and fun!

August 2018,

Sophie Karrenberg

to earlier versions
Andrés J. Cortés

5
1. Getting started with statistics

1.1 Distributions and probability

Goals
In this section you will:

1. learn about why and how statistics are used in biology


2. understand basic statistical concepts such as distributions and probabilities
3. become familiar with the normal distribution

Why do we need (so much) statistics in biology?


Organisms that biologists study are influenced by a multitude of factors including their genetic
makeup, their developmental stage and the environmental conditions experienced. It is usually
impossible to work on all the units of the group or species that you are interested in, for example
ALL trees of a certain species. Instead, biologists commonly work on a subset of units taken at
random and make inferences from this subset. The whole group of units is called a "population",
while the subset is referred to as a "sample" (Figure 1-1). Statistical analyses help you to use
samples to make inferences about the populations.

Figure 1-1 Population and sample.

6
Biological questions such as which genes affect certain traits or how climate change affects the
biosphere can only be solved using statistical analyses on massive datasets. But even
comparatively simple questions, for example to what extent men are taller than women are in need
of statistical treatment. Thus, as soon as you formulate a study question, you should start
thinking about statistics.

Statistical analyses have a central place in biological studies and in many other scientific
disciplines (Figure 1-2):

Figure 1-2 The role of statistical analyses in the biological sciences.

Distributions
A distribution describes how often different values occur in a set of data. In the graph on the
next page you see a common representation of a distribution, a histogram (Figure 1-3). In
histograms, the horizontal x-axis represents the values occurring in the data, separated into
groups (columns), and the vertical y-axis shows how often they occur (proportion or frequency).

7
Figure 1-3 Histogram of normally distributed data.

Probability and probability density functions


Probabilities express how likely events are, as the number of events of interest divided by the
number of all events. Thus, probabilities range from zero (never) to 1 (always) or are expressed as
percent. For example, when you throw a coin it is equally likely that it lands on heads or tails. The
probability of a coin landing on heads is 0.5. However, for a single throw of the coin you cannot
predict where it will land! If you, however, throw the coin very many times you expect it to land
on heads about half of the time, corresponding to a probability of 0.5 or 50%.

The coin example concerns an outcome with two categories, heads and tails. For continuous
(measurement) values, probability density functions can be derived (Figure 1-4). Note the
similarity in shape to the histogram above (Figure 1-3). For each value on the x-axis the value of
the probability density function displayed on the y-axis is the expected probability of that value
occurring. The value 2 is thus expected to occur with a very low probability in data from a
normal distribution with a mean of 0 and a standard deviation of 1 (green curve, Figure 1-4).
Probability density functions of test statistics are used for the evaluation of statistical tests.

The normal distribution


The normal distribution provides the basis for many statistical analyses in use today.
The parameters of the normal distribution are
the mean, correponding to the cente of the distribution, and
the standard deviation (sd) describing to the spread of the distribution
(see Descriptive statistics and Figure 1-4).

The probability of obtaining values in a certain range corresponds to the area under the curve
in this range. The entire area under the probability density curve sums to one.

8
Figure 1-4 Probability density curves for different normal
distributions.

For normally distributed data, values within 1 sd to either side of the mean represent 34.4%
the data (pink, Figure 1-5), 13.6% of the values occur between 1 and 2 sd from the mean on either
side (yellow), 2.1% of the values occur 2 and 3 sd from the mean (green) and 0.1% of the data
occur beyond 3 sd from the mean (white). Values more than four standard deviations from the mean
are thus extremely unlikely!

Figure 1-5 Probability density of a standard normal distribution


(mean = 0 and sd=1), with percentages of values.

9
Summary

Statistical analyses are important in the biological sciences.


Distributions can be displayed as histograms and show how often different values (or classes
of values) occur.
Probabilities express how likely events or outcomes are. Probability density functions show
how likely it is to obtain values under a certain distribution.
The normal distribution is fundamentally liked to many common statistical tests. Normal
distributions are described by their mean and their standard deviation.

1.2 Descriptive statistics

Goals
In this section you will learn how to describe your data in terms of

range, quartiles and median


mean and standard deviation and standard error of the mean

Range, median and quartiles


Once you obtain data you often wish to gain an overview before you start conducting
analyses. One of the most basic measures of data series is their range (Figure 1-6).

Figure 1-6 Histogram of a normally distributed data


with descriptive statistics.

10
The range refers to the interval between the smallest value, the minimum, and the largest value,
the maximum. Indeed, looking at the range is highly recommended as it allows you to conduct a
first check of the data: are the values actually in the expected (or reasonable) range?

In addition to the range, you can calculate the median, the value "in the middle" of the data, that is,
half of the data are smaller than the median and the other half are larger (Figure 1-6). For example,
the data (2, 3, 5, 7, 10, 50, 73) have a median of 7 and the data (2, 3, 4, 5, 6, 7) have a median of 4.5.
Symmetrically distributed data have a median more or less in the middle of their range. Data with a
range of 1 to 10 with a median of 2, in contrast, are NOT symmetrically distributed. You can further
calculate the 25% and 75% quartiles, separating the smaller 25% of the data from the larger 75%
and the smaller 75% from the larger 25% of the data, respectively (Figure 1-6, Figure 1-7).

Biological data are often asymmericallydistributed, especially size measurements. It is common


to obtain many small measurement values and fewer large values, such that the data has a distribution
as in the histogram below (right-skewed, Figure 1-7).

Figure 1-7 Histogram and descriptive statistics on a right-skewed distribution.

Median and quartiles are in fact good descriptive statistics for such data: the median indeed is in
the center of the data and the quartiles nicely reflect the asymmetry in the distribution, i.e. , the
distance between 25% quartile and median is smaller than the distance between 75% quartile and
the median (Figure 1-7). Alternatively, you can use data transformations (see Chapter 4.1).

11
Mean and standard deviation of a sample
Common descriptive statistics are the mean and the standard deviation. They make most
sense for symmetrically distributed data. The mean is calculated as the sum of all values
X i divided by the number of values n. The sample standard deviation s (also sd) is calculated as
the root of the "sum of squares" (summing up squares of each value minus the mean),
divided by n-1.

Sample Standard
mean: deviation:

Standard error and confidence interval of the mean


The standard error of the mean (SE or se) gives a measure of the precision of the estimate of the
mean. The standard error can be used to calculate a confidence interval (CI) for the mean. The
95% confidence interval around the sample mean is expected to contain the population mean
with 95% probability.

Note that standard error and confidence interval of the mean become smaller the larger the
sample is. This reflects the greater trust you can have for a mean calculated from a large sample
as opposed to a mean calculated from a small sample.

Summary

Range, quartiles and median are basic descriptive statistics for data with any distribution.
Mean and standard deviation are most useful for symmetrically distributed data.
Descriptive statistics are important to check data and are used to summarize data.

12
1.3 How statistical tests work

Goals
In this section you will learn about

1. basic reasoning in classical statistical tests


2. P-Values, error types and statistical power

Hypothesis testing
Many classic statistical tests evaluate a strict null hypothesis, H 0 , against an alternative
hypothesis, H A . For example:

H0: mean plant height does not differ between plants grown at low and high densities

H A : mean plant height differs between plants grown at low and high densities

Statistical tests calculate a test statistic from the data to find out how likely the obtained result
would be under the null hypothesis. This probability is termed the P-value. The probability of
the test statistic is found using the theoretical probability distribution of the test statistic.

Figure 1-8 Theoretical distribution of a test statistics under anull hypothesis and
an alternative hypothesis.

13
Statistical tests can have four potential outcomes, two are correct and two are false (Table 1-1).
The probability of rejecting the null hypothesis when it is true (Figure 1-8, Table 1-1) is termed
the P-value and used to assess statistical significance.

If the test statistic calculated from the data happens be a value that is very rare under the null
hypothesis, usually occurring at a probability of less than 5% (P-value < 0.05), the null hypothesis
is discarded in favor of the alternative hypothesis (Figure 1-8, see also below). This result is
referred to as "statistically significant". If the P-Value is larger than 0.05 the null hypothesis is
accepted in comparison to the alternative hypothesis and the test result referred to as "not
significant".

Table 1-1 Possible outcomes of statistical tests with a significance level of 0.05.

Test result Reality

Test significant H 0 true H0 false


P-value < 0.05: Type I error
Correct!
H 0 rejected, ("false positive")
H A favored

Test not significant


P-value > = 0.05: Type II error
Correct!
H 0 accepted, ("false negative")
H A discarded

The significance level is commonly set to 0.05 in biological studies and P < 0.01 or P < 0.001 are
regarded as highly significant and very highly significant. Importantly, the choice of significance
level has direct implications for the two error types. If the significance level and thus the type I
error is decreased to 0.01 the type II error is inevitably increased (Figure 1-8, lower panel).

One-and two-tailed tests


When comparing means of two groups statstically the alternative hypothesis is usually specified to
consider means smaller OR larger than under the null hypothesis. This is referred to a two-sided
hypothesis and followerd by two-tailed tests (Figure 1-9). For example:
H0: meam of group 1 equal to mean of group 2
HA: mean of group1 NOT equal to mean of group 2.

14
The alternative hypothesis can also be formulated to specifically test how the means
of the two groups differ. Here, one of two different alternative hypotheses for group means can be
used, either group 1 > group 2 or group 1 < group 2.
Such hypotheses are referred to as one-sided hypotheses and are analyzed by one-tailed tests
(Figure 1-9).
Importantly, all statistical test make assumption about the data and are only valid if these are met
(see chapters 4, 5, and 6).

Figure 1-9 Illustration of significance (P< 0.05) ranges in one- and two-
ailed tests.

Summary

Statistical tests use samples to makes inferences on large populations and generally evaluate a
null hypothesis (usually no difference) against an alternative hypothesis (a difference). They
do so by comparing a test statistic calculated from the data against a theoretical distribution
of this statistic under the null hypothesis.
The significance level used in statistical testing is related to both type I errors (false positives)
and type II errors (false negatives).

15
1.4 Exercise 1: Getting started with statistics

1-A

A B

(a)
(b)
(c)
(d)
(e)
(f) ost values are smaller than 2.
(g)

16
1-B

or should you complain

1-C

(a)
(b)
(c)

1-D

(a)
(b)
(c)
(d)
(e)
(f)

17
2 Getting started with R

2.1 What is R?
R is a versatile and powerful statistical programming language developed by the statistics
professors Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand.
Different from other statistical programs, R is free and its source code is available. R was released
in 1996 and is maintained by the R development core team (http://www.R-project.org/) with a
very large number of international contributions and continues to develop at a fast pace. R is
among the most widely used statistical programs at universities today. For more information
please check out this list of Web resources and books on R at the end of this section.

2.2 Downloading and installing R


R is available for Macintosh, Windows and Linux operating systems and easy to install. To
download and install R, please go to The Comprehensive R Archive Network page, CRAN,
(http://www.R-project.org/) and follow the instructions there.

2.3 How to work with R

Goals
In this section, you will learn:

1 What R scripts, R commands and the R console are


2 How to work with R and RStudio
3 How to assign R objects and write code
4 Basic calculations and data display
5 The steps to follow when working with R (work flow)

18
Elements of R: Script, Console & Co.
Using R involves mostly writing commands (or "code") rather than clicking on menus.
The commands are usually assembled in a script that can be saved and reused (Figure
2-1). The R console receives the commands either from the script or by direct typing.
The console shows the progress of analyses and displays the output. Graphical output
will open in a separate window. This means that working with R can involve quite a lot
of windows and files, the script, the console, graphs and other output and then, of
course, your data files. You can assign a working directory where all of these files and
outputs are saved by default.
An excellent way of ordering and manipulating your R windows and files is to use the
free and powerful interface for R provided by RStudio (see below).
We highly recommend that you use RStudio.

Figure 2-1 An overview of workflow of R.

Using RStudio to work with R


RStudio incorporates the console, script, graphical output and various other elements in
an accessible and easy-to-manipulate form. RStudio is free and available for both
Windows and Macintosh operating systems and can be downloaded
from http://www.rstudio.com/products/rstudio/. Note that the RStudio menu differs
slightly between PC and Mac versions. You need to install R before using RStudio (see
above).
19
The RStudio screen is divided into four resizable parts (Figure 2-2). The upper left part
contains a script editor where commands are written and saved. The various tabs in the
upper left part can contain multiple scripts and also data files. Commands are sent to the
console in the lower left part using the key combinations + (Macintosh) or
Control+R (PC). On the right side, RStudio displays a workspace tab listing all objects
in the current analysis and a history tab providing a recollection of executed commands.
The lower right partition hosts a figure tab where graphical output is stored, a package
tab where packages (for specialized analyses) can be viewed and installed, a file tab to
manipulate files, including the working directory, as well as a help tab where R help
can be searched and displayed.

Figure 2-2 A commented screenshot of RStudio.

In RStudio, you can bundle your analyses into projects by using the RStudio menu in the
top right corner of RStudio (Figure 2-2). Projects contain all elements of analyses
allowing you to continue a session exactly where you ended the previous time.

You can set a new working directory by navigating to the folder you want to use on the
files tab (lower right) and choosing "Set as working directory" on from the "More" menu
(Figure 2-2). Note that the working directory basically is a folder on your computer that
contains various files whereas the workspace is a collection of R objects (see below)
assigned through R (Figure 2-1).

20
To create a new script, you can follow "File > New File > R script", or use the shortcut
Ctrl + Shift + N. Save your scripts regularly. A file that has been modified but not saved
again is displayed with a red title and a * at the end.

You can navigate between different plots on the plots tab (lower right) produced during
a session using the blue arrows at the top left corner of the plots tab. You can save your
graphs by clicking on "Export". When you are finished with your analyses you can
to document a successful analysis. However, in recent years, issues with
version incomptibilities and RStudio bugs have led to difficulties with noteboook creation and,
for this reason, we currently DO NOT recommend that you use notebooks as a beginner.

For help with RStudi

Time to get started!


You can now open "Statistics with R_2015_Demo_1.R" from RStudio (menu: file > open file).
The file is attached to this pdf (use paper clip symbol). It contains R code with comments to go
along with procedures described below.

Work flow in R
When working with R on a new analysis you usually follow these steps:
1. Define/create a folder to be used as the working directory.
2. Open R Studio and create a new Script file (menu). You can also create a project
(button top right).
3. Set the working directory to your prepared folder.
4. Write your script in the script window and save it. Send selected code line(s) to
the console using + (Mac) or ctrl+R (PC).
5. Conduct analyses, save the script, outputs and graphs. When the entire analysis is
ready, you can compile code and output into a notebook.
6. Quit R using the menu or . You usually DO NOT need to save the
workspace.

R commands
R commands always have the same structure (Figure 2-3). A command name is followed
by parentheses without a space in between. Command names are often closely related to
what the command does, for example the command will calculate the mean.
The parentheses contain the arguments that the command will use separated by commas.
Such arguments tell R what data to use and which analysis options to select. Which
arguments are needed is differs between commands.

21
Figure 2-3 Structure of R commands.

R object assignment and the workspace


In R, all data and analysis parts are assigned to so-called objects.
You can assign content to objects using the "assignment arrow", a "smaller than" and
and minus chanracters (<-) pointing from the content to the name you give to the
object. You can choose names freely but stick to the English alphabet and avoid blanks
and special characters, except for the underscore (_) and the dot (.).

For example, the following line of code

will assign "hej" to an object named greeting. All text has to be in quotes ( ),
otherwise R will look for an object with this name and create an error message, for
example,

will result in the error message

To call an object and have it displayed on the console you type its name (in the script
window and send it to the console).

will result in the output

The indicates that this is the first (and only) element of this object.

One (very slow) way to enter data into R is to assign them to an object directly using
the command

22
will assign the numbers 1, 2, and 3 to an object called . This type of
object is called a vector.

You can check which objects have been created in R using the command

This will result in a list of the current objects in the workspace. In RStudio you can view
and manipulate current objects in the workspace tab (top right partition). You can also
use rm(name.of object) to remove objects.
The command deletes all current objects.

Objects are overwritten without notice whenever something else is assigned to the
same name.

Basic calculations and data display in R


You can perform calculations directly in R, for example the line

will result in

Other common arithmetic oprators are: - (minus), / (division), * (multiplication),


^ (exponentiation). Calculations can also be done on vectors. Basic calculations (+, -, /,
*, ^, etc.) are then conducted on each element of vectors, for example

will multiply each element of the vector by 2, resulting in the output

Vectors of the same length (containing the same number of elements) will be combined
element-wise in calculations. Below we create a second vector with 3
elements, and add it to .

23
You can use this procedure to explore functional relationships of interest.
creates a vector with the numbers from 1 to 100.

R allows you to explore sets of random numbers as well. The command


generates random numbers from a normal distribution, by default from the standard
normal distribution with a mean of zero and standard deviation of 1 by default ( if you
are not sure what this means, see Distributions and The normal distribution). Below we
create a vector containing 100 randomly drawn values from this distribution; the number
of values to draw is given with the argument :

One way to quickly visualize this data is to create a boxplot that displays the
median (black bar in the middle) and 25% and 75% quantiles (box) and indicates outliers
as values beyond the t-bars or whiskers. To display a boxplot of the 100 random
numbers we use the object name (see code above) as an argument
to and generate the graph below (Figure 2-4).

Figure 2-4 Boxplot of a sample of 100 numbers randomly drawn from the standard
normal distribution. The bar indicates the median; the box shows the 25% and 75%
quantiles.

The median is around zero, as expected for a sample from the standard normal
distribution. If we now want to compare data from several groups using a boxplot we
can do this with two vectors, one with the numbers from both groups combined and the
other indicating which number belongs to which group. As an example, we can use two
random samples of 100 numbers, and and combine them with .

24
We can use the command to create a group indicator vector with the sample
names repeated 100 times each and combine these with as well. The boxplot
command then receives a so-called formula statement (more on that in later chapters)
with the sample numbers on the left side connected with a tilde symbol (~) to the group
indicator on the right side (resulting boxplot in Figure 2-5).

Figure 2-5 Boxplot comparing two groups of random


numbers. Outliers are indicated as points beyond the t-bars.

Now it is time to try this out using the demo code (attached) and some exercises, please
check out the work-flow (Work flow in R) and the notes below before you start!

Important notes for writing and executing code


Write code in the script window. You send code lines to the console such that
they are executed by highlighting one or multiple lines and pressing the
keyboard shortcuts + (Mac) or ctrl+R (PC).
We highly recommend that one of the first commands y ou write is the one below
that sets the language of output and error messages to English (this greatly
improves communication with teachers and other students….).

You can enter comments and titles preceded by a hash (#). Everything written
after a # will be ignored and not executed by R (see example code).

25
In RStudio, you can use four hash symbols (####) after a title to
organize the script. You can later navigate through these headings using
the pull-down menu at the bottom left of the script window (see example code).
Observe the ">" sign in the console. This is the command prompt indicating that
R is ready to receive commands and has finished executing previous commands.
R will display a "+" if a command is incomplete. On the console, you can cycle
through previous commands using the "arrow up" and "arrow down" keys on
your keyboard.

Summary

In R, you use a script window to enter the commands. Commands are transferred
to the R console for execution. Scripts can be saved and re-used.
Data, output and scripts are saved in a designated working directory.
R stores data and analysis outputs as objects.
Content is stored as objects that are named with the ass ignment arrow "
The workflow in R involves setting the working directory at the beginning and
saving the script file repeatedly.
In R, you can conduct basic mathematical calculations directly and element-wise, for
example on vectors.
Boxplots can be generated with to summarize data.

26
2.4 Exercise 2: Getting started with R

(attached)

2-A

vertical
each o

A B
1.3 10.1
1.8 13.7
0.75 15.1
0.9 9.3
0.6 12.0
1.0 16.8
1.1 17.3

2-B Effects of sample size.

Hint

(a)
(b) (5, 10, or 100)
of each size

(only if you want e-mail support)

(or similar) with with your comments/interpretations.

27
2.5 Web resources and books on R

Web resources
CRAN page (http://www.r-project.org/)

R page for download of program and packages, manuals and guides.

R-Studio (http://www.rstudio.com)

User interface for working with R, highly recommended.

Quick R (http://www.statmethods.net/index.html)

Code example page for beginners and a little bit more advanced users.

Books
Dalgaard, Peter (2008) Introductory statistics with R, 2nd edition. Springer. ISBN-
13: 978-0387790534, also available as e-book.

Popular and very well-written, covers both R and statistics, including slightly more
advanced methods. Suitable for beginners. We recommend this book to get started and
also as a reference throughout your studies.

Zuur, Alain, Ieno, Elena N., Meesters, Erik (2009) A Beginner´s Guide to R.
Springer. ISBN-13: 978-0387938363.

More detailed introduction to R and R programming. Does not cover statistics. We


recommend this book if you want to deepen your knowledge and/or start programming
in R.

Crawley, Michael (2012) The R book. Wiley. ISBN - 9780470973929

Large overview with many biological examples, suitable for beginning users with some
statistical knowledge. This is popular reference volume.

Kabacoff, Robert (2011) R in action. Manning Publications. ISBN-13: 978-1935182399

From the author of Quick R; this book covers both beginner and more advanced
methods and follows a case-study approach. Previous experience with R is desirable.

28
3 Loading and exploring data

3.1 Entering and loading data

Entering data in the script


Small datasets can be created using the command ; larger datasets are
loaded into R from files prepared in other programs such as Excel (see next section).
The command creates an R object that can combine vectors with
numbers, for example measurements, and vectors with categories. Below is an example
from a plant experiment. Width and the length of six leaves were measured (in cm) in the
plant species Silene dioica. The first three plants were flowering and the last three were not
flowering. Note that data for the flowering state is entered in quotes because flowering
state is a category. In the command, each column is set by an argument
giving the column name. Check out where commas are placed — they are vital! Without
them the command will not work. You can choose column names freely (in this example
they are , , and ).

Note: when you execute these commands nothing visible happens in the console!

29
Preparing data files
Most of the time it will be convenient to load data from a file. You first need to prepare a
suitable file from another program, for example Excel. Follow these points:

1. Arrange the variables (measurements) in columns and study units in rows.


2. Make sure that column headings and table entries contain only letters from
English alphabet or numbers; in particular, they must not have spaces, slashes or
commas. This is advisable even though some newer versions of R (sometimes)
tolerate spaces and other letters. If you want to make your headings clearer, you
can use points or underscores, for example, .
3. Save your table as .csv. (Other formats are also possible but not covered here.)
4. Open your file in a text editor. Observe two things: First, what is the decimal
separator, i.e., is it 1,5 or 1.5 (comma or point)? Secondly, how are the entries
separated? This can for example be a space, tabulator, comma or semicolon
(;). You need this information to ensure correct loading of your data in R, as
explained below.

Loading data files through the RStudio menu


There are many ways to load data into R. In RStudio, you can load data through the
workspace tab using the import data pull-down menu. Unfortunately, this does not
always work in all operating systems and versions of R. For this reason, we recommend
that data is loaded using the script (see below).
If you do load data through the menu please observe that this is not compatible with
notebook creation. However, data loading throug the RStudio menu will produce
data loading code in the console. You can copy this code into your script such
that it is executed during notebook assembly.

Loading data files through the script


We recommend the command because it is universally applicable. Within
the command you can specify to browse your computer for the file to
load using the argument ). Alternatively, you can give the path
to the file, for example file = "My_harddisk/documents/R_exercise" (Mac) or
file = "C:\documents\R_exercise" (PC). The path is needed for notebook creation.
For data loading, you further indicate whether your data contains headings with
(has headings) or he (no headings). The table separator is set
using , for example,
, (semicolon), (comma), or s
(tabulator). The decimal separator is specified by , as d (comma), d
(point). The input file needs to be assigned to an object using the arrow " ", below to the
object . In most cases, this will automatically be a data frame object. For a .csv
file with header, semicolon-separated entries and decimal commas (as usually used for
Swedish settings for .csv files saved from Excel) the command looks like this:

30
For a .csv file with header, comma -separated entries and decimal points (as common in
North America) the command looks like this:

Common problems when loading data


Additional entries. Loading a file with additional entries (sometimes invisible ones such
as spaces) outside the data will yield an error message similar to this one:

Remedy: copy your data (and only your data!) to a new sheet in Excel, save it as .csv and
reload.

Non-English characters and signs. Non- English characters (for example ä, ö, é),
signs (for example, , , , , , , , , , , , , ) or spaces in the column
names will produce an error message similar to this:

Remedy: change the names in an Excel file or directly in a .csv file, save it as .csv and
reload.

Checking data structure


To look at small datasets you can just type the name of the object. For example,

will yield

31
To control what sort of object your data is stored in and whether it has the correct
structure, you can use the structure command . The same information is
displayed when you click on the object in the workspace tab in RStudio (top right
partition).

yields the following output:

This indicates that the object is a data frame with six observations, i.e.,
six rows, and four variables, i.e., four columns. This matches well with the six plants and
four columns in our data.

The column names and the type of the columns are also given together with the first few
values (in this case all six values). The columns
and are numeric (continuous numbers) and you will be able to do
calculations with these numbers. The column is a factor with the two
levels and . Different from a vector, a factor stores
information on the factor levels and this information will be used directly in analyses and
graphing commands.

If you want to change the type of the column, for example changing the
column from numeric to factor (because the number is a "name" in this
case), use the following command

In this command, you use the expression to refer to


the column in the data frame object . Note that the
name of the data frame and the column name are connected by a dollar sign ( ) – a
common notation in R. You can control whether the factor assignment has been
successful by calling the structure command again. We will cover more on accessing and
handling data in the section Handling data.

32
3.2 Object types in R
Before we can go on and load data and conduct analyses we need have a look at the
different object types in R the data contained in them. In general, R objects contain
numbers, characters (entered in quotes, , such as the
and in the example above) and logical statements ( , these
are covered later).

The following types of objects are commonly used by beginners:

Vector. You have already created vectors in the Exercise 2. Vectors are one-dimensional
and contain a sequence of one type of data, numbers OR categories (letters, group
names) OR logical statements. Vectors can be created using the command
which concatenates (connects them one after each other)
the different elements into a vector. You can also use to combine multiple vectors
as you may have done in Exercise 2-B.

Sequences of numbers can be created using the colon (:). For instance,

creates the vector x that contains a sequence of numbers from 1 to 7.

There are a number of other functions for creating vectors, for user-defined
sequences and for repeated elements (see Chapter 2, Basic calculations and data
display in R).

Factor: Factors are similar to vectors but also contain information on grouping levels.
They are used in analyses and graphs that involve groups. Factors can be created from
vectors using or (see Checking data structure above). You can
create a factor named using the code below:

Data frame: data frames are a collection of vectors and factors of the same length. This
is the format commonly used for basic data analysis where each row corresponds to an
observation and each column corresponds to a measurement or group (vector or factor,
respectively). The above section Entering and loading data explains how to create data
frames from your data.

33
List: lists are collections of elements of any type and length and can be created
using . Outputs of statistical analyses often are lists. Other types of objects
include matrix and array. Many commands create custom object types.

3.3 Exploring data with tables


It often is a good idea to start exploring data with the command. This command
creates a table reporting the number of data points (or rows) for levels of factors or for
combinations of factor levels. These tables are very useful for determining if your data
contains the expected number of entries. For example, we can produce a table of
the data frame:

We can thus quickly confirm that there are three flowering and three vegetative plants in
the dataset. Of course, this only becomes really useful in larger datasets. If you enter
multiple factors as arguments, counts on all possible combinations of levels are reported.

3.4 Exploring data graphically


In most situations you will benefit from first looking at your data graphically. This is in
order to

detect any large outliers (for example data entry mistakes)


assess what the approximate distribution of the data is
see the preliminary patterns in the data.

Histograms
The command produces a histogram displaying data values on the x-axis against
their frequencies on the y-axis allowing you to judge the distribution of the data (see
Chapter 1, Distributions). The command is applied to individual variables
(columns) of the data. The code below yields Figure 3-1.

34
The command
A quick graphical check of the data is provided by the versatile command . In the
previous chapter we saw that can be use to plot to vectors against each other
(see chapter 2, Basic calculations and data display in R). The command can also
be used to depict all pair-wise combinations of variables in a data frame against each
other. Below is am example for the internal data frame , as plot(iris)(Figure 3-2).

Figure 3-1 Histogram of sepal length in iris

Figure 3-2 Pairwise plot for iris dataset

35
Boxplots
Boxplots (Figure 3-3) that we have introduced earlier (see Chapter 2, Basic calculations
and data display in R) can be used to get an idea on whether there are large differences
between groups, whether the data is distributed symmetrically within groups and whether
there are outliers. In the default settings, the command that you have used
already shows medians as thick black lines and quartiles as a box around the median.
The t-bars ("whiskers") are the range of the data that is within 1.5 times the inter-quartile
distance from the median. Data points outside that range are regarded as outliers are
displayed as circles. When working with data frames we can also use give the name of
the data frame as an argument ( ) and use only the variable names to refer
to the data and the grouping variable ( ).

Figure 3-3 Boxplot of sepal length of iris.

3.5 Calculating descriptive statistics


Another quick way to explore data is to use the command This command
gives you a number of descriptive statistics for each continuous variable (range, quartiles,
mean, median) (see Chapter 1, Getting started with statistics). For factor variables the
command tabulates the number of observations for each factor level. In our example of
the data frame

36
You can also calculate these and further descriptive statistics directly using commands
(Table 3-1), that are all applied to numeric vectors or data frame columns (containing
continuous numbers, here referred to as – replace this by the name of
your vector). If you want to apply these commands to data with missing values, please
see the Chapter 5, Dealing with missing Values.

Table 3-1 Commands to calculate descriptive statistics

Statistic Command

37
Descriptive statistics for groups of data
Often, you will wish to apply these commands over groups of the data that are defined in
a grouping factor. You can do this using the command .

is the most complicated command we introduced thus far. The main


arguments of are , the variable that is to be summarized, one or more
grouping variable(s) and the command or function you want to apply to the data.
The variable has to be a list object and can be converted from a factor or vector
into a list within the command using . The
command below calculates the mean of Sepal length in the three species of Iris using
the dataset.

You can state all kinds of other commands or functions (including functions that you
write yourself, see chapter 8) in the function argument.

3.6 Understanding help functions


Now that we have seen quite number of different commands you may wish find out
more. R provides a standardized documentation on commands and word searches.
However, unfortunately, we need to warn you that you may not fully understand these
help texts until you are a more advanced user of R. There are also many web forums
dedicated to discussing R; these are also most useful for experienced R users.

Documentation on commands
Typing a question mark followed by a command, for example

will open a the help file on the command in the lower right partition in RStudio. You can
also search directly from the help tab there. The information is always displayed in the
same general way. At the top of the page, the R package that the command originates from
is given in braces. Here , shows that the command originates
form the R package . This basic package is pre-installed when you download R. A
large number of more advanced or specialized packages can be downloaded and installed
and updated through the package tab in the lower right partition of RStudio. Further help

38
sections contain a description, the usage, the arguments and the value or object returned
by the command. The help file for indicates that its arguments are
and It further explains that the command returns one value as a
result.

The help pages end with references, similar commands ("see also" section, here
on ) and importantly, examples. Example code can be directly copy-pasted
into the script and only internal data or data generated within the example code is used.
Running example code is a very good way to examine how to work with a command.

Searching for words or terms


When looking for a command that you do not know the name of, for example one related
to a statistical term, you can use two question marks followed by the term you are
looking for. Information on analyses involving medians, for example, is found by typing

median

This command will open a table on commands related to the median in some way. The
table lists the commands and the packages they originate from, as well as a short
description, including the one from the package described above, listed
as median. Clicking on these entires will open the R documentation for these
commands. If you want to use commands from other packages you may need to
install the respective package first (package tab in lower right partition of RStudio).

39
3.7 Exercise 3: Exploring data

3-A Data structure

(a)

(b)

(c) for each species

3-B Loading and graphical exploration of data


(attached to this pdf

(a)
(b)

(c) s s for for ,


seperately
(d)

40
4 Basic statistics with nice datasets

Goals
In this section your will learn when an how to conduct and how to interpret

t-tests (non-parametric alternatives are also mentioned briefly)


assessments of normality (important also for later chapters!)
correlation analysis
cross-tabulation and the 2-test (pronounced "chi square" )

4.1 t-tests (and non-parametric alternatives)


t-tests are a common method to compare groups of measurements (samples) and to
assess whether a single group of measurements is consistent with a hypothesized mean.
We first describe the comparison of two independent samples in detail as an example.
We can use the following hypotheses:

H 0 : both samples come from the same population (with the same mean)

H A : the two samples come from different populations (with different means).

From Chapter 1 (How statistical tests work) you may remember that the basic reasoning
of many statistical tests is relating a test statistic calculated from the data to the
theoretically derived distribution of that test statistic under H 0 . When testing the above
H 0 with equal samples size n in the two samples, a t-test has the test statistic

=
1
( + )

The s in t s indicates that t s is calculated from a sample. t s relates the differences between
sample means ( , numerator) to the standard error of this difference in the
denominator. t s will be very large or very small (depending on which of the means in
larger) when the difference in means in large in comparison to its standard error. It will
be small or zero when the two sample means differ very little or when the differences
cannot be determined with any confidence and its standard error is large. The test
statistic t s has a theoretical distribution called the t-distribution. This distribution looks
roughly similar to the normal distribution and it depends on the so-called degrees of
freedom, abbreviated as df (Figure 4-1). In this case with equal n, degrees of freedom are
calculated as df = 2(n - 1). The null hypothesis is tested by assessing the probability of t s
given df.

41
Figure 4-1 Probability density of different t-distributions.
Note that for very high df the t-dsitribution approaches the
normal distribution.

Importantly, t-tests are only valid when the assumptions of this analysis are met. These
are normality of the data in each sample a nd homogeneity of variances. How to assess
normality is explained later in this chapter (Assessing normality using quantile-quantile-
plots (qq-plots)). If the data are not normally distributed a transformation can be applied
or an alternative test that does not require normal distribution can be conducted; such
tests are referred to as non-parametric (Non-parametric alternatives). The homogeneity
of variances can be roughly seen from calculating the variances (Calculating descriptive
statistics). However, R automatically conducts a "safer" form of the t-test, Welch´s t-test,
that applies a correction for differences in the variances.

The general work-flow for conducting group comparisons involves plotting the data,
testing for normality and conducting the test if appropriate (Figure 4-2).

Figure 4-2 Work flow for group comparisons.

Different forms of the t-test can be used for two samples with unequal samples sizes, to
compare a single sample to a hypothesized mean and for comparing paired samples

42
where pairs of measurements are related in some way (see Two sample, One-sample and
paired t-tests).

Worked example: t-test with normally distributed data


We compare the duration of nectaring (feeding from flowers, measured in seconds) in
male and female butterflies and use the command . In this first example, the
data are normally distributed and variances are equal. This is the data:

The main argument to is a formula statement relating the vector with the data
( ) to the grouping vector ( ) with a tilde symbol ( ). Note that this is the
same kind of formula that is used in the command . In equal
variances must be specified using the argument , otherwise a more
robust version, Welch´s t-test, that does not assume equal variances is automatically used.

The value of the test statisticsis (rounded) and this is in the far left tail of the
t-distribution with df = 2 (5-1) = 8 (Figure 4-1). Accordingly, the P-Value is <0.05 and
significant ( . The output also specifies the alternative hypothesis
and gives the 95% confidence interval for the difference between means together with
the means for the two groups, 2.2 for females and 4.4 for males. The difference 2.2 -4.4
= -2.2, thus has a confidence interval from -3.4 to -0.934 again indicating that we can be
rather sure about this result. Thus, we reject null hypothesis and conclude that male
butterflies feed on nectar significantly longer than female butterflies.

In this example, we have used a two-tailed hypothesis (compare One-and two-tailed


tests), the default option for . One-tailed hypotheses can be tested using the
additional argument or
depending on which tail is to be tested.

43
Assessing normality using quantile-quantile-plots (Q-Q plots)
Normality of data is best assessed graphically. Here, we make used of cumulative
empirical distributions. Cumulative distributions are displayed as ordered values
(x-axis) against thier rank divided by the number of values, n (Figure 4-3, left).
For example a dataset with ten values will have points at 1/10, 2/10, 3/10, …, 10/10
(Figure 4-3). This way, the values on the y-axis represent the proportion of data smaller
than the corresponding values in the sample on the x-axis. These values are actually
quantiles (sample quantiles) for example the x-axis value corresponding to 0.5 (50%) is
the 50% quantile, the median (see also Descriptive statistics). In order to assess
normality of the sample, we compare such an empirical cumulative distribution to what
would be expected in a standard normal distribution. The comparison is made by
plotting the sample quantiles against the quantiles of the standard normal distribution
that are obtained at the same proportions. This is called a Q-Q plot (quantile-quantile
plot). If the sample data is normally distributed, the points should fall on a line. The sample
mean should plot near zero because the standard normal distribution has a mean of zero.
Moreover, sample quantiles corresponding to standard normal quantiles of -2 or 2, that is,
values more than two standard deviations away from the mean, should be very rare.

Figure 4-3 Empirial cumulative dstribution and normal Q-Q plot for an example
with 10 data points randomly drawn from a normal distribution.

The command produces the Q-Q


plot. , produces a line from the first to the third quartile of the
sample data. You expect the points to be more or less close to this line if the data is
normally distributed (Figure 4-4 and 4-5).

In Figure 4-4 examples of histograms and Q-Q plots for 250 data points that are
normally distributed (green), left-skewed (red) and right-skewed (red) are displayed. Data
that look like the left-skewed or right-skewed examples are not suitable for analysis with
a t-test and should be transformed before analysis (Transformations). If this does not
work a non-parametric test should be used (Non-parametric alternatives).

44
Figure 4-4 Histogram and Q-Q plot for normally-distributed,
right- skewed and left-skewed data.

However, for smaller datasets, quite some deviations from the Q-Q lines are expected in
normally distributed data, especially at the extremes. In Figure 4-5 you see three
examples of histograms and corresponding Q-Q plots for five and ten values sampled
from a normal distribution. If your data looks like this you can use t-tests!

45
Figure 4-5 Example Q-Q plots for small datasets of normally distributed data.
ALL samples were drawn from a standard normal distribution.

Transformations
To obtain normally distributed data for further analysis the following transformations
(Table 4-1) are recommended in most cases.

Table 4-1 Transformations

Data Transformations R code

Most of these transformations, with the exception of the exponential transformation for
left-skewed data, are not defined mathematically for values smaller than zero. You may
need to add a constant value to all values in order to perform the transformation. A
constant value can be added element-wise to vectors, for example
see also Basic calculations and data display in R. You may need to try out several
different transformations. You can apply the transformation directly within other
commands, for example , ,

46
or If you
are still not satisfied with the distribution, please use a non-parametric test (see Non-
parametric alternatives).

At first sight, transformations may appear as "distorting the data". However, one can also
think of transformations as measuring on other scales than the linear one, for example,
measurements could also be taken with a "logarithmic ruler". Nonetheless, transformations
can be problematic in specialized investigations of the relationship between variables.

Back-transformation
Descriptive statistics, such as the mean and the standard error of the mean are not meaningful
when calculated from transformed data
.

, ( ) cannot be directly back-transformed


Instead, the interval o

Worked example: t-test after data transformation

We have obtained data on the height of poplar trees at two different distances from a
river, 5-20m ( ) and 20-40m ( ). You find this data in the attached
file . The data has two columns, one factor giving the in
groups and and one with the height measurements ( ).

The first thing we do is looking at the data using a boxplot (Figure 4-6).

47
Figure 4-6 Boxplot of height in poplars
growing at different distances from a river.

There seem to be many outliers of larger values in both samples. Also, the sample
varies more than the sample. Let´s look at this further using a Q-Q plot for each
sample. In the code below, the two samples are selected using a logical
statement within square brackets ( ), more on
how to select data will be explained in the next chapter.

Figure 4-7 Q-Q plot of height in poplar trees growing at two distances from a
river.

48
As suspected, these data do not conform to the normal distribution and appear to have a
right skewed distribution (compare Figure 4-4). We try out a log-transformation and plot
Q-Q plots of the log-transformed data.

Figure 4-8 Q-Q plots of poplar height after log-transformation.

This indeed looks much better! We look at a boxplot of the transformed data as well
(Figure 4-9 .

This looks fine. We use to compare variances between samples


(see Calculating descriptive statistics for groups of data).

49
Figure 4-9 Boxplot of log-transformed
height in poplar trees growing at different
distances to a river.

Variances appear different. Let´s conduct the test without assuming that variances are
identical.

These results indicate that poplar trees that grow farther from the river are not
significantly different in height from those that grow close to the river. This can also
be seen from the confidence interval of the difference between means that includes zero.

Means and standard errors reported with this analysis need to be back-transformed.

50
This corresponds well with the center of the data in the boxplot of the original data
(Figure 4-6).

Back-transformed means can be reported together with the standard error interval,
usually rounded to two digits after the decimal point, 5.50 [4.91, 6.16] m for trees far
from the river and 4.27 [3.89, 4.68] m for trees closer to the river. These two means are
not statistically different as we have seen above. The P-Value of 0.085 is rather
small though, so this might also be a case where the decision is not entirely clear.

Two sample, one-sample and paired t-tests


Apart from the two-sample t-test explained above, there are two more ways to use the t-
test with the command , the and the  . All
three types of t-tests are described with examples and the R code below.

Two independent samples test

51
One-sample test

Paired-sample test

two different
columns

Non-parametric alternatives
In certain cases, it can be impossible to meet the assumption of normality required for
standard t-tests, even after transformations a non-parametric alternative
such as the Wilcoxon family of tests can be used. These include two-sample (also named
Mann-Whitney-U test), one-sample and paired-sample alternatives, all available through
the command The syntax of is very similar to that
of , see

4.2 Correlation analysis


You can use correlation tests to assess whether two variables are associated. For a
correlation, no assumptions about the functional relationship between variables are made.
This differs from regression analysis (see Ch. 6, Linear models )used for cases where
one or several predictor variables directly affect (cause) the outcome of a response
variable.

52
Pearson Correlation
One of the most common correlation analyses is the Pearson product-moment
correlation with the correlation coefficient Pearson’s r. Pearson’s r ranges from -1
(perfect negative correlation, one variable increases as the other decreases) to 1 (perfect
positive correlation, one variable increases and the other increases). A value of zero
indicates no correlation. You can obtain Pearson’s r and test whether Pearson’s r differs
significantly from zero using the command . If Pearson´s r differs
significantly from zero we can infer that the two variables are significantly associated.
This test assumes the data conforms to the normal distribution (see Assessing normality
using quantile-quantile-plots (Q-Q plots)).

In an example using the internal dataset on floral measurement, we test whether


sepal length and petal length are significantly associated.

This analysis shows that sepal length and petal length are strongly and positively
associated (Pearson´s r = 0.87) and this association is highly significant with a P-value
smaller than 2.2 * 10-16 (displayed as ).

Spearman Correlation
If the data does not conform to a normal distribution a non-parametric alternative to
Pearson’s r, for example Spearman's rank correlation coefficient (Spearman’s rho), can be
used. Like Pearson’s r, Spearman’s rho determines the level of association between two
variables and ranges from -1 to 1. Spearman´s rho is calculated using the rank-order of
the data rather than the raw values.

Spearman’s rho can also be obtained with the command. Below we repeat
the correlation analysis above using Spearman’s rho.

53
This test produces a very similar, but not identical, result compared with the test
based on Pearson's r above.

2
4.3 Cross-tabulation and the test

Apart from continuous data that we have treated thus far, biological experiments can
produce categorical data or counts. For example, in an experiment on the inheritance
of eye color in flies, eye color (red or white) was recorded in two groups of flies (A and
B). Here we want to test whether eye colors differ between groups. The data is available
in the file flies_Ch.4.csv (attached to this pdf).

The data is organized in 100 rows, one fore each fly, and two columns,
and ; both are factors. The entries are the group ( or ) and eye color
( or ). We use the command to summarize how many flies of
each eye color were found in each group. This kind of table is referred to as a 2 x 2
contingency table.

flies_Ch.4$ flies_Ch.4$ )

54
In Group A, 34 flies have red eyes and 16 flies have white eyes. In group B, 41 flies have
red eyes and 9 flies have white eyes. Alternatively, you could enter such data directly (its
only four numbers here!) as

We now want to test whether the eye color differs between groups, or in statistical terms,
we test the null hypothesis that eye color is independent of the group. For this we use
the the -test (read "chi-square") that compares the test statistic X2 to the
distribution. The main argument of the command is the contingency
table that we created above.

This test shows that eye color and group are not associated (the P-Value > 0.05, not
significant) and we conclude that eye color does not differ between groups. The -test
can also be used for larger contingency tables with more than two categories.

Summary

t-tests compare the means two groups or a single mean to a hypothesized mean.
Q-Q plots are used to assess normality.
Non-parametric tests should be used on data that is non-normal even after
transformation.
Correlation analysis tests whether two continuous variables are associated.
The -test is used to test whether categorical variables are associated.

55
4.4 Exercise 4: Basic statistics with nice datasets

4-A Identify the test type

(a)

(b)

(c)

(d)

(e)

4-B Snow-melt times

continued on next page

56
4-C Willow shrub responses to herbivory

Use in attached

(a)
(b)

4-D Swiss population survey

ould (in principle...)

(a)
(b)
(c)

4-E Crab behavior

57
5 Handling larger and nastier datasets

5.1 Handling data

Goal
In this section you will learn how to

change and subset data


handle missing data

Accessing and changing data


Sometimes you need to check or change individual elements of your data. In R, the
elements of vector and data frames are always internally numbered and we use this
numbering to access and change the data. Here we use our Silene leaves data (p. 29).

For example, the vector

has six elements. We can access its third element using the vector name followed by
square brackets and the element number. This line will bring up the third element:

Should we realize that this element needs to be changed from 2.8 to 3.0 we can do that
using the assignment arrow:

calling again shows that this has actually happened.

We can access the elements of data frames in the same way, except that data frames have
two dimensions, rows and columns. These are accessed by two numbers, separated by a
comma within square brackets ( ).The first number always refers to rows, the
second to columns. To access the element in the third row and the second column in
our data frame from before (Checking data structure) we use

58
This is the same measurement as in the vector example above. To change it to 3.0 in the
data frame we use the same kind of assignment operation:

We can check whether this has happened by

Entire rows and columns of data frames can be accessed by leaving column (or row)
number empty in the square brackets. Note that the comma must always be entered
because data frames have two dimensions. Accessing rows and columns is needed to
conduct analyses and to make changes or calculation. For example,

brings up entries in the entire second column and

brings up the entire third row, for example to check that plant´s measurements. The first
3 is the row number.

Alternatively, column names can be used in place of the numbers. R has a special
notation for columns involving the dollar sign as we have seen earlier. The following line
will also bring up the third column.

Column names can also be entered in quotes directly within the square brackets (note the
comma!).

Should you now realize that the width measurements all need to be increased by 0.2 you
can do that using, as

59
OR

OR

Which of these options is most convenient depends on your column names, the size of
your data file and your preferences.

Adding and deleting columns


Additional columns can be assigned at any times. For example, you may wish to create
column for the ratio of leaf width and leaf length in our Silene.leaves data frame.

Calling the structure command shows that the column has been added and is numeric.

Deleting one or several columns can be done using the minus sign within the square
brackets. This only works with column numbers, not with column names. This line
removes the newly added width-length ratio column (output not shown):

Removing rows, for example, when you realize that measurements of an entire row are
faulty, works in the same way. This line removes the first row (observe the placement of
the comma!).

60
If you need to remove more than one column use the command within the square
brackets:

will remove rows one to three.

will remove columns one and four.

Subsetting data
There are many situations where only a specific subset of the data needs to be used. In R,
this is done with entering a so-called logical statement (see below) in the square brackets.
For example, if you want to select only the flowering plants in the
data frame you use the logical statement
for row selection.

will produce a new data frame named containing only the flowering
plants.

Let´s have a closer look at the logical statement:

61
In words this statement means something like "check for each element
of whether it reads " " or not". If you execute only
the logical statement, you obtain a vector with the same number of elements than
the column, six, the first three are (corresponding to the flowering
plants) and last three are (corresponding to vegetative plants).

When you use such statements for row selection, all rows corresponding to will be
selected, in this case, the first three rows. and are logical values and
constitute a special data type in R, logical data. evaluates to 1 and to zero.
Thus, when you sum logical statements the number of elements will be returned.

Note that R does not assume that you will use only columns of the same data frame here,
in fact, you can also use columns from other data frames or vectors. For this reason, you
need to write and not
only .

You can use the following logical operators:

identical

not identical

greater than, greater than or equal to

less than, < l ess than or equal to

and combine conditions using

(pipe symbol), logical OR, one of the conditions fulfilled

logical AND, both conditions fulfilled

62
Here are some more examples:

Selecting plants with leaf widths over 4.0 can be done with:

Selecting plants with either width or length under 3.5

Selecting plants with both width and length over 4.0

Data selection can also be done with the command. The first argument
of specifies the data frame to subset. The second argument is a logical
expression, as explained above, to select rows. The third argument indicates the columns
to be selected using their names. If several columns are selected, the names are
combined into a vector. If you only want to omit one column, use – in front of the
column name. The lines

will create a new data frame containing only the rows with the flowering plants, and all
columns except the column. The advantage of the subset command
is that you can refer to columns within a data frame directly, without the dollar sign.

Summary

Individual data entries, rows and columns can be accessed and changed using their
row and column subscripts.
Data can be subset using logical statements involving logical operators such as
== , != , | and &.
Subsetting of data frames can be done either with row and column subscripts in
square brackets or with the command.

63
5.2 Dealing with missing Values

Goal
In this section you will learn how to

Interpret different types of missing value indicators in R


Handle missing values in common functions
Identify, count and set missing values

Most real datasets end up containing missing values. This is often because certain
experimental units, such as animals or plants, were not available for measuring because
they died or could not be observed or measured for other reasons. Moreover, technical
failures of the equipment or inevitable mistakes of the researcher commonly result in
missing values. In fact, entirely complete datasets are very rare in biological experiment,
especially in those conducted in natural populations or in the field. For this reason, it is
important to be able to deal with missing values and to assess, based on the research
questions, how missing datat may affect the results.

Types of missing values in R


(not available) codes missing data in R. When preparing data, it is good
practice to enter into "empty cells" in your Excel table. also appears as a
result when a command cannot be executed, for example because the data
contains and the command is not prepared to handle .
(not a number) appears when a calculation does not yield a mathematically
defined answer. R often gives a warning when are generated as in

Handling missing values in common commands


R is very cautious. Most of the basic commands return as soon as a single is
present in the data. However, many commands have an optional argument to tell R to
ignore but this differs between commands. For example,

64
In this case, the output is , because the command cannot be executed due to the in
the data. Setting the optional argument (for remove) to tells R to
remove before calculations. TRUE can be abbreviated to T. For example,

This also works for the commands , , , , , ,


and many other commands. An exception is the command . Here,
the number of elements is counted, regardless of the presence of , for example:

The commands for correlation and for covariance ignore with the
argument

Here, and are two vectors of the same length.

Other commands such as for calculating linear models (see Chapter 6)


ignore in the default setting. Consult the help files to find out how NA is dealt with
for specific commands (see section 3.6 on help functions).

Finding and counting missing values


To find out whether your data contains use

or more specifically

This command can be applied to any data structure or part thereof. The
command returns logical statements for each element of the data with
for both and and for other entries. To find only use . For
example,

65
The number of missing values can be obtained by summing logical vectors, for example,

To access rows that have no in any of the columns use

Note that the command will also provide the number of in


each column. To find out where the are in the data use the command , for
example

indicating that elements 2 and 4 are .

Setting missing values


To set certain data points as , for example when you realize that there is a problem
with them, access the elements of the data frame using row and column numbers and
assign to those, for example

will set the first element of the vector to .

will set row 2, column 3 to . Note that these changes are made to the data frame
object stored in R’s current workspace NOT to your original data file.

Summary

Missing data types in R are NA (not available, to be used in data tables) and NaN
(not a number).
Many commands have optional arguments to deal with missing values, for
example will tell R to ignore missing values in
and other basic commands.
The command is used to identify and NaN.

66
will return the number of missing values in the data
and will return subscript numbers of the elements that
are or .
Data entries can be set to with the assignment arrow.

5.3 Exercise 5: Handling larger & nastier datasets

5-A iris breeding

Iris

(a)

(b) Iris
Hint
.

5-B Outliers and missing data


attached measured
s
(a)
(b)
(c) in
the
(d)

67
6 Linear models

Goals
In this section you will learn how to

design and interpret linear models


assess whether models meet assumptions using analysis of residuals

6.1 Linear models: basic reasoning


Linear models are a large family of statistical analyses that relate a continuous response to
one or several explanatory variables. The response variable contains the measurements or
observations that you are interested in understanding. You hypothesize that one or
several explanatory variables influence this response variable. For example, when testing
for differences in mean plant height among three species of plants, plant height is the
response variable and the grouping factor, identifying the three species, is the explanatory
variable. Linear models can accommodate explanatory variables in the form of such
grouping factors (analysis of variance, ANOVA), as continuous variables (linear
regression) or a combination of both.

Linear models generally relate the variation around the "fitted model" ("explained
variation") to the variation in the data without the model ("unexplained variation" or
Residuals). This so-called F-Ratio is the test statistic and compared to its theoretically
derived distribution, the F-distribution, in order to determine model significance. For
example, a model for comparing three groups fits three means to the data. A
regression analysis fits a slope and an intercept to the data. For group comparisons
(ANOVA), the explained variation is expressed as a sort of average squared distance
from the fitted means referred to as the mean square or MS. For n data points in each of
k groups the MS for the explained variation is obtained by summing up the squared
distances from the group means to the overall mean (Figure 6-1) and dividing this sum
by k-1. The MS of the (unexplained) residual variation in is expressed as the sum of
squared distances from the group means to the data points (Figure 6-1) divided by k(n –
1). The ratio between these two MS is the F-Ratio that is tested against the F-distribution
with k-1 and k(n-1) degrees of freedom, corresponding to the denominators of the two
MS calculations. From this, the P-Value is obtained. For regression analyses, the MS are
calculated in a similar way from the fitted regression line (Figure 6-2) with model df of 1
and residual df of n-2. The information on df, sum of squares, MS, F-ratio and P-values
is commonly presented as a ANOVA table (Table 6-1). The second important output of

68
linear models is information on the fitted values, such as means, slopes and intercepts
with their standard errors. More complicated models follow a similar reasoning.

Note that the F-distribution is contingent on two df not on one df as the t-distribution.
F-ratios of over four tend to be significant and indicate that the variation explained by
the model is more four times larger than the unexplained variation.

The distances from the fitted models (Figure 6-1, Figure 6-2) are referred to as the
residuals. These are a very important quantity for testing model assumptions (see next
section, Assumptions and diagnostic plots). The estimates of means, slopes and
intercepts are actually obtained by mathematically minimizing the sum of the squared
residuals. For this reason, linear models are also referred to as a "least squares" method.

Figure 6-1 Graphical representation of explained variation and unexplained variation


(residuals) in group comparisons (ANOVA, see text).

Figure 6-2 Graphical representation of explained and unexplained variation (residuals)


in regression analysis (see text).

69
Table 6-1 Example of an ANOVA table for a comparison of three groups (k=3) with four
units each (n=4).

Source of variation df Sum of squares Mean square F-ratio P-Value

Assumptions and diagnostic plots


All linear models have the following assumptions:

1. The experimental units are independent and sampled at random.


2. The residuals have constant variance across values of explanatory
variables.
3. The residuals are normally distributed with a mean of zero.

Analysis of residuals is thus a key step when conduction linear model analyses. Residuals
are analyzed graphically after fitting the model . Whether or not experimental units are
independent (assumption 1) depends on the experimental design and it should be known
by the experimenter. Constancy of variances (assumption 2) is checked using a plot of
the residuals against the fitted values. This plot is also referred to as the Tukey-
Anscombe plot (Figure 6-3, left).

Figure 6-3 Example of a Tukey-Anscombe plot (Residuals vs Fitted) and a Q-Q plot
of the residuals for a llinear model. These plots indicate the model satisfies the
assumptions and can be used (see text).

70
If the variances are constant along the fitted values of the regression line or among
groups we expect a random scatter around zero in the Tukey-Anscombe plot (Figure
6-3). If variation of residuals strongly increases or decreases with the fitted values the
Tukey-Anscombe plot shows funnel-like patterns (Figure 6-4). If you obtain such a
diagnostic plot or any other strong patterns on the Tukey-Anscombe plot the data often
needs to be transformed before a new model can be calculated (Transformations).
Normality of the residuals (assumption 3) is tested using a Q-Q plot of the residuals that
we have already used (see Assessing normality using quantile-quantile-plots (Q-Q plots)).
Transformations may also be appropriate when the residuals are not normally
distributed; in particular, the log-transformation is commonly applied in group
comparisons of measurement data. For regressions, the use of transformations changes
the relationship between the variables and this may or may not be appropriate depending
on the study questions.

Figure 6-4Tukey-Anscombe plot (Residuals vs Fitted) and Q-Q plot of the residuals of
an unsuitable model. Note that the Tukey-Anscombe plots has a funnel shape with
increasing residuals with larger fitted values. Here, log-transformation of the response
can be used to improve the model.

Workflow for linear models


First, start by exploring the data through a basic plot (see Exploring data graphically).
Define the model based on your hypotheses and analyze the residuals. If the residuals are
normally distributed, and the Tukey-Anscombe plot looks fine, obtain and interpret the
results. Otherwise, try transforming the data (either the response, the explanatory
variable or both) if appropriate and calculate the model again. Only models that satisfy
the assumptions should be used as results and be interpreted (Figure 6-5).

71
Figure 6-5 Workflow for linear models.

6.2 Linear models in R


In R, linear models are analyzed using the command , where a formula statement
and the type of variable contained in it define the type of linear model. Formula
statements have the same form as for and , and always relate the
response variable on the left to the explanatory variables on the right using a tilde symbol
( ): . Explanatory variables can be
combined in different ways to create different types of models. Below we give an
overview of the analysis types together with an example graph reporting results of such
analyses and the formula statement.

One-way analysis of variance (ANOVA)

The one-way ANOVA, does not, however, tell you which


groups are significantly different from each other. This can be done with
a Tukey-test with using HSD.test()from the package agricolae.
for lm()

must be a factor

Note that if there are only two groups, a t-test can be used (same result).

72
Two-way ANOVA

and

statistical interaction also be


way

Linear regression

Analysis of covariance (ANCOVA)

73
6.3 One-way ANOVA: worked example
To test whether fruit production differs between populations of Lythrum, fruits were
counted on 10 individuals on each of 3 populations.

We obtain a first exploratory plot Figure 6-6 with

Lythrum Lythrum

Figure 6-6 Boxplot of fruits per population

The boxplot shows reasonably symmetrically distributed data and suggests some
differences between groups. As a next step, we calculate a linear model with as
the response variable and as the explanatory variable. We specify the data frame
using the argument and assign the model the object .

The model is now stored in the object . We will later extract the results from this
object. But first we need to analyze the residuals. R provides both the Tukey-Anscombe
plot and the normal Q-Q plot of the residuals through the command with the
name of the model object as the first arg ument to p . The two plots are actually the
first two and the most important of six diagnostic plots and we concentrate on these for

74
now. To bring up only the first two plots we add the argument to
We also use to change the plotting parameters such
that both plots are displayed next to each other in a single plotting window. This change
of plotting parameters remains until changed again. Thus, if you want a single plot the
next time you need to use either or close the plotting window
with the command dev.off(), such that a new, default, plotting window opens.

Figure 6-7 Tukey Anscombe plot and Q-Q plot

Both diagnostic plots look fine and we proceed to extract the ANOVA table. This is done
by the command , that also uses the model object name as its argument.

75
In this ANOVA table we see one row for the effect of populations and another for the
residuals (compare Table 6-1). Degrees of freedom, sums of squares, mean squares, the F-
Ratio ( ) and the P-Value ( are given. This analysis shows that the three
populations differ significantly in fruit production. The P-Value is 6.7*10-6 (displayed
as ), thus, it is very unlikely that this result is due to chance. Indeed, variation
explained by the model is 19 times larger than residual variation (F-Value = 19.1).

This analysis does not, however, tell us whether fruit number differs significantly between
all three populations or whether two are more similar to each other. This can be
determined using a Tukey test. This is a so- called post-hoc ("after the fact") test that
is applied only if the ANOVA shows a significant overall effect. The Tukey test
compares all pairs of groups (here populations) with each other. This is referred to as
multiple pairwise comparisons. Manual calculation of multiple t-tests would be invalid as
all pairwise comparisons are not independent. If we, for example, already know that A is
much larger than B and B is much larger than C, it also follows (without looking at the
data) that A is larger than C. The Tukey test adjusts the significance level to take account
of this issue.

One convenient way to conduct the Tukey test after a linear model is using the
command . For this you first need to download the package .
You can do this in the RStudio package tab in the right lower partition. Search for
and download it. After that, the package is loaded using ).
You need to load the package each time you want to use it.

The command has the model object as its first argument followed buy
the argument to specify the grouping factor ("treatment") to use. You can also
use for models with multiple grouping factors. Further arguments,
and , result in a convenient output.

76
The output starts with repeating the response and the grouping factor
used, . It further gives information on the pairwise comparisons together with the
group means, their sd, samples size and range. Here, the honestly significant difference
specifies that any difference between groups mean larger than 2.56 is significant. The
output ends with repeating the treatment means (referring to populations 1, 2 and 3 here)
together with a column. Means with the same letter in this column are not
significantly different. We can conclude that means of populations 2 and 3 are not
significantly different from each other whereas both are significantly different from
population 1. The letters in the group columns are often displayed in results graphs. In
the biological sciences, it is common to report the standard error of the mean together
with then mean. We can obtain the means and standard errors using .

Means and standard errors can be reported either in the text (as mean ± standard error)
or in barplots (see Basic graphs with R).

6.4 Two-way ANOVA with interaction: worked example


In a study on pea cultivation methods, pea production was assessed in two treatments of
irrigation (normal irrigation and drought) and in three treatments of radiation (low and
high). 10 plants in each of the six combinations were considered. We load the data
attached to this file and placed in the working directory) and look at the data frame object.

read.table("Peas_Ch.6.csv", sep=",", dec=".",


header = T)

77
We see that both irrigation and radiation are factors with 2 levels each and that we have
10 measurements in each of the four combinations of levels (drought and high radiation,
drought and low radiation, normal irrigation and high radiation, normal irrigation and
low radiation).

The next step is to create an exploratory graph. Here, it is most informative to display all
four combinations of factor levels (Figure 6-8). We can do this by creating another
column with the combined factor levels using .

78
Figure 6-8 Boxplot of the peas data (see text).

In the plot, the data seem reasonably symmetrical within groups without any extreme
outliers. We can go ahead and calculate the linear model with an interaction term.
Interaction terms should always be included in initial models. If the interaction
term is not significant it sometimes is dropped from final models, however, this usually
has little effect on the results. We also move ahead and produce the diagnostic plots of
the residuals.

Figure 6-9 Diagnostic plots for a two-way ANOVA on the peas data.

79
The diagnostic plots of residuals both look consistent with the model assumptions. We
proceed to extract the results from the model using the command.

(significance codes removed here)

The ANOVA table shows that the interaction is significant. This means that the effect of
irrigation depends on that of radiation. In the exploratory boxplot (Figure 6-8) we see that,
indeed, seed number is reduced by radiation under drought but not under normal
watering. Because of this, it is not possible to interpret the overall effect of irrigation and
radiation from the model we just calculated. To find out more, this analysis should be
followed by splitting the dataset into the two levels of one of the grouping factors, either
into two irrigation levels or into the two radiation levels. After that, the effect of the
second grouping factor is tested with each level of the first one. In this example, we can
test for the effect of radiation on drought-treated plants and, separately, on normally
watered plants. Alternatively, we could test for the effect of drought on plants that received
high light radiation and, separately, on plants that received low radiation. Which split of
the data is most useful depends on the questions and interests of the researcher. However,
the data should not be split both ways. The analysis can be completed by calculating
means and standard errors. Usually, both the overall ANOVA with the significant
interaction term and the separate ANOVA results on the split datasets are reported. Such
analyses are often accompanied by grouped barplots (see Basic graphs with R).

6.5 Linear regression: worked example


In the following example, we analyze whether wind speed is a good predictor for ozone
levels using the internal dataset (use and
for more information). We start by plotting the data (Figure 6-10).

80
Figure 6-10 Plot of ozone levels against wind
speed ( internal dataset).

We see that ozone levels decrease with wind speed. The relationship appears a bit curved.
We try a linear model with the straight data first and analyze the residuals (Figure 6-11).

Figure 6-11 Diagnostic plots for the model

81
The Tukey-Anscombe plot, we see that the residuals increase with larger fitted values
(funnel shape). This indicates that either the model does not fit the data such that the
fitted line is farther away from the data at higher ozone values (fitted values) or that
there is more variation in measurements corresponding to higher ozone levels. In the
normal Q-Q plot we do not see a strong pattern at higher ozone levels and neither in
our initial data plot (Figure 6-10), Let's plot the fitted line on the data to understand this
better. We do that by producing the initial plot again and then add the regression line
to it using abline().

Figure 6-12 Plot of ozone levels against wind speed


with fitted regression line

Indeed, it appears that the model does to fite the ozone values very well. Next,
we explore transformation of the response . We calculate a new model and plot
the diagnostic plots (Figure 6-13). We also plot the fitted regression on a plot of log-
transformed ozone values against wind speed (Figure 6-14).

82
Figure 6-13 Diagnostic plot of the model

Figure 6-14 Plot of log-transformed ozone levels


against wind speed with fitted regression line

These diagnostic plots look much better and can be accepted. Nonetheless, there are still
mostly positive residuals on the left side of the graph, however, there are fewer
measurements in that region. We could also try log-transforming both the response and
the explanatory variable wind speed. What sort of transformation is appropriate depends
on the functional relationship between the variables and on the aims of the study. If
accurate prediction of ozone levels is the aim the best-fitting model is interesting as it
leads to the best predictions. If the study is about testing particular theories on functional
relationships, a model that reflects the hypothesized relationship should be used. It is

83
possible that linear models cannot be used in some of these cases (see Web resources
and books on R) . for books that include more advanced modeling techniques). We
accept model.2 for now and extract the results.

The ANOVA table indicates that wind speed has a significant effect ozone levels. Apart
from the ANOVA table we can extract the slope and the intercept of the regression line
using the command .

The output first reiterates the model formula and gives information on the residuals. The
coefficients part gives the slope of the regression line as -0.13 with a standard error of
0.019 and the intercept as 4.70 with a standard error of 0.200. The table also reports t-

84
tests on the hypothesis that theses coefficients are zero. For the intercept, this is not
informative in our example. The test of the slope is reiterating the F-test in the ANOVA
table. In this simple case, with one explanatory variable the t2 equals the F-ratio (-
6.822 2 = 46.539) and the P-Values are identical for both tests ( ). In more
complicated cases, with several explanatory variable, the default settings will produce
ANOVA analyses with sequentially added explanatory variables, such that the results can
depend on the order of the explanatory variables in the model.

The output ends with stating R2 and adjusted R2 and repeating the overall model test
that we have already seen in the output. Adjusted R2 is used to express the
percentage of the variation in the response that is explained by the model. In our
example, 28% of the variation in log-transformed ozone levels is explained by wind
speed.

We have now covered the basic reasoning and coding steps involved in calculating and
interpreting linear models. More complicated models are described in the sources listed
in Web resources and books on R.

Summary

Linear models are used to for various analyses of continuous responses, including
analysis of variance (ANOVA) and regression.
The workflow involves exploring the data through a plot, stating the model,
analyzing the residuals to test whether assumptions are met and obtaining and
interpret the results.
Assumptions are independent experimental units, constant variance of residuals and
normal distribution of residuals.
Statistical interactions imply that effect of an explanatory variable depends on
another explanatory variable.
Linear model analyses in R involve the commands , ,
and .

85
6.6 Exercise 6: Linear Models

6-A Flower differences in Iris

the S s
Iris.
Iris
Please make sure that you follow the work-flow.
(a) Produce exploratory graphs.
(b) U

(c) corresponding

(d) ,

6-B Fertilization and herbivory

Ex_6B_Birch.csv

Does fertilization increase herbivory?


(a)
(b)

(c)
(d)

Continued on next page

86
6-C Mating preference in earwigs
determined by
the size of the female? dataset Ex_6C_Earwigs.csv on
measurements of female size (mm) and courtship duration.
(a) P
(b) a regression

(c) the

(d)

87
7 Basic graphs with R

7.1 Bar-plots

Goal
In this section, you will learn how to script a grouped barplot of means with standard
errors indicated as t-bars in R (Figure 7-1) and how to adjust the layout of this type of
graph.

Figure 7-1 Example of a grouped barplot of means with


standard errors (as T-bars).

88
How to do it
We are going to use the internal dataset , which contains measurements of
tooth length in guinea pigs that received three levels of vitamin C and two supplement
types (Figure 7-1). To explore this dataset you can
use , and . We want to
produce a barplot of the mean tooth length for all six combinations of the two factors
(supplement type: 2 levels, dose: 3 levels). We first need to calculate the mean tooth
length for each of the combinations. For this, we use the
command (see
. returns mean tooth lengths for all six combinations in the right
format for the command that is later used to produce the plot. We assign
these means to the object and display them in the console at the same time by
using parentheses around the entire command.

We are now ready to use the command . The first argument to is


the data to use, . Using the argument indicates that the bars should
be plotted beside each other instead of on top of each other. This will result in a basic
barplot (not shown).

We can customize the layout of the barplot using further arguments, otherwise default
options will be used. The labels of the axes can be specified with the arguments
and and the labels below each group of bars are controlled with the
argument . The font size of these labels can be changed with
and These arguments are set to 1 by default and changes are relative to this
default. For example, will double the font size. The limit of the y-axis is
specified with . Here we use a range from zero to the maximum in the dataset. The
orientation of the axis labels can be altered with the argument that has four options
(0, 1, 2, 3). Here, produces horizontal axis labels. You can try out the other
options as well. The colors of the bars are determined by , in our example by a vector
with a length of two for the two groups, specifying 1 (black) and 8 (grey). The color can
be specified either with numbers (1 to 8) or with the color name. An overview of all
available color names is brought up by . You can further explore colors
at http://research.stowers-institute.org/efg/R/Color/Chart/. With the above
adjustments the command looks like this:

89
The next step is to add error bars to the barplot. There is no standard command to add
error bars. Instead, we have to draw them ourselves with the command .
First, we need the standard error of the mean for all six groups. We do this in the same
way as the calculating the means: we use but ask for calculation of the
standard error. Here we do that by creating a custom function within (more
on functions in the next chapter). Besides the length of the error bars, we also need the
horizontal locations of the bars, such that they end up in the middle of the bars. These
midpoints, in the format as the means above, can be extracted from a
basic We assign the command to an object, here
named . We can use to suppress the plotting. Without that
argument the plot is created and the midpoints saved at the same time.

Now we are ready to draw the error bars using the command . Within this
command, we first state the position of the error bars by two sets of x and y coordinates
corresponding to the start and end of the error bars. First, the starting coordinate
set, identifies the six midpoints as x coordinates and
the means minus the standard errors as the six y coordinates. Secondly,
is used for the end of the error bars. We further use the
argument and such that we get bars with T´s on both ends and
not arrows. The arguments and set the size of the T´s and line width of the
entire error bars.

We can further use the command to add a legend with our groups to the
graph. We can specify the place of the legend in the graph either with coordinates
(here, ) or default options such as " or (see ? ).
The argument produces boxes with the specified colors to place next to the legend
text. determines if there will be a box drawn around the legend or not (default:
, with box), here, removes the box. The font size in the legend is
determined by , as explained above.

90
With this code we obtain the graph displayed in Figure 7-1. There are many more details
of the plot that can be controlled and changed. For an overview of the graphical
parameters that can be changed by arguments check out .
Plots can saved through the RStudio menu ( "export" button on plot tab).

Summary

A barplot can be made with the command , a higher-level plotting


command that creates a new graph. Mean values to be plotted should be calculated
first with the command for grouped barplots.
Standard error bars, calculated with , and a legend can be added with the
lower-level plotting commands and that add extra features to
an existing graph.
A large number of graphical parameters can be used to customize plots. An
overview of some the arguments is given in the in the help for par (?par).

7.2 Grouped scatter plot with regression lines

Goal
In this section, you will learn how to script a grouped scatterplot with regression lines
( ) and to adjust the layout of this type of graph.

Figure 7-2 Example of a grouped scatterplot with regression


lines.

91
How to do it
To produce a scatterplot, we will use the command. is a higher-level
plotting command that it will create a new graph.

We are going to use part of the internal dataset (available with the R installation) as
an example (Figure 7-2). i contains flower measurements of three ,ULV species. You
can explore the dataset by . We reduce the dataset to two species and and create an
initial plot (not shown).

We can now assign two different plotting symbols for the species by creating a new
column in the data frame , named , that contains the
number of the plotting character ( ) to be used. There are 26 different plotting
characters. Here we use character 1 for Iris setosa and character 16 for Iris versicolor. You
can use the same procedure to assign different colors to the two species (see above). We
can then set the axis labels, range and orientation as well as font size
using , , , , , and as explained above.

width(mm)",

The next step is to add a regression line for each species assuming that sepal length
causes changes in sepal width ( may or may not be reasonable ). For this, we have
to model the regression lines first. Subsequently, we plot lines corresponding to these
models with the lower-level plotting command . The line is specified by the x
and y coordinates, which are both a vector: the x-vector contains sepal lengths, the y-
vector contains the sepal widths predicted by the model. We increase the line widths
using with . that we have used in the linear regression chapter
produces the same line, except that it extends over the entire graph. Using
allows us to have the regression line only within the actual datapoints.

92
We should also add a legend to the figure. This similar to the barplot legend above.
Species names in italics are generated using ("species"))
for each entry.

There are many more details of the plot that can be customized. An overview of the
graphical parameters that can be changed can be viewed using

Summary

Scatter plots can be created with the higher-level plotting command .


A new vector in the data frame can be used to specify plotting symbol and color.
The lower- level plotting command l can be used to add regression lines
according to a linear model.

93
7.3 Exercise 7: Basic graphs with R

7-A

7-B

7-C
Hint

legend()

94
8 Introduction to customizing R

8.1 Flow control

Goal
In this section, you will gain basic understanding and experience of how to
> write code for repeated execution of commands (iteration, "loops") using
for(), while() and repeat()

> specify conditions for commands to be executed (conditional constructs) using


i and
How to do it
The syntax of the looping functions is listed below. Note that that we use a new type of
bracket here, the curly bracket ({ } ).

for
while
repeat

Here, VAR is the abbreviation of variable. SEQ is the abbreviation for a sequence of
elements in a vector or list. COND refers to of conditional, which can evaluate to TRUE
or FALSE. EXPR stands for expression, that is, the actual commands to be executed.

95
The structure, iterates through each component of the sequence SEQ. For
example, in the first iteration VAR = SEQ[1], in the second iteration VAR = SEQ[2],
and so on. The following code uses the structure to print out the square of each
component in a vector.

The other two loop structures and rely on the change of state of the
expression, or the use of to end the loop. The command b halts the
execution of the innermost loop and passes control to the first statement outside. When
using or , special attention should be paid to averting an infinite
loop, that is, a loop which iterates without end. This may cause your computer to freeze!
Below is an example showing two different ways of ending the loop when the loop
indicator ( or ) becomes larger than 10.

We now introduce the syntax of the conditional structure .

96
The condition COND is first evaluated, and if it is TRUE, the expression EXPR1 is
executed; if COND evaluates to FALSE, EXPR2 is executed. When COND evaluates to
a numeric value of zero, R treats it as FALSE. COND evaluating to any non-zero
number is treated as TRUE. We can also extend/shrink structures by
adding/removing one or several clauses. also works without if else clauses.
Note that in these cases, the order of conditional clauses is vital. Here is an example
asking R to comment on the temperature.

if
else

We see that R came up with the right answer!

8.2 Custom functions

Goal
In this section, you will learn how to develop your own R functions.

How to do it
R provides a convenient way to define custom functions using . When
writing functions you define yourself what arguments the function should use, what
should happen to the input and what the output should be. The syntax of function is:

function

If the expression only includes one statement, it can be directly entered and when there
are multiple expressions, they have to be enclosed in braces {}. Functions are assigned to
a function object and can later be executed using this object name together with
arguments, just as all of the other R commands. Here is an example of a function
named that has the argument It samples 5 elements, with replacement,
from the a vector given as the argument x.

97
Arguments to the function can later be accessed via the R
functions and

We have now introduced a first idea how R can be customized using programming
structures. You can learn more about programing in R using some of the more advanced
books suggested at the end of Chapter 2 (Web resources and books on R).

8.3 Summary

R programs are made up of expressions. Flow control constructs are needed to


control compound expressions. R provides , and to write
loops. statements can be used when executing expressions based on one
or several conditions.
You can create new functions and combine existing commands in custom ways by
defining your own functions.

98
8.4 Exercise 8: Customizing R

Are my data/residuals normally distributed? Creating a function and loop


for normality assessment

, ,

The goal:

within the cloud of


The any

Instructions

99
Step 1:

Step 2:

Step. 3: sample

Step 4:
, d

100
Step 5:

Step 6:

Step 7:

Step 8:

a very similar gr

Step 9:

10

You might also like