Professional Documents
Culture Documents
8
Table of Contents
Introduction ..................................................................................................................... 4
Who should use this script? ...................................................................................................... 4
How to use this script ............................................................................................................... 5
1. Getting started with statistics .................................................................................... 6
1.1 Distributions and probability ........................................................................................... 6
1.2 Descriptive statistics ..................................................................................................... 10
1.3 How statistical tests work ............................................................................................. 13
1.4 Exercise 1: Getting started with statistics ....................................................................... 16
2 Getting started with R ............................................................................................... 18
2.1 What is R?..................................................................................................................... 18
2.2 Downloading and installing R ........................................................................................ 18
2.3 How to work with R ...................................................................................................... 18
2.4 Exercise 2: Getting started with R .................................................................................. 27
2.5 Web resources and books on R ...................................................................................... 28
3 Loading and exploring data ....................................................................................... 29
3.1 Entering and loading data.............................................................................................. 29
3.2 Object types in R ........................................................................................................... 33
3.3 Exploring data with tables ............................................................................................. 34
3.4 Exploring data graphically ............................................................................................. 34
3.5 Calculating descriptive statistics .................................................................................... 36
3.6 Understanding help functions........................................................................................ 38
3.7 Exercise 3: Exploring data .............................................................................................. 40
4 Basic statistics with nice datasets................................................................................ 41
4.1 t-tests (and non-parametric alternatives)....................................................................... 41
4.2 Correlation analysis....................................................................................................... 52
4.3 Cross-tabulation and the 2 test..................................................................................... 54
4.4 Exercise 4: Basic statistics with nice datasets ................................................................. 56
5 Handling larger and nastier datasets ......................................................................... 58
5.1 Handling data ............................................................................................................... 58
5.2 Dealing with missing Values .......................................................................................... 64
5.3 Exercise 5: Handling larger & nastier datasets ............................................................... 67
6 Linear models ........................................................................................................... 68
6.1 Linear models: basic reasoning ...................................................................................... 68
6.2 Linear models in R ......................................................................................................... 72
6.3 One-way ANOVA: worked example ............................................................................... 74
6.4 Two-way ANOVA with interaction: worked example ...................................................... 77
6.5 Linear regression: worked example ............................................................................... 80
6.6 Exercise 6: Linear Models .............................................................................................. 86
2
7 Basic graphs with R ................................................................................................... 88
7.1 Bar-plots ....................................................................................................................... 88
7.2 Grouped scatter plot with regression lines ..................................................................... 91
7.3 Exercise 7: Basic graphs with R ...................................................................................... 94
8 Introduction to customizing R ................................................................................... 95
8.1 Flow control.................................................................................................................. 95
8.2 Custom functions .......................................................................................................... 97
8.3 Summary ...................................................................................................................... 98
8.4 Exercise 8: Customizing R .............................................................................................. 99
3
Introduction
This introduction is directed at beginning MSc students and advance BSc students of Biology at
Uppsala University who specialize in Ecology and Conservation, Evolutionary Biology, Limnology
or Toxicology.
1. an incoming Master or exchange student with limited previous education in Statistics with R
2. a student who wishes to freshen up knowledge in statistics and/or R for courses, project
work or research (Uppsala University student, Master thesis, Doctoral thesis)
4
How to use this script
We wrote this script for flexible use, such that you can direct your attention to the parts that you
want to focus on, given your background and current interest.
You will have most use of this pdf file if you read it electronically using a pdf reader that provides
a content sidebar (bookmarks pane), for example Adobe Reader for Macintosh and for PC .
You can browse and navigate between sections and subsections using the bookmarks pane.
This file also contains an R code demo file and data files as attachments (paperclip symbol in
Adobe reader).
Please feel free to approach us with questions on the contents and on the exercises. Exercise
solutions are available from us after you have handed in your own solution.
August 2018,
Sophie Karrenberg
to earlier versions
Andrés J. Cortés
5
1. Getting started with statistics
Goals
In this section you will:
6
Biological questions such as which genes affect certain traits or how climate change affects the
biosphere can only be solved using statistical analyses on massive datasets. But even
comparatively simple questions, for example to what extent men are taller than women are in need
of statistical treatment. Thus, as soon as you formulate a study question, you should start
thinking about statistics.
Statistical analyses have a central place in biological studies and in many other scientific
disciplines (Figure 1-2):
Distributions
A distribution describes how often different values occur in a set of data. In the graph on the
next page you see a common representation of a distribution, a histogram (Figure 1-3). In
histograms, the horizontal x-axis represents the values occurring in the data, separated into
groups (columns), and the vertical y-axis shows how often they occur (proportion or frequency).
7
Figure 1-3 Histogram of normally distributed data.
The coin example concerns an outcome with two categories, heads and tails. For continuous
(measurement) values, probability density functions can be derived (Figure 1-4). Note the
similarity in shape to the histogram above (Figure 1-3). For each value on the x-axis the value of
the probability density function displayed on the y-axis is the expected probability of that value
occurring. The value 2 is thus expected to occur with a very low probability in data from a
normal distribution with a mean of 0 and a standard deviation of 1 (green curve, Figure 1-4).
Probability density functions of test statistics are used for the evaluation of statistical tests.
The probability of obtaining values in a certain range corresponds to the area under the curve
in this range. The entire area under the probability density curve sums to one.
8
Figure 1-4 Probability density curves for different normal
distributions.
For normally distributed data, values within 1 sd to either side of the mean represent 34.4%
the data (pink, Figure 1-5), 13.6% of the values occur between 1 and 2 sd from the mean on either
side (yellow), 2.1% of the values occur 2 and 3 sd from the mean (green) and 0.1% of the data
occur beyond 3 sd from the mean (white). Values more than four standard deviations from the mean
are thus extremely unlikely!
9
Summary
Goals
In this section you will learn how to describe your data in terms of
10
The range refers to the interval between the smallest value, the minimum, and the largest value,
the maximum. Indeed, looking at the range is highly recommended as it allows you to conduct a
first check of the data: are the values actually in the expected (or reasonable) range?
In addition to the range, you can calculate the median, the value "in the middle" of the data, that is,
half of the data are smaller than the median and the other half are larger (Figure 1-6). For example,
the data (2, 3, 5, 7, 10, 50, 73) have a median of 7 and the data (2, 3, 4, 5, 6, 7) have a median of 4.5.
Symmetrically distributed data have a median more or less in the middle of their range. Data with a
range of 1 to 10 with a median of 2, in contrast, are NOT symmetrically distributed. You can further
calculate the 25% and 75% quartiles, separating the smaller 25% of the data from the larger 75%
and the smaller 75% from the larger 25% of the data, respectively (Figure 1-6, Figure 1-7).
Median and quartiles are in fact good descriptive statistics for such data: the median indeed is in
the center of the data and the quartiles nicely reflect the asymmetry in the distribution, i.e. , the
distance between 25% quartile and median is smaller than the distance between 75% quartile and
the median (Figure 1-7). Alternatively, you can use data transformations (see Chapter 4.1).
11
Mean and standard deviation of a sample
Common descriptive statistics are the mean and the standard deviation. They make most
sense for symmetrically distributed data. The mean is calculated as the sum of all values
X i divided by the number of values n. The sample standard deviation s (also sd) is calculated as
the root of the "sum of squares" (summing up squares of each value minus the mean),
divided by n-1.
Sample Standard
mean: deviation:
Note that standard error and confidence interval of the mean become smaller the larger the
sample is. This reflects the greater trust you can have for a mean calculated from a large sample
as opposed to a mean calculated from a small sample.
Summary
Range, quartiles and median are basic descriptive statistics for data with any distribution.
Mean and standard deviation are most useful for symmetrically distributed data.
Descriptive statistics are important to check data and are used to summarize data.
12
1.3 How statistical tests work
Goals
In this section you will learn about
Hypothesis testing
Many classic statistical tests evaluate a strict null hypothesis, H 0 , against an alternative
hypothesis, H A . For example:
H0: mean plant height does not differ between plants grown at low and high densities
H A : mean plant height differs between plants grown at low and high densities
Statistical tests calculate a test statistic from the data to find out how likely the obtained result
would be under the null hypothesis. This probability is termed the P-value. The probability of
the test statistic is found using the theoretical probability distribution of the test statistic.
Figure 1-8 Theoretical distribution of a test statistics under anull hypothesis and
an alternative hypothesis.
13
Statistical tests can have four potential outcomes, two are correct and two are false (Table 1-1).
The probability of rejecting the null hypothesis when it is true (Figure 1-8, Table 1-1) is termed
the P-value and used to assess statistical significance.
If the test statistic calculated from the data happens be a value that is very rare under the null
hypothesis, usually occurring at a probability of less than 5% (P-value < 0.05), the null hypothesis
is discarded in favor of the alternative hypothesis (Figure 1-8, see also below). This result is
referred to as "statistically significant". If the P-Value is larger than 0.05 the null hypothesis is
accepted in comparison to the alternative hypothesis and the test result referred to as "not
significant".
Table 1-1 Possible outcomes of statistical tests with a significance level of 0.05.
The significance level is commonly set to 0.05 in biological studies and P < 0.01 or P < 0.001 are
regarded as highly significant and very highly significant. Importantly, the choice of significance
level has direct implications for the two error types. If the significance level and thus the type I
error is decreased to 0.01 the type II error is inevitably increased (Figure 1-8, lower panel).
14
The alternative hypothesis can also be formulated to specifically test how the means
of the two groups differ. Here, one of two different alternative hypotheses for group means can be
used, either group 1 > group 2 or group 1 < group 2.
Such hypotheses are referred to as one-sided hypotheses and are analyzed by one-tailed tests
(Figure 1-9).
Importantly, all statistical test make assumption about the data and are only valid if these are met
(see chapters 4, 5, and 6).
Figure 1-9 Illustration of significance (P< 0.05) ranges in one- and two-
ailed tests.
Summary
Statistical tests use samples to makes inferences on large populations and generally evaluate a
null hypothesis (usually no difference) against an alternative hypothesis (a difference). They
do so by comparing a test statistic calculated from the data against a theoretical distribution
of this statistic under the null hypothesis.
The significance level used in statistical testing is related to both type I errors (false positives)
and type II errors (false negatives).
15
1.4 Exercise 1: Getting started with statistics
1-A
A B
(a)
(b)
(c)
(d)
(e)
(f) ost values are smaller than 2.
(g)
16
1-B
1-C
(a)
(b)
(c)
1-D
(a)
(b)
(c)
(d)
(e)
(f)
17
2 Getting started with R
2.1 What is R?
R is a versatile and powerful statistical programming language developed by the statistics
professors Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand.
Different from other statistical programs, R is free and its source code is available. R was released
in 1996 and is maintained by the R development core team (http://www.R-project.org/) with a
very large number of international contributions and continues to develop at a fast pace. R is
among the most widely used statistical programs at universities today. For more information
please check out this list of Web resources and books on R at the end of this section.
Goals
In this section, you will learn:
18
Elements of R: Script, Console & Co.
Using R involves mostly writing commands (or "code") rather than clicking on menus.
The commands are usually assembled in a script that can be saved and reused (Figure
2-1). The R console receives the commands either from the script or by direct typing.
The console shows the progress of analyses and displays the output. Graphical output
will open in a separate window. This means that working with R can involve quite a lot
of windows and files, the script, the console, graphs and other output and then, of
course, your data files. You can assign a working directory where all of these files and
outputs are saved by default.
An excellent way of ordering and manipulating your R windows and files is to use the
free and powerful interface for R provided by RStudio (see below).
We highly recommend that you use RStudio.
In RStudio, you can bundle your analyses into projects by using the RStudio menu in the
top right corner of RStudio (Figure 2-2). Projects contain all elements of analyses
allowing you to continue a session exactly where you ended the previous time.
You can set a new working directory by navigating to the folder you want to use on the
files tab (lower right) and choosing "Set as working directory" on from the "More" menu
(Figure 2-2). Note that the working directory basically is a folder on your computer that
contains various files whereas the workspace is a collection of R objects (see below)
assigned through R (Figure 2-1).
20
To create a new script, you can follow "File > New File > R script", or use the shortcut
Ctrl + Shift + N. Save your scripts regularly. A file that has been modified but not saved
again is displayed with a red title and a * at the end.
You can navigate between different plots on the plots tab (lower right) produced during
a session using the blue arrows at the top left corner of the plots tab. You can save your
graphs by clicking on "Export". When you are finished with your analyses you can
to document a successful analysis. However, in recent years, issues with
version incomptibilities and RStudio bugs have led to difficulties with noteboook creation and,
for this reason, we currently DO NOT recommend that you use notebooks as a beginner.
Work flow in R
When working with R on a new analysis you usually follow these steps:
1. Define/create a folder to be used as the working directory.
2. Open R Studio and create a new Script file (menu). You can also create a project
(button top right).
3. Set the working directory to your prepared folder.
4. Write your script in the script window and save it. Send selected code line(s) to
the console using + (Mac) or ctrl+R (PC).
5. Conduct analyses, save the script, outputs and graphs. When the entire analysis is
ready, you can compile code and output into a notebook.
6. Quit R using the menu or . You usually DO NOT need to save the
workspace.
R commands
R commands always have the same structure (Figure 2-3). A command name is followed
by parentheses without a space in between. Command names are often closely related to
what the command does, for example the command will calculate the mean.
The parentheses contain the arguments that the command will use separated by commas.
Such arguments tell R what data to use and which analysis options to select. Which
arguments are needed is differs between commands.
21
Figure 2-3 Structure of R commands.
will assign "hej" to an object named greeting. All text has to be in quotes ( ),
otherwise R will look for an object with this name and create an error message, for
example,
To call an object and have it displayed on the console you type its name (in the script
window and send it to the console).
The indicates that this is the first (and only) element of this object.
One (very slow) way to enter data into R is to assign them to an object directly using
the command
22
will assign the numbers 1, 2, and 3 to an object called . This type of
object is called a vector.
You can check which objects have been created in R using the command
This will result in a list of the current objects in the workspace. In RStudio you can view
and manipulate current objects in the workspace tab (top right partition). You can also
use rm(name.of object) to remove objects.
The command deletes all current objects.
Objects are overwritten without notice whenever something else is assigned to the
same name.
will result in
Vectors of the same length (containing the same number of elements) will be combined
element-wise in calculations. Below we create a second vector with 3
elements, and add it to .
23
You can use this procedure to explore functional relationships of interest.
creates a vector with the numbers from 1 to 100.
One way to quickly visualize this data is to create a boxplot that displays the
median (black bar in the middle) and 25% and 75% quantiles (box) and indicates outliers
as values beyond the t-bars or whiskers. To display a boxplot of the 100 random
numbers we use the object name (see code above) as an argument
to and generate the graph below (Figure 2-4).
Figure 2-4 Boxplot of a sample of 100 numbers randomly drawn from the standard
normal distribution. The bar indicates the median; the box shows the 25% and 75%
quantiles.
The median is around zero, as expected for a sample from the standard normal
distribution. If we now want to compare data from several groups using a boxplot we
can do this with two vectors, one with the numbers from both groups combined and the
other indicating which number belongs to which group. As an example, we can use two
random samples of 100 numbers, and and combine them with .
24
We can use the command to create a group indicator vector with the sample
names repeated 100 times each and combine these with as well. The boxplot
command then receives a so-called formula statement (more on that in later chapters)
with the sample numbers on the left side connected with a tilde symbol (~) to the group
indicator on the right side (resulting boxplot in Figure 2-5).
Now it is time to try this out using the demo code (attached) and some exercises, please
check out the work-flow (Work flow in R) and the notes below before you start!
You can enter comments and titles preceded by a hash (#). Everything written
after a # will be ignored and not executed by R (see example code).
25
In RStudio, you can use four hash symbols (####) after a title to
organize the script. You can later navigate through these headings using
the pull-down menu at the bottom left of the script window (see example code).
Observe the ">" sign in the console. This is the command prompt indicating that
R is ready to receive commands and has finished executing previous commands.
R will display a "+" if a command is incomplete. On the console, you can cycle
through previous commands using the "arrow up" and "arrow down" keys on
your keyboard.
Summary
In R, you use a script window to enter the commands. Commands are transferred
to the R console for execution. Scripts can be saved and re-used.
Data, output and scripts are saved in a designated working directory.
R stores data and analysis outputs as objects.
Content is stored as objects that are named with the ass ignment arrow "
The workflow in R involves setting the working directory at the beginning and
saving the script file repeatedly.
In R, you can conduct basic mathematical calculations directly and element-wise, for
example on vectors.
Boxplots can be generated with to summarize data.
26
2.4 Exercise 2: Getting started with R
(attached)
2-A
vertical
each o
A B
1.3 10.1
1.8 13.7
0.75 15.1
0.9 9.3
0.6 12.0
1.0 16.8
1.1 17.3
Hint
(a)
(b) (5, 10, or 100)
of each size
27
2.5 Web resources and books on R
Web resources
CRAN page (http://www.r-project.org/)
R-Studio (http://www.rstudio.com)
Quick R (http://www.statmethods.net/index.html)
Code example page for beginners and a little bit more advanced users.
Books
Dalgaard, Peter (2008) Introductory statistics with R, 2nd edition. Springer. ISBN-
13: 978-0387790534, also available as e-book.
Popular and very well-written, covers both R and statistics, including slightly more
advanced methods. Suitable for beginners. We recommend this book to get started and
also as a reference throughout your studies.
Zuur, Alain, Ieno, Elena N., Meesters, Erik (2009) A Beginner´s Guide to R.
Springer. ISBN-13: 978-0387938363.
Large overview with many biological examples, suitable for beginning users with some
statistical knowledge. This is popular reference volume.
From the author of Quick R; this book covers both beginner and more advanced
methods and follows a case-study approach. Previous experience with R is desirable.
28
3 Loading and exploring data
Note: when you execute these commands nothing visible happens in the console!
29
Preparing data files
Most of the time it will be convenient to load data from a file. You first need to prepare a
suitable file from another program, for example Excel. Follow these points:
30
For a .csv file with header, comma -separated entries and decimal points (as common in
North America) the command looks like this:
Remedy: copy your data (and only your data!) to a new sheet in Excel, save it as .csv and
reload.
Non-English characters and signs. Non- English characters (for example ä, ö, é),
signs (for example, , , , , , , , , , , , , ) or spaces in the column
names will produce an error message similar to this:
Remedy: change the names in an Excel file or directly in a .csv file, save it as .csv and
reload.
will yield
31
To control what sort of object your data is stored in and whether it has the correct
structure, you can use the structure command . The same information is
displayed when you click on the object in the workspace tab in RStudio (top right
partition).
This indicates that the object is a data frame with six observations, i.e.,
six rows, and four variables, i.e., four columns. This matches well with the six plants and
four columns in our data.
The column names and the type of the columns are also given together with the first few
values (in this case all six values). The columns
and are numeric (continuous numbers) and you will be able to do
calculations with these numbers. The column is a factor with the two
levels and . Different from a vector, a factor stores
information on the factor levels and this information will be used directly in analyses and
graphing commands.
If you want to change the type of the column, for example changing the
column from numeric to factor (because the number is a "name" in this
case), use the following command
32
3.2 Object types in R
Before we can go on and load data and conduct analyses we need have a look at the
different object types in R the data contained in them. In general, R objects contain
numbers, characters (entered in quotes, , such as the
and in the example above) and logical statements ( , these
are covered later).
Vector. You have already created vectors in the Exercise 2. Vectors are one-dimensional
and contain a sequence of one type of data, numbers OR categories (letters, group
names) OR logical statements. Vectors can be created using the command
which concatenates (connects them one after each other)
the different elements into a vector. You can also use to combine multiple vectors
as you may have done in Exercise 2-B.
Sequences of numbers can be created using the colon (:). For instance,
There are a number of other functions for creating vectors, for user-defined
sequences and for repeated elements (see Chapter 2, Basic calculations and data
display in R).
Factor: Factors are similar to vectors but also contain information on grouping levels.
They are used in analyses and graphs that involve groups. Factors can be created from
vectors using or (see Checking data structure above). You can
create a factor named using the code below:
Data frame: data frames are a collection of vectors and factors of the same length. This
is the format commonly used for basic data analysis where each row corresponds to an
observation and each column corresponds to a measurement or group (vector or factor,
respectively). The above section Entering and loading data explains how to create data
frames from your data.
33
List: lists are collections of elements of any type and length and can be created
using . Outputs of statistical analyses often are lists. Other types of objects
include matrix and array. Many commands create custom object types.
We can thus quickly confirm that there are three flowering and three vegetative plants in
the dataset. Of course, this only becomes really useful in larger datasets. If you enter
multiple factors as arguments, counts on all possible combinations of levels are reported.
Histograms
The command produces a histogram displaying data values on the x-axis against
their frequencies on the y-axis allowing you to judge the distribution of the data (see
Chapter 1, Distributions). The command is applied to individual variables
(columns) of the data. The code below yields Figure 3-1.
34
The command
A quick graphical check of the data is provided by the versatile command . In the
previous chapter we saw that can be use to plot to vectors against each other
(see chapter 2, Basic calculations and data display in R). The command can also
be used to depict all pair-wise combinations of variables in a data frame against each
other. Below is am example for the internal data frame , as plot(iris)(Figure 3-2).
35
Boxplots
Boxplots (Figure 3-3) that we have introduced earlier (see Chapter 2, Basic calculations
and data display in R) can be used to get an idea on whether there are large differences
between groups, whether the data is distributed symmetrically within groups and whether
there are outliers. In the default settings, the command that you have used
already shows medians as thick black lines and quartiles as a box around the median.
The t-bars ("whiskers") are the range of the data that is within 1.5 times the inter-quartile
distance from the median. Data points outside that range are regarded as outliers are
displayed as circles. When working with data frames we can also use give the name of
the data frame as an argument ( ) and use only the variable names to refer
to the data and the grouping variable ( ).
36
You can also calculate these and further descriptive statistics directly using commands
(Table 3-1), that are all applied to numeric vectors or data frame columns (containing
continuous numbers, here referred to as – replace this by the name of
your vector). If you want to apply these commands to data with missing values, please
see the Chapter 5, Dealing with missing Values.
Statistic Command
37
Descriptive statistics for groups of data
Often, you will wish to apply these commands over groups of the data that are defined in
a grouping factor. You can do this using the command .
You can state all kinds of other commands or functions (including functions that you
write yourself, see chapter 8) in the function argument.
Documentation on commands
Typing a question mark followed by a command, for example
will open a the help file on the command in the lower right partition in RStudio. You can
also search directly from the help tab there. The information is always displayed in the
same general way. At the top of the page, the R package that the command originates from
is given in braces. Here , shows that the command originates
form the R package . This basic package is pre-installed when you download R. A
large number of more advanced or specialized packages can be downloaded and installed
and updated through the package tab in the lower right partition of RStudio. Further help
38
sections contain a description, the usage, the arguments and the value or object returned
by the command. The help file for indicates that its arguments are
and It further explains that the command returns one value as a
result.
The help pages end with references, similar commands ("see also" section, here
on ) and importantly, examples. Example code can be directly copy-pasted
into the script and only internal data or data generated within the example code is used.
Running example code is a very good way to examine how to work with a command.
median
This command will open a table on commands related to the median in some way. The
table lists the commands and the packages they originate from, as well as a short
description, including the one from the package described above, listed
as median. Clicking on these entires will open the R documentation for these
commands. If you want to use commands from other packages you may need to
install the respective package first (package tab in lower right partition of RStudio).
39
3.7 Exercise 3: Exploring data
(a)
(b)
(a)
(b)
40
4 Basic statistics with nice datasets
Goals
In this section your will learn when an how to conduct and how to interpret
H 0 : both samples come from the same population (with the same mean)
H A : the two samples come from different populations (with different means).
From Chapter 1 (How statistical tests work) you may remember that the basic reasoning
of many statistical tests is relating a test statistic calculated from the data to the
theoretically derived distribution of that test statistic under H 0 . When testing the above
H 0 with equal samples size n in the two samples, a t-test has the test statistic
=
1
( + )
The s in t s indicates that t s is calculated from a sample. t s relates the differences between
sample means ( , numerator) to the standard error of this difference in the
denominator. t s will be very large or very small (depending on which of the means in
larger) when the difference in means in large in comparison to its standard error. It will
be small or zero when the two sample means differ very little or when the differences
cannot be determined with any confidence and its standard error is large. The test
statistic t s has a theoretical distribution called the t-distribution. This distribution looks
roughly similar to the normal distribution and it depends on the so-called degrees of
freedom, abbreviated as df (Figure 4-1). In this case with equal n, degrees of freedom are
calculated as df = 2(n - 1). The null hypothesis is tested by assessing the probability of t s
given df.
41
Figure 4-1 Probability density of different t-distributions.
Note that for very high df the t-dsitribution approaches the
normal distribution.
Importantly, t-tests are only valid when the assumptions of this analysis are met. These
are normality of the data in each sample a nd homogeneity of variances. How to assess
normality is explained later in this chapter (Assessing normality using quantile-quantile-
plots (qq-plots)). If the data are not normally distributed a transformation can be applied
or an alternative test that does not require normal distribution can be conducted; such
tests are referred to as non-parametric (Non-parametric alternatives). The homogeneity
of variances can be roughly seen from calculating the variances (Calculating descriptive
statistics). However, R automatically conducts a "safer" form of the t-test, Welch´s t-test,
that applies a correction for differences in the variances.
The general work-flow for conducting group comparisons involves plotting the data,
testing for normality and conducting the test if appropriate (Figure 4-2).
Different forms of the t-test can be used for two samples with unequal samples sizes, to
compare a single sample to a hypothesized mean and for comparing paired samples
42
where pairs of measurements are related in some way (see Two sample, One-sample and
paired t-tests).
The main argument to is a formula statement relating the vector with the data
( ) to the grouping vector ( ) with a tilde symbol ( ). Note that this is the
same kind of formula that is used in the command . In equal
variances must be specified using the argument , otherwise a more
robust version, Welch´s t-test, that does not assume equal variances is automatically used.
The value of the test statisticsis (rounded) and this is in the far left tail of the
t-distribution with df = 2 (5-1) = 8 (Figure 4-1). Accordingly, the P-Value is <0.05 and
significant ( . The output also specifies the alternative hypothesis
and gives the 95% confidence interval for the difference between means together with
the means for the two groups, 2.2 for females and 4.4 for males. The difference 2.2 -4.4
= -2.2, thus has a confidence interval from -3.4 to -0.934 again indicating that we can be
rather sure about this result. Thus, we reject null hypothesis and conclude that male
butterflies feed on nectar significantly longer than female butterflies.
43
Assessing normality using quantile-quantile-plots (Q-Q plots)
Normality of data is best assessed graphically. Here, we make used of cumulative
empirical distributions. Cumulative distributions are displayed as ordered values
(x-axis) against thier rank divided by the number of values, n (Figure 4-3, left).
For example a dataset with ten values will have points at 1/10, 2/10, 3/10, …, 10/10
(Figure 4-3). This way, the values on the y-axis represent the proportion of data smaller
than the corresponding values in the sample on the x-axis. These values are actually
quantiles (sample quantiles) for example the x-axis value corresponding to 0.5 (50%) is
the 50% quantile, the median (see also Descriptive statistics). In order to assess
normality of the sample, we compare such an empirical cumulative distribution to what
would be expected in a standard normal distribution. The comparison is made by
plotting the sample quantiles against the quantiles of the standard normal distribution
that are obtained at the same proportions. This is called a Q-Q plot (quantile-quantile
plot). If the sample data is normally distributed, the points should fall on a line. The sample
mean should plot near zero because the standard normal distribution has a mean of zero.
Moreover, sample quantiles corresponding to standard normal quantiles of -2 or 2, that is,
values more than two standard deviations away from the mean, should be very rare.
Figure 4-3 Empirial cumulative dstribution and normal Q-Q plot for an example
with 10 data points randomly drawn from a normal distribution.
In Figure 4-4 examples of histograms and Q-Q plots for 250 data points that are
normally distributed (green), left-skewed (red) and right-skewed (red) are displayed. Data
that look like the left-skewed or right-skewed examples are not suitable for analysis with
a t-test and should be transformed before analysis (Transformations). If this does not
work a non-parametric test should be used (Non-parametric alternatives).
44
Figure 4-4 Histogram and Q-Q plot for normally-distributed,
right- skewed and left-skewed data.
However, for smaller datasets, quite some deviations from the Q-Q lines are expected in
normally distributed data, especially at the extremes. In Figure 4-5 you see three
examples of histograms and corresponding Q-Q plots for five and ten values sampled
from a normal distribution. If your data looks like this you can use t-tests!
45
Figure 4-5 Example Q-Q plots for small datasets of normally distributed data.
ALL samples were drawn from a standard normal distribution.
Transformations
To obtain normally distributed data for further analysis the following transformations
(Table 4-1) are recommended in most cases.
Most of these transformations, with the exception of the exponential transformation for
left-skewed data, are not defined mathematically for values smaller than zero. You may
need to add a constant value to all values in order to perform the transformation. A
constant value can be added element-wise to vectors, for example
see also Basic calculations and data display in R. You may need to try out several
different transformations. You can apply the transformation directly within other
commands, for example , ,
46
or If you
are still not satisfied with the distribution, please use a non-parametric test (see Non-
parametric alternatives).
At first sight, transformations may appear as "distorting the data". However, one can also
think of transformations as measuring on other scales than the linear one, for example,
measurements could also be taken with a "logarithmic ruler". Nonetheless, transformations
can be problematic in specialized investigations of the relationship between variables.
Back-transformation
Descriptive statistics, such as the mean and the standard error of the mean are not meaningful
when calculated from transformed data
.
We have obtained data on the height of poplar trees at two different distances from a
river, 5-20m ( ) and 20-40m ( ). You find this data in the attached
file . The data has two columns, one factor giving the in
groups and and one with the height measurements ( ).
The first thing we do is looking at the data using a boxplot (Figure 4-6).
47
Figure 4-6 Boxplot of height in poplars
growing at different distances from a river.
There seem to be many outliers of larger values in both samples. Also, the sample
varies more than the sample. Let´s look at this further using a Q-Q plot for each
sample. In the code below, the two samples are selected using a logical
statement within square brackets ( ), more on
how to select data will be explained in the next chapter.
Figure 4-7 Q-Q plot of height in poplar trees growing at two distances from a
river.
48
As suspected, these data do not conform to the normal distribution and appear to have a
right skewed distribution (compare Figure 4-4). We try out a log-transformation and plot
Q-Q plots of the log-transformed data.
This indeed looks much better! We look at a boxplot of the transformed data as well
(Figure 4-9 .
49
Figure 4-9 Boxplot of log-transformed
height in poplar trees growing at different
distances to a river.
Variances appear different. Let´s conduct the test without assuming that variances are
identical.
These results indicate that poplar trees that grow farther from the river are not
significantly different in height from those that grow close to the river. This can also
be seen from the confidence interval of the difference between means that includes zero.
Means and standard errors reported with this analysis need to be back-transformed.
50
This corresponds well with the center of the data in the boxplot of the original data
(Figure 4-6).
Back-transformed means can be reported together with the standard error interval,
usually rounded to two digits after the decimal point, 5.50 [4.91, 6.16] m for trees far
from the river and 4.27 [3.89, 4.68] m for trees closer to the river. These two means are
not statistically different as we have seen above. The P-Value of 0.085 is rather
small though, so this might also be a case where the decision is not entirely clear.
51
One-sample test
Paired-sample test
two different
columns
Non-parametric alternatives
In certain cases, it can be impossible to meet the assumption of normality required for
standard t-tests, even after transformations a non-parametric alternative
such as the Wilcoxon family of tests can be used. These include two-sample (also named
Mann-Whitney-U test), one-sample and paired-sample alternatives, all available through
the command The syntax of is very similar to that
of , see
52
Pearson Correlation
One of the most common correlation analyses is the Pearson product-moment
correlation with the correlation coefficient Pearson’s r. Pearson’s r ranges from -1
(perfect negative correlation, one variable increases as the other decreases) to 1 (perfect
positive correlation, one variable increases and the other increases). A value of zero
indicates no correlation. You can obtain Pearson’s r and test whether Pearson’s r differs
significantly from zero using the command . If Pearson´s r differs
significantly from zero we can infer that the two variables are significantly associated.
This test assumes the data conforms to the normal distribution (see Assessing normality
using quantile-quantile-plots (Q-Q plots)).
This analysis shows that sepal length and petal length are strongly and positively
associated (Pearson´s r = 0.87) and this association is highly significant with a P-value
smaller than 2.2 * 10-16 (displayed as ).
Spearman Correlation
If the data does not conform to a normal distribution a non-parametric alternative to
Pearson’s r, for example Spearman's rank correlation coefficient (Spearman’s rho), can be
used. Like Pearson’s r, Spearman’s rho determines the level of association between two
variables and ranges from -1 to 1. Spearman´s rho is calculated using the rank-order of
the data rather than the raw values.
Spearman’s rho can also be obtained with the command. Below we repeat
the correlation analysis above using Spearman’s rho.
53
This test produces a very similar, but not identical, result compared with the test
based on Pearson's r above.
2
4.3 Cross-tabulation and the test
Apart from continuous data that we have treated thus far, biological experiments can
produce categorical data or counts. For example, in an experiment on the inheritance
of eye color in flies, eye color (red or white) was recorded in two groups of flies (A and
B). Here we want to test whether eye colors differ between groups. The data is available
in the file flies_Ch.4.csv (attached to this pdf).
The data is organized in 100 rows, one fore each fly, and two columns,
and ; both are factors. The entries are the group ( or ) and eye color
( or ). We use the command to summarize how many flies of
each eye color were found in each group. This kind of table is referred to as a 2 x 2
contingency table.
flies_Ch.4$ flies_Ch.4$ )
54
In Group A, 34 flies have red eyes and 16 flies have white eyes. In group B, 41 flies have
red eyes and 9 flies have white eyes. Alternatively, you could enter such data directly (its
only four numbers here!) as
We now want to test whether the eye color differs between groups, or in statistical terms,
we test the null hypothesis that eye color is independent of the group. For this we use
the the -test (read "chi-square") that compares the test statistic X2 to the
distribution. The main argument of the command is the contingency
table that we created above.
This test shows that eye color and group are not associated (the P-Value > 0.05, not
significant) and we conclude that eye color does not differ between groups. The -test
can also be used for larger contingency tables with more than two categories.
Summary
t-tests compare the means two groups or a single mean to a hypothesized mean.
Q-Q plots are used to assess normality.
Non-parametric tests should be used on data that is non-normal even after
transformation.
Correlation analysis tests whether two continuous variables are associated.
The -test is used to test whether categorical variables are associated.
55
4.4 Exercise 4: Basic statistics with nice datasets
(a)
(b)
(c)
(d)
(e)
56
4-C Willow shrub responses to herbivory
Use in attached
(a)
(b)
(a)
(b)
(c)
57
5 Handling larger and nastier datasets
Goal
In this section you will learn how to
has six elements. We can access its third element using the vector name followed by
square brackets and the element number. This line will bring up the third element:
Should we realize that this element needs to be changed from 2.8 to 3.0 we can do that
using the assignment arrow:
We can access the elements of data frames in the same way, except that data frames have
two dimensions, rows and columns. These are accessed by two numbers, separated by a
comma within square brackets ( ).The first number always refers to rows, the
second to columns. To access the element in the third row and the second column in
our data frame from before (Checking data structure) we use
58
This is the same measurement as in the vector example above. To change it to 3.0 in the
data frame we use the same kind of assignment operation:
Entire rows and columns of data frames can be accessed by leaving column (or row)
number empty in the square brackets. Note that the comma must always be entered
because data frames have two dimensions. Accessing rows and columns is needed to
conduct analyses and to make changes or calculation. For example,
brings up the entire third row, for example to check that plant´s measurements. The first
3 is the row number.
Alternatively, column names can be used in place of the numbers. R has a special
notation for columns involving the dollar sign as we have seen earlier. The following line
will also bring up the third column.
Column names can also be entered in quotes directly within the square brackets (note the
comma!).
Should you now realize that the width measurements all need to be increased by 0.2 you
can do that using, as
59
OR
OR
Which of these options is most convenient depends on your column names, the size of
your data file and your preferences.
Calling the structure command shows that the column has been added and is numeric.
Deleting one or several columns can be done using the minus sign within the square
brackets. This only works with column numbers, not with column names. This line
removes the newly added width-length ratio column (output not shown):
Removing rows, for example, when you realize that measurements of an entire row are
faulty, works in the same way. This line removes the first row (observe the placement of
the comma!).
60
If you need to remove more than one column use the command within the square
brackets:
Subsetting data
There are many situations where only a specific subset of the data needs to be used. In R,
this is done with entering a so-called logical statement (see below) in the square brackets.
For example, if you want to select only the flowering plants in the
data frame you use the logical statement
for row selection.
will produce a new data frame named containing only the flowering
plants.
61
In words this statement means something like "check for each element
of whether it reads " " or not". If you execute only
the logical statement, you obtain a vector with the same number of elements than
the column, six, the first three are (corresponding to the flowering
plants) and last three are (corresponding to vegetative plants).
When you use such statements for row selection, all rows corresponding to will be
selected, in this case, the first three rows. and are logical values and
constitute a special data type in R, logical data. evaluates to 1 and to zero.
Thus, when you sum logical statements the number of elements will be returned.
Note that R does not assume that you will use only columns of the same data frame here,
in fact, you can also use columns from other data frames or vectors. For this reason, you
need to write and not
only .
identical
not identical
62
Here are some more examples:
Selecting plants with leaf widths over 4.0 can be done with:
Data selection can also be done with the command. The first argument
of specifies the data frame to subset. The second argument is a logical
expression, as explained above, to select rows. The third argument indicates the columns
to be selected using their names. If several columns are selected, the names are
combined into a vector. If you only want to omit one column, use – in front of the
column name. The lines
will create a new data frame containing only the rows with the flowering plants, and all
columns except the column. The advantage of the subset command
is that you can refer to columns within a data frame directly, without the dollar sign.
Summary
Individual data entries, rows and columns can be accessed and changed using their
row and column subscripts.
Data can be subset using logical statements involving logical operators such as
== , != , | and &.
Subsetting of data frames can be done either with row and column subscripts in
square brackets or with the command.
63
5.2 Dealing with missing Values
Goal
In this section you will learn how to
Most real datasets end up containing missing values. This is often because certain
experimental units, such as animals or plants, were not available for measuring because
they died or could not be observed or measured for other reasons. Moreover, technical
failures of the equipment or inevitable mistakes of the researcher commonly result in
missing values. In fact, entirely complete datasets are very rare in biological experiment,
especially in those conducted in natural populations or in the field. For this reason, it is
important to be able to deal with missing values and to assess, based on the research
questions, how missing datat may affect the results.
64
In this case, the output is , because the command cannot be executed due to the in
the data. Setting the optional argument (for remove) to tells R to
remove before calculations. TRUE can be abbreviated to T. For example,
The commands for correlation and for covariance ignore with the
argument
or more specifically
This command can be applied to any data structure or part thereof. The
command returns logical statements for each element of the data with
for both and and for other entries. To find only use . For
example,
65
The number of missing values can be obtained by summing logical vectors, for example,
will set row 2, column 3 to . Note that these changes are made to the data frame
object stored in R’s current workspace NOT to your original data file.
Summary
Missing data types in R are NA (not available, to be used in data tables) and NaN
(not a number).
Many commands have optional arguments to deal with missing values, for
example will tell R to ignore missing values in
and other basic commands.
The command is used to identify and NaN.
66
will return the number of missing values in the data
and will return subscript numbers of the elements that
are or .
Data entries can be set to with the assignment arrow.
Iris
(a)
(b) Iris
Hint
.
67
6 Linear models
Goals
In this section you will learn how to
Linear models generally relate the variation around the "fitted model" ("explained
variation") to the variation in the data without the model ("unexplained variation" or
Residuals). This so-called F-Ratio is the test statistic and compared to its theoretically
derived distribution, the F-distribution, in order to determine model significance. For
example, a model for comparing three groups fits three means to the data. A
regression analysis fits a slope and an intercept to the data. For group comparisons
(ANOVA), the explained variation is expressed as a sort of average squared distance
from the fitted means referred to as the mean square or MS. For n data points in each of
k groups the MS for the explained variation is obtained by summing up the squared
distances from the group means to the overall mean (Figure 6-1) and dividing this sum
by k-1. The MS of the (unexplained) residual variation in is expressed as the sum of
squared distances from the group means to the data points (Figure 6-1) divided by k(n –
1). The ratio between these two MS is the F-Ratio that is tested against the F-distribution
with k-1 and k(n-1) degrees of freedom, corresponding to the denominators of the two
MS calculations. From this, the P-Value is obtained. For regression analyses, the MS are
calculated in a similar way from the fitted regression line (Figure 6-2) with model df of 1
and residual df of n-2. The information on df, sum of squares, MS, F-ratio and P-values
is commonly presented as a ANOVA table (Table 6-1). The second important output of
68
linear models is information on the fitted values, such as means, slopes and intercepts
with their standard errors. More complicated models follow a similar reasoning.
Note that the F-distribution is contingent on two df not on one df as the t-distribution.
F-ratios of over four tend to be significant and indicate that the variation explained by
the model is more four times larger than the unexplained variation.
The distances from the fitted models (Figure 6-1, Figure 6-2) are referred to as the
residuals. These are a very important quantity for testing model assumptions (see next
section, Assumptions and diagnostic plots). The estimates of means, slopes and
intercepts are actually obtained by mathematically minimizing the sum of the squared
residuals. For this reason, linear models are also referred to as a "least squares" method.
69
Table 6-1 Example of an ANOVA table for a comparison of three groups (k=3) with four
units each (n=4).
Analysis of residuals is thus a key step when conduction linear model analyses. Residuals
are analyzed graphically after fitting the model . Whether or not experimental units are
independent (assumption 1) depends on the experimental design and it should be known
by the experimenter. Constancy of variances (assumption 2) is checked using a plot of
the residuals against the fitted values. This plot is also referred to as the Tukey-
Anscombe plot (Figure 6-3, left).
Figure 6-3 Example of a Tukey-Anscombe plot (Residuals vs Fitted) and a Q-Q plot
of the residuals for a llinear model. These plots indicate the model satisfies the
assumptions and can be used (see text).
70
If the variances are constant along the fitted values of the regression line or among
groups we expect a random scatter around zero in the Tukey-Anscombe plot (Figure
6-3). If variation of residuals strongly increases or decreases with the fitted values the
Tukey-Anscombe plot shows funnel-like patterns (Figure 6-4). If you obtain such a
diagnostic plot or any other strong patterns on the Tukey-Anscombe plot the data often
needs to be transformed before a new model can be calculated (Transformations).
Normality of the residuals (assumption 3) is tested using a Q-Q plot of the residuals that
we have already used (see Assessing normality using quantile-quantile-plots (Q-Q plots)).
Transformations may also be appropriate when the residuals are not normally
distributed; in particular, the log-transformation is commonly applied in group
comparisons of measurement data. For regressions, the use of transformations changes
the relationship between the variables and this may or may not be appropriate depending
on the study questions.
Figure 6-4Tukey-Anscombe plot (Residuals vs Fitted) and Q-Q plot of the residuals of
an unsuitable model. Note that the Tukey-Anscombe plots has a funnel shape with
increasing residuals with larger fitted values. Here, log-transformation of the response
can be used to improve the model.
71
Figure 6-5 Workflow for linear models.
must be a factor
Note that if there are only two groups, a t-test can be used (same result).
72
Two-way ANOVA
and
Linear regression
73
6.3 One-way ANOVA: worked example
To test whether fruit production differs between populations of Lythrum, fruits were
counted on 10 individuals on each of 3 populations.
Lythrum Lythrum
The boxplot shows reasonably symmetrically distributed data and suggests some
differences between groups. As a next step, we calculate a linear model with as
the response variable and as the explanatory variable. We specify the data frame
using the argument and assign the model the object .
The model is now stored in the object . We will later extract the results from this
object. But first we need to analyze the residuals. R provides both the Tukey-Anscombe
plot and the normal Q-Q plot of the residuals through the command with the
name of the model object as the first arg ument to p . The two plots are actually the
first two and the most important of six diagnostic plots and we concentrate on these for
74
now. To bring up only the first two plots we add the argument to
We also use to change the plotting parameters such
that both plots are displayed next to each other in a single plotting window. This change
of plotting parameters remains until changed again. Thus, if you want a single plot the
next time you need to use either or close the plotting window
with the command dev.off(), such that a new, default, plotting window opens.
Both diagnostic plots look fine and we proceed to extract the ANOVA table. This is done
by the command , that also uses the model object name as its argument.
75
In this ANOVA table we see one row for the effect of populations and another for the
residuals (compare Table 6-1). Degrees of freedom, sums of squares, mean squares, the F-
Ratio ( ) and the P-Value ( are given. This analysis shows that the three
populations differ significantly in fruit production. The P-Value is 6.7*10-6 (displayed
as ), thus, it is very unlikely that this result is due to chance. Indeed, variation
explained by the model is 19 times larger than residual variation (F-Value = 19.1).
This analysis does not, however, tell us whether fruit number differs significantly between
all three populations or whether two are more similar to each other. This can be
determined using a Tukey test. This is a so- called post-hoc ("after the fact") test that
is applied only if the ANOVA shows a significant overall effect. The Tukey test
compares all pairs of groups (here populations) with each other. This is referred to as
multiple pairwise comparisons. Manual calculation of multiple t-tests would be invalid as
all pairwise comparisons are not independent. If we, for example, already know that A is
much larger than B and B is much larger than C, it also follows (without looking at the
data) that A is larger than C. The Tukey test adjusts the significance level to take account
of this issue.
One convenient way to conduct the Tukey test after a linear model is using the
command . For this you first need to download the package .
You can do this in the RStudio package tab in the right lower partition. Search for
and download it. After that, the package is loaded using ).
You need to load the package each time you want to use it.
The command has the model object as its first argument followed buy
the argument to specify the grouping factor ("treatment") to use. You can also
use for models with multiple grouping factors. Further arguments,
and , result in a convenient output.
76
The output starts with repeating the response and the grouping factor
used, . It further gives information on the pairwise comparisons together with the
group means, their sd, samples size and range. Here, the honestly significant difference
specifies that any difference between groups mean larger than 2.56 is significant. The
output ends with repeating the treatment means (referring to populations 1, 2 and 3 here)
together with a column. Means with the same letter in this column are not
significantly different. We can conclude that means of populations 2 and 3 are not
significantly different from each other whereas both are significantly different from
population 1. The letters in the group columns are often displayed in results graphs. In
the biological sciences, it is common to report the standard error of the mean together
with then mean. We can obtain the means and standard errors using .
Means and standard errors can be reported either in the text (as mean ± standard error)
or in barplots (see Basic graphs with R).
77
We see that both irrigation and radiation are factors with 2 levels each and that we have
10 measurements in each of the four combinations of levels (drought and high radiation,
drought and low radiation, normal irrigation and high radiation, normal irrigation and
low radiation).
The next step is to create an exploratory graph. Here, it is most informative to display all
four combinations of factor levels (Figure 6-8). We can do this by creating another
column with the combined factor levels using .
78
Figure 6-8 Boxplot of the peas data (see text).
In the plot, the data seem reasonably symmetrical within groups without any extreme
outliers. We can go ahead and calculate the linear model with an interaction term.
Interaction terms should always be included in initial models. If the interaction
term is not significant it sometimes is dropped from final models, however, this usually
has little effect on the results. We also move ahead and produce the diagnostic plots of
the residuals.
Figure 6-9 Diagnostic plots for a two-way ANOVA on the peas data.
79
The diagnostic plots of residuals both look consistent with the model assumptions. We
proceed to extract the results from the model using the command.
The ANOVA table shows that the interaction is significant. This means that the effect of
irrigation depends on that of radiation. In the exploratory boxplot (Figure 6-8) we see that,
indeed, seed number is reduced by radiation under drought but not under normal
watering. Because of this, it is not possible to interpret the overall effect of irrigation and
radiation from the model we just calculated. To find out more, this analysis should be
followed by splitting the dataset into the two levels of one of the grouping factors, either
into two irrigation levels or into the two radiation levels. After that, the effect of the
second grouping factor is tested with each level of the first one. In this example, we can
test for the effect of radiation on drought-treated plants and, separately, on normally
watered plants. Alternatively, we could test for the effect of drought on plants that received
high light radiation and, separately, on plants that received low radiation. Which split of
the data is most useful depends on the questions and interests of the researcher. However,
the data should not be split both ways. The analysis can be completed by calculating
means and standard errors. Usually, both the overall ANOVA with the significant
interaction term and the separate ANOVA results on the split datasets are reported. Such
analyses are often accompanied by grouped barplots (see Basic graphs with R).
80
Figure 6-10 Plot of ozone levels against wind
speed ( internal dataset).
We see that ozone levels decrease with wind speed. The relationship appears a bit curved.
We try a linear model with the straight data first and analyze the residuals (Figure 6-11).
81
The Tukey-Anscombe plot, we see that the residuals increase with larger fitted values
(funnel shape). This indicates that either the model does not fit the data such that the
fitted line is farther away from the data at higher ozone values (fitted values) or that
there is more variation in measurements corresponding to higher ozone levels. In the
normal Q-Q plot we do not see a strong pattern at higher ozone levels and neither in
our initial data plot (Figure 6-10), Let's plot the fitted line on the data to understand this
better. We do that by producing the initial plot again and then add the regression line
to it using abline().
Indeed, it appears that the model does to fite the ozone values very well. Next,
we explore transformation of the response . We calculate a new model and plot
the diagnostic plots (Figure 6-13). We also plot the fitted regression on a plot of log-
transformed ozone values against wind speed (Figure 6-14).
82
Figure 6-13 Diagnostic plot of the model
These diagnostic plots look much better and can be accepted. Nonetheless, there are still
mostly positive residuals on the left side of the graph, however, there are fewer
measurements in that region. We could also try log-transforming both the response and
the explanatory variable wind speed. What sort of transformation is appropriate depends
on the functional relationship between the variables and on the aims of the study. If
accurate prediction of ozone levels is the aim the best-fitting model is interesting as it
leads to the best predictions. If the study is about testing particular theories on functional
relationships, a model that reflects the hypothesized relationship should be used. It is
83
possible that linear models cannot be used in some of these cases (see Web resources
and books on R) . for books that include more advanced modeling techniques). We
accept model.2 for now and extract the results.
The ANOVA table indicates that wind speed has a significant effect ozone levels. Apart
from the ANOVA table we can extract the slope and the intercept of the regression line
using the command .
The output first reiterates the model formula and gives information on the residuals. The
coefficients part gives the slope of the regression line as -0.13 with a standard error of
0.019 and the intercept as 4.70 with a standard error of 0.200. The table also reports t-
84
tests on the hypothesis that theses coefficients are zero. For the intercept, this is not
informative in our example. The test of the slope is reiterating the F-test in the ANOVA
table. In this simple case, with one explanatory variable the t2 equals the F-ratio (-
6.822 2 = 46.539) and the P-Values are identical for both tests ( ). In more
complicated cases, with several explanatory variable, the default settings will produce
ANOVA analyses with sequentially added explanatory variables, such that the results can
depend on the order of the explanatory variables in the model.
The output ends with stating R2 and adjusted R2 and repeating the overall model test
that we have already seen in the output. Adjusted R2 is used to express the
percentage of the variation in the response that is explained by the model. In our
example, 28% of the variation in log-transformed ozone levels is explained by wind
speed.
We have now covered the basic reasoning and coding steps involved in calculating and
interpreting linear models. More complicated models are described in the sources listed
in Web resources and books on R.
Summary
Linear models are used to for various analyses of continuous responses, including
analysis of variance (ANOVA) and regression.
The workflow involves exploring the data through a plot, stating the model,
analyzing the residuals to test whether assumptions are met and obtaining and
interpret the results.
Assumptions are independent experimental units, constant variance of residuals and
normal distribution of residuals.
Statistical interactions imply that effect of an explanatory variable depends on
another explanatory variable.
Linear model analyses in R involve the commands , ,
and .
85
6.6 Exercise 6: Linear Models
the S s
Iris.
Iris
Please make sure that you follow the work-flow.
(a) Produce exploratory graphs.
(b) U
(c) corresponding
(d) ,
Ex_6B_Birch.csv
(c)
(d)
86
6-C Mating preference in earwigs
determined by
the size of the female? dataset Ex_6C_Earwigs.csv on
measurements of female size (mm) and courtship duration.
(a) P
(b) a regression
(c) the
(d)
87
7 Basic graphs with R
7.1 Bar-plots
Goal
In this section, you will learn how to script a grouped barplot of means with standard
errors indicated as t-bars in R (Figure 7-1) and how to adjust the layout of this type of
graph.
88
How to do it
We are going to use the internal dataset , which contains measurements of
tooth length in guinea pigs that received three levels of vitamin C and two supplement
types (Figure 7-1). To explore this dataset you can
use , and . We want to
produce a barplot of the mean tooth length for all six combinations of the two factors
(supplement type: 2 levels, dose: 3 levels). We first need to calculate the mean tooth
length for each of the combinations. For this, we use the
command (see
. returns mean tooth lengths for all six combinations in the right
format for the command that is later used to produce the plot. We assign
these means to the object and display them in the console at the same time by
using parentheses around the entire command.
We can customize the layout of the barplot using further arguments, otherwise default
options will be used. The labels of the axes can be specified with the arguments
and and the labels below each group of bars are controlled with the
argument . The font size of these labels can be changed with
and These arguments are set to 1 by default and changes are relative to this
default. For example, will double the font size. The limit of the y-axis is
specified with . Here we use a range from zero to the maximum in the dataset. The
orientation of the axis labels can be altered with the argument that has four options
(0, 1, 2, 3). Here, produces horizontal axis labels. You can try out the other
options as well. The colors of the bars are determined by , in our example by a vector
with a length of two for the two groups, specifying 1 (black) and 8 (grey). The color can
be specified either with numbers (1 to 8) or with the color name. An overview of all
available color names is brought up by . You can further explore colors
at http://research.stowers-institute.org/efg/R/Color/Chart/. With the above
adjustments the command looks like this:
89
The next step is to add error bars to the barplot. There is no standard command to add
error bars. Instead, we have to draw them ourselves with the command .
First, we need the standard error of the mean for all six groups. We do this in the same
way as the calculating the means: we use but ask for calculation of the
standard error. Here we do that by creating a custom function within (more
on functions in the next chapter). Besides the length of the error bars, we also need the
horizontal locations of the bars, such that they end up in the middle of the bars. These
midpoints, in the format as the means above, can be extracted from a
basic We assign the command to an object, here
named . We can use to suppress the plotting. Without that
argument the plot is created and the midpoints saved at the same time.
Now we are ready to draw the error bars using the command . Within this
command, we first state the position of the error bars by two sets of x and y coordinates
corresponding to the start and end of the error bars. First, the starting coordinate
set, identifies the six midpoints as x coordinates and
the means minus the standard errors as the six y coordinates. Secondly,
is used for the end of the error bars. We further use the
argument and such that we get bars with T´s on both ends and
not arrows. The arguments and set the size of the T´s and line width of the
entire error bars.
We can further use the command to add a legend with our groups to the
graph. We can specify the place of the legend in the graph either with coordinates
(here, ) or default options such as " or (see ? ).
The argument produces boxes with the specified colors to place next to the legend
text. determines if there will be a box drawn around the legend or not (default:
, with box), here, removes the box. The font size in the legend is
determined by , as explained above.
90
With this code we obtain the graph displayed in Figure 7-1. There are many more details
of the plot that can be controlled and changed. For an overview of the graphical
parameters that can be changed by arguments check out .
Plots can saved through the RStudio menu ( "export" button on plot tab).
Summary
Goal
In this section, you will learn how to script a grouped scatterplot with regression lines
( ) and to adjust the layout of this type of graph.
91
How to do it
To produce a scatterplot, we will use the command. is a higher-level
plotting command that it will create a new graph.
We are going to use part of the internal dataset (available with the R installation) as
an example (Figure 7-2). i contains flower measurements of three ,ULV species. You
can explore the dataset by . We reduce the dataset to two species and and create an
initial plot (not shown).
We can now assign two different plotting symbols for the species by creating a new
column in the data frame , named , that contains the
number of the plotting character ( ) to be used. There are 26 different plotting
characters. Here we use character 1 for Iris setosa and character 16 for Iris versicolor. You
can use the same procedure to assign different colors to the two species (see above). We
can then set the axis labels, range and orientation as well as font size
using , , , , , and as explained above.
width(mm)",
The next step is to add a regression line for each species assuming that sepal length
causes changes in sepal width ( may or may not be reasonable ). For this, we have
to model the regression lines first. Subsequently, we plot lines corresponding to these
models with the lower-level plotting command . The line is specified by the x
and y coordinates, which are both a vector: the x-vector contains sepal lengths, the y-
vector contains the sepal widths predicted by the model. We increase the line widths
using with . that we have used in the linear regression chapter
produces the same line, except that it extends over the entire graph. Using
allows us to have the regression line only within the actual datapoints.
92
We should also add a legend to the figure. This similar to the barplot legend above.
Species names in italics are generated using ("species"))
for each entry.
There are many more details of the plot that can be customized. An overview of the
graphical parameters that can be changed can be viewed using
Summary
93
7.3 Exercise 7: Basic graphs with R
7-A
7-B
7-C
Hint
legend()
94
8 Introduction to customizing R
Goal
In this section, you will gain basic understanding and experience of how to
> write code for repeated execution of commands (iteration, "loops") using
for(), while() and repeat()
for
while
repeat
Here, VAR is the abbreviation of variable. SEQ is the abbreviation for a sequence of
elements in a vector or list. COND refers to of conditional, which can evaluate to TRUE
or FALSE. EXPR stands for expression, that is, the actual commands to be executed.
95
The structure, iterates through each component of the sequence SEQ. For
example, in the first iteration VAR = SEQ[1], in the second iteration VAR = SEQ[2],
and so on. The following code uses the structure to print out the square of each
component in a vector.
The other two loop structures and rely on the change of state of the
expression, or the use of to end the loop. The command b halts the
execution of the innermost loop and passes control to the first statement outside. When
using or , special attention should be paid to averting an infinite
loop, that is, a loop which iterates without end. This may cause your computer to freeze!
Below is an example showing two different ways of ending the loop when the loop
indicator ( or ) becomes larger than 10.
96
The condition COND is first evaluated, and if it is TRUE, the expression EXPR1 is
executed; if COND evaluates to FALSE, EXPR2 is executed. When COND evaluates to
a numeric value of zero, R treats it as FALSE. COND evaluating to any non-zero
number is treated as TRUE. We can also extend/shrink structures by
adding/removing one or several clauses. also works without if else clauses.
Note that in these cases, the order of conditional clauses is vital. Here is an example
asking R to comment on the temperature.
if
else
Goal
In this section, you will learn how to develop your own R functions.
How to do it
R provides a convenient way to define custom functions using . When
writing functions you define yourself what arguments the function should use, what
should happen to the input and what the output should be. The syntax of function is:
function
If the expression only includes one statement, it can be directly entered and when there
are multiple expressions, they have to be enclosed in braces {}. Functions are assigned to
a function object and can later be executed using this object name together with
arguments, just as all of the other R commands. Here is an example of a function
named that has the argument It samples 5 elements, with replacement,
from the a vector given as the argument x.
97
Arguments to the function can later be accessed via the R
functions and
We have now introduced a first idea how R can be customized using programming
structures. You can learn more about programing in R using some of the more advanced
books suggested at the end of Chapter 2 (Web resources and books on R).
8.3 Summary
98
8.4 Exercise 8: Customizing R
, ,
The goal:
Instructions
99
Step 1:
Step 2:
Step. 3: sample
Step 4:
, d
100
Step 5:
Step 6:
Step 7:
Step 8:
a very similar gr
Step 9:
10