Professional Documents
Culture Documents
Objectives
The focus of today’s prac is to introduce you to some of R’s graphical capabilities. While R is well known for its
analytic capacity, it can also produce professional-grade scientific figures. Before you do any analyses or
modeling of your data, you should always have look at the data. R makes that relatively easy. In this prac
session, we will, in as much as possible, use the ggplot2 package for graphing (although recognising that
base R also has very rich graphical capabilities). By the end of today’s prac, you should be able to create an R
Markdown document that shows how to make, modify, and save several types of figures using ggplot .
Assessment
This prac is NOT assessed. So, you will not need to submit anything. However, you will be using the graphical
skills in this prac throughout the rest of the unit and they will figure prominently in the final exam. As such, we
encourage you to make sure that you can complete every question in this week’s prac manual.
library(ggplot2)
The dataset
For today’s prac we will be using a dataset that describes the heights and diameters of tropical trees in
Thailand. The data were collected in 2001 as part of a broad survey of forest diversity in all of the National
Parks, Wildlife Sanctuaries, and other protected reserves in continental Thailand. The survey established 600
0.4 ha plots across the protected areas. Where a protected area had more than one forest type, separate plots
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.html 1/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
were established in each forest type. The full dataset includes 111,174 trees from 1218 species. For this
exercise, we are working with a subset of the full dataset that includes the 20 most abundant species (6341
trees in total).
str(trees)
## $ spp : chr "Toona surei (Bl.) Merr." "Toona surei (Bl.) Merr." "Toona surei (Bl.)
Merr." "Toona surei (Bl.) Merr." ...
This shows us that trees is a data frame with 6325 observation and 6 variables and then lists each of the
variables and the data type. The variables spp , code , and forestType are all factors because they are
based on names (e.g., this species, that species). The variables dbh and height are numeric variables
because they are continuous (i.e., not integer) values. The variable treeNum is an integer; however, because it
is the ID for each tree it makes more sense to think of it as a factor, too. For today’s prac we won’t be using it,
so we can just ignore it for now.
If we want to look at a snippet of the data to get a sense for what the dataset looks like, we can look at the first
or last 6 rows of data by using the head() or tail() functions:
head(trees)
tail(trees)
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.html 2/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
Another thing that we can do is to confirm the size of the dataframe (this can be useful when you are cleaning
your dataset or subsetting to make sure that the dataframe has indeed changed in size). We do this by using
the dim (for dimension) command:
dim(trees)
## [1] 6325 6
We can get a quick summary of any variable using the summary command for the whole dataframe or for just
one of the variables:
summary(trees)
## 1st Qu.: 40780 Class :character Class :character 1st Qu.: 24.00
## height forestType
## Mean :19.44
## 3rd Qu.:25.00
## Max. :52.50
summary(trees$dbh)
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.html 3/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
summary(trees$sppCode)
Notice the differences among these outputs. For the trees$dbh summary, you get the min, max, mean,
median, and 1st and 3rd quartiles. For the non-numeric variables, you don’t get much information. Be careful
when you look at the treeNum – it’s a numeric variable, but not the kind of variable where these summary
statistics are useful (e.g., why should we care what the mean tag number is for the dataset?) For the
trees$sppCode summary, you get a table of the number of stems per species.
BF – beach forest
DDF – deciduous dipterocarp forest
DDFP – deciduous dipterocarp and pine forest
DEF – dry evergreen forest
DMDF – dry mixed deciduous forest
LMF – lower montane forest
MMDF – moist mixed deciduous forest
PF – pine forest
Another way to look at these data is with the table command. This works particularly well for Factor variables
(such as spp and forestType in this dataset).
table(trees$forestType)
##
table(trees$sppCode, trees$forestType)
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.html 4/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
##
## CASTAC 0 1 0 15 16 98 0 0
## DIPTCO 0 0 0 78 78 0 0 0
## DIPTOB 0 695 32 0 10 0 0 84
## DIPTTU 0 493 78 0 18 0 0 0
## HOPEFE 0 1 0 90 2 0 17 0
## IRVIMA 0 48 0 31 26 0 33 0
## LAGECA 0 1 0 20 77 0 190 0
## MELAQU 116 3 0 0 0 0 0 0
## PINUKE 0 4 5 0 0 14 0 400
## PINUME 0 0 81 0 6 1 0 245
## PTERMA 0 86 0 5 62 0 5 4
## SCHIWA 0 7 0 45 62 53 0 11
## SHOROB 0 619 64 0 1 0 0 8
## SHORRO 0 53 33 3 35 0 0 3
## SHORSI 0 1028 22 1 34 2 95 1
## TECTGR 0 1 0 0 259 0 0 0
## TERMTR 0 2 0 14 33 0 80 0
## TOONSU 0 0 0 0 0 0 122 0
## XYLIXY 0 44 1 1 73 0 3 0
QUESTION 2: Last week we applied various functions to vectors. Use them to find the mean DBH and the
minimum and maximum heights of the trees in this dataset.
This is easier to understand by doing it. So, let’s build a ggplot from the ground up. We’ll look at DBH vs height
in the trees dataframe. We begin by specifying the data:
ggplot(data = trees)
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.html 5/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
We have not specified any mappings, so it does not know what to plot. The mappings are important because
they link variables to things that you will see in the plot. In the next code chunk this will tell ggplot to include
the x and y variables. But, as we will see later it can also include colour, shape, size, and line type (e.g., solid,
dashed, or some other pattern). The mapping does not necessarily say what colours will be used; rather, it
says which variables will be represented by visual elements such as colour, etc. This is an important
distinction and one we will return to repeatedly. Let’s add the mappings:
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.html 6/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
So, what happened? Not much. The figure now has labels on the x- and y-axis (due to the mapping=
statement) and has identified what the scale limits to those axes should be. But we can’t see any data yet.
To see the individual points, we need to specify the geometry that we would like to use. For x,y data, we can
use geom_point() . Because we have already specified what the ggplot object needs in terms of data and
mappings, we can just add the geom to the existing ggplot object like this:
geom_point()
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.html 7/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
And there you have it! The basic elements of a ggplot graph: data + mapping + geom = a plot of dbh vs height
for 6325 trees.
NOTE: Just as a point of programming style, I like to put each new part of the `ggplot` call
on to a new line. That makes it easier to copy, paste, and modify individual pieces of the `g
gplot` call. The only trick to doing this is knowing that each element of the `ggplot` must b
e separated by a `+` sign and the `+` signs must go at the end of each line, not at the begin
ning of the next line.
As an example, let’s start by colouring the points in the first figure green. We’ll add into the geom_point() a
statement specifying the colour.
geom_point(colour='green')
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.html 8/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
You can change other features of the geom, too. Try playing around with shape= , size= , alpha= . For
shape , use integer values from 0 to 20 (although there are others to choose from). For size , use positive
non-zero values (non-integers are OK – 0.5, 1.75, 3.3, etc.). For alpha , use values from 0 to 1. You can use
more than one of these at a time. Just separate them with commas in the geom statement.
Here’s an example:
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.html 9/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
We can also add nice (or more detailed) labelling. To do this we add the labs() call to the overall statement
(like adding the geom_point() call). There are several options that we can call (I have not included a subtitle,
although you are welcome to…).
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 10/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
We will touch on a few other things to add value and clarity to your figure as we go along. However, at a
minimum every figure you make should have:
geom_point()
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 11/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
With a large dataset like this, we might be better off breaking this figure up into a bunch of subfigures. This is
known as faceting. We just need to add the facet_wrap() call to do this and specify the grouping variable that
we want to use (preceded with a tilde ~ ).
geom_point(size=0.5) +
facet_wrap(~forestType)
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 12/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
QUESTION 3: Remake this figure so that the points within each facet have different colours.
QUESTION 4: Repeat this figure, but faceted by species rather than forest types.
QUESTION 5: Try to make a figure that is faceted by species, but in which the points are coloured by forest
type. What would this tell us?
QUESTION 6: Having now made figures in which multiple categories are distinguished by colour, by faceting,
and by both, what are the relative advantages and disadvantages of colour mapping versus faceting in
presenting data?
ggsave("DBH_Height_by_ForestType.pdf")
## Saving 7 x 5 in image
If you want to adjust the dimensions of your figure, you can use the width and height arguments in the
ggsave() function like this:
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 13/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
This makes a tall, narrow image. If you reverse the width and height values, it will make a short, wide image.
To show you some of the ways of displaying univariate data, we’ll use the same trees dataset, but we’ll
confine our graphical explorations to the dbh data. The structure of the ggplot call is the same as before. We
need statements specifying the data, mapping, and geom. The main differences are that we only need to
specify one variable in the mapping aesthetic and that we need to use geoms that are appropriate to univariate
data (i.e., not geom_points() ). There are several geoms that we can use for univariate data (see the ggplot2
cheat sheet for a good summary). The most widely used ones are geom_histogram() and geom_density() .
We’ll start with geom_histogram() .
geom_histogram()
This shows the number of individuals in each of 30 equal-sized dbh bins. If you want to see the data in finer
detail, you can specify a larger number of bins; if you want to see the data displayed in large bins, you can
specify a smaller number of bins. You just need to write bins=100 or whatever number you want within the
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 14/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
geom_histogram brackets.
QUESTION 7: Remake this figure three times using 10, 50, and 100 bins. How do they differ? What
information has been lost or gained?
What if you wanted to look at the DBH distribution for a single species? We could do this in a number of ways.
For example, we could use facet_wrap() as we did before. That will show us the DBH distribution for all of
the species, but with one species per panel.
geom_histogram() +
facet_wrap(~sppCode)
That’s fine for some of the really abundant species (e.g., Dipterocarpus obtusifolius, Shorea obtusa, Shorea
siamensis), but it’s not so great for the less abundant species because the y-axis is scaled to the most
abundant species.
If we wanted to just look at a single species, we need to subset the data. We can do that in two different
ways. First, we can create a new dataframe using the subset command and then put the new dataframe in
the data statement. Here we create a new dataframe called diptob for the species Dipterocarpus obtusifolius.
geom_histogram()
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 15/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
Alternatively, we can just subset within the ggplot call. This has the benefit of avoiding multiple, overlapping
dataframes floating around in your workspace, but requires understanding the logic of the call. Here we add
the subset call directly into the data statement. We then specify the dataframe ( trees ) that we are subsetting
and what we would like to subset. This has a funny, backward syntax where we state the variable that the
subsetting is based on and then the value(s) that we want in the subset. We separate these with the %in%
operator. In effect, we read this expression from right to left, not left to right. We can also specify limits for the
x- and y-axes (which will make comparison between this and the next figure easier) using xlim() and
ylim() . Note the number of bins has changed. This will allow us to see a bit more of the shape of the data.
geom_histogram(bins=50) +
xlim(0,80) +
ylim(0,70)
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 16/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
We can make these subset statements more complex. For example, we might want to know what the DBH
distribution of Dipterocarpus obtusifolius (DIPTOB) >20 cm looks like. We just use the & operator in the
subset statement:
ggplot(data = subset(trees, sppCode %in% "DIPTOB" & dbh >20), mapping = aes(x=dbh)) +
geom_histogram(bins=50) +
xlim(0,80) +
ylim (0,70)
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 17/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
QUESTION 9: Create your own subset (by species or forest type or dbh or height) and make a histogram.
The last figure we’ll make today is a boxplot. Boxplots give us a description of the distribution of a single
variable across different groups. As such, the mapping aesthetic requires both x and y to be defined. The key
thing to remember is that x is the grouping variable and y is the continuous variable. Otherwise, everything is
much the same as before:
geom_boxplot()
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 18/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
The only problem with this is that the species codes are unreadable because they all overlap. It would be much
easier to see them if they were on the y-axis instead of the x-axis. Fortunately, ggplot2 has any intuitive
option for that ( coord_flip ):
geom_boxplot() +
coord_flip()
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 19/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
One issue with this figure is that the data are ordered alphabetically from bottom to top. Usually ecology and
alphabetisation are uncorrelated. So it is often helpful to sort the data based on something more meaningful. If
we add in the reorder command into the mapping aesthetic, then specify which variable to sort on (dbh), we
get the following:
geom_boxplot() +
coord_flip()
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 20/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
QUESTION 10: Make a boxplot figure of the height distributions for each of the forest types.
Make sure the axes are clearly labelled and the figure has a title. Pay close
attention to what is labelled x and what is labelled y. Feel free to explore colour
and fill options in the geom.
geom_boxplot() +
coord_flip() +
theme_bw()
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 21/22
16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 22/22