Data Visualisation With Ggplot2 PDF

16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2
AGRI90075 Week 3 - Data visualisation

with ggplot2
Laura Brannelly, Saras Windecker, and Patrick Baker
14 March 2022
Objectives
The focus of today’s prac is to introduce you to some of R’s graphical capabilities. While R is well known for its
analytic capacity, it can also produce professional-grade scientific figures. Before you do any analyses or
modeling of your data, you should always have look at the data. R makes that relatively easy. In this prac
session, we will, in as much as possible, use the ggplot2 package for graphing (although recognising that
base R also has very rich graphical capabilities). By the end of today’s prac, you should be able to create an R
Markdown document that shows how to make, modify, and save several types of figures using ggplot .
Assessment
This prac is NOT assessed. So, you will not need to submit anything. However, you will be using the graphical
skills in this prac throughout the rest of the unit and they will figure prominently in the final exam. As such, we
encourage you to make sure that you can complete every question in this week’s prac manual.
The ggplot2 package

The ggplot2 package was developed by Hadley Wickham in an attempt to implement the ideas described by
Leland Wilkinson in his book, the Grammar of Graphics. It provides the ability to generate complex,
publication-quality graphics using a simple logic structure that can be easily repeated and adjusted to create a
wide range of graphics. ggplot2 is aesthetically nicer and conceptually simpler than the graphics in base R.
While they are both capable of producing high-quality figures, we will primarily use ggplot2 for this class.
Having said that, if you prefer base R graphics, you are welcome to use them. There is no penalty associated
with using one approach over the other.
Loading the ggplot2 package

R has many packages and you can install many more. However, R does not keep them all in its active
memory. You need to tell R which package you’d like it to load into the workspace to use in your analyses. We
will specify in the set-up code chunk of the R Markdown answer template the necessary packages to load for
each prac. This means that you do not. However, if you are running R and RStudio on your own computer,
then you can load a package with either the library() or require() functions. I like to use library() :
library(ggplot2)
The dataset
For today’s prac we will be using a dataset that describes the heights and diameters of tropical trees in
Thailand. The data were collected in 2001 as part of a broad survey of forest diversity in all of the National
Parks, Wildlife Sanctuaries, and other protected reserves in continental Thailand. The survey established 600
0.4 ha plots across the protected areas. Where a protected area had more than one forest type, separate plots
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.html 1/22
were established in each forest type. The full dataset includes 111,174 trees from 1218 species. For this
exercise, we are working with a subset of the full dataset that includes the 20 most abundant species (6341
trees in total).
Loading the dataset

The dataset has been saved as ThaiTrees.csv and put in the data folder of the Rproject. You can load the
dataset into a dataframe in R called trees by typing this:
trees <- read.csv("data/ThaiTrees.csv", header=TRUE)
What’s in the trees dataset?

Before we start making plots with the data, let’s just have a quick look at what is in the dataset. There are a
variety of ways to do this. A good place to start is with the str() (or structure) function. This gives you a
summary of all of the variables in the dataset and what type of data they hold.
str(trees)
## 'data.frame': 6325 obs. of 6 variables:
## $ treeNum : int 3 4 12 21 22 25 74 88 94 96 ...
## $ spp : chr "Toona surei (Bl.) Merr." "Toona surei (Bl.) Merr." "Toona surei (Bl.)
Merr." "Toona surei (Bl.) Merr." ...
## $ sppCode : chr "TOONSU" "TOONSU" "TOONSU" "TOONSU" ...
## $ dbh : num 23.9 32.9 29.5 40.7 78.8 ...
## $ height : num 23 28 21 22 33 27 33 25 14 34 ...
## $ forestType: chr "MMDF" "MMDF" "MMDF" "MMDF" ...
This shows us that trees is a data frame with 6325 observation and 6 variables and then lists each of the
variables and the data type. The variables spp , code , and forestType are all factors because they are
based on names (e.g., this species, that species). The variables dbh and height are numeric variables
because they are continuous (i.e., not integer) values. The variable treeNum is an integer; however, because it
is the ID for each tree it makes more sense to think of it as a factor, too. For today’s prac we won’t be using it,
so we can just ignore it for now.
If we want to look at a snippet of the data to get a sense for what the dataset looks like, we can look at the first
or last 6 rows of data by using the head() or tail() functions:
head(trees)
## treeNum spp sppCode dbh height forestType
## 1 3 Toona surei (Bl.) Merr. TOONSU 23.9 23 MMDF
tail(trees)
## treeNum spp sppCode dbh height forestType
## 6320 133462 Pinus kesiya Royle ex Gordon PINUKE 65.87 19 DDFP
## 6321 133463 Dipterocarpus tuberculatus Roxb. DIPTTU 28.83 10 DDFP
## 6322 133477 Shorea obtusa Wall. ex Blume SHOROB 19.00 5 DDFP
So a quick explanation of the variables:
treeNum is the unique ID number for each tree

spp is the scientific name (with botanical authority) for each tree
sppCode is the 6-letter species code (first four letters of the genus + first two letters of species)
dbh is the stem diameter (in cm) measured at 1.3 m above the ground
height is the height of the tree (in m)
forestType is the forest type classification for that plot using the standard Thai forest-type classication
system
Another thing that we can do is to confirm the size of the dataframe (this can be useful when you are cleaning
your dataset or subsetting to make sure that the dataframe has indeed changed in size). We do this by using
the dim (for dimension) command:
dim(trees)
## [1] 6325 6
We can get a quick summary of any variable using the summary command for the whole dataframe or for just
one of the variables:
summary(trees)
## treeNum spp sppCode dbh
## Min. : 3 Length:6325 Length:6325 Min. : 4.50
## 1st Qu.: 40780 Class :character Class :character 1st Qu.: 24.00
## Median : 75645 Mode :character Mode :character Median : 34.50
## Mean : 74235 Mean : 40.18
## 3rd Qu.:107478 3rd Qu.: 49.46
## Max. :133510 Max. :180.00
## height forestType
## Min. : 2.00 Length:6325
## 1st Qu.:12.00 Class :character
## Median :18.00 Mode :character
## Mean :19.44
## 3rd Qu.:25.00
## Max. :52.50
summary(trees$dbh)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.50 24.00 34.50 40.18 49.46 180.00
summary(trees$sppCode)
## Length Class Mode
## 6325 character character
Notice the differences among these outputs. For the trees$dbh summary, you get the min, max, mean,
median, and 1st and 3rd quartiles. For the non-numeric variables, you don’t get much information. Be careful
when you look at the treeNum – it’s a numeric variable, but not the kind of variable where these summary
statistics are useful (e.g., why should we care what the mean tag number is for the dataset?) For the
trees$sppCode summary, you get a table of the number of stems per species.
QUESTION 1: How many trees are there in each forest type?
For your reference, the forest type codes are:
BF – beach forest
DDF – deciduous dipterocarp forest
DDFP – deciduous dipterocarp and pine forest
DEF – dry evergreen forest
DMDF – dry mixed deciduous forest
LMF – lower montane forest
MMDF – moist mixed deciduous forest
PF – pine forest
Another way to look at these data is with the table command. This works particularly well for Factor variables
(such as spp and forestType in this dataset).
table(trees$forestType)
##
## BF DDF DDFP DEF DMDF LMF MMDF PF
## 116 3087 316 429 907 168 546 756
You can also have two-way tables…
table(trees$sppCode, trees$forestType)
##
## BF DDF DDFP DEF DMDF LMF MMDF PF
## CASTAC 0 1 0 15 16 98 0 0
## DIPTCO 0 0 0 78 78 0 0 0
## DIPTOB 0 695 32 0 10 0 0 84
## DIPTT2 0 1 0 126 115 0 1 0
## DIPTTU 0 493 78 0 18 0 0 0
## HOPEFE 0 1 0 90 2 0 17 0
## IRVIMA 0 48 0 31 26 0 33 0
## LAGECA 0 1 0 20 77 0 190 0
## MELAQU 116 3 0 0 0 0 0 0
## PINUKE 0 4 5 0 0 14 0 400
## PINUME 0 0 81 0 6 1 0 245
## PTERMA 0 86 0 5 62 0 5 4
## SCHIWA 0 7 0 45 62 53 0 11
## SHOROB 0 619 64 0 1 0 0 8
## SHORRO 0 53 33 3 35 0 0 3
## SHORSI 0 1028 22 1 34 2 95 1
## TECTGR 0 1 0 0 259 0 0 0
## TERMTR 0 2 0 14 33 0 80 0
## TOONSU 0 0 0 0 0 0 122 0
## XYLIXY 0 44 1 1 73 0 3 0
QUESTION 2: Last week we applied various functions to vectors. Use them to find the mean DBH and the
minimum and maximum heights of the trees in this dataset.
Making graphical plots with ggplot2

The basic structure used by ggplot is relatively simple. You identify your data, specify a mapping, and then
choose an appropriate geometry to display your data. The data is the dataframe that you are working with. The
mapping is the variables (referred to as aesthetics in ggplot ) that you wish to display. The geometry is the
way you would like to display them (e.g., as a scatterplot, as a histogram, as a heat map). Once you have
these components in place, you can tinker with scales, labels, guides, and formatting for the figure.
This is easier to understand by doing it. So, let’s build a ggplot from the ground up. We’ll look at DBH vs height
in the trees dataframe. We begin by specifying the data:
ggplot(data = trees)
We have not specified any mappings, so it does not know what to plot. The mappings are important because
they link variables to things that you will see in the plot. In the next code chunk this will tell ggplot to include
the x and y variables. But, as we will see later it can also include colour, shape, size, and line type (e.g., solid,
dashed, or some other pattern). The mapping does not necessarily say what colours will be used; rather, it
says which variables will be represented by visual elements such as colour, etc. This is an important
distinction and one we will return to repeatedly. Let’s add the mappings:
ggplot(data = trees, mapping = aes(x = dbh, y = height))
So, what happened? Not much. The figure now has labels on the x- and y-axis (due to the mapping=
statement) and has identified what the scale limits to those axes should be. But we can’t see any data yet.
To see the individual points, we need to specify the geometry that we would like to use. For x,y data, we can
use geom_point() . Because we have already specified what the ggplot object needs in terms of data and
mappings, we can just add the geom to the existing ggplot object like this:
ggplot(data = trees, mapping = aes(x = dbh, y = height)) +
geom_point()
And there you have it! The basic elements of a ggplot graph: data + mapping + geom = a plot of dbh vs height
for 6325 trees.
NOTE: Just as a point of programming style, I like to put each new part of the `ggplot` call
on to a new line. That makes it easier to copy, paste, and modify individual pieces of the `g
gplot` call. The only trick to doing this is knowing that each element of the `ggplot` must b
e separated by a `+` sign and the `+` signs must go at the end of each line, not at the begin
ning of the next line.
Prettifying your figure

Both ggplot2 and base R graphics have many, many ways to tweak individual elements of a figure. We will
focus on some of the ways to do that in ggplot2 . We might start with adding some colour to the plot. To do
this we need to think about where we would add that. So far all of our specifications have gone into the ggplot
call either as data or as the mapping. However, recall that the mapping is the variables that you want to
display. So, adding color into the mapping will yield unexpected results (see Question 3 below). If we want to
deal with some of the aesthetics of the figure, we need to modify the geom.
As an example, let’s start by colouring the points in the first figure green. We’ll add into the geom_point() a
statement specifying the colour.
geom_point(colour='green')
You can change other features of the geom, too. Try playing around with shape= , size= , alpha= . For
shape , use integer values from 0 to 20 (although there are others to choose from). For size , use positive
non-zero values (non-integers are OK – 0.5, 1.75, 3.3, etc.). For alpha , use values from 0 to 1. You can use
more than one of these at a time. Just separate them with commas in the geom statement.
Here’s an example:
geom_point(colour = 'darkblue', size = 0.75, shape = 3)
We can also add nice (or more detailed) labelling. To do this we add the labs() call to the overall statement
(like adding the geom_point() call). There are several options that we can call (I have not included a subtitle,
although you are welcome to…).
geom_point(colour = 'darkblue', size = 0.75, shape = 3) +
labs(x = "DBH (cm)", y = "Height (m)",
title = "DBH-Height relationships in Thai forests",
caption = "Source: Thai ForestGEO")
https://02f24688475047b38b0f08a46eb46aea.app.rstudio.cloud/file_show?path=%2Fcloud%2Fproject%2FWeek3_DataVisualisation_Manual.h… 10/22
We will touch on a few other things to add value and clarity to your figure as we go along. However, at a
minimum every figure you make should have:
clear data presentation

descriptive axis labels (with units!)
a descriptive title
the data source (if appropriate)
Grouping data within a figure

What if we want to see how different groups within the dataset behave? We can add another aesthetic term to
the initial mapping. For example, we might be interested in looking at patterns in dbh-height relationships
across forest types. Are some forests taller for a given stem dbh? To do this we can add a third aesthetic,
colour (or color ), for grouping. Here we create a separate colour for each forestType :
ggplot(data = trees, mapping = aes(x = dbh, y = height, colour = forestType)) +
geom_point()
With a large dataset like this, we might be better off breaking this figure up into a bunch of subfigures. This is
known as faceting. We just need to add the facet_wrap() call to do this and specify the grouping variable that
we want to use (preceded with a tilde ~ ).
geom_point(size=0.5) +
facet_wrap(~forestType)
QUESTION 3: Remake this figure so that the points within each facet have different colours.
QUESTION 4: Repeat this figure, but faceted by species rather than forest types.
QUESTION 5: Try to make a figure that is faceted by species, but in which the points are coloured by forest
type. What would this tell us?
QUESTION 6: Having now made figures in which multiple categories are distinguished by colour, by faceting,
and by both, what are the relative advantages and disadvantages of colour mapping versus faceting in
presenting data?
Saving your figures

One thing that you might want to know at this point is how to save a figure. If you are working in R Markdown,
it is probably less important as R can make and display the figure on the fly. But if you are taking a class that
requires making a presentation and you need to get an R figure into your PowerPoint slide deck, then you
might want to save the file. In ggplot2 we can use the ggsave function. We just need to specify the filename
(including filetype) that we are making and the plot that we wish to print. If you do not specify the plot, it will
print the last plot created. The file will be saved in your working directory, unless you specify otherwise.
ggsave("DBH_Height_by_ForestType.pdf")
## Saving 7 x 5 in image
If you want to adjust the dimensions of your figure, you can use the width and height arguments in the
ggsave() function like this:
ggsave("DBH_Height_by_ForestType2.pdf", width=4, height=8)
This makes a tall, narrow image. If you reverse the width and height values, it will make a short, wide image.
What about univariate data?

We started exploring ggplot2 using two variables (dbh and height) because it is useful to illustrate the basic
concepts of ggplot2 . However, visualising single variables is important, both for data exploration and for data
analyses. The main interest with a single variable is what the distribution of the data looks like. Is it narrow? Is
it wide? Is it skewed? Does it have multiple peaks? While we can use various statistical functions (e.g., mean ,
median , sd , range ) to answer some of these questions, displaying the data for visual assessment is quick,
easy, and very powerful.
To show you some of the ways of displaying univariate data, we’ll use the same trees dataset, but we’ll
confine our graphical explorations to the dbh data. The structure of the ggplot call is the same as before. We
need statements specifying the data, mapping, and geom. The main differences are that we only need to
specify one variable in the mapping aesthetic and that we need to use geoms that are appropriate to univariate
data (i.e., not geom_points() ). There are several geoms that we can use for univariate data (see the ggplot2
cheat sheet for a good summary). The most widely used ones are geom_histogram() and geom_density() .
We’ll start with geom_histogram() .
ggplot(data = trees, mapping = aes(x = dbh)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This shows the number of individuals in each of 30 equal-sized dbh bins. If you want to see the data in finer
detail, you can specify a larger number of bins; if you want to see the data displayed in large bins, you can
specify a smaller number of bins. You just need to write bins=100 or whatever number you want within the
geom_histogram brackets.
QUESTION 7: Remake this figure three times using 10, 50, and 100 bins. How do they differ? What
information has been lost or gained?
QUESTION 8: Remake this as a density figure. [HINT: use geom_density()]
What if you wanted to look at the DBH distribution for a single species? We could do this in a number of ways.
For example, we could use facet_wrap() as we did before. That will show us the DBH distribution for all of
the species, but with one species per panel.
ggplot(data = trees, mapping = aes(x = dbh)) +
geom_histogram() +
facet_wrap(~sppCode)
That’s fine for some of the really abundant species (e.g., Dipterocarpus obtusifolius, Shorea obtusa, Shorea
siamensis), but it’s not so great for the less abundant species because the y-axis is scaled to the most
abundant species.
If we wanted to just look at a single species, we need to subset the data. We can do that in two different
ways. First, we can create a new dataframe using the subset command and then put the new dataframe in
the data statement. Here we create a new dataframe called diptob for the species Dipterocarpus obtusifolius.
diptob <- subset(trees, sppCode=="DIPTOB")
ggplot(data = diptob, mapping = aes(x = dbh)) +
geom_histogram()
Alternatively, we can just subset within the ggplot call. This has the benefit of avoiding multiple, overlapping
dataframes floating around in your workspace, but requires understanding the logic of the call. Here we add
the subset call directly into the data statement. We then specify the dataframe ( trees ) that we are subsetting
and what we would like to subset. This has a funny, backward syntax where we state the variable that the
subsetting is based on and then the value(s) that we want in the subset. We separate these with the %in%
operator. In effect, we read this expression from right to left, not left to right. We can also specify limits for the
x- and y-axes (which will make comparison between this and the next figure easier) using xlim() and
ylim() . Note the number of bins has changed. This will allow us to see a bit more of the shape of the data.
ggplot(data = subset(trees, sppCode %in% "DIPTOB"), mapping = aes(x=dbh)) +
geom_histogram(bins=50) +
xlim(0,80) +
ylim(0,70)
## Warning: Removed 2 rows containing missing values (geom_bar).
We can make these subset statements more complex. For example, we might want to know what the DBH
distribution of Dipterocarpus obtusifolius (DIPTOB) >20 cm looks like. We just use the & operator in the
subset statement:
ggplot(data = subset(trees, sppCode %in% "DIPTOB" & dbh >20), mapping = aes(x=dbh)) +
geom_histogram(bins=50) +
xlim(0,80) +
ylim (0,70)
## Warning: Removed 2 rows containing missing values (geom_bar).
QUESTION 9: Create your own subset (by species or forest type or dbh or height) and make a histogram.
The last figure we’ll make today is a boxplot. Boxplots give us a description of the distribution of a single
variable across different groups. As such, the mapping aesthetic requires both x and y to be defined. The key
thing to remember is that x is the grouping variable and y is the continuous variable. Otherwise, everything is
much the same as before:
ggplot(data = trees, mapping = aes(x=sppCode, y=dbh)) +
geom_boxplot()
The only problem with this is that the species codes are unreadable because they all overlap. It would be much
easier to see them if they were on the y-axis instead of the x-axis. Fortunately, ggplot2 has any intuitive
option for that ( coord_flip ):
ggplot(data = trees, mapping = aes(x=sppCode, y=dbh)) +
geom_boxplot() +
coord_flip()
One issue with this figure is that the data are ordered alphabetically from bottom to top. Usually ecology and
alphabetisation are uncorrelated. So it is often helpful to sort the data based on something more meaningful. If
we add in the reorder command into the mapping aesthetic, then specify which variable to sort on (dbh), we
get the following:
ggplot(data = trees, mapping = aes(x=reorder(sppCode, dbh), y=dbh)) +
geom_boxplot() +
coord_flip()
QUESTION 10: Make a boxplot figure of the height distributions for each of the forest types.
Make sure the axes are clearly labelled and the figure has a title. Pay close
attention to what is labelled x and what is labelled y. Feel free to explore colour
and fill options in the geom.
Making your figures a bit prettier

So far we have been using the default plotting theme in ggplot2 . This gives us the grey background and white
guidelines within the plot background. If you want to change the theme, there are other themes that you can
specify. It is just an additional piece that you add to the overall graphical statement. Here I use the basic black-
and-white theme theme_bw() :
ggplot(data = trees, mapping = aes(x=reorder(sppCode, dbh), y=dbh)) +
geom_boxplot() +
coord_flip() +
theme_bw()

Data Visualisation With Ggplot2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Visualisation With Ggplot2 PDF

Uploaded by

Copyright:

Available Formats

16/03/2022, 12:03 AGRI90075 Week 3 - Data visualisation with ggplot2

AGRI90075 Week 3 - Data visualisation

The ggplot2 package

Loading the ggplot2 package

Loading the dataset

trees <- read.csv("data/ThaiTrees.csv", header=TRUE)

What’s in the trees dataset?

## 'data.frame': 6325 obs. of 6 variables:

## $ treeNum : int 3 4 12 21 22 25 74 88 94 96 ...

## $ sppCode : chr "TOONSU" "TOONSU" "TOONSU" "TOONSU" ...

## $ dbh : num 23.9 32.9 29.5 40.7 78.8 ...

## $ height : num 23 28 21 22 33 27 33 25 14 34 ...

## $ forestType: chr "MMDF" "MMDF" "MMDF" "MMDF" ...

## treeNum spp sppCode dbh height forestType

## 1 3 Toona surei (Bl.) Merr. TOONSU 23.9 23 MMDF

## 2 4 Toona surei (Bl.) Merr. TOONSU 32.9 28 MMDF

## 3 12 Toona surei (Bl.) Merr. TOONSU 29.5 21 MMDF

## 4 21 Toona surei (Bl.) Merr. TOONSU 40.7 22 MMDF

## 5 22 Toona surei (Bl.) Merr. TOONSU 78.8 33 MMDF

## 6 25 Toona surei (Bl.) Merr. TOONSU 53.2 27 MMDF

## treeNum spp sppCode dbh height forestType

## 6320 133462 Pinus kesiya Royle ex Gordon PINUKE 65.87 19 DDFP

## 6321 133463 Dipterocarpus tuberculatus Roxb. DIPTTU 28.83 10 DDFP

## 6322 133477 Shorea obtusa Wall. ex Blume SHOROB 19.00 5 DDFP

## 6323 133485 Shorea obtusa Wall. ex Blume SHOROB 12.30 5 DDFP

## 6324 133496 Shorea obtusa Wall. ex Blume SHOROB 41.61 10 DDFP

## 6325 133510 Shorea obtusa Wall. ex Blume SHOROB 26.34 6 DDFP

So a quick explanation of the variables:

treeNum is the unique ID number for each tree

## treeNum spp sppCode dbh

## Min. : 3 Length:6325 Length:6325 Min. : 4.50

## Median : 75645 Mode :character Mode :character Median : 34.50

## Mean : 74235 Mean : 40.18

## 3rd Qu.:107478 3rd Qu.: 49.46

## Max. :133510 Max. :180.00

## Min. : 2.00 Length:6325

## 1st Qu.:12.00 Class :character

## Median :18.00 Mode :character

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 4.50 24.00 34.50 40.18 49.46 180.00

## Length Class Mode

## 6325 character character

QUESTION 1: How many trees are there in each forest type?

For your reference, the forest type codes are:

## BF DDF DDFP DEF DMDF LMF MMDF PF

## 116 3087 316 429 907 168 546 756

You can also have two-way tables…

## BF DDF DDFP DEF DMDF LMF MMDF PF

## DIPTT2 0 1 0 126 115 0 1 0

Making graphical plots with ggplot2

ggplot(data = trees, mapping = aes(x = dbh, y = height))

ggplot(data = trees, mapping = aes(x = dbh, y = height)) +

Prettifying your figure

ggplot(data = trees, mapping = aes(x = dbh, y = height)) +

ggplot(data = trees, mapping = aes(x = dbh, y = height)) +

geom_point(colour = 'darkblue', size = 0.75, shape = 3)

ggplot(data = trees, mapping = aes(x = dbh, y = height)) +

geom_point(colour = 'darkblue', size = 0.75, shape = 3) +

labs(x = "DBH (cm)", y = "Height (m)",

title = "DBH-Height relationships in Thai forests",

caption = "Source: Thai ForestGEO")

clear data presentation

Grouping data within a figure