You are on page 1of 516

Learn R

...as you learnt your mother tongue

Pedro J. Aphalo
Learn R
…as you learnt your mother tongue
Learn R
…as you learnt your mother tongue
Git hash: 1bf3003; Git date: 2017-04-11 00:00:26 +0300

Pedro J. Aphalo

Helsinki, 11 April 2017

Draft, 95% done


Available through Leanpub
© 2001–2017 by Pedro J. Aphalo
Licensed under one of the Creative Commons licenses as indicated, or when not ex-
plicitly indicated, under the CC BY-SA 4.0 license.

Typeset with XƎLATEX in Lucida Bright and Lucida Sans using the KOMA-Script book
class.
The manuscript was written using R with package knitr. The manuscript was edited
in WinEdt and RStudio. The source files for the whole book are available at https:
//bitbucket.org/aphalo/using-r.
Contents

1 Introduction 1
1.1 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 R as a computer program . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 R as a language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Packages and repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Reproducible data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Finding additional information . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 R’s built-in help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Obtaining help from on-line forums . . . . . . . . . . . . . . . . . . 11
1.5 Additional tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.1 Revision control: Git and Subversion . . . . . . . . . . . . . . . . . 12
1.5.2 C, C++ and FORTRAN compilers . . . . . . . . . . . . . . . . . . . 13
1.5.3 LAT
EX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.4 Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 What is needed to run the examples on this book? . . . . . . . . . . . . . 14

2 R as a powerful calculator 15
2.1 Aims of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Working at the R console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Arithmetic and numeric values . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Boolean operations and logical values . . . . . . . . . . . . . . . . . . . . . 27
2.5 Comparison operators and operations . . . . . . . . . . . . . . . . . . . . 29
2.6 Character values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 The ‘mode’ and ‘class’ of objects . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 ‘Type’ conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.9 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.10 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.11 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.12 Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.13 Simple built-in statistical functions . . . . . . . . . . . . . . . . . . . . . . 59
2.14 Functions and execution flow control . . . . . . . . . . . . . . . . . . . . . 60

v
Contents

3 R Scripts and Programming 61


3.1 Aims of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 What is a script? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 How do we use a scrip? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 How to write a script? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 The need to be understandable to people . . . . . . . . . . . . . . . . . . 64
3.6 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7 Objects, classes and methods . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.8 Control of execution flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.8.1 Conditional execution . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.8.2 Why using vectorized functions and operators is important . . . 79
3.8.3 Repetition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.8.4 Nesting of loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.9 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4 R built-in functions 91
4.1 Aims of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Loading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3 Looking at data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.5 Fitting linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5.2 Analysis of variance, ANOVA . . . . . . . . . . . . . . . . . . . . . . 101
4.5.3 Analysis of covariance, ANCOVA . . . . . . . . . . . . . . . . . . . . 102
4.6 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5 Storing and manipulating data with R 105


5.1 Aims of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Packages used in this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4 Data input and output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.1 .Rda files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.2 File names and portability . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.3 Text files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4.4 Worksheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.5 Statistical software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.4.6 NetCDF files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.4.7 Remotely located data . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.4.8 Data acquisition from physical devices . . . . . . . . . . . . . . . . 141
5.4.9 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

vi
Contents

5.5 Apply functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143


5.5.1 Base R’s apply functions . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.6 Grammar of data manipulation . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.6.1 Better data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.6.2 Tidying up data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.6.3 Row-wise manipulations . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.6.4 Group-wise manipulations . . . . . . . . . . . . . . . . . . . . . . . . 164
5.7 Pipes and tees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.7.1 Pipes and tees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.8 Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.9 Extended examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.9.1 Well-plate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.9.2 Seedling morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

6 Plots with ggpplot 179


6.1 Aims of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.2 Packages used in this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.4 Grammar of graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.4.1 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.4.2 Geometries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.4.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.4.4 Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.4.5 Coordinate systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.4.6 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.5 Scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.6 Line plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.7 Plotting functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.8 Plotting text and maths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.9 Axis- and key labels, titles, subtitles and captions . . . . . . . . . . . . . 205
6.10 Tile plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.11 Bar plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.12 Plotting summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.12.1 Statistical “summaries” . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.13 Fitted smooth curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.14 Frequencies and densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
6.14.1 Marginal rug plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
6.14.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.14.3 Density plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

vii
Contents

6.14.4 Box and whiskers plots . . . . . . . . . . . . . . . . . . . . . . . . . . 240


6.14.5 Violin plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
6.15 Using facets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.16 Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.16.1 Continuous scales for 𝑥 and 𝑦 . . . . . . . . . . . . . . . . . . . . . . 250
6.16.2 Time and date scales for 𝑥 and 𝑦 . . . . . . . . . . . . . . . . . . . . 258
6.16.3 Discrete scales for 𝑥 and 𝑦 . . . . . . . . . . . . . . . . . . . . . . . . 258
6.16.4 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6.16.5 Color and fill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.16.6 Continuous colour-related scales . . . . . . . . . . . . . . . . . . . 266
6.16.7 Discrete colour-related scales . . . . . . . . . . . . . . . . . . . . . . 266
6.16.8 Identity scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.16.9 Position of axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.16.10Secondary axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.17 Adding annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
6.18 Coordinates and circular plots . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.18.1 Pie charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.18.2 Wind-rose plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
6.19 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
6.19.1 Predefined themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
6.19.2 Modifying a theme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
6.19.3 Defining a new theme . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
6.20 Using plotmath expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
6.21 Generating output files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
6.21.1 Using EX instead of plotmath . . . . . . . . . . . . . . . . . . . . .
LAT 304
6.22 Building complex data displays . . . . . . . . . . . . . . . . . . . . . . . . . 304
6.22.1 Using the grammar of graphics for individual plots . . . . . . . . 304
6.22.2 Using the grammar of graphics for series of plots with consistent
design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
6.23 Extended examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
6.23.1 Heat maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
6.23.2 Quadrat plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
6.23.3 Volcano plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
6.23.4 Anscombe’s regression examples . . . . . . . . . . . . . . . . . . . 316
6.23.5 Plotting color patches . . . . . . . . . . . . . . . . . . . . . . . . . . 319
6.23.6 Pie charts vs. bar plots example . . . . . . . . . . . . . . . . . . . . 325

7 Extensions to ggplot 329


7.1 Packages used in this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 329

viii
Contents

7.2 Aims of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330


7.3 ‘showtext’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
7.4 ‘viridis’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
7.5 ‘pals’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
7.6 ‘gganimate’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
7.7 ‘ggstance’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
7.8 ‘ggbiplot’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
7.9 ‘ggalt’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
7.10 ‘ggExtra’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
7.11 ‘ggfortify’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
7.12 ‘ggnetwork’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
7.13 ‘geomnet’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
7.14 ‘ggforce’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
7.14.1 Geoms and stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
7.14.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
7.14.3 Theme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
7.14.4 Paginated facetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
7.15 ‘ggpmisc’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
7.15.1 Plotting time-series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
7.15.2 Peaks and valleys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
7.15.3 Equations as text or labels in plots . . . . . . . . . . . . . . . . . . 378
7.15.4 Highlighting deviations from fitted line . . . . . . . . . . . . . . . 392
7.15.5 Plotting residuals from linear fit . . . . . . . . . . . . . . . . . . . . 394
7.15.6 Filtering observations based on local density . . . . . . . . . . . . 394
7.15.7 Learning and/or debugging . . . . . . . . . . . . . . . . . . . . . . . 398
7.16 ‘ggrepel’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
7.16.1 New geoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
7.16.2 Selectively plotting repulsive labels . . . . . . . . . . . . . . . . . . 407
7.17 ’tidyquant’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
7.18 ’ggseas’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
7.19 ’ggsci’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
7.20 ’ggthemes’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
7.21 ’ggtern’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
7.22 Other extensions to ‘ggplot2’ . . . . . . . . . . . . . . . . . . . . . . . . . . 421
7.23 Extended examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
7.23.1 Anscombe’s example revisited . . . . . . . . . . . . . . . . . . . . . 422
7.23.2 Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
7.23.3 Volcano plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
7.23.4 Quadrat plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

ix
Contents

8 Plotting maps and images 425


8.1 Aims of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
8.2 ggmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
8.2.1 Google maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
8.2.2 World map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
8.3 imager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
8.3.1 Using the package: 1st example . . . . . . . . . . . . . . . . . . . . 441
8.3.2 Plotting with ‘ggplot2’: 1st example . . . . . . . . . . . . . . . . . . 444
8.3.3 Using the package: 2nd example . . . . . . . . . . . . . . . . . . . . 448
8.3.4 Plotting with ‘ggplot2’: 2nd example . . . . . . . . . . . . . . . . . 450
8.3.5 Manipulating pixel data: 2nd example . . . . . . . . . . . . . . . . 452
8.3.6 Using bitmaps as data in R . . . . . . . . . . . . . . . . . . . . . . . 459

9 If and when R needs help 463


9.1 Packages used in this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 463
9.2 Aims of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
9.3 R’s limitations and strengths . . . . . . . . . . . . . . . . . . . . . . . . . . 463
9.3.1 Optimizing R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
9.3.2 Using the best tool for each job . . . . . . . . . . . . . . . . . . . . 464
9.3.3 R is great, but not always best . . . . . . . . . . . . . . . . . . . . . 465
9.4 Rcpp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
9.5 FORTRAN and C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
9.6 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
9.7 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
9.8 sh, bash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
9.9 Web pages, and interactive interfaces . . . . . . . . . . . . . . . . . . . . . 469

10 Further reading about R 471


10.1 Introductory texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
10.2 Texts on specific aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
10.3 Advanced texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

Bibliography 473

x
Preface

”Suppose that you want to teach the ‘cat’ concept to a very young
child. Do you explain that a cat is a relatively small, primarily
carnivorous mammal with retractible claws, a distinctive sonic
output, etc.? I’ll bet not. You probably show the kid a lot of
different cats, saying ‘kitty’ each time, until it gets the idea. To put
it more generally, generalizations are best made by abstraction
from experience.”

— R. P. Boas (1981) Can we make mathematics intelligible?,


American Mathematical Monthly 88: 727-731.

This book covers different aspects of the use of R. It is meant to be used as a tutorial
complementing a reference book about R, or the documentation that accompanies R
and the many packages used in the examples. Explanations are rather short and
terse, so as to encourage the development of a routine of exploration. This is not
an arbitrary decision, this is the normal modus operandi of most of us who use R
regularly for a variety of different problems.
I do not discuss here statistics, just R as a tool and language for data manipulation
and display. The idea is for you to learn the R language like children learn a language:
they work-out what the rules are, simply by listening to people speak and trying to
utter what they want to tell their parents. I do give some explanations and comments,
but the idea of these notes is mainly for you to use the numerous examples to find-
out by yourself the overall patterns and coding philosophy behind the R language.
Instead of parents being the sound board for your first utterances in R, the computer
will play this role. You should look and try to repeat the examples, and then try your
own hand and see how the computer responds, does it understand you or not?
When teaching I tend to lean towards challenging students rather than telling a
simplified story. I do the same here, because it is what I prefer as a student, and
how I learn best myself. Not everybody learns best with the same approach, for me
the most limiting factor is for what I listen to, or read, to be in a way or another
challenging or entertaining enough to keep my thoughts focused. This I achieve best
when making an effort to understand the contents or to follow the thread or plot of a
story. So, be warned, reading this book will be about exploring a new world, this book
aims to be a travel guide, neither a traveler’s account, nor a cookbook of R recipes.
Do not expect to ever know everything about R! R in a broad sense is vast because

xi
Contents

its capabilities can be expanded with independently developed packages. Currently


there are more than ten thousand packages available for free in the Comprehensive
R Archive Network (CRAN), the main, but not only public repository for R packages.
You just need to learn to use what you need to use and to have an idea of what else
is available, so that you know where to look for packages when your needs change
in the future. And if what you need does not exist, then take the plunge, and write
your very own package to share with the world (or not). Being R very popular there
is nowadays lots of information available, plus a helpful and open minded on-line
community willing to help with those difficult problems for which Goggle will not be
of help.
How to read this book? My idea is that you will run all the code examples and
try as many other variations as needed until you are sure to understand the basic
‘rules’ of the R language and how each function or command described works. In R
for each function, data set, etc. there is a help page available. In addition, if you use
a front-end like RStudio, auto-completion is available as well as balloon help on the
arguments accepted by functions. For scripts, there is syntax checking of the source
code before its execution: possible mistakes and even formatting style problems are
highlighted in the editor window. Error messages tend to be terse in R, and may
require some lateral thinking and/or ‘experimentation’ to understand the real cause
behind problems. When you are not sure to understand how some command works,
it is useful in many cases to try simple examples for which you know the correct
answer and see if you can reproduce them in R.
As with any computer language, in addition to learning the grammar of the lan-
guage, learning the commonly used writing styles and idioms is extremely important.
Computer programs should be readable and easy to understand to humans, in addi-
tion to being valid. One aspect of this is consistency. I have tried to be consistent,
and to use a clear style that does not diverge much from current usual practice. With
time you may develop to some extent a personal style, and this is usually o.k. How-
ever, when writing computer code, as for any other text intended for humans to read,
strive to stick to a consistent writing style and formatting as they go a long way in
making your intentions clear.
As I wrote above there are many different ways of doing thing in R, and many of the
packages that are most popular nowadays did not exist when I started using R. One
could write many different R books with similar content and still use substantially
different ways of achieving the same results. I limit myself to packages that are cur-
rently popular or that I consider elegantly designed. I have in particular tried to limit
myself to packages with similar design philosophies, especially in relation to their
interfaces. What is elegant design, and in particular what is a friendly user interface
depends strongly on each user’s preferences and previous experience. Consequently,

xii
Contents

the contents of the book is strongly biased by my own preferences. Once again, I en-
courage readers to take this book as a travel guide, as a starting point for exploring
the very many packages, styles and approaches which I have not described.
I will appreciate suggestions for further examples, notification of errors and unclear
sections. Many of the examples here have been collected from diverse sources over
many years and because of this not all sources are acknowledged. If you recognize
any example as yours or someone else’s please let me know so that I can add a proper
acknowledgement. I warmly thank the students that over the years have asked the
questions and posed the problems that have helped me write this text and correct
the mistakes and voids of previous versions. I have also received help on on-line for-
ums and in person from numerous people, learnt from archived e-mail list messages,
blog posts, books, articles, tutorials, webinars, and by struggling to solve some new
problems on my own. In many ways this text owes much more to people who are
not authors than to myself. However, as I am the one who has written this version
and decided what to include and exclude, as author, I take full responsibility for any
errors and inaccuracies.
I have been using R since around 1998 or 1999, but I am still constantly learning
new things about R itself and R packages. With time it has replaced in my work as
a researcher and teacher several other pieces of software: SPSS, Systat, Origin, Excel,
and it has become a central piece of the tool set I use for producing lecture slides,
notes, books and even web pages. This is to say that it is the most useful piece of
software and programming language I have ever learnt to use. Of course, in time it
will be replaced by something better, but at the moment it is the “hot” thing to learn
for anybody with a need to analyse and display data.

I encourage you to approach R, like a child approaches his or hers


mother tongue when learning to speak: Do not struggle, just play!
If going gets difficult and frustrating, take a break! If you get a new
insight, take a break to enjoy the victory!

xiii
Contents

Icons used to mark different content. Throughout


the book text boxes marked with icons present differ-
ent types of information. First of all, we have play-
ground boxes indicated with U which contain open-
ended exercises—ideas and pieces of R code to play with
at the R console. A few of these will require more time
to grasp, and are indicated with U . Boxes providing
general information, usually not directly related to R as a
language, are indicated with =. Some boxes highlighted
with  give important bits of information that must be
remembered when using R—i.e. explain some unusual
feature of the language. Finally, some boxes indicated
by  give in depth explanations, that may require you
to spend time thinking, which en general can be skipped
on first reading, but to which you should return at a later,
and peaceful, time with a cup of coffee or tea.

xiv
Contents

= Status as of 2016-11-23. I have updated the manuscript to track package


updates since the previous version uploaded six months ago, and added sev-
eral examples of the new functionality added to packages ‘ggpmisc’, ‘ggrepel’,
and ‘ggplot2’. I have written new sections on packages ‘viridis’, ‘gganimate’,
‘ggstance’, ‘ggbiplot’, ‘ggforce’, ‘ggtern’ and ‘ggalt’. Some of these sections are
to be expanded, and additional sections are planned for other recently released
packages.
With respect to the chapter Storing and manipulating data with R I have put it
on hold, except for the introduction, until I can see a soon to be published book
covering the same subject. Hadley Wickham has named the set of tools developed
by him and his collaborators as tidyverse to be described in the book titled R for
Data Science by Grolemund and Wickham (O’Reilly).
An important update to ‘ggplot2’ was released last week, and it includes
changes to the behavior of some existing functions, specially faceting has become
extensible through other packages. Several of the new facilities are described in
the updated text and code included in this book and this pdf has been generated
with up-to-date version of ‘ggplot2’ and packages as available today from CRAN,
except for ‘ggtern’ which was downloaded from Bitbucket minutes ago.
The present update adds about 100 pages to the previous versions. I expect to
upload a new update to this manuscript in one or two months time.
Status as of 2017-01-17. Added “playground” exercises to the chapter describ-
ing ‘ggplot2’, and converted some of the examples earlier part of the main text
into these playground items. Added icons to help readers quickly distinguish
playground sections (U), information sections (=), warnings about things one
needs to be specially aware of (  ) and boxes with more advanced content that
may require longer time/more effort to grasp (). Added to the sections scales
and examples in the ‘ggplot2’ chapter details about the use of colors in R and
‘ggplot2’ 2. Removed some redundant examples, and updated the section on
plotmath . Added terms to the alphabetical index. Increased line-spacing to avoid
uneven spacing with inline code bits.
Status as of 2017-02-09. Wrote section on ggplot2 themes, and on using system-
and Google fonts in ggpplots with the help of package ‘showtext’. Expanded sec-
tion on ‘ggplot2’’s annotation , and revised some sections in the “R scripts and
Programming” chapter. Started writing the data chapter. Wrote draft on writing
and reading text files. Several other smaller edits to text and a few new examples.
Status as of 2017-02-14. Wrote sections on reading and writing MS-Excel files,
files from statistical programs such as SPSS, SyStat, etc., and NetCDF files. Also

xv
Contents

wrote sections on using URLs to directly read data, and on reading HTML and
XML files directly, as well on using JSON to retrieve measured/logged data from
IoT (internet of things) and similar intelligent physical sensors, micro-controller
boards and sensor hubs with network access.
Status as of 2017-03-25. Revised and expanded the chapter on plotting maps,
adding a section on the manipulation and plotting of image data. Revised and ex-
panded the chapter on extensions to ‘ggplot2’, so that there are no longer empty
sections. Wrote short chapter “If and when R needs help”. Revised and expan-
ded the “Introduction” chapter. Added index entries, and additional citations to
literature.
Status as of 2017-04-04. Revised and expanded the chapter on using R as a cal-
culator. Revised and expanded the “Scripts” chapter. Minor edits to “Functions”
chapter. Continued writing chapter on data, writing a section on R’s native apply
functions and added preliminary text for a pipes and tees section. Write intro to
‘tidyverse’ and grammar of data manipulation. Added index entries, and a few
additional citations to the literature. Spell checking.
Status as of 2017-04-08. Completed writing first draft of chapter on data, writ-
ing all the previously missing sections on the “grammar of data manipulation”.
Wrote two extended examples in the same chapter. Add table listing several ex-
tensions to ‘ggplot2’ not described in the book.
Status as of 2017-04-10. Revised all chapters correcting some spelling mis-
takes, adding some explanatory text and indexing all functions and operators
used. Thoroughly revised the Introduction chapter and the Preface.

xvi
1 Introduction

The creative adult is the child who has survived.

— Ursula K. le Guin

1.1 R

1.1.1 What is R?

Most people think of R as a computer program. R is indeed a computer program—a


piece of software—, but it is also a computer language, implemented in the R program.
Does this make a difference? Yes, until recently we had only one mainstream imple-
mentation of R, the program R. In the last couple of years another implementation has
started to gain popularity, Microsoft R. These are not the only two implementations,
but others are not in widespread use.
Being R a command line application in its simplest incarnation, it can be made
to work on what nowadays are frugal computing resources equivalent to a personal
computer of a couple of decades ago. Nowadays R can be made to run even on the
Raspberry Pi, a Linux micro-controller board with the processing power of a modest
smartphone. At the other end of the spectrum on really powerful servers it can be
used for the analysis of big data sets with millions of observations. How powerful a
computer you will need will depend on the size of the data sets to analyze, on how
patient you are, and on your ability to write ‘good’ code.
One could think of R, as a dialect of the S language. S was created and implemented
before R. S evolved into S-Plus. As S and S-Plus are commercial programs, variations in
the language appeared only between versions. R started as a poor man’s home-brewed
implementation of S, for use in teaching. Initially R, the program, implemented a sub-
set of the S language. The R program evolved until only some relatively small differ-
ences between S and R remained, and these differences were intentional—thought of
as improvements. As R overtook S-Plus in popularity, some of the new features in R
made their way back into S. R is sometimes called Gnu S.
What makes R different from SPSS, SAS, etc., is that it is based on a complete com-
puter programming language designed from scratch for data analysis and visualiza-
tion. This may look unimportant for someone not actually needing or willing to write
software for data analysis. However, in reality it makes a huge difference because R is
extensible. By this it is meant that new functionality can be easily added, and shared,

1
1 Introduction

and this new functionality is to the user indistinguishable from that built-in into R. It
other words, instead of having to switch between different pieces of software to do
different types of analyses or plots, one can usually find an R package that will do the
job. For those routinely doing similar analyses the ability to write a short program,
sometimes just a handful of lines of code, will allow automation of routine analyses.
For those willing to spend time programming, they have to door open to building the
tools they need if they do not already exist.
However, the most import advantage of using R is that it makes it easy to do data
analyses in a way that ensures that they can be exactly repeated. In other words,
the biggest advantage of using R, as a language, is not in communicating with the
computer, but in communicating to other people what has been done, in a way that
is unambiguous. Of course, other people may want to run the same commands in
another computer, but still it means that a translation from a set of instructions to
the computer into text readable to humans—say the materials and methods section
of a paper—and back is avoided.

1.1.2 R as a computer program

The R code is open-source, it is available for anybody to inspect, modify and use.
A small fraction of users will directly contribute improvements to the R program
itself, but it is possible, and those contributions are important in making R reliable.
The executable, the R program we actually use, can be built for different operating
systems and computer hardware. The developers make an important effort to keep
the results obtained from calculations done on all the different builds and computer
architectures as consistent as possible.
R does not have a graphical user interface (GUI), or menus from which to start
different types of analyses. One types the commands at the R console, or saves the
commands into a text file, and uses the file as a ‘script’ or list of commands to be run.
When we work at the console typing in commands one by one, we say that we use R
interactively. When we run script we would say that we run a “batch job”. These are
the two options that R by itself provides, however, we can use a front-end program
on top of R. The simplest option is to use a text editor like Emacs to edit the scripts
and then run the scripts in R. With some editors like Emacs, rather good integration
is possible, but nowadays there are also some Integrated Development Environments
for R, currently being RStudio the most popular by a wide margin.

Using R interactively

Typing commands at the R console is useful when one is playing around, aimlessly
exploring things, but once we want to keep track of what we are doing, there are

2
1.1 R

Figure 1.1: Screen capture of the R console being used interactively.

better ways of using R. However, the different ways of using R are not exclusive, so
most users will use the R console to test individual commands, plot data during the
first stages of exploring them, at the console. As soon as we know how we want to
plot or analyse the data, it is best to start using scripts. This is not enforced in any
way by R, but using scripts, or as we will below literate scripts to produce reports is
what really brings to fruition the most important advantages of using R. In Figure 1.1
we can see how the R console looks under MS-Windows. The text in red has been type
in by the user—except for the prompt > —, and the blue text is what R has displayed
in response. It is essentially a dialogue between user and R.

Using R as a “batch job”

To run a script we need first to prepare a script in a text editor. Figure 1.2 shows
the console immediately after running the script file shown in the lower window. As
before, red text, the command source("my-script.R") , was typed by the user, and
the blue text in the console is what was displayed by R as a result of this action.
A true “batch job” is not run at the R console but at the operating system command
prompt, or shell. The shell is the console of the operating system—Linux, Unix, OS X,
or MS-Windows. Figure 1.3 shows how running an script at the Windows commands

3
1 Introduction

Figure 1.2: Screen capture of the R console and editor just after running a script. The upper
window shows the R console, and the lower window the script file in an editor window.

Figure 1.3: Screen capture Windows 10 command console just after running the same script.
Here we use Rscript to run the script, the exact syntax will depend on the operating system
in use. In this case R prints the results at the operating system console or shell, rather than in
its own R console.

prompt looks. In normal use, a script run at the operating system prompt does time-
consuming calculations and the output is saved to a file. One may use this approach
on a server, say, to leave the batch job running over-night.

4
1.1 R

Where do IDEs fit?

Integrated Development Environments (IDEs) were initially created for computer pro-
gram development. They are programs that the user interacts with, from within which
the different tools needed can be used in a coordinated way. They usually include a
dedicated editor capable of displaying the output from different tools in a useful way,
and also in many cases can do syntax highlighting, and even report some mistakes,
related to the programming language in use while the user types. One could describe
such editor as the equivalent as a word processor, that can check the program code
for spelling and syntax errors, and has a built-in thesaurus for the computer language.
In the case of RStudio, the main, but not only language supported is R. The screen of
IDEs usually displays several panes or windows simultaneously. From within the IDE
one has access to the R console, an editor, a file-system browser, and access to sev-
eral tools. Although RStudio supports very well the development of large scripts and
packages, it is also the best possible way of using R at the console as it has the R help
system very well integrated. Figure 1.4 shows the window display by RStudio under
Windows after running the same script as shown above at the R console and at the
operating system command prompt. We can see in this figure how RStudio is really
a layer between the user and an unmodified R executable. The script was sourced by
pressing the “Source” button at the top of the editor pane. RStudio in response to
this generated the code needed to source the file and “entered” at the console, the
same console, where we would type ourselves any R commands.
When a script is run, if an error is triggered, it automatically finds the location of
the error. RStudio also supports the concept of projects allowing saving of settings
separately. Some features are beyond what you need for everyday data analysis and
aimed at package development, such as integration of debugging, traceback on er-
rors, profiling and bench marking of code so as to analyse and improve performance.
It also integrates support for file version control, which is not only useful for pack-
age development, but also for keeping track of the progress or collaboration in the
analysis of data.
The version of RStudio that one uses locally, i.e. installed in your own computer,
runs with almost identical user interface on most modern operating systems, such as
Linux, Unix, OS X, and MS-Windows. There is also a server version that runs on Linux,
and that can be used remotely through any web browser. The user interface is still
the same.
RStudio is under active development, and constantly improved. Visit http://www.
rstudio.org/ for an up-to-date description and download and installation instruc-
tions. Two books (Hillebrand and Nierhoff 2015; Loo and Jonge 2012) describe and
teach how to use RStudio without going in depth into data analysis or statistics, how-

5
1 Introduction

Figure 1.4: The RStudio interface just after running the same script. Here we used the “Source”
button to run the script. In this case R prints the results to the R console in the lower left pane.

ever, as RStudio is under very active development several recently added important
features are not described in these books. You will find tutorials and up-to-date cheat
sheets at http://www.rstudio.org/.

1.1.3 R as a language

R is a computer language designed for data analysis and data visualization, however,
in contrast to some other scripting languages, it is from the point of view of computer
programming a complete language—it is not missing any important feature. As men-
tioned above, R started as a free and open-source implementation of the S-language
(Becker and Chambers 1984; Becker et al. 1988). We will described the features of the
R language on later chapters. Here I mention, that it does have some features that
makes it different from other programming languages. For example, it does not have
the strict type checks of Pascal, nor C++. It also has operators that can take vectors
and matrices as operands allowing a lot more concise program statements for such
operations than other languages. Writing programs, specially reliable and fast code,
requires familiarity with some of these idiosyncracies of the R language. For those
using R interactively, or writing short scripts, these features make life a lot easier.

6
1.1 R

 Some languages have been standardised, and their grammar has been form-
ally defined. R, in contrast is not standardized, and there is no formal grammar
definition. So, the R language is defined by the behaviour of the R program.

R was initially designed for interactive use in teaching, the R program uses an inter-
preter instead of a compiler.

 Interpreters and compilers Computer programs and scripts are nowadays


almost always written in a high level language that is readable to humans, and that
relies on a grammar much more complex than that understood by the hardware
processor chip in the computer or device. Consequently one or more translation
steps are needed. An interpreter, translates user code at the time of execution,
and consequently parts of the code that are executed repeatedly are translated
multiple times. A native compiler translates the user code into machine code in a
separate step, and the compiled machine code can be stored and executed itself
as many times as needed. On the other hand, compiled code can be executed only
on a given hardware (processor, or processors from a given family). A byte-code
compiler, translates user code into an intermediate representation, which cannot
be directly executed by any hardware, and which is independent of the hardware
architecture, but easier/faster to translate into machine code. This second inter-
preter is called a “virtual machine”, as it is not dependent on a real hardware
processor architecture.
An interpreter adds flexibility and makes interactive use possible, but results in
slower execution compared to compiled executables. Nowadays, byte compiling
is part of the R program, and used by default in some situations or under user
control. Just-in-time (JIT) compiling is a relatively new feature in R, and consists
in compiling on-the-fly code that is repeatedly evaluated within a single run of a
script.
Functions or subroutines that have been compiled to machine code can be called
from within R, but currently not written in the R language itself, as no native com-
piler exists for the R language. It is common to call from within R code, compiled
functions or use whole libraries coded in languages such a C, C++ and FORTRAN
when maximum execution speed is needed. The calls are normally done from
within an R package, so that they appear to the user not different any other R func-
tion. Functions and libraries written in other interpreted and/or byte-compiled
languages like Java and Python can also be called from R.

7
1 Introduction

In addition, R exposes a programming interface (API) and many R functions


can be called from within programs or scripts written in other languages such a
Python and Java, also database systems and work sheets. This flexibility is one of
the reasons behind R’s popularity.

1.2 Packages and repositories

The most elegant way of adding new features or capabilities is through packages.
This is without doubt the best mechanism when these extensions to R need to be
shared. However, in most situations it is the best mechanism for managing code that
will be reused even by a single person over time. R packages have strict rules about
their contents, file structure, and documentation, which makes it possible among
other things for the package documentation to be merged into R’s help system when
a package is loaded. With a few exceptions, packages can be written so that they will
work on any computer where R runs.
Packages can be shared as source or binary package files, sent for example through
e-mail. However, for sharing them widely, the best is to submit them to repository.
The largest public repository of R packages is called CRAN, an acronym for Compre-
hensive R Archive Network. Packages available through CRAN are guaranteed to work,
in the sense of not failing any tests built into the package and not crash or fail. They
are tested daily, as they may depend on other packages that may change as they are
updated. In January 2017, the number of packages available through CRAN passed
the 10 000 mark.

1.3 Reproducible data analysis

One requirement for reproducible data analysis, is a reliable record of what com-
mands have been run on which data. Such a record is specially difficult to keep when
issuing commands through menus and dialogue boxes in a graphical user interface.
When working interactively at the R console, it is a bit easier, but still copying and
pasting is error prone.
A further requirement is to be able to match the output of the R commands to the
output. If the script generates the output to separate files, then the user will need
to take care that the script saved or shared as record of the data analysis was the
one actually used for obtaining the reported results and conclusions. This is another
error prone stage in the report of a data analysis. To solve this problem an approach
was developed, inspired in what is called literate programming. The idea is running

8
1.4 Finding additional information

the script will produce a document that includes the script, the results of running the
scripts and any explanatory text needed to understand and interpret the analysis.
Although a system capable of producing such reports, called Sweave, has been avail-
able for a couple decades, it was rather limited and not supported by an IDE, making
its use tedious. A more recently developed system called ‘knitr’ together by its in-
tegration into RStudio has made the use of this type of reports very easy. The most
recent development are Notebooks produced within RStudio. This very new feature,
can produce the readable report of running the script, including the code used inter-
spersed with the results within the viewable file. However, this newest approach goes
even further, in the actual source script used to generate the report is embedded in
the HTML file of the report. This means that anyone who gets access to the output
of the analysis in human readable form also gets access to the code used to generate
report, in a format that can be immediately executed as long as the data is available.
Because of these recent developments, R is an ideal language to use when the goal
of reproducibility is important. During recent years the problem of the lack of re-
producibility in scientific research has been broadly discussed and analysed. One on
the problems faced when attempting to reproduce experimental work, is reproducing
the data analysis. R together with these modern tools can help in avoiding one of the
sources of lack of reproducibility.
How powerful are these tools? and how flexible? They are powerful and flexible
enough to write whole books, such as this very book you are now reading, produced
with R, knitr and LATEX. All pages in the book are generated directly, all figures are
generated by R and included automatically, except for the three figures in this chapter
that have been manually captured from the computer screen. Why am I using this
approach? First because I want to make sure that every bit of code as you will see
printed, runs without error. In addition I want to make sure that the output that
you will see below every line or chunk of R language code is exactly what R returns.
Furthermore, it saves a lot of work for me as author, and can just update R and all
the packages used to their latest version, and build the book again, to keep it up to
date and free of errors.

1.4 Finding additional information

When searching for answers, asking for advice or reading books you will be confron-
ted with different ways of doing the same tasks. Do not allow this overwhelm you,
in most cases it will not matter as many computations can be done in R, as in any
language, in several different ways, still obtaining the same result. The different
approaches may differ mainly in two aspects: 1) how readable to humans are the
instructions given to the computer as part of a script or program, and 2) how fast

9
1 Introduction

the code will run. Unless performance is an important bottleneck in your work, just
concentrate on writing code that is easy to understand to you and to others, and con-
sequently easy to check and reuse. Of course do always check any code you write for
mistakes, preferably using actual numerical test cases for any complex calculation or
even relatively simple scripts. Testing and validation are extremely important steps
in data analysis, so get into this habit while reading this book. Testing how every
function works as I will challenge you to do in this book, is at the core of any robust
data analysis or computing programming. When developing R packages, including a
good coverage of test cases as part of the package itself simplifies code maintenance
enormously.

1.4.1 R’s built-in help

To access help pages through the command prompt we use function help() or a
question mark. Every object exported by an R package (functions, methods, classes,
data) is documented. Sometimes a single help page documents several R objects.
Usually at the end of the help pages some us examples are given. For example one
can search for a help page at the R console.

help("sum")
?sum

U Look at help for some other functions like mean() , var() , plot() and, why
not, help() itself!

help(help)

When using RStudio there are several easier ways of navigating to a help page, for
example with the cursor on the name of a function in the editor or console, pressing
the F1 key, opens the corresponding help page in the help pane. Letting the cursor
stay for a few seconds on the name of a function at the R console will open “bubble
help” for it. If the function is defined in a script or another file open in the editor
pane one can directly navigate from the line where the function is called to where it
is defined. In RStudio one can also search for help through the graphical interface.
In addition to help pages, the R’s distribution includes useful manuals as PDF or
HTML files. This can be accessed most easily through the Help menu in RStudio or
RGUI. Extension packages, provide help pages for the functions and data they export.
When a package is loaded into an R session, its help pages are added to the native

10
1.4 Finding additional information

help of R. In addition to these individual help pages, each package, provides an index
of its corresponding help pages, for users to browse. Many packages, also provide
vignettes such as User Guides or articles describing the algorithms used.

1.4.2 Obtaining help from on-line forums

Netiquette

In most internet forums, a certain behaviour is expected from those asking and an-
swering questions. Some types of miss-behavior, like use of offensive or inappropri-
ate language, will usually result in the user being banned writing rights in a forum.
Occasional minor miss-behaviour, will usually result in the original question not being
answered and instead the problem highlighted in the reply.

• Do your homework: first search for existing answers to your question, both on-
line and in the documentation. (Do mention that you attempted this without
success when you post your question.)

• Provide a clear explanation of the problem, and all the relevant information. Say
if it concerns R, the version, operating system, and any packages loaded and
their versions.

• If at all possible provide a simplified and short, but self-contained, code example
that exemplifies the problem.

• Be polite.

• Contribute to the forum by answering other users’ questions when you know
the answer.

StackOverflow

Nowadays, StackOverflow (http://stackoverflow.com/) is the best questions and


answers support site for R. In most cases, searching for existing questions and their
answers, will be all what you need to do. If asking a question, make sure that it is
really a new question. If there is some question that looks similar, make clear how
your question is different.
StackOverflow has a user-rights system based on reputation, and questions and an-
swers can be up- and down-voted. Those with the most up-votes are listed at the top
of searches. If the questions or answers you write are up-voted after you accumulate
enough reputation you acquire badges, and rights, such as editing other users’ ques-
tions and answers or later on, even deleting wrong answers or off-topic questions
from the system. This sounds complicated, but works extremely well at ensuring

11
1 Introduction

that the base of questions and answers is relevant and correct, without relying on a
single or ad-hoc moderators.

1.5 Additional tools

Additional tools can be used from within RStudio. These tools are not described in
this book, but they can either be needed or very useful when working with R. Revision
control systems like Git are very useful for keeping track of history of any project, be
it data analysis, package development, or a manuscript. For example I not only use
Git for the development of packages, and data analysis, but also the source files of
this book are managed with Git.
If you develop packages that include functions written in other computer languages,
you will need to have compilers installed. If you have to install packages from source
files, and the packages include code in other languages like C, C++ or FORTRAN you
will need to have the corresponding compilers installed. For Windows and OS X, com-
piled versions are available through CRAN, so the compilers will be rarely needed.
Under Linux, packages are normally installed from sources, but in most Linux distri-
butions the compilers are installed by default as part of the Linux installation.
When using ‘knitr’ for report writing or literate R programming we can use two
different types of mark up for the non-code text part—the text that is not R code.
The software needed to use Markdown is installed together with RStudio. To use
LATEX, a TEX distribution such as TexLive or MikTeX must be installed separately.

1.5.1 Revision control: Git and Subversion

Revision control systems help by keeping track of the history of software develop-
ment, data analysis, or even manuscript writing. They make it possible for several
programmers, data analysts, authors and or editors to work on the same files in par-
allel and then merge their edits. They also allow easy transfer of whole ‘projects’
between computers. Git is very popular, and Github (https://github.com/) and
Bitbucket (https://bitbucket.org/) are popular hosts for Git repositories. Git it-
self is free software, was designed by Linus Tordvals of Linux fame, and can be also
run locally, or as one’s own private server, either as an AWS instance or on other
hosting services, or on your own hardware.
The books ‘Git: Version Control for Everyone’ (Somasundaram 2013) and ‘Pragmatic
Guide to Git’ (Swicegood 2010) are good introductions to revision control with Git.
Free introductory videos and cheatsheets are available at https://git-scm.com/
doc.

12
1.5 Additional tools

1.5.2 C, C++ and FORTRAN compilers

As mentioned above, although R is an interpreted language, a compiler may need to be


installed for installing packages containing functions or libraries written in C, C++ or
FORTRAN. Although these languages are defined by standards, compilers still differ,
and standards evolve. Under MS-Windows a specific compiler and sets of tools are
needed. They are available from CRAN, ready to be installed, as Rtools. Under OS X,
the compiler to install is X-Code available for free from Apple. In Linux distributions
the compilers installed as part of the operating systems should be all what is needed.

1.5.3 LATEX

LATEX is built on top of TEX. TEX code and features were ‘frozen’ (only bugs are fixed)
long ago. There are currently a few ‘improved’ derivatives: pdfTEX, XƎTEX, and LuaTEX.
Currently the most popular TEX in western countries is pdfTEX which can directly
output PDF files. XƎTEX can handle text both written from left to right and right to
left, even in the same document and supports additional font formats, and is the
most popular TEX engine in China and other Asian countries. Both XƎLATEX and LuaTEX
are rapidly becoming popular also for typesetting texts in variants of Latin and Greek
alphabets as these new TEX engines natively support large character sets and modern
font formats such as TTF (True Type) and OTF (Open Type).
LATEX is needed only for building the documentation of packages that include doc-
umentation using this text markup language. However, building the PDF manuals
is optional. The most widely used distribution of TEXis TEXLive and is available for
Linux, OS X and MS-Windows. However, under MS-Windows many users prefer the
MikTEXdistribution. The equivalent of CRAN for TEX is CTAN, the Comprehensive
TEXArchive Network, at http://ctan.tug.org. Good source of additional informa-
tion on TEXand LATEXis TUG, the TEXUsers Group (http://www.tug.org).

1.5.4 Markdown

Markdown (see https://daringfireball.net/projects/markdown/ is a simple


markup language, which although offering somehow less flexibility than LATEX, it is
much easier to learn and text files using this markup language, can be easily conver-
ted to various output formats such as HTML and XTML in addition to PDF. RStudio
supports editing markdown and the variants R Markdown and Bookdown. Document-
ation on Rmarkdown is available on-line at http://rmarkdown.rstudio.com/ and
on Bookdown at https://bookdown.org/.

13
1 Introduction

1.6 What is needed to run the examples on this book?

I recommend you to use as an editor or IDE (integrated development environment)


RStudio. RStudio is user friendly, actively maintained, free, open-source and available
both in desktop and server versions. The desktop version runs on Windows, Linux,
and OS X and other Unixes. For running the examples in the handbook, you would
need only to have R installed. That would be enough as long as you also have a
text editor available. This is possible, but does not give a very smooth work flow for
data analyses that are beyond the very simple. The next stage is to use a text editor
which integrates to some extent with R, but still this is not ideal, specially for writing
packages or long scripts for data analysis. Currently, by far the best option is to use
RStudio.
Of course when choosing which editor to use, personal preferences and previous fa-
miliarity play an important. Currently, for the development of packages, I use RStudio
exclusively. For writing this book I have used both RStudio and the text editor WinEdt
which also has some support for R together with excellent support for LATEX. When
working on a large project or collaborating with other data analysts or researchers,
one big advantage of a system based on plain text files, is that the same files can be
edited with different programs as needed or wished by the different persons involved
in a project.
When I started using R, nearly two decades ago, I was using other editors, using
the operating system shell a lot more, and struggling with debugging as no IDE was
available. The only reasonably good integration with an editor was for Emacs, which
was widely available only under Unix-like systems. Given this past experience, I en-
courage you to use an IDE for R. RStudio is nowadays very popular, but if you do
not like it, need a different set of features, such as integration with ImageJ, or are
already familiar with the Eclipse IDE, you may like try the Bio7 IDE, available from
http://bio7.org.

14
2 R as a powerful calculator

The desire to economize time and mental effort in arithmetical


computations, and to eliminate human liability to error, is
probably as old as the science of arithmetic itself.

— Howard Aiken, Proposed automatic calculating machine,


presented to IBM in 1937

2.1 Aims of this chapter

In my experience, for those not familiar with computing programming or scripting


languages, and who have mostly used computer programs through visual interfaces
making heavy use of menus and icons, the best first step in learning R is to learn the
basics of the language through its use at the R command prompt. This will teach not
only the syntax and grammar rules, but also give a glimpse at the advantages and
flexibility of this approach to data analysis.
Menu-driven programs are not necessarily bad, they are just unsuitable when there
is a need to set very many options and chose from many different actions. They
are also difficult to maintain when extensibility is desired, and when independently
developed modules of very different characteristics need to be integrated. Textual
languages also have the advantage, to be dealt with in the next chapter, that command
sequences can be stored as a human- and computer readable text file that keeps a
record of all the steps used and that in most cases makes it trivial to reproduce the
same steps at a later time. The scripts are also a very simple and handy way of
communicating to others how to do a given data analysis.

2.2 Working at the R console

I assume here that you have installed or have had installed by someone else R and
RStudio and that you are already familiar enough with RStudio to find your way around
its user interface. The examples in this chapter use only the console window, and
results are printed to the console. The values stored in the different variables are
visible in the Environment tab in RStudio.
In the console you can type commands at the > prompt. When you end a line by
pressing the return key, if the line can be interpreted as an R command, the result will

15
2 R as a powerful calculator

be printed in the console, followed by a new > prompt. If the command is incomplete
a + continuation prompt will be shown, and you will be able to type-in the rest of the
command. For example if the whole calculation that you would like to do is 1 + 2 + 3,
if you enter in the console 1 + 2 + in one line, you will get a continuation prompt
where you will be able to type 3 . However, if you type 1 + 2 , the result will be
calculated, and printed.
When working at the command prompt, results are printed by default, but in other
cases you may need to use the function print() explicitly. The examples here rely
on the automatic printing.
The idea with these examples is that you learn by working out how different com-
mands work based on the results of the example calculations listed. The examples
are designed so that they allow the rules, and also a few quirks, to be found by ‘de-
tective work’. This should hopefully lead to better understanding than just studying
rules.

2.3 Arithmetic and numeric values

When working with arithmetic expressions the normal mathematical precedence rules
are respected, but parentheses can be used to alter this order. Parentheses can be nes-
ted and at all nesting levels the normal rounded parentheses are used. The number
of opening (left side) and closing (right side) parentheses must be balanced, and they
must be located so that each enclosed term is a valid mathematical expression. For
example while (1 + 2) * 3 is valid, (1 +) 2 * 3 is a syntax error as 1 + is incom-
plete and cannot be calculated.

1 + 1

## [1] 2

2 * 2

## [1] 4

2 + 10 / 5

## [1] 4

(2 + 10) / 5

## [1] 2.4

10^2 + 1

## [1] 101

16
2.3 Arithmetic and numeric values

sqrt(9)

## [1] 3

pi # whole precision not shown when printing

## [1] 3.141593

print(pi, digits = 22)

## [1] 3.1415926535897931

sin(pi) # oops! Read on for explanation.

## [1] 1.224606e-16

log(100)

## [1] 4.60517

log10(100)

## [1] 2

log2(8)

## [1] 3

exp(1)

## [1] 2.718282

One can use variables to store values. The ‘usual’ assignment operator is <- . Vari-
able names and all other names in R are case sensitive. Variables a and A are two
different variables. Variable names can be quite long, but usually it is not a good
idea to use very long names. Here I am using very short names, that is usually a very
bad idea. However, in cases like these examples where the stored values have no real
connection to the real world and are used just once or twice, these names emphasize
the abstract nature.

a <- 1
a + 1

## [1] 2

## [1] 1

b <- 10

17
2 R as a powerful calculator

b <- a + b
b

## [1] 11

3e-2 * 2.0

## [1] 0.06

There are some syntactically legal statements that are not very frequently used,
but you should be aware that they are valid, as they will not trigger error messages,
and may surprise you. The important thing is that you write commands consistently.
The assignment ‘backwards’ assignment operator -> resulting code like 1 -> a is
valid but rarely used. The use of the equals sign ( = ) for assignment although valid is
generally discouraged as it is seldom used as this meaning has not earlier been part
of the R language. Chaining assignments as in the first line below is sometimes used,
and signals to the human reader that a , b and c are being assigned the same value.

a <- b <- c <- 0.0


a

## [1] 0

## [1] 0

## [1] 0

1 -> a
a

## [1] 1

a = 3
a

## [1] 3

 Here I very briefly introduce the concept of mode of an R object. In the case
of R, numbers, belong to mode numeric . We can query if the mode of an object
is numeric with function is.numeric() .

18
2.3 Arithmetic and numeric values

is.numeric(1)

## [1] TRUE

a <- 1
is.numeric(a)

## [1] TRUE

One can think informally of a mode, as a “type” or “kind” of objects. Constants


like 1 or variables such as a in the code chunk above, belong to, or have a mode,
that indicates that they are numbers. Other modes that we will use later in the
present chapter are logical and character (We will discuss the concepts of
mode and class, as used in R, in section 2.7 on page 37 for more details).
As in computers numbers can be stored in different ways, most computing
languages allow the use of several different types of numbers. In most cases R’s
numeric() can be used everywhere where a number is expected. In some cases
in can be more efficient to explicitly indicate whether we will store or operate on
integer numbers, in which case we can use class integer , with integer constants
indicated with a trailing capital L, as in 32L . When the intention is to represent
Real numbers, within a finite range, in other words, floats, we can directly use
class double and the conctructor double() .

is.numeric(1L)

## [1] TRUE

is.integer(1L)

## [1] TRUE

is.double(1L)

## [1] FALSE

The name double originates from the C language, in which there are different
types of floats available. Similarly, the use of L stems the long type in C.

Numeric variables can contain more than one value. Even single numbers are vector
s of length one. We will later see why this is important. As you have seen above the
results of calculations were printed preceded with [1] . This is the index or position
in the vector of the first number (or other value) displayed at the head of the current
line.

19
2 R as a powerful calculator

One can use c ‘concatenate’ to create a vector from other vectors, including vectors
of length 1, such as the numeric constants in the statements below.

a <- c(3, 1, 2)
a

## [1] 3 1 2

b <- c(4, 5, 0)
b

## [1] 4 5 0

c <- c(a, b)
c

## [1] 3 1 2 4 5 0

d <- c(b, a)
d

## [1] 4 5 0 3 1 2

One can also create sequences using seq() , or repeat values. In this case I leave
to the reader to work out the rules by running these and his/her own examples.

a <- -1:5
a

## [1] -1 0 1 2 3 4 5

b <- 5:-1
b

## [1] 5 4 3 2 1 0 -1

c <- seq(from = -1, to = 1, by = 0.1)


c

## [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2
## [10] -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
## [19] 0.8 0.9 1.0

d <- rep(-5, 4)
d

## [1] -5 -5 -5 -5

Now something that makes R different from most other programming languages:
vectorized arithmetic.

20
2.3 Arithmetic and numeric values

a + 1 # we add one to vector a defined above

## [1] 0 1 2 3 4 5 6

(a + 1) * 2

## [1] 0 2 4 6 8 10 12

a + b

## [1] 4 4 4 4 4 4 4

a - a

## [1] 0 0 0 0 0 0 0

As it can be seen in the first line above, another peculiarity of R, that is frequently
called “recycling”: as vector a is of length 6, but the constant 1 is a vector of length 1,
this 1 is extended by recycling into a vector of ones of the same length as the longest
vector in the statement, in this case, a .
Make sure you understand what calculations are taking place in the chunk above,
and also the one below.

a <- rep(1, 6)
a

## [1] 1 1 1 1 1 1

a + 1:2

## [1] 2 3 2 3 2 3

a + 1:3

## [1] 2 3 4 2 3 4

a + 1:4

## Warning in a + 1:4: longer object length is not a multiple of shorter object length

## [1] 2 3 4 5 2 3

 A useful thing to know: a vector can have length zero. Vectors of length
zero may seem at first sight quite useless, but in fact they are very useful. They
allow the handling of “no input” or “nothing to do” cases as normal cases, which
in the absence of vectors of length zero would require to be treated as special

21
2 R as a powerful calculator

cases. We also introduce here two useful functions, length() which returns the
length of a vector, and is.numeric() that can be used to test if an R object is
numeric .

z <- numeric(0)
z

## numeric(0)

length(z)

## [1] 0

is.numeric(z)

## [1] TRUE

Vectors of length zero, behave in most cases, as expected—e.g. they can be


concatenated as shown here.

length(c(a, numeric(0), b))

## [1] 13

length(c(a, b))

## [1] 13

Many functions, such as R’s maths functions and operators, will accept numeric
vectors of length zero as valid input, returning also a vector of length zero, issuing
neither a warning nor an error message. In other words, these are valid operations
in R.

log(numeric(0))

## numeric(0)

5 + numeric(0)

## numeric(0)

Even when of length zero, vectors do have to belong to a class acceptable for
the operation.

It is possible to remove variables from the workspace with rm() . Function ls()
returns a list all objects in the current environment, or by supplying a pattern ar-

22
2.3 Arithmetic and numeric values

gument, only the objects with names matching the pattern . The pattern is given
as a regular expression, with [] enclosing alternative matching characters, ^ and $
indicating the extremes of the name (start and end, respectively). For example "^z$"
matches only the single character ‘z’ while "^z" matches any name starting with ‘z’.
In contrast "^[zy]$" matches both ‘z’ and ‘y’ but neither ‘zy’ nor ‘yz’, and "^[a-z]"
matches any name starting with a lower case ASCII letter. If you are using RStudio,
all objects are listed in the Environment pane, and the search box of the panel can be
used to find a given object.
ls(pattern="^z$")

## [1] "z"

rm(z)
try(z)
ls(pattern="^z$")

## character(0)

There are some special values available for numbers. NA meaning ‘not available’ is
used for missing values. Calculations can yield also the following values NaN ‘not a
number’, Inf and -Inf for ∞ and −∞. As you will see below, calculations yielding
these values do not trigger errors or warnings, as they are arithmetically valid. Inf
and -Inf are also valid numerical values for input and constants.
a <- NA
a

## [1] NA

-1 / 0

## [1] -Inf

1 / 0

## [1] Inf

Inf / Inf

## [1] NaN

Inf + 4

## [1] Inf

b <- -Inf
b * -1

## [1] Inf

23
2 R as a powerful calculator

Not available ( NA ) values are very important in the analysis of experimental data, as
frequently some observations are missing from an otherwise complete data set due
to “accidents” during the course of an experiment. It is important to understand how
to interpret NA ’s. They are simple place holders for something that is unavailable, in
other words unknown.

A <- NA
A

## [1] NA

A + 1

## [1] NA

A + Inf

## [1] NA

Any operation, even tests of equality, involving one or more NA ’s return an NA . In


other words when one input to a calculation is unknown, the result of the calculation
is unknown. This means that a special function is needed for testing for the presence
of NA values.

is.na(c(NA, 1))

## [1] TRUE FALSE

 When to use vectors of length zero, and when NA s? NA is used to signal


a value that “was lost” or “was expected” but unavailable. A vector of length zero,
usually represents a value that is not available, but when this case is within the
normal expectations. In particular, if vectors are expected to have a certain length,
or if index positions along a vector are meaningful, then using NA is a must.

One thing to be aware of, and which we will discuss again later, is that numbers in
computers are almost always stored with finite precision. This means that they not
always behave as Real numbers as defined in mathematics. In R the usual numbers
are stored as double-precision floats, which means that there are limits to the largest
and smallest numbers that can be represented (approx. −1 ⋅ 10308 and 1 ⋅ 10308 ), and
the number of significant digits that can be stored (usually described as 𝜖 (epsilon,
abbreviated eps, defined as the largest number for which 1 + 𝜖 = 1)). This can be
sometimes important, and can generate unexpected results in some cases, especially

24
2.3 Arithmetic and numeric values

when testing for equality. In the example below, the result of the subtraction is still
exactly 1.

1 - 1e-20

## [1] 1

It is usually safer not to test for equality to zero when working with numeric values.
One alternative is comparing against a suitably small number, which will depend on
the situation, although eps is usually a safe bet, unless the expected range of values
is known to be small. This type of precautions are specially important in what is usu-
ally called “production” code: a script or program that will be used many times and
with little further intervention by the researcher or programmer. Such code must
work correctly, or not work at all, and it should not under any imaginable circum-
stance possibly give a wrong answer.

eps <- .Machine$double.eps


abs(-1)

## [1] 1

abs(1)

## [1] 1

x <- 1e-40
abs(x) < eps * 2

## [1] TRUE

abs(x) < 1e-100

## [1] FALSE

The same precautions apply to tests for equality, so whenever possible accord-
ing to the logic of the calculations, it is best to test for inequalities, for example
using x <= 1.0 instead of x == 1.0. If this is not possible, then the tests should be
treated as above, for example replacing x == 1.0 with abs(x - 1.0) < eps. Func-
tion abs() returns the absolute value, in simple words, makes all values positive or
zero, by changing the sign of negative values.
When comparing integer values these problems do not exist, as integer arithmetic
is not affected by loss of precision in calculations restricted to integers (the L comes
from ‘long’, a name sometimes used for a machine representation of integers). Be-
cause of the way integers are stored in the memory of computers, within the accept-
able range, they are stored exactly. One can think of computer integers as a subset
of whole numbers restricted to a certain range of values.

25
2 R as a powerful calculator

1L + 3L

## [1] 4

1L * 3L

## [1] 3

1L %/% 3L

## [1] 0

1L %% 3L

## [1] 1

1L / 3L

## [1] 0.3333333

The last statement in the example immediately above, using the ‘usual’ division
operator yields a floating-point double result, while the integer division operator %/%
yields an integer result, and %% returns the remainder from the integer division.
Both doubles and integers are considered numeric. In most situations conversion
is automatic and we do not need to worry about the differences between these two
types of numeric values. This last chunk shows returned values that are either TRUE
or FALSE . These are logical values that will be discussed in the next section.

is.numeric(1L)

## [1] TRUE

is.integer(1L)

## [1] TRUE

is.double(1L)

## [1] FALSE

is.double(1L / 3L)

## [1] TRUE

is.numeric(1L / 3L)

## [1] TRUE

26
2.4 Boolean operations and logical values

2.4 Boolean operations and logical values

What in maths are usually called Boolean values, are called logical values in R. They
can have only two values TRUE and FALSE , in addition to NA (not available). They
are vectors as all other simple types in R. There are also logical operators that allow
Boolean algebra (and support for set operations that we will only describe very briefly).
In the chunk below we work with logical vectors of length one.
a <- TRUE
b <- FALSE
a

## [1] TRUE

!a # negation

## [1] FALSE

a && b # logical AND

## [1] FALSE

a || b # logical OR

## [1] TRUE

Again vectorization is possible. I present this here, and will come back to this later,
because this is one of the most troublesome aspects of the R language for beginners.
There are two types of ‘equivalent’ logical operators that behave differently, but use
similar syntax! The vectorized operators have single-character names & and |, while
the non vectorized ones have double-character names && and ||. There is only one
version of the negation operator ! that is vectorized. In some, but not all cases, a
warning will indicate that there is a possible problem.

a <- c(TRUE,FALSE)
b <- c(TRUE,TRUE)
a

## [1] TRUE FALSE

## [1] TRUE TRUE

a & b # vectorized AND

## [1] TRUE FALSE

a | b # vectorized OR

27
2 R as a powerful calculator

## [1] TRUE TRUE

a && b # not vectorized

## [1] TRUE

a || b # not vectorized

## [1] TRUE

Functions any() and all() take a logical vector as argument, and return a single
logical value ‘summarizing’ the logical values in the vector. all returns TRUE only
if every value in the argument is TRUE , and any returns TRUE unless every value in
the argument is FALSE .

any(a)

## [1] TRUE

all(a)

## [1] FALSE

any(a & b)

## [1] TRUE

all(a & b)

## [1] FALSE

Another important thing to know about logical operators is that they ‘short-cut’
evaluation. If the result is known from the first part of the statement, the rest of
the statement is not evaluated. Try to understand what happens when you enter the
following commands. Short-cut evaluation is useful, as the first condition can be used
as a guard preventing a later condition to be evaluated when its computation would
result in an error (and possibly abort of the whole computation).

TRUE || NA

## [1] TRUE

FALSE || NA

## [1] NA

TRUE && NA

## [1] NA

28
2.5 Comparison operators and operations

FALSE && NA

## [1] FALSE

TRUE && FALSE && NA

## [1] FALSE

TRUE && TRUE && NA

## [1] NA

When using the vectorized operators on vectors of length greater than one, ‘short-
cut’ evaluation still applies for the result obtained.

a & b & NA

## [1] NA FALSE

a & b & c(NA, NA)

## [1] NA FALSE

a | b | c(NA, NA)

## [1] TRUE TRUE

2.5 Comparison operators and operations

Comparison operators yield as result logical values.

1.2 > 1.0

## [1] TRUE

1.2 >= 1.0

## [1] TRUE

1.2 == 1.0 # be aware that here we use two = symbols

## [1] FALSE

1.2 != 1.0

## [1] TRUE

1.2 <= 1.0

## [1] FALSE

29
2 R as a powerful calculator

1.2 < 1.0

## [1] FALSE

a <- 20
a < 100 && a > 10

## [1] TRUE

Again these operators can be used on vectors of any length, returning as result
a logical vector. Recycling of logical values works in the same way as described
above for numeric values.

a <- 1:10
a > 5

## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE


## [8] TRUE TRUE TRUE

a < 5

## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE


## [8] FALSE FALSE FALSE

a == 5

## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE


## [8] FALSE FALSE FALSE

all(a > 5)

## [1] FALSE

any(a > 5)

## [1] TRUE

b <- a > 5
b

## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE


## [8] TRUE TRUE TRUE

any(b)

## [1] TRUE

all(b)

## [1] FALSE

30
2.5 Comparison operators and operations

Be once more aware of ‘short-cut evaluation’. If the result would not be affected by
the missing value then the result is returned. If the presence of the NA makes the
end result unknown, then NA is returned.

c <- c(a, NA)


c > 5

## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE


## [8] TRUE TRUE TRUE NA

all(c > 5)

## [1] FALSE

any(c > 5)

## [1] TRUE

all(c < 20)

## [1] NA

any(c > 20)

## [1] NA

is.na(a)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE


## [8] FALSE FALSE FALSE

is.na(c)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE


## [8] FALSE FALSE FALSE TRUE

any(is.na(c))

## [1] TRUE

all(is.na(c))

## [1] FALSE

This behaviour can be modified in the case of many base R’s functions, by means
of an optional argument passed through parameter na.rm , which if TRUE , removes
NA values before the function is applied. Even some functions defined in packages
extending R, have an na.rm parameter.

31
2 R as a powerful calculator

all(c < 20)

## [1] NA

any(c > 20)

## [1] NA

all(c < 20, na.rm=TRUE)

## [1] TRUE

any(c > 20, na.rm=TRUE)

## [1] FALSE

 You may skip this box on first reading. See also page 25. Here I give some
examples for which the finite resolution of computer machine floats, as compared
to Real numbers as defined in mathematics makes an important difference.

1e20 == 1 + 1e20

## [1] TRUE

1 == 1 + 1e-20

## [1] TRUE

0 == 1e-20

## [1] FALSE

As R can run on different types of computer hardware, the actual machine limits
for storing numbers in memory may vary depending on the type of processor and
even compiler used. However, it is possible to obtain these values at run time
from variable .Machine . Please, see the help page for .Machine for a detailed,
and up-to-date, description of the available constants.

32
2.5 Comparison operators and operations

.Machine$double.eps

## [1] 2.220446e-16

.Machine$double.neg.eps

## [1] 1.110223e-16

.Machine$double.max

## [1] 1024

.Machine$double.min

## [1] -1022

The last two values refer to the exponents of 10, rather than the maximum and
minimum size of numbers that can be handled as doubles . Values outside these
limits are stored as -Inf or Inf and enter arithmetic as infinite values would
according the mathematical rules.

1e1026

## [1] Inf

1e-1026

## [1] 0

Inf + 1

## [1] Inf

-Inf + 1

## [1] -Inf

As integer values are stored in machine memory without loss of precision,


epsilon is not defined for integer values.

.Machine$integer.max

## [1] 2147483647

2147483699L

## [1] 2147483699

33
2 R as a powerful calculator

In the last statement in the previous code chunk, the out-of-range integer
constant is promoted to a numeric to avoid the loss of information. A similar
promotion does not take place when operations result in an overflow, or out-of-
range values. However, if one of the operands is a double , then other operands
are promoted before the operation is attempted.

2147483600L + 99L

## Warning in 2147483600L + 99L: NAs produced by integer overflow

## [1] NA

2147483600L + 99

## [1] 2147483699

2147483600L * 2147483600L

## Warning in 2147483600L * 2147483600L: NAs produced by integer overflow

## [1] NA

2147483600L * 2147483600

## [1] 4.611686e+18

2147483600L^2

## [1] 4.611686e+18

U Explore with examples similar to the one above, but making use of other
operands and functions, when does promotion to a “wider” type of storage
take place, and when it does not.

In many situations, when writing programs one should avoid testing for equal-
ity of floating point numbers (‘floats’). Here we show how to handle gracefully
rounding errors. As the example shows, rounding errors may accumulate, and in
practice .Machine$double.eps is not always a good value to safely use in tests
for “zero”, a larger value may be needed.

34
2.6 Character values

a == 0.0 # may not always work

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE


## [8] FALSE FALSE FALSE

abs(a) < 1e-15 # is safer

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE


## [8] FALSE FALSE FALSE

sin(pi) == 0.0 # angle in radians, not degrees!

## [1] FALSE

sin(2 * pi) == 0.0

## [1] FALSE

abs(sin(pi)) < 1e-15

## [1] TRUE

abs(sin(2 * pi)) < 1e-15

## [1] TRUE

sin(pi)

## [1] 1.224606e-16

sin(2 * pi)

## [1] -2.449213e-16

2.6 Character values

Character variables can be used to store any character. Character constants are writ-
ten by enclosing characters in quotes. There are three types of quotes in the ASCII
character set, double quotes " , single quotes ' , and back ticks ` . The first two
types of quotes can be used for delimiting character constants.
a <- "A"
a

## [1] "A"

b <- 'A'
b

35
2 R as a powerful calculator

## [1] "A"

a == b

## [1] TRUE

There are in R two predefined vectors with characters for letters stored in alphabet-
ical order.

a <- "A"
b <- letters[2]
c <- letters[1]
a

## [1] "A"

## [1] "b"

## [1] "a"

d <- c(a, b, c)
d

## [1] "A" "b" "a"

e <- c(a, b, "c")


e

## [1] "A" "b" "c"

h <- "1"
try(h + 2)

Vectors of characters are not the same as character strings. In character vectors
each position in the vector is occupied by a single character, while in character strings,
each string of characters, like a word enclosed in double or single quotes occupies a
single position or slot in the vector.

f <- c("1", "2", "3")


g <- "123"
f == g

## [1] FALSE FALSE FALSE

36
2.7 The ‘mode’ and ‘class’ of objects

## [1] "1" "2" "3"

## [1] "123"

One can use the ‘other’ type of quotes as delimiter when one wants to include
quotes within a string. Pretty-printing is changing what I typed into how the string
that is stored in R: I typed b <- 'He said "hello" when he came in' in the second
statement below, try it.
a <- "He said 'hello' when he came in"
a

## [1] "He said 'hello' when he came in"

b <- 'He said "hello" when he came in'


b

## [1] "He said \"hello\" when he came in"

The outer quotes are not part of the string, they are ‘delimiters’ used to mark the
boundaries. As you can see when b is printed special characters can be represented
using ‘escape sequences’. There are several of them, and here we will show just two,
newline and tab. We also show here the different behaviour of print() and cat() ,
with cat() interpreting the escape sequences and print() not.
c <- "abc\ndef\txyz"
print(c)

## [1] "abc\ndef\txyz"

cat(c)

## abc
## def xyz

Above, you will not see any effect of these escapes when using print() : \n rep-
resents ‘new line’ and \t means ‘tab’ (tabulator). The scape codes work only in some
contexts, as when using cat() to generate the output. They also are very useful
when one wants to split an axis-label, title or label in a plot into two or more lines as
they can be embedded in any string.

2.7 The ‘mode’ and ‘class’ of objects

Variables have a mode that depends on what can be stored in them. But differently to
other languages, assignment to variable of a different mode is allowed and in most

37
2 R as a powerful calculator

cases its mode changes together with its contents. However, there is a restriction
that all elements in a vector, array or matrix, must be of the same mode. While this is
not required for lists, which can be heterogenous. In practice this means that we can
assign an object, such as a vector, with a different mode to a name already in use, but,
we cannot use indexing to assign an object of a different mode, to certain members of
a vector, matrix or array. Functions with names starting with is. are tests returning
a logical value, TRUE , FALSE or NA . Function mode() returns the mode of an object,
as a character string.

my_var <- 1:5


mode(my_var)

## [1] "numeric"

is.numeric(my_var)

## [1] TRUE

is.logical(my_var)

## [1] FALSE

is.character(my_var)

## [1] FALSE

my_var <- "abc"


mode(my_var)

## [1] "character"

While mode is a fundamental property, and limited to those modes defined as part
of the R language, the concept of class, is different in that classes can be defined by
user code. In particular, different R objects of a given mode, such as numeric , can
belong to different class es. The use of classes for dispatching functions is discussed
briefly in section ?? on page ??, in relation to object oriented programming in R.

2.8 ‘Type’ conversions

The least intuitive ones are those related to logical values. All others are as one would
expect. By convention, functions used to convert objects from one mode to a different
one have names starting with as. .

as.character(1)

## [1] "1"

38
2.8 ‘Type’ conversions

as.character(3.0e10)

## [1] "3e+10"

as.numeric("1")

## [1] 1

as.numeric("5E+5")

## [1] 5e+05

as.numeric("A")

## Warning: NAs introduced by coercion

## [1] NA

as.numeric(TRUE)

## [1] 1

as.numeric(FALSE)

## [1] 0

TRUE + TRUE

## [1] 2

TRUE + FALSE

## [1] 1

TRUE * 2

## [1] 2

FALSE * 2

## [1] 0

as.logical("T")

## [1] TRUE

as.logical("t")

## [1] NA

as.logical("TRUE")

## [1] TRUE

as.logical("true")

39
2 R as a powerful calculator

## [1] TRUE

as.logical(100)

## [1] TRUE

as.logical(0)

## [1] FALSE

as.logical(-1)

## [1] TRUE

f <- c("1", "2", "3")


g <- "123"
as.numeric(f)

## [1] 1 2 3

as.numeric(g)

## [1] 123

Some tricks useful when dealing with results. Be aware that the printing is being
done by default, these functions return numerical values that are different from their
input. Look at the help pages for further details. Very briefly round() is used to
round numbers to a certain number of decimal places after or before the decimal
point, while signif() keeps the requested number of significant digits.

round(0.0124567, digits = 3)

## [1] 0.012

round(0.0124567, digits = 1)

## [1] 0

round(0.0124567, digits = 5)

## [1] 0.01246

signif(0.0124567, digits = 3)

## [1] 0.0125

round(1789.1234, digits = 3)

## [1] 1789.123

40
2.8 ‘Type’ conversions

signif(1789.1234, digits = 3)

## [1] 1790

a <- 0.12345
b <- round(a, digits = 2)
a == b

## [1] FALSE

a - b

## [1] 0.00345

## [1] 0.12

Being digits the second parameter of these functions, the argument can be also
passed by position. However, code is usually easier to understand for humans when
parameter names are made explicit.

round(0.0124567, digits = 3)

## [1] 0.012

round(0.0124567, 3)

## [1] 0.012

When applied to vectors, signif() behaves slightly differently, it ensures that the
value of smallest magnitude retains digits significant digits.

signif(c(123, 0.123), digits = 3)

## [1] 123.000 0.123

U What does value truncation mean? Function trunc() truncates a numeric


value, but it does not return an integer .

• Compare the values returned by trunc() and as.integer() when applied


to a floating point number, such as 12.34 . Check for the equality of values,
and for the class of the returned objects.

• Explore how trunc() and ceiling() differ. Test them both with positive
and negative values.

41
2 R as a powerful calculator

• Advanced Use function abs() and operators + and - to recreate the out-
put of trunc() and ceiling() for the different inputs.

• Can trunc() and ceiling() be considered type conversion functions in R?

Other functions relevant to the “conversion” of numbers and other values are
format() , and sprintf() . These two functions return character strings, instead
of numeric or other values, and are useful for printing output. One could think of
these functions as advanced conversion functions returning formatted, and possibly
combined and annotated, character strings. However, they are usually not considered
normal conversion functions, as they are very rarely used in a way that preserves the
original precision of the input values.

U Function format() may be easier to use, in some cases, but sprintf() is


more flexible and powerful. Those with experience in the use of the ( C) language
will already know about sprintf() and its use of templates for formatting output.
Look up the help pages for both functions, and practice, by trying to create the
same output by means of the two functions.

2.9 Vectors

You already know how to create a vector. Now we are going to see how to extract in-
dividual elements (e.g. numbers or characters) out of a vector. Elements are accessed
using an index. The index indicates the position in the vector, starting from one, fol-
lowing the usual mathematical tradition. What in maths would be 𝑥𝑖 for a vector 𝑥, in
R is represented as x[i] . (In R indexes (or subscripts) always start from one, while
in some other programming languages such as C and C++, indexes start from zero.
This difference is important, as code implementing many algorithms will need to be
modified when implemented in a language using a different convention for indexes.)

a <- letters[1:10]
a

## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

a[2]

## [1] "b"

42
2.9 Vectors

a[c(3,2)]

## [1] "c" "b"

a[10:1]

## [1] "j" "i" "h" "g" "f" "e" "d" "c" "b" "a"

The examples below demonstrate what is the result of using a longer vector of
indexes than the indexed vector. The length of the indexing vector is not restricted
by the length of the indexed vector, individual values in the indexing vector pointing
to positions that are not present in the indexed vector, result in NA s. This is easier
to demonstrate, than explain.

length(a)

## [1] 10

a[c(3,3,3,3)]

## [1] "c" "c" "c" "c"

a[c(10:1, 1:10)]

## [1] "j" "i" "h" "g" "f" "e" "d" "c" "b" "a" "a"
## [12] "b" "c" "d" "e" "f" "g" "h" "i" "j"

a[c(1,11)]

## [1] "a" NA

Negative indexes have a special meaning, they indicate the positions at which values
should be excluded. Be aware that it is illegal to mix positive and negative values in
the same indexing operation.

a[-2]

## [1] "a" "c" "d" "e" "f" "g" "h" "i" "j"

a[-c(3,2)]

## [1] "a" "d" "e" "f" "g" "h" "i" "j"

a[-3:-2]

## [1] "a" "d" "e" "f" "g" "h" "i" "j"

# a[c(-3,2)]

As shown above, results from indexing with out-of-range values may be surprising.

43
2 R as a powerful calculator

a[11]

## [1] NA

a[1:11]

## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" NA

Results from indexing with special values may be surprising.

a[ ]

## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

a[0]

## character(0)

a[numeric(0)]

## character(0)

a[NA]

## [1] NA NA NA NA NA NA NA NA NA NA

a[c(1, NA)]

## [1] "a" NA

a[NULL]

## character(0)

a[c(1, NULL)]

## [1] "a"

Another way of indexing, which is very handy, but not available in most other pro-
gramming languages, is indexing with a vector of logical values. In practice, the
vector of logical values used for ‘indexing’ is in most cases of the same length as
the vector from which elements are going to be selected. However, this is not a re-
quirement, and if the logical vector is shorter it is ‘recycled’ as discussed above in
relation to operators.

a[TRUE]

## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

a[FALSE]

44
2.9 Vectors

## character(0)

a[c(TRUE, FALSE)]

## [1] "a" "c" "e" "g" "i"

a[c(FALSE, TRUE)]

## [1] "b" "d" "f" "h" "j"

a > "c"

## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE


## [8] TRUE TRUE TRUE

a[a > "c"]

## [1] "d" "e" "f" "g" "h" "i" "j"

selector <- a > "c"


a[selector]

## [1] "d" "e" "f" "g" "h" "i" "j"

which(a > "c")

## [1] 4 5 6 7 8 9 10

indexes <- which(a > "c")


a[indexes]

## [1] "d" "e" "f" "g" "h" "i" "j"

b <- 1:10
b[selector]

## [1] 4 5 6 7 8 9 10

b[indexes]

## [1] 4 5 6 7 8 9 10

Make sure to understand the examples above. These type of constructs are very
widely used in R scripts because they allow for concise code that is easy to understand
once you are familiar with the indexing rules. However, if you do not command these
rules, many of these ‘terse’ statements will be unintelligible to you.
Indexing can be used on both sides of an assignment. This may look rather esoteric
at first sight, but it is just a simple extension of the logic of indexing described above.

45
2 R as a powerful calculator

a <- 1:10
a

## [1] 1 2 3 4 5 6 7 8 9 10

a[1] <- 99
a

## [1] 99 2 3 4 5 6 7 8 9 10

a[c(2,4)] <- -99


a

## [1] 99 -99 3 -99 5 6 7 8 9 10

a[TRUE] <- 1
a

## [1] 1 1 1 1 1 1 1 1 1 1

a <- 1

We can also have subscripting on both sides.

a <- letters[1:10]
a

## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

a[1] <- a[10]


a

## [1] "j" "b" "c" "d" "e" "f" "g" "h" "i" "j"

a <- a[10:1]
a

## [1] "j" "i" "h" "g" "f" "e" "d" "c" "b" "j"

a[10:1] <- a
a

## [1] "j" "b" "c" "d" "e" "f" "g" "h" "i" "j"

a[5:1] <- a[c(TRUE,FALSE)]


a

## [1] "i" "g" "e" "c" "j" "f" "g" "h" "i" "j"

46
2.10 Factors

U Do play with subscripts to your heart’s content, really grasping how they
work and how they can be used, will be very useful in anything you do in the
future with R.

2.10 Factors

Factors are used for indicating categories, most frequently the factors describing the
treatments in an experiment, or categories in a survey. They can be created either
from numerical or character vectors. The different possible values are called levels.
Normal factors created with factor() are unordered or categorical. R also defines
ordered factors that can be created with function ordered() .

my.vector <- c("treated", "treated", "control", "control", "control", "treated")


my.factor <- factor(my.vector)
my.factor <- factor(my.vector, levels=c("treatment", "control"))

It is always preferable to use meaningful names for levels, although it is possible to


use numbers. The order of levels becomes important when plotting data, as it affects
the order of the levels along the axes, or in legends. Converting factors to numbers
is not intuitive, because even if the levels look like numbers when displayed, they are
just character strings.

my.vector2 <- rep(3:5, 4)


my.factor2 <- factor(my.vector2)
as.numeric(my.factor2)

## [1] 1 2 3 1 2 3 1 2 3 1 2 3

as.numeric(as.character(my.factor2))

## [1] 3 4 5 3 4 5 3 4 5 3 4 5

Internally factor levels are stored as running numbers starting from one, and those
are the numbers returned by as.numeric() when applied to a factor.
Factors are very important in R. In contrast to other statistical software in which
the role of a variable is set when defining a model to be fitted or when setting up a
test, in R models are specified exactly in the same way for ANOVA and regression
analysis, as linear models. What ‘decides’ what type of model is fitted is whether
the explanatory variable is a factor (giving ANOVA) or a numerical variable (giving
regression). This makes a lot of sense, as in most cases, considering an explanatory
variable as categorical or not, depends on the design of the experiment or survey, in

47
2 R as a powerful calculator

other words, is a property of the data and the experiment or survey that gave origin
to them, rather than of the data analysis.

2.11 Lists

Lists’ main difference to other collections is, in R, that they can be heterogeneous. In
R, the members of a list can be considered as following a sequence, and accessible
through numerical indexes, the same as vectors. However, most frequently members
of a list are given names, and retrieved (indexed) through these names.
Lists as usually defined in languages like C are based on pointers stored at each
node, that chain the different member nodes. In such implementations, indexing by
position is not possible, or at least requires “walking” down the list, node by node. In
R, list members can be accessed through positional indexes. Of course, insertions
and deletions in the middle of a list, whatever their implementation, modifies any
position-based indexes. Elements in list can be named, and are normally accessed by
name. Lists are defined using function list .
a.list <- list(x = 1:6, y = "a", z = c(TRUE, FALSE))
a.list

## $x
## [1] 1 2 3 4 5 6
##
## $y
## [1] "a"
##
## $z
## [1] TRUE FALSE

a.list$x

## [1] 1 2 3 4 5 6

a.list[["x"]]

## [1] 1 2 3 4 5 6

a.list[[1]]

## [1] 1 2 3 4 5 6

a.list["x"]

## $x
## [1] 1 2 3 4 5 6

a.list[1]

48
2.11 Lists

## $x
## [1] 1 2 3 4 5 6

a.list[c(1,3)]

## $x
## [1] 1 2 3 4 5 6
##
## $z
## [1] TRUE FALSE

try(a.list[[c(1,3)]])

## [1] 3

To investigate the returned values, function str() for structure tends to help, es-
pecially when the lists have many members, as it prints more compact output, than
printing the same list.

str(a.list)

## List of 3
## $ x: int [1:6] 1 2 3 4 5 6
## $ y: chr "a"
## $ z: logi [1:2] TRUE FALSE

Using double square brackets for indexing gives the element stored in the list, in
its original mode, in the example above, a.list[["x"]] returns a numeric vector,
while a.list[1] returns a list containing the numeric vector x . a.list$x returns
the same value as a.list[["x"]] , a numeric vector. While a.list[c(1,3)] returns
a list of length two, a.list[[c(1,3)]] .
Lists can be also nested.

a.list <- list("a", "ff")


b.list <- list("b", "ff")
c.list <- list(a = a.list, b = b.list)
c.list

## $a
## $a[[1]]
## [1] "a"
##
## $a[[2]]
## [1] "ff"
##
##
## $b
## $b[[1]]

49
2 R as a powerful calculator

## [1] "b"
##
## $b[[2]]
## [1] "ff"

The nesting can be also done within a single statement.

d.list <- list(a = list("a", "ff"), b = list("b", "ff"))


d.list

## $a
## $a[[1]]
## [1] "a"
##
## $a[[2]]
## [1] "ff"
##
##
## $b
## $b[[1]]
## [1] "b"
##
## $b[[2]]
## [1] "ff"

U What do you expect the each of the statements to return? Before running
the code predict what value and of which mode each statement will return. You
may use implicit, or explicit, calls to print() , or calls to str() to visualize the
structure of the different objects.

c.list[c(1,2,1,3)]
c.list[1]
c.list[[1]][2]
c.list[[1]][[2]]
c.list[2][[1]][[2]]

Sometimes we need to flatten a list, or a nested structure of lists within lists.


Function unlist() is what should be normally used in such cases.
The c.list is a nested system of lists, but all the “terminal” members are
character strings. In other words, terminal nodes are all of the same mode.

50
2.11 Lists

c.vec <- unlist(c.list)


c.vec

## a1 a2 b1 b2
## "a" "ff" "b" "ff"

is.list(c.list)

## [1] TRUE

is.list(c.vec)

## [1] FALSE

mode(c.list)

## [1] "list"

mode(c.vec)

## [1] "character"

names(c.list)

## [1] "a" "b"

names(c.vec)

## [1] "a1" "a2" "b1" "b2"

The returned value is a vector with named member elements. Function str()
helps figure out what this object looks like. The names, in this case are based
in the names of list elements when available, but numbers used for anonymous
node in the list. We can access the members of the vector either through numeric
indexes, or names.

51
2 R as a powerful calculator

str(c.vec)

## Named chr [1:4] "a" "ff" "b" "ff"


## - attr(*, "names")= chr [1:4] "a1" "a2" "b1" "b2"

c.vec[2]

## a2
## "ff"

c.vec["a2"]

## a2
## "ff"

U Function unlist() , has two additional parameters, for which we did not
change their default argument in the example above. These are recursive
and use.names , both of them expecting a logical values a argument. Modify
the statement c.vec <- unlist(c.list) , by passing FALSE to each of them,
in turn, and in each case study the value returned and how it differs with
respect to the one obtained above.

2.12 Data frames

Data frames are a special type of list, in which each element is a vector or a factor of
the same length. The are created with function data.frame with a syntax similar to
that used for lists. When a shorter vector is supplied as argument, it is recycled, until
the full length of the variable is filled. This is very different to what we obtained in
the previous section when we created a list.

a.df <- data.frame(x = 1:6, y = "a", z = c(TRUE, FALSE))


a.df

## x y z
## 1 1 a TRUE
## 2 2 a FALSE
## 3 3 a TRUE
## 4 4 a FALSE
## 5 5 a TRUE
## 6 6 a FALSE

52
2.12 Data frames

str(a.df)

## 'data.frame': 6 obs. of 3 variables:


## $ x: int 1 2 3 4 5 6
## $ y: Factor w/ 1 level "a": 1 1 1 1 1 1
## $ z: logi TRUE FALSE TRUE FALSE TRUE FALSE

class(a.df)

## [1] "data.frame"

mode(a.df)

## [1] "list"

is.data.frame(a.df)

## [1] TRUE

is.list(a.df)

## [1] TRUE

Indexing of data frames is somehow similar to that of the underlying list, but not
exactly equivalent. We can index with [[ ]] to extract individual variables, thought
as being stored as columns in a matrix-like list or “worksheet”.

a.df$x

## [1] 1 2 3 4 5 6

a.df[["x"]]

## [1] 1 2 3 4 5 6

a.df[[1]]

## [1] 1 2 3 4 5 6

class(a.df)

## [1] "data.frame"

R is an object oriented language, and objects belong to classes. With function


class() we can query the class of an object. As we saw in the two previous chunks
lists and data frames objects belong to two different classes. However, their relation-
ship is based on a hierarchy of classes. We say that class data.frame is derived from
class list . Consequently, data frames inherit the methods and characteristics of
lists, which have not been modified for data frames.

53
2 R as a powerful calculator

In the same way as with vectors, we can add members to lists and data frames.

a.df$x2 <- 6:1


a.df$x3 <- "b"
a.df

## x y z x2 x3
## 1 1 a TRUE 6 b
## 2 2 a FALSE 5 b
## 3 3 a TRUE 4 b
## 4 4 a FALSE 3 b
## 5 5 a TRUE 2 b
## 6 6 a FALSE 1 b

We have added two columns to the data frame, and in the case of column x3
recycling took place. This is where lists and data frames differ substantially in their
behaviour. In a data frame, although class and mode can be different for different
variables (columns), they are required to have the same length. In the case of lists,
there is no such requirement, and recycling never takes place when adding a node.
Compare the values returned below for a.ls , to those in the example above for a.df .

a.ls <- list(x = 1:6, y = "a", z = c(TRUE, FALSE))


a.ls

## $x
## [1] 1 2 3 4 5 6
##
## $y
## [1] "a"
##
## $z
## [1] TRUE FALSE

a.ls$x2 <- 6:1


a.ls$x3 <- "b"
a.ls

## $x
## [1] 1 2 3 4 5 6
##
## $y
## [1] "a"
##
## $z
## [1] TRUE FALSE
##
## $x2
## [1] 6 5 4 3 2 1
##

54
2.12 Data frames

## $x3
## [1] "b"

Data frames are extremely important to anyone analysing or plotting data in R. One
can think of data frames as tightly structured work-sheets, or as lists. As you may
have guessed from the examples earlier in this section, there are several different
ways of accessing columns, rows, and individual observations stored in a data frame.
The columns can to some extent be treated as elements in a list, and can be accessed
both by name or index (position). When accessed by name, using $ or double square
brackets a single column is returned as a vector or factor. In contrast to lists, data
frames are ‘rectangular’ and for this reason the values stored can be also accessed
in a way similar to how elements in a matrix are accessed, using two indexes. As we
saw for vectors indexes can be vectors of integer numbers or vectors of logical values.
For columns they can in addition be vectors of character strings matching the names
of the columns. When using indexes it is extremely important to remember that the
indexes are always given row first.
a.df[ , 1] # first column

## [1] 1 2 3 4 5 6

a.df[ , "x"] # first column

## [1] 1 2 3 4 5 6

a.df[1, ] # first row

## x y z x2 x3
## 1 1 a TRUE 6 b

a.df[1:2, c(FALSE, FALSE, TRUE, FALSE, FALSE)]

## [1] TRUE FALSE

# first two rows of the third column


a.df[a.df$z , ] # the rows for which z is true

## x y z x2 x3
## 1 1 a TRUE 6 b
## 3 3 a TRUE 4 b
## 5 5 a TRUE 2 b

a.df[a.df$x > 3, -3] # the rows for which x > 3 for

## x y x2 x3
## 4 4 a 3 b
## 5 5 a 2 b
## 6 6 a 1 b

# all columns except the third one

55
2 R as a powerful calculator

As earlier explained for vectors, indexing can be present both on the right-hand
side and left-hand-side of an assignment. The next few examples do assignments to
“cells” of a a.df , either to one whole column, or individual values. The last statement
in the chunk below copies a number from one location to another by using indexing
of the same data frame both on the ‘right side’ and ‘left side’ of the assignment.

a.df[1, 1] <- 99
a.df

## x y z x2 x3
## 1 99 a TRUE 6 b
## 2 2 a FALSE 5 b
## 3 3 a TRUE 4 b
## 4 4 a FALSE 3 b
## 5 5 a TRUE 2 b
## 6 6 a FALSE 1 b

a.df[ , 1] <- -99


a.df

## x y z x2 x3
## 1 -99 a TRUE 6 b
## 2 -99 a FALSE 5 b
## 3 -99 a TRUE 4 b
## 4 -99 a FALSE 3 b
## 5 -99 a TRUE 2 b
## 6 -99 a FALSE 1 b

a.df[["x"]] <- 123


a.df

## x y z x2 x3
## 1 123 a TRUE 6 b
## 2 123 a FALSE 5 b
## 3 123 a TRUE 4 b
## 4 123 a FALSE 3 b
## 5 123 a TRUE 2 b
## 6 123 a FALSE 1 b

a.df[1, 1] <- a.df[6, 4]


a.df

## x y z x2 x3
## 1 1 a TRUE 6 b
## 2 123 a FALSE 5 b
## 3 123 a TRUE 4 b
## 4 123 a FALSE 3 b
## 5 123 a TRUE 2 b
## 6 123 a FALSE 1 b

56
2.12 Data frames

 We mentioned above that indexing by name can be done either with double
square brackets, [[ ]] , or with $ . In the first case the name of the variable
or column is given as a character string, enclosed in quotation marks, or as a
variable with mode character . When using $ , the name is entered as is, without
quotation marks.

x.list <- list(abcd = 123, xyzw = 789)


x.list[["abcd"]]

## [1] 123

x.list$abcd

## [1] 123

x.list$ab

## [1] 123

x.list$a

## [1] 123

Both in the case of lists and data frames, when using square brackets, an exact
match is required between the name set in the object and the name used for
indexing. In contrast, with $ any unambiguous partial match will be accepted.
For interactive use, partial matching is helpful in reducing typing. However, in
scripts, and especially R code in packages it is best to avoid the use of $ as
partial matching to a wrong variable present at a later time, e.g. when someone
else revises the script, can lead to very difficult to diagnose errors. In addition, as
$ is implemented by first attempting a match the name and then calling [[ ]] ,
using $ for indexing can result in slightly slower performance compared to using
[[ ]] .

When the names of data frames are long, complex conditions become awkward to
write using indexing—i.e. subscripts. In such cases subset() is handy because eval-
uation is done in the ‘environment’ of the data frame, i.e. the names of the columns
are recognized if entered directly when writing the condition.

a.df <- data.frame(x = 1:6, y = "a", z = c(TRUE, FALSE))


subset(a.df, x > 3)

## x y z

57
2 R as a powerful calculator

## 4 4 a FALSE
## 5 5 a TRUE
## 6 6 a FALSE

When calling functions that return a vector, data frame, or other structure, the
square brackets can be appended to the rightmost parenthesis of the function call, in
the same way as to the name of a variable holding the same data.

subset(a.df, x > 3)[ , -3]

## x y
## 4 4 a
## 5 5 a
## 6 6 a

subset(a.df, x > 3)$x

## [1] 4 5 6

None of the examples in the last three code chunks alter the original data frame
a.df . We can store the returned value using a new name, if we want to preserve
a.df unchanged, or we can assign the result to a.df deleting in the process the
original Another way to delete a column from a data frame is to assign NULL to it.

a.df[["x2"]] <- NULL


a.df$x3 <- NULL
a.df

## x y z
## 1 1 a TRUE
## 2 2 a FALSE
## 3 3 a TRUE
## 4 4 a FALSE
## 5 5 a TRUE
## 6 6 a FALSE

In the previous code chuck we deleted the last two columns of the data frame a.df .
Finally an esoteric trick for you think about.

a.df[1:6, c(1,3)] <- a.df[6:1, c(3,1)]


a.df

## x y z
## 1 0 a 6
## 2 1 a 5
## 3 0 a 4
## 4 1 a 3
## 5 0 a 2
## 6 1 a 1

58
2.13 Simple built-in statistical functions

Although in this last example we used numeric indexes to make in more interesting,
in practice, especially in scripts or other code that will be reused, do use column
names instead of positional indexes. This makes your code much more reliable, as
changes elsewhere in the script are much less likely to lead to undetected errors.

2.13 Simple built-in statistical functions

Being R’s main focus in statistics, it provides functions for both simple and complex
calculations, going from means and variances to fitting very complex models. we will
start with the simple ones.

x <- 1:20
mean(x)

## [1] 10.5

var(x)

## [1] 35

median(x)

## [1] 10.5

mad(x)

## [1] 7.413

sd(x)

## [1] 5.91608

range(x)

## [1] 1 20

max(x)

## [1] 20

min(x)

## [1] 1

length(x)

## [1] 20

59
2 R as a powerful calculator

2.14 Functions and execution flow control

Although functions can be defined and used at the command prompt, we will dis-
cuss them on their own, in Chapter 4 starting on page 91. Flow-control statements
(e.g. repetition and conditional execution) are introduced in Chapter 3, immediately
following.

60
3 R Scripts and Programming

An R script is simply a text file containing (almost) the same


commands that you would enter on the command line of R.

— Kickstarting R

3.1 Aims of this chapter

In my experience, for those who have mainly used graphical user interfaces, under-
standing why and when scripts can help in communicating a certain data analysis
protocol can be revelatory. As soon as a data analysis stops being trivial, describing
the steps followed through a system of menus and dialogue boxes becomes extremely
tedious.
It is also usually the case that graphical user interfaces tend to be difficult extend
or improve in a way that keeps step-by-step instructions valid over program versions
and operating systems.
Many times the same sequence of commands needs to be applied to different data
sets, and scripts make validation of such a requirement easy.
In this chapter I will walk you through the use of R scripts, starting from a extremely
simple script.

3.2 What is a script?

We call script to a text file that contains the same commands that you would type at
the console prompt. A true script is not for example an MS-Word file where you have
pasted or typed some R commands. A script file has the following characteristics.

• The script is a text file (ASCII or some other encoding e.g. UTF-8 that R uses in
your set-up).

• The file contains valid R statements (including comments) and nothing else.

• Comments start at a # and end at the end of the line. (True end-of line as coded
in file, the editor may wrap it or not at the edge of the screen).

• The R statements are in the file in the order that they must be executed.

• R scripts have file names ending in .r or .R.

61
3 R Scripts and Programming

It is good practice to write scripts so that they are self-contained. Such a scripts
will run in a new R session by including library commands to load all the required
packages.

3.3 How do we use a scrip?

A script can be sourced. If we have a text file called my.first.script.r containing


the following text:

# this is my first R script


print(3 + 4)

And then source this file:

source("my.first.script.r")

## [1] 7

The results of executing the statements contained in the file will appear in the
console. The commands themselves are not shown (the sourced file is not echoed to
the console) and the results will not be printed unless you include an explicit print()
command in the script. This applies in many cases also to plots—e.g. A figure created
with ggplot() needs to be printed if we want it to be included in the output when
the script is run. Adding a redundant print() is harmless.
From within RStudio, if you have an R script open in the editor, there will a “source”
drop box (≠ DropBox) visible from where you can choose “source” as described above,
or “source with echo” for the currently open file.
When a script is sourced, the output can be saved to a text file instead of being
shown in the console. It is also easy to call R with the script file as argument directly
at the command prompt of the operating system.

RScript my.first.script.r

You can open a operating system’s shell from the Tools menu in RStudio, to run
this command. The output will be printed to the shell console. If you would like to
save the output to a file, use redirection.

62
3.4 How to write a script?

RScript my.first.script.r > my.output.txt

Sourcing is very useful when the script is ready, however, while developing a script,
or sometimes when testing things, one usually wants to run (or execute) one or a few
statements at a time. This can be done using the “run” button after either positioning
the cursor in the line to be executed, or selecting the text that one would like to run
(the selected text can be part of a line, a whole line, or a group of lines, as long as it
is syntactically valid).

3.4 How to write a script?

The approach used, or mix of approaches will depend on your preferences, and on
how confident you are that the statements will work as expected.

If one is very familiar with similar problems One would just create a new text file
and write the whole thing in the editor, and then test it. This is rather unusual.

If one if moderately familiar with the problem One would write the script as
above, but testing it, part by part as one is writing it. This is usually what I
do.

If ones mostly playing around Then if one is using RStudio, one type statements
at the console prompt. As you should know by now, everything you run at the
console is saved to the “History”. In RStudio the History is displayed in its own
pane, and in this pane one can select any previous statement and by pressing
a single having copy and pasted to either the console prompt, or the cursor
position in the file visible in the editor. In this way one can build a script by
copying and pasting from the history to your script file the bits that have worked
as you wanted.

U By now you should be familiar enough with R to be able to write your own
script.

1. Create a new R script (in RStudio, from ‘File’ menu, “+” button, or by typing
“Ctrl + Shift + N”).

2. Save the file as my.second.script.r.

63
3 R Scripts and Programming

3. Use the editor pane in RStudio to type some R commands and comments.

4. Run individual commands.

5. Source the whole file.

3.5 The need to be understandable to people

When you write a script, it is either because you want to document what you have
done or you want re-use it at a later time. In either case, the script itself although
still meaningful for the computer could become very obscure to you, and even more
to someone seeing it for the first time.
How does one achieve an understandable script or program?

• Avoid the unusual. People using a certain programming language tend to use
some implicit or explicit rules of style1 . As a minimum try to be consistent with
yourself.

• Use meaningful names for variables, and any other object. What is meaningful
depends on the context. Depending on common use a single letter may be more
meaningful than a long word. However self explaining names are better: e.g.
using n.rows and n.cols is much clearer than using n1 and n2 when dealing
with a matrix of data. Probably number.of.rows and number.of.columns would
just increase the length of the lines in the script, and one would spend more time
typing without getting much in return.

• How to make the words visible in names: traditionally in R one would use dots to
separate the words and use only lower case. Some years ago, it became possible
to use underscores. The use of underscores is quite common nowadays because
in some contexts is “safer” as in some situations a dot may have a special mean-
ing. What we call “camel case” is only infrequently used in R programming but
is common in other languages like Pascal. An example of camel case is NumCols .
In some cases it can become a bit confusing as in UVMean or UvMean .

U Here is an example of bad style in a script. Read Google’s R Style Guide , and 2

edit the code in the chuck below so that it becomes easier to read.

1 Style includes indentation of statements, capitalization of variable and function names.

64
3.6 Functions

a <- 2 # height
b <- 4 # length
C <-
a *
b
C -> variable
print(
"area: ", variable
)

The points discussed above already help a lot. However, one can go further in
achieving the goal of human readability by interspersing explanations and code
“chunks” and using all the facilities of typesetting, even of maths, within the listing
of the script. Furthermore, by including the results of the calculations and the code
itself in a typeset report built automatically, we ensure that the results are indeed the
result of running the code shown. This greatly contributes to data analysis reprodu-
cibility, which is becoming a widespread requirement for any data analysis both in
academic research and in industry. It is possible not only to build whole books like
this one, but also whole data-based web sites with these tools.
In the realm of programming, this approach is called literate programming and was
first proposed by Donald Knuth (Knuth 1984) through his WEB system. In the case
of R programming the first support of literate programming was through ‘Sweave’,
which has been mostly superseded by ‘knitr’ (Xie 2013). This package supports
the use of Markdown or LATEX (Lamport 1994) as markup language for the textual
contents, and also can format and add syntax highlighting to code chunks. Mark-
down language has been extended to make it easier to include R code—R Markdown
(http://rmarkdown.rstudio.com/), and in addition suitable for typesetting large
and complex documents—Bookdown (Xie 2016). The use of ‘knitr’ is well integrated
into the RStudio IDE.
This is not strictly an R programming subject, as it concerns programming in any
language. On the other hand, this is an incredibly important skill to learn, but well
described in other books and web sites cited in the previous paragraph. This whole
book, including figures, has been generated using ‘knitr’ and the source scripts for the
book are available through Bitbucket at https://bitbucket.org/aphalo/using-r.

3.6 Functions

When writing scripts, or any program, one should avoid repeating blocks of code
(groups of statements). The reasons for this are: 1) if the code needs to be changed—

65
3 R Scripts and Programming

e.g. to fix a bug or error—, you have to make changes in more than one place in the
file, or in more than one file. Sooner or later, some copies will remain unchanged
by mistake. This leads to inconsistencies and hard to track bugs; 2) it makes the
script file longer, and this makes debugging, commenting, etc. more tedious, and
error prone; 3) abstraction and division of a problem into smaller chunks, helps with
keeping the code understandable to humans.
How do we avoid repeating bits of code? We write a function containing the state-
ments that we would need to repeat, and then call (“use”) the function in their place.
Functions are defined by means of function() , and saved like any other object
in R by assignment to a variable. In the example below x and y are both formal
parameters, or names used within the function for objects that will be supplied as
“arguments” when the function is called. One can think of parameter names as place-
holders.
my.prod <- function(x, y){x * y}
my.prod(4, 3)

## [1] 12

First some basic knowledge. In R, arguments are passed by copy. This is some-
thing very important to remember. Whatever you do within a function to modify an
argument, its value outside the function will remain (almost) always unchanged.
my.change <- function(x){x <- NA}
a <- 1
my.change(a)
a

## [1] 1

Any result that needs to be made available outside the function must be returned
by the function. If the function return() is not explicitly used, the value returned
by the last statement executed within the body of the function will be returned.
print.x.1 <- function(x){print(x)}
print.x.1("test")

## [1] "test"

print.x.2 <- function(x){print(x); return(x)}


print.x.2("test")

## [1] "test"
## [1] "test"

print.x.3 <- function(x){return(x); print(x)}


print.x.3("test")

66
3.6 Functions

## [1] "test"

print.x.4 <- function(x){return(); print(x)}


print.x.4("test")

## NULL

print.x.5 <- function(x){x}


print.x.4("test")

## NULL

Now we will define a useful function: a function for calculating the standard error
of the mean from a numeric vector.

SEM <- function(x){sqrt(var(x)/length(x))}


a <- c(1, 2, 3, -5)
a.na <- c(a, NA)
SEM(x=a)

## [1] 1.796988

SEM(a)

## [1] 1.796988

SEM(a.na)

## [1] NA

For example in SEM(a) we are calling function SEM() with a as argument.


The function we defined above may sometimes give a wrong answer because NAs
will be counted by length() , so we need to remove NAs before calling length() .

simple_SEM <- function(x) {


sqrt(var(x, na.rm=TRUE)/length(na.omit(x)))
}
a <- c(1, 2, 3, -5)
a.na <- c(a, NA)
simple_SEM(x=a)

## [1] 1.796988

simple_SEM(a)

## [1] 1.796988

simple_SEM(a.na)

## [1] 1.796988

67
3 R Scripts and Programming

R does not have a function for standard error, so the function above would be
generally useful. If we would like to make this function both safe, and consistent
with other R functions, one could define it as follows, allowing the user to provide a
second argument which is passed as an argument to var() :

SEM <- function(x, na.rm=FALSE){


sqrt(var(x, na.rm=na.rm)/length(na.omit(x)))
}
SEM(a)

## [1] 1.796988

SEM(a.na)

## [1] NA

SEM(a.na, TRUE)

## [1] 1.796988

SEM(x=a.na, na.rm=TRUE)

## [1] 1.796988

SEM(TRUE, a.na)

## Warning in if (na.rm) "na.or.complete" else "everything": the condition has


length > 1 and only the first element will be used

## [1] NA

SEM(na.rm=TRUE, x=a.na)

## [1] 1.796988

In this example you can see that functions can have more than one parameter, and
that parameters can have default values to be used if no argument is supplied. In
addition if the name of the parameter is indicated, then arguments can be supplied
in any order, but if parameter names are not supplied, then arguments are assigned
to parameters based on their position. Once one parameter name is given, all later
arguments need also to be explicitly matched to parameters. Obviously if given by
position, then arguments should be supplied explicitly for all parameters at ‘interme-
diate’ positions.

U Test the behaviour of print.x.1 and print.x.5 at the command prompt,


and in a script, by writing a script. The behaviour of one of these functions will

68
3.7 Objects, classes and methods

be different when the script is source than at the command prompt. Explain why.

U Define your own function to calculate the mean in a similar way as SEM()
was defined above. Hint: function sum() could be of help.

U Create some additional vectors containing NA s or not. Use them to test


functions simple_SEM() and SEM() defined above, and then explain why SEM()
returns always the correct value, even though “ na.omit(x) ” is non-conditionally
(always) applied to x before calculating its length.

 R handles evaluation of function arguments differently to many other com-


puter languages. Not only arguments are passed by value, but in addition they are
evaluated only at the time of first use in the function body code. This is called lazy
evaluation and before evaluation arguments remain as promises. In many cases
this is advantageous by improving computation efficiency. However, if the value
of the variable used as argument or in an expression used as argument changes,
the value of the variable at the time of evaluation will be used. This is rarely a
problem, but being aware of this behaviour is helpful specially when programmat-
ically defining functions. Very rarely, an argument will not the evaluated when it
should (e.g. because of bugs in packages, or use of “trickery”). Earlier evaluation
can be forced at any time with function force() .

3.7 Objects, classes and methods

An in-depth discussion of object oriented programming in R is outside the scope of


this book. Several books describe in detail the different class systems available and
how to take best advantage of them when developing packages extending R. For the
non-programmer user, a basic understanding can be useful, even if he or she do not
intend to create new classes. This basic knowledge is what we intend to convey in
this section. For an in-depth treatment of the subject please consult the recently
published book Advanced R (Wickham 2014a).
We start with a quotation form “S poetry” (Burns 1998, , page 13).

69
3 R Scripts and Programming

The idea of object-oriented programming is simple, but carries a lot of


weight. Here’s the whole thing: if you told a group of people “dress for
work”, then you would expect each to put on clothes appropriate for that
individual’s job. Likewise it is possible for S[R] objects to get dressed ap-
propriately depending on what class of object they are.

R supports the use of the object oriented programming paradigm, but as a system
that has evolved over the years, currently R includes different approaches. The still
most popular approach is called S3, and a more recent and powerful approach, with
slower performance, is called S4. The general idea is that a generic name like “plot”
can be used as a generic name, and that which specific version of plot() is called
depends on the arguments of the call. Using computing terms we could say that the
generic version of plot() dispatches the original call to different specific versions of
plot() based on the class of the arguments passed. S3 generic functions dispatch,
by default, based only on the argument passed to a single parameter, the first one.
S4 generic functions can dispatch the call based on the arguments passed to more
than one parameter and the structure of the objects of a given class is known to the
interpreter. In S3 functions the specializations of a generic are recognized/identified
only by their name. And the class of an object by a character string stored as an
attribute to the object.
The most basic approach is to create a new class, pre-pending its name to the exist-
ing class attribute of an object. This would normally take place within a constructor.

a <- 123
class(a)

## [1] "numeric"

class(a) <- c("myclass", class(a))


class(a)

## [1] "myclass" "numeric"

Now we create a print method specific to "myclass" objects.

print.myclass <- function(x) {


sprintf("[myclass] %.g4", x)
}

Once a specialized method exists, it will be used.

print(a)

## [1] "[myclass] 1e+024"

print(as.numeric(a))

70
3.7 Objects, classes and methods

## [1] 123

The S3 class system is “lightweight” in that it adds very little additional computation
load, but it is rather fragile in that most of the responsibility about consistency and
correctness of the design—e.g. not messing up dispatch by redefining functions or
loading a package exporting functions with the same name, etc.– is not checked by
the R interpreter.
Defining a new S3 generic is also quite simple. A generic method and a default
method need to be created.

my_print <- function (x, ...) {


UseMethod("my_print", x)
}

my_print.default <- function(x, ...) {


print(class(x))
print(x, ...)
}

my_print(123)

## [1] "numeric"
## [1] 123

my_print("abc")

## [1] "character"
## [1] "abc"

Up to now, my_print() , has no specialization. We now write one for data frames.

my_print.data.frame <- function(x, rows = 1:5, ...) {


print(x[rows, ], ...)
invisible(x)
}

We add the second statement so that the function returns invisibly the whole data
frame, rather than the lines printed. We now do a quick test of the function.

my_print(cars)

## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16

71
3 R Scripts and Programming

my_print(cars, 8:10)

## speed dist
## 8 10 26
## 9 10 34
## 10 11 17

my_print(cars, TRUE)

## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
## 11 11 28
## 12 12 14
## 13 12 20
## 14 12 24
## 15 12 28
## 16 13 26
## 17 13 34
## 18 13 34
## 19 13 46
## 20 14 26
## 21 14 36
## 22 14 60
## 23 14 80
## 24 15 20
## 25 15 26
## 26 15 54
## 27 16 32
## 28 16 40
## 29 17 32
## 30 17 40
## 31 17 50
## 32 18 42
## 33 18 56
## 34 18 76
## 35 18 84
## 36 19 36
## 37 19 46
## 38 19 68
## 39 20 32
## 40 20 48

72
3.7 Objects, classes and methods

## 41 20 52
## 42 20 56
## 43 20 64
## 44 22 66
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85

b <- my_print(cars)

## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16

## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
## 11 11 28
## 12 12 14
## 13 12 20
## 14 12 24
## 15 12 28
## 16 13 26
## 17 13 34
## 18 13 34
## 19 13 46
## 20 14 26
## 21 14 36
## 22 14 60
## 23 14 80
## 24 15 20
## 25 15 26
## 26 15 54
## 27 16 32

73
3 R Scripts and Programming

## 28 16 40
## 29 17 32
## 30 17 40
## 31 17 50
## 32 18 42
## 33 18 56
## 34 18 76
## 35 18 84
## 36 19 36
## 37 19 46
## 38 19 68
## 39 20 32
## 40 20 48
## 41 20 52
## 42 20 56
## 43 20 64
## 44 22 66
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85

U 1) What would be the most concise way of defining a specialization for


matrix ? Write one, and test it. 2) How would you modify the code so that also
columns to print can be selected?

3.8 Control of execution flow

We call control of execution statements those that allow the execution of sections of
code when a certain dynamically computed condition is TRUE . Some of the control
of execution flow statements, function like ON-OFF switches for program statements.
Others, allow statements to executed repeatedly while or until a condition is met, or
until all members of a list or a vector are processed.

3.8.1 Conditional execution

Non-vectorized

R has two types of if statements, non-vectorized and vectorized. We will start with
the non-vectorized one, which is similar to what is available in most other computer

74
3.8 Control of execution flow

programming languages.
Before this we need to explain compound statements. Individual statements can be
grouped into compound statements by enclosed them in curly braces.

print("A")

## [1] "A"

{
print("B")
print("C")
}

## [1] "B"
## [1] "C"

The example above is pretty useless, but becomes useful when used together with
‘control’ constructs. The if construct controls the execution of one statement, how-
ever, this statement can be a compound statement of almost any length or complexity.
Play with the code below by changing the value assigned to variable printing , includ-
ing NA , and logical(0) .

printing <- TRUE


if (printing) {
print("A")
print("B")
}

## [1] "A"
## [1] "B"

The condition passed as argument to if enclosed in parentheses, can be anything


yielding a logical vector, however, as this condition is not vectorized, only the first
element will be used. Play with this example by changing the value assigned to a .

a <- 10.0
if (a < 0.0) print("'a' is negative") else print("'a' is not negative")

## [1] "'a' is not negative"

print("This is always printed")

## [1] "This is always printed"

As you can see above the statement immediately following else is executed if the
condition is false. Later statements are executed independently of the condition.
Do you still remember the rules about continuation lines?

75
3 R Scripts and Programming

## [1] 1 2 3 4
## [1] FALSE

# 1
a <- 1
if (a < 0.0)
print("'a' is negative") else
print("'a' is not negative")

## [1] "'a' is not negative"

Why does the statement below (not evaluated here) trigger an error?

# 2 (not evaluated here)


if (a < 0.0) print("'a' is negative")
else print("'a' is not negative")

Play with the use conditional execution, with both simple and compound state-
ments, and also think how to combine if and else to select among more than two
options.

U Revise the conversion rules between numeric and logical values, run each
of the statements below, and explain the output based on how type conversions
are interpreted, remembering the difference between floating-point numbers as
implemented in computers and real numbers (ℝ) as defined in mathematics:

if (0) print("hello")
if (-1) print("hello")
if (0.01) print("hello")
if (1e-300) print("hello")
if (1e-323) print("hello")
if (1e-324) print("hello")
if (1e-500) print("hello")
if (as.logical("true")) print("hello")
if (as.logical(as.numeric("1"))) print("hello")
if (as.logical("1")) print("hello")
if ("1") print("hello")

There is in R a switch() statement, that we describe here, which can be used


to select among “cases”, or several alternative statements, based on an expression
evaluating to a number or a character string. The switch statement returns a value, the
value returned by the code corresponding to the matching switch value, or the default

76
3.8 Control of execution flow

if there is no match, and a default has been included in the code. Both character values
or numeric values can used.

my.object <- "two"


b <- switch(my.object,
one = 1,
two = 1 / 2,
three = 1/ 4,
0
)
b

## [1] 0.5

Do play with the use of the switch statement.

Vectorized

Vectorized conditional execution is coded by means of a function called ifelse()


(written as a single word). This function takes three arguments: a logical vector
( test ), a result vector for TRUE ( yes ), a result vector for FALSE ( no ). All three can
be any construct giving the necessary argument as their return value. In the case of
vectors passed as arguments to parameters yes and no , recycling will apply if they
are not of the expected length. No recycling applies to test .

a <- 1:10
ifelse(a > 5, 1, -1)

## [1] -1 -1 -1 -1 -1 1 1 1 1 1

ifelse(a > 5, a + 1, a - 1)

## [1] 0 1 2 3 4 7 8 9 10 11

ifelse(any(a>5), a + 1, a - 1) # tricky

## [1] 2

ifelse(logical(0), a + 1, a - 1) # even more tricky

## logical(0)

ifelse(NA, a + 1, a - 1) # as expected

## [1] NA

77
3 R Scripts and Programming

 In the case of ifelse() , the length of the returned value is determined


by the length of the logical vector passed as argument to its the first formal
parameter (named test )! A frequent mistake is to use a condition that returns
a logical of length one, expecting that it will be recycled because arguments
passed to the other parameters (named yes and no ) are longer. However, no
recycling will take place, resulting in a returned value of length one, with the
remaining of the vectors being discarded. Do try this by yourself, using logical
vectors of different lengths. You can start with the examples below, making sure
you understand why the returned values are what they are.

ifelse(TRUE, 1:5, -5:-1)

## [1] 1

ifelse(FALSE, 1:5, -5:-1)

## [1] -5

ifelse(c(TRUE, FALSE), 1:5, -5:-1)

## [1] 1 -4

ifelse(c(FALSE, TRUE), 1:5, -5:-1)

## [1] -5 2

ifelse(c(FALSE, TRUE), 1:5, 0)

## [1] 0 2

U Try to understand what is going on in the previous example. Create your own
examples to test how ifelse() works.

U Write, using ifelse() , a single statement to combine numbers from the two
vectors a and b into a result vector d , based on whether the corresponding
value in vector c is the character "a" or "b" . Then print vector d to make the
result visible.

78
3.8 Control of execution flow

a <- -10:-1
b <- +1:10
c <- c(rep("a", 5), rep("b", 5))
# your code

If you do not understand how the three vectors are built, or you cannot guess
the values they contain by reading the code, print them, and play with the argu-
ments, until you have clear what each parameter does.

3.8.2 Why using vectorized functions and operators is important

If you have written programs in other languages, it would feel to you natural to use
loops (for, repeat while, repeat until) for many of the things for which we have been us-
ing vectorization. When using the R language it is best to use vectorization whenever
possible, because it keeps the listing of scripts and programs shorter and easier to
understand (at least for those with experience in R). However, there is another very
important reason: execution speed. The reason behind this is that R is an interpreted
language. In current versions of R it is possible to byte-compile functions, but this
is rarely used for scripts, and even byte-compiled loops are usually much slower to
execute than vectorized functions.
However, there are cases were we need to repeatedly execute statements in a way
that cannot be vectorized, or when we do not need to maximize execution speed. The
R language does have loop constructs, and we will describe them next.

3.8.3 Repetition

The most frequently used type of loop is a for loop. These loops work in R are based
on lists or vectors of values to act upon.

b <- 0
for (a in 1:5) b <- b + a
b

## [1] 15

b <- sum(1:5) # built-in function


b

## [1] 15

Here the statement b <- b + a is executed five times, with a sequentially taking
each of the values in 1:5 . Instead of a simple statement used here, also a compound

79
3 R Scripts and Programming

statement could have been used.


Here are a few examples that show some of the properties of for loops and func-
tions, combined with the use of a function.

test.for <- function(x) {


for (i in x) {print(i)}
}
test.for(numeric(0))
test.for(1:3)

## [1] 1
## [1] 2
## [1] 3

test.for(NA)

## [1] NA

test.for(c("A", "B"))

## [1] "A"
## [1] "B"

test.for(c("A", NA))

## [1] "A"
## [1] NA

test.for(list("A", 1))

## [1] "A"
## [1] 1

test.for(c("z", letters[1:4]))

## [1] "z"
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"

In contrast to other languages, in R function arguments are not checked for ‘type’
when the function is called. The only requirement is that the function code can handle
the argument provided. In this example you can see that the same function works
with numeric and character vectors, and with lists. We haven’t seen lists before. As
earlier discussed all elements in a vector should have the same type. This is not the
case for lists. It is also interesting to note that a list or vector of length zero is a valid
argument, that triggers no error, but that as one would expect, causes the statements
in the loop body to be skipped.

80
3.8 Control of execution flow

Some examples of use of for loops — and of how to avoid their use.
a <- c(1, 4, 3, 6, 8)
for(x in a) x*2 # result is lost
for(x in a) print(x*2) # print is needed!

## [1] 2
## [1] 8
## [1] 6
## [1] 12
## [1] 16

b <- for(x in a) x*2 # does not work as expected, but triggers no error
b

## NULL

for(x in a) b <- x*2 # a bit of a surprise, as b is not a vector!

b <- numeric()
for(i in seq(along.with = a)) {
b[i] <- a[i]^2
print(b)
}

## [1] 1
## [1] 1 16
## [1] 1 16 9
## [1] 1 16 9 36
## [1] 1 16 9 36 64

b # is a vector!

## [1] 1 16 9 36 64

# a bit faster if we first allocate a vector of the required length


b <- numeric(length(a))
for(i in seq(along.with = a)) {
b[i] <- a[i]^2
print(b)
}

## [1] 1 0 0 0 0
## [1] 1 16 0 0 0
## [1] 1 16 9 0 0
## [1] 1 16 9 36 0
## [1] 1 16 9 36 64

b # is a vector!

## [1] 1 16 9 36 64

# vectorization is simplest and fastest


b <- a^2
b

## [1] 1 16 9 36 64

81
3 R Scripts and Programming

seq(along.with = a) builds a new numeric vector with a sequence of the same


length as vector a , passed as argument to parameter along.width .

U Look at the results from the above examples, and try to understand where
does the returned value come from in each case. In the code chunk above,
print() is used within the loop to make intermediate values visible. You can
add additional print() statements to visualize other variables such as i or run
parts of the code, such as seq(along.with = a) , by themselves.
In this case, the code examples are valid, but the same approach can be used
for debugging syntactically correct code that does not return the expected results,
either for every input value, or with a specific value as input.

 In the examples above we show the use of seq() passing a vector as argu-
ment to its parameter along.with . This approach is much better than using the
not exactly equivalent call to seq() based on the length of the vector, or its short
version using operator : .

a <- c(1, 4, 3, 6, 8)
# a <- numeric(0)

b <- numeric(length(a))
for(i in seq(along.with = a)) {
b[i] <- a[i]^2
}
print(b)

## [1] 1 16 9 36 64

c <- numeric(length(a))
for(i in 1:length(a)) {
c[i] <- a[i]^2
}
print(c)

## [1] 1 16 9 36 64

With a of length 1 or longer, the statements are equivalent, but when a has
length zero the two statements are no longer equivalent. Run the statements
above, after un-commenting the second definition of a and try to understand
why they behave as they do.
Advanced note: R vectors are indexed starting with 1 while languages like C

82
3.8 Control of execution flow

and C++ use indexes starting from 0 . In addition, these languages, also differ
from R in how they handle vectors of length zero.

We sometimes may not be able to use vectorization, or may be easiest to not use it.
However, whenever working with large data sets, or many similar data sets, we will
need to take performance into account. As vectorization usually also makes code
simpler, it is good style to use it whenever possible.

b <- numeric(length(a)-1)
for(i in seq(along.with = b)) {
b[i] <- a[i+1] - a[i]
print(b)
}

## [1] 3 0 0 0
## [1] 3 -1 0 0
## [1] 3 -1 3 0
## [1] 3 -1 3 2

# although in this case there were alternatives, there


# are other cases when we need to use indexes explicitly
b <- a[2:length(a)] - a[1:length(a)-1]
b

## [1] 3 -1 3 2

# or even better
b <- diff(a)
b

## [1] 3 -1 3 2

while loops are quite frequently also useful. Instead of a list or vector, they take
a logical argument, which is usually an expression, but which can also be a variable.
For example the previous calculation could be also done as follows.

a <- c(1, 4, 3, 6, 8)
i <- 1
while (i < length(a)) {
b[i] <- a[i]^2
print(b)
i <- i + 1
}

## [1] 1 -1 3 2
## [1] 1 16 3 2
## [1] 1 16 9 2

83
3 R Scripts and Programming

## [1] 1 16 9 36

## [1] 1 16 9 36

Here is another example. In this case we use the result of the previous iteration in
the current one. In this example you can also see, that it is allowed to put more than
one statement in a single line, in which case the statements should be separated by a
semicolon (;).

a <- 2
while (a < 50) {print(a); a <- a^2}

## [1] 2
## [1] 4
## [1] 16

print(a)

## [1] 256

U Make sure that you understand why the final value of a is larger than 50.

U The statements above can be simplified to:


a <- 2
while (a < 50) {print(a <- a^2)}

## [1] 4
## [1] 16
## [1] 256

print(a)

## [1] 256

Explain why this works, and how it relates to the support in R of chained as-
signments to several variables within a single statement like the one below.

84
3.8 Control of execution flow

a <- b <- c <- 1:5


a

## [1] 1 2 3 4 5

## [1] 1 2 3 4 5

## [1] 1 2 3 4 5

repeat is seldom used, but adds flexibility as break() can be in the middle of the
compound statement.

a <- 2
repeat{
print(a)
a <- a^2
if (a > 50) {print(a); break()}
}

## [1] 2
## [1] 4
## [1] 16
## [1] 256

# or more elegantly
a <- 2
repeat{
print(a)
if (a > 50) break()
a <- a^2
}

## [1] 2
## [1] 4
## [1] 16
## [1] 256

U Please, explain why the examples above return the values they do. Use the
approach of adding print() statements, as described on page 82.

85
3 R Scripts and Programming

3.8.4 Nesting of loops

All the execution-flow control statements seen above can be nested. We will show an
example with two for loops. We first need a matrix of data to work with:

A <- matrix(1:50, 10)


A

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50

A <- matrix(1:50, 10, 5)


A

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50

# argument names used for clarity


A <- matrix(1:50, nrow = 10)
A

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50

86
3.8 Control of execution flow

A <- matrix(1:50, ncol = 5)


A

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50

A <- matrix(1:50, nrow = 10, ncol = 5)


A

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50

All the statements above are equivalent, but some are easier to read than others.

row.sum <- numeric() # slower as size needs to be expanded


for (i in 1:nrow(A)) {
row.sum[i] <- 0
for (j in 1:ncol(A))
row.sum[i] <- row.sum[i] + A[i, j]
}
print(row.sum)

## [1] 105 110 115 120 125 130 135 140 145 150

row.sum <- numeric(nrow(A)) # faster


for (i in 1:nrow(A)) {
row.sum[i] <- 0
for (j in 1:ncol(A))
row.sum[i] <- row.sum[i] + A[i, j]
}
print(row.sum)

87
3 R Scripts and Programming

## [1] 105 110 115 120 125 130 135 140 145 150

Look at the output of these two examples to understand what is happening differ-
ently with row.sum .
The code above is very general, it will work with any size of two dimensional matrix,
which is good programming practice. However, sometimes we need more specific
calculations. A[1, 2] selects one cell in the matrix, the one on the first row of the
second column. A[1, ] selects row one, and A[ , 2] selects column two. In the
example above the value of i changes for each iteration of the outer loop. The value
of j changes for each iteration of the inner loop, and the inner loop is run in full for
each iteration of the outer loop. The inner loop index j changes fastest.

U 1) modify the example above to add up only the first three columns of A , 2)
modify the example above to add the last three columns of A .
Will the code you wrote continue working as expected if the number of rows in
A changed? and what if the number of columns in A changed, and the required
results still needed to be calculated for relative positions? What would happen
if A had fewer than three columns? Try to think first what to expect based on
the code you wrote. Then create matrices of different sizes and test your code.
After that think how to improve the code, at least so that wrong results are not
produced.

Vectorization can be achieved in this case easily for the inner loop, as R includes
function sum() which returns the sum of a vector passed as its argument. Repla-
cing the inner loop, which is the most frequently executed, by an efficient vectorized
function can be expected to improve performance significantly.

row.sum <- numeric(nrow(A)) # faster


for (i in 1:nrow(A)) {
row.sum[i] <- sum(A[i, ])
}
print(row.sum)

## [1] 105 110 115 120 125 130 135 140 145 150

A[i, ] selects row i and all columns. In R, the row index always comes first,
which is not the case in all programming languages.
Both explicit loops can be if we use an apply function, such as apply() , lapply()
and sapply() , in place of the outer for loop.. See section 5.5 on page 143 for details
on the use of R’s apply functions.

88
3.9 Packages

row.sum <- apply(A, MARGIN = 1, sum) # MARGIN=1 inidcates rows


print(row.sum)

## [1] 105 110 115 120 125 130 135 140 145 150

U How would you change this last example, so that only the last three columns
are added up? (Think about use of subscripts to select a part of the matrix.)

There are many variants of apply functions, both in base R and exported by con-
tributed packages. See section 5.5 for details on the use of several of the later ones.

3.9 Packages

In R speak ‘library’ is the location where ‘packages’ are installed. Packages are sets
of functions, and data, specific for some particular purpose, that can be loaded into
an R session to make them available so that they can be used in the same way as
built-in R functions and data. The function library() is used to load packages,
already installed in the local R library, into the current session, while the function
install.packages() is used to install packages, either from a file, or directly from
the internet into the library. When using RStudio it is easiest to use RStudio com-
mands (which call install.packages() and update.packages() ) to install and up-
date packages.

library(graphics)

Currently there are thousands of packages available. The most reliable source of
packages is CRAN, as only packages that pass strict tests and are actively maintained
are included. In some cases you may need or want to install less stable code, and
this is also possible. With package ‘devtools’ it is even possible to install packages
directly from Github, Bitbucket and a few other repositories. These later installations
are always installations from source (see below).
R packages can be installed either from source, or from already built ‘binaries’. In-
stalling from sources, depending on the package, may require quite a lot of additional
software to be available. Under MS-Windows, very rarely the needed shell, commands
and compilers are already available. Installing them is not too difficult (you will need
RTools, and MiKTEX). However, for this reason it is the norm to install packages from
binary .zip files under MS-Windows. Under Linux most tools will be available, or very
easy to install, so it is usual to install packages from sources. For OS X (Mac) the situ-
ation is somewhere in-between. If the tools are available, packages can be very easily

89
3 R Scripts and Programming

installed from sources from within RStudio. However, binaries are for most packages
also readily available.
The development of packages is beyond the scope of the current book, and very
well explained in the book R Packages (Wickham 2015). However, it is still worth-
while mentioning a few things about the development of R packages. Using RStudio
it is relatively easy to develop your own packages. Packages can be of very different
sizes. Packages use a relatively rigid structure of folders for storing the different
types of files, and there is a built-in help system, that one needs to use, so that the
package documentation gets linked to the R help system when the package is loaded.
In addition to R code, packages can call C, C++, FORTRAN, Java, etc. functions and
routines, but some kind of ‘glue’ is needed, as function call conventions and name
mangling depend on the programming language, and in many cases also on the com-
piler used. At least for C++, the recently developed ‘Rcpp’ R package makes the “glu-
ing” extremely easy. See Chapter 9 starting on page 463 for more information on
performance-related and other limitations of R and how to solve possible bottlenecks.
One good way of learning how R works, is by experimenting with it, and whenever
using a certain function looking at its help, to check what are all the available options.
How much documentation is included with packages varies a lot, but many packages
include comprehensive user guides or examples as vignettes in addition to the help
pages for individual functions or data sets.

90
4 R built-in functions

The desire to economize time and mental effort in arithmetical


computations, and to eliminate human liability to error, is
probably as old as the science of arithmetic itself.

— Howard Aiken, Proposed automatic calculating machine,


presented to IBM in 1937

4.1 Aims of this chapter

The aim of this chapter is to introduce some of the frequently used function available
in base R—i.e. without any non-standard packages loaded. This is by necessity a very
incomplete introduction to the capabilities of base R. This chapter is designed to
give the reader only an introduction to base R, as there are several good texts on the
subject (e.g. Matloff 2011). Furthermore, many of base R’s functions are specific to
different statistical procedures, maths and calculus, that transcend the description
of R as a programming language.

4.2 Loading data

To start with, we need some data to run the examples. Here we use cars , a data set
included in base R. How to read or import “foreign” data is discussed in R’s document-
ation in R Data Import/Export, and in this book, in Chapter 5 starting on page 105.
In general data() is used load R objects saved in a file format used by R. Text files
con be read with functions scan() , read.table() , read.csv() and their variants.
It is also possible to ‘import’ data saved in files of foreign formats, defined by other
programs. Packages such as ’foreign’, ’readr’, ’readxl’, ’RNetCDF’, ’jsonlite’, etc. allow
importing data from other statistic and data analysis applications and from standard
data exchange formats. It is also good to keep in mind that in R urls are accepted as
arguments to the file argument (see Chapter 5 starting on page 105 for details and
examples on how to import data from different “foreign” formats and sources).
In the examples of the present chapter we use data included in R, as R objects,
which can be loaded with function data . cars is a data frame.

data(cars)

91
4 R built-in functions

4.3 Looking at data

There are several functions in R that let us obtain different ‘views’ into objects. Func-
tion print() is useful for small data sets, or objects. Especially in the case of large
data frames, we need to explore them step by step. In the case of named compon-
ents, we can obtain their names, with names() . If a data frame contains many rows
of observations, head() and tail() allow us to easily restrict the number of rows
printed. Functions nrow() and ncol() return the number of rows and columns in
the data frame (but are not applicable to lists). As earlier mentioned, str() , outputs
is abbreviated but in a way that preserves the structure of the object.
class(cars)

## [1] "data.frame"

nrow(cars)

## [1] 50

ncol(cars)

## [1] 2

names(cars)

## [1] "speed" "dist"

head(cars)

## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10

tail(cars)

## speed dist
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85

str(cars)

## 'data.frame': 50 obs. of 2 variables:


## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...

92
4.3 Looking at data

U Look up the help pages for head() and tail() , and edit the code above to
print only the first line, or only the last line of cars , respectively. As a second
exercise print the 25 topmost rows of cars .

Data frames consist in columns of equal length (see Chapter 2, section 2.12 on page
52 for details). The different columns of a data frame can contain data of different
modes (e.g. numeric, factor and/or character).
To explore the mode of the columns of cars , we can use an apply function. In the
present case, we want to apply function mode() to each column of the data frame
cars .

sapply(cars, mode)

## speed dist
## "numeric" "numeric"

The statement above returns a vector of character strings, with the mode of each
column. Each element of the vector is named according to the name of the corres-
ponding “column” in the data frame. For this same statement to be used with any
other data frame or list, we need only to substitute the name of the object, the second
argument, to the one of current interest.

U Data set airquality contains data from air quality measurements in


New York, and, being included in the R distribution, can be loaded with
data(airquality) . Load it, and repeat the steps above, to learn what variables
are included, their modes, the number of rows, etc.

There is in R a function called summary() , which can be used to obtain a suitable


summary from objects of most classes. We can also use sapply() or lapply() to
apply any suitable function to individual columns. See section 5.5 on page 143 for
details about R’s apply functions.

summary(cars)

## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00

93
4 R built-in functions

sapply(cars, range)

## speed dist
## [1,] 4 2
## [2,] 25 120

U Obtain the summary of airquality with function summary() , but in addi-


tion, write code with an apply function to count the number of non-missing values
in each column.

4.4 Plotting

The base R’s generic function plot() can be used to plot different data. It is a gen-
eric function that has suitable methods for different kinds of objects (see section 3.7
on page 69 for a brief introduction to objects, classes and methods). In this section
we only very briefly demonstrate the use of the most common base R’s graphics func-
tions. They are well described in the book R Graphics, Second Edition (Chapman &
Hall/CRC The R Series) (Murrell 2011). We will not describe either the Trellis and Lat-
tice approach to plotting (Sarkar 2008). We describe in detail the use of the grammar
of graphics and plotting with package ‘ggplot2’ in Chapter 6 from page 179 onwards.

plot(dist ~ speed, data = cars)


100


● ●


● ●

dist


60


● ● ● ●
● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ●

20

● ● ●
● ● ● ●
● ●
● ●
0

5 10 15 20 25

speed

94
4.5 Fitting linear models

4.5 Fitting linear models

One important thing to remember is that model ‘formulas’ are used in different con-
texts: plotting, fitting of models, and tests like 𝑡-test. The basic syntax is rather
consistently followed, although there are some exceptions.

4.5.1 Regression

The R function lm() is used next to fit linear models. If the explanatory variable is
continuous, the fit is a regression. In the example below, speed is a numeric variable
(floating point in this case). In the ANOVA table calculated for the model fit, in this
case a linear regression, we can see that the term for speed has only one degree of
freedom (df) for the denominator.
We first fit the model and save the output as fm1 (A name I invented to remind
myself that this is the first fitted-model in this chapter.

fm1 <- lm(dist ~ speed, data=cars)

The next step is diagnosis of the fit. Are assumptions of the linear model procedure
used reasonably fulfilled? In R it is most common to use plots to this end. We show
here only one of the four plots normally produced. This quantile vs. quantile plot
allows to assess how much the residuals deviate from being normally distributed.

plot(fm1, which = 2)

Normal Q−Q
3

49 ●
Standardized residuals

● 23

● 35
2

● ●
●●
1

●●●● ● ●
●●●●
●●●●
●●●●
0

●●
●●●●●●●
●●●●
● ● ●●●
−1

●●
● ● ● ●
−2

−2 −1 0 1 2

Theoretical Quantiles
lm(dist ~ speed)

In the case of a regression, calling summary() with the fitted model object as ar-
gument is most useful as it provides a table of coefficient estimates and their errors.
anova() applied to the same fitted object, returns the ANOVA table.

95
4 R built-in functions

summary(fm1) # we inspect the results from the fit

##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123
## speed 3.9324 0.4155 9.464 1.49e-12
##
## (Intercept) *
## speed ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511,Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

anova(fm1) # we calculate an ANOVA

## Analysis of Variance Table


##
## Response: dist
## Df Sum Sq Mean Sq F value Pr(>F)
## speed 1 21186 21185.5 89.567 1.49e-12 ***
## Residuals 48 11354 236.5
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s look at each argument separately: dist ~ speed is the specification of the
model to be fitted. The intercept is always implicitly included. To ‘remove’ this impli-
cit intercept from the earlier model we can use dist ~ speed - 1. In what follows
we fit a straight line through the origin (𝑥 = 0, 𝑦 = 0).

fm2 <- lm(dist ~ speed - 1, data=cars)


plot(fm2, which = 2)
summary(fm2)

##
## Call:
## lm(formula = dist ~ speed - 1, data = cars)

96
4.5 Fitting linear models

##
## Residuals:
## Min 1Q Median 3Q Max
## -26.183 -12.637 -5.455 4.590 50.181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## speed 2.9091 0.1414 20.58 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.26 on 49 degrees of freedom
## Multiple R-squared: 0.8963,Adjusted R-squared: 0.8942
## F-statistic: 423.5 on 1 and 49 DF, p-value: < 2.2e-16

anova(fm2)

## Analysis of Variance Table


##
## Response: dist
## Df Sum Sq Mean Sq F value Pr(>F)
## speed 1 111949 111949 423.47 < 2.2e-16 ***
## Residuals 49 12954 264
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Normal Q−Q
49 ●
Standardized residuals

● 23
● 35
2

● ● ●

1

● ●
●●
●●
●●●●●
0

●●
●●●●●
●●●●
●●●●●
●●●●
−1

● ● ● ●●●
● ● ● ●

−2 −1 0 1 2

Theoretical Quantiles
lm(dist ~ speed − 1)

We now we fit a second degree polynomial.

fm3 <- lm(dist ~ speed + I(speed^2), data = cars) # we fit a model, and then save the result
plot(fm3, which = 3) # we produce diagnosis plots
summary(fm3) # we inspect the results from the fit

##

97
4 R built-in functions

## Call:
## lm(formula = dist ~ speed + I(speed^2), data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.720 -9.184 -3.188 4.628 45.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47014 14.81716 0.167 0.868
## speed 0.91329 2.03422 0.449 0.656
## I(speed^2) 0.09996 0.06597 1.515 0.136
##
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared: 0.6673,Adjusted R-squared: 0.6532
## F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12

anova(fm3) # we calculate an ANOVA

## Analysis of Variance Table


##
## Response: dist
## Df Sum Sq Mean Sq F value Pr(>F)
## speed 1 21185.5 21185.5 91.986 1.211e-12
## I(speed^2) 1 528.8 528.8 2.296 0.1364
## Residuals 47 10824.7 230.3
##
## speed ***
## I(speed^2)
## Residuals
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Scale−Location
● 23
49 ●
Standardized residuals

1.5

● 35

● ● ●
● ●
1.0

● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●

● ● ● ●
0.5

● ●
● ● ● ● ●
● ● ●

● ●
0.0

20 40 60 80

Fitted values
lm(dist ~ speed + I(speed^2))

The “same” fit using an orthogonal polynomial. Higher degrees can be obtained by

98
4.5 Fitting linear models

supplying as second argument to poly() a different positive integer value.

fm3a <- lm(dist ~ poly(speed, 2), data=cars) # we fit a model, and then save the result
plot(fm3a, which = 3) # we produce diagnosis plots
summary(fm3a) # we inspect the results from the fit

##
## Call:
## lm(formula = dist ~ poly(speed, 2), data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.720 -9.184 -3.188 4.628 45.152
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 42.980 2.146 20.026
## poly(speed, 2)1 145.552 15.176 9.591
## poly(speed, 2)2 22.996 15.176 1.515
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## poly(speed, 2)1 1.21e-12 ***
## poly(speed, 2)2 0.136
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared: 0.6673,Adjusted R-squared: 0.6532
## F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12

anova(fm3a) # we calculate an ANOVA

## Analysis of Variance Table


##
## Response: dist
## Df Sum Sq Mean Sq F value
## poly(speed, 2) 2 21714 10857.1 47.141
## Residuals 47 10825 230.3
## Pr(>F)
## poly(speed, 2) 5.852e-12 ***
## Residuals
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

99
4 R built-in functions

Scale−Location
● 23
49 ●
Standardized residuals

1.5
● 35

● ● ●
1.0
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●

● ● ● ●
0.5

● ●
● ● ● ● ●
● ● ●

● ●
0.0

20 40 60 80

Fitted values
lm(dist ~ poly(speed, 2))

We can also compare two models, to test whether one of models describes the data
better than the other.

anova(fm2, fm1)

## Analysis of Variance Table


##
## Model 1: dist ~ speed - 1
## Model 2: dist ~ speed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 12954
## 2 48 11354 1 1600.3 6.7655 0.01232 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Or three or more models. But be careful, as the order of the arguments matters.

anova(fm2, fm1, fm3, fm3a)

## Analysis of Variance Table


##
## Model 1: dist ~ speed - 1
## Model 2: dist ~ speed
## Model 3: dist ~ speed + I(speed^2)
## Model 4: dist ~ poly(speed, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 12954
## 2 48 11354 1 1600.26 6.9482 0.01133 *
## 3 47 10825 1 528.81 2.2960 0.13640
## 4 47 10825 0 0.00
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

100
4.5 Fitting linear models

We can use different criteria to choose the best model: significance based on 𝑃 -
values or information criteria (AIC, BIC). AIC and BIC penalize the resulting ‘goodness’
based on the number of parameters in the fitted model. In the case of AIC and BIC,
a smaller value is better, and values returned can be either positive or negative, in
which case more negative is better.

BIC(fm2, fm1, fm3, fm3a)

## df BIC
## fm2 2 427.5739
## fm1 3 424.8929
## fm3 4 426.4202
## fm3a 4 426.4202

AIC(fm2, fm1, fm3, fm3a)

## df AIC
## fm2 2 423.7498
## fm1 3 419.1569
## fm3 4 418.7721
## fm3a 4 418.7721

One can see above that these three criteria not necessarily agree on which is the
model to be chosen.

anova fm1

BIC fm1

AIC fm3

4.5.2 Analysis of variance, ANOVA

We use as the InsectSpray data set, giving insect counts in plots sprayed with dif-
ferent insecticides. In these data spray is a factor with six levels.

fm4 <- lm(count ~ spray, data = InsectSprays)

plot(fm4, which = 2)

101
4 R built-in functions

Normal Q−Q

3
● 69 70 ●
Standardized residuals ●8

2
● ●
●● ● ●
1 ●●

●●●
●●●●●●
●●●●
●●●●●●
0

●●●●●●●
●●●●●
●●●
●●
●●●●
●●●●●
●●●●
−1

●●
● ● ●●


−2

● ●

−2 −1 0 1 2

Theoretical Quantiles
lm(count ~ spray)

anova(fm4)

## Analysis of Variance Table


##
## Response: count
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2668.8 533.77 34.702 < 2.2e-16 ***
## Residuals 66 1015.2 15.38
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.5.3 Analysis of covariance, ANCOVA

When a linear model includes both explanatory factors and continuous explanatory
variables, we say that analysis of covariance (ANCOVA) is used. The formula syntax
is the same for all linear models, what determines the type of analysis is the nature of
the explanatory variable(s). Conceptually a factor (an unordered categorical variable)
is very different from a continuous variable.

4.6 Generalized linear models

Linear models make the assumption of normally distributed residuals. Generalized


linear models, fitted with function glm() are more flexible, and allow the assumed
distribution to be selected as well as the link function. For the analysis of the
InsectSpray data set, above (section 4.5.2 on page 101) the Normal distribution is
not a good approximation as count data deviates from it. This was visible in the
quantile–quantile plot above.

102
4.6 Generalized linear models

For count data GLMs provide a better alternative. In the example below we fit the
same model as above, but we assume a quasi-Poisson distribution instead of the Nor-
mal.

fm10 <- glm(count ~ spray, data = InsectSprays, family = quasipoisson)


plot(fm10, which = 2)
anova(fm10, test = "F")

## Analysis of Deviance Table


##
## Model: quasipoisson, link: log
##
## Response: count
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev F
## NULL 71 409.04
## spray 5 310.71 66 98.33 41.216
## Pr(>F)
## NULL
## spray < 2.2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Normal Q−Q

27 ● 39 ●
2
Std. deviance resid.

● ● ●

●●● ●
●●●●
1


●●●●●
●●●●
●●
●●●●●●●
0

●●●●●●●●
●●●●●●
●●
●●●●●●●●●
−1

●●●●
●●●●

● ● ● ●
−2

● 23

−2 −1 0 1 2

Theoretical Quantiles
glm(count ~ spray)

103
5 Storing and manipulating data with R

Essentially everything in S[R], for instance, a call to a function, is


an S[R] object. One viewpoint is that S[R] has self-knowledge. This
self-awareness makes a lot of things possible in S[R] that are not in
other languages.

— Patrick J. Burns (1998) S Poetry.


http://www.burns-stat.com/documents/books/s-poetry/

5.1 Aims of this chapter

Base R includes many functions for importing and or manipulating data. This is a
complete set, that supports all the usually needed operations. However, many of
these functions have not been designed to perform optimally on very large data sets
(see Matloff 2011). The usual paradigm consists in indexing more complex objects,
such as arrays and data frames to apply math operantions on vectors. Quite some
effort has been put into improving the implementation of these operations on several
fronts, 1) designing an enhanced user interface, that it is simpler to use and also
easier to optimize for performance, 2) adding to the existing paradigm of allways
copying arguments passed to functions, an additional semantics based on the use of
references to variables, and 3) allowing reading data into memory selectively from
files.
The aim of this chapter is to describe, and show how, some of the existing enhance-
ments available through CRAN, can be useful both with small and large data sets.

5.2 Packages used in this chapter

For executing the examples listed in this chapter you need first to load the following
packages from the library:

library(tibble)
library(magrittr)
library(stringr)
library(dplyr)
library(tidyr)
library(readr)
library(readxl)

105
5 Storing and manipulating data with R

library(xlsx)
library(foreign)
library(haven)
library(xml2)
library(RNetCDF)
library(ncdf4)
library(lubridate)
library(jsonlite)

= The data sets used in this chapter are at the moment avaialble for download.
The details of how to download files from within R is explained in section 5.4.7
on page 139. The examples for local data use the same files. As it is easier to
first examplify reading local files, please, run the code in the chunk below at least
once, before attempting to run the code in the next sections. Make sure that the
current folder/directory is the same one that will be current when running the
examples.
The code chunk below will create a folder called data unless it already exists
and download all files except one from my web server. Existing files with the
same names will not be overwritten.

106
5.3 Introduction

dir.name = "./data"
if (!dir.exists(dir.name)) {
dir.create(dir.name)
}
# download file in text mode
file.name <- paste(dir.name, "logger_1.txt", sep ="/")
if (!file.exists(file.name)) {
download.file("http://r4photobiology.info/learnr/logger_1.txt",
file.name)
}
# download remaining files in binary mode
bin.file.names <- c("my-data.xlsx", "Book1.xlsx", "BIRCH1.SYS",
"thiamin.sav", "my-data.sav", "meteo-data.nc")
for (file.name in bin.file.names) {
f <- paste(dir.name, file.name, sep ="/")
if (!file.exists(f)) {
download.file(paste("http://r4photobiology.info/learnr",
file.name, sep="/"),
f,
mode = "wb")
}
}
# download NetCDF file from NOAA server
file.name <- paste(dir.name, "pevpr.sfc.mon.ltm.nc", sep ="/")
if (!file.exists(file.name)) {
my.url <- paste("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.derived/",
"surface_gauss/pevpr.sfc.mon.ltm.nc",
sep = "")
download.file(my.url,
mode = "wb",
destfile = paste(dir.name, "pevpr.sfc.mon.ltm.nc", sep ="/"))
}

5.3 Introduction

By reading previous chapters, you have already become familiar with base R’s classes,
methods, functions and operators for storing and manipulating data. Several recently
developed packages provide somehow different, and in my view easier, ways of work-
ing with data in R without compromising performance to a level that would matter
outside the realm of ‘big data’. Some other recent packages emphasize computation
speed, at some cost with respect to simplicity of use, and in particular intuitiveness.
Of course, as with any user interface, much depends on one’s own preferences and
attitudes to data analysis. However, a package designed for maximum efficiency like

107
5 Storing and manipulating data with R

‘data.table’ requires of the user to have a good understanding of computers to be able


to understand the compromises and the unusual behavior compared to the rest of R.
I will base this chapters on what I mostly use myself for everyday data analysis and
scripting, and exclude the complexities of R programming and package development.
The chapter is divided in three sections, the first one deals with reading data from
files produced by other programs or instruments, or typed by users outside of R, and
querying databases and very briefly on reading data from the internet. The second
section will deal with transformations of the data that do not combine different ob-
servations, although they may combine different variables from a single observation
event, or select certain variables or observations from a larger set. The third sec-
tion will deal with operations that produce summaries or involve other operations on
groups of observations.

5.4 Data input and output

In recent several packages have made it easier and faster to import data into R. This
together with wider and faster internet access to data sources, has made it possible
to efficiently work with relatively large data sets. The way R is implemented, keeping
all data in memory (RAM), imposes limits the size of data sets that can analysed with
base R. One option is to use a 64 bit version of R on a computer running a 64 bit
operating system. This allows the use of large amounts of RAM if available. For
larger data sets, one can use different packages that allow selective reading of data
from files, and using queries to obtain subsets of data from databases. We will start
with the simplest case, files using the native formats of R itself.

5.4.1 .Rda files

In addition to saving the whole workspace, one can save any R object present in the
workspace to disk. One or more objects, belonging to any mode or class can be saved
into the same file. Reading the file restores all the saved objects into the current
workspace. These files are portable across most R versions. Whether compression
is used, and whether the files is encoded in ASCII characters—allowing maximum
portability at the expense of increased size or not.
We create and save a data frame object.

my.df <- data.frame(x = 1:10, y = 10:1)


my.df

## x y
## 1 1 10
## 2 2 9

108
5.4 Data input and output

## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
## 7 7 4
## 8 8 3
## 9 9 2
## 10 10 1

save(my.df, file = "my-df.rda")

We delete the data frame object and confirm that it is no longer present in the
workspace.

rm(my.df)
ls(pattern = "my.df")

## character(0)

We read the file we earlier saved to restore the object.

load(file = "my-df.rda")
ls(pattern = "my.df")

## [1] "my.df"

my.df

## x y
## 1 1 10
## 2 2 9
## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
## 7 7 4
## 8 8 3
## 9 9 2
## 10 10 1

The default format used is binary and compressed, which results in smaller files.

U In the example above, only one object was saved, but one can simply give the
names of additional objects as arguments. Just try saving, more than one data
frame to the same file. Then the data frames plus a few vectors. Then define a
simple function and save it. After saving each file, clear the workspace and then
load the objects you save from the file.

109
5 Storing and manipulating data with R

Sometimes it is easier to supply the names of the objects to be saved as a vector of


character strings through an argument to parameter list . One case is when wanting
to save a group of objects based on their names. We can use ls() to list the names
of objects matching a simple pattern or a complex regular expression. The example
below does this in two steps saving the character vector first, and then using this
saved object as argument to save ’s list parameter.

objcts <- ls(pattern = "*.df")


save(list = objcts, file = "my-df1.rda")

The intermediate step can be skipped.

save(list = ls(pattern = "*.df"), file = "my-df1.rda")

U Practice using different patterns with ls() . You do not need to save the
objects to a file. Just have a look at the list of object names returned.

As a coda, we show how to cleanup by deleting the two files we created. Function
unlink() can also be used to delete folders.

unlink(c("my-df.rda", "my-df1.rda"))

5.4.2 File names and portability

When saving data to files from scripts or code that one expects to be run on a different
operating system (OS), we need to be careful to chose files names valid under all OSs
where the file could be used. This is specially important when developing R packages.
Best avoid space characters as part of file names and the use of more than one dot.
For widest portability, underscores should be avoided, while dashes are usually not
a problem.
R provides some functions which help with portability, by hiding the idiosyncracies
of the different OSs from R code. Different OSs use different characters in paths, for
example, and consequently the algorithm needed to extract a file name from a file
path, is OS specific. However, R’s function basename() allows the inclusion of this
operation in user’s code portably.
Under MS-Windows paths include backslash characters which are not “normal” char-
acters in R, and many other languages, but rather “escape” characters. Within R for-
ward slash can be used in their place,

110
5.4 Data input and output

basename("C:/Users/aphalo/Documents/my-file.txt")

## [1] "my-file.txt"

or backslash characters can be “escaped” by repeating them.

basename("C:\\Users\\aphalo\\Documents\\my-file.txt")

## [1] "my-file.txt"

The complementary function is dirname() which extracts the bare path to the con-
taining disk folder, from a full file path.

dirname("C:/Users/aphalo/Documents/my-file.txt")

## [1] "C:/Users/aphalo/Documents"

 We here use in examples paths and filenames valid in MS-Windows. We have


tried to avoid names incompatible with other operating systems, but special char-
acters separating directories (= folders) in paths are different among operating
systems. For example, if you use UNIX (e.g. Apple´s OS X) or a Linux distribution
(such as Debian or Ubuntu) only forward slashes will be recognized as separators.

Functions getwd() and setwd() can be used to get the path to the current working
directory and to set a directory as current, respectively.

getwd()

## [1] "D:/aphalo/Documents/Own_manuscripts/Books/using-r"

Function setwd() returns the path of the previous working directory, allowing us
to portably set the working directory to the previous one. Both relative paths, as in
the example, or absolute paths are accepted as arguments.

oldwd <- setwd("..")


getwd()

## [1] "D:/aphalo/Documents/Own_manuscripts/Books"

The returned value is always an absolute full path, so it remains valid even if the
path to the working directory changes more than once before it being restored.

oldwd

## [1] "D:/aphalo/Documents/Own_manuscripts/Books/using-r"

111
5 Storing and manipulating data with R

setwd(oldwd)
getwd()

## [1] "D:/aphalo/Documents/Own_manuscripts/Books/using-r"

We can also obtain a list of files and/or directories (= disk folders).

head(list.files("."))

## [1] "abbrev.sty"
## [2] "anscombe.svg"
## [3] "aphalo-learnr-001.pdf"
## [4] "aphalo-learnr-002.pdf"
## [5] "aphalo-learnr-003.pdf"
## [6] "aphalo-learnr-004.pdf"

head(list.dirs("."))

## [1] "." "./.git"


## [3] "./.git/hooks" "./.git/info"
## [5] "./.git/logs" "./.git/logs/refs"

head(dir("."))

## [1] "abbrev.sty"
## [2] "anscombe.svg"
## [3] "aphalo-learnr-001.pdf"
## [4] "aphalo-learnr-002.pdf"
## [5] "aphalo-learnr-003.pdf"
## [6] "aphalo-learnr-004.pdf"

U Above we passed "." as argument for parameter path . This is the same
as the default. Convince yourself that this is indeed the default by calling the
functions without an explicit argument. After this, play with the functions trying
other existing and non-existent paths in your computer.

U Combine the use of basename() with list.files() to obtain a list of files


names.

U Compare the behaviour of functions dir and lis.dirs() , and try by over-
riding the default arguments of list.dirs() , to get the call to return the same

112
5.4 Data input and output

output as dir() does by default.

Base R provides several functions for working with files, they are listed in the help
page for files and in individual help pages. Use help("files") to access the help
for the “family” of functions.

if (!file.exists("xxx.txt")) {
file.create("xxx.txt")
}

## [1] TRUE

file.size("xxx.txt")

## [1] 0

file.info("xxx.txt")

## size isdir mode mtime


## xxx.txt 0 FALSE 666 2017-04-11 00:00:47
## ctime atime
## xxx.txt 2017-04-11 00:00:47 2017-04-11 00:00:47
## exe
## xxx.txt no

file.rename("xxx.txt", "zzz.txt")

## [1] TRUE

file.exists("xxx.txt")

## [1] FALSE

file.exists("zzz.txt")

## [1] TRUE

file.remove("zzz.txt")

## [1] TRUE

U Function file.path() can be used to construct a file path from its compon-
ents in a way that is portable across OSs. Look at the help page and play with the
function to assemble some paths that exist in the computer you are using.

113
5 Storing and manipulating data with R

5.4.3 Text files

Base R and ‘utils’

Text files come many different sizes and formats, but can be divided into two broad
groups. Those with fixed format fields, and those with delimited fields. Fixed format
fields were especially common in the early days of FORTRAN and COBOL, and com-
puters with very limited resources. They are usually capable of encoding information
using fewer characters than with delimited fields. The best way of understanding the
differences is with examples. We first discuss base R functions and starting from
page 119 we discuss the functions defined in package ‘readr’.
In a format with delimited fields, a delimiter, in this case “,” is used to separate the
values to be read. In this example, the values are aligned by inserting “white space”.
This is what is called comma-separated-values format (CSV). Function write.csv()
and read.csv() can be used to write and read these files using the conventions used
in this example.

1.0, 24.5, 346, ABC


23.4, 45.6, 78, ZXY

When reading a CSV file, white space is ignored and fields recognized based on
separators. In most cases decimal points and exponential notation are allowed for
floating point values. Alignment is optional, and helps only reading by humans, as
white space is ignored. This miss-aligned version of the example above can be expec-
ted to be readable with base R function read.csv() .

1.0,24.5,346,ABC
23.4,45.6,78,ZXY

With a fixed format for fields no delimiters are needed, but a description of the
format is required. Decoding is based solely on the position of the characters in the
line or record. A file like this cannot be interpreted without a description of the format
used for saving the data. Files containing data stored in fixed format with fields can
be read with base R function read.fwf() . Records, can be stored in multiple lines,
each line with fields of different but fixed widths.

10245346ABC
234456 78ZXY

Function read.fortran() is a wrapper on read.fwf() that accepts format defin-


itions similar to those used in FORTRAN, but not completely compatible with them.
One particularity of FORTRAN formated data transfer is that the decimal marker can
be omitted in the saved file and its position specified as part of the format definition.
Again an additional trick used to make text files (or stacks of punch cards) smaller.

114
5.4 Data input and output

R functions write.table() and read.table() default to separating fields with


whitespace. Functions write.csv() and read.csv() have defaults for their argu-
ments suitable for writing and reading CSV files in English-language locales. Func-
tions write.csv2() and read.csv2() are similar have defaults for delimiters and
decimal markers suitable for CSV files in locales with languages like Spanish, French,
or Finnish that use comma (,) as decimal marker and semi-colon (;) as field delimiter.
Another frequently used field delimiter is the “tab” or tabulator character, and some-
times any white space character (tab, space). In most cases the records (observations)
are delimited by new lines, but this is not the only possible approach as the user can
pass the delimiters to used as arguments in the function call.
We give examples of the use of all the functions described in the paragraphs above,
starting by writing data to a file, and then reading this file back into the workspace.
The write() functions take as argument data frames or objects that can be coerced
into data frames. In contrast to save() , these functions can only write to files data
that is in a tabular or matrix-like arrangement.

my1.df <- data.frame(x = 1:5, y = 5:1 / 10)

We write a CSV file suitable for an English language locale, and then display its
contents. In most cases setting row.names = FALSE when writing a CSV file will help
when it is read. Of course, if row names do contain important information, such as
gene tags, you cannot skip writing the row names to the file unless you first copy
these data into a column in the data frame. (Row names are stored separately as an
attribute in data.frame objects.

write.csv(my.df, file = "my-file1.csv", row.names = FALSE)


file.show("my-file1.csv", pager = "console")

"x","y"
1,10
2,9
3,8
4,7
5,6
6,5
7,4
8,3
9,2
10,1

If we had written the file using default settings, reading the file so as to recover
the original objects, would have required overriding of the default argument for para-
meter row.names .

115
5 Storing and manipulating data with R

my_read1.df <- read.csv(file = "my-file1.csv")


my_read1.df

## x y
## 1 1 10
## 2 2 9
## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
## 7 7 4
## 8 8 3
## 9 9 2
## 10 10 1

all.equal(my.df, my_read1.df, check.attributes = FALSE)

## [1] TRUE

U Read the file with function read.csv2() instead of read.csv() . Although


this may look as a waste of time, the point of the exercise is for you to get familiar
with R’s behaviour in case of such a mistake. This will help you recognize similar
errors when they happen accidentally.

We write a CSV file suitable for a Spanish, Finnish or similar locale, and then dis-
play its contents. It can be seen, that the same data frame is saved using different
delimiters.

write.csv2(my.df, file = "my-file2.csv", row.names = FALSE)


file.show("my-file2.csv", pager = "console")

"x";"y"
1;10
2;9
3;8
4;7
5;6
6;5
7;4
8;3
9;2
10;1

As with read.csv() had we written row names to the file, we would have needed
to override the default behaviour.

116
5.4 Data input and output

my_read2.df <- read.csv2(file = "my-file2.csv")


my_read2.df

## x y
## 1 1 10
## 2 2 9
## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
## 7 7 4
## 8 8 3
## 9 9 2
## 10 10 1

all.equal(my.df, my_read2.df, check.attributes = FALSE)

## [1] TRUE

U Read the file with function read.csv() instead of read.csv2() . This may
look as an even more futile exercise than the previous one, but it isn’t as the
behaviour of R is different. Consider how values are erroneously decoded in both
exercises. If the structure of the data frames read is not clear to you, do use
function str() to look at them.

We write a file with the fields separated by white space with function
write.table() .

write.table(my.df, file = "my-file3.txt", row.names = FALSE)


file.show("my-file3.txt", pager = "console")

"x" "y"
1 10
2 9
3 8
4 7
5 6
6 5
7 4
8 3
9 2
10 1

In the case of read.table() there is no need to override the default, independently


of row names are written to the file or not. The reason is related to the default

117
5 Storing and manipulating data with R

behaviour of the write functions. Whether they write a column name ( "" , an empty
character string) or not for the first column, containing the row names.
my_read3.df <- read.table(file = "my-file3.txt", header = TRUE)
my_read3.df

## x y
## 1 1 10
## 2 2 9
## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
## 7 7 4
## 8 8 3
## 9 9 2
## 10 10 1

all.equal(my.df, my_read3.df, check.attributes = FALSE)

## [1] TRUE

U If you are still unclear about why the files were decoded in the way they were,
now try to read them with read.table() . Do now the three examples make sense
to you?

Function cat() takes R objects and writes them after conversion to character
strings to a file, inserting one or more characters as separators, by default a space.
This separator can be set by an argument through sep . In our example we set sep
to a new line (entered as the escape sequence "n" .
my.lines <- c("abcd", "hello world", "123.45")
cat(my.lines, file = "my-file4.txt", sep = "\n")
file.show("my-file4.txt", pager = "console")

abcd
hello world
123.45

my_read.lines <- readLines('my-file4.txt')


my_read.lines

## [1] "abcd" "hello world" "123.45"

all.equal(my.lines, my_read.lines, check.attributes = FALSE)

## [1] TRUE

118
5.4 Data input and output

 There are couple of things to take into account when reading data from text
files using base R functions read.table() and its relatives: by default columns
containing character strings are converted into factors, and column names are
sanitised (spaces and other “inconvenient” characters replaced with dots).

‘readr’

citation(package = "readr")

##
## To cite package 'readr' in publications use:
##
## Hadley Wickham, Jim Hester and Romain
## Francois (2017). readr: Read Rectangular
## Text Data. R package version 1.1.0.
## https://CRAN.R-project.org/package=readr
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {readr: Read Rectangular Text Data},
## author = {Hadley Wickham and Jim Hester and Romain Francois},
## year = {2017},
## note = {R package version 1.1.0},
## url = {https://CRAN.R-project.org/package=readr},
## }

Package ‘readr’ is part of the ‘tidyverse’ suite. It defines functions that allow much
faster input and output, and have different default behaviour. Contrary to base R
functions, they are optimized for speed, but may sometimes wrongly decode their in-
put and sometimes silently do this even for some CSV files that are correctly decoded
by the base functions. Base R functions are dumb, the file format or delimiters must
be supplied as arguments. The ‘readr’ functions use “magic” to guess the format,
in most cases they succeed, which is very handy, but occasionally the power of the
magic is not strong enough. The “magic” can be overridden by passing arguments.
Another important advantage is that these functions read character strings formatted
as dates or times directly into columns of class datetime .
All write functions defined in this package have an append parameter, which can
be used to change the default behaviour of overwriting an existing file with the same
name, to appending the output at its end.
Although in this section we exemplify the use of these functions by passing a file

119
5 Storing and manipulating data with R

name as argument, URLs, and open file descriptors are also accepted. Furthermore,
if the file name ends in a tag recognizable as indicating a compressed file format, the
file will be uncompressed on-the-fly.

 The names of functions “equivalent” to those described in the previous sec-


tion have names formed by replacing the dot with an underscore, e.g. read_csv()
≈ read.csv() . The similarity refers to the format of the files read, but not
the order, names or roles of their formal parameters. Function read_table()
has a different behaviour to read.table() , although they both read fields sep-
arated by white space, read_table() expects the fields in successive records
(usually lines) to be vertically aligned while read.table() tolerates vertical mis-
alignment. Other aspects of the default behaviour are also different, for example
these functions do not convert columns of character strings into factors and row
names are not set in the returned data frame (truly a tibble which inherits from
data.frame ).

read_csv(file = "my-file1.csv")

## Parsed with column specification:


## cols(
## x = col_integer(),
## y = col_integer()
## )

## # A tibble: 10 � 2
## x y
## <int> <int>
## 1 1 10
## 2 2 9
## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
## 7 7 4
## 8 8 3
## 9 9 2
## 10 10 1

read_csv2(file = "my-file2.csv")

## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(

120
5.4 Data input and output

## x = col_integer(),
## y = col_integer()
## )

## # A tibble: 10 � 2
## x y
## <int> <int>
## 1 1 10
## 2 2 9
## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
## 7 7 4
## 8 8 3
## 9 9 2
## 10 10 1

Because of the vertically misaligned fields in file my-file3.txt , we need to use


read_delim() instead of read_table() .

read_delim(file = "my-file3.txt", " ")

## Parsed with column specification:


## cols(
## x = col_integer(),
## y = col_integer()
## )

## # A tibble: 10 � 2
## x y
## <int> <int>
## 1 1 10
## 2 2 9
## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
## 7 7 4
## 8 8 3
## 9 9 2
## 10 10 1

U See what happens when you modify the code to use read functions to read
files that are not matched to them—i.e. mix and match functions and files from
the three code chunks above. As mentioned earlier forcing errors will help you

121
5 Storing and manipulating data with R

learn how to diagnose when such errors are caused by coding mistakes.

We demonstrate here the use of write_tsv() to produce a text file with tab-
separated fields.

write_tsv(my.df, path = "my-file5.tsv")


file.show("my-file5.tsv", pager = "console")

x y
1 10
2 9
3 8
4 7
5 6
6 5
7 4
8 3
9 2
10 1

my_read4.df <- read_tsv(file = "my-file5.tsv")

## Parsed with column specification:


## cols(
## x = col_integer(),
## y = col_integer()
## )

my_read4.df

## # A tibble: 10 � 2
## x y
## <int> <int>
## 1 1 10
## 2 2 9
## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
## 7 7 4
## 8 8 3
## 9 9 2
## 10 10 1

all.equal(my.df, my_read4.df, check.attributes = FALSE)

## [1] TRUE

122
5.4 Data input and output

We demonstrate here the use of write_excel_csv() to produce a text file with


comma-separated fields suitable for reading with Excel.

write_excel_csv(my.df, path = "my-file6.csv")


file.show("my-file6.csv", pager = "console")

x,y
1,10
2,9
3,8
4,7
5,6
6,5
7,4
8,3
9,2
10,1

U Compare the output from write_excel_csv() and write_csv() . What is


the difference? Does it matter when you import the written CSV file into Excel
(the version you are using, with the locale settings of your computer).

write_lines(my.lines, path = "my-file7.txt")


file.show("my-file7.txt", pager = "console")

abcd
hello world
123.45

my_read.lines <- read_lines("my-file7.txt")


my_read.lines

## [1] "abcd" "hello world" "123.45"

all.equal(my.lines, my_read.lines, check.attributes = FALSE)

## [1] TRUE

Additional write and read functions not mentioned are also provided by the pack-
age: write_csv() , write_delim() , write_file() , and read_fwf() .

123
5 Storing and manipulating data with R

U Use write_file() to write a file that can be read with read_csv() .

5.4.4 Worksheets

Microsoft Office, Open Office and Libre Office are the most frequently used suites
containing programs based on the worksheet paradigm. There is available a stand-
ardized file format for exchange of worksheet data, but it does not support all the
features present in native file formats. We will start by considering MS-Excel. The
file format used by Excel has changed significantly over the years, and old formats
tend to be less well supported by available R packages and may require the file to be
updated to a more modern format with Excel itself before import into R. The current
format is based on XML and relatively simple to decode, older binary formats are
more difficult. Consequently for the format currently in use, there are alternatives.

Exporting CSV files

If you have access to the original software used, then exporting a worksheet to a text
file in CSV format and importing it into R using the functions described in section
5.4.3 starting on page 114 is a workable solution. It is not ideal from the perspective
of storing the same data set repeatedly, which, can lead to these versions diverging
when updated. A better approach is to, when feasible, to import the data directly
from the workbook or worksheets into R.

‘readxl’

citation(package = "readxl")

##
## To cite package 'readxl' in publications
## use:
##
## Hadley Wickham (2016). readxl: Read Excel
## Files. R package version 0.1.1.
## https://CRAN.R-project.org/package=readxl
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {readxl: Read Excel Files},
## author = {Hadley Wickham},

124
5.4 Data input and output

## year = {2016},
## note = {R package version 0.1.1},
## url = {https://CRAN.R-project.org/package=readxl},
## }

This package exports only two functions for reading Excel workbooks in xlsx format.
The interface is simple, and the package easy to instal. We will import a file that in
Excel looks as in the screen capture below.

We first list the sheets contained in the workbook file with excel_sheets() .

sheets <- excel_sheets("data/Book1.xlsx")


sheets

## [1] "my data"

In this case the argument passed to sheet is redundant, as there is only a single
worksheet in the file. It is possible to use either the name of the sheet or a posi-
tional index (in this case 1 would be equivalent to "my data" ). We use function
read_excel() to import the worksheet.

Book1.df <- read_excel("data/Book1.xlsx", sheet = "my data")


Book1.df

## # A tibble: 10 � 3
## sample group observation
## <dbl> <chr> <dbl>

125
5 Storing and manipulating data with R

## 1 1 a 1.0
## 2 2 a 5.0
## 3 3 a 7.0
## 4 4 a 2.0
## 5 5 a 5.0
## 6 6 b 0.0
## 7 7 b 2.0
## 8 8 b 3.0
## 9 9 b 1.0
## 10 10 b 1.5

Of the remaining arguments, skip is useful when we need to skip the top row of
a worksheet.

‘xlsx’

Package ‘xlsx’ can be more difficult to install as it uses Java functions to do the actual
work. However, it is more comprehensive, with functions both for reading and writing
Excel worksheet and workbooks, in different formats. It also allows selecting regions
of a worksheet to be imported.

citation(package = "xlsx")

##
## To cite package 'xlsx' in publications use:
##
## Adrian A. Dragulescu (2014). xlsx: Read,
## write, format Excel 2007 and Excel
## 97/2000/XP/2003 files. R package version
## 0.5.7.
## https://CRAN.R-project.org/package=xlsx
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {xlsx: Read, write, format Excel 2007 and Excel 97/2000/XP/2003 files},
## author = {Adrian A. Dragulescu},
## year = {2014},
## note = {R package version 0.5.7},
## url = {https://CRAN.R-project.org/package=xlsx},
## }
##
## ATTENTION: This citation information has
## been auto-generated from the package
## DESCRIPTION file and may need manual
## editing, see 'help("citation")'.

Here we use function read.xlsx() , idexing the worksheet by name.

126
5.4 Data input and output

Book1_xlsx.df <- read.xlsx("data/Book1.xlsx", sheetName = "my data")


Book1_xlsx.df

## sample group observation


## 1 1 a 1.0
## 2 2 a 5.0
## 3 3 a 7.0
## 4 4 a 2.0
## 5 5 a 5.0
## 6 6 b 0.0
## 7 7 b 2.0
## 8 8 b 3.0
## 9 9 b 1.0
## 10 10 b 1.5

As above, but indexing by a numeric index.

Book1_xlsx2.df <- read.xlsx2("data/Book1.xlsx", sheetIndex = 1)


Book1_xlsx2.df

## sample group observation


## 1 1 a 1
## 2 2 a 5
## 3 3 a 7
## 4 4 a 2
## 5 5 a 5
## 6 6 b 0
## 7 7 b 2
## 8 8 b 3
## 9 9 b 1
## 10 10 b 1.5

With the three different functions we get a data frame or a tibble, which is compat-
ible with data frames.

class(Book1.df)

## [1] "tbl_df" "tbl" "data.frame"

class(Book1_xlsx.df)

## [1] "data.frame"

class(Book1_xlsx2.df)

## [1] "data.frame"

However,
the columns are imported differently. Both Book1.df and
Book1_xlsx.df differ only in that the second column, a character variable, has
been converted into a factor or not. This is to be expected as packages in the

127
5 Storing and manipulating data with R

‘tidyverse’ suite default to preserving character variables as such, while base R


functions convert them to factors. The third function, read.xlsx2() , did not decode
numeric values correctly, and converted everything into factors. This function is
reported as being much faster than read.xlsx() .

sapply(Book1.df, class)

## sample group observation


## "numeric" "character" "numeric"

sapply(Book1_xlsx.df, class)

## sample group observation


## "numeric" "factor" "numeric"

sapply(Book1_xlsx2.df, class)

## sample group observation


## "factor" "factor" "factor"

With function write.xlsx() we can also write data frames out to Excel worksheets
and even append new worksheets to an existing workbook.

set.seed(456321)
my.data <- data.frame(x = 1:10, y = 1:10 + rnorm(10))
write.xlsx(my.data, file = "data/my-data.xlsx", sheetName = "first copy")
write.xlsx(my.data, file = "data/my-data.xlsx", sheetName = "second copy", append = TRUE)

When opened in Excel we get a workbook, containing two worksheets, named using
the arguments we passed through sheetName in the code chunk above.

128
5.4 Data input and output

U If you have some worksheet files available, import them into R, to get a feel
of how the way data is organized in the worksheets affects how easy or difficult
it is to read the data from them.

‘xml2’

Several modern data exchange formats are based on the XML standard format which
uses schema for flexibility. Package ‘xml2’ provides functions for reading and parsing
such files, as well as HTML files. This is a vast subject, of which I will only give a brief
introduction.
We first read a very simple web page with function read_html() .

web_page <- read_html("http://r4photobiology.info/R/index.html")


html_structure(web_page)

## <html>
## <head>
## <title>
## {text}
## <meta [name, content]>
## <meta [name, content]>
## <meta [name, content]>
## <body>

129
5 Storing and manipulating data with R

## {text}
## <hr>
## <h1>
## {text}
## {text}
## <hr>
## <p>
## {text}
## <a [href]>
## {text}
## {text}
## {text}
## <p>
## {text}
## <a [href]>
## {text}
## {text}
## {text}
## <address>
## {text}
## {text}

And we extract the text from its title attribute, using functions xml_find_all()
and xml_text() .

xml_text(xml_find_all(web_page, ".//title"))

## [1] "r4photobiology repository"

The functions defined in this package and in package ‘XML’ can be used to “harvest”
data from web pages, but also to read data from files using formats that are defined
through XML schemas.

5.4.5 Statistical software

There are two different comprehensive packages for importing data saved from other
statistical such as SAS, Statistica, SPSS, etc. The long time “standard” the ‘foreign’
package and the much newer ‘haven’. In the case of files saved with old versions of
statistical programs, functions from ‘foreign’ tend to be more more robust than those
from ‘haven’.

‘foreign’

Functions in this package allow to import data from files saved by several foreign stat-
istical analysis programs, including SAS, Stata and SPPS among others, and a function

130
5.4 Data input and output

for writing data into files with formats native to these three programs. Documenta-
tion is included with R describing them in R Data Import/Export. As a simple example
we use function read.spss() to read a .sav file, saved with a recent version of SPSS.

my_spss.df <- read.spss(file = "data/my-data.sav", to.data.frame = TRUE)

## Warning in read.spss(file = "data/my-data.sav", to.data.frame = TRUE):


data/my-data.sav: Unrecognized record type 7, subtype 18 encountered in system file
## Warning in read.spss(file = "data/my-data.sav", to.data.frame = TRUE):
data/my-data.sav: Unrecognized record type 7, subtype 24 encountered in system file

head(my_spss.df)

## block treat mycotreat water1 pot harvest


## 1 0 Watered, EM 1 1 14 1
## 2 0 Watered, EM 1 1 52 1
## 3 0 Watered, EM 1 1 111 1
## meas_order spad psi H_mm d_mm pot_plant_g
## 1 NA NA NA 67 2.115 NA
## 2 NA NA NA 44 1.285 NA
## 3 NA NA NA 65 1.685 NA
## plant_g tag_g pot_g leaf_area harvest_date
## 1 NA NA NA 35.883 13653705600
## 2 NA NA NA 16.938 13653705600
## 3 NA NA NA 38.056 13653705600
## stem_g leaves_g green_leaves save_order
## 1 0.0372 0.1685 0.0542 1
## 2 0.0139 0.0626 0.0443 2
## 3 0.0279 0.1522 0.0511 3
## waterprcnt height_1 height_2 height_3 height_4
## 1 NA 23 34 55 NA
## 2 NA 10 21 37 NA
## 3 NA 12 27 48 NA
## diam_1 height_5 diam_2
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## [ reached getOption("max.print") -- omitted 3 rows ]

Dates were not converted into R’s datetime objects, but instead into numbers.
A second example, this time with a simple .sav file saved 15 years ago.

thiamin.df <- read.spss(file = "data/thiamin.sav", to.data.frame = TRUE)


head(thiamin.df)

## THIAMIN CEREAL
## 1 5.2 wheat
## 2 4.5 wheat
## 3 6.0 wheat

131
5 Storing and manipulating data with R

## 4 6.1 wheat
## 5 6.7 wheat
## 6 5.8 wheat

Another example, for a Systat file saved on an PC more than 20 years ago, and rea.

my_systat.df <- read.systat(file = "data/BIRCH1.SYS")


my_systat.df

## CONT DENS BLOCK SEEDL VITAL BASE ANGLE HEIGHT


## 1 1 1 1 2 44 2 0 1
## 2 1 1 1 2 41 2 1 2
## 3 1 1 1 2 21 2 0 1
## 4 1 1 1 2 15 3 0 1
## 5 1 1 1 2 37 3 0 1
## 6 1 1 1 2 29 2 1 1
## 7 1 1 1 1 30 0 NA NA
## 8 1 1 1 1 28 0 NA NA
## 9 1 1 1 1 37 3 2 1
## 10 1 1 1 1 26 3 1 3
## 11 1 1 1 1 27 3 0 1
## DIAM
## 1 53
## 2 70
## 3 65
## 4 79
## 5 71
## 6 43
## 7 NA
## 8 NA
## 9 74
## 10 71
## 11 64
## [ reached getOption("max.print") -- omitted 383 rows ]

The functions in ‘foreign’ can return data frames, but not always this is the default.

‘haven’

The recently released package ‘haven’ is less ambitious in scope, providing read and
write functions for only three file formats: SAS, Stata and SPSS. On the other hand
‘haven’ provides flexible ways to convert the different labelled values that cannot be
directly mapped to normal R modes. They also decode dates and times according to
the idiosyncrasies of each of these file formats. The returned tibble objects in cases
when the imported file contained labelled values needs some further work from the
user before obtaining ‘normal’ data-frame-compatible tibble objects.

132
5.4 Data input and output

We here use function read_sav() to import here a .sav file saved by a recent
version of SPSS.

my_spss.tb <- read_sav(file = "data/my-data.sav")


my_spss.tb

## # A tibble: 372 � 29
## block treat mycotreat water1 pot harvest
## <dbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 1 1 1 14 1
## 2 0 1 1 1 52 1
## 3 0 1 1 1 111 1
## 4 0 1 1 1 127 1
## 5 0 1 1 1 230 1
## 6 0 1 1 1 258 1
## 7 0 1 1 1 363 1
## 8 0 1 1 1 400 1
## 9 0 1 1 1 424 1
## 10 0 1 1 1 443 1
## # ... with 362 more rows, and 23 more variables:
## # meas_order <dbl>, spad <dbl>, psi <dbl>,
## # H_mm <dbl>, d_mm <dbl>, pot_plant_g <dbl>,
## # plant_g <dbl>, tag_g <dbl>, pot_g <dbl>,
## # leaf_area <dbl>, harvest_date <date>,
## # stem_g <dbl>, leaves_g <dbl>,
## # green_leaves <dbl>, save_order <dbl>,
## # waterprcnt <dbl>, height_1 <dbl>,
## # height_2 <dbl>, height_3 <dbl>,
## # height_4 <dbl>, diam_1 <dbl>, height_5 <dbl>,
## # diam_2 <dbl>

head(my_spss.tb$harvest_date)

## [1] "2015-06-15" "2015-06-15" "2015-06-15"


## [4] "2015-06-15" "2015-06-15" "2015-06-15"

In this case the dates are correctly decoded.


And an SPSS’s .sav file saved 15 years ago.

thiamin.tb <- read_sav(file = "data/thiamin.sav")


thiamin.tb

## # A tibble: 24 � 2
## THIAMIN CEREAL
## <dbl> <dbl+lbl>
## 1 5.2 1
## 2 4.5 1
## 3 6.0 1
## 4 6.1 1

133
5 Storing and manipulating data with R

## 5 6.7 1
## 6 5.8 1
## 7 6.5 2
## 8 8.0 2
## 9 6.1 2
## 10 7.5 2
## # ... with 14 more rows

thiamin.tb <- as_factor(thiamin.tb)


thiamin.tb

## # A tibble: 24 � 2
## THIAMIN CEREAL
## <dbl> <fctr>
## 1 5.2 wheat
## 2 4.5 wheat
## 3 6.0 wheat
## 4 6.1 wheat
## 5 6.7 wheat
## 6 5.8 wheat
## 7 6.5 barley
## 8 8.0 barley
## 9 6.1 barley
## 10 7.5 barley
## # ... with 14 more rows

U Compare the values returned by different read functions when applied to the
same file on disk. Use names() , str() and class() as tools in your exploration.
If you are brave, also use attributes() , mode() , dim() , dimnames() , nrow()
and ncol() .

U If you use or have used in the past other statistical software or a general
purpose language like Python, look up some files, and import them into R.

5.4.6 NetCDF files

In some fields including geophysics and meteorology NetCDF is a very common


format for the exchange of data. It is also used in other contexts in which data is
referenced to an array of locations, like with data read from Affymetrix micro arrays
used to study gene expression. The NetCDF format allows the storage of metadata

134
5.4 Data input and output

together with the data itself in a well organized and standardized format, which is
ideal for exchange of moderately large data sets.
Officially described as

NetCDF is a set of software libraries and self-describing, machine-


independent data formats that support the creation, access, and sharing
of array-oriented scientific data.

As sometimes NetCDF files are large, it is good that it is possible to selectively read
the data from individual variables with functions in packages ‘ncdf4’ or ‘RNetCDF’.
On the other hand, this implies that contrary to other data file reading operations,
reading a NetCDF file is done in two or more steps.

‘ncdf4’

We first need to read an index into the file contents, and in additional steps we read
a subset of the data. With print() we can find out the names and characteristics of
the variables and attributes. In this example we use long term averages for potential
evapotranspiration (PET).
We first open a connection to the file with function nc_open() .

meteo_data.nc <- nc_open("data/pevpr.sfc.mon.ltm.nc")


# very long output
# print(meteo_data.nc)

U Uncomment the print() statement above and study the metadata available
for the data set as a whole, and for each variable.

The dimensions of the array data are described with metadata, mapping indexes
to in our examples a grid of latitudes and longitudes and a time vector as a third
dimension. The dates are returned as character strings. We get here the variables
one at a time with function ncvar_get() .

time.vec <- ncvar_get(meteo_data.nc, "time")


head(time.vec)

## [1] -657073 -657042 -657014 -656983 -656953


## [6] -656922

longitude <- ncvar_get(meteo_data.nc, "lon")


head(longitude)

135
5 Storing and manipulating data with R

## [1] 0.000 1.875 3.750 5.625 7.500 9.375

latitude <- ncvar_get(meteo_data.nc, "lat")


head(latitude)

## [1] 88.5420 86.6531 84.7532 82.8508 80.9473


## [6] 79.0435

The time vector is rather odd, as it contains only month data as these are long-
term averages. From the metadata we can infer that they correspond to the months
of the year, and we directly generate these, instead of attempting a conversion.
We construct a tibble object with PET values for one grid point, we can take ad-
vantage of recycling or short vectors.

pet.tb <-
tibble(moth = month.abb[1:12],
lon = longitude[6],
lat = latitude[2],
pet = ncvar_get(meteo_data.nc, "pevpr")[6, 2, ]
)
pet.tb

## # A tibble: 12 � 4
## moth lon lat pet
## <chr> <dbl> <dbl> <dbl>
## 1 Jan 9.375 86.6531 4.275492
## 2 Feb 9.375 86.6531 5.723819
## 3 Mar 9.375 86.6531 4.379165
## 4 Apr 9.375 86.6531 6.760361
## 5 May 9.375 86.6531 16.582457
## 6 Jun 9.375 86.6531 28.885454
## 7 Jul 9.375 86.6531 22.823912
## 8 Aug 9.375 86.6531 12.661168
## 9 Sep 9.375 86.6531 4.085276
## 10 Oct 9.375 86.6531 3.354837
## 11 Nov 9.375 86.6531 5.083717
## 12 Dec 9.375 86.6531 5.168580

If we want to read in several grid points, we can use several different approaches.
In this example we take all latitudes along one longitude. Here we avoid using loops
altogether when creating a tidy tibble object. However, because of how the data
is stored, we needed to transpose the intermediate array before conversion into a
vector.

pet2.tb <-
tibble(moth = rep(month.abb[1:12], length(latitude)),
lon = longitude[6],

136
5.4 Data input and output

lat = rep(latitude, each = 12),


pet = as.vector(t(ncvar_get(meteo_data.nc, "pevpr")[6, , ]))
)
pet2.tb

## # A tibble: 1,128 � 4
## moth lon lat pet
## <chr> <dbl> <dbl> <dbl>
## 1 Jan 9.375 88.542 1.0156335
## 2 Feb 9.375 88.542 1.5711517
## 3 Mar 9.375 88.542 0.8833860
## 4 Apr 9.375 88.542 3.5472817
## 5 May 9.375 88.542 12.4486160
## 6 Jun 9.375 88.542 27.0826015
## 7 Jul 9.375 88.542 21.7112827
## 8 Aug 9.375 88.542 11.0301638
## 9 Sep 9.375 88.542 0.3564302
## 10 Oct 9.375 88.542 -1.1898587
## # ... with 1,118 more rows

subset(pet2.tb, lat == latitude[2])

## # A tibble: 12 � 4
## moth lon lat pet
## <chr> <dbl> <dbl> <dbl>
## 1 Jan 9.375 86.6531 4.275492
## 2 Feb 9.375 86.6531 5.723819
## 3 Mar 9.375 86.6531 4.379165
## 4 Apr 9.375 86.6531 6.760361
## 5 May 9.375 86.6531 16.582457
## 6 Jun 9.375 86.6531 28.885454
## 7 Jul 9.375 86.6531 22.823912
## 8 Aug 9.375 86.6531 12.661168
## 9 Sep 9.375 86.6531 4.085276
## 10 Oct 9.375 86.6531 3.354837
## 11 Nov 9.375 86.6531 5.083717
## 12 Dec 9.375 86.6531 5.168580

U Play with as.vector(t(ncvar_get(meteo_data.nc, "pevpr")[6, , ])) until


you understand what is the effect of each of the nested function calls, starting
from ncvar_get(meteo_data.nc, "pevpr") . You will also want to use str() to
see the structure of the objects returned at each stage.

137
5 Storing and manipulating data with R

U Instead of extracting data for one longitude across latitudes, extract data
across longitudes for one latitude near the Equator.

‘RNetCDF’

 Package RNetCDF supports NetCDF3 files, but not those saved using the
current NetCDF4 format.

We first need to read an index into the file contents, and in additional steps we read
a subset of the data. With print.nc() we can find out the names and characteristics
of the variables and attributes. We open the connection with function open.nc() .

meteo_data.nc <- open.nc("data/meteo-data.nc")


str(meteo_data.nc)

## Class 'NetCDF' num 65536

# very long output


# print.nc(meteo_data.nc)

The dimensions of the array data are described with metadata, mapping indexes
to in our examples a grid of latitudes and longitudes and a time vector as a third
dimension. The dates are returned as character strings. We get variables, one at a
time, with function var.get.nc() .

time.vec <- var.get.nc(meteo_data.nc, "time")


head(time.vec)

## [1] 20080902 20080903 20080904 20080905 20080906


## [6] 20080907

longitude <- var.get.nc(meteo_data.nc, "lon")


head(longitude)

## [1] 19.5 20.5 21.5 22.5 23.5 24.5

latitude <- var.get.nc(meteo_data.nc, "lat")


head(latitude)

## [1] 59.5 60.5 61.5 62.5 63.5 64.5

We construct a tibble object with values for midday UV Index for 26 days. For
convenience, we convert the strings into R’s datetime objects.

138
5.4 Data input and output

uvi.tb <-
tibble(date = ymd(time.vec, tz="EET"),
lon = longitude[6],
lat = latitude[2],
uvi = var.get.nc(meteo_data.nc, "UVindex")[6,2,]
)
uvi.tb

## # A tibble: 26 � 4
## date lon lat uvi
## <dttm> <dbl> <dbl> <dbl>
## 1 2008-09-02 24.5 60.5 2.3613100
## 2 2008-09-03 24.5 60.5 1.1853613
## 3 2008-09-04 24.5 60.5 1.2863934
## 4 2008-09-05 24.5 60.5 3.2393212
## 5 2008-09-06 24.5 60.5 2.3606744
## 6 2008-09-07 24.5 60.5 2.6877227
## 7 2008-09-08 24.5 60.5 1.4642892
## 8 2008-09-09 24.5 60.5 1.8718901
## 9 2008-09-10 24.5 60.5 0.8997096
## 10 2008-09-11 24.5 60.5 2.4975569
## # ... with 16 more rows

5.4.7 Remotely located data

Many of the functions described above accept am URL address in place of file name.
Consequently files can be read remotely, without a separate step. This can be useful,
especially when file names are generated within a script. However, one should avoid,
especially in the case of servers open to public access, not to generate unnecessary
load on server and/or network traffic by repeatedly downloading the same file. Be-
cause of this, our first example reads a small file from my own web site. See section
5.4.3 on page 114 for details of the use of these and other functions for reading text
files.
logger.df <-
read.csv2(file = "http://r4photobiology.info/learnr/logger_1.txt",
header = FALSE,
col.names = c("time", "temperature"))
sapply(logger.df, class)

## time temperature
## "factor" "numeric"

sapply(logger.df, mode)

## time temperature
## "numeric" "numeric"

139
5 Storing and manipulating data with R

logger.tb <-
read_csv2(file = "http://r4photobiology.info/learnr/logger_1.txt",
col_names = c("time", "temperature"))

## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
## time = col_character(),
## temperature = col_double()
## )

sapply(logger.tb, class)

## time temperature
## "character" "numeric"

sapply(logger.tb, mode)

## time temperature
## "character" "numeric"

While functions in package ‘readr’ support the use of URLs, those in packages
‘readxl’ and ‘xlsx’ do not. Consequently we need to first download the file writing
a file locally, that we can read as described in section 5.4.4 on page 124.

download.file("http://r4photobiology.info/learnr/my-data.xlsx",
"data/my-data-dwn.xlsx",
mode = "wb")

Functions in package ‘foreign’, as well as those in package ‘haven’ support URLs.


See section 5.4.5 on page 130 for more information about importing this kind of data
into R.

remote_thiamin.df <-
read.spss(file = "http://r4photobiology.info/learnr/thiamin.sav",
to.data.frame = TRUE)
head(remote_thiamin.df)

## THIAMIN CEREAL
## 1 5.2 wheat
## 2 4.5 wheat
## 3 6.0 wheat
## 4 6.1 wheat
## 5 6.7 wheat
## 6 5.8 wheat

140
5.4 Data input and output

remote_my_spss.tb <-
read_sav(file = "http://r4photobiology.info/learnr/thiamin.sav")
remote_my_spss.tb

## # A tibble: 24 � 2
## THIAMIN CEREAL
## <dbl> <dbl+lbl>
## 1 5.2 1
## 2 4.5 1
## 3 6.0 1
## 4 6.1 1
## 5 6.7 1
## 6 5.8 1
## 7 6.5 2
## 8 8.0 2
## 9 6.1 2
## 10 7.5 2
## # ... with 14 more rows

Function download.file() in R’s default ‘utils’ package can be used to download


files using URLs. It supports differemt modes such as binary or text, and write or
append, and different methods such as internal, wget and libcurl.
In this example we use a downloaded NetCDF file of long-term means for potential
evapotranspiration from NOOA, the same used above in the ‘ncdf4’ example. This is
a moderately large file at 444 KB. In this case we cannot directly open the connection
to the NetCDF file, we first download it (commented out code, as we have a local copy),
and then we open the local file.

my.url <- paste("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.derived/",


"surface_gauss/pevpr.sfc.mon.ltm.nc",
sep = "")
## #download.file(my.url,
# mode = "wb",
# destfile = "data/pevpr.sfc.mon.ltm.nc")
pet_ltm.nc <- nc_open("data/pevpr.sfc.mon.ltm.nc")

 For portability NetCDF files should be downloaded in binary mode, setting


mode = "wb" , which is required at least under MS-Windows.

5.4.8 Data acquisition from physical devices

Numerous modern data acquisition devices based on microcontrolers, including


internet-of-things (IoT) devices, have servers (or daemons) that can be queried over

141
5 Storing and manipulating data with R

a network connection to retrieve either real-time or looged data. Formats based on


XML schemas or in JSON format are commonly used.

‘jsonlite’

We give here a simple example using a module from the YoctoPuce family using a
software hub running locally. We retrieve logged data from a YoctoMeteo module.

= This example is not run, and needs setting the configuration of the Yoc-
toPuce module beforehand. Fully reproducible examples, including configuration
instructions, will be included in a future revision of the manuscript.

Here we use function fromJSON() to retrieve logged data from one sensor.

hub.url <- "http://127.0.0.1:4444/"


Meteo01.lst <-
fromJSON(paste(hub.url, "byName/Meteo01/dataLogger.json",
sep = ""))
names(Meteo01.lst)
Meteo01.lst

The minimum, mean and maximum values for each logging interval, need to be
split from a single vector. We do this by indexing with a logical vector (recycled). The
data returned is tidy with respect to the variables, with quantity names and units also
returned by the module, as well as the time.

val.vector <- unlist(Meteo01.lst[["val"]])


dplyr::transmute(Meteo01.lst,
utc.time = as.POSIXct(utc, origin = "1970-01-01", tz = "UTC"),
qty = qty.name,
unit = qty.unit,
minimum = val.vector[c(TRUE, FALSE, FALSE)],
mean = val.vector[c(FALSE, TRUE, FALSE)],
maximum = val.vector[c(FALSE, FALSE, TRUE)],
dur,
freq)

5.4.9 Databases

One of the advantages of using databases is that subsets of cases and variables can
be retrieved from databases, even remotely, making it possible to work both locally
and remotely with huge data sets. One should remember that R natively keeps whole

142
5.5 Apply functions

objects in RAM, and consequently available machine memory limits the size of data
sets with which it is possible to work.

= The contents of this section is still missing, but will in any case be basic. I re-
comend the book R for Data Science (Wickham and Grolemund 2017) for learning
how to use the packages in the ‘tidyverse’ suite, especially in the case of connect-
ing to databases.

5.5 Apply functions

Apply functions apply functions to elements in a collection of R objects. These collec-


tions can be vectors, lists, data frames, matrices of arrays. As long as the operations
to be applied are independent—i.e. the results from one iteration are not used in an-
other iteration, and each iteration refers only one member of the collection of objects—
these functions can replace for , while or repeat loops.

 When apply functions cannot replace traditional loop constructs? We will


give some typical examples. First case is the accumulation pattern, where we
“walk” through a collection storing a partial result between iterations.

set.seed(123456)
a.vector <- runif(20)
total <- 0
for (i in seq(along.with = a.vector)) {
total <- total + a.vector[i]
}
total

## [1] 11.88678

Although the loop above cannot the replaced by a statement based on an apply
function, it can be replaced by the summation function sum() from base R.

set.seed(123456)
a.vector <- runif(20)
total <- sum(a.vector)
total

## [1] 11.88678

Another frequent pattern are operations, at each iteration, on a subset com-


posed by two or more consecutive elements of the collection. The simplest and

143
5 Storing and manipulating data with R

probably most frequent calculation of this kind is the calculation of differences


between successive members.

set.seed(123456)
a.vector <- runif(20)
b.vector <- numeric(length(a.vector) - 1)
for (i in seq(along.with = b.vector)) {
b.vector[i] <- a.vector[i + 1] - a.vector[i]
}
b.vector

## [1] -0.04421923 -0.36230941 -0.04969899


## [4] 0.01973741 -0.16294938 0.33651323
## [7] -0.43833172 0.89132070 -0.82027747
## [10] 0.63041965 -0.20419511 0.31151599
## [13] -0.02446136 0.11298790 -0.09788022
## [16] -0.01731298 -0.68103760 0.13738785
## [19] 0.44221272

In this case, we can use diff() instead of an explicit loop.

b.vector <- diff(a.vector)


b.vector

## [1] -0.04421923 -0.36230941 -0.04969899


## [4] 0.01973741 -0.16294938 0.33651323
## [7] -0.43833172 0.89132070 -0.82027747
## [10] 0.63041965 -0.20419511 0.31151599
## [13] -0.02446136 0.11298790 -0.09788022
## [16] -0.01731298 -0.68103760 0.13738785
## [19] 0.44221272

5.5.1 Base R’s apply functions

Base R’s apply functions differ on the class of the returned value and on the class of
the argument expected for their X parameter: apply() expects a matrix or array
as argument, or an argument like a data.frame which can be converted to a matrix
or array. apply() returns an array or a list or a vector depending on the size, and
consistency in length and class among the values returned by the applied function.
lapply() and sapply() expect a vector or list as argument passed through X .
lapply() returns a list or an array ; and vapply() always simplifies its returned
value into a vector, while sapply() does the simplification according to the argument
passed to its simplify parameter. All these apply functions can be used to apply any
R function that returns a value of the same or a different class as its argument. In the

144
5.5 Apply functions

case of apply() and lapply() not even the length of the values returned for each
member of the collection passed as argument, needs to be consistent. In summary,
apply() is used to apply a function to the elements of an object that has dimensions
defined, and lapply() and sapply() to apply a function to the members of and
object without dimensions, such as a vector.

 Of course, a matrix can have a single row, a single column, or even a single
element, but even in such cases, a matrix will have dimensions defined and stored
as an attribute.

my.vector <- 1:6


dim(my.vector)

## NULL

one.col.matrix <- matrix(1:6, ncol = 1)


dim(one.col.matrix)

## [1] 6 1

two.col.matrix <- matrix(1:6, ncol = 2)


dim(two.col.matrix)

## [1] 3 2

one.elem.matrix <- matrix(1, ncol = 1)


dim(one.elem.matrix)

## [1] 1 1

U Print the matrices defined in the chucks above. Then, look up the help
page for array() and write equivalent examples for arrays with three and
higher dimensions.

We first examplify the use of lapply() and sapply() given their simpler argument
for X .

set.seed(123456)
a.vector <- runif(10)
my.fun <- function(x, k) {log(x) + k}
z <- lapply(X = a.vector, FUN = my.fun, k = 5)

145
5 Storing and manipulating data with R

class(z)

## [1] "list"

dim(z)

## NULL

## [[1]]
## [1] 4.774083
##
## [[2]]
## [1] 4.71706
##
## [[3]]
## [1] 4.061606
##
## [[4]]
## [1] 3.925758
##
## [[5]]
## [1] 3.981937
##
## [[6]]
## [1] 3.382251
##
## [[7]]
## [1] 4.374246
##
## [[8]]
## [1] 2.66206
##
## [[9]]
## [1] 4.987772
##
## [[10]]
## [1] 3.213643

z <- sapply(X = a.vector, FUN = my.fun, k = 5)


class(z)

## [1] "numeric"

dim(z)

## NULL

146
5.5 Apply functions

## [1] 4.774083 4.717060 4.061606 3.925758 3.981937


## [6] 3.382251 4.374246 2.662060 4.987772 3.213643

z <- sapply(X = a.vector, FUN = my.fun, k = 5, simplify = FALSE)


class(z)

## [1] "list"

dim(z)

## NULL

## [[1]]
## [1] 4.774083
##
## [[2]]
## [1] 4.71706
##
## [[3]]
## [1] 4.061606
##
## [[4]]
## [1] 3.925758
##
## [[5]]
## [1] 3.981937
##
## [[6]]
## [1] 3.382251
##
## [[7]]
## [1] 4.374246
##
## [[8]]
## [1] 2.66206
##
## [[9]]
## [1] 4.987772
##
## [[10]]
## [1] 3.213643

Anonymous functions can be defined on the fly, resulting in the same returned
value.

sapply(X = a.vector, FUN = function(x, k) {log(x) + k}, k = 5)

## [1] 4.774083 4.717060 4.061606 3.925758 3.981937

147
5 Storing and manipulating data with R

## [6] 3.382251 4.374246 2.662060 4.987772 3.213643

Of course, as discussed in Chapter ??, when vectorization is possible, this results


also in fastest execution and simplest code.

log(a.vector) + 5

## [1] 4.774083 4.717060 4.061606 3.925758 3.981937


## [6] 3.382251 4.374246 2.662060 4.987772 3.213643

Next we give examples of the use of apply() . The argument passed to MARGIN
determines, the dimension along which the matrix or array passed to X will be split
before passing it as argument to the function passed through FUN . In the example
below we get either row- or column means. In these examples, sum() is passed a
vector, for each row or each column of the matrix. As function sum() returns a
single value independently of the length of its argument, instead of a matrix, the
returned value is a vector. In other words, an array with one dimension less than that
of its input.

set.seed(123456)
a.mat <- matrix(runif(10), ncol = 2)
row.means <- apply(X = a.mat, MARGIN = 1, FUN = mean, na.rm = TRUE)
class(row.means)

## [1] "numeric"

dim(row.means)

## NULL

row.means

## [1] 0.4980645 0.6442115 0.2438910 0.6647018


## [5] 0.2644318

col.means <- apply(X = a.mat, MARGIN = 2, FUN = mean, na.rm = TRUE)


class(col.means)

## [1] "numeric"

dim(col.means)

## NULL

col.means

## [1] 0.5290912 0.3970291

148
5.5 Apply functions

U Look up the help pages for apply() and mean() to study them until you
understand how to pass additional arguments to any applied function. Can you
guess why apply was designed to have parameter names fully in upper case,
something very unusual for R functions?

 If we apply a function that returns a value of the same length as its input,
then the dimensions of the value returned by apply() are the same as those of its
input. We use in the next examples a “no-op” function that returns its argument
unchanged, so that input and output can be easily compared.

set.seed(123456)
a.mat <- matrix(1:10, ncol = 2)
no_op.fun <- function(x) {x}
b.mat <- apply(X = a.mat, MARGIN = 2, FUN = no_op.fun)
class(b.mat)

## [1] "matrix"

dim(b.mat)

## [1] 5 2

b.mat

## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10

t(b.mat)

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10

In the chunk above we passed MARGIN = 2 , but if we pass MARGIN = 1 , we


get an equivalent return value but transposed! We use in the next example a
“no-op” function than simply returns its argument unchanged, so that input and
output can be easily compared. To restore the original layout of the matrix we
can transpose the result with function t() .

149
5 Storing and manipulating data with R

b.mat <- apply(X = a.mat, MARGIN = 1, FUN = no_op.fun)


class(b.mat)

## [1] "matrix"

dim(b.mat)

## [1] 2 5

b.mat

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10

t(b.mat)

## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10

Of course, these two toy examples, are something that can, and should be al-
ways avoided, as vectorization allows us to directly apply the function to the
whole matrix.

b.mat <- no_op.fun(a.mat)

A more realistic example, but difficult to grasp without seeing the toy examples
shown above, is when we apply a function that returns a value of a different
length than its input, but longer than one. If this length is consistent, an array
with matching dimensions is returned, but again with the original columns as
rows. What happens is that by using apply() one dimension of the original mat-
rix or array disappears, as we apply the function over it. Consequently, given how
matrices are stored in R, when the column dimension disappears, the row dimen-
sion becomes the new column dimension. After this, the elements of the vectors
returned by the applied function applied, are stored along rows. To restore the
original rows to rows in the result matrix we can transpose the it with function
t() .

150
5.5 Apply functions

set.seed(123456)
a.mat <- matrix(runif(10), ncol = 2)
mean_and_var <- function(x, na.rm = FALSE) {
c(mean(x, na.rm = na.rm), var(x, na.rm = na.rm))
}
c.mat <- apply(X = a.mat, MARGIN = 1, FUN = mean_and_var, na.rm = TRUE)
class(c.mat)

## [1] "matrix"

dim(c.mat)

## [1] 2 5

c.mat

## [,1] [,2] [,3] [,4]


## [1,] 0.4980645 0.6442115 0.24389096 0.6647018
## [2,] 0.1796639 0.0239164 0.04343272 0.2088455
## [,5]
## [1,] 0.26443179
## [2,] 0.01876462

t(c.mat)

## [,1] [,2]
## [1,] 0.4980645 0.17966391
## [2,] 0.6442115 0.02391640
## [3,] 0.2438910 0.04343272
## [4,] 0.6647018 0.20884554
## [5,] 0.2644318 0.01876462

In this case, calling the user-defined function with the whole matrix as argument
is not equivalent. Of course, a for loop stepping through the rows would be the
job, but more slowly.

Function vapply() is not as frequently used, but can be sometimes useful. Here
is a possible way of obtaining means and variances across member vectors at each
vector index position from a list of vectors. These could be called parallel means and
variances.

set.seed(123456)
a.list <- lapply(rep(4, 5), runif)
a.list

## [[1]]
## [1] 0.7977843 0.7535651 0.3912557 0.3415567

151
5 Storing and manipulating data with R

##
## [[2]]
## [1] 0.36129411 0.19834473 0.53485796 0.09652624
##
## [[3]]
## [1] 0.9878469 0.1675695 0.7979891 0.5937940
##
## [[4]]
## [1] 0.9053100 0.8808486 0.9938366 0.8959563
##
## [[5]]
## [1] 0.8786434 0.1976057 0.3349936 0.7772063

mean_and_var <- function(x, na.rm = FALSE) {


c(mean(x, na.rm = na.rm), var(x, na.rm = na.rm))
}
values <- vapply(X = a.list,
FUN = mean_and_var,
FUN.VALUE = c(mean = 0, var = 0),
na.rm = TRUE)
class(values)

## [1] "matrix"

dim(values)

## [1] 2 5

values

## [,1] [,2] [,3] [,4]


## mean 0.57104045 0.29775576 0.6367999 0.918987886
## var 0.05657113 0.03687682 0.1237476 0.002591487
## [,5]
## mean 0.5471123
## var 0.1100018

5.6 The grammar of data manipulation of the ‘tidyverse’

Packages in tidyverse , define more user-friendly apply functions, which I describe


in the next sections. These packages, do much more than providing replacements
for R’s apply functions. They define a “grammar of data” for data manipulations like
transformations and summaries, based on the same philosophy as that behind the
grammar of graphics on which package ‘ggplot2’ is based (see Chapter 6 starting on
page 179).
To make the problem of manipulating data, tractable and consistent, the first step

152
5.6 Grammar of data manipulation

is to settle on a certain way of storing data. In R’s data frames, variables are most
frequently in columns and cases are in rows. This is a good start and also frequently
used in other software. The first major inconsistency across programs, and to some
extent among R packages, is how to store data for sequential or repeated measure-
ments. Do the rows represent measuring events, or measured objects? In R, data
from individual measuring events are in most cases stored as rows, and if those that
correspond to the same object or individual encoded with an index variable. Further-
more, say in a time sequence, the times or dates are stored in an additional variable.
R’s approach is much more flexible in that it does not assume that observations on dif-
ferent individuals are synchronized. Wickham Wickham 2014c has coined the name
“tidy data” organized in this manner.
Hadley Wickham, together with collaborators, has developed a set of R tools for
the manipulation, plotting and analysis of tidy data, thoroughly described in the re-
cently published book R for Data Science (Wickham and Grolemund 2017). The book
Mastering Software Development in R (Peng et al. 2017) covers data manipulaiton in
the first chapters before moving on to programming. Here we give an overview of the
components of the ‘tidyverse’ grammar of data manipulation. The book R for Data
Science and the documentation included with the various packages should be con-
sulted for a deeper and more detailed discussion. Aspects of the ‘tidyverse’ related
to reading and writing data files (‘readr’, ‘readxl’, and ‘xml2’) have been discussed in
earlier sections of this chapter, while the use of (‘ggplot2’) for plotting is described
in later chapters.

5.6.1 Better data frames

Package ‘tibble’ defines an improved class tibble that can be used in place of data
frames. Changes are several, including differences in default behaviour of both con-
structors and methods. Objects of class tibble can non-the-less be used as argu-
ments for most functions that expect data frames as input.

= In their first incarnation, the name for tibble was data_frame (with a dash
instead of a dot). The old name is still recognized, but it is better to only use
tibble() to avoid confusion. One should be aware that although the constructor
tibble() and conversion function as.tibble() , as well as the test is.tibble()
use the name tibble , the class attribute is named tbl .

153
5 Storing and manipulating data with R

my.tb <- tibble(numbers = 1:3)


is.tibble(my.tb)

## [1] TRUE

class(my.tb)

## [1] "tbl_df" "tbl" "data.frame"

Furthermore, by necessity, to support tibbles based on different underlying


data sources a further derived class is needed. In our example, as our tibble has
an underlying data.frame class, the most derived class of my.tb is tbl_df .

We start with the constructor and conversion methods. For this we will define our
own diagnosis function.

show_classes <- function(x) {


cat(
paste(paste(class(x)[1],
"containing:"),
paste(names(x),
sapply(x, class), collapse = ", ", sep = ": "),
sep = "\n")
)
}

In the next two chunks we can see some of the differences. The tibble()
constructor does not by default convert character data into factors, while the
data.frame() constructor does.

my.df <- data.frame(codes = c("A", "B", "C"), numbers = 1:3, integers = 1L:3L)
is.data.frame(my.df)

## [1] TRUE

is.tibble(my.df)

## [1] FALSE

show_classes(my.df)

## data.frame containing:
## codes: factor, numbers: integer, integers: integer

Tibbles are data frames—or more formally class tibble is derived from class
data.frame . However, data frames are not tibbles.

154
5.6 Grammar of data manipulation

my.tb <- tibble(codes = c("A", "B", "C"), numbers = 1:3, integers = 1L:3L)
is.data.frame(my.tb)

## [1] TRUE

is.tibble(my.tb)

## [1] TRUE

show_classes(my.tb)

## tbl_df containing:
## codes: character, numbers: integer, integers: integer

The print() method for tibbles, overrides the one defined for data frames.

print(my.df)

## codes numbers integers


## 1 A 1 1
## 2 B 2 2
## 3 C 3 3

print(my.tb)

## # A tibble: 3 � 3
## codes numbers integers
## <chr> <int> <int>
## 1 A 1 1
## 2 B 2 2
## 3 C 3 3

U The main difference is in how tibbles and data frames are printed when they
have many rows. Construct a data frame and an equivalent tibble with at least 50
rows, and then test how the output looks when they are printed.

Data frames can be converted into tibbles with as.tibble() .

my_conv.tb <- as.tibble(my.df)


is.data.frame(my_conv.tb)

## [1] TRUE

is.tibble(my_conv.tb)

## [1] TRUE

155
5 Storing and manipulating data with R

show_classes(my_conv.tb)

## tbl_df containing:
## codes: factor, numbers: integer, integers: integer

my_conv.df <- as.data.frame(my.tb)


is.data.frame(my_conv.df)

## [1] TRUE

is.tibble(my_conv.df)

## [1] FALSE

show_classes(my_conv.df)

## data.frame containing:
## codes: character, numbers: integer, integers: integer

U Look carefully at the result of the conversions. Why do we now have a data
frame with A as character and tibble with A as a factor ?

 Not all conversion functions work consistently when converting from a de-
rived class into its parent. The reason for this is disagreement between author
on what is the correct behaviour based on logic and theory. You are not likely to
be hit by this problem frequently, but it can be difficult to diagnose.
We have already seen that calling as.data.frame() on a tibble strips the de-
rived class attributes, returning a data frame. We now look at the whole contents
on the "class" attribute to better exemplify the problem. We also test the two
objects for equality, in two different ways. Using the operator == tests for equi-
valent objects. Objects that contain the same data. Using identical() tests that
objects are exactly the same, including same attributes, including same equal
class attributes.

156
5.6 Grammar of data manipulation

class(my.tb)

## [1] "tbl_df" "tbl" "data.frame"

class(my_conv.df)

## [1] "data.frame"

my.tb == my_conv.df

## codes numbers integers


## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE

identical(my.tb, my_conv.df)

## [1] FALSE

Now we derive from a tibble, and then attempt a conversion back into a tibble.

my.xtb <- my.tb


class(my.xtb) <- c("xtb", class(my.xtb))
class(my.xtb)

## [1] "xtb" "tbl_df" "tbl"


## [4] "data.frame"

my_conv_x.tb <- as_tibble(my.xtb)


class(my_conv_x.tb)

## [1] "xtb" "tbl_df" "tbl"


## [4] "data.frame"

my.xtb == my_conv_x.tb

## codes numbers integers


## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE

identical(my.xtb, my_conv_x.tb)

## [1] TRUE

The two viewpoints on conversion functions are as follows. 1) The conversion


function should return an object of its corresponding class, even if the argument
is an object of a derived class, stripping the derived class. 2) If the object is of
the class to be converted to, including objects of derived classes, then it should

157
5 Storing and manipulating data with R

remain untouched. Base R follows, as far as I have been able to work out, approach
1). Packages in the ‘tidyverse’ follow approach 2). If in doubt about the behaviour
of some function, then you need to do a test similar to the I have presented in
the chunks in this box.

There are additional important differences between the constructors tibble() and
data.frame() . One of them is that variables (“columns”) being defined can be used
in the definition of subsequent variables.

tibble(a = 1:5, b = 5:1, c = a + b, d = letters[a + 1])

## # A tibble: 5 � 4
## a b c d
## <int> <int> <int> <chr>
## 1 1 5 6 b
## 2 2 4 6 c
## 3 3 3 6 d
## 4 4 2 6 e
## 5 5 1 6 f

U What is the behaviour if you replace tibble() by data.frame() in the state-


ment above?

Furthermore, while data frame columns are required to be vectors, columns of


tibbles can also be lists.

tibble(a = 1:5, b = 5:1, c = list("a", 2, 3, 4, 5))

## # A tibble: 5 � 3
## a b c
## <int> <int> <list>
## 1 1 5 <chr [1]>
## 2 2 4 <dbl [1]>
## 3 3 3 <dbl [1]>
## 4 4 2 <dbl [1]>
## 5 5 1 <dbl [1]>

Which even allows a list of lists as a variable, or a list of vectors.

tibble(a = 1:5, b = 5:1, c = list("a", 1:2, 0:3, letters[1:3], letters[3:1]))

## # A tibble: 5 � 3
## a b c

158
5.6 Grammar of data manipulation

## <int> <int> <list>


## 1 1 5 <chr [1]>
## 2 2 4 <int [2]>
## 3 3 3 <int [4]>
## 4 4 2 <chr [3]>
## 5 5 1 <chr [3]>

5.6.2 Tidying up data

In later sections of this and subsequent chapters we assume that available data is in
a tidy arrangement, in which rows correspond to measurement events, and columns
correspond to values for different variables measured at a given measuring event, or
descriptors of groups or permanent features of the measured units. Real-world data
can be quite messy, so frequently the first task in an analysis is to make data in ad-
hoc or irregular formats “tidy”. Please consult the vignette other documentation of
package ‘tidyr’ for details.
In most cases using function gather() is the easiest way of converting data in a
“wide” form into data into “long” form, or tidy format. We will use the iris data
set included with R. We print iris as a tibble for the nicer formatting of the screen
output, but we do not save the result. We use gather to obtain a long-form tibble.
Be aware that in this case, the original wide form would in some cases be best for
further analysis.
We first convert iris into a tibble to more easily control the length of output.

data(iris)
iris.tb <- as.tibble(iris)
iris.tb

## # A tibble: 150 � 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3.0 1.4
## 3 4.7 3.2 1.3
## 4 4.6 3.1 1.5
## 5 5.0 3.6 1.4
## 6 5.4 3.9 1.7
## 7 4.6 3.4 1.4
## 8 5.0 3.4 1.5
## 9 4.4 2.9 1.4
## 10 4.9 3.1 1.5
## # ... with 140 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fctr>

159
5 Storing and manipulating data with R

By comparing iris.tb above with long_iris below we can appreciate how


gather() transformed its input.

long_iris <- gather(iris.tb, key = part, value = dimension, -Species)


long_iris

## # A tibble: 600 � 3
## Species part dimension
## <fctr> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
## 3 setosa Sepal.Length 4.7
## 4 setosa Sepal.Length 4.6
## 5 setosa Sepal.Length 5.0
## 6 setosa Sepal.Length 5.4
## 7 setosa Sepal.Length 4.6
## 8 setosa Sepal.Length 5.0
## 9 setosa Sepal.Length 4.4
## 10 setosa Sepal.Length 4.9
## # ... with 590 more rows

U To better understand why I added -Species as an argument, edit the code


removing it, and execute the statement to see how the returned tibble is different.

5.6.3 Row-wise manipulations

We can calculate derived quantities by combining different variables measured on the


same measuring unit—i.e. calculations within a single row of a data frame or tibble.
In this case there are two options, we add new variables (columns) retaining existing
ones using mutate() or we assemble a new tibble containing only the columns we
explicitly specify using transmute() .
Continuing with the example from the previous section, we most likely would like to
split the values in variable part into plant_part and part_dim . We use mutate()
from ‘dplyr’ and str_extract() from ‘stringr’. We use regular expressions as argu-
ments passed to pattern . We do not show it here, but mutate() can be used with
variables of any mode , and calculations can involve values from several columns. It
is even possible to operate on values applying a lag or in other words using rows
displaced relative to the current one. As shown in the example in section 5.9.2 on
page 176, within a single call to mutate() values calculated first can be used in the
calculations for later variables.

160
5.6 Grammar of data manipulation

long_iris <- mutate(long_iris,


plant_part = str_extract(part, "^[:alpha:]*"),
part_dim = str_extract(part, "[:alpha:]*$"))
long_iris

## # A tibble: 600 � 5
## Species part dimension plant_part
## <fctr> <chr> <dbl> <chr>
## 1 setosa Sepal.Length 5.1 Sepal
## 2 setosa Sepal.Length 4.9 Sepal
## 3 setosa Sepal.Length 4.7 Sepal
## 4 setosa Sepal.Length 4.6 Sepal
## 5 setosa Sepal.Length 5.0 Sepal
## 6 setosa Sepal.Length 5.4 Sepal
## 7 setosa Sepal.Length 4.6 Sepal
## 8 setosa Sepal.Length 5.0 Sepal
## 9 setosa Sepal.Length 4.4 Sepal
## 10 setosa Sepal.Length 4.9 Sepal
## # ... with 590 more rows, and 1 more variables:
## # part_dim <chr>

In the next few chunks we print the returned values rather than saving then in
variables. In most cases in practice one will combine these function into a “pipe”
using operator %>% (see section 5.7 on page 165, and for more realistic examples,
section 5.9 starting on page 174).
Function arrange() is used for sorting the rows—makes sorting a data frame sim-
pler than by using sort() and order() . These two base R methods are more versat-
ile.

arrange(long_iris, Species, plant_part, part_dim)

## # A tibble: 600 � 5
## Species part dimension plant_part
## <fctr> <chr> <dbl> <chr>
## 1 setosa Petal.Length 1.4 Petal
## 2 setosa Petal.Length 1.4 Petal
## 3 setosa Petal.Length 1.3 Petal
## 4 setosa Petal.Length 1.5 Petal
## 5 setosa Petal.Length 1.4 Petal
## 6 setosa Petal.Length 1.7 Petal
## 7 setosa Petal.Length 1.4 Petal
## 8 setosa Petal.Length 1.5 Petal
## 9 setosa Petal.Length 1.4 Petal
## 10 setosa Petal.Length 1.5 Petal
## # ... with 590 more rows, and 1 more variables:
## # part_dim <chr>

161
5 Storing and manipulating data with R

Function filter() to select a subset of rows—similar to subset() but with a


syntax consistent with that of other functions in the ‘tidyverse’.

filter(long_iris, plant_part == "Petal")

## # A tibble: 300 � 5
## Species part dimension plant_part
## <fctr> <chr> <dbl> <chr>
## 1 setosa Petal.Length 1.4 Petal
## 2 setosa Petal.Length 1.4 Petal
## 3 setosa Petal.Length 1.3 Petal
## 4 setosa Petal.Length 1.5 Petal
## 5 setosa Petal.Length 1.4 Petal
## 6 setosa Petal.Length 1.7 Petal
## 7 setosa Petal.Length 1.4 Petal
## 8 setosa Petal.Length 1.5 Petal
## 9 setosa Petal.Length 1.4 Petal
## 10 setosa Petal.Length 1.5 Petal
## # ... with 290 more rows, and 1 more variables:
## # part_dim <chr>

Function slice() to select a subset of rows based on their positions—would be


done with positional indexes with [ , ] in base R.

slice(long_iris, 1:5)

## # A tibble: 5 � 5
## Species part dimension plant_part
## <fctr> <chr> <dbl> <chr>
## 1 setosa Sepal.Length 5.1 Sepal
## 2 setosa Sepal.Length 4.9 Sepal
## 3 setosa Sepal.Length 4.7 Sepal
## 4 setosa Sepal.Length 4.6 Sepal
## 5 setosa Sepal.Length 5.0 Sepal
## # ... with 1 more variables: part_dim <chr>

Function select() to select a subset of columns—requires selection with subind-


exes in base R. In the first example we remove one column by name.

select(long_iris, -part)

## # A tibble: 600 � 4
## Species dimension plant_part part_dim
## <fctr> <dbl> <chr> <chr>
## 1 setosa 5.1 Sepal Length
## 2 setosa 4.9 Sepal Length
## 3 setosa 4.7 Sepal Length
## 4 setosa 4.6 Sepal Length
## 5 setosa 5.0 Sepal Length

162
5.6 Grammar of data manipulation

## 6 setosa 5.4 Sepal Length


## 7 setosa 4.6 Sepal Length
## 8 setosa 5.0 Sepal Length
## 9 setosa 4.4 Sepal Length
## 10 setosa 4.9 Sepal Length
## # ... with 590 more rows

In addition select() as other functions in ‘dplyr’ can be used together with func-
tions starts_with() , ends_with() , contains() , and matches() to select groups of
columns to be selected to be retained or removed. For this example we use R’s iris
instead of our long_iris .

select(iris.tb, -starts_with("Sepal"))

## # A tibble: 150 � 3
## Petal.Length Petal.Width Species
## <dbl> <dbl> <fctr>
## 1 1.4 0.2 setosa
## 2 1.4 0.2 setosa
## 3 1.3 0.2 setosa
## 4 1.5 0.2 setosa
## 5 1.4 0.2 setosa
## 6 1.7 0.4 setosa
## 7 1.4 0.3 setosa
## 8 1.5 0.2 setosa
## 9 1.4 0.2 setosa
## 10 1.5 0.1 setosa
## # ... with 140 more rows

select(iris.tb, Species, matches("pal"))

## # A tibble: 150 � 3
## Species Sepal.Length Sepal.Width
## <fctr> <dbl> <dbl>
## 1 setosa 5.1 3.5
## 2 setosa 4.9 3.0
## 3 setosa 4.7 3.2
## 4 setosa 4.6 3.1
## 5 setosa 5.0 3.6
## 6 setosa 5.4 3.9
## 7 setosa 4.6 3.4
## 8 setosa 5.0 3.4
## 9 setosa 4.4 2.9
## 10 setosa 4.9 3.1
## # ... with 140 more rows

Function to rename columns—requires the use of


rename() names() and
names<-() and a way of matching the old name in base R.

163
5 Storing and manipulating data with R

rename(long_iris, dim = dimension)

## # A tibble: 600 � 5
## Species part dim plant_part part_dim
## <fctr> <chr> <dbl> <chr> <chr>
## 1 setosa Sepal.Length 5.1 Sepal Length
## 2 setosa Sepal.Length 4.9 Sepal Length
## 3 setosa Sepal.Length 4.7 Sepal Length
## 4 setosa Sepal.Length 4.6 Sepal Length
## 5 setosa Sepal.Length 5.0 Sepal Length
## 6 setosa Sepal.Length 5.4 Sepal Length
## 7 setosa Sepal.Length 4.6 Sepal Length
## 8 setosa Sepal.Length 5.0 Sepal Length
## 9 setosa Sepal.Length 4.4 Sepal Length
## 10 setosa Sepal.Length 4.9 Sepal Length
## # ... with 590 more rows

The first advantage a user sees of these functions is the completeness of the set of
operations supported and the symmetry and consistency among the different func-
tions. A second advantage is that almost all the functions are defined not only for
objects of class tibble , but also for objects of class data.table and for accessing
SQL based databases with the same syntax. The functions are also optimized for fast
performance.

5.6.4 Group-wise manipulations

Another important operation is to summarize quantities by group of rows. Contrary


to base R, the grammar of data manipulation, splits this operation in two: the setting
of the grouping, and the calculation of summaries. This simplifies the code, making
it more easily understandable, compared to the approach of base R’s aggregate() ,
and it also makes it easier to summarize several columns in a single operation.
The first step is to use group_by() to “tag” a tibble with the grouping. We create a
tibble and then convert it into a grouped tibble.

my.tb <- tibble(numbers = 1:9, letters = rep(letters[1:3], 3))


my_gr.tb <- group_by(my.tb, letters)

Once we have a grouped tibble, function summarise() will recognize the grouping
and use it when the summary values are calculated.

summarise(my_gr.tb,
mean_numbers = mean(numbers),
median_numbers = median(numbers),
n = n())

164
5.7 Pipes and tees

## # A tibble: 3 � 4
## letters mean_numbers median_numbers n
## <chr> <dbl> <int> <int>
## 1 a 4 4 3
## 2 b 5 5 3
## 3 c 6 6 3

 How is grouping implemented for data-frame-based tibbles? In our case


as our tibble belongs to class tibble_df , grouping adds grouped_df as the most
derived class. It also adds several attributes with the grouping information in a
format suitable for fast selection of group members.

my.tb <- tibble(numbers = 1:9, letters = rep(letters[1:3], 3))


class(my.tb)

## [1] "tbl_df" "tbl" "data.frame"

my_gr.tb <- group_by(my.tb, letters)


class(my_gr.tb)

## [1] "grouped_df" "tbl_df" "tbl"


## [4] "data.frame"

U Use function attributes() to compare the attributes of my.tb and


my_gr.tb . Trysee how the groups information is stored in

5.7 Pipes and tees

Pipes have been part of Unix shells already starting from the early days of Unix in 1973.
By the early 1980’s the idea had led to the development of many tools to be used in sh
connected by pipes (Kernigham and Plauger 1981). Shells developed more recently
like the Korn shell, ksh, and bash maintained support for this approach (Rosenblatt
1993). The idea behind the concept of data pipe, is that one can directly use the
output from one tool as input for the tool doing the next stage in the processing.
These tools are simple programs that do a defined operation, such as ls or cat—from
where the names of equivalent functions in R were coined.
Apple’s OS X is based on Unix, and allows the use of pipes at the command prompt

165
5 Storing and manipulating data with R

and in shell scripts. Linux uses the tools from the Gnu project that to a large extent
replicate and extend the capabilities by the and also natively supports pipes equival-
ent to those in Unix. In Windows support for pipes was initially partial at the com-
mand prompt. Currently, Window’s PowerShell supports the use of pipes, as well as
some Linux shells are available in versions that can be used under MS-Windows.
Within R code, the support for pipes is not native, but instead implemented by
some recent packages. Most of the packages in the tidyverse support this new
syntax through the use of package ‘magrittr’. The use of pipes has advantages and
disadvantages. They are at their best when connecting small functions with rather
simple inputs and outputs. They tend, yet, to be difficult to debug, a problem that
counterbalances the advantages of the clear and consice notation achieved.

5.7.1 Pipes and tees

The pipe operator %>% is defined in package ‘magrittr’, but imported and re-exported
by other packages in the ‘tidyverse’. The idea is that the value returned by a func-
tion is passed by the pipe operator as the first argument to the next function in the
“pipeline”.
We can chain some of the examples in the previous section into a “pipe”.

tibble(numbers = 1:9, letters = rep(letters[1:3], 3)) %>%


group_by(letters) %>%
summarise(mean_numbers = mean(numbers),
var_numbers = var(numbers),
n = n())

## # A tibble: 3 � 4
## letters mean_numbers var_numbers n
## <chr> <dbl> <dbl> <int>
## 1 a 4 9 3
## 2 b 5 9 3
## 3 c 6 9 3

I we want to save the returned value, to me it feels more natural to use a left to
right assignment, although the usual right to left one can also be used.

tibble(numbers = 1:9, letters = rep(letters[1:3], 3)) %>%


group_by(letters) %>%
summarise(mean_numbers = mean(numbers),
var_numbers = var(numbers),
n = n()) -> summary.tb
summary.tb

## # A tibble: 3 � 4
## letters mean_numbers var_numbers n

166
5.7 Pipes and tees

## <chr> <dbl> <dbl> <int>


## 1 a 4 9 3
## 2 b 5 9 3
## 3 c 6 9 3

summary.tb <-
tibble(numbers = 1:9, letters = rep(letters[1:3], 3)) %>%
group_by(letters) %>%
summarise(mean_numbers = mean(numbers),
var_numbers = var(numbers),
n = n())
summary.tb

## # A tibble: 3 � 4
## letters mean_numbers var_numbers n
## <chr> <dbl> <dbl> <int>
## 1 a 4 9 3
## 2 b 5 9 3
## 3 c 6 9 3

As print() returns its input, we can also include it in the middle of a pipe as a
simple way of visualizing what takes place at each step.

tibble(numbers = 1:9, letters = rep(letters[1:3], 3)) %>%


print() %>%
group_by(letters) %>%
summarise(mean_numbers = mean(numbers),
var_numbers = var(numbers),
n = n()) %>%
print() -> summary.tb

## # A tibble: 9 � 2
## numbers letters
## <int> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 a
## 5 5 b
## 6 6 c
## 7 7 a
## 8 8 b
## 9 9 c
## # A tibble: 3 � 4
## letters mean_numbers var_numbers n
## <chr> <dbl> <dbl> <int>
## 1 a 4 9 3
## 2 b 5 9 3
## 3 c 6 9 3

167
5 Storing and manipulating data with R

 Why and how we can insert a call to print() in the middle of a pipe? An
extremely simple example, with a twist, follows.

print("a") %>% print()

## [1] "a"
## [1] "a"

The example above is equivalent to.

print(print("a"))

## [1] "a"
## [1] "a"

The examples above are somehow surprising but instructive. Function print()
returns a value, its first argument, but invisibly—see help for invisible() . Oth-
erwise default printing would result in the value being printed twice at the R
prompt. We can demonstrate this by saving the value returned by print.

a <- print("a")

## [1] "a"

class(a)

## [1] "character"

## [1] "a"

b <- print(2)

## [1] 2

class(b)

## [1] "numeric"

## [1] 2

168
5.7 Pipes and tees

U Assemble different pipes, predict what will be the output, and check your
prediction by executing the code.

Although %>% is the most frequently used pipe operator, there are some additional
ones available. We start by creating a tibble.

my.tb <- tibble(numbers = 1:9, letters = rep(letters[1:3], 3))

We first demonstrate that the pipe can have at its head a variable with the same
operator as we used above, in this case a tibble.

my.tb %>%
group_by(letters) %>%
summarise(mean_numbers = mean(numbers),
var_numbers = var(numbers),
n = n())

## # A tibble: 3 � 4
## letters mean_numbers var_numbers n
## <chr> <dbl> <dbl> <int>
## 1 a 4 9 3
## 2 b 5 9 3
## 3 c 6 9 3

my.tb

## # A tibble: 9 � 2
## numbers letters
## <int> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 a
## 5 5 b
## 6 6 c
## 7 7 a
## 8 8 b
## 9 9 c

We could save the output of the pipe to the same variable at the head of the pipe
by explicitly using the same name, but operator %<>% does this directly.

my.tb %<>%
group_by(letters) %>%
summarise(mean_numbers = mean(numbers),
var_numbers = var(numbers),

169
5 Storing and manipulating data with R

n = n())
my.tb

## # A tibble: 3 � 4
## letters mean_numbers var_numbers n
## <chr> <dbl> <dbl> <int>
## 1 a 4 9 3
## 2 b 5 9 3
## 3 c 6 9 3

A few additional operators defined in ‘magrittr’ are not re-exported by packages in


the ‘tidyverse’, so their use requires ‘magrittr’ to be loaded.
When functions have a side-effect like print() displaying its input and passing it
unchanged as the returned value, we do not need to split flow of processing through
a pipe. In real house plumbing, when a split is needed a “tee” shaped pipe joint is
used. This is where the name tee as used in programming originates. Operator %T>%
passes along not the value returned by a function, but instead the value passed to it
as input.
As in the previous chunk we assigned the summaries to my.tb , we need to re-create
it to run the next example.

my.tb <- tibble(numbers = 1:9, letters = rep(letters[1:3], 3))

sump <- function(x) {print("hello"); return(NULL)}


my.tb %>%
group_by(letters) %>%
summarise(mean_numbers = mean(numbers),
var_numbers = var(numbers),
n = n()) %T>%
sump() -> summary.tb

## [1] "hello"

We can see that the value saved in summary.tb is the one returned by summarize()
rather than the one returned by sump() .

U Look up the help page for operator %$% and write an example of its use.

5.8 Joins

Joins allow us to combine two data sources which share some variables. The variables
in common are used to match the corresponding rows before adding columns from

170
5.8 Joins

both sources together. There are several join functions in ‘dplyr’. They differ mainly
in how they handle mismatched rows.
We create here some artificial data to demonstrate the use of these functions. We
will create two small tibbles, with one column in common and one mismatched row
in each.

first.tb <- tibble(idx = c(1:4, 5), values1 = "a")


second.tb <- tibble(idx = c(1:4, 6), values2 = "b")

Here we apply all the join functions exported by ‘dplyr’— full_join() ,


left_join() , right_join() , inner_join() , semi_join() , and anti_join() —to
the two tibbles, each time swapping their order as input to help make the differences
in behaviour clear.

full_join(first.tb, second.tb)

## Joining, by = "idx"

## # A tibble: 6 � 3
## idx values1 values2
## <dbl> <chr> <chr>
## 1 1 a b
## 2 2 a b
## 3 3 a b
## 4 4 a b
## 5 5 a <NA>
## 6 6 <NA> b

full_join(second.tb, first.tb)

## Joining, by = "idx"

## # A tibble: 6 � 3
## idx values2 values1
## <dbl> <chr> <chr>
## 1 1 b a
## 2 2 b a
## 3 3 b a
## 4 4 b a
## 5 6 b <NA>
## 6 5 <NA> a

left_join(first.tb, second.tb)

## Joining, by = "idx"

## # A tibble: 5 � 3

171
5 Storing and manipulating data with R

## idx values1 values2


## <dbl> <chr> <chr>
## 1 1 a b
## 2 2 a b
## 3 3 a b
## 4 4 a b
## 5 5 a <NA>

left_join(second.tb, first.tb)

## Joining, by = "idx"

## # A tibble: 5 � 3
## idx values2 values1
## <dbl> <chr> <chr>
## 1 1 b a
## 2 2 b a
## 3 3 b a
## 4 4 b a
## 5 6 b <NA>

right_join(first.tb, second.tb)

## Joining, by = "idx"

## # A tibble: 5 � 3
## idx values1 values2
## <dbl> <chr> <chr>
## 1 1 a b
## 2 2 a b
## 3 3 a b
## 4 4 a b
## 5 6 <NA> b

right_join(second.tb, first.tb)

## Joining, by = "idx"

## # A tibble: 5 � 3
## idx values2 values1
## <dbl> <chr> <chr>
## 1 1 b a
## 2 2 b a
## 3 3 b a
## 4 4 b a
## 5 5 <NA> a

172
5.8 Joins

inner_join(first.tb, second.tb)

## Joining, by = "idx"

## # A tibble: 4 � 3
## idx values1 values2
## <dbl> <chr> <chr>
## 1 1 a b
## 2 2 a b
## 3 3 a b
## 4 4 a b

inner_join(second.tb, first.tb)

## Joining, by = "idx"

## # A tibble: 4 � 3
## idx values2 values1
## <dbl> <chr> <chr>
## 1 1 b a
## 2 2 b a
## 3 3 b a
## 4 4 b a

semi_join(first.tb, second.tb)

## Joining, by = "idx"

## # A tibble: 4 � 2
## idx values1
## <dbl> <chr>
## 1 1 a
## 2 2 a
## 3 3 a
## 4 4 a

semi_join(second.tb, first.tb)

## Joining, by = "idx"

## # A tibble: 4 � 2
## idx values2
## <dbl> <chr>
## 1 1 b
## 2 2 b
## 3 3 b
## 4 4 b

173
5 Storing and manipulating data with R

anti_join(first.tb, second.tb)

## Joining, by = "idx"

## # A tibble: 1 � 2
## idx values1
## <dbl> <chr>
## 1 5 a

anti_join(second.tb, first.tb)

## Joining, by = "idx"

## # A tibble: 1 � 2
## idx values2
## <dbl> <chr>
## 1 6 b

See section 5.9.1 on 174 for a realistic example of the use of a join.

5.9 Extended examples

5.9.1 Well-plate data

Our first example attempts to simulate data arranged in rows and columns based on
spatial position, such as in a well plate. We will use pseudo-random numbers for the
fake data—i.e. the measured response.

well_data.tb <-
as.tibble(matrix(rnorm(50),
nrow = 5,
dimnames = list(as.character(1:5), LETTERS[1:10])))
# drops names of rows
well_data.tb <-
add_column(well_data.tb, row_ids = 1:5, .before = 1)

In addition, we create a matrix of fake treatment ids.

well_ids.tb <-
as.tibble(matrix(sample(letters, size = 50, replace = TRUE),
nrow = 5,
dimnames = list(as.character(1:5), LETTERS[1:10])))
# drops names of rows
well_ids.tb <-
add_column(well_ids.tb, row_ids = 1:5, .before = 1)

174
5.9 Extended examples

As we will combine them, the coordinates should be encoded consistently in the


two objects. I will take the approach of first converting each tibble into a tidy tibble.
We use function gather() from package ‘tidyr’.

well_data.ttb <- gather(well_data.tb,


key = col_ids, value = reading,
-row_ids)
well_ids.ttb <- gather(well_ids.tb,
key = col_ids, value = group,
-row_ids)

Now we need to join the two tibbles into a single one. In this case, as we know that
the row order in the two tibbles is matched, we could simply use cbind() . However,
full_join() , from package ‘dplyr’ provides a more general and less error prone al-
ternative as it can do the matching based on the values of any variables common to
both tibbles, by default all the variables in common, as needed here. We use a “pipe”,
through which, after the join, we remove the ids (assuming they are no longer needed),
sort the rows by group, and finally save the result to a new “tidy” tibble.

full_join(well_ids.ttb, well_data.ttb) %>%


select(-row_ids, -col_ids) %>%
arrange(group) -> well.tb

## Joining, by = c("row_ids", "col_ids")

well.tb

## # A tibble: 50 � 2
## group reading
## <chr> <dbl>
## 1 a 0.9305284
## 2 a -1.1298596
## 3 a 1.0859498
## 4 b -0.6204916
## 5 b 1.0439944
## 6 b -0.9659226
## 7 b 1.5372426
## 8 c -0.1219225
## 9 d -0.2527467
## 10 f -1.1139499
## # ... with 40 more rows

We finally calculate summaries by group using function summarise() , and store


the tibble containing the summaries to variable well_summaries.tb .

group_by(well.tb, group) %>%


summarise(avg_read = mean(reading),

175
5 Storing and manipulating data with R

var_read = var(reading),
count = n()) -> well_summaries.tb
well_summaries.tb

## # A tibble: 23 � 4
## group avg_read var_read count
## <chr> <dbl> <dbl> <int>
## 1 a 0.29553954 1.52986101 3
## 2 b 0.24870571 1.50787915 4
## 3 c -0.12192251 NA 1
## 4 d -0.25274669 NA 1
## 5 f -0.52219793 1.03433077 4
## 6 g -0.25139495 0.37467384 4
## 7 h 0.03399791 0.55269991 2
## 8 i 0.01658600 NA 1
## 9 k -0.75397477 NA 1
## 10 l -0.24281301 0.06127658 3
## # ... with 13 more rows

We now save the tibbles into an R data file with function save() .

save(well.tb, well_summaries.tb, file = "data/well-data.rda")

5.9.2 Seedling morphology

We use here data from an experiment on the effects of spacing in the nursery between
silver birch seedlings on their morphology. We take one variable from a lager study
(Aphalo and Rikala 2006), the leaf area at different heights above the ground in 10 cm
increments. Area was measured separately for leaves on the main stem and leaves
on branches.
In this case, as the columns are badly aligned in the original text file, we use
read.table() from base R, rather than read_table() from ‘readr’. Afterwards we
heavily massage the data into shape so as to obtain a tidy tibble with the total leaf
area per height segment per plant. The file contains additional data that we discard
for this example.

as.tibble(read.table("data/areatable.dat", header = TRUE)) %>%


filter(row %in% 4:8) %>%
select(code, tray, row, starts_with("a.")) %>%
gather(key = sample, value = area, -tray, -row, -code) %>%
mutate(segment = str_extract(sample, "[0-9]{1,2}"),
part = ifelse(str_extract(sample, "[bm]") == "b",
"branch", "main")) %>%
group_by(tray, code, row, segment) %>%
summarise(area_tot = sum(area)) -> birch.tb

176
5.9 Extended examples

birch.tb

## Source: local data frame [240 x 5]


## Groups: tray, code, row [?]
##
## tray code row segment area_tot
## <int> <int> <int> <chr> <int>
## 1 5 2 4 10 0
## 2 5 2 4 20 7313
## 3 5 2 4 30 0
## 4 5 2 4 40 0
## 5 5 2 4 50 0
## 6 5 2 4 60 0
## 7 5 3 5 10 8387
## 8 5 3 5 20 8944
## 9 5 3 5 30 8160
## 10 5 3 5 40 11947
## # ... with 230 more rows

U The previous chunk uses a long “pipe” to manipulate the data. I built this
example interactively, starting at the top, and adding one line at a time. Repeat
this process, line by line. If in a given line you do not understand why a certain
bit of code is included, look at the help pages, and edit the code to experiment.

We now will calculate means per true replicate, the trays. Then use these means to
calculate overall means, standard deviations and coefficients of variabilities (%).

group_by(birch.tb, tray, row, segment) %>%


summarise(area = mean(area_tot)) %>%
group_by(row, segment) %>%
summarise(mean_area = mean(area),
sd_area = sd(area),
cv_area = sd_area / mean_area * 100) ->
birch_summaries.tb
birch_summaries.tb

## Source: local data frame [30 x 5]


## Groups: row [?]
##
## row segment mean_area sd_area cv_area
## <int> <chr> <dbl> <dbl> <dbl>
## 1 4 10 20.250 40.5000 200.000000
## 2 4 20 8604.250 432.0232 5.021045
## 3 4 30 9195.500 3879.4272 42.188323
## 4 4 40 9180.125 2853.6121 31.084676

177
5 Storing and manipulating data with R

## 5 4 50 7054.375 5987.9445 84.882707


## 6 4 60 2983.750 3103.2491 104.004996
## 7 5 10 3088.375 3660.2748 118.517822
## 8 5 20 9880.000 1714.6564 17.354822
## 9 5 30 11151.875 3734.8653 33.490918
## 10 5 40 9406.000 1264.8214 13.446964
## # ... with 20 more rows

We could be also interested in total leaf area per plant. The code is the same as
above, but with no grouping for segment .

group_by(birch.tb, tray, row) %>%


summarise(area = mean(area_tot)) %>%
group_by(row) %>%
summarise(mean_area = mean(area),
sd_area = sd(area),
cv_area = sd_area / mean_area * 100) ->
birch_plant_summaries.tb
birch_plant_summaries.tb

## # A tibble: 5 � 4
## row mean_area sd_area cv_area
## <int> <dbl> <dbl> <dbl>
## 1 4 6173.042 2254.8318 36.527080
## 2 5 7160.792 1442.7202 20.147495
## 3 6 7367.958 1002.8878 13.611475
## 4 7 8210.146 559.3762 6.813231
## 5 8 7807.792 448.3005 5.741707

We now save the tibbles into an R data file.

save(birch.tb, birch_summaries.tb, birch_plant_summaries.tb,


file = "data/birch-data.rda")

U Repeat the same calculations for all the rows as I originally did. I eliminated
the data from the borders of the trays, as those plants apparently did not really
experience as crowded a space as that corresponding to the nominal spacing.

178
6 Plots with ‘ggplot2’

The commonality between science and art is in trying to see


profoundly—to develop strategies of seeing and showing.

— Edward Tufte

6.1 Aims of this chapter

Three main plotting systems are available to Rusers: base R, package ‘lattice’ (Sarkar
2008) and package ‘ggplolt2’ (Wickham and Sievert 2016), being the last one the most
recent and currently most popular system available in Rfor plotting data. Even two
different sets of graphics primitives are available in R, that in base R and a newer one
in the ‘grid’ package (Murrell 2011).
In this chapter you will learn the concepts of the grammar of graphics, on which
package ‘ggplot2’ is based. You will as well learn how to do many of the data plots
that can be produced with package ‘ggplot2’. We will focus only on the grammar of
graphics, as it is currently the most used plotting approach in R. As a consequence
of this popularity and its flexibility, many extensions to ‘ggplot2’ have been released
through free licences and deposited in public repositories. Several of these packages
will be described in Chapter 7 starting on page 329 and in Chapter 8 starting on page
425. As previous chapters, this chapter is intended to be read in whole.

6.2 Packages used in this chapter

citation(package = "ggplot2")

##
## To cite ggplot2 in publications, please use:
##
## H. Wickham. ggplot2: Elegant Graphics for
## Data Analysis. Springer-Verlag New York,
## 2009.
##
## A BibTeX entry for LaTeX users is
##
## @Book{,
## author = {Hadley Wickham},

179
6 Plots with ggpplot

## title = {ggplot2: Elegant Graphics for Data Analysis},


## publisher = {Springer-Verlag New York},
## year = {2009},
## isbn = {978-0-387-98140-6},
## url = {http://ggplot2.org},
## }

For executing the examples listed in this chapter you need first to load the following
packages from the library:

library(ggplot2)
library(scales)
library(tikzDevice)
library(lubridate)

We set a font of larger size than the default

theme_set(theme_grey(14))

6.3 Introduction

Being R extensible, in addition to the built-in plotting functions, there are several
alternatives provided by packages. Of the general purpose ones, the most extensively
used are ‘Lattice’ (Sarkar 2008) and ‘ggplot2’ (Wickham and Sievert 2016). There are
additional packages that add extra functionality to these packages (see Chapter 7
starting on page 329.
In the examples in this chapter we describe the of use package ‘ggplot2’. We start
with an introduction to the ‘grammar of graphics’ and ‘ggplot2’. There is ample lit-
erature on the use of ‘ggplot2’, including the very good reference documentation at
http://docs.ggplot2.org/. The book titled ggplot2: Elegant Graphics for Data
Analysis (Wickham and Sievert 2016) is the authoritative reference, as it is authored
by the developers of ‘ggplot2’. The book ‘R Graphics Cookbook’ (Chang 2013) is very
useful as a reference as it contains many worked out examples. Some of the literature
available at this time is for older versions of ‘ggplot2’ but we here describe version
2.2.0, and highlight the most important incompatibilities that need to be taken into ac-
count when using versions of ‘ggplot2’ earlier than 2.2.0. There is no comprehensive
text on packages extending ‘ggplot2’ so I will describe many of them in later chapters.
In the present chapter we describe the functions and methods defined in package
‘ggplot2’, in chapter 7 on page 329 we describe extensions to ‘ggplot2’ defined in
other packages, except for those related to plotting data onto maps and other im-
ages, described in chapter 8 on page 425. Consistent with the title of this book, we

180
6.4 Grammar of graphics

use a tutorial style, interspersing exercises to motivate learning using a hands-on ap-
proach and playful exploration of a wide range of possible uses of the grammar of
graphics.

6.4 Grammar of graphics

What separates ‘ggplot2’ from base-R and trellis/lattice plotting functions is the use
of a grammar of graphics (the reason behind ‘gg’ in the name of the package). What
is meant by grammar in this case is that plots are assembled piece by piece from
different ‘nouns’ and ‘verbs’ (Cleveland 1985). Instead of using a single function with
many arguments, plots are assembled by combining different elements with operators
+ and %+%. Furthermore, the construction is mostly semantic-based and to a large
extent how the plot looks when is printed, displayed or exported to a bitmap or vector
graphics file is controlled by themes.

6.4.1 Mapping

When we design a plot, we need to map data variables to aesthetics (or graphic ‘prop-
erties’). Most plots will have an 𝑥 dimension, which is considered an aesthetic, and a
variable containing numbers mapped to it. The position on a 2D plot of say a point
will be determined by 𝑥 and 𝑦 aesthetics, while in a 3D plot, three aesthetics need to
be mapped 𝑥, 𝑦 and 𝑧. Many aesthetics are not related to coordinates, they are proper-
ties, like color, size, shape, line type or even rotation angle, which add an additional
dimension on which to represent the values of variables and/or constants.

6.4.2 Geometries

Geometries describe the graphics representation of the data: for example,


geom_point() , plots a ‘point’ or symbol for each observation, while geom_line() ,
draws line segments between successive observations. Some geometries rely on stat-
istics, but most ‘geoms’ default to the identity statistics.

6.4.3 Statistics

Statistics are ‘words’ that represent calculation of summaries or some other operation
on the values from the data, and these summary values can be plotted with a geometry.
For example stat_smooth() fits a smoother, and stat_summary() applies a summary
function. Statistics are applied automatically by group when data has been grouped
by mapping additional aesthetics such as color to a factor.

181
6 Plots with ggpplot

6.4.4 Scales

Scales give the relationship between data values and the aesthetic values to be actually
plotted. Mapping a variable to the ‘color’ aesthetic only tells that different values
stored in the mapped variable will be represented by different colors. A scale, such
as scale_color_continuous() will determine which color in the plot corresponds to
which value in the variable. Scales are used both for continuous variables, such as
numbers, and categorical ones such as factors.

6.4.5 Coordinate systems

The most frequently used coordinate system when plotting data is the cartesian sys-
tem, which is the default for most geometries. In the cartesian system, 𝑥 and 𝑦 are
represented as distances on two orthogonal (at 90∘ ) axes. In the polar system of co-
ordinates, angles around a central point are used instead of distances on a straight
line. However, package ‘ggtern’ adds a ternary system of coordinates, to allow the
extension of the grammar to allow the construction of ternary plots.

6.4.6 Themes

How the plots look when displayed or printed can be altered by means of themes.
A plot can be saved without adding a theme and then printed or displayed using
different themes. Also individual theme elements can be changed, and whole new
themes defined. This adds a lot of flexibility and helps in the separation of the data
representation aspects from those related to the graphical design.
As discussed above the grammar of graphics is based on aesthetics ( aes ) as for ex-
ample color, geometric elements geom_… such as lines, and points, statistics stat_… ,
scales scale_… , labels labs , coordinate systems and themes theme_… . Plots are
assembled from these elements, we start with a plot with two aesthetics, and one
geometry.
As the workings and use of the grammar are easier to show by example than to
explain with words, after this short introduction we will focus on examples showing
how to produce graphs of increasing complexity.

6.5 Scatter plots

In the examples that follow we will use the mtcars data set included in R. To learn
more about this data set, type help("mtcars") at the R command prompt.
Data variables must be ‘mapped’ to aesthetics to appear as in a plot. Variables to
be represented in a plot can be either continuous (numeric) or discrete (categorical,

182
6.5 Scatter plots

factor). Variable cyl is encoded in the mtcars data frame as numeric values. Even
though only three values are present, a continuous color scale is used by default.
In the example below, x , y and color are aesthetics. In this example they are all
mapped to variables contained in the data frame mtcars . To build a scatter plot, we
use the geom_point() geometry as in a scatter plot each individual observation is
represented by a point or symbol in the plot.

ggplot(data = mtcars,
aes(x = disp, y = mpg, color = cyl)) +
geom_point()

35

● ●
30


cyl

8
25 ●
7
mpg

● ●




● 6
20 ●



5
● ●

● 4

● ●
15 ●●


● ●
10
100 200 300 400
disp

U Try a different mapping: disp → color , cyl → x . Continue by using


help(mtcars) and/or names(mtcars) to see what variables are available, and
then try the combinations that trigger your curiosity.

Some scales exist in two ‘flavours’, one suitable for continuous variables and an-
other for discrete variables. We can convert cyl into a factor ‘on-the-fly’ to force the
use of a discrete color scale. If we map the color aesthetic to factor(cyl) , points
get colors according to the levels of the factor, and by default a guide or key for the
mapping is also added.

ggplot(data = mtcars,
aes(x = disp, y = mpg, color = factor(cyl))) +
geom_point()

183
6 Plots with ggpplot

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●




● ● 6
20 ●
● ● ● 8

● ●



● ●
15 ●●


● ●
10
100 200 300 400
disp

U Try a different mapping: mpg → color , cyl → y . Invent your own map-
pings taking into account which variables are continuous and which ones categor-
ical.

Using an aesthetic, involves the mapping of values in the data to aesthetic values
such as colours. The mapping is defined by means of scales. If we now consider the
color aesthetic in the previous statement, a default discrete color scale was used
when factor(cyl) was mapped to the aesthetic, while a continuous color scale was
used when mpg was mapped to it.

In the case of the discrete scale three different colours taken from a default palette
were used. If we would like to use a different set of three colours for the three values
of the factor, but still have them assigned automatically to each point in the plot, we
can select a different colour palette by passing an argument to the corresponding
scale function.

ggplot(data = mtcars,
aes(x = disp, y = mpg, color = factor(cyl))) +
geom_point() +
scale_color_brewer(type = "qual", palette = 2)

184
6.5 Scatter plots

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●




● ● 6
20 ●
● ● ● 8

● ●



● ●
15 ●●


● ●
10
100 200 300 400
disp

U Try the different palettes available through the brewer scale. You can play
directly with the palettes using function brewer_pal() from package ‘scales’ to-
gether with show_col() ).

show_col(brewer_pal()(3))
show_col(brewer_pal(type = "qual", palette = 2, direction = 1)(3))

Once you have found a suitable palette for these data, redo the plot above with
the chosen palette.

Neither the data, nor the aesthetics mappings or geometries are different than in
earlier code; to alter how the plot looks we have changed only the palette used by the
color aesthetic. Conceptually it is still exactly the same plot we earlier created. This
is a very important point to understand, because it is extremely useful in practice.
Plots are assembled piece by piece and it is even possible to replace elements in an
existing plot.

 Within aes() the aesthetics are interpreted as being a function of the val-
ues in the data—i.e. to be mapped. If given outside aes() they are interpreted as
constant values, which apply to one geometry if given within the call to a geom_
but outside aes() . The aesthetics and data given as ggplot() ’s arguments be-
come the defaults for all the geoms, but geoms also accept aesthetics and data as
arguments, which when supplied locally override the whole-plot defaults. In the
example below, we override the default colour of the points.

185
6 Plots with ggpplot

If we set the color aesthetic to a constant value, "red" , all points are plotted in
red.

ggplot(data = mtcars,
aes(x = disp, y = mpg, color = factor(cyl))) +
geom_point(color = "red")

35

● ●
30

25 ●
mpg

● ●


● ●

20 ●
● ●

● ●



● ●
15 ●●


● ●
10
100 200 300 400
disp

U Does the code chunk below produces exactly the same plot as that above
this box? Consider how the two mappings differ, and make sure that you under-
stand the reasons behind the difference or lack of difference in output by trying
different variations of these examples

ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point(color = "red")

As with any R function it is possible to pass arguments by position to aes when


mapping variables to aesthetics but this makes the code more difficult to read and
less tolerant to possible changes to the definitions of functions. It is not recommen-
ded to use this terse style in scripts or package coding. However, it can be used by
experienced users at the command prompt usually without problems.
Mapping passing arguments by name to aes .

ggplot(data = mtcars, aes(x = disp, y = mpg)) +


geom_point()

186
6.5 Scatter plots

35

● ●
30

25 ●
mpg

● ●


● ●

20 ●
● ●

● ●



● ●
15 ●●


● ●
10
100 200 300 400
disp

U If we swap the order of the arguments do we still obtain the same plot?
ggplot(data = mtcars, aes(y = mpg, x = disp)) +
geom_point()

Mapping passing arguments by position to aes .

ggplot(mtcars, aes(disp, mpg)) +


geom_point()

35

● ●
30

25 ●
mpg

● ●


● ●

20 ●
● ●

● ●



● ●
15 ●●


● ●
10
100 200 300 400
disp

U If we swap the order of the arguments do we obtain a different plot?

187
6 Plots with ggpplot

ggplot(mtcars, aes(mpg, disp)) +


geom_point()

When not relying on colors, the most common way of distinguishing groups of
observations in scatter plots is to use the shape of the points as an aesthetic. We need
to change a single “word” in the code statement to achieve this different mapping.

ggplot(data = mtcars, aes(x = disp, y = mpg, shape = factor(cyl))) +


geom_point()

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●


● 6
20 8

15

10
100 200 300 400
disp

We can use scale_shape_manual to choose each shape to be used. We set three


“open” shapes that we will see later are very useful as they obey both color and
fill aesthetics.

ggplot(data = mtcars, aes(x = disp, y = mpg, shape = factor(cyl))) +


geom_point() +
scale_shape_manual(values = c(21, 22, 23))

188
6.5 Scatter plots

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●


● 6
20 8

15

10
100 200 300 400
disp

It is also possible to use characters as shapes. The character is centred on the posi-
tion of the observation. Conceptually using character values for shape is different
to using geom_text() as in the later case there is much more flexibility as character
strings and expressions are allowed in addition to single characters. Also positioning
with respect to the coordinates of the observations can be adjusted through justifica-
tion. While geom_text() is usually used for annotations, the present example treats
the character string as a symbol. (This also opens the door to the use as shapes of
symbols defined in special fonts.)

ggplot(data = mtcars, aes(x = disp, y = mpg, shape = factor(cyl))) +


geom_point(size = 2.5) +
scale_shape_manual(values = c("4", "6", "8"), guide = FALSE)

35
4
4

4 4
30

4
4
25 4
mpg

4 4
44 6
6
20 6
6 8
8
6 6
8
8
8 8
15 8 88
8 8
8

8 8
10
100 200 300 400
disp

U What do you expect to be the result of the following statement?

189
6 Plots with ggpplot

ggplot(data = mtcars, aes(x = disp, y = mpg, shape = factor(cyl))) +


geom_point(size = 4) +
scale_shape_manual(values = c("c4", "c6", "c8"), guide = FALSE)

As seen earlier one variable can be mapped to more than one aesthetic allowing re-
dundant aesthetics. This may seem wasteful, but it is extremely useful as it allows one
to produce figures that even when produced in color, can still be read if reproduced
as monochrome images.

ggplot(data = mtcars, aes(x = disp, y = mpg,


shape = factor(cyl),
color = factor(cyl))) +
geom_point()

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●


● 6
20 8

15

10
100 200 300 400
disp

U Here we map fill and shape to cyl . What do you expect this variation of
the statement above to produce?

ggplot(data = mtcars, aes(x = disp, y = mpg,


shape = factor(cyl),
fill = factor(cyl))) +
geom_point()

Hint: Do all shapes obey the fill aesthetic? (Having a look at page 188 may
be of help.)

We can create a “bubble” plot by mapping the size aesthetic to a continuous vari-
able. In this case, one has to think what is visually more meaningful. Although the

190
6.5 Scatter plots

radius of the shape is frequently mapped, due to how human perception works, map-
ping a variable to the area of the shape is more useful by being perceptually closer to
a linear mapping. For this example we add a new variable to the plot. The weight of
the car in tons and map it to the area of the points.

ggplot(data = mtcars, aes(x = disp, y = mpg,


color = factor(cyl),
size = wt)) +
scale_size_area() +
geom_point()

35


● ●
wt
30
● 2
● ● 3

25

●4
●5
mpg

● ●
● ● ●
20 ● ●
● ● factor(cyl)
● ● ● ● 4
● ● ●
15 ● ●
● ●
● 6

● ● 8

10 ●●
100 200 300 400
disp

U If we use a radius-based scale the “impression” is different.


ggplot(data = mtcars, aes(x = disp, y = mpg,
color = factor(cyl),
size = wt)) +
scale_size() +
geom_point()

Make the plot, look at it carefully. Check the numerical values of some of the
weights, and assess if your perception of the plot matches the numbers behind
it.

As a final example of how to combine different aesthetics, we use in a single plot


several of the different mappings described in earlier examples.

ggplot(data = mtcars, aes(x = disp, y = mpg,


shape = factor(cyl),
fill = factor(cyl),

191
6 Plots with ggpplot

size = wt)) +
geom_point(alpha = 0.33, color = "black") +
scale_size_area() +
scale_shape_manual(values = c(21, 22, 23))

35

wt
30
2
3
25 4
5
mpg

20
factor(cyl)
4
15 6
8

10
100 200 300 400
disp

U Play with the code in the chunk above. Remove or change each of the map-
pings and the scale, display the new plot and compare it to the one above. Con-
tinue playing with the code until you are sure you understand what each indi-
vidual element in the code statement creates or controls which graphical element
in the plot itself.

Data assigned to an aesthetic can be the ‘result of a computation’. In other words,


the values to be plotted do not need to be stored in the data frame passed as argument
to data , the first formal parameter of ggplot()

Here we plot the ratio of miles-per-gallon, mpg , and the engine displacement
(volume), disp . Instead of mapping as above disp to the 𝑥 aesthetic, we map
factor(cyl) to 𝑥. In contrast to the continuous variable disp we earlier used, now
we use a factor, so a discrete (categorical) scale is used by default for 𝑥.

ggplot(data = mtcars, aes(x = factor(cyl), y = mpg / disp)) +


geom_point()

192
6.5 Scatter plots


0.4 ●


mpg/disp

0.3



0.2






0.1









0.0
4 6 8
factor(cyl)

U What will happen if we replace factor(cyl) with cyl in the statement


above? How do you expect the plot to change? First think carefully what you
can expect, and then run the edited code.

Although factor(cyl) is mapped to 𝑥, we can map it in addition to color . This


may be useful when desiring to keep the design consistent across plots, for example
this one and those above.

ggplot() +
aes(x = factor(cyl), y = mpg / disp,
colour = factor(cyl)) +
geom_point(data = mtcars)


0.4 ●


factor(cyl)
mpg/disp

0.3
● 4
● 6


0.2 ● 8






0.1









0.0
4 6 8
factor(cyl)

We can set the labels for the different aesthetics, and give a title (\n means ‘new
line’ and can be used to continue a label in the next line). In this case, if two aesthetics

193
6 Plots with ggpplot

are linked to the same variable, the labels supplied should be identical, otherwise two
separate keys will be produced.

ggplot(data = mtcars,
aes(x=disp, y=hp, colour=factor(cyl),
shape=factor(cyl))) +
geom_point() +
labs(x="Engine displacement)",
y="Gross horsepower",
colour="Number of\ncylinders",
shape="Number of\ncylinders")

300
Gross horsepower

Number of
cylinders
200 ● 4
6
8



100 ●


●● ●

100 200 300 400


Engine displacement)

U Play with the code statement above. Edit the character strings. Move the \n
around. How would you write a string so that quotation marks can be included
as part of the title of the plot? Experiment, and google, if needed, until you get
this to work.

Please, see section 6.9 on page 205 for more an extended description of the use of
labs .

6.6 Line plots

For line plots we use geom_line() . The size of a line is its thickness, and as we
had shape for points, we have linetype for lines. In a line plot observations in
successive rows of the data frame, or the subset corresponding to a group, are joined
by straight lines. We use a different data set included in R, Orange, with data on the
growth of five orange trees. See the help page for Orange for details.

194
6.7 Plotting functions

ggplot(data = Orange,
aes(x = age, y = circumference, color = Tree)) +
geom_line()

200

Tree
circumference

150 3
1
5
100 2
4

50

400 800 1200 1600


age

ggplot(data = Orange,
aes(x = age, y = circumference, linetype = Tree)) +
geom_line()

200

Tree
circumference

150 3
1
5
100 2
4

50

400 800 1200 1600


age

Much of what was described above for scatter plots can be adapted to line plots.

6.7 Plotting functions

In addition to plotting data from a data frame with variables to map to 𝑥 and 𝑦 aes-
thetics, it is possible to have only a variable mapped to 𝑥 and use stat_function()
to generate the values to be mapped to 𝑦 using a function. This avoids the need to
generate data beforehand (the number of data points to be generated can be also set).

195
6 Plots with ggpplot

We start with the Normal distribution function.


ggplot(data.frame(x=-3:3), aes(x=x)) +
stat_function(fun=dnorm)

0.4

0.3
y

0.2

0.1

0.0
−2 0 2
x

Using a list we can even pass by name additional arguments to a function.


ggplot(data.frame(x=-3:3), aes(x=x)) +
stat_function(fun = dnorm, args = list(mean = 1, sd = .5))

0.8

0.6

0.4
y

0.2

0.0
−2 0 2
x

U 1) Edit the code above so as to plot in the same figure three curves, either for
three different values for mean or for three different values for sd .
2) Edit the code above to use a different function, say df , the F distribution,
adjusting the argument(s) passed through args accordingly.

Of course, user-defined functions (not shown), and anonymous functions (below)


can also be used.

196
6.7 Plotting functions

ggplot(data.frame(x = 0:1), aes(x = x)) +


stat_function(fun = function(x, a, b){a + b * x^2},
args = list(a = 1, b = 1.4))

2.0
y

1.5

1.0
0.00 0.25 0.50 0.75 1.00
x

U Edit the code above to use a different function, such as 𝑒 𝑥+𝑘


, adjusting the ar-
gument(s) passed through args accordingly. Do this by means of an anonymous
function, and by means of an equivalent named function defined by your code.

 In some cases we may want to tweak some aspects of the plot to better
match the properties of the mathematical function. Here we use a predefined
function for which the default 𝑥-axis breaks (tick positions) are not the best. We
first show how the plot looks using defaults.

ggplot(data.frame(x=c(0, 2 * pi)), aes(x=x)) +


stat_function(fun=sin)

197
6 Plots with ggpplot

1.0

0.5

0.0
y

−0.5

−1.0
0 2 4 6
x

Next we change the 𝑥-axis scale to better match the sine function and the use
of radians as angular units.

ggplot(data.frame(x = c(0, 2 * pi)), aes(x = x)) +


stat_function(fun = sin) +
scale_x_continuous(
breaks = c(0, 0.5, 1, 1.5, 2) * pi,
labels = c("0", expression(0.5~pi), expression(pi),
expression(1.5~pi), expression(2~pi))) +
labs(y="sin(x)")

1.0

0.5
sin(x)

0.0

−0.5

−1.0
0 0.5 π π 1.5 π 2π
x

There are three things in the above code that you need to understand: the use
of the R built-in numeric constant pi , the use of argument ‘recycling’ to avoid
having to type pi many times, and the use of R expressions to construct suitable
tick labels for the 𝑥 axis. Do also consider why pi is interpreted differently within
expression than within the numeric statements.
The use of expression is explained in detail in section 6.20, an the use of

198
6.8 Plotting text and maths

scales in section 6.16.

6.8 Plotting text and maths

We can use geom_text() or geom_label() to add text labels to observations. For


geom_text() and geom_label() , the aesthetic label provides the text to be plotted
and the usual aesthetics x and y the location of the labels. As one would expect
the color and size aesthetics can be also used for the text. In addition angle and
vjust and hjust can be used to rotate the label, and adjust its position. The default
value of 0.5 for both hjust and vjust centres the label. The centre of the text is at
the supplied x and y coordinates. ‘Vertical’ and ‘horizontal’ for justification refer to
the text, not the plot. This is important when angle is different from zero. Negative
justification values, shift the label left or down, and positive values right or up. A
value of 1 or 0 sets the text so that its edge is at the supplied coordinate. Values
outside the range 0 … 1 shift the text even further away, however, based on the length
of the string. In the case of geom_label() the text is enclosed in a rectangle, which
obeys the fill aesthetic and takes additional parameters (described starting at page
203). However, it does not support rotation with angle .

my.data <-
data.frame(x = 1:5,
y = rep(2, 5),
label = c("a", "b", "c", "d", "e"))

ggplot(my.data, aes(x, y, label = label)) +


geom_text(angle = 45, hjust=1.5, size = 8) +
geom_point()

2.50

2.25

2.00
y

● ● ● ● ●
c
a

1.75

1.50
1 2 3 4 5
x

199
6 Plots with ggpplot

U Modify the examples above to use geom_label() instead of geom_text()


using in addition the fill aesthetic.

In the next example we select a different font family, using the same characters in
the Roman alphabet. We start by checking which fonts families R recognizes on our
system for the PDF output device we use to compile the figures in this book.

names(pdfFonts())

## [1] "serif" "sans"


## [3] "mono" "AvantGarde"
## [5] "Bookman" "Courier"
## [7] "Helvetica" "Helvetica-Narrow"
## [9] "NewCenturySchoolbook" "Palatino"
## [11] "Times" "URWGothic"
## [13] "URWBookman" "NimbusMon"
## [15] "NimbusSan" "URWHelvetica"
## [17] "NimbusSanCond" "CenturySch"
## [19] "URWPalladio" "NimbusRom"
## [21] "URWTimes" "ArialMT"
## [23] "Japan1" "Japan1HeiMin"
## [25] "Japan1GothicBBB" "Japan1Ryumin"
## [27] "Korea1" "Korea1deb"
## [29] "CNS1" "GB1"

A sans-serif font, either "Helvetica" or "Arial" is the default, but we can change
the default through parameter family . Some of the family names are generic like
serif , sans (sans-serif) and mono (mono-spaced), and others refer to actual font
names. Some related fonts (e.g. from different designers or foundries) may also use
variations of the same name. Base R does not support the use of system fonts in
graphics output devices. However, add-on packages allow their use. The simplest to
use is package ‘showtext’ described in 7.3 on page 330.

my.data <-
data.frame(x = 1:5,
y = rep(2, 5),
label = c("a", "b", "c", "d", "e"))

ggplot(my.data, aes(x, y, label = label)) +


geom_text(angle = 45, hjust=1.5, size = 8, family = "serif") +
geom_point()

200
6.8 Plotting text and maths

2.50

2.25

2.00
y

● ● ● ● ●
a

e
b

d
1.75

1.50
1 2 3 4 5
x

In the next example we use paste() (which uses recycling here) to add a space at
the end of each label.

my.data <-
data.frame(x = 1:5, y = rep(2, 5),
label = paste(c("a", "ab", "abc", "abcd", "abcde"), " "))

ggplot(my.data, aes(x, y, label = label)) +


geom_text(angle = 45, hjust=1, color = "blue") +
geom_point()

2.50

2.25

2.00
y

● ● ● ● ●
a

ab

cd

e
ab

cd
ab

ab

1.75

1.50
1 2 3 4 5
x

U Justification values outside the range 0 … 1 are allowed, but are relative to the
width of the label. As the labels are of different length, using any value other than
zero or one results in uneven positioning of the labels with respect to points. Edit
the code above using hjust set to 1.5 instead of to 1, without pasting a space

201
6 Plots with ggpplot

character to the labels. Is the plot obtained “tidy” enough for publication? and
for data exploration?

Plotting expressions (mathematical expressions) involves mapping to the label


aesthetic character strings that can be parsed as expressions, and setting
parse = TRUE .

my.data <-
data.frame(x = 1:5, y = rep(2, 5),
label=paste("alpha[", 1:5, "]", sep = ""))

ggplot(my.data, aes(x, y, label = label)) +


geom_text(hjust = -0.2, parse = TRUE, size = 6) +
geom_point()

2.50

2.25

2.00 α1 α2 α3 α4 α5
y

● ● ● ● ●

1.75

1.50
1 2 3 4 5
x

Plotting maths and other alphabets using R expressions is discussed in section 6.20
on page 291.
In the examples above we plotted text and expressions present in the data frame
passed as argument for data . It is also possible to build suitable labels on-the-fly
within aes when setting the mapping for label . Here we use geom_text() and
expressions for the example, but the same two approaches can be use to “build” char-
acter strings to be used directly without parsing.

my.data <-
data.frame(x = 1:5, y = rep(2, 5))

ggplot(my.data, aes(x,
y,
label = paste("alpha[", x, "]", sep = ""))) +
geom_text(hjust = -0.2, parse = TRUE, size = 6) +
geom_point()

202
6.8 Plotting text and maths

2.50

2.25

2.00 α1 α2 α3 α4 α5
y

● ● ● ● ●

1.75

1.50
1 2 3 4 5
x

U What are the advantages and disadvantages of each approach in relation to


easy with which a script in which several figures using the same “labels” are pro-
duced, in relation to consistency across figures? In contrast, which approach
would you prefer if different figures in the same script used different variations
of labels constructed from the same variables in the data?

As geom_label() obeys the same parameters as geom_text() except for angle


we describe below only the additional parameters compared to geom_text() .

my.data <-
data.frame(x = 1:5, y = rep(2, 5),
label=paste("alpha[", 1:5, "]", sep = ""))
ggplot(my.data, aes(x, y, label = label)) +
geom_label(hjust = -0.2, parse = TRUE, size = 6) +
geom_point() +
expand_limits(x = 5.4)

203
6 Plots with ggpplot

2.50

2.25

2.00 α1 α2 α3 α4 α5
y

● ● ● ● ●

1.75

1.50
1 2 3 4 5
x

We may want to alter the default width of the border line or the color used to fill
the rectangle, or to change the “roundness” of the corners. To suppress the border
line use NA , as a value of zero produces a very thin border. Corner roundness is
controlled by parameter label.r and the size of the margin around the text with
label.padding .

my.data <-
data.frame(x = 1:5, y = rep(2, 5),
label=paste("alpha[", 1:5, "]", sep = ""))
ggplot(my.data, aes(x, y, label = label)) +
geom_label(hjust = -0.2, parse = TRUE, size = 6,
label.size = NA,
label.r = unit(0, "lines"),
label.padding = unit(0.15, "lines"),
fill = "yellow", alpha = 0.5) +
geom_point() +
expand_limits(x = 5.4)

2.50

2.25

2.00 α1 α2 α3 α4 α5
y

● ● ● ● ●

1.75

1.50
1 2 3 4 5
x

204
6.9 Axis- and key labels, titles, subtitles and captions

U Play with the arguments to the different parameters and with the aesthetics
to get an idea of what can be with them. For example, use thicker border lines
and increase the padding so that a good margin is still achieve. You may also try
mapping the fill and color aesthetics to factors in the data.

 You should be aware that R and ggplot2 support the use of UNICODE, such
as UTF8 character encoding in strings. If your editor or IDE supports their use,
then you can type Greek letters and simple maths symbols directly, and they may
show correctly in labels if a suitable font is loaded and an extended encoding
like UTF8 in use by the operating system. Even if UTF8 is in use, text is not fully
portable unless the same font is available, as even if the character positions are
standardized for many languages, most UNICODE fonts support at most a small
number of languages. In principle one can use this mechanism to have labels both
using other alphabets and languages like Chinese with their numerous symbols
mixed in the same figure. Furthermore, the support for fonts and consequently
character sets in R is output-device dependent. The font encoding used by R by
default depends on the default locale settings of the operating system, which can
also lead to garbage printed to the console or wrong characters being plotted
running the same code on a different computer from the one where a script was
edited. Not all is lost, though, as R can be coerced to use system fonts and Google
fonts with functions provided by packages ‘showtext’ and ‘extrafont’ described
in section 7.3 on page 330. Encoding-related problems, specially in MS-Windows,
are very common.

6.9 Axis- and key labels, titles, subtitles and captions

I describe this in the same section, and immediately after the section on plotting text
labels, as they are added to plots using similar approaches. Be aware that the default
justification of plot titles has changed in ‘ggplot2’ version 2.2.0 from centered to left
justified. At the same time, support for subtitles and captions was added.
The most flexible approach is to use labs() as it allows the user to set the text or
expressions to be used for these different elements.

ggplot(data = Orange,
aes(x = age, y = circumference, color = Tree)) +

205
6 Plots with ggpplot

geom_line() +
geom_point() +
expand_limits(y = 0) +
labs(title = "Growth of orange trees",
subtitle = "Starting from 1968-12-31",
caption = "see Draper, N. R. and Smith, H. (1998)",
x = "Time (d)",
y = "Stem circumference (mm)",
color = "Tree\nnumber")

Growth of orange trees


Starting from 1968−12−31

Stem circumference (mm)


● ●
200



● ● ● Tree
150


number
● ●
● ●


● 3

● ●
● ● ● 1
100


● 5




● 2
50 ●



● 4

0
400 800 1200 1600
Time (d)
see Draper, N. R. and Smith, H. (1998)

There are in addition to labs() convenience functions for setting the axis labels,
xlab() and ylab() .

ggplot(data = Orange,
aes(x = age, y = circumference, color = Tree)) +
geom_line() +
geom_point() +
expand_limits(y = 0) +
xlab("Time (d)") +
ylab("Stem circumference (mm)")

206
6.9 Axis- and key labels, titles, subtitles and captions



● ●
200

Stem circumference (mm) ●


● ●


150 ●
Tree
● ● ●

● 3



● ● ● 1

100 ● 5



● 2



● 4
50 ●


0
400 800 1200 1600
Time (d)

An additional convenience function, ggtitle() can be used to add a title and op-
tionally a subtitle.

ggplot(data = Orange,
aes(x = age, y = circumference, color = Tree)) +
geom_line() +
geom_point() +
expand_limits(y = 0) +
ggtitle("Growth of orange trees",
subtitle = "Starting from 1968-12-31")

Growth of orange trees


Starting from 1968−12−31


● ●
200
● ●
● ●

Tree
circumference


150 ●
● ●
● ● ● 3

● ●


● 1

100 ● 5



● ● 2


50 ●

● 4

0
400 800 1200 1600
age

U Make an empty plot ( ggplot() ) and add to it as title an expression producing


𝑦 = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥2 . (Hint: have a look at the examples for the use of expressions
as labels in section 6.8 on page 199 and the plotmath demo in R.)

Function update_labels allows the replacement of labels in an existing plot. We

207
6 Plots with ggpplot

first create a plot with one set of labels, and afterwards we replace them. (In ‘ggplot2’
2.2.1 update_labels fails for aesthetic color but works as expected with colour .
Issue raised in Github on 2016-01-21.)
p <-
ggplot(data = mtcars,
aes(x = disp, y = hp, colour = factor(cyl),
shape = factor(cyl))) +
geom_point() +
labs(x = "Engine displacement)",
y = "Gross horsepower",
color = "Number of\ncylinders",
shape = "Number of\ncylinders")
p

300
Gross horsepower

Number of
cylinders
200 ● 4
6
8



100 ●


●● ●

100 200 300 400


Engine displacement)

update_labels(p, list(x = "Cilindrada",


y = "Potencia bruta (caballos de fuerza)",
colour = "no. de\ncilindros",
shape = "no. de\ncilindros"))
Potencia bruta (caballos de fuerza)

300

no. de
cilindros
200 ● 4
6
8



100 ●


●● ●

100 200 300 400


Cilindrada

208
6.9 Axis- and key labels, titles, subtitles and captions

 When setting or updating labels using either labs() or update_labels() be


aware that even though color and colour are synonyms for the same aesthetics,
the ‘name’ used in the call to aes() must match the ‘name’ used when setting or
updating the labels.

U Modify the code used in the code chunk above to update labels, so that
colour is used instead of color . How does the figure change?

The labels used in keys and axis tick-labels for factor levels can be changed through
the different scales as described in section 6.16 on page 249.

 Sometimes we would like to include in the title or as an annotation in


the plot, the name of the argument passed to ggplot() ’s data parameter.
To obtain the name of an object as a character string, the usual R “slang” is
deparse(substitute(x)) where x is the object (see section 6.20 on page 291
for further details).
ggplot(data = Orange,
aes(x = age, y = circumference, color = Tree)) +
geom_line() +
geom_point() +
expand_limits(y = 0) +
ggtitle(paste("Data:", deparse(substitute(Orange))))

Data: Orange


● ●
200
● ●
● ●

● Tree
circumference

150 ● ● ●

● ● 3


● ● ●
● 1

100 ● 5



● 2


● ● 4
50 ●


0
400 800 1200 1600
age

The example above rarely is of much use, as we have anyway to pass the object
itself twice, and consequently there is no advantage in effort compare to typing

209
6 Plots with ggpplot

"Data: Orange" as argument to ggtitle() . A more general way to solve this


problem is to write a wrapper function.

ggwrapper <- function(data, ...) {


ggplot(data, ...) +
ggtitle(paste("Object: ", substitute(data)))
}

ggwrapper(data = Orange,
mapping = aes(x = age, y = circumference, color = Tree)) +
geom_line() +
geom_point() +
expand_limits(y = 0)

Object: Orange


● ●
200
● ●
● ●

● Tree
circumference

150 ● ● ●

● ● 3


● ● ●
● 1

100 ● 5



● 2


● ● 4
50 ●


0
400 800 1200 1600
age

This is a bare-bones example, as it does not retain user control over the format-
ting of the title. The ellipsis ( ... ) is a catch-all parameter that we use to pass
all other arguments to ggplot() . Because of the way our wrapper function is
defined using ellipsis, we need to always pass mapping and other arguments that
are to be “forwarded” to ggplot() by name.
Using this function in a loop over a list or vector, will produce output is not
as useful as you may expect. In many cases, the best, although more complex
solution is to add case-specific code the loop itself to generate suitable titles auto-
matically.
We create a suitable set of data frames, build a list with name my.dfs contain-
ing them.

df1 <- data.frame(x = 1:10, y = (1:10)^2)


df2 <- data.frame(x = 10:1, y = (10:1)^2.5)
my.dfs <- list(first.df = df1, second.df = df2)

210
6.9 Axis- and key labels, titles, subtitles and captions

If we print the output produced by the wrapper function when called in a loop
but we get always the same title, so this approach is not useful.

for (df in my.dfs) {


print(
ggwrapper(data = df,
mapping = aes(x = x, y = y)) +
geom_line()
)
}

Object: df
100

75
y

50

25

0
2.5 5.0 7.5 10.0
x
Object: df

300

200
y

100

0
2.5 5.0 7.5 10.0
x

211
6 Plots with ggpplot

 Automatic printing of objects is disabled within functions and iteration


loops, making it necessary to use print() explicitly in these cases (see loops
above). This ‘inconsistency’ in behaviour is frequently surprising to unexper-
ienced R users, so keep in mind that if some chunk of R code unexpectedly
fails to produce visible output, the most frequent reason is that print()
needs to be included in the code to make the ‘missing’ result visible. Except
for base R plotting functions, the norm in R is that printing, either implicitly
or explicitly is needed for output to be visible to the user.

As we have given names to the list members, we can use these and enclose the
loop in a function. This is a very inflexible approach, and on top the plots are
only printed, and the ggplot objects get discarded once printed.

plot.dfs <- function(x, ...) {


list.name <- deparse(substitute(x))
member.names <- names(x)
if (is.null(member.names)) {
member.names <- as.character(seq_along(x))
}

for (i in seq_along(x)) {
print(
ggplot(data = x[[i]], aes(x = x, y = y)) +
geom_line() +
ggtitle(paste("Object: ", list.name,
'[["', member.names[i], '"]]', sep = ""))
)
}

plot.dfs(my.dfs)

212
6.9 Axis- and key labels, titles, subtitles and captions

Object: my.dfs[["first.df"]]
100

75
y

50

25

0
2.5 5.0 7.5 10.0
x
Object: my.dfs[["second.df"]]

300

200
y

100

0
2.5 5.0 7.5 10.0
x

U Study the output from the two loops, and analyse why the titles differ.
This will help not only understand this problem, but the implications of for-
mulating for loops in these three syntactically correct ways.

As it should be obvious by now, is that as an object “moves” through the


function-call stack its visible name changes. Consequently when we nest func-
tions or use loops it becomes difficult to retrieve the name under which the object
was saved by the user. After these experiments, is should be clear that saving the
titles “in” the data frames would be the most elegant approach. It is possible to
save additional data in R objects using attributes. R itself uses attributes to keep
track of objects’ properties like the names of members in a list, or the class of
objects.
When one has control over the objects, one can add the desired title as an

213
6 Plots with ggpplot

attribute to the data frame, and then retrieve and use this when plotting. One
should be careful, however, as some functions and operators may fail to copy
user attributes to their output.

U
As an advanced exercise I suggest implementing this attribute-based solu-
tion by tagging the data frames using a function defined as shown below or
by directly using attr() . You will also need modify the code to use the new
attribute when building the ggplot object.

add.title.attr <- function(x, my.title) {


attr(x, "title") <- my.title
x
}

What advantages and disadvantages does this approach have? Can it be


used in a loop?

6.10 Tile plots

For the special case of heat maps see section 6.23.1 on page 309. Here we describe
the use of geom_tile() for simple tile plots with no use of clustering.
We here generate 100 random draws from the 𝐹 distribution with degrees of free-
dom 𝜈1 = 5, 𝜈2 = 20.

set.seed(1234)
randomf.df <- data.frame(z = rf(100, df1 = 5, df2 = 20),
x = rep(letters[1:10], 10),
y = LETTERS[rep(1:10, rep(10, 10))])

ggplot(randomf.df, aes(x, y, fill = z)) +


geom_tile()

214
6.10 Tile plots

H
z
G
4
F 3
y

E 2
D 1

a b c d e f g h i j
x

We can use "white" or some other contrasting color to better delineate the borders
of the tiles.

ggplot(randomf.df, aes(x, y, fill = z)) +


geom_tile(color = "white")

H
z
G
4
F 3
y

E 2
D 1

a b c d e f g h i j
x

Any continuous fill scale can be used to control the appearance. Here we show a
tile plot using a grey gradient.

ggplot(randomf.df, aes(x, y, fill = z)) +


geom_tile(color = "black") +
scale_fill_gradient(low = "grey15", high = "grey85", na.value = "red")

215
6 Plots with ggpplot

H
z
G
4
F 3
y

E 2
D 1

a b c d e f g h i j
x

6.11 Bar plots

R users not familiar yet with ‘ggplot2’ are frequently surprised by the default beha-
viour of geom_bar() as it uses stat_count() to compute the value plotted, rather
than plotting values as is (see section 6.12 on page 217). The default can be changed,
but geom_col is equivalent to geom_bar() used with "identity" as argument to
parameter stat . The statistic stat_identity() just echoes its input. In previous
sections, as when plotting points and lines, this statistic was used by default.
In this bar plot, each bar shows the number of observations in each class of car
in the data set. We use a data set included in ‘ggplot2’ for this example based on the
documentation.

ggplot(mpg, aes(class)) + geom_bar()

60

40
count

20

0
2seater compact midsize minivan pickup subcompact suv
class

We can easily get stacked bars grouped by the number of cylinders of the engine.

216
6.12 Plotting summaries

ggplot(mpg, aes(class, fill = factor(cyl))) + geom_bar()

60

40 factor(cyl)
4
count

5
6

20 8

0
2seater compact midsize minivan pickup subcompact suv
class

The default palette used for fill is rather ugly, so we also show the same plot
with another scale for fill.

ggplot(mpg, aes(class, fill = factor(cyl))) +


geom_bar(color = "black") +
scale_fill_brewer()

60

40 factor(cyl)
4
count

5
6

20 8

0
2seater compact midsize minivan pickup subcompact suv
class

6.12 Plotting summaries

The summaries discussed in this section can be superimposed on raw data plots, or
plotted on their own. Beware, that if scale limits are manually set, the summaries will
be calculated from the subset of observations within these limits. Scale limits can be
altered when explicitly defining a scale or by means of functions xlim() and ylim .
See the text box on 221 for a way of constraining the viewport (the region visible in

217
6 Plots with ggpplot

the plot) by changing coordinate limits while keeping the scale limits on a wider range
of 𝑥 and 𝑦 values.

6.12.1 Statistical “summaries”

It is possible to summarize data on-the-fly when plotting. We describe in the


same section the calculation of measures of central position and of variation, as
stat_summary() allows them to be calculated in the same function call.
For the examples we will generate some normally distributed artificial data.

fake.data <- data.frame(


y = c(rnorm(10, mean=2, sd=0.5),
rnorm(10, mean=4, sd=0.7)),
group = factor(c(rep("A", 10), rep("B", 10)))
)

We first use scatter plots for the examples, later we give some additional examples
for bar plots. We will reuse a “base” plot in a series of examples, so that the dif-
ferences are easier to appreciate. We first add just the mean. In this case we
need to pass as argument to stat_summary() the geom to use, as the default one,
geom_pointrange() , expects data for plotting error bars in addition to the mean.

ggplot(data=fake.data, aes(y=y, x=group)) +


geom_point(shape = 21) +
stat_summary(fun.y = "mean", geom="point", color="red", shape="-", size=10)




4 −●

3
y





2 −●



1
A B
group

Then the median, by changing the argument passed to fun.y .

ggplot(data=fake.data, aes(y=y, x=group)) +


geom_point(shape = 21) +
stat_summary(fun.y = "median", geom="point", colour="red", shape="-", size=10)

218
6.12 Plotting summaries



4
−●


3
y





2 −



1
A B
group

We can add the mean and 𝑝 = 0.95 confidence intervals assuming normality (using
the 𝑡 distribution):

ggplot(data=fake.data, aes(y=y, x=group)) +


geom_point(shape = 21) +
stat_summary(fun.data = "mean_cl_normal", colour="red", size=1, alpha=0.7)




4

3
y





2 ●



1
A B
group

We can add the means and 𝑝 = 0.95 confidence intervals not assuming normality
(using the actual distribution of the data by bootstrapping):

ggplot(data=fake.data, aes(y=y, x=group)) +


geom_point(shape = 21) +
stat_summary(fun.data = "mean_cl_boot", colour="red", size=1, alpha=0.7)

219
6 Plots with ggpplot




4

3
y





2 ●



1
A B
group

If needed, we can display less restrictive confidence intervals, at 𝑝 = 0.90 in this


example, by means of conf.int = 0.90 passed as a list to the underlying function
being called.

ggplot(data=fake.data, aes(y=y, x=group)) +


geom_point(shape = 21) +
stat_summary(fun.data = "mean_cl_boot",
fun.args = list(conf.int = 0.90),
colour = "red", size = 1, alpha = 0.7)




4

3
y





2 ●



1
A B
group

We can plot error bars corresponding to ±s.e. (standard errors) with the function
"mean_se" , added in ‘ggplot2’ 2.0.0.

ggplot(data=fake.data, aes(y=y, x=group)) +


geom_point(shape = 21) +
stat_summary(fun.data = "mean_se",
colour="red", size=1, alpha=0.7)

220
6.12 Plotting summaries




4

3
y





2 ●



1
A B
group

 Scale- and coordinate limits are very different. Scale limits restrict the data
used, while coordinate limits restrict the data that are visible. For a scatter plot,
the effect of either approach on the resulting plot are equivalent, as no calcula-
tions are involved, but when using statistics to compute summaries, one should
almost always rely on coordinate limits, to make sure that no data are excluded
from the calculated summary. An example follows, using artificial data with an
outlier added.

outlier.data <- fake.data


outlier.data[1, "y"] <- outlier.data[1, "y"] * 5

This figure has the wrong values for mean and standard error, as the outlier
has been excluded from the calculations. A warning is issued, reporting that
observations have been excluded. One should never ignore such warnings before
one understands why they are being triggered and is confident that this what one
really intended to do!

ggplot(data=outlier.data, aes(y=y, x=group)) +


stat_summary(fun.data = "mean_se",
colour="red", size=1, alpha=0.7) +
ylim(range(fake.data$y))

## Warning: Removed 1 rows containing non-finite values


## (stat_summary).

221
6 Plots with ggpplot

3
y

1
A B
group

This figure has the correct values for mean and standard error, as the outlier
has been included in the calculations.

ggplot(data=outlier.data, aes(y=y, x=group)) +


stat_summary(fun.data = "mean_se",
colour="red", size=1, alpha=0.7) +
coord_cartesian(ylim = range(fake.data$y))

3
y

1
A B
group

As mult is the multiplier based on the probability distribution used, by default


student’s t, by setting it to one, we get also standard errors of the mean.

ggplot(data=fake.data, aes(y=y, x=group)) +


geom_point(shape = 21) +
stat_summary(fun.data = "mean_cl_normal",
fun.args = list(mult = 1),
colour="red", size=1, alpha=0.7)

222
6.12 Plotting summaries




4

3
y





2 ●



1
A B
group

However, be aware that the code such as below (NOT EVALUATED HERE), as used
in earlier versions of ‘ggplot2’, needs to be rewritten as above.

ggplot(data=fake.data, aes(y=y, x=group)) +


geom_point(shape = 21) +
stat_summary(fun.data = "mean_cl_normal", mult = 1,
colour="red", size=1, alpha=0.7)

Finally we can plot error bars showing ±s.d. (standard deviation).

ggplot(data=fake.data, aes(y=y, x=group)) +


geom_point(shape = 21) +
stat_summary(fun.data = "mean_sdl", colour="red", size=1, alpha=0.7)

5




4 ●

y


3 ● ●


● ●

2 ●



1

A B
group

We do not give an example here, but instead of using these functions (from package
‘Hmisc’) it is possible to define one’s own functions. In addition as arguments to any
function used, except for the first one containing the actual data, are supplied as a
list through formal argument fun.args , there is a lot of flexibility with respect to
what functions can be used.

223
6 Plots with ggpplot

Finally we plot the means in a scatter plot, with the observations superimposed and
𝑝 = 0.95 confidence interval (the order in which the geoms are added is important: by
having geom_point() last it is plotted on top of the bars. In this case we set fill, colour
and alpha (transparency) to constants, but in more complex data sets mapping them
to factors in the data set can be used to distinguish them. Adding stat_summary()
twice allows us to plot the mean and the error bars using different colors.

ggplot(data=fake.data, aes(y=y, x=group)) +


stat_summary(fun.y = "mean", geom = "point",
fill="white", colour="black") +
stat_summary(fun.data = "mean_cl_boot",
geom = "errorbar",
width=0.1, size=1, colour="red") +
geom_point(size=3, alpha=0.3)


4

3
y

1
A B
group

Similarly as with scatter plots, we can plot summaries as bars plots and add error
bars. If we supply a different argument to stat we can for example plot the means
or medians for a variable, for each class of car.

ggplot(mpg, aes(class, hwy)) + geom_bar(stat = "summary", fun.y = mean)

224
6.12 Plotting summaries

20
hwy

10

0
2seater compact midsize minivan pickup subcompact suv
class

ggplot(mpg, aes(class, hwy)) + geom_bar(stat = "summary", fun.y = median)

20
hwy

10

0
2seater compact midsize minivan pickup subcompact suv
class

The “reverse” syntax is also possible, we can add the statistics to the plot object and
pass the geometry as an argument to it.

ggplot(mpg, aes(class, hwy)) +


stat_summary(geom = "col", fun.y = mean)

225
6 Plots with ggpplot

hwy 20

10

0
2seater compact midsize minivan pickup subcompact suv
class

And we can easily add error bars to the bar plot. We use size to make the lines
of the error bar thicker, and a value smaller than zero for fatten to make the point
smaller. The default geom for stat_summary() is geom_pointrange .

ggplot(mpg, aes(class, hwy)) +


stat_summary(geom = "col", fun.y = mean) +
stat_summary(fun.data = "mean_se", size = 1,
fatten = 0.5, color = "red")

30
● ●

20


hwy

10

0
2seater compact midsize minivan pickup subcompact suv
class

Instead of making the point smaller, we can pass "linerange" as argument for
geom to eliminate the point completely by use of geom_linerange() .

ggplot(mpg, aes(class, hwy)) +


stat_summary(geom = "col", fun.y = mean) +
stat_summary(geom = "linerange",
fun.data = "mean_se", size = 1,
color = "red")

226
6.12 Plotting summaries

30

20
hwy

10

0
2seater compact midsize minivan pickup subcompact suv
class

Passing "errorbar" to geom results in the use of geom_errorbar resulting in tra-


ditional “capped” error bars. However, this type of error bars has been criticized as
adding unnecessary clutter to plots (Tufte 1983). We use width to reduce the width
of the cross lines at the ends of the bars.

ggplot(mpg, aes(class, hwy)) +


stat_summary(geom = "col", fun.y = mean) +
stat_summary(geom = "errorbar",
fun.data = "mean_se", width = 0.1, size = 1,
color = "red")

30

20
hwy

10

0
2seater compact midsize minivan pickup subcompact suv
class

If we have ready calculated values for the summaries, we can still obtain the same
plots. Here we calculate the summaries before plotting, and then redraw the plot
immediately above.

mpg_g <- dplyr::group_by(mpg, class)


mpg_summ <- dplyr::summarise(mpg_g, hwy_mean = mean(hwy),
hwy_se = sd(hwy) / sqrt(n()))

227
6 Plots with ggpplot

ggplot(mpg_summ, aes(x = class,


y = hwy_mean,
ymax = hwy_mean + hwy_se,
ymin = hwy_mean - hwy_se)) +
geom_col() +
geom_errorbar(width = 0.1, size = 1, color = "red")

30

20
hwy_mean

10

0
2seater compact midsize minivan pickup subcompact suv
class

6.13 Fitted smooth curves

The statistic stat_smooth() fits a smooth curve to observations in the case when the
scales for 𝑥 and 𝑦 are continuous. For the first example, we use the default smoother,
a spline. The type of spline is automatically chosen based on the number of observa-
tions.

ggplot(data = mtcars, aes(x=disp, y=mpg)) +


stat_smooth()

## `geom_smooth()` using method = 'loess'

228
6.13 Fitted smooth curves

35

30

25
mpg

20

15

10

100 200 300 400


disp

In most cases we will want to plot the observations as points together with the
smoother. We can plot the observation on top of the smoother, as done here, or the
smoother on top of the observations.

ggplot(data = mtcars, aes(x=disp, y=mpg)) +


stat_smooth() +
geom_point()

## `geom_smooth()` using method = 'loess'

35

● ●
30


25 ●

● ●
mpg


● ●

20 ●
● ●

● ●


● ●
● ●●
15 ●

● ●
10

100 200 300 400


disp

Instead of using the default spline, we can fit a different model. In this example we
use a linear model as smoother, fitted by lm() .

ggplot(data = mtcars, aes(x=disp, y=mpg)) +


stat_smooth(method="lm") +
geom_point()

229
6 Plots with ggpplot

35

● ●
30


25 ●

● ●
mpg


● ●

20 ●
● ●

● ●


● ●
● ●●
15 ●

● ●
10

100 200 300 400


disp

These data are really grouped, so we map the grouping to the color aesthetic. Now
we get three groups of points with different colours but also three separate smooth
lines.

ggplot(data = mtcars, aes(x=disp, y=mpg, color=factor(cyl))) +


stat_smooth(method="lm") +
geom_point()

35

● ●
30


25 ●
factor(cyl)
4
mpg


● ●




● ● 6
20 ●
● ● ● 8

● ●


● ●
● ●●
15 ●

● ●
10
100 200 300 400
disp

To obtain a single smoother for the three groups, we need to set the mapping of
the color aesthetic to a constant within stat_smooth . This local value overrides the
default for the whole plot set with aes just for this single statistic. We use "black"
but this could be replaced by any other color definition known to R.

ggplot(data = mtcars, aes(x=disp, y=mpg, color=factor(cyl))) +


stat_smooth(method="lm", colour="black") +
geom_point()

230
6.13 Fitted smooth curves

35

● ●
30


25 ●
factor(cyl)
● ●
4
mpg



● ●

20 ●
● ●
● 6




● 8

● ●
● ●●
15 ●

● ●
10

100 200 300 400


disp

Instead of using the default formula for a linear regression as smoother, we pass
a different formula as argument. In this example we use a polynomial of order 2
fitted by lm() .

ggplot(data = mtcars, aes(x=disp, y=mpg, color=factor(cyl))) +


stat_smooth(method="lm", formula=y~poly(x,2), colour="black") +
geom_point()

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●




● ● 6
20 ●
● ● ● 8

● ●



● ●
15 ●●


● ●
10
100 200 300 400
disp

It is possible to use other types of models, including GAM and GLM, as smooth-
ers, but we will not give examples of the use of these more advanced models in this
section.

 The different geoms and elements can be added in almost any order to a
ggplot object, but they will be plotted in the order that they are added. The
alpha (transparency) aesthetic can be mapped to a constant to make underlying

231
6 Plots with ggpplot

layers visible, or alpha can be mapped to a data variable for example making the
transparency of points in a plot depend on the number of observations used in
its calculation.

ggplot(data = mtcars, aes(x=disp, y=mpg, colour=factor(cyl))) +


geom_point() +
geom_smooth(colour="black", alpha=0.7) +
theme_bw()

## `geom_smooth()` using method = 'loess'

35

● ●
30


25 ● factor(cyl)
● ● ● 4
mpg


● ●
● ● 6
20 ●
● ●

● ● 8



● ●
● ●●
15 ●

● ●
10

100 200 300 400


disp

The plot looks different if the order of the geometries is swapped. The data
points overlapping the confidence band are more clearly visible in this second
example because they are above the shaded area instead of bellow it.

ggplot(data = mtcars, aes(x=disp, y=mpg, colour=factor(cyl))) +


geom_smooth(colour="black", alpha=0.7) +
geom_point() +
theme_bw()

## `geom_smooth()` using method = 'loess'

232
6.14 Frequencies and densities

35

● ●
30


25 ● factor(cyl)
● ● ● 4
mpg


● ●
● ● 6
20 ●
● ●

● ● 8



● ●
● ●●
15 ●

● ●
10

100 200 300 400


disp

6.14 Frequencies and densities

A different type of summaries are frequencies and empirical density functions. These
can be calculated in one or more dimensions. Sometimes instead of being calculated,
we rely on the density of graphical elements to convey the density. Sometimes, scatter
plots using a well chosen value for alpha give a satisfactory impression of the density.
Rug plots, described below work in a similar way.

6.14.1 Marginal rug plots

Rarely rug-plots are used by themselves. Instead they are usually an addition to scat-
ter plots. An example follows. They make it easier to see the distribution along the
𝑥- and 𝑦-axes.
We generate new fake data by random sampling from the normal distribution. We
use set.seed(1234) to initialize the pseudo-random number generator so that the
same data are generated each time the code is run.

set.seed(12345)
my.data <-
data.frame(x = rnorm(200),
y = c(rnorm(100, -1, 1), rnorm(100, 1, 1)),
group = factor(rep(c("A", "B"), c(100, 100))) )

ggplot(my.data, aes(x, y, colour = group)) +


geom_point() +
geom_rug()

233
6 Plots with ggpplot

● ● ●

● ● ●●



● ● ●

2 ●●


●●



● ● ●
● ●
●● ●
● ● ●● ●●● ● ● ● ●
● ● ●
● ● ●● ● ● ●
● ● ● ●
●● ● ● ●
● ●
● ● ●● ●● ●

● ●
● ● ●
●●
● ●
● ●

● ●
● ●

group
● ● ● ●
● ● ●
0 ● ● A
y

● ● ●
● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ● ●●
● ●●
● ● ● ● ●

● ●

● B
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
●●
● ● ●
● ●
● ●● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ●
●● ●● ●
● ● ●●
−2 ●
● ●

● ●●
● ●

● ● ●

−2 −1 0 1 2
x

6.14.2 Histograms

Histograms are defined by how the plotted values are calculated. Although they are
most frequently plotted as bar plots, many bar plots are not histograms. Although
rarely done in practice, a histogram could be plotted using a different geometry and
stat_bin the statistic used by default by geom_histogram() . This statistics does
binning of observations before computing frequencies, as is suitable for continuous
𝑥 scales. For categorical data stat_count should be used, which as seen in section
6.11 on page 216 is the default stat for geom_bar .

ggplot(my.data, aes(x)) +
geom_histogram(bins = 15)

30

20
count

10

0
−2 −1 0 1 2
x

ggplot(my.data, aes(y, fill = group)) +


geom_histogram(bins = 15, position = "dodge")

234
6.14 Frequencies and densities

20

group
count

A
B
10

0
−2 0 2
y

ggplot(my.data, aes(y, fill = group)) +


geom_histogram(bins = 15, position = "stack")

20

group
count

A
B
10

0
−2 0 2
y

ggplot(my.data, aes(y, fill = group)) +


geom_histogram(bins = 15, position = "identity", alpha = 0.5) +
theme_bw(16)

235
6 Plots with ggpplot

20
group
count

A
B
10

0
−2 0 2
y

The geometry geom_bin2d() by default uses the statistic stat_bin2d which can
be thought as a histogram in two dimensions. The frequency for each rectangle is
mapped onto a fill scale.

ggplot(my.data, aes(x, y)) +


geom_bin2d(bins = 8) +
facet_wrap(~group)

A B

2
count
10.0
0
7.5
y

5.0
2.5
−2

−4

−2 −1 0 1 2 3 −2 −1 0 1 2 3
x

The geometry geom_hex() by default uses the statistic stat_binhex() which can
be thought as a histogram in two dimensions. The frequency for each hexagon is
mapped onto a fill scale.

ggplot(my.data, aes(x, y)) +


geom_hex(bins = 8) +
facet_wrap(~group)

236
6.14 Frequencies and densities

A B

2
count
10.0

0 7.5
y

5.0
2.5
−2

−2 −1 0 1 2 −2 −1 0 1 2
x

6.14.3 Density plots

Empirical density functions are the equivalent of a histogram, but are continuous and
not calculated using bins. They can be calculated in 1 or 2 dimensions (2d), for 𝑥 or
𝑥 and 𝑦 respectively. As with histograms it is possible to use different geometries to
visualize them.

ggplot(my.data, aes(x, colour = group)) +


geom_density()

0.4

0.3

group
density

0.2 A
B

0.1

0.0
−2 −1 0 1 2
x

ggplot(my.data, aes(y, colour = group)) +


geom_density()

237
6 Plots with ggpplot

0.5

0.4

group
density
0.3
A
B
0.2

0.1

0.0
−2 0 2
y

ggplot(my.data, aes(y, fill = group)) +


geom_density(alpha = 0.5)

0.5

0.4

group
density

0.3
A
B
0.2

0.1

0.0
−2 0 2
y

ggplot(my.data, aes(x, y, colour = group)) +


geom_point() +
geom_rug() +
geom_density_2d()

238
6.14 Frequencies and densities

● ● ●

● ● ●●



● ● ●

2 ●●


●●



● ● ●
● ●
●● ●
● ● ●● ●●● ● ● ● ●
● ● ●
● ● ●● ● ● ●
● ● ● ●
●● ● ● ●
● ●
● ● ●● ●● ●

● ●
● ● ●
●●
● ●
● ●

● ●
● ●

group
● ● ● ●
● ● ●
0 ● ● A
y

● ● ●
● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ● ●●
● ●●
● ● ● ● ●

● ●

● B
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
●●
● ● ●
● ●
● ●● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ●
●● ●● ●
● ● ●●
−2 ●
● ●

● ●●
● ●

● ● ●

−2 −1 0 1 2
x

ggplot(my.data, aes(x, y)) +


geom_density_2d() +
facet_wrap(~group)

A B

2
y

−2

−2 −1 0 1 2 −2 −1 0 1 2
x

ggplot(my.data, aes(x, y)) +


stat_density_2d(aes(fill = ..level..), geom = "polygon") +
facet_wrap(~group)

239
6 Plots with ggpplot

A B

2
level
0.15
y

0
0.10

0.05

−2

−2 −1 0 1 2 −2 −1 0 1 2
x

6.14.4 Box and whiskers plots

Box and whiskers plots, also very frequently called just boxplots, are also summaries
that convey some of the characteristics of a distribution. They are calculated and
plotted by means of geom_boxplot() . Although they can be calculated and plotted
based on just a few observations, they are not useful unless each box plot is based in
more than 10 to 15 observations.

ggplot(my.data, aes(group, y)) +


geom_boxplot()

2

0
y


−2

A B
group

As with other geometries their appearance obeys both the usual aesthetics such as
color, and others specific to these type of visual representation.

240
6.14 Frequencies and densities

6.14.5 Violin plots

Violin plots are a more recent development than box plots, and usable with relat-
ively large numbers of observations. They could be thought as being a sort of hybrid
between an empirical density function and a box plot. As is the case with box plots,
they are particularly useful when comparing distributions of related data, side by
side.

ggplot(my.data, aes(group, y)) +


geom_violin()

0
y

−2

A B
group

ggplot(my.data, aes(group, y, fill = group)) +


geom_violin(alpha = 0.16) +
geom_point(alpha = 0.33, size = rel(4),
colour = "black", shape = 21)

group
0 A
y

−2

A B
group

As with other geometries their appearance obeys both the usual aesthetics such as
color, and others specific to these type of visual representation.

241
6 Plots with ggpplot

6.15 Using facets

Sets of coordinated plots are a very useful tool for visualizing data. These became
popular through the trellis graphs in S, and the ‘lattice’ package in R. The basic idea
is to have row and/or columns of plots with common scales, all plots showing values
for the same response variable. This is useful when there are multiple classification
factors in a data set. Similarly looking plots but with free scales or with the same scale
but a ‘floating’ intercept are sometimes also useful. In ‘ggplot2’ there are two possible
types of facets: facets organized in a grid, and facets along a single ‘axis’ but wrapped
into several rows. These are produced by adding facet_grid() or facet_wrap() to
a ggplot, respectively. In the examples below we use geom_point() but faceting can
be used with any ggplot object (even with maps, spectra and ternary plots produced
by functions in packages ‘ggmap’, ‘ggspectra’ and ‘ggtern’.

 The code underlying faceting has been rewritten in ‘ggplot2’ version 2.2.0.
All the examples given here are backwards compatible with versions 2.1.0 and
possibly 2.0.0. The new functionality is related to the writing of extensions or
controlled through themes, and will be discussed in other sections.

p <- ggplot(data = mtcars, aes(mpg, wt)) + geom_point()


# With one variable
p + facet_grid(. ~ cyl)

4 6 8



4
● ●
● ●
●●●
wt

●● ● ● ●


● ● ●
3 ●
● ●





2 ●


10 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 35
mpg

p + facet_grid(cyl ~ .)

242
6.15 Using facets

5
4

4
● ●
3 ●
● ●
● ●
2 ●


5
4
wt

6
●● ●

3 ● ●

2
● ●

5
4 ●


●● ●

● ●

8


3
2
10 15 20 25 30 35
mpg

p + facet_grid(. ~ cyl, scales = "free")

4 6 8



4
● ●
● ●
● ● ●
wt

● ● ● ● ●

● ● ●

3 ●
● ●





2 ●


25 30 18 19 20 21 10.0 12.5 15.0 17.5


mpg

p + facet_grid(. ~ cyl, scales = "free", space = "free")

4 6 8



4
● ●
● ●
● ● ●
wt

●● ● ● ●

● ● ●

3 ●
● ●





2 ●


25 30 18 19 20 2110.0 12.5 15.0 17.5


mpg

243
6 Plots with ggpplot

p + facet_grid(vs ~ am)

0 1



4
● ●
● ●
● ● ●

0
● ●


3 ●


2
wt

4
1

●● ●

● ●

3



2 ●


10 15 20 25 30 35 10 15 20 25 30 35
mpg

p + facet_grid(vs ~ am, margins=TRUE)

244
6.15 Using facets

0 1 (all)
● ● ● ●
● ●
5

● ●
4 ● ● ● ● ● ●
● ●
● ● ● ●●●

0
● ● ● ●
● ●
3 ●



● ●

● ●
2

4
wt

1
●● ● ●● ●
● ● ● ● ● ●
3 ● ●
● ●
● ●
● ●
2 ●



● ●
● ●

● ● ● ●
● ●
5

● ●
4 ● ● ● ● ● ●
● ●

(all)
● ● ● ●●●
● ●●●● ● ●●●●
● ● ● ● ● ● ● ●
3 ●
● ●

● ●
● ●
● ●
● ●
● ● ● ●
2 ●



● ●
● ●

10 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 35
mpg

p + facet_grid(. ~ vs + am)

0 0 1 1
0 1 0 1



4 ● ● ●

●● ●
wt

● ● ●●


● ● ●
3 ●
● ●



● ●

2 ●


10 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 35


mpg

245
6 Plots with ggpplot

p + facet_grid(. ~ vs + am, labeller = label_both)

vs: 0 vs: 0 vs: 1 vs: 1


am: 0 am: 1 am: 0 am: 1



4 ● ● ●

●● ●
wt

● ● ●●


● ● ●
3 ●
● ●



● ●

2 ●


10 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 35


mpg

p + facet_grid(. ~ vs + am, margins=TRUE)

0 0 0 1 1 1 (all)
0 1 (all) 0 1 (all) (all)
● ● ●
● ● ●
● ● ●

● ● ●
4 ●● ● ●● ● ●● ●
● ● ●
●● ● ●● ●●
wt

● ●● ●● ● ●
● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●
● ● ●
3 ● ● ●
● ● ● ● ●●
● ● ●
● ● ●
● ● ●
● ● ● ● ● ●

2 ●





● ● ●
● ● ●

10 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 35


mpg

p + facet_grid(cyl ~ vs, labeller = label_both)

246
6.15 Using facets

vs: 0 vs: 1
5

cyl: 4
4
● ●
3 ●
● ●
● ●
2 ●


cyl: 6
4
wt

●● ●

3 ● ●

2

● ●
5

cyl: 8
4 ● ●
● ●●●

● ●


3
2
10 15 20 25 30 35 10 15 20 25 30 35
mpg

mtcars$cyl12 <- factor(mtcars$cyl,


labels = c("alpha", "beta", "sqrt(x, y)"))
p1 <- ggplot(data = mtcars, aes(mpg, wt)) +
geom_point() +
facet_grid(. ~ cyl12, labeller = label_parsed)

Here we use as labeller function label_bquote() with a special syntax that al-
lows us to use an expression where replacement based on the facet (panel) data takes
place. See section 6.20 for an example of the use of bquote() , the R function on
which this labeller is built upon.

p + facet_grid(. ~ vs, labeller = label_bquote(alpha ^ .(vs)))

0 1



4
● ●
● ●
● ●●
wt

● ● ●● ●

● ● ●

3 ●
● ●





2 ●


10 15 20 25 30 35 10 15 20 25 30 35
mpg

In versions of ‘ggplot2’ before 2.0.0, labeller was not implemented for


facet_wrap() , it was only available for facet_grid() .

247
6 Plots with ggpplot

p + facet_wrap(~ vs, labeller = label_bquote(alpha ^ .(vs)))

α0 α1



4 ●

● ●
● ●●
wt

● ● ●● ●

● ● ●

3 ●
● ●





2 ●


10 15 20 25 30 35 10 15 20 25 30 35
mpg

A minimal example of a wrapped facet. In this case the number of levels is small,
when they are more the row of plots will be wrapped into two or more continuation
rows. When using facet_wrap() there is only one dimension, so no ‘.’ is needed
before or after the tilde.

p + facet_wrap(~ cyl)

4 6 8



4
● ●
● ●
●●●
wt

●● ● ● ●


● ● ●
3 ●
● ●





2 ●


10 15 20 25 30 3510 15 20 25 30 3510 15 20 25 30 35
mpg

An example showing that even though faceting with facet_wrap() is along a single,
possibly wrapped, row, it is possible to produce facets based on more than one vari-
able.

p + facet_wrap(~ vs + am, ncol=2)

248
6.16 Scales

0 0
0 1



4 ● ●
● ●
● ● ●
● ●


3 ●


2

1 1
wt

0 1

4
●● ●
● ●

3




2 ●


10 15 20 25 30 35 10 15 20 25 30 35
mpg

6.16 Scales

Scales map data onto aesthetics. There are different types of scales depending on
the characteristics of the data being mapped: scales can be continuous or discrete.
And of course, there are scales for different attributes of the plotted geometrical
object, such as color , size , position ( x, y, z ), alpha or transparency, angle ,
justification, etc. This means that many properties of, for example, the symbols used
in a plot can be either set by a constant, or mapped to data. The most elemental
mapping is identity , which means that the data is taken at its face value. In a
numerical scale, say scale_x_continuous() , this means that for example a ‘5’ in the
data is plotted at a position in the plot corresponding to the value ‘5’ along the x-axis.
A simple mapping could be a log10 transformation, that we can easily achieve with
the pre-defined scale_x_log10 in which case the position on the 𝑥-axis will be based
on the logarithm of the original data. A continuous data variable can, if we think it
useful for describing our data, be mapped to continuous scale either using an identity
mapping or transformation, which for example could be useful if we want to map the
value of a variable to the area of the symbol rather than its diameter.

249
6 Plots with ggpplot

Discrete scales work in a similar way. We can use scale_colour_identity() and


have in our data a variable with values that are valid colour names like ”red” or ”blue”.
However we can also map the colour aesthetic to a factor with levels like ”control”,
and ”treatment”, an these levels will be mapped to colours from the default palette,
unless we chose a different palette, or even use scale_colour_manual() to assign
whatever colour we want to each level to be mapped. The same is true for other
discrete scales like symbol shape and linetype . Remember that for example for
colour, and ‘numbers’ there are both discrete and continuous scales available. Map-
ping colour or fill to NA makes such observation invisible.

Advanced scale manipulation requires package scales to be loaded, although


‘ggplot2’ 2.0.0 and later re-exports many functions from package scales . Some
simple examples follow.

We generate new fake data.

fake2.data <-
data.frame(y = c(rnorm(20, mean=20, sd=5),
rnorm(20, mean=40, sd=10)),
group = factor(c(rep("A", 20), rep("B", 20))),
z = rnorm(40, mean=12, sd=6))

6.16.1 Continuous scales for 𝑥 and 𝑦

Limits

To change the limits of the 𝑦-scale, ylim() is a convenience function used for modi-
fication of the lims (limits) of the scale used by the 𝑦 aesthetic. We here exemplify
the use of ylim() only, but xlim() can be used equivalently for the 𝑥 scale.

We can set both limits, minimum and maximum.

ggplot(fake2.data, aes(z, y)) + geom_point() + ylim(0, 100)

250
6.16 Scales

100

75


● ● ●
50
y
● ●
● ● ●
● ●
● ●



● ●


25 ●
● ●

● ●
● ● ● ●
● ●
● ● ● ● ●

0
0 5 10 15 20
z

We can set both limits, minimum and maximum, reversing the direction of the axis
scale.

ggplot(fake2.data, aes(z, y)) + geom_point() + ylim(100, 0)

● ● ● ● ●
● ●
● ●
● ● ● ● ●
● ●
● ●
25 ● ●

● ●



● ● ●
● ●
● ●
● ●
50
y

● ● ●


75

100
0 5 10 15 20
z

We can set one limit and leave the other one free.

ggplot(fake2.data, aes(z, y)) + geom_point() + ylim(0, NA)

251
6 Plots with ggpplot


60 ●


● ●




● ●


● ●
40 ●



● ●

y ●
● ●


● ●
20 ●
● ● ●
● ●

● ● ● ● ●

0
0 5 10 15 20
z

We can use lims with discrete scales, listing all the levels that are to be included
in the scale, even if they are missing from a given data set, such as after subsetting.
And we can expand the limits, to set a default minimum range, that will grow when
needed to accommodate all observations in the data set. Of course here x and y
refer to the aesthetics and not to names of variables in data frame fake2.data .

ggplot(fake2.data, aes(z, y)) + geom_point() + expand_limits(y = 0, x = 0)


60 ●


● ●




● ●


● ●
40 ●



● ●
y


● ●


● ●
20 ●
● ● ●
● ●

● ● ● ● ●

0
0 5 10 15 20
z

Transformed scales

The default scale used by the y aesthetic uses position = "identity" , but there
are predefined for transformed scales.
Although transformations can be passed as argument to scale_x_continuous()
and scale_y_continuous() , there are predefined convenience scale functions for
log10 , sqrt and reverse .

252
6.16 Scales

 Similarly to the maths functions of R, the name of the scales are


scale_x_log10() and scale_y_log10() rather than scale_y_log() because in
R the function log returns the natural or Neperian logarithm.

We can use scale_x_reverse() to reverse the direction of a continuous scale,

ggplot(fake2.data, aes(z, y)) + geom_point() + scale_x_reverse()

60 ●


● ●

50 ●


● ●


● ●
40 ●
y


● ●

30

● ●


● ●

20 ●

● ●



● ● ●
● ● ●

20 15 10 5 0
z

Axis tick-labels display the original values before applying the transformation. The
"breaks" need to be given in the original scale as well. We use scale_y_log10() to
apply a log10 transformation to the 𝑦 values.

ggplot(fake2.data, aes(z, y)) +


geom_point() +
scale_y_log10(breaks=c(10,20,50,100))



● ●
50 ●


● ●


● ●



● ●
y





● ●

20 ●



● ●
● ● ●

0 5 10 15 20
z

253
6 Plots with ggpplot

In contrast, transforming the data on-the-fly when mapping it to the 𝑥 aesthetic,


results in tick-labels expressed in the logarithm of the original data.

ggplot(fake2.data, aes(z, log10(y))) + geom_point()

1.8 ●


● ●



● ●


● ●
1.6 ●


log10(y)


● ●




1.4 ●

● ●
● ●
● ●


1.2 ● ●

● ●

0 5 10 15 20
z

We show here how to specify a transformation to a continuous scale, using a pre-


defined “transformation” object.

ggplot(fake2.data, aes(z, y)) + geom_point() +


scale_y_continuous(trans = "reciprocal")



● ●




20 ●
● ●

● ●
y





● ●

40 ● ● ●

● ● ●

● ●
● ● ●

60 ●

0 5 10 15 20
z

Natural logarithms are important in growth analysis as the slope against time gives
the relative growth rate. We show this with the Orange data set.

ggplot(data = Orange,
aes(x = age, y = circumference, color = Tree)) +
geom_line() +
geom_point() +
scale_y_continuous(trans = "log", breaks = c(20, 50, 100, 200))

254
6.16 Scales

● ●
● ●
200
● ●
● ●


● ● ●
● ●


● Tree
circumference
● ●


100
● 3
● ● 1

● ● 5


● 2


● 4
50 ●



400 800 1200 1600


age

In section 6.23.3 on page 313 we define and use a transformation object.

 When combining scale transformations and summaries, one should be aware


of which data are used, transformed or not.

Tick labels

Finally, when wanting to display tick labels for data available as fractions as percent-
ages, we can use labels = scales::percent .

ggplot(fake2.data, aes(z, y / max(y))) +


geom_point() +
scale_y_continuous(labels = scales::percent)

100% ●


● ●

80% ●


● ●

y/max(y)


● ●

60%


● ●


● ●
40% ●

● ●
● ● ●
● ●


● ● ●

● ●

20%
0 5 10 15 20
z

In the case of currency we can use labels = scales::dollar , and if we

255
6 Plots with ggpplot

want to use commas to separate thousands, millions, and so on, we can use
labels = scales::comma .

ggplot(fake2.data, aes(z, y)) +


geom_point() +
scale_y_continuous(labels = scales::dollar)

$60 ●


● ●

$50 ●


● ●


● ●

$40 ●
y



● ●
$30

● ●


● ●

$20 ● ●




● ● ●

● ●

0 5 10 15 20
z

When using breaks, we can just accept the default labels for the breaks .

ggplot(fake2.data, aes(z, y)) +


geom_point() +
scale_y_continuous(breaks = c(20, 40, 47, 60))

60 ●


● ●



47 ●
● ●


● ●

40 ●
y



● ●


● ●


● ●

20 ● ●




● ● ●

● ●

0 5 10 15 20
z

We can also set tick labels manually, in parallel to the setting of breaks .

ggplot(fake2.data, aes(z, y)) +


geom_point() +
scale_y_continuous(breaks = c(20, 40, 47, 60),
labels = c("20", "40", "->", "60"))

256
6.16 Scales

60 ●


● ●



−> ●
● ●


● ●

40 ●
y



● ●


● ●


● ●

20 ● ●




● ● ●

● ●

0 5 10 15 20
z

Using an expression we obtain a Greek letter.

ggplot(fake2.data, aes(z, y)) +


geom_point() +
scale_y_continuous(breaks = c(20, 40, 47, 60),
labels = c("20", "40", expression(alpha), "60"))

60 ●


● ●


α ●

● ●


● ●

40 ●
y



● ●


● ●


● ●
20 ●
● ●




● ● ●

● ●

0 5 10 15 20
z

We can pass a function that accepts the breaks and returns labels to labels . Pack-
age ‘scales’ defines several formatters, or we can define our own. For log10 scales

ggplot(fake2.data, aes(z, y)) +


geom_point() +
scale_y_continuous(labels = scales::scientific_format())

257
6 Plots with ggpplot

6e+01 ●


● ●

5e+01 ●


● ●


● ●

4e+01 ●

y ●

● ●
3e+01

● ●


● ●

2e+01 ● ●




● ● ●

● ●

0 5 10 15 20
z

Please, see section 6.23.3 on page 313 for an example of the use of
scales::math_format together with a logarithmic transformation of the data.

6.16.2 Time and date scales for 𝑥 and 𝑦

Limits

Time and date scales are conceptually similar to continuous numeric scales, but use
special data types and formatting for labels. We can set limits and breaks using
constants as time or dates. These are most easily input with the functions in packages
‘lubridate’ or ‘anytime’.
Please, see section ?? on page ?? for examples.

Axis labels

By default the tick labels produced and their formatting is automatically selected
based on the extent of the time data. For example, if we have all data collected within
a single day, then the tick labels will show hours and minutes. If we plot data for
several years, the labels will show the date portion of the time instant. The default is
frequently good enough, but it is possible, as for numbers to use different formatter
functions to generate the tick labels.

6.16.3 Discrete scales for 𝑥 and 𝑦

In the case of ordered or unordered factors, the tick labels are by default the names
of the factor levels. Consequently one roundabout way to obtaining the desired tick
labels is to use them as factor levels. This approach is not recommended as in most
cases the text of the desired tick labels may not be recognized as a valid name making

258
6.16 Scales

the code using them difficult to type in scripts or at the command prompt. It is best to
use simple mnemonic short names for factor levels and variables, and to set suitable
labels when plotting, as we will show here.

 When using factors, the ordering used for plotting levels is the one they
have in the factor. When a factor is created, the default is for levels to be stored
in alphabetical order. This default can be easily overridden at the time of creation,
as well as the order modified at a later time.

default.fct <- factor(c("a", "c", "f", "f", "a", "d"))


levels(default.fct)

## [1] "a" "c" "d" "f"

levels.fct <- factor(c("a", "c", "f", "f", "a", "d"),


levels = c("f", "a", "d", "c"))
levels(levels.fct)

## [1] "f" "a" "d" "c"

Reorder can be used to change the order of the levels based on the values of a
numeric variable. We will visit once again the Orange data set.

my1.Tree <- with(Orange,


reorder(Tree, -circumference))
levels(Orange$Tree)

## [1] "3" "1" "5" "2" "4"

levels(my1.Tree)

## [1] "4" "2" "5" "1" "3"

Which is equivalent to reversing the order in this particular case.

my2.Tree <- with(Orange,


factor(Tree,
levels = rev(levels(Tree))))
levels(Orange$Tree)

## [1] "3" "1" "5" "2" "4"

levels(my2.Tree)

## [1] "4" "2" "5" "1" "3"

259
6 Plots with ggpplot

We restore the default ordering.

my3.Tree <- with(Orange,


factor(Tree,
levels = sort(levels(Tree))))
levels(Orange$Tree)

## [1] "3" "1" "5" "2" "4"

levels(my3.Tree)

## [1] "1" "2" "3" "4" "5"

We can set the levels in any arbitrary order by explicitly listing the level names,
not only at the time of creation but also later. Here we show that it is possible
to not only reorder existing levels, but even to add a level for which there are no
observations.

my3.Tree <- with(Orange,


factor(Tree,
levels = c("1", "2", "3", "4", "5", "9")))
levels(Orange$Tree)

## [1] "3" "1" "5" "2" "4"

levels(my3.Tree)

## [1] "1" "2" "3" "4" "5" "9"

We use here once again the mpg data set.

We order the columns in the plot based on mpg$hwy by reordering mpg$class . This
approach makes sense if this ordering is needed for all plots. It is always bad to keep
several versions of a single data set as it easily leads to mistakes and confusion.

my.mpg <- mpg


my.mpg$class <- with(my.mpg, reorder(factor(class), hwy))
ggplot(my.mpg, aes(class, hwy)) +
stat_summary(geom = "col", fun.y = mean)

260
6.16 Scales

20

hwy

10

0
pickup suv minivan 2seater midsizesubcompactcompact
class

Or the same on-the-fly, which is much better as the data remains unmodified..

ggplot(mpg, aes(reorder(factor(class), hwy), hwy)) +


stat_summary(geom = "col", fun.y = mean)

20
hwy

10

0
pickup suv minivan 2seater midsizesubcompactcompact
reorder(factor(class), hwy)

Or ordering based on a different variable, displ .

ggplot(mpg, aes(reorder(factor(class), displ), hwy)) +


stat_summary(geom = "col", fun.y = mean)

261
6 Plots with ggpplot

20

hwy
10

0
compactsubcompactmidsize minivan pickup suv 2seater
reorder(factor(class), displ)

Alternatively we can use scale_x_discrete() to reorder and select the columns


without altering the data. If we use this approach to subset the data, then to avoid
warnings we need to add na.rm = TRUE . We use the scale in this example to convert
level names to uppercase. The complementary function of toupper() is tolower() .

ggplot(mpg, aes(class, hwy)) +


stat_summary(geom = "col", fun.y = mean, na.rm = TRUE) +
scale_x_discrete(limits = c("compact", "subcompact", "midsize"),
labels = toupper)

20
hwy

10

0
COMPACT SUBCOMPACT MIDSIZE
class

6.16.4 Size

For the size aesthetic several scales are available, both discrete and continu-
ous. They do not differ much from those already described above. Geomet-
ries geom_point() , geom_line() , geom_hline() , geom_vline() , geom_text() ,
geom_label() obey size as expected. In the case of geom_bar() , geom_col() ,

262
6.16 Scales

geom_area() and all other geometric elements bordered by lines, size is obeyed
by these border lines. In fact, other aesthetics natural for lines such as linetype
also apply to these borders.
When using size scales, breaks and labels affect the key or guide . In scales
that produce a key passing guide = FALSE removes the key corresponding to the
scale.

6.16.5 Color and fill

Colour and fill scales are similar, but they affect different elements of the plot. All
visual elements in a plot obey the color aesthetic, but only elements that have an
inner region and a boundary, obey both color and fill aesthetics. There are separ-
ate but equivalent sets of scales available for these two aesthetics. We will describe
in more detail the color aesthetic and give only some examples for fill . We will
however, start by reviewing how colors are defined and used in R.

Color definitions in R

Colors can be specified in R not only through character strings with the names of
previously defined colors, but also directly as strings describing the RGB components
as hexadecimal numbers (on base 16) such as "#FFFFFF" for white or "#000000" for
black, or "#FF0000" for the brightest available pure red. The list of color names
known to R can be obtained be entering colors() in the console.
Given the number of colors available, we may want to subset them based on their
names. Function colors() returns a character vector. We can use grep() or
grepl() to find indexes to the names containing a given character substring, in this
example "dark" .

grep("dark",colors())

## [1] 73 74 75 76 77 78 79 80 81 82 83
## [12] 84 85 86 87 88 89 90 91 92 93 94
## [23] 95 96 97 98 99 100 101 102 103 104 105
## [34] 106 107 108 109 110 111 112 113 114 115

U Replace grep() by grepl() in the example above. What is the difference in


the returned value?

Although the vector of indexes, or the logical vector, could be used to extract the
subset of matching color names with code like,

263
6 Plots with ggpplot

colors()[grep("dark",colors())]

## [1] "darkblue" "darkcyan"


## [3] "darkgoldenrod" "darkgoldenrod1"
## [5] "darkgoldenrod2" "darkgoldenrod3"
## [7] "darkgoldenrod4" "darkgray"
## [9] "darkgreen" "darkgrey"
## [11] "darkkhaki" "darkmagenta"
## [13] "darkolivegreen" "darkolivegreen1"
## [15] "darkolivegreen2" "darkolivegreen3"
## [17] "darkolivegreen4" "darkorange"
## [19] "darkorange1" "darkorange2"
## [21] "darkorange3" "darkorange4"
## [23] "darkorchid" "darkorchid1"
## [25] "darkorchid2" "darkorchid3"
## [27] "darkorchid4" "darkred"
## [29] "darksalmon" "darkseagreen"
## [31] "darkseagreen1" "darkseagreen2"
## [33] "darkseagreen3" "darkseagreen4"
## [35] "darkslateblue" "darkslategray"
## [37] "darkslategray1" "darkslategray2"
## [39] "darkslategray3" "darkslategray4"
## [41] "darkslategrey" "darkturquoise"
## [43] "darkviolet"

a simpler approach is available.

grep("dark", colors(), value = TRUE)

## [1] "darkblue" "darkcyan"


## [3] "darkgoldenrod" "darkgoldenrod1"
## [5] "darkgoldenrod2" "darkgoldenrod3"
## [7] "darkgoldenrod4" "darkgray"
## [9] "darkgreen" "darkgrey"
## [11] "darkkhaki" "darkmagenta"
## [13] "darkolivegreen" "darkolivegreen1"
## [15] "darkolivegreen2" "darkolivegreen3"
## [17] "darkolivegreen4" "darkorange"
## [19] "darkorange1" "darkorange2"
## [21] "darkorange3" "darkorange4"
## [23] "darkorchid" "darkorchid1"
## [25] "darkorchid2" "darkorchid3"
## [27] "darkorchid4" "darkred"
## [29] "darksalmon" "darkseagreen"
## [31] "darkseagreen1" "darkseagreen2"
## [33] "darkseagreen3" "darkseagreen4"
## [35] "darkslateblue" "darkslategray"
## [37] "darkslategray1" "darkslategray2"
## [39] "darkslategray3" "darkslategray4"

264
6.16 Scales

## [41] "darkslategrey" "darkturquoise"


## [43] "darkviolet"

To retrieve the RGB values for a color definition we use

col2rgb("purple")

## [,1]
## red 160
## green 32
## blue 240

col2rgb("#FF0000")

## [,1]
## red 255
## green 0
## blue 0

Color definitions in R can contain a transparency described by an alpha value,


which by default is not returned.

col2rgb("purple", alpha = TRUE)

## [,1]
## red 160
## green 32
## blue 240
## alpha 255

With function rgb() we can define new named or nameless colors.

rgb(1, 1, 0)

## [1] "#FFFF00"

rgb(1, 1, 0, names = "my.color")

## my.color
## "#FFFF00"

rgb(255, 255, 0, names = "my.color", maxColorValue = 255)

## my.color
## "#FFFF00"

As described above colors can be defined in the RGB color space, however, other
color models such as HSV (hue, saturation, value) can be also used to define colours.

265
6 Plots with ggpplot

hsv(c(0,0.25,0.5,0.75,1), 0.5, 0.5)

## [1] "#804040" "#608040" "#408080" "#604080"


## [5] "#804040"

The probably a more useful flavour of HSV colors are those returned by function
hcl() for hue, chroma and luminance. While the “value” and “saturation” in HSV
are based physical values, the “chroma” and “luminance” values in HCL are based
on human visual perception. Colours with equal luminance will be as equally bright
by average human being. In a scale based on different hues but equal chroma and
luminance values, as used by package ‘ggplot2’, all colours are perceived as equally
bright. The hues need to be expressed as angles in degrees, with values between zero
and 360.

hcl(c(0,0.25,0.5,0.75,1) * 360)

## [1] "#FFC5D0" "#D4D8A7" "#99E2D8" "#D5D0FC"


## [5] "#FFC5D0"

It is also important to remember that humans can only distinguish a limited set
of colours, and even smaller colour gamuts can be reproduced by screens and print-
ers. Furthermore, variation from individual to individual exists in color perception,
including different types of colour blindness. It is important to take this into account
when using colour in illustrations.

6.16.6 Continuous colour-related scales

Scales scale_color_continuous() , scale_color_gradient() ,


scale_color_gradient2() , scale_color_gradientn() , scale_color_date() and
scale_color_datetime() , give a smooth continuous gradient between two or more
colours. They are useful for numerical, date and datetime data. A corresponding set
of fill scales is also available.

6.16.7 Discrete colour-related scales

Scales scale_color_discrete() , scale_color_hue() , scale_color_grey() are use-


ful for categorical data stored as factors.

6.16.8 Identity scales

In the case of identity scales the mapping is 1 to 1 to the data. For example, if we
map the color or fill aesthetic to a variable using scale_color_identity() or

266
6.16 Scales

scale_fill_identity() the variable in the data frame passed as argument for data
must already contain valid color definitions. In the case of mapping alpha the vari-
able must contain numeric values in the rage 0 to 1.
We create a data frame containing a variable colors containing character strings
interpretable as the names of color definitions known to R. We then use them directly
in the plot.

df99 <- data.frame(x = 1:10, y = dnorm(10), colors = rep(c("red", "blue"), 5))

ggplot(df99, aes(x, y, color = colors)) +


geom_point() +
scale_color_identity()

0.50

0.25

0.00
y

● ● ● ● ● ● ● ● ● ●

−0.25

−0.50
2.5 5.0 7.5 10.0
x

U How does the plot look, if the identity scale is deleted from the example
above? Edit and re-run the example code.

U While using the identity scale, how would you need to change the code ex-
ample above, to produce a plot with green and purple point?

6.16.9 Position of axes

ggplot(fake2.data, aes(z, y)) + geom_point() +


scale_x_continuous(position = "top") +
scale_y_continuous(position = "right")

267
6 Plots with ggpplot

z
0 5 10 15 20

● 60


● ●


50


● ●


● ●

● 40

y


● ●
30

● ●


● ●

● ●

● 20


● ● ●

● ●

6.16.10 Secondary axes

ggplot(fake2.data, aes(z, y)) + geom_point() +


scale_y_continuous(
"y",
sec.axis = sec_axis(~ . ^-1, name = "1/y")
)

60 ●


● ●

50 ●
0.02


● ●


● ●

40
1/y


y



● ●
30

● ●
● 0.04

● ●

20 ● ●


● ● ●


0.06
● ●

0.08
0 5 10 15 20
z

ggplot(fake2.data, aes(z, y)) + geom_point() +


scale_y_continuous(
"y",
sec.axis = sec_axis(~ ., name = "y", breaks = c(33.2, 55.4))
)

268
6.17 Adding annotations

60 ●

55.4

● ●

50 ●


● ●


● ●

40 ●
y

y


● ●
33.2
30

● ●


● ●

20 ● ●




● ● ●

● ●

0 5 10 15 20
z

6.17 Adding annotations

Annotations use the data coordinates of the plot, but do not ‘inherit’ data or aesthetics
from the ggplot object. They are added to a ggplot with annotate() . Annotations
frequently make use "text" or "label" geometries with character strings as data,
possibly to be parsed as expressions. However, other geometries can also be very
useful. We start with a simple example with text.

ggplot(fake2.data, aes(z, y)) +


geom_point() +
annotate(geom = "text",
label = "origin",
x = 0, y = 0,
color = "blue",
size=4)


60 ●


● ●




● ●


● ●
40 ●



● ●
y


● ●


● ●
20 ●
● ● ●
● ●

● ● ● ● ●

0 origin
0 5 10 15 20
z

269
6 Plots with ggpplot

U Play with the values of the arguments to annotate() to vary the position,
size, color, font family, font face, rotation angle and justification of the annota-
tion.

We can add lines to mark the origin more precisely and effectively. With ‘ggplot2’
2.2.1 we cannot use annotate() with geom = "vline" or geom = "hline" , but
we can achieve the same effect by directly adding layers with the geometries,
geom_vline() and/or geom_hline() , to the plot.

ggplot(fake2.data, aes(z, y)) +


geom_point() +
geom_hline(yintercept = 0, color = "blue") +
geom_vline(xintercept = 0, color = "blue")


60 ●


● ●




● ●


● ●
40 ●



● ●
y


● ●


● ●
20 ●
● ● ●
● ●

● ● ● ● ●

0
0 5 10 15 20
z

U Play with the values of the arguments to annotate to vary the position and
attributes of the lines. The vertical and horizontal line geometries have the same
properties as other lines: linetype, color, size, etc.

U Modify the examples above to use the line geometry for the annotations.
Explore the help page for geom_line and add arrows as annotations to the plot.

In this third example, in addition to adding expressions as annotations, we


also pass expressions as tick labels through the scale. Do notice that we use

270
6.17 Adding annotations

recycling for setting the breaks, as c(0, 0.5, 1, 1.5, 2) * pi is equivalent to


c(0, 0.5 * pi, pi, 1.5 * pi, 2 * pi . Annotations are plotted at their own pos-
ition, unrelated to any observation in the data, but using the same coordinates and
units as for plotting the data.

ggplot(data.frame(x=c(0, 2 * pi)), aes(x=x)) +


stat_function(fun=sin) +
scale_x_continuous(
breaks=c(0, 0.5, 1, 1.5, 2) * pi,
labels=c("0", expression(0.5~pi), expression(pi),
expression(1.5~pi), expression(2~pi))) +
labs(y="sin(x)") +
annotate(geom="text",
label=c("+", "-"),
x=c(0.5, 1.5) * pi, y=c(0.5, -0.5),
size=20) +
annotate(geom="point",
colour="red",
shape=21,
fill="white",
x=c(0, 1, 2) * pi, y=0,
size=6)

1.0

0.5
+
sin(x)

0.0 ● ● ●

−0.5

−1.0
0 0.5 π π 1.5 π 2π
x

U Modify the plot above to show the cosine instead of the sine function, repla-
cing sin with cos . This is easy, but the catch is that you will need to relocate
the annotations.

271
6 Plots with ggpplot

6.18 Coordinates and circular plots

In this section I include pie charts and wind-rose plots. Here we add a new ”word” to
the grammar of graphics, coordinates, such as coord_polar() in the next examples.
The default coordinate system for 𝑥 and 𝑦 aesthetics is cartesian.

6.18.1 Pie charts

Pie charts are more difficult to read: our brain is more comfortable at comparing
lengths than angles. If used, they should only be used to show composition, or frac-
tional components that add up to a total. In this case only if the number of “pie slices”
is small (rule of thumb: less than seven).
We make the equivalent of the first bar plot above. As we are still using geom_bar()
the default is stat_count . As earlier we use the brewer scale for nicer colors.

ggplot(data = mpg, aes(x = factor(1), fill = factor(class))) +


geom_bar(width = 1, color = "black") +
coord_polar(theta = "y") +
scale_fill_brewer() +
scale_x_discrete(breaks = NULL) +
labs(x = NULL, fill = "Vehicle class")

200 Vehicle class


2seater
50 compact
midsize
minivan
pickup
subcompact
suv
150

100

count

Even with four slices pie charts can be difficult to read. Compare the following bar
plot and pie chart.

ggplot(data = mpg, aes(x = factor(cyl), fill = factor(cyl))) +


geom_bar(color = "black") +
scale_fill_grey() +
scale_x_discrete(breaks = NULL) +
labs(x = NULL, fill = "Vehicle class") +
theme_bw()

272
6.18 Coordinates and circular plots

ggplot(data = mpg, aes(x = factor(1), fill = factor(cyl))) +


geom_bar(width = 1, color = "black") +
coord_polar(theta = "y") +
scale_fill_grey() +
scale_x_discrete(breaks = NULL) +
labs(x = NULL, fill = "Vehicle class") +
theme_bw()

80

60

Vehicle class
4
count

40 5
6
8

20

200

Vehicle class
50
4
5
6
8

150

100

count

An example comparing pie charts to bar plots is presented in section 6.23.6 on page
325.

6.18.2 Wind-rose plots

They can be plotted as histograms on polar coordinates, when the data is to be rep-
resented by frequencies or, as density plot. A bar plot or a line or points when the
values are means calculated with a statistic or a single observation is available per
quadrat. It also possible to use summaries, or smoothers.

273
6 Plots with ggpplot

Some types of data are more naturally expressed on polar coordinates than on
cartesian coordinates. The clearest example is wind direction, from which the name
derives. In some cases of time series data with a strong periodic variation, polar
coordinates can be used to highlight any phase shifts or changes in frequency. A
more mundane application is to plot variation in a response variable through the day
with a clock-face like representation of time-of-day.
We use for this example wind speed and direction data, measured once per minute
during 24 h.

load("data/wind-data.rda")

We first show a time series plot, using cartesian coordinates, which demonstrates
the problem of using an arbitrary origin at the North for a variable that does not have
a scale with true limits: early in the day the predominant direction is just slightly
West of 0 degrees North and the cloud of observations gets artificially split. We can
also observe a clear change in wind direction soon after solar noon.

ggplot(viikki_d29.dat, aes(solar_time, WindDir_D1_WVT)) +


geom_point() +
scale_x_datetime(date_labels = "%H:%M") +
labs(x = "Time of day (hh:mm)", y = "Wind direction (degrees)")

●●●●●● ●●● ●● ● ●● ● ●● ● ● ● ● ●

● ●●
●● ●● ●
●●●

●●●

●●●●●●
●●●● ●

● ● ●●
● ●● ●● ● ● ● ● ● ● ● ●● ●

● ●●●●●●●
●●
● ● ● ●

● ●● ● ● ● ●● ●● ● ● ● ●●● ●
●●●●
● ●● ●●●●●● ●●
●● ● ●
●● ●
●●●
●●●●
●●
●●
● ●●
●●

● ●●●●
●● ●● ● ●●
●●● ● ●● ●●● ● ●●● ●● ●
●●●●●
●●●●●● ●● ● ●●
●●●●

●●●●● ●●
● ●●●●● ●

●●
●●
●●
● ●● ● ●
●●●● ●●●
●●
● ● ●● ● ● ● ●
● ● ●●● ●
● ●● ●●● ●●● ● ●●
● ●
● ● ● ●●


●● ● ●●●●●●



● ● ●●●●
●● ● ●● ● ● ● ●●


● ●●
●●
●●●● ●
● ●●
●● ● ●
●●●●● ●●●●●●
● ● ● ●●●
● ● ●● ●● ●
●●●●●●●● ●●
●● ●●● ●●

●●●

●●
●●●
● ●●
●●● ●● ●
●●●●●● ●●●●●● ●
Wind direction (degrees)

●●● ● ● ●● ● ● ● ● ●●●●
●● ● ●● ● ●● ● ●
300 ●
● ●●●● ● ●● ●●





●●
● ●●



●●●●
●●●












●● ● ●●


●●●


●●●






● ● ● ●●●
● ●●



●●●


●●●●



●●
●●


●●●
●●●
●●●●



●●●






●●


●●

● ●
●●

●●


● ● ●● ● ● ●●● ● ●
● ● ● ● ●
● ● ●●●● ●


● ●● ●●
●●
● ● ● ● ●

● ● ● ●● ●
●●
● ●● ●
● ● ●
● ●●●●● ●
● ● ● ● ●
● ● ●● ● ●●
● ● ● ● ● ●● ● ●●
● ● ● ●●● ●● ● ●
●● ● ●● ●●●● ● ● ●● ● ●●● ● ● ●●
● ● ● ● ●● ●●●●

● ●●● ●●● ● ●● ● ● ●●● ● ●●●● ● ●●●
● ● ●●●● ●●● ●● ● ● ● ● ● ●

● ●
●●
● ●● ●● ●

● ● ● ●
● ●● ● ● ●●
● ●●
● ● ● ●● ●● ●● ●
● ●● ●● ● ● ●
● ● ●● ●●
●● ● ● ● ● ●●●
● ●●●●●●● ● ● ●● ● ● ●●●
● ●● ● ●● ●●● ●●
●● ●● ●●
200 ●●



● ●
●●

●●●●●●●
● ● ●●
●●

●●




● ●
● ●●●●
●●

●●●●






●●
●●
●●
●●●●



●●
●●●●
● ●● ●
●●●
●●



●● ● ● ● ●●
●●●●●
●●

●●
● ●
● ● ●
●●●● ●●●●●●
●●
●●●
● ●
● ●●●
●● ●●
●● ●
●● ● ● ●●● ●
● ●●
● ● ●


●●
●●●●●●
● ●●●●●
●●

● ●
●●

●●
●●●●
●●
● ●


●●
●●●



●●

●● ●


●●

● ●
●●●


● ● ●● ● ● ●● ●

●●

●●●
●●

● ●
●●●

● ● ●

●●









●●● ●●


● ● ●● ●●● ●
●●●● ● ● ●
●● ●●● ●●

●● ● ●●●

● ●
●●
● ● ●●● ●● ●
● ● ● ●
●●● ● ●● ● ●●

● ● ●●●● ● ● ●
● ●
● ● ●
●● ● ●● ●
● ● ● ●

●● ● ●
●●●● ● ●

● ●
● ●●● ● ● ●
●● ●
● ●
100 ●


●●
● ●
● ● ●

● ●● ●


● ● ●



● ●●●
● ●


● ●
● ● ● ● ●
●● ● ●
● ● ● ●
● ● ● ● ● ● ●

● ● ● ●● ● ● ● ● ●
● ● ● ● ●
●● ●●●●●●
● ● ● ●
●●● ●● ●● ● ● ● ●
●●● ●● ● ● ● ● ●●
● ●●● ●
0 ● ● ●

00:00 06:00 12:00 18:00 00:00


Time of day (hh:mm)

No such problem exists with wind speed, and we add a smooth line with
geom_smooth() .

ggplot(viikki_d29.dat, aes(solar_time, WindSpd_S_WVT)) +


geom_point() +
geom_smooth() +
scale_x_datetime(date_labels = "%H:%M") +
labs(x = "Time of day (hh:mm)", y = "Wind speed (m/s)")

## `geom_smooth()` using method = 'gam'

274
6.18 Coordinates and circular plots

4 ● ●


●●
● ● ● ●

● ●●
●●
●● ● ●● ●● ●
● ● ●
● ● ●●



3 ● ● ●
● ●

Wind speed (m/s)


● ●● ● ●
● ●●

● ● ● ● ●● ● ● ●
●●● ● ●● ●●
● ●
●● ● ● ● ●●
●● ●● ● ● ●● ● ● ●● ● ● ●●●
● ●
● ● ● ● ●● ●● ●
●●
● ● ●● ● ●●
● ● ●
●● ●
● ●● ● ●● ● ●
●● ●●● ●● ●


●● ●● ● ●

● ●● ● ● ●
● ● ● ●●● ●●●● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●●
● ●● ● ●● ●● ●●● ● ● ●
● ● ●
● ●● ● ●● ●●●● ● ●● ● ● ●●● ●●●

●● ●●●●● ●●
● ● ●●● ●
● ●
2 ● ● ●
●● ● ● ●● ● ●● ●● ● ●
●● ●
● ● ●● ● ●● ●● ●● ● ●●● ●●●● ●● ●●●
●●●● ● ●●


● ●●●● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ●●
● ● ●●●●
● ● ● ●●● ●●● ●●●●●
● ●●● ●● ●
● ● ● ● ● ●●


● ● ● ● ● ● ●●

● ●●
● ● ●●


● ●
● ● ●●

● ●● ● ●● ●
● ●
●● ●
● ●● ●● ●●●
● ● ●●
● ●● ● ● ●
● ● ● ● ●● ● ● ●●
● ● ●●●●●●● ●●●● ●● ● ●● ●● ●●●● ● ●


● ● ●
●● ● ● ●●●
● ● ●● ●●
● ●
● ● ●● ● ●
● ●
● ●● ● ● ●
● ●●● ●
●●● ●● ●● ●

●● ●
● ●● ● ●●●●● ● ●●●

● ● ● ● ● ●●

●● ●●● ●
●● ● ● ● ● ●●
● ●●●●●● ● ● ●●




● ● ●
●●
● ●
●●●●● ●●

●● ●●●●●● ● ● ●●●●● ●●● ●●
● ●● ●
●●●●●●

●● ● ●
● ● ●● ● ●
● ●
●● ●
● ●●
●●●
● ●● ● ● ● ● ●
● ● ●● ●●●● ● ● ● ●● ●● ● ● ●
● ●● ● ●●● ●● ●● ●●●●● ● ●●● ● ● ●● ● ●●●●● ●● ●
●● ● ●● ● ● ●●● ●● ●● ●●●●●●
● ●●●
●●
●●● ●●●●
●● ● ●● ●●● ●●●● ● ●● ● ●●●●● ●
● ●●●●● ●
●●● ●● ● ●● ● ● ● ● ●●
● ●●● ●●● ● ●●● ●
●●●●● ●●●●
● ● ● ●
● ●● ● ●●
●● ●●● ●●● ●●●● ● ● ●● ● ●● ● ● ●

● ●
● ●●● ● ●● ● ●●●● ● ● ● ●● ● ●● ●
1 ●●●●●

●●●

● ● ● ● ●●
●● ●●●
●●


●●
●●● ●
●●●

● ●
● ● ●


● ●
●●
●● ●
● ●
● ● ● ●

● ● ● ●
● ●●● ● ●●


● ● ● ●
●●


●●●●

●●
● ● ● ●● ● ● ●
● ●● ●● ● ● ● ●●
●● ●● ●


●●●● ●●● ● ●● ●


●● ●
●●●
● ● ● ●●● ● ●●●●●
●●● ●● ● ● ●●

●●●●● ● ●
●●● ●● ● ●
●●●●●
● ● ●● ● ●●●● ● ●● ● ●
●●● ● ● ●●●●●
●●● ●

●●
● ● ●

● ● ●
● ● ● ● ●● ●● ●● ● ● ●●
● ●●●● ●● ● ● ● ●
● ● ● ● ● ●●●●● ●
●●
● ● ●● ● ● ● ●●● ● ● ● ●●●●●● ● ● ●
● ● ● ●
● ● ● ● ●

●● ●● ●● ●
● ●
●●●
● ● ●● ● ● ●


● ● ● ● ● ●● ● ●● ●●●●

● ● ● ● ●● ●●●● ●●●● ●●● ●●●●●●●●●●●●●

● ● ● ● ● ●
●●● ●●●●

●●●●●


● ●●●


●●●●●●●


●●

●●



●●●


● ● ● ●●●●● ● ●
●●

● ●●●●
●● ●● ●●●●●●●●●
●● ●●

●●
●●●
● ●●

● ●
●●●●


●●

●●●
●●●●●
●●●
●●●●
● ●
● ●●● ●
●● ●●●●
●●●
●●
● ● ●●
●●

● ●
●●
●● ●

●●

● ●●●
●●
0
00:00 06:00 12:00 18:00 00:00
Time of day (hh:mm)

Using a scatter plot with polar coordinates helps to some extent, but having time
of day on the radial axis is rather unclear.

ggplot(viikki_d29.dat, aes(WindDir_D1_WVT, solar_time)) +


coord_polar() +
geom_point() +
scale_x_continuous(breaks = c(0, 90, 180, 270),
labels = c("N", "E", "S", "W"),
limits = c(0, 360),
expand = c(0, 0),
name = "Wind direction") +
scale_y_datetime(date_labels = "%H:%M",
name = "Time of day (hh:mm)",
date_breaks = "6 hours",
date_minor_breaks = "3 hours",)

N
●●
●●

● ● ●
● ● ● ●
18:00 ●

●●


Time of day (hh:mm)

● ●

●●
12:00 ●
● ●
● ●
●●●
● ●●●
● ●●
● ●●● ● ●● ● ● ●● ●
● ●● ● ●● ●●

●●●●●
● ●
● ●
● ●
●●● ●
●●

● ● ●●
● ● ●●●

●●●●
● ●●
●●

●● ●●

●● ●●●● ●●●
06:00 ● ●
●●
●●


●●●●
● ●●●
●●

●● ●
●●●


●●●●


●●


●●
●● ● ●● ●
● ●
●●●
●●
●●
●●●
● ●
●● ●
●●
● ●●●●

●●
● ●●● ●● ● ●
● ●● ● ●
●●

●●●
●●●●
● ●●●
●●

● ●●●
● ●●●
● ●●

● ●● ●● ●● ●●
●●
●●
●●
●●●
●●
●●
● ● ●
●●●●●
●●
● ●
●●

●●
●● ●

●●
●●
● ●● ●
● ●● ●
●● ●●
● ●●●

●●

●●●

●● ● ●
●●
●●●
●● ● ●●●●
●●
●●

●●
●●
●●●
●●
●●

●●●● ● ● ●

● ● ●

●●
●●

●●●
●●●
●●
●●
●●

●●●

●●
●● ●●● ● ● ● ●
●●
● ●
●● ●●●●
● ●● ●
●●
●●
●● ●
●●

●●
●●

● ●
●●



●●




●●

●●
●●●
●●●● ●
● ●

● ●

●● ●●● ● ●● ●●●●

●●
●●















●●


●●● ●
W ● ●

●● ●












● ●●

E
● ●● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ● ●● ●
●●
● ● ● ● ●

●● ● ●●● ●● ●●
● ● ●● ●
●●●● ● ● ● ●●● ● ●
●●
● ●● ●● ●●
● ● ●
● ● ● ●● ●●● ● ●●● ● ●
● ● ● ●● ●●●


●●● ●●● ● ● ●● ● ●●
● ● ● ●●● ●●
●●●● ● ●●● ● ● ● ●●
●●● ●●●● ●●● ●●●●● ●
● ●●●● ●● ●●
●● ●●●
● ●● ● ●
● ●
●●● ● ●●●● ●●
●●●●● ●●
●●
●●●
●●
●●●●●●● ● ●
●● ● ● ● ●●
●●●●● ●●
● ●●●●●●

●●●●●●
●●●●



●●

●●●●●●

●●●

● ● ●
●● ●● ● ●●●● ●●●● ●
●●
● ●●● ●● ●●
●●
●● ●
●●●

● ●


●●
●●●
● ●
●●●●● ● ●●
● ●●●
●●●●
●●
● ●●
●●●●●●
●●●●
●●●
● ●●


●●●●


●●●● ●● ●
● ●●●●● ●●

● ● ● ● ●
●● ● ●●
● ● ●● ●

●●
●● ●
●●
●●
●●● ●

●● ● ● ●●● ●● ●
●●●
● ●
●●●● ●●●●●
● ●●●●●●● ● ● ●●
● ●
● ● ● ●●
●●●

● ● ●● ●●
●● ● ●● ● ●● ●● ●
●●
● ●●
● ●●
●●●

●●●
●● ●●●
●●
● ●

●●
●●●●● ● ●●●● ● ●
● ●●
● ●
●●●● ●
●● ●

● ● ● ● ●●
●●
● ●
● ●
● ● ●● ●
●●●
●● ● ● ● ●● ●● ●●● ●●
●●●
●● ●● ●● ●
●●● ● ●
●●

Wind direction

Most frequently, wind-rose plots use summaries, such as histograms or densities.


Next we plot a circular histogram of wind directions with 15 degrees-wide bins. We

275
6 Plots with ggpplot

use stat_bin() .

ggplot(viikki_d29.dat, aes(WindDir_D1_WVT)) +
coord_polar() +
stat_bin(color = "black", fill = "grey50", binwidth = 15, geom = "bar") +
scale_x_continuous(breaks = c(0, 90, 180, 270),
labels = c("N", "E", "S", "W"),
limits = c(0, 360),
expand = c(0, 0),
name = "Wind direction") +
scale_y_continuous(name = "Frequency")

150

100
Frequency

50

0 W E

Wind direction

An equivalent plot, using an empirical density, created with stat_density() .

ggplot(viikki_d29.dat, aes(WindDir_D1_WVT)) +
coord_polar() +
stat_density(color = "black", fill = "grey50", size = 1, na.rm = TRUE) +
scale_x_continuous(breaks = c(0, 90, 180, 270),
labels = c("N", "E", "S", "W"),
limits = c(0, 360),
expand = c(0, 0),
name = "Wind direction") +
scale_y_continuous(name = "Density")

276
6.18 Coordinates and circular plots

N
0.006

0.004

0.002

Density
0.000 W E

Wind direction

As final wind-rose plot examples we do a scatter plot of wind speeds versus wind
direction and a two dimensional density plot. In both cases we use facet_wrap() to
have separate panel for AM and PM. In the scatter plot we set alpha = 0.1 for better
visualization of overlapping points.

ggplot(viikki_d29.dat, aes(WindDir_D1_WVT, WindSpd_S_WVT)) +


coord_polar() +
geom_point(alpha = 0.1, shape = 16) +
scale_x_continuous(breaks = c(0, 90, 180, 270),
labels = c("N", "E", "S", "W"),
limits = c(0, 360),
expand = c(0, 0),
name = "Wind direction") +
scale_y_continuous(name = "Wind speed (m/s)") +
facet_wrap(~factor(ifelse(hour(solar_time) < 12, "AM", "PM")))

AM PM
N N

3
Wind speed (m/s)

2
1
W E W E

S S

Wind direction

277
6 Plots with ggpplot

ggplot(viikki_d29.dat, aes(WindDir_D1_WVT, WindSpd_S_WVT)) +


coord_polar() +
stat_density_2d() +
scale_x_continuous(breaks = c(0, 90, 180, 270),
labels = c("N", "E", "S", "W"),
limits = c(0, 360),
expand = c(0, 0),
name = "Wind direction") +
scale_y_continuous(name = "Wind speed (m/s)") +
facet_wrap(~factor(ifelse(hour(solar_time) < 12, "AM", "PM")))

AM PM
N N

3
Wind speed (m/s)

2
1
W E W E

S S

Wind direction

6.19 Themes

For ggplots themes are the equivalent of style sheets for text. They determine how
the different elements of a plot are rendered when displayed, printed or saved to a
file. They do not alter how the data themselves are displayed, but instead that of
text-labels, titles, axes, grids, etc. are formatted. Package ‘ggplot2’ includes several
predefined themes, and some extension packages define additional ones. In addition
to switching between themes, the user can modify the format applied to individual
elements, or define totally new themes.

6.19.1 Predefined themes

The theme used by default is theme_grey() . Themes are defined as functions, with
parameters. These parameters allow changing some “base” properties. The base size
for text elements is given in points, and affects all text elements in a plot (except
those produced by geometries) as the size of them is by default defined relative to
the base size. Another parameter, base_family , allows the font family to be set.

278
6.19 Themes

ggplot(fake2.data, aes(z, y)) +


geom_point() +
theme_grey(15, "serif")

60 ●


● ●

50 ●


● ●


● ●

40 ●
y



● ●
30

● ●


● ●
20 ●
● ●




● ● ●

● ●

0 5 10 15 20
z

U Change the code in the previous chunk to use the "mono" font family at size
8.

ggplot(fake2.data, aes(z, y)) +


geom_point() +
theme_bw()

60 ●


● ●

50 ●


● ●


● ●

40 ●
y





30





● ●

20 ● ●




● ● ●
● ● ●

0 5 10 15 20
z

279
6 Plots with ggpplot

U Change the code in the previous chunk to use all the other predefined themes:
theme_classic() , theme_minimal() , theme_linedraw() ,
Rfunctiontheme_light(), theme_dark() and theme_void() .

A frequent idiom is to create a ggplot without specifying a theme, and then adding
the theme when printed.

p <- ggplot(fake2.data, aes(z, y)) +


geom_point()
p + theme_bw()

60 ●


● ●

50 ●


● ●


● ●

40 ●
y





30





● ●

20 ● ●




● ● ●
● ● ●

0 5 10 15 20
z

U Play by replacing in the last statement in the previous code chunk the theme
used to print the saved ggplot object p . Do also try the effect of changing the
base size and font family.

It is also possible to set the default theme to be used by all subsequent plots
rendered.

280
6.19 Themes

60 ●


● ●

50 ●


● ●


● ●

40 ●
y



● ●
30

● ●


● ●

20 ● ●




● ● ●

● ●

0 5 10 15 20
z

We save the current default theme, so as to be able to restore it. If there is no need
to ‘go back’ then saving can be skipped by not including the left hand side and the
assignment operator in the first statement below.

old_theme <- theme_set(theme_bw(15))


p

60 ●


● ●

50 ●


● ●


● ●

40 ●
y



● ●
30

● ●


● ●
20 ●
● ●




● ● ●

● ●

0 5 10 15 20
z

theme_set(old_theme)
p

281
6 Plots with ggpplot

60 ●


● ●

50 ●


● ●


● ●

40 ●

y ●

● ●
30

● ●


● ●

20 ● ●




● ● ●

● ●

0 5 10 15 20
z

6.19.2 Modifying a theme

Sometimes we would just like to slightly tweak one of the predefined themes. This
is also possible. We exemplify this by solving the frequent problem of overlapping
𝑥-axis tick labels with different approaches. We force this by setting the number ticks
to a high value. Usually rotating the text of the labels solves the problem.

ggplot(fake2.data, aes(z + 100, y)) +


geom_point() +
scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

60 ●


● ●

50 ●


● ●


● ●
40 ●
y



● ●
30

● ●


● ●
20 ●
● ●




● ● ●

● ●

98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122

z + 100

U Play with the code above, modifying the values used for angle , hjust and

282
6.19 Themes

vjust . (Angles are expressed in degrees, and justification with values between 0
and 1.

 When tick labels are rotated one usually needs to set both the horizontal
and vertical justification as the default values are no longer suitable. This is due
to the fact that justification settings are referenced to the text itself rather than
to the plot, i.e. vertical justification of 𝑥-axis tick labels rotated 90 degrees sets
their horizontal position with respect to the plot.

Another possibility is to use a smaller font size. Within theme function rel() can
be used to set size relative to the base size.

ggplot(fake2.data, aes(z + 100, y)) +


geom_point() +
scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) +
theme(axis.text = element_text(color = "darkblue"),
axis.text.x = element_text(size = rel(0.6)))

60 ●


● ●

50 ●


● ●


● ●

40 ●
y



● ●
30

● ●


● ●

20 ● ●




● ● ●

● ●

98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122

z + 100

Themes definitions follow a hierarchy, allowing us to modify the formatting of


groups of similar elements, as well as of individual elements. In the chunk above
we modify the color of the tick labels in both axes, but changed the font size only for
the 𝑥-axis.

U Modify the example above, so that the tick labels on the 𝑥-axis are blue and
those on the 𝑦-axis red, and the font size the same for both axes, but changed

283
6 Plots with ggpplot

from the default.

Formatting of all other text elements can be adjusted in a similar way.


The color of the background, and the properties of the grid lines and other lines
can be adjusted thought theme elements. We next change the properties of the lines
used for the axes, removing the lines on the top and right margins, and adding arrow
heads to the axis lines. See chapter 9 in ggplot2: Elegant Graphics for Data Analysis
(Wickham and Sievert 2016) for additional examples and R Graphics Cookbook (Chang
2013) for more details.

 If you use a saved theme, and want to modify some elements, then the saved
theme should be added to the plot before adding + theme(...) as otherwise the
changes would be overwritten.

It is also possible to modify the default theme used for rendering all subsequent
plots.

60 ●


● ●

50 ●


● ●


● ●

40 ●
y



● ●
30

● ●


● ●

20 ● ●




● ● ●

● ●

0 5 10 15 20
z

As above, we save the current default theme, so as to be able to restore it.

old_theme <- theme_update(text = element_text(color = "red"))


p

284
6.19 Themes

60 ●


● ●

50 ●


● ●


● ●

40 ●
y



● ●
30

● ●


● ●

20 ● ●




● ● ●

● ●

0 5 10 15 20
z

theme_set(old_theme)
p

60 ●


● ●

50 ●


● ●


● ●

40 ●
y



● ●
30

● ●


● ●

20 ● ●




● ● ●

● ●

0 5 10 15 20
z

6.19.3 Defining a new theme

Themes can be defined both from scratch, or by modifying existing saved themes,
and saving the modified version. If we want to preserve the ability to change the base
settings, we cannot use theme() to modify a saved theme and save the resulting
theme. We need to create a new theme from scratch. However, unless you are writing
a package, the first way of “creating” a new theme is enough, and documented in the
vignette accompanying package ‘ggplot2’. We give an example below.

my_theme <- theme_bw() + theme(text = element_text(color = "red"))

The default theme remains unchanged.

285
6 Plots with ggpplot

60 ●


● ●

50 ●


● ●


● ●

40 ●
y



● ●
30

● ●


● ●

20 ● ●




● ● ●

● ●

0 5 10 15 20
z

But we can use the saved theme when desired.

p + my_theme

60 ●


● ●

50 ●


● ●


● ●

40 ●
y





30





● ●

20 ● ●




● ● ●
● ● ●

0 5 10 15 20
z

Be aware that our own my_theme is not a function, and consequently we do not use
parenthesis as with the saved themes included in package ‘ggplot2’.

U It is always good to learn to recognize error messages. One way of doing this
is by generating errors on purpose. So do add parentheses to the statement in
the code chunk above.

286
6.19 Themes

 How to create a new theme with a behaviour similar to those part of pack-
age ‘ggplot2’ is not documented, as it is usually the case with changes that in-
volve programming. However, you should always remember that the source code
is available. Usually typing the name of a function without the parentheses is
enough to get a listing of its definition, or if this is not useful, then reading the
source file in the package reveals how a function has been defined. We can then
use it as a template for writing our own function.

Looking at the definition of theme_minimal() gives us enough information as to


proceed to define our own modified theme as a function.

theme_minimal

## function (base_size = 11, base_family = "")


## {
## theme_bw(base_size = base_size, base_family = base_family) %+replace%
## theme(axis.ticks = element_blank(), legend.background = element_blank(),
## legend.key = element_blank(), panel.background = element_blank(),
## panel.border = element_blank(), strip.background = element_blank(),
## plot.background = element_blank(), complete = TRUE)
## }
## <environment: namespace:ggplot2>

Using theme_minimal() as a model, we will proceed to define our own theme func-
tion. Argument complete = TRUE is essential as it affects the behaviour of the re-
turned theme. A ‘complete’ theme replaces any theme present in the ggplot object
clearing all settings, while a theme that is not ‘complete’ adds to the existing the new
elements without clearing existing settings not being redefined. Saved themes like
theme_grey() are complete themes, while the themes objects returned by theme()
are by default not complete.

my_theme <-
function (base_size = 11, base_family = "") {
theme_grey(base_size = base_size, base_family = base_family) +
theme(text = element_text(color = "red"), complete = TRUE)
}

The default theme remains unchanged, as shown earlier. The saved theme is now a
function, and accepts arguments. In this example we have kept the function paramet-
ers the same as used by the predefined themes—whenever it is possible we should
avoid surprising users.

287
6 Plots with ggpplot

p + my_theme(base_family = "serif")

60 ●


● ●

50 ●


● ●


● ●
40 ●
y





30





● ●
20 ● ● ●
● ●


● ● ●
● ● ●

0 5 10 15 20
z

There is nothing to prevent us from defining a theme function with additional para-
meters. The example below is fully compatible with the one defined above thanks to
the default argument for text.color but allows changing the color.

my_theme <-
function (base_size = 11, base_family = "", text.color = "red") {
theme_grey(base_size = base_size, base_family = base_family) +
theme(text = element_text(color = text.color), complete = TRUE)
}

p + my_theme(text.color = "green")

60 ●


● ●

50 ●


● ●


● ●

40 ●
y





30





● ●

20 ● ●




● ● ●
● ● ●

0 5 10 15 20
z

288
6.19 Themes

U Define a theme function that instead of color allows setting the face (reg-
ular, bold, italic) through a user-supplied argument.

 In the definition of theme_minimal() , %+replace% is used so as to unset


all the properties of each theme element, while + only replaces the properties
explicitly given as argument to the element-setting function.

 The function theme_minimal() was a good model for the example above,
however, it was not the first function I explored. I did list the definition of
theme_gray() first, but as this theme is defined from scratch, it was not the best
starting point for our problem. Of course, if we had wanted to define a theme
from scratch, then it would have been the ‘model’ to use for defining it.

Frequently one needs the same plots differently formatted, e.g. for overhead slides
and for use in a printed article or book. In such a case, we may even want some
elements like titles to be included only in the plots in overhead slides. One could
create two different ggplot objects, one for each occasion, but this can lead to in-
consistencies if the code used to create the plot is updated. A better solution is to
use themes, more generally, define themes for the different occasions according to
one’s taste and needs. A simple example is given in the next five code chunks.

theme_ovh <-
function (base_size = 15, base_family = "") {
theme_grey(base_size = base_size, base_family = base_family) +
theme(text = element_text(face = "bold"), complete = TRUE)
}

theme_prn <-
function (base_size = 11, base_family = "serif") {
theme_classic(base_size = base_size, base_family = base_family) +
theme(plot.title = element_blank(),
plot.subtitle = element_blank(),
complete = TRUE)
}

p1 <- p + ggtitle("A Title", subtitle = "with a subtitle")

289
6 Plots with ggpplot

p1

A Title
with a subtitle

60 ●


● ●
50 ●


● ●


● ●
40 ●
y



● ●
30

● ●


● ●
20 ●
● ●
● ●


● ● ● ● ●

0 5 10 15 20
z

p1 + theme_ovh()

A Title
with a subtitle

60 ●


● ●
50 ●


● ●


● ●
40 ●
y



● ●
30 ●
● ●


● ●
20 ●

● ●
● ●


● ● ● ● ●

0 5 10 15 20
z

p1 + theme_prn()

290
6.20 Using plotmath expressions

60 ●


● ●

50 ●


● ●


● ●
40 ●
y





30





● ●
20 ● ● ●
● ●


● ● ●
● ● ●

0 5 10 15 20
z

U Modify the two themes defined above, so as to suite your own tastes and
needs, but first of all, just play around to get a feel of all the possibilities. The
help page for function theme() describes and exemplifies the use of most if not
all the valid theme elements.

6.20 Using plotmath expressions

In sections 6.7 and 6.8 we gave some simple examples of the use of R expressions in
plot. The plotmath demo and help in R give all the details of using expressions in
plots. Composing syntactically correct expressions can be challenging. Expressions
are very useful but rather tricky to use because the syntax is unusual. Although
expressions are here shown in the context of plotting, they are also used in other
contexts in R code.
When constructing a ggplot object one can either use expressions explicitly, or
supply them as character string labels, and tell ggplot to parse them. For titles,
axis-labels, etc. (anything that is defined within labs() ) the expressions have to
be entered explicitly, or saved as such into a variable, and the variable supplied as
argument.
When plotting expressions using geom_text() expression arguments should be sup-
plied as character strings and the optional argument parse = TRUE used to tell the
geometry to parse (“convert”) the text labels into expressions.
Finally in the case of facets, panel labels can also be expressions. They can be
generated by labeller functions to allow them to be dynamic.

291
6 Plots with ggpplot

Before giving examples using these different mechanisms to add maths to plots, I
will describe the syntax used to write expressions. The most difficult thing to remem-
ber is how to connect the different parts of the expression. Tilde ( ~ ) adds space in
between symbols. Asterisk ( * ) can be also used as a connector, and is needed usu-
ally when dealing with numbers. Using space is allowed in some situations, but not
in others. For a long list of examples have a look a the output and code displayed by
demo(plotmath) at the R command prompt.

demo(plotmath)

We will use a couple of complex examples to show in each plot how to use expres-
sions for different elements of a plot.
We first create a data frame, using paste() to assemble a vector of subscripted 𝛼
values.

set.seed(54321) # make sure we always generate the same data


my.data <-
data.frame(x = 1:5,
y = rnorm(5),
greek.label = paste("alpha[", 1:5, "]", sep = ""))

We also use a Greek 𝛼 character, but with 𝑖 as subscript, instead of a number. The
𝑦-axis label uses a superscript for the units. The title is a rather complex expression.
In these three cases, we explicitly use expression() .
We label each observation with a subscripted 𝑎𝑙𝑝ℎ𝑎, offset from the point position
and rotated. We finally add an annotation with the same formula as used for the title
but in red. Annotations are plotted ignoring the default aesthetics, but still make
use of geometries. We cannot pass expressions to geometries by simply mapping
them to the label aesthetic. Instead, we pass character strings that can be parsed into
expressions. In simpler terms, a string, that is written using the syntax of expressions
but not using the function expression() . We need to set parse = TRUE so that the
strings instead of being plotted as is, are parsed into expressions at the time the plot
is output. When using geom_text() , the argument passed to parameter label must
be a character string. Consequently, expressions to be plotted through this geometry
need always to be parsed.

ggplot(my.data, aes(x,y,label=greek.label)) +
geom_point() +
geom_text(angle=45, hjust=1.2, parse=TRUE) +
labs(x = expression(alpha[i]),
y = expression(Speed~~(m~s^{-1})),
title = expression(sqrt(alpha[1] + frac(beta, gamma)))
) +

292
6.20 Using plotmath expressions

annotate("text", label="sqrt(alpha[1] + frac(beta, gamma))",


y=2.5, x=3, size=8, colour="red", parse=TRUE) +
expand_limits(y = c(-2, 4))

β
α1 +
γ

β
α1 +
Speed (m s−1)

2
γ
0 ●

1
α

5
α

3
α
2
α

4
−2

α
1 2 3 4 5
αi

We can also use a character string stored in a variable, and use parse() both ex-
plicitly and implicitly by setting parse = TRUE .

my_eq.char <- "sqrt(alpha[1] + frac(beta, gamma))"


ggplot(my.data, aes(x,y,label=greek.label)) +
geom_point() +
geom_text(angle=45, hjust=1.2, parse=TRUE) +
labs(x = expression(alpha[i]),
y = expression(Speed~~(m~s^{-1})),
title = parse(text = my_eq.char)
) +
annotate("text", label = my_eq.char,
y=2.5, x=3, size=8, colour="red", parse=TRUE) +
expand_limits(y = c(-2, 4))

β
α1 +
γ

β
α1 +
Speed (m s−1)

2
γ
0 ●

1
α

5
α



3
α
2
α


4

−2
α

1 2 3 4 5
αi

293
6 Plots with ggpplot

The examples above are moderately complex, but do not use expressions for all
the elements in a ggplot that accept them. The next example uses them for scale
labels. In the cases of scales, there are alternative approaches. One approach is to
use user-supplied expressions.

ggplot(my.data, aes(x,y,label=greek.label)) +
geom_point() +
geom_text(angle=45, hjust=1.2, parse=TRUE) +
labs(x = NULL,
y = expression(Speed~~(m~s^{-1})),
title = expression(sqrt(alpha[1] + frac(beta, gamma)))
) +
annotate("text", label="sqrt(alpha[1] + frac(beta, gamma))",
y=2.5, x=3, size=8, colour="red", parse=TRUE) +
scale_x_continuous(breaks = c(1,3,5),
labels = c(expression(alpha[1]),
expression(alpha[3]),
expression(alpha[5]))
) +
expand_limits(y = c(-2, 4))

β
α1 +
γ

β
α1 +
Speed (m s−1)

2
γ

0 ●

1
α

5
α



3
α
2
α


4

−2
α

α1 α3 α5

As expression accepts multiple arguments separated by commas, the labels can be


written more concisely using a single call to expression() .

ggplot(my.data, aes(x,y,label=greek.label)) +
geom_point() +
geom_text(angle=45, hjust=1.2, parse=TRUE) +
labs(x = NULL,
y = expression(Speed~~(m~s^{-1})),
title = expression(sqrt(alpha[1] + frac(beta, gamma)))
) +
annotate("text", label="sqrt(alpha[1] + frac(beta, gamma))",
y=2.5, x=3, size=8, colour="red", parse=TRUE) +

294
6.20 Using plotmath expressions

scale_x_continuous(breaks = c(1,3,5),
labels = expression(alpha[1], alpha[3], alpha[5])
) +
expand_limits(y = c(-2, 4))

β
α1 +
γ

β
α1 +
Speed (m s−1)

2
γ

0 ●

1
α

5
α


3
α
2
α

4
−2
α1 α3 α α5

A different approach (no example shown) would be to use parse() explicitly for
each individual label, something that might be needed if the tick labels need to be
“assembled” programmatically instead of set as constants.

U Instead of this being an exercise for you to write code, you will need
to study the code shown bellow until you are sure understand how it works. It
makes use of different things you have learn in the current and previous chapters.
Parsing multiple labels in a scale definition, after assembling them with
paste() . We want to achieve more generality, looking ahead to a future func-
tion to be defined.

labels.char <- paste("alpha[", as.character(c(1,3,5)), "]")


my_parse <- function(x, ...) {parse(text = x, ...)}
labels.xpr <- sapply(labels.char, my_parse)

This three lines of code return a vector of expressions that can be used in a
scale definition. Before using them, we will make a function out of them.

make_labels <- function(base_text = "alpha", idxs = 1:5, ...) {


sapply(X = paste(base_text, "[", as.character(idxs), "]", sep = ""),
FUN = function(x, ...) {parse(text = x, ...)},
USE.NAMES = FALSE)
}

295
6 Plots with ggpplot

And now we can use the function in a plot.

breaks <- c(1,3,5)


ggplot(my.data, aes(x,y,label=greek.label)) +
geom_point() +
geom_text(angle=45, hjust=1.2, parse=TRUE) +
labs(x = NULL,
y = expression(Speed~~(m~s^{-1})),
title = expression(sqrt(alpha[1] + frac(beta, gamma)))
) +
annotate("text", label="sqrt(alpha[1] + frac(beta, gamma))",
y=2.5, x=3, size=8, colour="red", parse=TRUE) +
scale_x_continuous(breaks = breaks,
labels = make_labels("alpha", breaks)
) +
expand_limits(y = c(-2, 4))

β
α1 +
γ

β
α1 +
Speed (m s−1)

2
γ

0 ●

1
α

5
α



3
α
2
α


4

−2
α

α1 α3 α5

As a final task, change the code above so that the labels are subscripted 𝛽s and
breaks from 1 to 5 with step 1.

 Differences between parse() and expression() . Function parse() takes


as argument a character string. This is very useful as the character string can be
created programmatically. When using expression() this is not possible, except
for substitution at execution time of the value of variables into the expression.
See help pages for both functions.
Function expression() accepts its arguments without any delimiters. Func-
tion parse() takes a single character string as argument to be parsed, in which
case quotation marks need to be escaped (using \" where a literal " is desired).

296
6.20 Using plotmath expressions

We can, also in both cases embed a character string by means of one of the func-
tions plain() , italic() , bold() or bolditalic() which also affect the font
used. The argument to these functions needs sometimes to be a character string
delimited by quotation marks.
When using expression() , bare quotation marks can be embedded,

ggplot(cars, aes(speed, dist)) +


geom_point() +
xlab(expression(x[1]*" test"))

125

100

● ●


75



dist



● ●
● ●

50 ●




● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
25 ●

● ●
● ●


● ●


0
5 10 15 20 25
x1 test

while in the case of parse() they need to be escaped,

ggplot(cars, aes(speed, dist)) +


geom_point() +
xlab(parse(text = "x[1]*\" test\""))

125

100

● ●


75



dist



● ●
● ●

50 ●




● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
25 ●

● ●
● ●


● ●


0
5 10 15 20 25
x1 test

and in some cases will need to be enclosed within a format function.

297
6 Plots with ggpplot

ggplot(cars, aes(speed, dist)) +


geom_point() +
xlab(parse(text = "x[1]*italic(\" test\")"))

125

100

● ●


75



dist



● ●
● ●

50 ●




● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
25 ●

● ●
● ●


● ●


0
5 10 15 20 25
x1 test

We can compare the expressions returned by expression() and parse() as


used above.

expression(x[1]*" test")

## expression(x[1] * " test")

parse(text = "x[1]*\" test\"")

## expression(x[1] * " test")

A few additional remarks. If expression() is passed multiple arguments,


ggplot() uses only the first one, in the case of axis labels, when a single character
string is expected as argument.

expression(x[1], " test")

## expression(x[1], " test")

ggplot(cars, aes(speed, dist)) +


geom_point() +
xlab(expression(x[1], " test"))

298
6.20 Using plotmath expressions

125

100

● ●


75



dist


● ●
● ●

50 ●



● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
25 ●

● ●
● ●


● ●



0
5 10 15 20 25
x1

Depending on the location within a expression, spaces maybe ignored, or even


illegal. To juxtapose elements without adding space use * , to explicitly insert
white space, use ~ . As shown above spaces are accepted within quoted text.

So the following alternatives can also be used.

ggplot(cars, aes(speed, dist)) +


geom_point() +
xlab(parse(text = "x[1]~~~~\"test\""))

125

100

● ●


75



dist



● ●
● ●

50 ●




● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
25 ●

● ●
● ●


● ●


0
5 10 15 20 25
x1 test

ggplot(cars, aes(speed, dist)) +


geom_point() +
xlab(parse(text = "x[1]~~~~plain(test)"))

299
6 Plots with ggpplot

125

100

● ●


75


dist


● ●
● ●

50 ●




● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
25 ●

● ●
● ●


● ●


0
5 10 15 20 25
x1 test

However, unquoted white space is discarded.

ggplot(cars, aes(speed, dist)) +


geom_point() +
xlab(parse(text = "x[1]*plain( test)"))

125

100

● ●


75



dist



● ●
● ●

50 ●




● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
25 ●

● ●
● ●


● ●


0
5 10 15 20 25
x1test

Above we used paste to insert values stored in a variable, and this combined with
format() , sprintf() , and strftime() gives already a lot of flexibility.

U Study the examples below. If you are familiar with C or C++ the last two
functions will be already familiar to you.

300
6.20 Using plotmath expressions

sprintf("%s: %.3g two values formatted and inserted", "test", 15234)

## [1] "test: 1.52e+04 two values formatted and inserted"

sprintf("log(%.3f) = %.3f", 5, log(5))

## [1] "log(5.000) = 1.609"

sprintf("log(%.3g) = %.3g", 5, log(5))

## [1] "log(5) = 1.61"

Write a function for the second statement in the chunk above. The function
should take a single numeric argument through its only formal parameter, and
produce equivalent output to the statement above. However, it should be usable
with any numeric value.
Do look up the help pages for these three functions and play with them at the
console. They are extremely useful.

It is also possible to substitute the value of variables or, in fact, the result of eval-
uation, into a new expression, allowing on-the-fly construction of expressions. Such
expressions are frequently used as labels in plots. This is achieved through use of
quoting and substitution.

We use bquote() to substitute variables or expressions enclosed in .( ) by their


value. Be aware that the argument to bquote() needs to be written as an expres-
sion, in this example we need to use a tilde, ~ , to insert a space between words.
Furthermore, if the expressions include variables, these will be searched for in the
environment.

ggplot(cars, aes(speed, dist)) +


geom_point() +
labs(title = bquote(Time~zone: .(Sys.timezone())),
subtitle = bquote(Date: .(as.character(today())))
)

301
6 Plots with ggpplot

Time zone : Europe/Helsinki


Date : 2017−04−11
125 ●

100

● ●


75 ●
dist ●



● ●
● ●

50 ●




● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
25 ●

● ●
● ●


● ●


0
5 10 15 20 25
speed

In the case of substitute() we supply what is to used for substitution through a


named list.

ggplot(cars, aes(speed, dist)) +


geom_point() +
labs(title = substitute(Time~zone: tz, list(tz = Sys.timezone())),
subtitle = substitute(Date: date, list(date = as.character(today())))
)

Time zone : Europe/Helsinki


Date : 2017−04−11
125 ●

100

● ●


75 ●

dist




● ●
● ●

50 ●




● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
25 ●

● ●
● ●


● ●


0
5 10 15 20 25
speed

For example, substitution can be used to assemble an expression within a function


based on the arguments passed. One case of interest is to retrieve the name of the
object passed as an argument, from within a function.

deparse_test <- function(x) {


print(deparse(substitute(x)))
}

a <- "saved in variable"

302
6.21 Generating output files

deparse_test("constant")

## [1] "\"constant\""

deparse_test(1 + 2)

## [1] "1 + 2"

deparse_test(a)

## [1] "a"

6.21 Generating output files

It is possible, when using RStudio, to directly export the displayed plot to a file. How-
ever, if the file will have to be generated again at a later time, or a series of plots need
to be produced with consistent format, it is best to include the commands to export
the plot in the script.
In R, files are created by printing to different devices. Printing is directed to a
currently open device. Some devices produce screen output, others files. Devices
depend on drivers. There are both devices that or part of R, and devices that can be
added through packages.
A very simple example of PDF output (width and height in inches):

fig1 <- ggplot(data.frame(x=-3:3), aes(x=x)) +


stat_function(fun=dnorm)
pdf(file="fig1.pdf", width=8, height=6)
print(fig1)
dev.off()

Encapsulated Postscript output (width and height in inches):

postscript(file="fig1.eps", width=8, height=6)


print(fig1)
dev.off()

There are Graphics devices for BMP, JPEG, PNG and TIFF format bitmap files. In this
case the default units for width and height is pixels. For example we can generate
TIFF output:

tiff(file="fig1.tiff", width=1000, height=800)


print(fig1)
dev.off()

303
6 Plots with ggpplot

6.21.1 Using LATEX instead of plotmath

To use LATEX syntax in plots we need to use a different software device for output. It
is called Tikz and defined in package ‘tikzDevice’. This device generates output that
can be interpreted by LATEX either as a self-contained file or as a file to be input into
another LATEX source file. As the bulk of this handbook does not use this device, we
will use it explicitly and input the files into this section. A TEX distribution should be
installed, with LATEX and several (LATEX) packages including ‘tikz’.

Fonts

Font face selection, weight, size, maths, etc. are set with LATEX syntax. The main ad-
vantage of using LATEX is the consistency between the typesetting of the text body and
figure labels and legends. For those familiar with LATEX not having to remember/learn
the syntax of plotmath will a bonus.
We will revisit the example from the previous sections, but now using LATEX for
the subscripted Greek 𝛼 for labels instead of plotmath . In this example we use as
subscripts numeric values from another variable in the same dataframe.

6.22 Building complex data displays

In this section we do not refer to those aspects of the design of a plot that can be
adjust through themes (see section 6.19 on page 278. Whenever this possibility ex-
ists, it is the best. Here we refer to aspects that are not really part of the graphical
(”artistic”) design, but instead mappings, labels and similar data and metadata related
aspects of plots. In many cases scales (see section 6.16 on page 249) also fall within
the scope of the present section.

6.22.1 Using the grammar of graphics for individual plots

The grammar of graphics allows one to build and test plots incrementally. In daily
use, it is best to start with a simple design for a plot, print this plot, checking that the
output is as expected and the code error-free. Afterwards, one can map additional
aesthetics and geometries and statistics gradually. The final steps are then to add
annotations and the text or expressions used for titles, and axis and key labels.

U Build a graphically complex data plot of your interest, step by step. By step
by step, I do not refer to using the grammar in the construction of the plot as

304
6.22 Building complex data displays

earlier, but of taking advantage of this modularity to test intermediate version


in an iterative design process, first by building up the complex plot in stages as
a tool in debugging, and later using iteration in the processes of improving the
graphic design of the plot and improving its readability and effectiveness.

6.22.2 Using the grammar of graphics for series of plots with consistent design

As in any type of script with instructions (for humans or computers), we should avoid
unnecessary repetition, as repetition conspires against consistent results and is a
major source of errors when the script needs to be modified. Not less important, a
shorter script, if well written is easier to read.
One approach is to use user-defined functions. One can for example, write
simple wrapper functions on top functions defined in ‘ggplot2’, for example,
adding/changing the defaults mappings to ones suitable for our application. In the
case of ggplot() , as it is defined as a generic function, if one’s data is stored in
objects of a user-defined class, the wrapper can be a specialization of the generic,
and become almost invisible to users (e.g. not require a different syntax or adding a
word to the grammar). At the other extreme of complexity compared to a wrapper
function, we could write a function that encapsulates all the code needed to build a
specific type of plot. Package ‘ggspectra’ uses the last two approaches.
As ggplot objects are composed using operator + to assemble together the dif-
ferent components, one can also store in a variable these components, or using a list,
partial plots, which can be used to compose the final figure.

 We can assign a ggplot object or a part of it to a variable, and then assemble


a new plot from the different pieces.

myplot <- ggplot(data = mtcars,


aes(x=disp, y=mpg,
colour=factor(cyl))) +
geom_point()

mylabs <- labs(x="Engine displacement)",


y="Gross horsepower",
colour="Number of\ncylinders",
shape="Number of\ncylinders")

And now we can assemble them into plots.

305
6 Plots with ggpplot

myplot
myplot + mylabs + theme_bw(16)
myplot + mylabs + theme_bw(16) + ylim(0, NA)

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●



● ● 6
20 ●
● ● ● 8

● ●



● ●
15 ●●


● ●
10
100 200 300 400
disp
35

● ●
30
Gross horsepower

● Number of
25 ●
cylinders
● ●

● ●

4

20 ●

6
● ●

● ●


8

● ●
15 ● ●●

●●
10
100 200 300 400
Engine displacement)

● ●
30

Gross horsepower







Number of

20 ●


● cylinders
● ●




4
● ●
●● ●

● ●
6
●●

8
10

0
100 200 300 400
Engine displacement)

306
6.22 Building complex data displays

We can also save intermediate results.

mylogplot <- myplot + scale_y_log10(limits=c(8,55))


mylogplot + mylabs + theme_bw(16)

Gross horsepower



● ●
Number of

● ●
● cylinders


● ● ●
4
● ●

● ●


6


● ●●




8

●●
10

100 200 300 400


Engine displacement)

If the pieces to put together do not include a ”ggplot” object, we can put them
into a ”list” object.

myparts <- list(mylabs, theme_bw(16))


mylogplot + myparts
Gross horsepower



● ●
Number of

● ●
● cylinders


● ● ●
4
● ●

● ●


6


● ●●




8

●●
10

100 200 300 400


Engine displacement)

The are a few predefined themes in package ‘ggplot2’ and additional ones in
other packages such as ‘cowplot’, even the default theme_grey() can come in
handy because the first parameter to themes is the point size used as reference
to calculate all other font sizes. You can see in the two examples bellow, that the
size of all text elements changes proportionally when we set a different base size
in points.

307
6 Plots with ggpplot

myplot + mylabs + theme_grey(10)


myplot + mylabs + theme_grey(16)

35

● ●
30


Gross horsepower


Number of
25
● cylinders
● ● ● 4

● ●
● ● 6
20 ●
● ● ● 8







● ●●
15 ●

● ●
10
100 200 300 400
Engine displacement)

35

● ●
30
Gross horsepower

● Number of
25 ●
cylinders
● ●

● ●

4

20 ●

6
● ●

● ●


8

● ●
15 ● ●●

●●
10
100 200 300 400
Engine displacement)

The code in the next chunk is valid, it returns a blank plot. This apparently
useless plot, can be very useful when writing functions that return ggplot objects
or build them piece by piece in a loop.

ggplot()

308
6.23 Extended examples

U Revise the code you wrote for the “playground” exercise in section 6.22.1,
but this time, pre-building and saving groups of elements that you expect to
be useful unchanged when composing a different plot of the same type, or a
plot of a different type from the same data.

6.23 Extended examples

In this section we first produce some publication-ready plots requiring the use of
different combinations of what has been presented earlier in this chapter and then
we recreate some well known plots, using versions from Wikipedia articles as mod-
els. Our objective here is to show, how by combining different terms and modifiers
from the grammar of graphics we can build step by step very complex plots and/or
annotate them with sophisticated labels. Here we do not use any packages extending
‘ggplot2’. Even more elaborate versions of these plots are presented in later chapters
using ‘ggplot2’ together with other packages.

6.23.1 Heat maps

Heat maps are 3D plots, with two axes with cartesian coordinates giving origin to
rectangular tiles, with a third dimension represented by the fill of the tiles. They
are used to describe deviations from a reference or controls condition, with for ex-
ample, blue representing values below the reference and red above. A color gradient
represents the size of the deviation. Simple heat maps can be produced directly with

309
6 Plots with ggpplot

‘ggplot2’ functions and methods. Heat maps with similitude trees obtained through
clustering require additional tools.
The main difference with a generic tile plot (See section 6.10 on page 214) is that
the fill scale is centred on zero and the red to blue colours used for fill represent a
“temperature”. Nowadays, the name heatmap is also used for tile plots using other
color for fill, as long as they represent deviations from a central value.
To obtain a heat map, then we need to use as fill scale scale_fill_gradient2() .
In the first plot we use the default colors for the fill, and in second example we use
different ones.
For the examples in this section we use artificial data to build a correlation matrix,
which we convert into a data frame before plotting.

set.seed(123)
x <- matrix(rnorm(200), nrow=20, ncol=10)
y <- matrix(rnorm(200), nrow=20, ncol=10)
cor.mat <- cor(x,y)
cor.df <- data.frame(cor = as.vector(cor.mat),
x = rep(letters[1:10], 10),
y = LETTERS[rep(1:10, rep(10, 10))])

ggplot(cor.df, aes(x, y, fill = cor)) +


geom_tile(color = "white") +
scale_fill_gradient2()

H
cor
G

F 0.25
y

E 0.00

D −0.25

a b c d e f g h i j
x

ggplot(cor.df, aes(x, y, fill = cor)) +


geom_tile(color = "white") +
scale_fill_gradient2(low = "darkred", mid = "yellow",
high = "darkgreen")

310
6.23 Extended examples

H
cor
G
y
F 0.25

E 0.00

D −0.25

a b c d e f g h i j
x

6.23.2 Quadrat plots

A quadrat plot is usually a scatter plot, although sometimes lines are also used. The
scales are symmetrical both for 𝑥 and 𝑦 and negative and positive ranges: the origin
𝑥 = 0, 𝑦 = 0 is at the geometrical center of the plot.

We generate an artificial data set with y values correlated to x values.

set.seed(4567)
x <- rnorm(200, sd = 1)
quadrat_data.df <- data.frame(x = x,
y = rnorm(200, sd = 0.5) + 0.5 * x)

Here we draw a simple quadrat plot, by adding two lines and using fixed coordinates
with a 1:1 ratio between 𝑥 and 𝑦 scales.

ggplot(data = quadrat_data.df, aes(x, y)) +


geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
geom_point() +
coord_fixed(ratio = 1) +
theme_bw()

311
6 Plots with ggpplot

2 ●

● ● ●




● ●
● ● ●
● ●
● ●
1 ● ● ●● ● ●


● ● ● ●

● ●● ●
● ● ● ● ●● ●
●●
●● ● ● ● ●
●● ●
● ●
● ●● ● ●
● ● ● ● ●●
● ● ● ● ● ●
● ● ● ●● ● ● ●
● ● ●●● ●●
● ● ●● ● ● ●
●●● ●● ● ●● ●
● ●
0 ●● ●● ● ● ● ●●● ● ●●
● ● ● ● ●● ●

y
● ● ● ●● ●● ● ●
● ●
●● ● ● ● ● ●
● ●●
● ● ●● ● ●● ● ●●
● ● ●●● ●●
● ●●
● ●
● ● ● ●
● ●● ● ● ●
● ● ●
−1 ●● ● ● ●● ●

● ●●

●● ●

● ● ●

−2

−2 −1 0 1 2 3
x

We may want to add lines showing 1:1 slopes, make the axes limits symmetric,
and make points semi-transparent to allow overlapping points to be visualized. We
expand the limits with expand_limits() rather that set them with limits or xlim()
and ylim() , so that if there are observations in the data set outside our target limits,
the limits will still include them. In other words, we set a minimum expanse for the
limits of the axes, but allow them to grow further if needed.

ggplot(data = quadrat_data.df, aes(x, y)) +


geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
geom_abline(slope = -1, intercept = 0, color = "blue", linetype = "dashed") +
geom_point(alpha = 0.5) +
scale_color_identity(guide = FALSE) +
scale_fill_identity(guide = FALSE) +
coord_fixed(ratio = 1) +
expand_limits(x = -3, y = -3) +
expand_limits(x = +3, y = +3) +
theme_bw()

312
6.23 Extended examples

−2

−2 0 2
x

It is also easy to add a linear regression line with its confidence band.

ggplot(data = quadrat_data.df, aes(x, y)) +


geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
stat_smooth(method = "lm") +
geom_point(alpha = 0.5) +
coord_fixed(ratio = 1) +
expand_limits(x = -3, y = -3) +
expand_limits(x = +3, y = +3) +
theme_bw()

0
y

−2

−2 0 2
x

6.23.3 Volcano plots

A volcano plot is just an elaborate version of a scatter plot, and can be created with
‘ggplot2’ functions. We here demonstrate how to create a volcano plot with tick labels
in untransformed units, off-scale values drawn at the edge of the plotting region

313
6 Plots with ggpplot

and highlighted with a different shape, and points color coded according to whether
expression is significantly enhanced or depressed, or the evidence for the direction
of the effect is inconclusive. We use a random sample of size 5000 from real data
from an RNAseq experiment.

load(file = "data/volcano-example.rda")
head(clean5000.df, 4)

## logFC logCPM LR PValue


## 2766 -0.5362781 3.955900 2.0643968 1.507746e-01
## 4175 0.3278792 3.501271 0.4639981 4.957614e-01
## 2953 -0.7472158 5.896275 16.3420534 5.287742e-05
## 11128 0.2916467 4.772487 0.8551329 3.551043e-01
## outcome
## 2766 0
## 4175 0
## 2953 -1
## 11128 0

First we create a no-frills volcano plot. This is just an ordinary scatter plot, with a
certain way of transforming the 𝑃 -values. We do this transformation on the fly when
mapping the 𝑦 aesthetic with y = -log10(PValue) .

ggplot(data = clean5000.df,
aes(x = logFC,
y = -log10(PValue),
color = factor(outcome))) +
geom_point() +
scale_color_manual(values = c("blue", "grey10", "red"), guide = FALSE)

40


−log10(PValue)

30






20 ● ●


●● ● ●
● ●
● ●
● ●
● ● ●
● ● ● ● ●● ●
● ● ● ●
10 ●
● ●
● ●● ●
● ● ● ● ●
● ●● ●




● ●
● ●●● ● ● ● ● ● ●

●● ● ●● ●●
●● ● ● ●
● ● ● ● ● ●● ● ● ●● ●●●● ●
●● ●
● ●● ● ● ●● ●● ●● ●● ●
● ●● ●● ● ● ●
● ●
● ● ● ●● ●
●●
●●
●● ●● ●




●●● ● ●
●●●
● ● ● ●
● ● ●●● ●● ●● ●
● ● ● ●● ● ● ●●
●● ● ● ●●
● ● ● ● ● ●
●●●
● ●
● ●



●●
●●●
●●
●●



●●

● ●
● ●

●●●●●
●●●
●●

●●●


●●●
●●●● ● ●
●● ● ● ● ● ● ●●
●●●●● ●●●●
●● ●●

●●

●●
●●
●●
●●

●●


●●

●●● ●
●●
● ●
●●● ●●

●●●




●●

●●●●●

●●

●●● ● ● ●
● ● ● ●●● ● ● ●● ●
● ● ●
●●●●●●●

●● ● ●●●●●
●●●●
●●
●●
● ●●●● ●●
●● ●● ●●●●●
●●
● ●

●●●

●●
●●

●●

●●


●●


●●


●●


●●


●●


●●


●●

●●
●●
●●

●●● ● ●
●●●

●●●


●●


●●

●●
●●
●●●
● ●●● ●●●●●●
● ●● ●●
● ●
●●
●● ●
●●●●●● ● ●●●●
●●
●●


●●










●●●

●●





●●














●●






●●





















●●


●●











●●











●●



●●


● ●●

●●









●●









●●









●●


●●


●●








●●




●●



● ●


●●●●
● ● ●● ● ●● ● ● ●

●● ●

● ●
●●
● ●●
●●●

●●

●●

●●

●●

●●

●●

●●


●●


●●



●●


●●

●●


●●
●●


●●



●●●


●●




●●

●●
●●
●●

●●

●●
●●

●●





●●
●●
● ●


●●
●●

●●

●●
●●●●


●● ●●


●●


●●

●●




●●


●●


●●


●●

●●

●●●

●●

●●●
●●●

●●●


●●
●●

●●

●●●

● ●
●●
● ●●●
●●
0 ● ● ●
●●
●●

●●
●●


●●

●●
●●


●●

●●


●●

●●●
●●


●●

●●
●●
●●

●●
●●●
●●
●●
●●
●●

●●
●●●
●●●
●●
●●

●●


●●
●●●

● ●

●●

●●
●●
●●


●●
●●
●●
●●

●●●
●●
●●

●●

●●

● ●●●


●●
●●

●●●

●●
●●

●●


●●

●●

●●

●●

● ●

●● ●

−2 0 2
logFC

Now we add quite many tweaks to the 𝑥 and 𝑦 scales. 1) we show tick labels in
back-transformed units, at nice round numbers. 2) We add publication-ready axis
labels. 3) We restrict the limits of the 𝑥 and 𝑦 scales, but use oob = scales::squish

314
6.23 Extended examples

so that instead of being dropped observations outside the range limits are plotted at
the limit and highlighted with a a different shape . We also use the black and white
theme instead of the default one.

As we assume the reverse log transformation to be generally useful we define a


function reverselog_trans() for it. In the plot we use this function to set the trans-
formation as part of the 𝑦-scale definition, so that we can directly map 𝑃 -values to
the 𝑦 aesthetic.

reverselog_trans <- function(base = exp(1)) {


trans <- function(x) -log(x, base)
inv <- function(x) base^(-x)
scales::trans_new(paste0("reverselog-", format(base)), trans, inv,
scales::log_breaks(base = base),
domain = c(1e-100, Inf))
}

ggplot(data = clean5000.df,
aes(x = logFC,
y = PValue,
color = factor(outcome),
shape = factor(ifelse(PValue <= 1e-40, "out", "in")))) +
geom_vline(xintercept = c(log2(2/3), log2(3/2)), linetype = "dotted",
color = "grey75") +
geom_point() +
scale_color_manual(values = c("blue", "grey80", "red"), guide = FALSE) +
scale_x_continuous(breaks = c(log2(1e-2), log2(1e-1), log2(1/2),
0, log2(2), log2(1e1), log2(1e2)),
labels = c("1/100", "1/10", "1/2", "1",
"2", "10", "100"),
limits = c(log2(1e-2), log2(1e2)),
name = "Relative expression",
minor_breaks = NULL) +
scale_y_continuous(trans = reverselog_trans(10),
breaks = c(1, 1e-3, 1e-10, 1e-20, 1e-30, 1e-40),
labels = scales::trans_format("log10",
scales::math_format(10^.x)),
limits = c(1, 1e-40), # axis is reversed!
name = expression(italic(P)-{value}),
oob = scales::squish,
minor_breaks = NULL) +
scale_shape(guide = FALSE) +
theme_bw()

315
6 Plots with ggpplot

10−40

10−30

P − value




10−20 ●

● ●
●● ●

● ●
● ●
● ●
● ● ●
● ● ●● ●● ●
● ● ● ●
10−10 ● ●●●
● ● ●
● ● ● ●● ●
●● ● ●
●●● ● ●

●● ● ●●●

● ●
● ●● ● ● ● ● ● ●
● ●●●●● ●●● ●●
● ●●●●●●
●●● ●
● ●● ● ●●● ●●
●●●
● ●

● ●●
●● ●●
● ●●●●
● ●● ●
● ●

●●
● ●● ● ● ●●●
● ● ● ● ●
●●● ●●
●●● ● ●
●● ● ●● ●● ●
●●●● ●●●●●●●
●●
●● ●● ●

●●
●● ●●●●●
●● ● ● ●● ● ●
10−3 ●
●●●


●●
●●●

●●




























































●●
●● ●













































●●







●●


●●●
●●●● ●


● ●
●●● ●●●●●
●●●
●●●

●●
●●

●●


●●●


●●

●●
● ●●●

●●
●●


●●●

●●● ●●●●●● ● ●

●●● ●
●●●●●●●
●●






●●


●●











●●

●●
●●








●●





●●

●●
●●





















































●●










●●


●●



●●
●●

●●● ● ●● ● ●

●● ●●
●●●
●●
●●
●●


●●


●●
● ●


●●
●●

●●
●●



●●
●●


●●
●●
●●

●●
●●
● ●


●●


●●
●●●

●●



●●



●●


●●
●●



●●


●●

●●
●●
●●
●●
●●
● ●●●● ●
100 ● ● ●●●
●●


●●
●●
●●
●●
●●
●●

●●
●●

●●
●●●

●●

●●
●●●
● ●●
● ●

●●
●●
●●●
●●
●●
●●●



●●●
●●
●●



● ●


●●
●●
●●
●●● ●●

●●
●●

●●
●●

●●●




●●●
●●
●●
● ●



●●
● ●
●●
●●


●●


●●●
●●
●●●●

1/100 1/10 1/2 1 2 10 100


Relative expression

6.23.4 Anscombe’s regression examples

opts_chunk$set(opts_fig_wide_square)

This is another figure from Wikipedia http://commons.wikimedia.org/wiki/


File:Anscombe.svg?uselang=en-gb.
This classical example form Anscombe (1973) demonstrates four very different data
sets that yield exactly the same results when a linear regression model is fit to them,
including 𝑅2 = 0.666. It is usually presented as a warning about the need to check
model fits beyond looking at 𝑅2 and other parameter’s estimates.
I will redraw the Wikipedia figure using ‘ggplot2’, but first I rearrange the original
data.

# we rearrange the data


my.mat <- matrix(as.matrix(anscombe), ncol=2)
my.anscombe <- data.frame(x = my.mat[ , 1],
y = my.mat[ , 2],
case=factor(rep(1:4, rep(11,4))))

Once the data is in a data frame, plotting the observations plus the regression lines
is easy.

ggplot(my.anscombe, aes(x,y)) +
geom_point() +
geom_smooth(method="lm") +
facet_wrap(~case, ncol=2)

316
6.23 Extended examples

1 2

12

● ● ●
● ● ●

● ●
8 ●

● ●


● ●

4

y

3 4



12

● ●


8 ● ●


● ●
● ●
● ●

● ●
● ●

5 10 15 5 10 15
x

It is not much more difficult to make it look similar to the Wikipedia original.

ggplot(my.anscombe, aes(x,y)) +
geom_point(shape=21, fill="orange", size=3) +
geom_smooth(method="lm", se=FALSE) +
facet_wrap(~case, ncol=2) +
theme_bw(16)

317
6 Plots with ggpplot

1 2

12.5


10.0 ●
● ● ●
● ● ●
● ● ●

7.5 ●
● ●



5.0 ● ●


y

3 4

12.5 ●

10.0
● ●

● ●
● ●
7.5 ●



● ●


● ●

● ●
5.0

5 10 15 5 10 15
x

Although I think that the confidence bands make the point of the example much
clearer.

ggplot(my.anscombe, aes(x,y)) +
geom_point(shape=21, fill="orange", size=3) +
geom_smooth(method="lm") +
facet_wrap(~case, ncol=2) +
theme_bw(16)

318
6.23 Extended examples

1 2

12


● ● ● ●
● ●
8 ● ● ● ●
● ●
● ●


● ●

4

y

3 4

● ●
12

● ●
● ●
8 ● ●


● ●

● ●

● ●
● ●
● ●
4

5 10 15 5 10 15
x

6.23.5 Plotting color patches

For choosing colours when designing plots, or scales used in them, an indexed colour
patch plot is usually very convenient (see section 6.16.5 on page 263. We can produce
such a chart of colors with subsets of colors, or colours re-ordered compared to their
position in the value returned by colors() . As the present chapter is on package
‘ggplot2’ we use this package in this example. As this charts are likely to be needed
frequently, I define here a function ggcolorchart() .

ggcolorchart <- function(colors,


ncol = NULL,
use.names = NULL,
text.size = 2) {
# needed if the argument passed is subset with [ ]!
force(colors)

len.colors <- length(colors)


# by default we attempt to use
if (is.null(ncol)) {
ncol <- max(trunc(sqrt(len.colors)), 1L)

319
6 Plots with ggpplot

}
# default for when to use color names
if (is.null(use.names)) {
use.names <- ncol < 8
}
# number of rows needed to fit all colors
nrow <- len.colors %/% ncol
if (len.colors %% ncol != 0) {
nrow <- nrow + 1
}
# we extend the vector with NAs to match number of tiles
if (len.colors < ncol*nrow) {
colors[(len.colors + 1):(ncol*nrow)] <- NA
}
# we build a data frame
colors.df <-
data.frame(color = colors,
text.color =
ifelse(sapply(colors,
function(x){mean(col2rgb(x))}) > 110,
"black", "white"),
x = rep(1:ncol, nrow),
y = rep(nrow:1, rep(ncol, nrow)),
idx = ifelse(is.na(colors),
"",
format(1:(ncol * nrow), trim = TRUE)))
# we build the plot
p <- ggplot(colors.df, aes(x, y, fill = color))
if (use.names) {
p <- p + aes(label = ifelse(is.na(colors), "", colors))
} else {
p <- p + aes(label = format(idx, width = 3))
}
p <- p +
geom_tile(color = "white") +
scale_fill_identity() +
geom_text(size = text.size, aes(color = text.color)) +
scale_color_identity()
p + theme_void()
}

U After reading the use examples below, review the definition of the function,
section by section, trying to understand what is the function of each section of the
code. You can add print statements at different steps to look at the intermediate
data values. Once you think you have grasped the purpose of a given statement,

320
6.23 Extended examples

you can modify it in some way that modifies the output. For example, changing
the defaults, for the shape of the tiles, e.g. so that the number of columns is
about 1/3 of the number of rows. Although you may never need exactly this
function, studying its code will teach you some idioms used by R programers.
This function, in contrast to some other R code examples for plotting color tiles,
does not contain any loop. It returns a ggplot object, which be added to and/or
modified.

We first the predefined colors available in R.

ggcolorchart(colors()) +
ggtitle("R colors",
subtitle = "Labels give index or position in colors() vector")

R colors
Labels give index or position in colors() vector

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75

76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125

126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150

151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175

176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200

201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225

226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250

251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275

276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300

301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325

326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350

351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375

376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400

401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425

426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450

451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475

476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500

501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525

526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550

551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575

576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600

601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625

626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650

651 652 653 654 655 656 657

We subset those containing “blue” in the name, using the default number of
columns.

ggcolorchart(grep("blue", colors(), value = TRUE), text.size = 3)

321
6 Plots with ggpplot

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

65 66

We reduce the number of columns and obtain rectangular tiles. The default for
use.names depends on the number of tile columns, triggering automatically the
change in labels.

ggcolorchart(grep("blue", colors(), value = TRUE), ncol = 4)

aliceblue blue blue1 blue2

blue3 blue4 blueviolet cadetblue

cadetblue1 cadetblue2 cadetblue3 cadetblue4

cornflowerblue darkblue darkslateblue deepskyblue

deepskyblue1 deepskyblue2 deepskyblue3 deepskyblue4

dodgerblue dodgerblue1 dodgerblue2 dodgerblue3

dodgerblue4 lightblue lightblue1 lightblue2

lightblue3 lightblue4 lightskyblue lightskyblue1

lightskyblue2 lightskyblue3 lightskyblue4 lightslateblue

lightsteelblue lightsteelblue1 lightsteelblue2 lightsteelblue3

lightsteelblue4 mediumblue mediumslateblue midnightblue

navyblue powderblue royalblue royalblue1

royalblue2 royalblue3 royalblue4 skyblue

skyblue1 skyblue2 skyblue3 skyblue4

slateblue slateblue1 slateblue2 slateblue3

slateblue4 steelblue steelblue1 steelblue2

steelblue3 steelblue4

We demonstrate how perceived colors are affected by the hue, saturation and value
in the HSV colour model.

ggcolorchart(hsv(1, (0:48)/48, 0.67), text.size = 3) +


ggtitle("HSV saturation", "H = 1, S = 0..1, V = 0.67")

322
6.23 Extended examples

HSV saturation
H = 1, S = 0..1, V = 0.67

#ABABAB #ABA7A7 #ABA4A4 #ABA0A0 #AB9D9D #AB9999 #AB9595

#AB9292 #AB8E8E #AB8B8B #AB8787 #AB8484 #AB8080 #AB7D7D

#AB7979 #AB7575 #AB7272 #AB6E6E #AB6B6B #AB6767 #AB6464

#AB6060 #AB5D5D #AB5959 #AB5555 #AB5252 #AB4E4E #AB4B4B

#AB4747 #AB4444 #AB4040 #AB3D3D #AB3939 #AB3535 #AB3232

#AB2E2E #AB2B2B #AB2727 #AB2424 #AB2020 #AB1C1C #AB1919

#AB1515 #AB1212 #AB0E0E #AB0B0B #AB0707 #AB0404 #AB0000

ggcolorchart(hsv(1, 1, (0:48)/48), text.size = 3) +


ggtitle("HSV value", "H = 1, S = 1, V = 0..1")

HSV value
H = 1, S = 1, V = 0..1

#000000 #050000 #0B0000 #100000 #150000 #1B0000 #200000

#250000 #2B0000 #300000 #350000 #3A0000 #400000 #450000

#4A0000 #500000 #550000 #5A0000 #600000 #650000 #6A0000

#700000 #750000 #7A0000 #800000 #850000 #8A0000 #8F0000

#950000 #9A0000 #9F0000 #A50000 #AA0000 #AF0000 #B50000

#BA0000 #BF0000 #C50000 #CA0000 #CF0000 #D50000 #DA0000

#DF0000 #E40000 #EA0000 #EF0000 #F40000 #FA0000 #FF0000

ggcolorchart(hsv((0:48)/48, 1, 1), text.size = 3) +


ggtitle("HSV hue", "H = 0..1, S = 1, V = 1")

323
6 Plots with ggpplot

HSV hue
H = 0..1, S = 1, V = 1

#FF0000 #FF2000 #FF4000 #FF6000 #FF8000 #FF9F00 #FFBF00

#FFDF00 #FFFF00 #DFFF00 #BFFF00 #9FFF00 #80FF00 #60FF00

#40FF00 #20FF00 #00FF00 #00FF20 #00FF40 #00FF60 #00FF80

#00FF9F #00FFBF #00FFDF #00FFFF #00DFFF #00BFFF #009FFF

#0080FF #0060FF #0040FF #0020FF #0000FF #2000FF #4000FF

#6000FF #8000FF #9F00FF #BF00FF #DF00FF #FF00FF #FF00DF

#FF00BF #FF009F #FF0080 #FF0060 #FF0040 #FF0020 #FF0000

We demonstrate how perceived colors are affected by the hue, chroma and lumin-
ance in the HCL colour model.

ggcolorchart(hcl((0:48)/48 * 360), text.size = 3) +


ggtitle("CIE-LUV 'hcl' hue", "h = 0..360, c = 35, l = 85")

CIE−LUV 'hcl' hue


h = 0..360, c = 35, l = 85

#FFC5D0 #FFC6CA #FDC7C5 #FBC9C0 #F9CABB #F6CCB6 #F2CDB2

#EECFAE #EAD1AB #E5D3A9 #DFD5A8 #DAD6A7 #D4D8A7 #CEDAA9

#C8DBAB #C1DDAD #BBDEB1 #B5DFB4 #AFE0B9 #A9E1BE #A4E2C3

#A0E2C8 #9CE2CE #9AE2D3 #99E2D8 #99E1DE #9AE1E3 #9DE0E7

#A1DEEC #A5DDF0 #ABDBF3 #B1DAF6 #B8D8F8 #BFD6FA #C6D4FB

#CED2FC #D5D0FC #DBCEFB #E2CCF9 #E7CAF7 #EDC8F5 #F1C7F2

#F5C6EE #F9C5EA #FBC5E5 #FDC4E0 #FFC4DB #FFC5D5 #FFC5D0

ggcolorchart(hcl((0:48)/48 * 360, l = 67), text.size = 3) +


ggtitle("CIE-LUV 'hcl' hue", "h = 0..360, c = 35, l = 67")

324
6.23 Extended examples

CIE−LUV 'hcl' hue


h = 0..360, c = 35, l = 67

#CD949F #CC9599 #CA9694 #C9988E #C69989 #C39B84 #C09C80

#BC9E7C #B8A079 #B3A277 #AEA375 #A8A575 #A3A775 #9DA876

#97AA78 #90AB7B #8AAD7F #83AE83 #7DAF88 #77AF8C #71B092

#6CB097 #68B19C #65B1A2 #63B0A7 #63B0AC #64AFB1 #67AEB6

#6CADBA #71ACBE #77AAC1 #7EA8C4 #86A7C7 #8DA5C8 #95A3C9

#9CA0CA #A39ECA #AA9CC9 #B09AC8 #B699C6 #BB97C3 #C096C0

#C394BC #C794B8 #C993B4 #CB93AF #CC93AA #CD93A4 #CD949F

ggcolorchart(hcl((0:48)/48 * 360, c = 67), text.size = 3) +


ggtitle("CIE-LUV 'hcl' hue", "h = 0..360, c = 67, l = 85")

CIE−LUV 'hcl' hue


h = 0..360, c = 67, l = 85

#FFB7CC #FFB9C1 #FFBBB6 #FFBEAB #FFC1A0 #FFC496 #FFC88C

#FFCB84 #F9CE7C #F1D277 #E8D573 #DED871 #D4DB72 #C9DE75

#BDE17B #B0E382 #A2E68A #94E893 #85EA9D #74EBA8 #63ECB2

#51EDBD #3DEEC8 #28EED2 #10EDDC #00EDE6 #13EBEF #2CEAF8

#43E8FF #58E5FF #6DE2FF #81DFFF #94DBFF #A6D7FF #B6D3FF

#C6CFFF #D5CBFF #E2C6FF #EEC2FF #F9BFFF #FFBBFF #FFB8FF

#FFB6FF #FFB5FC #FFB4F4 #FFB3EB #FFB4E1 #FFB5D7 #FFB7CC

U The default order of the different colors in the vector returned by colors()
results in a rather unappealing color tile plot (see page 321). Use functions
col2rgb() , rgb2hsv() and sort() or order() to rearrange the tiles into a more
pleasant arrangement, but still using for the labels the indexes to the positions
of the colors in the original unsorted vector.

6.23.6 Pie charts vs. bar plots example

325
6 Plots with ggpplot

opts_chunk$set(opts_fig_wide)

There is an example figure widely used in Wikipedia to show how much easier it
is to ‘read’ bar plots than pie charts (http://commons.wikimedia.org/wiki/File:
Piecharts.svg?uselang=en-gb).

Here is my ‘ggplot2’ version of the same figure, using much simpler code and ob-
taining almost the same result.

example.data <-
data.frame(values = c(17, 18, 20, 22, 23,
20, 20, 19, 21, 20,
23, 22, 20, 18, 17),
examples= rep(c("A", "B", "C"), c(5,5,5)),
cols = rep(c("red", "blue", "green", "yellow", "black"), 3)
)

ggplot(example.data, aes(x=cols, y=values, fill=cols)) +


geom_col(width = 1) +
facet_grid(.~examples) +
scale_fill_identity()
ggplot(example.data, aes(x=factor(1), y=values, fill=cols)) +
geom_col(width = 1) +
facet_grid(.~examples) +
scale_fill_identity() +
coord_polar(theta="y")

A B C

20

15
values

10

0
black blue green red yellow black blue green red yellow black blue green red yellow
cols

326
6.23 Extended examples

A B C
0/100 0/100 0/100

1
factor(1)

75 25 75 25 75 25

50 50 50

values

try(detach(package:lubridate))
try(detach(package:tikzDevice))
try(detach(package:ggplot2))
try(detach(package:scales))

327
7 Extensions to ‘ggplot2’

What this means is that we shouldn’t abbreviate the truth but


rather get a new method of presentation.

— Edward Tufte

7.1 Packages used in this chapter

For executing the examples listed in this chapter you need first to load the following
packages from the library:

library(tibble)
library(ggplot2)
library(showtext)
library(viridis)
library(pals)
library(ggrepel)
library(ggforce)
library(ggpmisc)
library(ggseas)
library(gganimate)
library(ggstance)
library(ggbiplot)
library(ggalt)
library(ggExtra)
# library(ggfortify) # loaded later
library(ggnetwork)
library(geomnet)
# library(ggradar)
library(ggsci)
library(ggthemes)
library(xts)
library(MASS)

We set a font larger size than the default

theme_set(theme_grey(14))

329
7 Extensions to ggplot

7.2 Aims of this chapter

In this chapter I describe packages that add additional functionality or graphical


designs of plots to package ‘ggplot2’. Several new packages were written after
‘ggplot2’ version 2.0.0 was released, because this version for the first time made it
straightforward to write these extensions. To keep up-to-date with the release of new
extensions I recommend to regularly check the site ‘ggplot2 Extensions’ (maintained
by Daniel Emaasit) at https://www.ggplot2-exts.org/.
In contrast with previous chapters, I expect readers to first browse through the
whole chapter looking to get an idea of what is possible and cherry pick the sections
they find useful and worth of detailed study. Some of the packages are generally
useful, but others are more specialized. I have tried to cover a wide array of plot
types, but, I have described in more depth packages that I have written myself, or
that I am more familiar with—i.e. the space dedicated to each package description is
not to be taken as a measure of their usefulness for your own work.

 In this chapter we use mostly the modernized data frames of package


‘tibble’. The main reason is that the tibble() constructor does not by default
convert character variables into factors as the data.frame() constructor does.
The format used for printing is also improved. It is possible to use data.frame()
instead of tibble() in the examples below, but in some cases you will need to
add stringsAsFactors = FALSE to the call.

7.3 ‘showtext’

citation(package = "showtext")

##
## To cite package 'showtext' in publications
## use:
##
## Yixuan Qiu and authors/contributors of the
## included software. See file AUTHORS for
## details. (2017). showtext: Using Fonts
## More Easily in R Graphs. R package version
## 0.4-6.
## https://CRAN.R-project.org/package=showtext
##
## A BibTeX entry for LaTeX users is
##

330
7.3 ‘showtext’

## @Manual{,
## title = {showtext: Using Fonts More Easily in R Graphs},
## author = {Yixuan Qiu and authors/contributors of the included software. See file AUTHORS for details
## year = {2017},
## note = {R package version 0.4-6},
## url = {https://CRAN.R-project.org/package=showtext},
## }
##
## ATTENTION: This citation information has
## been auto-generated from the package
## DESCRIPTION file and may need manual
## editing, see 'help("citation")'.

Package ‘showtext’ allows portable use of different system fonts or fonts from
Google in plots created with ggplot.
A font with Chinese characters is included in the package. This example is
borrowed from the package vignette, but modified to use default fonts, of which
"wqy-microhei" is a Chinese font included by package ‘showtext’.

ggplot(NULL, aes(x = 1, y = 1)) + ylim(0.8, 1.2) +


theme(axis.title = element_blank(), axis.ticks = element_blank(),
axis.text = element_blank()) +
annotate("text", 1, 1.1, family = "wqy-microhei", size = 12,
label = "\u4F60\u597D\uFF0C\u4E16\u754C") +
annotate("text", 1, 0.9, label = 'Chinese for "Hello, world!"',
family = "sans", fontface = "italic", size = 8)

Next we load some system fonts, the same we are using for the text of this book.
Within code chunks when using ‘knitr’ we can enable showtext with chunk option
fig.showtext = TRUE as done here (but not visible). In a script or at the console we
can use showtext.auto() , or showtext.begin() and showtext.end() . As explained

331
7 Extensions to ggplot

in the package vignette, using showtext can increase the size of the PDF files created,
but on the other hand, it makes embedding of fonts unnecessary.

Function font.families() lists the fonts known to R, and function font.add()


can be used to make system fonts visible to R. We set families, and indicate the font
names for each face.

font.families()

## [1] "sans" "serif" "mono"


## [4] "wqy-microhei"

font.add(family = "Lucida.Sans",
regular = "LucidaSansOT.otf",
italic = "LucidaSansOT-Italic.otf",
bold = "LucidaSansOT-Demi.otf",
bolditalic = "LucidaSansOT-DemiItalic.otf")

font.add(family = "Lucida.Bright",
regular = "LucidaBrightOT.otf",
italic = "LucidaBrightOT-Italic.otf",
bold = "LucidaBrightOT-Demi.otf",
bolditalic = "LucidaBrightOT-DemiItalic.otf")

font.families()

## [1] "sans" "serif"


## [3] "mono" "wqy-microhei"
## [5] "Lucida.Sans" "Lucida.Bright"

We can then select these fonts in the usual way.

ggplot(NULL, aes(x = 1, y = 1)) + ylim(0.8, 1.2) +


theme(axis.title = element_blank(), axis.ticks = element_blank(),
axis.text = element_blank()) +
annotate("text", 1, 1.1, label = 'Lucida Bright Demi "Hello, world!"',
family = "Lucida.Bright", fontface = "bold", size = 6) +
annotate("text", 1, 0.9, label = 'Lucida Sans Italic "Hello, world!"',
family = "Lucida.Sans", fontface = "italic", size = 6)

332
7.3 ‘showtext’

my.data <-
data.frame(x = 1:5, y = rep(2, 5),
label = c("a", "b", "c", "d", "e"))

ggplot(my.data, aes(x, y, label = label)) +


geom_text(hjust=1.5, family = "Lucida.Sans", fontface = "italic") +
geom_point()

● ● ● ● ●

ggplot(my.data, aes(x, y, label = label)) +


geom_text(hjust=1.5, family = "Lucida.Bright") +
geom_point()

333
7 Extensions to ggplot

● ● ● ● ●

The examples that follow, using function font.add.google() to add Google fonts,
are more portable. This is so because as long as internet access is available, fonts
can be downloaded if not available locally. You can browse the available fonts at
https://fonts.google.com/. The names used in the statements below are those
under which the fonts are listed.

## Loading Google fonts (http://www.google.com/fonts)


font.add.google(name = "Permanent Marker", family = "Marker")
font.add.google(name = "Courgette")
font.add.google(name = "Lato")

ggplot(NULL, aes(x = 1, y = 1)) + ylim(0.8, 1.2) +


theme(axis.title = element_blank(), axis.ticks = element_blank(),
axis.text = element_blank()) +
annotate("text", 1, 1.1, label = 'Courgette "Hello, world!"',
family = "Courgette", size = 6) +
annotate("text", 1, 0.9, label = 'Permanent Marker "Hello, world!"',
family = "Marker", size = 6)

334
7.3 ‘showtext’

In all the examples above we used geom_text() , but geom_label() can be used
similarly. In the case of the title, axis-labels, tick-labels, and similar components
the use of fonts is controlled through the theme. Here we change the base family
used. Please, see section 6.19 on page 278 for examples of how to set the family for
individual elements of the plot.

font.add.google(name = "Lora", regular.wt = 400, bold.wt = 700)


font.families()

## [1] "sans" "serif"


## [3] "mono" "wqy-microhei"
## [5] "Lucida.Sans" "Lucida.Bright"
## [7] "Marker" "Courgette"
## [9] "Lato" "Lora"

ggplot(my.data, aes(x, y, label = label)) +


geom_text(vjust = -1.2,
family = "Lora",
fontface = "bold",
size = 8) +
geom_point() +
theme_classic(base_size = 15, base_family = "Lora")

335
7 Extensions to ggplot

● ● ● ● ●

 Be aware that in geometries the equivalent of face in theme text elements


is called fontface , while the character string values they accept are the same.

7.4 ‘viridis’

citation(package = "viridis")

##
## To cite package 'viridis' in publications
## use:
##
## Simon Garnier (2017). viridis: Default
## Color Maps from 'matplotlib'. R package
## version 0.4.0.
## https://CRAN.R-project.org/package=viridis
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {viridis: Default Color Maps from 'matplotlib'},
## author = {Simon Garnier},
## year = {2017},
## note = {R package version 0.4.0},
## url = {https://CRAN.R-project.org/package=viridis},
## }

Package ‘viridis’ defines color palettes and fill and color scales with colour selected
based on human perception, with special consideration of visibility for those with
different kinds of color blindness and well as in grey-scale reproduction.

336
7.4 ‘viridis’

set.seed(56231)
my.data <- tibble(x = rnorm(500),
y = c(rnorm(250, -1, 1), rnorm(250, 1, 1)),
group = factor(rep(c("A", "B"), c(250, 250))) )

Using scale_fill_viridis() replaces the default palette.

ggplot(my.data, aes(x, y)) +


stat_density_2d(aes(fill = ..level..), geom = "polygon") +
facet_wrap(~group) +
scale_fill_viridis()

A B

2
level
0.16
0.12
0
y

0.08
0.04

−2

−2 −1 0 1 2 −2 −1 0 1 2
x

Function scale_fill_viridis() supports several different palettes, which can be


selected through an argument passed to parameter option .

ggplot(my.data, aes(x, y)) +


stat_density_2d(aes(fill = ..level..), geom = "polygon") +
facet_wrap(~group) +
scale_fill_viridis(option = "magma")

A B

2
level
0.16
0.12
0
y

0.08
0.04

−2

−2 −1 0 1 2 −2 −1 0 1 2
x

337
7 Extensions to ggplot

ggplot(my.data, aes(x, y)) +


stat_density_2d(aes(fill = ..level..), geom = "polygon") +
facet_wrap(~group) +
scale_fill_viridis(option = "inferno")

A B

2
level
0.16
0.12
0
y

0.08
0.04

−2

−2 −1 0 1 2 −2 −1 0 1 2
x

ggplot(my.data, aes(x, y)) +


stat_density_2d(aes(fill = ..level..), geom = "polygon") +
facet_wrap(~group) +
scale_fill_viridis(option = "plasma")

A B

2
level
0.16
0.12
0
y

0.08
0.04

−2

−2 −1 0 1 2 −2 −1 0 1 2
x

ggplot(my.data, aes(x, y)) +


geom_bin2d(bins = 8) +
facet_wrap(~group) +
scale_fill_viridis()

338
7.5 ‘pals’

A B
5.0

2.5 count

20
y

0.0
10

−2.5

−2 0 2 −2 0 2
x

ggplot(my.data, aes(x, y)) +


geom_hex(bins = 8) +
facet_wrap(~group) +
scale_fill_viridis()

A B
4

2
count
25
20
15
y

0
10
5
−2

−4
−2 −1 0 1 2 −2 −1 0 1 2
x

7.5 ‘pals’

citation(package = "pals")

##
## To cite package 'pals' in publications use:
##
## Kevin Wright (2016). pals: Color Palettes,
## Colormaps, and Tools to Evaluate Them. R
## package version 1.0.
## https://CRAN.R-project.org/package=pals

339
7 Extensions to ggplot

##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {pals: Color Palettes, Colormaps, and Tools to Evaluate Them},
## author = {Kevin Wright},
## year = {2016},
## note = {R package version 1.0},
## url = {https://CRAN.R-project.org/package=pals},
## }

Package ‘pals’ fulfils a very specific role: it provides definitions for palettes and
color maps, and also palette evaluation tools. Being a specialized package, we de-
scribe it briefly and recommend readers to read the vignette and other documentation
included with the package.
We modify some of the examples from the previous section to show how to use the
palettes and colormaps defined in this package.

set.seed(56231)
my.data <- tibble(x = rnorm(500),
y = c(rnorm(250, -1, 1), rnorm(250, 1, 1)),
group = factor(rep(c("A", "B"), c(250, 250))) )

First we simply reproduce the first example obtaining the same plot as by use of
scale_fill_viridis() .

ggplot(my.data, aes(x, y)) +


stat_density_2d(aes(fill = ..level..), geom = "polygon") +
facet_wrap(~group) +
scale_fill_gradientn(colours = viridis(100), guide = "colourbar")

A B

2
level
0.16
0.12
0
y

0.08
0.04

−2

−2 −1 0 1 2 −2 −1 0 1 2
x

The biggest advantage is that we can in the same way use any of the very numerous
colormaps and palettes, and choose how smooth a color map we use.

340
7.5 ‘pals’

ggplot(my.data, aes(x, y)) +


stat_density_2d(aes(fill = ..level..), geom = "polygon") +
facet_wrap(~group) +
scale_fill_gradientn(colours = viridis(10), guide = "colourbar")

A B

2
level
0.16
0.12
0
y

0.08
0.04

−2

−2 −1 0 1 2 −2 −1 0 1 2
x

We can compare different colormaps with pal.bands() . In this example we com-


pare those included in package ‘viridis’ with some of the other palettes defined in
package ‘pals’.

pal.bands(viridis, magma, inferno, plasma, coolwarm, tol.rainbow, parula)

viridis

magma

inferno

plasma

coolwarm

tol.rainbow

parula

How does the luminance of the red, green and blue colour channels vary along the
palette or color map gradient? We can see this with pal.channels() .

pal.channels(viridis, main = "viridis")

341
7 Extensions to ggplot

viridis

200
50 100
0

0 50 100 150

How would viridis look in monochrome, and to persons with different kinds of
color blindness? We can see this with pal.safe() .

pal.safe(viridis, main = "viridis")

viridis

Original

Black/White

Deutan

Protan

Tritan

A brief example with a discrete palette follows using tol() .

ggplot(data = mtcars,
aes(x = disp, y = mpg, color = factor(cyl))) +
geom_point() +
scale_color_manual(values = tol(n = 3))

342
7.5 ‘pals’

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●




● ● 6
20 ●
● ● ● 8

● ●



● ●
15 ●●


● ●
10
100 200 300 400
disp

Parameter n gives the number of discrete values in the palette. Discrete palettes
have a maximum value for n , in the case of tol, 12 discrete steps.

pal.bands(tol(n = 3), tol(n = 6), tol())

U Play with the argument passed to n to test what happens when the number
of values in the scale is smaller or larger than the number of levels of the factor
mapped to the color aesthetic.

Is this palette safe?

pal.safe(tol(n = 3))

343
7 Extensions to ggplot

Original

Black/White

Deutan

Protan

Tritan

U Explore the available palettes until you find a nice one that is also safe with
three steps. Be aware that color maps, like viridis() can be used to define
a discrete color scale using scale_color_manual() in exactly the same way as
palettes like tol() . Colormaps, however, may be perceived as gradients, rather
than un-ordered discrete categories, so care is needed.

7.6 ‘gganimate’

citation(package = "gganimate")

##
## To cite package 'gganimate' in publications
## use:
##
## c)) (2016). gganimate: Create easy
## animations with ggplot2. R package version
## 0.1. http://github.com/dgrtwo/gganimate
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {gganimate: Create easy animations with ggplot2},
## author = {{c))}},
## year = {2016},
## note = {R package version 0.1},
## url = {http://github.com/dgrtwo/gganimate},
## }
##

344
7.6 ‘gganimate’

## ATTENTION: This citation information has


## been auto-generated from the package
## DESCRIPTION file and may need manual
## editing, see 'help("citation")'.

Package ‘gganimate’ allows the use of package ‘animation’ in ggplots with a syntax
consistent with the grammar of graphics. It adds a new aesthetic frame , which can
be used to map groups of data to frames in the animation.
Use of the package is extremely easy, but installation can be somehow tricky be-
cause of system requirements. Just, make sure to have ImageMagick installed and
included in the search PATH .
We modify an example from section 6.5 on page 6.5. We add the frame aesthetic
to the earlier figure.

p <- ggplot(data = mtcars,


aes(x = disp, y = mpg, colour = factor(cyl), frame = cyl)) +
geom_point()

Now we can print p as a normal plot, with print() , here called automatically.

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●




● ● 6
20 ●
● ● ● 8

● ●



● ●
15 ●●


● ●
10
100 200 300 400
disp

Or display an animation with gg_animate() . The animation will look differently


depending on the output format, and the program used for viewing it. For example,
in this PDF files, the animation will work when viewed with Adobe Viewer or Adobe
Acrobat but not in Sumatra PDF viewer. We add title_frame = FALSE as a title does
not seem useful in this simple animation.

gg_animate(p, title_frame = FALSE)

345
7 Extensions to ggplot

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●


● ● 6
20 ● 8

15

10
100 200 300 400
disp

Or save it to a file.

gg_animate(p, "p-animation.gif")

Cumulative animations are also supported. We here use the same example with
three frames, but this type of animation is particularly effective for time series data.
To achieve this we only need to add cumulative = TRUE to the aesthetics mappings.

p <- ggplot(data = mtcars,


aes(x = disp, y = mpg, colour = factor(cyl),
frame = cyl, cumulative = TRUE)) +
geom_point()

Now we can print p as a normal plot,


p

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●




● ● 6
20 ●
● ● ● 8

● ●



● ●
15 ●●


● ●
10
100 200 300 400
disp

Or display an animation with gg_animate() . The animation will look differently


depending on the output format, and the program used for viewing it. For example,

346
7.7 ‘ggstance’

in this PDF files, the animation will work when viewed with Adobe Viewer or Adobe
Acrobat but in Sumatra PDF viewer.

gg_animate(p, title_frame = FALSE)

35

● ●
30

25 ●
factor(cyl)
4
mpg


● ●


● ● 6
20 ● 8

15

10
100 200 300 400
disp

7.7 ‘ggstance’

citation(package = "ggstance")

## Warning in citation(package = "ggstance"): no date field in DESCRIPTION file of


package 'ggstance'
## Warning in citation(package = "ggstance"): could not determine year for
'ggstance' from package DESCRIPTION file

##
## To cite package 'ggstance' in publications
## use:
##
## Lionel Henry, Hadley Wickham and Winston
## Chang (NA). ggstance: Horizontal 'ggplot2'
## Components. R package version 0.3.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggstance: Horizontal 'ggplot2' Components},
## author = {Lionel Henry and Hadley Wickham and Winston Chang},
## note = {R package version 0.3},
## }

347
7 Extensions to ggplot

Package ‘ggstance’ defines horizontal versions of common ggplot geometries, stat-


istics and positions. Although ‘ggplot2’ defines coord_flip , ‘ggstance’ provides a
more intuitive user interface and more consistent plot formatting.

Currently the package defines horizontal geoms geom_barh() , geom_histogramh() ,


geom_linerangeh() , geom_pointrangeh() , geom_errorbarh() , geom_crossbarh() ,
geom_boxploth() , and geom_violinh() . It also defines horizontal stats
stat_binh() , stat_boxploth() , stat_counth() , and stat_xdensity() and vertical
positions position_dodgev , position_nudgev , position_fillv , position_stackv ,
and position_jitterdodgev .

We will give give only a couple of examples, as their use has no surprises. First
we make horizontal versions of the histogram plots shown in section 6.14.2 on page
234.

set.seed(12345)
my.data <- tibble(x = rnorm(200),
y = c(rnorm(100, -1, 1), rnorm(100, 1, 1)),
group = factor(rep(c("A", "B"), c(100, 100))) )

ggplot(my.data, aes(y = x)) +


geom_histogramh(bins = 15)

0
x

−1

−2

0 10 20 30
count

ggplot(my.data, aes(y = y, fill = group)) +


geom_histogramh(bins = 15, position = "dodgev")

348
7.7 ‘ggstance’

group
0 A
y

−2

0 10 20
count

ggplot(my.data, aes(y = y, fill = group)) +


geom_histogramh(bins = 15, position = "stackv")

group
0 A
y

−2

0 10 20
count

ggplot(my.data, aes(y = y, fill = group)) +


geom_histogramh(bins = 15, position = "identity", alpha = 0.5) +
theme_bw(16)

349
7 Extensions to ggplot

group
0 A
y

−2

0 10 20
count

Now we make an horizontal version of the boxplot shown in section 6.14.4 on page
240.

ggplot(my.data, aes(y, group)) +


geom_boxploth()

B ● ● ● ●

group

A ●

−2 0 2
y

7.8 ‘ggbiplot’

citation(package = "ggbiplot")

##
## To cite package 'ggbiplot' in publications
## use:
##
## Vincent Q. Vu (2011). ggbiplot: A ggplot2
## based biplot. R package version 0.55.

350
7.8 ‘ggbiplot’

## http://github.com/vqv/ggbiplot
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggbiplot: A ggplot2 based biplot},
## author = {Vincent Q. Vu},
## year = {2011},
## note = {R package version 0.55},
## url = {http://github.com/vqv/ggbiplot},
## }
##
## ATTENTION: This citation information has
## been auto-generated from the package
## DESCRIPTION file and may need manual
## editing, see 'help("citation")'.

Package ‘ggbiplot’ defines two functions, ggscreeplot() and ggbiplot() . These


functions make it easy to nicely print the results from principal components analysis
done with prcomp() .

For the time being we reproduce an example from the package README.

data(wine)
wine.pca <- prcomp(wine, scale. = TRUE)
ggbiplot(wine.pca, obs.scale = 1, var.scale = 1,
groups = wine.class, ellipse = TRUE, circle = TRUE) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')

351
7 Extensions to ggplot

● barolo ● grignolino ● barbera

Color
Alc
● ● ●

oh

ol
● ●

● ●
● ● Pr ● ●


2 ol
● in ●
● ● e ● ●

Ash
● ●
● ● ● ● ● ●
PC2 (19.2% explained var.)


id

Mg
●●
Ac

● ●
● ●● ● ● lic ● ● ●
● ●
● ●●
Ma ●●
● ●
● ● ●
● ●● ● ●● ● ● ●
● ● ● ● ●
● ● ●
● ●
Phenols ● ●● ● ●
● ●
● ● ● ●
NonFlavPhenols
● ●
● ● Pr●oa ●
● ● ● ●

0 Flav ● ● ● AlcAsh ●
● ●
● ● ●


● ●
● ● ● ●
OD ●
● ●
● ●
● ●
● ●
● ●

H ue

● ● ●● ● ● ●●●


● ●
● ● ●●
● ● ●
−2 ●

● ● ●
● ● ●
● ●● ●
● ●
● ● ●●


●●


−4
−2.5 0.0 2.5
PC1 (36.2% explained var.)

7.9 ‘ggalt’

citation(package = "ggalt")

##
## To cite package 'ggalt' in publications use:
##
## Bob Rudis, Ben Bolker and Jan Schulz
## (2017). ggalt: Extra Coordinate Systems,
## 'Geoms', Statistical Transformations,
## Scales and Fonts for 'ggplot2'. R package
## version 0.4.0.
## https://CRAN.R-project.org/package=ggalt
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggalt: Extra Coordinate Systems, 'Geoms', Statistical Transformations,
## Scales and Fonts for 'ggplot2'},
## author = {Bob Rudis and Ben Bolker and Jan Schulz},

352
7.9 ‘ggalt’

## year = {2017},
## note = {R package version 0.4.0},
## url = {https://CRAN.R-project.org/package=ggalt},
## }

Package ‘ggalt’ defines geoms geom_xspline() , geom_bkde() , geom_bkde2d() ,


geom_stateface() , geom_encircle() , geom_lollipop() , geom_dumbbell() , and
geom_stepribbon() ; stats stat_xspline() , stat_bkde() , stat_bkde2d() , and
stat_ash() ; scale scale_fill_pokemon() ; formatter byte_format() .
The highlights are use of functions from package ’KernSmooth’ for density estima-
tion, the provision of X-splines and for formatting “bytes” in the usual way used when
describing computer memory.
First example is the use of x-splines which are very flexible splines that are smooth
(have a continuous first derivative). They can be tuned from interpolation (passing
through every observation) to being rather “stiff” smoothers.

set.seed(1816)
dat <- tibble(x=1:10,
y=c(sample(15:30, 10)))

ggplot(dat, aes(x, y)) +


geom_point() +
geom_xspline()

30 ●

25 ●


y

20 ●

15 ●

2.5 5.0 7.5 10.0


x

The ”flexibility” of the spline can be adjusted by passing an numeric argument to


parameter spline_shape .

ggplot(dat, aes(x, y)) +


geom_point() +
geom_xspline(spline_shape=0.4)

353
7 Extensions to ggplot

30 ●

25 ●


y

20 ●

15 ●

2.5 5.0 7.5 10.0


x

We also redo some of the density plot examples from 6.14.3 on page 237.

ggplot(my.data, aes(y, fill = group)) +


geom_bkde(alpha = 0.5)

## Bandwidth not specified. Using '0.37', via KernSmooth::dpik.


## Bandwidth not specified. Using '0.29', via KernSmooth::dpik.

0.5

0.4

0.3 group
density

A
B
0.2

0.1

0.0
−2 0 2
y

ggplot(my.data, aes(x, y, colour = group)) +


geom_point() +
geom_rug() +
geom_bkde2d()

## Bandwidth not specified. Using ['0.39', '0.37'], via KernSmooth::dpik.


## Bandwidth not specified. Using ['0.42', '0.29'], via KernSmooth::dpik.

354
7.9 ‘ggalt’

● ● ●

● ● ●●



● ● ●

2 ●●


●●



● ● ●
● ●
●● ●
● ● ●● ●●● ● ● ● ●
● ● ●
● ● ●● ● ● ●
● ● ● ●
●● ● ● ●
● ●
● ● ●● ●● ●

● ●


● ●

●●
● ●

● ●
● ●
● group
● ● ● ● ●
● ● ● ● A
y

0 ●
● ●

● ●
● ●


● ● ● ● ●
● ● ● ● ● ●●
●● B
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
●●
● ● ●
● ●
● ●● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ●
●● ●● ●
● ● ●●
−2 ●
● ●

● ●●
● ●

● ● ●

−2 −1 0 1 2
x

ggplot(my.data, aes(x, y)) +


geom_bkde2d() +
facet_wrap(~group)

## Bandwidth not specified. Using ['0.39', '0.37'], via KernSmooth::dpik.


## Bandwidth not specified. Using ['0.42', '0.29'], via KernSmooth::dpik.

A B

2
y

−2

−2 −1 0 1 2 −2 −1 0 1 2
x

We here use a scale from package ‘viridis’ described in section 7.4 on page 336.

ggplot(my.data, aes(x, y)) +


stat_bkde2d(aes(fill = ..level..), geom = "polygon") +
facet_wrap(~group) +
scale_fill_viridis()

## Bandwidth not specified. Using ['0.39', '0.37'], via KernSmooth::dpik.


## Bandwidth not specified. Using ['0.42', '0.29'], via KernSmooth::dpik.

355
7 Extensions to ggplot

A B

2
level
0.15
y

0
0.10

0.05

−2

−2 −1 0 1 2 −2 −1 0 1 2
x

7.10 ‘ggExtra’

Sometimes it is useful to add marginal plots to a ggplot . Package ‘ggExtra’ provides


this functionality through an easy to use interface.

citation(package = "ggExtra")

##
## To cite package 'ggExtra' in publications
## use:
##
## Dean Attali (2016). ggExtra: Add Marginal
## Histograms to 'ggplot2', and More
## 'ggplot2' Enhancements. R package version
## 0.6.
## https://CRAN.R-project.org/package=ggExtra
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggExtra: Add Marginal Histograms to 'ggplot2', and More 'ggplot2'
## Enhancements},
## author = {Dean Attali},
## year = {2016},
## note = {R package version 0.6},
## url = {https://CRAN.R-project.org/package=ggExtra},
## }

set.seed(12345)
my.data <-
data.frame(x = rnorm(200),

356
7.10 ‘ggExtra’

y = c(rnorm(100, -1, 1), rnorm(100, 1, 1)),


group = factor(rep(c("A", "B"), c(100, 100))) )

p01 <- ggplot(my.data, aes(x, y)) +


geom_point()

ggMarginal(p01)

● ● ●
● ●

● ● ●




● ● ●

2 ●
●●
● ●
●●



● ● ●● ● ●
● ● ●● ●●● ● ●●● ●
● ● ●
● ● ●● ● ●● ● ● ● ●
●● ● ● ●
● ●
● ●●●●● ●

● ● ●
● ●
●● ● ●
● ● ● ● ●

● ●
● ● ● ●
● ● ● ●
0
y

● ● ● ●
● ● ● ●●
● ● ● ● ●
● ● ● ● ●● ●●
● ●●
● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●● ● ●●
● ● ● ●● ●● ●
● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ●● ● ● ●
● ● ● ●●
−2 ●
● ●


● ●●
● ●

● ● ●

−2 −1 0 1 2
x

ggMarginal(p01, type = "histogram", margins = "x", size = 3)

## `stat_bin()` using `bins = 30`. Pick better


## value with `binwidth`.

● ● ●
● ●

● ● ●●

● ●
● ● ●

2 ●
●●
● ●
●●



● ● ●● ● ●
● ● ●● ●●● ● ●●● ●
● ●● ● ● ● ●
● ● ●
● ● ● ●
● ●● ● ● ●
● ● ●●●●●● ●

● ● ●
●●● ● ● ● ● ●
● ●● ● ●
● ● ● ●
● ● ● ●
0
y

● ● ● ●
● ● ● ● ●●
● ● ●
●● ● ● ● ●● ● ● ● ● ●●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ●
●●
● ● ● ● ●●
● ● ●
● ● ● ● ● ●

● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ●
● ● ●●
−2 ●
● ●


● ●●
● ●
● ● ●

−2 −1 0 1 2
x

357
7 Extensions to ggplot

U Read the documentation for ggMarginal() and play by changing the aesthet-
ics used for the lines and bars on the margins.

= At the time of writing, ggMarginal() does not support grouping or facets.


Both of these features would be very useful quite frequently, but this needs to
be done manually, using the facilities of package ‘gridExtra’ to combine and align
ggplots created individually. Grouping is ignored. Facets in the plot passed as
argument trigger fatal errors when ggMargins() is exectuted.

p02 <- ggplot(my.data, aes(x, y, color = group)) +


geom_point()

ggMarginal(p02, margins = "x")

● ● ●
● ●

● ● ●




● ● ●

2 ●
●●
● ●
●●



● ● ●● ● ●
● ● ●● ●●● ● ●●● ●
● ● ●
● ● ●● ● ●● ● ● ● ●
●● ● ● ●
● ●
● ●●●●● ●

● ●

● ●

●●
● ●



●●

● group
● ● ● ● ●
0 ● ● ● ● A
y

● ● ● ●
● ● ● ●●
● ● ● ● ●
● ● ● ● ●● ●●
● ●●
● ● ●
● ●

● ●




● ● ● B
● ● ●
●● ● ● ●● ● ●●
● ● ● ●● ●● ●
● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ●● ● ● ●
● ● ● ●●
−2 ●
● ●


● ●●
● ●

● ● ●

−2 −1 0 1 2
x

7.11 ‘ggfortify’

# interferes with 'ggbiplot'


library(ggfortify)

##
## Attaching package: 'ggfortify'
## The following object is masked from 'package:ggbiplot':
##
## ggbiplot

358
7.11 ‘ggfortify’

citation(package = "ggfortify")

##
## To cite ggfortify in publications, please
## use:
##
## Yuan Tang, Masaaki Horikoshi, and Wenxuan
## Li. ggfortify: Unified Interface to
## Visualize Statistical Result of Popular R
## Packages. The R Journal, 2016.
##
## Masaaki Horikoshi and Yuan Tang (2016).
## ggfortify: Data Visualization Tools for
## Statistical Analysis Results.
## https://CRAN.R-project.org/package=ggfortify

Package ‘fortify’ re-organizes the output from different model fitting functions into
an easier to handle and more consistent format that is especially useful when collect-
ing the results from different fits. Package ‘ggfortify’ extends this idea to encompass
the creation of diagnostic and other plots from model fits using ‘ggplot2’. The most
important method to remember is autoplot() for which many different specializa-
tions are provided. As the returned objects are of class "ggplot" , it is easy to add
additional layers and graphical elements to them.

We start with a linear model as example. We return to the regression example used
in Chapter 4, page 95.

fm1 <- lm(dist ~ speed, data=cars)

autoplot(fm1)

359
7 Extensions to ggplot

Residuals vs Fitted Normal Q−Q


23
● 49
● 3 49

40 23

Standardized residuals
35

2 35

● ● ●●
20
Residuals

● ●●
● ● ● ● ● 1 ●●●●●
● ●

● ● ● ●●●
● ● ● ● ●●●●

0 ●



0 ●●●
●●

● ●●
● ● ●
● ●●●●
● ● ● ● ●●●
● ● ●●●●
● ● ● ●●●
● ● ●●
● ●

−1 ●
●●
● ●
−20 ● ● ● ●

● ●
−2
0 20 40 60 80 −2 −1 0 1 2
Fitted values Theoretical Quantiles
Scale−Location Residuals vs Leverage
23
● 49
● 3 49

23

Standardized residuals

Standardized Residuals
1.5
35

● 2 ●

● ●●

● ●
● ●
● ●
1 ● ● ● ●


● ●
1.0 ●

● ●
● ● ● ●
● ● ● ● ● ●
● ● ●● ●
●●


● ● ● 0 ● ●

● ●
● ● ● ● ●●
● ● ●
● ● ● ● ● ●●
● ● ●● ●
● ● ●
0.5 ●
● ● −1 ●



● ● ●
● ●
● ●

● −2 39

0 20 40 60 80 0.00 0.03 0.06 0.09 0.12


Fitted values Leverage

And here the example used for ANOVA on page 101.

fm4 <- lm(count ~ spray, data = InsectSprays)

autoplot(fm4)

360
7.11 ‘ggfortify’

Residuals vs Fitted Normal Q−Q


10
69
70
● 69
● 70

8● 8●

Standardized residuals
● ● 2 ●●

● ● ●●
● ●●●
5 ● ●

● 1 ●
Residuals

● ● ●●●
● ● ●
● ●●●●●
● ● ●●●
● ● ●●
0 ●


● ●
0 ●
●●
●●
●●


●●
●●
●●

● ● ●
●●
●●

● ● ● ●
●●
● ●●
● ● ●●●●
● ● ● ●●●●
● ●

● ●
● −1 ●●●●
●●

−5 ● ●
● ●
● ● −2 ● ●
● ●

5 10 15 −2 −1 0 1 2
Fitted values Theoretical Quantiles
Scale−Location Residuals vs Leverage
1.6 69
70

69
70

8● ● 8●
Standardized residuals


● ●
● Standardized Residuals 2 ●




● ● ●
1.2 ●
● ● ●

● ●
1 ●


● ● ●




● ●

0.8 ● ●
● 0 ●



● ●

● ● ●



● ● ●
● ● ●

● ●



● −1 ●


● ●
0.4 ● ● ●

−2 ●

● ● ●

5 10 15 0.00 0.02 0.04 0.06 0.08


Fitted values Leverage

There is also an autoplot() especialization for time series data.

autoplot(lynx)

6000

4000

2000

0
1820 1840 1860 1880 1900 1920

Please, see section 7.15 for an alternative approach, slightly less automatic, but
based on a specialization of the ggplot() method.

361
7 Extensions to ggplot

7.12 ‘ggnetwork’

citation(package = "ggnetwork")

##
## To cite package 'ggnetwork' in publications
## use:
##
## Francois Briatte (2016). ggnetwork:
## Geometries to Plot Networks with
## 'ggplot2'. R package version 0.5.1.
## https://CRAN.R-project.org/package=ggnetwork
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggnetwork: Geometries to Plot Networks with 'ggplot2'},
## author = {Francois Briatte},
## year = {2016},
## note = {R package version 0.5.1},
## url = {https://CRAN.R-project.org/package=ggnetwork},
## }
##
## ATTENTION: This citation information has
## been auto-generated from the package
## DESCRIPTION file and may need manual
## editing, see 'help("citation")'.

Package ‘ggnetwork’ provides methods and functions to plot network graphs with
‘ggplot2’, which are a rather specialized type of plots. This package contains a very
nice vignette with many nice examples, so in this section I will only provide some
examples to motivate the readers to explore the package documentation and use the
package. This package allows very flexible control of the graphical design.

data(blood, package = "geomnet")

Using mostly defaults, the plot is not visually attractive. For the layout to be de-
terministic, we need to set the seed used by the pseudorandom number generator. To
assemble the plot we add three layers of data with geom_edges() , geom_nodes() and
geom_nodetext() . We use theme_blank() as axes and their labels play no function
in a plot like this.

set.seed(12345)
ggplot(ggnetwork(network::network(blood$edges[, 1:2]),
layout = "circle"),
aes(x, y, xend = xend, yend = yend)) +

362
7.12 ‘ggnetwork’

geom_edges() +
geom_nodes() +
geom_nodetext(aes(label = vertex.names)) +
theme_blank()

## Loading required package: sna


## Loading required package: statnet.common
## Loading required package: network
## network: Classes for Relational Data
## Version 1.13.0 created on 2015-08-31.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## Mark S. Handcock, University of California -- Los Angeles
## David R. Hunter, Penn State University
## Martina Morris, University of Washington
## Skye Bender-deMoll, University of Washington
## For citation information, type citation("network").
## Type help("network-package") to get started.
##
## Attaching package: 'network'
## The following object is masked from 'package:plyr':
##
## is.discrete
## sna: Tools for Social Network Analysis
## Version 2.4 created on 2016-07-23.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## For citation information, type citation("sna").
## Type help(package="sna") to get started.

A−

O+
● A+

O−
● AB−

B+
● AB+

B−

363
7 Extensions to ggplot

Some tweaking of the aesthetics leads to a nicer plot.

set.seed(12345)
ggplot(ggnetwork(network::network(blood$edges[, 1:2]),
layout = "circle", arrow.gap = 0.06),
aes(x, y, xend = xend, yend = yend)) +
geom_edges(color = "grey30",
arrow = arrow(length = unit(6, "pt"), type = "open")) +
geom_nodes(size = 16, color = "darkred") +
geom_nodetext(aes(label = vertex.names), color = "white") +
theme_blank()

A−

O+ A+

O− AB−

B+ AB+

B−

U How does the layout change if you change the argument passed to
set.seed() ? And what happens with the layout if you run the plotting state-
ment more than once, without calling set.seed() ?

U What happens if you change the order of the geom s in the code above? Ex-
periment by editing and running the code to find the answer, or if you think you
know the answer, to check whether you guess was right or wrong.

364
7.13 ‘geomnet’

U Change the graphic design of the plot in steps, by changing: 1) the shape of
the nodes, 2) the color of the nodes, 3) the size of the nodes and the size of the
text, 4) the type of arrows and their size, 5) the font used in nodes to italic.

= This is not the only package supporting the plotting of network graphs
with package ‘ggplot2’. Packages ‘GGally’ and ‘geomnet’ support network graphs.
Package ‘ggCompNet’ compares the three methods, both for performance and by
giving examples of the visual design.

7.13 ‘geomnet’

citation(package = "geomnet")

##
## To cite package 'geomnet' in publications
## use:
##
## Samantha Tyner and Heike Hofmann (2016).
## geomnet: Network Visualization in the
## 'ggplot2' Framework. R package version
## 0.2.0.
## https://CRAN.R-project.org/package=geomnet
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {geomnet: Network Visualization in the 'ggplot2' Framework},
## author = {Samantha Tyner and Heike Hofmann},
## year = {2016},
## note = {R package version 0.2.0},
## url = {https://CRAN.R-project.org/package=geomnet},
## }
##
## ATTENTION: This citation information has
## been auto-generated from the package
## DESCRIPTION file and may need manual
## editing, see 'help("citation")'.

Package ‘geomnet’ provides methods and functions to plot network graphs with
‘ggplot2’, which are a rather specialized type of plots.

365
7 Extensions to ggplot

data(blood, package = "geomnet")

Using mostly defaults, the plot is very simple, and lacks labels. As above, for the
layout to be deterministic, we need to set the seed. In the case of ‘geomnet’, new
aesthetics from_id and to_id are defined, only one layer is needed, added with
geom_net() . We use here theme_net() , also exported by this package.

set.seed(12345)
ggplot(data = blood$edges, aes(from_id = from, to_id = to)) +
geom_net() +
theme_net()




Some tweaking of the aesthetics leads to a nicer plot, equivalent to the second ex-
ample in the previous section.

set.seed(12345)
ggplot(data = blood$edges, aes(from_id = from, to_id = to)) +
geom_net(colour = "darkred", layout.alg = "circle", labelon = TRUE, size = 16,
directed = TRUE, vjust = 0.5, labelcolour = "white",
arrow = arrow(length = unit(6, "pt"), type = "open"),
linewidth = 0.5, arrowgap = 0.06,
selfloops = FALSE, ecolour = "grey30") +
theme_net()

366
7.14 ‘ggforce’

A−

O+ A+

O− AB−

B+ AB+

B−

U Change the graphic design of the plot in steps, by changing: 1) the shape of
the nodes, 2) the color of the nodes, 3) the size of the nodes and the size of the
text, 4) the type of arrows and their size, 5) the font used in nodes to italic.

7.14 ‘ggforce’

citation(package = "ggforce")

##
## To cite package 'ggforce' in publications
## use:
##
## Thomas Lin Pedersen (2016). ggforce:
## Accelerating 'ggplot2'. R package version
## 0.1.1.
## https://CRAN.R-project.org/package=ggforce
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggforce: Accelerating 'ggplot2'},
## author = {Thomas Lin Pedersen},

367
7 Extensions to ggplot

## year = {2016},
## note = {R package version 0.1.1},
## url = {https://CRAN.R-project.org/package=ggforce},
## }
##
## ATTENTION: This citation information has
## been auto-generated from the package
## DESCRIPTION file and may need manual
## editing, see 'help("citation")'.

Package ‘ggforce’ includes an assortment of useful extensions to ‘ggplot2’.

7.14.1 Geoms and stats

Sina plots are a new type of plots resembling violin plots (described in section 6.14.5
on page 241), where actual observations are plotted as a cloud spreading widely as
the density increases. Both a geometry and a statistics are defined.

set.seed(12345)
my.data <-
data.frame(x = rnorm(200),
y = c(rnorm(100, -1, 1), rnorm(100, 1, 1)),
group = factor(rep(c("A", "B"), c(100, 100))) )

Sina plots can obtained with geom_sina() .

ggplot(my.data, aes(group, y)) +


geom_sina()

● ● ●

●●●●



● ● ●

2 ●
● ● ● ●●
● ●
● ●●
● ●●
● ●●●
● ●● ● ●
● ● ● ●
● ●● ● ● ●● ●●●
● ● ●●
● ● ● ●
● ● ●


● ●●● ● ● ●●
● ● ●●
●● ● ●● ●
● ● ●
●●
● ● ●●
● ●●
●●
0
y

● ● ●
● ● ● ●●
● ● ● ●
● ● ● ●● ●● ● ●●
● ● ● ●
●● ● ●

● ● ● ● ●
● ●● ● ●
● ● ●● ●●● ●

● ● ●● ●
● ● ●
●● ● ● ● ●
● ● ● ● ●
●● ● ● ● ●● ● ●
● ● ●
●● ● ● ● ●

−2 ●

● ●●
● ●

●●●

A B
group

ggplot(my.data, aes(group, y, fill = group)) +


geom_sina()

368
7.14 ‘ggforce’

●● ●

●● ● ●



● ●●
●● ●
2 ● ● ●

●●●
● ● ●
● ● ● ●
● ●● ● ● ● ● ●●
● ● ● ●
● ● ● ●● ● ●
● ● ● ●● ●
● ● ● ● ● ● ●
● ●● ● ●● ● ●
● ●
●● ● ● ●● ●


● ● ●

●● group
● ●● ●
● ● ●
0 ● ● A
y
● ● ● ●
● ● ●● ● ●
● ● ●
● ●● ●● ● ●
● ●●



●●

●●
● ●● ● B
●● ●● ●
●● ● ● ● ●
● ● ● ●● ●
● ● ●● ● ●●
● ●
● ●● ●●
● ●● ●
● ● ●● ● ● ●
● ●
● ●● ● ●
● ● ●
● ●

−2 ●



●● ●
●●


●●

A B
group

The geometries geom_sina() and geom_violin() can be combined to create at-


tractive and informative plots. As with any ggplot varying the order of the layers
and their transparency ( alpha ) can be used to obtain plots where one or the other
geometry is highlighted.

ggplot(my.data, aes(group, y, fill = group)) +


geom_violin(alpha = 0.16) +
geom_sina(alpha = 0.33)

group
0 A
y

−2

A B
group

Several geometries for plotting arcs, curves and circles also provided:
geom_circle() , geom_arc() , geom_arcbar() , geom_bezier() , geom_bspline() .
coming soon.
Geometries similar to geom_path() and geom_segment() , called geom_link() and
geom_link2() add interpolation of aesthetics along the segment or path between
each pair of observations/points.
coming soon.

369
7 Extensions to ggplot

7.14.2 Transformations

trans_reverser can be used to reverse any monotonic transformation. New trans-


formations power_trans() and radial_trans()
coming soon.

7.14.3 Theme

theme_no_axes() is not that useful for a sina plot, but could be used to advantage
for raster images or maps. It differs from theme_blank() and theme_null() in the
plot being framed and having a white plotting area.

ggplot(my.data, aes(group, y)) +


geom_sina() +
theme_no_axes()

●● ●

● ● ●●



● ● ●
●●●
● ●● ●
● ● ● ●
●● ● ● ●
● ● ●● ●
● ● ● ●● ● ● ●
●● ●● ● ● ● ● ●
● ● ● ●●
● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●●
● ●
●● ●●
● ● ●●
● ● ●●
● ●●
● ●●●
●● ●
● ● ● ● ●
● ●

● ● ●● ●●● ● ●
● ● ●● ● ●
● ●
●● ●
● ● ●
● ● ●● ●
● ● ● ● ●
●● ● ●
● ● ●
● ● ● ●● ●
● ●● ● ●
● ●● ●
●● ● ●
● ● ●● ●●●
● ●
● ●
● ● ●
● ●

● ●


●● ●

7.14.4 Paginated facetting

facet_grid_paginate() , facet_wrap_paginate() and facet_zoom() add pagination


to usual faceting, allowing one to split large faceted plots into pages, and zooming
into individual panel in a facetted plot.
coming soon.

7.15 ‘ggpmisc’

citation(package = "ggpmisc")

370
7.15 ‘ggpmisc’

##
## To cite ggpmisc in publications, please use:
##
## Pedro J. Aphalo. (2016) Learn R ...as you
## learnt your mother tongue. Leanpub,
## Helsinki.
##
## A BibTeX entry for LaTeX users is
##
## @Book{,
## author = {Pedro J. Aphalo},
## title = {Learn R ...as you learnt your mother tongue},
## publisher = {Leanpub},
## year = {2016},
## url = {http://leanpub.com/learnr},
## }

Package ‘ggpmisc’ is a package developed by myself as a result of questions from


work mates and in Stackoverflow, or functionality that I have needed in my own
research or for teaching. It provides new stats for everyday use: stat_peaks() ,
stat_valleys() , stat_poly_eq() , stat_fit_glance() , stat_fit_deviations() ,
and stat_fit_augment() . A function for converting time-series data to a data frame
that can be easily plotted with ‘ggplot2’. It also provides some debugging tools that
echo the data received as input: stat_debug_group() , stat_debug_panel() , and
geom_debug() , and geom_null() that does not plot its input.

7.15.1 Plotting time-series

Instead of creating a new statistics or geometry for plotting time series we provide a
function that can be used to convert time series objects into data frames suitable for
plotting with ‘ggplot2’. A single function try_tibble() (also available as
Rfunctiontry_data_frame()) accepts time series objects saved with different packages
as well as R’s native ts objects. The magic is done mainly by package ‘xts’ to which
we add a wrapper to obtain a data frame. By default the time variable is given name
time and that with observations, the “name” of the data argument passed. In the
usual case of passing a time series object, its name is used for the variable.
We exemplify this with some of the time series data included in R. In the first ex-
ample we use the default format for time.

ggplot(try_tibble(austres), aes(time, austres)) +


geom_line()

## Don't know how to automatically pick scale for object of type ts. Defaulting to
continuous.

371
7 Extensions to ggplot

17000

16000
austres

15000

14000

13000
1975 1980 1985 1990
time

In the second example we use “decimal years” in numeric format for expressing
‘time’.

ggplot(try_tibble(lynx, as.numeric = TRUE),


aes(x = time, y = lynx)) +
geom_line()

## Don't know how to automatically pick scale for object of type ts. Defaulting to
continuous.

6000

4000
lynx

2000

0
1820 1840 1860 1880 1900 1920
time

Here we use dates rounded to the month.

ggplot(try_tibble(AirPassengers, "month"),
aes(time, AirPassengers)) +
geom_line()

## Don't know how to automatically pick scale for object of type ts. Defaulting to
continuous.

372
7.15 ‘ggpmisc’

600

AirPassengers

400

200

1950 1955 1960


time

Multivariate time series are also supported.


Plotting can be automated even further for "ts" and "xts" with the specialized
ggplot methods defined by package ‘ggpmisc’. The same parameters as in

ggplot(AirPassengers) +
geom_line()

600
AirPassengers

400

200

1952 1956 1960


time

These methods default to using “decimal time” for time as not all statistics (i.e.
from package ‘ggseas’) work correctly with POSIXct . Passing FALSE as argument
as.numeric results in time being returned as a datetime variable. This allows use
of ‘ggplot2’’s time scales.

ggplot(AirPassengers, as.numeric = FALSE) +


scale_x_datetime(date_breaks = "1 year", date_labels = "%Y") +
geom_line()

373
7 Extensions to ggplot

600

AirPassengers

400

200

1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961
time

7.15.2 Peaks and valleys

Peaks and valleys are local (or global) maxima and minima. These stats return the
𝑥 and 𝑦 values at the peaks or valleys plus suitable labels, and default aesthet-
ics that make easy their use with several different geoms, including geom_point() ,
geom_text() , geom_label() , geom_vline() , geom_hline() and geom_rug() , and
also with geoms defined by package ‘ggrepel’. Some examples follow.

There are many cases, for example in physics and chemistry, but also when plotting
time-series data when we need to automatically locate and label local maxima (peaks)
or local minima (valleys) in curves. The statistics presented here are useful only for
dense data as they do not fit a peak function but instead simply search for the local
maxima or minima in the observed data. However, they allow flexible generation of
labels on both 𝑥 and 𝑦 peak or valley coordinates.

We use as example the same time series as above. In the next several examples we
demonstrate some of this flexibility.

ggplot(lynx) + geom_line() +
stat_peaks(colour = "red") +
stat_peaks(geom = "text", colour = "red",
vjust = -0.5, x.label.fmt = "%4.0f") +
stat_valleys(colour = "blue") +
stat_valleys(geom = "text", colour = "blue",
vjust = 1.5, x.label.fmt = "%4.0f") +
ylim(-100, 7300)

374
7.15 ‘ggpmisc’

1904
1866 ●

1828
6000 ●

1885

1895
4000 ● 1913
1916
lynx

1838 ● ● 1925


1857

1848
● 1875

2000




● ● ●

1908

● 1929
0 ● ●
1832 1842 1852 1861 1869 1879 1889 1898 1919
1820 1840 1860 1880 1900 1920
time

ggplot(lynx) + geom_line() +
stat_peaks(colour = "red") +
stat_peaks(geom = "rug", colour = "red") +
stat_peaks(geom = "text", colour = "red",
vjust = -0.5, x.label.fmt = "%4.0f") +
ylim(NA, 7300)

1904
1866 ●

1828
6000 ●

1885

1895
4000 ● 1913
1916
lynx

1838
● ● 1925


1857

1848
● 1875

2000

0
1820 1840 1860 1880 1900 1920
time

ggplot(lynx) + geom_line() +
stat_peaks(colour = "red") +
stat_peaks(geom = "rug", colour = "red") +
stat_valleys(colour = "blue") +
stat_valleys(geom = "rug", colour = "blue")

375
7 Extensions to ggplot


6000 ●

4000 ●
lynx

● ●




2000



● ● ● ●
● ● ●
0 ● ●

1820 1840 1860 1880 1900 1920


time

ggplot(lynx) + geom_line() +
stat_peaks(colour = "red") +
stat_peaks(geom = "rug", colour = "red") +
stat_peaks(geom = "text", colour = "red",
hjust = -0.1, label.fmt = "%4.0f",
angle = 90, size = 2.5,
aes(label = paste(..y.label..,
"skins in year", ..x.label..))) +
stat_valleys(colour = "blue") +
stat_valleys(geom = "rug", colour = "blue") +
stat_valleys(geom = "text", colour = "blue",
hjust = -0.1, label.fmt = "%4.0f",
angle = 90, size = 2.5,
aes(label = paste(..y.label..,
"skins in year", ..x.label..))) +
ylim(NA, 10000)
6991 skins in year 1904
6721 skins in year 1866

10000
5943 skins in year 1828

4431 skins in year 1885

4031 skins in year 1895

7500
3800 skins in year 1913
3790 skins in year 1916

3574 skins in year 1925


3409 skins in year 1838



2871 skins in year 1857
2536 skins in year 1848


2251 skins in year 1875
lynx

5000

485 skins in year 1929


345 skins in year 1908
255 skins in year 1869


236 skins in year 1861


225 skins in year 1852

201 skins in year 1879

105 skins in year 1898


98 skins in year 1832

80 skins in year 1919


45 skins in year 1842

39 skins in year 1889


2500 ●


● ● ● ●
● ● ● ●
0 ● ●

1820 1840 1860 1880 1900 1920


time

Using POSIXct for ‘time‘ but supplying a format string, to show only the month
corresponding to each peak or valley. Any format string accepted by strftime()

376
7.15 ‘ggpmisc’

can be used.

ggplot(AirPassengers, as.numeric = FALSE) + geom_line() +


stat_peaks(colour = "red", span = 9) +
stat_peaks(geom = "text", span = 9, colour = "red", hjust = -0.5,
angle = 90, x.label.fmt = "%b") +
stat_valleys(colour = "blue", span = 9) +
stat_valleys(geom = "text", span = 9, colour = "blue", hjust = 1.5,
angle = 90, x.label.fmt = "%b") +
scale_x_datetime(date_breaks = "1 year", date_labels = "%Y") +
ylim(-50, 700)

Jul
Aug

600

Aug
Aug

Jul
AirPassengers

Jul

400
Jul
Aug

● ●
Aug

Nov
● ● ●
Aug

● ●

Nov
Jul
Aug

Nov

Aug


Jul

Nov
Jul

200 ●● ●
Nov

● ●
●●
Nov

●● ●
Nov
Nov



Nov
Nov
Nov

1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961
time

Rotating the labels.

ggplot(lynx, as.numeric = FALSE) + geom_line() +


stat_peaks(colour = "red") +
stat_peaks(geom = "text", colour = "red", angle = 66,
hjust = -0.1, x.label.fmt = "%Y") +
ylim(NA, 7800)

8000
4
190
6
186



8
182

6000 ●
5
188

5
189

1913
6


191

5
lynx

192

4000 ●
183

● ●
7



185
8
184


187



2000

0
1820 1840 1860 1880 1900 1920
time

377
7 Extensions to ggplot

Of course, if one finds use for it, the peaks and/or valleys can be plotted on their
own. Here we plot an ”envelope” using geom_line() .

ggplot(AirPassengers) +
geom_line() +
stat_peaks(geom = "line", span = 9, linetype = "dashed") +
stat_valleys(geom = "line", span = 9, linetype = "dashed")

600
AirPassengers

400

200

1952 1956 1960


time

7.15.3 Equations as text or labels in plots

How to add a label with a polynomial equation including coefficient estimates from
a model fit seems to be a frequently asked question in Stackoverflow. The parameter
estimates are extracted automatically from a fit object corresponding to each group or
panel in a plot and other aesthetics for the group respected. An aesthetic is provided
for this, and only this. Such a statistics needs to be used together with another geom
or stat like geom smooth to add the fitted line. A different approach, discussed in
Stackoverflow, is to write a statistics that does both the plotting of the polynomial
and adds the equation label. Package ‘ggpmisc’ defines stat_poly_eq() using the
first approach which follows the ‘rule’ of using one function in the code for a single
action. In this case there is a drawback that the users is responsible for ensuring that
the model used for the label and the label are the same, and in addition that the same
model is fitted twice to the data.
We first generate some artificial data.

set.seed(4321)
# generate artificial data
x <- 1:100
y <- (x + x^2 + x^3) +
rnorm(length(x), mean = 0, sd = mean(x^3) / 4)

378
7.15 ‘ggpmisc’

my.data <- tibble(x, y,


group = rep(c("A", "B"), 50),
y2 = y * c(0.5,2))

Linear models

This section shows examples of linear models with one independent variables, includ-
ing different polynomials. We first give an example using default arguments.

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(formula = formula, parse = TRUE)



R 2 = 0.96 ●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y

●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●●
0e+00 ●●
● ●

● ●


●● ●

0 25 50 75 100
x

The default geometry used by the statistic is geom_text() but it is possible to use
geom_label() instead when the intention is to have a color background for the label.
The default background fill is white but this can also changed in the usual way by
mapping the fill aesthetic.

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(geom = "label", formula = formula, parse = TRUE)

379
7 Extensions to ggplot


R 2 = 0.96 ●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●

y
●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●
0e+00 ●●
●● ●
● ●



●● ●

0 25 50 75 100
x

It is also possible to create a semi-transparent text background by use of the alpha


aesthetic.

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(geom = "label", alpha = 0.3, formula = formula, parse = TRUE)


R 2 = 0.96 ●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y

●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ●
● ● ●
● ● ●● ●● ●●
0e+00 ●●
●● ●
● ●


●● ●

0 25 50 75 100
x

The geom_label_repel() accepts the same arguments as geom_label() for con-


trolling the format of the box and border. We give a simple example here. For other
examples see page ??.

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(geom = "label",

380
7.15 ‘ggpmisc’

label.size = NA,
label.r = unit(0, "lines"),
color = "white",
fill = "grey10",
formula = formula, parse = TRUE) +
theme_bw()


R 2 = 0.96 ●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y

●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ●

●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●

●● ● ● ● ● ● ●
● ● ● ●● ●●● ●●
0e+00 ●●
●● ●

● ●

●● ●

0 25 50 75 100
x

The remaining examples in this section use the default geom_text() but can be
modified to use geom_label() as shown above.

stat_poly_eq() makes available five different labels in the returned data frame.
2 2
𝑅 , 𝑅adj , AIC, BIC and the polynomial equation. 𝑅2 is used by default, but aes() can
be used to select a different one.

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = ..adj.rr.label..),
formula = formula, parse = TRUE)

381
7 Extensions to ggplot



2
R adj = 0.96 ●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●

y
●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●
0e+00 ●●
●● ●
● ●



●● ●

0 25 50 75 100
x

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = ..AIC.label..),
formula = formula, parse = TRUE)



AIC = 2486
●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y

●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●
0e+00 ●●
●● ●
● ●



●● ●

0 25 50 75 100
x

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = ..eq.label..),
formula = formula, parse = TRUE)

382
7.15 ‘ggpmisc’


y = − 4840 + 1170 x − 23.1 x 2 + 1.14 x 3 ●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y
●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●
0e+00 ●●
●● ●
● ●



●● ●

0 25 50 75 100
x

Within aes() it is possible to compute new labels based on those returned plus
“arbitrary” text. The supplied labels are meant to be parsed into R expressions, so
any text added should be valid for a string that will be parsed. Here we need to scape
the quotation marks. See section 6.20 starting on page 291 for details on parsing
character strings into expressions.

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = paste(..eq.label..,
..adj.rr.label..,
sep = "*\",\"~~")),
formula = formula, parse = TRUE)


y = − 4840 + 1170 x − 23.1 x 2 + 1.14 x 3, R adj


2
= 0.96 ●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y

●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ●
● ● ●
● ● ●● ●● ●●
0e+00 ●●
●● ●
● ●


●● ●

0 25 50 75 100
x

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +

383
7 Extensions to ggplot

geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = paste("atop(", ..AIC.label.., ",",
..BIC.label.., ")",

sep = "")),
formula = formula, parse = TRUE)


AIC = 2486 ●● ●

BIC = 2499
●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y

●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●●
0e+00 ●●
●● ●
● ●


●● ●

0 25 50 75 100
x

Two examples of removing or changing the lhs and/or the rhs of the equation. (Be
aware that the equals sign must be always enclosed in backticks in a string that will
be parsed.)

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = ..eq.label..),
eq.with.lhs = "italic(hat(y))~`=`~",
formula = formula, parse = TRUE)

384
7.15 ‘ggpmisc’


y^ = − 4840 + 1170 x − 23.1 x 2 + 1.14 x 3 ●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y
●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●
0e+00 ●●
●● ●
● ●



●● ●

0 25 50 75 100
x

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
labs(x = expression(italic(z)), y = expression(italic(h)) ) +
stat_poly_eq(aes(label = ..eq.label..),
eq.with.lhs = "italic(h)~`=`~",
eq.x.rhs = "~italic(z)",
formula = formula, parse = TRUE)



h = − 4840 + 1170 z − 23.1 z 2 + 1.14 z 3 ●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
h

●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ●

●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●

●● ● ● ● ● ● ●
● ● ● ●● ●●● ●●
0e+00 ●●
● ●

● ●


●● ●

0 25 50 75 100
z

As any valid R expression can be used, Greek letters are also supported, as well as
the inclusion in the label of variable transformations used in the model formula.

formula <- y ~ poly(x, 2, raw = TRUE)


ggplot(my.data, aes(x, log10(y + 1e6))) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = ..eq.label..),

385
7 Extensions to ggplot

eq.with.lhs = "plain(log)[10](italic(y)+10^6)~`=`~",
formula = formula, parse = TRUE)

●●
6.3 log10(y + 106) = 6.01 − 0.000922 x + 3.9 × 10−5 x 2 ●● ●

●● ●

● ●
● ●●
log10(y + 1e+06) ● ●

●●
6.2
●●
●●
● ● ●
● ●

●● ●
● ●
●● ●
● ●●
6.1 ● ● ●
●● ●

●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●

●● ●
● ●
● ● ●
● ●
● ● ●● ●● ●●
6.0 ●
● ●●


● ●

●● ●

0 25 50 75 100
x

Example of a polynomial of fifth order.

opts_chunk$set(opts_fig_wide)

formula <- y ~ poly(x, 5, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = ..eq.label..),
formula = formula, parse = TRUE)


y = 35000 − 5340 x + 186 x 2 + 0.33 x 3 − 0.0317 x 4 + 0.000255 x 5 ●● ●

●● ●
8e+05
● ●●●

● ●


●●
y

●●
● ●
● ● ●
● ●
4e+05 ●
●● ●
● ●
● ● ●
● ●●
● ● ● ●
●● ●
●● ●
● ● ● ●●
●● ●
● ● ● ● ● ● ●
● ● ● ● ●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ● ●
0e+00 ●●
● ● ●
● ●
● ●
● ● ●

0 25 50 75 100
x

opts_chunk$set(opts_fig_narrow)

Intercept forced to zero—line through the origin.

386
7.15 ‘ggpmisc’

formula <- y ~ x + I(x^2) + I(x^3) - 1


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = ..eq.label..),
formula = formula, parse = TRUE)


y = 813 x − 15.9 x 2 + 1.1 x 3 ●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y

●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●
0e+00 ●●
●● ●
● ●



●● ●

0 25 50 75 100
x

We give some additional examples to demonstrate how other components of the


ggplot object affect the behaviour of this statistic.

Facets work as expected either with fixed or free scales. Although bellow we had
to adjust the size of the font used for the equation.

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y2)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = ..eq.label..), # size = 2.8,
formula = formula, parse = TRUE) +
facet_wrap(~group)

387
7 Extensions to ggplot

A B


2000000 y = − 11100 + 764 x − 7.39 x 2 + 0.499 x 3 y = 34600 + 720 x − 40.1 x 2 + 2.4 x 3 ●

1500000 ●





y2

1000000 ●
● ●


● ●
●● ●

500000 ● ●
● ●


●●●
● ●
● ●●
●●● ● ● ●
● ● ● ●
●●● ●
● ●
● ●●●●●● ● ● ●

● ● ●● ●●●●● ● ●

● ● ●●● ● ● ●●
0 ● ●● ●●●● ●●●●
● ●
● ●
● ●

0 25 50 75 100 0 25 50 75 100
x

Grouping, in this example using colour aesthetic also works as expected.

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y2, colour = group)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = ..eq.label..),
formula = formula, parse = TRUE) +
theme_bw() +
theme(legend.position = "top")

group ● A ● B



2000000 y = − 11100 + 764 x − 7.39 x 2 + 0.499 x 3 ●

y = 34600 + 720 x − 40.1 x 2 + 2.4 x 3


1500000 ●





y2

1000000 ●
● ●


● ●
●● ●

500000 ●
● ● ●
● ●
●●●
● ●
●● ●
● ● ●●● ●
● ● ● ●
● ● ●●● ●
● ● ● ●●●●●●●●
● ●● ● ● ● ● ● ● ●●● ● ● ●
0 ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●
● ●
● ●

0 25 50 75 100
x

Other types of models

Another statistic, stat_fit_glance() allows lots of flexibility, but at the moment


there is no equivalently flexible version of stat_smooth() .
We give an example with a linear model, showing a P-value (a frequent request for
which I do not find much use).

388
7.15 ‘ggpmisc’

We use geom_debug() to find out what values stat_glance() returns for our linear
model, and add labels with P-values for the fits.

formula <- y ~ x + I(x^2) + I(x^3)


ggplot(my.data, aes(x, y2, colour = group)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_fit_glance(method.args = list(formula = formula),
geom = "debug",
summary.fun = print,
summary.fun.args = list()) +
theme_bw() +
theme(legend.position = "top")

## Input 'data' to 'geom_debug()':

## colour hjust vjust r.squared adj.r.squared


## 1 #F8766D 0 1.4 0.9619032 0.9594187
## 2 #00BFC4 0 2.8 0.9650270 0.9627461
## sigma statistic p.value df logLik
## 1 29045.57 387.1505 1.237801e-32 4 -582.6934
## 2 118993.86 423.0996 1.732868e-33 4 -653.2037
## AIC BIC deviance df.residual x
## 1 1175.387 1184.947 38807664340 46 1
## 2 1316.407 1325.968 651338752799 46 1
## y PANEL group
## 1 2154937 1 1
## 2 2154937 1 2
## colour hjust vjust r.squared adj.r.squared
## 1 #F8766D 0 1.4 0.9619032 0.9594187
## 2 #00BFC4 0 2.8 0.9650270 0.9627461
## sigma statistic p.value df logLik
## 1 29045.57 387.1505 1.237801e-32 4 -582.6934
## 2 118993.86 423.0996 1.732868e-33 4 -653.2037
## AIC BIC deviance df.residual x
## 1 1175.387 1184.947 38807664340 46 1
## 2 1316.407 1325.968 651338752799 46 1
## y PANEL group
## 1 2154937 1 1
## 2 2154937 1 2

389
7 Extensions to ggplot

group ● A ● B



2000000 ●

1500000 ●





y2

1000000 ●
● ●


● ●
●● ●

500000 ●
● ● ●
● ●
●●●
● ●
●● ●
● ● ●●● ●
● ● ● ●
● ● ●●● ●
● ● ● ●●●●●●●●
● ●● ● ● ● ● ● ● ●●● ● ● ●
0 ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●
● ●
● ●

0 25 50 75 100
x

Using the information now at hand we create some labels.

formula <- y ~ x + I(x^2) + I(x^3)


ggplot(my.data, aes(x, y2, colour = group)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_fit_glance(aes(label = paste('italic(P)~`=`~', signif(..p.value.., 3), sep = "")),
parse = TRUE,
method.args = list(formula = formula),
geom = "text") +
theme_bw() +
theme(legend.position = "top")

group ● A ● B



2000000 P = 1.24e−32

P = 1.73e−33

1500000 ●





y2

1000000 ●
● ●


● ●
●● ●

500000 ●
● ● ●
● ●
●●●
● ●
●● ●
● ● ●●● ●
● ● ● ●
● ● ●●● ●
● ● ● ●●●●●●●●
● ●● ● ● ● ● ● ● ●●● ● ● ●
0 ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●
● ●
● ●

0 25 50 75 100
x

We use geom_debug() to find out what values stat_glance() returns for our res-
istant linear model fitted with rlm() from package ‘MASS’.

formula <- y ~ x + I(x^2) + I(x^3)


ggplot(my.data, aes(x, y2, colour = group)) +
geom_point() +

390
7.15 ‘ggpmisc’

geom_smooth(method = "rlm", formula = formula) +


stat_fit_glance(method.args = list(formula = formula),
geom = "debug",
method = "rlm",
summary.fun = print,
summary.fun.args = list()) +
theme_bw() +
theme(legend.position = "top")

## Input 'data' to 'geom_debug()':

## colour hjust vjust sigma converged


## 1 #F8766D 0 1.4 20078.62 TRUE
## 2 #00BFC4 0 2.8 126111.74 TRUE
## logLik AIC BIC deviance x
## 1 -582.8362 1175.672 1185.232 39029842201 1
## 2 -653.2392 1316.478 1326.039 652263183741 1
## y PANEL group
## 1 2154937 1 1
## 2 2154937 1 2
## colour hjust vjust sigma converged
## 1 #F8766D 0 1.4 20078.62 TRUE
## 2 #00BFC4 0 2.8 126111.74 TRUE
## logLik AIC BIC deviance x
## 1 -582.8362 1175.672 1185.232 39029842201 1
## 2 -653.2392 1316.478 1326.039 652263183741 1
## y PANEL group
## 1 2154937 1 1
## 2 2154937 1 2

group ● A ● B



2000000 ●

1500000 ●





y2

1000000 ●
● ●


● ●
●● ●

500000 ●
● ● ●
● ●
●●●
● ●
●● ●
● ● ●●● ●
● ● ● ●
● ● ●●● ●
● ● ● ●●●●●●●●
● ●● ● ● ● ● ● ● ●●● ● ● ●
0 ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●
● ●
● ●

0 25 50 75 100
x

Using the information now at hand we create some labels.

formula <- y ~ x + I(x^2) + I(x^3)


ggplot(my.data, aes(x, y2, colour = group)) +
geom_point() +

391
7 Extensions to ggplot

geom_smooth(method = "rlm", formula = formula) +


stat_fit_glance(aes(label = paste('AIC~`=`~', signif(..AIC.., 3),
"~~", 'BIC~`=`~', signif(..BIC.., 3), sep = "")),
parse = TRUE,
method = "rlm",
method.args = list(formula = formula),
geom = "text") +
theme_bw() +
theme(legend.position = "top")

group ● A ● B



2000000 AIC = 1180 BIC = 1190

AIC = 1320 BIC = 1330

1500000 ●





y2

1000000 ●
● ●


● ●
●● ●

500000 ●
● ● ●
● ●
●●●
● ●
●● ●
● ● ●●● ●
● ● ● ●
● ● ●●● ●
● ● ● ●●●●●●●●
● ●● ● ● ● ● ● ● ●●● ● ● ●
0 ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●
● ●
● ●

0 25 50 75 100
x

In a similar way one can generate labels for any fit supported by package ‘broom’.

7.15.4 Highlighting deviations from fitted line

First an example using default arguments for stat_fit_deviations() .

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_fit_deviations(formula = formula)

392
7.15 ‘ggpmisc’


●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y
●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●
0e+00 ●●
●● ●
● ●



●● ●

0 25 50 75 100
x

And setting some to the aesthetics to non-default values.

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y)) +
geom_smooth(method = "lm", formula = formula) +
stat_fit_deviations(formula = formula, color = "red",
arrow = arrow(length = unit(0.015, "npc"),
ends = "both")) +
geom_point()


●● ●

●● ●
8e+05
● ●
● ●●
● ●


●●

●●
y

●●
● ● ●
● ●
4e+05 ●
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●
0e+00 ●●
●● ●
● ●



●● ●

0 25 50 75 100
x

Grouping is respected. Here colour is mapped to the variable group .

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y, colour = group)) +
geom_smooth(method = "lm", formula = formula) +
stat_fit_deviations(formula = formula) +
geom_point()

393
7 Extensions to ggplot


●● ●

●● ●
8e+05
● ●
● ●●
● ●

●●

group
● A
y
●●
●●
● ● ●

4e+05 ● ●

● B
●● ●
● ●
●● ●
● ●●
● ● ● ●
●● ●
●● ●
●● ● ●●
●● ●
● ● ●● ● ● ●
● ● ● ●●
●● ● ● ● ● ● ●
● ● ● ●● ●●● ●
0e+00 ●● ●
● ●
● ●

● ●
●● ●

0 25 50 75 100
x

7.15.5 Plotting residuals from linear fit

We can plot the residuals by themselves with stat_fit_residuals() .

formula <- y ~ poly(x, 3, raw = TRUE)


ggplot(my.data, aes(x, y, colour = group)) +
geom_hline(yintercept = 0, linetype = "dashed") +
stat_fit_residuals(formula = formula)


● ●
1e+05
● ●
● ● ●
● ●
● ●

● ●● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
● ● ●●
● ●
● ●●
● ● ● ● ●● ●
●●
0e+00 ● ● ●

● ●
● ● ●
y

● ●●
● ●● ● ●
● ●
● ● ●●

● ● ●
● ● ●
● ●

● ● ● ●
● ●
●● ●
● ●
● ● ● ●
−1e+05 ● ●


0 25 50 75 100
x

7.15.6 Filtering observations based on local density

Statistics stat_dens2d_filter() works best with clouds of observations, so we gen-


erate some random data.

394
7.15 ‘ggpmisc’

set.seed(1234)
nrow <- 200
my.2d.data <- tibble(
x = rnorm(nrow),
y = rnorm(nrow) + rep(c(-1, +1), rep(nrow / 2, 2)),
group = rep(c("A", "B"), rep(nrow / 2, 2))
)

In most recipes in the section we use stat_dens2d_filter() to highlight observa-


tions with the color aesthetic. Other aesthetics can also be used.

By default 1/10 of the observations are kept from regions of lowest density.

ggplot(my.2d.data, aes(x, y)) +


geom_point() +
stat_dens2d_filter(color = "red")


● ●

● ● ●

● ● ●
● ●● ● ●

● ●
2 ●



●● ●

● ● ● ● ● ●
● ●
●● ● ● ● ●
● ●

●● ● ● ● ●
● ● ●●● ●
● ● ●● ● ● ●
● ●

● ●● ● ● ●
● ● ●
●● ● ●● ●
● ● ● ●● ● ● ●
● ●
● ● ● ● ●
● ● ●
0 ● ● ●● ● ● ●
● ● ●
● ●
● ● ● ● ●● ● ● ●

y

● ●●●●● ● ●● ●● ●●
● ● ● ●
●● ● ●● ● ●
● ●
● ● ●
● ● ●
●● ● ● ●
● ●●
● ● ●● ●
● ● ● ●● ● ●
●● ● ●●● ● ●
● ●

−2 ●
● ●●


● ●

● ●
● ●

−4 ●

−3 −2 −1 0 1 2 3
x

Here we change the fraction to 1/3.

ggplot(my.2d.data, aes(x, y)) +


geom_point() +
stat_dens2d_filter(color = "red",
keep.fraction = 1/3)

395
7 Extensions to ggplot


● ●

● ● ●

● ● ●
● ●● ● ●

● ●
2 ●



●● ●

● ● ● ● ● ●
● ●
●● ● ● ● ●
● ●

●● ● ● ● ●
● ● ●●● ●
● ● ●● ● ● ●
● ●

● ●● ● ● ●
● ● ●● ●
● ●● ●
● ● ● ●● ● ● ●
● ●
● ● ● ● ●
● ● ●
0 ● ● ●● ● ● ●
● ● ●
● ●
● ● ●● ● ●
●●● ●

y
● ●●●●● ● ●● ●● ●●
● ● ● ●
●● ● ● ● ● ●
● ●
● ● ●
● ● ●
●● ● ● ●

● ● ● ●●● ●
● ● ● ●● ● ●
●● ● ●●● ● ●
● ●

−2 ●
● ●●


● ●

● ●
● ●

−4 ●

−3 −2 −1 0 1 2 3
x

We can also set a maximum number of observations to keep.

ggplot(my.2d.data, aes(x, y)) +


geom_point() +
stat_dens2d_filter(color = "red",
keep.number = 3)


● ●

● ● ●

● ● ●
● ●● ● ●

● ●
2 ●



●● ●

● ● ● ● ● ●
● ●
●● ● ● ● ●
● ●

●● ● ● ● ●
● ● ●●● ●
● ● ●● ● ● ●
● ●

● ●● ● ● ●
● ● ●
●● ● ●● ●
● ● ● ●● ● ● ●
● ●
● ● ● ● ●
● ● ●
0 ● ● ●● ● ● ●
● ● ●
● ●
● ● ● ● ●● ● ● ●

y

● ●●●●● ● ●● ●● ●●
● ● ● ●
●● ● ●● ● ●
● ●
● ● ●
● ● ●
●● ● ● ●
●●
● ● ● ●● ●
● ● ● ●● ● ●
●● ● ●●● ● ●
● ●

−2 ●
● ●●


● ●

● ●
● ●

−4 ●

−3 −2 −1 0 1 2 3
x

We can also keep the observations from the densest areas instead of the from the
sparsest.

ggplot(my.2d.data, aes(x, y)) +


geom_point() +
stat_dens2d_filter(color = "red",
keep.sparse = FALSE)

396
7.15 ‘ggpmisc’


● ●

● ● ●

● ● ●
● ●● ● ●

● ●
2 ●



●● ●

● ● ● ● ● ●
● ●
●● ● ● ● ●
● ●

●● ● ● ● ●
● ● ●●● ●
●● ●● ● ● ●
● ●

● ●● ● ● ●
● ● ●● ●
● ●● ●
● ● ● ●● ● ● ●
● ●
● ● ● ● ●
● ● ●
0 ● ● ●● ● ● ●
● ● ●
● ●
● ● ●● ● ●
●●● ●
y
● ●●●●● ● ●● ●● ●●
● ● ● ●
●● ● ● ● ● ●
● ●
● ● ●
● ● ●
●● ● ● ●

● ● ● ●●● ●
● ● ● ●● ● ●
●● ● ●●● ● ●
● ●

−2 ●
● ●●


● ●

● ●
● ●

−4 ●

−3 −2 −1 0 1 2 3
x

ggplot(my.2d.data, aes(x, y)) +


geom_point() +
stat_dens2d_filter(color = "red",
keep.sparse = FALSE) +
facet_grid(~group)

A B

●●

● ●●
● ● ●●

● ● ●● ● ●
2 ● ● ●

● ● ● ●

● ● ●● ● ● ● ●
● ●● ●
● ● ●
● ●
● ●● ●● ●● ● ● ● ●
●● ● ● ● ●

● ● ● ●● ●
● ● ●● ● ● ● ●
●● ● ●
● ●●●●● ● ● ●
● ●● ●●●

●● ● ● ●
0 ●
●●
●● ●
● ●

● ●
● ●●
● ●● ● ●
y

● ●● ● ● ● ●● ●

● ●●●● ●● ● ● ● ● ● ● ●●

● ●
● ● ● ●
●● ● ● ●
●● ● ● ●● ● ● ● ● ●
● ● ●●
●● ● ●● ● ● ●
● ● ●
−2 ●●● ●


● ●

● ●
● ●

−4 ●

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x

In addition to stat_dens2d_filter() there is stat_dens2d_filter_g() . The dif-


ference is in that the first one computes the density on a plot-panel basis while the
second one does it on a group basis. This makes a difference only when observations
are grouped based on another aesthetic within each panel.

ggplot(my.2d.data, aes(x, y, color = group)) +


geom_point() +
stat_dens2d_filter(shape = 1, size = 3)

397
7 Extensions to ggplot



●●


● ●



● ●

●● ● ●● ● ● ●
2 ●



●● ● ●



●● ● ●
●●

● ●

●●
● ●
● ●

● ●
● ●
●● ●●●
●● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ●● ● ●●●
● ●● ●
● ● ●●● ● ● ●

0 ●
● ● ●
● ●

● ●●
● ●
● ●


● ● ●
group


● ●●
●●
● ● ●● ● ●

● ● A
y
● ●●●●● ● ●● ●● ●●




● ●
● ● ● ● ●



● ● ●


● ●●
●● ● ●
● ●

●●
● ● B

● ● ● ● ● ●● ● ●

●● ● ●●●
● ●
● ●●

−2 ● ●
● ●●

● ● ●







−4

−3 −2 −1 0 1 2 3
x

ggplot(my.2d.data, aes(x, y, color = group)) +


geom_point() +
stat_dens2d_filter_g(shape = 1, size = 3)



●●


● ●



● ●

●● ● ●● ● ● ●
2 ●



●● ● ●●


●● ● ●
● ● ●
● ●



● ●
● ●
● ● ●
●● ●●
● ● ●
● ● ● ●

● ●
●● ● ●
● ● ●
● ● ●
● ●●●
● ●●

● ●
● ●●● ● ● ● ●● ● ●

0 ●
● ● ●
● ●

● ●●
● ●
● ●


● ● ●
group


● ●●
●●
● ● ●● ● ●

● ● A
y

● ●●●●● ● ●● ●● ●●
● ● ● ● ● ●
● ● ● ● ● ● ●

● ● ●
●● ●
● ● ●

● ●
●●
● ● B

● ●●
● ● ● ● ●● ● ●

●● ● ●●●● ● ●
● ●

−2 ● ●
● ●●

● ● ●


● ●
● ●

−4

−3 −2 −1 0 1 2 3
x

A related stat stat_dens2d_label() , also defined in package ‘ggpmisc’ is described


in section 7.16.2 on page 407.

7.15.7 Learning and/or debugging

A very simple stat named stat_debug() can save the work of adding print state-
ments to the code of stats to get information about what data is being passed to
the compute_group() function. Because the code of this function is stored in a
ggproto object, at the moment it is impossible to directly set breakpoints in it. This
stat_debug() may also help users diagnose problems with the mapping of aesthetics
in their code or just get a better idea of how the internals of ‘ggplot2’ work.

398
7.15 ‘ggpmisc’

ggplot(lynx) + geom_line() +
stat_debug_group()

## [1] "Input 'data' to 'compute_group()':"


## # A tibble: 114 � 4
## x y PANEL group
## * <dbl> <dbl> <int> <int>
## 1 1821 269 1 -1
## 2 1822 321 1 -1
## 3 1823 585 1 -1
## 4 1824 871 1 -1
## 5 1825 1475 1 -1
## 6 1826 2821 1 -1
## 7 1827 3928 1 -1
## 8 1828 5943 1 -1
## 9 1829 4950 1 -1
## 10 1830 2577 1 -1
## # ... with 104 more rows

6000

4000
lynx

2000

0
1820 1840 1860 1880 1900 1920
time

ggplot(lynx,
aes(time, lynx,
color = ifelse(time >= 1900, "XX", "XIX"))) +
geom_line() +
stat_debug_group() +
labs(color = "century")

## [1] "Input 'data' to 'compute_group()':"


## # A tibble: 79 � 5
## x y colour PANEL group
## * <dbl> <dbl> <chr> <int> <int>
## 1 1821 269 XIX 1 1
## 2 1822 321 XIX 1 1
## 3 1823 585 XIX 1 1
## 4 1824 871 XIX 1 1

399
7 Extensions to ggplot

## 5 1825 1475 XIX 1 1


## 6 1826 2821 XIX 1 1
## 7 1827 3928 XIX 1 1
## 8 1828 5943 XIX 1 1
## 9 1829 4950 XIX 1 1
## 10 1830 2577 XIX 1 1
## # ... with 69 more rows
## [1] "Input 'data' to 'compute_group()':"
## # A tibble: 35 � 5
## x y colour PANEL group
## * <dbl> <dbl> <chr> <int> <int>
## 1 1900 387 XX 1 2
## 2 1901 758 XX 1 2
## 3 1902 1307 XX 1 2
## 4 1903 3465 XX 1 2
## 5 1904 6991 XX 1 2
## 6 1905 6313 XX 1 2
## 7 1906 3794 XX 1 2
## 8 1907 1836 XX 1 2
## 9 1908 345 XX 1 2
## 10 1909 382 XX 1 2
## # ... with 25 more rows

6000

4000 century
lynx

XIX
XX

2000

0
1820 1840 1860 1880 1900 1920
time

By means of geom_debug() it is possible to ”print” to the console the data returned


by a ggplot statistic.

ggplot(mpg, aes(class, hwy, color = class)) +


geom_point(alpha = 0.2) +
stat_summary(fun.data = mean_se, size = 0.6)

400
7.15 ‘ggpmisc’

40
class
● 2seater
● compact
30 ● midsize
hwy
● ●
● ● minivan
● ● pickup
● ● subcompact
20 ● suv

2seater
compact
midsizeminivanpickup
subcompactsuv
class

ggplot(mpg, aes(class, hwy, color = class)) +


geom_debug() +
stat_summary(fun.data = mean_se,
geom = "debug", summary.fun = as_tibble, summary.fun.args = list())

## Input 'data' to 'geom_debug()':

## # A tibble: 234 � 5
## colour x y PANEL group
## <chr> <int> <dbl> <int> <int>
## 1 #C49A00 2 29 1 2
## 2 #C49A00 2 29 1 2
## 3 #C49A00 2 31 1 2
## 4 #C49A00 2 30 1 2
## 5 #C49A00 2 26 1 2
## 6 #C49A00 2 26 1 2
## 7 #C49A00 2 27 1 2
## 8 #C49A00 2 26 1 2
## 9 #C49A00 2 25 1 2
## 10 #C49A00 2 28 1 2
## # ... with 224 more rows

## Input 'data' to 'geom_debug()':

## # A tibble: 7 � 7
## colour x group y ymin ymax
## <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 #F8766D 1 1 24.80000 24.21690 25.38310
## 2 #C49A00 2 2 28.29787 27.74627 28.84948
## 3 #53B400 3 3 27.29268 26.95911 27.62626
## 4 #00C094 4 4 22.36364 21.74172 22.98555
## 5 #00B6EB 5 5 16.87879 16.48289 17.27469
## 6 #A58AFF 6 6 28.14286 27.23431 29.05140
## 7 #FB61D7 7 7 18.12903 17.75083 18.50724
## # ... with 1 more variables: PANEL <int>

401
7 Extensions to ggplot

40

30
hwy

20

2seater compact midsize minivan pickup subcompact suv


class

7.16 ‘ggrepel’

citation(package = "ggrepel")

##
## To cite package 'ggrepel' in publications
## use:
##
## Kamil Slowikowski (2016). ggrepel:
## Repulsive Text and Label Geoms for
## 'ggplot2'. R package version 0.6.5.
## https://CRAN.R-project.org/package=ggrepel
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggrepel: Repulsive Text and Label Geoms for 'ggplot2'},
## author = {Kamil Slowikowski},
## year = {2016},
## note = {R package version 0.6.5},
## url = {https://CRAN.R-project.org/package=ggrepel},
## }

Package ‘ggrepel’ is under development by Kamil Slowikowski. It does a single


thing, relocates text labels so that they do not overlap. This is achieved through two
geometries that work similarly to those provided by ‘ggplot2’ except for the reloca-
tion. This is incredibly useful both when labeling peaks and valleys and when labeling
points in scatter-plots. This is a significant problem in bioinformatics plots and in
maps.

402
7.16 ‘ggrepel’

7.16.1 New geoms

Package ‘ggrepel’ provides two new geoms: geom_text_repel() and


geom_label_repel() . They are used similarly to geom_text() and geom_label()
but the text or labels “repel” each other so that they rarely overlap unless the plot
is very crowded. The vignette ggrepel Usage Examples provides very nice examples
of the power and flexibility of these geoms. The algorithm used for avoiding
overlaps through repulsion is iterative, and can be slow when the number of labels
or observations are in the thousands.

I reproduce here some simple examples from the ‘ggrepel’ vignette.

opts_chunk$set(opts_fig_wide_square)

Just using defaults, we avoid overlaps among text items on the plot.
geom_text_repel() has some parameters matching those in geom_text() , but those
related to manual positioning are missing except for angle . Several new parameters
control both the appearance of text and the function of the repulsion algorithm.

ggplot(mtcars, aes(wt, mpg)) +


geom_point(color = 'red') +
geom_text_repel(aes(label = rownames(mtcars)))

403
7 Extensions to ggplot

35

Toyota Corolla

Fiat 128
Honda Civic
● ●
30 Lotus Europa

Fiat X1−9

Porsche

914−2

25

Merc 240D
Datsun 710
● ●
mpg

Toyota Corona Volvo 142E Merc 230


● ● ● Hornet 4 Drive
● ●
Mazda RX4 Mazda RX4 Wag
20 ● Merc 280
Ferrari Dino ● ● Pontiac Firebird

Hornet Sportabout ● Valiant

Merc 280C ●
Merc 450SL
Ford Pantera L ●
● Dodge Challenger Merc 450SE

15



Merc 450SLC
AMC Javelin ●
● Maserati Bora Chrysler Imperial
Duster 360 ●

Camaro Z28

Cadillac Fleetwood
● ●
10 Lincoln Continental

2 3 4 5
wt

The chunk below shows how to change the appearance of labels.


geom_label_repel() is comparable to geom_label() , but with repulsion.

set.seed(42)
ggplot(mtcars) +
geom_point(aes(wt, mpg), size = 5, color = 'grey') +
geom_label_repel(
aes(wt, mpg, fill = factor(cyl), label = rownames(mtcars)),
fontface = 'bold', color = 'white',
box.padding = unit(0.25, "lines"),
point.padding = unit(0.5, "lines")) +
theme(legend.position = "top")

404
7.16 ‘ggrepel’

factor(cyl) a 4 a 6 a 8

35
● Fiat 128
Toyota Corolla

Honda Civic
30 ●● Lotus Europa

Fiat X1−9

● Merc 240D
Porsche 914−2
25
Datsun 710 Merc 230 ●
● ● Hornet 4 Drive
mpg

Volvo 142E
● Merc 280
Toyota Corona
● ●● ●
20 Mazda RX4 Wag Pontiac Firebird
Mazda RX4 ●
● ●
Ferrari Dino Hornet Sportabout ● Valiant
●● Merc 450SE
Merc 280C ● 450SL
Merc
Dodge Challenger


15 Ford Pantera L ●●● ● Merc 450SLC
● ●
AMC Javelin Maserati Bora
Chrysler Imperial

Duster 360
Camaro Z28
Cadillac Fleetwood

10 Lincoln Continental ●●
2 3 4 5
wt

As with geom_label() we can change the width of the border line, or remove it
completely as in the example below, by means of an argument passed through para-
meter label.size which defaults to 0.25. Although 0 as argument still results in a
thin border line, NA removes it altogether.

set.seed(42)
ggplot(mtcars) +
geom_point(aes(wt, mpg), size = 5, color = 'grey') +
geom_label_repel(
aes(wt, mpg, fill = factor(cyl), label = rownames(mtcars)),
fontface = 'bold', color = 'white',
box.padding = unit(0.25, "lines"),
point.padding = unit(0.5, "lines"),
label.size = NA) +
theme(legend.position = "top")

405
7 Extensions to ggplot

factor(cyl) a 4 a 6 a 8

35
● Fiat 128
Toyota Corolla

Honda Civic
30 ●● Lotus Europa

Fiat X1−9

● Merc 240D
Porsche 914−2
25
Datsun 710 Merc 230 ●
● ● Hornet 4 Drive
mpg

Volvo 142E
● Merc 280
Toyota Corona
● ●● ●
20 Mazda RX4 Wag Pontiac Firebird
Mazda RX4 ●
● ●
Ferrari Dino Hornet Sportabout ● Valiant
●● Merc 450SE
Merc 280C ● 450SL
Merc
Dodge Challenger


15 Ford Pantera L ●●● ● Merc 450SLC
● ●
AMC Javelin Maserati Bora
Chrysler Imperial

Duster 360
Camaro Z28
Cadillac Fleetwood

10 Lincoln Continental ●●
2 3 4 5
wt

The parameters nudge_x and nudge_y allow strengthening or weakening the repul-
sion force, or favouring a certain direction. We also need to expand the x-axis high
limit to make space for the labels.

opts_chunk$set(opts_fig_wide)

set.seed(42)
ggplot(Orange, aes(age, circumference, color = Tree)) +
geom_line() +
expand_limits(x = max(Orange$age) * 1.1) +
geom_text_repel(data = subset(Orange, age == max(age)),
aes(label = paste("Tree", Tree)),
size = 5,
nudge_x = 65,
segment.color = NA) +
theme(legend.position = "none") +
labs(x = "Age (days)", y = "Circumference (mm)")

406
7.16 ‘ggrepel’

Tree 4
200 Tree 2
Tree 5
Circumference (mm)

150 Tree 1
Tree 3

100

50

500 1000 1500


Age (days)

We can combine stat_peaks() from package ‘ggpmisc’ with the use of repulsive
text to avoid overlaps between text items. We use nudge_y = 500 to push the text
upwards.

ggplot(lynx) +
geom_line() +
stat_peaks(geom = "text_repel", nudge_y = 500)

1866 1904
1828
6000

1885
1895
1913 1916
4000 1838 1925
lynx

1857
1848
1875

2000

0
1820 1840 1860 1880 1900 1920
time

7.16.2 Selectively plotting repulsive labels

To repel text or labels so that they do not overlap unlabelled observations, one can
set the labels to an empty character string "" . Setting labels to NA skips the ob-
servation completely, as is the usual behavior in ‘ggplot2’ 2 geoms, and can res-
ult in text or labels overlapping those observations. Labels can be set manually
to "" , but in those cases where all observations have labels in the data, but we
would like to plot only those in low density regions, this can be automated. Geoms

407
7 Extensions to ggplot

geom_text_repel and geom_label_repel() from package ‘ggrepel’ can be used to-


gether with stat_dens2d_label() from package ‘ggpmisc’.
To demonstrate this we first generate suitable data and labels.
# Make random labels
random_string <- function(len = 6) {
paste(sample(letters, len, replace = TRUE), collapse = "")
}

# Make random data.


set.seed(1001)
myl.data <- tibble(
x = rnorm(100),
y = rnorm(100),
group = rep(c("A", "B"), c(50, 50)),
lab = replicate(100, { random_string() })
)
head(myl.data)

## # A tibble: 6 � 4
## x y group lab
## <dbl> <dbl> <chr> <chr>
## 1 2.1886481 0.07862339 A emhufi
## 2 -0.1775473 -0.98708727 A yrvrlo
## 3 -0.1852753 -1.17523226 A wrpfpp
## 4 -2.5065362 1.68140888 A ogrqsc
## 5 -0.5573113 0.75623228 A wfxezk
## 6 -0.1435595 0.30309733 A zjccnn

The first example uses defaults.


ggplot(data = myl.data, aes(x, y, label = lab, color = group)) +
geom_point() +
stat_dens2d_labels(geom = "text_repel")

● vyzndc

2 xybhdv ● gpjolw

● ogrqsc


● ●

1 ● ● ●
mugaeh ●
●●
● ●

● ● ● group
● ● ●
● ●
● ● ● A
y

● ●
● ●
● ● ● ● ●
●● ● ●●
0 ●
● ● ● ● ● ●
levavf ●
● B
● ●
● ● ●● ●
● ●
● ●
● ● ●
● ● ● ● ● ●

● ● ●
● ●●
● ● ●● ● ●

−1 xgiugz ●
● kbchrv
● ● ●
● ● ●
● ●

obatms ●

xecnvn
−2 0 2
x

408
7.16 ‘ggrepel’

The fraction of observations can be plotted, as well as the maximum number can
be both set through parameters, as shown in section 7.15.6 on page 394.
Something to be aware of when rotating labels is that repulsion is always based
on bounding box that does not rotate, which for long labels and angles that are not
multiples of 90 degrees, reserves too much space and leaves gaps between segments
and text. Compare the next two figures.

ggplot(data = myl.data, aes(x, y, label = lab, color = group)) +


geom_point() +
stat_dens2d_labels(geom = "text_repel", angle = 90)

gpjolw

xybhdv

vyzndc
2
ogrqsc






mugaeh



1 ● ● ●

●●
● ●

● ● ● group
● ● ●
● ●
● ● ● A
y

● ●
● ●
● ● ● ● ●

levavf
●● ● ●●
0 ●
● ● ● ● ● ●

● B
● ●
● ● ●● ●
● ●

obatms xgiugz

● ● ●
● ●
● ● ● ● ●

● ● ●
● ●●
kbchrv
● ● ●● ● ●

xecnvn


−1 ●

● ● ● ●
● ● ●


−2 0 2
x

ggplot(data = myl.data, aes(x, y, label = lab, color = group)) +


geom_point() +
stat_dens2d_labels(geom = "text_repel", angle = 45)



c

2
v

d
sc

zn


bh
rq

lw
vy


og

xy

jo


gp



● ●

1 ● ● ●
h
ae


●●
group
ug

● ● ● ● ●

m

● ● ●
● ●
● ● ● A
y

● ●
● ●
● ● ● ● ●
●● ● ●●
0 ●
● ● ● ● ● ●

● B
vf

● ●
va

● ● ●● ●
● ●
le

● ●
● ● ●
● ● ● ● ●
gz




iu

● ●
s

● ●●
m
xg

● ●●
rv

● ● ●
at

ch

● ●
ob

−1 ●
kb

● ● ●
● ● ●
● ●

vn
cn



xe

−2 0 2
x

Labels cannot be rotated.

409
7 Extensions to ggplot

ggplot(data = myl.data, aes(x, y, label = lab, color = group)) +


geom_point() +
stat_dens2d_labels(geom = "label_repel")



vyzndc
2
xybhdv ●

● ogrqsc gpjolw


● ●

1 ● ● ●

mugaeh ●
●●
● ●

● ● ● group
● ● ●
● ●
● ● ● A
y

● ●
● ●
● ● ● ● ●
●● ● ●●
0 ●
● ● ● ● ● ●
levavf ●
● B
● ●
● ● ●● ●
● ●
● ●
● ● ●
● ● ● ● ● ●
xgiugz ●
● ● ●
● ●●
● ● ●● ● ●
● ● kbchrv
−1 obatms ●

● ● ● ●
● ● ●

xecnvn ●

−2 0 2
x

7.17 ‘tidyquant’

citation(package = "tidyquant")

##
## To cite package 'tidyquant' in publications
## use:
##
## Matt Dancho and Davis Vaughan (2017).
## tidyquant: Tidy Quantitative Financial
## Analysis. R package version 0.5.0.
## https://CRAN.R-project.org/package=tidyquant
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {tidyquant: Tidy Quantitative Financial Analysis},
## author = {Matt Dancho and Davis Vaughan},
## year = {2017},
## note = {R package version 0.5.0},
## url = {https://CRAN.R-project.org/package=tidyquant},
## }

The focus of this extension to ‘ggplot2’ is the conversion of time series data into
tidy tibbles. It also defines additional geometries for plotting moving averages with
‘ggplot2’. Package ‘tidyquant’ defines six geometries, several mutators for time series
stored as tibbles are also exported. Furthermore it integrates with packages used

410
7.18 ’ggseas’

for the analysis of financial time series: ‘xts’, ‘zoo’, ‘quantmod’, and ‘TTR’. Financial
analysis falls outside the scope of this book, so we give no examples of the use of
this package.

7.18 ‘ggseas’

citation(package = "ggseas")

##
## To cite package 'ggseas' in publications
## use:
##
## Peter Ellis (2016). ggseas: 'stats' for
## Seasonal Adjustment on the Fly with
## 'ggplot2'. R package version 0.5.1.
## https://CRAN.R-project.org/package=ggseas
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggseas: 'stats' for Seasonal Adjustment on the Fly with 'ggplot2'},
## author = {Peter Ellis},
## year = {2016},
## note = {R package version 0.5.1},
## url = {https://CRAN.R-project.org/package=ggseas},
## }

The focus of this extension to ‘ggplot2’ is the seasonal decomposition of time


series done on the fly while creating a ggplot. Package ‘ggseas’ defines five statistics,
stat_index() , stat_decomp() , stat_rollapplyr() , stat_stl() , and stat_seas() .
By default they all use geom_line() . This package also defines function tsdf() that
needs to be used to convert time series to data frames to pass as data argument to
ggplot() .

Index referenced to the first two observations in the series. Here we use ggplot()
method for class "ts" from our package ‘ggpmisc’. Functions
Rfunctiontry_tibble() from ‘ggpmisc’ and tsdf() from ‘ggseas’ can be also used.

ggplot(lynx) +
stat_index(index.ref = 1:2) +
expand_limits(y = 0)

411
7 Extensions to ggplot

2000

1500
lynx

1000

500

0
1820 1840 1860 1880 1900 1920
time

ggplot(AirPassengers) +
stat_index(index.ref = 1:10) +
expand_limits(y = 0)

500

400
AirPassengers

300

200

100

0
1952 1956 1960
time

Rolling average.

We use a with of 9, which seems to be approximately the length of the cycle.

ggplot(lynx) +
geom_line() +
stat_rollapplyr(width = 9, align = "center", color = "blue") +
expand_limits(y = 0)

## Warning: Removed 8 rows containing missing values (geom_path).

412
7.18 ’ggseas’

6000

4000
lynx

2000

0
1820 1840 1860 1880 1900 1920
time

For monthly data on air travel, it is clear that a width of 12 observations (months)
is best.

ggplot(AirPassengers) +
geom_line() +
stat_rollapplyr(width = 12, align = "center", color = "blue") +
expand_limits(y = 0)

## Warning: Removed 11 rows containing missing values (geom_path).

600
AirPassengers

400

200

0
1952 1956 1960
time

Seasonal decomposition.

ggplot(AirPassengers) +
geom_line() +
stat_seas(colour = "blue") +
stat_stl(s.window = 7, color = "red") +
expand_limits(y = 0)

## Calculating starting date of 1949 from the data.

413
7 Extensions to ggplot

## Calculating frequency of 12 from the data.


## Calculating frequency of 12 from the data.

AirPassengers 600

400

200

0
1952 1956 1960
time

Using function tsdf() from package ‘ggseas’.

ggplot(tsdf(AirPassengers),
aes(x, y)) +
geom_line() +
stat_seas(colour = "blue") +
stat_stl(s.window = 7, color = "red") +
expand_limits(y = 0)

## Calculating starting date of 1949 from the data.


## Calculating frequency of 12 from the data.
## Calculating frequency of 12 from the data.

600

400
y

200

0
1952 1956 1960
x

7.19 ‘ggsci’

414
7.19 ’ggsci’

citation(package = "ggsci")

##
## To cite package 'ggsci' in publications use:
##
## Nan Xiao and Miaozhu Li (2017). ggsci:
## Scientific Journal and Sci-Fi Themed Color
## Palettes for 'ggplot2'. R package version
## 2.4.
## https://CRAN.R-project.org/package=ggsci
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggsci: Scientific Journal and Sci-Fi Themed Color Palettes for
## 'ggplot2'},
## author = {Nan Xiao and Miaozhu Li},
## year = {2017},
## note = {R package version 2.4},
## url = {https://CRAN.R-project.org/package=ggsci},
## }
##
## ATTENTION: This citation information has
## been auto-generated from the package
## DESCRIPTION file and may need manual
## editing, see 'help("citation")'.

I list here package ‘ggsci’ as it provides several color palettes (and color maps) that
some users may like or find useful. They attempt to reproduce the those used by
several publications, films, etc. Although visually attractive, several of them are not
safe, in the sense discussed in section 7.5 on page 339. For each palette, the package
exports a corresponding statistic for use with package ‘ggplot2’.

Here is one example, using package ‘pals’, to test if it is “safe”.

pal.safe(pal_uchicago(), n = 9)

415
7 Extensions to ggplot

Original

Black/White

Deutan

Protan

Tritan

A few of the discrete palettes as bands, setting n to 8, which the largest value
supported by the smallest of these palettes.

pal.bands(pal_npg(),
pal_aaas(),
pal_nejm(),
pal_lancet(),
pal_igv(),
pal_simpsons(),
n = 8)

pal_npg()

pal_aaas()

pal_nejm()

pal_lancet()

pal_igv()

pal_simpsons()

And a plot using a palette mimicking the one used by Nature Publishing Group
(NPG).

ggplot(data = Orange,
aes(x = age, y = circumference, color = Tree)) +
geom_line() +
scale_color_npg() +
theme_classic()

416
7.20 ’ggthemes’

200

Tree
150
circumference

3
1
5

100 2
4

50

400 800 1200 1600


age

7.20 ‘ggthemes’

citation(package = "ggthemes")

##
## To cite package 'ggthemes' in publications
## use:
##
## Jeffrey B. Arnold (2017). ggthemes: Extra
## Themes, Scales and Geoms for 'ggplot2'. R
## package version 3.4.0.
## https://CRAN.R-project.org/package=ggthemes
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggthemes: Extra Themes, Scales and Geoms for 'ggplot2'},
## author = {Jeffrey B. Arnold},
## year = {2017},
## note = {R package version 3.4.0},
## url = {https://CRAN.R-project.org/package=ggthemes},
## }

Package ‘ggthemes’ as one can infer from its name, provides definitions of several
themes for use with package ‘ggplot2’. They vary from formal to informal graphic
designs, mostly attempting to follow the recommendations and examples of design-
ers like Tufte (Tufte 1983), or reproduce design used by well known publications or
the default output of some frequently used computer programs.
We first save one of the plots earlier used as example, and later print it using dif-
ferent themes.

417
7 Extensions to ggplot

p05 <- ggplot(data = Orange,


aes(x = age, y = circumference, color = Tree)) +
geom_line()

A theme_tufte() obeying Tufte’s recommendation of maximizing the information


to ink ratio.

p05 + theme_tufte()

200

Tree
150
3
circumference

1
5

100 2
4

50

400 800 1200 1600


age

A theme_economist() like The Economist.

p05 + theme_economist()

Tree 3 1 5 2 4

200
circumference

150

100

50

400 800 1200 1600


age

A theme_gdocs() like Google docs.

p05 + theme_gdocs()

418
7.21 ’ggtern’

200

Tree
circumference

150 3
1
5
100 2
4

50

400 800 1200 1600


age

7.21 ‘ggtern’

citation(package = "ggtern")

##
## To cite package 'ggtern' in publications
## use:
##
## Nicholas Hamilton (2016). ggtern: An
## Extension to 'ggplot2', for the Creation
## of Ternary Diagrams. R package version
## 2.2.0.
## https://CRAN.R-project.org/package=ggtern
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {ggtern: An Extension to 'ggplot2', for the Creation of Ternary Diagrams},
## author = {Nicholas Hamilton},
## year = {2016},
## note = {R package version 2.2.0},
## url = {https://CRAN.R-project.org/package=ggtern},
## }

Package ‘ggtern’ provides facilities for making ternary plots, frequently used in soil
science and in geology, and in sensory physiology and color science for representing
trichromic vision (red-green-blue for humans). They are based on a special system of
coordinates with three axes on a single plane.

419
7 Extensions to ggplot

 Package ‘ggtern’ redefines some functions exported by ‘ggplot2’ and cur-


rently easily conflicts with other extensions to ‘ggplot2’. One rarely would like to
use functions from this and other packages extending ‘ggplot2’ in the same figure,
but using them in the same document could be necessary. In such cases one may
need to call the original definitions explicitly, for example ggplot2::ggplot()
instead of simply ggplot() which after loading ‘ggtern’ no longer refers to the
original definition. Because of this problems we load this package here, near the
end of the chapter.

library(ggtern)

## --
## Consider donating at: http://ggtern.com
## Even small amounts (say $10-50) are very much appreciated!
## Remember to cite, run citation(package = 'ggtern') for further info.
## --
##
## Attaching package: 'ggtern'
## The following objects are masked from 'package:ggplot2':
##
## %+%, aes, annotate, calc_element,
## ggplot_build, ggplot_gtable, ggplotGrob,
## ggsave, layer_data, theme, theme_bw,
## theme_classic, theme_dark, theme_gray,
## theme_light, theme_linedraw,
## theme_minimal, theme_void

In this example of the use of ggtern() , we use colors pre-defined in R and make a
ternary plot of the red, green and blue components of these colors.

colours <- c("red", "green", "yellow", "white",


"orange", "purple", "seagreen", "pink")
rgb.values <- col2rgb(colours)
color.data <- data.frame(colour=colours,
R=rgb.values[1, ],
G=rgb.values[2, ],
B=rgb.values[3, ])
ggtern(data=color.data,
aes(x=R, y=G, z=B, label=colour, fill=colour)) +
geom_point(shape=23, size=3) +
geom_text(hjust=-0.2) +
labs(x = "R", y="G", z="B") + scale_fill_identity() +
theme_nomask()

420
7.22 Other extensions to ‘ggplot2’

G
green
100

20
80

40
60

yellow seagreen
60

orange 40

white
pink
80

20

purple
10
0

red
20

40

60

80

0
R B

10

In the example above we need to use theme_nomask() to avoid clipping of symbols


drawn on the edges of the triangular plotting area.

U Test how the plot changes if you remove ‘ + theme_nomask() ’ from the code
chunk above.

7.22 Other extensions to ‘ggplot2’

In this section I list some specialized or very recently released extensions to ‘ggplot2’
(Table 7.1). The table below will hopefully temp you to explore those suitable for
the data analysis tasks you deal with. There is a package under development, already
released through CRAN, called ggvis . This package is not and extension to ‘ggplot2’,
but instead a new implementation of the grammar of graphics, with a focus on the
creation of interactive plots.

421
7 Extensions to ggplot

Table 7.1: Additional packages extending ‘ggplot2’ whose use is not described in this book.
All these packages are available at CRAN.

Package Title

‘ggspectra’ Extensions … for Radiation Spectra


‘ggspatial’ Spatial data framework …
‘ggsignif’ Significance Bars …
‘ggsn’ North Symbols and Scale Bars for Maps …
‘ggmosaic’ Mosaic Plots …
‘ggimage’ Use image [map image to shape aesthetic]
‘cowplot’ Streamlined Plot Theme and Plot Annotations …
‘hrbrthemes’ Additional Themes, Theme Components and Utilities …
‘ggedit’ Interactive …Layer and Theme Aesthetic Editor
‘ggparallel’ … Parallel Coordinate Plots for Categorical Data
‘ggraph’ … Grammar of Graphics for Graphs and Networks
‘gglogo’ Geom for Logo Sequence Plots
‘ggiraph’ Make …Graphics Interactive
‘ggiraphExtra’ Make Interactive …

7.23 Extended examples

7.23.1 Anscombe’s example revisited

To make the example self contained we repeat the code from chapter 6, page 316.

# we rearrange the data


my.mat <- matrix(as.matrix(anscombe), ncol=2)
my.anscombe <- tibble(x = my.mat[ , 1],
y = my.mat[ , 2],
case=factor(rep(1:4, rep(11,4))))

ggplot(my.anscombe, aes(x = x, y = y)) +


geom_point(shape=21, fill="orange", size=3) +
geom_smooth(method="lm") +
stat_poly_eq(formula = y ~ x, parse = TRUE,
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~"))) +
facet_wrap(~case, ncol=2) +
theme_bw(16)

422
7.23 Extended examples

1 2

12 y = 3 + 0.5 x R 2 = 0.67 y = 3 + 0.5 x R 2 = 0.67




● ● ● ●
● ●
8 ● ● ● ●
● ●
● ●


● ●

4

y

3 4

● ●
12 y = 3 + 0.5 x R 2 = 0.67 y = 3 + 0.5 x R 2 = 0.67

● ●
● ●
8 ● ●


● ●

● ●

● ●
● ●
● ●
4

5 10 15 5 10 15
x

7.23.2 Heatmaps

7.23.3 Volcano plots

7.23.4 Quadrat plots

try(detach(package:ggfortify))
try(detach(package:MASS))
try(detach(package:xts))
try(detach(package:ggthemes))
try(detach(package:ggsci))
#try(detach(package:ggradar))
try(detach(package:geomnet))
try(detach(package:ggnetwork))
try(detach(package:ggExtra))
try(detach(package:ggalt))
try(detach(package:ggbiplot))
try(detach(package:ggstance))
try(detach(package:gganimate))
try(detach(package:ggseas))

423
7 Extensions to ggplot

try(detach(package:ggpmisc))
try(detach(package:ggforce))
try(detach(package:ggrepel))
try(detach(package:pals))
try(detach(package:viridis))
try(detach(package:showtext))
try(detach(package:ggplot2))
try(detach(package:tibble))

424
8 Plotting maps and images

“A labyrinth is a symbolic journey…but it is a map we can really


walk on, blurring the difference between map and world.”

— Rebecca Solnit (2001) Wanderlust: A History of Walking.


Penguin Books.

8.1 Aims of this chapter

Once again plotting maps and bitmaps, is anything but trivial. Plotting maps usu-
ally involves downloading the map information, and applying a certain projection to
create a suitable map on a flat surface. Of course, it is very common to plot other
data, ranging from annotations of place names to miniature bar plots, histograms,
etc. o filling different regions or countries with different colors. In the first half of
the chapter we describe not only plotting of maps using the grammar of graphics, but
also how to download map images, and shape files from service providers like Google
and repositories.
In the second half of the chapter we describe how to load, write and manipulate
raster images in R. Ris not designed to efficiently work with bitmap images as data.
We describe a couple of packages that attempt to solve this limitation.

8.2 ‘ggmap’

library(ggplot2)
library(ggmap)

##
## Attaching package: 'ggmap'
## The following object is masked from 'package:magrittr':
##
## inset

library(rgdal)

## Loading required package: sp


## rgdal: version: 1.2-6, (SVN revision 651)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.0.1, released 2015/09/15

425
8 Plotting maps and images

## Path to GDAL shared files: C:/Users/aphalo/Documents/R/win-library/3.3/rgdal/gdal


## Loaded PROJ.4 runtime: Rel. 4.9.2, 08 September 2015, [PJ_VERSION: 492]
## Path to PROJ.4 shared files: C:/Users/aphalo/Documents/R/win-library/3.3/rgdal/proj
## Linking to sp version: 1.2-4

library(scatterpie)

##
## Attaching package: 'scatterpie'
## The following object is masked from 'package:sp':
##
## recenter

library(imager)

##
## Attaching package: 'imager'
## The following object is masked from 'package:sp':
##
## bbox
## The following object is masked from 'package:grid':
##
## depth
## The following object is masked from 'package:plyr':
##
## liply
## The following object is masked from 'package:hexbin':
##
## erode
## The following object is masked from 'package:tidyr':
##
## fill
## The following object is masked from 'package:magrittr':
##
## add
## The following object is masked from 'package:stringr':
##
## boundary
## The following objects are masked from 'package:stats':
##
## convolve, spectrum
## The following object is masked from 'package:graphics':
##
## frame
## The following object is masked from 'package:base':
##
## save.image

426
8.2 ggmap

citation(package = "ggmap")

##
## To cite ggmap in publications, please use:
##
## D. Kahle and H. Wickham. ggmap: Spatial
## Visualization with ggplot2. The R Journal,
## 5(1), 144-161. URL
## http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## author = {David Kahle and Hadley Wickham},
## title = {ggmap: Spatial Visualization with ggplot2},
## journal = {The R Journal},
## year = {2013},
## volume = {5},
## number = {1},
## pages = {144--161},
## url = {http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf},
## }

citation(package = "rgdal")

##
## To cite package 'rgdal' in publications use:
##
## Roger Bivand, Tim Keitt and Barry
## Rowlingson (2017). rgdal: Bindings for the
## Geospatial Data Abstraction Library. R
## package version 1.2-6.
## https://CRAN.R-project.org/package=rgdal
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {rgdal: Bindings for the Geospatial Data Abstraction Library},
## author = {Roger Bivand and Tim Keitt and Barry Rowlingson},
## year = {2017},
## note = {R package version 1.2-6},
## url = {https://CRAN.R-project.org/package=rgdal},
## }

Package ‘ggmap’ is an extension to package ‘ggplot2’ for plotting and retrieving


map data. Package ‘ggmap’ makes it possible to plot data using normal ‘ggplot2’
syntax on top of a map. Maps can be easily retrieved from the internet through
different services. Some of these services require the user to register and obtain a

427
8 Plotting maps and images

key for access. As Google Maps does not require such a key for normal resolution
maps, we use this service in the examples.

8.2.1 Google maps

The first step is to fetch the desired map. One can fetch the maps base on any valid
Google Maps search term, or by giving the coordinates at the center of the map. Al-
though zoom defaults to ”auto”, frequently the best result is obtained by providing
this argument. Valid values for zoom are integers in the range 1 to 20.


We will fetch maps from Google Maps. We have disabled the messages, to avoid
repeated messages about Google’s terms of use.

Google Maps API Terms of Service: http://developers.google.com/maps/


terms

Information from URL: http://maps.googleapis.com/maps/api/geocode/


json?address=Europe&sensor=false

Map from URL: http://maps.googleapis.com/maps/api/staticmap?


center=Europe&zoom=3&size=%20640x640&scale=%202&maptype=
terrain&sensor=false

We start by using get_map() to fetch and ggmap() to plot a map of Europe of


type satellite . We use the default extent panel , and also the extent device and
normal . The normal plot includes axes showing the coordinates, while device does
not show them, while panel shows axes but the map fits tightly into the drawing area:

Europe1 <- get_map("Europe", zoom = 3, maptype = "satellite")


ggmap(Europe1)

428
8.2 ggmap

60
lat

40

20

−25 0 25 50
lon

ggmap(Europe1, extent = "device")

## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property


instead

429
8 Plotting maps and images

ggmap(Europe1, extent = "normal")

60
lat

40

20

−25 0 25 50 75
lon

To demonstrate the option to fetch a map in black and white instead of the default
colour version, we use a map of Europe of type terrain .
Europe2 <- get_map("Europe", zoom = 3, maptype = "terrain")
ggmap(Europe2)

60
lat

40

20

−25 0 25 50
lon

430
8.2 ggmap

Europe3 <-
get_map("Europe", zoom = 3, maptype = "terrain", color = "bw")
ggmap(Europe3)

60
lat

40

20

−25 0 25 50
lon

To demonstrate the difference between type roadmap and the default type
terrain , we use the map of Finland. Note that we search for “Oulu” instead of
“Finland” as Google Maps takes the position of the label “Finland” as the center of
the map, and clips the northern part. By means of zoom we override the default
automatic zooming onto the city of Oulu.

Finland1 <- get_map("Oulu", zoom = 5, maptype = "terrain")


ggmap(Finland1)

431
8 Plotting maps and images

70.0

67.5

lat

65.0

62.5

60.0

20 30
lon

Finland2 <- get_map("Oulu", zoom = 5, maptype = "roadmap")


ggmap(Finland2)

70.0

67.5
lat

65.0

62.5

60.0

20 30
lon

We can even search for a street address, and in this case with high zoom value, we
can see the building where one of us works:

432
8.2 ggmap

BIO3 <- get_map("Viikinkaari 1, 00790 Helsinki",


zoom = 18,
maptype = "satellite")
ggmap(BIO3)

60.2260

60.2255
lat

60.2250

60.2245

25.016 25.017 25.018


lon

We will now show a simple example of plotting data on a map, first by explicitly
giving the coordinates, and in the second example we show how to fetch from Google
Maps coordinate values that can be then plotted. We use function geocode() . In one
example we use geom_point() and geom_text() , while in the second example we
use annotate , but either approach could have been used for both plots:

viikki <- get_map("Viikki, 00790 Helsinki",


zoom = 15,
maptype = "satellite")

our_location <- data.frame(lat = c(60.225, 60.227),


lon = c(25.017, 25.018),
label = c("BIO3", "field"))
ggmap(viikki, extent = "normal") +
geom_point(data = our_location, aes(y = lat, x = lon),
size = 4, colour = "yellow") +
geom_text(data = our_location, aes(y = lat, x = lon, label = label),
hjust = -0.3, colour = "yellow") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0))

433
8 Plotting maps and images

60.231

60.228

● field
lat

60.225 ● BIO3

60.222

60.219

25.010 25.015 25.020 25.025 25.030


lon

our_geocode <- geocode("Viikinkaari 1, 00790 Helsinki")


ggmap(viikki, extent = "normal") +
annotate(geom = "point",
y = our_geocode[ 1, "lat"], x = our_geocode[ 1, "lon"],
size = 4, colour = "yellow") +
annotate(geom = "text",
y = our_geocode[ 1, "lat"], x = our_geocode[ 1, "lon"],
label = "BIO3", hjust = -0.3, colour = "yellow") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0))

434
8.2 ggmap

60.231

60.228

● BIO3
lat

60.225

60.222

60.219

25.010 25.015 25.020 25.025 25.030


lon

8.2.2 World map

Using function get_map() from package ‘ggmap’ for drawing a world map is
not possible at the time of writing. In addition a worked out example of
how to plot shape files, and how to download them from a repository is suit-
able as our final example. We also show how to change the map projec-
tion. The example is adapted from a blog post at http://rpsychologist.com/
working-with-shapefiles-projections-and-world-maps-in-ggplot.
We start by downloading the map data archive files from http://www.
naturalearthdata.com which is available in different layers. We only use three of
the available layers: ‘physical’ which describes the coastlines and a grid and bounding
box, and ‘cultural’ which gives country borders. We save them in a folder with name
‘maps’, which is expected to already exist. After downloading each file, we unzip it.
The recommended way of changing the root directory in a knitr document as this,
is to use a chunk option, which is not visible in the output. The commented out lines,
would have the same effect if typed at the R console.

# oldwd <- setwd("./maps")

url_path <-
# "http://www.naturalearthdata.com/download/110m/"
"http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/"

435
8 Plotting maps and images

download.file(paste(url_path,
"physical/ne_110m_land.zip",
sep = ""), "ne_110m_land.zip")
unzip("ne_110m_land.zip")

download.file(paste(url_path,
"cultural/ne_110m_admin_0_countries.zip",
sep = ""), "ne_110m_admin_0_countries.zip")
unzip("ne_110m_admin_0_countries.zip")

download.file(paste(url_path,
"physical/ne_110m_graticules_all.zip",
sep = ""), "ne_110m_graticules_all.zip")
unzip("ne_110m_graticules_all.zip")

# setwd(oldwd)

We list the layers that we have downloaded.

ogrListLayers(dsn = "./maps")

## [1] "ne_110m_admin_0_countries"
## [2] "ne_110m_graticules_1"
## [3] "ne_110m_graticules_10"
## [4] "ne_110m_graticules_15"
## [5] "ne_110m_graticules_20"
## [6] "ne_110m_graticules_30"
## [7] "ne_110m_graticules_5"
## [8] "ne_110m_land"
## [9] "ne_110m_wgs84_bounding_box"
## attr(,"driver")
## [1] "ESRI Shapefile"
## attr(,"nlayers")
## [1] 9

Next we read the layer for the coastline, and use fortify() to convert it into a data
frame. We also create a second version of the data using the Robinson projection.

wmap <- readOGR(dsn = "./maps", layer = "ne_110m_land")

## OGR data source with driver: ESRI Shapefile


## Source: "./maps", layer: "ne_110m_land"
## with 127 features
## It has 2 fields

wmap.data <- fortify(wmap)

## Regions defined for each Polygons

436
8.2 ggmap

wmap_robin <- spTransform(wmap, CRS("+proj=robin"))


wmap_robin.data <- fortify(wmap_robin)

## Regions defined for each Polygons

We do the same for country borders,

countries <- readOGR("./maps", layer = "ne_110m_admin_0_countries")

## OGR data source with driver: ESRI Shapefile


## Source: "./maps", layer: "ne_110m_admin_0_countries"
## with 177 features
## It has 63 fields

countries.data <- fortify(countries)

## Regions defined for each Polygons

countries_robin <- spTransform(countries, CRS("+init=ESRI:54030"))


countries_robin.data <- fortify(countries_robin)

## Regions defined for each Polygons

and for the graticule at 15∘ intervals, and the bounding box.

grat <- readOGR("./maps", layer = "ne_110m_graticules_15")

## OGR data source with driver: ESRI Shapefile


## Source: "./maps", layer: "ne_110m_graticules_15"
## with 35 features
## It has 5 fields
## Integer64 fields read as strings: degrees scalerank

grat.data <- fortify(grat)


grat_robin <- spTransform(grat, CRS("+proj=robin"))
grat_robin.data <- fortify(grat_robin)

bbox <- readOGR("./maps", layer = "ne_110m_wgs84_bounding_box")

## OGR data source with driver: ESRI Shapefile


## Source: "./maps", layer: "ne_110m_wgs84_bounding_box"
## with 1 features
## It has 2 fields

bbox.data <- fortify(bbox)

## Regions defined for each Polygons

bbox_robin <- spTransform(bbox, CRS("+proj=robin"))


bbox_robin.data <- fortify(bbox_robin)

## Regions defined for each Polygons

437
8 Plotting maps and images

Now we plot the world map of the coastlines, on a longitude and latitude scale, as
a ggplot using geom_polygon() .

ggplot(wmap.data, aes(long, lat, group = group)) +


geom_polygon() +
labs(title = "World map (longlat)") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
coord_equal()

World map (longlat)

50

0
lat

−50

−100 0 100
long

There is one noticeable problem in the map shown above: the Caspian sea is missing.
We need to use aesthetic fill and a manual scale to correct this.

ggplot(wmap.data, aes(long, lat, group = group, fill = hole)) +


geom_polygon() +
labs(title = "World map (longlat)") +
scale_fill_manual(values = c("#262626", "#e6e8ed"), guide = "none") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
coord_equal()

438
8.2 ggmap

World map (longlat)

50

0
lat

−50

−100 0 100
long

When plotting a map using a projection, many default elements of the ggplot
theme need to be removed, as the data is no longer in units of degrees of latitude and
longitude and axes and their labels are no longer meaningful.

theme_map_opts <-
list(theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.background = element_blank(),
plot.background = element_rect(fill="#e6e8ed"),
panel.border = element_blank(),
axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()))

Next we plot all the layers using the Robinson projection. This is still a ggplot
object and consequently one can plot data on top of the map, being aware of the
transformation of the scale needed to make the data location match locations in a
map using a certain projection.

ggplot(bbox_robin.data, aes(long,lat, group = group)) +


geom_polygon(fill = "white") +
geom_polygon(data = countries_robin.data, aes(fill = hole)) +
geom_path(data = countries_robin.data, color = "white", size = 0.3) +
geom_path(data = grat_robin.data, linetype = "dashed", color = "grey50") +
labs(title = "World map (Robinson)") +
coord_equal() +
theme_map_opts +
scale_fill_manual(values = c("black", "white"),
guide = "none")

439
8 Plotting maps and images

World map (Robinson)

As a last example, a variation of the plot above in colour and using the predefined
theme theme_void() instead of our home-brewed theme settings.

ggplot(bbox_robin.data, aes(long,lat, group = group)) +


geom_polygon(fill = "blue") +
geom_polygon(data = countries_robin.data, aes(fill = hole)) +
geom_path(data = countries_robin.data, color = "white", size = 0.3) +
geom_path(data = grat_robin.data, linetype = "dashed", color = "grey75") +
labs(title = "World map (Robinson)") +
coord_equal() +
theme_void() +
scale_fill_manual(values = c("brown", "white"),
guide = "none")

World map (Robinson)

440
8.3 imager

8.3 ‘imager’

Functions in this package allow easy plotting and “fast” processing of images with
R. It is based on the CImg library. CImg, http://cimg.eu, is a simple, modern C++
library for image processing defined using C++ templates for flexibility and to achieve
fast computations.

citation(package = "imager")

##
## To cite package 'imager' in publications
## use:
##
## Simon Barthelme (2017). imager: Image
## Processing Library Based on 'CImg'. R
## package version 0.40.1.
## https://CRAN.R-project.org/package=imager
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {imager: Image Processing Library Based on 'CImg'},
## author = {Simon Barthelme},
## year = {2017},
## note = {R package version 0.40.1},
## url = {https://CRAN.R-project.org/package=imager},
## }

8.3.1 Using the package: first example

I will use as examples downsized cropped sections1 from photographs of a Dahlia


flower.
The first example is a photograph taken in sunlight, with no filter on the camera
objective—i.e. a normal image.
We use load.image() to read the image from a TIFF file with luminance data en-
coded in 8 bits per channel, i.e. as values in the range from 0 to 255. The image is
saved as an object of class "cimg" as defined in package ‘imager’.

dahlia01.img <- load.image("data/dahlia-vis.tif")


class(dahlia01.img)

## [1] "cimg" "imager_array" "numeric"

1A crop is used to make the code faster. I may replace it with a higher resolution image before the book
is published.

441
8 Plotting maps and images

mode(dahlia01.img)

## [1] "numeric"

dahlia01.img

## Image. Width: 800 pix Height: 800 pix Depth: 1 Colour channels: 3

range(R(dahlia01.img))

## [1] 32 255

range(G(dahlia01.img))

## [1] 0 250

range(B(dahlia01.img))

## [1] 0 227

We exemplify first the use of the plot() method from package ‘imager’.

plot(dahlia01.img)
200
400
600
800

0 200 400 600 800

U Read a different image, preferably one you have captured yourself. Images
are not only photographs, so for example, you may want to play with electro-

442
8.3 imager

phoresis gels. Several different bitmap file formats are accepted, and the path
to a file can also be an URL (see Chapter 5 for details). Which file formats can be
read depends on other tools installed in the computer you are using, in particular,
if ImageMagick is available, many different file formats can automatically recog-
nized and decoded/uncompressed when read. When playing, use a rather small
bitmap, e.g. one mega pixel or smaller, to get a fast response when plotting.

Converting the image to gray scale is easy if it is an 8 bit per channel image. It is
done with function grayscale() .

dahlia01g.img <- grayscale(dahlia01.img)


class(dahlia01g.img)

## [1] "cimg" "imager_array" "numeric"

mode(dahlia01g.img)

## [1] "numeric"

dahlia01g.img

## Image. Width: 800 pix Height: 800 pix Depth: 1 Colour channels: 1

plot(dahlia01g.img)
200
400
600
800

0 200 400 600 800

443
8 Plotting maps and images

U Convert to gray scale a different colour image, after reading it from a file.
We can convert a gray scale image into a black and white image with binary values
for each pixel.

dahlia01t.img <- threshold(dahlia01g.img)


plot(dahlia01t.img)
200
400
600
800

0 200 400 600 800

U Function threshold() has a parameter that allows to override the automatic-


ally chosen threshold value. Read the documentation of the function and play by
passing different thresholds for the same image, and looking at the plotted result.
As an additional task, try the behaviour of the default with different images, and
by reading the documentation on how the default is chosen, try to make sense of
how the different images were segmented using the default threshold.

8.3.2 Plotting with ‘ggplot2’: first example

Although a plot() method is provided for "cimg" objects, we convert the image into
a data frame so as to be able to use the usual R functions to plot and operate on the
data. For simplicity’s sake we start with the gray scale image. The as.data.frame()

444
8.3 imager

method converts this image into a tidy data frame with column cc identifying the
three colour channels, and the luminance values in column value . We add a factor
channel with ‘nice’ labels and add a numeric variable luminance with the values
re-encoded to be in the range zero to one.

dahlia01g.df <- as.data.frame(dahlia01g.img)


names(dahlia01g.df)

## [1] "x" "y" "value"

Now we can use functions from package ‘ggplot2’ as usual to create plots. We start
by plotting a histogram of the value column.

ggplot(dahlia01g.df, aes(value)) +
geom_histogram(bins = 30)

60000

40000
count

20000

0 50 100 150 200 250


value

And then we plot it as a raster image adding a layer with geom_raster() , mapping
luminance to the alpha aesthetic and setting fill to a constant "black" . Because
the 𝑦-axis of the image is the reverse of the default expected by aes() we need to
reverse the scale, and we change expansion to zero, as we want the raster to extend
up to the edges of the plotting area. As coordinates of pixel locations are not needed,
we use theme_void() to remove 𝑥- and 𝑦-axis labels, and the background grid. We
use coord_fixed() accepting the default ratio between 𝑥 and 𝑦 scales equal to one,
as the image has square pixels.

ggplot(dahlia01g.df,
aes(x, y, alpha = (255 - value) / 255)) +
geom_raster(fill = "black") +
coord_fixed() +
scale_alpha_identity() +
scale_x_continuous(expand = c(0, 0)) +

445
8 Plotting maps and images

scale_y_continuous(expand = c(0, 0),


trans = scales::reverse_trans()) +
theme_void()

U Plotting a large raster is slow, even with geom_raster() . Package ‘imager’


provides a function resize() that can be used to expand (by interpoaltion) or
reduce the size of the image. Try reducing the 𝑥 and 𝑦 dimensions of the bitmap
to 50%, 20%, and 5% of their original size, and plotting it. With how much size
reductio does the image quality deteriorate enough to be noticed on the monitor
or laptop screen you are using?

After this first simple example, we handle the slightly more complicated case of
working with the original RGB colour image. In this case, as.data.frame() method
converts the image into a tidy data frame with column cc identifying the three colour
channels, and the luminance values in column value . We add a factor channel with
‘nice’ labels and add a numeric variable luminance with the values re-encoded to be
in the range zero to one.
dahlia01.df <- as.data.frame(dahlia01.img)
names(dahlia01.df)

## [1] "x" "y" "cc" "value"

dahlia01.df <- plyr::mutate(dahlia01.df,


channel = factor(cc, labels = c('R','G','B')),
luminance = value)
names(dahlia01.df)

## [1] "x" "y" "cc"


## [4] "value" "channel" "luminance"

446
8.3 imager

Now we can use functions from package ‘ggplot2’ as usual to create different plots.
We start by plotting histograms for the different color channels.

ggplot(dahlia01.df,
aes(luminance, fill = channel)) +
geom_histogram(bins = 30, color = NA) +
scale_fill_manual(values = c('R' = "red", 'G' = "green", 'B' = "blue"),
guide = FALSE) +
facet_wrap(~channel)

R G B

100000

75000
count

50000

25000

0 100 200 0 100 200 0 100 200


luminance

We now plot each channel as a separate raster with geom_raster() , mapping lu-
minance to the alpha aesthetic and map the colour corresponding to each channel
as a uniform fill . As above, because the 𝑦-axis of the image is the reverse of the
default expected by aes() we need to reverse the scale, and we change expansion
to zero, as we want the raster to extend up to the edges of the plotting area. Also
as above, we use theme_void() to remove 𝑥- and 𝑦-axis labels, and the background
grid. We use coord_fixed() accepting the default ratio between 𝑥 and 𝑦 scales equal
to one.

ggplot(dahlia01.df,
aes(x, y, alpha = (255 - luminance) / 255, fill = channel)) +
geom_raster() +
facet_wrap(~channel) +
coord_fixed() +
scale_fill_manual(values = c('R' = "red", 'G' = "green", 'B' = "blue"),
guide = FALSE) +
scale_alpha_identity() +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0),
trans = scales::reverse_trans()) +
theme_void()

447
8 Plotting maps and images

R G B

U Change the code used to build the ggplot above so that 1) the panels are in
a column instead of in a row, 2) the bitmap for each channel is shown as a grey
scale rather than a single red, green or blue image, and consider if the relative
darkness between the three channels “feels” different in the two figures, 3) add
to the previous figure a fourth panel with the image converted to a single gray
scale channel. Hint: the way to do it is combine the data into a single data frame.

8.3.3 Using the package: second example

The second original is a photograph of the same flower taken in sunlight, but using
a UV-A band-pass filter. I chose such an image because the different colour chan-
nels have very different luminance values even after applying the full strength of the
corrections available in the raw conversion software, making it look almost mono-
chromatic.
We read the image from a TIFF file with luminance data encoded in 8 bits per chan-
nel, i.e. as values in the range from 0 to 255. As above the image is saved as an object
of class "cimg" as defined in package ‘imager’.

dahlia02.img <- load.image("data/dahlia-uva.tif")

We use as above the plot() method from package ‘imager’.

plot(dahlia02.img)

448
8.3 imager

200
400
600
800

0 200 400 600 800

Converting this image to gray scale with grayscale() is easy as it is an 8 bit per
channel image2 .

dahlia02g.img <- grayscale(dahlia02.img)


dahlia02g.img

## Image. Width: 800 pix Height: 800 pix Depth: 1 Colour channels: 1

plot(dahlia02g.img)

2 Inthe case of images with 16 bit data, one needs to re-scale the luminance values to avoid out-of-range
errors.

449
8 Plotting maps and images

200
400
600
800

0 200 400 600 800

8.3.4 Plotting with ‘ggplot2’: second example

To be able to use package ‘ggplot2’ we convert the image into a data frame so
as to be able to use the usual R functions to plot and operate on the data. The
as.data.frame() method converts the image into a tidy data frame with column cc
identifying the three colour channels, and the luminance values in column value .
We add a factor channel with ‘nice’ labels and add a numeric variable luminance
with the values re-encoded to be in the range zero to one.

dahlia02.df <- as.data.frame(dahlia02.img)


names(dahlia02.df)

## [1] "x" "y" "cc" "value"

dahlia02.df <- plyr::mutate(dahlia02.df,


channel = factor(cc, labels = c('R','G','B')),
luminance = value)
names(dahlia02.df)

## [1] "x" "y" "cc"


## [4] "value" "channel" "luminance"

Now we can use functions from package ‘ggplot2’ as usual to create different plots.
We start by plotting histograms for the different color channels.

450
8.3 imager

ggplot(dahlia02.df,
aes(luminance, fill = channel)) +
geom_histogram(bins = 30, color = NA) +
scale_fill_manual(values = c('R' = "red", 'G' = "green", 'B' = "blue"),
guide = FALSE) +
facet_wrap(~channel)

R G B

250000

200000

150000
count

100000

50000

0 100 200 0 100 200 0 100 200


luminance

We now plot each channel as a separate raster using geom_raster() , mapping lu-
minance to the alpha aesthetic so as to be able to map the colour corresponding to
each channel as a uniform fill . Because the 𝑦-axis of the image is the reverse of the
default expected by aes() we need to reverse the scale, and we change expansion to
zero, as we want the raster to extend up to the edges of the plotting area. As coordin-
ates of pixel locations are not needed, we use theme_void() to remove 𝑥- and 𝑦-axis
labels, and the background grid. We use coord_fixed() accepting the default ratio
between 𝑥 and 𝑦 scales equal to one, as the image has square pixels.

ggplot(dahlia02.df,
aes(x, y, alpha = (255 - luminance) / 255, fill = channel)) +
geom_raster() +
facet_wrap(~channel) +
coord_fixed() +
scale_fill_manual(values = c('R' = "red", 'G' = "green", 'B' = "blue"),
guide = FALSE) +
scale_alpha_identity() +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0),
trans = scales::reverse_trans()) +
theme_void()

451
8 Plotting maps and images

R G B

8.3.5 Manipulating pixel data: second example

After seeing the histograms, we guess values for constants to use to improve the white
balance in a very simplistic way. Be aware that code equivalent to the one below, but
using ifelse() triggers an error.

dahlia03.img <- dahlia02.img


range(G(dahlia03.img))

## [1] 0 230

G(dahlia03.img) <- G(dahlia03.img) + 40


range(G(dahlia03.img))

## [1] 40 270

dahlia03.df <- as.data.frame(dahlia03.img)


dahlia03.df <- plyr::mutate(dahlia03.df,
channel = factor(cc, labels = c('R','G','B')),
luminance = value)

ggplot(dahlia03.df,
aes(luminance, fill = channel)) +
geom_histogram(bins = 30, color = NA) +
scale_fill_manual(values = c('R' = "red", 'G' = "green", 'B' = "blue"),
guide = FALSE) +
facet_wrap(~channel)

452
8.3 imager

R G B

200000

150000
count

100000

50000

0 100 200 0 100 200 0 100 200


luminance

plot(dahlia03.img)
200
400
600
800

0 200 400 600 800

Another approach would be to equalize the histograms. We start with the gray scale
image.

plot(as.cimg(ecdf(dahlia02g.img)(dahlia02g.img), dim = dim(dahlia02g.img)))

453
8 Plotting maps and images

200
400
600
800

0 200 400 600 800

The line above is not that easy to understand. What is going on is that the call
ecdf(dahlia02g.img) returns a function built on the fly, and then with the additional
set of parentheses we call it, and then pass the result to the as.cimg() method, and
the object this method returns is passed as argument to the plot() method. It can
be split as follows into four statements.

eq.f <- ecdf(dahlia02g.img)


equalized_dahlia02g <- eq.f(dahlia02g.img)
equalized_dahlia02g.img <- as.cimg(equalized_dahlia02g, dim = dim(dahlia02g.img))
plot(equalized_dahlia02g.img)

454
8.3 imager

200
400
600
800

0 200 400 600 800

A third syntax, is to use the %>% pipe operator. This operator is not native to the
R language, but is defined by a package. However, in recent times its use has become
rather popular for data transformations. These is equivalent to the nested calls in
the one-line statement above.

ecdf(dahlia02g.img)(dahlia02g.img) %>%
as.cimg(equalized_dahlia02g, dim = dim(dahlia02g.img)) %>%
plot()

455
8 Plotting maps and images

200
400
600
800

0 200 400 600 800

We can plot the histogram before and after equalization.

ggplot(as.data.frame(dahlia02g.img),
aes(value)) +
geom_histogram(bins = 30)

100000

75000
count

50000

25000

0 50 100 150 200 250


value

456
8.3 imager

ggplot(as.data.frame(equalized_dahlia02g.img),
aes(value)) +
geom_histogram(bins = 30)

25000

20000

15000
count

10000

5000

0.00 0.25 0.50 0.75 1.00


value

We can further check how the ECDF function looks like, by looking at its attributes
and printing its definition.

class(eq.f)

## [1] "ecdf" "stepfun" "function"

mode(eq.f)

## [1] "function"

eq.f

## Empirical CDF
## Call: ecdf(dahlia02g.img)
## x[1:26384] = 1.86, 2.56, 2.67, ..., 238.48, 240.25

U Define a function that accepts as argument a cimg object, and returns an


equalized image as a cimg object. Do the development in steps as follows.

Easy Implement for gray scale images.

457
8 Plotting maps and images

Medium Implement for colour images both as argument and return values.

Advanced, brute force approach As Medium but use R package ‘Rcpp’ to imple-
ment the “glue” code calling functions in the CImg library in C++ so that the
data is passed back and forth between R and compiled C++ code only once.
Hint: look at the source code of package ‘imager’, and use this as example.
Read the documentation for the CImg library and package ‘Rcpp’ and try to
avoid as much as possible the use of interpreted R code (see also Chapter
9).

Advanced, efficient approach As above but use profiling and bench marking
tools to first find which parts of the R and/or C++ code are limiting per-
formance and worthwhile optimizing for execution speed (see also Chapter
9).

In the example above we used the as.data.frame() method defined in package


‘imager’ to obtain a tidy data frame with the luminance values in a single column. For
some operations it may be better to work directly on the cimg object, which is simply
a multidimensional numeric array with some “frosting on top”.

U Study the code of functions R() , G() and B() . Study the code, so as to
understand why one could call them wrappers of R array extraction operators.
Then study the assignment version of the same functions R<-() , G<-() and
B<-() .

## function (im)
## {
## channel(im, 1)
## }
## <environment: namespace:imager>

channel

## function (im, ind)


## {
## im[, , , ind, drop = FALSE]
## }
## <environment: namespace:imager>

458
8.3 imager

8.3.6 Using bitmaps as data in R

I end with some general considerations about manipulating bitmap data in R. The
functions in package ‘imager’ convert the images that are read from files into R’s
numeric arrays, something that is very handy because it allows applying any of the
maths operators and functions available in R to the raster data. The downside is that
this is wasteful with respect to memory use as in most cases the original data has
only 8 or at most 16 bit of resolution. This approach could also slow down some
operations compared to calling the functions defined in the CImg library directly
from C++. For example plotting seems so slow as to cause problems. The CImg
library itself is very flexible and can efficiently use memory (see http://cimg.eu/),
however profiting for all its capabilities and flexibility in combination with functions
defined in R is made difficult by the fact that R supports fewer types of numerical
data than C++ and tends to convert results to a wider type quite frequently.
To better understand what this means in practice, we can explore how the image is
stored.
dim(dahlia01.img)

## [1] 800 800 1 3

dimnames(dahlia01.img)

## NULL

attributes(dahlia01.img)

## $class
## [1] "cimg" "imager_array" "numeric"
##
## $dim
## [1] 800 800 1 3

str(dahlia01.img)

## cimg [1:800, 1:800, 1, 1:3] 228 229 228 230 230 230 229 228 228 227 ...
## - attr(*, "class")= chr [1:3] "cimg" "imager_array" "numeric"

is.integer(dahlia01.img)

## [1] FALSE

is.double(dahlia01.img)

## [1] TRUE

is.logical(dahlia01.img)

## [1] FALSE

459
8 Plotting maps and images

We use function object.size() defined in the base R package ‘utils’ to find out
how much space in memory the cimg object dahlia01.img occupies, and then we
divide this value by the number of pixels.

format(object.size(dahlia01.img), units = "MB")

## [1] "14.6 Mb"

nPix(dahlia01.img) * 1e-6 # MPix

## [1] 1.92

object.size(dahlia01.img) %/% nPix(dahlia01.img)

## 8 bytes

width(dahlia01.img) * height(dahlia01.img) * 1e-6 # MPix

## [1] 0.64

object.size(dahlia01.img) %/% (width(dahlia01.img) * height(dahlia01.img))

## 24 bytes

We can see above that function nPix() returns the number of pixels in the image
times the number of colour channels, and that to obtain the actual number of pixels
we should multiply width by height of the image. We used in these examples, small
images by current standards, of only 0.64 MPix, that at their native colour depth of
8 bits per channel, have a size in memory of 1.92 MB. They were read from compressed
TIFF files with a size of about 0.8 to 1.1 MB on disk. However, they occupy nearly
15 MB in memory, or 8 times the size required to represent the information they
contain.

= Package ‘imager’ is a package containing many different functions wrapping


functions from library CImg, the examples given here are only an introduction to
the most basic of its capabilities. This library is written in C++ using templates,
and can be instantiated at compile time for different types of pixel data. Con-
sequently one cannot expect that calling these functions from R can be as fast as
a good C++ implementation of the same operations with the same library. On the
other hand, for relatively small images or small numbers of images, calling the
library from R allows the use of R for computations on pixel data, which opens the
door to the quick development and testing of pixel related statistical algorithms.

460
8.3 imager

try(detach(package:imager))
try(detach(package:ggmap))
try(detach(package:rgdal))

461
9 If and when R needs help

Improving the efficiency of your S functions can be well worth


some effort. …But remember that large efficiency gains can be
made by using a better algorithm, not just by coding the same
algorithm better.

— Patrick J. Burns (1998) S Poetry.


http://www.burns-stat.com/documents/books/s-poetry/

9.1 Packages used in this chapter

For executing the examples listed in this chapter you need first to load the following
packages from the library:

library(Rcpp)
library(inline)
# library(rPython)
library(rJava)

9.2 Aims of this chapter

In this final chapter I highlight what in my opinion are limitations and advantages
of using Ras a scripting language in data analysis, briefly describing alternative ap-
proaches that can help overcome performance bottle necks in R code.

9.3 R’s limitations and strengths

9.3.1 Optimizing R code

Some constructs like for and while loops execute slowly in R, as they are inter-
preted. Byte compiling and Just-In-Time (JIT) compiling of loops (enabled by default
in R >= 3.4.0) should decrease this burden in the future. However, base R as well
some packages define several apply functions. Being compiled functions, written in
C or C++, using apply functions instead of explicit loops can provide a major im-
provement in performance while keeping user’s code fully written in R. Pre-allocating
memory, rather than growing a vector or array at each iteration can help. One little

463
9 If and when R needs help

known problem is related to consistency tests when ‘growing’ data frames. If we


add one by one variables to a large data frame the overhead is in many cases huge.
This can be easily avoided in many cases by assembling the object as a list, and once
assembled converting it into a data frame.
You may ask, how can I know, where in the code is the performance bottleneck.
During the early years of R, this was quite a difficult task. Nowadays, we there are
good code profiling and code benchmarking tools, which are in the most recent ver-
sion, integrated into the RStudio IDE. Profiling consists in measuring how much of the
total runtime of a test is spent in different functions, or even lines of code. Bench-
marking consists in timing the execution of alternative versions of some piece of code,
to decide which one should preferred.
There are some rules of style, and common sense, that should be always applied,
to develop good quality program code. However, as in most cases, high performance
comes at the cost of a more complex program or algorithm, optimizations should be
applied only at the parts of the code that are limiting overall performance. Usually
even when the requirement of high performance is known in advance, it is in most
cases to start with a simple implementation of a simple algorithm. Get this first solu-
tion working reliably, and use this as a reference both for performance and accuracy
of returned results while attempting optimization.
The book The Art of R Programming: A Tour of Statistical Software Design (Matloff
2011) is very good at presenting the use of R language and how to profit from its
peculiar features to write concise and efficient code. Studying the book Advanced R
(Wickham 2014b) will give you a deep understanding of the R language, its limitations
and good and bad approaches to its use. If you aim at writing R packages, then R
Packages (Wickham 2015) will guide you on how to write your own packages, using
modern tools. Finally, any piece of software, benefits from thorough and consistent
testing, and R packages and scripts are no exception. Building a set of test cases
simplifies enormously code maintenance, as they help detect unintended changes in
program behaviour (Cotton 2016; Wickham 2015).

9.3.2 Using the best tool for each job

In many cases optimizing R code for performance can yield more than an order of
magnitude decrease in runtime. In many cases this is enough, and the most cost-
effective solution. There are both packages and functions in base R, that if properly
used can make a huge difference in performance. In addition, efforts in recent years
to optimize the overall performance of R itself have been successful. Some of the
packages with enhanced performance have been described in earlier chapters, as they
are easy enough to use and have also an easy to learn user interface. Other packages

464
9.4 Rcpp

like ‘data.table’ although achieving very fast execution, incur the cost of using a user
interface and having a behaviour alien to the “normal way of working” with R.
Sometimes, the best available tools for a certain job have not been implemented in
R but are available in other languages. Alternatively, the algorithms or the size of the
data are such that performance is poor when implemented in the R language, and can
be better using a compiled language.

9.3.3 R is great, but not always best

One extremely important feature leading to the success of R is extensibility. Not only
by writing packages in R itself, but by allowing the development of packages contain-
ing functions written in other computer languages. The beauty of the package loading
mechanism, is that even if R itself is written in C, and compiled into an executable,
packages containing interpreted R code, and also compiled C, C++, FORTRAN, or other
languages, or calling libraries written in Java, Python, etc. can be loaded and unloaded
at runtime.
Most common reasons for using compiled code, are the availability of libraries writ-
ten in FORTRAN, C and C++ that are well tested and optimized for performance. This
is frequently the case for numerical calculations and time-consuming data manipula-
tions like image analysis. In such cases the R code in packages is just a wrapper (or
“glue”) to allow the functions in the library to be called from R.
In other cases we diagnose a performance bottleneck, decide to write a few func-
tions within a package otherwise written in R, in a compiled language like C++. In
such cases is a good idea to use benchmarking, as the use of a language does not
necessarily provide a worthwhile performance enhancement. Different languages do
not always store data in memory in the same format, this can add overhead to func-
tion calls across languages.

9.4 Rcpp

citation(package = "Rcpp")

##
## To cite Rcpp in publications use:
##
## Dirk Eddelbuettel and Romain Francois
## (2011). Rcpp: Seamless R and C++
## Integration. Journal of Statistical
## Software, 40(8), 1-18. URL
## http://www.jstatsoft.org/v40/i08/.
##

465
9 If and when R needs help

## Eddelbuettel, Dirk (2013) Seamless R and


## C++ Integration with Rcpp. Springer, New
## York. ISBN 978-1-4614-6867-7.

Nowadays, thanks to package ‘Rcpp’, using C++ mixed with R language, is fairly
simple (Eddelbuettel 2013). This package does not only provide R code, but a C++
header file with macro definitions that reduces the writing of the necessary “glue”
code to the use of a simple macro in the C++ code. Although, this mechanism is most
frequently used as a component packages, it is also possible to define a function
written in C++ at the R console, or in a simple user’s script. Of course for these to
work all the tools needed to build R packages from source are needed, including a
suitable compiler and linker.
An example taken from the ‘Rcpp’ documentation follows. This is an example of
how one would define a function during an interactive session at the R console, or in
a simple script. When writing a package, one would write a separate source file for
the function, include the rcpp.h header and use the C++ macros to build the R code
side. Using C++ inline requires package ‘inline’ to be loaded in addition to ‘Rcpp’.
First we save the source code for the function written in C++, taking advantage of
types and templates defined in the Rccp.h header file.

src <- '


Rcpp::NumericVector xa(a);
Rcpp::NumericVector xb(b);
int n_xa = xa.size(), n_xb = xb.size();

Rcpp::NumericVector xab(n_xa + n_xb - 1);


for (int i = 0; i < n_xa; i++)
for (int j = 0; j < n_xb; j++)
xab[i + j] += xa[i] * xb[j];
return xab;
'

The second step is to compile and load the function, in a way that it can be called
from R code and indistinguishable from a function defined in R itself.

fun <- cxxfunction(signature(a = "numeric", b = "numeric"), src, plugin = "Rcpp")

We can now use as any other R function.

fun(1:3, 1:4)

## [1] 1 4 10 16 17 12

As we will see below, this is not the case in the case of calling Java and Python, cases
where although the integration is relatively tight, special syntax is used when calling

466
9.5 FORTRAN and C

the “foreign” functions. The advantage of Rcpp in this respect is very significant, as
we can define functions that have exactly the same argument signature, use the same
syntax and behave in the same way, using either the R or C++ language. This means
that at any point during development of a package a function defined in R can be
replaced by an equivalent function defined in C++, or vice versa, with absolutely no
impact on user’s code, except possibly by faster execution of the C++ version.

9.5 FORTRAN and C

In the case of FORTRAN and C, the process is less automated in the R code needed to
call the compiled functions needs to be explicitly written (See Writing R Extensions in
the R documentation, for up-to-date details). Once written, the building and install-
ation of the package is automatic. This is the way how many existing libraries are
called from within R and R packages.

9.6 Python

Package ‘rPython’ allows calling Python functions and methods from R code. Cur-
rently this package is not available under MS-Windows.
Example taken from the package description (not run).

python.call( "len", 1:3 )


a <- 1:4
b <- 5:8
python.exec( "def concat(a,b): return a+b" )
python.call( "concat", a, b)

It is also possible to call R functions from Python. However, this is outside the
scope of this book.

9.7 Java

Although Java compilers exist, most frequently Java programs are compiled into in-
termediate byte code and this is interpreted, and usually the interpreter includes a
JIT compiler. For calling Java functions or accessing Java objects from R code, the
solution is to use package ‘rJava’. One important point to remember is that the Java
Development Environment must be installed for this package to work. The usually
installed runtime is not enough.
We need first to start the Java Virtual Machine (the byte-code interpreter).

467
9 If and when R needs help

.jinit()

## [1] 0

The code that follows is not that clear, and merits some explanation.
We first create a Java array from inside R.

a <- .jarray( list(


.jnew( "java/awt/Point", 10L, 10L ),
.jnew( "java/awt/Point", 30L, 30L )
) )
print(a)

## [1] "Java-Array-Object[Ljava/lang/Object;:[Ljava.lang.Object;@731f8236"

mode(a)

## [1] "S4"

class(a)

## [1] "jarrayRef"
## attr(,"package")
## [1] "rJava"

str(a)

## Formal class 'jarrayRef' [package "rJava"] with 2 slots


## ..@ jobj :<externalptr>
## ..@ jclass: chr "[Ljava/lang/Object;"
## ..@ jsig : chr "[Ljava/lang/Object;"

Then we use base R’s function lapply() to apply a user-defined R function to the
elements of the Java array, obtaining as returned value an R array.

b <- sapply(a,
function(point){
with(point, {
(x + y )^2
} )
})
print(b)

## [1] 400 3600

mode(b)

## [1] "numeric"

class(b)

468
9.8 sh, bash

## [1] "numeric"

str(b)

## num [1:2] 400 3600

Although more cumbersome than in the case of ‘Rcpp’ one can manually write wrap-
per code to hide the special syntax and object types from users.
It is also possible to call R functions from within a Java program. This is outside
the scope of this book.

9.8 sh, bash

The operating system shell can be accessed from within R and the output from pro-
grams and shell scripts returned to the R session. This is useful, for example for
pre-processing raw data files with tools like AWK or Perl scripts. The problem with
this approach is that when it is used, the R script cannot run portably across oper-
ating systems, or in the absence of the tools or sh or bash scripts. Except for code
that will never be reused (i.e. it is used once and discarded) it is preferable to use R’s
built-in commands whenever possible, or if shell scripts are used, to make the shell
script the master script from within which the R scripts are called, rather than the
other way around. The reason for this is mainly making clear the developer’s inten-
tion: that the code as a whole will be run in a given operating system using a certain
set of tools, rather hiding shell calls inside the R script. In other words, keep the least
portable bits in full view.

9.9 Web pages, and interactive interfaces

There is a lot to write on this aspect, and intense development efforts going on.
One example is the ‘Shiny’ package and Shiny server https://shiny.rstudio.com/.
This package allows the creation of interactive displays to be viewed through any web
browser.
There are other packages for generating both static and interactive graphics in
formats suitable for on-line display, as well as package ‘knitr’ used for writing this
book https://yihui.name/knitr/, which when using R Markdown for markup
(with package ‘rmarkdown’ http://rmarkdown.rstudio.com or ‘Bookdown’ https:
//bookdown.org/ can output self-contained HTML files in addition to RTF and PDF
formats.

469
10 Further reading about R

Before you become too entranced with gorgeous gadgets and


mesmerizing video displays, let me remind you that information is
not knowledge, knowledge is not wisdom, and wisdom is not
foresight. Each grows out of the other, and we need them all.

— Arthur C. Clarke

10.1 Introductory texts

Dalgaard 2008; Paradis 2005; Peng 2016; Peng et al. 2017; Teetor 2011; Zuur et al.
2009

10.2 Texts on specific aspects

Chang 2013; Everitt and Hothorn 2011; Faraway 2004, 2006; Fox 2002; Fox and Weis-
berg 2010; Wickham and Grolemund 2017

10.3 Advanced texts

Chambers 2016; Ihaka and Gentleman 1996; Matloff 2011; Murrell 2011; Pinheiro and
Bates 2000; Venables and Ripley 2000; Wickham 2014b, 2015; Wickham and Sievert
2016; Xie 2013

471
Bibliography

Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. In: The American Statistician


27.1, p. 17. doi: 10.2307/2682899 (cit. on p. 316).
Aphalo, P. J. and R. Rikala (2006). “Spacing of silver birch seedlings grown in contain-
ers of equal size affects their morphology and its variability.” In: Tree physiology
26.9, pp. 1227–1237. doi: 10.1093/treephys/26.9.1227 (cit. on p. 176).
Becker, R. A. and J. M. Chambers (1984). S: An Interactive Environment for Data Ana-
lysis and Graphics. Chapman and Hall/CRC. isbn: 0-534-03313-X (cit. on p. 6).
Becker, R. A., J. M. Chambers, and A. R. Wilks (1988). The New S Language: A Pro-
gramming Environment for Data Analysis and Graphics. Chapman & Hall. isbn: 0-
534-09192-X (cit. on p. 6).
Burns, P. J. (1998). “S poetry”. In: (cit. on p. 69).
Chambers, J. M. (2016). Extending R. The R Series. Chapman and Hall/CRC. isbn:
1498775713 (cit. on p. 471).
Chang, W. (2013). R Graphics Cookbook. 1-2. Sebastopol: O’Reilly Media, p. 413. isbn:
9781449316952 (cit. on pp. 180, 284, 471).
Cleveland, W. S. (1985). The Elements of Graphing Data. Wadsworth, Inc. isbn: 978-
0534037291 (cit. on p. 181).
Cotton, R. J. (2016). Testing R Code. The R Series. Chapman and Hall/CRC. isbn:
1498763650 (cit. on p. 464).
Dalgaard, P. (2008). Introductory Statistics with R. Springer, p. 380. isbn: 0387790543
(cit. on p. 471).
Eddelbuettel, D. (2013). Seamless R and C++ Integration with Rcpp. Springer, p. 248.
isbn: 1461468671 (cit. on p. 466).
Everitt, B. and T. Hothorn (2011). An Introduction to Applied Multivariate Analysis with
R. Springer, p. 288. isbn: 1441996494 (cit. on p. 471).
Faraway, J. J. (2004). Linear Models with R. Boca Raton, FL: Chapman & Hall/CRC,
p. 240 (cit. on p. 471).
– (2006). Extending the linear model with R: generalized linear, mixed effects and non-
parametric regression models. Chapman & Hall/CRC Taylor & Francis Group, p. 345.
isbn: 158488424X (cit. on p. 471).
Fox, J. (2002). An {R} and {S-Plus} Companion to Applied Regression. Thousand Oaks,
CA, USA: Sage Publications (cit. on p. 471).

473
Bibliography

Fox, J. and H. S. Weisberg (2010). An R Companion to Applied Regression. SAGE Pub-


lications, Inc, p. 472. isbn: 141297514X (cit. on p. 471).
Hillebrand, J. and M. H. Nierhoff (2015). Mastering RStudio - Develop, Communicate,
and Collaborate with R. Packt Publishing. 348 pp. isbn: 9781783982554 (cit. on
p. 5).
Ihaka, R. and R. Gentleman (1996). “R: A Language for Data Analysis and Graphics”.
In: J. Comput. Graph. Stat. 5, pp. 299–314 (cit. on p. 471).
Kernigham, B. W. and P. J. Plauger (1981). Software Tools in Pascal. Reading, Massachu-
setts: Addison-Wesley Publishing Company, p. 366 (cit. on p. 165).
Knuth, D. E. (1984). “Literate programming”. In: The Computer Journal 27.2, pp. 97–
111 (cit. on p. 65).
Lamport, L. (1994). LATEX: a document preparation system. English. 2nd ed. Reading:
Addison-Wesley, p. 272. isbn: 0-201-52983-1 (cit. on p. 65).
Loo, M. P. van der and E. de Jonge (2012). Learning RStudio for R Statistical Computing.
1st ed. Birmingham: Packt Publishing, p. 126. isbn: 9781782160601 (cit. on p. 5).
Matloff, N. (2011). The Art of R Programming: A Tour of Statistical Software Design.
No Starch Press, p. 400. isbn: 1593273843 (cit. on pp. 91, 105, 464, 471).
Murrell, P. (2011). R Graphics, Second Edition (Chapman & Hall/CRC The R Series).
CRC Press, p. 546. isbn: 1439831769 (cit. on pp. 94, 179, 471).
Paradis, E. (2005). R for Beginners. Montpellier. 76 pp. url: https : / / cran . r -
project.org/doc/contrib/Paradis-rdebuts_en.pdf (visited on 07/17/2016)
(cit. on p. 471).
Peng, R. D. (2016). R Programming for Data Science. Leanpub. 182 pp. url: https:
//leanpub.com/rprogramming (visited on 08/07/2016) (cit. on p. 471).
Peng, R. D., S. Kross, and B. Anderson (2017). Mastering Software Development in R.
Leanpub. url: https://leanpub.com/msdr (cit. on pp. 153, 471).
Pinheiro, J. C. and D. M. Bates (2000). Mixed-Effects Models in S and S-Plus. New York:
Springer (cit. on p. 471).
Rosenblatt, B. (1993). Learning the Korn Shell. English. Sebastopol: O’Reilly and Asso-
ciates, p. 337. isbn: 1-56592-054-6 (cit. on p. 165).
Sarkar, D. (2008). Lattice: Multivariate Data Visualization with R. 1st ed. Springer,
p. 268. isbn: 0387759689 (cit. on pp. 94, 179, 180).
Somasundaram, R. (2013). Git. Packt Publishing. 180 pp. isbn: 1849517525 (cit. on
p. 12).
Swicegood, T. (2010). Pragmatic Guide to Git. Pragmatic Programmers, LLC. isbn: 978-
1-934356-72-2 (cit. on p. 12).
Teetor, P. (2011). R Cookbook. 1st ed. Sebastopol: O’Reilly Media, p. 436. isbn:
9780596809157 (cit. on p. 471).

474
Bibliography

Tufte, E. R. (1983). The Visual Display of Quantitative Information. book pedro (7119):
Graphics Press, p. 197. isbn: 0-9613921-0-X (cit. on pp. 227, 417).
Venables, W. N. and B. D. Ripley (2000). S Programming. Statistics and Computing.
New York: Springer, pp. x + 264. isbn: 0 387 98966 8 (cit. on p. 471).
Wickham, H. (2014a). Advanced R. Chapman & Hall/CRC The R Series. CRC Press. isbn:
9781466586970 (cit. on p. 69).
– (2014b). Advanced R. Chapman & Hall/CRC The R Series. CRC Press. isbn:
9781466586970 (cit. on pp. 464, 471).
– (2015). R Packages. O’Reilly Media. isbn: 9781491910542 (cit. on pp. 90, 464, 471).
Wickham, H. (2014c). “Tidy Data”. In: Journal of Statistical Software 59.10. issn: 1548-
7660. url: http://www.jstatsoft.org/v59/i10 (cit. on p. 153).
Wickham, H. and G. Grolemund (2017). R for Data Science. O’Reilly. isbn: 978-1-4919-
1039-9. url: http://r4ds.had.co.nz/ (visited on 02/11/2017) (cit. on pp. 143,
153, 471).
Wickham, H. and C. Sievert (2016). ggplot2: Elegant Graphics for Data Analysis. 2nd ed.
Springer. XVI + 260. isbn: 978-3-319-24277-4. doi: 10.1007/978-3-319-24277-4
(cit. on pp. 179, 180, 284, 471).
Xie, Y. (2013). Dynamic Documents with R and knitr. The R Series. Chapman and
Hall/CRC, p. 216. isbn: 1482203537 (cit. on pp. 65, 471).
– (2016). bookdown: Authoring Books and Technical Documents with R Markdown.
Chapman & Hall/CRC The R Series. Chapman and Hall/CRC. isbn: 9781138700109
(cit. on p. 65).
Zuur, A. F., E. N. Ieno, and E. Meesters (2009). A Beginner’s Guide to R. 1st ed. Springer,
p. 236. isbn: 0387938362 (cit. on p. 471).

475
Index

(, 42 apply , 88
+, 42 apply(), 88, 144, 145, 148–150
-, 42 arrange(), 161
->, 18 array(), 145
<-, 17 as.cimg(), 454
=, 18 as.data.frame(), 156, 444, 446, 450,
[[ ]], 57 458
$, 55, 57 as.integer(), 41
%<>%, 169 as.tibble(), 153
%>%, 161, 166, 169, 455 assignment, 17
%T>%, 170 chaining, 18
%$%, 170 leftwise, 18
MiKTEX, 89 attr(), 214
attributes(), 134, 165
abs(), 25, 42 autoplot(), 359, 361
aes(), 209, 381, 383, 447, 451 AWK, 469
aesthetics (ggplot), see plots,
aesthetics B(), 458
aggregate(), 164 B<-(), 458
all(), 28 basename(), 110, 112
analysis of covariance, 102 bash, 469
analysis of variance, 101 bash, 165
ANCOVA, see analysis of covariance Bio7, 14
‘animation’, 345 bold(), 297
annotate, 270 bolditalic(), 297
annotate(), 269, 270 Bookdown, 65
annotations (ggplot), see plots, Bookdown, 13
annotations ‘Bookdown’, 469
ANOVA, see analysis of variance Boolean arithmetic, 27
anova(), 95 box plots, see plots, box and whiskers
anti_join(), 171 plot
any(), 28 bquote(), 301
‘anytime’, 258 break(), 85

477
INDEX

‘broom’, 392 code


byte compiler, 7 benchmarking, 464
byte_format(), 353 optimization, 463
performance, 464
C, 7, 12, 13, 19, 42, 48, 82, 90, 300, profiling, 464
463, 465, 467 writing style, 464
c col2rgb(), 325
compiler, 13 color
C++, 13 definitions, 263–266
c, 20 using, 263–267
C++, 6, 7, 12, 13, 42, 83, 90, 300, 441, color maps, 340
458–460, 463, 465–467 color palettes, 339, 340
cat, 165 colors(), 319, 325
cat(), 37, 118 colour, see color
categorical variables, see factors command shell, 469
cbind(), 175 comparison operators, 29
ceiling(), 41, 42 compiler, 7
character, 42 compute_group(), 398
character strings, 35 conditional execution, 74
CImg, 441, 458–460 console, 15
class contains(), 163
character, 35–37 control of execution flow, 74
logical, 27–35 coord_fixed(), 445, 447, 451
numeric, 26 coord_polar(), 272
class(), 53, 134 coordinates
classes, 69 polar, 272
classes and modes coordinates (ggplot), see plots,
character, 42 coordinates
data.frame, 120 ‘cowplot’, 307, 422
double, 19
ggplot, 212, 214, 242, 269, 289, data
291, 305, 308, 321, 356, 369, bitmaps, 459
373, 387, 438, 439 exploring at the console, 92
integer, 19 loading data sets, 91
logical, 75–77 raster image, 459
numeric, 18, 22, 34, 76 data frames, 52
tibble, 120, 132, 136, 138, 153, data(), 91
154, 164 data.frame, 120
vector, 19 data.frame(), 154, 158, 330

478
INDEX

‘data.table’, 108, 465 for , 79


density plots, 233 ‘foreign’, 130, 132, 140
devices format(), 42, 300
output, see graphic output devices ‘fortify’, 359
‘devtools’, 89 fortify(), 436
dim(), 134 FORTRAN, 7, 12, 13, 90, 465, 467
dimnames(), 134 fromJSON(), 142
dir(), 113 full_join(), 171, 175
dirname(), 111 function(), 66
double, 19 functions
double(), 19 abs(), 25, 42
download.file(), 141 aes(), 209, 381, 383, 447, 451
‘dplyr’, 160, 163, 171, 175 aggregate(), 164
all(), 28
Eclipse, 14
annotate, 270
editor for R scripts, 14
annotate(), 269, 270
ends_with(), 163
anova(), 95
EPS, see machine arithmetic precision
anti_join(), 171
examples
any(), 28
modular plot construction,
apply(), 88, 144, 145, 148–150
304–309
arrange(), 161
Excel, xiii
array(), 145
excel_sheets(), 125
as.cimg(), 454
expand_limits(), 312
as.data.frame(), 156, 444, 446,
expression(), 292, 294, 296–298
450, 458
‘extrafont’, 205
as.integer(), 41
facet_grid(), 242, 247 as.tibble(), 153
facet_grid_paginate(), 370 attr(), 214
facet_wrap(), 242, 247, 248, 277 attributes(), 134, 165
facet_wrap_paginate(), 370 autoplot(), 359, 361
facet_zoom(), 370 B(), 458
facets (ggplot), see plots, facets B<-(), 458
factor(), 47 base R, 59
factors, 47 basename(), 110, 112
file.path(), 113 bold(), 297
filter(), 162 bolditalic(), 297
font.add(), 332 bquote(), 301
font.add.google(), 334 break(), 85
font.families(), 332 byte_format(), 353

479
INDEX

c, 20 function(), 66
cat(), 37, 118 G(), 458
cbind(), 175 G<-(), 458
ceiling(), 41, 42 gather(), 159, 160, 175
class(), 53, 134 geocode(), 433
col2rgb(), 325 geom_arc(), 369
colors(), 319, 325 geom_arcbar(), 369
compute_group(), 398 geom_area(), 263
contains(), 163 geom_bar(), 216, 262, 272
coord_fixed(), 445, 447, 451 geom_barh(), 348
coord_polar(), 272 geom_bezier(), 369
data(), 91 geom_bin2d(), 236
data.frame(), 154, 158, 330 geom_bkde(), 353
defining new, 65 geom_bkde2d(), 353
dim(), 134 geom_boxplot(), 240
dimnames(), 134 geom_boxploth(), 348
dir(), 113 geom_bspline(), 369
dirname(), 111 geom_circle(), 369
double(), 19 geom_col(), 262
download.file(), 141 geom_crossbarh(), 348
ends_with(), 163 geom_debug(), 371, 389, 390, 400
excel_sheets(), 125 geom_dumbbell(), 353
expand_limits(), 312 geom_edges(), 362
expression(), 292, 294, 296–298 geom_encircle(), 353
facet_grid(), 242, 247 geom_errorbar, 227
facet_grid_paginate(), 370 geom_errorbarh(), 348
facet_wrap(), 242, 247, 248, 277 geom_hex(), 236
facet_wrap_paginate(), 370 geom_histogram(), 234
facet_zoom(), 370 geom_histogramh(), 348
factor(), 47 geom_hline(), 262, 270, 374
file.path(), 113 geom_label(), 199, 203, 262,
filter(), 162 335, 374, 379–381, 403–405
font.add(), 332 geom_label_repel(), 380, 403,
font.add.google(), 334 404, 408
font.families(), 332 geom_line(), 181, 194, 262, 378,
format(), 42, 300 411
fortify(), 436 geom_linerange(), 226
fromJSON(), 142 geom_linerangeh(), 348
full_join(), 171, 175 geom_link(), 369

480
INDEX

geom_link2(), 369 ggscreeplot(), 351


geom_lollipop(), 353 ggtern(), 420
geom_net(), 366 ggtitle(), 207, 210
geom_nodes(), 362 glm(), 102
geom_nodetext(), 362 grayscale(), 443, 449
geom_null(), 371 grep(), 263
geom_path(), 369 grepl(), 263
geom_point(), 181, 183, 224, group_by(), 164
242, 262, 374, 433 hcl(), 266
geom_pointrange, 226 head(), 92, 93
geom_pointrange(), 218 identical(), 156
geom_pointrangeh(), 348 ifelse(), 77, 78, 452
geom_polygon(), 438 inner_join(), 171
geom_raster(), 445–447, 451 install.packages(), 89
geom_rug(), 374 invisible(), 168
geom_segment(), 369 is.numeric(), 18, 22
geom_sina(), 368, 369 is.tibble(), 153
geom_smooth(), 274 italic(), 297
geom_stateface(), 353 label_bquote(), 247
geom_stepribbon(), 353 labs(), 205, 209, 291
geom_text(), 199, 202, 203, 262, lapply(), 88, 93, 144, 145, 468
291, 292, 335, 374, 379, 381, left_join(), 171
403, 433 length(), 22
geom_text_repel(), 403 lis.dirs(), 112
geom_tile(), 214 list.dirs(), 112
geom_violin(), 369 list.files(), 112
geom_violinh(), 348 lm(), 95, 229, 231
geom_vline(), 262, 270, 374 load.image(), 441
geom_xspline(), 353 ls(), 22, 110
get_map(), 428, 435 matches(), 163
getwd(), 111 mean(), 149
gg_animate(), 345, 346 mode(), 38, 134
ggbiplot(), 351 mutate(), 160
ggcolorchart(), 319 my_print(), 71
ggmap(), 428 names(), 92, 134, 163
ggMarginal(), 358 names<-(), 163
ggMargins(), 358 nc_open(), 135
ggplot(), 62, 209, 210, 298, 305, ncol(), 92, 134
361, 411, 420 ncvar_get(), 135

481
INDEX

nPix(), 460 rename(), 163


nrow(), 92, 134 resize(), 446
numeric(), 19 return(), 66
object.size(), 460 reverselog_trans(), 315
open.nc(), 138 rgb(), 265
order(), 161, 325 rgb2hsv(), 325
ordered(), 47 right_join(), 171
pal.bands(), 341 rlm(), 390
pal.channels(), 341 rm(), 22
pal.safe(), 342 round(), 40
parse(), 293, 295–298 sapply(), 88, 93, 144, 145
paste(), 292, 295 save(), 115, 176
plain(), 297 scale_color_continuous(),
plot(), 70, 442, 444, 448, 454 182, 266
power_trans(), 370 scale_color_date(), 266
prcomp(), 351 scale_color_datetime(), 266
print(), 16, 37, 50, 62, 82, 85, scale_color_discrete(), 266
92, 135, 155, 167, 168, 170, scale_color_gradient(), 266
212, 345 scale_color_gradient2(), 266
print.nc(), 138 scale_color_gradientn(), 266
R(), 458 scale_color_grey(), 266
R<-(), 458 scale_color_hue(), 266
radial_trans(), 370 scale_color_identity(), 266
read.csv(), 91, 114–117, 120 scale_color_manual(), 344
read.csv2(), 115–117 scale_colour_identity(), 250
read.fortran(), 114 scale_colour_manual(), 250
read.fwf(), 114 scale_fill_gradient2(), 310
read.spss(), 131 scale_fill_identity(), 267
read.table(), 91, 115, 117, 119, scale_fill_pokemon(), 353
120, 176 scale_fill_viridis(), 337,
read.xlsx(), 126, 128 340
read.xlsx2(), 128 scale_x_continuous(), 249,
read_csv(), 120, 124 252
read_delim(), 121 scale_x_discrete(), 262
read_excel(), 125 scale_x_log10(), 253
read_html(), 129 scale_x_reverse(), 253
read_sav(), 133 scale_y_continuous(), 252
read_table(), 120, 121, 176 scale_y_log(), 253
rel(), 283 scale_y_log10(), 253

482
INDEX

scan(), 91 stat_function(), 195


select(), 162, 163 stat_glance(), 389, 390
SEM(), 67, 69 stat_identity(), 216
semi_join(), 171 stat_index(), 411
seq(), 20 stat_peaks(), 371, 407
set.seed(), 364 stat_poly_eq(), 371, 378, 381
setwd(), 111 stat_rollapplyr(), 411
showtext.auto(), 331 stat_seas(), 411
showtext.begin(), 331 stat_smooth(), 181, 228, 388
showtext.end(), 331 stat_stl(), 411
signif(), 40, 41 stat_summary(), 181, 218, 224,
simple_SEM(), 69 226
slice(), 162 stat_valleys(), 371
sort(), 161, 325 stat_xdensity(), 348
sprintf(), 42, 300 stat_xspline(), 353
starts_with(), 163 str(), 49–51, 92, 117, 134, 137
stat_ash(), 353 str_extract(), 160
stat_bin(), 276 strftime(), 300, 376
stat_binh(), 348 subset(), 57, 162
stat_binhex(), 236 substitute(), 302
stat_bkde(), 353 sum(), 69, 143, 148
stat_bkde2d(), 353 summarise(), 164, 175
stat_boxploth(), 348 summarize(), 170
stat_count(), 216 summary(), 93–95
stat_counth(), 348 sump(), 170
stat_debug(), 398 switch(), 76
stat_debug_group(), 371 t(), 149, 150
stat_debug_panel(), 371 tail(), 92, 93
stat_decomp(), 411 theme(), 287, 291
stat_dens2d_filter(), 394, theme_blank(), 362, 370
395, 397 theme_classic(), 280
stat_dens2d_filter_g(), 397 theme_dark(), 280
stat_dens2d_label(), 398, 408 theme_economist(), 418
stat_density(), 276 theme_gdocs(), 418
stat_fit_augment(), 371 theme_gray(), 289
stat_fit_deviations(), 371, theme_grey(), 278, 287, 307
392 theme_linedraw(), 280
stat_fit_glance(), 371, 388 theme_minimal(), 280, 287, 289
stat_fit_residuals(), 394 theme_net(), 366

483
INDEX

theme_no_axes(), 370 functions:arguments, 66


theme_nomask(), 421
theme_null(), 370 G(), 458
theme_tufte(), 418 G<-(), 458
theme_void(), 280, 440, 445, gather(), 159, 160, 175
447, 451 generalized linear models, 102
threshold(), 444 geocode(), 433
tibble(), 153, 154, 158, 330 geom , see plots, geometries
tol(), 342, 344 geom_arc(), 369
tolower(), 262 geom_arcbar(), 369
toupper(), 262 geom_area(), 263
transmute(), 160 geom_bar(), 216, 262, 272
trunc(), 41, 42 geom_barh(), 348
try_tibble(), 371 geom_bezier(), 369
tsdf(), 411, 414 geom_bin2d(), 236
unlist(), 50, 52 geom_bkde(), 353
update.packages(), 89 geom_bkde2d(), 353
update_labels(), 209 geom_boxplot(), 240
vapply(), 144, 151 geom_boxploth(), 348
var(), 68 geom_bspline(), 369
var.get.nc(), 138 geom_circle(), 369
viridis(), 344 geom_col(), 262
write(), 115 geom_crossbarh(), 348
write.csv(), 114, 115 geom_debug(), 371, 389, 390, 400
write.csv2(), 115 geom_dumbbell(), 353
write.table(), 115, 117 geom_edges(), 362
write.xlsx(), 128 geom_encircle(), 353
write_csv(), 123 geom_errorbar, 227
write_delim(), 123 geom_errorbarh(), 348
write_excel_csv(), 123 geom_hex(), 236
write_file(), 123, 124 geom_histogram(), 234
write_tsv(), 122 geom_histogramh(), 348
xlab(), 206 geom_hline(), 262, 270, 374
xlim(), 217, 250, 312 geom_label(), 199, 203, 262, 335,
xml_find_all(), 130 374, 379–381, 403–405
xml_text(), 130 geom_label_repel(), 380, 403, 404,
ylab(), 206 408
ylim, 217 geom_line(), 181, 194, 262, 378, 411
ylim(), 250, 312 geom_linerange(), 226

484
INDEX

geom_linerangeh(), 348 ‘gganimate’, xv, 344, 345


geom_link(), 369 ‘ggbiplot’, xv, 350, 351
geom_link2(), 369 ggbiplot(), 351
geom_lollipop(), 353 ggcolorchart(), 319
geom_net(), 366 ‘ggCompNet’, 365
geom_nodes(), 362 ‘ggedit’, 422
geom_nodetext(), 362 ‘ggExtra’, 356
geom_null(), 371 ‘ggforce’, xv, 367, 368
geom_path(), 369 ‘ggfortify’, 358, 359
geom_point(), 181, 183, 224, 242, ‘ggimage’, 422
262, 374, 433 ‘ggiraph’, 422
geom_pointrange, 226 ‘ggiraphExtra’, 422
geom_pointrange(), 218 ‘gglogo’, 422
geom_pointrangeh(), 348 ‘ggmap’, 242, 425, 427, 435
geom_polygon(), 438 ggmap(), 428
geom_raster(), 445–447, 451 ggMarginal(), 358
geom_rug(), 374 ggMargins(), 358
geom_segment(), 369 ‘ggmosaic’, 422
geom_sina(), 368, 369 ‘ggnetwork’, 362
geom_smooth(), 274 ‘ggparallel’, 422
geom_stateface(), 353 ‘ggplolt2’, 179
geom_stepribbon(), 353 ggplot, 212, 214, 242, 269, 289, 291,
geom_text(), 199, 202, 203, 262, 291, 305, 308, 321, 356, 369, 373,
292, 335, 374, 379, 381, 403, 387, 438, 439
433 ggplot(), 62, 209, 210, 298, 305, 361,
geom_text_repel(), 403 411, 420
geom_tile(), 214 ‘ggplot2’, xv, xvi, 94, 152, 153,
geom_violin(), 369 179–181, 205, 208, 216, 220,
geom_violinh(), 348 223, 242, 247, 250, 266, 270,
geom_vline(), 262, 270, 374 278, 285–287, 305, 307, 309,
geom_xspline(), 353 310, 313, 316, 319, 326, 329,
geometries (ggplot), see plots, 330, 348, 359, 362, 365, 368,
geometries 371, 373, 398, 402, 407, 410,
‘geomnet’, 365, 366 411, 415, 417, 420–422, 427,
get_map(), 428, 435 444, 445, 447, 450
getwd(), 111 ‘ggpmisc’, xv, 370, 371, 373, 378, 398,
gg_animate(), 345, 346 407, 408, 411
‘GGally’, 365 ‘ggraph’, 422
‘ggalt’, xv, 352, 353 ‘ggrepel’, xv, 374, 402, 403, 408

485
INDEX

‘ggsci’, 414, 415 plotting, 441


ggscreeplot(), 351 processing, 441
‘ggseas’, 373, 411, 414 ‘inline’, 466
‘ggsignif’, 422 inner_join(), 171
‘ggsn’, 422 install.packages(), 89
‘ggspatial’, 422 integer, 19
‘ggspectra’, 242, 305, 422 interpreter, 7
‘ggstance’, xv, 347, 348 invisible(), 168
‘ggtern’, xv, 182, 242, 419, 420 is.numeric(), 18, 22
ggtern(), 420 is.tibble(), 153
‘ggthemes’, 417 italic(), 297
ggtitle(), 207, 210
Git, 12 Java, 7, 8, 90, 465, 467–469
Git, 12 ‘jsonlite’, 142
GLM, see generalized linear models
‘knitr’, 9, 12, 65, 331, 469
glm(), 102
ksh, 165
grammar of graphics, 181, 304
graphic output devices, 303–304 label_bquote(), 247
grayscale(), 443, 449 labs(), 205, 209, 291
grep(), 263 lapply(), 88, 93, 144, 145, 468
grepl(), 263 LATEX, 13
‘grid’, 179 ‘Lattice’, 180
‘gridExtra’, 358 ‘lattice’, 179, 242
group_by(), 164 left_join(), 171
length(), 22
‘haven’, 130, 132, 140
limits
hcl(), 266
coordinate, 221
head(), 92, 93
scale, 221
‘Hmisc’, 223
linear models, 95
‘hrbrthemes’, 422
linear regression, 95
IDE for R, 14 lis.dirs(), 112
identical(), 156 list.dirs(), 112
if, 74 list.files(), 112
ifelse, 77 lists, 48
ifelse(), 77, 78, 452 literate programming, 65
ImageJ, 14 LM, see linear models
ImageMagick, 345, 443 lm(), 95, 229, 231
‘imager’, 441, 442, 446, 448, 458–460 load.image(), 441
images logical, 75–77

486
INDEX

logical operators, 27 numbers


logical values, 27 double, 26
ls, 165 floating point, 24
ls(), 22, 110 interger, 26
LuaTEX, 13 numeric, 18, 22, 34, 76
‘lubridate’, 258 numeric values, 16
numeric(), 19
machine arithmetic precision, 32
‘magrittr’, 166, 170 object
marginal density plots, 356 mode, 37
marginal histograms, 356 object.size(), 460
marginal plots, 356 objects, 69
Markdown, 65 open.nc(), 138
Markdown, 13 operators
‘MASS’, 390 +, 42
matches(), 163 -, 42
math functions, 16 ->, 18
math operators, 16 <-, 17
mean(), 149 =, 18
methods, 69 [[ ]], 57
mode $, 55, 57
numeric, 16 %<>%, 169
mode(), 38, 134 %>%, 161, 166, 169, 455
models %T>%, 170
linear, 95 %$%, 170
MS-Windows, 89, 110 order(), 161, 325
mutate(), 160 ordered(), 47
my_print(), 71 Origin, xiii
OS X, 89
names(), 92, 134, 163
names<-(), 163 packages
nc_open(), 135 ‘animation’, 345
‘ncdf4’, 135, 141 ‘anytime’, 258
ncol(), 92, 134 ‘Bookdown’, 469
ncvar_get(), 135 ‘broom’, 392
netiquette, 11 ‘cowplot’, 307, 422
network etiquette, 11 ‘data.table’, 108, 465
network graphs, 362, 365 ‘devtools’, 89
nPix(), 460 ‘dplyr’, 160, 163, 171, 175
nrow(), 92, 134 ‘extrafont’, 205

487
INDEX

‘foreign’, 130, 132, 140 ‘ggspectra’, 242, 305, 422


‘fortify’, 359 ‘ggstance’, xv, 347, 348
‘geomnet’, 365, 366 ‘ggtern’, xv, 182, 242, 419, 420
‘GGally’, 365 ‘ggthemes’, 417
‘ggalt’, xv, 352, 353 ‘grid’, 179
‘gganimate’, xv, 344, 345 ‘gridExtra’, 358
‘ggbiplot’, xv, 350, 351 ‘haven’, 130, 132, 140
‘ggCompNet’, 365 ‘Hmisc’, 223
‘ggedit’, 422 ‘hrbrthemes’, 422
‘ggExtra’, 356 ‘imager’, 441, 442, 446, 448,
‘ggforce’, xv, 367, 368 458–460
‘ggfortify’, 358, 359 ‘inline’, 466
‘ggimage’, 422 ‘jsonlite’, 142
‘ggiraph’, 422 ‘knitr’, 9, 12, 65, 331, 469
‘ggiraphExtra’, 422 ‘Lattice’, 180
‘gglogo’, 422 ‘lattice’, 179, 242
‘ggmap’, 242, 425, 427, 435 ‘lubridate’, 258
‘ggmosaic’, 422 ‘magrittr’, 166, 170
‘ggnetwork’, 362 ‘MASS’, 390
‘ggparallel’, 422 ‘ncdf4’, 135, 141
‘ggplolt2’, 179 ‘pals’, 339–341, 415
‘ggplot2’, xv, xvi, 94, 152, 153, ‘quantmod’, 411
179–181, 205, 208, 216, 220, ‘Rcpp’, 90, 458, 466, 469
223, 242, 247, 250, 266, 270, ‘readr’, 114, 119, 140, 153, 176
278, 285–287, 305, 307, 309, ‘readxl’, 124, 140, 153
310, 313, 316, 319, 326, 329, ‘rJava’, 467
330, 348, 359, 362, 365, 368, ‘rmarkdown’, 469
371, 373, 398, 402, 407, 410, ‘RNetCDF’, 135, 138
411, 415, 417, 420–422, 427, ‘rPython’, 467
444, 445, 447, 450 ‘scales’, 185, 257
‘ggpmisc’, xv, 370, 371, 373, 378, ‘Shiny’, 469
398, 407, 408, 411 ‘showtext’, xv, 200, 205, 330, 331
‘ggraph’, 422 ‘stringr’, 160
‘ggrepel’, xv, 374, 402, 403, 408 ‘Sweave’, 65
‘ggsci’, 414, 415 ‘tibble’, 153, 330
‘ggseas’, 373, 411, 414 ‘tidyquant’, 410
‘ggsignif’, 422 ‘tidyr’, 159, 175
‘ggsn’, 422 ‘tidyverse’, 119, 128, 143, 152,
‘ggspatial’, 422 153, 158, 162, 166, 170

488
INDEX

‘tikz’, 304 annotations, 269


‘tikzDevice’, 304 fitted model labels, 378
‘TTR’, 411 arcs, curves and circles, 369
using, 89 axis position, 267
‘utils’, 114, 141, 460 b-splines, 369
‘viridis’, xv, 336, 341, 355 bar plot, 216–217
‘xlsx’, 126, 140 base R graphics, 94
‘XML’, 130 Bezier curves, 369
‘xml2’, 129, 153 bitmap output, 303
‘xts’, 371, 411 box and whiskers plot, 240
‘zoo’, 411 horizontal, 350
pal.bands(), 341 caption, 205–214
pal.channels(), 341 circular, 272–278
pal.safe(), 342 color palettes, 336, 339, 340
‘pals’, 339–341, 415 consistent format using functions,
parse(), 293, 295–298 305
Pascal, 6 coordinates, 182
paste(), 292, 295 polar, 272
pdfTEX, 13 ternary, 419
performance, 463 debugging, 398
Perl, 469 density plot
plain(), 297 1 dimension, 237–238, 354
plot 2 dimensions, 238–240, 354
scales, 269 dumbell plot, 352
plot(), 70, 442, 444, 448, 454 facets, 242–249
plotmath, 291 pagination, 370
plots zooming, 370
additional colour palettes, 414 filter observations by density, 394
advanced examples, 309 fitted curves, 228–233
Anscombe’s linear regression deviations, 392
plots, 316–319, 422–423 equation annotation, 378
color patches, 319–325 residuals, 392, 394
heatmap plot, 309–311, 423 fitted models, 359
quadrat plot, 311–313, 423 fonts, 205, 331
selected repulsive text, 407 formatters
volcano plot, 313–316, 423 byte, 352
World map, 435 geographical maps, see plots,
aesthetics, 181 maps
animation, 344, 345 geometries, 181

489
INDEX

arc, 369 from Natural Earth, 435


arcbar, 369 projection, 436
barh, 347 Robinson projection, 436
bezier, 369 shape files, 435
boxploth, 347 marginal, 356
bspline, 369 math expressions, 291–303
circle, 369 maths in, 199–205
crossbarh, 347 network graphs, 362, 365
debug, 398 output to files, 303
dumbbell, 352 panels, see plots, facets
encircle, 352 PDF output, 303
errorbarh, 347 pie charts, 272–273
histogramh, 347 plotting functions, 195–199
linerangeh, 347 positions
link, 369 dodgev, 347
link2, 369 fillv, 347
lollipop, 352 jitterdodgev, 347
pointrangeh, 347 nudgev, 347
repulsive label, 402 stackv, 347
repulsive text, 402 Postscript output, 303
sina, 368 principal components, 350, 351
stateface, 352 printing, 303
step ribbon, 352 raster images, 441
violinh, 347 reusing parts of, 305
x-spline, 353 rug marging, 233–234
histogram satellite images
horizontal, 348 from Google, 428
histograms, 234–237 saving, 303
horizontal geometries, 347 scales, 182, 249
horizontal positions, 347 color, 266–267, 336
horizontal statistics, 347 discrete, 258
interpolation, 369 fill, 266–267, 336, 352
labels, 205–214 limits, 258
layers, 304 size, 262
line plot, 194–195 scatter plot, 182–194
lollipop plot, 352 secondary axes, 268
maps, 427 sina plot, 368
data overlay layer, 433 smooth curves, 228–233, 353
from Google Maps, 428 statistics, 181, 218–228

490
INDEX

binh, 347 math operations, 24


boxploth, 347 print(), 16, 37, 50, 62, 82, 85, 92,
counth, 347 135, 155, 167, 168, 170, 212,
debug, 398 345
density, 237 print.nc(), 138
density 2d, 238 programing languages
function, 195 (, 42
peaks, 374 AWK, 469
smooth, 228 Bookdown, 13
summary, 218 C, 7, 12, 13, 48, 82, 90, 300, 463,
valleys, 374 465, 467
x-spline, 353 C++, 7, 12, 13, 83, 90, 300, 441,
xdensity, 347 458–460, 463, 465–467
step ribbon plot, 352 FORTRAN, 7, 12, 13, 90, 465, 467
subtitle, 205–214 Java, 7, 8, 90, 465, 467–469
ternary plots, 419 Markdown, 13
text in, 199–205, 331, 402 Perl, 469
themes, 182, 278–291, 307, 417 Python, 7, 8, 134, 465, 467
creating, 285–291 R, xi, xiv, 6, 7, 15, 19, 20, 38, 42,
modifying, 282–285 48, 69, 70, 76, 79, 82–84,
no axes, 370 88–92, 94, 165, 458, 463–465
predefined, 278–282 R Markdown, 13
tile plot, 214–216 Rmarkdown, 13
time series, 371 S, 242
moving average, 410 programmes
seasonal decomposition, 411 MiKTEX, 89
tibble, 410 bash, 165
title, 205–214 Bio7, 14
transformations C, 19, 42
power, 370 C++, 6, 42
radial, 370 cat, 165
reverser, 370 CImg, 441, 458–460
using LATEX, 304 Eclipse, 14
violin plot, 241 Excel, xiii
wind rose, 273–278 Git, 12
portability, 205 ImageJ, 14
power_trans(), 370 ImageMagick, 345, 443
prcomp(), 351 ksh, 165
precision ls, 165

491
INDEX

MS-Windows, 89, 110 read.csv2(), 115–117


Origin, xiii read.fortran(), 114
OS X, 89 read.fwf(), 114
Pascal, 6 read.spss(), 131
R, xi–xiii, xvi, 1, 7–11, 14, 15, 18, read.table(), 91, 115, 117, 119, 120,
19, 21, 22, 24, 27, 31, 32, 36, 176
37, 42, 45, 47, 48, 53, 55, read.xlsx(), 126, 128
61–66, 69, 74, 80, 89–91, 93, read.xlsx2(), 128
465 read_csv(), 120, 124
RGUI, 10 read_delim(), 121
RStudio, xii, 5, 6, 10, 13–15, 62–65, read_excel(), 125
90, 303, 464 read_html(), 129
RTools, 89 read_sav(), 133
Rtools, 13 read_table(), 120, 121, 176
SAS, 130, 132 ‘readr’, 114, 119, 140, 153, 176
sh, 165 ‘readxl’, 124, 140, 153
SPPS, 130 recycling of arguments, 21
SPSS, xiii, 131–133 recycling of arguments, 79
Stata, 130, 132 rel(), 283
Systat, xiii rename(), 163
WEB, 65 resize(), 446
Python, 7, 8, 134, 465, 467 return(), 66
reverselog_trans(), 315
‘quantmod’, 411
revision control, 12
R rgb(), 265
design, 7 rgb2hsv(), 325
extensibility, 465 RGUI, 10
help, 10 right_join(), 171
R, xi–xiv, xvi, 1, 6–11, 14, 15, 18–22, ‘rJava’, 467
24, 27, 31, 32, 36–38, 42, 45, rlm(), 390
47, 48, 53, 55, 61–66, 69, 70, rm(), 22
74, 76, 79, 80, 82–84, 88–94, Rmarkdown, 13
165, 458, 463–465 ‘rmarkdown’, 469
R Markdown, 13 ‘RNetCDF’, 135, 138
R(), 458 round(), 40
R<-(), 458 ‘rPython’, 467
radial_trans(), 370 RStudio, xii, 5, 6, 10, 13–15, 62–65, 90,
‘Rcpp’, 90, 458, 466, 469 303, 464
read.csv(), 91, 114–117, 120 RTools, 89

492
INDEX

Rtools, 13 readability, 64
sourcing, 62
S, 242 writing, 63
sapply(), 88, 93, 144, 145 select(), 162, 163
SAS, 130, 132 SEM(), 67, 69
save(), 115, 176 semi_join(), 171
scale_color_continuous(), 182, seq(), 20
266 sequence, 20
scale_color_date(), 266 set.seed(), 364
scale_color_datetime(), 266 setwd(), 111
scale_color_discrete(), 266 sh, 469
scale_color_gradient(), 266 sh, 165
scale_color_gradient2(), 266 ‘Shiny’, 469
scale_color_gradientn(), 266 ‘showtext’, xv, 200, 205, 330, 331
scale_color_grey(), 266 showtext.auto(), 331
scale_color_hue(), 266 showtext.begin(), 331
scale_color_identity(), 266 showtext.end(), 331
scale_color_manual(), 344 signif(), 40, 41
scale_colour_identity(), 250 simple_SEM(), 69
scale_colour_manual(), 250 slice(), 162
scale_fill_gradient2(), 310 sort(), 161, 325
scale_fill_identity(), 267 SPPS, 130
scale_fill_pokemon(), 353 sprintf(), 42, 300
scale_fill_viridis(), 337, 340 SPSS, xiii, 131–133
scale_x_continuous(), 249, 252 StackOverflow, 11
scale_x_discrete(), 262 starts_with(), 163
scale_x_log10(), 253 stat , see plots, statistics
scale_x_reverse(), 253 stat_ash(), 353
scale_y_continuous(), 252 stat_bin(), 276
scale_y_log(), 253 stat_binh(), 348
scale_y_log10(), 253 stat_binhex(), 236
scales stat_bkde(), 353
color, 263 stat_bkde2d(), 353
fill, 263 stat_boxploth(), 348
‘scales’, 185, 257 stat_count(), 216
scales (ggplot), see plots, scales stat_counth(), 348
scan(), 91 stat_debug(), 398
scripts, 61 stat_debug_group(), 371
definition, 61 stat_debug_panel(), 371

493
INDEX

stat_decomp(), 411 switch(), 76


stat_dens2d_filter(), 394, 395, Systat, xiii
397
stat_dens2d_filter_g(), 397 t(), 149, 150
stat_dens2d_label(), 398, 408 tail(), 92, 93
stat_density(), 276 TEX, 13
stat_fit_augment(), 371 theme(), 287, 291
stat_fit_deviations(), 371, 392 theme_blank(), 362, 370
stat_fit_glance(), 371, 388 theme_classic(), 280
stat_fit_residuals(), 394 theme_dark(), 280
stat_function(), 195 theme_economist(), 418
stat_glance(), 389, 390 theme_gdocs(), 418
stat_identity(), 216 theme_gray(), 289
stat_index(), 411 theme_grey(), 278, 287, 307
stat_peaks(), 371, 407 theme_linedraw(), 280
stat_poly_eq(), 371, 378, 381 theme_minimal(), 280, 287, 289
stat_rollapplyr(), 411 theme_net(), 366
stat_seas(), 411 theme_no_axes(), 370
stat_smooth(), 181, 228, 388 theme_nomask(), 421
stat_stl(), 411 theme_null(), 370
stat_summary(), 181, 218, 224, 226 theme_tufte(), 418
stat_valleys(), 371 theme_void(), 280, 440, 445, 447,
stat_xdensity(), 348 451
stat_xspline(), 353 themes (ggplot), see plots, themes
Stata, 130, 132 threshold(), 444
statistics (ggplot), see plots, statistics ‘tibble’, 153, 330
str(), 49–51, 92, 117, 134, 137 tibble, 120, 132, 136, 138, 153, 154,
str_extract(), 160 164
strftime(), 300, 376 tibble(), 153, 154, 158, 330
‘stringr’, 160 ‘tidyquant’, 410
subset(), 57, 162 ‘tidyr’, 159, 175
substitute(), 302 ‘tidyverse’, 119, 128, 143, 152, 153,
Subversion, 12 158, 162, 166, 170
sum(), 69, 143, 148 ‘tikz’, 304
summarise(), 164, 175 tikz output device, see plots, using
summarize(), 170 LATEX
summary(), 93–95 ‘tikzDevice’, 304
sump(), 170 time series
‘Sweave’, 65 conversion into data frame, 371

494
INDEX

conversion into tibble, 371 WEB, 65


tol(), 342, 344 ‘worksheet’, see data frame
tolower(), 262 write(), 115
toupper(), 262 write.csv(), 114, 115
transmute(), 160 write.csv2(), 115
trunc(), 41, 42 write.table(), 115, 117
try_tibble(), 371 write.xlsx(), 128
tsdf(), 411, 414 write_csv(), 123
‘TTR’, 411 write_delim(), 123
type conversion, 38 write_excel_csv(), 123
write_file(), 123, 124
UNICODE, 205
write_tsv(), 122
unlist(), 50, 52
update.packages(), 89
XƎTEX, 13
update_labels(), 209
xlab(), 206
UTF8, 205
xlim(), 217, 250, 312
‘utils’, 114, 141, 460
‘xlsx’, 126, 140
vapply(), 144, 151 ‘XML’, 130
var(), 68 ‘xml2’, 129, 153
var.get.nc(), 138 xml_find_all(), 130
variables, 17 xml_text(), 130
vector, 19 ‘xts’, 371, 411
vectorization, 79
vectorized arithmetic, 20 ylab(), 206
vectors, 42 ylim, 217
indexing, 42 ylim(), 250, 312
‘viridis’, xv, 336, 341, 355
viridis(), 344 ‘zoo’, 411

495
View publication stats

You might also like