You are on page 1of 21

Data Visualization with ggplot2, asthetic mappings, facets,

common problems, layered grammar of graphics.


Dr Diana Abdul Wahab

Sem II 2021/2022
The simple graph has brought more information to the data analyst’s mind than any other
device (John Tukey).
This chapter will teach you how to visualize your data using ggplot2. R has several systems
for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2
implements the grammar of graphics, a coherent system for describing and building
graphs. With ggplot2, you can do more faster by learning one system and applying it in
many places. To access the datasets, help pages, and functions that we will use in this
chapter, load these packages:
install.packages(c("ggplot2","gcookbook","tidyverse"))

library(ggplot2)
library(gcookbook)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse


1.3.1 --

## v tibble 3.1.6 v dplyr 1.0.8


## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## v purrr 0.3.4

## -- Conflicts ------------------------------------------
tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

Scatterplot
Let’s use our first graph to answer a question: do cars with big engines use more fuel than
cars with small engines? You probably already have an answer, but try to make your
answer precise. What does the relationship between engine size and fuel efficiency look
like? Is it positive? Negative? Linear? Nonlinear?
??mpg

str(mpg)
## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999
2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)"
"auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

Exercise:
1. Run ggplot(data = mpg). What do you see?
2. How many rows are in mtcars? How many columns?
3. What does the drv variable describe? Read the help for ?mpg to find out.
4. Make a scatterplot of hwy versus cyl.
Map your points to the class variable to reveal the class of each car:
# By color:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
# By size:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))

# Use transparency:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# By shape:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))

# Change all data points to blue:


ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
# Add a line of a fitted regression:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The smooth line displays just a subset of the mpg dataset, the subcompact cars.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(
data = filter(mpg, class == "subcompact"),
se = FALSE
)

Using the lm() function for regression


You have already created a fitted regression model object for a data set, and you want to
plot the lines for that model. Usually the easiest way to overlay a fitted model is to simply
ask stat_smooth() to do it for you. Sometimes, however, you may want to create the model
yourself and then add it to your graph. This allows you to be sure that the model you’re
using for other calculations is the same one that you see.
In this example, we’ll build a quadratic model using lm() with ageYear as a predictor of
heightIn . Then we’ll use the predict() function and find the predicted values of heightIn
across the range of values for the predictor, ageYear.
model <- lm(heightIn ~ ageYear +
I(ageYear^2), heightweight)
xmin <- min(heightweight$ageYear)
xmax <- max(heightweight$ageYear)
predicted <- data.frame(ageYear=
seq(xmin, xmax, length.out=100))
predicted$heightIn <- predict(model, predicted)
ggplot(heightweight, aes(x=ageYear, y=heightIn)) +
geom_point(colour="grey40") +
geom_line(data=predicted, size=1)

Line Graph
A basic line graph:
ggplot(BOD, aes(x=Time, y=demand)) +
geom_line()
ggplot(BOD, aes(x=Time, y=demand)) +
geom_line() +
geom_point()

You want to make a line graph with more than one line.
library(plyr)

## --------------------------------------------------------------------------
----

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first,
then dplyr:
## library(plyr); library(dplyr)

## --------------------------------------------------------------------------
----

##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize

## The following object is masked from 'package:purrr':


##
## compact

tg <- ddply(ToothGrowth, c("supp", "dose"),


summarise, length=mean(len))
ggplot(tg, aes(x=dose, y=length, colour=supp)) +
geom_line()

ggplot(tg, aes(x=dose, y=length,


linetype=supp)) +
geom_line()

ggplot(tg, aes(x=dose, y=length, colour=supp)) +


geom_line() +
scale_colour_brewer(palette="Set1")
ggplot(BOD, aes(x=Time, y=demand)) +
geom_line() +
geom_point(size=4, shape=22, colour="darkred", fill="pink")

You want to make a stacked area graph.


ggplot(uspopage, aes(x=Year, y=Thousands,
fill=AgeGroup)) +
geom_area()
ggplot(uspopage, aes(x=Year, y=Thousands, fill=AgeGroup)) +
geom_area(colour="black", size=.2, alpha=.4) +
scale_fill_brewer(palette="Blues",
breaks=rev(levels(uspopage$AgeGroup)))

You want to add a confidence region to a graph.


clim <- subset(climate, Source == "Berkeley",
select=c("Year", "Anomaly10y", "Unc10y"))
ggplot(clim, aes(x=Year, y=Anomaly10y)) +
geom_ribbon(aes(ymin=Anomaly10y-Unc10y,
ymax=Anomaly10y+Unc10y),
alpha=0.2) +
geom_line()
Bar Charts
BOD

## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 15.6
## 6 7 19.8

barplot(BOD$demand, names.arg=BOD$Time)

You have a data frame where one column represents the x position of each bar, and another
column represents the vertical (y) height of each bar. Use ggplot() with
geom_bar(stat="identity") and specify what variables you want on the x- and y-axes.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

barplot(table(mtcars$cyl))

ggplot(BOD, aes(x=Time, y=demand)) +


geom_bar(stat="identity")

ggplot(mtcars, aes(x=factor(cyl))) +
geom_bar()
# Another example:
ggplot(pg_mean, aes(x=group, y=weight)) +
geom_bar(stat="identity")

Bar chart for continous variable: when x is a continuous (or numeric) variable, the bars
behave a little differently. Instead of having one bar at each actual x value, there is one bar
at each possible x value between the minimum and the maximum. You can convert the
continuous variable to a discrete variable by using factor():
ggplot(BOD, aes(x=Time, y=demand)) +
geom_bar(stat="identity")
ggplot(BOD, aes(x=factor(Time), y=demand)) +
geom_bar(stat="identity")

Group bars together by a second variable.


cabbage_exp

## Cultivar Date Weight sd n se


## 1 c39 d16 3.18 0.9566144 10 0.30250803
## 2 c39 d20 2.80 0.2788867 10 0.08819171
## 3 c39 d21 2.74 0.9834181 10 0.31098410
## 4 c52 d16 2.26 0.4452215 10 0.14079141
## 5 c52 d20 3.11 0.7908505 10 0.25008887
## 6 c52 d21 1.47 0.2110819 10 0.06674995

ggplot(cabbage_exp,
aes(x=Date, y=Weight, fill=Cultivar)) +
geom_bar(position = "dodge", stat = "identity")
If the data has one row representing each case, and you want plot counts of the cases:
ggplot(diamonds, aes(x=cut)) +
geom_bar()

Use different colors for negative and positive-valued bars.


csub <- subset(climate,
Source=="Berkeley" & Year >= 1900)
csub$pos <- csub$Anomaly10y >= 0
ggplot(csub, aes(x=Year, y=Anomaly10y, fill=pos)) +
geom_bar(stat="identity", position="identity")
# Use different colors for negative and positive-valued bars.
ggplot(csub, aes(x=Year, y=Anomaly10y, fill=pos)) +
geom_bar(stat="identity", position="identity",
colour="black", size=0.25) +
scale_fill_manual(values=c("#CCEEFF", "#FFDDDD"),
guide=FALSE)

Stacked bar graph.


ggplot(cabbage_exp,
aes(x=Date, y=Weight, fill=Cultivar)) +
geom_bar(stat="identity")
# 100% stacked bar graph
ce <- ddply(cabbage_exp, "Date", transform,
percent_weight = Weight / sum(Weight) * 100)
ggplot(ce, aes(x=Date, y=percent_weight, fill=Cultivar)) +
geom_bar(stat="identity")

ggplot(cabbage_exp,
aes(x=interaction(Date, Cultivar), y=Weight)) +
geom_bar(stat="identity") +
geom_text(aes(label=Weight),
vjust=-0.2)
Histogram
hist(mtcars$mpg, breaks=10)

hist(mtcars$mpg, breaks=25)
ggplot(mtcars, aes(x=mpg)) +
geom_histogram(binwidth=4)

Boxplot
plot(ToothGrowth$supp, ToothGrowth$len)

boxplot(len ~ supp, data = ToothGrowth)


ggplot(ToothGrowth, aes(x=supp, y=len)) +
geom_boxplot()

# For two categories:


ggplot(ToothGrowth,
aes(x=interaction(supp, dose), y=len)) +
geom_boxplot()
Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize,
and model data. ” O’Reilly Media, Inc.”.

You might also like