Professional Documents
Culture Documents
Sem II 2021/2022
The simple graph has brought more information to the data analyst’s mind than any other
device (John Tukey).
This chapter will teach you how to visualize your data using ggplot2. R has several systems
for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2
implements the grammar of graphics, a coherent system for describing and building
graphs. With ggplot2, you can do more faster by learning one system and applying it in
many places. To access the datasets, help pages, and functions that we will use in this
chapter, load these packages:
install.packages(c("ggplot2","gcookbook","tidyverse"))
library(ggplot2)
library(gcookbook)
library(tidyverse)
## -- Conflicts ------------------------------------------
tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Scatterplot
Let’s use our first graph to answer a question: do cars with big engines use more fuel than
cars with small engines? You probably already have an answer, but try to make your
answer precise. What does the relationship between engine size and fuel efficiency look
like? Is it positive? Negative? Linear? Nonlinear?
??mpg
str(mpg)
## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999
2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)"
"auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Exercise:
1. Run ggplot(data = mpg). What do you see?
2. How many rows are in mtcars? How many columns?
3. What does the drv variable describe? Read the help for ?mpg to find out.
4. Make a scatterplot of hwy versus cyl.
Map your points to the class variable to reveal the class of each car:
# By color:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
# By size:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
# Use transparency:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# By shape:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
The smooth line displays just a subset of the mpg dataset, the subcompact cars.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(
data = filter(mpg, class == "subcompact"),
se = FALSE
)
Line Graph
A basic line graph:
ggplot(BOD, aes(x=Time, y=demand)) +
geom_line()
ggplot(BOD, aes(x=Time, y=demand)) +
geom_line() +
geom_point()
You want to make a line graph with more than one line.
library(plyr)
## --------------------------------------------------------------------------
----
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first,
then dplyr:
## library(plyr); library(dplyr)
## --------------------------------------------------------------------------
----
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 15.6
## 6 7 19.8
barplot(BOD$demand, names.arg=BOD$Time)
You have a data frame where one column represents the x position of each bar, and another
column represents the vertical (y) height of each bar. Use ggplot() with
geom_bar(stat="identity") and specify what variables you want on the x- and y-axes.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
barplot(table(mtcars$cyl))
ggplot(mtcars, aes(x=factor(cyl))) +
geom_bar()
# Another example:
ggplot(pg_mean, aes(x=group, y=weight)) +
geom_bar(stat="identity")
Bar chart for continous variable: when x is a continuous (or numeric) variable, the bars
behave a little differently. Instead of having one bar at each actual x value, there is one bar
at each possible x value between the minimum and the maximum. You can convert the
continuous variable to a discrete variable by using factor():
ggplot(BOD, aes(x=Time, y=demand)) +
geom_bar(stat="identity")
ggplot(BOD, aes(x=factor(Time), y=demand)) +
geom_bar(stat="identity")
ggplot(cabbage_exp,
aes(x=Date, y=Weight, fill=Cultivar)) +
geom_bar(position = "dodge", stat = "identity")
If the data has one row representing each case, and you want plot counts of the cases:
ggplot(diamonds, aes(x=cut)) +
geom_bar()
ggplot(cabbage_exp,
aes(x=interaction(Date, Cultivar), y=Weight)) +
geom_bar(stat="identity") +
geom_text(aes(label=Weight),
vjust=-0.2)
Histogram
hist(mtcars$mpg, breaks=10)
hist(mtcars$mpg, breaks=25)
ggplot(mtcars, aes(x=mpg)) +
geom_histogram(binwidth=4)
Boxplot
plot(ToothGrowth$supp, ToothGrowth$len)