You are on page 1of 42

DATACAMP – NOTAS DE CLASE

MASTER DATA ANALYSIS WITH R

02/07/2020
CHIS - Mosaic Plots

CHIS - Descriptive Statistics


Ex.

Do Things Manually
In the previous exercise we looked at how to produce a frequency histogram when
we have many sub-categories. The problem here is that this can't be facetted because
the calculations occur on the fly inside ggplot2.

To overcome this we're going to calculate the proportions outside ggplot2. This is the
beginning of our flexible script for a mosaic plot.

The dataset adult and the BMI_fill object from the previous exercise have been
carried over for you. Code that tries to make the accurate frequency histogram
facetted is available. You should understand these commands by now.

● Use adult$RBMI and adult$SRAGE_P as arguments in table() to create a


contingency table of the two variables. Save this as DF.
● Use apply() To get the frequency of each group. The first argument is DF, the
second argument 2, because you want to do calculations on each column. The
third argument should be function(x) x/sum(x). Store the result as DF_freq.
● Load the reshape2 package and use the melt() function on DF_freq. Store the
result as DF_melted. Examine the structure of DF_freq and DF_melted if you are
not familiar with this operation.

Note: Here we use reshape2 instead of the more current tidyr because reshape2::melt()
allows us to work directly on a table. tidyr::gather() requires a data frame.

Apuntes de clase - Data Camp R - 1


● Use names() to rename the variables in DF_melted to be c("FILL", "X", "value"),
with the prospect of making this a generalized function later on.
● The plotting call at the end uses DF_melted. Add code to make it facetted. Use
the formula FILL ~ .. Note that we use geom_col() now, this is just a short-cut
to geom_bar(stat = "identity").

# An attempt to facet the accurate frequency histogram from before (failed)


ggplot(adult, aes (x = SRAGE_P, fill= factor(RBMI))) +
geom_histogram(aes(y = ..count../sum(..count..)), binwidth = 1, position = "fill") +
BMI_fill +
facet_grid(RBMI ~ .)

# Create DF with table()


DF <- table(adult$RBMI, adult$SRAGE_P)

# Use apply on DF to get frequency of each group


DF_freq <- apply(DF, 2, function(x) x/sum(x))

# Load reshape2 and use melt on DF to create DF_melted


library(reshape2)
DF_melted <- melt(DF_freq)
str(DF_freq)
str(DF_melted)

# Change names of DF_melted


names(DF_melted) <- c("FILL", "X", "value")

# Add code to make this a faceted plot


ggplot(DF_melted, aes(x = X, y = value, fill = FILL)) +
geom_col(position = "stack") +
BMI_fill +
facet_grid(FILL ~ .) # Facets

Apuntes de clase - Data Camp R - 2


Ex.
# Plot 1 - Count histogram
ggplot(adult, aes (x = SRAGE_P, fill= factor(RBMI))) +
geom_histogram(binwidth = 1) +
BMI_fill

# Plot 2 - Density histogram


ggplot(adult, aes (x = SRAGE_P, fill= factor(RBMI))) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
BMI_fill

Apuntes de clase - Data Camp R - 3


# Plot 3 - Faceted count histogram
ggplot(adult, aes (x = SRAGE_P, fill= factor(RBMI))) +
geom_histogram(binwidth = 1) +
BMI_fill+
facet_grid(RBMI ~ .)

# Plot 4 - Faceted density histogram


ggplot(adult, aes (x = SRAGE_P, fill= factor(RBMI))) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
BMI_fill+
facet_grid(RBMI ~ .)

Apuntes de clase - Data Camp R - 4


# Plot 5 - Density histogram with position = "fill"
ggplot(adult, aes (x = SRAGE_P, fill= factor(RBMI))) +
geom_histogram(aes(y = ..density..), binwidth = 1, position = "fill") +
BMI_fill

# Plot 6 - The accurate histogram


ggplot(adult, aes (x = SRAGE_P, fill= factor(RBMI))) +
geom_histogram(aes(y = ..count../sum(..count..)), binwidth = 1, position = "fill") +
BMI_fill

Ex.

Multiple Histograms
When we introduced histograms we focused on univariate data, which is exactly
what we've been doing here. However, when we want to explore distributions
further there is much more we can do. For example, there are density plots, which
you'll explore in the next course. For now, we'll look deeper at frequency histograms
and begin developing our mosaic plots.

The adult dataset, which is cleaned up by now, is available in the workspace for you.

Apuntes de clase - Data Camp R - 5


Two layers have been pre-defined for you: BMI_fill is a scale layer which we can add
to a ggplot() command using +: ggplot(...) + BMI_fill. fix_strips is a theme() layer to
make nice facet titles.

● The histogram from the first exercise of age colored by BMI has been
provided. The predefined theme(), fix_strips (see above), has been added to the
histogram. Add BMI_fill to this plot using the + operator as well.
● In addition, add the following elements to create a pretty & insightful plot:
● Use facet_grid() to facet the rows according to RBMI (Remember formula
notation ROWS ~ COL and use . as a place-holder when not facetting in that
direction).
● Add the classic theme using theme_classic().

# The color scale used in the plot


BMI_fill <- scale_fill_brewer("BMI Category", palette = "Reds")

# Theme to fix category display in faceted plot


fix_strips <- theme(strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.1, size = 14),
strip.background = element_blank(),
legend.position = "none")

# Histogram, add BMI_fill and customizations


ggplot(adult, aes (x = SRAGE_P, fill= RBMI)) +
BMI_fill +
geom_histogram(binwidth = 1) +
facet_grid(RBMI ~ .)+
theme_classic()+
fix_strips

Ex.

Apuntes de clase - Data Camp R - 6


# Keep adults younger than or equal to 84
adult <- adult[adult$SRAGE_P <= 84, ]

# Keep adults with BMI at least 16 and less than 52


adult <- adult[adult$BMI_P >= 16 & adult$BMI_P < 52, ]

# Relabel the race variable


adult$RACEHPR2 <- factor(adult$RACEHPR2, labels = c("Latino", "Asian", "African
American", "White"))

# Relabel the BMI categories variable


adult$RBMI <- factor(adult$RBMI, labels = c("Under-weight", "Normal-weight", "Over-
weight", "Obese"))

Ex.

Exploring Data
In this chapter we're going to continuously build on our plotting functions and
understanding to produce a mosaic plot (aka Marimekko plot). This is a visual
representation of a contingency table, comparing two categorical variables.
Essentially, our question is which groups are over or under represented in our
dataset. To visualize this we'll color groups according to their Pearson residuals from
a chi-squared test. At the end of it all we'll wrap up our script into a flexible function
so that we can look at other variables.

We'll familiarize ourselves with a small number of variables from the 2009 CHIS
adult-response dataset (as opposed to children). We have selected the following
variables to explore:

● RBMI: BMI Category description


● BMI_P: BMI value
● RACEHPR2: race
● SRSEX: sex
● SRAGE_P: age
● MARIT2: Marital status
● AB1: General Health Condition
● ASTCUR: Current Asthma Status
● AB51: Type I or Type II Diabetes
● POVLL: Poverty level

Apuntes de clase - Data Camp R - 7


We'll filter our dataset to plot a more reliable subset (we'll still retain over 95% of the
data).

Before we get into mosaic plots it's worthwhile exploring the dataset using simple
distribution plots - i.e. histograms.

ggplot2 is already loaded and the dataset, named adult, is already available in the
workspace.

● Use the typical commands for exploring the structure of adult to get familiar
with the variables: summary() and str().

# Explore the dataset with summary and str


summary(adult)
str(adult)

● As a first exploration of the data, plot two histograms using ggplot2 syntax: one
for age (SRAGE_P) and one for BMI (BMI_P). The goal is to explore the dataset
and get familiar with the distributions here. Feel free to explore different bin
widths. We'll ask some questions about these in the next exercises.

# Age histogram
ggplot(adult, aes(x=SRAGE_P))+
geom_histogram()

# BMI value histogram


ggplot(adult, aes(x=BMI_P))+
geom_histogram(binwidth = 3)

Apuntes de clase - Data Camp R - 8


● Next plot a binned-distribution of age, filling each bar according to the BMI
categorization. Inside geom_histogram(), set binwidth = 1. You'll want to use fill
= factor(RBMI) since RBMI is a categorical variable.

# Age colored by BMI, binwidth = 1


ggplot(adult, aes(x=SRAGE_P, fill=factor(RBMI)))+
geom_histogram(binwidth = 1)

01/07/2020
Best Practices: Heat Maps

30/06/2020
Best Practices: Bar Plots
Ex.

# Base layers

Apuntes de clase - Data Camp R - 9


m <- ggplot(mtcars.cyl, aes(x = cyl, y = wt.avg))

# Plot 1: Draw bar plot with geom_bar


m + geom_bar(stat = "identity", fill = "skyblue")

# Plot 2: Draw bar plot with geom_col


# Plot 2: geom_col() is a shortcut for geom_bar(stat = "identity"), for when your data
already has counts.
m + geom_col(fill = "skyblue")

# Plot 3: geom_col with variable widths.


m + geom_col(fill = "skyblue", width = mtcars.cyl$prop)

# Plot 4: Add error bars


m+
geom_col(fill = "skyblue", width = mtcars.cyl$prop) +
geom_errorbar(aes(ymin = wt.avg - sd, ymax = wt.avg + sd), width = 0.1)

Apuntes de clase - Data Camp R - 10


Ex.
# Base layers
m <- ggplot(mtcars, aes(x = cyl,y = wt, col = am, fill = am))

# Plot 1: Draw dynamite plot


m+
stat_summary(fun.y = mean, geom = "bar") +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width =
0.1)

# Plot 2: Set position dodge in each stat function


m+
stat_summary(fun.y = mean, geom = "bar", position = "dodge") +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1),
geom = "errorbar", width = 0.1, position = "dodge")

# Set your dodge posn manually


posn.d <- position_dodge(0.9)

Apuntes de clase - Data Camp R - 11


# Plot 3: Redraw dynamite plot
m+
stat_summary(fun.y = mean, geom = "bar", position = posn.d) +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width =
0.1, position = position_dodge(0.9))

Ex.
# Base layers
m <- ggplot(mtcars, aes(x = cyl, y = wt))

# Draw dynamite plot


m+
stat_summary(fun.y = mean, geom = "bar", fill = "skyblue") +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width =
0.1)

Apuntes de clase - Data Camp R - 12


22/06/2020
Recycling Themes
Ex.

Exploring ggthemes
There are many themes available by default in ggplot2: theme_bw(), theme_classic(),
theme_gray(), etc. In the previous exercise, you saw that you can apply these themes
to all following plots, with theme_set():

theme_set(theme_bw())

But you can also apply them on an individual plot, with:

... + theme_bw()

You can also extend these themes with your own modifications. In this exercise, you'll
experiment with this and use some preset templates available from the ggthemes
package. The workspace already contains the same basic plot from before under the
name z2.

Rpta.
# Original plot
z2

# Load ggthemes
library(ggthemes)

# Apply theme_tufte(), plot additional modifications


custom_theme <- theme_tufte() +
theme(legend.position = c(0.9, 0.9),
legend.title = element_text(face ="italic",

Apuntes de clase - Data Camp R - 13


size = 12), axis.title=element_text(face ="bold", size=14 ))

# Draw the customized plot


z2 + custom_theme

# Use theme set to set custom theme as default


theme_set(custom_theme)

# Plot z2 again
z2

Ex.
# Original plot
z2

# Theme layer saved as an object, theme_pink


theme_pink <- theme(panel.background = element_blank(),
legend.key = element_blank(),
legend.background = element_blank(),
strip.background = element_blank(),
plot.background = element_rect(fill = myPink, color = "black", size = 3),
panel.grid = element_blank(),
axis.line = element_line(color = "red"),
axis.ticks = element_line(color = "red"),
strip.text = element_text(size = 16, color = myRed),
axis.title.y = element_text(color = myRed, hjust = 0, face = "italic"),
axis.title.x = element_text(color = myRed, hjust = 0, face = "italic"),
axis.text = element_text(color = "black"),
legend.position = "none")

# 1 - Apply theme_pink to z2
z2 +
theme_pink

Apuntes de clase - Data Camp R - 14


# 2 - Update the default theme, and at the same time
# assign the old theme to the object old.
old <- theme_update(panel.background = element_blank(),
legend.key = element_blank(),
legend.background = element_blank(),
strip.background = element_blank(),
plot.background = element_rect(fill = myPink, color = "black", size = 3),
panel.grid = element_blank(),
axis.line = element_line(color = "red"),
axis.ticks = element_line(color = "red"),
strip.text = element_text(size = 16, color = myRed),
axis.title.y = element_text(color = myRed, hjust = 0, face = "italic"),
axis.title.x = element_text(color = myRed, hjust = 0, face = "italic"),
axis.text = element_text(color = "black"),
legend.position = "none")

# 4 - Restore the old default theme


theme_set(old)

# Display the plot z2 - old theme restored


z2

Apuntes de clase - Data Camp R - 15


Themes from Scratch
# Original plot, color provided
z
myRed

# Extend z with theme() function and 3 args


z+
theme(strip.text = element_text(size = 16, color = myRed),
axis.title = element_text(color = myRed, hjust = 0, face = "italic"),
axis.text = element_text(color = "black"))

Facets Layer
Ex.
# Code to create the cyl_am col and myCol vector
mtcars$cyl_am <- paste(mtcars$cyl, mtcars$am, sep = "_")
myCol <- rbind(brewer.pal(9, "Blues")[c(3,6,8)],
brewer.pal(9, "Reds")[c(3,6,8)])

# Map cyl_am onto col


ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am)) +

Apuntes de clase - Data Camp R - 16


geom_point() +
# Add a manual colour scale
scale_color_manual(values = myCol)

# Grid facet on gear vs. vs


ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am)) +
geom_point() +
# Add a manual colour scale
scale_color_manual(values = myCol)+
facet_grid(gear ~ vs)

# Also map disp to size


ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am, size=disp)) +
geom_point() +
# Add a manual colour scale
scale_color_manual(values = myCol)+
facet_grid(gear ~ vs)

Apuntes de clase - Data Camp R - 17


Ex.
# Basic scatter plot
p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()

# 1 - Separate rows according to transmission type, am -


p+
facet_grid(am ~ . )

# 2 - Separate columns according to cylinders, cyl


p+
facet_grid(. ~ cyl)

Apuntes de clase - Data Camp R - 18


# 3 - Separate by both columns and rows
p+
facet_grid(am ~ cyl)

21/06/2020
Ex.

# Create a stacked bar plot: wide.bar


wide.bar <- ggplot(mtcars, aes(x = 1, fill = cyl)) +
geom_bar()

# Convert wide.bar to pie chart


wide.bar +
coord_polar(theta = "y")

Apuntes de clase - Data Camp R - 19


# Create stacked bar plot: thin.bar
thin.bar <- ggplot(mtcars, aes(x = 1, fill = cyl)) +
geom_bar(width = 0.1) +
scale_x_continuous(limits = c(0.5,1.5))

# Convert thin.bar to "ring" type pie chart


thin.bar +
coord_polar(theta = "y")

Ex.
# The base ggplot command; you don't have to change this
wt.cyl.am <- ggplot(mtcars, aes(x = cyl,y = wt, col = am, fill = am, group = am))

# Add three stat_summary calls to wt.cyl.am


wt.cyl.am +
stat_summary(geom = "linerange", fun.data = med_IQR,
position = posn.d, size = 3) +
stat_summary(geom = "linerange", fun.data = gg_range,
position = posn.d, size = 3,
alpha = 0.4) +
stat_summary(geom = "point", fun.y = median,
position = posn.d, size = 3,
col = "black" , shape = "X")

Apuntes de clase - Data Camp R - 20


Ex.
# wt.cyl.am, posn.d, posn.jd and posn.j are available

# Plot 1: Jittered, dodged scatter plot with transparent points


wt.cyl.am +
geom_point(position = posn.jd, alpha = 0.6)

# Plot 2: Mean and SD - the easy way


wt.cyl.am +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn.d)

# Plot 3: Mean and 95% CI - the easy way


wt.cyl.am +
stat_summary(fun.data = mean_cl_normal, position = posn.d)

Apuntes de clase - Data Camp R - 21


# Plot 4: Mean and SD - with T-tipped error bars - fill in ___
wt.cyl.am +
stat_summary(geom = "point", fun.y = mean,
position = posn.d) +
stat_summary(geom = "errorbar", fun.data = mean_sdl,
position = posn.d, fun.args = list(mult = 1), width = 0.1)

Ex.
# Display structure of mtcars
str(mtcars)

# Convert cyl and am to factors


mtcars$cyl <- as.factor(mtcars$cyl )
mtcars$am <- as.factor(mtcars$am)

# Define positions
posn.d <- position_dodge(width = 0.1)
posn.jd <- position_jitterdodge(jitter.width=0.1, dodge.width=0.2)
posn.j <- position_jitter(width=0.2)

# Base layers
wt.cyl.am <- ggplot(mtcars, aes(x=cyl, y=wt, col=am, fill=am, group=am))

Ex.

Apuntes de clase - Data Camp R - 22


Sum
Another useful stat function is stat_sum(). This function calculates the total number
of overlapping observations and is another good alternative to overplotting.

Ex.
# Use stat_quantile instead of stat_smooth
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
stat_quantile(alpha = 0.6, size = 2) +
scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))

# Set quantile to 0.5


ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
stat_quantile(alpha = 0.6, size = 2, quantiles = 0.5)+
scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))

Ex.
# Plot 1: Jittered scatter plot, add a linear model (lm) smooth
ggplot(Vocab, aes(x = education, y = vocabulary)) +
geom_jitter(alpha = 0.2) +

Apuntes de clase - Data Camp R - 23


stat_smooth(method = "lm", se = FALSE) # smooth

# Plot 2: points, colored by year


ggplot(Vocab, aes(x = education, y = vocabulary, col = year)) +
geom_jitter(alpha = 0.2)

# Plot 3: lm, colored by year


ggplot(Vocab, aes(x = education, y = vocabulary, col = factor(year))) +
stat_smooth(se=FALSE, method="lm") # smooth

# Plot 4: Set a color brewer palette


ggplot(Vocab, aes(x = education, y = vocabulary, col = factor(year))) +
stat_smooth(se=FALSE, method="lm") + # smooth
scale_color_brewer() # colors

Apuntes de clase - Data Camp R - 24


# Plot 5: Add the group aes, specify alpha and size
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
stat_smooth(method = "lm", se = FALSE, alpha = 0.6, size = 2) +
scale_color_gradientn(colors = brewer.pal(9, "YlOrRd"))

12/06/2020
Stats and Geoms

ggplot(mtcars, aes(x = wt, y = mpg)) +


geom_point()+
geom_smooth()

Apuntes de clase - Data Camp R - 25


ggplot(mtcars, aes(x = wt, y = mpg, y ~ x,)) +
geom_point()+
geom_smooth(method = "lm")

ggplot(mtcars, aes(x = wt, y = mpg, y ~ x,)) +


geom_point()+
geom_smooth(method = "lm", se = FALSE)

ggplot(mtcars, aes(x = wt, y = mpg))+


stat_smooth(method = "lm", se = FALSE)

Wrap-up

Apuntes de clase - Data Camp R - 26


Ex.
# 2 - Use ggplot() for the first instruction
ggplot(titanic, aes(x = Pclass, fill = Sex)) +
geom_bar(position = "dodge")

ggplot(titanic, aes(x = Pclass, fill = Sex)) +


geom_bar(position = "dodge")+
facet_grid(. ~ Survived)

ggplot(titanic, aes(x = Pclass, y = Age, color = Sex)) +


geom_point(size = 3, alpha = 0.5, position = posn.jd)+
facet_grid(. ~ Survived)

Apuntes de clase - Data Camp R - 27


10/06/2020
qplot
Ex.

Choosing geoms, part 2 - dotplot


Some naming conventions:

● Scatter plots: Continuous x, continuous y.


● Dot plots: Categorical x, continuous y.

You use geom_point() for both plot types. Jittering position is set in the geom_point()
layer.

However, to make a "true" dot plot, you can use geom_dotplot(). The difference is that
unlike geom_point(), geom_dotplot() uses a binning statistic. Binning means to cut up a
continuous variable (the y in this case) into discrete "bins". You already saw binning
with geom_histogram() (see this exercise for a refresher).

One thing to notice is that geom_dotplot() uses a different plotting symbol to


geom_point(). For these symbols, the color aesthetic changes the color of its border,
and the fill aesthetic changes the color of its interior.

Let's take a look at how the two geoms compare.

# cyl and am are factors, wt is numeric


class(mtcars$cyl)
class(mtcars$am)
class(mtcars$wt)

# "Basic" dot plot, with geom_point():


ggplot(mtcars, aes(cyl, wt, col = am)) +
geom_point(position = position_jitter(0.2, 0))

# 1 - "True" dot plot, with geom_dotplot():


ggplot(mtcars, aes(cyl, wt, fill = am)) +
geom_dotplot(binaxis = "y", stackdir = "center")

# 2 - qplot with geom "dotplot", binaxis = "y" and stackdir = "center"

Apuntes de clase - Data Camp R - 28


qplot(
cyl, wt,
data = mtcars,
fill = am,
geom = "dotplot",
binaxis = "y",
stackdir = "center"
)

IMPORTANT!
categorical variables: a categorical variable, the value is limited and usually based on
a particular finite group. For example, a categorical variable can be countries, year,
gender, occupation.
continuous variables: a continuous variable, however, can take any values, from
integer to decimal. For example, we can have the revenue, price of a share, etc..
Continuous class variables are the default value in R. They are stored as numeric or
integer.
factor: R stores categorical variables into a factor
Discrete variable, is a subtype of numerical or continuous variable.

Discrete variables, whose values are necessarily whole numbers or other discrete
values, such as population or counts of items. Continuous variables can take on any
value within an interval, and so can be expressed as decimals. They are often
measured quantities.

More info: https://rcompanion.org/handbook/C_01.html

● n
● A continuous variable, however, can take any values, from integer to
decimal. For example, we can have the revenue, price of a share,
etc..

# qplot() with x only


qplot(factor(cyl), data = mtcars)

Apuntes de clase - Data Camp R - 29


# qplot() with x and y
qplot(factor(cyl), factor(vs), data = mtcars)

# qplot() with geom set to jitter manually


qplot(factor(cyl), factor(vs), data = mtcars, geom = "jitter")

Line Plots - Time Series


Ex.
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
geom_line()

Apuntes de clase - Data Camp R - 30


Ex.
# Basic line plot
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_line()

# Expand the following command with geom_rect() to draw the recess periods
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_rect(data = recess,
aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf),
inherit.aes = FALSE, fill = "red", alpha = 0.2) +
geom_line()

09/06/2020
Ex.

# 1 - Basic histogram plot command


ggplot(mtcars, aes(mpg)) +
geom_histogram(binwidth = 1)

Apuntes de clase - Data Camp R - 31


# 2 - Plot 1, Expand aesthetics: am onto fill
ggplot(mtcars, aes(mpg, fill = am)) +
geom_histogram(binwidth = 1)

# 3 - Plot 2, change position = "dodge"


ggplot(mtcars, aes(mpg, fill = am, )) +
geom_histogram(binwidth = 1, position = "dodge")

# 4 - Plot 3, change position = "fill"


ggplot(mtcars, aes(mpg, fill = am, )) +
geom_histogram(binwidth = 1, position = "fill")

Apuntes de clase - Data Camp R - 32


# 5 - Plot 4, plus change position = "identity" and alpha = 0.4
ggplot(mtcars, aes(mpg, fill = am, )) +
geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)

# 6 - Plot 5, plus change mapping: cyl onto fill


ggplot(mtcars, aes(mpg, fill = cyl, )) +
geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)

Ex.

# Plot education on x and vocabulary on fill


# Use the default brewed color palette
ggplot(Vocab, aes(x=education, fill=vocabulary))+
geom_bar(position = "fill")

# Plot education on x and vocabulary on fill


# Use the default brewed color palette
ggplot(Vocab, aes(x=education, fill=vocabulary))+
scale_fill_brewer()+
geom_bar(position = "fill")

Apuntes de clase - Data Camp R - 33


Warning in the console:
Warning message: n too large, allowed maximum for palette Blues is 9
Returning the palette you asked for with that many colors

Ex.

Bar plots with color ramp, part 2


In the previous exercise, you ended up with an incomplete bar plot. This was because
for continuous data, the default RColorBrewer palette that scale_fill_brewer() calls is
"Blues". There are only 9 colours in the palette, and since you have 11 categories,
your plot looked strange.

In this exercise, you'll manually create a color palette that can generate all the colours
you need. To do this you'll use a function called colorRampPalette().

The input is a character vector of 2 or more colour values, e.g. "#FFFFFF" (white) and
"#0000FF" (pure blue). (See this exercise for a discussion on hexadecimal codes).

The output is itself a function! So when you assign it to an object, that object should
be used as a function. To see what we mean, execute the following three lines in the
console:

new_col <- colorRampPalette(c("#FFFFFF", "#0000FF"))


new_col(4) # the newly extrapolated colours
munsell::plot_hex(new_col(4)) # Quick and dirty plot

new_col() is a function that takes one argument: the number of colours you want to
extrapolate. You want to use nicer colours, so we've assigned the entire "Blues"
colour palette from the RColorBrewer package to the character vector blues.

# Final plot of last exercise


ggplot(Vocab, aes(x = education, fill = vocabulary)) +

Apuntes de clase - Data Camp R - 34


geom_bar(position = "fill") +
scale_fill_brewer()

# Definition of a set of blue colors


blues <- brewer.pal(9, "Blues") # from the RColorBrewer package

# 1 - Make a color range using colorRampPalette() and the set of blues


blue_range <- colorRampPalette(blues)

# 2 - Use blue_range to adjust the color of the bars, use scale_fill_manual()


ggplot(Vocab, aes(x = education, fill = vocabulary)) +
geom_bar(position = "fill") +
scale_fill_manual(values = blue_range(11))

Ex.

# A basic histogram, add coloring defined by cyl


ggplot(mtcars, aes(mpg, fill = cyl)) +
geom_histogram(binwidth = 1)

# Change position to identity


ggplot(mtcars, aes(mpg, fill = cyl)) +
geom_histogram(binwidth = 1, position = "identity")

Apuntes de clase - Data Camp R - 35


# Change geom to freqpoly (position is identity by default)
ggplot(mtcars, aes(mpg, color = cyl)) +
geom_freqpoly(binwidth = 1, position = "identity")

Overlapping bar plots


So far you've seen three different positions for bar plots: stack (the default), dodge
(preferred), and fill (to show proportions).

However, you can go one step further by adjusting the dodging, so that your bars
partially overlap each other. For this example you'll again use the mtcars dataset. Like
last time cyl and am are already available as factors inside mtcars.

Instead of using position = "dodge" you're going to use position_dodge(), like you did
with position_jitter() in the Scatter plots and jittering (1) exercise. Here, you'll save this
as an object, posn_d, so that you can easily reuse it.

Remember, the reason you want to use position_dodge() (and position_jitter()) is to


specify how much dodging (or jittering) you want.

alpha -> transparencia


dodge -> superposición que permite ver los elementos

Apuntes de clase - Data Camp R - 36


stack -> uno encima del otro, que no permite ver todos los elementos de los valores
menores.

# 1 - The last plot form the previous exercise


ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar(position = "dodge")

# 2 - Define posn_d with position_dodge()


posn_d <- position_dodge(width = 0.2)

# 3 - Change the position argument to posn_d


ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar(position = posn_d)

# 4 - Use posn_d as position and adjust alpha to 0.6


ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar(position = posn_d, alpha = 0.6)

Apuntes de clase - Data Camp R - 37


07/06/2020
Scatter Plots
Notice that jitter can be:
1) an argument in geom_point(position = 'jitter'),
2) a geom itself, geom_jitter(), or
3) a position function, position_jitter(0.1)

Ex.
# Shown in the viewer:
ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_point()

# Solutions:
# 1 - With geom_jitter()
ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_jitter()

# 2 - Set width in geom_jitter()


ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_jitter(width = 0.1)

# 3 - Set position = position_jitter() in geom_point() ()


ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_point(position = position_jitter(0.1))

Bar Plots

Histograms

Histograms are one of the most common and intuitive ways of showing distributions.
In this exercise you'll use the mtcars data frame to explore typical variations of simple
histograms. But first, some background:

The x axis/aesthetic: The documentation for geom_histogram() states the argument


stat = "bin" as a default. Recall that histograms cut up a continuous variable into
discrete bins - that's what the stat "bin" is doing. You always get 30 evenly-sized bins
by default, which is specified with the default argument binwidth = range/30. This is a

Apuntes de clase - Data Camp R - 38


pretty good starting point if you don't know anything about the variable being ploted
and want to start exploring.

The y axis/aesthetic: geom_histogram() only requires one aesthetic: x. But there is


clearly a y axis on your plot, so where does it come from? Actually, there is a variable
mapped to the y aesthetic, it's called ..count... When geom_histogram() executed the
binning statistic (see above), it not only cut up the data into discrete bins, but it also
counted how many values are in each bin. So there is an internal data frame where
this information is stored. The .. calls the variable count from this internal data frame.
This is what appears on the y aesthetic. But it gets better! The density has also been
calculated. This is the proportional frequency of this bin in relation to the whole data
set. You use ..density.. to access this information.

Ex.

# 1 - Make a univariate histogram


ggplot(mtcars, aes(x = mpg)) +
geom_histogram()

# 2 - Plot 1, plus set binwidth to 1 in the geom layer


ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 1)

# 3 - Plot 2, plus MAP ..density.. to the y aesthetic (i.e. in a second aes() function)
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 1, aes(y=..density..))

# 4 - plot 3, plus SET the fill attribute to "#377EB8"


ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 1, aes(y=..density..), fill="#377EB8")

05/06/2020
Aesthetics Best Practices
Principles:

Form follows function (It is not necessary beautiful plots)

Apuntes de clase - Data Camp R - 39


Do not put unnecessary information to avoid introducing visual noise
Efficiency and accuracy in data representation

Ex.

# 1 - Create jittered plot of mtcars, mpg onto x, 0 onto y


ggplot(mtcars, aes(x = mpg, y = 0)) +
geom_jitter()

# 2 - Add function to change y axis limits


ggplot(mtcars, aes(x = mpg, y = 0)) +
geom_jitter()+
scale_y_continuous(limits = c(-2, 2))

Ex.
# The base layer, cyl.am, is available for you
# Add geom (position = "stack" by default)
cyl.am +
geom_bar(position = "stack")

# Fill - show proportion


cyl.am +
geom_bar(position = "fill")

# Dodging - principles of similarity and proximity


cyl.am +
geom_bar(position = "dodge")

# Clean up the axes with scale_ functions


val = c("#E41A1C", "#377EB8")
lab = c("Manual", "Automatic")
cyl.am +
geom_bar(position = "dodge") +
scale_x_discrete("Cylinders") +
scale_y_continuous("Number") +
scale_fill_manual("Transmission",
values = val,
labels = lab)

IMPORTANT!
In the last chapter you saw that all the visible aesthetics can serve as attributes
(inside geom_*(???)) and aesthetics (inside aes(???)) , but I very conveniently left out
x and y. That's because although you can make univariate plots (such as histograms,
which you'll get to in the next chapter), a y-axis will always be provided, even if you
didn't ask for it.

Apuntes de clase - Data Camp R - 40


03/06/2020
Ex.

The color aesthetic typically changes the outside outline of an object and the fill
aesthetic is typically the inside shading. However, as you saw in the last exercise,
geom_point() is an exception. Here you use color, instead of fill for the inside of the
point. But it's a bit subtler than that.

Which shape to use? The default geom_point() uses shape = 19 (a solid circle with an
outline the same colour as the inside). Good alternatives are shape = 1 (hollow) and
shape = 16 (solid, no outline). These all use the col aesthetic (don't forget to set alpha
for solid points).

A really nice alternative is shape = 21 which allows you to use both fill for the inside
and col for the outline! This is a great little trick for when you want to map two
aesthetics to a dot.

# From the previous exercise


ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
geom_point(shape = 1, size = 4)

# 1 - Map cyl to fill


ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
geom_point(shape = 1, size = 4)

# 2 - Change shape and alpha of the points in the above plot


ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
geom_point(shape = 21, size = 4, alpha = 0.6)

# 3 - Map am to col in the above plot


ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl, col = am)) +
geom_point(shape = 21, size = 4, alpha = 0.6)

Notice that mapping a categorical variable onto fill doesn't change the colors, although a
legend is generated! This is because the default shape for points only has a color attribute
and not a fill attribute! Use fill when you have another shape (such as a bar), or when using
a point that does have a fill and a color attribute, such as shape = 21, which is a circle with
an outline. Any time you use a solid color, make sure to use alpha blending to account for
over plotting.

Modifying Aesthetics
Apuntes de clase - Data Camp R - 41
Ex.
# Map cyl to size
ggplot(mtcars, aes(x = wt, y = mpg, size = cyl))+
geom_point()

# Map cyl to alpha


ggplot(mtcars, aes(x = wt, y = mpg, alpha = cyl)) +
geom_point()

# Map cyl to shape


ggplot(mtcars, aes(x = wt, y = mpg, shape = cyl))+
geom_point()

# Map cyl to label


ggplot(mtcars, aes(x = wt, y = mpg, label = cyl))+
geom_text()

Ex.
# Expand to draw points with alpha 0.5
ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
geom_point(alpha = 0.5)

# Expand to draw points with shape 24 and color yellow


ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
geom_point(shape = 24, color = "yellow")

# Expand to draw text with label rownames(mtcars) and color red


ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
geom_text(label = rownames(mtcars), color = "red")

IMPORTANT!
Note: In this chapter you saw aesthetics and attributes. Variables in a data frame are
mapped to aesthetics in aes(). (e.g. aes(col = cyl)) within ggplot(). Visual elements are set
by attributes in specific geom layers (geom_point(col = "red")). Don't confuse these two
things - here you're focusing on aesthetic mappings.

IMPORTANT!
label and shape are only applicable to categorical data.

Apuntes de clase - Data Camp R - 42

You might also like