Apuntes de Clase - DataCamp - R

DATACAMP – NOTAS DE CLASE
MASTER DATA ANALYSIS WITH R
02/07/2020
CHIS - Mosaic Plots
CHIS - Descriptive Statistics

Ex.
Do Things Manually
In the previous exercise we looked at how to produce a frequency histogram when
we have many sub-categories. The problem here is that this can't be facetted because
the calculations occur on the fly inside ggplot2.
To overcome this we're going to calculate the proportions outside ggplot2. This is the
beginning of our flexible script for a mosaic plot.
The dataset adult and the BMI_fill object from the previous exercise have been
carried over for you. Code that tries to make the accurate frequency histogram
facetted is available. You should understand these commands by now.
● Use adult$RBMI and adult$SRAGE_P as arguments in table() to create a

contingency table of the two variables. Save this as DF.
● Use apply() To get the frequency of each group. The first argument is DF, the
second argument 2, because you want to do calculations on each column. The
third argument should be function(x) x/sum(x). Store the result as DF_freq.
● Load the reshape2 package and use the melt() function on DF_freq. Store the
result as DF_melted. Examine the structure of DF_freq and DF_melted if you are
not familiar with this operation.
Note: Here we use reshape2 instead of the more current tidyr because reshape2::melt()
allows us to work directly on a table. tidyr::gather() requires a data frame.
Apuntes de clase - Data Camp R - 1

● Use names() to rename the variables in DF_melted to be c("FILL", "X", "value"),
with the prospect of making this a generalized function later on.
● The plotting call at the end uses DF_melted. Add code to make it facetted. Use
the formula FILL ~ .. Note that we use geom_col() now, this is just a short-cut
to geom_bar(stat = "identity").
# An attempt to facet the accurate frequency histogram from before (failed)

ggplot(adult, aes (x = SRAGE_P, fill= factor(RBMI))) +
geom_histogram(aes(y = ..count../sum(..count..)), binwidth = 1, position = "fill") +
BMI_fill +
facet_grid(RBMI ~ .)
# Create DF with table()

DF <- table(adult$RBMI, adult$SRAGE_P)
# Use apply on DF to get frequency of each group

DF_freq <- apply(DF, 2, function(x) x/sum(x))
# Load reshape2 and use melt on DF to create DF_melted

library(reshape2)
DF_melted <- melt(DF_freq)
str(DF_freq)
str(DF_melted)
# Change names of DF_melted

names(DF_melted) <- c("FILL", "X", "value")
# Add code to make this a faceted plot

ggplot(DF_melted, aes(x = X, y = value, fill = FILL)) +
geom_col(position = "stack") +
BMI_fill +
facet_grid(FILL ~ .) # Facets

Ex.
# Plot 1 - Count histogram
geom_histogram(binwidth = 1) +
BMI_fill
# Plot 2 - Density histogram

geom_histogram(aes(y = ..density..), binwidth = 1) +
BMI_fill

# Plot 3 - Faceted count histogram
BMI_fill+
# Plot 4 - Faceted density histogram

geom_histogram(aes(y = ..density..), binwidth = 1) +
BMI_fill+

# Plot 5 - Density histogram with position = "fill"
geom_histogram(aes(y = ..density..), binwidth = 1, position = "fill") +
BMI_fill
# Plot 6 - The accurate histogram

geom_histogram(aes(y = ..count../sum(..count..)), binwidth = 1, position = "fill") +
BMI_fill
Ex.
Multiple Histograms
When we introduced histograms we focused on univariate data, which is exactly
what we've been doing here. However, when we want to explore distributions
further there is much more we can do. For example, there are density plots, which
you'll explore in the next course. For now, we'll look deeper at frequency histograms
and begin developing our mosaic plots.
The adult dataset, which is cleaned up by now, is available in the workspace for you.

Two layers have been pre-defined for you: BMI_fill is a scale layer which we can add
to a ggplot() command using +: ggplot(...) + BMI_fill. fix_strips is a theme() layer to
make nice facet titles.
● The histogram from the first exercise of age colored by BMI has been
provided. The predefined theme(), fix_strips (see above), has been added to the
histogram. Add BMI_fill to this plot using the + operator as well.
● In addition, add the following elements to create a pretty & insightful plot:
● Use facet_grid() to facet the rows according to RBMI (Remember formula
notation ROWS ~ COL and use . as a place-holder when not facetting in that
direction).
● Add the classic theme using theme_classic().
# The color scale used in the plot

BMI_fill <- scale_fill_brewer("BMI Category", palette = "Reds")
# Theme to fix category display in faceted plot

fix_strips <- theme(strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.1, size = 14),
strip.background = element_blank(),
legend.position = "none")
# Histogram, add BMI_fill and customizations

ggplot(adult, aes (x = SRAGE_P, fill= RBMI)) +
BMI_fill +
facet_grid(RBMI ~ .)+
theme_classic()+
fix_strips
Ex.

# Keep adults younger than or equal to 84
adult <- adult[adult$SRAGE_P <= 84, ]
# Keep adults with BMI at least 16 and less than 52

adult <- adult[adult$BMI_P >= 16 & adult$BMI_P < 52, ]
# Relabel the race variable

adult$RACEHPR2 <- factor(adult$RACEHPR2, labels = c("Latino", "Asian", "African
American", "White"))
# Relabel the BMI categories variable

adult$RBMI <- factor(adult$RBMI, labels = c("Under-weight", "Normal-weight", "Over-
weight", "Obese"))
Ex.
Exploring Data
In this chapter we're going to continuously build on our plotting functions and
understanding to produce a mosaic plot (aka Marimekko plot). This is a visual
representation of a contingency table, comparing two categorical variables.
Essentially, our question is which groups are over or under represented in our
dataset. To visualize this we'll color groups according to their Pearson residuals from
a chi-squared test. At the end of it all we'll wrap up our script into a flexible function
so that we can look at other variables.
We'll familiarize ourselves with a small number of variables from the 2009 CHIS
adult-response dataset (as opposed to children). We have selected the following
variables to explore:
● RBMI: BMI Category description

● BMI_P: BMI value
● RACEHPR2: race
● SRSEX: sex
● SRAGE_P: age
● MARIT2: Marital status
● AB1: General Health Condition
● ASTCUR: Current Asthma Status
● AB51: Type I or Type II Diabetes
● POVLL: Poverty level

We'll filter our dataset to plot a more reliable subset (we'll still retain over 95% of the
data).
Before we get into mosaic plots it's worthwhile exploring the dataset using simple
distribution plots - i.e. histograms.
ggplot2 is already loaded and the dataset, named adult, is already available in the
workspace.
● Use the typical commands for exploring the structure of adult to get familiar
with the variables: summary() and str().
# Explore the dataset with summary and str

summary(adult)
str(adult)
● As a first exploration of the data, plot two histograms using ggplot2 syntax: one
for age (SRAGE_P) and one for BMI (BMI_P). The goal is to explore the dataset
and get familiar with the distributions here. Feel free to explore different bin
widths. We'll ask some questions about these in the next exercises.
# Age histogram
ggplot(adult, aes(x=SRAGE_P))+
geom_histogram()
# BMI value histogram

ggplot(adult, aes(x=BMI_P))+
geom_histogram(binwidth = 3)

● Next plot a binned-distribution of age, filling each bar according to the BMI
categorization. Inside geom_histogram(), set binwidth = 1. You'll want to use fill
= factor(RBMI) since RBMI is a categorical variable.
# Age colored by BMI, binwidth = 1

ggplot(adult, aes(x=SRAGE_P, fill=factor(RBMI)))+
01/07/2020
Best Practices: Heat Maps
30/06/2020
Best Practices: Bar Plots
Ex.
# Base layers

m <- ggplot(mtcars.cyl, aes(x = cyl, y = wt.avg))
# Plot 1: Draw bar plot with geom_bar

m + geom_bar(stat = "identity", fill = "skyblue")
# Plot 2: Draw bar plot with geom_col

# Plot 2: geom_col() is a shortcut for geom_bar(stat = "identity"), for when your data
already has counts.
m + geom_col(fill = "skyblue")
# Plot 3: geom_col with variable widths.

m + geom_col(fill = "skyblue", width = mtcars.cyl$prop)
# Plot 4: Add error bars

m+
geom_col(fill = "skyblue", width = mtcars.cyl$prop) +
geom_errorbar(aes(ymin = wt.avg - sd, ymax = wt.avg + sd), width = 0.1)

Ex.
# Base layers
m <- ggplot(mtcars, aes(x = cyl,y = wt, col = am, fill = am))
# Plot 1: Draw dynamite plot

m+
stat_summary(fun.y = mean, geom = "bar") +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width =
0.1)
# Plot 2: Set position dodge in each stat function

m+
stat_summary(fun.y = mean, geom = "bar", position = "dodge") +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1),
geom = "errorbar", width = 0.1, position = "dodge")
# Set your dodge posn manually

posn.d <- position_dodge(0.9)

# Plot 3: Redraw dynamite plot
m+
stat_summary(fun.y = mean, geom = "bar", position = posn.d) +
0.1, position = position_dodge(0.9))
Ex.
# Base layers
m <- ggplot(mtcars, aes(x = cyl, y = wt))
# Draw dynamite plot

m+
stat_summary(fun.y = mean, geom = "bar", fill = "skyblue") +
0.1)

22/06/2020
Recycling Themes
Ex.
Exploring ggthemes
There are many themes available by default in ggplot2: theme_bw(), theme_classic(),
theme_gray(), etc. In the previous exercise, you saw that you can apply these themes
to all following plots, with theme_set():
theme_set(theme_bw())
But you can also apply them on an individual plot, with:
... + theme_bw()
You can also extend these themes with your own modifications. In this exercise, you'll
experiment with this and use some preset templates available from the ggthemes
package. The workspace already contains the same basic plot from before under the
name z2.
Rpta.
# Original plot
z2
# Load ggthemes
library(ggthemes)
# Apply theme_tufte(), plot additional modifications

custom_theme <- theme_tufte() +
theme(legend.position = c(0.9, 0.9),
legend.title = element_text(face ="italic",

size = 12), axis.title=element_text(face ="bold", size=14 ))
# Draw the customized plot

z2 + custom_theme
# Use theme set to set custom theme as default

theme_set(custom_theme)
# Plot z2 again
z2
Ex.
# Original plot
z2
# Theme layer saved as an object, theme_pink

theme_pink <- theme(panel.background = element_blank(),
legend.key = element_blank(),
legend.background = element_blank(),
plot.background = element_rect(fill = myPink, color = "black", size = 3),
panel.grid = element_blank(),
axis.line = element_line(color = "red"),
axis.ticks = element_line(color = "red"),
strip.text = element_text(size = 16, color = myRed),
axis.title.y = element_text(color = myRed, hjust = 0, face = "italic"),
axis.title.x = element_text(color = myRed, hjust = 0, face = "italic"),
axis.text = element_text(color = "black"),
# 1 - Apply theme_pink to z2
z2 +
theme_pink

# 2 - Update the default theme, and at the same time
# assign the old theme to the object old.
old <- theme_update(panel.background = element_blank(),
legend.key = element_blank(),
legend.background = element_blank(),
plot.background = element_rect(fill = myPink, color = "black", size = 3),
panel.grid = element_blank(),
axis.line = element_line(color = "red"),
axis.ticks = element_line(color = "red"),
strip.text = element_text(size = 16, color = myRed),
axis.title.y = element_text(color = myRed, hjust = 0, face = "italic"),
axis.title.x = element_text(color = myRed, hjust = 0, face = "italic"),
axis.text = element_text(color = "black"),
# 4 - Restore the old default theme

theme_set(old)
# Display the plot z2 - old theme restored

z2

Themes from Scratch
# Original plot, color provided
z
myRed
# Extend z with theme() function and 3 args

z+
theme(strip.text = element_text(size = 16, color = myRed),
axis.title = element_text(color = myRed, hjust = 0, face = "italic"),
axis.text = element_text(color = "black"))
Facets Layer
Ex.
# Code to create the cyl_am col and myCol vector
mtcars$cyl_am <- paste(mtcars$cyl, mtcars$am, sep = "_")
myCol <- rbind(brewer.pal(9, "Blues")[c(3,6,8)],
brewer.pal(9, "Reds")[c(3,6,8)])
# Map cyl_am onto col

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am)) +

geom_point() +
# Add a manual colour scale
scale_color_manual(values = myCol)
# Grid facet on gear vs. vs

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am)) +
geom_point() +
scale_color_manual(values = myCol)+
facet_grid(gear ~ vs)
# Also map disp to size

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am, size=disp)) +
geom_point() +
scale_color_manual(values = myCol)+
facet_grid(gear ~ vs)

Ex.
# Basic scatter plot
p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
# 1 - Separate rows according to transmission type, am -

p+
facet_grid(am ~ . )
# 2 - Separate columns according to cylinders, cyl

p+
facet_grid(. ~ cyl)

# 3 - Separate by both columns and rows
p+
facet_grid(am ~ cyl)
21/06/2020
Ex.
# Create a stacked bar plot: wide.bar

wide.bar <- ggplot(mtcars, aes(x = 1, fill = cyl)) +
geom_bar()
# Convert wide.bar to pie chart

wide.bar +
coord_polar(theta = "y")

# Create stacked bar plot: thin.bar
thin.bar <- ggplot(mtcars, aes(x = 1, fill = cyl)) +
geom_bar(width = 0.1) +
scale_x_continuous(limits = c(0.5,1.5))
# Convert thin.bar to "ring" type pie chart

thin.bar +
coord_polar(theta = "y")
Ex.
# The base ggplot command; you don't have to change this
wt.cyl.am <- ggplot(mtcars, aes(x = cyl,y = wt, col = am, fill = am, group = am))
# Add three stat_summary calls to wt.cyl.am

wt.cyl.am +
stat_summary(geom = "linerange", fun.data = med_IQR,
position = posn.d, size = 3) +
stat_summary(geom = "linerange", fun.data = gg_range,
position = posn.d, size = 3,
alpha = 0.4) +
stat_summary(geom = "point", fun.y = median,
position = posn.d, size = 3,
col = "black" , shape = "X")

Ex.
# wt.cyl.am, posn.d, posn.jd and posn.j are available
# Plot 1: Jittered, dodged scatter plot with transparent points

wt.cyl.am +
geom_point(position = posn.jd, alpha = 0.6)
# Plot 2: Mean and SD - the easy way

wt.cyl.am +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn.d)
# Plot 3: Mean and 95% CI - the easy way

wt.cyl.am +
stat_summary(fun.data = mean_cl_normal, position = posn.d)

# Plot 4: Mean and SD - with T-tipped error bars - fill in ___
wt.cyl.am +
stat_summary(geom = "point", fun.y = mean,
position = posn.d) +
stat_summary(geom = "errorbar", fun.data = mean_sdl,
position = posn.d, fun.args = list(mult = 1), width = 0.1)
Ex.
# Display structure of mtcars
str(mtcars)
# Convert cyl and am to factors

mtcars$cyl <- as.factor(mtcars$cyl )
mtcars$am <- as.factor(mtcars$am)
# Define positions
posn.d <- position_dodge(width = 0.1)
posn.jd <- position_jitterdodge(jitter.width=0.1, dodge.width=0.2)
posn.j <- position_jitter(width=0.2)
# Base layers
wt.cyl.am <- ggplot(mtcars, aes(x=cyl, y=wt, col=am, fill=am, group=am))
Ex.

Sum
Another useful stat function is stat_sum(). This function calculates the total number
of overlapping observations and is another good alternative to overplotting.
Ex.
# Use stat_quantile instead of stat_smooth
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
stat_quantile(alpha = 0.6, size = 2) +
scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))
# Set quantile to 0.5

stat_quantile(alpha = 0.6, size = 2, quantiles = 0.5)+
scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))
Ex.
# Plot 1: Jittered scatter plot, add a linear model (lm) smooth
ggplot(Vocab, aes(x = education, y = vocabulary)) +
geom_jitter(alpha = 0.2) +

stat_smooth(method = "lm", se = FALSE) # smooth
# Plot 2: points, colored by year

ggplot(Vocab, aes(x = education, y = vocabulary, col = year)) +
geom_jitter(alpha = 0.2)
# Plot 3: lm, colored by year

ggplot(Vocab, aes(x = education, y = vocabulary, col = factor(year))) +
stat_smooth(se=FALSE, method="lm") # smooth
# Plot 4: Set a color brewer palette

ggplot(Vocab, aes(x = education, y = vocabulary, col = factor(year))) +
stat_smooth(se=FALSE, method="lm") + # smooth
scale_color_brewer() # colors

# Plot 5: Add the group aes, specify alpha and size
stat_smooth(method = "lm", se = FALSE, alpha = 0.6, size = 2) +
scale_color_gradientn(colors = brewer.pal(9, "YlOrRd"))
12/06/2020
Stats and Geoms
ggplot(mtcars, aes(x = wt, y = mpg)) +

geom_point()+
geom_smooth()

ggplot(mtcars, aes(x = wt, y = mpg, y ~ x,)) +
geom_point()+
geom_smooth(method = "lm")
ggplot(mtcars, aes(x = wt, y = mpg, y ~ x,)) +

geom_point()+
geom_smooth(method = "lm", se = FALSE)
ggplot(mtcars, aes(x = wt, y = mpg))+

stat_smooth(method = "lm", se = FALSE)
Wrap-up

Ex.
# 2 - Use ggplot() for the first instruction
ggplot(titanic, aes(x = Pclass, fill = Sex)) +
geom_bar(position = "dodge")
ggplot(titanic, aes(x = Pclass, fill = Sex)) +

geom_bar(position = "dodge")+
facet_grid(. ~ Survived)
ggplot(titanic, aes(x = Pclass, y = Age, color = Sex)) +

geom_point(size = 3, alpha = 0.5, position = posn.jd)+
facet_grid(. ~ Survived)

10/06/2020
qplot
Ex.
Choosing geoms, part 2 - dotplot

Some naming conventions:
● Scatter plots: Continuous x, continuous y.

● Dot plots: Categorical x, continuous y.
You use geom_point() for both plot types. Jittering position is set in the geom_point()
layer.
However, to make a "true" dot plot, you can use geom_dotplot(). The difference is that
unlike geom_point(), geom_dotplot() uses a binning statistic. Binning means to cut up a
continuous variable (the y in this case) into discrete "bins". You already saw binning
with geom_histogram() (see this exercise for a refresher).
One thing to notice is that geom_dotplot() uses a different plotting symbol to

geom_point(). For these symbols, the color aesthetic changes the color of its border,
and the fill aesthetic changes the color of its interior.
Let's take a look at how the two geoms compare.
# cyl and am are factors, wt is numeric

class(mtcars$cyl)
class(mtcars$am)
class(mtcars$wt)
# "Basic" dot plot, with geom_point():

ggplot(mtcars, aes(cyl, wt, col = am)) +
geom_point(position = position_jitter(0.2, 0))
# 1 - "True" dot plot, with geom_dotplot():

ggplot(mtcars, aes(cyl, wt, fill = am)) +
geom_dotplot(binaxis = "y", stackdir = "center")
# 2 - qplot with geom "dotplot", binaxis = "y" and stackdir = "center"

qplot(
cyl, wt,
data = mtcars,
fill = am,
geom = "dotplot",
binaxis = "y",
stackdir = "center"
)
IMPORTANT!
categorical variables: a categorical variable, the value is limited and usually based on
a particular finite group. For example, a categorical variable can be countries, year,
gender, occupation.
continuous variables: a continuous variable, however, can take any values, from
integer to decimal. For example, we can have the revenue, price of a share, etc..
Continuous class variables are the default value in R. They are stored as numeric or
integer.
factor: R stores categorical variables into a factor
Discrete variable, is a subtype of numerical or continuous variable.
Discrete variables, whose values are necessarily whole numbers or other discrete
values, such as population or counts of items. Continuous variables can take on any
value within an interval, and so can be expressed as decimals. They are often
measured quantities.
More info: https://rcompanion.org/handbook/C_01.html
● n
● A continuous variable, however, can take any values, from integer to
decimal. For example, we can have the revenue, price of a share,
etc..
# qplot() with x only

qplot(factor(cyl), data = mtcars)

# qplot() with x and y
qplot(factor(cyl), factor(vs), data = mtcars)
# qplot() with geom set to jitter manually

qplot(factor(cyl), factor(vs), data = mtcars, geom = "jitter")
Line Plots - Time Series

Ex.
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
geom_line()

Ex.
# Basic line plot
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_line()
# Expand the following command with geom_rect() to draw the recess periods
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_rect(data = recess,
aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf),
inherit.aes = FALSE, fill = "red", alpha = 0.2) +
geom_line()
09/06/2020
Ex.
# 1 - Basic histogram plot command

ggplot(mtcars, aes(mpg)) +

# 2 - Plot 1, Expand aesthetics: am onto fill
ggplot(mtcars, aes(mpg, fill = am)) +
# 3 - Plot 2, change position = "dodge"

ggplot(mtcars, aes(mpg, fill = am, )) +
geom_histogram(binwidth = 1, position = "dodge")
# 4 - Plot 3, change position = "fill"

geom_histogram(binwidth = 1, position = "fill")

# 5 - Plot 4, plus change position = "identity" and alpha = 0.4
geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)
# 6 - Plot 5, plus change mapping: cyl onto fill

ggplot(mtcars, aes(mpg, fill = cyl, )) +
geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)
Ex.
# Plot education on x and vocabulary on fill

# Use the default brewed color palette
ggplot(Vocab, aes(x=education, fill=vocabulary))+
geom_bar(position = "fill")
# Plot education on x and vocabulary on fill

# Use the default brewed color palette
ggplot(Vocab, aes(x=education, fill=vocabulary))+
scale_fill_brewer()+

Warning in the console:
Warning message: n too large, allowed maximum for palette Blues is 9
Returning the palette you asked for with that many colors
Ex.
Bar plots with color ramp, part 2

In the previous exercise, you ended up with an incomplete bar plot. This was because
for continuous data, the default RColorBrewer palette that scale_fill_brewer() calls is
"Blues". There are only 9 colours in the palette, and since you have 11 categories,
your plot looked strange.
In this exercise, you'll manually create a color palette that can generate all the colours
you need. To do this you'll use a function called colorRampPalette().
The input is a character vector of 2 or more colour values, e.g. "#FFFFFF" (white) and
"#0000FF" (pure blue). (See this exercise for a discussion on hexadecimal codes).
The output is itself a function! So when you assign it to an object, that object should
be used as a function. To see what we mean, execute the following three lines in the
console:
new_col <- colorRampPalette(c("#FFFFFF", "#0000FF"))

new_col(4) # the newly extrapolated colours
munsell::plot_hex(new_col(4)) # Quick and dirty plot
new_col() is a function that takes one argument: the number of colours you want to
extrapolate. You want to use nicer colours, so we've assigned the entire "Blues"
colour palette from the RColorBrewer package to the character vector blues.
# Final plot of last exercise

ggplot(Vocab, aes(x = education, fill = vocabulary)) +

geom_bar(position = "fill") +
scale_fill_brewer()
# Definition of a set of blue colors

blues <- brewer.pal(9, "Blues") # from the RColorBrewer package
# 1 - Make a color range using colorRampPalette() and the set of blues

blue_range <- colorRampPalette(blues)
# 2 - Use blue_range to adjust the color of the bars, use scale_fill_manual()

ggplot(Vocab, aes(x = education, fill = vocabulary)) +
geom_bar(position = "fill") +
scale_fill_manual(values = blue_range(11))
Ex.
# A basic histogram, add coloring defined by cyl

ggplot(mtcars, aes(mpg, fill = cyl)) +
# Change position to identity

ggplot(mtcars, aes(mpg, fill = cyl)) +
geom_histogram(binwidth = 1, position = "identity")

# Change geom to freqpoly (position is identity by default)
ggplot(mtcars, aes(mpg, color = cyl)) +
geom_freqpoly(binwidth = 1, position = "identity")
Overlapping bar plots

So far you've seen three different positions for bar plots: stack (the default), dodge
(preferred), and fill (to show proportions).
However, you can go one step further by adjusting the dodging, so that your bars
partially overlap each other. For this example you'll again use the mtcars dataset. Like
last time cyl and am are already available as factors inside mtcars.
Instead of using position = "dodge" you're going to use position_dodge(), like you did
with position_jitter() in the Scatter plots and jittering (1) exercise. Here, you'll save this
as an object, posn_d, so that you can easily reuse it.
Remember, the reason you want to use position_dodge() (and position_jitter()) is to

specify how much dodging (or jittering) you want.
alpha -> transparencia

dodge -> superposición que permite ver los elementos

stack -> uno encima del otro, que no permite ver todos los elementos de los valores
menores.
# 1 - The last plot form the previous exercise

ggplot(mtcars, aes(x = cyl, fill = am)) +
# 2 - Define posn_d with position_dodge()

posn_d <- position_dodge(width = 0.2)
# 3 - Change the position argument to posn_d

geom_bar(position = posn_d)
# 4 - Use posn_d as position and adjust alpha to 0.6

geom_bar(position = posn_d, alpha = 0.6)

07/06/2020
Scatter Plots
Notice that jitter can be:
1) an argument in geom_point(position = 'jitter'),
2) a geom itself, geom_jitter(), or
3) a position function, position_jitter(0.1)
Ex.
# Shown in the viewer:
ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_point()
# Solutions:
# 1 - With geom_jitter()
geom_jitter()
# 2 - Set width in geom_jitter()

geom_jitter(width = 0.1)
# 3 - Set position = position_jitter() in geom_point() ()

geom_point(position = position_jitter(0.1))
Bar Plots
Histograms
Histograms are one of the most common and intuitive ways of showing distributions.
In this exercise you'll use the mtcars data frame to explore typical variations of simple
histograms. But first, some background:
The x axis/aesthetic: The documentation for geom_histogram() states the argument

stat = "bin" as a default. Recall that histograms cut up a continuous variable into
discrete bins - that's what the stat "bin" is doing. You always get 30 evenly-sized bins
by default, which is specified with the default argument binwidth = range/30. This is a

pretty good starting point if you don't know anything about the variable being ploted
and want to start exploring.
The y axis/aesthetic: geom_histogram() only requires one aesthetic: x. But there is

clearly a y axis on your plot, so where does it come from? Actually, there is a variable
mapped to the y aesthetic, it's called ..count... When geom_histogram() executed the
binning statistic (see above), it not only cut up the data into discrete bins, but it also
counted how many values are in each bin. So there is an internal data frame where
this information is stored. The .. calls the variable count from this internal data frame.
This is what appears on the y aesthetic. But it gets better! The density has also been
calculated. This is the proportional frequency of this bin in relation to the whole data
set. You use ..density.. to access this information.
Ex.
# 1 - Make a univariate histogram

ggplot(mtcars, aes(x = mpg)) +
geom_histogram()
# 2 - Plot 1, plus set binwidth to 1 in the geom layer

# 3 - Plot 2, plus MAP ..density.. to the y aesthetic (i.e. in a second aes() function)
geom_histogram(binwidth = 1, aes(y=..density..))
# 4 - plot 3, plus SET the fill attribute to "#377EB8"

geom_histogram(binwidth = 1, aes(y=..density..), fill="#377EB8")
05/06/2020
Aesthetics Best Practices
Principles:
Form follows function (It is not necessary beautiful plots)

Do not put unnecessary information to avoid introducing visual noise
Efficiency and accuracy in data representation
Ex.
# 1 - Create jittered plot of mtcars, mpg onto x, 0 onto y

ggplot(mtcars, aes(x = mpg, y = 0)) +
geom_jitter()
# 2 - Add function to change y axis limits

ggplot(mtcars, aes(x = mpg, y = 0)) +
geom_jitter()+
scale_y_continuous(limits = c(-2, 2))
Ex.
# The base layer, cyl.am, is available for you
# Add geom (position = "stack" by default)
cyl.am +
geom_bar(position = "stack")
# Fill - show proportion

cyl.am +
# Dodging - principles of similarity and proximity

cyl.am +
# Clean up the axes with scale_ functions

val = c("#E41A1C", "#377EB8")
lab = c("Manual", "Automatic")
cyl.am +
geom_bar(position = "dodge") +
scale_x_discrete("Cylinders") +
scale_y_continuous("Number") +
scale_fill_manual("Transmission",
values = val,
labels = lab)
IMPORTANT!
In the last chapter you saw that all the visible aesthetics can serve as attributes
(inside geom_*(???)) and aesthetics (inside aes(???)) , but I very conveniently left out
x and y. That's because although you can make univariate plots (such as histograms,
which you'll get to in the next chapter), a y-axis will always be provided, even if you
didn't ask for it.

03/06/2020
Ex.
The color aesthetic typically changes the outside outline of an object and the fill
aesthetic is typically the inside shading. However, as you saw in the last exercise,
geom_point() is an exception. Here you use color, instead of fill for the inside of the
point. But it's a bit subtler than that.
Which shape to use? The default geom_point() uses shape = 19 (a solid circle with an
outline the same colour as the inside). Good alternatives are shape = 1 (hollow) and
shape = 16 (solid, no outline). These all use the col aesthetic (don't forget to set alpha
for solid points).
A really nice alternative is shape = 21 which allows you to use both fill for the inside
and col for the outline! This is a great little trick for when you want to map two
aesthetics to a dot.
# From the previous exercise

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
geom_point(shape = 1, size = 4)
# 1 - Map cyl to fill

ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
geom_point(shape = 1, size = 4)
# 2 - Change shape and alpha of the points in the above plot

geom_point(shape = 21, size = 4, alpha = 0.6)
# 3 - Map am to col in the above plot

ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl, col = am)) +
geom_point(shape = 21, size = 4, alpha = 0.6)
Notice that mapping a categorical variable onto fill doesn't change the colors, although a
legend is generated! This is because the default shape for points only has a color attribute
and not a fill attribute! Use fill when you have another shape (such as a bar), or when using
a point that does have a fill and a color attribute, such as shape = 21, which is a circle with
an outline. Any time you use a solid color, make sure to use alpha blending to account for
over plotting.
Modifying Aesthetics
Ex.
# Map cyl to size
ggplot(mtcars, aes(x = wt, y = mpg, size = cyl))+
geom_point()
# Map cyl to alpha

ggplot(mtcars, aes(x = wt, y = mpg, alpha = cyl)) +
geom_point()
# Map cyl to shape

ggplot(mtcars, aes(x = wt, y = mpg, shape = cyl))+
geom_point()
# Map cyl to label

ggplot(mtcars, aes(x = wt, y = mpg, label = cyl))+
geom_text()
Ex.
# Expand to draw points with alpha 0.5
geom_point(alpha = 0.5)
# Expand to draw points with shape 24 and color yellow

geom_point(shape = 24, color = "yellow")
# Expand to draw text with label rownames(mtcars) and color red

geom_text(label = rownames(mtcars), color = "red")
IMPORTANT!
Note: In this chapter you saw aesthetics and attributes. Variables in a data frame are
mapped to aesthetics in aes(). (e.g. aes(col = cyl)) within ggplot(). Visual elements are set
by attributes in specific geom layers (geom_point(col = "red")). Don't confuse these two
things - here you're focusing on aesthetic mappings.
IMPORTANT!
label and shape are only applicable to categorical data.

Apuntes de Clase - DataCamp - R

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apuntes de Clase - DataCamp - R

Uploaded by

Copyright:

Available Formats

DATACAMP – NOTAS DE CLASE

MASTER DATA ANALYSIS WITH R

CHIS - Descriptive Statistics

● Use adult$RBMI and adult$SRAGE_P as arguments in table() to create a

Apuntes de clase - Data Camp R - 1

# An attempt to facet the accurate frequency histogram from before (failed)

# Create DF with table()

# Use apply on DF to get frequency of each group

# Load reshape2 and use melt on DF to create DF_melted

# Change names of DF_melted

# Add code to make this a faceted plot

Apuntes de clase - Data Camp R - 2

# Plot 2 - Density histogram

Apuntes de clase - Data Camp R - 3

# Plot 4 - Faceted density histogram

Apuntes de clase - Data Camp R - 4

# Plot 6 - The accurate histogram

Apuntes de clase - Data Camp R - 5

# The color scale used in the plot

# Theme to fix category display in faceted plot

# Histogram, add BMI_fill and customizations

Apuntes de clase - Data Camp R - 6

# Keep adults with BMI at least 16 and less than 52

# Relabel the race variable

# Relabel the BMI categories variable

● RBMI: BMI Category description

Apuntes de clase - Data Camp R - 7

# Explore the dataset with summary and str

# BMI value histogram

Apuntes de clase - Data Camp R - 8

# Age colored by BMI, binwidth = 1

Apuntes de clase - Data Camp R - 9

# Plot 1: Draw bar plot with geom_bar

# Plot 2: Draw bar plot with geom_col

# Plot 3: geom_col with variable widths.

# Plot 4: Add error bars

Apuntes de clase - Data Camp R - 10

# Plot 1: Draw dynamite plot

# Plot 2: Set position dodge in each stat function

# Set your dodge posn manually

Apuntes de clase - Data Camp R - 11

# Draw dynamite plot

Apuntes de clase - Data Camp R - 12

But you can also apply them on an individual plot, with:

# Apply theme_tufte(), plot additional modifications

Apuntes de clase - Data Camp R - 13

# Draw the customized plot

# Use theme set to set custom theme as default

# Theme layer saved as an object, theme_pink

Apuntes de clase - Data Camp R - 14

# 4 - Restore the old default theme

# Display the plot z2 - old theme restored

Apuntes de clase - Data Camp R - 15

# Extend z with theme() function and 3 args

# Map cyl_am onto col

Apuntes de clase - Data Camp R - 16

# Grid facet on gear vs. vs

# Also map disp to size

Apuntes de clase - Data Camp R - 17

# 1 - Separate rows according to transmission type, am -

# 2 - Separate columns according to cylinders, cyl

Apuntes de clase - Data Camp R - 18

# Create a stacked bar plot: wide.bar