Professional Documents
Culture Documents
02/07/2020
CHIS - Mosaic Plots
Do Things Manually
In the previous exercise we looked at how to produce a frequency histogram when
we have many sub-categories. The problem here is that this can't be facetted because
the calculations occur on the fly inside ggplot2.
To overcome this we're going to calculate the proportions outside ggplot2. This is the
beginning of our flexible script for a mosaic plot.
The dataset adult and the BMI_fill object from the previous exercise have been
carried over for you. Code that tries to make the accurate frequency histogram
facetted is available. You should understand these commands by now.
Note: Here we use reshape2 instead of the more current tidyr because reshape2::melt()
allows us to work directly on a table. tidyr::gather() requires a data frame.
Ex.
Multiple Histograms
When we introduced histograms we focused on univariate data, which is exactly
what we've been doing here. However, when we want to explore distributions
further there is much more we can do. For example, there are density plots, which
you'll explore in the next course. For now, we'll look deeper at frequency histograms
and begin developing our mosaic plots.
The adult dataset, which is cleaned up by now, is available in the workspace for you.
● The histogram from the first exercise of age colored by BMI has been
provided. The predefined theme(), fix_strips (see above), has been added to the
histogram. Add BMI_fill to this plot using the + operator as well.
● In addition, add the following elements to create a pretty & insightful plot:
● Use facet_grid() to facet the rows according to RBMI (Remember formula
notation ROWS ~ COL and use . as a place-holder when not facetting in that
direction).
● Add the classic theme using theme_classic().
Ex.
Ex.
Exploring Data
In this chapter we're going to continuously build on our plotting functions and
understanding to produce a mosaic plot (aka Marimekko plot). This is a visual
representation of a contingency table, comparing two categorical variables.
Essentially, our question is which groups are over or under represented in our
dataset. To visualize this we'll color groups according to their Pearson residuals from
a chi-squared test. At the end of it all we'll wrap up our script into a flexible function
so that we can look at other variables.
We'll familiarize ourselves with a small number of variables from the 2009 CHIS
adult-response dataset (as opposed to children). We have selected the following
variables to explore:
Before we get into mosaic plots it's worthwhile exploring the dataset using simple
distribution plots - i.e. histograms.
ggplot2 is already loaded and the dataset, named adult, is already available in the
workspace.
● Use the typical commands for exploring the structure of adult to get familiar
with the variables: summary() and str().
● As a first exploration of the data, plot two histograms using ggplot2 syntax: one
for age (SRAGE_P) and one for BMI (BMI_P). The goal is to explore the dataset
and get familiar with the distributions here. Feel free to explore different bin
widths. We'll ask some questions about these in the next exercises.
# Age histogram
ggplot(adult, aes(x=SRAGE_P))+
geom_histogram()
01/07/2020
Best Practices: Heat Maps
30/06/2020
Best Practices: Bar Plots
Ex.
# Base layers
Ex.
# Base layers
m <- ggplot(mtcars, aes(x = cyl, y = wt))
Exploring ggthemes
There are many themes available by default in ggplot2: theme_bw(), theme_classic(),
theme_gray(), etc. In the previous exercise, you saw that you can apply these themes
to all following plots, with theme_set():
theme_set(theme_bw())
... + theme_bw()
You can also extend these themes with your own modifications. In this exercise, you'll
experiment with this and use some preset templates available from the ggthemes
package. The workspace already contains the same basic plot from before under the
name z2.
Rpta.
# Original plot
z2
# Load ggthemes
library(ggthemes)
# Plot z2 again
z2
Ex.
# Original plot
z2
# 1 - Apply theme_pink to z2
z2 +
theme_pink
Facets Layer
Ex.
# Code to create the cyl_am col and myCol vector
mtcars$cyl_am <- paste(mtcars$cyl, mtcars$am, sep = "_")
myCol <- rbind(brewer.pal(9, "Blues")[c(3,6,8)],
brewer.pal(9, "Reds")[c(3,6,8)])
21/06/2020
Ex.
Ex.
# The base ggplot command; you don't have to change this
wt.cyl.am <- ggplot(mtcars, aes(x = cyl,y = wt, col = am, fill = am, group = am))
Ex.
# Display structure of mtcars
str(mtcars)
# Define positions
posn.d <- position_dodge(width = 0.1)
posn.jd <- position_jitterdodge(jitter.width=0.1, dodge.width=0.2)
posn.j <- position_jitter(width=0.2)
# Base layers
wt.cyl.am <- ggplot(mtcars, aes(x=cyl, y=wt, col=am, fill=am, group=am))
Ex.
Ex.
# Use stat_quantile instead of stat_smooth
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
stat_quantile(alpha = 0.6, size = 2) +
scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))
Ex.
# Plot 1: Jittered scatter plot, add a linear model (lm) smooth
ggplot(Vocab, aes(x = education, y = vocabulary)) +
geom_jitter(alpha = 0.2) +
12/06/2020
Stats and Geoms
Wrap-up
You use geom_point() for both plot types. Jittering position is set in the geom_point()
layer.
However, to make a "true" dot plot, you can use geom_dotplot(). The difference is that
unlike geom_point(), geom_dotplot() uses a binning statistic. Binning means to cut up a
continuous variable (the y in this case) into discrete "bins". You already saw binning
with geom_histogram() (see this exercise for a refresher).
IMPORTANT!
categorical variables: a categorical variable, the value is limited and usually based on
a particular finite group. For example, a categorical variable can be countries, year,
gender, occupation.
continuous variables: a continuous variable, however, can take any values, from
integer to decimal. For example, we can have the revenue, price of a share, etc..
Continuous class variables are the default value in R. They are stored as numeric or
integer.
factor: R stores categorical variables into a factor
Discrete variable, is a subtype of numerical or continuous variable.
Discrete variables, whose values are necessarily whole numbers or other discrete
values, such as population or counts of items. Continuous variables can take on any
value within an interval, and so can be expressed as decimals. They are often
measured quantities.
● n
● A continuous variable, however, can take any values, from integer to
decimal. For example, we can have the revenue, price of a share,
etc..
# Expand the following command with geom_rect() to draw the recess periods
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_rect(data = recess,
aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf),
inherit.aes = FALSE, fill = "red", alpha = 0.2) +
geom_line()
09/06/2020
Ex.
Ex.
Ex.
In this exercise, you'll manually create a color palette that can generate all the colours
you need. To do this you'll use a function called colorRampPalette().
The input is a character vector of 2 or more colour values, e.g. "#FFFFFF" (white) and
"#0000FF" (pure blue). (See this exercise for a discussion on hexadecimal codes).
The output is itself a function! So when you assign it to an object, that object should
be used as a function. To see what we mean, execute the following three lines in the
console:
new_col() is a function that takes one argument: the number of colours you want to
extrapolate. You want to use nicer colours, so we've assigned the entire "Blues"
colour palette from the RColorBrewer package to the character vector blues.
Ex.
However, you can go one step further by adjusting the dodging, so that your bars
partially overlap each other. For this example you'll again use the mtcars dataset. Like
last time cyl and am are already available as factors inside mtcars.
Instead of using position = "dodge" you're going to use position_dodge(), like you did
with position_jitter() in the Scatter plots and jittering (1) exercise. Here, you'll save this
as an object, posn_d, so that you can easily reuse it.
Ex.
# Shown in the viewer:
ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_point()
# Solutions:
# 1 - With geom_jitter()
ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_jitter()
Bar Plots
Histograms
Histograms are one of the most common and intuitive ways of showing distributions.
In this exercise you'll use the mtcars data frame to explore typical variations of simple
histograms. But first, some background:
Ex.
# 3 - Plot 2, plus MAP ..density.. to the y aesthetic (i.e. in a second aes() function)
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 1, aes(y=..density..))
05/06/2020
Aesthetics Best Practices
Principles:
Ex.
Ex.
# The base layer, cyl.am, is available for you
# Add geom (position = "stack" by default)
cyl.am +
geom_bar(position = "stack")
IMPORTANT!
In the last chapter you saw that all the visible aesthetics can serve as attributes
(inside geom_*(???)) and aesthetics (inside aes(???)) , but I very conveniently left out
x and y. That's because although you can make univariate plots (such as histograms,
which you'll get to in the next chapter), a y-axis will always be provided, even if you
didn't ask for it.
The color aesthetic typically changes the outside outline of an object and the fill
aesthetic is typically the inside shading. However, as you saw in the last exercise,
geom_point() is an exception. Here you use color, instead of fill for the inside of the
point. But it's a bit subtler than that.
Which shape to use? The default geom_point() uses shape = 19 (a solid circle with an
outline the same colour as the inside). Good alternatives are shape = 1 (hollow) and
shape = 16 (solid, no outline). These all use the col aesthetic (don't forget to set alpha
for solid points).
A really nice alternative is shape = 21 which allows you to use both fill for the inside
and col for the outline! This is a great little trick for when you want to map two
aesthetics to a dot.
Notice that mapping a categorical variable onto fill doesn't change the colors, although a
legend is generated! This is because the default shape for points only has a color attribute
and not a fill attribute! Use fill when you have another shape (such as a bar), or when using
a point that does have a fill and a color attribute, such as shape = 21, which is a circle with
an outline. Any time you use a solid color, make sure to use alpha blending to account for
over plotting.
Modifying Aesthetics
Apuntes de clase - Data Camp R - 41
Ex.
# Map cyl to size
ggplot(mtcars, aes(x = wt, y = mpg, size = cyl))+
geom_point()
Ex.
# Expand to draw points with alpha 0.5
ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
geom_point(alpha = 0.5)
IMPORTANT!
Note: In this chapter you saw aesthetics and attributes. Variables in a data frame are
mapped to aesthetics in aes(). (e.g. aes(col = cyl)) within ggplot(). Visual elements are set
by attributes in specific geom layers (geom_point(col = "red")). Don't confuse these two
things - here you're focusing on aesthetic mappings.
IMPORTANT!
label and shape are only applicable to categorical data.