You are on page 1of 5

Lab session 1: Intro to ggplot2

Data analysis and Visualization – Ilias Thomas

Red: code to copy paste in RStudio

Italics: variable and dataset names

Blue: functions and arguments

First things first


In this course we will mainly work with ggplot2. This is a package for R that is very commonly used to
create good looking graphs in R using code.

To work with this, you will need to install R and RStudio on your computers if you have not done so
already (Download RStudio - Posit). RStudio is an integrated development environment in R.

After finishing with installations, launch RStudio. Create a new R script through the plus sigh on the top
left of the screen. Working with such scripts will allow you to save your progress and return to your work
when necessary.

In your script install ggplot2.

install.packages("ggplot2")

and then load ggplot2

library(ggplot2)

We will mainly work with three datasets already loaded with the ggplot2 package,

• midwest: Midwest demographics


• mpg: Fuel economy data from 1999 to 2008 for 38 popular models of cars
• msleep: An updated and expanded version of the mammals sleep dataset, of 83 mammals

Try to initially understand the datasets. You can do so by loading the datasets in your environment with
names of your choice, for example:

Midwest_demo_data <- Midwest


Task 1
1. Explore the data for properties such as size (number of observations and variables)

2. Understand the data using the ? sign (goes to function documentation)

3. Use functions such as summary to find quantitative properties of the datasets.

How to use ggplot2


There are three components in all ggplot2

1. The data you want to plot

2. Aesthetic mappings between variables

3. Some layer describing how to plot each observation, usually created through the geom function.

Example: ggplot(mpg, aes(x = displ, y = hwy)) +geom_point()

Using this you can plot from the mpg dataset, two variables of your choice plus a layer component as
points. What do you make of this plot? Do you see any relationship between the variables? This plot is
called a scatterplot.

You will have to use this pattern to create plots in ggplot2. The plus sign will allow you to add more
components, as the plots get more advanced. Please note that x and y are not necessary for these plots
and can be removed. Always remember the order.

Task 2

1. Is there a connection between the college education and poverty levels in the midwest data? What
would you guess before plotting the data?

2. Is there a connection between body weight and the sleep duration in the msleep data? What would
you guess before plotting the data?

Color, size, shape


These should be added on the aes.

An example from the mpg dataset: ggplot(mpg, aes(displ, cty, colour = class)) + geom_point(). What do
you notice here?
Task 3
1. Play around with the size and shape of the plot. Does is work well? What is the solution if you want to
apply same size and shape to all your data?

2. Do the x and y axis labels look fine to you? How could you change those? Replace the labels of this
plot to “engine displacement, in litres” and “city miles per gallon” for x and y respectively.

ggplot(mpg, aes(displ, cty, colour = class)) + geom_point()


Hint: Use the xlab and ylab functions.

3. Can you use a continuous variable to set the color? What happens if you try to use year or highway
miles per gallon as color?

4. plot the sleeping times against body weight and apply shape based on the diet they follow. Do you
see any patterns?

Axis properties
We saw in the previous task that we can use xlab and ylab to rename the axis. We can also use axis
properties to transform the data but also set specific limits. The arguments to do so in ggplot2 would be
to use xlim and ylim axis of interest. Moreover, scale_y_log10 re-scales the data to the log scale.

Task 4
We saw before that it was hard to understand the relationship between body weight and total sleep
time. Can we transform on of the variables so that the relationship becomes clearer?

Facetting or Small multiples


To separate the plot into smaller plots based on some grouping factor in the data, it is possible to use
the facet_wrap addition to our plot to make this possible.

Task 5
1. In the midwest dataset plot the poverty vs. college education by state.

2. In the mpg dataset plot the engine size vs. consumption by manufacturer.
Boxplots, histograms, bar charts, and others

The first plot we have worked with has been the scatterplot. As you may have seen before if we plot the
relationship of two variables, it is not uncommon to add a line that models that relationship. In ggplot2
we can easily do that with geom_smooth.

Can you fit a line to the data when you plot the body weight of the mammals vs. the sleeping time?

We would use a boxplot to make comparisons between groups. A grouped plot with be automatically
created when have two variables and one of them is a categorical one. See what happens when you plot
state against poverty on a regular plot (with geom_point and no other arguments).

This plot is informative but not all it could be. That is because there are a lot of overlapping points and
it’s harder to compare. The solution is simple, change the geom_point to geom_jitter. Even better, to
geom_boxplot. Now, we can see clear differences! Another way to plot this is to use a violin plot that
also shows data density around specific values.

Task 6
1. In the mpg data plot consumption against manufacturer in a boxplot.

2. In the mspleep data plot diet against body weight in a violin plot. Careful with the transformation!

Continuous data distributions


We can also plot the distribution of a single numerical variable in a histogram. All we need to do is use
the geom_histogram argument as:

ggplot(midwest, aes(percbelowpoverty)) + geom_histogram()

We can also adjust the binwidth of our plot

ggplot(midwest, aes(percbelowpoverty)) + geom_histogram(binwidth=10)

Finally, we can split the distribution to multiple plots based on another categorical variable.

ggplot(midwest, aes(percbelowpoverty, colour=state)) + geom_histogram()

Another good alternative is using geom_freqpoly instead of geom_histogram.


Task 7
1. In the midwest data plot the poverty distribution in relation to the metropolitan classification of the
county. Do you see any differences?

2. In the mpg data plot the consumption distribution against the drive axis of the car.

Categorical data distributions


Finally, in order to plot distributions of a categorical variable, we just need to use a barplot as geom_bar.
Barplots can also be used when plotting a categorical against a numerical variable.

You might also like