Professional Documents
Culture Documents
To work with this, you will need to install R and RStudio on your computers if you have not done so
already (Download RStudio - Posit). RStudio is an integrated development environment in R.
After finishing with installations, launch RStudio. Create a new R script through the plus sigh on the top
left of the screen. Working with such scripts will allow you to save your progress and return to your work
when necessary.
install.packages("ggplot2")
library(ggplot2)
We will mainly work with three datasets already loaded with the ggplot2 package,
Try to initially understand the datasets. You can do so by loading the datasets in your environment with
names of your choice, for example:
3. Some layer describing how to plot each observation, usually created through the geom function.
Using this you can plot from the mpg dataset, two variables of your choice plus a layer component as
points. What do you make of this plot? Do you see any relationship between the variables? This plot is
called a scatterplot.
You will have to use this pattern to create plots in ggplot2. The plus sign will allow you to add more
components, as the plots get more advanced. Please note that x and y are not necessary for these plots
and can be removed. Always remember the order.
Task 2
1. Is there a connection between the college education and poverty levels in the midwest data? What
would you guess before plotting the data?
2. Is there a connection between body weight and the sleep duration in the msleep data? What would
you guess before plotting the data?
An example from the mpg dataset: ggplot(mpg, aes(displ, cty, colour = class)) + geom_point(). What do
you notice here?
Task 3
1. Play around with the size and shape of the plot. Does is work well? What is the solution if you want to
apply same size and shape to all your data?
2. Do the x and y axis labels look fine to you? How could you change those? Replace the labels of this
plot to “engine displacement, in litres” and “city miles per gallon” for x and y respectively.
3. Can you use a continuous variable to set the color? What happens if you try to use year or highway
miles per gallon as color?
4. plot the sleeping times against body weight and apply shape based on the diet they follow. Do you
see any patterns?
Axis properties
We saw in the previous task that we can use xlab and ylab to rename the axis. We can also use axis
properties to transform the data but also set specific limits. The arguments to do so in ggplot2 would be
to use xlim and ylim axis of interest. Moreover, scale_y_log10 re-scales the data to the log scale.
Task 4
We saw before that it was hard to understand the relationship between body weight and total sleep
time. Can we transform on of the variables so that the relationship becomes clearer?
Task 5
1. In the midwest dataset plot the poverty vs. college education by state.
2. In the mpg dataset plot the engine size vs. consumption by manufacturer.
Boxplots, histograms, bar charts, and others
The first plot we have worked with has been the scatterplot. As you may have seen before if we plot the
relationship of two variables, it is not uncommon to add a line that models that relationship. In ggplot2
we can easily do that with geom_smooth.
Can you fit a line to the data when you plot the body weight of the mammals vs. the sleeping time?
We would use a boxplot to make comparisons between groups. A grouped plot with be automatically
created when have two variables and one of them is a categorical one. See what happens when you plot
state against poverty on a regular plot (with geom_point and no other arguments).
This plot is informative but not all it could be. That is because there are a lot of overlapping points and
it’s harder to compare. The solution is simple, change the geom_point to geom_jitter. Even better, to
geom_boxplot. Now, we can see clear differences! Another way to plot this is to use a violin plot that
also shows data density around specific values.
Task 6
1. In the mpg data plot consumption against manufacturer in a boxplot.
2. In the mspleep data plot diet against body weight in a violin plot. Careful with the transformation!
Finally, we can split the distribution to multiple plots based on another categorical variable.
2. In the mpg data plot the consumption distribution against the drive axis of the car.