You are on page 1of 10

STA1007S Lab 2: Summaries and plots

SUBMISSION INSTRUCTIONS:

Your answers need to be submitted on Vula.

Go into the Submissions section and click on Lab Session 2 to access the submission form. Please note that
the answers get automatically marked and so have to be in the correct format:

ENTER YOUR ANSWERS TO 2 DECIMAL PLACES UNLESS THE ANSWER IS A ZERO OR AN INTE-
GER (for example if the answer is 0 you just enter 0 and not 0.00, or if the answer is 2 you enter 2 and not 2.00).

DO NOT INCLUDE ANY UNITS (ie meters, mgs, etc).

PROBABILITIES MUST BE BETWEEN 0 AND 1, SO A 50% CHANCE WOULD CORRESPOND TO A


PROBABILITY OF 0.5.

Introduction
The first steps in any data analysis have to do with familiarizing ourselves with the information we have
available. Summaries and plots are a great way to do this. R has extremely powerful graphical capabilities,
with so many options that it is easy to get lost. In this lab we will practice some of the most common
exploratory graphs and summaries that will help us get acquainted with our data and detect possible
relationships existing between variables. But first we will need to read some data into R!
You will find that most of the R code necessary to execute the R commands is provided. This lab is meant to
be practice for you, so even if the code and the output of the code is provided, you are expected to create
your own script, run the pieces of code yourself and check whether the output is what you would expect it to
be. Every now and then, you will be asked to fill in blank pieces of code marked as ---. In addition to “fill
in the code”, you will need to answer other questions for which you must produce plots, run your own code
or explore your data. The questions you need to submit through Vula will appear in the submission boxes.
At any time you might call the function help(), to obtain information from any function you want. E.g. If
you wanted to obtain a description of how the function sample() works, you can at any time type in the
console (bottom left panel in RStudio):
help("sample")

or you can just type:


?sample

You should take this as a habit and check the help files of the functions you use for the first time.

Start a new R script


Let’s begin this lab with a quick recap of the previous session. The first thing we need to do in any lab is
create a new R script in RStudio (we envision you will be using RStudio for all labs but if your computer is
slow, you can also do the labs straight in R: go to ‘File’ -> ‘New script’ and you will get a basic text editor

1
in which you can create your script). Remember that we created a folder called “MySTA1007analysis” or
something similar? and that we also created a RStudio project that lived inside this folder? We will be using
these once again.

1. Open RStudio
2. Open “MySTA1007analysis” project by clicking File → Open Project. . . and then browse to
“MySTA1007analysis” project.
Note that the Projects tab in the top right corner will now display the name of your project.
3. Create a new R script: click File → New File → R Script

Good! We have now opened our RStudio project, created a new R script and we are ready to go.

Script preamble
Another important thing to remember from the previous lab is that we must always start our R scripts with
a preamble, in which we remind ourselves of what the purpose of the script is and several details about it,
like who wrote it or when it was written. You might also remember that any lines preceded by the character
# will be taken as comments and R will not try to run them. Let’s type our preamble now (we are typing in
our R script, which is the top left panel, NOT in the console, which is the bottom left panel):
# Amazing R User (your name)
# 23rd August (today's date)
# This script contains STA1007S Lab 2 commands

Next thing we need to do is to clean R’s memory, just in case there are some objects laying around that we
are not aware of.
# Clean my working space
rm(list = ls())

Just to remind ourselves or what these functions do, we may type in the following in the console (bottom
left):
help("ls")

We type directly in the console now because we don’t necessarily want to execute this command every time
we run our script. In other words, this is a once-off visit to the help file.
Now we see in the bottom right panel, the help file for the function ls(). Help files might sometimes be a
little confusing, but with time, we will be able to extract the information we want. In this case, it explains
that ls() provides a vector (group of elements) with all the names of the objects in our working environment
(R’s memory). It also explains what arguments this function might take. Arguments define different settings
the function might take on to perform slightly different actions. We’ll see more of this later.
Let’s have a look at the help file of the function rm(). You may type now in your console:
help("rm")

We now see the help file for the function rm() at the bottom right panel of the screen. It indicates, amongst
many other things, that this function is used to remove objects from the working environment and that we
may pass on a list of names (character strings. . . ) to get rid of.
OK, so it seems that the command rm(list = ls()) is creating a list of names with all the objects in the
working environment and then removing them. That is exactly what we wanted to do. Clean up R’s memory,
otherwise called “working environment”.

2
If this was a bit confusing, don’t worry, we will be using functions that are much simpler than this. This just
gives you a taste of how we communicate with R using functions, It also shows you that we can visit the help
files of the functions at any time.
Now, back to the preamble.
We have written a few details about the script and we have removed all the objects in our working environment.
Next, we want to check where R is looking for data and saving things at. This was called the “working
directory”. We can check what the working directory is by typing (this must go in your script not in the
console):
# Check what working directory R is looking at
getwd()

Oh, R is looking exactly where we wanted it to look! That’s right, that is the magic of RStudio projects.
Remember that we said that as long as you have the R Project file in your folder, R will consider this as the
root directory and will always set the working directory here. You may move the folder containing the R
Project file anywhere and whenever you open the project, R will change the working directory to the new
location.
Remember, to run the code in your R script you must place your cursor in the line you want to run and click
“Run” in the top right corner of your script panel. Alternatively, you can use the shortcut Ctrl + Enter. If
you want to run more than one line at once, you may select all the lines you wish to run and then either click
“Run” or press Ctrl + Enter.
OK, so nothing else to do here. We have written our preamble and made sure that our working directory is
correct. Let’s now save the R script into the ‘MySTA1007analysis’ folder.

1. Click on File → Save


2. Browse to the ‘MySTA1007analysis’ folder
3. Give your script a sensible name such as: lab2script.R (no special characters! - #, $, %, &, *, +, etc.
or spaces)

Remember to save your R script frequently. All the code we save is code we don’t have to type again!

Importing data
Enough preambles, let’s start using R. The first thing we need is data, so we need a way to import our data
into R. The easiest format for R to read data from is as comma-delimited text files, with the extension .csv.
It is easy to get data into this format, for example from Excel: just save the data as .csv file. But don’t
worry about this right now. We will use a data file that is already in the correct format.

1. Locate the file Class_Survey.csv on the STA1007S site in Vula → Resources → Labs → Data.
WARNING: do not open this file in Excel! When it comes to data analysis, Excel is not your
friend! If you accidentally opened it in Excel, it is best to delete the file from your computer,
download it again and start afresh (in this case).
2. Save Class_Survey.csv to the folder you made for the labs on your computer (the folder that contains
the R project file).

3
This data set contains the information we asked you to submit about yourself at the beginning of the course.
The R function that reads .csv files is called read.csv(). So, it would be tempting to run the command
(you are welcome to try):
read.csv("Class_Survey.csv")

This will read the data but it would not do anything useful with it, other than showing it to us. If you
remember, in the first lab we showed you how to create objects in R. These objects were stored in R’s memory
and we could call them anytime to use them. We’ll do something similar here. We will create an object that
will contain our data set. Then, we can call it anytime we want to manipulate it. To create an object, we
used the <- operator, which assigns certain value to our object. Try typing this into your R script and run
the code:
# Read the class data set into R
classData <- read.csv("Class_Survey.csv")

In this case, our object will contain a full data set! R calls this type of object a data frame.
Instead of classData, we could have chosen any name for this object. It is best to choose a name that is fairly
short but holds some clues of what the object contains. R is VERY case sensitive! So, ClassData, Classdata,
and classdata, would all be different things to R! Keep this in mind; it can cause a lot of frustration. Also
remember that R happily overwrites objects without warning, so be careful not to re-use names within the
same R session. Finally, avoid names that clash with functions.
So read.csv() is a function, and as we’ve already seen, R functions typically take arguments - input that
allows us to tell the function what we want it to do, and we supply this information inside the round brackets.
In the case of read.csv(), the argument it needs is the path to the data file. We’ve already set the path to
the working directory, so all we need to specify is the file name. The file name is a character string (rather
than an object) and we tell R this by putting it into quotes. Again, you may at any time visit the help file of
the function to see further details.
Alright, back to our data. If you haven’t done so yet, read the Class_Survey.csv data set into R. When
you look at the console again, it seems like nothing happened. However, if you look closely at the top right
window, under the Environment tab, you can see that RStudio now lists the object classData and tells us
how many observations and how many variables the data set has.
We want to be sure that our data have been read in correctly and also get a feeling of what our data looks
like. R has a number of functions that help us do so. Let’s see a few of them.
Often the most straight forward way to quickly get a glimpse of what the data look like, is to print the first
few rows. We can do this by using the function head(). Go to your script and type in the following code:
# Explore the data set by printing the first few rows
head(classData)

## cigs height views alc tattoo age sleep gender stats eyes
## 1 0 1.20 moderate 0 No 19 6 or 7 hours female 9 Brown
## 2 0 1.63 conservative 0 No 20 6 or 7 hours female 10 Brown
## 3 0 1.63 socialist 0 No 19 4 or 5 hours female 9 Brown
## 4 0 1.63 moderate 0 No 19 8 hours or more female 8 Blue
## 5 0 1.75 moderate 0 No 20 8 hours or more female 9 Hazel
## 6 0 1.84 moderate 0 No 19 6 or 7 hours male 7 Brown
We immediately see the names of the variables, whether the variables are numbers or characters and also if
they allow any decimal places.
We can gain a deeper understanding of the structure of the variables by calling the function str().
# Explore the data set by looking at the structure of the variables
str(classData)

4
## 'data.frame': 202 obs. of 10 variables:
## $ cigs : int 0 0 0 0 0 0 0 0 0 0 ...
## $ height: num 1.2 1.63 1.63 1.63 1.75 1.84 1.67 1.85 1.62 1.75 ...
## $ views : chr "moderate" "conservative" "socialist" "moderate" ...
## $ alc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tattoo: chr "No" "No" "No" "No" ...
## $ age : int 19 20 19 19 20 19 18 18 19 18 ...
## $ sleep : chr "6 or 7 hours" "6 or 7 hours" "4 or 5 hours" "8 hours or more" ...
## $ gender: chr "female" "female" "female" "female" ...
## $ stats : num 9 10 9 8 9 7 8 10 8 8.5 ...
## $ eyes : chr "Brown" "Brown" "Brown" "Blue" ...
This table tells us what type of variables we have in the data set. e.g. Variable cigs is an integer (quantitative
variable that can only take on integer values), while height is numeric (quantitative variable that can take on
any real value) and views is a character, i.e. text. The character variables need some attention. Since the
release of R version 4, R does not make any assumption about what we want to do with these variables. For
analysis, we typically want R to treat them as a factor (qualitative or categorical variable that can only take
on some pre-defined values). We’ll see more of this later. But right now, let’s tell R that we want it too treat
the variable sleep as a factor.
classData$sleep <- as.factor(classData$sleep)

(Some older versions of R automatically treated text variables as factor; if you happen to use an older version
of R, you might not need to do this although it never hurts to make sure R treats your variables the way you
wanted.)
Now that we know what variables we are dealing with, we may have a look at some basic statistical information
about them. The quickest and most general way to look at this is by using the function summary()
# Produce a summary of the variables in the data set
summary(classData)

## cigs height views alc


## Min. : 0.0000 Min. :1.200 Length:202 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.:1.583 Class :character 1st Qu.: 0.00
## Median : 0.0000 Median :1.650 Mode :character Median : 0.00
## Mean : 0.4753 Mean :1.657 Mean : 7.43
## 3rd Qu.: 0.0000 3rd Qu.:1.730 3rd Qu.: 0.00
## Max. :50.0000 Max. :1.980 Max. :750.00
##
## tattoo age sleep gender
## Length:202 Min. : 17.00 4 or 5 hours : 23 Length:202
## Class :character 1st Qu.: 18.00 6 or 7 hours :100 Class :character
## Mode :character Median : 19.00 8 hours or more : 77 Mode :character
## Mean : 19.99 Less than 4 hours: 2
## 3rd Qu.: 20.00
## Max. :188.00
## NA's :1
## stats eyes
## Min. : 1.000 Length:202
## 1st Qu.: 8.000 Class :character
## Median : 9.000 Mode :character
## Mean : 8.418
## 3rd Qu.:10.000
## Max. :21.000
## NA's :1

5
This function provides the five-number summary, plus the mean, of the numerical variables and the frequencies
of the categorical variables (i.e. how many observations in each category). This is a great way of having a
quick look at the scale and range of the different variables, detect possible outliers (min or max values may
not make much sense) or missing values (R will show missing values as “NA”) or determine which categories
are more frequent. Summaries are not only useful to get familiar with our data, they are also great for trouble
shooting. Sometimes the reason why our analysis doesn’t want to run or R doesn’t seem to cooperate is
because our variables are in a format we didn’t expect or there are missing values that we were unaware of.
Summaries are a “must do” at the beginning of any analysis.
If we wanted to obtain a summary of a single variable in our data frame, we can use the $ operator. If we
wanted a summary for the variable height and run the following command,
# Produce a summary for the variable height
summary(height)

## Error in summary(height): object 'height' not found


R can’t find the object height and throws an error. We need to tell R that it needs to look for this variable
inside the data frame classData (or whatever you called it), so we type the name of the data frame followed
by $ sign and the name of the variable.
# Produce a summary for the variable height
summary(classData$height)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1.200 1.583 1.650 1.657 1.730 1.980
This code provides the summary for the variable height inside the data frame classData. We’ll use this
notation very often. There are other ways of indexing and sub-setting data frames, but this is all you need to
know for now; we’ll see more of this in the next lab.

SUBMISSION:
Vula Question 1. Is the variable age a quantitative or a qualitative variable?
Vula Question 2. By looking at the variable sleep, fill in the blank: "Most people in class sleeps between —
and — hours."
What about the variance and the standard deviation? These two measures of dispersion are used very
frequently. We can access them with the functions var() and sd(), respectively. These functions take on
single variables, though, not entire data frames. We can also access the mean of each variable with the
function mean(). Let’s look at the mean, variance and standard deviation of the variable age (fill in the
blank pieces of code ---):
# Calculate the mean, variance and standard deviation of the variable age
mean(classData$age)
var(---$age)
sd(---$---)

Plotting quantitative variables


Summaries are great, and we should always start our analysis with them; however, they lack the visual appeal
of plots. Plots are much better at communicating what the frequency distribution of the variable values is.
For example, by looking at the five-number summary for the variable height, would you say that average
heights are more or less frequent than large heights? It’s difficult to say. . .
In the next sections, we will look at a couple of ways to plot quantitative variables. In the following labs we’ll
see more plotting techniques, including some to deal with qualitative (categorical) variables.

6
Histograms
Histograms are used to explore the frequency distribution of continuous variables (those variables that can
take on any real number). Let’s now try a histogram of the variable height (note the $ notation again).
# Produce a histogram for the variable height
hist(classData$height)

60
50
40 Histogram of classData$height
Frequency

30
20
10
0

1.2 1.4 1.6 1.8 2.0

classData$height

The histogram shows that large to average height values are more common than very large or very small
values of height. The distribution is not completely symmetric but is rather skewed to the left ((larger values
tend to be more frequent and the tail to the left is relatively long and ‘heavy’). This is what is meant when
we refer to the “frequency distribution” of the data and it is an extremely important characteristic, as we’ll
see over and over again.
Back to our histogram plot. You can click on the little field that says ‘Zoom’ above the graph, to show a
larger version. You can also save the graph by clicking on the ‘Export’ button. This is all pretty neat. But,
hang on, how did R decide on the bin width for the histogram? We didn’t give any instructions in this regard
and we know that this is an important decision! To answer this question, we ask RStudio to display the help
file for the function hist() by typing (remember, you may type this directly into the console if you prefer,
otherwise R will show the help file every time you run your script; this is up to you):
?hist()

As we’ve mentioned before, help files are really important in R. We see that the hist() function expects
some arguments. The first is ‘x’, i.e. ‘a vector of values for which the histogram is desired’. We did supply
such a vector (a vector in R is just an object with several values), classData$height, which contains the
height values for the class. What about all the other arguments? And why didn’t we have to specify anything
for them? The reason is that the other arguments all have defaults, which R uses when we don’t specify
anything. For example, the next argument that hist() expects is ‘breaks’ and the help file tells us that the
default is ‘Sturges’. So, this is the method that R used to determine the number of bins since we did not
specify anything else. Now let’s change this. Under the heading ‘Arguments’, the help file gives us a few
options for specifying the bin widths. Let’s try the third option, ‘a single number giving the number of cells
for the histogram’. Let’s try 20 bins for example:
# Produce a histogram for the variable height with 20 bins
hist(classData$height, breaks = 20)

7
Histogram of classData$height

10 15 20 25 30 35
Frequency

5
0

1.2 1.4 1.6 1.8 2.0

classData$height

We have now more detail because of the larger number of bins and the frequencies (y-axis) are smaller for
each bin. This is because we have the same amount of observations (data points) and we need to distribute
them into more bins. In this case, the general pattern of the histogram remains the same. However, as we
increase the number of bins, we’ll see that the pattern starts “fading”. This is because each bin only captures
a few data points and we end up with either data or no data in each bin. Try this:
# Produce a histogram for the variable height with 100 bins
hist(classData$height, breaks = 100)

Histogram of classData$height
15
10
Frequency

5
0

1.2 1.4 1.6 1.8 2.0

classData$height

We have created a histogram with 100 bins for a data set with 202 observations. Probably too many bins. . .
We see how the general pattern breaks down into something “spikier”. Some values seem to be much more
frequent than others that are right next to them. With such a high resolution, characteristics of this particular
data set stand out over the general pattern, which is more likely to represent the characteristics of the
population under study. This is why the choice of number of bins (or alternatively bin width) matters. On
the contrary, bin widths that are too large will give a picture that is too broad and we might lose important

8
details of the distribution. R usually provides sensible defaults, but good judgement is key here.

SUBMISSION:
Produce a histogram of the variable height with only 3 bins.
Vula Question 3. Choose the best answer:
a) Using this plot, we can be sure that height does not have a bimodal distribution.
b) The histogram with 3 bins captures the general frequency in the data, but it hides important details
visible at higher resolution.
c) The histogram with 3 bins seems to be optimal, since fewer bins are always preferable.

Box-and-Whisker plots
We might want to quickly compare two data sets or different categories within a data set. Box-and-whisker
plots provide a quick and visual way of comparing the five-number summary of any two groups of data. These
plots work well for any quantitative variable.
Let’s compare the five-number summary of the values of height amongst people with a tattoo and without a
tattoo, using a box-and-whisker plot:
# Produce a boxplot for height amongst people with and without tattoos
boxplot(formula = height ~ tattoo, data = classData)
2.0
1.8
height

1.6
1.4
1.2

No Yes

tattoo

The easiest way to tell the function boxplot() what to do is by using a formula. If you go to the help file of
the function boxplot,
?boxplot()

you’ll see that you can pass on different types of arguments to specify what you want the function to plot.
The option with the argument formula seems to be the more straight forward. You just have to use the ~
sign, which stands for ‘as a function of’ or ‘grouped by’. In this case you want to plot the variable height
‘grouped by’ the variable tattoo and so you pass on the argument formula = height ~ tattoo.
In the x-axis of the plot we observe the different categories (groups) being compared. As you have seen
in the lectures, the range of the box captures the values of the data between the first and third quartiles
(central 50% of the data). The bold line represents the median of the distribution of values for each group.
The whiskers capture those values more extreme than the first or third quartiles but that are not considered
outliers. And what are considered outliers? Once again, it is something we need to ask R’s help file. This is a

9
bit more difficult to find, but if we move down to the argument range we find out that it determines how far
the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data
point which is no more than range times the interquartile range from the box. And if we look at the default
value for this argument (under the Usage section, at the beginning of the help file) we see that it is 1.5. So,
outliers, in this case, are consider any observations falling outside 1.5 times the interquartile range.
Looking at the boxplot, we see a number of small outliers in the heights of the group with No tattoo. Do you
think there is connection between these outliers and the frequency distribution being skewed to the left in
the histogram above?
If someone asked you, by looking at the boxplot, would you say that people with a tattoo tend to be taller
than people without a tattoo?
We’ll leave you to think about these questions.

SUBMISSION:
Produce a boxplot of the variable stats for the different political views in class.
Vula Question 4. Fill in the blank: "The group with — views had the most outliers in the variable stats".
Vula Question 5. Use the functions you used today to answer, how many observations were used to produce
the boxplot of the socialist group?
Vula Question 6. Use the functions you used today to answer, what is the maximum number of missing
values in a single variable in this data set?

The commands you learned today


These are the functions and operators that you learned today. Fill in your own description of what they do.
read.csv()
head()
str()
as.facor()
summary()
mean()
sd()
var()
hist()
boxplot()

10

You might also like