You are on page 1of 10

STA1007S Lab 3: Plots (II) and sub-setting

SUBMISSION INSTRUCTIONS:

Your answers need to be submitted on Vula.

Go into the Submissions section and click on Lab Session 3 to access the submission form. Please note that
the answers get automatically marked and so have to be in the correct format:

ENTER YOUR ANSWERS TO 2 DECIMAL PLACES UNLESS THE ANSWER IS A ZERO OR AN INTE-
GER (for example if the answer is 0 you just enter 0 and not 0.00, or if the answer is 2 you enter 2 and not 2.00).

DO NOT INCLUDE ANY UNITS (ie meters, mgs, etc).

PROBABILITIES MUST BE BETWEEN 0 AND 1, SO A 50% CHANCE WOULD CORRESPOND TO A


PROBABILITY OF 0.5.

Introduction
In the last lab session, we saw how to produce a quick summary of the variables in our data set, as well
as how to produce histograms and boxplots for quantitative variables. In the first part of this lab, we
will keep practicing exploratory techniques using the Class_Survey data set. We will look at how to plot
qualitative (categorical) variables and also quantitative variables that can only take on certain values (these
are called discrete variables and usually take on integer numbers). We will then learn more about indexing
and sub-setting in R.
You will find that most of the R code necessary to execute the R commands is provided. This lab is meant to
be practice for you, so even if the code and the output of the code is provided, you are expected to create
your own script, run the pieces of code yourself and check whether the output is what you would expect it to
be. Every now and then, you will be asked to fill in blank pieces of code marked as ---. In addition to “fill
in the code”, you will need to answer other questions for which you must produce plots, run your own code
or explore your data. The questions you need to submit through Vula will appear in the submission boxes.
At any time you might call the function help(), to obtain information from any function you want. E.g. If
you wanted to obtain a description of how the function sample() works, you can at any time type in the
console (bottom left panel in RStudio):
help("sample")

or you can just type:


?sample

You should make this a habit and check the help files of the functions you use for the first time.

Start a new R script


By now, you should be getting the idea of how to start an R script for an existing project. Let’s quickly
recap once more:

1
1. Open RStudio
2. Open “MySTA1007analysis” project by clicking File → Open Project. . . and then browse to
“MySTA1007analysis” project.
Note that the Projects tab in the top right corner will now display the name of your project.
3. Create a new R script: click File → New File → R Script

Script preamble
Remember to write a preamble for your script. We will provide the code once again, although you should be
familiarizing yourself with it, so we won’t go through all the steps in detail.
# Amazing R User (your name)
# 23rd August (today's date)
# This script contains STA1007S Lab 3 commands

# Clean my working space


rm(list = ls())

# Check what working directory R is looking at


getwd()

Make sure that R is looking in the correct working directory and save your script. Remember to save your R
script frequently. All the code we save is code we don’t have to type again!

Importing data
Next, we need to get our data into R. We saw how to do this in the last lab session. For this lab and the
following ones, we will use an updated version of the Class_Survey data set, called Class_Survey2.csv. Make
sure you use this version of the data set!

1. Locate the file Class_Survey2.csv on Vula in the Resources → Labs → Data folder.
2. Save Class_Survey2.csv to the folder you made for the labs on your F drive (the folder that contains
the R project file).

This data set contains the information we asked you to submit about yourself at the beginning of the course.
Remember, the R function that reads .csv files is called read.csv(). Fill in the missing pieces (marked as
---) and run the following code.
# Read the class data set into R
classData --- read.csv("---.csv")

Make sure that the data frame was loaded into R correctly. We saw a few R functions that help us do exactly
that. Type and run the following code:
# Check that the data was loaded correctly
dim(classData) # how many rows and columns does the data frame have?

## [1] 196 10

2
head(classData)

## cigs height views alc tattoo age sleep gender stats eyes
## 1 0 1.20 moderate 0 No 19 6 or 7 hours female 9 Brown
## 2 0 1.63 conservative 0 No 20 6 or 7 hours female 10 Brown
## 3 0 1.63 socialist 0 No 19 4 or 5 hours female 9 Brown
## 4 0 1.63 moderate 0 No 19 8 hours or more female 8 Blue
## 5 0 1.75 moderate 0 No 20 8 hours or more female 9 Hazel
## 6 0 1.84 moderate 0 No 19 6 or 7 hours male 7 Brown
str(classData)

## 'data.frame': 196 obs. of 10 variables:


## $ cigs : int 0 0 0 0 0 0 0 0 0 0 ...
## $ height: num 1.2 1.63 1.63 1.63 1.75 1.84 1.67 1.85 1.62 1.75 ...
## $ views : chr "moderate" "conservative" "socialist" "moderate" ...
## $ alc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tattoo: chr "No" "No" "No" "No" ...
## $ age : int 19 20 19 19 20 19 18 18 19 18 ...
## $ sleep : chr "6 or 7 hours" "6 or 7 hours" "4 or 5 hours" "8 hours or more" ...
## $ gender: chr "female" "female" "female" "female" ...
## $ stats : num 9 10 9 8 9 7 8 10 8 8.5 ...
## $ eyes : chr "Brown" "Brown" "Brown" "Blue" ...
A good first check is to see how many rows and columns the data frame has and the function dim() gives us
these two quantities.
At this point, it is important to note the different types of variables we have in the data set.
We see, for example, that the name of the variable height is followed by the abbreviation num, which means
that this variable is numeric and therefore, it can take on any real value (basically, it means it can have
decimal values).
In contrast, the name of the variable age is followed by the abbreviation int for integer. This means that
this variable is still quantitative, but it can only take on integer numbers (no decimal places). How does this
make any difference? We’ll see that computing probabilities for the different values a variable can take, will
be different for numeric and integer variables.
NOTE: In general, during the labs we will refer to numeric variables as being continuous and to integer
variables as being discrete. Although, we might use these terms interchangeably, you must note that they are
not exactly the same. e.g. a variable consisting of integers divided by 10 would also be discrete but it won’t
be an integer anymore. But don’t worry too much about this for now.
Lastly, there are some variables labeled with chr (for character, i.e. R treats these as simple strings of
characters), like for example the variables views or tattoo. In our case, these are categorical variables and
can only take on particular values. They are useful for classifying and defining groups within the data. To
work with categorical variables that have been read in as chr, we need to turn them into the class factor.
Factors are variables that can take on a limited set of values called “levels”.
classData$views <- as.factor(classData$views)
classData$tattoo <- as.factor(classData$tattoo)
classData$sleep <- as.factor(classData$sleep)
classData$gender <- as.factor(classData$gender)
classData$eyes <- as.factor(classData$eyes)

To check that R is now treating these variables as factors, let’s look at the attributes of the variable “views”
as an example.

3
# What type of variable is the variable views?
class(classData$views)

## [1] "factor"
# What values can the variable views take on?
levels(classData$views)

## [1] "communist" "conservative" "liberal" "moderate" "socialist"

Plotting categorical variables


Before we start plotting variables let’s refresh our memory on how the variables in our data set look like by
using the function summary(). Run the following code, filling in the missing piece:
# Produce a summary of the variables in the data set
summary(---)

## cigs height views alc tattoo


## Min. :0.000 Min. :1.200 communist : 7 Min. :0.0000 No :177
## 1st Qu.:0.000 1st Qu.:1.587 conservative:11 1st Qu.:0.0000 Yes: 19
## Median :0.000 Median :1.650 liberal :86 Median :0.0000
## Mean :0.199 Mean :1.656 moderate :60 Mean :0.4633
## 3rd Qu.:0.000 3rd Qu.:1.722 socialist :32 3rd Qu.:0.0000
## Max. :8.000 Max. :1.980 Max. :6.0000
## age sleep gender stats
## Min. :17.00 4 or 5 hours :22 female :138 Min. : 1.000
## 1st Qu.:18.00 6 or 7 hours :98 male : 55 1st Qu.: 8.000
## Median :19.00 8 hours or more :74 No Answer: 1 Median : 9.000
## Mean :19.14 Less than 4 hours: 2 other : 2 Mean : 8.367
## 3rd Qu.:20.00 3rd Qu.:10.000
## Max. :25.00 Max. :10.000
## eyes
## Blue : 36
## Brown:124
## Green: 19
## Hazel: 11
## Other: 6
##
Remember, this function provides the five-number summary, plus the mean, of the numerical variables and
the frequencies of the categorical variables (i.e. how many observations in each category).
This is a good point to bridge what we have seen in class with this practical exercise.

SUBMISSION:
Vula Question 1 Using functions you learned so far, calculate the standard error for the mean of the
variable height. The function sqrt(x) calculates the square root of a number x. ROUND YOUR ANSWER
TO 4 DECIMAL PLACES!
Probably the first question we might ask is whether we are equally likely to observe any of the eye colours, if
we picked a random person in class. One way to answer this question is by using the function table(). The
function table() allows us to determine how many observations we have of each eye colour. Let’s try the
following code (fill in the missing code):

4
# Produce a table with the number of cases of each level in the variable eyes
table(classData$---)

##
## Blue Brown Green Hazel Other
## 36 124 19 11 6
R has produced a table with the number of observations associated to each eye colour. It shouldn’t come as a
surprise that the sum of all cases equals the total number of rows in the data frame. It seems that “Brown”
eyes are the most frequent.
There are only a few categories in this table, which makes this relatively easy to visualize, but with greater
number of categories it would be difficult to get a quick picture of what categories are more frequent. It is
usually much better to plot our results to get a visual idea of what is going on.
We can’t produce a histogram for the varible eyes because it is categorical and it wouldn’t make sense to
distribute the different data points into bins. Actually it would, but only if we created exactly one bin for
each eye colour. This is exactly what a barplot does.
A barplot is a common way of visualizing categorical data, with each bar representing a different category
and the height of the bar representing the amount of cases (observations) in each category. We will use the
function barplot() to produce a barplot for the variable eyes. This function needs us to provide the height
of the bars, which we just said corresponds to the number of cases in each category. Wasn’t that what the
function table() just provided?
Alright, let’s see if we can work this out. . . we’ve seen that we can nest functions inside functions in R. You
should be able to figure out what argument the function table() needs, in the following piece of code
# Produce a barplot with the variable eyes of the classData data frame
barplot(height = table(---))
100 120
80
60
40
20
0

Blue Brown Green Hazel Other


You should now see a plot similar to the one above. The plot looks similar to a histogram, but there is some
space between the bins. This reflects the discontinuous nature of the data.
OK, now we have seen two ways of looking at the frequency of the different levels of a categorical variable,
namely: using the function table() and also using the function barplot().

SUBMISSION:
Vula Question 2 What is the frequency of people sleeping 8 hours or more, in our data set?

5
Plotting discrete numerical variables
We have seen how to plot the frequency distribution of continuous quantitative variables and also of categorical
variables. What about quantitative variables that can only take on certain values? Although they might
seem like strange creatures, we will work with this type of variables quite a lot. If you think about it, counts
like “number of cigarettes” can only take on integer values and it is a very common type of variable. Other
variables like “age” are very often rounded to the nearest integer. So how do we plot them?
We can use histograms, although it usually doesn’t make much sense unless the range of the variable is large.
For example, (fill in the blanks and) let’s create a histogram of the variable age.
# Create a histogram for the variable age
---(---$age)

Histogram of classData$age
60
Frequency

40
20
0

18 20 22 24

classData$age

Now create a histogram of the variable age but with 100 bins (fill in the blanks and refer to the help file if
you need to):
# Create a histogram for the variable age with 100 bins
---(---, --- = ---)

6
Histogram of classData$age

60
Frequency

40
20
0

18 20 22 24

classData$age

You can see that the frequency pattern for both histograms is very similar. All the data points concentrate
exactly on the integers and therefore by increasing the number of breaks we only accomplish reducing the
width of the bins and creating empty spaces between them (empty bins, actually).
So, if the range of the variable is not large, like it is the case with most counts, we may as well just use
barplots.
# Create a barplot for the variable age
barplot(height = ---(classData$age))
10 20 30 40 50 60 70
0

17 18 19 20 21 22 23 25
The problem with the barplot above is that we don´t see the gaps where there is no data, i.e. between 23
and 27 if we had any observations of age 27. This is a problem that can be fixed but requires some extra
manipulation that we won´t cover now.
For now, using either option is fine; however, we should think about whether we are plotting “binned” data
or we just want to see bars on top of integer numbers.

7
Find and grab part of your data – indexing and sub-setting
We have seen that objects in R can be made up of different elements. These elements are indexed (given an
identifier) by R to make it possible for us to refer to them. R does this automatically and we just need to
understand how it works. Sub-setting is used to refer to or extract these elements and this is what we will
use to explore and manipulate the structure of our objects.
We have already seen a way of subsetting objects, with the $ operator. Another common way of sub-setting
is using the square brackets [].
Now, recall that R has stored our data in a data frame. This is the typical format for a data set in R. A data
frame has two dimensions: rows and columns. Many data analysis tasks require us to work with a subset
of the dataset, e.g. certain variables (columns) or particular rows. Data frames in R have implicit row and
column numbers that allow us to subset data.
In this case the indexing system deals first with rows, then with columns, and separates the two by a comma
[rows,columns]. Let’s see how this works:
# Extract the first row of the class survey data frame
classData[1,]

## cigs height views alc tattoo age sleep gender stats eyes
## 1 0 1.2 moderate 0 No 19 6 or 7 hours female 9 Brown
This should return the first row of the data frame. Make sure you understand what these numbers are by
comparing them to the beginning of the data frame. Remember how to display the first few rows of your
data frame?
Fill the blank and run the following code:
# Display the first rows of the class survey data set
---(---)

## cigs height views alc tattoo age sleep gender stats eyes
## 1 0 1.20 moderate 0 No 19 6 or 7 hours female 9 Brown
## 2 0 1.63 conservative 0 No 20 6 or 7 hours female 10 Brown
## 3 0 1.63 socialist 0 No 19 4 or 5 hours female 9 Brown
## 4 0 1.63 moderate 0 No 19 8 hours or more female 8 Blue
## 5 0 1.75 moderate 0 No 20 8 hours or more female 9 Hazel
## 6 0 1.84 moderate 0 No 19 6 or 7 hours male 7 Brown
Hopefully, you are now seeing the first few rows of your data set in your console and you may see that the
output of classData[1,] corresponds to the first row.
Let’s now extract the second column (second variable) of our data frame. Note that the number 2 is now to
the right of the comma.:
# Extract the second column of the classData data frame
classData[,2]

## [1] 1.20 1.63 1.63 1.63 1.75 1.84 1.67 1.85 1.62 1.75 1.53 1.72 1.50 1.65 1.58
## [16] 1.60 1.63 1.58 1.60 1.62 1.66 1.72 1.80 1.60 1.70 1.64 1.65 1.58 1.43 1.66
## [31] 1.54 1.60 1.58 1.78 1.53 1.71 1.75 1.86 1.65 1.68 1.78 1.56 1.55 1.84 1.64
## [46] 1.86 1.84 1.54 1.54 1.70 1.60 1.74 1.77 1.85 1.80 1.73 1.63 1.67 1.68 1.90
## [61] 1.69 1.69 1.79 1.61 1.72 1.60 1.59 1.49 1.70 1.77 1.69 1.61 1.81 1.70 1.63
## [76] 1.74 1.51 1.65 1.83 1.55 1.66 1.51 1.79 1.69 1.75 1.66 1.53 1.50 1.65 1.59
## [91] 1.68 1.67 1.50 1.72 1.60 1.77 1.60 1.97 1.85 1.70 1.72 1.50 1.79 1.60 1.62
## [106] 1.59 1.54 1.53 1.57 1.55 1.80 1.58 1.87 1.65 1.40 1.59 1.70 1.70 1.56 1.78
## [121] 1.65 1.93 1.50 1.72 1.95 1.54 1.25 1.63 1.69 1.75 1.58 1.53 1.67 1.70 1.78
## [136] 1.72 1.65 1.64 1.70 1.50 1.56 1.50 1.65 1.65 1.71 1.60 1.71 1.70 1.60 1.72

8
## [151] 1.98 1.50 1.87 1.75 1.80 1.63 1.61 1.80 1.74 1.56 1.80 1.72 1.77 1.55 1.61
## [166] 1.69 1.70 1.69 1.69 1.71 1.56 1.67 1.59 1.68 1.79 1.40 1.50 1.59 1.51 1.60
## [181] 1.55 1.60 1.40 1.33 1.42 1.60 1.75 1.75 1.73 1.50 1.83 1.69 1.66 1.66 1.60
## [196] 1.65
Note that when you ask R for a single row or column of data, it will return it as a series of numbers that are
not arranged neatly as a column or row. This can be confusing when you first see it but saves space on the
screen.
You can ask R to extract more than one row or column, e.g.:
# Extract the first 3 rows of the classData data frame
classData[1:3,]

## cigs height views alc tattoo age sleep gender stats eyes
## 1 0 1.20 moderate 0 No 19 6 or 7 hours female 9 Brown
## 2 0 1.63 conservative 0 No 20 6 or 7 hours female 10 Brown
## 3 0 1.63 socialist 0 No 19 4 or 5 hours female 9 Brown
Remember the : operator we saw in the first lab? This operator created a sequence of numbers, in this case
1 2 3. R will interpret that we want rows 1, 2 and 3.
Similarly, we can extract several columns:
# Extract the first 3 rows of the first 4 columns of the classData data frame
classData[1:3,1:4]

## cigs height views alc


## 1 0 1.20 moderate 0
## 2 0 1.63 conservative 0
## 3 0 1.63 socialist 0
We extracted only the first 3 rows to prevent R from filling the console with data, although you are welcome
to try extracting the full first 4 columns and see what happens. . .
So, numbers inside the square brackets to the left of the comma refer to rows and numbers to the right of the
comma refer to columns.
If we don’t want a complete series from 1 to some number, we may pass on a vector with the indices we want.
We construct vectors with the function c() (for “combine”" or “concatenate”"). Inside the brackets we pass
on the values we want, separated by a comma. For example, to see how this works:
# Create a vector with the numbers 1, 4 and 6.
c(1,4,6)

## [1] 1 4 6
# Extract the first 3 rows of columns 1, 4 and 6 of the classData data frame
classData[1:3, c(1,4,6)]

## cigs alc age


## 1 0 0 19
## 2 0 0 20
## 3 0 0 19
You can also use the column headings (names) to extract columns. Extract the variable cigs using:
# Extract the variable "cigs" of the classData data frame
classData[, "cigs"]

## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 1 0 0 0 0 7 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0

9
## [75] 0 0 0 0 0 0 0 0 0 0 7 1 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
## [149] 8 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0
or some observations of the variables cigs and tattoo:
# Extract observations 1, 4 and 10 of the variables "cigs" and "tattoo"
classData[c(1,4,10), c("cigs", "tattoo")]

## cigs tattoo
## 1 0 No
## 4 0 No
## 10 0 No
Note the use of the c() function and also the quotation marks. These are really important, they help R
distinguish between objects and character strings (words). Don’t worry about it too much for now, you will
have time to practice. Just be aware that it does make a difference using quotation marks or not.

SUBMISSION:
Vula Question 3. Choose the best answer:
When do we use square brackets [], and when do we use round ones ()?
1. Square brackets are used to subset objects, while round brackets are used to pass arguments to functions.
2. Square brackets are used when refering to objects, while round brackets are used when referring to variable
names.
3. It doesn’t make a difference, we can use either square brackets or round brackets indistinctively.
Vula Question 4. Fill in the blanks:
To display the first 5 observations of the first and third variable of my data, I would use the code ‘classData[—,
—]‘.
Vula Question 5. True of false:
Using the class survey data set, the code ‘classData[1:6, 1:10]‘ will produce the same output as the code
‘head(classData)‘.

The commands you learned today


These are the functions and operators that you learned today. Fill in your own description of what they do.
dim()
sqrt()
levels()
table()
barplot()
[,]
c()

10

You might also like