You are on page 1of 8

Introducion to R: Data Structures and Operations

Precursor
This lab introduces some of the basic concepts of data structures in R. This is important because you will
often be accessing and manipulating a dataset, either as an aid to visualizing the data, or as a result of
running a model. Note that this document is an R Notebook - code chunks are in the grey boxes, and can be
run by selecting the green icon on the right side of the box. Any output from an R command that is run will
be displayed below the code.

Vectors, Matrices and DataFrames

Vectors

The most basic data structure in R is the vector. You can think of this as a list of values. Note that in R the
assignment operator to a variable is <-, although you can use = instead. The = operator is normally used for
setting parameters in a function call. The code below creates a vector with 4 elements. The c(. . . ) is the
concatenate function.
v <- c(1,3,8,4)
v

## [1] 1 3 8 4
class(v)

## [1] "numeric"
vs <- c(1,3,8,"Hello")

Note that since all of the elements in the vector are numbers, the class type of v is numeric. Examine the
value and class type of the variable vs in the R console. Why do you think it is different from v? You can
also find out the length of any object in R using the length(..) function.
v <- c(1,2,4,6)
v2 <- 1:10
length(v)

## [1] 4
length(v2)

## [1] 10
Note that in the above example the : operator is used to create a vector of values ordered from 1 to 10. We
can use this operator to select out a subset of rows, columns or items. For example, if we only wanted the
first 3 values from the vector v we could write v[1:3]. Try this in the console.
Exercise 1. How would you extract the last 2 elements from v?
There are many other ways to extract elements from a vector (or any other data structure, as we will see).
For example, you can remove elements by specifying a minus sign for an index, or a group of indexes. Try
the following examples and check that you understand what is happening.
v[-3]

1
## [1] 1 2 6
v[-(1:3)]

## [1] 6
R provides many useful functions for creating vectors, tables and matrices. For example, the seq(.) function
generates sequences of values. R has an extensive help system for all functions, although be aware that
sometimes it is more confusing than you may have hoped! To get the help for a function (such as seq), just
type help(seq) or ?seq.
It is worth becoming familiar with the structure of R help. At the top of the page is a description of the
function, followed by examples of how the function is called. If a function takes some arguments, then the
default values for these arguments is often given. The most useful section is often found at the bottom of the
help file, where some examples showing how the function is called are given. You can often just cut and paste
these into R to see how they work. Go to the bottom of the help for seq and try out the examples.
Exercise 2. Find the sum and the mean of the first 20 integers (i.e. 1,2,.20). Hint: Use the seq( ), sum( )
and mean( ) functions. Note: You can also create the first 20 numbers by using 1:20

Matrices

Matrices in R are just like tables in EXCEL except that a matrix is just a single type of data (we will see
later that a data frame allows mixed types). Matrices have a number of rows and columns. A matrix can be
created by using some data and specifying the number of rows/columns that we want to create. Note that
R will try and create the matrix given just a row or column specification, and will warn you if there is a
problem. You can create an empty matrix (which will have the values NA, indicating no value), or fill one
with a vector of values. Creating a matrix also specifies how the data is to be filled in the table - either by
row or by column. Make sure you undestand the difference in the examples below.
m.novals <- matrix(nrow=2,ncol=3)
m.novals

## [,1] [,2] [,3]


## [1,] NA NA NA
## [2,] NA NA NA
mv <- matrix(v,nrow=2,byrow=FALSE)
mv

## [,1] [,2]
## [1,] 1 4
## [2,] 2 6
mv1 <- matrix(v,nrow=2,byrow=TRUE)
mv1

## [,1] [,2]
## [1,] 1 2
## [2,] 4 6
The output shown above is worth examining in detail. It shows you that the variable mv has 2 rows, 2
columns, and that they are indexed as [row,column], starting at 1 (just like a vector). The number of rows
and columns for a matrix can be found using the functions nrow(x) and ncol(x).
Exercise 3. Check that the variable m.novals has 2 rows and 3 columns.
Just like vectors, we can select out elements from a matrix, however now we can also select out an entire row,
column or a single element. Note that leaving a row or column entry blank when selecting from a matrix
means you want all of it. For example:

2
print("Here is mv:")

## [1] "Here is mv:"


mv

## [,1] [,2]
## [1,] 1 4
## [2,] 2 6
print("and here are selected rows, columns and elements:")

## [1] "and here are selected rows, columns and elements:"


mv[1,] # select the first row

## [1] 1 4
mv[,2] # select the second column

## [1] 4 6
mv[1,2] # get the second element from row 1

## [1] 4
mv[-1,] # remove the first row

## [1] 2 6
Exercise 3. Construct a 20 row, 5 column matrix with the numbers from 1 to 100 with byrow=FALSE. What is
the sum of the 12th row? (Answer = 260). What is the sum of the 4th column? (Answer = 1410). Check
that the matrix you have created is the class “matrix”.
Exercise 4. Take the matrix you created in Exercise 3 and make a new matrix from this by extracting the 15th
to 20th rows, and the 2nd and 3rd column. What is the sum of the contents of this new matrix? (Answer =
570).

Data Frames

A data frame is a table of data, just like a matrix, however it can have columns with different types of data.
In addition, normally a data frame has labels for the rows and columns. We will find that this is often useful
for producing plots of data and labelling the values. In addition, when we load in a dataset from a file, it will
normally be created as a data frame. NOTE: some operations require a data frame to be converted to a
matrix - this is simply done by using the function as.matrix(x). R has many datasets already available for
use (type in data( ) to see a list), so let’s access one of these as an example. The iris dataset is a very
simp le dataset with 4 measures of an iris plant (sepal and petal length and width) and the species type.
data(iris) # Ensure data is loaded
head(iris) # Look at the first few rows of the table

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species


## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa

3
pairs(iris) # Plot variables against each other

2.0 3.0 4.0 0.5 1.5 2.5

4.5 6.0 7.5


Sepal.Length
2.0 3.0 4.0

Sepal.Width

7
5
Petal.Length

3
1
0.5 1.5 2.5

Petal.Width

3.0
2.0
Species
1.0

4.5 6.0 7.5 1 3 5 7 1.0 2.0 3.0

Examine the plot from pairs(iris) - note that the species are just given the values 1,2 and 3, although the
species is a named factor. Check this by typing summary(iris). However, we can use the species name as a
label for a plot as follows. The example below plots the Species versus Petal.Width. Note that since this is
a data frame we can refer to the columns by name using the $ notation.
plot( iris$Species, iris$Petal.Width,xlab="Species",
ylab="Petal Width")

4
2.5
Petal Width

1.5
0.5

setosa versicolor virginica

Species

The row and column names for a dataframe can be accessed using the commands rownames(x) and
colnames(x).
Exercise 5. Check that the iris data column names are Sepal.Width,.,Species. Load the data for USArrests
and check that the row names correspond to the states of the US.

Selecting data items based on value

How do we select values from a vector, matrix or data frame based on their values? This is important since
we might want to create a subset of some data based on certain values of that data. The basic command is
called which(.), and this is used to select out the index in a data structure of values that satisfy a certain
condition. Although there are many ways to use this command, I’ll just show you the basic idea as a starting
point. The following example creates a vector of values from 50 to 60, and then finds out which elements
are greater than 55. Note that the result for the which(..) command is the index into the vector, not the
values of the vector. We can use indexes resulting from which(...) to create a new vector with just those
selected items.
vals <- 50:60
bigvals <- which(vals > 55) #Index of values > 55
bigvals

## [1] 7 8 9 10 11
newvals <- vals[bigvals]
newvals

## [1] 56 57 58 59 60
An often more useful command is subset(..), which creates the new table of data in a single command. For
example, the previous newvals could be created: newvals <- subset(vals,vals > 55). Try this out in
the console to check it works. The subset(..) command is particularly useful with dataframes and named
columns.

Logical Operators (Boolean decisions for selection)

There are logical operators to combine conditions. For example, & is the AND condition, | is the OR condition.
So, which(vals > 55 & vals < 59) would return just the index of those values that are 56,57 and 58. The

5
example below shows how to create a new table of data from the dataset USArrests where Murder > 10
AND UrbanPop < 60:
# Example using which
USrows <- which(USArrests$Murder > 10 & USArrests$UrbanPop < 60)
badstates <- USArrests[USrows,] # All columns, selected rows
badstates

## Murder Assault UrbanPop Rape


## Alabama 13.2 236 58 21.2
## Mississippi 16.1 259 44 17.1
## North Carolina 13.0 337 45 16.1
## South Carolina 14.4 279 48 22.5
## Tennessee 13.2 188 59 26.9
# This can also be done using subset:
badstates <- subset(USArrests,
USArrests$Murder > 10 &
USArrests$UrbanPop < 60)

Exercise 6. Using the USArrests data, and select out those states (rows) where the murder rate is greater
than twice the mean murder rate. Create a new dataset with those high murder states, and write to the screen
the names of the states (these are the row names). Answer: [1] “Georgia” “Mississippi”

Adding rows and columns to a table

Often when working with a table of data you want to create a new column or row and add some values to it.
The main commands to do this are cbind (..) and rbind(..), which are to bind columns and rows to a
table.
For example, let’s assume that we wanted to create a new column for the USArrests dataset, which is the
difference from the mean of the murder rate for each state. We need to calculate this value for each state,
and then add the set of values to the USArrests data as a new column. Note cbind takes any number of
tables or vectors and joins them together, but they need to have the same number of rows (for cbind) or
columns (for rbind).
arrests <- USArrests # Make a copy of the table
diffmeanmurder <- USArrests$Murder - mean(USArrests$Murder)
arrests <- cbind(arrests, diffmeanmurder)
head(arrests,n=3) # just show first 3

## Murder Assault UrbanPop Rape diffmeanmurder


## Alabama 13.2 236 58 21.2 5.412
## Alaska 10.0 263 48 44.5 2.212
## Arizona 8.1 294 80 31.0 0.312
# For dataframes you can also create the column automatically:
arrests <- USArrests # Make a copy of the table
arrests$diffmeanmurder <- USArrests$Murder - mean(USArrests$Murder)
# Result not shown

Finally, as an example of sorting some data and plotting it out, here is some code that orders murder rate
from smallest to largest, and plots out the murder rate, with the x axis labelled by state name. Check that
you understand what the order(..) function is doing (see help(order) and help(sort)).
myusa <- USArrests [ order(USArrests$Murder), ]
plot(myusa$Murder, axes=FALSE, xlab="",ylab="Murder Rate")
par(las = 3) # Force writing the x labels vertically

6
axis(1, at = 1:nrow(USArrests),
lab = rownames(myusa)) # Draw axis and state names
axis(2, at = 1:20) # and draw the second axis
abline(h=mean(myusa$Murder),col='red',lty=2) # and draw the mean murder rate
Murder Rate

14
1 5 9

North Dakota
Maine
New Hampshire
Iowa
Vermont
Idaho
Wisconsin
Minnesota
Utah
Connecticut
Rhode Island
South Dakota
Washington
Nebraska
Massachusetts
Oregon
Hawaii
West Virginia
Delaware
Kansas
Montana
Pennsylvania
Oklahoma
Wyoming
Indiana
Ohio
New Jersey
Colorado
Arizona
Virginia
Arkansas
California
Missouri
Kentucky
Alaska
Illinois
New York
Maryland
New Mexico
Michigan
Nevada
Texas
North Carolina
Alabama
Tennessee
South Carolina
Florida
Louisiana
Mississippi
Georgia
Sampling, Random Numbers and Loops

Random Numbers

Simulations of some process are often used to aid decision making. An important component of simulation is
to generate random numbers from known distributions to simulate actual events. Fortunately, R makes the
process of generating random numbers easy, through a number of built-in functions, such as:
• runif(N, min=0, max=1) - generates N random numbers from a uniform distribution in the range
[min,max).
• rnorm(N, mean=0, sd=1) - generates N random numbers from a Gaussian distribution with a given
mean and standard deviation
Exercise 7. Create 1000 uniformly sampled random numbers between 0 and 1 and check that the mean of
these values is approximately 0.5. Plot a histogram of the random numbers that you created (HINT: hist(..)
). Now create 1000 random numbers using rnorm with mean=0, sd=1, and plot the histogram. Check that
you understand why these are different and what they represent.

Sampling from a list

The sample(..) function allows repeated random samplings from a vector, either with or without replacement.
Try the following, and run the sample command several times to check that you get a different combination
of 2 animal names:
animals <- c( "dog", "cat", "mouse", "rabbit") # make a list with 4 strings
sample(animals, 2, replace=FALSE) # pick 2 randomly, no repeats

## [1] "mouse" "rabbit"

7
Exercise 8. Change the sample command above to pick 3 animal names and use replace = TRUE. Repeat the
sample call until you get two (or more!) names that are the same.

Repeating an operation: The for loop

R provides looping mechanisms that will be familiar to anybody that has programmed before. In INFO 304,
you will only need to make use of the for(..) style of looping, which will loop over a given set of items and
perform an operation for each item. The for loop can use numbers, or pick an element one at a time from a
vector/list. The first example just loops over the numbers from 1 to 5, while the second prints out the animal
names from the previous example.
for (num in 1:5) print (num)

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
for (name in animals) { cat(name); cat(' ') }

## dog cat mouse rabbit


Exerice 9 The timetrail.repeated.dat dataset
Finally - for some practice, load the timetrial.repeated.dat file and visualize/check the data. Note that this is
a table (not a comma-seperated file), so to load the table use the command:
timetrial <- read.table('timetrial.repeated.dat',header=TRUE)
What class of data is timetrial? What about the columns? Can you visualize the data? What does this tell
you about the dataset?
For additional help, see timetrial.pdf that shows various visualisations/transformations and methods for this
data. If you want to repeat some of the work please ignore the first comment regarding where to download the
data (since we have already given you the data).This set of examples was kindly provided by the University
of Minnesota.

You might also like