You are on page 1of 27

TRY RCHAPTER 2

Vectors
o

Try R is Sponsored By:

Complete to
Unlock

The name may sound intimidating, but a vector is simply a list of values. R relies on
vectors for many of its operations. This includes basic plots - we'll have you drawing
graphs by the end of this chapter (and it's a lot easier than you might think)!
Course tip: if you haven't already, try clicking on the expand icon ( ) in the upperleft corner of the sidebar. The expanded sidebar offers a more in-depth look at chapter
sections and progress.

2.

Vectors2.1

A vector's values can be numbers, strings, logical values, or any other type, as long as
they're all the same type. Try creating a vector of numbers, like this:
RedoComplete
>c(4,7,9)
[1]479

The c function (c is short for Combine) creates a new vector by combining a list of
values.
3.
Now try creating a vector with strings:
RedoComplete
>c('a','b','c')
[1]"a""b""c"

4.

Vectors cannot hold values with different modes (types). Try mixing modes and
see what happens:
RedoComplete
>c(1,TRUE,"three")
[1]"1""TRUE""three"

All the values were converted to a single mode (characters) so that the vector can
hold them all.

5.

Sequence Vectors2.2

If you need a vector with a sequence of numbers you can create it


withstart:end notation. Let's make a vector with values from 5 through 9:

RedoComplete
>5:9
[1]56789

6.

A more versatile way to make sequences is to call the seq function. Let's do the
same thing with seq:
RedoComplete
>seq(5,9)
[1]56789

7.

seq also allows you to use increments other than 1. Try it with steps of 0.5:

RedoComplete
>seq(5,9,0.5)
[1]5.05.56.06.57.07.58.08.59.0

8.

Now try making a vector with integers from 9 down to 5:


RedoComplete
>9:5
[1]98765

9.

Vector Access2.3

We're going to create a vector with some strings in it for you, and store it in
thesentence variable.
You can retrieve an individual value within a vector by providing its numeric index in
square brackets. Try getting the third value:
RedoComplete
>sentence<c('walk','the','plank')
>sentence[3]
[1]"plank"

10.
Many languages start array indices at 0, but R's vector indices start at 1. Get
the first value by typing:
RedoComplete
>sentence[1]
[1]"walk"

11.
You can assign new values within an existing vector. Try changing the third word
to "dog":
RedoComplete
>sentence[3]<"dog"

12.
If you add new values onto the end, the vector will grow to accommodate them.
Let's add a fourth word:
RedoComplete
>sentence[4]<'to'

13.
You can use a vector within the square brackets to access multiple values. Try
getting the first and third words:
RedoComplete

>sentence[c(1,3)]
[1]"walk""dog"

14.
This means you can retrieve ranges of values. Get the second through fourth
words:
RedoComplete
>sentence[2:4]
[1]"the""dog""to"

15.
You can also set ranges of values; just provide the values in a vector. Add words
5 through 7:
RedoComplete
>sentence[5:7]<c('the','poop','deck')

16.
Now try accessing the sixth word of the sentence vector:
RedoComplete
>sentence[6]
[1]"poop"

17. Vector Names2.4


For this challenge, we'll make a 3-item vector for you, and store it in the ranksvariable.
You can assign names to a vector's elements by passing a second vector filled with
names to the names assignment function, like this:
RedoComplete
>ranks<1:3
>names(ranks)<c("first","second","third")

18.
Assigning names for a vector can act as useful labels for the data. Below, you
can see what our vector looks like now.
You can also use the names to access the vector's values. Try getting the value for
the "first" rank:
RedoComplete
>ranks
firstsecondthird
123
>ranks["first"]
first
1

19.
Now see if you can set the value for the "third" rank to something other than 3
using the name rather than the position.
RedoComplete
>ranks["third"]<2

20. Plotting One Vector2.5


The barplot function draws a bar chart with a vector's values. We'll make a new vector
for you, and store it in the vesselsSunk variable.
Now try passing the vector to the barplot function:
RedoComplete

>vesselsSunk<c(4,5,1)
>barplot(vesselsSunk)

o
012345
21.
If you assign names to the vector's values, R will use those names as labels on
the bar plot. Let's use the names assignment function again:
RedoComplete
>names(vesselsSunk)<c("England","France","Norway")

22.
Now, if you call barplot with the vector again, you'll see the labels:
RedoComplete
>barplot(vesselsSunk)

o
23.

EnglandFranceNorway012345
Now, try calling barplot on a vector of integers ranging from 1 through 100:

RedoComplete
>barplot(1:100)

020406080100

24. Vector Math2.6


Most arithmetic operations work just as well on vectors as they do on single values.
We'll make another sample vector for you to work with, and store it in the avariable.
If you add a scalar (a single value) to a vector, the scalar will be added to each value
in the vector, returning a new vector with the results. Try adding 1 to each element in
our vector:
RedoComplete
>a<c(1,2,3)
>a+1
[1]234

25.
The same is true of division, multiplication, or any other basic arithmetic. Try
dividing our vector by 2:
RedoComplete
>a/2
[1]0.51.01.5

26.

Now try multiplying our vector by 2:

RedoComplete
>a*2
[1]246

27.
If you add two vectors, R will take each value from each vector and add them.
We'll make a second vector for you to experiment with, and store it in the bvariable.
Try adding it to the a vector:
RedoComplete
>b<c(4,5,6)
>a+b
[1]579

28.

Now try subtracting b from a:

RedoComplete
>ab
[1]333

29.
You can also take two vectors and compare each item. See which values in
the avector are equal to those in a second vector:
RedoComplete
>a==c(1,99,3)
[1]TRUEFALSETRUE

Notice that R didn't test whether the whole vectors were equal; it checked each value
in the a vector against the value at the same index in our new vector.
30.
Check if each value in the a vector is less than the corresponding value in
another vector:
RedoComplete
>a==c(1,2,3)
[1]TRUETRUETRUE

31.
Functions that normally work with scalars can operate on each element of a
vector, too. Try getting the sine of each value in our vector:
RedoComplete
>sin(a)
[1]0.84147100.90929740.1411200

32.
Now try getting the square roots with sqrt:
RedoComplete
>sqrt(a)
[1]1.0000001.4142141.732051

33. Scatter Plots2.7


The plot function takes two vectors, one for X values and one for Y values, and draws
a graph of them.
Let's draw a graph showing the relationship of numbers and their sines.
First, we'll need some sample data. We'll create a vector for you with some fractional
values between 0 and 20, and store it in the x variable.
Now, try creating a second vector with the sines of those values:
RedoComplete
>x<seq(1,20,0.1)
>y<sin(x)

34.
Then simply call plot with your two vectors:
RedoComplete
>plot(x,y)

Great job! Notice on the graph that values from the first argument (x) are used for the
horizontal axis, and values from the second (y) for the vertical.
o
5101520-1.0-0.50.00.51.0xy
35.
Your turn. We'll create a vector with some negative and positive values for you,
and store it in the values variable.

We'll also create a second vector with the absolute values of the first, and store it in
the absolutes variable.
Try plotting the vectors, with values on the horizontal axis, and absolutes on the
vertical axis.
RedoComplete
>values<10:10
>absolutes<abs(values)
>plot(values,absolutes)

-10-505100246810valuesabsolutes

36. NA Values2.8
Sometimes, when working with sample data, a given value isn't available. But it's not
a good idea to just throw those values out. R has a value that explicitly indicates a
sample was not available: NA. Many functions that work with vectors treat this value
specially.
We'll create a vector for you with a missing sample, and store it in the a variable.
Try to get the sum of its values, and see what the result is:
RedoComplete
>a<c(1,3,NA,7,9)
>sum(a)
[1]NA

The sum is considered "not available" by default because one of the vector's values
was NA. This is the responsible thing to do; R won't just blithely add up the numbers
without warning you about the incomplete data. We can explicitly tellsum (and many
other functions) to remove NA values before they do their calculations, however.
37.
Remember that command to bring up help for a function? Bring up
documentation for the sum function:
RedoComplete
>help(sum)
sumpackage:baseRDocumentation

SumofVectorElements

Description:
'sum'returnsthesumofallthevaluespresentinitsarguments.

Usage:
sum(...,na.rm=FALSE)
...

As you see in the documentation, sum can take an optional named argument,na.rm. It's
set to FALSE by default, but if you set it to TRUE, all NA arguments will be removed from
the vector before the calculation is performed.
38.
Try calling sum again, with na.rm set to TRUE:
RedoComplete
>sum(a,na.rm=TRUE)

[1]20

39. Chapter 2 Completed

Share your plunder:

You've traversed Chapter 2 and discovered another badge!


In this chapter, we've shown you all the basics of manipulating vectors - creating and
accessing them, doing math with them, and making sequences. We've shown you how
to make bar plots and scatter plots with vectors. And we've shown you how R treats
vectors where one or more values are not available.
The vector is just the first of several data structures that R offers. See you in the next
chapter, where we'll talk about the matrix.

More from O'Reilly


Did you know that our sponsor O'Reilly has some great resources for big data
practitioners? Check out the Strata Newsletter, the Strata Blog, and get access to five
e-books on big data topics from leading thinkers in the space.

TRY RCHAPTER 3

Matrices
o

Try R is Sponsored By:

Complete to
Unlock

So far we've only worked with vectors, which are simple lists of values. What if you
need data in rows and columns? Matrices are here to help.

A matrix is just a fancy term for a 2-dimensional array. In this chapter, we'll show you
all the basics of working with matrices, from creating them, to accessing them, to
plotting them.

2.

Matrices3.1

Let's make a matrix 3 rows high by 4 columns wide, with all its fields set to 0.
RedoComplete
>matrix(0,3,4)
[,1][,2][,3][,4]
[1,]0000
[2,]0000
[3,]0000

3.

You can also use a vector to initialize a matrix's value. To fill a 3x4 matrix, you'll
need a 12-item vector. We'll make that for you now:
RedoComplete
>a<1:12

4.

If we print the value of a, we'll see the vector's values, all in a single row:
RedoComplete
>print(a)
[1]123456789101112

5.

Now call matrix with the vector, the number of rows, and the number of
columns:
RedoComplete
>matrix(a,3,4)
[,1][,2][,3][,4]
[1,]14710
[2,]25811
[3,]36912

6.

The vector's values are copied into the new matrix, one by one. You can also reshape the vector itself into a matrix. We'll create a new 8-item vector for you:
RedoComplete
>plank<1:8

7.

The dim assignment function sets dimensions for a matrix. It accepts a vector
with the number of rows and the number of columns to assign.
Assign new dimensions to plank by passing a vector specifying 2 rows and 4 columns
(c(2,4)):
RedoComplete
>dim(plank)<c(2,4)

8.

If you print plank now, you'll see that the values have shifted to form 2 rows by
4 columns:
RedoComplete
>print(plank)
[,1][,2][,3][,4]

[1,]1357
[2,]2468

9.

The vector is no longer one-dimensional. It has been converted, in-place, to a


matrix.
Now, use the matrix function to make a 5x5 matrix, with its fields initialized to any
values you like.
RedoComplete
>matrix(1:25,5,5)
[,1][,2][,3][,4][,5]
[1,]16111621
[2,]27121722
[3,]38131823
[4,]49141924
[5,]510152025

10. Matrix Access3.2


Getting values from matrices isn't that different from vectors; you just have to provide
two indices instead of one.
Let's take another look at our plank matrix:
RedoComplete
>print(plank)
[,1][,2][,3][,4]
[1,]1357
[2,]2468

11.
Try getting the value from the second row in the third column ofplank:
RedoComplete
>plank[2,3]
[1]6

12.

Now, try getting the value from first row of the fourth column:

RedoComplete
>plank[1,4]
[1]7

13.
As with vectors, to set a single value, just assign to it. Set the previous value
to 0:
RedoComplete
>plank[1,4]<0

14.
You can get an entire row of the matrix by omitting the column index (but keep
the comma). Try retrieving the second row:
RedoComplete
>plank[2,]
[1]2468

15.

To get an entire column, omit the row index. Retrieve the fourth column:

RedoComplete
>plank[,4]
[1]78

16.
You can read multiple rows or columns by providing a vector or sequence with
their indices. Try retrieving columns 2 through 4:
RedoComplete
>plank[,2:4]
[,1][,2][,3]
[1,]357
[2,]468

17. Matrix Plotting3.3


Text output is only useful when matrices are small. When working with more complex
data, you'll need something better. Fortunately, R includes powerful visualizations for
matrix data.
We'll start simple, with an elevation map of a sandy beach.
It's pretty flat - everything is 1 meter above sea level. We'll create a 10 by 10 matrix
with all its values initialized to 1 for you:
RedoComplete
>elevation<matrix(1,10,10)

18.
Oh, wait, we forgot the spot where we dug down to sea level to retrieve a
treasure chest. At the fourth row, sixth column, set the elevation to 0:
RedoComplete
>elevation[4,6]<0

19.
You can now do a contour map of the values simply by passing the matrix to
the contour function:
RedoComplete
>contour(elevation)

o
0.00.20.40.60.81.00.00.20.40.60.81.0
20.
Or you can create a 3D perspective plot with the persp function:
RedoComplete
>persp(elevation)

o
elevationYZ
21.
The perspective plot looks a little odd, though. This is because persp
automatically expands the view so that your highest value (the beach surface) is at
the very top.
We can fix that by specifying our own value for the expandparameter.
RedoComplete
>persp(elevation,expand=0.2)

elevationYZ

22.
Okay, those examples are a little simplistic. Thankfully, R includes some sample
data sets to play around with. One of these is volcano, a 3D map of a dormant New
Zealand volcano.
It's simply an 87x61 matrix with elevation values, but it shows the power of R's matrix
visualizations.
Try creating a contour map of the volcano matrix:
RedoComplete
>contour(volcano)

o
23.

0.00.20.40.60.81.00.00.20.40.60.81.0
Try a perspective plot (limit the vertical expansion to one-fifth again):

RedoComplete
>persp(volcano,expand=0.2)

o
volcanoYZ
24.
The image function will create a heat map:
RedoComplete
>image(volcano)

0.00.20.40.60.81.00.00.20.40.60.81.0

25. Chapter 3 Completed

Share your plunder:

Here we stand on the beach, at the end of Chapter 3. What's this, buried in the sand?
It's another badge! Click here to log in over on Code School, if you'd like to add it to
your account.
In this chapter, we learned how to create matrices from scratch, and how to re-shape
a vector into a matrix. We learned how to access values within a matrix one-by-one, or
in groups. And we saw just a few of the ways to visualize a matrix's data.
None of the techniques we've used so far will help you describe your data, though.
We'll rectify that in the next chapter, where we'll talk about summary statistics.

TRY RCHAPTER 4

Summary Statistics
o

Try R is Sponsored By:

Complete to
Unlock

Simply throwing a bunch of numbers at your audience will only confuse them. Part of
a statistician's job is to explain their data. In this chapter, we'll show you some of the
tools R offers to let you do so, with minimum fuss.

2.

Mean4.1

Determining the health of the crew is an important part of any inventory of the ship.
Here's a vector containing the number of limbs each member has left, along with their
names.
limbs<c(4,3,4,3,2,4,4,4)
names(limbs)<c('OneEye','PegLeg','Smitty','Hook','Scooter','Dan','Mikey',
'Blackbeard')

A quick way to assess our battle-readiness would be to get the average of the crew's
appendage counts. Statisticians call this the "mean". Call the mean function with
the limbs vector.
RedoComplete
>mean(limbs)
[1]3.5

An average closer to 4 would be nice, but this will have to do.


3.

Here's a barplot of that vector:


RedoComplete
>barplot(limbs)

One-EyePeg-LegSmittyHookScooterDanMikeyBlackbeard01234

4.

If we draw a line on the plot representing the mean, we can easily compare the
various values to the average. The abline function can take an h parameter with a
value at which to draw a horizontal line, or a v parameter for a vertical line. When it's
called, it updates the previous plot.
Draw a horizontal line across the plot at the mean:
RedoComplete
>abline(h=mean(limbs))

One-EyePeg-LegSmittyHookScooterDanMikeyBlackbeard01234

5.

Median4.2

Let's say we gain a crew member that completely skews the mean.
>limbs<c(4,3,4,3,2,4,4,14)
>names(limbs)<c('OneEye','PegLeg','Smitty','Hook',
'Scooter','Dan','Mikey','DavyJones')
>mean(limbs)
[1]4.75

Let's see how this new mean shows up on our same graph.
RedoComplete
>barplot(limbs)
>abline(h=mean(limbs))

It may be factually accurate to say that our crew has an average of 4.75 limbs, but it's
probably also misleading.
One-EyePeg-LegSmittyHookScooterDanMikeyDavy Jones02468101214

6.

For situations like this, it's probably more useful to talk about the "median"
value. The median is calculated by sorting the values and choosing the middle one the third value, in this case. (For sets with an even number of values, the middle two
values are averaged.)
Call the median function on the vector:
RedoComplete
>median(limbs)
[1]4

7.

That's more like it. Let's show the median on the plot. Draw a horizontal line
across the plot at the median.

RedoComplete
>abline(h=median(limbs))

One-EyePeg-LegSmittyHookScooterDanMikeyDavy Jones02468101214

8.

Standard Deviation4.3

Some of the plunder from our recent raids has been worth less than what we're used
to. Here's a vector with the values of our latest hauls:
>pounds<c(45000,50000,35000,40000,35000,45000,10000,15000)
>barplot(pounds)
>meanValue<mean(pounds)

Let's see a plot showing the mean value:


RedoComplete
>abline(h=meanValue)

These results seem way below normal. The crew wants to make Smitty, who picked
the last couple ships to waylay, walk the plank. But as he dangles over the water, wily
Smitty raises a question: what, exactly, is a "normal" haul?
o

9.

01000020000300004000050000

Statisticians use the concept of "standard deviation" from the mean to describe
the range of typical values for a data set. For a group of numbers, it shows how much
they typically vary from the average value. To calculate the standard deviation, you
calculate the mean of the values, then subtract the mean from each number and
square the result, then average those squares, and take the square root of that
average.
If that sounds like a lot of work, don't worry. You're using R, and all you have to do is
pass a vector to the sd function. Try calling sd on the pounds vector now, and assign
the result to the deviation variable:
RedoComplete
>deviation<sd(pounds)

10.
We'll add a line on the plot to show one standard deviation above the mean (the
top of the normal range)...
RedoComplete
>abline(h=meanValue+deviation)

Hail to the sailor that brought us that 50,000-pound payday!

01000020000300004000050000

11.
Now try adding a line on the plot to show one standard devation below the
mean (the bottom of the normal range):
RedoComplete
>abline(h=meanValuedeviation)

We're risking being hanged by the Spanish for this? Sorry, Smitty, you're shark bait.
01000020000300004000050000

12. Chapter 4 Completed

Share your plunder:

Land ho! You've navigated Chapter 4. And what awaits us on the shore? It's another
badge!
Summary statistics let you show how your data points are distributed, without the
need to look closely at each one. We've shown you the functions for mean, median,
and standard deviation, as well as ways to display them on your graphs.

TRY RCHAPTER 5

Factors
o

Try R is Sponsored By:

Complete to
Unlock

Often your data needs to be grouped by category: blood pressure by age range,
accidents by auto manufacturer, and so forth. R has a special collection type called
a factor to track these categorized values.

2.

Creating Factors5.1

It's time to take inventory of the ship's hold. We'll make a vector for you with the type
of booty in each chest.
To categorize the values, simply pass the vector to the factor function:
RedoComplete
>chests<c('gold','silver','gems','gold','gems')
>types<factor(chests)

3.

There are a couple differences between the original vector and the new factor
that are worth noting. Print the chests vector:
RedoComplete
>print(chests)
[1]"gold""silver""gems""gold""gems"

4.

You see the raw list of strings, repeated values and all. Now print
the types factor:
RedoComplete
>print(types)
[1]goldsilvergemsgoldgems
Levels:gemsgoldsilver

Printed at the bottom, you'll see the factor's "levels" - groups of unique values. Notice
also that there are no quotes around the values. That's because they're not strings;
they're actually integer references to one of the factor's levels.
5.

Let's take a look at the underlying integers. Pass the factor to


the as.integerfunction:
RedoComplete
>as.integer(types)
[1]23121

6.

You can get only the factor levels with the levels function:
RedoComplete
>levels(types)
[1]"gems""gold""silver"

7.

Plots With Factors5.2

You can use a factor to separate plots into categories. Let's graph our five chests by
weight and value, and show their type as well. We'll create two vectors for
you;weights will contain the weight of each chest, and prices will track how much the
chests are worth.
Now, try calling plot to graph the chests by weight and value.
RedoComplete
>weights<c(300,200,100,250,150)
>prices<c(9000,5000,12000,7500,18000)
>plot(weights,prices)

8.

100150200250300600080001000012000140001600018000weightsprices

We can't tell which chest is which, though. Fortunately, we can use different plot
characters for each type by converting the factor to integers, and passing it to
thepch argument of plot.
RedoComplete
>plot(weights,prices,pch=as.integer(types))

"Circle", "Triangle", and "Plus Sign" still aren't great descriptions for treasure, though.
Let's add a legend to show what the symbols mean.
o

9.

100150200250300600080001000012000140001600018000weightsprices

The legend function takes a location to draw in, a vector with label names, and a
vector with numeric plot character IDs.
RedoComplete
>legend("topright",c("gems","gold","silver"),pch=1:3)

Next time the boat's taking on water, it would be wise to dump the silver and keep the
gems!
o

100150200250300600080001000012000140001600018000weightspricesgemsg
oldsilvergemsgoldsilver

10.
If you hard-code the labels and plot characters, you'll have to update them
every time you change the plot factor. Instead, it's better to derive them by using
thelevels function on your factor:
RedoComplete
>legend("topright",levels(types),pch=1:length(levels(types)))

100150200250300600080001000012000140001600018000weightspricesgemsg
oldsilvergemsgoldsilver

11. Chapter 5 Completed

Share your plunder:

A long inland march has brought us to the end of Chapter 5. We've stumbled across
another badge!
Factors help you divide your data into groups. In this chapter, we've shown you how to
create them, and how to use them to make plots more readable.

More from O'Reilly


Did you know that our sponsor O'Reilly has some great resources for big data
practitioners? Check out the Strata Newsletter, the Strata Blog, and get access to five
e-books on big data topics from leading thinkers in the space.

TRY RCHAPTER 6

Data Frames
o

Try R is Sponsored By:

Complete to

o
Unlock

The weights, prices, and types data structures are all deeply tied together, if you think
about it. If you add a new weight sample, you need to remember to add a new price
and type, or risk everything falling out of sync. To avoid trouble, it would be nice if we
could tie all these variables together in a single data structure.
Fortunately, R has a structure for just this purpose: the data frame. You can think of a
data frame as something akin to a database table or an Excel spreadsheet. It has a
specific number of columns, each of which is expected to contain values of a
particular type. It also has an indeterminate number of rows - sets of related values
for each column.

2.

Data Frames6.1

Our vectors with treasure chest data are perfect candidates for conversion to a data
frame. And it's easy to do. Call the data.frame function, and pass weights,prices,
and types as the arguments. Assign the result to the treasure variable:
RedoComplete
>treasure<data.frame(weights,prices,types)

3.

Now, try printing treasure to see its contents:


RedoComplete
>print(treasure)
weightspricestypes
13009000gold
22005000silver
310012000gems
42507500gold
515018000gems

There's your new data frame, neatly organized into rows, with column names (derived
from the variable names) across the top.

4.

Data Frame Access6.2

Just like matrices, it's easy to access individual portions of a data frame.

You can get individual columns by providing their index number in double-brackets.
Try getting the second column (prices) of treasure:
RedoComplete
>treasure[[2]]
[1]9000500012000750018000

5.

You could instead provide a column name as a string in double-brackets. (This is


often more readable.) Retrieve the "weights" column:
RedoComplete
>treasure[["weights"]]
[1]300200100250150

6.

Typing all those brackets can get tedious, so there's also a shorthand notation:
the data frame name, a dollar sign, and the column name (without quotes). Try using
it to get the "prices" column:
RedoComplete
>treasure$prices
[1]9000500012000750018000

7.

Now try getting the "types" column:


RedoComplete
>treasure$types
[1]goldsilvergemsgoldgems
Levels:gemsgoldsilver

8.

Loading Data Frames6.3

Typing in all your data by hand only works up to a point, obviously, which is why R was
given the capability to easily load data in from external files.
We've created a couple data files for you to experiment with:
>list.files()
[1]"targets.csv""infantry.txt"

Our "targets.csv" file is in the CSV (Comma Separated Values) format exported by
many popular spreadsheet programs. Here's what its content looks like:
"Port","Population","Worth"

"Cartagena",35000,10000
"PortoBello",49000,15000
"Havana",140000,50000
"PanamaCity",105000,35000

You can load a CSV file's content into a data frame by passing the file name to
theread.csv function. Try it with the "targets.csv" file:
RedoComplete
>read.csv("targets.csv")
PortPopulationWorth
1Cartagena3500010000
2PortoBello4900015000
3Havana14000050000
4PanamaCity10500035000

9.

The "infantry.txt" file has a similar format, but its fields are separated by tab
characters rather than commas. Its content looks like this:

10.

PortInfantry

11.

PortoBello700

12.

Cartagena500

13.

PanamaCity1500

14.

Havana2000

For files that use separator strings other than commas, you can use
theread.table function. The sep argument defines the separator character, and you
can specify a tab character with "\t".
Call read.table on "infantry.txt", using tab separators:
RedoComplete
>read.table("infantry.txt",sep="\t")
V1V2
1PortInfantry
2PortoBello700
3Cartagena500
4PanamaCity1500
5Havana2000

15.
Notice the "V1" and "V2" column headers? The first line is not automatically
treated as column headers with read.table. This behavior is controlled by the header
argument. Call read.table again, setting header to TRUE:
RedoComplete
>read.table("infantry.txt",sep="\t",header=TRUE)
PortInfantry
1PortoBello700
2Cartagena500
3PanamaCity1500
4Havana2000

16. Merging Data Frames6.4


We want to loot the city with the most treasure and the fewest guards. Right now,
though, we have to look at both files and match up the rows. It would be nice if all the
data for a port were in one place...
R's merge function can accomplish precisely that. It joins two data frames together,
using the contents of one or more columns. First, we're going to store those file
contents in two data frames for you, targets and infantry.
The merge function takes arguments with an x frame (targets) and a y frame
(infantry). By default, it joins the frames on columns with the same name (the
two Port columns). See if you can merge the two frames:
RedoComplete
>targets<read.csv("targets.csv")
>infantry<read.table("infantry.txt",sep="\t",header=TRUE)
>merge(x=targets,y=infantry)
PortPopulationWorthInfantry
1Cartagena3500010000500
2Havana140000500002000
3PanamaCity105000350001500
4PortoBello4900015000700

17. Chapter 6 Completed

Share your plunder:

Thirty paces south from the gate of the fort, and dig we've unearthed another
badge!
When your data grows beyond a certain size, you need powerful tools to organize it.
With data frames, R gives you exactly that. We've shown you how to create and
access data frames. We've also shown you how to load frames in from files, and how
to cobble multiple frames together into a new data set.
Time to take what you've learned so far, and apply it. In the next chapter, we'll be
working with some real-world data!

TRY RCHAPTER 7

Real-World Data
o

Try R is Sponsored By:

Complete to
Unlock

So far, we've been working purely in the abstract. It's time to take a look at some real
data, and see if we can make any observations about it.

2.

Some Real World Data7.1

Modern pirates plunder software, not silver. We have a file with the software piracy
rate, sorted by country. Here's a sample of its format:

Country,Piracy
Australia,23
Bangladesh,90
Brunei,67
China,77
...

We'll load that into the piracy data frame for you:
>piracy<read.csv("piracy.csv")

We also have another file with GDP per capita for each country (wealth produced,
divided by population):
RankCountryGDP
1Liechtenstein141100
2Qatar104300
3Luxembourg81100
4Bermuda69900
...

That will go into the gdp frame:


>gdp<read.table("gdp.txt",sep="",header=TRUE)

We'll merge the frames on the country names:


>countries<merge(x=gdp,y=piracy)

Let's do a plot of GDP versus piracy. Call the plot function, using the "GDP"column
of countries for the horizontal axis, and the "Piracy" column for the vertical axis:
RedoComplete
>plot(countries$GDP,countries$Piracy)

02000040000600008000020406080countries$GDPcountries$Piracy

3.

It looks like there's a negative correlation between wealth and piracy - generally,
the higher a nation's GDP, the lower the percentage of software installed that's
pirated. But do we have enough data to support this connection? Is there really a
connection at all?
R can test for correlation between two vectors with the cor.test function. Try calling it
on the GDP and Piracy columns of the countries data frame:
RedoComplete

>cor.test(countries$GDP,countries$Piracy)

Pearson'sproductmomentcorrelation

data:countries$GDPandcountries$Piracy
t=14.8371,df=107,pvalue<2.2e16
alternativehypothesis:truecorrelationisnotequalto0
95percentconfidenceinterval:
0.87361790.7475690
sampleestimates:
cor
0.8203183

The key result we're interested in is the "p-value". Conventionally, any correlation with
a p-value less than 0.05 is considered statistically significant, and this sample data's
p-value is definitely below that threshold. In other words, yes, these data do show a
statistically significant negative correlation between GDP and software piracy.
4.

We have more countries represented in our GDP data than we do our piracy rate
data. If we know a country's GDP, can we use that to estimate its piracy rate?
We can, if we calculate the linear model that best represents all our data points (with
a certain degree of error). The lm function takes a model formula, which is represented
by a response variable (piracy rate), a tilde character (~), and a predictor
variable (GDP). (Note that the response variable comes first.)
Try calculating the linear model for piracy rate by GDP, and assign it to
the linevariable:
RedoComplete
>line<lm(countries$Piracy~countries$GDP)

5.

You can draw the line on the plot by passing it to the abline function. Try it now:
RedoComplete
>abline(line)

Now, if we know a country's GDP, we should be able to make a reasonable prediction


of how common piracy is there!
o

02000040000600008000020406080countries$GDPcountries$Piracy

6.

ggplot27.2

The functionality we've shown you so far is all included with R by default. (And it's
pretty powerful, isn't it?) But in case the default installation doesn't include that
function you need, there are still more libraries available on the servers of the
Comprehensive R Archive Network, or CRAN. They can add anything from new
statistical functions to better graphics capabilities. Better yet, installing any of them is
just a command away.
Let's install the popular ggplot2 graphics package. Call the install.packagesfunction
with the package name in a string:
RedoComplete
>install.packages("ggplot2")

7.

You can get help for a package by calling the help function and passing the
package name in the package argument. Try displaying help for the "ggplot2"
package:
RedoComplete
>help(package="ggplot2")
Informationonpackage'ggplot2'

Description:

Package:ggplot2
Type:Package
Title:AnimplementationoftheGrammarofGraphics
Version:0.9.1

...

8.

Here's a quick demo of the power you've just added to R. To use it, let's revisit
some data from a previous chapter.

9.

>weights<c(300,200,100,250,150)

10.

>prices<c(9000,5000,12000,7500,18000)

11.

>chests<c('gold','silver','gems','gold','gems')

12.

>types<factor(chests)

The qplot function is a commonly-used part of ggplot2. We'll pass the weights and
values of our cargo to it, using the chest types vector for the color argument:

RedoComplete
>qplot(weights,prices,color=types)

Not bad! An attractive grid background and colorful legend, without any of the
configuration hassle from before!
ggplot2 is just the first of many powerful packages awaiting discovery on CRAN. And of

course, there's much, much more functionality in the standard R libraries. This course
has only scratched the surface!
o

80001200016000100150200250300..1..2..3gemsgoldsilver

13. Chapter 7 Completed

Share your plunder:

Captain's Log: The end of chapter 7. Supplies are running low. Luckily, we've spotted
another badge!
We've covered how to take some real-world data sets, and test whether they're
correlated with `cor.test`. Then we learned how to show that correlation on plots, with
a linear model.

You might also like