You are on page 1of 25

Statistics 5601 (Geyer, Fall 2013) Examples: R Intro

stat.umn.edu/geyer/5601/examp/intro.html

What is R?

First there was S, a general-purpose, interpreted, computer language


especially designed for statistics. It came from the famous
Bell Labs.

The original implementation of S is no longer commercially


available. S together with additional functions was renamed
S-PLUS.
It is currently marketed by
TIBCO Software Inc,
where it seems to have been replaced by or buried within a product
called Spotfire.

R is free S. It is free as in "free beer"


(you can download it with no charge)
and free as in "free speech"
(you can do whatever you want with it except make
it non-free). More precisely, R is a dialect of the S language.
R and S-PLUS are more or less compatible. Roughly 90% of things
you want to do work in both. Most other things work with minor variations.
R is available from the

Comprehensive R Archive Network (CRAN).

R is the the language of choice for research statistics.


If it's statistics, you can do it in R.

If you have the time and want to know more about R, the
Introduction to R
that comes with the R software is the first thing to
read, but it is way more than you need to know for this course.

What is Rweb?
1/25
Free software is amazing. Creative programmers
can use it to do anything they can think of.
There's no vendor controlling use of the software to protect their profits.

Prof. Jeff Banfield at Montana State University put R on the web. You
can run simple R commands from any computer connected to the internet.
A similar program could be easily done for S-PLUS but would be illegal
because the vendor couldn't profit from it.

The local Rweb server is at


http://rweb.stat.umn.edu/Rweb.
This link is also in the navigation section of every course web page.

There are two "interfaces" to Rweb. The simple one found by clicking on
the Rweb link
on the main Rweb page, is the only one we will explain. It has the virtue
of being embeddable in web pages to make examples.

Here is a simple example (not having much to


do with nonparametrics, just one of the examples on the Rweb page)

To see how the example works, just click the "Submit" button.

When you have seen the example, click the "Back" button on your web
browser to return to this page.

For now, don't bother with what the example does. Just notice that
it does some calculations on some data and draws a picture.

The Relation between R and Rweb

Rweb is just R. You type R statements into a web form. You submit
them. They get executed on the server. The results get stuffed into
a web page sent back to your computer. So Rweb is just R run over the web.

2/25
So mostly we will use "R" and "Rweb" interchangeably.

One important difference between Rweb and R is that the server remembers
nothing between Rweb submissions.
The entire calculation you want done must be submitted
to Rweb in one web form.
R run on your own computer
does remember. You can build up a complicated analysis a little bit at
a time.

Thus Rweb is fairly useless for really complicated problems, but is fine
for (simple) coursework.

Variables and Assignment

Like all other computer languages, R has variables, which are referred to
by variable names. Variable names may contain any letter, digit, or the dot
( . ) and cannot begin with a digit. Names are case sensitive,
thus fred , Fred , and FRED refer to
different variables.

The assignment operator in R is an arrow " <- " constructed


from two characters. An assignment statement looks like

fred <- 4

or

sally <- 2 + 2

or

a.very.long.variable.name <- sqrt(16)

Each assigns the value of the expression on the right side of


the assignment operator to the variable name on the left side. In each case
the variable gets the value 4.

3/25
In order to see any results from R. You have to execute a command that
makes output, the most common being and plot .

When an assignment is made, you don't see anything


unless you ask explicitly.

prints the value (4) assigned to the variable sally .

If the statement were omitted, there wouldn't be


any point because you wouldn't see anything and Rweb would't remember
the results for future use.

Actually this example can be shortened to

because an expression that is not an assignment usually prints its value


so

sally

does the same thing as

If in doubt, put in the .

Vectors

Not all R variable values are single numbers (in fact most aren't).
Most R variables are vectors, which is R's name for a list of objects
of the same type
(often numbers but character variables and other types are possible).

There are many ways to create vectors in R. Many functions and operators
return vector values if given vector values as arguments. Here we will only
look at a few ways to create a vector and a few functions and operators
that work vectorwise.

4/25
The c Function

The R function c
(on-line help)
"combines" or "collects" all its arguments into one vector,
for example

The seq Function

The R function seq


(on-line help)
creates a sequence, for example

Rweb External Data Entry

Variables can also be read into Rweb from an external file,


either a file on your own computer or one on the web.
We'll only illustrate the latter. An example file is

http://www.stat.umn.edu/geyer/5601/examp/blurfle.txt

The file has the following properties.

The file is plain text. The garbage inserted into files


by many so-called "word processors" (like Microsoft Word) is not allowed.
Each variable is one column of the file.
The column heading is the name of the variable.
The rest of the column is the list of values of the variable.

This has the result that all of the variables must be vectors
of the same length.
This can usually be arranged somehow
(if necessary pad the variables that are too short with
NA values).

When a job is submitted to Rweb, the first thing it does is read


the "External Data Entry" file (if there is one) and create the variables

5/25
in it. The example blurfle.txt creates
three variables, color , x , and y
and prints them out.

R Data Entry

The same issues from the preceding section apply when you are not using
Rweb but using R on your own computer. The pattern is a little different.
Suppose you are in R. The following commands

X
mimic what Rweb does when it loads the data in the URL of the preceding
section. After that the statements in the Rweb form in the preceding
section work the same in your computer as they do on Rweb.
(In Rweb you can actually see the two statements shown just above in
each Rweb output page, but the read.table in Rweb does
not read from the web but from an already downloaded copy of the file.)

You can also download the file yourself to your own computer before starting
R and then just do

X
This assumes the file "blurfle.txt" has been downloaded to a directory or
folder where R will look for it (the current working directory under Linux,
the user's Documents folder by default under Windows, or
the user's home directory under MacOS X, these can be changed using
the menus of the R GUI app).

Or you can create your own data file with your own data in it.

Plain Text Files

We repeat information from the Rweb data entry


section above.

The file has the following properties.

The file is plain text. The garbage inserted into files


by many so-called "word processors" (like Microsoft Word) is not allowed.
Each variable is one column of the file.
The column heading is the name of the variable.
The rest of the column is the list of values of the variable.

This has the result that all of the variables must be vectors
of the same length.
6/25
This can usually be arranged somehow
(if necessary pad the variables that are too short with
NA values).

You cannot use Microsoft Word or any similar product to create data files
for entry into R. Just say no. It cannot be made to work. You will only
frustrate yourself if you try.

If you are careful to save in plain text format, you can use Microsoft Notepad.

An alternative is to get and use RStudio,


but that is very complicated and we are not going to teach you how to use it.

CSV (Comma Separated Values) Files

An alternative is to use Microsoft Excel or other spreadsheet program,


such as
LibreOffice Calc
to enter your data. Then write it out as a CSV (comma separated values)
file. This is not the default format. You have to choose this
output format specially when you save the file.

Here's what a CSV file looks like

http://www.stat.umn.edu/geyer/5601/examp/blurfle.csv

It looks almost the same as blurfle.txt . The only difference


is that commas instead of whitespace separate the values.

Now

X
reads the data into R.

For more info see the


on-line help page for read.table, read.csv, and several
related functions for reading in data.

Vectorwise Functions and Operators


7/25
It is an important and generally useful fact about R that most functions
and operators work vectorwise (operating on each element of the vector).

Note that multiplication needs an explicit operator * as in


most computer languages. The ^ operator is exponentiation:
bob^2 is "bob squared".

That's all for now (admittedly too brief, see


Simple manipulations; numbers and vectors in
the Introduction to R document if you need to know more, but don't
look at it your first time through this).

Indexing Vectors

Indexing operations allow you to modify or pick out or remove specified


elements of a vector.

Integer Indexing
8/25
The simplest form of indexing uses positive integers in the range from
one to the length of the vector. For example

do what is obvious (after you get used to vector indexing). Not quite so
obvious is that subscripts work the same way on the other side of the
assignment operator.

Negative Integer Indexing

9/25
Negative index values indicate "everything but"

do the same thing (why? figure it out!).

Logical Indexing

Perhaps the most useful form of indexing uses logical vectors.


First the example, then the explanation.

bob[bob != 42]

is the (vector of) elements of bob not equal to 42.

(The operator != is "not equal".


Similarly <= is
"less than or equal" and >=
is "greater than or equal".)

The result of

bob != 42
10/25
is a logical vector (all elements having values TRUE
or FALSE. Indexing with such a vector picks out the elements
for which the index is TRUE.

When the logical vector is the result of a comparison (as here), it


picks out the elements for which the comparison was TRUE.

That's all for this web page. If you need to know more, see
Index vectors; selecting and modifying subsets of a data set in
the Introduction to R document if you need to know more, but don't
look at it your first time through this.

Functions

Built-in Functions

We've already mentioned a few R functions. There are lots and lots of
others. By "built-in" functions, we mean those that you don't have
to do anything special to use. Strictly, speaking R doesn't have any
"built-in" functions. Any function is like any other function.
None are more special than any other. But several "packages" called
base,
datasets,
graphics,
grDevices,
methods,
stats, and
utils
are automatically available
with no special effort.

These functions are listed on the documentation for the


base
11/25
library, and so forth. All the libraries are listed on
the
package index.

Arguments

To use an R function, you just type the function name followed by


the list of arguments in parentheses. We've already seen examples,
like

plot(x, y)

Named Arguments

Most R functions also have named arguments. The syntax


for that is

12/25
The named arguments here, main, xlab
and ylab can appear in any order so long as they
are after the unnamed arguments.

This makes the functions much


simpler to use. Many functions have dozens of arguments, and
you only need to use a few (the others have default values or aren't
used the way you are invoking the function).

If you actually know the order of all the arguments, then you don't
need the name. For example, the three expressions

rnorm(10, 0.0, 1.0)


rnorm(10, mean = 0.0, sd = 1.0)
rnorm(10)

all do the same thing (generate 10 independent


and identically distributed standard
normal random numbers) because the second argument is mean
and the third is sd and the defaults for these arguments
are 0.0 and 1.0, respectively.

Your choice.

Packages

Some functions are not available until the "package"


containing it is added. For example

library(exactRankTests)

adds the exactRankTests package, which does


pretty much what the name suggests. We'll use it soon.

Other than needing a library command first,


functions in such a package, such as the wilcox.exact
function in the exactRankTests library are just like any other
functions.
13/25
The list of all packages available on our Rweb server is

here. It can also be found by going to the


main Rweb page
(follow the link on the navigation bar at the top of any 5601 web page)
then clicking on the link "HTML documentation" in the second paragraph
and then on the link "Packages" on the main R documentation page.

Many more packages can be found at


the
contributed packages page at CRAN.

Writing Your Own

Defining Functions

The function function defines new functions. For example

trim = lower & x <= upper


return(x[inies])
}

trims off the values of the argument x that are


below or above the arguments lower and upper,
respectively.

The lower = 0.0 and upper = 1.0 in the


definition specify default values for these arguments that
are used when the user does not supply values.

Let's check it out.

14/25
As the assignment suggests, an R function is just an R object
like any other R object. As such, it can be assigned a variable name
or used in any other way an R object can be used.
In this example, trim is an R variable
that happens to be a function and x is an R variable
that happens to be a numeric vector.

This allows functions to be passed as arguments to other functions,


a very useful technique that we will use often (that's the main reason
we will want to define our own functions).

Returned Value

A return statement is not strictly necessary. Functions return


the value of the last expression if there is no return statement. The curly
brackets are not necessary if there is only one statement.

Thus

trim = lower & x <= upper]

works just as well as the other definition. But it is a lot harder to


read, and we generally won't use this trick.

Local Variables
15/25
Local variables are variables defined inside a function.
They exist only inside the function and have no influence on anything
outside the function.

The following example shows this behavior.

Inside the function x is defined to be the value


of y, but outside the function x is unchanged.

Global Variables

Global variables are variables defined outside a function.


They are not defined inside the function, either in the argument list
or in the body. They can, however, be used inside the function.

16/25
This is sometimes very convenient, but can lead to confusing code.
It probably shouldn't be overused.
(Real
programmers think
global variables
are evil, but they are part of the R way.)

More on Functions

This section only scratches the surface. There's a lot more to be


said about R functions. The
section on writing your own functions in
the
Introduction to R book is a good place to start.

Missing Data and Computer Arithmetic

The number system used by R has two sorts of accomodation to values


the computer can't handle or at least isn't supposed to deal with.

NA: Not Available

17/25
Any data value, numeric or not, can be NA. This is what
you use for "missing data". Always use NA for this
purpose. Never use 999 or some other code that is actually
a number. Sad experience of many scientists shows this sort of code is
always forgotten at some point and the data analysis thereby ruined.

NaN: Not a Number

This is a special value that only numeric variables can take. It is


the result of an undefined operation like 0 / 0.
It is produced by the low level arithmetic of all modern computers.
R is just going along with the standard here.

Inf: Infinity

Numeric variables can also take the values -Inf


and Inf. These are produced by the low level arithmetic
of all modern computers by operations such as -1 / 0 and
1 / 0.
R is just going along with the standard here.

You shouldn't think of these as real infinities, like in calculus, but


rather that the correct calculation, if the computer could do it would
probably (but not certainly) be very large, larger than the
largest numbers the computer can hold (about 10 300) and of
the sign of the "infinity".

Control Structures

18/25
R is
a Turing complete
computer language. Anything you can do with a computer, you can do in R
(if you are a sufficiently clever programmer). For those who don't want
to use a computer except via
a WIMP interface
(mice and menus), this may seem irrelevant, but it is very important.

No computer software product, no matter how large, can implement everything.


We will see that, even in an undergraduate-master's level course like this,
there are many issues than cannot be explored using a canned program.
Thus R, unlike other statistics programs (except for S-PLUS, which is equal
in power to R), is able to explore these issues.

Those who find computer programming frightening may rest assured that the
"programming" we do will be very simple, involving no more than
writing your own functions and the two control structures
described in this section.

For Loops

One thing computers are much better at than people is mindless repetition.
The for control construct
(on-line help)
is the main way mindless repetition is done in R.
The R expression

for (i in 1:100) {
### some R statements that do some work here
}

does the same thing (whatever is done by the R statements inside) 100 times.

Here is a simple example. Suppose we want to examine the sample median


as an estimator of the mean of a normal distribution. We may know the
asymptotic distribution from theory class (or not, maybe that wasn't covered,
although we will on the page about efficiency).
The following code simulates the median of a random sample of
size n from the normal distribution (it does not matter which
one), and we do this repeatedly nsim times.

19/25
Comments

The statement theta.hat creates a vector


of length nsim to hold the simulation results (that we have
not done yet). Each time the for loop repeats, it calculates
one result (one median of a sample of size n).

The statement x simulates a random sample of


size n, see the web page
about probability distributions in R for
more about simulation of random variables.

The expression median(x) calculates the median of the sample


just simulated, and the whole statement theta.hat[i]
assigns this calculation to the i-th element of the vector
theta.hat.

To understand this one needs to know that each time the loop is executed
the variable i takes a different value from the list specified
in the for statement, which in this case is 1:nsim,
the vector of integers 1, 2, …, nsim.

When the loop finishes the vector theta.hat contains


nsim random (and independent) replications of the sample
median of a sample of size n.

20/25
The line following the loop draws the histogram: the sampling distribution
of the sample median for a random sample of size n from a normal
population.

The last line adds the density of the asymptotic normal distribution of
this estimator (the asymptotic normal distribution being given in theory
books).

If, Else, and Ifelse

If

Besides mindless repetition, the thing that allows computers to "think"


or at least
appear smarter than the average bear
is their ability to make "decisions" based on some criterion applied
to their current state.
The if control construct
(on-line help)
is the main way decisions are made in R.

As an example of decisions, we investigate a really bad idea that seems


to occur naturally to many people exposed to introductory statistics
(it doesn't have much to do with nonparametrics, although we will mention
it in the handout about breakdown point).

There are two kinds of two-sample t test, the old-fashioned one


that is exact under the assumption of normal populations and equal population
variance
and the newer one that is only approximate but does not need the assumption
of equal population variance
(on-line help for the t.test function).
21/25
Which to use? Perhaps we should use a test about population variances
to decide. Or perhaps we should decide only to use the "exact"
when the test about population variances says it is o. k.
Let's try that.

Comments

The statement tstat creates a vector


of length nsim all of whose elements are NA.
In the following loop we will set some of them to be useful values.
We initialize to NA so it will be clear which values
were not set.

The for loop works just like the preceding example.

The if statement calculates the P-value


for the test of equality of variance and if greater than 0.05 then
we carry out the t test assuming equality of variance and
record the test statistic.

After the loop finishes, some of the values of the vector tstat
are independent random realizations of the null distribution of the test

22/25
statistic when the hypothesis of equality of variances is false (because
sigx and sigy are different.

The code following the loop plots the histogram of the simulation distribution
and the density of the t distribution that the procedure assumes
the test statistic has. Clearly it doesn't.

Else

There are lots of ways to "make decisions" in R.


One alternative is the else control construct
(on-line help)
that is used with the if control construct.

This example does the same thing as in the preceding section.

The only difference is that instead of initializing the


vector tstat to NA we do the assignment
to NA in the if-else control construct: if the P-value
is greater than 0.05 then we compute the test statistic and assign it to
23/25
tstat[i] as in the preceding example, otherwise we assign
tstat[i] the value NA.

Ifelse

The preceding examples in this section use the if control


construct in ways that look like general purpose computer languages
such as C or Java. Here is a way that is unique to R using vector operations.

The if construct in the loop is gone.


We just save both the P-value and the t test
statistic each time through the loop.

After the loop is done we use the ifelse function


(on-line help) to make the tstat vector into what it is at the end of
the loop in the other examples.

Logical Indexing
24/25
Another R way to make decisions that has no analog in conventional
computing languages uses logical indexing (which is described in
a section above). Here's how that works.

This example is just a little different from the others in that


after the statement tstat 0.05] the
vector tstat contains only the values corresponding
to P-values greater than 0.05. There are no NA
values; we have just shortened the vector to omit those cases.

25/25

You might also like