You are on page 1of 70

Thompson

10/2002

Introduction to S-PLUS (and R)


Based on Chapters 1-4, A, B, C of Venables & Ripley (2002) Introductory Notes
The S-PLUS 6.1 Environment
S-PLUS is an integrated suite of software facilities for data manipulation, calculation, and graphical display. It is an environment within which many classical and modern statistical techniques have been implemented. Some of these are built into S-PLUS, but many are supplied as libraries (freely downloadable libraries) and modules (suites of S-PLUS functions and operations specific to a particular type of analysis). Some libraries come with the base software. S is the language that S-PLUS is built around. S is an object-oriented, interpreted language. So, objects are the basic building blocks in S (everything is an object is an oft-made quote). And, most objects have classes (or types) assigned to them that determine how they are printed, summarized, plotted, and more. That is, methods (basically, functions) are catered to the type of object the method is dealing with. For example, a method may be written to find the arithmetic mean. The mean of a vector of numeric data will be different from the mean of a matrix of numbers. In the latter case, the mean method might take the mean of each column of the matrix. S is an interpreted language. JAVA and The Mathematica Kernel are other examples of interpreted languages. Because the language is not compiled first into memory, the execution is nowhere near as fast as it might be for compiled languages. However, the ease of programming far outweighs that defect. Turning ideas into functions (especially statistical ideas) is amazingly simple in comparison to doing the same in C, for example1. S-PLUS includes a GUI in which a plethora of graphs can be created. Furthermore, pieces of graphs are objects, and thus can be edited and created as ordinary objects, like functions or variables. You can think of an S-PLUS graph sheet as a drawing board, where you can add and subtract objects or modify them almost to your hearts content. The GUI has really developed since the first versions of S-PLUS that I used (version 3.3 for windows and 3.4 for UNIX). S-PLUS has a fairly sophisticated spreadsheet for data entry or editing data objects. It also has a filing system for objects, graph sheets, scripts, and data sheets called Object Explorer. Script sheets are available for saving sets of functions and code. One can also run sets of commands from a BATCH file.

The R language
R is GNU S. R is a re-implementation of S. The two languages are superficially similar languages, but do not have identical underpinnings. (One big difference is in scoping rules). Furthermore, there are some packages (libraries) for certain statistical analyses written only for R (or S), and due to the differences between the two languages, the package runs error-free only in R (or S).
1

There is a program called Scompile that will compile S functions. I believe this may be similar to the compiling done in Mathematica.

Thompson

10/2002

R also comes with an optional GUI and one can enter data from a simple spreadsheet environment. But, R does not have the editable graphics facilities that S-PLUS does. Indeed, there is no R-PLUS. However, there is an XGOBI interface accessible from CRAN (see below) for both Unix and Windows. There is a tremendous amount of documentation on R (the language, data import information, extensions, etc) from the website (see below) and downloadable with the distribution (which is on version 1.6, as of this writing). As with all GNU software, there are no guarantees on full functionality, and the distribution gets updated much more frequently than does the commercial version in S-PLUS, although usercontributed libraries for both applications are updated and added frequently. If there is time, I will try to include R in the S-PLUS lesson. It is occasionally discussed in V&R, and contrasted with S. If you can use S-PLUS via the command line, you should have no problems using R. However, if you are going to be developing software in R or porting functions from S to R, you need to know the differences between R and S (if you are lucky, they will show up for you). This write-up is meant to show you around S-PLUS and point out where to find things. There is not the space to go into detail about everything (and many other books and manuals have already done that2), but at least youll know where things are, and you can explore them on your own later. There is a lot to SPLUS. I can say with all honesty that I have never been able to NOT do something (statistical or programming-wise) that Ive wanted to do using S or S-PLUS. Of course, sometimes figuring out how to do it can take time. But, I usually learn something in the process.

Getting R from the Web


Consult the URL, http://lib.stat.cmu.edu/R/CRAN/ for information about R and obtaining the binaries for Windows, Linux, or MacOS.

Getting the MASS library that goes with Venables and Ripley (2002) (V&R)
See the Appendices of V&R for information on how to download and install the MASS library (http://www.stats.ox.ac.uk/pub/MASS4/Winlibs). You can use these instructions to download any S-PLUS library. Many libraries are stored at the web site http://lib.stat.cmu.edu/S.

A list of books on S-PLUS can be found in the Users Guide Manual that comes with S-PLUS.

Thompson

10/2002

Part 1: Introduction to some capabilities of S-PLUS


Getting Started with S-PLUS 6.1
Starting S-PLUS in a chosen working directory
To keep the objects from different projects separate, you should use different working directories for each project. One option is to create a shortcut icon for each project. Then, to start S-PLUS in a working directory of your choice, modify the target line in the properties tab of the shortcut menu (right-click menu) to read, for example:
"C:\Program Files\insightful\splus61\cmd\SPLUS.exe" S_PROJ=C:\data\thompsonla\mydir

However, note that when you start S-PLUS for the first time, it will ask you what directory you want to use for your working directory. As long as you dont check the box that says dont ask me this again, it will continue to ask you for the working directory.

Starting S-PLUS in a project workspace (S-PLUS 2000 only)


Page 438-444 of the S-PLUS Users Guide Manual describes how to create a new workspace and save your objects, data sheets, scripts, and graphsheets to this directory. This concept is very familiar to project directories discussed above. The difference is that a workspace contains a snapshot of your directory and any open files, so that when you open the workspace again, you dont have to reopen all the script files and data sheets you had open when you closed S-PLUS. See the accompanying handout on workspaces, or check the STAT 5537 course web site.

Starting S-PLUS in BATCH mode


You do not have to run S-PLUS interactively. To run a set of commands (that you place into a file called infile, say) in BATCH mode, Modify the target line in a shortcut to read
"C:\Program Files\insightful\splus61\...\splus.exe" /BATCH infile [outfile].

BATCH mode is convenient for calling S-PLUS and running S-PLUS commands from other applications (like R). See my example on the STAT 5537 course website.

Starting multiple instances of S-PLUS


To run multiple instances of S-PLUS, create a new shortcut icon and modify the target line in the properties tab to read, for example:
"C:\Program Files\insightful\splus61\cmd\SPLUS.exe" /MULTIPLE_INSTANCES

Each instance you start up should be invoked from this icon. Beware, though, that objects you change or add in one instance can affect what the other instance sees.
3

Thompson

10/2002

The S-PLUS Command Line


When you start S-PLUS and move to the commands window, you see the working directory listed. In this directory is where S-PLUS will store objects you create. S-PLUS is an interactive system. You type a command, and S-PLUS returns an answer. Basically, commands are either assignments (of a value to a name) or S expressions. An S expression is a sequence of symbols or tokens interpreted by S. So, 2+2 is an expression. The value returned can be given a name and saved as an object. (junk<-2+2). This object will exist in the working directory until you explicitly remove it. Crashes to the hard drive will preserve assignments made to the working directory (as long as they are completed prior to the crash). You can see what objects are stored in the working directory in two ways. Type objects() at the command line. Or, check the Object Explorer. (Check out the Visual Demo on Object Explorer). objects() can take arguments: e.g., pattern. You do not have to store objects in your working directory. The function assign allows you to store them wherever you want on the search list (see below). You can even assign objects to the session frame which means they will be deleted when you exit your session of S-PLUS. (See my function %<% on the STAT 5537 course website for assigning objects to the session frame). Another alternative is to create a temporary file using the function tempfile, along with new.database (See help file for new.database, and below).

Quitting from the command line


Type q() at the command line.

Interrupting a running command (Press ESC)


Example: for(i in 1:10000) print(i)

Comments within an expression


Everything typed after a # on a single line is not read by S-PLUS
log(10) # This computes log(10) in base e

Interacting with the operating system (DOS or Windows for S-PLUS 2000)
One can send DOS commands while in S-PLUS using the dos function, if you are using Windows preXP. One can invoke any Windows application using the system function, with the appropriate executable statement.
system(Notepad, multi=T) dos(dir, output.to.S = T)

Thompson

10/2002

The next command calls the Borland C++ compiler


system("bcc32 -u- -6 -O2 -WDE -Ic:\\progra~1\\R\\rw1060\\src\\include convolve3.c R.lib",show.output.on.console = TRUE)

Also, see examples on the STAT 5537 course website. I used system to invoke an S-PLUS BATCH file from R. You can also do the reverse: i.e., run an R BATCH file from S-PLUS.

Restore Data Objects Button


If you make changes to an object using the assignment operator and later decide you dont want to keep the changes, use the Edit->Restore Data Objects selection to restore the object to the previous state. Also see the function, undo. Whew!

Search List
When S-PLUS searches for objects or files, it searches databases or lists in a particular order. The list is returned via the search()command. search()lists the order of directories or lists that SPLUS searches for objects. The database in position 1 is also where SPLUS will write any new objects you create. The default first position is the working directory, by definition. Positions can be changed or replaced via the attach and detach commands. Thus, changing the working directory entails manipulating the search list. Here is an example of search() result
[1] [3] [5] [7] [9] "C:\\Program Files\\Insightful\\splus61\\users\\default" "default" "Laura" "splus" "stat" "data" "trellis" "nlme3" "menu" "sgui"

To find where on the search list an object appears, use the function find. For example, to find the object (a function) rm0,
> find("rm0") [1] "C:\\Program Files\\Insightful\\splus61\\users\\default"

Any S-PLUS object can be attached to the search list as long as it consists of named components (as does a list or data frame. See the help file for more information. To list the paths of the databases on the search list, use searchPaths.
> matrix(searchPaths(),nc=1) [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [,1] "C:\\Program Files\\Insightful\\splus61\\users\\default" "C:\\Program Files\\Insightful\\splus6\\users\\default" "C:\\Program Files\\Insightful\\splus6\\users\\Laura" "c:\\progra~1\\insightful\\splus61/library/splus" "c:\\progra~1\\insightful\\splus61/library/stat" "c:\\progra~1\\insightful\\splus61/library/data" "c:\\progra~1\\insightful\\splus61/library/trellis" "c:\\progra~1\\insightful\\splus61\\library\\nlme3"

Thompson

10/2002

[9,] "c:\\progra~1\\insightful\\splus61\\library\\menu" [10,] "c:\\progra~1\\insightful\\splus61\\library\\sgui" [11,] "c:\\progra~1\\insightful\\splus61/library/main"

Installing libraries
Libraries contain functions, objects, and sometimes data sets (which are stored as objects) to do specific types of tasks. Many libraries have functions that perform statistical analyses. For example, the tree library contains functions associated with fitting regression and classification trees. SPLUS comes with some libraries already. Other libraries (contributed by users) have to be downloaded from the statlib website. See Appendix C.2 of V&R. To load a library that comes with SPLUS, do one of the following: type, at the command line: library(libname) # libname is the name of the library In S-PLUS, from the File menu, select Load Library, and follow instructions What loading a library does is place its objects onto the search list. To have a library load every time you open a chapter (project) in S-PLUS, create a text file called S.chapters in the working directory for the project. For each library you want to load in that project startup, type the path to the library. For example, to load MASS each time I start up in a particular chapter, I the file S.chapters would look like c:/program files/insightful/splus61/library/mass A similar thing can be done in the Rprofile file (c:\Program Files\R\rw1061\etc\Rprofile) in R. Alternatively, issuing a library or require command within a .First() function within the working directory will load the library at start-up in that directory.

Getting Help
A help file exists for most S-PLUS commands and objects that come built-in with the system. One can access help using one of the following:
help(topic) ?topic

typed from the command line starts the help window

within a script window: highlight the object and press F1 Help Menu, including pdf manuals Many user-contributed libraries also have help files for their functions. To invoke help for a library, type help(function.name, library=libname), at the command line, where function.name is the name of the function on which you want help. See Appendix A of V&R. To get a description of a particular library, use library with the help argument as the name of the library. For example, to get description of the MASS library, do

Thompson

10/2002

> library(help=MASS)

History Log
A history of the commands you have given S-PLUS via the command line or the GUI are stored in text form in the history log. To see the contents of this log, do one of the following: Press the history log button (this opens a script file containing the history log) Access History log under the Window menu Press the history interactive button In R, the command history is saved onto a file called .Rhistory when the image in saved onto a .RData file using the command save.image. (See below)

Customizing the S-PLUS Environment


You can customize almost everything in S-PLUS. Here are some tools for basic customization.
options() command

To change the prompt to STAT_5537>: options(prompt= STAT_5537>) To save old options: options.old<-options() # assign old options to options.old
options(options.old) # change back to old options

Options menu Graph Options (later) General Settings Command Line Create a .First function: .First<-function(){ library(MASS, first=T) } You may also have a .Last function for cleanup on exit. The getenv()function lists current settings for environment variables. S_CWD is the current working directory.

Data Import and Export


One can import and export data using the command line or menus. The following topics are related to database management.

Creating new databases


Once in S-PLUS, you can create a new database into which objects will be stored.
new.database(dirname) # creates a new database in the working directory # Folders named _Data and _Prefs will be created. # creates a database in the directory # indicated

new.database(c:\\project1\\dirname)

Thompson

10/2002

attach(dirname) # move dirname to position 2 in search list (the default) attach(dirname, pos=1) # move dirname to position 1 in search list detach(dirname)

See also mkdir and rmdir and links in the help files. Try
junk<-new.database(paste(getenv("S_CWD"),"\\STAT_5537",sep="")) attach(junk, pos=1)

The result of a call to attach is an object of class attached. Assigning the call to a name gives a way to refer to that attached database using the name. For example,
mydb<- attach(my.fuel.frame) identical(mydb, database.attached("fuel.frame")) [1] T # are these identical databases?

mydb2<- attach(directory.name) # attach a directory with .data folder assign(x, 0, w=mydb2) # assign x with value 0 to database mydb2

Not all attached databases have to appear in the search list (And if name argument is not specified, then they appear as in the list). The purpose argument of attach allows other ways of attaching. For example, purpose=data implies that the new database is to be used only for explicit requests, and that it will not appear in the search list (Chambers, 1998). A database attached for purpose=data will not be used in the standard search for objects, and so you will never accidently get an object from that database instead of the intended database.

Temporary Databases
To make a workspace that you can throw away at the end of a session type at command line
attach(pos=1, new.database(work.dir<-tempfile()))

There is also the option of saving objects to the session database or session frame. This is frame number 0. To save objects to the session database, use
assign(name, value, fr=0) # assigns to frame 0 (disappears when you quit S+)

A convenient operator function for doing this is


"%<-%" = function(x, y, fr = 0) { x <- deparse(substitute(x)) assign(x, y, fr = fr) }

So that the above assign statement and the following are now equivalent
name%<-%value

save.image can be called from the File menu, called .RData.

In R, all objects are deleted when you quit R unless they have been saved to an image via save.image(). as well, in Rgui. This will save the database into a file

Thompson

10/2002

Removing Objects
To remove objects from your working directory, type rm(object) at command line. To remove objects from other databases, type remove(name of object, where=database.position) at the command line, where database.position is the position of database in search list. A convenient way to remove objects from the session database uses the following function
rm0<-function(x){ remove(deparse(substitute(x)), where=0) }

Now, rm0("e") removes the object e from the session database. Important Note: With the assignment function, <-, you cannot directly modify any objects in databases other than those in position 1. So, you cant accidently overwrite built-in S functions! If you try to overwrite an object in one of those databases (via the assignment function), SPLUS will make a copy of the modified object in your working directory. This modified object will mask the original until you remove it.

Reading in databases from text files


Databases can be read in from text files, spreadsheets, other formats (SAS, SPSS, etc) or from the console itself. Once read in, a database can be manipulated in lots of different ways, within a data window or via the command line. Here are some relevant functions.

The read.table function (See informative help file)


Example: Read in a text file containing the fuel.frame data from the base library.
system(Notepad, multi=T) # examine the file fuel.frame<-read.table("a:fuel.txt", header=T, row.names="Car", sep= ) count.fields("a:fuel.txt")

If you use a file in a different directory, specify backslashes with double \s. For example
> read.table("c:\\my documents\\fuel.txt", header=T, row.names="Car", sep= )

You can also use forward slashes: "c:/my documents/fuel.txt" The argument na.strings changes the label for a missing value from NA to any number of other values.

The scan function (See informative help file)


scan handles more complex data files,

with records spanning many lines, and has more flexibility. The help file describes its features. With the what argument, you can specify the structure of the data to be read in, and each variable type or class. In addition, you do not have to read all of the variables or all of
9

Thompson

10/2002

the observations. Here, I read in the first three lines of a ficticious data set where the observations are labeled X1, X2,.., X15. After these character id variables, there are 20 numeric measurements spanning three lines per observation. I will read in only the first three observations, and only the middle 7 variables per observation.
X1 X1 X1 X2 X2 X2 X3 X3 X3 23.39138 28.19607 18.73721 25.05268 26.20181 22.12924 17.30856 21.09150 26.94256 31.03014 18.46590 25.03833 21.22678 26.16561 24.26280 24.10714 30.82490 28.56601 37.01177 29.90596 22.83223 25.25613 30.77551 21.94572 25.21277 24.76578 25.85390 26.36072 25.15313 17.65843 22.19962 21.34217 22.83311 10.54811 27.28496 26.00526 23.30528 32.04329 28.81536 23.02498 25.13774 30.93580 25.55041 29.46369 31.82780 24.43292 27.00844 NA 25.13215 22.48904 19.99993 31.21184 21.13336 26.66346 30.70835 22.18513 32.20992 21.76138 23.46410 27.47658

test<-scan("a:/TextData/scanExample.txt", what=list(NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL, "",0,0,0,0,0,0,0, NULL,NULL,NULL,NULL,NULL,NULL,NULL), multi=T, n=23*3) test<-test[unlist(lapply(test, function(x) !is.null(x)))] as.data.frame(test[-1], row.names = test[[1]]) X.1 X.2 X.3 X.4 X.5 X.6 X.7 X1 28.19607 18.46590 29.90596 25.15313 32.04329 27.00844 22.18513 X2 26.20181 26.16561 30.77551 21.34217 25.13774 22.48904 21.76138 X3 21.09150 30.82490 24.76578 27.28496 29.46369 21.13336 27.47658

See also the S Commands file on the STAT 5537 course website. Look up scan for a fairly complex example.
scan can also be used for

reading interactively from standard input (i.e., the console). For example,

junk<-matrix(scan(),nr=2,nc=2) 1: 2 2: 2 3: 4 4: 1 5: junk [1,] [2,] [,1] [,2] 2 4 2 1

scan can be used to read in fixed width fields.

V&R discuss this on page 47 of their text. I did something similar to what they describe on pages 47-48 to extract relevant information out of a complicated subject ID string (see Part II of this manual).

Reading in databases from other formats (or text files)


To read data in from Excel, SAS, SPSS, Minitab, etc, use the importData function (previously, the import.data function). The alternative, menu command (which really is importData) is File->Import
10

Thompson

10/2002

Data, and specify the format. Consult the options available for each format. Data can be read in in sequential blocks using the readNextDataRows function on a data handle object. A data handle object is created using the openData command on an external file. After reading all the data you want, close the handle using closeData function. A simple example is as follows. Suppose we have an EXCEL file called mydata.xls. We can open it using openData, and specify the number of rows to read with each call to readNextDataRows by giving the argument rowsToRead. We can get information on the variables names from the call to getDataInfo.
handle.mydata <- openData("mydata.xls", type=EXCEL, rowsToRead=100) getDataInfo(handle.mydata) # Read the first 100 observations. mydata.100 <- readNextDataRows(handle.mydata) # Close the external data file. closeData(handle.mydata)

In R, consult the Data Import/Export manual. $$$$

Data Window Editor (also see below)


To call up the spreadsheet-like data editor in S-PLUS, use the Data menu-> Select Data. Then, select new for a new dataset or existing data for an existing data frame or data sheet. Alternatively, type at the command line
Edit.data(my.data.set)

or
menuSelectData(data.source = "Existing Data", existing.name = "my.data.set ")

In order for a data set to be included in the list of existing data, it must be either a data frame or data sheet. To find out the class of an object, use the class function.
class(my.data.set) [1] "data.frame"

R also has a data editor window. Call it up with menu item Edit->Data Editor, then select the name of a data frame. Or, use fix(x) where x is a data frame.

Writing out data sets to ASCII files


The function write.table will write out a data frame to an ASCII file. You can specify the separator between fields and whether the column and row names are printed. For example, to write out the fuel.frame data frame to a file called fuel.txt, type at the command line:
write.table(fuel.frame, file=fuel.txt, sep=,, dimnames.write=T)

Check what happens when you put dimnames.write=F. Consult the functions write and write.matrix for other forms of writing to external files.

11

Thompson

10/2002

Writing out data frames to other formats


One can also write out data frames to other formats, such as those read by importData (see above). The function exportData (or menu File-> Export Data) in SPLUS does this. Many of the same arguments are available as with importData. Also, one can writeNextDataRows just as one can readNextDataRows. The same procedure is involved with openData and closeData.

Transferring objects between computers


The functions dump and data.dump will write out S-PLUS objects to external files in assignment form. Then, the files can be read into S-PLUS again (possibly on another machine) using the functions source and data.restore, respectively. The file created with dump can be read easily by a human, but the output from data.dump cannot so easily. data.dump can only be used with data.restore, which is faster for reading back in large objects. For example, in S-PLUS
data.dump(c(fuel.frame,hills),file=mydata.sdd) # .sdd not necessary, but useful data.restore(file=mydata.sdd) # with a .sdd file, this line is not necessary # (instead, just double-click the files icon)

In R, the functions save and load do the analogous.


save(fuel.frame, hills, file=mydata.rda) load(files = mydata.rda)

The S-PLUS GUI

There are several other options for save, such as whether to save in ascii format or compressed format. See online help via ?save.

Object Explorer: Also see the Visual Demo


Because everything is an object in S-PLUS, you can view and edit all your objects using the Object Explorer. The Users Guide for S-PLUS 2000 has nice sections on using the Object Explorer (p. 401 ff). Each explorer page contains folders which contain shortcuts to objects. The default page contains folders Data, Graphs, Reports, and Scripts. The Object Explorer Window is split into two panes. The left pane contains folders which give a map of the objects in the currently viewed directory. The Details view of the Explorer window is the most informative.

Creating and saving an Object Explorer


To create a new object explorer, click the Object Explorer button on the standard toolbar (this will default to list your current working directory), or click the New button on the standard toolbar or select File>New from the menu. To save an object explorer, File-> Save. Or, use the right-click shortcut menu.

Customizing an Object Explorer

12

Thompson

10/2002

Set preferences for an Explorer using Format->Object Explorer or the right-click shortcut menu (right click in the white space in the right pane). Consult the Users Guide for descriptions of the options. When you are done customizing, you can save your preferences as the new default Object Explorer. All new Object Explorers have these preferences.

Inserting and deleting Explorer Pages and folders within a page


The Users Guide has a nice tutorial on how and why to do these tasks.

Manipulating objects in an Explorer Page


Again, the Users Guide has a nice tutorial on how and why to do tasks like moving and copying objects between and within pages and explorers, creating objects, and modifying and deleting objects.

Creating a Search Path folder and using it to attach databases


One way to attach databases is through a search path folder in an object explorer. Create a new folder called SearchPath that filters on searchpath objects (see p. 409 of Users Guide to see how). Right-click on the SearchPath icon in the Search Path folder and select Attach Database from the shortcut menu. Follow the dialog instructions.

Script Windows in S-PLUS


You can enter S commands (and run them) in a script window rather than the command window. You can then save the commands in a script file (a text file with extension .ssc). A new script file can be created using the File->New menu or by typing guiCreate(Script, Name=name of script file) at the command line. When in a script window, type S expressions and run them by highlighting them with the mouse, and pressing F10. You can also create functions in scripts and read them into S-PLUS by highlighting them and pressing F10. Other useful tools within a script window: Click on a function name. Then, right-click to get the Shortcut menu: Show Dialog; Expand Inplace Find tool To read in a set of commands in a script file (extension .ssc or .s), type source(filename.ssc) at the command line. Opened script files can be filtered into an Explorer Page. In this way, one can drag and drop (or copy and paste) any object onto the script file, and the code used to reproduce the object will be written to the file. This is especially useful for graph objects. Both the script file and the graph sheet must be open to do this.

13

Thompson

10/2002

Other forms of script editors: ESS under EMACS


At a later date

Data Window Editor


Use Data->Select Data to open an existing data set in a spreadsheet or to create a new data set (you can also open an existing data set by double clicking the entry in the Object Explorer). After you make changes within the spreadsheet, exit the data window to save the changes. You can also save the changes into another file with extension .sdd. Within the data window, the buttons on the toolbar are self-explanatory. Arguably more interesting is the Data Menu. The Restructure->Stack (or Unstack) is a particularly useful feature. See the pictures below for an example of its use. (Note the corresponding command to this menu selection from the History Log.) Here, we have 5 observations from each of three groups. We want to change this to two columns: one groups column, and the other a measurements column.

14

Thompson

10/2002

junk3 <- menuStackColumns(target = junk3, target.col.spec = list(Measurement), source = junk2, source.col.spec = list(Group1, Group2, Group3), group.col.p = T, group.col.name = Group)

I also used Data->Exchange to exchange the Time and Measurement columns:


junk3 <- gui.exchange.col(target = junk3, target.col.spec = list(Group), source = junk3, source.col.spec = list(Measurement))

15

Thompson

10/2002

Categorizing a continuous variable using the Data Editor


Suppose we want to categorize the variable Mileage in the data frame fuel.frame. We can do this very simply using the Data->Create Categories dialog. The five categories we will create will have the same number of observations. (In the following dialogs, note that I renamed the object fuel.frame to my.fuel.frame.)

16

Thompson

10/2002

Note that a command equivalent exists for categorizing a continuous variable (cut function). Subsetting a data set and merging two datasets can also be accomplished via the GUI.

Graphics via the GUI


Graphics can be created using the command line or the GUI. Many times I start with a command, then modify aspects of the graph using the GUI. Graphs are highly customizable once created. You can change just about everything because graphs are made up of S objects. This is called editable graphics. To ensure that all graphs you create from the GUI or from the command line are editable, set your graph options via the Options menu. See the help file for details. Here is the Two-dimensional plot palette. This describes all the basic 2D plots one can do.

17

Thompson

10/2002

You can see what each plot resembles from the Insert Menu->Graph within any graphsheet. If you insert a graph into the current graphsheet, any existing graphs on that sheet move to make room for the new graph when it is finished being created. The arrangement of the graphs on a sheet can also be changed by the user, either through the Format->Arrange Graphs Menu or just by moving with the mouse.

Here is the three-dimensional graph palette:

Trellis graphs are series of graphs conditioned on a set of categorical values. For example,

Scatter Plot of Fuel Consumption by Weight of Car (conditioned on Mileage category)

18

Thompson

10/2002

Cat.Mileage: 28.0+ thru 37.0

5.5 4.5 3.5 2.5


Cat.Mileage: 23.0+ thru 25.0 Cat.Mileage: 25.0+ thru 28.0

5.5 Fuel 4.5 3.5 2.5


Cat.Mileage: 18.0+ thru 20.8 Cat.Mileage: 20.8+ thru 23.0

5.5 4.5 3.5 2.5 1500 2000 2500 3000 3500 4000 Weight

Exporting GraphSheets
The File Menu has a selection called Export Graphs that allows you to export a graph in a variety of formats. In addition, there is a clipboard button that by default sends your graph to the clipboard, but can be customized to send it to another application. Finally, there is a function called pdf.graph which will send your graphical output to a pdf file. I believe also that the Design or Hmisc libraries have a function that transforms your graph into Latex format. I can give more information at a later date.

Excel and PowerPoint Wizards:


See the Visual Demo

Introductory Example: Read in a dataset and look at it visually and graphically


The fuel.frame data set and cu.summary data sets that come with S-PLUS give information on makes of cars taken from the April, 1990 issue of Consumer Reports. A merged data frame contains data on 8 variables (columns) for 60 cars (rows). These are the variables common to both data sets: Mileage a numeric vector of gas mileage in miles/gallon as tested by CU; contains NA s.

19

Thompson

10/2002

Type Weight Disp. Price

a factor giving the general type of car. The levels are: Small , Sporty , Compact , Medium , Large , Van . an order statistic giving the relative weights of the cars; 1 is the lightest and 111 is the heaviest. a numeric vector giving the engine displacement in liters.

numeric vector of list price with standard equipment, in dollars. Country a factor giving the country in which the car was manufactured. The levels are: Brazil , England , France , Germany , Japan , Japan/USA , Korea , Mexico , Sweden , USA . Reliability an ordered factor; ` Much worse < worse < average < better < Much better; contains NA s. Fuel - ? First, we merge the two data sets by their row names (the names of the cars). The command to do this is given here.
fuel5537.fr<- merge(fuel.frame,cu.summary, by=c("row.names",intersect(names(fuel.frame),names(cu.summary))))

Here are the first 5 rows of the new data frame:


> fuel5537.fr[1:5,] Row.names Mileage Type Weight Disp. Fuel Price Country Reliability 1 Acura Legend V6 20 Medium 3265 163 5.000000 24760 Japan Much better 2 Audi 80 4 27 Compact 2670 121 3.703704 18900 Germany NA 3 Buick Century 4 21 Medium 2880 151 4.761905 13150 USA average 4 Buick Le Sabre V6 23 Large 3325 231 4.347826 16145 USA average 5 Buick Skylark 4 23 Compact 2640 151 4.347826 10565 USA worse

But, now we lost the row.names. Lets get them back:


> row.names(fuel5537.fr)<-fuel5537.fr[,"Row.names"] > fuel5537.fr<-fuel5537.fr[,-match("Row.names",names(fuel5537.fr))]

Here are the first 5 rows again:


> fuel5537.fr[1:5,] Mileage Type Weight Disp. Fuel Price Country Reliability Acura Legend V6 20 Medium 3265 163 5.000000 24760 Japan Much better Audi 80 4 27 Compact 2670 121 3.703704 18900 Germany NA Buick Century 4 21 Medium 2880 151 4.761905 13150 USA average Buick Le Sabre V6 23 Large 3325 231 4.347826 16145 USA average Buick Skylark 4 23 Compact 2640 151 4.347826 10565 USA worse

We can get summaries of some of the variables by Country and Type of car from the Statistics Menu -> Data Summaries:

20

Thompson

10/2002

We will get summaries of Weight, Disp., Fuel, Price, and Reliability by Country and Type of car. The descriptive statistics to compute are chosen in the Statistics tab.

21

Thompson

10/2002

Pressing OK gives an awful lot of information (and some Country/Type combinations contain no data). The output will be given in a Report window, which you can copy or save. The choice of output routing can be changed via the Options menu. To reduce the amount of information, lets get summaries only by Type of car. Also, to be able to fit everything on the page, well only get numeric summaries, not factors. We will do so using the command by. First, which columns are factors? The following expression will tell you.
> sapply(names(fuel5537.fr),function(x) is.factor(eval(parse(text=x)))) [1] F T F F F F T T

You might think that the command apply(fuel5537.fr, 2, is.factor) would work but it turns out not to work. So, we can do
> check.factors<-sapply(names(fuel5537.fr),function(x) is.factor(eval(parse(text=x)))) > by(fuel5537.fr[,!check.factors], fuel5537.fr$Type, summary)

22

Thompson

10/2002

fuel5537.fr$Type:Compact Mileage Weight Disp. Fuel Price Min.:21.00 Min.:2575 Min.:116.0 Min.:3.703704 Min.: 9483 1st Qu.:23.00 1st Qu.:2663 1st Qu.:127.0 1st Qu.:3.923077 1st Qu.:10755 Median:24.00 Median:2780 Median:135.0 Median:4.166667 Median:11588 Mean:24.13 Mean:2821 Mean:140.4 Mean:4.167655 Mean:12853 3rd Qu.:25.50 3rd Qu.:2928 3rd Qu.:148.5 3rd Qu.:4.347826 3rd Qu.:14195 Max.:27.00 Max.:3110 Max.:181.0 Max.:4.761905 Max.:18900 -------------------------------------------------------------------------------------fuel5537.fr$Type:Large Mileage Weight Disp. Fuel Price Min.:18.00 Min.:3325 Min.:231.0 Min.:4.347826 Min.:14525 1st Qu.:19.00 1st Qu.:3588 1st Qu.:266.5 1st Qu.:4.673913 1st Qu.:15335 Median:20.00 Median:3850 Median:302.0 Median:5.000000 Median:16145 Mean:20.33 Mean:3677 Mean:279.3 Mean:4.967794 Mean:15976 3rd Qu.:21.50 3rd Qu.:3853 3rd Qu.:303.5 3rd Qu.:5.277778 3rd Qu.:16701 Max.:23.00 Max.:3855 Max.:305.0 Max.:5.555556 Max.:17257 -------------------------------------------------------------------------------------fuel5537.fr$Type:Medium Mileage Weight Disp. Fuel Price Min.:20.00 Min.:2765 Min.:143.0 Min.:4.347826 Min.: 9999 1st Qu.:21.00 1st Qu.:2975 1st Qu.:153.0 1st Qu.:4.545455 1st Qu.:13150 Median:22.00 Median:3200 Median:180.0 Median:4.545455 Median:14980 Mean:21.77 Mean:3196 Mean:175.8 Mean:4.601413 Mean:16201 3rd Qu.:22.00 3rd Qu.:3450 3rd Qu.:182.0 3rd Qu.:4.761905 3rd Qu.:17899 Max.:23.00 Max.:3610 Max.:232.0 Max.:5.000000 Max.:24760 -------------------------------------------------------------------------------------fuel5537.fr$Type:Small Mileage Weight Disp. Fuel Price Min.:25 Min.:1845 Min.: 73.00 Min.:2.702703 Min.:5866 1st Qu.:28 1st Qu.:2260 1st Qu.: 91.00 1st Qu.:3.030303 1st Qu.:6599 Median:32 Median:2295 Median: 97.00 Median:3.125000 Median:7399 Mean:31 Mean:2258 Mean: 97.31 Mean:3.273380 Mean:7682 3rd Qu.:33 3rd Qu.:2350 3rd Qu.:109.00 3rd Qu.:3.571429 3rd Qu.:8748 Max.:37 Max.:2560 Max.:114.00 Max.:4.000000 Max.:9995 -------------------------------------------------------------------------------------fuel5537.fr$Type:Sporty Mileage Weight Disp. Fuel Price Min.:19 Min.:2170 Min.: 97.0 Min.:3.030303 Min.: 9410 1st Qu.:24 1st Qu.:2695 1st Qu.:109.0 1st Qu.:3.571429 1st Qu.:10855 Median:27 Median:2775 Median:133.0 Median:3.703704 Median:11545 Mean:26 Mean:2799 Mean:164.1 Mean:3.957606 Mean:11717 3rd Qu.:28 3rd Qu.:2885 3rd Qu.:153.0 3rd Qu.:4.166667 3rd Qu.:13071 Max.:33 Max.:3320 Max.:305.0 Max.:5.263158 Max.:13945 -------------------------------------------------------------------------------------fuel5537.fr$Type:Van Mileage Weight Disp. Fuel Price Min.:18.00 Min.:3185 Min.:143.0 Min.:5.000000 Min.:12267 1st Qu.:18.00 1st Qu.:3305 1st Qu.:146.0 1st Qu.:5.131579 1st Qu.:13972 Median:19.00 Median:3665 Median:151.0 Median:5.263158 Median:14799 Mean:18.86 Mean:3517 Mean:164.4 Mean:5.313283 Mean:14325 3rd Qu.:19.50 3rd Qu.:3713 3rd Qu.:181.5 3rd Qu.:5.555556 3rd Qu.:14937 Max.:20.00 Max.:3735 Max.:202.0 Max.:5.555556 Max.:15395

23

Thompson

10/2002

Now, we get graphical summaries. Well use the 2D palette and select the variables we want to use from the data in the data editor.

0.10 0.08 0.06 0.04 0.02 0.00 15 20 25 30 35 40

0.0008 0.0006 0.0004 0.0002 0.0000

1500 2000 2500 3000 3500 4000

Mileage

Weight

0.6 0.5 0.4 0.3 0.2 0.1 0.0


2.7 3.3 3.8 4.4 5.0 5.6 6.1

0 0 0 0 0 0
1000 6000 11000 16000 21000 26000

Fuel

Price

Edit-> Select all Axes (or Plots, or Graphs, or Lines etc) Format->Selected Objects Make Modifications. These modifications apply to all selected objects. Another tip: the right-click menu for any selected object has an option to save those preferences as default. For example, I rarely like all four borders on a graph. So, once I remove the top and right side border I save the x-axis and y-axis as the default. For a trellis display of histograms of Price by Type and Country, first set the conditioning mode on, then select the number to condition on (2, here). Well use Log(Price) though.

24

Thompson

10/2002

What is the problem? To few observations within these groups. So, we only use Type as a grouping variable.
8.7
Type: Sporty

9.1

9.6

10.0

10.5

Type: Van

12 10 8 6 4 2 0

Type: Medium

Type: Small

Type: Compact

Type: Large

12 10 8 6 4 2 0

12 10 8 6 4 2 0 8.7 9.1 9.6 10.0 10.5


LogPrice

25

Thompson

10/2002

Side-by-Side Boxplots of LogPrice by Reliability score: Turn off conditioning mode. Select Reliability first, then LogPrice. Press Boxplot button on 2D palette.

10.20

9.86

LogPrice

9.52

9.18

8.84

8.50 Much worse worse average Reliability better Much better

26

Thompson

10/2002

Now, bivariate plots: This is a scatterplot matrix.

10006000 11000 16000 21000 26000

50 100 150 200 250 300 4000 3500

Weight

3000 2500 2000 1500

26000 21000 16000 11000 6000 1000 Price

35 Mileage 30 25 20 300 250 200 150 100 50 15

Disp.

5.5 Fuel 4.5 3.5 2.5 150020002500300035004000 15 20 25 30 35 2.5 3.5 4.5 5.5

27

Thompson

10/2002

This is the Trellis version: splom(~fuel5537.fr[,!check.factors])

Next time we will see how to add a loess curve to each plot.
10.0 9.8 9.6 9.4 9.4 9.6 9.810.0

LogPrice

9.4 9.2 9.0

8.8 9.0 9.2 9.4 15000 20000 20000 15000

8.8

Price

15000 10000

10000 15000 5.5 5.0 4.5 4.5 5.0 5.5

Fuel

4.0 3.5

3.0 3.5 4.0 300 250 200 200 250 300

3.0

Disp.
150 200

200 150 100

100 3000 3500 3000 3500

Weight
2500 2000 2500

2000 35 30 30 35

Mileage
25

25 20

20

I should add that for all these graphs, I was able to format them within Microsoft Word 2000 by double clicking on each graph. That is because S-PLUS supports OLE automation for object linking and embedding.

A second point is that if any of these graphs were saved (as .sgr files in my S-PLUS working directory), they would be dynamically linked to the data sets used to create them. So, when the data changes, so does the graph. This can be good or bad. If you dont want this, then embed the data into the graph. To do this, open your graphsheet (or put it in focus), go to the Graph Menu, and select Embed Data. A graphsheet can also be linked with Excel data. See information on using S-PLUS within Excel from the Users Manuals.

28

Thompson

10/2002

3D Plots: S-PLUS has the ability to do lots of types of 3D plots, including contour plots. Here is a 3D scatter plot.

Trellis graphs work for 3D plots as well.

29

Thompson

10/2002

Brush and Spin: Do for homework!

30

Thompson

10/2002

Part II: Introduction to S objects and their properties (Language Aspects)


Assignments
To assign a name to an expression (which is an S object), use the -> operator, which is binary. For example, type name.of.object<-S.expression at the command line, where name.of.object and S.expression are the name and expression of interest. This is also the way to rename an object with a different name: new.name<-old.name. Binary operators like -> are really just functions taking two arguments. More about functions appears later. S is a functional language because commands are carried out via series of functions, which optionally return one or more values. Assignments are evaluated from left to right. The following are more examples of making assignments in S:
a<-b<-c<-4 a<-4; b<-4 a<-.Last.value

If you want to see the result of an assignment with only one command, then enclose the whole assignment statement within parentheses. For example,
(a<-4) [1] 3

Naming Conventions
Object names must start with a letter or period and may contain letters, numbers, and periods. S is case sensitive. Spaces are not allowed between characters. Some names are already in use by the language. Examples are s, c, C, T, and F. The last two are examples of reserved names, which cannot be assigned (an error will result). For the first three, you can assign an expression to these names, but your object will mask the original objects with those names. That doesnt mean the original objects are gone because your assignment (via <-) will only make changes to your working directory. To unmask it, rename your object, then remove the first object from the working directory by typing rm(c), for example, at the command line. The command masked() will return any masked names. Related to masking is the function find, which finds where an object exists in your search list.

Properties of every object


Everything in S is an object. All objects have a mode and a length.

Mode
The mode of an object describes the type of object that it is. Examples of atomic modes are logical, numeric, character, complex and NULL. These describe data objects. Some modes that describe language objects are: list, function, graphics, expression, call, <-, etc. Most objects also have a class. Beginning with S-PLUS 6.0 for Windows, all objects will have a class. The class of an
31

Thompson

10/2002

object defines how methods (special types of functions) are applied to it. Classes and methods are discussed later.

Length
The meaning of the length of an object depends on its mode. The length of an array is the number of elements it has. So, the length of a two or three dimensional array is the product of its dimension sizes. The length of a list is the number of components it has. The length of a data frame is the number of columns it has (although the function ncol will return that information as well). Zero-length objects exist. The vector numeric(0) is an empty vector, like an empty container. Also, the NULL object has length zero, and, according to V&R, is like no container. The length of a function is one plus the number of arguments. The additional 1 is from the function body.

Attributes
Objects may have attributes, other objects that are attached to the main object, but are of a subordinate status. For example, a list object (a collection of objects with different modes) has a names attribute, which can be retrieved using the function call, name(my.list.object). The attributes of an object can be listed (in the S sense of the word) using the function call attributes(my.object). Attributes can be changed by the user. For example, to replace the row.names attribute of a data frame (row names are the labels for each of the rows) with the row names of another data frame, type
attr(data.frame.1, row.names)<-row.names(data.frame.2)

OR
row.names(data.frame.1)<- row.names(data.frame.2)

The second instance is an example of a replacement function. See p. 16 of V&R. It is safer to use the replacement function if it exists because it does consistency checking. The function structure can be called to create an object along with its attributes. For example, to create a simple 2x2 matrix with attribute dimnames, type
structure(matrix(c(1,2,3,4),nr=2,byrow=T), dimnames=list(c("row1","row2"),c("col1", "col2"))) row1 row2 col1 col2 1 2 3 4

Types of Objects
Functions
We have seen examples of functions since page 1. The structure of an S function is
name.of.function<function(formal arg1 = actual arg1, formal arg2 = actual arg2, etc) {

32

Thompson

10/2002

function body

Braces {} are only required when the function body contains multiple statements. The function is called by typing its name and inserting actual arguments for (at least) the required formal arguments. When functions are called, their arguments may be given names or determined by their order. S lets you omit argument names or supply just enough of the name to match the formal argument (but, I rarely use this feature). For example, to compute the 10% trimmed mean of the whole numbers from 1 to 5, we call the function mean, which takes a numeric object (like a numeric vector).
mean(c(1,2,3,4,5), mean(1:5, tr=.1) tr=.1) # the c function denotes combine or concatenate # c(1,2,3,4,5) is a numeric vector # alternatively

The formal trim argument is given the actual argument, .1. Note that we did not have to spell out trim completely, and we also did not explicitly state the formal argument, x for which 1:5 is the value. mean has another argument called na.rm, which determines how missing values (NA) are handled. This argument is not required to be given a value, and, in this case, has a default value, which is F. The function body may or may not contain a last expression, which would be the returned value from the call to the function. Functions may have side effects as well, which are not returned with the call of the function, but happen as a result of calling the function. A function is allowed to do nothing. Indeed, the shortest.s function,
shortest.s<-function(){}

is valid. To create a function, just decide on a name for it, and fill in the details of what it should do. For example, to create our own function to compute a trimmed mean
mean.vector<-function(x, trim=0, na.rm=F) { if(na.rm) x<-x[!is.na(x)] # remove NAs if na.rm=T elements.to.trim<-floor(trim*length(x)) # number of elements in x to trim from both ends if(elements.to.trim > 0) x[-c(1:elements.to.trim,(length(x)-elements.to.trim):length(x))] # trim the ends sum(x)/length(x) } mean.vector(c(1,2,3,4,5,NA),trim=.2,T) [1] 3

Which is faster? Ours or S-PLUS?


dos.time(mean.vector(c(1,2,3,4,5,NA),trim=.2,T)) - dos.time(mean(c(1,2,3,4,5,NA),trim=.2,T)) [1] 0

Try again:
> now <- proc.time() > mean.vector(c(1, 2, 3, 4, 5, NA), trim = 0.2, T) [1] 3 > proc.time() - now [1] 0.05 0.00 0.05 0.00 0.00

33

Thompson

10/2002

> now <- proc.time() > mean(c(1, 2, 3, 4, 5, NA), trim = 0.2, T) [1] 3 > proc.time() - now [1] 0.06 0.00 0.06 0.00 0.00

More will be said about debugging functions later. S-PLUS comes with over 2,000 built-in functions, not including the functions from contributed libraries. Descriptions of the functions grouped by purpose or type can be found in the S-PLUS manuals or from the help utility. (See extra handout on statistical and mathematical functions)

Calling conventions for functions


Not all arguments are required to be specified for a function. That is, some functions can be written to accept an arbitrary number of unnamed arguments using the dot notation (). Read p. 55-56 of V&R for some rules about giving values to named arguments and argument matching.

Vectors and Matrices


There are six basic types of vectors: logical, integer, double, single, complex, and character. A standard numeric vector is, by default, double. A character vector is a vector of character strings. Types may not be mixed within a vector. Vectors can be created using the c function:
flavors<-c(chocolate, vanilla, strawberry) scores<-c(5,5,5,3,6,2,4,5,3,4)

Logical vectors can be created using logical operators:


scores >= 5 [1] T T T F T F F T F F

A list of logical and comparison operators can be obtained via the help file. Search on logical. The elements of a vector can be named and accessed by name:
names(scores)<-paste("Name",1:length(scores),sep="") names(scores) [1] "Name1" "Name2" "Name3" "Name4" "Name5" "Name6" "Name7" "Name8" "Name9" "Name10"

A matrix is a vector of elements with a dim attribute that gives the dimensions of the matrix.
dim(scores)<-c(2,5) scores [,1] [,2] [,3] [,4] [,5] [1,] 5 5 6 4 3 [2,] 5 3 2 5 4 class(scores) [1] "matrix" # will fill the matrix by row

34

Thompson

10/2002

To fill a matrix by column, use the function matrix, which takes a vector and creates a matrix by filling by row (or by column). Multi-way arrays are extensions of this concept, with a dim attribute of length the number of dimensions. Dimensions are given names using the dimnames attribute.

Indexing vectors, matrices, and arrays


Extract the i,j th element of a n m matrix named my.matrix: res<-my.matrix[i, j] Extract the ith row of my.matrix (and dont drop the column dimension): Extract the jth column of my.matrix (and drop the row dimension):

res<-my.matrix[i, , drop=F] res<-my.matrix[ ,j]

In the second instance, dim(res) is c(1, m). In the third instance, dim(res)is NULL. Also, Replace the i,j th element of a n m matrix named my.matrix: my.matrix[i, j]<-2 Replace the ith row of my.matrix:
my.matrix[i, ]<-rep(2, m)

Vectors and multi-way arrays are indexed in the logically analogous way. For example, my.vec[c(3,5)] gives the third and fifth elements of my.vec. The functions ncol and nrow return the number of columns and rows for a matrix (they also work for a data frame). rbind and cbind will bind row-wise or columnwise, several matrices, vectors, or data frames. Indexing in general is treated in more detail later.

Lists
Lists are actually special cases of vectors. The elements of the vector are the components of the list. A list is a hierarchical data structure composed of different data objects, called components. Lists are used to collect different types of items together in one structure. The elements of a list do not have to be the same length. The components of a list are always numbered (and sometimes named, as well). As such, they can always be extracted by their number or position, and by their name, if available. Because lists can contain lists as components, which further can contain lists as components, etc, lists are sometimes called recursive objects.

Creating lists
A list can be created using the list function. For example, suppose we random select some people coming out of an ice cream store and ask them what they bought and how they would rate it (1 to 5). Here is a list of the data.
> x<-list(ice.cream.choice= c( "vanilla", "chocolate", "strawberry", "chocolate"), topping.choice = c("chocolate","strawberry","chocolate","marshmallow"), rating=c(5,4,4,5)) > x $ice.cream.choice: [1] "vanilla" "chocolate"

"strawberry" "chocolate"

35

Thompson

10/2002

$topping.choice: [1] "chocolate" "strawberry" "chocolate" $rating: [1] 5 4 4 5

"marshmallow"

Lists are concatenated using the c function. Lets add the number we actually approached as well. So, six people chose not to answer the survey.
x<-c(x, number.surveyed=10) > x $ice.cream.choice: [1] "vanilla" "chocolate" $topping.choice: [1] "chocolate" $rating: [1] 5 4 4 5 $number.surveyed: [1] 10

"strawberry" "chocolate" "chocolate" "marshmallow"

"strawberry"

If we had set the recursive argument to TRUE, prior to concatenation with number.surveyed, the list x would have been unlisted. Thus, the result would be
x<-c(x, number.surveyed=10,recursive=T) >x ice.cream.choice1 ice.cream.choice2 ice.cream.choice3 ice.cream.choice4 "vanilla" "chocolate" "strawberry" "chocolate" topping.choice1 topping.choice2 topping.choice3 topping.choice4 "chocolate" "strawberry" "chocolate" "marshmallow" rating1 rating2 rating3 rating4 number.surveyed "5" "4" "4" "5" "10"

Note the coercion to character strings for the ratings and number surveyed. Thus, it is probably wise only to recursively concatenate when all elements are numeric. The function unlist exists by itself as well, and can be quite useful (See discussion of lapply and sapply). If we had not given names to the list components, the components would be labeled by position:
> names(x)<-NULL > x [[1]]: [1] "vanilla" [[2]]: [1] "chocolate" [[3]]: [1] 5 4 4 5 [[4]]: # note the use of the replacement function, names

"chocolate"

"strawberry" "chocolate" "chocolate" "marshmallow"

"strawberry"

36

Thompson

10/2002

[1] 10

Extracting and setting components from lists


We can select components of a list using [ ], as though the list were a vector. Thus, to select the second and fourth components of x, type
> x[c(2, 4)] $topping.choice: [1] "chocolate" $number.surveyed: [1] 10

"strawberry"

"chocolate"

"marshmallow"

Compare the above with


> x[[c(2, 4)]] [1] "marshmallow" > x[[2]] [1] "chocolate" "strawberry" "chocolate" "marshmallow"

However, to set a component to a value, we use [[ ]]. For example, to set the fourth component to 9
x[[4]]<-9 > x $ice.cream.choice: [1] "vanilla" "chocolate" $topping.choice: [1] "chocolate" $rating: [1] 5 4 4 5 $number.surveyed: [1] 9

"strawberry" "chocolate" "chocolate" "marshmallow"

"strawberry"

To set the fourth element of the first component to vanilla, type


x[[1]][4]<-"vanilla"

To set a component to NULL, use list.name[component.name]<-list(NULL) or list.name[[component.name]]<-list(NULL). So, to set number.surveyed to NULL use
x[number.surveyed]<-list(NULL) list(NULL) is required because x[component.name]

returns a list with one component.

To remove a list component completely (e.g., the number surveyed), type


x[[4]]<-NULL

37

Thompson

10/2002

Or equivalently,
x[[number.surveyed]]<-NULL

The last few commands allude to easier ways of selecting components from a named list. We can use the name information instead of the position information. So, to select the topping choices from x, type
> x$topping.choice # selects the component [1] "chocolate" "strawberry" "chocolate" "marshmallow" > x[["topping.choice"]] # does the same thing [1] "chocolate" "strawberry" "chocolate" "marshmallow" > x["topping.choice"] # but, this command returns a list itself $topping.choice: [1] "chocolate" "strawberry" "chocolate" "marshmallow"

As lists are vectors of list components, the vector selection ([ ]) returns the vector component. This also illustrates another way to set or change components of a list.
x$topping.preferred<-c("orange pineapple", "strawberry banana", "chocolate mint", "cherry jubilee") x$topping.preferred[2]<-"chocolate banana" # change second topping preferred

>x .. $topping.preferred: [1] "orange pineapple" "chocolate banana" "chocolate mint"

"cherry jubilee"

There is a caveat to selecting components using their names. Partial matching will be done on the quoted component name.
> x["topping"] $topping.choice: [1] "chocolate"

"strawberry"

"chocolate"

"marshmallow"

Naturally, problems arise if we have two components with the words topping. We will get the first partial match. To get around this, one can select components using the numerical indexes only, finding them via exact matching:
> x[[match("topping.choice", names(x))]] [1] "chocolate" "strawberry" "chocolate" "marshmallow"

But,
> x[[match("topping", names(x))]] NULL

Initializing a list with n components (useful for simulations) can be done using the command:
my.list<-vector(list, n)

38

Thompson

10/2002

Factors
Factors are similar to character vectors. However, factors possess an attribute called levels, which indicates which character strings are allowed in the factor. Internally, a factor is stored as a set of integer codes (retrievable using the codes function or the levelsIndex function). The names of the levels or categories are stored in the levels attribute. (In a data window, entries which are character strings are automatically converted to factors. You can change this option to have them remain as character using the Options menu ->General Settings->Data. When creating a data frame using data.frame, enclose character columns in I() to prevent them from being converted automatically to factors). Here we factor topping.preferred.
> x$topping.preferred<-factor(x$topping.preferred) > x $ice.cream.choice: [1] "vanilla" "chocolate"

"strawberry" "chocolate"

. . .
$topping.preferred: [1] "orange pineapple" "chocolate banana" "chocolate mint" $topping.preferred: Levels: [1] "cherry jubilee" "chocolate banana" "chocolate mint" > levelsIndex(x$topping.preferred) [1] 4 2 3 1 > attributes(x$topping.preferred) $levels: [1] "cherry jubilee" "chocolate banana" "chocolate mint" $class: [1] "factor" "cherry jubilee" "orange pineapple"

"orange pineapple"

Note that the numerical codes follow alphabetical order. To specify a different order, use the levels argument, which takes a character vector. The levels argument can include character strings not occurring in the factor.
> x$topping.preferred<-factor(x$topping.preferred, levels=c(levels(x$topping.preferred), peanut butter)) > x ... $topping.preferred: [1] "orange pineapple" "chocolate banana" "chocolate mint" $topping.preferred: Levels: [1] "orange pineapple" "chocolate banana" "chocolate mint" "peanut butter" "cherry jubilee" "cherry jubilee"

Unclass will now give the correct codes. Codes will re-code using alphabetical order. > unclass(x$topping.preferred)

39

Thompson

10/2002

[1] 1 2 3 4 attr(, "levels"): [1] "orange pineapple" "chocolate banana" "chocolate mint" [5] "peanut butter" mode(unclass(x$topping.preferred)) [1] "numeric"

"cherry jubilee"

Factors are treated specially in many statistical modeling functions (e.g., lm, aov, glm, gam, loess, nls) and in some plotting functions. For example, using the built-in S-PLUS data frame fuel.frame, a boxplot of Mileage by Type (a factor) can be given automatically using plot:
> attach(fuel.frame) > plot(Type, Mileage)

Mileage 20 25

30

35

Compact

Large

Medium Type

Small

Sporty

Van

Ordered Factors
An ordered factor is a factor with an ordering of the levels. To create an ordered factor use the ordered function with first argument a vector. The default ordering of levels is alphabetical for characters and numerical for numbers. To change the default ordering, use the levels argument.
> ordered(x$topping.choice,levels=c("marshmallow","strawberry", "chocolate")) [1] chocolate strawberry chocolate marshmallow marshmallow < strawberry < chocolate

One can create an ordered factor from a numeric variable by using the cut function. For example, to category the Weight variable from the fuel.frame data set into quartiles, use
cut(fuel.frame$Weight, breaks=quantile(fuel.frame$Weight, seq(0,1,.25)), include.lowest=T) [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 4 2 4 2 1 2 2 2 1 2 2 2 3 2 3 2 2 3 3 2 2 3 3 3 4 2 3 4 3 3 4 2 4 3 2 3 4 4 4 4 3 4 4 4 4 3 4 attr(, "levels"): [1] "1845.00+ thru 2571.25" "2571.25+ thru 2885.00" "2885.00+ thru 3231.25" "3231.25+ thru 3855.00"

40

Thompson

10/2002

Subsetting Factors
A factor can be subsetted using a logical selection vector. For example, to select the types of cars with mileage over 30 mph in the fuel.frame data frame, type
> fuel.frame$Type[fuel.frame$Mileage>30] [1] Small Small Small Small Small Small Small Small Levels (first 5 out of 6): [1] "Compact" "Large" "Medium" "Small" "Sporty" Sporty

This retains the full set of levels, even though not all of them appear in the subset. If you only want levels that appear in the subset, type
> fuel.frame$Type[fuel.frame$Mileage>30,drop=T] [1] Small Small Small Small Small Small Small Small Sporty

> attributes(fuel.frame$Type[fuel.frame$Mileage>30,drop=T]) $levels: [1] "Small" "Sporty" $class: [1] "factor"

Conversion of Factors to numbers


We already saw how to convert a factor to its numeric codes, using unclass. Now, suppose we create the following ordered factor with number labels, and later want to convert it to a numeric vector with those values. First we coerce the factor to a character vector, then to a numeric vector:
> my.factor<-ordered(c(1.5,2,5,3)) [1] 1.5 2 5 3 1.5 < 2 < 3 < 5 > as.numeric(my.factor) [1] 1 2 4 3 # this only returns the numeric codes # now we get the real values back # the ordered factor

> as.numeric(as.character(my.factor)) [1] 1.5 2.0 5.0 3.0

A more efficient way is given on p. 33 of V&R


> as.numeric(levels(my.factor))[ my.factor] [1] 1.5 2.0 5.0 3.0

The levels function does the conversion to character. So, in the second example, we only convert length(levels(my.factor)) characters to numeric instead of length(my.factor), where the former is usually shorter than the latter (though not in this example).

Data Frames and Data Sheets


Data frames are probably the most typical way to store data sets. When you open the data editor window and input data, you are creating a data frame (or a data sheet). Also, imported data from other formats by
41

Thompson

10/2002

default becomes a data sheet. A data frame or data sheet is a table of data in rows and columns. Typically, the rows correspond to observations and the columns correspond to variables. Data frames are special cases of lists. They are lists with vector components. So, all the extraction techniques one can do with lists also apply to data frames. However, all of their component columns must be the same length, but not necessarily the same mode. Not all the columns of a data sheet must be the same length. A data sheet is like a spreadsheet. The attributes of a data frame include row.names (the names of the rows), names (the column or variable names), and class (data.frame). Its mode is list.
> names(fuel.frame) [1] "Weight" "Disp." "Mileage" "Fuel" "Type" "Honda Civic 4" "Pontiac LeMans 4" "Toyota Tercel 4" "Ford Mustang V8"

> row.names(fuel.frame) [1] "Eagle Summit 4" "Ford Escort 4" "Ford Festiva 4" [5] "Mazda Protege 4" "Mercury Tracer 4" "Nissan Sentra 4" [9] "Subaru Loyale 4" "Subaru Justy 3" "Toyota Corolla 4" [13] "Volkswagen Jetta 4" "Chevrolet Camaro V8" "Dodge Daytona" ...

Creating a data frame


If a set of variables or data frames have the same number of rows, they can be collected using the function data.frame.
new.data.frame<-data.frame(fuel.frame, Cat.Mileage=my.fuel.frame$Cat.Mileage)

If two or more data frames are collected together, the row names corresponding to the first data frame in the list are the row names for the new data frame. Character and logical vectors are automatically converted to factors unless they are enclosed in an identity function, I(). See V&R (2000, p. 16) for details. The functions cbind and rbind bind column-wise or row-wise, data frames that have the same number of columns or rows. This is useful when updating a data frame with new data:
rbind(old.data.frame, new.data.frame)

or adding additional variables


cbind(data.frame.with.variable.set.1, data.frame.with.variable.set.2)

However, note that cbinding a mixture of numeric and character variables results in a character matrix. To see this,
> cbind(1:26, letters[1:26]) character matrix: 26 rows, 2 columns. [,1] [,2] [1,] "1" "a" [2,] "2" "b" [3,] "3" "c"

Thus, cbind is only useful for data frames with all numeric (or all character) columns.

42

Thompson

10/2002

One can convert a list or a matrix to a data frame by using the function data.frame. For example, we convert the ice cream list, x, to a data frame.
> x.fr<-data.frame(x) > x.fr ice.cream.choice topping.choice rating topping.preferred 1 vanilla chocolate 5 orange pineapple 2 chocolate strawberry 4 chocolate banana 3 strawberry chocolate 4 chocolate mint 4 chocolate marshmallow 5 cherry jubilee

One can convert a data frame to a matrix using data.matrix.

Indexing a data frame


A data frame can be indexed like a list or a matrix. We can select the second and fourth variables of x.fr using any of the following commands
> x.fr[c(2, 4)] > x.fr[, c(2, 4)] > x.fr[,c("topping.choice","topping.preferred")] 1 2 3 4 topping.choice topping.preferred chocolate orange pineapple strawberry chocolate banana chocolate chocolate mint marshmallow cherry jubilee

Also,
> x.fr$topping.choice > x.fr[, "topping.choice"] > x.fr[[,"topping.choice"]] [1] chocolate strawberry chocolate marshmallow

Subsetting a data frame using logical vectors


Lets subset the fuel.frame data frame (available in S-PLUS) according to Type of car.
Small.cars<-fuel.frame[fuel.frame$Type==Small,] Eagle Summit Ford Escort Ford Festiva Honda Civic Mazda Protege Mercury Tracer Nissan Sentra Pontiac LeMans 4 4 4 4 4 4 4 4 Weight Disp. Mileage Fuel Type 2560 97 33 3.030303 Small 2345 114 33 3.030303 Small 1845 81 37 2.702703 Small 2260 91 32 3.125000 Small 2440 113 32 3.125000 Small 2285 97 26 3.846154 Small 2275 97 33 3.030303 Small 2350 98 28 3.571429 Small

If we first attach the data frame, we dont need the $ to subset. Thats because attaching a list brings the list to the second position of the search list. So, unless there are other objects in our working directory
43

Thompson

10/2002

with the same names as the names in the attached list, using a name in the data frame brings up the associated column. For example,
> attach(fuel.frame) > Type

[1] Small Small Small Small Small Small Small Small Small Small Small Small Small Sporty Sporty Sporty Sporty [18] Sporty Sporty Sporty Sporty Sporty Compact Compact Compact Compact Compact Compact Compact Compact Compact Compact Compact Compact

... > Small.cars<-fuel.frame[Type==Small,]

Next we select all cars that are either Small or Sporty.


> fuel.frame[is.element(Type, c("Small","Sporty")),] > # Or use: fuel.frame[Type %in% c("Small","Sporty"),] Eagle Summit 4 Ford Escort 4 Ford Festiva 4 Honda Civic 4 Mazda Protege 4 Mercury Tracer 4 Nissan Sentra 4 Pontiac LeMans 4 Subaru Loyale 4 Subaru Justy 3 Toyota Corolla 4 Toyota Tercel 4 Volkswagen Jetta 4 Chevrolet Camaro V8 Etc Weight Disp. Mileage Fuel Type 2560 97 33 3.030303 Small 2345 114 33 3.030303 Small 1845 81 37 2.702703 Small 2260 91 32 3.125000 Small 2440 113 32 3.125000 Small 2285 97 26 3.846154 Small 2275 97 33 3.030303 Small 2350 98 28 3.571429 Small 2295 109 25 4.000000 Small 1900 73 34 2.941176 Small 2390 97 29 3.448276 Small 2075 89 35 2.857143 Small 2330 109 26 3.846154 Small 3320 305 20 5.000000 Sporty

Partial matching can be done using the function grep.

Subsampling a data frame


The function seq can be used as an index to select a systematic sample from a data frame.
> fuel.frame[seq(1, nrow(fuel.frame), by=10), ] # select every 10th case

Instead of seq, we could have used rep(c(T,rep(F,9)), 10). However, seq is faster. We can sample from the vector of integers representing the row names to get a pseudorandom sample.
> fuel.frame[sample(1:nrow(fuel.frame), 10), ] # sample 10 cases

Removing Duplicate Rows from Data Frames


To remove duplicate rows from data frames, we can use the method of duplicated that is appropriate for data frames (formally called duplicated.data.frame). To use it on a doubled version of the fuel.frame data set, we supply its name as an argument.
> my.fuel.frame<-rbind(fuel.frame, fuel.frame)

44

Thompson

10/2002

> duplicated(my.fuel.frame)
[1] F F F [68] T T T F F T T F F T T F F T T

# calls duplicated.data.frame via method dispatch

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T

Then, to select only the unique rows, use


my.fuel.frame[!duplicated(my.fuel.frame),] duplicated has methods for vectors (its original use), whose inverse is the function unique. The function unique(x) selects the unique elements of x, a vector. unique does not have a method for data

frames.

Merging Data Frames: (see merge function)

Coercion (going from one class to another)


We have already seen examples of going from one class to another. More examples are given in the S Commands link on the STAT 5537 course web site. In general there are functions as.xxx and is.xxx. These coerce to r test for class xxx. For information, consult the help files as well as the S Commands link on the course web site.

Operations and arithmetical expression on data types


Generating regular sequences
The function, seq, is the most general way to generate a from, to, and by. If by is not given, it defaults to by=1.
> seq(-1, 1, .2) [1] -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4

sequence. It takes (among others) arguments

0.6

0.8

1.0

You can seq along another vector by using the along argument.
> seq(along=seq(-1, 1, .2)) [1] 1 2 3 4 5 6 7 8 9 10 11

A colon can be used for incrementing or decrementing sequences with by=1: e.g., 1:5, 3:5, -10:(-1), 1:(-10). However, note that : will take precedence over arithmetical expressions so that 1:51 results in c(0, 1, 2, 3, 4) which is different from 1:(5-1) which gives c(1, 2, 3, 4) The rep function repeats its first argument a number of times, the second argument. Here are some simple examples:
> rep("hi", times=5) [1] "hi" "hi" "hi" "hi" "hi" > rep(1:3, 4) [1] 1 2 3 1 2 3 1 2 3 1 2 3

45

Thompson

10/2002

> rep(letters[1:3], 3:1) [1] "a" "a" "a" "b" "b" "c" > rep(letters[1:3], times=rep(4, 3)) [1] "a" "a" "a" "a" "b" "b" "b" "b" "c" "c" "c" "c"

With S-PLUS 6.1 and R, we dont need the additional rep expression for the times argument. We can use the each argument.
> rep(letters[1:3], each=4) [1] "a" "a" "a" "a" "b" "b" "b" "b" "c" "c" "c" "c" > rep(rep(1:4, each = 2), 2) [1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4

R has a function gl (thats an el) that does the above. gl(n, k) generates a sequence of length n*k consisting of the numbers 1,2,,n each repeated k times.
> gl(n=4,k=2) [1] 1 1 2 2 3 3 4 4 Levels: 1 2 3 4

Operations on numerical vectors


Arithmetic operations on vectors are performed element by element. Applying the operators +,-,*./, %/%, %%, ^ to two vectors gives the following results:
z <- rep(2:4, 4) y <- rep(1:3, 4) z + y [1] 3 5 7 3 5 7 3 5 7 3 5 7 z + 3 [1] 5 6 7 5 6 7 5 6 7 5 6 7 z * y [1] 2 6 12 2 6 12 2 # example of recycling rule: shorter vectors are # repeated to match the length of longer vectors 2 6 12

6 12

z[-1] [1] 3 4 2 3 4 2 3 4 2 3 4 y^2 [1] 1 4 9 1 4 9 1 4 9 1 4 9 z^y [1] 2 9 64 2 9 64 2 9 64 2 9 64

Another instance of the recycling rule is the command


cbind(1, z) z [1,] 1 2 [2,] 1 3 [3,] 1 4 [4,] 1 2 [5,] 1 3

46

Thompson

10/2002

[6,] [7,] [8,] [9,] [10,] [11,] [12,]

1 1 1 1 1 1 1

4 2 3 4 2 3 4

which can be convenient for specifying the intercept in the X matrix for a regression problem.

Operations on logical vectors


The symbols, |, xor, and & are element-wise or, exclusive-or and and, respectively.
z <- rep(2:4, 4) y <- rep(1:3, 4) z == y # ith value equals T if z[i]=y[i] [1] F F F F F F F F F F F F z/3 > y | 2 * y > z # ith value equals T if z[i]/3 > y[i] or if 2*y[i] > z[i] [1] F T T F T T F T T F T T z > y & 2*y > z # ith value equals T if z[i] > y[i] and if 2*y[i] > z[i] [1] F T T F T T F T T F T T

There are functions any, all, and all.equal which return scalar T or F values:
> all.equal(z, y) # default tolerance is .Machine$double.eps [1] "Mean relative difference: 0.3333333" > all(z > y) [1] T > any(y > z) [1] F

The function identical tests for exact equality, i.e., without a tolerance. If a logical vector is used in an arithmetical expression, then T is converted to 1 and F is converted to 0. Thus, sum(c(T,T,F)) is the number 2.

Operations on logical scalars (||, &&)


The logical operators | and & operate component-wise on vectors (i.e., on each component of a vector). On the other hand, && and || operate on scalars. As mentioned in V&R (p. 58), && evaluates the righthand expression only if the left-hand one is true, and || evaluates the right-hand expression only if the lefthand one is false.

Operations and functions on matrices


Elementwise matrix operations are: +, -, /, *, and ^. The recycling rule applies. Matrix multiplication is achieved via the operation, %*%. Matrix outer product is the operation, %o%, or the function outer. The function outer is more flexible, as one does not have to use * as the operation. For example, the call
47

Thompson

10/2002

outer(z, y, function(u,v) (u+v)/2)

takes the midpoint of each ordered pair in the regular grid of z and y values. The function crossprod(x, y) gives the (matrix) cross-product xT y , where x and/or y may be a matrix. With a single argument, crossprod(x) gives xT x . The function diag will create a diagonal matrix or extract the diagonal of a square matrix, depending on the mode of the first argument. Examples:

The function diag


diag(2) [,1] [,2] [1,] 1 0 [2,] 0 1 > diag(rep(1:3,2)) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 0 0 0 0 0 [2,] 0 2 0 0 0 0 [3,] 0 0 3 0 0 0 [4,] 0 0 0 1 0 0 [5,] 0 0 0 0 2 0 [6,] 0 0 0 0 0 3 > diag(diag(2)) [1] 1 1

The function sweep


The function sweep will sweep out an operation from the rows or columns of a matrix. For example, suppose we want to subtract the row mean from the elements in each row of a matrix X. Then,
sweep(X, MARGIN=1, STATS=rowMeans(X), FUN=-) # MARGIN=1 implies rows

Also, we can subtract the column mean from each column


X.dev<-sweep(X, MARGIN=2, STATS=colMeans(X), FUN=-) # MARGIN=2 implies columns x.scaled<-sweep(X.dev, 2, sqrt(colVars(X)), FUN=/) # Equivalent to scale

The last command above is equivalent to the function scale. scale will standardize the values in each column (subtract its respective column mean and divide by column standard deviation). Also note the related functions, rowSums, colSums, rowVars, colVars. Section 4.3 of V&R contains more information on matrix functions like QR decomposition, eigen decomposition, etc.

48

Thompson

10/2002

Functions for managing vectors


Here is a list of functions to look up: append, duplicated, match, pmatch, unique, rev, pmax,
pmin, round, floor, ceiling, trunc, union, setdiff, intersect, diff

Many of these functions have methods for other classes, like data frames or lists. See if you can find out which. Note that the function match(x, table, nomatch=NA) returns a vector of the positions in table of the elements of x, but only for the first matches. For example,
> match(1,c(1,2,3,1,4)) [1] 1

For this example, one can type instead:


> 1==c(1,2,3,1,4) [1] T F F T F

So, now you know that 1 appears in the first and fourth positions. To know all the positions in table of each element in x, where x is a vector, here, for the sake of argument, you can use sapply (I think there is something easier, but I forgot). For example, to find all the positions of 1s, all the positions of 2s, etc, in the vector c(1,2,3,1,4)
> sapply(1:4,function(x,y) x==y,y=c(1,2,3,1,4)) [,1] [,2] [,3] [,4] [1,] T F F F [2,] F T F F [3,] F F T F [4,] T F F F [5,] F F F T

What would happen if I used lapply? Could I have used apply? Note the following matches of cars between the two data sets cu.summary and fuel.frame:
> match(row.names(cu.summary),row.names(fuel.frame)) [1] NA NA NA 1 2 3 NA NA 4 NA 5 6 NA 7 8 9 NA NA 16 17 NA NA 18 19 NA NA NA 20 NA NA 21 NA 22 NA [47] NA NA 23 NA NA 24 25 NA 26 NA 27 28 29 NA 30 31 39 NA NA NA NA 40 41 NA 42 NA 43 44 45 NA NA NA NA 46 [93] 47 48 49 NA NA NA 50 NA NA 51 NA NA 52 53 NA NA 10 11 12 NA NA 13 NA 14 NA 15 NA NA 32 33 34 NA NA 35 36 37 38 NA NA NA 54 NA 55 56 57 58 59 60 NA

Sorting
The function sort sorts a vector in ascending order or alphabetical order. rev will then reverse the order. The function order returns the indices of a vector that will sort the original vector in ascending order. To order a vector by one variable within another, use further arguments to order. For example, to order cars by Price within Type, use
order(Type, Price)

49

Thompson

10/2002

To further order within Country, use


order(Country, Type, Price)

The function sort.list will quickly sort a list (data frame) by a single column. For example,
my.data.frame[sort.list(row.names(my.data.frame)),]

Missing value marker, NA; indefinite and infinite values


NA is the symbol for missing value. is.na is the function to use to test for missing values in a vector. set a value, x, to missing use is.na(x)<-T. You can use x<-NA as well, but not NA. > is.na(x<-NA) [1] T > is.na(x<-NA) [1] F # we did not really need to make the hidden assignment here

To

Any operation on an NA becomes an NA. So, the expression below gives NAs.
c(1, NA, 1)== NA [1] NA NA NA

In S-PLUS, character vectors cannot have missing values. The expression NA will be interpreted as NA. However, in R NA in a character vector is a missing value. See the documentation on functions, is.finite, is.infinite, is.nan, and is.number. The symbol NaN means Not a Number (undefined). The symbol Inf stands for infinity.

Character vector operations


Character vectors are vectors of character strings. The function character(length=n) creates a vector of n empty strings, . The expression character(0) is an empty character vector with length 0, much like numeric(0) is an empty vector, with length 0.

50

Thompson

10/2002

The function nchar takes in a vector of character strings and returns a vector with the number of characters in each string. The function paste takes an arbitrary number of arguments (coercing them to character strings if necessary) and joins them together element by element. By default, the joined elements are separated by a blank. The separator can be changed using the sep argument. For example,
> paste("Round",1:10) [1] "Round 1" "Round 2" "Round 3" "Round 8" "Round 9" "Round 10" "Round 4" "Round 5" "Round 6" "Round 7"

Also, note the use of the argument, collapse:


> paste(c("X", "Y"),1:4, sep="", collapse=" + ") [1] "X1 + Y2 + X3 + Y4"

The function substring takes as first argument a vector (which is coerced to character vector). The next two arguments indicate the first and last positions of each string in the character vector. This defines a segment of each string to be extracted. For example, to extract the Round number from the th paste("Round",1:10) vector created above, we start in the 7 position, then continue until the end of each string (thus, we dont need to give a value for the last argument).
> substring(paste("Round",1:10), first=7) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

Searching and Matching character strings


The grep(pattern, text) function finds the indices of where pattern appears in a vector of character strings, text. The following examples appear in the help file for grep.
grep("*ia", state.name) # find state names that end in "ia" grep("?a*",state.name) # find state names with second letter "a" grep("[A-D]*",state.name) # find state names beginning with letters A through D grep("*[!n]",state.name) # find state names EXCEPT those ending with "n" grep("Ne*", state.name) # returns all states beginning with "Ne"

Patterns can be made using the following rules


[abc] [.,;] [a-z] [!0-9] [0-9!] [a-zA-Z] [!a-zA-Z] [.!] \. # # # # # # # # # match match match match match match match match match single letters in a,b,c . , ; all lowercase letters all non digits all digits and exclamation point all lower and uppercase letters any character that isn't a letter period and ! .

charmatch(input, target, nomatch=0) returns a vector matched by input. Ambiguous matches return a 0.

of the indices of target that are partially

51

Thompson

10/2002

From the help file, charmatch is useful for processing the arguments to functions. It is very similar to the pmatch function. However, pmatch does not allow a distinction between no match and an ambiguous match. The pmatch function doesn't allow a match to the empty string, while charmatch does. See the help file for examples.
regexpr(pattern, text) matches one regular

expression, pattern, to a character vector, text. Basic use of regular expressions is covered in the help file. An example: Suppose we want to find all functions in the working directory that contain the letters pow as a whole word. regexpr will match the characters pow to the directory listing. What is returned is the position of the p in each directory file listed. If pow does not occur as a whole word, then a 1 is returned.
> tmp<-objects() > tmp[ regexpr("\\<pow\\>", tmp)>0 ] [1] "pow.matrix" "pow.structure" # >0 eliminates the 1 nomatches. # So we get the matches only

Regular expressions can be used in the pattern argument to objects too. R has the functions sub and gsub which resemble the sed commands
s/pattern/replacement/ s/pattern/replacement/g # replace first instance of pattern with replacement # g stands for global replacement (replace all instances)

in Unix. For example, to replace our Round vector with lower case rs,
> gsub("R", "r", paste("Round",1:10)) [1] "round 1" "round 2" "round 3" "round 4" "round 8" "round 9" "round 10" "round 5" "round 6" "round 7"

Here are some good examples of the use of character vector operations and searching and matching. I have a data set with 548 records, of which 258 are unique individuals. Several individuals contributed more than one record to this data set. Each record has a unique ID associated with it. The first few digits of the ID correspond to the individual. The last two digits correspond to the record contributed by the individual (I call these events). So, a record with ID=11209 corresponds to the 09th event from an individual identified with number 112. I want to create an event variable that indicates which event corresponds to the record for that individual. Here is how I did it. ID contains the 548 ID tags.
# First I find the largest number of events: num.events<-max(as.numeric(substring(ID,nchar(ID)-1))) # Now, I create a vector of strings with event labels event<-if(num.events > 9) c(paste("*0",1:9,sep=""), paste("*", 10:num.events, sep="")) else paste("*0",1:num.events,sep="") # Now, create the event variable for(i in 1:num.events) event[grep(event[i],ID)]<-i # Get the first records in the database of each individual my.data[match(unique(individual),individual), ]

52

Thompson

10/2002

Formatting and Printing Objects


Weve seen the paste function already, for printing character strings. The cat function prints S objects as character strings (by coercing them) to the standard output or to a specified file. The option fill allows control over the width of the output in characters (including white space). For example, compare
> cat(1,2,3,4,5,6, fill=2) 1 2 3 4 5 6 > 1 3 5 cat(1,2,3,4,5,6, fill=4) 2 4 6

> cat(1,2,3,4,5,6, fill=6) 1 2 3 4 5 6

The function format coerces input to character strings, then formats it using specified number of digits or significant digits. It outputs the result in quotes, which can be removed by encasing the format output in cat or print, with argument quote=F. For example,
> format(pi^(-3:3), digits=5)
[1] " 0.032252" " 0.10132 " " 0.31831 " " 1 " " 3.1416 " " 9.8696 " "31.006 "

> cat(format(pi^(-3:3), digits=5),"\n") 0.032252 0.10132 0.31831 1

3.1416

9.8696

31.006 31.006

> print(format(pi^(-3:3), digits=5),quote=F) [1] 0.032252 0.10132 0.31831 1 3.1416

9.8696

If you want to reduce the number of significant digits of a number (say, called, number) to that of the default (found in options()$digits), then use format(number).

Indexing of arrays, data frames, lists, and vectors


Selecting or extracting components or subsets of arrays, data frames, and lists is done by using an index vector. The index vector can be one of the five types mentioned in V&R (p. 28): logical, a vector of positive integers or the integer codes of a factor, a vector of negative integers (in which case, you are eliminating components), a vector of character strings, or an empty vector. Not all index vectors are valid for extracting components from all types of data structures mentioned in the heading. A logical index vector can be used to index all types. However, the logical operators: ==, <, > cannot operate on a list. So, the expression
x<-list(2,3,4); x[x>3]

is not valid. One can, however, use unlist to prevent using the operator on a list:
x[unlist(x)>3] # Note that sapply can be used as well

53

Thompson

10/2002

Vectors of positive and negative integers can be used to index all four data structures. Decimal representations are truncated toward zero. A zero subscript is allowed, but will return an empty structure (empty list, empty numeric vector, etc). However, using a zero subscript to select the zeroth column of a data frame returns an error. If a subscript extends outside the range of the length or dimensions of the object, then for a vector, the extracted component is NA, for a list it is NULL, and for arrays and data frames it is an error. Out of range subscripts on the left hand side of assignments return a lengthened vector or list with intervening components set to NA and NULL, respectively. The function is.element can be used as an index vector. For example, to create missing values (NAs) in a vector out of the values 99 or 999, use
x[is.element(x, c(99, 999))] <- NA

Note that the concatenation cannot contain elements of mixed types. Thus, c(99, 999, .) is not allowed. The command x[x==.]<-NA would have to be used. For an object with a names component, extraction by name can be done. The following are examples using data frames.
# R data(iris) iris[,c("Sepal.Length","Sepal.Width")] 1 2 3 4 5 Sepal.Length Sepal.Width 5.1 3.5 4.9 3.0 4.7 3.2 4.6 3.1 5.0 3.6

# S-PLUS fuel.frame["Ford Escort",] Weight Disp. Mileage Fuel Type Ford Escort 4 2345 114 33 3.030303 Small

The empty selection, [], returns the entire object (for all data structures in the heading). Replacements to subsets of vectors, arrays, data frames, and lists can also be done, via a replacement function, such as x[unlist(x)>3]<-0. Negative subscripts work on the left hand side as well. For example, to assign a 9 to all but the 4th element of a vector x, x[-4]<-9.

Indexing an array
Arrays are indexed in column-major order, meaning that the first index moves fastest. For a 3dimensional array, this means that filling an array will start with the first matrix and fill down columns. One can index a 3-dimensional array using 3 subscripts or a single subscript. If a single subscript is used, then the selected element will be the element in the position of the index, when we start counting positions in column-major order. Compare the following ways to set two values to NA in an array. They yield the same answer. (Note the method for setting NA. We use a replacement function).

54

Thompson

10/2002

x<-array(1:50, dim=c(2,5,5)) is.na(x)<-c(1,3) x<-array(1:50, dim=c(2,5,5)) is.na(x)<-x[1,1:2,1] > x , , 1 [,1] [,2] [,3] [,4] [,5] [1,] NA NA 5 7 9 [2,] 2 4 6 8 10 , , 2 [,1] [,2] [,3] [,4] [,5] [1,] 11 13 15 17 19 [2,] 12 14 16 18 20 , , 3 [,1] [,2] [,3] [,4] [,5] [1,] 21 23 25 27 29 [2,] 22 24 26 28 30 , , 4 [,1] [,2] [,3] [,4] [,5] [1,] 31 33 35 37 39 [2,] 32 34 36 38 40 , , 5 [,1] [,2] [,3] [,4] [,5] [1,] 41 43 45 47 49 [2,] 42 44 46 48 50

# set the 1st and 3rd

values of x to NA

# set the 1st and 2nd columns in the 1st row of the 1st # matrix of x to NA

We can select submatrices from a 3-dimensional array. For example, to select the first 3 columns of each matrix in the array, x, above:
> x[, 1:3, ]

, , 1 [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 , , 2 [,1] [,2] [,3] [1,] 11 13 15 [2,] 12 14 16 , , 3 [,1] [,2] [,3] [1,] 21 23 25 [2,] 22 24 26 , , 4 [,1] [,2] [,3] [1,] 31 33 35 [2,] 32 34 36 , , 5 [,1] [,2] [,3]

55

Thompson

10/2002

[1,] [2,]

41 42

43 44

45 46

Normally, in selecting from an array dimensions are dropped unless you tell S not to drop them. This is done using the drop=F argument as an additional index. Compare the following:
> x[1, , ] # select first row from each matrix in the array [,1] [,2] [,3] [,4] [,5] [1,] 1 11 21 31 41 [2,] 3 13 23 33 43 [3,] 5 15 25 35 45 [4,] 7 17 27 37 47 [5,] 9 19 29 39 49 > x[1, , ,drop = F] # dont drop length-one dimensions

, , 1 [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 , , 2 [,1] [,2] [,3] [,4] [,5] [1,] 11 13 15 17 19 , , 3 [,1] [,2] [,3] [,4] [,5] [1,] 21 23 25 27 29 , , 4 [,1] [,2] [,3] [,4] [,5] [1,] 31 33 35 37 39 , , 5 [,1] [,2] [,3] [,4] [,5] [1,] 41 43 45 47 49

Note the function drop, too. We can index an array with a matrix. Here we extract the antidiagonal of a square matrix.
> x <- diag(4) > x [,1] [,2] [,3] [,4] [1,] 1 0 0 0 [2,] 0 1 0 0 [3,] 0 0 1 0 [4,] 0 0 0 1 > n <- nrow(x) > x[matrix(c(1:n, n:1), nr = n, nc = 2)] [1] 0 0 0 0

To subtract the off-diagonal, see the matrix.coords function in S commands on the website (http://math.cl.uh.edu/~thompsonla/5537).

56

Thompson

10/2002

Vectorized calculations
Vectorized functions are functions that return a vector if the argument is a vector. These calculations operate on the entire vector instead of the individual elements in turn. Many mathematical functions and transformations are vectorized. Actually, vectorized functions will usually also return a structure like the argument if the argument is a structure. A structure in S is a class of object that adds to an ordinary vector some notion of the values being organized in space and time (Chambers, 1998). A structure can be even more general, taking a basic object and turning it into something else by adding attributes. For example, a matrix is a vector with a dim attribute an array is a vector with a dim attribute a factor is a character vector with a levels attribute a data frame is a list with a names and row.names attribute So, for a vectorized function, f, if x is a vector, f(x) is again a vector, where the function is applied to each element in x. Furthermore, f(x) is a matrix if x is a matrix. Examples:
> log(1:10) [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 2.0794415 2.1972246 2.3025851 > log(diag(4)) [,1] [,2] [1,] 0 -Inf [2,] -Inf 0 [3,] -Inf -Inf [4,] -Inf -Inf

[,3] -Inf -Inf 0 -Inf

[,4] -Inf -Inf -Inf 0

> diag(3) + diag(3) [,1] [,2] [,3] [1,] 2 0 0 [2,] 0 2 0 [3,] 0 0 2 > 1:10 + 1 [1] 2 3

9 10 11

> 1:10 + 1:20 [1] 2 4 6

8 10 12 14 16 18 20 12 14 16 18 20 22 24 26 28 30

What happened in the third example is that S coerces the 1 to a vector: rep(1, length(1:10)). This coercion will happen whenever the shorter vector is a multiple of the longer. So, in the fourth example, 1:10 is coerced to rep(1:10, 2). If the shorter vector is not a multiple of the longer, then an error results. Functions that are vectorized include: mathematical transformations and operations (+, -, *, /, log, exp, sqrt, etc), logical operations (==, >, <, !, etc), and functions related to probability distributions and random number generation.

57

Thompson

10/2002

One way to get around if a function is not vectorized is to use sapply with the vector as the first argument and the function as the second.

Example of simple simulation in S-PLUS that uses vectorization


Compare the following functions to compute the numerical constant, p. In each function, we try to compute, via simulation, the area of the unit circle in one of the four quadrants. This area is p/4. Then, we multiply the result by 4 to approximate p. The first function uses an explicit for loop. For each iteration, two independent uniform(0, 1) random numbers are drawn. On each iteration, we note whether the coordinates represented by these two random numbers fall within the unit circle. So, if the square root of the sum of the squares of the two numbers is less than 1, we increase the counter. The second function uses the concept of vectorization. All N simulations are sampled at once, and each operation thereafter is done on whole vectors.
pi1.f<-function(N) { counter<-0 for(i in 1:N){ x<-runif(1, 0, 1) y<-runif(1, 0, 1) if(sqrt(x^2 + y^2) < 1) counter<-counter+1 NULL } 4*counter/N } pi2.f<-function(N){ x<-runif(N, 0, 1) y<-runif(N, 0, 1) z<-sqrt(x^2 + y^2) 4*sum(z < 1)/N }

Here the results from 100,000 simulations, and their times in seconds.
> pi1.f(100000) [1] 3.15316 > pi2.f(100000) [1] 3.15028 > dos.time(pi1.f(100000)) [1] 153.07 > dos.time(pi2.f(100000)) [1] 0.54

The first function takes 2.5 minutes, whereas the second takes half a second. The next section describes functions that operate on whole objects such as lists, arrays, data frames. These can sometimes be used in place of explicit loops. However, in R, apply still uses an internal loop.

58

Thompson

10/2002

Functions operating on arrays, data frames, and lists


The apply function
The function apply(X, MARGIN, FUN,) applies a function (FUN) to dimensions of an array (MARGIN). It returns a result that is of dimension c(d, dim(X)[MARGIN]), where d is the dimension of the function result, and MARGIN is the dimension which to apply the function to. For example, we create an array of dimensions c(2, 4, 3) (so, dim(X) = c(2, 4, 3)). Then, we apply the function rowSums to the third dimension (MARGIN=3). rowSums will return a result with length 2 (so, d = 2). The result of apply will then be of dimension c(2, 3).
> (newarray <- array(c(1:8, 11:18, 111:118), dim = c(2, 4, 3), dimnames = list(LETTERS[1:2], letters[1:4], c("array1", "array2", "array3")))) , , a A 1 B 2 array1 b c d 3 5 7 4 6 8

, , array2 a b c d A 11 13 15 17 B 12 14 16 18 , , array3 a b c d A 111 113 115 117 B 112 114 116 118 > apply(newarray, 3, rowSums) array1 array2 array3 A 16 56 456 B 20 60 460 > apply(newarray, 3, colSums) array1 array2 array3 a 3 23 223 b 7 27 227 c 11 31 231 d 15 35 235

But, what happens when we want a matrix result returned when we apply a function to each matrix of an array?
(newarray.ginv<-apply(newarray, 3, ginverse)) array1 array2 array3 [1,] -1.000000e+000 -2.50 -17.50 [2,] -5.000000e-001 -1.00 -6.00 [3,] -6.730727e-016 0.50 5.50 [4,] 5.000000e-001 2.00 17.00 [5,] 8.500000e-001 2.35 17.35 [6,] 4.500000e-001 0.95 5.95 [7,] 5.000000e-002 -0.45 -5.45 [8,] -3.500000e-001 -1.85 -16.85

59

Thompson

10/2002

We could put the results into a matrix using matrix, but it takes a little dirty work. Another way is to transform the array to a list, then lapply the same function to the list. Finally, we can transform back to an array.
> newlist <- apply(newarray, 3, as.data.frame) > (newlist.ginv<-lapply(newlist, ginverse)) $array1: [,1] [,2] [1,] -1.000000e+000 0.85 [2,] -5.000000e-001 0.45 [3,] -6.730727e-016 0.05 [4,] 5.000000e-001 -0.35 attr($array1, "rank"): [1] 2 $array2: [,1] [,2] [1,] -2.5 2.35 [2,] -1.0 0.95 [3,] 0.5 -0.45 [4,] 2.0 -1.85 attr($array2, "rank"): [1] 2 $array3: [,1] [,2] [1,] -17.5 17.35 [2,] -6.0 5.95 [3,] 5.5 -5.45 [4,] 17.0 -16.85 attr($array3, "rank"): [1] 2 # back to an array array(unlist(newlist.ginv),dim=c(4,2,3)) , , 1 [,1] [,2] [1,] -1.000000e+000 0.85 [2,] -5.000000e-001 0.45 [3,] -6.730727e-016 0.05 [4,] 5.000000e-001 -0.35 , , 2 [,1] [,2] [1,] -2.5 2.35 [2,] -1.0 0.95 [3,] 0.5 -0.45 [4,] 2.0 -1.85 , , 3 [,1] [,2] [1,] -17.5 17.35 [2,] -6.0 5.95 [3,] 5.5 -5.45 [4,] 17.0 -16.85

However, the easiest way is to use the dim and dimnames arguments of array on newarray.ginv:
array(newarray.ginv, dim=dim(newarray)[c(2,1,3)], dimnames= dimnames(newarray)[c(2,1,3)])

60

Thompson

10/2002

One can also sum the matrix components of an array. We show three functions for doing so. Try out the first function on newarray.
sum.array<-function(array){ res<-aperm(array,c(3,2,1)) apply(res,3,colSums) } > sum.array(newarray) A B a 123 126 b 129 132 c 135 138 d 141 144

A quicker way uses matrix multiplication instead of the apply function. The apply function is not as efficient as matrix multiplication when the latter can be done.
sum.array2<-function(array){ res<-aperm(array,c(3,2,1)) d<-dim(res) matrix(rep(1,d[1])%*%matrix(res, nr=d[1]), nr=d[2], dimnames=dimnames(res)[-1])

> sum.array2(newarray) A B a 123 126 b 129 132 c 135 138 d 141 144

A third way uses the function colSums, which has the efficiency of sum.array2 and the simplicity of sum.array.
sum.array3<-function(array){ res<-aperm(array,c(3,2,1)) colSums(res)

> sum.array3(newarray) A B a 123 126 b 129 132 c 135 138 d 141 144

Comparing them we get (the resources function appears in Venables and Ripley (2000) and is available from their Sprog scripts online.
> resources(sum.array(newarray)) CPU Elapsed % CPU Child Cache Working 0.11 0.11 100 0 0 4540 > resources(sum.array2(newarray)) CPU Elapsed % CPU Child Cache Working 0.05 0.05 100 0 0 1713

61

Thompson

10/2002

> resources(sum.array3(newarray)) CPU Elapsed % CPU Child Cache Working 0.06 0.06 100 0 0 1001

The time is cut in half using sum.array2 instead of sum.array.

The function tapply


The function tapply applies a function to parts of a vector defined by a factor or combination of factors. For example, we can get a table of frequencies for Country of manufacture by Type of Car in the cu.summary data frame.
> tapply(cu.summary$Country, cu.summary$Type, table) $Compact: Brazil England France Germany Japan Japan/USA Korea Mexico Sweden USA 0 0 1 4 3 4 0 0 3 7 $Large: Brazil England France Germany Japan Japan/USA Korea Mexico Sweden USA 0 0 0 0 0 0 0 0 0 7 $Medium: Brazil England France Germany Japan Japan/USA Korea Mexico Sweden USA 0 1 1 3 6 0 1 0 2 16 $Small: Brazil England France Germany Japan Japan/USA Korea Mexico Sweden USA 1 0 0 1 7 4 4 2 0 3 $Sporty: Brazil England France Germany Japan Japan/USA Korea Mexico Sweden USA 0 0 0 2 11 1 0 1 0 11 $Van: Brazil England France Germany Japan Japan/USA Korea Mexico Sweden USA 0 0 0 1 4 0 0 0 0 5

See the examples in V&R. Note how they apply a function to the diagonals of a matrix.

The functions lapply, sapply, and split


Both lapply and sapply apply a function to the individual components of a list or a vector. Both work on data frames. A difference between the two functions is in the result. lapply keeps the result as a list; sapply tries to simplify the result to a vector or array. Weve seen examples from both functions already. One nice use of sapply is for doing vectorized calculations when the calculation comes from a function that does not vectorize. For example, for the data frame state.x77, suppose we want to find out which states have the maximum values on each of the variables measured. The state with the maximum illiteracy will not necessarily be the same as the state with the maximum population.
max).

We can find the maximum values for each column using the apply function: apply(state.x77, 2, But, this only gives us the numbers, not the states which produced them. Also, using the function
62

Thompson

10/2002

which on the result of apply will return NULL unless there is a single state that produces the maximum values for each variable. sapply can help.

We can sapply over the names of the variables in this matrix. The function we apply to each variable name is
function(x) which(apply(state.x77,2,max)[x]== state.x77[,x])

The variable names are given in dimnames(state.x77)[[2]]. So we use sapply with this as our first argument:
sapply(dimnames(state.x77)[[2]], function(x,ref.table) which(apply(ref.table,2,max)[x]==ref.table[,x]), ref.table=state.x77) [1] 5 2 18 11 1 44 28 2

I chose to pass in the data matrix state.x77 instead of accessing it directly. I did this to show how arguments can be passed to the function used in sapply. Now, we can see which states have the maximum values on each variable.
ind<sapply(dimnames(state.x77)[[2]], function(x,ref.table) which(apply(ref.table,2,max)[x]==ref.table[,x]), ref.table=state.x77) structure(dimnames(state.x77)[[1]][ind], names = dimnames(state.x77)[[2]]) Population Income Illiteracy Life Exp Murder HS Grad Frost Area "California" "Alaska" "Louisiana" "Hawaii" "Alabama" "Utah" "Nevada" "Alaska"

Heres another command of interest:


structure(data.frame(apply(state.x77, 2, rank)), row.names=dimnames(state.x77)[[1]])

A use of lapply that we have not yet seen is to replace explicit looping. Here is an example of using lapply as a replacement for a for loop:
res<-vector(list, B) # res is where we will keep the B results of the loop # do the loop

res<-lapply(1:B, my.function, args.list)

args.list gives a list of arguments for my.function. Sometimes using lapply instead of for helps memory management, keeping it under manageable control.

To get a list of all the levels of the factors in the cu.summary data frame, do
> lapply(cu.summary[sapply(cu.summary, is.factor)], levels) $Country: [1] "Brazil" "Mexico" "England" "France" "Sweden" "USA" "worse" "Germany" "Japan" "Japan/USA" "Korea"

$Reliability: [1] "Much worse"

"average"

"better"

"Much better"

63

Thompson

10/2002

$Type: [1] "Compact" "Large"

"Medium"

"Small"

"Sporty"

"Van"

One can use lapply on a dataframe, as it is a list. For example, if we wanted to apply a function that operates on vectors to all numeric columns of a data frame. We can use either lapply or apply for this
my.data.frame[]<-lapply(my.data.frame, function(x) if(is.numeric(x)) my.function(x) else x)

The left-hand side ensures that the row.names remain intact. The function split(data, group) takes in a vector, matrix, or data frame (data) and splits it by an index (group), a vector or factor giving the indices, returning a list. This list can then be passed to lapply. Using the combination of split then lapply is usually equivalent to using tapply alone, but frequently much faster (see discussion in V&R, p. 107).

Functions operating on data frames: aggregate and by


The function aggregate is very similar to tapply. It behaves as though you were using tapply on each column of a data frame. If you want to get means on all numeric variables of a data frame grouped by one or more factors, aggregate will do it. For example, we get means on Price and Mileage for cars grouped by Country, Reliability and Type, in the cu.summary data frame. Even though we set the argument na.rm to T for the mean function, we still get NA for some classifications because there were only missing values for those cross-classifications.
> check.factors <- sapply(cu.summary, is.factor) > aggregate(cu.summary[!check.factors], cu.summary[check.factors], mean, na.rm = T) Country Reliability Type Price Mileage 1 USA Much worse Compact 9470.00 26.00000 2 Sweden worse Compact 21970.00 NA 3 USA worse Compact 10014.33 23.33333 4 Germany average Compact 31600.00 NA 5 Sweden average Compact 18450.00 23.00000 6 USA average Compact 8845.00 NA 7 Germany better Compact 24650.00 NA 8 Japan better Compact 17879.00 21.00000 9 Japan/USA better Compact 12459.00 24.00000 10 USA better Compact 10945.00 25.00000 11 Japan Much better Compact 11319.50 23.00000 12 Japan/USA Much better Compact 11744.00 25.33333 13 USA Much worse Large 20962.50 18.00000 14 USA average Large 21714.60 21.50000 15 Germany Much worse Medium 39950.00 NA 16 USA Much worse Medium 18955.33 23.00000 Country Reliability Type Price Mileage 17 USA worse Medium 17604.25 21.66667 18 Japan average Medium 21498.00 23.00000 19 Sweden average Medium 21700.00 NA 20 USA average Medium 19010.29 21.66667 21 Japan Much better Medium 21986.33 21.00000 22 Korea Much worse Small 7254.00 28.00000 23 USA Much worse Small 6995.00 NA 24 Korea worse Small 5899.00 NA 25 USA worse Small 7402.00 33.00000 26 Germany average Small 9995.00 26.00000

64

Thompson

10/2002

27 Mexico average 28 Korea better 29 Mexico better 30 USA better 31 Japan Much better 32 Japan/USA Much better 33 USA Much worse 34 Japan average 35 USA average 36 Japan better 37 Japan Much better 38 Japan/USA Much better 39 USA Much worse 40 USA average 41 Japan Much better

Small Small Small Small Small Small Sporty Sporty Sporty Sporty Sporty Sporty Van Van Van

8695.00 6319.00 8672.00 8895.00 8659.00 8226.75 14111.29 12749.00 13098.00 22860.00 13745.00 12279.00 13790.00 13219.00 14944.00

NA 37.00000 26.00000 33.00000 30.66667 31.33333 22.00000 24.00000 30.00000 NA 30.00000 NA NA 18.00000 19.00000

aggregate(cu.summary$Price, by = list(Type = cu.summary$Type), mean) Type x Compact Compact 15201.909 Large Large 21499.714 Medium Medium 21622.867 Small Small 7736.591 Sporty Sporty 15308.115 Van Van 14014.300 > aggregate(cu.summary$Price, by = list(Type = cu.summary$Type), mean)$x [1] 15201.909 21499.714 21622.867 7736.591 15308.115 14014.300

Control Structures
The function ifelse
S has ordinary if-then-else structures, as well as a vectorized ifelse function. ifelse(test, yes, no) takes a vector (test) and returns yes[i] if test[i]==T and no[i] otherwise. If yes or no are not as long as test, they will be repeated cyclically.
ifelse evaluates all three arguments. > y <- c(-1, 0, 2, 3) > ifelse(y > 0, log(y), 0) [1] 0.0000000 0.0000000 0.6931472 1.0986123 Warning messages: NAs generated > log(ifelse(y > 0, y, 1)) [1] 0.0000000 0.0000000 0.6931472 1.0986123

So, one must be careful:

(Note that V&Rs solution to this is different than mine)

The function switch


The function switch is similar to case functions in some compiled languages. The form is:

65

Thompson

10/2002

switch(test, additional arguments evaluated based on result of test) o

if test evaluates to a character string, the value of the expression is that of the matching named argument, or the default if none matches. The default is, many times, the last unnamed option or argument.
switch(test.choice, Levene=, levene=levene(y, f), Cochran=, cochran=cochran(y, f), Bartlett=, bartlett=, bartlett(y,f))

example:

If test.choice evaluates to Levene or levene being matched, the expression evaluates to levene(y, f). If no option is matched, the expression evaluates to bartlett(y,f), the final unnamed argument.
o

To allow abbreviated names (e.g., test.choice evaluates to Lev or lev), use the pmatch function with nomatch=. example:
switch(pmatch(test.choice, c(Levene, levene, Cochran, Cochran), nomatch=), # result of pmatch is coerced to 1=, 2= levene(y, f), # character mode 3=, 4= cochran(y, f), bartlett(y,f))

if test evaluates to a number (which is coerced to an integer using trunc), the argument evaluated will be the position in the list of arguments matching the integer. There is no default argument (nonmatching evaluations of test give result NULL). So, we select an argument position from the list of arguments. example:
switch(1, T, F) returns T switch(2, T, F) returns F switch(4, T, F, T, ,) returns a result with mode missing

Scalar logical operators (only deal with single values) o && right-hand expression is evaluated only if the left-hand one is true o || right-hand expression is evaluated only if the left-hand one is false o example: if(any(y <0) || any(x<0)) stop(none of the data should be
negative)

Logical summary functions


o any, all

Looping
Typical looping structures exist: for, while, and repeat loops. These are used in the usual way (see p. 58 of V&R). However, in S, one can loop over the items in a list. For example,
x<-list(ice.cream.choice=c("vanilla", "chocolate", "strawberry", "chocolate"), topping.choice=rev(c("vanilla", "chocolate", "strawberry", "chocolate")))

66

Thompson

10/2002

> x $ice.cream.choice: [1] "vanilla" $topping.choice: [1] "chocolate" "strawberry" "chocolate" "vanilla" "chocolate" "strawberry" "chocolate"

> for(i in x) print(sort(i)) [1] "chocolate" [1] "chocolate" "chocolate" "chocolate" "strawberry" "vanilla" "strawberry" "vanilla"

Note that the above is also accomplished using lapply


> lapply(x, sort) $ice.cream.choice: [1] "chocolate" $topping.choice: [1] "chocolate" "chocolate" "strawberry" "vanilla" "chocolate" "strawberry" "vanilla"

Memory Management in S-PLUS


Explicit loops (like for loops or while loops) are memory intensive. The reason is that S does not release any memory until these loops are finished. Some things to do to ease the computational burden of an explicit loop: 1) use lapply() instead of a for()loop. See the description of using lapply above, and 2) put the workhorse of the loop into a function and make the last statement of the loop NULL. For example,
for(i in 1:R) { res[i]<-f() NULL } # f() is the workhorse containing the computations

Later, I will discuss the For() loop, which starts a new S-PLUS process at each iteration of the loop so that each step is run as a top-level expression. This causes S-PLUS to release memory after each iteration of the loop. However, a new process must be called at each step, so a For loop will be useful only for very large computational tasks done at each iteration. A short loop is better left as a regular for() loop. Actually, what For does is call the program Sqpe.exe which starts up a terminal S-PLUS process. It does this while in one S-PLUS process using the MULTIPLE_INSTANCES switch (see Chapter 1). There are several arguments one can use with the For loop to change how the process is run (background or foreground) or how the commands in the loop are issued (one by one or in blocks of expressions).

67

Thompson

10/2002

Method Dispatch

Methods written for classes of objects

var(x) # x is a vector

var(X) # X is a matrix

Debugging functions

trace

debugger

browser

inspect

find.calls

fix

Edit

68

Thompson

10/2002

Part III: Command-Line Graphics and Advanced Graphics Techniques


In this chapter I briefly describe the more common command-line graphics. Much more information can be found in the on-line help screens. Most of the information here can be found in Chapter 4 of V&R.

Setting up a graphics device


There are several ways to output graphics in S-PLUS and R. A common way is via a graphics window. If a graphics window has not been opened prior to issuing a graphics command in either S-PLUS or R, then a graphics window will be opened automatically with the default settings. In S-PLUS 6 for Windows this will be a graphsheet. In R it is a graphics device similar to that opened via the commands windows() or win.graph().
win.metafile (in R).

To output graphics in Windows to a Windows metafile, call the commands wmf.graph (in S-PLUS) or To output graphics to a pdf file, use pdf.graph (in S-PLUS) or pdf (in R).

You can have several graphics windows or devices open simultaneously. As they open, they are assigned a number. The current device (that is, the one to which graphics commands will be sent) is the most recently opened device by default. To change the current device to say a graphics window in focus, select the Make Current item in the windows system menu. Alternatively, use dev.set(which) where which is the device to be made the current one. To turn off one of the devices issue the command dev.off(which) with which being the device number. To close all devices, use graphics.off().

Setting up and manipulating a graphsheet in S-PLUS


The function graphsheet has several options. For example, you can set the width and height of the sheet and its orientation (landscape, portrait), as well as the color scheme to use. Once you have created a blank graphsheet, you might find it useful to divide it into separate screens that can be recreated without having to recreate the whole graph. For example, suppose we have four plots to draw on one graphsheet. We first split the screen into top and bottom halves, using split.screen(c(2,1)). This results in a screen called 1 (the top) and a screen called 2 (the bottom). Next, we split the top screen into two halves, and the bottom screen into two halves. These are called, respectively, screens 3 and 4, and screens 5 and 6.
graphsheet(width=9,height=8) split.screen(c(2,1)) split.screen(c(1,2),screen=1) split.screen(c(1,2),screen=2) # screen 1 = top; screen 2 = bottom # split screen 1 into two parts (3, 4) # split screen 2 into two parts (5, 6)

To draw the first plot in the upper left corner (screen 3), make screen 3 the focus, and call the plotting function:
screen(3) # ready to draw on screen 3 (upper left corner) plot(<put arguments here>)

69

Thompson

10/2002

Do the same for the remaining screens. If you later want to change what was drawn on screen 3, you can erase it, and redraw:
erase.screen(3) # erase screen 3 screen(3) # ready to draw on screen 3 (upper left corner) plot(<put arguments here>) # draw something else

To close the screen numbering, issue


close.screen(all=T)

If you forget to issue close.screen(all=T), then the next set of screens you create will start with number 7, which can become very confusing.
graphsheet can be used to output a graph in a format that is not a screen device. For example, set the argument format = JPG option with file = mygraph.jpg to output a graph as a JPEG file. Other formats are EPS, WMF, and TIF. To send output to the printer, use format = printer.

You can also name a graphsheet for use with the guiModify function later. guiModify can be used to make changes to an editable graphsheet that has already been created. One of its arguments (Name) is the name of the graphsheet. If you dont set Name, the default name is GSD<number>, starting with number 2.

2D plots 3D plots Ggobi in R

70