Statistics and Data with R
Statistics and Data with R: An applied approach through examples
© 2008 John Wiley & Sons, Ltd. ISBN: 978047075805
Y. Cohen and J.Y. Cohen
Statistics and Data with R :
An applied approach through examples
Yosef Cohen University of Minnesota, USA.
Jeremiah Y. Cohen Vanderbilt University, Nashville, USA.
This edition first published 2008 c 2008 John Wiley & Sons Ltd.
Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or
registered trademarks of their respective owners. The publisher is not associated with any product
or vendor mentioned in this book. This publication is designed to provide accurate and
authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress CataloginginPublication Data
Cohen, Yosef. Statistics and data with R : an applied approach through examples / Yosef Cohen, Jeremiah Cohen. p. cm. Includes bibliographical references and index. ISBN 9780470758052 (cloth) 1. Mathematical statistics—Data processing. 2. R (Computer program language) I. Cohen, Jeremiah. II. Title. QA276.45.R3C64 2008 519.50285 ^{} 2133—dc22
2008032153
A catalogue record for this book is available from the British Library.
ISBN 9780470758052
Typeset in 10/12pt Computer Modern by Laserwords Private Limited, Chennai, India Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire
To the memory of Gad Boneh
Contents
Preface 
xv 

Part I Data in statistics and R 

1 
Basic R 
3 

1.1 Preliminaries 
4 

1.1.1 
An R session 
4 

1.1.2 
Editing statements 
8 

1.1.3 
The functions help() , help.search() and example() 
8 

1.1.4 
Expressions 
10 

1.1.5 
Comments, line continuation and Esc 
11 

1.1.6 
source() , sink() and history() 
11 

1.2 Modes 
13 

1.3 Vectors 
14 

1.3.1 Creating vectors 
14 

1.3.2 Useful vector functions 
15 

1.3.3 Vector arithmetic 
15 

1.3.4 Character vectors 
17 

1.3.5 Subsets and index vectors 
18 

1.4 Arithmetic operators and special values 
20 

1.4.1 Arithmetic operators 
20 

1.4.2 Logical operators 
21 

1.4.3 Special values 
22 

1.5 Objects 
24 

1.5.1 Orientation 
24 

1.5.2 Object attributes 
26 

1.6 Programming 
28 

1.6.1 Execution controls 
28 

1.6.2 Functions 
30 
1.7
Packages
33
viii Contents
1.8 
Graphics 
34 
1.8.1 Highlevel plotting functions 
35 

1.8.2 Lowlevel plotting functions 
36 

1.8.3 Interactive plotting functions 
36 

1.8.4 Dynamic plotting 
36 

1.9 Customizing the workspace 
36 

1.10 Projects 
37 

1.11 A note about producing figures and output 
39 

1.11.1 openg() 
39 

1.11.2 saveg() 
40 

1.11.3 h() 
40 

1.11.4 nqd() 
40 

1.12 
Assignments 
41 
2 Data in statistics and in R 
45 

2.1 Types of data 
45 

2.1.1 Factors 
45 

2.1.2 Ordered factors 
48 

2.1.3 Numerical variables 
49 

2.1.4 Character variables 
50 

2.1.5 Dates in R 
50 

2.2 Objects that hold data 
50 

2.2.1 Arrays and matrices 
51 

2.2.2 Lists 
52 

2.2.3 Data frames 
54 

2.3 Data organization 
55 

2.3.1 Data tables 
55 

2.3.2 Relationships among tables 
57 

2.4 Data import, export and connections 
58 

2.4.1 Import and export 
58 

2.4.2 Data connections 
60 

2.5 Data manipulation 
63 

2.5.1 Flat tables and expand tables 
63 

2.5.2 Stack, unstack and reshape 
64 

2.5.3 Split, unsplit and unlist 
66 

2.5.4 Cut 
66 

2.5.5 Merge, union and intersect 
68 

2.5.6 is.element() 
69 

2.6 Manipulating strings 
71 

2.7 Assignments 
72 

3 Presenting data 
75 

3.1 Tables and the flavors of apply() 
75 

3.2 Bar plots 
77 

3.3 Histograms 
81 

3.4 Dot charts 
85 

3.5 Scatter plots 
86 
3.6
Lattice plots
88
Contents
ix
3.7 Threedimensional plots and contours 
90 
3.8 Assignments 
90 
Part II Probability, densities and distributions 

4 Probability and random variables 
97 
4.1 Set theory 
98 
4.1.1 Sets and algebra of sets 
98 
4.1.2 Set theory in R 
103 
4.2 Trials, events and experiments 
103 
4.3 Definitions and properties of probability 
108 
4.3.1 Definitions of probability 
108 
4.3.2 Properties of probability 
111 
4.3.3 Equally likely events 
112 
4.3.4 Probability and set theory 
112 
4.4 Conditional probability and independence 
113 
4.4.1 Conditional probability 
114 
4.4.2 Independence 
116 
4.5 Algebra with probabilities 
118 
4.5.1 Sampling with and without replacement 
118 
4.5.2 Addition 
119 
4.5.3 Multiplication 
120 
4.5.4 Counting rules 
120 
4.6 Random variables 
127 
4.7 Assignments 
128 
5 Discrete densities and distributions 
137 
5.1 Densities 
137 
5.2 Distributions 
141 
5.3 Properties 
143 
5.3.1 Densities 
144 
5.3.2 Distributions 
144 
5.4 Expected values 
144 
5.5 Variance and standard deviation 
146 
5.6 The binomial 
147 
5.6.1 Expectation and variance 
151 
5.6.2 Decision making with the binomial 
151 
5.7 The Poisson 
153 
5.7.1 The Poisson approximation to the binomial 
155 
5.7.2 Expectation and variance 
156 
5.7.3 Variance of the Poisson density 
157 
5.8 Estimating parameters 
161 
5.9 Some useful discrete densities 
163 
5.9.1 Multinomial 
163 
5.9.2 Negative binomial 
165 
5.9.3 Hypergeometric 
168 
5.10
Assignments
171
x
Contents
6 Continuous distributions and densities 
177 
6.1 Distributions 
177 
6.2 Densities 
180 
6.3 Properties 
181 
6.3.1 Distributions 
181 
6.3.2 Densities 
182 
6.4 Expected values 
183 
6.5 Variance and standard deviation 
184 
6.6 Areas under density curves 
185 
6.7 Inverse distributions and simulations 
187 
6.8 Some useful continuous densities 
189 
6.8.1 Double exponential (Laplace) 
189 
6.8.2 Normal 
191 
6.8.3 χ ^{2} 
193 
6.8.4 Student t 
195 
6.8.5 F 
197 
6.8.6 Lognormal 
198 
6.8.7 Gamma 
199 
6.8.8 Beta 
201 
6.9 Assignments 
203 
7 The normal and sampling densities 
205 
7.1 The normal density 
205 
7.1.1 The standard normal 
207 
7.1.2 Arbitrary normal 
210 
7.1.3 Expectation and variance of the normal 
212 
7.2 Applications of the normal 
213 
7.2.1 The normal approximation of discrete densities 
214 
7.2.2 Normal approximation to the binomial 
215 
7.2.3 The normal approximation to the Poisson 
218 
7.2.4 Testing for normality 
220 
7.3 Data transformations 
225 
7.4 Random samples and sampling densities 
226 
7.4.1 Random samples 
227 
7.4.2 Sampling densities 
228 
7.5 A detour: using R efficiently 
230 
7.5.1 Avoiding loops 
230 
7.5.2 Timing execution 
230 
7.6 The sampling density of the mean 
232 
7.6.1 The central limit theorem 
232 
7.6.2 The sampling density 
232 
7.6.3 Consequences of the central limit theorem 
234 
7.7 The sampling density of proportion 
235 
7.7.1 The sampling density 
236 
7.7.2 Consequence of the central limit theorem 
238 
7.8 The sampling density of intensity 
239 
7.8.1
The sampling density
239
Contents
xi
7.8.2 
Consequences of the central limit theorem 
241 
7.9 The sampling density of variance 
241 

7.10 Bootstrap: arbitrary parameters of arbitrary densities 
242 

7.11 Assignments 
243 

Part III Statistics 

8 Exploratory data analysis 
251 

8.1 Graphical methods 
252 

8.2 Numerical summaries 
253 

8.2.1 Measures of the center of the data 
253 

8.2.2 Measures of the spread of data 
261 

8.2.3 The Chebyshev and empirical rules 
267 

8.2.4 Measures of association between variables 
269 

8.3 Visual summaries 
275 

8.3.1 Box plots 
275 

8.3.2 Lag plots 
276 

8.4 Assignments 
277 

9 Point and interval estimation 
283 

9.1 Point estimation 
284 

9.1.1 Maximum likelihood estimators 
284 

9.1.2 Desired properties of point estimators 
285 

9.1.3 Point estimates for useful densities 
288 

9.1.4 Point estimate of population variance 
292 

9.1.5 Finding MLE numerically 
293 

9.2 Interval estimation 
294 

9.2.1 Large sample confidence intervals 
295 

9.2.2 Small sample confidence intervals 
301 

9.3 Point and interval estimation for arbitrary densities 
304 

9.4 Assignments 
307 

10 Single sample hypotheses testing 
313 

10.1 Null and alternative hypotheses 
313 

10.1.1 Formulating hypotheses 
314 

10.1.2 Types of errors in hypothesis testing 
316 

10.1.3 Choosing a significance level 
317 

10.2 Large sample hypothesis testing 
318 

10.2.1 Means 
318 

10.2.2 Proportions 
323 

10.2.3 Intensities 
324 

10.2.4 Common sense significance 
325 

10.3 Small sample hypotheses testing 
326 

10.3.1 Means 
326 

10.3.2 Proportions 
327 
10.3.3
Intensities
328
xii Contents
10.4 Arbitrary statistics of arbitrary densities 
329 

10.5 p values 
330 

10.6 Assignments 
333 

11 Power and sample size for single samples 
341 

11.1 
Large sample 
341 

11.1.1 Means 
342 

11.1.2 Proportions 
352 

11.1.3 Intensities 
356 

11.2 
Small samples 
359 

11.2.1 Means 
359 

11.2.2 Proportions 
361 

11.2.3 Intensities 
363 

11.3 Power and sample size for arbitrary densities 
365 

11.4 Assignments 
365 

12 Two samples 
369 

12.1 Large samples 
370 

12.1.1 Means 
370 

12.1.2 Proportions 
375 

12.1.3 Intensities 
379 

12.2 Small samples 
380 

12.2.1 Estimating variance and standard error 
380 

12.2.2 Hypothesis testing and confidence intervals for variance 
382 

12.2.3 Means 
384 

12.2.4 Proportions 
386 

12.2.5 Intensities 
387 

12.3 Unknown densities 
388 

12.3.1 Rank sum test 
389 

12.3.2 t vs. rank sum 
392 

12.3.3 Signed rank test 
392 

12.3.4 Bootstrap 
394 

12.4 Assignments 
396 

13 Power and sample size for two samples 
401 

13.1 Two means from normal populations 
401 

13.1.1 Power 
401 

13.1.2 Sample size 
404 

13.2 Two proportions 
406 

13.2.1 Power 
407 

13.2.2 Sample size 
409 

13.3 Two rates 
410 

13.4 Assignments 
415 

14 Simple linear regression 
417 

14.1 
Simple linear models 
417 

14.1.1 
The regression line 
418 
14.1.2
Interpretation of simple linear models
419
Contents xiii
14.2 Estimating regression coefficients 
422 

14.3 The model goodness of fit 
428 

14.3.1 The F test 
428 

14.3.2 The correlation coefficient 
433 

14.3.3 The correlation coefficient vs. the slope 
434 

14.4 Hypothesis testing and confidence intervals 
434 

14.4.1 t test for model coefficients 
435 

14.4.2 Confidence intervals for model coefficients 
435 

14.4.3 Confidence intervals for model predictions 
436 

14.4.4 t test for the correlation coefficient 
438 

14.4.5 z tests for the correlation coefficient 
439 

14.4.6 Confidence intervals for the correlation coefficient 
441 

14.5 Model assumptions 
442 

14.6 Model diagnostics 
443 

14.6.1 The hat matrix 
445 

14.6.2 Standardized residuals 
447 

14.6.3 Studentized residuals 
448 

14.6.4 The RSTUDENT residuals 
449 

14.6.5 The DFFITS residuals 
453 

14.6.6 The DFBETAS residuals 
454 

14.6.7 Cooke’s distance 
456 

14.6.8 Conclusions 
457 

14.7 Power and sample size for the correlation coefficient 
458 

14.8 Assignments 
459 

15 Analysis of variance 
463 

15.1 Oneway, fixedeffects ANOVA 
463 

15.1.1 The model and assumptions 
464 

15.1.2 The 
F test 
469 
15.1.3 Paired group comparisons 
475 

15.1.4 Comparing sets of groups 
484 

15.2 Nonparametric oneway ANOVA 
488 

15.2.1 The KruskalWallis test 
488 

15.2.2 Multiple comparisons 
491 

15.3 Oneway, randomeffects ANOVA 
492 

15.4 Twoway ANOVA 
495 

15.4.1 Twoway, fixedeffects ANOVA 
496 

15.4.2 The model and assumptions 
496 

15.4.3 Hypothesis testing and the F test 
500 

15.5 Twoway linear mixed effects models 
505 

15.6 Assignments 
509 

16 Simple logistic regression 
511 

16.1 Simple binomial logistic regression 
511 

16.2 Fitting and selecting models 
519 

16.2.1 The log likelihood function 
519 

16.2.2 Standard errors of coefficients and predictions 
521 
16.2.3
Nested models
524
xiv Contents
16.3 
Assessing goodness of fit 
525 

16.3.1 The Pearson χ ^{2} statistic 
526 

16.3.2 The deviance χ ^{2} statistic 
527 

16.3.3 The group adjusted χ ^{2} statistic 
528 

16.3.4 The ROC curve 
529 

16.4 
Diagnostics 
533 

16.4.1 Analysis of residuals 
533 

16.4.2 Validation 
536 

16.4.3 Applications of simple logistic regression to 2 × 2 tables 
536 

16.5 
Assignments 
539 

17 
Application: the shape of wars to come 
541 

17.1 
A statistical profile of the war in Iraq 
541 

17.1.1 Introduction 
542 

17.1.2 The data 
542 

17.1.3 Results 
543 

17.1.4 Conclusions 
550 

17.2 
A statistical profile of the second Intifada 
552 

17.2.1 
Introduction 
552 

17.2.2 
The data 
553 

17.2.3 
Results 
553 

17.2.4 
Conclusions 
561 

References 
563 

R Index 
569 

General Index 
583 
Preface
For the purpose of this book, let us agree that Statistics ^{1} is the study of data for some purpose. The study of the data includes learning statistical methods to analyze it and draw conclusions. Here is a question and answer session about this book. Skip the answers to those questions that do not interest you.
• Who is this book intended for?
The book is intended for students, researchers and practitioners both in and out of academia. However, no prior knowledge of statistics is assumed. Consequently, the presentation moves from very basic (but not simple) to sophisticated. Even if you know statistics and R , you may find the many many examples with a variety of real world data, graphics and analyses useful. You may use the book as a reference and, to that end, we include two extensive indices. The book includes (almost) parallel discussions of analyses with the normal density, proportions (binomial), counts (Poisson) and bootstrap methods.
• Why “Statistics and data with R ”?
Any project in which statistics is applied involves three major activities: preparing data for application of some statistical methods, applying the methods to the data and interpreting and presenting the results. The first and third activities consume by far the bulk of the time. Yet, they receive the least amount of attention in teaching and studying statistics. R is particularly useful for any of these activities. Thus, we present a balanced approach that reflects the relative amount of time one spends in these activities. The book includes over 300 hundred examples from various fields: ecology, envi ronmental sciences, medicine, biology, social sciences, law and military. Many of the examples take you through the three major activities: They begin with importing the data and preparing it for analysis (that is the reason for “and data” in the title), run the analysis (that is the reason for “Statistics” in the title) and end with presenting the results. The examples were applied through R scripts (that is the reason for “with R ” in the title).
^{1} From here on, we shall not capitalize Statistics.
xvi Preface
Our guiding principle was “what you see is what you get” (WYSIWYG). Thus, whether examples illustrate data manipulation, statistical methods or graphical pre sentation, accompanying scripts allow you to produce the results as shown. Each script is presented as code snippets or as linenumbered statements. These enhance explana tion of the scripts. Consequently, some of the scripts are not short and there are plenty of repetitions. Adhering to our goal, a wiki website, http://turtle.gis.umn.edu includes all of the scripts and data used in the book. You can download, cut and paste to reproduce all of the tables, figures and statistics that are shown. You are invited to modify the scripts to fit your needs. Albeit not a database management system, R integrates the tasks of preparing data for analysis, analyzing it and presenting it in powerful and flexible ways. Power and flexibility do not come easily and the learning curve of R is steep. To the novice— in particular those not versed in object oriented computer languages— R may seem at times cryptic. In Chapter 1 we attempt to demystify some of the things about R that at first sight might seem incomprehensible. To learn how to deal with data, analysis and presentation through R , we include over 300 examples. They are intended to give you a sense of what is required of statistical projects. Therefore, they are based on moderately large data that are not necessarily “clean”—we show you how to import, clean and prepare the data.
• What is the required knowledge and level of presentation?
No previous knowledge of statistics is assumed. In a few places (e.g. Chapter 5), previous exposure to introductory Calculus may be helpful, but is not necessary. The few references to integrals and derivatives can be skipped without missing much. This is not to say that the presentation is simple throughout—it starts simple and becomes gradually more advanced as one progresses through the book. In short, if you want to use R , there is no way around it: You have to invest time and effort to learn it. However, the rewards are: you will have complete control over what you do; you will be able to see what is happening “behind the scenes”; you will develop a goodpractices approach.
• What is the choice of topics and the order of their presentation?
Some of the topics are simple, e.g. parts of Chapters 9 to 12 and Chapter 14. Others are more advanced, e.g. Chapters 16 and 17. Our guiding principle was to cover large sample normal theory and in parallel topics about proportions and counts. For example, in Chapter 12, we discuss two sample analysis. Here we cover the normal approach where we discuss hypotheses testing, confidence intervals and power and sample size. Then we cover the same topics for proportions and then for counts. Similarly, for regression, we discuss the classical approach (Chapter 14) and then move on to logistic regression (Chapter 16). With this approach, you will quickly learn that life does not end with the normal. In fact, much of what we do is to analyze proportions and counts. In parallel, we also use Bootstrap methods when one cannot make assumptions about the underlying distribution of the data and when samples are small.
• How should I teach with this book?
The book can be covered in a twosemester course. Alternatively, Chapters 1 to 14 (along with perhaps Chapter 15) can be taught in an introductory course for seniors
Preface xvii
and firstyear graduate students in the sciences. The remaining Chapters, in a second course. The book is accompanied by a solution manual which includes solutions to most of the exercises. The solution manual is available through the book’s site.
• How should I study and use this book?
To study the book, read through each chapter. Each example provides enough code to enable you to reproduce the results (statistical analysis and graphics) exactly as shown. Scripts are explained in detail. Read through the explanation and try to produce the results, as shown. This way, you first see how a particular analysis is done and how tables and graphics may be used to summarize it. Then when you study the script, you will know what it does. Because of the adherence to the WYSIWYG principle, some of the early examples might strike you as particularly complicated. Be patient; as you progress, there are repetitions and you will soon learn the common R idioms. If you are familiar with both R and basic statistics, you may use the book as a reference. Say you wish to run a simple logistic regression. You know how to do it with R , but you wish to plot the regression on a probability scale. Then go to Chapter 16 and flip through the pages until you hit Figure 16.5 in Example 16.5. Next, refer to the example’s script and modify it to fit your needs. There are so many R examples and scripts that flipping through the pages you are likely to find one that is close to what you need. We include two indices, an R index and a general index. The R index is organized such that functions are listed as first entry and their arguments as subentries. The page references point to applications of the functions by the various examples.
• Classical vs. Bayesian statistics
This topic addresses developments in statics in the last few decades. In recent years and with advances in numerical computations, a trend among natural (and other) scientists of moving away from classical statistics (where large data are needed and no prior knowledge about it is assumed) to the socalled Bayesian statistics (where prior knowledge about the data contributes to its analysis) has become fashionable. Without getting into the sticky details, we make no judgment about the efficacy of one as opposed to the other approach. We prescribe to the following:
• With large data sets, statistical analyses using these two approaches often reach the same conclusions. Developments in bootstrap methods make “small” data sets “large”.
• One can hardly appreciate advances in Bayesian statistics without knowledge of classical statistics.
• Certain situations require applications of Bayesian statistics (in particular when realtime analysis is imperative). Others, do not.
• The analyses we present are suitable for both approaches and in most cases should give identical results. Therefore, we pursue the classical statistics approach.
• A word about typography, data and scripts
We use monospaced characters to isolate our code work with R . Monospace lines that begin with > indicate code as it is typed in an R session. Monospace lines that
xviii Preface
begin with linenumber indicate R scripts. Both scripts and data are available from the books site ( http://turtle.gis.umn.edu ).
• Do you know a good joke about statistics?
Yes. Three statisticians go out hunting. They spot a deer. The first one shoots and misses to the left. The second one shoots and misses to the right. The third one stands up and shouts, “We got him!”
Yosef Cohen and Jeremiah Cohen
Part I
Data in statistics and R
Statistics and Data with R: An applied approach through examples
© 2008 John Wiley & Sons, Ltd. ISBN: 978047075805
Y. Cohen and J.Y. Cohen
1
Basic R
Here, we learn how to work with R . R is rooted in S , a statistical computing and data visualization language originated at the Bell Laboratories (see Chambers and Hastie, 1992). The S interpreter was written in the C programming language. R pro vides a wide variety of statistical and graphical techniques in a consistent computing environment. It is also an interpreted programming environment. That is, as soon as you enter a statement, or a group of statements and hit the Enter key, R computes and responds with output when necessary. Furthermore, your whole session is kept in memory until you exit R . There are a number of excellent introductions to and tutorials for R (Becker et al., 1988; Chambers, 1998; Chambers and Hastie, 1992; Dalgaard, 2002; Fox, 2002). Venables and Ripley (2000) was written by experts and one can hardly improve upon it (see also The R Development Core Team, 2006 b ). An excellent exposition of S and thereby of R , is given in Venables and Ripley (1994). With minor differences, books, tutorials and applications written with S are rele vant in R . A good entry point to the vast resources on R is its website, http:// rproject.org . R is not a panacea. If you deal with tremendously large data sets, with mil lions of observations, many variables and numerous related data tables, you will need to rely on advanced database management systems. A particularly good one is the open source PostgreSQL ( http://www.postgresql.org ). R deals effectively— and elegantly—with data that include several hundreds of thousands of observations and scores of variables. The bulkier your system is in memory and hard disk space, the larger the data set that R can handle. With all these caveats, R is a most flexible system for dealing with data manipulation and analysis. One often spends a large amount of time preparing data for analysis and exploring statistical models. R makes such tasks easy. All of this at no cost, for R is free! This chapter introduces R . Learning R might seem like a daunting task, partic ularly if you do not have prior experience with programming. Steep as it may be, the learning curve will yield much dividend later. You should develop tolerance of
Statistics and Data with R: An applied approach through examples
©2008
John Wiley & Sons, Ltd. ISBN: 978047075805
Y. Cohen and J.Y. Cohen
4
Basic R
ambiguity. Things might seem obscure at first. Be patient and continue. With plenty of repetitions and examples, the fog will clear. Being an introductory chapter, we are occasionally compelled to make simple and sweeping statements about what R does or does not do. As you become familiar with R , you will realize that some of our statements may be slightly (but not grossly) inaccurate. In addition to R , we discuss some abstract concepts such as classes and objects. The motivation for this seemingly unrelated material is this: When you start with R , many of the commands (calls to functions) and the results they produce may mystify you. With knowledge of the gen eral concepts introduced here, your R sessions will be clearer. To get the most out of this chapter, follow this strategy: Read the whole chapter lightly and do whatever exercises and examples you find easy. Do not get stuck in a particular issue. Later, as we introduce R code, revisit the relevant topics in this chapter for further clarification. It makes little sense to read this chapter without actually implementing the code and examples in a working R session. Some of the material in this chapter is based on the excellent tutorial by Venables et al. (2003). Except for starting R , the instructions and examples apply to all operating systems that R is implemented on. System specific instructions refer to the Windows operating systems.
1.1 Preliminaries
In this section we learn how to work with R . If you have not yet done so, go to http://rproject.org and download and install R . Next, start R . If you installed R properly, click on its icon on the desktop. Otherwise, click on Start  Programs  R  R x.y.z where x.y.z refers to the version number you installed. If all else fails, find where R is installed (usually in the Program Files directory). Then go to the bin directory and click on the Rgui icon. Once you start R , you will receive a welcome statement that ends with a prompt (usually > ).
1.1.1 An R session
An R session consists of starting R , working with it and then quitting. You quit R by typing q() or (in Windows) by selecting File  Exit . ^{1} R displays the prompt at the beginning of the line as a cue for you. This is where you interact with the system (you do not type the prompt). The vast majority of our work with R will consist of executing functions. For now, we will say that a function is a small, singlepurpose executable program that R can be instructed to run. A function is identified by a name followed by parentheses. Often the parentheses are separated by words. These are called arguments. We say that we execute a function when we type the function’s name, with the parentheses (and possibly arguments) and then hit the Enter key. Let us go through a simple session. This will give you a feel for R . At any point, if you wish to quit, type q() to end the session. To quit, you can also use the File  Exit menu. At the prompt, type (recall that you do not type the prompt):
> help.start()
^{1} As a rule, while working with R , we avoid menus and graphical interface and stick with the command line interface.
Preliminaries
5
This will display the introductory help page in your browser. You can access help. start() from the Help  Html help menu. In the Manuals section of the HTML page, click on the title An Introduction to R . This will bring up a wellwritten tutorial that introduces R and its capabilities.
R as a calculator
Here are some ways to use R as a simple calculator (note the # character—it tells R to treat everything beyond it to the end of the line as is):
> 
5  
1 + 10 
# add and subtract 
[1] 14 

> 
7 * 
10 / 2 
# multiply and divide 
[1] 35
> pi
[1] 3.141593
> sqrt(2)
[1] 1.414214
> exp(1)
[1] 2.718282
# the constant pi
# square root
# e to the power of 1
The order of execution obeys the usual mathematical rules.
Assignments
Assignments in R are directional. Therefore, we need a way to distinguish the direc tion:
> 
x < 5 
# The object (variable) x holds the value 5 
> x [1] 5 
# print x 
> 6 > x
> x [1] 6
Note that to observe the value of x , we type x and follow it with Enter . To save space, we sometimes do this:
> (x < pi)
# x now holds the value 6
# assign the constant pi and print x
[1] 3.141593
Executing functions
R contains many function. We execute a function by entering its name followed by
parentheses and then Enter . Some functions need information to execute. This infor mation is passed to the function by way of arguments.
> print(x) # print() is a function. It prints its argument, x [1] 6
> ls()
# lists the objects in memory
[1] "x"
6
Basic R
rm(x)
> # remove x from memory
> # no objects in memory, therefore:
character(0)
A word of caution : R has a large number of functions. When you create objects (such as x ) avoid function names. For example, R includes a function c() ; so do not name any of your variables (also called objects) c . To discover if a function name may collide with an object name, before the assignment, type the object’s name followed by Enter , like this:
> t
function (x) UseMethod("t") <environment: namespace:base>
From this response you may surmise that there is a function named t() . Depending on the function, the response might be some code.
ls()
Vectors
We call a set of elements a vector. The elements may be floating point (decimal) numbers, integers, strings of characters and so on. Many common operations in R
require sequences. For example, to generate a sequence of 20 numbers between 0 and
19 do:
> x < 0 : 19
To see the values stored in the vector x we just created, type x and hit Enter . R prints:
> 
x 

[1] 
0 
1 
2 
3 
4 
5 
6 
7 
8 
9 10 11 12 13 14 15 16 17 
[19] 18 19
Note the way R lists the values of the vector x . The numbers in the square brackets
on the left are not part of the data. They are the index of the first element of the vector in the listed row. This helps you identify the order of the data. So there are
18 values of x listed in the first row, starting with the first element. The value of this
element is 0 and its index is 1. This index is denoted by [1] . The second row starts
with the 19th element and so on. To access the value of the fifth element, do
> x[5]
[1] 4
Behind the scenes
To demystify R , let us take a short detour. Recall that statements in R are calls to functions and that functions are identified by a name followed by parentheses (e.g. help() ). You may then ask: “To print x , I did not call a function. I just entered the name of the object. So how does R know what to do?” To fully answer this question we need to discuss objects (soon). For now, let us just say that a vector (such as x ) is an object. Objects belong to objecttypes (sometimes called classes). For example x belongs to the type vector. Objects of the same type have (inherit) the properties of
Preliminaries
7
their type. One of these properties might be a print function that knows how to print the object. For example, R knows how to print objects of type vector. By entering the name of the vector only, you are calling a function that knows how to print the vector object x . This idea extends to other types of objects. For example, you can create objects of type data.frame . Such objects are versatile. They hold different kinds of data and have a comprehensive set of functions to manipulate the data. data.frame objects have a print function different from that of vector objects. The specific way that R accomplishes the task of printing, for example, may be different for different objecttypes. However, because all objects of the same type have similar structure, a print function that is attached to a type will work as intended on any object of that type. In short, when you type x < 0 : 19 you actually create an object and enter the data that the object stores. Other actions that are common to and necessary for such objects are known to R through the objecttype.
Plots
Back to our session. Let us create a vector with some random numbers and plot it. We create a vector y from x . To each element of x , we add a random value between 10 and 20. Here is how:
> y < x + runif(20, min = 10, max = 20)
Here we assign each value of x + (a random value) to the vector y . When you assign an object to a named object that does not exist ( y in our example), R creates the object automatically. The function runif() , standing for “random uniform,” produces random numbers. The call to it includes three arguments. The first argument, 20 , tells
R to produce 20 random numbers. The second and third, min = 10 and max = 20 ,
ask that each number between the values of 10 and 20 have the same probability of occurring. The addition of the random numbers to x is done element by element. Thus, the statement above produces
y[1] < x[1] + the first random number y[2] < x[2] + the second random number
y[20] < x[20] + the 20th random number
To plot the values of y against x , type
> plot(x, y)
In response, you should get something like Figure 1.1.
Statistics
R is a statistical computing environment. It contains a large number of functions that
apply statistical analyses to data. Two familiar statistics are the mean and variance of a set (vector) of numbers:
> mean(x)
[1] 9.5
> var(x)
[1] 35
8
Basic R
Figure 1.1 plot(x,y) .
The function mean() takes—in this case—a single argument, a vector of the data ( x ). It responds with the mean of the data. The function var() responds with the variance.
1.1.2 Editing statements
You can recall and edit statements that you have already typed. R saves the statements in a session in a buffer (think of a buffer as a file in memory). This buffer is called history. You can use the ↑ and the ↓ keys to recall previous statements or move down from a previous statement. Use the ← and → arrows to move within a statement in the active line. The active line is the last line in the session window. It is the line that is executed when you hit Enter . The Delete or Backspace keys can be used to delete characters on the command line. There are other commands you can use. See the Help  Console menu.
1.1.3 The functions help() , help.search() and example()
Suppose that you need to generate uniform random numbers. You remember that there is such a function, named runif() , but you do not exactly remember how to call it. To see the syntax for the call and other information related to runif() , call the function help() with the argument set to runif() , like this:
> help(runif)
(or help('runif') ). In response, R displays a window that contains the following (some output was omitted and line numbers added):
_{1} Uniform
2
package:base
R Documentation
_{3} The Uniform Distribution
4
5 Description:
6 These functions provide information about ,,,
7
Preliminaries
9
_{8} 
Usage: 
_{9} 

_{1}_{0} 
runif(n, min=0, max=1) 
11 

_{1}_{2} 
Arguments: 
_{1}_{3} 

_{1}_{4} 
n: number of 
_{1}_{5} 
min,max: lower and upper limits of the distribution. 
_{1}_{6} 

17 

_{1}_{8} 
Details: 
19 
If 'min' or 'max' are not specified they assume 
_{2}_{0} 
'0' and '1' respectively. 
_{2}_{1} 

22 

_{2}_{3} 
References: 
_{2}_{4} 

25 

_{2}_{6} 
See Also: 
_{2}_{7} 
'.Random.seed' about random number generation, 
28 

_{2}_{9} 
Examples: 
_{3}_{0} 
u < runif(20) 
_{3}_{1} 
You can access help for all of the functions in R the same way—type help(functionname ) , where functionname is a name of a function you need help with. If you are sure that help is available for the requested topic and you get no response, enclose the topic with quotes. Functions’ help is standard, so let us go through the help text for runif() in detail. In line 1, the header of the help window for runif() declares that the function resides in a package, named base —we will talk about packages soon. The Description section starts in line 5. Since this help window provides help for all of the functions that relate to the socalled uniform distribution, the description mentions all of them. The Usage section begins on line 8. It explains how to call this function. In line 10, it says that runif() is called with the arguments n , min and max . Because n represents the number of random numbers you wish to generate and because it is not written as argumentname = argumentvalue , you should realize that this is a required argument. That is, if you omit it, R will respond with an error message. On line 12 begins a section about the Arguments . As it explains, n is the number of observations. min and max are the limits between which the ran dom numbers will be generated. Note that in the call (line 10), they are specified as min = 0 and max = 1 . This means that you do not have to call these argu ments explicitly. If you do not, then the default values will be 0 and 1. In other words, runif(n) will produce n random numbers between 0 and 1. Because it is the uniform distribution, each of the numbers between 0 and 1 is equally likely to occur. We say that n is an unnamed argument while min and max are named
10
Basic R
arguments. Usually, named arguments have default values and unnamed arguments are required. The Details section explains other issues related to the functions on this help page. There is usually a Reference section where more details about the functions may be found. The See Also section names relevant functions and finally there is an Examples section. All functions are documented according to this template. The documentation is often terse and it takes time to get used to. Suppose that you forgot the exact function name. Or, you may wish to look for a concept or a topic. Then use the function help.search() . It takes a single argument. The argument is a string and strings in R are delineated by single or double quotes. So you may look for a topic such as
> help.search('random')
This will bring up a window with a list such as this (line numbers were added and the output was edited):
_{1} 
Help files with alias or concept or title matching 'random' 

_{2} 
using fuzzy matching: 

3 

_{4} 
REAR(agce) 
Fit a autoregressive model with 
_{5} 
r2dtable(base) 
Random 2way Tables with 
_{6} 
Random.user(base) 
Usersupplied Random Number 
_{7} 
Line 4 is the beginning of a list of topics relevant to “random.” Each topic begins with the name of a function and the package it belongs to, followed by a brief explanation. For example, line 6 says that the function Random.user() resides in the package base . Through it, the user may supply his or her own random number generator. Once you identify a function of interest, you may need to study its documentation. As we saw, most of the functions documented in R have an Examples section. The examples illustrate various applications of the documented function. You can cut and paste the code from the example to the console and study the output. Alternatively, you can run all of the examples with the function example() , like this:
> example(plot)
The statement above runs all the examples that accompany the documentation of the function plot() . This is a good way to quickly learn what a function does and perhaps even modify an example’s code to fit your needs. Copy as much code as you can; do not try to reinvent the wheel.
1.1.4 Expressions
R is case sensitive. This means that, for example, x is different from X . Here x or X refer to object names. Object names can contain digits, characters and a period. You may be able to include other characters in object names, but we will not. As a rule, object names start with a letter. Endow ephemeral objects—those you use within, but not between, sessions—with short names and save on typing. Endow per sistent object—those you wish to use in future sessions—with meaningful names.
Preliminaries 11
In object names, you can separate words with a period or underscore. In some cases, you can names objects with spaces embedded in the name. To accomplish this, you will have to wrap the object name in quotes. Here are examples of correct object names:
a A hello.dolly the.fat.in.the.cat hello.dOlly Bush_gore
Note that hello.dolly and hello.dOlly are two different object names because R
is case sensitive.
You type expressions on the current (usually last) line in the console. For example, say x and y are two vectors that contain the same number of elements and the elements are numerical. Then a call to
> plot(x, y)
is an expression. If the expression is complete, R will execute it when you hit Enter and we get a plot of y vs. x . You can group expressions with braces and separate them with semicolons. All statements (separated by semicolons) on a single line are grouped. For example
> x < seq(1 : 10) ; x # two expressions in one line
[1]
1
2
3
4
5
6
7
8
9 10
creates the sequence and prints it.
1.1.5 Comments, line continuation and Esc
You can add comments almost anywhere. We will usually add them to the end of expressions. A comment begins with the hash sign # and terminates at the end of the line. For example
# different from seq(1, 10, by = 2); try it
> (x < seq(1, 10, length = 5))
[1]
1.00 3.25
5.50 7.75 10.00
> 
# also different from seq(1 : 10) 
If 
an expression is incomplete, R will allow you to complete it on the next line once 
you hit Enter . To distinguish between a line and a continuation line, R responds with
a + , instead of with a > on the next line. Here is an example (the comments explain what is going on):
> x < seq(1 : 10 # no ')' at the end
+ ) #now '(' is matched by ')'
> x
[1]
1
2
3
4
5
6
7
8
9 10
If you wish to exit a continuation line, hit the Esc key. This key also aborts executions
in R . 

1.1.6 
source() , sink() and history() 
You can store expressions in a text file, also called a script file and execute all of them at once by referring to this file with the function source() . For example, suppose
12
Basic R
we want to plot the normal (popularly known as the bell) curve. We need to create a vector that holds the values of x and then calculate the values of y based on the values of x using the formula:
y ( x ) =
1
σ _{√} 2 π _{e} − ( x − μ ) ^{2} / 2 σ ^{2}
_{.}
_{(}_{1}_{.}_{1}_{)}
Here σ and μ (the Greek letters sigma and mu) are fixed, so we choose σ = 1 and μ = 0. With these values, y ( x ) is known as the standard normal probability density. Suppose we wish to create 101 values of x between − 3 and 3. Then for each value of x , we calculate a value for y . Then we plot y vs. x . To accomplish this, open a new text file using Notepad (or any other text editor you use) and save it as normal.R (you can name it anything). Note the directory in which you saved the file. Next, type the following (without the line numbers) in the text editor:
_{1} x < seq(3, 3, length = 101)
_{2} y < dnorm(x)
3 plot(x, y, type = 'l') # 'l' stands for line
# assign standard normal values to y
menu in the console and change to the
directory where you saved normal.R . Next, type
> source('normal.R')
or click on File  Source R
statement in line 2. We use the 101 values of x to generate 101 values of y according to (1.1). The function dnorm() —for density normal—does just that. The function source() treats the content of the argument—a string that represents a file name—
and save it. Click on the File  Change
Either way, you should obtain Figure 1.2. Note the
as a sequence of commands. It will read one line at a time and execute it. Because the argument to source() must be a constant string of characters representing a file name, you must delineate the string with single or double quotes.
Figure 1.2 The standard normal (bell) curve.
The complement of source() is the function sink() . Occasionally you may find it useful to divert output from the console to a file. Let us create a vector of 100
Modes
13
elements, all initialized to zero and write the vector to a file. Type the following (without the line numbers) in your console:
_{1} 
> x < vector(mode = 'numeric', length = 100) #create the vector 

_{2} 
> sink('x.txt') 

_{3} 
> 
x 
#open a text file named x.txt #write the data to the file 
_{4} 
> 
sink() 
#close the file 
Go to the directory in which R is working now and open the file x.txt . You should see the output exactly as if you did not redirect the output away from the console. In line 1 of the code, the function vector() creates a vector of length 100. ^{2} If you do not specify mode = 'numeric' , vector() will create a vector of 100 elements all set to FALSE . The latter are logical elements that have only two values, TRUE or FALSE . Note that TRUE and FALSE are not strings (you do not enclose them in quotes and when R prints them, it does not enclose them in quotes). They are special values that are simply represented by the tokens TRUE or FALSE . You will later see that vector() and logical variables are useful. history() is another useful housekeeping function. Now that we have worked a little in the current session, type
> history(50)
In response, a new text window will pop up. It will include your last 50 statements (not output). The argument to history is any number of lines you wish to see. You can then copy part or all of the text to the clipboard and then paste to a script file. This is a good way to build scripts methodically. You try the various statements in the console, make sure they work and then copy them from a history window to your script. You can then edit them. It is also a good way to document your work. You will often spend much time looking for a particular function, or an idiom that you developed only to forget it later. Document your work.
1.2 Modes
In R , a simple unit of data occupies the same amount of memory. The type of these units is called mode. The modes in R are:
LOGICAL
character A string of characters.
Has only two values, TRUE or FALSE .
numeric 
Numbers. Can be integers or floating point numbers, called double . 
complex 
Complex numbers. We will not use this type. 
raw 
A stream of bytes. We will not use this type. 
You can test the mode of an object like this:
> x < integer() ; mode(x) [1] "numeric"
> mode(x < double()) [1] "numeric"
> mode(x < TRUE)
^{2} You can use x < numeric(100) instead of a call to vector() .
14
Basic R
[1] "logical"
> mode(x < 'a')
[1] "character"
1.3 Vectors
As we saw, R works on objects. Objects can be of a variety of types. One of the simplest objects we will use is of type vector. There are other, more sophisticated types of objects such as matrix, array, data frame and list. We will discuss these in due course. To work with data effectively, you should be proficient in manipulating objects in R . You need to know how to construct, split, merge and subset these objects. In R , a vector object consists of an ordered collection of elements of the same mode. The length of a vector is the number of its elements. As we said, all elements of a vector must be of the same mode. There is one exception to this rule. We can mix with any mode the special token NA , which stands for Not Available. It will be worth your while to remember that NA is not a value! Therefore, to work with it, you must use specific functions. We will discuss those later. We call a vector of dimension zero the empty vector. Albeit devoid of elements, empty vectors have a mode. This gives R an idea of how much additional memory to allocate to a vector when it is needed.
1.3.1 Creating vectors
Vectors can be constructed in several ways. The simplest is to assign a vector to a new symbol:
> v
> v < c(1, 5, 3) # c() concatenates its elements
You can create a vector with a call to
> v < vector()
> (v < vector(length = 10)) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
<
1
: 10
# assign the vector of elements 1
10
to v
# vector of length 0 of logical mode
[10] FALSE
From the above we conclude that vector() produces a vector of mode logical by default. If the vector has a length, all elements are set to FALSE . To create a vector of a specific mode, simply name the mode as a function, with or without length:
> v < integer()
> (v < double(10))
[1] 0
0
0
0
0
0 0 0
0 0
> (v < character(10)) [1] "" "" "" "" "" "" "" "" "" ""
You can also create vectors by assigning vectors to them. For example the function c() (for concatenate) returns a vector. When the result of c() is assigned to an object, the object becomes a vector:
> v < c(0, 10, 1000) [1] 0 10 1000
Because a vector must include elements of a simple mode, if you concatenate vectors with c() , R will force all modes to the simplest mode. Here is an example:
Vectors
15
_{1} 
> 
(x < c('Alice', 'in', 'lalaland')) 

_{2} 
[1] "Alice" 
"in" 
"lalaland" 

_{3} 
> 
(y < c(1, 2, 3)) 

_{4} 
[1] 1 2 3 

_{5} 
> 
(z < c(x,y)) 

_{6} 
[1] "Alice" 
"in" 
"lalaland" "1" 
"2" 

_{7} 
[6] "3" 

Study this code snippet. It will pay valuable dividends later. In line 1 we assign three strings to x . Because the function c() constructs the vector and returns it, the assignment in line 1 creates the vector x . Its type is character and so is its mode. In line 3, we construct the vector y with the elements 1 , 2 and 3 . The mode(y) is numeric and the typeof(y) is double . In line 5 we concatenate (with c() ) the vectors x and y . Because all elements of a vector must be of the same mode, the mode of z is reduced to character . Thus, when we print z (lines 6 and 7), the numbers are turned into their character representation. They therefore are enclosed with quotes. This may be confusing, but once you realize that R strives to be coherent, it is no longer so. 

1.3.2 Useful vector functions 

Working with data requires manipulating vectors frequently. R provides functions that allow such manipulations. 

length() 

The length of a vector is obtained with length() : 

> 
(t.f < c(TRUE, TRUE, FALSE, TRUE)) ; length(t.f) 

[1] TRUE 
TRUE FALSE TRUE 

[1] 4 

sum() 

Here is an example where we need the length() of a vector and the sum() of its elements: 

> 
(a < 1 : 11) 

[1] 1 
2 
3 
4 
5 
6 
7 
8 
9 10 11 

> 
(average < sum(a) / length(a)) 

[1] 6 

As you might expect, mean() does this and then some (see help(mean) ). 

1.3.3 Vector arithmetic 

The binary (in the sense that they need two objects to operate on) arithmetic oper ations addition, subtraction, multiplication and division, + ,  , * and / , respectively, can be applied to vectors. In such cases, they are applied elementwise: 

1 
> 
(x < 1.2 : 6.4) 

2 
[1] 1.2 2.2 3.2 4.2 5.2 6.2 
3
>
x * 2
16
Basic R
_{4} 
[1] 
2.4 
4.4 
6.4 
8.4 10.4 12.4 

_{5} 
> 
x / 2 

_{6} 
[1] 0.6 1.1 1.6 2.1 2.6 3.1 

_{7} 
> 
x 
 1 

_{8} 
[1] 0.2 1.2 2.2 3.2 4.2 5.2 
In line 1 we create a sequence from 1.2 to 6.4 incremented by 1 and print it. In line
2 we multiply x by 2. This results in multiplying each element of x by 2. Similarly
(lines 5 and 7), division and subtraction are executed elementwise. When two vectors are of the same length and of the same mode, then the result of x / y is a vector whose first element is x[1] / y[1] , the second is x[2] / y[2] and so on:
> (x < seq(2, 10, by = 2))
[1]
2
4
6
8 10
> (y < 1 : 5)
[1] 
1 2 
3 4 5 
> x / y 

[1] 
2 2 
2 2 2 
Here, corresponding elements of x are divided by corresponding elements of y . This might seem a bit inconsistent at first. On the one hand, when dividing a vector by a scalar we obtain each element of the vector divided by the scalar. On the other, when dividing one vector by another (of the same length), we obtain elementbyelement division. The results of the operations of dividing a vector by a scalar or by a vector are actually consistent once you realize the underlying rules.
First, recall that a single value is represented as a vector of length 1. Second, when the length of x > the length of y and we use binary arithmetic, R will cycle through the elements of y until its length equals to that of x . If x ’s length is not an integer multiple of y , then R will cycle through the values of y until the length of x is met, but R will complain about it. In light of this rule, any vector is an integer multiple of the length of a vector of length 1. For example, a vector with 10 elements is 10 times as long as a vector of length 1. If you use a binary arithmetic operation on a vector of length 10 with a vector of length 3, you will get a warning message. However, if one vector is of length that is an integer multiple of the other, R will not complain. For example, if you divide a vector of length 10 by a vector of length 5, then the vector of length
5 will be duplicated once and the division will proceed, element by element. The
same rule applies when the length of x < the length of y : x ’s length is extended (by cycling through its elements) to the length of y and the arithmetic operation is done elementbyelement. Here are some examples. First, to avoid printing too many digits, we set
> options(digits = 3)
This will cause R to print no more than 3 decimal digits. Next, we create x with length 10 and y with length 3 and attempt to obtain x / y :
Vectors
17
> (x < 1 
: 
10) ; (y < 
1 
: 3) ; 
x / y 

[1] 
1 
2 
3 
4 
5 
6 
7 
8 
9 10 

[1] 1 2 3 

[1] 
1.0 
1.0 
1.0 
4.0 
2.5 
2.0 
7.0 
4.0 
3.0 10.0 
Warning message:
longer object length is not a multiple of shorter object length in: x/y
Note that because x is not an integer multiple of y , R complains. Here is what happens:
x y 
x/y 

1 1 
1 
1.0 
2 2 2 
1.0 

3 3 3 
1.0 

4 4 1 
4.0 

5 5 2 
2.5 

6 6 3 
2.0 

7 7 1 
7.0 

8 8 2 
4.0 

9 9 3 
3.0 
10 10 1 10.0
As expected, y cycles through its values until it has 10 elements (the length of x ) and we get an element by element division. The same rule applies with functions that implement vector arithmetic. Consider, for example, the square root function sqrt() . Here R follows the rule of operating on one element at a time:
>
(x < 1 : 5)
;
sqrt(x)
[1] 1
[1] 1.00 1.41 1.73 2.00 2.24
2 3
4 5
1.3.4 Character vectors
Data, reports and figures require frequent manipulation of characters. R has a rich set of functions to deal with character manipulations. Here we mention just a few. More will be introduced as the need arises. Character strings are delineated by double or single quotes. If you need to quote characters in a string, switch between double and single quotes, or preface the quote by the escape character \ . Here is an example:
> (s < c("Florida; a politician's",'nightmare')) [1] "Florida; a politician's" [2] "nightmare"
The vector s has two elements. Because we need a single quote in the first element (after the word politician ), we add ' . To create a single string from s[1] and s[2] , we paste() them:
> paste(s[1], s[2])
[1] "Florida; a politician's nightmare"
18
Basic R
By default, paste() separates its arguments with a space. If you want a different character for spacing elements of characters, use the argument sep :
> paste(s[1], s[2], sep = '')
[1] "Florida; a politician'snightmare"
The examples above demonstrate how to include a single quote in a string and how
to paste() character strings with different separators (see help(paste) for further
information).
1.3.5 Subsets and index vectors
One of the most frequent manipulations of vectors (and other classes of objects that
hold data) is extracting subsets. To extract a subset from a vector, specify the indices
of the elements you wish to extract or exclude. To demonstrate, let us create
> x < 15 : 30
Here x is a vector with elements 15 , 16 , and fifth elements of x , we do
> c(x[2], x[5]) [1] 16 19
A more succinct way to extract the second and fifth elements is to execute
> x[c(2, 5)] [1] 16 19
Here c(2, 5) is an index vector. The index vector specifies the elements to be returned. Index vectors must resolve to one of 4 modes: logical, positive integers, negative integers or strings. Examples are the best way to introduce the subject.
, 30 . To create a new vector of the second
Index vector of logical values and missing data
Here is a typical case. You collect numerical data with some cases missing. You might end up with a vector of data like this:
> (x < c(10, 20, NA, 4, NA, 2))
[1] 10 20 NA
4 NA
2
You wish to compute the mean. So you try:
> sum(x) / length(x) [1] NA
Because operations on NA result in NA , the result is NA . So we need to find a way
to extract the values that are not NA from the vector. Here is how. x above has six
elements, the third and the fifth are NA . The following statement identifies the NA elements in i :
> (i < is.na(x))
[1] FALSE FALSE TRUE FALSE TRUE FALSE
The function is.na(x) examines each element of the vector x . If it is NA , it assigns the element the value TRUE . Otherwise, it assigns it the value FALSE . Now we can use i as an index vector to extract the not NA values from x . Recall that i is a vector.
Vectors
19
If an element of x is a value, then the corresponding element of the vector i is FALSE . If an element of i has no value, then the corresponding element of i is TRUE . So i and not i , denoted by !i give:
>
i
;
!i
[1] FALSE FALSE TRUE FALSE TRUE FALSE
[1]
TRUE FALSE TRUE FALSE TRUE
Therefore, to extract the values of x that are not NA , we do
> (y < x[!i])
TRUE
[1] 10 20
4
2
Now to calculate the mean of x , a vector with some elements containing missing values, we do:
> n < length(x[!i]) ; new.x < x[!i] ; sum(new.x) / n [1] 9 The same result can be achieved with
> mean(x, na.rm = TRUE)
[1] 9 na.rm is a named argument. Its default value is FALSE . If you set it to TRUE , mean() will compute the mean of its unnamed argument ( x in our example) even though the latter contains NA .
Index vector of positive integers
Here is another way to exclude NA data:
> (x < c(160, NA, 175, NA, 180))
[1] 160
NA 175
NA 180
> (no.na < c(1, 3, 5)) [1] 1 3 5
> x[no.na]
[1] 160 175 180
Again, we use no.na as an index vector. Another example: we wish to extract elements 20 to 30 from a vector x of length 100:
> x < runif(100) # 100 random numbers between 0 and 1
> round(x[20 : 30], 3) # print rounded elements 20 to 30 [1] 0.249 0.867 0.946 0.593 0.088 0.818 0.765 0.586 0.454 [10] 0.922 0.738 First we create a vector of 100 elements, each a random number between 0 and 1. To
interpret the second statement, we go from the inner parentheses out: 20 : 30 creates
, 30 . Then x[20 : 30] extracts the
an index vector of positive integers 20 , 21 ,
desired elements. We wish to see only the first 3 significant digits of the elements. So we call round() with two arguments—the data and the number of significant digits. Here is a variation on the theme:
> (i < c(20 : 25, 28, 35 : 40)) # 20 to 25, 28 and 35 to 40 [1] 20 21 22 23 24 25 28 35 36 37 38 39 40
> round(x[i], 3) [1] 0.249 0.867 0.946 0.593 0.088 0.818 0.454 0.675 0.834 [10] 0.131 0.281 0.636 0.429
20
Basic R
Index vector of negative integers
The effect of index vectors of negative integers is the mirror of positive. To extract all elements from x except elements 20 to 30, we write x[(20 : 30)] . Here we must put 20 : 30 in parenthesis. Otherwise R thinks that we want to extract elements − 20 to 30. Negative indices are not allowed.
Index vector of strings
We measure the weight ( x ) and height ( y ) of 5 persons:
> x < c(160, NA, 175, NA, 180)
> y < c(50, 65, 65, 80, 70)
Next, we associate names with the data:
> names(x) < c("A. Smith", "B. Smith",
+ "C. Smith", "D. Smith", "E. Smith")
The function names() names the elements of x . To see the data, we columnbind it with the function cbind() :
> cbind(x, y)
x
y
A. Smith 160 50
B.
C.
D.
E. Smith 180 70
Now that the indices are named, we can extract elements by their names:
Smith
NA 65
Smith 175 65
Smith
NA 80
> 
x[c('B. Smith', 'D. Smith')] 
B. 
Smith D. Smith 
NA
NA
Observe this:
> y[c('A. Smith', 'D. Smith')] [1] NA NA
Because y has no elements named A. Smith and D. Smith , R assigns NA to such elements.
1.4 Arithmetic operators and special values
R includes the usual arithmetic operators and logical operators. It also has a set of symbols for special values, or no values at all.
1.4.1 Arithmetic operators
Arithmetic operators consist of + , − , ∗ , / and the power operator ˆ. All of these oper ate on vectors, element by element:
> x
< 1
:
3
; x^2
[1] 1 4 9
Arithmetic operators and special values 21
1.4.2 Logical operators
Logical operators include “and” and “or”, denoted by & and  . We compare values with the logical operators > , > =, < , < =, == and != standing for greater than, greater equal than, less than, less equal than, equal and not equal. ! is the negation operator. Upon evaluation, logical operators return the logical values TRUE or FALSE . If the operation cannot be accomplished, then NA is returned. With vectors, logical operators work as usual—one element at a time. Here are some examples:
> 5 == 4 & 5 == 5 [1] FALSE
> 5 != 4 & 5 == 5 [1] TRUE
> 
5 
== 4 
 
5 
== 5 

[1] TRUE 

> 
5 
!= 4 
 
5 
== 5 

[1] TRUE 

> 
5 
> 4 
; 
5 
< 
4 
; 
5 == 4 ; 
5 != 4 
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
Like any other operation, if you operate on vectors, the values returned are element by element comparisons among the vectors. The rules of extending vectors to equal length still stand. Thus,
> x < c(4, 5) ; y < c(5, 5)
> x
y
[1] FALSE FALSE [1] TRUE FALSE
[1] FALSE TRUE [1] TRUE FALSE
> y
;
x <
;
x
==
y
;
x != y
Here is an example that explains what happens when you compare two vectors of different lengths:
_{1} > x < c(4,5) ; y < 4
2 > cbind(x, y)
_{3} x y
_{4} [1,] 4 4
5 [2,] 5 4
_{6} > y
_{7} [1] 4
_{8} > x == y
_{9} [1] TRUE FALSE
10
>
x < y
11 [1] FALSE FALSE
12
>
x > y
13
[1] FALSE TRUE
22
Basic R
In line 1 we create two vectors, with lengths of 2 and 1. We column bind them. So R extends y with one element. When we implement the logical operations in lines 8, 10 and 12, R compares the vector 5, 4 to 4, 4 . To make sure that you get what you want, when comparing vectors, always make sure that they are of equal length.
1.4.3 Special values
Because R ’s orientation is toward data and statistical analysis, there are features to deal with logical values, missing values and results of computations that at first sight do not make sense. Sooner or later you will face these in your data and analysis. You need to know how to distinguish among these values and test for their existence. Here are the important ones.
Logical values
Logical values may be represented by the tokens TRUE or FALSE . You can specify them as T or F . However, you should avoid the shorthand notation. Here is an example why. We wish to construct a vector with three logical elements, all set to TRUE . So we do this
> T < 5
> (x < c(T, T, T)) [1] 5 5 5
Some time earlier during the session, we happened to assign 5 to T . Then, forgetting this fact, we assign c(T, T, T) to x . The result is not what we expect. Because TRUE and FALSE are reserved words, R will not permit the assignment TRUE < 5 . The tokens TRUE and FALSE are represented internally as 1 or 0. Thus,
> TRUE == 0 ; TRUE == 1
[1] FALSE [1] TRUE
> TRUE == 0.1 ; FALSE == 0 [1] FALSE [1] TRUE
NA
This token stands for “Not Available” It is used to designate missing data. In the next example, we create a vector x with five elements, the first of which is missing. To test for NA , we use the function is.na() . This function returns TRUE if an element is NA and FALSE otherwise.
> (x < c(NA, 2 : 5))
[1] NA
2
3
4
5
> (test < is.na(x))
[1] TRUE FALSE FALSE FALSE FALSE
It is important to realize that is.na() returns FALSE if the element tested for is not NA . Why? Because there are other values that are not numbers. They may result from computations that make no sense, but they are not NA .
Arithmetic operators and special values 23
NaN and Inf
These designate “Not a Number” and infinity, respectively. Division by zero does not result in a number; it therefore returns NaN . You may wish to assign Inf to a vector (for example when you wish any vector to be smaller than Inf in a comparison). In both cases, these are not NA ; they are NaN and Inf , respectively. Furthermore, Inf is a number (you can verify this with the function is.numeric() ); NaN is not. To distinguish among these possibilities, use the function is.nan() .
Distinguishing among NA , NaN and Inf
Distinguishing among these in data can be confusing. Unless interested, you may skip this topic. Consider the following vector:
> (x < c(NA, 0 / 0, Inf  Inf, Inf, 5))
[1]
NA NaN NaN Inf
5
Here 0/0 is undefined and therefore not a number. So is InfInf . Albeit not a real number, Inf is part of the set of numbers called extended real numbers. We need to distinguish among vector elements that are a number, NA , NaN and Inf in x . First, let us test for NA :
>
is.na(x) TRUE TRUE TRUE FALSE FALSE
[1]
As you can see, NA and NaN
Much more than documents.
Discover everything Scribd has to offer, including books and audiobooks from major publishers.
Cancel anytime.