# Statistics and Statistical Analysis with R

BASICS
Jose Ramon G. Albert, Ph.D.
Secretary General National Statistical Coordination Board Email: jrgalbert@gmail.com Adjunct Faculty Asian Institute of Management

1

Topics
1. Getting Started with R
    What is R? Installation Starting a Session; The R Workspace Getting Help

2. Calculations with R
      Basic Math Vector Arithmetic Matrix Operations Keyboard Input Database Input Importing Data

3. Getting Data into R

0. Introduction: Computing Resources
• There are a number of statistical software that provide all sorts of analytical and data management capabilities. – R (www.r-project.org) – SAS (www.sas.com) – SPSS (www.spss.com) – Stata (www.stata.com) For this training course, we shall use R (release 2.15)

3

Introduction: Computing Resources Time Series Analysis ARIMA GARCH R Yes Yes SAS Yes Yes SPSS Yes No Stata Yes Yes Unit root/ Cointegration/ VAR Multivariate Time Series Yes Yes No Yes Yes No Yes 4 .0.

0. Introduction: Computing Resources Advanced Modeling Endogenous Covariates Sample Selection Longitudinal /Panel Data Multivariate Analysis R No No No Yes SAS Yes Yes Yes Yes SPSS No No No Yes Stata Yes Yes Yes Yes 5 .

0. Introduction: Computing Resources Practical Issues Price Command Structure Support Ease of Teaching R SAS SPSS Stata ++ (free) + + - -+ -- --+ + ++ ++ ++ 6 .

almost every analytical tool you can think of is available R is free and will continue to exist. . SPSS Macros) They cost money.g.0. There is no guarantee they will continue to exist.. you can bet that their prices will always increase Many different datasets (and other “objects”) available at same time Experience is interactive-you program until you get exactly what you want One stop shopping . its price will never increase. Nothing can make it go away. but if they do. Introduction: Computing Resources R Datasets can be of any dimension Functions can be modified Commercial Packages One datasets available at a given time Datasets are rectangular Functions are proprietary Experience is passive-you choose an analysis and they give you everything they think you need Tend to be have limited scope. extra options cost more and/or require you to learn a different language (e. forcing you to learn additional programs.

The beginning is difficult. Introduction: Computing Resources CAVEAT: • “Using R is a bit akin to smoking.0. But in the long run. there is something not fully healthy in it.” --Francois Pinard R 8 . one may get headaches and even gag the first few times. deep down. Yet. for those willing to be honest.it becomes pleasurable and even addictive.

we firstly discuss its capabilities. deducer) . then we illustrate some basic commands and how to obtain help. rattle.1. Getting Started with R • To enable us to use R. then we describe how to install it. • Various ways of communicating with R – Interactively: (through console) – Batch Processing: (through scripts) – Point and Click: (through “add ons” Rcmdr.

thus generating computer code to complete tasks is typically required. • R is a programming language.1. What is R? • R is a statistical programming environment for performing standard and specialized statistical tools – “environment” : intended to characterize R as a fully planned and coherent system. as is frequently the case with other data analysis software • R is a is a free open source statistical package based on the S language developed at Bell Labs (later commercially released by Mathsoft as Splus). rather than an incremental accretion of very specific and inflexible tools.1. .

now maintained by the “R core development team” – Since 1997: international R-core team ~15 people & 1000s of code writers and statisticians happy to share their libraries • Cross platform compatibility: Windows.1. .1. What is R? • Initially developed by Robert Gentleman and Ross Ihaka of University of Auckland. – Many statistical functions are already built in. – Contributed packages expand the functionality to cutting edge research. Linux • Very powerful for writing programs. MacOS.

SPSS or SAS oSome users complain about hostility on the R listserve o2nd only to MATLAB for graphics.oFast and free. programming. minimal GUI. oEasy to make mistakes and not know. oActive user community oExcellent for simulation. SPSS and SAS are years behind R! oNot user friendly @ start . computer intensive analyses.1. WinBugs. figuring out correct methods or how to use a function on your own can be frustrating. oForces you to think about your analysis. oInterfaces with database storage software (SQL) . oNo commercial support. oWorking with large datasets is limited by RAM oData prep & cleaning can be messier & more mistake prone in R vs. oMx. etc. What is R? Disadvantages oState of the art: Statistical researchers provide their methods as R packages.steep learning curve. Advantages 1. and other programs use or will use R.

What is R? • Over 800 add-on packages (http://cran. – Allows you to build a customized statistical program suited to your own needs.1. & QC is an issue. it is becoming difficult to choose the best package for your needs. and they can be performed using the R language you already know.html) – This is an enormous advantage .new techniques available without delay.1. – Downside = as the number of packages grows.org/src/contrib/PACKAGES.r-project. 13 .

org/manuals. Installation • R home page: http://www.org .2.r-project.org/ • R FAQ (frequently asked questions about R): http://cran.r-project.rseek.rproject.org/doc/FAQ/R-FAQ.html • R seek engine: http://www.1.org/ • R Archive: http://cran.r-project.html • R manuals: http://cran.

2. Installation Step 2: Choose a mirror site .1.

g. Installation Step 3: Select OS for R e..1. Windows .2.

1.2. Installation Click “base” .

2.1.exe” .1-win.15. Installation Click “R-2.

1.2. Installation Click “OK” .

2. Installation Click “Next” and keep answering resulting pop-up windows until we get FINISH window .1.

Starting Up 1.1.3. Look for R shortcut. 2. Or click Start ► Programs ► R ► R i286 2.1 23 .15.

1.3. Start Up Windows INTERACTIVE COMMAND WINDOW: commands typed here 24 .R Console: 1.

1.3. Start Up Windows R ◦ Used for entering commands. graphing ◦ Output: results of analyses. queries.1. etc. data manipulations. analyses. are written here ◦ Toggle through previous commands by using the up and down arrow keys command window (console) 25 .

functions  Most functionality is provided through built-in and user-created functions . ◦ ◦ Basic functions are available by default.3. Other functions contained in packages 26 . all data objects are kept in memory during an interactive session. datasets. Start Up Windows  The R workspace ◦ Current working environment ◦ Comprised primarily of variables.1.1.

1.2. Menu & Tool Bars MENU BAR: TOOL BAR : 27 .3.

2. Menu & Tool Bars In the Menu/Header bar: • • • • • • • File Edit View Misc(ellaneous) Packages Windows Help .3.1.

1.2. Seeking Help In Help Option of Menu/Header bar: • Console (keys to work on R console) • FAQ on R • FAQ on R for windows • Manuals (in portable document format) • R function • Html help … 29 .3.1.

2.1.1. Seeking Help Obtaining Html help • We can do a search with the Menu bar: Help ► Html help • If you want to use search engine 30 .3.

An Introduction to R 31 . M. – P.2.1. The R Guide – D. Kuhnert & B. R for SAS and SPSS Users – W. Introduction to the R Project for Statistical Computing for Use at the ITC – W.J. Smith. Venables. Maindonald. Muenchen. An Introduction to R: Software for Statistical Modeling & Computing – J. Owen.3. Using R for Data Analysis and Graphics – B.N. Rossiter. Venebles & D.2.H. Tutorials Tutorials • Each of the following tutorials are in PDF format.

1. Tutorials Tutorials (cont’d) – – – – Paul Geissler's excellent R tutorial Dave Robert's Excellent Labs on Ecological Analysis Excellent Tutorials by David Rossitier Excellent tutorial an nearly every aspect of R (c/o Rob Kabacoff) MOST of these notes follow this web page format – Introduction to R by Vincent Zoonekynd – R Cookbook – Data Manipulation Reference 32 .2.2.3.

1.2.2.3. Tutorials Tutorials (cont’d) – R time series tutorial – R Concepts and Data Types presentation by Deepayan Sarkar – Interpreting Output From lm() – The R Wiki – An Introduction to R – Import / Export Manual – R Reference Cards 33 .

Changing GUI Preferences • Click on Edit ► GUI Preferences 34 .1.3.3.

4. Load Workspace Save: Saves the current data.3. Copy Paste Copy and Paste Stop current computation Print 35 . Buttons Button Functions • • • • • • • • Open : Opens R file.1.

36 . Opening a Script Window • Click on File ► New Script gives you a script window.1.5.3.

1.5. use a hash mark (#) at the beginning of the line 37 .3. Opening a Script Window R scripts ◦ A text file containing commands that you would enter on the command line of R ◦ To place a comment in a R script.

6. exp. *. log10. Assignments and Operations • Arithmetic and Mathematical Operations: +. -. log. 38 . ^ are the standard arithmetic operators. sin.3.. Mod: %% sqrt. cos. /.1. tan. ….

<=. or formulas assignment (or =) 39 . || ~ <component selection HIGH subscripts.3. Assignments and Operations • Other Operations: \$ [ . [[ : %*% <. &&.1.6. |. >. != ! &. elements sequence operator matrix multiplication inequality comparison not and. >= ==.

Here I'm only referring to numeric and character functions that are commonly used in creating or recoding variables.6. many can be applied to vectors and matrices as well.3. Assignments and Operations • Functions: – Almost everything in R is done through functions. 40 . – Note that while the examples here apply functions to individual variables.1.

Assignments and Operations • Numeric Functions: Function abs(x) sqrt(x) ceiling(x) floor(x) trunc(x) round(x. digits=n) cos(x). digits=2) is 3.99) is 5 round(3.48 signif(3.475. acosh(x).475) is 4 floor(3. tan(x) Description absolute value square root ceiling(3.5 also acos(x). digits=n) signif(x. cosh(x). digits=2) is 3.475) is 3 trunc(5. etc.1. sin(x).6. log(x) log10(x) exp(x) natural logarithm common logarithm e^x 41 .3.475.

2."b".case =FALSE. grep("A". split) paste(. If fixed=TRUE then pattern is a text string. 4) is "bcd" substr(x. date()) grep(pattern.. paste("x". 2.1:3."abcdef" substr(x."Hello There") returns "Hello. fixed=TRUE) returns 2 Find pattern in x and replace with replacement text. fixed=FALSE) sub(pattern.1. sep="") 42 . ignore. If fixed = T then pattern is a text string. 4) <."xM2" "xM3") paste("Today is". Assignments and Operations • Character Functions: Function Description substr(x."22222" is "a222ef" Search for pattern in x."A". x. stop=n2) Extract or replace substrings in a character vector. strsplit("abc". ignore. x <.1:3. If fixed=FALSE then pattern is a regular expression.sep="M") returns c("xM1".There" Split the elements of character vector x at split. start=n1.sep="") returns c("x1"..6."x2" "x3") paste("x". x .case=FALSE.3. If fixed =FALSE then pattern is a regular expression."c" Concatenate strings after using sep string to seperate them.". "") returns 3 element vector "a". sub("\\s". fixed=FALSE) strsplit(x. c("b". replacement.."c").". Returns matching indices.

tail = TRUE. log.1. sd=1) 43 . mean=0. log.3.p = FALSE) • rnorm(n.6. sd=1.p = FALSE) • qnorm(p. mean=0. log = FALSE) • pnorm(q. mean=0. sd=1.tail = TRUE. sd=1. lower. Assignments and Operations Probability Functions : • Notations Probability Density Function: p Distribution Function: p Quantile function: q Random generation for distribution: r • Examples: – Normal distribution: • dnorm(x. lower. mean=0.

1.3.6. Assignments and Operations
– Weibull Distribution
• • • • dweibull(x, shape, scale = 1, log = FALSE) pweibull(q, shape, scale = 1, lower.tail = TRUE, log.p = FALSE) qweibull(p, shape, scale = 1, lower.tail = TRUE, log.p = FALSE) rweibull(n, shape, scale = 1)

– Log Normal Distribution
• • • • dlnorm(x, meanlog = 0, sdlog = 1, log = FALSE) plnorm(q, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE) qlnorm(p, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE) rlnorm(n, meanlog = 0, sdlog = 1)

44

1.3.6. Assignments and Operations
• The following gives examples of various function names in R along with additional arguments

45

1.3.6. Assignments and Operations
Statistical Functions : Excel R NORMSDIST pnorm(7.2,mean=5,sd=2) NORMSINV qnorm(0.9,mean=5,sd=2) LOGNORMDIST plnorm(7.2,meanlog=5,sdlog=2) LOGINV qlnorm(0.9,meanlog=5,sdlog=2) GAMMADIST pgamma(31, shape=3, scale =5) GAMMAINV qgamma(0.95, shape=3, scale =5) GAMMALN lgamma(4) WEIBULL pweibull(6, shape=3, scale =5) BINOMDIST pbinom(2,size=20,p=0.3) POISSON ppois(2, lambda =3)

46

2) # y is c(1.seq(1. 9) repeat x n times y <.2) #indices is c(1.cut(x. 5) rep(x. ntimes) cut(x. 3) divide continuous variable in factor with n levels y <. 5. 3. 1.rep(1:3. 2. 3. 2.6. by) Description generate a sequence indices <. Assignments and Operations Other Useful Functions Function seq(from .3. n) 47 . to.10. 7.1.

3. Assignments and Operations • Matrix Arithmetic.1.  * is element wise multiplication  %*% is matrix multiplication • Assignment  To assign a value to a variable use “<-” or equal (=) character 48 .6.

6. -. as well as temp and temP.  A dot (.1. 49 . To print object just enter name of object.  Object names can contain a number but cannot start with a number.  R is case sensitive. Assignments and Operations • Objects can be used in other calculations.) and an underscore ( ) are allowed. +. also a name starting with a dot. X and x are two different objects. #.3. • Restrictions for name of object: Object names cannot contain `strange' symbols like !.

1.circle <.pi*r^2 area.6. Assignments and Operations The assignment operator <x <.3.25 assigns the value of 25 to the variable x y <.3*x assigns the value of 3 times x (75 in this case) to the variable y r <.circle NOTE: R is case-sensitive (y ≠ Y) 50 .4 area.

Assignments and Operations We can evaluate truth or falsity of expressions: 2>1 1>2&2>1 generate sequences (and perform operations on them) 3*(1:5) We can do matrix operations a <.1.3:5 a*b a%*%b a%*%t(b) 51 .6.3.1:3 b <.

3. Assignments and Operations We can evaluate perform operations component wise on a vector round(sqrt(a).6.1.2) eigen( a%*% t(b)) eigen( a%*% t(b))\$values 52 .

3. the collection of objects that you currently have is called the workspace.1.7. • This workspace is not saved on disk unless you tell R to do so. 53 . Workspace • Objects that you create during an R session are hold in memory. or worse when R or your system crashes on you during a session. This means that your objects are lost when you close R and not save the objects.

1.3.7.RData. This is a binary file located in the working directory of R. which is by default the installation directory of R. • If you select to save the workspace image then all the objects in your current R session are saved in a file . 54 . the system will ask if you want to save the workspace image. Workspace • When you close the RGui or the R console window.

## save to the current working directory save.1.image function.RData) ## just checking what the current working directory is getwd() ## save to a specific file and location save.7.RData") 55 ..0\\bin\\basicR .5.3. Go to the `File‘ menu and then select `Save Workspace.'. Workspace • During your R session you can also explicitly save the workspace image. or use the save.image("C:\\Program Files\\R\\R2..image(basicR .

3. So all your previously saved objects are available again. Workspace • If you have saved a workspace image and you start R the next time. or alternatively: load ("basicR. it will restore the workspace.7. that could be the workspace image of someone else. Go the `File' menu and select `Load workspace.'.1.RData ") 56 .. You can also explicitly load a saved workspace le..

1. Workspace • R gets confused if you use a path in your code like c:\mydocuments\myfile. it is better to use c:\\my documents\\myfile. Thus.txt Note that R sees "\" as an escape character.txt 57 .3.txt or c:/mydocuments/myfile.7.

3. • If you know which function you want help with simply use ?_______ with the function in the blank. Seeking Help • R has a very good help system built in.1.8. ?hist args(hist) ?lm args(lm) 58 .

8.search(“_______”). help.search("histogram") 59 . then use help.1. Seeking Help • If you don’t know which function to use.3.

Enter in Command Window: quit() 2. 60 . Click on File ► Exit 3.1. Quitting Three Ways of Quitting from R session 1.3. Click on Close button (X at upper right hand corner of R console window).9.

More on Vectors We can yield number sequences or use an index to identify components of a vector: # An example z <.c(1:10) z # combines the sequence from 1 upto 10 z>8 61 .c(1:10) z[(z>8) | (z<5)] # yields 1 2 3 4 9 10 # How it works z <.4.1.1.

1. More on Vectors z>8 # yields >FFFFFFFFTT z<5 # then yields >TTTTFFFFFF # while z>8|z<5 # results in >TTTTFFFFTT 62 .1.4.

F. More on Vectors # finally.F.T.T. viz > 1 2 3 4 9 10 63 .T.1.4.T)] # selects only components of the vector # which yield true values.T. the command > z[c(T.F.F.1.

4. Entering only ls will just print the object.2.1. you will see the underlying R code of the function ls. 64 . Listing Objects • To list the objects that you have in your current R session use the function ls or the function objects. ls() • So to run the function ls we need to enter the name followed by an opening ( and a closing ).

To list all objects starting with the letter “x”: x2 = 9 y2 = 10 ls(pattern="x") 65 .2. • For example.1. one of the arguments of the function ls is pattern. Listing Objects • Most functions in R accept certain arguments.4.

3.8.6) z3 <. • Let us generate two small vectors with data and a scatterplot.4. z2 <. Plot • If you assign a value to an object that already exists then the contents of the object will be overwritten with the new value (without a warning!).7. x2) 66 .1.2.5. Use the function rm to remove one or more objects from your session.c(1.z3) title("My first scatterplot") rm(x.1) plot(z2.c(6.4.5.3.3.

5.1.1. Type help(datasetname) for details on a sample dataset. 67 . The result will depend on which packages you have loaded. Type data( ) to see the available datasets. Datasets • R comes with a number of sample datasets that you can experiment with.

5. • Just a few examples. spatial statistics and the list goes on and on. • The R package may also contain other R objects. time series analysis. 68 .1. Packages • One of the strengths of R is that the system can easily be extended. • The system allows you to write new functions and package those functions in a so called `R package' (or `R library'). exporting objects to html. • There is a lively R user community and many R packages have been written and made available on CRAN for other users. drawing maps.2. for example data sets or documentation. there are packages for portfolio optimization.

already a number of packages are downloaded as well. only seven packages are attached to the system by default.1. Libraries • When you download R.2. • To use a function in an R package. • You can use the function search to see a list of packages that are currently attached to the system.5. • hen you start R not all of the downloaded packages are attached. this list is also called the search path. that package has to be attached to the system. search( ) 69 .

the package is now attached to your current R session.5. Via the menu: • Select the `Packages' menu and select `Load package. Select one and click `OK'. a list of available packages on your system will be displayed. Libraries • To attach another package to the system you can use the menu or the library function.2.'... Via the library function: library() library(MASS) shoes 70 .1.

make sure to run R as administrator – Right click on the shortcut – Choose “Run as administrator”: 71 .5. Libraries • Before you download a new package.1.2.

1.5.2.packages("Rcmdr") • NOTE: Install also rattle and hexbin • More about these packages later . Libraries • Suppose we want to install a package called Rcmdr: – Choose Rcmdr in Packages ► Install packages menu – Or alternatively run the command: install.

– Almost all things in R – functions. results. Calculations with R To understand how R runs and how to do calculations with R .2. datasets. etc. • graphics are written out and are not stored as objects . – are OBJECTS. note that: • R is an object oriented programming language.

& function – CLASS: how objects are treated by functions (important to know!) . & hundreds of special classes created by specific functions .g..frame. array. Calculations with R • Script can be thought of as a way to make objects. factor.2. data. – Your goal is usually to write a script that. numeric. by its end. has created the objects (e. logical.character. list. matrix.[vector]. statistical results) and graphics you need. • Objects are classified by two criteria: – MODE: how objects are stored in R .

Calculations with R x1 x2 x3 x4 x5 x6 Z <- 1 2 3 4 5 6 7 8 .2.

etc. characters. mode = list.2. . If it is a mix. Calculations with R x1 x2 x3 x4 x5 x6 1 2 3 4 5 6 7 8 The MODE of Z is determined automatically by the types of things stored in Z – numbers.

You can check the objects’ class and change it. Calculations with R x1 x2 x3 x4 x5 x6 1 2 3 4 5 6 7 8 The CLASS of Z is either set by default depending. on how it was created. or is explicitly set by user.2. It determines how functions deal with Z. .

Calculations with R • R has a wide variety of data types: – scalars. logical). – vectors (numerical. character. and – lists.2. – matrices. . – dataframes.

we can use R as a calculator directly • Or alternatively. – generate scalar objects first.1.2.50 41. Scalars # scalar (fixed constants) 3+8 x <. perform calculations on the objects (and yield another object) • Scalars are the most basic “vectors” .3*x • Thus.

a Note: • Semicolon used to combine multiple statements in one line # character vector b <."three") #logical vector c1 <. Vectors # null vector using concatenate function x <.FALSE) ."two".12) .2.FALSE.6.c("one".c(2.TRUE.2.TRUE.c() # numeric vector a <.c(TRUE.-3.TRUE.4.

c( seq(0.-5) # length length(d) .rep(NA.5).-1.2). to =4. a[c(3. Vectors We can refer to elements of a vector using subscripts.2).5). e subash <.2. by =0. d # replicate e = rep(NA.2.2)] # 3rd and 2nd elements of vector # sequence d=seq( from =1.

“ : this is not the same as “ < .“ # assigning a value of 2 to f f<-2 f # is f less than negative 2? f< -2 Note: • For readability. Vectors Be careful with assignments using “ <.2.2. good to use parentheses .

Vectors Comparison and Logical Operators: .2.2.

2. • There are existing constants.g. Vectors Note: • Important not to name your variables after existing variables or functions. pi but the values of such constants can be overwritten.. e.2. • There are a number of built-in R functions: mean(d) min(d) max(d) .

NA. Vectors • We can sort: a a <-sort(a.-4.2).na(a) • Other helpful functions: unique(a) duplicated(a) .2.NA. a • We can identify missing values with is. 2.na function: a<-c(a. 2.2. a is. decreasing=TRUE).

2.rm=TRUE) To determine number of missing data: sum(is.na(a)) Summary statistics: summary(a) . mean of a vector with missing data is missing: mean(a) mean(a. Vectors Note: by default.na.2.

and store these into e2 • Find the variance of e2 .Exercise 1 • Generate a vector e1 of positive even integers less than 100 • Remove the values greater than 50 and less than 90.

to = 100 .Solns to Exercise 1 e1 <.seq ( from = 2. by = 2) e2 <.e1[e1 <= 50 | e1 >= 90] var(e2) .

matrix(vector.dimnames=list(char_vector_rownam es. .2. byrow=FALSE indicates that the matrix should be filled by columns (the default). nrow=r. character.) and the same length.3. ncol=c. byrow=FALSE. dimnames provides optional labels for the columns and rows. • General format is mymatrix <. etc. char_vector_colnames)) byrow=TRUE indicates that the matrix should be filled by rows. • All columns in a matrix must have the same mode(numeric. Matrices • A matrix is a rectangular array.

ncol =3).matrix (10:15 .3. t(mat_a) . nrow =2. Matrices Let us generate a 2 by 3 matrix consisting of numbers 10 to 15: mat_a <. mat_a We can perform matrix operations: mat_a+mat_a 3*mat_a We can transpose a matrix: mat_a.2.

Matrices We can identify its dimensions: dim(mat_a ) Or alternatively: nrow(mat_a) ncol(mat_a) .3.2.

2. Matrices Matrix Multiplication mat_a %*% t(mat_a ) which is different from: t(mat_a )%*% (mat_a ) .3.

2.1. Subsetting with Matrices First element in Matrix mat_a [1.3.] Question: how about second column? ANSWER: mat_a[.1] First row mat_a[1.2] .

2].3)] Extracting 2nd element in 1st row.3]) . Subsetting with Matrices Extracting 2nd and 3rd elements in first row mat_a [1.3.c(2. and 3rd element in 2nd row. mat_a[2.2. c(mat_a[1.1.

Generating Matrices from Vectors To stack two vectors.2. elements will be repeated until appropriate: a d mat_c = rbind(a.3. use rbind(): mat_b <-rbind(a.2.d). mat_b If one vector has less length than the others.a). mat_c 95 . one below the other.

2. Generating Matrices from Vectors • To stack two vectors.3.5.2.-2. ncol=3. byrow = TRUE ). use cbind(): mat_d <-rbind(a.na(mat_e) 96 . mat_d • Missing data may also be part of a matrix: mat_e= matrix (c(9.NA. nrow =2.-10.a).na(): is. NA). mat_e • To see if any of the elements of a vector are missing use is. one next to each other.

na functions: sum(is.na functions: which(is.3.na(mat_e)) • To obtain the element number of the matrix of the missing value(s). Generating Matrices from Vectors • To see how many missing values there are. to next columns. 97 . use sum and is.na(mat_e)) Note: by default counting goes from first column.2.2. use which and is.

Exercise 2 • Find the matrix product of M_A and M_B if M_A= M_B = • Obtain the matrix inverse of the earlier matrix product with the solve function .

Solns to Exercise 2
M_A <- matrix (c(2, 3, 7, 1, 6, 2, 3, 5, 1) , nrow = 3, byrow = TRUE ) M_B <- matrix (c(3, 2, 9, 0, 7, 8, 5, 8, 2) , nrow = 3, byrow = TRUE ) M_C=M_A%*%M_B M_C solve(M_C)

2.4. Arrays
• Extension of matrices but can have more than two dimensions.
– Vector is an array of one dimension – Matrix is a rectangular array

• See help(array) for details.

100

2.5. Dataframe
• Another generalization of a matrix , but with different columns possibly having different modes (numeric, character, factor, etc.). d <- 1:5 e <- c("red", NA, "white", "blue", "red") f <- c(TRUE,TRUE,TRUE,FALSE,TRUE) mydata <- data.frame(d,e,f) names(mydata) <- c("ID","Color","Passed") #variable names

101

"Color")] # columns ID and Color from dataframe mydata\$Color # variable Color in the dataframe 102 . Dataframe • There are a variety of ways to identify the elements of a dataframe .2. mydata[2:3] # columns 2 and 3 of dataframe mydata[c("ID".5.

mydf=mydata. mynumbers=a.list(name="Fred". # example of a list with 5components a string.2. a numeric vector. Lists • An ordered collection of objects (components).6. a matrix. and a scalar w <.age=28) 103 . mymatrix=mat_a. • A list allows you to gather a variety of (possibly unrelated) objects under one name.

2. w[[2]] # 2nd component of the list 104 . and a list of list of lists… • Identify elements of a list using the [[]] convention.6. Lists • We can have a list of lists.

factor(sex) # stores sex as 20 1s and 30 2s and associates 1=female.20). 30)) summary(sex) sex <. Factors • Need to tell R that a variable is nominal by making it a factor.2. rep("female".c(rep("male". 2=male internally (alphabetically) # R now treats sex as a nominal variable summary(sex) 105 . # variable sex with 20 "male" entries and 30 "female" entries sex <.7.

by making of use ? followed by the name of the function.table • Alternatively: help(read.table) 106 .8. ?read. Getting Help • Recall that we can get help with a function in R.2.

2.8. Getting Help
• In order to get help with a package, use help(package="name"). help(package="MASS") • We can also search R packages for help with a topic with help.search: help.search("regress")

107

3. Getting Data into R
• Getting data into R is fairly simple, whether keyboard entry, or importing the data from a dataset generated from other software.
– For Stata, Systat (and SPSS and SAS), use the foreign package, – For SPSS and SAS, we could also use the Hmisc package for ease and functionality. – See the Quick-R section on packages, for information on obtaining and installing the these packages .

• Examples are provided in the next slides.

108

3.1. Keyboard Data Entry
• Generate a dataframe from scratch age <- c(44, 35, 65, 42) ; sex <- c("male", "female", "male", "male") ; weight <- c(160, 110, 220, 158) ; mydata <- data.frame(age,sex,weight)

109

1.3. the edits are not saved! 110 . # enter data using editor mydata2 <.data.frame(age=numeric(0). gender=character(0). Keyboard Data Entry • You can also use R's built in spreadsheet to enter the data interactively.edit(mydata2) # note that without the assignment in the # line above. weight=numeric(0)) mydata2 <. as in the following example.

Now use the read. Tell R in what folder the data set is stored (if different from (1)).2. To import these data. Suppose your data set is on ex subfolder of C drive.read. header=TRUE.table() command : mydata <. Importing Data from Excel • Often we already have a filein Excel. save Excel file as a CSV file. sep=". enter: setwd("C:/ex") 3.3. then : 1.table("anscombe.") 111 . Check what is the working folder for R getwd () 2.csv". and we want to analyze these data.

– This is by far the easiest and most reliable method of entering data into R. – The default is for it to read in everything as numeric data. read. it is easiest to change that once the data has been read in using the mode function.table • The read.table function will let you read in any type of delimited ASCII file.2. and character data is read in as numeric.1.3. 112 . – It can read in both numeric and character values.

table is the space delimiter. Thus.edu/stat/d ata/crime. If there are missing values the easiest way to fix this problem is to change the type of delimiter.dta") – The default delimiter in read.3.dta("http://www. the data lines will have different number of values. and you will receive an error. if there are missing data. 113 . – The function will not work unless every data line has the same number of values.ats. Importing Internet Data cdata <read. but this could create problems if there are missing data.ucla.3.

Loading Data from Packages • To use a data set available in one of the R packages. Load the package into R. using the data() function. For instance library("MASS") • Extract the data set you want from that package. • To list available data: data() 114 . install that package (if needed). using the library() function.4.3.

you will need to load the foreign packages.table function – For SPSS.5. Exporting Data from R • There are numerous methods for exporting R objects into other formats . you will need the xlsReadWrite package. – For comma separated value (CSV). – For Excel.3. 115 . files. use write. SAS and Stata.

xls(mydata. use write. colNames = TRUE.csv function write. naStrings = "") 116 .table function or write. – For Excel.csv") – For SPSS.csv(mydata. rowNames = TRUE. Exporting Data from R • There are numerous methods for exporting R objects into other formats . "c:/ex/mydata. library(xlsReadWrite) write. from = 1. sheet = 1. you will need to load the foreign packages. "c:/ex/mydata. – For comma separated value (CSV). files.xls".3. you will need the xlsReadWrite package. SAS and Stata.5.

6. # list objects in the working environment ls() # list the variables in mydata names(mydata) # list the structure of mydata str(mydata) factor(e) # list levels of factor e in mydata levels(mydata\$e) 117 . Viewing Data in R • There are a number of functions for listing the contents of an object or dataset.3.

Viewing Data in R • functions for listing the contents of an object or dataset (cont’d): # print mydata Mydata # print first 2rows of mydata head(mydata.3. n=2) 118 .6. n=2) # print last 3 rows of mydata tail(mydata.

6. mydata dim(mydata) # class of an object (numeric. etc).g. e. Viewing Data in R • functions for listing the contents of an object or dataset (cont’d): # dimensions of an object.g. dataframe. e. mydata class(mydata) 119 .3. matrix.

omit(mydata) 120 .3.] • The function na.cases() returns a logical vector indicating which cases are complete. # create new dataset without missing data newmydata <. # list rows of data that have missing values mydata[!complete.cases(mydata).na.omit() returns the object with listwise deletion of missing values. Missing Data • The function complete.7.

3. Impossible values (e.7.g.na(x) # returns TRUE of x is missing y <. missing values are represented by the symbol NA (not available) .c(1.NA) is. dividing by zero) are represented by the symbol NaN (not a number). Missing Data • In R..3. • To test for missing values: is.na(y) # returns a vector (F F F T) 121 .2.

7. You can go beyond pairwise of listwise deletion of missing values through methods such as multiple imputation.3. Missing Data Advanced Handling of Missing Data • Most modeling functions in R offer options for dealing with missing values. and mitools. – Good implementations that can be accessed through R include Amelia II. Mice. 122 .

mydates[2] 123 . Date Values • Dates are represented as the number of days since 1970-01-01.Date( ) to convert strings to dates mydates <.3. "2004-02-13")) # number of days between 6/22/07 and 2/13/04 days <.Date(c("2007-06-22". with negative values for earlier dates. # use as.as.8.mydates[1] .

3. # print today's date today <. format="%B %d %Y") 124 .8. • Date() returns the current date and time. Date Values • Sys.Date( ) returns today's date.Date() format(today.Sys.

8. Symbol %d %a %A %m %b %B %y %Y Meaning day as a number (0-31) abbreviated weekday unabbreviated weekday month (00-12) abbreviated month unabbreviated month 2-digit year 4-digit year Example 01-31 Mon Monday 00-12 Jan January 07 2007 125 . Date Values • The following symbols can be used with the format( ) function to print dates.3.

Common Bugs and Fixes • Syntax Errors – Incorrect spelling (of the function. etc.4. variable.) – Including a "+" when copying code from the Console – Having an extra parenthesis at the end of a function – Using parenthesis rather than brackets or vice versa – Having an extra bracket when subsetting 126 .

Common Bugs and Fixes • Trailing + – Not closing a function call with a parenthesis – Not closing brackets when subsetting – Not closing a function you wrote with a squiggly brace 127 .4.

with the as. Objects are data frames. not matrices 2. : requires numeric matrix/vector arguments 1. Coerce (a copy of) the vector to have numeric entries...4.numeric() command 128 .matrix() command 2. Common Bugs and Fixes • Error in . Coerce (a copy of) the data set to be a matrix. with the as. Elements of the vectors are characters • Possible solutions: 1.

Next Topics • Summarizing Data with R – Numerical Summaries – Graphs with R • Intermediate Analysis – Hypothesis Tests – Regression 129 .