11/14/2016
A Brief Introduction to R
Dr. Norberto E. Milla
What is R?
• R is a language and environment for statistical
computing and graphics
• R is the open source - public domain version of S+
• Initially developed by Robert Gentleman and Ross
Ihaka of University of Auckland (early 1990’s)
• R is written by statisticians for statisticians (and the
rest of us)
• An environment–huge library of algorithms for data
access, data manipulation, analysis and graphics
• A community
–Thousands of contributors, 2 million users
–Resources and help in every domain
1
11/14/2016
Awesome thing #1: Its FREE!
• Open Source, licensed under GPL (like Linux!)
–Free as in freedom
• Flexible and runs on a wide array of platforms,
including Windows, Unix, and Mac OS X
• Open for integration
–Data ($A$, $P$$, $TATA, Excel, …)
• Broad user-base
–De-facto standard for data analysis and
teaching statistics
Awesome thing #2: Language
• Programming, not dialogs or cell formulas
–Freedom to combine methods
–Repeatable results
–Reliable and reusable
• Language designed for data analysis
–Object-oriented: vector, matrix, model, …
–Built-in library of algorithms
• Get more done, faster
2
11/14/2016
Awesome thing #3: Graphics
• Functions for standard graphs
–Scatterplot, boxplot, histogram, smoothing
–Bar plot, pie chart, dot chart, …
–Image plot, 3-D surface, map, …
• Customize without limits
–Combine graph types
–Create entirely new graphics
– Use of colors
Awesome thing #4: Statistics
• All standard statistical methods built in
–Mean, median, covariance, distributions, …
–Regression, ANOVA, cross-tabulations, …
–Survival, nonlinear mixed effects, GLM, …
–Neural networks, trees, GAM, …
• Object-oriented functions
–Access all parts of the analysis results
–Combine analytic methods
• Over 3,000 contributed packages for specialized
applications (as of 2011)
3
11/14/2016
Caveat
“Using R is a bit akin to smoking.
The beginning is difficult, one may
get headaches and even gag the
first few times. But in the long run,
it becomes pleasurable and even
addictive.”
--Francois Pinard
Downloading and installing R
Step 1: Go to the R homepage: http://www.r-project.org
Click here
4
11/14/2016
Downloading and installing R
Step 2: Select a CRAN mirror site
Click here
Downloading and installing R
Step 3: Select appropriate installer based on OS
Click here
5
11/14/2016
Downloading and installing R
Step 4: Select “base” installer
Downloading and installing R
Step 5: Download R installer
Click here
6
11/14/2016
Downloading and installing R
Step 5: Double click on the R application
Step 6. On the pop-up menu, click OK.
Step 7. Click Next on the next pop-up window and continue
answering all pop-up windows until you reach FINISH
window.
The R console
7
11/14/2016
Data types in R
• R has varied data types: scalars, vectors, matrices,
data frames and lists
• A vector is a single entity consisting of an ordered
collection of numbers (numeric, character, logical)
• A matrix is a vector that can be indexed by two or
more indices
• Data frames are matrix-like structures, in which the
columns can be of different types.
• Data frames are ‘data matrices’ with one row per
observational unit but with (possibly) both
numerical and categorical variables
Vector
• R is case-sensitive
• Assignment operators in R: <-, =
a<-c(1, 2, 5, 3, 6, -2 , 4) # numeric vector
b=c(“one”, ”two”, “three”) #character vector
c=c(TRUE, FALSE, TRUE, TRUE) #logical vector
• Elements of a vector can be referred to using
subscripts
• The following command will display the 2nd and 4th
elements of vector a
a[c(2, 4)]
8
11/14/2016
Matrix
• All columns in a matrix must have the same mode
(numeric, character, etc.) and the same length
mymatrix=matrix(vector,nrow=r,ncol=c,
byrow=FALSE,dimnames=list(char_vector_row
names, char_vector_colnames))
Example:
matrix1=matrix(1:20, 4, 5) #generates a 4x5 matrix
x=c(1:9)
rownames=c(“r1”,”r2”,”r3”)
colnames=c(“c1”,”c2”,”c3”)
matrix2=matrix(x, 3, 3, byrow=T,
dimnames=list(rownames,colnames)
Data Frame
• In a data frame different columns can have different
modes
• Similar to SAS and SPSS data sets
• Example:
x=c(1,2,3,4)
y=c(“red”, ”white”, ”red”, NA)
z=c(TRUE, TRUE, FALSE, FALSE)
mydata=data.frame(x,y,z) #will create the data frame
mydata
names(mydata)=c(“ID”, ”Color”, ”Passed”) #creates column
labels for mydata
9
11/14/2016
R built-in data editor
• One can enter data interactively into R using its
built-in spreadsheet
mydata=data.frame() #will create an empty data frame
mydata=edit(mydata) #will open the spreadsheet for
data entry
• Example:
Importing data from Excel
• For Excel 2003 or earlier, save the file in csv format
and use any one the following commands to import
the file into R
read.csv("D:/DMPS/R Training/QUICK-R/
import1.csv",header=TRUE,sep=",")
or,
read.table("D:/DMPS/R Training/QUICK-R/
import1.csv",header=TRUE,sep=",")
10
11/14/2016
Importing data from Excel
• For Excel 2007 or 2010, load first the xlsx library
using the following command
library(xlsx)
• Then use the following command to import the file
into R
read.xlsx("D:/DMPS/R Training/QUICK-R/
import2.xlsx",sheetIndex=1)
or, simply
read.xlsx("D:/DMPS/R Training/QUICK-R/
import2.xlsx“,1)
Importing data from SPSS
• There are two packages which can be used to
import SPSS data sets into R: foreign and Hmisc
• Load the foreign package
library(foreign)
• Use the following command to import the data into
R
myspssdata=read.spss(“D:/DMPS/R Training/QUICK-
R/ched_complete.sav”, use.value.labels=TRUE,
to.data.frame = TRUE)
11
11/14/2016
Importing data from SPSS
• Save the SPSS data set in portable (*.por) format
• Load the Hmisc package
library(Hmisc)
• Use the following command to import the data into
R
myspssdata=spss.get(“D:/DMPS/R Training/QUICK-
R/ched_complete.por”, use.value.labels=TRUE,
to.data.frame.=TRUE)
Importing data from STATA
• Call in the foreign package
library(foreign)
• Use the following command to import the data into
R
mystatadata=read.dta(“D:/DMPS/R Training/QUICK-
R/statadata.dta”, convert.factors=TRUE)
12
11/14/2016
Variable labels
• Using the edit() function we can specify the
variable labels in the R spreadsheet
• An alternative is by using the following command:
names(mydata)[3]=“age” # this assigns age as the label the 3rd
column of mydata
Value labels
• Use the factor() function for nominal data and the
ordered() function for ordinal data
• Suppose the variable v1 is coded 1, 2 or 3 and we
want to attach value labels 1=red, 2=blue and
3=green
mydata$v1=factor(mydata$v1, levels=c(1,2,3),
labels=c(“red”, ”blue”, ”green”))
13
11/14/2016
Value labels
• Suppose the variable y is coded 1, 3 or 5 and we
want to attach value labels 1=Low, 3=Medium, and
5=High
mydata$y=ordered(mydata$y, levels=c(1,3,5),
labels=c(“Low”, ”Medium”, ”High”))
Creating new variables
• There are three ways to create new variables from
existing variables in an R data set
• Suppose the R data set mydata has two variables x1
and x2 and we want to create two variables the
mean and sum of x1 and x2
• This can be accomplished as follows:
attach(mydata)
mydata$sum=x1+x2
mydata$sum=(x1+x2)/2
detach(mydata)
14
11/14/2016
Recoding variables
• Suppose we want to categorize age as follows:
>75=Old, 45-75=Middle Aged, and <=45=Young
• This can be done as follows:
attach(mydata)
mydata$agecat[age<=45]=“Young”
mydata$agecat[age>45 and age<=75]=“Middle Aged”
mydata$agecat[age>75]=“Old”
detach(mydata)
Renaming variables
• There are many ways to do this
• The simplest is using the fix() function
mydata=fix(mydata) # results are saved on close
15
11/14/2016
Merging data sets
• We can merge data sets horizontally using the
merge() function
newdata=merge(data1,data2,by=“id”) #assuming id is
common to data1 and
data2
• Vertical merging can be done using the rbind()
function
newdata=rbind(data1,data2) #assuming data1 and
data2 have the same
variables
Selecting variables
• The following command can be used to select
variables
newdata=mydata[c(“v1”,”v3”,”v15”)] # this selects variables
v1, v3, and v15 in
my data
Or,
newdata=mydata[c(5:10)] # this will select the 5th through
the 10th variables in mydata
16
11/14/2016
Excluding/removing variables
• The following command can be used to exclude
variables in the analysis
newdata=mydata[c(-1, -3)] # this will remove the 1st and 3rd
variables in mydata
Or,
mydata$v1=mydata$v3=NULL # this will delete the
variables v1 and v3 in mydata
Selecting observations
• Use the following commands to select observations
newdata=mydata[1:5,] #will select the first 5
observations in mydata
attach(mydata)
newdata=mydata[which(gender==“male” &
age>=65),] #will select males aged 65 and
over
detach(mydata)
17