R Training Deck - v1

Introduction to R
Fidelity Internal Information
Agenda
Introduction
Getting Started
Packages
Types of data
Data structure
Import/Export
Working with datasets
Functions
Dates
Crosstabs
Advanced Capabilities of R
Text Mining
Connecting to DB
Appendix
2
Fidelity Internal
Introduction
Developed by Robert Gentleman and Ross Ihaka
Why R?
R is a free open source package
Comprehensive statistical and graphical programming language
R can connect to different database Oracle, Greenplum, Hadoop etc.
R script can be executed using SAS also
Complex statistical models can be created without additional cost
R offers plenty of options for loading external data, e.g. Excel, SAS, SPSS files
R can be integrated with tableau to enhance tableau graphs
R is
Case Sensitive
Works in RAM
Functions object wise
There are different frontends for R. e.g. R Studio and R Commandar

R studio is more popular among coders
R Commander can only be use for analysis purpose
Official website of R
http://cran.r-project.org/
Rstudio -http://www.rstudio.com/ide/download/
R Blog - https://ribbit.fmr.com/groups/rcommunity
Fidelity Internal
Getting Started with R Studio
R Studio Session
Script To write and save codes

Codes are executed using Ctrl+Enter
Workspace Displays all

the datasets loaded
Console
Executes the codes; log
> R prompt
+ Continuation prompt
<- Assignment Operator
Displays
Plots
Help files
Packages
Console (log/output)
Script (Editor)
Variable Assignment
Functions (function())
Comments (#)
Extension Package
Getting help (?/help(function))
Working directory - setwd()
Fidelity Internal
- additional functionality
Packages
Packages are collections of R functions, data, and compiled code in a well-defined format
The directory where packages are stored is called the library
R comes with a standard set of packages

Others are available for download and installation
Once installed, they have to be loaded into the session to be used
install.packages() is used to download packages
library() function loads the package
When the library is being install, R usually asks for a cran selection
Any country code can be selected from the list of country names and codes displayed (preferable India)
Library names
Functions
Hmisc
wtd.mean(), wtd.quantile(), describe(), etc.Weighted mean, sd, quantiles etc.
sas7bdat
read.sas7bdat(); to import SAS files
foreign
read.spss(); to import SPSS, excel files etc.
grDevices
rgb(), windoesFont(); specifies the color and font type in the chart
reshape
rename(), renames the variable , melt(), cast()
gtools
smartbind(); appends datasets with unequal variables
plyr
ddply()
Rcmdr
To open the window for R Commander
Fidelity Internal
Getting Started with R Commander
R commander is a GUI which can be access using the Rcmdr library
It is majorly menu driven and very useful for performing statistical analysis
R Codes can be written and executed here
Fidelity Internal
Types of Variables
As R works element/object wise, data can be stored in multiple ways. For illustration purpose, we use a
variable which has single value
Numeric
It is the default computational data type
x=5 ; class(x)
Dates in R are stored as number of days since 1 st Jan 1970
Integer
x=as.integer(5.1); class(x)
Logical
A logical value is often created via comparison between variables
x=5
x>5
[1] FALSE
Character
A character object is used to represent string values in R
y = c("Ram","Sam","Jodu","Modu")
x= as.character(5)
x = c("5")
Factors
Stores the nominal values as a vector as character strings
> x = as.factor(c(2,9,7))
> as.numeric(x)
[1] 1 3 2
Fidelity Internal
Data Structure in R
Vector
A set of elements of the same data type (whether they are logical, numeric, character etc. )
x = c(2,4,7,9)
y = c("Ram","Sam","Jodu","Modu")
#Numeric Vector
#Character Vector
Matrix
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout
The following is an example of a matrix with 3 rows and 5 columns.
Data frame
A dataset/ Combination of rows and columns (tabular)
List
A generalization of a vector and represents a collection of data objects
Fidelity Internal
Vectors
c function (concatenate)
Var1 <- c(1,2,6)
Var2 <- c(apple, mango, grape)
rep (repeat) and seq (sequence) functions

seq(from = ,to = ,by = )
seq(1,10,1)
[1] 1 2 3 4 5 6 7 8 9 10
rep(x = ,times =)
rep(c(1,4),3)
[1] 1 4 1 4 1 4
value <- c(1,3,4,rep(3,4),seq(from=1,to=6,by=2))

> value
[1] 1 3 4 3 3 3 3 1 3 5
scan() function is used to enter in data at the terminal

scan()
vector(mode = , length=10)
Fidelity Internal
Subscripting/Index in R
Accessing elements is achieved through a process called indexing. Indexing may be done
A vector of positive integers: to indicate inclusion
A vector of negative integers: to indicate exclusion
A vector of logical values: to indicate which are in and which are out
A vector of names: if the object has a names attribute
> z <- c(8,3,0,9,9,2,1)
The first element of z;
z[1]
The first, third and fourth element; z[c(1,3,4)]
The elements of z in reverse;
z[7:1]
All elements except the first and third;
z[-c(1,3)]
To calculate the length of the vector; length(z)
Those elements of z less than 4;
z[z < 4]
> z < 4 ; [1] FALSE TRUE TRUE FALSE FALSE TRUE TRUE
mode(); class(), typeof() can also be used to understand the datatype
as.numeric(), as.Date(), as.factor(), as.character() etc. converts the mode data
rm() is used to delete datasets in R
ls() is used to get the list of datasets in R

rm(list=ls()) can be used to delete all dataset
10
rm(list=ls(pattern="data"))
sink(filename.txt) will create a output file. sink() can be used to stop exporting output to file
Fidelity Internal
Character Function
Function
11
Description
substr(x, start=n1, stop=n2)
Extract substrings in a character vector.

x = "abcdef"
substr(x, 2, 4) is "bcd"
grep(pattern, x )
grep("A", c("b","A","c")) returns 2
sub(pattern, replacement, x)
gsub(pattern, replacement, x)
Find pattern in x and replace with replacement text.

sub replace the first occurrence of a pattern, gsup replace all occurrences.
sub(" ",".","Say Hello to World")
gsub(" ",".","Say Hello to World")
strsplit(x, split)
Split the elements of character vector x at split.

strsplit("abc", "") returns 3 element vector "a","b","c" , strsplit("ab,c", ",")
paste(..., sep="")
Concatenate strings after using sep string to seperate them.

paste("x",1:3,sep="") returns c("x1","x2" "x3")
paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3")
paste("Today is", date())
toupper(x)/ tolower(x)
Uppercase/Lowercase
nchar(x)
Returns the length of the string
sprintf()
Similar to %let in SAS

name = c("Sam","Vicky") ; amt = c(100,150.5)
sprintf("%s has %i dollars", name, amt)
Fidelity Internal
Matrix in R
matrix() creates a matrix
dim() displays dimension of the matrix
length()
rbind() and cbind() can also be used to create matrix

rbind() appends rows
cbind() appends columns
12
mat=cbind(seq(1:6),seq(.1:.6),rep(1,6))
mat[4,] [Row No., Column No.]
matrix(seq(1:16), ncol=4)
matrix(seq(1:16), nrow=4)
Fidelity Internal
DataFrame in R
The function data.frame converts a matrix or collection of vectors into a data frame (Data set)
data.frame can join two columns to create a data set

data1 = data.frame(seq(7,12,1),rep(c('a','b'),3)
> names(data1) = c('row no.','factor') # adds names to the columns
> row.names(data1) = seq(1:6) #adds names to the rows
Subscripting
nd column of the dataset
data1[ ,2] returns the 2nd
rd row of the dataset
data1[3, ] returns the 3rd
rd row of the dataset
data1[c(1,3),] retains the 1stst and the 3rd
nd column of the dataset
data1$factor will return the 2nd
Parameters of the dataset

nrow(data1) returns the number of rows in the dataset
ncol(data1) returns the number of columns in the dataset
head(data1) returns the first few rows of the dataset
colnames(data1) returns the column names of the dataset
summary(data1) returns the minimum, maximum, median
attach() and detach() can be used to avoid recalling variable through dataset names (data1$factor), but should be used
with only if one dataset is being used
d1 = rename(mtcars, c(wt = "weight", cyl = "cylinders"))
#package reshape
d2 = subset(mtcars, select=c(wt,cyl))
#Keep
d2 = subset(mtcars, select=-c(wt,cyl))
#Drop
13
Fidelity Internal
List in R
Like data frames, list can incorporate a mixture of modes into the one list and each component can be
of a different length or size
> L1 <- list(x = sample(1:5, 20, rep=T),y = rep(letters[1:5], 4), z =rpois(20, 1))
> L1
$x
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
$y
[1] "a" "b" "c" "d" "e" "a" "b" "c" "d" "e" "a" "b "c" "d" "e" "a" "b" "c" "d" "e
$z
[1] 1 3 0 0 3 1 3 1 0 1 2 2 0 3 1 1 0 1 2 0
14
There are a number of ways of accessing the first component of a list
Assigning name and adding a new component name
> L1[["x"]] ;
> L1$x ;
> L1[[1]] ;
> L1[1] ;
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
> names(L1) <- c("Item1","Item2","Item3")

#Assigning name
#adding new component
> L1$Item4 <- c("apple","orange","melon","grapes")
> L1[["Item4"]] <- c("apple","orange","melon","grapes")
#alternate way
> L1[[4]] <- c("apple","orange","melon","grapes")
#alternate way
Fidelity Internal
Exercise 1
Create a vector A with values 1, 3, 5, 21,1,1,1,1,1 and extract the 7 th
element of A
Use seq and rep function
Create a variable with your name, separate the first name from last
name
Use strsplit function
Display the Mean, Median, Q1, Q3, Min and Max for each variable in
dataset mtcars
mtcars is a data available in R for practice
summary() /describe() function can be used
The output of the above can be exported to a txt file

Use sink function
15
Fidelity Internal
Import/Export in R
R can import various types of files starting from delimited files to excel, SAS, SPSS etc.
read.csv, read.delim, read.table etc. can be used

stringAsFactors should be kept as FALSE(F) to avoid automatic conversion of string to factors
data <-read.csv(file1.csv", sep = ",",header = T,stringsAsFactors=F)
data <-read.delim("test.txt",header = T, sep = "\t")
write.csv, write.table etc. can be used to export file

write.csv(data,test.csv)
write.table(data,"test.txt")
read.spss() in foreign library can be used to import SPSS, excel files
read.sas7bdat() in sas7bdat library can be used to import SAS file

Special charterers should be avoided in CSV files e.g. <space> are replaced by .
All the numeric variable should be kept in number format only not accounting,%, currency etc. format
# For correcting dates for SAS imported data
dat = read.sas7bdat(data.sas7bdat")
dat$date_new = as.Date(dat$date_var , origin="1960-01-01")
16
Importing
excel files
library(XLConnect)
library(XLConnect)
workbook = loadWorkbook("R_Training.xlsx")
df=readWorksheet(workbook,sheet="Sheet1",header=TRUE )
getwd()
Fidelity Internal
Working with datasets
subset() function return subsets of vectors, matrices or data frames which meet conditions
(acts as filters in excel; where in SAS)
subset(airquality, Temp > 80, select = c(Ozone, Temp))
airquality is a default dataset available in R
which() can be used alternatively
#select retains the columns to be kept
order() is used to sort any dataset

It allows sorting with tie-breaking based on 2 nd , 3rd and other arguments
x <- sample(1:5, 20, rep=T)
> y <- sample(1:5, 20, rep=T)
> z <- sample(1:5, 20, rep=T)
> xyz <- rbind(x, y, z)
o <- order(x, y, z) ; xyz1 <- xyz[, o]
data_order =data[order(data$customer_id),]
a <- c(rep("A", 3), rep("B", 3), rep("C",2)); b <- c(1,1,3,4,1,1,2,2); c <- rep(1,4);
duplicated()
is used to remove#duplicates
from data
df[!duplicated(cbind(df$a,df$b)), ]
removes duplicates from data set on basis of a and b
df[duplicated(cbind(df$a,df$b)), ]
# returns the duplicate observations by a and b
unique(df)
# retains the unique observations
17
Ad11
dataset
can be= transposed
= melt(mtcars,id
c("am","gear")) simply by using t()
cast() and melt() in the reshape can be used for reshaping the data in required format
var = colnames(mtcars) ; d10 = melt(mtcars,id = var[-10:-11])
Fidelity Internal
Working with Datasets Contd.
merge() is used as to join tables

merge(df1, df2, by=ID")
#inner join
merge(df1, df2, all=TRUE)
#outer join
merge(df1, df2, all.x=TRUE)
#left outer join
merge(df1, df2, all.y=TRUE)
#right outer join
merge(df1,df2,by=c("ID","Country"))
# merge two data frames by ID and Country
If two dataset are in the same order, cbind() (column bind) can be used to attach any new
column(s) to the dataset
Similarly rows can be added using rbind() (row bind). In case of row bind the variables
should be same (however order can be different)
a <- c(rep("A", 3), rep("B", 3), rep("C",2)); b <- c(1,1,3,4,1,1,2,2); c <- rep(1,4);
df1 = data.frame(a,b,c)
df2 = data.frame(b,c,a)
df3 = rbind(df1,df2)
smartbind() in gtools library can be used to append forcefully. Can append datasets
having different columns
split() splits the dataset into different datasets.

g = split(mtcars,mtcars$am) ; lapply(g,sum)
18
Fidelity Internal
Exercise 2
Import B2b_1.txt and B2b_2.csv
Join the two tables using customer_id
Create a flag for Hold_Time_Seconds_Qty > 60
Export the top 1000 rows to .csv
19
Fidelity Internal
Random Sample
sample() generates a random numbers
rnorm(), runif(), rpois() are used to draw sample from Normal, Uniform, Poisson
distribution respectively
Similarly sample from other distribution can be drawn
# take a random sample of size 50 from a dataset mydata
# sample without replacement
mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]
Its is recommended to use set.seed while working of same project to ensure the sample does not
change in between the analysis
set.seed(5)
rnorm(5)
[1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087
set.seed(5)
rnorm(5)
[1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087
20
Fidelity Internal
Apply
apply(), lapply(), sapply() , tapply(), aggregrate() etc. are aggregate functions which are used for
efficiency and convenience
apply() allows functions (mean, sum, etc.) to operate on sections of an array

apply(mtcars, 1, sum)
apply(mtcars, 2, sum)
#row sum
#column sum
lapply() and sapply() functions operate on components of a list or vector

lapply will always return a table format (list mode)
sapply will display the result into a vector or array
l = list(mtcars[,10:11])
>lapply(l,table)
>sapply(l,table)
#returns crosstab in a mXn format

#returns crosstab in a array format
tapply() applies a function to each cell of a ragged array, that is to each (non-empty) group of values
tapply(X, INDEX, FUN =);
given
by
a the
unique
combination
of thebylevels
certain
#
X is
variable
on which the group
functionofneeds
to befactors
applied
#
INDEX is the group by variable
#
FUN is the roll up function to be applied; e.g. sum, mean, wtd.mean, etc
tapply(mtcars[,1], mtcars$gear, FUN = mean);
dfx <- data.frame( group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
sex = sample(c("M", "F"), size = 29, replace = TRUE),
ddply()
age = runif(n = 29, min = 18, max = 54))
ddply(dfx, .(group, sex), summarize, mean = round(mean(age), 2), sd = round(sd(age), 2))
21
Fidelity Internal
Statements
ifelse statement
Ifelse statements operate on vectors of variable length
ifelse(expression, true_value, false_value)
x <- 1:10 # Creates sample data
ifelse(x<5 | x>8, x, 0)
[1] 1 2 3 4 0 0 0 0 9 10
for loop
for(variable in sequence) { statements}
x <- 1:10
z <- NULL
for(i in x) {
if(x[i] < 5) {
z <- c(z, x[i] - 1)
} else {
z <- c(z, x[i] / x[i])
}}
i <- 2
repeat
{ print(i)
i <- i+1
if(i > 4)
break }
if statement
Ifif(cond1=true)
statements operate
length-one
{ cmd1 }on
else
{ cmd2 } logical vectors
22
Fidelity Internal
Dates
as.Date() #default format yyyy-mm-dd
Is.date()
Mondate packages
Meaning
Example
%d
day as a number (0-31)
01-31
MonthsBetween(x, y)
%a
%A
abbreviated weekday
unabbreviated weekday
Mon
Monday
YearsBetween(x, y)
%m
month (00-12)
00-12
DaysBetween(x, y)
%b
%B
abbreviated month
unabbreviated month
Jan
January
%y
%Y
2-digit year
4-digit year
07
2007
date() Todays date in Text
Sys.Date() todayss numeric date
ISOdate(year,month,day)
1st Jan 1970 is the base date in R
difftime(x, Sys.Date(), tz,units = c("auto", "secs", "mins", "hours","days", "weeks"))

date1=as.Date(c("13-Oct-2013","02-Aug-2014"), "%d-%b-%Y")
date1[2]-date1[1]
seq(as.Date("2012/12/31"), length = 2, by = "-6 months")[2]
seq(as.Date("2012/12/31"), length = 2, by = "-1 years")[2]
seq(as.Date("2012/12/31"), length = 2, by = "-1 years")[2]
dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
as.Date(dates, "%m/%d/%y")+9
#Adding days
dates = "2011-05-28"
as.Date(dates)+9
Sys.Date()+5
as.Date("2014-02-14")+5
24
Symbol
Fidelity Internal
date = Sys.Date()
library(lubridate)
month(date)
year(date)
day(date)
quarter(date)
# Character output
quarters(date)
months(date)
weekdays(date)
x <- as.POSIXlt('2005-12-16') # a date
dput(x) #structure of the date
x$mday
Date Formats
25
Code
Meaning
Code
Meaning
%a
Abbreviated weekday
%A
Full weekday
%b
Abbreviated month
%B
Full month
%c
Locale-specific date and time
%d
Decimal date
%H
Decimal hours (24 hour)
%I
Decimal hours (12 hour)
%j
Decimal day of the year
%m
Decimal month
%M
Decimal minute
%p
Locale-specific AM/PM
%S
Decimal second
%U
Decimal week of the year (starting

on Sunday)
%w
Decimal Weekday (0=Sunday)
%W
Decimal week of the year (starting

on Monday)
%x
Locale-specific Date
%X
Locale-specific Time
%y
2-digit year
%Y
4-digit year
%z
Offset from GMT
%Z
Time zone (character)
Fidelity Internal
Exercise 3
Import the prdsal2.sas7bdat data
Use read.sas7bdat()
Calculate the number of observations for post 1997

Add 6 Months to the date Month/Year variable
Calculate the number of months from 01-01-2000
26
Fidelity Internal
Crosstabs
There are multiple functions for creating cross tabs

xtabs
table
count
mytable <- table(A,B, useNA = always) # A will be rows, B will be columns
margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages
table(a,b,useNA=always) #useNA = always displays the frequency for missing values
mytable<-xtabs(weight~a+b,data = subset(dataset, cond1)

s1 = summary(mytable) # chi-square test of indepedence
names(s1)
"n.vars"
#displays the number of factors
"n.cases"
#Total cases in the table
"statistic"
#Chi Sq. statistics
"parameter" #Degrees of freedom of the test
"approx.ok" #Approximation status
"p.value"
# p-value for test of independence; if <=0.05 then the factors are independent; else dependent
"call
#calls the xtabs function
# marginal sum and prop can be calcuated using margin.table and prop.table
27
Fidelity Internal
Graphical Capabilities of R
Types of graphs
Functions
Scatter Plots
plot()
Line Charts
plot(), lines()
Bar Charts/ Column Charts/ Staked charts
barplot()
Histogram
hist()
Pie Charts
pie()
Dot Charts
dotchart()
Heat Maps
heatmap(), heatmap.2(), coefmap()
Box Plots
boxplot()
Polygon
polygon(), sm.density.compare()
Contour Maps
Example:
plot(cars, type="o", col="blue", ylim=c(0,12),xlim=c(0,10),xlab="Speed", sub="Plots")
lines(mtcars$gear, type="o", pch=22, lty=2, col="red") # Graph gear with red dashed line and square
points
title(main="Autos", col.main="red", font.main=4)
# Create a title with a red, bold/italic font
abline(a=3,b=0,"cars")
legend('topleft', names(cars) ,lty=1, col=c('red', 'blue', 'green',' brown'), bty='n', cex=.75)
http://www.statmethods.net/graphs/index.html
http://www.harding.edu/fmccown/r/#linecharts
28
Fidelity Internal
Text Mining
tm package is a useful package available for text minning in R
http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Sentiment Analysis
http://www.slideshare.net/jeffreybreen/r-by-example-mining-twitter-for
Wordclouds can be generated in R using wordcloud package

install.packages("wordcloud")
install.packages(tm")
library(wordcloud)
library(tm)
wordcloud("Many years ago the great British explorer George Mallory, who was to die on Mount
Everest, was asked why did he want to climb it. He said, \"Because it is there.\ Well, space is there,
and were going to climb it, and the moon and the planets are there, and new hopes for knowledge
and peace are there. And, therefore, as we set sail we ask Gods blessing on the most hazardous
and dangerous and greatest adventure on which man has ever embarked."
,random.order=F)
29
Fidelity Internal
Connecting to databases
Prerequisites for connecting to Oracle DB

Oracle 11g
SQL Developer
Username and Password
ODBC Connection
Java
Library rJava, RJDBC
library(rJava)
library(RJDBC)
a = c("C:/Program Files (x86)/sqldeveloper/jdbc/lib/ojdbc6.jar")
jdbcDriver <- JDBC(driverClass = "oracle.jdbc.OracleDriver", classPath = a)
dbcConnection <- dbConnect(jdbcDriver,"jdbc:oracle:thin:@//odal04-can:1521/CWEP1",username",pwd")
test<-dbGetQuery(dbcConnection,
" select count(*) from table")
library(RPostgreSQL)
Connecting
to Greenplum
con <- dbConnect(PostgreSQL(), host="", user= ", password= ", dbname="")

dbGetQuery(con, "create table test as select count(*) as count_1 from table")
30
RODBC() can also be used for connecting to databases

Fidelity Internal
Mathematical Operators
Operator
31
Fidelity Internal
Description
addition
subtraction
multiplication
division
^ or **
exponentiation
x %% y
modulus (x mod y) 5%%2 is 1
x %/% y
integer division 5%/%2 is 2
<
less than
<=
less than or equal to
>
greater than
>=
greater than or equal to
==
exactly equal to
!=
not equal to
!x
Not x
x|y
x OR y
x&y
x AND y
isTRUE(x)
test if X is TRUE
abs(x)
absolute value
sqrt(x)
square root
ceiling(x)
ceiling(3.475) is 4
floor(x)
floor(3.475) is 3
trunc(x)
trunc(5.99) is 5
round(x, digits=n)
round(3.475, digits=2) is 3.48
signif(x, digits=n)
signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x)
also acos(x), cosh(x), acosh(x), etc.
log(x)
natural logarithm
log10(x)
common logarithm
exp(x)
e^x
Statistical Capabilities of R
Capability
Function
Correlation
cor(), cov()
Testing of Hypothesis
pnorm, rnorm,qnorm, prop.test, t.test
Package
Estimation
Linear Regression
lm (y~x1+x2, data=)
Logistic Regression
glm(y~x1+x2,data= ,family=binomial(link="logit"))
ANOVA
anova()
Time Series Forecast
ts(), HoltWinters(), forecast()
forecast
Text Mining
wordcloud(), tm()
wordcloud, tm
Cluster Analysis
dist(), hclust(), cutree(), kmeans(), pam()
pvclust, fpc
Factor Analysis
principal(), factanal( )
psych
Correspondence Analysis
ca()
ca
CART
rpart()
rpart
Random Forest
randomForest()
randomForest
Sequential Pattern
arules
Generalized Additive Models
gam()
gam
Discriminant Analysis
lda(y ~ x1 + x2 + x3, data= )
MASS
Neural Network
nnet(), neuralnet()
nnet,neuralnet
Design of Experiment
For more details

32
MASS
Fidelity Internal
SAS and R
33
Proc Contents sapply(data,class), str()
Proc Means summary/mean, sd, quantile etc.
Proc Freq xtabs/tables/count
Proc SQL - sqldf
Proc Sort - order
Proc delete - rm
Proc Reg lm
Proc Logistic - glm
Proc Append smartbind, rbind, cbind
Proc Print head, tail
Proc Import read.table
Proc Export write.csv
Proc Transpose - t
Merge/Join - merge
Proc Univariate summary, describe
Proc datasets ls
Proc std - scale
Fidelity Internal
Linear Regression
Below are the functions for building regression equation in R
34
Expression
Description
lm(y~x1+x2, data=)
Defining a regression equation - object
coef(obj)
regression beta coefficients
residuals(obj)
Residuals ei
fitted(obj)
fitted values
summary(obj)
analysis summary
anova(obj)
ANOVA table
influence(obj)
Regression Diagnostics
predict(obj,newdata=ndat)
predict for new dataset
plot(obj)
Displays diagnostics plots
confint(obj, level=0.95)
95% confidence interval for beta coefficients
deviance(obj)
residual sum of squares
vif(obj) (package: car)
VIF of the independent variables
Fidelity Internal
Advanced R (Statistics Module)

Linear Regression
Logistic Regression
Cluster Analysis
CART
Random Forest
Neural Network
Sentiment Analysis
Support Vector Machine
Splines
35
Fidelity Internal
Advanced R (Graphics Module)

Connecting R to tableau
Using R capabilities in tableau
Formatting graphs
Heat Maps
Identify/Locator
Slider Functions
Sentiment Analysis
Brief on Linear and Logistic Regression
36
Fidelity Internal

R Training Deck - v1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

R Training Deck - v1

Uploaded by

Copyright:

Available Formats

Introduction to R

Fidelity Internal Information

Developed by Robert Gentleman and Ross Ihaka

There are different frontends for R. e.g. R Studio and R Commandar

Getting Started with R Studio

Script To write and save codes

Workspace Displays all

Getting help (?/help(function))

Working directory - setwd()

The directory where packages are stored is called the library

R comes with a standard set of packages

wtd.mean(), wtd.quantile(), describe(), etc.Weighted mean, sd, quantiles etc.

read.sas7bdat(); to import SAS files

read.spss(); to import SPSS, excel files etc.

rename(), renames the variable , melt(), cast()

smartbind(); appends datasets with unequal variables

To open the window for R Commander

Getting Started with R Commander

R commander is a GUI which can be access using the Rcmdr library

R Codes can be written and executed here

Dates in R are stored as number of days since 1 st Jan 1970

rep (repeat) and seq (sequence) functions

value <- c(1,3,4,rep(3,4),seq(from=1,to=6,by=2))

scan() function is used to enter in data at the terminal

mode(); class(), typeof() can also be used to understand the datatype

as.numeric(), as.Date(), as.factor(), as.character() etc. converts the mode data

rm() is used to delete datasets in R

ls() is used to get the list of datasets in R

substr(x, start=n1, stop=n2)

Extract substrings in a character vector.

grep("A", c("b","A","c")) returns 2

Find pattern in x and replace with replacement text.

Split the elements of character vector x at split.

Concatenate strings after using sep string to seperate them.

Returns the length of the string

Similar to %let in SAS

matrix() creates a matrix

dim() displays dimension of the matrix

rbind() and cbind() can also be used to create matrix

mat[4,] [Row No., Column No.]

data.frame can join two columns to create a data set

Parameters of the dataset

There are a number of ways of accessing the first component of a list

Assigning name and adding a new component name

> names(L1) <- c("Item1","Item2","Item3")

The output of the above can be exported to a txt file

read.csv, read.delim, read.table etc. can be used

write.csv, write.table etc. can be used to export file

read.spss() in foreign library can be used to import SPSS, excel files

read.sas7bdat() in sas7bdat library can be used to import SAS file

Working with datasets

#select retains the columns to be kept

order() is used to sort any dataset

var = colnames(mtcars) ; d10 = melt(mtcars,id = var[-10:-11])

Working with Datasets Contd.

merge() is used as to join tables

split() splits the dataset into different datasets.

sample() generates a random numbers

apply() allows functions (mean, sum, etc.) to operate on sections of an array

lapply() and sapply() functions operate on components of a list or vector

#returns crosstab in a mXn format

as.Date() #default format yyyy-mm-dd