You are on page 1of 35

Introduction to R

Fidelity Internal Information

Agenda
Introduction
Getting Started
Packages
Types of data
Data structure
Import/Export
Working with datasets
Functions
Dates
Crosstabs
Advanced Capabilities of R
Text Mining
Connecting to DB
Appendix
2

Fidelity Internal

Introduction

Developed by Robert Gentleman and Ross Ihaka

Why R?
R is a free open source package
Comprehensive statistical and graphical programming language
R can connect to different database Oracle, Greenplum, Hadoop etc.
R script can be executed using SAS also
Complex statistical models can be created without additional cost
R offers plenty of options for loading external data, e.g. Excel, SAS, SPSS files
R can be integrated with tableau to enhance tableau graphs

R is
Case Sensitive
Works in RAM
Functions object wise

There are different frontends for R. e.g. R Studio and R Commandar


R studio is more popular among coders
R Commander can only be use for analysis purpose

Official website of R
http://cran.r-project.org/
Rstudio -http://www.rstudio.com/ide/download/

R Blog - https://ribbit.fmr.com/groups/rcommunity
Fidelity Internal

Getting Started with R Studio

R Studio Session

Script To write and save codes


Codes are executed using Ctrl+Enter

Workspace Displays all


the datasets loaded

Console
Executes the codes; log
> R prompt
+ Continuation prompt
<- Assignment Operator

Displays

Plots

Help files

Packages

Console (log/output)

Script (Editor)

Variable Assignment

Functions (function())

Comments (#)

Extension Package

Getting help (?/help(function))

Working directory - setwd()

Fidelity Internal

- additional functionality

Packages

Packages are collections of R functions, data, and compiled code in a well-defined format

The directory where packages are stored is called the library

R comes with a standard set of packages


Others are available for download and installation
Once installed, they have to be loaded into the session to be used
install.packages() is used to download packages
library() function loads the package

When the library is being install, R usually asks for a cran selection
Any country code can be selected from the list of country names and codes displayed (preferable India)

Library names

Functions

Hmisc

wtd.mean(), wtd.quantile(), describe(), etc.Weighted mean, sd, quantiles etc.

sas7bdat

read.sas7bdat(); to import SAS files

foreign

read.spss(); to import SPSS, excel files etc.

grDevices

rgb(), windoesFont(); specifies the color and font type in the chart

reshape

rename(), renames the variable , melt(), cast()

gtools

smartbind(); appends datasets with unequal variables

plyr

ddply()

Rcmdr

To open the window for R Commander

Fidelity Internal

Getting Started with R Commander

R commander is a GUI which can be access using the Rcmdr library

It is majorly menu driven and very useful for performing statistical analysis

R Codes can be written and executed here

Fidelity Internal

Types of Variables
As R works element/object wise, data can be stored in multiple ways. For illustration purpose, we use a
variable which has single value

Numeric
It is the default computational data type
x=5 ; class(x)

Dates in R are stored as number of days since 1 st Jan 1970

Integer
x=as.integer(5.1); class(x)

Logical
A logical value is often created via comparison between variables
x=5
x>5
[1] FALSE

Character
A character object is used to represent string values in R
y = c("Ram","Sam","Jodu","Modu")
x= as.character(5)
x = c("5")

Factors
Stores the nominal values as a vector as character strings
> x = as.factor(c(2,9,7))
> as.numeric(x)
[1] 1 3 2

Fidelity Internal

Data Structure in R

Vector
A set of elements of the same data type (whether they are logical, numeric, character etc. )

x = c(2,4,7,9)

y = c("Ram","Sam","Jodu","Modu")

#Numeric Vector
#Character Vector

Matrix
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout
The following is an example of a matrix with 3 rows and 5 columns.

Data frame
A dataset/ Combination of rows and columns (tabular)

List
A generalization of a vector and represents a collection of data objects

Fidelity Internal

Vectors

c function (concatenate)
Var1 <- c(1,2,6)
Var2 <- c(apple, mango, grape)

rep (repeat) and seq (sequence) functions


seq(from = ,to = ,by = )
seq(1,10,1)
[1] 1 2 3 4 5 6 7 8 9 10

rep(x = ,times =)
rep(c(1,4),3)
[1] 1 4 1 4 1 4

value <- c(1,3,4,rep(3,4),seq(from=1,to=6,by=2))


> value
[1] 1 3 4 3 3 3 3 1 3 5

scan() function is used to enter in data at the terminal


scan()

vector(mode = , length=10)

Fidelity Internal

Subscripting/Index in R

Accessing elements is achieved through a process called indexing. Indexing may be done
A vector of positive integers: to indicate inclusion
A vector of negative integers: to indicate exclusion
A vector of logical values: to indicate which are in and which are out
A vector of names: if the object has a names attribute
> z <- c(8,3,0,9,9,2,1)
The first element of z;
z[1]
The first, third and fourth element; z[c(1,3,4)]
The elements of z in reverse;
z[7:1]
All elements except the first and third;
z[-c(1,3)]
To calculate the length of the vector; length(z)
Those elements of z less than 4;
z[z < 4]
> z < 4 ; [1] FALSE TRUE TRUE FALSE FALSE TRUE TRUE

mode(); class(), typeof() can also be used to understand the datatype

as.numeric(), as.Date(), as.factor(), as.character() etc. converts the mode data

rm() is used to delete datasets in R

ls() is used to get the list of datasets in R


rm(list=ls()) can be used to delete all dataset

10

rm(list=ls(pattern="data"))

sink(filename.txt) will create a output file. sink() can be used to stop exporting output to file

Fidelity Internal

Character Function
Function

11

Description

substr(x, start=n1, stop=n2)

Extract substrings in a character vector.


x = "abcdef"
substr(x, 2, 4) is "bcd"

grep(pattern, x )

grep("A", c("b","A","c")) returns 2

sub(pattern, replacement, x)
gsub(pattern, replacement, x)

Find pattern in x and replace with replacement text.


sub replace the first occurrence of a pattern, gsup replace all occurrences.
sub(" ",".","Say Hello to World")
gsub(" ",".","Say Hello to World")

strsplit(x, split)

Split the elements of character vector x at split.


strsplit("abc", "") returns 3 element vector "a","b","c" , strsplit("ab,c", ",")

paste(..., sep="")

Concatenate strings after using sep string to seperate them.


paste("x",1:3,sep="") returns c("x1","x2" "x3")
paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3")
paste("Today is", date())

toupper(x)/ tolower(x)

Uppercase/Lowercase

nchar(x)

Returns the length of the string

sprintf()

Similar to %let in SAS


name = c("Sam","Vicky") ; amt = c(100,150.5)
sprintf("%s has %i dollars", name, amt)

Fidelity Internal

Matrix in R

matrix() creates a matrix

dim() displays dimension of the matrix

length()

rbind() and cbind() can also be used to create matrix


rbind() appends rows
cbind() appends columns

12

mat=cbind(seq(1:6),seq(.1:.6),rep(1,6))

mat[4,] [Row No., Column No.]

matrix(seq(1:16), ncol=4)

matrix(seq(1:16), nrow=4)

Fidelity Internal

DataFrame in R

The function data.frame converts a matrix or collection of vectors into a data frame (Data set)

data.frame can join two columns to create a data set


data1 = data.frame(seq(7,12,1),rep(c('a','b'),3)
> names(data1) = c('row no.','factor') # adds names to the columns
> row.names(data1) = seq(1:6) #adds names to the rows

Subscripting
nd column of the dataset
data1[ ,2] returns the 2nd
rd row of the dataset
data1[3, ] returns the 3rd
rd row of the dataset
data1[c(1,3),] retains the 1stst and the 3rd
nd column of the dataset
data1$factor will return the 2nd

Parameters of the dataset


nrow(data1) returns the number of rows in the dataset
ncol(data1) returns the number of columns in the dataset
head(data1) returns the first few rows of the dataset
colnames(data1) returns the column names of the dataset
summary(data1) returns the minimum, maximum, median

attach() and detach() can be used to avoid recalling variable through dataset names (data1$factor), but should be used
with only if one dataset is being used
d1 = rename(mtcars, c(wt = "weight", cyl = "cylinders"))
#package reshape
d2 = subset(mtcars, select=c(wt,cyl))
#Keep
d2 = subset(mtcars, select=-c(wt,cyl))
#Drop

13

Fidelity Internal

List in R

Like data frames, list can incorporate a mixture of modes into the one list and each component can be
of a different length or size
> L1 <- list(x = sample(1:5, 20, rep=T),y = rep(letters[1:5], 4), z =rpois(20, 1))
> L1
$x
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
$y
[1] "a" "b" "c" "d" "e" "a" "b" "c" "d" "e" "a" "b "c" "d" "e" "a" "b" "c" "d" "e
$z
[1] 1 3 0 0 3 1 3 1 0 1 2 2 0 3 1 1 0 1 2 0

14

There are a number of ways of accessing the first component of a list

Assigning name and adding a new component name

> L1[["x"]] ;
> L1$x ;
> L1[[1]] ;
> L1[1] ;

[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1

> names(L1) <- c("Item1","Item2","Item3")


#Assigning name
#adding new component
> L1$Item4 <- c("apple","orange","melon","grapes")
> L1[["Item4"]] <- c("apple","orange","melon","grapes")
#alternate way
> L1[[4]] <- c("apple","orange","melon","grapes")
#alternate way

Fidelity Internal

Exercise 1
Create a vector A with values 1, 3, 5, 21,1,1,1,1,1 and extract the 7 th
element of A
Use seq and rep function

Create a variable with your name, separate the first name from last
name
Use strsplit function

Display the Mean, Median, Q1, Q3, Min and Max for each variable in
dataset mtcars
mtcars is a data available in R for practice
summary() /describe() function can be used

The output of the above can be exported to a txt file


Use sink function

15

Fidelity Internal

Import/Export in R

R can import various types of files starting from delimited files to excel, SAS, SPSS etc.

read.csv, read.delim, read.table etc. can be used


stringAsFactors should be kept as FALSE(F) to avoid automatic conversion of string to factors
data <-read.csv(file1.csv", sep = ",",header = T,stringsAsFactors=F)
data <-read.delim("test.txt",header = T, sep = "\t")

write.csv, write.table etc. can be used to export file


write.csv(data,test.csv)
write.table(data,"test.txt")

read.spss() in foreign library can be used to import SPSS, excel files

read.sas7bdat() in sas7bdat library can be used to import SAS file


Special charterers should be avoided in CSV files e.g. <space> are replaced by .
All the numeric variable should be kept in number format only not accounting,%, currency etc. format
# For correcting dates for SAS imported data
dat = read.sas7bdat(data.sas7bdat")
dat$date_new = as.Date(dat$date_var , origin="1960-01-01")

16

Importing
excel files
library(XLConnect)

library(XLConnect)
workbook = loadWorkbook("R_Training.xlsx")
df=readWorksheet(workbook,sheet="Sheet1",header=TRUE )
getwd()

Fidelity Internal

Working with datasets

subset() function return subsets of vectors, matrices or data frames which meet conditions
(acts as filters in excel; where in SAS)
subset(airquality, Temp > 80, select = c(Ozone, Temp))
airquality is a default dataset available in R
which() can be used alternatively

#select retains the columns to be kept

order() is used to sort any dataset


It allows sorting with tie-breaking based on 2 nd , 3rd and other arguments
x <- sample(1:5, 20, rep=T)
> y <- sample(1:5, 20, rep=T)
> z <- sample(1:5, 20, rep=T)
> xyz <- rbind(x, y, z)
o <- order(x, y, z) ; xyz1 <- xyz[, o]

data_order =data[order(data$customer_id),]

a <- c(rep("A", 3), rep("B", 3), rep("C",2)); b <- c(1,1,3,4,1,1,2,2); c <- rep(1,4);
duplicated()
is used to remove#duplicates
from data
df[!duplicated(cbind(df$a,df$b)), ]
removes duplicates from data set on basis of a and b
df[duplicated(cbind(df$a,df$b)), ]
# returns the duplicate observations by a and b
unique(df)
# retains the unique observations

17

Ad11
dataset
can be= transposed
= melt(mtcars,id
c("am","gear")) simply by using t()

cast() and melt() in the reshape can be used for reshaping the data in required format

var = colnames(mtcars) ; d10 = melt(mtcars,id = var[-10:-11])

Fidelity Internal

Working with Datasets Contd.

merge() is used as to join tables


merge(df1, df2, by=ID")
#inner join
merge(df1, df2, all=TRUE)
#outer join
merge(df1, df2, all.x=TRUE)
#left outer join
merge(df1, df2, all.y=TRUE)
#right outer join
merge(df1,df2,by=c("ID","Country"))
# merge two data frames by ID and Country

If two dataset are in the same order, cbind() (column bind) can be used to attach any new
column(s) to the dataset

Similarly rows can be added using rbind() (row bind). In case of row bind the variables
should be same (however order can be different)
a <- c(rep("A", 3), rep("B", 3), rep("C",2)); b <- c(1,1,3,4,1,1,2,2); c <- rep(1,4);
df1 = data.frame(a,b,c)
df2 = data.frame(b,c,a)
df3 = rbind(df1,df2)

smartbind() in gtools library can be used to append forcefully. Can append datasets
having different columns

split() splits the dataset into different datasets.


g = split(mtcars,mtcars$am) ; lapply(g,sum)

18

Fidelity Internal

Exercise 2
Import B2b_1.txt and B2b_2.csv
Join the two tables using customer_id
Create a flag for Hold_Time_Seconds_Qty > 60
Export the top 1000 rows to .csv

19

Fidelity Internal

Random Sample

sample() generates a random numbers

rnorm(), runif(), rpois() are used to draw sample from Normal, Uniform, Poisson
distribution respectively
Similarly sample from other distribution can be drawn
# take a random sample of size 50 from a dataset mydata
# sample without replacement
mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]

Its is recommended to use set.seed while working of same project to ensure the sample does not
change in between the analysis
set.seed(5)
rnorm(5)
[1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087
set.seed(5)
rnorm(5)
[1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087

20

Fidelity Internal

Apply

apply(), lapply(), sapply() , tapply(), aggregrate() etc. are aggregate functions which are used for
efficiency and convenience

apply() allows functions (mean, sum, etc.) to operate on sections of an array


apply(mtcars, 1, sum)
apply(mtcars, 2, sum)

#row sum
#column sum

lapply() and sapply() functions operate on components of a list or vector


lapply will always return a table format (list mode)
sapply will display the result into a vector or array

l = list(mtcars[,10:11])
>lapply(l,table)
>sapply(l,table)

#returns crosstab in a mXn format


#returns crosstab in a array format

tapply() applies a function to each cell of a ragged array, that is to each (non-empty) group of values
tapply(X, INDEX, FUN =);
given
by
a the
unique
combination
of thebylevels
certain
#
X is
variable
on which the group
functionofneeds
to befactors
applied
#
INDEX is the group by variable
#
FUN is the roll up function to be applied; e.g. sum, mean, wtd.mean, etc
tapply(mtcars[,1], mtcars$gear, FUN = mean);

dfx <- data.frame( group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
sex = sample(c("M", "F"), size = 29, replace = TRUE),
ddply()
age = runif(n = 29, min = 18, max = 54))
ddply(dfx, .(group, sex), summarize, mean = round(mean(age), 2), sd = round(sd(age), 2))

21

Fidelity Internal

Statements
ifelse statement
Ifelse statements operate on vectors of variable length
ifelse(expression, true_value, false_value)
x <- 1:10 # Creates sample data
ifelse(x<5 | x>8, x, 0)
[1] 1 2 3 4 0 0 0 0 9 10

for loop
for(variable in sequence) { statements}
x <- 1:10
z <- NULL
for(i in x) {
if(x[i] < 5) {
z <- c(z, x[i] - 1)
} else {
z <- c(z, x[i] / x[i])
}}

i <- 2
repeat
{ print(i)
i <- i+1
if(i > 4)
break }

if statement
Ifif(cond1=true)
statements operate
length-one
{ cmd1 }on
else
{ cmd2 } logical vectors
22

Fidelity Internal

Dates

as.Date() #default format yyyy-mm-dd

Is.date()

Mondate packages

Meaning

Example

%d

day as a number (0-31)

01-31

MonthsBetween(x, y)

%a
%A

abbreviated weekday
unabbreviated weekday

Mon
Monday

YearsBetween(x, y)

%m

month (00-12)

00-12

DaysBetween(x, y)

%b
%B

abbreviated month
unabbreviated month

Jan
January

%y
%Y

2-digit year
4-digit year

07
2007

date() Todays date in Text

Sys.Date() todayss numeric date

ISOdate(year,month,day)

1st Jan 1970 is the base date in R

difftime(x, Sys.Date(), tz,units = c("auto", "secs", "mins", "hours","days", "weeks"))


date1=as.Date(c("13-Oct-2013","02-Aug-2014"), "%d-%b-%Y")
date1[2]-date1[1]
seq(as.Date("2012/12/31"), length = 2, by = "-6 months")[2]
seq(as.Date("2012/12/31"), length = 2, by = "-1 years")[2]
seq(as.Date("2012/12/31"), length = 2, by = "-1 years")[2]
dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
as.Date(dates, "%m/%d/%y")+9
#Adding days
dates = "2011-05-28"
as.Date(dates)+9
Sys.Date()+5
as.Date("2014-02-14")+5

24

Symbol

Fidelity Internal

date = Sys.Date()
library(lubridate)
month(date)
year(date)
day(date)
quarter(date)
# Character output
quarters(date)
months(date)
weekdays(date)
x <- as.POSIXlt('2005-12-16') # a date
dput(x) #structure of the date
x$mday

Date Formats

25

Code

Meaning

Code

Meaning

%a

Abbreviated weekday

%A

Full weekday

%b

Abbreviated month

%B

Full month

%c

Locale-specific date and time

%d

Decimal date

%H

Decimal hours (24 hour)

%I

Decimal hours (12 hour)

%j

Decimal day of the year

%m

Decimal month

%M

Decimal minute

%p

Locale-specific AM/PM

%S

Decimal second

%U

Decimal week of the year (starting


on Sunday)

%w

Decimal Weekday (0=Sunday)

%W

Decimal week of the year (starting


on Monday)

%x

Locale-specific Date

%X

Locale-specific Time

%y

2-digit year

%Y

4-digit year

%z

Offset from GMT

%Z

Time zone (character)

Fidelity Internal

Exercise 3
Import the prdsal2.sas7bdat data
Use read.sas7bdat()

Calculate the number of observations for post 1997


Add 6 Months to the date Month/Year variable
Calculate the number of months from 01-01-2000

26

Fidelity Internal

Crosstabs

There are multiple functions for creating cross tabs


xtabs
table
count
mytable <- table(A,B, useNA = always) # A will be rows, B will be columns
margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages
table(a,b,useNA=always) #useNA = always displays the frequency for missing values

mytable<-xtabs(weight~a+b,data = subset(dataset, cond1)


s1 = summary(mytable) # chi-square test of indepedence
names(s1)
"n.vars"
#displays the number of factors
"n.cases"
#Total cases in the table
"statistic"
#Chi Sq. statistics
"parameter" #Degrees of freedom of the test
"approx.ok" #Approximation status
"p.value"
# p-value for test of independence; if <=0.05 then the factors are independent; else dependent
"call
#calls the xtabs function
# marginal sum and prop can be calcuated using margin.table and prop.table

27

Fidelity Internal

Graphical Capabilities of R
Types of graphs

Functions

Scatter Plots

plot()

Line Charts

plot(), lines()

Bar Charts/ Column Charts/ Staked charts

barplot()

Histogram

hist()

Pie Charts

pie()

Dot Charts

dotchart()

Heat Maps

heatmap(), heatmap.2(), coefmap()

Box Plots

boxplot()

Polygon

polygon(), sm.density.compare()

Contour Maps
Example:
plot(cars, type="o", col="blue", ylim=c(0,12),xlim=c(0,10),xlab="Speed", sub="Plots")
lines(mtcars$gear, type="o", pch=22, lty=2, col="red") # Graph gear with red dashed line and square
points
title(main="Autos", col.main="red", font.main=4)
# Create a title with a red, bold/italic font
abline(a=3,b=0,"cars")
legend('topleft', names(cars) ,lty=1, col=c('red', 'blue', 'green',' brown'), bty='n', cex=.75)

http://www.statmethods.net/graphs/index.html
http://www.harding.edu/fmccown/r/#linecharts
28

Fidelity Internal

Text Mining
tm package is a useful package available for text minning in R
http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Sentiment Analysis
http://www.slideshare.net/jeffreybreen/r-by-example-mining-twitter-for

Wordclouds can be generated in R using wordcloud package


install.packages("wordcloud")
install.packages(tm")
library(wordcloud)
library(tm)
wordcloud("Many years ago the great British explorer George Mallory, who was to die on Mount
Everest, was asked why did he want to climb it. He said, \"Because it is there.\ Well, space is there,
and were going to climb it, and the moon and the planets are there, and new hopes for knowledge
and peace are there. And, therefore, as we set sail we ask Gods blessing on the most hazardous
and dangerous and greatest adventure on which man has ever embarked."

,random.order=F)

29

Fidelity Internal

Connecting to databases

Prerequisites for connecting to Oracle DB


Oracle 11g
SQL Developer
Username and Password
ODBC Connection
Java
Library rJava, RJDBC
library(rJava)
library(RJDBC)
a = c("C:/Program Files (x86)/sqldeveloper/jdbc/lib/ojdbc6.jar")
jdbcDriver <- JDBC(driverClass = "oracle.jdbc.OracleDriver", classPath = a)
dbcConnection <- dbConnect(jdbcDriver,"jdbc:oracle:thin:@//odal04-can:1521/CWEP1",username",pwd")
test<-dbGetQuery(dbcConnection,

" select count(*) from table")

library(RPostgreSQL)
Connecting
to Greenplum

con <- dbConnect(PostgreSQL(), host="", user= ", password= ", dbname="")


dbGetQuery(con, "create table test as select count(*) as count_1 from table")

30

RODBC() can also be used for connecting to databases


Fidelity Internal

Mathematical Operators
Operator

31

Fidelity Internal

Description

addition

subtraction

multiplication

division

^ or **

exponentiation

x %% y

modulus (x mod y) 5%%2 is 1

x %/% y

integer division 5%/%2 is 2

<

less than

<=

less than or equal to

>

greater than

>=

greater than or equal to

==

exactly equal to

!=

not equal to

!x

Not x

x|y

x OR y

x&y

x AND y

isTRUE(x)

test if X is TRUE

abs(x)

absolute value

sqrt(x)

square root

ceiling(x)

ceiling(3.475) is 4

floor(x)

floor(3.475) is 3

trunc(x)

trunc(5.99) is 5

round(x, digits=n)

round(3.475, digits=2) is 3.48

signif(x, digits=n)

signif(3.475, digits=2) is 3.5

cos(x), sin(x), tan(x)

also acos(x), cosh(x), acosh(x), etc.

log(x)

natural logarithm

log10(x)

common logarithm

exp(x)

e^x

Statistical Capabilities of R
Capability

Function

Correlation

cor(), cov()

Testing of Hypothesis

pnorm, rnorm,qnorm, prop.test, t.test

Package

Estimation
Linear Regression

lm (y~x1+x2, data=)

Logistic Regression

glm(y~x1+x2,data= ,family=binomial(link="logit"))

ANOVA

anova()

Time Series Forecast

ts(), HoltWinters(), forecast()

forecast

Text Mining

wordcloud(), tm()

wordcloud, tm

Cluster Analysis

dist(), hclust(), cutree(), kmeans(), pam()

pvclust, fpc

Factor Analysis

principal(), factanal( )

psych

Correspondence Analysis

ca()

ca

CART

rpart()

rpart

Random Forest

randomForest()

randomForest

Sequential Pattern

arules

Generalized Additive Models

gam()

gam

Discriminant Analysis

lda(y ~ x1 + x2 + x3, data= )

MASS

Neural Network

nnet(), neuralnet()

nnet,neuralnet

Design of Experiment

For more details


32

MASS

Fidelity Internal

SAS and R

33

Proc Contents sapply(data,class), str()

Proc Means summary/mean, sd, quantile etc.

Proc Freq xtabs/tables/count

Proc SQL - sqldf

Proc Sort - order

Proc delete - rm

Proc Reg lm

Proc Logistic - glm

Proc Append smartbind, rbind, cbind

Proc Print head, tail

Proc Import read.table

Proc Export write.csv

Proc Transpose - t

Merge/Join - merge

Proc Univariate summary, describe

Proc datasets ls

Proc std - scale

Fidelity Internal

Linear Regression
Below are the functions for building regression equation in R

34

Expression

Description

lm(y~x1+x2, data=)

Defining a regression equation - object

coef(obj)

regression beta coefficients

residuals(obj)

Residuals ei

fitted(obj)

fitted values

summary(obj)

analysis summary

anova(obj)

ANOVA table

influence(obj)

Regression Diagnostics

predict(obj,newdata=ndat)

predict for new dataset

plot(obj)

Displays diagnostics plots

confint(obj, level=0.95)

95% confidence interval for beta coefficients

deviance(obj)

residual sum of squares

vif(obj) (package: car)

VIF of the independent variables

Fidelity Internal

Advanced R (Statistics Module)


Linear Regression
Logistic Regression
Cluster Analysis
CART
Random Forest
Neural Network
Sentiment Analysis
Support Vector Machine
Splines

35

Fidelity Internal

Advanced R (Graphics Module)


Connecting R to tableau
Using R capabilities in tableau
Formatting graphs
Heat Maps
Identify/Locator
Slider Functions
Sentiment Analysis
Brief on Linear and Logistic Regression

36

Fidelity Internal

You might also like