Professional Documents
Culture Documents
Agenda
Introduction
Getting Started
Packages
Types of data
Data structure
Import/Export
Working with datasets
Functions
Dates
Crosstabs
Advanced Capabilities of R
Text Mining
Connecting to DB
Appendix
2
Fidelity Internal
Introduction
Why R?
R is a free open source package
Comprehensive statistical and graphical programming language
R can connect to different database Oracle, Greenplum, Hadoop etc.
R script can be executed using SAS also
Complex statistical models can be created without additional cost
R offers plenty of options for loading external data, e.g. Excel, SAS, SPSS files
R can be integrated with tableau to enhance tableau graphs
R is
Case Sensitive
Works in RAM
Functions object wise
Official website of R
http://cran.r-project.org/
Rstudio -http://www.rstudio.com/ide/download/
R Blog - https://ribbit.fmr.com/groups/rcommunity
Fidelity Internal
R Studio Session
Console
Executes the codes; log
> R prompt
+ Continuation prompt
<- Assignment Operator
Displays
Plots
Help files
Packages
Console (log/output)
Script (Editor)
Variable Assignment
Functions (function())
Comments (#)
Extension Package
Fidelity Internal
- additional functionality
Packages
Packages are collections of R functions, data, and compiled code in a well-defined format
When the library is being install, R usually asks for a cran selection
Any country code can be selected from the list of country names and codes displayed (preferable India)
Library names
Functions
Hmisc
sas7bdat
foreign
grDevices
rgb(), windoesFont(); specifies the color and font type in the chart
reshape
gtools
plyr
ddply()
Rcmdr
Fidelity Internal
It is majorly menu driven and very useful for performing statistical analysis
Fidelity Internal
Types of Variables
As R works element/object wise, data can be stored in multiple ways. For illustration purpose, we use a
variable which has single value
Numeric
It is the default computational data type
x=5 ; class(x)
Integer
x=as.integer(5.1); class(x)
Logical
A logical value is often created via comparison between variables
x=5
x>5
[1] FALSE
Character
A character object is used to represent string values in R
y = c("Ram","Sam","Jodu","Modu")
x= as.character(5)
x = c("5")
Factors
Stores the nominal values as a vector as character strings
> x = as.factor(c(2,9,7))
> as.numeric(x)
[1] 1 3 2
Fidelity Internal
Data Structure in R
Vector
A set of elements of the same data type (whether they are logical, numeric, character etc. )
x = c(2,4,7,9)
y = c("Ram","Sam","Jodu","Modu")
#Numeric Vector
#Character Vector
Matrix
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout
The following is an example of a matrix with 3 rows and 5 columns.
Data frame
A dataset/ Combination of rows and columns (tabular)
List
A generalization of a vector and represents a collection of data objects
Fidelity Internal
Vectors
c function (concatenate)
Var1 <- c(1,2,6)
Var2 <- c(apple, mango, grape)
rep(x = ,times =)
rep(c(1,4),3)
[1] 1 4 1 4 1 4
vector(mode = , length=10)
Fidelity Internal
Subscripting/Index in R
Accessing elements is achieved through a process called indexing. Indexing may be done
A vector of positive integers: to indicate inclusion
A vector of negative integers: to indicate exclusion
A vector of logical values: to indicate which are in and which are out
A vector of names: if the object has a names attribute
> z <- c(8,3,0,9,9,2,1)
The first element of z;
z[1]
The first, third and fourth element; z[c(1,3,4)]
The elements of z in reverse;
z[7:1]
All elements except the first and third;
z[-c(1,3)]
To calculate the length of the vector; length(z)
Those elements of z less than 4;
z[z < 4]
> z < 4 ; [1] FALSE TRUE TRUE FALSE FALSE TRUE TRUE
10
rm(list=ls(pattern="data"))
sink(filename.txt) will create a output file. sink() can be used to stop exporting output to file
Fidelity Internal
Character Function
Function
11
Description
grep(pattern, x )
sub(pattern, replacement, x)
gsub(pattern, replacement, x)
strsplit(x, split)
paste(..., sep="")
toupper(x)/ tolower(x)
Uppercase/Lowercase
nchar(x)
sprintf()
Fidelity Internal
Matrix in R
length()
12
mat=cbind(seq(1:6),seq(.1:.6),rep(1,6))
matrix(seq(1:16), ncol=4)
matrix(seq(1:16), nrow=4)
Fidelity Internal
DataFrame in R
The function data.frame converts a matrix or collection of vectors into a data frame (Data set)
Subscripting
nd column of the dataset
data1[ ,2] returns the 2nd
rd row of the dataset
data1[3, ] returns the 3rd
rd row of the dataset
data1[c(1,3),] retains the 1stst and the 3rd
nd column of the dataset
data1$factor will return the 2nd
attach() and detach() can be used to avoid recalling variable through dataset names (data1$factor), but should be used
with only if one dataset is being used
d1 = rename(mtcars, c(wt = "weight", cyl = "cylinders"))
#package reshape
d2 = subset(mtcars, select=c(wt,cyl))
#Keep
d2 = subset(mtcars, select=-c(wt,cyl))
#Drop
13
Fidelity Internal
List in R
Like data frames, list can incorporate a mixture of modes into the one list and each component can be
of a different length or size
> L1 <- list(x = sample(1:5, 20, rep=T),y = rep(letters[1:5], 4), z =rpois(20, 1))
> L1
$x
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
$y
[1] "a" "b" "c" "d" "e" "a" "b" "c" "d" "e" "a" "b "c" "d" "e" "a" "b" "c" "d" "e
$z
[1] 1 3 0 0 3 1 3 1 0 1 2 2 0 3 1 1 0 1 2 0
14
> L1[["x"]] ;
> L1$x ;
> L1[[1]] ;
> L1[1] ;
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
[1] 2 1 1 4 5 3 4 5 5 3 3 3 4 3 2 3 3 2 3 1
Fidelity Internal
Exercise 1
Create a vector A with values 1, 3, 5, 21,1,1,1,1,1 and extract the 7 th
element of A
Use seq and rep function
Create a variable with your name, separate the first name from last
name
Use strsplit function
Display the Mean, Median, Q1, Q3, Min and Max for each variable in
dataset mtcars
mtcars is a data available in R for practice
summary() /describe() function can be used
15
Fidelity Internal
Import/Export in R
R can import various types of files starting from delimited files to excel, SAS, SPSS etc.
16
Importing
excel files
library(XLConnect)
library(XLConnect)
workbook = loadWorkbook("R_Training.xlsx")
df=readWorksheet(workbook,sheet="Sheet1",header=TRUE )
getwd()
Fidelity Internal
subset() function return subsets of vectors, matrices or data frames which meet conditions
(acts as filters in excel; where in SAS)
subset(airquality, Temp > 80, select = c(Ozone, Temp))
airquality is a default dataset available in R
which() can be used alternatively
data_order =data[order(data$customer_id),]
a <- c(rep("A", 3), rep("B", 3), rep("C",2)); b <- c(1,1,3,4,1,1,2,2); c <- rep(1,4);
duplicated()
is used to remove#duplicates
from data
df[!duplicated(cbind(df$a,df$b)), ]
removes duplicates from data set on basis of a and b
df[duplicated(cbind(df$a,df$b)), ]
# returns the duplicate observations by a and b
unique(df)
# retains the unique observations
17
Ad11
dataset
can be= transposed
= melt(mtcars,id
c("am","gear")) simply by using t()
cast() and melt() in the reshape can be used for reshaping the data in required format
Fidelity Internal
If two dataset are in the same order, cbind() (column bind) can be used to attach any new
column(s) to the dataset
Similarly rows can be added using rbind() (row bind). In case of row bind the variables
should be same (however order can be different)
a <- c(rep("A", 3), rep("B", 3), rep("C",2)); b <- c(1,1,3,4,1,1,2,2); c <- rep(1,4);
df1 = data.frame(a,b,c)
df2 = data.frame(b,c,a)
df3 = rbind(df1,df2)
smartbind() in gtools library can be used to append forcefully. Can append datasets
having different columns
18
Fidelity Internal
Exercise 2
Import B2b_1.txt and B2b_2.csv
Join the two tables using customer_id
Create a flag for Hold_Time_Seconds_Qty > 60
Export the top 1000 rows to .csv
19
Fidelity Internal
Random Sample
rnorm(), runif(), rpois() are used to draw sample from Normal, Uniform, Poisson
distribution respectively
Similarly sample from other distribution can be drawn
# take a random sample of size 50 from a dataset mydata
# sample without replacement
mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]
Its is recommended to use set.seed while working of same project to ensure the sample does not
change in between the analysis
set.seed(5)
rnorm(5)
[1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087
set.seed(5)
rnorm(5)
[1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087
20
Fidelity Internal
Apply
apply(), lapply(), sapply() , tapply(), aggregrate() etc. are aggregate functions which are used for
efficiency and convenience
#row sum
#column sum
l = list(mtcars[,10:11])
>lapply(l,table)
>sapply(l,table)
tapply() applies a function to each cell of a ragged array, that is to each (non-empty) group of values
tapply(X, INDEX, FUN =);
given
by
a the
unique
combination
of thebylevels
certain
#
X is
variable
on which the group
functionofneeds
to befactors
applied
#
INDEX is the group by variable
#
FUN is the roll up function to be applied; e.g. sum, mean, wtd.mean, etc
tapply(mtcars[,1], mtcars$gear, FUN = mean);
dfx <- data.frame( group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
sex = sample(c("M", "F"), size = 29, replace = TRUE),
ddply()
age = runif(n = 29, min = 18, max = 54))
ddply(dfx, .(group, sex), summarize, mean = round(mean(age), 2), sd = round(sd(age), 2))
21
Fidelity Internal
Statements
ifelse statement
Ifelse statements operate on vectors of variable length
ifelse(expression, true_value, false_value)
x <- 1:10 # Creates sample data
ifelse(x<5 | x>8, x, 0)
[1] 1 2 3 4 0 0 0 0 9 10
for loop
for(variable in sequence) { statements}
x <- 1:10
z <- NULL
for(i in x) {
if(x[i] < 5) {
z <- c(z, x[i] - 1)
} else {
z <- c(z, x[i] / x[i])
}}
i <- 2
repeat
{ print(i)
i <- i+1
if(i > 4)
break }
if statement
Ifif(cond1=true)
statements operate
length-one
{ cmd1 }on
else
{ cmd2 } logical vectors
22
Fidelity Internal
Dates
Is.date()
Mondate packages
Meaning
Example
%d
01-31
MonthsBetween(x, y)
%a
%A
abbreviated weekday
unabbreviated weekday
Mon
Monday
YearsBetween(x, y)
%m
month (00-12)
00-12
DaysBetween(x, y)
%b
%B
abbreviated month
unabbreviated month
Jan
January
%y
%Y
2-digit year
4-digit year
07
2007
ISOdate(year,month,day)
24
Symbol
Fidelity Internal
date = Sys.Date()
library(lubridate)
month(date)
year(date)
day(date)
quarter(date)
# Character output
quarters(date)
months(date)
weekdays(date)
x <- as.POSIXlt('2005-12-16') # a date
dput(x) #structure of the date
x$mday
Date Formats
25
Code
Meaning
Code
Meaning
%a
Abbreviated weekday
%A
Full weekday
%b
Abbreviated month
%B
Full month
%c
%d
Decimal date
%H
%I
%j
%m
Decimal month
%M
Decimal minute
%p
Locale-specific AM/PM
%S
Decimal second
%U
%w
%W
%x
Locale-specific Date
%X
Locale-specific Time
%y
2-digit year
%Y
4-digit year
%z
%Z
Fidelity Internal
Exercise 3
Import the prdsal2.sas7bdat data
Use read.sas7bdat()
26
Fidelity Internal
Crosstabs
27
Fidelity Internal
Graphical Capabilities of R
Types of graphs
Functions
Scatter Plots
plot()
Line Charts
plot(), lines()
barplot()
Histogram
hist()
Pie Charts
pie()
Dot Charts
dotchart()
Heat Maps
Box Plots
boxplot()
Polygon
polygon(), sm.density.compare()
Contour Maps
Example:
plot(cars, type="o", col="blue", ylim=c(0,12),xlim=c(0,10),xlab="Speed", sub="Plots")
lines(mtcars$gear, type="o", pch=22, lty=2, col="red") # Graph gear with red dashed line and square
points
title(main="Autos", col.main="red", font.main=4)
# Create a title with a red, bold/italic font
abline(a=3,b=0,"cars")
legend('topleft', names(cars) ,lty=1, col=c('red', 'blue', 'green',' brown'), bty='n', cex=.75)
http://www.statmethods.net/graphs/index.html
http://www.harding.edu/fmccown/r/#linecharts
28
Fidelity Internal
Text Mining
tm package is a useful package available for text minning in R
http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Sentiment Analysis
http://www.slideshare.net/jeffreybreen/r-by-example-mining-twitter-for
,random.order=F)
29
Fidelity Internal
Connecting to databases
library(RPostgreSQL)
Connecting
to Greenplum
30
Mathematical Operators
Operator
31
Fidelity Internal
Description
addition
subtraction
multiplication
division
^ or **
exponentiation
x %% y
x %/% y
<
less than
<=
>
greater than
>=
==
exactly equal to
!=
not equal to
!x
Not x
x|y
x OR y
x&y
x AND y
isTRUE(x)
test if X is TRUE
abs(x)
absolute value
sqrt(x)
square root
ceiling(x)
ceiling(3.475) is 4
floor(x)
floor(3.475) is 3
trunc(x)
trunc(5.99) is 5
round(x, digits=n)
signif(x, digits=n)
log(x)
natural logarithm
log10(x)
common logarithm
exp(x)
e^x
Statistical Capabilities of R
Capability
Function
Correlation
cor(), cov()
Testing of Hypothesis
Package
Estimation
Linear Regression
lm (y~x1+x2, data=)
Logistic Regression
glm(y~x1+x2,data= ,family=binomial(link="logit"))
ANOVA
anova()
forecast
Text Mining
wordcloud(), tm()
wordcloud, tm
Cluster Analysis
pvclust, fpc
Factor Analysis
principal(), factanal( )
psych
Correspondence Analysis
ca()
ca
CART
rpart()
rpart
Random Forest
randomForest()
randomForest
Sequential Pattern
arules
gam()
gam
Discriminant Analysis
MASS
Neural Network
nnet(), neuralnet()
nnet,neuralnet
Design of Experiment
MASS
Fidelity Internal
SAS and R
33
Proc delete - rm
Proc Reg lm
Proc Transpose - t
Merge/Join - merge
Proc datasets ls
Fidelity Internal
Linear Regression
Below are the functions for building regression equation in R
34
Expression
Description
lm(y~x1+x2, data=)
coef(obj)
residuals(obj)
Residuals ei
fitted(obj)
fitted values
summary(obj)
analysis summary
anova(obj)
ANOVA table
influence(obj)
Regression Diagnostics
predict(obj,newdata=ndat)
plot(obj)
confint(obj, level=0.95)
deviance(obj)
Fidelity Internal
35
Fidelity Internal
36
Fidelity Internal