You are on page 1of 20

Data Exploration

with R III
Noratiqah Mohd Ariff
Contents
• Head and Tail
• Which
• Subset
• Sort and Order
• Merge
• Aggregate
• Factor
• Table
• Apply
• Missing Values
• Paste and Split
Head and Tail
✖Obtain the first several rows of a matrix or
data frame using head, and use tail to
obtain the last several rows.
✖Example:
head(trees,3)
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
tail(trees,2)
Girth Height Volume
30 18.0 80 51
31 20.6 87 77
Which
✖Return the position of the in a logical vector
which are TRUE.
✖Example:

which(mtcars$cyl==6)
[1] 1 2 4 6 10 11 30

mtcars[which(mtcars$cyl==4&mtcars$gear==5),]
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Subset
✖The easiest way to select variables and
observations.
✖Example:
subset(mtcars,subset=hp<100,select=mpg)
mpg
Datsun 710 22.8
Merc 240D 24.4
Merc 230 22.8
Fiat 128 32.4
Honda Civic 30.4
Toyota Corolla 33.9
Toyota Corona 21.5
Fiat X1-9 27.3
Porsche 914-2 26.0
Sort & Order
✖ The sort() function sorts a vector or factor
into ascending or descending order.
✖ Example:
sort(mtcars$mpg,decreasing=T)

✖ The order() function sorts a data frame.


✖ By default, sorting is ascending. Prepend
the sorting variable by a minus sign to
indicate descending order.
✖ Example:
mtcars[order(mtcars$mpg, -mtcars$cyl),]
Merge
✖ Join two data frames by one or more
common key variables.
✖ Example:
dataA dataB
Name Age Weight Name Height Address
1 Abu 23 78 1 Abu 80 Urban
2 Ali 25 75 2 Ahmad 87 Rural
3 Ahmad 24 80 3 Amir 79 Rural
4 Amir 22 74 4 Ali 80 Urban

dataC<-merge(dataA,dataB,by="Name")
dataC
Name Age Weight Height Address
1 Abu 23 78 80 Urban
2 Ahmad 24 80 87 Rural
3 Ali 25 75 80 Urban
4 Amir 22 74 79 Rural
Aggregate
✖ Splits the data into subsets, computes
summary statistics for each, and returns the
result in a convenient form.
✖ The by variables must be in a list (even if
there is only one).
✖ Example:
agg <-aggregate(mtcars, y=list(mtcars$cyl),
FUN=median)
agg
Group.1 mpg cyl disp hp drat wt qsec vs am gear carb
1 4 26.0 4 108.0 91.0 4.080 2.200 18.900 1 1 4 2.0
2 6 19.7 6 167.6 110.0 3.900 3.215 18.300 1 0 4 4.0
3 8 15.2 8 350.5 192.5 3.115 3.755 17.175 0 0 3 3.5
Factor
✖The output levels will tell us how many and
what factors are in the data set.
✖Example:
num<-c(1,2,2,3,1,2,3,3,1,2,3,3)
fnum<-factor(num)
> fnum
[1] 1 2 2 3 1 2 3 3 1 2 3 3
Levels: 1 2 3
✖We can also relabel the factors
fnum<-factor(num,labels=c("I","II","III"))
Table
✖Tabulate data and provide the frequency of each
factor.
✖Example:
mons <-
c("March","April","January","November","January","September","October",
"September", "November","August","January","November", "November",
"February","May","August","July","December","August","August",
"September","November","February","April")
mons <- factor(mons)
table(mons)
✖We can also reorder the table
mons <-
factor(mons,levels=c("January","February","March","April","May","June",
"July",“August","September", "October", "November", "December"),
ordered=TRUE)
Table
✖ Tabulate continuous data by first creating
intervals using the cut() function.
✖ Convert a numeric variable into a factor.
✖ The argument breaks= describe how ranges
of numbers will be converted to factor values.
✖ If breaks=(a number), the resulting factor
will be created by dividing the range of the
variable into that number of equal-length
intervals.
✖ If breaks=(a vector of values), the values in
the vector are used to determine the
breakpoints. The number of levels will be one
less than the number of values in the vector.
Table
✖ Example:
attach(women)
wfact <- cut(weight,breaks=3)
wtab <- table(wfact)

✖We can relabel the intervals


wfact <-
cut(weight,breaks=3,labels=c("Low","Medium","High"))

✖By using a vector of values.


wfact <- cut(weight,breaks=c(100,120,140,160,180))
Table
✖ Combine factors using the interaction() function.
✖Take two or more factors and create a new factor
whose levels correspond to the combinations of the
levels of the input factors.
✖By default, it includes all possible combinations
of its input factors.
✖The argument drop=T is used to retain only those
combinations for which there were observations.
✖By default, the level for the new factor are levels
of its input factors joined by a period (.). The
argument sep= is used to override this.
✖Example:
data(CO2)
newfact <- interaction(CO2$Plant,CO2$Type,drop=T,sep=";")
tfact <- table(newfact)
The sapply() and lapply() function
✖ To calculate the same function for all the
variables in a data frame.
✖ The function to be executed is represented by
the argument FUN=.
✖ The difference between the two functions
sapply() and lapply() is in the presentation of
output.
✖ Example:
data(rock)
lapply(rock, FUN=mean)
sapply(rock, FUN=mean)
The apply() function
✖ Similar to sapply() and lapply() function.
✖ Contains the argument MARGIN=. If
MARGIN=1, R will apply the function to each
row. If MARGIN=2, R will apply the function to
each column.
✖ Example:
apply(rock,MARGIN=1, FUN=mean)
The Do.call() function
✖ This function allows you to call any R
function, but instead of writing out the
arguments one by one, you can use a list to hold
the arguments of the function.
✖ The do.call() function only works with list as
the second argument.
✖ Example:
do.call(rbind,lapply(rock,FUN=mean))
[,1]
area 7187.7291667
peri 2682.2119375
shape 0.2181104
perm 415.4500000
Missing Values
✖ Important to recognize missing values as early as
possible.
✖ May be part of original data or may arise as part
of a computation.
✖ Missing values are represented by NA without
quotes.
✖ To test for missing values, use is.na function.
✖ If missing values occur because of computation, it
may display Inf, -Inf or NaN.
✖ is.nan can also be used for Inf, -Inf or NaN.
✖ Most statistical function accept argument
na.rm=T/F to remove missing values.
✖ na.omit function eliminates missing values from
data.
paste and split
✖ Simple functions to handle strings or character
types data
✖Paste characters or strings together using the
paste() or paste0() function
✖Argument “sep=” for character string to separate
terms
✖Argument collapse “collapse=” for character
strings to separate results
✖Example:
mine<-c("a","b","c")
paste(mine,"!")
paste(mine,1:3)
paste0(mine,"!")
paste(mine,1:3,sep=",")
paste(mine,1:3,collapse=",")
paste and split
✖Split characters or strings together using the
strsplit() function
✖Results will be in a list
✖Example:
Mytext<-paste(mine,1:3, "!“, sep=",")
strsplit(Mytext,"[,]")
THANK YOU

You might also like