You are on page 1of 59

# Lecture 3: String Functions

Ben Fanson
Simon Lisovski
Quick Refresher
1) R data types (major ones)
• numeric
• character
• factor [think of as a hybrid of numeric and character]

2) R classes (major ones)
• vector: c(), factor(), length()
• list: list(), str()

Quick Refresher
3) subsetting
• vector_name[ position_number ]

• data.frame_name[ row_number, column_number ] or
data.frame_name\$column_name

• list_name[ item_number ] or
list_name\$item_name

Quick Refresher
4) useful functions that we have encountered
• class( ) # to get object structure

• names() # get column names

• unique() # list of unique items in a vector(s)

• rep() # repeat an something so many times

• seq() # create a sequence of numbers

• 1:5 # ':' is shortcut for numeric vector

Quick Refresher
4) useful functions that we have encountered
• as.numeric(), as.character(), as.factor() # convert a vector to another data type (if possible)

• mean(), sd(), median(), min(), max() # summary statistics

• summary() # gives some statistics

• log(), sqrt(), x^2

Quick Refresher
5) importing files

Lecture Outline
1) sorting (ben)

2) stringr package (ben)

3) regular expressions (ben)

4) dates (Simeon)

- http://cran.r-project.org/web/packages/stringr/stringr.pdf
- http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf
Sorting in R
1) for vectors, use sort()
• returns sorted items

Sorting in R
1) for vectors, use sort()
• returns sorted items

Sorting in R
1) for vectors, use sort()
• returns sorted items

Sorting in R
2) for data.frames, use order()
• returns ordered row numbers
Sorting in R
2) for data.frames, use order()
• dataframe_name[ row_order, ]
Sorting in R
2) for data.frames, use order()
• dataframe_name[ row_order, ]
Sorting in R
2) for data.frames, use order()
• dataframe_name[ row_order, ]
String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
name
Bactrocera tryoni
Bactrocera meloni
Ceratitis capitata
Anastrepha ludens
genus species
Bactrocera tryoni
Bactrocera meloni
Ceratitis capitata
Anastrepha ludens
split
String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
model
output
Estimate
trtA – wk1 0.23
trtA – wk2 0.45
trtB – wk1 0.12
trtB – wk2 0.0001
trt Week Estimate
A 1 0.23
B 2 0.45
B 1 0.12
B 2 0.0001
split and strip off
specific text
split
String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
name
Bactrocera__tryoni
Bactrocera__meloni
Ceratitis__capitata
Anastrepha__ludens
replace
'__' with ' '
name
Bactrocera tryoni
Bactrocera meloni
Ceratitis capitata
Anastrepha ludens
String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
2) writing R scripts, especially generic script

automating file paths/names…

String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
2) writing R scripts, especially generic script

automating file paths/names…

String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
2) writing R scripts, especially generic script

automating file paths/names…

paste()
1) one of the more important functions that you will use

2) concatenates two or more objects

genus species
Bactrocera tryoni
Bactrocera meloni
Ceratitis capitata
Anastrepha ludens
name
Bactrocera tryoni
Bactrocera meloni
Ceratitis capitata
Anastrepha ludens
paste()
paste()
stringr package

Note - Like most R functions, there are many other functions that do similar things. stringr attempts
to add consistency and have all the main functions in one place (can just start typing str_)

str_c( object1, object2, …) # same as paste(...,sep='') or paste0()

stringr - Basic string operators

str_split_fixed(object1, pattern, num_splits) # break apart a string by a pattern

stringr - Basic string operators

str_split_fixed(object1, pattern, num_splits) # break apart a string by a pattern

stringr - Basic string operators

str_length( object ) # gets the length of each element in a vector

stringr - Basic string operators

str_sub( object, start, end ) # substring - remove part of a string by position

stringr - Basic string operators

str_trim(object, side) # remove leading and/or trailing whitespace

stringr - Basic string operators

stringr - Basic string operators

str_trim(object, side) # remove leading and/or trailing whitespace

str_detect( object, pattern) # determine if the string contains the pattern
# returns TRUE or FALSE

stringr - Pattern matching
str_detect( object, pattern) # determine if the string contains the pattern
# returns TRUE or FALSE

stringr - Pattern matching
str_detect( object, pattern) # determine if the string contains the pattern
# returns TRUE or FALSE

stringr - Pattern matching
str_replace(string, pattern, repalce_with) # replace pattern with another string

stringr - Pattern matching
str_replace(string, pattern, repalce_with) # replace pattern with another string

stringr - Pattern matching
Regular expression
computer language built to find stuff in strings
'^' starts with

Regular expression
computer language built to find stuff in strings
'\$' ends with

Regular expression
computer language built to find stuff in strings
'[0-9]' find strings with numbers

Regular expression
computer language built to find stuff in strings
'[0-9]' find strings with numbers

Regular expression
computer language built to find stuff in strings
'[a-z]' find strings with specific letters

together):
Output for publication table
mod <- lm( mpg ~ cyl + wt + disp, data=mtcars )

together):
Output for publication table
mod <- lm( mpg ~ cyl + wt + disp, data=mtcars )
mod_tbl <- as.data.frame( summary( mod )\$coefficients )
mod_tbl\$var <- row.names( mod_tbl )
mod_tbl\$est_se <- str_c( round(mod_tbl\$Estimate, 2), ' ± ', round( mod_tbl\$'Std. Error',
2) )
mod_tbl\$t <- str_c( round(mod_tbl\$'t value',2), ' ( ',
format.pval(mod_tbl\$'Pr(>|t|)', eps=0.001),' )' )
mod_tbl1 <- mod_tbl[ , c('var', 'est_se', 't') ]
names(mod_tbl1) <- c('Variable', 'Estimate ± SE', 't (pvalue)')
row.names(mod_tbl1) <- NULL
mod_tbl1
Output for publication table

Output for publication table

library(rtf)
rtf<-RTF('test.rtf')
done(rtf)

Dates in R
Dates in R
Three date/time classes are built-in in R
Date
POSIXct
POSIXlt
Dates in R
Three date/time classes are built-in in R
Date
POSIXct - “Portable Operating System Interface [for Unix]” calendar time
POSIXlt - “Portable Operating System Interface [for Unix]” local time
Class – Date
This is the class you could use if you have only dates, BUT no times, in your data.
Dates in R
“Symbol” Explanation
%a Abbreviated weekday (e.g. “Mon”)
%A Full weekday (e.g. “Monday”)
%m Abbreviated month (e.g. “Jan”)
%M Full month (e.g. “January”)
….
%d Day of the month (01-31)
….
%H Hours as decimal numbers (00-23)
….
%j Day of the year (001-366)
….
Create a date:
Non-standard formats must be specified:
Class – Date
Dates in R
Calculation with dates:
NOTE:
Class Date works internally with a numeric value:
Class – POSIXct
If you have times in your data, this is usually the best class to use.
Dates in R
Create some POSIXct objects:
Some calculations with time:
NOTE:
Internal integer representation:
Be aware of time zones!
Class – POSIXlt
This class enables easy extraction of specific components of a time.
Dates in R
Create some POSIXlt objects:
NOTE:
Internal integer representation:
Extract components of a time object:
Dates in R
function – format()
Format R Objects (not specific to POSIXct and POSIXlt objects)
“Symbol” Explanation
%a Abbreviated weekday (e.g. “Mon”)
%A Full weekday (e.g. “Monday”)
%m Abbreviated month (e.g. “Jan”)
%M Full month (e.g. “January”)
….
%d Day of the month (01-31)
….
%H Hours as decimal numbers (00-23)
….
%j Day of the year (001-366)
….
Dates in R
R Package ‘lubridate’

- http://www.jstatsoft.org/v40/i03/paper
NA’s OR “Missing Data”
In R, missing values are represented by the symbol NA (not available). Impossible
values (e.g., dividing by zero) are represented by the symbol NaN (not a number).
Testing for missing values:

Excluding missing data from analysis
Data frame manipulation
grammar of data manipulation (dplyr package)
restructuring dataframes
Next Week
Lecture 3: Hands on Section
Getting Started