You are on page 1of 59

Lecture 3: String Functions

Ben Fanson
Simon Lisovski
Quick Refresher
1) R data types (major ones)
• numeric
• character
• factor [think of as a hybrid of numeric and character]

2) R classes (major ones)
• vector: c(), factor(), length()
• data frame: data.frame() or read.table()/read.csv()/read.xls(), dim()
• list: list(), str()

Quick Refresher
3) subsetting
• vector_name[ position_number ]

• data.frame_name[ row_number, column_number ] or
data.frame_name$column_name

• list_name[ item_number ] or
list_name$item_name


Quick Refresher
4) useful functions that we have encountered
• class( ) # to get object structure

• names() # get column names

• unique() # list of unique items in a vector(s)

• rep() # repeat an something so many times


• seq() # create a sequence of numbers


• 1:5 # ':' is shortcut for numeric vector




Quick Refresher
4) useful functions that we have encountered
• as.numeric(), as.character(), as.factor() # convert a vector to another data type (if possible)



• mean(), sd(), median(), min(), max() # summary statistics


• summary() # gives some statistics


• log(), sqrt(), x^2

Quick Refresher
5) importing files
• read.table(filename, header=T, sep='\t') # tab-delimited files



• read.csv(filename,header=T) # comma-delimited files



• read.xls(filename, sheet=1) # library(gdata) – used for reading excel files


Lecture Outline
1) sorting (ben)

2) stringr package (ben)

3) regular expressions (ben)

4) dates (Simeon)

Helpful references
- http://cran.r-project.org/web/packages/stringr/stringr.pdf
- http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf
Sorting in R
1) for vectors, use sort()
• returns sorted items


Sorting in R
1) for vectors, use sort()
• returns sorted items


Sorting in R
1) for vectors, use sort()
• returns sorted items


Sorting in R
2) for data.frames, use order()
• returns ordered row numbers
Sorting in R
2) for data.frames, use order()
• dataframe_name[ row_order, ]
Sorting in R
2) for data.frames, use order()
• dataframe_name[ row_order, ]
Sorting in R
2) for data.frames, use order()
• dataframe_name[ row_order, ]
String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
name
Bactrocera tryoni
Bactrocera meloni
Ceratitis capitata
Anastrepha ludens
genus species
Bactrocera tryoni
Bactrocera meloni
Ceratitis capitata
Anastrepha ludens
split
String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
model
output
Estimate
trtA – wk1 0.23
trtA – wk2 0.45
trtB – wk1 0.12
trtB – wk2 0.0001
trt Week Estimate
A 1 0.23
B 2 0.45
B 1 0.12
B 2 0.0001
split and strip off
specific text
split
String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
name
Bactrocera__tryoni
Bactrocera__meloni
Ceratitis__capitata
Anastrepha__ludens
replace
'__' with ' '
name
Bactrocera tryoni
Bactrocera meloni
Ceratitis capitata
Anastrepha ludens
String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
2) writing R scripts, especially generic script

automating file paths/names…



String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
2) writing R scripts, especially generic script

automating file paths/names…



String functions
String functions have two main purposes
1) cleaning and preparing data tables for analysis
2) writing R scripts, especially generic script

automating file paths/names…



paste()
1) one of the more important functions that you will use

2) concatenates two or more objects


genus species
Bactrocera tryoni
Bactrocera meloni
Ceratitis capitata
Anastrepha ludens
name
Bactrocera tryoni
Bactrocera meloni
Ceratitis capitata
Anastrepha ludens
paste()
paste()
stringr package
functions start with 'str_'










Note - Like most R functions, there are many other functions that do similar things. stringr attempts
to add consistency and have all the main functions in one place (can just start typing str_)


str_c( object1, object2, …) # same as paste(...,sep='') or paste0()



stringr - Basic string operators

str_split_fixed(object1, pattern, num_splits) # break apart a string by a pattern




stringr - Basic string operators

str_split_fixed(object1, pattern, num_splits) # break apart a string by a pattern




stringr - Basic string operators

str_length( object ) # gets the length of each element in a vector



stringr - Basic string operators

str_sub( object, start, end ) # substring - remove part of a string by position




stringr - Basic string operators

str_trim(object, side) # remove leading and/or trailing whitespace




stringr - Basic string operators

stringr - Basic string operators

str_trim(object, side) # remove leading and/or trailing whitespace




str_detect( object, pattern) # determine if the string contains the pattern
# returns TRUE or FALSE




stringr - Pattern matching
str_detect( object, pattern) # determine if the string contains the pattern
# returns TRUE or FALSE




stringr - Pattern matching
str_detect( object, pattern) # determine if the string contains the pattern
# returns TRUE or FALSE




stringr - Pattern matching
str_replace(string, pattern, repalce_with) # replace pattern with another string




stringr - Pattern matching
str_replace(string, pattern, repalce_with) # replace pattern with another string




stringr - Pattern matching
Regular expression
computer language built to find stuff in strings
'^' starts with







Regular expression
computer language built to find stuff in strings
'$' ends with







Regular expression
computer language built to find stuff in strings
'[0-9]' find strings with numbers







Regular expression
computer language built to find stuff in strings
'[0-9]' find strings with numbers







Regular expression
computer language built to find stuff in strings
'[a-z]' find strings with specific letters







Advanced (putting it all
together):
Output for publication table
mod <- lm( mpg ~ cyl + wt + disp, data=mtcars )

Advanced (putting it all
together):
Output for publication table
mod <- lm( mpg ~ cyl + wt + disp, data=mtcars )
mod_tbl <- as.data.frame( summary( mod )$coefficients )
mod_tbl$var <- row.names( mod_tbl )
mod_tbl$est_se <- str_c( round(mod_tbl$Estimate, 2), ' ± ', round( mod_tbl$'Std. Error',
2) )
mod_tbl$t <- str_c( round(mod_tbl$'t value',2), ' ( ',
format.pval(mod_tbl$'Pr(>|t|)', eps=0.001),' )' )
mod_tbl1 <- mod_tbl[ , c('var', 'est_se', 't') ]
names(mod_tbl1) <- c('Variable', 'Estimate ± SE', 't (pvalue)')
row.names(mod_tbl1) <- NULL
mod_tbl1
Output for publication table








Output for publication table








library(rtf)
rtf<-RTF('test.rtf')
addTable.RTF(rtf, mod_tbl1 )
done(rtf)








Dates in R
Dates in R
Three date/time classes are built-in in R
Date
POSIXct
POSIXlt
Dates in R
Three date/time classes are built-in in R
Date
POSIXct - “Portable Operating System Interface [for Unix]” calendar time
POSIXlt - “Portable Operating System Interface [for Unix]” local time
Class – Date
This is the class you could use if you have only dates, BUT no times, in your data.
Dates in R
“Symbol” Explanation
%a Abbreviated weekday (e.g. “Mon”)
%A Full weekday (e.g. “Monday”)
%m Abbreviated month (e.g. “Jan”)
%M Full month (e.g. “January”)
….
%d Day of the month (01-31)
….
%H Hours as decimal numbers (00-23)
….
%j Day of the year (001-366)
….
Create a date:
Non-standard formats must be specified:
Class – Date
Dates in R
Calculation with dates:
NOTE:
Class Date works internally with a numeric value:
Class – POSIXct
If you have times in your data, this is usually the best class to use.
Dates in R
Create some POSIXct objects:
Some calculations with time:
NOTE:
Internal integer representation:
Be aware of time zones!
Class – POSIXlt
This class enables easy extraction of specific components of a time.
Dates in R
Create some POSIXlt objects:
NOTE:
Internal integer representation:
Extract components of a time object:
Dates in R
function – format()
Format R Objects (not specific to POSIXct and POSIXlt objects)
“Symbol” Explanation
%a Abbreviated weekday (e.g. “Mon”)
%A Full weekday (e.g. “Monday”)
%m Abbreviated month (e.g. “Jan”)
%M Full month (e.g. “January”)
….
%d Day of the month (01-31)
….
%H Hours as decimal numbers (00-23)
….
%j Day of the year (001-366)
….
Dates in R
R Package ‘lubridate’
?Dates and Times Made Easy?

- http://www.jstatsoft.org/v40/i03/paper
NA’s OR “Missing Data”
In R, missing values are represented by the symbol NA (not available). Impossible
values (e.g., dividing by zero) are represented by the symbol NaN (not a number).
Testing for missing values:

Excluding missing data from analysis
Data frame manipulation
grammar of data manipulation (dplyr package)
restructuring dataframes
Next Week
Lecture 3: Hands on Section
Getting Started

1) download zip file from http://github.com/bfanson/Rcourse_proj


2) move 'R programs/Lecture3.R' into your 'Rcourse_proj/R programs' folder

3) move 'data/' folder and replace all your data files