You are on page 1of 15

String

library(tidyverse)

Strings are not glamorous, high-profile components of R, but they do play a big
role in many data cleaning and preparation tasks.

# The easiest way to get stringr is to install the whole tidyverse:


install.packages("tidyverse")

# Alternatively, install just stringr:


install.packages("stringr")
Usage

x <- c("why", “you", “are", “here, “: for", “Enjoyment")

str_length(x)

str_c(x, collapse = ", ")

str_sub(x, 1, 2)
Most string functions work with regular expressions, a concise language for

describing patterns of text

• str_subset(x, "[aeiou]")

• str_count(x, "[aeiou]")
There are seven main verbs that work with patterns:

• str_detect(x, pattern) tells you if there’s any match to the pattern.

• str_detect(x, "[aeiou]")

• str_count(x, pattern) counts the number of patterns.

• str_subset(x, pattern) extracts the matching components.

• str_locate(x, pattern) gives the position of the match.

• str_extract(x, pattern) extracts the text of the match


• str_match(x, pattern) # extracts parts of the match defined by
parentheses.
• str_match(x, pattern) extracts parts of the match defined by
parentheses.
• str_match(x, "(.)[aeiou](.)") # extract the characters on either side of
the vowel
Compared to base R

• Uses consistent function and argument names. The first argument is always the vector of

strings to modify, which makes stringer work particularly well in conjunction with the pipe.

• Simplifies string operations by eliminating options that you don’t need 95% of the time.

• Produces outputs than can easily be used as inputs. This includes ensuring that missing

inputs result in missing outputs, and zero length inputs result in zero length outputs.
letters %>%
.[1:10] %>%
str_pad(3, "right") %>%
str_c(letters[2:11])
In R, missing values are contagious. If you want them to print as "NA", use
str_replace_na()

x <- c("abc", NA)


str_c("|-", x, "-|")
#> [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
#> [1] "|-abc-|" "|-NA-|"
str_c() is vectorized, and it automatically recycles
shorter vectors to the same length as the longest:

str_c("prefix-", c("a", "b", "c"), "-suffix")


Objects of length 0 are silently dropped

name <- "Shantilal"


time<- "morning"
day1<- TRUE

str_c(
"Good ", time, " ", name,
if (day1) " and Have A NICE DAY",
".“ )
• str_c(c(“Today", “is", “Monday"), collapse = ", ")
• # names of states
• states <- rownames(USArrests)

• # substr
• substr(x = states, start = 1, stop = 4)
• #> [1] "Alab" "Alas" "Ariz" "Arka" "Cali" "Colo" "Conn" "Dela" "Flor" "Geor"
• #> [11] "Hawa" "Idah" "Illi" "Indi" "Iowa" "Kans" "Kent" "Loui" "Main" "Mary"
• #> [21] "Mass" "Mich" "Minn" "Miss" "Miss" "Mont" "Nebr" "Neva" "New " "New "
• #> [31] "New " "New " "Nort" "Nort" "Ohio" "Okla" "Oreg" "Penn" "Rhod" "Sout"
• #> [41] "Sout" "Tenn" "Texa" "Utah" "Verm" "Virg" "Wash" "West" "Wisc" "Wyom"
• # abbreviate state names
• states2 <- abbreviate(states)

• # remove vector names (for convenience)


• names(states2) <- NULL
• states2
• #> [1] "Albm" "Alsk" "Arzn" "Arkn" "Clfr" "Clrd" "Cnnc" "Dlwr" "Flrd" "Gerg"
• #> [11] "Hawa" "Idah" "Illn" "Indn" "Iowa" "Knss" "Kntc" "Losn" "Main" "Mryl"
• #> [21] "Mssc" "Mchg" "Mnns" "Msss" "Mssr" "Mntn" "Nbrs" "Nevd" "NwHm" "NwJr"
• #> [31] "NwMx" "NwYr" "NrtC" "NrtD" "Ohio" "Oklh" "Orgn" "Pnns" "RhdI" "SthC"
• #> [41] "SthD" "Tnns" "Texs" "Utah" "Vrmn" "Vrgn" "Wshn" "WstV" "Wscn" "Wymn"
Getting the longest name

abbreviate(states, minlength = 5)

# size (in characters) of each name

state_chars = nchar(states)

state_chars

# longest name

states[which(state_chars == max(state_chars))]
Some Computations
summary(nchar(states))

• # histogram
hist(nchar(states), las = 1, col = "gray80", main = "Histogram",
xlab = "number of characters in US State names")
• https://stringr.tidyverse.org/

• https://www.gastonsanchez.com/r4strings/reversing.html
• USArrests

You might also like