You are on page 1of 2

Data transformation with dplyr : : CHEAT SHEET

dplyr functions work with pipes and expect tidy data. In tidy data:
A B C A B C
Manipulate Cases Manipulate Variables
&
pipes EXTRACT CASES EXTRACT VARIABLES
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x %>% f(y)
its own column case, is in its own row becomes f(x, y) filter(.data, …, .preserve = FALSE) Extract rows pull(.data, var = -1, name = NULL, …) Extract

Summarise Cases w
www
ww that meet logical criteria.
filter(mtcars, mpg > 20) w
www column values as a vector, by name or index.
pull(mtcars, wt)

distinct(.data, …, .keep_all = FALSE) Remove select(.data, …) Extract columns as a table.

w
www
Apply summary functions to columns to create a new table of

w
www
ww
rows with duplicate values. select(mtcars, mpg, wt)
summary statistics. Summary functions take vectors as input and distinct(mtcars, gear)
return one value (see back).
relocate(.data, …, .before = NULL, .a er = NULL)
slice(.data, …, .preserve = FALSE) Select rows

w
www
ww
summary function Move columns to new position.
by position. relocate(mtcars, mpg, cyl, .a er = last_col())
slice(mtcars, 10:15)
summarise(.data, …)

w
ww w
www
ww
Compute table of summaries. slice_sample(.data, …, n, prop, weight_by =
summarise(mtcars, avg = mean(mpg)) NULL, replace = FALSE) Randomly select rows. Use these helpers with select() and across()
Use n to select a number of rows and prop to e.g. select(mtcars, mpg:cyl)
count(.data, …, wt = NULL, sort = FALSE, name = select a fraction of rows. contains(match) num_range(prefix, range) :, e.g. mpg:cyl
NULL) Count number of rows in each group slice_sample(mtcars, n = 5, replace = TRUE) ends_with(match) all_of(x)/any_of(x, …, vars) -, e.g, -gear

w
ww
defined by the variables in … Also tally(). starts_with(match) matches(match) everything()
count(mtcars, cyl) slice_min(.data, order_by, …, n, prop,
with_ties = TRUE) and slice_max() Select rows
with the lowest and highest values.
MANIPULATE MULTIPLE VARIABLES AT ONCE
Group Cases w
www
ww
slice_min(mtcars, mpg, prop = 0.25)
across(.cols, .funs, …, .names = NULL) Summarise
slice_head(.data, …, n, prop) and slice_tail()

w
ww
Use group_by(.data, …, .add = FALSE, .drop = TRUE) to create a or mutate multiple columns in the same way.
Select the first or last rows. summarise(mtcars, across(everything(), mean))
"grouped" copy of a table grouped by columns in ... dplyr slice_head(mtcars, n = 5)
functions will manipulate each "group" separately and combine
the results. c_across(.cols) Compute across columns in

w
ww
Logical and boolean operators to use with filter() row-wise data.
== < <= is.na() %in% | xor() transmute(rowwise(UKgas), total = sum(c_across(1:2)))

w
www
ww mtcars %>% != > >= !is.na() ! &

w
group_by(cyl) %>% MAKE NEW VARIABLES
summarise(avg = mean(mpg)) See ?base::Logic and ?Comparison for help.
Apply vectorized functions to columns. Vectorized functions take
vectors as input and return vectors of the same length as output
ARRANGE CASES (see back).
Use rowwise(.data, …) to group data into individual rows. dplyr vectorized function
arrange(.data, …, .by_group = FALSE) Order
functions will compute results for each row. Also apply functions

w
www
ww
rows by values of a column or columns (low to
to list-columns. See tidyr cheat sheet for list-column workflow. high), use with desc() to order from high to low. mutate(.data, …, .keep = "all", .before = NULL,

w
www
ww
arrange(mtcars, mpg) .a er = NULL) Compute new column(s). Also
starwars %>% arrange(mtcars, desc(mpg)) add_column(), add_count(), and add_tally().

ww
www
ww
mutate(mtcars, gpm = 1 / mpg)

w
w
rowwise() %>%
mutate(film_count = length(films))
ADD CASES transmute(.data, …) Compute new column(s),

ungroup(x, …) Returns ungrouped copy of table.


add_row(.data, …, .before = NULL, .a er = NULL)
w
ww drop others.
transmute(mtcars, gpm = 1 / mpg)

w
www
ww
Add one or more rows to a table.
ungroup(g_mtcars) add_row(cars, speed = 1, dist = 1) rename(.data, …) Rename columns. Use

w
wwww rename_with() to rename with a function.
rename(cars, distance = dist)

CC BY SA Posit So ware, PBC • info@posit.co • posit.co • Learn more at dplyr.tidyverse.org • dplyr 1.0.7 • Updated: 2021-07
ft

ft

ft





ft
ft





Vectorized Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARISE () COMBINE VARIABLES COMBINE CASES
mutate() and transmute() apply vectorized summarise() applies summary functions to x y
functions to columns to create new columns. columns to create a new table. Summary A B C E F G A B C E F G A B C

Vectorized functions take vectors as input and


return vectors of the same length as output.
functions take vectors as input and return single
values as output.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
= a
b
c
t
u
v
1
2
3
a
b
d
t
u
w
3
2
1
x
a t 1
b u 2
A B C

vectorized function summary function


bind_cols(…, .name_repair) Returns tables
placed side by side as a single table. Column
+ y
c v 3
d w 4 bind_rows(…, .id = NULL)
Returns tables one on top of the
lengths must be equal. Columns will NOT be DF A B C other as a single table. Set .id to
matched by id (to do that look at Relational Data x a t 1
a column name to add a column
OFFSET COUNT below), so be sure to check that both tables are
x
y
b
c
u
v
2
3 of the original table names (as
dplyr::lag() - o set elements by 1 dplyr::n() - number of values/rows ordered the way you want before binding. y d w 4 pictured).
dplyr::lead() - o set elements by -1 dplyr::n_distinct() - # of uniques
sum(!is.na()) - # of non-NA’s RELATIONAL DATA
CUMULATIVE AGGREGATE
POSITION Use a "Mutating Join" to join one table to Use a "Filtering Join" to filter one table against
dplyr::cumall() - cumulative all() columns from another, matching values with the the rows of another.
dplyr::cumany() - cumulative any() mean() - mean, also mean(!is.na()) rows that they correspond to. Each join retains a
cummax() - cumulative max() median() - median x y
di erent combination of values from the tables.
dplyr::cummean() - cumulative mean() A B C A B D

cummin() - cumulative min()


cumprod() - cumulative prod()
LOGICAL
A B C D le _join(x, y, by = NULL, copy = FALSE,
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
=
mean() - proportion of TRUE’s
cumsum() - cumulative sum() sum() - # of TRUE’s
a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
b u 2 2
na_matches = "na") Join matching
A B C semi_join(x, y, by = NULL, copy = FALSE,
RANKING
c v 3 NA
values from y to x.
a t 1
b u 2
…, na_matches = "na") Return rows of x
ORDER that have a match in y. Use to see what
dplyr::cume_dist() - proportion of all values <= dplyr::first() - first value will be included in a join.
dplyr::dense_rank() - rank w ties = min, no gaps A B C D right_join(x, y, by = NULL, copy = FALSE,
dplyr::last() - last value a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
dplyr::min_rank() - rank with ties = min dplyr::nth() - value in nth location of vector b u 2 2 A B C anti_join(x, y, by = NULL, copy = FALSE,
dplyr::ntile() - bins into n bins na_matches = "na") Join matching
d w NA 1
values from x to y.
c v 3
…, na_matches = "na") Return rows of x
dplyr::percent_rank() - min_rank scaled to [0,1] RANK that do not have a match in y. Use to see
dplyr::row_number() - rank with ties = "first" what will not be included in a join.
quantile() - nth quantile  A B C D inner_join(x, y, by = NULL, copy = FALSE,
MATH min() - minimum value
a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
b u 2 2
na_matches = "na") Join data. Retain Use a "Nest Join" to inner join one table to
max() - maximum value another into a nested data frame.
+, - , *, /, ^, %/%, %% - arithmetic ops only rows with matches.
log(), log2(), log10() - logs SPREAD
<, <=, >, >=, !=, == - logical comparisons
A B C y nest_join(x, y, by = NULL, copy =
A B C D full_join(x, y, by = NULL, copy = FALSE, a t 1 <tibble [1x2]>
FALSE, keep = FALSE, name =
dplyr::between() - x >= le & x <= right IQR() - Inter-Quartile Range a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE, b u 2 <tibble [1x2]>
dplyr::near() - safe == for floating point numbers mad() - median absolute deviation b u 2 2 c v 3 <tibble [1x2]> NULL, …) Join data, nesting
c v 3 NA na_matches = "na") Join data. Retain all matches from y in a single new
sd() - standard deviation d w NA 1 values, all rows.
MISCELLANEOUS var() - variance data frame column.
dplyr::case_when() - multi-case if_else()
starwars %>%
mutate(type = case_when(
Row Names COLUMN MATCHING FOR JOINS SET OPERATIONS

Tidy data does not use rownames, which store a A B C intersect(x, y, …)


height > 200 | mass > 200 ~ "large", A B.x C B.y D Use by = c("col1", "col2", …) to
variable outside of the columns. To work with the
c v 3
Rows that appear in both x and y.
species == "Droid" ~ "robot",
a t 1 t 3
specify one or more common
TRUE ~ "other") rownames, first move them into a column. b u 2 u 2
columns to match on.
c v 3 NA NA
setdi (x, y, …)
) tibble::rownames_to_column() le _join(x, y, by = "A") A B C
A B
a t 1 Rows that appear in x but not y.
dplyr::coalesce() - first non-NA values by 1 a t
C
1
A
a
B
t Move row names into col. b u 2
element  across a set of vectors a <- rownames_to_column(mtcars, 
A.x B.x C A.y B.y Use a named vector, by = c("col1" =
2 b u 2 b u a t 1 d w union(x, y, …)
dplyr::if_else() - element-wise if() + else() 3 c v 3 c v
var = "C") "col2"), to match on columns that A B C
b u 2 b u a t 1 Rows that appear in x or y.
dplyr::na_if() - replace specific values with NA c v 3 a t have di erent names in each table. b u 2
(Duplicates removed). union_all()
pmax() - element-wise max() tibble::column_to_rownames() le _join(x, y, by = c("C" = "D")) c v 3
A B C A B d w 4 retains duplicates.
pmin() - element-wise min() 1 a t t 1 a
Move col into row names. 
2 b u u 2 b
column_to_rownames(a, var = "C") A1 B1 C A2 B2 Use su ix to specify the su ix to
3 c v v 3 c a t 1 d w
give to unmatched columns that Use setequal() to test whether two data sets
b u 2 b u
have the same name in both tables. contain the exact same rows (in any order).
Also tibble::has_rownames() and c v 3 a t
tibble::remove_rownames(). le _join(x, y, by = c("C" = "D"),
su ix = c("1", "2"))

CC BY SA Posit So ware, PBC • info@posit.co • posit.co • Learn more at dplyr.tidyverse.org • dplyr 1.0.7 • Updated: 2021-07
ft
ft
ft
ft
ff
ff
ff
ff
ff
ff
ff

ff
ff

ft

ff

ff

ft


ff



You might also like