Professional Documents
Culture Documents
Data Wrangling Cheatsheet
Data Wrangling Cheatsheet
Syntax - Helpful conventions for wrangling Reshaping Data - Change the layout of a data set
dplyr::data_frame(a = 1:3, b = 4:6)
dplyr::tbl_df(iris)
Converts data to tbl class. tbl’s are easier to examine than w
ww
w w
w
ww
w
w
Combine vectors into data frame
(optimized).
data frames. R displays only the data that fits onscreen: ww
1005
A 1005
A
1013
A dplyr::arrange(mtcars, mpg)
1013
A
1010
A 1010
A
tidyr::spread(pollution, size, amount) Order rows by values of a column
1010
A
Source: local data frame [150 x 5]
+ =
A 1 A T
B 2 B F
C 3 D T
dplyr::summarise(iris, avg = mean(Sepal.Length))
dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width) Mutating Joins
Summarise data into single row of values.
Compute and append one or more new columns. x1 x2 x3
dplyr::left_join(a, b, by = "x1")
dplyr::summarise_each(iris, funs(mean)) A 1 T
dplyr::mutate_each(iris, funs(min_rank)) B 2 F
Join matching rows from b to a.
Apply summary function to each column. C 3 NA
Apply window function to each column.
dplyr::count(iris, Species, wt = Sepal.Length) x1 x3 x2
dplyr::right_join(a, b, by = "x1")
dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width) A T 1
Count number of rows with each unique value of B F 2 Join matching rows from a to b.
Compute one or more new columns. Drop original columns.
variable (with or without weights).
D T NA
x1 x2 x3 dplyr::inner_join(a, b, by = "x1")
A 1 T
summary window B 2 F Join data. Retain only rows in both sets.
function function x1
A
x2
1
x3
T
dplyr::full_join(a, b, by = "x1")
Summarise uses summary functions, functions that Mutate uses window functions, functions that take a vector of B
C
2
3
F
NA
Join data. Retain all values, all rows.
take a vector of values and return a single value, such as: values and return another vector of values, such as: D NA T
dplyr::nth mean
C 3
dplyr::dense_rank dplyr::cummean All rows in a that do not have a match in b.
Nth value of a vector. Mean value of a vector.
Ranks with no gaps. Cumulative mean y z
dplyr::n median
dplyr::min_rank cumsum x1 x2 x1 x2
# of values in a vector. Median value of a vector.
+ =
A 1 B 2
dplyr::n_distinct var Ranks. Ties get min rank. Cumulative sum B 2 C 3
Group data into rows with the same value of Species. Are values between a and b? Element-wise max x1 x2 dplyr::setdiff(y, z)
A 1
dplyr::ungroup(iris) dplyr::cume_dist pmin Rows that appear in y but not z.
Remove grouping information from data frame. Cumulative distribution. Element-wise min Binding
iris %>% group_by(Species) %>% summarise(…) iris %>% group_by(Species) %>% mutate(…)
x1
A
x2
1
Compute separate summary row for each group. Compute new variables by group.
B 2 dplyr::bind_rows(y, z)
C 3
B
C
2
3
Append z to y as new rows.
D 4
ir ir dplyr::bind_cols(y, z)
C x1 x2 x1 x2
A 1 B 2 Append z to y as new columns.
B 2 C 3
C 3 D 4 Caution: matches rows by position.
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15