You are on page 1of 2

Data tidying with tidyr : : CHEAT SHEET

Tidy data is a way to organize tabular data in a


consistent data structure across packages. Reshape Data - Pivot data to reorganize values into a new layout. Expand
A table is tidy if:
A B C A B C
table4a Tables
country 1999 2000 country year cases pivot_longer(data, cols, names_to = "name", Create new combinations of variables or identify
& A
B
0.7K 2K
37K 80K
A
B
1999 0.7K
1999 37K
values_to = "value", values_drop_na = FALSE) implicit missing values (combinations of
C 212K 213K C 1999 212K "Lengthen" data by collapsing several columns variables not present in the data).
A 2000 2K
Each variable is in Each observation, or into two. Column names move to a new
B 2000 80K x
its own column case, is in its own row C 2000 213K names_to column and values to a new values_to x1 x2 x3 x1 x2 expand(data, …) Create a
column. A 1 3
B 1 4
A 1
A 2 new tibble with all possible
A B C A *B C pivot_longer(table4a, cols = 2:3, names_to ="year", B 2 3 B 1
B 2
combinations of the values
values_to = "cases") of the variables listed in …
Drop other variables.
table2 expand(mtcars, cyl, gear,
Access variables Preserve cases in country year type count country year cases pop pivot_wider(data, names_from = "name", carb)
as vectors vectorized operations A 1999 cases 0.7K A 1999 0.7K 19M
values_from = "value")
A 1999 pop 19M A 2000 2K 20M x
A 2000 cases 2K B 1999 37K 172M The inverse of pivot_longer(). "Widen" data by x1 x2 x3 x1 x2 x3 complete(data, …, fill =
Tibbles
A 1 3 A 1 3
A
B
2000
1999
pop 20M
cases 37K
B
C
2000 80K 174M
1999 212K 1T
expanding two columns into several. One column B 1 4 A 2 NA list()) Add missing possible
B 1999 pop 172M C 2000 213K 1T provides the new column names, the other the B 2 3 B 1 4
combinations of values of
AN ENHANCED DATA FRAME
B 2 3
B 2000 cases 80K values. variables listed in … Fill
Tibbles are a table format provided B 2000 pop 174M
remaining variables with NA.
C 1999 cases 212K pivot_wider(table2, names_from = type,
by the tibble package. They inherit the complete(mtcars, cyl, gear,
C 1999 pop 1T values_from = count)
data frame class, but have improved behaviors: C 2000 cases 213K carb)
C 2000 pop 1T
• Subset a new tibble with ], a vector with [[ and $.
• No partial matching when subsetting columns.
• Display concise views of the data on one screen. Split Cells - Use these functions to split or combine cells into individual, isolated values. Handle Missing Values
options(tibble.print_max = n, tibble.print_min = m, table5 Drop or replace explicit missing values (NA).
tibble.width = Inf) Control default display settings. country century year country year unite(data, col, …, sep = "_", remove = TRUE,
x
View() or glimpse() View the entire data set.
A 19 99 A 1999
na.rm = FALSE) Collapse cells across several x1 x2 x1 x2 drop_na(data, …) Drop
A 20 00 A 2000
B 19 99 B 1999 columns into a single column. A 1 A 1
rows containing NA’s in …
CONSTRUCT A TIBBLE B NA D 3
B 20 00 B 2000
unite(table5, century, year, col = "year", sep = "") C
D
NA
3
columns.
tibble(…) Construct by columns. E NA drop_na(x, x2)
tibble(x = 1:3, y = c("a", "b", "c")) Both make table3
x
this tibble country year rate country year cases pop separate(data, col, into, sep = "[^[:alnum:]]+",
tribble(…) Construct by rows. x1 x2 x1 x2 fill(data, …, .direction =
A 1999 0.7K/19M0 A 1999 0.7K 19M remove = TRUE, convert = FALSE, extra = "warn", A 1 A 1
tribble(~x, ~y, A 2000 0.2K/20M0 A 2000 2K 20M B NA B 1 "down") Fill in NA’s in …
A tibble: 3 × 2 fill = "warn", …) Separate each cell in a column
1, "a", x y B 1999 .37K/172M B 1999 37K 172 C NA C 1
columns using the next or
<int> <chr> B 2000 .80K/174M B 2000 80K 174 into several columns. Also extract(). D 3 D 3
2, "b", 1 1 a
E NA E 3 previous value.
3, "c") 2
3
2
3
b
c
separate(table3, rate, sep = "/", fill(x, x2)
into = c("cases", "pop"))
x
as_tibble(x, …) Convert a data frame to a tibble. table3
country
A
year
1999
rate
0.7K x1 x2 x1 x2 replace_na(data, replace)
A 1 A 1
enframe(x, name = "name", value = "value") country year rate A 1999 19M
Specify a value to replace
A 1999 0.7K/19M0 A 2000 2K separate_rows(data, …, sep = "[^[:alnum:].]+", B NA B 2
Convert a named vector to a tibble. Also deframe(). C NA C 2
NA in selected columns.
A 2000 0.2K/20M0 A 2000 20M
convert = FALSE) Separate each cell in a column D 3 D 3

is_tibble(x) Test whether x is a tibble. B 1999 .37K/172M B 1999 37K


into several rows.
E NA E 2 replace_na(x, list(x2 = 2))
B 2000 .80K/174M B 1999 172M
B 2000 80K
B 2000 174M separate_rows(table3, rate, sep = "/")

CC BY SA Posit So ware, PBC • info@posit.co • posit.co • Learn more at tidyr.tidyverse.org • tibble 3.1.2 • tidyr 1.1.3 • Updated: 2021–08


ft














Nested Data
A nested data frame stores individual tables as a list-column of data frames within a larger organizing data frame. List-columns can also be lists of vectors or lists of varying data types.
Use a nested data frame to:
• Preserve relationships between observations and subsets of data. Preserve the type of the variables being nested (factors and datetimes aren't coerced to character).
• Manipulate many sub-tables at once with purrr functions like map(), map2(), or pmap() or with dplyr rowwise() grouping.

CREATE NESTED DATA RESHAPE NESTED DATA TRANSFORM NESTED DATA


nest(data, …) Moves groups of cells into a list-column of a data unnest(data, cols, ..., keep_empty = FALSE) Flatten nested columns A vectorized function takes a vector, transforms each element in
frame. Use alone or with dplyr::group_by(): back to regular columns. The inverse of nest(). parallel, and returns a vector of the same length. By themselves
n_storms %>% unnest(data) vectorized functions cannot work with lists, such as list-columns.
1. Group the data frame with group_by() and use nest() to move
the groups into a list-column. unnest_longer(data, col, values_to = NULL, indices_to = NULL) dplyr::rowwise(.data, …) Group data so that each row is one
n_storms <- storms %>% Turn each element of a list-column into a row. group, and within the groups, elements of list-columns appear
group_by(name) %>% directly (accessed with [[ ), not as lists of length one. When you
nest() starwars %>% use rowwise(), dplyr functions will seem to apply functions to
2. Use nest(new_col = c(x, y)) to specify the columns to group select(name, films) %>% list-columns in a vectorized fashion.
using dplyr::select() syntax. unnest_longer(films)
n_storms <- storms %>%
name films
nest(data = c(year:long)) data data data result
Luke The Empire Strik…
<tibble [50x4]> <tibble [50x4]> fun( <tibble [50x4]> , …) result 1
"cell" contents Luke Revenge of the S… <tibble [50x4]> <tibble [50x4]> fun( <tibble [50x4]> , …) result 2
yr lat long name films Luke Return of the Jed… <tibble [50x4]> <tibble [50x4]> fun( <tibble [50x4]> , …) result 3
name yr lat long name yr lat long 1975 27.5 -79.0 Luke <chr [5]> C-3PO The Empire Strik…
Amy 1975 27.5 -79.0 Amy 1975 27.5 -79.0 1975 28.5 -79.0 C-3PO <chr [6]> C-3PO Attack of the Cl…
Amy Amy 1975 28.5 -79.0 nested data frame 1975 29.5 -79.0
1975 28.5 -79.0 R2-D2 <chr[7]> C-3PO The Phantom M…
Amy 1975 29.5 -79.0 Amy 1975 29.5 -79.0
Bob 1979 22.0 -96.0 Bob 1979 22.0 -96.0
name data
Amy <tibble [50x3]>
yr lat
1979 22.0 -96.0
long
R2-D2 The Empire Strik… Apply a function to a list-column and create a new list-column.
Bob 1979 22.5 -95.3
R2-D2 Attack of the Cl…
Bob 1979 22.5 -95.3 Bob <tibble [50x3]> 1979 22.5 -95.3
Bob 1979 23.0 -94.6 R2-D2 The Phantom M…
Bob 1979 23.0 -94.6 Zeta <tibble [50x3]> 1979 23.0 -94.6 dim() returns two
Zeta 2005 23.9 -35.6 Zeta 2005 23.9 -35.6
yr lat long
n_storms %>% values per row
Zeta 2005 24.2 -36.1 Zeta
Zeta
2005
2005
24.2
24.7
-36.1
-36.6
2005 23.9 -35.6 rowwise() %>%
Zeta 2005 24.7 -36.6
2005 24.2 -36.1 unnest_wider(data, col) Turn each element of a list-column into a mutate(n = list(dim(data))) wrap with list to tell mutate
to create a list-column
2005 24.7 -36.6
Index list-columns with [[]]. n_storms$data[[1]] regular column.
starwars %>%
CREATE TIBBLES WITH LIST-COLUMNS select(name, films) %>% Apply a function to a list-column and create a regular column.
unnest_wider(films)
tibble::tribble(…) Makes list-columns when needed.
tribble( ~max, ~seq, n_storms %>%
3, 1:3, max seq name films name ..1 ..2 ..3 rowwise() %>%
Luke <chr [5]> Luke The Empire... Revenge of... Return of... nrow() returns one
4, 1:4,
3
4
<int [3]>
<int [4]>
mutate(n = nrow(data)) integer per row
C-3PO <chr [6]> C-3PO The Empire... Attack of... The Phantom...
5, 1:5) 5 <int [5]>
R2-D2 <chr[7]> R2-D2 The Empire... Attack of... The Phantom...

tibble::tibble(…) Saves list input as list-columns.


tibble(max = c(3, 4, 5), seq = list(1:3, 1:4, 1:5)) Collapse multiple list-columns into a single list-column.
tibble::enframe(x, name="name", value="value") hoist(.data, .col, ..., .remove = TRUE) Selectively pull list components
Converts multi-level list to a tibble with list-cols. out into their own top-level columns. Uses purrr::pluck() syntax for starwars %>% append() returns a list for each
row, so col type must be list
enframe(list('3'=1:3, '4'=1:4, '5'=1:5), 'max', 'seq') selecting from lists. rowwise() %>%
mutate(transport = list(append(vehicles, starships)))
OUTPUT LIST-COLUMNS FROM OTHER FUNCTIONS starwars %>%
dplyr::mutate(), transmute(), and summarise() will output select(name, films) %>%
Apply a function to multiple list-columns.
list-columns if they return a list. hoist(films, first_film = 1, second_film = 2)
mtcars %>% starwars %>% length() returns one
integer per row
group_by(cyl) %>% name films name first_film second_film films rowwise() %>%
summarise(q = list(quantile(mpg))) Luke <chr [5]> Luke The Empire… Revenge of… <chr [3]> mutate(n_transports = length(c(vehicles, starships)))
C-3PO <chr [6]> C-3PO The Empire… Attack of… <chr [4]>
R2-D2 <chr[7]> R2-D2 The Empire… Attack of… <chr [5]>
See purrr package for more list functions.

CC BY SA Posit So ware, PBC • info@posit.co • posit.co • Learn more at tidyr.tidyverse.org • tibble 3.1.2 • tidyr 1.1.3 • Updated: 2021–08


ft
































You might also like