Data Transformation With Data - Table: Cheat Sheet

Data Transformation with data.
table : : CHEAT SHEET

Basics Manipulate columns with j Group according to by
data.table is an extremely fast and memory efficient package
for transforming data in R. It works by converting R’s native a a a dt[, j, by = .(a)] – group rows by
EXTRACT
data frame objects into data.tables with new and enhanced values in specified columns.
functionality. The basics of working with data.tables are: dt[, c(2)] – extract columns by number. Prefix
column numbers with “-” to drop. dt[, j, keyby = .(a)] – group and
dt[i, j, by] simultaneously sort rows by values
in specified columns.
Take data.table dt, b c b c dt[, .(b, c)] – extract columns by name.
subset rows using i COMMON GROUPED OPERATIONS
and manipulate columns with j,
grouped according to by. dt[, .(c = sum(b)), by = a] – summarize rows within groups.
data.tables are also data frames – functions that work with data SUMMARIZE dt[, c := sum(b), by = a] – create a new column and compute rows
frames therefore also work with data.tables. within groups.
a x dt[, .(x = sum(a))] – create a data.table with new
columns based on the summarized values of rows.
dt[, .SD[1], by = a] – extract first row of groups.
Create a data.table Summary functions like mean(), median(), min(),

max(), etc. can be used to summarize rows. dt[, .SD[.N], by = a] – extract last row of groups.
data.table(a = c(1, 2), b = c("a", "b")) – create a data.table from

scratch. Analogous to data.frame(). COMPUTE COLUMNS*
Chaining
setDT(df)* or as.data.table(df) – convert a data frame or a list to c dt[, c := 1 + 2] – compute a column based on
a data.table.
3 an expression. dt[…][…] – perform a sequence of data.table operations by
3
chaining multiple “[]”.
a a c dt[a == 1, c := 1 + 2] – compute a column

Subset rows using i 2
1
2
1
NA
3
based on an expression but only for a subset
of rows. Functions for data.tables
dt[1:2, ] – subset rows based on row numbers.
c d dt[, `:=`(c = 1 , d = 2)] – compute multiple
1 2 columns based on separate expressions. REORDER
1 2
a b a b setorder(dt, a, -b) – reorder a data.table
1 2 1 2 according to specified columns. Prefix column
a a dt[a > 5, ] – subset rows based on values in DELETE COLUMN 2 2 1 1 names with “-” for descending order.
1 1 2 2
2 6 one or more columns.
6 c dt[, c := NULL] – delete a column.
5
* SET FUNCTIONS AND :=

LOGICAL OPERATORS TO USE IN i CONVERT COLUMN TYPE data.table’s functions prefixed with “set” and the operator “:=”
work without “<-” to alter data without making copies in
< <= is.na() %in% | %like% b b dt[, b := as.integer(b)] – convert the type of a memory. E.g., the more efficient “setDT(df)” is analogous to
> >= !is.na() ! & %between% 1.5 1 column using as.integer(), as.numeric(), “df <- as.data.table(df)”.
2.6 2 as.character(), as.Date(), etc..
CC BY SA Erik Petrovski • www.petrovski.dk • Learn more with the data.table homepage or vignette • data.table version 1.11.8 • Updated: 2019-01
UNIQUE ROWS
unique(dt, by = c("a", "b")) – extract unique
BIND
Apply function to cols.
a b a b a b a b a b rbind(dt_a, dt_b) – combine rows of two
1 2 1 2 rows based on columns specified in “by”. + = data.tables.
2 2 2 2 Leave out “by” to use all columns. APPLY A FUNCTION TO MULTIPLE COLUMNS
1 2
a b a b dt[, lapply(.SD, mean), .SDcols = c("a", "b")] –
uniqueN(dt, by = c("a", "b")) – count the number of unique rows
1 4 2 5 apply a function – e.g. mean(), as.character(),
based on columns specified in “by”. a b x y a b x y cbind(dt_a, dt_b) – combine columns
2 5 which.max() – to columns specified in .SDcols
of two data.tables.
3 6 with lapply() and the .SD symbol. Also works
+ = with groups.
RENAME COLUMNS
a a a_m cols <- c("a")
a b x y setnames(dt, c("a", "b"), c("x", "y")) – rename 1 1 2 dt[, paste0(cols, "_m") := lapply(.SD, mean),
columns. .SDcols = cols] – apply a function to specified
Reshape a data.table
2 2 2
3 3 2 columns and assign the result with suffixed
variable names to the original data.
SET KEYS RESHAPE TO WIDE FORMAT
setkey(dt, a, b) – set keys to enable fast repeated lookup in
specified columns using “dt[.(value), ]” or for merging without id y a b id a_x a_z b_x b_z dcast(dt, Sequential rows
specifying merging columns using “dt_a[dt_b]”. A x 1 3 A 1 2 3 4 id ~ y,
A z 2 4 B 1 2 3 4
B x 1 3
value.var = c("a", "b")) ROW IDS
B z 2 4
dt[, c := 1:.N, by = b] – within groups, compute a
Combine data.tables
a b a b c
Reshape a data.table from long to wide format. 1 a 1 a 1 column with sequential row IDs.
2 a 2 a 2
dt A data.table. 3 b 3 b 1
JOIN id ~ y Formula with a LHS: ID columns containing IDs for
multiple entries. And a RHS: columns with values to
LAG & LEAD
a b x y a b x dt_a[dt_b, on = .(b = y)] – join spread in column headers.
1 c 3 b 3 b 3 data.tables on rows with equal values. value.var Columns containing values to fill into cells. dt[, c := shift(a, 1), by = b] – within groups,
2 a + 2 c = 1 c 2
a
1
b
a
a
1
b
a
c
NA duplicate a column with rows lagged by
3 b 1 a 2 a 1 2 a 2 a 1 specified amount.
RESHAPE TO LONG FORMAT 3 b 3 b NA
4 b 4 b 3
a b c x y z a b c x dt_a[dt_b, on = .(b = y, c > z)] – id a_x a_z b_x b_z id y a b melt(dt, 5 b 5 b 4 dt[, c := shift(a, 1, type = "lead"), by = b] –
1 c 7 3 b 4 3 b 4 3 join data.tables on rows with within groups, duplicate a column with rows
+ = id.vars = c("id"),
A 1 2 3 4 A 1 1 3
2 a 5 2 c 5 1 c 5 2 equal and unequal values. B 1 2 3 4 B 1 1 3 leading by specified amount.
3 b 6 1 a 8 NA a 8 1 A 2 2 4 measure.vars = patterns("^a", "^b"),
B 2 2 4 variable.name = "y",
value.name = c("a", "b"))
ROLLING JOIN read & write files
a id date b id date a id date b
Reshape a data.table from wide to long format.
1 A 01-01-2010 + 1 A 01-01-2013 = 2 A 01-01-2013 1 dt A data.table. IMPORT
2 A 01-01-2012 1 B 01-01-2013 2 B 01-01-2013 1 id.vars ID columns with IDs for multiple entries.
3 A 01-01-2014 measure.vars Columns containing values to fill into cells (often in fread("file.csv") – read data from a flat file such as .csv or .tsv into R.
1 B 01-01-2010
pattern form).
2 B 01-01-2012
variable.name, Names of new columns for variables and values fread("file.csv", select = c("a", "b")) – read specified columns from a
value.name derived from old headers. flat file into R.
dt_a[dt_b, on = .(id = id, date = date), roll = TRUE] – join
data.tables on matching rows in id columns but only keep the most
recent preceding match with the left data.table according to date
columns. “roll = -Inf” reverses direction. EXPORT
fwrite(dt, "file.csv") – write data to a flat file from R.
CC BY SA Erik Petrovski • www.petrovski.dk • Learn more with the data.table homepage or vignette • data.table version 1.11.8 • Updated: 2019-01

Data Transformation With Data - Table: Cheat Sheet

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Transformation With Data - Table: Cheat Sheet

Uploaded by

Copyright:

Available Formats

Data Transformation with data.

table : : CHEAT SHEET

Create a data.table Summary functions like mean(), median(), min(),

data.table(a = c(1, 2), b = c("a", "b")) – create a data.table from

a a c dt[a == 1, c := 1 + 2] – compute a column

* SET FUNCTIONS AND :=

You might also like