You are on page 1of 9

Understanding data.

table Rolling Joins


https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html?m=1

Jun
6

Understanding data.table Rolling Joins


Robert Norberg

June 5, 2016

Introduction

Rolling joins in  data.table are incredibly useful, but not that well
documented. I wrote this to help myself figure out how to use them and
perhaps it can help you too.

library(data.table)

The Setup

Imagine we have an eCommerce website that uses a third party (like


PayPal) to handle payments. We track user sessions on our website and
PayPal tracks our payments, and we would like to attribute sales to user
sessions. This way we can answer all kinds of questions along the lines of
“What types of sessions lead to sales?”
Lets create some example data for a diverse group of customers.
Indecisive Isabel shops a lot before she buys anything.

isabel_website <- data.table(name = rep('Indecisive Isabel', 5),


session_start_time = as.POSIXct(c('2016-01-01 11:01', '2016-01-02 8
isabel_paypal <- data.table(name = 'Indecisive Isabel', purchase_time = as.POSIXct('2016

Spendy Sally visits the website once and makes multiple purchases.

sally_website <- data.table(name = 'Spendy Sally', session_start_time = as.POSIXct('2016


sally_paypal <- data.table(name = rep('Spendy Sally', 2), purchase_time = as.POSIXct(c('
Frequent Francis comes to the site a lot and sometimes buys things.

francis_website <- data.table(name = rep('Frequent Francis', 6),


session_start_time = as.POSIXct(c('2016-01-02 13:09', '2016-01-03 1
francis_paypal <- data.table(name = rep('Frequent Francis', 3), purchase_time = as.POSIX

Error-prone Erica has mysteriously managed to make a purchase before


ever visiting the site!

erica_website <- data.table(name = rep('Error-prone Erica', 2),


session_start_time = as.POSIXct(c('2016-01-04 19:12', '2016-01-04 2
erica_paypal <- data.table(name = 'Error-prone Erica', purchase_time = as.POSIXct('2016-

Visitor Vivian visits our website a couple times, but never makes a purchase
(so she appears in the website data, but not in the payment data).

vivian_website <- data.table(name = rep('Visitor Vivian', 2),


session_start_time = as.POSIXct(c('2016-01-01 9:10', '2016-01-09 2:
vivian_paypal <- erica_paypal[0]

And Mom sent money to my PayPal account before my website was up and
running (so she appears in the payment data, but not in the website data).

mom_website <- vivian_website[0]


mom_paypal <- data.table(name = 'Mom', purchase_time = as.POSIXct('2015-12-02 17:58'))

Combine these into two  data.tables:

website <- rbindlist(list(isabel_website, sally_website, francis_website, erica_website,


paypal <- rbindlist(list(isabel_paypal, sally_paypal, francis_paypal, erica_paypal, vivi

To keep things straight, lets give each website session a unique ID and each
payment a unique ID.

website[, session_id:=.GRP, by = .(name, session_start_time)]


paypal[, payment_id:=.GRP, by = .(name, purchase_time)]

Finally, have a look at the data.

website

## name session_start_time session_id


## 1: Indecisive Isabel 2016-01-01 11:01:00 1
## 2: Indecisive Isabel 2016-01-02 08:59:00 2
## 3: Indecisive Isabel 2016-01-05 18:18:00 3
## 4: Indecisive Isabel 2016-01-07 19:03:00 4
## 5: Indecisive Isabel 2016-01-08 19:01:00 5
## 6: Spendy Sally 2016-01-03 10:00:00 6
## 7: Frequent Francis 2016-01-02 13:09:00 7
## 8: Frequent Francis 2016-01-03 19:22:00 8
## 9: Frequent Francis 2016-01-08 08:44:00 9
## 10: Frequent Francis 2016-01-08 20:22:00 10
## 11: Frequent Francis 2016-01-10 17:36:00 11
## 12: Frequent Francis 2016-01-15 16:56:00 12
## 13: Error-prone Erica 2016-01-04 19:12:00 13
## 14: Error-prone Erica 2016-01-04 21:05:00 14
## 15: Visitor Vivian 2016-01-01 09:10:00 15
## 16: Visitor Vivian 2016-01-09 02:15:00 16

paypal

## name purchase_time payment_id


## 1: Indecisive Isabel 2016-01-08 19:10:00 1
## 2: Spendy Sally 2016-01-03 10:06:00 2
## 3: Spendy Sally 2016-01-03 10:15:00 3
## 4: Frequent Francis 2016-01-03 19:28:00 4
## 5: Frequent Francis 2016-01-08 20:33:00 5
## 6: Frequent Francis 2016-01-10 17:46:00 6
## 7: Error-prone Erica 2016-01-03 08:02:00 7
## 8: Mom 2015-12-02 17:58:00 8

The Joins

Before doing any rolling joins, I like to create a separate date/time column
in each table to join on because one of the two tables loses it’s date/time
field and I can never remember which.

website[, join_time:=session_start_time]
paypal[, join_time:=purchase_time]

Next, set keys on each table. The last key column is the one the rolling join
will “roll” on. We want to first join on  name and then within each  name,
match website sessions to purchases. So we key on  name first, then on the
newly created  join_time.

setkey(website, name, join_time)


setkey(paypal, name, join_time)

Rolling Forward

Now let’s answer the question “what website session immediately preceded
each payment?”
website[paypal, roll = T]

## name session_start_time session_id join_time


## 1: Error-prone Erica <NA> NA 2016-01-03 08:02:00
## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00
## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00
## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00
## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00
## 6: Mom <NA> NA 2015-12-02 17:58:00
## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00
## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00
## purchase_time payment_id
## 1: 2016-01-03 08:02:00 7
## 2: 2016-01-03 19:28:00 4
## 3: 2016-01-08 20:33:00 5
## 4: 2016-01-10 17:46:00 6
## 5: 2016-01-08 19:10:00 1
## 6: 2015-12-02 17:58:00 8
## 7: 2016-01-03 10:06:00 2
## 8: 2016-01-03 10:15:00 3

Notice several things about this result:

Each payment is matched to the closest preceding


payment. all(purchase_time > session_start_time, na.rm = T) evaluates
to TRUE.
Payments with no preceding sessions still appear in the result (that
is, nrow(result) == nrow(paypal)).
Visitor Vivian does not appear in the results because she does not
appear in the paypal table.
Mom’s “purchase” has no website session associated with it because she
has never visited the website at all.
Error-prone Erica’s mysterious purchase has no website session
associated with it because she never visited the website prior to her
purchase.
Spendy Sally’s one website session is matched to both of her purchases.

Rolling Backward

Now lets switch the order of the two tables and answer the question “which
sessions led to a purchase?” In this case, we want to match payments to
website sessions, so long as the payment occurred  after the beginning of
the website session.

paypal[website, roll = -Inf]

## name purchase_time payment_id join_time


## 1: Error-prone Erica <NA> NA 2016-01-04 19:12:00
## 2: Error-prone Erica <NA> NA 2016-01-04 21:05:00
## 3: Frequent Francis 2016-01-03 19:28:00 4 2016-01-02 13:09:00
## 4: Frequent Francis 2016-01-03 19:28:00 4 2016-01-03 19:22:00
## 5: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 08:44:00
## 6: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 20:22:00
## 7: Frequent Francis 2016-01-10 17:46:00 6 2016-01-10 17:36:00
## 8: Frequent Francis <NA> NA 2016-01-15 16:56:00
## 9: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-01 11:01:00
## 10: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-02 08:59:00
## 11: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-05 18:18:00
## 12: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-07 19:03:00
## 13: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-08 19:01:00
## 14: Spendy Sally 2016-01-03 10:06:00 2 2016-01-03 10:00:00
## 15: Visitor Vivian <NA> NA 2016-01-01 09:10:00
## 16: Visitor Vivian <NA> NA 2016-01-09 02:15:00
## session_start_time session_id
## 1: 2016-01-04 19:12:00 13
## 2: 2016-01-04 21:05:00 14
## 3: 2016-01-02 13:09:00 7
## 4: 2016-01-03 19:22:00 8
## 5: 2016-01-08 08:44:00 9
## 6: 2016-01-08 20:22:00 10
## 7: 2016-01-10 17:36:00 11
## 8: 2016-01-15 16:56:00 12
## 9: 2016-01-01 11:01:00 1
## 10: 2016-01-02 08:59:00 2
## 11: 2016-01-05 18:18:00 3
## 12: 2016-01-07 19:03:00 4
## 13: 2016-01-08 19:01:00 5
## 14: 2016-01-03 10:00:00 6
## 15: 2016-01-01 09:10:00 15
## 16: 2016-01-09 02:15:00 16

In this result

Each website session is match to the nearest following


payment. all(session_start_time > purchase_time, na.rm = T) evaluates
toTRUE.
Mom does not appear because she has no record in the website table.
Visitor Vivian’s sessions are not matched to any purchases because she
hasn’t purchased anything.
Neither of Erica’s website sessions are matched to her purchase because
it took place before both sessions.
Frequent Francis’s most recent session isn’t matched to a purchase
because she hasn’t made a purchase after that session.
All of Indecisive Isabel’s sessions are matched to her one purchase. In
fact, several purchases appear more than once.

Rolling Windows

What if we wanted to add an additional criteria to the rolling join above:


match payments to website sessions, so long as the payment occurred after
the beginning of the website session  and within 12 hours of the website
session?

twelve_hours <- 60*60*20


paypal[website, roll = -twelve_hours]

## name purchase_time payment_id join_time


## 1: Error-prone Erica <NA> NA 2016-01-04 19:12:00
## 2: Error-prone Erica <NA> NA 2016-01-04 21:05:00
## 3: Frequent Francis <NA> NA 2016-01-02 13:09:00
## 4: Frequent Francis 2016-01-03 19:28:00 4 2016-01-03 19:22:00
## 5: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 08:44:00
## 6: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 20:22:00
## 7: Frequent Francis 2016-01-10 17:46:00 6 2016-01-10 17:36:00
## 8: Frequent Francis <NA> NA 2016-01-15 16:56:00
## 9: Indecisive Isabel <NA> NA 2016-01-01 11:01:00
## 10: Indecisive Isabel <NA> NA 2016-01-02 08:59:00
## 11: Indecisive Isabel <NA> NA 2016-01-05 18:18:00
## 12: Indecisive Isabel <NA> NA 2016-01-07 19:03:00
## 13: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-08 19:01:00
## 14: Spendy Sally 2016-01-03 10:06:00 2 2016-01-03 10:00:00
## 15: Visitor Vivian <NA> NA 2016-01-01 09:10:00
## 16: Visitor Vivian <NA> NA 2016-01-09 02:15:00
## session_start_time session_id
## 1: 2016-01-04 19:12:00 13
## 2: 2016-01-04 21:05:00 14
## 3: 2016-01-02 13:09:00 7
## 4: 2016-01-03 19:22:00 8
## 5: 2016-01-08 08:44:00 9
## 6: 2016-01-08 20:22:00 10
## 7: 2016-01-10 17:36:00 11
## 8: 2016-01-15 16:56:00 12
## 9: 2016-01-01 11:01:00 1
## 10: 2016-01-02 08:59:00 2
## 11: 2016-01-05 18:18:00 3
## 12: 2016-01-07 19:03:00 4
## 13: 2016-01-08 19:01:00 5
## 14: 2016-01-03 10:00:00 6
## 15: 2016-01-01 09:10:00 15
## 16: 2016-01-09 02:15:00 16

Now Indecisive Isabel’s last session only is associated with a purchase.

The rollends Argument

Recall the first join from above, matching the preceding website session to
each payment.

website[paypal, roll = T]

## name session_start_time session_id join_time


## 1: Error-prone Erica <NA> NA 2016-01-03 08:02:00
## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00
## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00
## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00
## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00
## 6: Mom <NA> NA 2015-12-02 17:58:00
## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00
## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00
## purchase_time payment_id
## 1: 2016-01-03 08:02:00 7
## 2: 2016-01-03 19:28:00 4
## 3: 2016-01-08 20:33:00 5
## 4: 2016-01-10 17:46:00 6
## 5: 2016-01-08 19:10:00 1
## 6: 2015-12-02 17:58:00 8
## 7: 2016-01-03 10:06:00 2
## 8: 2016-01-03 10:15:00 3

What if we want the rolling join to handle Error-prone Erica’s case


differently? Perhaps in cases like hers, where there is a purchase with no
preceding session, we prefer the user’s first website session to be matched
to the offending purchase. We can use the  rollends argument for this. From
the  data.table documentation ( ?data.table),

rollends[1]=TRUE will roll the first value backwards if the value is before it

website[paypal, roll = T, rollends = c(T, T)]

## name session_start_time session_id join_time


## 1: Error-prone Erica 2016-01-04 19:12:00 13 2016-01-03 08:02:00
## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00
## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00
## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00
## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00
## 6: Mom <NA> NA 2015-12-02 17:58:00
## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00
## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00
## purchase_time payment_id
## 1: 2016-01-03 08:02:00 7
## 2: 2016-01-03 19:28:00 4
## 3: 2016-01-08 20:33:00 5
## 4: 2016-01-10 17:46:00 6
## 5: 2016-01-08 19:10:00 1
## 6: 2015-12-02 17:58:00 8
## 7: 2016-01-03 10:06:00 2
## 8: 2016-01-03 10:15:00 3

In this result, Erica’s first session is matched to her purchase, even though
the session was after her purchase. Mom’s “purchase” still has no matching
session because Mom does not appear in the  website table. So 
all(purchase_time > session_start_time, na.rm = T) no longer evaluates to 
TRUE.
What if we want to perform the same join as above, but only returning
matches for payments with sessions before  and after?

website[paypal, roll = T, rollends = c(F, F)]

## name session_start_time session_id join_time


## 1: Error-prone Erica <NA> NA 2016-01-03 08:02:00
## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00
## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00
## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00
## 5: Indecisive Isabel <NA> NA 2016-01-08 19:10:00
## 6: Mom <NA> NA 2015-12-02 17:58:00
## 7: Spendy Sally <NA> NA 2016-01-03 10:06:00
## 8: Spendy Sally <NA> NA 2016-01-03 10:15:00
## purchase_time payment_id
## 1: 2016-01-03 08:02:00 7
## 2: 2016-01-03 19:28:00 4
## 3: 2016-01-08 20:33:00 5
## 4: 2016-01-10 17:46:00 6
## 5: 2016-01-08 19:10:00 1
## 6: 2015-12-02 17:58:00 8
## 7: 2016-01-03 10:06:00 2
## 8: 2016-01-03 10:15:00 3

In this result, the purchases of Error-prone Erica and Mom are unmatched
because they have no  preceding sessions, and Spendy Sally’s two purchases
are unmatched because they have no  following website session.
Note that when  roll is set to a negative number, the meaning of the two 
rollends elements kind of flip-flops:

website[paypal, roll = -Inf, rollends = c(F, T)]


## name session_start_time session_id join_time
## 1: Error-prone Erica <NA> NA 2016-01-03 08:02:00
## 2: Frequent Francis 2016-01-08 08:44:00 9 2016-01-03 19:28:00
## 3: Frequent Francis 2016-01-10 17:36:00 11 2016-01-08 20:33:00
## 4: Frequent Francis 2016-01-15 16:56:00 12 2016-01-10 17:46:00
## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00
## 6: Mom <NA> NA 2015-12-02 17:58:00
## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00
## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00
## purchase_time payment_id
## 1: 2016-01-03 08:02:00 7
## 2: 2016-01-03 19:28:00 4
## 3: 2016-01-08 20:33:00 5
## 4: 2016-01-10 17:46:00 6
## 5: 2016-01-08 19:10:00 1
## 6: 2015-12-02 17:58:00 8
## 7: 2016-01-03 10:06:00 2
## 8: 2016-01-03 10:15:00 3

In this example,

Each payment is matched to the nearest following website session


(because of roll = -Inf).
Error-prone Erica’s purchase is not matched to the following session
because there is no previous session (due to rollends[1] = F).
Spendy Sally’s purchases are joined to her most recent website session,
even though it occurred before her purchases (because ofrollends[2] =
T).

You might also like