Professional Documents
Culture Documents
Jun
6
June 5, 2016
Introduction
Rolling joins in data.table are incredibly useful, but not that well
documented. I wrote this to help myself figure out how to use them and
perhaps it can help you too.
library(data.table)
The Setup
Spendy Sally visits the website once and makes multiple purchases.
Visitor Vivian visits our website a couple times, but never makes a purchase
(so she appears in the website data, but not in the payment data).
And Mom sent money to my PayPal account before my website was up and
running (so she appears in the payment data, but not in the website data).
To keep things straight, lets give each website session a unique ID and each
payment a unique ID.
website
paypal
The Joins
Before doing any rolling joins, I like to create a separate date/time column
in each table to join on because one of the two tables loses it’s date/time
field and I can never remember which.
website[, join_time:=session_start_time]
paypal[, join_time:=purchase_time]
Next, set keys on each table. The last key column is the one the rolling join
will “roll” on. We want to first join on name and then within each name,
match website sessions to purchases. So we key on name first, then on the
newly created join_time.
Rolling Forward
Now let’s answer the question “what website session immediately preceded
each payment?”
website[paypal, roll = T]
Rolling Backward
Now lets switch the order of the two tables and answer the question “which
sessions led to a purchase?” In this case, we want to match payments to
website sessions, so long as the payment occurred after the beginning of
the website session.
In this result
Rolling Windows
The rollends Argument
Recall the first join from above, matching the preceding website session to
each payment.
website[paypal, roll = T]
rollends[1]=TRUE will roll the first value backwards if the value is before it
In this result, Erica’s first session is matched to her purchase, even though
the session was after her purchase. Mom’s “purchase” still has no matching
session because Mom does not appear in the website table. So
all(purchase_time > session_start_time, na.rm = T) no longer evaluates to
TRUE.
What if we want to perform the same join as above, but only returning
matches for payments with sessions before and after?
In this result, the purchases of Error-prone Erica and Mom are unmatched
because they have no preceding sessions, and Spendy Sally’s two purchases
are unmatched because they have no following website session.
Note that when roll is set to a negative number, the meaning of the two
rollends elements kind of flip-flops:
In this example,