SSRN Id4727171

Meta-Learning Customer Preference Dynamics
on Digital Platforms
Mingzhang Yin ∗ Khaled Boughanmi † Asim Ansari‡
February 14, 2024
Abstract
Digital media platforms need to adapt quickly to changing customer preferences to

successfully recommend and sequence content. Traditionally, platforms have leveraged
external customer data for this task, but recent privacy regulations have made this
difficult. Personalizing offerings to suit individual tastes is challenging, especially when
data on customer interactions is limited, as is the case with new customers or for new
customer sessions on the platform. In this research, we develop a novel meta-learning
approach to infer heterogeneous preference dynamics from few-shot interactions of
customers on digital platforms. Our approach splits individual-level data into context
and target sets and is specifically trained to generate personalized predictions for
each customer or session. It leverages the Transformer architecture and its attention
mechanisms to capture sequential customer behaviors, while overcoming the scalability
and adaptability issues of traditional models. Methodologically, our framework offers
solutions to the cold start challenge and infers time-varying individual parameters
effectively. We illustrate our approach to consumer listening sessions on a digital music
streaming platform. We compare the predictive performance of our approach to several
state-of-the-art static and adaptive benchmark models. Our approach demonstrates
superior accuracy and efficiency to these models and offers managerial insights into
individual preferences and evolving tastes over sessions. Furthermore, our model has
rich managerial applications such as content sequencing, existing session completion,
and personalized product and content recommendation, which can enhance digital
platforms’ targeting and personalization strategies.
∗
University of Florida, Assistant Professor of Marketing; email: mingzhang.yin@warrington.ufl.edu.
†
Cornell University, Assistant Professor of Marketing; email: kb746@cornell.edu.
‡
Columbia University, William T. Dillard Professor of Marketing; email: maa48@gsb.columbia.edu.
Electronic copy available at: https://ssrn.com/abstract=4727171

1 Introduction
The ability to quickly adapt to the preferences of a customer is essential for the success
of a digital platform. This ability is important in two scenarios. First, new customers of
the platform are susceptible to churn if they do not find products that they like or if they
are unsatisfied with their early interactions with the platform. Companies, therefore, have
an incentive to personalize their offerings to align these with customer needs and thereby
cement customer loyalty. Similarly, platforms that use a freemium strategy strive to quickly
convert free users to premium subscribers to profit from the customer relationship. Second,
customer preferences can be context-dependent. For instance, a customer could arrive on
a platform like Spotify or YouTube with a particular need in mind and initiate a session.
The platform needs to quickly assess what the customer prefers in the current session to
be able to personalize offerings. Both these scenarios involve making inferences based on
limited customer interactions on the platform. This coupled with the fact that customer
tastes are heterogeneous and dynamic, poses a systematic challenge for targeted marketing
and personalization. We call this challenge the few-shot learning problem for customer
preferences and address it in this paper.
Platforms have traditionally relied on external data such as demographics, user registra-
tion details, cookies, or acquisition characteristics [1] to tailor experiences for new users.
However, collecting external data raises privacy concerns and faces regulatory constraints,
as exemplified by the General Data Protection Regulation (GDPR) in the European Union
[2] and California Consumer Privacy Act (CCPA) in the US [3], which limit the collection
of personal data. Moreover, external data often provides weak information on customer
preferences. In contrast, user interactions are readily observable, and it is beneficial to
leverage this internal yet potentially small-sized data for personalization.
Researchers have developed numerous methods to analyze customer interactions and
infer how product features impact choices. As customers differ in their preferences, the
marketing literature has focused on methods that accommodate observed and unobserved
sources of customer heterogeneity. Hierarchical Bayes (HB) approaches [4–6] are popular
for this purpose as they allow inferences about individual-level parameters. However,
traditional HB methods encounter multiple challenges in the face of the few-shot learning
problem. First, HB methods that rely on MCMC inference are not easily scalable. This limits
their applicability in settings with many customers. Second, it is difficult to incorporate
the data of new users in HB models as this requires retraining the model with the addition
of new customers. Third, an ideal method should continuously adapt and improve with
incoming data as a customer continues the relationship. This can create a positive feedback
loop where increased interactions with the platform generate more observations, which in
turn further improves the model. Yet it is difficult for a traditional HB model to seamlessly
adjust itself on the fly to incoming data from a customer. This is particularly challenging
when data are obtained within a customer session.
Motivated by these challenges, we present a new methodology for the few-shot learning

problem in customer relationship management. Building on the meta-learning literature in
machine learning and artificial intelligence [7, 8], we propose a novel framework — the
meta-temporal process (MetaTP) — that learns preferences and predicts future behavior
using only a few initial observations. MetaTP is a neural network-based model that mimics
the human learning process – learning from past experience makes it possible to solve a
new related problem with high data efficiency. To the best of our knowledge, our work is
the first in marketing to study customer preferences with meta-learning.
We train MetaTP over multiple constructed tasks that are like the few-shot learning
problem for a new customer or session. Specifically, each task takes the relationship data of
an existing customer or session and splits it into a context set and a target set. The context
set consists of the initial few data points and the target set contains the data over the rest of
the customer relationship or session. The model is learned by fitting the target set of each
task using the personalized information from its context set. This differs from traditional
supervised learning and marketing models that fit the entire training data of a customer
or session without splitting it into context and target sets and do not specifically focus on
predicting the responses in the target set.
In contrast, MetaTP infers an efficient learning to learn strategy that is designed to
generalize to new tasks that may differ from existing ones. In essence, MetaTP learns a
prior distribution over customer preferences functions. This prior is used along with the
context set of recently acquired customers to quickly capture their preferences and predict
how they would respond to future platform offerings and marketing actions.
In designing MetaTP, we develop a novel integration of Transformers [9] into the meta-
learning framework. Unlike most existing meta-learning algorithms that assume that the
observations within a task are exchangeable [10–13], we address the inherently sequential
nature of customer relationship or sessions data. We leverage the Transformer model,
widely recognized for its success in modeling language sequences [14, 15], as a cutting-
edge tool for modeling customer sequential patterns that are governed by the ‘grammar’ of
consumer behaviors. Furthermore, Transformer’s in-context learning ability to understand
a new task (i.e., a new customer’s data) based on short inputs [16], as exemplified by
the Generative Pre-trained Transformer (GPT)’s fast adaptation to user-provided prompts
[15], aligns seamlessly with the objectives of few-shot learning. MetaTP harnesses this
in-context learning ability to uncover preferences and predict future responses based on
initial activities.
We make several novel methodological contributions. First, we provide a new approach
for the cold start challenge. MetaTP leverages early interaction data and enables firms to
continuously benefit from a stream of incoming individual data. Second, MetaTP offers a
novel means of inferring time-varying individual parameters. Existing methods typically
model dynamic heterogeneity by fitting a time-varying mean model of the population and
assuming the individual model deviates from the mean model by an offset [17] or via a
Gaussian process [18].

In contrast, MetaTP models dynamic heterogeneity by capturing different dependence
structures between the context data and the target data using the attention mechanism of
the Transformer model. Doing so allows it to overcome limitations imposed by concentrating
individual models on a pre-specified mean model. Last, compared to an adaptive variant
of the hierarchical Bayes model (derived in this paper for the inference of new users), the
neural network-based MetaTP flexibly models nonlinear relationships and handles both
structured and unstructured data. For computation, we develop an optimization approach
to implement MetaTP efficiently with large-scale data.
We apply our model to facilitate the sequential consumption of music products on
digital platforms [19] and show that it can capture the preferences based on a few initial
observations within a session. Using real music listening session data, we show that MetaTP
yields more accurate and computationally efficient predictions of future behaviors. It
outperforms six baseline methods including (generalized) linear, nonlinear, static, and
adaptive models, across a variety of evaluation metrics to demonstrate the advantages of
our Transformer-based meta-learning framework. The model yields several substantive and
managerial insights. First, it transparently identifies the factors most relevant to future
behaviors. The Transformer’s attention mechanism reveals which tracks in a user’s history
are more informative in predicting the future behavior within the session. Our results show
that for a large proportion of sessions, the tracks played first and last in history carry more
predictive power than the tracks in the middle positions. Furthermore, we use explainable
ML methods to infer the influence of the acoustic fingerprints of historical tracks on future
behavior at the individual level. This heterogeneity information is valuable for firms to
implement customer management programs for recently acquired users.
Second, the model can be used to infer the pattern of individual-level preferences
evolution. For example, as we show in the application, the time-varying parameters reflect
the changes in music tastes and emotions over the course of a listening session, and the
firm can leverage these to improve user engagement. We then illustrate the actionable
benefits of the model for listening session completion. We show how the model can leverage
the information from a few listened tracks for product recommendation [20] by designing
future sessions that have the highest predicted listening rate. Specifically, we can match the
customers based on the similarity of their context embeddings, find the optimal order of
an existing music list, or recommend future tracks for an individual or a group in either a
one-shot or sequential manner. These applications contribute new tools for data-efficient
and session-based customer targeting and personalization [21, 22].
The rest of the paper is organized as follows. § 2 positions our paper in the broad
marketing, statistics, and machine learning literature. We develop the model and describe
the inference approach in § 3. We describe the data and empirical setup in § 4. § 5 validates
the empirical advantages of our model relative to the baselines. § 6 describes the results
and substantive findings. We illustrate the use of our approach for several managerial
applications in the penultimate section. Finally, we summarize our contributions and
identify future research directions.

2 Methodological Background
Our work is related and contributes to several areas in marketing, statistics, and machine
learning. These include the literature on meta-learning, deep learning, hierarchical Bayes,
and the cold-start problem. Below, we provide a concise overview of the related works in
these areas and position our contribution.
Meta-learning. Meta-learning, often known as learning-to-learn, has been driving the
development of machine learning and artificial intelligence since its initial development [7,
23], and has achieved considerable success in areas such as computer vision [24], language
models [25], and robotics [26]. It enables algorithms to learn from past experience and solve
new problems using a small amount of data. Nevertheless, the application of meta-learning
remains underexplored in marketing. This paper builds on the development of Neural
Processes [12, 27], a family of methods combining the strengths of neural networks and
Gaussian processes. Our work is the first in marketing to study individual preferences with
meta-learning. Extant meta-learning methods often focus on classification and regression
tasks where observations within a task are identically and independently distributed (i.i.d.)
[8, 10, 28, 29]. For sequential modeling, [30] proposes the SNP method to predict dynamic
3D scenes in computer vision where the tasks are segmented by time. [31] designs the
TNP method with the attention mechanism, which assumes the predictions are invariant
to the permutation of individual data points. However, these assumptions do not hold in
the context of customer relationship dynamics. Addressing this gap, our work introduces
a novel meta-learning framework to model sequential customer behaviors on the digital
platform.
Deep Learning in Marketing. The increasing scale and complexity of data have driven
the adoption of deep learning in marketing research [32–34]. For example, [35] uses
fully connected neural networks to model user preferences in a dynamic factor model.
[36] leverage deep learning to extract emotions from livestream videos and explore their
influence on sales outcomes. Convolutional Neural Networks (CNNs) have been effectively
employed to extract product quality information from texts [37] and images [38, 39];
similarly, Recurrent Neural Networks (RNNs) have been utilized to obtain sentiment scores
from text reviews [40] and to model action sequences in video games [41]. [42] models
consumer collections in the context of music products using Hypergraph Convolutional
Neural Networks. Recently, the Transformer-based BERT model [14] has been applied to
mine brand interest and sentiment in textual comments [43] and attribute-specific valence
from reviews [44]. Instead of using Transformer as a pre-trained black box feature extractor,
our work unboxes the Transformer and integrates it as a pivotal component of the proposed
probabilistic meta-learning framework. In this way, it fully leverages the Transformer’s
flexibility, interpretability, and in-context learning ability.
Hierarchical Bayes. Meta-learning, from a statistical standpoint, can be conceptualized as

a form of Bayesian hierarchical modeling [45], bearing a close resemblance to empirical

Bayes methods [46]. It learns meta-knowledge from solving multiple tasks and stores it
in a prior distribution or through parameter initialization. Similar to hierarchical Bayes
(HB) Logit methods [47], MetaTP effectively leverages data heterogeneity by considering
customer data as tasks. While conventional HB models focus on fitting the training data
of customers, often assuming that the observations are i.i.d., MetaTP extends its scope to
solving multiple tasks and focuses on the predictive distribution of customer future data.
Thereby it learns information that is generalizable across tasks and is tailored for few-shot
data adaptation. Moreover, MetaTP leverages flexible neural networks that offer a similar
level of granularity to HB in analyzing panel data, but with enhanced capabilities that
include higher accuracy, adaptability to different data types, various interpretable structures,
and efficient inference processes.
Cold Start Recommendation. MetaTP addresses the challenging cold-start problem in

modeling customer relationships [48]. Existing strategies to tackle the cold-start prob-
lem typically rely on external customer information such as demographics or additional
acquisition data [1, 49]. However, acquiring such external data can be challenging, and
its relevance to the primary prediction task is often weak. Instead, MetaTP capitalizes on
initial customer interactions as the internal information for effective personalization. Recent
advances in computer science have seen the application of meta-learning to tackle the cold-
start problem in session-based recommendation [50, 51]. These approaches mainly encode
the meta-knowledge as the initialization for neural network parameters, which, however,
tends to obscure interpretability. MetaTP distinguishes itself by adopting an encoder-decoder
framework, a structure that not only generates interpretable individual parameters but also
enables transparent decision-making processes.
3 Model
We first introduce the consumer data structure that is needed for meta-learning and
show how to design the meta-learning tasks for model calibration. We then present the
probabilistic model for customer interactions and our inference procedure. Finally, we
describe the Transformer component that encodes the individual-level parameter dynamics.
3.1 Tasks and Data Splitting
Consider a platform with a potentially large pool of data from existing customer sessions.1
Ti
We denote the data of each existing session i as Di = {x i t , yi t } t=1 , i ∈ {1, 2, · · · , N }, where
N is the number of existing sessions. The scalar outcome yi t is the customer behavior
1
We use customers and sessions interchangeably. Session level data is appropriate when modeling contextual
preferences, as in our application. In contrast, a customer-level perspective is more appropriate in Customer
Relationship Management (CRM) contexts.

of interest at interaction t. This could be a categorical outcome indicating the chosen
product when modeling choices. Alternatively, the response could indicate particular types
of customer activities. For example, in our application, the response captures song skipping
behavior in customer listening sessions. The vector x i t ∈ R P contains the P external product
covariates, and Ti is the number of observed interactions of the customer within the session.
The data are longitudinal and follow a sequential pattern indexed by t, which denotes the
position in the sequence of interactions.
We focus on cold-start scenarios where the platform is interested in quickly capturing
the preferences reflected in a newly started session k (k > N ). The interactions of the
tk
new session are denoted as Dk = {x kt , ykt } t=1 . The number of interactions t k is small in
the beginning and the goals of the platform are to predict future customer responses ykt ,
t > t k within the session and to customize offerings consistent with the preferences that are
operant within the session.
The cold-start problem poses several challenges. First, the data Dk of the new session
could be distributed differently from data Di of the existing sessions because preferences
are heterogeneous. Extant machine learning approaches often build on the assumption that
the data are independently and identically distributed (i.i.d.). However, the distribution
shift between new and existing sessions violates this assumption, making it improper to
apply off-the-shelf supervised learning tools. Second, the model needs to be calibrated on
the data from existing sessions and adapt to new sessions that are not part of the training
data. Third, the observed data Dk is small in size and contains scarce information for a
new session. Relevant information needs to be extracted from this limited data efficiently.
Last, the data of a customer is sequential and may exhibit complex temporal dependence
structures as preferences evolve within a session.
In meta-learning, the first three challenges are tackled by treating the data of existing
sessions as consisting of a large number of tasks. Specifically, each task i is constructed by
ti
splitting the data of each session, Di , into a context set Dic = {x i t , yi t } t=1 and a target set
q Ti
Di = {x i t , yi t } t=t i +1 where t i < Ti . For each session, the context set consists of the initial
interactions, while the target set contains the rest of the interactions. The size t i of the
context set is kept small. In this way, the tasks are designed to mimic the new session data
q
using the data of existing sessions — the context set Dic and the target set Di mimic the
observed context and the future data of a new session, respectively. A task is “solved" if the
model adapts to the context set Dic and accurately predicts the outcomes in the target set
q
Di . The premise of this approach is that if a model manages to solve a variety of tasks for
existing sessions, the learned knowledge can be transferred to solve a similar task for a new
customer session. The meta-learning framework is depicted in Fig. 1.
Note that the multi-task design differs from conventional marketing and machine learning
approaches where the entire dataset is split only once into a pair of calibration and validation
sets and the focus is on fitting a model using the entire data Di of each customer. The use
of multiple tasks encourages a model to acquire an inductive bias that extracts maximum
information from the limited context data in order to predict the target set data. It also

Figure 1: Meta-learning framework.
prevents a model from overfitting to a particular task, but instead, learns a pattern that is
q
shared across tasks. The focus on predicting the interactions in target set Di (as opposed to
modeling the entire data Di , as is done in conventional marketing models) allows the model
to learn a relationship between the early and subsequent customer interactions. Hence,
constructing tasks addresses the challenges of distribution shift and data scarcity regarding
a new session as well as the preference dynamics within a session.
3.2 Probabilistic Model
We design a probabilistic model to capture the hierarchical structure arising from multiple
tasks and sequential patterns of session data. The model is specified using global parameters
θ and a set of session-specific local parameters β it , one for each task. The global parameters
θ represent the shared statistical structure between tasks. As we will describe, θ consists
of high-dimensional parameters of a Transformer model. The high dimensionality of θ
introduces the necessary flexibility that allows the model to learn the inductive bias for
few-shot prediction automatically from data, avoiding the difficulty of pre-specifying the
inductive bias in the prior distribution manually. We will use point estimates for the shared
θ since data from all tasks will pinpoint its value. At the session level, we use local random
variables β it ∈ R P to capture uncertainty and dynamic heterogeneity. Consider an example
of a user consuming music products on a digital platform. Users would have their own
unique preferences for musical styles based on the usage scenario, and such preferences are
likely to change with the progress of the listening session. The local parameters β it encode
this intrinsic session-specific and time-varying information for customer preferences.
We use an encoder and decoder framework to specify the model [12, 52]. The objective
of meta-learning is to predict the target set of a task. Meta-learning therefore uses the
posterior predictive distribution of the future data of customer i at time t for t > t i , which

is decomposed as
Z
pθ ( yi t | x i t , Dic ) = pθ D ( yi t | x i t , β it )pθ E (β it | Dic , t)dβ it , (1)
where θ = (θ E , θ D ). The encoder pθ E (β it | Dic , t), parameterized by θ E , models the posterior

predictive distribution of the local variables β it given the context data of the customer
and a future event time. In the design of the encoder, we leverage the Transformer model
[9], an architecture widely recognized for its success in understanding language sequences
(e.g. in BERT [14] and GPTs [15]), as a state-of-the-art tool for sequential modeling. We
will illustrate how the Transformer and our probabilistic framework are synergized in the
following section.
The decoder pθ D ( yi t | x i t , β it ), parameterized by θ D , predicts the future outcome given
the covariates x i t and latent variable β it . The decoder can be set as a feed-forward neural
network with the concatenated [x i t , β it ] as the inputs, θ D as the parameters, and the
distributional parameters as the outputs. For instance, if the outcome is a discrete choice,
the output could be the logits of the categorical distribution. Alternatively, one can use a
generalized linear model (GLM) [53] for the decoder. Though a neural network decoder
has high flexibility, the black-box nature of neural networks can obscure the meaning of
β it . For the sake of interpretability, we can set the decoder as p( yi t | x i t , β it ) = p( yi t | x > β t ),
it i
where the expected outcome E[ yi t | x i t ] = g (x i t β i ) with a link function g. The usage
−1 > t
of the dot product to link the local parameters and outcome is similar to that in neural
matrix factorization [35, 54, 55]. Such a GLM takes β it as the parameter and no longer
needs additional parameters θ D . We can then interpret an element of the vector β it as the
preference intensity for a covariate in x i t at time t. In our application, we use a binary logit
specification for the decoder. The GLM and neural network formulations trade-off model
flexibility and interpretability, and the choice between these two can be application-specific.
3.3 Inference
We rely on maximum likelihood estimation (MLE) of the predictive distribution in Eq. (1)
to obtain the global model parameters θ = (θ E , θ D ) for the encoder and decoder. The MLE
of Eq. (1), averaged over all existing sessions, becomes
Z

max E log pθ ( yi t | x i t , Di ) = E log pθ D ( yi t | x i t , β i )pθ E (β i | Di , t)dβ i ,
c t t c t
(2)
θ
where the expectation is over uniformly distributed session i ∈ {1, · · · , N } and target
position t ∈ {t i + 1, · · · , Ti }. In practice, the inner integral of Eq. (2) can be approximated
by Monte-Carlo estimation, which corresponds to a lower bound of the original objective
due to the concavity of the logarithmic function and Jensen’s inequality [45].

Computing the MLE can be operationalized as the following algorithm.
• Randomly sample a subset of session/task A ⊂ {1, 2, · · · , N };

• For each i ∈ A and target point t ∈ {t i + 1, · · · , Ti }, sample β ilt ∼ pθ E (β it | Dic , t) for
l = 1, · · · , L;
• Maximize the predictive likelihood using stochastic gradient descent with stepsize η
Ti
¨ L
«
new 1 X 1 X 1X
θ ← θ + η∇θ log pθ D ( yi t | x i t , β il (θ )) .
t E
(3)
|A | i∈A Ti − t i t=t +1 L l=1
i
Since the total number of existing sessions is large, for each step, we randomly sample
a mini-batch of customers A ⊂ {1, 2, · · · , N } and compute the stochastic gradient to
achieve high computational efficiency [56]. The algorithm loops over the steps above
until convergence. The stepsize η can be scheduled by methods such as Adam [57] and
the gradient can be computed by auto-differentiation tools such as PyTorch [58]. This
procedure describes an episodic learning paradigm with a series of tasks; each step solves
a task by performing a cycle of the posterior inference for β it given the context set, and
predicting the outcomes yi t for the target set. The parameter θ learned via episodic
learning is not task-specific but generalizable across existing and new customer sessions.
3.4 Few-shot Prediction
Lastly, using the optimized parameter θ̂ , the prediction of future events of a new session k
given its context data can be computed by the predictive distribution in Eq. (1). The context
data Dk for the new session k (k > N ) consists of the initial interactions with the platform.
The prediction of the future behavior ŷkt , t > t k can be obtained by taking samples l =
1, · · · , L from the predictive distribution as β kl t
∼ pθ̂ (E) (β kt | Dk , t), ŷkt l ∼ pθ̂ (D) ( ykt | x kt , β kl
t
).
With continuous outcomes, we take P L the predicted outcome as the conditional expectation
E ykt | x kt ,Dkc [ ykt ] approximated by l=1 ŷkt l /L. With discrete outcomes, we take the prediction
L
as the predictive mode, approximated by the most frequent outcome in the samples { ŷkt l }l=1 .
Note the adaptation to the new sessions does not require model refitting. Hence, the
computation for new sessions is very efficient, allowing for individual adaptation on the fly.
Moreover, the arbitrary size of context set Dk permits the model to continuously improve
with the new data streaming in as the session unfolds and smoothly interpolate from the
cold start stage to the warm stage.
3.5 Transformer-based Sequential Encoder
So far, we had focused on the overall structure of the model. Now, we describe how we
use the Transformer architecture [9] to specify the temporal dependence between the context
10

and target sets using the encoder component pθ E (β it | Dic , t) of the MetaTP. Transformer-
based models have demonstrated superior task-solving ability by fast adapting to a short
context, such as the user-provided prompts in GPTs [15]. Our adoption of Transformer
is motivated by this in-context learning property [16], which aligns seamlessly with our
objective of understanding a preferences via a small number of initial interactions that form
the context. Below, we describe generally how we leverage the Transformer in the model
design, and offer a more detailed description in § B of the Web Appendix.
The critical component of the Transformer model is its attention mechanism. It enables
the model to focus on specific parts of the context set when predicting the target outcomes,
thereby automatically determining the relevance of particular interactions in the context
set for the future. The attention mechanism excels at capturing multi-faceted dependence
structures over a sequence. To this end, it assigns different weights to the elements of the
context set, allowing the model to allocate more ‘attention’ to the most relevant elements.
Specifically, the attention mechanism constructs a key matrix K ∈ R t×mk , a value matrix
0
V ∈ R t×mv , and a query matrix Q ∈ R t ×mk , where the matrices K and V are constructed from
the context set Dic . The query matrix Q is constructed by either the context set Dic or target
q
set Di depending on how we use the attention mechanism. The construction is illustrated
in § B of the Web Appendix. Here mk is the dimension of keys and queries and m v is the
0
dimension of values. The attention module computes the embedding S ∈ R t ×mv for the
queries as
QK>
S = softmax( p )V. (4)
mk
The inner product and softmax function computes the attention weights W =
p
softmax(QK> / mk ), which measure the alignment between the context keys and queries.
The Transformer model applies the attention mechanism in two major steps. The first
step is known as self-attention, which builds all the matrices K, V, Q based on the context
set Dic . The output of self-attention is a context set embedding z ci = [z ci1 , z ci2 , · · · , z cit i ] where
each element z cit is a vector depending on the individual data {x i t 0 , yi t 0 } t 0 ≤t before time t.
Next, we capitalize on the cross-attention to model the relationship between the context set
and the target set. The cross-attention constructs the key, value, and query matrices based
on the context set Dic , the context embedding z ci , and target position t, respectively, and
learns an embedding for the target set. Finally, the embedding of a target element is mapped
to the distributional parameters (µi t , σ i t ) by a feed-forward neural network, and we set
the distribution of β it as pθ E (β it | Dic , t) = N (µi t (Dic , t), diag(σ 2i t (Dic , t))) where σ i t ∈ R P
and diag(σ 2i t (Dic , t)) is a diagonal matrix with σ 2i t on the diagonal. We choose the Gaussian
distribution because β it is continuous and a Gaussian distribution is reparameterizable [52].2
The diagonal covariance is a common mean-field assumption for latent variables in the
2
A variable is reparameterizable if it can be expressed as a deterministic transformation of a base distribu-
tion that does not depend on the parameters of interest, an essential property for efficient gradient-based
optimization.
11

autoencoder models [59, 60]. We use θ E to denote all the encoder parameters shared across
the tasks, including the parameters of the attention and feed-forward neural networks. In
practice, we apply multi-head attention to capture various types of relationships between
the context and target sets, and apply causal masking to avoid peaking in the future; we
illustrate these details in Appendix § B. The directed graphical model of the proposed MetaTP
is demonstrated in Fig. 2, and the end-to-end methodology is summarized in Appendix § A.
Figure 2: Directed graphical model of MetaTP.
4 Data
We applied our model to the context of predicting consumer skipping of tracks in

listening sessions. The data set was released to the public domain by [61] and covers a
time span from July 15th, 2018, to September 18th, 2018, consisting of a large volume
of logged listening sessions. Each session contains 10 to 20 songs/tracks that a customer
has listened to in sequence. Specifically, we focus on the customer skipping behavior of
each track, defined as whether the track was only played briefly.
Few-shot learning is an ideal framework for predicting how skipping behavior evolves
over the course of a listening session because of two major reasons. First, as the length of
a session is typically short in nature (e.g. with a max of 20 in our data), it is imperative
for music platforms to infer customer preferences early in the session from the initial few
interactions. This would allow the platform to personalize the remainder of the session by
playing songs that could be appealing to the consumer. Second, the preference for music
often changes significantly across sessions based on unobserved external factors such as
physical activities and emotions, which makes it necessary for a model to fast adapt under
distribution shifts.
The data set contains 150 million listening sessions, from which we retain the sessions
12

that are played in shuffled mode. We then randomly subsample M = 10, 000 sessions,
corresponding to 50,704 tracks and 167,880 interactions. This scale allows us to estimate
all our benchmark methods in a reasonable amount of time. The outcome yi t = 1 if a track
is skipped and yi t = 0 if it is not skipped. The skipping label has a balanced distribution over
the tracks with an overall skipping rate of 0.52. Fig. 3 shows the average skipping rate of all
sessions at each track position. On average, the skipping rate increases at the beginning of a
session, then decreases between the 6th and 11th tracks, and finally tends to increase after
the 11th track, though the skipping pattern varies significantly across listening sessions.
Figure 3: Skip rate at each session position.
The data set does not include customer-specific information, such as demographics,
geographical information, or other customer identifiers. Hence, we rely on the skipping be-
haviors of the customers and the acoustic fingerprints of the tracks. The acoustic fingerprints
include Acousticness, Beat strength, Bounciness, Danceability, Energy, Flatness, Key, Mode,
Mechanism, Liveness, Loudness, Instrumentalness, Organism, Speechiness, Tempo, and
Valence, whose definitions are presented in Table D.1 of the Web Appendix. These acoustic
fingerprints describe the technical aspects of the music tracks and the felt experience of
listeners [62]. We preprocess the data by one-hot encoding the categorical features, such as
Mode and Key, using dummy variables for the categories. The continuous features, such as
Valence, Danceability, and Acousticness, are min-max normalized to be within the range
from 0 to 1 to put these on the same scale. Table E.1 of the Web Appendix contains summary
statistics of all the acoustic fingerprints.
To fit the meta-learning model, we consider each listening session as a task where the
first few consumer interactions are taken as the context set, and the rest of the tracks in
the session belong to the target set. Constructing the context and target sets using the
data within a session is more meaningful than using all the data of a customer because a
customer’s tastes may differ across sessions. Given that the minimum session length is 10,
we take the first five interactions as the context set size. Note that the size of the context
set used for the model calibration does not pose any restrictions on the feasible size of the
context set for a new customer. Given this split, the mean and median skip rates are 0.48
and 0.40 for the context set, and 0.51 and 0.53 for the target set, respectively. The skip rate
13

of the target set is significantly higher than that of the context set by a two-sample t-test
(p ≤ 0.01).
5 Validation and Model Comparison
We randomly select 80% of the sessions indexed by 1, 2, · · · , N as the training data

to construct the tasks for model training. The number of context points t i = 5 for each
training task. The remaining 20% sessions indexed by N + 1, N + 2, · · · , M are used for
model validation. Such a division is consistent with the real situation that a platform faces,
where existing data and new data are on the session level. A new session typically has
a distribution different from the existing sessions and therefore, a personalization model
needs to successfully adapt to the context data of a new session.
After the meta-learning with existing listening sessions, we can compute the posterior
predictive distribution of the future behavior for a new session as described in § 3.3. The
outcome for the music session data is binary, hence we take ŷkt = 1[p( ykt | x kt , Dk ) > 0.5],
where 1[·] stands for the indicator function, to make skip prediction. The implementation
details of MetaTP are presented in § D of the Web Appendix.
5.1 Benchmark Models
We compare MetaTP with a variety of benchmark models. These include traditional

marketing models as well as ML approaches that are known for their predictive prowess. Our
benchmark models vary systematically on how they leverage the existing session data and the
observed data of a new session. On the one end are the static methods that only fit a single
model based on existing data without adjusting to the data of new sessions. On the other end
are the adaptive methods that are developed using existing session data and then customized
to the observations from a new session. We now describe these benchmark models.
Hierarchical Bayes Logit. We first compare our model with a hierarchical Bayes Logit
model (HB), which is widely used to model heterogeneous choices in marketing. Traditional
HB methods, however, can typically be used to understand the preferences that underlie
existing sessions. Adapting these methods to new sessions requires refitting the model to
obtain session-specific parameters for the new session via computationally expensive MCMC
methods. Given the dynamic character of our setting, we extend traditional HB methods
to yield an adaptive HB approach that can be pre-trained and still utilize new session data
efficiently.
Following the typical HB setting, each session i is associated with a set of parameters β i
which come from a multivariate normal population distribution N (µ, Σ). We use normal
priors for the population mean and a separation-strategy prior [63] for the population
covariance matrix by decomposing it as Σ = Diag(τ) L Diag(τ), where τ is a vector of stan-
14

dard deviations, Diag(τ) is a diagonal matrix with diagonal vector τ, and L is a correlation
matrix. In particular, we use independent Half-Cauchy priors on the standard deviations in
τ and an LKJ prior [64] on L to obtain a flexible prior. The full probabilistic model is
yi t |x i t , β i ∼ Bern(σ(x i t · β i )), β i ∼ N (µ, Diag(τ)LDiag(τ))

(5)
τ p ∼ Half-Cauchy(0, a), L ∼ LKJ(b), µ ∼ N (0, c 2 I p ),
where σ(x) = 1/(1 + exp(−x)) is the sigmoid function, I p is an identity matrix, and a, b, c
are hyperparameters. With a slight abuse of notation, we denote θ = (µ, τ, L) as the global
t=1:T
variables for the HB methods shared by the sessions and M = {(x i t , yi t )}i=1:Ni as all the
observed data of existing sessions. The global and local latent variables are inferred by the
posterior distribution p({β i }i=1:N , θ | M ), which can be approximated by sampling methods
such as Hamiltonian Monte Carlo (HMC). It is important to note that the inferred local
variables β i are only for the existing sessions i = 1, · · · , N .
For a new session k, k > N , we derive the predictive distribution of the outcome variable
given the covariates x kt , the observed data Dk , and the calibration data M as
ZZ
Y
p( ykt | x kt , Dk , M ) ∝ p( ykt | x kt , β k ) p( yk0 | x 0k , β k ) p(β k | θ )p(θ | M )dθ dβ k ,
(x 0k , yk0 )∈Dk
(6)
The predictive distribution in Eq. (6) is derived based on the d-separations of the directed
acyclic graph for the HB model; the details are presented in § C of the Web Appendix.
Intuitively, the posterior distribution p(θ | M ) borrows information from the related sessions.
The samples of β k ∼ p(β k | θ ) are adapted to new session k by re-weighting the samples
according to their predictive ability for the context data Dk . The Adaptive HB can calibrate
the model once with all the data M , and adjust the model to new session data Dk without
re-fitting to the large data M .
Accordingly, we call the HB method Static HB if it only uses existing session data M but
not new session data Dk . The predictive distribution for Static HB is
ZZ
p( ykt | x kt , M ) ∝ p( ykt | x kt , β k )p(β k | θ )p(θ | M )dθ dβ k . (7)
Eq. (7) can be approximated with MCMC samples of θ generated from calibration runs
and draw β k ∼ N (µ, D(τ)LD(τ)) with the θ samples. The HB methods are implemented
with CmdStanPy, a Python interface for Stan [65]. We used four parallel chains to generate
posterior samples from the No-U-Turn sampler [66] and checked for convergence using the
trace plots of the unknowns, calculated the effective sample size, and monitored the mixing
rate.
Other Baselines. Besides the HB methods, we also make a comparison with Logistic
Regression (LR), Random Forest (RF), and their adaptive versions as Fine-tuned Logistic
15

Regression (FT-LR) and Fine-tuned Random Forest (FT-RF). The Logistic and RF are two
classic supervised machine learning methods, representing the (generalized) linear and
t=1:T
nonlinear models. They are trained on the data M = {(x i t , yi t )}i=1:Ni that pool up all
the existing sessions, and they are not adapted to the new session data. We also evaluate
their variants that are adaptive to new sessions. Specifically, FT-LR first calibrates a logistic
regression model on all the training sessions, computes a stochastic gradient using the
new session context data, and then updates the model parameters by a one-step gradient
descent [67]. FT-RF first fits a random forest model for the training sessions and keeps the
fitted model fixed. Then it introduces additional simple tree models to fit the new customer
data and combines the fitted forest model and the tree model as an ensemble [68]. These
baselines are implemented with the scikit-learn library in Python [69] with implementation
details in § D of the Web Appendix.
Table 1: Overall and first event predictive results.
LR RF Static HB FT-LR FT-RF Adaptive HB MetaTP

Accuracy 53.6% 52.5% 52.8% 56.8% 56.2% 59.3% 62.9%
Precision 0.549 0.560 0.546 0.627 0.598 0.631 0.638
Overall
Recall 0.755 0.565 0.725 0.489 0.571 0.588 0.708
AUC 0.528 0.529 0.510 0.605 0.514 0.617 0.664
Accuracy 55.7% 54.4% 54.6% 63.3% 62.6% 66.3% 73.2%

Precision 0.577 0.598 0.568 0.721 0.684 0.721 0.756
First Event
Recall 0.762 0.570 0.892 0.545 0.617 0.653 0.766
AUC 0.541 0.560 0.527 0.717 0.685 0.725 0.776
5.2 Future Behavior Prediction
We evaluate the prediction of the first event in the session after the context set, which
measures the short-term predictive ability, and the prediction for all future events, which
combines the short and long-term predictive ability. The short-term scenario is useful for
making real-time expansions of the listening session taking into the responses of customers
on the fly. All the evaluations are performed on the new sessions k = N + 1, · · · , M that do
not appear in the training data. Specifically, we consider the following
PM Pevaluation metrics.
1 Tk
For the prediction of all future events, Accuracy = M (T −t ) k=N +1 t=t k +1 1[ ŷkt = ykt ],
P
P M k=N +1
k k
PM P Tk P Tk
k=N +1 t=t k +1 1[ ŷkt =1 and ykt =1] k=N +1 t=t k +1 1[ ŷkt =1 and ykt =1]
Recall = PM P Tk , Precision = PM P Tk , and AUC is the
k=N +1 t=t k +1 1[ ykt =1] k=N +1 t=t k +1 1[ ŷkt =1]
area under the Receiver Operating Curve (ROC), a widely used metric for evaluating the
overall predictive performance [70]. Similar definitions apply to the first event prediction
by fixing t = t k + 1.
16

The predictive results are summarized in Table 1. We report the overall accuracy for
all the events in the target set and the accuracy of the first event after the context set as
predicting what will happen next, two metrics commonly used in sequential prediction
[71]. In general, the adaptive models perform significantly better than the corresponding
static models. The advantage of adaptation indicates that the new sessions exhibit strong
individual heterogeneity from the sessions in the training data, and the information of this
hidden heterogeneity can be inferred from the context set of the new sessions. The LR and
Static HB have high recalls but low precision, which means that although a large portion of
the skipped tracks are correctly predicted, many non-skipped tracks are wrongly predicted
as being skipped. Among the adaptive methods, Adaptive HB performs better than FT-LR
and FT-RF, likely because its adaptation mechanism is computed based on the principled
Bayesian predictive likelihood instead of heuristics.
MetaTP demonstrates a clear performance advantage over Adaptive HB. This perfor-
mance gain could be explained by the major differences between the two models. First,
MetaTP uses an episodic learning framework with multiple tasks that are used to predict
the target set given the context set, while the HB model focuses on modeling the entire
data within a task. Second, the cross-attention module of MetaTP captures the dynamic
heterogeneity with the local parameters β it , while the local parameter β i of Adaptive HB
is shared by all the future events. Third, MetaTP models the sequential pattern of the
Ti
data while the individual data {x i t , yi t } t=1 of Adaptive HB are assumed exchangeable. The
advantage of sequential modeling is further corroborated by the large gap in the first-event
accuracy between MetaTP and Adaptive HB, as a sequential model can better predict events
in the near future. Last, MetaTP incorporates flexible nonlinear components such as embed-
dings, multi-head attention, and feed-forward neural networks. However, observing that RF
performs on par with LR, the flexibility of nonlinear modeling might not be the dominant
reason for MetaTP’s empirical advantage over Adaptive HB on this data. Nonetheless, for
other data sets with a more complex nonlinear relationship between the covariates and the
outcome, we envision that the advantage of MetaTP over Adaptive HB would be even more
evident because of its nonlinearity and flexibility.
It is worth noting that MetaTP outperforms Adaptive HB not only in accuracy but also
significantly in computational efficiency. Training MetaTP on 10,000 sessions takes less than
30 minutes to achieve convergence as it relies on optimization, while generating posterior
samples of HB methods takes over 10 hours before convergence. The efficiency allows
MetaTP to benefit from training on big data. For example, we tested MetaTP on a larger
random subset of data with 178,000 randomly sampled sessions. The model fitting can be
completed in around one hour, and it lifts the overall and first event predictive accuracy by
1.1% and 1.3% compared with the smaller dataset, respectively.
Next, we explore the predictive accuracy at each target position as shown in Fig. 4.
The larger the target position is, the further it is away from the observed context set and
the more difficult it is to predict skipping behavior. The result shows that the accuracy
of the adaptive models is highest at the first position and gradually decreases for further
17

Figure 4: Predictive accuracy at target positions.
positions. In contrast, the accuracy of static models is similar at different positions. MetaTP
performance is better than that of other baselines across all the target positions, reflecting
the advantage of the attention mechanism to capture the multi-faceted and long-term
sequential dependence structures. To capitalize on the high first-event accuracy, we also
generate prediction ŷkt by MetaTP using all the preceding new session data and we label
this approach as MetaTP-seq. That is, when predicting the skipping at the position t, the
context set consists of all the track features and the observed customer skips {x kt 0 , ykt 0 } t 0 <t
before the position t within the session. The context set is dynamically updated but the
model is not re-estimated. As expected, MetaTP-seq has the highest accuracy at all positions
due to the increased context size and because it can leverage recent dynamics; we will
demonstrate its managerial application in § 7.
We further compare the adaptation ability of MetaTP and Adaptive HB by evaluating
the predictive accuracy given a different size of the context set. MetaTP is calibrated on
the tasks with five context points, and we provide a different number of context points at
the validation time without refitting the model. As shown in Fig. 5, both adaptive methods
benefit from the increased number of context points. However, the improvement of MetaTP
is faster than Adaptive HB and the performance gap increases with additional context data.
MetaTP might acquire this fast adaptation ability by solving thousands of similar few-shot
prediction tasks, which forces the posterior distribution pθ E (β it | Dic , t) to maximize the
18

Figure 5: Predictive accuracy given a different number of context points at the test time.
information gain from the limited context data. Another reason for the difference might be
that the summary statistics β i of Adaptive HB does not change much with more context
data, but MetaTP can extract more complex dynamic patterns from a richer context set with
time-varying parameter β it .
6 Results
We now demonstrate the substantive understanding of customer population and behavior

that can be uncovered through the application of MetaTP after fitting the model on the
entire training data.
6.1 Context Set Representation
The self-attention embedding z cit ∈ Rm at position t ∈ {1, 2, · · · , t i } encodes the listening

session up to position t. We examine what information is stored in the self-attention
embedding by focusing on z cit i , which encodes the whole context set. To this end, we
create 4 context sets for sessions with different music preferences and skipping patterns,
which are shown in Table 2. The context sets consist of tracks from artists Metallica and
19

Table 2: Synthetic context sets for rock and classic music lovers.
Context Set 1 Context Set 2 Context Set 3 Context Set 4
Track Artist Skip Track Artist Skip Track Artist skip Track Artist Skip
The Call Of Ktulu M 0 Charlie Brown G 0 The Call Of Ktulu M 0 Charlie Brown G 0
Charlie Brown G 1 The Call Of Ktulu M 1 Master of Puppets M 0 Anxious Moments G 0
Master of Puppets M 0 Anxious Moments G 0 To Live Is to Die M 0 Waltz for the Lonely G 0
Anxious Moments G 1 Master of Puppets M 1 Master of Puppets M 0 The Skin Horse G 0
To Live Is to Die M 0 Waltz for the Lonely G 0 Battery M 0 The Swan G 0
Waltz for the Lonely G 1 To Live Is to Die M 1 Charlie Brown G 1 The Call Of Ktulu M 1
Master of Puppets M 0 The Skin Horse G 0 Anxious Moments G 1 Master of Puppets M 1
The Skin Horse G 1 Master of Puppets M 1 Waltz for the Lonely G 1 To Live Is to Die M 1
Battery M 0 The Swan G 0 The Skin Horse G 1 Master of Puppets M 1
The Swan G 1 Battery M 1 The Swan G 1 Battery M 1
Notes. Artist “M” is Metallica, a band known for heavy metal rock music; “G” is George Winston, a pianist
known for performing instrumental music. Context Sets 1 and 3 represent rock music lovers while Context
Sets 2 and 4 represent classic music lovers. Context Sets 1 and 2 have the same skipping sequence, so do
Context Sets 3 and 4.
George Winston, who have distinct specialties in heavy metal rock music and classical music,
respectively. The listener in Context Sets 1 or 3 skips all the George Winston music but
finishes all the Metallica tracks, while the skipping is reversed for Context Sets 2 and 4. We
show the pairwise distances between the embeddings of these context sets in Fig. 6a. We find
the sessions with similar music preferences are close together, i.e., the Context Sets 2 and 4
have the smallest distance at 1.14, followed by the embedding distance between Context
Sets 1 and 3 at 1.51. The largest distance of 4.10 is between Context Sets 2 and 3 where both
the preferences and the skipping patterns ( yi1 , · · · , yi t i ) are different. These findings suggest
that the context embedding successfully encodes the information on customer preferences
and skipping patterns.
(a) Pairwise distance of synthetic context sets (b) Clustering
Figure 6: Embedding of the context set.
20

Table 3: Cluster statistics.
Cluster 1 2 3 4 5 6 7
Size 281 82 249 289 621 212 266

Context set listening rate 1.07% 81.00% 40.08% 100.00% 62.61% 60.43% 20.00%
Target set listening rate 38.90% 50.17% 35.72% 60.65% 59.40% 41.50% 37.53%
The meaningful embedding space can be leveraged to identify the subgroup with high
future engagement. We use Uniform Manifold Approximation and Projection (UMAP)
[72, 73] to reduce z cit i to two dimensions and visualize it in Fig. 6b. The points in Fig. 6b are
the UMAP encodings of z i t i for the 2000 hold-out sessions. The self-attention embedding
segments the sessions into 7 well-separated clusters. We examine the listening rate of each
cluster in the context and target set, defined as the percentage of tracks not being skipped,
which are shown in Table 3. Among the segments, we find Cluster 5 of particular interest.
Cluster 5 is the largest cluster containing 29.4% of total sessions. Unlike other clusters that
mainly consist of sessions with a similar listening rate in the context set, Cluster 5 consists of
sessions with mixed listening rates ranging from 0.2 to 0.8. The average listening rate of this
cluster is 62.6% in the context set and is as high as 59.4% in the target set (the second highest
among all clusters). Hence, we call Cluster 5 the Enthusiasts with persistent engagement
patterns throughout the listening session. In comparison, Top Listeners (Cluster 4) with
100% listening rate in the context set have a future listening rate of 60.7%, only slightly
higher than that of the Enthusiasts. This indicates the Top Listener cluster may contain
passive listeners whose high listening rate can not reflect their engagement. Moderate
Listeners (Cluster 6) have a context-set listening rate similar to the Enthusiasts but have a
much lower future listening rate of 41.5%. Comparing Enthusiasts with Top Listeners and
Moderate Listeners indicates that self-attention embeddings contain essential predictive
information of subsequent behaviors beyond the basic context set listening rate. The platform
can use this embedding to identify consumers with high engagement for future tracks.
6.2 Context Information and Prediction
MetaTP makes individualized predictions by adapting to the context data. A natural

question in understanding the targeting mechanism is how different information in the
context set influences the model prediction. We answer this question by leveraging the
attention mechanism and the methods from explainable AI that are developed to comprehend
the outputs of algorithms. We explore how different context points and acoustic features
influence the skipping behavior on the next track after the context set.
Cross-attention weights. The cross-attention captures the dependency structure between
a future target set point and the observed context set. We investigate how the attention is
distributed across the context points by analyzing the individual-level attention weights.
21

(a) Session 1 (b) Session 2 (c) Population average
Figure 7: Attention weights: the importance of context points in decision making.
Figs. 7a and 7b show the distribution of attention weights on the five context points for two
randomly selected sessions. For both sessions, the most attention is paid to the last track
in the context set. The tracks with the highest attention for session 1 are the non-skipped
tracks, and for session 2, are a mix of skipped and non-skipped tracks.
Fig. 7c is the averaged attention weights over the population. The highest attention is
paid to the context points that are close to the target points, which contain more relevant
information to future skipping behaviors because of the closeness in time. For example, a
customer skipping the latest tracks due to fatigue would tend to skip the future tracks if the
fatigue persists. This is consistent with the recency effect for people’s tendency to remember
the most recently presented information best [74]. We also notice that the first track is of
high importance, making a U-shape distribution of attention over the context points. This
finding may relate to the primacy effect that the first items influence customers’ judgments
and decisions more than the information they receive later on [75]. The behavior on the
first track might better reflect the customer’s gut instinct on the music preferences. Overall,
the U-shape attention weights distribution is aligned with the psychological tendency for
one to be influenced by the first and last items in a list more than those in the middle [76].
Feature importance. The attention weights quantify the influence at the track level. Now,
we further scrutinize what acoustic fingerprints are indicative of future behavior. The
saliency map is computed as a measure of feature-level influence [77]. It is computed as
the gradient of the prediction with respect to each acoustic fingerprint, i.e., ∂ ŷ(x )/∂ x ,
where ŷ(x ) is the predicted skipping of the first track in the target set and x are the acoustic
fingerprints of the context tracks.3
3
The intuition behind the saliency map is that, as seen in linear regression, the value of the derivative
corresponds to the regression coefficient, thereby quantifying the feature’s importance. The saliency map
generalizes this idea to any nonlinear, differentiable models.
22

(a) Customer of Session 1
(b) Customer of Session 2
Figure 8: The importance of context set acoustic features.
Fig. 8 shows the saliency values for each acoustic for each of the five context positions for
customers of two sessions. We see that the most influential acoustic fingerprints generally
are those from the latest tracks, aligned with the findings from the attention weights.
Interestingly, the influential acoustic fingerprints differ significantly across sessions. For
example, what matters for the customer of Session 1 are beat strength, bounciness, energy,
and liveness, and for the customer of Session 2 are bounciness, instrumentals, energy, and
speechiness. The granular preferences on the music acoustic fingerprint of each track reveal
personal demands for the music product, as the customer of Session 1 might be more
sensitive to the rhythms, and the customer of Session 2 is influenced more by the melody.
23

Figure 9: Preference dynamics over the population.
Notes. The solid line is the population mean; the shaded region is the standard deviation across
individuals. The x-axis is the session position.
6.3 Dynamic Preference Heterogeneity
General meta-learning methods capture the heterogeneity across individuals by adap-

tively computing the task-specific parameters given the context data. MetaTP further applies
the attention mechanism to quantify the time-varying heterogeneity of an individual. Such
a preference pattern is reflected in the local parameter β it for individual i at the position
t of a listening session. The element βipt , p ∈ {acousticness, energy, · · · , valence}, can be
interpreted as the preference to the covariate p, which represents the change in the log-odds
of track skipping given one unit change in the covariate x i t p .
24

Figure 10: Dynamic of individual preference variables.
Fig. 9 shows the preference dynamic on the target set for a few covariates at the
population level. The results for all the covariates are contained in § F of the Web Appendix.
The mean and standard deviation of βipt in Fig. 9 are computed across sessions. We set
E[ yi t | x i t ] = 1/(1 + exp(x i t · β it ))) and recall that yi t = 1 means customer i skips the track
at the t-th position. As all the processed covariates are positive, the larger βipt is, the higher
the probability for non-skipping at a certain level of an acoustic fingerprint. The magnitude
of βitp reflects the degree of preference for the acoustic fingerprint p and the sign of βitp
reflects whether the acoustic fingerprint is positively or negatively related to non-skipping
behavior. For example, we find that acousticness is positively related to non-skipping, while
the track duration and instrumentalness exhibit a negative correlation. Across time, on
average, the preference for acousticness and instrumentalness increases, and the preference
for energy decreases. We also notice that the tendency to skip long tracks increases as
a session progresses in the early period. These dynamics indicate a pattern of customer
fatigue, where the customer, on average, begins to prefer melodic or light music to heavy
metal songs or techno tracks when getting further into a listening session.
Beyond the aggregated patterns, our model uncovers time-varying heterogeneous pref-
erences. Fig. 10 plots the individual preference variable βipt ∼ pθ (βip | Dic , t) sampled for
3 random sessions. Taking the preference of high-energy tracks, for example, we notice
customers exhibit different fatigue patterns. To explore the driving factors behind the
dynamic patterns, we focus on Listeners and Skippers as two subgroups of the new customer
sessions. The Listeners are defined as the sessions with the top 10% listening rate, and the
Skippers are defined as the sessions with the bottom 10% listening rate in the context set.
The preference dynamics of these two subgroups are shown in Fig. 11. We find that the
level of historical engagement in the context set significantly influences the evolution of
preference dynamics. From the duration and valence panels, the Listeners are more tolerable
with long tracks and prefer negative emotional songs more than Skippers. We notice that
the reduced interest in the energy feature is mainly among the Listeners, and the increased
25

interest in the instrumentalness is more evident for the Skippers. The subgroup analysis
demonstrates the heterogeneous patterns of customer fatigue and identifies a driving factor
as the listening rate of the context sets.
Figure 11: (Color online) Heterogeneous preference dynamics for subgroups.
Notes. The orange curve corresponds to the high listening rate group, and the blue curve corresponds
to the low listening rate group. The dynamics are over the target set.
Having shown the predictive superiority of our model and the various substantive insights
that it generates, we now move to showcase its use in several managerial tasks.
7 Managerial Applications
We demonstrate how MetaTP can inform the managerial actions of platforms for customiz-
ing and extending new sessions that include a few initial interactions on the platform.This is
a challenging scenario because of the limited information in the new session. We illustrate
in this section how a platform can achieve this in our context of music personalization. As
26

Table 4: Example of the optimal ordering of the target set.
Original target set Artist Skip Reordered target set
Silver Tiles Matt and Kim 1 Don’t Let Him Steal Your Heart
Dear Avery The Decemberists 1 October Road
Don’t Let Him Steal Your Heart Phil Collins 0 Ordinary Just Won’t Do
Ordinary Just Won’t Do Commissioned 0 Lay It All On Me
October Road James Taylor 0 Dear Avery
Lay It All On Me Rudimental 1 Silver Tiles
the tracks are anonymized in our data, we collected an additional 192 thousand tracks using
Spotify API, with track names, artist information, and acoustic fingerprints, to facilitate an
understanding of the appropriateness of the recommendations.
7.1 Optimal Product Sequencing
We start with a scenario where customers may have a self-selected listening list and the
platform provides a listening order for the list. As MetaTP does not assume exchangeable
data, it can optimize the sequential ordering of a list to improve customer engagement,
without changing the content of the list. The Transformer model can be used to produce
the position encodings for each permutation π(·) of the future tracks positions (π(t i +
1), · · · , (π(Ti )). The cross-attention mechanism can then use the position encodings to
compute the predictive distribution.
For each permutation, we compute the average skipping probability for the tracks in the
target set and select the permutation with the lowest skipping probability. An example of
the re-ordered tracks from our data set is shown in Table 4. Interestingly, we see that the
optimal order tends to place the non-skipped tracks at the early positions. This could be
explained by the primacy effect, as noticed in § 6.2, which implies that placing the preferred
products at early positions generates a positive carry-over effect that influence subsequent
decisions of customers.
7.2 Context Similarity-based Recommendation
In the beginning of a newly started session, the platform wants to quickly capture the
music preferences of the customer from the initial interactions and recommend future tracks.
An approach that can be used to complete the listening session involves matching the focal
new session to an existing session with a similar context set. Matching the listening history is
27

nontrivial as it synthesizes music acoustic fingerprints and the sequential pattern of customer
skipping behavior. The self-attention mechanism unifies these pieces of information into the
context embedding z k = (z k1 , · · · , z kt i ). As the model predicts the target sequence based
on the embeddings z k , the context embedding contains sufficient information relevant to
future behavior. We therefore use these embeddings for matching sessions.
Table 5: Skipping patterns matched on the context set.
Track Artist Skip Track Artist Skip
Ramblin’ Man The Allman Brothers Band 0 Glittering Prize Simple Minds 0
California - Tchad Blake Mix Phantom Planet 1 Peaches Bob Schneider 1
Context
Riding With The King Eric Clapton 1 Majesty of Heaven Chris Tomlin 1
Life In The Fast Lane Eagles 1 I Don’t Want This Night to End Luke Bryan 1
Stolen Moments Dan Fogelberg 0 Majesty of Heaven Chris Tomlin 0
Bottoms Up Trey Songz 0 Gave Me Something Jess Glynne 0

Me Myself I & Boyz II Men 0 Sister Lost Soul Alejandro Escovedo 0
Mashed Potatoes Rufus Thomas 0 Make You Miss Me Sam Hunt 0
Target
Posse Big Kuntry King 1 Here In the Real World Alan Jackson 0
On Da Hush - Trina 1 Been There Clint Black 0
The Most Beautiful BTS 1 Sweet 16 Green Day 0
Yo Quería Cristian Castro 1 Smoke Break Carrie Underwood 0
Jezebel - Demo Version Depeche Mode 1 Any Ol’ Barstool Jason Aldean 0
Notes. The left table is the focal session, and the right table is the matched session with a low skipping
rate on the future tracks.
Specifically, for a focal new session k, we identify the K-nearest neighbors {ẑ 1 , · · · , ẑ K }
using L2 -norm in the embedding space from the existing sessions. These existing sessions
have observed skipping behaviors in their target sets. We match the focal session with
a session from the nearest neighbors which has the lowest skipping rate of the tracks
in the target set, and then recommend the tracks of the matched session’s target set to
the focal session. Since the context set is predictive of the skipping behaviors of future
tracks in the target set, the matching on the context sets suggests similar responses to the
subsequent tracks between the focal and matched sessions [78]. Table 5 shows an instance
of context-matched session pairs. The matched context sets consist of similar tracks with
strong rhythmic and similar skipping patterns, while the skipping rate of the focal session’s
future tracks is seven times higher than that of the matched session. By replacing the focal
listening session’s target set with the matched session’s target set, we find an average of 7%
decrease in the skipping probability on the target set predicted by the model.
7.3 Model-based One-shot Session Completion
Instead of matching sessions, MetaTP can provide a model-based approach to gener-

ate an optimal continuation that is not necessarily limited to the existing sessions in our
data. Specifically, for each session position in the target set, we can estimate the skipping
28

Table 6: Session completion for heterogeneous context sets.
Rock music lover Classic music lover
Track Artist Instrument. Track Artist Instrument.
Can You Bounce? Ice Cube 0.000 Omni Potens Morbid Angel 0.959
Face Facts Kottonmouth Kings 0.000 The Garden: The Garden George Winston 0.989
Einstein Tech N9ne Tech N9ne 0.375 Cinny’s Waltz Tom Waits 0.996
ATLiens OutKast 0.394 Siboney - Live Buena Vista 0.989
You Gotta Lotta That Ice Cube 0.000 O Sacred Head Jim Brickman 0.992
Juke It Snypaz 0.000 I Went From - Intro Brotha Lynch Hung 0.996
To My People Flipmode Squad 0.000 In The Kingdom of Peace Jean-Luc Ponty 0.995
F*k Somthin Webbie 0.000 Rest Your Head George Winston 0.986
Beats Knockin Jack Ü 0.000 Calliope Tom Waits 0.959
Bianca’s and Beatrice’s Tech N9ne 0.000 The Map Room: Dawn John Williams 0.969
Average - 0.077 Average - 0.983
Notes. This table shows the recommended next 10 tracks for two sessions. The left session has
Context Set 1, and the right session has Context Set 2 from Table 2. Instrumentalness measures
whether a track contains no vocals. The value of instrumentalness is the quantile among all the
recommendation candidates of 192 thousand tracks. MetaTP accurately adapts to heterogeneous
preferences from a context set with only 10 tracks.
probability for each track in the recommendation list using its acoustic fingerprints and
then complete the session using the track with the highest non-skipping probability at that
position.
We first evaluate how the model completes the listening sessions for a rock music lover
and a classic music lover. Specifically, we take the synthetic Context Sets 1 and 2 from Table 2,
which exhibit preferences for Metallica and George Winston tracks, respectively. Table 6
shows the completed session for the two context sets. For ease of interpretation, we show
the instrumentalness fingerprint for each track, a fingerprint that is typically low for rock
music and high for non-vocal pure music. Since instrumentalness has a skewed distribution,
we compute the quantile of each track’s instrumentalness among all the recommendation
candidates. The recommended tracks clearly match the taste of customers – the tracks for
the rock music lover are by artists like Tech N9ne and Ice Cube, and the tracks for the classic
music lover are by pianists and violinists like Jim Brickman and Jean-Luc Ponty. MetaTP
manages to adapt to the session heterogeneity with as few as ten tracks in the context set.
It is important to highlight that the adaptation mechanism is not pre-programmed into the
model; instead, it is meta-learned from data – an ability automatically discovered from
predicting target sets for thousands of sessions in the training tasks. In addition to the above
synthetic context sets, we show in Table F.1 of the Web Appendix, the one-shot completion
for two context sets from the observed hold-out sessions, one with a low skip rate and the
other with a high skip rate for low danceability tracks. Similar to the synthetic context
sets, we find MetaTP can recommend tracks with correctly adapted danceability levels to
complete the sessions.
29

Group session completion. MetaTP can also complete a music session jointly for a
group of people. Creating group playlists is popular on music streaming platforms, allowing
shared music lists to be created for circumstances like social events. For example, Spotify
launched the Collaborative Playlist feature in 20084 and introduced the Jam service to
provide personalized and real-time listening sessions for a group in 2023.5 MetaTP can
leverage the context sets of group members and complete the session using a sequence of
tracks that appeals to all the group members. Each track can be selected to maximize the
probability that no group member will skip it. Table 7 shows a list of recommended tracks
by MetaTP for a group of three individuals from our data with varied context sets: one
has few skips for high valence tracks, one has a high skip rate for low danceability tracks,
and another has a low skip rate for low acousticness tracks. Table 7 contains key acoustic
fingerprints of the recommended tracks and the probability of not skipping by each group
member. The recommended tracks cater to the group’s personalized preferences, meaning
the tracks are generally happy, danceable, and less acoustic. The non-skip probability is
uniformly high for each group member.
Table 7: Joint session completion for a customer group.
Track Valence Danceability Acousticness Prob 1 Prob 2 Prob 3
Jimmy Collins’ Wake 0.96 0.55 0.05 0.71 0.77 0.80

Counting Stars 0.89 0.63 0.01 0.71 0.76 0.80
Tear Jerky 0.96 0.64 0.01 0.71 0.76 0.79
Good Times, Cheap Wine 0.94 0.79 0.05 0.71 0.76 0.79
I Wanna Be Your Boyfriend 0.90 0.45 0.00 0.71 0.76 0.79
So So Long 0.96 0.63 0.06 0.71 0.76 0.79
Los Vaquetones 0.96 0.76 0.34 0.71 0.76 0.79
Old Time Rock And Roll 0.94 0.54 0.10 0.70 0.76 0.81
God Save Rock n Roll 0.90 0.71 0.01 0.69 0.77 0.81
Hot Summer Night 0.96 0.68 0.04 0.70 0.77 0.80
7.4 Adaptive Session Completion
The similarity-based and one-shot session completion methods are open-loop recommen-
dations, which operate without considering the immediate feedback or dynamic reactions of
the customers as the session is extended. We now leverage the MetaTP-seq discussed in § 5.1
to illustrate how closed-loop session completion can be done. As we do not have dynamic
4
See https://newsroom.spotify.com/2020-09-29/how-to-make-a-collaborative-
playlist/
5
See https://newsroom.spotify.com/2023-09-26/spotify-jam-personalized-
collaborative-listening-session-free-premium-users/
30

feedback for the recommended tracks in our data set, we simulate virtual sessions that are
dynamically constructed with a set of pre-specified response rules for the customers. The
MetaTP model suggests the next track to be included in the session based on the simulated
real-time customer feedback and interactions with previous tracks.
Each virtual new customer k is first given one randomly selected track. The customer
response yk1 to this track and its features x k1 is used as the initial context set, which is then
leveraged to suggest the next track. At any position t in the sequence, MetaTP takes the
data {Dkc (t) = {x kt 0 , ykt 0 } tt 0 =1 } as the context set and completes the session at position t + 1
by a track x k,t+1 with the lowest skipping rate. After observing the skipping behavior yk,t+1 ,
the context set is updated as Dkc (t + 1) = Dkc (t) ∪ {(x k,t+1 , yk,t+1 )}.
Figure 12: Session completion for two heterogeneous groups.
Notes. The blue curve is for the High Preference group, and the orange curve is for the Low Preference
group.
Specifically, we study whether MetaTP can quickly uncover customer preferences based
on a limited number of responses. We use preferences pertaining to specific acoustic
fingerprints to define two groups of synthetic customers. We define a High Preference
group that prefers a high value of that acoustic fingerprint. Customers in this group are
assumed to skip a track with a probability of 0.1 if the acoustic fingerprint of this track is
above a threshold, and otherwise will skip with a probability of 0.9. Similarly, a synthetic
group (Low Preference) will skip with a probability of 0.9 (or 0.1) if the acoustic fingerprint
is below (or above) the threshold. Fig. 12 shows the average acoustic fingerprint level
(e.g., acousticness and instrumentalness) of the recommended tracks for the two groups of
customers. The magnitude of the corresponding acoustic fingerprint is averaged over the top
5% of the recommended tracks and 100 customers from the two groups to reduce sampling
variance. The initial tracks recommended for both groups are similar, but after 3 to 4 tracks
the acousticness or instrumentalness level of the tracks offered to the High Preference
group becomes larger than that of the Low Preference group. These two types of customers
follow a hypothesized skipping pattern, whose distribution may differ significantly from the
observed existing customers in the calibration data. Nevertheless, MetaTP can adapt to these
31

hypothesized customers’ unique preferences from a few observations. Compared to a static
session completion using Random Forest that does not use the new customer context data,
we find MetaTP increases the whole-session listening rate by 16.7%. Similarly, compared
to a random session completion procedure, MetaTP increases the listening rate by 17.9%.
Figure 13: Sampled sessions for the customers with high and low preferences for the acousticness
fingerprint.
Notes. Tracks associated with the blue and orange curves are for one customer from the High
Preference and Low Preference groups, respectively.
Fig. 13 visualizes a sampled session for a customer in the High Preference group and a
customer in the Low Preference group for the acousticness fingerprint. Both customers start
from the track The Twelve Days of Christmas, and the subsequent session for the customers
from the High Preference group mainly consists of tracks like Space Kay, Kitchen Girl, I Ain’t
Gonna Be The First To Cry with acousticness over 0.8, and the session for the customers
from the Low Preference group mainly consists of songs like Crazy Rap and No Hate with
heavy electronic sounds and acousticness below 0.4.
8 Discussion
In this paper, we proposed a framework for measuring preferences and predicting future
behaviors within ongoing digital sessions by integrating meta-learning and Transformer-
based sequence modeling. The framework generates accurate future predictions for a new
customer session only using a small number of observations and can produce personalized
sequential recommendations on the fly. The ability of fast adaptation hinges on a new
design of multiple training tasks to reuse knowledge with existing sessions. The key benefits
of the model are its data efficiency, flexibility, and transparent processes. In the empirical
32

application, we demonstrate how the model can flexibly leverage different information sets
and sizes of context data. Though interpreting neural network-based models is challenging,
our framework produces a variety of interpretable preference measurements by synergiz-
ing techniques of embeddings, attention mechanisms, explainable AI, and hierarchical
probabilistic modeling.
We applied our approach to a rich dataset of music listening sessions, where music
tracks are characterized based on their acoustic fingerprints. Our findings demonstrate that
our model predicts better than both static and adaptive benchmark models. Furthermore,
our model provides meaningful embeddings and interpretable parameters that capture
consumer preferences evolvement during listening sessions.
We then showcased several managerial applications crucial for digital platforms, enabling
them to initiate recommendations and craft listening sessions for consumers with minimal
initial interactions. These applications encompass optimal product sequencing, context-
based recommendations, model-driven one-shot session completion, and adaptive session
completion.
While our model’s application was demonstrated within the context of a digital music
platform, its framework extends beyond this domain to various digital media platforms
such as YouTube, Instagram, Facebook, and TikTok. These platforms encounter a common
challenge of adapting to the tastes of new users and sessions. Our Transformer-based model
accommodates diverse sets of covariate features, including images, videos, and audio.
Although our model offers numerous advantages over competing frameworks, it’s crucial
to acknowledge its limitations. First, while we employed a linear decoder to enhance
interpretability, utilizing more sophisticated decoders, such as deep neural networks or
autoregressive flows [79], could enhance predictive performance. Second, we only use
acoustic fingerprints to inform future behaviors. If auxiliary information like customer click
patterns is observed in the context set, such information can potentially be leveraged by
the encoder. Additionally, our current approach generates a point estimate of the global
parameters; future directions could explore Bayesian estimation for all the parameters [46].
This paper opens several promising directions for future research. First, the task in
this paper is constructed by segmenting data at the session level. An interesting direction
is to consider product-based and time-based task segmentation and combine them for
meta-learning. A natural approach is to combine the context embeddings from different
types of task constructions using methods similar to neural collaborative filtering [54].
Our framework also provides opportunities for building privacy-preserved targeting and
personalization [80]. Since the tasks for our model are customer-specific, each task uses
data from exactly one customer. A master model can be distributed to customers; the
customers use locally stored data to calibrate the model by solving a single task, then send
the model updates back to the platform for aggregation and re-distribution. We expect this
federated learning approach [81, 82], in combination with our meta-learning framework,
can offer personalized models while protecting customer data privacy. Finally, MetaTP
33

may be leveraged to study dynamic sequences such as prices of online products [83, 84]
or new user on-boarding questionnaires. We anticipate that researchers and marketing
professionals will build on our meta-learning framework in their future research.
Web Appendix
This Web appendix has been supplied by the authors to aid in the understanding of their
paper.
A Summary of Methodologies
Algorithm 1: Meta-Temporal Processes

q
Input : Data {Dic , Di }Ni=1 , encoder pθ (E) (β it | Dic , t), decoder pθ D ( yi t | x i t , β it ), batchsize B
Output : Model with parameter θ
1 Initialize parameter θ randomly ;
2 while not converged do
3 Sample random set of customers A ⊂ {1, 2, · · · , N } with |A | = B ;
4 for i ∈ A do
5 Compute context embedding z ci by self-attention with Dic ;
6 Compute µi t , σ 2i t by cross-attention with z ci , Dic , t ;
7 Sample local variable β ilt ∼ pθ E (β it | µi t (Dic , t), diag(σ 2i t (Dic , t))), for l = 1 : L,
t = t i + 1 : Ti ;
8 Update θ acoording to Eq. (3) with step size η.
B Details of the Transformer Model
We design the encoder pθ (E) (β it | Dic , t) using the Transformer model and the attention
mechanism therein [9, 85]. Here, we first introduce the general techniques of the attention
mechanism, and then illustrate how to leverage attention to infer the time-varying individual
preference from music sessions. The attention mechanisms are illustrated in Fig. B.2 which
are components of the Transformer model depicted in Fig. B.1.
Attention Mechanism. Attention is a mapping that takes the key, value, and query
matrices as the inputs and outputs an embedding for each vector in the query matrix. The
key matrix K ∈ R t×mk contains mk -dimensional keys, the value matrix V ∈ R t×mv contains
0
m v -dimensional values, and the query matrix Q ∈ R t ×mk contains the mk -dimensional
34

Figure B.1: Transformer component
queries. The number of keys and values is t and the number of queries is t 0 . The output
embeddings S are computed according to Eq. (4) as
QK>
S = softmax( p )V,
mk
0
where S ∈ R t ×mv contains m v -dimensional embeddings for each query element. Intuitively,
the embedding is a weighted combination of the values in V, and the attention weights
QK>
Wa = softmax( pmk ) measures the alignment between the query elements and the keys.
In practice, we use multi-head attention to capture heterogeneous dependency structures
between the queries and keys. Consider H > 1 heads. Each head h ∈ {1, · · · , H} has its own
query Qh , key Kh and value Vh matrices and outputs Sh according to Eq. (4). The multi-head
attention output combines the outputs of all the attention heads by S = [S1 , · · · , SH ]Wo ∈
0
R t ×m where Wo ∈ RH mv ×m is an aggregation matrix. We use the attention mechanism to
first generate embeddings for the context set and then for the target set.
Self-attention for context embedding. The context set contains the initial session’s
track features and skipping behaviors. For each context data (x i t , yi t ), we transform the
track features x i t into an embedding Eixt ∈ Rm by a feed-forward network, and transform
the skipping yi t (one-hot encoded) by an embedding matrix U. Written in a matrix form,
the transformations are
Eix = σ(X i W1 + b1 )W2 + b2 , X i = [x i1 , · · · , x i t i ]> ,

y (B.1)
Ei = Yi U, Yi = [ yi1 , · · · , yi t i ]> .
35

Query Qih1
Key Kih1 Weights W1a

Context
c Logits S1
Ei1
Value Vih1
.. .. .. zic
. . .
Value Vihti
Context
c Logits Sti
Eit i
Key Kihti Weights Wtai
Query Qihti
(a) Self-attention
Query Q̃ih1
Context
c
Ei1 Key K̃ih1 Weights W̃1a
Logits S̃1
Value Ṽih1
.. .. .. βit
. . .
Value Ṽihti
Logits S̃ti
Context
c
Eit Key K̃ihti Weights W̃tai
i
Query Q̃ihti
(b) Cross-attention
Figure B.2: Illustration of attention mechanism.
We apply the position encoding in the Transformer models [9] to define an m-dimensional
36

temporal encoding for each track position in the context set,
( j−1

p,t cos t/10000 m , if j is even
(Ei ) j = j
(B.2)
sin t/10000 m , if j is odd
p p,1 p,t
for element j = 1, · · · , m and the position encoding matrix Ei = [Ei , · · · , Ei i ]> . The
y p
context set of consumer i is encoded as Eic = Eix + Ei + Ei ∈ R t i ×m . The query, key, and
Q
value matrices for each attention head h are computed as Qih = Eic Wh , Kih = Eic WhK , and
Vih = Eic WVh with linear transformation weights WQh , WhK ∈ Rm×mk and WhV ∈ Rm×mv . We then
apply the attention mechanism, as described in the previous section, to map the inputs
H
{Qih , Kih , Vih }h=1 to the output matrix S ∈ R t i ×m . Finally, this output matrix S is fed through a
position-wise feed-forward neural network fη0 (·), generating the embeddings of the context
set z ci = fη0 (S) with z ci = [z ci1 , · · · , z cit i ] ∈ R t i ×m . We apply the causal masking 6 technique by
setting the attention weights for future data as zero so that the embedding z cit at the session
position t is independent of future tracks {x i t 0 , yi t 0 } t 0 >t .
Cross-attention for query embedding. The self-attention captures the dependency
within the context set and encodes it in the embedding z ci . Next, we apply the cross-
attention, the attention mechanism between the context set and the target set, to learn
the dynamic pattern of future sessions given the context. The position encoding Eit is
similarly computed for the target data with t > t i . The query, key, and value matrices
Q
are computed as Q̃ih = Eit W̃h , K̃ih = Eic W̃hK , and Ṽih = z ci W̃Vh with weights W̃Qh , W̃hK ∈ Rm×mk
and W̃hV ∈ Rm×mv . Eventually, the output S̃ by the attention mechanism in Eq. (4) with
H
inputs {Q̃ih , K̃ih , Ṽih }h=1 is mapped to the mean and variance of the approximate posterior
distribution as β i ∼ N (µη1 (S̃), ση2 1 (S̃)I) where η1 are the parameters for a feed-forward
t
network. We use θ (E) to denote all the parameters shared across the tasks as θ (E) =
{W1 , W2 , b1 , b2 , U, {WQh , WhK , WhV }h=1
H
, η0 , {W̃Qh , W̃hK , W̃hV }h=1
H
, η1 }.
Figure B.1: Posterior prediction of the Adaptive HB Logit model.
6
Causal masking in Transformers is a technique used to ensure that during training, a model only has
access to present and past data, but not future data. In the context of a sequence, this means each position
can only attend to positions before it (or itself), maintaining the causal or sequential order of the data.
37

C Derivation of Adaptive Mixed Logit Model
The Mixed Logit Model is a hierarchical Bayes model with the likelihood and prior as
yi t |x i t , β i ∼ Cat(σ(x i t · β i )), β i ∼ N (µ, D(τ)LD(τ))

(C.1)
τ p ∼ HalfCauchy(0, a), L ∼ LKJ(b), µ p ∼ N (0, c 2 )
for i = 1, · · · , N and t = 1 · · · , Ti . We use HMC to draw V posterior samples for each latent
variable
β iv , τ v , µ v , L v ∼ p(β i , τ, µ, L | D1 , · · · , DN ), v = 1, · · · , V. (C.2)
Suppose for a new customer k > N , we observe some additional new context data Dk .
Let all the training data be M = {D1 , · · · , DN }. Then the posterior predictive distribution
for the data in the target set of the new customer is
Z
p( ykt | x kt , Dk , M ) = p( ykt | x kt , β k , Dk , M )p(β k | Dk , M )dβ k
Z
= p( ykt | x kt , β k )p(Dk | β k , M )p(β k | M )dβ k · Const.
Z
= p( ykt | x kt , β k )p(y ck | x ck , β k , M )p(x ck | β k , M )p(β k | M )dβ k · Const.
ZZ
= p( ykt | x kt , β k )p(y ck | x ck , β k )p(β k | θ )p(θ | M )dθ dβ k · Const.
(C.3)
where θ = (τ, µ, L). The derivation is based on the graphical model of inference and
prediction for the HB Logit model in Fig. B.1 and the corresponding d-separation rules.
The first equality is because x kt β k not conditional on ykt ; the second equality is by the
|=
Bayes rule p(β k | Dk , M ) = p(Dk | M , β k )p(β k | M )/p(Dk | M ), and ykt Dk , M | x kt , β k ,

|=
where the multiplicative constant is 1/p(Dk | M ); the third equality is by Dk = (x ck , y ck );

the last equality is by y ck M | x ck , β k and x ck β k , M , where the multiplicative constant is
|=
|=
p(x ck )/p(Dk | M ).
We draw V samples θ v ∼ p(θ | M ), β kv ∼ p(β k | θ v ) and compute p( ykt | x kt , β kv ),
p(y ck | x ck , β kv ). Samples from the marginal posterior p(θ | M ) are distributionally the same
as the samples obtained from the full posterior p({β i }i=1:N , θ | M ) using HMC, so the sam-
ples of θ by Eq. (C.2) can be directly used. Finally, the Monte-Carlo estimate of Eq. (C.3)
is
V
X p(y ck | x ck , β kv )
p̂( ykt | x kt , Dk , M ) = w v p( ykt | x kt , β kv ), w v = PV 0
. (C.4)
v=1 v 0 =1
p(y ck | x ck , β kv )
38

We take the predicted outcome as the one that has the highest predictive likelihood. Intu-
itively, the weight w v scales according to how accurate each posterior sample β kv predicts
the new individual’s context data Dkc .
D Implementation Details
For the proposed method, we set the dimension of context points initial encoding Eic as
64. The dimensions mk , m v for the key, query, and value matrices of attention are set as
16, and we use 4 attention heads. The feed-forward neural network has two layers where
the latent layer dimension is 128, and the activation function is ReLU(x) = max{0, x}. We
use a dropout with a rate of 0.1 for all the neural network parameters during the model
training. We use the Adam optimizer [57] implemented in PyTorch [58] with a learning
rate of 1e − 4. The learning rate is decayed by a factor of 0.9 every 20 epochs. As shown in
Fig. D.1, the algorithm converges in around 2,000 iterations where each iteration processes
200 sessions.
Figure D.1: Trace plot of model fitting.
The logistic regression and random forest are implemented using the LogisticRegression
and RandomForestClassifier modules of the Scikit-learn library in Python [69]. For the
FT-LR method, the logistic regression model is first calibrated on all the training sessions,
then the pre-trained model is fine-tuned on the new session context data by performing
stochastic gradient descent. The method is implemented by the SGDClassifier module and
the partial_fit function therein.7 For the FT-RF, a random forest model is pre-trained on the
training sessions with the warm_start parameter set to True, which allows fitting additional
weak-learners, i.e. the classification trees, to an already fitted model.8 We set the number
7
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
8
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
39

Table D.1: Definitions of Acoustic Fingerprints
Acoustic Fingerprints Definition
Duration The track duration in minutes.

Popularity How popular a track is relative to all other tracks.
Acousticness Degree to which a track is acoustic (e.g. sounds by acoustic
guitar, piano, strings).
Beat Strength Strength of beats in a track.
Bounciness Rhythmic lift of a track.
Danceability Track’s suitability for dancing.
Energy Track’s intensity and dynamism.
Flatness Signal’s lack of fluctuation.
Mechanism Mechanical sound presence in music.
Liveness Likelihood track was recorded in front of a live audience.
Loudness Track’s average decibel level.
Instrumentalness Track’s lack of vocal content.
Organism Organic sound presence in music.
Speechiness Spoken word presence in a track.
Tempo Beats per minute of a track.
Valence The degree to which a song conveys positive emotions.
Key Pitch center of a track using standard pitch class notation.
Mode Scale type (major=1, minor=0) of a track.
of trees for the pre-trained model as 20, and use an additional 2 trees to fit the context data
of a new customer. The number of trees is chosen to maximize the predictive accuracy on a
hold-out set.
E Data summaries
The music session data consists of 16 acoustic fingerprints of the tracks; among them,
Mode and Key are coded as one-hot categorical features, and the rest are min-max standard-
ized as numerical features with the summary statistics shown in Table E.1. The definition of
these acoustic fingerprints are presented in Table D.1. The other features include an eight-
dimensional acoustic embedding vector provided by Spotify, as well as duration, release
year, and US popularity estimate for a track.
40

Table E.1: Summary statistics of the numerical acoustic fingerprints.
Acousticness Beat strength Bounciness Danceability Energy Flatness Mechanism
Min 0 0 0 0 0 0 0
Q1 0.03 0.44 0.46 0.57 0.51 0.9 0.45
Mean 0.22 0.55 0.6 0.68 0.63 0.91 0.6
Median 0.12 0.56 0.61 0.7 0.63 0.92 0.64
Q3 0.34 0.67 0.74 0.8 0.76 0.94 0.76
Max 1 1 1 1 1 1 1
Liveness Loudness Instrumentalness Organism Speechiness Tempo Valence
Min 0 0 0 0 0 0 0
Q1 0.1 0.84 0 0.21 0.05 0.44 0.28
Mean 0.19 0.86 0.03 0.36 0.15 0.56 0.46
Median 0.13 0.87 0 0.32 0.09 0.57 0.44
Q3 0.24 0.89 0 0.49 0.21 0.66 0.63
Max 1 1 1 1 1 1 1
Figure E.1: Dynamic of preference variables for multiple individuals.
F Additional results
Table F.1 contains the one-shot session completion for two context sets from our data
set. One context set has high skip rate and the other context set has low skip rate for low
danceability tracks. Fig. E.1 plots the individual trace of βipt around the mean curve. We
notice that the individual preference for acousticness centers around the mean curve, while
the preference for energy exhibits two types of patterns. One subgroup prefers high-energy
tracks at the beginning and gradually becomes neutral, while the other subgroup prefers
41

soft tracks. Fig. F.1 shows the preference dynamics for all the numerical acoustic fingerprints
over the population. Fig. F.2 shows the dynamics for the subgroups with top and bottom
10% listening rate.
Figure F.1: Preference dynamics distributed over the population.
42

Table F.1: One-shot Session completion.
Context set Session Completion
Track Skip Danceability Track Danceability
Beautiful Zombies 1 0.15 Ms. Sweetwater (Skit) 0.00

One and Only 0 0.45 My Man’s Gone Now 0.36
Waltz for the Lonely 0 0.51 The Map Room: Dawn 0.09
Last Days of My Bitter Heart 0 0.28 Portrait of Tracy - Live 0.46
Corona Radiata 0 0.09 Naked Moon - Live Version 0.34
Pavane 0 0.17 Calliope 0.26
Est Secretum 0 0.15 Con Te Partiro 0.08
Morning In The Bush 0 0.34 Warszawa 0.26
Spanish Love Song 0 0.32 Entr’acte - Live in Concert 0.17
Innocent Soul 0 0.09 Whatever (Strings) 0.56
Average 0.1 0.26 - 0.26

(a) Session completion for a customer with a low skip rate for low danceability tracks.
Context set Session Completion
Track Skip Danceability Track Danceability
My Kind Of Perfect 0 0.53 Cold Steel Canyons 0.73

Bagatelle in A Minor" 1 0.33 Tierra Del Fuego 0.52
Hemp 1 0.25 Sally Goodin’ 0.37
Talking Bout My Baby 1 0.25 Move To The City - Live 0.38
Mr. Longbottom Flies 1 0.07 Speed of Life - Live 0.45
Valtari 1 0.17 Spanish Lady - From Slane Castle 0.66
Final Fantasy: Main Theme 1 0.13 Outkast - New" 0.36
Cloudy This Morning 1 0.27 Eres Divina 0.73
Dolphin Dance 1 0.23 Jump Down 0.69
Real World Applications 1 0.34 Juanita -Flor De Walamo 0.70
Average 0.9 0.23 - 0.51

(b) Session completion for a customer with a high skip rate for low danceability tracks.
43

Figure F.2: Heterogeneous preference dynamics distributed over the high and low listening rate
groups.
44

References
[1] Nicolas Padilla and Eva Ascarza. Overcoming the Cold Start Problem of Customer
Relationship Management Using a Probabilistic Machine Learning Approach. JMR,
Journal of marketing research, 58(5):981–1006, October 2021.
[2] Paul Voigt and Axel Von dem Bussche. The EU General Data Protection Regulation
(GDPR): A Practical Guide. Springer International Publishing, Cham, 1st edition, 2017.
[3] Preston Bukaty. The california consumer privacy act (ccpa): An implementation guide.
IT Governance Ltd, 2019.
[4] Peter E Rossi, Robert E McCulloch, and Greg M Allenby. The value of purchase history
data in target marketing. Marketing Science, 15(4):321–340, 1996.
[5] Greg M Allenby and Peter E Rossi. Marketing models of consumer heterogeneity.
Journal of econometrics, 89(1-2):57–78, 1998.
[6] Asim Ansari, Skander Essegaier, and Rajeev Kohli. Internet recommendation systems.
Journal of Marketing Research, 37(3):363–375, 2000.
[7] Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning

how to learn: the meta-meta-... hook. PhD thesis, Technische Universität München,
1987.
[8] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast
adaptation of deep networks. In International conference on machine learning, pages
1126–1135. PMLR, 2017.
[9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
neural information processing systems, 30, 2017.
[10] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching
networks for one shot learning. Advances in neural information processing systems, 29,
2016.
[11] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast
Adaptation of Deep Networks. In Doina Precup and Yee Whye Teh, editors, Proceedings
of the 34th International Conference on Machine Learning, volume 70 of Proceedings of
Machine Learning Research, pages 1126–1135. PMLR, 2017.
[12] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende,
SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622,
2018.
45

[13] Sebastian Flennerhag, Andrei A Rusu, Razvan Pascanu, Francesco Visin, Hujun Yin,
and Raia Hadsell. Meta-learning with warped gradient descent. arXiv preprint
arXiv:1909.00025, 2019.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
[15] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. Advances in neural information processing
systems, 33:1877–1901, 2020.
[16] Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as
Statisticians: Provable In-Context Learning with In-Context Algorithm Selection. June
2023.
[17] Srinivasaraghavan Sriram, Pradeep K Chintagunta, and Ramya Neelamegham. Effects

of brand preference, product attributes, and marketing mix variables in technology
product markets. Marketing Science, 25(5):440–456, 2006.
[18] R Dew, A Ansari, and Y Li. Modeling dynamic heterogeneity using Gaussian processes.
JMR, Journal of marketing research, 2020.
[19] Dennis J Zhang, Ming Hu, Xiaofei Liu, Yuxiang Wu, and Yong Li. Netease cloud music
data. Manufacturing & Service Operations Management, 24(1):275–284, 2022.
[20] Daria Dzyabura and John R Hauser. Recommending products when consumers learn
their preference weights. Marketing Science, 38(3):417–441, 2019.
[21] Duncan Simester, Artem Timoshenko, and Spyros I Zoumpoulis. Targeting prospec-
tive customers: Robustness of machine-learning methods to typical data challenges.
Management Science, 66(6):2495–2522, 2020.
[22] Jeremy Yang, Dean Eckles, Paramveer Dhillon, and Sinan Aral. Targeting for long-term
outcomes. Management Science, Articles in Advance:1–15, 2023.
[23] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using
gradient descent. In Artificial Neural Networks—ICANN 2001: International Conference
Vienna, Austria, August 21–25, 2001 Proceedings 11, pages 87–94. Springer, 2001.
[24] Juan-Manuel Perez-Rua, Xiatian Zhu, Timothy M Hospedales, and Tao Xiang. In-
cremental few-shot object detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 13846–13855, 2020.
[25] Yujia Bao, Menghua Wu, Shiyu Chang, and Regina Barzilay. Few-shot text classification
with distributional signatures. In International Conference on Learning Representations,
2019.
46

[26] Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea
Finn, and Shimon Whiteson. A survey of meta-reinforcement learning. arXiv preprint
arXiv:2301.08028, 2023.
[27] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton,
Murray Shanahan, Yee Whye Teh, Danilo Rezende, and S M Ali Eslami. Conditional
neural processes. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th
International Conference on Machine Learning, volume 80 of Proceedings of Machine
Learning Research, pages 1704–1713. PMLR, 2018.
[28] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosen-
baum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. In International
Conference on Learning Representations, 2018.
[29] Mingzhang Yin, George Tucker, Mingyuan Zhou, Sergey Levine, and Chelsea Finn. Meta-
learning without memorization. In International Conference on Learning Representations,
2020.
[30] Gautam Singh, Jaesik Yoon, Youngsung Son, and Sungjin Ahn. Sequential neural
processes. Advances in Neural Information Processing Systems, 32, 2019.
[31] Tung Nguyen and Aditya Grover. Transformer neural processes: Uncertainty-aware
meta learning via sequence modeling. In International Conference on Machine Learning,
pages 16569–16594. PMLR, 2022.
[32] Xiao Liu. Deep learning in marketing: A review and research agenda. Artificial
Intelligence in Marketing, 20:239–271, 2023.
[33] Naresh K. Malhotra, K. Sudhir, and Olivier Toubia. Artificial Intelligence in Marketing,
volume 20 of Review of Marketing Research. Emerald Publishing Limited, 2023.
[34] Yi Yang, Kunpeng Zhang, and PK Kannan. Identifying market structure: A deep network
representation learning of social engagement. Journal of Marketing, 86(4):37–56,
2022.
[35] Paramveer S Dhillon and Sinan Aral. Modeling dynamic user interests: A neural matrix
factorization approach. Marketing Science, 40(6):1059–1080, 2021.
[36] Neeraj Bharadwaj, Michel Ballings, Prasad A Naik, Miller Moore, and Mustafa Murat
Arat. A new livestream retail analytics framework to assess the sales impact of
emotional displays. Journal of Marketing, 86(1):27–47, 2022.
[37] Xiao Liu, Dokyun Lee, and Kannan Srinivasan. Large-scale cross-category analysis of
consumer review content on sales conversion leveraging deep learning. Journal of
Marketing Research, 56(6):918–943, 2019.
[38] Liu Liu, Daria Dzyabura, and Natalie Mizik. Visual listening in: Extracting brand
image portrayed on social media. Marketing Science, 39(4):669–686, 2020.
47

[39] Shunyuan Zhang, Dokyun Lee, Param Vir Singh, and Kannan Srinivasan. What makes
a good image? airbnb demand analytics leveraging interpretable image features.
Management Science, 68(8):5644–5666, 2022.
[40] Ishita Chakraborty, Minkyung Kim, and K Sudhir. Attribute sentiment scoring with
online text reviews: Accounting for language structure and missing attributes. Journal
of Marketing Research, 59(3):600–622, 2022.
[41] Junming Yin, Zisu Wang, Yue Katherine Feng, and Yong Liu. Modeling behavioral
dynamics in digital content consumption: An attention-based neural point process
approach with applications in video games. Marketing Science, Forthcoming, 2022.
[42] Khaled Boughanmi, Asim Ansari, and Yang Li. A generative model of consumer
collections. Available at SSRN 4261182, 2023.
[43] Jochen Hartmann, Mark Heitmann, Christina Schamp, and Oded Netzer. The power
of brand selfies. Journal of Marketing Research, 58(6):1159–1177, 2021.
[44] Dinesh Puranam, Vrinda Kadiyali, and Vishal Narayan. The impact of increase in
minimum wages on consumer perceptions of service: A transformer model of online
restaurant reviews. Marketing Science, 40(5):985–1004, 2021.
[45] Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard
Turner. Meta-learning probabilistic inference for prediction. In International Conference
on Learning Representations, 2018.
[46] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recast-
ing gradient-based meta-learning as hierarchical bayes. In International Conference on
Learning Representations, 2018.
[47] Greg M Allenby, Peter E Rossi, and Robert E McCulloch. Hierarchical Bayes models: A
practitioners guide. SSRN Electronic Journal, 2005.
[48] Zikun Ye, Dennis J Zhang, Heng Zhang, Renyu Zhang, Xin Chen, and Zhiwei Xu.
Cold start to improve market thickness on online advertising platforms: Data-driven
algorithms and field experiments. Management Science, 69(7):3838–3860, 2023.
[49] Yijia Zhang, Zhenkun Shi, Wanli Zuo, Lin Yue, Shining Liang, and Xue Li. Joint person-
alized markov chains with social network embedding for cold-start recommendation.
Neurocomputing, 386:208–220, 2020.
[50] Jiayu Song, Jiajie Xu, Rui Zhou, Lu Chen, Jianxin Li, and Chengfei Liu. Cbml: A
cluster-based meta-learning model for session-based recommendation. In Proceedings
of the 30th ACM international conference on information & knowledge management,
pages 1713–1722, 2021.
48

[51] Xiaowen Huang, Jitao Sang, Jian Yu, and Changsheng Xu. Learning to learn a cold-
start sequential recommender. ACM Transactions on Information Systems (TOIS),
40(2):1–25, 2022.
[52] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
[53] John Ashworth Nelder and Robert WM Wedderburn. Generalized linear models.
Journal of the Royal Statistical Society Series A: Statistics in Society, 135(3):370–384,
1972.
[54] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua.
Neural collaborative filtering. In Proceedings of the 26th international conference on
world wide web, pages 173–182, 2017.
[55] Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. Neural collaborative
filtering vs. matrix factorization revisited. In Proceedings of the 14th ACM Conference
on Recommender Systems, pages 240–248, 2020.
[56] Asim Ansari, Yang Li, and Jonathan Z Zhang. Probabilistic topic model for hybrid
recommender systems: A stochastic variational bayesian approach. Marketing Science,
37(6):987–1008, 2018.
[57] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint 1412.6980, 2014.
[58] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch:
An imperative style, high-performance deep learning library. Advances in neural
information processing systems, 32, 2019.
[59] Ryan Dew, Asim Ansari, and Olivier Toubia. Letting logos speak: Leveraging multiview
representation learning for data-driven branding and logo design. Marketing Science,
41(2):401–425, 2022.
[60] Alex Burnap, John R Hauser, and Artem Timoshenko. Product aesthetic design: A
machine learning augmentation. Marketing Science, 2023.
[61] Brian Brost, Rishabh Mehrotra, and Tristan Jehan. The music streaming sessions
dataset, 2020.
[62] Khaled Boughanmi and Asim Ansari. Dynamics of musical success: A machine learning
approach for multimedia data fusion. Journal of Marketing Research, 58(6):1034–1057,
2021.
[63] John Barnard, Robert McCulloch, and Xiao-Li Meng. Modeling covariance matrices in
terms of standard deviations and correlations, with application to shrinkage. Statistica
Sinica, pages 1281–1311, 2000.
49

[64] Daniel Lewandowski, Dorota Kurowicka, and Harry Joe. Generating random corre-
lation matrices based on vines and extended onion method. Journal of multivariate
analysis, 100(9):1989–2001, 2009.
[65] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich,
Michael Betancourt, Marcus A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan:
A probabilistic programming language. Journal of statistical software, 76, 2017.
[66] Matthew D Hoffman, Andrew Gelman, et al. The no-u-turn sampler: adaptively setting
path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014.
[67] Gaël Varoquaux, Lars Buitinck, Gilles Louppe, Olivier Grisel, Fabian Pedregosa, and
Andreas Mueller. Scikit-learn: Machine learning without learning the machinery.
GetMobile: Mobile Computing and Communications, 19(1):29–33, 2015.
[68] Daniel Martin Katz, Michael J Bommarito, and Josh Blackman. A general approach
for predicting the behavior of the supreme court of the united states. PloS one,
12(4):e0174698, 2017.
[69] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand
Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent
Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine
Learning research, 12:2825–2830, 2011.
[70] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine
learning, volume 4. Springer, 2006.
[71] Mark Granroth-Wilding and Stephen Clark. What happens next? event prediction
using a compositional neural network model. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 30, 2016.
[72] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform
manifold approximation and projection. Journal of Open Source Software, 3(29):861,
2018.
[73] Ryan Dew. Adaptive preference measurement with unstructured data. SSRN, 2023.
[74] Alan D Baddeley and Graham Hitch. The recency effect: Implicit learning with explicit
retrieval? Memory & Cognition, 21:146–155, 1993.
[75] Gregory J Digirolamo and Douglas L Hintzman. First impressions are lasting impres-
sions: A primacy effect in memory for repetitions. Psychonomic Bulletin & Review,
4(1):121–124, 1997.
[76] Henry L Roediger III and Robert G Crowder. A serial position effect in recall of united
states presidents. Bulletin of the psychonomic society, 8(4):275–278, 1976.
50

[77] Jeremy Yang, Juanjuan Zhang, and Yuhan Zhang. Engagement that sells: Influencer
video advertising on tiktok. Available at SSRN 3815124, 2021.
[78] Donald B Rubin and Neal Thomas. Matching using estimated propensity scores:
relating theory to practice. Biometrics, pages 249–264, 1996.
[79] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural
autoregressive flows. In International Conference on Machine Learning, pages 2078–
2087. PMLR, 2018.
[80] Omid Rafieian and Hema Yoganarasimhan. Targeting and privacy in mobile advertising.
Marketing Science, 40(2):193–218, 2021.
[81] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise
Agüera y Arcas. Communication-Efficient Learning of Deep Networks from Decentral-
ized Data. February 2016.
[82] Muhammad Ammad-Ud-Din, Elena Ivannikova, Suleiman A Khan, Were Oyomno,

Qiang Fu, Kuan Eeik Tan, and Adrian Flanagan. Federated collaborative filter-
ing for privacy-preserving personalized recommendation system. arXiv preprint
arXiv:1901.09888, 2019.
[83] Kanishka Misra, Eric M Schwartz, and Jacob Abernethy. Dynamic online pricing with
incomplete information using multiarmed bandit experiments. Marketing Science,
38(2):226–252, 2019.
[84] Hamsa Bastani, David Simchi-Levi, and Ruihao Zhu. Meta dynamic pricing: Transfer
learning across experiments. Management Science, 68(3):1865–1881, 2022.
[85] Sneha Chaudhari, Varun Mithal, Gungor Polatkan, and Rohan Ramanath. An attentive
survey of attention models. ACM Transactions on Intelligent Systems and Technology
(TIST), 12(5):1–32, 2021.
51

SSRN Id4727171

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSRN Id4727171

Uploaded by

Copyright:

Available Formats

Meta-Learning Customer Preference Dynamics

Mingzhang Yin ∗ Khaled Boughanmi † Asim Ansari‡

February 14, 2024

Digital media platforms need to adapt quickly to changing customer preferences to

Electronic copy available at: https://ssrn.com/abstract=4727171

Electronic copy available at: https://ssrn.com/abstract=4727171

Electronic copy available at: https://ssrn.com/abstract=4727171

Electronic copy available at: https://ssrn.com/abstract=4727171

Hierarchical Bayes. Meta-learning, from a statistical standpoint, can be conceptualized as

Electronic copy available at: https://ssrn.com/abstract=4727171

Cold Start Recommendation. MetaTP addresses the challenging cold-start problem in

3.1 Tasks and Data Splitting

Electronic copy available at: https://ssrn.com/abstract=4727171

Electronic copy available at: https://ssrn.com/abstract=4727171

3.2 Probabilistic Model

Electronic copy available at: https://ssrn.com/abstract=4727171

where θ = (θ E , θ D ). The encoder pθ E (β it | Dic , t), parameterized by θ E , models the posterior

Electronic copy available at: https://ssrn.com/abstract=4727171

• Randomly sample a subset of session/task A ⊂ {1, 2, · · · , N };

3.4 Few-shot Prediction

3.5 Transformer-based Sequential Encoder

Electronic copy available at: https://ssrn.com/abstract=4727171

Electronic copy available at: https://ssrn.com/abstract=4727171

Figure 2: Directed graphical model of MetaTP.

We applied our model to the context of predicting consumer skipping of tracks in

Electronic copy available at: https://ssrn.com/abstract=4727171

Figure 3: Skip rate at each session position.

Electronic copy available at: https://ssrn.com/abstract=4727171

5 Validation and Model Comparison

We randomly select 80% of the sessions indexed by 1, 2, · · · , N as the training data

5.1 Benchmark Models

We compare MetaTP with a variety of benchmark models. These include traditional

Electronic copy available at: https://ssrn.com/abstract=4727171

yi t |x i t , β i ∼ Bern(σ(x i t · β i )), β i ∼ N (µ, Diag(τ)LDiag(τ))

Electronic copy available at: https://ssrn.com/abstract=4727171

Table 1: Overall and first event predictive results.

LR RF Static HB FT-LR FT-RF Adaptive HB MetaTP

Accuracy 55.7% 54.4% 54.6% 63.3% 62.6% 66.3% 73.2%

5.2 Future Behavior Prediction

Electronic copy available at: https://ssrn.com/abstract=4727171

Electronic copy available at: https://ssrn.com/abstract=4727171

Electronic copy available at: https://ssrn.com/abstract=4727171

We now demonstrate the substantive understanding of customer population and behavior

6.1 Context Set Representation

The self-attention embedding z cit ∈ Rm at position t ∈ {1, 2, · · · , t i } encodes the listening

Electronic copy available at: https://ssrn.com/abstract=4727171

Context Set 1 Context Set 2 Context Set 3 Context Set 4

(a) Pairwise distance of synthetic context sets (b) Clustering

Figure 6: Embedding of the context set.

Electronic copy available at: https://ssrn.com/abstract=4727171

Size 281 82 249 289 621 212 266

6.2 Context Information and Prediction

MetaTP makes individualized predictions by adapting to the context data. A natural

Electronic copy available at: https://ssrn.com/abstract=4727171

Figure 7: Attention weights: the importance of context points in decision making.

Electronic copy available at: https://ssrn.com/abstract=4727171

(b) Customer of Session 2

Figure 8: The importance of context set acoustic features.

Electronic copy available at: https://ssrn.com/abstract=4727171

6.3 Dynamic Preference Heterogeneity

General meta-learning methods capture the heterogeneity across individuals by adap-

Electronic copy available at: https://ssrn.com/abstract=4727171