Professional Documents
Culture Documents
on Digital Platforms
Abstract
The ability to quickly adapt to the preferences of a customer is essential for the success
of a digital platform. This ability is important in two scenarios. First, new customers of
the platform are susceptible to churn if they do not find products that they like or if they
are unsatisfied with their early interactions with the platform. Companies, therefore, have
an incentive to personalize their offerings to align these with customer needs and thereby
cement customer loyalty. Similarly, platforms that use a freemium strategy strive to quickly
convert free users to premium subscribers to profit from the customer relationship. Second,
customer preferences can be context-dependent. For instance, a customer could arrive on
a platform like Spotify or YouTube with a particular need in mind and initiate a session.
The platform needs to quickly assess what the customer prefers in the current session to
be able to personalize offerings. Both these scenarios involve making inferences based on
limited customer interactions on the platform. This coupled with the fact that customer
tastes are heterogeneous and dynamic, poses a systematic challenge for targeted marketing
and personalization. We call this challenge the few-shot learning problem for customer
preferences and address it in this paper.
Platforms have traditionally relied on external data such as demographics, user registra-
tion details, cookies, or acquisition characteristics [1] to tailor experiences for new users.
However, collecting external data raises privacy concerns and faces regulatory constraints,
as exemplified by the General Data Protection Regulation (GDPR) in the European Union
[2] and California Consumer Privacy Act (CCPA) in the US [3], which limit the collection
of personal data. Moreover, external data often provides weak information on customer
preferences. In contrast, user interactions are readily observable, and it is beneficial to
leverage this internal yet potentially small-sized data for personalization.
Researchers have developed numerous methods to analyze customer interactions and
infer how product features impact choices. As customers differ in their preferences, the
marketing literature has focused on methods that accommodate observed and unobserved
sources of customer heterogeneity. Hierarchical Bayes (HB) approaches [4–6] are popular
for this purpose as they allow inferences about individual-level parameters. However,
traditional HB methods encounter multiple challenges in the face of the few-shot learning
problem. First, HB methods that rely on MCMC inference are not easily scalable. This limits
their applicability in settings with many customers. Second, it is difficult to incorporate
the data of new users in HB models as this requires retraining the model with the addition
of new customers. Third, an ideal method should continuously adapt and improve with
incoming data as a customer continues the relationship. This can create a positive feedback
loop where increased interactions with the platform generate more observations, which in
turn further improves the model. Yet it is difficult for a traditional HB model to seamlessly
adjust itself on the fly to incoming data from a customer. This is particularly challenging
when data are obtained within a customer session.
Motivated by these challenges, we present a new methodology for the few-shot learning
Our work is related and contributes to several areas in marketing, statistics, and machine
learning. These include the literature on meta-learning, deep learning, hierarchical Bayes,
and the cold-start problem. Below, we provide a concise overview of the related works in
these areas and position our contribution.
Meta-learning. Meta-learning, often known as learning-to-learn, has been driving the
development of machine learning and artificial intelligence since its initial development [7,
23], and has achieved considerable success in areas such as computer vision [24], language
models [25], and robotics [26]. It enables algorithms to learn from past experience and solve
new problems using a small amount of data. Nevertheless, the application of meta-learning
remains underexplored in marketing. This paper builds on the development of Neural
Processes [12, 27], a family of methods combining the strengths of neural networks and
Gaussian processes. Our work is the first in marketing to study individual preferences with
meta-learning. Extant meta-learning methods often focus on classification and regression
tasks where observations within a task are identically and independently distributed (i.i.d.)
[8, 10, 28, 29]. For sequential modeling, [30] proposes the SNP method to predict dynamic
3D scenes in computer vision where the tasks are segmented by time. [31] designs the
TNP method with the attention mechanism, which assumes the predictions are invariant
to the permutation of individual data points. However, these assumptions do not hold in
the context of customer relationship dynamics. Addressing this gap, our work introduces
a novel meta-learning framework to model sequential customer behaviors on the digital
platform.
Deep Learning in Marketing. The increasing scale and complexity of data have driven
the adoption of deep learning in marketing research [32–34]. For example, [35] uses
fully connected neural networks to model user preferences in a dynamic factor model.
[36] leverage deep learning to extract emotions from livestream videos and explore their
influence on sales outcomes. Convolutional Neural Networks (CNNs) have been effectively
employed to extract product quality information from texts [37] and images [38, 39];
similarly, Recurrent Neural Networks (RNNs) have been utilized to obtain sentiment scores
from text reviews [40] and to model action sequences in video games [41]. [42] models
consumer collections in the context of music products using Hypergraph Convolutional
Neural Networks. Recently, the Transformer-based BERT model [14] has been applied to
mine brand interest and sentiment in textual comments [43] and attribute-specific valence
from reviews [44]. Instead of using Transformer as a pre-trained black box feature extractor,
our work unboxes the Transformer and integrates it as a pivotal component of the proposed
probabilistic meta-learning framework. In this way, it fully leverages the Transformer’s
flexibility, interpretability, and in-context learning ability.
3 Model
We first introduce the consumer data structure that is needed for meta-learning and
show how to design the meta-learning tasks for model calibration. We then present the
probabilistic model for customer interactions and our inference procedure. Finally, we
describe the Transformer component that encodes the individual-level parameter dynamics.
Consider a platform with a potentially large pool of data from existing customer sessions.1
Ti
We denote the data of each existing session i as Di = {x i t , yi t } t=1 , i ∈ {1, 2, · · · , N }, where
N is the number of existing sessions. The scalar outcome yi t is the customer behavior
1
We use customers and sessions interchangeably. Session level data is appropriate when modeling contextual
preferences, as in our application. In contrast, a customer-level perspective is more appropriate in Customer
Relationship Management (CRM) contexts.
prevents a model from overfitting to a particular task, but instead, learns a pattern that is
q
shared across tasks. The focus on predicting the interactions in target set Di (as opposed to
modeling the entire data Di , as is done in conventional marketing models) allows the model
to learn a relationship between the early and subsequent customer interactions. Hence,
constructing tasks addresses the challenges of distribution shift and data scarcity regarding
a new session as well as the preference dynamics within a session.
We design a probabilistic model to capture the hierarchical structure arising from multiple
tasks and sequential patterns of session data. The model is specified using global parameters
θ and a set of session-specific local parameters β it , one for each task. The global parameters
θ represent the shared statistical structure between tasks. As we will describe, θ consists
of high-dimensional parameters of a Transformer model. The high dimensionality of θ
introduces the necessary flexibility that allows the model to learn the inductive bias for
few-shot prediction automatically from data, avoiding the difficulty of pre-specifying the
inductive bias in the prior distribution manually. We will use point estimates for the shared
θ since data from all tasks will pinpoint its value. At the session level, we use local random
variables β it ∈ R P to capture uncertainty and dynamic heterogeneity. Consider an example
of a user consuming music products on a digital platform. Users would have their own
unique preferences for musical styles based on the usage scenario, and such preferences are
likely to change with the progress of the listening session. The local parameters β it encode
this intrinsic session-specific and time-varying information for customer preferences.
We use an encoder and decoder framework to specify the model [12, 52]. The objective
of meta-learning is to predict the target set of a task. Meta-learning therefore uses the
posterior predictive distribution of the future data of customer i at time t for t > t i , which
of the dot product to link the local parameters and outcome is similar to that in neural
matrix factorization [35, 54, 55]. Such a GLM takes β it as the parameter and no longer
needs additional parameters θ D . We can then interpret an element of the vector β it as the
preference intensity for a covariate in x i t at time t. In our application, we use a binary logit
specification for the decoder. The GLM and neural network formulations trade-off model
flexibility and interpretability, and the choice between these two can be application-specific.
3.3 Inference
We rely on maximum likelihood estimation (MLE) of the predictive distribution in Eq. (1)
to obtain the global model parameters θ = (θ E , θ D ) for the encoder and decoder. The MLE
of Eq. (1), averaged over all existing sessions, becomes
Z
max E log pθ ( yi t | x i t , Di ) = E log pθ D ( yi t | x i t , β i )pθ E (β i | Di , t)dβ i ,
c t t c t
(2)
θ
where the expectation is over uniformly distributed session i ∈ {1, · · · , N } and target
position t ∈ {t i + 1, · · · , Ti }. In practice, the inner integral of Eq. (2) can be approximated
by Monte-Carlo estimation, which corresponds to a lower bound of the original objective
due to the concavity of the logarithmic function and Jensen’s inequality [45].
Since the total number of existing sessions is large, for each step, we randomly sample
a mini-batch of customers A ⊂ {1, 2, · · · , N } and compute the stochastic gradient to
achieve high computational efficiency [56]. The algorithm loops over the steps above
until convergence. The stepsize η can be scheduled by methods such as Adam [57] and
the gradient can be computed by auto-differentiation tools such as PyTorch [58]. This
procedure describes an episodic learning paradigm with a series of tasks; each step solves
a task by performing a cycle of the posterior inference for β it given the context set, and
predicting the outcomes yi t for the target set. The parameter θ learned via episodic
learning is not task-specific but generalizable across existing and new customer sessions.
Lastly, using the optimized parameter θ̂ , the prediction of future events of a new session k
given its context data can be computed by the predictive distribution in Eq. (1). The context
data Dk for the new session k (k > N ) consists of the initial interactions with the platform.
The prediction of the future behavior ŷkt , t > t k can be obtained by taking samples l =
1, · · · , L from the predictive distribution as β kl t
∼ pθ̂ (E) (β kt | Dk , t), ŷkt l ∼ pθ̂ (D) ( ykt | x kt , β kl
t
).
With continuous outcomes, we take P L the predicted outcome as the conditional expectation
E ykt | x kt ,Dkc [ ykt ] approximated by l=1 ŷkt l /L. With discrete outcomes, we take the prediction
L
as the predictive mode, approximated by the most frequent outcome in the samples { ŷkt l }l=1 .
Note the adaptation to the new sessions does not require model refitting. Hence, the
computation for new sessions is very efficient, allowing for individual adaptation on the fly.
Moreover, the arbitrary size of context set Dk permits the model to continuously improve
with the new data streaming in as the session unfolds and smoothly interpolate from the
cold start stage to the warm stage.
So far, we had focused on the overall structure of the model. Now, we describe how we
use the Transformer architecture [9] to specify the temporal dependence between the context
10
QK>
S = softmax( p )V. (4)
mk
The inner product and softmax function computes the attention weights W =
p
softmax(QK> / mk ), which measure the alignment between the context keys and queries.
The Transformer model applies the attention mechanism in two major steps. The first
step is known as self-attention, which builds all the matrices K, V, Q based on the context
set Dic . The output of self-attention is a context set embedding z ci = [z ci1 , z ci2 , · · · , z cit i ] where
each element z cit is a vector depending on the individual data {x i t 0 , yi t 0 } t 0 ≤t before time t.
Next, we capitalize on the cross-attention to model the relationship between the context set
and the target set. The cross-attention constructs the key, value, and query matrices based
on the context set Dic , the context embedding z ci , and target position t, respectively, and
learns an embedding for the target set. Finally, the embedding of a target element is mapped
to the distributional parameters (µi t , σ i t ) by a feed-forward neural network, and we set
the distribution of β it as pθ E (β it | Dic , t) = N (µi t (Dic , t), diag(σ 2i t (Dic , t))) where σ i t ∈ R P
and diag(σ 2i t (Dic , t)) is a diagonal matrix with σ 2i t on the diagonal. We choose the Gaussian
distribution because β it is continuous and a Gaussian distribution is reparameterizable [52].2
The diagonal covariance is a common mean-field assumption for latent variables in the
2
A variable is reparameterizable if it can be expressed as a deterministic transformation of a base distribu-
tion that does not depend on the parameters of interest, an essential property for efficient gradient-based
optimization.
11
4 Data
12
The data set does not include customer-specific information, such as demographics,
geographical information, or other customer identifiers. Hence, we rely on the skipping be-
haviors of the customers and the acoustic fingerprints of the tracks. The acoustic fingerprints
include Acousticness, Beat strength, Bounciness, Danceability, Energy, Flatness, Key, Mode,
Mechanism, Liveness, Loudness, Instrumentalness, Organism, Speechiness, Tempo, and
Valence, whose definitions are presented in Table D.1 of the Web Appendix. These acoustic
fingerprints describe the technical aspects of the music tracks and the felt experience of
listeners [62]. We preprocess the data by one-hot encoding the categorical features, such as
Mode and Key, using dummy variables for the categories. The continuous features, such as
Valence, Danceability, and Acousticness, are min-max normalized to be within the range
from 0 to 1 to put these on the same scale. Table E.1 of the Web Appendix contains summary
statistics of all the acoustic fingerprints.
To fit the meta-learning model, we consider each listening session as a task where the
first few consumer interactions are taken as the context set, and the rest of the tracks in
the session belong to the target set. Constructing the context and target sets using the
data within a session is more meaningful than using all the data of a customer because a
customer’s tastes may differ across sessions. Given that the minimum session length is 10,
we take the first five interactions as the context set size. Note that the size of the context
set used for the model calibration does not pose any restrictions on the feasible size of the
context set for a new customer. Given this split, the mean and median skip rates are 0.48
and 0.40 for the context set, and 0.51 and 0.53 for the target set, respectively. The skip rate
13
14
where σ(x) = 1/(1 + exp(−x)) is the sigmoid function, I p is an identity matrix, and a, b, c
are hyperparameters. With a slight abuse of notation, we denote θ = (µ, τ, L) as the global
t=1:T
variables for the HB methods shared by the sessions and M = {(x i t , yi t )}i=1:Ni as all the
observed data of existing sessions. The global and local latent variables are inferred by the
posterior distribution p({β i }i=1:N , θ | M ), which can be approximated by sampling methods
such as Hamiltonian Monte Carlo (HMC). It is important to note that the inferred local
variables β i are only for the existing sessions i = 1, · · · , N .
For a new session k, k > N , we derive the predictive distribution of the outcome variable
given the covariates x kt , the observed data Dk , and the calibration data M as
ZZ
Y
p( ykt | x kt , Dk , M ) ∝ p( ykt | x kt , β k ) p( yk0 | x 0k , β k ) p(β k | θ )p(θ | M )dθ dβ k ,
(x 0k , yk0 )∈Dk
(6)
The predictive distribution in Eq. (6) is derived based on the d-separations of the directed
acyclic graph for the HB model; the details are presented in § C of the Web Appendix.
Intuitively, the posterior distribution p(θ | M ) borrows information from the related sessions.
The samples of β k ∼ p(β k | θ ) are adapted to new session k by re-weighting the samples
according to their predictive ability for the context data Dk . The Adaptive HB can calibrate
the model once with all the data M , and adjust the model to new session data Dk without
re-fitting to the large data M .
Accordingly, we call the HB method Static HB if it only uses existing session data M but
not new session data Dk . The predictive distribution for Static HB is
ZZ
p( ykt | x kt , M ) ∝ p( ykt | x kt , β k )p(β k | θ )p(θ | M )dθ dβ k . (7)
Eq. (7) can be approximated with MCMC samples of θ generated from calibration runs
and draw β k ∼ N (µ, D(τ)LD(τ)) with the θ samples. The HB methods are implemented
with CmdStanPy, a Python interface for Stan [65]. We used four parallel chains to generate
posterior samples from the No-U-Turn sampler [66] and checked for convergence using the
trace plots of the unknowns, calculated the effective sample size, and monitored the mixing
rate.
Other Baselines. Besides the HB methods, we also make a comparison with Logistic
Regression (LR), Random Forest (RF), and their adaptive versions as Fine-tuned Logistic
15
We evaluate the prediction of the first event in the session after the context set, which
measures the short-term predictive ability, and the prediction for all future events, which
combines the short and long-term predictive ability. The short-term scenario is useful for
making real-time expansions of the listening session taking into the responses of customers
on the fly. All the evaluations are performed on the new sessions k = N + 1, · · · , M that do
not appear in the training data. Specifically, we consider the following
PM Pevaluation metrics.
1 Tk
For the prediction of all future events, Accuracy = M (T −t ) k=N +1 t=t k +1 1[ ŷkt = ykt ],
P
P M k=N +1
k k
PM P Tk P Tk
k=N +1 t=t k +1 1[ ŷkt =1 and ykt =1] k=N +1 t=t k +1 1[ ŷkt =1 and ykt =1]
Recall = PM P Tk , Precision = PM P Tk , and AUC is the
k=N +1 t=t k +1 1[ ykt =1] k=N +1 t=t k +1 1[ ŷkt =1]
area under the Receiver Operating Curve (ROC), a widely used metric for evaluating the
overall predictive performance [70]. Similar definitions apply to the first event prediction
by fixing t = t k + 1.
16
17
positions. In contrast, the accuracy of static models is similar at different positions. MetaTP
performance is better than that of other baselines across all the target positions, reflecting
the advantage of the attention mechanism to capture the multi-faceted and long-term
sequential dependence structures. To capitalize on the high first-event accuracy, we also
generate prediction ŷkt by MetaTP using all the preceding new session data and we label
this approach as MetaTP-seq. That is, when predicting the skipping at the position t, the
context set consists of all the track features and the observed customer skips {x kt 0 , ykt 0 } t 0 <t
before the position t within the session. The context set is dynamically updated but the
model is not re-estimated. As expected, MetaTP-seq has the highest accuracy at all positions
due to the increased context size and because it can leverage recent dynamics; we will
demonstrate its managerial application in § 7.
We further compare the adaptation ability of MetaTP and Adaptive HB by evaluating
the predictive accuracy given a different size of the context set. MetaTP is calibrated on
the tasks with five context points, and we provide a different number of context points at
the validation time without refitting the model. As shown in Fig. 5, both adaptive methods
benefit from the increased number of context points. However, the improvement of MetaTP
is faster than Adaptive HB and the performance gap increases with additional context data.
MetaTP might acquire this fast adaptation ability by solving thousands of similar few-shot
prediction tasks, which forces the posterior distribution pθ E (β it | Dic , t) to maximize the
18
information gain from the limited context data. Another reason for the difference might be
that the summary statistics β i of Adaptive HB does not change much with more context
data, but MetaTP can extract more complex dynamic patterns from a richer context set with
time-varying parameter β it .
6 Results
19
Track Artist Skip Track Artist Skip Track Artist skip Track Artist Skip
The Call Of Ktulu M 0 Charlie Brown G 0 The Call Of Ktulu M 0 Charlie Brown G 0
Charlie Brown G 1 The Call Of Ktulu M 1 Master of Puppets M 0 Anxious Moments G 0
Master of Puppets M 0 Anxious Moments G 0 To Live Is to Die M 0 Waltz for the Lonely G 0
Anxious Moments G 1 Master of Puppets M 1 Master of Puppets M 0 The Skin Horse G 0
To Live Is to Die M 0 Waltz for the Lonely G 0 Battery M 0 The Swan G 0
Waltz for the Lonely G 1 To Live Is to Die M 1 Charlie Brown G 1 The Call Of Ktulu M 1
Master of Puppets M 0 The Skin Horse G 0 Anxious Moments G 1 Master of Puppets M 1
The Skin Horse G 1 Master of Puppets M 1 Waltz for the Lonely G 1 To Live Is to Die M 1
Battery M 0 The Swan G 0 The Skin Horse G 1 Master of Puppets M 1
The Swan G 1 Battery M 1 The Swan G 1 Battery M 1
Notes. Artist “M” is Metallica, a band known for heavy metal rock music; “G” is George Winston, a pianist
known for performing instrumental music. Context Sets 1 and 3 represent rock music lovers while Context
Sets 2 and 4 represent classic music lovers. Context Sets 1 and 2 have the same skipping sequence, so do
Context Sets 3 and 4.
George Winston, who have distinct specialties in heavy metal rock music and classical music,
respectively. The listener in Context Sets 1 or 3 skips all the George Winston music but
finishes all the Metallica tracks, while the skipping is reversed for Context Sets 2 and 4. We
show the pairwise distances between the embeddings of these context sets in Fig. 6a. We find
the sessions with similar music preferences are close together, i.e., the Context Sets 2 and 4
have the smallest distance at 1.14, followed by the embedding distance between Context
Sets 1 and 3 at 1.51. The largest distance of 4.10 is between Context Sets 2 and 3 where both
the preferences and the skipping patterns ( yi1 , · · · , yi t i ) are different. These findings suggest
that the context embedding successfully encodes the information on customer preferences
and skipping patterns.
20
Cluster 1 2 3 4 5 6 7
The meaningful embedding space can be leveraged to identify the subgroup with high
future engagement. We use Uniform Manifold Approximation and Projection (UMAP)
[72, 73] to reduce z cit i to two dimensions and visualize it in Fig. 6b. The points in Fig. 6b are
the UMAP encodings of z i t i for the 2000 hold-out sessions. The self-attention embedding
segments the sessions into 7 well-separated clusters. We examine the listening rate of each
cluster in the context and target set, defined as the percentage of tracks not being skipped,
which are shown in Table 3. Among the segments, we find Cluster 5 of particular interest.
Cluster 5 is the largest cluster containing 29.4% of total sessions. Unlike other clusters that
mainly consist of sessions with a similar listening rate in the context set, Cluster 5 consists of
sessions with mixed listening rates ranging from 0.2 to 0.8. The average listening rate of this
cluster is 62.6% in the context set and is as high as 59.4% in the target set (the second highest
among all clusters). Hence, we call Cluster 5 the Enthusiasts with persistent engagement
patterns throughout the listening session. In comparison, Top Listeners (Cluster 4) with
100% listening rate in the context set have a future listening rate of 60.7%, only slightly
higher than that of the Enthusiasts. This indicates the Top Listener cluster may contain
passive listeners whose high listening rate can not reflect their engagement. Moderate
Listeners (Cluster 6) have a context-set listening rate similar to the Enthusiasts but have a
much lower future listening rate of 41.5%. Comparing Enthusiasts with Top Listeners and
Moderate Listeners indicates that self-attention embeddings contain essential predictive
information of subsequent behaviors beyond the basic context set listening rate. The platform
can use this embedding to identify consumers with high engagement for future tracks.
21
Figs. 7a and 7b show the distribution of attention weights on the five context points for two
randomly selected sessions. For both sessions, the most attention is paid to the last track
in the context set. The tracks with the highest attention for session 1 are the non-skipped
tracks, and for session 2, are a mix of skipped and non-skipped tracks.
Fig. 7c is the averaged attention weights over the population. The highest attention is
paid to the context points that are close to the target points, which contain more relevant
information to future skipping behaviors because of the closeness in time. For example, a
customer skipping the latest tracks due to fatigue would tend to skip the future tracks if the
fatigue persists. This is consistent with the recency effect for people’s tendency to remember
the most recently presented information best [74]. We also notice that the first track is of
high importance, making a U-shape distribution of attention over the context points. This
finding may relate to the primacy effect that the first items influence customers’ judgments
and decisions more than the information they receive later on [75]. The behavior on the
first track might better reflect the customer’s gut instinct on the music preferences. Overall,
the U-shape attention weights distribution is aligned with the psychological tendency for
one to be influenced by the first and last items in a list more than those in the middle [76].
Feature importance. The attention weights quantify the influence at the track level. Now,
we further scrutinize what acoustic fingerprints are indicative of future behavior. The
saliency map is computed as a measure of feature-level influence [77]. It is computed as
the gradient of the prediction with respect to each acoustic fingerprint, i.e., ∂ ŷ(x )/∂ x ,
where ŷ(x ) is the predicted skipping of the first track in the target set and x are the acoustic
fingerprints of the context tracks.3
3
The intuition behind the saliency map is that, as seen in linear regression, the value of the derivative
corresponds to the regression coefficient, thereby quantifying the feature’s importance. The saliency map
generalizes this idea to any nonlinear, differentiable models.
22
Fig. 8 shows the saliency values for each acoustic for each of the five context positions for
customers of two sessions. We see that the most influential acoustic fingerprints generally
are those from the latest tracks, aligned with the findings from the attention weights.
Interestingly, the influential acoustic fingerprints differ significantly across sessions. For
example, what matters for the customer of Session 1 are beat strength, bounciness, energy,
and liveness, and for the customer of Session 2 are bounciness, instrumentals, energy, and
speechiness. The granular preferences on the music acoustic fingerprint of each track reveal
personal demands for the music product, as the customer of Session 1 might be more
sensitive to the rhythms, and the customer of Session 2 is influenced more by the melody.
23
Notes. The solid line is the population mean; the shaded region is the standard deviation across
individuals. The x-axis is the session position.
24
Fig. 9 shows the preference dynamic on the target set for a few covariates at the
population level. The results for all the covariates are contained in § F of the Web Appendix.
The mean and standard deviation of βipt in Fig. 9 are computed across sessions. We set
E[ yi t | x i t ] = 1/(1 + exp(x i t · β it ))) and recall that yi t = 1 means customer i skips the track
at the t-th position. As all the processed covariates are positive, the larger βipt is, the higher
the probability for non-skipping at a certain level of an acoustic fingerprint. The magnitude
of βitp reflects the degree of preference for the acoustic fingerprint p and the sign of βitp
reflects whether the acoustic fingerprint is positively or negatively related to non-skipping
behavior. For example, we find that acousticness is positively related to non-skipping, while
the track duration and instrumentalness exhibit a negative correlation. Across time, on
average, the preference for acousticness and instrumentalness increases, and the preference
for energy decreases. We also notice that the tendency to skip long tracks increases as
a session progresses in the early period. These dynamics indicate a pattern of customer
fatigue, where the customer, on average, begins to prefer melodic or light music to heavy
metal songs or techno tracks when getting further into a listening session.
Beyond the aggregated patterns, our model uncovers time-varying heterogeneous pref-
erences. Fig. 10 plots the individual preference variable βipt ∼ pθ (βip | Dic , t) sampled for
3 random sessions. Taking the preference of high-energy tracks, for example, we notice
customers exhibit different fatigue patterns. To explore the driving factors behind the
dynamic patterns, we focus on Listeners and Skippers as two subgroups of the new customer
sessions. The Listeners are defined as the sessions with the top 10% listening rate, and the
Skippers are defined as the sessions with the bottom 10% listening rate in the context set.
The preference dynamics of these two subgroups are shown in Fig. 11. We find that the
level of historical engagement in the context set significantly influences the evolution of
preference dynamics. From the duration and valence panels, the Listeners are more tolerable
with long tracks and prefer negative emotional songs more than Skippers. We notice that
the reduced interest in the energy feature is mainly among the Listeners, and the increased
25
Notes. The orange curve corresponds to the high listening rate group, and the blue curve corresponds
to the low listening rate group. The dynamics are over the target set.
Having shown the predictive superiority of our model and the various substantive insights
that it generates, we now move to showcase its use in several managerial tasks.
7 Managerial Applications
We demonstrate how MetaTP can inform the managerial actions of platforms for customiz-
ing and extending new sessions that include a few initial interactions on the platform.This is
a challenging scenario because of the limited information in the new session. We illustrate
in this section how a platform can achieve this in our context of music personalization. As
26
Silver Tiles Matt and Kim 1 Don’t Let Him Steal Your Heart
Dear Avery The Decemberists 1 October Road
Don’t Let Him Steal Your Heart Phil Collins 0 Ordinary Just Won’t Do
Ordinary Just Won’t Do Commissioned 0 Lay It All On Me
October Road James Taylor 0 Dear Avery
Lay It All On Me Rudimental 1 Silver Tiles
the tracks are anonymized in our data, we collected an additional 192 thousand tracks using
Spotify API, with track names, artist information, and acoustic fingerprints, to facilitate an
understanding of the appropriateness of the recommendations.
We start with a scenario where customers may have a self-selected listening list and the
platform provides a listening order for the list. As MetaTP does not assume exchangeable
data, it can optimize the sequential ordering of a list to improve customer engagement,
without changing the content of the list. The Transformer model can be used to produce
the position encodings for each permutation π(·) of the future tracks positions (π(t i +
1), · · · , (π(Ti )). The cross-attention mechanism can then use the position encodings to
compute the predictive distribution.
For each permutation, we compute the average skipping probability for the tracks in the
target set and select the permutation with the lowest skipping probability. An example of
the re-ordered tracks from our data set is shown in Table 4. Interestingly, we see that the
optimal order tends to place the non-skipped tracks at the early positions. This could be
explained by the primacy effect, as noticed in § 6.2, which implies that placing the preferred
products at early positions generates a positive carry-over effect that influence subsequent
decisions of customers.
In the beginning of a newly started session, the platform wants to quickly capture the
music preferences of the customer from the initial interactions and recommend future tracks.
An approach that can be used to complete the listening session involves matching the focal
new session to an existing session with a similar context set. Matching the listening history is
27
Ramblin’ Man The Allman Brothers Band 0 Glittering Prize Simple Minds 0
California - Tchad Blake Mix Phantom Planet 1 Peaches Bob Schneider 1
Context
Riding With The King Eric Clapton 1 Majesty of Heaven Chris Tomlin 1
Life In The Fast Lane Eagles 1 I Don’t Want This Night to End Luke Bryan 1
Stolen Moments Dan Fogelberg 0 Majesty of Heaven Chris Tomlin 0
Posse Big Kuntry King 1 Here In the Real World Alan Jackson 0
On Da Hush - Trina 1 Been There Clint Black 0
The Most Beautiful BTS 1 Sweet 16 Green Day 0
Yo Quería Cristian Castro 1 Smoke Break Carrie Underwood 0
Jezebel - Demo Version Depeche Mode 1 Any Ol’ Barstool Jason Aldean 0
Notes. The left table is the focal session, and the right table is the matched session with a low skipping
rate on the future tracks.
Specifically, for a focal new session k, we identify the K-nearest neighbors {ẑ 1 , · · · , ẑ K }
using L2 -norm in the embedding space from the existing sessions. These existing sessions
have observed skipping behaviors in their target sets. We match the focal session with
a session from the nearest neighbors which has the lowest skipping rate of the tracks
in the target set, and then recommend the tracks of the matched session’s target set to
the focal session. Since the context set is predictive of the skipping behaviors of future
tracks in the target set, the matching on the context sets suggests similar responses to the
subsequent tracks between the focal and matched sessions [78]. Table 5 shows an instance
of context-matched session pairs. The matched context sets consist of similar tracks with
strong rhythmic and similar skipping patterns, while the skipping rate of the focal session’s
future tracks is seven times higher than that of the matched session. By replacing the focal
listening session’s target set with the matched session’s target set, we find an average of 7%
decrease in the skipping probability on the target set predicted by the model.
28
Can You Bounce? Ice Cube 0.000 Omni Potens Morbid Angel 0.959
Face Facts Kottonmouth Kings 0.000 The Garden: The Garden George Winston 0.989
Einstein Tech N9ne Tech N9ne 0.375 Cinny’s Waltz Tom Waits 0.996
ATLiens OutKast 0.394 Siboney - Live Buena Vista 0.989
You Gotta Lotta That Ice Cube 0.000 O Sacred Head Jim Brickman 0.992
Juke It Snypaz 0.000 I Went From - Intro Brotha Lynch Hung 0.996
To My People Flipmode Squad 0.000 In The Kingdom of Peace Jean-Luc Ponty 0.995
F*k Somthin Webbie 0.000 Rest Your Head George Winston 0.986
Beats Knockin Jack Ü 0.000 Calliope Tom Waits 0.959
Bianca’s and Beatrice’s Tech N9ne 0.000 The Map Room: Dawn John Williams 0.969
Notes. This table shows the recommended next 10 tracks for two sessions. The left session has
Context Set 1, and the right session has Context Set 2 from Table 2. Instrumentalness measures
whether a track contains no vocals. The value of instrumentalness is the quantile among all the
recommendation candidates of 192 thousand tracks. MetaTP accurately adapts to heterogeneous
preferences from a context set with only 10 tracks.
probability for each track in the recommendation list using its acoustic fingerprints and
then complete the session using the track with the highest non-skipping probability at that
position.
We first evaluate how the model completes the listening sessions for a rock music lover
and a classic music lover. Specifically, we take the synthetic Context Sets 1 and 2 from Table 2,
which exhibit preferences for Metallica and George Winston tracks, respectively. Table 6
shows the completed session for the two context sets. For ease of interpretation, we show
the instrumentalness fingerprint for each track, a fingerprint that is typically low for rock
music and high for non-vocal pure music. Since instrumentalness has a skewed distribution,
we compute the quantile of each track’s instrumentalness among all the recommendation
candidates. The recommended tracks clearly match the taste of customers – the tracks for
the rock music lover are by artists like Tech N9ne and Ice Cube, and the tracks for the classic
music lover are by pianists and violinists like Jim Brickman and Jean-Luc Ponty. MetaTP
manages to adapt to the session heterogeneity with as few as ten tracks in the context set.
It is important to highlight that the adaptation mechanism is not pre-programmed into the
model; instead, it is meta-learned from data – an ability automatically discovered from
predicting target sets for thousands of sessions in the training tasks. In addition to the above
synthetic context sets, we show in Table F.1 of the Web Appendix, the one-shot completion
for two context sets from the observed hold-out sessions, one with a low skip rate and the
other with a high skip rate for low danceability tracks. Similar to the synthetic context
sets, we find MetaTP can recommend tracks with correctly adapted danceability levels to
complete the sessions.
29
The similarity-based and one-shot session completion methods are open-loop recommen-
dations, which operate without considering the immediate feedback or dynamic reactions of
the customers as the session is extended. We now leverage the MetaTP-seq discussed in § 5.1
to illustrate how closed-loop session completion can be done. As we do not have dynamic
4
See https://newsroom.spotify.com/2020-09-29/how-to-make-a-collaborative-
playlist/
5
See https://newsroom.spotify.com/2023-09-26/spotify-jam-personalized-
collaborative-listening-session-free-premium-users/
30
Notes. The blue curve is for the High Preference group, and the orange curve is for the Low Preference
group.
Specifically, we study whether MetaTP can quickly uncover customer preferences based
on a limited number of responses. We use preferences pertaining to specific acoustic
fingerprints to define two groups of synthetic customers. We define a High Preference
group that prefers a high value of that acoustic fingerprint. Customers in this group are
assumed to skip a track with a probability of 0.1 if the acoustic fingerprint of this track is
above a threshold, and otherwise will skip with a probability of 0.9. Similarly, a synthetic
group (Low Preference) will skip with a probability of 0.9 (or 0.1) if the acoustic fingerprint
is below (or above) the threshold. Fig. 12 shows the average acoustic fingerprint level
(e.g., acousticness and instrumentalness) of the recommended tracks for the two groups of
customers. The magnitude of the corresponding acoustic fingerprint is averaged over the top
5% of the recommended tracks and 100 customers from the two groups to reduce sampling
variance. The initial tracks recommended for both groups are similar, but after 3 to 4 tracks
the acousticness or instrumentalness level of the tracks offered to the High Preference
group becomes larger than that of the Low Preference group. These two types of customers
follow a hypothesized skipping pattern, whose distribution may differ significantly from the
observed existing customers in the calibration data. Nevertheless, MetaTP can adapt to these
31
Figure 13: Sampled sessions for the customers with high and low preferences for the acousticness
fingerprint.
Notes. Tracks associated with the blue and orange curves are for one customer from the High
Preference and Low Preference groups, respectively.
Fig. 13 visualizes a sampled session for a customer in the High Preference group and a
customer in the Low Preference group for the acousticness fingerprint. Both customers start
from the track The Twelve Days of Christmas, and the subsequent session for the customers
from the High Preference group mainly consists of tracks like Space Kay, Kitchen Girl, I Ain’t
Gonna Be The First To Cry with acousticness over 0.8, and the session for the customers
from the Low Preference group mainly consists of songs like Crazy Rap and No Hate with
heavy electronic sounds and acousticness below 0.4.
8 Discussion
In this paper, we proposed a framework for measuring preferences and predicting future
behaviors within ongoing digital sessions by integrating meta-learning and Transformer-
based sequence modeling. The framework generates accurate future predictions for a new
customer session only using a small number of observations and can produce personalized
sequential recommendations on the fly. The ability of fast adaptation hinges on a new
design of multiple training tasks to reuse knowledge with existing sessions. The key benefits
of the model are its data efficiency, flexibility, and transparent processes. In the empirical
32
33
Web Appendix
This Web appendix has been supplied by the authors to aid in the understanding of their
paper.
A Summary of Methodologies
We design the encoder pθ (E) (β it | Dic , t) using the Transformer model and the attention
mechanism therein [9, 85]. Here, we first introduce the general techniques of the attention
mechanism, and then illustrate how to leverage attention to infer the time-varying individual
preference from music sessions. The attention mechanisms are illustrated in Fig. B.2 which
are components of the Transformer model depicted in Fig. B.1.
Attention Mechanism. Attention is a mapping that takes the key, value, and query
matrices as the inputs and outputs an embedding for each vector in the query matrix. The
key matrix K ∈ R t×mk contains mk -dimensional keys, the value matrix V ∈ R t×mv contains
0
m v -dimensional values, and the query matrix Q ∈ R t ×mk contains the mk -dimensional
34
queries. The number of keys and values is t and the number of queries is t 0 . The output
embeddings S are computed according to Eq. (4) as
QK>
S = softmax( p )V,
mk
0
where S ∈ R t ×mv contains m v -dimensional embeddings for each query element. Intuitively,
the embedding is a weighted combination of the values in V, and the attention weights
QK>
Wa = softmax( pmk ) measures the alignment between the query elements and the keys.
In practice, we use multi-head attention to capture heterogeneous dependency structures
between the queries and keys. Consider H > 1 heads. Each head h ∈ {1, · · · , H} has its own
query Qh , key Kh and value Vh matrices and outputs Sh according to Eq. (4). The multi-head
attention output combines the outputs of all the attention heads by S = [S1 , · · · , SH ]Wo ∈
0
R t ×m where Wo ∈ RH mv ×m is an aggregation matrix. We use the attention mechanism to
first generate embeddings for the context set and then for the target set.
Self-attention for context embedding. The context set contains the initial session’s
track features and skipping behaviors. For each context data (x i t , yi t ), we transform the
track features x i t into an embedding Eixt ∈ Rm by a feed-forward network, and transform
the skipping yi t (one-hot encoded) by an embedding matrix U. Written in a matrix form,
the transformations are
35
.. .. .. zic
. . .
Value Vihti
Context
c Logits Sti
Eit i
Query Qihti
(a) Self-attention
Query Q̃ih1
Context
c
Ei1 Key K̃ih1 Weights W̃1a
Logits S̃1
Value Ṽih1
.. .. .. βit
. . .
Value Ṽihti
Logits S̃ti
Context
c
Eit Key K̃ihti Weights W̃tai
i
Query Q̃ihti
(b) Cross-attention
We apply the position encoding in the Transformer models [9] to define an m-dimensional
36
p p,1 p,t
for element j = 1, · · · , m and the position encoding matrix Ei = [Ei , · · · , Ei i ]> . The
y p
context set of consumer i is encoded as Eic = Eix + Ei + Ei ∈ R t i ×m . The query, key, and
Q
value matrices for each attention head h are computed as Qih = Eic Wh , Kih = Eic WhK , and
Vih = Eic WVh with linear transformation weights WQh , WhK ∈ Rm×mk and WhV ∈ Rm×mv . We then
apply the attention mechanism, as described in the previous section, to map the inputs
H
{Qih , Kih , Vih }h=1 to the output matrix S ∈ R t i ×m . Finally, this output matrix S is fed through a
position-wise feed-forward neural network fη0 (·), generating the embeddings of the context
set z ci = fη0 (S) with z ci = [z ci1 , · · · , z cit i ] ∈ R t i ×m . We apply the causal masking 6 technique by
setting the attention weights for future data as zero so that the embedding z cit at the session
position t is independent of future tracks {x i t 0 , yi t 0 } t 0 >t .
Cross-attention for query embedding. The self-attention captures the dependency
within the context set and encodes it in the embedding z ci . Next, we apply the cross-
attention, the attention mechanism between the context set and the target set, to learn
the dynamic pattern of future sessions given the context. The position encoding Eit is
similarly computed for the target data with t > t i . The query, key, and value matrices
Q
are computed as Q̃ih = Eit W̃h , K̃ih = Eic W̃hK , and Ṽih = z ci W̃Vh with weights W̃Qh , W̃hK ∈ Rm×mk
and W̃hV ∈ Rm×mv . Eventually, the output S̃ by the attention mechanism in Eq. (4) with
H
inputs {Q̃ih , K̃ih , Ṽih }h=1 is mapped to the mean and variance of the approximate posterior
distribution as β i ∼ N (µη1 (S̃), ση2 1 (S̃)I) where η1 are the parameters for a feed-forward
t
network. We use θ (E) to denote all the parameters shared across the tasks as θ (E) =
{W1 , W2 , b1 , b2 , U, {WQh , WhK , WhV }h=1
H
, η0 , {W̃Qh , W̃hK , W̃hV }h=1
H
, η1 }.
6
Causal masking in Transformers is a technique used to ensure that during training, a model only has
access to present and past data, but not future data. In the context of a sequence, this means each position
can only attend to positions before it (or itself), maintaining the causal or sequential order of the data.
37
The Mixed Logit Model is a hierarchical Bayes model with the likelihood and prior as
for i = 1, · · · , N and t = 1 · · · , Ti . We use HMC to draw V posterior samples for each latent
variable
β iv , τ v , µ v , L v ∼ p(β i , τ, µ, L | D1 , · · · , DN ), v = 1, · · · , V. (C.2)
Suppose for a new customer k > N , we observe some additional new context data Dk .
Let all the training data be M = {D1 , · · · , DN }. Then the posterior predictive distribution
for the data in the target set of the new customer is
Z
p( ykt | x kt , Dk , M ) = p( ykt | x kt , β k , Dk , M )p(β k | Dk , M )dβ k
Z
= p( ykt | x kt , β k )p(Dk | β k , M )p(β k | M )dβ k · Const.
Z
= p( ykt | x kt , β k )p(y ck | x ck , β k , M )p(x ck | β k , M )p(β k | M )dβ k · Const.
ZZ
= p( ykt | x kt , β k )p(y ck | x ck , β k )p(β k | θ )p(θ | M )dθ dβ k · Const.
(C.3)
where θ = (τ, µ, L). The derivation is based on the graphical model of inference and
prediction for the HB Logit model in Fig. B.1 and the corresponding d-separation rules.
The first equality is because x kt β k not conditional on ykt ; the second equality is by the
|=
|=
p(x ck )/p(Dk | M ).
We draw V samples θ v ∼ p(θ | M ), β kv ∼ p(β k | θ v ) and compute p( ykt | x kt , β kv ),
p(y ck | x ck , β kv ). Samples from the marginal posterior p(θ | M ) are distributionally the same
as the samples obtained from the full posterior p({β i }i=1:N , θ | M ) using HMC, so the sam-
ples of θ by Eq. (C.2) can be directly used. Finally, the Monte-Carlo estimate of Eq. (C.3)
is
V
X p(y ck | x ck , β kv )
p̂( ykt | x kt , Dk , M ) = w v p( ykt | x kt , β kv ), w v = PV 0
. (C.4)
v=1 v 0 =1
p(y ck | x ck , β kv )
38
D Implementation Details
For the proposed method, we set the dimension of context points initial encoding Eic as
64. The dimensions mk , m v for the key, query, and value matrices of attention are set as
16, and we use 4 attention heads. The feed-forward neural network has two layers where
the latent layer dimension is 128, and the activation function is ReLU(x) = max{0, x}. We
use a dropout with a rate of 0.1 for all the neural network parameters during the model
training. We use the Adam optimizer [57] implemented in PyTorch [58] with a learning
rate of 1e − 4. The learning rate is decayed by a factor of 0.9 every 20 epochs. As shown in
Fig. D.1, the algorithm converges in around 2,000 iterations where each iteration processes
200 sessions.
The logistic regression and random forest are implemented using the LogisticRegression
and RandomForestClassifier modules of the Scikit-learn library in Python [69]. For the
FT-LR method, the logistic regression model is first calibrated on all the training sessions,
then the pre-trained model is fine-tuned on the new session context data by performing
stochastic gradient descent. The method is implemented by the SGDClassifier module and
the partial_fit function therein.7 For the FT-RF, a random forest model is pre-trained on the
training sessions with the warm_start parameter set to True, which allows fitting additional
weak-learners, i.e. the classification trees, to an already fitted model.8 We set the number
7
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
8
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
39
of trees for the pre-trained model as 20, and use an additional 2 trees to fit the context data
of a new customer. The number of trees is chosen to maximize the predictive accuracy on a
hold-out set.
E Data summaries
The music session data consists of 16 acoustic fingerprints of the tracks; among them,
Mode and Key are coded as one-hot categorical features, and the rest are min-max standard-
ized as numerical features with the summary statistics shown in Table E.1. The definition of
these acoustic fingerprints are presented in Table D.1. The other features include an eight-
dimensional acoustic embedding vector provided by Spotify, as well as duration, release
year, and US popularity estimate for a track.
40
Min 0 0 0 0 0 0 0
Q1 0.03 0.44 0.46 0.57 0.51 0.9 0.45
Mean 0.22 0.55 0.6 0.68 0.63 0.91 0.6
Median 0.12 0.56 0.61 0.7 0.63 0.92 0.64
Q3 0.34 0.67 0.74 0.8 0.76 0.94 0.76
Max 1 1 1 1 1 1 1
Liveness Loudness Instrumentalness Organism Speechiness Tempo Valence
Min 0 0 0 0 0 0 0
Q1 0.1 0.84 0 0.21 0.05 0.44 0.28
Mean 0.19 0.86 0.03 0.36 0.15 0.56 0.46
Median 0.13 0.87 0 0.32 0.09 0.57 0.44
Q3 0.24 0.89 0 0.49 0.21 0.66 0.63
Max 1 1 1 1 1 1 1
F Additional results
Table F.1 contains the one-shot session completion for two context sets from our data
set. One context set has high skip rate and the other context set has low skip rate for low
danceability tracks. Fig. E.1 plots the individual trace of βipt around the mean curve. We
notice that the individual preference for acousticness centers around the mean curve, while
the preference for energy exhibits two types of patterns. One subgroup prefers high-energy
tracks at the beginning and gradually becomes neutral, while the other subgroup prefers
41
42
43
44
[1] Nicolas Padilla and Eva Ascarza. Overcoming the Cold Start Problem of Customer
Relationship Management Using a Probabilistic Machine Learning Approach. JMR,
Journal of marketing research, 58(5):981–1006, October 2021.
[2] Paul Voigt and Axel Von dem Bussche. The EU General Data Protection Regulation
(GDPR): A Practical Guide. Springer International Publishing, Cham, 1st edition, 2017.
[3] Preston Bukaty. The california consumer privacy act (ccpa): An implementation guide.
IT Governance Ltd, 2019.
[4] Peter E Rossi, Robert E McCulloch, and Greg M Allenby. The value of purchase history
data in target marketing. Marketing Science, 15(4):321–340, 1996.
[5] Greg M Allenby and Peter E Rossi. Marketing models of consumer heterogeneity.
Journal of econometrics, 89(1-2):57–78, 1998.
[6] Asim Ansari, Skander Essegaier, and Rajeev Kohli. Internet recommendation systems.
Journal of Marketing Research, 37(3):363–375, 2000.
[8] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast
adaptation of deep networks. In International conference on machine learning, pages
1126–1135. PMLR, 2017.
[9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
neural information processing systems, 30, 2017.
[10] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching
networks for one shot learning. Advances in neural information processing systems, 29,
2016.
[11] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast
Adaptation of Deep Networks. In Doina Precup and Yee Whye Teh, editors, Proceedings
of the 34th International Conference on Machine Learning, volume 70 of Proceedings of
Machine Learning Research, pages 1126–1135. PMLR, 2017.
[12] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende,
SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622,
2018.
45
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
[15] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. Advances in neural information processing
systems, 33:1877–1901, 2020.
[16] Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as
Statisticians: Provable In-Context Learning with In-Context Algorithm Selection. June
2023.
[18] R Dew, A Ansari, and Y Li. Modeling dynamic heterogeneity using Gaussian processes.
JMR, Journal of marketing research, 2020.
[19] Dennis J Zhang, Ming Hu, Xiaofei Liu, Yuxiang Wu, and Yong Li. Netease cloud music
data. Manufacturing & Service Operations Management, 24(1):275–284, 2022.
[20] Daria Dzyabura and John R Hauser. Recommending products when consumers learn
their preference weights. Marketing Science, 38(3):417–441, 2019.
[21] Duncan Simester, Artem Timoshenko, and Spyros I Zoumpoulis. Targeting prospec-
tive customers: Robustness of machine-learning methods to typical data challenges.
Management Science, 66(6):2495–2522, 2020.
[22] Jeremy Yang, Dean Eckles, Paramveer Dhillon, and Sinan Aral. Targeting for long-term
outcomes. Management Science, Articles in Advance:1–15, 2023.
[23] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using
gradient descent. In Artificial Neural Networks—ICANN 2001: International Conference
Vienna, Austria, August 21–25, 2001 Proceedings 11, pages 87–94. Springer, 2001.
[24] Juan-Manuel Perez-Rua, Xiatian Zhu, Timothy M Hospedales, and Tao Xiang. In-
cremental few-shot object detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 13846–13855, 2020.
[25] Yujia Bao, Menghua Wu, Shiyu Chang, and Regina Barzilay. Few-shot text classification
with distributional signatures. In International Conference on Learning Representations,
2019.
46
[27] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton,
Murray Shanahan, Yee Whye Teh, Danilo Rezende, and S M Ali Eslami. Conditional
neural processes. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th
International Conference on Machine Learning, volume 80 of Proceedings of Machine
Learning Research, pages 1704–1713. PMLR, 2018.
[28] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosen-
baum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. In International
Conference on Learning Representations, 2018.
[29] Mingzhang Yin, George Tucker, Mingyuan Zhou, Sergey Levine, and Chelsea Finn. Meta-
learning without memorization. In International Conference on Learning Representations,
2020.
[30] Gautam Singh, Jaesik Yoon, Youngsung Son, and Sungjin Ahn. Sequential neural
processes. Advances in Neural Information Processing Systems, 32, 2019.
[31] Tung Nguyen and Aditya Grover. Transformer neural processes: Uncertainty-aware
meta learning via sequence modeling. In International Conference on Machine Learning,
pages 16569–16594. PMLR, 2022.
[32] Xiao Liu. Deep learning in marketing: A review and research agenda. Artificial
Intelligence in Marketing, 20:239–271, 2023.
[33] Naresh K. Malhotra, K. Sudhir, and Olivier Toubia. Artificial Intelligence in Marketing,
volume 20 of Review of Marketing Research. Emerald Publishing Limited, 2023.
[34] Yi Yang, Kunpeng Zhang, and PK Kannan. Identifying market structure: A deep network
representation learning of social engagement. Journal of Marketing, 86(4):37–56,
2022.
[35] Paramveer S Dhillon and Sinan Aral. Modeling dynamic user interests: A neural matrix
factorization approach. Marketing Science, 40(6):1059–1080, 2021.
[36] Neeraj Bharadwaj, Michel Ballings, Prasad A Naik, Miller Moore, and Mustafa Murat
Arat. A new livestream retail analytics framework to assess the sales impact of
emotional displays. Journal of Marketing, 86(1):27–47, 2022.
[37] Xiao Liu, Dokyun Lee, and Kannan Srinivasan. Large-scale cross-category analysis of
consumer review content on sales conversion leveraging deep learning. Journal of
Marketing Research, 56(6):918–943, 2019.
[38] Liu Liu, Daria Dzyabura, and Natalie Mizik. Visual listening in: Extracting brand
image portrayed on social media. Marketing Science, 39(4):669–686, 2020.
47
[40] Ishita Chakraborty, Minkyung Kim, and K Sudhir. Attribute sentiment scoring with
online text reviews: Accounting for language structure and missing attributes. Journal
of Marketing Research, 59(3):600–622, 2022.
[41] Junming Yin, Zisu Wang, Yue Katherine Feng, and Yong Liu. Modeling behavioral
dynamics in digital content consumption: An attention-based neural point process
approach with applications in video games. Marketing Science, Forthcoming, 2022.
[42] Khaled Boughanmi, Asim Ansari, and Yang Li. A generative model of consumer
collections. Available at SSRN 4261182, 2023.
[43] Jochen Hartmann, Mark Heitmann, Christina Schamp, and Oded Netzer. The power
of brand selfies. Journal of Marketing Research, 58(6):1159–1177, 2021.
[44] Dinesh Puranam, Vrinda Kadiyali, and Vishal Narayan. The impact of increase in
minimum wages on consumer perceptions of service: A transformer model of online
restaurant reviews. Marketing Science, 40(5):985–1004, 2021.
[45] Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard
Turner. Meta-learning probabilistic inference for prediction. In International Conference
on Learning Representations, 2018.
[46] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recast-
ing gradient-based meta-learning as hierarchical bayes. In International Conference on
Learning Representations, 2018.
[47] Greg M Allenby, Peter E Rossi, and Robert E McCulloch. Hierarchical Bayes models: A
practitioners guide. SSRN Electronic Journal, 2005.
[48] Zikun Ye, Dennis J Zhang, Heng Zhang, Renyu Zhang, Xin Chen, and Zhiwei Xu.
Cold start to improve market thickness on online advertising platforms: Data-driven
algorithms and field experiments. Management Science, 69(7):3838–3860, 2023.
[49] Yijia Zhang, Zhenkun Shi, Wanli Zuo, Lin Yue, Shining Liang, and Xue Li. Joint person-
alized markov chains with social network embedding for cold-start recommendation.
Neurocomputing, 386:208–220, 2020.
[50] Jiayu Song, Jiajie Xu, Rui Zhou, Lu Chen, Jianxin Li, and Chengfei Liu. Cbml: A
cluster-based meta-learning model for session-based recommendation. In Proceedings
of the 30th ACM international conference on information & knowledge management,
pages 1713–1722, 2021.
48
[52] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
[53] John Ashworth Nelder and Robert WM Wedderburn. Generalized linear models.
Journal of the Royal Statistical Society Series A: Statistics in Society, 135(3):370–384,
1972.
[54] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua.
Neural collaborative filtering. In Proceedings of the 26th international conference on
world wide web, pages 173–182, 2017.
[55] Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. Neural collaborative
filtering vs. matrix factorization revisited. In Proceedings of the 14th ACM Conference
on Recommender Systems, pages 240–248, 2020.
[56] Asim Ansari, Yang Li, and Jonathan Z Zhang. Probabilistic topic model for hybrid
recommender systems: A stochastic variational bayesian approach. Marketing Science,
37(6):987–1008, 2018.
[57] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint 1412.6980, 2014.
[58] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch:
An imperative style, high-performance deep learning library. Advances in neural
information processing systems, 32, 2019.
[59] Ryan Dew, Asim Ansari, and Olivier Toubia. Letting logos speak: Leveraging multiview
representation learning for data-driven branding and logo design. Marketing Science,
41(2):401–425, 2022.
[60] Alex Burnap, John R Hauser, and Artem Timoshenko. Product aesthetic design: A
machine learning augmentation. Marketing Science, 2023.
[61] Brian Brost, Rishabh Mehrotra, and Tristan Jehan. The music streaming sessions
dataset, 2020.
[62] Khaled Boughanmi and Asim Ansari. Dynamics of musical success: A machine learning
approach for multimedia data fusion. Journal of Marketing Research, 58(6):1034–1057,
2021.
[63] John Barnard, Robert McCulloch, and Xiao-Li Meng. Modeling covariance matrices in
terms of standard deviations and correlations, with application to shrinkage. Statistica
Sinica, pages 1281–1311, 2000.
49
[65] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich,
Michael Betancourt, Marcus A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan:
A probabilistic programming language. Journal of statistical software, 76, 2017.
[66] Matthew D Hoffman, Andrew Gelman, et al. The no-u-turn sampler: adaptively setting
path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014.
[67] Gaël Varoquaux, Lars Buitinck, Gilles Louppe, Olivier Grisel, Fabian Pedregosa, and
Andreas Mueller. Scikit-learn: Machine learning without learning the machinery.
GetMobile: Mobile Computing and Communications, 19(1):29–33, 2015.
[68] Daniel Martin Katz, Michael J Bommarito, and Josh Blackman. A general approach
for predicting the behavior of the supreme court of the united states. PloS one,
12(4):e0174698, 2017.
[69] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand
Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent
Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine
Learning research, 12:2825–2830, 2011.
[70] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine
learning, volume 4. Springer, 2006.
[71] Mark Granroth-Wilding and Stephen Clark. What happens next? event prediction
using a compositional neural network model. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 30, 2016.
[72] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform
manifold approximation and projection. Journal of Open Source Software, 3(29):861,
2018.
[73] Ryan Dew. Adaptive preference measurement with unstructured data. SSRN, 2023.
[74] Alan D Baddeley and Graham Hitch. The recency effect: Implicit learning with explicit
retrieval? Memory & Cognition, 21:146–155, 1993.
[75] Gregory J Digirolamo and Douglas L Hintzman. First impressions are lasting impres-
sions: A primacy effect in memory for repetitions. Psychonomic Bulletin & Review,
4(1):121–124, 1997.
[76] Henry L Roediger III and Robert G Crowder. A serial position effect in recall of united
states presidents. Bulletin of the psychonomic society, 8(4):275–278, 1976.
50
[78] Donald B Rubin and Neal Thomas. Matching using estimated propensity scores:
relating theory to practice. Biometrics, pages 249–264, 1996.
[79] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural
autoregressive flows. In International Conference on Machine Learning, pages 2078–
2087. PMLR, 2018.
[80] Omid Rafieian and Hema Yoganarasimhan. Targeting and privacy in mobile advertising.
Marketing Science, 40(2):193–218, 2021.
[81] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise
Agüera y Arcas. Communication-Efficient Learning of Deep Networks from Decentral-
ized Data. February 2016.
[83] Kanishka Misra, Eric M Schwartz, and Jacob Abernethy. Dynamic online pricing with
incomplete information using multiarmed bandit experiments. Marketing Science,
38(2):226–252, 2019.
[84] Hamsa Bastani, David Simchi-Levi, and Ruihao Zhu. Meta dynamic pricing: Transfer
learning across experiments. Management Science, 68(3):1865–1881, 2022.
[85] Sneha Chaudhari, Varun Mithal, Gungor Polatkan, and Rohan Ramanath. An attentive
survey of attention models. ACM Transactions on Intelligent Systems and Technology
(TIST), 12(5):1–32, 2021.
51