Professional Documents
Culture Documents
443
RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands G. Salha-Galvan et al.
items ranking as a directed link prediction problem [53], for new release. This hinders the computation of usage-based similarity
nodes gradually added into this graph. scores, thus excluding these items from recommendations until
Then, we solve this problem by leveraging recent advances in they become warm - if ever. In this paper, we study the feasibility
graph representation learning [18, 19, 67], and specifically directed of effectively predicting their future similar items ranked lists, from
graph autoencoders [30, 53]. Our proposed framework permits re- the delivery of these items i.e. without any usage data. This would
trieving similar neighbors of items from node embeddings and from enable offering such an important feature quicker and on a larger
a gravity-inspired decoder acting as a ranking mechanism. Our work part of the catalog. More precisely, we answer the following re-
is the first transposition and analysis of gravity-inspired graph au- search question: using only 1) the known similarity scores between
toencoders [53] on recommendation problems. Backed by in-depth warm items, and 2) the available descriptive information, how, and
experiments on industrial data from the global music streaming to which extent, can we predict the future "Fans Also Like" lists that
service Deezer1 , we show the effectiveness of our approach at ad- would ultimately be computed once cold items become warm?
dressing a real-world cold start similar artists ranking problem,
outperforming several popular baselines for cold start recommen-
2.2 Related Work
dation. Last, we publicly release our code and the industrial data
from our experiments. Besides making our results reproducible, While collaborative filtering methods effectively learn item prox-
we hope that such a release of real-world resources will benefit imities, e.g. via the factorization of user-item interaction matrices
future research on this topic. [33, 59], these methods usually become unsuitable for cold items
This paper is organized as follows. In Section 2, we introduce without any interaction data and thus absent from these matri-
the cold start similar items ranking problem more precisely and ces [59]. In such a setting, the simplest strategy for similar items
mention previous works on related topics. In Section 3, we present ranking would consist in relying on popularity metrics [55], e.g. to
our graph-based framework to address this problem. We report recommend the most listened artists. In the presence of descriptive
and discuss our experiments on Deezer data in Section 4, and we information on cold items, one could also recommend items with
conclude in Section 5. the closest descriptions [1]. These heuristics are usually outper-
formed by hybrid models, leveraging both item descriptions and
2 PRELIMINARIES collaborative filtering on warm items [22, 23, 59, 63]. They con-
sist in:
2.1 Problem Formulation
Throughout this paper, we consider a catalog of n recommendable • learning a vector space representation (an embedding) of
items on an online service, such as music artists in our application. warm items, where proximity reflects user preferences;
Each item i is described by some side information summarized • then, projecting cold items into this embedding, typically by
in an f -dimensional vector x i ; for artists, such a vector could for learning a model to map descriptive vectors of warm items
instance capture information related to their country of origin or to their embedding vectors, and then applying this mapping
to their music genres. These n items are assumed to be "warm", to cold items’ descriptive vectors.
meaning that the service considers that a sufficiently large number Albeit under various formulations, this strategy has been trans-
of interactions with users, e.g. likes or streams, has been reached posed to Matrix Factorization [6, 59], Collaborative Metric Learning
for these items to ensure reliable usage data analyses. [23, 36] and Bayesian Personalized Ranking [2, 22]; in practice, a
From these usage data, the service learns an n × n similarity deep neural network often acts as the mapping model. The retrieved
matrix S, where the element Si j ∈ [0, 1] captures the similarity of similar items are then the closest ones in the embedding. Other deep
item j w.r.t. item i. Examples of some possible usage-based simi- learning approaches were also recently proposed for item cold start,
larity scores2 Si j include the percentage of users interacting with with promising performances. DropoutNet [61] processes both us-
item i that also interacted with item j (e.g. users listening to or age and descriptive data, and is explicitly trained for cold start
liking both items [26]), mutual information scores [56], or more through a dropout [57] simulation mechanism. MeLU [35] deploys
complex measures derived from collaborative filtering [25, 33, 55]. a meta-learning paradigm to learn embeddings in the absence of
Throughout this paper, we assume that similarity scores are fixed usage data. CVAE [37] leverages a Bayesian generative process to
over time, which we later discuss. sample cold embedding vectors, via a collaborative variational au-
Leveraging these scores, the service proposes a similar items toencoder.
feature comparable to the "Fans Also Like" described in the intro- While they constitute relevant baselines, these models do not
duction. Specifically, along with the presentation of an item i on rely on graphs, contrary to our work. Graph-based recommendation
the service, this feature recommends a ranked list of k similar items has recently grown at a fast pace (see the surveys of [64, 66]), in-
to users. They correspond to the top-k items j such as j , i and cluding in industrial applications [63, 68]. Existing research widely
with highest similarity scores Si j . focuses on bipartite user-item graphs [64]. Notably, STAR-GCN
On a regular basis, "cold" items will appear in the catalog. While [69] addresses cold start by reconstructing user-item links using
the service might have access to descriptive side information on stacked graph convolutional networks, extending ideas from [5, 30].
these items, no usage data will be available upon their first online Instead, recent efforts [46, 47] emphasized the relevance of lever-
1 https://www.deezer.com aging - as we will - graphs connecting items together, along with
2 Details on the computation of similarities at Deezer are provided in Section 4.1 - their attributes. In this direction, the work closest to ours might
without loss of generality, as our framework is valid for any S i j ∈ [0, 1]. be the recent DEAL [20] who, thanks to an alignment mechanism,
444
Cold Start Similar Artists Ranking with Gravity-Inspired Graph Autoencoders RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands
predicts links in such graphs for new nodes having only descrip- predictions with the actual ground-truth edges ultimately revealed,
tive information. We will also compare to DEAL; we nonetheless both in terms of:
point out that their work focused on undirected graphs, and did
• prediction accuracy: do we retrieve the correct locations of
not consider ranking settings.
missing edges in the graph?
• ranking quality: are the retrieved edges correctly ordered, in
3 A GRAPH-BASED FRAMEWORK FOR COLD terms of similarity scores?
START SIMILAR ITEMS RANKING
In this section, we present our proposed graph-based framework 3.2 From Similar Items Graphs to Directed
to address the cold start similar items ranking problem.
Node Embeddings
3.1 Similar Items Ranking as a Directed Link Locating missing links in graphs has been the objective of signif-
icant research efforts from various fields [38, 40, 53, 54]. While
Prediction Task this problem has been historically addressed via the construction
We argue that "Fans Also Like" features can naturally be summarized of hand-engineered node similarity metrics, such as the popular
as a graph structure with n nodes and n × k edges. Nodes are Adamic-Adar, Jaccard or Katz measures [38], significant improve-
warm recommendable items from the catalog, e.g. music artists ments were recently achieved by methods directly learning node
with enough usage data according to the service’s internal rules. representations summarizing the graph structure [18, 19, 30, 31, 67].
Each item node points to its k most similar neighbors via a link, i.e. These methods represent each node i as a vector zi ∈ Rd (with
an edge. This graph is: d ≪ n) in a node embedding space where structural proximity
• directed: edges have a direction, leading to asymmetric re- should be preserved, and learned by using matrix factorization [7],
lationships. For instance, while most fans of a little known random walks [45], or graph neural networks [31, 67]. Then, they
reggae band might listen to Bob Marley (Marley thus appear- estimate the probability of a missing edge between two nodes by
ing among their similar artists), Bob Marley’s fans will rarely evaluating their proximity in this embedding [19, 30].
listen to this band, which is unlikely to appear back among In this paper, we build upon these advances and thus learn node
Bob Marley’s own similar artists. embeddings to tackle link prediction in our similar items graph.
• weighted: among the k neighbors of node i, some items are Specifically, we propose to leverage gravity-inspired graph (varia-
more similar to i than others (hence the ranking). We capture tional) autoencoders, recently introduced in a previous work from
this aspect by equipping each directed edge (i, j) from the Deezer [53]. As detailed in Section 3.3, these models are doubly
graph with a weight corresponding to the similarity score advantageous for our application:
Si j . More formally, we summarize our graph structure by the
• Foremost, they can effectively process node attribute vectors
n × n adjacency matrix A, where the element Ai j = Si j if j is
in addition to the graph, contrary to some popular alterna-
one of the k most similar items w.r.t. i, and where Ai j = 0
tives such as DeepWalk [45] and standard Laplacian eigen-
otherwise3 . maps [3]. This will help us add some cold nodes, isolated
• attributed: as explained in Section 2.1, each item i is also but equipped with some descriptions, into an existing warm
described by a vector x i ∈ Rf . In the following, we denote node embedding.
by X the n × f matrix stacking up all descriptive vectors • Simultaneously, they were specifically designed to address
from the graph, i.e. the i-th row of X is x i . directed link prediction from embeddings, contrary to the
Then, we model the release of a cold recommendable item in the aforementioned alternatives or to standard graph autoen-
catalog as the addition of a new node in the graph, together with coders [30].
its side descriptive vector. As usage data and similarity scores are
In the following, we first recall key concepts related to gravity-
unavailable for this item, it is observed as isolated, i.e. it does not
inspired graph (variational) autoencoders, in Section 3.3. Then, we
point to k other nodes. In our framework, we assume that these k
transpose them to similar items ranking, in Section 3.4. To the
missing directed edges - and their weights - are actually masked.
best of our knowledge, our work constitutes the first analysis of
They point to the k nodes with highest similarity scores, as would
these models on a recommendation problem and, more broadly,
be identified by the service once it collects enough usage data to
on an industrial application.
consider the item as warm, according to the service’s criteria. These
links and their scores, ultimately revealed, are treated as ground-
truth in the remainder of this work. 3.3 Gravity-Inspired Graph (Variational)
From this perspective, the cold start similar items ranking prob- Autoencoders
lem consists in a directed link prediction task [40, 53, 54]. Specifically, 3.3.1 Overview. To predict the probability of a missing edge - and
we aim at predicting the locations and weights - i.e. estimated simi- its weight - between two nodes i and j from some embedding vectors
larity scores - of these k missing directed edges, and at comparing zi and z j , most existing methods rely on symmetric measures, such
as the Euclidean distance between these vectors ||zi − z j ||2 [19] or
one could consider a dense matrix where Ai j = S i j for all pairs (i, j). their inner-product zTi z j [30]. However, due to this symmetry, pre-
3 Alternatively,
However, besides acting as a data cleaning process on lowest scores, sparsifying A
speeds up computations for the encoder introduced thereafter, whose complexity dictions will be identical for both edge directions i → j and j → i.
evolves linearly w.r.t. the number of edges (see Section 3.4.3). This is undesirable for directed graphs, where Ai j , A ji in general.
445
RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands G. Salha-Galvan et al.
Salha et al. [53] addressed this problem by equipping zi vectors vectors are drawn from Gaussian distributions - one for each node
with masses, and then taking inspiration from Newton’s second - that must be learned. Formally, they consider the following infer-
ence model [53], acting as an encoder: q(Z̃ |A, X ) = ni=1 q(z̃i |A, X ),
Î
law of motion [43]. Considering two objects 1 and 2 with posi-
tive masses m 1 and m 2 , and separated by a distance r > 0, the with q(z̃i |A, X ) = N (z̃i |µ i , diag(σi2 )).
physical acceleration a 1→2 (resp. a 2→1 ) of 1 towards 2 (resp. 2 to- Gaussian parameters are learned from two GNNs, i.e.
wards 1) due to gravity [53] is a 1→2 = Gm r2
2
(resp. a 2→1 = Gm
r2
1
), µ = GNN µ (A, X ), with µ denoting the n × (d + 1) matrix stack-
where G denotes the gravitational constant [8]. One observes that ing up mean vectors µ i . Similarly, log σ = GNNσ (A, X ). Finally, a
a 1→2 > a 2→1 when m 2 > m 1 , i.e. that the acceleration of the less generative model decodes A using, as above, the acceleration for-
mula: p(A|Z̃ ) = ni=1 nj=1 p(Ai j |z̃i , z̃ j ), with p(Ai j |z̃i , z̃ j ) = Âi j =
Î Î
massive object towards the more massive one is higher. Transpos-
ing these notions to node embedding vectors zi augmented with σ (m̃ j − log ∥zi − z j ∥22 ).
masses mi for each node i, Salha et al. [53] reconstruct asymmetric During training, weights from the two GNN encoders are tuned
links by using ai→j (resp. a j→i ) as an indicator of the likelihood by maximizing, as a reconstruction quality measure, a tractable
that node i is connected to j (resp. j to i) via a directed edge, setting variational lower bound (ELBO) of the model’s likelihood [30, 53],
r = ∥zi − z j ∥2 . Precisely, they use the logarithm of ai→j , limiting using gradient descent. We refer to [53] for the exact formula of this
accelerations towards very central nodes [53], and add a sigmoid ac- ELBO loss, and to [12, 29] for details on derivations of variational
tivation σ (x) = 1/(1 + e −x ) ∈]0, 1[. Therefore, for weighted graphs, bounds for VAE models.
the output is an estimation Âi j of some true weight Ai j ∈ [0, 1]. Besides constituting generative models with powerful applica-
For unweighted graphs, it corresponds to the probability of a miss- tions to various graph generation problems [39, 41], graph VAE
ing edge: models emerged as competitive alternatives to graph AE on some
Gm link prediction problems [21, 30, 52, 53]. We therefore saw value in
Âi j = σ (log ai→j ) = σ (log ∥z −z j ∥ 2 )
i j 2 considering both gravity-inspired graph AE and gravity-inspired
= σ ( log Gm j − log ∥zi − z j ∥22 ). (1) graph VAE in our work.
| {z }
denoted m̃ j
3.4 Cold Start Similar Items Ranking using
3.3.2 Gravity-Inspired Graph AE. To learn such masses and embed- Gravity-Inspired Graph AE/VAE
ding vectors from an n×n adjacency matrix A, potentially completed
In this subsection, we now explain how we build upon these models
with an n × f node attributes matrix X , Salha et al. [53] introduced
to address cold start similar items ranking.
gravity-inspired graph autoencoders. They build upon the graph
extensions of autoencoders (AE) [30, 58, 62], that recently appeared 3.4.1 Encoding Cold Nodes with GCNs. In this paper, our GNN
among the state-of-the-art approaches for (undirected) link predic- encoders will be graph convolutional networks (GCN) [31]. In a GCN
tion in numerous applications [17, 21, 30, 50, 52, 62]. Graph AE are a with L layers, with input layer H (0) = X corresponding to node
family of models aiming at encoding nodes into an embedding space attributes, and output layer H (L) (with H (L) = Z̃ for AE, and H (L) =
from which decoding i.e. reconstructing the graph should ideally µ or log σ for VAE), we have H (l ) = ReLU(ÃH (l −1)W (l −1) ) for l ∈
be possible, as, intuitively, this would indicate that such representa- {1, ..., L − 1} and H (L) = ÃH (L−1)W (L−1) . Here, Ã denotes the out-
tions preserve important characteristics from the initial graph. degree normalized 5 version of A [53]. Therefore, at each layer, each
In the case of gravity-inspired graph AE, the encoder is a graph node averages the representations from its neighbors (and itself),
neural network (GNN) [19] processing A and X (we discuss archi-
with a ReLU activation: ReLU(x) = max(x, 0). W (0) , ...,W (L−1) are
tecture choices in Section 3.4.1), and returning an n × (d + 1) matrix
the weight matrices to tune. More precisely, we will implement
Z̃ . The i-th row of Z̃ is a (d + 1)-dimensional vector z̃i . The d first
2-layer GCNs, meaning that, for AE:
dimensions of z̃i correspond to the embedding vector zi of node i;
the last dimension corresponds to the mass4 m̃i . Then, a decoder Z̃ = ÃReLU(ÃXW (0) )W (1) . (3)
reconstructs  from the aforementioned acceleration formula:
(0) (1) (0) (1)
Z̃ = GNN(A, X ), then ∀(i, j) ∈ {1, ..., n}2 , Âi j For VAE, µ = ÃReLU(ÃXWµ )Wµ , log σ = ÃReLU(ÃXWσ )Wσ ,
(2)
= σ (m̃ j − log ∥zi − z j ∥22 ). and Z̃ is then sampled from µ and log σ . As all outputs are n ×(d +1)
(0) (0)
During training, the GNN’s weights are tuned by minimizing a matrices and X is an n × f matrix, then W (0) (or, Wµ and Wσ )
reconstruction loss capturing the quality of the reconstruction  is an f × d hidden matrix, with d hidden the hidden layer dimension,
(1) (1)
w.r.t. the true A, using gradient descent [16]. In [53], this loss is and W (1) (or, Wµ and Wσ ) is an d hidden × (d + 1) matrix.
formulated as a standard weighted cross entropy [52]. We rely on GCN encoders as they permit incorporating new nodes,
attributed but isolated in the graph, into an existing embedding.
3.3.3 Gravity-Inspired Graph VAE. Introduced as a probabilistic Indeed, let us consider an autoencoder trained from some A and X ,
variant of gravity-inspired graph AE, gravity-inspired graph varia- leading to optimized weights W (0) and W (1) for some GCN. If m ≥
tional autoencoders (VAE) [53] extend the graph VAE model from 1 cold nodes appear, along with their f -dimensional description
Kipf and Welling [30], originally designed for undirected graphs. vectors:
They provide an alternative strategy to learn Z̃ , assuming that z̃i
4 Learning m̃
i , as defined in equation (1), is equivalent to learning m i , but allows to
5 Formally,Ã = D out
−1 (A + I ) where I is the n × n identity matrix and D
n n out is the
get rid of the logarithm and of the constant G in computations. diagonal out-degree matrix [53] of A + I n .
446
Cold Start Similar Artists Ranking with Gravity-Inspired Graph Autoencoders RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands
447
RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands G. Salha-Galvan et al.
Figure 1: Similar artists recommended on the website version of Deezer. The mobile app proposes identical recommendations.
training set corresponds to warm artists. Artists from the validation of (4) and discuss the impact of λ thereafter. Our adaptation of these
and test sets are the cold nodes: their edges are masked, and they are models builds upon the Tensorflow code of Salha et al. [53].
therefore observed as isolated in the graph. Their 56-dimensional
descriptions are available. We evaluate the ability of our models 4.2.2 Other Methods based on the Directed Artist Graph. We com-
at retrieving these edges, with correct weight ordering. As a mea- pare gravity-inspired graph AE/VAE to standard graph AE /VAE
sure of prediction accuracy, we will report Recall@K scores. They models [31], with a similar setting as above. These models use
indicate, for various K, which proportion of the 20 ground-truth symmetric inner-product decoders i.e. Âi j = σ (zTi z j ), therefore
similar artists appear among the top-K artists with largest estimated ignoring directionalities. Moreover, we implement source-target
weights. Moreover, as a measure of ranking quality, we will also graph AE/VAE, used as a baseline in [53]. They are similar to stan-
report the widely used Mean Average Precision at K (MAP@K) and dard graph AE/VAE, except that they decompose the 32-dim vec-
(s)
Normalized Discounted Cumulative Gain at K (NDCG@K) scores8 . tors zi into a source vector zi = zi[1:16] and a target vector
(t ) (s)T (t )
zi = zi[17:32] , and then decode edges as follows: Âi j = σ (zi zj )
4.1.3 Dataset and Code Release. We publicly release our indus-
(s)T (t )
trial data (i.e. the complete graph, all weights and all descriptive and  ji = σ (z j zi ) (, Âi j in general). They reconstruct directed
vectors) as well as our source code on GitHub9 . Besides making links, as gravity models, and are therefore relevant baselines for our
our results fully reproducible, such a release publicly provides a evaluation. Last, we also test the recent DEAL model [20] mentioned
new benchmark dataset to the research community, permitting in Section 2.2, and designed for inductive link prediction on new
the evaluation of comparable graph-based recommender systems isolated but attributed nodes. We used the authors’ PyTorch imple-
on real-world resources. mentation, with similar attribute and structure encoders, alignment
mechanism, loss and cosine predictions as their original model [20].
4.2 List of Models and Baselines
4.2.3 Other Baselines. In addition, we compare our proposed frame-
We now describe all methods considered in our experiments. All em- work to four popularity-based baselines:
beddings have d = 32, which we will discuss. Also, all hyperparam-
eters mentioned thereafter were tuned by optimizing NDCG@20 • Popularity: recommends the K most popular10 artists on
scores on the validation set. Deezer (with K as in Section 4.1.2).
• Popularity by country: recommends the K most popular
4.2.1 Gravity-Inspired Graph AE/VAE. We follow our framework artists from the country of origin of the cold artist.
from Section 3.4 to embed cold nodes. For both AE and VAE, we use • In-degree: recommends the K artists with highest in-degrees
2-layer GCN encoders with 64-dim hidden layer, and 33-dim output in the graph, i.e. sum of weights pointing to them.
layer (i.e. 32-dim zi vectors, plus the mass), trained for 300 epochs. • In-degree by country: proceeds as In-degree, but on warm
We use Adam optimizer [28], with a learning rate of 0.05, without artists from the country of origin of the cold artist.
dropout, performing full-batch gradient descent, and using the
We also consider three baselines only or mainly based on musical
reparameterization trick [29] for VAE. We set λ = 5 in the decoder
descriptions x i and not on usage data:
8 MAP@K and NDCG@K are computed as in equation (4) of [55] (averaged over all
cold artists) and in equation (2) of [65] respectively. 10 Our dataset includes the popularity rank (from 1st to n th ) of warm artists. It is out of
9 Code and data are available on: https://github.com/deezer/similar_artists_ranking x i vectors, as it is usage-based and thus unavailable for cold artists.
448
Cold Start Similar Artists Ranking with Gravity-Inspired Graph Autoencoders RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands
• K-NN : recommends the K artists with closest x i vectors, Table 2 shows that, while masses are indeed positively correlated to
from a nearest neighbors search with Euclidean distance. popularity and to various graph-based node importance measures,
• K-NN + Popularity and K-NN + In-degree: retrieve the 200 these correlations are not perfect, which highlights that our models
artists with closest x i vectors, then recommends the K most do not exactly learn any of these metrics. Furthermore, replacing
popular ones among these 200 artists, ranked according to masses m̃i by any of these measures during training (i.e. by optimiz-
popularity and in-degree values respectively. ing zi vectors with the mass of i being fixed, e.g. as its PageRank
We also implement SVD+DNN, which follows the "embedding+mapping" [44] score) diminishes performances (e.g. more than -6 points in
strategy from Section 2.2 by 1) computing an SVD [34] of the warm NDCG@200, in the case of PageRank), which confirms that jointly
artists similarity matrix, learning 32-dim zi vectors, 2) training a learning embeddings and masses is optimal. Last, qualitative inves-
3-layer neural network (with layers of dimension 64, 32 and 32, tigations also tend to reveal that masses, and thus relative node
trained with Adam [28] and a learning rate of 0.01) to map warm attractions, vary across location in the graph. Local influence cor-
x i vectors to zi vectors, and 3) projecting cold artists into the SVD relates with popularity but is also impacted by various culture or
embedding through this mapping. Last, among deep learning ap- music-related factors such as countries and genres. As an illustra-
proaches from Section 2.2 (CVAE, DropoutNet, MeLU, STAR-GCN), tion, the successful samba/pagode Brazilian artist Thiaguinho, out
we report results from the two best methods on our dataset, namely of the top-100 most popular artists from our training set, has a larger
DropoutNet [61] and STAR-GCN [69], using the authors’ imple- mass than American pop star Ariana Grande, appearing among the
mentations with careful fine-tuning on validation artists11 . Similar top-5 most popular ones. While numerous Brazilian pagode artists
artists ranking is done via a nearest neighbors search in the result- point towards Thiaguinho, American pop music is much broader
ing embedding spaces. and all pop artists do not point towards Ariana Grande despite her
popularity.
4.3 Results 4.3.3 Impact of attributes. So far, all models processed the complete
4.3.1 Performances. Table 1 reports mean performance scores for 56-dimensional attribute vectors x i , concatenating information on
all models, along with standard deviations over 20 runs for models music genres, countries and moods. In Figure 3, we assess the ac-
with randomness due to weights initialization in GCNs or neural tual impact of each of these descriptions on performances, for our
networks. Popularity and In-degree appear as the worst baselines. gravity-inspired graph VAE. Assuming only one attribute (gen-
Their scores significantly improve by focusing on the country of res, countries or moods) is available during training, genres-aware
origin of cold artists (e.g. with a Recall@100 of 12.38% for Popularity models return the best performances. Moreover, adding moods to
by country, v.s. 0.44% for Popularity). Besides, we observe that meth- country vectors leads to larger gains than adding moods to genre
ods based on a direct K-NN search from x i attributes are outper- vectors. This could reflect how some of our music genres, such as
formed by the more elaborated cold start methods leveraging both speed metal, already capture some valence or arousal characteris-
attributes and warm usage data. In particular, DropoutNet, as well tics. Last, Figure 3 confirms that gathering all three descriptions
as the graph-based DEAL, reach stronger results than SVD+DNN provides the best performances, corresponding to those reported
and STAR-GCN. They also surpass standard graph AE and VAE in Table 1.
(e.g. with a +6.46 gain in average NDCG@20 score for DEAL w.r.t.
4.3.4 Popularity/diversity trade-off. Last, besides performances,
graph AE), but not the AE/VAE extensions that explicitly model
the gravity-inspired decoder from equation (4) also enables us to
edges directionalities, i.e. source-target graph AE/VAE and, even
flexibly address popularity biases when ranking similar artists. More
more, gravity-inspired graph AE/VAE, that provide the best rec-
precisely:
ommendations. It emphasizes the effectiveness of our framework,
both in terms of prediction accuracy (e.g. with a top 67.85% average • Setting λ → 0 increases the relative importance of the influ-
Recall@200 for gravity-inspired graph AE) and of ranking quality ence part of equation (4). Thus, the model will highly rank
(e.g. with a top 41.42% average NDCG@200 for this same method). the most massive nodes. As illustrated in Figure 4, this results
Moreover, while embeddings from Table 1 have d = 32, we point out in recommending more popular music artists.
that gravity-inspired models remained superior on our tests with • On the contrary, increasing λ diminishes the relative impor-
d = 64 and d = 128. Also, VAE methods tend to outperform their tance of masses in predictions, in favor of the actual node
deterministic counterparts for standard and source-target models, proximity. As illustrated in Figure 4, this tends to increase
while the contrary conclusion emerges on gravity-inspired models. the recommendation of less popular content.
This confirms the value of considering both settings when testing Setting λ = 5 leads to optimal scores in our application (e.g. with a
these models on novel applications. 41.42% NDCG@200 for our AE, v.s. 35.91% and 40.31% for the same
model with λ = 1 and λ = 20 respectively). Balancing between
4.3.2 On the mass parameter. In Figure 2, we visualize some artists
popularity and diversity is often desirable for industrial-level rec-
and their estimated masses. At first glance, one might wonder
ommender systems [55]. Gravity-inspired decoders flexibly permit
whether nodes i with largest masses m̃i , as Bob Marley in Fig-
such a balancing.
ure 2, simply correspond to the most popular artists on Deezer.
4.3.5 Possible improvements (on models). Despite promising re-
11 These last models do not process similar artist graphs, but raw user-item usage data, sults, assuming fixed similarity scores over time might sometimes
either as a bipartite user-artist graph or as an interaction matrix. While Deezer can
not release such fine-grained data, we nonetheless provide embedding vectors from be unrealistic, as some user preferences could actually evolve. Cap-
these baselines to reproduce our scores. turing such changes, e.g. through dynamic graph embeddings, might
449
RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands G. Salha-Galvan et al.
Popularity 0.02 0.44 1.38 <0.01 0.03 0.12 0.01 0.17 0.44
Popularity by country 2.76 12.38 18.98 0.80 3.58 6.14 2.14 6.41 8.76
In-degree 0.91 3.43 6.85 0.15 0.39 0.86 0.67 1.69 2.80
In-degree by country 5.46 16.82 23.52 2.09 5.43 7.73 5.00 10.19 12.64
K-NN on x i 4.41 13.54 19.80 1.14 3.38 5.39 4.29 8.83 11.22
K-NN + Popularity 5.73 15.87 19.83 1.66 4.32 5.74 4.86 10.03 11.76
K-NN + In-degree 7.49 17.29 18.76 2.78 5.60 6.18 7.41 12.48 13.14
SVD + DNN 6.42 ± 0.96 21.83 ± 1.21 35.01 ± 1.41 2.25 ± 0.67 6.36 ± 1.19 11.52 ± 1.98 6.05 ± 0.75 12.91 ± 0.92 17.89 ± 0.95
STAR-GCN 10.03 ± 0.56 31.45 ± 1.09 43.92 ± 1.10 3.10 ± 0.32 10.64 ± 0.54 16.62 ± 0.68 10.07 ± 0.40 21.17 ± 0.69 25.99 ± 0.75
DropoutNet 12.96 ± 0.54 37.59 ± 0.76 49.93 ± 0.82 4.18 ± 0.30 13.61 ± 0.55 20.12 ± 0.67 13.12 ± 0.68 25.61 ± 0.72 30.52 ± 0.78
DEAL 12.80 ± 0.52 37.98 ± 0.59 50.75 ± 0.72 4.15 ± 0.25 14.01 ± 0.44 20.92 ± 0.54 12.78 ± 0.53 25.70 ± 0.62 30.69 ± 0.70
Graph AE 7.30 ± 0.51 25.92 ± 0.95 40.37 ± 1.11 2.81 ± 0.29 7.97 ± 0.47 14.24 ± 0.67 6.32 ± 0.39 15.54 ± 0.66 20.94 ± 0.72
Graph VAE 10.01 ± 0.52 34.00 ± 1.06 49.72 ± 1.14 3.53 ± 0.27 11.68 ± 0.52 19.46 ± 0.70 10.09 ± 0.58 21.37 ± 0.73 27.31 ± 0.75
Sour.-Targ. Graph AE 12.21 ± 1.30 39.52 ± 3.53 56.25 ± 3.57 4.62 ± 0.81 14.67 ± 2.33 23.60 ± 2.85 12.42 ± 1.39 25.45 ± 3.37 31.80 ± 3.38
Sour.-Targ. Graph VAE 13.52 ± 0.64 42.68 ± 0.69 59.51 ± 0.76 5.19 ± 0.31 16.07 ± 0.40 25.48 ± 0.55 13.60 ± 0.73 27.81 ± 0.56 34.19 ± 0.59
Gravity Graph AE 18.33 ± 0.45 52.26 ± 0.90 67.85 ± 0.98 6.64 ± 0.25 21.19 ± 0.55 30.67 ± 0.68 18.64 ± 0.47 35.77 ± 0.66 41.42 ± 0.68
Gravity Graph VAE 16.59 ± 0.50 49.51 ± 0.78 65.70 ± 0.75 5.66 ± 0.35 19.07 ± 0.57 28.66 ± 0.59 16.74 ± 0.55 33.34 ± 0.66 39.29 ± 0.64
MAP@200
Moods NDCG@200
Recall@200
Countries
Genres
Countries
+ Moods
Genres
+ Moods
Genres
+ Countries
Genres + Moods
+ Countries
10 20 30 40 50 60 70
Scores (in %)
Figure 2: Visualization of some music artists from gravity-inspired graph
AE. Nodes are scaled using masses m̃i , and node separation is based on dis- Figure 3: Performances of gravity-inspired graph
tances in the embedding, using multidimensional scaling and with [15]. AE, when only trained from subsets of artist-level
Red nodes correspond to Jamaican reggae artists, appearing in the same attributes, among countries, music genres and
neighborhood. moods.
permit providing even more refined recommendations. Also, during 4.3.6 Possible improvements (on evaluation). Our evaluation fo-
training we kept k = 20 edges for each artist, while one could con- cused on the prediction of ranked lists for cold artists. This permits
sider varying k at the artist level, i.e. adding more (or fewer) edges, filling up their "Fans Also Like/Similar Artists" sections, which was
depending on the actual musical relevance of each link. One could our main goal in this paper. On the other hand, future internal
also compare different metrics, besides mutual information, when investigations could also aim at measuring to which extent the in-
constructing the ground-truth graph. Last, we currently rely on a clusion of new nodes in the embedding space impacts the existing
single GCN forward pass to embed cold artists which, while being ranked lists for warm artists. Such an additional evaluation, e.g. via
fast and simple, might also be limiting. Future studies on more elab- an online A/B test on Deezer, could assess which cold artists actu-
orated approaches, e.g. to incrementally update GCN weights when ally enter these lists, and whether the new recommendations 1) are
new nodes appear, could also improve our current framework. more diverse, according to some music or culture-based criteria,
and 2) improve user engagement on the service.
450
Cold Start Similar Artists Ranking with Gravity-Inspired Graph Autoencoders RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands
Density
on Deezer and to three node importance measures: in-degree
(i.e. sum of edges coming into the node), betweenness cen- 0.00015
trality [53] and PageRank [44]. Coefficients computed on
0.00010
the training set.
0.00005
Node-level Pearson Spearman 0.00000
measures correlation correlation 0 2000 4000 6000 8000 10000 12000 14000
Popularity Rank
Popularity Rank 0.208 0.206
Popularity Rank by Country 0.290 0.336
In-degree Centrality 0.201 0.118
Figure 4: Popularity rank of the most popular artist recom-
Betweenness Centrality 0.109 0.079 mended to each test artist, among their top-20. Distributions ob-
PageRank Score 0.272 0.153 tained from gravity-inspired graph AE models with varying hy-
perparameters λ.
5 CONCLUSION [11] Rémi Delbouys, Romain Hennequin, Francesco Piccoli, Jimena Royo-Letelier, and
Manuel Moussallam. 2018. Music Mood Detection based on Audio and Lyrics
In this paper, we modeled the challenging cold start similar items with Deep Neural Net. In 19th International Society for Music Information Retrieval
ranking problem as a link prediction task, in a directed and attrib- Conference. 370–375.
[12] Carl Doersch. 2016. Tutorial on Variational Autoencoders. arXiv preprint
uted graph summarizing information from "Fans Also Like/Similar arXiv:1606.05908 (2016).
Artists" features. We presented an effective framework to address [13] Silvia Donker. 2019. Networking data. A network analysis of Spotify’s socio-
this task, transposing recent advances on gravity-inspired graph technical related artist network. IJMBR 8, 1 (2019), 67–101.
[14] Elena V. Epure, Guillaume Salha, Manuel Moussallam, and Romain Hennequin.
autoencoders to recommender systems. Backed by in-depth experi- 2020. Modeling the Music Genre Perception across Language-Bound Cultures. In
ments on artists from the global music streaming service Deezer, 2020 Conference on Empirical Methods in Natural Language Processing. 4765–4779.
we emphasized the practical benefits of our approach, both in terms [15] Thomas Fruchterman and Edward M Reingold. 1991. Graph drawing by force-
directed placement. Software: Practice and exp. 21, 11 (1991), 1129–1164.
of recommendation accuracy, of ranking quality and of flexibility. [16] I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press.
Along with this paper, we publicly release our source code, as well [17] Aditya Grover, Aaron Zweig, and Stefano Ermon. 2019. Graphite: Iterative
Generative Modeling of Graphs. In Int. Conf. on Mach. Learn. 2434–2444.
as Deezer data from our experiments. We hope that this release [18] William L Hamilton. 2020. Graph Representation Learning. Synthesis Lectures on
of industrial resources will benefit future research on graph-based Artifical Intelligence and Machine Learning 14, 3 (2020), 1–159.
cold start recommendation. In particular, we already identified [19] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation Learning
on Graphs: Methods and Applications. IEEE Data Eng. Bul. (2017).
several directions that, in future works, should lead towards the [20] Yu Hao, Xin Cao, Yixiang Fang, Xike Xie, and Sibo Wang. 2020. Inductive Link
improvement of our approach. Prediction for Nodes Having Only Attribute Information. In 29th International
Joint Conference on Artificial Intelligence. 1209–1215.
[21] Arman Hasanzadeh, Ehsan Hajiramezanali, Krishna Narayanan, Nick Duffield,
REFERENCES Mingyuan Zhou, and Xiaoning Qian. 2019. Semi-Implicit Graph Variational Auto-
[1] Mohammad Reza Abbasifard, Bijan Ghahremani, and Hassan Naderi. 2014. A Encoders. In Advances in Neural Information Processing Systems. 10712–10727.
Survey on Nearest Neighbor Search Methods. International Journal of Computer [22] Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized
Applications 95, 25 (2014). Ranking from Implicit Feedback. In 30th AAAI Conf. on Art. Int. 144–150.
[2] Oren Barkan, Noam Koenigstein, Eylon Yogev, and Ori Katz. 2019. CB2CF: [23] Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, and
A Neural Multiview Content-to-Collaborative Filtering Model for Completely Deborah Estrin. 2017. Collaborative Metric Learning. In WWW.
Cold Item Recommendations. In 13th ACM Conference on Recommender Systems. [24] Moyuan Huang, Wenge Rong, Tom Arjannikov, Nan Jiang, and Zhang Xiong.
228–236. 2016. Bi-Modal Deep Boltzmann Machine Based Musical Emotion Classification.
[3] Mikhail Belkin and Partha Niyogi. 2001. Laplacian Eigenmaps and Spectral In 25th International Conference on Artificial Neural Networks. Springer, 199–207.
Techniques for Embedding and Clustering. In Advances in Neural Information [25] Gourav Jain, Tripti Mahara, and Kuldeep Narayan Tripathi. 2020. A survey of
Processing Systems, Vol. 14. 585–591. similarity measures for collaborative filtering-based recommender system. In
[4] Walid Bendada, Guillaume Salha, and Théo Bontempelli. 2020. Carousel Person- Soft Computing: Theories and Applications. Springer, 343–352.
alization in Music Streaming Apps with Contextual Bandits. In Fourteenth ACM [26] Maura Johnston. 2019. How "Fans Also Like" Works. Blog post on "Spotify for
Conference on Recommender Systems. 420–425. Artists": https://artists.spotify.com/blog/how-fans-also-like-works (2019).
[5] Rianne van den Berg, Thomas N Kipf, and Max Welling. 2018. Graph Convolu- [27] Alexandros Karatzoglou, Linas Baltrunas, and Yue Shi. 2013. Learning to Rank
tional Matrix Completion. KDD Deep Learning Day (2018). for Recommender Systems. In ACM Conference on Rec. Sys. 493–494.
[6] Léa Briand, Guillaume Salha-Galvan, Walid Bendada, Mathieu Morlon, and Viet- [28] Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-
Anh Tran. 2021. A Semi-Personalized System for User Cold Start Recommendation mization. In 3rd International Conference on Learning Representations.
on Music Streaming Apps. 27th ACM SIGKDD Conference on Knowledge Discovery [29] Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In
and Data Mining (KDD). 2nd International Conference on Learning Representations.
[7] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. Grarep: Learning Graph Repre- [30] Thomas Kipf and Max Welling. 2016. Variational Graph Auto-Encoders. In
sentations with Global Structural Information. In CIKM. 891–900. NeurIPS Workshop on Bayesian Deep Learning.
[8] Henry Cavendish. 1798. Experiments to Determine the Density of the Earth. [31] Thomas Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph
Philosophical Transactions of the Royal Society of London 88, 469–526. Convolutional Networks. In 5th Int. Conf. on Learning Representations.
[9] Sam Corbett-Davies and Sharad Goel. 2018. The Measure and Mismeasure of Fair- [32] Yngvar Kjus. 2016. Musical Exploration via Streaming services: The Norwegian
ness: A Critical Review of Fair Machine Learning. arXiv preprint arXiv:1808.00023 Experience. Popular Communication 14, 3 (2016), 127–136.
(2018). [33] Yehuda Koren and Robert Bell. 2015. Advances in Collaborative Filtering. Rec-
[10] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for ommender Systems Handbook (2015), 77–118.
Youtube Recommendations. In 10th ACM Conf. on Rec. Sys. 191–198.
451
RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands G. Salha-Galvan et al.
[34] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech- in Databases.
niques for Recommender Systems. Computer 42, 8 (2009), 30–37. [53] Guillaume Salha, Stratis Limnios, Romain Hennequin, Viet Anh Tran, and
[35] Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019. Michalis Vazirgiannis. 2019. Gravity-Inspired Graph Autoencoders for Directed
MeLU: Meta-Learned User Preference Estimator for Cold-Start Recommendation. Link Prediction. In 28th ACM International Conference on Information and Knowl-
In 25th ACM SIGKDD International Conference on Knowledge Discovery and Data edge Management. 589–598.
Mining. 1073–1082. [54] Daniel Schall. 2014. Link Prediction in Directed Social Networks. Social Network
[36] Joonseok Lee, Sami Abu-El-Haija, Balakrishnan Varadarajan, and Apostol Natsev. Analysis and Mining 4, 1 (2014), 157.
2018. Collaborative Deep Metric Learning for Video Understanding. In 24th [55] Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Elahi. 2018. Current Challenges and Visions in Music Recommender Systems
481–490. Research. Int. Journal of Multimedia Information Retrieval 7, 2 (2018), 95–116.
[37] Xiaopeng Li and James She. 2017. Collaborative Variational Autoencoder for Rec- [56] Hadi Shakibian and Nasrollah Charkari. 2017. Mutual Information Model for
ommender Systems. In 23rd ACM SIGKDD International Conference on Knowledge Link Prediction in Heterogeneous Complex Networks. Scientific Reports 7, 1
Discovery and Data Mining. 305–314. (2017), 1–16.
[38] David Liben-Nowell and Jon Kleinberg. 2007. The Link Prediction Problem for [57] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Social Networks. Journal of the American Society for Information Science and Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from
Technology 58, 7 (2007), 1019–1031. Overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
[39] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. 2018. [58] Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and Tie-Yan Liu. 2014. Learning Deep
Constrained Graph Variational Autoencoders for Molecule Design. Advances in Representations for Graph Clustering. In AAAI. 1293–1299.
Neural Information Processing Systems, 7806–7815. [59] Aäron Van Den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep
[40] Linyuan Lu and Tao Zhou. 2011. Link Prediction in Complex Networks: A Survey. Content-Based Music Recommendation. In Advances in Neural Information Pro-
Physica A: Stat. Mech. and its Applications 390, 6 (2011), 1150–1170. cessing Systems Conference.
[41] Tengfei Ma, Jie Chen, and Cao Xiao. 2018. Constrained Generation of Semantically [60] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Valid Graphs via Regularizing Variational Autoencoders. Advances in Neural Lio, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International
Information Processing Systems, 7113–7124. Conference on Learning Representations.
[42] Rishabh Mehrotra, Mounia Lalmas, Doug Kenney, Thomas Lim-Meng, and Golli [61] Maksims Volkovs, Guang Wei Yu, and Tomi Poutanen. 2017. DropoutNet: Ad-
Hashemian. 2019. Jointly Leveraging Intent and Interaction Signals to Predict dressing Cold Start in Recommender Systems. In Advances in Neural Information
User Satisfaction with Slate Recommendations. In The World Wide Web Conference. Processing Systems. 4957–4966.
1256–1267. [62] Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural Deep Network Em-
[43] Isaac Newton. 1687. Philosophiae Naturalis Principia Mathematica. bedding. In KDD. 1225–1234.
[44] Lawrence Page, Sergey Brin, , and Terry Winograd. 1999. The PageRank Citation [63] Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun
Ranking: Bringing Order to the Web. Stanford InfoLab (1999). Lee. 2018. Billion-Scale Commodity Embedding for E-commerce Recommen-
[45] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online Learn- dation in Alibaba. In 24th ACM SIGKDD International Conference on Knowledge
ing of Social Representations. In KDD. 701–710. Discovery and Data Mining. 839–848.
[46] Tieyun Qian, Yile Liang, and Qing Li. 2019. Solving Cold Start Problem [64] Shoujin Wang, Liang Hu, Yan Wang, Xiangnan He, Quan Z Sheng, Mehmet
in Recommendation with Attribute Graph Neural Networks. arXiv preprint Orgun, Longbing Cao, Francesco Ricci, and Philip S Yu. 2021. Graph Learning
arXiv:1912.12398 (2019). Based Recommender Systems: A Review. In 30th International Joint Conference
[47] Tieyun Qian, Yile Liang, Qing Li, and Hui Xiong. 2020. Attribute Graph Neural on Artificial Intelligence.
Networks for Strict Cold Start Recommendation. IEEE Transactions on Knowledge [65] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, and Tie-Yan Liu. 2013.
and Data Engineering (2020), 1–1. A Theoretical Analysis of NDCG Ranking Measures. In 26th Annual Conference
[48] Dimitrios Rafailidis and Fabio Crestani. 2017. Learning to Rank with Trust and on Learning Theory, Vol. 8. 6.
Distrust in Recommender Systems. In ACM Conf. on Rec. Sys. 5–13. [66] Shiwen Wu, Fei Sun, Wentao Zhang, and Bin Cui. 2020. Graph Neural Networks
[49] James A Russell. 1980. A Circumplex Model of Affect. Journal of Personality and in Recommender Systems: A Survey. In arXiv preprint arXiv:2011.02260.
Social Psychology 39, 6 (1980), 1161. [67] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and
[50] Guillaume Salha, Romain Hennequin, Jean-Baptiste Remy, Manuel Moussallam, Philip S Yu. 2021. A Comprehensive Survey on Graph Neural Networks. IEEE
and Michalis Vazirgiannis. 2021. FastGAE: Scalable Graph Autoencoders with Transactions on Neural Networks and Learning Systems 32-1 (2021).
Stochastic Subgraph Decoding. Neural Networks 142 (2021), 1–19. [68] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton,
[51] Guillaume Salha, Romain Hennequin, Viet Anh Tran, and Michalis Vazirgian- and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale
nis. 2019. A Degeneracy Framework for Scalable Graph Autoencoders. In 28th Recommender Systems. In 24th ACM SIGKDD International Conference on Knowl-
International Joint Conference on Artificial Intelligence. 3353–3359. edge Discovery & Data Mining. 974–983.
[52] Guillaume Salha, Romain Hennequin, and Michalis Vazirgiannis. 2020. Simple [69] Jiani Zhang, Xingjian Shi, Shenglin Zhao, and Irwin King. 2019. STAR-GCN:
and Effective Graph Autoencoders with One-Hop Linear Models. In 2020 European Stacked and Reconstructed Graph Convolutional Networks for Recommender
Conference on Machine Learning and Principles and Practice of Knowledge Discovery Systems. In 28th International Joint Conference on Artificial Intelligence. 4264–
4270.
452