Professional Documents
Culture Documents
Itemsage: Learning Product Embeddings For Shopping Recommendations at Pinterest
Itemsage: Learning Product Embeddings For Shopping Recommendations at Pinterest
Recommendations at Pinterest
Paul Baltescu Haoyu Chen Nikil Pancha
pauldb@pinterest.com hchen@pinterest.com npancha@pinterest.com
Pinterest Pinterest Pinterest
San Francisco, USA San Francisco, USA San Francisco, USA
and concatenation idea, is the most suitable for our use case given Table 1: Training data volume.
our modalities.
We should note that although multi-modal representation learn- Surface Engagement No. Examples
ing has been studied in various areas, few works [4, 33] have suc-
Closeup Clicks + Saves 93.3M
cessfully applied it to large-scale recommendation systems. To our
Closeup Checkouts + Add-to-Cart 3.7M
knowledge, this work is the first one that uses the Transformer
Search Clicks + Saves 99.4M
architecture to aggregate image and text modalities to learn product
Search Checkouts + Add-to-Cart 3.5M
representations for production-scale recommendation systems.
involves a summation over the entire catalog. To make the loss LSpos LSneg
computation tractable, a common technique is to approximate the
normalization term by treating all the other positive examples query positive in-batch negatives random negatives
from the same training batch as negatives and ignoring all the
remaining products in the catalog. This approach is very efficient as
no additional product embeddings need to be generated to compute
the loss. However, naively replacing the whole catalog C with B … …
introduces a sampling bias to the full softmax. We address this issue
by applying the logQ correction [32] that updates the logits ⟨𝑞𝑥𝑖 , 𝑝 𝑦 ⟩
to be ⟨𝑞𝑥𝑖 , 𝑝 𝑦 ⟩ − log 𝑄 𝑝 (𝑦|𝑥𝑖 ), where 𝑄 𝑝 (𝑦|𝑥𝑖 ) is the probability of
…
𝑦 being included as a positive example in the training batch. The
loss becomes: … …
|B |
1 ∑︁ 𝑒 ⟨𝑞𝑥𝑖 ,𝑝 𝑦𝑖 ⟩−log 𝑄 𝑝 (𝑦𝑖 |𝑥𝑖 )
𝐿𝑆𝑝𝑜𝑠 =− log Í , (2) sample
|B| 𝑖=1 ⟨𝑞𝑥𝑖 ,𝑝 𝑦 ⟩−log 𝑄 𝑝 (𝑦 |𝑥𝑖 )
𝑦 ∈B 𝑒
query image random product
We estimate the probabilities 𝑄 𝑝 (𝑦|𝑥𝑖 ) in streaming fashion using
Task 1 clicked product
a count-min sketch that tracks the frequencies with which entities query image
appear in the training data. The count-min sketch [3] is a proba- Task 2 saved product
bilistic data structure useful for tracking the frequency of events in query image Shopping
Task 3
streams of data that uses sub-linear memory. add-to-cart product Catalog
One problem we have experienced with using in-batch positives Task K search query
as negatives is that unengaged products in the catalog will never checked-out product
appear as negative labels. This treatment unfairly penalizes popular Engagement Log
products as they are ever more likely to be selected as negatives. To
counter this effect, we adopt a mixed negative sampling approach Figure 4: The construction of a training batch. The white
inspired by [31]. For each training batch, we further select a set squares on the diagonal indicates a mask applied to prevent
of random negatives N (where |N | = |B|) based on which we using an example’s positive label also as a negative. Squares
compute a second loss term: of different shades denote different queries and products.
|N |
1 ∑︁ 𝑒 ⟨𝑞𝑥𝑖 ,𝑝 𝑦𝑖 ⟩−log 𝑄𝑛 (𝑦𝑖 )
𝐿𝑆𝑛𝑒𝑔 = − log Í , (3) trade-offs between different tasks. When introducing new tasks, we
|N | 𝑖=1 ⟨𝑞𝑥𝑖 ,𝑝 𝑦 ⟩−log 𝑄𝑛 (𝑦)
𝑦 ∈N 𝑒 find that setting 𝑇𝑘 = |B|/𝐾 can be a good starting point to achieve
where 𝑄𝑛 (𝑦) represents the probability of random sampling product significant improvements towards the new task without hurting the
𝑦. The loss term 𝐿𝑆𝑛𝑒𝑔 helps reduce the negative contribution that performance of other tasks. We believe the lack of negative impact
popular products receive when used as negative labels. The main on existing tasks can be attributed to the correlation between tasks.
distinction between our approach and [31] is that we optimize For example, to purchase a product users are likely to click on it
for 𝐿𝑆𝑝𝑜𝑠 + 𝐿𝑆𝑛𝑒𝑔 , while [31] uses both B and N to calculate the first, thus adding the click task will not hurt the performance of
normalization term and minimizes the purchase task.
|B |
1 ∑︁ 𝑒 ⟨𝑞𝑥𝑖 ,𝑝 𝑦𝑖 ⟩−log 𝑄 𝑝 (𝑦𝑖 |𝑥𝑖 ) 3.6 Inference and Serving
𝐿𝑆𝑚𝑖𝑥𝑒𝑑 = − log
|B| 𝑖=1 𝑍 Figure 5 illustrates how ItemSage embeddings are deployed to
power Pinterest shopping recommendations. The embeddings are
∑︁ ∑︁
⟨𝑞𝑥𝑖 ,𝑝 𝑦 ⟩−log 𝑄 𝑝 (𝑦 |𝑥𝑖 )
𝑍= 𝑒 + 𝑒 ⟨𝑞𝑥𝑖 ,𝑝 𝑦 ⟩−log 𝑄𝑛 (𝑦) .
𝑦 ∈B 𝑦 ∈N
inferred in a batch workflow, which ensures that the model infer-
(4) ence latency does not impact the end-to-end latency of the vertical
recommendation systems. The inference workflow runs daily to up-
Section 4.1.4 shows that we obtain better results separating the loss date the embeddings based on the latest features and to extend the
terms of the two negative sampling approaches. coverage to newly ingested products. Each new set of embeddings is
used to create an HNSW index [19] for candidate generation and is
3.5 Multi-task Learning also pushed to the online feature store for ranking applications. The
We implement multi-task learning by combining positive labels HNSW index and the feature set are reused by all of the different
from all the different tasks into the same training batch (Figure vertical systems for Home, Closeup and Search recommendations.
4). This technique is effective even when the query entities come The Home and Closeup surfaces use precomputed PinSage embed-
from different modalities, the only difference being that the query dings fetched from the feature store as queries, while in Search, the
embedding needs to be inferred with a different pretrained model. query embeddings are inferred on the fly to support the long tail of
In a training batch of size |B|, we allocate 𝑇𝑘 positive examples previously unseen queries. The main thing to note is the simplicity
Í𝐾
for each task of interest 𝑘 ∈ {1, · · · , 𝐾 } such that |B| = 𝑘=1 𝑇𝑘 . of this design enabled by multi-task learning. By creating a single
Therefore tuning the values 𝑇𝑘 to be an effective way to create set of embeddings for all 3 vertical applications, we can use a single
KDD ’22, August 14–18, 2022, Washington, DC Paul Baltescu, Haoyu Chen, Nikil Pancha, Andrew Zhai, Jure Leskovec, and Charles Rosenberg
the product’s images. The text-only model is trained based on the 10 results of more than 10%-15% of the queries in the evaluation set
text attributes from the shopping catalog (title, description, etc.). depending on the task.
We observe that the model trained on features from both modalities We also compare our mixed negative sampling approach with
has significantly better performance than both baselines on all 8 the approach from [31], and observe that our approach which in-
tasks. Also, the image only model has significantly stronger perfor- troduces separate loss terms for in batch positives and random
mance over the text-only model, which could be attributed to the negatives provides an improvement of at least 3.5% on every task.
additional information summarized into the image embeddings: (1)
PinSage is a graph neural network (GNN) aggregating information
from the Pinterest pin-board graph in addition to the image itself
and (2) the learnings presented in this paper regarding multi-modal
learning have also been applied within PinSage to textual metadata 4.1.5 Task Ablation. In this section, we evaluate the effectiveness
available with each image. As one might expect, the models trained of the multi-task learning regime used to train ItemSage.
on a single feature modality have stronger performance when the We first evaluate the recall of ItemSage embeddings against two
query comes from the same modality, i.e., the image-only model baseline models trained on vertical specific tasks: “Closeup” and
shows better performance on Closeup tasks, while the text-only “Search” (Table 3, Row “Surface”). Each baseline uses the same model
model performs better on Search tasks. architecture and represents the performance we would expect if we
Inspired by PinSage and related work on GNNs [9, 10], we con- deployed separate embeddings per vertical. We find it encouraging
ducted an experiment augmenting ItemSage with information ob- that the model optimized for both verticals performs better on
tained from the Pinterest pin-board graph [6]. Products can nat- 6 out of 8 tasks. This suggests that applying multi-task learning
urally be embedded into this graph by creating edges between a leads to improved performance even across product verticals and,
product and its own images. For each product, we performed ran- in the cases where the performance degrades, the degradation is
dom walks starting from its corresponding node (reusing the same not substantial compared to a vertical specific model. This is a
configuration as PinSage [33]) and kept the most frequently occur- remarkable result considering that the PinSage and SearchSage
ring 50 image neighbors that are different from the product’s own embeddings are not compatible with one another. We also find it
images. The images are mapped to their corresponding PinSage em- interesting that the largest improvements are seen on shopping
beddings which are appended to the sequence of features provided specific objectives (add-to-cart actions and checkouts), these actions
as input to the transformer. This extension to ItemSage showed being 30 times more sparse than clicks and saves, suggesting that
neutral results (Table 3, Row “Feature”) and increased training cost. the cross vertical setup helps the model extract more information
We attribute this result to the fact that the PinSage embeddings about which products are more likely to be purchased by users.
of the product’s own images are already aggregating the graph The next experiment focuses on the impact of mixing regular
information, making the explicit extension redundant. engagement tasks (clicks and saves) together with shopping en-
gagement tasks (add-to-cart actions and checkouts). The results
are summarized in the Row “Engagement Type” of Table 3. As
4.1.4 Loss Ablation. In Section 3.4, we introduce our approach for expected, regardless of the task, we see an improvement in recall
sampling negative labels which consists of two sources: (1) other whenever we optimize the model on that specific task. Furthermore,
positive labels from the same batch as the current example and we observe substantial gains in add-to-cart actions and checkouts
(2) randomly sampled negative labels from the entire catalog. In compared to a model specialized at modeling just shopping engage-
this section, we compare how our model performs if the negative ment and minimal losses on clicks and saves tasks compared to the
labels are selected from only one of these sources. The results are corresponding specialized model. We believe the substantial gains
presented in the Row “Negative Sampling” of Table 3. We observe (3.7%-8.7%) of the joint model on add-to-cart actions and check-
a steep drop in recall if only one source of negatives is used. More- outs can be explained by the significant difference in training data
over, if we only select random negatives, the model converges to volume between shopping and regular engagement, the additional
a degenerate solution (and thus we need to apply early stopping) click and save labels help the embeddings converge towards a more
where a few products become very popular and appear in the top robust representation.
KDD ’22, August 14–18, 2022, Washington, DC Paul Baltescu, Haoyu Chen, Nikil Pancha, Andrew Zhai, Jure Leskovec, and Charles Rosenberg
Table 3: Ablation study for ItemSage models. Relative differences from the ItemSage model are shown in the parentheses.
Closeup Search
Clicks Saves Add Cart Checkouts Clicks Saves Add Cart Checkouts
ItemSage 0.816 0.812 0.897 0.916 0.749 0.762 0.842 0.869
0.795 0.787 0.882 0.908 0.670 0.698 0.798 0.830
Image Only
(-2.6%) (-3.1%) (-1.7%) (-0.9%) (-10.5%) (-8.4%) (-5.2%) (-4.5%)
0.683 0.658 0.832 0.859 0.669 0.665 0.790 0.820
Feature Text Only
(-16.3%) (-19.0%) (-7.2%) (-6.2%) (-10.7%) (-12.7%) (-6.2%) (-5.6%)
0.814 0.812 0.893 0.905 0.743 0.767 0.842 0.860
Image + Text + Graph
(-0.2%) (0.0%) (-0.4%) (-1.2%) (-0.8%) (0.7%) (0.0%) (-1.0%)
0.597 0.602 0.717 0.772 0.553 0.544 0.662 0.724
𝐿𝑆𝑝𝑜𝑠 Only
(–26.8%) (-25.9%) (-20.1%) (-15.7%) (-26.2%) (-28.6%) (-21.4%) (-16.7%)
Negative 0.774 0.768 0.868 0.897 0.655 0.670 0.804 0.840
𝐿𝑆𝑛𝑒𝑔 Only
Sampling (-5.1%) (-5.2%) (-3.2%) (-2.1%) (-12.6%) (-12.1%) (-4.5%) (-3.3%)
0.781 0.774 0.860 0.884 0.687 0.706 0.809 0.838
𝐿𝑆𝑚𝑖𝑥𝑒𝑑
(-4.3%) (-4.7%) (-4.1%) (-3.5%) (-8.3%) (-7.3%) (-3.9%) (-3.6%)
0.815 0.811 0.891 0.909 - - - -
Closeup
(-0.1%) (-0.1%) (-0.7%) (-0.8%) - - - -
Surface
- - - - 0.760 0.766 0.830 0.861
Search
- - - - ( 1.5%) ( 0.5%) (-1.4%) (-0.9%)
0.819 0.812 0.869 0.894 0.755 0.768 0.689 0.765
Clicks + Saves
Engagement ( 0.4%) ( 0.0%) (-3.1%) (-2.4%) ( 0.8%) ( 0.8%) (-18.2%) (-12.0%)
Type 0.503 0.503 0.850 0.882 0.382 0.392 0.768 0.793
Add Cart + Checkouts
(-38.4%) (-38.1%) (-5.2%) (-3.7%) (-49.0%) (-48.6%) (-8.8%) (-8.7%)
Table 4: Results of A/B experiments. The three columns business indicators deploying ItemSage embeddings for product
show the relative difference between ItemSage and the base- recommendations on all surfaces.
lines in number of total clicks, average checkouts per user,
and average Gross Merchandise Value (GMV) per user.
5 CONCLUSION
Surface Clicks Purchases GMV In this work, we presented our end-to-end approach of learning
ItemSage, the product embeddings for shopping recommendations
Home 11.34% 2.61% 2.94% at Pinterest.
Closeup 9.97% 5.13% 6.85% In contrast to other work focused on representation learning
Search 2.3% 1.5% 3.7% for e-commerce applications, our embeddings are able to extract
information from both text and image features. Visual features are
particularly effective for platforms with a strong visual compo-
nent like Pinterest, while modeling text features leads to improved
4.2 Online Experiments relevance, especially in search results.
We report results from A/B experiments applying ItemSage in each Furthermore, we describe a procedure to make our embeddings
of the main surfaces at Pinterest: Home, Closeup and Search. In compatible with the embeddings of other entities in the Pinterest
these experiments, ItemSage was used to generate candidates for ecosystem at the same time. Our approach enables us to deploy
upstream recommendation systems via approximate nearest neigh- a single embedding version to power applications with different
bor retrieval [19]. As shown in Section 4.1.5, ItemSage is directly inputs: sequences of pins (images) for user activity based recom-
applicable to Search and Closeup recommendations; to extend to mendations in the Home surface, single images for image based
Home recommendations, we cluster the user activity history and recommendations in the Closeup surface and text queries to power
sample pins from several clusters to reduce the problem to the same search results. This leads to a 3X reduction in infrastructure costs as
setting as producing Closeup recommendations [22]. The embed- we do not need to infer separate embeddings per vertical. From our
dings used in the control group are generated with the baselines experience, embedding version upgrades are a slow process taking
introduced in Section 4.1.2: the Sum baseline is used for Home consumers up to a year to completely adopt a new version before
and Closeup recommendations and the Sum-MLP baseline is used an old one can be fully deprecated. A unified embedding for all
in Search. The results are summarized in Table 4. We observed applications means less maintenance throughout this period and a
a strong impact on both engagement and shopping specific key more efficient, consolidated process for upgrades and deprecation.
ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest KDD ’22, August 14–18, 2022, Washington, DC