High Quality Related Search Query Suggestions Using Deep Reinforcement Learning

High Quality Related Search Query Suggestions using Deep
Reinforcement Learning
Praveen Kumar Bodigutla
pbodigutla@linkedin.com
LinkedIn
Sunnyvale, California, USA
Abstract user feedback, such as click on recommended query, are prone

to selection-bias [29]. Reinforcement Learning techniques were
“High Quality Related Search Query Suggestions” task aims at rec-
proposed to address these limitations [30].
arXiv:2108.04452v1 [cs.IR] 10 Aug 2021
ommending search queries which are real, accurate, diverse, rel-

“Local” Deep-Reinforcement-Learning (DRL) frameworks for
evant and engaging. Obtaining large amounts of query-quality
query reformulation [23, 30] adjust a query relative to the initial
human annotations is expensive. Prior work on supervised query
search results. Reward signals used in these approaches were com-
suggestion models suffered from selection and exposure bias, and
puted from mined search result documents. Processing search re-
relied on sparse and noisy immediate user-feedback (e.g., clicks),
sults to reformulate the query and mining entire collection of docu-
leading to low quality suggestions. Reinforcement Learning tech-
ments is not practical in large-scale real-world applications, due to
niques employed to reformulate a query using terms from search
rapidly changing and ever-growing user-generated content. Hence,
results, have limited scalability to large-scale industry applications.
we employ “global” DRL approaches [23], that are independent of
To recommend high quality related search queries, we train a
the set of documents returned by the original query and instead
Deep Reinforcement Learning model to predict the query a user
depend on the actual queries entered by the users and the feed-
would enter next. The reward signal is composed of long-term
back provided by them. Furthermore, we choose the more useful
session-based user feedback, syntactic relatedness and estimated
approach to predict the queries a user will enter next, over simply
naturalness of generated query. Over the baseline supervised model,
reformulating the current query [7]. We mine search sessions and
our proposed approach achieves a significant relative improvement
corresponding co-occuring query pairs from LinkedIn search en-
in terms of recommendation diversity (3%), down-stream user-
gine logs. Unlike general purpose web search, our work is focused
engagement (4.2%) and per-sentence word repetitions (82%).
on domain-specific-search queries [28], that users enter to search
CCS Concepts for job postings, people profiles, user-groups, company pages and
• Theory of computation → Reinforcement learning; • Informa- industry news.
tion systems → Query suggestion. DRL text generation approaches such as policy-gradient based Se-
quential Generative Adversarial Networks (SeqGAN) [33] achieved
Keywords excellent performance on generating creative text sequences. Most
query suggestions, deep reinforcement learning, text generation recently, on summary generation task, fine-tuning pre-trained su-
pervised model using Proximal Policy Optimization (PPO) [26]
1 Introduction method outperformed supervised GPT3 [6] model 10x its size [27].
Motivated by the performance improvements achieved by DRL
Related search query suggestions aims at suggesting queries that techniques on text-generation problems, we solve the high qual-
are related to the user’s most recent search query. For example, ity query suggestions problem by modeling the query generator
search query “machine learning jobs” is related to “machine learn- as a stochastic parametrized policy. Specifically, we fine-tune the
ing”. These suggestions are important to improve the usability of state-of-the-practice Seq2Seq Neural Machine Translation (NMT)
search engines [7]. We define high-quality related search query sug- model [3] for query generation [14], using policy-gradient REIN-
gestions as query recommendations which are natural (i.e., entered FORCE [31] algorithm. Seq2Seq NMT models are widely popular in
by a real user), diverse, relevant, error-free and engaging. industry applications, especially in low resource environments[10].
Sequence-to-Sequence (Seq2Seq) encoder-decoder architectures Supervised reward estimation models are commonly used to
are widely used to generate query suggestions [14, 23, 30, 33]. Super- optimize text-generation policy [23, 27, 30, 33]. SeqGAN model used
vised autoregressive generative models trained with ground-truth GAN [11] discriminator output as reward signal. The supervised
labels suffer from exposure bias [24, 33]. Maximum Likelihood discriminator model predicted if the generated sequence is “real”.
Estimation (MLE) training generates repetitive sequences [19]. Ma- The estimated reward is passed back to the intermediate state-
chine learning approaches [14, 20] that are trained on immediate action steps using computationally expensive Monte Carlo search
Permission to make digital or hard copies of part or all of this work for personal or [8]. PPO based summary generation model [27] used estimated
classroom use is granted without fee provided that copies are not made or distributed annotated user feedback on generated summaries as the reward
for profit or commercial advantage and that copies bear this notice and the full citation signal. Immediate implicit user feedback is sparse, asking users to
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s). provide explicit feedback is intrusive [4] and reliable annotations
Marble-KDD ’21, August 16, 2021, Singapore are expensive to obtain.
© Copyright held by the owner/author(s).
Marble-KDD ’21, August 16, 2021, Singapore Praveen Kumar Bodigutla
In our proposed approach, the DRL future-reward is composed Naturalness

of three signals, which are: 1) long-term implicit user-feedback
within a search-session; 2) unnatural query generation penalty; Relatedness
and 3) syntactic similarity between generated query and user’s
User
most recent search query. Leveraging implicit user-feedback from feedback
search-sessions as opposed to using immediate feedback, helps in REWARD
maximizing user engagement across search-sessions, addresses the ACTION
reward-sparsity problem and removes the need to obtain expensive

human annotations (Section 2.2.2). We design a weakly supervised
context-aware-naturalness estimator model, which estimates the
naturalness probability of a generated query. Similar to [33] we REINFORCE
Algorithm
perform Monte Carlo search to propagate rewards. However, we
reduce the computation cost considerably by performing policy Update
Agent
roll-out only from the first time-step of the decoder (Section 2.2.1).
In summary, we employ DRL policy-gradient technique for mak-
ing high-quality related-query suggestions at scale. Our proposed
approach achieves improvement in-terms of recommendation di-
versity, down-stream user-engagement, relevance and errors per AGENT
sentence. To the best of our knowledge, this is the first time a com-
bination of long-term session-based user-feedback, un-natural sen-
Figure 1: Deep Reinforcement Learning for text-generation using REINFORCE algorithm. Agent is
tence penalty and syntactic relatedness reward signals are jointly
an encoder-decoder Seq2Seq attention model. 𝑞𝑏𝑘 is the input query, for batch-index 𝑏𝜖𝐵 and
optimized to improve query suggestions’ quality. 𝑖 {1:𝑇 }
1:𝐵,1:𝐾
Monte-Carlo sample index 𝑘𝜖𝐾 . Generated words (𝑦𝑡 ) concatenated with attention context
The remainder of this paper is structured as follows: Section (𝑐𝑡 +1 ) is passed as input to the decoder’s 𝑡 + 1 time-step. For each generated sequence (𝑦𝑏𝑘 ) policy
2 describes our proposed deep reinforcement learning approach. update is done using RL reward (𝑅𝑏𝑘
1:𝑇
), which is calculated at the end of each generated sequence.
1:𝑇
Section 3.2 presents the experimental setup and discusses empirical
results. Section 4 concludes.
learning”) . The 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 encoder-decoder framework consists
2 Approach for Improving Query Suggestions
of a BiLSTM [12] encoder, that encodes a batch (batch-size 𝐵) of in-
We fine-tune a weakly supervised Sequence-to-Sequence (Seq2Seq) put queries (𝑞𝑖1:𝐵 ) and the LSTM [13] decoder generates a batched
Neural Machine Translation (NMT) model (𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 ) to ini- sequence of words y = (𝑦11:𝐵 ,..., 𝑦𝑇1:𝐵 ). Where, 𝑇 is the sequence
tialize the query generation policy. The process then consists of length. During training, we use teacher forcing [32], i.e., use the
two steps: 1) Learn a context-aware weakly supervised naturalness co-occuring query (𝑞𝑖+ 1:𝐵 ) as input to the decoder. Context attention
1
estimator; and 2) Fine-tune pre-trained supervised 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 vector is obtained from the alignment model [3] . Categorical cross
model using REINFORCE [31] algorithm. The future-reward is com- entropy loss is minimized during training and hyper-parameters of
posed of user-feedback in a search session (𝑈 + ), syntactic similarity the model are fine-tuned (see Section 3.2).
(𝑅𝑂𝑈 𝐺𝐸) and unnaturalness penalty (−𝜂∗(1−𝐷𝜙 )) of the generated
query given the co-occuring previous query. 2.2 Fine-tuning using Deep Reinforcement
2.1 Weakly Supervised Pre-Training Learning
This section describes the reward estimation and Deep Reinforce-
Variants of mono-lingual supervised 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 models are used
ment Learning (DRL) model training steps to fine-tune and improve
in industry applications for query suggestions [14]. In the pre-
the policy obtained via pre-trained supervised model.
training step, we train the supervised 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model using
co-occurring consecutive query pairs in a search sessions. A search- 2.2.1 Deep Reinforcement Learning Model. Parameters of the
session [7] is a stream of queries entered by a user in a 5-min1 DRL agent 𝐺𝜃 are initialized with pre-trained 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model
time window. N-1 Consecutive query pairs (𝑞𝑖 , 𝑞𝑖+1 ) are extracted (Section 2.1). The initial policy is fine-tuned using the REINFORCE
from a search session consisting of a sequence of N queries (𝑞 1 , policy-gradient algorithm (Figure 1). ‘𝐾’ complete sentences (𝑦𝑏, 1:𝐾
)
1:𝑇
𝑞 2 ,...,𝑞 𝑁 ). Consecutive queries could be unrelated, semantically and
generated per query (𝑞𝑖 ) constitute the action space at time-step
𝑏
(or) syntactically related. Our model is weakly supervised as we use
𝑇 , where 𝑏𝜖𝐵 is the index in a mini-batch of 𝐵 queries. To mitigate
all query pairs and do not filter them using sparse click data, costly
exposure-bias, generated words (𝑦𝑡1:−𝐵, 1:𝐾
) from previous time-step
human-evaluations and weak association rules. Weak supervision 1
are passed as input to the next time-step ‘𝑡’ of the decoder. Future-
allows the training process to scale, minimize selection-bias and
reward (𝑅𝐷𝜙 (𝑦𝑏𝑘 )) computed at the end of each generated sample,
we conjecture that it improves model generalization too. For ex- 1:𝑇
ample, unlike [14], we do not apply syntactic similarity heuristics is back-propagated to the encoder-decoder model. Given the start-
to filter query pairs, as queries could be semantically related yet state (𝑆 𝑏0 ) comprising of the input query (𝑞𝑏𝑖 ) and <START> token
syntactically dissimilar (e.g., “artificial intelligence” and “machine 𝑦𝑏0 , the objective of the agent is to generate related-search query
suggestions (𝑦𝑏1:𝑇 ) which maximize objective:
1 Based on guidance from internal search team’s proprietary analysis. 𝐽 (𝜃 ) = E[𝑅𝐷𝜙 (𝑦𝑏1:𝑇 ) |𝑆 𝑏 (1)
0, 𝜃 ]
High Quality Related Search Query Suggestions using Deep Reinforcement Learning Marble-KDD ’21, August 16, 2021, Singapore
Where per-sample reward is: session after the user enters 𝑞𝑖+1 is considered. In our work we
𝑅𝐷𝜙 (𝑦𝑏𝑘 𝑏𝑘
1:𝑇 ) = 𝑈 + + ( 1 − 𝑈 + ) ∗ (𝑅𝑂𝑈 𝐺𝐸𝑞𝑏 ,𝑦𝑏𝑘 − 𝜂 ∗ ( 1 − 𝐷𝜙 (𝑦1:𝑇 ))) (2) maximize session-based user-feedback, as we are interested in
𝑖 1:𝑇
maximizing user engagement across search sessions. For a gen-
MC approximation of the gradient using likelihood ratio trick:
1 𝐺 𝛽 𝑏𝑘 erated query 𝑦𝑏𝑘1:𝑇
, session-based user feedback (𝑈 + ) is “1” , if a
Δ𝜃 𝐽 (𝜃 ) ≈ ∗ Σ𝑘𝜖𝐾 [𝑅𝐷𝜙 (𝑦𝑏𝑘
1:𝑇 ) ∗ Δ𝜃 𝑙𝑜𝑔 𝐺𝜃 (𝑦1:𝑇 |𝑆 0 ) ], 𝑦1:𝑇 𝜖 𝑀𝐶
𝑏𝑘 𝑏 𝑏𝑘 (𝑦1:𝑇 ) (3)
𝐾 positive down-stream user action is observed in the remainder
Unlike SeqGAN [33] DRL model training, where the action-value at of the search-session and “0” otherwise.
each intermediate time-step 𝑡 is evaluated by generating 𝐾 samples • Relatedness of generated query to source query (𝑅𝑂𝑈 𝐺𝐸𝑞𝑏 ,𝑦𝑏𝑘 ):
(𝑦𝑡𝑏,+1:
𝑖 1:𝑇
𝐾
1:𝑇
), we perform MC policy roll-out from the start state alone. Despite increasing the percentage of associated positive user ac-
𝐾 queries (𝑦𝑏, 1:𝐾
) are generated using roll-out policy 𝐺 𝛽 , for each tion by considering user’s feedback across a search session, the
1:𝑇
input query 𝑞𝑏𝑖 . This modification reduces the computation cost label sparsity problem is not completely mitigated. In the search
by a factor of O (𝑇 ) (O (𝐾𝑇 2 ) → O (𝐾𝑇 )) per input query. 𝐺 𝛽 is query logs, when there is no positive downstream user action
initialized with 𝐺𝜃 and is periodically updated during training, associated with a generated query 𝑦𝑏𝑘 1:𝑇
, we estimate the reward
using a configurable schedule (see Algorithm 1). At each time-step using a syntactic similarity measure. Reformulated queries are
𝑡, the state 𝑆𝑡𝑏𝑘 is comprised of the input query and the tokens syntactically and semantically similar [18]. We compute syntactic
relatedness of generated query (𝑦𝑏𝑘 ) with the source query (𝑞𝑏𝑖 )
produced so far ({𝑞𝑏𝑖 , 𝑦𝑏𝑘
1:𝑡 −1 }) and the action is the next token 𝑦𝑡
𝑏𝑘 1:𝑇
using ROUGE-1 [22] score.
to be selected from stochastic policy 𝐺𝜃 (𝑦𝑡 |𝑆𝑡 ).
𝑏𝑘 𝑏𝑘
• Naturalness probability of generated query (𝐷𝜙 (𝑦𝑏𝑘 )): Users
Details of the constituents of future-reward (𝑈 + , 𝑅𝑂𝑈 𝐺𝐸𝑞𝑏 ,𝑦𝑏𝑘 1:𝑇
𝑖 1:𝑇 enter either natural language search queries[5] (e.g., “jobs requir-
and 𝐷𝜙 (𝑦𝑏𝑘
1:𝑇
)) are in the next section (Section 2.2.2). Since the ing databases expertise in the bay area” ) or they just enter key-
expectation E[.] can be approximated by sampling methods, we words (e.g., “bay area jobs database” ). In the context of related
then update the generator’s parameters with 𝛼 as the learning-rate query suggestions, we define a “natural” query as one which a
as: real user is likely to enter. We train a contextual-naturalness-
estimation model (see Section 2.2.3) to predict naturalness proba-
𝜃 ← 𝜃 + 𝛼 ∗ Δ𝜃 𝐽 (𝜃 ) (4)
bility 𝐷𝜙 (𝑦𝑏𝑘
1:𝑇
) of a generated query, given the previous query
Require: Generator policy 𝐺𝜃 ; roll-out policy 𝐺 𝛽 ; naturalness-estimator 𝐷𝜙 ;
entered by the user as context. “AI jobs” is an example of a nat-
ural query after the user searched for “Google”, even though

Query-pair in a search session 𝑞𝑖1:{𝐵1:𝑇 } , 𝑞 1:
{𝑖+1 } {1:𝑇 } ; Batch size: B;
𝐵
MC sampling-strategy 𝜖 [beam-search, sampling from categorical distribution]

both queries are syntactically and semantically (jobs vs com-
1. Fine-tune supervised model using MLE with 𝑞𝑖1:{𝐵1:𝑇 } as input sequence and
pany) dissimilar. However, “AI AI jobs jobs” is unnatural and is
𝑞 1:
{𝑖+1 } {1:𝑇 } as target sequence.
𝐵
2. Initialize 𝐺𝜃 with fine-tuned supervised model. unlikely to be entered by a real user. In our DRL reward formu-
3. 𝛽 ← 𝜃
4. Train contextual-naturalness-estimator 𝐷𝜙 using negative examples generated from 𝐺𝜃
lation, we add penalty term −𝜂 ∗ (1 − 𝐷𝜙 (𝑦𝑏𝑘 1:𝑇
)) to syntactic-
5. repeat relatedness (𝑅𝑂𝑈 𝐺𝐸𝑞𝑏 ,𝑦𝑏𝑘 ) score to discourage generation of un-
6. for n steps 𝑖 1:𝑇
7. foreach 𝑏 𝜖 𝐵
1:𝐾
natural queries. Coefficient 𝜂 is the configurable penalty weight.
8. Generate "K" sequences 𝑦𝑏, 1:𝑇
using configured sampling-strategy
2.2.3 Contextual Naturalness Estimator. Our proposed contextual-

1:𝐾 1:𝐾 1:𝐾
𝑦𝑏,
1:𝑇
= 𝑦𝑏,
1 , 𝑦𝑏,
2 , ..., 𝑦𝑇𝑏, 1:𝐾 ∼ 𝐺 𝛽
9. Compute future-reward at the end of each generated sequence
𝑅𝑏,𝑘
1:𝑇
= 𝑈 + + (1 − 𝑈 + ) ∗ (𝑅𝑂𝑈 𝐺𝐸 (𝑞𝑏𝑖{1:𝑇 } ,𝑦𝑏,𝑘 ) − 𝜂 ∗ (1 − 𝐷𝜙 )) naturalness-estimator is a BiLSTM [12] supervised model, which
{1:𝑇 }
10. Compute gradient Δ𝐽 (𝜃 ) (Equation 3) predicts the probability a generated query is “natural” (𝐷𝜙 (𝑦𝑏𝑘 1:𝑇
)) .
11. Update 𝐺𝜃 parameters via policy-gradient (Equation 4)
12. end foreach Concatenated with query-context (𝑞𝑖 ), user entered queries (𝑞𝑖+1 )
13. end for
14. 𝛽 ← 𝜃
serve as positive examples (𝑞𝑖 ⊕𝑞𝑖+1 ) to train the model. We employ
four methods to generate negative examples (𝑞𝑖 ⊕ 𝑞𝑖 ) per each
15. until convergence
𝑛𝑒𝑔
positive example, which are: 1) With 𝑞𝑖 as input, sample query

Algorithm 1: Deep Reinforcement Learning algorithm for related-search query suggestions.
𝑞𝑖 from fine-tuned 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model’s decoder (Section 2.1);
𝑛𝑒𝑔
2) perturb 𝑞𝑖+1 by duplicating a word in randomly selected position

2.2.2 Reward Formulation. This section describes the three com- within the sentence; 3) replace a word with unknown word token
ponents of the future-reward 𝑅𝐷𝜙 (𝑦𝑏𝑘 ) , which are session based (“<UNK>”) at a randomly selected position in 𝑞𝑖+1 ; and 4) generate
1:𝑇
positive user-feedback (𝑈 + ), syntactic-similarity (𝑅𝑂𝑈 𝐺𝐸𝑞𝑏 ,𝑦𝑏𝑘 ) a sentence by repeating a sampled word from a categorical distribu-
𝑖 1:𝑇
tion (𝑝 𝑤1 ...𝑝 𝑤|𝑉 | ) for (randomly chosen) 𝑟𝜖 [1,𝑇 − 1] times. |V| is the
and unnatural suggestion penalty (−𝜂 ∗ (1 − 𝐷𝜙 (𝑦𝑏𝑘 ))).
1:𝑇 size of the training data vocabulary, 𝑝 𝑤𝑖 = 𝑛 𝑤𝑖 /𝑁 |𝑉 | , 𝑛 𝑤𝑖 is word
• Long-term user-feedback in a search session (𝑈 + ): Viewing (𝑤𝑖 ) frequency and 𝑁 |𝑉 | = Σ𝑤𝑖 𝜖 |𝑉 | 𝑛 𝑤𝑖 . Methods 1, 2 and 3 generate
search-results (dwell time >= 5 seconds 2 ) or performing an en- hard-negative examples [18] and 4 captures popularity-bias [2], a
gaging action such as: sending a connection request or a message; situation where popular terms are generated more often than terms
applying for a job; favoriting a group, profile or an article, consti- in long-tail.
tute positive implicit user actions in a search-session. Immediate To validate our hypothesis that sampled queries from 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇
user-feedback is sparse. However, positive user action percentage are less natural than the ones provided by the user, we asked
increases by absolute 11%3 , when the remainder of the search three annotators to rate 100 randomly sampled query pairs (𝑞 1, 𝑞 2 ).
2 Determined by internal domain-specific search user-behavior analysis. Query 𝑞 1 is entered by the user and 𝑞 2 is either sampled from a
3 Computed over a set of ~100 million query pairs extracted from search-sessions in one
supervised model (46%) or was entered by the user (54%) after
month window. Actual % of positive user actions is not included due to confidentiality.
searching for 𝑞 1 . Without revealing the source of 𝑞 2 , we asked strategy. Complete set of hyper-parameters we tuned are in Appen-
annotators to identify if the query is “natural” (defined in Section dix Table 2. Best combination of hyper-parameters are chosen are
2.2.2) . On an average 58% of model-generated queries and 74% based on performance on validation set (See Appendix Table 3).
of real-user queries were identified as natural. The Inter Annota-
tor Agreement (IAA), measured using Fleiss-Kappa [25], was poor
3.3 Evaluation Metrics
(0.04) when the users evaluated model-generated sentences. In The binary “natural/unnatural” class prediction performance of
comparison, when they evaluated queries entered by real users, the contextual naturalness estimator is evaluated using F15 score
IAA was better (0.34) between the three annotators’ ratings and it and Accuracy 6 metrics. We use the mean of the following metrics
ranged from fair (0.22) to moderate (0.52) agreement between each calculated on the test set, to evaluate the relevance, engagement,
pair of annotators. Higher IAA and higher percentage of queries accuracy and diversity of generated queries.
identified as “natural” imply that real-user queries are more nat- • Sessions with positive user-action (𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑠 + @6): Long-term
ural and distinguishable than queries sampled from pre-trained binary engagement metric indicating if recommended queries
𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model. lead to a successful session. Its value is “1”, if any of the six
generated queries belong to a search-session in test-data with an
3 Experiment Setup and Results associated down-stream positive user action (Section 2.2.2).
This section describes the experimental setup to train and evaluate • Unique@6: Diversity metric indicating the percentage of unique
the naturalness-estimator, supervised and DRL query generation sentences in (six) query suggestions made per query 𝑞𝑖𝑡𝑒𝑠𝑡 . Queries
models. containing unknown word token (“<UNK>”) are filtered out as
only high-quality suggestions are presented to the end user.
3.1 Data • Precision@6: Measures relevance with respect to the query a
From user search-query logs, we randomly sampled 0.61 million user would enter next. Is “1” if (𝑞𝑖+
𝑡𝑒𝑠𝑡 ) is in the set of six query
1
(90% train), 34k (5% valid) and 34k (5% test) query pairs to train suggestions made for (𝑞𝑖 ) and “0” otherwise.
𝑡𝑒𝑠𝑡
the supervised 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 and DRL models. Dataset size to train • Word-repetitions per sentence (𝑅𝑒𝑝𝑒𝑡𝑖𝑡𝑖𝑜𝑛𝑠𝑆 ): Fraction of word
the naturalness-estimator model is 5x the aforementioned amount repetitions per generated query (𝑆). Unwanted word repetitions
(See Section 2.2.3). Max-length of a query is 8 and mean-length is lead to lower quality.
~2 words. Vocabulary size is 32k and out of vocabulary words in • Prior Sentence Probability (𝑃𝑆 ): 𝑃𝑆 = Σ𝑤𝑖 𝜖𝑆 𝑙𝑜𝑔(𝑝 𝑤𝑖 ), mea-
validation and test sets are replaced with “<UNK>” unknown-token. sures the prior sentence probability. 𝑝 𝑤𝑖 is prior word probability
defined in Section 2.2.3. Lower sentence probability indicates
3.2 Experimental Setup higher diversity as generated queries contain less frequent words.
We implemented all models in Tensorflow [1] and tuned the pa- 3.4 Results
rameters using Ray[Tune] [21] on Kubernetes [17] distributed
The contextual-naturalness-estimator achieved 90% accuracy and
cluster. As described in Section 2, the query suggestion policy
80% F1 performance on test set. Table 1 shows the performance
is initialized with fine-tuned 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model. 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇
of supervised (𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 ) and proposed DRL model on the five
model parameters are updated using Adam [16] optimizer and
metrics mentioned in previous section. 𝐷𝑅𝐿𝑏𝑒𝑎𝑚 and 𝐷𝑅𝐿𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔
categorical-cross-entropy loss is minimized during training. Dur-
ing inference, six4 queries are generated per input query (𝑞𝑖 ) us- use “beam-search” and “sampling from categorical distribution”
ing beam-search [9] decoding. Negative examples to train the MC sampling strategies respectively (see Section 3.2). In order as-
two-layered BiLSTM contextual-naturalness-estimator are obtained sess the impact of applying heuristics to filter and improve quality
from pre-trained 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model. At inference, naturalness of suggestions provided by supervised models, we analyzed the
performance of 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 −
𝑀𝑇 , which is 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model with
probability (𝐷𝜙 (𝑦𝑏𝑘 )) is obtained from the output of fully-connected
1:𝑇 post-processing filters to remove suggestions with repeated words.
layer with last time-step’s hidden state as its input.
The initial policy is fine-tuned using REINFORCE policy-gradient Model\Metric 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑠 + @6 Unique@6 Precision@6 𝑅𝑒𝑝𝑒𝑡𝑖𝑡𝑖𝑜𝑛𝑠𝑆 𝑃𝑆
0.1108 ± 0.002 5.8244 ± 0.0045 0.0456 ± 0.0025 2.21% ± 0.04% -6.4442 ± 0.0149
algorithm, using future-reward described in Section 2.2.2. During 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇
−
𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 0.1101 ± 0.001 5.5595 ± 0.0075 0.0456 ± 0.0025 0.00% ± 0.00%† -6.4875 ± 0.0151†
training, “K” samples for MC roll-out are generated using beam-
𝑀𝑇
𝐷𝑅𝐿𝑏𝑒𝑎𝑚 0.1155 ± 0.002† 5.9606 ± 0.0023† 0.0468 ± 0.0025 1.10% ± 0.03%† -6.4897 ± 0.0140†
search or from categorical distribution of inferred word probabilities 𝐷𝑅𝐿𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 0.1149 ± 0.002† 5.9956 ± 0.0007† 0.0467 ± 0.0024 0.40%± 0.02%† -6.3932 ± 0.0141
Table 1: Mean performance of supervised 𝑆𝑒𝑞 2𝑆𝑒𝑞 𝑁 𝑀𝑇 and 𝐷𝑅𝐿 models across all query pairs
at each time-step (See Figure 1). DRL model training stability is mon- in test data. Cells show the mean and 95% confidence interval calculated using t-distribution. Best
itored using reward weighted Negative Log Likelihood convergence mean is in bold. † indicates statistically significant improvement over baseline 𝑆𝑒𝑞 2𝑆𝑒𝑞 𝑁 𝑀𝑇 .
performance, with 𝐾𝐵 1
Σ (𝑘𝜖𝐾,𝑏𝜖𝐵) [−𝑅𝐷𝜙 (𝑦𝑏𝑘
1:𝑇
) ∗ 𝑙𝑜𝑔(𝐺𝜃 (𝑦𝑏𝑘
1:𝑇
))] as On offline test data set, in comparison to the baseline 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇
the computed loss at each model training step. model, 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 removed query suggestions with repeated
−
We use SGD optimizer [15] to update the weights of the agent words completely, however the heuristics-based model performed
(Equation 4). Appendix Figure 2 shows the convergence perfor- poorly in-terms of diversity (4.5% relative drop in mean Unique@6)
mance of the DRL model for different values of unnaturalness and average number of successful sessions (0.6% relative drop in
penalty (𝜂), number of MC samples (𝐾) and choice of sampling mean 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑠 + @6). On the other hand, both versions of our pro-
posed DRL models outperformed the baseline model on all metrics.
5 F1 ∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
= 2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
4 In production environment six related queries are suggested for each user query. 6 Categorical accuracy: calculates how often predictions match one-hot labels.
DRL variants achieved significant relative improvement in-terms AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment
of user-engagement (mean 𝑆𝑒𝑠𝑠𝑖𝑜𝑛 + @6) up to 4.2% (0.1108 → (AIIDE’08). AAAI Press, 216–217.
[9] Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural
0.1155), query suggestions’ diversity (mean Unique@6) up to 3% Machine Translation. CoRR abs/1702.01806 (2017). arXiv:1702.01806 http://arxiv.
(5.8244 → 5.9956), sentence-level diversity (mean Prior Sentence org/abs/1702.01806
[10] Thushan Ganegedara. 2020. Is the race over for Seq2Seq models? https:
Probability) up to 0.7% (−6.4442 → −6.4897) and reduction in //towardsdatascience.com/is-the-race-over-for-seq2seq-models-adef2b24841c
errors per sentence up to 82% (2.21 → 0.40). Non significant im- [11] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
provement in relevance (mean Precision@6) is not surprising as Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative
Adversarial Nets. In Proceedings of the 27th International Conference on Neural
the supervised 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model is also trained with consecutive Information Processing Systems - Volume 2 (NIPS’14). MIT Press, Cambridge, MA,
query pairs. USA, 2672–2680.
[12] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech
4 Conclusions recognition with Deep Bidirectional LSTM. In 2013 IEEE Workshop on Automatic
Speech Recognition and Understanding. 273–278. https://doi.org/10.1109/ASRU.
In this paper, we proposed a Deep Reinforcement Learning (DRL) 2013.6707742
[13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
framework to improve the quality of related-search query sugges- Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.
tions. Using long-term user-feedback, syntactic relatedness and 8.1735
estimated unnaturalness penalty as reward signals, we fine-tuned [14] Michaeel Kazi, Weiwei Guo, Huiji Gao, and Bo Long. 2020. Incorporating User
Feedback into Sequence to Sequence Model Training. In CIKM ’20: The 29th ACM
the supervised text-generation policy at scale with REINFORCE International Conference on Information and Knowledge Management, Virtual
policy-gradient algorithm. We showed significant improvement Event, Ireland, October 19-23, 2020, Mathieu d’Aquin, Stefan Dietze, Claudia Hauff,
in recommendation diversity (3%), query correctness (82%), user- Edward Curry, and Philippe Cudré-Mauroux (Eds.). ACM, 2557–2564. https:
//doi.org/10.1145/3340531.3412714
engagement (4.2%) over industry-baselines. For future work, we [15] J. Kiefer and J. Wolfowitz. 1952. Stochastic Estimation of the Maximum of a
plan to include semantic relatedness as reward. Since the proposed Regression Function. The Annals of Mathematical Statistics 23, 3 (1952), 462–466.
http://www.jstor.org/stable/2236690
DRL framework is agnostic to the choice of an encoder-decoder ar- [16] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-
chitecture, we plan to fine-tune different state-of-the-art language mization. CoRR abs/1412.6980 (2015).
models using our proposed DRL framework. [17] kubernetes.io. 2020. Cluster Architecture. https://kubernetes.io/docs/concepts/
architecture/
Acknowledgments [18] Seanie Lee, Dong Bok Lee, and Sung Ju Hwang. 2021. Contrastive
Learning with Adversarial Perturbations for Conditional Text Generation.
Thanks to Cong Gu, Ankit Goyal and LinkedIn Big Data team for arXiv:cs.CL/2012.07280
[19] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Ju-
their help in setting up DRL experiments on Kubernetes. Thanks to rafsky. 2016. Deep Reinforcement Learning for Dialogue Generation. CoRR
Souvik Ghosh and RL Foundations team for your valuable feedback. abs/1606.01541 (2016). arXiv:1606.01541 http://arxiv.org/abs/1606.01541
[20] Ruirui Li, Liangda Li, Xian Wu, Yunhong Zhou, and Wei Wang. 2019. Click
References Feedback-Aware Query Recommendation Using Adversarial Examples. In The
World Wide Web Conference (WWW ’19). Association for Computing Machinery,
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, New York, NY, USA, 2978–2984. https://doi.org/10.1145/3308558.3313412
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San- [21] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez,
jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, and Ion Stoica. 2018. Tune: A Research Platform for Distributed Model Selection
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven- and Training. arXiv preprint arXiv:1807.05118 (2018).
berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike [22] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries.
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul In Text Summarization Branches Out. Association for Computational Linguistics,
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Barcelona, Spain, 74–81. https://www.aclweb.org/anthology/W04-1013
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. [23] Rodrigo Nogueira and Kyunghyun Cho. 2017. Task-Oriented Query Reformula-
2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. tion with Reinforcement Learning. In Proceedings of the 2017 Conference on Empir-
http://tensorflow.org/ Software available from tensorflow.org. ical Methods in Natural Language Processing. Association for Computational Lin-
[2] Himan Abdollahpouri. 2019. Popularity Bias in Ranking and Recommendation. guistics, Copenhagen, Denmark, 574–583. https://doi.org/10.18653/v1/D17-1061
https://doi.org/10.1145/3306618.3314309 [24] Florian Schmidt. 2019. Generalization in Generation: A closer look at Exposure
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Bias. In Proceedings of the 3rd Workshop on Neural Generation and Translation.
Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 Association for Computational Linguistics, Hong Kong, 157–167. https://doi.
(2015). org/10.18653/v1/D19-5616
[4] Praveen Kumar Bodigutla, Longshaokan Wang, Kate Ridgeway, Joshua Levy, [25] Hubert J. A. Schouten. 1986. Nominal scale agreement among observers. Psy-
Swanand Joshi, Alborz Geramifard, and Spyros Matsoukas. 2019. Domain- chometrika 51, 3 (1986), 453–466. https://doi.org/10.1007/BF02294066
Independent turn-level Dialogue Quality Evaluation via User Satisfaction Esti- [26] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.
mation. arXiv:cs.LG/1908.07064 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017).
[5] F. Borges, Georgios Balikas, Marc Brette, Guillaume Kempf, A. Srikantan, arXiv:1707.06347 http://arxiv.org/abs/1707.06347
Matthieu Landos, Darya Brazouskaya, and Qianqian Shi. 2020. Query Under- [27] Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea
standing for Natural Language Enterprise Search. ArXiv abs/2012.06238 (2020). Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to
[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, summarize from human feedback. CoRR abs/2009.01325 (2020). arXiv:2009.01325
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda https://arxiv.org/abs/2009.01325
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, [28] Thanh Tin Tang, Nick Craswell, David Hawking, Kathy Griffiths, and Helen
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christensen. 2006. Quality and relevance of domain-specific search: A case study
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin in mental health. Information Retrieval 9, 2 (2006), 207–225. https://doi.org/10.
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya 1007/s10791-006-7150-5
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. [29] Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016.
arXiv:cs.CL/2005.14165 Learning to Rank with Selection Bias in Personal Search. In Proc. of the 39th
[7] Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen, and Hang International ACM SIGIR Conference on Research and Development in Information
Li. 2008. Context-Aware Query Suggestion by Mining Click-through and Session Retrieval. 115–124.
Data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowl- [30] Xiao Wang, Craig Macdonald, and Iadh Ounis. 2020. Deep Reinforced Query
edge Discovery and Data Mining (KDD ’08). Association for Computing Machinery, Reformulation for Information Retrieval. arXiv:cs.IR/2007.07987
New York, NY, USA, 875–883. https://doi.org/10.1145/1401890.1401995 [31] Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for
[8] Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. 2008. Monte- connectionist reinforcement learning. Machine Learning 8, 3 (1992), 229–256.
Carlo Tree Search: A New Framework for Game AI. In Proceedings of the Fourth https://doi.org/10.1007/BF00992696
[32] Ronald J. Williams and David Zipser. 1989. A Learning Algorithm for Continually
Running Fully Recurrent Neural Networks.
[33] Lantao Yu, W. Zhang, J. Wang, and Y. Yu. 2017. SeqGAN: Sequence Generative
Adversarial Nets with Policy Gradient. In AAAI.
A Appendices
Figure 2: Reward weighted Negative log-likelihood convergence performance w.r.t the training epochs of 𝐷𝑅𝐿𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 (top figure) and 𝐷𝑅𝐿𝑏𝑒𝑎𝑚
(bottom figure) models, for different values of learning-rate. Number of Monte Carlo (MC) samples generated per query during training and unnatural
suggestion penalty are from best parameter combination for each MC sampling-method (see Appendix Table 3).
Model Hyper parameter and their corresponding ranges

Batch size: [64, 128, 256, 512],
Sequence Length [4, 5, 6, 7, 8],
𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 Dropout: [0.0 − 0.4],
Num RNN Layers: [1, 2, 3],
Hidden-vector length: [128, 256]
Contextual Naturalness Estimator Dropout: [0.0 − 0.4]
Learning rate: [1𝑒 − 04, 1𝑒 − 06, 1𝑒 − 05, 2𝑒 − 05, 3𝑒 − 05, 4𝑒 − 05, 5𝑒 − 05],
Number of samples per input query: [1 − 5],
Deep Reinforcement Learning (DRL) Model
Naturalness Penalty Coefficient [1.0, 0.1, 0.01, 0.001],
Sampling startegy: [Beam Search, Sampling from categorical distribution]
Table 2: Hyper parameter value ranges we used for training query generation and contextual-naturalness-estimation models. Once hyper-params of
Supervised-𝑆𝑒𝑞 2𝑆𝑒𝑞 𝑁 𝑀𝑇 are fine-tuned, same model architecture is used for training supervised natural-estimator (encoder) model and DRL (encoder-
decoder) agent. Hence batch-size, sequence length, number of rnn layers and hidden vector length remain consistent across all three models.
Model Best set of hyper-parameters Criteria

Batch size: 256,
Sequence Length 7,
𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 Dropout: 0.2, Min categorical cross entropy loss
Num RNN Layers: 2,
Hidden-vector length: 256
Contextual Naturalness Estimator Dropout: 0.0 Max F1 on validation set
Learning rate: 3𝑒 − 05,
𝐷𝑅𝐿𝑏𝑒𝑎𝑚 Number of samples per input query: 4, Negative Log Likelihood convergence crite-
Naturalness Penalty Coefficient 1.0 ria to terminate training. Max 𝑆𝑒𝑠𝑠𝑖𝑜𝑛 + @6
performance on validation set to pick
the best parameter combination (optimal
model).
Learning rate: 5𝑒 − 05,
𝐷𝑅𝐿𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 Number of samples per input query: 2, Negative Log Likelihood convergence crite-
Naturalness Penalty Coefficient 0.01 ria to terminate training. Max 𝑆𝑒𝑠𝑠𝑖𝑜𝑛 + @6
performance on validation set to pick
the best parameter combination (optimal
model).
Table 3: Optimal hyper-parameter values for query generation and naturalness estimation models with criteria applied on validation-set performance
to choose them.

High Quality Related Search Query Suggestions Using Deep Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

High Quality Related Search Query Suggestions Using Deep Reinforcement Learning

Uploaded by

Copyright:

Available Formats

High Quality Related Search Query Suggestions using Deep

Abstract user feedback, such as click on recommended query, are prone

ommending search queries which are real, accurate, diverse, rel-

In our proposed approach, the DRL future-reward is composed Naturalness

reward-sparsity problem and removes the need to obtain expensive

MC sampling-strategy 𝜖 [beam-search, sampling from categorical distribution]

2.2.3 Contextual Naturalness Estimator. Our proposed contextual-

positive example, which are: 1) With 𝑞𝑖 as input, sample query

2) perturb 𝑞𝑖+1 by duplicating a word in randomly selected position

Model Hyper parameter and their corresponding ranges

Model Best set of hyper-parameters Criteria

You might also like