You are on page 1of 6

Building Deep Structured Semantic Similarity Features to

Improve the Media Search

Xugang Ye, Zijie Qi, Xiaodong He


Microsoft
{xugangye, xiaohe, zijieqi}@microsoft.com
Jingjing Li
University of Virginia
jl9rf@virginia.edu

Abstract scores that are generated from these models, when used
In media search, it’s often found that for a query, a for document retrieval and ranking, have not been seen
content without any shared term in its title or concise to have performance as good as originally expected.
description can still be highly relevant. Therefore, On the other hand, one recent semantic matching
document retrieval based on only word matching can method called Deep Structured Semantic Model
miss important contents. On the other hand, even with (DSSM) was proposed by Huang et al. [7] in a form of
an exact-word match in its document, the content can supervised learning, which utilizes the relevance labels
still be an unsatisfactory result. These observations in the model training. With the labels incorporated into
suggest that query-document matching should be its objective function, the DSSM maps a query and a
established at the semantic level other than the term relevant document into a common semantic space in
level. In this paper, we introduce some latest results of which a similarity measure can be calculated so that
applying the Deep Structured Semantic Model (DSSM) even there is no shared term between the query and the
to build the similarity features for document retrieval document, the similarity score can still be positive. In
and ranking in media search. To test the new features, this sense, the DSSM can do better in reducing the
we perform the experiments of retrieving and ranking chance that relevant documents are missed during the
media documents from Xbox, with and without the retrieval. Furthermore, it has been reported in [7][9] that
DSSM involved. To have a DSSM, we leverage the applying the DSSM to web search yields superior single
Bing’s large web search logs to train a seed DSSM and feature performance. That is the DSSM similarity score
is better than many traditional features including TF-
then tune the model using the Xbox’s ranker training
IDF, BM25, WTM (word translation model), LSA,
data. We also train the Xbox’s rankers with and without
PLSA, and LDA, etc. One important reason is that the
adding the DSSM similarity features. Our experimental DSSM similarity score contains the label information so
results have shown that adding the DSSM similarity that not only high relevance without term match receives
features significantly improves the results of document high similarity score, the low relevance with term match
retrieval and ranking. also has low similarity score as a result of penalization.
Given the encouraging results reported so far, we
aim to apply the DSSM model to the media search area
1. Introduction
to see whether the method can also improve the quality
of the retrieval and ranking results in the media domain.
Applying semantic matching based methods beyond
The rest of this paper is organized as the follows. We
lexical matching based ones to search problems is
first indtroduce the formulation of the DSSM model. We
currently a hot research direction. Latent semantic
then introduce letter-tri-gram as the technique of
models such as latent semantic analysis (LSA) have
dimension-reduced text representation, the parameter
been widely used to find the relatedness of a query and
estimation method, the model structure, and how the
its relevant documents when term matching fails
model is integrated into a media search system. We
[1][4][6]. Besides LSA, probabilistic topic models such
make conclusions after presenting some experimental
as the probabilistic LSA (PLSA) [6] and Latent Dirichlet
results on the Xbox’s media search data.
Allocation (LDA) [1] have also been proposed for
semantic matching. Those models, despite the purpose
and solid theoretical foundation, suffer from a fact that 2. Model
they are usually trained via unsupervised learning under
the objective functions that are loosely connected to the 2.1. Formulation
relevance labels. Consequently, the semantic matching
Let  ,  ,  ,  , …
denote a set of relevant where ? D = ∑LM F E
 ,L
D D
CL +  ,N
D
. Note that the last layers
query-document pairs in which  is a document under for  and  respectively are 2  and 2  . The structure
the query  . Assuming that these relevant query- is illustrated as the following Figure 1.
document pairs are independent, then the joint
probability of observing these query-document pairs is
∏  | , (1) |; 


where  |  denotes the conditional probability that


-, ; 

 is relevant under  . To parameterize the joint
likelihood (1), The critical thing is to properly model
|. We use the softmax function in the following 2  2 
form:
⋯ ⋯
,;

|; 
 = , (2) 
  
 
,;
∑! ∈#$ , ;


where % > 0 is a smoothing factor that is empirically set  


and ( ) is a set of *≥ 4 irrelevant documents under .
-, ; 
 is a similarity function of  and  , Figure 1. Illustration of the DSSM. For both O
parametrized by  . One form of the similarity function and P , it maps the high dimensional sparse
is cosine distance, which can be expressed as word-occurrence vector into low dimensional
. /0 . ! dense semantic vector.
-, ; 
 = , (3)
||. / ||∙||. ! ||

where 2  = 3; 


   and 2  = 4; 
   are the 2.2. Text Representation
semantic vectors of  and , respectively, and    and

 
are parts of   corresponding to  and  ,
Since the dimension of the sparse word-occurrence
vector representation of a text stream input can be very
respectively
high because there can be unlimited words, we apply the
By taking the negative logarithm of (1) we have the letter-tri-gram (LTG) based text stream representation
differentiable loss function of the parameters 
 as for the purpose of the dimension reduction. To illustrate
 = − ∑ ln  | 
5
the idea, consider the English text stream “Tom Cruise’s
movies”. We first add “#” at the head and the tail of the
9 ,9 ;

= − ∑ ln
text stream and remove all the symbols other than the 26
, (4)
9 ,9 ;
∑!∈#$ 9 ,;
 English letters a-z and the 10 digits 0-9, we then change
9
all the letters to lower case to obtain
where ( ) is a set of *≥ 4 irrelevant documents under “#tomcruisesmovies#”. The idea of the LTG
 . Therefore, the aim of the DSSM is to find the representation is to break the stream into (#to, tom, omc,
functions 3 and 4 that are parameterized by    and mcr, cru, rui, uis, ise, ses, esm, smo, mov, ovi, vie, ies,

 
respectively such that the loss function (4) is es#), which can be represented by a vector in a 36S +
minimized. Since the :th term can also be written as 2 × 36 = 49248-dimensional space. The vector has 16

entries as 1 and all the other 49232 entries are 0. If the
5 
 = ln , (5) original word based dictionary has 500k unique words,
∑!∈#$ ;)9 ,9 ;
)9 ,;
<
9 then the LTG representation has a 10-fold reduction in
where ∆  = - ,  ; 
 − - , ; , we can see that, dimensionality. Besides dimension reduction, another
intuitively, minimizing 5  is to jointly maximize the benefit of the LTG based representation is that the
different between the relevant document and the morphological variants of a same text stream can be
irrelevant documents under each query. mapped to close vectors in the LTG space. This is
encouraging since a query can always have misspelled
In DSSM, the functions 3 and 4 are represented by forms. Take an example #bananna# vs. #bannana#, they
two neural networks. For both nets, we use >?@ℎ are two misspelled forms of the word #banana# and have
function as the activation function. That is if we denote the same LTG words: #ba, ban, ana, nan, ann, nna, na#.
the B-th layer as CD , CD , … , CEDF and the B + 1-th layer as While the correct spelling has the LTG words: #ba, ban,
CD , CD , … , CED
FIJ
, then for each : = 1, … , @D , ana, nan, na#, with ana occurring twice.

);)K9F < Formally, if we denote the LTG dictionary as


C D = X , … , XY , then a text stream can be represented as a Z-
)K9F 
, (6)
dimensional vector [ = [ , … , [Y  , where [L is the
count of the LTG word XL . And if only the 26 English
letters a-z and the 10 digits 0-9 are allowed, then Z = document  to its LTG-vector representation 2  can be
49248, which is the input dimension at the first layer. performed offline. And since a query  usually contains
much less terms, mapping it to its LTG-vector
representation 2  can be performed in real-time [7][9].
2.3. Parameter Estimation
We apply the paralleled stochastic gradient descent
method [12] to minimize the loss function (4). The one-
Document Retrieved Ranking
step update can be expressed as Document base: {d}
representation: documents results
2 


 ] = 
 ] − ^] ∇
 5
 (7) q:
where ^] > 0 is the learning rate at >-th ibteration. The ⋯ d1
d2
gradient can be written as ⋯
∇  = ∑ ∑∈`9$ 3 ∙ ∇
 5  ∆ ,

(8)
DSSM
) ;)∆9! <
Ranker with
3 =
Query
where . Hence computing the DSSM
∑! ∈#$ ;)∆9 <
representation:
9 ! 2  features
gradient of the loss function 5  is reduced to Query: q
computing the gradient of the similarity function
. It depends on the structures of the two neural
-, ; 
nets that represent the two functions 3 and 4 , Figure 3. The illustration of DSSM in search
respectively. We adopt the same structure as the and ranking
following Figure 2 shows.
3. Experiments
S
C :2 
or 2  We trained three models in our experiments. First, we
50 trained a baseline ranker that does not have the DSSM
similarity scores as features. Second, we trained a DSSM
C and used it to compute -,  for each , -pair in the
ranker’s training data. Third, we added -,  in
100

C
different forms as similarity features to the existing
100 features and trained a ranker that we call DSSM ranker.
Since a media document usually contains title,
CN : LTG vector of  or  description, actor list, genre, we calculate four similarity
scores - title, - description, - actors, and - genre. Consequently,
49248

we added four new similarity features to the existing


Figure 2. The structure of the neural network. feature set. Note that the new features should be added to
The input layer corresponds to the LTG space both the training data set and the test data set.
that has dimension 49248, there are two
intermediate layers of dimension 100, and the 3.1. DSSM Traning
output layer corresponds to a sematic space Training a DSSM needs huge amount of data because
that has dimension 50. the size of the parameter space can be very large. As the
Figure 2 shows, there are 49248×100 + 100 + 100×100
+ 100 + 100×50 + 50 = 4,940,050 parameters in    or

  . However, there are only about 298,184 labeled
2.4. Integration
The value of the DSSM is expected in two aspects:
query-document pairs with 17,735 unique queries in our
one is the document retrieval; the other is document
training data for the Xbox’s ranker. These query-
ranking. Precisely we expect that by using the DSSM
document pairs were from the Xbox’s click-logs from
based similarity scores, previously missed relevant
December 2013 to March 2014 and were labeled using
documents due to poor or none term match can be
the method in [11] that combines human judgments and
retrieved. We’re also interested in seeing whether the
click-signals. The label gains range from 0 to 15. We
ranker’s performance can be further lifted when the
adopted the strategy of leveraging the Bing’s large web
DSSM based similarity scores are added as the features
search logs and extracted from the year 2013’s logs about
in addition to the existing ones. Therefore, we integrate
20 million clicked query-document pairs as positive
the DSSM in both the retrieval and the ranking. The
examples. By random reshuffling, we obtained
following Figure 3 shows how the DSSM is used in our
additional 80 million query-document pairs that are
media search and ranking. Note that mapping a
treated as negative examples. In total, there are 100
million pairs, with one clicked document and four rankers’ quality from another perspective. We sampled
unclicked documents under each query. We first trained 22,828 unique queries from the Xbox’s click-logs from
a DSSM using these 100 million pairs to obtain what we April to May in 2014 and for both of the retrieval
call “seed model”, we then used the 298k labeled query- methods, we retrieved documents for each sampled
document pairs in Xbox’s data to tune the seed model. In query to obtain 181,692 query-document pairs in total.
the 298k labeled query-document pairs, we treat those For each query-document pair, we calculate the number
pairs with label gains no less than 7 as positive examples of unique users who clicked under the query from April
and others as negative examples. Again by randomly to May in 2014 and use that number as the label for this
query-document pair. This type of labels measures the
reshuffling the 298k query-document pairs, we obtained
per query user engagement. Obviously, this set of
additional 700k query-document pairs as negative
evaluation data can easily get much larger than the one
examples so that we have about 1 million pairs, with one with only human labels.
positive document and four negative documents under
each query, for the tuning. For both the two types of the labels, the mean
normalized discounted cumulative gain (NDCG) [8] can
be calculated. Specifically, we calculate the mean NDCG
at top : position as
3.2. Ranker Training
For the original training data of the xbox’s ranker that
/ /
ij i̅j
ffffffff =  ∑ h∑ LM
contains 298k labeled query-document pairs, there are
NDCG o / h∑ LM o , (9)
about 10k existing features that are built from the basic g klmn L klmn L
 
r̅ ≥ r̅ ≥ ⋯ represent the descending order
information on word matching, clicks, usages, release
where
 
of r , r , … , which respectively are the label scores of
date, and the advanced models including WTM, LSA,
PLSA, and LDA. To show the effect of the DSSM, we
trained a “control” ranker as the baseline with only the the documents at positions 1,2, … under the query  that
existing features and a “treatment” ranker with existing does not have all zero-score results, and s is total
features plus the four new semantic similarity features number of such queries.
- title, - description,- actors, and - genre. We used the gradient-
boosted decision trees [3][10] and the LambdaRank 3.4. Results
algorithm [2] for learning the rankers. We used 1000 The main experimental results are summarized in
gradient boosted decision trees and for each individual Table 1 and Table 2. The Table 1 corresponds to the
tree, we required that there are at least 200 data points in evaluation data with human labels; the Table 2
a node. corresponds to the evaluation data with labels as numbers
of users who clicked. The mean NDCG numbers are all
3.3. Evaluation relative ones. For each : = 1,4,8, the mean NDCG value
Our evaluation approach is to test the DSSM in both of basic retrieval + baseline ranker is scaled to 1, the
the retrieval and the ranking. We set up a baseline by mean NDCG values of DSSM retrieval + baseline ranker
retrieving documents per query using the inverted-file- and DSSM retrieval + DSSM ranker are multipliers. It
index method and ranking the retrieved documents by can be seen that DSSM improves the results of both the
using the baseline ranker. To compare with the baseline, retrieval and the ranking under the two NDCG metrics.
we first changed the document retrieval method into the We also performed significant tests across all queries
criterion that the maximum DSSM similarity score is using the paired t-test and the two-sample t-test. All the
positive, and we still used the baseline ranker to rank the p-values are far less than 0.01.
retrieved documents. We then change the baseline ranker
into the advanced one with the four new semantic
similarity features. For both the retrieval methods, we Table 1. NDCG values by human labels
retrieved documents for each of 7029 unique queries that ffffffff
NDCG ffffffff
NDCGt ffffffff
NDCGu
were sampled from the Xbox’s click-logs from April to
Basic retrieval + Baseline
May in 2014 and obtained 37,948 query-document pairs. 1.0000 1.0000 1.0000
ranker
These query-document pairs were all labeled by human
judges with 4-level labels: “Excellent”, “Good”, “Fair”, DSSM retrieval + Baseline
1.0200 1.0143 1.0121
and “Bad”. This type of labels reflect the relevance, and ranker
we use 15, 7, 3, 0 as the label gains respectively for DSSM retrieval + DSSM
1.0438 1.0414 1.0289
Excellent, Good, Fair, and Bad. ranker

Since the evaluation data with human-generated Number of queries: 7029


relevance labels is limited due to the high cost of Number of query-document pairs: 37948
recruiting and training human judges, we also evaluate
Table 2. NDCG values by number of users who Terminator 3: Rise of the Escape Plan (+Bonus),
clicked 4
Machines, movie, 2003 movie, 2013
ffffffff
NDCG fffffffft
NDCG ffffffffu
NDCG
Generation Iron, movie,
5 Escape Plan, movie, 2013
Basic retrieval + Baseline 2013
1.0000 1.0000 1.0000
ranker
Last Stand (En Espanol), The Expendables 2,
DSSM retrieval + Baseline 6
1.0186 1.0095 1.0121 movie, 2013 movie, 2012
ranker
DSSM retrieval + DSSM The Last Stand (Xbox Total Recall, movie,
1.0361 1.0235 1.0329 7
ranker Exclusive), movie, 2013 1990
Number of queries: 22828
Number of query-document pairs: 181692 The Last Action Hero, The Last Stand (Xbox
8
movie, 1993 Exclusive), movie, 2013

The Expendables 2, Terminator 2: Judgment


To illustrate by example, the following Table 3 lists 9
movie, 2012 Day, movie, 1991
the top 10 results for the query “arnold schwarzenegger”
under the two settings: basic retrieval + baseline ranker Conan the Barbarian,
Terminator 3: Rise of
vs. DSSM retrieval + DSSM ranker. Under the setting 10 the Machines, movie,
movie, 1982
basic retrieval + baseline ranker, the new hot movie 2003
“Sabotage” stared by Arnold Schwarzenegger is not
successfully retrieved. On the other hand, the DSSM can
capture the relation between “arnold schwarzenegger”
and the character name “John Breacher Wharton” and 4. Conclusions
successfully retrieve the movie. Moreover, the movie is
also ranked the highest by the advanced ranker. Another In this paper, we presented some results of applying
movie “Generation Iron” that is a documentary on top the DSSM model to build the similarity features for
bodybuilders in sport is also not successfully retrieved document retrieval and ranking in media search.
under the setting basic retrieval + baseline ranker. Again, Theoretically, the advantage of DSSM lies in that it
the DSSM can capture the connection between Arnold captures the relatedness between a query and a document
Schwarzenegger and the bodybuilding and, therefore,
at the semantic level other than term level. Therefore, for
retrieve the movie. This movie is also ranked high.
a query, a relevant document with poor or even no term
For the very old movie “Conan the Barbarian”, match can still be retrieved. In addition, the DSSM
although it’s a relevant result, compared with other model entertains the supervise learning so that the label
results on the list, it’s much less satisfactory. The DSSM information can be reflected in the similarity scores. To
features, which have the relevance label gain information show the benefits of the DSSM similarity features, we set
incorporated, have a certain penalty on this movie so that up a baseline that uses the inverted-file-index method for
it’s excluded from the top 10 results under the setting document retrieval and a ranker with only existing
DSSM retrieval + DSSM ranker. features for ranking the retrieved documents. We first
changed the basic retrieval method into the DSSM
Table 3. A sample comparison of two settings similarity score and then changed the basic ranker into
Query = “arnold schwarzenegger” the advanced one with the DSSM similarity scores as
new features. We obtained a DSSM by first training the
Basic retrieval + DSSM retrieval + seed model from the large search logs and then tuning
Baseline ranker DSSM ranker the model by using the ranker’s training data. We used
two evaluation metrics. One is the mean NDCG value
The Last Stand, movie, from human labels, the other is the mean NDCG value
1 Sabotage, movie, 2014
2013 from a number of users who clicked. The first measures
relevance; the second measures user engagement. Our
Terminator 2: Judgment Escape Plan, movie, experiments have shown that the DSSM improves both
2
day, movie, 1991 2013 the document retrieval and the ranking results.
Total Recall, movie, The Last Stand, move,
3
1990 2013
Acknowledgments
We thank Xinying Song for building the DSSM
training and evaluation pipeline. We thank the Microsoft
Research Group for providing the GPU computing Spring Symposium Series: Cross-Language Text and
environment for training and tuning the DSSM. We also Speech Retrieval, 1997.
thank the Microsoft Aether team for providing the [6] T. Hofmann. “Probabilistic latent semantic indexing,” In
SIGIR, 1999.
computing environment for training and evaluating
rankers. [7] P. S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L.
Heck. “Learning deep structured semantic models for
web search using clickthrough data,” In CIKM, 2013.
[8] K. Järvelin and J. Kekäläinen. “Cumulated gain-based
References evaluation of IR techniques,” ACM Transactions on
Information Systems (TOIS), 20(4), pp. 422-446, 2002.
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. “Latent Dirichlet [9] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. “A
allocation,” J. Machine Learning Research, 3, pp. 993- convolutional latent semantic model for web search,”
1022, 2003. Technical Report MSR-TR-2014-55, Microsoft
[2] C. J. Burges. “From ranknet to lambdarank to Research, 2014.
lambdamart: An overview,” Learning, 11, pp. 23-581, [10] J. Ye, J. H. Chow, J. Chen, and Z. Zheng. “Stochastic
2010. gradient boosted distributed decision trees,” In CIKM,
[3] K. Chen, R. Lu, C. K. Wong, G. Sun, L. Heck, and B. 2009.
Tseng. “Trada: tree based ranking function adaptation,” [11] X.Ye, J. Li, Z. Qi, B. Peng, and D. Massey. "A Generative
In CIKM, 2008. Model for Generating Relevance Labels from Human
[4] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Judgments and Click-Logs." In Proceedings of the 23rd
Furnas, and R. A. Harshman. “Indexing by latent ACM International Conference on Conference on
semantic analysis,” J. American Society for Information Information and Knowledge Management, pp. 1907-
Science, 41(6), pp. 391-407, 1990. 1910. ACM, 2014.
[5] S. T. Dumais, T. A. Letsche, M. L. Littman, and T. K. [12] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola.
Landauer. “Automatic cross-linguistic information “Parallelized stochastic gradient descent,” In NIPS, 2010.
retrieval using latent semantic indexing,” In AAAI-97

You might also like