Professional Documents
Culture Documents
Permission to make digital or hard copies of all or part of this work for personal or Modeling contextual information in search queries and
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation documents is a long-standing research topic in IR
on the first page. Copyrights for components of this work owned by others than ACM [11][25][12][26][2][22][24]. Classical retrieval models, such as
must be honored. Abstracting with credit is permitted. To copy otherwise, or TF-IDF and BM25, use a bag-of-words representation and cannot
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from permissions@acm.org.
effectively capture contextual information of a word. Topic
CIKM’14, November 03 – 07, 2014, Shanghai, China. models learn the topic distribution of a word by considering word
Copyright © 2014 ACM 978-1-4503-2598-1/14/11…$15.00 occurrence information within a document or a sentence.
http://dx.doi.org/10.1145/2661829.2661935 However, the contextual information captured by such models is
often too coarse-grained to be effective for the retrieval task. For 2. RELATED WORK
example, the word office in “office excel” and “apartment office”,
which represent two very different search intents when used in 2.1 Modeling Term Dependencies for IR
search queries, are likely to be projected to the same topic. As an
alternative, retrieval methods that directly model phrases (or word Although most traditional retrieval models assume the
n-grams) and term dependencies are proposed in [12][25][26]. For occurrences of terms to be completely independent, contextual
example, in [25], the Markov Random Field (MRF) is used to information is crucial for detecting particular search intent of a
model dependencies among terms (e.g., term n-grams and skip- query term. Thus, research in this area has been focusing on
grams) of the query and the document for ranking, while in [26] a capturing term dependencies. Early work tries to relax the
latent concept expansion (LCE) model is proposed which independence assumption by including phrases, in addition to
leverages the term-dependent information by adding n-gram and single terms, as indexing units [6][36]. Phrases are defined by
(unordered n-gram) as features into the log-linear ranking model. collocations (adjacency or proximity) and selected on the
In [12] a phrase-based translation model was proposed to learn the statistical ground, possibly with some syntactic knowledge.
translation probability of a multi-term phrase in a query given a Unfortunately, the experiments did not provide a clear indication
phrase in a document. Since the phrases capture richer contextual whether the retrieval effectiveness can be improved in this way.
information than words, more precise translations can be Recently, within the framework of language models for IR,
determined. However, the phrase translation model can only score various approaches that go beyond unigrams have been proposed
phrase-to-phrase pairs observed in the clickthrough training data to capture certain term dependencies, notably the bigram and tri-
and thus generalize poorly to new phrases. gram models [35], the dependence model [11], and the MRF
In this study, we develop a new latent semantic model based based models [25][26]. These models have shown benefit of
on the convolutional neural network with convolution-pooling capturing dependencies. However, they focus on the utilization of
structure, called the convolutional latent semantic model (CLSM), phrases as indexing units, rather than the phrase-to-phrase
to capture the important contextual information for latent semantic semantic relationships.
modeling. Instead of using the input representation based on bag- The translation model-based approach proposed in [12] tries to
of-words, the new model views a query or a document 1 as a extract phrase-to-phrase relationships according to clickthrough
sequence of words with rich contextual structure, and it retains data. Such relationships are expected to be more effective in
maximal contextual information in its projected latent semantic bridging the gap between queries and documents. In particular,
representation. The CLSM first projects each word within its the phrase translation model learns a probability distribution over
context to a low-dimensional continuous feature vector, which “translations” of multi-word phrases from documents to queries.
directly captures the contextual features at the word n-gram level Assuming that queries and documents are composed using two
(detailed in section 3.3). Second, instead of summing over all different “languages”, the phrases can be viewed as bilingual
word n-gram features uniformly, the CLSM discovers and phrases (or bi-phrases in short), which are consecutive multi-term
aggregates only the salient semantic concepts to form a sentence- sequences that can be translated from one language to another as
level feature vector (detailed in section 3.4). Then, the sentence- units. In [12], it was shown that the phrase model is more
level feature vector is further fed to a regular feed-forward neural powerful than word translation models [3] because words in the
network, which performs a non-linear transformation, to extract relationships are considered with some context words within a
high-level semantic information of the word sequence. In training, phrase. Therefore, more precise translations can be determined for
the parameters of the CLSM is learned on clickthrough data. phrases than for words. Recent studies show that this approach is
Our research contributions can be summarized as follows: highly effective when large amounts of clickthrough data are
We propose a novel CLSM that captures both the word n- available for training [12][15]. However, as discussed before, the
gram level and sentence-level contextual structures for IR phrase-based translation model can only score phrase pairs
using carefully designed convolution and pooling operations; observed in the training data, and cannot generalize to new
We carry out an extensive experimental study on the phrases. In contrast, the CLSM can generalize to model any
proposed model whereby several state-of-the-art semantic context. In our experiments reported in Section 5, we will
models are compared, and we achieve a significant compare the CLSM with the word-based and phrase-based
performance improvement on a large-scale real-world Web translation models.
search data set;
We perform an in-depth case analysis on the capacity of the
2.2 Latent Semantic Models
proposed model, through which the strength of the CLSM is The most well-known linear projection model for IR is LSA [9]. It
clearly demonstrated. models the whole document collection using a document-
term matrix , where n is the number of documents and d is the
number of word types. is first factored into the product of three
1
In modern search engines, a Web document is described by matrices using singular value decomposition (SVD) as
multiple fields [12][38], including title, body, anchor text etc. In , where the orthogonal matrices and are called term and
our experiments, we only used the title field of a Web document document vectors, respectively, and the diagonal elements of
for ranking. In addition to providing simplicity for fast are singular values in descending order. Then, a low-rank matrix
experimentation, our decision is motivated by the observation approximation of is generated by retaining only the k biggest
that the title field gives better single-field retrieval result than singular values in . Now, a document (or a query) represented by
body, although it is much shorter (as shown in Table 4). Thus it a term vector can be mapped to a low-dimensional concept
can serve as a reasonable baseline in our experiments. vector ̂ , where the matrix is called the
Nevertheless, our methods are not limited to the title field, and projection matrix. In document search, the relevance score
can be easily applied to the multi-field description. between a query and a document, represented respectively by term
vectors and , is assumed to be proportional to their cosine three words shown in Figure 1 is accomplished by the letter-
similarity score of the corresponding concept vectors ̂ and ̂ , trigram matrix which is fixed and not learned. Other deep
according to the projection matrix . learning approaches that are related to the CLSM include word-to-
Generative topic models are also widely used for IR. They vector mapping (also known as word embedding) using deep
include Probabilistic Latent Semantic Analysis (PLSA) [19] and neural networks learned on large amounts of raw text [1][27]. In
its extensions such as Latent Dirichlet Allocation (LDA) [4][39]. [28], the vector representation of a word sequence is computed as
PLSA assumes that each document has a multinomial distribution a summation of embedding vectors of all words. An alternative
over topics (called the document-topic distribution), where each approach is proposed in [34], where a parsing tree for a given
of the topics is in turn of a multinomial distribution over words sentence is extracted, which is then mapped to a fixed-length
(called the topic-word distribution). The relevance of a query to a representation using recursive auto-encoders. Recently, a neural
document is assumed to be proportional to the likelihood of network based DeepMatch model is also proposed to directly
generating the query by that document. Recently, topic models capture the correspondence between two short texts without
have been extended so that they can be trained on clickthrough explicitly relying on semantic vector representations [23].
data. For example, a generative model called Bi-Lingual Topic
Model (BLTM) is proposed for Web search in [15], which 3. EXTRACTING CONTEXTUAL
assumes that a query and its clicked document share the same FEATURES FOR IR USING CLSM
document-topic distribution. It is shown that, by learning the
model on clicked query-title pairs, the BLTM gives superior 3.1 The CLSM Architecture
performance over PLSA [15].
The architecture of the CLSM is illustrated in Figure 1. The model
2.3 Neural-Network-based Semantic Models contains (1) a word-n-gram layer obtained by running a contextual
sliding window over the input word sequence (i.e., a query or a
Deep architectures have been shown to be highly effective in document), (2) a letter-trigram layer that transforms each word-
discovering from training data the hidden structures and features trigram into a letter-trigram representation vector, (3) a
at different levels of abstraction useful for a variety of tasks convolutional layer that extracts contextual features for each word
[32][18][20][37][7][34]. Among them, the DSSM proposed in [20] with its neighboring words defined by a window, e.g., a word-n-
is most relevant to our work. The DSSM uses a feed-forward gram, (4) a max-pooling layer that discovers and combines salient
neural network to map the raw term vector (i.e., with the bag-of- word-n-gram features to form a fixed-length sentence-level
words representation) of a query or a document to its latent feature vector, and (5) a semantic layer that extracts a high-level
semantic vector, where the first layer, also known as the word semantic feature vector for the input word sequence. In what
hashing layer, converts the term vector to a letter-trigram vector follows, we describe these components in detail, using the
to scale up the training. The final layer’s neural activities form the annotation illustrated in Figure 1.
vector representation in the semantic space. In document retrieval,
the relevance score between a document and a query is the cosine
similarity of their corresponding semantic concept vectors, as in
Eq. (1). The DSSM is reported to give superior IR performance to
other semantic models.
However, since the DSSM treats a query or a document as a
bag of words, the fine-grained contextual structures within the
query (or the document) are lost. In contrast, the CLSM is
designed to capture important word n-gram level and sentence-
level contextual structures that the DSSM does not. Specifically,
the CLSM directly represents local contextual features at the word
n-gram level; i.e., it projects each raw word n-gram to a low-
dimensional feature vector where semantically similar word n-
grams are projected to vectors that are close to each other in this
feature space. Moreover, instead of simply summing all local
word-n-gram features evenly, the CLSM performs a max pooling
operation to select the highest neuron activation value across all
word n-gram features at each dimension, so as to extract the
sentence-level salient semantic concepts. Meanwhile, for any
sequence of words, this operation forms a fixed-length sentence-
level feature vector, with the same dimensionality as that of the
local word n-gram features. Figure 1: The CLSM maps a variable-length word sequence to a
Deep convolutional neural networks (CNN) have been applied low-dimensional vector in a latent semantic space. A word
successfully in speech, image, and natural language processing contextual window size (i.e. the receptive field) of three is used in
[8][41][7]. The work presented in this paper is the first successful the illustration. Convolution over word sequence via learned
attempt in applying the CNN-like methods to IR. One main matrix is performed implicitly via the earlier layer’s mapping
difference from the conventional CNN is that the convolution with a local receptive field. The dimensionalities of the
operation in our CLSM is applied implicitly on the letter-trigram convolutional layer and the semantic layer are set to 300 and 128
representation space with the learned convolutional matrix . in the illustration, respectively. The max operation across the
The explicit convolution, with the “receptive field” of a size of sequence is applied for each of 300 feature dimensions separately.
(Only the first dimension is shown to avoid figure clutter.)
3.2 Letter-trigram based Word-n-gram
Representation (3)