Nn4ir PDF

Neural Networks for Information Retrieval
nn4ir
1
Neural Networks for Information Retrieval
Hosein Azarbonyad1 Alexey Borisov1,2 Mostafa Dehghani1 Tom Kenter3

Bhaskar Mitra4,5 Maarten de Rijke1 Christophe Van Gysel1
1 University of Amsterdam
2 Yandex
3 Booking.com
4 University College London

5 Microsoft
NN4IR@ECIR2018
nn4ir
2
Who’s here today and who are we?
Tom Kenter Bhaskar Mitra

Booking.com Microsoft & UCL
Maarten de Rijke Christophe Van Gysel

U. Amsterdam U. Amsterdam
3
Aims
Describe the use of neural network based methods in information retrieval
Compare architectures
Summarize where the field is, what seems to work, what seems to fail
Provide an overview of applications and directions for future development of neural

methods in information retrieval
4
Structure of the tutorial
Morning Afternoon
1. Preliminaries 5. Modeling user behavior
2. Semantic matching 6. Generating responses
3. Learning to rank 7. Recommender systems
4. Entities 8. Industry insights
9. Q & A
5
Materials
Slides available at http://nn4ir.com/ecir2018
Bibliography available at http://nn4ir.com/ecir2018
Survey 1 An introduction to neural information retrieval [Mitra and Craswell, 2017]

Survey 2 Neural information retrieval: at the end of the early years [Onal et al., 2017]
6
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Modeling user behavior
Generating responses
Recommender systems
Industry insights
Q&A
7
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
8
Outline
Morning program
Preliminaries
Feedforward neural network
Distributed representations
Recurrent neural networks
Sequence-to-sequence models
Convolutional neural networks
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
9
Preliminaries
Multi-layer perceptron a.k.a. feedforward neural network
xi,j
target: y y1 y2 y3
cost function
output/prediction: y^ ^y ^y ^y e.g.: 1/2 (y^ - y)2 φ: activation function
1 2 3
e.g.: sigmoid
1
1 + e-o
output layer
weights
hidden layer j
weights w1,4
xi-1, 1 xi-1, 4
xi-1, 2 xi-1, 3
input: x x1 x2 x3 x4
node j at level i
10
Preliminaries
Multi-layer perceptron a.k.a. feedforward neural network
ground truth: y y[0] y[1] y[2]

xi,j
ˆ x3
output layer/predictions y: x3[0] x3[1] x3[2]
φ: activation function
activation x3 = σ( o2 ) = [1 * 4 ] e.g.: sigmoid
1
x2 * W2 = o2 1 + e-o
[ 1 * 4 ] * [ 4 * 4 ] = [1 * 4]
hidden layer: x2 x2[0] x2[1] x2[2] x2[3]
activation x2 = σ( o1 ) = [1 * 4 ]
x1 * W1 = o1 xi-1, 1 xi-1, 4
xi-1, 2 xi-1, 3
[ 1 * 4 ] * [ 4 * 4 ] = [1 * 4]
node j at level i
input layer: x1 x1[0] x1[1] x1[2] x1[3]
11
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
12
Preliminaries
I Represent units, e.g., words, as vectors
I Goal: words that are similar, e.g., in terms of
meaning, should get similar embeddings newspaper = <0.08, 0.31, 0.41>
magazine = <0.09, 0.35, 0.36>

Cosine similarity to determine how similar two vectors
are:
~v > · w
~
cosine(~v , w)
~ =
k~v k2 kwk
~ 2 biking = <0.59, 0.25, 0.01>
P|v|
v i · wi
i=1 q
=q
P|v| 2 P|w| 2
i=1 vi i=1 wi
13
Preliminaries
How do we get these vectors?
I You shall know a word by the company it keeps [Firth, 1957]
I The vector of a word should be similar to the vectors of the words surrounding it
−
→ −→ −−→ → − −−→
all −
you need is love
14
Preliminaries
Embedding methods
er k
sw tra ed u at rro
all an am is ne yo wh zo
target distribution 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
vocabulary size probabitity distribution ...
?? turn this into a probability distribution ??
vocabulary size layer ...
embedding size × vocabulary size

weight matrix
embedding size hidden layer
vocabulary size × embedding size

weight matrix
vocabulary size inputs 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0

r is at
all we trak lov
e
ed yo
u rro
s ne wh zo
an am
15
Preliminaries
Probability distributions
softmax = normalize the logits

y: probability distribution ...
cost
elogits[i] y: probability distribution
^ ...
= P|logits| ?
logits ...
j=1 elogits[j]
cost = cross entropy loss
X
=− p(x) log p̂(x) ...
x
X
=− pground truth (word = vocabulary[i]) log ppredictions (word = vocabulary[i])
i
X
=− yi log ŷi
i
16
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
17
Preliminaries
I Recurrent neural networks (RNNs) are typically used in scenarios where a

sequence of inputs and/or outputs is being modelled
I RNNs have memory that captures information about what has been computed so
far
I RNNs are called recurrent because they perform same task for every element of
sequence, with output dependent on previous computations
I RNNs can, in theory, make use of information in arbitrarily long sequences – in
practice, however, they are limited to looking back only few steps
18
Preliminaries
Recurrent cell
d
wor
xt
ne
target distribution
cross entropy loss
distribution over
softmax vocabulary
output vector 2
rnn cell
(language modelling) hidden state vector:
f(word vector, input vector)
input vector output vector
word vector
19
Preliminaries
Recurrent cell
d
wor
xt
ne
target distribution
cross entropy loss
distribution over
softmax vocabulary
output vector 2
rnn cell
(language modelling) hidden state vector:
f(hidden statet-1, input vector)
hidden statet-1 hidden statet
word vector
20
Preliminaries
I RNN being unrolled (or unfolded) into

full network
I Unrolling: write out network for
complete sequence
Formulas governing computation:

I xt input at time step t
I st hidden state at time step t – memory of the network, calculated based on
- input at the current step
st = f (U xt + W st−1 )
- previous hidden state
I f usually nonlinearity, e.g., tanh or ReLU
f can also be LSTM or GRU.
Image credits: Nature 21

Preliminaries
Language modeling using RNNs
I Language model allows us to predictQ probability of observing sentence (in a given

dataset) as: P (w1 , . . . , wm ) = mi=1 (wi | w1 , . . . , wi−1 )
P
I In RNN, set ot = xt+1 : we want output at step t to be actual next word
I Cross-entropy loss as loss function
I Training RNN similar to training a traditional NN: backpropagation algorithm, but
with small twist:
parameters shared by all time steps, so gradient at each output depends on
calculations of previous time steps: Backpropagation Through Time
22
Preliminaries
Vanishing and exploding gradients
L0 L1 L2 L3 L4
I For training RNNs, calculate gradients for U ,
V , W – ok for V but for W and U . . .
I Gradients for W :
3
X
∂L3 ∂L3 ∂o3 ∂s3 ∂L3 ∂o3 ∂s3 ∂sk
= =
∂W ∂o3 ∂s3 ∂W ∂o3 ∂s3 ∂sk ∂W
k=0
∂L ∂L ∂sm ∂sm−1 ∂st+1

I More generally: ∂st = ∂sm · ∂sm−1 · ∂sm−2 · ··· · ∂st ⇒ 1
<1 <1 <1
I Gradient contributions from far away steps become zero: state at those steps
doesn’t contribute to what you are learning
Image credits: http://www.wildml.com/2015/10/

recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradi
23
Preliminaries
Long Short Term Memory [Hochreiter and Schmidhuber, 1997]
LSTMs designed to combat vanishing gradients through gating mechanism
it = σ(xt U i + st−1 W i )
ft = σ(xt U f + st−1 W f )
ot = σ(xt U o + st−1 W o )
gt = tanh(xt U g + st−1 W g )
ct = ct−1 ◦ f + g ◦ i
st = tanh(ct ) ◦ o
(◦ is elementwise multiplication)
Image credits:
https://commons.wikimedia.org/wiki/File:Peephole_Long_Short-Term_Memory.svg
24
Preliminaries
Gated Recurrent Units
I GRU layer quite similar to that of LSTM layer, as are the equations:
z = σ(xt U z + st−1 W z )
r = σ(xt U r + st−1 W r )
h = tanh(xt U h + (st−1 ◦ r)W h )
st = (1 − z) ◦ h + z ◦ st−1
I GRU has two gates: reset gate r and update gate z.
I Reset gate determines how to combine new input with previous memory; update
gate defines how much of the previous memory to keep around
I Set reset to all 1’s and update gate to all 0’s to get plain RNN model
I On many tasks, LSTMs and GRUs perform similarly
25
Preliminaries
Bidirectional RNNs
I Bidirectional RNNs based on idea that

output at time t may depend on
previous and future elements in
sequence
I Example: predict missing word in a
sequence
I Bidirectional RNNs are two RNNs
stacked on top of each other
I Output is computed based on hidden
state of both RNNs
Image credits: http://www.wildml.com/2015/09/

recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 26
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
27
Preliminaries
Increasingly important: not just retrieval but also generation
I Machine translation, spoken results, chatbots, conversational interfaces, . . . , but
also snippets, query suggestion, query correction, . . .
Basic sequence-to-sequence (seq2seq) model consists of two RNNs: an encoder that
processes input and a decoder that generates output:
Encoder
Decoder
Each box represents cell of RNN (often GRU cell or LSTM cell). Encoder and decoder
can share weights or, as is more common, use a different set of parameters
Image credits: Adapted from https://www.tensorflow.org/tutorials/seq2seq 28

Preliminaries
Attention mechanism Bahdanau et al. [2014]
Normal decoder model:
Encoder
ht = f (x, ht−1 ; θ),

Decoder
Decoder with attention:

W X Y Z <eos>
Encoder
W X Y Z
<go>
Decoder
A B C
hdec
t = g(xdec , Hencoder , hdec
t−1 )
29
Preliminaries
Attention mechanism
W X Y Z <eos>
Encoder
W X Y Z
<go>
Decoder
A B C
Decoder with attention Luong and Manning [2016], Vinyals et al. [2015]:
dt is calculated from Hencoder by:
hdec
t = g(xdec , Hencoder , hdec
t−1 )
n
X
dt = at,i hencoder
i
= Wproj · dt ||ĥdec
t , i=1
|| is the concatenation operator, and at = softmax(ut )

ĥdec
t = f (xdec , hdec
t−1 ; θ
dec ) from above. ut,i = v T tanh(W1 hencoder
i + W2 hdec
t ),
30
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
31
Preliminaries
What is a convolution? Intuition: sliding window function applied to a matrix
Example: convolution with 3 × 3 filter
Multiply values element-wise with original matrix, then sum. Slide over whole matrix.
Image credits:
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
32
Preliminaries
I Each layer applies different filters and I First layer may learn to detect edges from
combines results raw pixels in first layer
I Pooling (subsampling) layers I Use edges to detect simple shapes in
second layer
I During training, CNN learns values of
filters I Use shapes to detect higher-level features,
such as facial shapes in higher layers
I Last layer: classifier using high-level
features
33
Preliminaries
CNNs in text
activation function
convolution softmax function
1-max
pooling regularization
in this layer
3 region sizes: (2,3,4) 2 feature
Sentence matrix 2 ﬁlters for each region maps for 6 univariate 2 classes
7 5 size each vectors
totally 6 ﬁlters region size concatenated
together to form a
single feature
vector
I Instead of image pixels, inputs
typically are word embeddings.
I
d=5 I For a 10 word sentence using a
like
this
100-dimensional embedding we would
movie
very have a 10 × 100 matrix as our input.
much
!
I That’s our “image”
I Typically 1-dimensional convolutions
are used. I.e. filters slide over rows of
the matrix (words).
Architecture for sentiment analysis Zhang and Wallace [2015]

34
Preliminaries
CNNs in text
Example uses in IR
I MSR: how to learn semantically meaningful representations of sentences that can
be used for Information Retrieval
I Recommending potentially interesting documents to users based on what they are
currently reading
I Sentence representations are trained based on search engine log data
I Modeling interestingness with deep neural networks Gao et al. [2014]
A latent semantic model with convolutional-pooling structure for information
retrieval [Shen et al., 2014]
35
Preliminaries
Take aways
1. Information Retrieval (IR) systems help people find the right (most useful)
information in the right (most convenient) format at the right time (when they
need it).
2. Neural Network (NN) is a function F (x; Θ) with (a large number of) parameters
Θ that maps an input object x (which can be text, image or arbitrary vector of
features) to an output object y (class label, sequence of class labels, text, image).
3. There three main architectures (classes of F (x; Θ)): (i) feed-forward NN (FFNN),
(ii) recurrent NN (RNN), (iii) convolutional NN (CNN).
4. Embeddings are vector representations of objects in a high-dimensional space that
are learned during training. In many practical applications, these vectors reflect
similarities between objects that are important for solving the task.
5. Other stuff, such as seq2seq, vanishing and exploding gradients, LSTM and GRU,
attention mechanism, softmax, cross-entropy.
36
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
37
Semantic matching
Semantic matching
Definition
”... conduct query/document analysis to represent the meanings of query/document
with richer representations and then perform matching with the representations.” - Li
et al. [2014]
A promising area within neural IR, due to the success of semantic representations in
NLP and computer vision.
38
Outline
Morning program
Preliminaries
Semantic matching
Using pre-trained unsupervised representations for semantic matching
Learning unsupervised representations for semantic matching
Learning to match models
Learning to match using pseudo relevance
Toolkits
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
39
Semantic matching
Unsupervised semantic matching with pre-trained representations
Word embeddings have recently gained popularity for their ability to encode semantic
and syntactic relations amongst words.
How can we use word embeddings for information retrieval tasks?
40
Semantic matching
Word embedding
Distributional Semantic Model (DSM): A model for associating words with vectors
that can capture their meaning. DSM relies on the distributional hypothesis.
Distributional Hypothesis: Words that occur in the same contexts tend to have similar
meanings [Harris, 1954].
Statistics on observed contexts of words in a corpus is quantified to derive word

vectors.
I The most common choice of context: The set of words that co-occur in a context
window.
I Context-counting VS. Context-predicting [Baroni et al., 2014]
41
Semantic matching
From word embeddings to query/document embeddings

Creating representations for compound units of text (e.g., documents) from
representation of lexical units (e.g., words).
42
Semantic matching
Obtaining representations of compound units of text (in comparison to the atomic
words).
Bag of embedded words: sum or average of word vectors.
I Averaging the word representations of query terms has been extensively explored
in different settings. [Vulić and Moens, 2015, Zamani and Croft, 2016b]
I Effective but for small units of text, e.g. query [Mitra, 2015].
I Training word embeddings directly for the purpose of being averaged [Kenter
et al., 2016].
43
Semantic matching
I Skip-Thought Vectors
I Conceptually similar to distributional semantics: a units representation is a function
of its neighbouring units, except units are sentences instead of words.
I Similar to auto-encoding objective: encode sentence, but decode neighboring

sentences.
I Pair of LSTM-based seq2seq models with shared encoder.
I Doc2vec (Paragraph2vec) [Le and Mikolov, 2014].
I You’ll hear more later about it on “Learning unsupervised representations from
scratch”. (Also you might want to take a look at Deep Learning for Semantic
Composition)
44
Semantic matching
Using similarity amongst documents, queries and terms.

Given low-dimensional representations, integrate their similarity signal within IR.
45
Semantic matching
Dual Embedding Space Model (DESM) [Nalisnick et al., 2016]
Word2vec optimizes IN-OUT dot product which captures
the co-occurrence statistics of words from the training
corpus:
- We can gain by using these two embeddings differently
I IN-IN and OUT-OUT cosine similarities are high for words that are similar by
function or type (typical) and the
I IN-OUT cosine similarities are high between words that often co-occur in the
same query or document (topical).
46
Semantic matching
Pre-trained word embeddings for document retrieval and ranking
DESM [Nalisnick et al., 2016]: Using IN-OUT similarity to model document aboutness.
I A document is represented by the centroid of its word OUT vectors:
1 X ~vtd,OUT
~vd,OUT =
|d| k~vtd,OUT k
td ,∈d
I Query-document similarity is average of cosine similarity over query words:

>
1 X ~vtq,IN ~vtd,OUT
DESMIN-OUT (q, d) =
q t ∈q k~vtq,IN k k~vtd,OUT k
q
I IN-OUT captures more topical notion of similarity than IN-IN and OUT-OUT.
I DESM is effective at, but only at, ranking at least somewhat relevant documents.
47
Semantic matching
I NTLM [Zuccon et al., 2015]: Neural Translation Language Model
I Translation Language Model: extending query likelihood:
p(d|q) ∼ p(q|d)p(d)
Y
p(q|d) = p(tq |d)
tq ∈q
X
p(tq |d) = p(tq |td )p(td |d)
td ∈d
I Uses the similarity between term embeddings as a measure for term-term translation
probability p(tq |td ).
cos(~vtq , ~vtd )
p(tq |td ) = P
t∈V cos(~ vt , ~vtd )
48
Semantic matching
GLM [Ganguly et al., 2015]: Generalized Language Model
I Terms in a query are generated by sampling them independently from either the
document or the collection.
I The noisy channel may transform (mutate) a term t into a term t0 .
X X
p(tq |d) = λp(tq |d)+α p(tq , td |d)p(td )+β p(tq , t0 |C)p(t0 )+1−λ−α−β)p(tq |C)
td ∈d t0 ∈Nt
Nt is the set of nearest-neighbours of term t.
sim(~vt0 , ~vt ).tf(t0 , d)

p(t0 , t|d) = P P
t1 ∈d t2 ∈d sim(~ vt1 , ~vt2 ).|d|
49
Semantic matching
Pre-trained word embeddings for query term weighting
Term re-weighting using word embeddings [Zheng and Callan, 2015].
- Learning to map query terms to query term weights.
I Constructing the feature vector ~xtq for term tq
using its embedding and embeddings of other
terms in the same query q as:
1 X
~xtq = ~vtq − ~vt0q
|q| 0
tq ∈q
I ~xtq measures the semantic difference of a term

to the whole query.
I Learn a model to map the feature vectors the
defined target term weights.
50
Semantic matching
Pre-trained word embeddings for query expansion
I Identify expansion terms using word2vec cosine similarity [Roy et al., 2016].
I pre-retrieval:
I Taking nearest neighbors of query terms as the expansion terms.
I post-retrieval:
I Using a set of pseudo-relevant documents to restrict the search domain for the
candidate expansion terms.
I pre-retrieval incremental:
I Using an iterative process of reordering and pruning terms from the nearest neighbors
list.
- Reorder the terms in decreasing order of similarity with the previously selected term.
I Works better than having no query expansion, but does not beat non-neural query
expansion methods.
51
Semantic matching
Pre-trained word embedding for query expansion
I Embedding-based Query Expansion [Zamani and Croft, 2016a]

Main goal: Estimating a better language model for the query using embeddings.
I Embedding-based Relevance Model:
Main goal: Semantic similarity in addition to term matching for PRF.
52
Semantic matching
Pre-trained word embedding for query expansion
Query expansion with locally-trained word embeddings [Diaz et al., 2016].
I Main idea: Embeddings be learned on

topically-constrained corpora, instead of large
topically-unconstrained corpora.
I Training word2vec on documents from first
round of retrieval.
I Fine-grained word sense disambiguation.
I A large number of embedding spaces can be
cached in practice.
53
Outline
Morning program
Preliminaries
Semantic matching
Toolkits
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
54
Semantic matching
Pre-trained word embeddings can be used to obtain

I a query/document representation through compositionality, or
I a similarity signal to integrate within IR frameworks.
Can we learn unsupervised query/document representations directly for IR tasks?
55
Semantic matching
LSI, pLSI and LDA
History of latent document representations

Latent representations of documents that are learned from scratch have been around
since the early 1990s.
I Latent Semantic Indexing [Deerwester et al., 1990],

I Probabilistic Latent Semantic Indexing [Hofmann, 1999], and
I Latent Dirichlet Allocation [Blei et al., 2003].
These representations provide a semantic matching signal that is complementary to a

lexical matching signal.
56
Semantic matching
Semantic Hashing
Salakhutdinov and Hinton [2009] propose Semantic Hashing for document similarity.
I Auto-encoder trained on frequency
vectors.
I Documents are mapped to memory
addresses in such a way that
semantically similar documents are
located at nearby bit addresses.
I Documents similar to a query
document can then be found by
accessing addresses that differ by only
Schematic representation of Semantic Hashing.
a few bits from the query document Taken from Salakhutdinov and Hinton [2009].
address.
57
Semantic matching
Distributed Representations of Documents [Le and Mikolov, 2014]
I Learn document representations based

on the words contained within each
document.
I Reported to work well on a document
similarity task.
I Attempts to integrate learned
representations into standard retrieval
models [Ai et al., 2016a,b].
Overview of the Distributed Memory document
vector model. Taken from Le and Mikolov
[2014].
58
Semantic matching
Two Doc2Vec Architectures [Le and Mikolov, 2014]
Overview of the Distributed Memory document

vector model. Taken from Le and Mikolov
[2014]. Overview of the Distributed Bag of Words
document vector model. Taken from Le and
Mikolov [2014].
59
Semantic matching
Neural Vector Spaces for Unsupervised IR [Van Gysel et al., 2018]
I Learns query (term) and document representations directly from the document
collection.
I Outperforms existing latent vector
space models and provides semantic
matching signal complementary to
lexical retrieval models.
I Learns a notion of term specificity.
I Luhn significance: mid-frequency Relation between query term representation

words are more important for retrieval L2-norm within NVSM and its collection
than infrequent and frequent words. frequency. Taken from [Van Gysel et al., 2018].
60
Outline
Morning program
Preliminaries
Semantic matching
Toolkits
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
61
Semantic matching
Text matching as a supervised objective
Text matching is often formulated as a supervised objective where pairs of relevant or

paraphrased texts are given.
In the next few slides, we’ll go over different architectures introduced for supervised
text matching. Note that this is a mix of models originally introduced for (i) relevance
ranking, (ii) paraphrase identification, and (iii) question answering among others.
62
Semantic matching
Representation-based models
Representation-based models construct a fixed-dimensional vector representation for
each text separately and then perform matching within the latent space.
63
Semantic matching
(C)DSSM [Huang et al., 2013, Shen et al., 2014]
I Siamese network between query and document, performed on character trigrams.
I Originally introduced for learning from implicit feedback.
64
Semantic matching
ARC-I [Hu et al., 2014]
I Similar to DSSM, perform 1D convolution on text representations separately.
I Originally introduced for paraphrasing task.
65
Semantic matching
Interaction-based models
Interaction-based models compute the interaction between each individual term of
both texts. An interaction can be identity or syntactic/semantic similarity.
The interaction matrix is subsequently summarized into a matching score.
66
Semantic matching
DRMM [Guo et al., 2016]
I Compute term/document
interactions and matching
histograms using different
strategies (count, relative
count, log-count).
I Pass histograms through
feed-forward network for
every query term.
I Gating network that produces
an attention weight for every
query term; per-term scores
are then aggregated into a
relevance score using
attention weights.
67
Semantic matching
MatchPyramid [Pang et al., 2016]
I Interaction matrix between query/document

terms, followed by convolutional layers.
I After convolutions, feed-forward layers

determine matching score.
68
Semantic matching
aNMM [Yang et al., 2016]
I Compute word interaction

matrix.
I Aggregate similarities by
running multiple kernels.
I Every kernel assigns a
different weight to a
particular similarity range.
I Similarities are aggregated to
the kernel output by
weighting them according to
which bin they fall in.
69
Semantic matching
Match-SRNN [Wan et al., 2016b]
I Word interaction layer, followed by a spatial

recurrent NN.
I The RNN hidden state is updated using the

current interaction coefficient, and the hidden
state of the prefix.
70
Semantic matching
K-NRM [Xiong et al., 2017b]
I Compute word-interaction matrix,

apply k kernels to every query term
row in interaction matrix.
I This results in k-dimensional vector.
I Aggregate the query term vectors into

a fixed-dimensional query
representation.
I Later extended to convolutional

networks [Dai et al., 2018] (hybrid).
71
Semantic matching
Hybrid models
Hybrid models consist of (i) a representation component that combines a sequence of
words (e.g., a whole text, a window of words) into a fixed-dimensional representation
and (ii) an interaction component.
These two components can occur (1) in serial or (2) in parallel.
72
Semantic matching
ARC-II [Hu et al., 2014]
I Cascade approach where word representation are generated from context.

I Interaction matrix between sliding windows, where the interaction activation is
computed using a non-linear mapping.
I Originally introduced for paraphrasing task.
73
Semantic matching
MV-LSTM [Wan et al., 2016a]
I Cascade approach where input
representations for the interaction
matrix are generated using a
bi-directional LSTM.
I Differs from pure interaction-based

approaches as the LSTM builds a
representation of the context, rather
than using the representation of a
word.
I Obtains fixed-dimensional
representation by max-pooling over
query/document; followed by
feed-forward network. 74
Semantic matching
Duet [Mitra et al., 2017]
I Model has an interaction-based and a
representation-based component.
I Interaction-based component consist of a

indicator matrix showing where query terms
occur in document; followed by convolution
layers.
I Representation-based component is similar to

DSSM/ARC-I, but uses a feed-forward
network to compute the similarity signal rather
than cosine similarity.
I Both are combined at the end using a linear

combination of the scores.
75
Semantic matching
DeepRank [Pang et al., 2017]
I Focus only on exact term

occurrences in document.
I Compute interaction between
query and window
surrounding term occurrence.
I RNN or CNN then combines
per-window features (query
representation, context
representations and
interaction between
query/document term) into
matching score.
76
Outline
Morning program
Preliminaries
Semantic matching
Toolkits
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
77
Semantic matching
Beyond supervised signals: semi-supervised learning
The architectures we presented for learning to match all require labels. Typically these
labels are obtained from domain experts.
However, in information retrieval, there is the concept of pseudo relevance that gives
us a supervised signal that was obtained from unsupervised data collections.
78
Semantic matching
Pseudo test/training collections
Given a source of pseudo relevance, we can build pseudo collections for training
retrieval models [Asadi et al., 2011, Berendsen et al., 2013].
Sources of pseudo-relevance
Typically given by external knowledge about retrieval domain, such as hyperlinks,
query logs, social tags, ...
79
Semantic matching
Training neural networks using pseudo relevance
Training a neural ranker using weak supervision [Dehghani et al., 2017].
Main idea: Annotating a large amount of unlabeled

data using a weak annotator (Pseudo-Labeling) and
design a model which can be trained on weak super-
vision signal.
I Function approximation. (re-inventing BM25?)
I Beating BM25 using BM25!
80
Semantic matching
Training neural networks using pseudo relevance
Generating weak supervision training data for training neural IR model [MacAvaney
et al., 2017].
I Using a news corpus with article headlines acting as pseudo-queries and article
content as pseudo-documents.
I Problems:
I Hard-Negative
I Mismatched-Interaction: (example: “When Bird Flies In”, a sports article about
basketball player Larry Bird)
I Solutions:
I Ranking filter:
- top pseudo-documents are considered as negative samples.
- only pseudo-queries that are able to retrieve their pseudo-relevant documents are
used as positive samples.
I Interaction filter:
- building interaction embeddings for each pair.
- filtering out based on similarity to the template query-document pairs.
81
Semantic matching
Query expansion using neural word embeddings based on pseudo relevance
Locally trained word embeddings [Diaz et al., 2016]
I Performing topic-specific training, on a set of topic specific documents that are
collected based on their relevance to a query.
Relevance-based Word Embedding [Zamani and Croft, 2017].
I Relevance is not necessarily equal to semantically or syntactically similarity:
I “united state” as expansion terms for “Indian American museum”.
I Main idea: Defining the “context”
Using the relevance model distribution for the given query to define the context.
So the objective is to predict the words observed in the documents relevant to a
particular information need.
I The neural network will be constraint by the given weights from RM3 to learn
word embeddings.
82
Outline
Morning program
Preliminaries
Semantic matching
Toolkits
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
83
Semantic matching
Document & entity representation learning toolkits
gensim : https://github.com/RaRe-Technologies/gensim [Řehůřek and

Sojka, 2010]
SERT : http://www.github.com/cvangysel/SERT [Van Gysel et al., 2017a]
cuNVSM : http://www.github.com/cvangysel/cuNVSM [Van Gysel et al., 2018]
HEM : https://ciir.cs.umass.edu/downloads/HEM [Ai et al., 2017]
MatchZoo : https://github.com/faneshion/MatchZoo [Fan et al., 2017]
K-NRM : https://github.com/AdeDZY/K-NRM [Xiong et al., 2017b]
84
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
85
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Overview & basics
Quick refreshers
Pointwise loss
Pairwise loss
Listwise loss
Toolkits
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
86
Learning to rank
Learning to rank (LTR)
Definition
”... the task to automatically construct a ranking model using training data, such that
the model can sort new objects according to their degrees of relevance, preference, or
importance.” - Liu [2009]
LTR models represent a rankable item—e.g., a document—given some context—e.g.,

a user-issued query—as a numerical vector ~x ∈ Rn .
The ranking model f : ~x → R is trained to map the vector to a real-valued score such
that relevant items are scored higher.
We only discuss offline LTR models here–see Grotov and de Rijke [2016] for an
overview of online LTR.
87
Learning to rank
Three training objectives
Liu [2009] categorizes different LTR approaches based on training objectives:

I Pointwise approach: relevance label yq,d is a number—derived from binary or
graded human judgments or implicit user feedback (e.g., CTR). Typically, a
regression or classification model is trained to predict yq,d given ~xq,d .
I Pairwise approach: pairwise preference between documents for a query (di q dj )

as label. Reduces to binary classification to predict more relevant document.
I Listwise approach: directly optimize for rank-based metric, such as

NDCG—difficult because these metrics are often not differentiable w.r.t. model
parameters.
88
Learning to rank
Features
Traditional LTR models employ hand-crafted features that encode IR insights
They can often be categorized as:

I Query-independent or static features (e.g., incoming link count and document
length)
I Query-dependent or dynamic features (e.g., BM25)
I Query-level features (e.g., query length)
89
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Overview & basics
Quick refreshers
Pointwise loss
Pairwise loss
Listwise loss
Toolkits
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
90
Learning to rank
A quick refresher - Neural models for different tasks
91
Learning to rank
A quick refresher - What is the Softmax function?
In neural classification models, the softmax function is popularly used to normalize the
neural network output scores across all the classes
eγzi
p(zi ) = P γz
(γ is a constant) (1)
z∈Z e
92
Learning to rank
A quick refresher - What is Cross Entropy?
The cross entropy between two probability distributions p and q over a discrete set of
events is given by,
X
CE(p, q) = − pi log(qi )
i
(2)
If pcorrect = 1 and pi = 0 for
all other values of i then,
CE(p, q) = − log(qcorrect )
(3)
93
Learning to rank
A quick refresher - What is the Cross Entropy with Softmax loss?
Cross entropy with softmax is a popular loss

function for classification
eγzcorrect
LCE = −log P γz
(4)
z∈Z e
94
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Overview & basics
Quick refreshers
Pointwise loss
Pairwise loss
Listwise loss
Toolkits
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
95
Learning to rank
Pointwise objectives
Regression-based or classification-based approaches are popular
Regression loss
Given hq, di predict the value of yq,d
E.g., square loss for binary or categorical labels,
LSquared = kyq,d − f (~xq,d )k2 (5)
where, yq,d is the one-hot representation [Fuhr,

1989] or the actual value [Cossock and Zhang,
2006] of the label
96
Learning to rank
Pointwise objectives
Regression-based or classification-based approaches are popular
Classification loss
Given hq, di predict the class yq,d
E.g., Cross-Entropy with Softmax over

categorical labels Y [Li et al., 2008],

LCE (q, d, yq,d ) = − log p(yq,d |q, d) (6)
eγ·syq,d
= − log P γ·sy
(7)
y∈Y e
where, syq,d is the model’s score for label yq,d

97
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Overview & basics
Quick refreshers
Pointwise loss
Pairwise loss
Listwise loss
Toolkits
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
98
Learning to rank
Pairwise objectives
Pairwise loss generally has the followingform

[Chen et al., 2009],
Pairwise loss minimizes the average
number of inversions in ranking—i.e., Lpairwise = φ(si − sj ) (8)
di
q dj but dj is ranked higher than di
where, φ can be,
Given hq, di , dj i, predict the more I Hinge function φ(z) = max(0, 1 − z)
relevant document [Herbrich et al., 2000]
For hq, di i and hq, dj i,
I Exponential function φ(z) = e−z [Freund
et al., 2003]
Feature vectors: ~xi and ~xj
I Logistic function φ(z) = log(1 + e−z )
Model scores: si = f (~xi ) and sj = f (~xj )
[Burges et al., 2005]
I etc.
99
Learning to rank
RankNet
RankNet [Burges et al., 2005] is a pairwise loss

function—an industry favourite [Burges, 2015]
Predicted probabilities:
γ·s
pij = p(si > sj ) ≡ eγ·sei +eiγ·sj = 1
1+e−γ(si −sj )
1
and pji ≡
1+e−γ(sj −si )
Desired probabilities: p̄ij = 1 and p̄ji = 0
Computing cross-entropy between p̄ and p,
LRankN et = −p̄ij log(pij ) − p̄ji log(pji ) (9)

= − log(pij ) (10)
−γ(si −sj )
= log(1 + e ) (11)
100
Learning to rank
Cross Entropy (CE) with Softmax over documents
An alternative loss function assumes a single relevant document d+ and compares it
against the full collection D
Probability of retrieving d+ for q is given by the softmax function,

γ·s q,d+
e
p(d+ |q) = P γ·s(q,d)
(12)
d∈D e
The cross entropy loss is then given by,

LCE (q, d+ , D) = − log p(d+ |q) (13)

eγ·s q,d+
= − log P (14)
d∈D eγ·s(q,d)
101
Learning to rank
Notes on Cross Entropy (CE) loss
I If we consider only a pair of relevant and non-relevant documents in the

denominator, CE reduces to RankNet
I Computing the denominator is prohibitively expensive—large body of work in NLP
on this that may be relevant to future LTR models
I Hierarchical softmax
I Sampling based approaches
I In IR, LTR models typically consider few negative candidates [Huang et al., 2013,
Mitra et al., 2017, Shen et al., 2014]
102
Learning to rank
Hierarchical Softmax
Avoid computing p(d+ |q), group candidates D into set of classes C, then predict
correct class c+ given q followed by predicting d+ given hc+ , qi [Goodman, 2001]
p(d+ |q) = p(d+ |c+ , q) · p(c+ |q) (15)
Computational cost is a function of |C| + |c+ | << |D|
Employ hieararchy of classes [Mnih and Hinton, 2009, Morin and Bengio, 2005]
Hierarchy based on similarity between candidates [Brown et al., 1992, Le et al., 2011,
Mikolov et al., 2013], or frequency binning [Mikolov et al., 2011]
103
Learning to rank
Sampling based approaches
Alternative to computing exact softmax, estimate it using sampling based approaches

eγ·s q,d+ X
LCE (q, d+ , D) = −log P γ·s(q,d)
= −γ · s q, d+
+ log eγ·s(q,d) (16)
d∈D e d∈D
| {z }
expensive to compute
Importance sampling [Bengio and Senécal, 2008, Bengio et al., 2003, Jean et al., 2014,
Jozefowicz et al., 2016], Noise Contrastive Estimation [Gutmann and Hyvärinen, 2010,
Mnih and Teh, 2012, Vaswani et al., 2013], negative sampling [Mikolov et al., 2013],
BlackOut [Ji et al., 2015], and others have been proposed
See [Mitra and Craswell, 2017] for detailed discussion
104
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Overview & basics
Quick refreshers
Pointwise loss
Pairwise loss
Listwise loss
Toolkits
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
105
Learning to rank
Listwise
Blue: relevant Gray: non-relevant
NDCG and ERR higher for left but pairwise

errors less for right
Due to strong position-based discounting in IR

measures, errors at higer ranks are much more
problematic than at lower ranks
But listwise metrics are non-continuous and

non-differentiable
[Burges, 2010]
106
Learning to rank
LambdaRank
Key observations:
I To train a model we dont need the costs themselves, only the gradients (of the
costs w.r.t model scores)
I It is desired that the gradient be bigger for pairs of documents that produces a
bigger impact in NDCG by swapping positions
LambdaRank [Burges et al., 2006]

Multiply actual gradients with the change in NDCG by swapping the rank positions of
the two documents
λLambdaRank = λRankN et · |∆N DCG| (17)
107
Learning to rank
ListNet and ListMLE
According to the Luce model [Luce, 2005], given four items {d1 , d2 , d3 , d4 } the
probability of observing a particular rank-order, say [d2 , d1 , d4 , d3 ], is given by:
φ(s2 ) φ(s1 ) φ(s4 )

p(π|s) = · ·
φ(s1 ) + φ(s2 ) + φ(s3 ) + φ(s4 ) φ(s1 ) + φ(s3 ) + φ(s4 ) φ(s3 ) + φ(s4 )
(18)
where, π is a particular permutation and φ is a transformation (e.g., linear,

exponential, or sigmoid) over the score si corresponding to item di
108
Learning to rank
ListNet and ListMLE
ListNet [Cao et al., 2007]

Compute the probability distribution over all possible permutations based on model
score and ground-truth labels. The loss is then given by the K-L divergence between
these two distributions.
This is computationally very costly, computing permutations of only the top-K items
makes it slightly less prohibitive
ListMLE [Xia et al., 2008]

Compute the probability of the ideal permutation based on the ground truth. However,
with categorical labels more than one permutation is possible which makes this difficult.
109
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Overview & basics
Quick refreshers
Pointwise loss
Pairwise loss
Listwise loss
Toolkits
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
110
Learning to rank
Toolkits for off-line learning to rank
RankLib : https://sourceforge.net/p/lemur/wiki/RankLib
shoelace : https://github.com/rjagerman/shoelace [Jagerman et al., 2017]
QuickRank : http://quickrank.isti.cnr.it [Capannini et al., 2016]
RankPy : https://bitbucket.org/tunystom/rankpy
pyltr : https://github.com/jma127/pyltr
jforests : https://github.com/yasserg/jforests [Ganjisaffar et al., 2011]
XGBoost : https://github.com/dmlc/xgboost [Chen and Guestrin, 2016]
SVMRank : https://www.cs.cornell.edu/people/tj/svm_light [Joachims,
2006]
sofia-ml : https://code.google.com/archive/p/sofia-ml [Sculley, 2009]
pysofia : https://pypi.python.org/pypi/pysofia
111
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
112
Entities
Entities are polysemic
“Finding entities” has multiple meanings.
Entities can be
I nodes in knowledge graphs,
I mentions in unstructured texts or queries,
I retrievable items characterized by texts.
113
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Knowledge graph embeddings
Entity mentions in unstructured text
Entity finding
Afternoon program
Recommender systems
Industry insights
Q&A
114
Entities
Knowledge graphs
Beyoncé Knowles member of start date 1997
member of Destiny’s Child

end date
Kelly Rowland 2005
member of
Michelle Williams
Triples
(beyoncé knowles, member of, destinys child)
(kelly rowland, member of, destinys child)
(michelle williams, member of, destinys child)
(destinys child, start date, 1997)
(destinys child, end date, 2005)
...
Nice overview on using knowledge bases in IR: [Dietz et al., 2017]

115
Entities
Knowlegde graphs
Tasks Datatsets
I Link prediction WordNet
Predict the missing h or t for a triple (h, r, t) (car, hyponym, vehicle)
Rank entities by score. Metrics: Freebase/DBPedia
I Mean rank of correct entity (steve jobs, founder of, apple)
I Hits@10
I Triple classification
Predict if (h, r, t) is correct.
Metric: accuracy.
I Relation fact extraction from free text
Use knowledge base as weak supervision for extracting new triples.
Suppose some IE system gives us (steve jobs, ‘‘was the initiator of’’,
apple), then we want to predict the founder of relation.
116
Entities
Knowlegde graphs

I TransE [Bordes et al., 2013]
I TransH [Wang et al., 2014]
I TransR [Lin et al., 2015]
117
Entities
TransE
“Translation intuition”
For a triple (h, l, t) : ~h + ~l ≈ ~t.
ti
tj
l
l
hi hj
118
Entities
TransE
“Translation intuition”
For a triple (h, l, t) : ~h + ~l ≈ ~t.
positive negative distance function

examples examples
[Bordes et al., 2013]
119
Entities
TransE
ti
tj
ti
“Translation intuition” tj r??
For a triple (h, l, t) : ~h + ~l ≈ ~t. r
r
r??
hi hj hi
How about:
I one-to-many relations? ti
tj
I many-to-many relations?
r??
I many-to-one relations? r??
hi hj
120
Entities
TransH
[Wang et al., 2014]

121
Entities
TransH
distance function
[Wang et al., 2014]

122
Entities
TransH
i.e., translation vector dr

Constraints is in the hyperplane
soft constraints
[Wang et al., 2014]
123
Entities
TransR
Use different embedding spaces for entities and relations

I 1 entity space
I multiple relation spaces
I perform translation in appropriate relation space
[Lin et al., 2015]

124
Entities
TransR
[Lin et al., 2015]

125
Entities
TransR
Relations: Rd
Entities: Rk
Mr = projection matrix: k * d
Constraints:
[Lin et al., 2015]

126
Entities
Challenges
I How about time?

E.g., some relations hold from a certain date, until a certain date.
I New entities/relationships
I Finding synonymous relationships/duplicate entities (2005, end date,
destinys child) (destinys child, disband, 2005) (destinys child,
last performance, 2005)
I Evaluation
Link prediction? Relation classification? Is this fair? As in, is this even possible in
all cases (for a human without any world knowledge)?
127
Entities
Resources: toolkits + knowledge bases
Source Code
KB2E : https://github.com/thunlp/KB2E [Lin et al., 2015]
TransE : https://everest.hds.utc.fr/
Knowledge Graphs
I Google Knowledge Graph
google.com/insidesearch/features/search/knowledge.html
I Freebase
freebase.com
I GeneOntology
geneontology.org
I WikiLinks
code.google.com/p/wiki-links
128
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Entity finding
Afternoon program
Recommender systems
Industry insights
Q&A
129
Entities
Entity mentions
Recognition Detect mentions within unstructured text (e.g., query).
Linking Link mentions to knowledge graph entities.
Utilization Use mentions to improve search.
130
Entities
Named entity recognition
B−ORG O B−MISC O
y
B-ORG O B-MISC O O O B-MISC O O
h
EU rejects German call to boycott British lamb .
x
EU rejects German call
Task vanilla RNN
131
Entities
I A Unified Architecture for Natural
Language Processing: Deep Neural
Networks with Multitask Learning
[Collobert and Weston, 2008]
I Natural Language Processing (Almost)
from Scratch [Collobert et al., 2011]
Feed-forward language model architecture for

Learning a single model to solve multiple NLP different NLP tasks. Taken from [Collobert and
tasks. Taken from [Collobert and Weston, Weston, 2008].
2008].
132
Entities
B−ORG O B−MISC O
CRF CRF CRF

forward
backward
EU rejects German call

BI-LSTM-CRF model
[Huang et al., 2015]

133
Entities
Entity disambiguation
I Learn representations for documents
and entities.
I Optimize a distribution of candidate
entities given a document using (a)
cross entropy or (b) pairwise loss.
Learn initial document representation in Learn similarity between document and entity
unsupervised pre-training stage. Taken from representations using supervision. Taken from
[He et al., 2013]. [He et al., 2013].
134
Entities
Entity linking
Learn representations for the context, the mention, the entity (using surface words) and the
entity class. Uses pre-trained word2vec embeddings. Taken from [Sun et al., 2015].
135
Entities
Entity linking
Encode Wikipedia descriptions, linked mentions in Wikipedia and fine-grained entity types. All
representations are optimized jointly. Taken from [Gupta et al., 2017].
136
Entities
Entity linking
A single mention phrase refers to various entities. Multi-Prototype Mention Embedding model
that learns multiple sense embeddings for each mention by jointly modeling words from textual
contexts and entities derived from a KB. Taken from [Cao et al., 2017].
137
Entities
Improving search using linked entities
Attention-based ranking model for word-entity duet. Learn a similarity between query and
document entities. Resulting model can be used to obtain ranking signal. Taken from [Xiong
et al., 2017a].
138
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Entity finding
Afternoon program
Recommender systems
Industry insights
Q&A
139
Entities
Entity finding
Task definition
Rank entities satisfying a topic described by a few query terms.
Not just document search — (a) topics do not typically correspond to entity names,
(b) average textual description much longer than typical document.
Different instantiations of the task within varying domains:

I Wikipedia: INEX Entity Ranking Track [de Vries et al., 2007, Demartini et al.,
2008, 2009, 2010] (lots of text, knowledge graph, revisions)
I Enterprise search: expert finding [Balog et al., 2006, 2012] (few entities,
abundance of text per entity)
I E-commerce: product ranking [Rowley, 2000] (noisy text, customer preferences)
140
Entities
Semantic Expertise Retrieval [Van Gysel et al., 2016]
I Expert finding is a particular entity retrieval task where there is a lot of text.
I Learn representations of words and entities such that n-grams extracted from a
document predict the correct expert.
Taken from slides of Van Gysel et al. [2016].
141
Entities
Semantic Expertise Retrieval [Van Gysel et al., 2016] (cont’d)
I Expert finding is a particular entity retrieval task where there is a lot of text.
I Learn representations of words and entities such that n-grams extracted from a
document predict the correct expert.
142
Entities
Regularities in Text-based Entity Vector Spaces [Van Gysel et al., 2017b]
To what extent do entity representation models, trained only on text, encode structural
regularities of the entity’s domain?
Goal: give insight into learned entity representations.

I Clusterings of experts correlate somewhat with groups that exist in the real world.
I Some representation methods encode co-authorship information into their vector
space.
I Rank within organizations is learned (e.g., Professors > PhD students) as senior
people typically have more published works.
143
Entities
Latent Semantic Entities [Van Gysel et al., 2016]
I Learn representations of e-commerce products and query terms for product search.
I Tackles learning objective scalability limitations from previous work.
I Useful as a semantic feature within a Learning To Rank model in addition to a
lexical matching signal.
144
Entities
Personalized Product Search [Ai et al., 2017]
I Learn representations of e-commerce

products, query terms, and users for
personalized e-commerce search.
I Mixes supervised (relevance triples of
query, user and product) and
unsupervised (language modeling)
objectives.
I The query is represented as an
interpolation of query term and user
representations.
Personalized product search in a latent space
with query ~q, user ~u and product item ~i. Taken
from Ai et al. [2017].
145
Entities
Resources: toolkits
SERT : http://www.github.com/cvangysel/SERT [Van Gysel et al., 2017a]

HEM : https://ciir.cs.umass.edu/downloads/HEM [Ai et al., 2017]
146
Entities
Resources: further reading on entities/KGs
For more information, see the tutorial on “Utilizing Knowledge Graphs in Text-centric
Information Retrieval” [Dietz et al., 2017] presented at last year’s WSDM.
https://github.com/laura-dietz/tutorial-utilizing-kg
147
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
148
Understanding user behavior is the key
The ability to accurately predict the behavior of a particular user

allows search engines to construct optimal result pages
149
User behavioral signals
Actions
(e.g., click, first/last click, long click, satisfied click, repeated click)
Times between actions
(e.g., time between clicks, time to first/last click)
150
Interpretation is difficult
Biases in user behavior — (statistically significant) differences

between probability distributions of user behavioral signal observed in different contexts
Clicks are biased towards:

I higher ranked results (position bias)
I visually salient results (attention bias)
I previously unseen results (novelty bias) account for biases in clicks
Click dwell times are biased
Times to first/last/satisfied clicks are biased
151
Applications of user behavior models
I Understand users
I Simulate users
I Features for ranking
I Evaluate search
152
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Click behavior in web search
Time aspects of user behavior in web search
Web search vs. sponsored search
Take aways and future work
Recommender systems
Industry insights
Q&A
153
Traditional click models
↵ ur 1q
↵ ur q
document ur 1 document ur
Ar 1 Ar
Cr 1 Cr
... Er 1 Er
...
Graphical representation of the cascade click model.
Pros: Based on the probabilistic graphical model (PGM) framework

Cons: Structure of the underlying PGM has to be set manually
154
Neural click modeling framework
USER%QUERY%
Ini,alize%vector%state%
with%USER%QUERY%
Update%vector%state% Predict%click%on%
with%DOCUMENT%1% DOCUMENT%1%
✖%
ed%
skipp
✔%
d%
clicke
✖%
A neural click model for web search [Borisov et al., 2016].
Learns patterns of user behavior directly from click-through data

155
Distributed representations (s0 , s1 , s2 , . . .)
USER%QUERY%
with%USER%QUERY%
✖%
ed%
skipp
✔%
d%
clicke
✖%
We model user browsing behavior as a sequence of vector states (s0 , s1 , s2 , . . .) that

describes the information consumed by the user as it evolves during a query session.
156
Mappings I, U and Function F
USER%QUERY%
with%USER%QUERY%
✖%
s0 = I(q) skipp
ed%
sr+1 = U(sr , ir , dr+1 ) with%DOCUMENT%2% DOCUMENT%2%
✔%
d%
clicke
✖%
P (Cr+1 = 1 | q, i1 , . . . , ir , d1 , . . . , dr+1 ) = F(sr+1 )
q — user query ir — user interaction

with document at rank r
dr — document at rank r
157
{RNN, LSTM}
Neural click modeling framework → NCM{QD, QD+Q, QD+Q+D}
Representations of q, dr and ir
Use three sets: QD, QD+Q, QD+Q+D
Parameterization of I, U and F
I Feed-forward neural network

U Recurrent neural network (RNN, LSTM)
F Feed-forward neural network
(with one output unit and the sigmoid activation function)
Training
Stochastic gradient descent

(with AdaDelta update rules and gradient clipping)
158
Experimental setup
Dataset
Yandex Relevance Prediction dataset1

(146, 278, 823 query sessions)
Tasks and evaluation metrics
Click prediction task (log-likelihood, perplexity)

Relevance prediction task (NDCG)
Baselines
Dynamic Bayesian network (DBN), Dependent click model (DCM)

Click chain model (CCM), User browsing model (UBM)
159
Results on click prediction task
Click model Perplexity Log-likelihood

DBN 1.3510 −0.2824
DCM 1.3627 −0.3613
CCM 1.3692 −0.3560
UBM 1.3431 −0.2646
NCMRNN
QD 1.3379 −0.2564
NCMLSTM
QD 1.3362 −0.2547
NCMLSTM
QD+Q 1.3355 −0.2545
NCMLSTM
QD+Q+D 1.3318 −0.2526
Differences between all pairs of click models are statistically significant (p < 0.001)
160
Results on relevance prediction task
NDCG
Click model @1 @3 @5 @10
DBN 0.717 0.725 0.764 0.833
DCM 0.736 0.746 0.780 0.844
CCM 0.741 0.752 0.785 0.846
UBM 0.724 0.737 0.773 0.838
NCMRNN
QD 0.762 0.759 0.791 0.851
NCMLSTM
QD 0.756 0.759 0.789 0.850
NCMLSTM
QD+Q 0.775 0.773 0.799 0.857
NCMLSTM
QD+Q+D 0.755 0.755 0.787 0.847
Improvements of NCMRNN QD , NCMQD

LSTM
and NCMLSTM QD+Q
over baseline click models are statistically significant (p < 0.05)
161
Analysis
1.0 1.0 1.0
0 00000
000 1970
8
9
8 98
1907 7 69 1
8809 8
10
00 000
0000 00 00 10108 19760
665 1087 8 6 79 10 8
000000 000 0 68576 65
55555 44 109 89796856 77 95 78 6 87 787
00 77910 10
99867998 5 1089 68 7 9 576
710
78 765 76
000 0 0 0 0 7 4444444 445 10 98910 98989104
10 7 4 5698 7
9 44 10 910 99
76 3 8 6510
78
554
00 0 0 10
8569
6
7 87
10997 910 810
9
8 10 7 77 8 5 5 4 33343333 333
333 7 10
10
78 7 7 2722
6 8 858 5544
0 65 46 7876
58710
8 7 7 6 443 4 333
1010 9910
10 69 57756967655 455 4 3 433 333
000000
0 0 0 65510
8 10
6 7 6 65 4 44 3 33 3 97 65
10109 109 81089710 10
10 827 5 66 6 46 435 4 54 3 3 33 3443 322222
00 0 0 0 0 00 910 9798988 8 10
10 910
8 5
9 66 97510 54554 10
74 78
66 8 10 8 7 8 66 4 5 54 5 444 4 33
3 333 3222
0
0 00000000 0000000000 10 89
10 610 769 6 99661088
667 10
6 55449
54 964 76 98 7
54 75 44 33 33 10
10
9
8 7 10
10 10 587 943 998 6 4 8710 998 7575 7 8595 8
57
5 7666565
9578 444343 4
4
433 22222
0 710
9 77 96 810 76 786 5 89433
710 10 6 5 5 565 434 3343 33 222222
00 00 0 0 0 0 00000000 00 10
710
810 910 6
7 67 4 66 65
433 3 766 4 6 610 5666887
10 7810 5 6
7 7 787 866574 6 66 5 6 6 4
5 5 4
343 43 22 2 2
0 0
0
00 00 0000
0
0 0
0 000 0 0000
00 0 00000 00
0 00 8
91900 6 876
1 8
866 75 5 109 8
10
55666 77976 10
10 910
8 7 55 6557 686 5 5 86
10 769 44 434 5 4 4 4 4 4 54 333 333 3 22 5810 9 10 10 8 9 10 710 9
698 78
768910 7 6767675 655 5 4 44 4 3 4 333 2 2
0000 55655 75 443 2222 1088998 8797 88
10 665 56 4 4434 2
2 22222
000000 00 0
000 78 107 66966 10
68 6 9 87565 5 8 7 99109 10 5810 333233 22222 10 89 98 76 9 6 77 65 84 696 54 445 4443 5 333 333 3 22222
3 2222222
2
2 22 000
00 0 00000 00 0 90108 8 877710 10
6788 7 6 56555 5 10 7 988665
10 3 22222 989 510 7 98 7 8 410 66 55554 56 5544 5 5 5556 33 222
222 22
0.8 2 000 0 0 0 0 0
000000
0 0.8 9
6
7
1710
878 9 9 988 7 5758 7947 9 9
7 897 446 8 444 3 33 2 2 3 2 22 222 0.8 5 7799
10 8 6 74 4 4
4 554 4 433 34 333 3 3 3 2 222
33 22 2222222 0000000 98 10 98
9
98
8710
8799789 810889 710
66 88 4 4 10
44
4 10 5 10 4 85 810
796
87 3 2 2 22 2 9108710
10 109 78
9 9 87867678 8 7 899 7897 976688 6 8 96 8 64 465
6 6 5 45 3 2 22 2 2 2 222
3 2 22222 0 8 9 10 88 10
987 78
6 5 65
4 955 7 10
6
98 9878 6 444 33
6 4 3333 3332 22 2 7
10 55 4 55 4 3
23 2 2 222 2 22
2 2 22 222
33 33332 22 22 22222 0 0000000 0 0
1099 77
10 9 710 810 6 75757 447 7 2 2 22 2222 9
9 9 97896
87 58987 10 7986 5 8
76 54 6 4 65 22 22
43 4334
3 222 2 2 22 2222 222 000 0000 00 0 910 978 710 8 97
66
7 5 10
56 454 4 6 5
8 9 8 10 7 33 3 222 2222 22 22 22
2222222222 789 10 87
10
9 6 8 7 4
4446
3
32 2 22222 2 2 2
43 3 33 5 4 2 222 2222 10
9 108988 87 77566 665 55
10 4 47 4 22222222 222 3 2 22 9 8 10 610810 10
810 10
10 109887
87 75764 579766675
656 55 4
4 3 33
3 3 3 3333 3 22 22 2 2 2 2 222 22
2 2 2
22
4 4 43 32 2 9889
10899 8 7
10
10 6 5 4 434344333344 7 5 434 65 222 223 2 2 10
879 9 79 968 7655 5554444 44 4
3 3 222 2 222 22
354 434 3 3 3 9
10
910
10 89
1098 7 6 5
44 4445 5 5
4 4 454 43 3 33333 3
4 3 3 33 4 3 5 9 56 35 483945 5 53 22 2 2 22
2 2 2222
109
998 1087 9 55 5
85 55105 8 7 6 76 544 44 4
33
33 54
3 3 3 22222222 22 2
22 2 22 222 2 2
7 5 44544 554 333333
6 3 43 3 4
65 43
0 10910 10 98 65555 55 45454 4 43 3 3334 333 34 64965 5
697
3 9 5634 222 10
910 7 79 989
9 67 710876 66 8 57 6 87 7675
9 65 6 4 45 4 5 3 3 2222222 22
2 22
5 4 5 3343 43 54 9910
8 65 55 44 34 3 44 33 3 455 7 9 948 10 4 3338
3 2 222 8 10
10 10109989
22 22 2 2 222222 2 2 22222
44 43 4 3 578
3765 3322 2 10 4544 3 8664 4 10 10 4 333 2 2 22 10
9 10
6 8658 5 46 5 4 7 565 45 5 4 44 4 3 3 22
2222222
2
7 4 4 3 548 6 3 3 33 3 45 54 222 7 5
65 5 565 64 54
6 555 4 64
5 6 444443 4 4 3
33434 33 7 4 5510 10 3 4 10 44 6 58 5 5 3 3 343 4
3
222 10 5 7
6 9
810 9 7 95 444
5 5 4545 4 22 2 2222
2
667 59 4454 5544 5444 4 3 43
4 4
6 5
43 3 4 3 33 222 9 8 6 766 7 56 5 656 7 5 7587 6 33 3 3334346 22222 98 7 10
10 9 10 1010
67
6 855 67 6 55 55 3 33334 3333 2 2 2 2 222 22222
8877 76755 5 8 455 6 5 8
5 1010 8 7666675 67 7 6 4 476
10 766955456 34
354 34 8 55 610 8 344 222 9810 4 43333 34 344333 3 2
44 6 4 3 43 5 44 33 7 6
2
2 1010910 8 766 5766 57567 7 7 10 5
10
9
6 5 65776 483 4 435445 4 10
10 9 9
10
109 8 10 710
10
8 10 9 9 88676 9 9710 8
8
8 56
7 7 7665 5 6 5 6445 54
5 4 5 4 4 5 3 22 22 2
9 10 779910 7
8675654355 5
343 4
33353 2 6 7 79 66 5 6 5 5 9 8
8 88910
10 9
10
9878 678810 9897109878810 9
10 9 5 3 2
7 5 10
10
9 910
8988
9
68 554 4 89910 443333 10
9 977710 8 67 86 7 8 56610 10 446 564 65 4 86 6 5 544 7 4
7566734610 11
1 111111
710 9
610
98 87 77
8 10
7 699 9 6 7 6 56 54 444 3 3 33 3333 3334 33 3 2222 222 2 2
7768 6 765 5666 54 5 87
10
9 7678
910 57 5344
333 3
334 3 2 77 78 76 7 10 9 6756 75 7 454 44 6 7 9 2 910
10
89 8 10 9 10 6698698 7 5 56 556 6 5 5 55 54 34 44
44 33 222 22222
6 7 6 54454 65 7 6 586 10910 910 979787686
810 7655
10 88 778 897 877 65 4443 655
754 10
104 78 97978
8 9 10 8 98 10 96 7810 5 697 5 464546 55 449 5 710 7 99 6 6 7 5 956 14 0 43 1 111 7 9 78 89878 67 867 565 66676 555 7 5 5444444 35353343 333333 3 33 33 3 33 222222222 222
22
9 7676 7 9 7 9 8677 10 6 5 4 3 334 334333 22 2 1111
1
11111111111 111
1 8 10 10 9 9 9 9
7 8 104 5 8 88 9 11 11 1 111 9 10 910 9 10 10 10 7 6 9 9
9
8 7 5 6 5 22 2 2 2
788
8 66 7 10 6 10 8999 698710 9767 455664
5
4 5 43 3 3 222222 22 111 89 8 8
77 710 8 10
9 10 7 8 66 9 8775 5 6 54 5 696 6
710 4 10 5 1 111111
111 8 10 9
8 910 898 99 10
6 5 5 6 5565 6 766 5 6
5755 44 44444 444
4 4
5444 3 3
33 3 3 33 22 2 222222 2
2 2222 2 222 2
2 2
2
8710
98
10 9 8 8
6
5
910 8 910 75 8855 5 565 45 4456 5 3435 43 33 2222222222 2222
22 11 1
111 10 87 8 9 170
10
99810
99 910 7896 567 8 9 8 9 81010
910 65 67 9 55 55 75 5
6 4898 6 66 889
11111
1
1111111 1 1 11 1 1010 9 10
99910791010 9 78
910
997
910 9 8 8 8 88 77 6667 667
76 6 5 76 4 4 444 3333 3 3
0.6 9
10
9
8 998910
999
10
810
10 796776576 55565 65 765 897 9 97108 86 66 657 7 8 556 656 4 43444 4443 2 111 1 0.6 7 87 10
10 10
109 7 98 6910 69 65 677878 655 568 8 757 1 1 1 111 0.6 9 108 8 10 10 8 10 1010
10
109910 987 9 7 66 65777 7 4445 554 44
655 43 3333 33 32 22 2 2 22222222
10 10 99
89
10 10
810
87 576 98 9
10
871010
9109 6 610
68 77
75 5 67 6664 9 443 4 323 22222 22 111 11 111 11
1 11 910 797810 8 9 9 99 8897 8 78 897 810
8878797 7 76 58 767 858 10 58
10 11 11111111
11111 1 1111
10
10 10 7 666
566
6 5 6 5 4
4 3 3 3333 2 2
222222 2 2 2 222222 2 2
10
10
1010
910
101010
10 109 88776 7678 89 9 65 8 5
78710 7 910
574 6 89910
58
846565
10 3 3 3 222 22 2 222 2 11 11 1111
1 1 88 910 89 10
910 10 8810 910 99 10 98 9 7 88 9 10 910 98 67
8 9710 687 6 98 910776 687 86
59109710 111
111
1111111 1 111 99
99
1099 9989810 10109107 9 77107789 7 6
10 7 6 675 766 5 4 656 4 5 3
8 10 10
910 6876
9898 9668610 79 810
10
9 6 74666 55 675 6 86534 33223 4333 3 3 2323 32 11 11 1 9 8899 10 910 910 1010 7910 910 101097910 9 8 7699 8 1 856709668 76
5 710 9
78
8 9877810
810 6
111111111 111 11 11
1 1 10 9 810 89 8 88 68 77 7766668
3 4 4 33333 3 3 3 2 2 2 2 222 2
222 2 2
9 899 10 10 78 7
6 8 98610 10 8998910 10 10 77 53748 5 447 749 454 3
3 55445 335 333332
11
1111111 1 1 1 11 1 10 89 910910
999 10 998 910
10 8810 8610 9710 810
10 10
7988687879
33 6 10
10 43 9978 49910 8 10 9107 610
66 9 7 8 9789
6 111 10 1010
8 89 99 77 455 554 44 444 444 33333 333333
3
333
96986 7107685 65 510 10
9
10 9108 810 109 10
8 8 8710
10 8979 107 6 66 55564 3 24545
222
22 222 22 2 2
9910 10
10
9 10 710
89 810810 9
9 89 7999610 810 9 10
10 8 8 6 755610 6 10 898 7 5 757 1111111 1 11 1 111111111 1010 1010 10 7 8 97109 9 8108 89 77 8778 6 6 7 5 55 7
555 45
876 667 86565555 5 54 86 76766 4 445 4 4 44 4 3333333333333 3 22
2 22
2 2 2222 2 0
9 8 810
9 10 7 98 7 97898710
910 55 10 9109810 190969 777 810 978 655454 5 57 5 4 6 42 44 2
3 1
11 1 11111 111
1 1 1111 9910
10 1010
10 99 10
9 10
10 1010 859910
8 510 9810 88 10 9
9 889
10 910
10 7
9 78 889 67 9 10 698 69 10
9 10
68
1
89910 88 9 10
10 10887 710
7 9 8
8 7 888 8877677 7 2 222 222
10 1077 96 976 10 74859798 10 8 6 6710 9 10 9 78109 9976689910 10 37 675 22 222 1 1 1111111 910 10
10
10 910
10 910
1089
1010
108109 10
971010 77 98 7810 710
710
6 5 4 8 10 89988 8 89 98 10 7 7 98978910 10710
968880774 87 111 1 111 1 111 111111 111 10 9 8 7710 910 810 977 7 6566 66 55 55 5 45 544 4 4 333 2
10
999 8 10 9810 10109 9510 99810
10 10 710 786877 94897677867 66578677 5 4 4 2 1 1 1111111111 1 1010
10 10
10
10 10
910
9109
10
8810 67
8 10 910 571978076 5 6979810 10 91010 10
9810910 10 1069 10 1
9 1 1 1 11 1 9
9 9 8 8 10 9 9787879 88 10 6 45
33333 22 0
1
10 8010 10
8 7
10 10
9 9
10 6 6 9 9
109 8 9 10 9 10
8 10610 9 8 8 9 66 6
56 8 7 2 2 1 1 1 1 111 1 10 1010 9 8 8 8 10 8 9 10 87 7 79 7 8 10
8 59 8 6 7 8 1 1 1 1 9 8 10 88 8 99 8 8 10 7 6 7 86 6 56 555 4 5 5 44444444 44 22 2
7 9 98 910 6
5 10 8610 877675108 8810887 90 9 910
910 5 10 9810 8 7798 966 5 95 86 4 8 7 55 2 432
222 2 1 1 1 11111111111 111 1 111 11 10
9 8 9 91 980910
9 10 10 8109 10 108 10 9 85 10 75 5 5 86 1
11 1 1 11 1 111 1 10 109 889889 99710889 7 887 6 6 6 55 65 5 5 33 33 222 2
10 896 3 9 610 19 1010 410 87 4 578655 8
9 22 1 10 8 10 91010 9 798
10 99810 89610 9 6 910 10
84536 7 65 6 66 68667 8
787 6 57 1
111111111111111111111 11 1
10
10
109 109 8 10 8 79 7 9 10 79 997 6 66 65 67
8 67 4444
444 4444 333 33333
9
910
910 9 10 10 78
10 87 1099 787
8 49 87 7
8 6108 61010 877
910 101098889 7 53 756 47 68 6532 432 544333322 2232222 2
1 111 11 111 1111
1111 1111 1 1010 109 10
9 8 88 8 97 7977 9 10 10
9
9 659 99 71089910
87 98 1098
10576
9 9479 10 767 75 66 97665
7 1010 10109 1088 7 9
88109999910 108897898 6 910 766 5 6 666 55
444 4444 333
33
9 88 7 8 4 56 8 9 8 9 6 1
111 910 10
10 10 88 10 10 778 99 910 6 5585 7855 576 456564 7 1 111 11111 11
1 1 11 1 10
910 10 8 9 897 8
6 77 8 77 66 6 577 67 6 6 7 665 555555 44444 3 3
99 9 6 10 96 8 10 76 96 79439 86 10 8
9 98
61087 768 610 8 10
10 8
5 6857 87495744 3 2 2 1111
1 1
11 1 111 99 1010 10 10910
10 998898 67
7 6 99 8 87 778 9910 10
9 10 98 7589 8 111 1 11 1
1
111111 11 910 10 10 899 910 9
10
8 998
9 10 8 8 10 9 9 87 787 7787 7678 7667667 666656
6
5555
56556 4 444
8 6 89885 9887 76 3 88 65 10 7 10 98 3 2 24 23 22 2 2 2 1 111 11 1010 910 10 10
910 810 99 9 89889
99910 87
8 987 5 56 66 856 6 5
795
6
1 11 10 5 555 5
8 109 8 710 75 58 6578979
910 1607 10
88 10910 89109 10 75 1078 867795 54 44756 42
4 2232222222 2 1 10 10
910 10
10
10 8 9 910
10
66 768 77
8 77
655 5 6465545 4 5 11
11 9 810 9109 91010 10 910 987968109 9 98 7 78 7777 66666 5
555 4 4444 44 33 32
8 9 7 6 7 6 9797 797 6
1010
10
65 935 8 7610 7
8106584 5 58765 45 64 46444 322233 33223323 222 1 1 111 1 10
10109
9
1099
10
10
9 10
99
910
10 9 9 9
10 10 9 7 5 8 1 1 1 11
1 11
1
11
1
11 10
10 9108 9 10898
9 10
10
10
10 9 8 8 9 7 7 6 6
6 6 6 5 5
5 5 5 5 4 4 3 33 1
111
1
1
9 109 6 5 34 8 86 987989 9 109
10 108 10
10 96 47 10 7 65 734 33 2 222 10
10 9
9 99
10 9999 8
888 8 99109899
10 10 87 889888888810 665 6
656 65
7 776
665 5 564 54 6 55 54 55 44 4 5
443354364 11 11111
1 11 1 89109 108998 9
10 10 7 8887 8 8 787 7
7 7 7 66666 7 6 55 5 5 5 4 43 3 33 1
111 111111 11 1
1010
10 6 543 3 710 9 10 810 6 10 89 810 89 710 789 5 5433
3 33 33 433 2 2222222
1
111111111111111 99
10
88 8 888889810
89 10 9
810
5
8777878 6 667 66 656
7 7
7 6 56 55 5 4 445 544 1 11111
999 10
98 9 1010 10
101010 10
99 10 109 79 8888 77
8877777 77
7 666 56 6666 5555 555555 55 444 4 3 3 4 433333 3 1 11111 1 111 11 11
99
89
10 54 10 9 1058
10
99 7 8 7 710
89510 8554 96 69898108 467 6 5 10 5 43574 3334 33 322 222
2 2 22 1 111 1 11 11111
1
10
9 10887 8879 88898 88 999 8 8 10 99 7910 777 6 576656475554 76 6 5655 5 5 55 4 76 2 4 4444 4333 343
4 11 1 1010
99 9 10 9910 9810
9710
1010 988878 8 76 66666 6 55555 44444 4 3 3 3 11 1 1111 1 111 1
11
1
1 11
1
10
10 6 10 9 6
78 3 109 8510
89
19 7 0 8 9
66 765558 687 8 9 65546 65 3 3 4 3 33 1 1 1
11 111111111
1 109810 99 810 7 888 8888888888887 88889 87879 6 7777 5 4 6 4 6 6 45 5 5 44 43 4333433 3323 111111111111 1 111111 10
1010 910 10 99 9 9 9 88887777 7 7 6
6 666 5 55 5 55 4 44 4 4
4 44 33
333 3 1
11 1 1111111 1
1 11 1
11
76 7 5 10
10
910 1010 9 97610 10 107 6 5 6 8
689779 76 6 3333 2 9988
910 91010 7
10797 766 7777 64 46565 757555 5 4 45 444 4 3 4 4 3 3434
54 4 1 11 10
10
10
10 810 9 88 888777 777 7 666 6666666 6 5 5 555 5 4 4444444444 3 33333 3
3 1111111
1 1 111111 1111
0.4 87108776 8310
9878610 910
99 89786 79 967 47365663 4 1 1 11 0.4 10 7777 877 7 87 7 77 787778 7 867 6 67 7 66 65
45 54 33 33333333333 45 1111 0.4 10910 10
10
10 10 8 89 4 3333 3333 3 11 1
8 97 76 765 798 8 7 7 785 54 7689787 6966576
4 33333 1010 10 910
99 9810 6 77 77 7 777 777
77
6 7687 87 76 76565 65 76 4434
4433
23
11
11111
1 8 9 8988 88 887 7 7 7 7 666 65 55555 5 5 44 4 44 4 3 33333 1111
111
111111
443
56
3
4 4 86 997 48544 785 10
456
5443 5 3
33344
23
3 11 1 11 8 10 10
10 10
9888 6 98 7 7 7 6777 7 7 76 6 76777687 766 7766 5554 55 4
44 34 3 544 33 33 3 3 2 2 222
3 1111 89
10
1010 10
10 8
9 8 8 88 98 7 7 777777 66 66666 5 6 5 5 5 4 4 3 1 1
1111 111 1 1 111 1 1 11 11
766
74 543 9 69810 654 755 7 7 710 65 556
4 666 6710 6 46 4 55 344 3
4
33
3 11 111 111
11
710 6 56 565 222 99109 109 10
1010 9
54 44 44 4 4 3333 33 1 111 1
43 8865 543
5
9686765 5665 4 333 3 65454
33 1 987877710 66810
799 88985 99 888 76 6 6 67 6 6666 666 6667 667 6 8 5 5 3 34 3 33 22 2222 22 10 109 10
10 99 8 88 87 7 676 55 44 54444444 33 1111 1 111 11 11 111111 1 11
7
10 987
10
10 65 6
7 7 59 5 6 10 6
7 3 1 1111
1 965887 6 6 97
810 6 8796910 89788 66666666 66776 6 6 6 66 66 5 656 7 5 444 4
65
22222222 22
222222 0 10910 910 9 9 98 9 10 7 7 77 7 7
6
66666 6
6 5
5 555555555 4 44 4 44 4 333333333 3333
3 1111 1 111111
1
10
910
8 6985 7
9 4 3 1 987065 5710 98
4 8
96
7 9 4
10 65544443 56 7
6 5 4
10
10
7 3
4644433433433332 455
3
45 33 1
11111
87654 910
710
7810 6 10 877767 67666 6666 76 5 56656 6 56 666555 6 4445 43 434 33 333 222
3
222 2
2 0
00 1010 9 999
10
10 99910 9988 88 810 9 7 9 7 7767 6 111111111 1 1 111
10
86968 5 786 98788 67 87 43 53 9 910 9 9810 9 55 758 76667767766 6 2223 22 10 676 666 5
6 5 5 4 4 4 344 4 3 333 33333 1 1 1 1 1111
7
556
7910 569 1010 65 89 534 7 69 66 6 374310 463 335 4 434 3 3
4 11 1 10 9877
10
1098 9 7 5 6 65 6 5555 5 5555555 65 5 5
55
555555 55 45554 4444444 444
5
3333 33
4 910
1010 98 88 7 7 7 7
10 88 8 7 88 778777 6 767 7 6776776 6 666 33333333 33 111 11
87 6
84 10 768810 10
9 685 53 97 548 7 76 465 363 65 44 3 11 111 98 910 7 10 6 66 5 5 3 43 33
3 2 22 2 2 23 2 10
109
10
109 910 5 65 5 5 4444444
4 4
333333333
111 11
11 1 1
1
11
810
10 7898610 65 54433 97 88 7 8 89
5
7 33 7 6 4 10 9 8910 89 5 5 9 8
7 4
3 32 4 332 11 10
910 109
9988 6 98 5 5 5 4 66 5 5
44 4 445
44
4
4 44 3 3
333 3 43 3 2 2
222222 2 2 10
10 8
1099 888 108 787
8 777777 7 7 7 87 6 6 666 6 6
656 555 5 555 3 3 1111111111 1 11111
910
9879 6 57799 98 23 24 7 76 7 54 75 08765554 65765 140 54 10 9 5 756545 34 3 222322
22 10 10
10 910
1098
655 6 5 5 5555 55 6 45 45 5 655 5 4 4 444444 333333333 333 22222 2 10
109 10
8 8 77777 7 77 77
7 7 66666 555 5 54 54 4 333333
10 10
3229 5710 75 6868 10 10 91
910 59 6 564 544 5 4 33 6 10 10
3 510 554
6455 2222 2 1010 9 8 75 455
4 53 4 45 555 5 5555534 654
444 4 4 333 33 3 10 9 999 9 9888 6 66
6 666 66 5 5
6 44 11 1 1 111 1111 1 1 1 11 11 111
1
10 990 10
9 510
7 96 6 865444 10 78 8 587656 4 4 446 64554 5 97
10 10 88975 9 10
45 969
5565610
33 3
4
22 222 2
22 10
109910 99 6
10 444345544 444
4 44 4444 44 33 33 3 222
2
2 2 00 910
99 9 888
910 8888 88 10 78778 89 67 6 6 6 6655 55
55555555
55 4 4
4 4 1 1 1 1 11
10
9998
110 410 4 69 5910 5 8 65 5 5453 5 446 3108997 6 810
4 8 10 99
6 5 33 32 222222 2 99 88 88 8 10 343 444 5 55555555555555 45 44 33333 2222 2222 5 55 555
5
55 5 4
4 4444444 44444 1 111
1
11 1 1111 11 1 1
9 8878 9
9810
7
10797 54786 6
7 6 8 10 1095 10 5443 76 45 9 810
7 10
1010 91087888 77777
98 8
44 444445 3 33 33 3
33
22222222 22 2 0 00 10 99 9 88 897 78 77 786 878 6 8 66 66 66 55
4 44 4
3 111111 1 1 1111 111
9710 9 9 5910 109 8 510 108747 873 443 755 3 910
8 4 910
9 3
22
2222 910 777 7 6
44
33333 3
333 3 222
222 22 1010
910 99 888 8 979 7 87 10 55 5 5 5 1111 1 1 11
11111 11111
10
5979810 710 5 3 510
498 9 94 4 6644 3 22
22 2 222223
32 22 22 22 10 4 4 44444 4 44 4 4 333 333 2 22
2
2 2 0 0 10
10
10 988 7781077 10 9 888 778787
10 76 666 6 5 5 555 444 4 4 111
111
111 1
8 7 10
8710699 810 8 8710
10 85 710 96 97688 679
33 65 6 45 3
4 3345 671078810 2222222 22
2 10
108 9810 10 7 6
66 6 43 2 4 44 2 2 22 22
2222
2 2
2 10
10
10 9
10
10 9 78 7 7
8 9 86 77 6 6 4 4 1
111111
1 1
10 910 996
10 9 8799
910 3433 322 222222222 97 6 665 6566 44444444444 33
333333333
3 2 00 0 10 9
999 8 797998 9 8 9 10 8
98968 766 666 5 5 4 44 1
10 10
9 4
109 98 77 6 53 4 339 4 9910
7898 1 444 2 2 22 222222 22 00 1010
10
10 555 5444454 44 11
8 10
99 10 9 98 667 7678910 10 10 34337345 8 4 664 4 3 2222
3 222223 9898 10 78 7 5 55555 4 33
3 00 0 0 1010 99
910 77 8810 9 8 10
10 98 87 7 7 666
6
665 5 43 34 3 4 1111 111 1 1 111
9 57 4 36
44454 7 5 8 36 6 32
3 3 2 22 2 2 2 10
19
1089810
0 10 7 77 66 7 6 655
6 5 2 2 2 222 222222 00 98 8 8 8 11111
0.2 810
1910
07 10 769
9110
10
10 9847 8 66554687 5
0510 66 5 64 6 5 75 9 4 6 9545
86
22 0.2 9910
10
10 97887 6 6576656 3 3 3 33
2222222 22222 00 000 00 0.2 10
10 9 108
9 8 78710 9988910 9 76 5 43 4
810
910 10 67 10 67 5
5757 7 6 679844765710 33 3 66 576555 3 43
6 22 1010989 555 4 4 4 44 333333 3 333 2 222 0000 000
9 99 10 988 10
1099 6 5 66556 665554 4 4 4 33
5
00
0
9 89 10 87 88 9807 977 97 8677 10 766 658579 789810
910 3 3 69 4 9 876
10 6
94749 4 4 3
4 3 333 64 444 54454 4444 333333333333333 3323222 2 2 22222 22222 0 0 0 000000 101088 10
910
10 1010 9 10 99 10 9 910 9 9910
9810 00
10 10
1099 10
9 810
1
7 94 6 4 3 4 8657 77 56 5445343 3333 00
00
00 10 8910 98 9910910 1010
8710
87 7 65 4 54 33 2
88 98 610 9 767777 810 810 4 66 4433 4444444 5 810 10 5 4 848
10 10987 6 765 5 44 4 33333333 333333 3 3 3 22 2222 2222222222 10 91088 10
10 10 101010 7 7818 90 55 2 22 00
98 10 10 9 109 89988 99 7
79 6 564566 10 9 5 548544 4
4554
4 710 98 8
9 5 476 55353
4 222 22 2 22 9
999810 10106 7 89 8986 910 7 9
10 6 4 5 4 35 24 2 0
7 10910 9 108810 10 10
97810 7610 5 5 865881010 4 3 2 2 10 10
6 4 43 3333 3 2
3 0
10108910 910
1010 999 87 888 8 7 5 4 6 3 445
6 610 8566 596665655 5555 55
10 9 10
4 33 322 0 9
109
10
97
10 6
79 108 108 9 6 4 3 22 0 00 0 0
0
00
0
10 10 10 4 6 6 68 8 5 3 63 3334 99
10 000000
10 9 899 9810910 7 56 66 557 7 678 33 34 0 10 8 78
10
10 108 10
910 766 65 5 4 33 33 222 0 0 000
00000 000
00
1010
9 10
9989
87
8 675 43 5 56544444
56 4 67 766 777 7 6 9 966 549 4677
4
33
73333
3433 9810
10
910 810910
97
668 77 10 8 5 55444 454333 222 0 00000
109109 10 10 8 8781078
98988897877810
987 10 6107910 66987 8 7333 3 34 46348 54443 10
9
10 76677 8 9
10 65 4 4 33 0
891010
910
10
10 8887
87 77776
656 55
544443
910
10 810 8787 10 77 10 77 6
88789
9 9 410 4456 44 000000
67 10 9 6 777575 4 4 54 4 0
9 999999910 9 9 710
8988 8 8 98 77 76 664
857556544
6 44 9910 1099910
10
810
9798
96
10 910
88910
91097
8 107 10
810 779
5
856
88597
57
5765846766 4 3338 644 8 46 556
56
1045
57
10
87889
000
7 8810 868986 76688 8 8 57 5 5 00
88 8888 7 8 7 7 66 666 5 5
5 4 44 4 98 10 710
10
9 10 9
7 6 7 8
6795557 44
4 47 978910 10
10 0 8 910 77 7 10107 6 69 97 6 5 00
8 76 7 7 6 545455 54555 44 4 986 10 6686 96710 5
9 656 7 55 0000
6 66 555 5 444
4 910
8997 8 109 10976 610 76
777 77
77
77
7 6 66 88
9 10 9
9 99 88 988
10 9 8 68 3
6 789 910
10 9 10 9 779777
10
10 1
9540
6
7
8 0
66666 55 4
4 1010
98
10
109910
10 10
8
910
10
88 10988
0.0 5555 4 0.0 0 0.0 9
10
10
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Learns regularities in user browsing behavior that

1. have been manually encoded in existing click models, such as ranks and distances
to previous clicks (large clusters on t-SNE projections of vector states sr )
2. can not be manually encoded in traditional click models
(small clusters on t-SNE projections of vector states sr )
162
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
163
Times between user actions
time-to-first-click time-between-clicks time-from-abandoned-query
Q C C Q Q
time
time-to-last-click
I time-to-first-click (reflects quality of result presentation)
I time-between-clicks (proxy for click dwell time)
I time-to-last-click (reflects quality of search engine results)
I time-from-abandoned-query (reflects quality of search engine results in query

sessions with no clicks)
164
How to interpret times between user actions
Average
N
b 1 X
t= τi
N
i=1
1 1
Uncertainty: 2 (3 + 600) vs. 7 (30 + 28 + 45 + 23 + 100 + 23 + 58)
165
How to interpret times between user actions
Average
N
b 1 X
t= τi
N
i=1
1 1
Uncertainty: 2 (3 + 600) vs. 7 (30 + 28 + 45 + 23 + 100 + 23 + 58)
Fit distribution (e.g., exponential, gamma, Weibull)

N
Y
θb = arg max f (τi | θ)
θ i=1
Context bias: fhigh expectation (τ = 15 | θ1 ) vs. flow expectation (τ = 10 | θ2 )
165
Detected context bias effect
0.8
with observed context bias effect 0.7

Percentage of action IDs
0.6
0.5 time-to-first-click
time-between-clicks
0.4 time-to-last-click
time-from-abandoned-query
0.3 significance level (0.05)
0.2
0.1
0.0
40 60 80 100 120 140 160 180
minimum number of observations in context groups
166
Context-aware time modeling (naive)
Time(action, context) ∼ Gamma(k(act, ctx ), θ(act, ctx ))
167
Context-aware time modeling
Time(action, context) ∼ Gamma(

ak (ctx ) · k(act) + bk (ctx ),
aθ (ctx ) · θ(act) + bθ (ctx ))
168
Parameter estimation
Time(action, context) ∼ Gamma(

ak (ctx ) · k(act) + bk (ctx ),
aθ (ctx ) · θ(act) + bθ (ctx ))
1. Fix context-independent parameters

2. Optimize context-dependent parameters using neural networks
3. Fix context-dependent parameters
4. Optimize context-independent using gradient descent
5. Repeat until convergence
169
Parameter estimation
I We do not know the form of context-dependent parameters

=⇒ neural networks
I We know the form of context-independent parameters (Gamma distribution)

=⇒ direct optimization
170
Dataset
3 months of log data from Yandex search engine
Time between actions Max time # Observations

Time-to-first-click 1 min 30, 747, 733
Time-between-clicks 5 min 6, 317, 834
Time-to-last-click 5 min 30, 446, 973
Time-from-abandoned-query 1 min 11, 523, 351
171
Evaluation tasks
Task1. Predict time between clicks

I Log-likelihood
I Root mean squared error (MSE)
Task2. Rank results based on time between clicks

I nDCG@{1, 3, 5, 10}
172
Task 1. Predicting time
Time model Distribution Log-likelihood RMSE

exponential −4.9219 60.73
Basic gamma −4.9105 60.76
Weibull −4.9077 60.76
exponential −4.8787 58.93
Context-aware gamma −4.8556 58.98
Weibull −4.8504 58.94
173
Results on time prediction task (time-between-clicks)
−3.0
exponential (basic)
exponential (context-aware)
−3.5
gamma (basic)
gamma (context-aware)
−4.0 weibull (basic)
log-likelihood
weibull (context-aware)
−4.5
−5.0
−5.5
−6.0
1 2 4 8 16 32 64 128 256
time between clicks (in seconds)
174
Task 2. Ranking results
NDCG
Time model Distribution @1 @3 @5 @10
Average — 0.651 0.693 0.728 0.812
exponential 0.668 0.710 0.743 0.820
Context-aware gamma 0.675 0.715 0.748 0.822
Weibull 0.671 0.709 0.745 0.821
175
Summary
I Remove context bias from time between actions
I Predict user search interactions better (Task 1)
I Use the context-independent component for better document ranking (Task 2)
176
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
177
In web search we work (mostly) with query sessions and search sessions
In sponsored search we need to consider longer user histories
Recurrent Neural Networks (RNNs) can be used not only to account for biases,
but also to infer user interests and behavioral patterns
from very long sequences of user actions
Sequential Click Prediction for Sponsored Search with Recurrent Neural

Networks [Zhang et al., 2014]
178
…
…
…
Prediction
…
…
Fully connected
Instance matrix Pooling
…
with varied length
…
Convolution
Flexible pooling
Convolution
The framework of CCPM. The left part is an input instance (single ad impression or
ssion) with varied A Convolutional
elements Clicklength
and the Prediction Model [Liu
of element et al., 2015]
embedding is d = 4. The archite
olutional layers with two feature maps each. The widths of filters at two layers are thre
179
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
180
Neural Networks — an alternative to probabilistic graphical models (PGMs)

that allows to learn patterns of user behavior directly from the data
Understanding and modelling user behavior with PGMs — is a mature field

We expect many ideas to be transfered from PGM to neural framework
Future user behavior models

I will learn patterns of user behavior directly from the data
I will take very long user history into account
I will extract signals from images, videos, user voice and background sounds
I will improve our understanding of humankind
181
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
182
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
One-shot dialogues
Open-ended dialogues (chit-chat)
Goal-oriented dialogues
Alternatives to RNNs
Resources
Recommender systems
Industry insights
Q&A
183
Tasks
I Question Answering
I Summarization
I Query Suggestion
I Reading Comprehension / Wiki Reading
I Dialogue Systems
I Goal-Oriented
I Chit-Chat
184
Example Scenario for machine reading task
Sandra went to the kitchen. Fred went to the kitchen. Sandra picked up the milk.
Sandra traveled to the office. Sandra left the milk. Sandra went to the bathroom.
I Where is the milk now? A: office
I Where is Sandra? A: bathroom
I Where was Sandra before the office? A: kitchen
185
Example Scenario for machine reading task
Sandra went to the kitchen. Fred went to the kitchen. Sandra picked up the milk.
Sandra traveled to the office. Sandra left the milk. Sandra went to the bathroom.
I Where is the milk now? A: office
I Where is Sandra? A: bathroom
I Where was Sandra before the office? A: kitchen
I’ll be going to Los Angeles shortly. I want to book a flight. I am leaving from
Amsterdam. I want the return flight to be early morning. I don’t have any extra
luggage. I wouldn’t mind extra leg room.
I What does the user want? A: Book a flight
I Where is the user flying from? A: Amsterdam
I Where is the user going to? A: Los Angeles
186
What is Required?
I The model needs to remember the context

I It needs to know what to look for in the context
I Given an input, the model needs to know where to look in the context
I It needs to know how to reason using this context
I It needs to handle changes in the context
A Possible Solution:
I Hidden states of RNNs have memory: Run an RNN on the and get its
representation to map question to answers/response.
This will not scale as RNN states don’t have ability to capture long term dependency:
vanishing gradients, limited state size.
187
Teaching Machine to Read and Comprehend
[Hermann et al., 2015]

188
Neural Networks with Memory
I Memory Networks
I End2End MemNNs
I Key-Value MemNNs
I Neural Turing Machines
I Stack/List/Queue Augmented RNNs
189
End2End Memory Networks [Sukhbaatar et al., 2015]
190
End2End Memory Networks [Sukhbaatar et al., 2015]
I Share the input and output embeddings or not

I What to store in memories individual words, word windows, full sentences
I How to represent the memories? Bag-of-words? RNN reading of words?
Characters?
191
Attentive Memory Networks [Kenter and de Rijke, 2017]
Framing the task of conversational search as a general machine reading task.

192
Key-Value Memory Networks
Example:
for a KB triple [subject, relation, object], Key could be [subject,relation] and value
could be [object] or vice versa.
[Miller et al., 2016] 193
WikiReading [Hewlett et al., 2016, Kenter et al., 2018]
Task is based on Wikipedia data (datasets available in English, Turkish and Russian).
I Categorical: relatively small number of possible answer (e.g.: instance of, gender,
country).
I Relational: rare or totally unique answers (e.g.: date of birth, parent,capital).
194
WikiReading
I Answer Classification: Encoding document and question, using softmax
classifier to assign probability to each of to-50k answers (limited answer vocab) .
I Sparse BoW Baseline, Averaged Embeddings, Paragraph Vector, LSTM Reader,
Attentive Reader, Memory Network.
I Generally models with RNN and attention work better, especially at relational
properties.
I Answer Extraction (labeling/pointing) For each word in the document, compute
the probability that it is part of the answer.
I Regardless of the vocabulary so the answer requires being mentioned in the
document.
I RNN Labeler: shows a complementary set of strengths, performing better on
relational properties than categorical ones
I Sequence to Sequence Encoding query and document and decoding the answer
as sequences of words or characters.
I Basic seq2seq, Placeholder seq2seq, Basic Character seq2seq,
I Unifies the classification and extraction in one model: Greater degree of balance
between relational and categorical properties.
195
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
One-shot dialogues
Resources
Recommender systems
Industry insights
Q&A
196
Dialogue systems
Dialogues/conversational agents/chat bots
Open-ended dialogues Goal-oriented dialogues

I ELIZA I Restaurant finding
I Twitterbots I Hotel reservations
I Alexa/Google home/Siri/Cortana I Set an alarm clock
I Order a pizza
I Play music
I Alexa/Google home/Siri/Cortana
Is this IR?
197
Dialogue systems
s
Chit-chat bots
k
an
e
am
fin
th
I
machine
user machine
lo
w
e
u
I
am
th e
ks
user
ar
yo
fin
ho
el
an
H
Straightforward seq-to-seq [Vinyals and
lo
w
e
u
Le, 2015].
ar
yo
ho
el
H
([Sordoni et al., 2015] is a precursor,
but no RNN-to-RNN, and no LSTM). Same idea, but with attention [Shang
et al., 2015]
198
Dialogue systems
Limitations
I ’Wrong’ optimization criterion Human: what is your job?
Machine: i’m a lawyer.
I Generic responses Human: what do you do?
I No way to incorporate world knowledge Machine: i’m a doctor.
I No model of conversation Example from [Vinyals and Le,
2015]
I Inconsistency
I No memory of what was said earlier on
Evaluation
I Perplexity?
I BLUE/METEOR?
I Nice overview of How NOT To Evaluate Your Dialogue System [Liu et al., 2016].
I Open problem....
199
Dialogue systems
3 solutions
I More consistency in dialogue with hierarchical network
I Less generic responses with different optimization function
I More natural responses with GANs
200
Dialogue systems
Hierarchical seq-to-seq [Serban et al., 2016]. Main evaluation metric: perplexity.

201
Dialogue systems
Avoid generic responses
Usually: optimize log likelhood of predicted utterance, given previous context:
CLL = arg max log p(ut |context) = arg max log p(ut |u0 . . . ut−1 )
ut ut
To avoid repetitive/boring answer (I don’t know ), use maximum mutual information

between previous context and predicted utterance [Li et al., 2015].
p(ut , context)
CM M I = arg max log
ut p(ut )p(context)
= [derivation, next page . . . ]
= arg max (1 − λ)log p(ut |context) + λlog p(context|ut )
ut
202
Dialogue systems
Bayes rule p(context|ut )p(ut )
log p(ut |context) = log
p(context)
log p(ut |context) = log p(context|ut ) + log p(ut ) − log p(context)
log p(ut ) = log p(ut |context) − log p(context|ut ) + log p(context)
p(ut , context) p(ut |context)p(context)

CM M I = arg max log = arg max log
ut p(ut )p(context) ut p(ut )p(context)
p(ut |context)
= arg max log
ut p(ut )
= arg max log p(ut |context) − log p(ut ) ← Weird, minus language model score.
ut
= arg max log p(ut |context) − λlog p(ut ) ← Introduce λ. Crucial step! Without this it wouldn’t work.
ut
= arg max log p(ut |context) − λ(log p(ut |context) − log p(context|ut ) + log p(context))
ut
= arg max (1 − λ)log p(ut |context) + λlog p(context|ut )

ut
(More is needed to get it to work. See [Li et al., 2015] for more details.)
203
Generative adversarial network for dialogues
I Discriminator network p( / )
I Classifier: real or generated utterance
I Generator network Discriminator
I Generate a realistic utterance
Generator Real data

Original GAN paper [Goodfellow et al., 2014].
Conditional GANs, e.g. [Isola et al., 2016].
204
Generative adversarial network for dialogues
provided generated
I Discriminator network
Generator
I Classifier: real or generated utterance
I Generator network
lo
w
e
u
I
am
th ne
ks
ar
yo
ho
el
an
fi
H
I Generate a realistic utterance
ks
an
am
th e
fin
I
Discriminator p(x = real)
p(x = generated)
See [Li et al., 2017] for more details.
k s
ho o
w
e
u
an
l
ar
yo
am
th e
el
fin
H
I
Code available at https://github.com/jiweil/Neural-Dialogue-Generation
205
Dialogue systems
Open-ended dialogue systems

I Very cool, current problem
I Very hard
I Many problems
I Training data
I Evaluation
I Consistency
I Persona
I ...
206
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
One-shot dialogues
Resources
Recommender systems
Industry insights
Q&A
207
Goal-oriented
Idea Challenges
I Closed domain I Training data
I Restaurant reservations
I Finding movies
I Keeping track of dialogue history
I Have a dialogue system
I Handling of out-of-domain words or
find out what the user requests
wants I Going beyond task-specific slot filling
I Intermingling live API calls, chit chat,
information requests, etc.
I Evaluation
I Solve the task
I Naturalness
I Tone of voice
I Speed
I Error recovery
208
Goal-oriented as seq2seq
Memory network [Bordes and Weston, 2017]
I Simulated dataset Restaurant Knowledge Base, i.e., a table.
Queried by API calls.
I Finite set of things the bot
Each row = restaurant:
can say
I Because of the way the I cuisine (10 choices, e.g., French, Thai)
dataset is constructed I location (10 choices, e.g., London, Tokyo)
I Memory networks I price range (cheap, moderate or expensive)
I Training: next utterance
I rating (from 1 to 8)
prediction
I Evaluation For words of relevant entity types
I response-level I add a trainable entity vector
I dialogue-level
209
Goal-oriented as reinforcement learning
A typical reinforcement learning Typically [Shah et al., 2016]:

system: I States: agents interpretation of the
I States S environment: distribution over user
intents, dialogue acts and slots and their
I Actions A values
I State transition function: I intent(buy ticket)
T : S, A → S I inform(destination=Atlanta)
I ...
I Reward function:
R : S, A, S → R I Actions: possible communications, and are
I Policy: π : S → A usually designed as a combination of
dialogue act tags, slots and possibly slot
A RL system needs an values
environment to interact with (e.g., I request(departure date)
real users). I ...
210
Goal-oriented as reinforcement learning
Restaurant finding [Wen et al., 2017]: Movie finding [Dhingra et al., 2017]:
I Neural belief tracking: distribution I Simulated user
over a possible values of a set of I Soft attention over database
slots I Neural belief tracking:
I Delexicalisation: swap slot-values I Multinomial distribution for
for generic token (e.g. Chinese, every column over possible
Indian, Italian → FOOD TYPE ) column values
I RNN, input is dialogue so far,
output softmax over possible
column values
Reward based on finding the right KB
entry.
211
Goal-oriented
Goal-oriented models
I Currently works primarily in very small domains
I How about multiple speakers?
I Not clear what kind of architecture is best
I Reinforcement learning might be the way to go (?)
I Open research area...
212
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
One-shot dialogues
Resources
Recommender systems
Industry insights
Q&A
213
RNNs are: Can we do better?

I Well-studied I WaveNet
I Robust and tried and trusted method for sequence I ByteNet
tasks I Transformer
However, RNNs have several drawbacks:
I Take time to train
I Expensive to unroll for many steps
I Not too good at catching long-term dependencies
214
Alternatives to RNNs: WaveNet
WaveNet is originally introduced for a text-to-speech task (i.e. generating realistic
audio waves).
We try to model:
Q
p(x) = Tt=1 p(xt |x1 , . . . , xt−1 ).
I Stack of convolutional layers. No pooling layers.

I Output of the model has the same time dimensionality as the input.
I Output is a categorical distribution over the next value xt with a softmax layer and
it is optimized to maximize the log-likelihood of the data w.r.t. the parameters.
Based on the idea of dilated causal convolutions.
[van den Oord et al., 2016]
215
Causal convolutions
Output
Hidden Layer
Hidden Layer
Hidden Layer
Input
216
Dilated causal convolutions
Output
Dilation = 8
Hidden Layer
Dilation = 4
Hidden Layer
Dilation = 2
Hidden Layer
Dilation = 1
Input
“At training time, the conditional predictions for all timesteps can be made in parallel because all
timesteps of ground truth x are known. When generating with the model, the predictions are se-
quential: after each sample is predicted, it is fed back into the network to predict the next sample.”

217
Alternatives to RNNs: ByteNet
t1 t2 t3 t4 t5 t6 t7 t8 t 9 t10 t11 t 12 t13 t14 t 15 t 16 t17
decoder
t0 t1 t2 t3 t4 t5 t6 t7 t8 t 9 t10 t 11 t12 t13 t14 t 15 t16
encoder
s 0 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 s 10 s 11 s 12 s 13 s 14 s 15 s 16
[Kalchbrenner et al., 2016]

218
Alternatives to RNNs: Transformer
I Positional encoding added to the input embeddings

I Key-value attention
I Multi-head self-attention
I The encoder attends over its own states
I The decoder alters between
I attending over its own inputs/states
I attending over encoder states at the same level
[Vaswani et al., 2017]
219
Alternatives to RNNs: Transformer
layer N
... ... ...
layer 2
layer 1
encoder decoder
220
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
One-shot dialogues
Resources
Recommender systems
Industry insights
Q&A
221
Resources: datasets
Open-ended dialogue Goal-oriented dialogues
- Opensubtitles [Tiedemann, 2009] - MISC: A data set of information-seeking
- Twitter: http://research.microsoft.com/convo/ conversations [Thomas et al., 2017]
- Weibo: - Maluuba Frames
http://www.noahlab.com.hk/topics/ShortTextConversation http://datasets.maluuba.com/Frames
- Ubuntu Dialogue Corpus [Lowe et al., - Loqui Human-Human Dialogue Corpus

2015] https://academiccommons.columbia.edu/catalog/ac:176612
- Switchboard - bAbi (Facebook Research)
https://web.stanford.edu/~jurafsky/ws97/ https://research.fb.com/downloads/babi/
- Coarse Discourse (Google Research)
https://research.googleblog.com/2017/05/ Machine reading
coarse-discourse-dataset-for.html - bAbi QA (Facebook Research)
https://research.fb.com/downloads/babi/
- QA Corpus [Hermann et al., 2015]
https://github.com/deepmind/rc-data/
- WikiReading (Google Research)
https://github.com/google-research-datasets/wiki-reading
222
Resources: source code
I End-to-end memory network

https://github.com/facebook/MemNN
I Attentive Memory Networks

https://bitbucket.org/TomKenter/attentive-memory-networks-code
I Hierarchical NN [Serban et al., 2016]

https://github.com/julianser/hed-dlg, https://github.com/julianser/rnn-lm
I GAN for dialogues

https://github.com/jiweil/Neural-Dialogue-Generation
I RL for dialogue agents [Dhingra et al., 2017]

https://github.com/MiuLab/KB-InfoBot
I Transformer network
https://github.com/tensorflow/tensor2tensor
223
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
224
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Items and Users
Matrix factorization
Matrix factorization as a network
Side information
Richer models
Other tasks
Wrap-up
Industry insights
Q&A
225
Recommender systems
Recommender systems – The task
I Build a model that estimates how a

user will like an item.
I A typical recommendation setup has
I matrix with users and items
I plus ratings of users for items
reflecting past/known preferences
and tries to predict future preferences
I This is not about rating prediction
[Karatzoglou and Hidasi, Deep Learning for Recommender Systems, RecSys ’17, 2017]
226
Recommender systems
Approaches to recommender systems
I Collaborative filtering
I Based on analyzing users’ behavior and preferences such as ratings given to movies
or books
I Content-based filtering
I Based on matching the descriptions of items and users’ profiles
I Users’ profiles are typically constructed using their previous purchases/ratings, their
submitted queries to search engines and so on
I A hybrid approach
227
Recommender systems
Warm, cold
I Cold start problem

I User cold-start problem – generate recommendations for a new user / a user for
whom very few preferences are known
I Item cold-start problem – recommendation items that are new / for which very
users have shared ratings or preferences
I Cold items/users
I Warm items/users
228
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Items and Users
Side information
Richer models
Other tasks
Wrap-up
Industry insights
Q&A
229
Recommender systems
I The recommender system’s work horse
V
R
X
min (r uT v ) + (||u ||2 + ||v ||2 ) 230
Recommender systems
I Discover the latent features underlying the interactions between users and items
I Don’t rely on imputation to fill in missing ratings and make matrix dense
I Instead, model observed ratings directly, avoid overfitting through a regularized
model
I Minimize the regularized squared error on the set of known ratings:
X
min (ri,j − uTi vj ) + λ(kui k2 + kvj k2 )
u,v
i,j∈R
Popular methods for minimizing include stochastic gradient descent and

alternating least squares
231
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Items and Users
Side information
Richer models
Other tasks
Wrap-up
Industry insights
Q&A
232
Recommender systems
A feed-forward neural network view
[Raimon and Basilico, Deep Learning for Recommender Systems, 2017]
233
Recommender systems
A deeper view
234
Recommender systems
Matrix factorization vs. feed-forward network
I Two models are very similar

I Embeddings, MSE loss, gradient-based optimization
I Feed-forward net can learn different embedding combinations than a dot product
I Capturing pairwise interactions through feed-forward net requires a huge amount
of data
I This approach is not superior to properly tuned traditional matrix factorization
approach
235
Recommender systems
Great escape . . .
I Side information
I Richer models
I Other tasks
236
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Items and Users
Side information
Richer models
Other tasks
Wrap-up
Industry insights
Q&A
237
Recommender systems
Side information for recommendation
(1) (2)
(3) (4)
238
Recommender systems
Side information for recommendation
I Textual side information

I Product description, reviews, etc.
I Extraction: RNNs, one dimensional CNNs, word embeddings, paragraph vectors
I Applications: news, products, books, publication
I Images
I Product pictures, video thumbnails
I Extraction: CNNs
I Applications: fashion, video
I Music/audio
I Extraction: CNNs and RNNs
I Applications: music
239
Recommender systems
Textual side information
I Content2vec [Nedelec et al., 2016]

I Using associated textual information for recommendations [Bansal et al., 2016]
240
Recommender systems
Textual information for improving recommendations
I Task: paper recommendation
I Item representation
I Text representation: RNN based
I Item-specific embeddings created using MF
I Final representation: item + text embeddings
241
Recommender systems
Images in recommendation
Visual Bayesian Personalized Ranking (BPR) [He and McAuley, 2016]
I Bias terms
I MF model
I Visual part:
I Pretrained CNN features
I Dimension reduction through embeddings
I BPR loss
242
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Items and Users
Side information
Richer models
Other tasks
Wrap-up
Industry insights
Q&A
243
Recommender systems
Alternative models
I Restricted Boltzman Machines [Salakhutdinov et al., 2007]

I Auto-encoders [Wu et al., 2016]
I Prod2vec [Grbovic et al., 2015]
I Wide + Deep models [Cheng et al., 2016]
244
Recommender systems
Restricted Boltzman Machines – RBM
I Generative stochastic neural network
I Visible and hidden units connected by weights
I Activation probabilities:
P
p(hj = 1|v) = σ(bhj + P m i=1 wi,j vi )
p(vi = 1|h) = σ(bvi + nj=1 wi,j hj )
I Training
I Set visible units based on data, sample hidden units, then sample
visible units
I Modify weights to approach the configuration of visible units to the
data
I In recommendation:
I Visible units: ratings on the movie
I Vector of length 5 (for each rating value) in each unit
I Units corresponding to users who not rated the movie are ignored
245
Recommender systems
Auto-encoders
Auto-encoders
I One hidden layer
I Same number of input and output units
I Try to reconstruct the input on the output
I Hidden layer: compressed representation of the data
Constraining the model: improve generalization
I Sparse auto-encoders: activation of units are limited
I Denoising auto-encoders: corrupt the input
246
Recommender systems
Auto-encoders for recommendation
Reconstruct corrupted user interaction vectors [Wu

et al., 2016]
I Collaborative Denoising Auto-Encoder (CDAE)
I The link between nodes are associated with
different weights
I The links with red color are user specific
I Other weights are shared across all the users
247
Recommender systems
Prod2vec and Item2vec
I Prod2vec and item2vec: Item-item co-occurrence

factorization
I User2vec: User-user co-occurrence factorization
I The two approaches can be combined [Liang
et al., 2016]
248
Recommender systems
Wide + Deep models
I Combination of two models

I Deep neural network
I On embedded item features
I In charge of generalization
I Linear model
I On embedded item feature
I And cross product of item features
I In charge of memorization on binarized
features
I [Cheng et al., 2016]
249
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Items and Users
Side information
Richer models
Other tasks
Wrap-up
Industry insights
Q&A
250
Recommender systems
Other tasks
I Session-based recommendation
I Contextual sequence prediction
I Time-sensitive sequence prediction
I Causality in recommendations
I Recommendation as question answering
I Deep reinforcement learning for recommendations
251
Recommender systems
Session-based recommendation
I Treat recommendations as a sequence classification problem

I Input: a sequence of user actions (purchases/ratings of items)
I Output: next action
I Disjoint sessions (instead of consistent user history)
252
Recommender systems
GRU4Rec
Network structure [Hidasi et al., 2016]

I Input: one hot encoded item ID
I Output: scores over all items
I Goal: predicting the next item in the session
Adapting GRU to session-based recommendations
I Session-parallel mini-batching: to handle sessions of (very)
different length and lots of short sessions
I Sampling on the output: to handle lots of items
(inputs,outputs)
253
Recommender systems
GRU4Rec
Session-parallel mini-batches
I Mini-batch is defined over sessions
Output sampling
I Computing scores for all items (100K
1M) in every step is slow
I One positive item (target) + several
samples
I Fast solution: scores on mini-batch targets
I Items of the other mini-batches are
negative samples for the current
mini-batch
254
Recommender systems
Contextual sequence prediction
I Input: sequence of contextual user

actions, plus current context
I Output: probability of next action
I E.g. “Given all the actions a user has
taken so far, what’s the most likely video
they’re going to play right now?” [Beutel
et al., 2018]
255
Recommender systems
Time-sensitive sequence prediction
I Recommendations are actions at a

moment in time
I Proper modeling of time and system
dynamics is critical
I Experiment on a Netflix internal
dataset
I Context:
I Discrete time – Day-of-week:
Sunday, Monday, . . . Hour-of-day
I Continuous time (Timestamp)
I Predict next play (temporal split
data)
[Raimon and Basilico, Deep Learning for Recommender Systems, 2017]
256
And now, for some speculative tasks
in the recommender systems space
Answers not clear, but good potential for follow-up research
Recommender systems
Causality in recommendations – [Schnabel et al., 2016]
I Performance: MF vs. propensity
I Virtually all data for training recommender
systems is subject to selection biases weighted MF
I In movie recommendation, users typically
watch and rate movies they like, rarely movies
they do not like
I View recommendation from causal inference
perspective – exposing user to item is
intervention analogous to exposing patient to
treatment in medical study
(As α → 0, data is increasingly
I Propensity-weighted MF method – propensities
act as weights on loss terms missing not at random, only revealing
X top rated items.)
1
min (ri,j − uTi vj ) + λ(kui k2 + kvj k2 ) I How to incorporate this in neural
u,v
i,j∈R
Pi,j
networks for recommender systems?
258
Recommender systems
Recommendations as question answering – [Dodge et al., 2015]
I Conversational recommendation agent I Memory network jointly

I (1) question-answering (QA), (2) trained on all (four) tasks
recommendation, (3) mix of recommendation performs best
and QA and (4) general dialog about the topic I Incorporate short and long
(chit-chat)
term memory and can use
local context and knowledge
bases of facts
I Performance on QA needs a
real boost
I Performance degraded rather
than improved when training
on all four tasks at once
259
Recommender systems
Deep reinforcement learning for recommendations – [Zhao et al., 2017]
I MDP-based formulations of recommender systems go back to early 2000s

I Use of reinforcement learning has two advantages
1. can continuously update strategies during interactions
2. are able to learn strategy that maximizes the long-term cumulative reward from users
I List-wise recommendation framework, which can be applied in scenarios with large
and dynamic item spaces
I Uses Actor-Critic network
I Integrate multiple orders – positional order, temporal order

I Needs proper evaluation in live environment
260
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Items and Users
Side information
Richer models
Other tasks
Wrap-up
Industry insights
Q&A
261
Recommender systems
Wrap-up
I Current directions
I Item/user embedding
I Deep collaborative filtering
I Feature extraction from content
I Session- and context-based recommendation
I Fairness, accuracy, confidentiality, and transparency (FACT)
I Deep learning can work well for recommendations
I Matrix factorization and deep learning are very similar in classic recommendation
setup
I Lots of areas to explore
262
Recommender systems
Resources
RankSys : https://github.com/RankSys/RankSys
LibRec : https://www.librec.net
LensKit : http://lenskit.org
LibMF : https://www.csie.ntu.edu.tw/%7Ecjlin/libmf/
proNet-core : https://github.com/cnclabs/proNet-core
rival : http://rival.recommenders.net
TensorRec : https://github.com/jfkirk/tensorrec
263
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
264
Industry insights
Deep Learning in industry
I Companies have endless amounts of data!

Or do they?
I Performance
Is .9 accuracy/F1 /etc. good enough?
No? Would 0.95 be?
I Business logic/constraints
- Your model is doing great in general, but not in case X, Y and Z.
Can you keep it exactly as it is now, and fix just these cases?
I Explicit domain knowledge
E.g.: recommending product X for user Y is not applicable, as it is not available
where user Y lives.
265
Industry insights
Deep Learning in industry
I Hybrid Code Networks

Combining RNNs with domain-specific knowledge
I Smart Reply
Automated response suggestion for email
266
Industry insights
Hybrid Code Networks
Task
Dialogue system. User can converse with a system that can interact with APIs.
Combining RNNs with domain-specific knowledge

I Incorporate business logic by including modules in the system that can be
programmed
I Explicitly condition actions on external knowledge
[Williams et al., 2017]
267
Industry insights
Hybrid Code Networks
Forecast()
Trapezoids refer to programmatic code provided by the software developer.

Shaded boxes are trainable components.
[Williams et al., 2017]

268
Industry insights
Smart Reply
Automated response suggestion for email
Use an RNN to generate responses for any given input message.
Additional constraints
I Response quality
Ensure that the individual response options are always high
quality in language and content.
I Utility
Select multiple options to show a user so as to maximize
the likelihood that one is chosen.
I Scalability
Process millions of messages per day while remaining within
the latency requirements.
I Privacy
Develop this system without ever inspecting the data
except aggregate statistics. [Kannan et al., 2016]
269
Industry insights
Smart Reply
Response selection
I Construct a set of allowed responses R.
I Organise the elements of R into a trie.
Preprocess I Conduct a left-to-right beam search, and only
Email
retain hypotheses that appear in the trie.
Separate
feed forward neural network
classifier Complexity: O(beam size × response length).
No Smart Trigger
Reply no response?
Utility/diversity
yes Goal: present user with diverse responses
Response
Instead of “No”, “No, thanks”, and “Thanks!”, we’d
Permitted
selection responses rather produce “No, thanks”, “Yes, please”, “Let me
(LSTM) and clusters
come back to it”.
Smart I Manually label a couple of messages per response
Reply Diversity
Suggested selection intent.
I Use a state-of-the-art label propagation algorithm
[Kannan et al., 2016]
to label all other messages in R.
270
Industry insights
What do we learn?
I Deep learning component is a (small) part of a much larger system.

I Getting the right training data can be hard.
I The machine learned part is guided/corrected/prevented from predicting undesired
output.
271
Industry insights
Neural IR at Bing
Long history of neural IR models at Bing/Microsoft
I RankNet/LambdaRank [Burges et al., 2005, 2006]

I ListNet/ListMLE [Cao et al., 2007, Xia et al., 2008]
I DSSM/CDSSM [Huang et al., 2013, Shen et al., 2014]
I Recent representation learning models for long text [Mitra et al., 2017, Zamani
et al., 2018]
NN and GBDT are both popularly used across many teams
272
Industry insights
Neural IR at Bing
Beyond Web search, heavy use of deep learning systems for
I Speech recognition [Xiong et al., 2017c]

I Conversational models (e.g., Cortana & Zo)
I Machine translation [Hassan et al., 2018]
I Machine reading [Wang et al., 2017] and emerging Office Intelligence scenarios
(e.g., [Van Gysel et al., 2017])
I And others...
273
Industry insights
Neural IR at Bing
Some of the unique challenges and considerations:
I Supervision
I Large (explicitly/implicitly) labeled datasets are available for training deep models in
Web search
I Not available for many multi-tenant enterprise scenarios due to privacy and scalability
considerations—distance supervision and other approaches may be necessary
I Infrastructure investments
I GPU and other machine resources for experimentation; serving infrastructure
investments for running deep models in production
I Neural model based features vs. rethinking the stack with neural models as first
class citizens
I Model reuse: across tenants and different services
274
Outline
Morning program
Preliminaries
Semantic matching
Learning to rank
Entities
Afternoon program
Recommender systems
Industry insights
Q&A
275
Q&A
Q&A
Tom Kenter Bhaskar Mitra Maarten de Rijke Christophe Van Gysel

Booking.com Microsoft & UCL U. Amsterdam U. Amsterdam
276
References I
Qingyao Ai, Liu Yang, Jiafeng Guo, and W. Bruce Croft. 2016a. Analysis of the Paragraph Vector Model for Information Retrieval. In ICTIR. ACM,
133–142.
Qingyao Ai, Liu Yang, Jiafeng Guo, and W Bruce Croft. 2016b. Improving language estimation with the paragraph vector model for ad-hoc retrieval.
In SIGIR. ACM, 869–872.
Qingyao Ai, Yongfeng Zhang, Keping Bi, Xu Chen, and Bruce W. Croft. 2017. Learning a Hierarchical Embedding Model for Personalized Product
Search. In SIGIR.
Nima Asadi, Donald Metzler, Tamer Elsayed, and Jimmy Lin. 2011. Pseudo test collections for learning web search ranking functions. In SIGIR.
ACM, 1073–1082.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR.
K. Balog, L. Azzopardi, and M. de Rijke. 2006. Formal models for expert finding in enterprise corpora. In SIGIR. 43–50.
Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, and Luo Si. 2012. Expertise Retrieval. Found. & Tr. in Inform. Retr. 6, 2-3 (2012),
127–256.
Trapit Bansal, David Belanger, and Andrew McCallum. 2016. Ask the GRU: Multi-task Learning for Deep Text Recommendations. In RecSys.
107–114.
Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs.
context-predicting semantic vectors.. In ACL. 238–247.
Yoshua Bengio and Jean-Sébastien Senécal. 2008. Adaptive importance sampling to accelerate training of a neural probabilistic language model.
Transactions on Neural Networks 19, 4 (2008), 713–722.
Yoshua Bengio, Jean-Sébastien Senécal, and others. 2003. Quick Training of Probabilistic Neural Nets by Importance Sampling.. In AISTATS.
Richard Berendsen, Manos Tsagkias, Wouter Weerkamp, and Maarten de Rijke. 2013. Pseudo test collections for training and tuning microblog
rankers. In SIGIR. ACM, 53–62.
Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and H Chi. 2018. Latent Cross: Making Use of Context in Recurrent
Recommender Systems. In WSDM.
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. JMLR 3 (2003), 993–1022.
277
References II
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling
multi-relational data. In NIPS. 2787–2795.
Antoine Bordes and Jason Weston. 2017. Learning end-to-end goal-oriented dialog. ICLR (2017).
Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016. A neural click model for web search. In WWW. International World Wide
Web Conferences Steering Committee, 531–541.
Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language.
Computational linguistics 18, 4 (1992), 467–479.
Chris Burges. 2015. RankNet: A ranking retrospective. (2015).
https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/ Accessed July 16, 2017.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient
descent. In ICML. ACM, 89–96.
Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23-581 (2010), 81.
Christopher JC Burges, Robert Ragno, and Quoc Viet Le. 2006. Learning to rank with nonsmooth cost functions. In NIPS, Vol. 6. 193–200.
Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and Juanzi Li. 2017. Bridge text and knowledge by learning multi-prototype entity mention embedding. In
ACL, Vol. 1. 1623–1633.
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In ICML. ACM,
129–136.
Gabriele Capannini, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, and Nicola Tonellotto. 2016. Quality versus
efficiency in document scoring with learning-to-rank models. IPM 52, 6 (2016), 1161–1177.
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In KDD. ACM, 785–794.
Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. 2009. Ranking measures and loss functions in learning to rank. In NIPS. 315–323.
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa
Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender
Systems. In DLRS. 7–10.
278
References III
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In
ICML. ACM, 160–167.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost)
from scratch. JMLR 12, Aug (2011), 2493–2537.
David Cossock and Tong Zhang. 2006. Subset ranking using regression. In COLT, Vol. 6. Springer, 605–619.
Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. In
WSDM. ACM, 126–134.
Arjen P. de Vries, Anne-Marie Vercoustre, James A. Thom, Nick Craswell, and Mounia Lalmas. 2007. Overview of the INEX 2007 entity ranking
track. In Focused Access to XML Documents. Springer, 245–251.
Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis.
JASIS 41, 6 (1990), 391–407.
Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In
SIGIR.
Gianluca Demartini, Arjen P. de Vries, Tereza Iofciu, and Jianhan Zhu. 2008. Overview of the INEX 2008 Entity Ranking Track. In Advances in
Focused Retrieval. Springer Berlin Heidelberg, 243–252.
Gianluca Demartini, Tereza Iofciu, and Arjen P De Vries. 2009. Overview of the INEX 2009 entity ranking track. In International Workshop of the
Initiative for the Evaluation of XML Retrieval. Springer, 254–264.
Gianluca Demartini, Tereza Iofciu, and Arjen P De Vries. 2010. Overview of the INEX 2009 Entity Ranking Track. In Focused Retrieval and
Evaluation. Springer, 254–264.
Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017. Towards End-to-End Reinforcement
Learning of Dialogue Agents for Information Access. In ACL.
Fernando Diaz, Bhaskar Mitra, and Nick Craswell. 2016. Query expansion with locally-trained word embeddings. In ACL.
Laura Dietz, Alexander Kotov, and Edgar Meij. 2017. Utilizing Knowledge Graphs in Text-centric Information Retrieval. In WSDM. ACM, 815–816.
Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander H. Miller, Arthur Szlam, and Jason Weston. 2015. Evaluating
Prerequisite Qualities for Learning End-to-End Dialog Systems. CoRR abs/1511.06931 (2015).
279
References IV
Yixing Fan, Liang Pang, JianPeng Hou, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2017. MatchZoo: A Toolkit for Deep Text Matching. arXiv
preprint arXiv:1707.07270 (2017).
John Rupert Firth. 1957. Papers in Linguistics 1934-1951. Oxford University Press.
Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. 2003. An efficient boosting algorithm for combining preferences. JMLR 4, Nov (2003),
933–969.
Norbert Fuhr. 1989. Optimum polynomial retrieval functions based on the probability ranking principle. TOIS 7, 3 (1989), 183–204.
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. 2015. Word embedding based generalized language model for information
retrieval. In SIGIR. ACM, 795–798.
Yasser Ganjisaffar, Rich Caruana, and Cristina Lopes. 2011. Bagging Gradient-Boosted Trees for High Precision, Low Variance Ranking Models. In
SIGIR. ACM, 85–94.
Jianfeng Gao, Patrick Pantel, Michael Gamon, Xiaodong He, and Li Deng. 2014. Modeling interestingness with deep neural networks. In Proceedings
of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2–13.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014.
Generative adversarial nets. In NIPS. 2672–2680.
Joshua Goodman. 2001. Classes for fast maximum entropy training. In ICASSP, Vol. 1. IEEE, 561–564.
Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-commerce in
Your Inbox: Product Recommendations at Scale. In KDD. 1809–1818.
Artem Grotov and Maarten de Rijke. 2016. Online Learning to Rank for Information Retrieval: SIGIR 2016 Tutorial. In SIGIR. ACM, 1215–1218.
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. ACM, 55–64.
Nitish Gupta, Sameer Singh, and Dan Roth. 2017. Entity linking via joint encoding of types, descriptions, and context. In EMNLP. 2681–2690.
Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models.. In
AISTATS, Vol. 1. 6.
Zellig S Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146–162.
280
References V
Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt,
William Lewis, Mu Li, and others. 2018. Achieving Human Parity on Automatic Chinese to English News Translation. arXiv preprint
arXiv:1803.05567 (2018).
Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In AAAI. 144–150.
Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai Zhang, and Houfeng Wang. 2013. Learning entity representation for entity disambiguation. In
ACL, Vol. 2. 30–34.
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 2000. Large margin rank boundaries for ordinal regression. (2000).
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching
machines to read and comprehend. In NIPS.
Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016.
WIKIREADING: A Novel Large-scale Language Understanding Task over Wikipedia. In ACL.
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Session-based recommendations with recurrent neural networks.
ICLR (2016).
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In SIGIR. ACM, 50–57.
Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences.
In NIPS. 2042–2050.
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search
using clickthrough data. In CIKM. ACM, 2333–2338.
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. arXiv
preprint arXiv:1611.07004 (2016).
Rolf Jagerman, Julia Kiseleva, and Maarten de Rijke. 2017. Modeling Label Ambiguity for Neural List-Wise Learning to Rank. In Neu-IR SIGIR
Workshop.
281
References VI
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On Using Very Large Target Vocabulary for Neural Machine
Translation. arXiv preprint arXiv:1412.2007 (2014).
Shihao Ji, SVN Vishwanathan, Nadathur Satish, Michael J Anderson, and Pradeep Dubey. 2015. Blackout: Speeding up recurrent neural network
language models with very large vocabularies. arXiv preprint arXiv:1511.06909 (2015).
Thorsten Joachims. 2006. Training linear SVMs in linear time. In KDD. ACM, 217–226.
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint
arXiv:1602.02410 (2016).
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural Machine Translation in
Linear Time. arXiv preprint arXiv:1610.10099 (2016).
Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, László Lukács, Marina Ganea, Peter
Young, and others. 2016. Smart Reply: Automated response suggestion for email. In KDD. ACM, 955–964.
Tom Kenter, Alexey Borisov, and Maarten de Rijke. 2016. Siamese CBOW: Optimizing Word Embeddings for Sentence Representations. In ACL.
941–951.
Tom Kenter and Maarten de Rijke. 2017. Attentive Memory Networks: Efficient Machine Reading for Conversational Search. In The First
International Workshop on Conversational Approaches to Information Retrieval (CAIR’17).
Tom Kenter, Llion Jones, and Daniel Hewlett. 2018. Byte-level Machine Reading across Morphologically Varied Languages. In AAAI.
Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, and François Yvon. 2011. Structured output layer neural network language model. In
ICASSP. IEEE, 5524–5527.
Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In ICML. 1188–1196.
Hang Li, Jun Xu, and others. 2014. Semantic matching in search. Foundations and Trends
R in Information Retrieval 7, 5 (2014), 343–469.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A diversity-promoting objective function for neural conversation models.
In NAACL-HLT 2016. 110–119.
Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. arXiv preprint
arXiv:1701.06547 (2017).
Ping Li, Qiang Wu, and Christopher J Burges. 2008. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS. 897–904.
282
References VII
Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M. Blei. 2016. Factorization Meets the Item Embedding: Regularizing Matrix Factorization
with Item Co-occurrence. In RecSys ’16. 59–66.
Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning Entity and Relation Embeddings for Knowledge Graph Completion..
In AAAI. 2181–2187.
Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue
system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In EMNLP.
Qiang Liu, Feng Yu, Shu Wu, and Liang Wang. 2015. A convolutional click prediction model. In CIKM. ACM, 1743–1746.
Tie-Yan Liu. 2009. Learning to rank for information retrieval. Foundations and Trends
R in Information Retrieval 3, 3 (2009), 225–331.
Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn
dialogue systems. In SIGDIAL. 285–294.
R Duncan Luce. 2005. Individual choice behavior: A theoretical analysis. Courier Corporation.
Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In
Proceedings of the The 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016).
S. MacAvaney, K. Hui, and A. Yates. 2017. An Approach for Weakly-Supervised Deep Information Retrieval. arXiv 1707.00189. (2017).
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint
arXiv:1301.3781 (2013).
Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2011. Extensions of recurrent neural network language
model. In ICASSP. IEEE, 5528–5531.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their
compositionality. 3111–3119.
Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-Value Memory Networks for
Directly Reading Documents. In EMNLP.
Bhaskar Mitra. 2015. Exploring session context using distributed representations of queries and reformulations. In SIGIR. ACM, 3–12.
Bhaskar Mitra and Nick Craswell. 2017. An introduction to neural information retrieval. Foundations and Trends
R in Information Retrieval (to
appear) (2017).
283
References VIII
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match Using Local and Distributed Representations of Text for Web Search.
1291–1299.
Andriy Mnih and Geoffrey E Hinton. 2009. A scalable hierarchical distributed language model. In NIPS. 1081–1088.
Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426
(2012).
Frederic Morin and Yoshua Bengio. 2005. Hierarchical Probabilistic Neural Network Language Model.. In AISTATS, Vol. 5. 246–252.
Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. 2016. Improving Document Ranking with Dual Word Embeddings. In WWW.
Thomas Nedelec, Elena Smirnova, and Flavian Vasile. 2016. Content2vec: Specializing joint representations of product images and text for the task
of product recommendation. (2016).
Kezban Dilek Onal, Ye Zhang, Ismail Sengor Altingovde, Md Mustafizur Rahman, Pinar Karagoz, Alex Braylan, Brandon Dang, Heng-Lu Chang,
Henna Kim, Quinten McNamara, and others. 2017. Neural information retrieval: At the end of the early years. Information Retrieval Journal
(2017), 1–72.
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text Matching as Image Recognition.. In AAAI. 2793–2799.
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. 2017. DeepRank: A New Deep Architecture for Relevance Ranking in
Information Retrieval. In CIKM. ACM, 257–266.
Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In LREC Workshop on New Challenges for NLP
Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.
Jennifer Rowley. 2000. Product search in e-shopping: a review and research propositions. Journal of consumer marketing 17, 1 (2000), 20–35.
Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. 2016. Using Word Embeddings for Automatic Query Expansion. arXiv preprint
arXiv:1606.07608 (2016).
Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing. Int. J. Approximate Reasoning 50, 7 (2009), 969–978.
Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltzmann Machines for Collaborative Filtering. In ICML. 791–798.
Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as Treatments: Debiasing
Learning and Evaluation. CoRR abs/1602.05352 (2016).
D. Sculley. 2009. Large scale learning to rank. In NIPS Workshop on Advances in Ranking.
284
References IX
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using
Generative Hierarchical Neural Network Models.. In AAAI. 3776–3784.
Pararth Shah, Dilek Hakkani-Tür, and Larry Heck. 2016. Interactive reinforcement learning for task-oriented dialogue management. In NIPS
Workshop on Deep Learning for Action and Interaction.
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In ACL.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for
information retrieval. In CIKM. ACM, 101–110.
Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Meg Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A
Neural Network Approach to Context-Sensitive Generation of Conversational Responses. In ACL HLT.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-To-End Memory Networks. In NIPS.
Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, Zhenzhou Ji, and Xiaolong Wang. 2015. Modeling Mention, Context and Entity with Neural Networks
for Entity Disambiguation.. In IJCAI.
Paul Thomas, Daniel McDuff, Mary Czerwinski, and Nick Craswell. 2017. MISC: A data set of information-seeking conversations. In The First
International Workshop on Conversational Approaches to Information Retrieval (CAIR’17).
Jörg Tiedemann. 2009. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In ACL RANLP, Vol. 5. 237–248.
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray
Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. In Arxiv. https://arxiv.org/abs/1609.03499
Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2016. Learning Latent Vector Spaces for Product Search. In CIKM. ACM,
165–174.
Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2017a. Semantic Entity Retrieval Toolkit. In Neu-IR SIGIR Workshop.
Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2017b. Structural Regularities in Expert Vector Spaces. In ICTIR. ACM.
Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2018. Neural Vector Spaces for Unsupervised Information Retrieval. TOIS (2018).
Christophe Van Gysel, Maarten de Rijke, and Marcel Worring. 2016. Unsupervised, Efficient and Semantic Expertise Retrieval. In WWW. ACM,
1069–1079.
285
References X
Christophe Van Gysel, Bhaskar Mitra, Matteo Venanzi, Roy Rosemarin, Grzegorz Kukla, Piotr Grudzien, and Nicola Cancedda. 2017. Reply with:
Proactive recommendation of email attachments. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.
ACM, 327–336.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is
all you need. In NIPS. 6000–6010.
Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with Large-Scale Neural Language Models Improves
Translation.. In EMNLP. 1387–1392.
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in
Neural Information Processing Systems 28 (NIPS 2015).
Oriol Vinyals and Quoc Le. 2015. A neural conversational model. In ICML Workshop on Deep Learning.
Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In
Shengxian Wan, Yanyan Lan, Jiafeng Guo, Jun Xu, Liang Pang, and Xueqi Cheng. 2016a. A Deep Architecture for Semantic Matching with Multiple
Positional Sentence Representations.. In AAAI.
Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng. 2016b. Match-SRNN: Modeling the Recursive Matching Structure
with Spatial RNN. In IJCAI. AAAI Press, 2922–2928.
Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question
answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1.
189–198.
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge Graph Embedding by Translating on Hyperplanes.. In AAAI.
1112–1119.
Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A
network-based end-to-end trainable task-oriented dialogue system. In EACL.
Jason Williams, Kavosh Asadi, and Geoffrey Zweig. 2017. Hybrid Code Networks: Practical and Efficient End-To-End Dialog Control With
Supervised and Reinforcement Learning. In ACL.
286
References XI
Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. 2016. Collaborative Denoising Auto-Encoders for Top-N Recommender Systems. In
WSDM. 153–162.
Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise approach to learning to rank: theory and algorithm. In ICML. ACM,
1192–1199.
Chenyan Xiong, Jamie Callan, and Tie-Yan Liu. 2017a. Word-Entity Duet Representations for Document Ranking. In SIGIR.
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017b. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In
Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2017c. The Microsoft
2016 conversational speech recognition system. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on.
IEEE, 5255–5259.
Liu Yang, Qingyao Ai, Jiafeng Guo, and W Bruce Croft. 2016. aNMM: Ranking short answer texts with attention-based neural matching model. In
CIKM. ACM, 287–296.
Hamed Zamani and W. Bruce Croft. 2016a. Embedding-based Query Language Models. In ICTIR. ACM, 147–156.
Hamed Zamani and W. Bruce Croft. 2016b. Estimating Embedding Vectors for Queries. In ICTIR. ACM, 123–132.
Hamed Zamani and W Bruce Croft. 2017. Relevance-based Word Embedding. In SIGIR.
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. 2018. Neural ranking models with multiple document fields. In
Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 700–708.
Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, Bin Wang, and Tie-Yan Liu. 2014. Sequential Click Prediction for
Sponsored Search with Recurrent Neural Networks.. In AAAI. 1369–1375.
Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification.
arXiv preprint arXiv:1510.03820 (2015).
Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang Tang. 2017. Deep Reinforcement Learning for List-wise
Recommendations. CoRR abs/1801.00209 (2017).
Guoqing Zheng and Jamie Callan. 2015. Learning to reweight terms with distributed representations. In SIGIR. ACM, 575–584.
Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. 2015. Integrating and Evaluating Neural Word Embeddings in Information
Retrieval. In 20th Australasian Document Computing Symposium. ACM, Article 12, 8 pages.
287
Notation
Meaning Notation
Single query q
Single document d
Set of queries Q
Collection of documents D
Term in query q tq
Term in document d td
Full vocabulary of all terms T
Set of ranked results retrieved for query q Rq
Result tuple (document d at rank i) hi, di, where hi, di ∈ Rq
Relevance label of document d for query q relq (d)
di is more relevant than dj for query q relq (di ) > relq (dj ), or di d
q j
Frequency of term t in document d tf (t, d)
Number of documents that contain term t df (t)
Vector representation of text z ~
z
Probability function for an event E p(E)
R The set of real numbers
We adopt some neural network related notation from [Goodfellow et al., 2016] and IR related notation from [Mitra and Craswell, 2017]
288
Acknowledgments
All content represents the opinion of the author(s), which is not necessarily shared or endorsed by their
employers and/or sponsors.
289

Nn4ir PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nn4ir PDF

Uploaded by

Copyright:

Available Formats

Neural Networks for Information Retrieval

Hosein Azarbonyad1 Alexey Borisov1,2 Mostafa Dehghani1 Tom Kenter3

4 University College London

Tom Kenter Bhaskar Mitra

Maarten de Rijke Christophe Van Gysel

Describe the use of neural network based methods in information retrieval

Provide an overview of applications and directions for future development of neural

Slides available at http://nn4ir.com/ecir2018

Bibliography available at http://nn4ir.com/ecir2018

Survey 1 An introduction to neural information retrieval [Mitra and Craswell, 2017]

ground truth: y y[0] y[1] y[2]

hidden layer: x2 x2[0] x2[1] x2[2] x2[3]

magazine = <0.09, 0.35, 0.36>

vocabulary size probabitity distribution ...

?? turn this into a probability distribution ??

vocabulary size layer ...

embedding size × vocabulary size

embedding size hidden layer

vocabulary size × embedding size

vocabulary size inputs 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0

softmax = normalize the logits

I Recurrent neural networks (RNNs) are typically used in scenarios where a

input vector output vector

hidden statet-1 hidden statet

I RNN being unrolled (or unfolded) into

Formulas governing computation:

Image credits: Nature 21

I Language model allows us to predictQ probability of observing sentence (in a given

∂L ∂L ∂sm ∂sm−1 ∂st+1

Image credits: http://www.wildml.com/2015/10/

I Bidirectional RNNs based on idea that

Image credits: http://www.wildml.com/2015/09/

Image credits: Adapted from https://www.tensorflow.org/tutorials/seq2seq 28

ht = f (x, ht−1 ; θ),

Decoder with attention:

dt is calculated from Hencoder by:

|| is the concatenation operator, and at = softmax(ut )

Architecture for sentiment analysis Zhang and Wallace [2015]

How can we use word embeddings for information retrieval tasks?

Statistics on observed contexts of words in a corpus is quantified to derive word

From word embeddings to query/document embeddings

I Similar to auto-encoding objective: encode sentence, but decode neighboring

Using similarity amongst documents, queries and terms.

I Query-document similarity is average of cosine similarity over query words:

sim(~vt0 , ~vt ).tf(t0 , d)

I ~xtq measures the semantic difference of a term

I Embedding-based Query Expansion [Zamani and Croft, 2016a]

I Main idea: Embeddings be learned on

Pre-trained word embeddings can be used to obtain

Can we learn unsupervised query/document representations directly for IR tasks?

History of latent document representations

I Latent Semantic Indexing [Deerwester et al., 1990],

These representations provide a semantic matching signal that is complementary to a

I Learn document representations based

Overview of the Distributed Memory document

I Learns a notion of term specificity.

I Luhn significance: mid-frequency Relation between query term representation

Text matching is often formulated as a supervised objective where pairs of relevant or

The interaction matrix is subsequently summarized into a matching score.

I Interaction matrix between query/document

I After convolutions, feed-forward layers

I Compute word interaction

I Word interaction layer, followed by a spatial

I Pairwise approach: pairwise preference between documents for a query (di q dj )