Professional Documents
Culture Documents
+ Concatenated C2
Question Layer
Stem BiLSTM Attention
1
Max C2 C1
CNN Layer pooling
w1 w2 w3 Concatenated
+ Layer C3
BiLSTM Attention
2
Answer
Key
CNN Layer Max
pooling C3
Concatenated
+ Layer
w5 w18 w3 Attention
Output Score
BiLSTM
SOFTMAX
DENSE
DENSE
Supporting
Texts
+ Concatenated
Layer
w21 w11 w14 Attention
BiLSTM Convolution
1−Max−pooling 1
concatenation C3
Similarity Matrix Convolution
Distractor 1−Max−pooling 2
concatenation C4
K
Convolution CNN
Similarity Matrix Similarity Matrix 1−Max−pooling K With
concatenation Max−pooling
w3 w7 w9
Figure 1: Overview of the Semantically informed multi-channel convolution recurrent neural network for distractor generation.
main of data, the downstream task tries to optimize the loss for the semantic similarity task. Let us assume a sequence of
incurred during learning of the shared weights of the inner word W = w1 , w2 ...wn where each wi is represented with
state of the pre-trained language model. All words that occur a word embedding of dimension d. Therefore, a sequence of
at least five times in the corpus are included and infrequent n-words is represented by a 1D convolution network. The
words are denoted as UNK. In our training dataset, we have width of the convolution is k over the sentence. We apply
encountered many such unknown words which are specific the same convolution filter to each such window of size k in
to some given domain such as Physics, Chemistry and Biol- the sequence. This is done by computing a dot-product be-
ogy terms for which no word-level embeddings were avail- tween the concatenation of the embedding vectors in a given
able. To compensate for the loss we have also included the window and a weight vector u. Accordingly, the convolution
character-level embeddings. layer applies a linear transformation to all K windows in the
In case of BERT we use the off-the-shelf pre-trained gen- given sequence of vectors. In order to ensure a uniform di-
eral domain BERTBASE model with the same additional mension over all sequences of input and output vectors, we
CNN-LSTM network layers at the top of the architecture perform a zero padding.
similar to the other embedding models. It has 12 layers of In the present distractor generation task, the neighbouring
transformer blocks, 768 hidden units, and 12 self-attention words in the form of context plays an important role. Occur-
heads. The pre-trained model are fine-tuned on all the three rence of one word wi largely depends upon the occurrence of
training dataset available to us. its neighbouring words wi−k , wi−k−1 ...wi−1 . These factors
can easily be handled by using the LSTM network structure.
Sentence modelling with CNN and LSTM However, typical LSTMs allow the preceding elements to be
Once we generate both the sentence and word level embed- considered as context, on the other hand, we have used the
dings, they form a one-dimensional array of representation. Bidirectional LSTM that considers both forward and back-
Each of the convolution network assumes a pre-defined se- ward sequence of elements (Zhou et al. 2016), (Hochreiter
quence of n-gram words as input and perform a series of and Schmidhuber 1997). Thus, corresponding to each vector
linear transformation. We represent every sentence using our output generated from the preceding CNN layer, we paral-
joint CNN and LSTM architecture. The CNN is able to learn laly determine the sequence level output representation from
the local features from words to phrases from the text, while the Bi-LSTM units. Eventually, the last latent layer of the
the LSTM learns the long-term dependencies of the text. The LSTM is taken as the semantic representation of the sen-
main goal of this step is to learn good intermediate seman- tence, and the difference between element-wise difference
tic representations of the sentences, which are further used between these representations is used as a semantic discrep-
Similarity Matrix
ing from this concatenated feature vector. The first layer is
Sentence−S1
activated by the ReLU function, while we use the softmax
link-function to the rest of the layers. We take cross-entropy
as the loss function that measures the discrepancy between
Sentence−S2
the real annotated output and the model output distribution
of sentences in the corpora