Professional Documents
Culture Documents
Wei Xu
(many slides from Greg Durrett)
This Lecture
‣ Sequence-to-Sequence Model
‣ A@enAon Mechanism
‣ Copy Mechanism
‣ Transformer Architecture
Recap: Seq2Seq Model
Encoder le film Decoder
h̄1
ie
t
ie
t
s
e
s
e
ea
ea
wa
wa
ov
th
th
ov
gr
gr
m
m
t
s
ie
e
ea
wa
th
ov
gr
m
the movie was great <s> eij = f (h̄i , hj ) ‣ Some funcAon f
(next slide)
A@enAon
le X
ci = ↵ij hj f (h̄i , hj ) = tanh(W [h̄i , hj ])
j ‣ Bahdanau+ (2014): addiAve
c1
exp(eij )
↵ij = P
j 0 exp(e ij 0)
f (h̄i , hj ) = h̄i · hj
h̄1
‣ Luong+ (2015): dot product
eij = f (h̄i , hj )
>
f (h̄i , hj ) = h̄i W hj
<s>
‣ Luong+ (2015): bilinear
0 3 2 1
‣ Need to count (for ordering) and also determine which tokens are in/
out
‣ Content-based addressing
Pont-de-Buis
‣ Vocabulary contains “normal” vocab as well as
words in input. Normalizes over both of these: ecotax
if w in vocab
{
exp Ww [ci ; h̄i ]
P (yi = w|x, y1 , . . . , yi 1) / exp W>
hjwV[ch̄ii; h̄i ] if w = xj
n
Le
aA
m
‣ Pointer network (Ppointer): predict from source words,
instead of target vocabulary
… … … …
co
s
in
i
Bu
rA
… porAco in Pont-de-Buis …
e-
po
<s>
-d
nt
Po
Pointer Generator Mixture Models
‣ Define the decoder model as a mixture model of Pvocab and Ppointer
s
po n
in
co
Le
i
a
aA
Bu
rA
m
e-
-d
nt
Gulcehre et al. (2016), Gu et al. (2016)
Po
Copying in SummarizaAon
17
Sentence Encoders
‣ LSTM abstracPon: maps each vector in
a sentence to a new, context-aware
vector
the movie was great
The ballerina is very excited that she will dance in the show.
The ballerina is very excited that she will dance in the show.
The ballerina is very excited that she will dance in the show.
23
Transformer Uses
‣ Supervised: transformer can replace LSTM as encoder, decoder, or both;
will revisit this when we discuss MT
such as in machine translaAon and natural language generaAon tasks.
‣ Unsupervised: transformers work be8er than LSTM for unsupervised
‣ Encoder and decoder are both transformers
pre-training of embeddings: predict word given context words
‣ Decoder consumes the previous generated
‣ BERT (BidirecPonal Encoder
token (and a@ends to input), but has no
RepresentaPons from Transformers):
recurrent state
pretraining transformer language models
‣ Many other details to get it to work: residual
similar to ELMo
connecAons, layer normalizaAon, posiAonal
‣ Stronger than similar methods, SOTA on ~11
encoding, opAmizer with learning rate
tasks (including NER — 92.8 F1)
schedule, label smoothing ….
24 Vaswani et al. (2017)
Transformers
emb(1)
emb(2)
emb(3)
emb(4)
‣ Augment word embedding with posi=on embeddings,
each dim is a sine/cosine wave of a different
frequency. Closer points = higher dot products
‣ Works essen=ally as well as just encoding posi=on as
a one-hot vector
25 Vaswani et al. (2017)
Residual ConnecAons
Transformers
‣ allow gradients to output to next layer
the movie was great the movie was great
f(x) + x
flow through a
network directly,
x
emb(1)
emb(2)
emb(3)
emb(4)
without passing +
through non-linear f(x)
acAvaAon funcAons non-linearity
‣ Augment word embedding with posi=on embeddings,
f
each dim is a sine/cosine wave of a different
frequency. Closer points = higher dot products
x
input from previous layer
‣ Works essen=ally as well as just encoding posi=on as
a one-hot vector
26
Vaswani et al. (2017)
Transformers
h@ps://ai.googleblog.com/
2017/08/transformer-novel-
neural-network.html
Takeaways
‣ Can build MT systems with LSTM encoder-decoders, CNNs, or transformers