Deep Learning For Machine Translation: A Dramatic Turn of Paradigm

Deep Learning
for Machine Translation

A dramatic turn of paradigm
Alberto Massidda
Who we are
● Founded in 2001;
● Branches in Milan, Rome and London;
● Market leader in enterprise ready solutions based on Open Source tech;
● Expertise:
○ Open Source
○ DevOps
○ Public and private cloud
○ Search
○ BigData and many more...

This presentation is Open Source (yay!)
https://creativecommons.org/licenses/by-nc-sa/3.0/
Outline
1. Statistical Machine Translation
2. Neural Machine Translation
3. Domain Adaptation
4. Zero shot translation
5. Unsupervised Neural MT
Statistical Machine Translation
Translating as a ciphered message recovery through probability laws:
1. Foreign language as a noisy channel
2. Language model and Translation model
3. Training (building the translation model)
4. Decoding (translating with the translation model)

Noisy channel model
Goal
Translate a sentence in foreign language f to our language e:
The abstract model
1. Transmit e over a noisy channel.

2. Channel garbles sentence and f is received.
3. Try to recover e by thinking about:
a. how likely is that e was the message, p(e) (source model)
b. how f is turned into e, p(e|f) (channel model)
Word choice and word reordering
P(f|e) cares about words, in any order.
● “It’s too late” → “Tardi troppo è” ✓
● “It’s too late” → “È troppo tardi” ✓
● “It’s too late” → “È troppa birra” ✗
P(e) cares about words order.
● “È troppo tardi” ✓
● “Tardi troppo è” ✗
P(e) and P(f|e)
Where does
these numbers
come from?
Language model
P(e) comes from a Language model, a machine that assigns scores to
sentences, estimating their likelihood.
1. Record every sentence ever said in English (1 Billion?)
2. If the sentence “how’s it going?” appears 76413 times in that database, then
we say:
Translation model
Next we need to worry about P(f|e), the probability of a French string f given an
English string e.
This is called a translation model.
It boils down to computing alignments between source and target languages.

Computing alignments intuition
Pairs of English and Chinese words which come together in a parallel example
may be translations of each other.
Training Data
A parallel corpus is a collection of texts, each of which is translated into one or
more other languages than the original.
EN IT
Look at that! Guarda lì!
I' ve never seen anything like that! Non ho mai visto nulla di simile!
That's incredible! É incredibile!
That's terrific. É eccezionale.

Computing alignments: Expectation Maximization
This algorithm iterates over data,

exacerbating latent properties of a
system.
b c x y
It finds a local optimum convergence

point without any user supervision.
b y
Example with a 2 sentence corpus:
Decoding
Now it’s time to decode our string encoded by the noisy channel.
Word alignments are leveraged to build a “space” for a search algorithm.
Translating is searching in a space of options.

Translation options as a coverage set
Decoding in action
1. The algorithm builds the search space as a tree of options, sorted by p(e|f).
a. Search space is limited to a fixed size named “beam”.
2. Each option is picked on highest probability first.
a. Reordering adds a penalty.
b. Language model penalizes each stage output.
3. Translation stops when all source words are translated, or covered.

Decoding in action
Neural machine translation
NMT is based on probability too, but has some differences:
● End-to-end training: no more separate Translation + Language Models.

● Markovian assumption, instead of Naive Bayesian: words move together.
If a sentence f of length n is a sequence of words , then p(f) is:

Neural network review: feed-forward
Weighted links determine the strength a neuron can influence its neighbours.
Deviation between outputs and expected values affects rebalancing of weights.
But a feed forward network is not suitable to map the temporal dependencies
between words. We need an architecture than can explicitly map sequences.
Recurrent network
Neural language model
Encoder - Decoder architecture
With a sentence f and e :
(one single sequence)
Languages are independent (vocabulary and domain), so

we can split in 2 separate RNNs:
1. (summary vector of source)

2. Each new word depends on history
Sequence-to-sequence (seq2seq) architecture
CAMERI
IL PRESE I PIATTI
ERE
g g g g g
h h h h h
THE WAITER TOOK THE PLATES

Summary vector as information bottleneck
Fixed sized representation degrades as sentence length increases.
This is because the alignment learning operates on many-to-many logic.

Gradient flows towards everybody for any alignment mistake.
Let’s gate gradient flow through a context vector, as a weighted average of

source hidden states (also known as “soft search” or “attention”).
Weights computed by feed-forward network with softmax activation.

Attention model
CAMERI
IL PRESE I PIATTI
ERE
g g g g g
+
0.7 0.05
0.1 0.1 0.05
h h h h h

Attention model
CAMERI
IL PRESE I PIATTI
ERE
g g g g g
+
0.1 0.05
0.7 0.1 0.05
h h h h h

Attention model
CAMERI
IL PRESE I PIATTI
ERE
g g g g g
+
0.05 0.05
0.1 0.7 0.1
h h h h h

Attention model
CAMERI
IL PRESE I PIATTI
ERE
g g g g g
+
0.05 0.1
0.05 0.1 0.7
h h h h h

Attention model
CAMERI
IL PRESE I PIATTI
ERE
g g g g g
+
0.05 0.7
0.05 0.1 0.1
h h h h h

Neural domain adaptation
Sometimes we want our network to assume a
particular style, but we don’t have enough data.
Solution: adapt an already trained network.
1. First, train the full network with general data to

obtain a general model.
2. Then, train last layers on new data to have it
influence stylistically the output.
Zero shot translation: Google Neural MT
We can use a single system for
multilingual MT: just feed all the different
French English
parallel data inside the same system.
<2DE> je suis ici <2IT> I am here
Tag input data with desired target
language: NMT will translate in target FR → DE
EN → IT
language! GNMT
EN → DE?
As a side effect, we build an internal
“shared knowledge representation”.
Ich bin hier Sono qui
This enables to translate between unseen
German Italian
language pairs.
Unsupervised NMT
We can translate even without parallel data, using just two monolingual corpora.
Each corpus builds a latent semantic space. Similar languages build similar spaces.
Translation as geometrical mapping between affine latent semantic spaces.

auto encoder
source sentence latent space

target sentence
encoder
x ^ z decoder
x y
decoder
Links
https://www.tensorflow.org/tutorials/seq2seq
NMT (seq2seq) Tutorial
https://github.com/google/seq2seq
A general-purpose encoder-decoder framework for Tensorflow
https://github.com/awslabs/sockeye
seq2seq framework with a focus on NMT based on Apache MXNet
http://www.statmt.org/
Old school statistical MT reference site
QA

Deep Learning For Machine Translation: A Dramatic Turn of Paradigm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning For Machine Translation: A Dramatic Turn of Paradigm

Uploaded by

Copyright:

Available Formats

Deep Learning

for Machine Translation

○ Public and private cloud

○ BigData and many more...

1. Foreign language as a noisy channel

2. Language model and Translation model

3. Training (building the translation model)

4. Decoding (translating with the translation model)

Translate a sentence in foreign language f to our language e:

The abstract model

1. Transmit e over a noisy channel.

● “It’s too late” → “Tardi troppo è” ✓

● “It’s too late” → “È troppo tardi” ✓

● “It’s too late” → “È troppa birra” ✗

P(e) cares about words order.

This is called a translation model.

It boils down to computing alignments between source and target languages.

Look at that! Guarda lì!

That's incredible! É incredibile!

That's terrific. É eccezionale.

This algorithm iterates over data,

It finds a local optimum convergence

Word alignments are leveraged to build a “space” for a search algorithm.

Translating is searching in a space of options.

a. Search space is limited to a fixed size named “beam”.

2. Each option is picked on highest probability first.

a. Reordering adds a penalty.

b. Language model penalizes each stage output.

3. Translation stops when all source words are translated, or covered.

● End-to-end training: no more separate Translation + Language Models.

If a sentence f of length n is a sequence of words , then p(f) is:

(one single sequence)

Languages are independent (vocabulary and domain), so

1. (summary vector of source)

THE WAITER TOOK THE PLATES

This is because the alignment learning operates on many-to-many logic.

Let’s gate gradient flow through a context vector, as a weighted average of

Weights computed by feed-forward network with softmax activation.

THE WAITER TOOK THE PLATES

THE WAITER TOOK THE PLATES

THE WAITER TOOK THE PLATES

THE WAITER TOOK THE PLATES

THE WAITER TOOK THE PLATES

1. First, train the full network with general data to

Translation as geometrical mapping between affine latent semantic spaces.

source sentence latent space

You might also like