Professional Documents
Culture Documents
?
Temporal/sequence problems
‣ How to cast as a supervised learning problem?
USD/EURO
0
2 3 ?
1
6 0 7
6 7
a 6 .. 7
4 . 5
0
(t) (t)
<latexit sha1_base64="CpARfXOGarb4BA4vHdXLXAYOCtY=">AAACQXicfZDLSgMxFIYz3q23qks3g1VQkTKjgi4LunAjVrBW6BQ5k562oZnMkJwRy9CHcKuP41P4CO7ErRvTi+ANDwQ+/vMnOecPEykMed6zMzY+MTk1PTObm5tfWFzKL69cmTjVHCs8lrG+DsGgFAorJEjidaIRolBiNewc9/vVW9RGxOqSugnWI2gp0RQcyErVIGmLLdq+yRe8ojco9zf4IyiwUZVvlp2NoBHzNEJFXIIxNd9LqJ6BJsEl9nJBajAB3oEW1iwqiNDUs8G8PXfTKg23GWt7FLkD9euNDCJjulFonRFQ2/zs9cW/erWUmkf1TKgkJVR8+FEzlS7Fbn95tyE0cpJdC8C1sLO6vA0aONmIcsEJ2l00ntl3zxPUQLHeyQLQrQjuena3VrDbp/+MQn0aLeVsrv7PFH/D1V7R3y96FweFUmmU8AxbY+tsi/nskJXYKSuzCuOsw+7ZA3t0npwX59V5G1rHnNGdVfatnPcPJiOv5Q==</latexit>
y
Temporal/sequence problems
‣ Language modeling: what comes next?
(t) (t)
<latexit sha1_base64="CpARfXOGarb4BA4vHdXLXAYOCtY=">AAACQXicfZDLSgMxFIYz3q23qks3g1VQkTKjgi4LunAjVrBW6BQ5k562oZnMkJwRy9CHcKuP41P4CO7ErRvTi+ANDwQ+/vMnOecPEykMed6zMzY+MTk1PTObm5tfWFzKL69cmTjVHCs8lrG+DsGgFAorJEjidaIRolBiNewc9/vVW9RGxOqSugnWI2gp0RQcyErVIGmLLdq+yRe8ojco9zf4IyiwUZVvlp2NoBHzNEJFXIIxNd9LqJ6BJsEl9nJBajAB3oEW1iwqiNDUs8G8PXfTKg23GWt7FLkD9euNDCJjulFonRFQ2/zs9cW/erWUmkf1TKgkJVR8+FEzlS7Fbn95tyE0cpJdC8C1sLO6vA0aONmIcsEJ2l00ntl3zxPUQLHeyQLQrQjuena3VrDbp/+MQn0aLeVsrv7PFH/D1V7R3y96FweFUmmU8AxbY+tsi/nskJXYKSuzCuOsw+7ZA3t0npwX59V5G1rHnNGdVfatnPcPJiOv5Q==</latexit>
y
What are we missing?
‣ Sequence prediction problems can be recast in a form
amenable to feed-forward neural networks
‣ But we have to engineer how “history” is mapped to a
vector (representation). This vector is then fed into, e.g.,
a neural network
- how many steps back should we look at?
- how to retain important items mentioned far back?
‣ Sentiment classification
I have seen better lectures -1
‣ Machine translation
Olen nähnyt parempia
I have seen better lectures
luentoja
encoding decoding
Key concepts
‣ Encoding (this lecture)
- e.g., mapping a sequence to a vector
‣ Decoding (next lecture)
- e.g., mapping a vector to, e.g., a sequence
Encoding everything
2 32 32 3
.1 .7 .2
words 4 .3 5 4 .1 5 4 .8 5 …
.4 .0 .3
“Efforts and courage are not enough
without purpose and direction” — JFK sentences
2 3
2 3 .2
.3 4 .3 5
4 .3 5 2 3 .6
images .5 .2
4 .4 5
.6
events
Example: encoding sentences
‣ Easy to introduce adjustable “lego pieces” and optimize
them for end-to-end performance
new information
<null>
new information
s,s s,x
st = tanh(W st 1 +W xt )
<null>
new information
s,s s,x
st = tanh(W st 1 +W xt )
<null>
new information
s,s s,x
st = tanh(W st 1 +W xt )
lego piece
<null>
new information
s,s s,x
st = tanh(W st 1 +W xt )
lego piece
summary of
<null>
“Efforts”
new information
s,s s,x
st = tanh(W st 1 +W xt )
lego piece
<null>
summary of
“Efforts and”
new information
s,s s,x
st = tanh(W st 1 +W xt )
lego piece
<null> sentence
as a vector
new information
s,s s,x lego piece
st = tanh(W st 1+W xt )
(encoder)
lego piece
<null> sentence
as a vector
lego piece
(encoder)
lego piece
sentence
<null> ✓ ✓ ✓ ✓ ✓ as a vector
context
✓ new context basic
or state or state
RNN
new information
s,s s,x
st = tanh(W st 1 +W xt )
What’s in the box?
‣ We can make the RNN more sophisticated…
context
✓ new context simple
or state or state
gated RNN
new information
g,s g,x
gt = sigmoid(W st 1 +W xt )
st = (1 gt ) st 1 + gt tanh(W s,s st 1 + W s,x
xt )
What’s in the box?
‣ We can make the RNN more sophisticated…
context
✓ new context LSTM
or state or state
(one of many)
new information
f,h f,x
ft = sigmoid(W ht 1 +W xt ) forget gate
i,h i,x
it = sigmoid(W ht 1 +W xt ) input gate
o,h o,x
ot = sigmoid(W ht 1 +W xt ) output gate
c,h c,x memory
ct = ft ct 1 + it tanh(W ht 1 +W xt )
cell
ht = o t tanh(ct ) visible state
Key things
‣ Neural networks for sequences: encoding
‣ RNNs, unfolded
- state evolution, gates
- relation to feed-forward neural networks
- back-propagation (conceptually)
‣ Issues: vanishing/exploding gradient
‣ LSTM (operationally)