You are on page 1of 27

Modeling with Machine

Learning: RNN (part 2)


Recall: learning to encode/decode
‣ Language modeling
This course has been a success (?)

‣ Sentiment classification
I have seen better lectures -1

‣ Machine translation
Olen nähnyt parempia
I have seen better lectures
luentoja

encoding decoding
Outline (part 2)
‣ Modeling sequences: language models
- Markov models
- as neural networks
- hidden state, Recurrent Neural Networks (RNNs)
‣ Example: decoding images into sentences
Markov Models
‣ Next word in a sentence depends on previous symbols
already written (history = one, two, or more words)

The lecture leaves me bumfuzzled


‣ Similar, next character in a word depends on previous
characters already written

bumfuzzled

‣ We can model such kth order dependences between


symbols with Markov Models
Markov Language Models
‣ Let w 2 V denote the set of possible words/symbols that
<latexit sha1_base64="Yd4BdR5mtdKf81zzxWIH6K5BZFI=">AAACKHicbVDLSsNAFJ34rPFVdelmsAiuSuIDXRbcuKxgq9AEmUxu29HJJMzcqCXkH9zqJ/g17qRbv8TpY6GtBwYO59zL3HOiTAqDnjd0FhaXlldWK2vu+sbm1nZ1Z7dt0lxzaPFUpvouYgakUNBCgRLuMg0siSTcRo+XI//2CbQRqbrBQQZhwnpKdAVnaKX2cyAUbd9Xa17dG4POE39KamSK5v2OsxHEKc8TUMglM6bjexmGBdMouITSDXIDGeOPrAcdSxVLwITF+NySHlolpt1U26eQjtXfGwVLjBkkkZ1MGPbNrDcS//M6OXYvwkKoLEdQfPJRN5cUUzrKTmOhgaMcWMK4FvZWyvtMM462ITcwYOtTPewXQca0ULFNVxY2WPnHQ3jBZxHbG4qT+plQpWv782fbmift47rv1f3r01qjMW2yQvbJATkiPjknDXJFmqRFOHkgr+SNvDsfzqfz5QwnowvOdGeP/IHz/QPpD6Zi</latexit>

includes
- an UNK symbol for any unknown word (out of vocabulary)
- <beg> symbol for specifying the start of a sentence
- <end> symbol for specifying the end of the sentence

<beg> The lecture leaves me UNK <end>


w0
<latexit sha1_base64="D5SVYtIkIUZu2PQk5n/FTBpFmyA=">AAACJXicbVDLSsNAFJ34rLVqq0s3wSK4KokPdFlw47KifUBbymRy2w6dTMLMjbWEfIJb/QS/xp0IrvwVp20WtvXAwOGce5l7jhcJrtFxvq219Y3Nre3cTn63sLd/UCwdNnQYKwZ1FopQtTyqQXAJdeQooBUpoIEnoOmNbqd+8wmU5qF8xEkE3YAOJO9zRtFID+Oe0yuWnYozg71K3IyUSYZar2QVOn7I4gAkMkG1brtOhN2EKuRMQJrvxBoiykZ0AG1DJQ1Ad5PZral9ahTf7ofKPIn2TP27kdBA60ngmcmA4lAve1PxP68dY/+mm3AZxQiSzT/qx8LG0J4Gt32ugKGYGEKZ4uZWmw2pogxNPfmOBtOdHOAw6URUcembdGligqULHsIzjrlvbkguKldcpnnTn7vc1ippnFdcp+LeX5ar1azJHDkmJ+SMuOSaVMkdqZE6YWRAXsgrebPerQ/r0/qaj65Z2c4RWYD18wuIb6Uq</latexit>
w1
<latexit sha1_base64="TdD5Il6Adtl1J+WBmQjEz7Z8vhM=">AAACJXicbVDLSsNAFJ34rLVqq0s3wSK4KokPdFlw47KifUBbymRy2w6dTMLMjbWEfIJb/QS/xp0IrvwVp20WtvXAwOGce5l7jhcJrtFxvq219Y3Nre3cTn63sLd/UCwdNnQYKwZ1FopQtTyqQXAJdeQooBUpoIEnoOmNbqd+8wmU5qF8xEkE3YAOJO9zRtFID+Oe2yuWnYozg71K3IyUSYZar2QVOn7I4gAkMkG1brtOhN2EKuRMQJrvxBoiykZ0AG1DJQ1Ad5PZral9ahTf7ofKPIn2TP27kdBA60ngmcmA4lAve1PxP68dY/+mm3AZxQiSzT/qx8LG0J4Gt32ugKGYGEKZ4uZWmw2pogxNPfmOBtOdHOAw6URUcembdGligqULHsIzjrlvbkguKldcpnnTn7vc1ippnFdcp+LeX5ar1azJHDkmJ+SMuOSaVMkdqZE6YWRAXsgrebPerQ/r0/qaj65Z2c4RWYD18wuKLqUr</latexit>
w2
<latexit sha1_base64="AkNc51wumyi+8Vy0RB8QjSOx9Ow=">AAACJXicbVDLSsNAFJ3UV61VW126CRbBVUmqosuCG5cV7QPaUiaT23boZBJmbqwl5BPc6if4Ne5EcOWvOH0sbOuBgcM59zL3HC8SXKPjfFuZjc2t7Z3sbm4vv39wWCgeNXQYKwZ1FopQtTyqQXAJdeQooBUpoIEnoOmNbqd+8wmU5qF8xEkE3YAOJO9zRtFID+NepVcoOWVnBnuduAtSIgvUekUr3/FDFgcgkQmqddt1IuwmVCFnAtJcJ9YQUTaiA2gbKmkAupvMbk3tM6P4dj9U5km0Z+rfjYQGWk8Cz0wGFId61ZuK/3ntGPs33YTLKEaQbP5RPxY2hvY0uO1zBQzFxBDKFDe32mxIFWVo6sl1NJju5ACHSSeiikvfpEsTEyxd8hCeccx9c0NyUb7iMs2Z/tzVttZJo1J2nbJ7f1mqVhdNZskJOSXnxCXXpEruSI3UCSMD8kJeyZv1bn1Yn9bXfDRjLXaOyRKsn1+L7aUs</latexit>
w3
<latexit sha1_base64="SrFTaIPXqeCax3aRHdDV4S0rHVs=">AAACJXicbVDLSsNAFJ3UV61VW126CRbBVUmsosuCG5cV7QPaUiaT23boZBJmbqwl5BPc6if4Ne5EcOWvOH0sbOuBgcM59zL3HC8SXKPjfFuZjc2t7Z3sbm4vv39wWCgeNXQYKwZ1FopQtTyqQXAJdeQooBUpoIEnoOmNbqd+8wmU5qF8xEkE3YAOJO9zRtFID+NepVcoOWVnBnuduAtSIgvUekUr3/FDFgcgkQmqddt1IuwmVCFnAtJcJ9YQUTaiA2gbKmkAupvMbk3tM6P4dj9U5km0Z+rfjYQGWk8Cz0wGFId61ZuK/3ntGPs33YTLKEaQbP5RPxY2hvY0uO1zBQzFxBDKFDe32mxIFWVo6sl1NJju5ACHSSeiikvfpEsTEyxd8hCeccx9c0NSKV9xmeZMf+5qW+ukcVF2nbJ7f1mqVhdNZskJOSXnxCXXpEruSI3UCSMD8kJeyZv1bn1Yn9bXfDRjLXaOyRKsn1+NrKUt</latexit>
w4
<latexit sha1_base64="wgfF6G9cvFABqpj8b4dp6jKpOxo=">AAACJXicbVDLSsNAFJ3UV61VW126CRbBVUm0osuCG5cV7QPaUiaT23boZBJmbqwl5BPc6if4Ne5EcOWvOH0sbOuBgcM59zL3HC8SXKPjfFuZjc2t7Z3sbm4vv39wWCgeNXQYKwZ1FopQtTyqQXAJdeQooBUpoIEnoOmNbqd+8wmU5qF8xEkE3YAOJO9zRtFID+NepVcoOWVnBnuduAtSIgvUekUr3/FDFgcgkQmqddt1IuwmVCFnAtJcJ9YQUTaiA2gbKmkAupvMbk3tM6P4dj9U5km0Z+rfjYQGWk8Cz0wGFId61ZuK/3ntGPs33YTLKEaQbP5RPxY2hvY0uO1zBQzFxBDKFDe32mxIFWVo6sl1NJju5ACHSSeiikvfpEsTEyxd8hCeccx9c0NyWb7iMs2Z/tzVttZJ46LsOmX3vlKqVhdNZskJOSXnxCXXpEruSI3UCSMD8kJeyZv1bn1Yn9bXfDRjLXaOyRKsn1+Pa6Uu</latexit>
w5
<latexit sha1_base64="3iNt+jCZAu1/t4wENOCa9hN45ts=">AAACJXicbVDLSsNAFJ3UV61VW126CRbBVUnUosuCG5cV7QPaUiaT23boZBJmbqwl5BPc6if4Ne5EcOWvOH0sbOuBgcM59zL3HC8SXKPjfFuZjc2t7Z3sbm4vv39wWCgeNXQYKwZ1FopQtTyqQXAJdeQooBUpoIEnoOmNbqd+8wmU5qF8xEkE3YAOJO9zRtFID+NepVcoOWVnBnuduAtSIgvUekUr3/FDFgcgkQmqddt1IuwmVCFnAtJcJ9YQUTaiA2gbKmkAupvMbk3tM6P4dj9U5km0Z+rfjYQGWk8Cz0wGFId61ZuK/3ntGPs33YTLKEaQbP5RPxY2hvY0uO1zBQzFxBDKFDe32mxIFWVo6sl1NJju5ACHSSeiikvfpEsTEyxd8hCeccx9c0NyWa5wmeZMf+5qW+ukcVF2nbJ7f1WqVhdNZskJOSXnxCXXpEruSI3UCSMD8kJeyZv1bn1Yn9bXfDRjLXaOyRKsn1+RKqUv</latexit>
w6
<latexit sha1_base64="/Vfce1fPQKciBbr6pfPSF6H6eqo=">AAACJXicbVDLSsNAFJ3UV61VW126CRbBVUl8LwtuXFa0D2hLmUxu26GTSZi5sZaQT3Crn+DXuBPBlb/i9LGwrQcGDufcy9xzvEhwjY7zbWXW1jc2t7LbuZ387t5+oXhQ12GsGNRYKELV9KgGwSXUkKOAZqSABp6Ahje8nfiNJ1Cah/IRxxF0AtqXvMcZRSM9jLpX3ULJKTtT2KvEnZMSmaPaLVr5th+yOACJTFCtW64TYSehCjkTkObasYaIsiHtQ8tQSQPQnWR6a2qfGMW3e6EyT6I9Vf9uJDTQehx4ZjKgONDL3kT8z2vF2LvpJFxGMYJks496sbAxtCfBbZ8rYCjGhlCmuLnVZgOqKENTT66twXQn+zhI2hFVXPomXZqYYOmCh/CMI+6bG5Lz8iWXac705y63tUrqZ2XXKbv3F6VKZd5klhyRY3JKXHJNKuSOVEmNMNInL+SVvFnv1of1aX3NRjPWfOeQLMD6+QWS6aUw</latexit>

‣ In a first order Markov model (bigram model), the next


symbol only depends on the previous one
A first order Markov model
‣ Each symbol (except <beg>) in the sequence is
predicted using the same conditional probability table
until an <end> symbol is seen
wi
<latexit sha1_base64="71jX3odOrxMNrBN+csHs3zNgrYI=">AAACJ3icbVDLTsJAFJ3iCxEVdOmmkZi4Iq2P6JLEjUtM5JEAIdPpBSZMp83MrUiafoNb/QS/xp3RpX/iAF0IeJJJTs65N3PP8SLBNTrOt5Xb2Nza3snvFvaK+weHpfJRU4exYtBgoQhV26MaBJfQQI4C2pECGngCWt74bua3nkBpHspHnEbQC+hQ8gFnFI3UmPQTnvZLFafqzGGvEzcjFZKh3i9bxa4fsjgAiUxQrTuuE2EvoQo5E5AWurGGiLIxHULHUEkD0L1kfm1qnxnFtwehMk+iPVf/biQ00HoaeGYyoDjSq95M/M/rxDi47SVcRjGCZIuPBrGwMbRn0W2fK2AopoZQpri51WYjqihDU1Chq8G0J4c4SroRVVz6Jl2amGDpkofwjBPumxuSy+o1l2nB9OeutrVOmhdV16m6D1eVWi1rMk9OyCk5Jy65ITVyT+qkQRjh5IW8kjfr3fqwPq2vxWjOynaOyRKsn1/qb6Zv</latexit>

ML course is UNK <end>


<beg> 0.7 0.1 0.1 0.1 0.0
ML 0.1 0.5 0.2 0.1 0.1
wi
<latexit sha1_base64="kaqHQmA16HpY4AATIiUT8X72NqI=">AAACKXicbVDLTsJAFJ36REQFXbppJCZuJK2P6JLEjUtM5JEAIdPpBSZMp83MrUiafoRb/QS/xp269UccoAsBTzLJyTn3Zu45XiS4Rsf5stbWNza3tnM7+d3C3v5BsXTY0GGsGNRZKELV8qgGwSXUkaOAVqSABp6Apje6m/rNJ1Cah/IRJxF0AzqQvM8ZRSM1x72En7tpr1h2Ks4M9ipxM1ImGWq9klXo+CGLA5DIBNW67ToRdhOqkDMBab4Ta4goG9EBtA2VNADdTWb3pvapUXy7HyrzJNoz9e9GQgOtJ4FnJgOKQ73sTcX/vHaM/dtuwmUUI0g2/6gfCxtDexre9rkChmJiCGWKm1ttNqSKMjQV5TsaTH9ygMOkE1HFpW/SpYkJli54CM845r65IbmsXHOZ5k1/7nJbq6RxUXGdivtwVa5WsyZz5JickDPikhtSJfekRuqEkRF5Ia/kzXq3PqxP63s+umZlO0dkAdbPL9zvpuE=</latexit>
1 course 0.0 0.0 0.7 0.1 0.2
is 0.1 0.3 0.0 0.6 0.0
UNK 0.1 0.2 0.2 0.3 0.2
Sampling from a Markov model
wi
<latexit sha1_base64="71jX3odOrxMNrBN+csHs3zNgrYI=">AAACJ3icbVDLTsJAFJ3iCxEVdOmmkZi4Iq2P6JLEjUtM5JEAIdPpBSZMp83MrUiafoNb/QS/xp3RpX/iAF0IeJJJTs65N3PP8SLBNTrOt5Xb2Nza3snvFvaK+weHpfJRU4exYtBgoQhV26MaBJfQQI4C2pECGngCWt74bua3nkBpHspHnEbQC+hQ8gFnFI3UmPQTnvZLFafqzGGvEzcjFZKh3i9bxa4fsjgAiUxQrTuuE2EvoQo5E5AWurGGiLIxHULHUEkD0L1kfm1qnxnFtwehMk+iPVf/biQ00HoaeGYyoDjSq95M/M/rxDi47SVcRjGCZIuPBrGwMbRn0W2fK2AopoZQpri51WYjqihDU1Chq8G0J4c4SroRVVz6Jl2amGDpkofwjBPumxuSy+o1l2nB9OeutrVOmhdV16m6D1eVWi1rMk9OyCk5Jy65ITVyT+qkQRjh5IW8kjfr3fqwPq2vxWjOynaOyRKsn1/qb6Zv</latexit>

ML course is UNK <end>


<beg> 0.7 0.1 0.1 0.1 0.0
ML 0.1 0.5 0.2 0.1 0.1
wi 1
course 0.0 0.0 0.7 0.1 0.2
<latexit sha1_base64="kaqHQmA16HpY4AATIiUT8X72NqI=">AAACKXicbVDLTsJAFJ36REQFXbppJCZuJK2P6JLEjUtM5JEAIdPpBSZMp83MrUiafoRb/QS/xp269UccoAsBTzLJyTn3Zu45XiS4Rsf5stbWNza3tnM7+d3C3v5BsXTY0GGsGNRZKELV8qgGwSXUkaOAVqSABp6Apje6m/rNJ1Cah/IRJxF0AzqQvM8ZRSM1x72En7tpr1h2Ks4M9ipxM1ImGWq9klXo+CGLA5DIBNW67ToRdhOqkDMBab4Ta4goG9EBtA2VNADdTWb3pvapUXy7HyrzJNoz9e9GQgOtJ4FnJgOKQ73sTcX/vHaM/dtuwmUUI0g2/6gfCxtDexre9rkChmJiCGWKm1ttNqSKMjQV5TsaTH9ygMOkE1HFpW/SpYkJli54CM845r65IbmsXHOZ5k1/7nJbq6RxUXGdivtwVa5WsyZz5JickDPikhtSJfekRuqEkRF5Ia/kzXq3PqxP63s+umZlO0dkAdbPL9zvpuE=</latexit>

is 0.1 0.3 0.0 0.6 0.0


UNK 0.1 0.2 0.2 0.3 0.2
Maximum likelihood estimation
‣ The goal is to maximize the probability that the model
can generate all the observed sentences (corpus S)
s s s
s 2 S, s = {w1 , w2 , . . . , w|s| }
<latexit sha1_base64="xnE/ctN87rC4064RNRrlzjyrMz0=">AAACUnicbVJNbxMxEPWmQEsIkJZjL1YjJA7RareAilQhReLCsVWbtlIcIq93klj1eleeWUrk7r/g13CFn8CFv8IJJ82hH4xk6/m9Gc3Mk7PKaKQk+RO1Nh49frK59bT9rPP8xcvu9s4ZlrVTMFSlKd1FJhGMtjAkTQYuKgeyyAycZ5eflvr5V3CoS3tKiwrGhZxZPdVKUqAm3RiFtvykLw7FIcePwl99wUnaX977fWHyknD18Nd43Yhm0u0lcbIK/hCka9Bj6ziabEcdkZeqLsCSMhJxlCYVjb10pJWBpi1qhEqqSzmDUYBWFoBjv1qs4a8Dk/Np6cKxxFfs7QovC8RFkYXMQtIc72tL8n/aqKbph7HXtqoJrLppNK0Np5IvXeK5dqDILAKQyukwK1dz6aSi4GVbIASj7YzmXlTSaZuH7RofFmvuaATf6ErnYQb/Nn6vbdMO/qX33XoIzvbjNInT43e9wWDt5BbbZXvsDUvZARuwz+yIDZli39kP9pP9in5Hf1vhl9yktqJ1zSt2J1qdf8qms7o=</latexit>
Maximum likelihood estimation
‣ The goal is to maximize the probability that the model
can generate all the observed sentences (corpus S)
s s s
s 2 S, s = {w1 , w2 , . . . , w|s| }
<latexit sha1_base64="xnE/ctN87rC4064RNRrlzjyrMz0=">AAACUnicbVJNbxMxEPWmQEsIkJZjL1YjJA7RareAilQhReLCsVWbtlIcIq93klj1eleeWUrk7r/g13CFn8CFv8IJJ82hH4xk6/m9Gc3Mk7PKaKQk+RO1Nh49frK59bT9rPP8xcvu9s4ZlrVTMFSlKd1FJhGMtjAkTQYuKgeyyAycZ5eflvr5V3CoS3tKiwrGhZxZPdVKUqAm3RiFtvykLw7FIcePwl99wUnaX977fWHyknD18Nd43Yhm0u0lcbIK/hCka9Bj6ziabEcdkZeqLsCSMhJxlCYVjb10pJWBpi1qhEqqSzmDUYBWFoBjv1qs4a8Dk/Np6cKxxFfs7QovC8RFkYXMQtIc72tL8n/aqKbph7HXtqoJrLppNK0Np5IvXeK5dqDILAKQyukwK1dz6aSi4GVbIASj7YzmXlTSaZuH7RofFmvuaATf6ErnYQb/Nn6vbdMO/qX33XoIzvbjNInT43e9wWDt5BbbZXvsDUvZARuwz+yIDZli39kP9pP9in5Hf1vhl9yktqJ1zSt2J1qdf8qms7o=</latexit>

‣ The ML estimate is obtained as normalized counts of


successive word occurrences (matching statistics)
Feature based Markov Model
‣ We can also represent the Markov model as a feed-
forward neural network (very extendable)
Feature based Markov Model
‣ We can also represent the Markov model as a feed-
forward neural network (very extendable)
Temporal/sequence problems
‣ Language modeling: what comes next?

This course has been a tremendous …


2 3
0
6 .. 7
6 . 7
6
tremendous 4 1 57

0
2 3 ?
1
6 0 7
6 7
a 6 .. 7
4 . 5
0

(t) (t)
x y
Temporal/sequence problems
‣ A trigram language model
2 3
0
6 .. 7
6 . 7
6
tremendous 4 1 57

0
2 3


1
6 0 7
6 7
a 6 .. 7
4 . 5
0 (t)
y

x(t)
Temporal/sequence problems
‣ A trigram language model
2 3
0
6 .. 7
6 . 7
6
tremendous 4 1 57

0
2 3


1
6 0 7
6 7
a 6 .. 7
4 . 5
0 (t)
y

x(t)
RNNs for sequences
‣ Language modeling: what comes next?

This course has been a tremendous …


2 3
0
6 .. 7
6 . 7 ?
6 7
tremendous 4 1 5
0
RNNs for sequences
‣ Language modeling: what comes next?

This course has been a tremendous …


2 3
0
6 .. 7
6 . 7 ?
6 7
tremendous 4 1 5
0

s,s s,x
st = tanh(W st 1 +W xt ) state
o output distribution
pt = softmax(W st )
Decoding, RNNs
‣ Our RNN now also produces an output (e.g., a word) as
well as update its state
[0.1,0.3,. . . ,0.2] output distribution

previous new basic



state state RNN

previous output
as an input x

s,s s,x
st = tanh(W st 1 +W xt ) state
o output distribution
pt = softmax(W st )
Decoding, LSTM
[0.1,0.3,. . . ,0.2] output distribution

previous new
state
✓ state
LSTM

previous output
as an input x
f,h f,x
ft = sigmoid(W ht 1 +W xt ) forget gate
i,h i,x
it = sigmoid(W ht 1 +W xt ) input gate
o,h o,x
ot = sigmoid(W ht 1 +W xt ) output gate
c,h c,x memory
ct = ft ct 1 + it tanh(W ht 1 +W xt )
cell
ht = o t tanh(ct ) visible state
pt = softmax(W o ht ) output distribution
Decoding (into a sentence)
‣ Our RNN now needs to also produce an output (e.g., a
word) as well as update its state

vector encoding
of a sentence
“I have seen better
lectures”
Decoding (into a sentence)
‣ Our RNN now needs to also produce an output (e.g., a
word) as well as update its state

distribution over the


possible words
p1
vector encoding
of a sentence
“I have seen better
lectures”
Decoding (into a sentence)
‣ Our RNN now needs to also produce an output (e.g., a
word) as well as update its state

distribution over the


possible words
p1 p2 p3 p4 p5
vector encoding
of a sentence …
“I have seen better
lectures”
Decoding (into a sentence)
‣ Our RNN now needs to also produce an output (e.g., a
word) as well as update its state

sampled word = Olen


p1
vector encoding
of a sentence
“I have seen better
lectures”
Decoding (into a sentence)
‣ Our RNN now needs to also produce an output (e.g., a
word) as well as update its state

sampled word = Olen nähnyt parem. luentoja <end>


p1 p2 p3 p4 p5
vector encoding
of a sentence
“I have seen better
lectures”
Decoding (into a sentence)
‣ Our RNN now needs to also produce an output (e.g., a
word) as well as update its state
‣ The output is fed in as an input (to gauge what’s left)
sampled word = Olen nähnyt parem. luentoja <end>
p1 p2 p3 p4 p5
vector encoding
of a sentence
“I have seen better
lectures”
<null>
Mapping images to text

log p1(S1) log p2(S2) log pN(SN) Infer


to ge
p1 p2 pN one i
cordi
as inp
...

LSTM

LSTM
LSTM

LSTM
the sp
The s
of the
WeS0 WeS1 WeSN-1
sente
of the
We u
image S0 S1 SN-1
ment
greed
Figure 3. LSTM model combined with a CNN image embedder avera
(as defined in [30]) and word embeddings. The unrolled connec-
Examples

Figure 5. A selection of evaluation results, grouped by human rating.


Key things
‣ Markov models for sequences
- how to formulate, estimate, sample sequences from
‣ RNNs for generating (decoding) sequences
- relation to Markov models
- evolving hidden state
- sampling from
‣ Decoding vectors into sequences

You might also like