Professional Documents
Culture Documents
dynamic objects and xt ⊆ yt are the sensor measurements of P (y 1:N , x1:N , h1:N ) = P (xt |yt )P (yt |ht )P (ht |ht−1 ),
directly visible scene parts. We use encoding xt = {vt , rt } t=1
where {vti , zti } = {1, yti } if i-th element of yt is observed (1)
and {0, 0} otherwise. Moreover, as the scene changes the where
robot can observe different parts of the scene at different • P (ht |ht−1 ) denotes the hidden state transition probability
times. capturing the dynamics of the world;
In general, this problem can not be solved effectively if
• P (yt |ht ) is modelling the instantaneous unoccluded sen-
the process yt is purely random and unpredictable as in such
sor space;
case there is no way to determine the unobserved elements
of yt . In practice, however, the sequences yt and xt exhibit • and P (xt |yt ) describes the actual sensing process.
structural and temporal regularities which can be exploited The task of estimating P (yt |x1:t ) can now be framed in the
for effective prediction. For example, objects tend to follow context of recursive Bayesian estimation (Bergman 1999) of
certain motion patterns and have certain appearance. This the belief Bt which, at any point in time t, corresponds to a
allows the behaviour of the objects to be estimated when distribution Bel(ht ) = P (ht |x1:t ). This belief can be com-
they are visible such that later this knowledge can be used puted recursively for the considered model as
for prediction when they become occluded. Z
Deep tracking leverages recurrent neural networks to ef- −
fectively model P (yt |x1:t ). In the remainder of this section Bel (ht ) = P (ht |ht−1 )Bel(ht−1 ) (2)
ht−1
we first describe the probabilistic model assumed to underlie Z
the sensing process before detailing how RNNs can be used Bel(ht ) ∝ P (xt |yt )P (yt |ht )Bel− (ht ) (3)
for tracking. yt
The Model where Bel− (ht ) denotes belief prediction one time step into
the future and Bel(ht ) denotes the corrected belief after the
Like many approaches to object tracking, our model is in-
latest measurement has become available. P (yt |x1:t ) in turn
spired by Bayesian filtering (Chen 2003). Note, however, the
can then be expressed simply using this belief,
1
Video is available at: https://youtu.be/cdeWCpfUGWc Z
2
The source code of our experiments is available at: P (yt |x1:t ) = P (yt |ht )Bel(ht ). (4)
http://mrg.robots.ox.ac.uk/mrg people/peter-ondruska/ ht
Presented at AAAI-16 conference, February 12-17, 2016, Phoenix, Arizona USA
Filtered Training
output The network can be trained both in a supervised mode, i.e.
with both known y1:N , x1:N as well as in an unsupervised
y1 y2 y3 mode where only x1:N is known. A common method of su-
sensor input neural network
work overfitting. Here, we are dropping out the observations Decoder P(y t |B t )
both spatially across the entire scene and temporarily across 5x5 Kernel 7x7 Kernel
16
multiple time steps in order to prevent the network to overfit Belief
to the task of simply copying over the elements. This forces tracker
the network to learn the correct patterns. 5x5 Kernel F(B t-1 , xt )
An interesting observation demonstrated in the next sec- 8
scene yt around the robot using a 2D grid of 50×50 pixels with logarithm corresponding to the binary cross-entropy.
with the robot placed towards the bottom centre. To model One pass through the network takes 10ms on a standard lap-
sensor observations xt we ray-traced the visibility of each top drawing the method suitable for real-time data filtering.
pixel and assigned values xit = {vti , rti } indicating pixel vis-
ibility (by the robot) and presence of an obstacle if the pixel Training
is visible. An example of the input yt and corresponding ob- We generated 10,000 sequences of length 200 time steps and
servation xt is depicted in Figure 3. Note that, as in the case trained the network for a total of 50,000 iterations using
of a real sensor, only the surface of the object as seen from stochastic gradient descent with learning rate 0.9. The ini-
particular viewpoint is ever observed. This input was then tial belief Bt was modelled as a free parameter being jointly
fed to the network as a stream of 50×50 2-channel binary optimised with the network weights WF , WP .
images for filtering. Both supervised (i.e. when the groundtruth of yt is
known) and un-supervised training were attempted with al-
Neural Network most identical results. Figure 5 illustrates the unsupervised
To jointly model F (Bt−1 , xt ) and P (yt |Bt ) we used a small training progress. Initially, the network produces random
feed-forward recurrent network as illustrated in Figure 4. output, then it gradually learns to predict the correct shape of
This architecture has four layers and uses convolutional op- the visible objects and finally it learns to track their position
erations followed by a sigmoid nonlinearity as the basic in- even through complete occlusion. As illustrated in the at-
formation processing step at every layer. The network has in tached video, the fully trained network is able to confidently
total 11k parameters and its hyperparameters such as num- and correctly start tracking an object immediately after it
ber of channels in each layer and size of the kernels were has seen even only a part of it and is then able to confidently
set by cross-validation. The belief state Bt is represented by keep predicting its position even through long periods of oc-
the 3rd layer of size 50×50×16 which is kept between time clusion.
steps. Finally, Figure 6 shows example activations of layers
The mapping F (Bt−1 , xt ) is composed of the stage of from different parts of the network. It can be seen that the
input-preprocessing (the Encoder) followed by a stage of Encoder module learned to produce in one of its layers a
hidden state propagation (the Belief tracker). The aim of the spike at the center of visible objects. Particularly interesting
Encoder is to analyse the sensor measurements, to perform is the learned structure of the belief state. Here, the network
operations such as detecting objects directly visible to the learned to represent hypotheses of objects having different
Presented at AAAI-16 conference, February 12-17, 2016, Phoenix, Arizona USA
Training cost
1000
500
200
100
10,000 20,000 30,000 40,000 50,000
Iteration
t=1
t=2
t=3
t=4
true positive false positive false negative true negative (visible) true negative (occluded)
Figure 5: The unsupervised training progress. From left to right, the network first learns to complete objects and then it learns
to track the objects through occlusions. Some situations can not be predicted correctly such as the object in occlusion on the
fourth frame not seen before. Note, the ground-truth for comparison was used only for evaluation, but not during the training.
motion patterns with different patterns of unit activations of tracking objects. This problem is commonly solved by
and to track this pattern from frame to frame. None of this Bayesian filtering which gave rise to tracking methods for a
behaviour is hard-coded and it is a pure result of the network variety of domains (Yilmaz, Javed, and Shah 2006). Com-
adaptation to the task. monly, in these approaches the state representation is de-
Additionally we performed experiments changing the signed by hand and the prediction and correction operations
shape of objects from circles to squares and also simulating are made tractable under a number of assumption on model
sensor noise on 1% of all pixel observations3 . In all cases distributions or via sampling based methods. The Kalman
the network correctly learned and predicted the world state, filter (Kalman 1960), for example, assumes a multivariate
suggesting the approach and the learning procedure is able normal distribution to represent the belief over the latent
to perform well in a variety of scenarios. We however ex- state leading to the well-known prediction/update equations
pect more complex cases will require networks having more but limiting its expressive power. In contrast, the Particle fil-
intermediate layers or using different units such as LSTM ter (Thrun, Burgard, and Fox 2005) foregoes any assump-
(Hochreiter and Schmidhuber 1997). tions on belief distributions and employs a sample-based ap-
proach. In addition, where multiple objects are to be tracked,
Related Works correct data association is crucial for the tracking process to
Our work concerns modelling partially-observable stochas- succeed.
tic processes with a particular focus on the application In contrast to using such hand-designed pipelines we pro-
vide a novel end-to-end trainable solution allowing to auto-
3
See the attached video. matically learn an appropriate belief state representation as
Presented at AAAI-16 conference, February 12-17, 2016, Phoenix, Arizona USA
Decoder
Belief tracker
Encoder
Input
Figure 6: Examples of the filter activations from different parts of the network (one filter per layer). The Encoder adapts to
produce a spike at the position of directly visible objects. Interesting is the structure of the belief Bt . As highlighted by circles
in 2nd row, it adapted to assign different activation patterns to objects having different motion patterns (best viewed in colour).
well as the corresponding predict and update operations for and Williams 1988). Moreover, unlike undirected models,
an environment containing multiple objects with different a feed-forward network allows the freedom to straightfor-
appearance and potentially very complex behaviour. While wardly apply a wide variety of network architectures such as
we are not the first to leverage the expressive power of neural fully-connected layers and LSTM (Hochreiter and Schmid-
networks in the context of tracking, prior art in this domain huber 1997) to design the network. Our used architecture is
primarily concerned only the detection part of the tracking similar to recent Encoder-Recurrent-Decoder (Fragkiadaki
pipeline (Fan et al. 2010), (Jänen et al. 2010). et al. 2015) and similarly to (Shi et al. 2015) features convo-
A full neural-network pipeline requires the ability to learn lutions for spatial processing.
and simulate the dynamics of the underlying system state in
the absence of measurements such as when the tracked ob- Conclusions
ject becomes occluded. Modelling complex dynamic scenes In this work we presented Deep Tracking, an end-to-end ap-
however requires modelling distributions of exponentially- proach using recurrent neural networks to map directly from
large state representations such as real-valued vectors hav- raw sensor data to an interpretable yet hidden sensor space,
ing thousands of elements. A pivotal work to model high- and employ it to predict the unoccluded state of the entire
dimensional sequences was based on using Temporal Re- scene in a simulated 2D sensing application. The method
stricted Boltzman Machines (Sutskever and Hinton 2007; avoids any hand-crafting of plant or sensor models and in-
Sutskever, Hinton, and Taylor 2009; Mittelman et al. 2014). stead learns the corresponding models directly from raw,
This generative model allows to model the joint distribu- occluded sensor data. The approach was demonstrated on
tion of P (y, x). The downside of using an RBM, however, a synthetic dataset where it achieved highly faithful recon-
is the need of sampling, making the inference and learning structions of the underlying world model.
computationally expensive. In our work, we assume an un- As future work our aim is to evaluate Deep Tracking on
derlying generative model but, in the spirit of Conditional real data gathered by robots in a variety of situations such as
Random Fields (Lafferty, McCallum, and Pereira 2001) we pedestrianised areas or in the context of autonomous driv-
directly model only P (y|x). Instead of RBM our archi- ing in the presence of other traffic participant. In both situa-
tecture features Feed-forward Recurrent Neural Networks tions knowledge of the likely unoccluded scene is a pivotal
(Medsker and Jain 2001; Graves 2013) making the inference requirement for robust robot decision making. We further in-
and weight gradients computation exact using a standard tend to extend our approach to different modalities such as
back-propagation learning procedure (Rumelhart, Hinton, 3D point-cloud data and depth cameras.
Presented at AAAI-16 conference, February 12-17, 2016, Phoenix, Arizona USA
References [Shi et al. 2015] Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.;
[Bengio 2009] Bengio, Y. 2009. Learning deep architectures Wong, W.-K.; and Woo, W.-c. 2015. Convolutional lstm net-
for ai. Foundations and trends in Machine Learning 2(1):1– work: A machine learning approach for precipitation now-
127. casting. arXiv preprint arXiv:1506.04214.
[Bergman 1999] Bergman, N. 1999. Recursive bayesian es- [Srivastava et al. 2014] Srivastava, N.; Hinton, G.;
timation. Department of Electrical Engineering, Linköping Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R.
University, Linköping Studies in Science and Technology. 2014. Dropout: A simple way to prevent neural networks
Doctoral dissertation 579. from overfitting. The Journal of Machine Learning
[Chen 2003] Chen, Z. 2003. Bayesian filtering: From Research 15(1):1929–1958.
kalman filters to particle filters, and beyond. Statistics [Sutskever and Hinton 2007] Sutskever, I., and Hinton, G. E.
182(1):1–69. 2007. Learning multilevel distributed representations for
[Fan et al. 2010] Fan, J.; Xu, W.; Wu, Y.; and Gong, Y. 2010. high-dimensional sequences. In International Conference
Human tracking using convolutional neural networks. Neu- on Artificial Intelligence and Statistics, 548–555.
ral Networks, IEEE Transactions on 21(10):1610–1623. [Sutskever, Hinton, and Taylor 2009] Sutskever, I.; Hinton,
[Fragkiadaki et al. 2015] Fragkiadaki, K.; Levine, S.; Felsen, G. E.; and Taylor, G. W. 2009. The recurrent temporal
P.; and Malik, J. 2015. Recurrent network models for human restricted boltzmann machine. In Advances in Neural In-
dynamics. ICCV. formation Processing Systems, 1601–1608.
[Graves 2013] Graves, A. 2013. Generating sequences with [Szegedy, Toshev, and Erhan 2013] Szegedy, C.; Toshev, A.;
recurrent neural networks. arXiv preprint arXiv:1308.0850. and Erhan, D. 2013. Deep neural networks for object de-
[Hochreiter and Schmidhuber 1997] Hochreiter, S., and tection. In Advances in Neural Information Processing Sys-
Schmidhuber, J. 1997. Long short-term memory. Neural tems, 2553–2561.
computation 9(8):1735–1780. [Thrun, Burgard, and Fox 2005] Thrun, S.; Burgard, W.; and
[Jänen et al. 2010] Jänen, U.; Paul, C.; Wittke, M.; and Fox, D. 2005. Probabilistic robotics. MIT press.
Hähner, J. 2010. Multi-object tracking using feed-forward [Yilmaz, Javed, and Shah 2006] Yilmaz, A.; Javed, O.; and
neural networks. In Soft Computing and Pattern Recogni- Shah, M. 2006. Object tracking: A survey. Acm comput-
tion (SoCPaR), 2010 International Conference of, 176–181. ing surveys (CSUR) 38(4):13.
IEEE.
[Kalman 1960] Kalman, R. E. 1960. A new approach to
linear filtering and prediction problems. Journal of Fluids
Engineering 82(1):35–45.
[Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.;
Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifica-
tion with deep convolutional neural networks. In Advances
in neural information processing systems, 1097–1105.
[Lafferty, McCallum, and Pereira 2001] Lafferty, J.; McCal-
lum, A.; and Pereira, F. C. 2001. Conditional random fields:
Probabilistic models for segmenting and labeling sequence
data.
[Medsker and Jain 2001] Medsker, L., and Jain, L. 2001. Re-
current neural networks. Design and Applications.
[Mittelman et al. 2014] Mittelman, R.; Kuipers, B.;
Savarese, S.; and Lee, H. 2014. Structured recurrent
temporal restricted boltzmann machines. In Proceedings
of the 31st International Conference on Machine Learning
(ICML-14), 1647–1655.
[Rabiner 1989] Rabiner, L. R. 1989. A tutorial on hidden
markov models and selected applications in speech recogni-
tion. Proceedings of the IEEE 77(2):257–286.
[Rumelhart, Hinton, and Williams 1985] Rumelhart, D. E.;
Hinton, G. E.; and Williams, R. J. 1985. Learning inter-
nal representations by error propagation. Technical report,
DTIC Document.
[Rumelhart, Hinton, and Williams 1988] Rumelhart, D. E.;
Hinton, G. E.; and Williams, R. J. 1988. Learning repre-
sentations by back-propagating errors. Cognitive modeling
5:3.