Professional Documents
Culture Documents
Aristeidis Tsitiridis, Cristina Conde, Isaac Martin de Diego, Jose Sanchez del Rio Saez, Jorge Raul Gomez, and
Enrique Cabello
Department of Computer Science and Statistics
King Juan Carlos University
Madrid, Spain
Abstract- In recent years, there has been a growing interest variations encoded as address-events. The Address-Event
in dynamic vision sensors due to their incredible advantages in Representation (AER) [4], i.e. reflectance events
speed, computational cost and power consumption. These new asynchronously encoded in x, y pixel coordinates resembles
vision sensors have been inspired from biological retinae and
the precisely timed electrical impulses or spikes of the
use asynchronous address-event representation for visual
spatially arranged optical nerves stemming from the retinae
information instead of a series of snapshots taken from
traditional frame-based devices. Spiking neurons are
to the primary visual cortex.
biologically-plausible artificial neurons that process
The most prevalent method for constructing biologically
information in sequences of time events and are particularly
inspired vision models requires alternating hierarchical
suited for processing address-event information. A novel and
layers following the early visual processing stages [5].
refined biologically-inspired Gabor feature approach based on
spiking neural networks is presented here. This approach
Neurons at higher layers progressively exhibit a combination
utilises the retina-inspired data from dynamic vision sensors of selectivity and invariance to object translations such as
with Gabor edge detection in a hierarchical structure that has size, position, rotation, depth etc. In the recent past, there had
been populated with Leaky-Integrate and Fire neurons that been many models and variants that employed this kind of
have been trained via the Remote Supervision Method. The architecture such as the Neocognitron [6], Convolutional
number of active spiking neurons at each time instance depends neural network [7], and Hierarchical model and X (HMAX)
on the number of time events. This idea provides a flexible
[8], and all of them produced promising results for a variety
approach that avoids unnecessary computations and
of object recognition tasks. However, these models had been
complexity. The biologically-inspired model developed for this
solely applied for frame-driven visual scenarios which
preliminary work has shown promising results and has laid the
foundation for a rapid parallel object recognition model
considerably increased their computational cost especially as
designed for the new retina-like address-event representation the complexity of a given situation rose. The high
sensors. customisation for images or frames from videos had made
their integration with temporal features and consequently
Keywords-Biologically-inspired machine vision; Gabor edge video in real-time, exceptionally difficult and eventually
detection; Spiking neural networks; Hierarchical vision model; implausible with biology.
Address - Event Representation (AER)
Soon after the emergence of AER retina sensors, several
I. INTRODUCTION models were introduced to harness their approach to visual
An investigation of biological visual perception representation. A convolutional neural network was directly
properties in primates can quickly reveal certain irrefutable applied in [9] for object recognition with minor alterations.
advantages. It can be fast and efficient under various Although this work reported a frame-based architecture that
conditions, for example, humans can recognise faces in was applied directly to retina sensors, its fast recognition
different poses in as little as 140ms [1]. In addition, these response and rate offer a promising insight to methodologies
cognitive operations consume relatively small amounts of that share similar hierarchical characteristics. Edge detection
energy and portray adaptation to a wide spectrum of real was performed by applying Gabor filter templates directly on
world situations. Remarkably, all these visual cognitive AER images. Furthermore, this model was extended to a
operations occur in parallel with other senses in a seamless neuromorphic engineering application with an event-driven
manner. Therefore, it is not surprising that researchers across convolution module [10] and combined with AER sensors
various fields, over the last years, have intensified their focus was employed for high-speed recognition examples. The
on harnessing biologically-inspired techniques for their spiking neuron model presented in [11], was a biologically
applications. plausible approach that captured temporal visual information
and learned features in an unsupervised manner based on the
Dynamic VISIOn sensors (DVS) [2] [3] mImIc certain Spike-Timing Dependent Plasticity (STDP) [12]. This
biological traits of the retinae. The continuous data output of hierarchical Spiking Neural Network (SNN) performed,
these sensors is an asynchronous stream of reflectance
; [«,+ {1Jw'(Sd)SIn(t_Sd)dSd]
[21], [22] and can adapt and reorganise [23] during the
Wk; (t)=st(t) lifetime of the visual cortex. This principle is adopted here
and Gabor filter parameters in addition to their sizes are
{+: cr (;)
" in the visual cortex has been reported in many past findings
"'('")� p [25]-[27]. This topology has also been proven efficient in its
(4)
adaptation to many applications in object and face
recognition [7], [S]. The main objective of such a topology is
W V) � {-A�crP U) (5)
the progressive creation of a view-invariant representation of
objects with some important invariance properties being size,
position, rotation and illumination. Similarly, the future goal
of the model is to obtain enhanced object invariance
In equations above, Wd W,
and are the exponential
windows for the desired and learning signals. and areAd A, properties. Hence, this view-invariant approach (Fig. 2) is
also partly followed here.
constants and can be positive for excitatory synapses and
Id I,
negative for inhibitory. and are positive time constants.
C. VI Cells - Gabor edge detection
Cllaver
V1 simple cells in the primary visual cortex process
incoming visual data [5] from the retinae and perform edge
detection operations for subsequent layers of the visual
cortex. Gabor filters have been found to match the response
of V1 cells to oriented bars or gratings and as such, have
been used to match the response of VI cells to oriented bars SI layer
or gratings [I S]-[20]. Their properties are essentially
encoded with ReSuMe as discussed in the previous section. �
Inputlaver
128x128
Fig. l.An example of a 7x7 Gabor filter at the 8 different orientations
used by the model.
Fig. 2. Algorithm diagram.
Lateral inhibition is a well-studied phenomenon of LIF neurons are pre-trained using ReSuMe with the
biological vision and is implemented in this work for both SI Gabor parameters (as explained in section I1.C and D) and
and Cl layers. Lateral inhibition is a mechanism which should respond maximally to the specifIc orientations of
promotes the activity of maximally fIring spikes by reducing edges they are tuned for. More specifIcally, Gabor ftIter
the activity of their neighbours. In some respects this activity amplitude values are scaled and encoded to spike trains in
can be viewed as a biological threshold technique when the time domain.
excessive noise is present and as such is treated here, i.e. a
As shown in Fig. 4, each of the input neurons fIres at
noise ftIter.
precisely timed instances forming temporal patterns of Gabor
2) SI to Cl layer ftIters. Higher vector values are translated as delayed
responses in the time domain and vice versa, lower values to
In the next layer referred as Cl, complex cells receive faster responses. Since these responses are scaled the exact
local edge information from SI layer neurons and pool their fIring time of each of the input neurons directly depends on
responses across all different RF sizes and orientations. The the chosen time period. All these responses are applied to
pooling operation is achieved via spatial summation [28] ReSuMe in order to train multiple LIF neurons that after
training will behave as Gabor ftIters.
[30]. Spatial summation, i.e. the integration of spike events
from various retinal RF, is another established biological
function of vision that explains how edge features propagate
in higher layers of the visual cortex. Spatial summation is
directly applied over each RF and for all orientations:
(9)
g 0.4
u
00 (� �
Fig. 5. An example of a Gabor neuron training process of 300 epochs for a 7x7 RF size at 45°. a) The lower row of spike events (green dots) is
generated by membrane potential spikes which vary in each epoch and readjust according to a Poisson distribution (red dots). b) Synaptic weight
values of excitatory (positive) and inhibitory (negative) synapses. c) Correlation values for 300 training epochs
The number of input neuron events processed by each Gabor epoch, synaptic weight values rearrange according to the
neuron depends on the size of the respective RF size, e.g. for equations presented in the previous section to match the
a small RF size of 3x3, the total number of input neurons is desired spike train response;t (equation 3). As shown in Fig.
9. LIF neurons are trained for 8 different orientations and for 5c, the correlation values between the desired train response
4 different RF sizes (Table 1). This means that in this work ;t and output spike train S', reaches a maximum value after a
the total number of pre-trained types of Gabor-like neurons relatively small number oftraining epochs. Membrane spikes
is 32. are monitored throughout the training process and spikes
Table 1. RF sizes and their respective G and J.. values.
with the highest correlation score achieved from all epochs,
are stored along with their respective synaptic weights.
RF sizes G J..
The Poisson-like membrane spikes and synaptic weights
3x3 1.4 2
5x5 2 2.8 are the necessary parameters to define Gabor simple neurons
7x7 2.8 3.5 (or simple cells) tuned at that particular orientation and RF.
9x9 3.6 4.6 In the SI layer, all correlation measurement values higher
than a pre-set threshold value are neglected. If an edge exists
ReSuMe is an efficient learning rule and requires a within a certain RF area of SI cells then their membrane
relatively small number of epochs before neurons have spikes emit patterns similar to the pre-trained spike trains.
reached their highest possible training score. The training Therefore, the cross-correlation difference between these SI
score is calculated as the highest cross-correlation number unit responses and the pre-trained responses can be
between membrane potential spikes and desired output measured. These measurement values fluctuate from 0 to 1,
spikes. Training scores are most successful when they and like weights, they are multiplied with the incoming
approach an absolute value of 1. In practice, reaching an signals to indicate how strong the presence of a particular
absolute score is not important as discussed further below. edge is.
The output spike pattern is set according to a random IV. RESULTS
predefined Poisson distribution (Fig. 5a). In past literature,
The SNN model is first tested against ideal shapes of 0
the Poisson-like distributions are known to exist in the
and 1 values without the presence of noise. Fig. 6 shows a
primary visual cortex and have often been examined in bio
simple example of a solid fill circle processed by the model
inspired systems [32], [33]. Positive synaptic weight values
with various RF sizes in the SI layer. For the given circle the
signify excitatory synapses and conversely, negative values
3x3 RF size produces the closest circular shape to the
indicate inhibitory synapses (Fig. 5b) and with every training
original.
Fig. 6. From left to right, the leftmost input circle image is processed with various RF sizes of Gabor neurons, at 3x3, 5x5, 7x7 and 9x9 producing the
respective SI layer results.
Moreover, it is noticeable that as the RF size increases, the edges and can be incompatible with the biological-like
quality of edge detection decreases which is caused by standards that were set for the model here.
progressively larger RF sizes overlapping on a constant small
Fig. 8 shows some examples from original AER video
area over different angles. Also by incorporating more
data and their Cl layer images. Noise reduction from lateral
information from the homogeneous space of the original
inhibition is noticeable. There are some minor improvements
circle with larger edge detection windows, thicker
in the thickness of the edges and the space between them but
overlapping edges create artefacts. This is evident from edge
it is apparent that the integrity of morphological information
information that has advanced further inside the circle,
cannot be further enhanced.
creating a thick uneven outline.
SI layer errors are mostly compensated by inhibition and
Further tests were conducted with other 'ideal' shapes
spatial surrunation in the Cl layer. In practice, using the
without the presence of noise (Fig. 7). In simpler examples
optimal RF sizes and spatial frequencies for SI unit tuning
such as the triangle and square, edge integrity appears
on the numerous objects that can be found in the real world,
slightly better. However, as the number of orientations
is a subject of rigorous analysis [20], [23] which has been
increases such as in the pentagon and star examples, corners
planned for this model in the near future. Furthermore,
progressively exhibit a thicker outline. Regardless of some of
sensor related errors are expected to be improved drastically
these tuning difficulties with RF sizes and parameterisation,
with higher spatial resolution DVS, less sensitive to noise,
all images prove the model's ability to extract spatial features
that are planned for production.
from Gabor-like spiking neurons.
It is important to mention that contrary to frame-based
AER videos pose a challenging task compared to the
approaches, AER-based data are processed in time windows
perfect shapes that have been presented so far. The
as they occur. These time windows can be set by the user
experiments centred on objects in motion without any
manually to as little as 1/-15. The chosen temporal resolution
obstructions or clutter. The video data of this work were
was set empirically to capture the objects particular speed of
captured under natural light conditions and contained some
motion. With faster moving objects this setting would
reflectance noise from background objects and surfaces.
require smaller values and with variable speed monitoring, a
Contrast polarity changes in AER data vary between three
mechanism which adjusts accordingly.
event states -1, 0 and l. The -1 state indicates that the AER
sensor pixel has detected illumination reductions and vice The speed of AER image processing was not the main
versa for l. Naturally, zero events indicate that no changes focus of these preliminary experiments. However, given the
have been detected. Since this work only focuses on the nature of AER data, the model was exceptionally efficient in
extraction and processing of spatial features, -1 occurrences processing only meaningful information as it occurred. The
are treated simply as 1. This effectively neglects directional model performs its operations only in sections were there are
motion and concentrates on the edges being detected by the time events. This is an additional advantage over frame
AER sensor. In real-world situations object edges and based techniques which need to scan, often aimlessly, the
surfaces are not uniformly illuminated, partly due to entire image for important or meaningful visual information.
naturally occurring shadows or the direction of the light Consequently, CPU load and stored data for all the
source. Therefore, objects appear significantly distorted in experiments were minimal.
AER data by salt and pepper-like noise or inhomogeneous
edges. The jAER software package provides de-noise filter
options but such filtering has been noticed to further distort
• *
Fig. 7. Cl layer examples with simple shapes. Top row shows the original shapes and bottom row the processed results at the Cl layer.
Retina Events 1716. Time window 162 Retina Events 368, TIme window 108 Retin. Events 907. Time window 127
20 20 20
40 40 40
60 60 60
80 80 80
20 20 20
40 40 40
60 60 60
80 80 80
120 120 1 20
20 40 60 80 100 120 20 40 60 80 100 120
Fig. 8. Results from AER videos. Top row shows the original AER events in MATLAB of a triangle, a pool table 8-ball and a hand. Bottom row shows
the processed Cl layer images from the model.
Naturally, in the near future with more advanced retina-like simulate any of the Gestalt principles found in higher cortical
sensors of higher spatiotemporal resolutions that can areas [36] and therefore these edges cannot be accurately
additionally process spectral information, the amount of data detected or joined.
and their processing load is expected to increase.
The work presented in this paper has contributed the
V, CONCLUSIONS following for the first time: a) Gabor filters are directly
encoded with SNN in the time domain, b) a biologically
A biologically-plausible model for Gabor feature
inspired learning technique is used to teach neurons as VI
extraction using spiking neural networks is described in this
cells for AER processing and c) a hierarchical and
paper. Its methodology relies on LIF neurons that have been
biologically-inspired model with increased biological
pre-trained with the Remote Supervision Method.
plausibility is applied on retina-like data. Furthermore, the
Subsequently, these neurons form a Gabor edge detection
model has been successful in establishing the foundation
layer and progressively the model processes these features
upon which future enhancements and modifications will rely
with alternating layers in a temporal manner. The number of
for an advanced low power, low cost model which will
LIF neurons being created to handle incoming visual
perform rapid parallel object recognition in a biologically
information depends on the events being captured at a given
inspired manner, utilising SNNs together with AER sensors.
time window. This flexible approach closely simulates the
overall structure of the mammalian brain and exhibits In the near future, this approach is going to be enriched
adaptation to AER data. Furthermore, the model with the with more advanced DVS of higher spatiotemporal
proposed methodology avoids the unnecessary complexity resolutions and additional spectral information that will
that would otherwise involve thousands of additional either be introduced by an upgraded DVS or a frame-based
neurons and their respective synapses with the extra data device in a more elaborate schema. Moreover, by
storage required for unwanted visual information. incorporating extra layers in the hierarchy, more complex
classification problems and practical pattern recognition
The model is examined with noiseless images containing
scenarios will be investigated for specific AER applications
simple shapes and then applied on retina-like data from an
in video surveillance and navigation. Particular attention will
AER sensor. The preliminary investigation produced
be given to future security projects and applications
satisfactory results in the absence of noise and promising
involving the use of fast response recognition systems.
results for the actual retina-like data. More specifically with
AER data, the model sufficiently identifies the edges of ACKNOWLEDGEMENT
objects. However, edges that are separated, dotted or broken
The authors would like to thank the Robotics and
proved to be a difficult task that necessitates more advanced
Computer Technology group in the University of Seville and
AER sensors or techniques. With the current version of the
the Institute of Microelectronics in Seville, CSIC. Finally,
model, there is no provision for techniques that introduce or
the authors thank the ABC4EU project for funding this work.
REFERENCES [ 19] 1. G. Daugman, "Uncertainty relation for resolution in space, spatial
frequency, and orientation optimized by two-dimensional visual
[I] S. Yamamoto and K. Kashikura, "Speed of face recognition in cortical filters," J. Opt. Soc. Am., vol. 2,no. 7,pp. 1 160- 1 169, 1985.
humans: an event-related potentials study.," 1999.
[20] M. A. Webster and R. L. De Valois, "Relationship between spatial
[2] P. Lichtsteiner, C. Posch, and T. DelbrOck, "A 128x128 120dB 15/15 frequency and orientation tuning of striate-cortex cells.," J. Opt. Soc.
Latency Asynchronous Temporal Contrast Vision Sensor," iEEE J. Am. A., vol. 2,pp. 1 124- 1 132, 1985.
Solid-State Circuits, vol. 43,pp. 566-576,2008.
[2 1] C. J. McAdams and R. C. Reid,"Attention modulates the responses of
[3] C. Posch, D. Matolin, and R. Wohlgenannt, "A QVGA 143 dB simple cells in monkey primary visual cortex.," J. Neurosci., vol. 25,
dynamic range frame-free PWM image sensor with lossless pixel-Ievel pp. 1 1023-1 1033,2005.
video compression and time-domain CDS," in IEEE Journal of Solid
[22] N. C. Rust, O. Schwartz, 1. A. Movshon, and E. P. Simoncelli,
State Circuits, 20 1 1,vol. 46,pp. 259-275.
"Spatiotemporal elements of macaque VI receptive fields," Neuron,
[4] R. Berner, T. Delbruck,A. Civit-Balcells,and A. Linares-Barranco, "A vol. 46,pp. 945-956, 2005.
5 Meps $ 100 USB2.0 Address-Event Monitor-Sequencer Interface,"
[23] M. P. Sceniak, M. 1. Hawken, and R. Shapley, "Contrast-dependent
2007 iEEE Int. Symp. Circuits Syst., 2007.
changes in spatial frequency tuning of macaque VI neurons: effects of
[5] D. H. Hubel and T. N. Wiesel, "Receptive fields and functional a changing receptive field size.," J. Neurophysiol., vol. 88, pp. 1363-
architecture of monkey striate cortex," J. Physiol., vol. 195, no. I, pp. 1373,2002.
2 15-243.,1967.
[24] T. Serre and M. Riesenhuber, "Realistic Modeling of Simple and
[6] K. Fukushima, "Neocognitron: A self organizing neural network for a Complex Cell Tuning in the HMAX Model , and Implications for
mechanism of pattern recognition unaffected by shift in position," BioI. Invariant Object Recognition in Cortex," Methods. p. -017,2004.
Cybern., vol. 36,no. 4,pp. 93-202, 1980.
[25] D. Felleman and V. Essen D, "Distributed hierarchical processing in
[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffiler, " Gradient-based the primate cerebral cortex," Cereb. Cortex, vol. I, no. I, pp. 1-47,
learning applied to document recognition.," Proc. iEEE, vol. 86, pp. 1991.
2278-2324, 1998.
[26] 1. P. Van Kleef, S. L. Cloherty, and M. R. Ibbotson, "Complex cell
[8] M. Riesenhuber and T. Poggio, "Hierarchical models of object receptive fields: evidence for a hierarchical mechanism," J. Physiol.,
recognition in cortex," Nat. Neurosci., no. 2( 1 1): I0 19-25, 1999. vol. 588,no. 18,pp. 3457-3470,2010.
[9] 1. A. Perez-Carrasco, c. Serrano, B. Acha, T. Serrano-Gotarredona, [27] V. Axelrod and G. Yovel,"Hierarchical Processing of Face Viewpoint
and B. Linares-Barranco, "Spike-based convolutional network for real in Human Visual Cortex," Journal of Neuroscience, vol. 32. pp. 2442-
time processing," in Proceedings - international Conference on 2452,2012.
Pattern Recognition, 2010,pp. 3085-3088.
[28] E. R. Howell and R. F. Hess, 'The functional area for summation to
[ 10] L. Camuiias-Mesa, C. Zamarreiio-Ramos, A. Linares-Barranco, A. 1. threshold for sinusoidal gratings.," Vision Res., vol. 18, pp. 369-374,
Acosta-Jimenez, T. Serrano-Gotarredona, and B. Linares-Barranco, 1978.
"An event-driven multi-kernel convolution processor module for
[29] S. J. Anderson and D. C. Burr, "Spatial summation properties of
event-driven vision sensors," IEEE J. Solid-State Circuits, vol. 47, pp.
directionally selective mechanisms in human vision.," J. Opt. Soc. Am.
504-5 17,2012.
A., vol. 8,pp. 1330-1339, 1991.
[ 1 1] O. Bichler, D. Querlioz, S. 1. Thorpe, 1. P. Bourgoin, and C. Gamrat,
[30] S. Sukumar and S. J. Waugh, "Separate first- and second-order
"Extraction of temporally correlated features from dynamic vision
processing is supported by spatial summation estimates at the fovea
sensors with spike-timing-dependent plasticity," Neural Networks, vol.
and eccentrically," Vision Res., vol. 47,pp. 58 1-596,2007.
32,pp. 339-348,2012.
[3 1] J. Mutch and D. Lowe, "Object class recognition and localisation using
[ 12] H. Markram,1. Lubke, M. Frotscher,and B. Sakmann, " Regulation of
sparse features with limited receptive fields," int. J. Comput. Vis., vol.
synaptic efficacy by coincidence of postsynaptic APs and EPSPs,"
80,no. I,pp. 45-57, 2008.
Science (80-. ).,vol. 275,no. 5297,pp. 2 13-215, 1997.
[32] E. Niebur and C. Koch, "A model for the neuronal implementation of
[ 13] B. Zhao, Q. Vu, H. Vu, S. Chen, and H. Tang, "A bio-inspired
selective visual attention based on temporal correlation among
feedforward system for categorization of AER motion events," in
neurons," J. Comput. Neurosci., vol. I,pp. 141- 158, 1994.
Biomedical Circuits and Systems Conference (BioCAS), 2013, pp. 9-
12. [33] I. C. Lin, D. Xing, and R. Shapley, "Integrate-and-fire vs Poisson
models of LGN input to VI cortex: Noisier inputs reduce orientation
[ 14] S. Thorpe and J. Gautrais, "Rank order coding," Comput. Neurosci.
selectivity," J. Comput. Neurosci., vol. 33,pp. 559-572,2012.
Trends Res., vol. 13,pp. 1 13- 1 19, 1998.
[34] T. Delbruck and L. Longinotti, ')AER "
[ 15] F. Ponulak, "ReSuMe-new supervised learning method for Spiking
http://sourceforge.netlpl}aerlwikiIHomel, 2014.
Neural Networks," in International Conference on Machine Learning,
ICML,2005. [35] T. Serrano-Gotarredona and B. Linares-Barranco, "A 128,x 128 1.5%
contrast sensitivity 0.9% FPN 3 J.lS latency 4 mW asynchronous frame-
[ 16] W. Gerstner and W. M. Kistler, "Mathematical formulations of
free dynamic vision sensor using transimpedance preamplifiers," IEEE
Hebbian learning," BioI. Cybern., vol. 87,pp. 404-4 15,2002.
J. Solid-State Circuits, vol. 48,pp. 827-838,2013.
[ 17] R. C. Froemke and Y. Dan, "Spike-timing-dependent synaptic
[36] W. Ehrenstein, L. Spillmann, and V. Sarris, "Gestalt issues in modem
modification induced by natural spike trains.," Nature, vol. 4 16, pp.
neuroscience," Axiomathes, vol. 13,pp. 433-458, 2003.
433-438,2002.
[ 18] S. Marcelja, "Mathematical description of the responses of simple
cortical cells.," J. Opt. Soc. Am., vol. 70,pp. 1297-1300, 1980.