You are on page 1of 55

Self-Organised Learning in the Chialvo-Bak Model

MSc Project
Marco Brigham
T
H
E
U
N
I V
E
R
S
I
T
Y
O
F
E
D
I
N B
U
R
G
H
Master of Science
Articial Intelligence
School of Informatics
University of Edinburgh
2009
Abstract
A review of the Chialvo-Bak model is presented, for the two-layer neural network topology. A
novel Markov Chain representation is proposed that yields several important analytical quanti-
ties and supports a learning convergence argument. The power lawregime is re-examined under
this new representation and is found to be limited to learning under small mapping changes.
A parallel between the power law regime and the biological neural avalanches is proposed. A
mechanism to avoid the permanent tagging of synaptic weights of the selective punishment rule
is proposed.
i
Acknowledgements
I wish to thank Dr. Mark van Rossum for his tireless support and attentive guidance, and for
having accepted to supervise me in the rst place.
To Dr. J. Michael Herrmann I wish to thank the very creative and rewarding discussions on the
holistic merits of the Chialvo-Bak model.
To Dr. Wolfgang Maass and his team at the Institute for Theoretical Computer Science at
T.U. Graz, I wish to thank the precious feedback and fruitful discussions received after the rst
talk on this MSc project.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is my own
except where explicitly stated otherwise in the text, and that this work has not been submitted
for any other degree or professional qualication except as specied.
(Marco Brigham)
iii
To the memory of Per Bak, whose ideas live on and inspire.
iv
Contents
1 Introduction 2
1.1 Brief literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 The Two-Layer Topology 7
2.1 Basic Principles and Learning . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Interference events . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Synaptic Landscape . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Neural avalanches . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Storing Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Advanced Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Research Results 21
3.1 c-band Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Desaturation strategies . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Global tag threshold . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Markov Chain Representation . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Statistical properties . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Markov chain representation: numerical evidence . . . . . . . . . 30
3.2.3 Analytical solution for (2, j
n
, j
c
) . . . . . . . . . . . . . . . . . 33
3.2.4 Alternate formulation: graph transitions . . . . . . . . . . . . . . 36
3.2.5 Analytical solution: numerical evidence . . . . . . . . . . . . . . 37
3.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Learning Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Power-Law Behaviour and Neural Avalanches . . . . . . . . . . . . . . . 44
3.4.1 Biological interpretation . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Conclusion 47
1
Chapter 1
Introduction
The Chialvo-Bak model was introduced by P. Bak and D. Chialvo [8] in 1999, with
the stated goal of identifying some universal and simple mechanism which allows
a large number of neurons to connect autonomously in a way that helps the organism
to survive [8]. Their eort resulted in a schematic brain model of self-organised
learning and adaptation that operates using the principle of satiscing [1].
In common with other models authored by P. Bak, is a patent minimalism of form,
where models are succinctly dened by simple, local and stochastic interaction rules
that reect the most basic assumptions of the real-world system. However simple and
minimalistic, these models manage to reproduce complex and emergent behaviour that
is observed in the real-world systems [2].
In the Chialvo-Bak model, the basic properties of neurons and neural networks are
represented by simple, local and stochastic dynamical rules that support the processes
of learning, memory and adaptation. The most basic operations in the model, the
node activation and the synaptic plasticity rules, are regulated by Winner-Take-All
(WTA) dynamics and learning by synaptic depression, respectively. These mechanisms
may correspond to well accepted physiological mechanisms [14] [10] [12], which suggests
the biological feasibility of the model.
The present work focused on extending the analytical understanding of the model.
A Markov chain representation for the simple two-layer topology is proposed, where
the states of the chain correspond to the learning states of the network. This represen-
tation provides a good statistical description of the model and supports an argument
for the learning convergence.
A power law tail in the learning time distribution, corresponding to an order-disorder
phase transition in the model, was proposed by J.R Wakeling [20]. This result was
specic to the slow change mode, where the network is made to learn a succession of
small mapping changes. The power law behaviour was re-examined under the Markov
chain representation for other mapping change modes and was only reproduced in the
slow change mode.
An argument is provided for drawing a parallel between the above power law behaviour
and the biological neural avalanches, evidenced experimentally J. Beggs and D. Plenz
[4][5] in 2003. These correspond to the propagation of spontaneous activity in neural
2
networks with power law behaviour in the event size distribution.
The ability to store previously successful congurations is enabled by a selective pun-
ishment mechanism [8] [1], where successful synaptic weights are depressed less severely
when no longer leading to the correct mappings. The selective punishment mechanism
has a known ageing eect [1] that is related to the permanent tagging of the successful
synaptic weights. A mechanism to avoid the permanent tagging in order to maintain
the performance advantage of selective punishment is proposed.
This document is organised as follows:
Chapter 1 presents a succinct introduction to the Chialvo-Bak model and the rele-
vant literature.
Chapter 2 introduces the Chialvo-Bak model in the simple two-layer topology, cover-
ing the learning modes, the selective punishment rule and the power-law tail behaviour.
Chapter 3 presents the research results of this MSc project.
Chapter 4 presents the conclusion and future work.
1.1 Brief literature review
A brief review on the research papers related to the Chialvo-Bak model is presented be-
low. The purpose of this review is to broadly describe the areas of the model that have
already been investigated to considerable depth. Detailed descriptions of the model
that support the present work are provided in Section 2.1.
The literature on the Chialvo-Bak model can be grouped into papers that follow the
original formulation of the model and papers that extend the model to dierent work-
ing principles and dynamic rules. As the present work is closely aligned to the rst
approach, so is the focus of the literature review presented below.
Papers on the original Chialvo-Bak model
The Chialvo-Bak model was introduced by P. Bak and D. Chialvo [8] in 1999. In this
paper, the motivations, biological constraints and ground rules are put forward, and
a great emphasis is placed on the biological plausibility of the model, which leads to
requirements of self-organisation at dierent levels and robustness to noise.
Self-organised learning and adaptation is required to reect ability to learn without
external guidance. The apparent lack of information in the DNA to encode the phys-
ical properties of neurons and synapses and their connectivity [15] motivates the self-
organisation at the connectivity level. Each neuron must learn without genetic or
external guidance, to which other neurons to connect and this connectivity should re-
main exible in order to adapt to external changes.
The ability quickly recover from perturbations induced by biological noise is a con-
straint motivated by the biological reality of the organism.
3
Learning by synaptic depression (negative feedback) is proposed as the basis of bi-
ological learning and adaptation, supported by the following elements:
Long-term synaptic depression (LTD) is as common in the mammalian brain as
long-term synaptic potentiation (LTP) [8]. The LTP mechanism is the suggested
physiological implementation of learning by synaptic depression.
Learning by synaptic potentiation leads to very stable synaptic landscapes, from
which adaptation to new congurations is dicult and slow.
Learning of new tasks or adapting to new environments is error prone, as such, a
process that acts on errors rather than on what is correct leads to faster learning.
The other pillar of the model, the Winner-Take-All (WTA) rule is inspired from
models of Self-Organised Criticality [2], as means to drive the system to an adaptive
critical state, where small perturbations can cause very large changes in the synaptic
landscape. The WTA rule plays also a key role in the solution to the credit assignment
problem by keeping the activity of the network low, as detailed in Section 2.1.
The synaptic plasticity changes are driven by a global signal informing on the success
of the latest synaptic changes. The ability for the organism to dierentiate between
outcomes is deemed innate to the system and possibly resulting from Darwinist selec-
tion.
A second paper from the same authors [1] was published in 2000 that expanded on
several key aspects of the model, such as the network topologies, the memory mech-
anism and a new learning rule to tackle more complex problems. The performance
scaling under the new learning rule was also analysed.
Several network topologies and their relevant learning rules were formally dened and
these are illustrated in Figure 1.1.
(a) The simple layered network
topology, which is the one used for
the present work.
(b) The lattice network
topology, where nodes
connect to a small
number of nodes in the
subsequent layer.
(c) The random network topology,
where nodes connect randomly to
n

of other neurons. Two subsets


of nodes are selected as the input
and the output nodes of the net-
work.
Figure 1.1 The various network topologies proposed in the original model. Figures
reproduced from [1].
4
The ability to store and retrieve previously successful congurations is enabled by the
selective punishment of synaptic weights, where previously successful weights are de-
pressed less severely when no longer leading to the correct mappings.
A small modication to the basic learning rules enables the network learn non-linear
problems such as the XOR problem or more generally the generalised parity problem,
where the parity of any number of input neurons must be correctly calculated.
The learning of multistep sequences is introduced, where the depression of weights
related to bad sequences only occurred at the end of the last step. The generalisation
and feature detection capability is also covered. The network is able to dierentiate
between classes of inputs requiring the same output by identifying useful features in
the inputs.
In [20], J.R. Wakeling identied an order-disorder phase transition in the model that
is regulated by the ratio of middle layer to input and output nodes. At the phase
transition, the network displays power-law behaviour with exponent c = 1.3 for the
learning time distribution.
These order-disorder regimes are characterised by the frequency of path interference
events, where already learnt mappings are accidentally destroyed while learning new
mappings. The disordered phase is characterised by a high probability of interference.
In [21], J. Wakeling investigated the performance of synaptic potentiation and selective
punishment under dierent mapping change modes. In the slow change mode the net-
work is made to learn a succession of mapping sets that only dier by one input-output
mapping. In the )|ij )|oj mode two dierent mapping sets are presented alternately
to the network.
Synaptic potentiation was introduced to the plasticity rules by rewarding successful
weights by an amount i while still punishing unsuccessful weights. A quantitative
analysis showed that any small amount of synaptic potentiation resulted in higher av-
erage learning time, specially in the slow change mode. This is illustrated in Figure
1.2a.
The performance of selective punishment was also investigated. While no visible im-
provement was detected in the slow change mode, in the ip-op mode the mechanism
was very eective, as illustrated in Figure 1.2b.
Chialvo-Bak model extensions
In [7], R.J.C. Bosman, W.A. van Leeuwen and B. Wemmenhove propose an extension
to the Chialvo-Bak model by including the potentiation of successful weights. This
enables faster single mapping learning and multi-node ring in each layer, but is at the
cost of the adaptation performance.
In [16], K. Klemm, S. Bornholdt and H.G. Schuster propose a stochastic approximation
to the WTA rule and include a forgiveness parameter in order to only punish those
weights that are consistently unsuccessful. This extension results in a more complex
model that according to the learning performance analysis in [1] does not improve on
5
(a) The learning performance of synaptic po-
tentiation in the slow change mode, where only
one mapping is changed at a time.
(b) The learning performance of synaptic po-
tentiation in the ip-op mode, where the net-
work is presented alternately with two dierent
mappings.
Figure 1.2 The learning performance decreases with any amount of synaptic potentiation
for the successful weights. Reproduced from [21].
the original model.
6
Chapter 2
The Two-Layer Topology
This chapter describes the functioning and the properties of the Chialvo-Bak model
that are most relevant to the research results presented in Chapter 3. As such, it does
not comprise an extensive description to the model, for which the reader is best directed
to the original papers by P. Bak and D. Chialvo [8] [1].
The contents of this chapter are based in the above two papers and in the paper
published by J.R. Wakeling [20] on the order-disorder phase transition of the model.
2.1 Basic Principles and Learning
The Chialvo-Bak model is characterised by the following principles:
Winner-Take-All dynamics: Neural activity only propagates through the
strongest synapses.
Learning by synaptic depression (negative feedback): Synaptic plasticity
is exclusively applied by the weakening of synaptic weights that participate in
wrong decisions. These synapses are depressed.
These principles dene the node activation and plasticity rules of the network, there-
fore determining the dynamics and properties of the model. To illustrate this, the
functioning of a Chialvo-Bak network while learning an arbitrary input-output map-
ping is presented below.
Consider a neural network with one input layer, one middle layer and one output
layer, as illustrated in Figure 2.1. Each layer has j
i
, j
n
and j
c
nodes respectively and
the network is noted (j
i
, j
n
, j
c
).
The nodes connect between layers with synaptic weights n, as follows:
Input nodes i connect to all middle nodes : with weights n(:, i).
Middle nodes : connect to all output nodes o with weights n(o, :).
The network is initialised with random weights in [0, 1].
Each node can be active or not active, corresponding to node state 0 or 1. The activation
of an input layer node results in the activation of one middle layer node and one output
layer node, according to the following Winner-Wake-All (WTA) rule:
7
Figure 2.1 The two-layer network with three input nodes, four middle layer nodes and
three output nodes, i.e. j

= 3, j

= 4 and j

= 3. Each input node is connected to


all nodes in the middle layer, and each middle layer node is connected to all nodes in the
output layer.
Input node i activates the middle layer node : with maximum n(:, i).
Middle node : activates the output node o with maximum n(o, :).
In biological terms, the WTA rule could be implemented using lateral inhibitory con-
nections within the same layer and excitatory connections between layers.
The activation sequence above ensures that no directed cycles are possible between
nodes, qualifying the network as a feed-forward network.
For a given a set of weights, the WTA rule determines the sequence of activation in
the middle and output layers, which denes the active conguration C of the network.
An active conguration example is shown in Figure 2.2, where the blue connections
represent the winning weights according to the WTA rule.
Figure 2.2 The Winner-Take-All rule (WTA) species the active conguration C of
the network. In the above graph all input nodes are shown active, whereas in the network
only one input node is active at the time. In this example C = {{1, 1, 1}, {2, 2, 3}, {3, 4, 2}}
and corresponds to the input-output mapping set ` = {1, 3, 2}. The blue connections
represent the active weights of the conguration C.
An active conguration C associates input nodes to middle layer nodes, which in turn
are associated to output nodes. As such, each C maps the input nodes to output nodes:
for each input node i corresponds a mapping to the output node o = `(i).
8
The mapping set ` contains the mappings of all the input nodes of the network.
In such terms, learning an arbitrary mapping set ` corresponds to the evolution of
synaptic weights from an initial active conguration C to the nal active conguration

C that yields the required mapping set `.


The network learns an arbitrary mapping set ` by applying the following synaptic
plasticity rules:
1. A random input node i is selected.
2. The input node i res and activates a middle layer node : and output layer node
o according to the WTA rule.
3. If output node o is correct, i.e. o = `(i), return to step 1.
4. Otherwise depress the active weights n(:, i) and n(o, :) by a random amount
in [0, 1] and return to step 2.
A sequence of learning steps from ` = {1, 3, 2} to the identity mapping set

` =
{1, 2, 3} is illustrated in Figure 2.3.
Figure 2.3 The learning of the identity mapping set

` = {1, 2, 3} from the initial map-
ping set ` = {1, 3, 2}. In this example,

` is learnt in three depressions: one depression
to learn input node 2 (upper row graph sequence) and two depressions to learn input node
3 (lower row graph sequence). The blue connections represent the active weights of the
conguration and the orange connections represent the depression of active weights.
A weight normalisation can be applied at the end of step 3 of the plasticity rules, by
raising the weights of input node i and middle layer node : such that the winning
weights are equal to one.
9
The step 2 of the plasticity rules requires a feedback signal informing the suitabil-
ity of the recent changes. This is provided in the form of a global feedback signal that
is broadcast to the entire network, in the case the latest changes are not satisfactory.
The synaptic plasticity rules could correspond to the following events at the biolog-
ical level:
1. Depressing the current active level of an input node results in a new active level
that is tagged for recent changes. This tagging takes the form of a chemical or
hormone release that is triggered by the latest synaptic activity.
2. No further plasticity changes take place until a global feedback signal is received.
This signal is broadcast to the entire network informing whether the latest changes
are not satisfactory.
3. Following an unsuccessful global feedback signal, the step 1 is repeated. No
further actions are taken otherwise.
The synaptic plasticity rules result in the following properties:
For the global feedback mechanism to be ecient in directing synaptic learning,
the rate of plasticity change has to be sparse. In such conditions, the credit
assignment problem [18][3] is solved, i.e. the system can determine which elements
are to be punished following bad performance.
The network signalling is in the time scale of ring patterns (i.e. milliseconds),
while the tagging and feedback mechanisms are in a timescale more adapted to
the scale of events in the external world (i.e. seconds to hours).
The global feedback signal represents an external critic rather than a teacher, as
no specic instructions are provided to direct the plasticity activity.
For the network to learn a random mapping set , the middle layer size must be at
least be as large as the input layer size, i.e. j
n
j
i
, so that each input node i can
have a dedicated path to the corresponding output node o = `(i).
A network with j
i
input nodes, j
n
middle layer nodes and j
c
output nodes is noted
(j
i
, j
n
, j
c
).
2.1.1 Interference events
In the process of learning a new mapping an interference event i may occur, where the
network unlearns a previously learnt mapping.
This is the case whenever while learning a new mapping, a middle layer node that
was establishing a correct mapping for another input node is selected (assuming the
correct output node for these input nodes is dierent). This is illustrated in Figure 2.4.
2.1.2 Synaptic Landscape
An interesting consequence of learning by synaptic depression is the resulting synaptic
landscape, shown in Figure 2.5.
10
Figure 2.4 While learning a mapping the network may unlearn a previously correct
mapping. In the above sequence, the learning of input node 3 lead from ` = {1, 2, 2}
to

` = {1, 1, 3}. As such, the net number of learnt mappings remained unchanged: the
output mapping of input node 3 was learnt and the output mapping of input node 2 was
unlearnt.
20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Node 1 to Middle Layer
(6,108,6) [runs:1e+5 rand]
W
e
i
g
h
t

w
Node Index
(a) The synaptic weights from input node one
to the middle layer, in a network (6, 108, 6).
0 0.2 0.4 0.6 0.8 1
0
0.005
0.01
0.015
0.02
0.025
Weight Distribution p(w)
(32,1024,32) [runs:1e+5 bins:100 rand]
p
(
w
)
Weight w
(b) The synaptic weights distribution (100
bins) in a network (6, 108, 6).
Figure 2.5 The metastable synaptic landscape is a direct consequence of learning by
synaptic depression and supports the fast adaptation property of the model.
In Figure 2.5a the metastable nature of the synaptic landscape is apparent, with the
active conguration barely supporting the current mapping. This is to be contrasted
with the synaptic landscape resulting from learning by synaptic potentiation, which
often results in a small number of dominating synaptic weights.
In this model, learning a very dierent mapping set ` is often just a few depres-
sions away from the currently active weights.
The particular form of the weights distribution in Figure 2.5b is due to both the WTA
rule and learning by synaptic depression. As the active synapses for a given input or
middle layer node are depressed by a random amount, the WTA rule will select the
synapse with the current highest weight
1
for the new active conguration in each layer.
This amounts to shifting all the weights by the amount of the dierence between the
previous highest weight and the new highest weight.
1
The highest weight after depression may still correspond be the previous winning weight, but the
probability of re-selection is lower than for any another weight.
11
Starting from an uniform weights distribution and repeating the above process a suf-
cient number of times, yields the distribution in Figure 2.5b. The intermediate steps
of this process are illustrated in Figure 2.6.
0 0.2 0.4 0.6 0.8 1
0
0.005
0.01
0.015
0.02
0.025
Weight Distribution p(w)
(32,1024,32) [runs:8 bins:100 rand]
p
(
w
)
Weight w
(a) The synaptic weights distribution after
adapting to eight successive random mappings,
in a network (32, 1024, 32).
0 0.2 0.4 0.6 0.8 1
0
0.005
0.01
0.015
0.02
0.025
Weight Distribution p(w)
(32,1024,32) [runs:15 bins:100 rand]
p
(
w
)
Weight w
(b) The synaptic weights distribution after
adapting to 15 random mappings, in a network
(32, 1024, 32).
Figure 2.6 The synaptic weights distribution evolves from a uniform distribution at the
initialisation of the network, to the distribution in Figure 2.5b.
2.1.3 Neural avalanches
The learning performance of the network can be measured in the number of depressions
j required to completely learn a given mapping set `. This quantity will be loosely
referred to as the learning time, although no particular timescale is thereby implied.
The learning performance is known [8] to improve with increasing middle layer sizes,
as illustrated in Figure 2.7. This is an advantage to regular back-propagation learning
where in general the performance decreases with increasing middle layer size.
Let A be the random variable associated with the number of depressions j required to
learn a mapping ` and let 1:(A = j) 1:(j) be the probability of learning mapping
` in j depressions.
The learning performance of the network (j
i
, j
n
, j
c
) is completely determined by
the learning time distribution, characterised by the probability mass function j such
that
j(j) = 1:(A = j) (2.1)

j=0
j(j) = 1. (2.2)
The basic operation for measuring j(j) is to record the number of depressions j re-
quired to learn the current mapping set `, increase by one unit the count of mapping
set learning in j depressions, present a new mapping set

` to the network and so on.
However, certain aspects in the setup of the simulations have a noticeable impact on
the measured values, as such these will be discussed in greater detail below.
12
20 40 60 80 100
0
50
100
150
200
250
Average Learning Time
(6,*,7) [run:50*1e+4 slow]
Middle Layer Nodes

(a) The average number of depressions j re-


quired to learn a random mapping A for dier-
ent sizes of the middle layer.
20 40 60 80 100
0
5
10
15
20
25
30
Average Interference Events
(6,*,7) [run:50*1e+4 slow]
Middle Layer Nodes

(b) The average number of interference events


i while learning a random mapping A for
dierent sizes of the middle layer.
Figure 2.7 Increasing the number of middle layer nodes decreases the average learning
time j and the average interference events i. The plots show j and i for networks
with six input nodes, seven output nodes and varying number of middle layer nodes.
One could require the weights of the network to be reset prior to learning the next
mapping but this would lead to measuring the rst mapping learning times. Instead,
the weights of the network are not reset at each new mapping set, and this yields a
measure that is closer to on-line learning performance.
A further distinction can be made on the degree of similarity between the new mapping
set being presented to the network and the previous one. These can be completely ran-
dom or diering in a small number of mappings only. Borrowing from J.R. Wakeling
[20], the slow mapping change mode corresponds to one single mapping change
2
in the
new mapping set. The random mapping change mode corresponds to random mapping
sets being presented to the network.
The distribution j(j) for a network (8, , 9) is shown in Figure 2.8a. The tail of
the distribution (i.e. long learning times) recedes for larger middle layer sizes, which is
consistent with the plot in Figure 2.7.
An interesting aspect of the model is the power law tail of j(j) [20], as illustrated
in Figure 2.8b. The power law tail is a telltale sign of scale-free behaviour, for which
no single value of j is typical for the learning time in those networks. Shorter values
of j occur more frequently than longer ones, but the later occur frequently enough to
not being singled out as exceptional.
The power-law tail of j(j) can be understood as avalanches of activation in the middle
layer nodes. Borrowing terminology from statistical physics, three operating regimes
are then identied:
Sub-critical regime for j
n
<< j
i
j
c
2
The slow change learning mode requires j

+ 1 unless the input nodes can share the same


output node.
13
10
1
10
2
10
3
10
4
10
4
10
2
10
0
Learning Time Distribution p()
[runs:1e+6 slow]
p
(



(8,36,9)
(8,72,9)
(8,144,9)
(a) The learning time distributions for several
middle layer sizes of a network with eight input
and eight output nodes. Data from 1c +6 runs.
10
1
10
2
10
3
10
4
10
6
10
4
10
2
10
0
Learning Time Distribution p()
[runs:1e+6 slow]
p
(



(8,72,9)
(16,272,17)
(32,1056,33)
(64,4160,65)
(b) The learning time distributions for several
networks with critical number of middle layer
size. Data from 1c + 6 runs.
Figure 2.8 The learning time distributions reveal three distinct regimes: sub-critical,
critical and super-critical. The critical regime exhibits power law behaviour with j(j)
j
1.3
according to [20].
Critical regime for j
n
j
i
j
c
Super-critical regime for j
n
j
i
j
c
In [20], J.R. Wakeling proposed that the power law tail of j(j) corresponds to an
order-disorder phase transition in the model and that the key dierence in the learning
dynamics for the three operating regimes is the interference probability 1:(i):
In sub-critical networks, there are enough middle layer nodes for interference
events to be quite rare and therefore learning is very quick.
In super-critical networks, there are hardly enough middle layer nodes to learn
without inducing interference events and therefore learning is extremely slow.
The learning dynamics for critical network sizes is in-between the other two
regimes with just enough interference to occasionally cause large learning times
while most of the time the learning times are quite fast.
However, it should be noted that the model has not been proved to be critical in the
proper statistical physics sense, in order to merit such terminology. Assessing the crit-
icality for the model in the two-layer topology is certainly challenging.
Furthermore, the approximately straight segments in the distributions of Figure 2.8b do
not necessarily imply that j(j) is a proper power law tail distribution, as very clearly
explained in the paper [9] by Clauset, Shalizi and Newman. Straight segments in a
log-log plot are a necessary but not sucient condition for j(j) to be a power law tail
distribution. Due to timing constraints however, no conclusive power law testing was
completed for j(j) and in consequence, the terminology proposed in [20] is adopted
through the document.
2.1.4 Summary
The key elements of this section are the following:
14
Network dynamics: Dened by Winner-Takes-All dynamics and learning by
synaptic depression (negative feedback).
Input-output learning: The network is able to learn an arbitrary mapping set
` where for each input node i corresponds an output node o = `(i).
Local agging mechanism: Plasticity changes are locally marked for recent
activity.
Global feedback mechanism: Feedback is provided in the form of a global
feedback signal specifying whether the most recent changes are unsatisfactory.
Solution to the credit-assignment problem: Requires sparse network activ-
ity, so that no plasticity changes occur until a global feedback signal is received.
Two typical timescales: The network signalling occurs in the time scale of
the ring patterns, while the tagging and feedback mechanisms occur in a much
longer timescale that is relevant to the scale of events in the external world.
Interference event i: The learning of input-output mappings can be disrupted
by the unlearning of previously learnt mappings.
Metastable synaptic landscape: The active conguration is barely supported
by the winning weights.
Neural Avalanches: For middle layer sizes j
n
= j
i
j
c
the network displays
power-law behaviour in the distribution of the learning time j(j).
2.2 Storing Mappings
The plasticity rules introduced in Section 2.1 enable the network to learn a random
mapping set `, and quickly adapt to another mapping set

` whenever needed. Not
much information is left [1] in the synaptic weights to reliably retrieve ` at a later
stage, since active weights that supported ` were depressed
3
a random amount in [0, 1]
to support the new mapping set

`.
An additional mechanism is therefore required to store the information from previ-
ously learnt mapping sets for later recall. It turns out that such a mechanism exists
and amounts to depressing less severely the weights that have successful in the past,
and is called the selective punishment rule [8][1].
The selective punishment rule requires small modications to the plasticity rules to
enable the distinction between successful and unsuccessful weights:
1. A random input node i is selected.
2. Input node i res and activates a middle layer node : and output node o according
to the WTA rule.
3. If output node o is correct, i.e. o = `(i), tag the weights n(:, i) and n(o, :) as
successful and return to step 1.
3
More specically, the active weights that are not shared by A and

A.
15
4. Otherwise depress the active weights n(:, i) and n(o, :) by:
A random amount in [0, 1] if n(:, i) and n(o, :) have never been successful.
A random amount in [0, c] otherwise.
Return to step 1.
In the Chialvo-Bak model, recalling a mapping set ` refers to a dierent operation
than in other neural network models. Since synaptic plasticity is required to retrieve
the information stored in the synaptic weights, the network is rather re-adapting to a
previously seen mapping set, than recalling the mapping set. Nevertheless, in order to
distinguish from the learning rules without selective plasticity, the term recall will be
used.
0 10 20 30 40 50
0
100
200
300
400
500
Recall Time
(6,36,6) [runs:50 rand]

Recall


M
1
M
2
M
3
M
4
(a) Example of learn and recall performance
without selective punishment.
0 10 20 30 40 50
0
100
200
300
400
500
Recall Time
(6,36,6) [runs:50 select rand]

Recall


M
1
M
2
M
3
M
4
(b) Example of learn and recall performance
with the selective punishment rule.
Figure 2.9 The number of depressions j required to rst learn and then recall four
random mapping sets `
1
, , `
4
. The network is presented with the mapping in random
succession, and value of j is recorded for each graph, i.e. at recall = 10 the network has
seen each mapping set 10 times.
The selective punishment rule results in a dramatic performance increase (under the
random mapping change mode), as shown in Figure 2.9. This performance increase
results from the network establishing preferred paths from each input node to the out-
put nodes required by the mappings being presented. These preferred paths are the
rst to be queried when the active conguration is no longer correct. A detailed ex-
ample of the selective punishment dynamics is presented in the Appendix of this section.
The weights tagged by the selective punishment rule are constrained to a region
4
within
a distance c from unity, as shown in Figure 2.10. This region is referred to as the c-band.
2.2.1 Summary
The key elements of this section are the following:
4
The uniform distribution of weights in the c /ond in Figure 2.10 results from depressing the
tagged weights a xed amount c rather than a random amount in [0, c]. In the later case, the resulting
distribution of weights in the c-band would be similar to that of Figure 2.5b.
16
0.998 0.9985 0.999 0.9995 1
0
0.005
0.01
0.015
0.02
0.025
Weight Distribution p(w)
(32,1024,32) [runs:1e+5 bins:5e+4 select rand]
p
(
w
)
Weight w
Figure 2.10 The weights tagged by the selective punishment rule are kept within the c
band located in [1 c, 1]. This is where the memory of previous successful mappings is
stored. Discrete distribution of 500 bins.
Selective punishment rule: Recalling previously learnt mapping sets is enabled
by depressing weights that haven been successful in the past by a smaller random
amount c when no longer leading to the desired mapping.
Selective punishment dynamics: The performance increase results from the
network establishing preferred paths from input nodes output nodes as required
by the learnt mapping sets. On average these preferred paths are queried much
more often.
Delta band: contains the weights representing the memory of previously suc-
cessful mappings and is located at [1 c, 1].
2.2.2 Appendix
The example from Figure 2.9 will be used to illustrate the dynamics of selective pun-
ishment. Suppose that `
1
, , `
4
are presented to the network in that order and
require input one to activate output nodes {1, 3, 6, 6} respectively.
After learning the mapping from input one to output node required by `
1
, the winning
weights n(:
1
, 1) and n(1, :
1
) are tagged by the selective punishment rule.
When presented with `
2
, input node one should now activate output node three and
the weights n(:
1
, 1) and n(1, :
1
) are depressed accordingly. This results in a series
of successive depressions to bring these weights slightly below the respective second
highest weights.
This succession of depressions is necessary since the average weight distance for this
network is 1,j
n
0.03 for the weights set n(:, i) and 1,j
c
0.14 for the weights set
n(o, :), whereas n(:
1
, 1) and n(1, :
1
) are now depressed by small random amounts
in c 0.001. This accounts for the relatively higher adaptation values for learning
mappings `
2
, `
3
and `
4
in Figure 2.9b, when compared to Figure 2.9a.
This also illustrates the negative performance impact that synaptic potentiation would
have in this network, since it would lead to large weight dierences (in units of depres-
sion amounts) between the active weights and the other weights. A metastable synaptic
17
landscape, such as the one illustrated in Figure 2.5a, is a requirement for the network
to quickly converge to a new mapping sets.
Eventually, either n(:
1
, i) or n(1, :
1
) will be depressed below the other weights.
Supposing that n(:
1
, i) is rst, then node one switches from middle layer node :
1
to another middle layer node :, which has 1,j
c
probability of activating output node
three. If that is the case, the input node one has learnt the correct mapping for `
2
and the weights n( :, 1) and n(3, :) are also tagged as successful.
If middle layer node : does not lead to output node three, it is depressed accord-
ingly. Node one is then very likely to switch back to middle layer node :
1
, which
will still activate output node one, unless weight n(1, :
1
) is already the second high-
est weight of node one, where it then has a chance of activating a dierent output node.
If output node three is still not found after a few more depressions, the search se-
quence will now alternate between middle layer node :
1
and other middle layer nodes,
and the output node of middle layer node :
1
will alternate between output node one
and the other output nodes.
After learning the mapping sets `
1
, , `
4
for the rst time, each input node has
one or more preferred paths formed by the pairs of tagged weights that lead to the
required output nodes. The network will pool these weights much more frequently.
The example above can be easily generalised to other input nodes and to the weight
n(1, :
1
) reaching the second highest weight before weight n(:
1
, 1) does.
A sample run where the mapping sets `
1
, , `
4
required input one to activate out-
put nodes {1, 3, 6, 6}, resulted in the preferred paths for input node shown in Table 2.1:
Preferred paths from input 1
To middle layer node : From middle layer node : to output node o
3 6
12 2, 6
30 6
31 1
34 3
Table 2.1 The successful weights tagged by the selective punishment rule result in a set
of preferred paths for input node 1.
The preferred path to output node two, which is not required for input node one, was
added by input node two when learning mapping set `
4
.
2.3 Advanced Learning
The type of learning problems that the model can tackle so far are better described as
solving a routing problem: given a mapping set `, the input nodes i have to nd paths
to the output node `(i).
18
A more advanced type of learning consists in considering mapping sets between
input node activation patterns 1 and the specic activation of output nodes o = (1).
As before, the state of each input node can be active i = 1 or inactive i = 0 and the
entire conguration of input nodes is represented by a binary vector 1. For example,
1 = {1, 0, 0, 1, 0, 1} is an activation pattern for a network with six input nodes.
Learning the basic Boolean functions = {AND, OR, XOR, NAND, } is a par-
ticular example of this type of learning. The logical value of proposition j and is
represented by the state of two input nodes and the logical value of the function (j, )
is represented by the activation of one of the two output nodes.
The changes to the plasticity rules that are required to learn this type of problem
are surprisingly small and amount to a slightly modied Winner-Take-All rule:
Input conguration 1 = {r
1
, , r
j

} activates the middle layer node , with


maximum
)
=

i
n(,, i) r(i).
Middle node : activates the output node o with maximum n(o, :), as before.
A bias input node that is always active is necessary in order to compute the state where
the remaining input nodes are inactive.
An example of the network solving the XOR problem under the above plasticity rules
is shown in Figure 2.11. An example of weights that implement a solution to the XOR
problem are shown in Table 2.2 of the Appendix of this section.
Figure 2.11 An example active conguration implementing the XOR truth table.
The network can learn the XOR problem with three middle layer nodes. In general, it
can learn any mapping with j
n
= 2
j

middle layer nodes, by discovering for each


input pattern 1 the corresponding middle layer node pointing to the correct output
(1). As there are as many middle layer nodes as input congurations the learning
convergence is guaranteed. As can be appreciated from the XOR example, the network
can learn with fewer middle layer nodes but the exact minimum is dependent on .
2.3.1 Summary
The key elements of this section are the following:
Advanced learning capability: The model can learn the mapping of input
node congurations 1, representing the activation state of the input nodes, and
the respective output nodes o = (1). In particular the basic Boolean functions
can be learned.
19
Advanced learning plasticity rule: The middle layer node with the maximum
weighted sum of weights n(:, i) is selected and activates the output node as
before.
2.3.2 Appendix
An example of weights that implement a solution to the XOR problem are shown in
Table 2.2.
Input to middle w(m,i) Middle to output w(o,m)
1 1 0.5 1 1 0.1
1 2 0.4 1 2 0.2
1 3 0.1
2 1 0.3 2 1 0.4
2 2 0.7 2 2 0.3
2 3 0.9
3 1 0.2 3 1 0.5
3 2 0.6 3 2 0.6
3 3 0.8
Table 2.2 Example weights to solve the XOR problem under the advanced learning
plasticity rules.
20
Chapter 3
Research Results
This chapter presents the results of the research that was conducted during this MSc
project.
3.1 -band Saturation
The selective punishment rule enables the network to quickly re-adapt to previously
learnt mappings by depressing less severely weights that were successful at least once
in the past. It is also know that the performance of this mechanism has an ageing eect
at large time scales [1], as illustrated in Figure 3.1.
0 10 20 30 40 50
0
100
200
300
400
500
Recall Time
(6,36,6) [runs:50 select rand]

Recall


M
1
M
2
M
3
M
4
(a) At short time scales the selective punish-
ment rule drastically improves the ability to
quickly recall previously learnt mapping sets.
0 500 1000 1500 2000
0
100
200
300
400
500
Recall Time
(6,36,6) [runs:2000 select rand]

Recall


M
1
M
2
M
3
M
4
(b) At long time scales the recall performance
degrades progressively.
Figure 3.1 The ageing eect on the selective punishment performance at large time
scales.
As before, the term recall will refer to the re-adaptation of the network to a previously
learnt mapping `.
The performance degradation of selective punishment is caused by the saturation of
the c-band, which is the region within a distance dc|to from unity where the weights
tagged as successful are constrained to be.
21
For the selective punishment rule to be eective, each input node should be able to
quickly sort through the tagged weights to recover the preferred path to the correct
output node. Ideally, each input node would have established one preferred path for
each previously learnt output nodes. As such, it would take a number of depressions
in the order of the number of preferred paths to nd the correct output node.
On the other hand, if the number of preferred paths leading to the same output node
grows further, the advantage of the tagging mechanism to identify the preferred middle
layer nodes ceases to be eective.
The increase of paths leading to the same output node is a consequence of all weights
eventually being given a chance to participate in correct mappings. Consider the con-
tinuous raising of the weights caused by the depressions of the plasticity rules. Each
raising step is the dierence between the highest weight and the second highest. Even-
tually, all weights end-up in the c-band and are soon able to compete with the tagged
weights for participating in a correct conguration. Once that occurs, one additional
path to the output node is created.
The monotonous increase of tagged weights leads to a saturation of the c-band, as
increasingly more weights are conned in that region of weight space. This eect can
be appreciated in Figure 3.2a and the corresponding increase of recall times is illustrated
in Figure 3.2b.
0 50 100 150 200 250
0
0.2
0.4
0.6
0.8
1
band Saturation
(6,36,6) [runs:100x250 map:128 rand]
T
a
g
g
e
d

W
e
i
g
h
t
s

(
%
)
Recall
(a) The number of tagged weights increases
monotonically and leads to a saturation of the
c-band.
0 50 100 150 200 250
0
50
100
150
Average Recall Time
(6,36,6) [runs:100x250 map:128 rand]

Recall


not selective
selective
(b) The performance of the selective punish-
ment rule degrades with successive recalls.
Figure 3.2 As the percentage of weights tagged as correct increases, the recall times
approach the performance of the regime without selective punishment.
3.1.1 Desaturation strategies
The monotonous increase of tagged weights is a consequence of the permanent tagging
of the selective punishment rule. As such, a mechanism is required to reduce the tag-
ging lifespan.
P. Bak and D. Chialvo proposed [1] a mechanism of neuron ageing to tackle this issue,
where nodes are replaced at a xed rate, their weights randomised and the tagging
22
information removed. However, the neuron replacement rate may be dependent on the
level of network activity, in order to successfully counter the saturation rate of the
dc|to-band.
Several strategies that result in non-permanent tagging were reviewed and the rst
of them was selected for implementation:
Global tag threshold: weights are untagged if not successful for more than a
global threshold of depressions.
Local tag threshold: as the previous, but the threshold depends on the past
performance of each weight.
Interference correction: untagging weights after a threshold number of inter-
ference events.
3.1.2 Global tag threshold
The global tag threshold has been investigated in greater detail, and numerical simu-
lations suggest that an optimum threshold value exists for each network size.
In order to ensure that the optimal tag threshold value does not depend on the activity
level of the network, an increasing number of mappings was presented to networks of
the same size. The performance of the optimal tag threshold value was consistent across
the number of presented mapping sets, as illustrated in Figure 3.3. For the network
(6, 36, 6) the optimal tag threshold value is close to 48.
10
1
10
2
10
3
0
0.2
0.4
0.6
0.8
1
Average band Saturation
(6,36,6) [runs:50x250 rand]
A
v
e
r
a
g
e

T
a
g
g
e
d

W
e
i
g
h
t
s

(
%
)
Mappings


16
32
48
64
80
(a) The average saturation of the c-band for
global tag threshold values are consistent across
the number of presented mapping sets.
10
1
10
2
10
3
0
50
100
150
200
Average Recall Time
(6,36,6) [runs:50x250 rand]

Mappings


16
32
48
64
80
(b) The average recall time j performance for
global tag threshold values are consistent across
the number of presented mapping sets.
Figure 3.3 For the network (6, 36, 6) the optimal global tag threshold value is close to
48.
For networks with larger number of input nodes j
i
the optimal tag threshold value is
also higher. This was veried for the network (12, 144, 12), where the optimal tag
threshold values is around 64. This is illustrated in Figure 3.4.
The optimal tag threshold seems closely related to an optimal number of average tagged
23
10
1
10
2
10
3
0
0.05
0.1
0.15
0.2
Average band Saturation
(12,144,12) [runs:50x250 rand]
A
v
e
r
a
g
e

T
a
g
g
e
d

W
e
i
g
h
t
s

(
%
)
Mappings


32
48
64
80
(a) The average saturation of the c-band for
global tag threshold values.
10
1
10
2
10
3
50
100
150
200
250
300
Average Recall Time
(12,144,12) [runs:50x250 rand]

Mappings


32
48
64
80
(b) The average recall time j performance for
global tag threshold values.
Figure 3.4 For the network (12, 144, 12) the optimal global tag threshold value is around
64.
middle layer nodes and tagged output nodes behind them, as illustrated in Figure 3.5.
Tag threshold values that are too low result in the network forgetting successful nodes
too fast, as can be observed by the sharp decrease in the average recall time j in
Figure 3.5. Tag threshold values that are too high fail to get rid of path redundancy
fast enough. This results in an increase of the average recall time j, as illustrated in
Figure 3.4b for the tag threshold value of 80, for example. Somewhere in-between lies
the optimal value.
The blue line in Figure 3.5b represents the average number of tagged middle layer
nodes per input node. From Table 3.1, the number of tagged middle layer nodes sur-
passes the number of output nodes j
c
from tag threshold 32 and above.
The green line of the same graph represents the average number of tagged output
nodes per tagged middle layer node. This value decreases until reaching tag threshold
32, where the network compensates for the lack of enough tagged middle layer nodes,
by tagging more output nodes per tagged middle layer node. Near tag threshold 32 a
minimum is reached, and starts growing again as higher tag threshold values cannot
get rid of path redundancy fast enough.
The best tag threshold found for this network produced an average of 7.9 tagged middle
layer nodes, which is nearly two nodes more than the number of output nodes j
c
. The
full results are presented in Table 3.1 in the Appendix of this Section.
3.1.3 Summary
The key elements of this section are the following:
c-band saturation: The permanent tagging of successful weights results in the
monotonous increase of weights being constrained to the c-band region, therefore
reducing the performance advantage of the selective punishment rule.
Desaturation strategies: Several desaturation strategies are possible and amount
24
Global Threshold Average Middle Layer Average Output Layer
16 4.8350 1.5447
24 5.3833 1.5129
32 6.0450 1.4118
40 6.9167 1.4135
48 7.9133 1.5362
56 9.3767 1.6998
64 10.9283 1.9372
72 12.4167 2.1066
80 14.4617 2.3799
Table 3.1 Average number of tagged middle layer nodes per input node and average
number of tagged output nodes per tagged middle layer node for dierent values of the
global tag threshold in the network (6, 36, 6).
10 20 30 40 50 60 70 80
0
50
100
150
200
Average Recall Time
(6,36,6) [runs:100x250 map:128 rand]

Global Tag Threshold


(a) The average recall time j decreases
sharply until reaching the optimal global tag
threshold value and slowly starts increasing
again beyond that point.
10 20 30 40 50 60 70 80
0
5
10
15
20
25
Average Tagged Nodes (MiddleOutput)
(6,36,6) [runs:100x250 map:128 rand]
A
v
e
r
a
g
e

N
o
d
e
s
Global Tag Threshold


middle
output x10
(b) The average number of tagged middle
layer nodes steadily increases with increasing
global tag threshold values. The average num-
ber of tagged output nodes per tagged middle
layer node has a minimum when the aver-
age tagged middle layer nodes are equal to the
number of output nodes j

.
Figure 3.5 The relation between the optimal global tag threshold and the average number
of tagged middle layer nodes and average number of tagged output nodes per tagged middle
layer nodes.
to capping the tagging lifetime.
Global tag threshold: Sets a global limit on the number of times a tagged
weight can be wrong before becoming untagged. There is an optimal tag threshold
for each network size that is independent of the level of network activity.
Global tag threshold dynamics: The optimal value of the global threshold is
related to the average number of tagged middle layer nodes and average number
of tagged output nodes behind them.
25
3.2 Markov Chain Representation
A Chialvo-Bak network can be represented by a rst-order Markov chain when consid-
ering the evolution of the network as a sequence learnt mappings states. Such repre-
sentation is useful to derive several statistical properties of the model analytically, such
as the average learning time j, the learning time distribution j(j) and the average
interference events i.
The Markov chain representation seems appropriate since the evolution of the net-
work is to a large extent stochastic. The random amounts of the depression from the
plasticity rules result in changes to the active conguration C, which is the basic macro-
scopic state of the network. Assuming the transition between active conguration C to
a new active conguration

C is stochastic and only depends on C, then the evolution
of the network can be described by a rst-order Markov chain
1
.
Rather than considering the evolution between active congurations C, a more mean-
ingful basic state of the Markov chain representation is the learning state o
a
of the
network, i.e. the number n of currently learnt mappings. For each active conguration
C the learning state o
a
is determined by counting the number n of learnt mappings
of C. As such, the correspondence between C and o
a
maintains the Markov chain
properties mentioned above.
The Markov chain has j
i
+ 1 states, noted o
0
, o
1
, ..., o
j

, corresponding to learning
from zero up to j
i
mappings.
In this context, learning a new mapping set ` corresponds to the Markov chain start-
ing from an initial state o
i
and evolving towards the nal state o
j

. The chain has a


transition from o
a
to o
a+1
when an additional output node is learnt and from o
a
to
o
a1
in the case of an interference events. This is illustrated in Figure 3.6.
Figure 3.6 The Markov chain representation considers the evolution of the network in
terms of the number of learnt mappings. In this example, the network started in the state
o
2
and successfully reached the nal state o
3
after two depressions. The evolution sequence
was o
2
o
2
o
3
. The target mapping set is ` = {1, 2, 3}
The nal state o
j

is special since the Markov chain stops when arriving to state o


j

1
A I-order Markov chain would depend on the I previous steps.
26
and no transitions from o
j

to any other states is possible. This corresponds to the


network having fully learnt the mapping set ` and therefore no further depressions
are necessary. The fact that all states can reach the absorbing state o
j

and that no
transition is possible from o
j

to any other state qualies the Markov chain as an ab-


sorbing Markov chain.
In general, there are four possible transitions from a given state o
a
. Noting o(t) the
state of the system at evolution step t, if o(t) = o
a
then o(t+1) {o
a1
, o
a
, o
a+1
, o
a+2
}.
In Figure 3.7 is shown an example of transition o
a
o
a+2
.
Figure 3.7 The network can learn up to two mappings after one depression giving the
transitions o

o
+1
and o

o
+2
, and it can unlearn one single mapping giving
o

o
1
.
For networks with three input nodes, all possible transitions are shown in Figure 3.8,
where the arrows indicate the sense of the transitions.
Figure 3.8 All the possible transitions for a network with three input node is shown
above. The arrows indicate the sense of the transitions. No transition is possible from o
3
to any of the other states.
A Markov chain is completely determined by the state transition matrix A, which in
the columns species the transition probabilities between states o
a
, and the initial state
probability vector p, which species the initial state probabilities 1:(o(0)).
The element o
na
of the state transition matrix A is the probability 1:(o
n
o
a
) of
the transition o
a
o
n
. The element j
I
of the column vector p is the probability of
starting the chain in state 1:(o(0) = o
I1
).
27
For the elements of A and p to represent valid probabilities the following must hold:
j

n=0
o
na
= 1 for any column index n of A (3.1)
j

I=0
j
I
= 1 (3.2)
For a network with three input nodes, A and p have the following form:
A =

o
00
o
01
0 0
o
10
o
11
o
12
0
o
20
o
21
o
22
0
0 o
31
o
32
1

and p = (j
0
j
1
j
2
j
3
)
|
,
where o
30
= o
02
= 0 since the corresponding transitions are not possible (see Figure
3.8).
For example, running this network in the slow change mode where only one mapping
is changed each time, corresponds to the initial state probability vector p = (0 0 1 0)
|
.
For a general network (j
i
, j
n
, j
c
), A and p have the following form:
A =

o
00
o
01
0 0 0 0 0
o
10
o
11
o
12
0 0 0 0
o
20
o
21
o
22
0 0 0 0
0 o
31
o
32
0 0 0 0
0 0 o
42
0 0 0 0
0 0 0 0 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 0 0 0 0
0 0 0 o
j
4
j
3
0 0 0
0 0 0 o
j
3
j
3
o
j
3
j
2
0 0
0 0 0 o
j
2
j
3
o
j
2
j
2
o
j
2
j
1
0
0 0 0 o
j
1
j
3
o
j
1
j
2
o
j
1
j
1
0
0 0 0 0 o
j

j
2
o
j

j
1
1

p = (j
0
j
1
j
j

)
|
,
with o
(a2) a
= o
(a3) a
= = o
0 a
= 0 and o
a+3 a
= = o
j

a
= 0 since the
corresponding transitions are impossible (for appropriate values of n).
3.2.1 Statistical properties
The state transition matrix A and the initial state probability vector p enable to com-
pute the statistics of the Markov chain in a very straightforward way. For detailed
analytical derivations see [13] and [17], for example.
28
For example, the elements of the n-th power of A, noted A
a
, yield the transition
probabilities to a given state in n steps, i.e. o
a
na
is the transition probability from
state o
a
to state o
n
in n steps.
To motivate the above statement, consider the previous example of the network with
three inputs. The probability 1:
(2)
(o
2
o
1
) of going from state o
1
to state o
2
in two
steps is computed as follows:
1:
(2)
(o
2
o
1
) =1:(o
2
o
1
) 1:(o
1
o
1
) + 1:(o
2
o
2
) 1:(o
2
o
1
) + 1:(o
2
o
3
) 1:(o
3
o
1
)
(3.3)
which in terms of the transition matrix elements is written:
1:
(2)
(o
2
o
1
) = o
21
o
11
+ o
22
o
21
+ o
23
o
31
. (3.4)
The last expression is the product of the second row of A with the rst column of A,
which is the element (1, 2) of the product AA A
2
of A with itself.
An important observation is that the above computation relied explicitly in the den-
ing properties of the Markov chain: step o(t + 1) is completely determined from step
o(t) and the state transition probability matrix A. This allowed to factor 1:
(2)
(o
2
o
1
)
in terms of the accessible intermediate states in Eq. (3.3) and associate the resulting
probabilities to the elements of A in Eq. (3.4). This will be useful when interpreting
the numerical evidence for the Markov chain representation.
The powers of matrix A lead to a straightforward computation of the learning times
distribution j(j). Since the last row of A
a
gives the transition probabilities to the
absorbing state o
j

in n steps when starting from each of the j


i
+1 states, the product
with p gives the probability of reaching the absorbing state in n steps when starting
from the initial state distribution p:
j (j n) = (0 1)
|
A
a
p, n
0
(3.5)
The transient states matrix Tcontains the transition probabilities between non-absorbing
states, i.e. all states excluding the absorbing state o
j

.
A =
(
T 0
F 1
)
where F is sub-matrix row of transition probabilities to the nal state o
j

.
The powers of matrix T are obtained from A
a
and its elements t
(a)
i )
correspond to the
transition probability in n steps from the non-absorbing states o
)
to the non-absorbing
state o
i
.
A
a
=
(
T
a
0
1
)
where represents the last row of A
a
except the element 1.
The learning times distribution j(j) can be expressed in terms of the sub-matrix T
a
,
since the element , of the last row of A
a
is equal to one minus the sum of the ,-th
29
column of T
a
, i.e. o
a
j

)
= 1

1
i=0
o
a
i )
= 1

i
n
a
i )
.
Rewriting the rst part of Eq. 3.5 in terms of T
a
yields:
j (j n) = 1 1
|
T
a
p, n
0
(3.6)
where p = (j
0
j
1
j
i1
)
|
are the components of p except the probability of starting
in the nal state o
j

and 1 is the ones column vector of size j


i
1.
An absorbing Markov chain has a number of interesting properties [13]:
The matrix I T has an inverse N = (I T)
1
, which is called the Fundamental
Matrix.
N = I +T+T
2
+
The element n
i )
of matrix N is the expected number of times the chain is in state
o
i
, when starting in state o
)
, before reaching the absorbing state o
j

.
The last property yields the expectation of the learning time j, since adding the en-
tries of column , of N gives the expect number of times in all transient state before
reaching the absorbing state o
j

, when starting in state o


)
.
Therefore,
j = 1
|
N p (3.7)
Higher order moments of the random variable j can be obtained from the expression
of the factorial moments [17]:
1[j (j 1) (j n + 1)] = n! 1
|
T
a1
N
a
p
where 1[] is the statistical expectation.
The expectation of the interference events i is easily derived from N, by noting
that the interference probability in a given state o
)
is given by the element o
)1 )
of A,
i.e. 1:(io
)
) = o
)1 )
.
Therefore:
i = v
|
N p (3.8)
where v = (0 o
0 1
o
0 j

2

1
)
|
is the column vector with the elements of A corre-
sponding to interference events.
3.2.2 Markov chain representation: numerical evidence
To test the validity of the Markov chain representation the following experiment was
performed:
A test set of networks was selected with input layer size j
i
ranging from one to
twelve input nodes.
For each input layer size j
i
:
30
The output layer size was j
i
+ 1.
Three middle layer sizes corresponding to the three regimes (sub-critical,
critical, super-critical) were used: j
sub
n
= 2 j
critical
n
and j
super
n
= j
critical
n
,2.
Each network was simulated for a large number of times and the following statis-
tics recorded:
state transition counts
initial state counts
j (j) learning time distribution
j average learning time
i average interference events
The state transition and initial state counts are possible to extract since the
network evolves in discrete steps and for each discrete step the learning state can
be computed from the active conguration.
The measured state transition and initial state counts are normalised according
to Eqs. (3.1) and (3.2) yielding
2
the maximum likelihood estimators [6] for A
MLE
and p
MLE
, respectively.
For each measured A
MLE
and p
MLE
the quantities j (j)
pred
, j
pred
and i
pred
are computed from Eqs. (3.6), (3.7) and (3.8) respectively.
The measured values are compared to the computed values using the Normalised
Root Mean Square Error (NRMSE) and the Kolmogorov-Smirnov 1 statistic:
Normalised root mean square error RMSE[A
pred
] =

a
pred

)
2
a
max
a
min
Kolmogorov-Smirnov statistic 1 = arg max
a

j (A r
i
)
pred
j (A r
i
)
In order to obtain enough state transition counts for each state to enable a good esti-
mation A
MLE
, the mappings were presented to the network with a uniform initial state
probability (except for the state o
j

). Under this mapping change mode, the learning


time distribution j (j) has a dierent shape than under the slow mapping change mode,
for example. The Figure 3.9 shows an example of measured learning time distributions
j (j) for two sample networks and the predicted distributions j (j)
pred
that are obtained
from Eq. (3.6).
The NRMSE for j
pred
and i
pred
was zero across the tested networks. The actual
value was below the numerical accuracy of the experiment as estimated by the inverse
of the number of samples in each simulation 1,. Such NRMSE value is perhaps an
indication that the underlying method is not applicable to validate the Markov Chain
representation.
The Kolmogorov-Smirnov 1 statistic for j(j)
pred
computed from Eq. (3.6) is shown in
Figure 3.10.
2
The derivation of the maximum likelihood estimator for A follows the usual scheme of minimising
the data log likelihood and applying Lagrange multipliers for each element of A.
31
10
0
10
1
10
2
10
3
10
4
10
2
10
0
Learning Time Distribution p()
(4,10,5) [runs:4e+5 uniform]
p
(



model
markov
(a) Super-critical network (4, 10, 5).
10
0
10
1
10
2
10
3
10
4
10
2
10
0
Learning Time Distribution p()
(12,312,13) [runs:1e+6 uniform]
p
(



model
markov
(b) Sub-critical network (12, 312, 13).
Figure 3.9 The learning time distribution j(j) for a uniform initial state probability
(except for the state o

). The black line represents the predicted distribution j (j)


pred
that
is obtained from Eq. (3.6). The green line is the measured j (j).
0 2 4 6 8 10 12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
KolmogorovSmirnov Statistic
D
Input Layer Size


super
critical
sub
Figure 3.10 The Kolmogorov-Smirnov statistic 1 for j(j)
MLE
for the test network set,
with networks ranging from two input nodes to 12 input nodes. The statistic 1 is computed
from Eq. (3.6) and decreases with increasing input layer size for the networks considered.
For the networks considered, the Kolmogorov-Smirnov 1 statistic was usually higher
in the super-critical and critical regimes. With increasing input layer size, the statistic
1 decreases and converges for all three regimes. The abrupt reduction for the network
with two input nodes lead to additional measurements with larger output layers but
the results were very similar.
The value dierences in the statistic 1 indicate that the Markov chain representa-
tion is more accurate for larger networks. Smaller networks may be better represented
by other types of Markov chain or may even not allow a Markov chain representation
by not respecting the Markov chain properties, for example.
A direct estimation of the Markov chain order might clarify the above question. Due to
timing constraints this was not pursued. Several estimators exist for the Markov chain
order, such as the BIC Markov order estimator [11] or the Peres-Shield order estimator
[19].
32
3.2.3 Analytical solution for (2,
n
,
c
)
An analytical solution for A has been obtained for network with two input nodes
(2, j
n
, j
c
), for the case where the input nodes cannot share output nodes
3
.
For (2, j
n
, j
c
) the state transition matrix A and the initial state transition vector
A have the form:
A =

o
00
o
01
0
o
10
o
11
0
o
20
o
21
1

and p = (j
0
j
1
j
2
)
|
, (3.9)
Each state o
a
can be further separated in sub-states o
o
a
corresponding to the graphs
with d degrees of freedom in the middle layer. The basic graphs o
o
a
for (2, j
n
, j
c
) are
illustrated in Figure 3.11. For simplicity, these sub-states are simply referred as the
graphs of state state o
a
.
Figure 3.11 The basic graphs for the two input networks (2, j

, j

). The upper index


species the number of degrees of freedom in the middle layer, i.e. o
2
1
is the graph for one
learnt input node and one shared middle layer node.
The main steps in this computation are the following:
1. Compute the number of active congurations C for each state o
a
in terms of the
graphs o
o
a
.
2. Equal graph accessibility approximation: assume that in a given state o
a
each
of graphs o
o
a
is equally accessible. This is strictly the case for o(0) where the
distribution of graphs is only dependent on the relative proportion of active con-
gurations C implementing each graph. However, it is not necessarily the case
for transitions from another state o
a
or from the same state o
a
where the graph
to graph transitions may favour some particular graphs.
3. For each graph o
o
a
compute the probability of transition to o
a1
, o
a
, o
a+1
and
o
a+2
corresponding to unlearning, no change, learning one and two mappings,
respectively.
3
The case where the input nodes can share output nodes may be easily derived by simplication of
the present results.
33
When starting the network from random weights, the active conguration distribution
for each state o
a
does not correspond to the relative frequency count of its graphs, as
shown in Table 3.2.
Graphs
o
1
0
o
1
1
o
1
2
o
2
0
o
2
1
o
2
2
Conguration count
6 6 0
54 36 6
Conguration count distribution
0.0556 0.0556 0
0.5000 0.3333 0.0556
Random graph distribution
0.1666 0.1666 0
0.3749 0.2502 0.0417
Table 3.2 The conguration count distribution and the initial graph distribution when
starting from random weights do not match. The values above are for the network (2, 3, 4).
The conguration count distribution is obtained by counting the relative frequency of the
congurations implementing a given graph. For example, (2, 3, 4) has six possible cong-
urations for graph o
1
0
out of 108 distinct congurations, which gives a relative frequency of
0.0556 for graph o
1
0
.
The correct distribution is obtained by rst counting the congurations generated from
non-shared middle layer nodes and one of the shared middle layer nodes and then
multiplying by j
c
for each shared additional shared middle layer node. Such strategy is
justied by considering all the combinations of triplets (i, :, o) and enforcing the WTA
rule in the second and third value of the triplet. The total number of combinations is
(j
n
j
c
)
j

and the WTA rule assigns them to the corresponding graphs o


o
a
. The resulting
graph distribution is the correct one, as shown in Table 3.3. This approach has also
been veried successfully for the network (3, j
n
, j
c
).
Random graph distribution
0.1666 0.1666 0
0.3749 0.2502 0.0417
Predicted graph distribution
0.1667 0.1667 0
0.3750 0.2500 0.0417
Table 3.3 The initial graph distribution when starting from random weights matches the
predicted graph distribution, for the network (2, 3, 4). The predicted graph distribution is
detailed in the text.
For the network (2, j
n
, j
c
) one obtains the following number of congurations per
graph:

o
1
0

= j
n
j
c
(j
c
2)

o
1
1

= j
i
j
n
j
c

o
2
0

= j
n
(j
n
1) (j
c
1)
2

o
2
1

= j
i
j
n
(j
n
1) (j
c
1)

o
2
2

= j
n
(j
n
1)
where

o
o
a

is the order of the graph o


o
a
, i.e. the number of distinct congurations that
implement the graph.
34
This allows to compute p
rand
when starting from a random conguration:
p
rand
=
1
o
(

o
1
0

o
2
0

o
1
1

o
2
1

o
2
2

)
where o

o
1
0

o
2
0

o
1
1

o
2
1

o
2
2

.
One nal element before computing the transition probabilities is to determine the
probability j
0
of reselecting the same node after depression and the probability of se-
lecting a dierent node 1 j
0
. The reselection probability is given by j
0
= 1,(2 j
a
1)
and this value can only be related to the particular weight distribution of this model,
as shown in Figure 2.5b. Numerical evidence for such value of j
0
is shown in Figure
3.12.
0 10 20 30 40
0
0.1
0.2
0.3
0.4
0.5
Average Reselection Probability
(2,*,10) [depressions:1e+4]


P
r
(
r
e

s
e
l
e
c
t
i
o
n
)

Middle Layer Size




reselect
1/(2
n
1)
1/
n
Figure 3.12 The average re-selection probability in the middle layer nodes, for the
networks (2, j

, 10). The measured probability (in blue) follows the curve 1,(2 j

1) (in
green). The curve 1,j

is shown for comparison (in red). Data collected for 1c4 depressions
in each network.
The element o
01
of the transition matrix A for (2, j
n
, j
c
) can be obtained as follows:
o
01
=1:(o
0
o
1
) = 1:(o
1
0
o
1
1
) 1:(o
1
1
o
1
) + 1:(o
2
0
o
1
1
) 1:(o
1
1
o
1
)
Using the equal graph accessibility approximation,
o
01

(
1:(o
1
0
o
1
1
) + 1:(o
2
0
o
1
1
)
)

o
1
1

o
1
1

o
2
1

which with the graph transition probabilities gives:


o
01

(
:
0
(1 o
0
)
j
c
2
j
c
1
+ (1 :
0
)(1 o
0
)
j
c
1
j
c
)

o
1
1

o
1
1

o
2
1

where :
0
and o
0
represent the re-selection probabilities of middle and output nodes,
respectively.
The main elements of the last computation are detailed below, where it is assumed
the network is learning to connect input two to output two while learning the iden-
tity mapping ` = {1, 2} (the other possible mapping set requiring an inversion of the
roles).
35

S
1
1

S
1
1
+S
2
1

is the equal graph accessibility approximation for graphs o


1
1
and o
2
1
in
state o
1
.
:
0
(1o
0
)
j

2
j

1
is the transition probability from o
1
1
to o
1
0
. This transition occurs
whenever the same middle layer node is re-selected and the output node is not
re-selected. There are j
c
2 out of j
c
1 such cases that enable the transition.
(1 :
0
)(1 o
0
)
j

1
j

is the transition probability from o


1
1
to o
2
0
, occurring when-
ever the same middle layer node is not re-selected, the output node is not re-
selected and there are j
c
1 out of j
c
cases that enable the transition.
Figure 3.13 The graph transitions from o
1
1
to o
1
0
and o
2
0
. The target mapping is ` =
{1, 2}.
3.2.4 Alternate formulation: graph transitions
There is an alternative to relying on the equal graph accessibility approximation. By
considering the transitions between graphs o
o
a
directly rather than between states o
a
,
the need to compute the graph occupation within states is avoided altogether.
The resulting transition matrix has more elements, one column for each possible graph,
but the predictions are potentially more accurate.
The elements of this new state transition matrix A
j
and the new initial probabilities
vector p
j
, are as follows:
A
j
=

o
10 10
o
10 20
o
10 11
0 0
o
20 10
o
20 20
o
20 11
0 0
o
11 10
o
11 20
o
11 11
o
11 21
0
o
21 10
o
21 20
o
21 11
o
21 21
0
o
22 10
0 o
22 11
o
22 21
1

and p
j
= (j
10
j
20
j
11
j
21
j
22
)
|
,
where o
22 20
= o
10 21
= o
20 21
= 0 since no transitions are possible between those graphs.
The element o
aoa
represents 1:(o

a
o
o
a
), the transition probability from graph o
o
a
to graph o

a
.
36
Figure 3.14 The graph transitions for a network with two input nodes. The arrows
indicate the sense of the transitions. No transition is possible from graph o
2
2
to any of the
other graph.
The analytical results previously obtained are directly applicable to compute the el-
ements of the transition matrix A
j
. For example, the element o
01
of A yields the
elements o
10 11
and o
20 11
of A
j
, as follows:
o
10 11
=1:(o
1
0
o
1
1
) = :
0
(1 o
0
)
j
c
2
j
c
1
o
20 11
=1:(o
2
0
o
1
1
) = (1 :
0
)(1 o
0
)
j
c
1
j
c
These two elements of A
j
correspond to the two graph transitions in Figure 3.13.
Computing j (j) and j is according to Eqs. (3.6) and (3.7) respectively. However, Eq.
(3.8) is no longer applicable for computing i, as the elements of A
j
corresponding to
interference events are no longer on the upper diagonal of matrix A
j
.
Nevertheless, a similar computation is still possible:
i =

j:column
of N

i:interference
graph of j
o
j
i )
n
i )
j
)
(3.10)
where o
j
i )
is an element of A
j
corresponding to an interference transition from graph
, to graph i, p
j
is the initial state transition vector excluding the absorbing graph and
j
)
is the element , of p
j
.
3.2.5 Analytical solution: numerical evidence
To test the validity of the analytical solution, a similar methodology was used as in
Sub-section 3.2.2 for testing the Markov chain representation. In the present case, the
network test set was limited to networks with two input nodes, 10 output nodes and a
range of middle layer nodes.
The analytical expressions for A and A
j
were used to obtain j(j)
jco
, j
jco
and
i
jco
from Eqs. (3.6), (3.7) and (3.8) or (3.10), respectively, and the predicted values
were then compared to the ones obtained from the simulations.
37
The slow mapping change mode was used this time, instead of the uniform initial
state probability, as the number of states is quite small and the slow change mode is
able to visit them a sucient number of times. The resulting learning time distribu-
tions j(j) are more representative of the typical network dynamics.
10
0
10
1
10
2
10
4
10
2
10
0
Learning Time Distribution p()
Gamma(2,10,10) [runs:1e+6 slow]
p
(



model
pred A
pred A
g
(a) Super-critical network (2, 10, 10).
10
0
10
1
10
2
10
4
10
2
10
0
Learning Time Distribution p()
Gamma(2,100,10) [runs:1e+6 slow]
p
(



model
pred A
pred A
g
(b) Sub-critical network (2, 100, 10).
Figure 3.15 The learning time distribution j(j) in the slow change mode. The green
and red lines are the prediction from Eq. (3.6) using A and A

respectively.
The predicted learning time distributions j(j)
jco
are quite similar to the ones mea-
sured. This can be appreciated from Figure 3.15, where the predictions are compared
in the super-critical and the sub-critical regimes. As with the predictions from the
Markov chain validation testing, the super-critical regime distributions are less accu-
rate than the sub-critical distributions.
0 20 40 60 80 100 120
0
0.01
0.02
0.03
0.04
0.05
0.06
KolmogorovSmirnov Statistic
Gamma(2,*,10) [runs:1e+6 slow]
D
Middle Layer Size


pred A
pred A
g
Figure 3.16 The Kolmogorov-Smirnov statistic for the learning time distribution j(j)
in the slow change mode. The green and red lines are the statistic value obtained from Eq.
(3.6) using A and A

respectively.
The above is conrmed by the Kolmogorov-Smirnov 1 statistic, which is higher for the
super-critical and critical regimes, as show in Figure 3.16.
The statistic 1 decreases with increasing input layer size and is comparable to the
maximum likelihood estimations for the transition matrix, obtained in Sub-section
38
3.2.2 for the critical and sub-critical regimes (compare Figure 3.16 with the D value for
network size two from Figure 3.10).
Interestingly, the predictions obtained from A
j
dont have better performance than
the ones from A, except in the sub-critical regime where the 1 statistic is lower for the
j(j)
jco
obtained from A
j
. This may be particular to the two input networks, which
are somewhat singled out from the others in Figure 3.10.
In terms of j
pred
and i
pred
, the predictions obtained from A are also more ac-
curate than the ones obtained from A
j
, as shown in Table 3.4 and Figure 3.17.
0 20 40 60 80 100 120
10
15
20
25
30
Average Learning Time
Gamma(2,*,10) [runs:1e+6 slow]

Middle Layer Size




model
pred A
pred A
g
0 20 40 60 80 100 120
0
0.5
1
1.5
Average Interference Event
Gamma(2,*,120) [runs:1e+6 slow]

Middle Layer Size




model
pred A
pred A
g
Figure 3.17 Comparing j
pred
and i
pred
to the measured values, in the slow change
mode. The green and red lines are the prediction from Eq. (3.7) and (3.10) for A and A

respectively.
The improvement in the Kolmogorov-Smirnov 1 statistic for A
j
in the sub-critical
regime did not reect in the NRMSE, where the predictions from A are consistently
more accurate.
NMRSE j
pred
super-critical critical sub-critical
A 0.0839 (0.0660) 0.0577 (0) 0.0149 (0.0203)
A
j
0.1042 (0.0758) 0.0817 (0) 0.0264 (0.0332)
NMRSE i
pred
super-critical critical sub-critical
A 0.1323 (0.1066) 0.1339 (0) 0.0452 (0.0616)
A
j
0.1704 (0.1269) 0.2000 (0) 0.0875 (0.1089)
Table 3.4 The normalised root mean square error (NMRSE) for the predicted j
pred
and i
pred
in the super-critical, critical and sub-critical regimes. The values in brackets
are the standard deviation on the error NMRSE.
3.2.6 Summary
The key elements of this section are the following:
39
Markov chain representation: The Chialvo-Bak network in the two-layer
topology is fairly accurately represented by a rst order Markov chain, where the
chain states correspond to the number of learnt maps.
Markov chain statistics: The state transition matrix A obtained either ana-
lytically or by maximum likelihood estimation, easily allows to compute j (j)
pred
,
j
pred
and i
pred
.
Numerical evidence for Markov chain: The Markov chain representation is
accurate in all regimes (super-critical, critical and sub-critical) for all but very
small networks.
Analytical solution for A for (2, j
n
, j
c
): An approximate analytical solution
was obtained for the network of two input nodes.
Analytical solution for A
j
for (2, j
n
, j
c
): An analytical solution was ob-
tained for the network of two input nodes when considering the transitions be-
tween graphs o
o
a
rather than between states. As such, this solution is in principle
exact.
Numerical evidence for analytical solution: The performance of the analyt-
ical solutions is comparable to a maximum likelihood estimation of the transition
matrix A in the critical and sub-critical regimes. The predictions from the ap-
proximate solution had better performance than the non-approximate solution.
3.2.7 Appendix
Analytical Solution with A for (2, j
n
, j
c
)
The solution for the full transition matrix A for (2, j
n
, j
c
) is as follows:
o
00

o
1
0

o
1
0

o
2
0

(
1:(o
1
0
o
1
0
) + 1:(o
2
0
o
1
0
)
)
+

o
2
0

o
1
0

o
2
0

(
1:(o
1
0
o
2
0
) + 1:(o
2
0
o
2
0
)
)

o
1
0

o
1
0

o
2
0

(
:
0
(
o
0
+ (1 o
0
)
j
c
3
j
c
1
)
+ (1 :
0
)
j
c
1
j
c
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
))
+

o
2
0

o
1
0

o
2
0

:
0
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+

o
2
0

o
1
0

o
2
0

(
(1 :
0
)
j
n
2
j
n
1
j
c
1
j
c
+ (1 :
0
)
1
j
n
1
j
c
2
j
c
1
)
o
10

o
1
0

o
1
0

o
2
0

(
1:(o
1
1
o
1
0
) + 1:(o
2
1
o
1
0
)
)
+

o
2
0

o
1
0

o
2
0

(
1:(o
1
0
o
2
0
) + 1:(o
2
0
o
2
0
)
)

o
1
0

o
1
0

o
2
0

:
0
(1 o
0
)
2
j
c
1
+

o
1
0

o
1
0

o
2
0

(
(1 :
0
)
(
1
j
c
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+
j
0
1
j
0
(1 o
0
)
1
j
c
1
))
+

o
2
0

o
1
0

o
2
0

(
:
0
(1 o
0
)
1
j
c
1
+ (1 :
0
)
j
n
2
j
n
1
1
j
c
+ (1 :
0
)
1
j
n
1
1
j
c
1
)
40
o
20

o
1
0

o
1
0

o
2
0

1:(o
2
2
o
1
0
)

o
1
0

o
1
0

o
2
0

(
(1 :
0
)
1
j
c
(1 o
0
)
1
j
c
1
)
o
01

o
1
1

o
1
1

o
2
1

(
1:(o
1
0
o
1
1
) + 1:(o
2
0
o
1
1
)
)

o
1
1

o
1
1

o
2
1

(
:
0
(1 o
0
)
j
c
2
j
c
1
+ (1 :
0
)(1 o
0
)
j
c
1
j
c
)
o
11

o
1
1

o
1
1

o
2
1

(
1:(o
1
1
o
1
1
) + 1:(o
2
1
o
1
1
)
)
+

o
2
1

o
1
1

o
2
1

(
1:(o
2
1
o
2
1
) + 1:(o
1
1
o
2
1
)
)

o
1
1

o
1
1

o
2
1

(
:
0
(
o
0
+ (1 o
0
)
1
j
c
1
)
+ (1 :
0
)
(
1
j
c
(1 o
0
) +
j
c
1
j
c
o
0
))
+

o
2
1

o
1
1

o
2
1

:
0
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+

o
2
1

o
1
1

o
2
1

+
(
(1 :
0
)
j
n
2
j
n
1
j
c
1
j
c
+ (1 :
0
)
1
j
n
1
)
o
22

o
1
1

o
1
1

o
2
1

1:(o
2
2
o
1
1
) +

o
2
1

o
1
1

o
2
1

1:(o
2
2
o
2
1
)

o
1
1

o
1
1

o
2
1

(
(1 :
0
)
1
j
c
o
0
)
+

o
2
1

o
1
1

o
2
1

(
:
0
(1 o
0
)
1
j
c
1
+ (1 :
0
)
j
n
2
j
n
1
1
j
c
)
o
02
=0 o
12
= 0 o
22
= 1
where :
0
and o
0
represent the re-selection probabilities of middle and output nodes,
respectively.
Analytical Solution with A
j
for (2, j
n
, j
c
)
The solution for the full transition matrix A
j
for (2, j
n
, j
c
) is as follows: where :
0
and o
0
represent the re-selection probabilities of middle and output nodes, respectively.
o
10 10
=1:(o
1
0
o
1
0
) = :
0
(
o
0
+ (1 o
0
)
j
c
3
j
c
1
)
o
20 10
=1:(o
2
0
o
1
0
) = (1 :
0
)
j
c
1
j
c
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
o
11 10
=1:(o
1
1
o
1
0
) = :
0
(1 o
0
)
2
j
c
1
o
21 10
=1:(o
2
1
o
1
0
) = (1 :
0
)
(
1
j
c
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+
j
0
1
j
0
(1 o
0
)
1
j
c
1
)
o
22 10
=1:(o
2
2
o
1
0
) = (1 :
0
)
1
j
c
(1 o
0
)
1
j
c
1
41
o
10 20
=1:(o
1
0
o
2
0
) = (1 :
0
)
1
j
n
1
j
c
2
j
c
1
o
20 20
=1:(o
2
0
o
2
0
) = :
0
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+ (1 :
0
)
j
c
2
j
c
1
j
c
1
j
c
o
11 20
=1:(o
1
1
o
2
0
) = (1 :
0
)
1
j
n
1
1
j
c
1
o
21 20
=1:(o
2
1
o
2
0
) = :
0
(1 o
0
)
1
j
c
1
+ (1 :
0
)
j
n
2
j
n
1
1
j
c
o
22 20
=1:(o
2
2
o
2
0
) = 0
o
10 11
=1:(o
1
0
o
1
1
) = :
0
(1 o
0
)
j
c
2
j
c
1
o
20 11
=1:(o
2
0
o
1
1
) = (1 :
0
)(1 o
0
)
j
c
1
j
c
o
11 11
=1:(o
1
1
o
1
1
) = :
0
(
o
0
+ (1 o
0
)
1
j
c
1
)
o
21 11
=1:(o
2
1
o
1
1
) = (1 :
0
)
(
1
j
c
(1 o
0
) +
j
c
1
j
c
o
0
)
o
22 11
=1:(o
2
2
o
1
1
) = (1 :
0
)
1
j
c
o
0
o
10 21
=1:(o
1
0
o
2
1
) = 0
o
20 21
=1:(o
2
0
o
2
1
) = 0
o
11 21
=1:(o
1
1
o
2
1
) = (1 :
0
)
1
j
n
1
o
21 21
=1:(o
2
1
o
2
1
) = :
0
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+ (1 :
0
)
j
n
2
j
n
1
j
c
1
j
c
o
22 21
=1:(o
2
2
o
2
1
) = :
0
(1 o
0
)
1
j
c
1
+ (1 :
0
)
j
n
2
j
n
1
1
j
c
o
10 22
=0 o
20 22
= 0 o
11 22
= 0 o
21 22
= 0 o
22 22
= 1
3.3 Learning Convergence
In all but very small networks, the two layer topology of the Chialvo-Bak model seems
well represented by a rst-order absorbing Markov chain. This is supported by the
numerical testing results of the Markov chain representation.
An important property of absorbing Markov Chains is the guaranteed convergence
4
to the absorbing states. Therefore, as long as the absorbing Markov chain represen-
tation holds, the corresponding network can be expected to converge to the complete
learning state.
More specically, for a given learning rule, the network is expected to converge to
the complete learning state if under that learning rule:
4
This property is based on a probability conservation argument. For details see [13], for example.
42
The Markov property is preserved, i.e.
1:(o(t + 1) o(t), o(t 1), , o(0)) = 1:(o(t + 1) o(t))
From every possible state o
a
there is a path to the absorbing state o
j

, which
does not need to be a direct path.
A necessary condition seems to be a requirement to have separate random depressions
in the middle and output layers. Indeed, simulations showed that for the network
(2, 2, 2) complete learning is not guaranteed if the middle and output layers are de-
pressed by the same random amount. This condition seems to be needed to maintain
the Markov property, but further investigation is necessary to verify this.
For any state o
a
to be able to reach the absorbing state o
j

it is sucient that at
least one path to the absorbing state exists. Such path can be constructed if for every
state o
a
, a transition is possible to a higher learning state o
a+1
. This translates to a
lower bound on the number of middle layer nodes. For the system to be in the learning
state o
a
, a total of n middle layer nodes are used to support this state. Moving to the
state o
a+1
requires one additional free middle layer node, and so on.
For the simple learning rule, the number of middle layer nodes should therefore be
as large as the number of input nodes, i.e. j
n
j
i
.
For the advanced learning rule, the amount of middle layer nodes should to be as
large as the number of dierent input congurations, i.e. j
n
1, where 1 represents
the order of the input congurations 1.
Proving the above assertions would be very interesting, as it would establish the learn-
ing capability in the advanced learning rule, which is the ability to learn arbitrary
binary inputs to output maps, and in particular, any boolean function.
3.3.1 Summary
The key elements of this section are the following:
Guaranteed convergence of the Markov chain: An absorbing Markov chain
is guaranteed to converge to the absorbing state. Should the Markov chain repre-
sentation hold for the simple and advanced learning modes, the Chialvo-Bak can
be expected to converge to the complete learning state.
Separate random depressions: A requirement for convergence is to have sep-
arate random depressions in the middle and output layers. This seems to be
related to maintaining the Markov property.
Lower bound on the number of middle layer nodes : Another requirement
for learning convergence is to have sucient middle layer nodes. For the simple
learning rule j
n
j
i
and for the advanced learning rule j
n
1, where 1
represents the order of the input congurations 1.
43
3.4 Power-Law Behaviour and Neural Avalanches
The power law tail
5
in the learning time distribution j(j) of the critical regime is also
dependent on the mapping change mode, i.e. the way new mapping sets are presented
to the network.
The power law tails of the distributions in Figure 2.8a were generated under the slow
change mode, where one single mapping is changed each time. It turns out that the slow
change mode seems to be the only mapping presentation mode that results in power
law behaviour. If the network is provided with random mapping sets the power-law
disappears, as illustrated in Figure 3.18a.
10
1
10
2
10
3
10
4
10
2
10
0
Learning Time Distribution p()
(8,72,9) [runs:1e+6]
p
(



slow
rand
(a) Learning under slow change mode and ran-
dom change mode. The blue line represents
learning with one mapping change each time,
whereas the green line represents learning ran-
dom mapping sets.
10
1
10
2
10
3
10
4
10
2
10
0
Learning Time Distribution p()
(8,72,9) [synthetic]
P
(



S
1
S
4
S
6
S
7
(b) Learning under j

n 1 mapping
changes, i.e. the slow change mode corre-
sponds to n = 1. Learning under n mapping
changes corresponds to starting the system in
state o

. The plot was generated from syn-


thetic data obtained from the A
MLE
.
Figure 3.18 The power-law tail of the distribution j(j) disappears when presenting
mappings to the network in other ways than the slow change mode.
In the Markov chain representation, the slow change mode is equivalent to starting the
network in state o(0) = o
j

1
. As such, learning under j
i
n 1 mapping changes
amounts to starting the network in the state o(0) = o
j

a
. Again, only the slow change
mode generated a power law tail, as illustrated in Figure 3.18b.
An interesting characterisation of the critical regime is obtained by analysing the state
occupation matrix Q that is obtained from the transition counts after running a simula-
tion. Normalising the transition counts globally gives the frequency of each transition,
and summing over the rows gives the state frequency. Below is an example for the a
network (6, 42, 7) after 1c5 runs in the slow change mode.
5
As mentioned in Sub-Section 2.1.3 the approximately straight segments in the distributions of
Figure 2.8b do not necessarily imply that j(j) is a proper power law tail distribution. Due to timing
constraints no conclusive power law testing was completed for j(j), as such the terminology proposed
in [20] is adopted here.
44
Q =

0.0063 0.0011 0 0 0 0
0.0011 0.0345 0.0059 0 0 0
0.0000 0.0059 0.0950 0.0154 0 0
0 0.0000 0.0153 0.1704 0.0262 0
0 0 0.0001 0.0261 0.2255 0.0328
0 0 0 0.0001 0.0327 0.2683
0 0 0 0 0.0000 0.0373

Summing over the columns of Q gives the relative occupation in each state. For the
network (6, 42, 7) the relative occupation of each state is shown in Table 3.5.
Analysing the columns of Q for larger networks yields the following picture:
In the super-critical regime the network spends most of the time in the states
neighbouring o
j

2
.
In the critical regime the network spends most of the time in the states o
j

2
, o
j

1
.
In the sub-critical regime the network spends most of the time in the last two
states o
j

1
and o
j

2
.
Relative State Occupation
o
0
o
1
o
2
o
3
o
4
o
5
super-critical 0.0648 0.1971 0.2800 0.2437 0.1449 0.0695
critical 0.0075 0.0416 0.1163 0.2119 0.2844 0.3384
sub-critical 0.0004 0.0038 0.0218 0.0849 0.2431 0.6460
Table 3.5 The relative state occupation of the network (6, 42, 7) after 1c5 runs in the
slow change mode. In the super-critical regimes the network spends most time in the middle
states. In the critical regime the occupation is from the middle state up to the state before
absorption. In the sub-critical regime the network spends most of the time in the last two
states before absorption.
3.4.1 Biological interpretation
The slow change mode may have an interesting biological interpretation. Assigning a
slow temporal scale to the rate of mapping changes, slow enough so that these would
be comparable to the perturbations in the synaptic weights resulting from biological
background noise, would allow to re-interpret the slow change mode as the network
recovering from noise.
Small punctual perturbations to the synaptic weights might aect the result of the
WTA rule for a given node of the active conguration. This is likely to result in the
activation of a dierent output node, which would trigger the corrective depressions.
This eect is comparable to re-assigning an output node, specially if the perturbation
aects a synapse from an input node to a middle layer node. The ability of the network
to recover perfectly from noise has been discussed in [8][1].
45
The neural avalanches, rst evidenced experimentally J. Beggs and D. Plenz [4][5] cor-
respond to spontaneous neural activity displaying power-law distributions in the event
sizes. Identifying both avalanche types is not trivial as the biological neural networks
analysed by Beggs and Plenz were not know to be performing any particular activity
during the experiments, whereas the Chialvo-Bak avalanches are due to an ongoing
process of synaptic plasticity.
As such, a parallel may be established between the biological neural avalanches and
the avalanches in the Chialvo-Bak model the slow change mode, under a very slow
temporal scale of changes.
3.4.2 Summary
The key elements of this section are the following:
Power law tail on slow change mode only: The power law tail in the learning
time distribution 1(j) only seems to occur in the slow change mode.
Relative state occupation in the slow change mode: In the super-critical
regimes the network spends most time in the middle states, whereas in the critical
regime the occupation is from the middle state up to the state before absorption.
In the sub-critical regime the network spends most of the time in the last two
states before absorption.
Parallel with neural avalanches: By re-interpreting the slow change mode
under a very slow temporal scale of changes, as the network recovering from
noise.
46
Chapter 4
Conclusion
The present work focused on extending the analytical understanding of the Chialvo-
Bak network in the two-layer topology.
The model seems to be accurately represented by a rst order Markov chain, where the
chain states correspond to the number of learnt maps. This representation easily allows
to compute important statistics of the network, such as j (j)
pred
, j
pred
and i
pred
.
Numerical testing found support for the Markov chain representation in all but very
small networks.
A direct estimation of the Markov chain order would further contribute to the un-
derstanding of the Markov chain representation and its eventual limitations.
An analytical solution was developed for the transition matrix A in the case of a
network with two input nodes (2, j
n
, j
c
). The performance of the analytical solu-
tions is comparable to the maximum likelihood estimation obtained from simulations
of (2, j
n
, j
c
) in the critical and sub-critical regimes.
Extending analytical for networks with higher number of inputs would consolidate the
Markov chain representation and could lead to a better understanding of the critical
regime.
Should the Markov chain representation hold in general, for the simple and advanced
learning modes, then the model can be expected to converge to the complete learning
state. A convergence requirement is to have separate random depressions in the middle
and output layers, which seems to be related to maintaining the Markov property. A
lower bound on the number of middle layer nodes is required for maintaining the ab-
sorbing property of the Markov chain.
A proper analytical proof of the learning convergence would be useful to establish
the learning capability of the model.
The power law tail in the learning time distribution 1(j) seems limited to the slow
change mode. The Markov chain representation provides a characterisation of the crit-
ical regime, for which the network spends most of the time in the state where half the
mappings are learnt, up to the state before absorption. Very rarely all the mappings
are unlearnt in the critical regime.
47
A systematic power law testing of the learning time distribution j(j) in the critical
regime is necessary to establish the power law tail nature of this distribution.
A parallel with the biological neural avalanches is proposed by re-interpreting the slow
change mode under very slow temporal scale of changes, as the network recovering from
noise rather than learning new mappings.
The permanent tagging of successful weights by the selective punishment rule results
in reduced performance for large timescales. A global tag threshold, which places a
global limit on the number of times a tagged weight can be wrong before becoming
untagged, seems successful in eliminating the ageing eect and is independent on the
level of network activity.
The local tag threshold was not developed in the present work, however, its local
character potentially results in network size independence, which is a clear advantage
over the global tag threshold.
48
Bibliography
[1] P. Bak and D.R. Chialvo. Adaptive learning by extremal dynamics and negative
feedback. Arxiv preprint cond-mat/0009211, 2000.
[2] Per Bak. How Nature Works: The science of Self-Organized Criticality. Springer-
Verlag, 1996.
[3] A. Barr, E.A. Feigenbaum, and P.R. Cohen. The handbook of articial intelligence.
Addison-Wesley Reading, MA, 1989.
[4] J.M. Beggs and D. Plenz. Neuronal avalanches in neocortical circuits. Journal of
Neuroscience, 23(35):1116711177, 2003.
[5] J.M. Beggs and D. Plenz. Neuronal avalanches are diverse and precise activity
patterns that are stable for many hours in cortical slice cultures. Journal of neu-
roscience, 24(22):52165229, 2004.
[6] P. Billingsley. Statistical inference for Markov processes. University of Chicago
Press, 1961.
[7] RJC Bosman, WA van Leeuwen, and B. Wemmenhove. Combining Hebbian and
reinforcement learning in a minibrain model. Neural Networks, 17(1):2936, 2004.
[8] DR Chialvo and P. Bak. Learning from mistakes. Neuroscience, 90(4):11371148,
1999.
[9] A. Clauset, C.R. Shalizi, and M.E.J. Newman. Power-law distributions in empirical
data. arxiv, 706, 2007.
[10] F. Crepel, N. Hemart, D. Jaillard, and H. Daniel. Long-term depression in the
cerebellum. Handbook of Brain Theory and Neural Networks, 1998.
[11] I. Csiszar and P.C. Shields. The consistency of the BIC Markov order estimator.
Annals of Statistics, pages 16011619, 2000.
[12] P. Dayan, L.F. Abbott, and L. Abbott. Theoretical neuroscience: Computational
and mathematical modeling of neural systems. MIT Press, 2001.
[13] C.M. Grinstead and J.L. Snell. Introduction to probability. Amer Mathematical
Society, 1997.
[14] M. Ito. Long-term depression. Annual Review of Neuroscience, 12(1):85102, 1989.
[15] M.H. Johnson. Developmental Cognitive Neuroscience: An Introduction. Blackwell
Publishing, 1997.
49
[16] K. Klemm, S. Bornholdt, and H.G. Schuster. Beyond Hebb: exclusive-OR and
biological learning. Physical Review Letters, 84(13):30133016, 2000.
[17] G. Latouche and V. Ramaswami. Introduction to matrix analytic methods in
stochastic modeling. Society for Industrial Mathematics, 1999.
[18] M. Minsky. Steps toward articial intelligence. Proceedings of the IRE, 49(1):830,
1961.
[19] Y. Peres and P. Shields. Two new Markov order estimators. Arxiv preprint
math/0506080, 2005.
[20] J. Wakeling. Orderdisorder transition in the ChialvoBak minibrain controlled
by network geometry. Physica A: Statistical Mechanics and its Applications, 325(3-
4):561569, 2003.
[21] J.R. Wakeling. Adaptivity and Per learning. Physica A: Statistical Mechanics
and its Applications, 340(4):766773, 2004.
50

You might also like