Professional Documents
Culture Documents
MSc Project
Marco Brigham
T
H
E
U
N
I V
E
R
S
I
T
Y
O
F
E
D
I
N B
U
R
G
H
Master of Science
Articial Intelligence
School of Informatics
University of Edinburgh
2009
Abstract
A review of the Chialvo-Bak model is presented, for the two-layer neural network topology. A
novel Markov Chain representation is proposed that yields several important analytical quanti-
ties and supports a learning convergence argument. The power lawregime is re-examined under
this new representation and is found to be limited to learning under small mapping changes.
A parallel between the power law regime and the biological neural avalanches is proposed. A
mechanism to avoid the permanent tagging of synaptic weights of the selective punishment rule
is proposed.
i
Acknowledgements
I wish to thank Dr. Mark van Rossum for his tireless support and attentive guidance, and for
having accepted to supervise me in the rst place.
To Dr. J. Michael Herrmann I wish to thank the very creative and rewarding discussions on the
holistic merits of the Chialvo-Bak model.
To Dr. Wolfgang Maass and his team at the Institute for Theoretical Computer Science at
T.U. Graz, I wish to thank the precious feedback and fruitful discussions received after the rst
talk on this MSc project.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is my own
except where explicitly stated otherwise in the text, and that this work has not been submitted
for any other degree or professional qualication except as specied.
(Marco Brigham)
iii
To the memory of Per Bak, whose ideas live on and inspire.
iv
Contents
1 Introduction 2
1.1 Brief literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 The Two-Layer Topology 7
2.1 Basic Principles and Learning . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Interference events . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Synaptic Landscape . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Neural avalanches . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Storing Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Advanced Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Research Results 21
3.1 c-band Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Desaturation strategies . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Global tag threshold . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Markov Chain Representation . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Statistical properties . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Markov chain representation: numerical evidence . . . . . . . . . 30
3.2.3 Analytical solution for (2, j
n
, j
c
) . . . . . . . . . . . . . . . . . 33
3.2.4 Alternate formulation: graph transitions . . . . . . . . . . . . . . 36
3.2.5 Analytical solution: numerical evidence . . . . . . . . . . . . . . 37
3.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Learning Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Power-Law Behaviour and Neural Avalanches . . . . . . . . . . . . . . . 44
3.4.1 Biological interpretation . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Conclusion 47
1
Chapter 1
Introduction
The Chialvo-Bak model was introduced by P. Bak and D. Chialvo [8] in 1999, with
the stated goal of identifying some universal and simple mechanism which allows
a large number of neurons to connect autonomously in a way that helps the organism
to survive [8]. Their eort resulted in a schematic brain model of self-organised
learning and adaptation that operates using the principle of satiscing [1].
In common with other models authored by P. Bak, is a patent minimalism of form,
where models are succinctly dened by simple, local and stochastic interaction rules
that reect the most basic assumptions of the real-world system. However simple and
minimalistic, these models manage to reproduce complex and emergent behaviour that
is observed in the real-world systems [2].
In the Chialvo-Bak model, the basic properties of neurons and neural networks are
represented by simple, local and stochastic dynamical rules that support the processes
of learning, memory and adaptation. The most basic operations in the model, the
node activation and the synaptic plasticity rules, are regulated by Winner-Take-All
(WTA) dynamics and learning by synaptic depression, respectively. These mechanisms
may correspond to well accepted physiological mechanisms [14] [10] [12], which suggests
the biological feasibility of the model.
The present work focused on extending the analytical understanding of the model.
A Markov chain representation for the simple two-layer topology is proposed, where
the states of the chain correspond to the learning states of the network. This represen-
tation provides a good statistical description of the model and supports an argument
for the learning convergence.
A power law tail in the learning time distribution, corresponding to an order-disorder
phase transition in the model, was proposed by J.R Wakeling [20]. This result was
specic to the slow change mode, where the network is made to learn a succession of
small mapping changes. The power law behaviour was re-examined under the Markov
chain representation for other mapping change modes and was only reproduced in the
slow change mode.
An argument is provided for drawing a parallel between the above power law behaviour
and the biological neural avalanches, evidenced experimentally J. Beggs and D. Plenz
[4][5] in 2003. These correspond to the propagation of spontaneous activity in neural
2
networks with power law behaviour in the event size distribution.
The ability to store previously successful congurations is enabled by a selective pun-
ishment mechanism [8] [1], where successful synaptic weights are depressed less severely
when no longer leading to the correct mappings. The selective punishment mechanism
has a known ageing eect [1] that is related to the permanent tagging of the successful
synaptic weights. A mechanism to avoid the permanent tagging in order to maintain
the performance advantage of selective punishment is proposed.
This document is organised as follows:
Chapter 1 presents a succinct introduction to the Chialvo-Bak model and the rele-
vant literature.
Chapter 2 introduces the Chialvo-Bak model in the simple two-layer topology, cover-
ing the learning modes, the selective punishment rule and the power-law tail behaviour.
Chapter 3 presents the research results of this MSc project.
Chapter 4 presents the conclusion and future work.
1.1 Brief literature review
A brief review on the research papers related to the Chialvo-Bak model is presented be-
low. The purpose of this review is to broadly describe the areas of the model that have
already been investigated to considerable depth. Detailed descriptions of the model
that support the present work are provided in Section 2.1.
The literature on the Chialvo-Bak model can be grouped into papers that follow the
original formulation of the model and papers that extend the model to dierent work-
ing principles and dynamic rules. As the present work is closely aligned to the rst
approach, so is the focus of the literature review presented below.
Papers on the original Chialvo-Bak model
The Chialvo-Bak model was introduced by P. Bak and D. Chialvo [8] in 1999. In this
paper, the motivations, biological constraints and ground rules are put forward, and
a great emphasis is placed on the biological plausibility of the model, which leads to
requirements of self-organisation at dierent levels and robustness to noise.
Self-organised learning and adaptation is required to reect ability to learn without
external guidance. The apparent lack of information in the DNA to encode the phys-
ical properties of neurons and synapses and their connectivity [15] motivates the self-
organisation at the connectivity level. Each neuron must learn without genetic or
external guidance, to which other neurons to connect and this connectivity should re-
main exible in order to adapt to external changes.
The ability quickly recover from perturbations induced by biological noise is a con-
straint motivated by the biological reality of the organism.
3
Learning by synaptic depression (negative feedback) is proposed as the basis of bi-
ological learning and adaptation, supported by the following elements:
Long-term synaptic depression (LTD) is as common in the mammalian brain as
long-term synaptic potentiation (LTP) [8]. The LTP mechanism is the suggested
physiological implementation of learning by synaptic depression.
Learning by synaptic potentiation leads to very stable synaptic landscapes, from
which adaptation to new congurations is dicult and slow.
Learning of new tasks or adapting to new environments is error prone, as such, a
process that acts on errors rather than on what is correct leads to faster learning.
The other pillar of the model, the Winner-Take-All (WTA) rule is inspired from
models of Self-Organised Criticality [2], as means to drive the system to an adaptive
critical state, where small perturbations can cause very large changes in the synaptic
landscape. The WTA rule plays also a key role in the solution to the credit assignment
problem by keeping the activity of the network low, as detailed in Section 2.1.
The synaptic plasticity changes are driven by a global signal informing on the success
of the latest synaptic changes. The ability for the organism to dierentiate between
outcomes is deemed innate to the system and possibly resulting from Darwinist selec-
tion.
A second paper from the same authors [1] was published in 2000 that expanded on
several key aspects of the model, such as the network topologies, the memory mech-
anism and a new learning rule to tackle more complex problems. The performance
scaling under the new learning rule was also analysed.
Several network topologies and their relevant learning rules were formally dened and
these are illustrated in Figure 1.1.
(a) The simple layered network
topology, which is the one used for
the present work.
(b) The lattice network
topology, where nodes
connect to a small
number of nodes in the
subsequent layer.
(c) The random network topology,
where nodes connect randomly to
n
= 3, j
= 4 and j
j=0
j(j) = 1. (2.2)
The basic operation for measuring j(j) is to record the number of depressions j re-
quired to learn the current mapping set `, increase by one unit the count of mapping
set learning in j depressions, present a new mapping set
` to the network and so on.
However, certain aspects in the setup of the simulations have a noticeable impact on
the measured values, as such these will be discussed in greater detail below.
12
20 40 60 80 100
0
50
100
150
200
250
Average Learning Time
(6,*,7) [run:50*1e+4 slow]
Middle Layer Nodes
(8,36,9)
(8,72,9)
(8,144,9)
(a) The learning time distributions for several
middle layer sizes of a network with eight input
and eight output nodes. Data from 1c +6 runs.
10
1
10
2
10
3
10
4
10
6
10
4
10
2
10
0
Learning Time Distribution p()
[runs:1e+6 slow]
p
(
(8,72,9)
(16,272,17)
(32,1056,33)
(64,4160,65)
(b) The learning time distributions for several
networks with critical number of middle layer
size. Data from 1c + 6 runs.
Figure 2.8 The learning time distributions reveal three distinct regimes: sub-critical,
critical and super-critical. The critical regime exhibits power law behaviour with j(j)
j
1.3
according to [20].
Critical regime for j
n
j
i
j
c
Super-critical regime for j
n
j
i
j
c
In [20], J.R. Wakeling proposed that the power law tail of j(j) corresponds to an
order-disorder phase transition in the model and that the key dierence in the learning
dynamics for the three operating regimes is the interference probability 1:(i):
In sub-critical networks, there are enough middle layer nodes for interference
events to be quite rare and therefore learning is very quick.
In super-critical networks, there are hardly enough middle layer nodes to learn
without inducing interference events and therefore learning is extremely slow.
The learning dynamics for critical network sizes is in-between the other two
regimes with just enough interference to occasionally cause large learning times
while most of the time the learning times are quite fast.
However, it should be noted that the model has not been proved to be critical in the
proper statistical physics sense, in order to merit such terminology. Assessing the crit-
icality for the model in the two-layer topology is certainly challenging.
Furthermore, the approximately straight segments in the distributions of Figure 2.8b do
not necessarily imply that j(j) is a proper power law tail distribution, as very clearly
explained in the paper [9] by Clauset, Shalizi and Newman. Straight segments in a
log-log plot are a necessary but not sucient condition for j(j) to be a power law tail
distribution. Due to timing constraints however, no conclusive power law testing was
completed for j(j) and in consequence, the terminology proposed in [20] is adopted
through the document.
2.1.4 Summary
The key elements of this section are the following:
14
Network dynamics: Dened by Winner-Takes-All dynamics and learning by
synaptic depression (negative feedback).
Input-output learning: The network is able to learn an arbitrary mapping set
` where for each input node i corresponds an output node o = `(i).
Local agging mechanism: Plasticity changes are locally marked for recent
activity.
Global feedback mechanism: Feedback is provided in the form of a global
feedback signal specifying whether the most recent changes are unsatisfactory.
Solution to the credit-assignment problem: Requires sparse network activ-
ity, so that no plasticity changes occur until a global feedback signal is received.
Two typical timescales: The network signalling occurs in the time scale of
the ring patterns, while the tagging and feedback mechanisms occur in a much
longer timescale that is relevant to the scale of events in the external world.
Interference event i: The learning of input-output mappings can be disrupted
by the unlearning of previously learnt mappings.
Metastable synaptic landscape: The active conguration is barely supported
by the winning weights.
Neural Avalanches: For middle layer sizes j
n
= j
i
j
c
the network displays
power-law behaviour in the distribution of the learning time j(j).
2.2 Storing Mappings
The plasticity rules introduced in Section 2.1 enable the network to learn a random
mapping set `, and quickly adapt to another mapping set
` whenever needed. Not
much information is left [1] in the synaptic weights to reliably retrieve ` at a later
stage, since active weights that supported ` were depressed
3
a random amount in [0, 1]
to support the new mapping set
`.
An additional mechanism is therefore required to store the information from previ-
ously learnt mapping sets for later recall. It turns out that such a mechanism exists
and amounts to depressing less severely the weights that have successful in the past,
and is called the selective punishment rule [8][1].
The selective punishment rule requires small modications to the plasticity rules to
enable the distinction between successful and unsuccessful weights:
1. A random input node i is selected.
2. Input node i res and activates a middle layer node : and output node o according
to the WTA rule.
3. If output node o is correct, i.e. o = `(i), tag the weights n(:, i) and n(o, :) as
successful and return to step 1.
3
More specically, the active weights that are not shared by A and
A.
15
4. Otherwise depress the active weights n(:, i) and n(o, :) by:
A random amount in [0, 1] if n(:, i) and n(o, :) have never been successful.
A random amount in [0, c] otherwise.
Return to step 1.
In the Chialvo-Bak model, recalling a mapping set ` refers to a dierent operation
than in other neural network models. Since synaptic plasticity is required to retrieve
the information stored in the synaptic weights, the network is rather re-adapting to a
previously seen mapping set, than recalling the mapping set. Nevertheless, in order to
distinguish from the learning rules without selective plasticity, the term recall will be
used.
0 10 20 30 40 50
0
100
200
300
400
500
Recall Time
(6,36,6) [runs:50 rand]
Recall
M
1
M
2
M
3
M
4
(a) Example of learn and recall performance
without selective punishment.
0 10 20 30 40 50
0
100
200
300
400
500
Recall Time
(6,36,6) [runs:50 select rand]
Recall
M
1
M
2
M
3
M
4
(b) Example of learn and recall performance
with the selective punishment rule.
Figure 2.9 The number of depressions j required to rst learn and then recall four
random mapping sets `
1
, , `
4
. The network is presented with the mapping in random
succession, and value of j is recorded for each graph, i.e. at recall = 10 the network has
seen each mapping set 10 times.
The selective punishment rule results in a dramatic performance increase (under the
random mapping change mode), as shown in Figure 2.9. This performance increase
results from the network establishing preferred paths from each input node to the out-
put nodes required by the mappings being presented. These preferred paths are the
rst to be queried when the active conguration is no longer correct. A detailed ex-
ample of the selective punishment dynamics is presented in the Appendix of this section.
The weights tagged by the selective punishment rule are constrained to a region
4
within
a distance c from unity, as shown in Figure 2.10. This region is referred to as the c-band.
2.2.1 Summary
The key elements of this section are the following:
4
The uniform distribution of weights in the c /ond in Figure 2.10 results from depressing the
tagged weights a xed amount c rather than a random amount in [0, c]. In the later case, the resulting
distribution of weights in the c-band would be similar to that of Figure 2.5b.
16
0.998 0.9985 0.999 0.9995 1
0
0.005
0.01
0.015
0.02
0.025
Weight Distribution p(w)
(32,1024,32) [runs:1e+5 bins:5e+4 select rand]
p
(
w
)
Weight w
Figure 2.10 The weights tagged by the selective punishment rule are kept within the c
band located in [1 c, 1]. This is where the memory of previous successful mappings is
stored. Discrete distribution of 500 bins.
Selective punishment rule: Recalling previously learnt mapping sets is enabled
by depressing weights that haven been successful in the past by a smaller random
amount c when no longer leading to the desired mapping.
Selective punishment dynamics: The performance increase results from the
network establishing preferred paths from input nodes output nodes as required
by the learnt mapping sets. On average these preferred paths are queried much
more often.
Delta band: contains the weights representing the memory of previously suc-
cessful mappings and is located at [1 c, 1].
2.2.2 Appendix
The example from Figure 2.9 will be used to illustrate the dynamics of selective pun-
ishment. Suppose that `
1
, , `
4
are presented to the network in that order and
require input one to activate output nodes {1, 3, 6, 6} respectively.
After learning the mapping from input one to output node required by `
1
, the winning
weights n(:
1
, 1) and n(1, :
1
) are tagged by the selective punishment rule.
When presented with `
2
, input node one should now activate output node three and
the weights n(:
1
, 1) and n(1, :
1
) are depressed accordingly. This results in a series
of successive depressions to bring these weights slightly below the respective second
highest weights.
This succession of depressions is necessary since the average weight distance for this
network is 1,j
n
0.03 for the weights set n(:, i) and 1,j
c
0.14 for the weights set
n(o, :), whereas n(:
1
, 1) and n(1, :
1
) are now depressed by small random amounts
in c 0.001. This accounts for the relatively higher adaptation values for learning
mappings `
2
, `
3
and `
4
in Figure 2.9b, when compared to Figure 2.9a.
This also illustrates the negative performance impact that synaptic potentiation would
have in this network, since it would lead to large weight dierences (in units of depres-
sion amounts) between the active weights and the other weights. A metastable synaptic
17
landscape, such as the one illustrated in Figure 2.5a, is a requirement for the network
to quickly converge to a new mapping sets.
Eventually, either n(:
1
, i) or n(1, :
1
) will be depressed below the other weights.
Supposing that n(:
1
, i) is rst, then node one switches from middle layer node :
1
to another middle layer node :, which has 1,j
c
probability of activating output node
three. If that is the case, the input node one has learnt the correct mapping for `
2
and the weights n( :, 1) and n(3, :) are also tagged as successful.
If middle layer node : does not lead to output node three, it is depressed accord-
ingly. Node one is then very likely to switch back to middle layer node :
1
, which
will still activate output node one, unless weight n(1, :
1
) is already the second high-
est weight of node one, where it then has a chance of activating a dierent output node.
If output node three is still not found after a few more depressions, the search se-
quence will now alternate between middle layer node :
1
and other middle layer nodes,
and the output node of middle layer node :
1
will alternate between output node one
and the other output nodes.
After learning the mapping sets `
1
, , `
4
for the rst time, each input node has
one or more preferred paths formed by the pairs of tagged weights that lead to the
required output nodes. The network will pool these weights much more frequently.
The example above can be easily generalised to other input nodes and to the weight
n(1, :
1
) reaching the second highest weight before weight n(:
1
, 1) does.
A sample run where the mapping sets `
1
, , `
4
required input one to activate out-
put nodes {1, 3, 6, 6}, resulted in the preferred paths for input node shown in Table 2.1:
Preferred paths from input 1
To middle layer node : From middle layer node : to output node o
3 6
12 2, 6
30 6
31 1
34 3
Table 2.1 The successful weights tagged by the selective punishment rule result in a set
of preferred paths for input node 1.
The preferred path to output node two, which is not required for input node one, was
added by input node two when learning mapping set `
4
.
2.3 Advanced Learning
The type of learning problems that the model can tackle so far are better described as
solving a routing problem: given a mapping set `, the input nodes i have to nd paths
to the output node `(i).
18
A more advanced type of learning consists in considering mapping sets between
input node activation patterns 1 and the specic activation of output nodes o = (1).
As before, the state of each input node can be active i = 1 or inactive i = 0 and the
entire conguration of input nodes is represented by a binary vector 1. For example,
1 = {1, 0, 0, 1, 0, 1} is an activation pattern for a network with six input nodes.
Learning the basic Boolean functions = {AND, OR, XOR, NAND, } is a par-
ticular example of this type of learning. The logical value of proposition j and is
represented by the state of two input nodes and the logical value of the function (j, )
is represented by the activation of one of the two output nodes.
The changes to the plasticity rules that are required to learn this type of problem
are surprisingly small and amount to a slightly modied Winner-Take-All rule:
Input conguration 1 = {r
1
, , r
j
i
n(,, i) r(i).
Middle node : activates the output node o with maximum n(o, :), as before.
A bias input node that is always active is necessary in order to compute the state where
the remaining input nodes are inactive.
An example of the network solving the XOR problem under the above plasticity rules
is shown in Figure 2.11. An example of weights that implement a solution to the XOR
problem are shown in Table 2.2 of the Appendix of this section.
Figure 2.11 An example active conguration implementing the XOR truth table.
The network can learn the XOR problem with three middle layer nodes. In general, it
can learn any mapping with j
n
= 2
j
Recall
M
1
M
2
M
3
M
4
(a) At short time scales the selective punish-
ment rule drastically improves the ability to
quickly recall previously learnt mapping sets.
0 500 1000 1500 2000
0
100
200
300
400
500
Recall Time
(6,36,6) [runs:2000 select rand]
Recall
M
1
M
2
M
3
M
4
(b) At long time scales the recall performance
degrades progressively.
Figure 3.1 The ageing eect on the selective punishment performance at large time
scales.
As before, the term recall will refer to the re-adaptation of the network to a previously
learnt mapping `.
The performance degradation of selective punishment is caused by the saturation of
the c-band, which is the region within a distance dc|to from unity where the weights
tagged as successful are constrained to be.
21
For the selective punishment rule to be eective, each input node should be able to
quickly sort through the tagged weights to recover the preferred path to the correct
output node. Ideally, each input node would have established one preferred path for
each previously learnt output nodes. As such, it would take a number of depressions
in the order of the number of preferred paths to nd the correct output node.
On the other hand, if the number of preferred paths leading to the same output node
grows further, the advantage of the tagging mechanism to identify the preferred middle
layer nodes ceases to be eective.
The increase of paths leading to the same output node is a consequence of all weights
eventually being given a chance to participate in correct mappings. Consider the con-
tinuous raising of the weights caused by the depressions of the plasticity rules. Each
raising step is the dierence between the highest weight and the second highest. Even-
tually, all weights end-up in the c-band and are soon able to compete with the tagged
weights for participating in a correct conguration. Once that occurs, one additional
path to the output node is created.
The monotonous increase of tagged weights leads to a saturation of the c-band, as
increasingly more weights are conned in that region of weight space. This eect can
be appreciated in Figure 3.2a and the corresponding increase of recall times is illustrated
in Figure 3.2b.
0 50 100 150 200 250
0
0.2
0.4
0.6
0.8
1
band Saturation
(6,36,6) [runs:100x250 map:128 rand]
T
a
g
g
e
d
W
e
i
g
h
t
s
(
%
)
Recall
(a) The number of tagged weights increases
monotonically and leads to a saturation of the
c-band.
0 50 100 150 200 250
0
50
100
150
Average Recall Time
(6,36,6) [runs:100x250 map:128 rand]
Recall
not selective
selective
(b) The performance of the selective punish-
ment rule degrades with successive recalls.
Figure 3.2 As the percentage of weights tagged as correct increases, the recall times
approach the performance of the regime without selective punishment.
3.1.1 Desaturation strategies
The monotonous increase of tagged weights is a consequence of the permanent tagging
of the selective punishment rule. As such, a mechanism is required to reduce the tag-
ging lifespan.
P. Bak and D. Chialvo proposed [1] a mechanism of neuron ageing to tackle this issue,
where nodes are replaced at a xed rate, their weights randomised and the tagging
22
information removed. However, the neuron replacement rate may be dependent on the
level of network activity, in order to successfully counter the saturation rate of the
dc|to-band.
Several strategies that result in non-permanent tagging were reviewed and the rst
of them was selected for implementation:
Global tag threshold: weights are untagged if not successful for more than a
global threshold of depressions.
Local tag threshold: as the previous, but the threshold depends on the past
performance of each weight.
Interference correction: untagging weights after a threshold number of inter-
ference events.
3.1.2 Global tag threshold
The global tag threshold has been investigated in greater detail, and numerical simu-
lations suggest that an optimum threshold value exists for each network size.
In order to ensure that the optimal tag threshold value does not depend on the activity
level of the network, an increasing number of mappings was presented to networks of
the same size. The performance of the optimal tag threshold value was consistent across
the number of presented mapping sets, as illustrated in Figure 3.3. For the network
(6, 36, 6) the optimal tag threshold value is close to 48.
10
1
10
2
10
3
0
0.2
0.4
0.6
0.8
1
Average band Saturation
(6,36,6) [runs:50x250 rand]
A
v
e
r
a
g
e
T
a
g
g
e
d
W
e
i
g
h
t
s
(
%
)
Mappings
16
32
48
64
80
(a) The average saturation of the c-band for
global tag threshold values are consistent across
the number of presented mapping sets.
10
1
10
2
10
3
0
50
100
150
200
Average Recall Time
(6,36,6) [runs:50x250 rand]
Mappings
16
32
48
64
80
(b) The average recall time j performance for
global tag threshold values are consistent across
the number of presented mapping sets.
Figure 3.3 For the network (6, 36, 6) the optimal global tag threshold value is close to
48.
For networks with larger number of input nodes j
i
the optimal tag threshold value is
also higher. This was veried for the network (12, 144, 12), where the optimal tag
threshold values is around 64. This is illustrated in Figure 3.4.
The optimal tag threshold seems closely related to an optimal number of average tagged
23
10
1
10
2
10
3
0
0.05
0.1
0.15
0.2
Average band Saturation
(12,144,12) [runs:50x250 rand]
A
v
e
r
a
g
e
T
a
g
g
e
d
W
e
i
g
h
t
s
(
%
)
Mappings
32
48
64
80
(a) The average saturation of the c-band for
global tag threshold values.
10
1
10
2
10
3
50
100
150
200
250
300
Average Recall Time
(12,144,12) [runs:50x250 rand]
Mappings
32
48
64
80
(b) The average recall time j performance for
global tag threshold values.
Figure 3.4 For the network (12, 144, 12) the optimal global tag threshold value is around
64.
middle layer nodes and tagged output nodes behind them, as illustrated in Figure 3.5.
Tag threshold values that are too low result in the network forgetting successful nodes
too fast, as can be observed by the sharp decrease in the average recall time j in
Figure 3.5. Tag threshold values that are too high fail to get rid of path redundancy
fast enough. This results in an increase of the average recall time j, as illustrated in
Figure 3.4b for the tag threshold value of 80, for example. Somewhere in-between lies
the optimal value.
The blue line in Figure 3.5b represents the average number of tagged middle layer
nodes per input node. From Table 3.1, the number of tagged middle layer nodes sur-
passes the number of output nodes j
c
from tag threshold 32 and above.
The green line of the same graph represents the average number of tagged output
nodes per tagged middle layer node. This value decreases until reaching tag threshold
32, where the network compensates for the lack of enough tagged middle layer nodes,
by tagging more output nodes per tagged middle layer node. Near tag threshold 32 a
minimum is reached, and starts growing again as higher tag threshold values cannot
get rid of path redundancy fast enough.
The best tag threshold found for this network produced an average of 7.9 tagged middle
layer nodes, which is nearly two nodes more than the number of output nodes j
c
. The
full results are presented in Table 3.1 in the Appendix of this Section.
3.1.3 Summary
The key elements of this section are the following:
c-band saturation: The permanent tagging of successful weights results in the
monotonous increase of weights being constrained to the c-band region, therefore
reducing the performance advantage of the selective punishment rule.
Desaturation strategies: Several desaturation strategies are possible and amount
24
Global Threshold Average Middle Layer Average Output Layer
16 4.8350 1.5447
24 5.3833 1.5129
32 6.0450 1.4118
40 6.9167 1.4135
48 7.9133 1.5362
56 9.3767 1.6998
64 10.9283 1.9372
72 12.4167 2.1066
80 14.4617 2.3799
Table 3.1 Average number of tagged middle layer nodes per input node and average
number of tagged output nodes per tagged middle layer node for dierent values of the
global tag threshold in the network (6, 36, 6).
10 20 30 40 50 60 70 80
0
50
100
150
200
Average Recall Time
(6,36,6) [runs:100x250 map:128 rand]
.
Figure 3.5 The relation between the optimal global tag threshold and the average number
of tagged middle layer nodes and average number of tagged output nodes per tagged middle
layer nodes.
to capping the tagging lifetime.
Global tag threshold: Sets a global limit on the number of times a tagged
weight can be wrong before becoming untagged. There is an optimal tag threshold
for each network size that is independent of the level of network activity.
Global tag threshold dynamics: The optimal value of the global threshold is
related to the average number of tagged middle layer nodes and average number
of tagged output nodes behind them.
25
3.2 Markov Chain Representation
A Chialvo-Bak network can be represented by a rst-order Markov chain when consid-
ering the evolution of the network as a sequence learnt mappings states. Such repre-
sentation is useful to derive several statistical properties of the model analytically, such
as the average learning time j, the learning time distribution j(j) and the average
interference events i.
The Markov chain representation seems appropriate since the evolution of the net-
work is to a large extent stochastic. The random amounts of the depression from the
plasticity rules result in changes to the active conguration C, which is the basic macro-
scopic state of the network. Assuming the transition between active conguration C to
a new active conguration
C is stochastic and only depends on C, then the evolution
of the network can be described by a rst-order Markov chain
1
.
Rather than considering the evolution between active congurations C, a more mean-
ingful basic state of the Markov chain representation is the learning state o
a
of the
network, i.e. the number n of currently learnt mappings. For each active conguration
C the learning state o
a
is determined by counting the number n of learnt mappings
of C. As such, the correspondence between C and o
a
maintains the Markov chain
properties mentioned above.
The Markov chain has j
i
+ 1 states, noted o
0
, o
1
, ..., o
j
, corresponding to learning
from zero up to j
i
mappings.
In this context, learning a new mapping set ` corresponds to the Markov chain start-
ing from an initial state o
i
and evolving towards the nal state o
j
1
A I-order Markov chain would depend on the I previous steps.
26
and no transitions from o
j
and that no
transition is possible from o
j
o
+1
and o
o
+2
, and it can unlearn one single mapping giving
o
o
1
.
For networks with three input nodes, all possible transitions are shown in Figure 3.8,
where the arrows indicate the sense of the transitions.
Figure 3.8 All the possible transitions for a network with three input node is shown
above. The arrows indicate the sense of the transitions. No transition is possible from o
3
to any of the other states.
A Markov chain is completely determined by the state transition matrix A, which in
the columns species the transition probabilities between states o
a
, and the initial state
probability vector p, which species the initial state probabilities 1:(o(0)).
The element o
na
of the state transition matrix A is the probability 1:(o
n
o
a
) of
the transition o
a
o
n
. The element j
I
of the column vector p is the probability of
starting the chain in state 1:(o(0) = o
I1
).
27
For the elements of A and p to represent valid probabilities the following must hold:
j
n=0
o
na
= 1 for any column index n of A (3.1)
j
I=0
j
I
= 1 (3.2)
For a network with three input nodes, A and p have the following form:
A =
o
00
o
01
0 0
o
10
o
11
o
12
0
o
20
o
21
o
22
0
0 o
31
o
32
1
and p = (j
0
j
1
j
2
j
3
)
|
,
where o
30
= o
02
= 0 since the corresponding transitions are not possible (see Figure
3.8).
For example, running this network in the slow change mode where only one mapping
is changed each time, corresponds to the initial state probability vector p = (0 0 1 0)
|
.
For a general network (j
i
, j
n
, j
c
), A and p have the following form:
A =
o
00
o
01
0 0 0 0 0
o
10
o
11
o
12
0 0 0 0
o
20
o
21
o
22
0 0 0 0
0 o
31
o
32
0 0 0 0
0 0 o
42
0 0 0 0
0 0 0 0 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 0 0 0 0
0 0 0 o
j
4
j
3
0 0 0
0 0 0 o
j
3
j
3
o
j
3
j
2
0 0
0 0 0 o
j
2
j
3
o
j
2
j
2
o
j
2
j
1
0
0 0 0 o
j
1
j
3
o
j
1
j
2
o
j
1
j
1
0
0 0 0 0 o
j
j
2
o
j
j
1
1
p = (j
0
j
1
j
j
)
|
,
with o
(a2) a
= o
(a3) a
= = o
0 a
= 0 and o
a+3 a
= = o
j
a
= 0 since the
corresponding transitions are impossible (for appropriate values of n).
3.2.1 Statistical properties
The state transition matrix A and the initial state probability vector p enable to com-
pute the statistics of the Markov chain in a very straightforward way. For detailed
analytical derivations see [13] and [17], for example.
28
For example, the elements of the n-th power of A, noted A
a
, yield the transition
probabilities to a given state in n steps, i.e. o
a
na
is the transition probability from
state o
a
to state o
n
in n steps.
To motivate the above statement, consider the previous example of the network with
three inputs. The probability 1:
(2)
(o
2
o
1
) of going from state o
1
to state o
2
in two
steps is computed as follows:
1:
(2)
(o
2
o
1
) =1:(o
2
o
1
) 1:(o
1
o
1
) + 1:(o
2
o
2
) 1:(o
2
o
1
) + 1:(o
2
o
3
) 1:(o
3
o
1
)
(3.3)
which in terms of the transition matrix elements is written:
1:
(2)
(o
2
o
1
) = o
21
o
11
+ o
22
o
21
+ o
23
o
31
. (3.4)
The last expression is the product of the second row of A with the rst column of A,
which is the element (1, 2) of the product AA A
2
of A with itself.
An important observation is that the above computation relied explicitly in the den-
ing properties of the Markov chain: step o(t + 1) is completely determined from step
o(t) and the state transition probability matrix A. This allowed to factor 1:
(2)
(o
2
o
1
)
in terms of the accessible intermediate states in Eq. (3.3) and associate the resulting
probabilities to the elements of A in Eq. (3.4). This will be useful when interpreting
the numerical evidence for the Markov chain representation.
The powers of matrix A lead to a straightforward computation of the learning times
distribution j(j). Since the last row of A
a
gives the transition probabilities to the
absorbing state o
j
.
A =
(
T 0
F 1
)
where F is sub-matrix row of transition probabilities to the nal state o
j
.
The powers of matrix T are obtained from A
a
and its elements t
(a)
i )
correspond to the
transition probability in n steps from the non-absorbing states o
)
to the non-absorbing
state o
i
.
A
a
=
(
T
a
0
1
)
where represents the last row of A
a
except the element 1.
The learning times distribution j(j) can be expressed in terms of the sub-matrix T
a
,
since the element , of the last row of A
a
is equal to one minus the sum of the ,-th
29
column of T
a
, i.e. o
a
j
)
= 1
1
i=0
o
a
i )
= 1
i
n
a
i )
.
Rewriting the rst part of Eq. 3.5 in terms of T
a
yields:
j (j n) = 1 1
|
T
a
p, n
0
(3.6)
where p = (j
0
j
1
j
i1
)
|
are the components of p except the probability of starting
in the nal state o
j
.
The last property yields the expectation of the learning time j, since adding the en-
tries of column , of N gives the expect number of times in all transient state before
reaching the absorbing state o
j
2
1
)
|
is the column vector with the elements of A corre-
sponding to interference events.
3.2.2 Markov chain representation: numerical evidence
To test the validity of the Markov chain representation the following experiment was
performed:
A test set of networks was selected with input layer size j
i
ranging from one to
twelve input nodes.
For each input layer size j
i
:
30
The output layer size was j
i
+ 1.
Three middle layer sizes corresponding to the three regimes (sub-critical,
critical, super-critical) were used: j
sub
n
= 2 j
critical
n
and j
super
n
= j
critical
n
,2.
Each network was simulated for a large number of times and the following statis-
tics recorded:
state transition counts
initial state counts
j (j) learning time distribution
j average learning time
i average interference events
The state transition and initial state counts are possible to extract since the
network evolves in discrete steps and for each discrete step the learning state can
be computed from the active conguration.
The measured state transition and initial state counts are normalised according
to Eqs. (3.1) and (3.2) yielding
2
the maximum likelihood estimators [6] for A
MLE
and p
MLE
, respectively.
For each measured A
MLE
and p
MLE
the quantities j (j)
pred
, j
pred
and i
pred
are computed from Eqs. (3.6), (3.7) and (3.8) respectively.
The measured values are compared to the computed values using the Normalised
Root Mean Square Error (NRMSE) and the Kolmogorov-Smirnov 1 statistic:
Normalised root mean square error RMSE[A
pred
] =
a
pred
)
2
a
max
a
min
Kolmogorov-Smirnov statistic 1 = arg max
a
j (A r
i
)
pred
j (A r
i
)
In order to obtain enough state transition counts for each state to enable a good esti-
mation A
MLE
, the mappings were presented to the network with a uniform initial state
probability (except for the state o
j
model
markov
(a) Super-critical network (4, 10, 5).
10
0
10
1
10
2
10
3
10
4
10
2
10
0
Learning Time Distribution p()
(12,312,13) [runs:1e+6 uniform]
p
(
model
markov
(b) Sub-critical network (12, 312, 13).
Figure 3.9 The learning time distribution j(j) for a uniform initial state probability
(except for the state o
o
00
o
01
0
o
10
o
11
0
o
20
o
21
1
and p = (j
0
j
1
j
2
)
|
, (3.9)
Each state o
a
can be further separated in sub-states o
o
a
corresponding to the graphs
with d degrees of freedom in the middle layer. The basic graphs o
o
a
for (2, j
n
, j
c
) are
illustrated in Figure 3.11. For simplicity, these sub-states are simply referred as the
graphs of state state o
a
.
Figure 3.11 The basic graphs for the two input networks (2, j
, j
o
1
0
= j
n
j
c
(j
c
2)
o
1
1
= j
i
j
n
j
c
o
2
0
= j
n
(j
n
1) (j
c
1)
2
o
2
1
= j
i
j
n
(j
n
1) (j
c
1)
o
2
2
= j
n
(j
n
1)
where
o
o
a
o
1
0
o
2
0
o
1
1
o
2
1
o
2
2
)
where o
o
1
0
o
2
0
o
1
1
o
2
1
o
2
2
.
One nal element before computing the transition probabilities is to determine the
probability j
0
of reselecting the same node after depression and the probability of se-
lecting a dierent node 1 j
0
. The reselection probability is given by j
0
= 1,(2 j
a
1)
and this value can only be related to the particular weight distribution of this model,
as shown in Figure 2.5b. Numerical evidence for such value of j
0
is shown in Figure
3.12.
0 10 20 30 40
0
0.1
0.2
0.3
0.4
0.5
Average Reselection Probability
(2,*,10) [depressions:1e+4]
P
r
(
r
e
s
e
l
e
c
t
i
o
n
)
, 10). The measured probability (in blue) follows the curve 1,(2 j
1) (in
green). The curve 1,j
is shown for comparison (in red). Data collected for 1c4 depressions
in each network.
The element o
01
of the transition matrix A for (2, j
n
, j
c
) can be obtained as follows:
o
01
=1:(o
0
o
1
) = 1:(o
1
0
o
1
1
) 1:(o
1
1
o
1
) + 1:(o
2
0
o
1
1
) 1:(o
1
1
o
1
)
Using the equal graph accessibility approximation,
o
01
(
1:(o
1
0
o
1
1
) + 1:(o
2
0
o
1
1
)
)
o
1
1
o
1
1
o
2
1
o
1
1
o
1
1
o
2
1
where :
0
and o
0
represent the re-selection probabilities of middle and output nodes,
respectively.
The main elements of the last computation are detailed below, where it is assumed
the network is learning to connect input two to output two while learning the iden-
tity mapping ` = {1, 2} (the other possible mapping set requiring an inversion of the
roles).
35
S
1
1
S
1
1
+S
2
1
2
j
1
is the transition probability from o
1
1
to o
1
0
. This transition occurs
whenever the same middle layer node is re-selected and the output node is not
re-selected. There are j
c
2 out of j
c
1 such cases that enable the transition.
(1 :
0
)(1 o
0
)
j
1
j
o
10 10
o
10 20
o
10 11
0 0
o
20 10
o
20 20
o
20 11
0 0
o
11 10
o
11 20
o
11 11
o
11 21
0
o
21 10
o
21 20
o
21 11
o
21 21
0
o
22 10
0 o
22 11
o
22 21
1
and p
j
= (j
10
j
20
j
11
j
21
j
22
)
|
,
where o
22 20
= o
10 21
= o
20 21
= 0 since no transitions are possible between those graphs.
The element o
aoa
represents 1:(o
a
o
o
a
), the transition probability from graph o
o
a
to graph o
a
.
36
Figure 3.14 The graph transitions for a network with two input nodes. The arrows
indicate the sense of the transitions. No transition is possible from graph o
2
2
to any of the
other graph.
The analytical results previously obtained are directly applicable to compute the el-
ements of the transition matrix A
j
. For example, the element o
01
of A yields the
elements o
10 11
and o
20 11
of A
j
, as follows:
o
10 11
=1:(o
1
0
o
1
1
) = :
0
(1 o
0
)
j
c
2
j
c
1
o
20 11
=1:(o
2
0
o
1
1
) = (1 :
0
)(1 o
0
)
j
c
1
j
c
These two elements of A
j
correspond to the two graph transitions in Figure 3.13.
Computing j (j) and j is according to Eqs. (3.6) and (3.7) respectively. However, Eq.
(3.8) is no longer applicable for computing i, as the elements of A
j
corresponding to
interference events are no longer on the upper diagonal of matrix A
j
.
Nevertheless, a similar computation is still possible:
i =
j:column
of N
i:interference
graph of j
o
j
i )
n
i )
j
)
(3.10)
where o
j
i )
is an element of A
j
corresponding to an interference transition from graph
, to graph i, p
j
is the initial state transition vector excluding the absorbing graph and
j
)
is the element , of p
j
.
3.2.5 Analytical solution: numerical evidence
To test the validity of the analytical solution, a similar methodology was used as in
Sub-section 3.2.2 for testing the Markov chain representation. In the present case, the
network test set was limited to networks with two input nodes, 10 output nodes and a
range of middle layer nodes.
The analytical expressions for A and A
j
were used to obtain j(j)
jco
, j
jco
and
i
jco
from Eqs. (3.6), (3.7) and (3.8) or (3.10), respectively, and the predicted values
were then compared to the ones obtained from the simulations.
37
The slow mapping change mode was used this time, instead of the uniform initial
state probability, as the number of states is quite small and the slow change mode is
able to visit them a sucient number of times. The resulting learning time distribu-
tions j(j) are more representative of the typical network dynamics.
10
0
10
1
10
2
10
4
10
2
10
0
Learning Time Distribution p()
Gamma(2,10,10) [runs:1e+6 slow]
p
(
model
pred A
pred A
g
(a) Super-critical network (2, 10, 10).
10
0
10
1
10
2
10
4
10
2
10
0
Learning Time Distribution p()
Gamma(2,100,10) [runs:1e+6 slow]
p
(
model
pred A
pred A
g
(b) Sub-critical network (2, 100, 10).
Figure 3.15 The learning time distribution j(j) in the slow change mode. The green
and red lines are the prediction from Eq. (3.6) using A and A
respectively.
The predicted learning time distributions j(j)
jco
are quite similar to the ones mea-
sured. This can be appreciated from Figure 3.15, where the predictions are compared
in the super-critical and the sub-critical regimes. As with the predictions from the
Markov chain validation testing, the super-critical regime distributions are less accu-
rate than the sub-critical distributions.
0 20 40 60 80 100 120
0
0.01
0.02
0.03
0.04
0.05
0.06
KolmogorovSmirnov Statistic
Gamma(2,*,10) [runs:1e+6 slow]
D
Middle Layer Size
pred A
pred A
g
Figure 3.16 The Kolmogorov-Smirnov statistic for the learning time distribution j(j)
in the slow change mode. The green and red lines are the statistic value obtained from Eq.
(3.6) using A and A
respectively.
The above is conrmed by the Kolmogorov-Smirnov 1 statistic, which is higher for the
super-critical and critical regimes, as show in Figure 3.16.
The statistic 1 decreases with increasing input layer size and is comparable to the
maximum likelihood estimations for the transition matrix, obtained in Sub-section
38
3.2.2 for the critical and sub-critical regimes (compare Figure 3.16 with the D value for
network size two from Figure 3.10).
Interestingly, the predictions obtained from A
j
dont have better performance than
the ones from A, except in the sub-critical regime where the 1 statistic is lower for the
j(j)
jco
obtained from A
j
. This may be particular to the two input networks, which
are somewhat singled out from the others in Figure 3.10.
In terms of j
pred
and i
pred
, the predictions obtained from A are also more ac-
curate than the ones obtained from A
j
, as shown in Table 3.4 and Figure 3.17.
0 20 40 60 80 100 120
10
15
20
25
30
Average Learning Time
Gamma(2,*,10) [runs:1e+6 slow]
respectively.
The improvement in the Kolmogorov-Smirnov 1 statistic for A
j
in the sub-critical
regime did not reect in the NRMSE, where the predictions from A are consistently
more accurate.
NMRSE j
pred
super-critical critical sub-critical
A 0.0839 (0.0660) 0.0577 (0) 0.0149 (0.0203)
A
j
0.1042 (0.0758) 0.0817 (0) 0.0264 (0.0332)
NMRSE i
pred
super-critical critical sub-critical
A 0.1323 (0.1066) 0.1339 (0) 0.0452 (0.0616)
A
j
0.1704 (0.1269) 0.2000 (0) 0.0875 (0.1089)
Table 3.4 The normalised root mean square error (NMRSE) for the predicted j
pred
and i
pred
in the super-critical, critical and sub-critical regimes. The values in brackets
are the standard deviation on the error NMRSE.
3.2.6 Summary
The key elements of this section are the following:
39
Markov chain representation: The Chialvo-Bak network in the two-layer
topology is fairly accurately represented by a rst order Markov chain, where the
chain states correspond to the number of learnt maps.
Markov chain statistics: The state transition matrix A obtained either ana-
lytically or by maximum likelihood estimation, easily allows to compute j (j)
pred
,
j
pred
and i
pred
.
Numerical evidence for Markov chain: The Markov chain representation is
accurate in all regimes (super-critical, critical and sub-critical) for all but very
small networks.
Analytical solution for A for (2, j
n
, j
c
): An approximate analytical solution
was obtained for the network of two input nodes.
Analytical solution for A
j
for (2, j
n
, j
c
): An analytical solution was ob-
tained for the network of two input nodes when considering the transitions be-
tween graphs o
o
a
rather than between states. As such, this solution is in principle
exact.
Numerical evidence for analytical solution: The performance of the analyt-
ical solutions is comparable to a maximum likelihood estimation of the transition
matrix A in the critical and sub-critical regimes. The predictions from the ap-
proximate solution had better performance than the non-approximate solution.
3.2.7 Appendix
Analytical Solution with A for (2, j
n
, j
c
)
The solution for the full transition matrix A for (2, j
n
, j
c
) is as follows:
o
00
o
1
0
o
1
0
o
2
0
(
1:(o
1
0
o
1
0
) + 1:(o
2
0
o
1
0
)
)
+
o
2
0
o
1
0
o
2
0
(
1:(o
1
0
o
2
0
) + 1:(o
2
0
o
2
0
)
)
o
1
0
o
1
0
o
2
0
(
:
0
(
o
0
+ (1 o
0
)
j
c
3
j
c
1
)
+ (1 :
0
)
j
c
1
j
c
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
))
+
o
2
0
o
1
0
o
2
0
:
0
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+
o
2
0
o
1
0
o
2
0
(
(1 :
0
)
j
n
2
j
n
1
j
c
1
j
c
+ (1 :
0
)
1
j
n
1
j
c
2
j
c
1
)
o
10
o
1
0
o
1
0
o
2
0
(
1:(o
1
1
o
1
0
) + 1:(o
2
1
o
1
0
)
)
+
o
2
0
o
1
0
o
2
0
(
1:(o
1
0
o
2
0
) + 1:(o
2
0
o
2
0
)
)
o
1
0
o
1
0
o
2
0
:
0
(1 o
0
)
2
j
c
1
+
o
1
0
o
1
0
o
2
0
(
(1 :
0
)
(
1
j
c
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+
j
0
1
j
0
(1 o
0
)
1
j
c
1
))
+
o
2
0
o
1
0
o
2
0
(
:
0
(1 o
0
)
1
j
c
1
+ (1 :
0
)
j
n
2
j
n
1
1
j
c
+ (1 :
0
)
1
j
n
1
1
j
c
1
)
40
o
20
o
1
0
o
1
0
o
2
0
1:(o
2
2
o
1
0
)
o
1
0
o
1
0
o
2
0
(
(1 :
0
)
1
j
c
(1 o
0
)
1
j
c
1
)
o
01
o
1
1
o
1
1
o
2
1
(
1:(o
1
0
o
1
1
) + 1:(o
2
0
o
1
1
)
)
o
1
1
o
1
1
o
2
1
(
:
0
(1 o
0
)
j
c
2
j
c
1
+ (1 :
0
)(1 o
0
)
j
c
1
j
c
)
o
11
o
1
1
o
1
1
o
2
1
(
1:(o
1
1
o
1
1
) + 1:(o
2
1
o
1
1
)
)
+
o
2
1
o
1
1
o
2
1
(
1:(o
2
1
o
2
1
) + 1:(o
1
1
o
2
1
)
)
o
1
1
o
1
1
o
2
1
(
:
0
(
o
0
+ (1 o
0
)
1
j
c
1
)
+ (1 :
0
)
(
1
j
c
(1 o
0
) +
j
c
1
j
c
o
0
))
+
o
2
1
o
1
1
o
2
1
:
0
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+
o
2
1
o
1
1
o
2
1
+
(
(1 :
0
)
j
n
2
j
n
1
j
c
1
j
c
+ (1 :
0
)
1
j
n
1
)
o
22
o
1
1
o
1
1
o
2
1
1:(o
2
2
o
1
1
) +
o
2
1
o
1
1
o
2
1
1:(o
2
2
o
2
1
)
o
1
1
o
1
1
o
2
1
(
(1 :
0
)
1
j
c
o
0
)
+
o
2
1
o
1
1
o
2
1
(
:
0
(1 o
0
)
1
j
c
1
+ (1 :
0
)
j
n
2
j
n
1
1
j
c
)
o
02
=0 o
12
= 0 o
22
= 1
where :
0
and o
0
represent the re-selection probabilities of middle and output nodes,
respectively.
Analytical Solution with A
j
for (2, j
n
, j
c
)
The solution for the full transition matrix A
j
for (2, j
n
, j
c
) is as follows: where :
0
and o
0
represent the re-selection probabilities of middle and output nodes, respectively.
o
10 10
=1:(o
1
0
o
1
0
) = :
0
(
o
0
+ (1 o
0
)
j
c
3
j
c
1
)
o
20 10
=1:(o
2
0
o
1
0
) = (1 :
0
)
j
c
1
j
c
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
o
11 10
=1:(o
1
1
o
1
0
) = :
0
(1 o
0
)
2
j
c
1
o
21 10
=1:(o
2
1
o
1
0
) = (1 :
0
)
(
1
j
c
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+
j
0
1
j
0
(1 o
0
)
1
j
c
1
)
o
22 10
=1:(o
2
2
o
1
0
) = (1 :
0
)
1
j
c
(1 o
0
)
1
j
c
1
41
o
10 20
=1:(o
1
0
o
2
0
) = (1 :
0
)
1
j
n
1
j
c
2
j
c
1
o
20 20
=1:(o
2
0
o
2
0
) = :
0
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+ (1 :
0
)
j
c
2
j
c
1
j
c
1
j
c
o
11 20
=1:(o
1
1
o
2
0
) = (1 :
0
)
1
j
n
1
1
j
c
1
o
21 20
=1:(o
2
1
o
2
0
) = :
0
(1 o
0
)
1
j
c
1
+ (1 :
0
)
j
n
2
j
n
1
1
j
c
o
22 20
=1:(o
2
2
o
2
0
) = 0
o
10 11
=1:(o
1
0
o
1
1
) = :
0
(1 o
0
)
j
c
2
j
c
1
o
20 11
=1:(o
2
0
o
1
1
) = (1 :
0
)(1 o
0
)
j
c
1
j
c
o
11 11
=1:(o
1
1
o
1
1
) = :
0
(
o
0
+ (1 o
0
)
1
j
c
1
)
o
21 11
=1:(o
2
1
o
1
1
) = (1 :
0
)
(
1
j
c
(1 o
0
) +
j
c
1
j
c
o
0
)
o
22 11
=1:(o
2
2
o
1
1
) = (1 :
0
)
1
j
c
o
0
o
10 21
=1:(o
1
0
o
2
1
) = 0
o
20 21
=1:(o
2
0
o
2
1
) = 0
o
11 21
=1:(o
1
1
o
2
1
) = (1 :
0
)
1
j
n
1
o
21 21
=1:(o
2
1
o
2
1
) = :
0
(
o
0
+ (1 o
0
)
j
c
2
j
c
1
)
+ (1 :
0
)
j
n
2
j
n
1
j
c
1
j
c
o
22 21
=1:(o
2
2
o
2
1
) = :
0
(1 o
0
)
1
j
c
1
+ (1 :
0
)
j
n
2
j
n
1
1
j
c
o
10 22
=0 o
20 22
= 0 o
11 22
= 0 o
21 22
= 0 o
22 22
= 1
3.3 Learning Convergence
In all but very small networks, the two layer topology of the Chialvo-Bak model seems
well represented by a rst-order absorbing Markov chain. This is supported by the
numerical testing results of the Markov chain representation.
An important property of absorbing Markov Chains is the guaranteed convergence
4
to the absorbing states. Therefore, as long as the absorbing Markov chain represen-
tation holds, the corresponding network can be expected to converge to the complete
learning state.
More specically, for a given learning rule, the network is expected to converge to
the complete learning state if under that learning rule:
4
This property is based on a probability conservation argument. For details see [13], for example.
42
The Markov property is preserved, i.e.
1:(o(t + 1) o(t), o(t 1), , o(0)) = 1:(o(t + 1) o(t))
From every possible state o
a
there is a path to the absorbing state o
j
, which
does not need to be a direct path.
A necessary condition seems to be a requirement to have separate random depressions
in the middle and output layers. Indeed, simulations showed that for the network
(2, 2, 2) complete learning is not guaranteed if the middle and output layers are de-
pressed by the same random amount. This condition seems to be needed to maintain
the Markov property, but further investigation is necessary to verify this.
For any state o
a
to be able to reach the absorbing state o
j
it is sucient that at
least one path to the absorbing state exists. Such path can be constructed if for every
state o
a
, a transition is possible to a higher learning state o
a+1
. This translates to a
lower bound on the number of middle layer nodes. For the system to be in the learning
state o
a
, a total of n middle layer nodes are used to support this state. Moving to the
state o
a+1
requires one additional free middle layer node, and so on.
For the simple learning rule, the number of middle layer nodes should therefore be
as large as the number of input nodes, i.e. j
n
j
i
.
For the advanced learning rule, the amount of middle layer nodes should to be as
large as the number of dierent input congurations, i.e. j
n
1, where 1 represents
the order of the input congurations 1.
Proving the above assertions would be very interesting, as it would establish the learn-
ing capability in the advanced learning rule, which is the ability to learn arbitrary
binary inputs to output maps, and in particular, any boolean function.
3.3.1 Summary
The key elements of this section are the following:
Guaranteed convergence of the Markov chain: An absorbing Markov chain
is guaranteed to converge to the absorbing state. Should the Markov chain repre-
sentation hold for the simple and advanced learning modes, the Chialvo-Bak can
be expected to converge to the complete learning state.
Separate random depressions: A requirement for convergence is to have sep-
arate random depressions in the middle and output layers. This seems to be
related to maintaining the Markov property.
Lower bound on the number of middle layer nodes : Another requirement
for learning convergence is to have sucient middle layer nodes. For the simple
learning rule j
n
j
i
and for the advanced learning rule j
n
1, where 1
represents the order of the input congurations 1.
43
3.4 Power-Law Behaviour and Neural Avalanches
The power law tail
5
in the learning time distribution j(j) of the critical regime is also
dependent on the mapping change mode, i.e. the way new mapping sets are presented
to the network.
The power law tails of the distributions in Figure 2.8a were generated under the slow
change mode, where one single mapping is changed each time. It turns out that the slow
change mode seems to be the only mapping presentation mode that results in power
law behaviour. If the network is provided with random mapping sets the power-law
disappears, as illustrated in Figure 3.18a.
10
1
10
2
10
3
10
4
10
2
10
0
Learning Time Distribution p()
(8,72,9) [runs:1e+6]
p
(
slow
rand
(a) Learning under slow change mode and ran-
dom change mode. The blue line represents
learning with one mapping change each time,
whereas the green line represents learning ran-
dom mapping sets.
10
1
10
2
10
3
10
4
10
2
10
0
Learning Time Distribution p()
(8,72,9) [synthetic]
P
(
S
1
S
4
S
6
S
7
(b) Learning under j
n 1 mapping
changes, i.e. the slow change mode corre-
sponds to n = 1. Learning under n mapping
changes corresponds to starting the system in
state o
1
. As such, learning under j
i
n 1 mapping changes
amounts to starting the network in the state o(0) = o
j
a
. Again, only the slow change
mode generated a power law tail, as illustrated in Figure 3.18b.
An interesting characterisation of the critical regime is obtained by analysing the state
occupation matrix Q that is obtained from the transition counts after running a simula-
tion. Normalising the transition counts globally gives the frequency of each transition,
and summing over the rows gives the state frequency. Below is an example for the a
network (6, 42, 7) after 1c5 runs in the slow change mode.
5
As mentioned in Sub-Section 2.1.3 the approximately straight segments in the distributions of
Figure 2.8b do not necessarily imply that j(j) is a proper power law tail distribution. Due to timing
constraints no conclusive power law testing was completed for j(j), as such the terminology proposed
in [20] is adopted here.
44
Q =
0.0063 0.0011 0 0 0 0
0.0011 0.0345 0.0059 0 0 0
0.0000 0.0059 0.0950 0.0154 0 0
0 0.0000 0.0153 0.1704 0.0262 0
0 0 0.0001 0.0261 0.2255 0.0328
0 0 0 0.0001 0.0327 0.2683
0 0 0 0 0.0000 0.0373
Summing over the columns of Q gives the relative occupation in each state. For the
network (6, 42, 7) the relative occupation of each state is shown in Table 3.5.
Analysing the columns of Q for larger networks yields the following picture:
In the super-critical regime the network spends most of the time in the states
neighbouring o
j
2
.
In the critical regime the network spends most of the time in the states o
j
2
, o
j
1
.
In the sub-critical regime the network spends most of the time in the last two
states o
j
1
and o
j
2
.
Relative State Occupation
o
0
o
1
o
2
o
3
o
4
o
5
super-critical 0.0648 0.1971 0.2800 0.2437 0.1449 0.0695
critical 0.0075 0.0416 0.1163 0.2119 0.2844 0.3384
sub-critical 0.0004 0.0038 0.0218 0.0849 0.2431 0.6460
Table 3.5 The relative state occupation of the network (6, 42, 7) after 1c5 runs in the
slow change mode. In the super-critical regimes the network spends most time in the middle
states. In the critical regime the occupation is from the middle state up to the state before
absorption. In the sub-critical regime the network spends most of the time in the last two
states before absorption.
3.4.1 Biological interpretation
The slow change mode may have an interesting biological interpretation. Assigning a
slow temporal scale to the rate of mapping changes, slow enough so that these would
be comparable to the perturbations in the synaptic weights resulting from biological
background noise, would allow to re-interpret the slow change mode as the network
recovering from noise.
Small punctual perturbations to the synaptic weights might aect the result of the
WTA rule for a given node of the active conguration. This is likely to result in the
activation of a dierent output node, which would trigger the corrective depressions.
This eect is comparable to re-assigning an output node, specially if the perturbation
aects a synapse from an input node to a middle layer node. The ability of the network
to recover perfectly from noise has been discussed in [8][1].
45
The neural avalanches, rst evidenced experimentally J. Beggs and D. Plenz [4][5] cor-
respond to spontaneous neural activity displaying power-law distributions in the event
sizes. Identifying both avalanche types is not trivial as the biological neural networks
analysed by Beggs and Plenz were not know to be performing any particular activity
during the experiments, whereas the Chialvo-Bak avalanches are due to an ongoing
process of synaptic plasticity.
As such, a parallel may be established between the biological neural avalanches and
the avalanches in the Chialvo-Bak model the slow change mode, under a very slow
temporal scale of changes.
3.4.2 Summary
The key elements of this section are the following:
Power law tail on slow change mode only: The power law tail in the learning
time distribution 1(j) only seems to occur in the slow change mode.
Relative state occupation in the slow change mode: In the super-critical
regimes the network spends most time in the middle states, whereas in the critical
regime the occupation is from the middle state up to the state before absorption.
In the sub-critical regime the network spends most of the time in the last two
states before absorption.
Parallel with neural avalanches: By re-interpreting the slow change mode
under a very slow temporal scale of changes, as the network recovering from
noise.
46
Chapter 4
Conclusion
The present work focused on extending the analytical understanding of the Chialvo-
Bak network in the two-layer topology.
The model seems to be accurately represented by a rst order Markov chain, where the
chain states correspond to the number of learnt maps. This representation easily allows
to compute important statistics of the network, such as j (j)
pred
, j
pred
and i
pred
.
Numerical testing found support for the Markov chain representation in all but very
small networks.
A direct estimation of the Markov chain order would further contribute to the un-
derstanding of the Markov chain representation and its eventual limitations.
An analytical solution was developed for the transition matrix A in the case of a
network with two input nodes (2, j
n
, j
c
). The performance of the analytical solu-
tions is comparable to the maximum likelihood estimation obtained from simulations
of (2, j
n
, j
c
) in the critical and sub-critical regimes.
Extending analytical for networks with higher number of inputs would consolidate the
Markov chain representation and could lead to a better understanding of the critical
regime.
Should the Markov chain representation hold in general, for the simple and advanced
learning modes, then the model can be expected to converge to the complete learning
state. A convergence requirement is to have separate random depressions in the middle
and output layers, which seems to be related to maintaining the Markov property. A
lower bound on the number of middle layer nodes is required for maintaining the ab-
sorbing property of the Markov chain.
A proper analytical proof of the learning convergence would be useful to establish
the learning capability of the model.
The power law tail in the learning time distribution 1(j) seems limited to the slow
change mode. The Markov chain representation provides a characterisation of the crit-
ical regime, for which the network spends most of the time in the state where half the
mappings are learnt, up to the state before absorption. Very rarely all the mappings
are unlearnt in the critical regime.
47
A systematic power law testing of the learning time distribution j(j) in the critical
regime is necessary to establish the power law tail nature of this distribution.
A parallel with the biological neural avalanches is proposed by re-interpreting the slow
change mode under very slow temporal scale of changes, as the network recovering from
noise rather than learning new mappings.
The permanent tagging of successful weights by the selective punishment rule results
in reduced performance for large timescales. A global tag threshold, which places a
global limit on the number of times a tagged weight can be wrong before becoming
untagged, seems successful in eliminating the ageing eect and is independent on the
level of network activity.
The local tag threshold was not developed in the present work, however, its local
character potentially results in network size independence, which is a clear advantage
over the global tag threshold.
48
Bibliography
[1] P. Bak and D.R. Chialvo. Adaptive learning by extremal dynamics and negative
feedback. Arxiv preprint cond-mat/0009211, 2000.
[2] Per Bak. How Nature Works: The science of Self-Organized Criticality. Springer-
Verlag, 1996.
[3] A. Barr, E.A. Feigenbaum, and P.R. Cohen. The handbook of articial intelligence.
Addison-Wesley Reading, MA, 1989.
[4] J.M. Beggs and D. Plenz. Neuronal avalanches in neocortical circuits. Journal of
Neuroscience, 23(35):1116711177, 2003.
[5] J.M. Beggs and D. Plenz. Neuronal avalanches are diverse and precise activity
patterns that are stable for many hours in cortical slice cultures. Journal of neu-
roscience, 24(22):52165229, 2004.
[6] P. Billingsley. Statistical inference for Markov processes. University of Chicago
Press, 1961.
[7] RJC Bosman, WA van Leeuwen, and B. Wemmenhove. Combining Hebbian and
reinforcement learning in a minibrain model. Neural Networks, 17(1):2936, 2004.
[8] DR Chialvo and P. Bak. Learning from mistakes. Neuroscience, 90(4):11371148,
1999.
[9] A. Clauset, C.R. Shalizi, and M.E.J. Newman. Power-law distributions in empirical
data. arxiv, 706, 2007.
[10] F. Crepel, N. Hemart, D. Jaillard, and H. Daniel. Long-term depression in the
cerebellum. Handbook of Brain Theory and Neural Networks, 1998.
[11] I. Csiszar and P.C. Shields. The consistency of the BIC Markov order estimator.
Annals of Statistics, pages 16011619, 2000.
[12] P. Dayan, L.F. Abbott, and L. Abbott. Theoretical neuroscience: Computational
and mathematical modeling of neural systems. MIT Press, 2001.
[13] C.M. Grinstead and J.L. Snell. Introduction to probability. Amer Mathematical
Society, 1997.
[14] M. Ito. Long-term depression. Annual Review of Neuroscience, 12(1):85102, 1989.
[15] M.H. Johnson. Developmental Cognitive Neuroscience: An Introduction. Blackwell
Publishing, 1997.
49
[16] K. Klemm, S. Bornholdt, and H.G. Schuster. Beyond Hebb: exclusive-OR and
biological learning. Physical Review Letters, 84(13):30133016, 2000.
[17] G. Latouche and V. Ramaswami. Introduction to matrix analytic methods in
stochastic modeling. Society for Industrial Mathematics, 1999.
[18] M. Minsky. Steps toward articial intelligence. Proceedings of the IRE, 49(1):830,
1961.
[19] Y. Peres and P. Shields. Two new Markov order estimators. Arxiv preprint
math/0506080, 2005.
[20] J. Wakeling. Orderdisorder transition in the ChialvoBak minibrain controlled
by network geometry. Physica A: Statistical Mechanics and its Applications, 325(3-
4):561569, 2003.
[21] J.R. Wakeling. Adaptivity and Per learning. Physica A: Statistical Mechanics
and its Applications, 340(4):766773, 2004.
50