Professional Documents
Culture Documents
Article
*Correspondence: tbuschma@princeton.edu
https://doi.org/10.1016/j.neuron.2019.04.020
Our model provides a simple mechanistic explanation for the Second, the model predicts maintaining multiple memories
limited capacity of working memory: it is due to interference in will reduce the response of selective neurons in the random
neural representations in the shared random network. This is a network (Figure 4B). This is consistent with experimental obser-
natural consequence of the convergent, random connectivity vations, which have shown divisive-normalization-like regulari-
between the two non-linear networks; as multiple inputs are pre- zation of mnemonic responses in single neurons and across
sented to the sensory networks, their representations interfere in the population (Buschman et al., 2011; Sprague et al., 2014).
the random network, disrupting maintenance. This is an unavoid- Our model provides a potential circuit mechanism for such divi-
able consequence of the convergence and does not depend on sive-normalization-like regularization, suggesting it is the result
network parameters (as we show below). In this way, our model of balanced excitation-inhibition between networks. The low
suggests capacity limits are a necessary trade-off for the flexi- fraction of connectivity (g) and balanced excitation-inhibition
bility of working memory. means that a neuron in the random network that is selective for
The model makes specific predictions about how neural ac- one stimulus is more likely to be inhibited than excited by a
tivity should change as more items are held in working memory. second stimulus (see STAR Methods). Thus, the response of
First, increasing the number of stimulus inputs increases the selective neurons is reduced as items are added to memory
overall average firing rate in the random network, saturating (Figure 4B).
at the capacity limit of the network (3 or 4 items; Figure 4A). This effect can be seen across the population. Figure 4C
This is consistent with experimental observations of gross shows the relative response of neurons in the random network
activity levels in prefrontal and parietal cortex; both blood- to two stimuli either presented separately (x axis) or together
oxygen-level-dependent (BOLD) and evoked potentials in- (y axis). As noted above, the overall average response increases
crease with working memory load, saturating at an individual’s when two stimuli are presented. However, the response to the
capacity limit (Curtis and D’Esposito, 2003; Ma et al., 2014). pair of stimuli was not a summation of the response to each stim-
Again, the model was highly correlated with experiments (r = ulus alone; rather, it was a sublinear mixture (slope is 0.50). This
0.998; p = 1.97 3 106; experimental data from Ma et al., 2014). is consistent with electrophysiological results during perception
(Reynolds et al., 1999) and with divisive normalization models 1973; Vogel and Machizawa, 2004). This is true, even on the first
that predict the response to two stimuli should be the average trial of being exposed to a stimulus, as observed in monkeys
of the response to each stimulus alone (resulting in a slope of (Constantinidis and Klingberg, 2016). Second, working memory
0.5). Divisive normalization has been observed in many cognitive activity is distributed across multiple networks, reflecting
domains (Carandini and Heeger, 1994, 2011). Indeed, in working the distributed nature of working memory representations in
memory, divisive-normalization-like regularization may explain the brain (Christophel et al., 2017). Third, as noted above, the
the observed decrease in stimulus information with memory random nature of connections in our model yields high-dimen-
load: it reduces selectivity, which reduces information (Figure 4D; sional, mixed-selective representations in the random network,
Buschman et al., 2011; Sprague et al., 2014). More broadly, our as has been seen in prefrontal cortex (Fusi et al., 2016). Finally,
model suggests excitation-inhibition balance within a conver- as we show next, our model shows the same combination of sta-
gent, non-linear network could be a circuit mechanism for imple- ble and dynamic representations seen in neural data (Murray
menting divisive normalization. et al., 2017; Stokes, 2015).
Interference between memory representations also provides a Recent work has highlighted the dynamic nature of working
simple mechanistic explanation for the reduced memory preci- memory representations. In particular, neural representations
sion at higher memory loads (Ma et al., 2014; Bays and Husain, early in the trial are not well correlated with representations later
2008). In our model, memory representations drift over time, due in the trial (e.g., stimulus presentation versus memory delay;
to the accumulation of noise from Poisson variability in neural Stokes et al., 2013). This argues against the classic view of a
spiking (Burak and Fiete, 2012). The increased interference in static mnemonic representation. In contrast, Murray et al.
the random network at higher memory loads leads to weaker (2017) found these dynamics are orthogonal to the space in
feedback, reducing the representation in the sensory network. which memories are represented such that, although neural re-
This reduction both allows noise to have a greater impact, sponses are dynamic, their encoding is stable (see also Spaak
causing an increase in drift (Figures 3C–3F), and impairs decod- et al., 2017).
ing of memory representations (Bays, 2014, 2015). Our model can capture both effects with two simple, biologi-
Given the model’s prediction that interference impairs memory cally motivated additions: (1) a direct projection from the input
performance, reducing interference between memories should into the random network and (2) weak recurrent connections
improve working memory performance and accuracy. Indeed, within the random network (see STAR Methods). These changes
how much two stimulus representations overlapped in the did not disrupt memory performance (Figure S2A) but did in-
random network strongly determined the ability to accurately crease dynamics within the random network. These dynamics
maintain both memories. Figure 5A shows memory performance can be seen in the low cross-correlation of neural representa-
was impaired when two memories were less correlated, reflect- tions across time (Figure 6A). Consistent with experimental
ing greater interference between the memories in the random observations, representations early in the trial are not well corre-
network (see also Figure S1C). Conversely, more correlated lated with later representations, but representations within the
memories were better remembered. Surprisingly, errors in mem- memory delay are relatively more stable (Figure 6B).
ory accumulated over time in a way that reduced interference However, despite these dynamics, the subspace in which a
between memories (Figure 5B). Both of these predictions are memory is represented is stable. Following the approach from
testable hypotheses that could be addressed with future electro- Murray et al. (2017), we estimated the ‘‘mnemonic subspace’’
physiology or imaging experiments. of the random network by finding the first two dimensions that
explained the most variance of the time-averaged response to
Stable and Dynamic Encoding of Memories eight different stimulus inputs (Figure S2C; see STAR Methods).
The model captures several more key electrophysiological find- Projecting the response of the random network over time into
ings related to working memory. First, we observe persistent this subspace showed stable representations (Figure 6C; note
mnemonic activity, consistent with electrophysiological results the subspace is not designed to minimize temporal variability).
in both monkeys and humans (Funahashi et al., 1989; Fuster, Furthermore, despite the randomness of connections, the
D H
memories, increasing memory performance. However, mem- trained inputs (Figure S7C). However, learning was significantly
ories also drifted toward one another, decreasing memory accu- slower when more patterns were trained (slope of performance
racy. This is consistent with experimental work, which has shown across all loads was m = 0:021 for 1 input pattern, 0.0014 for 5 input
increased stability and attractive drift when subjects held two patterns, and 0.00075 for 10 input patterns; p < 104 for all com-
similar items in working memory (Kiyonaga et al., 2017; Lin and parisons; see STAR Methods for details). Furthermore, learning
Luck, 2009). did not generalize beyond the trained inputs: untrained inputs
showed little to no improvement with training (Figure 7E, dashed
Learning Can Optimize Performance for Trained lines; m = 106, p = 0.87; m = 104, p < 104; and m = 104,
Memories but Does Not Generalize to Other Memories p = 0.026 for 1, 5, and 10 trained input patterns, respectively,
Memory performance in our network is limited by interference in tested against zero). In fact, memory performance and accuracy
the random network (Figure 5). Theoretical work has suggested for non-trained inputs was reduced when they were paired with
that random networks maximize information capacity and mini- trained inputs (Figures S7D and S7E). Together, these results sug-
mize interference when the structure of inputs are unknown gest training of working memory performance does not generalize
(Maass et al., 2002; Jaeger and Haas, 2004; Sussillo and Abbott, to new inputs, consistent with experimental results that training on
2009). However, to directly test whether connections could be specific working memory tasks does not generalize to other types
optimized to maintain memories in our network structure, we of information (Melby-Lervåg and Hulme, 2013).
trained a network to maximize memory performance for a small
subset of inputs (see STAR Methods for details). Training improved The Network Can Maintain Memories across a Variety of
performance of the network for the trained inputs (Figures 7E, S7A, Sensory Architectures
and S7B). This improvement was largely due to increased correla- Our model demonstrates how random connections can support
tions between the representations in the random network of the the flexible maintenance of memories. To highlight the flexibility
Further information and requests for resources and reagents should be directed to Tim Buschman (tbuschma@princeton.edu).
METHOD DETAILS
Computational model
We consider a two-layer network of Poisson spiking neurons. Model parameters were adapted from Burak and Fiete (2012). Each
neuron i generates Poisson spikes based on its time-varying firing rate ri ðtÞ. The firing rate of neuron i is a non-linear function of
the weighted sum of all pre-synaptic inputs:
!
X
ri ðtÞ = F Wij sj ðtÞ (1)
j
where tja are the spike times of neuron j. For simplicity, we choose the synaptic time constant t = 10ms to be equal for all synapses. As
noted in the main text, we are using a simplified model of neural activity; neglecting cell-types, the existence of refractory periods, and
bursts of neural activity.
Previous work has shown the variability in spiking activity plays a significant role in the diffusion of memories over time (Burak and
Fiete, 2012), which motivated our choice of using spiking neurons instead of a rate model. All variability in spiking arises from the
inhomogeneous Poisson process used to generate spike events (i.e., no noise was added).
The model consists of two interacting layers of neurons: a ‘sensory’ network and a ‘random’ network (Figure 1A). The sensory
network consists of 8 independent ring-like sub-networks. These sub-networks are each composed of Nsensory = 512 neurons and
mimic simplified sensory networks, with neurons around the ring corresponding to specific values of a continuously encoded feature,
such as shape, orientation, or color (here graphically represented as color). Every neuron i has an associated angle qi = 2pi=Nsensory .
Consistent with biological observations (Funahashi et al., 1989; Kuffler, 1953; Kiyonaga and Egner, 2016), connections within each
sensory sub-network have a center-surround structure (Figure 1A, inset). The synaptic weight between any pair of neurons is rota-
tionally invariant, with nearby excitation and surround inhibition. The synaptic weight between any pair of neurons i; j within a sensory
sub-network depends only on q = qi qj through:
Wijsens = wðqÞ = l + A expðk1 ðcos q 1ÞÞ A expðk2 ðcos q 1ÞÞ (4)
where k1 = 1 is the inverse width of the excitation kernel, k2 = 0:25 is the inverse width of the suppression kernel, A = 2 is the amplitude
and l = 0:28 is the baseline. Self-excitation is set to 0 ðwð0Þ = 0Þ. To test the robustness of the network to changes in recurrent con-
nectivity within the sensory sub-networks, the strength of all recurrent connections ðWijsens Þ was scaled by a constant factor
(Figure S4A).
All inputs were presented for 100 ms, indicated by the blue shaded region in all figures. Inputs were Gaussian, with the center of the
input (m) chosen randomly for each sensory sub-network. The width of the input (s) was defined as a fraction of the total number of
neurons in the sensory sub-network, s = Nsensory =32, which was 16 neurons for the presented network. Inputs beyond three s were
set to 0. Therefore, the sensory input to neuron i was
8 !
>
> 1 ði mÞ
2
< Sext pffiffiffiffiffiffiffiffiffiffi
ffi exp if i % ½3s
i =
sext 2ps2 2s2 (6)
>
>
:
0 if i > ½3s
where Sext was the strength of external sensory input (here, Sext = 10).
However, to investigate the dynamics of activity within the network, two biologically-motivated connections were added (only for re-
sults presented in Figures 6, S2, and S3). First, inputs were projected directly into the random network (in addition to inputs into the
sensory network). A random weight matrix, Wijext defined the feedforward connectivity of sensory inputs into the random network.
Similar to the weight matrix from the sensory network, the likelihood of an excitatory connection from any given input was gext (default
is 0.2) with weight proportional to aext . As with other connections in the network, inhibitory connections were added to all neurons in
the random network in order to balance excitation and inhibition.
where Aexc = 1:7, Ainh = 0:5, sexc = 3 and sinh = 10. Weights are set to 0 for all d < 1:2 (i.e., wðd < 1:2Þ = 0; this includes the center
neuron itself and the 4 nearest neighbors).
One to four 2D Gaussian inputs were presented during the first 100 ms of each trial. Each input was centered on a specific neuron i,
and the strength of stimulation to neuron j depended on the distance to the center dij through:
!
1 dij2
sj = S
ext ext;2D
pffiffiffiffiffiffiffiffiffiffiffiffiffi exp 2 (10)
2ps22D 2s2D
with
8
>
> 1000 if pmi = pmj = 1
>
>
>
>
< 1
> if pmi = pmj = 0
wmij = 1 (12)
>
> if pmi s pmj
>
> 100Np
>
>
>
:
0 if i = j
Note the major difference from the classic Hopfield network is that neurons that are both not active in a given pattern inhibit one
another (as supposed to exciting one another in the classic Hopfield network). This change was to accommodate the fact that
our neurons do not have negative firing rates and our patterns are not balanced (with non zero expected value).
Memories could be maintained in the Hopfield sensory network alone if weights were strong (l = 600 and below, Figure S8B). To
disrupt this maintenance, but keep the structure of the network, weights were reduced (l = 800, Figure S8C). Although the Hopfield
network could no longer maintain memories alone, interactions with a random network recovered the ability to sustain memories
(Figure S8D).
Inputs were taken as the sum of NL pattern vectors, depending on the load chosen (L). All active input neurons received the stimulus
drive as previous models ðSext = 10Þ.
To systematically investigate how overlap in the sensory network representations impacted memory performance, we changed
one pattern such that a proportion of its active neurons overlapped with another pattern (Figure S8A, example for 20%). By varying
the percentage of overlap, we could quantify the impact of overlap on memory performance, as seen in Figures S8F and S8G. If two
embedded memories were partially overlapping, then memory performance for one was improved as the structure of the other
embedded pattern helped stabilize the active pattern (even when it was not active; Figure S8G). However, if embedded patterns over-
lapped significantly ð> 16%Þ, an input matching one pattern tended to spuriously activate the other, overlapping, pattern. It is
important to note that overlap also induces spurious memories in typical Hopfield networks; this is a consequence of the recurrent
connections within the Hopfield network and not the recurrent connections with the random network. As noted above, similar effects
were seen when two sensory inputs had overlapping projections on the random network (Figures S6E–S6G). Together, these results
suggest that while some correlation in the random network improves working memory performance (Figure 5), too much correlation
can lead to spurious activation of memories.
Subspace Analysis
Analysis of the stable mnemonic space (Figures 6 as well as S2 and S3) followed the methods of Murray et al. (2017). To understand
the subspace used for encoding, we calculated the response of the network to 8 different inputs into the first sensory sub-network
ðs1k = 1::8 Þ. The number of stimulus conditions was chosen to match Murray et al. (2017), corresponding to 8 angular locations evenly
separated within the sub-network (i.e., spaced by 45+ ). For each input condition, 500 independent trials were simulated for the same
network.
We assessed temporal dynamics of neural activity by computing the cross-correlation of responses across trials with the same
input. Correlations were then averaged across input conditions, yielding the cross-correlation matrix in Figure 6A. Slices of the
cross-correlation are displayed in Figure 6B to match recent findings (Murray et al., 2017; Stokes et al., 2013).
Next, we defined the mnemonic and temporal subspaces as in Murray et al. (2017). First, we averaged the activity of each neuron
over trials. This resulted in a high-dimensional state space defined by the firing rate of Nrand neurons, over time T, for each of 8 input
conditions (i.e., a Nrand x T x 8 matrix). Then, we defined the mnemonic subspace by estimating the principal components of the
time-averaged response (which would now be a Nrand x 1 x 8 matrix). The first four PCs explained 77% and 82% of the variance
in the data for load 1 and load 4 respectively. Projecting the original full-time course of neural activity onto the first two principal
components defined the mnemonic subspace in Figure 6C, showing the 8 clusters of time-varying activity for the 8 stimulus inputs.
Figures S3C–S3E and S3H–S3J shows the projection of neural activity onto the subspace composed by the third and fourth com-
ponents. In Figure 6, we only used simulations for which the memory in the first sensory sub-network is maintained. Figures S3A, S3B,
S3F, and S3G shows the evolution of activity for both maintained and forgotten memories.
In order to analyze how memory load impacted the mnemonic subspace (Figures 6H and S2D), we repeated this process for mem-
ory loads 2 through 8. Memory load was manipulated by providing a random input into sensory sub-networks 2 through 8, in addition
to the parametrically varying input to sub-network 1. We took into account only the simulations for which the memory in the first
sensory sub-network was maintained, although the memories in the other sensory sub-networks could have been forgotten. This
provided new mnemonic subspaces for each load.
Discriminability was computed as the distance between the neural response to different input conditions, normalized by the stan-
dard deviation of responses ðd0 Þ. Neural responses were calculated as the average response over the delay period. The response for
each trial was then projected into either the mnemonic subspace defined for a stimulus presented alone (the reference subspace,
load = 1) or the mnemonic subspace optimized for each memory load. d0 was calculated for each pair of inputs, along the vector
connecting each cluster’s barycenter. The overall discriminability was then taken as the average d0 across all pairs of inputs. Stan-
dard error was estimated by bootstrapping (1000 draws). Discriminability was computed within the reference (load 1) subspace and
within the subspace optimized for each load. As seen in Figure 6H, decodability was nearly equal across subspaces. For complete-
ness, we also built a nearest centroid classifier as in Murray et al. (2017). Decoder accuracy was estimated by testing cross-validation
performance on withheld trials, for both the reference (load 1) subspace and the subspaces optimized for each load (Figure S2D).
Standard error was estimated by bootstrapping (1000 draws). This second analysis also shows that decodability was nearly equal
across subspaces.
As noted above, all subspace analyses used the model with direct sensory input into the random network and recurrence within the
random network (as described in section ‘Sensory inputs and recurrence in the random network’).
1. 100 new networks were generated from the current ‘best’ network at each step n by rearranging 1% of bi-directional excitatory
connections (swapping them with inhibitory connections). All connections were balanced after rearrangement.
2. The performance of each new network on the trained inputs was tested by running 1000 simulations per trained input.
3. The network with the best performance was then selected as the best network for step n + 1.
Initially, the network was trained to optimize performance for one, five or ten input pattern(s). Each of these ‘trained’ input patterns
consisted of one randomly chosen input for each sub-network (i.e., they were an 8 element vector with random numbers drawn from
1::Nsensory ). Memory load was varied by selecting a random subset of sub-networks to receive an input. Therefore, the number of
possible inputs across the entire sensory network was much higher than 1, 5, or 10. For example, if each sub-network was trained
on a single input pattern, there were 255 different inputs possible across the entire sensory network (across all loads,
X8
8
L = 1 L = 255). For five and ten trained input patterns, there were 1,275 and 2,550 different inputs possible, respectively.
To test whether training generalized to other inputs, the performance of the ‘best’ network at each training step was tested on a set
of 100 random input patterns (50 simulations each, across all loads). The set of 100 random inputs were fixed across training steps to
facilitate comparison (Figures 7E, S7A, and S7B). Note that computational time at each training step increased as we increased the
number of trained patterns. This limited the total number of training iterations possible for higher numbers of inputs (notably 10 input
patterns). Therefore, to quantify the speed of learning across training steps we used the statsmodels toolbox to fit a line to each
training curve (Figure 7E). The line was fit to the first 50 training steps for 1, 5 and 10 input patterns. To test for significant differences
in the learning rate as a function of the number of input patterns, we built a generalized linear model, adding a categorical variable for
the number of inputs. Thus, significant differences would manifest as a significant interaction term (p value included in the main text).
Finally, we tested whether training interfered with maintenance of other memories. For simplicity, only the networks trained to opti-
mize one input vector were used. To systematically test the impact of training on memory, we varied the input to sub-network 1 (SN1),
relative to the trained input. For higher loads all other sub-networks received their trained input. As seen in Figures S7D and S7E,
training improved memory performance and accuracy for the trained input but at the cost of impaired memory for dissimilar inputs.
All simulations were done with Python 3 (using numpy and scipy) and the Brian2 simulator (Stimberg et al., 2013; Goodman et al.,
2014), using exact integration with time-step dt = 0:1ms. For PCA and the nearest centroid classifier, we used the scikit-learn pack-
age (Pedregosa et al., 2011). The script of the model is available at https://github.com/buschman-lab/FlexibleWorkingMemory or by
contacting TJB.