Coale - Manuscript - Preprint 2

1 Computational optimization of associative learning experiments
2 Filip Melinscak* 1,2 and Dominik R. Bach1,2,3
1
3 Computational Psychiatry Research; Department of Psychiatry, Psychotherapy, and
4 Psychosomatics; Psychiatric Hospital; University of Zurich
2
5 Neuroscience Center Zurich; University of Zurich
3
6 Wellcome Centre for Human Neuroimaging and Max Planck/UCL Centre for
7 Computational Psychiatry and Ageing Research; University College London
*
8 Correspondence: Filip Melinscak <filip.melinscak@uzh.ch>
9 Abstract
10 With computational biology striving to provide more accurate theoretical accounts of biological sys-
11 tems, use of increasingly complex computational models seems inevitable. However, this trend engenders
12 a challenge of optimal experimental design: due to the flexibility of complex models, it is difficult to
13 intuitively design experiments that will efficiently expose differences between candidate models or allow
14 accurate estimation of their parameters. This challenge is well exemplified in associative learning research.
15 Associative learning theory has a rich tradition of computational modeling, resulting in a growing space
16 of increasingly complex models, which in turn renders manual design of informative experiments difficult.
17 Here we propose a novel method for computational optimization of associative learning experiments. We
18 first formalize associative learning experiments using a low number of tunable design variables, to make
19 optimization tractable. Next, we combine simulation-based Bayesian experimental design with Bayesian
20 optimization to arrive at a flexible method of tuning design variables. Finally, we validate the proposed
21 method through extensive simulations covering both the objectives of accurate parameter estimation and
22 model selection. The validation results show that computationally optimized experimental designs have
23 the potential to substantially improve upon manual designs drawn from the literature, even when prior
24 information guiding the optimization is scarce. Computational optimization of experiments may help
25 address recent concerns over reproducibility by increasing the expected utility of studies, and it may even
26 incentivize practices such as study pre-registration, since optimization requires a pre-specified analysis
27 plan. Moreover, design optimization has the potential not only to improve basic research in domains such
28 as associative learning, but also to play an important role in translational research. For example, design
1
29 of behavioral and physiological diagnostic tests in the nascent field of computational psychiatry could
30 benefit from an optimization-based approach, similar to the one presented here.
31 Keywords: associative learning, simulation-based Bayesian experimental design, Bayesian optimiza-
32 tion
33 Author summary
34 To capture complex biological systems, computational biology harnesses accordingly complex models. The
35 flexibility of such models allows them to better explain real-world data; however, this flexibility also creates
36 a challenge in designing informative experiments. Because flexible models can, by definition, fit a variety of
37 experimental outcomes, it is difficult to intuitively design experiments that will expose differences between
38 such models, or allow their parameters to be estimated with accuracy. This challenge of experimental design
39 is apparent in research on associative learning, where the tradition of modeling has produced a growing space
40 of increasingly complex theories. Here, we propose to use computational optimization methods to design asso-
41 ciative learning experiments. We first formalize associative learning experiments, making their optimization
42 possible, and then we describe a Bayesian, simulation-based method of finding optimized experiments. In
43 several simulated scenarios, we demonstrate that optimized experimental designs can substantially improve
44 upon the utility of often-used canonical designs. Moreover, a similar approach could also be used in transla-
45 tional research; e.g., in the nascent field of computational psychiatry, designs of behavioral and physiological
46 diagnostic tests could be computationally optimized.
47 1 Introduction
48 A major goal of computational biology is to find accurate theoretical accounts of biological systems. Given the
49 complexity of biological systems, accurately describing them and predicting their behavior will likely require
50 correspondingly complex computational models. Although the flexibility of complex models provides them
51 with potential to account for a diverse set of phenomena, this flexibility also engenders an accompanying
52 challenge of designing informative experiments. Indeed, flexible models can often be difficult to distinguish
53 one from another under a variety of experimental conditions (Silk et al. 2014), and the models’ parameters
54 often cannot be estimated with high certainty (Apgar et al. 2010). Designing informative experiments entails
55 formalizing them in terms of tunable design variables, and finding values for these variables that will allow
56 accurate model selection and parameter estimation. This challenge is well exemplified - and yet unaddressed
57 - in the field of associative learning research.
2
58 Associative learning is the ability of organisms to acquire knowledge about environmental contingencies
59 between stimuli, responses, and outcomes. This form of learning may not only be crucial in explaining
60 how animals learn to efficiently behave in their environments (Enquist, Lind, and Ghirlanda 2016), but
61 understanding associative learning also has considerable translational value. Threat conditioning (also termed
62 “fear conditioning”), as a special case of associative learning, is a long-standing model of anxiety disorders;
63 accordingly, threat extinction learning underlies treatments such as exposure therapy (Vervliet, Craske, and
64 Hermans 2013). Consequently, the phenomenon of associative learning - whether in the form of classical
65 (Pavlovian) or operant (instrumental) conditioning - has been a subject of extensive empirical research. These
66 empirical efforts have also been accompanied by a long tradition of computational modeling: since the seminal
67 Rescorla-Wagner model (Rescorla and Wagner 1972), many other computational accounts of associative
68 learning have been put forward (Pearce and Bouton 2001; Dayan and Kakade 2001; Gershman and Niv 2012;
69 Gershman 2015; Tzovara, Korn, and Bach 2018), together with software simulators that implement these
70 theories (Thorwart et al. 2009; Alonso, Mondragón, and Fernández 2012). As this space of models grows
71 larger and more sophisticated, the aforementioned challenge of designing informative experiments becomes
72 apparent. Here, we investigate how this challenge can be met through theory-driven, computational methods
73 for optimizing associative learning experiments.
74 Similar methods of optimal experimental design have been developed in various subfields of biology and
75 cognitive science concerned with formal modeling; for example, in systems biology (Vanlier et al. 2012;
76 Liepe et al. 2013), neurophysiology (Lewi, Butera, and Paninski 2009; Sanchez et al. 2014), neuroimaging
77 (Daunizeau et al. 2011; Lorenz et al. 2016), psychology (Myung and Pitt 2009; Rafferty, Zaharia, and Griffiths
78 2014), and behavioral economics (Vincent and Rainforth 2017; Balietti, Klein, and Riedl 2018). These efforts
79 have generally demonstrated the potential benefits of computational optimization in experimental design. Yet,
80 in research on associative learning, there is a lack of tools for design optimization, although the foundation
81 for this development has been laid through extensive computational modeling. We therefore explore the
82 potential of computationally optimizing experimental designs in associative learning studies, as opposed to
83 the common practice of manually designing such experiments.
84 In optimizing experimental designs we usually need to rely on prior information: formalizing this notion
85 is the essence of the Bayesian experimental design framework (Chaloner and Verdinelli 1995). Bayesian
86 experimental design is the basis of the aforementioned methods in other domains, and it also underlies our
87 approach. This framework formalizes the problem of experimental design as maximizing the expected utility
88 of an experiment. With this formulation we can integrate prior knowledge with explicit study goals to
89 arrive at optimal designs. Moreover, upon observing new experimental data, the Bayesian design framework
3
90 prescribes a principled manner in which to update the optimal design. However, Bayesian experimental
91 design requires calculations of expected utility that usually result in analytically intractable integrals. To
92 address this issue, we adopt a simulation-based approach to utility calculations (Wang and Gelfand 2002).
93 Using the simulation-based approach, we can flexibly choose the experiment structure, space of candidate
94 models, goals of the study, and the planned analysis procedure, without the need to specify an analytically
95 tractable utility function.
96 To illustrate the proposed design method, we apply it to the optimization of classical conditioning experi-
97 ments, as a special case of associative learning. We first enable computational optimization by formalizing
98 the structure of classical conditioning experiments and by identifying tunable design variables. Next, we
99 propose low-dimensional design parameterizations that make optimization computationally tractable. We
100 then generate optimized designs for three different types of study goals: namely, (1) the goal of accurately
101 estimating parameters of a model, (2) the goal of comparing models with different learning mechanisms,
102 and (3) the goal of comparing models with different mappings between latent states and emitted responses.
103 Additionally, we perform the optimizations under varying levels of prior knowledge. Finally, we compare the
104 optimized designs with commonly used manual designs drawn from the literature. This comparison reveals
105 that optimized designs substantially outperform manual designs, highlighting the potential of the method.
106 2 Methods
107 We can decompose the problem of designing experiments into three components: formalizing, evaluating,
108 and optimizing. We will first outline each of these components of experimental design and introduce the
109 main concepts. The details and the application to designing associative learning experiments are covered
110 by the later sub-sections. Here we provide a non-technical description of the method, with the more formal
111 treatment available in the S1 Appendix.
112 Formalizing experiments involves specifying the experiment structure and parameterizing it in terms of
113 design variables. Design variables are the tunable parameters of the experiment (e.g., probabilities of stimuli
114 occurring), and the experiment structure is a deterministic or stochastic mapping from the design variables
115 to experiment realizations (e.g., sequences of stimuli). The relationship between experiment structure, design
116 variables, and experiment realizations, is analogous to the relationship between statistical models, their
117 parameters, and observed data, respectively. Importantly, much like statistical models, experiment structures
118 can be parameterized in various ways, some more amenable to design optimization than others.
119 Evaluating experiments consists of calculating the expected utility for evaluated designs. Calculation of
4
120 expected utility requires us to integrate across the distribution of datasets observable under our assumptions;
121 however, this calculation generally results in analytically intractable integrals. To tackle this issue, we
122 adopted the simulation-based approach to the evaluation of expected utility (Wang and Gelfand 2002). The
123 simulation-based approach entails three steps: simulating datasets, analyzing datasets, and calculating the
124 expected utility (fig. 1.A-C).
125 Simulated datasets consist both of experiment realizations and model responses (fig. 1.A). Experiment realiza-
126 tions are sampled from the experiment structure using the design being evaluated. In order to sample model
127 responses, we first need to specify a candidate model space and an evaluation prior. A candidate model space
128 is a set of generative statistical models, which encode hypotheses about possible data-generating processes
129 underlying the phenomenon under study. An evaluation prior is a probability distribution that encodes the
130 a priori plausibility of candidate models and their parameters. Using the evaluation prior, we can sample
131 a candidate model and its parameters, which together comprise the ground truth. Using the ground truth
132 model and the experiment realization, we can now sample model responses, thus completing the simulated
133 dataset.
134 With the simulated dataset we can perform the planned analysis procedure (fig. 1.B). This is the procedure
135 that we plan to conduct with the dataset obtained in the actual experiment. Any type of statistical analysis
136 can be used here, but Bayesian analyses seem particularly appropriate, since specifying generative models
137 and priors is already a prerequisite of the design evaluation procedure. If we employ a Bayesian analysis, an
138 additional prior needs to be specified - the analysis prior. Unlike the evaluation prior - which should reflect
139 the experimenter’s best guess about the state of the world, and can be as informative as deemed appropriate
140 - the analysis prior should be acceptable to a wider audience, and therefore should not be overly informative
141 or biased (Brutti, De Santis, and Gubbiotti 2014). The analysis procedure - Bayesian or otherwise - will
142 yield an analysis result; for example, a parameter estimate (in estimation procedures) or a discrete choice of
143 the best-supported model (in model selection procedures).
144 Having performed the analysis on the simulated dataset, we can now calculate the experiment utility (fig. 1.C).
145 The experiment utility is provided by the utility function, which encodes the goals of the study, and maps the
146 values of design variables to utility values. Although we are interested in how the utility depends on the design,
147 the utility function may generally depend not only on the direct properties of the design (e.g., experiment
148 duration and cost) but also on quantities influenced by the evaluation prior or the analysis procedure, e.g.,
149 the analysis result and the ground truth. For example, when the goal is accurate parameter estimation, the
150 utility function can be based on the discrepancy between the true parameter value (ground truth), and the
151 obtained parameter estimate (analysis result), neither of which are direct properties of the design. Finally,
5
Figure 1: The simulation-based method of evaluating and optimizing experiment designs. Note: light-blue
elements denote required user inputs. (A) Dataset simulation process generates experiment realizations (e.g.,
cue-outcome sequences) and model responses (e.g., predicted physiological or behavioral responses), which
together form a simulated dataset. (B) Dataset analysis applies the user-specified analysis procedure to the
simulated dataset and produces analysis results (e.g., model evidence or parameter estimates). If the analysis
procedure is Bayesian, an additional analysis prior needs to be specified. (C) The calculation of expected
design utility requires simulating and analyzing a number of datasets. The user-defined utility function -
which can depend directly on the design or on other simulation-specific quantities - provides the utility value
for each simulated experiment. The expected design utility is obtained by averaging experiment-wise utilities.
(D) Design optimization proceeds in iterations: the optimization algorithm proposes a design from the design
space, the design is evaluated under the design prior, and the expected design utility is passed back to the
optimizer. If the optimization satisfies the termination criterion (which is one of the user-defined optimization
options), the optimizer returns the optimized design.
6
152 to obtain the expected design utility, we simulate a number of experiments using the same design, calculate
153 the experiment utility for each experiment, and average the utility values across experiments.
154 Optimizing experiments requires finding the values of design variables that maximize the expected utility
155 under our evaluation prior (please note, we will refer to the evaluation prior used in design optimization as
156 the design prior, to avoid confusion when a design is optimized using one prior, but ultimately evaluated
157 using a different prior). The principle of maximizing expected design utility is the core idea of Bayesian
158 experimental design, which is simply the application of Bayesian decision theory to the problem of designing
159 experiments (Schlaifer and Raiffa 1961; Lindley 1972; Chaloner and Verdinelli 1995). Practically, maximizing
160 expected design utility can be achieved using optimization algorithms (fig. 1.D). We first specify the domain
161 over which to search for the optimal design - the design space. The design space is defined by providing the
162 range of feasible values for the design variables, which must also take into account any possible constraints
163 on these variables. Next, we specify the optimization objective function, which is in this case the expected
164 design utility, evaluated through the previously described simulation-based approach. Lastly, we specify
165 the parameters of the optimization algorithm, e.g., the termination criterion. The optimization proceeds in
166 iterations that consist of the algorithm proposing a design, and obtaining back the expected design utility.
167 After each iteration, the termination criterion is checked: if the criterion is not fulfilled, the algorithm
168 continues, by proposing a new design (based on the previously observed utilities). Ultimately, the algorithm
169 returns an optimized design, which - importantly - might not be the optimal design, depending on the success
170 of the search.
171 2.1 Formalizing associative learning experiments
172 As outlined, in order to optimize associative learning experiments, we first need to formalize their structure.
173 Here we consider the paradigm of classical conditioning, as a special case of associative learning, where
174 associations are formed between presented cues and outcomes. There are many design variables that need
175 to be decided in implementing a conditioning experiment, such as physical properties of the stimuli, or
176 presentation timings. However, learning is saliently driven by statistical contingencies between cues and
177 outcomes in the experiment; consequently, most computational models of associative learning capture the
178 effect of these contingencies on the formation of associations. Therefore, we focus on the statistical properties
179 of the experiment as the design variables of interest.
180 We formalize a classical conditioning experiment as a sequence of conditioning trials. A conditioning trial
181 consists of presenting one of the cues (i.e., conditioned stimulus, CS), followed - with some probability -
182 by an aversive or appetitive outcome (i.e., unconditioned stimulus, US). If each trial’s cue-outcome pair is
7
Figure 2: Formalizing and parameterizing the structure of classical conditioning experiments. (A) A condi-
tioning trial can be represented as a Markov chain, parameterized by the probabilities of presenting different
cues (e.g., colored shapes), and transition probabilities from cues to outcomes (e.g., delivery of shock vs. shock
omission). (B) Stage-wise parameterization of conditioning experiments. Trials in each stage are generated
by a Markov chain with constant transition probabilities. The transition probabilities of each stage are tun-
able design variables. (C) Periodic parameterization. The transition probabilities in the Markov chain are
determined by square-periodic functions of time (i.e., trial number). The design variables are the two levels
of the periodic function and its half-period.
183 sampled independently, and if the probability of the outcome depends only on the cue, then each trial can be
184 represented as a Markov chain (fig. 2.A). In this formalization, the cue probabilities P (CS) and conditional
185 outcome probabilities P (US|CS) are the transition probabilities of the trial-generating Markov chain. These
186 transition probabilities represent the design variables to be optimized.
187 Allowing the transition probabilities of each trial to be independently optimized results in a high-dimensional
188 optimization problem, even with a modest number of trials. However, the majority of existing experimental
189 designs for studying associative learning vary the cue-outcome contingencies only between experiment stages
190 (i.e., contiguous blocks of trials), and not between individual trials. Examples of such designs include forward
191 (Kamin) blocking (Kamin 1968), backward blocking (Shanks 1985), reversal learning (Li et al. 2011), over-
192 expectation (Rescorla 1970), overshadowing (Pavlov 1927), conditioned inhibition (Pavlov 1927), and latent
193 inhibition (Lubow and Moore 1959). Constraining the designs to allow only for contingency changes between
194 stages can dramatically reduce the number of design variables, making optimization tractable. Therefore,
8
195 the first design parameterization of conditioning experiments that we introduce here is the stage-wise pa-
196 rameterization. In this parameterization the experiment is divided into stages, with each stage featuring
197 a trial-generating Markov chain with constant transition probabilities (fig. 2.B). The number of trials per
198 stage is assumed to be known, and the tunable design variables are stage-wise cue probabilities P (CS) and
199 conditional outcome probabilities P (US|CS) (some of these design variables can further be eliminated based
200 on constraints). The aforementioned motivating designs from the literature are special cases of the stage-wise
201 parameterization, but this parameterization also offers enough flexibility to go beyond classical designs.
202 However, if we are interested in associative learning in environments with frequent contingency changes,
203 the stage-wise parameterization may require many stages to implement such an environment, and would
204 thus still lead to a high-dimensional optimization problem. To address this scenario, we introduce the
205 periodic parameterization, which is a parameterization based on periodically-varying contingencies (fig. 2.C).
206 In this parameterization, each cue probability P (CS) and conditional outcome probability P (US|CS) is
207 determined by a periodic function of time (i.e., trial number). Specifically, we use a square-periodic function
208 parameterized with the two levels P1 and P2 , between which the probability cycles, and the length of the
209 half-period T . This parameterization allows expressing designs with many contingency changes using a
210 small number of design variables, when compared with an equivalent stage-wise design; however, unlike the
211 stage-wise parameterization, the periodic parameterization constrains contingency changes to occur in regular
212 intervals and only between two fixed values.
213 2.2 Evaluating associative learning experiments
214 Having formalized the structure of classical conditioning experiments, we now specify the procedure to evalu-
215 ate their expected utility using the simulation-based approach. To simulate a classical conditioning dataset,
216 we first sample an experiment realization - i.e., a sequence of cues (CSs) and corresponding outcomes (USs)
217 - using the experiment structure and the evaluated design. We then sample the ground truth model of asso-
218 ciative learning and its parameter values from the evaluation prior. Using the sampled CS-US pairs, sampled
219 parameter values, and the generative model, we generate the conditioned responses (CRs). These trial-wise
220 triplets of CSs, USs, and CRs constitute a single simulated dataset.
221 With this simulated dataset we can proceed to perform the analysis that we are planning to do with the real
222 dataset that will be collected. In scenarios presented here, we simulate and analyze only single-subject data.
223 We do so for two reasons. First, the single-subject scenario provides more difficult conditions for statistical
224 inference, due to the limited amount of data. Second, the single-subject analysis is especially relevant from
225 a translational view, which considers clinical inferences that need to be done for each patient individually.
9
226 For the analysis, we can specify any procedure that is afforded by the simulated dataset, but computationally
227 demanding analyses can lead to long design evaluation running times. In the current article, we focus on two
228 model-based types of analyses that are common in the modeling literature on associative learning: parameter
229 estimation within a single model, and model selection within a space of multiple models. For parameter
230 estimation, we use the maximum a posteriori (MAP) approach with uniform priors (which is equivalent to
231 maximum likelihood estimation (MLE)), and for model selection we use the Bayesian information criterion
232 (BIC), which takes into account both the model fit and model complexity. Both procedures are computation-
233 ally relatively inexpensive and were implemented using the existing ‘mfit’ toolbox for MATLAB (Gershman
234 2018). Fully Bayesian inference with proper analysis priors could also be used here, as long as the inference
235 procedure is not overly computationally expensive.
236 Finally, with the results of the analysis on the simulated dataset, we can assess the utility of the experiment.
237 The utility function should be determined by the goals of the study. In this article, for parameter estimation
238 analyses, we aim to minimize the absolute error between the parameter estimate and the ground truth value,
239 and for model selection analyses, we aim to minimize the misidentification of the ground truth model (i.e.,
240 model selection error or 0-1 loss).
241 2.3 Optimizing associative learning experiments
242 So far, we have formalized the structure of associative learning experiments, and provided a method of
243 assessing their utility by simulation. We are now in a position to specify a method of optimizing the design
244 of such experiments.
245 Although the simulation-based approach to design evaluation is instrumental in providing flexibility to the
246 user, it can yield difficult optimization problems. Evaluating the expected design utility using stochastic
247 simulations gives noisy estimates of utility and can often be computationally expensive. Furthermore, we
248 generally cannot make strong assumptions about the utility function (e.g., convexity) and we usually do not
249 have access to the derivatives of the function. A state-of-the-art optimization algorithm for such scenarios
250 is Bayesian optimization (Brochu, Cora, and Freitas 2010). Bayesian optimization has already been effec-
251 tively used in experimental design problems (Weaver et al. 2016; Balietti, Klein, and Riedl 2018), and it
252 was therefore our optimizer of choice. Bayesian optimization is a global optimization algorithm which uses
253 a surrogate probabilistic model of the utility function to decide which values of optimized variables to eval-
254 uate next; this allows it to efficiently optimize over expensive utility functions, at the cost of continuously
255 updating the surrogate model (see S1 Appendix for technical details). In results presented here, we used
256 the MATLAB bayesopt function (from the ‘Statistics and Machine Learning Toolbox’) which implements
10
257 Bayesian optimization using a Gaussian process as the surrogate model.
258 To fully specify the design optimization problems, we also need to define the design space, the design prior,
259 and the optimization options. The design space over which the optimization was performed depended on
260 the design parameterization. For the stage-wise parameterization, each transition probability could range
261 from 0 to 1, with the constraint that the sum of transition probabilities from a given starting state is 1.
262 For the periodic parameterization, the two probability levels P1 and P2 , and the half-period T (expressed
263 as a fraction of the total trial number) also ranged from 0 to 1. For the periodic parameterization, the
264 design is constrained such that all the transition probabilities from a given state have the same half-period,
265 and all P1 and P2 probabilities have to sum up to 1, respectively. The design priors were determined in a
266 problem-specific manner, and are described together with the application scenarios in the Results section. In
267 general, the design prior should be based on the information available from the relevant literature or from
268 existing data. Lastly, the optimization options depend on the specifics of the chosen optimization algorithm
269 (for Bayesian optimization, details are provided in the S1 Appendix). A common optimization option is the
270 termination criterion: we used a fixed maximum number of optimization iterations as the criterion, but the
271 maximum elapsed time, or the number of iterations without substantial improvement can also be used.
272 2.4 Workflow for evaluating and optimizing associative learning experiments
273 Finally, we propose a workflow for employing the described simulation-based method in the evaluation and
274 optimization of associative learning experiments. The proposed workflow is as follows (the required user
275 inputs are italicized; see also fig. 1):
276 1. Specify the assumptions and goals of the design problem. In this step we need to specify the
277 components of the design problem which are independent of the specific designs that will be evalu-
278 ated. In particular, we specify our candidate model space, evaluation priors over the models and their
279 parameters, analysis procedure, analysis priors, and the utility function we wish to maximize.
280 2. Evaluate pre-existing reference designs. If there are pre-existing reference designs, it is advisable
281 to first evaluate their expected utility. If these designs already satisfy our requirements, we do not need
282 to proceed with design optimization. In this step we first formalize reference designs, and then we use
283 the simulation-based approach to evaluate their expected utility under the assumptions made when
284 specifying the design problem (step 1).
285 3. Optimize the design. If existing reference designs do not satisfy our requirements, we proceed
286 with obtaining an optimized design. In this step we first specify the experiment structure and its
11
287 design variables (both tunable and fixed), the design space, the design priors, and optimization options.
288 Having specified these, we run the design optimization until the termination criterion is fulfilled, and
289 then inspect the resulting design.
290 4. Evaluate the optimized design. We take the optimized values of design variables obtained in step
291 3, and use them to specify the optimized design. Then we obtain the expected utility of the optimized
292 design in the same simulation-based manner as we did for the reference designs (in step 2), and inspect
293 the obtained results.
294 5. Compare the reference and optimized designs. Having evaluated the reference and optimized
295 designs through simulations (in steps 2 and 4, respectively), we can now compare their expected util-
296 ities. For example, if we are aiming for accurate parameter estimates, we can inspect the design-wise
297 distributions of estimation errors, and if we are aiming for accurate model selection, we can inspect the
298 design-wise confusion matrices (and corresponding model recovery rates). If the optimized design does
299 not satisfy our requirements, we can go back to step 3; for example, we can modify the experiment
300 space or run the optimization for longer. If the optimized design satisfies our requirements, we can
301 proceed to its implementation.
302 For one of the simulated scenarios presented in the Results section (Scenario 2), we illustrate the described
303 workflow with an executable MATLAB notebook provided on GitHub at https://github.com/fmelinscak/
304 coale-paper.
305 3 Results
306 We validated the proposed method in three scenarios, with one scenario targeting the goal of accurate
307 parameter estimation, and the other two targeting the goal of accurate model selection. The overview of
308 user inputs employed in these scenarios is given in table 1. In each scenario we compared optimized designs
309 with a reference design drawn from prior work in associative learning. In order to evaluate the design’s
310 accuracy of parameter estimation or model selection, it is necessary to know the ground truth, i.e., to sample
311 a data-generating model and its parameter values from the evaluation prior. Therefore, we have evaluated
312 the designs through simulations, as is common in the field of experimental design. To illustrate the benefit of
313 optimized designs, we report estimates of the difference in accuracy between the designs, and the associated
314 95% confidence intervals. We do not quantify the benefit of optimized designs using classical p-values because
315 the number of simulations (sample size) can be made arbitrarily large, such that optimized designs can be
316 made “significantly” better than reference designs even if the differences are trivially small.
12
317 Furthermore, we wanted to investigate the effect of different amounts of background knowledge, i.e., to
318 perform a sensitivity analysis with respect to the design prior, while holding the ground truth (evaluation
319 prior) constant. Therefore, we optimized designs both under a vague (weakly-informative) design prior
320 and under a strongly-informative design prior. The designs were then evaluated using point evaluation
321 priors, representing an assumed ground truth. With these two extreme design priors, we obtain, respectively,
322 approximate lower and upper bounds on the expected design utility of optimized experiments. However,
323 please note that while in a methodological study - like this one - it can be advantageous to disambiguate
324 between the design prior and the evaluation prior, in practical applications these two priors should coincide,
325 with both of them expressing the experimenter’s knowledge base at the time of designing the experiment.
Table 1: Overview of user inputs for the three simulated scenarios.
The role of user inputs is clarified by fig. 1. See sub-sections 3.1
and 3.2 for details. Analysis priors are used in fitting models, de-
sign priors are used to simulate data when optimizing designs, and
evaluation priors are used to simulate data when evaluating both
reference and optimized designs.
User
inputs Scenario 1 Scenario 2 Scenario 3
Candidate RW RW, KRW RW(V ), RWPH(V ),
model RWPH(α), RWPH(V + α)
space
Analysis Maximum likelihood Model selection using BIC Model selection using BIC
procedure parameter estimation
Utility Absolute error of learning Model selection accuracy Model selection accuracy
function rate (α) estimate
Analysis Uniform over α Uniform over models and Uniform over models and
prior parameters parameters
Reference Acquisition followed by Backward blocking Reversal learning
design extinction of equal length
Evaluation Point priors on low, middle Uniform over models with Uniform over models with
prior and high α (LA, MA, HA) point priors on parameters point priors on parameters
(from Kruschke 2008) (from Li et al. 2011)
13
User
inputs Scenario 1 Scenario 2 Scenario 3
Optimized One cue with periodically Two stages with three cues Two stages with two cues
experi- varying contingency (A, B, AB) and stage-wise (A, B) and stage-wise
ment contingencies contingencies
structure
Design Two contingencies (P1 , P2 ) For each stage s and cue For each stage s and cue
space and the period (T ) of their X: Ps (X), Ps (US|X) (10 X: Ps (X), Ps (US|X) (6
switching (3 variables) non-redundant variables) non-redundant variables)
Design Either a point prior over α Uniform over models with Uniform over models with
priors coinciding with evaluation either a point (PP) or either a point (PP) or
prior (PA) or a vague prior vague (VP) prior over vague (VP) prior over
(VA) parameters parameters
326 3.1 Optimizing for accuracy in parameter estimation
327 An early, but still widely influential model of classical conditioning is the Rescorla-Wagner (RW) model
328 (Rescorla and Wagner 1972). A parameter of particular interest in this model is the learning rate α (Behrens
329 et al. 2007). For example, the learning rate of the RW model fitted to human learning data in volatile
330 conditions has been found to correlate with trait anxiety (Browning et al. 2015). Therefore, in Scenario 1,
331 we optimized designs with the goal of accurately estimating the RW learning rate α.
332 As ground truth, three different point evaluation priors on learning rate were used: low alpha (LA, α = 0.1),
333 middle alpha (MA, α = 0.2), and high alpha (HA, α = 0.3). Three types of experimental designs were
334 compared under these three evaluation priors using the absolute estimation error metric. All three designs
335 have the same structure, which is simple but sufficient to estimate the learning rate: the probability of the
336 CS being followed by a US (P (US|CS)) is a periodic function of time, and it switches between two discrete
337 values. Therefore, there are three variables that specify the design: the two reinforcement probabilities, and
338 the period with which these two probabilities are switched.
339 The first evaluated design is the (manual) reference design (denoted REF), which was taken from existing
340 literature rather than being computationally optimized. Here we chose the often-used classical conditioning
341 design in which an association is first acquired under partial reinforcement of a CS with a US (acquisition
342 stage), and then extinguished by ceasing the reinforcement of the CS (extinction stage) (Humphreys 1939).
14
343 The probabilities of reinforcement were 0.5 and 0 in the acquisition and the extinction stages, respectively.
344 The switch between the stages occurred only once, at the middle of the experiment. The second design is
345 the design optimized under a vague prior over the learning rate α (denoted VA-OPT). The vague prior here
346 is implemented as a uniform prior over α. This prior reflects the situation when there is little information on
347 the parameter of interest at the planning stage of the experiment. Under this prior we are estimating a lower
348 bound on the accuracy of optimized experiments. The third design is the design optimized under the exact
349 point prior on the learning rate α (denoted PA-OPT). This prior reflects the situation where we have perfect
350 knowledge of the ground truth learning rate, and it therefore results in different specific designs for the LA,
351 MA, and HA evaluation priors. Although in practice the ground truth is not known, under this prior we can
352 estimate an upper bound on the accuracy of optimized experiments.
353 The optimized designs were obtained by running Bayesian optimization for 300 iterations, with the absolute
354 estimation error of the learning rate being minimized. In each iteration, the estimation error was evaluated
355 using 32 datasets simulated from the RW model, with parameters being randomly drawn from the design
356 prior. Using 32 CPU cores at 2.5 GHz, average time to obtain the optimized design was 0.5 h, with the average
357 being taken across optimizations with different design priors. The reference design and the optimized designs
358 were evaluated in additional 256 simulations per each combination of the evaluation prior (ground truth) and
359 design.
360 The results of the design evaluation are shown in fig. 3. We first inspect the distribution of absolute estimation
361 errors under different designs and evaluation priors (fig. 3.A). Under all designs, the estimation errors were
362 relatively small (with almost all simulations yielding errors below 0.1). This can be attributed to sufficient
363 number of simulated trials and moderate levels of observation noise. Nevertheless, even in such favorable
364 conditions, the REF design yielded the largest estimation errors, followed by the VA-OPT design, and the PA-
365 OPT design. This suggests that including even weak prior knowledge (as in the VA-OPT design) can improve
366 parameter estimation accuracy, with more informative prior knowledge providing further improvements (as
367 in PA-OPT).
368 Next, we quantified the relative differences in design utility, by estimating the probability that a randomly
369 sampled experiment from one design will yield higher utility than an experiment sampled from another
370 design (i.e., the probability of superiority). These probabilities were estimated for all pairs of designs by
371 computing the non-parametric common language effect sizes (CLES) (McGraw and Wong 1992), which
372 indicate designs of equal expected utility with CLES = 50%. The 95% CIs for CLES estimates were obtained
373 using the percentile bootstrap method (Efron 1981), with 1000 bootstrap samples. The appropriateness
374 of the percentile bootstrap method was verified by inspecting bootstrap distribution histograms, which did
15
Figure 3: Design evaluation in Scenario 1: RW model learning rate estimation. Three designs - reference
acquisition-extinction design (REF), design optimized under a vague prior (VA-OPT), and design optimized
under a point prior (PA-OPT) - are evaluated under the low (LA), middle (MA), and high (HA) value of
the learning rate alpha. (A) Distribution of absolute errors in estimating the learning rate. (B) Pair-wise
comparison of design accuracy expressed as the probability of the first design in the pair being superior (i.e.,
having lower estimation error). Error bars indicate bootstrapped 95% CI and 50% guideline indicates designs
of equal quality. (C) Comparison of model responses (red full line) to the contingencies (black dashed line)
obtained under different designs and different evaluation priors.
16
375 not exhibit substantial bias or skew. The design evaluation results are shown in fig. 3.B and they confirm
376 that the VA-OPT design is more probable to yield lower estimation errors than the REF design, under all
377 evaluation priors: the probabilities of VA-OPT being superior are 58.9% (95% CI: [53.9% - 63.8%]), 59.2%
378 (95% CI: [54.1% - 64.0%]), and 58.2% (95% CI: [53.6% - 62.7%]) for the LA, MA, and HA evaluation priors,
379 respectively. Moreover, the the PA-OPT design is probable to be superior - under all evaluation priors - both
380 to the REF design (LA - 68.8%, 95% CI: [64.2% - 73.2%]; MA - 66.3%, 95% CI: [61.8% - 70.8%]; HA - 63.3%,
381 95% CI: [58.4% - 68.5%]) and to the VA-OPT design (LA - 60.0%, 95% CI: [55.0% - 64.9%]; MA - 58.0%,
382 95% CI: [53.1% - 62.7%]; HA - 55.8%, 95% CI: [50.8% - 60.8%]).
383 Finally, we inspect the obtained designs and the corresponding model responses in fig. 3.C. The REF design
384 and the VA-OPT design are the same under all evaluation priors, since they do not take into account the
385 ground truth learning rate, whereas the PA-OPT designs are tuned to the exact parameter value. Compared
386 to the REF design, the VA-OPT design uses more frequent contingency switches, allowing for more accurate
387 estimation of the learning rate under a wide range of possible values. The PA-OPT design adapts the
388 frequency of contingency switches to the learning rate, with low learning rates being better estimated from
389 longer periods of stability, and higher learning rates being better estimated from frequent transitions.
390 3.2 Optimizing for accuracy in model selection
391 The basic RW model explains many aspects of associative learning, but it fails to account for many others
392 (Miller, Barnet, and Grahame 1995). For this reason, various extensions of the basic RW model have been
393 proposed in the literature. One example is the Kalman Rescorla-Wagner (KRW) model (Dayan and Kakade
394 2001; Kruschke 2008; Gershman 2015), which is a probabilistic extension of the RW model. Another example
395 is the hybrid Rescorla-Wagner-Pearce-Hall (RWPH) model (Li et al. 2011), which maintains separate dynamic
396 learning rates (termed “associabilities”) for all the cues. The growing space of theoretical models of associative
397 learning highlights the need for experiments that can efficiently expose the differences between these accounts.
398 Therefore, in Scenarios 2 and 3, we optimized designs to achieve the goal of accurate model selection (i.e.,
399 how often the selected winning model coincides with the ground truth model). The model space of Scenario
400 2 is comprised of models with different learning mechanisms, whereas the model space of Scenario 3 entails
401 not only models with different learning mechanisms, but also models with different mappings from latent
402 model states to conditioned responses.
403 Scenario 2 is based on the simulation study of Kruschke (2008), which compares the RW and the KRW
404 model using the backward blocking design. The backward blocking design is comprised of two stages. In
405 the trials of the first stage, only the reinforced compound cue AB is presented. In the trials of the second
17
406 stage, only one of the elements of the compound (e.g., A) is presented and reinforced. In the test stage, the
407 single element cues - A and B - are tested separately in non-reinforced trials. This design has been used
408 to compare the RW and KRW models because they make different predictions about test responding to the
409 cue that was not presented in the second stage (here, cue B). According to the RW model, responding to
410 cue B should be unchanged, as this model allows cue-specific associations to change only when the cues are
411 presented. In contrast, the KRW model predicts that the responding to cue B at test will be diminished
412 (blocked), because the model has assigned the credit for reinforcements to the cue A (i.e., during the first
413 stage a negative correlation between the associative weights for element cues is learned). Due to the opposing
414 model predictions for this design, and its previous use in literature, we chose it as the reference (REF) design
415 in Scenario 2.
416 Scenario 3 is based on the empirical study of Li et al. (2011), which compared the RW model (labeled RW(V ))
417 and three variants of the RWPH model. The variants of the RWPH model were formulated to emit different
418 model quantities as the conditioned response: associative weight (RWPH(V )), associabilities (RWPH(α)), or
419 a mixture of weights and associabilities (RWPH(V + α)). In order to distinguish between these models Li et
420 al. (2011) employed a reversal learning design, and this is the design we used as the reference (REF) design
421 in Scenario 3. The design is comprised of two stages: in the first stage one cue is partially reinforced, while
422 the other is not, and in the second stage the two cues switch their roles. This design is intuitively appealing
423 for the goal of discriminating between the RW and RWPH models: at the point of contingency reversal,
424 the learning rate in the RWPH model increases due to large prediction errors, while in the RW model the
425 learning rate is constant.
426 In both Scenario 2 and 3 the structure and parameterization of the optimized designs was the same: the
427 experiment had two stages (with equal number of trials), and the cue-outcome contingencies were kept
428 constant within each stage. Hence, the stage-wise design variables were the CS presentation probabilities
429 and the probabilities of the US conditional on the CS. To keep a close comparison with the reference designs,
430 Scenario 2 included two CSs (A and B) and their compound (AB), whereas Scenario 3 only included two
431 elemental CSs. Similarly to Scenario 1, we obtained optimized designs under two types of design priors:
432 a vague prior on model parameters (in the VP-OPT design) and an exact point design prior on model
433 parameters (in the PP-OPT design). The vague priors were either uniform (for bounded parameters) or only
434 weakly informative (see S1 Appendix for details). In Scenario 2 the point priors were set at the same values
435 as in the simulations of Kruschke (2008). The point priors in Scenario 3 were set to the best fitting parameter
436 values of the RWPH(V + α) model reported in the study of Li et al. (2011) (the best fitting parameters of
437 other models are not reported in their study); for other models in this scenario, we were able to use subsets
18
438 of the RWPH(V + α) parameters, since the models were nested. Importantly, even when the exact point
439 priors on model parameters were used in the design optimization (PP-OPT designs), the simulated datasets
440 were sampled in equal proportions from all candidate models (implying a uniform prior over models).
441 In both Scenario 2 and 3, we ran the Bayesian design optimization for 300 iterations and in each iteration
442 the utility function (model selection accuracy) was evaluated using 32 datasets simulated from each of the
443 candidate models. Average optimization times (across different design priors) were 5.2 h in Scenario 2,
444 and 14.1 h in Scenario 3. The optimized designs and the reference designs were evaluated in additional
445 256 simulated experiments per each combination of the ground truth model and design. Additionally, we
446 computed the responses of models fitted to the evaluation simulations. For visualization, the responses of
447 fitted models to each CS were computed in all the trials, even though in the simulated experiments only one
448 of the CSs was presented in each trial. The evaluation results for Scenario 2 and 3 are shown in fig. 4 and
449 fig. 5, respectively.
450 For Scenario 2, model selection accuracies of reference and optimized designs are shown in fig. 4.A. Although
451 the REF design yields accuracies above chance level (mean: 59.6%, 95% CI: [55.2% - 63.9%]), optimized
452 designs provide substantial improvements. Using a vague prior, the VP-OPT design provides a lower bound
453 on the accuracy expected with an optimized design; nonetheless, even this lower bound provides near perfect
454 accuracy (mean: 99.8%, 95% CI: [98.9% - 100.0%]). We can quantify the relative improvement using the
455 odds ratio (OR) as the effect size (please note: the value OR = 1.0 indicates designs of equal expected utility,
456 and the 95% CIs were obtained using the Woolf’s method with the Haldane-Anscombe correction (Lawson
457 2004)). The VP-OPT design greatly increases the odds of correctly identifying the ground truth model (OR:
458 346.8, 95% CI: [48.4, 2486.4]). Moreover, with the exact knowledge of model parameters, the PP-OPT design
459 provides an upper bound on expected improvements. Due to the already near-perfect accuracy of the VP-
460 OPT design, the PP-OPT design yielded similar accuracy (mean: 100.0%, 95% CI: [99.3% - 100.0%]). The
461 relative improvement of the PP-OPT design over the REF design is substantial (OR: 696.2, 95% CI: [43.2,
462 11208.0]). However, due to a ceiling effect, we do not observe a clear advantage of the PP-OPT design over
463 the VP-OPT design (OR: 3.0, 95% CI: [0.1, 74.0]).
464 Similar results are observed when comparing the design accuracies in Scenario 3 (fig. 5.A). Accuracy of
465 the REF design is significantly above chance level (mean: 59.7%, 95% CI: [56.6% - 62.7%]), but the designs
466 optimized both under a vague and point prior provide greatly improved accuracies. The VP-OPT design yields
467 both a high accuracy in absolute terms (mean: 94.8%, 95% CI: [93.3% - 96.1%]) and a large improvement
468 relative to the REF design (OR: 12.4, 95% CI: [9.1, 16.8]). The PP-OPT design also provides a high accuracy
469 (mean: 95.9%, 95% CI: [94.5% - 97.0%]), and an even larger improvement relative to the REF design (OR:
19
Figure 4: Design evaluation in Scenario 2: selection between RW and KRW models. Three designs - reference
backward blocking design (REF), design optimized under a vague prior (VP-OPT), and design optimized
under a point prior (PP-OPT) - are evaluated under the ground truth model being either the RW or the KRW
model. (A) Model selection accuracy (mean and the Clopper-Pearson binomial 95% CI). Horizontal guideline
indicates chance level. Darker bars summarize results across ground truth models. (B) Values of the design
variables in the two stages of the experiment: cue probabilities P (CS) and joint cue-outcome probabilities
P (CS, US). (C) Comparison of fitted model responses obtained under different designs (rows) and different
ground truth models (columns). Inset labels give the average difference in BIC (± SEM) between the fit of
the true model and the alternative model (more negative values indicate stronger evidence in favor of the
true model).
470 15.8, 95% CI: [11.3, 22.1]); however, the improvement relative to the VP-OPT design seems to be small (OR:
471 1.3, 95% CI: [0.8, 1.9]).
472 In both Scenario 2 and 3, we can observe that the largest gain in accuracy is realized when the true model
473 is the most complex candidate model (KRW in Scenario 2, and RWPH(V + α) in Scenario 3). With the
474 REF designs, data simulated from these more complex models is often attributed to simpler models (e.g.,
475 see confusion matrix in fig. 5.A). Such misidentification occurs because both simple and complex models can
476 account for the data from a simple design similarly well, but the model selection criterion (BIC in our case)
477 penalizes complexity, thus biasing the selection towards simple models and even leading to accuracies below
478 chance level, as seen in fig. 4.A and fig. 5.A. This suggests that simple manual designs do not provide the
479 conditions in which complex models could make predictions sufficiently distinct from simpler models. Hence,
480 to achieve conditions in which candidate models make sufficiently diverging predictions, the simplicity of
481 experimental designs may need to be sacrificed. Indeed, this is what we observe by comparing the values of
20
Figure 5: Design evaluation in Scenario 3: selection between RW and RWPH models. Three designs - reference
reversal learning design (REF), design optimized under a vague prior (VP-OPT), and design optimized under
a point prior (PP-OPT) - are evaluated under the ground truth model being either the RW(V ), RWPH(V ),
RWPH(α), or RWPH(V + α). (A) Model selection accuracy (mean and the Clopper-Pearson binomial 95%
CI). Horizontal guideline indicates chance level. Darker bars summarize results across ground truth models.
Inset shows the confusion matrix between the ground truth model (rows) and the selected model (columns).
(B) Values of the design variables in the two stages of the experiment: cue probabilities P (CS) and joint
cue-outcome probabilities P (CS, US). (C) Comparison of fitted RW(V ) and RWPH(V + α) model responses
obtained under different designs (rows) and these two models as assumed ground truth (columns). Inset labels
give the average difference in BIC (± SEM) between the fit of the true model and the alternative model (more
negative values indicate stronger evidence in favor of the true model). Note: the model responses are nearly
identical when the RW(V ) model is true, because this model is a special case of the alternative RWPH(V + α)
model.
21
482 design variables between the REF, VP-OPT, and PP-OPT designs, both in Scenario 2 (fig. 4.B) and Scenario
483 3 (fig. 5.B). In both cases, the optimized designs are arguably less parsimonious than the reference designs.
484 However, the optimized designs also result in more distinct model responses, especially when the ground
485 truth model is more complex.
486 For example, in Scenario 2, model responses of the fitted RW and KRW models (fig. 4.C) are very similar for
487 cues presented in the REF design (AB in stage 1, and A in stage 2), regardless of which model is the ground
488 truth. We can observe a similar pattern in Scenario 3 (fig. 5.C), where we show the responses of the RW(V )
489 and RWPH(V + α) models, which are most often confused under the REF design. This is also reflected in the
490 BIC differences between the true and the alternative models (∆BIC), which are - under the REF design - in
491 favor of the simple model regardless of the ground truth model. In contrast, model responses with optimized
492 designs are more distinct when the ground truth model is complex, allowing for easier model discrimination
493 in both scenarios. However, when the ground truth model is simple, model responses are similar even under
494 optimized designs; yet, this is not a failure of the design optimization: when the predictions are similar,
495 the simpler model will correctly be selected, since BIC favors parsimony. Similarly, it may seem surprising
496 that the PP-OPT designs can yield evidence in favor of the true model that is weaker than in the VP-OPT
497 designs (e.g., see fig. 4.C) in terms of ∆BIC, even though the PP-OPT designs are based on stronger prior
498 knowledge and provide similar or higher accuracies than VP-OPT designs. Again, this is not a failure of the
499 optimization; instead, the procedure takes into account that our goal is maximizing the frequency of correct
500 model identifications, rather than the subtly different goal of maximizing the expected evidence in favor of
501 the true model. Altogether, these observations demonstrate that the optimization took into account not only
502 the supplied model space and design priors, but also the specific properties of our utility function.
503 4 Discussion
504 In this paper we investigated the potential of computationally optimizing experimental designs in associative
505 learning studies, with the goal of enabling accurate statistical inference from experimental data. First, we
506 formalized and parameterized the structure of classical conditioning experiments to render them amenable
507 to design optimization. Next, we brought together several existing ideas from the literature on experimental
508 design and computational optimization - in particular, simulation-based Bayesian experimental design and
509 Bayesian optimization - and we composed them into a flexible method for designing associative learning
510 studies. Finally, we evaluated the proposed method through simulations in three scenarios drawn from the
511 literature on associative learning, with design optimization spanning both the goals of accurate parameter
512 estimation and model selection. In these simulated scenarios, optimized designs outperformed manual designs
22
513 previously used in similar contexts: the benefits ranged from a moderate reduction in parameter estimation
514 error in Scenario 1, to substantial gains in model selection accuracies in Scenarios 2 and 3.
515 Furthermore, we highlight the role of prior knowledge in experimental design optimization. Within the
516 framework of Bayesian experimental design, we can distinguish three types of prior probability distributions
517 - the analysis, evaluation, and design prior. The analysis prior is the prior we ultimately plan to use in
518 analyzing the obtained experimental data (if we plan to use Bayesian inference). Analysis prior should
519 generally be acceptable to a wide audience, and therefore should not be overly informative or biased; for
520 analysis priors we used flat priors that were fixed within scenarios, representing a fixed, but non-prejudiced
521 analysis plan. The evaluation prior is used to simulate data for the evaluation of expected design utility: in
522 simulations the models and their parameters are sampled from the evaluation prior. The design prior is used
523 to simulate data in the process of optimizing experimental designs.
524 Although in practical applications the evaluation prior and the design prior coincide (both representing
525 the designer’s partial knowledge), here we distinguish them in order to investigate how different levels of
526 background knowledge impact the accuracy of designs under different assumed ground truths. Hence, in the
527 presented scenarios, point evaluation priors represent the exactly assumed ground truth and design priors
528 represent the varying degrees of background knowledge. For design priors we used both weakly-informative
529 (vague) and strongly-informative priors. The weakly-informative and strongly-informative design priors can
530 be thought of as providing approximate lower and upper bounds on achievable accuracy improvements,
531 respectively. As expected, strongly-informative design priors provided highest accuracies, but, importantly,
532 even the designs optimized with only weakly-informative design priors enabled similarly accurate inferences
533 (e.g., in Scenarios 2 and 3). This result suggests that specifying the space of candidate models may in some
534 cases be sufficient prior information to reap the benefits of design optimization, even when there is only sparse
535 knowledge on plausible values of the models’ parameters.
536 4.1 Limitations and future work
537 Although the results presented here show that optimized designs can yield substantial improvements over
538 manual designs, there are some important caveats. For example, we can have floor or ceiling effects in design
539 utility. A floor effect can be observed when, e.g., the data is noisy regardless of the chosen experimental
540 design, leading to poor accuracy with both manual and optimized designs. Similarly, a ceiling effect can
541 be observed when the utility of a manual design is already high - e.g., a design manually optimized for
542 a small model space may already provide near-perfect model discrimination under low noise conditions.
543 Nevertheless, given the often large number of design variables, and the complexity of the information that
23
544 needs to be integrated to arrive at the optimal design, we conjecture that manual designs are often far from
545 optimal. Support for this view comes from two recent studies. Balietti, Klein, and Riedl (2018) asked experts
546 in economics and behavioral sciences to determine the values of two design variables of an economics game,
547 such that the predictions of hypothesized models of behavior in this game would be maximally different.
548 Results of simulations and empirical experiments showed that the experts suggested a more expensive and
549 less informative experimental design, when comparing with a computationally optimized design. The study
550 of Bakker et al. (2016) investigated researchers’ intuitions about statistical power in psychological research:
551 in an arguably simpler task of estimating statistical power of different designs, majority of the participants
552 systematically over- or underestimated power. The findings of these two studies indicate the need for formal
553 approaches.
554 Another important pair of caveats relates to the specification of the design optimization problem. First,
555 the experiment can only be optimized with respect to the design variables that affect model responses.
556 For example, in scenarios presented in this paper, we considered discrete, trial-level models of associative
557 learning, rather than continuous-time models. However, trial-level models do not capture the effects of
558 stimuli presentation timings, and therefore design variables related to timing cannot be optimized using this
559 model space. If we suspect that the phenomenon under investigation crucially depends on timings, a model
560 capturing this impact needs to be included in the model space. Second, if the specified models and their
561 parameters inadequately describe the phenomenon of interest, then the optimized designs might actually
562 perform worse than manually chosen designs. For example, if specified candidate models do not take into
563 account attentional mechanisms or constraints on cognitive resources, the resulting optimized designs might
564 either be so simple that they disengage subjects, or so complex that subjects cannot learn effectively. This
565 issue can be solved directly by amending the candidate model space and design prior to better capture the
566 studied phenomenon. Alternatively, the desired properties of the design can be achieved by including them in
567 the utility function or by appropriately constraining design variables. Overall, it is important to keep in mind
568 that design optimality is relative, not absolute: a design is only optimal with respect to the user-specified
569 model space, design prior, experiment structure, design space, analysis procedure, and a utility function.
570 The problem of misspecification is however not particular to our approach, or even to experimental design in
571 general; every statistical procedure requires assumptions, and can yield poor results when those assumptions
572 are not met. However, in the domain of experimental design, this problem can be dealt with by iterating
573 between design optimization, data acquisition, modeling, and model checking. Posterior of one iteration
574 can be used as the design prior of the next one, and model spaces can be amended with new models,
575 if current models are inadequate in accounting for the observed data. This iterative strategy is termed
24
576 Adaptive Design Optimization (ADO) (Cavagnaro et al. 2010) and it can be implemented at different levels
577 of granularity - between studies, between subsamples of subjects, between individual subjects, or even between
578 trials. Implementing the ADO approach in the context of associative learning studies - especially with design
579 updates at the level of each subject or trial - will be an important challenge for future work.
580 Adapting designs at finer granularity may be particularly beneficial for designs with complex experimental
581 structures. Even global optimization algorithms like Bayesian optimization can struggle in high-dimensional
582 experiment spaces. Optimizing designs on a per-trial basis may not only reduce the number of variables that
583 need to be tuned simultaneously, but might also improve inference accuracy by incorporating new data as it
584 becomes available (Kim et al. 2016). But even with non-adaptive design approach, a number of options is
585 available to reduce the computational burden: Bayesian optimization can be sped up using GPUs (Matthews
586 et al. 2017), and analytical bounds (Daunizeau et al. 2011) or normal approximations to the utility function
587 (Overstall, McGree, and Drovandi 2018) can be used to identify promising parts of the experiment space,
588 which can then be searched more thoroughly (see Ryan et al. (2016) for a review of computational strategies
589 in Bayesian experimental design). Moreover, although some of the scenarios presented in this paper required
590 lengthy optimizations, this time investment seems negligible compared to typical length of data acquisition
591 in associative learning studies and to the cost of the study yielding non-informative data.
592 We have demonstrated our approach by optimizing designs for accurate single-subject inferences, which we
593 contend is especially valuable from a translational perspective, and also represents a difficult benchmark, due
594 to limited amounts of data in single-subject experiments. Group studies could also be accommodated within
595 our approach, but simulating group studies would be computationally expensive, and it would imply that the
596 design can be updated only once all the subjects’ data has been acquired. Instead, in future developments
597 of design methods for associative learning, it may be advantageous to adopt the hierarchical ADO (HADO)
598 approach (Kim et al. 2014): this approach combines hierarchical Bayesian modeling with adaptive design.
599 Even when the goal is to achieve accurate inferences at the level of single subjects, hierarchically combining
600 the new subject’s data with the previously acquired data can improve efficiency and accuracy.
601 The design optimization method was illustrated on classical conditioning, as a special case of associative
602 learning, but the extension to operant conditioning is conceptually straightforward. Whereas we formalized
603 classical conditioning experiments using Markov chains, operant conditioning experiments can be formalized
604 using Markov decision processes (MDPs). In general, optimizing MDPs may be computationally challenging,
605 as the design has to specify transition probabilities and rewards for each state-action pair, possibly resulting in
606 a high-dimensional problem. However, tractable optimization problems may also be obtained for experiments
607 that can be represented using highly structured MDPs, with only few tunable design variables (Zhang and
25
608 Lee 2010; Rafferty, Zaharia, and Griffiths 2014). We expect that the crucial challenge for future work will
609 be to elucidate general correspondences between model classes and experimental structures that allow their
610 discrimination, while still being amenable to optimization (i.e., resulting in low-dimensional problems).
611 4.2 Conclusion
612 We believe that computational optimization of experimental designs is a valuable addition to the toolbox of
613 associative learning science and that it holds promise of improving the accuracy and efficiency of basic research
614 in this field. Optimizing experimental designs could also help address the recent concerns over underpowered
615 studies in cognitive sciences (Szucs and Ioannidis 2017): the usually-prescribed use of larger samples should
616 be complemented with the use of better designed experiments. Moreover, the effort in specifying the modeling
617 details prior to designing the experiment allows for the experiment to easily be pre-registered, thus facilitating
618 the adoption of emerging open science norms (Munafò et al. 2017; Nosek et al. 2018). The requirement
619 to specify candidate models and the analysis procedures may seem restrictive, but this does not preclude
620 subsequent exploratory analyses, it just delineates them from planned confirmatory analyses (Wagenmakers
621 et al. 2012). And the results of both confirmatory and exploratory analyses can then be used to further
622 optimize future experiments.
623 Finally, we believe design optimization will play a role not only in basic research on associative learning, but
624 also in translational research within the nascent field of computational psychiatry. Computational psychiatry
625 strives to link mental disorders with computational models of neural function, or with particular parameter
626 regimes of these models. Employing insights of computational psychiatry in a clinical setting will require
627 the development of behavioral or physiological diagnostic tests, based on procedures like model selection and
628 parameter estimation (Stephan et al. 2016). This presents a problem analogous to the design of experiments
629 (Ahn et al. 2019), and therefore computational optimization may be similarly useful in designing accurate
630 and efficient diagnostic tests in this domain.
631 5 Software and data availability
632 All the code necessary to reproduce the results of the paper is available at https://github.com/fmelinscak/
633 coale-paper. The accompanying Open Science Framework (OSF) project at https://osf.io/5ktaf/ holds the
634 data (i.e., simulation results) analyzed in the paper.
26
635 6 Acknowledgments
636 Authors would like to thank Karita Ojala, Andreea Sburlea, and Martina Melinščak for valuable discussions
637 and comments on the text.
638 7 References
639 Ahn, Woo-Young, Hairong Gu, Yitong Shen, Nathaniel Haines, Hunter A. Hahn, Julie E. Teater, Jay I.
640 Myung, and Mark A. Pitt. 2019. “Rapid, precise, and reliable phenotyping of delay discounting using a
641 Bayesian learning algorithm.” bioRxiv, June, 567412. https://doi.org/10.1101/567412.
642 Alonso, Eduardo, Esther Mondragón, and Alberto Fernández. 2012. “A Java simulator of Rescorla and Wag-
643 ner’s prediction error model and configural cue extensions.” Computer Methods and Programs in Biomedicine
644 108 (1): 346–55. https://doi.org/10.1016/J.CMPB.2012.02.004.
645 Apgar, Joshua F., David K. Witmer, Forest M. White, and Bruce Tidor. 2010. “Sloppy models, parameter
646 uncertainty, and the role of experimental design.” Molecular BioSystems 6 (10): 1890. https://doi.org/10.
647 1039/b918098b.
648 Bakker, M., C. H. J. Hartgerink, J. M. Wicherts, and H. L. J. van der Maas. 2016. “Researchers Intuitions
649 About Power in Psychological Research.” Psychological Science 27 (8): 1069–77. https://doi.org/10.1177/
650 0956797616647519.
651 Balietti, Stefano, Brennan Klein, and Christoph Riedl. 2018. “Fast Model-Selection through Adapting
652 Design of Experiments Maximizing Information Gain.” arXiv Preprint arXiv:1807.07024. http://arxiv.org/
653 abs/1807.07024.
654 Behrens, Timothy E J, Mark W Woolrich, Mark E Walton, and Matthew F S Rushworth. 2007. “Learning
655 the value of information in an uncertain world.” Nature Neuroscience 10 (9): 1214–21. https://doi.org/10.
656 1038/nn1954.
657 Brochu, Eric, Vlad M. Cora, and Nando de Freitas. 2010. “A Tutorial on Bayesian Optimization of Expensive
658 Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.” arXiv
659 Preprint arXiv:1012.2599. http://arxiv.org/abs/1012.2599.
660 Browning, Michael, Timothy E Behrens, Gerhard Jocham, Jill X O’Reilly, and Sonia J Bishop. 2015. “Anx-
661 ious individuals have difficulty learning the causal statistics of aversive environments.” Nature Neuroscience
662 18 (4): 590–96. https://doi.org/10.1038/nn.3961.
27
663 Brutti, Pierpaolo, Fulvio De Santis, and Stefania Gubbiotti. 2014. “Bayesian-frequentist sample size deter-
664 mination: a game of two priors.” METRON 72 (2): 133–51. https://doi.org/10.1007/s40300-014-0043-2.
665 Cavagnaro, Daniel R., Jay I. Myung, Mark A. Pitt, and Janne V. Kujala. 2010. “Adaptive Design Opti-
666 mization: A Mutual Information-Based Approach to Model Discrimination in Cognitive Science.” Neural
667 Computation 22 (4): 887–905. https://doi.org/10.1162/neco.2009.02-09-959.
668 Chaloner, Kathryn, and Isabella Verdinelli. 1995. “Bayesian experimental design: A review.” Statistical
669 Science 10 (3): 273–304. http://www.jstor.org/stable/2246015.
670 Daunizeau, Jean, Kerstin Preuschoff, Karl Friston, and Klaas Stephan. 2011. “Optimizing Experimental
671 Design for Comparing Models of Brain Function.” PLoS Computational Biology 7 (11): e1002280. https:
672 //doi.org/10.1371/journal.pcbi.1002280.
673 Dayan, Peter, and Sham Kakade. 2001. “Explaining away in weight space.” In Advances in Neural Informa-
674 tion Processing Systems 14, 451–57. http://papers.nips.cc/paper/1852-explaining-away-in-weight-space.pdf.
675 Efron, Bradley. 1981. “Nonparametric Standard Errors and Confidence Intervals.” Canadian Journal of
676 Statistics 9 (2): 139–58. https://doi.org/10.2307/3314608.
677 Enquist, Magnus, Johan Lind, and Stefano Ghirlanda. 2016. “The power of associative learning and the
678 ontogeny of optimal behaviour.” Royal Society Open Science 3 (11): 160734. https://doi.org/10.1098/rsos.
679 160734.
680 Gershman, Samuel J. 2018. “’Mfit’: Simple Model-Fitting Tools.” https://github.com/sjgershm/mfit.
681 ———. 2015. “A Unifying Probabilistic View of Associative Learning.” PLOS Computational Biology 11
682 (11): e1004567. https://doi.org/10.1371/journal.pcbi.1004567.
683 Gershman, Samuel J., and Yael Niv. 2012. “Exploring a latent cause theory of classical conditioning.”
684 Learning & Behavior 40 (3): 255–68. https://doi.org/10.3758/s13420-012-0080-8.
685 Humphreys, Lloyd G. 1939. “The effect of random alternation of reinforcement on the acquisition and
686 extinction of conditioned eyelid reactions.” Journal of Experimental Psychology 25 (2): 141–58. https:
687 //doi.org/10.1037/h0058138.
688 Kamin, Leon J. 1968. “”Attention-like” processes in classical conditioning.” In Miami Symposium on the
689 Prediction of Behavior, 1967: Aversive Stimulation, edited by M. R. Jones, 9–31. Coral Gables, Florida:
690 University of Miami Press.
691 Kim, Woojae, Mark A. Pitt, Zhong-Lin Lu, and Jay I. Myung. 2016. “Planning Beyond the Next Trial in
28
692 Adaptive Experiments: A Dynamic Programming Approach.” Cognitive Science, December. https://doi.org/
693 10.1111/cogs.12467.
694 Kim, Woojae, Mark A. Pitt, Zhong-Lin Lu, Mark Steyvers, and Jay I. Myung. 2014. “A Hierarchical
695 Adaptive Approach to Optimal Experimental Design.” Neural Computation 26 (11): 2465–92. https://doi.
696 org/10.1162/NECO_a_00654.
697 Kruschke, John K. 2008. “Bayesian approaches to associative learning: From passive to active learning.”
698 Learning and Behavior 36 (3): 210–26. https://doi.org/10.3758/LB.36.3.210.
699 Lawson, Raef. 2004. “Small Sample Confidence Intervals for the Odds Ratio.” Communications in Statistics
700 - Simulation and Computation 33 (4): 1095–1113. https://doi.org/10.1081/SAC-200040691.
701 Lewi, Jeremy, Robert Butera, and Liam Paninski. 2009. “Sequential Optimal Design of Neurophysiology
702 Experiments.” Neural Computation 21 (3): 619–87. https://doi.org/10.1162/neco.2008.08-07-594.
703 Li, Jian, Daniela Schiller, Geoffrey Schoenbaum, Elizabeth A Phelps, and Nathaniel D Daw. 2011. “Differ-
704 ential roles of human striatum and amygdala in associative learning.” Nature Neuroscience 14 (10): 1250–2.
705 https://doi.org/10.1038/nn.2904.
706 Liepe, Juliane, Sarah Filippi, Michał Komorowski, and Michael P. H. Stumpf. 2013. “Maximizing the
707 Information Content of Experiments in Systems Biology.” PLoS Computational Biology 9 (1): e1002888.
708 https://doi.org/10.1371/journal.pcbi.1002888.
709 Lindley, Dennis Victor. 1972. Bayesian Statistics, A Review. Philadelphia: SIAM. https://doi.org/10.1137/
710 1.9781611970654.
711 Lorenz, Romy, Ricardo Pio Monti, Inês R. Violante, Christoforos Anagnostopoulos, Aldo A. Faisal, Giovanni
712 Montana, and Robert Leech. 2016. “The Automatic Neuroscientist: A framework for optimizing experi-
713 mental design with closed-loop real-time fMRI.” NeuroImage 129 (April): 320–34. https://doi.org/10.1016/
714 j.neuroimage.2016.01.032.
715 Lubow, R. E., and A. U. Moore. 1959. “Latent inhibition: The effect of nonreinforced pre-exposure to
716 the conditional stimulus.” Journal of Comparative and Physiological Psychology 52 (4): 415–19. https:
717 //doi.org/10.1037/h0046700.
718 Matthews, Alexander G. de G., Mark van der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo
719 León-Villagrá, Zoubin Ghahramani, and James Hensman. 2017. “GPflow: A Gaussian Process Library using
720 TensorFlow.” Journal of Machine Learning Research 18 (40): 1–6. http://www.jmlr.org/papers/v18/16-537.
721 html.
29
722 McGraw, Kenneth O., and S. P. Wong. 1992. “A common language effect size statistic.” Psychological
723 Bulletin 111 (2): 361–65. https://doi.org/10.1037/0033-2909.111.2.361.
724 Miller, Ralph R., Robert C. Barnet, and Nicholas J. Grahame. 1995. “Assessment of the Rescorla-Wagner
725 model.” Psychological Bulletin 117 (3): 363–86. https://doi.org/10.1037/0033-2909.117.3.363.
726 Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers,
727 Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis.
728 2017. “A manifesto for reproducible science.” Nature Human Behaviour 1 (1): 0021. https://doi.org/10.
729 1038/s41562-016-0021.
730 Myung, Jay I., and Mark A. Pitt. 2009. “Optimal experimental design for model discrimination.” Psycho-
731 logical Review 116 (3): 499–518. https://doi.org/10.1037/a0016104.
732 Nosek, Brian A., Charles R. Ebersole, Alexander C. DeHaven, and David T. Mellor. 2018. “The pre-
733 registration revolution.” Proceedings of the National Academy of Sciences 115 (11): 2600–2606. https:
734 //doi.org/10.1073/pnas.1708274114.
735 Overstall, Antony M., James M. McGree, and Christopher C. Drovandi. 2018. “An approach for finding fully
736 Bayesian optimal designs using normal-based approximations to loss functions.” Statistics and Computing 28
737 (2): 343–58. https://doi.org/10.1007/s11222-017-9734-x.
738 Pavlov, I P. 1927. Conditioned reflexes: an investigation of the physiological activity of the cerebral cortex.
739 Oxford, England: Oxford Univ. Press.
740 Pearce, John M, and Mark E Bouton. 2001. “Theories of Associative Learning in Animals.” Annual Review
741 of Psychology 52 (1): 111–39. https://doi.org/10.1146/annurev.psych.52.1.111.
742 Rafferty, Anna N., Matei Zaharia, and Thomas L. Griffiths. 2014. “Optimally designing games for be-
743 havioural research.” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences
744 470 (2167): 20130828. https://doi.org/10.1098/rspa.2013.0828.
745 Rescorla, Robert A. 1970. “Reduction in the effectiveness of reinforcement after prior excitatory conditioning.”
746 Learning and Motivation 1 (4): 372–81. https://doi.org/10.1016/0023-9690(70)90101-3.
747 Rescorla, Robert A, and Allan R Wagner. 1972. “A theory of Pavlovian conditioning: Variations in the
748 effectiveness of reinforcement and nonreinforcement.” Classical Conditioning II Current Research and Theory
749 21 (6): 64–99. https://doi.org/10.1101/gr.110528.110.
750 Ryan, Elizabeth G., Christopher C. Drovandi, James M. McGree, and Anthony N. Pettitt. 2016. “A Review
30
751 of Modern Computational Algorithms for Bayesian Optimal Design.” International Statistical Review 84 (1):
752 128–54. https://doi.org/10.1111/insr.12107.
753 Sanchez, Gaëtan, Jean Daunizeau, Emmanuel Maby, Olivier Bertrand, Aline Bompas, and Jérémie Mat-
754 tout. 2014. “Toward a New Application of Real-Time Electrophysiology: Online Optimization of Cognitive
755 Neurosciences Hypothesis Testing.” Brain Sciences 4 (1): 49–72. https://doi.org/10.3390/brainsci4010049.
756 Schlaifer, R, and H Raiffa. 1961. Applied statistical decision theory. Wiley.
757 Shanks, David R. 1985. “Forward and Backward Blocking in Human Contingency Judgement.” The Quarterly
758 Journal of Experimental Psychology Section B 37 (1b): 1–21. https://doi.org/10.1080/14640748508402082.
759 Silk, Daniel, Paul D. W. Kirk, Chris P. Barnes, Tina Toni, and Michael P. H. Stumpf. 2014. “Model Selection
760 in Systems Biology Depends on Experimental Design.” Edited by Burkhard Rost. PLoS Computational
761 Biology 10 (6): e1003650. https://doi.org/10.1371/journal.pcbi.1003650.
762 Stephan, Klaas E, Dominik R Bach, Paul C Fletcher, Jonathan Flint, Michael J Frank, Karl J Friston,
763 Andreas Heinz, et al. 2016. “Charting the landscape of priority problems in psychiatry, part 1: classification
764 and diagnosis.” The Lancet Psychiatry 3 (1): 77–83. https://doi.org/10.1016/S2215-0366(15)00361-2.
765 Szucs, Denes, and John P. A. Ioannidis. 2017. “Empirical assessment of published effect sizes and power
766 in the recent cognitive neuroscience and psychology literature.” PLOS Biology 15 (3): e2000797. https:
767 //doi.org/10.1371/journal.pbio.2000797.
768 Thorwart, Anna, Holger Schultheis, Stephan König, and Harald Lachnit. 2009. “ALTSim: A MATLAB
769 simulator for current associative learning theories.” Behavior Research Methods 41 (1): 29–34. https://doi.
770 org/10.3758/BRM.41.1.29.
771 Tzovara, Athina, Christoph W. Korn, and Dominik R. Bach. 2018. “Human Pavlovian fear conditioning
772 conforms to probabilistic learning.” PLOS Computational Biology 14 (8): e1006243. https://doi.org/10.
773 1371/journal.pcbi.1006243.
774 Vanlier, J, C A Tiemann, P A J Hilbers, and N A W van Riel. 2012. “A Bayesian approach to targeted
775 experiment design.” Bioinformatics 28 (8): 1136–42. https://doi.org/10.1093/bioinformatics/bts092.
776 Vervliet, Bram, Michelle G. Craske, and Dirk Hermans. 2013. “Fear Extinction and Relapse:
777 State of the Art.” Annual Review of Clinical Psychology 9 (1): 215–48. https://doi.org/10.1146/
778 annurev-clinpsy-050212-185542.
779 Vincent, Benjamin T, and Tom Rainforth. 2017. “The DARC Toolbox: automated, flexible, and efficient
31
780 delayed and risky choice experiments using Bayesian adaptive design.” PsyArXiv. https://doi.org/10.31234/
781 osf.io/yehjb.
782 Wagenmakers, Eric-Jan, Ruud Wetzels, Denny Borsboom, Han L J van der Maas, and Rogier A Kievit.
783 2012. “An Agenda for Purely Confirmatory Research.” Perspectives on Psychological Science 7 (6): 632–38.
784 https://doi.org/10.1177/1745691612463078.
785 Wang, Fei, and Alan E. Gelfand. 2002. “A simulation-based approach to Bayesian sample size determination
786 for performance under a given model and for separating models.” Statistical Science 17 (2): 193–208. https:
787 //doi.org/10.1214/ss/1030550861.
788 Weaver, Brian P, Brian J Williams, Christine M Anderson-Cook, and David M Higdon. 2016. “Computa-
789 tional Enhancements to Bayesian Design of Experiments Using Gaussian Processes.” Bayesian Analysis 11
790 (1): 191–213. https://doi.org/10.1214/15-BA945.
791 Zhang, Shunan, and Michael D. Lee. 2010. “Optimal experimental design for a class of bandit problems.”
792 Journal of Mathematical Psychology 54 (6): 499–508. https://doi.org/10.1016/J.JMP.2010.08.002.
793 Supporting information legends
794 • S1 Appendix. Formal approach to optimizing associative learning experiments.
32
1 S1 Appendix: Formal approach to optimizing associative learning
2 experiments
3 Filip Melinscak* 1,2 and Dominik R. Bach1,2,3
1
4 Computational Psychiatry Research; Department of Psychiatry, Psychotherapy, and
5 Psychosomatics; Psychiatric Hospital; University of Zurich
2
6 Neuroscience Center Zurich; University of Zurich
3
7 Wellcome Centre for Human Neuroimaging and Max Planck/UCL Centre for
8 Computational Psychiatry and Ageing Research; University College London
*
9 Correspondence: Filip Melinscak <filip.melinscak@uzh.ch>
10 In this Appendix we first provide a technical treatment of simulation-based Bayesian experimental design.
11 Next, we describe Bayesian optimization in the context of experimental design. Lastly, we give the full
12 specification of the associative learning models used in simulated scenarios described in the main text of the
13 article (Scenarios 1, 2, and 3).
14 1 Bayesian experimental design
15 Bayesian experimental design is the application of Bayesian decision theory to the problem of designing
16 optimal experiments, i.e., finding the optimal values of tunable design variables. The problem is formalized
17 as a maximization of expected design utility EU (η), with respect to design variables η:
η ∗ = arg max EU (η), (1)

η∈H
18 where η ∗ denotes optimal values for the design variables and H denotes the space of possible designs (i.e.,
19 design space). Although we are ultimately interested in the dependence of the experiment utility on the
20 design variables (captured by the utility function U (η)), often it is easier to express the goal of the study by
21 using a more general utility function with additional direct dependence on study outcomes (observed data d
1
22 and analysis results r), and possibly on the unobserved ground truth generative model m and its parameters
23 θm , yielding U (r, d, θm , m, η). Evaluating a particular design requires calculating expected design utility, i.e.,
24 marginalizing across the uncertain variables in the utility function (here we assume that analysis results r
25 are deterministically dependent on data d):
EU (η) = Ed,θm ,m [U (r, d, θm , m, η)]

∫ ∑∫
= U (r, d, θm , m, η)p(d, θm , m | η) dθm dd
D MΘ (2)
∫ ∑∫
m
= U (r, d, θm , m, η)p(d | θm , m, η)pe (θm , m) dθm dd,

D MΘ
m
26 where D is the space of datasets, M is the space of candidate models, Θm is the parameter space of model
27 m, p(d | θm , m, η) is the conditional probability of the dataset, and pe (θm , m) is the joint prior probability
28 of the model and its parameters (which we term the evaluation prior). Here we made the assumption that
29 the evaluation prior is independent of the design variables (i.e., pe (θm , m | η) = pe (θm , m) for all η); this
30 assumption needs to be carefully considered on a case-by-case basis, and might not hold if, for example,
31 different values of design variables result in experiments which engage different cognitive mechanisms.
32 1.1 Evaluating design utility through simulation
33 Unfortunately, the expected design utility EU (η), calculated according to eq. 2, will have no closed-form
34 solution for almost any design problem of practical interest. To tackle this issue, we adopt the simulation-
35 based approach to design evaluation, i.e., Monte Carlo integration (Wang and Gelfand 2002). Each simulation
36 iteration consists of three steps: simulating a dataset, analyzing the simulated dataset, and calculating the
37 utility of the simulated experiment.
38 To simulate a dataset, we first sample the ground truth model m and its parameters θm from the evaluation
39 prior pe (θm , m). Next, using the evaluated values of design variables η and the ground truth, we sample the
40 dataset d from the data probability distribution p(d | θm , m, η). In classical conditioning, experiment realiza-
41 tions (i.e., stimuli and outcomes, together denoted x) are independent of model responses (i.e., conditioned
42 responses, denoted y); hence, we can decompose the dataset simulation into sampling an experiment realiza-
43 tion from the experiment structure pexp (x | η) and sampling the model responses from the model likelihood
44 function pm (y | x, θm ), thus yielding the full simulated dataset d = {x, y}.
45 We can now perform the planned analysis on the simulated dataset. We conceptualize this with a functional
46 mapping from the data to the results, i.e., r = a(d), where a(·) represents the planned analysis (e.g., model
2
47 comparison or parameter estimation) and r represents the analysis results (e.g., the selected winning model,
48 the parameter point estimate, or a posterior distribution over parameters and/or models). In scenarios where
49 we consider parameter estimation within a single model, we can obtain the maximum a posteriori (MAP)
50 point estimate of the parameter as the analysis result r, using Bayes’ rule:
p(y | x, θ)pa (θ)

r = θ̂MAP = arg max ∫ , (3)
θ p(y | x, θ)pa (θ) dθ
51 where pa (θ) is the analysis prior, which, importantly, does not need to coincide with the evaluation prior
52 pe (θ). Since the evaluation prior describes experimenter’s state of knowledge at the planning stage, and the
53 analysis prior needs to be acceptable even to the skeptical consumer of the analysis results, the evaluation
54 prior will usually be more informative than the analysis prior. In scenarios presented in the paper, we used
55 the uniform prior as a non-informative analysis prior pa (θ) ∝ 1; this makes the MAP parameter estimate
56 equal to the maximum (log)likelihood estimate (MLE):
θ̂MAP = θ̂MLE = arg max log p(y | x, θ). (4)

θ
57 In model selection scenarios, we can similarly obtain the analysis result by employing the Bayes’ rule to select
58 the a posteriori most probable model, based on maximizing model evidence:
r = m̂ = arg max p(m | y, x)

m
= arg max p(y | x, m) (5)

m
∫
= arg max pm (y | x, θm )pa (θm | m) dθm ,
m
59 where we assumed a uniform analysis prior over the models, i.e., pa (m) ∝ 1. Again, the integral necessary to
60 calculate model evidence will generally not be tractable, but can be approximated in various ways (Penny et
61 al. 2004).
62 Having obtained the analysis result from the simulated dataset, we can now calculate the experiment utility
63 U (r, d, θm , m, η) of the current simulation iteration. In general, the utility function is dependent on the goal
64 of the study and therefore its form is up to the user; here we provide two utility functions as examples - one
65 for parameter estimation scenarios and one for model selection scenarios. In scenarios featuring parameter
66 estimation within a single model, we decided to minimize the absolute parameter estimation error (i.e.,
3
67 maximizing negative absolute error), due to its robustness to outliers:
U (r = θ̂, θ) = −|θ − θ̂|. (6)
68 Similarly, in scenarios featuring model selection, we decided to maximize model selection accuracy:



1 if m̂ = m,
U (r = m̂, m) = (7)


0 otherwise.
69 Thus far, we have described calculating the experiment utility for a single simulation. To obtain the Monte
70 Carlo approximation of the expected design utility EU (η) (eq. 2), we run NS simulations and average
71 experiment utilities across them:
1 ∑
NS
EU (η) ≈ U (r(i) , d(i) , θm
(i)
, m(i) , η). (8)
NS i=1
72 1.2 Maximizing design utility through Bayesian optimization
73 Using the Monte Carlo approximation to the expected design utility (eq. 8), we can now search the design
74 space H for the optimal design η ∗ that maximizes expected utility EU (η) (eq. 1). We refer to the evaluation
75 prior pe (θm , m) used during optimization as the design prior pd (θm , m). In practice, the design prior and
76 the evaluation prior will generally coincide, since we want to use the best existing information to design
77 the experiment and to evaluate it. However, the distinction between the design prior and the evaluation
78 prior allows us to investigate the effect of varying levels of prior knowledge included in the optimization (i.e.,
79 different design priors), while evaluating the obtained designs using the same evaluation prior (which here
80 represents the ground truth and will usually be a point prior).
81 To find the values of design variables that maximize expected design utility, we employ the Bayesian opti-
82 mization (BO) algorithm (Brochu, Cora, and Freitas 2010), as implemented by the bayesopt function in the
83 ‘Statistics and Machine Learning Toolbox’ in MATLAB R2018a (version 9.4.0.813654). Since optimization
84 algorithms are usually formulated in terms of minimizing an objective function g(x), we equate the objective
85 function with negative expected design utility, i.e., g(x) = −EU (η). BO is a global optimization algorithm
86 which minimizes a deterministic or stochastic objective function g(x), with respect to x within a bounded
87 domain. BO is especially appropriate for optimization problems where the objective function is expensive to
88 evaluate and where the gradients of the objective function are not available; both of these conditions are true
4
89 for simulation-based experimental design. Although the BO algorithm has polynomial time complexity, for
90 expensive objective functions and for typical numbers of iterations, the running time will be dominated by
91 the evaluations of the objective function, thus justifying the use of a relatively costly search algorithm.
92 In order to efficiently search for the optimal x, BO approximates the true objective function g(x) using
93 a probabilistic model f (x), called the surrogate function (or response surface). A common choice for the
94 surrogate function (also featured in the MATLAB BO implementation) is a Gaussian process:
f (x) ∼ GP(µ(x), k(x, x′ ; θ)), (9)
95 where µ(x) is the mean function and k(x, x′ ; θ) is the covariance kernel function with hyperparameters θ. BO
96 implementation in MATLAB uses the ARD Matern 5/2 form of the kernel function. In order to choose the
97 next solution x to evaluate, the surrogate function needs to be supplemented with an acquisition function
98 u(x), which governs the exploration-exploitation trade-off. The acquisition function represents the utility of
99 evaluating a solution x, based on the current posterior probability distribution over the objective function,
100 Q(f ). Out of various acquisition functions that are available in the MATLAB implementation of BO, we have
101 chosen the ‘expected improvement plus’ option. The expected improvement acquisition functions evaluate the
102 expected decrease in the objective function, ignoring values that cause an increase in the objective. Expected
103 improvement (EI) is given by:
EI(x, Q) = EQ [max(0, µQ (xbest ) − f (x))], (10)
104 where xbest is the solution with the lowest posterior mean value µQ (xbest ). The EI-plus version of the
105 acquisition function dynamically modifies the kernel function to prevent overexploiting of a particular area
106 of the solution space. This modification is governed by the ‘exploration ratio’ option, which was set to the
107 default value of 0.5.
108 Having specified the surrogate and acquisition functions, the BO algorithm proceeds as follows. The algorithm
109 is first initialized by evaluating yi = g(xi ) for a fixed number of solutions xi sampled randomly within
110 variable bounds (i indexes objective function evaluations); we used 10 solutions in initialization. Hereafter,
111 the algorithm repeats the following steps in each iteration t:
112 1. Update the surrogate function using the Bayes’ rule, i.e., calculate the posterior over functions, condi-
113 tioned on the objective values observed thus far (Q(f | xi , yi , for i = 1, . . . , t)).
114 2. Find the new solution xt+1 that maximizes the acquisition function u(x) and evaluate the objective
5
115 function for this solution to obtain yt+1 = g(xt+1 ).
116 As the stopping criterion for the algorithm, we used a fixed number of iterations (300), regardless of the
117 elapsed time. The BO algorithm returns both the solution at which the lowest objective function value was
118 observed, as well as the solution at which the final surrogate function has the lowest mean. As the optimized
119 design, we chose the solution based on the surrogate function.
120 2 Associative learning models
121 2.1 Scenario 1
122 In Scenario 1 we consider the case of estimating the learning rate parameter of the Rescorla-Wagner (RW)
123 model (Rescorla and Wagner 1972). For this model, the update rule for the associative weights between the
124 stimuli s and outcome o is:
wt+1 = wt + αδt st , (11)
125 where wt is the vector of associative weights at the end of trial t, α is the learning rate, δt is the prediction
126 error, and st is the vector of binary indicator variables for the stimuli (i.e., conditioned stimuli, CSs). The
127 prediction error is given by:
δt = ot − wt⊤ st , (12)
128 where ot is the binary indicator variable for the outcome (i.e., unconditioned stimulus, US). The mapping
129 from the associative weights wt , to the model responses yt (i.e., conditioned responses, CRs) is assumed to
130 be linear with added Gaussian noise:

yt = β0 + β1 s⊤
t wt + ϵ,
(13)
i.i.d.
ϵ ∼ N (0, σy2 ).
131 The parameters of the model are therefore α, w0 , β0 , β1 , and σy . We make the simplifying assumption that
132 all initial values of associative weights are the same (w0 = w0 1). In all the simulations, parameters other
133 than α were kept constant (w0 = 0 , β0 = 0, β1 = 1, σy = 0.2). The parameters were estimated using
134 the maximum likelihood approach, which is equivalent to the maximum a posteriori approach with uniform
135 priors on the whole domain of permissible values for the parameter (eq. 4). To perform the estimation we
136 used the mfit_optimize function from the ‘mfit’ toolbox (Gershman 2018).
137 Three evaluation priors, with different values for α were used: low alpha (LA, α = 0.1), middle alpha (MA,
138 α = 0.2), and high alpha (HA, α = 0.3). The designs were optimized using either a point prior (PA-OPT
139 designs) or a vague prior (VA-OPT designs) on alpha. The PA-OPT design priors coincided with evaluation
6
140 priors, and the VA-OPT design prior was a uniform distribution on the [0, 1] interval. The designs were
141 optimized for accuracy in estimating the learning rate α, with the utility function being specified as the
142 negative absolute estimation error (eq. 6).
143 2.2 Scenario 2
144 Scenario 2 involves model comparison between the RW model and its probabilistic extension, the Kalman RW
145 (KRW) model (Dayan and Kakade 2001; Kruschke 2008; Gershman 2015). The RW model is implemented the
146 same as in Scenario 1 (equations 11–13). KRW model maintains a probabilistic representation of associative
147 weights using a multivariate Gaussian distribution, with mean vector wt and covariance matrix Σt . These
148 state variables are updated using the Kalman filter equations:
wt+1 = wt + kt δt , (14)
149
Σt+1 = Σt + τ 2 I − kt s⊤ 2
t (Σt + τ I),
(15)
150 where τ 2 is the variance that governs the diffusion of weights over time, and kt is the Kalman gain, which
151 acts as a stimulus-specific learning rate. Kalman gain is given by:
(Σt + τ 2 I)st
kt = , (16)
s⊤
t (Σt+ τ 2 I)st + σo2
152 where σo2 is the variance of the outcome noise distribution. The model responses yt for KRW are obtained
153 as a linear mapping of associative weights wt (eq. 13). Hence, the parameters of the model are w0 , Σ0 , β0 ,
2
154 β1 , τ , σo , and σy . We simplify the parameterization by assuming w0 = w0 1 and Σ0 = σw I.
155 The evaluation prior was a uniform prior over RW and KRW models, combined with a point prior on
156 model parameters, with the parameter values taken from the simulation study of Kruschke (2008). For RW,
2 2
157 parameter values were w0 = 0 and α = 0.08. For KRW, we set the parameters at w0 = 0, log σw = 0 (σw = 1),
158 log τ 2 = −20 (τ 2 ≈ 0), log σo2 = 1.3863 (σo2 ≈ 4). For both models, the parameters of the model response
159 mapping were set at β0 = 0, β1 = 1, and σy = 0.2.
160 The design prior was either a precise point prior coinciding with the evaluation prior (PP-OPT), or a vague
161 prior (VP-OPT). The vague prior for the RW model was implemented as:
α ∼ U (0, 1),
(17)
σy ∼ U (0.05, 0.5).
7
162 For the KRW model, the vague prior was:
2
log σw ∼ N (0, 2),
log σo2 ∼ N (0, 2), (18)
σy ∼ U (0.05, 0.5).
163 The rest of the parameters of both models were fixed to the same values as in the evaluation prior.
164 Parameter estimates in Scenario 2 were obtained using a maximum likelihood approach, as in Scenario 1.
165 Models were compared using the Bayesian Information Criterion (BIC; (Schwarz 1978)):
BIC = −2 log(L̂) + log(n)k, (19)
166 where L̂ is the value of the likelihood function for the maximum likelihood parameter estimates, n is the
167 number of data points, and k is the number of estimated parameters. The model with the lowest BIC was
168 selected as the winning model. When conducting design optimization, proportion of cases where the winning
169 model was also the ground truth model - i.e., model selection accuracy - was transformed to the log-odds scale
170 in order to better conform with the normality assumption we used in optimization (we assumed a Gaussian
171 process as a surrogate model of the utility function in Bayesian optimization).
172 2.3 Scenario 3
173 Scenario 3 again features model comparison between the RW model and its variant - namely, the Rescorla-
174 Wagner-Pearce-Hall (RWPH) hybrid model (Li et al. 2011). Whereas the RW model assumes a fixed learning
175 rate, the RWPH model assumes stimulus-specific learning rate that is modulated over time depending on
176 observed prediction errors. In particular, the RW weight updating is supplemented with the Pearce-Hall rule
177 for updating the learning rates (i.e., associabilities):



η|δt | + (1 − η)αt(s) if st = 1,
(s)
αt+1 = (20)


αt(s) if st = 0,
(s)
178 where αt is the associability of stimulus s at trial t and η is the forgetting factor. The corresponding weight
179 update equation is:

(s) (s) (s)
wt+1 = wt + καt δt st , (21)
8
180 where κ is the fixed learning rate. Additionally, in Scenario 3 we included variants of the RWPH model
181 that have different mappings between the latent states of the models (i.e., weights and associabilities) and
182 the model responses (i.e., conditioned responses). The most general version of the model, RWPH(V + α),
183 features a linear mapping from a mixture of active weights and associabilities:
y t = β0 + β1 s ⊤
t (λwt + (1 − λ)αt ) + ϵ,
(22)
184 where λ is the mixing factor. The parameters of the model are α0 , w0 , η, β0 , β1 , λ, and σy . Again, we make
185 the simplifying assumption about the initial associabilities and weights (all initial associabilities and weights
186 are equal to α0 and w0 , respectively). The other variations of the RWPH model - RWPH(V ) and RWPH(α)
187 - are obtained by setting the output mixing factor λ to 1 or 0, respectively.
188 As in Scenario 2, the evaluation prior was a uniform prior over all compared models (RW, RWPH(V ),
189 RWPH(α), RWPH(V +α)), combined with a point prior on model parameters. The parameters of RWPH(V +
190 α) update equations were set to the maximum likelihood values obtained in the empirical study of Li et al.
191 (2011); specifically, w0 = 0, α0 = 0.926, η = 0.166, and κ = 0.857. For the other models - which are nested
192 within the RWPH(V +α) model - corresponding subsets of these maximum likelihood values were used. Study
193 of Li et al. (2011) did not specify the maximum likelihood values of output mapping parameters, so they
194 were arbitrarily set at β0 = 0, β1 = 1, and σy = 0.2. The mixing factor λ was set to 0.5 in RWPH(V + α),
195 to 0 in RWPH(α), and to 1 in RWPH(V ) and RW.
196 The design prior - as in Scenario 2 - was either a point prior coinciding with the evaluation prior (PP-OPT),
197 or a vague prior (VP-OPT). The vague prior for the RWPH(V + α) model was:
α0 ∼ U (0, 1),
η ∼ U (0, 1),
κ ∼ U (0, 1), (23)
λ ∼ U (0, 1),
σy ∼ U (0.05, 0.5).
198 For the other models we used the appropriate subset of priors. The rest of the parameters of all the mod-
199 els were fixed to the same values as in the evaluation prior. Model fitting, model selection, and design
200 optimization were performed in the same manner as in Scenario 2.
9
201 3 References
202 Brochu, Eric, Vlad M. Cora, and Nando de Freitas. 2010. “A Tutorial on Bayesian Optimization of Expensive
203 Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.” arXiv
204 Preprint arXiv:1012.2599. http://arxiv.org/abs/1012.2599.
205 Dayan, Peter, and Sham Kakade. 2001. “Explaining away in weight space.” In Advances in Neural Informa-
206 tion Processing Systems 14, 451–57. http://papers.nips.cc/paper/1852-explaining-away-in-weight-space.pdf.
207 Gershman, Samuel J. 2018. “’Mfit’: Simple Model-Fitting Tools.” https://github.com/sjgershm/mfit.
208 ———. 2015. “A Unifying Probabilistic View of Associative Learning.” PLOS Computational Biology 11
209 (11): e1004567. https://doi.org/10.1371/journal.pcbi.1004567.
210 Kruschke, John K. 2008. “Bayesian approaches to associative learning: From passive to active learning.”
211 Learning and Behavior 36 (3): 210–26. https://doi.org/10.3758/LB.36.3.210.
212 Li, Jian, Daniela Schiller, Geoffrey Schoenbaum, Elizabeth A Phelps, and Nathaniel D Daw. 2011. “Differ-
213 ential roles of human striatum and amygdala in associative learning.” Nature Neuroscience 14 (10): 1250–2.
214 https://doi.org/10.1038/nn.2904.
215 Penny, WD, KE Stephan, A Mechelli, and KJ Friston. 2004. “Comparing dynamic causal models.” Neu-
216 roImage 22 (3): 1157–72. https://doi.org/10.1016/j.neuroimage.2004.03.026.
217 Rescorla, Robert A, and Allan R Wagner. 1972. “A theory of Pavlovian conditioning: Variations in the
218 effectiveness of reinforcement and nonreinforcement.” Classical Conditioning II Current Research and Theory
219 21 (6): 64–99. https://doi.org/10.1101/gr.110528.110.
220 Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6 (2): 461–64.
221 https://doi.org/10.1214/aos/1176344136.
222 Wang, Fei, and Alan E. Gelfand. 2002. “A simulation-based approach to Bayesian sample size determination
223 for performance under a given model and for separating models.” Statistical Science 17 (2): 193–208. https:
224 //doi.org/10.1214/ss/1030550861.
10

Coale - Manuscript - Preprint 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Coale - Manuscript - Preprint 2

Uploaded by

Copyright:

Available Formats

1 Computational optimization of associative learning experiments

2 Filip Melinscak* 1,2 and Dominik R. Bach1,2,3

30 benefit from an optimization-based approach, similar to the one presented here.

31 Keywords: associative learning, simulation-based Bayesian experimental design, Bayesian optimiza-

46 diagnostic tests could be computationally optimized.

57 - in the field of associative learning research.

73 for optimizing associative learning experiments.

82 potential of computationally optimizing experimental designs in associative learning studies, as opposed to

83 the common practice of manually designing such experiments.

95 tractable utility function.

99 propose low-dimensional design parameterizations that make optimization computationally tractable. We

111 treatment available in the S1 Appendix.

124 expected utility (fig. 1.A-C).

143 the best-supported model (in model selection procedures).

170 of the search.

171 2.1 Formalizing associative learning experiments

179 of the experiment as the design variables of interest.

186 transition probabilities represent the design variables to be optimized.

212 intervals and only between two fixed values.

213 2.2 Evaluating associative learning experiments

235 procedure is not overly computationally expensive.

240 model selection error or 0-1 loss).

241 2.3 Optimizing associative learning experiments

244 of such experiments.

275 inputs are italicized; see also fig. 1):

284 specifying the design problem (step 1).

289 then inspect the resulting design.

293 the obtained results.

301 proceed to its implementation.

303 workflow with an executable MATLAB notebook provided on GitHub at https://github.com/fmelinscak/

Table 1: Overview of user inputs for the three simulated scenarios.

The role of user inputs is clarified by fig. 1. See sub-sections 3.1

evaluation priors are used to simulate data when evaluating both

reference and optimized designs.

inputs Scenario 1 Scenario 2 Scenario 3

Candidate RW RW, KRW RW(V ), RWPH(V ),

model RWPH(α), RWPH(V + α)

procedure parameter estimation

function rate (α) estimate

prior parameters parameters

Reference Acquisition followed by Backward blocking Reversal learning

design extinction of equal length

(from Kruschke 2008) (from Li et al. 2011)

inputs Scenario 1 Scenario 2 Scenario 3

ment contingencies contingencies

switching (3 variables) non-redundant variables) non-redundant variables)

(VA) parameters parameters

326 3.1 Optimizing for accuracy in parameter estimation

352 estimate an upper bound on the accuracy of optimized experiments.

390 3.2 Optimizing for accuracy in model selection

402 model states to conditioned responses.

425 learning rate is constant.

449 fig. 5, respectively.

471 1.3, 95% CI: [0.8, 1.9]).

485 truth model is more complex.

523 to simulate data in the process of optimizing experimental designs.

535 knowledge on plausible values of the models’ parameters.

536 4.1 Limitations and future work

611 4.2 Conclusion

622 optimize future experiments.

630 and eﬀicient diagnostic tests in this domain.

631 5 Software and data availability