You are on page 1of 42

1 Computational optimization of associative learning experiments

2 Filip Melinscak* 1,2 and Dominik R. Bach1,2,3

1
3 Computational Psychiatry Research; Department of Psychiatry, Psychotherapy, and
4 Psychosomatics; Psychiatric Hospital; University of Zurich
2
5 Neuroscience Center Zurich; University of Zurich
3
6 Wellcome Centre for Human Neuroimaging and Max Planck/UCL Centre for
7 Computational Psychiatry and Ageing Research; University College London

*
8 Correspondence: Filip Melinscak <filip.melinscak@uzh.ch>

9 Abstract

10 With computational biology striving to provide more accurate theoretical accounts of biological sys-

11 tems, use of increasingly complex computational models seems inevitable. However, this trend engenders

12 a challenge of optimal experimental design: due to the flexibility of complex models, it is difficult to

13 intuitively design experiments that will efficiently expose differences between candidate models or allow

14 accurate estimation of their parameters. This challenge is well exemplified in associative learning research.

15 Associative learning theory has a rich tradition of computational modeling, resulting in a growing space

16 of increasingly complex models, which in turn renders manual design of informative experiments difficult.

17 Here we propose a novel method for computational optimization of associative learning experiments. We

18 first formalize associative learning experiments using a low number of tunable design variables, to make

19 optimization tractable. Next, we combine simulation-based Bayesian experimental design with Bayesian

20 optimization to arrive at a flexible method of tuning design variables. Finally, we validate the proposed

21 method through extensive simulations covering both the objectives of accurate parameter estimation and

22 model selection. The validation results show that computationally optimized experimental designs have

23 the potential to substantially improve upon manual designs drawn from the literature, even when prior

24 information guiding the optimization is scarce. Computational optimization of experiments may help

25 address recent concerns over reproducibility by increasing the expected utility of studies, and it may even

26 incentivize practices such as study pre-registration, since optimization requires a pre-specified analysis

27 plan. Moreover, design optimization has the potential not only to improve basic research in domains such

28 as associative learning, but also to play an important role in translational research. For example, design

1
29 of behavioral and physiological diagnostic tests in the nascent field of computational psychiatry could

30 benefit from an optimization-based approach, similar to the one presented here.

31 Keywords: associative learning, simulation-based Bayesian experimental design, Bayesian optimiza-

32 tion

33 Author summary

34 To capture complex biological systems, computational biology harnesses accordingly complex models. The

35 flexibility of such models allows them to better explain real-world data; however, this flexibility also creates

36 a challenge in designing informative experiments. Because flexible models can, by definition, fit a variety of

37 experimental outcomes, it is difficult to intuitively design experiments that will expose differences between

38 such models, or allow their parameters to be estimated with accuracy. This challenge of experimental design

39 is apparent in research on associative learning, where the tradition of modeling has produced a growing space

40 of increasingly complex theories. Here, we propose to use computational optimization methods to design asso-

41 ciative learning experiments. We first formalize associative learning experiments, making their optimization

42 possible, and then we describe a Bayesian, simulation-based method of finding optimized experiments. In

43 several simulated scenarios, we demonstrate that optimized experimental designs can substantially improve

44 upon the utility of often-used canonical designs. Moreover, a similar approach could also be used in transla-

45 tional research; e.g., in the nascent field of computational psychiatry, designs of behavioral and physiological

46 diagnostic tests could be computationally optimized.

47 1 Introduction

48 A major goal of computational biology is to find accurate theoretical accounts of biological systems. Given the

49 complexity of biological systems, accurately describing them and predicting their behavior will likely require

50 correspondingly complex computational models. Although the flexibility of complex models provides them

51 with potential to account for a diverse set of phenomena, this flexibility also engenders an accompanying

52 challenge of designing informative experiments. Indeed, flexible models can often be difficult to distinguish

53 one from another under a variety of experimental conditions (Silk et al. 2014), and the models’ parameters

54 often cannot be estimated with high certainty (Apgar et al. 2010). Designing informative experiments entails

55 formalizing them in terms of tunable design variables, and finding values for these variables that will allow

56 accurate model selection and parameter estimation. This challenge is well exemplified - and yet unaddressed

57 - in the field of associative learning research.

2
58 Associative learning is the ability of organisms to acquire knowledge about environmental contingencies

59 between stimuli, responses, and outcomes. This form of learning may not only be crucial in explaining

60 how animals learn to efficiently behave in their environments (Enquist, Lind, and Ghirlanda 2016), but

61 understanding associative learning also has considerable translational value. Threat conditioning (also termed

62 “fear conditioning”), as a special case of associative learning, is a long-standing model of anxiety disorders;

63 accordingly, threat extinction learning underlies treatments such as exposure therapy (Vervliet, Craske, and

64 Hermans 2013). Consequently, the phenomenon of associative learning - whether in the form of classical

65 (Pavlovian) or operant (instrumental) conditioning - has been a subject of extensive empirical research. These

66 empirical efforts have also been accompanied by a long tradition of computational modeling: since the seminal

67 Rescorla-Wagner model (Rescorla and Wagner 1972), many other computational accounts of associative

68 learning have been put forward (Pearce and Bouton 2001; Dayan and Kakade 2001; Gershman and Niv 2012;

69 Gershman 2015; Tzovara, Korn, and Bach 2018), together with software simulators that implement these

70 theories (Thorwart et al. 2009; Alonso, Mondragón, and Fernández 2012). As this space of models grows

71 larger and more sophisticated, the aforementioned challenge of designing informative experiments becomes

72 apparent. Here, we investigate how this challenge can be met through theory-driven, computational methods

73 for optimizing associative learning experiments.

74 Similar methods of optimal experimental design have been developed in various subfields of biology and

75 cognitive science concerned with formal modeling; for example, in systems biology (Vanlier et al. 2012;

76 Liepe et al. 2013), neurophysiology (Lewi, Butera, and Paninski 2009; Sanchez et al. 2014), neuroimaging

77 (Daunizeau et al. 2011; Lorenz et al. 2016), psychology (Myung and Pitt 2009; Rafferty, Zaharia, and Griffiths

78 2014), and behavioral economics (Vincent and Rainforth 2017; Balietti, Klein, and Riedl 2018). These efforts

79 have generally demonstrated the potential benefits of computational optimization in experimental design. Yet,

80 in research on associative learning, there is a lack of tools for design optimization, although the foundation

81 for this development has been laid through extensive computational modeling. We therefore explore the

82 potential of computationally optimizing experimental designs in associative learning studies, as opposed to

83 the common practice of manually designing such experiments.

84 In optimizing experimental designs we usually need to rely on prior information: formalizing this notion

85 is the essence of the Bayesian experimental design framework (Chaloner and Verdinelli 1995). Bayesian

86 experimental design is the basis of the aforementioned methods in other domains, and it also underlies our

87 approach. This framework formalizes the problem of experimental design as maximizing the expected utility

88 of an experiment. With this formulation we can integrate prior knowledge with explicit study goals to

89 arrive at optimal designs. Moreover, upon observing new experimental data, the Bayesian design framework

3
90 prescribes a principled manner in which to update the optimal design. However, Bayesian experimental

91 design requires calculations of expected utility that usually result in analytically intractable integrals. To

92 address this issue, we adopt a simulation-based approach to utility calculations (Wang and Gelfand 2002).

93 Using the simulation-based approach, we can flexibly choose the experiment structure, space of candidate

94 models, goals of the study, and the planned analysis procedure, without the need to specify an analytically

95 tractable utility function.

96 To illustrate the proposed design method, we apply it to the optimization of classical conditioning experi-

97 ments, as a special case of associative learning. We first enable computational optimization by formalizing

98 the structure of classical conditioning experiments and by identifying tunable design variables. Next, we

99 propose low-dimensional design parameterizations that make optimization computationally tractable. We

100 then generate optimized designs for three different types of study goals: namely, (1) the goal of accurately

101 estimating parameters of a model, (2) the goal of comparing models with different learning mechanisms,

102 and (3) the goal of comparing models with different mappings between latent states and emitted responses.

103 Additionally, we perform the optimizations under varying levels of prior knowledge. Finally, we compare the

104 optimized designs with commonly used manual designs drawn from the literature. This comparison reveals

105 that optimized designs substantially outperform manual designs, highlighting the potential of the method.

106 2 Methods

107 We can decompose the problem of designing experiments into three components: formalizing, evaluating,

108 and optimizing. We will first outline each of these components of experimental design and introduce the

109 main concepts. The details and the application to designing associative learning experiments are covered

110 by the later sub-sections. Here we provide a non-technical description of the method, with the more formal

111 treatment available in the S1 Appendix.

112 Formalizing experiments involves specifying the experiment structure and parameterizing it in terms of

113 design variables. Design variables are the tunable parameters of the experiment (e.g., probabilities of stimuli

114 occurring), and the experiment structure is a deterministic or stochastic mapping from the design variables

115 to experiment realizations (e.g., sequences of stimuli). The relationship between experiment structure, design

116 variables, and experiment realizations, is analogous to the relationship between statistical models, their

117 parameters, and observed data, respectively. Importantly, much like statistical models, experiment structures

118 can be parameterized in various ways, some more amenable to design optimization than others.

119 Evaluating experiments consists of calculating the expected utility for evaluated designs. Calculation of

4
120 expected utility requires us to integrate across the distribution of datasets observable under our assumptions;

121 however, this calculation generally results in analytically intractable integrals. To tackle this issue, we

122 adopted the simulation-based approach to the evaluation of expected utility (Wang and Gelfand 2002). The

123 simulation-based approach entails three steps: simulating datasets, analyzing datasets, and calculating the

124 expected utility (fig. 1.A-C).

125 Simulated datasets consist both of experiment realizations and model responses (fig. 1.A). Experiment realiza-

126 tions are sampled from the experiment structure using the design being evaluated. In order to sample model

127 responses, we first need to specify a candidate model space and an evaluation prior. A candidate model space

128 is a set of generative statistical models, which encode hypotheses about possible data-generating processes

129 underlying the phenomenon under study. An evaluation prior is a probability distribution that encodes the

130 a priori plausibility of candidate models and their parameters. Using the evaluation prior, we can sample

131 a candidate model and its parameters, which together comprise the ground truth. Using the ground truth

132 model and the experiment realization, we can now sample model responses, thus completing the simulated

133 dataset.

134 With the simulated dataset we can perform the planned analysis procedure (fig. 1.B). This is the procedure

135 that we plan to conduct with the dataset obtained in the actual experiment. Any type of statistical analysis

136 can be used here, but Bayesian analyses seem particularly appropriate, since specifying generative models

137 and priors is already a prerequisite of the design evaluation procedure. If we employ a Bayesian analysis, an

138 additional prior needs to be specified - the analysis prior. Unlike the evaluation prior - which should reflect

139 the experimenter’s best guess about the state of the world, and can be as informative as deemed appropriate

140 - the analysis prior should be acceptable to a wider audience, and therefore should not be overly informative

141 or biased (Brutti, De Santis, and Gubbiotti 2014). The analysis procedure - Bayesian or otherwise - will

142 yield an analysis result; for example, a parameter estimate (in estimation procedures) or a discrete choice of

143 the best-supported model (in model selection procedures).

144 Having performed the analysis on the simulated dataset, we can now calculate the experiment utility (fig. 1.C).

145 The experiment utility is provided by the utility function, which encodes the goals of the study, and maps the

146 values of design variables to utility values. Although we are interested in how the utility depends on the design,

147 the utility function may generally depend not only on the direct properties of the design (e.g., experiment

148 duration and cost) but also on quantities influenced by the evaluation prior or the analysis procedure, e.g.,

149 the analysis result and the ground truth. For example, when the goal is accurate parameter estimation, the

150 utility function can be based on the discrepancy between the true parameter value (ground truth), and the

151 obtained parameter estimate (analysis result), neither of which are direct properties of the design. Finally,

5
Figure 1: The simulation-based method of evaluating and optimizing experiment designs. Note: light-blue
elements denote required user inputs. (A) Dataset simulation process generates experiment realizations (e.g.,
cue-outcome sequences) and model responses (e.g., predicted physiological or behavioral responses), which
together form a simulated dataset. (B) Dataset analysis applies the user-specified analysis procedure to the
simulated dataset and produces analysis results (e.g., model evidence or parameter estimates). If the analysis
procedure is Bayesian, an additional analysis prior needs to be specified. (C) The calculation of expected
design utility requires simulating and analyzing a number of datasets. The user-defined utility function -
which can depend directly on the design or on other simulation-specific quantities - provides the utility value
for each simulated experiment. The expected design utility is obtained by averaging experiment-wise utilities.
(D) Design optimization proceeds in iterations: the optimization algorithm proposes a design from the design
space, the design is evaluated under the design prior, and the expected design utility is passed back to the
optimizer. If the optimization satisfies the termination criterion (which is one of the user-defined optimization
options), the optimizer returns the optimized design.

6
152 to obtain the expected design utility, we simulate a number of experiments using the same design, calculate

153 the experiment utility for each experiment, and average the utility values across experiments.

154 Optimizing experiments requires finding the values of design variables that maximize the expected utility

155 under our evaluation prior (please note, we will refer to the evaluation prior used in design optimization as

156 the design prior, to avoid confusion when a design is optimized using one prior, but ultimately evaluated

157 using a different prior). The principle of maximizing expected design utility is the core idea of Bayesian

158 experimental design, which is simply the application of Bayesian decision theory to the problem of designing

159 experiments (Schlaifer and Raiffa 1961; Lindley 1972; Chaloner and Verdinelli 1995). Practically, maximizing

160 expected design utility can be achieved using optimization algorithms (fig. 1.D). We first specify the domain

161 over which to search for the optimal design - the design space. The design space is defined by providing the

162 range of feasible values for the design variables, which must also take into account any possible constraints

163 on these variables. Next, we specify the optimization objective function, which is in this case the expected

164 design utility, evaluated through the previously described simulation-based approach. Lastly, we specify

165 the parameters of the optimization algorithm, e.g., the termination criterion. The optimization proceeds in

166 iterations that consist of the algorithm proposing a design, and obtaining back the expected design utility.

167 After each iteration, the termination criterion is checked: if the criterion is not fulfilled, the algorithm

168 continues, by proposing a new design (based on the previously observed utilities). Ultimately, the algorithm

169 returns an optimized design, which - importantly - might not be the optimal design, depending on the success

170 of the search.

171 2.1 Formalizing associative learning experiments

172 As outlined, in order to optimize associative learning experiments, we first need to formalize their structure.

173 Here we consider the paradigm of classical conditioning, as a special case of associative learning, where

174 associations are formed between presented cues and outcomes. There are many design variables that need

175 to be decided in implementing a conditioning experiment, such as physical properties of the stimuli, or

176 presentation timings. However, learning is saliently driven by statistical contingencies between cues and

177 outcomes in the experiment; consequently, most computational models of associative learning capture the

178 effect of these contingencies on the formation of associations. Therefore, we focus on the statistical properties

179 of the experiment as the design variables of interest.

180 We formalize a classical conditioning experiment as a sequence of conditioning trials. A conditioning trial

181 consists of presenting one of the cues (i.e., conditioned stimulus, CS), followed - with some probability -

182 by an aversive or appetitive outcome (i.e., unconditioned stimulus, US). If each trial’s cue-outcome pair is

7
Figure 2: Formalizing and parameterizing the structure of classical conditioning experiments. (A) A condi-
tioning trial can be represented as a Markov chain, parameterized by the probabilities of presenting different
cues (e.g., colored shapes), and transition probabilities from cues to outcomes (e.g., delivery of shock vs. shock
omission). (B) Stage-wise parameterization of conditioning experiments. Trials in each stage are generated
by a Markov chain with constant transition probabilities. The transition probabilities of each stage are tun-
able design variables. (C) Periodic parameterization. The transition probabilities in the Markov chain are
determined by square-periodic functions of time (i.e., trial number). The design variables are the two levels
of the periodic function and its half-period.

183 sampled independently, and if the probability of the outcome depends only on the cue, then each trial can be

184 represented as a Markov chain (fig. 2.A). In this formalization, the cue probabilities P (CS) and conditional

185 outcome probabilities P (US|CS) are the transition probabilities of the trial-generating Markov chain. These

186 transition probabilities represent the design variables to be optimized.

187 Allowing the transition probabilities of each trial to be independently optimized results in a high-dimensional

188 optimization problem, even with a modest number of trials. However, the majority of existing experimental

189 designs for studying associative learning vary the cue-outcome contingencies only between experiment stages

190 (i.e., contiguous blocks of trials), and not between individual trials. Examples of such designs include forward

191 (Kamin) blocking (Kamin 1968), backward blocking (Shanks 1985), reversal learning (Li et al. 2011), over-

192 expectation (Rescorla 1970), overshadowing (Pavlov 1927), conditioned inhibition (Pavlov 1927), and latent

193 inhibition (Lubow and Moore 1959). Constraining the designs to allow only for contingency changes between

194 stages can dramatically reduce the number of design variables, making optimization tractable. Therefore,

8
195 the first design parameterization of conditioning experiments that we introduce here is the stage-wise pa-

196 rameterization. In this parameterization the experiment is divided into stages, with each stage featuring

197 a trial-generating Markov chain with constant transition probabilities (fig. 2.B). The number of trials per

198 stage is assumed to be known, and the tunable design variables are stage-wise cue probabilities P (CS) and

199 conditional outcome probabilities P (US|CS) (some of these design variables can further be eliminated based

200 on constraints). The aforementioned motivating designs from the literature are special cases of the stage-wise

201 parameterization, but this parameterization also offers enough flexibility to go beyond classical designs.

202 However, if we are interested in associative learning in environments with frequent contingency changes,

203 the stage-wise parameterization may require many stages to implement such an environment, and would

204 thus still lead to a high-dimensional optimization problem. To address this scenario, we introduce the

205 periodic parameterization, which is a parameterization based on periodically-varying contingencies (fig. 2.C).

206 In this parameterization, each cue probability P (CS) and conditional outcome probability P (US|CS) is

207 determined by a periodic function of time (i.e., trial number). Specifically, we use a square-periodic function

208 parameterized with the two levels P1 and P2 , between which the probability cycles, and the length of the

209 half-period T . This parameterization allows expressing designs with many contingency changes using a

210 small number of design variables, when compared with an equivalent stage-wise design; however, unlike the

211 stage-wise parameterization, the periodic parameterization constrains contingency changes to occur in regular

212 intervals and only between two fixed values.

213 2.2 Evaluating associative learning experiments

214 Having formalized the structure of classical conditioning experiments, we now specify the procedure to evalu-

215 ate their expected utility using the simulation-based approach. To simulate a classical conditioning dataset,

216 we first sample an experiment realization - i.e., a sequence of cues (CSs) and corresponding outcomes (USs)

217 - using the experiment structure and the evaluated design. We then sample the ground truth model of asso-

218 ciative learning and its parameter values from the evaluation prior. Using the sampled CS-US pairs, sampled

219 parameter values, and the generative model, we generate the conditioned responses (CRs). These trial-wise

220 triplets of CSs, USs, and CRs constitute a single simulated dataset.

221 With this simulated dataset we can proceed to perform the analysis that we are planning to do with the real

222 dataset that will be collected. In scenarios presented here, we simulate and analyze only single-subject data.

223 We do so for two reasons. First, the single-subject scenario provides more difficult conditions for statistical

224 inference, due to the limited amount of data. Second, the single-subject analysis is especially relevant from

225 a translational view, which considers clinical inferences that need to be done for each patient individually.

9
226 For the analysis, we can specify any procedure that is afforded by the simulated dataset, but computationally

227 demanding analyses can lead to long design evaluation running times. In the current article, we focus on two

228 model-based types of analyses that are common in the modeling literature on associative learning: parameter

229 estimation within a single model, and model selection within a space of multiple models. For parameter

230 estimation, we use the maximum a posteriori (MAP) approach with uniform priors (which is equivalent to

231 maximum likelihood estimation (MLE)), and for model selection we use the Bayesian information criterion

232 (BIC), which takes into account both the model fit and model complexity. Both procedures are computation-

233 ally relatively inexpensive and were implemented using the existing ‘mfit’ toolbox for MATLAB (Gershman

234 2018). Fully Bayesian inference with proper analysis priors could also be used here, as long as the inference

235 procedure is not overly computationally expensive.

236 Finally, with the results of the analysis on the simulated dataset, we can assess the utility of the experiment.

237 The utility function should be determined by the goals of the study. In this article, for parameter estimation

238 analyses, we aim to minimize the absolute error between the parameter estimate and the ground truth value,

239 and for model selection analyses, we aim to minimize the misidentification of the ground truth model (i.e.,

240 model selection error or 0-1 loss).

241 2.3 Optimizing associative learning experiments

242 So far, we have formalized the structure of associative learning experiments, and provided a method of

243 assessing their utility by simulation. We are now in a position to specify a method of optimizing the design

244 of such experiments.

245 Although the simulation-based approach to design evaluation is instrumental in providing flexibility to the

246 user, it can yield difficult optimization problems. Evaluating the expected design utility using stochastic

247 simulations gives noisy estimates of utility and can often be computationally expensive. Furthermore, we

248 generally cannot make strong assumptions about the utility function (e.g., convexity) and we usually do not

249 have access to the derivatives of the function. A state-of-the-art optimization algorithm for such scenarios

250 is Bayesian optimization (Brochu, Cora, and Freitas 2010). Bayesian optimization has already been effec-

251 tively used in experimental design problems (Weaver et al. 2016; Balietti, Klein, and Riedl 2018), and it

252 was therefore our optimizer of choice. Bayesian optimization is a global optimization algorithm which uses

253 a surrogate probabilistic model of the utility function to decide which values of optimized variables to eval-

254 uate next; this allows it to efficiently optimize over expensive utility functions, at the cost of continuously

255 updating the surrogate model (see S1 Appendix for technical details). In results presented here, we used

256 the MATLAB bayesopt function (from the ‘Statistics and Machine Learning Toolbox’) which implements

10
257 Bayesian optimization using a Gaussian process as the surrogate model.

258 To fully specify the design optimization problems, we also need to define the design space, the design prior,

259 and the optimization options. The design space over which the optimization was performed depended on

260 the design parameterization. For the stage-wise parameterization, each transition probability could range

261 from 0 to 1, with the constraint that the sum of transition probabilities from a given starting state is 1.

262 For the periodic parameterization, the two probability levels P1 and P2 , and the half-period T (expressed

263 as a fraction of the total trial number) also ranged from 0 to 1. For the periodic parameterization, the

264 design is constrained such that all the transition probabilities from a given state have the same half-period,

265 and all P1 and P2 probabilities have to sum up to 1, respectively. The design priors were determined in a

266 problem-specific manner, and are described together with the application scenarios in the Results section. In

267 general, the design prior should be based on the information available from the relevant literature or from

268 existing data. Lastly, the optimization options depend on the specifics of the chosen optimization algorithm

269 (for Bayesian optimization, details are provided in the S1 Appendix). A common optimization option is the

270 termination criterion: we used a fixed maximum number of optimization iterations as the criterion, but the

271 maximum elapsed time, or the number of iterations without substantial improvement can also be used.

272 2.4 Workflow for evaluating and optimizing associative learning experiments

273 Finally, we propose a workflow for employing the described simulation-based method in the evaluation and

274 optimization of associative learning experiments. The proposed workflow is as follows (the required user

275 inputs are italicized; see also fig. 1):

276 1. Specify the assumptions and goals of the design problem. In this step we need to specify the

277 components of the design problem which are independent of the specific designs that will be evalu-

278 ated. In particular, we specify our candidate model space, evaluation priors over the models and their

279 parameters, analysis procedure, analysis priors, and the utility function we wish to maximize.

280 2. Evaluate pre-existing reference designs. If there are pre-existing reference designs, it is advisable

281 to first evaluate their expected utility. If these designs already satisfy our requirements, we do not need

282 to proceed with design optimization. In this step we first formalize reference designs, and then we use

283 the simulation-based approach to evaluate their expected utility under the assumptions made when

284 specifying the design problem (step 1).

285 3. Optimize the design. If existing reference designs do not satisfy our requirements, we proceed

286 with obtaining an optimized design. In this step we first specify the experiment structure and its

11
287 design variables (both tunable and fixed), the design space, the design priors, and optimization options.

288 Having specified these, we run the design optimization until the termination criterion is fulfilled, and

289 then inspect the resulting design.

290 4. Evaluate the optimized design. We take the optimized values of design variables obtained in step

291 3, and use them to specify the optimized design. Then we obtain the expected utility of the optimized

292 design in the same simulation-based manner as we did for the reference designs (in step 2), and inspect

293 the obtained results.

294 5. Compare the reference and optimized designs. Having evaluated the reference and optimized

295 designs through simulations (in steps 2 and 4, respectively), we can now compare their expected util-

296 ities. For example, if we are aiming for accurate parameter estimates, we can inspect the design-wise

297 distributions of estimation errors, and if we are aiming for accurate model selection, we can inspect the

298 design-wise confusion matrices (and corresponding model recovery rates). If the optimized design does

299 not satisfy our requirements, we can go back to step 3; for example, we can modify the experiment

300 space or run the optimization for longer. If the optimized design satisfies our requirements, we can

301 proceed to its implementation.

302 For one of the simulated scenarios presented in the Results section (Scenario 2), we illustrate the described

303 workflow with an executable MATLAB notebook provided on GitHub at https://github.com/fmelinscak/

304 coale-paper.

305 3 Results

306 We validated the proposed method in three scenarios, with one scenario targeting the goal of accurate

307 parameter estimation, and the other two targeting the goal of accurate model selection. The overview of

308 user inputs employed in these scenarios is given in table 1. In each scenario we compared optimized designs

309 with a reference design drawn from prior work in associative learning. In order to evaluate the design’s

310 accuracy of parameter estimation or model selection, it is necessary to know the ground truth, i.e., to sample

311 a data-generating model and its parameter values from the evaluation prior. Therefore, we have evaluated

312 the designs through simulations, as is common in the field of experimental design. To illustrate the benefit of

313 optimized designs, we report estimates of the difference in accuracy between the designs, and the associated

314 95% confidence intervals. We do not quantify the benefit of optimized designs using classical p-values because

315 the number of simulations (sample size) can be made arbitrarily large, such that optimized designs can be

316 made “significantly” better than reference designs even if the differences are trivially small.

12
317 Furthermore, we wanted to investigate the effect of different amounts of background knowledge, i.e., to

318 perform a sensitivity analysis with respect to the design prior, while holding the ground truth (evaluation

319 prior) constant. Therefore, we optimized designs both under a vague (weakly-informative) design prior

320 and under a strongly-informative design prior. The designs were then evaluated using point evaluation

321 priors, representing an assumed ground truth. With these two extreme design priors, we obtain, respectively,

322 approximate lower and upper bounds on the expected design utility of optimized experiments. However,

323 please note that while in a methodological study - like this one - it can be advantageous to disambiguate

324 between the design prior and the evaluation prior, in practical applications these two priors should coincide,

325 with both of them expressing the experimenter’s knowledge base at the time of designing the experiment.

Table 1: Overview of user inputs for the three simulated scenarios.

The role of user inputs is clarified by fig. 1. See sub-sections 3.1

and 3.2 for details. Analysis priors are used in fitting models, de-

sign priors are used to simulate data when optimizing designs, and

evaluation priors are used to simulate data when evaluating both

reference and optimized designs.

User

inputs Scenario 1 Scenario 2 Scenario 3

Candidate RW RW, KRW RW(V ), RWPH(V ),

model RWPH(α), RWPH(V + α)

space

Analysis Maximum likelihood Model selection using BIC Model selection using BIC

procedure parameter estimation

Utility Absolute error of learning Model selection accuracy Model selection accuracy

function rate (α) estimate

Analysis Uniform over α Uniform over models and Uniform over models and

prior parameters parameters

Reference Acquisition followed by Backward blocking Reversal learning

design extinction of equal length

Evaluation Point priors on low, middle Uniform over models with Uniform over models with

prior and high α (LA, MA, HA) point priors on parameters point priors on parameters

(from Kruschke 2008) (from Li et al. 2011)

13
User

inputs Scenario 1 Scenario 2 Scenario 3

Optimized One cue with periodically Two stages with three cues Two stages with two cues

experi- varying contingency (A, B, AB) and stage-wise (A, B) and stage-wise

ment contingencies contingencies

structure

Design Two contingencies (P1 , P2 ) For each stage s and cue For each stage s and cue

space and the period (T ) of their X: Ps (X), Ps (US|X) (10 X: Ps (X), Ps (US|X) (6

switching (3 variables) non-redundant variables) non-redundant variables)

Design Either a point prior over α Uniform over models with Uniform over models with

priors coinciding with evaluation either a point (PP) or either a point (PP) or

prior (PA) or a vague prior vague (VP) prior over vague (VP) prior over

(VA) parameters parameters

326 3.1 Optimizing for accuracy in parameter estimation

327 An early, but still widely influential model of classical conditioning is the Rescorla-Wagner (RW) model

328 (Rescorla and Wagner 1972). A parameter of particular interest in this model is the learning rate α (Behrens

329 et al. 2007). For example, the learning rate of the RW model fitted to human learning data in volatile

330 conditions has been found to correlate with trait anxiety (Browning et al. 2015). Therefore, in Scenario 1,

331 we optimized designs with the goal of accurately estimating the RW learning rate α.

332 As ground truth, three different point evaluation priors on learning rate were used: low alpha (LA, α = 0.1),

333 middle alpha (MA, α = 0.2), and high alpha (HA, α = 0.3). Three types of experimental designs were

334 compared under these three evaluation priors using the absolute estimation error metric. All three designs

335 have the same structure, which is simple but sufficient to estimate the learning rate: the probability of the

336 CS being followed by a US (P (US|CS)) is a periodic function of time, and it switches between two discrete

337 values. Therefore, there are three variables that specify the design: the two reinforcement probabilities, and

338 the period with which these two probabilities are switched.

339 The first evaluated design is the (manual) reference design (denoted REF), which was taken from existing

340 literature rather than being computationally optimized. Here we chose the often-used classical conditioning

341 design in which an association is first acquired under partial reinforcement of a CS with a US (acquisition

342 stage), and then extinguished by ceasing the reinforcement of the CS (extinction stage) (Humphreys 1939).

14
343 The probabilities of reinforcement were 0.5 and 0 in the acquisition and the extinction stages, respectively.

344 The switch between the stages occurred only once, at the middle of the experiment. The second design is

345 the design optimized under a vague prior over the learning rate α (denoted VA-OPT). The vague prior here

346 is implemented as a uniform prior over α. This prior reflects the situation when there is little information on

347 the parameter of interest at the planning stage of the experiment. Under this prior we are estimating a lower

348 bound on the accuracy of optimized experiments. The third design is the design optimized under the exact

349 point prior on the learning rate α (denoted PA-OPT). This prior reflects the situation where we have perfect

350 knowledge of the ground truth learning rate, and it therefore results in different specific designs for the LA,

351 MA, and HA evaluation priors. Although in practice the ground truth is not known, under this prior we can

352 estimate an upper bound on the accuracy of optimized experiments.

353 The optimized designs were obtained by running Bayesian optimization for 300 iterations, with the absolute

354 estimation error of the learning rate being minimized. In each iteration, the estimation error was evaluated

355 using 32 datasets simulated from the RW model, with parameters being randomly drawn from the design

356 prior. Using 32 CPU cores at 2.5 GHz, average time to obtain the optimized design was 0.5 h, with the average

357 being taken across optimizations with different design priors. The reference design and the optimized designs

358 were evaluated in additional 256 simulations per each combination of the evaluation prior (ground truth) and

359 design.

360 The results of the design evaluation are shown in fig. 3. We first inspect the distribution of absolute estimation

361 errors under different designs and evaluation priors (fig. 3.A). Under all designs, the estimation errors were

362 relatively small (with almost all simulations yielding errors below 0.1). This can be attributed to sufficient

363 number of simulated trials and moderate levels of observation noise. Nevertheless, even in such favorable

364 conditions, the REF design yielded the largest estimation errors, followed by the VA-OPT design, and the PA-

365 OPT design. This suggests that including even weak prior knowledge (as in the VA-OPT design) can improve

366 parameter estimation accuracy, with more informative prior knowledge providing further improvements (as

367 in PA-OPT).

368 Next, we quantified the relative differences in design utility, by estimating the probability that a randomly

369 sampled experiment from one design will yield higher utility than an experiment sampled from another

370 design (i.e., the probability of superiority). These probabilities were estimated for all pairs of designs by

371 computing the non-parametric common language effect sizes (CLES) (McGraw and Wong 1992), which

372 indicate designs of equal expected utility with CLES = 50%. The 95% CIs for CLES estimates were obtained

373 using the percentile bootstrap method (Efron 1981), with 1000 bootstrap samples. The appropriateness

374 of the percentile bootstrap method was verified by inspecting bootstrap distribution histograms, which did

15
Figure 3: Design evaluation in Scenario 1: RW model learning rate estimation. Three designs - reference
acquisition-extinction design (REF), design optimized under a vague prior (VA-OPT), and design optimized
under a point prior (PA-OPT) - are evaluated under the low (LA), middle (MA), and high (HA) value of
the learning rate alpha. (A) Distribution of absolute errors in estimating the learning rate. (B) Pair-wise
comparison of design accuracy expressed as the probability of the first design in the pair being superior (i.e.,
having lower estimation error). Error bars indicate bootstrapped 95% CI and 50% guideline indicates designs
of equal quality. (C) Comparison of model responses (red full line) to the contingencies (black dashed line)
obtained under different designs and different evaluation priors.

16
375 not exhibit substantial bias or skew. The design evaluation results are shown in fig. 3.B and they confirm

376 that the VA-OPT design is more probable to yield lower estimation errors than the REF design, under all

377 evaluation priors: the probabilities of VA-OPT being superior are 58.9% (95% CI: [53.9% - 63.8%]), 59.2%

378 (95% CI: [54.1% - 64.0%]), and 58.2% (95% CI: [53.6% - 62.7%]) for the LA, MA, and HA evaluation priors,

379 respectively. Moreover, the the PA-OPT design is probable to be superior - under all evaluation priors - both

380 to the REF design (LA - 68.8%, 95% CI: [64.2% - 73.2%]; MA - 66.3%, 95% CI: [61.8% - 70.8%]; HA - 63.3%,

381 95% CI: [58.4% - 68.5%]) and to the VA-OPT design (LA - 60.0%, 95% CI: [55.0% - 64.9%]; MA - 58.0%,

382 95% CI: [53.1% - 62.7%]; HA - 55.8%, 95% CI: [50.8% - 60.8%]).

383 Finally, we inspect the obtained designs and the corresponding model responses in fig. 3.C. The REF design

384 and the VA-OPT design are the same under all evaluation priors, since they do not take into account the

385 ground truth learning rate, whereas the PA-OPT designs are tuned to the exact parameter value. Compared

386 to the REF design, the VA-OPT design uses more frequent contingency switches, allowing for more accurate

387 estimation of the learning rate under a wide range of possible values. The PA-OPT design adapts the

388 frequency of contingency switches to the learning rate, with low learning rates being better estimated from

389 longer periods of stability, and higher learning rates being better estimated from frequent transitions.

390 3.2 Optimizing for accuracy in model selection

391 The basic RW model explains many aspects of associative learning, but it fails to account for many others

392 (Miller, Barnet, and Grahame 1995). For this reason, various extensions of the basic RW model have been

393 proposed in the literature. One example is the Kalman Rescorla-Wagner (KRW) model (Dayan and Kakade

394 2001; Kruschke 2008; Gershman 2015), which is a probabilistic extension of the RW model. Another example

395 is the hybrid Rescorla-Wagner-Pearce-Hall (RWPH) model (Li et al. 2011), which maintains separate dynamic

396 learning rates (termed “associabilities”) for all the cues. The growing space of theoretical models of associative

397 learning highlights the need for experiments that can efficiently expose the differences between these accounts.

398 Therefore, in Scenarios 2 and 3, we optimized designs to achieve the goal of accurate model selection (i.e.,

399 how often the selected winning model coincides with the ground truth model). The model space of Scenario

400 2 is comprised of models with different learning mechanisms, whereas the model space of Scenario 3 entails

401 not only models with different learning mechanisms, but also models with different mappings from latent

402 model states to conditioned responses.

403 Scenario 2 is based on the simulation study of Kruschke (2008), which compares the RW and the KRW

404 model using the backward blocking design. The backward blocking design is comprised of two stages. In

405 the trials of the first stage, only the reinforced compound cue AB is presented. In the trials of the second

17
406 stage, only one of the elements of the compound (e.g., A) is presented and reinforced. In the test stage, the

407 single element cues - A and B - are tested separately in non-reinforced trials. This design has been used

408 to compare the RW and KRW models because they make different predictions about test responding to the

409 cue that was not presented in the second stage (here, cue B). According to the RW model, responding to

410 cue B should be unchanged, as this model allows cue-specific associations to change only when the cues are

411 presented. In contrast, the KRW model predicts that the responding to cue B at test will be diminished

412 (blocked), because the model has assigned the credit for reinforcements to the cue A (i.e., during the first

413 stage a negative correlation between the associative weights for element cues is learned). Due to the opposing

414 model predictions for this design, and its previous use in literature, we chose it as the reference (REF) design

415 in Scenario 2.

416 Scenario 3 is based on the empirical study of Li et al. (2011), which compared the RW model (labeled RW(V ))

417 and three variants of the RWPH model. The variants of the RWPH model were formulated to emit different

418 model quantities as the conditioned response: associative weight (RWPH(V )), associabilities (RWPH(α)), or

419 a mixture of weights and associabilities (RWPH(V + α)). In order to distinguish between these models Li et

420 al. (2011) employed a reversal learning design, and this is the design we used as the reference (REF) design

421 in Scenario 3. The design is comprised of two stages: in the first stage one cue is partially reinforced, while

422 the other is not, and in the second stage the two cues switch their roles. This design is intuitively appealing

423 for the goal of discriminating between the RW and RWPH models: at the point of contingency reversal,

424 the learning rate in the RWPH model increases due to large prediction errors, while in the RW model the

425 learning rate is constant.

426 In both Scenario 2 and 3 the structure and parameterization of the optimized designs was the same: the

427 experiment had two stages (with equal number of trials), and the cue-outcome contingencies were kept

428 constant within each stage. Hence, the stage-wise design variables were the CS presentation probabilities

429 and the probabilities of the US conditional on the CS. To keep a close comparison with the reference designs,

430 Scenario 2 included two CSs (A and B) and their compound (AB), whereas Scenario 3 only included two

431 elemental CSs. Similarly to Scenario 1, we obtained optimized designs under two types of design priors:

432 a vague prior on model parameters (in the VP-OPT design) and an exact point design prior on model

433 parameters (in the PP-OPT design). The vague priors were either uniform (for bounded parameters) or only

434 weakly informative (see S1 Appendix for details). In Scenario 2 the point priors were set at the same values

435 as in the simulations of Kruschke (2008). The point priors in Scenario 3 were set to the best fitting parameter

436 values of the RWPH(V + α) model reported in the study of Li et al. (2011) (the best fitting parameters of

437 other models are not reported in their study); for other models in this scenario, we were able to use subsets

18
438 of the RWPH(V + α) parameters, since the models were nested. Importantly, even when the exact point

439 priors on model parameters were used in the design optimization (PP-OPT designs), the simulated datasets

440 were sampled in equal proportions from all candidate models (implying a uniform prior over models).

441 In both Scenario 2 and 3, we ran the Bayesian design optimization for 300 iterations and in each iteration

442 the utility function (model selection accuracy) was evaluated using 32 datasets simulated from each of the

443 candidate models. Average optimization times (across different design priors) were 5.2 h in Scenario 2,

444 and 14.1 h in Scenario 3. The optimized designs and the reference designs were evaluated in additional

445 256 simulated experiments per each combination of the ground truth model and design. Additionally, we

446 computed the responses of models fitted to the evaluation simulations. For visualization, the responses of

447 fitted models to each CS were computed in all the trials, even though in the simulated experiments only one

448 of the CSs was presented in each trial. The evaluation results for Scenario 2 and 3 are shown in fig. 4 and

449 fig. 5, respectively.

450 For Scenario 2, model selection accuracies of reference and optimized designs are shown in fig. 4.A. Although

451 the REF design yields accuracies above chance level (mean: 59.6%, 95% CI: [55.2% - 63.9%]), optimized

452 designs provide substantial improvements. Using a vague prior, the VP-OPT design provides a lower bound

453 on the accuracy expected with an optimized design; nonetheless, even this lower bound provides near perfect

454 accuracy (mean: 99.8%, 95% CI: [98.9% - 100.0%]). We can quantify the relative improvement using the

455 odds ratio (OR) as the effect size (please note: the value OR = 1.0 indicates designs of equal expected utility,

456 and the 95% CIs were obtained using the Woolf’s method with the Haldane-Anscombe correction (Lawson

457 2004)). The VP-OPT design greatly increases the odds of correctly identifying the ground truth model (OR:

458 346.8, 95% CI: [48.4, 2486.4]). Moreover, with the exact knowledge of model parameters, the PP-OPT design

459 provides an upper bound on expected improvements. Due to the already near-perfect accuracy of the VP-

460 OPT design, the PP-OPT design yielded similar accuracy (mean: 100.0%, 95% CI: [99.3% - 100.0%]). The

461 relative improvement of the PP-OPT design over the REF design is substantial (OR: 696.2, 95% CI: [43.2,

462 11208.0]). However, due to a ceiling effect, we do not observe a clear advantage of the PP-OPT design over

463 the VP-OPT design (OR: 3.0, 95% CI: [0.1, 74.0]).

464 Similar results are observed when comparing the design accuracies in Scenario 3 (fig. 5.A). Accuracy of

465 the REF design is significantly above chance level (mean: 59.7%, 95% CI: [56.6% - 62.7%]), but the designs

466 optimized both under a vague and point prior provide greatly improved accuracies. The VP-OPT design yields

467 both a high accuracy in absolute terms (mean: 94.8%, 95% CI: [93.3% - 96.1%]) and a large improvement

468 relative to the REF design (OR: 12.4, 95% CI: [9.1, 16.8]). The PP-OPT design also provides a high accuracy

469 (mean: 95.9%, 95% CI: [94.5% - 97.0%]), and an even larger improvement relative to the REF design (OR:

19
Figure 4: Design evaluation in Scenario 2: selection between RW and KRW models. Three designs - reference
backward blocking design (REF), design optimized under a vague prior (VP-OPT), and design optimized
under a point prior (PP-OPT) - are evaluated under the ground truth model being either the RW or the KRW
model. (A) Model selection accuracy (mean and the Clopper-Pearson binomial 95% CI). Horizontal guideline
indicates chance level. Darker bars summarize results across ground truth models. (B) Values of the design
variables in the two stages of the experiment: cue probabilities P (CS) and joint cue-outcome probabilities
P (CS, US). (C) Comparison of fitted model responses obtained under different designs (rows) and different
ground truth models (columns). Inset labels give the average difference in BIC (± SEM) between the fit of
the true model and the alternative model (more negative values indicate stronger evidence in favor of the
true model).

470 15.8, 95% CI: [11.3, 22.1]); however, the improvement relative to the VP-OPT design seems to be small (OR:

471 1.3, 95% CI: [0.8, 1.9]).

472 In both Scenario 2 and 3, we can observe that the largest gain in accuracy is realized when the true model

473 is the most complex candidate model (KRW in Scenario 2, and RWPH(V + α) in Scenario 3). With the

474 REF designs, data simulated from these more complex models is often attributed to simpler models (e.g.,

475 see confusion matrix in fig. 5.A). Such misidentification occurs because both simple and complex models can

476 account for the data from a simple design similarly well, but the model selection criterion (BIC in our case)

477 penalizes complexity, thus biasing the selection towards simple models and even leading to accuracies below

478 chance level, as seen in fig. 4.A and fig. 5.A. This suggests that simple manual designs do not provide the

479 conditions in which complex models could make predictions sufficiently distinct from simpler models. Hence,

480 to achieve conditions in which candidate models make sufficiently diverging predictions, the simplicity of

481 experimental designs may need to be sacrificed. Indeed, this is what we observe by comparing the values of

20
Figure 5: Design evaluation in Scenario 3: selection between RW and RWPH models. Three designs - reference
reversal learning design (REF), design optimized under a vague prior (VP-OPT), and design optimized under
a point prior (PP-OPT) - are evaluated under the ground truth model being either the RW(V ), RWPH(V ),
RWPH(α), or RWPH(V + α). (A) Model selection accuracy (mean and the Clopper-Pearson binomial 95%
CI). Horizontal guideline indicates chance level. Darker bars summarize results across ground truth models.
Inset shows the confusion matrix between the ground truth model (rows) and the selected model (columns).
(B) Values of the design variables in the two stages of the experiment: cue probabilities P (CS) and joint
cue-outcome probabilities P (CS, US). (C) Comparison of fitted RW(V ) and RWPH(V + α) model responses
obtained under different designs (rows) and these two models as assumed ground truth (columns). Inset labels
give the average difference in BIC (± SEM) between the fit of the true model and the alternative model (more
negative values indicate stronger evidence in favor of the true model). Note: the model responses are nearly
identical when the RW(V ) model is true, because this model is a special case of the alternative RWPH(V + α)
model.

21
482 design variables between the REF, VP-OPT, and PP-OPT designs, both in Scenario 2 (fig. 4.B) and Scenario

483 3 (fig. 5.B). In both cases, the optimized designs are arguably less parsimonious than the reference designs.

484 However, the optimized designs also result in more distinct model responses, especially when the ground

485 truth model is more complex.

486 For example, in Scenario 2, model responses of the fitted RW and KRW models (fig. 4.C) are very similar for

487 cues presented in the REF design (AB in stage 1, and A in stage 2), regardless of which model is the ground

488 truth. We can observe a similar pattern in Scenario 3 (fig. 5.C), where we show the responses of the RW(V )

489 and RWPH(V + α) models, which are most often confused under the REF design. This is also reflected in the

490 BIC differences between the true and the alternative models (∆BIC), which are - under the REF design - in

491 favor of the simple model regardless of the ground truth model. In contrast, model responses with optimized

492 designs are more distinct when the ground truth model is complex, allowing for easier model discrimination

493 in both scenarios. However, when the ground truth model is simple, model responses are similar even under

494 optimized designs; yet, this is not a failure of the design optimization: when the predictions are similar,

495 the simpler model will correctly be selected, since BIC favors parsimony. Similarly, it may seem surprising

496 that the PP-OPT designs can yield evidence in favor of the true model that is weaker than in the VP-OPT

497 designs (e.g., see fig. 4.C) in terms of ∆BIC, even though the PP-OPT designs are based on stronger prior

498 knowledge and provide similar or higher accuracies than VP-OPT designs. Again, this is not a failure of the

499 optimization; instead, the procedure takes into account that our goal is maximizing the frequency of correct

500 model identifications, rather than the subtly different goal of maximizing the expected evidence in favor of

501 the true model. Altogether, these observations demonstrate that the optimization took into account not only

502 the supplied model space and design priors, but also the specific properties of our utility function.

503 4 Discussion

504 In this paper we investigated the potential of computationally optimizing experimental designs in associative

505 learning studies, with the goal of enabling accurate statistical inference from experimental data. First, we

506 formalized and parameterized the structure of classical conditioning experiments to render them amenable

507 to design optimization. Next, we brought together several existing ideas from the literature on experimental

508 design and computational optimization - in particular, simulation-based Bayesian experimental design and

509 Bayesian optimization - and we composed them into a flexible method for designing associative learning

510 studies. Finally, we evaluated the proposed method through simulations in three scenarios drawn from the

511 literature on associative learning, with design optimization spanning both the goals of accurate parameter

512 estimation and model selection. In these simulated scenarios, optimized designs outperformed manual designs

22
513 previously used in similar contexts: the benefits ranged from a moderate reduction in parameter estimation

514 error in Scenario 1, to substantial gains in model selection accuracies in Scenarios 2 and 3.

515 Furthermore, we highlight the role of prior knowledge in experimental design optimization. Within the

516 framework of Bayesian experimental design, we can distinguish three types of prior probability distributions

517 - the analysis, evaluation, and design prior. The analysis prior is the prior we ultimately plan to use in

518 analyzing the obtained experimental data (if we plan to use Bayesian inference). Analysis prior should

519 generally be acceptable to a wide audience, and therefore should not be overly informative or biased; for

520 analysis priors we used flat priors that were fixed within scenarios, representing a fixed, but non-prejudiced

521 analysis plan. The evaluation prior is used to simulate data for the evaluation of expected design utility: in

522 simulations the models and their parameters are sampled from the evaluation prior. The design prior is used

523 to simulate data in the process of optimizing experimental designs.

524 Although in practical applications the evaluation prior and the design prior coincide (both representing

525 the designer’s partial knowledge), here we distinguish them in order to investigate how different levels of

526 background knowledge impact the accuracy of designs under different assumed ground truths. Hence, in the

527 presented scenarios, point evaluation priors represent the exactly assumed ground truth and design priors

528 represent the varying degrees of background knowledge. For design priors we used both weakly-informative

529 (vague) and strongly-informative priors. The weakly-informative and strongly-informative design priors can

530 be thought of as providing approximate lower and upper bounds on achievable accuracy improvements,

531 respectively. As expected, strongly-informative design priors provided highest accuracies, but, importantly,

532 even the designs optimized with only weakly-informative design priors enabled similarly accurate inferences

533 (e.g., in Scenarios 2 and 3). This result suggests that specifying the space of candidate models may in some

534 cases be sufficient prior information to reap the benefits of design optimization, even when there is only sparse

535 knowledge on plausible values of the models’ parameters.

536 4.1 Limitations and future work

537 Although the results presented here show that optimized designs can yield substantial improvements over

538 manual designs, there are some important caveats. For example, we can have floor or ceiling effects in design

539 utility. A floor effect can be observed when, e.g., the data is noisy regardless of the chosen experimental

540 design, leading to poor accuracy with both manual and optimized designs. Similarly, a ceiling effect can

541 be observed when the utility of a manual design is already high - e.g., a design manually optimized for

542 a small model space may already provide near-perfect model discrimination under low noise conditions.

543 Nevertheless, given the often large number of design variables, and the complexity of the information that

23
544 needs to be integrated to arrive at the optimal design, we conjecture that manual designs are often far from

545 optimal. Support for this view comes from two recent studies. Balietti, Klein, and Riedl (2018) asked experts

546 in economics and behavioral sciences to determine the values of two design variables of an economics game,

547 such that the predictions of hypothesized models of behavior in this game would be maximally different.

548 Results of simulations and empirical experiments showed that the experts suggested a more expensive and

549 less informative experimental design, when comparing with a computationally optimized design. The study

550 of Bakker et al. (2016) investigated researchers’ intuitions about statistical power in psychological research:

551 in an arguably simpler task of estimating statistical power of different designs, majority of the participants

552 systematically over- or underestimated power. The findings of these two studies indicate the need for formal

553 approaches.

554 Another important pair of caveats relates to the specification of the design optimization problem. First,

555 the experiment can only be optimized with respect to the design variables that affect model responses.

556 For example, in scenarios presented in this paper, we considered discrete, trial-level models of associative

557 learning, rather than continuous-time models. However, trial-level models do not capture the effects of

558 stimuli presentation timings, and therefore design variables related to timing cannot be optimized using this

559 model space. If we suspect that the phenomenon under investigation crucially depends on timings, a model

560 capturing this impact needs to be included in the model space. Second, if the specified models and their

561 parameters inadequately describe the phenomenon of interest, then the optimized designs might actually

562 perform worse than manually chosen designs. For example, if specified candidate models do not take into

563 account attentional mechanisms or constraints on cognitive resources, the resulting optimized designs might

564 either be so simple that they disengage subjects, or so complex that subjects cannot learn effectively. This

565 issue can be solved directly by amending the candidate model space and design prior to better capture the

566 studied phenomenon. Alternatively, the desired properties of the design can be achieved by including them in

567 the utility function or by appropriately constraining design variables. Overall, it is important to keep in mind

568 that design optimality is relative, not absolute: a design is only optimal with respect to the user-specified

569 model space, design prior, experiment structure, design space, analysis procedure, and a utility function.

570 The problem of misspecification is however not particular to our approach, or even to experimental design in

571 general; every statistical procedure requires assumptions, and can yield poor results when those assumptions

572 are not met. However, in the domain of experimental design, this problem can be dealt with by iterating

573 between design optimization, data acquisition, modeling, and model checking. Posterior of one iteration

574 can be used as the design prior of the next one, and model spaces can be amended with new models,

575 if current models are inadequate in accounting for the observed data. This iterative strategy is termed

24
576 Adaptive Design Optimization (ADO) (Cavagnaro et al. 2010) and it can be implemented at different levels

577 of granularity - between studies, between subsamples of subjects, between individual subjects, or even between

578 trials. Implementing the ADO approach in the context of associative learning studies - especially with design

579 updates at the level of each subject or trial - will be an important challenge for future work.

580 Adapting designs at finer granularity may be particularly beneficial for designs with complex experimental

581 structures. Even global optimization algorithms like Bayesian optimization can struggle in high-dimensional

582 experiment spaces. Optimizing designs on a per-trial basis may not only reduce the number of variables that

583 need to be tuned simultaneously, but might also improve inference accuracy by incorporating new data as it

584 becomes available (Kim et al. 2016). But even with non-adaptive design approach, a number of options is

585 available to reduce the computational burden: Bayesian optimization can be sped up using GPUs (Matthews

586 et al. 2017), and analytical bounds (Daunizeau et al. 2011) or normal approximations to the utility function

587 (Overstall, McGree, and Drovandi 2018) can be used to identify promising parts of the experiment space,

588 which can then be searched more thoroughly (see Ryan et al. (2016) for a review of computational strategies

589 in Bayesian experimental design). Moreover, although some of the scenarios presented in this paper required

590 lengthy optimizations, this time investment seems negligible compared to typical length of data acquisition

591 in associative learning studies and to the cost of the study yielding non-informative data.

592 We have demonstrated our approach by optimizing designs for accurate single-subject inferences, which we

593 contend is especially valuable from a translational perspective, and also represents a difficult benchmark, due

594 to limited amounts of data in single-subject experiments. Group studies could also be accommodated within

595 our approach, but simulating group studies would be computationally expensive, and it would imply that the

596 design can be updated only once all the subjects’ data has been acquired. Instead, in future developments

597 of design methods for associative learning, it may be advantageous to adopt the hierarchical ADO (HADO)

598 approach (Kim et al. 2014): this approach combines hierarchical Bayesian modeling with adaptive design.

599 Even when the goal is to achieve accurate inferences at the level of single subjects, hierarchically combining

600 the new subject’s data with the previously acquired data can improve efficiency and accuracy.

601 The design optimization method was illustrated on classical conditioning, as a special case of associative

602 learning, but the extension to operant conditioning is conceptually straightforward. Whereas we formalized

603 classical conditioning experiments using Markov chains, operant conditioning experiments can be formalized

604 using Markov decision processes (MDPs). In general, optimizing MDPs may be computationally challenging,

605 as the design has to specify transition probabilities and rewards for each state-action pair, possibly resulting in

606 a high-dimensional problem. However, tractable optimization problems may also be obtained for experiments

607 that can be represented using highly structured MDPs, with only few tunable design variables (Zhang and

25
608 Lee 2010; Rafferty, Zaharia, and Griffiths 2014). We expect that the crucial challenge for future work will

609 be to elucidate general correspondences between model classes and experimental structures that allow their

610 discrimination, while still being amenable to optimization (i.e., resulting in low-dimensional problems).

611 4.2 Conclusion

612 We believe that computational optimization of experimental designs is a valuable addition to the toolbox of

613 associative learning science and that it holds promise of improving the accuracy and efficiency of basic research

614 in this field. Optimizing experimental designs could also help address the recent concerns over underpowered

615 studies in cognitive sciences (Szucs and Ioannidis 2017): the usually-prescribed use of larger samples should

616 be complemented with the use of better designed experiments. Moreover, the effort in specifying the modeling

617 details prior to designing the experiment allows for the experiment to easily be pre-registered, thus facilitating

618 the adoption of emerging open science norms (Munafò et al. 2017; Nosek et al. 2018). The requirement

619 to specify candidate models and the analysis procedures may seem restrictive, but this does not preclude

620 subsequent exploratory analyses, it just delineates them from planned confirmatory analyses (Wagenmakers

621 et al. 2012). And the results of both confirmatory and exploratory analyses can then be used to further

622 optimize future experiments.

623 Finally, we believe design optimization will play a role not only in basic research on associative learning, but

624 also in translational research within the nascent field of computational psychiatry. Computational psychiatry

625 strives to link mental disorders with computational models of neural function, or with particular parameter

626 regimes of these models. Employing insights of computational psychiatry in a clinical setting will require

627 the development of behavioral or physiological diagnostic tests, based on procedures like model selection and

628 parameter estimation (Stephan et al. 2016). This presents a problem analogous to the design of experiments

629 (Ahn et al. 2019), and therefore computational optimization may be similarly useful in designing accurate

630 and efficient diagnostic tests in this domain.

631 5 Software and data availability

632 All the code necessary to reproduce the results of the paper is available at https://github.com/fmelinscak/

633 coale-paper. The accompanying Open Science Framework (OSF) project at https://osf.io/5ktaf/ holds the

634 data (i.e., simulation results) analyzed in the paper.

26
635 6 Acknowledgments

636 Authors would like to thank Karita Ojala, Andreea Sburlea, and Martina Melinščak for valuable discussions

637 and comments on the text.

638 7 References

639 Ahn, Woo-Young, Hairong Gu, Yitong Shen, Nathaniel Haines, Hunter A. Hahn, Julie E. Teater, Jay I.

640 Myung, and Mark A. Pitt. 2019. “Rapid, precise, and reliable phenotyping of delay discounting using a

641 Bayesian learning algorithm.” bioRxiv, June, 567412. https://doi.org/10.1101/567412.

642 Alonso, Eduardo, Esther Mondragón, and Alberto Fernández. 2012. “A Java simulator of Rescorla and Wag-

643 ner’s prediction error model and configural cue extensions.” Computer Methods and Programs in Biomedicine

644 108 (1): 346–55. https://doi.org/10.1016/J.CMPB.2012.02.004.

645 Apgar, Joshua F., David K. Witmer, Forest M. White, and Bruce Tidor. 2010. “Sloppy models, parameter

646 uncertainty, and the role of experimental design.” Molecular BioSystems 6 (10): 1890. https://doi.org/10.

647 1039/b918098b.

648 Bakker, M., C. H. J. Hartgerink, J. M. Wicherts, and H. L. J. van der Maas. 2016. “Researchers Intuitions

649 About Power in Psychological Research.” Psychological Science 27 (8): 1069–77. https://doi.org/10.1177/

650 0956797616647519.

651 Balietti, Stefano, Brennan Klein, and Christoph Riedl. 2018. “Fast Model-Selection through Adapting

652 Design of Experiments Maximizing Information Gain.” arXiv Preprint arXiv:1807.07024. http://arxiv.org/

653 abs/1807.07024.

654 Behrens, Timothy E J, Mark W Woolrich, Mark E Walton, and Matthew F S Rushworth. 2007. “Learning

655 the value of information in an uncertain world.” Nature Neuroscience 10 (9): 1214–21. https://doi.org/10.

656 1038/nn1954.

657 Brochu, Eric, Vlad M. Cora, and Nando de Freitas. 2010. “A Tutorial on Bayesian Optimization of Expensive

658 Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.” arXiv

659 Preprint arXiv:1012.2599. http://arxiv.org/abs/1012.2599.

660 Browning, Michael, Timothy E Behrens, Gerhard Jocham, Jill X O’Reilly, and Sonia J Bishop. 2015. “Anx-

661 ious individuals have difficulty learning the causal statistics of aversive environments.” Nature Neuroscience

662 18 (4): 590–96. https://doi.org/10.1038/nn.3961.

27
663 Brutti, Pierpaolo, Fulvio De Santis, and Stefania Gubbiotti. 2014. “Bayesian-frequentist sample size deter-

664 mination: a game of two priors.” METRON 72 (2): 133–51. https://doi.org/10.1007/s40300-014-0043-2.

665 Cavagnaro, Daniel R., Jay I. Myung, Mark A. Pitt, and Janne V. Kujala. 2010. “Adaptive Design Opti-

666 mization: A Mutual Information-Based Approach to Model Discrimination in Cognitive Science.” Neural

667 Computation 22 (4): 887–905. https://doi.org/10.1162/neco.2009.02-09-959.

668 Chaloner, Kathryn, and Isabella Verdinelli. 1995. “Bayesian experimental design: A review.” Statistical

669 Science 10 (3): 273–304. http://www.jstor.org/stable/2246015.

670 Daunizeau, Jean, Kerstin Preuschoff, Karl Friston, and Klaas Stephan. 2011. “Optimizing Experimental

671 Design for Comparing Models of Brain Function.” PLoS Computational Biology 7 (11): e1002280. https:

672 //doi.org/10.1371/journal.pcbi.1002280.

673 Dayan, Peter, and Sham Kakade. 2001. “Explaining away in weight space.” In Advances in Neural Informa-

674 tion Processing Systems 14, 451–57. http://papers.nips.cc/paper/1852-explaining-away-in-weight-space.pdf.

675 Efron, Bradley. 1981. “Nonparametric Standard Errors and Confidence Intervals.” Canadian Journal of

676 Statistics 9 (2): 139–58. https://doi.org/10.2307/3314608.

677 Enquist, Magnus, Johan Lind, and Stefano Ghirlanda. 2016. “The power of associative learning and the

678 ontogeny of optimal behaviour.” Royal Society Open Science 3 (11): 160734. https://doi.org/10.1098/rsos.

679 160734.

680 Gershman, Samuel J. 2018. “’Mfit’: Simple Model-Fitting Tools.” https://github.com/sjgershm/mfit.

681 ———. 2015. “A Unifying Probabilistic View of Associative Learning.” PLOS Computational Biology 11

682 (11): e1004567. https://doi.org/10.1371/journal.pcbi.1004567.

683 Gershman, Samuel J., and Yael Niv. 2012. “Exploring a latent cause theory of classical conditioning.”

684 Learning & Behavior 40 (3): 255–68. https://doi.org/10.3758/s13420-012-0080-8.

685 Humphreys, Lloyd G. 1939. “The effect of random alternation of reinforcement on the acquisition and

686 extinction of conditioned eyelid reactions.” Journal of Experimental Psychology 25 (2): 141–58. https:

687 //doi.org/10.1037/h0058138.

688 Kamin, Leon J. 1968. “”Attention-like” processes in classical conditioning.” In Miami Symposium on the

689 Prediction of Behavior, 1967: Aversive Stimulation, edited by M. R. Jones, 9–31. Coral Gables, Florida:

690 University of Miami Press.

691 Kim, Woojae, Mark A. Pitt, Zhong-Lin Lu, and Jay I. Myung. 2016. “Planning Beyond the Next Trial in

28
692 Adaptive Experiments: A Dynamic Programming Approach.” Cognitive Science, December. https://doi.org/

693 10.1111/cogs.12467.

694 Kim, Woojae, Mark A. Pitt, Zhong-Lin Lu, Mark Steyvers, and Jay I. Myung. 2014. “A Hierarchical

695 Adaptive Approach to Optimal Experimental Design.” Neural Computation 26 (11): 2465–92. https://doi.

696 org/10.1162/NECO_a_00654.

697 Kruschke, John K. 2008. “Bayesian approaches to associative learning: From passive to active learning.”

698 Learning and Behavior 36 (3): 210–26. https://doi.org/10.3758/LB.36.3.210.

699 Lawson, Raef. 2004. “Small Sample Confidence Intervals for the Odds Ratio.” Communications in Statistics

700 - Simulation and Computation 33 (4): 1095–1113. https://doi.org/10.1081/SAC-200040691.

701 Lewi, Jeremy, Robert Butera, and Liam Paninski. 2009. “Sequential Optimal Design of Neurophysiology

702 Experiments.” Neural Computation 21 (3): 619–87. https://doi.org/10.1162/neco.2008.08-07-594.

703 Li, Jian, Daniela Schiller, Geoffrey Schoenbaum, Elizabeth A Phelps, and Nathaniel D Daw. 2011. “Differ-

704 ential roles of human striatum and amygdala in associative learning.” Nature Neuroscience 14 (10): 1250–2.

705 https://doi.org/10.1038/nn.2904.

706 Liepe, Juliane, Sarah Filippi, Michał Komorowski, and Michael P. H. Stumpf. 2013. “Maximizing the

707 Information Content of Experiments in Systems Biology.” PLoS Computational Biology 9 (1): e1002888.

708 https://doi.org/10.1371/journal.pcbi.1002888.

709 Lindley, Dennis Victor. 1972. Bayesian Statistics, A Review. Philadelphia: SIAM. https://doi.org/10.1137/

710 1.9781611970654.

711 Lorenz, Romy, Ricardo Pio Monti, Inês R. Violante, Christoforos Anagnostopoulos, Aldo A. Faisal, Giovanni

712 Montana, and Robert Leech. 2016. “The Automatic Neuroscientist: A framework for optimizing experi-

713 mental design with closed-loop real-time fMRI.” NeuroImage 129 (April): 320–34. https://doi.org/10.1016/

714 j.neuroimage.2016.01.032.

715 Lubow, R. E., and A. U. Moore. 1959. “Latent inhibition: The effect of nonreinforced pre-exposure to

716 the conditional stimulus.” Journal of Comparative and Physiological Psychology 52 (4): 415–19. https:

717 //doi.org/10.1037/h0046700.

718 Matthews, Alexander G. de G., Mark van der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo

719 León-Villagrá, Zoubin Ghahramani, and James Hensman. 2017. “GPflow: A Gaussian Process Library using

720 TensorFlow.” Journal of Machine Learning Research 18 (40): 1–6. http://www.jmlr.org/papers/v18/16-537.

721 html.

29
722 McGraw, Kenneth O., and S. P. Wong. 1992. “A common language effect size statistic.” Psychological

723 Bulletin 111 (2): 361–65. https://doi.org/10.1037/0033-2909.111.2.361.

724 Miller, Ralph R., Robert C. Barnet, and Nicholas J. Grahame. 1995. “Assessment of the Rescorla-Wagner

725 model.” Psychological Bulletin 117 (3): 363–86. https://doi.org/10.1037/0033-2909.117.3.363.

726 Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers,

727 Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis.

728 2017. “A manifesto for reproducible science.” Nature Human Behaviour 1 (1): 0021. https://doi.org/10.

729 1038/s41562-016-0021.

730 Myung, Jay I., and Mark A. Pitt. 2009. “Optimal experimental design for model discrimination.” Psycho-

731 logical Review 116 (3): 499–518. https://doi.org/10.1037/a0016104.

732 Nosek, Brian A., Charles R. Ebersole, Alexander C. DeHaven, and David T. Mellor. 2018. “The pre-

733 registration revolution.” Proceedings of the National Academy of Sciences 115 (11): 2600–2606. https:

734 //doi.org/10.1073/pnas.1708274114.

735 Overstall, Antony M., James M. McGree, and Christopher C. Drovandi. 2018. “An approach for finding fully

736 Bayesian optimal designs using normal-based approximations to loss functions.” Statistics and Computing 28

737 (2): 343–58. https://doi.org/10.1007/s11222-017-9734-x.

738 Pavlov, I P. 1927. Conditioned reflexes: an investigation of the physiological activity of the cerebral cortex.

739 Oxford, England: Oxford Univ. Press.

740 Pearce, John M, and Mark E Bouton. 2001. “Theories of Associative Learning in Animals.” Annual Review

741 of Psychology 52 (1): 111–39. https://doi.org/10.1146/annurev.psych.52.1.111.

742 Rafferty, Anna N., Matei Zaharia, and Thomas L. Griffiths. 2014. “Optimally designing games for be-

743 havioural research.” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences

744 470 (2167): 20130828. https://doi.org/10.1098/rspa.2013.0828.

745 Rescorla, Robert A. 1970. “Reduction in the effectiveness of reinforcement after prior excitatory conditioning.”

746 Learning and Motivation 1 (4): 372–81. https://doi.org/10.1016/0023-9690(70)90101-3.

747 Rescorla, Robert A, and Allan R Wagner. 1972. “A theory of Pavlovian conditioning: Variations in the

748 effectiveness of reinforcement and nonreinforcement.” Classical Conditioning II Current Research and Theory

749 21 (6): 64–99. https://doi.org/10.1101/gr.110528.110.

750 Ryan, Elizabeth G., Christopher C. Drovandi, James M. McGree, and Anthony N. Pettitt. 2016. “A Review

30
751 of Modern Computational Algorithms for Bayesian Optimal Design.” International Statistical Review 84 (1):

752 128–54. https://doi.org/10.1111/insr.12107.

753 Sanchez, Gaëtan, Jean Daunizeau, Emmanuel Maby, Olivier Bertrand, Aline Bompas, and Jérémie Mat-

754 tout. 2014. “Toward a New Application of Real-Time Electrophysiology: Online Optimization of Cognitive

755 Neurosciences Hypothesis Testing.” Brain Sciences 4 (1): 49–72. https://doi.org/10.3390/brainsci4010049.

756 Schlaifer, R, and H Raiffa. 1961. Applied statistical decision theory. Wiley.

757 Shanks, David R. 1985. “Forward and Backward Blocking in Human Contingency Judgement.” The Quarterly

758 Journal of Experimental Psychology Section B 37 (1b): 1–21. https://doi.org/10.1080/14640748508402082.

759 Silk, Daniel, Paul D. W. Kirk, Chris P. Barnes, Tina Toni, and Michael P. H. Stumpf. 2014. “Model Selection

760 in Systems Biology Depends on Experimental Design.” Edited by Burkhard Rost. PLoS Computational

761 Biology 10 (6): e1003650. https://doi.org/10.1371/journal.pcbi.1003650.

762 Stephan, Klaas E, Dominik R Bach, Paul C Fletcher, Jonathan Flint, Michael J Frank, Karl J Friston,

763 Andreas Heinz, et al. 2016. “Charting the landscape of priority problems in psychiatry, part 1: classification

764 and diagnosis.” The Lancet Psychiatry 3 (1): 77–83. https://doi.org/10.1016/S2215-0366(15)00361-2.

765 Szucs, Denes, and John P. A. Ioannidis. 2017. “Empirical assessment of published effect sizes and power

766 in the recent cognitive neuroscience and psychology literature.” PLOS Biology 15 (3): e2000797. https:

767 //doi.org/10.1371/journal.pbio.2000797.

768 Thorwart, Anna, Holger Schultheis, Stephan König, and Harald Lachnit. 2009. “ALTSim: A MATLAB

769 simulator for current associative learning theories.” Behavior Research Methods 41 (1): 29–34. https://doi.

770 org/10.3758/BRM.41.1.29.

771 Tzovara, Athina, Christoph W. Korn, and Dominik R. Bach. 2018. “Human Pavlovian fear conditioning

772 conforms to probabilistic learning.” PLOS Computational Biology 14 (8): e1006243. https://doi.org/10.

773 1371/journal.pcbi.1006243.

774 Vanlier, J, C A Tiemann, P A J Hilbers, and N A W van Riel. 2012. “A Bayesian approach to targeted

775 experiment design.” Bioinformatics 28 (8): 1136–42. https://doi.org/10.1093/bioinformatics/bts092.

776 Vervliet, Bram, Michelle G. Craske, and Dirk Hermans. 2013. “Fear Extinction and Relapse:

777 State of the Art.” Annual Review of Clinical Psychology 9 (1): 215–48. https://doi.org/10.1146/

778 annurev-clinpsy-050212-185542.

779 Vincent, Benjamin T, and Tom Rainforth. 2017. “The DARC Toolbox: automated, flexible, and efficient

31
780 delayed and risky choice experiments using Bayesian adaptive design.” PsyArXiv. https://doi.org/10.31234/

781 osf.io/yehjb.

782 Wagenmakers, Eric-Jan, Ruud Wetzels, Denny Borsboom, Han L J van der Maas, and Rogier A Kievit.

783 2012. “An Agenda for Purely Confirmatory Research.” Perspectives on Psychological Science 7 (6): 632–38.

784 https://doi.org/10.1177/1745691612463078.

785 Wang, Fei, and Alan E. Gelfand. 2002. “A simulation-based approach to Bayesian sample size determination

786 for performance under a given model and for separating models.” Statistical Science 17 (2): 193–208. https:

787 //doi.org/10.1214/ss/1030550861.

788 Weaver, Brian P, Brian J Williams, Christine M Anderson-Cook, and David M Higdon. 2016. “Computa-

789 tional Enhancements to Bayesian Design of Experiments Using Gaussian Processes.” Bayesian Analysis 11

790 (1): 191–213. https://doi.org/10.1214/15-BA945.

791 Zhang, Shunan, and Michael D. Lee. 2010. “Optimal experimental design for a class of bandit problems.”

792 Journal of Mathematical Psychology 54 (6): 499–508. https://doi.org/10.1016/J.JMP.2010.08.002.

793 Supporting information legends

794 • S1 Appendix. Formal approach to optimizing associative learning experiments.

32
1 S1 Appendix: Formal approach to optimizing associative learning

2 experiments

3 Filip Melinscak* 1,2 and Dominik R. Bach1,2,3

1
4 Computational Psychiatry Research; Department of Psychiatry, Psychotherapy, and
5 Psychosomatics; Psychiatric Hospital; University of Zurich
2
6 Neuroscience Center Zurich; University of Zurich
3
7 Wellcome Centre for Human Neuroimaging and Max Planck/UCL Centre for
8 Computational Psychiatry and Ageing Research; University College London

*
9 Correspondence: Filip Melinscak <filip.melinscak@uzh.ch>

10 In this Appendix we first provide a technical treatment of simulation-based Bayesian experimental design.

11 Next, we describe Bayesian optimization in the context of experimental design. Lastly, we give the full

12 specification of the associative learning models used in simulated scenarios described in the main text of the

13 article (Scenarios 1, 2, and 3).

14 1 Bayesian experimental design

15 Bayesian experimental design is the application of Bayesian decision theory to the problem of designing

16 optimal experiments, i.e., finding the optimal values of tunable design variables. The problem is formalized

17 as a maximization of expected design utility EU (η), with respect to design variables η:

η ∗ = arg max EU (η), (1)


η∈H

18 where η ∗ denotes optimal values for the design variables and H denotes the space of possible designs (i.e.,

19 design space). Although we are ultimately interested in the dependence of the experiment utility on the

20 design variables (captured by the utility function U (η)), often it is easier to express the goal of the study by

21 using a more general utility function with additional direct dependence on study outcomes (observed data d

1
22 and analysis results r), and possibly on the unobserved ground truth generative model m and its parameters

23 θm , yielding U (r, d, θm , m, η). Evaluating a particular design requires calculating expected design utility, i.e.,

24 marginalizing across the uncertain variables in the utility function (here we assume that analysis results r

25 are deterministically dependent on data d):

EU (η) = Ed,θm ,m [U (r, d, θm , m, η)]


∫ ∑∫
= U (r, d, θm , m, η)p(d, θm , m | η) dθm dd
D MΘ (2)
∫ ∑∫
m

= U (r, d, θm , m, η)p(d | θm , m, η)pe (θm , m) dθm dd,


D MΘ
m

26 where D is the space of datasets, M is the space of candidate models, Θm is the parameter space of model

27 m, p(d | θm , m, η) is the conditional probability of the dataset, and pe (θm , m) is the joint prior probability

28 of the model and its parameters (which we term the evaluation prior). Here we made the assumption that

29 the evaluation prior is independent of the design variables (i.e., pe (θm , m | η) = pe (θm , m) for all η); this

30 assumption needs to be carefully considered on a case-by-case basis, and might not hold if, for example,

31 different values of design variables result in experiments which engage different cognitive mechanisms.

32 1.1 Evaluating design utility through simulation

33 Unfortunately, the expected design utility EU (η), calculated according to eq. 2, will have no closed-form

34 solution for almost any design problem of practical interest. To tackle this issue, we adopt the simulation-

35 based approach to design evaluation, i.e., Monte Carlo integration (Wang and Gelfand 2002). Each simulation

36 iteration consists of three steps: simulating a dataset, analyzing the simulated dataset, and calculating the

37 utility of the simulated experiment.

38 To simulate a dataset, we first sample the ground truth model m and its parameters θm from the evaluation

39 prior pe (θm , m). Next, using the evaluated values of design variables η and the ground truth, we sample the

40 dataset d from the data probability distribution p(d | θm , m, η). In classical conditioning, experiment realiza-

41 tions (i.e., stimuli and outcomes, together denoted x) are independent of model responses (i.e., conditioned

42 responses, denoted y); hence, we can decompose the dataset simulation into sampling an experiment realiza-

43 tion from the experiment structure pexp (x | η) and sampling the model responses from the model likelihood

44 function pm (y | x, θm ), thus yielding the full simulated dataset d = {x, y}.

45 We can now perform the planned analysis on the simulated dataset. We conceptualize this with a functional

46 mapping from the data to the results, i.e., r = a(d), where a(·) represents the planned analysis (e.g., model

2
47 comparison or parameter estimation) and r represents the analysis results (e.g., the selected winning model,

48 the parameter point estimate, or a posterior distribution over parameters and/or models). In scenarios where

49 we consider parameter estimation within a single model, we can obtain the maximum a posteriori (MAP)

50 point estimate of the parameter as the analysis result r, using Bayes’ rule:

p(y | x, θ)pa (θ)


r = θ̂MAP = arg max ∫ , (3)
θ p(y | x, θ)pa (θ) dθ

51 where pa (θ) is the analysis prior, which, importantly, does not need to coincide with the evaluation prior

52 pe (θ). Since the evaluation prior describes experimenter’s state of knowledge at the planning stage, and the

53 analysis prior needs to be acceptable even to the skeptical consumer of the analysis results, the evaluation

54 prior will usually be more informative than the analysis prior. In scenarios presented in the paper, we used

55 the uniform prior as a non-informative analysis prior pa (θ) ∝ 1; this makes the MAP parameter estimate

56 equal to the maximum (log)likelihood estimate (MLE):

θ̂MAP = θ̂MLE = arg max log p(y | x, θ). (4)


θ

57 In model selection scenarios, we can similarly obtain the analysis result by employing the Bayes’ rule to select

58 the a posteriori most probable model, based on maximizing model evidence:

r = m̂ = arg max p(m | y, x)


m

= arg max p(y | x, m) (5)


m

= arg max pm (y | x, θm )pa (θm | m) dθm ,
m

59 where we assumed a uniform analysis prior over the models, i.e., pa (m) ∝ 1. Again, the integral necessary to

60 calculate model evidence will generally not be tractable, but can be approximated in various ways (Penny et

61 al. 2004).

62 Having obtained the analysis result from the simulated dataset, we can now calculate the experiment utility

63 U (r, d, θm , m, η) of the current simulation iteration. In general, the utility function is dependent on the goal

64 of the study and therefore its form is up to the user; here we provide two utility functions as examples - one

65 for parameter estimation scenarios and one for model selection scenarios. In scenarios featuring parameter

66 estimation within a single model, we decided to minimize the absolute parameter estimation error (i.e.,

3
67 maximizing negative absolute error), due to its robustness to outliers:

U (r = θ̂, θ) = −|θ − θ̂|. (6)

68 Similarly, in scenarios featuring model selection, we decided to maximize model selection accuracy:



1 if m̂ = m,
U (r = m̂, m) = (7)


0 otherwise.

69 Thus far, we have described calculating the experiment utility for a single simulation. To obtain the Monte

70 Carlo approximation of the expected design utility EU (η) (eq. 2), we run NS simulations and average

71 experiment utilities across them:

1 ∑
NS
EU (η) ≈ U (r(i) , d(i) , θm
(i)
, m(i) , η). (8)
NS i=1

72 1.2 Maximizing design utility through Bayesian optimization

73 Using the Monte Carlo approximation to the expected design utility (eq. 8), we can now search the design

74 space H for the optimal design η ∗ that maximizes expected utility EU (η) (eq. 1). We refer to the evaluation

75 prior pe (θm , m) used during optimization as the design prior pd (θm , m). In practice, the design prior and

76 the evaluation prior will generally coincide, since we want to use the best existing information to design

77 the experiment and to evaluate it. However, the distinction between the design prior and the evaluation

78 prior allows us to investigate the effect of varying levels of prior knowledge included in the optimization (i.e.,

79 different design priors), while evaluating the obtained designs using the same evaluation prior (which here

80 represents the ground truth and will usually be a point prior).

81 To find the values of design variables that maximize expected design utility, we employ the Bayesian opti-

82 mization (BO) algorithm (Brochu, Cora, and Freitas 2010), as implemented by the bayesopt function in the

83 ‘Statistics and Machine Learning Toolbox’ in MATLAB R2018a (version 9.4.0.813654). Since optimization

84 algorithms are usually formulated in terms of minimizing an objective function g(x), we equate the objective

85 function with negative expected design utility, i.e., g(x) = −EU (η). BO is a global optimization algorithm

86 which minimizes a deterministic or stochastic objective function g(x), with respect to x within a bounded

87 domain. BO is especially appropriate for optimization problems where the objective function is expensive to

88 evaluate and where the gradients of the objective function are not available; both of these conditions are true

4
89 for simulation-based experimental design. Although the BO algorithm has polynomial time complexity, for

90 expensive objective functions and for typical numbers of iterations, the running time will be dominated by

91 the evaluations of the objective function, thus justifying the use of a relatively costly search algorithm.

92 In order to efficiently search for the optimal x, BO approximates the true objective function g(x) using

93 a probabilistic model f (x), called the surrogate function (or response surface). A common choice for the

94 surrogate function (also featured in the MATLAB BO implementation) is a Gaussian process:

f (x) ∼ GP(µ(x), k(x, x′ ; θ)), (9)

95 where µ(x) is the mean function and k(x, x′ ; θ) is the covariance kernel function with hyperparameters θ. BO

96 implementation in MATLAB uses the ARD Matern 5/2 form of the kernel function. In order to choose the

97 next solution x to evaluate, the surrogate function needs to be supplemented with an acquisition function

98 u(x), which governs the exploration-exploitation trade-off. The acquisition function represents the utility of

99 evaluating a solution x, based on the current posterior probability distribution over the objective function,

100 Q(f ). Out of various acquisition functions that are available in the MATLAB implementation of BO, we have

101 chosen the ‘expected improvement plus’ option. The expected improvement acquisition functions evaluate the

102 expected decrease in the objective function, ignoring values that cause an increase in the objective. Expected

103 improvement (EI) is given by:

EI(x, Q) = EQ [max(0, µQ (xbest ) − f (x))], (10)

104 where xbest is the solution with the lowest posterior mean value µQ (xbest ). The EI-plus version of the

105 acquisition function dynamically modifies the kernel function to prevent overexploiting of a particular area

106 of the solution space. This modification is governed by the ‘exploration ratio’ option, which was set to the

107 default value of 0.5.

108 Having specified the surrogate and acquisition functions, the BO algorithm proceeds as follows. The algorithm

109 is first initialized by evaluating yi = g(xi ) for a fixed number of solutions xi sampled randomly within

110 variable bounds (i indexes objective function evaluations); we used 10 solutions in initialization. Hereafter,

111 the algorithm repeats the following steps in each iteration t:

112 1. Update the surrogate function using the Bayes’ rule, i.e., calculate the posterior over functions, condi-

113 tioned on the objective values observed thus far (Q(f | xi , yi , for i = 1, . . . , t)).

114 2. Find the new solution xt+1 that maximizes the acquisition function u(x) and evaluate the objective

5
115 function for this solution to obtain yt+1 = g(xt+1 ).

116 As the stopping criterion for the algorithm, we used a fixed number of iterations (300), regardless of the

117 elapsed time. The BO algorithm returns both the solution at which the lowest objective function value was

118 observed, as well as the solution at which the final surrogate function has the lowest mean. As the optimized

119 design, we chose the solution based on the surrogate function.

120 2 Associative learning models

121 2.1 Scenario 1

122 In Scenario 1 we consider the case of estimating the learning rate parameter of the Rescorla-Wagner (RW)

123 model (Rescorla and Wagner 1972). For this model, the update rule for the associative weights between the

124 stimuli s and outcome o is:

wt+1 = wt + αδt st , (11)

125 where wt is the vector of associative weights at the end of trial t, α is the learning rate, δt is the prediction

126 error, and st is the vector of binary indicator variables for the stimuli (i.e., conditioned stimuli, CSs). The

127 prediction error is given by:

δt = ot − wt⊤ st , (12)

128 where ot is the binary indicator variable for the outcome (i.e., unconditioned stimulus, US). The mapping

129 from the associative weights wt , to the model responses yt (i.e., conditioned responses, CRs) is assumed to

130 be linear with added Gaussian noise:


yt = β0 + β1 s⊤
t wt + ϵ,
(13)
i.i.d.
ϵ ∼ N (0, σy2 ).

131 The parameters of the model are therefore α, w0 , β0 , β1 , and σy . We make the simplifying assumption that

132 all initial values of associative weights are the same (w0 = w0 1). In all the simulations, parameters other

133 than α were kept constant (w0 = 0 , β0 = 0, β1 = 1, σy = 0.2). The parameters were estimated using

134 the maximum likelihood approach, which is equivalent to the maximum a posteriori approach with uniform

135 priors on the whole domain of permissible values for the parameter (eq. 4). To perform the estimation we

136 used the mfit_optimize function from the ‘mfit’ toolbox (Gershman 2018).

137 Three evaluation priors, with different values for α were used: low alpha (LA, α = 0.1), middle alpha (MA,

138 α = 0.2), and high alpha (HA, α = 0.3). The designs were optimized using either a point prior (PA-OPT

139 designs) or a vague prior (VA-OPT designs) on alpha. The PA-OPT design priors coincided with evaluation

6
140 priors, and the VA-OPT design prior was a uniform distribution on the [0, 1] interval. The designs were

141 optimized for accuracy in estimating the learning rate α, with the utility function being specified as the

142 negative absolute estimation error (eq. 6).

143 2.2 Scenario 2

144 Scenario 2 involves model comparison between the RW model and its probabilistic extension, the Kalman RW

145 (KRW) model (Dayan and Kakade 2001; Kruschke 2008; Gershman 2015). The RW model is implemented the

146 same as in Scenario 1 (equations 11–13). KRW model maintains a probabilistic representation of associative

147 weights using a multivariate Gaussian distribution, with mean vector wt and covariance matrix Σt . These

148 state variables are updated using the Kalman filter equations:

wt+1 = wt + kt δt , (14)

149

Σt+1 = Σt + τ 2 I − kt s⊤ 2
t (Σt + τ I),
(15)

150 where τ 2 is the variance that governs the diffusion of weights over time, and kt is the Kalman gain, which

151 acts as a stimulus-specific learning rate. Kalman gain is given by:

(Σt + τ 2 I)st
kt = , (16)
s⊤
t (Σt+ τ 2 I)st + σo2

152 where σo2 is the variance of the outcome noise distribution. The model responses yt for KRW are obtained

153 as a linear mapping of associative weights wt (eq. 13). Hence, the parameters of the model are w0 , Σ0 , β0 ,
2
154 β1 , τ , σo , and σy . We simplify the parameterization by assuming w0 = w0 1 and Σ0 = σw I.

155 The evaluation prior was a uniform prior over RW and KRW models, combined with a point prior on

156 model parameters, with the parameter values taken from the simulation study of Kruschke (2008). For RW,
2 2
157 parameter values were w0 = 0 and α = 0.08. For KRW, we set the parameters at w0 = 0, log σw = 0 (σw = 1),

158 log τ 2 = −20 (τ 2 ≈ 0), log σo2 = 1.3863 (σo2 ≈ 4). For both models, the parameters of the model response

159 mapping were set at β0 = 0, β1 = 1, and σy = 0.2.

160 The design prior was either a precise point prior coinciding with the evaluation prior (PP-OPT), or a vague

161 prior (VP-OPT). The vague prior for the RW model was implemented as:

α ∼ U (0, 1),
(17)
σy ∼ U (0.05, 0.5).

7
162 For the KRW model, the vague prior was:

2
log σw ∼ N (0, 2),

log σo2 ∼ N (0, 2), (18)

σy ∼ U (0.05, 0.5).

163 The rest of the parameters of both models were fixed to the same values as in the evaluation prior.

164 Parameter estimates in Scenario 2 were obtained using a maximum likelihood approach, as in Scenario 1.

165 Models were compared using the Bayesian Information Criterion (BIC; (Schwarz 1978)):

BIC = −2 log(L̂) + log(n)k, (19)

166 where L̂ is the value of the likelihood function for the maximum likelihood parameter estimates, n is the

167 number of data points, and k is the number of estimated parameters. The model with the lowest BIC was

168 selected as the winning model. When conducting design optimization, proportion of cases where the winning

169 model was also the ground truth model - i.e., model selection accuracy - was transformed to the log-odds scale

170 in order to better conform with the normality assumption we used in optimization (we assumed a Gaussian

171 process as a surrogate model of the utility function in Bayesian optimization).

172 2.3 Scenario 3

173 Scenario 3 again features model comparison between the RW model and its variant - namely, the Rescorla-

174 Wagner-Pearce-Hall (RWPH) hybrid model (Li et al. 2011). Whereas the RW model assumes a fixed learning

175 rate, the RWPH model assumes stimulus-specific learning rate that is modulated over time depending on

176 observed prediction errors. In particular, the RW weight updating is supplemented with the Pearce-Hall rule

177 for updating the learning rates (i.e., associabilities):




η|δt | + (1 − η)αt(s) if st = 1,
(s)
αt+1 = (20)


αt(s) if st = 0,

(s)
178 where αt is the associability of stimulus s at trial t and η is the forgetting factor. The corresponding weight

179 update equation is:


(s) (s) (s)
wt+1 = wt + καt δt st , (21)

8
180 where κ is the fixed learning rate. Additionally, in Scenario 3 we included variants of the RWPH model

181 that have different mappings between the latent states of the models (i.e., weights and associabilities) and

182 the model responses (i.e., conditioned responses). The most general version of the model, RWPH(V + α),

183 features a linear mapping from a mixture of active weights and associabilities:

y t = β0 + β1 s ⊤
t (λwt + (1 − λ)αt ) + ϵ,
(22)

184 where λ is the mixing factor. The parameters of the model are α0 , w0 , η, β0 , β1 , λ, and σy . Again, we make

185 the simplifying assumption about the initial associabilities and weights (all initial associabilities and weights

186 are equal to α0 and w0 , respectively). The other variations of the RWPH model - RWPH(V ) and RWPH(α)

187 - are obtained by setting the output mixing factor λ to 1 or 0, respectively.

188 As in Scenario 2, the evaluation prior was a uniform prior over all compared models (RW, RWPH(V ),

189 RWPH(α), RWPH(V +α)), combined with a point prior on model parameters. The parameters of RWPH(V +

190 α) update equations were set to the maximum likelihood values obtained in the empirical study of Li et al.

191 (2011); specifically, w0 = 0, α0 = 0.926, η = 0.166, and κ = 0.857. For the other models - which are nested

192 within the RWPH(V +α) model - corresponding subsets of these maximum likelihood values were used. Study

193 of Li et al. (2011) did not specify the maximum likelihood values of output mapping parameters, so they

194 were arbitrarily set at β0 = 0, β1 = 1, and σy = 0.2. The mixing factor λ was set to 0.5 in RWPH(V + α),

195 to 0 in RWPH(α), and to 1 in RWPH(V ) and RW.

196 The design prior - as in Scenario 2 - was either a point prior coinciding with the evaluation prior (PP-OPT),

197 or a vague prior (VP-OPT). The vague prior for the RWPH(V + α) model was:

α0 ∼ U (0, 1),

η ∼ U (0, 1),

κ ∼ U (0, 1), (23)

λ ∼ U (0, 1),

σy ∼ U (0.05, 0.5).

198 For the other models we used the appropriate subset of priors. The rest of the parameters of all the mod-

199 els were fixed to the same values as in the evaluation prior. Model fitting, model selection, and design

200 optimization were performed in the same manner as in Scenario 2.

9
201 3 References

202 Brochu, Eric, Vlad M. Cora, and Nando de Freitas. 2010. “A Tutorial on Bayesian Optimization of Expensive

203 Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.” arXiv

204 Preprint arXiv:1012.2599. http://arxiv.org/abs/1012.2599.

205 Dayan, Peter, and Sham Kakade. 2001. “Explaining away in weight space.” In Advances in Neural Informa-

206 tion Processing Systems 14, 451–57. http://papers.nips.cc/paper/1852-explaining-away-in-weight-space.pdf.

207 Gershman, Samuel J. 2018. “’Mfit’: Simple Model-Fitting Tools.” https://github.com/sjgershm/mfit.

208 ———. 2015. “A Unifying Probabilistic View of Associative Learning.” PLOS Computational Biology 11

209 (11): e1004567. https://doi.org/10.1371/journal.pcbi.1004567.

210 Kruschke, John K. 2008. “Bayesian approaches to associative learning: From passive to active learning.”

211 Learning and Behavior 36 (3): 210–26. https://doi.org/10.3758/LB.36.3.210.

212 Li, Jian, Daniela Schiller, Geoffrey Schoenbaum, Elizabeth A Phelps, and Nathaniel D Daw. 2011. “Differ-

213 ential roles of human striatum and amygdala in associative learning.” Nature Neuroscience 14 (10): 1250–2.

214 https://doi.org/10.1038/nn.2904.

215 Penny, WD, KE Stephan, A Mechelli, and KJ Friston. 2004. “Comparing dynamic causal models.” Neu-

216 roImage 22 (3): 1157–72. https://doi.org/10.1016/j.neuroimage.2004.03.026.

217 Rescorla, Robert A, and Allan R Wagner. 1972. “A theory of Pavlovian conditioning: Variations in the

218 effectiveness of reinforcement and nonreinforcement.” Classical Conditioning II Current Research and Theory

219 21 (6): 64–99. https://doi.org/10.1101/gr.110528.110.

220 Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6 (2): 461–64.

221 https://doi.org/10.1214/aos/1176344136.

222 Wang, Fei, and Alan E. Gelfand. 2002. “A simulation-based approach to Bayesian sample size determination

223 for performance under a given model and for separating models.” Statistical Science 17 (2): 193–208. https:

224 //doi.org/10.1214/ss/1030550861.

10

You might also like