Professional Documents
Culture Documents
2.4 Choice function tion based on a hinge loss. Furthermore, they add two ad-
ditional terms to the loss function: (i) an L2 regularization
Individuals are often confronted with the situation of term; (ii) a multidimensional scaling (MDS) loss to ensure
choosing between several options (alternatives). These al- that options close to each other in the inputs space X will
ternatives can be goods that are going to be purchased, can- also be close in the embedding space Rne . This loss func-
didates in elections, food etc. tion is then used to learn a (deep) multi-layer perceptron to
We model options, that an agent has to choose, as real- represent the embedding.
valued vectors x ∈ Rnx and identify the sets of options as
finite subsets of Rnx . Let Q denote the set of all such finite 2.5 Contributions
subsets of Rnx .
Definition 3. A choice function C is a set-valued operator In this work, we devise a novel multi-objective PBO based
on sets of options. More precisely, it is a map C : Q → Q on choice functions. We follow the interpretation of choice
such that, for any set of options A ∈ Q, the corresponding functions as set function that select non-dominated sets for
value of C is a subset C(A) of A (see for instance [35]). an unknown latent function. First we derive a Bayesian
framework to learn the function from a dataset of observed
The interpretation of choice function is as follows. For choices. This framework is based on a Gaussian Process
a given option set A ∈ Q, the statement that an option prior on the unknown latent function vector.2 We then build
xj ∈ A is rejected from A (that is, xj ∈ / C(A)) means that an acquisition function to select the best next options to
there is at least one option xi ∈ A that an agent strictly evaluate. We compare this method against an oracle that
prefers over xj . The set of rejected options is denoted by knows the true value of the latent functions and we show
R(A) and is equal to A\C(A). Therefore choice functions that, by only working with choice function evaluations, we
represent non-binary choice models, so they are more gen- converge to the same results.
eral than preferences.
It is important to stress again that the statement xj ∈
/ C(A)
implies there is at least one option xi ∈ A that an agent 3 Bayesian learning of Choice functions
strictly prefers over xj . However, the agent is not required
to tell us which option(s) in C(A) they strictly prefer to xj . In this work we consider options x ∈ Rnx and, for x ∈
This makes choice functions a very easy-to-use tool to ex- Rnx , we model each latent function in the vector f (x) =
press choices. As depicted in Figure 2, the agent needs to [f1 (x), . . . , fne (x)]> as an independent GP [2]:
tell us only the options they selected (in green) without pro-
viding any justification for their choices (we do not know fj (x) ∼ GPj (0, kj (x, x0 )), j = 1, 2, . . . , ne . (1)
which option in the green set dominates x4 ).
Each GP is fully specified by its kernel function kj (·, ·),
g2 which specifies the covariance of the latent function be-
tween any two points. In all experiments in this paper, the
x3
GP kernel is Matern 3/2 [2].
x4 x2
x1
x5 3.1 Likelihood for general Choice functions
via C(Ak ) implies that: sistent with the choice function implied relations. Given
Ak , C(Ak ), the likelihood function p(C(Ak ), Ak |f ) is one
¬ min(fd (xi ) − fd (xj )) < 0, ∀i ∈ Ik , ∀j ∈ Jk , when (2)-(3) hold and zero otherwise.
d∈D
(2) In practice not all choices might be coherent and we can
treat this case by considering that the latent vector function
min(fd (xp ) − fd (xi )) < 0, ∀i, p ∈ Ik , p 6= i. (3)
d∈D f (xi ) = [f1 (xi ), . . . , fne (xi )]> is corrupted by a Gaus-
These equation express the conditions in Definition 1. Con- sian noise vi with zero mean vector and covariance σ 2 Ine .3
dition (2) means that, for each option xj ∈ Jk , it’s not true Then we require conditions (2) and (3) to only hold proba-
(¬ stands for logical negation) that all options in Ik are bilistically. This leads to the following likelihood function
worse than xj , i.e. there is at least an option in Ik which for the pair Ak , C(Ak ):
is better than xj . Condition (3) means that, for each op-
tion in Ik , there is no better option in Ik . This requires that
the latent functions values of the options should be con-
Z Y !
Y
p(C(Ak ), Ak |f ) = 1− I(−∞,0) min(fd (xi ) + vdi − fd (xj ) − vdj ) N (vi ; 0, σ 2 Id )dvi N (vj ; 0, σ 2 Id )dvj
d∈D
j∈Jk i∈Ik
Y Z
1 − I[0,∞) min(fd (xp ) + vdp − fd (xi ) − vdi ) N (vp ; 0, σ 2 Id )N (vi ; 0, σ 2 Id )dvp dvi , (4)
d∈D
i,p∈Ik ,p6=i
where vdi denotes the d-th component of the vector vi and argument only depends on vdj . We can thus write the
IB is the indicator function of the set B. We now provide above multidimensional integral as a sum of products of
two results which allows us to simplify (4). We first com- univariate integrals which can be computed efficiently, for
pute the integral in the third product in (4). instance by using a Gaussian quadrature rule.
Lemma 4. Therefore, the likelihood of the choices D =
{(Ak , C(Ak )) : for k = 1, . . . , N } given the latent
Z
I[0,∞) min(fd (xp ) + vdp − fd (xj ) − vdj )
d∈D vector function f can then be written as follows.
N (vp ; 0, σ Id )N (vj ; 0, σ 2 Id )dvp dvj
2
(5) Theorem 6. The likelihood is
Y fd (xp ) − fd (xj )
= Φ √ . N
2σ
Y
d∈D p(D|f ) = p(C(Ak ), Ak |f ) (7)
k=1
All proofs are in the supplementary material. We now fo-
cus on the first integral in (4), which can be simplified as with
follows.
Lemma 5. p(C(Ak ), Ak |f )
Z Y !
fd (xi ) − fd (xj ) − vdj
Y Z Y Y
I(−∞,0) min(fd (xi ) + vdi − fd (xj ) − vdj ) = 1− 1− Φ
i∈Ik
d∈D σ
j∈Jk i∈Ik d∈D
!
N (vi ; 0, σ 2 Id )dvi N (vj ; 0, σ 2 Id )dvj 2
N (vj ; 0, σ Id )dvj
Z Y" Y fd (xi ) − fd (xj ) − vdj
#
= 1− Φ !
σ Y Y fd (xp ) − fd (xj )
i∈Ik d∈D 1− Φ √ .
2σ
N (vj ; 0, σ 2 Id )dvj . i,p∈Ik ,p6=i d∈D
(6) (8)
best option. In this case, the likelihood (8) simplifies to which returns the posterior probability that the agent
Z Y chooses the options C(A∗ ) from the set of options A∗ .
f (xi ) − f (xj ) − vj Given a finite set X (that is A∗ = X ), we can use (12)
p(C(Ak ), Ak |f ) = Φ
σ to compute the probability that a subset of X is the set
j∈Jk
of non-dominated options. This provides an estimate of
N (vj ; 0, σ 2 )dvj .
(9) the Pareto-set X nd based on the learned GP-based latent
The above likelihood is equal to the batch likelihood de- model.
rived in [19, Eq.3] and reduces to the likelihood derived
in [6] when |R(Ak )| = 1 (that is the batch 2 case, i.e., 4 Latent dimension selection
|Ak | = 2). This shows that the likelihood in (8) encom-
passes batch preference-based models. In the previous section, we provided a Bayesian model
for learning a choice function. The model Mne is con-
3.2 Posterior ditional on the pre-defined latent dimension ne (that is,
the dimension of the vector of the latent functions f (x) =
The posterior probability of f is [f1 (x), . . . , fne (x)]> ). Although, it is sometimes reason-
N able to assume the number of criteria defining the choice
p(f ) Y function to be known (and so the dimension ne ), it is crucial
p(f |D) = p(C(Ak ), Ak |f ), (10)
p(D) to develop a statistical method to select ne . We propose a
k=1
forward selection method. We start learning the model M1
where the prior over the component of f is defined in (1), and we increase the dimension ne in a stepwise manner (so
the likelihood is defined
R in (8) and the probability of the ev- learning M2 , M3 , . . . ) until some model selection crite-
idence is p(D) = p(D|f )p(f )df . The posterior p(f |D) is rion is optimised. Criteria like AIC and BIC are inappro-
intractable because it is neither Gaussian nor Skew Gaus- priate for the proposed GP-based choice function model,
sian Process (SkewGP) distributed. In this paper we pro- since its nonparametric nature implies that the number of
pose an approximation schema for the posterior similar to parameters increases also with the size of the data (as ne ×
the one proposed in [12, 14]. In [12], an analytical for- m). We propose to use instead the Pareto Smoothed Im-
mulation of the posterior is available, the marginal likeli- portance sampling Leave-One-Out cross-validation (PSIS-
hood is approximated with a lower bound and inferences LOO) [41]. Exact cross-validation requires re-fitting the
are computed with an efficient rejection-free slice sampler model with different training sets. Instead, PSIS-loo can be
[37]. In [12, 14] such approximation schema showed better computed easily using the samples from the posterior.
performance in active learning and BO tasks than LP and
EP. Here we do not have an analytical formulation for the We define the Bayesian LOO estimate of out-of-sample
posterior therefore we use a Variational (ADVI) approxi- predictive fit for the model in (10):
mation [38] of the posterior to learn the hyperparameters θ N
of the kernel and, then, for fixed hyperparameters, we com- X
ϕ= p(zk |z−k ), (13)
pute the posterior of p(f |D, θ) via elliptical slice sampling
k=1
(ess) [39].4
where zk = (C(Ak ), Ak ), z−k = {(C(Ai ), Ai )}N
i=1,i6=k ,
3.3 Prediction and Inferences Z
Let A = {x∗1 , . . . , x∗m } be a set including m test points
∗ p(zk |z−k ) = p(zk |f )p(f |z−k )df . (14)
and f ∗ = [f (x∗1 ), . . . , f (x∗m )]> . The conditional predictive
∗ As derived in [42], we can evaluate (14) using the samples
distribution p(f |f ) is Gaussian and, therefore,
Z from the full posterior, that is f (s) ∼ p(f |{zk , z−k }) =
p(f |D) = p(f ∗ |f )p(f |D)df
∗
(11) p(f |D) for s = 1, . . . , S.5 We first define the importance
weights:
can be easily approximated as a sum using the samples
(s) 1 p(f (s) |z−k )
from p(f |D). In choice function learning, we are interested wk = ∝
in the inference: p(zk |f (s) ) p(f (s) |{zk , z−k })
Z
and then approximate (14) as:
P (C(A∗ ), A∗ |D) = p(C(A∗ ), A∗ |f ∗ )p(f ∗ |D)df ∗ ,
PS (s)
(12) s=1 wk p(zk |f (s) )
p(zk |z−k ) ≈ PS (s)
. (15)
4
We implemented our model in PyMC3 [40], which provides s=1 wk
implementations of ADVI and ess. Details about number of iter-
5
ations and tuning are reported in the supplementary. We compute these samples using elliptical slice sampling.
Alessio Benavoli, Dario Azzimonti, Dario Piga
It can be noticed that (15) is a function of p(zk |f (s) ) only, agent on a five options set.6
which can easily be computed from the posterior samples.
Unfortunately, a direct use of (15) induces instability be-
6 Numerical experiments
cause the importance weights can have high variance. To
address this issue, [41] applies a simple smoothing proce-
First, we assume ne = no (that is we assume that the la-
dure to the importance weights using a Pareto distribution
tent dimension is known) and evaluate the performance of
(see [41] for details). In Section 6.3, we will show that the
our algorithm on (1) the tasks of learning choice functions;
proposed PSIS-LOO-based forward procedure works well
(2) the use of choice functions in multi-objective BO. Sec-
in practice.
ond, we evaluate the latent dimension selection procedure
discussed in Section 4 on simulated and real datasets.
5 Choice-based Bayesian Optimisation
6.1 Choice functions learning
In the previous sections, we have introduced a GP-based
model to learn latent choice functions from choice data. Toy experiment We develop a simple toy experiment
We will now focus on the acquisition component of as a controlled setting for our initial assessment. We
Bayesian optimization. In choice-based BO, we never consider the bi-dimensional vector function g(x) =
observe the actual values of the functions. The data is [cos(2x), −sin(2x)]> with x ∈ R.
(X , {C(Ak ), Ak }Nk=1 }), where X is the set of the m train-
1.00
ing inputs (options), Ak is a subset of X and C(Ak ) ⊆ Ak 0.75
is the choice-set for the given options Ak . We denote the 0.50
0.25
Pareto-set estimated using the GP-based model as X̂ nd . In g 0.00
g1
g2
0.25
choice-based BO, the objective is to seek a new input point 0.50
x. Since g can only be queried via a choice function, this 0.75
1.00
is obtained by optimizing w.r.t. x an acquisition function 4 2 0 2 4
x
α(x, X̂ nd ), where X̂ nd is the current (estimated) Pareto- We use g to define a choice function. For instance, consider
set. We define the acquisition function α(x, X̂ nd ), with the the set of options Ak = {−1, 0, 2}, given that
aim to find a point that dominates the points in X̂ nd . That
is, given the set of options A∗ = x ∪ X̂ nd , we aim to find g(−1) = [−0.416, −0.909]
x such that C(A∗ ) = {x}. g(0) = [1, 0]
The acquisition function must also consider the trade-off g(2) = [−0.65, 0.75]
between exploration and exploitation. Therefore, we pro-
pose an acquisition function α(x, X̂ nd ) which is equal to we have that C(Ak ) = {0, 2} and R(Ak ) = Ak \C(Ak ) =
the γ% (in the experiments we use γ = 95) Upper Credi- {−1}. In fact, one can notice that [1, 0] dominates
ble Bound (UCB) of p(C(A∗ ), A∗ |f ∗ ) with f ∗ ∼ p(f ∗ |D), [−0.416, −0.909] on both the objectives, and [1, 0] and
A∗ = x ∪ X̂ nd and C(A∗ ) = {x}. [−0.65, 0.75] are incomparable. We sample 200 inputs xi
Note that the requirement for our acquisition function is at random in [−4.5, 4.5] and, using the above approach, we
strong. We could also define α̃(x, X̂ nd ) with different ob- generate
jectives in mind. For example we could seek to find a point
• N = 50 random subsets {Ak }N k=1 of the 200
x which allows to reject at least one option in X̂ nd . We
points each one of size |Ak | = 3 (respectively
opted for UCB over the probability p(C(A∗ ), A∗ |f ∗ ) be-
|Ak | = 5) and computed the corresponding choice
cause it leads to a fast to evaluate acquisition function. In
pairs (C(Ak ), Ak ) based on g;
particular we only need to compute one probability for each
• N = 150 random subsets {Ak }N k=1 each one of size
new function evaluation. In future work we will study al-
|Ak | = 3 (respectively |Ak | = 5) and computed the
ternative approaches and the trade-off between more costly
corresponding choice pairs (C(Ak ), Ak ) based on g;
acquisition function evaluations and faster convergence.
for a total of four different datasets. Fixing the latent di-
After computing the maximum of the the acquisition func-
mension ne = 2, we then compute the posterior means and
tion, denoted with xnew , consistently with the definition
95% credible intervals of the latent functions learned us-
of the acquisition function, we should query the agent to
ing the model introduced in Section 3.2. The four posterior
express their choice among the set of options in A∗ =
plots are shown in Figure 3. By comparing the 1st with
xnew ∪ X̂ nd . However, X̂ nd can be a very large set and
the 3rd plot and the 2nd with the 4th plot, it can be noticed
human cognitive ability cannot efficiently compare more
how the posterior means become more accurate (and the
than five options. Therefore, by using the GP-based latent
model, we select four options in X̂ nd which have the high- 6
Details about the procedure we use to select these 4 options
est probability of being dominated by xnew and query the are reported in the supplementary material.
Alessio Benavoli, Dario Azzimonti, Dario Piga
credible interval smaller) at the increase of the size dataset Choice-GP100 Choice-GP300 Oracle-GP
enb 0.74 0.77 0.77
(from N=50 to N=150 choice-sets). By comparing the 1st jura 0.44 0.47 0.53
with the 2nd plot and the 3rd with the 4th plot, it is evident real-estate 0.50 0.60 0.64
that estimating the latent function becomes more complex slump 0.26 0.39 0.45
at the increase of |Ak |. The reason is not difficult to un-
derstand. Given Ak , R(Ak ) includes the set of rejected As before, it can be noticed that the increase of N , the ac-
options. These are options that are dominated by (at least) curacy of Choice-GP gets closer to that of Oracle-GP.
one of the options in C(Ak ), but we do not know which
one(s). This uncertainty increases with the size of |Ak |, 6.2 Bayesian Optimisation
which makes the estimation problem more difficult.
We have considered for g(x) five standard multi-objective
Considering the same 200 inputs xi generated at random benchmark functions: Branin-Currin (nx = 2, no = 2),
in [−4.5, 4.5], we add Gaussian noise to g with σ = 0.1 ZDT1 (nx = 4, no = 2), ZDT2 (nx = 3, no = 2), DTLZ1
and generate two new training datasets with |Ak | = 3 and (nx = 3, no = 2), Kursawe (nx = 3, no = 2) and Vehicle-
(i) N = 100; (ii) N = 300. We aim to compute the Safety7 (nx = 5, no = 3). These are minimization prob-
predictive accuracy of the GP-based model for inferring lems, which we converted into maximizations so that the
C(Al ) from the set of options Al on additional 300 un- acquisition function in Section 5 is well-defined. We com-
seen pairs {C(Al ), Al }. We denote the model learned us- pare the Choice-GP BO (with ne = no ) approach proposed
ing the N = 100 and N = 300 dataset respectively by in this paper against ParEGO.8 For ParEGO, we assume the
Choice-GP100 and Choice-GP300. We compare their ac- algorithm can query directly g(x) and, therefore, we refer
curacy with that of an independent GP regression model to it as Oracle-ParEGO.9 Conversely, Choice-GP BO can
which has direct access to g: that is we modelled each di- only query g(x) via choice functions. We select |Ak | = 5
mension of g as an independent GP and compute the poste- and use UCB as acquisition function for Choice-GP BO.
rior of the components of g using GP regression from 200 We also consider a quasi-random baseline that selects can-
data pairs (xi , g1 (xi )) and, respectively, (xi , g2 (xi )). We didates from a Sobol sequence denoted as ‘Sobol‘. We
refer to this model as Oracle-GP: “Oracle” because it can evaluate optimization performance on the five benchmark
directly query g. The accuracy is: problems in terms of log-hypervolume difference, which
Choice-GP100 Choice-GP300 Oracle-GP is defined as the difference between the hypervolume of
0.54 0.72 0.77 the true Pareto front10 and the hypervolume of the approx-
averaged over 5 repetitions of the above data generation imate Pareto front based on the observed data X . Each
process. It can be noticed that the increase of N , the accu- experiment starts with 20 initial (randomly selected) input
racy of Choice-GP gets closer to that of Oracle-GP. This points which are used to initialise Oracle-ParEGO. We gen-
confirms the goodness of the learning framework devel- erate 7 pairs {C(Ak ), Ak } of size |Ak | = 5 by randomly
oped in this paper. selecting 7 subsets Ak of these 20 points. These choices
{C(Ak ), Ak }7k=1 are used to initialise Choice-GP BO. A
total budget of 80 iterations are run for both the algorithms.
Real-datasets We now focus on four benchmark datasets Further, each experiment is repeated 15 times with different
for multi-output regression problems. Table 1 displays the initialization. In these experiments we optimize the kernel
characteristics of the considered datasets. hyperparameters by maximising the marginal likelihood
Dataset #Instances #Attributes #Outputs
for Oracle-ParEGO and its variational approximation for
enb 768 6 2 Choice-GP. Figure 4 reports the performance of the three
jura 359 6 2 methods. Focusing on Branin-Currin, DTLZ1, Kursawe,
real-estate 414 5 2
slump 103 7 2 and Vehicle-Safety, it can be noticed how Choice-GP BO
convergences to the performance of the Oracle-ParEGO at
Table 1: Characteristics of the datasets.
the increase of the number of iterations. The convergence is
More details on the used datasets are in the supplementary clearly slower because Choice-GP BO uses qualitative data
material. By using 5-fold cross-validation, we divide the (choice functions) while Oracle-ParEGO uses quantitative
dataset in training and testing pairs. The target values in data (it has directly access to g). However, the overall per-
the training set are used to generate choice functions based formance shows that the proposed approach is very effec-
pairs (C(Ak ), Ak ) with |Ak | = 3 and (i) N = 100; (ii)
7
N = 300. From the test dataset, we generated N = 200 The problem of determining the thickness of five reinforced
components of a vehicle’s frontal frame [43]. This problem was
pairs. As before we denote the model learned using the
previously considered as benchmark in [44].
N = 100 and N = 300 dataset respectively by Choice- 8
We use the BoTorch implementation [45].
GP1 and Choice-GP3 and compare their accuracy against 9
The most recent MO BO approaches mentioned in Section 1
that of Oracle-GP (learned on the training dataset by inde- outperform ParEGO. We use ParEGO only as an Oracle reference.
10
pendent GP regression). The accuracy is: This known for the six benchmarks.
Alessio Benavoli, Dario Azzimonti, Dario Piga
3
2
2 2
2
1
1 1 1
f 0
f f 0 f
0 0
1 1 1 1
E[f1] E[f1] E[f1] E[f1]
2 E[f2] 2 E[f2] 2 E[f2] 2 E[f2]
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
x x x x
N = 50, |Ak | = 3 N = 50, |Ak | = 5 N = 150, |Ak | = 3 N = 150, |Ak | = 5
Figure 3: Posterior mean and 95% credible intervals of the two latent functions for the four artificial datasets.
1.00
0.75
1.4 0.75
0.50 0.50
1.2
0.25 0.25
1.0
0.00
0.00
0.8 0.25
Sobol Sobol 0.25 Sobol
Oracle-ParEGO 0.50 Oracle-ParEGO Oracle-ParEGO
0.6 Choice-BO Choice-BO Choice-BO
0.75 0.50
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
number of iterations number of iterations number of iterations
1.3 1.9
3.2
1.2 1.8
3.0
2.8 1.7
1.1
2.6 1.6
1.0
2.4 1.5
Sobol 0.9 Sobol Sobol
2.2 Oracle-ParEGO Oracle-ParEGO 1.4 Oracle-ParEGO
Choice-BO Choice-BO Choice-BO
2.0
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
number of iterations number of iterations number of iterations
Figure 4: Results over 15 repetitions. x-axis denotes the number of iterations and y-axis the log-hypervolume difference.
Alessio Benavoli, Dario Azzimonti, Dario Piga
[11] J. González, Z. Dai, A. Damianou, and N. D. [22] A. J. Keane, “Statistical improvement criteria for use
Lawrence, “Preferential bayesian optimization,” in in multiobjective design optimization,” AIAA journal,
Proceedings of the 34th International Conference vol. 44, no. 4, pp. 879–891, 2006.
on Machine Learning-Volume 70, pp. 1282–1291,
JMLR. org, 2017. [23] W. Ponweiser, T. Wagner, D. Biermann, and
M. Vincze, “Multiobjective optimization on a limited
[12] A. Benavoli, D. Azzimonti, and D. Piga, “Preferen- budget of evaluations using model-assisted s-metric
tial Bayesian optimisation with Skew Gaussian Pro- selection,” in International Conference on Parallel
cesses,” in 2021 Genetic and Evolutionary Computa- Problem Solving from Nature, pp. 784–794, Springer,
tion Conference Companion (GECCO ’21 Compan- 2008.
ion), July 10–14, 2021, Lille, France, (New York, NY, [24] M. T. Emmerich, A. H. Deutz, and J. W. Klinkenberg,
USA), ACM, 2021. “Hypervolume-based expected improvement: Mono-
tonicity properties and exact computation,” in 2011
[13] A. Benavoli, D. Azzimonti, and D. Piga, “Skew Gaus- IEEE Congress of Evolutionary Computation (CEC),
sian Processes for Classification,” Machine Learning, pp. 2147–2154, IEEE, 2011.
vol. 109, p. 1877–1902, 2020.
[25] V. Picheny, “Multiobjective optimization using gaus-
[14] A. Benavoli, D. Azzimonti, and D. Piga, “A unified sian process emulators via stepwise uncertainty re-
framework for closed-form nonparametric regression, duction,” Statistics and Computing, vol. 25, no. 6,
classification, preference and mixed problems with pp. 1265–1280, 2015.
skew gaussian processes,” 2021.
[26] D. Hernández-Lobato, J. Hernandez-Lobato, A. Shah,
[15] E. Brochu, N. de Freitas, and A. Ghosh, “Active and R. Adams, “Predictive entropy search for multi-
preference learning with discrete choice data,” in objective bayesian optimization,” in International
Advances in neural information processing systems, Conference on Machine Learning, pp. 1492–1501,
pp. 409–416, 2008. PMLR, 2016.
[27] T. Wada and H. Hino, “Bayesian optimization for
[16] M. Zoghi, S. Whiteson, R. Munos, and M. Rijke,
multi-objective optimization and multi-point search,”
“Relative upper confidence bound for the k-armed du-
arXiv preprint arXiv:1905.02370, 2019.
eling bandit problem,” in Proceedings of the 31st In-
ternational Conference on Machine Learning, (Be- [28] S. Belakaria and A. Deshwal, “Max-value entropy
jing, China), pp. 10–18, 2014. search for multi-objective bayesian optimization,” in
International Conference on Neural Information Pro-
[17] D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia, cessing Systems (NeurIPS), 2019.
“Active preference-based learning of reward func-
tions,” in Robotics: Science and Systems, 2017. [29] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fon-
seca, and V. G. Da Fonseca, “Performance assess-
[18] A. Bemporad and D. Piga, “Global optimization ment of multiobjective optimizers: An analysis and
based on active preference learning with radial ba- review,” IEEE Transactions on evolutionary compu-
sis functions,” Machine Learning, vol. 110, no. 2, tation, vol. 7, no. 2, pp. 117–132, 2003.
pp. 417–448, 2021.
[30] I. Couckuyt, D. Deschrijver, and T. Dhaene, “Fast cal-
[19] E. Siivola, A. K. Dhaka, M. R. Andersen, J. Gon- culation of multiobjective probability of improvement
zalez, P. G. Moreno, and A. Vehtari, “Preferen- and expected improvement criteria for pareto opti-
tial batch bayesian optimization,” arXiv preprint mization,” Journal of Global Optimization, vol. 60,
arXiv:2003.11435, 2020. no. 3, pp. 575–594, 2014.
[31] I. Hupkens, A. Deutz, K. Yang, and M. Emmerich,
[20] J. Knowles, “Parego: A hybrid algorithm with on- “Faster exact algorithms for computing expected hy-
line landscape approximation for expensive multiob- pervolume improvement,” in international conference
jective optimization problems,” IEEE Transactions on on evolutionary multi-criterion optimization, pp. 65–
Evolutionary Computation, vol. 10, no. 1, pp. 50–66, 79, Springer, 2015.
2006.
[32] M. Emmerich, K. Yang, A. Deutz, H. Wang,
[21] B. Paria, K. Kandasamy, and B. Póczos, “A flexible and C. M. Fonseca, “A multicriteria generalization
framework for multi-objective bayesian optimization of bayesian global optimization,” in Advances in
using random scalarizations,” in Uncertainty in Arti- Stochastic and Deterministic Global Optimization,
ficial Intelligence, pp. 766–776, PMLR, 2020. pp. 229–242, Springer, 2016.
Alessio Benavoli, Dario Azzimonti, Dario Piga
[33] K. Yang, M. Emmerich, A. Deutz, and C. M. Fonseca, Conference on Artificial Intelligence and Statistics,
“Computing 3-d expected hypervolume improvement (Chia, Italy), pp. 541–548, PMLR, 13–15 May 2010.
and related integrals in asymptotically optimal time,”
in International Conference on Evolutionary Multi- [40] J. Salvatier, T. V. Wiecki, and C. Fonnesbeck, “Prob-
Criterion Optimization, pp. 685–700, Springer, 2017. abilistic programming in python using pymc3,” PeerJ
[34] K. Yang, M. Emmerich, A. Deutz, and T. Bäck, Computer Science, vol. 2, p. e55, 2016.
“Multi-objective bayesian global optimization us-
ing expected hypervolume improvement gradient,” [41] A. Vehtari, A. Gelman, and J. Gabry, “Practi-
Swarm and evolutionary computation, vol. 44, cal bayesian model evaluation using leave-one-out
pp. 945–956, 2019. cross-validation and waic,” Statistics and computing,
vol. 27, no. 5, pp. 1413–1432, 2017.
[35] F. Aleskerov, D. Bouyssou, and B. Monjardet, Util-
ity maximization, choice and preference, vol. 16.
Springer Science & Business Media, 2007. [42] A. E. Gelfand, D. K. Dey, and H. Chang, “Model de-
termination using predictive distributions with imple-
[36] K. Pfannschmidt and E. Hüllermeier, “Learning mentation via sampling-based methods,” tech. rep.,
choice functions via pareto-embeddings,” in German Stanford Univ CA Dept of Statistics, 1992.
Conference on Artificial Intelligence (Künstliche In-
telligenz), pp. 327–333, Springer, 2020.
[43] R. Yang, N. Wang, C. Tho, J. Bobineau, and B. Wang,
[37] A. Gessner, O. Kanjilal, and P. Hennig, “Inte- “Metamodeling development for vehicle frontal im-
grals over Gaussians under Linear Domain Con- pact simulation,” 2005.
straints,” in Proceedings of the Twenty Third Inter-
national Conference on Artificial Intelligence and [44] S. Daulton, M. Balandat, and E. Bakshy, “Differen-
Statistics (S. Chiappa and R. Calandra, eds.), vol. 108, tiable expected hypervolume improvement for par-
pp. 2764–2774, PMLR, 26–28 Aug 2020. allel multi-objective bayesian optimization,” arXiv
[38] A. Kucukelbir, R. Ranganath, A. Gelman, and D. M. preprint arXiv:2006.05078, 2020.
Blei, “Automatic variational inference in stan,” arXiv [45] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton,
preprint arXiv:1506.03431, 2015. B. Letham, A. G. Wilson, and E. Bakshy, “BoTorch:
A Framework for Efficient Monte-Carlo Bayesian
[39] I. Murray, R. Adams, and D. MacKay, “Elliptical slice Optimization,” in Advances in Neural Information
sampling,” in Proceedings of the 13th International Processing Systems 33, 2020.