Professional Documents
Culture Documents
SA RR RVNS SD SA RR RVNS SD
Average SR 0.088 0.086 0.062 0.058 Average CT 2.202 2.085 2.679 2.775
Pr(SR > 0) 0.438 0.442 0.345 0.326 Pr(CT < 2.5) 0.873 0.962 0.275 0.197
Pr(SR > 0.5) 0.041 0.041 0.041 0.039 Pr(CT < 2.0) 0.127 0.162 0.046 0.044
Pr(SR = 1) 0.041 0.038 0.022 0.011 Pr(CT < 1.0) 0.041 0.041 0.036 0.036
(a) (b)
5
8 Solved by Description
0 Easy
6 1 Hard 4
2
4 3
4 3
Mean CT
2
PC2
2
0
2 1
4
0
20 10 0 10 20 30 20 10 0 10 20 30
PC1 PC1 SA RR RVNS SD
Fig. 3. Visualisation of instances using the 1st and 2nd principal components Fig. 4. Distribution of mean cost-time for each algorithm
of the full feature space showing (a) the number of algorithms able to solve
the instance at least once and (b) the classification based on logic solvers
we ran MATILDA multiple times with a different number of
A. Exploratory Data Analysis features to evaluate how this affects the selected features, and
the quality of the trained SVMs. To determine the number of
1) Success Rate (SR): Table I shows that for approximately
features for each threshold we considered the average model
two-fifths of the problem instances SA and RR found the
accuracy, precision and recall from 10-fold cross-validation.
solution with cost equal to zero (solved the instance) at least
We also considered the F1 score, which is the harmonic
once in the 20 runs. RVNS and SD found that solution at least
mean of precision and recall. Based on these metrics, the
once in approximately one-third of the problem instances. 654
models with 10 features were selected for all thresholds, except
of the instances were solved by at least one of the algorithms,
SR > 0.5, for which we used 6 features.
while 175 of the instances were solved by all of them. Fig. 3
Each subfigure in Fig. 5 visualises performance with respect
shows the number of algorithms that were able to solve a
to a performance metric. The bars are grouped by whether
given instance at least once. Generally, the instances that
good performance is specified in terms of SR or CT, and
three or four algorithms found solvable correspond to data
the choice of threshold. The first four bars correspond to the
sources which were described as “easy”. However there are
performance of the SVM that predicts whether each algorithm
also instances such as those furthest left which are described
will achieve good performance on a problem instance. The last
as easy but none of the considered algorithms can solve them.
bar, coloured purple, depicts the performance of the model
This can be explained by the fact that logic-based solvers
trained to select the best performing algorithm out of the four
employ search techniques that are markedly different from
considered based on the rules outlined in Section III-D. This
those employed by metaheuristic solvers.
model is called the Selector.
2) Mean Cost-Time (CT): The cost-time value for a prob-
lem instance solved to optimality (i.e., achieved a cost value of When SR is used to define good performance, higher
zero) is less than one. Table II and Fig. 4 show the probability thresholds result in increased class imbalance between good
distribution of the mean cost-time and the average for each (class 1) and bad (class 0) performance. Because of the small
algorithm. Instances with CT < 2 are those which were proportion of class 1 instances, the classes are easily separable.
frequently solved by a given algorithm. Consequently, the SVMs trained on SR > 0.5 have near
perfect accuracy and recall. When good performance is defined
B. Predicting Performance as SR > 0, the trained SVMs still perform reasonably well,
Using the success rate as the performance metric, we but the error rate is higher.
considered two absolute thresholds for good performance: The choice of threshold is also important when good per-
SR > 0 and SR > 0.5. We also considered two absolute formance is defined in terms of CT. When the threshold is
thresholds for mean cost-time value: CT < 2 and CT < 2.5. 2, which is closer to the average CT for SA and RR, the
In the feature selection algorithm (SIFTED), the number of SVMs trained to predict whether SA and RR will achieve
features is specified by the user. For each of the thresholds, lower performance compared to the corresponding SVMs for
Accuracy F1 Score (a) SR > 0 (b) SR > 0.5
100 100 15 6
10 4
80 80
5 2
Z2
Z2
60 60 0 0
5 2
40 40
10 4
20 20
7.5 5.0 2.5 0.0 2.5 5.0 7.5 4 2 0 2 4 6
Z1 Z1
0 0 Selected Algorithm Selected Algorithm
CT < 2 CT < 2.5 SR > 0 SR > 0.5 CT < 2 CT < 2.5 SR > 0 SR > 0.5 SA (351) None (621) SA (41) None (959)
RR (28)
Precision Recall
100 100 (c) CT < 2 (d) CT < 2.5
8 3
80 80 6
2
4
1
60 60 2
0
Z2
Z2
0
40 40 1
2
4 2
20 20
6 3
Fig. 5. Average model evaluation metrics as percentages for SVMs trained to Fig. 6. Projection of instance space showing the best algorithm and the
predict good performance and models to select best algorithm by MATILDA number of instances for which an algorithm is selected as best
2
instance and predicted that for 658 of the instances, there
will be no difference in the performance of the algorithms.
4 0.6
Table IV shows that this model has relatively poor accuracy
and precision (estimated through 10-fold cross validation).
RVNS SD So while the selections are rather similar to those made by
6
0.4 MATILDA, MATILDA also has the benefit of better predicting
4
good performance to support decision making.
2
When training a logistic regression model based on mean
Z2