Professional Documents
Culture Documents
Norbert Jankowski
1 Introduction
To talk about complexity and its connections with meta-learning we need to define the
learning problem and the learning machine first. The learning problem P is represented
by the pair of data D D and model space M . Finding the best solution for P is
usually an NP-hard problem and we must be satisfied with suboptimal solutions found
by learning machines defined as processes
L : KL D M (1)
where KL is the space of configuration parameters of L.
The task of finding the best possible model for fixed data D, can be replaced by the
task of finding the best possible learning machine L.
Note that learning machines may be simple or complex. In case of complex ma-
chines a learning machine is composed of several other learning machines:
L = L1 L2 . . . Lk . (2)
For example, a composition of a data transformation machine and a classifier machine
or classifiers committee which is composed of several machines. Additionally it is not
necessary to distinguish between learning and non-learning machines. Test procedures
estimating some measures of learning machines accuracies, adaptation etc. can also be
seen as another machines. And then the time and space complexity may touch more so-
phisticated structure which consists of learning machines and some testing procedures
as well.
Meta-learning algorithm (MLA) is just another type of learning algorithm (with
another data and model spaces). The most characteristic feature of meta-learning is that
it learns how to learn. Typically meta-learning uses conclusions of other learning tasks
formulated and evaluated (during the run of meta-learning). This is why meta-learning
algorithms realize followings actions in a loop: formulation and starting of the test
tasks, collection of results and meta-knowledge. For a review of different meta-learning
approaches, see [1].
2
The goal of meta-learning may be defined by: maximization of the probability of finding
best possible solution of given problem P within given search space in as short time
as possible.
In consequence the meta-learning algorithms should carefully order testing tasks
during the progress of the search, basing on information about complexity (as it will be
seen) and build meta-knowledge based on the experience from passed tests. It is highly
necessary to avoid test tasks that could accumulate too large amount of time, since it
would decrease the probability of finding the most interesting solutions.
This is why the complexity approach is presented in further part of the text.
Complexity measures by Solomonoff, Kolmogorov and Levin. To obtain the right order
in the searching queue of learning machines, a complexity measure should be used.
Solomonoff [2,3,4] and Kolmogorov [5] (see [6] too) defined complexity by:
CK (P) = min{l p : if program p prints P and halts}, (3)
p
3
CL (P) = min{cL (p) : if program p prints P and then halts within time t p } (4)
p
where
cL (p) = l p + logt p . (5)
In the case of Levins definition (Eq. 4) it is possible to realize [7] the Levin Uni-
versal Search (LUS) [8,6] but the problem is that this algorithm is NP-hard. This means
that, in practice, it is impossible to find an exact solution to the optimization problem.
Generators flow defines search space. The strategy of meta-learning is different than
the one of LUS. The LUS searches in an infinite space between all possible programs.
Meta-learning uses the functional definition of the search space, which is not infinite.
This means that the search space is, indeed, strongly limited. The generators flow is
assumed to generate machine configurations which are rational from the point of
view of given problem P.
Generators flow is a directed graph of machine generators. Thanks to the graph
connection the definition of appropriate graph is natural. However, a deep description
of this topic will, most certainly, exceed the page limit. The generators flow is defined
individually for each learning problem P. However similar configurations of the gen-
erator flow may be shared by classifiers (for example). Generators flow should reflect
all accessible knowledge about problem P and provide several configurations of ma-
chines which performance should be testedit increases the probability of finding a
better solution. Thanks to such definition of generators flow, the searching space is fi-
nite, however it may consist of many configurations, usually so many as to make it
impossible to validate all of them. To validate appropriate configurations the complex-
ity measure will be used to order the test task basing on machine configurations.
Complexity in the context of machines. The complexity can not be computed from
the learning machines themselves, but from configurations of learning machines and
descriptions of their inputs. Because meta-learning for organizing the order has to know
the complexity before the machine is ready for use. In most cases, there is no direct
analytical way of computing the machine complexity on the basis of its configuration.
The halting problem directly implies impossibility of calculation of time and memory
usage of a learning process in general. Therefore the approximators of complexity have
to be contracted.
Assume that we have restricted the learning problem P with a time limit t and a
meta-learning algorithm postulates learning machines m1 , m2 , . . . , mq , with testing times
of t1 ,t2 , . . . ,tq respectively. When not all of the tests can be done within the time limit
(i ti > t) and if we assume that each test is equally promising, then to maximize the
expected value of the best test result obtained within the time, we need to run as many
tests as possible. Therefore, an optimal order of the tests is an order i1 , i2 , . . . , im of
nondecreasing testing times:
ti1 ti2 . . . tim , (6)
and the choice of first m shortest tests such that
" #
m = arg max til t , (7)
1kq l=1,...,k
is an optimal choice of respective learning machines. The proof of this property is triv-
ial. Thanks to the sorted permutation in Eq. 6 the m from above equation is highest. And
this directly maximizes the probability of finding best solution on above assumptions.
This implies that meta-learning in complexity definition have to use time in calcu-
lation of complexity:
ct (p) = t p . (8)
This property is very important because it clearly answers the main problem of
MLA about the choice and order of learning machine tests.
Even if we do not assume that learning machines are not equally promising, the
complexity measure may be adjusted to maximize the expected value of test results
(see Eq. 11).
The property of optimal ordering the test tasks means that in case of ordering pro-
grams (machines) only on the basis of their length (as it was defined in Eq. 3) is not
rational in the context of learning machines (and non-learning machines too). The prob-
lem of using Levins additional term of time Eq. 5, in real applications, is that it is not
rigorous enough in respecting time. For example, a program running 1024 times longer
than another one may have just a little bigger complexity (just +10). Note that in case
of Levins complexity (Eq. 5) one unit more of space was equivalent to double time.
This in not acceptable in practice of computational intelligence.
On the other hand, total rejection of the length l p is also not recommended because
machines always have to fit in the memory, which is limited too.
In learning machines, typically the time complexity exceeds significantly the mem-
ory complexity. This is why measure of complexity should combine time and memory
together:
ca (p) = l p + t p / logt p . (9)
Although the time part may seem to be respected less than the memory part, it is not
really so, because, as mentioned above, almost always time complexity exceeds sig-
nificantly memory complexity. Thus, in fact, memory part dominates anyway, so the
order of performing tests is very sensitive to time requirements. Whats more, the or-
der of machines based on the time t p and based on t p / logt p are equivalent. The length
component l p just keeps in mind the limited resources of memory and the whole ca
defines some balance between the two complexities.
5
What machines are we interested in the complexity of? The generators flow provides
machine configurations which subsequently one by one are nested inside separate copies
of test template. The test template is used to define the goal of the problem P. It may
be defined in different ways depending on the problem P, for example it may calculate
average error. However the common divisor of all testing procedures is that it consists
of learning of the given machine and testing of this machine.
A good definition of the testing procedure should naturally reflect the usage of tested
machine. And by the same complexity of such test procedure with nested tested machine
reflects the complete behaviour of the machine really well: a part of the complexity
formula reflects the complexity of learning of the given machine and the rest reflects
the complexity of computing the test (for example, classification or approximation test).
The costs of learning are very important because without learning there is no model. The
complexity of the testing part is also very important, because it reflects the simplicity of
further use of the model. Some machines learn quickly and require more effort to make
6
use of their outputs (like kNN classifiers), while others learn for a long time and after
that may be very efficiently exploited (like many neural networks). Therefore, the test
procedure should be as similar to the whole life cycle of a machine as possible (and of
course as trustful as possible).
DL : KL M+ R2 M+ , (12)
where the domain is composed by the configurations space KL and the space of meta-
inputs M+ , and the results are the time complexity, the memory complexity and appro-
priate meta-outputs.
The problem is not as easy as the form of the function in Eq. 12. Finding the right
function for the given learning machine L may be impossible. This is caused by un-
predictable influence of some configuration elements and of some inputs (meta-inputs)
a on the machine complexity. Configuration elements are not always as simple as scalar
values. In some cases configuration elements are represented by functions or by sub-
configurations. Similar problem concerns meta-inputs. In many cases, meta-inputs can
not be represented by a simple chain of scalar values. Often, meta-inputs need their own
complexity determination tool to reflect their functional form. For example, a commit-
tee of machines, which plays a role of a classifier, will use other classifiers (inputs) as
slave machines. It means that the committee will use classifiers outputs, and the com-
plexity of using the outputs depends on the outputs, not on the committee itself. This
shows that sometimes, the behaviour of meta-inputs/outputs is not trivial and proper
complexity determination requires another encapsulation.
1 Machines which provide necessary outputs.
7
Meta Evaluators. To enable such high level of generality, the concept of meta-evaluators
has been introduced. The general goal of meta-evaluator is
To enable complexity computation, every learning machine gets its own meta eval-
uator.
Because of recurrent nature of machines (and machine configuration) and because
of nontriviality of inputs (its role, which sometimes have complex functional form),
meta evaluators are constructed not only for machines, but also for outputs and other
elements with nontrivial influence on machine complexity.
Init: {startEnv,
START
envScenario}
Generate next
observation no Train each
configuration oc. approximator
Succeeded?
data collection loop
yes
Return series
Run machines ac-
of approximators
cording to oc
for evaluator
3 Example of an application
The generators flow used in my experiments is presented in Fig. 2. Without going into
the details (because of the page limit) of meaning of the presented generators flow,
assume it generates machines presented in Table 1.
1 kNN (Euclidean)
2 kNN [MetricMachine (EuclideanOUO)]
3 kNN [MetricMachine (Mahalanobis)]
4 NBC
5 SVMClassifier [KernelProvider]
6 LinearSVMClassifier [LinearKernelProvider]
7 [ExpectedClass, kNN [MetricMachine (EuclideanOUO)]]
8 [LVQ, kNN (Euclidean)]
9 Boosting (10x) [NBC]
10 [[[RankingCC], FeatureSelection], [kNN (Euclidean)], TransformAndClassify]
11 [[[RankingCC], FeatureSelection], [kNN [MetricMachine (EuclideanOUO)]], TransformAndClassify]
12 [[[RankingCC], FeatureSelection], [kNN [MetricMachine (Mahalanobis)]], TransformAndClassify]
13 [[[RankingCC], FeatureSelection], [NBC], TransformAndClassify]
14 [[[RankingCC], FeatureSelection], [SVMClassifier [KernelProvider]], TransformAndClassify]
15 [[[RankingCC], FeatureSelection], [LinearSVMClassifier [LinearKernelProvider]], TransformAndClassify]
16 [[[RankingCC], FeatureSelection], [ExpectedClass, kNN [MetricMachine (EuclideanOUO)]], TransformAndClassify]
17 [[[RankingCC], FeatureSelection], [LVQ, kNN (Euclidean)], TransformAndClassify]
18 [[[RankingCC], FeatureSelection], [Boosting (10x) [NBC]], TransformAndClassify]
19 [[[RankingFScore], FeatureSelection], [kNN (Euclidean)], TransformAndClassify]
20 [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (EuclideanOUO)]], TransformAndClassify]
21 [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (Mahalanobis)]], TransformAndClassify]
22 [[[RankingFScore], FeatureSelection], [NBC], TransformAndClassify]
23 [[[RankingFScore], FeatureSelection], [SVMClassifier [KernelProvider]], TransformAndClassify]
24 [[[RankingFScore], FeatureSelection], [LinearSVMClassifier [LinearKernelProvider]], TransformAndClassify]
25 [[[RankingFScore], FeatureSelection], [ExpectedClass, kNN [MetricMachine (EuclideanOUO)]], TransformAndClas-
sify]
26 [[[RankingFScore], FeatureSelection], [LVQ, kNN (Euclidean)], TransformAndClassify]
27 [[[RankingFScore], FeatureSelection], [Boosting (10x) [NBC]], TransformAndClassify]
28 ParamSearch [kNN (Euclidean)]
29 ParamSearch [kNN [MetricMachine (EuclideanOUO)]]
30 ParamSearch [kNN [MetricMachine (Mahalanobis)]]
31 ParamSearch [NBC]
32 ParamSearch [SVMClassifier [KernelProvider]]
33 ParamSearch [LinearSVMClassifier [LinearKernelProvider]]
34 ParamSearch [ExpectedClass, kNN [MetricMachine (EuclideanOUO)]]
35 ParamSearch [LVQ, kNN (Euclidean)]
36 ParamSearch [Boosting (10x) [NBC]]
37 ParamSearch [[[RankingCC], FeatureSelection], [kNN (Euclidean)], TransformAndClassify]
38 ParamSearch [[[RankingCC], FeatureSelection], [kNN [MetricMachine (EuclideanOUO)]], TransformAndClassify]
39 ParamSearch [[[RankingCC], FeatureSelection], [kNN [MetricMachine (Mahalanobis)]], TransformAndClassify]
40 ParamSearch [[[RankingCC], FeatureSelection], [NBC], TransformAndClassify]
41 ParamSearch [[[RankingCC], FeatureSelection], [SVMClassifier [KernelProvider]], TransformAndClassify]
42 ParamSearch [[[RankingCC], FeatureSelection], [LinearSVMClassifier [LinearKernelProvider]], TransformAndClassify]
43 ParamSearch [[[RankingCC], FeatureSelection], [ExpectedClass, kNN [MetricMachine (EuclideanOUO)]], Transfor-
mAndClassify]
44 ParamSearch [[[RankingCC], FeatureSelection], [LVQ, kNN (Euclidean)], TransformAndClassify]
45 ParamSearch [[[RankingCC], FeatureSelection], [Boosting (10x) [NBC]], TransformAndClassify]
46 ParamSearch [[[RankingFScore], FeatureSelection], [kNN (Euclidean)], TransformAndClassify]
47 ParamSearch [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (EuclideanOUO)]], TransformAndClassify]
48 ParamSearch [[[RankingFScore], FeatureSelection], [kNN [MetricMachine (Mahalanobis)]], TransformAndClassify]
49 ParamSearch [[[RankingFScore], FeatureSelection], [NBC], TransformAndClassify]
50 ParamSearch [[[RankingFScore], FeatureSelection], [SVMClassifier [KernelProvider]], TransformAndClassify]
51 ParamSearch [[[RankingFScore], FeatureSelection], [LinearSVMClassifier [LinearKernelProvider]], TransformAndClas-
sify]
52 ParamSearch [[[RankingFScore], FeatureSelection], [ExpectedClass, kNN [MetricMachine (EuclideanOUO)]], Trans-
formAndClassify]
53 ParamSearch [[[RankingFScore], FeatureSelection], [LVQ, kNN (Euclidean)], TransformAndClassify]
54 ParamSearch [[[RankingFScore], FeatureSelection], [Boosting (10x) [NBC]], TransformAndClassify]
Table 1: Machine configurations produced by the generators flow of Fig. 2 and the
enumerated sets of classifiers and rankings.
10
Generators flow
Rankings Feature selection
generator of rankings
generator
output
Benchmarks. Because complexity analysis is not the main thread of this article, I have
presented examples on two datasets selected from the UCI machine learning reposi-
tory [9]: vowel and splice.
Diagram description. The results obtained for the benchmarks are presented in the form
of diagrams: Fig. 3 and Fig. 4. The diagrams present information about times of starting,
stopping and breaking of each task, about complexities (global, time and memory) of
each test task, about the order of the test tasks (according to their complexities, compare
Table 1) and about accuracy of each tested machine.
In the middle of the diagramsee the first diagram in Fig. 3there is a column
with task ids (the same ids as in table 1). But the order of rows in the diagram reflects
complexities of test tasks. It means that the most complex tasks are placed at the top
and the task of the smallest complexities is visualized at the bottom. For example, in
Fig. 3, at the bottom, I can see task ids 4 and 31 which correspond to the Naive Bayes
Classifier and the ParamSearch [NBC] classifier. At the top, I can see task ids 54 and
45, as the most complex ParamSearch test tasks of this benchmark. Task order in the
second example (Fig. 4) is completely different.
On the right side of the Task id column, there is a plot presenting starting, stopping
and breaking times of each test task. For an example of restarted task please look at
Fig. 3, at the topmost task-id 54there are two horizontal bars corresponding to the
two periods of the task run. The break means that the task was started, broken because
of exceeded allocated time and restarted when the tasks of larger complexities got their
turn. The breaks occur for the tasks, for which the complexity prediction was too op-
timistic. Two different diagrams (in Fig. 3 and 4) easily bring the conclusion that the
amount of inaccurately predicted time complexity is quite small (there are very few
broken bars). Note that, when a task is broken, its subtasks, that have already been
computed are not recalculated during the test-task restart (due to the machine unifica-
tion mechanism and machine cache [10]). At the bottom, the Time line axis can be seen.
The scope of the time is [0, 1] interval to show the times relative to the start and the end
of the whole MLA computations.
11
On the left side of the Task-id column, the accuracies of classification test tasks and
their approximated complexities are presented. At the bottom, there is the Accuracy
axis with interval from 0 (on the right) to 1 (on the left side). Each test task has its own
gray bar starting at 0 and finished exactly at the point corresponding to the accuracy.
So the accuracies of all the tasks are easily visible and comparable. Longer bars show
higher accuracies.
The leftmost column of the diagram presents ranks of the test tasks (the ranking of
the accuracies).
Between the columns with the task ids and the accuracy-ranks, on top of the gray
bars corresponding to the accuracies, some thin solid lines can be seen. The lines start at
the right side (just like the accuracy bars) and go to the right according to proper mag-
nitudes. For each task, the three lines correspond to total complexity (the upper line),
memory complexity (the middle line) and time complexity (the lower line)4 . All three
complexities are the approximated complexities (see Eq. 10 and 11). Approximated
complexities presented on the left side of the diagram can be easily compared visually
to the time-schedule obtained in the real time on the right side of the diagram. Longer
lines mean higher complexities. The longest line is spread to maximum width. The oth-
ers are proportionally shorter. So the complexity lines at the top of the diagram are long
while the lines at the bottom are almost invisible. It can be seen that sometimes the time
complexity of a task is smaller while the total complexity is larger and vice versa. For
example, see tasks 42 and 48 again in Fig. 3 (table 1 describes tasks numbers).
If you like to see more results or in better resolution please download the document
at http://www.is.umk.pl/norbert/publications/11-njCmplxEval-Figs.pdf.
4 Summary
References
1. Brazdil, P., Giraud-Carrier, C., Soares, C., Vilalta, R.: Metalearning: Applications to Data
Mining. Springer (2009)
2. Solomonoff, R.: A preliminary report on a general theory of inductive inference. Technical
Report V-131, Cambridge, Ma.: Zator Co. (1960)
4 In the case of time complexity, t/ logt is plotted, not the time t itself.
12
Complexity Complexity
Task id
0.64926 0.0 Task id 12.49434 0.0
17 54 3 32
14 45 9 30
9 35 18 35
6 32 7 28
5 28 23 42
25 50 19 51
25 41 21 41
23 51 19 50
23 42 11 3
19 48 13 45
19 39 12 54
26 53 4 33
26 44 4 6
19 47 24 39
19 38 25 48
18 52 5 29
18 43 24 38
19 46 25 47
20 37 10 5
13 49 15 44
13 40 17 53
4 30 24 37
3 29 25 46
11 34 25 43
7 36 25 52
7 9 13 40
16 27 12 49
15 18 8 1
10 5 23 15
24 23 20 24
24 14 22 14
23 24 19 23
23 15
22 33 18 8
21 6 2 36
12 8 2 9
2 3 6 2
19 21 25 34
19 12 24 12
26 26 25 21
26 17 13 18
19 20 12 27
19 11 24 11
1 2 25 20
19 19 16 17
19 10 14 26
18 25 24 10
18 16 26 19
1 1 1 31
11 7 25 16
13 22 25 25
13 13 1 4
8 31 25 7
8 4 13 13
Accuracy Time line 12 22
Accuracy Time line
1.0 0.5 0.0 0.0 0.25 0.5 0.75 1.0
1.0 0.5 0.0 0.0 0.25 0.5 0.75 1.0
3. Solomonoff, R.: A formal theory of inductive inference part i. Information and Control 7(1)
(1964) 122
4. Solomonoff, R.: A formal theory of inductive inference part ii. Information and Control 7(2)
(1964) 224254
5. Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Prob. Inf.
Trans. 1 (1965) 17
6. Li, M., Vitnyi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Text
and Monographs in Computer Science. Springer-Verlag (1993)
7. Jankowski, N.: Applications of Levins universal optimal search algorithm. In Kacki,
E.,
ed.: System Modeling Control95. Volume 3., dz, Poland, Polish Society of Medical In-
formatics (May 1995) 3440
8. Levin, L.A.: Universal sequential search problems. Problems of Information Transmission
(translated from Problemy Peredachi Informatsii (Russian)) 9 (1973)
9. Merz, C.J., Murphy, P.M.: UCI repository of machine learning databases (1998)
http://www.ics.uci.edu/mlearn/MLRepository.html.
10. Grabczewski,
K., Jankowski, N.: Saving time and memory in computational intelligence
system with machine unification and task spooling. Knowledge-Based Systems 24(5) (2011)
570588