Algorithmic Learning Theory
Algorithmic Learning Theory
3
Berlin
Heidelberg
New York
Hong Kong
London
Milan
Paris
Tokyo
Algorithmic
Learning Theory
14th International Conference, ALT 2003
Sapporo, Japan, October 17-19, 2003
Proceedings
13
Volume Editors
Ricard Gavaldà
Technical University of Catalonia
Department of Software (LSI)
Jordi Girona Salgado 1-3, 08034 Barcelona, Spain
E-mail: gavalda@[Link]
Klaus P. Jantke
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
Im Stadtwald, Geb. 43.8, 66125 Saarbrücken, Germany
E-mail: jantke@[Link]
Eiji Takimoto
Tohoku University
Graduate School of Information Sciences
Sendai 980-8579, Japan
E-mail: t2@[Link]
A catalog record for this book is available from the Library of Congress.
ISSN 0302-9743
ISBN 3-540-20291-9 Springer-Verlag Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer-Verlag Berlin Heidelberg New York
a member of BertelsmannSpringer Science+Business Media GmbH
[Link]
This volume contains the papers presented at the 14th Annual Conference on
Algorithmic Learning Theory (ALT 2003), which was held in Sapporo (Japan)
during October 17–19, 2003. The main objective of the conference was to provide
an interdisciplinary forum for discussing the theoretical foundations of machine
learning as well as their relevance to practical applications. The conference was
co-located with the 6th International Conference on Discovery Science (DS 2003).
The volume includes 19 technical contributions that were selected by the
program committee from 37 submissions. It also contains the ALT 2003 invited
talks presented by Naftali Tishby (Hebrew University, Israel) on “Efficient Data
Representations that Preserve Information,” by Thomas Zeugmann (University
of Lübeck, Germany) on “Can Learning in the Limit be Done Efficiently?”, and
by Genshiro Kitagawa (Institute of Statistical Mathematics, Japan) on “Sig-
nal Extraction and Knowledge Discovery Based on Statistical Modeling” (joint
invited talk with DS 2003). Furthermore, this volume includes abstracts of the
invited talks for DS 2003 presented by Thomas Eiter (Vienna University of Tech-
nology, Austria) on “Abduction and the Dualization Problem” and by Akihiko
Takano (National Institute of Informatics, Japan) on “Association Computation
for Information Access.” The complete versions of these papers were published
in the DS 2003 proceedings (Lecture Notes in Artificial Intelligence Vol. 2843).
ALT has been awarding the E. Mark Gold Award for the most outstanding
paper by a student author since 1999. This year the award was given to Sandra
Zilles for her paper “Intrinsic Complexity of Uniform Learning.”
This conference was the 14th in a series of annual conferences established in
1990. Continuation of the ALT series is supervised by its steering committee, con-
sisting of: Thomas Zeugmann (Univ. of Lübeck, Germany), Chair, Arun Sharma
(Univ. of New South Wales, Australia), Co-chair, Naoki Abe (IBM T.J. Wat-
son Research Center, USA), Klaus Peter Jantke (DFKI, Germany), Phil Long
(National Univ. of Singapore), Hiroshi Motoda (Osaka Univ., Japan), Akira
Maruoka (Tohoku Univ., Japan), Luc De Raedt (Albert-Ludwigs-Univ., Ger-
many), Takeshi Shinohara (Kyushu Institute of Technology, Japan), and Osamu
Watanabe (Tokyo Institute of Technology, Japan).
We would like to thank all individuals and institutions who contributed to
the success of the conference: the authors for submitting papers, the invited
speakers for accepting our invitation and lending us their insight into recent
developments in their research areas, as well as the sponsors for their generous
financial support.
Furthermore, we would like to express our gratitude to all program committee
members for their hard work in reviewing the submitted papers and participating
in on-line discussions. We are also grateful to the external referees whose reviews
made a considerable contribution to this process.
VI Preface
We are also grateful to the DS 2003 Chairs Yuzuru Tanaka (Hokkaido Uni-
versity, Japan), Gunter Grieser (Technical University of Darmstadt, Germany)
and Akihiro Yamamoto (Hokkaido University, Japan) for their efforts in coordi-
nating with ALT 2003, and to Makoto Haraguchi and Yoshiaki Okubo (Hokkaido
University, Japan) for their excellent work on the local arrangements. Last but
not least, Springer-Verlag provided excellent support in preparing this volume.
Conference Chair
Klaus P. Jantke DFKI GmbH Saarbrücken, Germany
Program Committee
Ricard Gavaldà (Co-Chair) Tech. Univ. of Catalonia, Spain
Eiji Takimoto (Co-Chair) Tohoku Univ., Japan
Hiroki Arimura Kyushu Univ., Japan
Shai Ben-David Technion, Israel
Nicolò Cesa-Bianchi Univ. di Milano, Italy
Nello Cristianini UC Davis, USA
François Denis LIF, Univ. de Provence, France
Kouichi Hirata Kyutech, Japan
Sanjay Jain Nat. Univ. Singapore, Singapore
Stephen Kwek Univ. Texas, San Antonio, USA
Phil Long Genome Inst. Singapore, Singapore
Yasubumi Sakakibara Keio Univ., Japan
Rocco Servedio Columbia Univ., USA
Hans-Ulrich Simon Ruhr-Univ. Bochum, Germany
Frank Stephan Univ. Heidelberg, Germany
Christino Tamon Clarkson Univ., USA
Local Arrangements
Makoto Haraguchi (Chair) Hokkaido Univ., Japan
Yoshiaki Okubo Hokkaido Univ., Japan
Subreferees
Kazuyuki Amano Joshua Goodman
Dana Angluin Colin de la Higuera
Tijl De Bie Hiroki Ishizaka
Laurent Brehelin Jeffrey Jackson
Christian Choffrut Satoshi Kobayashi
Pedro Delicado Jean-Yves Marion
Claudio Gentile Andrei E. Romashchenko
Rémi Gilleron Hiroshi Sakamoto
Sally Goldman Kengo Sato
VIII Organization
Sponsoring Institutions
The Japanese Ministry of Education, Culture, Sports, Science and Technology
The Suginome Memorial Foundation, Japan
Table of Contents
Invited Papers
Abduction and the Dualization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Thomas Eiter
Regular Contributions
Inductive Inference
Intrinsic Complexity of Uniform Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Sandra Zilles
Online Prediction
Criterion of Calibration for Transductive Confidence Machine
with Limited Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Ilia Nouretdinov, Vladimir Vovk
Thomas Eiter
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 1–2, 2003.
c Springer-Verlag Berlin Heidelberg 2003
2 T. Eiter
is that for certain formulas, logical consequence from T efficiently reduces to deciding
consequence from char(T ) (which is easy) and thus admits tractable inference. In fact,
finding some abductive explanation for a query literal is polynomial in this setting, while
this is well-known to be NP-hard under formula-based representation.
Computing all abductive explanations for a query literal, which rises in different
contexts, is known to be polynomial-time equivalent (in a precise sense) to the problem
of dualizing a Boolean function given by a monotone CNF. The latter problem, Mono-
tone Dualization, is with respect to complexity a somewhat mysterious problem which
since more than 20 years resists to a precise classification in terms of well-established
complexity classes. Currently, no polynomial total-time algorithm solving this problem
is known; on other hand, there is also no stringent evidence that such an algorithm is
unlikely to exist (like, e.g., coNP-hardness of the associated decision problem whether,
given two monotone CNFs ϕ and ψ, they represent dual functions). On the contrary,
results in the 1990’s provided some hints that the problem is closer to polynomial total-
time, since as shown by Fredman and Khachyian, the decisional variant can be solved in
quasi-polynomial time, i.e., in time O(nlog n ). This was recently refined to solvability
in polynomial time with limited nondeterminism, i.e., using a poly-logarithmic number
of bit guesses.
Apart from this peculiarity, Monotone Dualization has been recognized as an im-
portant problem since there are a large number of other problems in Computer Science
which are known to be polynomial-time equivalent to this problem. It has a role similar to
the one of SAT for the class NP: A polynomial total-time algorithm for Monotone Dual-
ization implies polynomial total-time algorithms for all the polynomial-time equivalent
problems.
We will consider some possible extensions of the results for abductive explanations
which are polynomial-time equivalent to Monotone Dualization. Besides generating
all abductive explanations for a literal, there are many other problems in Knowledge
Discovery and Data Mining which are polynomial-time equivalent or closely related to
Monotone Dualization, including learning with oracles, computation of infrequent and
frequent sets, and key generation. We shall give a brief account of such problems, and
finally will conclude with some open problems and issues for future research.
The results presented are joint work with Kazuhisa Makino, Osaka University.
Association Computation for Information
Access
Akihiko Takano
The full version of this paper is published in the Proceedings of the 6th International
Conference on Discovery Science, Lecture Notes in Artificial Intelligence Vol. 2843.
Naftali Tishby
School of Computer Science and Engineering and Center for Neural Computation
The Hebrew University, Jerusalem 91904, Israel
tishby@[Link]
Thomas Zeugmann
1 Introduction
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 17–38, 2003.
c Springer-Verlag Berlin Heidelberg 2003
18 T. Zeugmann
(in case of language learning from positive data). Alternatively, one can also
study language learning from both positive and negative data.
Most of the work done in the field has been aimed at the following goals: show-
ing what general collections of function classes or language classes are learnable,
characterizing those collections of classes that can be learned, studying the im-
pact of several postulates on the behavior of learners to their learning power,
and dealing with the influence of various parameters to the efficiency of learning.
However, defining an appropriate measure for the complexity of learning in the
limit has turned out to be quite difficult (cf. Pitt [31]). Moreover, whenever learn-
ing in the limit is done, in general one never knows whether or not the learner
has already converged. This is caused by the fact that it is either undecidable
at all whether or not convergence already occurred. But even if it is decidable,
it is practically infeasible to do so. Thus, there is always an uncertainty which
may not be tolerable in many applications of learning.
Therefore, different learning models have been proposed. In particular, Val-
iant’s [46] model of probably approximately correct (abbr. PAC) learning has
been very influential. As a matter of fact, this model puts strong emphasis on
the efficiency of learning and avoids the problem of convergence at all. In the
PAC model, the learner receives a finite labeled sample of the target concepts and
outputs, with high probability, a hypothesis that is approximately correct. The
sample is drawn with respect to an unknown probability distribution and the
error of as well as the confidence in the hypothesis are measured with respect
to this distribution, too. Thus, if a class is PAC learnable, one obtains nice
performance guarantees. Unfortunately, many interesting concept classes are not
PAC learnable.
Consequently, one has to look for other models of learning or one is back to
learning in the limit. So, let us assume that learning in the limit is our method
of choice. What we would like to present in this survey is a rather general way to
transform learning in the limit into stochastic finite learning. It should also be
noted that our ideas may be beneficial even in case that the considered concept
class is PAC learnable.
Furthermore, we aim to outline how a thorough study of limit learnability
of concept classes may nicely contribute to support our new approach. We ex-
emplify the research undertaken by mainly looking at the class of all pattern
languages introduced by Angluin [1]. As Salomaa [37] has put it “Patterns are
everywhere” and thus we believe that our research is worth the effort undertaken.
There are several problems that have to be addressed when dealing with
the learnability of pattern languages. First, the nice thing about patterns is that
they are very intuitive. Therefore, it seems desirable to design learners outputting
pattern as their hypotheses. Unfortunately, membership is known to be N P -
complete for the pattern languages (cf. [1]). Thus, many of the usual approaches
used in machine learning will directly lead to infeasible learning algorithms. As
a consequence, we shall ask what kind of appropriate hypothesis spaces can be
used at all to learn the pattern languages, and what are the appropriate learning
strategies.
Can Learning in the Limit Be Done Efficiently? 19
2 Preliminaries
Gold’s [12] model of learning in the limit allows one to formalize a rather general
class of learning problems, i.e., learning from examples. For defining this model
we assume any recursively enumerable set X and refer to it as the learning
domain. By ℘(X ) we denote the power set of X . Let C ⊆ ℘(X ) , and let c ∈ C
be non-empty; then we refer to C and c as a concept class and a concept,
respectively. Let c be a concept, and let t = (xj )j∈N be any infinite sequence of
elements xj ∈ c such that range(t) =df {xj j ∈ N} = c . Then t is said to be
a positive presentation or, synonymously, a text for c . By text(c) we denote the
set of all positive presentations for c . Moreover, let t be a positive presentation,
and let y ∈ N . Then, we set ty = x0 , . . . , xy , i.e., ty is the initial segment of t
of length y + 1 , and t+ y =df {xj j ≤ y} . We refer to ty as the content of ty .
+
By LimTxt we denote the collection of all concepts classes C that are learn-
able in the limit from text1 . Note that instead of LimTxt sometimes TxtEx is
used.
Note that Definition 1 does not contain any requirement concerning efficiency.
Before we are going to deal with efficiency, we want to point to another crucial
parameter of our learning model, i.e., the hypothesis space H . Since our goal
is algorithmic learning, we can consider the special case that X = N and let C
be any subset of the collection of all recursively enumerable sets over N . Let
Wi = domainϕi . In this case, (Wj )j∈N is the most general hypothesis space.
Within this setting many learning problems can be described. Moreover,
this setting has been used to study the general capabilities of different learning
models which can be obtained by suitable modifications of Definition 1. There are
numerous papers performing studies along this line of research (cf., e.g., [16,30]
and the references therein). On the one hand, the results obtained considerably
broaden our general understanding of algorithmic learning. On the other hand,
one has also to ask what kind of consequences one may derive from these results
for practical learning problems. This is a non-trivial question, since the setting
of learning recursively enumerable languages is very rich. Thus, it is conceivable
1
If learning from informant is considered we use LimInf to denote the collection of
all concepts classes C that are learnable in the limit from informant.
Can Learning in the Limit Be Done Efficiently? 21
that several of the phenomena observed hold in this setting due to the fact too
many sets are recursively enumerable and there are no counterparts within the
world of efficient computability.
As a first step to address this question we mainly consider the scenario
that indexable concept classes with uniformly decidable membership have to
be learned (cf. Angluin [2]). A class of non-empty concepts C is said to be
an indexable class with uniformly decidable membership provided there are
an effective enumeration c0 , c1 , c2 , ... of all and only the concepts in C and a
recursive function f such that for all j ∈ N and all elements x ∈ X we have
1, if x ∈ cj ,
f (j, x) =
0, otherwise.
In the following we refer to indexable classes with uniformly decidable mem-
bership as to indexable classes for short. Furthermore, we call any enumeration
(cj )j∈N of C with uniformly decidable membership problem an indexed family.
Since the paper of Angluin [2] learning of indexable concept classes has at-
tracted much attention (cf., e.g., Zeugmann and Lange [51]). Let us shortly pro-
vide some-well known indexable classes. Let Σ be any finite alphabet of symbols,
and let X be the free monoid over Σ , i.e., X = Σ ∗ . We set Σ + = Σ ∗ \ {λ} ,
where λ denotes the empty string. As usual, we refer to subsets L ⊆ X as
languages. Then the set of all regular languages, context-free languages, and
context-sensitive languages are indexable classes.
far. In contrast to that, next we define iterative IIMs. An iterative IIM is only
allowed to use its last guess and the next element in the positive presentation of
the target concept for computing its actual guess. Conceptionally, an iterative
IIM M defines a sequence (Mn )n∈N of machines each of which takes as its input
the output of its predecessor.
Definition 2 (Wiehagen [47]). Let C be a concept class, let c be a concept,
let H = (hj )j∈N be a hypothesis space, and let a ∈ N ∪ {∗} . An IIM M
ItLimTxt H -infers c iff for every t = (xj )j∈N ∈ text(c) the following conditions
are satisfied:
(1) for all n ∈ N , Mn (T ) is defined, where M0 (T ) =df M (x0 ) and for all
n ≥ 0 : Mn+1 (T ) =df M (Mn (T ), xn+1 ) ,
(2) the sequence (Mn (T ))n∈N converges to a number j such that c = hj .
Let C be any concept class, and let M be any IIM that learns C in the
limit. Then, for every c ∈ C and every text t for c , let
denote the stage of convergence of M on t (cf. [12]). Note that Conv (M, t) = ∞
if M does not learn the target concept from its text t . Moreover, by TM (tn )
we denote the time to compute M (tn ) . We measure this time as a function of
the length of the input and call it the update time. Finally, the total learning
time taken by the IIM M on successive input t is defined as
Conv (M,t)
T T (M, t) =df TM (tn ).
n=1
Clearly, if M does not learn the target concept from text t then the total
learning time is infinite.
Two more remarks are in order here. First, it has been argued elsewhere that
within the learning in the limit paradigm a learning algorithm is invoked only
when the current hypothesis has some problem with the latest observed data.
However, such a viewpoint implicitly assumes that membership in the target
concept is decidable in time polynomial in the length of the actual input. This
may be not case. Thus, directly testing consistency would immediately lead to
a non-polynomial update time provided membership is not known to be in P .
Second, Pitt [31] addresses the question with respect to what parameter one
should measure the total learning time. In the definition given above this param-
eter is the length of all examples seen so far. Clearly, now one could try to play
with this parameter by waiting for a large enough input before declaring suc-
cess. However, when dealing with the learnability of non-trivial concept classes,
in the worst-case the total learning time will be anyhow unbounded. Thus, it
does not make much sense to deal with the worst-case. Instead, we shall study
the expected total learning time. In such a setting one cannot simply wait for
long enough inputs. Therefore, using the definition of total learning time given
above seems to be reasonable.
Next, we define important concept classes which we are going to consider
throughout this survey.
Following Angluin [1] we define patterns and pattern languages as follows. Let
A = {0, 1, . . .} be any non-empty finite alphabet containing at least two ele-
ments. By A∗ we denote the free monoid over A . The set of all finite non-null
strings of symbols from A is denoted by A+ , i.e., A+ = A∗ \ {λ} , where
24 T. Zeugmann
3 Results
Within this section we ask whether or not the pattern languages and finite unions
thereof can be learned efficiently. The principal learnability of the pattern lan-
guages from text with respect to the hypothesis space Pat has been established
by Angluin [1]. However, her algorithm is based on computing descriptive pat-
terns for the data seen so far. Here a pattern π is said to be descriptive (for
the set S of strings contained in the input provided so far) if π can generate
all strings contained in S and no other pattern with this property generates a
proper subset of the language generated by π . Since no efficient algorithm is
known for computing descriptive patterns, and finding a descriptive pattern of
maximum length is N P -hard, its update time is practically intractable.
There are also serious difficulties when trying to learn the pattern languages
within the PAC model introduced by Valiant [46]. In the original model, the sam-
ple complexity depends exclusively on the VC dimension of the target concept
class and the error and confidence parameters ε and δ , respectively. Recently,
Mitchell et al. [25] have shown that even the class of all one-variable pattern
languages has infinite VC dimension. Consequently, even this special subclass
of PAT is not uniformly PAC learnable. Moreover, Schapire [40] has shown
that pattern languages are not PAC learnable in the generalized model provided
P/poly = N P/poly with respect to every hypothesis space for PAT that is
uniformly polynomially evaluable. Though this result highlights the difficulty of
PAC learning PAT it has no clear application to the setting considered in this
paper, since we aim to learn PAT with respect to the hypothesis space Pat .
Since the membership problem for this hypothesis space is N P -complete, it is
not polynomially evaluable (cf. [1]).
In contrast, Kearns and Pitt [18] have established a PAC learning algorithm
for the class of all k -variable pattern languages. Positive examples are gener-
ated with respect to arbitrary product distributions while negative examples are
allowed to be generated with respect to any distribution. In their algorithm the
length of substitution strings is required to be polynomially related to the length
of the target pattern. Finally, they use as hypothesis space all unions of poly-
nomially many patterns that have k or fewer variables2 . The overall learning
time of their PAC learning algorithm is polynomial in the length of the target
2
More precisely, the number of allowed unions is at most poly(|π|, s, 1/ε, 1/δ, |A|) ,
where π is the target pattern, s the bound on the length on substitution strings,
ε and δ are the usual error and confidence parameter, respectively, and A is the
alphabet of constants over which the patterns are defined.
26 T. Zeugmann
pattern, the bound for the maximum length of substitution strings, 1/ε , 1/δ ,
and |A| . The constant in the running time achieved depends doubly exponential
on k , and thus, their algorithm becomes rapidly impractical when k increases.
Finally, Lange and Wiehagen [19] have proposed an inconsistent but iterative
and conservative algorithm that learns PAT with respect to Pat . We shall
study this algorithm below in much more detail.
But before doing it, we aim to figure out under which circumstances iterative
learning of PAT is possible at all. A first answer is given by the following
theorems from Case et al. [9]. Note that Pat is a non-redundant hypothesis
space for PAT .
Theorem 1 (Case et al. [9]). Let C be any concept class, and let H =
(hj )j∈N be any non-redundant hypothesis space for C . Then, every IIM M that
ItLimTxt H -infers C is conservative.
Proof. Suppose the converse, i.e., there are a concept c ∈ C , a text t =
(xj )j∈N ∈ text(c) , and a y ∈ N such that, for j = M∗ (ty ) and k = M∗ (ty+1 ) =
M (j, xy+1 ) , both j = k and t+ y+1 ⊆ hj are satisfied. The latter implies xy+1 ∈
hj , and thus we may consider the following text t̃ ∈ text(hj ) . Let t̂ = (x̂j )j∈N
be any text for hj and let t̃ = x̂0 , xy+1 , x̂1 , xy+1 , x̂2 , . . . Since M has to learn
hj from t̃ there must be a z ∈ N such that M∗ (t̃z+r ) = j for all r ≥ 0 . But
M∗ (t̃2z+1 ) = M (j, xy+1 ) = k , a contradiction.
Next, we point to another peculiarity of PAT , i.e., it meets the superset
condition defined as follows. Let C be any indexable class. C meets the superset
condition if, for all c, c ∈ C , there is some ĉ ∈ C being a superset of both c
and c .
Theorem 2. (Case et al. [9]). Let C be any indexable class meeting the
superset condition, and let H = (hj )j∈N be any non-redundant hypothesis space
for C . Then, every consistent IIM M that ItLimTxt H -infers C may be used
to decide the inclusion problem for H .
Proof. Let X be the underlying learning domain, and let (wj )j∈N be an
effective enumeration of all elements in X . Then, for every i ∈ N , ti = (xij )j∈N
is the following computable text for hi . Let z be the least index such that
wz ∈ hi . Recall that, by definition, hi = ∅ , since H is an indexed family, and
thus wz must exist. Then, for all j ∈ N , we set xij = wj , if wj ∈ hi , and
xij = wz , otherwise.
We claim that the following algorithm Inc decides, for all i, k ∈ N , whether
or not hi ⊆ hk .
Algorithm Inc: “On input i, k ∈ N do the following:
Determine the least y ∈ N with i = M∗ (tiy ) . Test whether or not ti,+
y ⊆ hk .
In case it is, output ‘Yes,’ and stop. Otherwise, output ‘No,’ and stop.”
Clearly, since H is an indexed family and ti is a computable text, Inc is
an algorithm. Moreover, M learns hi on every text for it, and H is a non-
redundant hypothesis space. Hence, M has to converge on text ti to i , and
therefore Inc has to terminate.
Can Learning in the Limit Be Done Efficiently? 27
Proof. Part (2) is obvious. Part (1) is easy for finite L . For infinite L , it
follows from the lemma below.
Lemma 1. Let k ∈ N+ , let L ⊆ A+ be any language, and suppose t =
(sj )j∈N ∈ text(L). Then,
(1) Club(t+ +
0 , k) can be obtained effectively from s0 , and Club(tn+1 , k) is effec-
+
tively obtainable from Club(tn , k) and sn+1 (* note the iterative nature *).
28 T. Zeugmann
Using Lemma 1 it is easy to verify that Mn+1 (t) = M (Mn (t), sn+1 ) can be
obtained effectively from Mn (t) and sn+1 . Therefore, M ItLimTxt -identifies
PAT (k) .
So far, the general theory provided substantial insight into the iterative learn-
ability of the pattern languages. But still, we do not know anything about the
number of examples needed until successful learning and the total amount of time
to process them. Therefore, we address this problem in the following subsection.
As we have already mentioned, it does not make much sense to study the worst-
case behavior of learning algorithms with respect to their total learning time.
The reason for this phenomenon should be clear, since an arbitrary text may
provide the information needed for learning very late. Therefore, in the follow-
ing we always assume a class D of admissible probability distributions over the
relevant learning domain. Ideally, this class should be parameterized. Then, the
data fed to learner are generated randomly with respect to one of the probability
distributions from the class D of underlying probability distributions. Further-
more, we introduce a random variable CONV for the stage of convergence. Note
that CONV can be also interpreted as the total number of examples read by the
IIM M until convergence. The first major step to be performed consists now
in determining the expectation E[CONV ] . Clearly, E[CONV ] should be finite
for all concepts c ∈ C and all distributions D ∈ D . Second, one has to deal
with tail bounds for E[CONV ] . The easiest way to perform this step is to use
Markov’s inequality, i.e., we always know that
1
Pr(CONV ≥ t · E[CONV ]) ≤ for all t ∈ N+ .
t
However, quite often one can obtain much better tail bounds. If the underly-
ing learner is known to be conservative and rearrangement-independent we always
Can Learning in the Limit Be Done Efficiently? 29
Second, β is the conditional probability that two random strings that get sub-
stituted into π are identical under the condition that both have length 1 , i.e.,
2
β = Pr u = u |u| = |u | = 1 = d(a)2 d(a) .
a∈calA a∈A
Note that we have omitted the assumption of a text to exhaust the target lan-
guage. Instead, we only demand the data sequence fed to the learner to contain
“enough” information to recognize the target pattern. The meaning of “enough”
is mainly expressed by the parameter α .
The model of computation as well as the representation of patterns we assume
is the same as in Angluin [1]. In particular, we assume a random access machine
that performs a reasonable menu of operations each in unit time on registers of
length O(log n) bits, where n is the input length.
Lange and Wiehagen’s [19] algorithm (abbr. LWA) works as follows. Let hn
be the hypothesis computed after reading s1 , . . . , sn , i.e., hn = M (s1 , . . . , sn ) .
32 T. Zeugmann
The algorithm computes the new hypothesis only from the latest example
and the old hypothesis. If the latest example is longer than the old hypothesis,
the example is ignored, i.e., the hypothesis does not change. If the latest ex-
ample is shorter than the old hypothesis, the old hypothesis is ignored and the
new example becomes the new hypothesis. If, however, |hn−1 | = |sn | the new
hypothesis is the union of hn−1 and sn . The union = π ∪ s of a canonical
pattern π and a string s of the same length is defined as
π(i), if π(i) = s(i)
xj , if π(i) = s(i) & ∃k < i : [(k) = xj , s(k) = s(i),
(i) =
π(k) = π(i)]
xm , otherwise, where m = #var((1) . . . (i − 1))
where (0) = λ for notational convenience. Note that the resulting pattern is
again canonical.
If the target pattern does not contain any variable then the LWA converges
after having read the first example. Hence, this case is trivial and we therefore
assume in the following always k ≥ 1 , i.e., the target pattern has to contain
at least one variable. Our next theorem analyzes the complexity of the union
operation.
Theorem 7 (Rossmanith and Zeugmann [36]). The union operation can be
computed in linear time.
Furthermore, the following bound for the stage of convergence for every target
pattern from Pat k can be shown.
Theorem 8(Rossmanith and Zeugmann [36]).
1
E[CONV ] = O · log1/β (k) for all k ≥ 2 .
αk
Hence, by
Theorem 7, the expected
total learning time can be estimated by
1
E[T T ] = O E[Λ] log1/β (k) for all k ≥ 2 .
αk
For a better understanding of the bound obtained we evaluate it for the
uniform distribution and compare it to the minimum number of examples needed
for learning a pattern language via the LWA.
Theorem 9 (Rossmanith and Zeugmann [36]). E[T T ] = O(2k |π| log|A| (k))
for the uniform distribution and all k ≥ 2 .
Theorem 10 (Zeugmann [50]). To learn a pattern π ∈ Pat k the LWA needs
exactly log|A| (|A| + k − 1) + 1 examples in the best case.
Can Learning in the Limit Be Done Efficiently? 33
The main difference between the two bounds just given is the factor 2k
which precisely reflects the time the LWA has to wait until it has seen the first
shortest string from the target pattern language. Moreover, in the best-case the
LWA is processing shortest examples only. Thus, we introduce MC to denote
the number of minimum length examples read until convergence. Then, one can
show that
2 ln(k) + 3
E[MC ] ≤ +2 .
ln(1/β)
Note that Theorem 8 is shown by using the bound for E[MC ] just given.
More precisely, we have E[CONV ] = (1/αk )E[MC ] . Now, we are ready to
transform the LWA into a stochastic finite learner.
Theorem 11 (Rossmanith and Zeugmann [36]). Let α∗ , β∗ ∈ (0, 1) . Assume
D to be a class of admissible probability distributions over A+ such that α ≥
α∗ , β ≤ β∗ and E[Λ] finite for all distributions D ∈ D . Then (PAT , D) is
stochastically finitely learnable with high confidence from text.
Proof. Let D ∈ D , and let δ ∈ (0, 1) be arbitrarily fixed. Furthermore, let
t = s1 , s2 , s3 , . . . be any randomly generated text with respect to D for the
target pattern language. The wanted learner M uses the LWA as a subroutine.
Additionally, it has a counter for memorizing the number of examples already
seen. Now, we exploit the fact that the LWA produces a sequence (τn )n∈N+ of
hypotheses such that |τn | ≥ |τn+1 | for all n ∈ N+ .
The learner runs the LWA until for the first time C many examples have
been processed, where
|τ | 2 ln(|τ |) + 3
C = α1∗ · +2 (A)
ln(1/β∗ )
Therefore, after having processed C many examples the LWA has already
converged on average. The desired confidence is then an immediate consequence
of Corollary 6.
The latter theorem allows a nice corollary which we state next. Making the
same assumption as done by Kearns and Pitt [18], i.e., assuming the additional
prior knowledge that the target pattern belongs to Pat k , the complexity of the
stochastic finite learner given above can be considerably improved. The resulting
learning time is linear in the expected string length, and the constant depending
on k grows only exponentially in k in contrast to the doubly exponentially
growing constant in Kearns and Pitt’s [18] algorithm. Moreover, in contrast
to their learner, our algorithm learns from positive data only, and outputs a
hypothesis that is correct for the target language with high probability.
Again, for the sake of presentation we shall assume k ≥ 2 . Moreover, if the
prior knowledge k = 1 is available, then there is also a much better stochastic
finite learner for PAT 1 (cf. [34]).
Corollary 12. Let α∗ , β∗ ∈ (0, 1) . Assume D to be a class of admissible
probability distributions over A+ such that α ≥ α∗ , β ≤ β∗ and E[Λ] finite
for all distributions D ∈ D . Furthermore, let k ≥ 2 be arbitrarily fixed. Then
there exists a learner M such that
(1) M learns (PAT k , D) stochastically finitely with high confidence from text,
and
(2) The running time of M is O α̂∗k E[Λ] log1/β∗ (k) log2 (1/δ) .
(* Note that α̂∗k and log1/β∗ (k) now are constants. *)
4 Conclusions
The present paper surveyed results recently obtained concerning the iterative
learnability of the class of all pattern languages and finite unions thereof. In
particular, it could be shown that there are strong dependencies between iterative
learning, the class of admissible hypothesis spaces and additional requirements
to the learner such as consistency, conservativeness and the decidability of the
inclusion problem for the hypothesis space chosen. Looking at these results, we
have seen that the LWA is in some sense optimal.
Moreover, by analyzing the average-case behavior of Lange and Wiehagen’s
pattern language learning algorithm with respect to its total learning time and
by establishing exponentially shrinking tail bounds for a rather rich class of
limit learners, we have been able to transform the LWA into a stochastic finite
learner. The price paid is the incorporation of a bit prior knowledge concerning
the class of underlying probability distributions. When applied to the class of
all k -variable pattern languages, where k is a priori known, the resulting total
learning time is linear in the expected string length.
Can Learning in the Limit Be Done Efficiently? 35
Thus, the present paper provides evidence that analyzing the average-case
behavior of limit learners with respect to their total learning time may be consid-
ered as a promising path towards a new theory of efficient algorithmic learning.
Recently obtained results along the same path as outlined in Erlebach et al.[11]
as well as in Reischuk and Zeugmann [32,34] provide further support for the
fruitfulness of this approach.
In particular, in Reischuk and Zeugmann [32,34] we have shown that one-
variable pattern languages are learnable for basically all meaningful distributions
within an optimal linear total learning time on the average. Furthermore, this
learner can also be modified to maintain the incremental behavior of Lange and
Wiehagen’s [19] algorithm. Instead of memorizing the pair (PRE, SUF) , it can
also store just the two or three examples from which the prefix PRE and the suffix
SUF of the target pattern has been computed. While it is no longer iterative, it
is still a bounded example memory learner. A bounded example memory learner
is essentially an iterative learner that is additionally allowed to memorize an a
priori bounded number of examples (cf. [9] for a formal definition).
While the one-variable pattern language learner from [34] is highly practical,
our stochastic finite learner for the class of all pattern languages is still not good
enough for practical purposes. But our results surveyed point to possible direc-
tions for potential improvements. However, much more effort seems necessary to
design a stochastic finite learner for PAT (k) .
Additionally, we have applied our techniques to design a stochastic finite
learner for the class of all concepts describable by a monomial which is based
on Haussler’s [14] Wholist algorithm. Here we have assumed the examples to be
binomially distributed. The sample size of our stochastic finite learner is mainly
bounded by log(1/δ) log n , where δ is again the confidence parameter and n
is the dimension of the underlying Boolean learning domain. Thus, the bound
obtained is exponentially better than the bound provided within the PAC model.
Our approach also differs from U-learnability introduced by Muggleton [27].
First of all, our learner is fed with positive examples only, while in Muggle-
ton’s [27] model examples labeled with respect to their containment in the target
language are provided. Next, we do not make any assumption concerning the dis-
tribution of the target patterns. Furthermore, we do not measure the expected
total learning time with respect to a given class of distributions over the targets
and a given class of distributions for the sampling process, but exclusively in
dependence on the length of the target. Finally, we require exact learning and
not approximately correct learning.
References
1. D. Angluin, Finding Patterns common to a Set of Strings, Journal of Computer
and System Sciences 21, 1980, 46–62.
2. D. Angluin, Inductive inference of formal languages from positive data, Informa-
tion and Control 45, 1980, 117–135.
36 T. Zeugmann
3. D. Angluin and C.H. Smith. Inductive inference: Theory and methods. Computing
Surveys 15, No. 3, 1983, 237–269.
4. D. Angluin and C.H. Smith. Formal inductive inference. “Encyclopedia of Ar-
tificial Intelligence” (St.C. Shapiro, Ed.), Vol. 1, pp. 409–418, Wiley-Interscience
Publication, New York.
5. S. Arikawa, T. Shinohara and A. Yamamoto, Learning elementary formal systems,
Theoretical Computer Science 95, 97–113, 1992.
6. L. Blum and M. Blum, Toward a mathematical theory of inductive inference,
Information and Control 28, 125–155, 1975.
7. A. Blumer, A. Ehrenfeucht, D. Haussler and M. Warmuth, Learnability and the
Vapnik-Chervonenkis Dimension, Journal of the ACM 36 (1989), 929–965.
8. I. Bratko and S. Muggleton, Applications of inductive logic programming, Com-
munications of the ACM, 1995.
9. J. Case, S. Jain, S. Lange and T. Zeugmann, Incremental Concept Learning for
Bounded Data Mining, Information and Computation 152, No. 1, 1999, 74–110.
10. R. Daley and C.H. Smith. On the Complexity of Inductive Inference. Information
and Control 69 (1986), 12–40.
11. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger and T. Zeugmann, Learning
one-variable pattern languages very efficiently on average, in parallel, and by asking
queries, Theoretical Computer Science 261, No. 1–2, 2001, 119–156.
12. E.M. Gold, Language identification in the limit, Information and Control 10
(1967), 447–474.
13. S.A. Goldman, M.J. Kearns and R.E. Schapire, Exact identification of circuits
using fixed points of amplification functions. SIAM Journal of Computing 22,
1993, 705–726.
14. D. Haussler, Bias, version spaces and Valiant’s learning framework, “Proc. 8th Na-
tional Conference on Artificial Intelligence,” pp. 564–569, San Mateo, CA: Morgan
Kaufmann, 1987.
15. D. Haussler, M. Kearns, N. Littlestone and M.K. Warmuth, Equivalence of models
for polynomial learnability. Information and Computation 95 (1991), 129–161.
16. S. Jain, D. Osherson, J.S. Royer and A. Sharma, “Systems That Learn: An Intro-
duction to Learning Theory,” MIT-Press, Boston, Massachusetts, 1999.
17. T. Jiang, A. Salomaa, K. Salomaa and S. Yu, Inclusion is undecidable for pat-
tern languages, in “Proceedings 20th International Colloquium on Automata,
Languages and Programming,” (A. Lingas, R. Karlsson, and S. Carlsson, Eds.),
Lecture Notes in Computer Science, Vol. 700, pp. 301–312, Springer-Verlag, Berlin,
1993.
18. M. Kearns L. Pitt, A polynomial-time algorithm for learning k –variable pattern
languages from examples. in “Proc. Second Annual ACM Workshop on Computa-
tional Learning Theory” (pp. 57–71). San Mateo, CA: Morgan Kaufmann, 1989.
19. S. Lange and R. Wiehagen, Polynomial-time inference of arbitrary pattern lan-
guages. New Generation Computing 8 (1991), 361–370.
20. S. Lange and T. Zeugmann, Language learning in dependence on the space of
hypotheses. in “Proc. of the 6th Annual ACM Conference on Computational
Learning Theory,” (L. Pitt, Ed.), pp. 127–136, ACM Press, New York, 1993.
21. S. Lange and T. Zeugmann, Set-driven and Rearrangement-independent Learning
of Recursive Languages, Mathematical Systems Theory 29 (1996), 599–634.
22. S. Lange and T. Zeugmann, Incremental Learning from Positive Data, Journal of
Computer and System Sciences 53(1996), 88–103.
23. N. Lavrač and S. Džeroski, “Inductive Logic Programming: Techniques and Ap-
plications,” Ellis Horwood, 1994.
Can Learning in the Limit Be Done Efficiently? 37
Sandra Zilles
Universität Kaiserslautern,
FB Informatik, Postfach 3049, 67653 Kaiserslautern, Germany,
zilles@[Link]
1 Introduction
Inductive inference is concerned with algorithmic learning of recursive functions.
In the model of learning in the limit, cf. [7], a learner successful for a class of
recursive functions must eventually find a correct program for any function in
the class from a gradually growing sequence of its values. The learner is under-
stood as a machine – called inductive inference machine or IIM – reading finite
sequences of input-output pairs of a target function, and returning programs as
its hypotheses, see also [2]. The underlying programming system is then called
a hypothesis space.
Studying the potential of such IIMs in general leads to the question whether
– given a description of a class of functions – a corresponding successful IIM can
be synthesized computationally from this description. This idea is generalized in
the notion of uniform learning: we consider a collection C0 , C1 , . . . of learning
problems – which may be seen as a decomposition of a class C = C0 ∪ C1 ∪ . . .
– and ask for some kind of meta-IIM tackling the whole collection of learning
problems. As an input, such a meta-IIM gets a description of one of the learning
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 39–53, 2003.
c Springer-Verlag Berlin Heidelberg 2003
40 S. Zilles
tions of singleton sets may constitute complete classes in uniform learning. Still,
the characterization of completeness here reveals a weakness of the general idea
of intrinsic complexity, namely – as in the non-uniform case – complete classes
have a low algorithmic complexity (see Theorem 7). All in all, this shows that
intrinsic complexity, as in [4], is on the one hand a useful approach, because it
can be adapted to match the intuitively desired results in uniform learning. On
the other hand, the doubts in [8] are corroborated.
2 Preliminaries
2.1 Notations
Knowledge of basic notions used in mathematics and computability theory is
assumed, cf. [15]. N is the set of natural numbers. The cardinality of a set
X is denoted by card X. Partial-recursive functions always operate on natural
numbers. If f is a function, f (n) ↑ indicates that f (n) is undefined. Our target
objects for learning will always be recursive functions, i. e. total partial-recursive
functions. R denotes the set of all recursive functions.
If α is a finite tuple of numbers, then |α| denotes its length. Finite tuples
are coded, i. e. if f (0), . . . , f (n) are defined, a number f [n] represents the tuple
(f (0), . . . , f (n)), called an initial segment of f . f [n] ↑ means that f (x) ↑ for
some x ≤ n. For convenience, a function may be written as a sequence of values
or as a set of input-output pairs. A sequence σ = x0 , x1 , x2 , . . . converges to
x, iff xn = x for all but finitely many n; we write lim(σ) = x. For example let
f (n) = 7 for n ≤ 2, f (n) ↑ otherwise; g(n) = 7 for all n. Then f = 73 ↑∞ =
{(0, 7), (1, 7), (2, 7)}, g = 7∞ = {(n, 7) | n ∈ N}; lim(g) = 7, and f ⊆ g. For
n ∈ N, the notion f =n g means that for all x ≤ n either f (n) ↑ and g(n) ↑ or
f (n) = g(n). A set C of functions is dense, iff for any f ∈ C, n ∈ N there is
some g ∈ C satisfying f =n g, but f = g.
Recursive functions – our target objects for learning – require appropriate
representation schemes, to be used as hypothesis spaces. Partial-recursive enu-
merations serve for that purpose: any (n + 1)-place partial-recursive function ψ
enumerates the set Pψ := {ψi | i ∈ N} of n-place partial-recursive functions,
where ψi (x) := ψ(i, x) for all x = (x1 , . . . , xn ). Then ψ is called a numbering.
Given f ∈ Pψ , any index i satisfying ψi = f is a ψ-program of f .
Following [6], we call a family (di )i∈N of natural numbers limiting r. e., iff
there is a recursive numbering d such that lim(di ) = di for all i ∈ N.
By the following result, special sets describing only singleton recursive cores
are not uniformly Ex -learnable (restrictedly). For Claim 2 cf. a proof in [16].
44 S. Zilles
It has turned out, that even UEx -learnable subsets of these description sets
are not in UEx (or rUEx ), if additional demands concerning the sequence of hy-
potheses are posed, see [17]. This suggests that description sets representing only
singletons may form hardest problems in uniform learning; analogously descrip-
tion sets representing only a fixed singleton recursive core may form hardest
problems in restricted uniform learning. Hopefully, this intuition can be ex-
pressed by a notion of intrinsic complexity of uniform learning.
first reduction with the operators of a second reduction. The idea above cannot
guarantee that: assume D1 is reducible to D2 via Θ1 and Ξ1 ; D2 is reducible to
D3 via Θ2 and Ξ2 . If Θ1 maps (d1 , f1 ) to (δ2 , f2 ), then which description d in the
sequence δ2 should form an input (d, f2 ) for Θ2 ? It is in general impossible to
detect the limit d2 of the sequence δ2 , and any description d = d2 might change
the output of Θ2 .
So it is inevitable to let Θ operate on sequences of descriptions and on func-
tions, i. e., Θ maps pairs (δ1 , f1 ), where δ1 is a sequence of descriptions, to pairs
(δ2 , f2 ).
Definition 9 Let Θ be a total function operating on pairs of functions. Θ is a re-
cursive meta-operator, iff the following properties hold for all functions δ, δ , f, f :
1. if δ ⊆ δ , f ⊆ f , as well as Θ(δ, f ) = (γ, g) and Θ(δ , f ) = (γ , g ), then
γ ⊆ γ and g ⊆ g ;
2. if n, y ∈ N, Θ(δ, f ) = (γ, g), and γ(n) = y (or g(n) = y, resp.), then there
are initial segments δ0 ⊆ δ and f0 ⊆ f such that (γ0 , g0 ) = Θ(δ0 , f0 ) fulfils
γ0 (n) = y (g0 (n) = y, resp.);
3. if δ, f are finite, Θ(δ, f ) = (γ, g), one can effectively (in δ, f ) enumerate γ, g.
This finally allows for the following definition of UEx -reducibility.
Note that this definition expresses intrinsic complexity in the sense that a
meta-IIM for D1 can be computed from a meta-IIM for D2 , if D1 is UEx -
reducible to D2 . Moreover, as has been demanded in advance, the resulting
reducibility is transitive:
The question is, whether this notion of intrinsic complexity expresses the
intuitions formulated in advance, e. g., that there are UEx -complete description
sets representing only singleton recursive cores. Before answering this question
consider an illustrative example.
46 S. Zilles
Example 4 Let d ∈ N fulfil Rd = Cfsup . Then the set {d} is UEx -complete.
Proof. Obviously, {d} ∈ UEx . To show that each description set in UEx is UEx -
reducible to {d}, fix D1 ∈ UEx and let M be a corresponding meta-IIM as in
Definition 6. It remains to define a recursive meta-operator Θ and a recursive
operator Ξ appropriately.
Given initial segments δ1 and α, let Θ just modify the sequence of hypotheses
returned by the meta-IIM M , if the first input parameter is gradually taken from
the sequence δ1 and the second input parameter is gradually taken from the
sequence α. The modification is to increase each hypothesis by 1 and to change
each repetition of hypotheses into a zero output. A formal definition is omitted.
Moreover, given an initial segment σ = (s0 , . . . , sn ), let Ξ(σ) look for the
maximal m ≤ n such that at least one of the values τsm (x), x ≤ n, is de-
fined within n steps and greater than 0. In case m does not exist, Ξ(σ) =
Ξ(s0 , . . . , sn−1 ). Otherwise, let y ≤ n be maximal such that τsm (y) has already
been computed and is greater than 0. Then Ξ(σ) = Ξ(s0 , . . . , sn−1 )τsm (y) − 1.
Now D1 is UEx -reducible to {d} via Θ, Ξ; details are omitted.
That decompositions of Ex -complete classes may also be not UEx -complete,
is shown in Section 3.3. Example 4 moreover serves for proving the completeness
of other sets, if Lemma 5 – an immediate consequence of Lemma 3 – is applied.
It is easy to define Θ such that, if α does not end with 0, then Θ(δ1 , α0∞ ) =
(δ2 , α0∞ ), where δ2 converges to some g(i) with αi = α. Let Ξ(σ) = σ for all σ.
Then {d} is UEx -reducible to {g(i) | i ∈ N} via Θ and Ξ. Details are omitted.
ad 2. Fix an r. e. family (αi )i∈N of all initial segments; fix h ∈ R with τh(i) =
αi 0∞ for all i ∈ N. Then ϕ0 = αi 0∞ and ϕx+1 =↑∞ for i, x ∈ N. As above,
g(h(i)) g(h(i))
Let j ∈ N be minimal such that ψi =n ψj . (* Note that, for all but finitely
many n, the index j will be the minimal ψ-program of ψi . *)
Return di (n) := dj (n). (* lim(di ) = dj , for j minimal with ψi = ψj . *)
Finally, let di be given by the limit of the function di , in case a limit exists.
Fix i ∈ N. Then there is a minimal j with ψi = ψj . By definition, the limit
di of di exists and di = dj ∈ D. Moreover, as ψj ∈ Rdj , the function ψi is in Rdi .
As ψ and (di )i∈N allow us to apply Theorem 7, the set D is UEx -complete.
Thus certain decompositions of Ex -complete classes remain UEx -complete,
and UEx -complete description sets always represent decompositions of supersets
of Ex -complete classes. Example 9 illustrates how to apply the above characteri-
zations of UEx -completeness. A similar short proof may be given for Example 6.
Intrinsic Complexity of Uniform Learning 49
Proof. (g(i))i∈N is a (limiting) r. e. family such that χi ∈ Rg(i) for all i ∈ N and
Pχ is Ex -complete. Corollary 8 implies that {g(i) | i ∈ N} is UEx -complete.
Recall that, intuitively, sets describing just one singleton recursive core may
be rUEx -complete. This is affirmed by Example 12, the proof of which is omitted.
Fix the least elements d0 , d0 of A0 , d0 < d0 . Let I0 := {d0 }, I0 := {d0 }. Let
e0 ∈ A \ (I0 ∪ I0 ) be minimal such that f0 ∈ Re0 . (* e0 exists, because A
contains infinitely many descriptions d with ϕd0 = f0 . *)
Let D0 := I0 ∪ {e0 }. (* The disjoint sets D0 and I0 both intersect with A0 ;
some recursive core described by D0 equals {f0 }. *)
Fix the least elements dk+1 , dk+1 of Ak+1 \ (Dk ∪ Ik ), dk+1 < dk+1 . (* These
have not been touched in the definition of D0 , . . . , Dk yet. *)
Let Ik+1 := Dk ∪{dk+1 }, Ik+1 := Ik ∪{dk+1 }. Let ek+1 ∈ A\(Ik+1 ∪Ik+1
) be
minimal such that fk+1 ∈ Rek+1 . (* ek+1 exists, because A contains infinitely
many descriptions d with ϕd0 = fk+1 . *)
Let Dk+1 := Ik+1 ∪{ek+1 }. (* The disjoint sets Dk+1 and Ik+1 both intersect
with Ak+1 ; some recursive core described by Dk+1 equals {fk+1 }. *)
Choose D := k∈N Dk ⊂ A, so D does not contain any infinite limiting r. e. set.
As ϕdx+1 =↑∞ for all d ∈ D, x ∈ N, we have D ∈ rUEx . Moreover, C is the union
of all cores described by D. It remains to prove that D is not rUEx -complete.
Assume D is rUEx -complete. Then some limiting r. e. set {di | i ∈ N} ⊆ D
and some ψ ∈ R fulfil the conditions of Theorem 13. In particular, {(di , ψi ) |
i ∈ N} is infinite. As D does not contain any infinite limiting r. e. set, the set
{di | i ∈ N} is finite. card Rdi = 1 for i ∈ N implies that {ψi | i ∈ N} is finite, too;
thus {(di , ψi ) | i ∈ N} is finite – a contradiction. So D is not rUEx -complete.
The reason why each UEx -/rUEx -complete set D contains a limiting r. e. sub-
set representing a decomposition of an r. e. class is that certain properties of
UEx -complete sets are ‘transferred’ by meta-operators Θ. This corroborates the
possible interpretation that our approach of intrinsic complexity just makes a
Intrinsic Complexity of Uniform Learning 53
References
1. Baliga, G.; Case, J.; Jain, S. (1999); The synthesis of language learners, Information
and Computation 152:16–43.
2. Blum, L.; Blum, M. (1975); Toward a mathematical theory of inductive inference,
Information and Control 28:125–155.
3. Case, J.; Smith, C. (1983); Comparison of identification criteria for machine in-
ductive inference, Theoretical Computer Science 25:193–220.
4. Freivalds, R.; Kinber, E.; Smith, C. (1995); On the intrinsic complexity of learning,
Information and Computation 123:64–71.
5. Garey, M; Johnson, D. (1979); Computers and Intractability – A Guide to the
Theory of NP-Completeness, Freeman and Company.
6. Gold, E. M. (1965); Limiting recursion, Journal of Symbolic Logic 30:28–48.
7. Gold, E. M. (1967); Language identification in the limit, Information and Control
10:447–474.
8. Jain, S.; Kinber, E.; Papazian, C.; Smith, C.; Wiehagen, R. (2003); On the intrinsic
complexity of learning recursive functions, Information and Computation 184:45–
70.
9. Jain, S.; Kinber, E.; Wiehagen, R. (2000); Language learning from texts: Degrees
of intrinsic complexity and their characterizations, Proc. 13th Annual Conference
on Computational Learning Theory, Morgan Kaufmann, 47–58.
10. Jain, S.; Sharma, A. (1996); The intrinsic complexity of language identification,
Journal of Computer and System Sciences 52:393–402.
11. Jain, S.; Sharma, A. (1997); The structure of intrinsic complexity of learning,
Journal of Symbolic Logic 62:1187–1201.
12. Jantke, K. P. (1979); Natural properties of strategies identifying recursive func-
tions, Elektronische Informationsverarbeitung und Kybernetik 15:487–496.
13. Kapur, S.; Bilardi, G. (1992); On uniform learnability of language families, Infor-
mation Processing Letters 44:35–38.
14. Osherson, D.; Stob, M.; Weinstein, S. (1988); Synthesizing inductive expertise,
Information and Computation 77:138–161.
15. Rogers, H. (1987); Theory of Recursive Functions and Effective Computability,
MIT Press.
16. Zilles, S. (2001); On the synthesis of strategies identifying recursive functions,
Proc. 14th Annual Conference on Computational Learning Theory, LNAI 2111,
pp. 160–176, Springer-Verlag.
17. Zilles, S. (2001); On the comparison of inductive inference criteria for uniform
learning of finite classes, Proc. 12th Int. Conference on Algorithmic Learning The-
ory, LNAI 2225, pp. 251–266, Springer-Verlag.
On Ordinal VC-Dimension and Some Notions of
Complexity
1 Introduction
The notion of VC-dimension is a key concept in PAC-learning [6,12,13]. The
notion of finite telltale is a key concept in Inductive inference [3,8]. It can be
claimed that VC-dimension is to PAC-learning what finite telltales are to Induc-
tive inference. Both provide a characterization of learnability, for fundamental
classes of learning paradigms, in the respective settings. Both take the form of
a condition where finiteness is a key requirement, in frameworks that deal with
infinite objects. In logical learning paradigms of identification in the limit, it
has been shown that the finite telltale condition can be seen as a generalization
of the compactness property, the latter being the hallmark of, equivalently, fi-
nite learning or deductive inference [10]. The finite telltale condition can even be
generalized and be interpreted as a property of β-weak compactness, that charac-
terizes classification with less than β mind changes [10]. There are extensions of
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 54–68, 2003.
c Springer-Verlag Berlin Heidelberg 2003
On Ordinal VC-Dimension and Some Notions of Complexity 55
VC dimension to infinite domains [4]. But there are few essential connection be-
tween VC-dimension and some fundamental concepts from Inductive inference:
the relevance of the concept of VC-dimension seems to be closely related to the
existence of probability distributions over the sample space. Though connections
exists between PAC learning and Inductive inference (e.g., [5]), it does not seem
that VC-dimension has any chance to play a key role in learning paradigms of In-
ductive inference that do not introduce probability distributions over the sample
space. We will show that VC-dimension can still provide a perfect characteriza-
tion for the problem of predicting whether a possible datum is true or false in
the underlying world, in the realm of Inductive inference. But for such a char-
acterization to be possible, the condition that the set of possible data is closed
under boolean operators has to be imposed. The fact that we work in a logical
setting is of course essential to express this condition in a simple, meaningful
and natural way. The notions and main results are stated with no computabil-
ity condition on the procedures that analyze the data and output hypotheses.
This is necessary to obtain perfect equivalences, and provides strong evidence
that the concepts involved are naturally connected. When computability is a
requirement, the relationships become more complex. Our aim is to encourage
the study of the connections between ordinal VC-dimension and predictive com-
plexity in paradigms of Inductive inference. Obtaining perfect connections for
ideal, unconstrained paradigms, suggests that further work in the same direction
should be carried out, in the context of more realistic or constrained paradigms.
Moreover, the results will be illustrated with examples that always involve effec-
tive procedures. We proceed as follows. We introduce some background notions
and notation in Section 2. We define the various complexity measures in Section
3. We study the relationships between these complexity measures in Section 4.
We conclude in Section 5.
2 Background
The class of ordinals is denoted by Ord. Let a set X be given. The set of finite
sequences of members of X, including the empty sequence (), is represented
by X . Given a σ ∈ X , the set of members of X that occur in σ is denoted
by rng(σ). Given a finite or an infinite sequence σ of members of X and a
natural number i that, is case σ is finite, is at most equal to the length of σ,
we represent by σ|i the initial segment of σ of length i. Concatenation between
sequences is represented by , and sequences consisting of a unique element are
often identified with that element. We write ⊂ (respect., ⊆) for strict (respect.,
nonstrict) inclusion between sets, as well as for the notion of a finite sequence
being a strict (respect., nonstrict) initial segment of another sequence. We also
use the notation ⊃. Let two sets X, Y and a partial function f from X into Y
be given. Given x ∈ X, we write f (x) =↓ when f (x) is defined, and f (x) =↑
otherwise. Given two members x, x of X, we write f (x) = f (x ) when both
f (x) and f (x ) are defined and equal; we write f (x) = f (x ) otherwise. Let
R be a binary relation over a set X. Recall that R is a well-founded iff every
56 E. Martin, A. Sharma, and F. Stephan
3
3 3
2 2
0 1 1 0
0 0
Let us introduce the logical learning paradigms and their constituents. We denote
by V a countable vocabulary, i.e., a countable set of function symbols (possibly
including constants) and predicate symbols. Let us adopt a convention. If we
say that V contains 0 and s, then 0 denotes a constant and s a unary function
symbol. Moreover, given a nonnull n ∈ N, n is used as an abbreviation for the
term obtained from 0 by n applications of s to 0.
We denote by D a nonempty set of first-order sentences (closed formulas) over
V that represent data. Three important cases are sets of closed atoms (to model
learning from texts), sets of closed literals (i.e., closed atoms and their negations,
to model learning from informants), and sets of sentences closed under boolean
operators, a natural example being the set of quantifier free sentences. Note that
quantifier free sentences convey no more information than closed literals. Still,
the assumption that D is closed under boolean operators will play a key role in
this paper. We denote by some symbol to be used when no datum is presented.
Given a member σ of (D ∪ {}) , we set cnt(σ) = rng(σ) ∩ D.
We denote by W a nonempty set of structures over V. An important case is
given by the set of all Herbrand structures, i.e., structures over V each of whose
individuals interprets a unique closed term. (When we consider Herbrand struc-
tures, V contains at least one constant). Given a member M of W and a set E
On Ordinal VC-Dimension and Some Notions of Complexity 57
of formulas, the E-diagram of M, denoted DiagE (M), is the set of all members
of E that are true in M. We say that a set T of first-order formulas is consistent
in W iff T has a model in W. Given a set T of first-order formulas over V, we
denote by Mod(T ) the set of models of T , and by ModW (T ) the set of models
of T in W (i.e., ModW (T ) = Mod(T ) ∩ W).
We denote by P the triple (V, D, W). We call P a logical paradigm. This is a
simplification of the notion of logical paradigm investigated in [9,10]. Learning
paradigms in the numerical setting are naturally cast into the logical setting as
follows. Set V = {0, s, P } where P is a unary predicate symbol. Let E be the
set {P (n) : n ∈ N}. If C is the set of languages to be learnt, we define W as
the set of Herbrand structures whose E-diagrams are {P (n) : n ∈ L} where L
varies over C. The choice of D depends on the type of data: D = E when data
are positive, D = E ∪ {¬ψ : ψ ∈ E} when data are positive or negative.
We now define the various concepts of complexity we need in this paper, starting
with VC-dimension. The notion of a set of hypothesis shattering a set of data
takes the following form when hypotheses are represented as structures and data
as formulas.
Definition 1. Let a set E of formulas be given. We say that W shatters E iff
E is finite and for all subsets D of E, DiagE (M) = D for some M ∈ W.
Traditionally, the VC-dimension is defined as the greatest number n such
that some set consisting of n distinct elements is shattered [13]. When such an n
does not exist, the VC-dimension is considered to be either undefined or infinite.
We extend the notion of VC-dimension from natural numbers to ordinals as
follows.
Definition 2. Let X be the set of nonempty subsets of D that W shatters. Let
R be the restriction of ⊃ to X. The VC-dimension of P is equal to the length
of R if R is well-founded, and undefined otherwise.
As an intuitive interpretation, the VC-dimension of P is determined by the
following game, where we assume for simplicity that D is infinite. Consider two
players Anke and Boris. Anke has to output an increasing sequence of nonempty
finite subsets of D and Boris a decreasing sequence of ordinals.
The game terminates after Boris has output 0. If Anke’s last set is shattered,
Anke wins the game. Otherwise Boris wins the game. The VC-dimension of P
58 E. Martin, A. Sharma, and F. Stephan
is the smallest nonnull ordinal α for which Boris has a winning-strategy. If this
ordinal does not exist, the VC-dimension is undefined.
Given a structure M, we call environment (in P) an infinite sequence e of mem-
bers of D ∪ {} such that for all ϕ ∈ D, ϕ occurs in e iff ϕ ∈ DiagD (M). So
environments correspond to texts when D is the set of closed atoms, and to
informants when D is the set of closed literals. Identification in the limit and
the corresponding notion of complexity are defined next.
Definition 3. An identifier (for P) is a partial function from (D ∪ {}) into
{DiagD (M) : M ∈ W}. Let an identifier f be given.
– We say that a member σ of (D ∪ {}) is consistent in W just in case there
exists M ∈ W such that for all ϕ ∈ D that occur in σ, M |= ϕ.
– We say that f is is successful (in P) iff for every M ∈ W and for every
environment e for M, f (e|k ) = DiagD (M) for almost every k ∈ N.
– Let X be the set of all σ ∈ (D ∪ {}) such that σ is consistent in W and
f (τ ) =↓ for some initial segment τ of σ. We denote by Rf the binary relation
over X such that for all σ, τ ∈ X, Rf (σ, τ ) holds iff τ ⊂ σ and f (σ) = f (τ ).
The identification complexity of P is the least ordinal of the form |Rf |, where f
is an identifier that is successful in P and Rf is well-founded; if such an f does
not exist, the identification complexity of P is undefined.
If the identification complexity of P is equal to nonnull ordinal β, then the
D-diagrams of the members of W are identifiable in the limit with less than β
mind changes. (Note that the usual notion of mind change complexity considers
the least ordinal β such that at most, rather than less than, β mind changes
are sometimes necessary for the procedure to converge [1,2,7]. There are good
theoretical reasons for preferring the ‘less than’ formulation.) Note that if some
identifier is successful in P, then there are countably many D-diagrams of mem-
bers of W only. The following is a characterization of identification complexity
based on a generalization Angluin’s finite telltale characterization of learnability
in the limit [3].
Proposition 4. The identification complexity of P is defined and equal to non-
null ordinal β iff there exists a sequence (βM )M∈W of ordinals smaller than β
and for all M ∈ W, there exists a finite AM ⊆ DiagD (M) such that for all
N ∈ W:
(∗) if AM ⊆ DiagD (N) and βM ≤ βN then DiagD (N) = DiagD (M).
Definition 5. A selector (for P) is a partial function f from {0, 1} into D. Let
a selector f be given.
– We say that a member σ of {0, 1} is consistent with W and f just in case
there exists M ∈ W such that for all τ ∈ {0, 1} and i ∈ {0, 1} with τ i ⊆ σ,
f (τ ) =↓, and M |= f (τ ) iff i = 1.
– We say that f is is successful (in P) iff for every M ∈ W, there is a string
t of finitely many 0’s and infinitely many 1’s such that every finite initial
segment of t is consistent with W and f and for all ϕ ∈ DiagD (M), either
f (σ) = ϕ or f (σ) = ¬ϕ for some finite initial segment σ of t.
– Let X be the set of all σ ∈ {0, 1} that are consistent with W and f , and
that end with a 0. We denote by Rf the binary relation over X such that
for all σ, τ ∈ X, Rf (σ, τ ) holds iff τ ⊂ σ.
The selective complexity of P is the least ordinal of the form |Rf |, where f is a
selector that is successful in P and Rf is well-founded; if such a f does not exist,
the selective complexity of P is undefined.
The last notion of complexity we define is based on predictors. Whereas selec-
tors have control over the formulas the truth of which they want to predict, a
predictor has to take the members of D as they come.
Definition 6. A predictor (for P) is a partial function f from (D × {0, 1}) × D
into {0, 1}. Let a predictor f be given.
– Let σ = ((ϕ0 , 0 ), . . . , (ϕp , p )) ∈ (D × {0, 1}) be given. We say that σ is
consistent with W and f iff there exists M ∈ W such that for all i ≤ p:
• f (σ|i , ϕi ) =↓, and
• M |= ϕi iff i = 1
If σ is consistent with W and f , we call the number of i ≤ p such that
f (σ|i , ϕi ) = i , the number of mispredictions that f makes on σ (in P).
– We say that f is successful (in P) iff for every member M of W and every
t ∈ (D×{0, 1})N , the following holds. Assume that every finite initial segment
of t is consistent with W and f . Then there exists n ∈ N such that for all
finite initial segments σ of t, f makes at most n mispredictions on σ.
– Let X be the set of all σ ∈ (D × {0, 1}) that are consistent with W and
f , and on which f makes at least one misprediction. We denote by Rf the
binary relation over X such that for all σ, τ ∈ X, Rf (σ, τ ) holds iff τ ⊂ σ
and f makes more mispredictions on σ than on τ .
The predictive complexity of P is the least ordinal of the form |Rf |, where f is
a predictor that is successful in P and Rf is well-founded; if such an f does not
exist, the predictive complexity of P is undefined.
Clearly, if the predictive complexity of P is defined, then the selective com-
plexity of P is defined, and at most equal to the former. The three notions of
complexity we have introduced can, similarly to VC-dimension, be interpreted
in terms of the outcome of a game between Anke and Boris. For instance, for
predictive complexity, Anke selects formulas from D and Boris has to make a
60 E. Martin, A. Sharma, and F. Stephan
prediction whether the formula holds (in the unknown world M) or not. Anke
tells Boris whether the prediction is correct. If not, Boris has to count down an
ordinal counter. The predictive complexity is then the least ordinal to start with
for which Boris has a winning strategy.
Notation 7. Let α ∈ Ord be given, and suppose that Γβ has been defined for
all β < α. Let Γα be the set of all subsets
U of W such that for all ϕ ∈ D,
U ∩ Mod(ϕ) or U ∩ Mod(¬ϕ) belongs to β<α Γβ ∪ {∅}.
Property 8. The predictive complexity of P is defined iff W ∈ α∈Ord Γα ; in
this case it is equal to the least ordinal α with W ∈ Γα .
Lemma 10. Let a sequence (Mα )α∈Ord be inductively defined as follows. For
all α ∈ Ord, Mα is the set of all finite subsets D of D such that W shatters D
and for all β < α, Mβ contains a proper superset of D.
We are now in a position to state and prove one of the main results of the
paper, which relates VC-dimension to predictive complexity.
Proposition 11. Suppose that D is closed under boolean operators. Then the
VC-dimension of P is defined iff the predictive complexity of P is defined; more-
over, if they are defined then they are equal.
On Ordinal VC-Dimension and Some Notions of Complexity 61
– ∀x(x < s(x)) ∧ ∀xyz((x < y ∧ y < z) → (x < z)) ∧ ∀xy(x < y → ¬(y < x));
– ∀x1 . . . xn y1 . . . yn (( i<n (y1 = x1 ∧ . . . ∧ yi = xi ∧ yi+1 < xi+1 ) ∧
P (x1 , . . . , xn )) → P (y1 , . . . , yn )).
Denote by W the topological space over W generated by the sets of the form
ModW (ϕ), where ϕ varies over D. To establish relationships between predictive,
selective, and identification complexities, we usually need to assume that W is
compact. This is the case for the next result.
62 E. Martin, A. Sharma, and F. Stephan
Proposition 13. Assume that D is closed under negation and that W is com-
pact. If the identification complexity of P is defined and equal to ordinal α, then
the predictive complexity of P is defined and at most equal to ω × α.
and fix an enumeration (Dp )p<n of Fσ . Note that for all distinct p, p < n,
ModW (Dp ) ∩ ModW (Dp ) = ∅. Given m < n, set σ m= 1m−1 01n−m−1 . Now
for all σ ∈ Z, m < n and p < n, set hi (σ σ m |p ) = ¬ Dp ; for all σ ∈ Z and
m
m
m < n, set hi (σ σ ) = ϕi if Dm |= ϕi , and hi (σ σ ) = ¬ϕi otherwise. Set
h = i∈N hi . It is easily verified that h succeeds in P, and that the length of Rh
is defined and at most equal to α + 1.
By Proposition 13, β ≤ ω × (α + 1). So to complete the proof of the propo-
sition, it suffices to show that ω × α ≤ β. Remember the definition of the se-
quence (Γα )α∈Ord from Notation 7. By Property 8, W ∈ Γβ . Define an identi-
fier f as follows. Let σ ∈ (D ∪ {}) be given. Let γ be the least ordinal with
ModW (cnt(σ)) ∈ Γγ .
– If there exists τ ⊂ σ such that f (τ ) is defined and equal to the D-diagram
of a model of cnt(σ) in W, then f (σ) = f (τ ).
– Otherwise, and if there exists M∈ W such that for all ϕ ∈ D, M is a model
of ϕ iff ModW (cnt(σ) ∪ {ϕ}) ∈ / γ <γ Γγ , then f (σ) = DiagD (M).
– Otherwise f (σ) =↑.
Let σ ∈ (D ∪ {}) be such that f (σ) =↓, and let γ be the least ordinal with
ModW (cnt(σ)) ∈ Γγ . Suppose for a contradiction that γ is neither 0 nor a limit
ordinal. It then follows from the definition of Γγ that there exists ϕ ∈ D such that
Γγ−1 contains neither ModW (cnt(σ) ∪ {ϕ}) nor ModW (cnt(σ) ∪ {¬ϕ}), which
is impossible by the definition of f . Let λ be the number of limit ordinals less
than or equal to β. We infer easily from the previous remark that α + 1 ≤ λ + 1,
hence ω × α ≤ ω × λ ≤ β, and we are done.
We illustrate the previous results, especially the bounds that have been ob-
tained, with a few examples.
Example 16. Let V consists of 0, s, and a unary predicate symbol P . Let W
be the set of Herbrand models of ∀x(P (x) ∧ P (s(x)) → P (s(s(x)))) ∧ ∃y(P (y) ∧
P (s(y))). Then the identification complexity of P is 1 but the predictive com-
plexity of P is undefined.
Example 17. Assume that D is closed under Boolean operators. If the VC-
dimension of P is a nonnull n ∈ N then both the identification and the selective
complexity of P are 1, and the predictive complexity of P is n.
easily verified that the VC-dimension and the predictive complexity of P are at
most equal to ω. To see that they are at least equal to ω, let a nonnull n ∈ N be
given, and let p0 , p1 , . . . , p2n −1 be an enumeration of the first 2n prime numbers.
For all k < n, let qk be the product of all pi , 0 ≤ i < 2n , such that the (k + 1)st
bit in the binary representation of i is equal to 1. It is immediately verified that
W shatters {P (q0 ), P (q1 ), . . . , P (qn−1 )}, implying that the VC-dimension of P
is at least equal to n, hence the predictive complexity of P is also at least equal
to n.
Proposition 22. Suppose that D is the closure under negation of some set of
sentences D , and set P = (V, D , W). Assume that W is compact, the iden-
tification complexity of P is defined and equal to nonnull ordinal α, and some
identifier is successful in P . Then the identification complexity of P is defined
and smaller than ω α .
Proof. The proof is trivial if {DiagD (M) : M ∈ W} is finite, so suppose oth-
erwise. Let (ψi )i∈N be a repetition free enumeration of D . Let X be the set of
finite sequences of the form (ϕ0 , . . . , ϕn−1 ), n ∈ N, where for all i < n, ϕi = ψi
or ϕi = ¬ψi , such that {ϕ0 , . . . , ϕn−1 } is consistent in W. Let f be a canonical
identifier for P. Let a finite subset E of D be consistent in W. Denote by UE the
set of all ⊆-minimal members σ of X such that f (σ) is defined and contains E.
On Ordinal VC-Dimension and Some Notions of Complexity 67
We infer that Ucnt(τ ) is nonempty and distinct from Ucnt(σ) , which we know
implies that βcnt(σ) > βcnt(τ ) . It then follows that the height of Rf is defined and
at most equal to ω α . Moreover, since W is compact, the reasoning in Proposition
15 shows that the identification complexity of P is not a limit ordinal, hence is
smaller than ω α , as wanted.
It was shown in [11] that every class of languages that is finitely identifiable
from informants is also identifiable in the limit from texts. But such a relationship
does not generalize to identifiability from informants with one mind change at
most. Indeed, the class C consisting of N and all finite initial segments of N is not
learnable in the limit from texts, as proved in [8], whereas C is clearly learnable
from informants with one mind change at most. Cast into the logical framework,
68 E. Martin, A. Sharma, and F. Stephan
5 Conclusion
In ideals paradigms of Inductive inference, finite tell tails conditions offer char-
acterizations of identification in the limit or of classification, with or without
a mind change bound. Assuming that the set of data is closed under boolean
operators, the VC-dimension offers a characterization of prediction. An extra
topological assumption of compactness enables to provide a complete picture of
the relationship between VC-dimension and other notions of complexity, includ-
ing mind change bound complexity.
References
1. Ambainis, A., Jain, S., Sharma, A.: Ordinal mind change complexity of language
identification. Theoretical Computer Science. 220(2) pp. 323–343 (1999)
2. Ambainis, A., Freivalds, R., Smith, C.: Inductive Inference with Procrastination:
Back to Definitions. Fundamenta Informaticae. 40 pp. 1–16 (1999)
3. Angluin, D. Inductive Inference of Formal Languages from Positive Data. Infor-
mation and Control 45 p. 117–135 (1980)
4. Ben-David, S., Gurvits, L.: A note on VC-Dimension and Measure of Sets of Reals.
Combinatorics Probability and Computing 9, 391–405 (2000)
5. Ben-David, S., Jacovi, M.: On Learning in the Limit and Non-Uniform (, δ)-
Learning. In Proceedings of the Sixth Conference on Computational Learning The-
ory. ACM Press pp. 209–217 (1993)
6. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Learnability and the
Vapnik-Chervonenkis Dimension. J. ACM 36(4) pp. 929–965 (1989)
7. Freivalds, R., Smith, C.: On the role of procrastination for machine learning. In-
form. Comput. 107(2) pp. 237–271 (1993)
8. Gold, E.: Language Identification in the Limit. Information and Control. 10 (1967)
9. Martin, E., Sharma, A., Stephan, F.: A General Theory of Deduction, Induction,
and Learning. In Jantke, K., Shinohara, A.: Proceedings of the Fourth International
Conference on Discovery Science. Springer-Verlag, LNAI 2226 pp. 228–242 (2001)
10. Martin, E., Sharma, A., Stephan, F.: Logic, Learning, and Topology in a Common
Framework. In Cesa-Bianchi, N., Numao, M., Reischuk, R.: Proc. of the 13th Intern.
Conf. on Alg. Learning Theory. Springer-Verlag, LNAI 2533 pp. 248–262 (2002)
11. Sharma, A.: A note on batch and incremental learnability. Journal of Computer
and System Sciences 56 pp. 272–276 (1998)
12. Valiant, L.: A Theory of the Learnable. Commun. ACM 27(11) pp. 1134–1142
(1984)
13. Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies
of events to their probabilities. Theory of Probabilities and its Applications 16(2)
pp. 264–280 (1971)
Learning of Erasing Primitive Formal Systems
from Positive Examples
1 Introduction
An elementary formal system, EFS for short, is a kind of logic program over
strings consisting of finitely many axioms. A pattern is a finite string of constant
symbols and variables. A pattern is regular, if each variable appears in the pattern
at most once. In EFSs, patterns are used for terms in logic programming.
For example, Γ = {p(ab) ←, p(axb) ← p(x)} is an EFS with two axioms,
where p is a unary predicate symbol, a and b are constant symbols and x is a
variable, where patterns ab and axb are used as terms.
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 69–83, 2003.
c Springer-Verlag Berlin Heidelberg 2003
70 J. Uemura and M. Sato
τπ = τ {x := π | x appears in τ },
where c(π) is the string obtained from π by substituting the empty string to all
variables.
For Γ = (π, τ ), τ = x means L(Γ ) = L(π). That is, the induction step
p(x) ← p(x) is redundant, and thus we assume τ = x.
A string w ∈ Σ + is a multiple string of a string u, called a component, if
w = ul for some l ≥ 2, and is a multiple string if there is a component of w.
A component u for w is maximal if there is no component u for w satisfying
|u | > |u|.
We denote by PFS the sets of all PFSs except for PFSs Γ = (ε, τ ) such that
τε is a multiple string.
Definition 3. A PFS Γ is reduced if L(Γ ) L(Γ ) for any Γ Γ .
3 Inductive Inference
We first give the notion of identification in the limit from positive examples ([4]).
A language class L = L0 , L1 , · · · over Σ is an indexed family of recursive
languages if there is a computable function f : N × Σ ∗ → {0, 1} such that
f (i, w) = 1 if w ∈ Li , otherwise 0, where N = {i | i ≥ 0}. The function f is
called a membership function. Hereafter we confine ourselves to indexed families
of recursive languages.
An infinite sequence of strings w1 , w2 , · · · over Σ is a positive presentation of
a language L, if L = {wn | n ≥ 1} holds. An inference machine is an effective
procedure M that runs in stages 1, 2, · · · , and requests an example and produces
a hypothesis in N based on the examples so far received. Let M be an inference
machine and σ = w1 , w2 , · · · be an infinite sequence of strings. We denote by hn
the hypothesis produced by M at stage n after the examples w1 , · · · , wn are fed
to M . M on input σ converges to h if there is an integer n0 ∈ N such that hn = h
for every n ≥ n0 . M identifies in the limit or infers a language L from positive
examples, if for any positive presentation σ of L, M on input σ converges to h
with L = Lh . A class of languages L is inferable from positive examples if there
is an inference machine that infers any language in L from positive examples.
Angluin [1] gave a characterizing theorem for a language class L to be in-
ferable from positive examples if and only if there exists an effective procedure
74 J. Uemura and M. Sato
It means that the language L(Γ ) does not have any finite tell-tale set within the
class.
Angluin [1] gave a very useful sufficient condition for inferability called finite
thickness. The class of erasing regular pattern languages discussed in this paper
was shown to have finite thickness by Shinohara [11] as well as nonerasing pattern
languages (Angluin[1]). Wright [16] introduced another sufficient condition for
inferability called finite elasticity more general than finite thickness ([6]). A class
L has finite elasticity, if there is no infinite sequence of strings w0 , w1 , · · · and no
infinite sequence of languages L1 , L2 , · · · in L satisfying {w0 , w1 , · · · , wn−1 } ⊆
Ln but wn ∈ Ln for every n ≥ 1. Finite elasticity has a good property in a sense
that it is closed under various class operations such as union, intersection and
so on (Wright[16], Moriyama & Sato [5], Sato [10]). Shinohara [13] proved that
the class of nonerasing length-bounded EFS languages generated by at most k
clauses has finite elasticity, and so is inferable from positive examples. Mukouchi
[8] has showed that the class of erasing RFS languages generated by at most k
clauses has finite elasticity similarly, provided that all regular patterns in heads
of induction steps are of canonical forms. Without the condition of canonical
form, the inferability, however, is not valid as shown below.
Theorem 1. The class PFSL does not have finite elasticity.
Proof. Define PFSs Γn = (ε, τn ) (n ≥ 1) as follows: τn = a(x1 x2 · · · xn )b for
n ≥ 1, where a, b ∈ Σ, a = b. Then we can show that
Thus the infinite sequence (wn )n≥0 of strings and infinite sequence (Γn )n≥1 of
PFSs satisfies the above conditions, where w0 = ε, wn = a(ab)n b (n ≥ 1). Hence
the class PFSL has infinite elasticity.
Moriyama&Sato [5] introduced a notion of M-finite thickness by generalizing
finite thickness. For nonempty finite set S ⊆ Σ ∗ , we define
4 Reduced PFSs
This section gives a characterization of a reduced PFS. Hereafter, we assume
that Σ ≥ 3. Then by Lemma 1, for any regular patterns π and τ ,
π τ ⇐⇒ L(π) ⊆ L(τ ).
Let Γ = (π, τ ) be a PFS. If τ ∈ Σ ∗ , clearly L(Γ ) = L(π) ∪ {τ }. Thus the
PFS Γ is reduced if and only if τ π.
Hereafter, we consider a PFS Γ = (π, τ ) with var(τ ) = φ, where var(τ ∞ ) =
{x1 , x2 , · · · , xn } is the set of variables contained in τ . We define Γτ = t=1 Γt ,
where Γt is recursively defined as follows: Γ1 = {τπ } and for each t ≥ 2,
Γt = Γt−1 ∪ {τ {x1 := ξ1 , x2 := ξ2 , · · · , xn := ξn } | ξi ∈ Γt−1 ∪ {π}, i = 1, 2, · · · , n}.
Clearly every ξ in Γτ always contains π as a substring whenever var(τ ) = φ. By
the definitions of L(Γ ) and the above Γτ , it follows that:
Lemma 2. Let Γ = (π, τ ) be a PFS. Then L(Γ ) = L(π) ∪ L(Γτ ) holds, where
L(Γτ ) = ξ∈Γτ L(ξ).
By the definition of a reduced PFS, Γ is reduced if and only if L(π) L(Γ ),
i.e., L(Γτ ) ⊆ L(π). Clearly if π ∈ Σ ∗ , then Γ is always reduced.
For a pattern π with var(π) = φ, let us denote by Aπ the longest constant
prefix and by Bπ the longest constant suffix of π. For instance, Aπ = ab and
Bπ = ε if π = abx1 aax2 .
Lemma 3. Let Γ = (π, τ ) be a PFS with var(π) = φ. For any ξ ∈ Γτ , there is
a pair (i0 , j0 ) of positive integers such that
Aξ = Aiτ0 Aπ , Bξ = Bπ Bτj0 .
For two strings v, w ∈ Σ ∗ , v p w denotes that v is a prefix of w, and v s w
means that v is a suffix of w. Moreover, w p v ∗ means that w p v i for some
i ≥ 0. Similarly we define w s v ∗ .
By Pref, we denote the set of pairs (v, w) of strings satisfying v p w or
w p v. Similarly we define the set Suff.
76 J. Uemura and M. Sato
(i) ∃ i0 ≥ 1 s.t. w p v i0 w ⇐⇒ w p v ∗ ⇐⇒ ∀i ≥ 0, w p v i w,
(ii) ∃ j0 ≥ 1 s.t. w s wv j0 ⇐⇒ w s v ∗ ⇐⇒ ∀j ≥ 0, w s wv j .
Theorem 4. Let Γ = (π, τ ) be a PFS. Then the following statements are equiv-
alent:
(i) Γ is reduced. (ii) L(π) ∩ L(Γτ ) = φ.
(iii) [Aπ p A∗τ and Aτ = ε] or [Bπ s Bτ∗ and Bτ = ε)], if π ∈ Σ ∗ .
Aπ p (Aτ )i Aπ , Bπ s Bπ (Bτ )j , i, j = 0, 1, 2, · · · .
(∗) Aπ p (Aτ )i Aπ , i = 1, 2, · · · .
Suppose that γ π and L(γ) ⊆ L(Γτ ). By L(γ) ⊆ L(Γ ), it implies that L(γ) ∩
L(π) = φ and L(γ) ∩ L(ξ) = φ for some ξ ∈ Γτ . Then by Lemma 4,
(∗ ∗ ∗) Aξ = (Aτ )i0 Aπ .
By (∗), Aπ p Aξ .
Claim A. If |Aπ | ≤ |Aγ |, then Aπ p Aξ .
The proof of Claim A. If |Aπ | ≤ |Aγ |, by (∗∗), it follows that Aπ p Aγ .
If |Aξ | ≤ |Aγ |, by (∗∗), Aξ p Aγ holds, and so Aπ p Aξ holds.
Otherwise, i.e., |Aξ | > |Aγ |, by (∗∗), Aγ p Aξ holds, and so, Aπ p Aξ
holds.
As mentioned above, since Aπ p Aξ , |Aπ | > |Aγ | must hold.
Claim B. If |Aπ | > |Aγ |, then L(γ) ⊆ L(Γ ).
The proof of Claim B. By the assumption of our claim, both of the lengths of
Aπ and Aξ are larger than |Aγ |. Let a and b be the |Aγ | + 1-th constant symbols
from the left sides, respectively. Since Σ ≥ 3, there is a symbol c ∈ Σ such that
c = a, b. Let x be the variable of the most left in γ and let
γc = γ{x := cx}.
78 J. Uemura and M. Sato
Then L(γc ) ⊆ L(Γ ) and Aγc = (Aγ )c hold. Clearly (Aγc , Ajτ Aπ ) ∈ Pref (j =
0, 1, · · · ) holds, and so by Lemma 4, we have L(γc ) ∩ L(π) = φ. Furthermore, by
Lemma 3, for any ξ ∈ Γτ , it implies that L(γc ) ∩ L(ξ ) = φ. Thus L(γc ) ⊆ L(Γ ).
That is, L(γ) ⊆ L(Γ ).
The claim B leads a contradiction because of L(γ) ⊆ L(Γ ).
Lemma 8. Let Γ = (π, τ ) be a reduced PFS and let γ be a pattern with L(γ) ⊆
L(Γτ ). Then
L(γ) ∩ L(τπ ) = φ ⇐⇒ γ τπ .
Lemma 10. Let Γ = (α, β) be a reduced PFS and let π be a pattern. Then
L(Γ ) ⊆ L(π) ⇐⇒ α π, βα π.
In the present section, we show that the class PFSL is inferable from positive
examples. In order to do it, the class will be shown that the class has M -finite
thickness and each language has a finite tell-tale set.
For a pattern π, S(π) denotes the set of strings obtained from π by substituting
each variable to the empty string or a constant symbol in Σ.
The following result plays an essential role in our problem on finite tell-tale
set.
Lemma 12. Let w be a nonempty string in Σ + .
(i) If w is a nonmultiple string, then there do not exist strings u, v = ε
satisfying w = uv = vu or wu = vw.
(ii) If u is a maximal component for w, then wv1 = v2 w implies vi ∈ u∗ for
i = 1, 2.
Now we consider a PFS Γ = (w, τ ) for w ∈ Σ + . We introduce the following
strings:
ri (τ ) = τ {xi := τw , xj := w | j = i}, i = 1, · · · , n,
where var(τ ) = {x1 , · · · , xn } for n ≥ 1.
Then we have
Note that the string w is the unique shortest string of L(w, τ ) and S(Γ ), and s
is the unique second shortest string of them.
By Lemma 12, it follows that:
Lemma 13. Let Γ = (w, τ ) be a reduced PFS with w ∈ Σ + and var(τ ) =
{x1 , · · · , xn } (n ≥ 1). If s is a nonmultiple string, then ri (τ ) = rj (τ ) for i = j.
T (Γ ) ⊆ L(Γ ) L(Γ ).
Proof. We assume that there is a PFS Γ = (α, β) satisfying the inclusion re-
lations in our theorem. Then clearly w and s(= τw ) are the unique shortest
string and the unique second shortest string of L(Γ ). It implies that α = w
and βα = s. Let τ = v0 x1 v1 x2 · · · xn vn and β = v0 y1 · · · yn vn for some
vi , vj ∈ Σ ∗ (i = 0, · · · , n, j = 0, · · · , n ).
(i) A case that s is a nonmultiple string.
In this case, by Lemma 13, ri (τ ) = rj (τ ) for i = j. Clearly the strings ri (τ )
are the third shortest strings of L(Γ ) and S(Γ ). Similarly the strings rj (β)s are
those of L(Γ ). By our assumption,
{ε, s} ∪ {r̂i (τ ), ti (τ ) | i = 1, · · · , k} ⊆ T (Γ ).
T (Γ ) ⊆ L(Γ ) L(Γ ).
Proof. We assume that there is a PFS Γ = (α, β) satisfying the inclusion re-
lations in our theorem. Then clearly α = ε and u0 u1 · · · uk = u0 u1 · · · uk (= s),
where τ = u0 X1 u1 X2 · · · Xk uk and β = u0 Y1 · · · Yk uk for some u0 , u0 , uk , uk ∈
Σ ∗ , ui , uj ∈ Σ + (i = 1, · · · , k − 1, j = 1, · · · , k − 1) and Xi , Yj ∈ X + for each
i, j.
Claim A. k = k and ui = ui for i = 1, · · · , k.
Since s is nonmultiple string, the proof of Claim A can be done by showing
r̂i (τ ) = r̂j (τ ) for i = j similarly to the proof of Theorem 7.
Similarly we can proof Xi = Yi for each i.
Proof. Let S ⊆ Σ ∗ be nonempty finite set. Let lmin and lmax be the shortest
length and the longest length of strings in S, respectively.
Claim A. MIN(S, PFSL) < ∞ holds.
The proof of Claim A. Let L(Γ ) ∈ MIN(S, PFSL) for Γ = (π, τ ). Without
loss of generality, we can assume that Γ is reduced.
As mentioned in §2, every w ∈ L(Γ ) satisfies |w| ≥ min{|c(π)|, |c(τπ )|}.
A case of π = ε. In this case, |c(π)| and |c(τπ )| is less than or equal to lmax .
Since π is of canonical form and the number of variables in τ is bounded by lmax ,
there are at most finitely many such PFSs.
A case of π = ε. In this case, c(τ ) = τπ holds. Similarly to the above, we have
|c(τ )| ≤ lmax , and thus there are at most finitely many such constant strings. Let
us put |c(τ )| = l. Then as easily seen, lengths of strings in L(ε, τ ) are multiples
of l. Thus lmax = kl for some k ≥ 0.
Let us put τ = w0 X1 w1 · · · wn−1 Xn wn . A variable x ∈ var(τ ) is nonerasable
w.r.t. S if S ⊆ L(ε, τ {x := ε}). We can assume that every variable in τ is
nonerasable w.r.t. S, and such a τ is called a nonerasable pattern w.r.t. S.
Clearly |Xi | ≤ k holds for every i. Hence there are at most finitely many PFSs
in MIN(S, PFSL).
Claim B. For any PFS Γ , if S ⊆ L(Γ ), then L(Γ ) ⊆ L(Γ ) for some Γ ∈
MIN(S, PFSL).
The proof of Claim B. Let S ⊆ L(Γ ) for Γ = (π, τ ). We can assume that
Γ is reduced, and L(Γ ) ∈ MIN(S, PFSL). Then we have S ⊆ L(Γ ) L(Γ )
for some reduced PFS Γ = (α, β). Similarly to the proof of Claim A, it can be
shown that there are at most finitely many such L(Γ ) containing the set S.
By the above and Theorem 10, we obtain the following main theorem.
References
Frank Balbach
1 Introduction
In inductive inference a scenario is investigated where a learner receives more
and more data about a target object and outputs a sequence of hypotheses. The
learner is successful if its sequence of hypotheses eventually converges to a single
description for the target object. Usually a learner is required to learn all objects
from a (possibly infinite) class.
This model of learning in the limit [11], applied to classes of recursive func-
tions, has been studied thoroughly. Thereby, many variations of the basic model
have been developed [4,19,1,8,12,10,9,13]. All these models are referred to as “in-
ference types.” They often differ in the constraints placed on the intermediate
hypotheses or in the way the sequence of hypotheses has to converge.
Common to all inference types, however, is the need to interpret the hypothe-
ses a learner (“strategy”) outputs. This is usually done by means of a hypothesis
space. Its design is thus of key importance to the learning success. In the induc-
tive inference of recursive functions, hypothesis spaces are numberings of partial
recursive functions. Hypotheses are represented by indices in such numberings.
It is well known that Gödel numberings [15] (acceptable numberings [14])
serve, in many inference types, as universal hypothesis spaces. That is, any solv-
able learning problem can be solved within such a numbering.
Of course, knowing that a solution exists is not sufficient in practical appli-
cations. One rather needs to know how to construct an appropriate learner.
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 84–98, 2003.
c Springer-Verlag Berlin Heidelberg 2003
Changing the Inference Type – Keeping the Hypothesis Space 85
The inductive inference paradigm can also be viewed from a more practical
point of view. Here, the hypothesis spaces correspond to certain output formats,
or description languages, for hypotheses. The inference types find their counter-
part in various requirements put on the behavior of a learning algorithm and/or
the quality of its hypotheses.
Consider a typical practical learning algorithm that uses a certain description
language for hypothesis and learns according to some requirements.
In practice, requirements are often changing. The output format might be
fixed, however. The first natural question then is whether the algorithm can
still handle the new requirements or whether there is at all an algorithm that
can cope with the new situation. However, the new requirements could be too
demanding for such an algorithm to exist.
But what happens if a learning algorithm for the new problem is known that
(unfortunately) uses a different output format? Is this additional fact enough to
conclude that there must be such an algorithm for the previously fixed format,
too?
In general the possible answers are “yes, this conclusion can be drawn”, “no,
it cannot”, or “it depends...” In case of a positive answer the immediate next
question is how to find such an algorithm. Can it even be constructed effectively
from the previous one?
86 F. Balbach
The following pages give answers to these questions in the general frame-
work of inductive inference of recursive functions, thereby revealing that all
three kinds of answers do indeed occur, depending on the inference types under
consideration.
Section 3 addresses the “no” answers, Sect. 4 presents some “yes” answers,
and Sect. 5 some intermediate results. Finally, Sect. 6 gives an overview of all
results obtained.
2 Preliminaries
Notations not explained herein follow standard conventions [15]. By IN we de-
note the set {0, 1, 2, . . .} of natural numbers; inclusion and proper inclusion are
denoted by ⊆ and ⊂, respectively; card A is the cardinality of the set A. We
denote the set difference of A and B by A \ B.
For n ∈ IN the set of all n-ary partial recursive functions over IN will be
written P n , the set of all recursive functions Rn . As abbreviation for P 1 and
R1 we use P and R, respectively. For f ∈ P, x ∈ IN we write f (x) ↓, if f is
defined on input x and f (x) ↑ otherwise. Functions f, g ∈ P fulfill f =n g iff
{(x, f (x)) | x ≤ n and f (x) ↓} = {(x, g(x)) | x ≤ n and g(x) ↓}.
For i ∈ IN and n ≥ 1, in denotes the n-tuple (i, . . . , i). Functions can be
identified with the sequence of their values, e. g. f = 1n 0∞ means f (x) = 1 for
0 ≤ x < n and f (x) = 0 for x ≥ n.
If for f ∈ P and n ∈ IN the values f (0), . . . , f (n) are defined, we will write
f n for the initial segment (f (0), . . . , f (n)) and implicitly identify every f n with
a natural number via a computable bijective coding function.
For a tuple α = (α0 , . . . , αn ) over IN and a class U ⊆ R, we write α U iff
there is an f ∈ U such that f n = α.
In order to abbreviate certain statements, the symbols ∧ (“and”), =⇒ (“im-
plies”) and ⇐⇒ (“iff”) as well as the quantifiers ∀ (“for all”), ∀∞ (“for all but
finitely many”), and ∃ (“exists”) will be used.
Let ψ ∈ P 2 . Then ψ is called numbering and Pψ := {ψi | i ∈ IN} denotes
the set of the functions enumerated by ψ; for i ∈ IN the function ψi is defined
by ψi (x) := ψ(i, x) for all x. A numbering ϕ ∈ P 2 is called Gödel numbering
(acceptable numbering) iff (1) Pϕ = P and (2) ∀ψ ∈ P 2 ∃c ∈ R ∀i [ψi = ϕc(i) ].
We use ϕ to denote a fixed Gödel numbering. For a function ϕi we will write Si
if the function plays the role of a learning strategy (see below).
Let Φ be a Blum complexity measure [6] for ϕ. For i ∈ IN we will write
ϕi (x) ↓n instead of Φi (x) ↓≤ n.
In the basic learning model, a strategy S ∈ P learns a class U ⊆ R with
respect to a hypothesis space ψ ∈ P 2 . The strategy receives one after another
initial segments f n of a function f ∈ U as input and generates a sequence of hy-
potheses S(f n ) as output. Each hypothesis is interpreted as the function ψS(f n ) .
The basic inference type, learning in the limit [11], gives the learning strategy
the freedom to output whatever it wants, as long as it reaches a point beyond
that the output remains constant as well as correct.
Changing the Inference Type – Keeping the Hypothesis Space 87
Lemma 9.
(1) ∀ψ ∈ P 2 [FINψ ⊆ CPψ ⊆ TOTALψ ⊆ CONSψ ⊆ LIMψ ⊆ BCψ ],
(2) ∀ψ ∈ P 2 [CONS-CPψ ⊆ CONS-TOTALψ ⊆ TOTALψ ],
(3) ∀ψ ∈ P 2 [CONS-CPψ ⊆ CPψ ].
Note, however, that I ⊆ J is not sufficient for ∀ψ ∈ P 2 [Iψ ⊆ Jψ ].
Lemma 10.
(1) ∃ψ ∈ P 2 [TOTALψ ⊆ CONS-TOTALψ ],
(2) ∃ψ ∈ P 2 [CPψ ⊆ CONS-TOTALψ ].
Proof. Let U ∈ TOTAL. The hypothesis space ψ will be defined via diagonaliza-
tion against all (TOTAL) learning strategies. The functions in ψ will be grouped
into consecutive blocks Zj of increasing size. Within the j-th block, which con-
tains j + 2 functions, diagonalization against the j + 1 strategies S0 , . . . , Sj
happens.
The functions in the block will be defined, argument by argument, to equal
ϕj . Meanwhile the output of the strategies S0 , . . . , Sj on initial segments of ϕj
is watched. As soon as an Si is found to output a hypothesis z from within the
j-th block, the definition of ψz is stopped, resulting in ψz ∈ P \ R. From now
on neither Si nor ψz are taken into account during the ongoing definition of the
j-th block.
The formal algorithm for the j-th block is given below.
(j)
L0 := {0, . . . , j}
(j)
G0 := Zj
x := 0
While ϕxj ↓ do:
(j)
(1) For all z ∈ Gx : ψz (x) := ϕj (x)
(j)
(2) For all z∈ Zj \ Gx : ψz (x) := ↑
:= (, y) | ∈ Lx ∧ y ≤ x ∧ S (ϕyj ) ↓x ∈ Gx
(j) (j) (j)
(3) Px
(4) Gx+1 := Gx \ S (ϕyj ) | (, y) ∈ Px ∧ y = min{z | (, z) ∈ Px }
(j) (j) (j) (j)
90 F. Balbach
(j) (j) (j)
(5) Lx+1 := Lx \ | ∃y [(, y) ∈ Px ]
(6) x := x + 1
Note that, since the indices outnumber the strategies, there must remain at
least one z ∈ Zj such that ψz = ϕj .
To prove U ∈ / TOTALψ , we assume an Si such that U ∈ TOTALψ (Si ).
Since U is infinite, there must be an f ∈ U such that Si converges on f to
an index k ∈ Zj for a j ≥ i. All total functions in the j-th block equal ϕj ,
hence f = ψk = ϕj . Therefore, Si outputs on ϕj almost always indices in Zj .
Thus, either Si or k (or both) will be “eliminated.” In either case, Si outputs a
non-total hypothesis on f ∈ U , contradicting the assumption.
In order to show U ∈ CONSψ , let R ∈ P be a strategy such that U ∈
CONS-TOTALϕ (R). A strategy T learning U consistently in ψ works as follows:
n
T (f n ) := min Gn(R(f ))
.
n
(R(f ))
For f ∈ U , ϕR(f n ) is total, hence Gn exists and can be computed using
the algorithm above.
Let f ∈ U and j = lim R(f n ) be the final hypothesis of R on f . Then
T converges against the least k ∈ Zj not “eliminated.” It follows that ψk =
ϕj , hence T converges correctly. That the intermediate hypotheses of T are
consistent is a consequence of R being a consistent strategy for U in ψ.
If consistent learnability of a class is not sufficient for its total learnability,
then neither LIM nor BC learnability are. Moreover, consistent learnability can-
not be sufficient for learning in a CONS-TOTAL, CP, CONS-CP, or FIN way.
Corollary 12. Let I ∈ {CONS, LIM, BC} and J ∈ {FIN, CONS-CP, TOTAL,
CONS-TOTAL,CP} be inference types and let U ∈ J be an infinite class. Then
a hypothesis space ψ ∈ P 2 exists such that U ∈ Iψ \ Jψ .
A closer look at the proof of Theorem 11 reveals that the constructed hypoth-
esis space ψ does not depend on the class U . In ψ no infinite class U ∈ TOTAL
can be learned totally. But any such U can be learned consistently in ψ.
Even more is true: The hypothesis space ψ does in fact allow for the full
learning power of consistent learning as well as of limit learning in general.
Behaviorally correct learning, on the other hand, does not increase the learning
power further.
Corollary 13. There is a hypothesis space ψ satisfying
(1) TOTALψ = {U ⊆ R | card U < ∞},
(2) CONSψ = CONS,
(3) LIMψ = LIM,
(4) BCψ = LIM.
Proof. Let ψ be the hypothesis space constructed in the proof of Theorem 11.
(1) is obvious, as well as the ⊆-part of (2). In order to prove CONSψ ⊇ CONS
let U ∈ CONS be a class and R ∈ P a strategy such that U ∈ CONSϕ (R). We
Changing the Inference Type – Keeping the Hypothesis Space 91
define T as in the proof of Theorem 11 with the only difference that R is a CONS
strategy for U .
The proof of (3) proceeds similar to that of (2). In order to show U ∈ LIMψ
for a U ∈ LIM we use a strategy R ∈ R with U ∈ LIMϕ (R) and define T as
follows:
(R(f m ))
T (f n ) := “For m = 0, . . . , n compute the sets Gm for at most n steps.
Let m be the greatest index such that the computation is finished.
(R(f m ))
Output min Gm .”
This modification is necessary since ϕR(f n ) need not be defined up to n,
(R(f n ))
hence Gn need not exist.
(4) LIM = LIMψ ⊆ BCψ follows from (3) and the definitions of BC and
LIM. In order to show BCψ ⊆ LIM we assume a U ∈ BCψ with U ∈ / LIM. Let
S ∈ R be such that U ∈ BCψ (S). Then there must be a function g ∈ U such
that card {S(g n ) | n ∈ IN} = ∞ (otherwise one could amalgamate the finitely
many indices of each function of U and learn U in the limit, in contradiction to
our assumption).
Recall the grouping of the indices of ψ in blocks Zj . The hypotheses of S on
g reach infinitely many such blocks. We assume without loss of generality that
n
outputs on g at most one index from each such block, that is ∀n [S(g ) ∈
S /
m<n Z S(g m ) ]. (One can construct an S with this property from an S without
“survives.”
Now there must be an x such that (i, m) ∈ Px because Si (ϕm
(j)
j ) = k. Hence,
(j)
k will be selected for “elimination” in step (4) since (i, m) ∈ Px and m =
min{z | (i, z) ∈ Px } (remember that, by assumption an S, g m is the only
(j)
∀i ∈ IN (1) ϕρ(i) ∈ R,
(2) ∀ψ ∈ P 2 ∀U ⊆ R [U ∈ LIMψ (ϕi ) =⇒ U ∈ LIMψ (ϕρ(i) )].
92 F. Balbach
That is, for every LIM strategy ϕi , ρ effectively constructs an everywhere defined
LIM strategy ϕρ(i) with at least the same learning power.
We abbreviate the strategies from the last lemma by Si := ϕρ(i) .
Theorem 15. Let U ∈ LIM be an infinite class. Then there is a hypothesis
space ψ ∈ P 2 such that U ∈ BCψ \ LIMψ .
Proof. This proof uses the same blockwise grouping of indices as the proof of
Theorem 11. However, diagonalization happens against all strategies Si instead
of Si . For every j ∈ IN the functions with indices z ∈ Zj are defined as follows:
ϕj (x), if ∀i ≤ j ∃n ≥ x [ϕnj ↓ ∧Si (ϕnj ) = z],
ψz (x) :=
↑, otherwise.
Proof. Let R ∈ R such that U ∈ FINϕ (R). We need to modify the construction
of the proof of Theorem 11 in such a way that the class U will be taken into
account. This is inevitable, as will be shown in Theorem 18.
The construction of the j-th block starts by watching R on the function ϕj .
Only when (and if) R converges finitely to the index j, a construction very similar
to that of the proof of Theorem 11 takes place. For the sake of completeness the
algorithm constructing the block with indices from Zj is given below.
x := 0
While both (A) ϕxj ↓ and (B) R(ϕxj ) = ? do:
For all z ∈ Zj : ψz (x) := ϕj (x)
x := x + 1
Case 1: (A) and (B) hold for all x. Then ∀z ∈ Zj [ψz = ϕj ] and R does
not learn ϕj , hence ϕj ∈ / U.
Case 2: There is x0 such that (A) does not hold. Then
∀z ∈ Zj ∀x ≥ x0 [ψz (x) ↑] and therefore ∀z ∈ Zj [ψz ∈ / U ].
Case 3: There is x0 such that (A) holds, but (B) does not, i. e.
ϕxj 0 ↓ and R(ϕxj 0 ) ∈ IN. Then ∀z ∈ Zj [ψz =x0 −1 ϕj ] follows
and two cases are distinguished:
Case 3.1: R(ϕxj 0 ) = j. Then define for all z ∈ Zj and x ≥ x0 : ψz (x) := ↑.
Case 3.2: R(ϕxj 0 ) = j. Then the construction goes on in the following way.
(j)
Lx0 := {0, . . . , j}
(j)
Gx0 := Zj
x := x0
While ϕxj ↓ do:
Perform steps (1) to (6) as in the proof of Theorem 11.
The proofs of U ∈ TOTALψ and U ∈ / FINψ are not very different from the
corresponding ones in Theorem 11, but somewhat more technical, and will be
omitted due to space constraints.
is a class of total functions since S only outputs indices of total functions. Fur-
thermore, the test whether ψS(0n 1m ) extends 0n 1 can be carried out effectively for
m = 1, 2, . . . since the indices to be simulated are total. Thus U2 is well-defined
and recursively enumerable.
For every n there is exactly one function in U2 which extends 0n 1; thus U2
is infinite. Furthermore, U2 is in FINψ : on input which does not start with 0n 1,
one outputs “?”, on input that starts with 0n 1 one outputs the ψ-index for the
unique function in U2 which extends 0n 1.
4 Positive Results
The previous section might look somewhat discouraging. However, not every
property (I → J ) can be proved to have minimal scope. Contrary, there are
inference types I and J such that the scope of (I → J ) is maximal.
Theorem 19. If U ∈ FIN and ψ ∈ P 2 such that U ∈ CPψ , then U ∈ FINψ .
Proof. Let R, S ∈ R be such that U ∈ FINϕ (R) and U ∈ CPψ (S). Define
If R(f n ) ∈ IN ∧ ∃m ≤ n [ψS(f
n n
m ) ↓= f ] then:
n n
m := min{m ≤ n | ψS(f m ) ↓= f }.
Output S(f m ).”
Proof. The proof is similar to that of Theorem 19. Let R ∈ P be such that
U ∈ CONS-CPϕ (R) and S ∈ R such that U ∈ CPψ (S). Define
CONS-CP + + + + + + +
CP + + + −U + + +
+ CONS-CP
CONS − − − − − + +
LIM − − − − − −U +
+ CP
BC − − − − − − −
Table 1 tries to present the numerous results of this paper in a clear manner.
Results stated in that table, but not proved within this paper, can be obtained
via techniques similar to those presented in the previous sections. Note that the
scope of (I → J ) has not been fully characterized for every such property.
References
1 Introduction
In many data mining applications, one is faced with the situation that a binary
classification of elements with a large number of attributes only depends on a
small subset of these attributes. A central task is then to infer these relevant
attributes from a given input sample consisting of a series of examples X(k) =
(x1 (k), . . . , xn (k)) with classifications y(k) for k = 1, 2, . . . , m, i.e., one wants
to find a set of variables xi1 , . . . , xid such that the sample can be explained
by a function f : {0, 1}n → {0, 1} that depends only on these d variables. A
function f is said to explain the sample, if f (x1 (k), . . . , xn (k)) = y(k) for all
k. Moreover, since real data usually contain noise, it is of particular interest to
design algorithms that in some sense behave ‘robustly’ with respect to input
disturbances.
When inferring relevant attributes, two natural questions that can be asked:
Supported by DFG research grant Re 672/3.
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 99–113, 2003.
c Springer-Verlag Berlin Heidelberg 2003
100 J. Arpe and R. Reischuk
on-line setting, one tries to minimize the number of examples for which the
current hypothesis turns out to be wrong. There are several ways known how to
convert on-line algorithms with low mistake bounds into efficient PAC learning
algorithms (see [4,14,13]). In this paper, we consider the finite exact learning
model: From a randomly selected sample of small size, we have to compute a
single hypothesis that with high probability has to be correct (with accuracy 1).
Recently, Mossel, O’Donnell, and Servedio [16] have introduced an algorithm
that exactly learns the class of concepts f with n input variables and d relevant
attributes (also
ω
called d-juntas) under uniform distribution with confidence 1−δ
in time (nd ) ω+1 · poly(n, 2d , log(1/δ)), where ω < 2.376 is the matrix multipli-
cation exponent. The Target Ranking algorithm we introduce runs in time
O(m2 n) on samples of size m. In order to achieve confidence 1 − δ, we roughly
need c · log(1/δ) · log n examples, where c depends on the base function f˜ (i.e.,
the restriction of f to its relevant variables), the number of relevant attributes
d, and the probability distribution according to which the examples are drawn.
In particular, restricting to the uniform distribution, for arbitrary f satisfying
a certain statistical property, c can be bounded by poly(2d ). In this case we
are able to exactly infer the relevant attributes with confidence 1 − δ in time
n · poly(log n, 2d , log(1/δ)).
Due to space limitations, most proofs have to be omitted. Details are pre-
sented in [7].
2 Preliminaries
A concept is a Boolean function f : {0, 1}n → {0, 1}, a concept class is a set of
concepts. A concept f : {0, 1}n → {0, 1} depends on variable xi , if the two (n−1)-
ary subfunctions fxi =0 and fxi =1 with variable xi fixed to 0 and 1 respectively
are not identical. If f depends on xi , then attribute xi is called relevant for f ,
otherwise irrelevant. We denote the set of relevant (resp. irrelevant) attributes
by V + (f ) (resp. V − (f )). If f is clear from the context, we just write V + and
V − . We denote by f˜ the restriction of f to its relevant variables and call it
the base function of f . An example is a vector (x1 , . . . , xn ; y) ∈ {0, 1}n+1 . It
is an example for f , if y = f (x1 , . . . , xn ). The values of x1 , . . . , xn are called
variable or attribute assignments, whereas the value for y is called a label. A
sequence (x1 (k), . . . , xn (k); y(k)) (k = 1, . . . , m) of examples for f is called a
sample for f of size m, and f is said to explain the sample. A sample T is a
sequence of examples such that there exists some f that explains the sample. If f
depends only on variables from the set {xi1 , . . . , xid }, then we also say that these
variables explain T . A sample is stored in a matrix each line of which represents
x1 (1) . . . xn (1) | y(1))
.. ∈ {0, 1}m×(n+1) , where
one example: T = (X; y) = ... ..
.
..
. .
x1 (m) . . . xn (m) | y(m)
X is the submatrix consisting of the variable assignments in the examples, and
y is the column vector containing the labels of the examples. A sample T may
contain a certain combination of attributes several times. Then, of course, it is
necessary that for k = l the following implication holds:
102 J. Arpe and R. Reischuk
Indeed, if (1) does not hold for some k = l, then by definition, T is not a
sample. In the noisy case, however, it may well be that different combinations of
attributes yield different labels, but due to false measurements of the attributes,
the values for x1 , . . . , xn all look the same.
We assume that the examples of a sample T are drawn according to a fixed
probability distribution p : {0, 1}n → [0, 1], and we say that T is generated
according to p.
3 Approximability
Note that in order to find a small set of explaining attributes for an INFRA
instance, we do not have to explicitly define a corresponding concept f , but it is
enough to find a set of attributes xi1 , . . . , xid such that for k, l ∈ {1, . . . , m} with
y(k) = y(l) there exists r ∈ {1, . . . , d} with xir (k) = xir (l) by Proposition 1.
In order to obtain results on the approximability of INFRA, we consider the
well-studied Set Cover problem. Note that Proposition 1 yields a reduction
from INFRA to Set Cover. Based on this fact, Akutsu and Bao [2] have proved
the following theorem:
Theorem 1 ([2]). INFRA can be approximated in polynomial time within a
factor of 2 ln m + 1.
Robust Inference of Relevant Attributes 103
We apply some modifications of this algorithm and analyze their effects. The
strategy is based on a ranking of the sets S1 , . . . , Sn by their cardinalities which
is done by the procedure Rank Sets, see Fig. 2.
for i = 1 to n do
Si := {{k, l} ∈ S | xi (k) = xi (l)}
compute π : {1, . . . , n} → {1, . . . , n} such that |Sπ(1) | ≥ |Sπ(2) | ≥ . . . ≥ |Sπ(n) |
The results may be worse in some cases, since the new greedy approach is
based on a single static ranking. However, we show that the ranking still yields
104 J. Arpe and R. Reischuk
Note that, given a sample, all three algorithms terminate after a finite number
of steps since by property (1), each pair {k, l} ∈ S belongs to some Si . Clearly,
all algorithms presented here compute a cover of S. Thus by the reduction given
in Proposition 1, the algorithms work correctly for the optimization problem,
i.e., they output sets of variables that explain the input sample. It is not hard
to construct instances showing that in general none of the algorithms is superior
to the others in terms of finding small sets of explaining variables.
It may be the case that an input sample for some concept f can be explained
by a proper subset of the relevant variables for f . In case the number d of
relevant variables is a priori known, we can overcome this problem by giving d
as additional input to the algorithms and output the d variables with the largest
(resp., smallest) sets Si . This is done by the Target Ranking and the Modest
Target Ranking algorithms (see Fig. 4).
y(k) = 0 y(k) = 1
(0,0) (0,1)
Ki Ki
xi (k) = 0
xi (k) = 1
(1,0) (1,1)
Ki Ki
edges in Si
edges not in Si
Proof. The proof requires lengthy calculations and case distinctions. It is based
on standard Chernoff bound techniques and can be found in [7].
The following theorem provides very general conditions that guarantee the suc-
cess of the ranking algorithms with respect to a concept f :
Theorem 3. Let f : {0, 1}n → {0, 1} depend on xi1 , . . . , xid , let T be a sample
for f generated according to a probability distribution p : {0, 1}n → [0, 1], and
let c > 0.
Proof.
We only prove part (a), since (b) can be done analogously. Let t =
i∈V + αi + maxj∈V − αj m . Then for xi ∈ V
1 2 +
2 min it holds that
ε ε
Pr(|Si | ≤ t) ≤ Pr |Si | ≤ αi − m2 ≤ Pr |Si | − αi m2 ≥ m2
2 2
1 2 1 2
≤ 8e− 3 ( 2 ) = 8e− 12 ε
ε
m m
,
Pr min |Si | > max− |Sj | ≥ Pr min |Si | > t ∧ max |Sj | < t
xi ∈V + xj ∈V xi ∈V + xj ∈V −
= 1 − Pr ∃xi ∈ V |Si | ≤ t ∨ ∃xj ∈ V − |Sj | ≥ t
+
≥1− Pr (|Si | ≤ t) + Pr (|Sj | ≥ t)
xi ∈V + xj ∈V −
1 2
− 12
≥ 1 − 8ne ε m
.
1 2
− 12
If m ≥ 12
ε2 ((c + 1) ln n + ln 8), then 8ne ε m
≤ n−c , thus the claim follows.
As Theorem 3 is stated for a general setting, let us now consider some typical
input distributions and simplify its conditions in these cases.
(a) If q ≤ 12 , then it holds that α+ > α− . Thus the success ratio for Target
Ranking may be raised arbitrarily close to 1 by choosing a large enough
sample size m ∈ O(log n).
(b) If q = 12 , then it holds that α+ −α− = 2−2d−1 > 0. Thus the success ratio for
Target Ranking may be raised arbitrarily close to 1 by choosing a large
enough sample size m ∈ O(log n) with the constant being of order 24d .
(c) If q > 12 , then for sufficiently large d, we have α+ < α− . Thus the success
ratio for Modest Target Ranking may be raised arbitrarily close to 1 by
choosing a large enough sample size m ∈ O(log n).
2
Sketch of proof: A straightforward calculation yields α+ − α− = d−1 t−1 · 2−2d−1 .
Now Theorem 3 yields the claim.
If t = d in the previous theorem, then f = AND, and we recover our result from
Theorem 4, part (b). Moreover, under uniformly distributed inputs, the gap
between α+ and α− for threshold functions is smallest for t ∈{1, d}. The largest
such gap is reached for t = d2 , the majority function. Since t/2−1
d−1
∈ Θ( √1d 2d ),
− −1
we have α − α ∈ Θ(d ). Applying Theorem 3, this proves the following
+
Corollary 2 (Majority function). Let f : {0, 1}n → {0, 1} such that its base
function f˜ : {0, 1}d → {0, 1} is the majority function. Then Target Ranking
is succeessful with high probability, provided that m ∈ Ω(d2 · log n).
For symmetric Boolean functions, one cannot always guarantee α+ = α− ,
even for UDA-generated samples. A simple counter-example is the parity func-
tion f (x1 , . . . , xn ) = (xi1 + . . . + xid ) mod 2 for which αi = 18 for all i ∈
{1, . . . , n}, no matter whether xi ∈ V + or xi ∈ V − . Thus the ranking strategies
do not work for the parity function. We provide an alternative solution for such
concepts in Sect. 8.
7 Robust Inference
As real data usually contain noise, our ultimate goal is to handle cases in which
the attribute values underly certain disturbances. More precisely, we assume
that in each input example, attribute xi is flipped with probability δi , i.e., an
algorithm obtains xi (k) instead of the correct value xi (k) with probability δi .
We call the resulting set of disturbed examples a δ-disturbed sample, where
δ = (δ1 , . . . , δn ). Note that this assumption introduces a linear number of faults
(with respect to the number of attributes).
Fortunately, it can be shown that the ranking algorithms still perform well,
if they are given such disturbed samples. The key idea in this case is to examine
how much the sets Si computed by the ranking algorithms deviate from the Si ’s
intended by the real data. We denote the sets derived from the disturbed data
by Ŝi . Furthermore, for i ∈ {1, . . . , n}, let
Sketch of proof: We use the inequality |Ŝi | − αi m2 ≤ |Ŝi | − |Si | + |Si | − αi m2
and compute the probability that each of the summands on the right hand side is
bounded by 2ε m2 . Combinatorial investigations yield |Ŝi |−|Si | ≤ m|Fi |+ 12 |Fi |2 .
In particular, if |Fi | ≤ 13 εm, then |Ŝi | − |Si | ≤ 12 εm2 . From standard Chernoff
1 ε 1
bounds, it follows that Pr |Fi | ≥ 3ε m ≤ e− 3 · 6 ·m = e− 18 εm , since δi < 6ε (|Fi |
110 J. Arpe and R. Reischuk
Besides the general information theoretic problem that a sample may already
be explained by a proper subset of the relevant variables, just the opposite
phenomenon can occur due to disturbances: Ŝ may not be covered by Ŝ1 , . . . , Ŝn ,
so Greedy, Greedy Ranking, and Modest Ranking – as introduced in
Sect. 4 – do not terminate on the corresponding input samples. Therefore, when
faced with the disturbed situation, we modify the algorithms as follows: All edges
that do not belong to any of the computed Ŝi ’s are ignored, i.e., we compute
a new set Ŝnew = Ŝ \ {{k, l} ∈ Ŝ | ∀i ∈ {1, . . . , n} : {k, l} ∈ Ŝi }. The edges
removed are exactly those connecting two example nodes with identical attribute
values but different labels. All algorithms make use of this set Ŝnew instead of
Ŝ. However, this modification does not effect our analysis, so we continue by
writing Ŝ. In the noisy scenario, Lemma 1 has to be modified as follows:
Lemma 6. Let f be a concept depending on d variables, and T a δ-disturbed
sample for f such that Target Ranking is successful on (T, d). If Greedy
Ranking outputs at most d variables on input T , then it is correct. Otherwise,
the first d variables output by Greedy Ranking are the relevant ones.
We now state our main theorem for the case of disturbed samples:
Theorem 7. Let f : {0, 1}n → {0, 1} with relevant variables xi1 , . . . , xid , and
let δ = (δ1 , . . . , δn ) ∈ [0, 1]n . Let T be a δ-disturbed sample for f generated
according to a probability distribution p : {0, 1}n → [0, 1], and let c > 0.
(a) If min{αi | xi ∈ V + } > max{αj | xj ∈ V − } and δk ≤ 12 1
ε for all k ∈
−c
{1, . . . , n}, then with probability 1 − n , Target Ranking is successful
on input (T, d), provided that
m ≥ 48 ε−2 ((c + 1) ln n + ln 9) ,
We would like to stress that the algorithms have not been modified in any way in
order to overcome the disturbances. In particular, we do not have to assume that
the algorithms have any knowledge about the error probabilities δ1 , . . . , δn . Even
more, the sample size required for Target Ranking only has to be enlarged
by factor 4 in order to obtain the same success probability in case of a (small)
constant percentage of errors in the input sample.
Throughout this section we identify {0, 1} with the two-element field GF(2) n and
denote by ⊕ the sum operation in this field. Furthermore, we define |ξ| = i=1 ξi
for ξ ∈ {0, 1}n (here the sum is taken in Z). Let f : {0, 1}n → {0, 1} be defined by
f (x1 , . . . , xn ) = xi1 ⊕ . . . ⊕ xid for some set of variable indices I = {i1 , . . . , id } ⊆
{1, . . . , n}. Since we have seen at the end of Sect. 5 that ranking the variables
according to their occurences in the functional relations graph does not work for
the parity function, we present a different algorithm Parity Infer to find the
relevant variables. The idea is simply to compute a solution of a system of linear
equations associated with the input sample and then to infer from this solution
a set of variables that can explain the sample.
Let us again differenciate between the two aspects, the optimization problem
INFRA(⊕) obtained by restricting the instances and the solutions of INFRA to
samples for concepts whose base functions are parity functions – such functions
can be uniquely described by the set V + of relevant variables – and on the
other hand finding exactly the relevant variables of a given but unknown parity
function provided that the sample size is large enough.
Let T = (X; y) ∈ {0, 1}m×(n+1) . There is a one-to-one correspondence be-
tween solutions V + of the INFRA(⊕) instance T and the solutions ξ ∈ {0, 1}n
for the system of linear equations Xξ = y given by ξi = 1 iff xi ∈ V + . The
task of finding an optimal solution for an INFRA(⊕) instance is equivalent to
finding a solution ξ of Xξ = y with minimum |ξ|. Since {xi | i ∈ I} is a solution
for T , the system has at least one solution. Moreover, if X has full rank (i.e.,
rank(X) = n), then there is a unique solution which is of course also an optimal
solution in this case.
There is a well-known correspondence between the INFRA(⊕) and the Near-
est Codeword problem. A Nearest Codeword instance consists of a matrix
112 J. Arpe and R. Reischuk
A ∈ {0, 1}n×r and a vector b ∈ {0, 1}n . A solution is a vector x ∈ {0, 1}r , and
the goal is to minimize the Hamming distance of Ax and b (i.e., |Ax ⊕ b|). The
obvious reduction is approximation factor preserving. Using a result of [6], this
implies
Theorem 8. For any ε > 0, INFRA(⊕) cannot be approximated within a factor
1−ε
of 2log m unless NP ⊆ DTIME(npolylog(n) ).
Despite this negative result, INFRA(⊕) can be solved efficiently on the aver-
age. We show that under certain assumptions the variables detected by Parity
Infer are exactly the relevant ones with high probability.
Theorem 9. Let f : {0, 1}n → {0, 1} such that its base function f˜ is a parity
function, and let T = (X; y) ∈ {0, 1}m×n × {0, 1}n be a UDA-generated sample
for f . If m ≥ n + k(2 ln k + 1) with k = c log n + 1 for some c > 0, then with
probability 1 − n−c , Xξ = y has exactly one solution ξ.
For inferring relevant Boolean valued attributes we have presented ranking algo-
rithms, which are modifications of greedy algorithms proposed earlier. We have
extended a negative approximability result to the restriction of only Boolean
values and have improved a lower bound by using Feige’s result. General cri-
teria for the successfulness of our algorithms have been established in terms of
some statistical values (depending on the concept considered and the probabil-
ity distribution). These results have been applied to a series of typical input
distributions and specific functions.
In case of monotone functions, a straightforward modification of our strategy
restricts edge set Si to those edges {k, l} with xi (k) = y(k) = 0 and xi (l) =
y(l) = 1. This halves the values αj for xj ∈ V − and thus may satisfy (or
improve) the conditions of the main theorems for certain monotone functions.
Next, we have investigated the case of noisy attribute values. We have shown
that our algorithms still succeed with high probability, if their input contains a
(small) constant fraction of wrong values. This desirable robustness property is
achieved without requiring any specific knowledge about the likelihood of errors.
One direction of future research could be to extend these results to more
complex Boolean functions such as DNF formulas with a constant number of
monomials. Furthermore, the case of robustly inferring relevant attributes of
parity functions remains open. Another generalization would be the case that
attributes and/or labels may take values from sets with more than two elements.
Given an input instance, Greedy Ranking always outputs a proper solu-
tion that is capable of explaining the sample. However, if the input has some
Robust Inference of Relevant Attributes 113
disturbances, Greedy Ranking might indeed stop only after having chosen
significantly more than the real number of relevant attributes. In such situa-
tions, one might be interested in algorithms that – rather than computing an
exact solution for the given input data – output a simple solution fitting to an
input instance that is in some sense ‘near’ to the input instance. Following Oc-
cam’s razor such a simple solution may be much more likely to explain the real
phenomenon. Currently, we are working on a general framework for this setting.
References
1. R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules between Sets
of Items in Large Databases. Proc. 1993 ACM SIGMOD Conf., 207–216.
2. T. Akutsu, F. Bao, Approximating Minimum Keys and Optimal Substructure
Screens. Proc. 2nd COCOON, Springer LNCS 1090 (1996), 290–299.
3. T. Akutsu, S. Miyano, and S. Kuhara, A Simple Greedy Algorithm for Finding
Functional Relations: Efficient Implementation and Average Case Analysis. TCS
292(2) (2003), 481–495. (see also Proc.3rd DS, Springer LNAI 1967 (2000), 86–98.)
4. D. Angluin, Queries and Concept Learning. Machine Learning 2(4) (1988), 319–
342, Kluwer Academic Publishers, Boston.
5. D. Angluin and P. Laird, Learning from noisy examples. Machine Learning 2(4)
(1988), 343–370, Kluwer Academic Publishers, Boston.
6. S. Arora, L. Babai, J. Stern, and Z. Sweedyk, The Hardness of Approximate Op-
tima in Lattices, Codes, and Systems of Linear Equations, J. CSS 54 (1997), 317–
331.
7. J. Arpe, R. Reischuk, Robust Inference of Relevant Attributes. Techn. Report,
SIIM-TR-A 03-12, Univ. Lübeck, 2003, available at
[Link]
8. A. Blum, L. Hellerstein, and N. Littlestone, Learning in the Presence of Finitely
or Infinitely Many Irrelevant Attributes. Proc. 4th COLT ’91, 157–166.
9. A. Blum, P. Langley, Selection of Relevant Features and Examples in Machine
Learning. Artificial Intelligence 97(1–2), 245–271 (1997).
10. U. Feige, A Threshold of ln n for Approximating Set Cover. J. ACM 45 (1998),
634–652.
11. S. Goldman, H. Sloan, Can PAC Learning Algorithms Tolerate Random Attribute
Noise? Algorithmica 14 (1995), 70–84.
12. D. Johnson, Approximation Algorithms for Combinatorial Problems. J. CSS 9
(1974), 256–278.
13. N. Littlestone, Learning Quickly When Irrelevant Attributes Abound: A New
Linear-threshold Algorithm. Machine Learning 4(2) (1988), 285–318, Kluwer Aca-
demic Publishers, Boston.
14. N. Littlestone, From On-line to Batch Learning. Proc. 2nd COLT 1989, 269–284.
15. H. Mannila, K. Räihä, On the Complexity of Inferring Functional Dependencies.
Discrete Applied Mathematics 40 (1992), 237–243.
16. E. Mossel, R. O’Donnell, R. Servedio, Learning Juntas. Proc. STOC ’03, 206–212.
17. L. Valiant, Projection Learning. Machine Learning 37(2) (1999), 115–130, Kluwer
Academic Publishers, Boston.
Efficient Learning of Ordered and Unordered
Tree Patterns with Contractible Variables
Abstract. Due to the rapid growth of tree structured data such as Web
documents, efficient learning from tree structured data becomes more
and more important. In order to represent structural features common
to such tree structured data, we propose a term tree, which is a rooted
tree pattern consisting of tree structures and labeled variables. A vari-
able is a labeled hyperedge, which can be replaced with any tree. A
contractible variable is an erasing variable which is adjacent to a leaf. A
contractible variable may be replaced with a singleton vertex. A usual
variable, called an uncontractible variable, is replaced with a tree of size
at least 2. In this paper, we deal with ordered and unordered term trees
with contractible and uncontractible variables such that all variables
have mutually distinct variable labels. First we give a polynomial time
algorithm for deciding whether or not a given term tree matches a given
tree. Let Λ be a set of edge labels. Second, when Λ has more than one
edge label, we give a polynomial time algorithm for finding a minimally
generalized ordered term tree which explains all given tree data. Lastly,
when Λ has infinitely many edge labels, we give a polynomial time al-
gorithm for finding a minimally generalized unordered term tree which
explains all given tree data. These results imply that the classes of or-
dered and unordered term trees are polynomial time inductively inferable
from positive data.
1 Introduction
Due to the rapid growth of semistructured data such as Web documents, In-
formation Extraction from semistructured data becomes more and more impor-
tant. Web documents such as HTML/XML files have no rigid structure and
are called semistructured data. According to Object Exchange Model [1], we
treat semistructured data as tree structured data. Tree structured data such as
HTML/XML files are represented by rooted trees with edge labels. In order to
represent a tree structured pattern common to such tree structured data, we
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 114–128, 2003.
c Springer-Verlag Berlin Heidelberg 2003
Efficient Learning of Ordered and Unordered Tree Patterns 115
Sec1 Sec2 Sec3 Sec4 Sec1 Sec2 Comment Sec3 Sec4 Introduction Note SubSec3.1 Conclusion
Preliminary SubSec3.2
Exp1
Introduction Preliminary Exp1 Exp2 Conclusion Introduction Preliminary Exp2 Conclusion Exp1 Exp2 Result3
Comment
T1 T2 T3
u2
Comment Sec3
Sec1 Sec2 y Sec4 Sec1 Sec2 y Sec4
SubSec3.1
x
Exp1
z SubSec3.2
Introduction x Exp1 z Conclusion Introduction
Preliminary Exp2
Conclusion
u1 v2
x Result1 Result2 Result1 Result2 Note Result3
v1 u3
t1 t2 t3 g1 g2 g3
Fig. 1. Ordered term trees t1 , t2 and t3 and ordered trees T1 , T2 and T3 . An uncon-
tractible (resp. contractible) variable is represented by a single (resp. double) lined box
with lines to its elements. The label inside a box is the variable label of the variable.
proposed an ordered term tree and unordered term tree, which are rooted trees
with structured variables [12,13].
Many semistructured data have irregularities such as missing or erroneous
data. In Object Exchange Model, the data attached to leaves are essential in-
formation and such data represented as subtrees. On the other hand, in analyz-
ing tree structured data, sensitive knowledge (or patterns) for slight differences
among such data is often meaningless. For example, extracted patterns from
HTML/XML files are affected by attributes of tags which can be recognized
as noises. Therefore we introduce a new kind of variable, called a contractible
variable, that is an erasing variable which is adjacent to a leaf. A contractible
variable can be replaced with any tree, including a singleton vertex. A usual
variable, called an uncontractible variable, is replaced with a tree which consists
of at least 2 vertices. A term tree with only uncontractible variables is very sen-
sitive to noises. By introducing contractible variables, we can find robust term
trees for such noises. Shinohara [11] started to study the learnabilities of ex-
tended pattern languages of strings with erasing variable. Since this pioneering
work, researchers in the field of computational learning theory are interested in
classes of string or tree pattern languages with erasing variables which are poly-
nomial time learnable. Recently Uemura et al. [16] showed that classes of unions
of erasing regular pattern languages can be polynomial time learnable from pos-
itive data. In this paper, we study the learnabilities of classes of tree structured
patterns with restricted erasing variables, called contractible variables.
A term tree t is said to be regular if all variable labels in t are mutually
distinct. The term tree language of an ordered term tree t is the set of all ordered
trees which are obtained from t by substituting ordered trees for variables in t.
116 Y. Suzuki et al.
The language shows the representing power of an ordered term tree t. We say that
a regular ordered term tree t explains given tree structured data S if the term
tree language of t contains all trees in S. A minimally generalized regular ordered
term tree t explaining S is a regular ordered term tree t such that t explains S and
the language of t is minimal among all term tree languages which contain all trees
in S. For example, the term tree t3 in Fig. 1 is a minimally generalized regular
ordered term tree explaining T1 , T2 and T3 . And t2 is also minimally generalized
regular ordered term trees with no contractible variable explaining T1 , T2 and T3 .
On the other hand, t1 is overgeneralized and meaningless, since t1 explains any
tree of size at least 2. An ordered term tree using contractible and uncontractible
variables rather than an ordered term tree only using uncontractible variables
can express the structural feature of ordered trees more correctly. From this
reason, we consider that in Fig. 1, t3 is a more precious term tree than t2 .
In a similar way, we define the term tree language of an unordered term tree
and a minimally generalized regular unordered term tree explaining given tree
structured data S.
Let Λ be a set of edge labels used in tree structured data. We denote by OTT cΛ
(resp. UTT cΛ ) the set of all regular ordered (resp. unordered) term trees with
contractible and uncontractible variables. For a set S, the number of elements
in S is denoted by |S|. First we give a polynomial time algorithm for deciding
whether or not a given regular ordered (resp. unordered) term tree explains an
ordered (resp. unordered) tree, where |Λ| ≥ 1. Second when |Λ| ≥ 2, we give a
polynomial time algorithm for finding a minimally generalized regular ordered
term tree in OTT cΛ which explains all given data. Lastly when |Λ| is infinite,
we give a polynomial time algorithm for finding a minimally generalized regular
unordered term tree in UTT cΛ which explains all given data. These results imply
that both OTT cΛ where |Λ| ≥ 2 and UTT cΛ where |Λ| is infinite are polynomial
time inductively inferable from positive data.
A term tree is different from other representations of tree structured patterns
such as in [2,3,5] in that a term tree has structured variables which can be sub-
stituted by arbitrary trees. As related works, we proved the learnability of some
classes of term tree languages with no contractible variable. In [13,14], we showed
that some fundamental classes of regular ordered term tree languages are poly-
nomial time inductively inferable from positive data. And in [7,9,12], we showed
that the class of regular unordered term tree languages with infinitely many edge
labels is polynomial time inductively inferable from positive data. Moreover, we
showed in [8] that some classes of regular ordered term tree languages are ex-
actly learnable in polynomial time using queries. In [15], we showed that the
class of regular ordered term tree with contractible variables and no edge label
is polynomial time inductively inferable from positive data. Asai et al. [6] stud-
ied a data mining problem for semistructured data by modeling semistructured
data as labeled ordered trees and presented an efficient algorithm for finding all
frequent ordered tree patterns from semistructured data. In [10], we gave a data
mining method from semistructured data using ordered term trees.
Efficient Learning of Ordered and Unordered Tree Patterns 117
f and g are ordered term trees, for any internal vertex u in f which has more
than one child, and for any two children u and u of u, u <fu u if and only if
ϕ(u ) <gϕ(u) ϕ(u ).
For unordered term trees, we introduce a new definition of bindings which
is different from the original definition we gave in [9,12]. The reason why we
introduce the new one is explained after we define substitutions of term trees.
Definition 4 (Bindings of term trees). Let g be a term tree with at least
two vertices and x be a variable label in X or X c . Let σ = [u, u ] be a list of two
vertices in g where u is the root of g and u is a leaf of g. The form x := [g, σ]
is called a binding for x. If x is a contractible variable label in X c , g may be a
tree with a singleton vertex u and thus σ = [u, u]. It is the only case that a tree
with a singleton vertex is allowed for a binding.
Original definition of bindings of unordered term trees [9,12]: Let g be
an unordered term tree with at least two vertices. Let σ = [u, u ] be a list of two
vertices in g where u is the root of g and u is a vertex of g (u = u ). The form
x := [g, σ] is called a binding for x.
Definition 5 (Substitutions of term trees). Let f and g be two ordered
(resp. unordered) term trees. A new ordered (resp. unordered) term tree f {x :=
[g, σ]} is obtained by applying the binding x := [g, σ] to f in the following way.
Let e = [v, v ] be a variable in f with the variable label x. Let g be one copy
of g and w, w the vertices of g corresponding to u, u of g, respectively. For
the variable e = [v, v ], we attach g to f by removing the variable e from Hf
and by identifying the vertices v, v with the vertices w, w of g , respectively.
If g is a tree with a singleton vertex, i.e., u = u , then v becomes identical to
v after the binding. A substitution θ is a finite collection of bindings {x1 :=
[g1 , σ1 ], · · · , xn := [gn , σn ]}, where xi ’s are mutually distinct variable labels in
X and gi ’s are term trees. The term tree f θ, called the instance of f by θ,
is obtained by applying the all bindings xi := [gi , σi ] on f simultaneously. We
define the root of the resulting term tree f θ as the root of f .
A variable with a binding of the original definition is the same as a pair
of one uncontractible variable h with a binding of the new definition and one
contractible variable whose parent port is the child port of h. However a con-
tractible variable cannot be expressed with any variable with bindings of the
original definition. Therefore, by introducing contractible variables and the new
definition of bindings, we can express richer unordered tree structured patterns.
Further we have to give a new total ordering <fv θ on every vertex v of f θ.
These orderings are defined in a natural way.
Definition 6 (Child orderings on an instance of an ordered term tree).
Let f be an ordered term tree and θ a substitution. Suppose that v has more
than one child and let u and u be two children of v of f θ. If v is the parent
port of variables [v, v1 ], . . . , [v, vk ] of f with v1 <fv · · · <fv vk , we have the follow-
ing four cases. Let gi be a term tree which is substituted for [v, vi ] for i = 1, . . . , k.
Efficient Learning of Ordered and Unordered Tree Patterns 119
Case 1 : If u , u ∈ Vf and u <fv u , then u <fv θ u . Case 2 : If u , u ∈ Vgi and
u <gvi u for some i, then u <fv θ u . Case 3 : If u ∈ Vgi , u ∈ Vf , and vi <fv u
(resp. u <fv vi ), then u <fv θ u (resp. u <fv θ u ). Case 4 : If u ∈ Vgi , u ∈ Vgj
(i = j), and vi <fv vj , then u <fv θ u . If v is not a parent port of any variable,
then u , u ∈ Vf , therefore we have u <fv θ u if u <fv u .
For example, let t3 be a term tree in Fig. 1 and θ = {x := [g1 , [u1 , v1 ]], y :=
[g2 , [u2 , v2 ]], z := [g3 , [u3 , u3 ]]} be a substitution, where g1 , g2 and g3 are trees in
Fig. 1. Then the instance t3 θ of the term tree t3 by θ is the tree T3 .
Without loss of generality, we assume that for any ordered term tree, there
is no pair of contractible variables [v, v ]c and [v, v ]c such that v is the im-
mediately right sibling of v . By a similar reason, we also assume that for any
unordered term tree, any vertex v which is not a leaf has at most one con-
tractible variable such that the parent port of it is v. An ordered term tree with
no variable is called a ground ordered term tree, which is an ordered tree.
OT Λ denotes the set of all ground ordered term trees whose edge labels are in
Λ. Similarly we define a ground unordered term tree. UT Λ denotes the set
of all ground unordered term trees whose edge labels are in Λ. OTT cΛ denotes
the set of all ordered term trees with contractible and uncontractible variables
whose edge labels are in Λ. UTT cΛ denotes the set of all unordered term trees
with contractible and uncontractible variables whose edge labels are in Λ.
Definition 7 (Term tree languages). Let Λ be a set of edge labels. For an
ordered term tree t ∈ OTT cΛ , the ordered term tree language LO
Λ (t) of an ordered
term tree t is defined as {s ∈ OT Λ | s ≡ tθ for a substitution θ}. For an
unordered term tree t ∈ UTT cΛ , the unordered term tree language LU Λ (t) of an
unordered term tree t is defined as {s ∈ UT Λ | s ≡ tθ for a substitution θ}.
A minimally generalized ordered term tree explaining a given set of ordered
trees S ⊆ OT Λ is an ordered term tree t such that S ⊆ LO Λ (t) and there is
no term tree t satisfying that S ⊆ LO
Λ (t ) ⊆
/
LO
Λ (t). Similarly, we define a
minimally generalized unordered term tree explaining a given set of unordered
trees S ⊆ UT Λ . We give polynomial time algorithms for the following problems
for (TT , T ) ∈ {(OTT cΛ , OT Λ ), (UTT cΛ , UT Λ )}.
For a class C, Angluin [4] and Shinohara [11] showed that if C has finite
thickness, and the membership problem and the MINL problem for C are solvable
in polynomial time then C is polynomial time inductively inferable from positive
data. Let Λ be a finite or infinite alphabet of edge labels. In this paper, we
consider the classes OTT cΛ and UTT cΛ as targets of inductive inference.
120 Y. Suzuki et al.
It is easy to see that the classes OTT cΛ and UTT cΛ have finite thickness. In
Section 3, we give polynomial time algorithms for Membership problems for
OTT cΛ and UTT cΛ . And in Section 4, we give polynomial time algorithms for
Minimal Language problems for OTT cΛ (|Λ| ≥ 2) and UTT cΛ (|Λ| = ∞).
Therefore, we show the following main result.
Theorem 1. The classes OTT cΛ (|Λ| ≥ 2) and UTT cΛ (|Λ| = ∞) are polynomial
time inductively inferable from positive data.
u ≥1
u ≥1 u ≥1 u ≥1 ≥1 u u ≥1
v 0
v ≥0 w1 v ≥0 v ≥0 ≥0 v w2 v 0
g1 t1 g2 t2 w3
g3 t3
Definition 8. Let g be an ordered term tree in OTT cΛ for |Λ| ≥ 2. The ordered
term tree g is said to be a canonical ordered term tree if g has no occurrence of
ti (1 ≤ i ≤ 3) (Fig. 4).
Any ordered term tree g is transformed into the canonical term tree by re-
placing all occurrences of gi with ti (1 ≤ i ≤ 3) repeatedly. We denote by c(g) the
canonical ordered term tree transformed from g. We note that LO O
Λ (c(g)) = LΛ (g).
Proof. Let c(g) = (Vc(g) , Ec(g) , Hc(g) ) and c(t) = (Vc(t) , Ec(t) , Hc(t) ). Let Vc(g) =
{v ∈ Vc(g) | v is not a child port of a contractible variable} and Vc(t) = {v ∈ Vc(t) |
v is not a child port of a contractible variable}. For a vertex v which is not the
root of a term tree, we denote by p(v) the parent of v. For any v ∈ Vc(g) ,
either {p(v), v} ∈ Ec(g) , [p(v), v] ∈ Hc(g) , or [p(v), v]c ∈ Hc(g) . We note that
v ∈ Vc(g) − Vc(g) if and only if [p(v), v]c ∈ Hc(g) . Since g ≈ t, we have c(g) ≈ c(t).
Therefore there is a bijection ξ from Vc(g) to Vc(t) such that {p(v), v} ∈ Ec(g) or
[p(v), v] ∈ Hc(g) if and only if {ξ(p(v)), ξ(v)} ∈ Ec(t) or [ξ(p(v)), ξ(v)] ∈ Hc(t) .
Since Λ contains at least two edge labels, we can easily show the following claim.
Claim 1. If [p(v), v] ∈ Hc(g) then [ξ(p(v)), ξ(v)] ∈ Hc(t) .
In the following three lemmas, we assume that [p(v), v]c ∈ Hc(g) .
Claim 2. Suppose that p(v) has at least two children and v is the leftmost
(resp. rightmost) child of p(v). Let w be the immediately right (resp. left) sib-
ling of v. Then one of the following two statements holds: (1) there exists the
leftmost (resp. rightmost) child v of ξ(p(v)) such that [ξ(p(v)), v ]c ∈ Hc(t) , or
(2) [ξ(p(v)), ξ(w)] ∈ Hc(t) .
Proof of Claim 2. We note that {p(v), w} be an edge in Ec(g) since c(g)
is the canonical ordered term tree of g. We assume that {ξ(p(v)), ξ(w)} is an
edge in Ec(t) . Let α be the edge label of {ξ(p(v)), ξ(w)}. Let β be an edge label
in Λ − {α}. Let g β be the ground term trees which is obtained by replacing
all contractible and uncontractible variables with edges labeled with β. This
substitution does not increase the number of internal vertices of c(g). Thus, if
there is a substitution θ such that g β ≡ c(t)θ, the vertex of g β which is p(v) of
c(g) must corresponds to the vertex of c(t)θ which is ξ(p(v)) of c(t). Therefore,
if there does not exist the vertex v stated in this claim, g β does not belong to
LOΛ (c(t)) since the edge label of {ξ(p(v)), ξ(w)} is α. (End of Proof )
We can show the next two lemmas in a similar way to Claim 2.
Claim 3. If v is only child of p(v) then ξ(p(v)) has exactly one child v
such that [ξ(p(v)), v ]c ∈ Hc(t) or there exists the parent of ξ(p(v)) such that
[p(ξ(p(v))), ξ(p(v))] ∈ Hc(t) .
Claim 4. Suppose that p(v) has at least three children and v has the im-
mediately left sibling w and immediately right sibling wr . Then one of the
following three statements holds: (1) there exists a child v of ξ(p(v)) between
ξ(w ) and ξ(wr ) such that [ξ(p(v)), v ]c ∈ Hc(t) , (2) [ξ(p(v)), ξ(w )] ∈ Hc(t) , or
(3) [ξ(p(v)), ξ(wr )] ∈ Hc(t) .
From these claims, we can immediately show this lemma. 2
if S ⊆ LO
Λ (t2 := t − [u, wr ] ) then t := t2 ;
c
end
else
if S ⊆ LO ,r
Λ (t := tR(ci )λ ) then begin
if S ⊆ LΛ (t1 := t − [u, w ]c ) then t := t1 ;
O
if S ⊆ LO
Λ (t2 := t − [u, wr ] ) then t := t2 ; t := t
c
end;
return t
end;
Fig. 5. Algorithm MINL-OTT c : For an ordered term tree t, we denote by t − [u, v]c
the term tree obtained by removing a contractible variable [u, v]c .
Efficient Learning of Ordered and Unordered Tree Patterns 125
Proof. Obviously t ∈ LO
Λ (t). Let t be the ordered term tree obtained by
replacing all edges of s(t ) with uncontractible variables. Then t ≈ t and
LO O
Λ (t ) ⊆ LΛ (t ). Let θ be a substitution such that t ≡ tθ, and θ a substi-
tution which obtained by replacing all edges appearing in θ with uncontractible
variables. Then t ≡ tθ . Since Variable-Extension does not add a variable
to t any more then tθ ≡ t. Therefore t ≡ t, then t ≈ t. 2
Let u be a vertex of an ordered term tree which is not the root of the term
tree and p(u) the parent of u. Let λ be an element of Λ. Let w and wr be
new children of p(u) which become the immediately left and right siblings of u,
respectively. If u is a leaf, let wd be a new child of u. We suppose that [p(u), u]
is an uncontractible variable. Then we define the following 2 operations.
R(u),r
λ : Replace [p(u), u] with {p(u), u} labeled with λ,
[p(u), w ]c and [p(u), wr ]c .
,r,d
R(u)λ : Replace [p(u), u] with {p(u), u} labeled with λ,
[p(u), w ]c , [p(u), wr ]c and [u, wd ]c .
Edge-Replacing (Fig. 5): Let t be the output of Variable-Extension for an
input S. This procedure visits all vertices of t in the reverse order of the breadth-
first search of t. And it applies the above 2 operations R to t. If S ⊆ LO Λ (tR)
then t := tR. And it eliminates contractible variables as much as possible.
Lemma 4. Let t ∈ OTT cΛ (|Λ| ≥ 2) be the output of the algorithm MINL-OTT c
for an input S. Let t be a term tree satisfying that S ⊆ LO O
Λ (t ) ⊆ LΛ (t). Then
c(t ) ≡ c(t).
Definition 9. Let g be an unordered term tree in UTT cΛ (|Λ| = ∞). The un-
ordered term tree g is said to be a canonical unordered term tree if g has no
occurrence of ti (1 ≤ i ≤ 3) (Fig. 6).
We can easily see that any unordered term tree g is transformed into the
canonical unordered term tree by replacing all occurrences of gi with ti (1 ≤ i ≤
3) repeatedly. We denote by c(g) the canonical unordered term tree transformed
from g. We show the following lemma in a similar way to Lemma 2.
126 Y. Suzuki et al.
u ≥1 u ≥1
v1 v2 vn v1 v2 vn w
≥0 ≥0 ≥0 ≥0 ≥0 ≥0
g1 t1
u ≥1
u ≥1 u ≥1
u ≥1
v 0
v1 v2 vn v1 v2 vn w
v 0
≥0 ≥0 ≥0 ≥0 ≥0 ≥0
wd
g2 t2
g3 t3
Lemma 6. Let g and t be two unordered term trees in UTT cΛ (|Λ| = ∞). If g ≈ t
and LU U
Λ (g) ⊆ LΛ (t) then c(g) c(t).
The algorithm MINL-UTT c (Fig. 7) solves Minimal Language Problem
for UTT cΛ . The procedure Variable-Extension (Fig. 7) extends an unordered
term tree t by adding uncontractible variables as much as possible while S ⊆
LUΛ (t) holds. We can show the following lemma in a similar way to Lemma 3.
d := d + 1 end;
while d ≥ max-depth − 1 do begin t := t − [u, w]c ;
v :=a vertex at depth d which If S ⊆ LU
Λ (t ) then t := t
Fig. 7. Algorithm MINL-UTT c : For an unordered term tree t, we denote by t + [u, v]c
the term tree obtained by adding a contractible variable [u, v]c , and by t − [u, v]c the
term tree obtained by removing a contractible variable [u, v]c .
128 Y. Suzuki et al.
References
1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to
Semistructured Data and XML. Morgan Kaufmann, 2000.
2. S. Amer-Yahia, S. Cho, L. V. S. Lakshmanan, and D. Srivastava. Minimization of
Tree Pattern Queries. Proc. ACM SIGMOD 2001, pages 497–508, 2001.
3. T. R. Amoth, P. Cull, and P. Tadepalli. Exact learning of unordered tree patterns
from queries. Proc. COLT-99, ACM Press, pages 323–332, 1999.
4. D. Angluin. Finding patterns common to a set of strings. Journal of Computer
and System Science, 21:46–62, 1980.
5. H. Arimura, H. Sakamoto, and S. Arikawa. Efficient learning of semi-structured
data from queries. Proc. ALT-2001, Springer-Verlag, LNAI 2225, pages 315–331,
2001.
6. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient
Substructure Discovery from Large Semi-structured Data. the Proc. of the Second
SIAM International Conference on Data Mining, pages 158–174, 2002.
7. S. Matsumoto, Y. Hayashi, and T. Shoudai. Polynomial time inductive inference
of regular term tree languages from positive data. Proc. ALT-97, Springer-Verlag,
LNAI 1316, pages 212–227, 1997.
8. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning of finite unions
of tree patterns with internal structured variables from queries. Proc. AI-2002,
Springer-Verlag, LNAI 2557, pages 523–534, 2002.
9. T. Miyahara, T. Shoudai, T. Uchida, K. Kuboyama, K. Takahashi, and H. Ueda.
Discovering New Knowledge from Graph Data Using Inductive Logic Programming
Proc. ILP-99, Springer-Verlag, LNAI 1634, pages 222–233, 1999.
10. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, S. Hirokawa, K. Takahashi, and
H. Ueda. Extraction of Tag Tree Patterns with Contractible Variables from Irregu-
lar Semistructured data. Proc. PAKDD-2003, Springer-Verlag, LNAI 2637, pages
430–436, 2003.
11. T. Shinohara. Polynomial time inference of extended regular pattern languages.
Proc. RIMS Symposium on Software Science and Engineering, Springer-Verlag,
LNCS 147, pages 115–127, 1982.
12. T. Shoudai, T. Uchida, and T. Miyahara. Polynomial time algorithms for finding
unordered tree patterns with internal variables. Proc. FCT-2001, Springer-Verlag,
LNCS 2138, pages 335–346, 2001.
13. Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time
inductive inference of ordered tree patterns with internal structured variables from
positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages 169–184,
2002.
14. Y. Suzuki, T. Shoudai, T. Uchida, and T. Miyahara. Ordered term tree languages
which are polynomial time inductively inferable from positive data. Proc. ALT-
2002, Springer-Verlag, LNAI 2533, pages 188–202, 2002.
15. Y. Suzuki, T. Shoudai, S. Matsumoto and T. Uchida. Efficient Learning of Unla-
beled Term Trees with Contractible Variables from Positive Data. Proc. ILP-2003,
Springer-Verlag, LNAI (to appear), 2003.
16. J. Uemura and M. Sato. Compactness and Learning of Classes of Unions of Erasing
Regular Pattern Languages. Proc. ALT-2002, Springer-Verlag, LNAI 2533, pages
293–307, 2002.
On the Learnability of Erasing Pattern
Languages in the Query Model
1 Introduction
A pattern is a finite string of constant and variable symbols (cf. Angluin [2]).
The erasing language generated by a pattern p is the set of all strings that can
be obtained by substituting strings of constant symbols (including the empty
one!) for the variables in p.1 Thereby, each occurrence of a variable has to be
substituted by the same string.
The erasing pattern languages have found a lot of attention within the past
two decades both in the formal language theory community (see, e. g., Salo-
maa [15,16], Jiang et al. [9]) and in the learning theory community (see, e. g.,
1
The term ‘erasing’ is coined to distinguish these languages from those pattern lan-
guages originally defined in Angluin [2], where it is forbidden to replace variables by
the empty string.
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 129–143, 2003.
c Springer-Verlag Berlin Heidelberg 2003
130 S. Lange and S. Zilles
Shinohara [17], Erlebach et al. [6], Mitchell [12], Nessel and Lange [13], Reiden-
bach [14]). The learning scenarios studied include Gold’s [7] model of learning
in the limit and Angluin’s [3] model of learning with queries. Besides that, in-
teresting applications have been outlined. For example, learning algorithms for
particular subclasses of erasing pattern languages have been used to solve prob-
lems in molecular biology (see Arikawa et al. [5]).
The present paper focusses on the learnability of the erasing pattern lan-
guages and natural subclasses thereof in Angluin’s [3,4] model of learning with
queries. The paper extends the work of Nessel and Lange [13]; the first systematic
study in this context.
In contrast to Gold’s [7] model of learning in the limit, Angluin’s [3] model
deals with ‘one-shot’ learning. Here, a learning algorithm (henceforth called
query learner) has the option to ask queries in order to receive information
about an unknown language. The queries will truthfully be answered by an ora-
cle. After asking at most finitely many queries, the learner is supposed to output
its one and only hypothesis. This hypothesis is required to correctly describe the
unknown language.
The present paper contains a couple of new results, which illustrate the power
and limitations of query learners in the context of learning the class of all eras-
ing pattern languages and natural subclasses thereof. Along the line of former
studies, the capabilities of polynomial-time query learners (i. e. learners that are
constrained to ask at most polynomially many queries before returning their
hypothesis) are studied as well.
In addition, a problem is addressed that has mainly been ignored so far.
The present paper provides the first systematic study concerning the strength
of query learners that are – in contrast to standard query learners – allowed
to query languages that are themselves not object of learning. As it turns out,
these new learners often outperform standard learners, concerning their principal
learning capability as well as their efficiency.
Moreover, the learning power of non-standard query learners is compared to
the capabilities of Gold-style language learners. As a result of this comparison,
quite interesting coincidences between Gold-style language learning and query
learning – in the more general setting of learning indexable classes of recursive
languages – have been observed. One of them allows for a new approach to
the long-standing open question of whether or not the erasing pattern languages
(over a finite alphabet with at least three constant symbols) are Gold-style learn-
able from only positive examples. To be more precise, the erasing pattern lan-
guages are learnable in the non-standard query model (using a particular type
of queries, namely restricted superset queries), iff they are Gold-style learnable
from only positive examples by a conservative learner (i. e. a learner that strictly
avoids overgeneralized hypotheses).
Next, we summarize the disciplinary results on query learning of all erasing
pattern languages or natural subclasses thereof.
Among the different types of queries investigated in the past (see, e. g., An-
gluin [3,4]), we consider the following ones:
On the Learnability of Erasing Pattern Languages in the Query Model 131
Membership queries. The input is a string w and the answer is ‘yes’ or ‘no’, re-
spectively, depending on whether or not w belongs to the target language L.
Restricted subset queries. The input is a language L . If L ⊆ L, the answer is
‘yes’. Otherwise, the answer is ‘no’.
Restricted superset queries. The input is a language L . If L ⊆ L , the answer is
‘yes’. Otherwise, the answer is ‘no’.
In the original model of learning with queries (cf. Angluin [3]), the query
learner is constrained to choose the input language L exclusively from the class
of languages to be learned. Our study involves a further approach, in which
this constraint is weakened by allowing the learner to query languages that are
themselves not object of learning.
The following table summarizes the results obtained and compares them to
the previously known results. The focus is on the learnability of the class of
all erasing pattern languages and the following subclasses thereof: the so-called
regular, k-variable, and non-cross erasing pattern languages.2 The items in the
table have to be interpreted as follows. The item ‘No’ indicates that queries
of the specified type are insufficient to learn the corresponding language class,
while the item ‘Yes’ indicates that the corresponding class is learnable using
queries of this type. The superscript † refers to results, which can be found or
easily derived from results in Angluin [3], Matsumoto and Shinohara [11], and
Nessel and Lange [13], respectively.
Type of erasing pattern languages
Type of const.-free const.-free
queries all regular 1-variable 1-variable k-variable k-variable non-cross
membership No† Yes† No Yes No No No
restr. subset No Yes No Yes No No No
restr. superset No† Yes† No† No† No† No† No†
If query learners are allowed to choose input languages that are themselves
not object of learning, their learning capabilities change remarkably, particularly
when the learner is allowed to ask restricted superset queries. It seems as if this
type of queries is especially tailored to accumulate learning-relevant information
about erasing pattern languages. Note that the superscript ‡ marks immediate
outcomes of the table above.
Type of erasing pattern languages
Type of const.-free const.-free
extra queries all regular 1-variable 1-variable k-variable k-variable non-cross
restr. subset No Yes‡ No Yes‡ No No No
restr. superset Open Yes‡ Yes Yes Yes Yes Yes
Of particular interest is also the complexity of a successful query learner M ,
cf. Angluin [3]. M learns a class polynomially, if, for each target language L
2
A pattern p is regular provided that p does not contain any variable more than once.
Moreover, p is said to be a k-variable pattern, if it contains at most k variables, while
it is said to be non-cross, if there are variables x1 , . . . , xn and indices e1 , . . . , en such
that p = xe11 · · · xenn .
132 S. Lange and S. Zilles
2 Preliminaries
In the following, Σ denotes a fixed finite alphabet, the set of constant symbols.
Moreover, X = {x1 , x2 , x3 , . . .} is a countable, infinite set of variables. To distin-
guish constant symbols from variables, it is assumed that Σ and X are disjoint.
By Σ ∗ we refer to the set of all finite strings over Σ (words, for short), where ε
denotes the empty string or empty word, respectively. A pattern is a non-empty
string over Σ ∪ X .
Several special types of patterns are distinguished. Let p be a pattern. If
p ∈ X ∗ , then p is said to be a constant-free pattern. p is a regular pattern, if
each variable in p occurs at most once. If p contains at most k variables, then
p is a k-variable pattern. Moreover, p is said to be a non-cross pattern, if it is
constant-free and there are some n ≥ 1 and indices e1 , . . . , en ≥ 1 such that p
equals xe11 · · · xenn .
For a pattern p, the erasing pattern language Lε (p) generated by p is the set
of all words obtained by substituting all variables in p by strings in Σ ∗ . Thereby,
each occurrence of a variable in p has to be replaced by the same word.
Below, we generally assume that the underlying alphabet Σ consists of at
least three elements.3 a, b, c always denote elements of Σ.
The erasing pattern languages and natural subclasses thereof will provide
the target objects for learning. The formal learning model analyzed is called
3
As results in Shinohara [17] and Nessel and Lange [13] impressively show, this as-
sumption remarkably reduces the complexity of the proofs needed to establish learn-
ability results in the context of learning the erasing pattern languages and subclasses
thereof. However, some of the learnability results presented below may no longer re-
main valid, if this assumption is skipped. A detailed discussion of this issue is outside
the scope of the paper on hand.
On the Learnability of Erasing Pattern Languages in the Query Model 133
learning with queries, see Angluin [3,4]. In this model, the learner has access to
an oracle that truthfully answers queries of a specified kind. A query learner M
is an algorithmic device that, depending on the reply on the queries previously
made, either computes a new query or a hypothesis and halts. M learns a target
language L using a certain type of queries provided that it eventually halts and
that its one and only hypothesis correctly describes L. Furthermore, M learns
a target language class C using a certain type of queries, if it learns every L ∈ C
using queries of the specified type. As a rule, when learning a target class C, M
is not allowed to query languages not belonging to C (cf. Angluin [3]).
As in Angluin [3], the complexity of a query learner is measured by the total
number of queries to be asked in the worst-case. The relevant parameter is the
length of the minimal description for the target language.
Below, only indexable classes of erasing pattern languages are considered.
Note that a class of recursive languages is said to be an indexable class, if there
is an effective enumeration (Li )i≥0 of all and only the languages in that class that
has uniformly decidable membership. Such an enumeration is called an indexing.
We first present results related to Angluin’s [3] original model. Here the learner
is only allowed to query languages that are themselves object of learning.
The first result points to the general weakness of query learners when arbi-
trary erasing pattern languages have to be identified.
Theorem 1. The class of all erasing pattern languages is (i) not learnable using
membership queries, (ii) not learnable using restricted subset queries, and, (iii)
not learnable using restricted superset queries.
Proof. Assertions (i) and (iii) are results from Nessel and Lange [13].
To prove Assertion (ii), assume that a query learner M identifies the class of
all erasing pattern languages using restricted subset queries. Then it is possible
to show, that M fails to identify either Lε (x21 ) or all but finitely many of the
languages Lε (x21 xz2 ) for z ≥ 2. 2
The observed weakness has one origin: the query learners are only allowed to
output one hypothesis, which has to be correct. To see this, consider the following
relaxation of the learning model on hand. Suppose that a query learner M has
the freedom to output in each learning step, after asking a query and receiving
the corresponding answer, a hypothesis. Similarly to Gold’s [7] model of learning
in the limit, a query learner is now successful, if the sequence of its hypotheses
stabilizes on a correct one. Accordingly, we say that M learns in the limit using
queries.
Theorem 2. The class of all erasing pattern languages is (i) learnable in the
limit using membership queries, (ii) learnable in the limit using restricted subset
queries, and, (iii) learnable in the limit using restricted superset queries.
134 S. Lange and S. Zilles
However, let us come back to the original learning model, in which the first
hypothesis of the query learner has to be correct. As Theorem 1 shows, positive
results can only be achieved, if the scope is limited to proper subclasses of the
erasing pattern languages.
Suppose that a subclass of the erasing pattern languages is fixed. Naturally,
one may ask whether – similarly to Theorems 1 and 2 – the learnability of this
class does not depend on the type of queries actually considered. However, this
is generally not the case as our next theorem shows.
Theorem 3. Fix two different query types from the following ones: member-
ship, restricted subset, and restricted superset queries. Then there is a class of
erasing pattern languages, which is learnable using the first type of queries, but
not learnable using the second type of queries.
Proof. Scanning the first table above, the class of all erasing pattern languages
generated by constant-free 1-variable patterns is learnable with membership or
restricted subset queries, but not learnable with restricted superset queries.
Moreover, it is not hard to verify, that the class which contains Lε (a) and
all languages Lε (axz1 ), where z is a prime number, is learnable using restricted
superset queries, but not learnable using membership queries and not learnable
using restricted subset queries.
Next, the class containing Lε (x21 ) and all languages Lε (x21 x22 xz3 ), z ≥ 2, is
learnable with membership queries, but not with restricted subset queries.
A class learnable with restricted subset queries, but not with membership
queries can be constructed via diagonalization. For that purpose fix an effective
enumeration (Mi )i≥0 of all query learners using membership queries and posing
each query at most once.4 Let zi denote the i-th prime number for all i ≥ 0.
Given i ≥ 0, let L2i = Lε (xz1i a). Moreover, simulate the learner Mi . If Mi
queries a word w ∈ Σ ∗ , provide the answer ‘yes’ iff w ∈ Lε (xz1i a); provide the
answer ‘no’, otherwise. In case Mi never returns a hypothesis in this scenario,
let L2i+1 = L2i = Lε (xz1i a). In case Mi returns a hypothesis, let l be the length
of the longest word Mi has queried in the corresponding scenario. Then define
L2i+1 = Lε (xz1i axz2l ). Finally, let C consist of all languages Li for i ≥ 0.
Note that (Li )i≥0 is an indexing for C; membership is decidable as follows:
assume w ∈ Σ ∗ and j ≥ 0 are given. If j = 2i for some i ≥ 0, then w ∈ Lj iff
w ∈ Lε (xz1i a). If j = 2i + 1 for some i ≥ 0 and w ∈ L2i , then w ∈ Lj . If j = 2i + 1
and w ∈ / L2i , then let A = {l ≥ 0 | w ∈ Lε (xz1i axz2l )}. A is finite and can be
computed from w and i. Simulate Mi as above in the definition of L2i+1 . If Mi
does not return a hypothesis, then, since no query is posed twice, Mi queries a
word of a length not in A. Thus there is no l ∈ A with Lj = Lε (xz1i axz2l ); in
particular w ∈ / Lj . If Mi returns a hypothesis, one can determine the length l∗
of the longest word Mi has queried. In this case w ∈ Lj iff l∗ ∈ A.
Next, we show that C is learnable using restricted subset queries. A learner M
for C may first query the languages Lε (xz10 a), Lε (xz11 a), Lε (xz12 a), . . . , until the
4
Note that any query learner can be normalized to pose each query at most once
without affecting its learning capabilities.
On the Learnability of Erasing Pattern Languages in the Query Model 135
answer ‘yes’ is received for the first time, say as a reply to the query Lε (xz1i a) =
L2i . Then M queries the language L2i+1 . In case the answer is ‘yes’, let M return
the hypothesis L2i+1 . Otherwise, let M return the hypothesis L2i . It is not hard
to verify that M is a successful query learner for C.
It remains to verify that C is not learnable using membership queries. As-
sume to the contrary, that C is learnable using membership queries, say by the
learner Mi for some i ≥ 0. Then Mi identifies the language L2i = Lε (xz1i a). In
particular, if its queries are answered truthfully respecting L2i , Mi must return
a hypothesis correctly describing L2i after finitely many queries. Let l be the
length of the longest word Mi queries in the corresponding learning scenario.
Then, by definition, L2i+1 = Lε (xz1i axz2l ). Note that a word of length up to l
belongs to L2i iff it belongs to L2i+1 . Thus all queries in the learning scenario of
Mi for L2i are answered truthfully also for the language L2i+1 = L2i . Since Mi
correctly identifies L2i , Mi fails to learn L2i+1 . This yields a contradiction. 2
Next, we systematically investigate the learnability of some prominent sub-
classes of the erasing pattern languages in Angluin’s [3] model.
Theorem 4. The class of all regular erasing pattern languages is (i) learnable
using membership queries, (ii) learnable using restricted subset queries, and, (iii)
learnable using restricted superset queries.
Proof. For a proof of Assertions (i) and (iii) see Nessel and Lange [13]. Adapting
their ideas one can also prove (ii). 2
Theorem 5. The class of all 1-variable erasing pattern languages is (i) not
learnable using membership queries, (ii) not learnable using restricted subset
queries, and, (iii) not learnable using restricted superset queries.
Proof. (i) and (iii) are due to Nessel and Lange [13]. To verify (ii), note that the
class of all languages Lε (axz1 b), z ≥ 0, is not learnable using restricted subset
queries, even if it is allowed to query any 1-variable erasing pattern language. 2
To prove Theorems 6 to 9, similar methods as above can be used. For the
results concerning restricted superset queries, ideas from Nessel and Lange [13]
can be exploited. Further details are omitted.
Theorem 6. The class of all constant-free 1-variable erasing pattern languages
is (i) learnable using membership queries, (ii) learnable using restricted subset
queries, and, (iii) not learnable using restricted superset queries.
Theorem 7. The class of all k-variable erasing pattern languages is (i) not
learnable using membership queries, (ii) not learnable using restricted subset
queries, and, (iii) not learnable using restricted superset queries.
Theorem 8. The class of all constant-free k-variable erasing pattern languages
is (i) not learnable using membership queries, (ii) not learnable using restricted
subset queries, and, (iii) not learnable using restricted superset queries.
Theorem 9. The class of all non-cross erasing pattern languages is (i) not
learnable using membership queries, (ii) not learnable using restricted subset
queries, and, (iii) not learnable using restricted superset queries.
136 S. Lange and S. Zilles
Having analyzed the learnability of natural subclasses of the class of all erasing
pattern languages in the (extended) query model, we now turn our attention
to the question, which of the learnable classes can even be learned efficiently,
i. e. with polynomially many queries. In particular, it is of interest, whether or
not the permission to query extra languages may speed up learning.
As it turns out, there are subclasses, which are not learnable in the original
model, but even efficiently learnable with extra queries, see Theorem 13, Asser-
tion (iv). Thus, extra restricted superset queries may bring the maximal benefit
imaginable. In contrast, extra restricted subset queries do not help to speed up
learning of the prominent subclasses of the erasing pattern languages considered
above.
Theorem 13. (i) Polynomially many restricted superset queries suffice to learn
the class of all regular erasing pattern languages.
(ii) Polynomially many membership queries suffice to learn the class of all con-
stant-free 1-variable erasing pattern languages.
(iii) Polynomially many restricted subset queries suffice to learn the class of all
constant-free 1-variable erasing pattern languages.
(iv) Polynomially many extra restricted superset queries suffice to learn the
classes of all regular, of all 1-variable, and of all non-cross erasing pattern lan-
guages, respectively.
Proof. (i) is due to Nessel and Lange [13]. The proofs of (ii) and (iii) are omitted.
Results by Nessel and Lange [13] help to verify Assertion (iv) for the case of
regular erasing pattern languages. Details are omitted.
The more involved proof of Assertion (iv) for the case of non-cross erasing
pattern languages is just sketched:
Assume that the target language L equals Lε (p) for some non-cross pattern
p = xe11 · · · xenn . A query learner M successful for all non-cross erasing pattern
languages may operate as follows.
1. M poses the query Σ ∗ \ {a}. If the answer is ‘no’, then M returns the
hypothesis L = Lε (x1 ) and stops; otherwise M acts as described in 2.
2. The queries {w | |w| = j} for j = 1, 2, . . . help to determine the minimal
exponent e in {e1 , . . . , en }. Knowing e, M executes 3.
3. M poses the query Lε (xe1 ). If the answer is ‘yes’, then M returns the hy-
pothesis L = Lε (xe1 ) and stops; otherwise M acts as described in 4.
4. The queries (Lε (xe1 )∩{w | |w| ≤ j})∪{w | |w| > j} for j = e, e+1, . . . help to
determine further candidates for elements in {e1 , . . . , en }. Queries concerning
special words in a selected class of (at most e1 + · · · + en ) 2-variable erasing
pattern languages help to exactly compute a next exponent e . Knowing e ,
M executes 5.
5. The queries Σ ∗ \{w}, for particular words w ∈ Σ ∗ in order of growing length,
help to determine in which order the exponents e and e appear in p.
138 S. Lange and S. Zilles
equally large sets V1 and V2 ; which of these sets is taken under consideration,
in each splitting step only depends on the query V1 ∪ {w | |w| = l}. If v is
computed, M goes on as in 3.
3. M poses the query Lε (v). On answer ‘yes’, M returns the hypothesis Lε (v)
and stops. On answer ‘no’, M queries all the languages Lε (pi ), 1 ≤ i ≤ l + 1,
where pi is the pattern resulting from x1 v1 x2 v2 · · · xl vl xl+1 , if the variable
xi is deleted. Thus M can detect exactly those positions in v, where the only
variable has to occur (at least once). Knowing the positions of the variables,
M goes on as in 4.
4. By posing the queries {v} ∪ {w | |w| ≥ l + j} for j = 1, 2, . . ., M finds out the
number j ∗ of occurrences of the variable x1 in p. Afterwards, special queries
concerning the words of length l + j ∗ help to find out the multiplicity of x1
in the positions computed in 3. Finally, a hypothesis for Lε (p) is returned.
All in all, this method is successful for all 1-variable erasing pattern languages,
but uses only polynomially many queries. Instead of formalizing the details we
try to illustrate the idea with an example.
Assume Σ = {a, b, c} and the target language is Lε (ax31 bx21 ). Then the cor-
responding learning scenario can be described by the following table.
Theorem 14. Polynomially many queries do not suffice to learn the class of all
regular erasing pattern languages with either membership queries, or restricted
subset queries, or extra restricted subset queries.
Proof. Note that, for any n ≥ 0, there are at least |Σ|n distinct regular patterns,
such that each pair of corresponding erasing pattern languages is disjoint. By a
result in Angluin [3], given n ≥ 0, any query learner identifying each of these
|Σ|n erasing regular pattern languages using membership or restricted subset
queries must make |Σ|n − 1 queries in the worst case. Angluin’s proof can be
adopted for the case of learning with extra restricted subset queries. Concerning
membership queries and restricted subset queries, Theorem 14 has also been
verified by Nessel and Lange [13]. 2
It remains open, whether or not, for any k ≥ 3, the class of all k-variable
erasing pattern languages, or at least the class of all constant-free k-variable
erasing pattern languages, is learnable using polynomially many extra restricted
superset queries. Until now, we have only been successful in showing that Theo-
rem 13, Assertion (iv) generalizes to the case of learning the class of all 2-variable
erasing pattern languages. A similar, slightly extended, method as in the proof
above for 1-variable erasing pattern languages can be used. The relevant details
are omitted.
tively) identifies C from text with respect to H . Note that Consv Txt ⊂ LimTxt,
cf. Zeugmann and Lange [18].
For the next theorem, let xSupQ denote the collection of all indexable classes,
which are learnable with extra restricted superset queries.
in the n-th step, then M will receive the answer ‘no’, if t+ n ∩ Li = ∅ (i. e. if
Li ⊇ t+
n ), and the answer ‘yes’, otherwise. If M returns a hypothesis i in the
n-th step, then the hypothesis M (tn+1 ) is computed as follows:
• Let Li+ , . . . , Li+
m
be the queries answered with ‘yes’ in the currently
1
simulated scenario.
• Compute the canonical index i of the set {i, i+ 1 , . . . , im }.
+
• Return the hypothesis M (tn+1 ) = i .
It is not hard to verify that M learns C in the limit from text; the relevant
details are omitted. Moreover, as we will see next, M avoids overgeneralized
hypotheses, that means, if t is a text for some L ∈ C, n ≥ 0, and M (tn ) = i ,
then Li ⊃ L. Therefore, M can easily be transformed into a learner M which
identifies the class C conservatively in the limit from text.5
To prove that M learns C in the limit from text without overgeneralizations,
assume to the contrary, that there is an L ∈ C, a text t for L, and an n ≥ 0,
such that the hypothesis i = M (tn ) fulfills Li ⊃ L. Then i = 0. By definition
of M , there must be a learning scenario S for M , in which
Li ∩ Li+ ∩ · · · ∩ Li+ m
. So each of the languages Li+ , . . . , Li+m
is a superset of L.
1 1
By definition of M , Li− ⊇ t+ n for 1 ≤ j ≤ k. Therefore none of the languages
j
Li− , . . . , Li− are supersets of L. So the answers in the learning scenario S above
1 k
are truthful respecting the language L. As M learns C with extra restricted
superset queries, the hypothesis i must be correct for L, i. e. Li = L. This yields
Li ⊆ L in contradiction to Li ⊃ L.
So M learns C in the limit from text without overgeneralizations, which –
by the argumentation above – implies C ∈ Consv Txt. 2
As an immediate consequence of xSupQ = Consv Txt and the fact that
Consv Txt is a proper subset of LimTxt we obtain the following corollary.
Theorem 15 and Corollary 1 are of relevance for the open question, whether
or not the class of all erasing pattern languages is learnable in the limit from text,
if the underlying alphabet consists of at least three symbols. Obviously, if this
class is learnable with extra restricted superset queries, then the open question
can be answered in the affirmative. Conversely, if it is not learnable with extra
restricted superset queries, then it is not conservatively learnable in the limit
5
Note that a result by Zeugmann and Lange [18] states that any indexable class, which
is learnable in the limit from text without overgeneralizations, belongs to Consv Txt.
On the Learnability of Erasing Pattern Languages in the Query Model 143
from text. Of course the latter would not yet imply, that the open question can
be answered in the negative. Still it would at least suggest that this is the case,
since until now, there is no ‘natural’ class known that separates LimTxt from
Consv Txt.
References
1. D. Angluin. Inductive inference of formal languages from positive data. Informa-
tion and Control, 45:117–135, 1980.
2. D. Angluin. Finding patterns common to a set of strings. Journal of Computer
and System Sciences, 21:46–62, 1980.
3. D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
4. D. Angluin. Queries revisited. Proc. Int. Conf. on Algorithmic Learning Theory,
LNAI 2225, 12–31, Springer, 2001.
5. S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi, T. Shinohara.
A machine discovery from amino acid sequences by decision trees over regular
patterns. New Generation Computing, 11:361–375, 1993.
6. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger, T. Zeugmann. Learning one-
variable pattern languages very efficiently on average, in parallel, and by asking
questions. Proc. Int. Conf. on Algorithmic Learning Theory, LNAI 1316, 260–276,
Springer, 1997.
7. E. M. Gold. Language identification in the limit. Information and Control, 10:447–
474, 1967.
8. J. E. Hopcroft, J. D. Ullman. Introduction to Automata Theory, Languages, and
Computation. Addison-Wesley Publishing Company, 1979.
9. T. Jiang, A. Salomaa, K. Salomaa, S. Yu. Decision problems for patterns. Journal
of Computer and System Sciences, 50:53–63, 1995.
10. S. Lange, T. Zeugmann. Types of monotonic language learning and their char-
acterization. Proc. ACM Workshop on Computational Learning Theory, 377–390.
ACM Press, 1992.
11. S. Matsumoto, A. Shinohara. Learning pattern languages using queries. Proc.
European Conf. on Computational Learning Theory, LNAI 1208, 185–197, Springer,
1997.
12. A. Mitchell. Learnability of a subclass of extended pattern languages. Proc. ACM
Workshop on Computational Learning Theory, 64–71, ACM-Press, 1998.
13. J. Nessel, S. Lange. Learning erasing pattern languages with queries. Proc. Int.
Conf. on Algorithmic Learning Theory, LNAI 1968, 86–100, Springer, 2000.
14. D. Reidenbach. A negative result on inductive inference of extended pattern lan-
guages. Proc. Int. Conf. on Algorithmic Learning Theory, LNAI 2533, 308–320,
Springer, 2002.
15. A. Salomaa. Patterns (the formal language theory column). EATCS Bulletin,
54:46–62, 1994.
16. A. Salomaa. Return to patterns (the formal language theory column). EATCS
Bulletin, 55:144–157, 1995.
17. T. Shinohara. Polynomial time inference of extended regular pattern languages.
Proc. RIMS Symposium on Software Science and Engineering, LNCS 147, 115–127,
Springer, 1983.
18. T. Zeugmann, S. Lange. A guided tour across the boundaries of learning recursive
languages. Algorithmic Learning for Knowledge-Based Systems, LNAI 961, 190–
258, Springer, 1995.
Learning of Finite Unions of Tree Patterns with
Repeated Internal Structured Variables from
Queries
1 Introduction
In the field of Web mining, Web documents such as HTML/XML files have
tree structures and are called tree structured data. Tree structured data can be
naturally represented by rooted trees T such that every internal vertex in T has
ordered children, every vertex has no label and every edge has a label [1]. We
are interested in extracting a set (or a union) of tree structured patterns which
explains heterogeneous tree structured data having no rigid structure. From this
motivation, in this paper, we consider the polynomial time learnability of finite
unions of tree structured patterns in the query learning model of Angluin [5].
A term tree is a rooted tree pattern which consists of tree structures, ordered
children and internal structured variables [10,13]. A variable in a term tree is a
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 144–158, 2003.
c Springer-Verlag Berlin Heidelberg 2003
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 145
list of two vertices and it can be substituted by an arbitrary tree. For example,
the term tree t in Fig. 1 is a tree pattern explaining the tree T in Fig. 1 because
T can be obtained from t by substituting variables x1 , x2 , x3 and x4 by trees
g1 , g2 , g3 and g4 in Fig. 1, respectively. In [2,3], Amoth et al. presented the
into-matching semantics and introduced the class of ordered tree patterns and
ordered forests with the semantics. Such an ordered tree pattern is a standard
tree pattern, which is also called a first order term in formal logic. Since a term
tree may have variables consisting of two internal vertices (e.g. the variable x2 in
Fig. 1), a term tree is more powerful than an ordered tree pattern. For example,
in Fig. 1, the tree pattern f (b, x, g(a, z), y) can be represented by the term tree s,
but the term tree t cannot be represented by any standard tree pattern because
of the existence of internal structured variables represented by x2 and x3 in t.
Arimura et al. [7] presented ordered gapped tree patterns and ordered gapped
forests under into-matching semantics introduced in [3]. An ordered gapped tree
pattern is incomparable to a term tree, since a gap-variable in an ordered gapped
tree pattern does not exactly correspond to an internal variable in a term tree.
A variable with a variable label x in a term tree t is said to be repeated if x
occurs in t more than once. In this paper, we treat a term tree with repeated in-
ternal variables. In [7], Arimura et al. discussed the polynomial time learnability
of ordered gapped forests without repeated gap-variables in the query learning
model. In this paper, we discuss the polynomial time learnability of finite unions
of term trees with repeated variables in the query learning model. For a tree
T which represents tree structured data such as Web documents, string data
such as tags or texts are assigned to edges of T . Hence, we assume naturally
that the cardinality of a set of edge labels is infinite. Let Λ be a set of strings
used in tree structured data. Then, our target class of learning is the set OTFΛ
of all finite sets of term trees all of whose edges are labeled with elements in
Λ. A term tree t is said to be regular (or repetition-free) if all variable labels
in t are mutually distinct. The term tree language of a term tree t, denoted by
LΛ (t), is the set of all labeled trees which are obtained from t by substituting
arbitrary labeled trees for all variables in t. The language represented by a finite
set of term trees R = {t1 , t2 , . . . , tm } in OTFΛ is the finite union of m term tree
languages LΛ (R) = LΛ (t1 ) ∪ LΛ (t2 ) ∪ . . . ∪ LΛ (tm ).
In the query learning model of Angluin [5], a learning algorithm is said to
exactly learn a target finite set R∗ of term trees if it outputs a finite set R of
term trees such that LΛ (R) = LΛ (R∗ ) and halts, after it uses some queries. In
this paper, firstly, we present a polynomial time algorithm which exactly learns
any finite set in OTFΛ having m∗ term trees by using superset and restricted
equivalence queries. Next, we show that there exists no polynomial time learning
algorithm for finite unions of term trees by using restricted equivalence, mem-
bership and subset queries. This result indicates the hardness of learning finite
unions of term trees in the query learning.
In the query learning model, many researchers [2,3,7,10] showed the exact
learnabilities of several kinds of tree structured patterns (e.g. the query learn-
ing for ordered forests under onto-matching semantics [6], for unordered forests
146 S. Matsumoto et al.
under into-matching semantics [2,3], for ordered gapped forests [7], for regular
term trees[11]). In [10], we showed the polynomial time exact learnability of fi-
nite unions of regular term trees using restricted subset queries and equivalence
queries. As other learning models, in [13], we showed the class of single regular
term trees is polynomial time inductively inferable from positive data. Further,
we gave a data mining method from semistructured data, based on a learning
algorithm for regular term trees [12]. Further related works are discussed in
Conclusion.
This paper is organized as follows. In Section 2 and 3, we explain term trees
and the query learning model. In Section 4, we show that the class OTFΛ is
exactly learnable in polynomial time by using superset and restricted equivalence
queries. In Section 5, we give the hardness of learning the unions of term trees
in the query learning model.
2 Preliminaries
For a term tree t and its vertices v1 and vi , a path from v1 to vi is a sequence
v1 , v2 , . . . , vi of distinct vertices of t such that for any j with 1 ≤ j < i, there
exists an edge or a variable which consists of vj and vj+1 . If there is an edge or a
variable which consists of v and v such that v lies on the path from the root to
v , then v is said to be the parent of v and v is a child of v. We use a notation
[v, v ] to represent a variable {v, v } ∈ Ht such that v is the parent of v . Then
we call v the parent port of [v, v ] and v the child port of [v, v ]. A term tree t is
called ordered if every internal vertex u in t has a total ordering on all children
of u. We define the size of t as the number of vertices in t and denote it by |t|.
For a set S, the number of elements in S, called the size of S, is denoted by |S|.
For example, the ordered term tree t = (Vt , Et , Ht ) in Fig. 1 is defined as
follows. Vt = {v1 , . . . , v11 }, Et = {{v1 , v2 }, {v2 , v3 }, {v1 , v4 }, {v7 , v8 }, {v1 , v10 },
{v10 , v11 }} with the root v1 and the sibling relation displayed in Fig. 1. Ht =
{[v4 , v5 ], [v1 , v6 ], [v6 , v7 ], [v6 , v9 ]}.
For any ordered term tree t, a vertex u of t, and two children u and u of u,
we write u <tu u if u is smaller than u in the order of the children of u. We
assume that every edge and variable of an ordered term tree is labeled with some
words from specified languages. A label of a variable is called a variable label. Λ
and X denote a set of edge labels and a set of variable labels, respectively, where
Λ ∩ X = φ. We call an ordered term tree a term tree simply. In particular, a term
tree t = (Vt , Et , Ht ) is called regular if all variables in Ht have mutually distinct
variable labels in X. We denote by OTT Λ (resp. µOTT Λ ) the set of all term trees
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 147
v1
v2 v4 v6 v10
v3 v5 v7 v9 v11
v8
t T
f
u4
SubSec3.2
b x g y
u1 u2 u3
Result II
a z Preliminary Sec3 SubSec3.1
w4
w1 w2 w3
s g1 g2 g3 g4
Fig. 1. A term tree t explains a tree T . A term tree s represents the tree pattern
f (b, x, g(a, z), y). A variable is represented by a box with lines to its elements. The
label inside a box is the variable label of the variable.
(resp. regular term trees) with Λ as a set of edge labels, and by OTFΛ (resp.
µOTFΛ ) the set of all finite sets of term trees (resp. regular term trees) with Λ
as a set of edge labels. An ordered term tree with no variable is called a ground
term tree and considered to be a tree with ordered children. OT Λ denotes the
set of all ground term trees with Λ as a set of edge labels.
Let f = (Vf , Ef , Hf ) and g = (Vg , Eg , Hg ) be term trees. We say that f
and g are isomorphic, denoted by f ≡ g, if there is a bijection ϕ from Vf to Vg
such that (i) the root of f is mapped to the root of g by ϕ, (ii) {u, u } ∈ Ef if
and only if {ϕ(u), ϕ(u )} ∈ Eg and the two edges have the same edge label, (iii)
[u, u ] ∈ Hf if and only if [ϕ(u), ϕ(u )] ∈ Hg and the two variables have the same
variable label, and (iv) for any vertex u in f which has more than one child, and
for any two children u and u of u, u <fu u if and only if ϕ(u ) <gϕ(u) ϕ(u ).
Let f and g be term trees with at least two vertices. Let h = [v, v ] be a
variable in f with the variable label x and σ = [u, u ] a list of two distinct vertices
in g, where u is the root of g and u is a leaf of g. The form x := [g, σ] is called
a binding for x. A new term tree f = f {x := [g, σ]} is obtained by applying the
binding x := [g, σ] to f in the following way. Let e1 = [v1 , v1 ], . . . , em = [vm , vm
]
be the variables in f with the variable label x. Let g1 , . . . , gm be m copies of
g and ui , ui the vertices of gi corresponding to u, u of g, respectively. For each
variable ei = [vi , vi ], we attach gi to f by removing the variable ei from Hf and
by identifying the vertices vi , vi with the vertices ui , ui of gi .
148 S. Matsumoto et al.
r u0 r
y y
g (2) (2)
v0 (u0) v0 (< rf ) (< rf )
x z z
(3) (1) (1) (3)
u1
v1 < vf’0 (< ug0 ) (< ug0 ) < vf’0
v1 (u1)
(< ug0 )
(1)
(2)
(< vf 1 )
Fig. 2. The new ordering on vertices in the term tree f = f {x := [g, [u0 , u1 ]]}.
3 Learning Model
In this section, we show the learnability of finite unions of term tree languages
in the query learning model. In Subsection 4.1, we introduce some notations. In
Subsection 4.2, we show that any set in OTFΛ is exactly identifiable in polynomial
time using superset queries if the size of a target is known. In Subsection 4.3,
we show that any set in OTFΛ is exactly identifiable in polynomial time using
superset and restricted equivalence queries even if the size of a target is unknown.
The following property, called compactness, plays an important role in the learn-
ing of unions of languages [7,8]. By Lemma 1, in this paper, we assume that |Λ|
is infinite.
150 S. Matsumoto et al.
a a a a
a a a
a b a b
t1 t2
a a
a a a a a
a a
a b a c b
t3
a
a a a a
a a a a a a a a
a a a a a a a
a a b a b a c b
t4
Fig. 3. For regular term trees t1 , t2 , t3 and t4 , LΛ (t1 ) ⊆ LΛ (t2 ), LΛ (t4 ) ⊆ LΛ (t3 ),
t1 t2 and t4 t3 .
Lemma 1. Let r be a term tree in OTT Λ , R a set in OTFΛ and |Λ| infinite.
Then, r r for some r ∈ R if and only if LΛ (r) ⊆ LΛ (R)
Proof. Let wr be a ground term tree obtained from r by substituting edges
which have mutually distinct labels not appearing in any term trees in R. Since
wr ∈ LΛ (R), if LΛ (r) ⊆ LΛ (R), then there exists a term tree r ∈ R such that
wr ∈ LΛ (r ). Since any substituted labels do not appear in R, we have r r by
inverting the substitution. 2
Let Λ = {a, b}. For example, in Fig. 3, we have LΛ (t1 ) ⊆ LΛ (t2 ) and t1 t2 .
Thus, if |Λ| = 2, then compactness doesn’t hold. Moreover, let Λ = {a, b, c}.
Then LΛ (t4 ) ⊆ LΛ (t3 ) and t4 t3 . From the above mentioned, we show that
|Λ| must be infinite to satisfy compactness.
We introduce operations increasing variables in a term tree.
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 151
T1
T1
v1
v1
y
x T2 T3
T2 T3
v3
v2
T4 =⇒ z
v2
T4
T1 T2
T1 T1
v1 v1
T2 x T3 =⇒ T2 y x T3
v2 v3 v2
T4 T4
T3 T4
T1 T1
v1 v1
T2 x T3 =⇒ T2 x y T3
v2 v2 v3
T4 T4
T5 T6
Note that |r | > |r| and r ≺ r for any r ∈ ES(r). The number of non-isomorphic
term trees in ES(r) is at most 3|r|.
In this section, we assume that we know the size of R∗ in advance. Then, we show
that any set in OTFΛ is exactly identifiable in polynomial time using superset
queries. Thus, let |R∗ | = m∗ . We show that the algorithm LEARN KNOWN in
Fig. 5 exactly identifies any set R∗ ∈ OTFΛ in polynomial time using superset
queries. In LEARN KNOWN, Rhypo denotes a hypothesis set which is included
in OTFΛ and Rnocheck denotes a set of regular term trees which are not checked
by the algorithm LEARN OTT. Note that Rnocheck ∈ µOTFΛ and each regular
term tree in Rnocheck consists of variables only.
We denote by rin (resp. Rin ) a term tree (resp. a set of term trees) given to
the algorithm LEARN OTT in Fig. 6. By Lemma 2, the algorithm LEARN OTT
always takes as input a term tree rin such that r∗ rin and |r∗ | = |rin | for some
r∗ ∈ R∗ . Note that ES(rin ) ⊆ Rin and (Rnocheck − {rin }) ⊆ (Rhypo − {rin }) ⊆
Rin .
We show that if there exists a term tree r∗ ∈ R∗ such that r∗ ≡ rin , then
the algorithm outputs rin . Otherwise, the algorithm calls itself recursively and
gives a term tree r such that r ≺ rin and |r| = |rin |.
We give some notations. Let r be a term tree in OTT Λ , α an edge label and
x, y variable labels appearing in r. We denote by Xr the set of all variable labels
appearing in r, and by ρe (r, x, α) (resp. ρv (r, x, y)) the term tree obtained from
r by replacing all variables having the variable label x with edges having the
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 153
edge label α (resp. variables having a variable label y). For a subset ∆ of Λ, we
define the set RS ∆ (r) as follows:
If r ∈ OT Λ , then we define RS ∆ (r) = φ. Note that r ≺ r and |r | = |r| for any
r ∈ RS ∆ (r), and the number of non-isomorphic term tress in RS ∆ (r) is at most
|r| · |∆| + |r|2 .
In the algorithm LEARN OTT, let t1 , t2 , . . ., ti , . . . and ∆1 , ∆2 , . . ., ∆i ,. . .
(i ≥ 1) be the sequence of counterexamples returned by the superset queries in
line 7 and the sequence of finite subsets of Λ obtained in line 9 respectively. Let
∆0 be the finite subset of ∆ obtained in line 6. And we suppose that at each
stage i ≥ 0, LEARN OTT makes a superset query SupR∗ (Rin ∪ RS ∆i (rin )), and
receives a counterexample ti+1 . First we assume that rin ≡ r∗ for some r∗ ∈ R∗ .
Then we have the following lemma.
Proof. If rin has no variable, it is clear. We assume that rin has variables. The
proof is by the induction on the number of iterations i ≥ 0 of the while-loop. In
case i = 0. Since |Λ| is infinite, there exists an edge label in Λ − ∆0 . Thus, we
can construct a term tree r with r ∈ LΛ (rin ) − LΛ (Rin ∪ RS ∆0 (rin )). It follows
that LΛ (rin ) ⊆ LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆0 (rin )).
We assume inductively that the result holds for any number of iterations of
the while-loop less than i. By the inductive hypothesis, LΛ (R∗ ) ⊆ LΛ (Rin ∪
RS ∆i−1 (rin )). Thus, ti is obtained. LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )). Since |Λ|
is infinite, there exists an edge label in Λ − ∆i . We have a term tree r ∈
LΛ (R∗ ) − LΛ (Rin ∪ RS ∆i (rin )). Therefore, LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )). 2
Next we assume that rin ≡ r∗ for any r∗ ∈ R∗ . Let r∗1 , . . . , r∗ be the term
trees in R∗ such that r∗i ≺ rin and |r∗i | = |rin | for each i ∈ {1, . . . , }, where
≤ m∗ . Since LΛ (R∗ ) ⊆ LΛ (Rin ∪ {rin }) and ES(rin ) ⊆ Rin , we have LΛ (R∗ −
{r∗1 , . . . , r∗ }) ⊆ LΛ (Rin ). Then we have the following lemma.
Lemma 4. There exists a subset S of {r∗1 , . . . , r∗ } such that |S| ≥ i + 1 and
LΛ (S) ⊆ LΛ (RS ∆i (rin )).
From the above lemma, for some i ≤ m, LΛ ({r∗1 , . . . , r∗ }) ⊆ LΛ (RS ∆i (rin )).
It follows that LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )).
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 155
Thus, by Lemma 3 and 4, the algorithm LEARN OTT exactly identifies the
set {r∗1 , . . . , r∗ }. The algorithm is called recursively at most O(|rin |2 ) times to
identify one term tree in {r∗1 , . . . , r∗ }. Thus, the algorithm is called recursively
at most O( |rin |2 ) times in all.
The while-loop of lines 7–11 is repeated at most O(m) times. Since ES(rin ) ⊆
Rin , |ti | = |rin | for any i. Thus, in foreach-loop of lines 15–17, |∆| ≤ |t1 | +
. . . + |tm | = m|rin |. The loop uses at most O(m|rin |2 ) superset queries. The
number of superset queries needed to identify the set {r∗1 , . . . , r∗ } is at most
O( m|rin |4 ). The algorithm LEARN KNOWN uses at most O(|rin |2 ) superset
queries to obtain a term tree rin . Thus, the number of superset queries the
algorithm needs to identify a target R∗ is at most O(m2 n4 + 1), where n is the
maximum size of term trees in R∗ .
From the above mentioned, we have the following theorem.
Theorem 2. The algorithm LEARN OTF of Fig. 7 exactly identifies any set
R∗ ∈ OTFΛ in polynomial time using at most O(m3∗ n4 + 1) superset queries and
at most O(m∗ + 1) restricted equivalence queries, where m∗ is the size of R∗ and
n is the maximum size of term trees in R∗ .
In this section, we show the insufficiency of learning of OTFΛ in the query learn-
ing model. We uses the following lemma to show the insufficiency of learning of
OTFΛ .
Lemma 5. (László Lovász [9]) Let U Tn be the number of all rooted unordered
trees with no edge labels which of the size is n. Then, 2n < U Tn < 4n , where
n ≥ 6.
We denote by OTn the set of all rooted ordered trees with no edge labels
which of the size is n. From the above lemma, we have OTn ≥ 2n , where n ≥ 6.
By Lemma 5 and Lemma 1 in [5], we have Theorem 3.
156 S. Matsumoto et al.
Theorem 3. Any learning algorithm that exactly identifies all finite sets of the
term trees of size n using restricted equivalence, membership and subset queries
must make more than 2n queries in the worst case, where n ≥ 6 and |Λ| ≥ 1.
Proof. We denote by Sn the class of singleton sets of ground term trees of which
the size is n. The class Sn is a subclass of OTFΛ . For any L and L in Sn ,
L ∩ L = φ. However, the empty set is included in OTFΛ . Thus, by Lemma 5
and Lemma 1 in [5], any learning algorithm that exactly identifies all finite sets
of the term trees of size n using restricted equivalence, membership and subset
queries must make more than 2n queries in the worst case, where n ≥ 6 and
|Λ| ≥ 1. 2
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 157
Inductive Inference
Exact Learning
from positive data
Yes [11] Yes [13]
µOTT Λ membership & a positive example polynomial time
(|Λ| ≥ 2) (|Λ| ≥ 1)
Yes [10]
µOTFΛ restricted subset & equivalence Open
(|Λ| is infinite)
sufficient [This work] insufficient [This work]
superset & restricted equivalence
OTFΛ restricted equivalence membership Open
(|Λ| is infinite) subset
(|Λ| ≥ 1)
6 Conclusions
We have studied the learnability of OTFΛ in the query learning model. In Sec-
tion 4, we have shown that any finite set R∗ ∈ OTFΛ is exactly identifiable using
at most O(m3∗ n4 + 1) superset queries and at most O(m∗ + 1) restricted equiv-
alence queries, where m∗ = |R∗ |, n is the the maximum size of term trees in R∗
and |Λ| is infinite. In Section 5, we have shown that it is hard to exactly identify
any set in OTFΛ efficiently using restricted equivalence, membership and subset
queries.
158 S. Matsumoto et al.
References
1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to
Semistructured Data and XML. Morgan Kaufmann, 2000.
2. T. R. Amoth, P. Cull, and P. Tadepalli. Exact learning of tree patterns from
queries and counterexamples. Proc. COLT-98, ACM Press, pages 175–186, 1998.
3. T. R. Amoth, P. Cull, and P. Tadepalli. Exact learning of unordered tree patterns
from queries. Proc. COLT-99, ACM Press, pages 323–332, 1999.
4. D. Angluin. Finding pattern common to a set of strings. Journal of Computer and
System Sciences, 21:46–62, 1980.
5. D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
6. H. Arimura, H. Ishizaka, and T. Shinohara. Learning unions of tree patterns using
queries. Theoretical Computer Science, 185(1):47–62, 1997.
7. H. Arimura, H. Sakamoto, and S. Arikawa. Efficient learning of semi-structured
data from queries. Proc. ALT-2001, Springer-Verlag, LNAI 2225, pages 315–331,
2001.
8. H. Arimura, T. Shinohara, and S. Otsuki. Polynomial time algorithm for finding
finite unions of tree pattern languages. Proc. NIL-91, Springer-Verlag, LNAI 659,
pages 118–131, 1993.
9. László Lovász. Combinatorial Problems and Exercises, chapter Two classical enu-
meration problems in graph theory. North-Holland Publishing Company, 1979.
10. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning of finite unions
of tree patterns with internal structured variables from queries. Proc. AI-2002,
Springer LNAI 2557, pages 523–534, 2002.
11. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning unions of term
tree languages using queries. Proceedings of LA Summer Symposium, July 2002,
pages 21–1 – 21–10, 2002.
12. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda.
Discovery of frequent tag tree patterns in semistructured web documents. Proc.
PAKDD-2002, Springer-Verlag, LNAI 2336, pages 341–355, 2002.
13. Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time
inductive inference of ordered tree patterns with internal structured variables from
positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages 169–184,
2002.
Kernel Trick Embedded Gaussian Mixture
Model
1 Introduction
Kernel trick is an efficient method for nonlinear data analysis early used by
Support Vector Machine (SVM) [18]. It has been pointed out that kernel trick
could be used to develop nonlinear generalization of any algorithm that could
be cast in the term of dot products. In recent years, kernel trick has been suc-
cessfully introduced into various machine learning algorithms, such as Kernel
Principal Component Analysis (Kernel PCA) [14,15], Kernel Fisher Discrim-
inant (KFD) [11], Kernel Independent Component Analysis (Kernel ICA) [7]
and so on.
However, in many cases, we are required to obtain risk minimization result
and incorporate prior knowledge, which could be easily provided within Bayesian
probabilistic framework. This makes the emerging of combining kernel trick and
Bayesian method, which is called Bayesian Kernel Method [16]. As Bayesian
Kernel Method is in probabilistic framework, it can realize Bayesian optimal
decision and estimate confidence or reliability easily with probabilistic criteria
such as Maximum-A-Posterior [5] and so on.
Recently some researches have been done in this field. Kwok combined the
evidence framework with SVM [10], Gestel et al. [8] incorporated Bayesian frame-
work with SVM and KFD. These two work are both to apply Bayesian frame-
work to known kernel method. On the other hand, some researchers proposed
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 159–174, 2003.
c Springer-Verlag Berlin Heidelberg 2003
160 J. Wang, J. Lee, and C. Zhang
new Bayesian methods with kernel trick embedded, among which one of the
most influential work is the Relevance Vector Machine (RVM) proposed by Tip-
ping [17].
This paper also addresses the problem of Bayesian Kernel Method. The pro-
posed method is that we embed kernel trick into Expectation-Maximization
(EM) algorithm [3], and deduce a new parameter estimation algorithm for Gaus-
sian Mixture Model (GMM) in the feature space. The entire model is called
kernel Gaussian Mixture Model (kGMM).
The rest of this paper is organized as follows. Section 2 reviews some back-
ground knowledge, and Section 3 describes the kernel Gaussian Mixture Model
and the corresponding parameter estimation algorithm. Experiments and results
are presented in Section 4. Conclusions are drawn in the final section.
2 Preliminaries
In this section, we review some background knowledge including the kernel trick,
GMM based on EM algorithm and Bayesian Kernel Method.
Φ:X→H
(1)
x → φ(x)
And a similarity measure is defined from the dot product in space H as follows:
k (x, x ) φ(x) · φ(x ) (2)
where the kernel function k should satisfy Mercer’s condition [18]. Then it allows
us to deal with learning algorithms using linear algebra and analytic geometry.
Generally speaking, on the one hand kernel trick could deal with data in the
high-dimensional dot product space H, which is named feature space by a map
associated with k. On the other hand, it avoids expensive computation cost in
feature space by employing the kernel function k instead of directly computing
dot product in H.
Being an elegant way for nonlinear analysis, kernel trick has been used in
many other algorithms such as Kernel Fisher Discriminant [11], Kernel PCA [14,
15], Kernel ICA [7] and so on.
M
p(x|Θ) = αi Gi (x|θi ) (3)
i=1
where x ∈ Rd is a random variable, parameters Θ = α1 , · · · , αM ; θ1 , · · · θM
M
satisfy i=1 αi = 1, αi ≥ 0 and Gi (x|θi ) is a Gaussian probability density
function:
1 1
Gl (x|θl ) = 1/2
exp − (x − µl )T Σl−1 (x − µl ) (4)
(2π)d/2 |Σl | 2
where θl = (µl , Σl ).
GMM could be viewed as a generative model [12] or a latent variable model [6]
N
that assumes data set X = {xi }i=1 are generated by M Gaussian components,
and introduces the latent variables item Z = {zi }N
i=1 whose value indicates which
component generates the data. That is to say, we assume that if sample xi is
generated by the lth component and then zi = l. Then the parameters of GMM
could be estimated by the EM algorithm [2].
EM algorithm for GMM is an iterative procedure, which estimates the new
parameters in terms of the old parameters as the following updating formulas:
(t) 1 N
αl = p(l|xi , Θ(t−1) )
N i=1
N
(t) xi p(l|xi , Θ(t−1) )
µl = i=1
N (t−1) )
i=1 p(l|xi , Θ
N (t) (t)
(t) i=1 (xi − µl )(xi − µl )T p(l|xi , Θ(t−1) )
Σl = N (t−1) )
(5)
i=1 p(l|xi , Θ
(t−1) (t−1)
α G(xi |θl )
p(l|xi , Θ(t−1) ) = M l (t−1) (t−1)
, l = 1, · · · , M (6)
j=1 αj G(xi |θj )
(t−1) (t−1) (t−1) (t−1)
Θ(t−1) = α1 , · · · , αM ; θ1 , · · · , θM are parameters of the (t − 1)th
(t) (t) (t) (t)
iteration and Θ(t) = α1 , · · · , αM ; θ1 , · · · , θM are parameters of the (t)th
iteration.
GMM has been successfully applied in many fields, such as parametric clus-
tering, density estimation and so on. However, for instance, it can’t give a simple
but satisfied clustering result on data set with complex structure [13] as shown
in Figure 1. One alternative way is to perform GMM based clustering in another
space instead of in original data space.
162 J. Wang, J. Lee, and C. Zhang
0.5 0.5
0 0
-0.5 -0.5
-0.5 0 0.5 -0.5 0 0.5
Fig. 1. Data set of two concentric circles with 1,000 samples, points marked by ‘×’
belong to one cluster and marked by ‘·’ belong to the other. (a) is the partition result
by traditional GMM, (b) is the result achieved by kGMM using polynomial kernel of
degree 2; (c) shows the probability that each point belongs to the outer circle. The
whiter the point is, the higher the probability is.
And we intend to embed kernel trick into Gaussian Mixture Model. This
work just belongs to the second category of Bayesian Kernel Method.
As mentioned before, Gaussian Mixture Model can not obtain simple but satis-
fied results on data sets with complex structure, so we consider employing kernel
trick to realize a Bayesian Kernel version of GMM. Our basic idea is to embed
kernel trick into parameter estimation procedure of GMM. In this section, we
firstly describe GMM in feature space, secondly present the properties in feature
space, then formulate the Kernel Gaussian Mixture Model and the corresponding
parameter estimation algorithm, finally make some discussions on the algorithm.
Kernel Trick Embedded Gaussian Mixture Model 163
GMM in feature space by a map φ(·) associated with kernel function k can be
easily rewritten as
M
p(φ(x)|Θ) = αi G(φ(x)|θi ) (7)
i=1
and the EM updating formula in (5) and (6) can be replaced by the following.
(t) 1 N
αl = p(l|φ(xi ), Θ(t−1) )
N i=1
N (t−1)
(t) i=1 φ(xi )p(l|φ(xi ), Θ )
µl = N (t−1)
i=1 p(l|φ(xi ), Θ )
N (t) (t)
(t) i=1 (φ(xi ) − µl )(φ(xi ) − µl )T p(l|φ(xi ), Θ(t−1) )
Σl = N (t−1) )
(8)
i=1 p(l|φ(xi ), Θ
(t−1) (t−1)
α G(φ(xi )|θl )
p(l|φ(xi ), Θ(t−1) ) = M l (t−1) (t−1)
, l = 1, · · · , M (9)
j=1 αj G(φ(xi )|θj )
However, computing GMM directly with formula (8) and (9) in a high dimen-
sion feature space is computationally expensive thus impractical. We consider
employing kernel trick to overcome this difficulty. In the following section, we
will give some properties based on Mercer kernel trick to estimate the GMM
parameters in feature space.
To be convenient, notations in feature space are given firstly, and then three
properties are presented.
Notations. In all the formulas, bold and capital letters are for matrixes, italic
and bold letters are for vectors, and italic and lower case are for scalars. Subscript
l represents the lth Gaussian component, superscript t represents the tth iteration
of the EM procedure. AT represents transpose of matrix A. Other notations are
shown in Table 1.
164 J. Wang, J. Lee, and C. Zhang
1 1 dφ
ye2
Ĝl (φ(xj )|θl ) = exp −
dφ /2
dφ
1/2 2 e=1 λle
(2π) λle
e=1
1 ε2 (x )
j
× (N −dφ )/2
exp − (12)
(2πρ) 2ρ
where ye = φ̃(xj )T Vle (here Vle is the e-th eigenvector of Σl ), ρ is the weight
ratio, and ε2 (xj ) is the residual reconstruction error.
At the right side of the Equation (12), the first factor computes from the prin-
cipal subspace, and the second factor computes from the orthogonal complement
subspace.
The optimal value of ρ can be determined by minimizing a cost function. From
an information-theoretic point of view, the cost function should be the Kullback-
Leibler divergence between the true density Gl (φ(xj )|θl ) and its approximation
Ĝl (φ(xj )|θl ).
G(φ(x)|θl )
J(ρ) = G(φ(x)|θl ) log dφ(x) (13)
Ĝ(φ(x)|θl )
1
N λ ρ
le
J(ρ) = − 1 + log
2 ρ λle
e=dφ +1
∂J
Solving the equation ∂ρ = 0 yields the optimal value
1
N
ρ∗ = λle
N − dφ
e=dφ +1
And according to Property 2, Σl and K̃l has same nonzero eigenvalues, by em-
ploying the property of symmetry matrix, we obtain
1 1
dφ dφ
2 2
ρ∗ = Σl − λle = K̃l − λle (14)
N − dφ F
e=1
N − dφ F
e=1
dφ
The residual reconstruction error, ε2 (xj ) = φ̃(xj )2 − ye2 , could be easily
e=1
obtained by employing kernel trick
dφ
ε2 (xj ) = k(xj , xj ) − ye2 (15)
e=1
And according to Lemma 3 in Appendix,
ye = φ̃(xj )T Vle = Vle · φ̃(xj ) = β T Γl (16)
With these three properties, we can formulate our kernel Gaussian Mixture
Model.
In feature space, the kernel matrixes replace the mean and covariance matrixes
in input space to represent the Gaussian component. So the parameters of each
component are not the mean vectors and covariance matrixes, but the kernel
matrixes. In fact, it is also intractable to compute mean vectors and covariance
matrixes in feature space because feature space has a quite high or even infinite
Kernel Trick Embedded Gaussian Mixture Model 167
(0)
Step 0. Initialize all pli (l = 1, · · · , M ; i = 1, · · · , N ), t = 0, set tmax and
stopping condition false.
Step 1. While stopping condition is false, t = t + 1, do Step 2-7.
(t) (t) (t) (t) (t) (t)
Step 2. Compute αl , wli , Wl , (Wl ) , Kl and (Kl ) according to
notations in Table 1.
(t) (t)
Step 3. Compute the matrixes K̃l , (K̃l ) via Property 1.
Step 4. Compute the largest dφ eigenvalues and eigenvectors of centered
(t)
kernel matrixes K̃l .
(t)
Step 5. Compute Ĝl (φ(xj )|θl ) via Property 3.
(t)
Step 6. Compute all posterior probabilities pli via (9)
Step 7. Test the stopping condition.
(t) (t−1) 2
If t > tmax or kl=1 N i=1 pli − pli < ε, set stopping condition true,
otherwise loop back to Step 1.
where λi , βi (y) are eigenvalue and eigenvector, and k is a given kernel function.
The integral could be approximate using Monte Carlo method by a subset of
m
samples {xi }i=1 (m N, m dφ ) drawn according to p(x).
1 N
k(x, y)p(x)Vi (x)dx ≈ k(xj , y)Vi (xj ) (18)
N j=1
1 N
k(xj , xk )Vi (xj ) = λ̂i Vi (xk ) (19)
N j=1
Comparison with Related Works. There are still some other work related
to ours. One major is the spectral clustering algorithm [21]. Spectral clustering
could be regarded as using RBF based kernel method to extract features and
then performing clustering by K-means. Compared with spectral clustering, the
proposed kGMM has at least two advantages. (1)kGMM can provide result in
probabilistic framework and can incorporate prior information easily. (2) kGMM
can be used in supervised learning problem as a density estimation method. All
these advantages encourage us to apply the proposed kGMM.
Kernel Trick Embedded Gaussian Mixture Model 169
4 Experiments
In this section, two experiments are performed to validate the proposed kGMM
compared with traditional GMM. Firstly kGMM is employed as unsupervised
learning or clustering method on synthetic data set. Secondly kGMM is em-
ployed as supervised density estimation method for real-world handwritten digit
recognition.
0.5 0.5
0 0
-0.5 -0.5
-0.5 0 0.5 -0.5 0 0.5
Fig. 2. Data set consists of 1,000 samples. Points marked by ‘×’ belong to one cluster
and marked by ‘·’ belong to the other. (a) is the partition result by traditional GMM;
(b) is the result achieved by kGMM; (c) shows the probability that each point belongs
to the left-right cluster. The whiter the point is, the higher the probability is.
images of size 16x16, divided into a training set of 7,219 images and a test set
of 2,007 images.
The original input data is just vector form of the digit image, i.e., the input
feature space is with dimensionality 256. Optionally, we can perform a linear
discriminant analysis (LDA) to reduce the dimensionality of feature space. If
LDA is performed, the feature space yields to be 39.
Each category ω is estimated a density of p(x|ω) using 4 components GMM
on training set. To classify an test sample x, we use the Bayesian decision rule
ω∗ = arg max p(x|ω)P (ω) , ω = 1, · · · , 10 (20)
ω
k(x, x ) = exp(−γx − x 2 )
5 Conclusion
In this paper, we present a kernel Gaussian Mixture Model, and deduce a pa-
rameter estimation algorithm by embedding kernel trick into EM algorithm.
Furthermore, we adopt a Monte Carlo sampling technique to speedup kGMM
upon large scale problem, thus make it more practical and efficient.
Compared with most classical kernel methods, kGMM can solve problems
in a probabilistic framework. Moreover, it can tackle nonlinear problems better
than the traditional GMM. Experimental results on synthetic and real-world
data set show that the proposed approach has satisfied performance.
Our future work will focus on incorporating prior knowledge such as invari-
ance in kGMM and enriching its applications.
References
1. Achlioptas, D., McSherry, F. and Schölkopf, B.: Sampling techniques for kernel
methods. In Advances in Neural Information Processing System (NIPS) 14, MIT
Press, Cambridge MA (2002)
2. Bilmes, J. A.: A Gentle Tutorial on the EM Algorithm and its Application to
Parameter Estimation for Gaussian Mixture and Hidden Markov Models, Technical
Report, UC Berkeley, ICSI-TR-97-021 (1997)
3. Bishop, C. M.: Neural Networks for Pattern Recognition, Oxford University Press.
(1995)
4. Dahmen, J., Keysers, D., Ney, H. and Güld, M.O.: Statistical Image Object Recog-
nition using Mixture Densities. Journal of Mathematical Imaging and Vision, 14(3)
(2001) 285–296
5. Duda, R. O., Hart, P. E. and Stork, D. G.: Pattern Classification, New York: John
Wiley & Sons Press, 2nd Edition. (2001)
6. Everitt, B. S.: An Introduction to Latent Variable Models, London: Chapman and
Hall. (1984)
7. Francis R. B. and Michael I. J.: Kernel Independent Component Analysis, Journal
of Machine Learning Research, 3, (2002) 1–48
8. Gestel, T. V., Suykens, J.A.K., Lanckriet, G., Lambrechts, A., Moor, B. De and
Vanderwalle J.: Bayesian framework for least squares support vector machine clas-
sifiers, gaussian processs and kernel fisher discriminant analysis. Neural Computa-
tion, 15(5) (2002) 1115–1148
172 J. Wang, J. Lee, and C. Zhang
9. Herbrich, R., Graepel, T. and Campbell, C.: Bayes Point Machines: Estimating
the Bayes Point in Kernel Space. In Proceedings of International Joint Conference
on Artificial Intelligence Work-shop on Support Vector Machines, (1999) 23–27
10. Kwok, J. T.: The Evidence Framework Applied to Support Vector Machines, IEEE
Trans. on NN, Vol. 11 (2000) 1162–1173.
11. Mika, S., Rätsch, G., Weston, J., Schölkopf, B. and Müller, K.R.: Fisher discrim-
inant analysis with kernels. IEEE Workshop on Neural Networks for Signal Pro-
cessing IX, (1999) 41–48
12. Mjolsness, E. and Decoste, D.: Machine Learning for Science: State of the Art and
Future Pros-pects, Science. Vol. 293 (2001)
13. Roberts, S. J.: Parametric and Non-Parametric Unsupervised Cluster Analysis,
Pattern Recogni-tion, Vol. 30. No 2, (1997) 261–272
14. Schölkopf, B., Smola, A.J. and Müller, K.R.: Nonlinear Component Analysis as a
Kernel Eigen-value Problem, Neural Computation, 10(5), (1998) 1299–1319
15. Schölkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Müller, K. R., Raetsch, G.
and Smola, A.: Input Space vs. Feature Space in Kernel-Based Methods, IEEE
Trans. on NN, Vol 10. No. 5, (1999) 1000–1017
16. Schölkopf, B. and Smola, A. J.: Learning with Kernels: Support Vector Machines,
Regularization and Beyond, MIT Press, Cambridge MA (2002)
17. Tipping, M. E.: Sparse Bayesian Learning and the Relevance Vector Machine,
Journal of Machine Learning Research. (2001)
18. Vapnik, V.: The Nature of Statistical Learning Theory, 2nd Edition, Springer-
Verlag, New York (1997)
19. Williams, C. and Seeger, M.: Using the Nyström Method to Speed Up Kernel
Machines. In T. K. Leen, T. G. Diettrich, and V. Tresp, editors, Advances in Neural
Information Processing Systems (NIPS)13. MIT Press, Cambridge MA (2001)
20. Taylor, J. S., Williams, C., Cristianini, N. and Kandola J.: On the Eigenspec-
trum of the Gram Matrix and Its Relationship to the Operator Eigenspectrum, N.
CesaBianchi et al. (Eds.): ALT 2002, LNAI 2533, Springer-Verlag, Berlin Heidel-
berg (2002) 23–40
21. Ng, A. Y., Jordan, M. I. and Weiss, Y.: On Spectral Clustering: Analysis and an
algorithm, Advance in Neural Information Processing Systems (NIPS) 14, MIT
Press, Cambridge MA (2002)
22. Moghaddam, B. and Pentland, A.: Probabilistic visual learning for object repre-
sentation, IEEE Trans. on PAMI, Vol. 19, No. 7 (1997) 696–710
Appendix
K̃ = K − WK − KW + WKW (a1)
where W = wwT .
(2) If K is a N × N projecting kernel matrix such that Kij = φ(xi ) ·
wj φ(xj ) , and K̃ is a N × N matrix, which is centered in the feature space,
such that K = φ̃(xi ) · wj φ̃(xj ) , then
ij
K̃ = K − W K − K W + W KW (a2)
N
= wi φ(xi )T wj φ(xj ) − wi wk wk φ(xk )T wj φ(xj )
k=1
N
N
N
− wk wi φ(xi )T wk φ(xk ) ·wj + wi wk wn wk φ(xk )T wn φ(xn )
k=1 k=1 n=1
N
N
N
N
= Kij − wi wk Kkj − wk Kik wj + wi wk wn wk φ(xk )T wn φ(xn ) wj
k=1 k=1 k=1 n=1
N
Σ= φ̃(xi )φ̃(xi )T wi2 (a3)
i=1
N
V = βi wi φ̃(xi ) (a5)
i=1
N
N
N
λ βi wk φ̃(xk ) · wi φ̃(xi ) = βi wk φ̃(xk ) · wj φ̃(xj ) wj φ̃(xj ) · wi φ̃(xi )
i=1 i=1 j=1
λK̃β = K̃ 2 β
λβ = K̃β (a6)
N
V · φ̃(xj ) = βi wi φ̃(xi ) · φ̃(xj ) = β T Γ (a7)
i=1
1 Introduction
Machine learning algorithms rely to a large extent on the availability of a good
representation of the data, which is often the result of human design choices.
More specifically, a ‘suitable’ distance measure between data items needs to be
specified, so that a meaningful notion of ‘similarity’ is induced. The notion of
‘suitable’ is inevitably task dependent, since the same data might need very
different representations for different learning tasks.
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 175–189, 2003.
c Springer-Verlag Berlin Heidelberg 2003
176 T. De Bie, M. Momma, and N. Cristianini
This means that automatizing the task of choosing a representation will nec-
essarily need utilization of some type of information (e.g. some of the labels,
or less refined forms of information about the task at hand). Labels may be
too expensive, while a less refined and more readily available source of infor-
mation can be used (known as side-information). For example, one may want
to define a metric over the space of movies descriptions, using data about cus-
tomers associations (such as sets of movies liked by the same customer in [9]) as
side-information.
This type of side-information is commonplace in marketing data, recommen-
dation systems, bioinformatics and web data. Many recent papers have dealt
with these and related problems; some by imposing extra constraints without
learning a metric, as in the constrained K-means algorithm [5], others by im-
plicitly learning a metric, like [9], [13] or explicitly by [15]. In particular, [15]
provides a conceptually elegant algorithm based on semi-definite programming
(SDP) for learning the metric in the data space based on side-information, an
algorithm that unfortunately has complexity O(d6 ) for d-dimensional data1 .
In this paper we present an algorithm for the problem of finding a suit-
able metric, using the side-information that consists of n example pairs
(1) (2)
(xi , xi ), i = 1, . . . , n belonging to the same but unknown class. Furthermore,
we place our algorithm in a general framework, in which also the methods de-
scribed in [14] and [13] would fit. More specifically, we show how these methods
can all be related with Linear Discriminant Analysis (LDA, see [8] or [7]).
For reference, we will first give a brief review of LDA. Next we show how
our method can be derived as an approximation for LDA in case only side-
information is available. Furthermore, we provide a derivation similar to the
one in [15] in order to show the correspondence between the two approaches.
Empirical results include a toy example, and UCI data sets also used in [15].
Notation. All vectors are assumed to be column vectors. With Id the identity
matrix of dimension d is meant. With 0, we denote a matrix or a vector of appro-
priate size, containing all zero elements. The vector 1 is a vector of appropriate
dimension containing all 1’s. A prime denotes a transpose.
(1) (2)
To denote the side-information that consists of n pairs (xi , xi ) for which is
(1) (2)
known that xi and xi ∈ Rd belong to the same class, we will use the matrices
(1) (2)
X(1) ∈ Rn×d and X(2) ∈ Rn×d . These contain xi and xi as their ith rows:
(1) (2)
x1 x1
(1) (2)
X(1) = x2 and X(2) = x2 . This means that for any i = 1, . . . , n, it is
··· ···
(1) (2)
xn xn
1
The authors of [15] see this problem, and they try to circumvent it by developing a
gradient descent algorithm instead of using standard Newton algorithms for solving
SDP problems. However, this may lead to convergence problems, especially for data
sets in large dimensional spaces.
Efficiently Learning the Metric with Side-Information 177
known that the samples at the ith rows of X(1) and X(2) belong to the same class.
For ease of notation (but without loss of generality) we will construct the full
(1)
X
data matrix2 X ∈ R2n×d as X = . When we want to denote the sample
X(2)
corresponding to the ith row of X without regarding the side-information, it is
denoted as xi ∈ Rd (without superscript, and i = 1, . . . , 2n). The data matrix
should be centered, that is 1 X = 0 (the mean of each column is zero). We use
w ∈ Rd to denote a weight vector in this d-dimensional data space.
Although the labels for the samples are not known in our problem setting, we
will consider the label matrix Z ∈ R2n×c corresponding to X in our derivations.
(The number of classes is denoted by c.) It is defined as (where Zi,j indicates
the element at row i and column j):
within class scatter matrix CW = i=1 j:xj ∈Ci (xj − mi )(xj − mi ) . Since the
c
labels are not known in our problem setting, we will only use these quantities in
our derivations, not in our final results.
In this section, we will show how the LDA formulation which requires labels
can be adapted for cases where no labels but only side-information is available.
The resulting formulation can be seen as an approximation of LDA with labels
available. This will lead to an efficient algorithm to learn a metric: given the side-
information, solving just a generalized eigenproblem is sufficient to maximize the
expected separation between the clusters.
2
In all derivations, the only data samples involved are the ones that appear in the side-
information. It is not until the empirical results section that also data not involved
in the side-information is dealt with: the side-information is used to learn the metric,
and only subsequently, this metric is used to cluster any other available sample. We
also assume no sample appears twice in the side-information.
178 T. De Bie, M. Momma, and N. Cristianini
2.1 Motivation
This formulation for LDA will be the starting point for our derivations.
Dimensionality Selection. As with LDA, one will generally not only use the
dominant eigenvector, but a dominant eigenspace to project the data on. The
number of eigenvectors used should depend on the signal to noise ratio along
these components: when it is too low, noise effects will cause poor performance
of a subsequent clustering. So we need to make an estimate of the noise level.
This is provided by the negative eigenvalues: they allow us to make a good
estimate of the noise level present in the data, thus motivating the strategy
adopted in this paper: only retain the k directions corresponding to eigenvalues
larger than the largest absolute value of the negative eigenvalues.
Since we will project the data onto the k dominant eigenvectors w, this finally
boils down to using the distance measure
d2 (xi , xj ) = (W (xi − xj )) (W (xi − xj )) = xi − xj 2WW .
3 Remarks
Actually, X(1) and X(2) do not have to belong to the same space, they can be
of a different kind: it is sufficient when corresponding samples in X(1) and X(2)
belong to the same class to do something similar as above. Of course then we
need different weight vectors in both spaces: w(1) and w(2) . Following a similar
reasoning as above, in Appendix C we provide an argumentation that solving
the CCA eigenproblem
0 C12 w(1) C11 0 w(1)
=λ
C21 0 w(2) 0 C22 w(2)
– When the groups of samples for which is known they belong to the same class
is larger than 2 (let us call them X(i) again, but now i is not restricted to only
1 or 2). This can be handled very analogously to our previous derivation.
Therefore we just state the resulting generalized eigenvalue problem:
X(k) X(k) w=λ X(k) X(k) w
k k k
Efficiently Learning the Metric with Side-Information 181
– Also in case we are dealing with more than 2 data sets that are of a different
nature (eg analogous to [14]: we could have more than 2 data sets, each
consisting of a text corpus in a different language), but for which is known
that corresponding samples belong to the same class (as described in the
previous subsection), the problem is easily shown to reduce to the extension
of CCA towards more data spaces, as is e.g. used in [1]. Space restrictions
do not permit us to go into this.
– It is possible to keep this approach completely general, allowing for any type
of side-information of the form of constraints that express for any number of
samples they belong to the same class, or on the contrary do not to belong
to the same class. Also knowledge of some of the labels can be exploited.
For doing this, we have to use a different parameterization for Z than used
in this paper. In principle also any prior distribution on the parameters can
be taken into account. However, sampling techniques will be necessary to
estimate the expected value of the LDA cost function in these cases. We will
not go into this in the current paper.
the α’s corresponding to the weight vectors w are found as the generalized
eigenvectors of
This motivates that it will be possible to extend the approach to learning non-
linear metrics with side-information as well.
4 Empirical Results
The empirical results reported in this paper will be for clustering problems with
the type of side-information described above. Thus, with our method we learn
a suitable metric based on a set of samples for which the side-information is
known, i.e. X(1) and X(2) . Subsequently a K-means clustering of all samples
(including those that are not in X(1) or X(2) ) is performed, making use of the
metric that is learned.
182 T. De Bie, M. Momma, and N. Cristianini
4.2 Regularization
To deal with inaccuracies, numerical instabilities and influences of finite sample
size, we apply regularization to the generalized eigenvalue problem. This is done
in the same spirit as for CCA in [1], namely by adding a diagonal to the scatter
matrices C11 and C22 . This is justified thanks to the CCA-based derivation of
our algorithm. To train the regularization parameter, a cost function described
below is minimized via 10-fold cross validation.
In choosing the right regularization parameter, there are two things to con-
sider: firstly, we want the clustering to be good. This means that the side-
information should be reflected as well as possible by the clustering. Secondly we
want this clustering to be informative. This means, we don’t want one very large
cluster, the others being very small (the probability to fulfil the side-information
would be too easy then). Therefore, the cross-validation cost minimized here, is
the probability for the measured performance on the test set side-information,
given the sizes of the clusters found. (More exactly, we maximized the differ-
ence of this performance with its expected performance, divided by its standard
deviation.) This approach incorporates both considerations in a natural way.
15
10
−5
−10
−15
3
4
2
2 1
0 0
−1
−2
−2
−4 −3
Fig. 1. A toy example whereby the two clusters each consist of two distinct clouds
of samples, that are widely separated. Ordinary K-means obviously has a very low
accuracy of 0.5, whereas when some side-information is taken into account as described
in this paper, the performance goes up to 0.92.
5 Conclusions
Table 1. Accuracies for on UCI data sets, for different numbers of connected compo-
nents. (The more side-information, the less connected components. The fraction f is
the number of connected components divided by the total number of samples.)
Table 2. Accuracies on the wine and the protein data sets, as a function of the ratio
of constraints.
CCA and LDA, that try to identify interesting subspaces for a given task. This
often comes as an advantage, since algorithms like K-means and constrained
K-means will run faster on lower dimensional data.
References
1. F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of
Machine Learning Research, 3:1–48, 2002.
2. M. Barker and W.S. Rayens. Partial least squares for discrimination. Journal of
Chemometrics, 17:166–173, 2003.
3. M. S. Bartlett. Further aspects of the theory of multiple regression. Proc. Camb.
Philos. Soc., 34:33–40, 1938.
4. M. Borga, T. Landelius, and H. Knutsson. A Unified Approach to PCA, PLS,
MLR and CCA. Report LiTH-ISY-R-1992, ISY, SE-581 83 Linköping, Sweden,
November 1997.
5. P. Bradley, K. Bennett, and Ayhan Demiriz. Constrained K-means clustering.
Technical Report MSR-TR-2000-65, Microsoft Research, 2000.
6. N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target
alignment. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances
in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press.
7. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley &
Sons, Inc., 2nd edition, 2000.
8. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals
of Eugenics, 7, Part II:179–188, 1936.
9. T. Hofmann. What people don’t want. In European Conference on Machine Learn-
ing (ECML), 2002.
10. R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University
Press, 1991.
11. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning
the kernel matrix with semi-definite programming. Technical Report CSD-02-1206,
Division of Computer Science , University of California, Berkeley, 2002.
12. R. Rosipal, L.J. Trejo, and B. Matthews. Kernel PLS-SVC for linear and nonlinear
classification. In (to appear) Proceedings of the Twentieth International Conference
on Machine Learning, 2003.
13. J.-P. Vert and M. Kanehisa. Graph-driven features extraction from microarray
data using diffusion kernels and cca. In Advances in Neural Information Processing
Systems 15, Cambridge, MA, 2003. MIT Press.
14. A. Vinokourov, N. Cristianini, and J. Shawe-Taylor. Inferring a semantic repre-
sentation of text via cross-language correlation analysis. In Advances in Neural
Information Processing Systems 15, Cambridge, MA, 2003. MIT Press.
15. E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with
application to clustering with side-information. In Advances in Neural Information
Processing Systems 15, Cambridge, MA, 2003. MIT Press.
186 T. De Bie, M. Momma, and N. Cristianini
LwZ 2 = 1
The Lagrangian of this constrained optimization problem is:
(X(1) X(1) + X(2) X(2) )w − µw L LwZ
L = w X(1) LwZ + w X(2) LwZ − λw Z
It is well known that solving for the dominant generalized eigenvector is equiv-
alent to maximizing the Rayleigh quotient:
w (X(1) + X(2) ) L(L L)† L (X(1) + X(2) )w
. (7)
w (X(1) X(1) + X(2) X(2) )w
Until now, for the given side-information, there is still an exact equivalence be-
tween LDA and maximizing this Rayleigh quotient. The important difference
between the standard LDA cost function and (7) however, is that in the latter
the side-information is imposed explicitly by using the reduced parameterization
for Z in terms of L.
The Expected Cost Function. As pointed out, we do not know the term
between [·]. What we will do then is compute the expected value of the cost
L
function (7) by averaging over all possible label matrices Z = , possibly
L
weighted with any symmetric4 a priori probability for the label matrices. Since
the only part that depends on the label matrix is the factor between [·], and since
it appears linearly in the cost function, we just need to compute the expectation
of this factor. This expectation is proportional to I− 11 n . To see this we only have
to use symmetry arguments (all values on the diagonal should be equal to each
other, and all other values should
be equal
to each other), and the observation
that L is centered and thus L(L L)† L 1 = 0. Now, since we assume that the
data matrix X containing the samples in the side-information is centered too,
(X(1) +X(2) ) 11
n (X
(1)
+X(2) ) is equal to the null matrix. Thus the expected value
of (X +X ) L(L L)† L (X(1) +X(2) ) is proportional to (X(1) +X(2) ) (X(1) +
(1) (2)
X(2) ). The expected value of the LDA cost function in equation (7), where the
expectation is taken over all possible label assignments Z constrained by the
side-information, is then shown to be
w (C11 + C12 + C22 + C21 )w w (C12 + C21 )w
= 1 +
w (C11 + C22 )w w (C11 + C22 )w
The vector w maximizing this cost is the dominant generalized eigenvector of
the eigenvalue problem to be solved does not change either, which is of course a
desirable property.)
maxW trace(X(1) WW X(2) )
s.t. dim(W) = k
X(1)
W X(1) X (2)
W = Ik
X(2)
λ1 + . . . + λk = max
trace(P HP).
P P=I
where L corresponds to the common label matrix for X(1) and X(2) (both cen-
tered). In a similar way as previous derivation, this can be shown to amount to
solving the eigenvalue problem:
0 X(1) L(L L)−1 L X(2) w(1)
X(2) L(L L)−1 L X(1) 0 w(2)
(1)
C11 0 w
=λ
0 C22 w(2)
1 Introduction
A variety of machine learning and statistical inference problems focus on su-
pervised learning from labeled training data. In such problems, convexity often
plays a central role in formulating the loss function to be minimized during
training. For example, a standard approach to formulating a training loss is
to distinguish a preferred value from a set of candidate prediction values, and
measure prediction error by a convex error measure. Examples of this include
least squares regression, decision tree learning, boosting, on-line learning, max-
imum likelihood for exponential models, logistic regression, maximum entropy,
support vector machines, statistical signal processing (e.g. Burg’s spectral esti-
mation for speech signal analysis and image reconstruction) and optimal portfo-
lio selection. Such problems can often be naturally cast as convex optimization
problems involving a Bregman divergence [5,10,23], which can lead to new al-
gorithms, analytical tools, and insights derived from the powerful methods of
convex analysis [2,3,7,13]. Training algorithms that solve these problems can be
cast as implementing a minimum Bregman divergence (MB) principle.
However, in practice, many of the natural patterns we wish to classify are
the result of causal processes that have hidden hierarchical structure—yielding
data that does not report the value of latent variables. For example, in natural
language learning the observed data rarely reports the value of hidden semantic
variables or syntactic structure, in speech signal analysis gender information is
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 190–204, 2003.
c Springer-Verlag Berlin Heidelberg 2003
Learning Continuous Latent Variable Models with Bregman Divergences 191
not explicitly marked, etc. Obtaining fully labeled data is tedious or impossi-
ble in most realistic cases. This motivates us to propose a class of unsupervised
statistical learning algorithms that are still formulated in terms of minimizing a
Bregman divergence, except that we must now change the problem formulation
to respect hidden variables. In this paper we propose training algorithms for
solving the latent minimum Bregman divergence (LMB) principle: given a set
of training data and features that one would like to match in the training data,
compute a model that minimizes a convex objective function (a Bregman diver-
gence) subject to a set of non-linear constraints that take into account possible
latent structure.
Our treatment of the LMB principle closely parallels the results presented in
[24] for the Kullback-Leibler divergence, but the extension proposed here is not
trivial. For probabilistic models under the Kullback-Leibler divergence, we can
show an equivalence between satisfying the constraints (i.e. achieving feasibility)
and locally maximizing the likelihood under a log-linear assumption. Thus, in
this case, we can resort to the EM algorithm [14] to develop a practical tech-
nique for finding feasible solutions and proving convergence. However, general
Bregman divergences raise a more difficult technical challenge because the EM
approach breaks down for these generalized entropy measures. In this paper, we
will overcome this difficulty by using an alternating minimization approach [9]
in a non-trivial way, see Figure 1. Thus, beyond the generalized KL divergence
used for unsupervised boosting in clustering [25], the techniques of this paper
can also handle a broader class of functions, such as the Itakura-Siato distortion
[17] for speech signal analysis.
LMB principle
AM algorithm
LME principle
EM algorithm
joint convex
Shannon K−L generalized Bregman
Bregman
entropy divergence K−L divergence divergence
divergence
unsupervised
boosting
Fig. 1. The AM algorithm proposed in this paper is valid for the family of joint convex
Bregman divergences, but the EM algorithm proposed in [24] is only valid for the K-L
divergence. The unsupervised boosting case is dealing with generalized K-L divergence,
thus it can be solved by using the AM algorithm to find feasible solutions under latent
minimum Bregman divergence principle.
192 S. Wang and D. Schuurmans
where
∆φ (p(x); q(x)) = φ(p(x)) − φ(q(x)) − φ (q(x)) (p(x) − q(x))
and φ denotes the derivative of φ.2 That is, the Bregman divergence Bφ measures
the discrepancy between two distributions p and q by integrating the difference
between φ evaluated at p and φ’s first-order Taylor expansion about q, evaluated
at p over X .
To strengthen the interpretation of Bφ (p; q) as a measure of distance, we
make the following assumptions.
• ∆φ (u, v) is strictly convex in u and in v separately, but also satisfies the
stronger property that it is jointly convex in u, v. Thus our choice of Bregman
divergence Bφ (p; q) is strictly convex in p and in q separately, and also jointly
convex. This assumption lies at the heart of the analysis below.
• Bφ (p; q) is lower-semi-continuous in p and q jointly.
• For any fixed p ∈ S, the level sets {q : Bφ (p; q) ≤ } are bounded.
• If Bφ (pk ; q k ) → 0 and pk or q k is bounded, then pk → q k and q k → pk .
• If p ∈ S and q k → p, then Bφ (p; q k ) → 0.
Examples
1. Let φ(t) = t log t be defined on I = [0, ∞). Then φ (t) = log t + 1, and
p(x)
Bφ (p; q) = D(p; q) = p(x) log − p(x) + q(x) µ(dx)
x∈X q(x)
1
The machine learning community [8,13,18,19] is familiar with the discrete case, since
for supervised learning there are a finite number of sample points (training exam-
ples), so we can write the constraints as pertaining to a finite dimensional vector.
However, in unsupervised learning we are usually dealing with continuous variables,
and therefore instead of a vector, we are working with an infinite dimensional space.
2
In this paper, µ denotes a given σ-finite measure on X . If X is finite or countably
infinite, then µ is the counting measure, and integrals reduce to sums. If X is a subset
of a finite dimensional space, µ is the Lebesgue measure. If X is a combination of
both cases, µ will be a combination of both measures.
Learning Continuous Latent Variable Models with Bregman Divergences 193
Then we have v ◦φ q0 = Lφ (q0 , v) such that (Lφ (q0 , v)) (x) = (φ )−1 (φ (q0 (x)) −
v(x)) for all x. Also the map (v, q0 ) → v ◦φ q0 is an additive model for S.
By adopting an additive model restriction, we can make valuable progress
toward formulating a practical algorithm for approximately satisfying the LMB
principle.
In the following, we use a doubly iterative projection algorithm to obtain
feasible additive solutions, and also provide a characterization of its convergence
properties and information geometry.
Bφ (p; q) − Bφ (p∗ ; q)
= φ(p(x)) − φ(p∗ (x)) − φ (q(x)) (p(x) − p∗ (x)) µ(dx)
x∈X
= φ(p(x)) − φ(p∗ (x)) − φ (p∗ (x))(p(x) − p∗ (x))
x∈X
+ (φ (p∗ (x)) − φ (q(x))) (p(x) − p∗ (x)) µ(dx)
= Bφ (p; p∗ ) + (φ (p∗ (x)) − φ (q(x))) (p(x) − p∗ (x)) µ(dx)
x∈X
for all x. Since p∗ minimizes Bφ (p; q) over convex set P, we must have
∇p Bφ (p∗ ; q) · (p − p∗ ) ≥ 0
∇q∗ Bφ (p; q ∗ ) · (v − q ∗ ) ≥ 0
Thus
Thus we obtain
Given these two lemmas, following [15] we obtain the following convergence
result.
Theorem 1. The alternating minimization algorithm (AM) converges. That is,
p1 , p2 , ... converges to some p∞ ∈ P, and q 1 , q 2 , ... converges to some q ∞ ∈ Q,
such that
The proof of this theorem follows the same line of argument as that of theorem
2.17 given in [15].
where C denotes the set of nonlinear constraints the model should satisfies, M
denotes the set of distributions whose observed marginal distribution matches
the observed empirical distribution, Ga denotes the set of distributions whose
features’ expectations are constant, E denotes the set of additive models, and
N
N
Ω = λ ∈ : Lφ q0 , λi fi (x) < ∞
i=1
Now by choosing the closed convex set M̄ to play the role of P in the previous
discussion, and choosing the closed convex set Ē to play the role of Q, we can
define the corresponding forward projection and backward projection operators,
and then use these to iterate toward feasible LMB solutions.
First, to derive a backward projection operator, take a current pkλ playing
the role of q k in the previous discussion, and use this to determine a distribution
p∗ ∈ M̄ that minimizes
Bφ (p∗ ; pkλ ) = min Bφ (p; pkλ )
p∈M̄
That is, p∗ is the backward projection of pkλ onto M̄. To solve for p∗ , one can
formulate the Lagrangian Λ(p, α)
k
Λ(p, α) = Bφ (p; pλ ) + αy p(y, z)µ(dz) − p̃(y)
z∈Z
y∈Ỹ
Now since
∂
Λ(p, α) = φ (p(x)) − φ (pkλ (x)) + αy
∂p(x)
it is not hard to see that the solution p∗ must satisfy
p∗ (x) = (φ )−1 φ (pkλ (x)) − αy
Thus in many cases we can implement the backward projection step for AM
merely by calculating the conditional distribution pkλ (z|y) of the current model.
In general, one has to solve for the Lagrange multipliers that satistfy (5) to yield
a general form of the backward projection p∗ (x) = p̃(y)pkαy ,λ (z|y). In this case,
instead of using the original conditional distribution p(z|y) on the right-hand
side of the constraints, Eqn (4), a modified conditional distribution pαy (z|y)
which is a function of p(z|y) has to be used in the problem formulation of the
LMB principle.
Next, to formulate the forward projection step, we exploit the following
lemma.
Now since
N
∂
Ψ (p, λ) = φ (p(x)) − φ (q0 (x)) + λi fi (x)
∂p(x) i=1
= Bφ (pk ; q0 ) − Bφ (pk ; pλ )
N
fi (x) pk (x) µ(dx)
def
A(q, λ) = λi (7)
i=1 x∈X
N
N
|fi (x)|
− lφ q, σi (x)f (x)λi µ(dx)
x∈X i=1 f (x) i=1
and
lim qtk = q∞
k
= arg min Bφ (pk ; q) (10)
t→∞ q∈Ē
Learning Continuous Latent Variable Models with Bregman Divergences 201
The rest proof follows exactly the proof of Proposition 4.4 of [13].
We are then able to find feasible solutions for the LMB principle by using
an algorithm that combines the previous AM algorithm with a nested IS loop
to calculate the forward projection.
AM-IS algorithm:
This alternating procedure will halt at a point where the three manifolds C,
E and Ga have a common intersection, since we reach a stationary point in that
case. Due to the nonlinearity of the manifold C, the intersection is not unique,
and multiple feasible solutions may exist.
We are now ready to prove the main result of this section that AM-IS can
be shown to converge and hence is guaranteed to yield feasible solutions to the
LMB principle.
Proof. By lemmas 5 and 6 and choose the closed convex set M̄ to play the role
of P in Theorem 2, and choose the closed convex set Ē to play the role of Q in
Theorem 2. The conclusion immediately follows.
Unlike the standard K-L divergence for which the EM-IS algorithm can be
shown to monotonically increase likelihood during each iteration [24], monotonic
improvement will not necessarily hold under the Bregman divergences.
202 S. Wang and D. Schuurmans
Proof. For all pλ∗ ∈ C¯ ∩ Ē, pick p(x) = p̃(y)pαy ,λ∗ (z|y). Obviously p ∈ M̄. Now
we show that for all pλ ∈ Ē
Bφ (p̃(y)pαy ,λ∗ (z|y); pλ (x)) = Bφ (p̃(y)pαy ,λ∗ (z|y); pλ∗ (x)) + Bφ (pλ∗ (x); pλ (x))
Plugging
N
N
pλ (x) = L q0 , λi fi (x) = (φ )−1 φ (q0 ) + λi fi (x)
i=1 i=1
N
N
pλ∗ (x) = L q0 , λ∗i fi (x) −1
= (φ )
φ (q0 ) + λ∗i fi (x)
i=1 i=1
Learning Continuous Latent Variable Models with Bregman Divergences 203
6 Summary
References
1. H. Bauschke and J. Borwein, “Joint and Separate Convexity of the Bregman Dis-
tance,” in: Inherently Parallel Algorithms in Feasibility and Optimization and Their
Applications, Elsevier, 2001, pp. 23–36
204 S. Wang and D. Schuurmans
Joel Ratsaby
University College London Gower Street, London WC1E 6BT, United Kingdom
[Link]@[Link]
1 Introduction
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 205–220, 2003.
c Springer-Verlag Berlin Heidelberg 2003
206 J. Ratsaby
1
According to the probabilistic data-generation model mentioned above, only regions
in probability 1 support of the mixture distribution f (x) have a well-defined class
membership.
2
In that case, technically, if there does not exists a k in A such that g(k) = inf k g(k )
then we can always find an arbitrarily close approximating elements kn , i.e., ∀ > 0
∃N () such that for n > N () we have |g(kn ) − inf k g(k )| < .
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 207
where throughout the paper we assume the a priori probabilities are known to
the learner (see Assumption 1 below).
The loss L(c) depends on the unknown underlying probability distributions hence
realistically for a learning algorithm to work it needs to use only an estimate of
L(c). For a finite class C of classifiers the empirical loss Lm (c) is a consistent
estimator of L(c) uniformly for all c ∈ C hence provided that the sample size m is
sufficiently large, an algorithm that minimises Lm (c) over C will yield a classifier
ĉ whose loss L(ĉ) is an arbitrarily good approximation of the true minimum Bayes
loss, denoted here as L∗ , provided that the optimal Bayes classifier is contained in
C. The Vapnik-Chervonenkis theory (Vapnik, 1982) characterises the condition
for such uniform estimation over an infinite class C of classifiers. The condition
basically states that the class needs to have a finite complexity or richness which
is known as the Vapnik-Chervonenkis dimension and is defined as follows: for a
class H of functions from a set X to {0, 1} and a set S = {x1 , . . . , xl } of l points in
X, denote by H|S = {[h(x1 ), . . . , h(xl )] : h ∈ H}. Then the Vapnik-Chervonenkis
dimension
of H denoted by V C(H) is the largest l such that the cardinality
H|S = 2l . The method known as empirical risk minimisation represents a
general learning approach which for learning classification minimises the 0/1-
empirical loss and provided that the hypothesis class has a finite VC dimension
then the method yields a classifier ĉ with an asymptotically arbitrarily-close loss
to the minimum L∗ .
As is often the case in real learning algorithms, the hypothesis class can be
rich and may practically have an infinite VC-dimension, for instance, the class
of all two layer neural networks with a variable number of hidden nodes. The
method of Structural Risk Minimisation (SRM) was introduced by Vapnik (1982)
in order to learn such classes via empirical risk minimisation.
For the purpose of reviewing existing results we limit our discussion for the
remainder of this section to the case of two-category classification thus we use
m and k as scalars representing the total sample size and class VC-dimension,
respectively. Let us denote by Ck a class of classifiers having a VC-dimension
of k and let c∗k be the classifier which minimises the loss L(c) over Ck , i.e.,
c∗k = argminc∈Ck L(c). The standard setting for SRM considers the overall ∞ class C
of classifiers as an infinite union of finite VC-dimension classes, i.e., k=1 Ck , see
for instance Vapnik (1982), Devroye et. al. (1996), Shawe-Taylor et. al. (1996),
Lugosi & Nobel (1999), Ratsaby et. al. (1996). The best performing classifier in
C denoted as c∗ is defined as c∗ = argmin1≤k≤∞ L(c∗k ). Similarly, denote by ĉk
the empirically-best classifier in Ck , i.e., ĉk = argminc∈Ck Lm (c). Denoting by
k ∗ the minimal complexity of a class which contains c∗ , then depending on the
problem and on the type of classifiers used, k ∗ may even be infinite as in the
case when the Bayes classifier is not contained in C. The complexity k ∗ may be
thought of as the intrinsic complexity of the Bayes classifier.
208 J. Ratsaby
The idea behind SRM is to minimise not the pure empirical loss Lm (ck )
but a penalised version taking the form Lm (ck ) + (m, k) where (m, k) is some
increasing function of k and is sometimes referred to as a complexity penalty.
The classifier chosen by the criterion is then defined by
The term (m, k) is proportional to the worst case deviations between the true
loss and the empirical loss uniformly over all functions in Ck . More recently there
has been interest in data-dependent penalty terms for structural risk minimisa-
tion which do not have an explicit complexity factor k but are related to the
class Ck by being defined as a supremum of some empirical quantity over Ck ,
for instance the maximum discrepancy criterion (Bartlett et. al., 2002) or the
Rademacher complexity (Kultchinskii, 2001).
We take the penalty to be as in Vapnik (1982) (see also
Devroye et. al. (1996)) (m, k) = const k ln m
m where again const stands
for an absolute constant which for our purpose is not important. This bound is
central to the computations of the paper3 .
It can be shown (Devroye et. al., 1996) that for the two-pattern classification
case the error rate of the SRM-chosen classifier ĉ∗ (which implicitly depends on
the random sample of size m since it is obtained by minimising the sum in (1)),
satisfies
∗ ∗ k ∗ ln m
L(ĉ ) > L(c ) + const (2)
m
infinitely often with probability 0 where again c∗ is the Bayes classifier which is
assumed to be in C and k ∗ is its intrinsic complexity. The nice feature of SRM
is that the selected classifier ĉ∗ automatically locks onto the minimal error rate
as if the unknown k ∗ was known beforehand.
3 Multicategory Classification
classifier ci the loss m is defined as L(ci ) = Ei 1{ci (x)=1} , and the empirical loss
is Li,mi (ci ) = mi j=1
1 i
1{ci (xj )=1} which is based on a subsample {(xj , i)}m j=1
i
The above definition simply states a lower bound requirement on the rate of
increase of Tφ (N ). We now state the uniform strong law of large numbers for
the class of well-defined classifiers.
Lemma 1. (Uniform SLLN for multicategory classifier class) For any k ∈ ZZ M +
let Gk be a class of well-defined classifiers. Consider any sequence-generating pro-
cedure as in Definition 2 which generates m(n), n = 1, . . . , ∞. Let the empirical
m(n)
loss be defined based on examples {(xj , yj )}j=1 , each drawn i.i.d. according to
210 J. Ratsaby
an unknown underlying
distribution over IRd × {1, . . . , M }. Then for arbitrary
0 < δ < 1, supc∈Gk Lm(n)
(c) − L(c) ≤ const
(m(n), k, δ) with probability 1 − δ
and the events supc∈Gk Lm(n) (c) − L(c) > const (m(n), k), n = 1, 2, . . ., occur
infinitely often with probability 0, where m(n) is any sequence generated by the
procedure.
Lemma 2. Based on m examples {(xj , yj )}m j=1 each drawn i.i.d. according to
an unknown underlying distribution over IR × {1, . . . , M }, let ĉ∗ be the cho-
d
sen classifier of complexity k̂. Consider a sequence of samples ζ m(n) with in-
4
We will henceforth adopt the convention that a vector sequence k̂n → k∗ , a.s., means
that every component of k̂n converges to the corresponding component of k∗ , a.s., as
m → ∞.
212 J. Ratsaby
where F̂n = {k : L̃n (ĉn,k ) = minr∈ZZM L̃n (ĉn,r )} and for any ĉn,k which minimises
+
Lm(n) (c) over all c ∈ Gk we define the penalised empirical loss as L̃n (ĉn,k ) =
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 213
Lm(n) (ĉn,k ) + (m(n), k) where Lm(n) stands for the empirical loss based on the
sample-size vector m(n) at time n.
The second minimisation step is done via a query rule which selects the
particular pattern class from which to draw examples as one which minimises
the stochastic criterion (·, k̂n ) with respect to the sample size vector m(n). The
complexity k̂n of ĉ∗n will be shown later to converge to k ∗ hence (·, k̂n ) serves
as a consistent estimator of the criterion (·, k ∗ ). We choose an adaptation step
which changes one component of m at a time, namely, it increases the component
mjmax (n) which corresponds to the direction of maximum descent of the criterion
(·, k̂n ) at time n. This may be written as
where the positive integer ∆ denotes some fixed minimisation step-size and for
any integer i ∈ {1, 2, . . . , M }, ei denotes an M -dimensional elementary vector
with 1 in the ith component and 0 elsewhere. Thus at time n the new minimi-
sation step produces a new value m(n + 1) which is used for drawing additional
examples according to specific sample sizes mi (n + 1), 1 ≤ i ≤ M .
Learning Algorithm XSRM (Extended SRM)
Let: mi (0) = const > 0, 1 ≤ i ≤ M .
mi (0)
Given: (a) M uniform-size samples {ζ mi (0) }Mi=1 , where ζ
mi (0)
= {(xj , ‘i’)}j=1 ,
and xj are drawn i.i.d. according to underlying class-conditional probabil-
ity densities fi (x). (b) A sequence of classes Gk , k ∈ ZZ M + , of well-defined
classifiers. (c) A constant minimisation step-size ∆ > 0. (d) Known a priori
probabilities pj , 1 ≤ j ≤ M (for defining Lm ).
Initialisation: (Time n = 0) Based on ζ mi (0) , 1 ≤ i ≤ M , determine a set
of candidate classifiers ĉ0,k minimising the empirical loss Lm(0) over Gk , k ∈
Z+M
, respectively. Determine ĉ∗0 according to (3) and denote its complexity
vector by k̂0 .
Output: ĉ∗0 .
Call Procedure NM: m(1) := N M (0).
Let n = 1.
While (still more available examples) Do:
1. Based on the sample ζ m(n) , determine the empirical minimisers ĉn,k for
each class Gk . Determine ĉ∗n according to (3) and denote its complexity
vector by k̂n .
2. Output: ĉ∗n .
3. Call Procedure NM: m(n + 1) := N M (n).
4. n := n + 1.
End Do
Procedure New Minimisation (NM)
Input: Time n.
(mj (n),k̂n,j )
– jmax (n) := argmax1≤j≤M pj mj (n) , where if more than one argmax
then choose any one.
214 J. Ratsaby
The algorithm alternates between the standard minimisation step (3) and
the new minimisation step (4) repetitively until exhausting the total sample size
m which for most generality is assumed to be unknown a priori.
mi (n)
While for any fixed i ∈ {1, 2, . . . , M } the examples {(xj , i)}j=1 accumulated
m(n)
up until time n are all i.i.d. random variables, the total sample {(xj , yj )}j=1
consists of dependent random variables since based on the new minimisation the
choice of the particular class-conditional probability distribution used to draw
examples at each time instant l depends on the sample accumulated up until
time l − 1. It turns out that this dependency does not alter the results of Lemma
2. This follows from the proof of Lemma 2 and from the bound of Lemma 1
which holds even if the sample is i.i.d. only when conditioned on a pattern class
since it is the weighted average of the individual bounds corresponding to each
of the pattern classes. Therefore together with the next lemma this implies that
Lemma 2 applies to Algorithm XSRM.
Lemma 3. Algorithm XSRM is a sequence-generating procedure.
The outline of the proof is deferred to Appendix C. Next, we state the main
theorem of the paper.
Theorem 1. Assume that the Bayes complexity k ∗ is an unknown M -
dimensional vector of finite positive integers. Let the step size ∆ = 1 in Algo-
rithm XSRM resulting in a total sample size which increases with discrete time
as m(n) = n. Then the random sequence of classifiers ĉ∗n produced by Algorithm
XSRM is such that the events L(ĉ∗n ) > const (m(n), k ∗ ) or m(n)−m∗ (n)l1M >
1 occur infinitely often with probability 0 where m∗ (n) is the solution to the con-
strained minimisation of (m, k ∗ ) over all m of magnitude m = m(n).
Remark 1. In the limit of large n the bound const (m(n), k ∗ ) is almost mini-
mum (the minimum being at m∗ (n)) with respect to all vectors m ∈ ZZ M + of size
m(n). Note that this rate is achieved by Algorithm XSRM without the knowledge
of the intrinsic complexity k ∗ of the Bayes classifier. Compare this for instance
to uniform querying where at each time n one queries for subsamples of the
∆
same size M from every pattern class. This leads to a different (deterministic)
sequence m(n) = M ∆
[1, 1, . . . , 1]n ≡ ∆ n and in turn to a sequence of classifiers
ĉn whose loss L(ĉn ) ≤ const (∆ n, k ∗ ), as n → ∞, where here the upper bound
is not even asymptotically minimal. A similar argument holds if the propor-
tions are based on the a priori pattern class probabilities since in general letting
mi = pi m does not necessarily minimise the upper bound. In Ratsaby (1998),
empirical results show the inferiority of uniform sampling compared to an online
approach based on Algorithm XSRM.
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 215
6 Proving Theorem 1
First let us solve the following constrained minimisation problem. Fix a to-
∗
Msample size m and minimise the error (m, k ∗) under
tal M the constraint that
i=1 m i = m. This amounts to minimising (m, k ) + λ( i=1 mi − m) over m
and λ. Denote the gradient by g(m, k ∗ ) = ∇(m, k ∗ ). Then the above is equiva-
lent to solving g(m, k ∗ ) + λ[1, 1, . . . , 1] = 0 for m and λ. The vector valued func-
p (m ,k∗ ) p (m ,k∗ )
tion g(m, k ∗ ) may be approximated by g(m, k ∗ ) − 1 2m11 1 , − 2 2m22 2 ,. . . ,
p (m ,k∗ )
− M 2mMM M where we used the approximation 1 − log1mi 1 for 1 ≤ i ≤ M .
We then obtain the set of equations 2λ∗ m∗i = pi (m∗i , ki∗ ), 1 ≤ i ≤ M , and
∗ ∗
λ∗ = (m2m,k ) . We are interested not in obtaining a solution for a fixed m but
obtaining, using local gradient information, a sequence of solutions for the se-
quence of minimization problems corresponding to an increasing sequence of
total sample-size values m(t).
Applying the New Minimization procedure of Algorithm XSRM to the de-
terministic criterion (m, k ∗ ) we have an adaptation rule which modifies the
sample size vector m(t) at time t in the direction of steepest descent of
pj (mj (t),kj∗ )
(m, k ∗ ). This yields: j ∗ (t) = argmax1≤j≤M mj (t) which means we let
mj ∗ (t) (t + 1) = mj ∗ (t) (t) + ∆ while the remaining components of m(t) remain
unchanged, i.e., mj (t + 1) = mj (t), ∀j = j ∗ (t). The next lemma states that
this rule achieves the desired result, namely, the deterministic sequence m(t)
converges to the optimal trajectory m∗ (t).
Lemma 4. For any initial point m(0) ∈ IRM , satisfying mi (0) ≥ 3, for a fixed
positive ∆, there exists some finite integer 0 < N < ∞ such that for all discrete
time t > N the trajectory m(t) corresponding to a repeated application of the
216 J. Ratsaby
adaptation rule mj ∗ (t) (t + 1) = mj ∗ (t) (t) + ∆ is no farther than ∆ (in the l1M -
norm) from the optimal trajectory m∗ (t).
M
Outline
of Proof: Recall that (m, k ∗ ) = ∗
i=1 pi (mi , ki ) where (mi , ki ) =
∗
∂(m,k ) (mi ,ki∗ )
mi , 1 ≤ i ≤ M . The derivative −pi 2m . Denote by xi =
ki ln mi
∂mi i
(m ,k∗ )
pi 2m i i
i
, and note that dm dxi
i
− 32 mxi
i
, 1 ≤ i ≤ M . There is a one-to-one
correspondence between the vector x and m thus we may refer to the optimal
trajectory also in x-space. Consider the set T = {x = c[1, 1, . . . , 1] ∈ IRM + : c ∈
IR+ } and refer to T as the corresponding set in m-space. Define the Lyapunov
function V (x(t)) = V (t) = xmaxx(t)−x min (t)
min (t)
where for any vector x ∈ IRM
+ , xmax =
max1≤i≤M xi , and xmin = min1≤i≤M xi , and write mmax , mmin for the elements
of m with the same index as xmax , xmin , respectively. Denote by V̇ the derivative
of V with respect to t. Using standard analysis it can be shown that if x ∈ T then
V (x) > 0 and V̇ (x) < 0 while if x ∈ T then V (x) = 0 and V̇ (x) = 0. This means
that as long as m(t) is not on the optimal trajectory then V (t) decreases. To show
that the trajectory is an attractor V (t) is shown to decrease fast enough to zero
3
using the fact that V (t) ≤ const 1t 2 . Hence as t → ∞, the distance between
m(t) and the set T dist(m(t), T ) → 0 where dist(x, T ) = inf y∈T x − yl1M and
l1M denotes the Euclidean vector norm. It is then easy to show that for all large
t, m(t) is farther from m∗ (t) by no more than ∆.
We now show that the same adaptation rule may also be used in the setting
where k ∗ is unknown. The next lemma states that even when k ∗ is unknown, it
is possible, by using Algorithm XSRM, to generate a stochastic sequence which
asymptotically converges to the optimal m∗ (n) trajectory (again, the use of n
instead of t just means we have a random sequence m(n) and not a deterministic
sequence m(t) as was investigated above).
or m(n) − m∗ (n)lM > ∆ occurs infinitely often with probability 0. The state-
1
ment of the lemma then follows.
Appendix
In this section we outline the proofs. Complete proofs appear in Ratsaby (2003).
(with a different constant) does not hold infinitely often with probability 0. We
will refer to this as the uniform strong law of large numbers result and we note
that this was defined earlier as (m, r).
This result is used together with an application of the union boundMwhich re-
duces the probability P supc∈Ck |L(c) − Lm (c)| > (m, k, δ ) into i=1 P ∃c ∈
Cki :|L(c) − Li,mi (c)| > (mi , ki , δ ) which is bounded from above by M δ . The
first part of the lemma then follows since the class of well defined classifiers Gk
is contained in the class Ck . For the second part of the lemma, by the premise
consider any fixed complexity vector k and any sequence-generating procedure
φ according to Definition 2. Define the following set of sample size vector se-
quences: AN ≡ {m(n) : n > N, m(n) is generated by φ}. As the space is dis-
crete, note that for any finite N , the set AN contains all possible paths except a
finite number of length-N paths. The proof proceeds by showing that the events
En ≡ {supc∈Gk L(c) − Lm(n) (c) > (m(n), k, δ) : m(n) generated by φ} occur
∗
infinitely often with probability 0. To show this, we first choose for δ to be δm =
1
max1≤j≤M m2j
, and then reduce the P ∃m(n) ∈ AN : supc∈Gk L(c) − Lm(n) (c) >
∗
M
(m(n), k, δm(n) ) to j=1 mj >Tφ (N ) m12 . Then use the fact that m(n) ∈ AN
j
implies there exists a point m such that min1≤j≤M mj > Tφ (N ) where Tφ (N )
is increasing with N hence the set {mj : mj > Tφ (N )} is strictly increas-
ing, 1 ≤ j ≤ M , which implies that the above double sum strictly de-
creases with increasing N . It then follows that limN →∞ P(∃m(n) ∈ AN :
supc∈Gk L(c) − Lm(n) (c) > (m(n), k)) = 0 which implies the events En oc-
cur i.o. with probability 0.
218 J. Ratsaby
k̂∞ = k ∗ ∞ i.o. with probability zero but where k̂ does not necessarily
equal k ∗ and that k̂ → k ∗ , (componentwise) a.s., m → ∞ (or equivalently,
with n → ∞ as the sequence m(n) is increasing) where k ∗ = argmink∈F ∗ k∞
is not necessarily unique but all of whose components are finite. This proves the
first part of the lemma. The proof of the second part of the Lemma follows simi-
larly as the proof of Lemma 1. Start with P (∃m(n) ∈ AN : L(ĉ∗n ) > (m(n), k ∗ ) )
which after
M some
∞ manipulation is shown to be bounded from above by
the sum j=1 kj =1 P ∃mj > Tφ (N ) : L(ĉkj ) > Lj,mj (ĉkj ) + (mj , kj ) . Then
make use of the uniform strong law result (see first paragraph of Appendix
kj ln mj √ kj ln(emj )
A) and choose a const such that (mj , kj ) = const mj ≥ 3 mj .
Using the upper bound on the growth function cf. Vapnik (1982) Section 6.9,
Devroye et. al. (1996) Theorem 13.3, we have for some absolute constant κ > 0,
2
P L(ĉkj ) > Lj,mj (ĉkj )+ (mj , kj ) ≤ κmj j e−mj (mj ,kj ) which is bounded from
k
above by κ m12 e−3kj for kj ≥ 1. The bound on the double sum then becomes
M j
2κ j=1 mj >Tφ (N ) m12 which is strictly decreasing with N as in the proof of
j
Lemma 1. It follows that the events {L(ĉ∗n ) > (m(n), k ∗ )} occur infinitely often
with probability 0.
Note that for this proof we cannot use Lemma 1 or parts of Lemma 2 since
they are conditioned on having a sequence-generating procedure. Our approach
here relies on the characteristics of the SRM-selected complexity k̂n which is
shown to be bounded uniformly over n based on Assumption 1. It follows that
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 219
by the stochastic adaptation step of Algorithm XSRM the generated sample size
sequence m(n) is not only increasing but with a minimum rate of increase as
in Definition 2. This establishes that Algorithm XSRM is a sequence-generating
procedure. The proof starts by showing that for an increasing sequence m(n), as
in Definition 1, for all n there is some constant 0 < ρ < ∞ such that k̂n ∞ < ρ.
It then follows that for all n, k̂n is bounded by a finite constant independent of
n. So for a sequence generated by the new minimisation procedure in Algorithm
(mj (n),k̂n,j ) (m (n),k̃ )
XSRM, pj mj (n) are bounded by pj mj j (n) j , for some finite k̃j , 1 ≤ j ≤
M , respectively. It can be shown by simple analysis of the function (m, k) that
∂ 2 (mj ,kj ) ∂ 2 (mi ,ki )
for a fixed k the ratio of ∂m2j
/ ∂m2 converges to a constant dependent
i
on ki and kj with increasing mi , mj . Hence the adaptation step which always
increases one of the sub-samples yields increments of ∆mi and ∆mj which are
no farther apart than a constant multiple of each other for all n, for any pair
1 ≤ i, j ≤ M . Hence for a sequence m(n) generated by Algorithm XSRM the
following is satisfied: it is increasing in the sense of Definition 1, namely, for
all N > 0 there exists a Tφ (N ) such that for all n > N every component
mj (n) > Tφ (N ), 1 ≤ j ≤ M . Furthermore, its rate of increase is bounded from
below, namely, there exists a const > 0 such that for all N, N > 0 satisfying
Tφ (N ) = Tφ (N ) + 1, then |N − N | ≤ const. It follows that Algorithm XSRM
is a sequence-generating procedure according to Definition 2.
References
Anthony M., Bartlett P. L., (1999), “Neural Network Learning: Theoretical Founda-
tions”, Cambridge University Press, UK.
Bartlett P. L., Boucheron S., Lugosi G., (2002) Model Selection and Error Estimation,
Machine Learning, Vol. 48, No.1–3, p. 85–113.
Devroye L., Gyorfi L. Lugosi G. (1996). “A Probabilistic Theory of Pattern Recogni-
tion”, Springer Verlag.
Kultchinskii V., (2001), Rademacher Penalties and Structural Risk Minimization, IEEE
Trans. on Info. Theory, Vol. 47, No. 5, p.1902–1914.
Lugosi G., Nobel A., (1999), Adaptive Model Selection Using Empirical Complexities.
Annals of Statistics, Vol. 27, pp.1830–1864.
Ratsaby J., (1998), Incremental Learning with Sample Queries, IEEE Trans. on PAMI,
Vol. 20, No. 8, p.883–888.
Ratsaby J., (2003), On Learning Multicategory Classification with Sample Queries,
Information and Computation, Vol. 185, No. 2, p. 298–327.
220 J. Ratsaby
Ratsaby J., Meir R., Maiorov V., (1996), Towards Robust Model Selection using Esti-
mation and Approximation Error Bounds, Proc. 9th Annual Conference on Compu-
tational Learning Theory, p.57, ACM, New York N.Y..
Shawe-Taylor J., Bartlett P., Williamson R., Anthony M., (1996), A Framework for
Structural Risk Minimisation. NeuroCOLT Technical Report Series, NC-TR-96-032,
Royal Holloway, University of London.
Valiant L. G., A Theory of the learnable, (1984), Comm. ACM, Vol. 27, No. 11, p.1134–
1142.
Vapnik V.N., (1982), “Estimation of Dependences Based on Empirical Data”, Springer-
Verlag, Berlin.
On the Complexity of Training a Single
Perceptron with Programmable Synaptic Delays
Jiřı́ Šı́ma
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 221–233, 2003.
c Springer-Verlag Berlin Heidelberg 2003
222 J. Šı́ma
is usually employed. In this case, the output protocol can be defined so that
N with weights w0 , . . . , wn and delays d1 , . . . , dn computes a neuron function
yN : [−1, 1]n −→ {0, 1} defined for every input x1 , . . . , xn ∈ [−1, 1]n as
yN (x1 , . . . , xn ) = 1 iff there exists a time instant t ≥ 0 such that y(t) = 1.
Similarly, the logistic sigmoid
1
σL (ξ) = , (5)
1 + e−ξ
which is well-known from the back-propagation learning [26] produces analog
outputs y(t) ∈ [0, 1] whereas the output protocol can specify a time instant tout ≥
0 when the resulting output is read, that is yN (x1 , . . . , xn ) = y(tout ). Unless
otherwise stated we assume that neuron N employs the Heaviside activation
function (4).
By restricting certain parameters in the preceding definition of N we obtain
several computational units which are widely used in neurocomputing. For the
classical perceptrons [25] all synaptic delays are zero, i.e. di = 0 for i = 1, . . . , n,
and also tout = 0 when the logistic sigmoid (5) is employed [26]. Or assuming
the spikes with a uniform firing rate, e.g. xi ∈ {0, 1} for i = 1, . . . , n, neuron N
coincides with a simplified model of a spiking neuron with binary coded inputs
which was introduced and analyzed in [20]. Hence, the computational power of
N computing the Boolean functions is the same as that of the spiking neuron
with binary coded inputs [27] (cf. Section 4). In addition, the VC-dimension
Θ(n log n) of the spiking neuron still applies to N with n analog inputs as can
easily be verified by following the argument in [20]. From this point of view, N
represents generalization of the spiking neuron in which the temporal delays are
combined with the firing rates of perceptron units.
It follows that biological motivations for spiking neurons [10,19] partially
apply also to neuron N . For example, it is known that the synaptic delays are
On the Complexity of Training a Single Perceptron 223
fixed weights [20]. This implies that neuron N with binary delays is not prop-
erly PAC learnable unless RP = N P . The result generalizes also to bounded
delay values di ∈ {0, 1, . . . , c} for fixed c ≥ 2. For the spiking neurons with
unbounded delays, however, NP-hardness of the consistency problem was listed
among open problems [20].
In this section we prove that the consistency problem is NP-hard for a single
perceptron N with arbitrary delays, which partially answers the previous open
question provided that several levels of firing rates are allowed. For this purpose
a synchronization technique is introduced whose main idea can be described as
follows. The consistency of negative example (x1 , . . . , xn ; 0) means that for every
subset of inputs I ⊆ {1, . . . , n} whose spikes may simultaneously influence N
(i.e. ∩i∈I Di = ∅) a corresponding excitation must satisfy w0 + i∈I wi xi < 0.
At the same time, by using the consistency of other (mostly positive) training
examples we can enforce w0 + i∈J wi xi ≥ 0 for some J ⊆ {1, . . . , n}. In this way
we ensure that N is not simultaneously influenced by the spikes from inputs J,
that is ∩i∈J Di = ∅ which is then exploited for the synchronization of the input
spikes.
Proof. In order to achieve the NP-hardness result, the following variant of the
set splitting problem which is known to be NP-complete [9] will be reduced to
CPN in polynomial time.
The 3SSP problem was also used for proving the result restricted to binary
delays [20]. The above-described synchronization technique generalizes the proof
to arbitrary delays.
Given a 3SSP instance S, C, we construct a training set T for neuron N with
n inputs where n = 2p+2. The input firing rates of training examples exploit only
seven levels from −1, − 14 , − 18 , 0, 38 , 34 , 1 ⊆ [−1, 1]. A list of training examples
which are included in training set T follows:
3
(0, . . . , 0, 4 , 0, . . . , 0 ; 1)
↑ for i = 1, . . . , p , (7)
2i − 1
(0, . . . , 0, − 14 , 0, . . . , 0 ; 1)
↑ for i = 1, . . . , p , (8)
2i
On the Complexity of Training a Single Perceptron 225
(0, . . . , 0, 3
8 , − 18 , 0, . . . , 0 ; 0)
↑ ↑ for i = 1, . . . , p , (9)
2i − 1 2i
(0, . . . , 0, − 14 , 0 ; 1) ,
↑ (10)
2p + 1
(0, . . . , 0, − 14 ; 1) ,
↑ (11)
2p + 2
(0, . . . , 0, − 18 , − 18 ; 0) ,
↑ ↑ (12)
2p + 1 2p + 2
(0, . . . , 0, 1 , 0, . . . , 0 ; 1)
↑ for i = 1, . . . , p , (13)
2i − 1
(0, . . . , 0, 1 , 0, . . . , 0, 1 , 1 ; 0)
↑ ↑ ↑ for i = 1, . . . , p , (14)
2i − 1 2p + 1 2p + 2
(0, . . . , 0, −1 , 0, . . . , 0 ; 1)
↑ for i = 1, . . . , p , (15)
2i
(0, . . . , 0, −1 , 0, . . . , 0, 1 , 1 ; 0)
↑ ↑ ↑ for i = 1, . . . , p , (16)
2i 2p + 1 2p + 2
and
(0, . . . , 0, 1 , 1 , 0, . . . , 0, 1 , 1 , 0, . . . , 0, 1 , 1 , 0, . . . , 0 ; 0)
↑ ↑ ↑ ↑ ↑ ↑ (17)
2i − 1 2i 2j − 1 2j 2k − 1 2k
Clearly,
D2i−1 ∩ D2i = ∅ (22)
for i = 1, . . . , p + 1 according to (20) and (21). It can easily be checked that
N with parameters (18)–(21) is consistent with training examples (7)–(16). For
instance, for any positive training example (7), excitation ξ(t) = −1 + 2 · 34 ≥ 0
when t ∈ D2i−1 , which is sufficient for N to output 1. Or for any negative
training example (9), excitation ξ(t) = −1 + 2 · 38 < 0 for all t ∈ D2i−1 and
ξ(t) = −1 − 4 · (− 18 ) < 0 for all t ∈ D2i , whereas ξ(t) = −1 < 0 for t ≥ 2, which
implies that N outputs desired 0. The verification for the remaining training
examples (7)–(16) is similar. Furthermore, D2i−1 ∩ D2j−1 ∩ D2k−1 = ∅ holds for
any c = {si , sj , sk } ∈ C according to (20) since c ⊆ S1 . Hence, for a negative
training example (17) corresponding to c excitation ξ(t) ≤ −1+2·1+2·1−4·1 < 0
for every t ≥ 0 due to (22) which produces zero output. This completes the
argument for the CPN instance to be solvable.
On the other hand, assume that there exist weights w0 , . . . , wn and delays
d1 , . . . , dn for N such that N is consistent with training examples (7)–(17). Any
consistent negative example ensures
w0 < 0 (23)
since the excitation must satisfy ξ(t) < 0 also for t ∈ ∪ni=1 Di . Hence, it follows
from (7) and (8) that w0 + 34 w2i−1 ≥ 0 and w0 − 14 w2i ≥ 0, respectively, which
sums up to
3 1
w0 + w2i−1 − w2i ≥ 0 for i = 1, . . . , p . (24)
8 8
On the other hand, by comparing inequality (24) with the consistency of negative
examples (9) we conclude that
1 1
w0 − w2p+1 − w2p+2 ≥ 0 (26)
8 8
which implies
D2p+1 ∩ D2p+2 = ∅ (27)
when the consistency of negative example (12) is required.
Furthermore, positive training examples (13) ensure
It follows from (25), (27), (29), and (30) that for each 1 ≤ i ≤ p
due to (25), but inequalities (23) and (28) imply the opposite. Similarly,
D2i−1 = D2j−1 = D2k−1 = D2p+2 for c ⊆ S2 because of (32) and (31) pro-
viding contradiction (33). This completes the proof that the 3SSP instance is
solvable.
Within the PAC framework, the NP-hardness of this problem implies that the
neuron does not allow robust learning (i.e. probably approximately optimal learn-
ing for any training task) unless RP = N P [14].
For the perceptrons with zero delays, the complexity of the approximation
problem has been resolved. Several authors proved that the approximation prob-
lem is NP-complete in this case [14,23] even if the bias is assumed to be zero [2,
16]. This means that the perceptrons with zero delays do not allow robust learn-
ing unless RP = N P . In addition, it is NP-hard to achieve a fixed error that is
a constant multiple of the optimum [4]. These results were also generalized to
analog outputs, e.g. for the logistic sigmoid (5) it is NP-hard to minimize the
training error under the L1 [15] or L2 [28] norm within a given absolute bound
or within 1 of its infimum.
In this section the approximation problem is proved to be NP-hard for percep-
tron N with arbitrary delays. The proof exploits only binary firing rates, which
means the result is also valid for spiking neurons with binary coded inputs.
Proof. The following vertex cover problem that is known to be NP-complete [18]
will be reduced to APN in polynomial time:
A similar reduction was originally used for the NP-hardness result concerning
the approximate training an ordinary perceptron with zero synaptic delays [14].
The technique generalizes for arbitrary delays.
Thus given a VCP instance G = (V, E), k with n = |V | vertices V =
{v1 , . . . , vn } and r = |E| edges we construct a training set T for neuron N
with n inputs. Training set T contains the following m = n + r examples:
(0, . . . , 0, 1 , 0, . . . , 0 ; 1)
↑ for i = 1, . . . , n , (34)
i
(0, . . . , 0, 1 , 0, . . . , 0, 1 , 0, . . . , 0 ; 0)
↑ ↑ for each {vi , vj } ∈ E , (35)
i j
which can be constructed in polynomial time in terms of the size of the VCP in-
stance. Moreover, in this APN instance at most k inconsistent training examples
are allowed.
It will be shown that the VCP instance has a solution iff the corresponding
APN instance is solvable. So first assume that there exists a vertex cover U ⊆ V
On the Complexity of Training a Single Perceptron 229
of size at most k ≥ |U | vertices. Define the weights and delays for N as follows:
w0 = −1 , (36)
−1 if vi ∈ U
wi = for i = 1, . . . , n , (37)
1 if vi ∈
U
di = 0 for i = 1, . . . , n . (38)
Obviously, negative examples (35) corresponding to edges {vi , vj } ∈ E produce
excitations either ξ(t) = −3 when both endpoints in U or ξ(t) = −1 when only
one endpoint in U , for t ∈ [0, 1), while ξ(t) = w0 = −1 for t ≥ 1, which means N
outputs desired 0. Furthermore, the positive examples (34) that correspond to
vertices vi ∈ U give excitations ξ(t) = 0 for t ∈ [0, 1) and hence N classifies them
correctly. On the other hand, N is not consistent with the positive examples (34)
corresponding to vertices vi ∈ U since ξ(t) = −2 for t ∈ [0, 1) and ξ(t) = −1
for t ≥ 1. Nevertheless, the size of vertex cover U is at most k, which also
upper-bounds the number of inconsistent training examples. This completes the
argument for the APN instance to be solvable.
On the other hand, assume that there exist weights w0 , . . . , wn and delays
d1 , . . . , dn making N consistent with all but at most k training examples (34)–
(35). Define U ⊆ V so that U contains vertex vi for each inconsistent positive
example (34) corresponding to vi . In addition, U includes just one of vi and vj
(chosen arbitrarily) for each inconsistent negative example (35) corresponding
to edge {vi , vj }. Clearly, |U | ≤ k since there are at most k inconsistent training
examples. It will be proved that U is a vertex cover for G. On the contrary,
assume that there is an edge {vi , vj } ∈ E such that vi , vj ∈ U . It follows from the
definition of U that N is consistent with the negative example (35) corresponding
to edge {vi , vj }, which implies
ξ(t) = w0 < 0 for t ∈ Di ∪ Dj , (39)
and it is consistent with the positive examples (34) corresponding to vertices
vi , vj , which ensures
ξ(t) = w0 + wi ≥ 0 for t ∈ Di (40)
ξ(t) = w0 + wj ≥ 0 for t ∈ Dj (41)
because of (39). By summing inequalities (39)–(41), we obtain
w0 + wi + wj > 0 . (42)
On the other hand, by comparing inequalities (40) and (41) with the consistency
of the negative example (35) corresponding to edge {vi , vj } we conclude that
Di = Dj (synchronization technique) and hence
ξ(t) = w0 + wi + wj < 0 for t ∈ Di = Dj , (43)
which contradicts inequality (42). This completes the proof that U is a solution
of VCP.
In this section we deal with the representation (membership) problem for the
spiking neurons with binary coded inputs:
Representation Problem for Spiking Neuron N (RPN)
Instance: A Boolean function f in DNF (disjunctive normal form).
Question: Is f computable by a single spiking neuron N , i.e. are there weights
w0 , . . . , wn and delays d1 , . . . , dn for N such that yN (x) = f (x) for every x ∈
{0, 1}n ?
The representation problem for perceptrons with zero delays, known as the linear
separability problem, was proved to be co-NP-complete [13]. We generalize the
co-NP-hardness result for spiking neurons with arbitrary delays. On the other
hand, the RPN is clearly in Σ2p whereas its hardness for Σ2p (or for NP) which
would imply [1] that the spiking neurons with arbitrary delays are not learnable
with membership and equivalence queries (unless N P = co − N P ) remains an
open problem.
Moreover, it was shown [20] that the class of n-variable Boolean functions
computable by spiking neurons is strictly contained in the class DLLT that con-
sists of functions representable as disjunctions of O(n) Boolean linear threshold
functions over n variables (from the class LT containing functions computable
by threshold gates) where the smallest number of threshold gates is called the
threshold number [11]. For example, class DLLT corresponds to two-layer net-
works with linear number of hidden perceptrons (with zero delays) and one
output OR gate. It was shown [27] that the threshold number of spiking neurons
with n inputs is at most n − 1 and can be lower-bounded by n/2. On the other
hand, there exists a Boolean function with threshold number 2 that cannot be
computed by a single spiking neuron [27]. We prove that a modified version of
RPN, denoted as DLLT-RPN, whose instances are Boolean functions f from
DLLT (instead of DNF) is also co-NP-hard. This means that it is hard to decide
whether a given n-variable Boolean function expressed as a disjunction of O(n)
threshold gates can be computed by a single spiking neuron.
Theorem 3. RPN and DLLT-RPN are co-NP-hard and belong to Σ2p .
DNF ∨m
j=1 ((Cj ∧ xj ) ∨ (Cj ∧ x̄j )) where x1 , . . . , xm are m new variables. Clearly,
in the new DNF formula the number of monomials is linear in terms of the num-
ber of variables. Moreover, any monomial can obviously be computed by a single
threshold gate.
Thus given a TP (DLLT-TP) instance g over n variables x1 , . . . , xn , we
construct a corresponding RPN (DLLT-RPN) instance f over n + 2 variables
x1 , . . . , xn , y1 , y2 in polynomial time as follows:
For TP instance g, function f is actually in DNF as required for the RPN. For
DLLT-TP instance g = ∨m j=1 gj with gj from LT, formula (44) contains terms
gj ∧ y1 that are equivalent with ḡj ∨ ȳ1 which belong to LT since class LT is
closed under negation [21] and summand W (1 − y1 ) with a sufficiently large
weight W can be added to the weighted sum for ḡj to evaluate ḡj ∨ ȳ1 . This
implies that f is from DLLT representing a DLLT-RPN instance.
It will be shown that the TP (DLLT-TP) instance has a solution iff the
corresponding RPN (DLLT-RPN) instance is solvable. So first assume that g
is a tautology. Hence f given by (44) can be equivalently rewritten as y1 ∨ y2
which is trivially computable by a spiking neuron. On the other hand, assume
that there exists a ∈ {0, 1}n such that g(a) = 0. In this case, f (a, y1 , y2 ) reduces
to XOR(y1 , y2 ) which cannot be implemented by a single spiking neuron [20].
For proving that RP N ∈ Σ2p (similarly for DLLT-RPN) consider an alter-
nating algorithm for the RPN that, given f in DNF, guesses polynomial-size
representations [20] of weights and delays for spiking neuron N first in its ex-
istential state, and then verifies yN (x) = f (x) for every x ∈ {0, 1}n (yN (x)
can be computed in polynomial time since there are only linear number of time
intervals to check) in its universal state.
5 Conclusion
References
1. Aizenstein, H., Hegedüs, T., Hellerstein, L., Pitt, L.: Complexity Theoretic Hard-
ness Results for Query Learning. Computational Complexity 7 (1) (1998) 19–53
2. Amaldi, E.: On the complexity of training perceptrons. In: Kohonen, T., Mäkisara,
K., Simula, O., Kangas, J. (eds.): Proceedings of the ICANN’91 First Interna-
tional Conference on Artificial Neural Networks. Elsevier Science Publisher, North-
Holland, Amsterdam (1991) 55–60
3. Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations.
Cambridge University Press, Cambridge, UK (1999)
4. Arora, S., Babai, L., Stern, J., Sweedyk, Z.: The hardness of approximate optima in
lattices, codes, and systems of linear equations. Journal of Computer and System
Sciences 54 (2) (1997) 317–331
5. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the
Vapnik-Chervonenkis dimension. Journal of the ACM 36 (4) (1989) 929–965
6. Bohte, M., Kok, J.N., La Poutré, H.: Spike-prop: error-backpropagation in multi-
layer networks of spiking neurons. In: Proceedings of the ESANN’2000 European
Symposium on Artificial Neural Networks. D-Facto Publications, Brussels (2000)
419–425
7. Cook, S.A.: The complexity of theorem-proving procedures. In: Proceedings of the
STOC’71 Third Annual ACM Symposium on Theory of Computing. ACM Press,
New York (1971) 151–158
8. Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In:
Touretzky, D.S. (ed.): Advances in Neural Information Processing Systems
(NIPS’89), Vol. 2. Morgan Kaufmann, San Mateo (1990) 524–532
9. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory
of NP-Completeness. W.H. Freeman, San Francisco (1979)
10. Gerstner, W., Kistler, W.M.: Spiking Neuron Models: Single Neurons, Populations,
Plasticity. Cambridge University Press, Cambridge, UK (2002)
11. Hammer, P.L., Ibaraki, T., Peled, U.N.: Threshold numbers and threshold comple-
tions. In: Hansen, P. (ed.): Studies on Graphs and Discrete Programming, Annals
of Discrete Mathematics 11, Mathematics Studies, Vol. 59. North-Holland, Ams-
terdam (1981) 125–145
12. Haykin, S.: Neural Networks: A Comprehensive Foundation. 2nd edn. Prentice-
Hall, Upper Saddle River, NJ (1999)
13. Hegedüs, T., Megiddo, N.: On the geometric separability of Boolean functions.
Discrete Applied Mathematics 66 (3) (1996) 205–218
14. Höffgen, K.-U., Simon, H.-U., Van Horn, K.S.: Robust trainability of single neu-
rons. Journal of Computer and System Sciences 50 (1) (1995) 114–125
15. Hush, D.R.: Training a sigmoidal node is hard. Neural Computation 11 (5) (1999)
1249–1260
16. Johnson, D.S., Preparata, F.P.: The densest hemisphere problem. Theoretical Com-
puter Science 6 (1) (1978) 93–107
17. Judd, J.S.: Neural Network Design and the Complexity of Learning. The MIT
Press, Cambridge, MA (1990)
18. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E.,
Thatcher, J.W. (eds.): Complexity of Computer Computations. Plenum Press, New
York (1972) 85–103
19. Maass, W., Bishop, C.M. (eds.): Pulsed Neural Networks. The MIT Press, Cam-
bridge, MA (1999)
On the Complexity of Training a Single Perceptron 233
20. Maass, W., Schmitt, M.: On the complexity of learning for spiking neurons with
temporal coding. Information and Computation 153 (1) (1999) 26–46
21. Parberry, I.: Circuit Complexity and Neural Networks. The MIT Press, Cambridge,
MA (1994)
22. Pitt, L., Valiant, L.G.: Computational limitations on learning from examples. Jour-
nal of the ACM 35 (4) (1988) 965–984
23. Roychowdhury, V.P., Siu, K.-Y., Kailath, T.: Classification of linearly non-
separable patterns by linear threshold elements. IEEE Transactions on Neural
Networks 6 (2) (1995) 318–331
24. Roychowdhury, V.P., Siu, K.-Y., Orlitsky, A. (eds.): Theoretical Advances in Neu-
ral Computation and Learning. Kluwer Academic Publishers, Boston (1994)
25. Rosenblatt, F.: The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review 65 (6) (1958) 386–408
26. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-
propagating errors. Nature 323 (1986) 533–536
27. Schmitt, M.: On computing Boolean functions by a spiking neuron. Annals of
Mathematics and Artificial Intelligence 24 (1-4) (1998) 181–191
28. Šı́ma, J.: Training a single sigmoidal neuron is hard. Neural Computation 14 (11)
(2002) 2709–2728
29. Vidyasagar, M.: A Theory of Learning and Generalization. Springer-Verlag, Lon-
don (1997)
Learning a Subclass of Regular Patterns in
Polynomial Time
1 Introduction
The pattern languages were formally introduced by Angluin [1]. A pattern lan-
guage is (by definition) one generated by all the positive length substitution
instances in a pattern, such as, for example,
abxycbbzxa
— where the variables (for substitutions) are x, y, z and the constants/terminals
are a, b, c .
Supported in part by NSF grant number CCR-0208616 and USDA IFAFS grant
number 01-04145.
Supported in part by NUS grant number R252-000-127-112.
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 234–246, 2003.
c Springer-Verlag Berlin Heidelberg 2003
Learning a Subclass of Regular Patterns in Polynomial Time 235
Since then, much work has been done on pattern languages and extended pat-
tern languages which also allow empty substitutions as well as on various special
cases of the above (cf., e.g., [1,6,7,10,12,21,20,22,23,26,19,29] and the references
therein). Furthermore, several authors have also studied finite unions of pattern
languages (or extended pattern languages), unbounded unions thereof and also
of important subclasses of (extended) pattern languages (see, for example, [11,
5,27,3,32]).
Nix [18] as well as Shinohara and Arikawa [28,29] outline interesting appli-
cations of pattern inference algorithms. For example, pattern language learning
algorithms have been successfully applied toward some problems in molecular bi-
ology (see [25,29]). Pattern languages and finite unions of pattern languages turn
out to be subclasses of Smullyan’s [30] Elementary Formal Systems (EFSs), and
Arikawa, Shinohara and Yamamoto [2] show that the EFSs can also be treated as
a logic programming language over strings. The investigations of the learnabil-
ity of subclasses of EFSs are interesting because they yield corresponding results
about the learnability of subclasses of logic programs. Hence, these results are
also of relevance for Inductive Logic Programming (ILP) [17,13,4,15]. Miyano et
al. [16] intensively studied the polynomial-time learnability of EFSs.
In the following we explain the main philosophy behind our research as well
as the ideas by which it emerged. As far as learning theory is concerned, pattern
languages are a prominent example of non-regular languages that can be learned
in the limit from positive data (cf. [1]). Gold [9] has introduced the correspond-
ing learning model. Let L be any language; then a text for L is any infinite
sequence of strings containing eventually all strings of L , and nothing else. The
information given to the learner are successively growing initial segments of a
text. Processing these segments, the learner has to output hypotheses about L .
The hypotheses are chosen from a prespecified set called hypothesis space. The
sequence of hypotheses has to converge to a correct description of the target
language.
Angluin [1] provides a learner for the class of all pattern languages that is
based on the notion of descriptive patterns. Here a pattern π is said to be
descriptive (for the set S of strings contained in the input provided so far) if
π can generate all strings contained in S and no other pattern having this
property generates a proper subset of the language generated by π . But no
efficient algorithm is known for computing descriptive patterns. Thus, unless
such an algorithm is found, it is even infeasible to compute a single hypothesis
in practice by using this approach.
Another important special case extensively studied are the regular pattern
languages introduced by Shinohara [26]. These are generated by the regular pat-
terns, i.e., patterns in which each variable that appears, appears only once. The
learners designed by Shinohara [26] for regular pattern languages and extended
regular pattern languages are also computing descriptive patterns for the data
seen so far. These descriptive patterns are computable in time polynomial in the
length of all examples seen so far.
But when applying these algorithms in practice, another problem comes into
play, i.e., all the learners mentioned above are only known to converge in the
limit to a correct hypothesis for the target language. But the stage of convergence
is not decidable. Thus, a user never knows whether or not the learning process
is already finished. Such an uncertainty may not be tolerable in practice.
Consequently, one has tried to learn the pattern languages within Valiant’s
[31] PAC model. Shapire [24] could show that the whole class of pattern lan-
guages is not learnable within the PAC model unless P/poly = N P/poly for
any hypothesis space that allows a polynomially decidable membership problem.
Since membership is N P -complete for the pattern languages, his result does not
exclude the learnability of all pattern languages in an extended PAC model, i.e.,
a model in which one is allowed to use the set of all patterns as hypothesis space.
However, Kearns and Pitt [10] have established a PAC learning algorithm
for the class of all k -variable pattern languages, i.e., languages generated by
patterns in which at most k different variables occur. Positive examples are
generated with respect to arbitrary product distributions while negative exam-
ples are allowed to be generated with respect to any distribution. Additionally,
the length of substitution strings has been required to be polynomially related
to the length of the target pattern. Finally, their algorithm uses as hypothesis
space all unions of polynomially many patterns that have k or fewer variables1 .
The overall learning time of their PAC learning algorithm is polynomial in the
length of the target pattern, the bound for the maximum length of substitution
strings, 1/ε , 1/δ , and |±| . The constant in the running time achieved depends
doubly exponential on k , and thus, their algorithm becomes rapidly impractical
when k increases.
As far as the class of extended regular pattern languages is concerned, Miyano
et al. [16] showed the consistency problem to be N P -complete. Thus, the class
of all extended regular pattern languages is not polynomial-time PAC learnable
unless RP = N P for any learner that uses the regular patterns as hypothesis
space.
This is even true for REGPAT1 , i.e., the set of all extended regular pattern
languages where the length of constant strings is 1 (see below for a formal
1
More precisely, the number of allowed unions is at most poly(|π|, s, 1/ε, 1/δ, |±|) ,
where π is the target pattern, s the bound on the length on substitution strings,
ε and δ are the usual error and confidence parameter, respectively, and ± is the
alphabet of constants over which the patterns are defined.
Learning a Subclass of Regular Patterns in Polynomial Time 237
definition). The latter result follows from [16] via an equivalence proof to the
common subsequence languages studied in [14].
In the present paper we also study the special cases of learning the extended
regular pattern languages. On the one hand, they already allow non-trivial appli-
cations. On the other hand, it is by no means easy to design an efficient learner
for these classes of languages as noted above. Therefore, we aim to design an
efficient learner for an interesting subclass of the extended regular pattern lan-
guages which we define next.
Let Lang(π) be the extended pattern language generated by pattern π . For
c > 0 , let REGPATc be the set of all Lang(π) such that π is a pattern of the
form x0 α1 x1 α2 x2 . . . αm xm , where each αi is a string of terminals of length c
and x0 , x1 , x2 , . . . , xm are distinct variables.
We consider polynomial time learning of REGPATc for various data presen-
tations and for natural and plausible probability distributions on the input data.
As noted above, even REGPAT1 is not polynomial-time PAC learnable unless
RP = N P . Thus, one has to restrict the class of all probability distributions.
Then, the conceptional idea is as follows.
We explain it here for the case mainly studied in this paper, learning from
text (in our above notation). One looks again at the whole learning process as
learning in the limit. So, the data presented to the learner are growing initial
segments of a text. But now, we do not allow any text. Instead every text is
drawn according to some fixed probability distribution. Next, one determines
the expected number of examples needed by the learner until convergence. Let
E denote this expectation. Assuming prior knowledge about the underlying
probability distribution, E can be expressed in terms the learner may use con-
ceptionally to calculate E . Using Markov’s inequality, one easily sees that the
probability to exceed this expectation by a factor of t is bounded by 1/t . Thus,
we introduce, as in the PAC model, a confidence parameter δ . Given δ , one
needs roughly (1/δ) · E many examples to converge with probability at least
1 − δ . Knowing this, there is of course no need to compute any intermediate
hypotheses. Instead, now the learner firstly draws as many examples as needed
and then it computes just one hypothesis from it. This hypothesis is output, and
by construction we know it to be correct with probability at least 1 − δ . Thus,
we arrive at a learning model which we call probabilistically exact learning (cf.
Definition 5 below). Clearly, in order to have an efficient learner one also has to
ensure that this hypothesis can be computed in time polynomial in the length
of all strings seen. For arriving at an overall polynomial-time learner, it must be
also ensured that E is polynomially bounded in a suitable parameter. We use
the number of variables occurring in the regular target pattern, c (the length
of substitution strings) and a term describing knowledge about the probability
distribution as such a parameter.
For REGPATc , we have results for three different models of data presenta-
tion. The data are drawn according to the distribution prob described below.
238 J. Case et al.
2 Preliminaries
Let N = {0, 1, 2, . . .} denote the set of natural numbers, and let N+ = N \ {0} .
For any set S , we write |S| to denote the cardinality of S .
Let Σ be any non-empty finite set of constants such that |Σ| ≥ 2 and let V
be a countably infinite set of variables such that Σ ∩ V = ∅ . By Σ ∗ we denote
the free monoid over Σ . The set of all finite non-null strings of symbols from
Σ is denoted by Σ + , i.e., Σ + = Σ ∗ \ {λ} , where λ denotes the empty string.
As above, Σ c denotes the set of strings over Σ with length c . We let a, b, . . .
Learning a Subclass of Regular Patterns in Polynomial Time 239
Note that we are considering extended (or erasing) pattern languages, i.e., a
variable may be substituted with the empty string λ . Though allowing empty
substitutions may seem a minor generalization, it is not. Learning erasing pat-
tern languages is more difficult for the case considered within this paper than
learning non-erasing ones. For the general case of arbitrary pattern languages,
already Angluin [1] showed the non-erasing pattern languages to be learnable
from positive data. However, the erasing pattern languages are not learnable
from positive data if |Σ| = 2 (cf. Reidenbach [19]).
c = {π | π = x0 α1 x1 α2 x2 . . . αm xm , where each αi ∈ Σ } .
(a) regm c
(b) regc = m regc .
(c) REGPATc = {Lang(π) | π ∈ regc } .
3 Main Result
Proof. Since any regular pattern π has a variable at the end, the proposition
follows.
We now present our algorithm for learning REGPATc . The algorithm has
prior knowledge about the function r from Lemma 2 and the function f from
Proposition 2. It takes as input c , δ and knowledge about the probability
distribution by getting pol .
Let A be the set of all examples and Aj (j ∈ {1, 2, . . . , n}) , be the examples
whose index is j modulo n ; so the (k ∗ n + j) -th example from A goes to
Aj where k is an integer and j ∈ {1, 2, ..., n} .
Let i = 1 , π0 = x0 , X0 = {λ} and go to Step (2).
(2) For β ∈ Σ c , let Yi,β = Xi−1 βΣ ∗ .
If A ∩ Xi−1 = ∅ , then let m = i − 1 and go to Step (3).
Choose αi as the β ∈ Σ c , such that |Ai ∩ Yi,β | > |Ai ∩ Yi,β | , for β ∈
Σ c − {β} (if there is no such β , then abort the algorithm).
Let Xi be the set of all strings σ such that σ is in Σ ∗ α1 Σ ∗ α2 Σ ∗ . . . Σ ∗ αi ,
but no proper prefix τ of σ is in Σ ∗ α1 Σ ∗ α2 Σ ∗ . . . Σ ∗ αi .
Let πi = πi−1 αi xi , let i = i + 1 and go to Step (2).
(3) Output the pattern πm = x0 α1 x1 α2 x2 . . . αm xm and halt.
End
Note that since the shortest example is strictly shorter than c ∗ n it holds
that n ≥ 1 . Furthermore, if π = x0 , then the probability that a string drawn is
λ is at least 1/pol(0) . A lower bound for this is 1/(2 ∗ |Σ|c ∗ f (n) ∗ pol(f (n)) ,
whatever n is, due to the fact that pol is monotonically increasing. Thus λ
appears with probability 1 − δ/n in the set An and thus in the set A . So the
algorithm is correct for the case that π = x0 .
It remains to consider the case where π is of the form x0 α1 x1 α2 x2 . . . am xm
for some m ≥ 1 where all αi are in Σ c .
|Σ|h
C(i, αi , h) ≥ C(i, β, h) +
2 ∗ |Σ|c ∗ f (m)
for all β ∈ Σ c \ {αi } . In particular,
1
prob(Yi,β ∩ Lang(π)) + ≤ prob(Yi,αi ∩ Lang(π)).
2∗ |Σ|c ∗ pol(f (m)) ∗ f (m)
C(i,β,h)
Proof. Let D(i, β, h) = |Σ|h
, for all h and β ∈ Σ c . Proposition 1 and
Claim 3 give that
Since every string in Lang(π) is in some set Yi,β , it holds that D(i, αi , f (m)) ≥
1
2∗|Σ|c . Furthermore, D(i, αi , h) = 0 for all h < c since m > 0 and π does
not generate the empty string. Thus there is an h ∈ {1, 2, ..., f (m)} with
1
D(i, αi , h) − D(i, αi , h − 1) ≥ .
2 ∗ |Σ|c ∗ f (m)
For this h , it holds that
1
D(i, αi , h) ≥ D(i, β, h) + .
2∗ |Σ|c∗ f (m)
We now show that the learner presented above indeed probabilistically ex-
actly learns Lang(π) , for π ∈ regc .
A loop (Step (2)) invariant is that with probability at least 1 − δ∗(i−1)
n , the
pattern πi−1 is a prefix of the desired pattern π . This certainly holds before
entering Step (2) for the first time.
Case 1. i ∈ {1, 2, ..., m} .
244 J. Case et al.
1
≥ prob(Yi,β ∩ Lang(π)) + ,
2 ∗ |Σ|c ∗ f (m) ∗ pol(f (m))
Case 2. i = m + 1 .
that
1 1
D(π, h) − D(π, h − 1) ≥ ≥ .
2 ∗ f (m) 2 ∗ |Σ|c ∗ f (n)
Note that h ≤ f (n) since f is increasing. It follows that
1
prob(Xm ) ≥
2∗ |Σ|c ∗ (f (n) ∗ pol(f (n))
To get polynomial time bound for the learner, we note the following. It is
easy to show that there is a polynomial q(m, δ1 ) which with sufficiently high
probability ( 1 − δ , for any fixed δ ) bounds the parameter n of the learning
algorithm. Thus, with probability at least 1 − δ − δ the whole algorithm is
successful in time and example-number polynomial in m, 1/δ, 1/δ . Thus, for
Learning a Subclass of Regular Patterns in Polynomial Time 245
We are hoping in the future (not as part of the present paper) to run our
algorithm on molecular biology data to see if it can quickly provide useful an-
swers.
References
1. D. Angluin. Finding patterns common to a set of strings. Journal of Computer
and System Sciences, 21:46–62, 1980.
2. S. Arikawa, T. Shinohara, and A. Yamamoto. Learning elementary formal systems.
Theoretical Computer Science, 95:97–113, 1992.
3. T. Shinohara and H. Arimura. Inductive inference of unbounded unions of pattern
languages from positive data. Theoretical Computer Science, 241:191–209, 2000.
4. I. Bratko and S. Muggleton. Applications of inductive logic programming. Com-
munications of the ACM, 1995.
5. A. Brāzma, E. Ukkonen, and J. Vilo. Discovering unbounded unions of regular
pattern languages from positive examples. In Proceedings of the 7th International
Symposium on Algorithms and Computation (ISAAC’96), volume 1178 of Lecture
Notes in Computer Science, pages 95–104, Springer, 1996.
6. J. Case, S. Jain, S. Kaufmann, A. Sharma, and F. Stephan. Predictive learning
models for concept drift. Theoretical Computer Science, 268:323–349, 2001. Special
Issue for ALT’98.
7. J. Case, S. Jain, S. Lange, and T. Zeugmann. Incremental concept learning for
bounded data mining. Information and Computation, 152(1):74–110, 1999.
8. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger, and T. Zeugmann. Learning
one-variable pattern languages very efficiently on average, in parallel, and by asking
queries. Theoretical Computer Science, 261(1):119–156, 2001.
9. E.M. Gold. Language identification in the limit. Information & Control, 10:447–
474, 1967.
10. M. Kearns and L. Pitt. A polynomial-time algorithm for learning k -variable
pattern languages from examples. In R. Rivest, D. Haussler and M. K. War-
muth (Eds.), Proceedings of the Second Annual ACM Workshop on Computational
Learning Theory, pages 57–71, Morgan Kaufmann Publishers Inc., 1989.
11. P. Kilpeläinen, H. Mannila, and E. Ukkonen. MDL learning of unions of simple
pattern languages from positive examples. In Paul Vitányi, editor, Second European
Conference on Computational Learning Theory, volume 904 of Lecture Notes in
Artificial Intelligence, pages 252–260. Springer, 1995.
12. S. Lange and R. Wiehagen. Polynomial time inference of arbitrary pattern lan-
guages. New Generation Computing, 8:361–370, 1991.
13. N. Lavrač and S. Džeroski. Inductive Logic Programming: Techniques and Appli-
cations. Ellis Horwood, 1994.
14. S. Matsumoto and A. Shinohara. Learnability of subsequence languages. Informa-
tion Modeling and Knowledge Bases VIII, pages 335–344, IOS Press, 1997.
15. T. Mitchell. Machine Learning. McGraw Hill, 1997.
16. S. Miyano, A. Shinohara and T. Shinohara. Polynomial-time learning of elementary
formal systems. New Generation Computing, 18:217–242, 2000.
246 J. Case et al.
17. S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods.
Journal of Logic Programming, 19/20:669–679, 1994.
18. R. Nix. Editing by examples. Technical Report 280, Department of Computer
Science, Yale University, New Haven, CT, USA, 1983.
19. D. Reidenbach. A Negative Result on Inductive Inference of Extended Pattern
Languages. In N. Cesa-Bianchi and M. Numao, editors, Algorithmic Learning
Theory, 13th International Conference, ALT 2002, Lübeck, Germany, November
2002, Proceedings, pages 308–320. Springer, 2002.
20. R. Reischuk and T. Zeugmann. Learning one-variable pattern languages in linear
average time. In Proceedings of the Eleventh Annual Conference on Computational
Learning Theory, pages 198–208. ACM Press, 1998.
21. P. Rossmanith and T. Zeugmann. Stochastic Finite Learning of the Pattern Lan-
guages. Machine Learning 44(1/2):67–91, 2001. Special Issue on Automata Induc-
tion, Grammar Inference, and Language Acquisition
22. A. Salomaa. Patterns (The Formal Language Theory Column). EATCS Bulletin,
54:46–62, 1994.
23. A. Salomaa. Return to patterns (The Formal Language Theory Column). EATCS
Bulletin, 55:144–157, 1994.
24. R. Schapire, Pattern languages are not learnable. In M.A. Fulk and J. Case, edi-
tors, Proceedings, 3rd Annual ACM Workshop on Computational Learning Theory,
pages 122–129, Morgan Kaufmann Publishers, Inc., 1990.
25. S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara, and S. Arikawa.
Knowledge acquisition from amino acid sequences by machine learning system
BONSAI. Trans. Information Processing Society of Japan, 35:2009–2018, 1994.
26. T. Shinohara. Polynomial time inference of extended regular pattern languages.
In RIMS Symposia on Software Science and Engineering, Kyoto, Japan, volume
147 of Lecture Notes in Computer Science, pages 115–127. Springer-Verlag, 1982.
27. T. Shinohara. Inferring unions of two pattern languages. Bulletin of Informatics
and Cybernetics, 20:83–88., 1983.
28. T. Shinohara and S. Arikawa. Learning data entry systems: An application of
inductive inference of pattern languages. Research Report 102, Research Institute
of Fundamental Information Science, Kyushu University, 1983.
29. T. Shinohara and S. Arikawa. Pattern inference. In Klaus P. Jantke and Steffen
Lange, editors, Algorithmic Learning for Knowledge-Based Systems, volume 961 of
Lecture Notes in Artificial Intelligence, pages 259–291. Springer, 1995.
30. R. Smullyan. Theory of Formal Systems, Annals of Mathematical Studies, No. 47.
Princeton, NJ, 1961.
31. L.G. Valiant. A theory of the learnable. Communications of the ACM 27:1134–
1142, 1984.
32. K. Wright. Identification of unions of languages drawn from an identifiable class.
In R. Rivest, D. Haussler, and M.K. Warmuth, editors, Proceedings of the Sec-
ond Annual Workshop on Computational Learning Theory, pages 328–333. Morgan
Kaufmann Publishers, Inc., 1989.
33. T. Zeugmann. Lange and Wiehagen’s pattern language learning algorithm: An
average-case analysis with respect to its total learning time. Annals of Mathematics
and Artificial Intelligence, 23(1–2):117–145, 1998.
Identification with Probability One of Stochastic
Deterministic Linear Languages
1 Introduction
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 247–258, 2003.
c Springer-Verlag Berlin Heidelberg 2003
248 C. de la Higuera and J. Oncina
depend on being able to replace finite state models such as Hidden Markov Mod-
els by stochastic context-free grammars. Yet the problem of learning this type
of grammar from strings has rarely been addressed. The usual way of dealing
with the problem still consists in first learning a structure, and then estimating
the probabilities [Bak79].
In the more theoretical setting of learning from both examples and counter-
examples classes of grammars that are more general than the regular grammars,
but restricted to cases where both determinism and linearity apply have been
studied [dlHO02].
On the other hand, learning (deterministic) regular stochastic grammars has
received a lot of attention over the past 10 years. A well known algorithm for
this task is ALERGIA [CO94], which has been improved by different authors
[YLT00,CO99], and applied to different tasks [WA02].
We synthesize in this paper both types of results and propose a novel class of
stochastic languages that we call stochastic deterministic linear languages. We
prove that each language of the class admits an equivalence relation of finite
index, thus leading to a canonical normal form. We propose an algorithm that
works in polynomial time with respect to the learning data. It can identify with
probability one any language in the class.
In section 2 the necessary definitions are given. We prove in section 3 the
existence of a small normal form, and give in section 4 a learning algorithm that
can learn grammars in normal form.
2 Definitions
Note that the expresions for u−1 L and Lu−1 are equivalent to {(v, pv ) :
pv = p(uv|L)/p(uΣ ∗ |L)} and {(v, pv ) : pv = p(vu|L)/p(uΣ ∗ |L)} respectively
but avoiding division by zero problems.
Of course, if u is a common prefix (u common suffix) of L then p(uΣ ∗ |L) =
1 (p(Σ ∗ u|L) = 1) and u−1 L = {(v, pv ) : (uv, pv ) ∈ L} (Lu−1 = {(v, pv ) :
(vu, pv ) ∈ L}).
We denote the longest common suffix reduction of a stochastic language L
by L ↓ = {(u, p) : z = lcs(L), (uz, p) ∈ L}, where lcs(L) = lcs{u : (u, p) ∈ L}.
Note that if L is a stochastic language then ∀u u−1 L, Lu−1 and L ↓ are also
stochastic languages.
A stochastic deterministic linear (SDL) grammar, G = (Σ, V, R, S, p) consists
Σ, V , S as for context-free grammars, a finite set R of derivation rules with either
of the structures X → aY w or X → λ; such that X → aY w, X → aZv ∈ R ⇒
Y = Z ∧ w = v, and a real function p : R →]0, 1] giving the probability of each
derivation.
The probability p(S → *
w) that the grammar G generates the string w is
defined recursively as:
p(X →
*
avw) = p(X → aY w)p(Y →
*
v)
where Y is the only variable such that X → Y w ∈ R (if such variable does not
exist, thenp(X → aY w) = 0 is assumed). It can be shown that if
∀A ∈ V p(A → α) = 1 and G does not contains useless symbols then G
250 C. de la Higuera and J. Oncina
A slightly different function tail that works over sequences is now introduced.
This function will be used to define a function right to work over sequences.
Definition 6. Let Sn be a finite sequence of strings, then
lcs(x−1 Sn ) if x = λ
tailSn (x) = ∀x ∈ Pref(Sn )
λ if x = λ
It should be noticed that the above definition ensures that the functions
nextSn and rightSn can be computed in time polynomial in the size of Sn . We now
prove that the above definition allows functions nextSn and rightSn to converge
in the limit, to the intended functions nextL and rightL :
Lemma 2. Let L be a SDL language, for each sample Sn of L containing a set
D ⊆ {x : (x, p) ∈ L} such that:
1. ∀x ∈ SpL ∀a ∈ Σ : xa ∈ Pref(L) ⇒ ∃xaw ∈ D.
2. ∀x ∈ SpL ∀a ∈ Σ : CSFL (xa) = ∅ ⇒ tailD (xa) = tailL (xa)
then ∀x, y ∈ Sp(L),
1. nextSn (x) = nextL (x)
2. rightSn (xa) = rightL (xa)
Lemma 3. With probability one, nextSn (x) = nextL (x) and rightSn (xa) =
rightL (xa) ∀x ∈ Sp(L) except for finitely many values of n.
254 C. de la Higuera and J. Oncina
Proof. Given a SDL language, there exists (at least one) set D with non null
probability. Then with probability 1 any sufficiently large sample contains such
a set D. is unique for each SDL language. Then the above lemma yields the
result.
In order to evaluate the equivalence relation equiv(x, y) ⇐⇒ x ≡L y ⇐⇒
CSFL (x) = CSFL (y) we have to check if two stochastic languages are equivalent
from a finite sample Sn .
To do that, instead of comparing the probabilities of each string of the sample,
we are going to compare the probabilities of their prefixes. This strategy (also
used in ALERGIA [CO94] and RLIPS [CO99]) allows to distinguish different
probabilities faster, as more information is always available about a prefix than
about a whole string. It is therefore easy to establish the equivalence between
the various definitions:
Proposition 3. Two stochastic languages L1 and L2 are equal iff
p(aΣ ∗ |w−1 L1 ) = p(aΣ ∗ |w−1 L2 )∀a ∈ Σ, ∀w ∈ Σ ∗
Proof. L1 = L2 =⇒ ∀w ∈ Σ ∗ : p(w|L1 ) = p(w|L2 ) =⇒ w−1 L1 = w−1 L2 =⇒
∀z ⊆ Σ ∗ : p(z|w−1 L1 ) = p(z|w−1 L2 )
Conversely L1 = L2 =⇒ ∃w ∈ Σ ∗ : p(w|L1 ) = p(w|L2 ). Let w = az,
as p(az|L) = p(aΣ ∗ |L)p(z|a−1 L) then p(aΣ ∗ |L1 )p(z|a−1 L1 ) = p(aΣ ∗ |L2 )
p(z|a−1 L2 ).
Now we have 2 cases:
1. p(aΣ ∗ |L1 ) = p(aΣ ∗ |L2 ) and the proposition is shown.
2. p(aΣ ∗ |L1 ) = p(aΣ ∗ |L2 ) then p(z|a−1 L1 ) = p(z|a−1 L2 ).
This can be applyed recursively unless w = λ.
∗ ∗
In such case we have
∗
that ∃w ∈ Σ : p(w|L1 ) = p(w|L2 ) ∧ p(wΣ |L1 ) =
p(wΣ |L2 ). But since x∈Σ ∗ p(x|Li ) = 1, it follows that ∃a ∈ Σ such that
p(waΣ ∗ |L1 ) = p(waΣ ∗ |L2 ). Thus p(aΣ ∗ |w−1 L1 ) = p(aΣ ∗ |w−1 L2 ).
As a consequence,
x ≡L y ⇐⇒ p(aΣ ∗ |(xz)−1 L) = p(aΣ ∗ |(yz)−1 L)∀a ∈ Σ, z ∈ Σ ∗
If instead of the whole language we have a finite sample Sn we are going to
estimate the probabilities counting the appearances of the strings and comparing
using a confidence range.
Definition 8. Let f /n be the obseved frequency of a Bernoulli variable of prob-
ability p. We denote by α (n) a fuction such that p(| nf − p| < α (n)) > 1 − α
(the Hoeffding bound is one of such functions).
Lemma 4. Let f1 /n1 and f2 /n2 two obseved frecuencies of a Bernoulli variable
of probability p. Then:
f1 f2
p − < (n ) + (n ) > (1 − α)2
n2
α 1 α 2
n1
Identification with Probability One 255
Proof. p(| nf11 − nf22 | < α (n1 )+α (n2 )) < p(| nf11 −p|+| nf22 −p| < α (n1 )+α (n2 )) <
p(| nf11 − p| < α (n1 ) ∧ | nf22 − p| < α (n2 )) < (1 − α)2
Definition 9.
equivSn (x, y) ⇐⇒ ∀z ∈ Σ ∗ : xz ∈ Pref(Sn ) ∧ yz ∈ Pref(Sn ), ∀a ∈ Σ
cn (xzaΣ ∗ ) cn (yzaΣ ∗ )
− < α (cn (xzΣ ∗ )) + α (cn (yzΣ ∗ )) ∧
cn (xzΣ ∗ ) cn (yzΣ ∗ )
cn (xz) cn (yz)
− < α (cn (xzΣ ∗ )) + α (cn (yzΣ ∗ ))
cn (xzΣ ∗ ) cn (yzΣ ∗ )
This does not correspond to an infinite number of tests but only to those for
which xz or yz is a prefix in Sn . Each of these tests returns the correct answer
with probability greater than (1 − α)2 . Because the number of checks grows with
| Pref(L)| we will allow the parameter α to depend on n.
∞
Theorem 2. Let the parameter αn be such that n=0 nαn is finite. Then, with
probability one, (x ≡L y) = equivSn (x, y) except for finitely many values of n.
required [Link] might even be the case that the entire class of context-free
grammars may be identifiable in the limit with probability one by polynomial
algorithms.
An open problem for which in our mind an answer would be of real help for
further research in the field is that of coming up with a new learning criterion
for polynomial distribution learning. This should in a certain may better match
the idea of polynomial identification with probability one.
References
6 Appendix
Propositions from section 3 aim at establishing that a small canonical form exists
for each SDL grammar. The following proofs follow the ideas from [dlHO02].
Identification with Probability One 257
1 Introduction
In this paper we consider the problem of prediction: given some training data and
a new object xn we would like to predict its label yn . We use the randomised on-
line version of Transductive Confidence Machine as basic method of prediction;
first we explain why we are interested in this method and then formulate the
main question of this paper.
Transductive Confidence Machine (TCM) [3,4] is a prediction method giving
“p-values” py for any possible value y of the unknown label yn ; the p-values
satisfy the following property (proven in, e.g., [1]): if the data satisfies the i.i.d.
assumption, which means that the data is generated independently by same
mechanism, the probability that pyn < δ does not exceed δ for any threshold
δ ∈ (0, 1) (the validity property).
There are different ways of presenting the p-values. The one used in [3] only
works in the case of pattern recognition: the prediction algorithm outputs a
“most likely” label (y with the largest py ) together with confidence (one minus
the second largest py ) and credibility (the largest py ). Alternatively, the predic-
tion algorithm can be given a threshold δ as an input and its answer will be that
the label yn should lie in the set of such y that py > δ; this scenario of set (or
region) prediction was used in [5,2] and will be used in this paper. The validity
property says that the set prediction will be wrong with probability at most δ.
Therefore, we can guarantee some maximal probability of error; the downside is
that the set prediction can consist of more than one element.
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 259–267, 2003.
c Springer-Verlag Berlin Heidelberg 2003
260 I. Nouretdinov and V. Vovk
f (z1 , . . . , zn ) = (α1 , . . . , αn ).
where
Criterion of Calibration for Transductive Confidence Machine 261
3 Restricted TCM
In practice we are likely to have the true labels yn only for a subset of steps n;
moreover, even for this subset yn may be given with a delay. In this paper we
consider the following scheme. We are given a function L : N → IN defined on
an infinite set N ⊆ IN and required to satisfy
L(n) ≤ n
and random numbers θ1 , θ2 , . . . (independent from each other and anything else);
the error sequence of the ghost rTCM is denoted e1 , e2 , . . . (remember that an
error is encoded as 1 and the absence of error as 0). The ghost rTCM is given
all labels and each label is given without delay. Notice that the input sequence
zL(n1 ) , zL(n2 ) , . . . to the ghost rTCM is also distributed according to P ∞ .
Set, for each n = 1, 2, . . . ,
dn = P{en = 1 | z1 , . . . , zn−1 }
(it is clear that, for each k, dn will be the same for all n = nk−1 + 1, . . . , nk ) and
dk = P ek = 1 | z1 , . . . , zk−1
.
Proof. For any ε > 0, there exists K such that nkn−n k−1
k−1
< ε for any k ≥ K.
Therefore,
n21 + (n2 − n1 )2 + · · · + (nk − nk−1 )2
n2k
n2K (nK+1 − nK )2 + · · · + (nk − nk−1 )2
≤ 2 +
nk n2k
n2K nK+1 − nK nK+1 − nK nK+2 − nK+1 nK+2 − nK+1
≤ + + + ···
n2k nK nk nK+1 nk
nk − nk−1 nk − nk−1 n2 (nK+1 − nK ) + · · · + (nk − nk−1 )
+ ≤ K2 +ε ≤ 2ε
nk−1 nk nk nk
from some k on.
Now it is easy to finish the proof of the first part of the theorem. In combi-
nation with Chebyshev’s inequality and Lemma 2, Corollary 1 implies that
(e1 − δ)n1 + (e2 − δ)(n2 − n1 ) + · · · + (ek − δ)(nk − nk−1 )
→0
nk
in probability; using the notation k(n) := min{k : nk ≥ n}, we can rewrite this
as
nk
1
e − δ → 0. (2)
nk n=1 k(n)
(α1 , . . . , αk ) = (1 − ζ1 , . . . , 1 − ζk )
if ζ1 + · · · + ζk is odd.
It follows from the central limit theorem that
#{i = 1, . . . , k : zi = 1}
∈ (0.4, 0.6) (6)
k
with probability more than 99% for k large enough. Let δ = 5%. Consider some
k ∈ {1, 2, . . . }; we will show that dk deviates significantly from δ with probability
more than 99% for sufficiently large k; namely, that dk is significantly greater
than δ if z1 + · · · + zk−1
is odd (intuitively, in this case both potential labels are
strange) and dk is significantly less than δ if z1 + · · · + zk−1
is even (intuitively,
both potential labels are typical). Formally:
– If z1 + · · · + zk−1
is odd, then
– If z1 + · · · + zk−1
is even, then
and
K−1
dnk (nk+1 − nk ) = nK (δ + o(1));
k=0
i.e.,
(dnK − δ) (nK+1 − nK ) = o(nK+1 ).
In combination with (7) and (1), this implies nK+1 − nK = o(nK+1 ), i.e.,
nK+1 /nK → 1 as K → ∞.
References
1. Ilia Nouretdinov, Thomas Melluish, and Vladimir Vovk. Ridge Regression Confi-
dence Machine. In Proceedings of the 18th International Conference on Machine
Learning, 2001.
2. Daniil Ryabko, Vladimir Vovk, and Alex Gammerman. Online region prediction
with real teachers. Submitted for publication.
Criterion of Calibration for Transductive Confidence Machine 267
3. Craig Saunders, Alex Gammerman, and Vladimir Vovk. Transduction with confi-
dence and credibility. In Proceedings of the 16th International Joint Conference on
Artificial Intelligence, pp. 722–726, 1999.
4. Vladimir Vovk, Alex Gammerman, Craig Saunders. Machine-learning applications
of algorithmic randomness. Proceedings of the 16th International Conference on
Machine Learning, San Francisco, CA: Morgan Kaufmann, pp. 444–453, 1999.
5. Vladimir Vovk. On-line Confidence Machines are well-calibrated. Proceedings of the
43rd Annual Symposium on Foundations of Computer Science, IEEE Computer
Society, 2002.
Well-Calibrated Predictions from Online
Compression Models
Vladimir Vovk
1 Introduction
Transductive Confidence Machine (TCM) was introduced in [1,2] as a practi-
cally meaningful way of providing information about reliability of the predic-
tions made. In [3] it was shown that TCM’s confidence information is valid in
a strong non-asymptotic sense under the standard assumption that the exam-
ples are exchangeable. In §2 we define a general class of models, called “on-line
compression models”, which include not only the exchangeability model but also
the Gaussian model, the Markov model, and many other interesting models. An
on-line compression model (OCM) is an automaton (usually infinite) for sum-
marizing statistical information efficiently. It is usually impossible to restore the
statistical information from OCM’s summary (so OCM performs lossy compres-
sion), but it can be argued that the only information lost is noise, since one of
our requirements is that the summary should be a “sufficient statistic”. In §3
we construct “confidence transducers” and state the main result of the paper
(proved in Appendix A) showing that the confidence information provided by
confidence transducers is valid in a strong sense. In the last three sections, §4–6,
we consider three interesting examples of on-line compression models: exchange-
ability, Gaussian and Markov models. The idea of compression modelling was
the main element of Kolmogorov’s programme for applications of probability [4],
which is discussed in Appendix B.
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 268–282, 2003.
c Springer-Verlag Berlin Heidelberg 2003
Well-Calibrated Predictions from Online Compression Models 269
1. Σ is a measurable space called the summary space; its elements are called
summaries; 2 ∈ Σ is a summary called the empty summary;
2. Z is a measurable space from which the examples zi are drawn;
3. Fn , n = 1, 2, . . . , are functions of the type Σ × Z → Σ called forward
functions;
4. Bn , n = 1, 2, . . . , are kernels of the type Σ → Σ × Z called backward kernels;
in other words, each Bn is a function Bn (A | σ) which depends on σ ∈ Σ and
a measurable set A ⊆ Σ × Z such that
– for each σ, Bn (A | σ) as a function of A is a probability distribution in
Σ × Z;
– for each A, Bn (A | σ) is a measurable function of σ;
it is required that Bn be a reverse to Fn in the sense that
Bn Fn−1 (σ) | σ = 1
for each σ ∈ Fn (Σ × Z). We will sometimes write Bn (σ) for the probability
distribution A → Bn (A | σ).
Next we explain briefly the intuitions behind this formal definition and introduce
some further notation.
An OCM is a way of summarizing statistical information. At the beginning
we do not have any information, which is represented by the empty summary
σ0 := 2. When the first example z1 arrives, we update our summary to σ1 :=
F1 (σ0 , z1 ), etc.; when example zn arrives, we update the summary to σn :=
Fn (σn−1 , zn ). This process is represented in Figure 1. Let tn be the nth statistic
in the OCM, which maps the sequence of the first n examples z1 , . . . , zn to σn :
t1 (z1 ) := F1 (σ0 , z1 );
tn (z1 , . . . , zn ) := Fn (tn−1 (z1 , . . . , zn−1 ), zn ), n = 2, 3, . . . .
z1 z2 zn−1 zn
? ? ? ?
2 - σ1 - σ2 - ··· - σn−1 - σn
Pn (A1 × · · · × An | σn ) :=
· · · B1 (A1 | σ1 )B2 (dσ1 , A2 | σ2 ) . . .
z1 z2 zn−1 zn
6 6 6 6
2 σ1 σ2 ··· σn−1 σn
and
σn := tn (z1 , . . . , zn ), σn−1 := tn−1 (z1 , . . . , zn−1 ) .
The randomised version is obtained by replacing (2) with
En
lim = δ.
n→∞ n
272 V. Vovk
En
lim n→∞ ≤ δ.
n
4 Exchangeability Model
In this section we discuss the only special case of OCM studied from the point
of view of prediction with confidence so far: the exchangeable model. In the next
two sections we will consider two other models, Gaussian and Markov; many
more models are considered in [6], Chapter 4. For defining specific OCM, we will
specify their statistics tn and conditional distributions Pn ; these will uniquely
identify Fn and Bn .
The exchangeability model has statistics
tn (z1 , . . . , zn ) := z1 , . . . , zn ;
given the value of the statistic, all orderings have the same probability 1/n!. For-
mally, the set of bags z1 , . . . , zn of size n is defined as Zn equipped with the σ-
algebra of symmetric (i.e., invariant under permutations of components) events;
the distribution on the orderings is given by zπ(1) , . . . , zπ(n) , where z1 , . . . , zn is
a fixed ordering and π is a random permutation (each permutation is chosen
with probability 1/n!).
The main results of [3] and [7] are special cases of Theorem 1.
5 Gaussian Model
tn (z1 , . . . , zn ) := (z n , Rn ) ,
n
1
z n := zi , Rn := (z1 − z n )2 + · · · + (zn − z n )2 ,
n i=1
Let us give an explicit expression of the predictive region for the Gaussian
model and individual strangeness measure
(it is easy to see that this individual strangeness measure is equivalent, in the
sense of leading to the same p-values, to |zn − z n |, as well as to several other
natural expressions, including (5)). Under Pn (dz1 , . . . , dzn | σ), the expression
(n − 1)(n − 2) zn − z n−1
(5)
n Rn−1
Therefore, we obtained the usual predictive regions based on the t-test (as in [9]
or, in more detail, [10]); now, however, we can see that the errors of this standard
procedure (applied in the on-line fashion) are independent.
6 Markov Model
The Gaussian OCM, considered in the previous section, is narrower than the
exchangeability OCM. The OCM considered in this section is interesting in that
it goes beyond exchangeability.
In this section we always assume that the example space Z is finite. The
following notation for digraphs will be used: in(v)/out(v) stand for the number
of arcs entering/leaving vertex v; nu,v is the number of arcs leading from vertex
u to vertex v.
The Markov summary of a data sequence z1 . . . zn is the following digraph
with two vertices marked:
– the set of vertices is Z (the state space of the Markov chain);
– the vertex z1 is marked as the source and the vertex zn is marked as the
sink (these two vertices are not necessarily distinct);
– the arcs of the digraph are the transitions zi zi+1 , i = 1, . . . , n − 1; the arc
zi zi+1 has zi as its tail and zi+1 as its head.
It is clear that in any such digraph all vertices v satisfy in(v) = out(v) with the
possible exception of the source and sink (unless they coincide), for which we
then have out(source) = in(source) + 1 and in(sink) = out(sink) + 1. We will call
a digraph with this property a Markov graph if the arcs with the same tail and
head are indistinguishable (for example, we do not distinguish two Eulerian paths
274 V. Vovk
that only differ in the order in which two such arcs are passed); its underlying
digraph will have the same structure but all its arcs will be considered to have
their own identity.
More formally, the Markov model (Σ, 2, Z, F, B) is defined as follows:
– Z is a finite set; its elements (examples) are also called states; one of the
states is designated as the initial state;
– Σ is the set of all Markov graphs with the vertex set Z;
– 2 is the Markov graph with no arcs and with both source and sink at the
designated initial state;
– Fn (σ, z) is the Markov graph obtained from σ by adding an arc from σ’s sink
to z and making z the new sink;
– let σ ↓ z, where σ is a Markov graph and z is one of σ’s vertices, be the
Markov graph obtained from σ by removing an arc from z to σ’s sink (σ ↓ z
does not exist if there is no arc from z to σ’s sink) and moving the sink to z,
and let N (σ) be the number of Eulerian paths from the source to the sink in
the Markov graph σ; Bn (σ) is (σ ↓ z, sink) with probability N (σ ↓ z)/N (σ),
where sink is σ’s sink and z ranges over the states for which σ ↓ z is defined.
(we need the minus sign because lower probability makes an example stranger).
To give a computationally efficient representation of the confidence transducer
corresponding to this individual strangeness measure, we need the following two
graph-theoretic results, versions of the BEST theorem and the Matrix-Tree the-
orem, respectively.
Lemma 1. In any Markov graph σ = (V, E) the number of Eulerian paths from
the source to the sink equals
out(sink) v∈V (out(v) − 1)!
T (σ) ,
u,v∈V nu,v !
where T (σ) is the number of spanning out-trees in the underlying digraph centred
at the source.
Lemma 2. To find the number T (σ) of spanning out-trees rooted at the source
in the underlying digraph of a Markov graph σ with vertices z1 , . . . , zn (z1 being
the source),
These two lemmas immediately follow from Theorems VI.24 and VI.28 in [11].
Well-Calibrated Predictions from Online Compression Models 275
250
errors
cumulative errors, uncertain and empty predictions
uncertain predictions
empty predictions
200
150
100
50
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
examples
Fig. 3. TCM predicting the binary Markov chain with transition probabilities P(1 | 0) =
P(0 | 1) = 1% at significance level 2%; the cumulative numbers of errors (predictive
regions not covering the true label), uncertain (i.e., containing more than one label)
and empty predictive regions are shown
It is now easy to obtain an explicit formula for prediction in the binary case
Z = {0, 1}. First we notice that
N (σ ↓ z) T (σ ↓ z)nz,sink
Bn ({(σ ↓ z, sink)} | σ) = =
N (σ) T (σ) out(sink)
(all nu,v refer to the numbers of arcs in σ and sink is σ’s sink; we set N (σ ↓ z) =
T (σ ↓ z) := 0 when σ ↓ z does not exist). The following simple corollary from
the last formula is sufficient for computing the probabilities Bn in the binary
case:
nsink,sink
Bn ({(σ ↓ sink, sink)} | σ) = .
out(sink)
This gives us the following formulas for the TCM in the binary Markov model
(remember that the individual strangeness measure is (6)). Suppose the current
summary is given by a Markov graph with ni,j arcs going from vertex i to vertex
j (i, j ∈ {0, 1}) and let f : [0, 1] → [0, 1] be the function that squashes [0.5, 1] to
1:
p if p < 0.5
f (p) :=
1 otherwise .
276 V. Vovk
n0,0 + 1
f
n0,0 + n0,1 + 1
n1,0
f . (7)
n1,0 + n1,1
n1,1 + 1
f
n1,1 + n1,0 + 1
n0,1
f .
n0,1 + n0,0
Figure 3 shows the result of a computer simulation; as expected, the error line
is close to the straight line with the slope close to the significance level.
References
1. Saunders, C., Gammerman, A., Vovk, V.: Transduction with confidence and credi-
bility. In: Proceedings of the Sixteenth International Joint Conference on Artificial
Intelligence. (1999) 722–726
2. Vovk, V., Gammerman, A., Saunders, C.: Machine-learning applications of algo-
rithmic randomness. In: Proceedings of the Sixteenth International Conference on
Machine Learning, San Francisco, CA, Morgan Kaufmann (1999) 444–453
3. Vovk, V.: On-line Confidence Machines are well-calibrated. In: Proceedings of
the Forty Third Annual Symposium on Foundations of Computer Science, IEEE
Computer Society (2002) 187–196
4. Kolmogorov, A.N.: Combinatorial foundations of information theory and the cal-
culus of probabilities. Russian Mathematical Surveys 38 (1983) 29–40
5. Cox, D.R., Hinkley, D.V.: Theoretical Statistics. Chapman and Hall, London
(1974)
6. Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. Wiley, Chichester (2000)
Well-Calibrated Predictions from Online Compression Models 277
We will use the notation p− := p− (σ̃, z̃) and p+ := p+ (σ̃, z̃) where (σ̃, z̃) is
any borderline example. Notice that the Bn -measure of strange examples is p− ,
the Bn -measure of ordinary examples is 1−p+ , and the Bn -measure of borderline
examples is p+ − p− .
By the definition of rCT, pn ≤ δ if the pair (σn−1 , zn ) is strange, pn > δ if
the pair is ordinary, and pn ≤ δ with probability
δ − p−
(9)
p+ − p−
if the pair is borderline; indeed, in this case
pn = p− + θn (p+ − p− ) ,
and so pn ≤ δ is equivalent to
δ − p−
θn ≤ .
p+ − p−
Therefore, the overall probability that pn ≤ δ is
δ − p−
p− + (p+ − p− ) = δ.
p+ − p−
The other basic result that we will need is the following lemma.
Lemma 4. For any trial n = 1, 2, . . . , pn is Gn−1 -measurable.
Proof. Fix a trial n and δ ∈ [0, 1]. We are required to prove that the event
{pn ≤ δ} is Gn−1 -measurable. This follows from the definition, (3): pn is defined
in terms of σn−1 , zn and θn .
Fix temporarily positive integer N . First we prove that, for any n = 1, . . . , N
and any δ1 , . . . , δn ∈ [0, 1],
PGn {pn ≤ δn , . . . , p1 ≤ δ1 }
= EGn EGn−1 I{pn ≤δn } I{pn−1 ≤δn−1 ,...,p1 ≤δ1 }
= EGn I{pn ≤δn } EGn−1 I{pn−1 ≤δn−1 ,...,p1 ≤δ1 }
= EGn I{pn ≤δn } δn−1 · · · δ1 = δn δn−1 · · · δ1
P {pN ≤ δN , . . . , p1 ≤ δ1 } = δN · · · δ1 .
280 V. Vovk
(Kolmogorov’s only two publications on his programme are [4,13]; the work
reported in [14]–[17] was done under his supervision by his PhD students.)
After 1965 Kolmogorov and Martin-Löf worked on the information-theoretic
approach to probability applications independently of each other, but arrived at
similar concepts and definitions. In 1973 [18] Martin-Löf introduced the notion
of repetitive structure, later studied by Lauritzen [19]. Martin-Löf’s theory of
repetitive structures has features C and U of Kolmogorov’s programme but not
features A and D. An extra feature of repetitive structures is their on-line char-
acter : the conditional probability distributions are required to be consistent and
the sufficient statistic can usually be updated recursively as new data arrives.
The absence of algorithmic complexity and randomness from Martin-Löf’s
theory does not look surprising; e.g., it is argued in [20] that these algorithmic
notions are powerful sources of intuition, but for stating mathematical results in
their strongest and most elegant form it is often necessary to “translate” them
into a non-algorithmic form.
A more serious deviation from Kolmogorov’s ideas seems to be the absence
of “direct inferences”. The goal in the theory of repetitive structures is to derive
standard statistical models from repetitive structures (in the asymptotic on-
line setting the difference between Kolmogorov-type and standard models often
disappears); to apply repetitive structure to reality one still needs to go through
statistical models. In our approach (see Theorem 1 above or the optimality
results in [21,22]) statistical models become irrelevant.
Freedman and Diaconis independently came up with ideas similar to Kol-
mogorov’s (Freedman’s first paper in this direction was published in 1962); they
were inspired by de Finetti’s theorem and the Krylov-Bogolyubov approach to
ergodic theory.
Kolmogorov only considered the three models we discuss in §4–6, but many
other models have been considered by later authors (see, e.g., [6]).
The difference between standard statistical modelling and Kolmogorov’s
modelling discussed in [17] is not important for the purpose of one-step-ahead
forecasting in the exchangeable case (in particular, for both exchangeability and
Gaussian models of this paper; see [23]); it becomes important, however, in the
Markov case. The theory of prediction with confidence has a dual goal: valid-
ity (there should not be too many errors) and quality (there should not be too
many uncertain predictions). In the asymmetric Markov case, although we have
the validity result (Theorem 1), there is little hope of obtaining an optimality
result analogous to those of [21,22]. A manifestation of the difference between
282 V. Vovk
the two approaches to modelling is, e.g., the fact that (7) involves the ratio
n1,0 /(n1,0 + n1,1 ) rather than something like n0,1 /(n0,0 + n0,1 ).
1 Introduction
In the last several decades new powerful machine-learning algorithms have ap-
peared. A serious shortcoming of most of these algorithms, however, is that
they do not directly provide any measures of confidence in the predictions they
output. Two of the most important traditional ways to obtain such confidence
information are provided by PAC theory (a typical result that can be used is Lit-
tlestone and Warmuth’s theorem; see, e.g., [3]) and Bayesian theory. The former
is discussed in detail in [9] and the latter is discussed in [8], but disadvantages of
the traditional approaches can be summarized as follows: PAC bounds are valid
under the general iid assumption but are too weak for typical problems encoun-
tered in practice to give meaningful results; Bayesian bounds give practically
meaningful results, but are only valid under strong extra assumptions.
Vovk [4,16,14,11,12,17] proposed a practical (as confirmed by numerous em-
pirical studies reported in those papers) method of computing confidence infor-
mation valid under the general iid assumption. Vovk’s Transductive Confidence
Machine (TCM) is based on a specific formula
|{i : αi ≥ αl+1 }|
p= ,
l+1
where αi are numbers representing some measures of strangeness (cf. (1) in
Section 2). A natural question is whether there are better ways to produce
valid confidence information. In this paper (Sections 3 and 6) we show that the
first-order answer is “no”: no way of producing valid confidence information is
drastically better than TCM. We present our results in terms of Kolmogorov’s
theory of algorithmic complexity and randomness.
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 283–297, 2003.
c Springer-Verlag Berlin Heidelberg 2003
284 I. Nouretdinov, V. V’yugin, and A. Gammerman
Suppose we have two sets: the training set (x1 , y1 ), . . . , (xl , yl ) and the test set
(xl+1 , yl+1 ) containing only one example. The unlabelled examples xi are drawn
from a set X and the labels yi are drawn from a finite set Y; we assume that |Y|
is small (i.e., we consider the problem of classification with a small number of
classes) 1 . The examples (xi , yi ) are assumed to be generated by some probability
distribution P (same for all examples) independently of each other; we call this
the iid assumption.
Set Z := X × Y. For any l a sequence z l = z1 , . . . , zl defines a multiset B
of all elements of this sequence, where each element z ∈ B is supplied by its
arity n(z) = |{j : zj = z}|. We call multiset B of this type a bag. Its size |B| is
defined as the sum of arities of all its elements. The bag defined by a sequence
z l is also called configuration of this sequence; to be precise define the standard
representation of this bag as a set con(z l ) = {(z1 , n(z1 )), . . . (zl , n(zl )}.
In this paper we discuss four natural ways of predicting with confidence,
which we call Randomness Predictors, Exchangeability Predictors, Invariant Ex-
changeability Predictors, and Transductive Confidence Machines. We start with
the latter (following the papers mentioned above).
An individual strangeness measure is a family of functions An , n = 1, 2, . . .,
such that each An maps every pair (B, z), where B is a bag of n − 1 elements of
Z and z is an element of Z, to a real (typically non-negative) number An (B, z).
(Intuitively, An (B, z) measures how different z is from the elements of B). The
Transductive Confidence Machine associated with An works as follows: when
given the data
(x1 , y1 ), . . . , (xl , yl ), xl+1
(the training set and the known component xl+1 of the test example), every
potential classification y of xl+1 is assigned the p-value
|{i : αi ≥ αl+1 }|
p(y) := , (1)
l+1
where
αi := Al+1 (con(z1 , . . . , zi−1 , zi+1 , . . . , zl+1 ), zi ),
zj = (xj , yj ) (except zl+1 = (xl+1 , y)), and con(z1 , . . . , zi−1 , zi+1 , . . . , zl+1 ) is a
bag. TCM’s output p : Y → [0, 1] can be further packaged in two different ways:
– we can output arg maxy p(y) as the prediction and say that 1 − p2 , where p2
is the second largest p-value, is the confidence and that the largest p-value
p1 is the credibility;
– or we can fix some conventional threshold δ (such as 1% or 5%) and output
as our prediction (predictive region) the set of all y such that p(y) > δ.
1
By |Y| we mean the cardinality of the set Y.
Transductive Confidence Machine Is Universal 285
The essence of TCM is formula (1). The following simple example illustrates a
definition of individual strangeness measure in the spirit of the 1-Nearest Neigh-
bour Algorithm (we assume that objects are vectors in a Euclidian space)
3 Specific Randomness
Next we define Randomness Predictors (RP). At first, we consider a typical
example from statistics. Let Zn be a sample space and Qn be a sequence of
probability distributions in Zn , where n = 1, 2, . . .. Let fn (ω) be a statistics, i.e.
a sequence of real valued functions from Zn to the set of all real numbers. The
function
tn (ω) = Qn {α : fn (α) ≥ fn (ω)}
is called p-value and satisfies
Qn {ω : tn (ω) ≤ γ} ≤ γ (2)
for any real number γ. The outcomes ω with a small p-value have the small
probability. These outcomes should be consider as almost impossible from the
standpoint of the holder of the measure Qn .
The notion of p-value can be easily extended to the case where for any n we
consider a class of probability distributions Qn in Zn :
for all γ. We fix the properties (2) and (4) as basic for the following definitions.
Let for any n a probability distribution Qn in Zn be given. We say that a sequence
of functions tn (ω) : Z → [0, 1] is an Qn -randomness test (p-test) if it satisfies
inequality (2) for any γ. Analogously, let for any n a class Qn of probability
distributions in Zn be given. We say that a sequence of functions tn (ω) is an
Qn -randomness test if the inequality (4) holds for any γ. We call inequality (2)
or (4) validity property of a test.
We will consider two important statistical models on the sequence of sample
spaces Zn . The iid model Qiid n , n = 1, 2, . . ., is defined for any n by the class
286 I. Nouretdinov, V. V’yugin, and A. Gammerman
4 Algorithmic Randomness
We refer readers to [7] for details of the theory of Kolmogorov complexity and
algorithmic randomness. We also will consider a logarithmic scale for tests (log-
tests of randomness) 2 .
The proof of this proposition uses well known idea of universality from Kol-
mogorov theory of algorithmic randomness.
We fix some optimal uniform randomness log-test dn (ω|P, q). The value
dn (ω|P, q) is called the randomness level of the sequence ω with respect to P .
Parameter q will be used only in Section 4.2 for technical reason. In the following
usually we fix some q ∈ S and omit this variable from the notation of test.
Using the direct scale we consider the optimal uniform randomness test
dQ
n (z1 , . . . , zn ) = inf d(z1 , . . . , zn |P ), (10)
P ∈Qn (Zn )
if the data set z l = (x1 , y1 ), . . . , (xl , yl ), xl+1 is random and the set Y is small.
This shows the universality of TCM: the optimal IEP (equivalently, TCM; see
Proposition 1) is about as good as the optimal RP. The precise statement involves
a multiplicative constant C; this is inevitable since randomness and exchange-
ability levels are only defined to within a constant factor (in direct scale). We
will prove this assertion by the following way. Approximate equality (11) will be
split into two:
Proposition 3. Let dpn (z n |P, q) be the optimal p-log-test and din (z n |P, q) be the
optimal i-log-test. Then
Proof. Let dpn (z n |P, q) be the optimal p-log-test and din (z n |P, q) be the optimal
i-log-test. Then the proposition asserts 7
The first inequality ≤ is obvious. To prove the second one note that the lower
semicomputable function
m − 1 if dp (z n |P ) ≥ m,
ψ(z |P, m) =
n
−1 otherwise
is an i-log-test. Indeed,
2ψ(z |P,m) dP (z n ) = 2−1 dP (z n ) ≤ 1.
n
2m−1 dP (z n ) +
z n :dp (z n |P )≥m
The relation between conditional and unconditional i-tests is presented by
the following proposition.
Proposition 4. Let k ∈ N. Then
Proof. The first inequality is obvious. To prove the second inequality let us note
that the function
∞
2d (z |P,k) k −2
i n
ψ(z n |P ) = log
k=2
is an i-log-test. Indeed, it is lower semicomputable and
∞
k −2 2d (z |P,k) dP (z n ) ≤ 1.
i n
ψ(z n |P )dP (z n ) ≤
k=2
The main result of the theory is that an optimal F exists such that KF (x|q) ≤
KF (x|q) + O(1) holds for any method F of decoding. For detailed definition
and main properties of Kolmogorov (conditional) complexity K(x|q) we refer
reader to the book [7].
In the following we consider the prefix modification of Kolmogorov complex-
ity [7]. This means that only prefix methods of decoding are considered: if F (p, q)
and F (p , q) are defined then the strings p and p are incomparable.
Kolmogorov defined in [6] the notion of deficiency of randomness of an ele-
ment x of a finite set D
It is easy to verify that K(x|D) ≤ log |D) + O(1) and that the number of x ∈ D
such that d(x|D) > m does not exceed 2−m |D|. Earlier in [5] he also defined
m-Bernoulli sequence as a sequence x satisfying
n
K(x|n, k) ≥ log − m,
k
where n is the length of x and k is the number of ones in it.
For any finite sequence xn = x1 , . . . , xn ∈ Zn consider a permutation set
i.e. the set of all sequences with the same configuration as xn (set of all permu-
tations of xn ). For any permutation set Ξ we consider the measure QΞ
1/|Ξ| if z n ∈ Ξ,
QΞ (z n ) =
0 otherwise
concentrated in the set Ξ of all sequences with the same configuration. An
optimal uniform log-test d(xn |QΞ(xn ) , q) for the class {QΞ : ∃z n ∈ Zn (Ξ =
Ξ(z n ))} can be defined in the spirit of Proposition 2.
The next proposition shows that the deficiency of exchangeability can be
characterized in a fashion free from probability concept.
292 I. Nouretdinov, V. V’yugin, and A. Gammerman
8
Proposition 5. It holds
di,exch (z n |q) = log |Ξ(z n )| − K(z n |Ξ(z n ), q) + O(1). (17)
Proof. We prove (17) and that it is also equal to di (z n |QΞ(zn ) , q) + O(1). Let us
prove that the function
dˆi (z n |q) = log |Ξ(z n )| − K(z n |Ξ(z n ), q)
is an uniform i-log-test of exchangeability. Indeed, let P̂ (Ξ(z n )) = zn ∈Ξ P (z n ).
Then for any exchangeable measure P ∈ P(Zn )
ˆi n
2d (z |q) dP (z n ) = 2−K(z |Ξ(z ),q) P̂ (Ξ(z n )) =
n n
z n ∈Zn
2−K(z |Ξ,q)
n
P̂ (Ξ) ≤1
Ξ z n ∈Ξ
Then dˆi (z n |q) ≤ di (z n |P, q) + O(1) for any exchangeable measure P , and so, we
have
dˆi (z n |q) ≤ di,exch (z n |q) + O(1) ≤ di (z n |QΞ(zn ) , q) + O(1).
Let us check the converse inequality. Let Ξ = Ξ(z n ). We have
di,exch (z n |q) = inf d(z n |q, P ) ≤ log |Ξ| − K(z n |q, QΞ ) =
P ∈Qexch
di (z n |q) + O(1).
Here we take into account that K(z n |q, QΞ ) = K(z n |q, Ξ) + O(1), which fol-
lows from the fact that measure QΞ and configuration Ξ are computationally
equivalent.
Let D be a bag of elements of Z and x ∈ D has arity k(x). Then we can
assign a probability P (x) = k(x)/|D| to each element x of the bag and a positive
−lx −1
integer number
−llxx such that 2 ≤ P (x) ≤ 2−lx . It follows from the Kraft
inequality 2 ≤ 1 that a corresponding decodable prefix code exists, and
so, K(x|D) ≤ log(|D|/k(x)) + O(1). Let us define the randomness deficiency of
x with respect to a bag D
d(x|D) = log(|D|/k(x)) − K(x|D). (18)
We have |{x : d(x|D) ≥ m}| ≤ 2−m |D| for any m.
The following proposition implies that the optimal invariant exchangeabil-
ity log-test di,invexc of a training set (x1 , y1 ), . . . (xl , yl ) and testing example
(xl+1 , y) coincides with generalized Kolmogorov’s deficiency of randomness of
testing example (xl+1 , y) with respect to the configuration of all sequence.
Proposition 6. Let u1 , . . . ul+1 ∈ Zl+1 . Then
di,invexc (u1 , . . . ul+1 ) = d(ul+1 |con(u1 , . . . ul+1 )) + O(1)
The proof of this proposition is analogous to the proof of Proposition 5.
8
The same relation holds for dp,exch (z n |q) if we replace the prefix variant of Kol-
mogorov complexity by its plain variant.
Transductive Confidence Machine Is Universal 293
Let us define
di,exch (z l , xl+1 ) = min di,exch (z l , (xl+1 , y)) (19)
y∈Y
The following theorem implies that if a training set is random 9 (with respect
to some exchangeable measure) then EP and IEP are almost the same notion.
Theorem 1. It holds
di,invexc (z l , (xl+1 , y)) − O(1) ≤ di,exch (z l , (xl+1 , y)) ≤ di,invexc (z l , (xl+1 , y))
+2 log di,invexc (z l , (xl+1 , y)) + di,exch (z l , xl+1 )) + 2 log |Y| + O(1),
The following theorem shows that the difference between RP and IEP is not
essential in the most interesting case where a training set and an unlabelled test
example are random with respect to some iid probability distributions.
Theorem 2. It holds
dp,iid (z l , (xl+1 , y)) + O(1) ≥ dp,invexc (z l , (xl+1 , y)) ≥ dp,iid (z l , (xl+1 , y))
−4dp,iid (z l , xl+1 ) − 2 log dp,iid (z l , xl+1 ) − 4 log |Y| − O(1), (21)
7 Appendix
Let k be the arity of (xl+1 , y) in con(z l , (xl+1 , y)) and k̄ be the arity of (xl+1 , ȳ)
in con(z l , xl+1 , ȳ)). By definition k|Ξ| = k̄|Ξ̄|. Then from (22) we obtain
By the well known equality for the complexity of a pair [7] we have
di,exch (z l , (xl+1 , y)) = di,exch (z l , (xl+1 , ȳ)) + K(z l , (xl+1 , ȳ))|Ξ̄) + log k̄
− log k − K(z l |xl+1 , y, K(xl+1 , y|Ξ), Ξ) − K((xl+1 , y)|Ξ) + O(1). (24)
We have
|con(z l , (xl+1 , y))| = |con(z l , (xl+1 , ȳ))| = l + 1
Let m be the ordinal number of the pair (xl+1 , ȳ) in the list z l , (xl+1 , ȳ) sorted
in order of decreasing of theirs arities. Then it holds m ≤ (l + 1)/k̄.
Transductive Confidence Machine Is Universal 295
We recall an important relation between iid and exchangeability tests from [13].
Proposition 7. It holds
di,exch (z n ) + O(1) ≥ di,iid (z n ) − di,iid (Ξ(z n )) − 2 log di,iid (Ξ(z n )), (27)
where z n ∈ Zn .
Proof omitted.
10
Here we use inequality K(x|q) ≤ K(x|f (q)) + O(1) which holds for any computable
function f (see [7]).
296 I. Nouretdinov, V. V’yugin, and A. Gammerman
dp,iid (Ξ(z n , (xn+1 , y))) ≤ dp,iid (z n , xn+1 ) + 2 log |Y| + O(1). (28)
Proof omitted.
Proof omitted.
Lemma 3. Let dp be the optimal uniform randomness p-log-test. Then for any
P ∈ P(Z) there exists a P1 ∈ P(Z) such that
References
1. J.M. Bernardo, A.F.M. Smith. Bayesian Theory. Wiley, Chichester, 2000.
2. [Link], [Link]. Theoretical Statistics. Chapman, Hall, London, 1974.
3. N. Cristianini, J. Shawe-Taylor. An Introduction to Support Vector Machines and
OtherKernel-based Methods. Cambridge, Cambridge University Press, 2000.
4. A. Gammerman, V. Vapnik, V. Vovk. Learning by transduction. In Proceedings of
UAI’1998, pages 148–156, San Francisco, MorganKaufmann.
5. A.N. Kolmogorov Three approaches to the quantitative definition of information,
Problems Inform. Transmission, 1965, 1 N1, p.4–7.
6. A.N. Kolmogorov Combinatorial foundations of information theory and the calcu-
lus of probabilities. Russian Math. Suveys, 1983, 38, N4, p.29–40.
7. M. Li, P. Vitányi. An Introduction to Kolmogorov Complexity and ItsApplications.
Springer, New York, 2nd edition, 1997.
8. T. Melluish, C. Saunders, I. Nouretdinov, V. Vovk. Comparing the Bayes and
typicalness [Link] Proceedings of ECML’2001, [Link] version published
as a CLRC technical report TR-01-05; see[Link]
9. I. Nouretdinov, V. Vovk, M. Vyugin, A. Gammerman. Pattern recognition and
density estimation under the general i.i.d. assumption. In David Helmbold and
Bob Williamson, editors, Proceedings of COLT’ 2001, pages 337–353.
10. H. Rogers. Theory of recursive functions and effective computability, New York:
McGraw Hill, 1967
11. C. Saunders, A. Gammerman, V. Vovk. Transduction with confidence and credi-
bility. In Proceedings of the 16th IJCAI, pages 722–726, 1999.
12. C. Saunders, A. Gammerman, V. Vovk. Computationally efficient transductive
machines. In Proceedings of ALT’00, 2000.
13. V. Vovk. On the concept of the Bernoulli property. Russian Mathematical Surveys,
41:247–248, 1986.
14. V. Vovk, A. Gammerman. Statistical applications of algorithmic randomness. In
Bulletin of the International Statistical Institute. The 52ndSession, Contributed
Papers, volume LVIII, book 3, pages 469–470, 1999.
15. V. Vovk, A. Gammerman. Algorithmic randomness for machine learning.
Manuscript, 2001.
16. V. Vovk, A. Gammerman, C. Saunders. Machine-learning applications of algorith-
mic randomness. In Proceedings of the 16th ICML, pages 444–453, 1999.
17. V. Vovk. On-Line Confidence Machines Are Well-Calibrated. In proceedings of
FOCS’02, pages 187–196, 2002.
18. I. Nuretdinov, V. Vovk, V. V’yugin, A. Gammerman, Transductive confidence ma-
chine is universal. CLRC technical report [Link]
On the Existence and Convergence of
Computable Universal Priors
Marcus Hutter
1 Introduction
All induction problems can be phrased as sequence prediction tasks. This is, for
instance, obvious for time series prediction, but also includes classification tasks.
Having observed data x1 ,...,xt−1 at times 1,...,t−1, the task is to predict the t-th
symbol xt from sequence x=x1 ...xt−1 . The key concept to attack general induc-
tion problems is Occam’s razor and to a less extent Epicurus’ principle of multiple
explanations. The former/latter may be interpreted as to keep the simplest/all
theories consistent with the observations x1 ...xt−1 and to use these theories to
predict xt . Solomonoff [Sol64,Sol78] formalized and combined both principles in
his universal prior M (x) which assigns high/low probability to simple/complex
environments, hence implementing Occam and Epicurus. Solomonoff’s [Sol78]
central result is that if the probability µ(xt |x1 ...xt−1 ) of observing xt at time
This work was supported by SNF grant 2000-61847.00 to Jürgen Schmidhuber.
R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 298–312, 2003.
c Springer-Verlag Berlin Heidelberg 2003
On the Existence and Convergence of Computable Universal Priors 299
n→∞
sequences. We abbreviate limn→∞ [f (n)−g(n)] = 0 by f (n) −→ g(n) and say
f converges to g, without implying that limn→∞ g(n) itself exists. We write
f (x)g(x) for g(x) = O(f (x)), i.e. if ∃c > 0 : f (x) ≥ cg(x)∀x.
2 Computability Concepts
We define several computability concepts weaker than can be captured by halting
Turing machines.
Definition 1 (Computable functions). We consider functions f : IN → IR:
f is finitely computable or recursive iff there are Turing machines T1/2 with
output interpreted as natural numbers and f (x) = TT12 (x)
(x) ,
f is approximable iff ∃ finitely computable φ(·,·) with limt→∞ φ(x,t) = f (x).
f is lower semi-computable or enumerable iff additionally φ(x,t) ≤ φ(x,t+1).
f is upper semi-computable or co-enumerable iff [−f ] is lower semi-
computable.
f is semi-computable iff f is lower- or upper semi-computable.
f is estimable iff f is lower- and upper semi-computable.
If f is estimable we can finitely compute an ε-approximation of f by upper and
lower semi-computing f and terminating when differing by less than ε. This
means that there is a Turing machine which, given x and ε, finitely computes ŷ
such that |ŷ−f (x)| < ε. Moreover it gives an interval estimate f (x) ∈ [ŷ−ε,ŷ+ε].
An estimable integer-valued function is finitely computable (take any ε<1). Note
that if f is only approximable or semi-computable we can still come arbitrar-
ily close to f (x) but we cannot devise a terminating algorithm which produces
an ε-approximation. In the case of lower/upper semi-computability we can at
least finitely compute lower/upper bounds to f (x). In case of approximabil-
ity, the weakest computability form, even this capability is lost. In analogy to
lower/upper semi-computability one may think of notions like lower/upper es-
timability but they are easily shown to coincide with estimability. The following
implications are valid:
enumerable=
lower semi-
recursive= ⇒ computable ⇒
semi-
finitely ⇒ estimable ⇒ approximable
computable
computable ⇒ co-enumerable= ⇒
upper semi-
computable
The prefix Kolmogorov complexity K(x) is defined as the length of the shortest
binary program p ∈ {0,1}∗ for which a universal prefix Turing machine U (with
binary program tape and X ary output tape) outputs string x∈X ∗ , and similarly
K(x|y) in case of side information y [LV97]:
Solomonoff [Sol64,Sol78] (with a flaw fixed by Levin [ZL70]) defined (earlier) the
closely related quantity, the universal prior M (x). It is defined as the probability
that the output of a universal Turing machine starts with x when provided with
fair coin flips on the input tape. Formally, M can be defined as
M (x) := 2−l(p) (1)
p : U (p)=x∗
where the sum is over all so called minimal programs p for which U outputs a
string starting with x (indicated by the ∗). Before we can discuss the stochastic
properties of M we need the concept of (semi)measures for strings.
The infinite sum can only be finite if the difference M (0|x<t )−µ(0|x<t ) tends to
zero for t → ∞ with µ probability 1 (see Definition 4(i) and [Hut01] or Section
6 for general alphabet). This holds for any computable probability distribution
µ. The reason for the astonishing property of a single (universal) function to
converge to any computable probability distribution lies in the fact that the set
of µ-random sequences differ for different µ. Past data x<t are exploited to get
a (with t → ∞) improving estimate M (xt |x<t ) of µ(xt |x<t ).
The universality property (Theorem 1) is the central ingredient in the proof
of (2). The proof involves the construction of a semimeasure ξ whose dominance
is obvious. The hard part is to show its enumerability and equivalence to M . Let
M be the (countable) set of all enumerable semimeasures and define
ξ(x) := 2−K(ν) ν(x), then dominance ξ(x) ≥ 2−K(ν) ν(x) ∀ ν ∈ M (3)
ν∈M
5 Universal (Semi)Measures
What is so special about the set of all enumerable semimeasures Msemi enum ? The
larger we choose M the less restrictive is the assumption that M should contain
the true distribution µ, which will be essential throughout the paper. Why do
not restrict to the still rather general class of estimable or finitely computable
(semi)measures? It is clear that for every countable set M, the universal or
mixture distribution
ξ(x) := ξM (x) := wν ν(x) with wν ≤ 1 and wν > 0 (4)
ν∈M ν∈M
dominates all ν ∈M. This dominance is necessary for the desired convergence ξ →
µ similarly to (2). The question is what properties ξ possesses. The distinguishing
property of Msemi semi
enum is that ξ is itself an element of Menum . When concerned
with predictions, ξM ∈ M is not by itself an important property, but whether ξ
is computable in one of the senses of Definition 1. We define
is transitive (but not necessarily reflexive) in the sense that M1 M2 M3
implies M1 M3 and M0 ⊇ M1 M2 ⊇ M3 implies M0 M3 . For the com-
putability concepts introduced in Section 2 we have the following proper set
inclusions
Mmsr msr msr
comp ⊂ Mest ≡ Menum ⊂ Mappr
msr
∩ ∩ ∩ ∩
Msemi
comp ⊂ M semi
est ⊂ M semi
enum ⊂ M semi
appr
where Mmsr c stands for the set of all probability measures of appro-
priate computability type c ∈ {comp=finitely computable, est=estimable,
enum=enumerable, appr=approximable}, and similarly for semimeasures
Msemi
c . From an enumeration of a measure ρ one can construct a co-enumeration
by exploiting ρ(x1:n ) = 1− y1:n =x1:n ρ(y1:n ). This shows that every enumerable
measure is also co-enumerable, hence estimable, which proves the identity ≡
above.
With this notation, Theorem 1 implies Msemi semi
enum Menum . Transitivity allows
semi msr
to conclude, for instance, that Mappr Mcomp , i.e. that there is an approximable
semimeasure which dominates all computable measures.
304 M. Hutter
M semimeasure measure
ρ comp. est. enum. appr. comp. est. appr.
s comp. noiii noiii noiii noiv noiii noiii noiv
e est. noiii noiii noiii noiv noiii noiii noiv
m enum. yesi yesi yesi noiv yesi yesi noiv
i appr. yesi yesi yesi noiv yesi yesi noiv
m comp. noiii noiii noiii noiv noiii noiii noiv
s est. noiii noiii noiii noiv noiii noiii noiv
r appr. yesii yesii yesii noiv yesii yesii noiv
If we ask for a universal (semi)measure which at least satisfies the weakest form
of computability, namely being approximable, we see that the largest dominated
set among the 7 sets defined above is the set of enumerable semimeasures. This
is the reason why Msemi semi
enum plays a special role. On the other hand, Menum is not
the largest set dominated by an approximable semimeasure, and indeed no such
largest set exists. One may, hence, ask for “natural” larger sets M. One such set,
On the Existence and Convergence of Computable Universal Priors 305
the sequence leaves and enters to ỹk infinitely often. If ỹk is left (ykt−1 = ỹk = ykt ) we have
t→∞
t
µt (y<k ỹk ) = µt (y<k ykt−1 ) > 23 µt (y<k
t
) = 23 µt (y<k ) −→ 23 µ(y<k ). If ỹk is entered (ykt−1 =
ỹk = yk ) we have µt (y<k ỹk ) = µt (y<k ykt ) = minxk µt (y<k
t t t t
xk ) ≤ 12 [µt (y<k t
0)+µt (y<k 1)] ≤
t→∞
1
µ (y t ) = 12 µt (y<k ) −→ 12 µ(y<k ). Hence µt (y<k ỹk ) oscillates infinitely often between
2 t <k
2 1
> 3 µ(y<k ) and ≤ 2 µ(y<k ) which contradicts the assumption that µt converges. Hence
the assumption of a non-convergent ykt was wrong. With ykt also the measure ρt (y1:n t
):=
t
1 (and ρt (x) = 0 for all other x which are not prefixes of y1:∞ ) converges. For all
t t t
sufficiently large t we have y1:n = y1:n , hence µt (y1:n ) = µt (y1:n ) ≤ 23 µt (y<n ) ≤ ... ≤ ( 23 )n .
2 n
Since µ(y1:n ) ≤ ( 3 ) does not dominate ρ(y1:n ) = 1 (∀t > t0 ), we have µ ρ. Since
µ ∈ Msemi
appr was arbitrary and ρ is an approximable measure we get Mappr Mappr .
semi msr
6 Posterior Convergence
We have investigated in detail the computational properties of various mixture
distributions ξ. A mixture ξM multiplicatively dominates all distributions in
M. We have mentioned that dominance implies posterior convergence. In this
section we present in more detail what dominance implies and what not.
Convergence of ξ(xt |x<t ) to µ(xt |x<t ) with µ-probability 1 tells us that
ξ(xt |x<t ) is close to µ(xt |x<t ) for sufficiently large t and “most” sequences x1:∞ .
It says nothing about the speed of convergence, nor whether convergence is true
for any particular sequence (of measure 0). Convergence in mean sum defined
below is intended to capture the rate of convergence, Martin-Löf randomness is
used to capture convergence properties for individual sequences.
Martin-Löf randomness is a very important concept of randomness of in-
dividual sequences, which is closely related to Kolmogorov complexity and
Solomonoff’s universal prior. Levin gave a characterization equivalent to Martin-
Löf’s original definition [Lev73]:
One can show that a µ.M.L. random sequence x1:∞ passes all thinkable effective
randomness tests, e.g. the law of large numbers, the law of the iterated logarithm,
etc. In particular, the set of all µ.M.L. random sequences has µ-measure 1. The
following generalization is natural when considering general Bayes-mixtures ξ as
in this work:
Typically, ξ is a mixture over some M as defined in (3), in which case the reverse
inequality ξ(x)µ(x) is also true (for all x). For finite M or if ξ ∈M, the defini-
tion of µ/ξ-randomness depends only on M, and not on the specific weights used
On the Existence and Convergence of Computable Universal Priors 307
The latter strengthens the result ξ(xt |x<t )/µ(xt |x<t )→1 w.p.1 derived by Gács
in [LV97, Th.5.2.2] in that it also provides the “speed” of convergence.
Note also the subtle difference between the two convergence results. For any
sequence x1:∞ (possibly constant and not necessarily µ-random), µ(xt |x<t )−
ξ(xt |x<t ) converges to zero w.p.1 (referring to x1:∞ ), but no statement is
possible for ξ(xt |x<t )/µ(xt |x<t ), since lim infµ(xt |x<t ) could be zero. On the
other hand, if we stay on the µ-random sequence (x1:∞ = x1:∞ ), we have
ξ(xt |x<t )/µ(xt |x<t ) → 1 (whether infµ(xt |x<t ) tends to zero or not does not
matter). Indeed, it is easy to see that ξ(1|0<t )/µ(1|0<t ) ∝ t → ∞ diverges for
M = {µ,ν}, µ(1|x<t ) := 12 t−3 and ν(1|x<t ) := 12 t−2 , although 01:∞ is µ-random.
Proof. For a probability distribution yi ≥ 0 with y = 1 and a semi-distribution
i i √ √
zi ≥ 0 with distance h(y,z) := i ( yi − zi )2
z ≤ 1 and i = {1,...,N }, the Hellinger
i i
yi 0
is upper bounded by the relative entropy d(y,z) = i yi ln zi (and 0ln z := 0). This can
be seen as follows: For arbitrary 0 ≤ y ≤ 1 and 0 ≤ z ≤ 1 we define
y √ √
f (y, z) := y ln − ( y − z)2 + z − y = 2yg( z/y) with g(t) := − ln t + t − 1 ≥ 0.
z
This shows f ≥ 0, and hence i
f (yi ,zi ) ≥ 0, which implies
yi √ √
yi ln − ( yi − zi )2 ≥ yi − zi ≥ 1 − 1 = 0.
zi
i i i i
where we have used E[Et [..]] = E[..] and exchanged the t-sum with the expectation
E, which transforms to a product inside the logarithm. In the last equality we have
used the chain rule for µ and ξ. Using universality ξ(x1:n ) ≥ wµ µ(x1:n ) yields the final
inequality. Finally
2 2 √
Et ξt
µt
−1 = µt ξt
µt
−1 = ( ξt − µt )2 ≤ ht (x<t ) ≤ dt (x<t ).
xt xt
n
Taking the expectation E and the sum t=1
and chaining the result with (5) yields
Theorem 5. 2
On the Existence and Convergence of Computable Universal Priors 309
If M were recursive, then this would imply posterior M → µ and M/µ → 1 for
every µ.M.L. random sequence x1:∞ , since every sequence is M .M.L. random.
Since M is not recursive Vovk’s theorem cannot be applied and it is not obvious
how to generalize it. So the question of individual convergence remains open.
More generally, one may ask whether ξM →µ for every µ/ξ-random sequence. It
turns out that this is true for some M, but false for others.
i) If x1:∞ is µ/ξMΘD random with µ ∈ MΘD , then ξMΘD (xt |x<t ) → µ(xt |x<t ),
1
The formulation of their Theorem is quite misleading in general: “Let µ be a positive
recursive measure. If the length of y is fixed and the length of x grows to infinity,
then M (y|x)/µ(y|x)→1 with µ-probability one. The infinite sequences ω with prefixes
x satisfying the displayed asymptotics are precisely [‘⇒’ and ‘⇐’] the µ-random
sequences.” First, for off-sequence y convergence w.p.1 does not hold (xy must be
demanded to be a prefix of ω). Second, the proof of ‘⇐’ has loopholes (see main
text). Last, ‘⇒’ is given without
proof and is probably wrong. Also the assertion in
[LV97, Th.5.2.1] that St := E x (µ(xt |x<t )−M (xt |x<t ))2 converges to zero faster
t
than 1/t cannot
√ be made, since St may not decrease monotonically. For example,
∞
for at := 1/ t if t is a cube and 0 otherwise, we have a < ∞, but at = o(1/t).
t=1 t
310 M. Hutter
ii) There are µ ∈ MΘG and µ/ξMΘG random x1:∞ for which ξMΘG(xt |x<t ) →
µ(xt |x<t )
Our original/main motivation of studying µ/ξ-randomness is the implication of
M.L.
Theorem 6 that M −→ µ cannot be decided from M being a mixture distribu-
tion or from the universality property (Theorem 1) alone. Further structural
properties of Msemi
enum have to be employed. For Bernoulli sequences, conver-
gence µ.ξMΘ .r. is related to denseness of MΘ . Maybe a denseness characteriza-
tion of Msemi
enum can solve the question of convergence M.L. of M . The property
M ∈ Msemi
enum is also not sufficient to resolve this question, since there are M ξ
µ.ξ.r µ.ξ.r
for which ξ −→ µ and M ξ for which ξ −→ µ. Theorem 6 can be generalized to
i.i.d. sequences over general finite alphabet X .
The idea to prove (ii) is to construct a sequence x1:∞ which is µθ0 /ξ-random
and µθ1 /ξ-random for θ0 = θ1 . This is possible if and only if Θ contains a gap
and θ0 and θ1 are the boundaries of the gap. Obviously ξ cannot converge to θ0
and θ1 , thus proving non-convergence. For no θ ∈[0,1] will this x1:∞ be µθ M.L.-
random. Finally, the proof of Theorem 6 makes essential use of the mixture
representation of ξ, as opposed to the proof of Theorem 5 which only needs
dominance ξ M.
Proof. Let X = {0,1} and M = {µθ : θ ∈ Θ} with countable Θ ⊂ [0,1] and µθ (1|x1:n ) =
θ = 1−µθ (0|x1:n ), which implies
n1
µθ (x1:n ) = θn1 (1 − θ)n−n1 , n1 := x1 + ... +xn , θ̂ ≡ θ̂n :=
n
θ̂ depends on n; all other used/defined θ will be independent of n. ξ is defined in the
standard way as
ξ(x1:n ) = wθ µθ (x1:n ) ⇒ ξ(x1:n ) ≥ wθ µθ (x1:n ), (6)
θ∈Θ
where θ
wθ =1 and wθ >0 ∀θ. In the following let µ=µθ0 ∈M be the true environment.
ω = x1:∞ is µ/ξ-random ⇔ ∃cω : ξ(x1:n ) ≤ cω ·µθ0 (x1:n ) ∀n (7)
n→∞
For binary alphabet it is sufficient to establish whether ξ(1|x1:n ) −→ θ0 ≡µ(1|x1:n ) for
µ/ξ-random x1:∞ in order to decide ξ(xn |x<n ) → µ(xn |x<n ). We need the following
posterior representation of ξ:
µθ (x1:n ) wθ µθ (x1:n )
ξ(1|x1:n ) = wnθ µθ (1|x1:n ), wnθ := wθ ≤ , wnθ = 1 (8)
ξ(x1:n ) wθ0 µθ0 (x1:n )
θ∈Θ θ∈Θ
is the relative entropy between θ̂ and θ, which is continuous in θ̂ and θ, and is 0 if and
only if θ̂ = θ. We also need the following implication for sets Ω ⊆ Θ:
n→∞
n→∞
If wnθ ≤ wθ gθ (n) −→ 0 and gθ (n) ≤ c ∀θ ∈Ω, then wnθ µθ (1|x1:n ) ≤ wnθ −→ 0,
θ∈Ω θ∈Ω
(10)
On the Existence and Convergence of Computable Universal Priors 311
which follows from boundedness wθ ≤ 1 and µθ ≤ 1. We now prove Theorem 6.
θ n
We leave the special considerations necessary when 0,1 ∈ Θ to the reader and assume,
henceforth, 0,1 ∈ Θ.
(i) Let Θ be a countable dense subset of (0,1) and x1:∞ be µ/ξ-random. Using (6)
and (7) in (9) for θ ∈ Θ to be determined later we can bound
µθ (x1:n ) cω
en[D(θ̂n ||θ0 )−D(θ̂n ||θ)] = ≤ =: c < ∞ (11)
µθ0 (x1:n ) wθ
Let us assume that θ̂ ≡ θ̂n → θ0 . This implies that there exists a cluster point θ̃ = θ0
of sequence θ̂n , i.e. θ̂n is infinitely often in an ε-neighborhood of θ̃, e.g. D(θ̂n ||θ̃) ≤ ε
for infinitely many n. θ̃ ∈ [0,1] may be outside Θ. Since θ̃ = θ0 this implies that θ̂n
must be “far” away from θ0 infinitely often. E.g. for ε = 41 (θ̃ −θ0 )2 , using D(θ̂||θ̃)+
D(θ̂||θ0 ) ≥ (θ̃−θ0 )2 , we get D(θ̂||θ0 ) ≥ 3ε. We now choose θ ∈ Θ so near to θ̃ such that
|D(θ̂||θ)−D(θ̂||θ̃)| ≤ ε (here we use denseness of Θ). Chaining all inequalities we get
D(θ̂||θ0 )−D(θ̂||θ)≥3ε−ε−ε=ε>0. This, together with (11) implies enε ≤c for infinitely
many n which is impossible. Hence, the assumption θ̂n → θ0 was wrong.
Now, θ̂n → θ0 implies that for arbitrary θ = θ0 , θ ∈ Θ and for sufficiently large n
there exists δθ >0 such that D(θ̂n ||θ)≥2δθ (since D(θ0 ||θ)= 0) and D(θ̂n ||θ0 )≤δθ . This
implies
wθ n[D(θ̂n ||θ0 )−D(θ̂n ||θ)] wθ −nδθ n→∞
wnθ ≤ e ≤ e −→ 0,
wθ0 wθ0
where we have used (8) and (9) in the first inequality and the second inequality holds
for sufficiently large n. Hence θ=θ0 wnθ → 0 by (10) and wnθ0 → 1 by normalization (8),
which finally gives
n→∞
ξ(1|x1:n ) = wnθ0 µθ0 (1|x1:n ) + wnθ µθ (1|x1:n ) −→ µθ0 (1|x1:n ).
θ=θ0
(ii) We first consider the case Θ ={θ0 ,θ1 }: Let us choose θ̄ (=ln( 1−θ
1−θ1
0
)/ln( θθ10 1−θ0
1−θ1
),
potentially ∈ Θ) in the (KL) middle of θ0 and θ1 such that
D(θ̄||θ0 ) = D(θ̄||θ1 ), 0 < θ0 < θ̄ < θ1 < 1, (12)
n1 n→∞
and choose x1:∞ such that θ̂n := satisfies |θ̂n − θ̄| ≤ n1
n
(⇒ θ̂n −→ θ̄)
Using |D(θ̂||θ)−D(θ̄||θ)| ≤ c|θ̂− θ̄| ∀ θ,θ̂,θ̄ ∈ [θ0 ,θ1 ] (c = ln θθ10 (1−θ 0)
(1−θ1 )
< ∞) twice in (9) we
get
µθ1 (x1:n )
= en[D(θ̂n ||θ0 )−D(θ̂n ||θ1 )] ≤ en[D(θ̄||θ0 )+c|θ̂n −θ̄|−D(θ̄||θ1 )+c|θ̂n −θ̄|] ≤ e2c (13)
µθ0 (x1:n )
where we have used (12) in the last inequality. Now, (13) and (8) lead to
µθ0 (x1:n ) wθ µθ (x1:n ) −1 wθ
wnθ0 = wθ0 = [1 + 1 1 ] ≥ [1 + 1 e2c ]−1 =: c0 > 0, (14)
ξ(x1:n ) wθ0 µθ0 (x1:n ) wθ0
which shows that x1:∞ is µθ0 /ξ-random by (7). Exchanging θ0 ↔ θ1 in (13) and (14)
we similarly get wnθ1 ≥ c1 > 0, which implies (using wnθ0 +wnθ1 = 1)
ξ(1|x1:n ) = wnθ µθ (1|x1:n ) = wnθ0 ·θ0 + wnθ1 ·θ1 = θ0 = µθ0 (1|x1:n ). (15)
θ∈{θ0 ,θ1 }
n→∞
This shows ξ(1|x1:n ) −→ µ(1|x1:n ). For general Θ with gap in the sense that there
exist 0 < θ0 < θ1 < 1 with [θ0 ,θ1 ] ∩ Θ = {θ0 ,θ1 } one can show that all θ = θ0 ,θ1 give
asymptotically no contribution to ξ(1|x1:n ), i.e. (15) still applies. 2
312 M. Hutter
8 Conclusions
For a hierarchy of four computability definitions, we completed the classifica-
tion of the existence of computable (semi)measures dominating all computable
(semi)measures. Dominance is an important property of a prior, since it im-
plies rapid convergence of the corresponding posterior with probability one.
A strengthening would be convergence for all Martin-Löf (M.L.) random se-
quences. This seems natural, since M.L. randomness can be defined in terms of
Solomonoff’s prior M , so there is a close connection. Contrary to what was be-
lieved before, the question of posterior convergence M/µ→1 for all M.L. random
sequences is still open. We introduced a new flexible notion of µ/ξ-randomness
which contains Martin-Löf randomness as a special case. Though this notion
may have a wider range of application, the main purpose for its introduction
M.L.
was to show that standard proof attempts of M/µ −→ 1 based on dominance
only must fail. This follows from the derived result that the validity of ξ/µ → 1
for µ/ξ-random sequences depends on the Bayes mixture ξ.
References
[Doo53] J. L. Doob. Stochastic Processes. John Wiley & Sons, New York, 1953.
[Hut01] M. Hutter. Convergence and error bounds of universal prediction for general
alphabet. Proceedings of the 12th Eurpean Conference on Machine Learning
(ECML-2001), pages 239–250, 2001.
[Hut03] M. Hutter. Sequence prediction based on monotone complexity. In Proceedings
of the 16th Conference on Computational Learning Theory (COLT-2003).
[Lam87] M. van Lambalgen. Random Sequences. PhD thesis, Univ. Amsterdam, 1987.
[Lev73] L. A. Levin. On the notion of a random sequence. Soviet Math. Dokl.,
14(5):1413–1416, 1973.
[LV97] M. Li and P. M. B. Vitányi. An introduction to Kolmogorov complexity and
its applications. Springer, 2nd edition, 1997.
[Sch71] C. P. Schnorr. Zufälligkeit und Wahrscheinlichkeit. Springer, Berlin, 1971.
[Sch02] J. Schmidhuber. Hierarchies of generalized Kolmogorov complexities and
nonenumerable universal measures computable in the limit. International
Journal of Foundations of Computer Science, 13(4):587–612, 2002.
[Sol64] R. J. Solomonoff. A formal theory of inductive inference: Part 1 and 2. Inform.
Control, 7:1–22, 224–254, 1964.
[Sol78] R. J. Solomonoff. Complexity-based induction systems: comparisons and con-
vergence theorems. IEEE Trans. Inform. Theory, IT-24:422–432, 1978.
[VL00] P. M. B. Vitányi and M. Li. Minimum description length induction, Bayesian-
ism, and Kolmogorov complexity. IEEE Transactions on Information Theory,
46(2):446–464, 2000.
[Vov87] V. G. Vovk. On a randomness criterion. Soviet Mathematics Doklady,
35(3):656–660, 1987.
[Wan96] Y. Wang. Randomness and Complexity. PhD thesis, Univ. Heidelberg, 1996.
[ZL70] A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the
development of the concepts of information and randomness by means of the
theory of algorithms. Russian Mathematical Surveys, 25(6):83–124, 1970.
Author Index