0% found this document useful (0 votes)
16 views313 pages

Algorithmic Learning Theory

The document contains the proceedings of the 14th International Conference on Algorithmic Learning Theory (ALT 2003) held in Sapporo, Japan, from October 17-19, 2003. It includes 19 selected technical contributions, invited talks, and details about the conference organization and awards, notably the E. Mark Gold Award given to Sandra Zilles. The volume aims to provide a forum for discussing theoretical foundations of machine learning and its practical applications.

Uploaded by

Cherif Morsli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views313 pages

Algorithmic Learning Theory

The document contains the proceedings of the 14th International Conference on Algorithmic Learning Theory (ALT 2003) held in Sapporo, Japan, from October 17-19, 2003. It includes 19 selected technical contributions, invited talks, and details about the conference organization and awards, notably the E. Mark Gold Award given to Sandra Zilles. The volume aims to provide a forum for discussing theoretical foundations of machine learning and its practical applications.

Uploaded by

Cherif Morsli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Subseries of Lecture Notes in Computer Science

3
Berlin
Heidelberg
New York
Hong Kong
London
Milan
Paris
Tokyo
Algorithmic
Learning Theory
14th International Conference, ALT 2003
Sapporo, Japan, October 17-19, 2003
Proceedings

13
Volume Editors
Ricard Gavaldà
Technical University of Catalonia
Department of Software (LSI)
Jordi Girona Salgado 1-3, 08034 Barcelona, Spain
E-mail: gavalda@[Link]

Klaus P. Jantke
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
Im Stadtwald, Geb. 43.8, 66125 Saarbrücken, Germany
E-mail: jantke@[Link]

Eiji Takimoto
Tohoku University
Graduate School of Information Sciences
Sendai 980-8579, Japan
E-mail: t2@[Link]

Cataloging-in-Publication Data applied for

A catalog record for this book is available from the Library of Congress.

Bibliographic information published by Die Deutsche Bibliothek


Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data is available in the Internet at <[Link]

CR Subject Classification (1998): I.2.6, I.2.3, F.1, F.2, F.4.1, I.7

ISSN 0302-9743
ISBN 3-540-20291-9 Springer-Verlag Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer-Verlag Berlin Heidelberg New York
a member of BertelsmannSpringer Science+Business Media GmbH

[Link]

© Springer-Verlag Berlin Heidelberg 2003


Printed in Germany
Typesetting: Camera-ready by author, data conversion by PTP-Berlin GmbH
Printed on acid-free paper SPIN: 10963852 06/3142 543210
Preface

This volume contains the papers presented at the 14th Annual Conference on
Algorithmic Learning Theory (ALT 2003), which was held in Sapporo (Japan)
during October 17–19, 2003. The main objective of the conference was to provide
an interdisciplinary forum for discussing the theoretical foundations of machine
learning as well as their relevance to practical applications. The conference was
co-located with the 6th International Conference on Discovery Science (DS 2003).
The volume includes 19 technical contributions that were selected by the
program committee from 37 submissions. It also contains the ALT 2003 invited
talks presented by Naftali Tishby (Hebrew University, Israel) on “Efficient Data
Representations that Preserve Information,” by Thomas Zeugmann (University
of Lübeck, Germany) on “Can Learning in the Limit be Done Efficiently?”, and
by Genshiro Kitagawa (Institute of Statistical Mathematics, Japan) on “Sig-
nal Extraction and Knowledge Discovery Based on Statistical Modeling” (joint
invited talk with DS 2003). Furthermore, this volume includes abstracts of the
invited talks for DS 2003 presented by Thomas Eiter (Vienna University of Tech-
nology, Austria) on “Abduction and the Dualization Problem” and by Akihiko
Takano (National Institute of Informatics, Japan) on “Association Computation
for Information Access.” The complete versions of these papers were published
in the DS 2003 proceedings (Lecture Notes in Artificial Intelligence Vol. 2843).
ALT has been awarding the E. Mark Gold Award for the most outstanding
paper by a student author since 1999. This year the award was given to Sandra
Zilles for her paper “Intrinsic Complexity of Uniform Learning.”
This conference was the 14th in a series of annual conferences established in
1990. Continuation of the ALT series is supervised by its steering committee, con-
sisting of: Thomas Zeugmann (Univ. of Lübeck, Germany), Chair, Arun Sharma
(Univ. of New South Wales, Australia), Co-chair, Naoki Abe (IBM T.J. Wat-
son Research Center, USA), Klaus Peter Jantke (DFKI, Germany), Phil Long
(National Univ. of Singapore), Hiroshi Motoda (Osaka Univ., Japan), Akira
Maruoka (Tohoku Univ., Japan), Luc De Raedt (Albert-Ludwigs-Univ., Ger-
many), Takeshi Shinohara (Kyushu Institute of Technology, Japan), and Osamu
Watanabe (Tokyo Institute of Technology, Japan).
We would like to thank all individuals and institutions who contributed to
the success of the conference: the authors for submitting papers, the invited
speakers for accepting our invitation and lending us their insight into recent
developments in their research areas, as well as the sponsors for their generous
financial support.
Furthermore, we would like to express our gratitude to all program committee
members for their hard work in reviewing the submitted papers and participating
in on-line discussions. We are also grateful to the external referees whose reviews
made a considerable contribution to this process.
VI Preface

We are also grateful to the DS 2003 Chairs Yuzuru Tanaka (Hokkaido Uni-
versity, Japan), Gunter Grieser (Technical University of Darmstadt, Germany)
and Akihiro Yamamoto (Hokkaido University, Japan) for their efforts in coordi-
nating with ALT 2003, and to Makoto Haraguchi and Yoshiaki Okubo (Hokkaido
University, Japan) for their excellent work on the local arrangements. Last but
not least, Springer-Verlag provided excellent support in preparing this volume.

August 2003 Ricard Gavaldà


Klause P. Jantke
Eiji Takimoto
Organization

Conference Chair
Klaus P. Jantke DFKI GmbH Saarbrücken, Germany

Program Committee
Ricard Gavaldà (Co-Chair) Tech. Univ. of Catalonia, Spain
Eiji Takimoto (Co-Chair) Tohoku Univ., Japan
Hiroki Arimura Kyushu Univ., Japan
Shai Ben-David Technion, Israel
Nicolò Cesa-Bianchi Univ. di Milano, Italy
Nello Cristianini UC Davis, USA
François Denis LIF, Univ. de Provence, France
Kouichi Hirata Kyutech, Japan
Sanjay Jain Nat. Univ. Singapore, Singapore
Stephen Kwek Univ. Texas, San Antonio, USA
Phil Long Genome Inst. Singapore, Singapore
Yasubumi Sakakibara Keio Univ., Japan
Rocco Servedio Columbia Univ., USA
Hans-Ulrich Simon Ruhr-Univ. Bochum, Germany
Frank Stephan Univ. Heidelberg, Germany
Christino Tamon Clarkson Univ., USA

Local Arrangements
Makoto Haraguchi (Chair) Hokkaido Univ., Japan
Yoshiaki Okubo Hokkaido Univ., Japan

Subreferees
Kazuyuki Amano Joshua Goodman
Dana Angluin Colin de la Higuera
Tijl De Bie Hiroki Ishizaka
Laurent Brehelin Jeffrey Jackson
Christian Choffrut Satoshi Kobayashi
Pedro Delicado Jean-Yves Marion
Claudio Gentile Andrei E. Romashchenko
Rémi Gilleron Hiroshi Sakamoto
Sally Goldman Kengo Sato
VIII Organization

Dale Schuurmans Lee Wee Sun


Chema Sempere Hisao Tamaki
Shinichi Shimozono Marc Tommasi
Takeshi Shinohara Takashi Yokomori
Robert Sloan

Sponsoring Institutions
The Japanese Ministry of Education, Culture, Sports, Science and Technology
The Suginome Memorial Foundation, Japan
Table of Contents

Invited Papers
Abduction and the Dualization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Thomas Eiter

Signal Extraction and Knowledge Discovery Based on


Statistical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Genshiro Kitagawa

Association Computation for Information Access . . . . . . . . . . . . . . . . . . . . . . 15


Akihiko Takano

Efficient Data Representations That Preserve Information . . . . . . . . . . . . . . 16


Naftali Tishby

Can Learning in the Limit Be Done Efficiently? . . . . . . . . . . . . . . . . . . . . . . . 17


Thomas Zeugmann

Regular Contributions
Inductive Inference
Intrinsic Complexity of Uniform Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Sandra Zilles

On Ordinal VC-Dimension and Some Notions of Complexity . . . . . . . . . . . 54


Eric Martin, Arun Sharma, Frank Stephan

Learning of Erasing Primitive Formal Systems from Positive Examples . . . 69


Jin Uemura, Masako Sato

Changing the Inference Type – Keeping the Hypothesis Space . . . . . . . . . . 84


Frank Balbach

Learning and Information Extraction


Robust Inference of Relevant Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Jan Arpe, Rüdiger Reischuk

Efficient Learning of Ordered and Unordered Tree Patterns with


Contractible Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Yusuke Suzuki, Takayoshi Shoudai, Satoshi Matsumoto,
Tomoyuki Uchida, Tetsuhiro Miyahara
X Table of Contents

Learning with Queries


On the Learnability of Erasing Pattern Languages in the Query Model . . . 129
Steffen Lange, Sandra Zilles

Learning of Finite Unions of Tree Patterns with Repeated


Internal Structured Variables from Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Satoshi Matsumoto, Yusuke Suzuki, Takayoshi Shoudai,
Tetsuhiro Miyahara, Tomoyuki Uchida

Learning with Non-linear Optimization


Kernel Trick Embedded Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . 159
Jingdong Wang, Jianguo Lee, Changshui Zhang

Efficiently Learning the Metric with Side-Information . . . . . . . . . . . . . . . . . . 175


Tijl De Bie, Michinari Momma, Nello Cristianini

Learning Continuous Latent Variable Models with


Bregman Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Shaojun Wang, Dale Schuurmans

A Stochastic Gradient Descent Algorithm for Structural Risk


Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Joel Ratsaby

Learning from Random Examples


On the Complexity of Training a Single Perceptron with
Programmable Synaptic Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Jiřı́ Šı́ma

Learning a Subclass of Regular Patterns in Polynomial Time . . . . . . . . . . . 234


John Case, Sanjay Jain, Rüdiger Reischuk,
Frank Stephan, Thomas Zeugmann

Identification with Probability One of Stochastic Deterministic


Linear Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Colin de la Higuera, Jose Oncina

Online Prediction
Criterion of Calibration for Transductive Confidence Machine
with Limited Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Ilia Nouretdinov, Vladimir Vovk

Well-Calibrated Predictions from Online Compression Models . . . . . . . . . . 268


Vladimir Vovk
Table of Contents XI

Transductive Confidence Machine Is Universal . . . . . . . . . . . . . . . . . . . . . . . . 283


Ilia Nouretdinov, Vladimir V’yugin, Alex Gammerman

On the Existence and Convergence of Computable Universal Priors . . . . . 298


Marcus Hutter

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313


XII Table of Contents
Abduction and the Dualization Problem

Thomas Eiter

Institut für Informationssysteme, Technische Universität Wien,


Favoritenstraße 9-11, A-1040 Wien, Austria
eiter@[Link]

Abduction is a fundamental mode of reasoning which was extensively studied by C.S.


Peirce, who also introduced the term for inference of explanations for observed phenom-
ena. Abduction has taken on increasing importance in Artificial Intelligence (AI) and
related disciplines, where it has been recognized as an important principle of common-
sense reasoning. It has applications in many areas of AI and Computer Science including
diagnosis, database updates, planning, natural language understanding, learning, to num-
ber some of them.
In a logic-based setting, abduction can be seen as the task to find, given a set of
formulas Σ (the background theory), a formula χ (the query), and a set of formulas
A (the abducibles or hypotheses), a minimal subset E of A such that Σ plus E is
satisfiable and logically entails χ (i.e., an explanation). In many application scenarios T
is a propositional Horn theory, χ is a literal or a conjunction of literals, and the abducibles
A are certain literals of interest. For use in practice, computing abductive explanations
in this setting is an important problem, for which well-known early systems such as
Poole’s Theorist or assumption-based Truth Maintenance Systems have been devised in
the 1980s. Since then, there has been a growing literature on this subject.
Besides computing some arbitrary explanation for a query, the problem of generating
several or all explanations has received more attention in the last years. This problem is
important since often one would like to select one out of a set of alternative explanations
according to a preference or plausibility relation; this relation may be based on subjective
intuition which is difficult to formalize and thus can not be implemented by an algorithm.
In general, a query may have exponentially many explanations, and thus generating all
explanations inevitably requires exponential time in general, even in propositional logic.
It is then of interest to know whether generating all explanations is feasible in polynomial
total time (aka output-polynomial time), i.e., in time polynomial in the combined size
of the input and the output. Furthermore, if exponential resources are prohibitive, it is of
interest to know whether a few explanations (e.g., polynomially many) can be generated
in polynomial time.
In recent and ongoing work, we have investigated the computational complexity of
generating all abductive explanations, and compiled a number of interesting results for
charting the tractability / intractability frontier of this problem. In this talk, we shall
recall some of the results and then focus on abduction from Horn theories represented
by their characteristic models. In this setting, the background theory T is represented by
a set of so called characteristic models, char(T ), rather than by formulas. The benefit

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 1–2, 2003.
c Springer-Verlag Berlin Heidelberg 2003
2 T. Eiter

is that for certain formulas, logical consequence from T efficiently reduces to deciding
consequence from char(T ) (which is easy) and thus admits tractable inference. In fact,
finding some abductive explanation for a query literal is polynomial in this setting, while
this is well-known to be NP-hard under formula-based representation.
Computing all abductive explanations for a query literal, which rises in different
contexts, is known to be polynomial-time equivalent (in a precise sense) to the problem
of dualizing a Boolean function given by a monotone CNF. The latter problem, Mono-
tone Dualization, is with respect to complexity a somewhat mysterious problem which
since more than 20 years resists to a precise classification in terms of well-established
complexity classes. Currently, no polynomial total-time algorithm solving this problem
is known; on other hand, there is also no stringent evidence that such an algorithm is
unlikely to exist (like, e.g., coNP-hardness of the associated decision problem whether,
given two monotone CNFs ϕ and ψ, they represent dual functions). On the contrary,
results in the 1990’s provided some hints that the problem is closer to polynomial total-
time, since as shown by Fredman and Khachyian, the decisional variant can be solved in
quasi-polynomial time, i.e., in time O(nlog n ). This was recently refined to solvability
in polynomial time with limited nondeterminism, i.e., using a poly-logarithmic number
of bit guesses.
Apart from this peculiarity, Monotone Dualization has been recognized as an im-
portant problem since there are a large number of other problems in Computer Science
which are known to be polynomial-time equivalent to this problem. It has a role similar to
the one of SAT for the class NP: A polynomial total-time algorithm for Monotone Dual-
ization implies polynomial total-time algorithms for all the polynomial-time equivalent
problems.
We will consider some possible extensions of the results for abductive explanations
which are polynomial-time equivalent to Monotone Dualization. Besides generating
all abductive explanations for a literal, there are many other problems in Knowledge
Discovery and Data Mining which are polynomial-time equivalent or closely related to
Monotone Dualization, including learning with oracles, computation of infrequent and
frequent sets, and key generation. We shall give a brief account of such problems, and
finally will conclude with some open problems and issues for future research.
The results presented are joint work with Kazuhisa Makino, Osaka University.
Association Computation for Information
Access

Akihiko Takano

National Institute of Informatics


Hitotsubashi, Chiyoda, Tokyo 101-8430 Japan
aki@[Link]

Abstract. GETA (Generic Engine for Transposable Association) is a


software that provides efficient generic computation for association. It
enables the quantitative analysis of various proposed methods based on
association, such as measuring similarity among documents or words.
Scalable implementation of GETA can handle large corpora of twenty
million documents, and provides the implementation basis for the effec-
tive information access of next generation.
DualNAVI is an information retrieval system which is a successful ex-
ample to show the power and the flexibility of GETA-based computation
for association. It provides the users with rich interaction both in doc-
ument space and in word space. Its dual view interface always returns
the retrieved results in two views: a list of titles for document space and
“Topic Word Graph” for word space. They are tightly coupled by their
cross-reference relation, and inspires the users with further interactions.
The two-stage approach in the associative search, which is the key to its
efficiency, also facilitates the content-based correlation among databases.
In this paper we describe the basic features of GETA and DualNAVI.


The full version of this paper is published in the Proceedings of the 6th International
Conference on Discovery Science, Lecture Notes in Artificial Intelligence Vol. 2843.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, p. 15, 2003.



c Springer-Verlag Berlin Heidelberg 2003
Efficient Data Representations That Preserve
Information

Naftali Tishby

School of Computer Science and Engineering and Center for Neural Computation
The Hebrew University, Jerusalem 91904, Israel
tishby@[Link]

Abstract. A fundamental issue in computational learning theory, as


well as in biological information processing, is the best possible relation-
ship between model representation complexity and its prediction accu-
racy. Clearly, we expect more complex models that require longer data
representation to be more accurate. Can one provide a quantitative, yet
general, formulation of this trade-off? In this talk I will discuss this ques-
tion from Shannon’s Information Theory perspective. I will argue that
this trade-off can be traced back to the basic duality between source and
channel coding and is also related to the notion of “coding with side in-
formation”. I will review some of the theoretical achievability results for
such relevant data representations and discuss our algorithms for extract-
ing them. I will then demonstrate the application of these ideas for the
analysis of natural language corpora and speculate on possibly-universal
aspects of human language that they reveal.
Based on joint works with Ran Bacharach, Gal Chechik, Amir Globerson,
Amir Navot, and Noam Slonim.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, p. 16, 2003.



c Springer-Verlag Berlin Heidelberg 2003
Can Learning in the Limit Be Done Efficiently?

Thomas Zeugmann

Institut für Theoretische Informatik, Universität zu Lübeck,


Wallstraße 40, 23560 Lübeck, Germany
thomas@[Link]

Abstract. Inductive inference can be considered as one of the funda-


mental paradigms of algorithmic learning theory. We survey results re-
cently obtained and show their impact to potential applications.
Since the main focus is put on the efficiency of learning, we also deal
with postulates of naturalness and their impact to the efficiency of limit
learners. In particular, we look at the learnability of the class of all
pattern languages and ask whether or not one can design a learner within
the paradigm of learning in the limit that is nevertheless efficient.
For achieving this goal, we deal with iterative learning and its inter-
play with the hypothesis spaces allowed. This interplay has also a severe
impact to postulates of naturalness satisfiable by any learner.
Finally, since a limit learner is only supposed to converge in the limit, one
never knows at any particular learning stage whether or not the learner
did already succeed. The resulting uncertainty may be prohibitive in
many applications. We survey results to resolve this problem by out-
lining a new learning model, called stochastic finite learning. Though
pattern languages can neither be finitely inferred from positive data nor
PAC-learned, our approach can be extended to a stochastic finite learner
that exactly infers all pattern languages from positive data with high
confidence.

1 Introduction

Inductive inference can be considered as one of the fundamental paradigms of al-


gorithmic learning theory. In particular, inductive inference of recursive functions
and of recursively enumerable languages have been studied intensively within the
last four decades (cf., e.g., [3,4,30,16]). The basic model considered within this
framework is learning in the limit which can be informally described as follows.
The learner receives more and more data about the target and maps these data
to hypotheses. Of special interest is the investigation of scenarios in which the
sequence of hypotheses stabilizes to an accurate and finite description (e.g. a
grammar, a program) of the target. Clearly, then some form of learning must
have taken place. Here by data we mean either any infinite sequence of pairs
argument-value (in case of learning recursive functions) such that all arguments
appear eventually or any infinite sequence of all members of the target language

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 17–38, 2003.

c Springer-Verlag Berlin Heidelberg 2003
18 T. Zeugmann

(in case of language learning from positive data). Alternatively, one can also
study language learning from both positive and negative data.
Most of the work done in the field has been aimed at the following goals: show-
ing what general collections of function classes or language classes are learnable,
characterizing those collections of classes that can be learned, studying the im-
pact of several postulates on the behavior of learners to their learning power,
and dealing with the influence of various parameters to the efficiency of learning.
However, defining an appropriate measure for the complexity of learning in the
limit has turned out to be quite difficult (cf. Pitt [31]). Moreover, whenever learn-
ing in the limit is done, in general one never knows whether or not the learner
has already converged. This is caused by the fact that it is either undecidable
at all whether or not convergence already occurred. But even if it is decidable,
it is practically infeasible to do so. Thus, there is always an uncertainty which
may not be tolerable in many applications of learning.
Therefore, different learning models have been proposed. In particular, Val-
iant’s [46] model of probably approximately correct (abbr. PAC) learning has
been very influential. As a matter of fact, this model puts strong emphasis on
the efficiency of learning and avoids the problem of convergence at all. In the
PAC model, the learner receives a finite labeled sample of the target concepts and
outputs, with high probability, a hypothesis that is approximately correct. The
sample is drawn with respect to an unknown probability distribution and the
error of as well as the confidence in the hypothesis are measured with respect
to this distribution, too. Thus, if a class is PAC learnable, one obtains nice
performance guarantees. Unfortunately, many interesting concept classes are not
PAC learnable.
Consequently, one has to look for other models of learning or one is back to
learning in the limit. So, let us assume that learning in the limit is our method
of choice. What we would like to present in this survey is a rather general way to
transform learning in the limit into stochastic finite learning. It should also be
noted that our ideas may be beneficial even in case that the considered concept
class is PAC learnable.
Furthermore, we aim to outline how a thorough study of limit learnability
of concept classes may nicely contribute to support our new approach. We ex-
emplify the research undertaken by mainly looking at the class of all pattern
languages introduced by Angluin [1]. As Salomaa [37] has put it “Patterns are
everywhere” and thus we believe that our research is worth the effort undertaken.
There are several problems that have to be addressed when dealing with
the learnability of pattern languages. First, the nice thing about patterns is that
they are very intuitive. Therefore, it seems desirable to design learners outputting
pattern as their hypotheses. Unfortunately, membership is known to be N P -
complete for the pattern languages (cf. [1]). Thus, many of the usual approaches
used in machine learning will directly lead to infeasible learning algorithms. As
a consequence, we shall ask what kind of appropriate hypothesis spaces can be
used at all to learn the pattern languages, and what are the appropriate learning
strategies.
Can Learning in the Limit Be Done Efficiently? 19

In particular, we shall deal with the problem of redundancy in the hypothe-


sis space chosen, with consistency, conservativeness, and iterative learning. Here
consistency means that the intermediate hypotheses output by the learner do
correctly reflect the data seen so far. Conservativeness addresses the problem
to avoid overgeneralization, i.e., preventing the learner from guessing a proper
superset of the target language. These requirements are naturally arising desider-
ata, but this does not mean that they can be fulfilled. With iterative learning, the
learning machine, in making a conjecture, has access to its previous conjecture
and the latest data item coming in. Iterative learning is also a natural require-
ment whenever learning in the limit is concerned, since no practical learner can
process at every learning stage all examples provided so far, it may even not be
able to store them.
Finally, we address the question how efficient the overall learning process can
be performed, and how we can get rid of the uncertainty of not knowing whether
or not the learner has already converged.

2 Preliminaries

Unspecified notation follows Rogers [35]. By N = {0, 1, 2, . . .} we denote the


set of all natural numbers. We set N+ = N \ {0} . The cardinality of a set S is
denoted by |S| . Let ∅, ∈, ⊂, ⊆ , ⊃ , and ⊇ , denote the empty set, element
of, proper subset, subset, proper superset, and superset, respectively.
Let ϕ0 , ϕ1 , ϕ2 , . . . denote any fixed acceptable programming system for all
(and only) the partial recursive functions over N (cf. Rogers [35]). Then ϕk is
the partial recursive function computed by program k .
In the following subsection we define the main learning models considered
within this paper.

2.1 Learning in the Limit

Gold’s [12] model of learning in the limit allows one to formalize a rather general
class of learning problems, i.e., learning from examples. For defining this model
we assume any recursively enumerable set X and refer to it as the learning
domain. By ℘(X ) we denote the power set of X . Let C ⊆ ℘(X ) , and let c ∈ C
be non-empty; then we refer to C and c as a concept class and a concept,
respectively. Let c be a concept, and let t = (xj )j∈N be any infinite sequence of
elements xj ∈ c such that range(t) =df {xj j ∈ N} = c . Then t is said to be
a positive presentation or, synonymously, a text for c . By text(c) we denote the
set of all positive presentations for c . Moreover, let t be a positive presentation,
and let y ∈ N . Then, we set ty = x0 , . . . , xy , i.e., ty is the initial segment of t
of length y + 1 , and t+ y =df {xj j ≤ y} . We refer to ty as the content of ty .
+

Furthermore, let σ = x0 , . . . , xn−1 be any finite sequence. Then we use |σ|


to denote the length n of σ , and let content(σ) and σ + , respectively, denote
20 T. Zeugmann

the content of σ . Additionally, let t be a text and let τ be a finite sequence;


then we use σ  t and σ  τ to denote the sequence obtained by concatenating σ
onto the front of t and τ , respectively.
Alternatively, one can also consider complete presentations or, synonymously,
informants. Let c be a concept; then any sequence i = (xj , bj )j∈N of la-
beled examples, where bj ∈ {+, −} such that {xj j ∈ N} = X and
i+ = {xj (xj , bj ) = (xj , +), j ∈ N} = c and i− = {xj (xj , bj ) = (xj , −), j ∈
N} = X \ c is called an informant for c . For the sake of presentation, the
following definitions are only given for the text case, the generalization to the
informant case should be obvious. We sometimes use the term data sequence to
refer to both text and informant, respectively.
An inductive inference machine (abbr. IIM) is an algorithm that takes as
input larger and larger initial segments of a text and outputs, after each input, a
hypothesis from a prespecified hypothesis space H = (hj )j∈N . The indices j are
regarded as suitable finite encodings of the concepts described by the hypotheses.
A hypothesis is said to describe a concept c iff c = h .
Definition 1. Let C be any concept class, and let H = (hj )j∈N be a hypoth-
esis space for it. C is called learnable in the limit from text iff there is an IIM
M such that for every c ∈ X and every text t for c ,

(1) for all n ∈ N+ , M (tn ) is defined,


(2) there is a j such that c = hj and for all but finitely many n ∈ N+ ,
M (tn ) = j .

By LimTxt we denote the collection of all concepts classes C that are learn-
able in the limit from text1 . Note that instead of LimTxt sometimes TxtEx is
used.
Note that Definition 1 does not contain any requirement concerning efficiency.
Before we are going to deal with efficiency, we want to point to another crucial
parameter of our learning model, i.e., the hypothesis space H . Since our goal
is algorithmic learning, we can consider the special case that X = N and let C
be any subset of the collection of all recursively enumerable sets over N . Let
Wi = domainϕi . In this case, (Wj )j∈N is the most general hypothesis space.
Within this setting many learning problems can be described. Moreover,
this setting has been used to study the general capabilities of different learning
models which can be obtained by suitable modifications of Definition 1. There are
numerous papers performing studies along this line of research (cf., e.g., [16,30]
and the references therein). On the one hand, the results obtained considerably
broaden our general understanding of algorithmic learning. On the other hand,
one has also to ask what kind of consequences one may derive from these results
for practical learning problems. This is a non-trivial question, since the setting
of learning recursively enumerable languages is very rich. Thus, it is conceivable
1
If learning from informant is considered we use LimInf to denote the collection of
all concepts classes C that are learnable in the limit from informant.
Can Learning in the Limit Be Done Efficiently? 21

that several of the phenomena observed hold in this setting due to the fact too
many sets are recursively enumerable and there are no counterparts within the
world of efficient computability.
As a first step to address this question we mainly consider the scenario
that indexable concept classes with uniformly decidable membership have to
be learned (cf. Angluin [2]). A class of non-empty concepts C is said to be
an indexable class with uniformly decidable membership provided there are
an effective enumeration c0 , c1 , c2 , ... of all and only the concepts in C and a
recursive function f such that for all j ∈ N and all elements x ∈ X we have

1, if x ∈ cj ,
f (j, x) =
0, otherwise.
In the following we refer to indexable classes with uniformly decidable mem-
bership as to indexable classes for short. Furthermore, we call any enumeration
(cj )j∈N of C with uniformly decidable membership problem an indexed family.
Since the paper of Angluin [2] learning of indexable concept classes has at-
tracted much attention (cf., e.g., Zeugmann and Lange [51]). Let us shortly pro-
vide some-well known indexable classes. Let Σ be any finite alphabet of symbols,
and let X be the free monoid over Σ , i.e., X = Σ ∗ . We set Σ + = Σ ∗ \ {λ} ,
where λ denotes the empty string. As usual, we refer to subsets L ⊆ X as
languages. Then the set of all regular languages, context-free languages, and
context-sensitive languages are indexable classes.

 let Xn = {0, 1} be the set of all n -bit Boolean vectors. We consider


n
Next,
X = n≥1 Xn as learning domain. Then, the set of all concepts expressible as
a monomial, a k -CNF, a k -DNF, and a k -decision list form indexable classes.
When learning indexable classes C , it is generally assumed that the hy-
pothesis space H has to be an indexed family, too. We distinguish class pre-
serving learning and class comprising learning defined by C = range(H) and
C ⊆ range(H) , respectively. When dealing with class preserving learning, one
has the freedom to choose as hypothesis space a possibly different enumeration of
the target class C . In contrast, when class comprising learning is concerned, the
hypothesis space may enumerate, additionally, languages not belonging to C .
Note that, in general, one has to allow class comprising hypothesis spaces to
obtain the maximum possible learning power (cf. Lange and Zeugmann [20,22]).
Finally, we call an hypothesis space redundant if it is larger than necessary, i.e.,
there is at least one hypothesis in H not describing any concept from the target
class or one concept possesses at least two different descriptions in H . Thus,
non-redundant hypothesis spaces are as small as possible.
Formally, a hypothesis space H = (hj )j∈N is non-redundant for some target
concept class C iff range(H) = C and hi = hj for all i, j ∈ N with i = j .
Otherwise, H is a redundant hypothesis space for C .
Next, let us come back to the issue of efficiency. Looking at Definition 1
we see that an IIM M has always access to the whole history of the learning
process, i.e., in order to compute its actual guess M is fed all examples seen so
22 T. Zeugmann

far. In contrast to that, next we define iterative IIMs. An iterative IIM is only
allowed to use its last guess and the next element in the positive presentation of
the target concept for computing its actual guess. Conceptionally, an iterative
IIM M defines a sequence (Mn )n∈N of machines each of which takes as its input
the output of its predecessor.
Definition 2 (Wiehagen [47]). Let C be a concept class, let c be a concept,
let H = (hj )j∈N be a hypothesis space, and let a ∈ N ∪ {∗} . An IIM M
ItLimTxt H -infers c iff for every t = (xj )j∈N ∈ text(c) the following conditions
are satisfied:

(1) for all n ∈ N , Mn (T ) is defined, where M0 (T ) =df M (x0 ) and for all
n ≥ 0 : Mn+1 (T ) =df M (Mn (T ), xn+1 ) ,
(2) the sequence (Mn (T ))n∈N converges to a number j such that c = hj .

Finally, M ItLimTxt H -infers C iff, for each c ∈ C , M ItLimTxt H -infers c .


In the latter definition Mn (t) denotes the (n+1) th hypothesis output by M
when successively fed the text t . Thus, it is justified to make the following
convention. Let σ = x0 , . . . , xn be any finite sequence of elements over the
relevant learning domain. Moreover, let C be any concept class over X , and
let M be any IIM that iteratively learns C . Then we denote by My (σ) the
(y + 1) th hypothesis output by M when successively fed σ provided y ≤ n ,
and there exists a concept c ∈ C with σ + ⊆ c . Furthermore, we let M∗ (σ)
denote M|σ|−1 (σ) .
Moreover, whenever learning a concept class from text, a major problem
one has to deal with is avoiding or detecting overgeneralization. An overgener-
alization occurs if the learner is guessing a superconcept of the target concept.
Clearly, such an overgeneralized guess cannot be detected by using the incoming
positive data only. Therefore, one may be tempted to disallow overgeneralized
guesses at all. Learners behaving thus are called conservative. Intuitively speak-
ing a conservative IIM maintains its actual hypothesis at least as long as it has
not seen data contradicting it. More formally, an IIM M is said to be conser-
vative iff for all concepts c in the target class C and all texts t for c the
condition M (ty ) = M (ty+z ) then t+ y+z ⊆ hM (ty ) is fulfilled.
Another property of learners quite often found in the literature is consistency.
Informally, a learner is called consistent if all its intermediate hypotheses do
correctly reflect the data seen so far. More formally, an IIM M is said to be
consistent iff t+ x ⊆ hM (tx ) for all x ∈ N and every text t for every concept c
in the target class C .
Whenever one talks about the efficiency of learning besides the storage needed
by the learner one has also to consider the time complexity of the learner. When
talking about the time complexity of learning, it does not suffice to consider the
time needed to compute the actual guess. What really counts in applications is
the overall time needed until successful learning. Therefore, following Daley and
Smith [10] we define the total learning time as follows.
Can Learning in the Limit Be Done Efficiently? 23

Let C be any concept class, and let M be any IIM that learns C in the
limit. Then, for every c ∈ C and every text t for c , let

Conv (M, t) =df the least number m ∈ N+


such that for all n ≥ m, M (tn ) = M (tm )

denote the stage of convergence of M on t (cf. [12]). Note that Conv (M, t) = ∞
if M does not learn the target concept from its text t . Moreover, by TM (tn )
we denote the time to compute M (tn ) . We measure this time as a function of
the length of the input and call it the update time. Finally, the total learning
time taken by the IIM M on successive input t is defined as
Conv (M,t)

T T (M, t) =df TM (tn ).
n=1

Clearly, if M does not learn the target concept from text t then the total
learning time is infinite.
Two more remarks are in order here. First, it has been argued elsewhere that
within the learning in the limit paradigm a learning algorithm is invoked only
when the current hypothesis has some problem with the latest observed data.
However, such a viewpoint implicitly assumes that membership in the target
concept is decidable in time polynomial in the length of the actual input. This
may be not case. Thus, directly testing consistency would immediately lead to
a non-polynomial update time provided membership is not known to be in P .
Second, Pitt [31] addresses the question with respect to what parameter one
should measure the total learning time. In the definition given above this param-
eter is the length of all examples seen so far. Clearly, now one could try to play
with this parameter by waiting for a large enough input before declaring suc-
cess. However, when dealing with the learnability of non-trivial concept classes,
in the worst-case the total learning time will be anyhow unbounded. Thus, it
does not make much sense to deal with the worst-case. Instead, we shall study
the expected total learning time. In such a setting one cannot simply wait for
long enough inputs. Therefore, using the definition of total learning time given
above seems to be reasonable.
Next, we define important concept classes which we are going to consider
throughout this survey.

2.2 The Pattern Languages

Following Angluin [1] we define patterns and pattern languages as follows. Let
A = {0, 1, . . .} be any non-empty finite alphabet containing at least two ele-
ments. By A∗ we denote the free monoid over A . The set of all finite non-null
strings of symbols from A is denoted by A+ , i.e., A+ = A∗ \ {λ} , where
24 T. Zeugmann

λ denotes the empty string. Let X = {xi i ∈ N} be an infinite set of vari-


ables such that A ∩ X = ∅ . Patterns are non-empty strings over A ∪ X , e.g.,
01, 0x0 111, 1x0 x0 0x1 x2 x0 are patterns. The length of a string s ∈ A∗ and of
a pattern π is denoted by |s| and |π| , respectively. A pattern π is in canon-
ical form provided that if k is the number of different variables in π then the
variables occurring in π are precisely x0 , . . . , xk−1 . Moreover, for every j with
0 ≤ j < k − 1 , the leftmost occurrence of xj in π is left to the leftmost occur-
rence of xj+1 . The examples given above are patterns in canonical form. In the
sequel we assume, without loss of generality, that all patterns are in canonical
form. By Pat we denote the set of all patterns in canonical form.
If k is the number of different variables in π then we refer to π as to
a k -variable pattern. By Pat k we denote the set of all k -variable patterns.
Furthermore, let π ∈ Pat k , and let u0 , . . . , uk−1 ∈ A+ ; then we denote by
π[x0 /u0 , . . . , xk−1 /uk−1 ] the string w ∈ A+ obtained by substituting uj for
each occurrence of xj , j = 0, . . . , k − 1 , in the pattern π . For example, let
π = 0x0 1x1 x0 . Then π[x0 /10, x1 /01] = 01010110 . The tuple (u0 , . . . , uk−1 ) is
called a substitution. Furthermore, if |u0 | = · · · = |uk−1 | = 1 , then we refer to
(u0 , . . . , uk−1 ) as to a shortest substitution. Let π ∈ Pat k ; we define the language
generated by pattern π by L(π) = {π[x0 /u0 , . . . , xk−1 /uk−1 ] u0 , . . . , uk−1 ∈
A+ } . By  PAT k we denote the set of all k -variable pattern languages. Finally,
PAT = k∈N PAT k denotes the set of all pattern languages over A .
 Furthermore, we let Q range over finite sets of patterns and define L(Q) =
π∈Q L(π) , i.e., the union of all pattern languages generated by patterns from
Q . Moreover, we use Pat(k) and PAT (k) to denote the family of all unions
of at most k canonical patterns and the family of all unions of at most k
pattern languages, respectively. That is, Pat(k) = {Q Q ⊆ Pat, |Q| ≤ k} and
PAT (k) = {L (∃Q ∈ Pat(k))[L = L(Q)]} . Finally, let L ⊆ A+ be a language,
and let k ∈ N+ ; we define Club(L, k) = {Q |Q| ≤ k, L ⊆ L(Q), (∀Q )[Q ⊂
Q ⇒ L ⊆ L(Q )]} . Club stands for consistent least upper bounds.
The pattern languages have been intensively investigated (cf., e.g., Salo-
maa [37,38], and Shinohara and Arikawa [43] for an overview). Nix [29] as well as
Shinohara and Arikawa [43] outlined interesting applications of pattern inference
algorithms. For example, pattern language learning algorithms have been suc-
cessfully applied for solving problems in molecular biology (cf., e.g., Shimozono
et al. [39], Shinohara and Arikawa [43]).
As it turned out, pattern languages and finite unions of pattern languages are
subclasses of Smullyan’s [45] elementary formal systems (Abbr. EFS). Arikawa et
al. [5] have shown that EFS can also be treated as a logic programming language
over strings. Recently, the techniques for learning finite unions of pattern lan-
guages have been extended to show the learnability of various subclasses of EFS
(cf. Shinohara [42]). The investigations of the learnability of subclasses of EFSs
are interesting because they yield corresponding results about the learnability
of subclasses of logic programs. Hence, these results are also of relevance for
Inductive Logic Programming (ILP) [28,23,8,24]. Miyano et al. [26] intensively
studied the polynomial-time learnability of EFSs.
Can Learning in the Limit Be Done Efficiently? 25

Therefore, we may consider the learnability of pattern languages and of


unions thereof as a nice test bed for seeing what kind of results one may ob-
tain by considering the corresponding learning problems within the setting of
learning in the limit.

3 Results

Within this section we ask whether or not the pattern languages and finite unions
thereof can be learned efficiently. The principal learnability of the pattern lan-
guages from text with respect to the hypothesis space Pat has been established
by Angluin [1]. However, her algorithm is based on computing descriptive pat-
terns for the data seen so far. Here a pattern π is said to be descriptive (for
the set S of strings contained in the input provided so far) if π can generate
all strings contained in S and no other pattern with this property generates a
proper subset of the language generated by π . Since no efficient algorithm is
known for computing descriptive patterns, and finding a descriptive pattern of
maximum length is N P -hard, its update time is practically intractable.
There are also serious difficulties when trying to learn the pattern languages
within the PAC model introduced by Valiant [46]. In the original model, the sam-
ple complexity depends exclusively on the VC dimension of the target concept
class and the error and confidence parameters ε and δ , respectively. Recently,
Mitchell et al. [25] have shown that even the class of all one-variable pattern
languages has infinite VC dimension. Consequently, even this special subclass
of PAT is not uniformly PAC learnable. Moreover, Schapire [40] has shown
that pattern languages are not PAC learnable in the generalized model provided
P/poly = N P/poly with respect to every hypothesis space for PAT that is
uniformly polynomially evaluable. Though this result highlights the difficulty of
PAC learning PAT it has no clear application to the setting considered in this
paper, since we aim to learn PAT with respect to the hypothesis space Pat .
Since the membership problem for this hypothesis space is N P -complete, it is
not polynomially evaluable (cf. [1]).
In contrast, Kearns and Pitt [18] have established a PAC learning algorithm
for the class of all k -variable pattern languages. Positive examples are gener-
ated with respect to arbitrary product distributions while negative examples are
allowed to be generated with respect to any distribution. In their algorithm the
length of substitution strings is required to be polynomially related to the length
of the target pattern. Finally, they use as hypothesis space all unions of poly-
nomially many patterns that have k or fewer variables2 . The overall learning
time of their PAC learning algorithm is polynomial in the length of the target
2
More precisely, the number of allowed unions is at most poly(|π|, s, 1/ε, 1/δ, |A|) ,
where π is the target pattern, s the bound on the length on substitution strings,
ε and δ are the usual error and confidence parameter, respectively, and A is the
alphabet of constants over which the patterns are defined.
26 T. Zeugmann

pattern, the bound for the maximum length of substitution strings, 1/ε , 1/δ ,
and |A| . The constant in the running time achieved depends doubly exponential
on k , and thus, their algorithm becomes rapidly impractical when k increases.
Finally, Lange and Wiehagen [19] have proposed an inconsistent but iterative
and conservative algorithm that learns PAT with respect to Pat . We shall
study this algorithm below in much more detail.
But before doing it, we aim to figure out under which circumstances iterative
learning of PAT is possible at all. A first answer is given by the following
theorems from Case et al. [9]. Note that Pat is a non-redundant hypothesis
space for PAT .
Theorem 1 (Case et al. [9]). Let C be any concept class, and let H =
(hj )j∈N be any non-redundant hypothesis space for C . Then, every IIM M that
ItLimTxt H -infers C is conservative.
Proof. Suppose the converse, i.e., there are a concept c ∈ C , a text t =
(xj )j∈N ∈ text(c) , and a y ∈ N such that, for j = M∗ (ty ) and k = M∗ (ty+1 ) =
M (j, xy+1 ) , both j = k and t+ y+1 ⊆ hj are satisfied. The latter implies xy+1 ∈
hj , and thus we may consider the following text t̃ ∈ text(hj ) . Let t̂ = (x̂j )j∈N
be any text for hj and let t̃ = x̂0 , xy+1 , x̂1 , xy+1 , x̂2 , . . . Since M has to learn
hj from t̃ there must be a z ∈ N such that M∗ (t̃z+r ) = j for all r ≥ 0 . But
M∗ (t̃2z+1 ) = M (j, xy+1 ) = k , a contradiction. 

Next, we point to another peculiarity of PAT , i.e., it meets the superset
condition defined as follows. Let C be any indexable class. C meets the superset
condition if, for all c, c ∈ C , there is some ĉ ∈ C being a superset of both c
and c .
Theorem 2. (Case et al. [9]). Let C be any indexable class meeting the
superset condition, and let H = (hj )j∈N be any non-redundant hypothesis space
for C . Then, every consistent IIM M that ItLimTxt H -infers C may be used
to decide the inclusion problem for H .
Proof. Let X be the underlying learning domain, and let (wj )j∈N be an
effective enumeration of all elements in X . Then, for every i ∈ N , ti = (xij )j∈N
is the following computable text for hi . Let z be the least index such that
wz ∈ hi . Recall that, by definition, hi = ∅ , since H is an indexed family, and
thus wz must exist. Then, for all j ∈ N , we set xij = wj , if wj ∈ hi , and
xij = wz , otherwise.
We claim that the following algorithm Inc decides, for all i, k ∈ N , whether
or not hi ⊆ hk .
Algorithm Inc: “On input i, k ∈ N do the following:
Determine the least y ∈ N with i = M∗ (tiy ) . Test whether or not ti,+
y ⊆ hk .
In case it is, output ‘Yes,’ and stop. Otherwise, output ‘No,’ and stop.”
Clearly, since H is an indexed family and ti is a computable text, Inc is
an algorithm. Moreover, M learns hi on every text for it, and H is a non-
redundant hypothesis space. Hence, M has to converge on text ti to i , and
therefore Inc has to terminate.
Can Learning in the Limit Be Done Efficiently? 27

It remains to verify the correctness of Inc . Let i, k ∈ N .


Clearly, if Inc outputs ‘No,’ a string s ∈ hi \hk has been found, and hi ⊆ hk
follows.
Next, consider the case that Inc outputs ‘Yes.’ Suppose to the contrary that
hi ⊆ hk . Then, there is some s ∈ hi \ hk . Now, consider M when fed the text
t = tiy  tk . Since ti,+
y ⊆ hk , t is a text for hk . Since M learns hk , there is
some r ∈ N such that k = M∗ (tiy  tkr ) . By assumption, there are some ĉ ∈ C
with hi ∪ hk ⊆ ĉ , and some text t̂ for ĉ having the initial segment tiy  s  tkr .
By Theorem 1, M is conservative. Since s ∈ hi and i = M∗ (t̂y ) , we obtain
M∗ (t̂y+1 ) = M (i, s) = i . Consequently, M∗ (tiy  s  tkr ) = M∗ (tiy  tkr ) . Finally,
since s ∈ t̂+y+r+2 , k = M∗ (ty  tr ) , and s ∈
i k
/ hk , M fails to consistently learn
ĉ from text t̂ , a contradiction. This proves the theorem. 

Taking into account that the inclusion problem for Pat is undecidable (cf.
Jiang et al. [17] and that PAT meets the superset condition, since L(x0 ) = A+ ,
by Theorem 2, we immediately arrive at the following corollary.
Corollary 3 (Case et al. [9]). If an IIM M ItLimTxt Pat -learns PAT
then M is inconsistent.
As a matter of fact, the latter corollary generalizes to all non-redundant hy-
pothesis spaces for PAT . All the ingredients to prove this can be found in Zeug-
mann et al. [52]. Consequently, if one wishes to learn the pattern languages or
unions of pattern languages iteratively, then either redundant hypothesis spaces
or inconsistent learners cannot be avoided.
As for unions, the first result goes back to Shinohara [41] who proved the
class of all unions of at most two pattern languages to be in LimTxt Pat(2) .
Wright [49] extended this result to PAT (k) ∈ LimTxt Pat(k) for all k ≥ 1 .
Moreover, Theorem 4.2 in Shinohara and Arimura’s [44] together with a lemma
from Blum and Blum [6] shows that k∈N PAT (k) is not LimTxt H -inferable
for every hypothesis space H .
The iterative learnability of PAT (k) has been established by Case et al. [9].
Our learner is also consistent. Thus, the hypothesis space used had to be designed
to be redundant. We only sketch the proof here.
Theorem 4.
(1) Club(L, k) is finite for all L ⊆ A+ and all k ∈ N+ ,
(2) If L ∈ PAT (k) , then Club(L, k) is non-empty and contains a set Q , such
that L(Q) = L .

Proof. Part (2) is obvious. Part (1) is easy for finite L . For infinite L , it
follows from the lemma below.
Lemma 1. Let k ∈ N+ , let L ⊆ A+ be any language, and suppose t =
(sj )j∈N ∈ text(L). Then,

(1) Club(t+ +
0 , k) can be obtained effectively from s0 , and Club(tn+1 , k) is effec-
+
tively obtainable from Club(tn , k) and sn+1 (* note the iterative nature *).
28 T. Zeugmann

(2) The sequence Club(t+ +


0 , k), Club(t1 , k), . . . converges to Club(L, k).

Putting it all together, one directly gets the following theorem.


Theorem 5. For all k ≥ 1 , PAT (k) ∈ ItLimTxt .
Proof. Let can(·) , be some computable bijection from finite classes of finite
sets of patterns onto N . Let pad be a 1–1 padding function such that, for all
x, y ∈ N , Wpadx,y = Wx . For a finite class S of  sets of patterns, let g(S)
denote a grammar obtained, effectively from S , for Q∈S L(Q) .
Let L ∈ PAT (k) , and let t = (sj )j∈N ∈ text(L) . The desired IIM M is de-
fined as follows. We set M0 (t) = M (s0 ) = padg(Club(t+ +
0 , k)), can(Club(t0 , k)) ,
and for all n > 0 , let
Mn+1 (t) = M (Mn (t), sn+1 )
= padg(Club(t+ +
n+1 , k)), can(Club(tn+1 , k))

Using Lemma 1 it is easy to verify that Mn+1 (t) = M (Mn (t), sn+1 ) can be
obtained effectively from Mn (t) and sn+1 . Therefore, M ItLimTxt -identifies
PAT (k) . 

So far, the general theory provided substantial insight into the iterative learn-
ability of the pattern languages. But still, we do not know anything about the
number of examples needed until successful learning and the total amount of time
to process them. Therefore, we address this problem in the following subsection.

3.1 Stochastic Finite Learning

As we have already mentioned, it does not make much sense to study the worst-
case behavior of learning algorithms with respect to their total learning time.
The reason for this phenomenon should be clear, since an arbitrary text may
provide the information needed for learning very late. Therefore, in the follow-
ing we always assume a class D of admissible probability distributions over the
relevant learning domain. Ideally, this class should be parameterized. Then, the
data fed to learner are generated randomly with respect to one of the probability
distributions from the class D of underlying probability distributions. Further-
more, we introduce a random variable CONV for the stage of convergence. Note
that CONV can be also interpreted as the total number of examples read by the
IIM M until convergence. The first major step to be performed consists now
in determining the expectation E[CONV ] . Clearly, E[CONV ] should be finite
for all concepts c ∈ C and all distributions D ∈ D . Second, one has to deal
with tail bounds for E[CONV ] . The easiest way to perform this step is to use
Markov’s inequality, i.e., we always know that
1
Pr(CONV ≥ t · E[CONV ]) ≤ for all t ∈ N+ .
t
However, quite often one can obtain much better tail bounds. If the underly-
ing learner is known to be conservative and rearrangement-independent we always
Can Learning in the Limit Be Done Efficiently? 29

get exponentially shrinking tail bounds. A learner is said to be rearrangement-


independent if its output depends exclusively on the range and length of its
input (cf. [21] and the references therein). These tail bounds are established by
the following theorem.
Theorem 6. (Rossmanith and Zeugmann [36].) Let CONV be the sam-
ple complexity of a conservative and rearrangement-independent learning algo-
rithm. Then
Pr(CONV ) ≥ 2t · E[CONV ]) ≤ 2 −t for all t ∈ N .

Theorem 6 puts the importance of rearrangement-independent and conser-


vative learners into the right perspective. As long as the learnability of indexed
families is concerned, these results have a wide range of potential applications,
since every conservative learner can be transformed into a learner that is both
conservative and rearrangement-independent provided the hypothesis space is
appropriately chosen (cf. Lange and Zeugmann [21]).
Furthermore, since the distribution of CONV decreases geometrically for
all conservative and rearrangement-independent learning algorithms, all higher
moments of CONV exist in this case, too. Thus, instead of applying Theorem
6 directly, one can hope for further improvements by applying even sharper tail
bounds using for example Chebyshev’s inequality.
Additionally, the learner takes a confidence parameter δ as input. But in
contrast to learning in the limit, the learner itself decides how many examples
it wants to read. Then it computes a hypothesis, outputs it and stops. The
hypothesis output is correct for the target with probability at least 1 − δ .
The explanation given so far explains how it works, but not why it does.
Intuitively, the stochastic finite learner simulates the limit learner until an upper
bound for twice the expected total number of examples needed until convergence
has been met. Assuming this to be true, by Markov’s inequality the limit learner
has now converged with probability 1/2 . All what is left, is to decrease the
probability of failure. This is done by using the tail bounds for CONV . Applying
Theorem 6, one easily sees that increasing the sample complexity by a factor
of O(log 1δ ) results in a probability of 1 − δ for having reached the stage of
convergence. If Theorem 6 is not applicable, one can still use Markov’s inequality
but then the sample complexity needed will increase by a factor of 1/δ .
It remains to explain how the stochastic finite learner can calculate the upper
bound for E[CONV ] . This is precisely the point where we need the parameter-
ization of the class D of underlying probability distributions. Since in general,
it is not known which distribution from D has been chosen, one has to assume
a bit of prior knowledge or domain knowledge provided by suitable upper and/or
lower bounds for the parameters involved. A more serious difficulty is to incor-
porate the unknown target concept into this estimate. This step depends on
the concrete learning problem on hand, and requires some extra effort. We shall
exemplify it below.
Now we are ready to formally define stochastic finite learning.
30 T. Zeugmann

Definition 3 ([33,34,36]). Let D be a set of probability distributions on the


learning domain, C a concept class, H a hypothesis space for C , and δ ∈
(0, 1) . (C, D) is said to be stochastically finitely learnable with δ -confidence
with respect to H iff there is an IIM M that for every c ∈ C and every
D ∈ D performs as follows. Given any random data sequence θ for c generated
according to D , M stops after having seen a finite number of examples and
outputs a single hypothesis h ∈ H . With probability at least 1−δ (with respect
to distribution D ) h has to be correct, that is c = h .
If stochastic finite learning can be achieved with δ -confidence for every δ > 0
then we say that (C, D) can be learned stochastically finite with high confidence.
Note that there are subtle differences between our model and PAC learning.
By its definition, stochastic finite learning is not completely distribution inde-
pendent. A bit of additional knowledge concerning the underlying probability
distributions is required. Thus, from that perspective, stochastic finite learning
is weaker than the PAC-model. On the other hand, we do not measure the qual-
ity of the hypothesis with respect to the underlying probability distribution.
Instead, we require the hypothesis computed to be exactly correct with high
probability. Note that exact identification with high confidence has been consid-
ered within the PAC paradigm, too (cf., e.g., Goldman et al. [13]). Conversely,
we also can easily relax the requirement to learn probably exactly correct but
whenever possible we shall not do it.
Furthermore, in the uniform PAC model as introduced in Valiant [46] the
sample complexity depends exclusively on the VC dimension of the target con-
cept class and the error and confidence parameters ε and δ , respectively. This
model has been generalized by allowing the sample size to depend on the concept
complexity, too (cf., e.g., Blumer et al. [7] and Haussler et al. [15]). Provided
no upper bound for the concept complexity of the target concept is given, such
PAC learners decide themselves how many examples they wish to read (cf. [15]).
This feature is also adopted to our setting of stochastic finite learning. However,
all variants of PAC learning we are aware of require that all hypotheses from the
relevant hypothesis space are uniformly polynomially evaluable. Though this re-
quirement may be necessary in some cases to achieve (efficient) stochastic finite
learning, it is not necessary in general as we shall see below.
Next, let us exemplify our model by looking at the concept class of all pattern
languages. The results presented below have been obtained by Zeugmann [50]
and Rossmanith and Zeugmann [36]. Our stochastic finite learner uses Lange
and Wiehagen’s [19] pattern language learner as a main ingredient. We consider
here learning from positive data only.
Recall that every string of a particular pattern language is generated by at
least one substitution. Therefore, it is convenient to consider probability distri-
butions over the set of all possible substitutions. That is, if π ∈ Pat k , then
it suffices to consider any probability distribution D over A+ × · · · × A+ . For
  
k−times
(u0 , . . . , uk−1 ) ∈ A+ × · · · × A+ we denote by D(u0 , . . . , uk−1 ) the probability
Can Learning in the Limit Be Done Efficiently? 31

that variable x0 is substituted by u0 , variable x1 is substituted by u1 , . . . ,


and variable xk−1 is substituted by uk−1 .
In particular, we mainly consider a special class of distributions, i.e., prod-
uct distributions. Let k ∈ N+ , then the class of all product distributions for
Pat k is defined as follows. For each variable xj , 0 ≤ j ≤ k − 1 , we assume an
arbitrary probability distribution Dj over A+ on substitution strings. Then
we call D = D0 × · · · × Dk−1 product distribution over A+ × · · · × A+ , i.e.,
k−1
D(u0 , . . . , uk−1 ) = j=0 Dj (uj ) . Moreover, we call a product distribution reg-
ular if D0 = · · · = Dk−1 . Throughout this paper, we restrict ourselves to deal
with regular distributions. We therefore use d to denote the distribution over
k−1
A+ on substitution strings, i.e, D(u0 , . . . , uk−1 ) = j=0 d(uj ) . We call a regu-
lar distribution admissible if d(a) > 0 for at least two different elements a ∈ A .
As a special case of an admissible distribution we consider the uniform distribu-
tion over A+ , i.e., d(u) = 1/(2 · |A|) for all strings u ∈ A+ with |u| = .
We will express all estimates with the help of the following parameters: E[Λ] ,
α and β , where Λ is a random variable for the length of the examples drawn. α
and β are defined below. To get concrete bounds for a concrete implementation
one has to obtain c from the algorithm and has to compute E[Λ] , α , and β
from the admissible probability distribution D . Let u0 , . . . , uk−1 be indepen-
dent random variables with distribution d for substitution strings. Whenever
the index i of ui does not matter, we simply write u or u .
The two parameters α and β are now defined via d . First, α is simply the
probability that u has length 1, i.e.,

α = Pr(|u| = 1) = d(a).
a∈A

Second, β is the conditional probability that two random strings that get sub-
stituted into π are identical under the condition that both have length 1 , i.e.,
  2
β = Pr u = u |u| = |u | = 1 = d(a)2 d(a) .
a∈calA a∈A

Note that we have omitted the assumption of a text to exhaust the target lan-
guage. Instead, we only demand the data sequence fed to the learner to contain
“enough” information to recognize the target pattern. The meaning of “enough”
is mainly expressed by the parameter α .
The model of computation as well as the representation of patterns we assume
is the same as in Angluin [1]. In particular, we assume a random access machine
that performs a reasonable menu of operations each in unit time on registers of
length O(log n) bits, where n is the input length.
Lange and Wiehagen’s [19] algorithm (abbr. LWA) works as follows. Let hn
be the hypothesis computed after reading s1 , . . . , sn , i.e., hn = M (s1 , . . . , sn ) .
32 T. Zeugmann

Then h1 = s1 and for all n > 1 :




 hn−1 , if |hn−1 | < |sn |
hn = sn , if |hn−1 | > |sn |


hn−1 ∪ sn , if |hn−1 | = |sn |

The algorithm computes the new hypothesis only from the latest example
and the old hypothesis. If the latest example is longer than the old hypothesis,
the example is ignored, i.e., the hypothesis does not change. If the latest ex-
ample is shorter than the old hypothesis, the old hypothesis is ignored and the
new example becomes the new hypothesis. If, however, |hn−1 | = |sn | the new
hypothesis is the union of hn−1 and sn . The union  = π ∪ s of a canonical
pattern π and a string s of the same length is defined as


 π(i), if π(i) = s(i)


xj , if π(i) = s(i) & ∃k < i : [(k) = xj , s(k) = s(i),
(i) =

 π(k) = π(i)]


xm , otherwise, where m = #var((1) . . . (i − 1))

where (0) = λ for notational convenience. Note that the resulting pattern is
again canonical.
If the target pattern does not contain any variable then the LWA converges
after having read the first example. Hence, this case is trivial and we therefore
assume in the following always k ≥ 1 , i.e., the target pattern has to contain
at least one variable. Our next theorem analyzes the complexity of the union
operation.
Theorem 7 (Rossmanith and Zeugmann [36]). The union operation can be
computed in linear time.
Furthermore, the following bound for the stage of convergence for every target
pattern from Pat k can be shown.
Theorem 8(Rossmanith and  Zeugmann [36]).
1
E[CONV ] = O · log1/β (k) for all k ≥ 2 .
αk
Hence, by
 Theorem 7, the expected
 total learning time can be estimated by
1
E[T T ] = O E[Λ] log1/β (k) for all k ≥ 2 .
αk
For a better understanding of the bound obtained we evaluate it for the
uniform distribution and compare it to the minimum number of examples needed
for learning a pattern language via the LWA.
Theorem 9 (Rossmanith and Zeugmann [36]). E[T T ] = O(2k |π| log|A| (k))
for the uniform distribution and all k ≥ 2 .
Theorem 10 (Zeugmann [50]). To learn a pattern π ∈ Pat k the LWA needs
exactly log|A| (|A| + k − 1) + 1 examples in the best case.
Can Learning in the Limit Be Done Efficiently? 33

The main difference between the two bounds just given is the factor 2k
which precisely reflects the time the LWA has to wait until it has seen the first
shortest string from the target pattern language. Moreover, in the best-case the
LWA is processing shortest examples only. Thus, we introduce MC to denote
the number of minimum length examples read until convergence. Then, one can
show that
2 ln(k) + 3
E[MC ] ≤ +2 .
ln(1/β)

Note that Theorem 8 is shown by using the bound for E[MC ] just given.
More precisely, we have E[CONV ] = (1/αk )E[MC ] . Now, we are ready to
transform the LWA into a stochastic finite learner.
Theorem 11 (Rossmanith and Zeugmann [36]). Let α∗ , β∗ ∈ (0, 1) . Assume
D to be a class of admissible probability distributions over A+ such that α ≥
α∗ , β ≤ β∗ and E[Λ] finite for all distributions D ∈ D . Then (PAT , D) is
stochastically finitely learnable with high confidence from text.
Proof. Let D ∈ D , and let δ ∈ (0, 1) be arbitrarily fixed. Furthermore, let
t = s1 , s2 , s3 , . . . be any randomly generated text with respect to D for the
target pattern language. The wanted learner M uses the LWA as a subroutine.
Additionally, it has a counter for memorizing the number of examples already
seen. Now, we exploit the fact that the LWA produces a sequence (τn )n∈N+ of
hypotheses such that |τn | ≥ |τn+1 | for all n ∈ N+ .
The learner runs the LWA until for the first time C many examples have
been processed, where
 |τ |  2 ln(|τ |) + 3 
C = α1∗ · +2 (A)
ln(1/β∗ )

and τ is the actual output made by the LWA.


Finally, in order to achieve the desired confidence, the learner sets γ = log 1δ 
and runs the LWA for a total of 2 · γ · C examples. This is the reason we need
the counter for the number of examples processed. Now, it outputs the last
hypothesis τ produced by the LWA, and stops thereafter.
Clearly, the learner described above is finite. Let L be the target language
and let π ∈ Pat k be the unique pattern such that L = L(π) . It remains to
argue that L(π) = L(τ ) with probability at least 1 − δ .
First, the bound in (A) is an upper bound for the expected number of exam-
ples needed for convergence by the LWA that has been established in Theorem 8
(via the reformulation using E[MC ] given above). On the one hand, this follows
from our assumptions about the allowed α and β as well as from the fact that
|τ | ≥ |π| for every hypothesis output. On the other hand, the learner does not
know k , but the estimate #var (π) ≤ |π| is sufficient. Note that we have to use
in (A) the bound for E[MC ] given above, since the target pattern may contain
zero or one different variables.
34 T. Zeugmann

Therefore, after having processed C many examples the LWA has already
converged on average. The desired confidence is then an immediate consequence
of Corollary 6. 

The latter theorem allows a nice corollary which we state next. Making the
same assumption as done by Kearns and Pitt [18], i.e., assuming the additional
prior knowledge that the target pattern belongs to Pat k , the complexity of the
stochastic finite learner given above can be considerably improved. The resulting
learning time is linear in the expected string length, and the constant depending
on k grows only exponentially in k in contrast to the doubly exponentially
growing constant in Kearns and Pitt’s [18] algorithm. Moreover, in contrast
to their learner, our algorithm learns from positive data only, and outputs a
hypothesis that is correct for the target language with high probability.
Again, for the sake of presentation we shall assume k ≥ 2 . Moreover, if the
prior knowledge k = 1 is available, then there is also a much better stochastic
finite learner for PAT 1 (cf. [34]).
Corollary 12. Let α∗ , β∗ ∈ (0, 1) . Assume D to be a class of admissible
probability distributions over A+ such that α ≥ α∗ , β ≤ β∗ and E[Λ] finite
for all distributions D ∈ D . Furthermore, let k ≥ 2 be arbitrarily fixed. Then
there exists a learner M such that

(1) M learns (PAT k , D) stochastically finitely with high confidence from text,
and
(2) The running time of M is O α̂∗k E[Λ] log1/β∗ (k) log2 (1/δ) .
(* Note that α̂∗k and log1/β∗ (k) now are constants. *)

4 Conclusions

The present paper surveyed results recently obtained concerning the iterative
learnability of the class of all pattern languages and finite unions thereof. In
particular, it could be shown that there are strong dependencies between iterative
learning, the class of admissible hypothesis spaces and additional requirements
to the learner such as consistency, conservativeness and the decidability of the
inclusion problem for the hypothesis space chosen. Looking at these results, we
have seen that the LWA is in some sense optimal.
Moreover, by analyzing the average-case behavior of Lange and Wiehagen’s
pattern language learning algorithm with respect to its total learning time and
by establishing exponentially shrinking tail bounds for a rather rich class of
limit learners, we have been able to transform the LWA into a stochastic finite
learner. The price paid is the incorporation of a bit prior knowledge concerning
the class of underlying probability distributions. When applied to the class of
all k -variable pattern languages, where k is a priori known, the resulting total
learning time is linear in the expected string length.
Can Learning in the Limit Be Done Efficiently? 35

Thus, the present paper provides evidence that analyzing the average-case
behavior of limit learners with respect to their total learning time may be consid-
ered as a promising path towards a new theory of efficient algorithmic learning.
Recently obtained results along the same path as outlined in Erlebach et al.[11]
as well as in Reischuk and Zeugmann [32,34] provide further support for the
fruitfulness of this approach.
In particular, in Reischuk and Zeugmann [32,34] we have shown that one-
variable pattern languages are learnable for basically all meaningful distributions
within an optimal linear total learning time on the average. Furthermore, this
learner can also be modified to maintain the incremental behavior of Lange and
Wiehagen’s [19] algorithm. Instead of memorizing the pair (PRE, SUF) , it can
also store just the two or three examples from which the prefix PRE and the suffix
SUF of the target pattern has been computed. While it is no longer iterative, it
is still a bounded example memory learner. A bounded example memory learner
is essentially an iterative learner that is additionally allowed to memorize an a
priori bounded number of examples (cf. [9] for a formal definition).
While the one-variable pattern language learner from [34] is highly practical,
our stochastic finite learner for the class of all pattern languages is still not good
enough for practical purposes. But our results surveyed point to possible direc-
tions for potential improvements. However, much more effort seems necessary to
design a stochastic finite learner for PAT (k) .
Additionally, we have applied our techniques to design a stochastic finite
learner for the class of all concepts describable by a monomial which is based
on Haussler’s [14] Wholist algorithm. Here we have assumed the examples to be
binomially distributed. The sample size of our stochastic finite learner is mainly
bounded by log(1/δ) log n , where δ is again the confidence parameter and n
is the dimension of the underlying Boolean learning domain. Thus, the bound
obtained is exponentially better than the bound provided within the PAC model.
Our approach also differs from U-learnability introduced by Muggleton [27].
First of all, our learner is fed with positive examples only, while in Muggle-
ton’s [27] model examples labeled with respect to their containment in the target
language are provided. Next, we do not make any assumption concerning the dis-
tribution of the target patterns. Furthermore, we do not measure the expected
total learning time with respect to a given class of distributions over the targets
and a given class of distributions for the sampling process, but exclusively in
dependence on the length of the target. Finally, we require exact learning and
not approximately correct learning.

References
1. D. Angluin, Finding Patterns common to a Set of Strings, Journal of Computer
and System Sciences 21, 1980, 46–62.
2. D. Angluin, Inductive inference of formal languages from positive data, Informa-
tion and Control 45, 1980, 117–135.
36 T. Zeugmann

3. D. Angluin and C.H. Smith. Inductive inference: Theory and methods. Computing
Surveys 15, No. 3, 1983, 237–269.
4. D. Angluin and C.H. Smith. Formal inductive inference. “Encyclopedia of Ar-
tificial Intelligence” (St.C. Shapiro, Ed.), Vol. 1, pp. 409–418, Wiley-Interscience
Publication, New York.
5. S. Arikawa, T. Shinohara and A. Yamamoto, Learning elementary formal systems,
Theoretical Computer Science 95, 97–113, 1992.
6. L. Blum and M. Blum, Toward a mathematical theory of inductive inference,
Information and Control 28, 125–155, 1975.
7. A. Blumer, A. Ehrenfeucht, D. Haussler and M. Warmuth, Learnability and the
Vapnik-Chervonenkis Dimension, Journal of the ACM 36 (1989), 929–965.
8. I. Bratko and S. Muggleton, Applications of inductive logic programming, Com-
munications of the ACM, 1995.
9. J. Case, S. Jain, S. Lange and T. Zeugmann, Incremental Concept Learning for
Bounded Data Mining, Information and Computation 152, No. 1, 1999, 74–110.
10. R. Daley and C.H. Smith. On the Complexity of Inductive Inference. Information
and Control 69 (1986), 12–40.
11. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger and T. Zeugmann, Learning
one-variable pattern languages very efficiently on average, in parallel, and by asking
queries, Theoretical Computer Science 261, No. 1–2, 2001, 119–156.
12. E.M. Gold, Language identification in the limit, Information and Control 10
(1967), 447–474.
13. S.A. Goldman, M.J. Kearns and R.E. Schapire, Exact identification of circuits
using fixed points of amplification functions. SIAM Journal of Computing 22,
1993, 705–726.
14. D. Haussler, Bias, version spaces and Valiant’s learning framework, “Proc. 8th Na-
tional Conference on Artificial Intelligence,” pp. 564–569, San Mateo, CA: Morgan
Kaufmann, 1987.
15. D. Haussler, M. Kearns, N. Littlestone and M.K. Warmuth, Equivalence of models
for polynomial learnability. Information and Computation 95 (1991), 129–161.
16. S. Jain, D. Osherson, J.S. Royer and A. Sharma, “Systems That Learn: An Intro-
duction to Learning Theory,” MIT-Press, Boston, Massachusetts, 1999.
17. T. Jiang, A. Salomaa, K. Salomaa and S. Yu, Inclusion is undecidable for pat-
tern languages, in “Proceedings 20th International Colloquium on Automata,
Languages and Programming,” (A. Lingas, R. Karlsson, and S. Carlsson, Eds.),
Lecture Notes in Computer Science, Vol. 700, pp. 301–312, Springer-Verlag, Berlin,
1993.
18. M. Kearns L. Pitt, A polynomial-time algorithm for learning k –variable pattern
languages from examples. in “Proc. Second Annual ACM Workshop on Computa-
tional Learning Theory” (pp. 57–71). San Mateo, CA: Morgan Kaufmann, 1989.
19. S. Lange and R. Wiehagen, Polynomial-time inference of arbitrary pattern lan-
guages. New Generation Computing 8 (1991), 361–370.
20. S. Lange and T. Zeugmann, Language learning in dependence on the space of
hypotheses. in “Proc. of the 6th Annual ACM Conference on Computational
Learning Theory,” (L. Pitt, Ed.), pp. 127–136, ACM Press, New York, 1993.
21. S. Lange and T. Zeugmann, Set-driven and Rearrangement-independent Learning
of Recursive Languages, Mathematical Systems Theory 29 (1996), 599–634.
22. S. Lange and T. Zeugmann, Incremental Learning from Positive Data, Journal of
Computer and System Sciences 53(1996), 88–103.
23. N. Lavrač and S. Džeroski, “Inductive Logic Programming: Techniques and Ap-
plications,” Ellis Horwood, 1994.
Can Learning in the Limit Be Done Efficiently? 37

24. T. Mitchell. “Machine Learning,” McGraw Hill, 1997.


25. A. Mitchell, A. Sharma, T. Scheffer and F. Stephan, The VC-dimension of Sub-
classes of Pattern Languages, in “Proc. 10th International Conference on Algorith-
mic Learning Theory,” (O. Watanabe and T. Yokomori, Eds.), Lecture Notes in
Artificial Intelligence, Vol. 1720, pp. 93–105, Springer-Verlag, Berlin, 1999.
26. S. Miyano, A. Shinohara and T. Shinohara, Polynomial-time learning of elementary
formal systems, New Generation Computing, 18:217–242, 2000.
27. S. Muggleton, Bayesian Inductive Logic Programming, in “Proc. 7th Annual ACM
Conference on Computational Learning Theory” (M. Warmuth, Ed.), pp. 3–11,
ACM Press, New York, 1994.
28. S. Muggleton and L. De Raedt, Inductive logic programming: Theory and methods,
Journal of Logic Programming, 19/20:669–679, 1994.
29. R.P. Nix, Editing by examples, Yale University, Dept. Computer Science, Technical
Report 280, 1983.
30. D.N. Osherson, M. Stob and S. Weinstein, “Systems that Learn, An Introduction to
Learning Theory for Cognitive and Computer Scientists,” MIT-Press, Cambridge,
Massachusetts, 1986.
31. L. Pitt, Inductive Inference, DFAs and Computational Complexity, in “Proc. 2nd
Int. Workshop on Analogical and Inductive Inference” (K.P. Jantke, Ed.), Lecture
Notes in Artificial Intelligence, Vol. 397, pp. 18–44, Springer-Verlag, Berlin, 1989.
32. R. Reischuk and T. Zeugmann, Learning One- Variable Pattern Languages in Lin-
ear Average Time, in “Proc. 11th Annual Conference on Computational Learning
Theory - COLT’98,” July 24th - 26th, Madison, pp. 198–208, ACM Press 1998.
33. R. Reischuk and T. Zeugmann, A Complete and Tight Average-Case Analysis
of Learning Monomials, in “Proc. 16th International Symposium on Theoretical
Aspects of Computer Science,” (C. Meinel and S. Tison, Eds.), Lecture Notes in
Computer Science, Vol. 1563, pp. 414–423, Springer-Verlag , Berlin 1999.
34. R. Reischuk and T. Zeugmann, An Average-Case Optimal One-Variable Pattern
Language Learner, Journal of Computer and System Sciences 60, No. 2, 2000,
302–335.
35. H. Rogers, Jr., “Theory of Recursive Functions and Effective Computability,”
McGraw–Hill, New York, 1967.
36. P. Rossmanith and T. Zeugmann. Stochastic Finite Learning of the Pattern Lan-
guages, Machine Learning 44, No. 1-2, 2001, 67–91.
37. Patterns (The Formal Language Theory Column), EATCS Bulletin 54, 46–62,
1994.
38. Return to patterns (The Formal Language Theory Column), EATCS Bulletin 55,
144–157, 1994.
39. S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara and S. Arikawa,
Knowledge acquisition from amino acid sequences by machine learning system
BONSAI, Trans. Information Processing Society of Japan 35, 2009–2018, 1994.
40. R.E. Schapire, Pattern languages are not learnable, In M.A. Fulk & J. Case
(Eds.), Proceedings of the Third Annual ACM Workshop on Computational Learn-
ing Theory, (pp. 122–129). San Mateo, CA: Morgan Kaufmann, (1990).
41. T. Shinohara, Inferring unions of two pattern languages, Bulletin of Informatics
and Cybernetics 20, 83–88, 1983.
42. T. Shinohara, Inductive inference of monotonic formal systems from positive data,
New Generation Computing 8, 371–384, 1991.
43. [Link] and S. Arikawa, Pattern inference, in “Algorithmic Learning for
Knowledge-Based Systems,” (K.P. Jantke and S. Lange, Eds.), Lecture Notes in
Artificial Intelligence, Vol. 961, pp. 259–291, Springer-Verlag, Berlin, 1995.
38 T. Zeugmann

44. T. Shinohara and H. Arimura, Inductive inference of unbounded unions of pattern


languages from positive data, in “Proceedings 7th International Workshop on
Algorithmic Learning Theory,” (S. Arikawa and A.K. Sharma, Eds.), Lecture Notes
in Artificial Intelligence, Vol. 1160, pp. 256–271, Springer-Verlag, Berlin, 1996.
45. R. Smullyan, “Theory of Formal Systems,” Annals of Mathematical Studies,
No. 47. Princeton, NJ, 1961.
46. L.G. Valiant, A Theory of the Learnable, Communications of the ACM 27 (1984),
1134–1142.
47. R. Wiehagen. Limes-Erkennung rekursiver Funktionen durch spezielle Strategien.
Journal of Information Processing and Cybernetics (EIK) 12, 1976, 93–99.
48. R. Wiehagen and T. Zeugmann, Ignoring Data may be the only Way to Learn
Efficiently, Journal of Experimental and Theoretical Artificial Intelligence 6 (1994),
131–144.
49. K. Wright, Identification of unions of languages drawn from an identifiable
class, in “Proceedings of the 2nd Workshop on Computational Learning The-
ory,” (R. Rivest, D. Haussler, and M. Warmuth, Eds.), pp. 328–333, San Mateo,
CA: Morgan Kaufmann, 1989.
50. T. Zeugmann, Lange and Wiehagen’s Pattern Language Learning Algorithm: An
Average-case Analysis with respect to its Total Learning Time, Annals of Mathe-
matics and Artificial Intelligence 23, No. 1–2, 1998, 117–145.
51. T. Zeugmann and S. Lange, A guided tour across the boundaries of learning
recursive languages, in “Algorithmic Learning for Knowledge-Based Systems,”
(K.P. Jantke and S. Lange, Eds.), Lecture Notes in Artificial Intelligence, Vol. 961,
pp. 190–258, Springer-Verlag, Berlin, 1995.
52. T. Zeugmann, S. Lange and S. Kapur, Characterizations of monotonic and dual
monotonic language learning, Information and Computation 120, 155–173, 1995.
Intrinsic Complexity of Uniform Learning

Sandra Zilles

Universität Kaiserslautern,
FB Informatik, Postfach 3049, 67653 Kaiserslautern, Germany,
zilles@[Link]

Abstract. Inductive inference is concerned with algorithmic learning


of recursive functions. In the model of learning in the limit a learner
successful for a class of recursive functions must eventually find a pro-
gram for any function in the class from a gradually growing sequence of
its values. This approach is generalized in uniform learning, where the
problem of synthesizing a successful learner for a class of functions from
a description of this class is considered.
A common reduction-based approach for comparing the complexity of
learning problems in inductive inference is intrinsic complexity. In this
context, reducibility between two classes is expressed via recursive op-
erators transforming target functions in one direction and sequences of
corresponding hypotheses in the other direction.
The present paper is the first one concerned with intrinsic complexity
of uniform learning. The relevant notions are adapted and illustrated
by several examples. Characterizations of complete classes finally allow
for various insightful conclusions. The connection to intrinsic complexity
of non-uniform learning is revealed within several analogies concerning
firstly the role and structure of complete classes and secondly the general
interpretation of the notion of intrinsic complexity.

1 Introduction
Inductive inference is concerned with algorithmic learning of recursive functions.
In the model of learning in the limit, cf. [7], a learner successful for a class of
recursive functions must eventually find a correct program for any function in
the class from a gradually growing sequence of its values. The learner is under-
stood as a machine – called inductive inference machine or IIM – reading finite
sequences of input-output pairs of a target function, and returning programs as
its hypotheses, see also [2]. The underlying programming system is then called
a hypothesis space.
Studying the potential of such IIMs in general leads to the question whether
– given a description of a class of functions – a corresponding successful IIM can
be synthesized computationally from this description. This idea is generalized in
the notion of uniform learning: we consider a collection C0 , C1 , . . . of learning
problems – which may be seen as a decomposition of a class C = C0 ∪ C1 ∪ . . .
– and ask for some kind of meta-IIM tackling the whole collection of learning
problems. As an input, such a meta-IIM gets a description of one of the learning

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 39–53, 2003.

c Springer-Verlag Berlin Heidelberg 2003
40 S. Zilles

problems Ci (in our context a class Ci of recursive functions) in the collection.


The meta-IIM is then supposed to develop a successful IIM for Ci . Besides studies
on uniform learning of classes of recursive functions, cf. [12,16], this topic has also
been investigated in the context of learning formal languages, see in particular
[1,13,14].
Since we consider IIMs as tackling a given problem, namely the problem of
identifying all elements in a particular class of recursive functions, the complex-
ity of such IIMs might express, how hard a learning problem is. For instance, the
class of all constant functions allows for a simple and straightforward identifica-
tion method; for other classes successful methods might seem more complicated.
But this does not involve any rule allowing us to compare two learning problems
with respect to their difficulty. So a formal approach for comparing the complex-
ity of learning problems (i. e. of classes of recursive functions) is desirable.
Different aspects have been analysed in this context. One approach is, e. g.,
mind change complexity measured by the maximal number of hypothesis changes
a machine needs to identify a function in the given class, see [3]. But since in
general this number of mind changes is unbounded, other notions of complexity
might be of interest.
Various subjects in theoretical computer science deal with comparing the
complexity of decision problems, e. g. regarding decidability as such, see [15], or
the possible efficiency of decision algorithms, see [5]. In general Problem A is
at most as hard as Problem B, if A is reducible to B under a given reduction.
Each such reduction involves a notion of complete (hardest solvable) problems.
Besides studies concerning language learning, see [9,10,11], in [4] an approach
for reductions in the context of learning recursive functions is introduced. This
subject, intrinsic complexity, has been further analysed in [8] with a focus on
complete classes. It has turned out that, for learning in the limit, a class is
complete, iff it contains a dense r. e. subclass. Here the aspect of high topological
complexity (density), contrasts with the aspect of low algorithmic complexity of
r. e. sets, which is somehow striking and has caused discussions on whether this
particular approach of intrinsic complexity is adequate.
The present paper deals with intrinsic complexity in the context of uniform
learning. Assume some new reduction expresses such an idea of intrinsic com-
plexity. If a class C of functions is complete in the initial sense, natural ques-
tions are (i) whether C can be decomposed into a uniformly learnable collection
C0 , C1 , . . . , which is not a hardest problem in uniform learning, and (ii) whether
there are also inappropriate decompositions of C, i. e. collections of highest com-
plexity in uniform learning.
Below a notion of intrinsic complexity for uniform learning is developed and
the corresponding complete classes are characterized. The obtained structure of
degrees of complexity matches recent results on uniform learning: it has been
shown that even decompositions into singleton classes can yield problems too
hard for uniform learning in Gold’s model. This suggests that collections rep-
resenting singleton classes may sometimes form hardest problems in uniform
learning. Indeed, the notion developed below expresses this intuition, i. e. collec-
Intrinsic Complexity of Uniform Learning 41

tions of singleton sets may constitute complete classes in uniform learning. Still,
the characterization of completeness here reveals a weakness of the general idea
of intrinsic complexity, namely – as in the non-uniform case – complete classes
have a low algorithmic complexity (see Theorem 7). All in all, this shows that
intrinsic complexity, as in [4], is on the one hand a useful approach, because it
can be adapted to match the intuitively desired results in uniform learning. On
the other hand, the doubts in [8] are corroborated.

2 Preliminaries
2.1 Notations
Knowledge of basic notions used in mathematics and computability theory is
assumed, cf. [15]. N is the set of natural numbers. The cardinality of a set
X is denoted by card X. Partial-recursive functions always operate on natural
numbers. If f is a function, f (n) ↑ indicates that f (n) is undefined. Our target
objects for learning will always be recursive functions, i. e. total partial-recursive
functions. R denotes the set of all recursive functions.
If α is a finite tuple of numbers, then |α| denotes its length. Finite tuples
are coded, i. e. if f (0), . . . , f (n) are defined, a number f [n] represents the tuple
(f (0), . . . , f (n)), called an initial segment of f . f [n] ↑ means that f (x) ↑ for
some x ≤ n. For convenience, a function may be written as a sequence of values
or as a set of input-output pairs. A sequence σ = x0 , x1 , x2 , . . . converges to
x, iff xn = x for all but finitely many n; we write lim(σ) = x. For example let
f (n) = 7 for n ≤ 2, f (n) ↑ otherwise; g(n) = 7 for all n. Then f = 73 ↑∞ =
{(0, 7), (1, 7), (2, 7)}, g = 7∞ = {(n, 7) | n ∈ N}; lim(g) = 7, and f ⊆ g. For
n ∈ N, the notion f =n g means that for all x ≤ n either f (n) ↑ and g(n) ↑ or
f (n) = g(n). A set C of functions is dense, iff for any f ∈ C, n ∈ N there is
some g ∈ C satisfying f =n g, but f = g.
Recursive functions – our target objects for learning – require appropriate
representation schemes, to be used as hypothesis spaces. Partial-recursive enu-
merations serve for that purpose: any (n + 1)-place partial-recursive function ψ
enumerates the set Pψ := {ψi | i ∈ N} of n-place partial-recursive functions,
where ψi (x) := ψ(i, x) for all x = (x1 , . . . , xn ). Then ψ is called a numbering.
Given f ∈ Pψ , any index i satisfying ψi = f is a ψ-program of f .
Following [6], we call a family (di )i∈N of natural numbers limiting r. e., iff
there is a recursive numbering d such that lim(di ) = di for all i ∈ N.

2.2 Learning in the Limit and Intrinsic Complexity


Below, let τ be a fixed acceptable numbering, serving as a hypothesis space. The
learner is a total computable device called IIM (inductive inference machine)
working in steps. The input of an IIM M in step n is an initial segment f [n]
of some f ; the output M (f [n]) is interpreted as a τ -program. In learning in the
limit, M is successful for f , if the sequence M (f ) := (M (f [n]))n∈N of hypotheses
is admissible for f :
42 S. Zilles

Definition 1 [4] Let f, σ ∈ R. σ is admissible for f , iff σ converges and lim(σ)


is a τ -program for f .
Now a class of recursive functions is learnable in the limit (Ex -learnable; Ex
is short for explanatory), if a single IIM is successful for all functions in the class.
Definition 2 [7,2] A class C ⊆ R is Ex -learnable (C ∈ Ex ), iff there is an IIM
M such that, for any f ∈ C, the sequence M (f ) is admissible for f . M is then
called an Ex -learner or an IIM for C.
The class of constant functions and the class Cfsup = {α0∞ | α is an initial
segment} of recursive functions of finite support are in Ex , but intuitively, the
latter is harder to learn. A reduction-based approach for comparing the learning
complexity is proposed in [4], using the notion of recursive operators.
Definition 3 [15,8] Let Θ be a total function operating on functions. Θ is a
recursive operator, iff for all functions f, g and all numbers n, y ∈ N:
1. if f ⊆ g, then Θ(f ) ⊆ Θ(g);
2. if Θ(f )(n) = y, then Θ(f  )(n) = y for some initial segment f  ⊆ f ;
3. if f is finite, then one can effectively (in f ) enumerate Θ(f ).
Reducing a class C1 of functions to a class C2 of functions requires two oper-
ators: the first one maps C1 into C2 ; the second maps any admissible sequence
for a mapped function in C2 to an admissible sequence for the associated original
function in C1 .
Definition 4 [4] Let C1 , C2 ∈ Ex . C1 is Ex -reducible to C2 , iff there are recur-
sive operators Θ, Ξ such that all functions f ∈ C1 fulfil the following conditions:
1. Θ(f ) belongs to C2 ,
2. if σ is admissible for Θ(f ), then Ξ(σ) is admissible for f .
Note, if C1 is Ex -reducible to C2 , then an IIM for C1 can be deduced from any
IIM for C2 ; e. g. by [4], each class in Ex is Ex -reducible to Cfsup . As usual, this
reduction yields complete classes, i. e. learnable classes of highest complexity.
Definition 5 [4] A class C ∈ Ex is Ex -complete, iff each class C  ∈ Ex is
Ex -reducible to C.
By the remark above, the class Cfsup is Ex -complete. Note that Cfsup is
r. e. and dense – a relevant property for characterizing Ex -complete classes:
Theorem 1 [8] A class C ∈ Ex is Ex -complete iff it has an r. e. dense subset.
Ex -complete classes have subsets, which are dense, i. e. topologically complex,
but r. e., i. e. algorithmically non-complex. The latter is astonishing, since there
are dense classes, which are not Ex -complete, cf. [8], so they do not contain
r. e. dense subsets. These classes are algorithmically more complex than Cfsup ,
but belong to a lower degree of intrinsic complexity. R. e. subsets as in Theorem 1
are obtained by mapping r. e. Ex -complete classes – such as Cfsup – to C with
the help of an operator Θ. So perhaps this approach of intrinsic complexity just
makes a class complete, if it is a suitable ‘target’ for recursive operators. This
may be considered as a weakness of the notion of intrinsic complexity.
Intrinsic Complexity of Uniform Learning 43

2.3 Uniform Learning in the Limit


Uniform learning views the approach of Ex -learning on a meta-level; it is not
only concerned with the existence of methods solving specific learning problems,
but with the problem to synthesize such methods. So the focus is on families of
learning problems (here families of classes of recursive functions). Given a repre-
sentation or description of a class of recursive functions, the aim is to effectively
determine an adequate learner, i. e. to compute a program for a successful IIM
learning the class.
For a formal definition of uniform learning it is necessary to agree on a scheme
for describing classes of recursive functions (i. e. describing learning problems).
For that purpose we fix a three-place acceptable numbering ϕ. If d ∈ N, the
numbering ϕd is the function resulting from ϕ, if the first input is fixed by d.
Then any number d corresponds to a two-place numbering ϕd enumerating the
set Pϕd of partial-recursive functions. Now it is conceivable to consider the subset
of all total functions in Pϕd as a learning problem which is uniquely determined
by the number d. Thus each number d acts as a description of the set Rd , where

Rd := {ϕdi | i ∈ N and ϕdi is recursive} = Pϕd ∩ R for any d ∈ N .

Rd is called recursive core of the numbering ϕd . So any set D = {d0 , d1 , . . . } can


be regarded as a set of descriptions, i. e. a collection of learning problems Rd0 ,
Rd1 , . . . In this context, D is called a description set.
A meta-IIM M is an IIM with two inputs: (i) a description d of a recursive
core Rd , and (ii) an initial segment f [n] of some f ∈ R. Then Md is the IIM
resulting from M , if the first input is fixed by d. A meta-IIM M can be seen
as mapping descriptions d to IIMs Md ; it is a successful uniform learner for a
set D, in case Md learns Rd for all d ∈ D; i. e. given any description in D, M
develops a suitable learner for the corresponding recursive core.

Definition 6 Let D ⊆ N. D is uniformly Ex -learnable (D ∈ UEx ), iff there is


a meta-IIM M such that, for any d ∈ D, the IIM Md is an Ex -learner for Rd .

As a numbering ϕd enumerates a superset of Rd , a meta-IIM might also


use ϕd as a hypothesis space for Rd . This involves a new notion of admissible
sequences.

Definition 7 Let d ∈ N, f ∈ Rd , σ ∈ R. σ is r -admissible for d and f , iff σ


converges and lim(σ) is a ϕd -program for f .

This approach yields just a special (restricted ) case of uniform Ex -learning,


because ϕd -programs can be uniformly translated into τ -programs.

Definition 8 Let D ⊆ N. D is uniformly Ex -learnable restrictedly (D ∈ rUEx ),


iff there is a meta-IIM M such that, for any d ∈ D and any function f ∈ Rd ,
the sequence Md (f ) is r -admissible for d and f .

By the following result, special sets describing only singleton recursive cores
are not uniformly Ex -learnable (restrictedly). For Claim 2 cf. a proof in [16].
44 S. Zilles

Theorem 2 1. [12,16] {d ∈ N | card Rd = 1} ∈


/ UEx .
2. Fix s ∈ R. Then {d ∈ N | Rd = {s}} ∈
/ rUEx .

It has turned out, that even UEx -learnable subsets of these description sets
are not in UEx (or rUEx ), if additional demands concerning the sequence of hy-
potheses are posed, see [17]. This suggests that description sets representing only
singletons may form hardest problems in uniform learning; analogously descrip-
tion sets representing only a fixed singleton recursive core may form hardest
problems in restricted uniform learning. Hopefully, this intuition can be ex-
pressed by a notion of intrinsic complexity of uniform learning.

3 Intrinsic Complexity of Uniform Learning


3.1 Intrinsic Complexity of UEx -Learning
The crucial notion now concerns the reduction between description sets D1
and D2 . As in the non-uniform model, a meta-IIM for D1 should be computable
from a meta-IIM for D2 , if D1 is reducible to D2 . We first focus on UEx -learning;
the restricted variant will be discussed later on. A first idea for UEx -reducibility
might be to demand the existence of operators Θ and Ξ such that for d1 ∈ D1
and f1 ∈ Rd1

Θ transforms (d1 , f1 ) into a pair (d2 , f2 ) with d2 ∈ D2 and f2 ∈ Rd2 ;

where Ξ maps any admissible sequence for f2 to an admissible sequence for f1 .


Unfortunately, this does not allow us to reduce every set in UEx to a set
describing only singleton recursive cores: suppose Rd = Cfsup . As the set D1 =
{d} is uniformly Ex -learnable, it should be reducible to a set D2 representing
only singleton recursive cores, say via Θ and Ξ as above. Now for any initial
segment α, there are d2 ∈ D2 and f2 ∈ Rd2 such that Θ(d, α0∞ ) = (d2 , f2 ).
The usual notion of an operator yields an n > 0 and a subfunction σ ⊆ f2 such
that Θ(d, α0n ) = (d2 , σ). As card Rd2 = 1, this implies Θ(d, α0n β0∞ ) = (d2 , f2 )
for all initial segments β. In particular, there are f, f  ∈ Rd such that f = f  ,
but Θ(d, f ) = Θ(d, f  ) = (d2 , f2 ). By assumption, Ξ maps each admissible
sequence for f2 to a sequence admissible for both f and f  . The latter is of
course impossible, so this approach does not meet our purpose.
The problem above is that the description d2 , once it is output by Θ on input
of (d1 , f1 [m]), can never be changed depending on the values of f1 to be read.
Hence, Θ should be allowed to return a sequence of descriptions, when fed a pair
(d1 , f1 ). As an improved approach, it is conceivable to demand, that for d1 ∈ D1
and f1 ∈ Rd1
Θ transforms (d1 , f1 ) into a pair (δ2 , f2 ) .

Here δ2 is a sequence converging to some d2 ∈ D2 with f2 ∈ Rd2 . Moreover, Ξ


maps any admissible sequence for f2 to an admissible sequence for f1 .
Still this approach bears a problem. Intuitively, reducibility should be tran-
sitive. In general, such a transitivity is achieved by connecting the operators of a
Intrinsic Complexity of Uniform Learning 45

first reduction with the operators of a second reduction. The idea above cannot
guarantee that: assume D1 is reducible to D2 via Θ1 and Ξ1 ; D2 is reducible to
D3 via Θ2 and Ξ2 . If Θ1 maps (d1 , f1 ) to (δ2 , f2 ), then which description d in the
sequence δ2 should form an input (d, f2 ) for Θ2 ? It is in general impossible to
detect the limit d2 of the sequence δ2 , and any description d = d2 might change
the output of Θ2 .
So it is inevitable to let Θ operate on sequences of descriptions and on func-
tions, i. e., Θ maps pairs (δ1 , f1 ), where δ1 is a sequence of descriptions, to pairs
(δ2 , f2 ).
Definition 9 Let Θ be a total function operating on pairs of functions. Θ is a re-
cursive meta-operator, iff the following properties hold for all functions δ, δ  , f, f  :
1. if δ ⊆ δ  , f ⊆ f  , as well as Θ(δ, f ) = (γ, g) and Θ(δ  , f  ) = (γ  , g  ), then
γ ⊆ γ  and g ⊆ g  ;
2. if n, y ∈ N, Θ(δ, f ) = (γ, g), and γ(n) = y (or g(n) = y, resp.), then there
are initial segments δ0 ⊆ δ and f0 ⊆ f such that (γ0 , g0 ) = Θ(δ0 , f0 ) fulfils
γ0 (n) = y (g0 (n) = y, resp.);
3. if δ, f are finite, Θ(δ, f ) = (γ, g), one can effectively (in δ, f ) enumerate γ, g.
This finally allows for the following definition of UEx -reducibility.

Definition 10 Let D1 , D2 ∈ UEx . Fix a recursive meta-operator Θ and a re-


cursive operator Ξ. D1 is UEx -reducible to D2 via Θ and Ξ, iff for any d1 ∈ D1 ,
any f1 ∈ Rd1 , and any initial segment δ1 there are functions δ2 and f2 satisfying:
1. Θ(δ1 d∞1 , f1 ) = (δ2 , f2 ),
2. δ2 converges to some description d2 ∈ D2 such that f2 ∈ Rd2 ,
3. if σ is admissible for f2 , then Ξ(σ) is admissible for f1 .
D1 is UEx -reducible to D2 , iff D1 is UEx -reducible to D2 via some Θ and Ξ  .

Note that this definition expresses intrinsic complexity in the sense that a
meta-IIM for D1 can be computed from a meta-IIM for D2 , if D1 is UEx -
reducible to D2 . Moreover, as has been demanded in advance, the resulting
reducibility is transitive:

Lemma 3 If D1 , D2 , D3 are description sets such that D1 is UEx -reducible to


D2 and D2 is UEx -reducible to D3 , then D1 is UEx -reducible to D3 .

The notion of completeness can be adapted from the usual definitions.

Definition 11 A description set D ∈ UEx is UEx -complete, iff each description


set D ∈ UEx is UEx -reducible to D.

The question is, whether this notion of intrinsic complexity expresses the
intuitions formulated in advance, e. g., that there are UEx -complete description
sets representing only singleton recursive cores. Before answering this question
consider an illustrative example.
46 S. Zilles

This example states that there is a single description d of an Ex -complete


set such that the description set {d} is UEx -complete. On the one hand, this
might be surprising, because a description set consisting of just one index rep-
resenting an Ex -learnable class might be considered rather simple and thus not
complete for uniform learning. But on the other hand, this result is not contrary
to the intuition, that the hardest problems in non-uniform learning may remain
hardest, when considered in the context of meta-learning. The reason is that
the complexity is still of highest degree, if the corresponding class of recursive
functions is not decomposed appropriately.

Example 4 Let d ∈ N fulfil Rd = Cfsup . Then the set {d} is UEx -complete.

Proof. Obviously, {d} ∈ UEx . To show that each description set in UEx is UEx -
reducible to {d}, fix D1 ∈ UEx and let M be a corresponding meta-IIM as in
Definition 6. It remains to define a recursive meta-operator Θ and a recursive
operator Ξ appropriately.
Given initial segments δ1 and α, let Θ just modify the sequence of hypotheses
returned by the meta-IIM M , if the first input parameter is gradually taken from
the sequence δ1 and the second input parameter is gradually taken from the
sequence α. The modification is to increase each hypothesis by 1 and to change
each repetition of hypotheses into a zero output. A formal definition is omitted.
Moreover, given an initial segment σ = (s0 , . . . , sn ), let Ξ(σ) look for the
maximal m ≤ n such that at least one of the values τsm (x), x ≤ n, is de-
fined within n steps and greater than 0. In case m does not exist, Ξ(σ) =
Ξ(s0 , . . . , sn−1 ). Otherwise, let y ≤ n be maximal such that τsm (y) has already
been computed and is greater than 0. Then Ξ(σ) = Ξ(s0 , . . . , sn−1 )τsm (y) − 1.
Now D1 is UEx -reducible to {d} via Θ, Ξ; details are omitted. 
That decompositions of Ex -complete classes may also be not UEx -complete,
is shown in Section 3.3. Example 4 moreover serves for proving the completeness
of other sets, if Lemma 5 – an immediate consequence of Lemma 3 – is applied.

Lemma 5 Let D1 , D2 ∈ UEx . If D1 is UEx -complete and UEx -reducible to D2 ,


then D2 is UEx -complete.

Lemma 5 and Example 4 simplify the proofs of further examples, finally


revealing that there are indeed UEx -complete description sets representing sin-
gleton recursive cores only.

Example 6 1. Let (αi )i∈N be an r. e. family of all initial segments. Let g ∈ R


fulfil ϕ0 = αi 0∞ and ϕx+1 =↑∞ for i, x ∈ N. Then the description set
g(i) g(i)

{g(i) | i ∈ N} is UEx -complete.


2. Let g ∈ R fulfil ϕ0 = τi and ϕx+1 =↑∞ for i, x ∈ N. Then the description
g(i) g(i)

set {g(i) | i ∈ N} is UEx -complete.

Proof. ad 1. Obviously, {g(i) | i ∈ N} ∈ UEx . Now we reduce the UEx -complete


set {d} from Example 4 to {g(i) | i ∈ N}. Lemma 5 then proves Assertion 1.
Intrinsic Complexity of Uniform Learning 47

It is easy to define Θ such that, if α does not end with 0, then Θ(δ1 , α0∞ ) =
(δ2 , α0∞ ), where δ2 converges to some g(i) with αi = α. Let Ξ(σ) = σ for all σ.
Then {d} is UEx -reducible to {g(i) | i ∈ N} via Θ and Ξ. Details are omitted.
ad 2. Fix an r. e. family (αi )i∈N of all initial segments; fix h ∈ R with τh(i) =
αi 0∞ for all i ∈ N. Then ϕ0 = αi 0∞ and ϕx+1 =↑∞ for i, x ∈ N. As above,
g(h(i)) g(h(i))

the set {g(h(i)) | i ∈ N} is UEx -complete; so is its superset {g(i) | i ∈ N}. 


Just as the properties of Cfsup are characteristic for Ex -completeness, the
properties of description sets representing decompositions of Cfsup are charac-
teristic for UEx -completeness, as is stated in Theorem 7 and Corollary 8.

Theorem 7 Let D ∈ UEx . D is UEx -complete, iff there are a recursive num-


bering ψ and a limiting r. e. family (di )i∈N of descriptions in D such that:
1. ψi belongs to Rdi for all i ∈ N;
2. Pψ is dense.

Proof. Fix a description set D in UEx .


Necessity. Assume D is UEx -complete. Fix any one-one recursive numbering χ
such that Pχ = Cfsup . Moreover fix g ∈ R which, given any i, x ∈ N, fulfils
ϕ0 = χi and ϕx =↑∞ , if x > 0. Then the description set {g(i) | i ∈ N} is
g(i) g(i)

UEx -complete, as can be verified similarly to Example 6. Lemma 5 then implies


that {g(i) | i ∈ N} is UEx -reducible to D, say via Θ and Ξ.
Fix a one-one r. e. family (αi )i∈N of all finite tuples over N. For i ∈ N, i
coding the pair (x, y), define (δi , ψi ) := Θ(αy g(x)∞ , χx ). By definition, ψ is a
recursive numbering and, for all i ∈ N, the sequence δi converges to some di ∈ D
such that ψi ∈ Rdi . Hence (di )i∈N is a limiting r. e. family of descriptions in D.
It remains to verify Property 2.
For that purpose fix i, n ∈ N. By definition, if i encodes (x, y), we obtain
Θ(αy g(x)∞ , χx ) = (δi , ψi ). The properties of Θ yield some m ∈ N such that
Θ(αy g(x)m , χx [m]) = (δi , α ) for some δi , α with δi ⊆ δi and ψi [n] ⊆ α ⊆ ψi .
Because of the particular properties of χ, there is some x ∈ N, x = x,
such that χx =m χx , but χx = χx . Moreover, there is some y  ∈ N such
that αy = αy g(x)m . If j encodes (x , y  ), this yields Θ(αy g(x)m g(x )∞ , χx ) =
(δj , ψj ), where α ⊆ ψj . In particular ψj =n ψi .
Assume ψi = ψj . Suppose σ is any admissible sequence for ψi . Then σ is
admissible for ψj . This implies that Ξ(σ) is admissible for both χx and χx . As
χx = χx , this is impossible. So ψi = ψj .
Sufficiency. Assume D, ψ, and (di )i∈N fulfil the conditions of Theorem 7. Let d
denote a numbering associated to the limiting r. e. family (di )i∈N . The results in
the context of non-uniform learning help to show that D is UEx -complete:
By assumption, Pψ is a dense r. e. subset of R. Theorem 1 then implies that
Pψ is Ex -complete, so Cfsup is Ex -reducible to Pψ , say via Θ , Ξ  .
Using Θ and Ξ  one can show that the UEx -complete set {d} from Example 4
is UEx -reducible to D. This implies that D is UEx -complete, too. Note that
Rd = Cfsup .
48 S. Zilles

It remains to define a recursive meta-operator Θ and a recursive operator Ξ


appropriately. If δ1 and α1 are finite tuples over N, define Θ(δ1 , α1 ) as follows.

Compute Θ (α1 ) = α2 and n = |α2 |.


For all x < n, let ix be minimal such that α2 [x] ⊆ ψix .
Return Θ(δ1 α1 ) = ((di0 (0), di1 (1), . . . , din−1 (n − 1)), α2 ) (if n = 0, then the
first component of Θ(δ1 α1 ) is the empty sequence).

Clearly, if f1 ∈ R, then Θ(δ1 , f1 ) = (δ2 , Θ (f1 )) for some sequence δ2 .


Moreover, let Ξ := Ξ  .
Finally, to verify that {d} is UEx -reducible to D, fix a sequence δ1 and a
function f1 ∈ Rd .
First, note that f2 = Θ (f1 ) ∈ Pψ . Let i be the minimal ψ-program of

Θ (f1 ) = f2 . As ψ ∈ R, for all x ∈ N the minimal ix satisfying f2 [x] ⊆ ψix can
be computed. Additionally, lim(ix )x∈N = i. Note that di converges to di . Hence
Θ(δ1 , f1 ) = (δ2 , f2 ), where f2 ∈ Pψ and δ2 converges to di , given f2 = ψi . In
particular, f2 ∈ Rdi .
Second, if σ is admissible for f2 , then Ξ  (σ) is admissible for f1 .
So {d} is UEx -reducible to D via Θ and Ξ, and thus D is UEx -complete. 

Corollary 8 Let D ∈ UEx . D is UEx -complete, iff there are a recursive num-


bering ψ and a limiting r. e. family (di )i∈N of descriptions in D such that:

1. ψi belongs to Rdi for all i ∈ N;


2. Pψ is Ex -complete.

Proof. Necessity. The assertion follows from Theorem 1 and Theorem 7.


Sufficiency. Let D ∈ UEx . Assume ψ and (di )i∈N fulfil the conditions above.
Let d be a recursive numbering corresponding to the limiting r. e. family (di )i∈N .
By Property 2, Pψ is Ex -complete; thus, by Theorem 1, there exists a dense
r. e. subclass C ⊆ Pψ . Let ψ  be a one-one, recursive numbering with Pψ = C,
in particular Pψ is dense. It remains to find a limiting r. e. family (di )i∈N of
descriptions in D such that ψi ∈ Rdi for all i ∈ N. For that purpose define a
corresponding numbering d . Given i, n ∈ N, define di (n) as follows.

Let j ∈ N be minimal such that ψi =n ψj . (* Note that, for all but finitely
many n, the index j will be the minimal ψ-program of ψi . *)
Return di (n) := dj (n). (* lim(di ) = dj , for j minimal with ψi = ψj . *)

Finally, let di be given by the limit of the function di , in case a limit exists.
Fix i ∈ N. Then there is a minimal j with ψi = ψj . By definition, the limit

di of di exists and di = dj ∈ D. Moreover, as ψj ∈ Rdj , the function ψi is in Rdi .
As ψ  and (di )i∈N allow us to apply Theorem 7, the set D is UEx -complete. 
Thus certain decompositions of Ex -complete classes remain UEx -complete,
and UEx -complete description sets always represent decompositions of supersets
of Ex -complete classes. Example 9 illustrates how to apply the above characteri-
zations of UEx -completeness. A similar short proof may be given for Example 6.
Intrinsic Complexity of Uniform Learning 49

Example 9 Fix a recursive numbering χ such that Pχ is dense. Let g ∈ R fulfil


ϕ0 = χi and ϕx+1 =↑∞ for i, x ∈ N. Then {g(i) | i ∈ N} is UEx -complete.
g(i) g(i)

Proof. (g(i))i∈N is a (limiting) r. e. family such that χi ∈ Rg(i) for all i ∈ N and
Pχ is Ex -complete. Corollary 8 implies that {g(i) | i ∈ N} is UEx -complete. 

3.2 Intrinsic Complexity of rUEx -Learning


Adapting the formalism of intrinsic complexity for restricted uniform learning,
we have to be careful concerning the operator Ξ. In UEx -learning, the current
description d has no effect on whether a sequence is admissible for a function
or not. For restricted learning this is different. Therefore, to communicate the
relevant information to Ξ, it is inevitable to include a description from D2 in
the input of Ξ. That means, Ξ should operate on pairs (δ2 , σ) rather than on
sequences σ only. Since only the limit of the function output by Ξ is relevant for
the reduction, this idea can be simplified. It suffices, if Ξ operates correctly on the
inputs d2 and σ, where d2 is the limit of δ2 . Then an operator on the pair (δ2 , σ)
is obtained from Ξ by returning the sequence (Ξ(δ2 (0)σ[0]), Ξ(δ2 (1)σ[1]), . . . ).
Its limit will equal the limit of Ξ(d2 σ).

Definition 12 Let D1 , D2 ∈ rUEx . Fix a recursive meta-operator Θ and a re-


cursive operator Ξ. D1 is rUEx -reducible to D2 via Θ and Ξ, iff for any d1 ∈ D1 ,
any f1 ∈ Rd1 , and any initial segment δ1 there are functions δ2 and f2 satisfying:
1. Θ(δ1 d∞1 , f1 ) = (δ2 , f2 ),
2. δ2 converges to some description d2 ∈ D2 such that f2 ∈ Rd2 ,
3. if σ is r -admissible for d2 and f2 , then Ξ(d2 σ) is r -admissible for d1 and f1 .
D1 is rUEx -reducible to D2 , iff D1 is rUEx -reducible to D2 via some Θ and Ξ  .

Completeness is defined as usual. As in the UEx -case, rUEx -reducibility is


transitive; so the rUEx -completeness of one set may help to verify the rUEx -
completeness of others.

Lemma 10 If D1 , D2 , D3 are description sets such that D1 is rUEx -reducible


to D2 and D2 is rUEx -reducible to D3 , then D1 is rUEx -reducible to D3 .

Lemma 11 Let D1 , D2 ∈ rUEx . If D1 is rUEx -complete and rUEx -reducible


to D2 , then D2 is rUEx -complete.

Recall that, intuitively, sets describing just one singleton recursive core may
be rUEx -complete. This is affirmed by Example 12, the proof of which is omitted.

Example 12 Let s, g ∈ R such that ϕi = s and ϕx =↑∞ , if i, x ∈ N, x = i.


g(i) g(i)

Then {g(i) | i ∈ N} is rUEx -complete, but not UEx -complete.

Example 12 helps to characterize rUEx -completeness. In particular, it shows


that the demand ‘Pψ is dense’ has to be dropped.
50 S. Zilles

Theorem 13 Let D ∈ rUEx . D is rUEx -complete, iff there are a recursive


numbering ψ and a limiting r. e. family (di )i∈N of descriptions in D such that:
1. ψi belongs to Rdi for all i ∈ N;
2. for each i, n ∈ N there are infinitely many j ∈ N satisfying ψi =n ψj and
(di , ψi ) = (dj , ψj ).
Proof. Fix a description set D in rUEx .
Necessity. Assume D is rUEx -complete. Lemma 11 implies that the description
set {g(i) | i ∈ N} from Example 12 is rUEx -reducible to D, say via Θ and Ξ.
Fix a one-one r. e. family (αi )i∈N of all finite tuples over N. For i ∈ N, i coding
the pair (x, y), define (δi , ψi ) := Θ(αy g(x)∞ , s). By definition, ψ is a recursive
numbering and, for all i ∈ N, the sequence δi converges to some di ∈ D such
that ψi ∈ Rdi . Hence (di )i∈N is a limiting r. e. family of descriptions in D.
It remains to verify Property 2.
For that purpose fix i, n ∈ N. By definition, if i encodes (x, y), we have
Θ(αy g(x)∞ , s) = (δi , ψi ). The properties of Θ yield some m ∈ N such that
Θ(αy g(x)m , s) = (δi , α ) for some δi and α with δi ⊆ δi and ψi [n] ⊆ α ⊆ ψi .
Now choose any x ∈ N such that x = x. Moreover, there is some y  ∈ N
such that αy = αy g(x)m . If j encodes (x , y  ), this yields Θ(αy g(x)m g(x )∞ , s) =
(δj , ψj ), where α ⊆ ψj . In particular ψj =n ψi .
Assume (di , ψi ) = (dj , ψj ). Suppose σ is any rUEx -admissible sequence for
di and ψi . Then Ξ(di σ) is rUEx -admissible for both g(x) and s and g(x ) and s.

As x is the only ϕg(x) -number for s and x is the only ϕg(x ) -number for s, the
latter is impossible. So (di , ψi ) = (dj , ψj ).
Repeating this argument for any x with x = x yields the desired property.
Sufficiency. First note: if (di )i∈N is a limiting r. e. family and ψ any recursive
numbering, such that {(di , ψi ) | i ∈ N} is an infinite set, then there are a limiting
r. e. family (di )i∈N and a recursive numbering ψ  , such that {(di , ψi ) | i ∈ N} ⊆
{(di , ψi ) | i ∈ N} and i = j implies (di , ψi ) = (dj , ψj ). Details are omitted.
So let D, ψ, (di )i∈N fulfil the demands of Theorem 7 and assume wlog that
i = j implies (di , ψi ) = (dj , ψj ). Let d be the numbering associated to the
limiting r. e. family (di )i∈N . We show that the set {g(i) | i ∈ N} from Example 12
is rUEx -reducible to D; so Lemma 11 implies that D is rUEx -complete.
For that purpose fix a one-one numbering η ∈ R such that Pη equals the
set Cconst := {αi∞ | α is a finite tuple over N and i ∈ N} of all recursive finite
variants of constant functions.
Using a construction from [8] we define an operator Θ mapping Pη into Pψ .
In parallel, a function θ is constructed to mark used indices. Let Θ (η0 ) := ψ0
and θ(0) = 0. If i > 0, let Θ (ηi ) be defined as follows.
For x < i, let mx be maximal with ηi (mx ) = ηx (mx ). Let m := maxx<i (mx ).
Let k < i be minimal with mk = m. (* Among the functions η0 , . . . , ηi−1 ,
none agree with ηi on a longer initial segment than ηk does. *)
Compute the set H := {j ∈ N | j ∈ / {θ(0), . . . , θ(i − 1)} and ψj =m Θ (ηk )}.
(* H is the set of unused ψ-programs of functions agreeing with Θ (ηk ) on
the first m + 1 values. *)
Intrinsic Complexity of Uniform Learning 51

Choose h = min(H); return Θ (ηi ) := ψh , moreover let θ(i) := h. (* Because


of Property 2, the index h exists. As ψ is recursive, h is found effectively. *)
Note that Θ is a recursive operator mapping Pη into Pψ . θ is a recursive function,
that maps each number i to the index h used in the construction of Θ (ηi ) = ψh .
θ is one-one, yet it may happen, that Θ (ηi ) = Θ (ηj ), but θ(i) = θ(j) for some
i, j ∈ N.
It remains to define a recursive meta-operator Θ and a recursive operator Ξ
such that {g(i) | i ∈ N} is rUEx -reducible to D via Θ and Ξ. If δ is an infinite
sequence, define Θ(δ, s) as follows.
For each x ∈ N, let jx ∈ N be minimal such that ηjx =x δ. Let ix := θ(jx ).
Return Θ(δ, s) := ((di0 (0), di1 (1), . . . ), Θ (δ)).
Clearly, the output of Θ depends only on δ. If δ converges, then Θ(δ, s) = (δ  , f  ),
where f  ∈ Pψ and δ  converges to some description di such that i = θ(j) for the
minimal number j satisfying ηj = δ. To define an operator Ξ, compute Ξ(dσ)
for d ∈ N, σ ∈ R as follows.
For x ∈ N let X := {y ≤ x | dy (x) = d and, for all z ≤ x, if ϕdσ(x) (z) is
defined in x steps of computation, then ϕdσ(x) (z) = ψy (z)}. If X is empty,
let ix = 0, otherwise let ix be the minimum of X. (* In the limit, the only i
satisfying di = d and ϕdlim(σ) = ψi is found – provided that i exists. *)
For each x ∈ N, compute jx ∈ N with θ(jx ) = ix . (* In the limit, a number j
with θ(j) = i, di = d, and Θ (ηj ) = ψi is found – provided that j exists. *)
Return Ξ(dσ) := (g −1 (ηj0 (0)), g −1 (ηj1 (1)), g −1 (ηj2 (2)), . . . ). (* g −1 denotes
the function inverse to g. Ξ(dσ) converges to g −1 (l), where l is the limit of
ηj with Θ (ηj ) = ψi and di equals d – provided that i and j exist. *)
To show that {g(i) | i ∈ N} is rUEx -reducible to D via Θ and Ξ, fix some
δ1 ∈ R converging to some description d ∈ {g(i) | i ∈ N}.
First, by the remarks below the definition of Θ, we obtain Θ(δ1 , s) = (δ2 , f2 ),
where f2 ∈ Pψ and δ2 converges to some description di such that i = θ(j) for
the minimal j satisfying ηj = δ1 . This implies f2 = ψi . In particular, f2 ∈ Rdi .
Second, if σ ∈ R is r -admissible for di and ψi , then Ξ(di σ) converges to
g −1 (d) (by the note in the definition of Ξ and d = lim(ηj )). Recall that g −1 (d)
is the only ϕd -program of s, whenever d ∈ {g(i) | i ∈ N}. Hence Ξ(di σ) is
r -admissible for d and s.
So {g(i) | i ∈ N} is rUEx -reducible to D and finally D is rUEx -complete. 
As an immediate consequence of Theorems 7 and 13 we have:

Corollary 14 Let D ∈ rUEx . If D is UEx -complete, then D is rUEx -complete.

3.3 Algorithmic Structure of Complete Classes in Uniform Learning


Theorems 7 and 13 suggest a weakness of the notion of intrinsic complexity,
similar to the non-uniform case: though UEx -/rUEx -complete sets involve a
52 S. Zilles

topologically complex structure, expressed by Property 2, this goes along with


the demand for a limiting r. e. subset combined with an r. e. subset Pψ of the
union of all represented recursive cores. The latter again can be seen as a non-
complex algorithmic structure.
Now Theorem 15 shows that there are non-complete description sets, for
which the properties of Theorems 7 and 13 can be fulfilled, but only if the
demand for limiting r. e. sets is dropped. These sets are algorithmically more
complex than our examples of UEx -complete sets, but they belong to a lower
degree of intrinsic complexity.

Theorem 15 Let C ⊆ R. Then there is a set D ∈ rUEx such that

1. C equals the union of all recursive cores described by D,


2. D is not rUEx -complete (and hence not UEx -complete).

r. e. sets such that ϕd0 ∈ C and


Proof. Fix a list A0 , A1 , . . . of all infinite limiting 

ϕx+1 =↑ for all i, x ∈ N and d ∈ Ai . Let A := i∈N Ai and C = {f0 , f1 , . . . }.
d

Define a set D0 as follows.

Fix the least elements d0 , d0 of A0 , d0 < d0 . Let I0 := {d0 }, I0 := {d0 }. Let
e0 ∈ A \ (I0 ∪ I0 ) be minimal such that f0 ∈ Re0 . (* e0 exists, because A
contains infinitely many descriptions d with ϕd0 = f0 . *)
Let D0 := I0 ∪ {e0 }. (* The disjoint sets D0 and I0 both intersect with A0 ;
some recursive core described by D0 equals {f0 }. *)

Moreover, for any k ∈ N, define a set Dk+1 as follows.

Fix the least elements dk+1 , dk+1 of Ak+1 \ (Dk ∪ Ik ), dk+1 < dk+1 . (* These
have not been touched in the definition of D0 , . . . , Dk yet. *)

Let Ik+1 := Dk ∪{dk+1 }, Ik+1 := Ik ∪{dk+1 }. Let ek+1 ∈ A\(Ik+1 ∪Ik+1

) be
minimal such that fk+1 ∈ Rek+1 . (* ek+1 exists, because A contains infinitely
many descriptions d with ϕd0 = fk+1 . *)

Let Dk+1 := Ik+1 ∪{ek+1 }. (* The disjoint sets Dk+1 and Ik+1 both intersect
with Ak+1 ; some recursive core described by Dk+1 equals {fk+1 }. *)

Choose D := k∈N Dk ⊂ A, so D does not contain any infinite limiting r. e. set.
As ϕdx+1 =↑∞ for all d ∈ D, x ∈ N, we have D ∈ rUEx . Moreover, C is the union
of all cores described by D. It remains to prove that D is not rUEx -complete.
Assume D is rUEx -complete. Then some limiting r. e. set {di | i ∈ N} ⊆ D
and some ψ ∈ R fulfil the conditions of Theorem 13. In particular, {(di , ψi ) |
i ∈ N} is infinite. As D does not contain any infinite limiting r. e. set, the set
{di | i ∈ N} is finite. card Rdi = 1 for i ∈ N implies that {ψi | i ∈ N} is finite, too;
thus {(di , ψi ) | i ∈ N} is finite – a contradiction. So D is not rUEx -complete. 
The reason why each UEx -/rUEx -complete set D contains a limiting r. e. sub-
set representing a decomposition of an r. e. class is that certain properties of
UEx -complete sets are ‘transferred’ by meta-operators Θ. This corroborates the
possible interpretation that our approach of intrinsic complexity just makes a
Intrinsic Complexity of Uniform Learning 53

class complete, if it is a suitable ‘target’ for recursive meta-operators – similar


to the non-uniform case.
By the way, Theorem 15 shows, that every Ex -complete class C has a decom-
position represented by a description set which is not UEx -complete – answering
a question in Section 3.1.

Acknowledgements. My thanks are due to the referees and to Steffen Lange


for their comments correcting and improving a former version of this paper,
moreover to Frank Stephan for a very helpful discussion on the technical details.

References
1. Baliga, G.; Case, J.; Jain, S. (1999); The synthesis of language learners, Information
and Computation 152:16–43.
2. Blum, L.; Blum, M. (1975); Toward a mathematical theory of inductive inference,
Information and Control 28:125–155.
3. Case, J.; Smith, C. (1983); Comparison of identification criteria for machine in-
ductive inference, Theoretical Computer Science 25:193–220.
4. Freivalds, R.; Kinber, E.; Smith, C. (1995); On the intrinsic complexity of learning,
Information and Computation 123:64–71.
5. Garey, M; Johnson, D. (1979); Computers and Intractability – A Guide to the
Theory of NP-Completeness, Freeman and Company.
6. Gold, E. M. (1965); Limiting recursion, Journal of Symbolic Logic 30:28–48.
7. Gold, E. M. (1967); Language identification in the limit, Information and Control
10:447–474.
8. Jain, S.; Kinber, E.; Papazian, C.; Smith, C.; Wiehagen, R. (2003); On the intrinsic
complexity of learning recursive functions, Information and Computation 184:45–
70.
9. Jain, S.; Kinber, E.; Wiehagen, R. (2000); Language learning from texts: Degrees
of intrinsic complexity and their characterizations, Proc. 13th Annual Conference
on Computational Learning Theory, Morgan Kaufmann, 47–58.
10. Jain, S.; Sharma, A. (1996); The intrinsic complexity of language identification,
Journal of Computer and System Sciences 52:393–402.
11. Jain, S.; Sharma, A. (1997); The structure of intrinsic complexity of learning,
Journal of Symbolic Logic 62:1187–1201.
12. Jantke, K. P. (1979); Natural properties of strategies identifying recursive func-
tions, Elektronische Informationsverarbeitung und Kybernetik 15:487–496.
13. Kapur, S.; Bilardi, G. (1992); On uniform learnability of language families, Infor-
mation Processing Letters 44:35–38.
14. Osherson, D.; Stob, M.; Weinstein, S. (1988); Synthesizing inductive expertise,
Information and Computation 77:138–161.
15. Rogers, H. (1987); Theory of Recursive Functions and Effective Computability,
MIT Press.
16. Zilles, S. (2001); On the synthesis of strategies identifying recursive functions,
Proc. 14th Annual Conference on Computational Learning Theory, LNAI 2111,
pp. 160–176, Springer-Verlag.
17. Zilles, S. (2001); On the comparison of inductive inference criteria for uniform
learning of finite classes, Proc. 12th Int. Conference on Algorithmic Learning The-
ory, LNAI 2225, pp. 251–266, Springer-Verlag.
On Ordinal VC-Dimension and Some Notions of
Complexity

Eric Martin1 , Arun Sharma2 , and Frank Stephan2


1
School of Computer Science and Engineering, UNSW Sydney, NSW 2052, Australia
emartin@[Link]
2
National ICT Australia, UNSW Sydney, NSW 2052, Australia
{[Link],[Link]}@[Link]

Abstract. We generalize the classical notion of VC-dimension to ordi-


nal VC-dimension, in the context of logical learning paradigms. Logical
learning paradigms encompass the numerical learning paradigms com-
monly studied in Inductive inference. A logical learning paradigm is de-
fined as a set W of structures over some vocabulary, and a set D of
first-order formulas that represent data. The sets of models of ϕ in W,
where ϕ varies over D, generate a natural topology W over W.
We show that if D is closed under boolean operators, then the notion of
ordinal VC-dimension offers a perfect characterization for the problem
of predicting the truth of the members of D in a member of W, with an
ordinal bound on the number of mistakes. This shows that the notion
of VC-dimension has a natural interpretation in Inductive Inference,
when cast into a logical setting. We also study the relationships between
predictive complexity, selective complexity—a variation on predictive
complexity—and mind change complexity. The assumptions that D is
closed under boolean operators and that W is compact often play a
crucial role to establish connections between these concepts.

Keywords: Inductive inference, logical paradigms, VC-dimension, pre-


dictive complexity, mind change bounds.

1 Introduction
The notion of VC-dimension is a key concept in PAC-learning [6,12,13]. The
notion of finite telltale is a key concept in Inductive inference [3,8]. It can be
claimed that VC-dimension is to PAC-learning what finite telltales are to Induc-
tive inference. Both provide a characterization of learnability, for fundamental
classes of learning paradigms, in the respective settings. Both take the form of
a condition where finiteness is a key requirement, in frameworks that deal with
infinite objects. In logical learning paradigms of identification in the limit, it
has been shown that the finite telltale condition can be seen as a generalization
of the compactness property, the latter being the hallmark of, equivalently, fi-
nite learning or deductive inference [10]. The finite telltale condition can even be
generalized and be interpreted as a property of β-weak compactness, that charac-
terizes classification with less than β mind changes [10]. There are extensions of

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 54–68, 2003.

c Springer-Verlag Berlin Heidelberg 2003
On Ordinal VC-Dimension and Some Notions of Complexity 55

VC dimension to infinite domains [4]. But there are few essential connection be-
tween VC-dimension and some fundamental concepts from Inductive inference:
the relevance of the concept of VC-dimension seems to be closely related to the
existence of probability distributions over the sample space. Though connections
exists between PAC learning and Inductive inference (e.g., [5]), it does not seem
that VC-dimension has any chance to play a key role in learning paradigms of In-
ductive inference that do not introduce probability distributions over the sample
space. We will show that VC-dimension can still provide a perfect characteriza-
tion for the problem of predicting whether a possible datum is true or false in
the underlying world, in the realm of Inductive inference. But for such a char-
acterization to be possible, the condition that the set of possible data is closed
under boolean operators has to be imposed. The fact that we work in a logical
setting is of course essential to express this condition in a simple, meaningful
and natural way. The notions and main results are stated with no computabil-
ity condition on the procedures that analyze the data and output hypotheses.
This is necessary to obtain perfect equivalences, and provides strong evidence
that the concepts involved are naturally connected. When computability is a
requirement, the relationships become more complex. Our aim is to encourage
the study of the connections between ordinal VC-dimension and predictive com-
plexity in paradigms of Inductive inference. Obtaining perfect connections for
ideal, unconstrained paradigms, suggests that further work in the same direction
should be carried out, in the context of more realistic or constrained paradigms.
Moreover, the results will be illustrated with examples that always involve effec-
tive procedures. We proceed as follows. We introduce some background notions
and notation in Section 2. We define the various complexity measures in Section
3. We study the relationships between these complexity measures in Section 4.
We conclude in Section 5.

2 Background

The class of ordinals is denoted by Ord. Let a set X be given. The set of finite
sequences of members of X, including the empty sequence (), is represented
by X  . Given a σ ∈ X  , the set of members of X that occur in σ is denoted
by rng(σ). Given a finite or an infinite sequence σ of members of X and a
natural number i that, is case σ is finite, is at most equal to the length of σ,
we represent by σ|i the initial segment of σ of length i. Concatenation between
sequences is represented by , and sequences consisting of a unique element are
often identified with that element. We write ⊂ (respect., ⊆) for strict (respect.,
nonstrict) inclusion between sets, as well as for the notion of a finite sequence
being a strict (respect., nonstrict) initial segment of another sequence. We also
use the notation ⊃. Let two sets X, Y and a partial function f from X into Y
be given. Given x ∈ X, we write f (x) =↓ when f (x) is defined, and f (x) =↑
otherwise. Given two members x, x of X, we write f (x) = f (x ) when both
f (x) and f (x ) are defined and equal; we write f (x) = f (x ) otherwise. Let
R be a binary relation over a set X. Recall that R is a well-founded iff every
56 E. Martin, A. Sharma, and F. Stephan

nonempty subset Y of X contains an element x such that no member y of Y


satisfies R(y, x). Suppose that R is well founded. We then denote by ρR the
unique function from X into Ord such that for all x ∈ X:
ρR (x) = sup{ρR (y) + 1 : y ∈ X, R(y, x)}.
The length of R, written as |R|, is the least ordinal not in the range of ρR . Note
that |R| = 0 iff X = ∅. For example, Figure 1 depicts a finite binary relation
R of length 5. In this diagram, an arrow joins a point x to a point y iff R(x, y)
holds. For all points x in the field of R, the value of ρR (x) is indicated.

3
3 3

2 2

0 1 1 0

0 0

Fig. 1. A finite binary relation of length 5

Let us introduce the logical learning paradigms and their constituents. We denote
by V a countable vocabulary, i.e., a countable set of function symbols (possibly
including constants) and predicate symbols. Let us adopt a convention. If we
say that V contains 0 and s, then 0 denotes a constant and s a unary function
symbol. Moreover, given a nonnull n ∈ N, n is used as an abbreviation for the
term obtained from 0 by n applications of s to 0.
We denote by D a nonempty set of first-order sentences (closed formulas) over
V that represent data. Three important cases are sets of closed atoms (to model
learning from texts), sets of closed literals (i.e., closed atoms and their negations,
to model learning from informants), and sets of sentences closed under boolean
operators, a natural example being the set of quantifier free sentences. Note that
quantifier free sentences convey no more information than closed literals. Still,
the assumption that D is closed under boolean operators will play a key role in
this paper. We denote by  some symbol to be used when no datum is presented.

Given a member σ of (D ∪ {}) , we set cnt(σ) = rng(σ) ∩ D.
We denote by W a nonempty set of structures over V. An important case is
given by the set of all Herbrand structures, i.e., structures over V each of whose
individuals interprets a unique closed term. (When we consider Herbrand struc-
tures, V contains at least one constant). Given a member M of W and a set E
On Ordinal VC-Dimension and Some Notions of Complexity 57

of formulas, the E-diagram of M, denoted DiagE (M), is the set of all members
of E that are true in M. We say that a set T of first-order formulas is consistent
in W iff T has a model in W. Given a set T of first-order formulas over V, we
denote by Mod(T ) the set of models of T , and by ModW (T ) the set of models
of T in W (i.e., ModW (T ) = Mod(T ) ∩ W).
We denote by P the triple (V, D, W). We call P a logical paradigm. This is a
simplification of the notion of logical paradigm investigated in [9,10]. Learning
paradigms in the numerical setting are naturally cast into the logical setting as
follows. Set V = {0, s, P } where P is a unary predicate symbol. Let E be the
set {P (n) : n ∈ N}. If C is the set of languages to be learnt, we define W as
the set of Herbrand structures whose E-diagrams are {P (n) : n ∈ L} where L
varies over C. The choice of D depends on the type of data: D = E when data
are positive, D = E ∪ {¬ψ : ψ ∈ E} when data are positive or negative.

3 The Notions of Complexity

We now define the various concepts of complexity we need in this paper, starting
with VC-dimension. The notion of a set of hypothesis shattering a set of data
takes the following form when hypotheses are represented as structures and data
as formulas.
Definition 1. Let a set E of formulas be given. We say that W shatters E iff
E is finite and for all subsets D of E, DiagE (M) = D for some M ∈ W.
Traditionally, the VC-dimension is defined as the greatest number n such
that some set consisting of n distinct elements is shattered [13]. When such an n
does not exist, the VC-dimension is considered to be either undefined or infinite.
We extend the notion of VC-dimension from natural numbers to ordinals as
follows.
Definition 2. Let X be the set of nonempty subsets of D that W shatters. Let
R be the restriction of ⊃ to X. The VC-dimension of P is equal to the length
of R if R is well-founded, and undefined otherwise.
As an intuitive interpretation, the VC-dimension of P is determined by the
following game, where we assume for simplicity that D is infinite. Consider two
players Anke and Boris. Anke has to output an increasing sequence of nonempty
finite subsets of D and Boris a decreasing sequence of ordinals.

– In round 1, Anke outputs a nonempty finite subset of D and Boris outputs α.


– If in the n-th round Anke has output a finite set D ⊆ D and Boris has output
a nonnull ordinal β, then the following is done in round n + 1.
• Anke outputs a finite set E with D ⊂ E ⊆ D.
• Boris outputs an ordinal strictly below β.

The game terminates after Boris has output 0. If Anke’s last set is shattered,
Anke wins the game. Otherwise Boris wins the game. The VC-dimension of P
58 E. Martin, A. Sharma, and F. Stephan

is the smallest nonnull ordinal α for which Boris has a winning-strategy. If this
ordinal does not exist, the VC-dimension is undefined.
Given a structure M, we call environment (in P) an infinite sequence e of mem-
bers of D ∪ {} such that for all ϕ ∈ D, ϕ occurs in e iff ϕ ∈ DiagD (M). So
environments correspond to texts when D is the set of closed atoms, and to
informants when D is the set of closed literals. Identification in the limit and
the corresponding notion of complexity are defined next.

Definition 3. An identifier (for P) is a partial function from (D ∪ {}) into
{DiagD (M) : M ∈ W}. Let an identifier f be given.

– We say that a member σ of (D ∪ {}) is consistent in W just in case there
exists M ∈ W such that for all ϕ ∈ D that occur in σ, M |= ϕ.
– We say that f is is successful (in P) iff for every M ∈ W and for every
environment e for M, f (e|k ) = DiagD (M) for almost every k ∈ N.

– Let X be the set of all σ ∈ (D ∪ {}) such that σ is consistent in W and
f (τ ) =↓ for some initial segment τ of σ. We denote by Rf the binary relation
over X such that for all σ, τ ∈ X, Rf (σ, τ ) holds iff τ ⊂ σ and f (σ) = f (τ ).
The identification complexity of P is the least ordinal of the form |Rf |, where f
is an identifier that is successful in P and Rf is well-founded; if such an f does
not exist, the identification complexity of P is undefined.
If the identification complexity of P is equal to nonnull ordinal β, then the
D-diagrams of the members of W are identifiable in the limit with less than β
mind changes. (Note that the usual notion of mind change complexity considers
the least ordinal β such that at most, rather than less than, β mind changes
are sometimes necessary for the procedure to converge [1,2,7]. There are good
theoretical reasons for preferring the ‘less than’ formulation.) Note that if some
identifier is successful in P, then there are countably many D-diagrams of mem-
bers of W only. The following is a characterization of identification complexity
based on a generalization Angluin’s finite telltale characterization of learnability
in the limit [3].
Proposition 4. The identification complexity of P is defined and equal to non-
null ordinal β iff there exists a sequence (βM )M∈W of ordinals smaller than β
and for all M ∈ W, there exists a finite AM ⊆ DiagD (M) such that for all
N ∈ W:
(∗) if AM ⊆ DiagD (N) and βM ≤ βN then DiagD (N) = DiagD (M).

The next notion of complexity we define is based on selectors. Intuitively, a


selector for P is a procedure that, for any given M ∈ W, selects in the limit
all formulas from DiagD (M) or their negations, and is correct cofinitely many
times. In order to be able to achieve this, the selector needs to receive for every
selected formula ϕ, a feedback whether ϕ is or is not true in M. Thus one can
represent the selector as a partial function where at stage n the input is from
{0, 1}n and represents the feed-back on the previous selections.
On Ordinal VC-Dimension and Some Notions of Complexity 59


Definition 5. A selector (for P) is a partial function f from {0, 1} into D. Let
a selector f be given.

– We say that a member σ of {0, 1} is consistent with W and f just in case

there exists M ∈ W such that for all τ ∈ {0, 1} and i ∈ {0, 1} with τ i ⊆ σ,
f (τ ) =↓, and M |= f (τ ) iff i = 1.
– We say that f is is successful (in P) iff for every M ∈ W, there is a string
t of finitely many 0’s and infinitely many 1’s such that every finite initial
segment of t is consistent with W and f and for all ϕ ∈ DiagD (M), either
f (σ) = ϕ or f (σ) = ¬ϕ for some finite initial segment σ of t.

– Let X be the set of all σ ∈ {0, 1} that are consistent with W and f , and
that end with a 0. We denote by Rf the binary relation over X such that
for all σ, τ ∈ X, Rf (σ, τ ) holds iff τ ⊂ σ.
The selective complexity of P is the least ordinal of the form |Rf |, where f is a
selector that is successful in P and Rf is well-founded; if such a f does not exist,
the selective complexity of P is undefined.
The last notion of complexity we define is based on predictors. Whereas selec-
tors have control over the formulas the truth of which they want to predict, a
predictor has to take the members of D as they come.

Definition 6. A predictor (for P) is a partial function f from (D × {0, 1}) × D
into {0, 1}. Let a predictor f be given.

– Let σ = ((ϕ0 , 0 ), . . . , (ϕp , p )) ∈ (D × {0, 1}) be given. We say that σ is
consistent with W and f iff there exists M ∈ W such that for all i ≤ p:
• f (σ|i , ϕi ) =↓, and
• M |= ϕi iff i = 1
If σ is consistent with W and f , we call the number of i ≤ p such that
f (σ|i , ϕi ) = i , the number of mispredictions that f makes on σ (in P).
– We say that f is successful (in P) iff for every member M of W and every
t ∈ (D×{0, 1})N , the following holds. Assume that every finite initial segment
of t is consistent with W and f . Then there exists n ∈ N such that for all
finite initial segments σ of t, f makes at most n mispredictions on σ.

– Let X be the set of all σ ∈ (D × {0, 1}) that are consistent with W and
f , and on which f makes at least one misprediction. We denote by Rf the
binary relation over X such that for all σ, τ ∈ X, Rf (σ, τ ) holds iff τ ⊂ σ
and f makes more mispredictions on σ than on τ .
The predictive complexity of P is the least ordinal of the form |Rf |, where f is
a predictor that is successful in P and Rf is well-founded; if such an f does not
exist, the predictive complexity of P is undefined.
Clearly, if the predictive complexity of P is defined, then the selective com-
plexity of P is defined, and at most equal to the former. The three notions of
complexity we have introduced can, similarly to VC-dimension, be interpreted
in terms of the outcome of a game between Anke and Boris. For instance, for
predictive complexity, Anke selects formulas from D and Boris has to make a
60 E. Martin, A. Sharma, and F. Stephan

prediction whether the formula holds (in the unknown world M) or not. Anke
tells Boris whether the prediction is correct. If not, Boris has to count down an
ordinal counter. The predictive complexity is then the least ordinal to start with
for which Boris has a winning strategy.

4 Relationships between the Complexity Measures

Let us first introduce an alternative definition of the selective complexity, fol-


lowed by a definition and a lemma, that are just technical tools for proving some
of the main propositions.

Notation 7. Let α ∈ Ord be given, and suppose that Γβ has been defined for
all β < α. Let Γα be the set of all subsets
 U of W such that for all ϕ ∈ D,
U ∩ Mod(ϕ) or U ∩ Mod(¬ϕ) belongs to β<α Γβ ∪ {∅}.


Property 8. The predictive complexity of P is defined iff W ∈ α∈Ord Γα ; in
this case it is equal to the least ordinal α with W ∈ Γα .

Definition 9. A shatterer (for P) is a sequence (Yα )α∈Ord of sets of subsets of


D with the following properties.

– W shatters all members of α∈Ord Yα ;
– for all α, β ∈ Ord with β ≤ α and for all D, E ⊆ D with D ⊆ E, if E ∈ Yα
then D ∈ Yβ ;
– for all α, β ∈ Ord with β < α and for all E ∈ Yα , Yβ contains a proper
superset of E.

Lemma 10. Let a sequence (Mα )α∈Ord be inductively defined as follows. For
all α ∈ Ord, Mα is the set of all finite subsets D of D such that W shatters D
and for all β < α, Mβ contains a proper superset of D.

– (Mα )α∈Ord is a shatterer.


– For all shatterers (Yα )α∈Ord and α ∈ Ord, Yα ⊆ Mα .
– There exists a least ordinal α with Mα = Mα+1 ∪ {∅}; the VC-dimension of
P is defined and equal to α if Mα = {∅}, and undefined otherwise.

We are now in a position to state and prove one of the main results of the
paper, which relates VC-dimension to predictive complexity.

Proposition 11. Suppose that D is closed under boolean operators. Then the
VC-dimension of P is defined iff the predictive complexity of P is defined; more-
over, if they are defined then they are equal.
On Ordinal VC-Dimension and Some Notions of Complexity 61

Proof. Given α ∈ Ord, let Γα denote the set of subsets  of W as defined in


Notation 7. For all α ∈ Ord, set Zα = {U ⊆ W : U ∈ / β<α Γβ ∪ {∅}}. For all
ordinals α, let Yα be the set of all finite subsets E of D such that for all D ⊆ E,
{M ∈ W : DiagE (M) = D} ∈ Zα .
We show that (Yα )α∈Ord is a shatterer. Using the facts that ∅ ∈ / Zα and
Γβ ⊆ Γα for all α, β ∈ Ord with β ≤ α, it is immediately verified that the first
two conditions in Definition 9 are satisfied. For the third condition,
  α, β ∈ Ord
let
with β < α and E ∈ Yα be given. Given D ⊆ E, set ϕD = D ∧ ¬ (E \ D). By
the definition of Yα , ModW (ϕD ) ∈ Zα . Moreover, ModW (ϕD ) ∈ / Γβ . Hence there
exists ψD∈ D such that ModW (ϕD ∧ ψD ) ∈ Zβ and ModW (ϕD ∧ ¬ψD ) ∈ Zβ .
Set ϕ = {ϕD ∧ ψD : D ⊆ E}. It is immediately verified that E ∪ {ϕ} ∈ Yβ .
Assume that the predictive complexity of P is undefined. Let an ordinal α
be given. Then W ∈ / Γα , hence W ∈ Zα+1 , hence ∅ ∈ Yα+1 . This implies that
Yα contains a non-empty set, and by Lemma 10, the VC-dimension of P is not
equal to α. It follows that the VC-dimension of P is undefined.
Assume that the predictive complexity of P is defined and equal to ordinal
γ. So W ∈ Γγ ∩ Zγ . Let ϕ ∈ D be given, and suppose that {ϕ} ∈ Yγ . Then both
ModW (ϕ) and ModW (¬ϕ) are members of Zγ , which is in contradiction with
W ∈ Γγ . Hence Yγ = {∅}, and by Lemma 10 again, the VC-dimension of P is at
least equal to γ. It is easily verified that the VC-dimension of P is defined and
at most equal to γ, which completes the proof of the proposition.

The assumption that D is closed under negation is essential in Proposition


11, as shown in the next result.

Proposition 12. For every nonnull n ∈ N, there exists a finite vocabulary V


and a finite set B of formulas over V with the following property. Suppose that
V = V , W is the set of Herbrand models of B, and D is the set of positive
quantifier-free sentences. Then the VC-dimension of P is equal to 1 and the
predictive complexity of P is equal to ω × n.

Proof. Let a nonnull n ∈ N be given. Let V consist of a unary function symbol


s, a binary predicate symbol <, an n-ary predicate symbol P and equality. Let
B consist of the following formulas, which express that P is a predicate on Nn
that is downward closed for the lexicographic ordering.

– ∀x(x < s(x)) ∧ ∀xyz((x  < y ∧ y < z) → (x < z)) ∧ ∀xy(x < y → ¬(y < x));
– ∀x1 . . . xn y1 . . . yn (( i<n (y1 = x1 ∧ . . . ∧ yi = xi ∧ yi+1 < xi+1 ) ∧
P (x1 , . . . , xn )) → P (y1 , . . . , yn )).

Clearly, the VC-dimension of P is equal to 1. It can be verified that the predictive


complexity of P is equal to ω × n.

Denote by W the topological space over W generated by the sets of the form
ModW (ϕ), where ϕ varies over D. To establish relationships between predictive,
selective, and identification complexities, we usually need to assume that W is
compact. This is the case for the next result.
62 E. Martin, A. Sharma, and F. Stephan

Proposition 13. Assume that D is closed under negation and that W is com-
pact. If the identification complexity of P is defined and equal to ordinal α, then
the predictive complexity of P is defined and at most equal to ω × α.

Proof. Suppose that the identification complexity of P is defined and equal to


ordinal α. Choose a canonical identifier f for P. Let a predictor h be defined

as follows. Given a member σ = ((ϕ1 , 1 ), . . . , (ϕn , n )) of (D × {0, 1}) , put
σ
 = (ψ1 , . . . , ψn ) where for all nonnull i ≤ n, ψi = ϕi if i = 1 and ψi = ¬ϕi oth-

erwise. Let σ ∈ (D × {0, 1}) and ϕ ∈ D be given. If ModW (rng( σ )) ⊆ Mod(ϕ)
then h(σ, ϕ) = 1. If ModW (rng( σ )) ⊆ Mod(¬ϕ) then h(σ, ϕ) = 0. If neither
ModW (rng( σ )) ⊆ Mod(ϕ) nor ModW (rng( σ )) ⊆ Mod(¬ϕ), and if f ( σ ) is defined
and is a model of ϕ then h(σ, ϕ) = 1. Otherwise h(σ, ϕ) = 0. It is immediately
verified that h is successful on P. The proof of the proposition is completed if we
show that the length of Rh is smaller than ω ×α. Let a member of the field of Rh

of the form σ  (ϕ, 0) be given. Let X be the set of all τ ∈ (D ∪ {}) such that
rng(σ ) ⊆ rng(τ ), {ϕ, ¬ϕ}∩rng(τ ) = ∅, τ is consistent in W, and f (τ ) =↓. By the
choice of f , and since the restriction of W to ModW (rng( σ )) is compact, there
exists a finite F ⊆ X such that ModW (rng( σ )) = {ModW (rng(τ )) : τ ∈ F }.
Let n be the sum of the lengths of the members of F . It is easy to verify that
ρRh (σ  (ϕ, 0)) ≤ ω × supτ ∈F ρRf (τ ) + n. This implies immediately that Rh is at
most equal to ω × α, as wanted.

When D is closed under negation, identification and selective complexities


are very close concepts, as shown next.
Proposition 14. Suppose that D is closed under negation. If the selective com-
plexity of P is defined and equal to ordinal α, then the identification complexity
of P is defined and at most equal to α + 1.

Proof. Let a selector g be successful in P and such that the length of Rg is


defined and equal to ordinal α. Without loss of generality, we can assume that

for all σ, τ ∈ {0, 1} , if σ ⊂ τ then g(σ) = g(τ ). Let an identifier f be defined

as follows. Let σ = (ϕ0 , . . . , ϕn ) ∈ (D ∪ {}) be such that cnt(σ) is consistent

in W. We define a p ∈ N, a member σ  of {0, 1} and a finite sequence (ψi )i<p
of members of D as follows. Let i ∈ N be given, and suppose that ψj has been

defined for all j < i. Let τ = ( 0 , . . . , i−1 ) be the unique member of {0, 1} such
that for all nonnull j < i, j = 1 if ψj occurs in σ, and j = 0 if ¬ψj occurs in
σ. If g(τ ) =↓ and either g(τ ) or ¬g(τ ) occurs in σ, then ψi = g(τ ). Otherwise
p = i and we set σ  = τ . Let f (σ) consist of the set of all formulas occurring in

σ and of all formulas of the form g( σ  η) where η ∈ {1} and g(σ  η) =↓. Set
f (σ) = X. It is easily verified that f is successful in P and that the length of Rf
is defined at most equal to α + 1. This completes the proof of the proposition.

Assuming that D is closed under boolean operators, W is countable and


W is compact, we can then put together the results that have been obtained
previously and derive new connections, resulting in a complete picture of the
relationships between all notions of complexity.
On Ordinal VC-Dimension and Some Notions of Complexity 63

Proposition 15. Suppose that D is closed under boolean operators, W is count-


able and W is compact. Then there exists ordinals α, β such that:
– ω × α ≤ β ≤ ω × (α + 1);
– the VC-dimension of P is equal to β;
– the predictive complexity of P is equal to β;
– the identification complexity of P is equal to α + 1;
– the selective complexity of P is equal to α or α + 1.
Proof. We first show that the VC-dimension of P is defined, which together with
Propositions 11 and 14, implies immediately that the predictive complexity of
P, the identification complexity of P, and the selective complexity of P are also
defined. Suppose for a contradiction that there exists an infinite subset X of D
such that for all finite E ⊆ X and D ⊆ E, D = DiagE (M) for some M ∈ W.
Since W is compact, it follows that for all Y ⊆ X, Y = DiagX (M) for some
M ∈ W, which contradicts the assumption that W is countable. Thus every
infinite subset of D has a finite subset that W does not shatter, and the VC-
dimension of P is defined, as wanted.
We now show that the identification complexity of P is a successor ordinal.
Choose a canonical identifier f for P. It suffices to show that the length of Rf is
not a limit ordinal. Suppose otherwise for a contradiction. Let X be the set of all
finite subsets D of D such that f (D) =↓. Since W is compact  and f is successful
in P, there exists a finite subset F of X such that W = {ModW (D) : D ∈ F }.
This implies that |Rf | is equal to sup(ρRf (σ) + 1 : D ∈ F ), hence is smaller than
|Rf | since F is finite and |Rf | is a limit ordinal. Contradiction.
Denote by α + 1 the identification complexity of P, and by β the predictive
complexity of P. By Proposition 11, the VC-dimension of P is equal to β. By
Proposition 14, the selective complexity of P is at least equal to α. We show
that the selective complexity of P is at most equal to α + 1. Choose a set-driven
identifier f such that f is successful in P and the length of Rf is defined and
equal to ordinal α + 1. Fix an enumeration (ϕi )i∈N of D. We define by induction
a sequence (hi )i∈N of selectors with finite domains such that for all i, j ∈ N,
hi and hj have disjoint domains. Let i ∈ N be given, and suppose that hj has
been defined for all j < i. If i = 0 put Z = {()}. If i = 0, let Z be the
set of all ⊆-maximal members of the domain of hi−1 . Let σ ∈ Z be given. If
σ = () let σ  = σ. If σ is nonempty and equal to ( 0 , . . . , k ), denote by σ  the
sequence (ψ0 , . . . , ψk ) of members of D such that for all j ≤ k, ψj = f (σ|j )
if j = 1, and ψj = ¬f (σ|j ) otherwise. If ModW (rng( σ )) ⊆ Mod(ϕi ) then set
hi (σ) = ϕi . If ModW (rng( σ )) ⊆ Mod(¬ϕi ) then set hi (σ) = ¬ϕi . Suppose that
neither ModW (rng( σ )) ⊆ Mod(ϕi ) nor ModW (rng(σ )) ⊆ Mod(¬ϕi ). Let Xσ be

the set of all τ ∈ (D ∪ {}) such that rng( σ ) ⊆ rng(τ ), {ϕi , ¬ϕi } ∩ rng(τ ) = ∅,
τ is consistent in W, and f (τ ) =↓. By the choice of f , and since the restriction
of W to ModW (rng(  σ )) is compact, there exists a finite Fσ ⊆ Xσ such that
ModW (rng( σ )) = {ModW (rng(τ )) : τ ∈ Fσ }. Without loss of generality, we
can assume that there exists ψ0 , . . . , ψk ∈ D such that for all for all τ ∈ Fσ
and for all j ≤ k, either ψj or ¬ψj occurs in τ , and every formula that occurs
in τ is of the form ψj or ¬ψj for some j ≤ k. Let n be the cardinality of Fσ ,
64 E. Martin, A. Sharma, and F. Stephan

and fix an enumeration (Dp )p<n of Fσ . Note that for all distinct p, p < n,
ModW (Dp ) ∩ ModW (Dp ) = ∅. Given m < n, set σ m= 1m−1 01n−m−1 . Now
for all σ ∈ Z, m < n and p < n, set hi (σ  σ m |p ) = ¬ Dp ; for all σ ∈ Z and
m
 m
m < n, set hi (σ  σ ) = ϕi if Dm |= ϕi , and hi (σ  σ ) = ¬ϕi otherwise. Set
h = i∈N hi . It is easily verified that h succeeds in P, and that the length of Rh
is defined and at most equal to α + 1.
By Proposition 13, β ≤ ω × (α + 1). So to complete the proof of the propo-
sition, it suffices to show that ω × α ≤ β. Remember the definition of the se-
quence (Γα )α∈Ord from Notation 7. By Property 8, W ∈ Γβ . Define an identi-

fier f as follows. Let σ ∈ (D ∪ {}) be given. Let γ be the least ordinal with
ModW (cnt(σ)) ∈ Γγ .
– If there exists τ ⊂ σ such that f (τ ) is defined and equal to the D-diagram
of a model of cnt(σ) in W, then f (σ) = f (τ ).
– Otherwise, and if there exists M∈ W such that for all ϕ ∈ D, M is a model
of ϕ iff ModW (cnt(σ) ∪ {ϕ}) ∈ / γ  <γ Γγ  , then f (σ) = DiagD (M).
– Otherwise f (σ) =↑.

Let σ ∈ (D ∪ {}) be such that f (σ) =↓, and let γ be the least ordinal with
ModW (cnt(σ)) ∈ Γγ . Suppose for a contradiction that γ is neither 0 nor a limit
ordinal. It then follows from the definition of Γγ that there exists ϕ ∈ D such that
Γγ−1 contains neither ModW (cnt(σ) ∪ {ϕ}) nor ModW (cnt(σ) ∪ {¬ϕ}), which
is impossible by the definition of f . Let λ be the number of limit ordinals less
than or equal to β. We infer easily from the previous remark that α + 1 ≤ λ + 1,
hence ω × α ≤ ω × λ ≤ β, and we are done.

We illustrate the previous results, especially the bounds that have been ob-
tained, with a few examples.
Example 16. Let V consists of 0, s, and a unary predicate symbol P . Let W
be the set of Herbrand models of ∀x(P (x) ∧ P (s(x)) → P (s(s(x)))) ∧ ∃y(P (y) ∧
P (s(y))). Then the identification complexity of P is 1 but the predictive com-
plexity of P is undefined.

Example 17. Assume that D is closed under Boolean operators. If the VC-
dimension of P is a nonnull n ∈ N then both the identification and the selective
complexity of P are 1, and the predictive complexity of P is n.

Proposition 18. Suppose that V consists of 0, s, and a unary predicate symbol


P . Set D = {P (n) : n ∈ N}. Assume that W is the set of Herbrand structures
whose D-diagrams are {nk : n ∈ N} for k ∈ N. Then the identification complexity
of P is 2, and both the VC-dimension and the predictive complexity of P are ω.

Proof. It is easily verified that an identifier can be successful in P by first hy-


pothesizing {P (0)}, and in the case of an environment for the structure whose
D-diagram is T = {P (nk) : n ∈ N} for some nonnull k ∈ N, by eventually
changing {P (0)} to T . If follows that the identification complexity of P is 2. It is
On Ordinal VC-Dimension and Some Notions of Complexity 65

easily verified that the VC-dimension and the predictive complexity of P are at
most equal to ω. To see that they are at least equal to ω, let a nonnull n ∈ N be
given, and let p0 , p1 , . . . , p2n −1 be an enumeration of the first 2n prime numbers.
For all k < n, let qk be the product of all pi , 0 ≤ i < 2n , such that the (k + 1)st
bit in the binary representation of i is equal to 1. It is immediately verified that
W shatters {P (q0 ), P (q1 ), . . . , P (qn−1 )}, implying that the VC-dimension of P
is at least equal to n, hence the predictive complexity of P is also at least equal
to n.

Example 19. Suppose that V consists of 0, s, and a unary predicate symbol P .


Assume that W is the set of Herbrand structures M such that for all n ∈ N, if
M |= P (n) then there exists at most n natural numbers m such that M |= P (m).
Suppose that D is the set of quantifier free sentences. Then the identification
complexity of P is ω+1, the selective complexity of P is ω, and the VC-dimension
of P is ω 2 .

Note that if in the previous example, we remove from W the structure M


such that M |= ∀x¬P (x)—resulting in a noncompact topological space W—,
then the identification complexity becomes ω; but in both cases, the traditional
notion of mind change complexity is ω.
Proposition 20. Let a countable ordinal α be given. There exists a finite vocab-
ulary V and two sets K and E of sentences over V such that if V = V , D = E,
and W is the set of Herbrand models of K, then the identification complexity of
P is α + 1, the selective complexity of P is α, and both the VC-dimension and
the predictive complexity of P are ω × α.

Proof. Suppose that V consists of a constant, a unary function symbol, a unary


predicate symbol P , and a binary predicate symbol ≤. Identify the set X of
closed terms with the set of ordinals smaller than α, writing X = {cβ : β < α}.
Let D be the closure of {P (cβ ) : β < α} under boolean operators. Suppose that
W is the set of Herbrand models of

{(cβ , cγ ) : β, γ < α, β ≤ γ} ∪ {∀x∀y((P (x) ∧ x ≤ y) → P (y))}.



We define by induction, for all members σ of (D ∪ {}) , three ordinals: counterσ ,
aboveσ , and belowσ . Set counter() = α, above() = ω α and below() = 0. Let

σ ∈ (D ∪ {}) and x ∈ D ∪ {} be given, and suppose that counterσ , aboveσ
and belowσ have been defined. Let aboveσx be the minimal ordinal β such
that cβ occurs in σ  x if such a β exists, and let aboveσx be ω α otherwise.
Let belowσx be the maximal ordinal β such that cβ occurs in σ  x if such a
β exists, and let belowσx be 0 otherwise. If there exist (unique) ordinals β, γ
such that β < counterσ , aboveσ = ω β × γ and belowσ + ω β ≥ aboveσ , then
set counterσx = β; otherwise set counterσx = counterσ . Let f be the identifier

defined as follows. Let σ ∈ (D ∪ {}) be given. If counterσ = ω α then f (σ) is the
D-diagram of the member M of W such that M |= ∀x¬P (x); otherwise, f (σ) is
the D-diagram of the member M of W such that M |= ∀x(P (x) ↔ x ≥ caboveσ ).
66 E. Martin, A. Sharma, and F. Stephan

It is immediately verified that f is successful in P, and that the length of Rf


is defined and equal to α + 1. It is also easily verified that the identification
complexity of P cannot be smaller than α + 1, and that the selective complexity
of P is α.
Since W is a compact topological space, it follows from Propositions 11 and
15 that the VC-dimension and the predictive complexity of P are equal, and
at most equal to ω × α. So to complete the proof of the proposition, it suffices
to show that the VC-dimension of P is at least equal to ω × α. Suppose for a
contraction that the VC-dimension of P is equal to ordinal α with α < α. The
argument uses the presentation of the VC-dimension as a game between Anke
and Boris. At round n, Anke defines a set of ordinals On and a formula ψn , then
outputs a set En , before Boris outputs an ordinal βn . Set O0 = {0}, ψ0 = P (cα ),
E0 = {ψ0 }, and β0 = α . Let n ∈ N be given and suppose that On , ψn , En and
βn have been defined. Let ordinal γ and m ∈ N be such that βn = ω × γ + m.
Set On+1 = On ∪ {δ + ω γ + 2m : δ ∈ On }. Let ψn+1 be a formula expressing
that the cardinality of the members x of On+1 that have property P is odd. Let
En+1 = {ψ0 . . . , ψn+1 }. Note that On+1 expands On with exactly one ordinal
between two ordinals in On with no ordinal in On in between, plus an ordinal
greater than all ordinals in On . It is easily verified that for all n ∈ N, W shatters
En . Hence Anke wins the game, contradiction.
If the previous example were modified by taking {P (cβ ) : β < α} as set of
possible data, resulting in a new paradigm P , then the identification complexity
of P would clearly be equal to ω α + 1. More generally, there exists a relationship
between the identification complexities of two paradigms that differ in their sets
of possible data—one set of data being closed under negation in one paradigm,
and not in the other. This relationship is considered in the next proposition,
and the previous example shows that the “exponential” difference between the
identification complexities of both paradigms is almost as large as it can be.
First note that Angluin’s finite telltale condition takes the following form in the
logical framework.
Lemma 21. Some identifier is successful in P iff for all M ∈ W, there exists a
finite A ⊆ DiagD (M) such that no N ∈ W satisfies A ⊆ DiagD (N) ⊂ DiagD (M).

Proposition 22. Suppose that D is the closure under negation of some set of
sentences D , and set P = (V, D , W). Assume that W is compact, the iden-
tification complexity of P is defined and equal to nonnull ordinal α, and some
identifier is successful in P . Then the identification complexity of P is defined
and smaller than ω α .
Proof. The proof is trivial if {DiagD (M) : M ∈ W} is finite, so suppose oth-
erwise. Let (ψi )i∈N be a repetition free enumeration of D . Let X be the set of
finite sequences of the form (ϕ0 , . . . , ϕn−1 ), n ∈ N, where for all i < n, ϕi = ψi
or ϕi = ¬ψi , such that {ϕ0 , . . . , ϕn−1 } is consistent in W. Let f be a canonical
identifier for P. Let a finite subset E of D be consistent in W. Denote by UE the
set of all ⊆-minimal members σ of X such that f (σ) is defined and contains E.
On Ordinal VC-Dimension and Some Notions of Complexity 67

Suppose for a contradiction that UE is infinite. By König’s lemma, there exists


a sequence (ξ)i∈N of formulas such that for all n ∈ N, (ξ0 , . . . , ξn−1 ) = σ|n for
some σ ∈ UE . Note that E ⊆ rng(σ) for cofinitely many members σ of UE ,
hence E ⊆ {ξi : i ∈ N}. Moreover, for all distinct σ, τ ∈ UE , neither σ ⊆ τ nor
τ ⊆ σ, hence σ ⊂ {ξi : i ∈ N} for cofinitely many members σ of UE . Hence
f ((ξ0 , . . . , ξn )) is either undefined or does not contain E for infinitely many
n ∈ N. By compactness of W, {ξi }i∈N is the D-diagram of a member of W. But
this is in contradiction with the fact that f is successful in P. So we have verified
that UE is finite. Note that since f is successful in P, ModW (E) is included
in {ModW (cnt(σ)) : σ ∈ UE }. Let (σ1 , . . . , σp ) be a repetition free enumera-
tion of UE with ρRf (σ1 ) ≥ . . . ≥ ρRf (σp ). Set βE = ω ρRf (σ1 ) + . . . + ω ρRf (σp ) .
Let finite E, F ⊆ D be such that E ⊆ F , UF \ UE = ∅, and UF is nonempty.
Let τ ∈ UF \ UE be given. Then there exists i ∈ {1, . . . , p} such that σi ⊂ τ ,
f (σi ) = f (τ ), and σi ∈ / UF . Let D be the set of all members of UF that strictly
extend σi . Clearly, ρRf (σi ) > ρRf (γ) for all γ ∈ D. Moreover, the factor ω ρRf (σi )
in the sum βE is replaced in the sum βF by the factors ω ρRf (γ) , with γ vary-
ing over D. Due to the powers of ω, it is easily verified that the extra factors
ω ρRf (γ) , γ ∈ D, ‘weight less’ than ω ρRf (σi ) . Since every σi ∈ UE is either also
in UF or replaced in EF by a D with the properties just described, we conclude
that βE > βF .
Fix a repetition free enumeration (Ti )i∈N of {DiagD (M) : M ∈ W}. By
Lemma 21, choose for all i ∈ N a finite subset Ai of Ti such that no member

M of W satisfies Ai ⊆ DiagD (M) ⊂ Ti . Let Y be the set of all σ ∈ (D ∪ {})
such that Ai ⊆ cnt(τ ) ⊆ Ti for some i ∈ N and some initial segment τ of
σ. For all σ ∈ Y , denote by σ  the ⊆-maximal initial segment of σ such that
Ai ⊆ cnt( σ ) ⊆ Ti for some i ∈ N. Now define an identifier f  for P as follows.

Let σ ∈ (D ∪ {}) be given. If σ ∈ / Y then f  (σ) =↑; otherwise, f  (σ) = Ti

where i ∈ N is least with Ai ⊆ cnt( σ ) ⊆ Ti . Let σ, τ ∈ (D ∪ {}) be such that
σ ⊂ τ , and f  (σ) and f  (τ ) are defined but distinct. Then both σ and τ belong
to Y . Clearly, there exists i ∈ N such that cnt(σ) ⊆ Ti but cnt(τ ) ⊆ Ti , which
implies that
 
Ti ∈ {ModW (cnt(γ)) : γ ∈ Ucnt(τ ) } \ {ModW (cnt(γ)) : γ ∈ Ucnt(σ) }.

We infer that Ucnt(τ ) is nonempty and distinct from Ucnt(σ) , which we know
implies that βcnt(σ) > βcnt(τ ) . It then follows that the height of Rf  is defined and
at most equal to ω α . Moreover, since W is compact, the reasoning in Proposition
15 shows that the identification complexity of P is not a limit ordinal, hence is
smaller than ω α , as wanted.

It was shown in [11] that every class of languages that is finitely identifiable
from informants is also identifiable in the limit from texts. But such a relationship
does not generalize to identifiability from informants with one mind change at
most. Indeed, the class C consisting of N and all finite initial segments of N is not
learnable in the limit from texts, as proved in [8], whereas C is clearly learnable
from informants with one mind change at most. Cast into the logical framework,
68 E. Martin, A. Sharma, and F. Stephan

this provides an example of V, W, D, and D such that D is the closure of D


under negation, W is compact, the identification complexity of P is equal to 1,
but no identifier is successful in P = (V, D , W). This shows that Proposition 22
does not hold if the assumption that some identifier is successful in P is dropped.

5 Conclusion
In ideals paradigms of Inductive inference, finite tell tails conditions offer char-
acterizations of identification in the limit or of classification, with or without
a mind change bound. Assuming that the set of data is closed under boolean
operators, the VC-dimension offers a characterization of prediction. An extra
topological assumption of compactness enables to provide a complete picture of
the relationship between VC-dimension and other notions of complexity, includ-
ing mind change bound complexity.

References
1. Ambainis, A., Jain, S., Sharma, A.: Ordinal mind change complexity of language
identification. Theoretical Computer Science. 220(2) pp. 323–343 (1999)
2. Ambainis, A., Freivalds, R., Smith, C.: Inductive Inference with Procrastination:
Back to Definitions. Fundamenta Informaticae. 40 pp. 1–16 (1999)
3. Angluin, D. Inductive Inference of Formal Languages from Positive Data. Infor-
mation and Control 45 p. 117–135 (1980)
4. Ben-David, S., Gurvits, L.: A note on VC-Dimension and Measure of Sets of Reals.
Combinatorics Probability and Computing 9, 391–405 (2000)
5. Ben-David, S., Jacovi, M.: On Learning in the Limit and Non-Uniform (, δ)-
Learning. In Proceedings of the Sixth Conference on Computational Learning The-
ory. ACM Press pp. 209–217 (1993)
6. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Learnability and the
Vapnik-Chervonenkis Dimension. J. ACM 36(4) pp. 929–965 (1989)
7. Freivalds, R., Smith, C.: On the role of procrastination for machine learning. In-
form. Comput. 107(2) pp. 237–271 (1993)
8. Gold, E.: Language Identification in the Limit. Information and Control. 10 (1967)
9. Martin, E., Sharma, A., Stephan, F.: A General Theory of Deduction, Induction,
and Learning. In Jantke, K., Shinohara, A.: Proceedings of the Fourth International
Conference on Discovery Science. Springer-Verlag, LNAI 2226 pp. 228–242 (2001)
10. Martin, E., Sharma, A., Stephan, F.: Logic, Learning, and Topology in a Common
Framework. In Cesa-Bianchi, N., Numao, M., Reischuk, R.: Proc. of the 13th Intern.
Conf. on Alg. Learning Theory. Springer-Verlag, LNAI 2533 pp. 248–262 (2002)
11. Sharma, A.: A note on batch and incremental learnability. Journal of Computer
and System Sciences 56 pp. 272–276 (1998)
12. Valiant, L.: A Theory of the Learnable. Commun. ACM 27(11) pp. 1134–1142
(1984)
13. Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies
of events to their probabilities. Theory of Probabilities and its Applications 16(2)
pp. 264–280 (1971)
Learning of Erasing Primitive Formal Systems
from Positive Examples

Jin Uemura and Masako Sato

Department of Mathematics and Information Sciences


Osaka Prefecture University, Sakai, Osaka 599-8531, Japan
{jin, sato}@[Link]

Abstract. An elementary formal system, EFS for short, introduced by


Smullyan is a kind of logic program over strings, and is regarded as a
grammar to generate a language. Arikawa and his colleagues introduced
some subclasses of EFSs which correspond to Chomsky hierarchy, and
showed that they constitute a useful framework for language learning.
This paper considers a subclass of EFSs, called primitive EFSs, in view
of inductive inference in Gold framework from positive examples. Shi-
nohara showed that the class of languages generated by primitive EFSs
is inferable from positive examples, where ε substitutions, i.e., substitu-
tions that may substitute the empty string for variables is not allowed.
In the present paper, we consider allowing ε substitutions, and call such
EFSs erasing EFSs. It is unknown whether or not the class of erasing
pattern languages is learnable from positive examples. An erasing pat-
tern language is a language generated by an erasing EFS with just one
axiom.
We first show that the class PFSL of languages generated by erasing
primitive EFSs does not have finite elasticity, but has M-finite thickness.
The notions of finite elasticity and M-finite thickness were introduced
by Wright, and Moriyama and Sato, respectively, to present sufficient
conditions for learnability from positive examples. Moriyama and Sato
showed that a language class with M-finite thickness is learnable from
positive examples if and only if for each language in the class, there is
a finite tell-tale set of the language. Then we show the class PFSL is
learnable from positive examples by presenting a finite tell-tale set for
each language in the class.

1 Introduction

An elementary formal system, EFS for short, is a kind of logic program over
strings consisting of finitely many axioms. A pattern is a finite string of constant
symbols and variables. A pattern is regular, if each variable appears in the pattern
at most once. In EFSs, patterns are used for terms in logic programming.
For example, Γ = {p(ab) ←, p(axb) ← p(x)} is an EFS with two axioms,
where p is a unary predicate symbol, a and b are constant symbols and x is a
variable, where patterns ab and axb are used as terms.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 69–83, 2003.

c Springer-Verlag Berlin Heidelberg 2003
70 J. Uemura and M. Sato

An EFS generates a language of constant strings obtained by applying sub-


stitutions for variables and Modus Ponens to axioms in the EFS. In the above
example, the language generated by Γ is L(Γ ) = {an bn | n ≥ 1}.
The framework of EFSs was introduced by Smullyan [14] to develop his recur-
sive function theory. Arikawa and his colleagues [2] introduced some subclasses of
EFSs whose language classes correspond to Chomsky hierarchy. Among them,
the class of length-bounded EFSs that generates the class of context-sensitive
languages was especially investigated from a viewpoint of learning of languages
from positive examples in Gold framework [4](Shinohara [13], Moriyama&Sato
[5], Mukouchi&Arikawa [7], Sato [10]).
Wright [16] introduced a notion of finite elasticity as a sufficient condition
for learnability from positive examples. Shinohara [13] showed that the class of
languages generated by length-bounded EFSs with at most n axioms has finite
elasticity, where ε substitutions, i.e., substitutions that may substitute the empty
string for variables are not allowed.
In the present paper, we consider allowing ε substitutions, and call such
EFSs erasing EFSs. As easily seen, the language generated by an erasing EFS
is different from that generated by the nonerasing EFS with the same axioms.
The language of an EFS Γ = {p(π) ←} is called an erasing or extended
pattern language, if Γ is erasing, and nonerasing pattern language otherwise. It
is well known that the class of nonerasing pattern languages is learnable from
positive examples ([1]), but unknown for the class of erasing pattern languages.
Note that it was shown that the latter class is not learnable from positive exam-
ples when the number of constant symbols is two ([9]). The present authors [15]
proved compactness theorem of bounded unions of erasing regular pattern lan-
guages which plays an essential role in designing an efficient learning algorithm
for unions.
An EFS is regular, if each axiom in the EFS is of the form p(π) ←
q1 (x1 ), · · · , qn (xn ), where π is a regular pattern, p, qi s are unary predicate sym-
bols and x1 , · · · , xn are all of the mutually distinct variables appearing in π. The
class of languages generated by erasing regular formal system, RFS for short,
corresponds to that of context-free languages([2]). Recently Mukouchi [8] has
showed that the class of languages generated by erasing RFSs with at most n
axioms has finite elasticity, and so is learnable from positive examples, provided
every regular pattern π in axioms is of restricted form called canonical, i.e.,
contains no successive variables.
In this paper, we deal with a subclass of RFSs called primitive formal systems,
PFSs for short, consisting of exactly two axioms of the forms p(π) ← and p(τ ) ←
p(x1 ), · · · , p(xn ), where τ is possible to be noncanonical form, but imposed some
condition when π = ε. The class of nonerasing PFSs was firstly introduced by
Shinohara [12], and is a proper subclass of RFSs. Note that Mukouchi [8] has
shown that the class of languages generated by erasing EFSs with two axioms
where patterns in the axioms are not always regular is not learnable from positive
examples under Σ ≥ 2.
Learning of Erasing Primitive Formal Systems from Positive Examples 71

Unfortunately, the class PFSL of languages generated by erasing PFSs is


shown not to have finite elasticity. Moriyama and Sato [5] has introduced a notion
of M-finite thickness, and showed that a class with M-finite thickness is learnable
from positive examples if and only if for each language in the class, there is a
finite tell-tale set of the language. As well as finite elasticity, M-finite thickness
is known to have some good properties such as closure properties for various
class operations, but M-finite thickness merely is not a sufficient condition for
learnability ([10]).
We first investigate the inclusion problem L(Γ  ) ⊆ L(Γ ) for given PFSs Γ 
and Γ . Then, we show that the class of languages generated by PFSs has M-
finite thickness as well as the class of nonerasing length-bounded EFSs. Finally
we show that the class PFSL is learnable from positive examples by showing
that for each PFS Γ , there is a finite tell-tale set of the language generated by
Γ.

2 Erasing PFS Languages

Let Σ be a finite alphabet, X be a countable set of variables, and Π be a set


of predicate symbols. We assume these sets Σ, X, Π are mutually disjoint. Each
predicate symbol is associated with a positive integer called arity.
A pattern is a string (possibly empty string ε) on Σ ∪ X. An atom is an
expression of the form p(π1 , · · · , πn ), where p is a predicate symbol with arity
n and π1 , · · · , πn are patterns. A definite clause is a clause of the form A ←
B1 , · · · , Bm (m ≥ 0), where A, B1 , · · · , Bm are atoms. The atom A is called the
head and the part B1 , · · · , Bm the body of the definite clause.

Definition 1. An elementary formal system, EFS for short, is a finite set of


definite clauses. For an EFS Γ , each definite clause in Γ is called an axiom of
Γ.

A substitution is a homomorphism from patterns to patterns that maps each


symbol a ∈ Σ to itself. We permit an ε substitution that can map some variables
to the empty string. By πθ, we denote the image of a pattern π by a substitution
θ. For an atom A = p(π1 , · · · , πn ) and a clause C = A ← B1 , · · · , Bm , we define
Aθ = p(π1 θ, · · · , πn θ) and Cθ = Aθ ← B1 θ, · · · , Bm θ.
A definite clause C is provable from an EFS Γ , denoted by Γ  C, if C is
obtained by finitely many (possibly 0) applications of substitutions and Modus
Ponens. We define the language L(Γ, p) = {w ∈ Σ ∗ | Γ  p(w)}, where p is a
unary predicate symbol.

Definition 2. An EFS Γ is a simple formal system, an SFS for short, if each


clause in Γ is of the form p(π) ← q1 (x1 ), · · · , qn (xn ), where p, qi s are unary
predicate symbols and x1 , · · · , xn are mutually distinct variables appearing in π.
A pattern π is regular if each variable appears in π at most once. An SFS Γ is
a regular formal system, an RFS for short, if all patterns in heads of clauses in
72 J. Uemura and M. Sato

Γ are regular. An RFS Γ is a primitive formal system, a PFS for short, if it


contains exactly two clauses of the forms

p(π) ← and p(τ ) ← p(x1 ), p(x2 ), · · · , p(xn ),

where x1 , x2 , · · · , xn are all of the variables appearing in τ . The former is called


a base step and the latter an induction step of Γ .
A language L is an erasing EFS (resp., SFS, RFS or PFS) language if L =
L(Γ, p) for some EFS (resp., SFS, RFS or PFS) Γ and some unary predicate
symbol.
A language L is an erasing regular pattern language if L = L(Γ, p) for some
RFS Γ = {p(π) ←}.
It can be shown that the class of erasing RFS languages corresponds to that
of context-free languages (Arikawa et al.[2]). Thus a PFS language is context-
free, but not always regular. In fact, the following PFS language is not regular.

Example 1. Let us consider the PFS Γ = {p(ε) ←, p(axb) ← p(x)}, where


Σ = {a, b}. Then as easily seen, L(Γ, p) = {an bn | n ≥ 0}.
In this paper, we deal with the class of PFS languages. In what follows, we fix
a unary predicate symbol, say p, and denote L(Γ, p) by L(Γ ) simply. Moreover,
we denote a PFS Γ = {p(π) ←, p(τ ) ← p(x1 ), · · · , p(xn )} and L(Γ ), by Γ =
(π, τ ) and L(π, τ ), respectively. Similarly by L(π) we denote L({p(π) ←}).
For patterns π and τ , we introduce binary relations  and ≡ as follows:
π  τ if π = τ θ for some substitution θ, and π ≡ τ if π  τ and τ  π. A
renaming of variables of a substitution θ such that xθ ∈ X, and x = y implies
xθ = yθ for any x, y ∈ X. In this paper, we identify patterns equivalent from
each other by renaming. Thus ax1 bx2 = ax2 bx1 and ax1 bx2 ≡ ax1 x2 bx3 but
ax1 bx2 = ax1 x2 bx3 . A pattern π is of canonical form if π ≡ τ implies |π| ≤ |τ |
for any pattern τ , and π contains exactly n variables x1 , · · · , xn for some integer
n and the leftmost occurrence of xi is to the left of the leftmost occurrence of
xi+1 for each i, where |π| is the length of the pattern π.
Lemma 1 (Shinohara [11]). Suppose that Σ ≥ 3. Let π and τ be regular
patterns. Then

(i) π  τ ⇐⇒ L(π) ⊆ L(τ ), (ii) π ≡ τ ⇐⇒ L(π) = L(τ ).

By the definition of L(π, τ ), it follows that L(π, τ ) = L(π) ∪ L , where L is


the set of strings obtained by at least one applying the induction step of Γ .
Concerning an erasing regular pattern language L(π), Shinohara [11]
showed that there is a unique canonical pattern π  equivalent to any
regular pattern π. Clearly the canonical pattern π  has the form of
w0 x1 w1 x2 · · · wn−1 xn wn (w0 , wn ∈ Σ ∗ , wi ∈ Σ + (i = 1, · · · , n − 1), and L(π) =
L(π  ), provided that Σ ≥ 3. Hereafter, we assume that Σ ≥ 3 and the pattern
π in a PFS Γ = (π, τ ) is assumed to be of canonical form. On the other hand,
Learning of Erasing Primitive Formal Systems from Positive Examples 73

concerning with τ in the induction step, we can not assume τ to be of canonical


form. Indeed, let Γ = (aa, bx1 x2 b). As easily seen, b(aa)(aa)b ∈ L(Γ ). On the
other hand, for Γ  = (aa, bx1 b), we have b(aa)(aa)b ∈ L(Γ  ). Thus L(Γ ) = L(Γ  ).
For Γ = (π, τ ), we define the following particular pattern

τπ = τ {x := π | x appears in τ },

where variables substituted to the variables in τ are taken to be distinct, and so


τπ is always regular. Then we have for every w ∈ L(Γ ),

|w| ≥ min{|c(π)|, |c(τπ )|},

where c(π) is the string obtained from π by substituting the empty string to all
variables.
For Γ = (π, τ ), τ = x means L(Γ ) = L(π). That is, the induction step
p(x) ← p(x) is redundant, and thus we assume τ = x.
A string w ∈ Σ + is a multiple string of a string u, called a component, if
w = ul for some l ≥ 2, and is a multiple string if there is a component of w.
A component u for w is maximal if there is no component u for w satisfying
|u | > |u|.
We denote by PFS the sets of all PFSs except for PFSs Γ = (ε, τ ) such that
τε is a multiple string.
Definition 3. A PFS Γ is reduced if L(Γ  )  L(Γ ) for any Γ   Γ .

3 Inductive Inference

We first give the notion of identification in the limit from positive examples ([4]).
A language class L = L0 , L1 , · · · over Σ is an indexed family of recursive
languages if there is a computable function f : N × Σ ∗ → {0, 1} such that
f (i, w) = 1 if w ∈ Li , otherwise 0, where N = {i | i ≥ 0}. The function f is
called a membership function. Hereafter we confine ourselves to indexed families
of recursive languages.
An infinite sequence of strings w1 , w2 , · · · over Σ is a positive presentation of
a language L, if L = {wn | n ≥ 1} holds. An inference machine is an effective
procedure M that runs in stages 1, 2, · · · , and requests an example and produces
a hypothesis in N based on the examples so far received. Let M be an inference
machine and σ = w1 , w2 , · · · be an infinite sequence of strings. We denote by hn
the hypothesis produced by M at stage n after the examples w1 , · · · , wn are fed
to M . M on input σ converges to h if there is an integer n0 ∈ N such that hn = h
for every n ≥ n0 . M identifies in the limit or infers a language L from positive
examples, if for any positive presentation σ of L, M on input σ converges to h
with L = Lh . A class of languages L is inferable from positive examples if there
is an inference machine that infers any language in L from positive examples.
Angluin [1] gave a characterizing theorem for a language class L to be in-
ferable from positive examples if and only if there exists an effective procedure
74 J. Uemura and M. Sato

that enumerates for every language in L a set SL such that SL ⊆ L, SL is finite


and SL ⊆ L for all L ∈ L with L  L. The set SL is called a finite tell-tale set
of L.
Mukouchi [8] has showed that the class of SFS languages defined by at most
two clauses is not inferable from positive examples by considering an infinite
sequence of SFSs as follows:

Γn = {p(x1 x2 · · · xn−1 xn xn xn−1 · · · x2 x1 ) ←}, (n ≥ 1).

We should note that the pattern x1 x2 · · · xn−1 xn xn xn−1 · · · x2 x1 is not regular,


∞EFSs are not RFSs, but SFSs. They showed L(Γ1 )  L(Γ2 ) 
and thus the above
· · ·  L(Γ ) and n=1 L(Γn ) = L(Γ ) under Σ ≥ 2, where

Γ = {p(x1 x1 ) ←, p(x1 x2 x1 ) ← p(x2 )}.

It means that the language L(Γ ) does not have any finite tell-tale set within the
class.
Angluin [1] gave a very useful sufficient condition for inferability called finite
thickness. The class of erasing regular pattern languages discussed in this paper
was shown to have finite thickness by Shinohara [11] as well as nonerasing pattern
languages (Angluin[1]). Wright [16] introduced another sufficient condition for
inferability called finite elasticity more general than finite thickness ([6]). A class
L has finite elasticity, if there is no infinite sequence of strings w0 , w1 , · · · and no
infinite sequence of languages L1 , L2 , · · · in L satisfying {w0 , w1 , · · · , wn−1 } ⊆
Ln but wn ∈ Ln for every n ≥ 1. Finite elasticity has a good property in a sense
that it is closed under various class operations such as union, intersection and
so on (Wright[16], Moriyama & Sato [5], Sato [10]). Shinohara [13] proved that
the class of nonerasing length-bounded EFS languages generated by at most k
clauses has finite elasticity, and so is inferable from positive examples. Mukouchi
[8] has showed that the class of erasing RFS languages generated by at most k
clauses has finite elasticity similarly, provided that all regular patterns in heads
of induction steps are of canonical forms. Without the condition of canonical
form, the inferability, however, is not valid as shown below.
Theorem 1. The class PFSL does not have finite elasticity.
Proof. Define PFSs Γn = (ε, τn ) (n ≥ 1) as follows: τn = a(x1 x2 · · · xn )b for
n ≥ 1, where a, b ∈ Σ, a = b. Then we can show that

{ε, ab, a(ab)2 b, · · · , a(ab)k b} ⊆ L(Γk ) but a(ab)k+1 b ∈ L(Γk ) for k ≥ 1.

Thus the infinite sequence (wn )n≥0 of strings and infinite sequence (Γn )n≥1 of
PFSs satisfies the above conditions, where w0 = ε, wn = a(ab)n b (n ≥ 1). Hence
the class PFSL has infinite elasticity. 

Moriyama&Sato [5] introduced a notion of M-finite thickness by generalizing
finite thickness. For nonempty finite set S ⊆ Σ ∗ , we define

MIN(S, L) = {L ∈ L | L is a minimal language of S within L},


Learning of Erasing Primitive Formal Systems from Positive Examples 75

where L is a minimal language of S within L if no language L ∈ L satisfy


S ⊆ L  L.
Definition 4. A class L has M-finite thickness, if for any nonempty set S ⊆
Σ ∗ , (i) MIN(S, L) < ∞, (ii) S ⊆ L ∈ L implies that there is a language
L ∈ MIN(S, L) such that L ⊆ L.
M-finite thickness merely is not a sufficient condition for inferability from
positive examples, but is closed under various class operations such as union,
concatenation and so on as well as finite elasticity ([5,10]).
Theorem 2 (Moriyama&Sato [5]). If a class L has M-finite thickness, then
the class L is inferable from positive examples if and only if for each language
L ∈ L, there is a finite tell-tale set of L.

4 Reduced PFSs
This section gives a characterization of a reduced PFS. Hereafter, we assume
that Σ ≥ 3. Then by Lemma 1, for any regular patterns π and τ ,
π  τ ⇐⇒ L(π) ⊆ L(τ ).
Let Γ = (π, τ ) be a PFS. If τ ∈ Σ ∗ , clearly L(Γ ) = L(π) ∪ {τ }. Thus the
PFS Γ is reduced if and only if τ  π.
Hereafter, we consider a PFS Γ = (π, τ ) with var(τ ) = φ, where var(τ ∞ ) =
{x1 , x2 , · · · , xn } is the set of variables contained in τ . We define Γτ = t=1 Γt ,
where Γt is recursively defined as follows: Γ1 = {τπ } and for each t ≥ 2,
Γt = Γt−1 ∪ {τ {x1 := ξ1 , x2 := ξ2 , · · · , xn := ξn } | ξi ∈ Γt−1 ∪ {π}, i = 1, 2, · · · , n}.
Clearly every ξ in Γτ always contains π as a substring whenever var(τ ) = φ. By
the definitions of L(Γ ) and the above Γτ , it follows that:
Lemma  2. Let Γ = (π, τ ) be a PFS. Then L(Γ ) = L(π) ∪ L(Γτ ) holds, where
L(Γτ ) = ξ∈Γτ L(ξ).
By the definition of a reduced PFS, Γ is reduced if and only if L(π)  L(Γ ),
i.e., L(Γτ ) ⊆ L(π). Clearly if π ∈ Σ ∗ , then Γ is always reduced.
For a pattern π with var(π) = φ, let us denote by Aπ the longest constant
prefix and by Bπ the longest constant suffix of π. For instance, Aπ = ab and
Bπ = ε if π = abx1 aax2 .
Lemma 3. Let Γ = (π, τ ) be a PFS with var(π) = φ. For any ξ ∈ Γτ , there is
a pair (i0 , j0 ) of positive integers such that
Aξ = Aiτ0 Aπ , Bξ = Bπ Bτj0 .
For two strings v, w ∈ Σ ∗ , v p w denotes that v is a prefix of w, and v s w
means that v is a suffix of w. Moreover, w p v ∗ means that w p v i for some
i ≥ 0. Similarly we define w s v ∗ .
By Pref, we denote the set of pairs (v, w) of strings satisfying v p w or
w p v. Similarly we define the set Suff.
76 J. Uemura and M. Sato

Lemma 4. Let π and τ be regular patterns containing variables. Then

L(π) ∩ L(τ ) = φ ⇐⇒ (Aπ , Aτ ) ∈ Pref, (Bπ , Bτ ) ∈ Suff.

Lemma 5. Let π = Aπ (xπ  x )Bπ and τ = Aτ (xτ  x )Bτ . If π  is a substring of


τ  , then
L(τ ) ⊆ L(π) ⇐⇒ Aπ p Aτ , Bπ s Bτ .
By the above lemmas, the next result immediately follows:
Theorem 3. Let Γ = (π, τ ) be a PFS. If Aπ = Bπ = ε or Aτ = Bτ = ε, then
Γ is not reduced.
Remember that the pattern π for a PFS Γ = (π, τ ) is assumed to be of
canonical form. Thus by the above theorem, it follows that:
Corollary 1. Let Γ = (π, τ ) be a reduced PFS with var(τ ) = φ. Any pattern
ξ ∈ Γτ does not contain successive variables.
In terms of the above lemma, every pattern in Γτ can be assumed to be of
canonical form for a reduced PFS Γ = (π, τ ).
Lemma 6. For w ∈ Σ ∗ and v ∈ Σ + ,

(i) ∃ i0 ≥ 1 s.t. w p v i0 w ⇐⇒ w p v ∗ ⇐⇒ ∀i ≥ 0, w p v i w,
(ii) ∃ j0 ≥ 1 s.t. w s wv j0 ⇐⇒ w s v ∗ ⇐⇒ ∀j ≥ 0, w s wv j .

Theorem 4. Let Γ = (π, τ ) be a PFS. Then the following statements are equiv-
alent:
(i) Γ is reduced. (ii) L(π) ∩ L(Γτ ) = φ.
(iii) [Aπ p A∗τ and Aτ = ε] or [Bπ s Bτ∗ and Bτ = ε)], if π ∈ Σ ∗ .

Proof. If π ∈ Σ ∗ , clearly (i) and (ii) are equivalent. Hereafter we assume π ∈ Σ ∗ ,


i.e., π contains some variables.
(i) ⇒ (iii). Assume that Γ is reduced, and [Aπ p A∗τ or Aτ = ε] and [Bπ s Bτ∗
or Bτ = ε]. If Aτ = Bτ = ε, then by Theorem 3, Γ is not reduced.
A case of Aπ p A∗τ and Bπ s Bτ∗ . Then by Lemma 6, we have

Aπ p (Aτ )i Aπ , Bπ s Bπ (Bτ )j , i, j = 0, 1, 2, · · · .

Let ξ ∈ Γτ be an arbitrary pattern. Then by Lemma 3, there are integers i0 , j0 ≥


1 such that Aξ = (Aτ )i0 Aπ and Bξ = Bπ (Bτ )j0 . It implies that Aπ p Aξ and
Bπ s Bξ hold. Since π is a substring of ξ, by Lemma 5, we get L(ξ) ⊆ L(π). It
means that L(Γτ ) ⊆ L(π) holds, and a contradiction.
Similarly we can prove for [Aπ p A∗τ and Bτ = ε] or [Aτ = ε and Bπ s Bτ∗ ].
(iii) ⇒ (ii). Assume that L(Γτ ) ∩ L(π) = φ. Then there is a pattern ξ ∈ Γτ
satisfying L(ξ) ∩ L(π) = φ. By Lemma 4,

(Aπ , Aξ ) ∈ Pref, (Bπ , Bξ ) ∈ Suff.


Learning of Erasing Primitive Formal Systems from Positive Examples 77

By ξ ∈ Γτ , similarly to the above, there are i0 , j0 ≥ 1 satisfying Aξ = (Aτ )i0 Aπ


and Bξ = Bπ (Bτ )j0 . It means that Aπ p (Aτ )i0 Aπ . By Lemma 6(i), Aπ p A∗τ
holds. Similarly we have Bπ s Bτ∗ . It contradicts the assumption of (iii).
(ii) ⇒ (i) is clear. 

By the above theorem and Theorem 3, it follows immediately that:
Corollary 2. Given a PFS Γ = (π, τ ), the decision problem of whether Γ is
reduced is computable in time order O(|π| + |τ |).

5 Inclusion Problem for Erasing PFS Languages


In this section, we deal with the inclusion problem for erasing PFS languages.
Lemma 7. Let Γ = (π, τ ) be a reduced PFS and let γ be a pattern such that
L(γ) ⊆ L(Γ ). Then
γ  π ⇐⇒ L(γ) ⊆ L(Γτ ).
Proof. We only prove a case of π ∈ Σ ∗ since it can be easily shown for π ∈ Σ ∗ .
Since Γ is reduced, by Theorem 4, L(π) ∩ L(Γτ ) = φ holds, and so (⇐) is
valid. Moreover [Aπ p A∗τ and Aτ = ε] or [Bπ s Bτ∗ and Bτ = ε] hold.
(⇒) We consider only a case of Aπ p (Aτ )∗ and Aτ = ε. We can prove
similarly for the other case. In this case, by Lemma 6,

(∗) Aπ p (Aτ )i Aπ , i = 1, 2, · · · .

Suppose that γ  π and L(γ) ⊆ L(Γτ ). By L(γ) ⊆ L(Γ ), it implies that L(γ) ∩
L(π) = φ and L(γ) ∩ L(ξ) = φ for some ξ ∈ Γτ . Then by Lemma 4,

(∗∗) (Aγ , Aπ ) ∈ Pref, (Aγ , Aξ ) ∈ Pref.

Since ξ ∈ Γτ , by Lemma 3 there is an integer i0 ≥ 1 such that

(∗ ∗ ∗) Aξ = (Aτ )i0 Aπ .

By (∗), Aπ p Aξ .
Claim A. If |Aπ | ≤ |Aγ |, then Aπ p Aξ .
The proof of Claim A. If |Aπ | ≤ |Aγ |, by (∗∗), it follows that Aπ p Aγ .
If |Aξ | ≤ |Aγ |, by (∗∗), Aξ p Aγ holds, and so Aπ p Aξ holds.
Otherwise, i.e., |Aξ | > |Aγ |, by (∗∗), Aγ p Aξ holds, and so, Aπ p Aξ
holds.
As mentioned above, since Aπ p Aξ , |Aπ | > |Aγ | must hold.
Claim B. If |Aπ | > |Aγ |, then L(γ) ⊆ L(Γ ).
The proof of Claim B. By the assumption of our claim, both of the lengths of
Aπ and Aξ are larger than |Aγ |. Let a and b be the |Aγ | + 1-th constant symbols
from the left sides, respectively. Since Σ ≥ 3, there is a symbol c ∈ Σ such that
c = a, b. Let x be the variable of the most left in γ and let

γc = γ{x := cx}.
78 J. Uemura and M. Sato

Then L(γc ) ⊆ L(Γ ) and Aγc = (Aγ )c hold. Clearly (Aγc , Ajτ Aπ ) ∈ Pref (j =
0, 1, · · · ) holds, and so by Lemma 4, we have L(γc ) ∩ L(π) = φ. Furthermore, by
Lemma 3, for any ξ  ∈ Γτ , it implies that L(γc ) ∩ L(ξ  ) = φ. Thus L(γc ) ⊆ L(Γ ).
That is, L(γ) ⊆ L(Γ ).
The claim B leads a contradiction because of L(γ) ⊆ L(Γ ). 


Lemma 8. Let Γ = (π, τ ) be a reduced PFS and let γ be a pattern with L(γ) ⊆
L(Γτ ). Then
L(γ) ∩ L(τπ ) = φ ⇐⇒ γ  τπ .

The proof can be done similarly to the proof of Lemma 7.

Lemma 9. Let Γ  = (α, β) be a reduced PFS and let π be a pattern. Then

L(Γβ ) ⊆ L(π) ⇐⇒ βα  π, ββα  π.

Lemma 10. Let Γ  = (α, β) be a reduced PFS and let π be a pattern. Then

L(Γ  ) ⊆ L(π) ⇐⇒ α  π, βα  π.

By Lemma 9 and Lemma 10, we have the following:

Theorem 5. Let Γ = (π, τ ) and Γ  = (α, β) be reduced PFSs. If L(Γ  ) ⊆ L(Γ ),


the following equivalences are valid.

(i) L(Γ  ) ⊆ L(π) ⇐⇒ α  π, βα  π,


(ii) L(α) ⊆ L(π), L(Γβ ) ⊆ L(Γτ ) ⇐⇒ α  π, βα  π,
(iii) L(α) ⊆ L(Γτ ), L(Γβ ) ⊆ L(π) ⇐⇒ α  π, βα  π, ββα  π.

6 Learnability of the Class PF SL

In the present section, we show that the class PFSL is inferable from positive
examples. In order to do it, the class will be shown that the class has M -finite
thickness and each language has a finite tell-tale set.

6.1 Finite Tell-Tale Sets for PFS Languages

For a pattern π, S(π) denotes the set of strings obtained from π by substituting
each variable to the empty string or a constant symbol in Σ.

Lemma 11 (Shinohara [11]). Suppose that Σ ≥ 3. For any regular pattern


π, there is no regular pattern τ satisfying S(π) ⊆ L(τ )  L(π).
Learning of Erasing Primitive Formal Systems from Positive Examples 79

For a reduced PFS Γ = (π, τ ), we introduce the following finite subset of


L(Γ ):
T (Γ ) = L(Γ ) ∩ Σ ≤|ττπ | ,
where Σ ≤l denotes the set of strings with at most length l for each l ≥ 0.
We first consider a PFS Γ = (π, τ ) with var(π) = φ. Clearly
S(π) ∪ S(τπ ) ∪ S(ττπ ) ⊆ T (Γ ),
where τπ = ττπ = τ if τ ∈ Σ ∗ . Note that the string c(π) is the unique shortest
string of L(Γ ) and L(π) if var(τ ) = φ, and that c(τπ ) is that of L(Γτ ).
Theorem 6. Let Γ = (π, τ ) be a reduced PFS with var(π) = φ. Then there does
not exist a PFS Γ  such that
T (Γ ) ⊆ L(Γ  )  L(Γ ).
Proof. We can easily prove for a case of τ ∈ Σ ∗ , and thus consider the case of
var(τ ) = φ.
We assume T (Γ ) ⊆ L(Γ  )  L(Γ ) for some PFS Γ  = (α, β). We can assume
that Γ  is reduced.
Clearly c(π) and c(α) are the shortest strings of L(Γ ) and L(Γ  ), respectively,
and by the above assumption, c(π) = c(α)(= w).
Claim A. α = π and S(τπ ) ∪ S(ττπ ) ⊆ L(Γβ )  L(Γτ ).
The proof of Claim A. If α  π, by Lemma 7, we have L(α) ⊆ L(Γτ ). It
means that w ∈ L(Γτ ), and a contradiction. Thus α  π, and so L(α) ⊆ L(π).
If α ≺ π, then α  π{x := ε} for some x ∈ var(π) because of c(π) = c(α). If
the variable x is not the first or the last symbol of π, then π = π1 (axb)π2 for
some a, b ∈ Σ and for some patterns π1 , π2 . Let c ∈ Σ such that c = a, b, and let
w = π{x := c, x := ε x = x}. Clearly w ∈ S(π) and |w | = |w| + 1. As easily
seen, w ∈ L(α) holds. By T (Γ  ) ⊆ L(Γ  ), w ∈ L(Γβ ) must hold. The length
of the shortest string c(βα ) in L(Γβ ) is larger than |w|, and is equal to |w| + 1
if and only if β = dx or xd for some d ∈ Σ. In any case, w = dw, wd, and so
w ∈ L(Γ  ). It is a contradiction. Similarly we can prove for the variable x to be
the first symbol or the last symbol of π.
Hence α ≡ π must hold. Since both π and α are of canonical form, it implies
that π = α. Therefore we have
S(τπ ) ∪ S(ττπ ) ⊆ L(Γβ )  L(Γτ ).
Claim B. βπ = τπ .
The proof of Claim B. By the inclusions of the claim A, c(βπ ) = c(τπ ) holds.
If βπ  τπ , by Lemma 8, we have L(βπ ) ∩ L(τπ ) = φ. It contradicts c(βπ ) ∈
L(βπ ) ∩ L(τπ ). Hence βπ  τπ holds.
Similarly, by the claim A, S(τπ ) ⊆ L(Γβ ) holds. If τπ  βπ , by Lemma 8
L(τπ ) ∩ L(βπ ) = φ, and a contradiction. Therefore τπ  βπ holds.
Consequently, we obtain βπ ≡ τπ . In terms of Corollary 1, both βπ and τπ
are patterns of canonical form, and thus βπ = τπ . Since π contains variables, it is
easily shown that β = τ . It means that L(Γ  ) = L(Γ ) holds, and a contradiction.


80 J. Uemura and M. Sato

The following result plays an essential role in our problem on finite tell-tale
set.
Lemma 12. Let w be a nonempty string in Σ + .
(i) If w is a nonmultiple string, then there do not exist strings u, v = ε
satisfying w = uv = vu or wu = vw.
(ii) If u is a maximal component for w, then wv1 = v2 w implies vi ∈ u∗ for
i = 1, 2.
Now we consider a PFS Γ = (w, τ ) for w ∈ Σ + . We introduce the following
strings:
ri (τ ) = τ {xi := τw , xj := w | j = i}, i = 1, · · · , n,
where var(τ ) = {x1 , · · · , xn } for n ≥ 1.
Then we have

{w, s} ∪ {ri (τ ) | i = 1, · · · , n} ⊆ T (Γ ), where s = τw .

Note that the string w is the unique shortest string of L(w, τ ) and S(Γ ), and s
is the unique second shortest string of them.
By Lemma 12, it follows that:
Lemma 13. Let Γ = (w, τ ) be a reduced PFS with w ∈ Σ + and var(τ ) =
{x1 , · · · , xn } (n ≥ 1). If s is a nonmultiple string, then ri (τ ) = rj (τ ) for i = j.

Lemma 14. Let Γ = (w, τ ) be a reduced PFS with w ∈ Σ + and var(τ ) =


{x1 , · · · , xn } (n ≥ 1), and let u be a maximal component for s. If w ∈ u+ , then
ri (τ ) = rj (τ ) for all i and j. Otherwise ri (τ ) = rj (τ ) for i = j.

Theorem 7. Let Γ = (w, τ ) be a reduced PFS with w ∈ Σ + . Then there does


not exist a PFS Γ  such that

T (Γ ) ⊆ L(Γ  )  L(Γ ).

Proof. We assume that there is a PFS Γ  = (α, β) satisfying the inclusion re-
lations in our theorem. Then clearly w and s(= τw ) are the unique shortest
string and the unique second shortest string of L(Γ  ). It implies that α = w
and βα = s. Let τ = v0 x1 v1 x2 · · · xn vn and β = v0 y1 · · · yn vn  for some
vi , vj ∈ Σ ∗ (i = 0, · · · , n, j = 0, · · · , n ).
(i) A case that s is a nonmultiple string.
In this case, by Lemma 13, ri (τ ) = rj (τ ) for i = j. Clearly the strings ri (τ )
are the third shortest strings of L(Γ ) and S(Γ ). Similarly the strings rj (β)s are
those of L(Γ  ). By our assumption,

(∗) {ri (τ ) | i = 1, · · · , n} = {rj (β) | j = 1, · · · , n }.

Since there strings are distinct as mentioned above, we have n = n .


Learning of Erasing Primitive Formal Systems from Positive Examples 81

We show that vi = vi for i = 0, · · · , n. Assume that vi = vi for i = 0, · · · , i0 −2


and vi0 −1 = vi0 −1 for some i0 ≥ 1. Then ri (τ ) = ri (β) for i = 1, · · · , i0 −1. By the
above, there is an integer j ≥ i0 satisfying ri0 (τ ) = rj (β). By 2|s| = |ri0 (τ )|+|w|,
the substring s appears in both ri0 (τ ) and rj (τ ) at different but overlapped
places. It means that sw = w s for some nonempty strings w and w . Applying
Lemma 12, it contradicts to the assumption on s. Hence vi = vi holds for every
i. Therefore we have Γ = Γ  and a contradiction.
(ii) A case that a string u is a maximal component for s.
By Lemma 14, if w ∈ u+ , then ri (τ ) = rj (τ ) for any i, j, and moreover
vi , vi ∈ u∗ . It means that L(Γ ) = L(Γ  ) ⊆ u+ , and a contradiction.
By Lemma 14, if w ∈ u+ , then ri (τ ) = rj (τ ) for i = j. Similarly to the case
of (i), we obtain n = n and (∗). Furthermore, we can show that ww = w w for
some nonempty strings w and w , and a contradiction. 


Finally, we consider a PFS Γ = (ε, τ ), provided τε is nonmultiple string.


We define the following strings: for τ = u0 X1 u1 · · · uk−1 Xk uk for u0 , uk ∈ Σ ∗ ,
ui ∈ Σ + (i = 1, · · · , k − 1), Xi ∈ X + for i = 1, · · · , k,

r̂i (τ ) = τ {Xi := s, Xj := ε, j = i}, ti (τ ) = τ {Xi := s|Xi | , Xj := ε, j = i},

for i = 1, · · · , k, where s = τε , Xi := s means a substitution of x := s, x :=


ε(x = x) for some x in Xi and for any x = x in Xi and Xi := s|Xi | denotes
x := s for every x in Xi . Then clearly

{ε, s} ∪ {r̂i (τ ), ti (τ ) | i = 1, · · · , k} ⊆ T (Γ ).

Theorem 8. Let Γ = (ε, τ ) be a reduced PFS, where τε is a nonmultiple string.


Then there does not exist a PFS Γ  = (α, β) such that

T (Γ ) ⊆ L(Γ  )  L(Γ ).

Proof. We assume that there is a PFS Γ  = (α, β) satisfying the inclusion re-
lations in our theorem. Then clearly α = ε and u0 u1 · · · uk = u0 u1 · · · uk (= s),
where τ = u0 X1 u1 X2 · · · Xk uk and β = u0 Y1 · · · Yk uk for some u0 , u0 , uk , uk ∈
Σ ∗ , ui , uj ∈ Σ + (i = 1, · · · , k − 1, j = 1, · · · , k  − 1) and Xi , Yj ∈ X + for each
i, j.
Claim A. k = k  and ui = ui for i = 1, · · · , k.
Since s is nonmultiple string, the proof of Claim A can be done by showing
r̂i (τ ) = r̂j (τ ) for i = j similarly to the proof of Theorem 7.
Similarly we can proof Xi = Yi for each i.

Theorem 9. Let Γ be a reduced PFS in PFS. Then the set T (Γ ) is a finite


tell-tall set of the language L(Γ ).
82 J. Uemura and M. Sato

6.2 M-Finite Thickness

Theorem 10. The class PFSL has M -finite thickness.

Proof. Let S ⊆ Σ ∗ be nonempty finite set. Let lmin and lmax be the shortest
length and the longest length of strings in S, respectively.
Claim A. MIN(S, PFSL) < ∞ holds.
The proof of Claim A. Let L(Γ ) ∈ MIN(S, PFSL) for Γ = (π, τ ). Without
loss of generality, we can assume that Γ is reduced.
As mentioned in §2, every w ∈ L(Γ ) satisfies |w| ≥ min{|c(π)|, |c(τπ )|}.
A case of π = ε. In this case, |c(π)| and |c(τπ )| is less than or equal to lmax .
Since π is of canonical form and the number of variables in τ is bounded by lmax ,
there are at most finitely many such PFSs.
A case of π = ε. In this case, c(τ ) = τπ holds. Similarly to the above, we have
|c(τ )| ≤ lmax , and thus there are at most finitely many such constant strings. Let
us put |c(τ )| = l. Then as easily seen, lengths of strings in L(ε, τ ) are multiples
of l. Thus lmax = kl for some k ≥ 0.
Let us put τ = w0 X1 w1 · · · wn−1 Xn wn . A variable x ∈ var(τ ) is nonerasable
w.r.t. S if S ⊆ L(ε, τ {x := ε}). We can assume that every variable in τ is
nonerasable w.r.t. S, and such a τ is called a nonerasable pattern w.r.t. S.
Clearly |Xi | ≤ k holds for every i. Hence there are at most finitely many PFSs
in MIN(S, PFSL).
Claim B. For any PFS Γ , if S ⊆ L(Γ ), then L(Γ  ) ⊆ L(Γ ) for some Γ  ∈
MIN(S, PFSL).
The proof of Claim B. Let S ⊆ L(Γ ) for Γ = (π, τ ). We can assume that
Γ is reduced, and L(Γ ) ∈ MIN(S, PFSL). Then we have S ⊆ L(Γ  )  L(Γ )
for some reduced PFS Γ  = (α, β). Similarly to the proof of Claim A, it can be
shown that there are at most finitely many such L(Γ  ) containing the set S.  

By the above and Theorem 10, we obtain the following main theorem.

Theorem 11. The class PFSL is inferable from positive examples.

References

1. D. Angluin: Inductive inference of formal languages from positive data, Information


and Control, 45, 117–135, (1980).
2. S. Arikawa, T. Shinohara and A. Yamamoto: Learning elementary formal system,
Theoretical Computer Science, 95, 97–113, (1992).
3. H. Arimura, T. Shinohara and S. Otsuki: Finding minimal generalizations for
unions of pattern languages and its application to inductive inference from pos-
itive data, Lecture Notes in Computer Science, 775, 646–660, (1994).
4. E.M. Gold: Language identification in the limit, Information and Computation, 10,
447–474, (1967).
5. T. Moriyama and M. Sato: Properties of language classes with finite elasticity,
IEICE Transactions on Information and Systems, E78-D(5), 532–538, (1995).
Learning of Erasing Primitive Formal Systems from Positive Examples 83

6. T. Motoki, T. Shinohara and K. Wright: The correct definition of finite elasticity:


Corrigendum to identification of unions, Proceedings of the 4th Annual Workshop
on Computational Learning Theory, 375–375, (1991).
7. Y. Mukouchi and S. Arikawa: Towards a mathematical theory of machine discovery
from facts, Theoretical Computer Science, 137, 53–84, (1995).
8. Y. Mukouchi: Note on Learnability of Subclasses of Erasing Elementary Formal
Systems from Positive Examples, in preparation.
9. D. Reidenbach: Result on Inductive Inference of Extended Pattern Languages, Lec-
ture Notes in Artificial Intelligence, 2533, 308–320, (2002).
10. M. Sato: Inductive Inference of Formal Languages, Bulletin of Informatics and
Cybernetics, 27(1), 85–106, (1995).
11. T. Shinohara: Polynomial time inference of extended regular pattern languages,
Lecture Notes in Computer Science, 147, 115–127, (1982).
12. T. Shinohara: Inductive inference of formal systems from positive data, Bulletin of
Information and Cybernetics, 22, 9–18, (1986).
13. T. Shinohara: Rich classes inferable from positive data, Information and Control,
108, 175–186, (1994).
14. R.M. Smullyan: “Theory of Formal Systems,” Princeton University Press, 1961.
15. J. Uemura and M. Sato: Compactness and Learning of Unions of Erasing Regular
Pattern Languages, Lecture Notes in Artificial Intelligence, 2533, 293–307, (2002).
16. K. Wright: Identification of unions of languages drawn from positive data, in Pro-
ceedings of the 2nd Annual Workshop on Computational Learning Theory, 328–
333, (1989).
Changing the Inference Type – Keeping the
Hypothesis Space

Frank Balbach

Institut für Theoretische Informatik, Universität zu Lübeck


Wallstraße 40, 23560 Lübeck, Germany
balbach@[Link]

Abstract. In inductive inference all learning takes place in hypothesis


spaces. We investigate for which classes of recursive functions learnabil-
ity according to an inference type I implies learnability according to a
different inference type J within the same hypothesis space.
Several classical inference types are considered. Among FIN, CONS-CP,
and CP the above implication is true, for all relevant classes, indepen-
dently from the hypothesis space.
On the other hand, it is proved that for many other pairs (I, J ) hypoth-
esis spaces exist that allow full I learning power, but limit that of J to
finite classes.
Only in a few cases (e. g. LIM vs. CONS) the result does depend on the
actual class to be learned.

1 Introduction
In inductive inference a scenario is investigated where a learner receives more
and more data about a target object and outputs a sequence of hypotheses. The
learner is successful if its sequence of hypotheses eventually converges to a single
description for the target object. Usually a learner is required to learn all objects
from a (possibly infinite) class.
This model of learning in the limit [11], applied to classes of recursive func-
tions, has been studied thoroughly. Thereby, many variations of the basic model
have been developed [4,19,1,8,12,10,9,13]. All these models are referred to as “in-
ference types.” They often differ in the constraints placed on the intermediate
hypotheses or in the way the sequence of hypotheses has to converge.
Common to all inference types, however, is the need to interpret the hypothe-
ses a learner (“strategy”) outputs. This is usually done by means of a hypothesis
space. Its design is thus of key importance to the learning success. In the induc-
tive inference of recursive functions, hypothesis spaces are numberings of partial
recursive functions. Hypotheses are represented by indices in such numberings.
It is well known that Gödel numberings [15] (acceptable numberings [14])
serve, in many inference types, as universal hypothesis spaces. That is, any solv-
able learning problem can be solved within such a numbering.
Of course, knowing that a solution exists is not sufficient in practical appli-
cations. One rather needs to know how to construct an appropriate learner.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 84–98, 2003.

c Springer-Verlag Berlin Heidelberg 2003
Changing the Inference Type – Keeping the Hypothesis Space 85

Constructing learning strategies is usually easier in hypothesis spaces specif-


ically designed for a certain learning task than it is in Gödel numberings.
A first approach to such hypothesis spaces is given by numbering theoretical
characterizations [19,9]. As an example, let us consider:
A class U of recursive functions is learnable in the limit iff there is a num-
bering ψ such that (a) ψ contains all functions of U and (b) for different indices
i, j one can uniformly compute an upper bound d(i, j) of an argument on which
the i-th and j-th function in ψ differ.
This characterization provides a sufficient criterion for hypothesis spaces ψ
to be suitable for the learning of U . Moreover, learners for classes U can be built
uniformly from the hypothesis space ψ and the function d.
Specialized hypothesis spaces are also necessary in combination with general
types of strategies, like identification by enumeration [11]. In its basic form this
strategy outputs the least index whose associated function is consistent with
the data known to the strategy so far. Therefore it often needs to decide the
consistency of an index with some known data, which usually cannot be done in
Gödel numberings [20].
Despite their advantages regarding the construction of learning strategies,
specialized non-Gödel numberings suffer from several drawbacks. One of them is
a loss of flexibility. Changing the class to be learned or switching to a different
type of learning, thereby leaving the hypothesis space unchanged, might lead to
an unsolvable new learning problem.
In this paper we will concentrate on the consequences of switching to another
inference type after fixing the hypothesis space.

The inductive inference paradigm can also be viewed from a more practical
point of view. Here, the hypothesis spaces correspond to certain output formats,
or description languages, for hypotheses. The inference types find their counter-
part in various requirements put on the behavior of a learning algorithm and/or
the quality of its hypotheses.
Consider a typical practical learning algorithm that uses a certain description
language for hypothesis and learns according to some requirements.
In practice, requirements are often changing. The output format might be
fixed, however. The first natural question then is whether the algorithm can
still handle the new requirements or whether there is at all an algorithm that
can cope with the new situation. However, the new requirements could be too
demanding for such an algorithm to exist.
But what happens if a learning algorithm for the new problem is known that
(unfortunately) uses a different output format? Is this additional fact enough to
conclude that there must be such an algorithm for the previously fixed format,
too?
In general the possible answers are “yes, this conclusion can be drawn”, “no,
it cannot”, or “it depends...” In case of a positive answer the immediate next
question is how to find such an algorithm. Can it even be constructed effectively
from the previous one?
86 F. Balbach

The following pages give answers to these questions in the general frame-
work of inductive inference of recursive functions, thereby revealing that all
three kinds of answers do indeed occur, depending on the inference types under
consideration.
Section 3 addresses the “no” answers, Sect. 4 presents some “yes” answers,
and Sect. 5 some intermediate results. Finally, Sect. 6 gives an overview of all
results obtained.

2 Preliminaries
Notations not explained herein follow standard conventions [15]. By IN we de-
note the set {0, 1, 2, . . .} of natural numbers; inclusion and proper inclusion are
denoted by ⊆ and ⊂, respectively; card A is the cardinality of the set A. We
denote the set difference of A and B by A \ B.
For n ∈ IN the set of all n-ary partial recursive functions over IN will be
written P n , the set of all recursive functions Rn . As abbreviation for P 1 and
R1 we use P and R, respectively. For f ∈ P, x ∈ IN we write f (x) ↓, if f is
defined on input x and f (x) ↑ otherwise. Functions f, g ∈ P fulfill f =n g iff
{(x, f (x)) | x ≤ n and f (x) ↓} = {(x, g(x)) | x ≤ n and g(x) ↓}.
For i ∈ IN and n ≥ 1, in denotes the n-tuple (i, . . . , i). Functions can be
identified with the sequence of their values, e. g. f = 1n 0∞ means f (x) = 1 for
0 ≤ x < n and f (x) = 0 for x ≥ n.
If for f ∈ P and n ∈ IN the values f (0), . . . , f (n) are defined, we will write
f n for the initial segment (f (0), . . . , f (n)) and implicitly identify every f n with
a natural number via a computable bijective coding function.
For a tuple α = (α0 , . . . , αn ) over IN and a class U ⊆ R, we write α  U iff
there is an f ∈ U such that f n = α.
In order to abbreviate certain statements, the symbols ∧ (“and”), =⇒ (“im-
plies”) and ⇐⇒ (“iff”) as well as the quantifiers ∀ (“for all”), ∀∞ (“for all but
finitely many”), and ∃ (“exists”) will be used.
Let ψ ∈ P 2 . Then ψ is called numbering and Pψ := {ψi | i ∈ IN} denotes
the set of the functions enumerated by ψ; for i ∈ IN the function ψi is defined
by ψi (x) := ψ(i, x) for all x. A numbering ϕ ∈ P 2 is called Gödel numbering
(acceptable numbering) iff (1) Pϕ = P and (2) ∀ψ ∈ P 2 ∃c ∈ R ∀i [ψi = ϕc(i) ].
We use ϕ to denote a fixed Gödel numbering. For a function ϕi we will write Si
if the function plays the role of a learning strategy (see below).
Let Φ be a Blum complexity measure [6] for ϕ. For i ∈ IN we will write
ϕi (x) ↓n instead of Φi (x) ↓≤ n.
In the basic learning model, a strategy S ∈ P learns a class U ⊆ R with
respect to a hypothesis space ψ ∈ P 2 . The strategy receives one after another
initial segments f n of a function f ∈ U as input and generates a sequence of hy-
potheses S(f n ) as output. Each hypothesis is interpreted as the function ψS(f n ) .
The basic inference type, learning in the limit [11], gives the learning strategy
the freedom to output whatever it wants, as long as it reaches a point beyond
that the output remains constant as well as correct.
Changing the Inference Type – Keeping the Hypothesis Space 87

Definition 1. A class U ⊆ R is said to be learned in the limit with respect to


(or in) an hypothesis space ψ ∈ P 2 by a strategy S ∈ P iff
(1) ∀f ∈ U ∀n [S(f n ) ↓],
(2) ∀f ∈ U ∃i [ψi = f ∧ ∀∞ n [S(f n ) = i]].
This fact will be written U ∈ LIMψ (S). Furthermore LIMψ := {U | U ⊆
R ∧ ∃S ∈ P [U ∈ LIMψ (S)]} denotes the set of LIM learnable classes with
respect to ψ and LIM := ψ∈P 2 LIMψ the entire set of LIM learnable classes.
Various inference types are built from LIM by adding conditions to the inter-
mediate hypotheses. Probably the most natural one is the consistency condition
demanding that the hypothesized function always agrees with the data already
known to the strategy [2,7]. It has been intensively studied [3,18,20,16].
In any case, the hypothesized functions have to be total in order to be correct,
hence it is natural to demand that all hypotheses refer to total functions. This
is called total learning [19]. It is restricted even further within the so called
class preserving learning [5] where the hypothesized functions are required to be
members of the class to be learned. Both, total and class preserving learning,
can be combined with the consistency condition [12].
Definition 2. Let U ⊆ R be a class, ψ ∈ P 2 a hypothesis space, and S ∈ P a
strategy such that U ∈ LIMψ (S).
(1) U ∈ CONSψ (S) iff ∀f ∈ U ∀n [f =n ψS(f n ) ] .
(2) U ∈ TOTALψ (S) iff ∀f ∈ U ∀n [ψS(f n ) ∈ R] .
(3) U ∈ CPψ (S) iff ∀f ∈ U ∀n [ψS(f n ) ∈ U ] .
(4) U ∈ CONS-TOTALψ (S) iff ∀f ∈ U ∀n [f =n ψS(f n ) ∈ R] .
(5) U ∈ CONS-CPψ (S) iff ∀f ∈ U ∀n [f =n ψS(f n ) ∈ U ] .
CPψ , TOTALψ , CONSψ , CONS-TOTALψ , and CONS-CPψ as well as CP,
TOTAL, CONS, CONS-TOTAL, and CONS-CP are defined in analogy to LIMψ
and LIM.
Instead of convergence to a single correct hypothesis, the behaviorally correct
learning in the limit only requires that a learning strategy outputs almost always
arbitrary, but correct hypotheses [3].
Definition 3. A class U is said to be learned behaviorally correct in the limit
by a strategy S with respect to a hypothesis space ψ (written: U ∈ BCψ (S)) iff
(1) ∀f ∈ U ∀n [S(f n ) ↓],
(2) ∀f ∈ U ∀∞ n [ψS(f n ) = f ].
BCψ and BC are defined analogously to the previous inference types.
Convergence of a different kind takes place in the finite learning model, also
called one-shot learning. Here the strategy may, on each function, output the
special hypothesis “?” finitely many times until it outputs a correct one [11].
Definition 4. A class U is said to be learned finitely by a strategy S with respect
to a hypothesis space ψ (written: U ∈ FINψ (S)) iff
(1) ∀f ∈ U ∀n [S(f n ) ↓],
88 F. Balbach

(2) ∀f ∈ U ∃n (a) ∀x < n [S(f x ) = ?],


(b) ∃i [ψi = f ∧ ∀x ≥ n [S(f x ) = i]].
FINψ and FIN are defined in the obvious way.
For strategies S (except for BC strategies), the convergence point on a learned
function f is denoted by Conv(S, f ) := min{n | ∀x ≥ n [S(f x ) = S(f n )]}. and
the final hypothesis of S on f by lim S(f ) := limn→∞ S(f n ).
The convergence point for BC strategies depends on the hypothesis space ψ
and is defined by Convψ (S, f ) := min{n | ∀x ≥ n [ψS(f x ) = f ]}.
For every inference type introduced here, Gödel numberings present a uni-
versal hypothesis space insofar as every learnable class can be learned in any
such numbering [13].
Lemma 5. Let U ⊆ R be a class of recursive functions, I ∈ {LIM, CONS,
TOTAL, CP, CONS-TOTAL, CONS-CP, BC, FIN} an inference type, and ϕ
an arbitrary Gödel numbering. Then Iϕ = I.
Some inference types have another property, namely that the learning goal
can always be achieved by a strategy defined on every input.
Lemma 6. Let I ∈ {LIM, TOTAL, CP, BC, FIN} be an inference type and
S ∈ P, U ⊆ R, ψ ∈ P 2 satisfying U ∈ Iψ (S). Then there is a strategy T ∈ R
such that U ∈ Iψ (T ).
Both lemmata will be used implicitly in the next sections.
The set of classes contained in total numberings is denoted by NUM := {U |
∃ψ ∈ R2 [U ⊆ Pψ ]}. It is known [11] that every class U ∈ NUM enumerated
by ψ ∈ R2 , i. e. U ⊆ Pψ , can be learned with respect to ψ by the strategy
Enumψ (f n ) := min{i | ψi =n f }.
The relations between the inference types in terms of learnable classes are
described in the following theorem [3,18,12].
Theorem 7.
(1) CONS-CP ⊂ CP ⊂ TOTAL = CONS-TOTAL ⊂ CONS ⊂ LIM ⊂ BC,
(2) FIN ⊂ CP, and (3) NUM ⊂ TOTAL.
(4) Inference types whose relations are not explicitly stated are incomparable.
The question formulated in the introduction can now be expressed more
formally by: Does U ∈ J satisfy ∀ψ ∈ P 2 [U ∈ Iψ =⇒ U ∈ Jψ ]?
We give a name to the condition contained therein.
Definition 8. Let I, J ∈ {BC, LIM, FIN, CONS, TOTAL, CP, CONS-CP,
CONS-TOTAL} be inference types. A class U ⊆ R has the property (satisfies
the condition) (I → J ) iff ∀ψ ∈ P 2 [U ∈ Iψ =⇒ U ∈ Jψ ].
Note that all finite classes U satisfy (I → J ) for all introduced inference types
I and J . The set {U | U ∈ I ∩ J and U satisfies (I → J )} will be called scope
of (I → J ). Classes U ∈ / I ∩ J are not considered since it is obvious whether
they satisfy (I → J ) or not. A scope containing exactly the finite classes will be
called minimal, a scope equal to I ∩ J will be called maximal.
Often (I → J ) is fulfilled for all I ∩ J simply because ∀ψ ∈ P 2 [Iψ ⊆ Jψ ].
The next lemma states when this happens.
Changing the Inference Type – Keeping the Hypothesis Space 89

Lemma 9.
(1) ∀ψ ∈ P 2 [FINψ ⊆ CPψ ⊆ TOTALψ ⊆ CONSψ ⊆ LIMψ ⊆ BCψ ],
(2) ∀ψ ∈ P 2 [CONS-CPψ ⊆ CONS-TOTALψ ⊆ TOTALψ ],
(3) ∀ψ ∈ P 2 [CONS-CPψ ⊆ CPψ ].
Note, however, that I ⊆ J is not sufficient for ∀ψ ∈ P 2 [Iψ ⊆ Jψ ].
Lemma 10.
(1) ∃ψ ∈ P 2 [TOTALψ ⊆ CONS-TOTALψ ],
(2) ∃ψ ∈ P 2 [CPψ ⊆ CONS-TOTALψ ].

3 Negative Results and Biased Hypothesis Spaces


We will first consider the transition from TOTAL learning to CONS learning.
Both requirements are rather natural and lie close to each other in the hierarchy
(see Theorem 7 and Lemma 9). Hence, one would not expect any problems if one
wants to learn a TOTAL class in a TOTAL way with respect to a hypothesis
space suitable for learning this class consistently.
However, this expectation is wrong. As soon as the class to be learned is infi-
nite, the hypothesis space could be a “bad” one preventing the TOTAL learning
of the class. This is the subject of the following Theorem 11.
Theorem 11. Let U ∈ TOTAL be an infinite class. Then a hypothesis space
ψ ∈ P 2 exists such that U ∈ CONSψ \ TOTALψ .

Proof. Let U ∈ TOTAL. The hypothesis space ψ will be defined via diagonaliza-
tion against all (TOTAL) learning strategies. The functions in ψ will be grouped
into consecutive blocks Zj of increasing size. Within the j-th block, which con-
tains j + 2 functions, diagonalization against the j + 1 strategies S0 , . . . , Sj
happens.
The functions in the block will be defined, argument by argument, to equal
ϕj . Meanwhile the output of the strategies S0 , . . . , Sj on initial segments of ϕj
is watched. As soon as an Si is found to output a hypothesis z from within the
j-th block, the definition of ψz is stopped, resulting in ψz ∈ P \ R. From now
on neither Si nor ψz are taken into account during the ongoing definition of the
j-th block.
The formal algorithm for the j-th block is given below.
(j)
L0 := {0, . . . , j}
(j)
G0 := Zj
x := 0
While ϕxj ↓ do:
(j)
(1) For all z ∈ Gx : ψz (x) := ϕj (x)
(j)
(2) For all z∈ Zj \ Gx : ψz (x) := ↑ 
:= (, y) |  ∈ Lx ∧ y ≤ x ∧ S (ϕyj ) ↓x ∈ Gx
(j) (j) (j)
(3) Px
 
(4) Gx+1 := Gx \ S (ϕyj ) | (, y) ∈ Px ∧ y = min{z | (, z) ∈ Px }
(j) (j) (j) (j)
90 F. Balbach
 
(j) (j) (j)
(5) Lx+1 := Lx \  | ∃y [(, y) ∈ Px ]
(6) x := x + 1
Note that, since the indices outnumber the strategies, there must remain at
least one z ∈ Zj such that ψz = ϕj .
To prove U ∈ / TOTALψ , we assume an Si such that U ∈ TOTALψ (Si ).
Since U is infinite, there must be an f ∈ U such that Si converges on f to
an index k ∈ Zj for a j ≥ i. All total functions in the j-th block equal ϕj ,
hence f = ψk = ϕj . Therefore, Si outputs on ϕj almost always indices in Zj .
Thus, either Si or k (or both) will be “eliminated.” In either case, Si outputs a
non-total hypothesis on f ∈ U , contradicting the assumption.
In order to show U ∈ CONSψ , let R ∈ P be a strategy such that U ∈
CONS-TOTALϕ (R). A strategy T learning U consistently in ψ works as follows:
n
T (f n ) := min Gn(R(f ))
.
n
(R(f ))
For f ∈ U , ϕR(f n ) is total, hence Gn exists and can be computed using
the algorithm above.
Let f ∈ U and j = lim R(f n ) be the final hypothesis of R on f . Then
T converges against the least k ∈ Zj not “eliminated.” It follows that ψk =
ϕj , hence T converges correctly. That the intermediate hypotheses of T are
consistent is a consequence of R being a consistent strategy for U in ψ. 

If consistent learnability of a class is not sufficient for its total learnability,
then neither LIM nor BC learnability are. Moreover, consistent learnability can-
not be sufficient for learning in a CONS-TOTAL, CP, CONS-CP, or FIN way.

Corollary 12. Let I ∈ {CONS, LIM, BC} and J ∈ {FIN, CONS-CP, TOTAL,
CONS-TOTAL,CP} be inference types and let U ∈ J be an infinite class. Then
a hypothesis space ψ ∈ P 2 exists such that U ∈ Iψ \ Jψ .
A closer look at the proof of Theorem 11 reveals that the constructed hypoth-
esis space ψ does not depend on the class U . In ψ no infinite class U ∈ TOTAL
can be learned totally. But any such U can be learned consistently in ψ.
Even more is true: The hypothesis space ψ does in fact allow for the full
learning power of consistent learning as well as of limit learning in general.
Behaviorally correct learning, on the other hand, does not increase the learning
power further.
Corollary 13. There is a hypothesis space ψ satisfying
(1) TOTALψ = {U ⊆ R | card U < ∞},
(2) CONSψ = CONS,
(3) LIMψ = LIM,
(4) BCψ = LIM.
Proof. Let ψ be the hypothesis space constructed in the proof of Theorem 11.
(1) is obvious, as well as the ⊆-part of (2). In order to prove CONSψ ⊇ CONS
let U ∈ CONS be a class and R ∈ P a strategy such that U ∈ CONSϕ (R). We
Changing the Inference Type – Keeping the Hypothesis Space 91

define T as in the proof of Theorem 11 with the only difference that R is a CONS
strategy for U .
The proof of (3) proceeds similar to that of (2). In order to show U ∈ LIMψ
for a U ∈ LIM we use a strategy R ∈ R with U ∈ LIMϕ (R) and define T as
follows:
(R(f m ))
T (f n ) := “For m = 0, . . . , n compute the sets Gm for at most n steps.
Let m be the greatest index such that the computation is finished.

(R(f m ))
Output min Gm .”
This modification is necessary since ϕR(f n ) need not be defined up to n,
(R(f n ))
hence Gn need not exist.
(4) LIM = LIMψ ⊆ BCψ follows from (3) and the definitions of BC and
LIM. In order to show BCψ ⊆ LIM we assume a U ∈ BCψ with U ∈ / LIM. Let
S ∈ R be such that U ∈ BCψ (S). Then there must be a function g ∈ U such
that card {S(g n ) | n ∈ IN} = ∞ (otherwise one could amalgamate the finitely
many indices of each function of U and learn U in the limit, in contradiction to
our assumption).
Recall the grouping of the indices of ψ in blocks Zj . The hypotheses of S on
g reach infinitely many such blocks. We assume without loss of generality that
n
 outputs on g at most one index from each such block, that is ∀n [S(g ) ∈
S /
m<n Z S(g m ) ]. (One can construct an S with this property from an S without

by simulating S and, if S tries to output on g into an already used block, by


calculating S on longer initial segments of g until an unused block is reached.)
Let i be such that Si = S. Now an m ≥ Convψ (S, g) and a j ≥ i exist such
that k := S(g m ) ∈ Zj . Then ψk = ψS(gm ) = g = ϕj and ∀x [k ∈ Gx ], that is k
(j)

“survives.”
Now there must be an x such that (i, m) ∈ Px because Si (ϕm
(j)
j ) = k. Hence,
(j)
k will be selected for “elimination” in step (4) since (i, m) ∈ Px and m =
min{z | (i, z) ∈ Px } (remember that, by assumption an S, g m is the only
(j)

segment where S outputs a hypothesis from Zj ). This contradicts the conclusion


above that k “survives.” 


Note that part (1) of Corollary 13 remains true if TOTAL is substituted by


CONS-TOTAL, CP, CONS-CP, or FIN. Obviously, ψ is biased towards CONS
and LIM and against most other inference types considered.
The diagonalization technique from the proof of Theorem 11 can, suitably
modified, be used for other inference types as well.
In order to prove the next theorem we need a well known lemma that provides
a stronger version of Lemma 6 for LIM.
Lemma 14. There is a function ρ ∈ R such that

∀i ∈ IN (1) ϕρ(i) ∈ R,
(2) ∀ψ ∈ P 2 ∀U ⊆ R [U ∈ LIMψ (ϕi ) =⇒ U ∈ LIMψ (ϕρ(i) )].
92 F. Balbach

That is, for every LIM strategy ϕi , ρ effectively constructs an everywhere defined
LIM strategy ϕρ(i) with at least the same learning power.
We abbreviate the strategies from the last lemma by Si := ϕρ(i) .
Theorem 15. Let U ∈ LIM be an infinite class. Then there is a hypothesis
space ψ ∈ P 2 such that U ∈ BCψ \ LIMψ .

Proof. This proof uses the same blockwise grouping of indices as the proof of
Theorem 11. However, diagonalization happens against all strategies Si instead
of Si . For every j ∈ IN the functions with indices z ∈ Zj are defined as follows:

ϕj (x), if ∀i ≤ j ∃n ≥ x [ϕnj ↓ ∧Si (ϕnj ) = z],
ψz (x) :=
↑, otherwise.

Provided ϕj ∈ R, the function ψz equals ϕj iff every strategy S0 , . . . , Sj


outputs on ϕj infinitely often a hypothesis different from z. If, on the other
hand, one of those strategies converges on ϕj to z, then ψz ∈ / R. Furthermore
there is a z ∈ Zj with ψz = ϕj .
Assume U ∈ LIMψ (Si ) for an i ∈ IN. Then an f ∈ U and a j ≥ i exist such
that k := lim Si (f ) ∈ Zj . Hence ψk = f = ϕj holds. But then, according to the
definition of ψk , Si does not converge to k, a contradiction.
In order to prove U ∈ BCψ , let R ∈ R be a LIM strategy for U in ϕ. A BC
strategy for U in ψ can be defined in the following way.
T (f n ) := “(1) For all z ∈ ZR(f n ) and x ≤ n compute each ψz (x) for
at most n steps.
(2) Output a z such that ψz has a longest initial segment
computed in (1).”
Let f ∈ U and j := lim R(f ). Then T outputs on f almost always indices of Zj .
Let m be the length of the longest initial segment of all non-total functions
from Zj . Let n ≥ Conv(R, f ). Eventually n will be great enough for step (1)
to compute an initial segment longer than m, since the total function ϕj = f
appears in the j-th block. Then T will output an index z of a total function
ψz = ϕj = f and therefore f behaviorally correct. 

The proof does not only prove the scope of (BC → LIM) to be minimal.
Again, it yields a biased hypothesis space, this time favouring BC and disliking
LIM.
Corollary 16. There is a hypothesis space ψ satisfying
(1) LIMψ = {U ⊆ R | card U < ∞},
(2) BCψ =BC.
We will now turn to the property (TOTAL → FIN). Once again we apply
the same proof technique. But this time we only get a proof for the minimality
of the scope. We are not provided with a biased hypothesis space.
Theorem 17. Let U ∈ FIN be an infinite class. Then a hypothesis space ψ ∈ P 2
exists such that U ∈ TOTALψ \ FINψ .
Changing the Inference Type – Keeping the Hypothesis Space 93

Proof. Let R ∈ R such that U ∈ FINϕ (R). We need to modify the construction
of the proof of Theorem 11 in such a way that the class U will be taken into
account. This is inevitable, as will be shown in Theorem 18.
The construction of the j-th block starts by watching R on the function ϕj .
Only when (and if) R converges finitely to the index j, a construction very similar
to that of the proof of Theorem 11 takes place. For the sake of completeness the
algorithm constructing the block with indices from Zj is given below.

x := 0
While both (A) ϕxj ↓ and (B) R(ϕxj ) = ? do:
For all z ∈ Zj : ψz (x) := ϕj (x)
x := x + 1

Case 1: (A) and (B) hold for all x. Then ∀z ∈ Zj [ψz = ϕj ] and R does
not learn ϕj , hence ϕj ∈ / U.
Case 2: There is x0 such that (A) does not hold. Then
∀z ∈ Zj ∀x ≥ x0 [ψz (x) ↑] and therefore ∀z ∈ Zj [ψz ∈ / U ].
Case 3: There is x0 such that (A) holds, but (B) does not, i. e.
ϕxj 0 ↓ and R(ϕxj 0 ) ∈ IN. Then ∀z ∈ Zj [ψz =x0 −1 ϕj ] follows
and two cases are distinguished:
Case 3.1: R(ϕxj 0 ) = j. Then define for all z ∈ Zj and x ≥ x0 : ψz (x) := ↑.
Case 3.2: R(ϕxj 0 ) = j. Then the construction goes on in the following way.

(j)
Lx0 := {0, . . . , j}
(j)
Gx0 := Zj
x := x0
While ϕxj ↓ do:
Perform steps (1) to (6) as in the proof of Theorem 11.

The proofs of U ∈ TOTALψ and U ∈ / FINψ are not very different from the
corresponding ones in Theorem 11, but somewhat more technical, and will be
omitted due to space constraints. 


The question, whether there is a biased hypothesis space, i. e. a ψ such that


TOTALψ = TOTAL and FINψ = {U | card U < ∞}, has been answered by an
anonymous reviewer of this paper’s submitted version. The proof is also due to
this reviewer.
Theorem 18. If ψ is a hypothesis space with TOTALψ = TOTAL, then there
is an infinite class in FINψ .

Proof. Let U1 = {0n 1∞ | n ∈ IN}. U1 is in TOTAL and thus in TOTALψ . Let S


be a TOTALψ strategy for U1 . Now

U2 = {ψS(0n 1m ) | n ∈ IN and m is the first number such that m > 0


and ψS(0n 1m ) extends 0n 1 as a function}
94 F. Balbach

is a class of total functions since S only outputs indices of total functions. Fur-
thermore, the test whether ψS(0n 1m ) extends 0n 1 can be carried out effectively for
m = 1, 2, . . . since the indices to be simulated are total. Thus U2 is well-defined
and recursively enumerable.
For every n there is exactly one function in U2 which extends 0n 1; thus U2
is infinite. Furthermore, U2 is in FINψ : on input which does not start with 0n 1,
one outputs “?”, on input that starts with 0n 1 one outputs the ψ-index for the
unique function in U2 which extends 0n 1. 


Among all properties (I → J ) so far proven to have minimal scope, the


property (TOTAL → FIN) is the only one where no biased hypothesis space
exists.

4 Positive Results
The previous section might look somewhat discouraging. However, not every
property (I → J ) can be proved to have minimal scope. Contrary, there are
inference types I and J such that the scope of (I → J ) is maximal.
Theorem 19. If U ∈ FIN and ψ ∈ P 2 such that U ∈ CPψ , then U ∈ FINψ .

Proof. Let R, S ∈ R be such that U ∈ FINϕ (R) and U ∈ CPψ (S). Define

T (f n ) := “If R(f n ) = ? then output ‘?’.


If R(f n ) ∈ IN ∧ ∀m ≤ n [ψS(f
n n
m ) ↓= f ] then output ‘?’.

If R(f n ) ∈ IN ∧ ∃m ≤ n [ψS(f
n n
m ) ↓= f ] then:
 n n
m := min{m ≤ n | ψS(f m ) ↓= f }.

Output S(f m ).”

On f ∈ U the strategy T outputs “?” until an n with n ≥ Conv(R, f ) and


ψS(f n ) =n f is reached. Since R is a FIN strategy that has already converged
on f , there can be no function g = f in U such that g =n f (otherwise R would
converge on such a g to the same final hypothesis as it does on f , a contradiction).
But if there is only one function in U starting with f n , this function must be
identical to ψS(f n ) , for S is a CP strategy. It follows that S(f n ) is a correct
hypothesis for f with respect to ψ. 


Theorem 19 addresses the transition from class preserving learning to finite


learning. It does not only say that this translation is possible, it also provides
an algorithm to construct a finite learning strategy from a class preserving one.
Thus it presents a positive answer to the introductory question on effective
constructability of such learning algorithms.
The next theorem gives a very similar result concerning (CP → CONS-CP).

Theorem 20. If U ∈ CONS-CP and ψ ∈ P 2 such that U ∈ CPψ , then U ∈


CONS-CPψ .
Changing the Inference Type – Keeping the Hypothesis Space 95

Proof. The proof is similar to that of Theorem 19. Let R ∈ P be such that
U ∈ CONS-CPϕ (R) and S ∈ R such that U ∈ CPψ (S). Define

T (f n ) := “If ψS(f n ) =n f then output S(f n ).


If ψS(f n ) =n f then:
g := ϕR(f n ) .
Find m > n with ψS(gm ) =n f and output S(g m ).”

Let f ∈ U and n ∈ IN. Then ψS(f n ) ∈ U follows, since S is a class preserving


strategy. Furthermore “ψS(f n ) =n f ?” is decideable. If this condition is satisfied,
T (f n ) = S(f n ) is a class preserving consistent hypothesis; else for g := ϕR(f n )
we get g ∈ U and g =n f because of the properties of S. Since S learns all
functions of U in the limit, an m > n must exist such that ψS(gm ) = g =n f .
This condition can easily be checked because S outputs only total hypotheses
with respect to ϕ on functions g ∈ U . Thus, such an m can be found effectively
and S(g m ) is a consistent and class preserving hypothesis.
It remains to show that T learns every f ∈ U in the limit. For n ≥ Conv(S, f )
the first condition in the definition of T is satisfied and T (f n ) = S(f n ). Hence,
T converges on f to the same final hypothesis as S. 

Corollary 21. If U ∈ CONS-CP and ψ ∈ P 2 such that U ∈ FINψ , then U ∈
CONS-CPψ .
Corollary 22. If U ∈ FIN and ψ ∈ P 2 such that U ∈ CONS-CPψ , then U ∈
FINψ .
Corollary 22 is dual to Corollary 21. Thus, for all classes U ∈ CONS-CP∩FIN
and all hypothesis spaces ψ the equivalence U ∈ FINψ ⇐⇒ U ∈ CONS-CPψ
holds. Moreover, CP can be added to this equivalence.
Corollary 23. For all U ∈ CONS-CP ∩ FIN and for all ψ ∈ P 2 ,
U ∈ FINψ ⇐⇒ U ∈ CONS-CPψ ⇐⇒ U ∈ CPψ .
The equivalence described in the last corollary is remarkable in two ways.
First, it concernes an inference type and its consistent variant and, second, it
concernes two inference types that are incomparable regarding their learning
power. It shall be noted, however, that this equivalence is only valid within
a relatively small area, FIN ∩ CONS-CP. But nevertheless, none of the other
introduced inference types could be added to the statement of Corollary 23.
Hence, there is indeed a close relationship between FIN, CONS-CP, and CP.

5 An Intermediate Result — CONS vs. LIM


So far, every examined property (I → J ) had either minimal or maximal scope.
This need not be so for every pair of inference types. In order to show this, we
will turn our attention to the condition (LIM → CONS).
All classes U ∈ CP satisfy (LIM → CONS), as has already been proved [16].
96 F. Balbach

Theorem 24. If U ∈ CP and ψ ∈ P 2 such that U ∈ LIMψ , then U ∈ CONSψ .


Naturally the question arises, whether (LIM → CONS) holds not only for
the classes U ∈ CP, but for all U ∈ CONS. The next theorem gives a negative
answer to this question. It shows that even in NUM there are classes which do
not satisfy (LIM → CONS). This is remarkable since NUM classes tend to be
easily learnable. After all, there are always total numberings that can be used as
hypothesis spaces for them. However, the properties (I → J ) take into account
all numberings, whether total or not.
This is not unrealistic, even in the case of NUM classes, because deciding
whether a class is embedded in a total numbering can be much harder than to
find a non-total hypothesis space suitable for learning the class [17].
Theorem 25. There is a class U ∈ NUM and a hypothesis space ψ ∈ P 2 such
that U ∈ LIMψ \ CONSψ .
Proof. For the construction of ψ let η ∈ R2 be a numbering of V := {f ∈ R |
∀∞ n [f (n) = 0]} \ {0∞ } with the property ∀i ∀j [i = j =⇒ ηi = ηj ].
For all i ∈ IN set ψ2i+1 := ηi and ψ2i (0) := i. For all i ∈ IN and x ≥ 1 define


 i, if Si (ix+1 ) ∈ 2IN \ {2i}, (=: (A))

i, if Si (ix+1 ) ∈ 2IN + 1, (=: (B))
ψ2i (x) := x+1

 ↑, if S i (i ) = 2i, (=: (C))

↑, if Si (ix+1 ) ↑ . (=: (D))
Define U := V ∪ {ψ2i | ψ2i = i∞ }. Clearly, U ∈ NUM. Furthermore, U is
learnable in the limit with respect to ψ by the strategy T defined as follows:

n 2 · f (0), if f (0) = . . . = f (n),
T (f ) :=
2 · Enumη (f n ) + 1, otherwise.
To prove U ∈ / CONSψ , assume a strategy Si such that U ∈ CONSψ (Si ). Con-
sidering the behavior of Si on initial segments of i∞ , and thereby the definition
of ψ2i , we distinguish four cases:
Case 1: There is x ≥ 1 such that (A) happens. Then Si (ix+1 ) = 2k for a
k = i. Hence, ψSi (ix+1 ) (0) = ψ2k (0) = k = i and Si outputs an inconsistent
hypothesis on ix+1  U , a contradiction.
Case 2: There is x ≥ 1 such that (C) happens. Then Si (ix+1 ) = 2i, but
ψ2i (x) ↑. Thus, Si is inconsistent on ix+1  U , a contradiction.
Case 3: There is x ≥ 1 such that (D) happens. Then Si is undefined on
ix+1  U , a contradiction.
Case 4: For all x ≥ 1 (B) happens. Then ψ2i = i∞ ∈ U and Si outputs
almost always odd hypothesis on i∞ . Since the odd indices in ψ belong to non-
constant functions, Si does not converge to a correct hypothesis on i∞ ∈ U ,a
contradiction. 

The hypothesis space ψ constructed in the last proof satisfies the conditions
(a) and (b) of the characterization theorem given in the introduction. Hence, ψ
is a somewhat more “natural” hypothesis space (cf. proof of Theorem 11) biased
towards an inference type, namely LIM, although only for a certain class U .
Changing the Inference Type – Keeping the Hypothesis Space 97

Table 1. Overview of the scope of (I → J ). A + means maximal scope, a − minimal.


For a set M of classes, + M means the scope is a superset of M; − M means the
intersection of scope and M contains only the finite classes. Finally, − U means, a
counter-example for the maximality of the scope exists.

I \ J FIN CONS-CP CP CONS-TOTAL TOTAL CONS LIM


FIN + + + −U + + +
+ CONS-CP

CONS-CP + + + + + + +

CP + + + −U + + +
+ CONS-CP

CONS-TOTAL − NUM − NUM − NUM + + + +


− CONS-CP − FIN − FIN∩CONS-CP

TOTAL − − NUM − NUM −U + + +


− FIN − FIN + CONS-CP

CONS − − − − − + +

LIM − − − − − −U +
+ CP
BC − − − − − − −

6 Overview of Results Concerning (I → J )

Table 1 tries to present the numerous results of this paper in a clear manner.
Results stated in that table, but not proved within this paper, can be obtained
via techniques similar to those presented in the previous sections. Note that the
scope of (I → J ) has not been fully characterized for every such property.

Acknowledgments. This paper is based on my diploma thesis at the Univer-


sity of Kaiserslautern. It is a pleasure for me to thank Sandra Zilles and Rolf
Wiehagen for their continuous support and helpful advice. Many thanks also to
Thomas Zeugmann for many valuable hints and insights.
Finally, I wish to thank the members of the Program Committee of the
ALT 2003 for carefully reading the paper. In particular I am indebted to the
anonymous reviewer who provided the proof of Theorem 18.

References

1. D. Angluin, C. Smith. Inductive inference: theory and methods. Computing-


Surveys 15, 237–269, 1983.
2. J. Barzdin. Inductive inference of automata, functions and programs, Proceedings
International Congress of Math., 455–460, Vancouver, 1974.
3. J.M. Barsdin. Dve Teoremui o predjelnom sintjese funkzii. Teorija algorithmov i
programm I 82–88, Latviiskii. Gosudarstvenyi univ., Riga 1974.
98 F. Balbach

4. J.M. Barsdin, R.W. Freiwald: Prognosirovanje i predjelnyi sinfjes effektivno


peretschislimyich klassov funkzii. Teorija algorithmov i programm I 101–111, Latvi-
iskii Gosudarstvenyi universitjet, Riga 1974.
5. H.-R. Beick. Einige qualitative Aspekte bei der Erkennung von Klassen allgemein
rekursiver Funktionen. Diplomarbeit, Humboldt-Universität, Berlin, 1979.
6. M. Blum. A machine independent theory of the complexity of recursive functions,
Journal of Association for Computing Machinery, Vol. 11, 322–336, April 1967.
7. L. Blum, M. Blum. Toward a Mathematical Theory of Inductive Inference, Infor-
mation and Control 28, 125–155, 1975.
8. J. Case, C. Smith. Comparison of Identification Criteria for Machine Inductive
Inference, Theoretical Computer Science 25, 193–220, 1983.
9. R. Freivalds, E. B. Kinber, R. Wiehagen. How Inductive Inference Strategies Dis-
cover Their Errors, Information and Computation 118, 208–226, 1995.
10. R. Freivalds. Inductive inference of recursive functions: Qualitative theory, (J.
Bārzdiņš and D. Bjorner, Eds.) Baltic Computer Science, LNCS 502, 77–110,
Springer-Verlag, 1991.
11. E. M. Gold. Language identification in the limit, Information and Control 10,
447–474, 1967.
12. K. P. Jantke, H.-R. Beick. Combining Postulates of Naturalness in Inductive Infer-
ence, Elektronische Informationsverarbeitung und Kybernetik 17, 465–484, 1981.
13. S. Jain, D. Osherson, J. S. Royer, A. Sharma. Systems that Learn: An Introduction
to Learning Theory, second edition, MIT Press, Cambridge, Massachusetts, 1999.
14. M. Machtey, P. Young. An Introduction to the General Theory of Algorithms,
North-Holland, New York, 1978.
15. H. Rogers. Theory of Recursive Functions and Effective Computability. McGraw–
Hill, New York, 1967.
16. W. Stein. Konsistentes und inkonsistentes Lernen im Limes. Dissertation, Univer-
sität Kaiserslautern, 1998.
17. F. Stephan, T. Zeugmann. Learning Classes of Approximations to Non-Recursive
Functions, Theoretical Computer Science Vol. 288, Issue 2, 309–341, 2002. (Special
Issue ALT ’99).
18. R. Wiehagen. Limes-Erkennung rekursiver Funktionen durch spezielle Strategien,
Elektronische Informationsverarbeitung und Kybernetik 12 1/2, 93–99, 1976.
19. R. Wiehagen. Zur Theorie der algorithmischen Erkennung, Dissertation B, Sektion
Mathematik, Humboldt-Universität, Berlin, 1978.
20. R. Wiehagen, T. Zeugmann. Learning and Consistency, (K. P. Jantke, S. Lange,
Eds.) Alg. Learning for Knowledge-Based Systems, LNAI 961, 1–24, Springer, 1995.
Robust Inference of Relevant Attributes

Jan Arpe and Rüdiger Reischuk

Institut für Theoretische Informatik, Universität zu Lübeck


Wallstr. 40, 23560 Lübeck, Germany
{arpe/reischuk}@[Link]

Abstract. Given n Boolean input variables representing a set of at-


tritubes, we consider Boolean functions f (i.e., binary classifications of
tuples) that actually depend only on a small but unknown subset of these
variables/attributes, in the following called relevant. The goal is to deter-
mine the relevant attributes given a sequence of examples - input vectors
X and corresponding classifications f (X). We analyze two simple greedy
strategies and prove that they are able to achieve this goal for various
kinds of Boolean functions and various input distributions according to
which the examples are drawn at random.
This generalizes results obtained by Akutsu, Miyano, and Kuhara for the
uniform distribution. The analysis also provides explicit upper bounds
on the number of necessary examples. They depend on the distribution
and combinatorial properties of the function to be inferred.
Our second contribution is an extension of these results to the situation
where attribute noise is present, i.e., a certain number of input bits xi
may be wrong. This is a typical situation, e.g., in medical research or
computational biology, where not all attributes can be measured reliably.
We show that even in such an error-prone situation, reliable inference of
the relevant attributes can be performed, because our greedy strategies
are robust even against a linear number of errors.

1 Introduction

In many data mining applications, one is faced with the situation that a binary
classification of elements with a large number of attributes only depends on a
small subset of these attributes. A central task is then to infer these relevant
attributes from a given input sample consisting of a series of examples X(k) =
(x1 (k), . . . , xn (k)) with classifications y(k) for k = 1, 2, . . . , m, i.e., one wants
to find a set of variables xi1 , . . . , xid such that the sample can be explained
by a function f : {0, 1}n → {0, 1} that depends only on these d variables. A
function f is said to explain the sample, if f (x1 (k), . . . , xn (k)) = y(k) for all
k. Moreover, since real data usually contain noise, it is of particular interest to
design algorithms that in some sense behave ‘robustly’ with respect to input
disturbances.
When inferring relevant attributes, two natural questions that can be asked:

Supported by DFG research grant Re 672/3.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 99–113, 2003.

c Springer-Verlag Berlin Heidelberg 2003
100 J. Arpe and R. Reischuk

1. Given a fixed sample of an unknown concept, what is the minimum number


of variables that explain the sample?
2. How many examples does one need to generate in order to find out the actual
relevant attributes?
The first question gives rise to an optimization problem introduced in Sect. 3,
whereas the second one can be considered as an algorithmic learning problem.
In both cases, however, the key task is to infer relevant variables from a sample.
Thus our goal is to design efficient algorithms that find a small set of variables
explaining the input sample.
Akutsu and Bao [2] proposed a greedy algorithm based on a well-known
greedy strategy for the Set Cover problem (see [12]). Akutsu, Miyano, and
Kuhara [3] describe an efficient implementation of this approach and give an
average case analysis of the algorithm for two special types of functions, namely
AND and OR of arbitrary literals, under the uniform distribution of input exam-
ples. In Sect. 4, we simplify the greedy strategy. We call this strategy Greedy
Ranking and show that its performance is similar to the one obtained in [3].
In Sect. 5, the average case analysis of [3] is generalized in two respects: to a
broader class of functions and to weaker assumptions on the input distributions.
It turns out that a modification of our approach, namely taking the smallest sets
of the ranking, may also be useful for some classes of functions and input dis-
tributions. We call this strategy Modest Ranking, since its ‘modest’ behavior
of first selecting the smallest sets is in contrast to the greedy strategy of taking
the largest sets. We apply these very general results to some typical input distri-
butions and some specific functions of major interest (e.g., monomials, clauses,
and threshold functions) in Sect. 6.
After these investigations we turn to the ‘real case’ of samples that contain
partial errors. In Sect. 7, we assume that for each attribute there is a certain
(generally unknown) error probability δi that the value for this attribute xi in
an input vector is flipped. This noise model called product random attribute
noise ([11]) has been applied to the PAC learning model as well. Note that
it is quite different from the classification noise model [5]. We show that for
some δ > 0 depending only on combinatorial properties of the function f to be
inferred and on the probability distribution according to which the samples are
generated, one can tolerate any constant fractions δi ≤ δ of such erroneous bits
and still infer the relevant attributes successfully with high probability using
the ranking strategies. In addition to their general ability of robustly inferring
relevant attributes, the number of examples needed to handle disturbed inputs
only grows at most by a factor of 4.
Finally, in Sect. 8, we consider a different approach for the parity function,
since the ranking strategies do not work in this case.
Inferring relevant attributes is related to finding association rules (also called
functional dependencies/relations) – a well-studied problem (e.g., see [1,15]).
In the variant considered in this paper, the target attribute Y is fixed as in
[2,3]. The goal of efficiently inferring concepts with many irrelevant attributes
(so-called attribute-efficient learning) has attracted much attention in the past
(e.g., see [13,8,9,17]). Most authors consider the mistake-bounded model. In this
Robust Inference of Relevant Attributes 101

on-line setting, one tries to minimize the number of examples for which the
current hypothesis turns out to be wrong. There are several ways known how to
convert on-line algorithms with low mistake bounds into efficient PAC learning
algorithms (see [4,14,13]). In this paper, we consider the finite exact learning
model: From a randomly selected sample of small size, we have to compute a
single hypothesis that with high probability has to be correct (with accuracy 1).
Recently, Mossel, O’Donnell, and Servedio [16] have introduced an algorithm
that exactly learns the class of concepts f with n input variables and d relevant
attributes (also
ω
called d-juntas) under uniform distribution with confidence 1−δ
in time (nd ) ω+1 · poly(n, 2d , log(1/δ)), where ω < 2.376 is the matrix multipli-
cation exponent. The Target Ranking algorithm we introduce runs in time
O(m2 n) on samples of size m. In order to achieve confidence 1 − δ, we roughly
need c · log(1/δ) · log n examples, where c depends on the base function f˜ (i.e.,
the restriction of f to its relevant variables), the number of relevant attributes
d, and the probability distribution according to which the examples are drawn.
In particular, restricting to the uniform distribution, for arbitrary f satisfying
a certain statistical property, c can be bounded by poly(2d ). In this case we
are able to exactly infer the relevant attributes with confidence 1 − δ in time
n · poly(log n, 2d , log(1/δ)).
Due to space limitations, most proofs have to be omitted. Details are pre-
sented in [7].

2 Preliminaries
A concept is a Boolean function f : {0, 1}n → {0, 1}, a concept class is a set of
concepts. A concept f : {0, 1}n → {0, 1} depends on variable xi , if the two (n−1)-
ary subfunctions fxi =0 and fxi =1 with variable xi fixed to 0 and 1 respectively
are not identical. If f depends on xi , then attribute xi is called relevant for f ,
otherwise irrelevant. We denote the set of relevant (resp. irrelevant) attributes
by V + (f ) (resp. V − (f )). If f is clear from the context, we just write V + and
V − . We denote by f˜ the restriction of f to its relevant variables and call it
the base function of f . An example is a vector (x1 , . . . , xn ; y) ∈ {0, 1}n+1 . It
is an example for f , if y = f (x1 , . . . , xn ). The values of x1 , . . . , xn are called
variable or attribute assignments, whereas the value for y is called a label. A
sequence (x1 (k), . . . , xn (k); y(k)) (k = 1, . . . , m) of examples for f is called a
sample for f of size m, and f is said to explain the sample. A sample T is a
sequence of examples such that there exists some f that explains the sample. If f
depends only on variables from the set {xi1 , . . . , xid }, then we also say that these
variables explain T . A sample  is stored in a matrix each  line of which represents
x1 (1) . . . xn (1) | y(1))
 ..  ∈ {0, 1}m×(n+1) , where
one example: T = (X; y) =  ... ..
.
..
. . 
x1 (m) . . . xn (m) | y(m)
X is the submatrix consisting of the variable assignments in the examples, and
y is the column vector containing the labels of the examples. A sample T may
contain a certain combination of attributes several times. Then, of course, it is
necessary that for k = l the following implication holds:
102 J. Arpe and R. Reischuk

X(k) = X(l) =⇒ y(k) = y(l) . (1)

Indeed, if (1) does not hold for some k = l, then by definition, T is not a
sample. In the noisy case, however, it may well be that different combinations of
attributes yield different labels, but due to false measurements of the attributes,
the values for x1 , . . . , xn all look the same.
We assume that the examples of a sample T are drawn according to a fixed
probability distribution p : {0, 1}n → [0, 1], and we say that T is generated
according to p.

Definition 1. Let (X; y) ∈ {0, 1}m×(n+1) be a sample. The corresponding func-


tional relations graph is a bipartite labeled graph defined as follows. The vertices
are {1, . . . , m}, the edges are S = {{k, l} | y(k) = y(l)}. Each edge {k, l} is
labeled by the set of variables xi such that xi (k) = xi (l). The set of edges with a
label containing variable xi is denoted by Si = {{k, l} ∈ S | xi (k) = xi (l)}.

Proposition 1. Let T = (X; y) ∈ {0, 1}m×(n+1) be a sample and {i1 , . . . , id } ⊆


{1, . . . , n}. Then the following statements are equivalent:
(a) xi1 , . . . , xid explain T .
(b) For each pair k, l ∈ {1, . . . , m} such that y(k) = y(l) there exists r ∈
{1, . . . , d} such that xir (k) = xir (l).
(c) S = Si1 ∪ . . . ∪ Sid .

3 Approximability

Consider the following optimization problem:

Inference of Relevant Attributes (INFRA)


Instance: sample T = (X; y) = (x1 (k), . . . , xn (k); y(k))k=1,... ,m ∈ {0, 1}m×(n+1)
Solution: a function f : {0, 1}n → {0, 1} such that T is a sample for f
(i.e., y(k) = f (x1 (k), . . . , xn (k)) for all k ∈ {1, . . . , m})
Measure: |V + (f )|
Goal: minimize |V + (f )|

Note that in order to find a small set of explaining attributes for an INFRA
instance, we do not have to explicitly define a corresponding concept f , but it is
enough to find a set of attributes xi1 , . . . , xid such that for k, l ∈ {1, . . . , m} with
y(k) = y(l) there exists r ∈ {1, . . . , d} with xir (k) = xir (l) by Proposition 1.
In order to obtain results on the approximability of INFRA, we consider the
well-studied Set Cover problem. Note that Proposition 1 yields a reduction
from INFRA to Set Cover. Based on this fact, Akutsu and Bao [2] have proved
the following theorem:
Theorem 1 ([2]). INFRA can be approximated in polynomial time within a
factor of 2 ln m + 1.
Robust Inference of Relevant Attributes 103

The next claim is a slightly stronger version of Theorem 8 in [2], since we


consider the special case of INFRA for Boolean functions.
Proposition 2. Set Cover is reducible to INFRA via a polynomial time com-
putable approximation factor preserving reduction.
Applying a result from [10], we obtain the following lower bound:
Theorem 2. For any ε > 0, INFRA cannot be approximated within a factor
of (1 − ε) ln m unless NP ⊆ DTIME(nO(log log n) ).
Therefore, when faced with the INFRA problem, the best one can hope for
are efficient approximation algorithms with a nonconstant approximation ratio
or fast algorithms providing correct results for ‘most’ inputs. In the rest of this
paper, we investigate the latter challenge.

4 From Greedy to Ranking


Let us start with the algorithm discussed in [3] which is presented in Fig. 1. It
makes use of the reduction from INFRA to Set Cover given by Proposition 1
and applies a well-known greedy approach to the Set Cover instance obtained.
Johnson [12] first analyzed this approach for Set Cover.

input (x1 (k), . . . , xn (k); y(k))k=1,... ,m


V := {x1 , x2 , . . . , xn }; S := {{k, l} | y(k) = y(l)}
while S = ∅ do
for i = 1 to n do
Si := {{k, l} ∈ S | xi (k) = xi (l)}
find an xi ∈ V with maximum |Si |
output xi
S := S \ Si ; V := V \ {xi }

Fig. 1. Algorithm Greedy

We apply some modifications of this algorithm and analyze their effects. The
strategy is based on a ranking of the sets S1 , . . . , Sn by their cardinalities which
is done by the procedure Rank Sets, see Fig. 2.

for i = 1 to n do
Si := {{k, l} ∈ S | xi (k) = xi (l)}
compute π : {1, . . . , n} → {1, . . . , n} such that |Sπ(1) | ≥ |Sπ(2) | ≥ . . . ≥ |Sπ(n) |

Fig. 2. Procedure Rank Sets

The results may be worse in some cases, since the new greedy approach is
based on a single static ranking. However, we show that the ranking still yields
104 J. Arpe and R. Reischuk

properties similar to Greedy and in addition performs quite robustly when


confronted with attribute noise.
Greedy Ranking (see Fig. 3) outputs the variables xi with maximum |Si |
until these Si ’s cover the whole edge set S. In contrast to Greedy, Greedy
Ranking does not recompute the sets Si in each step. Given a concept f , the
Greedy Ranking algorithm works correctly, if with high probability, the sets
Si for xi ∈ V + are larger than the sets Si for xi ∈ V − . On the other hand,
if the converse is the case, i.e., if the sets Si for relevant variables are likely to
be smaller than the sets Si for the irrelevant variables, we should make use of
an algorithm that outputs the variables corresponding to the smallest sets Si .
Instead of being greedy, this algorithm rather behaves modestly, so we call it
Modest Ranking (see also Fig. 3).

input (x1 (k), . . . , xn (k); y(k))k=1,... ,m


Rank Sets
S := {{k, l} | y(k) = y(l)}
i := 1 / i = n
while S = ∅ do
output xπ(i)
S := S \ Sπ(i)
i := i + 1 / i := i − 1

Fig. 3. Algorithms Greedy/Modest Ranking

Note that, given a sample, all three algorithms terminate after a finite number
of steps since by property (1), each pair {k, l} ∈ S belongs to some Si . Clearly,
all algorithms presented here compute a cover of S. Thus by the reduction given
in Proposition 1, the algorithms work correctly for the optimization problem,
i.e., they output sets of variables that explain the input sample. It is not hard
to construct instances showing that in general none of the algorithms is superior
to the others in terms of finding small sets of explaining variables.
It may be the case that an input sample for some concept f can be explained
by a proper subset of the relevant variables for f . In case the number d of
relevant variables is a priori known, we can overcome this problem by giving d
as additional input to the algorithms and output the d variables with the largest
(resp., smallest) sets Si . This is done by the Target Ranking and the Modest
Target Ranking algorithms (see Fig. 4).

input (x1 (k), . . . , xn (k); y(k))k=1,... ,m , d


Rank Sets
S := {{k, l} | y(k) = y(l)}
for i = 1 to d do
output xπ(i) / output xπ(n−i+1)

Fig. 4. Algorithms (Modest) Target Ranking


Robust Inference of Relevant Attributes 105

Moreover, if we only have an a priori upper bound d on the number of


relevant variables, then the target ranking algorithms output a set of d variables
such that the d relevant ones are most likely among them.
Definition 2. Given a concept f with relevant variables xi1 , . . . , xid and a sam-
ple T for f , we say that an algorithm succeeds in a step, if the output generated
in that step is a relevant variable of f . The algorithm is said to be correct, if it
is successful in all steps it makes. It is complete, if it finds all relevant variables.
Finally, an algorithm is said to be successful, if it is both correct and complete.
The following properties are easy to show:
Lemma 1. Let f depend on d variables, and let T be a sample for f .
(a) If Target Ranking (resp., Modest Target Ranking) is correct on
input (T, d), then it is successful, too.
(b) If Target Ranking (resp., Modest Target Ranking) is complete on
input (T, d), then it is also successful.
(c) If Target Ranking (resp., Modest Target Ranking) is successful on
input (T, d), then Greedy Ranking (resp., Modest Ranking) is correct
on input T .
In order to uniquely recognize the relevance of some variable xir , there has
to be an edge in the functional relations graph whose only relevant label is
xir . Thus, independently of the used learning algorithm, a necessary (but not
sufficient) condition to infer the relevance of xir is the occurrence of two examples
k, l in the input sample with xir (k) = 0 and xir (l) = 1, but with identical
values for all other relevant attributes.
√ By the birthday paradox, already for the
uniform distribution roughly 2d−1 examples are necessary to guarantee such an
occurrence. This shows that in order for any algorithm to be complete, Ω(2d/2 )
examples have to be provided due to information theoretic reasons.

5 Probabilistic Analysis of the Ranking Strategies


Let f : {0, 1}n → {0, 1} be a concept with relevant variables xi1 , . . . , xid , and p
be a probability distribution. For x, y ∈ {0, 1}, we denote by ‘xi = x’ the set of
examples with xi = x, and by ‘f = y’ the set of examples with f (x1 , . . . , xn ) = y.
(x,y)
For i ∈ {1, . . . , n} and x, y ∈ {0, 1} define the probability αi that a randomly
drawn example (x1 , . . . , xn ; f (xi1 , . . . , xid )) has xi = x and f (xi1 , . . . , xid ) = y,
(x,y)
i.e., αi = Pr(xi = x ∧ f = y).
Let T = (X; y) be a sample of size m for f generated according to p. Define
(x,y)
K = {1, . . . , m} and Ki = {k ∈ K | xi (k) = x and y(k) = y}. The situation
(x,y)
is depicted in Fig. 5. Since all examples are identically distributed, |Ki | can
(x,y)
be considered as a binomially distributed random variable with parameters αi
(x,y) (x,y)
and m. Analogously to the αi ’s, we denote by βi the corresponding relative
(x,y) (x,y)
frequencies in the input sample, i.e., βi = |Ki | / |K|, and define αi =
(0,0) (1,1) (1,0) (0,1) (0,0) (1,1) (1,0) (0,1)
αi αi + αi αi and βi = βi βi + βi βi . It holds |Si | = βi m2 ,
and for large m, we get the approximation |Si | ≈ αi m2 .
106 J. Arpe and R. Reischuk

y(k) = 0 y(k) = 1
(0,0) (0,1)
Ki Ki

xi (k) = 0
xi (k) = 1

(1,0) (1,1)
Ki Ki

edges in Si
edges not in Si

Fig. 5. Partition of K with respect to variable xi

Lemma 2. For fixed i ∈ {1, . . . , n} and arbitrary 0 ≤ δ ≤ 10


1
, it holds that
 
Pr(|Si | − αi m2  ≥ δm2 ) ≤ 8e− 3 δ m .
1 2

Proof. The proof requires lengthy calculations and case distinctions. It is based
on standard Chernoff bound techniques and can be found in [7].

The following theorem provides very general conditions that guarantee the suc-
cess of the ranking algorithms with respect to a concept f :

Theorem 3. Let f : {0, 1}n → {0, 1} depend on xi1 , . . . , xid , let T be a sample
for f generated according to a probability distribution p : {0, 1}n → [0, 1], and
let c > 0.

(a) If min{αi | xi ∈ V + } > max{αj | xj ∈ V − }, then with probability 1 − n−c ,


Target Ranking is successful on input (T, d), provided that
m ≥ 12ε−2 ((c + 1) ln n + ln 8),
where ε = min { min{αi | xi ∈ V + } − max{αj | xj ∈ V − } , 1/5}.
(b) If max{αi | i ∈ V + } < min{αj | j ∈ V − }, then with probability 1 − n−c ,
Modest Target Ranking is successful on input (T, d), provided that
m ≥ 12ε−2 ((c + 1) ln n + ln 8),
where ε = min { min{αj | xj ∈ V − } − max{αi | xi ∈ V + } , 1/5}.

Proof.
 We only prove part (a), since (b) can be done analogously. Let t =
i∈V + αi + maxj∈V − αj m . Then for xi ∈ V
1 2 +
2 min it holds that

ε   ε
Pr(|Si | ≤ t) ≤ Pr |Si | ≤ αi − m2 ≤ Pr |Si | − αi m2  ≥ m2
2 2
1 2 1 2
≤ 8e− 3 ( 2 ) = 8e− 12 ε
ε
m m
,

where the last inequality is due to Lemma 2. Similarly, for xj ∈ V − ,


1 2
Pr(|Sj | ≥ t) ≤ 8e− 12 ε m
.
Robust Inference of Relevant Attributes 107

Target Ranking is successful on input (T, d) iff it is correct in all of its d


steps. This is exactly the case, if the largest d sets Si correspond to the relevant
variables, i.e., if minxi ∈V + |Si | > maxxj ∈V − |Sj |. We have

Pr min |Si | > max− |Sj | ≥ Pr min |Si | > t ∧ max |Sj | < t
xi ∈V + xj ∈V xi ∈V + xj ∈V −

= 1 − Pr ∃xi ∈ V |Si | ≤ t ∨ ∃xj ∈ V − |Sj | ≥ t
+
 
 
≥1− Pr (|Si | ≤ t) + Pr (|Sj | ≥ t)
xi ∈V + xj ∈V −
1 2
− 12
≥ 1 − 8ne ε m
.
1 2
− 12
If m ≥ 12
ε2 ((c + 1) ln n + ln 8), then 8ne ε m
≤ n−c , thus the claim follows.
As Theorem 3 is stated for a general setting, let us now consider some typical
input distributions and simplify its conditions in these cases.

• Independent Attributes (IA)


Suppose that the values for the xi ’s (i = 1, . . . , n) are generated independently
of each other, say with Pr(xi = 1) = pi ∈ [0, 1] (thus Pr(xi = 0) = 1 − pi ). Then
we say that the sample is IA(p1 , . . . , pn )-generated.
Lemma 3. Let T be an IA(p1 , . . . , pn )-generated sample for f . Then, for xj ∈
V − , we have αj = 2pj (1 − pj ) Pr(f = 0) Pr(f = 1).
• Independent Equiprobable Attributes (IEA)
If T is IA(p1 , . . . , pn )-generated with p1 = . . . = pn = q, then we say that T is
IEA(q)-generated.
Lemma 4. Let T be an IEA(q)-generated sample for f .
(a) For each xj ∈ V − , it holds αj = 2q(1−q) Pr(f = 0) Pr(f = 1). In particular,
αj is independent of xj ∈ V − . We denote the common value of these αj ’s
by α− in this case.
(b) If f is symmetric, then the αi ’s with xi ∈ V + are also independent of i. We
denote the common value of these αi ’s by α+ in this case.
From the previous lemma and Theorem 3 we immediately obtain the following
result on the successfulness of the target ranking algorithms when applied to
symmetric Boolean functions.
Corollary 1. For f with a symmetric base function f˜, three cases can occur:
– α+ > α− : O(log n) input examples suffice such that Target Ranking is
successful with high probability.
– α+ < α− : O(log n) input examples suffice such that Modest Target
Ranking is successful with high probability.
– α+ = α− : No success ratios can be guaranteed for the ranking algorithms,
regardless of how many input examples are provided.
• Uniformly Distributed Attributes (UDA)
If the examples are uniformly distributed, i.e., if a sample T is IEA( 12 )-generated,
then we say that T is UDA-generated.
108 J. Arpe and R. Reischuk

6 Inferring Specific Concepts

We now consider several basic Boolean functions.


Theorem 4 (AND-function). Let {i1 , . . . , id } ⊆ {1, . . . , n}, let f : {0, 1}n →
{0, 1} be defined by f (x1 , . . . , xn ) = xi1 ∧ . . . ∧ xid , and let T be an IEA(q)-
generated sample for f .

(a) If q ≤ 12 , then it holds that α+ > α− . Thus the success ratio for Target
Ranking may be raised arbitrarily close to 1 by choosing a large enough
sample size m ∈ O(log n).
(b) If q = 12 , then it holds that α+ −α− = 2−2d−1 > 0. Thus the success ratio for
Target Ranking may be raised arbitrarily close to 1 by choosing a large
enough sample size m ∈ O(log n) with the constant being of order 24d .
(c) If q > 12 , then for sufficiently large d, we have α+ < α− . Thus the success
ratio for Modest Target Ranking may be raised arbitrarily close to 1 by
choosing a large enough sample size m ∈ O(log n).

The same ideas apply to the OR-function with q substituted by 1 − q.


Sketch of proof: We have Pr(f = 1) = q d and Pr(f = 0) = 1 − q d , thus α− =
(0,0) (1,1)
2q d+1 (1−q)(1−q d ) by Lemma 4 (a). Furthermore, αi = 1−q, αi = q d , and
= 0, yielding α+ = q d (1 − q). Hence, α+ > α− ⇐⇒ q(1 − q d ) < 12 , from
(0,1)
αi
which (a) and (c) follow. Similarly, (b) can be shown by plugging in q = 1/2.

Theorem 5 (Monomials). Let {i1 , . . . , id } ⊆ {1, . . . , n}, let lr ∈ {xir , ¬xir }


for each r ∈ {1, . . . , d}, and let f : {0, 1}n → {0, 1} be defined by f (x1 , . . . , xn ) =
l1 ∧ . . . ∧ ld . Let T be a UDA-generated sample for f . Then Target Ranking is
successful with high probability provided that a sample of size m ∈ Ω(24d · log n)
is given.

Sketch of proof: The analysis is similar to the one in Theorem 4 for q = 12 . In


particular, α+ − α− = 2−2d−1 > 0. Now the claim follows from Theorem 3.
Akutsu, Miyano, and Kuhara [3] showed a similar result for the Greedy algo-
rithm. Note that for monomials under uniform input distribution, 2d rows are
necessary in order to obtain (in expectation) at least one example with label 1.
(If there is no such example, then the sample can be explained by the constant
zero function.)
It is easy to see that being able to infer the relevant attributes of a function f ,
the same holds for its negation. In particular, the result for monomials translates
to clauses. The case of negating individual attribute values is more complex. At
least in case of the uniform distribution the inferability is not effected.
Theorem 6 (Threshold functions). Let {i1 , . . . , id } ⊆ {1, . . . , n}, 1 ≤ t ≤ d,
d
and f : {0, 1}n → {0, 1} be defined by f (x1 , . . . , xn ) = 1 iff r=1 xir ≥ t. Let
T be a UDA-generated sample for f . Then Target Ranking is successful with
 −4 4d
high probability, provided that m ∈ Ω dt 2 log n .
Robust Inference of Relevant Attributes 109

 2
Sketch of proof: A straightforward calculation yields α+ − α− = d−1 t−1 · 2−2d−1 .
Now Theorem 3 yields the claim.
If t = d in the previous theorem, then f = AND, and we recover our result from
Theorem 4, part (b). Moreover, under uniformly distributed inputs, the gap
between α+ and α− for threshold functions is smallest for t ∈{1, d}. The largest
such gap is reached for t =  d2 , the majority function. Since t/2−1
d−1
∈ Θ( √1d 2d ),
− −1
we have α − α ∈ Θ(d ). Applying Theorem 3, this proves the following
+

Corollary 2 (Majority function). Let f : {0, 1}n → {0, 1} such that its base
function f˜ : {0, 1}d → {0, 1} is the majority function. Then Target Ranking
is succeessful with high probability, provided that m ∈ Ω(d2 · log n).
For symmetric Boolean functions, one cannot always guarantee α+ = α− ,
even for UDA-generated samples. A simple counter-example is the parity func-
tion f (x1 , . . . , xn ) = (xi1 + . . . + xid ) mod 2 for which αi = 18 for all i ∈
{1, . . . , n}, no matter whether xi ∈ V + or xi ∈ V − . Thus the ranking strategies
do not work for the parity function. We provide an alternative solution for such
concepts in Sect. 8.

7 Robust Inference

As real data usually contain noise, our ultimate goal is to handle cases in which
the attribute values underly certain disturbances. More precisely, we assume
that in each input example, attribute xi is flipped with probability δi , i.e., an
algorithm obtains xi (k) instead of the correct value xi (k) with probability δi .
We call the resulting set of disturbed examples a δ-disturbed sample, where
δ = (δ1 , . . . , δn ). Note that this assumption introduces a linear number of faults
(with respect to the number of attributes).
Fortunately, it can be shown that the ranking algorithms still perform well,
if they are given such disturbed samples. The key idea in this case is to examine
how much the sets Si computed by the ranking algorithms deviate from the Si ’s
intended by the real data. We denote the sets derived from the disturbed data
by Ŝi . Furthermore, for i ∈ {1, . . . , n}, let

Fi = {k ∈ {1, . . . , m} | the input table contains xi (k) instead of xi (k)} .

The following lemma is analogous to Lemma 2:

Lemma 5. Let i ∈ {1, . . . , n} with δi ≤ 30 1


. Then, for ε such that 6δi ≤ ε ≤ 15 ,
 
 2
it holds that Pr |Ŝi | − αi m ≥ εm ≤ 9e− 12 ε m .
2 1 2

     
Sketch of proof: We use the inequality |Ŝi | − αi m2  ≤ |Ŝi | − |Si | + |Si | − αi m2 
and compute the probability that each of the summands  on the right hand side is
bounded by 2ε m2 . Combinatorial investigations yield |Ŝi |−|Si | ≤ m|Fi |+ 12 |Fi |2 .
 
In particular, if |Fi | ≤ 13 εm, then |Ŝi | − |Si | ≤ 12 εm2 . From standard Chernoff
 1 ε 1
bounds, it follows that Pr |Fi | ≥ 3ε m ≤ e− 3 · 6 ·m = e− 18 εm , since δi < 6ε (|Fi |
110 J. Arpe and R. Reischuk

can be considered as a binomially distributed random variable with parameters


δi and m). Now
    ε   ε
   
Pr |Ŝi | − αi m2  ≥ εm2 ≤ Pr |Ŝi | − |Si | ≥ m2 ∨ |Si | − αi m2  ≥ m2
2 2
ε 2
− 13 ( 2ε ) m
≤ Pr |Fi | ≥ + 8e
3
1 1 2 1 2
≤ e− 18 εm + 8e− 12 ε m ≤ 9e− 12 ε m ,

where we make use of Lemma 2 and the fact that 18 1


εm > 12 ε m for ε ≤ 15 .
1 2

Besides the general information theoretic problem that a sample may already
be explained by a proper subset of the relevant variables, just the opposite
phenomenon can occur due to disturbances: Ŝ may not be covered by Ŝ1 , . . . , Ŝn ,
so Greedy, Greedy Ranking, and Modest Ranking – as introduced in
Sect. 4 – do not terminate on the corresponding input samples. Therefore, when
faced with the disturbed situation, we modify the algorithms as follows: All edges
that do not belong to any of the computed Ŝi ’s are ignored, i.e., we compute
a new set Ŝnew = Ŝ \ {{k, l} ∈ Ŝ | ∀i ∈ {1, . . . , n} : {k, l} ∈ Ŝi }. The edges
removed are exactly those connecting two example nodes with identical attribute
values but different labels. All algorithms make use of this set Ŝnew instead of
Ŝ. However, this modification does not effect our analysis, so we continue by
writing Ŝ. In the noisy scenario, Lemma 1 has to be modified as follows:
Lemma 6. Let f be a concept depending on d variables, and T a δ-disturbed
sample for f such that Target Ranking is successful on (T, d). If Greedy
Ranking outputs at most d variables on input T , then it is correct. Otherwise,
the first d variables output by Greedy Ranking are the relevant ones.
We now state our main theorem for the case of disturbed samples:
Theorem 7. Let f : {0, 1}n → {0, 1} with relevant variables xi1 , . . . , xid , and
let δ = (δ1 , . . . , δn ) ∈ [0, 1]n . Let T be a δ-disturbed sample for f generated
according to a probability distribution p : {0, 1}n → [0, 1], and let c > 0.
(a) If min{αi | xi ∈ V + } > max{αj | xj ∈ V − } and δk ≤ 12 1
ε for all k ∈
−c
{1, . . . , n}, then with probability 1 − n , Target Ranking is successful
on input (T, d), provided that
m ≥ 48 ε−2 ((c + 1) ln n + ln 9) ,

where ε = min{ min{αi | i ∈ V + } − max{αj | j ∈ V − } , 2/5} .


(b) If max{αi | i ∈ V + } < min{αj | j ∈ V − } and δk ≤ 12 1
ε for all k ∈
{1, . . . , n}, then Modest Target Ranking is successful on input (T, d),
provided that
m ≥ 48 ε−2 ((c + 1) ln n + ln 9) ,

where ε = min{ min{αj | j ∈ V − } − max{αi | i ∈ V + } , 2/5}.

Proof. Extension of the analysis in the proof of Theorem 3. See [7].


Robust Inference of Relevant Attributes 111

We would like to stress that the algorithms have not been modified in any way in
order to overcome the disturbances. In particular, we do not have to assume that
the algorithms have any knowledge about the error probabilities δ1 , . . . , δn . Even
more, the sample size required for Target Ranking only has to be enlarged
by factor 4 in order to obtain the same success probability in case of a (small)
constant percentage of errors in the input sample.

8 Inferring Relevant Attributes of the Parity Function

Throughout this section we identify {0, 1} with the two-element field GF(2) n and
denote by ⊕ the sum operation in this field. Furthermore, we define |ξ| = i=1 ξi
for ξ ∈ {0, 1}n (here the sum is taken in Z). Let f : {0, 1}n → {0, 1} be defined by
f (x1 , . . . , xn ) = xi1 ⊕ . . . ⊕ xid for some set of variable indices I = {i1 , . . . , id } ⊆
{1, . . . , n}. Since we have seen at the end of Sect. 5 that ranking the variables
according to their occurences in the functional relations graph does not work for
the parity function, we present a different algorithm Parity Infer to find the
relevant variables. The idea is simply to compute a solution of a system of linear
equations associated with the input sample and then to infer from this solution
a set of variables that can explain the sample.

input (X; y) ∈ {0, 1}m×(n+1)


solve Xξ = y
if there is no solution
then output ‘sample contains wrong data’
else choose any solution ξ; output all xi ’s such that ξi = 1

Fig. 6. Algorithm Parity Infer

Let us again differenciate between the two aspects, the optimization problem
INFRA(⊕) obtained by restricting the instances and the solutions of INFRA to
samples for concepts whose base functions are parity functions – such functions
can be uniquely described by the set V + of relevant variables – and on the
other hand finding exactly the relevant variables of a given but unknown parity
function provided that the sample size is large enough.
Let T = (X; y) ∈ {0, 1}m×(n+1) . There is a one-to-one correspondence be-
tween solutions V + of the INFRA(⊕) instance T and the solutions ξ ∈ {0, 1}n
for the system of linear equations Xξ = y given by ξi = 1 iff xi ∈ V + . The
task of finding an optimal solution for an INFRA(⊕) instance is equivalent to
finding a solution ξ of Xξ = y with minimum |ξ|. Since {xi | i ∈ I} is a solution
for T , the system has at least one solution. Moreover, if X has full rank (i.e.,
rank(X) = n), then there is a unique solution which is of course also an optimal
solution in this case.
There is a well-known correspondence between the INFRA(⊕) and the Near-
est Codeword problem. A Nearest Codeword instance consists of a matrix
112 J. Arpe and R. Reischuk

A ∈ {0, 1}n×r and a vector b ∈ {0, 1}n . A solution is a vector x ∈ {0, 1}r , and
the goal is to minimize the Hamming distance of Ax and b (i.e., |Ax ⊕ b|). The
obvious reduction is approximation factor preserving. Using a result of [6], this
implies
Theorem 8. For any ε > 0, INFRA(⊕) cannot be approximated within a factor
1−ε
of 2log m unless NP ⊆ DTIME(npolylog(n) ).
Despite this negative result, INFRA(⊕) can be solved efficiently on the aver-
age. We show that under certain assumptions the variables detected by Parity
Infer are exactly the relevant ones with high probability.
Theorem 9. Let f : {0, 1}n → {0, 1} such that its base function f˜ is a parity
function, and let T = (X; y) ∈ {0, 1}m×n × {0, 1}n be a UDA-generated sample
for f . If m ≥ n + k(2 ln k + 1) with k = c log n + 1 for some c > 0, then with
probability 1 − n−c , Xξ = y has exactly one solution ξ.

Corollary 3. Under the conditions described in Theorem 9, using sample size


m = n + O(c log n log log n) Parity Infer is successful with probability 1 − n−c ,
where c may be chosen arbitrarily large.

9 Conclusions and Further Research

For inferring relevant Boolean valued attributes we have presented ranking algo-
rithms, which are modifications of greedy algorithms proposed earlier. We have
extended a negative approximability result to the restriction of only Boolean
values and have improved a lower bound by using Feige’s result. General cri-
teria for the successfulness of our algorithms have been established in terms of
some statistical values (depending on the concept considered and the probabil-
ity distribution). These results have been applied to a series of typical input
distributions and specific functions.
In case of monotone functions, a straightforward modification of our strategy
restricts edge set Si to those edges {k, l} with xi (k) = y(k) = 0 and xi (l) =
y(l) = 1. This halves the values αj for xj ∈ V − and thus may satisfy (or
improve) the conditions of the main theorems for certain monotone functions.
Next, we have investigated the case of noisy attribute values. We have shown
that our algorithms still succeed with high probability, if their input contains a
(small) constant fraction of wrong values. This desirable robustness property is
achieved without requiring any specific knowledge about the likelihood of errors.
One direction of future research could be to extend these results to more
complex Boolean functions such as DNF formulas with a constant number of
monomials. Furthermore, the case of robustly inferring relevant attributes of
parity functions remains open. Another generalization would be the case that
attributes and/or labels may take values from sets with more than two elements.
Given an input instance, Greedy Ranking always outputs a proper solu-
tion that is capable of explaining the sample. However, if the input has some
Robust Inference of Relevant Attributes 113

disturbances, Greedy Ranking might indeed stop only after having chosen
significantly more than the real number of relevant attributes. In such situa-
tions, one might be interested in algorithms that – rather than computing an
exact solution for the given input data – output a simple solution fitting to an
input instance that is in some sense ‘near’ to the input instance. Following Oc-
cam’s razor such a simple solution may be much more likely to explain the real
phenomenon. Currently, we are working on a general framework for this setting.

References
1. R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules between Sets
of Items in Large Databases. Proc. 1993 ACM SIGMOD Conf., 207–216.
2. T. Akutsu, F. Bao, Approximating Minimum Keys and Optimal Substructure
Screens. Proc. 2nd COCOON, Springer LNCS 1090 (1996), 290–299.
3. T. Akutsu, S. Miyano, and S. Kuhara, A Simple Greedy Algorithm for Finding
Functional Relations: Efficient Implementation and Average Case Analysis. TCS
292(2) (2003), 481–495. (see also Proc.3rd DS, Springer LNAI 1967 (2000), 86–98.)
4. D. Angluin, Queries and Concept Learning. Machine Learning 2(4) (1988), 319–
342, Kluwer Academic Publishers, Boston.
5. D. Angluin and P. Laird, Learning from noisy examples. Machine Learning 2(4)
(1988), 343–370, Kluwer Academic Publishers, Boston.
6. S. Arora, L. Babai, J. Stern, and Z. Sweedyk, The Hardness of Approximate Op-
tima in Lattices, Codes, and Systems of Linear Equations, J. CSS 54 (1997), 317–
331.
7. J. Arpe, R. Reischuk, Robust Inference of Relevant Attributes. Techn. Report,
SIIM-TR-A 03-12, Univ. Lübeck, 2003, available at
[Link]
8. A. Blum, L. Hellerstein, and N. Littlestone, Learning in the Presence of Finitely
or Infinitely Many Irrelevant Attributes. Proc. 4th COLT ’91, 157–166.
9. A. Blum, P. Langley, Selection of Relevant Features and Examples in Machine
Learning. Artificial Intelligence 97(1–2), 245–271 (1997).
10. U. Feige, A Threshold of ln n for Approximating Set Cover. J. ACM 45 (1998),
634–652.
11. S. Goldman, H. Sloan, Can PAC Learning Algorithms Tolerate Random Attribute
Noise? Algorithmica 14 (1995), 70–84.
12. D. Johnson, Approximation Algorithms for Combinatorial Problems. J. CSS 9
(1974), 256–278.
13. N. Littlestone, Learning Quickly When Irrelevant Attributes Abound: A New
Linear-threshold Algorithm. Machine Learning 4(2) (1988), 285–318, Kluwer Aca-
demic Publishers, Boston.
14. N. Littlestone, From On-line to Batch Learning. Proc. 2nd COLT 1989, 269–284.
15. H. Mannila, K. Räihä, On the Complexity of Inferring Functional Dependencies.
Discrete Applied Mathematics 40 (1992), 237–243.
16. E. Mossel, R. O’Donnell, R. Servedio, Learning Juntas. Proc. STOC ’03, 206–212.
17. L. Valiant, Projection Learning. Machine Learning 37(2) (1999), 115–130, Kluwer
Academic Publishers, Boston.
Efficient Learning of Ordered and Unordered
Tree Patterns with Contractible Variables

Yusuke Suzuki1 , Takayoshi Shoudai1 , Satoshi Matsumoto2 , Tomoyuki Uchida3 ,


and Tetsuhiro Miyahara3
1
Department of Informatics, Kyushu University, Kasuga 816-8580, Japan
{y-suzuki,shoudai}@[Link]
2
Department of Mathematical Sciences, Tokai University, Hiratsuka 259-1292, Japan
matumoto@[Link]
3
Faculty of Information Sciences,
Hiroshima City University, Hiroshima 731-3194, Japan
{uchida@cs,miyahara@its}.[Link]

Abstract. Due to the rapid growth of tree structured data such as Web
documents, efficient learning from tree structured data becomes more
and more important. In order to represent structural features common
to such tree structured data, we propose a term tree, which is a rooted
tree pattern consisting of tree structures and labeled variables. A vari-
able is a labeled hyperedge, which can be replaced with any tree. A
contractible variable is an erasing variable which is adjacent to a leaf. A
contractible variable may be replaced with a singleton vertex. A usual
variable, called an uncontractible variable, is replaced with a tree of size
at least 2. In this paper, we deal with ordered and unordered term trees
with contractible and uncontractible variables such that all variables
have mutually distinct variable labels. First we give a polynomial time
algorithm for deciding whether or not a given term tree matches a given
tree. Let Λ be a set of edge labels. Second, when Λ has more than one
edge label, we give a polynomial time algorithm for finding a minimally
generalized ordered term tree which explains all given tree data. Lastly,
when Λ has infinitely many edge labels, we give a polynomial time al-
gorithm for finding a minimally generalized unordered term tree which
explains all given tree data. These results imply that the classes of or-
dered and unordered term trees are polynomial time inductively inferable
from positive data.

1 Introduction
Due to the rapid growth of semistructured data such as Web documents, In-
formation Extraction from semistructured data becomes more and more impor-
tant. Web documents such as HTML/XML files have no rigid structure and
are called semistructured data. According to Object Exchange Model [1], we
treat semistructured data as tree structured data. Tree structured data such as
HTML/XML files are represented by rooted trees with edge labels. In order to
represent a tree structured pattern common to such tree structured data, we

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 114–128, 2003.

c Springer-Verlag Berlin Heidelberg 2003
Efficient Learning of Ordered and Unordered Tree Patterns 115

Sec1 Sec2 Comment Sec3 Sec4

Sec1 Sec2 Sec3 Sec4 Sec1 Sec2 Comment Sec3 Sec4 Introduction Note SubSec3.1 Conclusion
Preliminary SubSec3.2

Exp1
Introduction Preliminary Exp1 Exp2 Conclusion Introduction Preliminary Exp2 Conclusion Exp1 Exp2 Result3
Comment

Result1 Result2 Result1 Result2 Result1 Result2

T1 T2 T3
u2

Comment Sec3
Sec1 Sec2 y Sec4 Sec1 Sec2 y Sec4

SubSec3.1
x
Exp1
z SubSec3.2
Introduction x Exp1 z Conclusion Introduction
Preliminary Exp2
Conclusion
u1 v2
x Result1 Result2 Result1 Result2 Note Result3

v1 u3

t1 t2 t3 g1 g2 g3

Fig. 1. Ordered term trees t1 , t2 and t3 and ordered trees T1 , T2 and T3 . An uncon-
tractible (resp. contractible) variable is represented by a single (resp. double) lined box
with lines to its elements. The label inside a box is the variable label of the variable.

proposed an ordered term tree and unordered term tree, which are rooted trees
with structured variables [12,13].
Many semistructured data have irregularities such as missing or erroneous
data. In Object Exchange Model, the data attached to leaves are essential in-
formation and such data represented as subtrees. On the other hand, in analyz-
ing tree structured data, sensitive knowledge (or patterns) for slight differences
among such data is often meaningless. For example, extracted patterns from
HTML/XML files are affected by attributes of tags which can be recognized
as noises. Therefore we introduce a new kind of variable, called a contractible
variable, that is an erasing variable which is adjacent to a leaf. A contractible
variable can be replaced with any tree, including a singleton vertex. A usual
variable, called an uncontractible variable, is replaced with a tree which consists
of at least 2 vertices. A term tree with only uncontractible variables is very sen-
sitive to noises. By introducing contractible variables, we can find robust term
trees for such noises. Shinohara [11] started to study the learnabilities of ex-
tended pattern languages of strings with erasing variable. Since this pioneering
work, researchers in the field of computational learning theory are interested in
classes of string or tree pattern languages with erasing variables which are poly-
nomial time learnable. Recently Uemura et al. [16] showed that classes of unions
of erasing regular pattern languages can be polynomial time learnable from pos-
itive data. In this paper, we study the learnabilities of classes of tree structured
patterns with restricted erasing variables, called contractible variables.
A term tree t is said to be regular if all variable labels in t are mutually
distinct. The term tree language of an ordered term tree t is the set of all ordered
trees which are obtained from t by substituting ordered trees for variables in t.
116 Y. Suzuki et al.

The language shows the representing power of an ordered term tree t. We say that
a regular ordered term tree t explains given tree structured data S if the term
tree language of t contains all trees in S. A minimally generalized regular ordered
term tree t explaining S is a regular ordered term tree t such that t explains S and
the language of t is minimal among all term tree languages which contain all trees
in S. For example, the term tree t3 in Fig. 1 is a minimally generalized regular
ordered term tree explaining T1 , T2 and T3 . And t2 is also minimally generalized
regular ordered term trees with no contractible variable explaining T1 , T2 and T3 .
On the other hand, t1 is overgeneralized and meaningless, since t1 explains any
tree of size at least 2. An ordered term tree using contractible and uncontractible
variables rather than an ordered term tree only using uncontractible variables
can express the structural feature of ordered trees more correctly. From this
reason, we consider that in Fig. 1, t3 is a more precious term tree than t2 .
In a similar way, we define the term tree language of an unordered term tree
and a minimally generalized regular unordered term tree explaining given tree
structured data S.
Let Λ be a set of edge labels used in tree structured data. We denote by OTT cΛ
(resp. UTT cΛ ) the set of all regular ordered (resp. unordered) term trees with
contractible and uncontractible variables. For a set S, the number of elements
in S is denoted by |S|. First we give a polynomial time algorithm for deciding
whether or not a given regular ordered (resp. unordered) term tree explains an
ordered (resp. unordered) tree, where |Λ| ≥ 1. Second when |Λ| ≥ 2, we give a
polynomial time algorithm for finding a minimally generalized regular ordered
term tree in OTT cΛ which explains all given data. Lastly when |Λ| is infinite,
we give a polynomial time algorithm for finding a minimally generalized regular
unordered term tree in UTT cΛ which explains all given data. These results imply
that both OTT cΛ where |Λ| ≥ 2 and UTT cΛ where |Λ| is infinite are polynomial
time inductively inferable from positive data.
A term tree is different from other representations of tree structured patterns
such as in [2,3,5] in that a term tree has structured variables which can be sub-
stituted by arbitrary trees. As related works, we proved the learnability of some
classes of term tree languages with no contractible variable. In [13,14], we showed
that some fundamental classes of regular ordered term tree languages are poly-
nomial time inductively inferable from positive data. And in [7,9,12], we showed
that the class of regular unordered term tree languages with infinitely many edge
labels is polynomial time inductively inferable from positive data. Moreover, we
showed in [8] that some classes of regular ordered term tree languages are ex-
actly learnable in polynomial time using queries. In [15], we showed that the
class of regular ordered term tree with contractible variables and no edge label
is polynomial time inductively inferable from positive data. Asai et al. [6] stud-
ied a data mining problem for semistructured data by modeling semistructured
data as labeled ordered trees and presented an efficient algorithm for finding all
frequent ordered tree patterns from semistructured data. In [10], we gave a data
mining method from semistructured data using ordered term trees.
Efficient Learning of Ordered and Unordered Tree Patterns 117

2 Ordered and Unordered Term Trees


Definition 1 (Ordered term trees and unordered term trees). Let T =
(VT , ET ) be a rooted tree with ordered children or unordered children, which
has a set VT of vertices and a set ET of edges. We call a rooted tree with ordered
(resp. unordered) children an ordered tree (resp. an unordered tree). Let Eg and
Hg be a partition of ET , i.e., Eg ∪ Hg = ET and Eg ∩ Hg = ∅. And let Vg = VT .
A triplet g = (Vg , Eg , Hg ) is called an ordered term tree if T is an ordered
tree, and called an unordered term tree if T is an unordered tree. We call an
element in Vg , Eg and Hg a vertex, an edge and a variable, respectively. Below
we say a term tree if we do not have to distinguish between ordered term trees
and unordered term trees.
We assume that every edge and variable of a term tree is labeled with some
words from specified languages. A label of a variable is called a variable label. Λ
and X denote a set of edge labels and a set of variable labels, respectively, where
Λ ∩ X = φ. For a term tree g and its vertices v1 and vi , a path from v1 to vi is a
sequence v1 , v2 , . . . , vi of distinct vertices of g such that for any j with 1 ≤ j < i,
there exists an edge or a variable which consists of vj and vj+1 . If there is an
edge or a variable which consists of v and v  such that v lies on the path from
the root to v  , then v is said to be the parent of v  and v  is a child of v. We use
a notation [v, v  ] to represent a variable {v, v  } ∈ Hg such that v is the parent
of v  . Then we call v the parent port of [v, v  ] and v  the child port of [v, v  ].
Definition 2 (Regular term tree). A term tree g is regular if all variables in
Hg have mutually distinct variable labels in X. In this paper, we discuss with
regular term trees only. Thus we assume that all term trees in this paper are
regular.

Definition 3 (Contractible variables). Let X c be a distinguished subset of


X. We call variable labels in X c contractible variable labels. A contractible vari-
able label can be attached to a variable whose child port is a leaf. We call a
variable with a contractible variable label a contractible variable, which is al-
lowed to substitute a tree with a singleton vertex. We state the formal definitions
later. We call a variable which is not a contractible variable an uncontractible
variable. In order to distinguish a contractible variable from an uncontractible
variable, we denote by [v, v  ]c (resp. [v, v  ]) a contractible variable (resp. an un-
contractible variable).
For an ordered term tree g, all children of every internal vertex u in g have a
total ordering on all children of u. The ordering on the children of u is denoted
by <gu . Let f = (Vf , Ef , Hf ) and g = (Vg , Eg , Hg ) be two ordered (or unordered)
term trees. We say that f and g are isomorphic, denoted by f ≡ g, if there is
a bijection ϕ from Vf to Vg such that (i) the root of f is mapped to the root
of g by ϕ, (ii) {u, v} ∈ Ef if and only if {ϕ(u), ϕ(v)} ∈ Eg and the two edges
have the same edge label, (iii) [u, v] ∈ Hf if and only if [ϕ(u), ϕ(v)] ∈ Hg , in
particular, [u, v]c ∈ Hf if and only if [ϕ(u), ϕ(v)]c ∈ Hg , (iv) and in addition, if
118 Y. Suzuki et al.

f and g are ordered term trees, for any internal vertex u in f which has more
than one child, and for any two children u and u of u, u <fu u if and only if
ϕ(u ) <gϕ(u) ϕ(u ).
For unordered term trees, we introduce a new definition of bindings which
is different from the original definition we gave in [9,12]. The reason why we
introduce the new one is explained after we define substitutions of term trees.
Definition 4 (Bindings of term trees). Let g be a term tree with at least
two vertices and x be a variable label in X or X c . Let σ = [u, u ] be a list of two
vertices in g where u is the root of g and u is a leaf of g. The form x := [g, σ]
is called a binding for x. If x is a contractible variable label in X c , g may be a
tree with a singleton vertex u and thus σ = [u, u]. It is the only case that a tree
with a singleton vertex is allowed for a binding.
Original definition of bindings of unordered term trees [9,12]: Let g be
an unordered term tree with at least two vertices. Let σ = [u, u ] be a list of two
vertices in g where u is the root of g and u is a vertex of g (u = u ). The form
x := [g, σ] is called a binding for x.
Definition 5 (Substitutions of term trees). Let f and g be two ordered
(resp. unordered) term trees. A new ordered (resp. unordered) term tree f {x :=
[g, σ]} is obtained by applying the binding x := [g, σ] to f in the following way.
Let e = [v, v  ] be a variable in f with the variable label x. Let g  be one copy
of g and w, w the vertices of g  corresponding to u, u of g, respectively. For
the variable e = [v, v  ], we attach g  to f by removing the variable e from Hf
and by identifying the vertices v, v  with the vertices w, w of g  , respectively.
If g is a tree with a singleton vertex, i.e., u = u , then v becomes identical to
v  after the binding. A substitution θ is a finite collection of bindings {x1 :=
[g1 , σ1 ], · · · , xn := [gn , σn ]}, where xi ’s are mutually distinct variable labels in
X and gi ’s are term trees. The term tree f θ, called the instance of f by θ,
is obtained by applying the all bindings xi := [gi , σi ] on f simultaneously. We
define the root of the resulting term tree f θ as the root of f .
A variable with a binding of the original definition is the same as a pair
of one uncontractible variable h with a binding of the new definition and one
contractible variable whose parent port is the child port of h. However a con-
tractible variable cannot be expressed with any variable with bindings of the
original definition. Therefore, by introducing contractible variables and the new
definition of bindings, we can express richer unordered tree structured patterns.
Further we have to give a new total ordering <fv θ on every vertex v of f θ.
These orderings are defined in a natural way.
Definition 6 (Child orderings on an instance of an ordered term tree).
Let f be an ordered term tree and θ a substitution. Suppose that v has more
than one child and let u and u be two children of v of f θ. If v is the parent
port of variables [v, v1 ], . . . , [v, vk ] of f with v1 <fv · · · <fv vk , we have the follow-
ing four cases. Let gi be a term tree which is substituted for [v, vi ] for i = 1, . . . , k.
Efficient Learning of Ordered and Unordered Tree Patterns 119

Case 1 : If u , u ∈ Vf and u <fv u , then u <fv θ u . Case 2 : If u , u ∈ Vgi and
u <gvi u for some i, then u <fv θ u . Case 3 : If u ∈ Vgi , u ∈ Vf , and vi <fv u
(resp. u <fv vi ), then u <fv θ u (resp. u <fv θ u ). Case 4 : If u ∈ Vgi , u ∈ Vgj
(i = j), and vi <fv vj , then u <fv θ u . If v is not a parent port of any variable,
then u , u ∈ Vf , therefore we have u <fv θ u if u <fv u .
For example, let t3 be a term tree in Fig. 1 and θ = {x := [g1 , [u1 , v1 ]], y :=
[g2 , [u2 , v2 ]], z := [g3 , [u3 , u3 ]]} be a substitution, where g1 , g2 and g3 are trees in
Fig. 1. Then the instance t3 θ of the term tree t3 by θ is the tree T3 .
Without loss of generality, we assume that for any ordered term tree, there
is no pair of contractible variables [v, v  ]c and [v, v  ]c such that v  is the im-
mediately right sibling of v  . By a similar reason, we also assume that for any
unordered term tree, any vertex v which is not a leaf has at most one con-
tractible variable such that the parent port of it is v. An ordered term tree with
no variable is called a ground ordered term tree, which is an ordered tree.
OT Λ denotes the set of all ground ordered term trees whose edge labels are in
Λ. Similarly we define a ground unordered term tree. UT Λ denotes the set
of all ground unordered term trees whose edge labels are in Λ. OTT cΛ denotes
the set of all ordered term trees with contractible and uncontractible variables
whose edge labels are in Λ. UTT cΛ denotes the set of all unordered term trees
with contractible and uncontractible variables whose edge labels are in Λ.
Definition 7 (Term tree languages). Let Λ be a set of edge labels. For an
ordered term tree t ∈ OTT cΛ , the ordered term tree language LO
Λ (t) of an ordered
term tree t is defined as {s ∈ OT Λ | s ≡ tθ for a substitution θ}. For an
unordered term tree t ∈ UTT cΛ , the unordered term tree language LU Λ (t) of an
unordered term tree t is defined as {s ∈ UT Λ | s ≡ tθ for a substitution θ}.
A minimally generalized ordered term tree explaining a given set of ordered
trees S ⊆ OT Λ is an ordered term tree t such that S ⊆ LO Λ (t) and there is
no term tree t satisfying that S ⊆ LO 
Λ (t ) ⊆
/
LO
Λ (t). Similarly, we define a
minimally generalized unordered term tree explaining a given set of unordered
trees S ⊆ UT Λ . We give polynomial time algorithms for the following problems
for (TT , T ) ∈ {(OTT cΛ , OT Λ ), (UTT cΛ , UT Λ )}.

Membership Problem for TT .


Instance: A term tree t ∈ TT and a tree in T ∈ T .
Question: Is there a substitution θ such that T ≡ tθ.

Minimal Language Problem (MINL) for TT .


Instance: A nonempty set of trees S ⊆ T .
Question: Find a minimally generalized term tree t ∈ TT explaining S.

For a class C, Angluin [4] and Shinohara [11] showed that if C has finite
thickness, and the membership problem and the MINL problem for C are solvable
in polynomial time then C is polynomial time inductively inferable from positive
data. Let Λ be a finite or infinite alphabet of edge labels. In this paper, we
consider the classes OTT cΛ and UTT cΛ as targets of inductive inference.
120 Y. Suzuki et al.

Procedure Ordered-C-Set-Attaching(v, Rule(t));


input v: a vertex of T , Rule(t): the C-set-attaching rule of t;
begin
CS(v) := ∅; Let c1 , · · · , cm be all ordered children of v in T ;
foreach I(u ) ⇐ J(c1 ), · · · , J(cm ) in Rule(t) do
if there is a sequence 0 = j0 ≤ j1 ≤ · · · ≤ ji ≤ · · · ≤ jm −1 ≤ jm = m such that
1 . if J(ci ) = I(ci ) then ji − ji−1 = 1 and I(ci ) ∈ CS(cji ),
2 . if J(ci ) = (I(ci )) then CS(cki ) has I(ci ) or (I(ci )) for some ki (ji−1 < ki ≤ ji )
for all i = 1, ..., m // we have no condition on ji when J(ci ) = I(∅).
then CS(v) := CS(v) ∪ {(I(u )};
foreach (I(u )) ⇐ (I(u )) in Rule(t) do
if there is a set in CS(c1 ), · · · , CS(cm ) which has I(u ) or (I(u )) then
CS(v) := CS(v) ∪ {(I(u )};
foreach I(u ) ⇐ I(u ) in Rule(t) do CS(v) := CS(v) ∪ {I(u )}
end;

Fig. 2. Procedure Ordered-C-Set-Attaching for |Λ| = 1. We can easily extend this


procedure for the case of |Λ| ≥ 2 by checking edge labels of a term tree and a tree in
applying C-set-attaching rules.

It is easy to see that the classes OTT cΛ and UTT cΛ have finite thickness. In
Section 3, we give polynomial time algorithms for Membership problems for
OTT cΛ and UTT cΛ . And in Section 4, we give polynomial time algorithms for
Minimal Language problems for OTT cΛ (|Λ| ≥ 2) and UTT cΛ (|Λ| = ∞).
Therefore, we show the following main result.
Theorem 1. The classes OTT cΛ (|Λ| ≥ 2) and UTT cΛ (|Λ| = ∞) are polynomial
time inductively inferable from positive data.

3 An Efficient Matching Algorithm for Term Trees


In this section, we give polynomial time algorithms for Membership Problem
for OTT cΛ and UTT cΛ by extending the matching algorithm in [9,13]. Let t =
(Vt , Et , Ht ) be a term tree and T a tree. We assume that all vertices of a term
tree t are associated with mutually distinct numbers, called vertex identifiers.
We denote by I(u ) the vertex identifier of u ∈ Vt . A correspondence set, C-set
for short, is a set of vertex identifiers which are with or without parentheses.
A vertex identifier with parentheses shows that the vertex is a child port of a
variable.
We employ the dynamic programming method. Our matching algorithms
proceed by constructing C-sets for each vertex of a given tree T in the bottom-
up manner, that is, from the leaves to the root of T . At first, we construct
the C-set-attaching rule of a vertex u of t as follows. Let c1 , · · · , cm be all
ordered (or unordered) children of u . The C-set-attaching rule of u is of the
form I(u ) ⇐ J(c1 ), . . . , J(cm ), where J(ci ) = I(ci ) if {u , ci } ∈ Et , J(ci ) = I(∅)
if [u , ci ]c ∈ Ht , J(ci ) = (I(ci )) otherwise. I(∅) is a special symbol which shows ci
Efficient Learning of Ordered and Unordered Tree Patterns 121

Procedure Unordered-C-Set-Attaching(v, Rule(t));


input v: a vertex of T , Rule(t): the C-set-attaching rule of t;
begin
CS(v) := ∅; Let c1 , · · · , cm be all unordered children of v in T ;
foreach I(u ) ⇐ J(c1 ), · · · , J(cm ) in Rule(t) do begin
E := {{I(ci ), CS(cj )} | I(ci ) ∈ CS(cj )}∪
{{I(ci ), CS(cj )} | (I(ci )) ∈ CS(cj ) and J(ci ) = (I(ci ))} (1 ≤ i ≤ m , 1 ≤ j ≤ m);
Let G be a bipartite graph ({I(c1 ), . . . , I(cm )}, {CS(c1 ), . . . , CS(cm )}, E);
if J(ci ) = I(ci ) for all i = 1, . . . , m then begin
if there is a perfect matching for G then CS(v) := CS(v) ∪ {I(u )}
end else if there is an index i (1 ≤ i ≤ m ) such that J(ci ) = I(∅) then begin
if there is a matching of size m − 1 for G then CS(v) := CS(v) ∪ {I(u )}
end else
if there is a matching of size m for G then CS(v) := CS(v) ∪ {I(u )}
end;
foreach (I(u )) ⇐ (I(u )) in Rule(t) do
if there is a set in CS(c1 ), · · · , CS(cm ) which has I(u ) or (I(u )) then
CS(v) := CS(v) ∪ {(I(u )};
foreach I(u ) ⇐ I(u ) in Rule(t) do CS(v) := CS(v) ∪ {I(u )}
end;

Fig. 3. Procedure Unordered-C-Set-Attaching for |Λ| = 1. We can easily extend


this procedure for the case of |Λ| ≥ 2.

is the child port of a contractible variable. The C-set-attaching rule of t, denoted


by Rule(t), is defined as follows.
Rule(t) =
{I(u ) ⇐ J(c1 ), . . . , J(cm ) | the C-set-attaching rule of all inner vertices}
∪ {(I(u )) ⇐ (I(u )) | u is the child port of an uncontractible variable}
∪ {I(u ) ⇐ I(u ) | u has just one child and connects to
the child with a contractible variable}.
Initially we attach

CS = {I( ) |  is a leaf of t that is not a child port of a contractible variable,


or  has just one child and connects to it with a contractible variable}

to all leaves of T . By using Ordered-C-Set-Attaching (Fig.2) for Mem-


bership Problem for OTT cΛ and Unordered-C-Set-Attaching (Fig.3) for
Membership Problem for UTT cΛ , we repeatedly attach a C-set to each vertex
of a given tree T in the bottom-up manner, that is, from the leaves to the root
of T . When we can not apply the procedure to any vertex any more, if the C-set
of the root of T has the vertex identifier of the root of t then we conclude that
t matches T .
Theorem 2. For each TT ∈ {OTT cΛ , UTT cΛ }, Membership Problem for TT
is solvable in polynomial time.
122 Y. Suzuki et al.

u ≥1
u ≥1 u ≥1 u ≥1 ≥1 u u ≥1
v 0
v ≥0 w1 v ≥0 v ≥0 ≥0 v w2 v 0
g1 t1 g2 t2 w3

g3 t3

Fig. 4. For i = 1, 2, 3, gi ≡ ti , gi ti , ti gi , and LO O


Λ (gi ) = LΛ (ti ). The digit in a box
k (resp. ≥ k ) near u shows that the number of children of u is equal to k (resp. is
more than or equal to k). An arrow shows that the right vertex of it is the immediately
right child of the left vertex.

4 An Algorithm for Finding a Minimally Generalized


Term Tree
Let g and t be ordered (or unordered) term trees. We denote g t if there
exists a substitution θ such that g ≡ tθ. For ordered (resp. unordered) term trees
g = (V, E, H) and g  = (V  , E  , H  ), we say that g  is an ordered (resp. unordered)
term subtree of g if V  ⊆ V , E  ⊆ E, and H  ⊆ H. For an ordered (resp.
unordered) term tree t, an occurrence of t in g is an ordered (resp. unordered)
term subtree of g which is isomorphic to t.
For any ordered (resp. unordered) term tree g, we denote by s(g) the ordered
(resp. unordered) tree obtained from g by replacing all edges of g with uncon-
tractible variables and all contractible variables of g with singleton vertices. For
any two ordered (or unordered) term trees g and t, we write g ≈ t if s(g) is
isomorphic to s(t).

4.1 An Algorithm for Ordered Term Trees


Lemma 1. Let gi and ti (1 ≤ i ≤ 3) be ordered term trees in OTT cΛ described in
Fig. 4. Let t be an ordered term tree in OTT cΛ which has at least one occurrence
of ti (1 ≤ i ≤ 3). For one of occurrences of ti , we make a new ordered term tree
g by replacing the occurrence of ti with gi . Then LO O
Λ (g) = LΛ (t).

Definition 8. Let g be an ordered term tree in OTT cΛ for |Λ| ≥ 2. The ordered
term tree g is said to be a canonical ordered term tree if g has no occurrence of
ti (1 ≤ i ≤ 3) (Fig. 4).
Any ordered term tree g is transformed into the canonical term tree by re-
placing all occurrences of gi with ti (1 ≤ i ≤ 3) repeatedly. We denote by c(g) the
canonical ordered term tree transformed from g. We note that LO O
Λ (c(g)) = LΛ (g).

Lemma 2. Let g and t be two ordered term trees in OT T cΛ (|Λ| ≥ 2). If g ≈ t


and LO O
Λ (g) ⊆ LΛ (t) then c(g) c(t).
Efficient Learning of Ordered and Unordered Tree Patterns 123


Proof. Let c(g) = (Vc(g) , Ec(g) , Hc(g) ) and c(t) = (Vc(t) , Ec(t) , Hc(t) ). Let Vc(g) =

{v ∈ Vc(g) | v is not a child port of a contractible variable} and Vc(t) = {v ∈ Vc(t) |
v is not a child port of a contractible variable}. For a vertex v which is not the
root of a term tree, we denote by p(v) the parent of v. For any v ∈ Vc(g) ,
either {p(v), v} ∈ Ec(g) , [p(v), v] ∈ Hc(g) , or [p(v), v]c ∈ Hc(g) . We note that

v ∈ Vc(g) − Vc(g) if and only if [p(v), v]c ∈ Hc(g) . Since g ≈ t, we have c(g) ≈ c(t).
 
Therefore there is a bijection ξ from Vc(g) to Vc(t) such that {p(v), v} ∈ Ec(g) or
[p(v), v] ∈ Hc(g) if and only if {ξ(p(v)), ξ(v)} ∈ Ec(t) or [ξ(p(v)), ξ(v)] ∈ Hc(t) .
Since Λ contains at least two edge labels, we can easily show the following claim.
Claim 1. If [p(v), v] ∈ Hc(g) then [ξ(p(v)), ξ(v)] ∈ Hc(t) .
In the following three lemmas, we assume that [p(v), v]c ∈ Hc(g) .
Claim 2. Suppose that p(v) has at least two children and v is the leftmost
(resp. rightmost) child of p(v). Let w be the immediately right (resp. left) sib-
ling of v. Then one of the following two statements holds: (1) there exists the
leftmost (resp. rightmost) child v  of ξ(p(v)) such that [ξ(p(v)), v  ]c ∈ Hc(t) , or
(2) [ξ(p(v)), ξ(w)] ∈ Hc(t) .
Proof of Claim 2. We note that {p(v), w} be an edge in Ec(g) since c(g)
is the canonical ordered term tree of g. We assume that {ξ(p(v)), ξ(w)} is an
edge in Ec(t) . Let α be the edge label of {ξ(p(v)), ξ(w)}. Let β be an edge label
in Λ − {α}. Let g β be the ground term trees which is obtained by replacing
all contractible and uncontractible variables with edges labeled with β. This
substitution does not increase the number of internal vertices of c(g). Thus, if
there is a substitution θ such that g β ≡ c(t)θ, the vertex of g β which is p(v) of
c(g) must corresponds to the vertex of c(t)θ which is ξ(p(v)) of c(t). Therefore,
if there does not exist the vertex v  stated in this claim, g β does not belong to
LOΛ (c(t)) since the edge label of {ξ(p(v)), ξ(w)} is α. (End of Proof )
We can show the next two lemmas in a similar way to Claim 2.
Claim 3. If v is only child of p(v) then ξ(p(v)) has exactly one child v 
such that [ξ(p(v)), v  ]c ∈ Hc(t) or there exists the parent of ξ(p(v)) such that
[p(ξ(p(v))), ξ(p(v))] ∈ Hc(t) .
Claim 4. Suppose that p(v) has at least three children and v has the im-
mediately left sibling w and immediately right sibling wr . Then one of the
following three statements holds: (1) there exists a child v  of ξ(p(v)) between
ξ(w ) and ξ(wr ) such that [ξ(p(v)), v  ]c ∈ Hc(t) , (2) [ξ(p(v)), ξ(w )] ∈ Hc(t) , or
(3) [ξ(p(v)), ξ(wr )] ∈ Hc(t) .
From these claims, we can immediately show this lemma. 2

The algorithm MINL-OTT c (Fig. 5) solves Minimal Language Problem


for OTT cΛ (|Λ| ≥ 2) correctly. The algorithm consists of two procedures.
Variable-Extension (Fig. 5): The aim of this procedure is to output an or-
dered term tree t consisting of only uncontractible variables such that there
is no ordered term tree t consisting of only uncontractible variables with
S ⊆ LO 
Λ (t ) ⊆
/
LO
Λ (t). Thus this procedure extends an ordered term tree t by
adding uncontractible variables as much as possible while S ⊆ LO Λ (t) holds.
124 Y. Suzuki et al.

Algorithm MINL-OTT c (S);


input: a set of trees S ⊆ OT Λ ;
begin
t := ({u, v}, ∅, {[u, v]}); Let q be a list initialized to be [[u, v]];
Variable-Extension(t, S, q);
Edge-Replacing(t, S, rt ), where rt is the root of t;
output t
end.
Procedure Variable-Extension(t, S, q);
input: an ordered term tree t, a set of trees S, a queue of variables q;
begin
while q is not empty do begin
[u, v] := q[1]; Let w1 , w2 , and w3 be new vertices;
// w1 becomes a vertex between u and v.
if S ⊆ LO 
Λ (t := (Vt ∪ {w1 }, Et , Ht ∪ {[u, w1 ], [w1 , v]} − {[u, v]})) then
begin t := t ; q := q&[[w1 , v]]; continue end else q := q[2..];
// w2 and w3 become the immediately left and right siblings of v, respectively.
if S ⊆ LO 
Λ (t := (Vt ∪ {w2 }, Et , Ht ∪ {[u, w2 ]})) then
begin t := t ; q := q&[[u, w2 ]] end;
if S ⊆ LO 
Λ (t := (Vt ∪ {w3 }, Et , Ht ∪ {[u, w3 ]})) then
begin t := t ; q := q&[[u, w3 ]] end
end; return t
end;
Procedure Edge-Replacing(t, S, u);
input: an ordered term tree t, a set of trees S, a vertex u
begin
if u is a leaf then return;
Let c1 , . . . , ck be the children of u;
for i := 1 to k do Edge-Replacing(t, S, ci );
for i := 1 to k do
foreach edge label λ ∈ ΛS do
if ci is a leaf then
if S ⊆ LO  ,r,d
Λ (t := tR(ci )λ ) then begin
if S ⊆ LΛ (t1 := t − [u, w ]c ) then t := t1 ;
O 

if S ⊆ LO  
Λ (t2 := t − [u, wr ] ) then t := t2 ;
c

if S ⊆ LΛ (t3 := t − [u, wd ] ) then t := t3 ; t := t


O  c

end
else
if S ⊆ LO  ,r
Λ (t := tR(ci )λ ) then begin
if S ⊆ LΛ (t1 := t − [u, w ]c ) then t := t1 ;
O 

if S ⊆ LO  
Λ (t2 := t − [u, wr ] ) then t := t2 ; t := t
c 

end;
return t
end;

Fig. 5. Algorithm MINL-OTT c : For an ordered term tree t, we denote by t − [u, v]c
the term tree obtained by removing a contractible variable [u, v]c .
Efficient Learning of Ordered and Unordered Tree Patterns 125

Lemma 3. Let t ∈ OTT cΛ (|Λ| ≥ 2) be the output of Variable-Extension for


an input S. Let t be a minimally generalized ordered term tree explaining S. If
S ⊆ LO  O 
Λ (t ) ⊆ LΛ (t) then t ≈ t.

Proof. Obviously t ∈ LO 
Λ (t). Let t be the ordered term tree obtained by
replacing all edges of s(t ) with uncontractible variables. Then t ≈ t and


LO  O   
Λ (t ) ⊆ LΛ (t ). Let θ be a substitution such that t ≡ tθ, and θ a substi-
tution which obtained by replacing all edges appearing in θ with uncontractible
variables. Then t ≡ tθ . Since Variable-Extension does not add a variable
to t any more then tθ ≡ t. Therefore t ≡ t, then t ≈ t. 2
Let u be a vertex of an ordered term tree which is not the root of the term
tree and p(u) the parent of u. Let λ be an element of Λ. Let w and wr be
new children of p(u) which become the immediately left and right siblings of u,
respectively. If u is a leaf, let wd be a new child of u. We suppose that [p(u), u]
is an uncontractible variable. Then we define the following 2 operations.
R(u),r
λ : Replace [p(u), u] with {p(u), u} labeled with λ,
[p(u), w ]c and [p(u), wr ]c .
,r,d
R(u)λ : Replace [p(u), u] with {p(u), u} labeled with λ,
[p(u), w ]c , [p(u), wr ]c and [u, wd ]c .
Edge-Replacing (Fig. 5): Let t be the output of Variable-Extension for an
input S. This procedure visits all vertices of t in the reverse order of the breadth-
first search of t. And it applies the above 2 operations R to t. If S ⊆ LO Λ (tR)
then t := tR. And it eliminates contractible variables as much as possible.
Lemma 4. Let t ∈ OTT cΛ (|Λ| ≥ 2) be the output of the algorithm MINL-OTT c
for an input S. Let t be a term tree satisfying that S ⊆ LO  O
Λ (t ) ⊆ LΛ (t). Then

c(t ) ≡ c(t).

Theorem 3. Minimal Language Problem for OTT cΛ (|Λ| ≥ 2) is solvable


in polynomial time.

4.2 An Algorithm for Unordered Term Trees


Lemma 5. Let gi and ti (1 ≤ i ≤ 3) be unordered term trees in UTT cΛ (|Λ| = ∞)
in Fig. 6. Let t be a term tree in UTT cΛ which has at least one occurrence of ti
(1 ≤ i ≤ 3). For one of occurrences of ti , we make a new unordered term tree g
by replacing the occurrence of ti with gi . Then LU U
Λ (g) = LΛ (t).

Definition 9. Let g be an unordered term tree in UTT cΛ (|Λ| = ∞). The un-
ordered term tree g is said to be a canonical unordered term tree if g has no
occurrence of ti (1 ≤ i ≤ 3) (Fig. 6).
We can easily see that any unordered term tree g is transformed into the
canonical unordered term tree by replacing all occurrences of gi with ti (1 ≤ i ≤
3) repeatedly. We denote by c(g) the canonical unordered term tree transformed
from g. We show the following lemma in a similar way to Lemma 2.
126 Y. Suzuki et al.

u ≥1 u ≥1

v1 v2 vn v1 v2 vn w

≥0 ≥0 ≥0 ≥0 ≥0 ≥0
g1 t1
u ≥1
u ≥1 u ≥1
u ≥1
v 0
v1 v2 vn v1 v2 vn w
v 0
≥0 ≥0 ≥0 ≥0 ≥0 ≥0
wd
g2 t2
g3 t3

Fig. 6. For i = 1, 2, 3, gi ≡ ti , gi ti , ti gi , and LU U


Λ (gi ) = LΛ (ti ). Let u is a vertex
of gi and c1 , . . . , ck children of u. We suppose that at least one child among c1 , . . . , ck
is connected to u by a contractible variable or an uncontractible variable.

Lemma 6. Let g and t be two unordered term trees in UTT cΛ (|Λ| = ∞). If g ≈ t
and LU U
Λ (g) ⊆ LΛ (t) then c(g) c(t).
The algorithm MINL-UTT c (Fig. 7) solves Minimal Language Problem
for UTT cΛ . The procedure Variable-Extension (Fig. 7) extends an unordered
term tree t by adding uncontractible variables as much as possible while S ⊆
LUΛ (t) holds. We can show the following lemma in a similar way to Lemma 3.

Lemma 7. Let t ∈ UTT cΛ (|Λ| = ∞) be the output of Variable-Extension


for an input S. Let t be a minimally generalized unordered term tree explaining
S. If S ⊆ LU  U 
Λ (t ) ⊆ LΛ (t) then t ≈ t.

Let c1 , . . . , ck be a vertex of an unordered term tree which is not the root of


the term tree and u the parent of c1 , . . . , ck . Let λ be an element of Λ. Let w be
a new child of u. If ci is a leaf, let wd be a new child of ci . We suppose that [u, ci ]
is an uncontractible variable. Edge-Replacing (Fig. 7) adds a contractible
variable [u, w]c , and applies the following 2 operations R(ci ) to a temporary
unordered term tree. If S ⊆ LU Λ (tR(ci )) then t := tR(ci ), and it eliminates a
contractible variable [u, w]c if possible.
R(ci )λ : Replace [u, ci ] with {u, ci } labeled with λ.
R(ci )dλ : Replace [u, ci ] with {u, ci } labeled with λ and [ci , wd ]c .

Lemma 8. Let t ∈ UTT cΛ (|Λ| = ∞) be the output of the algorithm MINL-UTT c


for an input S. Let t be an unordered term tree satisfying that S ⊆ LU 
Λ (t ) ⊆
U 
LΛ (t). Then c(t ) ≡ c(t).

Theorem 4. Minimal Language Problem for UTT cΛ (|Λ| = ∞) is solvable


in polynomial time.
Efficient Learning of Ordered and Unordered Tree Patterns 127

Algorithm MINL-UTT c (S); Procedure Edge-Replacing(t, S, u);


input: a set of trees S ⊆ UT Λ ; input: an unordered term tree t,
begin a set of trees S, and a vertex u;
Let ΛS be the set of edge labels begin
which appear in S; if u is a leaf then return;
Variable-Extension(S); Let c1 , . . . , ck be the children of u;
Edge-Replacing(t, S, rt ), for i := 1 to k do
where rt is the root of t; Edge-Replacing(t, S, ci );
output t Let w be a new children of u;
end. t := t + [u, w]c ;
for i := 1 to k do
Procedure Variable-Extension(S); foreach edge label λ ∈ ΛS do
input: a set of trees S; if ci is a leaf then begin
begin if S ⊆ LU  
Λ (t := tR(ci )λ ) then t := t ;
d := 0; t := ({r}, ∅, ∅); if S ⊆ LΛ (t := tR(ci )λ ) then t := t
U  d

t :=Breadth-Extension(t, S, r); end


max-depth :=the maximum depth of else
the trees in S; if S ⊆ LU 
Λ (t := tR(ci )λ ) then t := t


d := d + 1 end;
while d ≥ max-depth − 1 do begin t := t − [u, w]c ;
v :=a vertex at depth d which If S ⊆ LU 
Λ (t ) then t := t


is not yet visited; end;


t :=Breadth-Extension(t, S, v); return t
while there exists a sibling of v end;
which is not yet visited do begin
Let v  be a sibling of u which
is not yet visited; Procedure Depth-Extension(t, S, v);
t :=Breadth-Extension(t, S, v  );
input: an unordered term tree t,
end; a set of trees S, a vertex v,;
d := d + 1 begin
end; Let t be (Vt , ∅, Ht );
return t Let v  be a new vertex and
end; [v, v  ] a new variable;
t := (Vt ∪ {v  }, ∅, Hg ∪ {[v, v  ]});
Procedure Breadth-Extension(t, S, v); while S ⊆ LU 
Λ (t ) do begin
input: an unordered term tree t,  
t := t ; v := v ;
a set of trees S, a vertex v; Let v  be a new vertex and
begin [v, v  ] a new variable;

t :=Depth-Extension(t, S, v); t := (Vt ∪ {v  }, ∅, Ht ∪ {[v, v  ]})
while t = t do begin end; return g
t := t ; end;
t :=Depth-Extension(t, S, v)
end; return t
end;

Fig. 7. Algorithm MINL-UTT c : For an unordered term tree t, we denote by t + [u, v]c
the term tree obtained by adding a contractible variable [u, v]c , and by t − [u, v]c the
term tree obtained by removing a contractible variable [u, v]c .
128 Y. Suzuki et al.

References
1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to
Semistructured Data and XML. Morgan Kaufmann, 2000.
2. S. Amer-Yahia, S. Cho, L. V. S. Lakshmanan, and D. Srivastava. Minimization of
Tree Pattern Queries. Proc. ACM SIGMOD 2001, pages 497–508, 2001.
3. T. R. Amoth, P. Cull, and P. Tadepalli. Exact learning of unordered tree patterns
from queries. Proc. COLT-99, ACM Press, pages 323–332, 1999.
4. D. Angluin. Finding patterns common to a set of strings. Journal of Computer
and System Science, 21:46–62, 1980.
5. H. Arimura, H. Sakamoto, and S. Arikawa. Efficient learning of semi-structured
data from queries. Proc. ALT-2001, Springer-Verlag, LNAI 2225, pages 315–331,
2001.
6. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient
Substructure Discovery from Large Semi-structured Data. the Proc. of the Second
SIAM International Conference on Data Mining, pages 158–174, 2002.
7. S. Matsumoto, Y. Hayashi, and T. Shoudai. Polynomial time inductive inference
of regular term tree languages from positive data. Proc. ALT-97, Springer-Verlag,
LNAI 1316, pages 212–227, 1997.
8. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning of finite unions
of tree patterns with internal structured variables from queries. Proc. AI-2002,
Springer-Verlag, LNAI 2557, pages 523–534, 2002.
9. T. Miyahara, T. Shoudai, T. Uchida, K. Kuboyama, K. Takahashi, and H. Ueda.
Discovering New Knowledge from Graph Data Using Inductive Logic Programming
Proc. ILP-99, Springer-Verlag, LNAI 1634, pages 222–233, 1999.
10. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, S. Hirokawa, K. Takahashi, and
H. Ueda. Extraction of Tag Tree Patterns with Contractible Variables from Irregu-
lar Semistructured data. Proc. PAKDD-2003, Springer-Verlag, LNAI 2637, pages
430–436, 2003.
11. T. Shinohara. Polynomial time inference of extended regular pattern languages.
Proc. RIMS Symposium on Software Science and Engineering, Springer-Verlag,
LNCS 147, pages 115–127, 1982.
12. T. Shoudai, T. Uchida, and T. Miyahara. Polynomial time algorithms for finding
unordered tree patterns with internal variables. Proc. FCT-2001, Springer-Verlag,
LNCS 2138, pages 335–346, 2001.
13. Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time
inductive inference of ordered tree patterns with internal structured variables from
positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages 169–184,
2002.
14. Y. Suzuki, T. Shoudai, T. Uchida, and T. Miyahara. Ordered term tree languages
which are polynomial time inductively inferable from positive data. Proc. ALT-
2002, Springer-Verlag, LNAI 2533, pages 188–202, 2002.
15. Y. Suzuki, T. Shoudai, S. Matsumoto and T. Uchida. Efficient Learning of Unla-
beled Term Trees with Contractible Variables from Positive Data. Proc. ILP-2003,
Springer-Verlag, LNAI (to appear), 2003.
16. J. Uemura and M. Sato. Compactness and Learning of Classes of Unions of Erasing
Regular Pattern Languages. Proc. ALT-2002, Springer-Verlag, LNAI 2533, pages
293–307, 2002.
On the Learnability of Erasing Pattern
Languages in the Query Model

Steffen Lange1 and Sandra Zilles2


1
Deutsches Forschungszentrum für Künstliche Intelligenz,
Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany,
lange@[Link]
2
Universität Kaiserslautern,
FB Informatik, Postfach 3049, 67653 Kaiserslautern, Germany,
zilles@[Link]

Abstract. A pattern is a finite string of constant and variable symbols.


The erasing language generated by a pattern p is the set of all strings
that can be obtained by substituting (possibly empty) strings of constant
symbols for the variables in p.
The present paper deals with the problem of learning the erasing pat-
tern languages and natural subclasses thereof within Angluin’s model
of learning with queries. The paper extends former studies along this
line of research. It provides new results concerning the principal learn-
ing capabilities of query learners as well as the power and limitations of
polynomial-time query learners.
In addition, the paper focusses on a quite natural extension of Angluin’s
original model. In this extended model, the query learner is allowed to
query languages which are themselves not object of learning. Query learn-
ers of the latter type are often more powerful and more efficient than
standard query learners. Moreover, when studying this new model in a
more general context, interesting relations to Gold’s model of language
learning from only positive data have been elaborated.

1 Introduction

A pattern is a finite string of constant and variable symbols (cf. Angluin [2]).
The erasing language generated by a pattern p is the set of all strings that can
be obtained by substituting strings of constant symbols (including the empty
one!) for the variables in p.1 Thereby, each occurrence of a variable has to be
substituted by the same string.
The erasing pattern languages have found a lot of attention within the past
two decades both in the formal language theory community (see, e. g., Salo-
maa [15,16], Jiang et al. [9]) and in the learning theory community (see, e. g.,
1
The term ‘erasing’ is coined to distinguish these languages from those pattern lan-
guages originally defined in Angluin [2], where it is forbidden to replace variables by
the empty string.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 129–143, 2003.

c Springer-Verlag Berlin Heidelberg 2003
130 S. Lange and S. Zilles

Shinohara [17], Erlebach et al. [6], Mitchell [12], Nessel and Lange [13], Reiden-
bach [14]). The learning scenarios studied include Gold’s [7] model of learning
in the limit and Angluin’s [3] model of learning with queries. Besides that, in-
teresting applications have been outlined. For example, learning algorithms for
particular subclasses of erasing pattern languages have been used to solve prob-
lems in molecular biology (see Arikawa et al. [5]).
The present paper focusses on the learnability of the erasing pattern lan-
guages and natural subclasses thereof in Angluin’s [3,4] model of learning with
queries. The paper extends the work of Nessel and Lange [13]; the first systematic
study in this context.
In contrast to Gold’s [7] model of learning in the limit, Angluin’s [3] model
deals with ‘one-shot’ learning. Here, a learning algorithm (henceforth called
query learner) has the option to ask queries in order to receive information
about an unknown language. The queries will truthfully be answered by an ora-
cle. After asking at most finitely many queries, the learner is supposed to output
its one and only hypothesis. This hypothesis is required to correctly describe the
unknown language.
The present paper contains a couple of new results, which illustrate the power
and limitations of query learners in the context of learning the class of all eras-
ing pattern languages and natural subclasses thereof. Along the line of former
studies, the capabilities of polynomial-time query learners (i. e. learners that are
constrained to ask at most polynomially many queries before returning their
hypothesis) are studied as well.
In addition, a problem is addressed that has mainly been ignored so far.
The present paper provides the first systematic study concerning the strength
of query learners that are – in contrast to standard query learners – allowed
to query languages that are themselves not object of learning. As it turns out,
these new learners often outperform standard learners, concerning their principal
learning capability as well as their efficiency.
Moreover, the learning power of non-standard query learners is compared to
the capabilities of Gold-style language learners. As a result of this comparison,
quite interesting coincidences between Gold-style language learning and query
learning – in the more general setting of learning indexable classes of recursive
languages – have been observed. One of them allows for a new approach to
the long-standing open question of whether or not the erasing pattern languages
(over a finite alphabet with at least three constant symbols) are Gold-style learn-
able from only positive examples. To be more precise, the erasing pattern lan-
guages are learnable in the non-standard query model (using a particular type
of queries, namely restricted superset queries), iff they are Gold-style learnable
from only positive examples by a conservative learner (i. e. a learner that strictly
avoids overgeneralized hypotheses).
Next, we summarize the disciplinary results on query learning of all erasing
pattern languages or natural subclasses thereof.
Among the different types of queries investigated in the past (see, e. g., An-
gluin [3,4]), we consider the following ones:
On the Learnability of Erasing Pattern Languages in the Query Model 131

Membership queries. The input is a string w and the answer is ‘yes’ or ‘no’, re-
spectively, depending on whether or not w belongs to the target language L.
Restricted subset queries. The input is a language L . If L ⊆ L, the answer is
‘yes’. Otherwise, the answer is ‘no’.
Restricted superset queries. The input is a language L . If L ⊆ L , the answer is
‘yes’. Otherwise, the answer is ‘no’.
In the original model of learning with queries (cf. Angluin [3]), the query
learner is constrained to choose the input language L exclusively from the class
of languages to be learned. Our study involves a further approach, in which
this constraint is weakened by allowing the learner to query languages that are
themselves not object of learning.
The following table summarizes the results obtained and compares them to
the previously known results. The focus is on the learnability of the class of
all erasing pattern languages and the following subclasses thereof: the so-called
regular, k-variable, and non-cross erasing pattern languages.2 The items in the
table have to be interpreted as follows. The item ‘No’ indicates that queries
of the specified type are insufficient to learn the corresponding language class,
while the item ‘Yes’ indicates that the corresponding class is learnable using
queries of this type. The superscript † refers to results, which can be found or
easily derived from results in Angluin [3], Matsumoto and Shinohara [11], and
Nessel and Lange [13], respectively.
Type of erasing pattern languages
Type of const.-free const.-free
queries all regular 1-variable 1-variable k-variable k-variable non-cross
membership No† Yes† No Yes No No No
restr. subset No Yes No Yes No No No
restr. superset No† Yes† No† No† No† No† No†
If query learners are allowed to choose input languages that are themselves
not object of learning, their learning capabilities change remarkably, particularly
when the learner is allowed to ask restricted superset queries. It seems as if this
type of queries is especially tailored to accumulate learning-relevant information
about erasing pattern languages. Note that the superscript ‡ marks immediate
outcomes of the table above.
Type of erasing pattern languages
Type of const.-free const.-free
extra queries all regular 1-variable 1-variable k-variable k-variable non-cross
restr. subset No Yes‡ No Yes‡ No No No
restr. superset Open Yes‡ Yes Yes Yes Yes Yes
Of particular interest is also the complexity of a successful query learner M ,
cf. Angluin [3]. M learns a class polynomially, if, for each target language L
2
A pattern p is regular provided that p does not contain any variable more than once.
Moreover, p is said to be a k-variable pattern, if it contains at most k variables, while
it is said to be non-cross, if there are variables x1 , . . . , xn and indices e1 , . . . , en such
that p = xe11 · · · xenn .
132 S. Lange and S. Zilles

in the class, the total number of queries to be asked by M in the worst-case


is polynomial in the length of the minimal description for L. The table below
summarizes the corresponding results. The first (second) row displays the types
of queries (not) suitable for polynomial learning of a particular class; the third
row marks open problems. Here MemQ (SubQ,SupQ) is short for membership
(restricted subset, restricted superset) queries; the prefix x denotes extra queries.
The superscript † refers to results by Nessel and Lange [13]. Note that the results
on non-learnability are not displayed.
Type of erasing pattern languages
const.-free const.-free non-
all regular 1-variable 1-variable k-variable k-variable cross
polynomially SupQ † xSupQ MemQ, SubQ xSupQ, xSupQ, xSupQ
learnable xSupQ if k = 2 if k = 2
learnable, not MemQ †
polynomially xSubQ
open xSupQ xSupQ, xSupQ,
if k > 2 if k > 2

2 Preliminaries

In the following, Σ denotes a fixed finite alphabet, the set of constant symbols.
Moreover, X = {x1 , x2 , x3 , . . .} is a countable, infinite set of variables. To distin-
guish constant symbols from variables, it is assumed that Σ and X are disjoint.
By Σ ∗ we refer to the set of all finite strings over Σ (words, for short), where ε
denotes the empty string or empty word, respectively. A pattern is a non-empty
string over Σ ∪ X .
Several special types of patterns are distinguished. Let p be a pattern. If
p ∈ X ∗ , then p is said to be a constant-free pattern. p is a regular pattern, if
each variable in p occurs at most once. If p contains at most k variables, then
p is a k-variable pattern. Moreover, p is said to be a non-cross pattern, if it is
constant-free and there are some n ≥ 1 and indices e1 , . . . , en ≥ 1 such that p
equals xe11 · · · xenn .
For a pattern p, the erasing pattern language Lε (p) generated by p is the set
of all words obtained by substituting all variables in p by strings in Σ ∗ . Thereby,
each occurrence of a variable in p has to be replaced by the same word.
Below, we generally assume that the underlying alphabet Σ consists of at
least three elements.3 a, b, c always denote elements of Σ.
The erasing pattern languages and natural subclasses thereof will provide
the target objects for learning. The formal learning model analyzed is called
3
As results in Shinohara [17] and Nessel and Lange [13] impressively show, this as-
sumption remarkably reduces the complexity of the proofs needed to establish learn-
ability results in the context of learning the erasing pattern languages and subclasses
thereof. However, some of the learnability results presented below may no longer re-
main valid, if this assumption is skipped. A detailed discussion of this issue is outside
the scope of the paper on hand.
On the Learnability of Erasing Pattern Languages in the Query Model 133

learning with queries, see Angluin [3,4]. In this model, the learner has access to
an oracle that truthfully answers queries of a specified kind. A query learner M
is an algorithmic device that, depending on the reply on the queries previously
made, either computes a new query or a hypothesis and halts. M learns a target
language L using a certain type of queries provided that it eventually halts and
that its one and only hypothesis correctly describes L. Furthermore, M learns
a target language class C using a certain type of queries, if it learns every L ∈ C
using queries of the specified type. As a rule, when learning a target class C, M
is not allowed to query languages not belonging to C (cf. Angluin [3]).
As in Angluin [3], the complexity of a query learner is measured by the total
number of queries to be asked in the worst-case. The relevant parameter is the
length of the minimal description for the target language.
Below, only indexable classes of erasing pattern languages are considered.
Note that a class of recursive languages is said to be an indexable class, if there
is an effective enumeration (Li )i≥0 of all and only the languages in that class that
has uniformly decidable membership. Such an enumeration is called an indexing.

3 Strength and Weakness of Query Learners

3.1 Learning in the Original Query Model

We first present results related to Angluin’s [3] original model. Here the learner
is only allowed to query languages that are themselves object of learning.
The first result points to the general weakness of query learners when arbi-
trary erasing pattern languages have to be identified.
Theorem 1. The class of all erasing pattern languages is (i) not learnable using
membership queries, (ii) not learnable using restricted subset queries, and, (iii)
not learnable using restricted superset queries.
Proof. Assertions (i) and (iii) are results from Nessel and Lange [13].
To prove Assertion (ii), assume that a query learner M identifies the class of
all erasing pattern languages using restricted subset queries. Then it is possible
to show, that M fails to identify either Lε (x21 ) or all but finitely many of the
languages Lε (x21 xz2 ) for z ≥ 2. 2
The observed weakness has one origin: the query learners are only allowed to
output one hypothesis, which has to be correct. To see this, consider the following
relaxation of the learning model on hand. Suppose that a query learner M has
the freedom to output in each learning step, after asking a query and receiving
the corresponding answer, a hypothesis. Similarly to Gold’s [7] model of learning
in the limit, a query learner is now successful, if the sequence of its hypotheses
stabilizes on a correct one. Accordingly, we say that M learns in the limit using
queries.
Theorem 2. The class of all erasing pattern languages is (i) learnable in the
limit using membership queries, (ii) learnable in the limit using restricted subset
queries, and, (iii) learnable in the limit using restricted superset queries.
134 S. Lange and S. Zilles

However, let us come back to the original learning model, in which the first
hypothesis of the query learner has to be correct. As Theorem 1 shows, positive
results can only be achieved, if the scope is limited to proper subclasses of the
erasing pattern languages.
Suppose that a subclass of the erasing pattern languages is fixed. Naturally,
one may ask whether – similarly to Theorems 1 and 2 – the learnability of this
class does not depend on the type of queries actually considered. However, this
is generally not the case as our next theorem shows.
Theorem 3. Fix two different query types from the following ones: member-
ship, restricted subset, and restricted superset queries. Then there is a class of
erasing pattern languages, which is learnable using the first type of queries, but
not learnable using the second type of queries.

Proof. Scanning the first table above, the class of all erasing pattern languages
generated by constant-free 1-variable patterns is learnable with membership or
restricted subset queries, but not learnable with restricted superset queries.
Moreover, it is not hard to verify, that the class which contains Lε (a) and
all languages Lε (axz1 ), where z is a prime number, is learnable using restricted
superset queries, but not learnable using membership queries and not learnable
using restricted subset queries.
Next, the class containing Lε (x21 ) and all languages Lε (x21 x22 xz3 ), z ≥ 2, is
learnable with membership queries, but not with restricted subset queries.
A class learnable with restricted subset queries, but not with membership
queries can be constructed via diagonalization. For that purpose fix an effective
enumeration (Mi )i≥0 of all query learners using membership queries and posing
each query at most once.4 Let zi denote the i-th prime number for all i ≥ 0.
Given i ≥ 0, let L2i = Lε (xz1i a). Moreover, simulate the learner Mi . If Mi
queries a word w ∈ Σ ∗ , provide the answer ‘yes’ iff w ∈ Lε (xz1i a); provide the
answer ‘no’, otherwise. In case Mi never returns a hypothesis in this scenario,
let L2i+1 = L2i = Lε (xz1i a). In case Mi returns a hypothesis, let l be the length
of the longest word Mi has queried in the corresponding scenario. Then define
L2i+1 = Lε (xz1i axz2l ). Finally, let C consist of all languages Li for i ≥ 0.
Note that (Li )i≥0 is an indexing for C; membership is decidable as follows:
assume w ∈ Σ ∗ and j ≥ 0 are given. If j = 2i for some i ≥ 0, then w ∈ Lj iff
w ∈ Lε (xz1i a). If j = 2i + 1 for some i ≥ 0 and w ∈ L2i , then w ∈ Lj . If j = 2i + 1
and w ∈ / L2i , then let A = {l ≥ 0 | w ∈ Lε (xz1i axz2l )}. A is finite and can be
computed from w and i. Simulate Mi as above in the definition of L2i+1 . If Mi
does not return a hypothesis, then, since no query is posed twice, Mi queries a
word of a length not in A. Thus there is no l ∈ A with Lj = Lε (xz1i axz2l ); in
particular w ∈ / Lj . If Mi returns a hypothesis, one can determine the length l∗
of the longest word Mi has queried. In this case w ∈ Lj iff l∗ ∈ A.
Next, we show that C is learnable using restricted subset queries. A learner M
for C may first query the languages Lε (xz10 a), Lε (xz11 a), Lε (xz12 a), . . . , until the
4
Note that any query learner can be normalized to pose each query at most once
without affecting its learning capabilities.
On the Learnability of Erasing Pattern Languages in the Query Model 135

answer ‘yes’ is received for the first time, say as a reply to the query Lε (xz1i a) =
L2i . Then M queries the language L2i+1 . In case the answer is ‘yes’, let M return
the hypothesis L2i+1 . Otherwise, let M return the hypothesis L2i . It is not hard
to verify that M is a successful query learner for C.
It remains to verify that C is not learnable using membership queries. As-
sume to the contrary, that C is learnable using membership queries, say by the
learner Mi for some i ≥ 0. Then Mi identifies the language L2i = Lε (xz1i a). In
particular, if its queries are answered truthfully respecting L2i , Mi must return
a hypothesis correctly describing L2i after finitely many queries. Let l be the
length of the longest word Mi queries in the corresponding learning scenario.
Then, by definition, L2i+1 = Lε (xz1i axz2l ). Note that a word of length up to l
belongs to L2i iff it belongs to L2i+1 . Thus all queries in the learning scenario of
Mi for L2i are answered truthfully also for the language L2i+1 = L2i . Since Mi
correctly identifies L2i , Mi fails to learn L2i+1 . This yields a contradiction. 2
Next, we systematically investigate the learnability of some prominent sub-
classes of the erasing pattern languages in Angluin’s [3] model.
Theorem 4. The class of all regular erasing pattern languages is (i) learnable
using membership queries, (ii) learnable using restricted subset queries, and, (iii)
learnable using restricted superset queries.
Proof. For a proof of Assertions (i) and (iii) see Nessel and Lange [13]. Adapting
their ideas one can also prove (ii). 2
Theorem 5. The class of all 1-variable erasing pattern languages is (i) not
learnable using membership queries, (ii) not learnable using restricted subset
queries, and, (iii) not learnable using restricted superset queries.
Proof. (i) and (iii) are due to Nessel and Lange [13]. To verify (ii), note that the
class of all languages Lε (axz1 b), z ≥ 0, is not learnable using restricted subset
queries, even if it is allowed to query any 1-variable erasing pattern language. 2
To prove Theorems 6 to 9, similar methods as above can be used. For the
results concerning restricted superset queries, ideas from Nessel and Lange [13]
can be exploited. Further details are omitted.
Theorem 6. The class of all constant-free 1-variable erasing pattern languages
is (i) learnable using membership queries, (ii) learnable using restricted subset
queries, and, (iii) not learnable using restricted superset queries.
Theorem 7. The class of all k-variable erasing pattern languages is (i) not
learnable using membership queries, (ii) not learnable using restricted subset
queries, and, (iii) not learnable using restricted superset queries.
Theorem 8. The class of all constant-free k-variable erasing pattern languages
is (i) not learnable using membership queries, (ii) not learnable using restricted
subset queries, and, (iii) not learnable using restricted superset queries.
Theorem 9. The class of all non-cross erasing pattern languages is (i) not
learnable using membership queries, (ii) not learnable using restricted subset
queries, and, (iii) not learnable using restricted superset queries.
136 S. Lange and S. Zilles

3.2 Learning with Extra Queries


As it turns out, there are not so many natural subclasses of the erasing pat-
tern languages that are learnable using restricted subset and restricted superset
queries, respectively. But where does the observed weakness stem from? Does it
result from the complexity of the considered language classes? The following in-
vestigations seem to prove that this is not the case. Instead it seems as if, at least
in some cases, the query learners are simply not allowed to ask the ‘appropriate’
queries.
In the extended model, the query learner is not constrained to query just
languages belonging to the target class. That means, the learner and the oracle
have the ability to communicate additional queries and corresponding answers.
However, in a reasonable model, there has to be an a priori agreement of how
to formulate the queries. For that purpose, we assume that the query languages
are selected from an a priori fixed indexable class of recursive languages.
As we will see below, this may severely increase the learning power concerning
natural subclasses of the erasing pattern languages. Still, if the class of all erasing
pattern languages is considered, a benefit resulting from extra queries has not
been verified yet. Indeed, concerning restricted subset queries, it is clear that
extra queries do not help.
Theorem 10. The class of all erasing pattern languages is not learnable using
extra restricted subset queries.
It remains open whether or not the class of all erasing pattern languages is
learnable using extra restricted superset queries. The relevance of this problem
is discussed in the last section.
Extra restricted superset queries improve the power of query learners remark-
ably. Due to the space constraints the corresponding proof is omitted.
Theorem 11. The classes of all regular, of all k-variable, and of all non-cross
erasing pattern languages, respectively, are learnable using extra restricted su-
perset queries.
In contrast, extra restricted subset queries do not help for learning the natural
subclasses of the erasing pattern languages considered. Note that there are still
subclasses which are learnable using restricted subset queries if and only if the
learner may ask extra languages. An example can be found in the demonstration
of Theorem 3, third paragraph.
Theorem 12. (i) The classes of all 1-variable, of all constant-free k-variable
(k ≥ 2) and of all non-cross erasing pattern languages, respectively, are not
learnable using extra restricted subset queries.
(ii) The classes of all regular and of all constant-free 1-variable erasing pattern
languages, respectively, are learnable using extra restricted subset queries.
Proof. Assertion (ii) is an immediate consequence of Theorems 4 and 6. To prove
Assertion (i), note that neither the class consisting of Lε (a) and all languages
Lε (axz1 ), z ≥ 1, nor the class consisting of Lε (x21 ) and all languages Lε (x21 xz2 ),
z ≥ 2, are learnable with extra restricted subset queries. 2
On the Learnability of Erasing Pattern Languages in the Query Model 137

4 Efficiency of Query Learners

Having analyzed the learnability of natural subclasses of the class of all erasing
pattern languages in the (extended) query model, we now turn our attention
to the question, which of the learnable classes can even be learned efficiently,
i. e. with polynomially many queries. In particular, it is of interest, whether or
not the permission to query extra languages may speed up learning.
As it turns out, there are subclasses, which are not learnable in the original
model, but even efficiently learnable with extra queries, see Theorem 13, Asser-
tion (iv). Thus, extra restricted superset queries may bring the maximal benefit
imaginable. In contrast, extra restricted subset queries do not help to speed up
learning of the prominent subclasses of the erasing pattern languages considered
above.

Theorem 13. (i) Polynomially many restricted superset queries suffice to learn
the class of all regular erasing pattern languages.
(ii) Polynomially many membership queries suffice to learn the class of all con-
stant-free 1-variable erasing pattern languages.
(iii) Polynomially many restricted subset queries suffice to learn the class of all
constant-free 1-variable erasing pattern languages.
(iv) Polynomially many extra restricted superset queries suffice to learn the
classes of all regular, of all 1-variable, and of all non-cross erasing pattern lan-
guages, respectively.

Proof. (i) is due to Nessel and Lange [13]. The proofs of (ii) and (iii) are omitted.
Results by Nessel and Lange [13] help to verify Assertion (iv) for the case of
regular erasing pattern languages. Details are omitted.
The more involved proof of Assertion (iv) for the case of non-cross erasing
pattern languages is just sketched:
Assume that the target language L equals Lε (p) for some non-cross pattern
p = xe11 · · · xenn . A query learner M successful for all non-cross erasing pattern
languages may operate as follows.

1. M poses the query Σ ∗ \ {a}. If the answer is ‘no’, then M returns the
hypothesis L = Lε (x1 ) and stops; otherwise M acts as described in 2.
2. The queries {w | |w| = j} for j = 1, 2, . . . help to determine the minimal
exponent e in {e1 , . . . , en }. Knowing e, M executes 3.
3. M poses the query Lε (xe1 ). If the answer is ‘yes’, then M returns the hy-
pothesis L = Lε (xe1 ) and stops; otherwise M acts as described in 4.
4. The queries (Lε (xe1 )∩{w | |w| ≤ j})∪{w | |w| > j} for j = e, e+1, . . . help to
determine further candidates for elements in {e1 , . . . , en }. Queries concerning
special words in a selected class of (at most e1 + · · · + en ) 2-variable erasing
pattern languages help to exactly compute a next exponent e . Knowing e ,
M executes 5.
5. The queries Σ ∗ \{w}, for particular words w ∈ Σ ∗ in order of growing length,
help to determine in which order the exponents e and e appear in p.
138 S. Lange and S. Zilles

Afterwards, M executes (slightly modified versions of) steps 3 to 5 in order


to find further exponents, until the correct structure of p is output.
All in all, this method is successful for all non-cross erasing pattern languages,
but uses only polynomially many extra restricted superset queries. Instead of
formalizing the details we try to illustrate the idea with an example.
Assume Σ = {a, b, c} and the target language is Lε (x41 x22 x83 ). Then the cor-
responding learning scenario can be described by the following table.
Step Query Reply
1 Σ ∗ \ {a} ‘yes’
2 {w | |w| =  1} ‘yes’
{w | |w| = 2} ‘no’
(* e = 2. *)
3 Lε (x21 ) ‘no’
(* There is a second exponent e . *)
4 (Lε (x21 ) ∩ {w | |w| ≤ 2}) ∪ {w | |w| > 2} ‘yes’
.. ..
. .
(Lε (x21 ) ∩ {w | |w| ≤ 5}) ∪ {w | |w| > 5} ‘yes’
(Lε (x21 ) ∩ {w | |w| ≤ 6}) ∪ {w | |w| > 6} ‘no’
(* e = 4. *)

5 Σ \ {a2 b4 } ‘yes’
Σ ∗ \ {a4 b2 } ‘no’
(* e appears only before e in p. *)
3 Lε (x41 x22 ) ‘no’
(* There is a third exponent e . *)
4 (Lε (x41 x22 ) ∩ {w | |w| ≤ 6}) ∪ {w | |w| > 6} ‘yes’
.. ..
. .
(Lε (x41 x22 ) ∩ {w | |w| ≤ 9}) ∪ {w | |w| > 9} ‘yes’
(Lε (x41 x22 ) ∩ {w | |w| ≤ 10}) ∪ {w | |w| > 10} ‘no’
(* Candidates for e : 6, 8. Interesting words: a6 b4 , a2 b8 . *)

Σ \ {a6 b4 } ‘yes’
Σ ∗ \ {a2 b8 } ‘no’
(* e = 8, appears after e in p, step 5 is not necessary. *)
3 Lε (x41 x22 x83 ) ‘yes’
Output : Lε (x41 x22 x83 )

It remains to prove Assertion (iv) for 1-variable erasing pattern languages:


Assume the target language is L = Lε (p) for some 1-variable pattern p. Let
v be the shortest word in L, v = v1 · · · vl for v1 , . . . , vl ∈ Σ. A query learner M
successful for all 1-variable erasing pattern languages may operate as follows:
1. With the help of the queries Σ ∗ \{a} and Σ ∗ \{b} the learner M can find out,
whether or not L = Lε (x1 ). If yes, then M returns the hypothesis L = Lε (x1 )
and stops; otherwise M acts as described in 2.
2. The queries {w | |w| = j} for j = 0, 1, 2, . . . help to compute the length l of v.
To compute v itself, the |Σ|l candidates for v are recursively split into two
On the Learnability of Erasing Pattern Languages in the Query Model 139

equally large sets V1 and V2 ; which of these sets is taken under consideration,
in each splitting step only depends on the query V1 ∪ {w | |w| = l}. If v is
computed, M goes on as in 3.
3. M poses the query Lε (v). On answer ‘yes’, M returns the hypothesis Lε (v)
and stops. On answer ‘no’, M queries all the languages Lε (pi ), 1 ≤ i ≤ l + 1,
where pi is the pattern resulting from x1 v1 x2 v2 · · · xl vl xl+1 , if the variable
xi is deleted. Thus M can detect exactly those positions in v, where the only
variable has to occur (at least once). Knowing the positions of the variables,
M goes on as in 4.
4. By posing the queries {v} ∪ {w | |w| ≥ l + j} for j = 1, 2, . . ., M finds out the
number j ∗ of occurrences of the variable x1 in p. Afterwards, special queries
concerning the words of length l + j ∗ help to find out the multiplicity of x1
in the positions computed in 3. Finally, a hypothesis for Lε (p) is returned.

All in all, this method is successful for all 1-variable erasing pattern languages,
but uses only polynomially many queries. Instead of formalizing the details we
try to illustrate the idea with an example.
Assume Σ = {a, b, c} and the target language is Lε (ax31 bx21 ). Then the cor-
responding learning scenario can be described by the following table.

Step Query Reply


1 Σ ∗ \ {a} ‘yes’
2 {w | |w| =  0} ‘yes’
{w | |w| = 1} ‘yes’
{w | |w| = 2} ‘no’
(* l = 2. v ∈ {aa, ab, ac, ba, bb, bc, ca, cb, cc}. *)
{aa, ab, ac, ba} ∪ {w | |w| = 2} ‘yes’
{aa, ab} ∪ {w | |w| = 2} ‘yes’
{aa} ∪ {w | |w| = 2} ‘no’
(* v = ab. *)
3 Lε (ab) ‘no’
Lε (ax2 bx3 ) ‘yes’
Lε (x1 abx3 ) ‘no’
Lε (x1 ax2 b) ‘no’
(* p = axe11 bxe12 for some e1 , e2 ≥ 1. *)
4 {ab} ∪ {w | |w| ≥ 3} ‘yes’
.. ..
. .
{ab} ∪ {w | |w| ≥ 7} ‘yes’
{ab} ∪ {w | |w| ≥ 8} ‘no’
(* j ∗ = 5. Test a2 ba4 , a3 ba3 , a4 ba2 , a5 ba. *)
Σ ∗ \ {a2 ba4 } ‘yes’
Σ ∗ \ {a3 ba3 } ‘yes’
Σ ∗ \ {a4 ba2 } ‘no’
Output: Lε (ax31 bx21 )

Further details are omitted. 2


140 S. Lange and S. Zilles

Theorem 14. Polynomially many queries do not suffice to learn the class of all
regular erasing pattern languages with either membership queries, or restricted
subset queries, or extra restricted subset queries.

Proof. Note that, for any n ≥ 0, there are at least |Σ|n distinct regular patterns,
such that each pair of corresponding erasing pattern languages is disjoint. By a
result in Angluin [3], given n ≥ 0, any query learner identifying each of these
|Σ|n erasing regular pattern languages using membership or restricted subset
queries must make |Σ|n − 1 queries in the worst case. Angluin’s proof can be
adopted for the case of learning with extra restricted subset queries. Concerning
membership queries and restricted subset queries, Theorem 14 has also been
verified by Nessel and Lange [13]. 2
It remains open, whether or not, for any k ≥ 3, the class of all k-variable
erasing pattern languages, or at least the class of all constant-free k-variable
erasing pattern languages, is learnable using polynomially many extra restricted
superset queries. Until now, we have only been successful in showing that Theo-
rem 13, Assertion (iv) generalizes to the case of learning the class of all 2-variable
erasing pattern languages. A similar, slightly extended, method as in the proof
above for 1-variable erasing pattern languages can be used. The relevant details
are omitted.

5 Connections to Gold-Style Learning

Comparing query learning to the standard models of Gold-style language learn-


ing from positive examples requires some more notions. These will be kept short,
see, e. g., Gold [7], Angluin [1], and Zeugmann and Lange [18] for more details.
Let L be a language. Any infinite sequence t = (wj )j≥0 with {wj | j ≥ 0} = L
is called a text for L. For any n ≥ 0, tn denotes the initial segment w0 , . . . , wn
and t+n the set {w0 , . . . , wn }.
Let C be an indexable class, let H = (Li )i≥0 be a hypothesis space, and
let L ∈ C. An inductive inference machine (IIM ) is an algorithmic device, that
reads longer and longer initial segments of a text and, from time to time, out-
puts numbers as its hypotheses. An IIM M returning some i is construed to
hypothesize the language Li . Given a text t for L, M identifies L from t with
respect to H, if the sequence of hypotheses output by M , when fed t, stabilizes
on a number i (i. e. past some point M always outputs the hypothesis i) with
Li = L. M identifies C from text with respect to H, if it identifies every L ∈ C
from every corresponding text. We say that C can be conservatively identified
with respect to H iff there is an IIM M that identifies C from text with respect
to H and that performs exclusively justified mind changes, i. e. if M , on some
text t, outputs hypotheses i and later i , then M must have seen some word
w∈ / Li before it outputs i . In other words, M may only change its hypothesis
when it has found hard evidence that it is wrong.
LimTxt (Consv Txt) denotes the collection of all indexable classes C  for
which there are an IIM M  and a hypothesis space H such that M  (conserva-
On the Learnability of Erasing Pattern Languages in the Query Model 141

tively) identifies C  from text with respect to H . Note that Consv Txt ⊂ LimTxt,
cf. Zeugmann and Lange [18].
For the next theorem, let xSupQ denote the collection of all indexable classes,
which are learnable with extra restricted superset queries.

Theorem 15. Consv Txt = xSupQ.

Proof. “Consv Txt ⊆ xSupQ”:


Fix C ∈ Consv Txt. Then there is an indexing (Li )i≥0 and an IIM M , such
that M Consv Txt-identifies C with respect to (Li )i≥0 . Obviously, if L ∈ C and
t is a text for L, then M never returns an index i with L ⊂ Li on any segment
of t.
Now the underlying indexable class used for the queries contains all languages
in (Li )i≥0 and all languages Li \ {w} for i ≥ 0 and w ∈ Σ ∗ . A learner M 
identifying any L ∈ C with extra restricted superset queries may work as follows.
M  looks for a superset of L and uses queries on variants of this superset to
construct a text for L.
First, to find a superset of L, M  poses queries L0 , L1 , . . ., until the answer
‘yes’ is received for the first time, say upon the query Lk . (* Note that
L ⊆ Lk . *)
Second, to effectively enumerate a text t for L, M  determines the set T of
all words w ∈ Σ ∗ , for which the query Lk \ {w} is answered with ‘no’. Since
T = L and T is recursively enumerable in k, any effective enumeration of T
yields a text for L.
Third, to compute its hypothesis, M  executes step 0, 1, 2, . . . until it receives
a stop signal. In general, step n, n ≥ 0, consists of the following instructions:
Determine i := M (tn ), where t is a fixed effective enumeration of the
set T . Pose the query Li . If the answer is ‘no’, execute step n + 1. Oth-
erwise hypothesize i and stop. (* In the latter case, as M never hypoth-
esizes a proper superset of L, M  returns an index for L. *)
Further details are omitted.
“xSupQ ⊆ Consv Txt”:
Fix an indexable class C ∈ xSupQ. Then there is an indexing (Li )i≥0 and a
query learner M , such that M identifies C with extra restricted superset queries
respecting (Li )i≥0 . A new indexing (Li )i≥0 is defined as follows:
– L0 is the empty language.
– If i is the canonical index of the finite set {i1 , . . . , in }, then Li = Li1 ∩· · ·∩Lin .
An IIM M  identifying C in the limit from text with respect to the hypothesis
space (Li )i≥0 , given a text t, may work as follows.
M  (t0 ) := 0.
To compute M  (tn+1 ), the learner M  simulates a query learning scenario
with M for n steps of computation. If M does not return a hypothesis in the
n-th step, then M  (tn+1 ) := M  (tn ). Additionally, if M poses the query Li
142 S. Lange and S. Zilles

in the n-th step, then M will receive the answer ‘no’, if t+ n ∩ Li = ∅ (i. e. if
Li ⊇ t+
n ), and the answer ‘yes’, otherwise. If M returns a hypothesis i in the
n-th step, then the hypothesis M  (tn+1 ) is computed as follows:
• Let Li+ , . . . , Li+
m
be the queries answered with ‘yes’ in the currently
1
simulated scenario.
• Compute the canonical index i of the set {i, i+ 1 , . . . , im }.
+
 
• Return the hypothesis M (tn+1 ) = i .

It is not hard to verify that M  learns C in the limit from text; the relevant
details are omitted. Moreover, as we will see next, M  avoids overgeneralized
hypotheses, that means, if t is a text for some L ∈ C, n ≥ 0, and M  (tn ) = i ,
then Li ⊃ L. Therefore, M  can easily be transformed into a learner M  which
identifies the class C conservatively in the limit from text.5
To prove that M  learns C in the limit from text without overgeneralizations,
assume to the contrary, that there is an L ∈ C, a text t for L, and an n ≥ 0,
such that the hypothesis i = M  (tn ) fulfills Li ⊃ L. Then i = 0. By definition
of M  , there must be a learning scenario S for M , in which

– M poses queries Li− , . . . , Li− , Li+ , . . . , Li+


m
(in some particular order);
1 k 1
– the queries Li− , . . . , Li− are answered with ‘no’;
1 k
– the queries Li+ , . . . , Li+
m
are answered with ‘yes’;
1
– afterwards M returns the hypothesis i.

Hence i is the canonical index of the set {i, i+ 


1 , . . . , im }. This implies Li =
+

Li ∩ Li+ ∩ · · · ∩ Li+ m
. So each of the languages Li+ , . . . , Li+m
is a superset of L.
1 1
By definition of M  , Li− ⊇ t+ n for 1 ≤ j ≤ k. Therefore none of the languages
j
Li− , . . . , Li− are supersets of L. So the answers in the learning scenario S above
1 k
are truthful respecting the language L. As M learns C with extra restricted
superset queries, the hypothesis i must be correct for L, i. e. Li = L. This yields
Li ⊆ L in contradiction to Li ⊃ L.
So M  learns C in the limit from text without overgeneralizations, which –
by the argumentation above – implies C ∈ Consv Txt. 2
As an immediate consequence of xSupQ = Consv Txt and the fact that
Consv Txt is a proper subset of LimTxt we obtain the following corollary.

Corollary 1. xSupQ ⊂ LimTxt.

Theorem 15 and Corollary 1 are of relevance for the open question, whether
or not the class of all erasing pattern languages is learnable in the limit from text,
if the underlying alphabet consists of at least three symbols. Obviously, if this
class is learnable with extra restricted superset queries, then the open question
can be answered in the affirmative. Conversely, if it is not learnable with extra
restricted superset queries, then it is not conservatively learnable in the limit
5
Note that a result by Zeugmann and Lange [18] states that any indexable class, which
is learnable in the limit from text without overgeneralizations, belongs to Consv Txt.
On the Learnability of Erasing Pattern Languages in the Query Model 143

from text. Of course the latter would not yet imply, that the open question can
be answered in the negative. Still it would at least suggest that this is the case,
since until now, there is no ‘natural’ class known that separates LimTxt from
Consv Txt.

References
1. D. Angluin. Inductive inference of formal languages from positive data. Informa-
tion and Control, 45:117–135, 1980.
2. D. Angluin. Finding patterns common to a set of strings. Journal of Computer
and System Sciences, 21:46–62, 1980.
3. D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
4. D. Angluin. Queries revisited. Proc. Int. Conf. on Algorithmic Learning Theory,
LNAI 2225, 12–31, Springer, 2001.
5. S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi, T. Shinohara.
A machine discovery from amino acid sequences by decision trees over regular
patterns. New Generation Computing, 11:361–375, 1993.
6. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger, T. Zeugmann. Learning one-
variable pattern languages very efficiently on average, in parallel, and by asking
questions. Proc. Int. Conf. on Algorithmic Learning Theory, LNAI 1316, 260–276,
Springer, 1997.
7. E. M. Gold. Language identification in the limit. Information and Control, 10:447–
474, 1967.
8. J. E. Hopcroft, J. D. Ullman. Introduction to Automata Theory, Languages, and
Computation. Addison-Wesley Publishing Company, 1979.
9. T. Jiang, A. Salomaa, K. Salomaa, S. Yu. Decision problems for patterns. Journal
of Computer and System Sciences, 50:53–63, 1995.
10. S. Lange, T. Zeugmann. Types of monotonic language learning and their char-
acterization. Proc. ACM Workshop on Computational Learning Theory, 377–390.
ACM Press, 1992.
11. S. Matsumoto, A. Shinohara. Learning pattern languages using queries. Proc.
European Conf. on Computational Learning Theory, LNAI 1208, 185–197, Springer,
1997.
12. A. Mitchell. Learnability of a subclass of extended pattern languages. Proc. ACM
Workshop on Computational Learning Theory, 64–71, ACM-Press, 1998.
13. J. Nessel, S. Lange. Learning erasing pattern languages with queries. Proc. Int.
Conf. on Algorithmic Learning Theory, LNAI 1968, 86–100, Springer, 2000.
14. D. Reidenbach. A negative result on inductive inference of extended pattern lan-
guages. Proc. Int. Conf. on Algorithmic Learning Theory, LNAI 2533, 308–320,
Springer, 2002.
15. A. Salomaa. Patterns (the formal language theory column). EATCS Bulletin,
54:46–62, 1994.
16. A. Salomaa. Return to patterns (the formal language theory column). EATCS
Bulletin, 55:144–157, 1995.
17. T. Shinohara. Polynomial time inference of extended regular pattern languages.
Proc. RIMS Symposium on Software Science and Engineering, LNCS 147, 115–127,
Springer, 1983.
18. T. Zeugmann, S. Lange. A guided tour across the boundaries of learning recursive
languages. Algorithmic Learning for Knowledge-Based Systems, LNAI 961, 190–
258, Springer, 1995.
Learning of Finite Unions of Tree Patterns with
Repeated Internal Structured Variables from
Queries

Satoshi Matsumoto1 , Yusuke Suzuki2 , Takayoshi Shoudai2 , Tetsuhiro


Miyahara3 , and Tomoyuki Uchida3
1
Department of Mathematical Sciences, Tokai University, Hiratsuka 259-1292, Japan
matumoto@[Link]
2
Department of Informatics, Kyushu University, Kasuga 816-8580, Japan
{y-suzuki,shoudai}@[Link]
3
Faculty of Information Sciences, Hiroshima City University, Hiroshima 731-3194,
Japan
uchida@[Link]
miyahara@[Link]

Abstract. In the field of Web mining, a Web page can be represented


by a rooted tree T such that every internal vertex of T has ordered chil-
dren and string data such as tags or texts are assigned to edges of T . A
term tree is an ordered tree pattern, which has ordered tree structures
and variables, and is suited for a representation of a tree structured pat-
tern in Web pages. A term tree t is allowed to have a repeated variable
which occurs in t more than once. In this paper, we consider the learn-
ability of finite unions of term trees with repeated variables in the query
learning model of Angluin (1988). We present polynomial time learn-
ing algorithms for finite unions of term trees with repeated variables
by using superset and restricted equivalence queries. Moreover we show
that there exists no polynomial time learning algorithm for finite unions
of term trees by using restricted equivalence, membership and subset
queries. This result indicates the hardness of learning finite unions of
term trees in the query learning model.

1 Introduction
In the field of Web mining, Web documents such as HTML/XML files have
tree structures and are called tree structured data. Tree structured data can be
naturally represented by rooted trees T such that every internal vertex in T has
ordered children, every vertex has no label and every edge has a label [1]. We
are interested in extracting a set (or a union) of tree structured patterns which
explains heterogeneous tree structured data having no rigid structure. From this
motivation, in this paper, we consider the polynomial time learnability of finite
unions of tree structured patterns in the query learning model of Angluin [5].
A term tree is a rooted tree pattern which consists of tree structures, ordered
children and internal structured variables [10,13]. A variable in a term tree is a

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 144–158, 2003.

c Springer-Verlag Berlin Heidelberg 2003
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 145

list of two vertices and it can be substituted by an arbitrary tree. For example,
the term tree t in Fig. 1 is a tree pattern explaining the tree T in Fig. 1 because
T can be obtained from t by substituting variables x1 , x2 , x3 and x4 by trees
g1 , g2 , g3 and g4 in Fig. 1, respectively. In [2,3], Amoth et al. presented the
into-matching semantics and introduced the class of ordered tree patterns and
ordered forests with the semantics. Such an ordered tree pattern is a standard
tree pattern, which is also called a first order term in formal logic. Since a term
tree may have variables consisting of two internal vertices (e.g. the variable x2 in
Fig. 1), a term tree is more powerful than an ordered tree pattern. For example,
in Fig. 1, the tree pattern f (b, x, g(a, z), y) can be represented by the term tree s,
but the term tree t cannot be represented by any standard tree pattern because
of the existence of internal structured variables represented by x2 and x3 in t.
Arimura et al. [7] presented ordered gapped tree patterns and ordered gapped
forests under into-matching semantics introduced in [3]. An ordered gapped tree
pattern is incomparable to a term tree, since a gap-variable in an ordered gapped
tree pattern does not exactly correspond to an internal variable in a term tree.
A variable with a variable label x in a term tree t is said to be repeated if x
occurs in t more than once. In this paper, we treat a term tree with repeated in-
ternal variables. In [7], Arimura et al. discussed the polynomial time learnability
of ordered gapped forests without repeated gap-variables in the query learning
model. In this paper, we discuss the polynomial time learnability of finite unions
of term trees with repeated variables in the query learning model. For a tree
T which represents tree structured data such as Web documents, string data
such as tags or texts are assigned to edges of T . Hence, we assume naturally
that the cardinality of a set of edge labels is infinite. Let Λ be a set of strings
used in tree structured data. Then, our target class of learning is the set OTFΛ
of all finite sets of term trees all of whose edges are labeled with elements in
Λ. A term tree t is said to be regular (or repetition-free) if all variable labels
in t are mutually distinct. The term tree language of a term tree t, denoted by
LΛ (t), is the set of all labeled trees which are obtained from t by substituting
arbitrary labeled trees for all variables in t. The language represented by a finite
set of term trees R = {t1 , t2 , . . . , tm } in OTFΛ is the finite union of m term tree
languages LΛ (R) = LΛ (t1 ) ∪ LΛ (t2 ) ∪ . . . ∪ LΛ (tm ).
In the query learning model of Angluin [5], a learning algorithm is said to
exactly learn a target finite set R∗ of term trees if it outputs a finite set R of
term trees such that LΛ (R) = LΛ (R∗ ) and halts, after it uses some queries. In
this paper, firstly, we present a polynomial time algorithm which exactly learns
any finite set in OTFΛ having m∗ term trees by using superset and restricted
equivalence queries. Next, we show that there exists no polynomial time learning
algorithm for finite unions of term trees by using restricted equivalence, mem-
bership and subset queries. This result indicates the hardness of learning finite
unions of term trees in the query learning.
In the query learning model, many researchers [2,3,7,10] showed the exact
learnabilities of several kinds of tree structured patterns (e.g. the query learn-
ing for ordered forests under onto-matching semantics [6], for unordered forests
146 S. Matsumoto et al.

under into-matching semantics [2,3], for ordered gapped forests [7], for regular
term trees[11]). In [10], we showed the polynomial time exact learnability of fi-
nite unions of regular term trees using restricted subset queries and equivalence
queries. As other learning models, in [13], we showed the class of single regular
term trees is polynomial time inductively inferable from positive data. Further,
we gave a data mining method from semistructured data, based on a learning
algorithm for regular term trees [12]. Further related works are discussed in
Conclusion.
This paper is organized as follows. In Section 2 and 3, we explain term trees
and the query learning model. In Section 4, we show that the class OTFΛ is
exactly learnable in polynomial time by using superset and restricted equivalence
queries. In Section 5, we give the hardness of learning the unions of term trees
in the query learning model.

2 Preliminaries

Definition 1. Let T = (VT , ET ) be a rooted tree with ordered children which


has a set VT of vertices and a set ET of edges. Let Et and Ht be a partition of ET ,
i.e., Et ∪ Ht = ET and Et ∩ Ht = ∅. And let Vt = VT . A triplet t = (Vt , Et , Ht )
is called a term tree, and elements in Vt , Et and Ht are called a vertex, an edge
and a variable, respectively.

For a term tree t and its vertices v1 and vi , a path from v1 to vi is a sequence
v1 , v2 , . . . , vi of distinct vertices of t such that for any j with 1 ≤ j < i, there
exists an edge or a variable which consists of vj and vj+1 . If there is an edge or a
variable which consists of v and v  such that v lies on the path from the root to
v  , then v is said to be the parent of v  and v  is a child of v. We use a notation
[v, v  ] to represent a variable {v, v  } ∈ Ht such that v is the parent of v  . Then
we call v the parent port of [v, v  ] and v  the child port of [v, v  ]. A term tree t is
called ordered if every internal vertex u in t has a total ordering on all children
of u. We define the size of t as the number of vertices in t and denote it by |t|.
For a set S, the number of elements in S, called the size of S, is denoted by |S|.
For example, the ordered term tree t = (Vt , Et , Ht ) in Fig. 1 is defined as
follows. Vt = {v1 , . . . , v11 }, Et = {{v1 , v2 }, {v2 , v3 }, {v1 , v4 }, {v7 , v8 }, {v1 , v10 },
{v10 , v11 }} with the root v1 and the sibling relation displayed in Fig. 1. Ht =
{[v4 , v5 ], [v1 , v6 ], [v6 , v7 ], [v6 , v9 ]}.
For any ordered term tree t, a vertex u of t, and two children u and u of u,
we write u <tu u if u is smaller than u in the order of the children of u. We
assume that every edge and variable of an ordered term tree is labeled with some
words from specified languages. A label of a variable is called a variable label. Λ
and X denote a set of edge labels and a set of variable labels, respectively, where
Λ ∩ X = φ. We call an ordered term tree a term tree simply. In particular, a term
tree t = (Vt , Et , Ht ) is called regular if all variables in Ht have mutually distinct
variable labels in X. We denote by OTT Λ (resp. µOTT Λ ) the set of all term trees
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 147

v1

Sec1 Sec2 x2 Sec4 Sec1 Sec2 Sec3 Sec4

v2 v4 v6 v10

Introduction x1 x3 x4 Conclusion Introduction Preliminary


SubSec3.1 SubSec3.2
Conclusion

v3 v5 v7 v9 v11

Result I Result I Result II

v8

t T

f
u4

SubSec3.2
b x g y
u1 u2 u3
Result II
a z Preliminary Sec3 SubSec3.1

w4
w1 w2 w3

s g1 g2 g3 g4

Fig. 1. A term tree t explains a tree T . A term tree s represents the tree pattern
f (b, x, g(a, z), y). A variable is represented by a box with lines to its elements. The
label inside a box is the variable label of the variable.

(resp. regular term trees) with Λ as a set of edge labels, and by OTFΛ (resp.
µOTFΛ ) the set of all finite sets of term trees (resp. regular term trees) with Λ
as a set of edge labels. An ordered term tree with no variable is called a ground
term tree and considered to be a tree with ordered children. OT Λ denotes the
set of all ground term trees with Λ as a set of edge labels.
Let f = (Vf , Ef , Hf ) and g = (Vg , Eg , Hg ) be term trees. We say that f
and g are isomorphic, denoted by f ≡ g, if there is a bijection ϕ from Vf to Vg
such that (i) the root of f is mapped to the root of g by ϕ, (ii) {u, u } ∈ Ef if
and only if {ϕ(u), ϕ(u )} ∈ Eg and the two edges have the same edge label, (iii)
[u, u ] ∈ Hf if and only if [ϕ(u), ϕ(u )] ∈ Hg and the two variables have the same
variable label, and (iv) for any vertex u in f which has more than one child, and
for any two children u and u of u, u <fu u if and only if ϕ(u ) <gϕ(u) ϕ(u ).
Let f and g be term trees with at least two vertices. Let h = [v, v  ] be a
variable in f with the variable label x and σ = [u, u ] a list of two distinct vertices
in g, where u is the root of g and u is a leaf of g. The form x := [g, σ] is called
a binding for x. A new term tree f  = f {x := [g, σ]} is obtained by applying the
binding x := [g, σ] to f in the following way. Let e1 = [v1 , v1 ], . . . , em = [vm , vm

]
be the variables in f with the variable label x. Let g1 , . . . , gm be m copies of
g and ui , ui the vertices of gi corresponding to u, u of g, respectively. For each
variable ei = [vi , vi ], we attach gi to f by removing the variable ei from Hf and
by identifying the vertices vi , vi with the vertices ui , ui of gi .
148 S. Matsumoto et al.

r u0 r
y y
g (2) (2)
v0 (u0) v0 (< rf ) (< rf )
x z z
(3) (1) (1) (3)
u1
v1 < vf’0 (< ug0 ) (< ug0 ) < vf’0

v1 (u1)
(< ug0 )
(1)

(2)

(< vf 1 )

f g f  = f {x := [g, [u0 , u1 ]]}

Fig. 2. The new ordering on vertices in the term tree f  = f {x := [g, [u0 , u1 ]]}.

A substitution θ is a finite collection of bindings {x1 := [g1 , σ1 ], · · ·, xn :=


[gn , σn ]}, where xi ’s are mutually distinct variable labels in X. The term tree f θ,
called the instance of f by θ, is obtained by applying all the bindings xi := [gi , σi ]
on f simultaneously. We define a new total ordering <fv θ on every vertex v in
f θ in the following natural way. Suppose that v has more than one child and
let v  and v  be two children of v in f θ. If v is the parent port of variables
[v, v1 ], . . . , [v, vk ] of f with v1 <fv v2 <fv . . . <fv vk , we have the following four
cases. Let gi be a term tree which is substituted for [v, vi ] for i = 1, . . . , k.

1. If v, v  , v  ∈ Vg and v  <gv v  , then v  <fv θ v  .


2. If v, v  , v  ∈ Vf and v  <fv v  , then v  <fv θ v  .
3. u ∈ Vgi , u ∈ Vf , and vi <fv u (resp. u <fv vi ), then u <fv θ u (resp.
u <fv θ u ).
4. If u ∈ Vgi , u ∈ Vgj (i = j), and vi <fv vj , then u <fv θ u . If v is not a
parent port of any variable, then u , u ∈ Vf , therefore we have u <fv θ u if
u <fv u .

In Fig. 2, we give an example of the new ordering on vertices in a term tree.


We define the root of the resulting term tree f θ as the root of f . Consider
the examples in Fig. 1. An example of a term tree t is given. Let θ = {x1 :=
[g1 , [u1 , w1 ]], x2 := [g2 , [u2 , w2 ]], x3 := [g3 , [u3 , w3 ]], x4 := [g4 , [u4 , w4 ]]} be a sub-
stitution, where g1 , g2 , g3 , and g4 are ground term trees in Fig. 1. Then the
instance tθ of the term tree t by θ is isomorphic to the tree T in Fig 1. Let t
and t be term trees. We write t  t if there exists a substitution θ such that
t ≡ t θ. If t  t and t ≡ t , then write t ≺ t . The term tree language LΛ (t) of a
term tree t ∈ OTT Λ is {s ∈ OT Λ | s  t}. For a set H of term trees, we define
LΛ (H) = ∪t∈H LΛ (t), and LΛ (H) is called the term tree language defined by H.
In particular, we define LΛ (φ) = φ.
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 149

3 Learning Model

In this paper, let R∗ be a set, in OTFΛ , which is to be identified. We say that


the set R∗ is a target. Without loss of generality, we assume that LΛ (R∗ ) =
LΛ (R∗ − {r}) for any r ∈ R∗ .
We introduce the query learning model due to Angluin [5]. In this model,
learning algorithms can access to oracles that answer specific kinds of queries
about the unknown set LΛ (R∗ ). We consider the following oracles.
1. Superset oracle SupR∗ : The input is a set R in OTFΛ . If LΛ (R) ⊇ LΛ (R∗ ),
then the output is ”yes”. Otherwise, it returns a counterexample t ∈
LΛ (R∗ ) − LΛ (R). The query is called a superset query.
2. Restricted equivalence oracle rEquiv R∗ : The input is a set R in OTFΛ . The
output is ”yes” if LΛ (R) = LΛ (R∗ ) and ”no” otherwise. The query is called
a restricted equivalence query.
3. Membership oracle MemR∗ : The input is a ground term tree t in OT Λ . The
output is ”yes” if t ∈ LΛ (R∗ ), and ”no” otherwise. The query is called a
membership query.
4. Subset oracle SubR∗ : The input is a set R in OTFΛ . The output is ”yes”
if LΛ (R) ⊆ LΛ (R∗ ). Otherwise, it returns a counterexample t ∈ LΛ (R) −
LΛ (R∗ ). The query is called a subset query.
A learning algorithm A collects information about LΛ (R∗ ) by using queries
and outputs a set R in OTFΛ . We say that a learning algorithm A exactly
identifies a target R∗ in polynomial time using a certain type of queries if A halts
in polynomial time and outputs a set R ∈ OTFΛ such that LΛ (R) = LΛ (R∗ )
using queries of the specified type.

4 Learning Using Superset and Restricted Equivalence


Queries

In this section, we show the learnability of finite unions of term tree languages
in the query learning model. In Subsection 4.1, we introduce some notations. In
Subsection 4.2, we show that any set in OTFΛ is exactly identifiable in polynomial
time using superset queries if the size of a target is known. In Subsection 4.3,
we show that any set in OTFΛ is exactly identifiable in polynomial time using
superset and restricted equivalence queries even if the size of a target is unknown.

4.1 Compactness and Extensions of Term Trees

The following property, called compactness, plays an important role in the learn-
ing of unions of languages [7,8]. By Lemma 1, in this paper, we assume that |Λ|
is infinite.
150 S. Matsumoto et al.

a a a a
a a a
a b a b

t1 t2

a a

a a a a a
a a
a b a c b

t3

a
a a a a

a a a a a a a a
a a a a a a a
a a b a b a c b

t4

Fig. 3. For regular term trees t1 , t2 , t3 and t4 , LΛ (t1 ) ⊆ LΛ (t2 ), LΛ (t4 ) ⊆ LΛ (t3 ),
t1  t2 and t4  t3 .

Lemma 1. Let r be a term tree in OTT Λ , R a set in OTFΛ and |Λ| infinite.
Then, r  r for some r ∈ R if and only if LΛ (r) ⊆ LΛ (R)
Proof. Let wr be a ground term tree obtained from r by substituting edges
which have mutually distinct labels not appearing in any term trees in R. Since
wr ∈ LΛ (R), if LΛ (r) ⊆ LΛ (R), then there exists a term tree r ∈ R such that
wr ∈ LΛ (r ). Since any substituted labels do not appear in R, we have r  r by
inverting the substitution. 2
Let Λ = {a, b}. For example, in Fig. 3, we have LΛ (t1 ) ⊆ LΛ (t2 ) and t1  t2 .
Thus, if |Λ| = 2, then compactness doesn’t hold. Moreover, let Λ = {a, b, c}.
Then LΛ (t4 ) ⊆ LΛ (t3 ) and t4  t3 . From the above mentioned, we show that
|Λ| must be infinite to satisfy compactness.
We introduce operations increasing variables in a term tree.
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 151

T1
T1
v1
v1
y
x T2 T3
T2 T3
v3
v2
T4 =⇒ z
v2
T4

T1 T2

T1 T1
v1 v1
T2 x T3 =⇒ T2 y x T3

v2 v3 v2
T4 T4

T3 T4

T1 T1
v1 v1
T2 x T3 =⇒ T2 x y T3

v2 v2 v3
T4 T4

T5 T6

Fig. 4. Regular term trees T2 , T4 , T6 are obtained by extensions from T1 , T3 , T5


respectively, that is, T1 1[v1 ,v2 ] T2 , T3 2[v1 ,v2 ] T4 and T5 3[v1 ,v2 ] T6 . It is clear that
T2 ≺ T1 , T4 ≺ T3 , T6 ≺ T5 , |T2 | > |T1 |, |T4 | > |T3 | and |T6 | > |T5 |.

Definition 2. Let r = (Vr , Er , Hr ) be a term tree in µOTT Λ , v1 , v2 ∈ Vr and


v3 ∈ Vr . We define three extensions of term trees as the following operations:
1. If [v1 , v2 ] ∈ Hr , then Vr = Vr ∪ {v3 } and Hr = Hr ∪ {[v1 , v3 ], [v3 , v2 ]} −
{[v1 , v2 ]}. The variables [v1 , v3 ] and [v3 , v2 ] have variable labels which do not
appear in r, respectively.
2. If [v1 , v2 ] ∈ Hr , then Vr = Vr ∪ {v3 }, Hr = Hr ∪ {[v1 , v3 ]} and v2 is the next
sibling of v3 . The variable [v1 , v3 ] has a variable label which do not appear
in r.
3. If [v1 , v2 ] ∈ Hr , then Vr = Vr ∪ {v3 }, Hr = Hr ∪ {[v1 , v3 ]} and v3 is the next
sibling of v2 . The variable [v1 , v3 ] has a variable label which do not appear
in r.
Let r = (Vr , Er , Hr ) be a regular term tree and r the regular term tree
obtained from r by the i-th extension of Definition 2 for a variable h ∈ Hr .
152 S. Matsumoto et al.

Then we write r ih r . For example, in Fig. 4, regular term trees T2 , T4 , T6 are


obtained by extensions from regular term trees T1 , T3 , T5 respectively, that is,
T1 1[v1 ,v2 ] T2 , T3 2[v1 ,v2 ] T4 and T5 3[v1 ,v2 ] T6 . Then, we have T2 ≺ T1 , T4 ≺ T3 ,
T6 ≺ T5 , |T2 | > |T1 |, |T4 | > |T3 | and |T6 | > |T5 |. We define ES(r) as follows:

ES(r) = {r ∈ µOTT Λ | r i


h r for some h ∈ Hr and some i ∈ {1, 2, 3}}.

Note that |r | > |r| and r ≺ r for any r ∈ ES(r). The number of non-isomorphic
term trees in ES(r) is at most 3|r|.

4.2 The Size of a Target Is Known

In this section, we assume that we know the size of R∗ in advance. Then, we show
that any set in OTFΛ is exactly identifiable in polynomial time using superset
queries. Thus, let |R∗ | = m∗ . We show that the algorithm LEARN KNOWN in
Fig. 5 exactly identifies any set R∗ ∈ OTFΛ in polynomial time using superset
queries. In LEARN KNOWN, Rhypo denotes a hypothesis set which is included
in OTFΛ and Rnocheck denotes a set of regular term trees which are not checked
by the algorithm LEARN OTT. Note that Rnocheck ∈ µOTFΛ and each regular
term tree in Rnocheck consists of variables only.

Lemma 2. Let R be a set in µOTFΛ , r a term tree in R and R be a set in


OTFΛ . If LΛ (R ) ⊆ LΛ (R) and LΛ (R ) ⊆ LΛ (R − {r}) ∪ LΛ (ES(r)), then there
exists a term tree r ∈ R such that r  r and |r | = |r|.

Proof. Let rc be a ground term tree in LΛ (R ) − (LΛ (R − {r}) ∪ LΛ (ES(r))).


Since rc ∈ LΛ (R ), there exists a term tree r in R such that rc ∈ LΛ (r ). We
assume r  r. By Lemma 1, there exists a term tree r in R such that r  r .
Then, rc ∈ LΛ (r ) ⊆ LΛ (r ) ⊆ LΛ (R − {r}). This is a contradiction. Thus, we
have r  r. Since r  r, it is clear that |r | ≥ |r|. We assume |r | > |r|. Then,
rc ∈ LΛ (r ) ⊆ LΛ (ES(r)). This is a contradiction. Thus, we have |r | = |r|. 2

We denote by rin (resp. Rin ) a term tree (resp. a set of term trees) given to
the algorithm LEARN OTT in Fig. 6. By Lemma 2, the algorithm LEARN OTT
always takes as input a term tree rin such that r∗  rin and |r∗ | = |rin | for some
r∗ ∈ R∗ . Note that ES(rin ) ⊆ Rin and (Rnocheck − {rin }) ⊆ (Rhypo − {rin }) ⊆
Rin .
We show that if there exists a term tree r∗ ∈ R∗ such that r∗ ≡ rin , then
the algorithm outputs rin . Otherwise, the algorithm calls itself recursively and
gives a term tree r such that r ≺ rin and |r| = |rin |.
We give some notations. Let r be a term tree in OTT Λ , α an edge label and
x, y variable labels appearing in r. We denote by Xr the set of all variable labels
appearing in r, and by ρe (r, x, α) (resp. ρv (r, x, y)) the term tree obtained from
r by replacing all variables having the variable label x with edges having the
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 153

Algorithm LEARN KNOWN


Given: An oracle Sup R∗ for the target R∗ ∈ OTFΛ and an integer m with m ≥ m∗ ;
Output: A set R ∈ OTFΛ with LΛ (R) = LΛ (R∗ );
begin
Let Rhypo := φ;
if Sup R∗ (Rhypo ) = “yes” then output Rhypo ;
else begin
Let r = ({u, v}, φ, {[u, v]}) ∈ µOTT Λ and R = {r};
Let Rhypo := Rnocheck := R;
while Rnocheck = φ do begin
foreach r ∈ Rnocheck do
if Sup R∗ ((Rhypo − {r}) ∪ ES(r)) = “yes” then begin
Rhypo := (Rhypo − {r}) ∪ ES(r);
Rnocheck := (Rnocheck − {r}) ∪ ES(r);
/* Remove redundant term trees in ES(r). */
foreach r ∈ ES(r) do
if Sup R∗ (Rhypo − {r }) = “yes” then begin
Rhypo := Rhypo − {r };
Rnocheck := Rnocheck − {r };
end;
end
else begin
R := LEARN OTT(m,(Rhypo − {r}) ∪ ES(r),r);
Rhypo := (Rhypo − {r}) ∪ R ∪ ES(r);
Rnocheck := (Rnocheck − {r}) ∪ ES(r);
/* Remove redundant term trees in ES(r). */
foreach r ∈ ES(r) do
if Sup R∗ (Rhypo − {r }) = “yes” then begin
Rhypo := Rhypo − {r };
Rnocheck := Rnocheck − {r };
end;
end;
end;
output Rhypo ;
end;
end.

Fig. 5. Algorithm LEARN KNOWN

edge label α (resp. variables having a variable label y). For a subset ∆ of Λ, we
define the set RS ∆ (r) as follows:

RS ∆ (r) = {ρe (r, x, α) ∈ OTT Λ | x ∈ Xr and α is an edge label in ∆.}


∪{ρv (r, x, y) ∈ OTT Λ | x, y ∈ Xr , x and y are different.}
154 S. Matsumoto et al.

If r ∈ OT Λ , then we define RS ∆ (r) = φ. Note that r ≺ r and |r | = |r| for any
r ∈ RS ∆ (r), and the number of non-isomorphic term tress in RS ∆ (r) is at most
|r| · |∆| + |r|2 .
In the algorithm LEARN OTT, let t1 , t2 , . . ., ti , . . . and ∆1 , ∆2 , . . ., ∆i ,. . .
(i ≥ 1) be the sequence of counterexamples returned by the superset queries in
line 7 and the sequence of finite subsets of Λ obtained in line 9 respectively. Let
∆0 be the finite subset of ∆ obtained in line 6. And we suppose that at each
stage i ≥ 0, LEARN OTT makes a superset query SupR∗ (Rin ∪ RS ∆i (rin )), and
receives a counterexample ti+1 . First we assume that rin ≡ r∗ for some r∗ ∈ R∗ .
Then we have the following lemma.

Lemma 3. For any i ≥ 0, LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )).

Proof. If rin has no variable, it is clear. We assume that rin has variables. The
proof is by the induction on the number of iterations i ≥ 0 of the while-loop. In
case i = 0. Since |Λ| is infinite, there exists an edge label in Λ − ∆0 . Thus, we
can construct a term tree r with r ∈ LΛ (rin ) − LΛ (Rin ∪ RS ∆0 (rin )). It follows
that LΛ (rin ) ⊆ LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆0 (rin )).
We assume inductively that the result holds for any number of iterations of
the while-loop less than i. By the inductive hypothesis, LΛ (R∗ ) ⊆ LΛ (Rin ∪
RS ∆i−1 (rin )). Thus, ti is obtained. LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )). Since |Λ|
is infinite, there exists an edge label in Λ − ∆i . We have a term tree r ∈
LΛ (R∗ ) − LΛ (Rin ∪ RS ∆i (rin )). Therefore, LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )). 2

Next we assume that rin ≡ r∗ for any r∗ ∈ R∗ . Let r∗1 , . . . , r∗ be the term
trees in R∗ such that r∗i ≺ rin and |r∗i | = |rin | for each i ∈ {1, . . . , }, where
≤ m∗ . Since LΛ (R∗ ) ⊆ LΛ (Rin ∪ {rin }) and ES(rin ) ⊆ Rin , we have LΛ (R∗ −
{r∗1 , . . . , r∗ }) ⊆ LΛ (Rin ). Then we have the following lemma.

Lemma 4. There exists a subset S of {r∗1 , . . . , r∗ } such that |S| ≥ i + 1 and
LΛ (S) ⊆ LΛ (RS ∆i (rin )).

Proof. The proof is by the induction on the number of iterations i ≥ 0 of the


while-loop. In case i = 0. Let t be a ground term tree given by SupR∗ (Rin ) as a
counterexample in line 5. Then, t ∈ LΛ (r∗ ) for some r∗ ∈ {r∗1 , . . . , r∗ }. Since ∆0
is the set of edge labels appearing in t, r∗  r for some r ∈ RS ∆0 (rin ). Thus, we
have LΛ ({r∗ }) ⊆ LΛ (RS ∆0 (rin )).
We assume inductively that the result holds for any number of iterations of
the while-loop less than i. By the inductive hypothesis, there exists a subset S
of {r∗1 , . . . , r∗ } such that |S| ≥ i and LΛ (S) ⊆ LΛ (RS ∆i−1 (rin )). If LΛ (R∗ ) ⊆
LΛ (Rin ∪ RS ∆i−1 (rin )), ti is obtained. Since LΛ (S) ⊆ LΛ (RS ∆i−1 (rin )), there
exits a term tree r∗ ∈ {r∗1 , . . . , r∗ } − S such that ti ∈ LΛ (r∗ ). We have r∗ ∈
RS ∆i (rin ). Thus, there exists a subset S  of {r∗1 , . . . , r∗ } such that |S  | ≥ i + 1
and LΛ (S  ) ⊆ LΛ (RS ∆i (rin )), where S ∪ {r∗ } ⊆ S  . 2

From the above lemma, for some i ≤ m, LΛ ({r∗1 , . . . , r∗ }) ⊆ LΛ (RS ∆i (rin )).
It follows that LΛ (R∗ ) ⊆ LΛ (Rin ∪ RS ∆i (rin )).
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 155

Thus, by Lemma 3 and 4, the algorithm LEARN OTT exactly identifies the
set {r∗1 , . . . , r∗ }. The algorithm is called recursively at most O(|rin |2 ) times to
identify one term tree in {r∗1 , . . . , r∗ }. Thus, the algorithm is called recursively
at most O( |rin |2 ) times in all.
The while-loop of lines 7–11 is repeated at most O(m) times. Since ES(rin ) ⊆
Rin , |ti | = |rin | for any i. Thus, in foreach-loop of lines 15–17, |∆| ≤ |t1 | +
. . . + |tm | = m|rin |. The loop uses at most O(m|rin |2 ) superset queries. The
number of superset queries needed to identify the set {r∗1 , . . . , r∗ } is at most
O( m|rin |4 ). The algorithm LEARN KNOWN uses at most O(|rin |2 ) superset
queries to obtain a term tree rin . Thus, the number of superset queries the
algorithm needs to identify a target R∗ is at most O(m2 n4 + 1), where n is the
maximum size of term trees in R∗ .
From the above mentioned, we have the following theorem.

Theorem 1. If the algorithm LEARN KNOWN of Fig. 5 takes an integer m


with m ≥ |R∗ | as input, then it exactly identifies a set R∗ ∈ OTFΛ in polynomial
time using at most O(m2 n4 + 1) superset queries, where n is the maximum size
of term trees in R∗ .

4.3 The Size of a Target Is Unknown

In this section, we assume that the size of a target is unknown. However, by


Theorem 1, we have the following theorem.

Theorem 2. The algorithm LEARN OTF of Fig. 7 exactly identifies any set
R∗ ∈ OTFΛ in polynomial time using at most O(m3∗ n4 + 1) superset queries and
at most O(m∗ + 1) restricted equivalence queries, where m∗ is the size of R∗ and
n is the maximum size of term trees in R∗ .

5 Hardness Result on Learnability

In this section, we show the insufficiency of learning of OTFΛ in the query learn-
ing model. We uses the following lemma to show the insufficiency of learning of
OTFΛ .

Lemma 5. (László Lovász [9]) Let U Tn be the number of all rooted unordered
trees with no edge labels which of the size is n. Then, 2n < U Tn < 4n , where
n ≥ 6.

We denote by OTn the set of all rooted ordered trees with no edge labels
which of the size is n. From the above lemma, we have OTn ≥ 2n , where n ≥ 6.
By Lemma 5 and Lemma 1 in [5], we have Theorem 3.
156 S. Matsumoto et al.

Algorithm LEARN OTT


Given: An oracle Sup R∗ for the target R∗ ∈ OTFΛ , a positive integer m,
a set Rin in OTFΛ and a term tree rin in OTT Λ
such that LΛ (R∗ ) ⊆ LΛ (Rin ∪ {rin }) and LΛ (R∗ ) ⊆ LΛ (Rin );
Output: A set S in OTFΛ ;
begin
1. S := φ;
2. if rin ∈ OT Λ then S := {rin }
3. else begin
4. Let n := 0;
5. Let t be a counterexample given by Sup R∗ (Rin );
6. Let ∆ be the set of all edge labels in t;
7. while Sup R∗ (Rin ∪ RS ∆ (rin )) = “yes” and n ≤ m do begin
8. Let t be a counterexample and ∆ the set of all edge labels in t ;
9. ∆ := ∆ ∪ ∆ ;
10. n := n + 1;
11. end;
12. if n > m then S := {rin };
13. else begin
14. /* Remove redundant term trees in RS. */
15. foreach r ∈ RS ∆ (rin ) do
16. if Sup R∗ (Rin ∪ RS ∆ (rin ) − {r}) = “yes” then
17. RS ∆ (rin ) := RS ∆ (rin ) − {r};
18. foreach r ∈ RS ∆ (rin ) do begin
19. S  := LEARN OTT(m,Rin ∪ RS ∆ (rin ) − {r},r);
20. S := S ∪ S  ;
21. end;
22. end;
23. end;
24. output S;
end.

Fig. 6. Algorithm LEARN OTT

Theorem 3. Any learning algorithm that exactly identifies all finite sets of the
term trees of size n using restricted equivalence, membership and subset queries
must make more than 2n queries in the worst case, where n ≥ 6 and |Λ| ≥ 1.
Proof. We denote by Sn the class of singleton sets of ground term trees of which
the size is n. The class Sn is a subclass of OTFΛ . For any L and L in Sn ,
L ∩ L = φ. However, the empty set is included in OTFΛ . Thus, by Lemma 5
and Lemma 1 in [5], any learning algorithm that exactly identifies all finite sets
of the term trees of size n using restricted equivalence, membership and subset
queries must make more than 2n queries in the worst case, where n ≥ 6 and
|Λ| ≥ 1. 2
Query Learning of Finite Unions of Tree Patterns with Repeated Variables 157

Algorithm LEARN OTF


Given: Oracles Sup R∗ and rEquiv R∗ for the target R∗ ∈ OTFΛ ;
Output: A set R ∈ OTFΛ with LΛ (R) = LΛ (R∗ ).
begin
m := 0;
R := φ;
repeat
m := m + 1;
R := LEARN KNOWN (m) using Sup R∗ ;
until rEquiv R∗ (R) = “yes”;
output R;
end.

Fig. 7. Algorithm LEARN OTF

Table 1. Our results and future works

Inductive Inference
Exact Learning
from positive data
Yes [11] Yes [13]
µOTT Λ membership & a positive example polynomial time
(|Λ| ≥ 2) (|Λ| ≥ 1)
Yes [10]
µOTFΛ restricted subset & equivalence Open
(|Λ| is infinite)
sufficient [This work] insufficient [This work]
superset & restricted equivalence
OTFΛ restricted equivalence membership Open
(|Λ| is infinite) subset
(|Λ| ≥ 1)

6 Conclusions
We have studied the learnability of OTFΛ in the query learning model. In Sec-
tion 4, we have shown that any finite set R∗ ∈ OTFΛ is exactly identifiable using
at most O(m3∗ n4 + 1) superset queries and at most O(m∗ + 1) restricted equiv-
alence queries, where m∗ = |R∗ |, n is the the maximum size of term trees in R∗
and |Λ| is infinite. In Section 5, we have shown that it is hard to exactly identify
any set in OTFΛ efficiently using restricted equivalence, membership and subset
queries.
158 S. Matsumoto et al.

We showed the learnabilities of µOTT Λ and µOTFΛ in the query learning


model [10,11]. Suzuki et al. [13] showed the learnability of µOTT Λ in the frame-
work of polynomial time inductive inference from positive data [4]. Thus, we
will study the learnabilities of µOTFΛ and OTFΛ in the same framework. We
summarize our results and future works in Table 1.

References
1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to
Semistructured Data and XML. Morgan Kaufmann, 2000.
2. T. R. Amoth, P. Cull, and P. Tadepalli. Exact learning of tree patterns from
queries and counterexamples. Proc. COLT-98, ACM Press, pages 175–186, 1998.
3. T. R. Amoth, P. Cull, and P. Tadepalli. Exact learning of unordered tree patterns
from queries. Proc. COLT-99, ACM Press, pages 323–332, 1999.
4. D. Angluin. Finding pattern common to a set of strings. Journal of Computer and
System Sciences, 21:46–62, 1980.
5. D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
6. H. Arimura, H. Ishizaka, and T. Shinohara. Learning unions of tree patterns using
queries. Theoretical Computer Science, 185(1):47–62, 1997.
7. H. Arimura, H. Sakamoto, and S. Arikawa. Efficient learning of semi-structured
data from queries. Proc. ALT-2001, Springer-Verlag, LNAI 2225, pages 315–331,
2001.
8. H. Arimura, T. Shinohara, and S. Otsuki. Polynomial time algorithm for finding
finite unions of tree pattern languages. Proc. NIL-91, Springer-Verlag, LNAI 659,
pages 118–131, 1993.
9. László Lovász. Combinatorial Problems and Exercises, chapter Two classical enu-
meration problems in graph theory. North-Holland Publishing Company, 1979.
10. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning of finite unions
of tree patterns with internal structured variables from queries. Proc. AI-2002,
Springer LNAI 2557, pages 523–534, 2002.
11. S. Matsumoto, T. Shoudai, T. Miyahara, and T. Uchida. Learning unions of term
tree languages using queries. Proceedings of LA Summer Symposium, July 2002,
pages 21–1 – 21–10, 2002.
12. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda.
Discovery of frequent tag tree patterns in semistructured web documents. Proc.
PAKDD-2002, Springer-Verlag, LNAI 2336, pages 341–355, 2002.
13. Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time
inductive inference of ordered tree patterns with internal structured variables from
positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages 169–184,
2002.
Kernel Trick Embedded Gaussian Mixture
Model

Jingdong Wang, Jianguo Lee, and Changshui Zhang

State Key Laboratory of Intelligent Technology and Systems Department of


Automation, Tsinghua University Beijing, 100084, P. R. China
{wangjd01,lijg01}@[Link]
zcs@[Link]

Abstract. In this paper, we present a kernel trick embedded Gaussian


Mixture Model (GMM), called kernel GMM. The basic idea is to em-
bed kernel trick into EM algorithm and deduce a parameter estimation
algorithm for GMM in feature space. Kernel GMM could be viewed as
a Bayesian Kernel Method. Compared with most classical kernel meth-
ods, the proposed method can solve problems in probabilistic framework.
Moreover, it can tackle nonlinear problems better than the traditional
GMM. To avoid great computational cost problem existing in most ker-
nel methods upon large scale data set, we also employ a Monte Carlo
sampling technique to speed up kernel GMM so that it is more practical
and efficient. Experimental results on synthetic and real-world data set
demonstrate that the proposed approach has satisfing performance.

1 Introduction
Kernel trick is an efficient method for nonlinear data analysis early used by
Support Vector Machine (SVM) [18]. It has been pointed out that kernel trick
could be used to develop nonlinear generalization of any algorithm that could
be cast in the term of dot products. In recent years, kernel trick has been suc-
cessfully introduced into various machine learning algorithms, such as Kernel
Principal Component Analysis (Kernel PCA) [14,15], Kernel Fisher Discrim-
inant (KFD) [11], Kernel Independent Component Analysis (Kernel ICA) [7]
and so on.
However, in many cases, we are required to obtain risk minimization result
and incorporate prior knowledge, which could be easily provided within Bayesian
probabilistic framework. This makes the emerging of combining kernel trick and
Bayesian method, which is called Bayesian Kernel Method [16]. As Bayesian
Kernel Method is in probabilistic framework, it can realize Bayesian optimal
decision and estimate confidence or reliability easily with probabilistic criteria
such as Maximum-A-Posterior [5] and so on.
Recently some researches have been done in this field. Kwok combined the
evidence framework with SVM [10], Gestel et al. [8] incorporated Bayesian frame-
work with SVM and KFD. These two work are both to apply Bayesian frame-
work to known kernel method. On the other hand, some researchers proposed

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 159–174, 2003.

c Springer-Verlag Berlin Heidelberg 2003
160 J. Wang, J. Lee, and C. Zhang

new Bayesian methods with kernel trick embedded, among which one of the
most influential work is the Relevance Vector Machine (RVM) proposed by Tip-
ping [17].
This paper also addresses the problem of Bayesian Kernel Method. The pro-
posed method is that we embed kernel trick into Expectation-Maximization
(EM) algorithm [3], and deduce a new parameter estimation algorithm for Gaus-
sian Mixture Model (GMM) in the feature space. The entire model is called
kernel Gaussian Mixture Model (kGMM).
The rest of this paper is organized as follows. Section 2 reviews some back-
ground knowledge, and Section 3 describes the kernel Gaussian Mixture Model
and the corresponding parameter estimation algorithm. Experiments and results
are presented in Section 4. Conclusions are drawn in the final section.

2 Preliminaries
In this section, we review some background knowledge including the kernel trick,
GMM based on EM algorithm and Bayesian Kernel Method.

2.1 Kernel Trick


Mercer kernel trick was early applied by SVM. The idea is that we can implicitly
map input data into a high dimension feature space via a nonlinear function:

Φ:X→H
(1)
x → φ(x)

And a similarity measure is defined from the dot product in space H as follows:
 
k (x, x )  φ(x) · φ(x ) (2)

where the kernel function k should satisfy Mercer’s condition [18]. Then it allows
us to deal with learning algorithms using linear algebra and analytic geometry.
Generally speaking, on the one hand kernel trick could deal with data in the
high-dimensional dot product space H, which is named feature space by a map
associated with k. On the other hand, it avoids expensive computation cost in
feature space by employing the kernel function k instead of directly computing
dot product in H.
Being an elegant way for nonlinear analysis, kernel trick has been used in
many other algorithms such as Kernel Fisher Discriminant [11], Kernel PCA [14,
15], Kernel ICA [7] and so on.

2.2 GMM Based on EM Algorithm


GMM is a kind of mixture density models, which assumes that each component
of the probabilistic model is a Gaussian density. That is to say:
Kernel Trick Embedded Gaussian Mixture Model 161

M
p(x|Θ) = αi Gi (x|θi ) (3)
i=1
 
where x ∈ Rd is a random variable, parameters Θ = α1 , · · · , αM ; θ1 , · · · θM
M
satisfy i=1 αi = 1, αi ≥ 0 and Gi (x|θi ) is a Gaussian probability density
function:

1  1 
Gl (x|θl ) = 1/2
exp − (x − µl )T Σl−1 (x − µl ) (4)
(2π)d/2 |Σl | 2

where θl = (µl , Σl ).
GMM could be viewed as a generative model [12] or a latent variable model [6]
N
that assumes data set X = {xi }i=1 are generated by M Gaussian components,
and introduces the latent variables item Z = {zi }N
i=1 whose value indicates which
component generates the data. That is to say, we assume that if sample xi is
generated by the lth component and then zi = l. Then the parameters of GMM
could be estimated by the EM algorithm [2].
EM algorithm for GMM is an iterative procedure, which estimates the new
parameters in terms of the old parameters as the following updating formulas:

(t) 1 N
αl = p(l|xi , Θ(t−1) )
N i=1

N
(t) xi p(l|xi , Θ(t−1) )
µl = i=1
N (t−1) )
i=1 p(l|xi , Θ

N (t) (t)
(t) i=1 (xi − µl )(xi − µl )T p(l|xi , Θ(t−1) )
Σl = N (t−1) )
(5)
i=1 p(l|xi , Θ

where l represents the lth Gaussian component, and

(t−1) (t−1)
α G(xi |θl )
p(l|xi , Θ(t−1) ) = M l (t−1) (t−1)
, l = 1, · · · , M (6)
j=1 αj G(xi |θj )
 (t−1) (t−1) (t−1) (t−1) 
Θ(t−1) = α1 , · · · , αM ; θ1 , · · · , θM are parameters of the (t − 1)th
 (t) (t) (t) (t) 
iteration and Θ(t) = α1 , · · · , αM ; θ1 , · · · , θM are parameters of the (t)th
iteration.
GMM has been successfully applied in many fields, such as parametric clus-
tering, density estimation and so on. However, for instance, it can’t give a simple
but satisfied clustering result on data set with complex structure [13] as shown
in Figure 1. One alternative way is to perform GMM based clustering in another
space instead of in original data space.
162 J. Wang, J. Lee, and C. Zhang

0.5 0.5

0 0

-0.5 -0.5
-0.5 0 0.5 -0.5 0 0.5

(a) (b) (c)

Fig. 1. Data set of two concentric circles with 1,000 samples, points marked by ‘×’
belong to one cluster and marked by ‘·’ belong to the other. (a) is the partition result
by traditional GMM, (b) is the result achieved by kGMM using polynomial kernel of
degree 2; (c) shows the probability that each point belongs to the outer circle. The
whiter the point is, the higher the probability is.

2.3 Bayesian Kernel Method

Bayesian Kernel Method could be viewed as a combination of Bayesian method


and Kernel method. It inherits merits from both these two methods. It could
tackle problems nonlinearly like kernel method, and it obtains estimation results
within a probabilistic framework like classical Bayesian methods. Many works
have been done in this field.
There are typically three different ways.

– Interpretation of kernel methods in Bayesian framework such as SVM and


other kernel methods in Bayesian framework as in [10,8];
– Employing kernel methods in traditional Bayesian methods such as Gaussian
Processes and Laplacian Processes [16];
– Proposing new methods in Bayesian framework with kernel trick embedded
such as Relevance Vector Machine (RVM) [17] and Bayes Point Machine [9].

And we intend to embed kernel trick into Gaussian Mixture Model. This
work just belongs to the second category of Bayesian Kernel Method.

3 Kernel Trick Embedded GMM

As mentioned before, Gaussian Mixture Model can not obtain simple but satis-
fied results on data sets with complex structure, so we consider employing kernel
trick to realize a Bayesian Kernel version of GMM. Our basic idea is to embed
kernel trick into parameter estimation procedure of GMM. In this section, we
firstly describe GMM in feature space, secondly present the properties in feature
space, then formulate the Kernel Gaussian Mixture Model and the corresponding
parameter estimation algorithm, finally make some discussions on the algorithm.
Kernel Trick Embedded Gaussian Mixture Model 163

3.1 GMM in Feature Space

GMM in feature space by a map φ(·) associated with kernel function k can be
easily rewritten as

M
p(φ(x)|Θ) = αi G(φ(x)|θi ) (7)
i=1

and the EM updating formula in (5) and (6) can be replaced by the following.

(t) 1 N
αl = p(l|φ(xi ), Θ(t−1) )
N i=1

N (t−1)
(t) i=1 φ(xi )p(l|φ(xi ), Θ )
µl = N (t−1)
i=1 p(l|φ(xi ), Θ )

N (t) (t)
(t) i=1 (φ(xi ) − µl )(φ(xi ) − µl )T p(l|φ(xi ), Θ(t−1) )
Σl = N (t−1) )
(8)
i=1 p(l|φ(xi ), Θ

where l represents the lth Gaussian component, and

(t−1) (t−1)
α G(φ(xi )|θl )
p(l|φ(xi ), Θ(t−1) ) = M l (t−1) (t−1)
, l = 1, · · · , M (9)
j=1 αj G(φ(xi )|θj )

However, computing GMM directly with formula (8) and (9) in a high dimen-
sion feature space is computationally expensive thus impractical. We consider
employing kernel trick to overcome this difficulty. In the following section, we
will give some properties based on Mercer kernel trick to estimate the GMM
parameters in feature space.

3.2 Properties in Feature Space

To be convenient, notations in feature space are given firstly, and then three
properties are presented.

Notations. In all the formulas, bold and capital letters are for matrixes, italic
and bold letters are for vectors, and italic and lower case are for scalars. Subscript
l represents the lth Gaussian component, superscript t represents the tth iteration
of the EM procedure. AT represents transpose of matrix A. Other notations are
shown in Table 1.
164 J. Wang, J. Lee, and C. Zhang

Table 1. Notation List


(t)
pli = p(l|φ(xi ), Θ(t) ) Posterior that φ(xi ) belongs to the lth component.

(t) (t) M (t)  (t)
wli = pli j=1 pji (wli )2 represents ratio that φ(xi ) is occupied by
the lth Gaussian component.
(t) N (t) 2
µl = i=1 φ(xi )(wli ) Mean vector of the lth Gaussian component.
(t) (t)
φ̃l (xi ) = φ(xi ) − µl Centered image of φ(xi ).
(t)  (t) 2
Σl = N T
i=1 φ̃(xi )φ̃(xi ) (wli ) Covariance matrix of the lth Gaussian component.
(t)  
(Kl )ij = (wli φ(xi ) · wlj φ(xj ) Kernel matrix
(t)  
(K̃l )ij = (wli φ̃(xi ) · wlj φ̃(xj ) Centered kernel matrix.
(t)  
(Kl )ij = (φ(xi ) · wlj φ(xj ) Projecting kernel matrix.
(t)  
(K̃l )ij = φ̃(xi ) · wlj φ̃(xj ) Centered projecting kernel matrix.
(t) (t) (t)
λle , Vle Eigenvalue and eigenvector of Σl .
(t) (t) (t)
λle , βle Eigenvalue and eigenvector of K̃l .

Properties in Feature Space. According to Lemma 1 in Appendix, we can


get the first property.
(t)
[Property 1] Centered kernel matrix K̃l and centered projecting kernel matrix
(t) 
(K̃l ) are computed from the following formulas:

(t) (t) (t) (t) (t) (t) (t) (t) (t)


K̃l = Kl − Wl Kl − Kl Wl + Wl Kl Wl

(t)  (t)  (t)  (t) (t)  (t) (t)  (t) (t)


(K̃l ) = (Kl ) − (Wl ) Kl − (Kl ) Wl + (Wl ) Kl Wl (10)

(t) (t) (t) (t)  (t) (t) (t) (t)


where Wl = wl (wl )T , (Wl ) = 1N (wl )T , wl = [wl1 , · · · , wlN ]T , and
1N is a N-dimensional column vector with all entries equal to 1.
This property presents the way to center the kernel matrix and the projecting
kernel matrix. According to Lemma 2 in Appendix and Property 1, we can obtain
the second property.
(t)
[Property 2] Feature space covariance matrix Σl and centered kernel matrix
(t)
K̃l have the same nonzero eigenvalues, and the following equivalence relation
holds.

(t) (t) (t) (t) (t) (t)


Σl Vle = λle Vle ⇔ K̃l βle = λle βle (11)

(t) (t) (t) (t)


where Vle and βle are eigenvectors of Σl and K̃l respectively.
This property enables us to compute eigenvectors from centered kernel matrix
instead of from feature space covariance matrix as in Equation (8).
With the first two properties, we do not need to compute the means and
covariance matrixes in feature space, but do the eigen-decomposition of centered
kernel matrix instead.
Kernel Trick Embedded Gaussian Mixture Model 165

Feature space has a very high dimensionality, which makes it intractable to


compute the complete Gaussian probability density Gl (φ(xj )|θl ) directly. How-
ever, Moghadam and Pentland’s work [22] bring us some motivation. It divides
the original high dimension space into two parts: the principal subspace and
the orthogonal complement subspace. The principal subspace has dimension dφ .
Thus the complete Gaussian density could be approximated by

1  1 dφ
ye2
Ĝl (φ(xj )|θl ) = exp −
dφ /2 

1/2 2 e=1 λle
(2π) λle
e=1
1  ε2 (x )
j
× (N −dφ )/2
exp − (12)
(2πρ) 2ρ
where ye = φ̃(xj )T Vle (here Vle is the e-th eigenvector of Σl ), ρ is the weight
ratio, and ε2 (xj ) is the residual reconstruction error.
At the right side of the Equation (12), the first factor computes from the prin-
cipal subspace, and the second factor computes from the orthogonal complement
subspace.
The optimal value of ρ can be determined by minimizing a cost function. From
an information-theoretic point of view, the cost function should be the Kullback-
Leibler divergence between the true density Gl (φ(xj )|θl ) and its approximation
Ĝl (φ(xj )|θl ).

G(φ(x)|θl )
J(ρ) = G(φ(x)|θl ) log dφ(x) (13)
Ĝ(φ(x)|θl )

Plugging (12) into upper equation, it can easily shown that

1 
N λ ρ
le
J(ρ) = − 1 + log
2 ρ λle
e=dφ +1

∂J
Solving the equation ∂ρ = 0 yields the optimal value

1 
N
ρ∗ = λle
N − dφ
e=dφ +1

And according to Property 2, Σl and K̃l has same nonzero eigenvalues, by em-
ploying the property of symmetry matrix, we obtain

1   1  
dφ dφ
2 2
ρ∗ = Σl − λle = K̃l − λle (14)
N − dφ F
e=1
N − dφ F
e=1

where  · F is a Frobenius matrix norm defined as AF = trace(AAT ).


166 J. Wang, J. Lee, and C. Zhang



The residual reconstruction error, ε2 (xj ) = φ̃(xj )2 − ye2 , could be easily
e=1
obtained by employing kernel trick



ε2 (xj ) = k(xj , xj ) − ye2 (15)
e=1
And according to Lemma 3 in Appendix,
 
ye = φ̃(xj )T Vle = Vle · φ̃(xj ) = β T Γl (16)

where Γl is the j-th column of centered projecting kernel matrix K̃l .


It should notice here the centered kernel matrix K̃l is used to obtain eigen-
values λl and eigenvectors βl , whereas projecting kernel matrix K̃l is used to
compute ye as in Equation (16). And both these two matrixes cannot be omit-
ted in the training procedure.
Employing all these results, we obtain the third property.
[Property 3] The Gaussian probability density function Gl (φ(xj )|θl ) can be
approximated by Ĝl (φ(xj )|θl ) as shown in (12), where ρ, ε2 (xj ) and ye are
shown in (14), (15) and (16) respectively.
We should specially stress that the approximation of Gl (φ(xj )|θl ) by
Ĝl (φ(xj )|θl ) is complete since it represents not only the principle subspace but
also the orthogonal complement subspace.
According to these three properties, we can draw the following conclusions.

– Mercer kernel trick is introduced to indirectly compute dot products in the


high dimension feature space.
– The probability that a sample belongs to the lth component can be computed
through centered kernel matrix and centered projecting kernel matrix instead
of mean vector and covariance matrix as in traditional GMM.
– Needing not to obtain full eigen-decomposition of the centered kernel matrix,
we could approximate the Gaussian density with only the largest dφ principal
components of the centered kernel matrix, and the approximation is not
dependent on dφ very much since it is complete and optimal.

With these three properties, we can formulate our kernel Gaussian Mixture
Model.

3.3 Kernel GMM and the Parameter Estimation Algorithm

In feature space, the kernel matrixes replace the mean and covariance matrixes
in input space to represent the Gaussian component. So the parameters of each
component are not the mean vectors and covariance matrixes, but the kernel
matrixes. In fact, it is also intractable to compute mean vectors and covariance
matrixes in feature space because feature space has a quite high or even infinite
Kernel Trick Embedded Gaussian Mixture Model 167

dimension. Fortunately, with kernel trick embedded, computing on the kernel


matrix is quite feasible since the dimension of the principal subspace is bounded
by the data size N.
Consequently, the parameters that kernel GMM needs to estimate are the
(t) (t)
prior probability αl , centered kernel matrix K̃l , centered projecting kernel
(t)  (t)
matrix (K̃l ) and wl (see Table 1). That is to say, the M -components kernel
 (t) (t) (t) (t)  
GMM is determined by parameters θl = αl , wl , K̃l , (K̃l ) , l = 1, · · · , M .
According to the properties in previous sections, the EM algorithm for pa-
rameter estimation of kGMM could be summarized as in Table 2. Assuming the
number of Gaussian components is M , we initialize the posterior probability pli
that each sample belongs to some Gaussian component. The algorithm could
not be terminated until it converges or the presetting maximum iteration step
is reached.

Table 2. Parameter Estimation Algorithm for kGMM

(0)
Step 0. Initialize all pli (l = 1, · · · , M ; i = 1, · · · , N ), t = 0, set tmax and
stopping condition false.
Step 1. While stopping condition is false, t = t + 1, do Step 2-7.
(t) (t) (t) (t) (t) (t)
Step 2. Compute αl , wli , Wl , (Wl ) , Kl and (Kl ) according to
notations in Table 1.
(t) (t)
Step 3. Compute the matrixes K̃l , (K̃l ) via Property 1.
Step 4. Compute the largest dφ eigenvalues and eigenvectors of centered
(t)
kernel matrixes K̃l .
(t)
Step 5. Compute Ĝl (φ(xj )|θl ) via Property 3.
(t)
Step 6. Compute all posterior probabilities pli via (9)
Step 7. Test the stopping condition.
   (t) (t−1) 2
If t > tmax or kl=1 N i=1 pli − pli < ε, set stopping condition true,
otherwise loop back to Step 1.

3.4 Discussion of the Algorithm


Computational Cost and Speedup Techniques on Large Scale Problem.
By employing kernel trick, the computational cost of kernel eigen-decomposition
based methods is almost involved by the eigen-decomposition step. Therefore,
the computational cost mainly depends on the size of kernel matrix, i.e. the size
of data set. If the size N is not very large (e.g. N ≤ 1, 000), it is not a problem
to obtain full eigen-decomposition. If the size N is large enough, it is liable to
meet with the curse of dimension. As is known, if N > 5, 000, it is impossible to
finish full eigen-decomposition even within hours on the fastest PCs currently.
However, the size N is usually very large in some problems such as data mining.
Fortunately, as we have pointed out in Section 3.2, we need not obtain the
full eigen-decomposition for components of kGMM, and we only need estimate
168 J. Wang, J. Lee, and C. Zhang

the largest dφ nonzero eigenvalues and corresponding eigenvectors. As for large


scale problem, we could make the assumption that dφ N . Some techniques
can be adopted to estimate the largest dφ components for kernel methods. The
first technique is based on traditional Orthogonal Iteration or Lanzcos Iteration.
The second is to make the kernel matrix sparse by sampling techniques [1]. The
third is to apply Nyström method to speedup kernel machine [19].
However, these three techniques are a little complicated. In this paper, we
adopt another much simple but practical technique proposed by Taylor [Link] [20].
It assumes that all samples forming kernel matrix of each component are drawn
from an underlying density p(x). And the problem could be written down as a
continue eigen-problem.

k(x, y)p(x)βi (x)dx = λi βi (y) (17)

where λi , βi (y) are eigenvalue and eigenvector, and k is a given kernel function.
The integral could be approximate using Monte Carlo method by a subset of
m
samples {xi }i=1 (m N, m  dφ ) drawn according to p(x).

1 N
k(x, y)p(x)Vi (x)dx ≈ k(xj , y)Vi (xj ) (18)
N j=1

Plugging in y = xk for j = 1, · · · , N , we obtain a matrix eigen-problem.

1 N
k(xj , xk )Vi (xj ) = λ̂i Vi (xk ) (19)
N j=1

where λ̂i is the approximation of eigenvalue λi .


This approximation approach has been proved feasible and has bounded er-
ror. We apply it to our parameter estimation algorithm on large scale problem
(N > 1, 000). In our algorithm, the underlying density of component l is ap-
(t)
proximated by Ĝl (φ(x)|θl ). We do sampling to obtain a subset with size m
(t)
according to Ĝl (φ(x)|θl ), and perform full eigen-decomposition on such subset
to obtain the largest dφ eigen-components.
With employing this Monte Carlo sampling technique, the computational
cost upon large scale problem could be reduced greatly. Furthermore, the mem-
ory needed by the parameter estimation algorithm also reduces greatly upon
large scale problem. These makes the proposed kGMM efficient and practical.

Comparison with Related Works. There are still some other work related
to ours. One major is the spectral clustering algorithm [21]. Spectral clustering
could be regarded as using RBF based kernel method to extract features and
then performing clustering by K-means. Compared with spectral clustering, the
proposed kGMM has at least two advantages. (1)kGMM can provide result in
probabilistic framework and can incorporate prior information easily. (2) kGMM
can be used in supervised learning problem as a density estimation method. All
these advantages encourage us to apply the proposed kGMM.
Kernel Trick Embedded Gaussian Mixture Model 169

Misunderstanding. We emphasize a misunderstanding of the proposed model.


Someone doubt that it can simply run GMM in the reduced dimension space
obtained by Kernel PCA to achieve the same result as kGMM. They say that
this just need project the data into the first dφ dimension of the feature space
by Kernel PCA, and then perform GMM parameter estimation in that princi-
ple subspace. However, the choice of a proper dφ is a critical problem so that
the performance will completely depend on the choice of dφ . If dφ is too large,
the estimation is not feasible because probability estimation demands that the
number of samples is large enough in comparison with dimension dφ , and the
computational cost will increase greatly simultaneously. On the contrary, small
dφ makes the estimated parameters not “well represent” the data.
The proposed kGMM does not have that problem since the approximated
density function is complete and optimal under the minimum Kullback-Leibler
divergence criteria. Moreover, kGMM can allow different component with dif-
ferent dφ . All these improve the flexibility and expand the application of the
proposed kGMM.

4 Experiments
In this section, two experiments are performed to validate the proposed kGMM
compared with traditional GMM. Firstly kGMM is employed as unsupervised
learning or clustering method on synthetic data set. Secondly kGMM is em-
ployed as supervised density estimation method for real-world handwritten digit
recognition.

4.1 Synthetic Data Clustering Using Kernel GMM


To provide an intuitive comparison between the proposed kGMM and traditional
GMM, we first conduct experiments on synthetic 2-D data sets.
The data sets each with 1,000 samples are depicted in Figure 1 and Figure 2.
For traditional GMM, all the samples are used to estimate the parameters of
two components mixture of Gaussian. When the algorithm stops, each sample
will belong to one of the components or clusters according to its posterior. The
clustering results of traditional GMM are shown in Figure 1(a) and Figure 2(a).
The results are obviously not satisfying.
However, by using kGMM with a polynomial kernel of degree 2, dφ = 4
for each Gaussian component and the same clustering scheme as traditional
GMM, we achieve the promising results as shown in Figure 1(b) and Figure 2(b).
Besides, kGMM provides probabilistic information as in Figure 1(c) and Fig-
ure 2(c), which cannot provide by most classical kernel methods.

4.2 USPS Data-Set Recognition Using Kernel GMM


Kernel GMM is also applied to a real-world problem, the US Postal Service
(USPS) handwritten digit recognition. The data set consists of 9,226 grayscale
170 J. Wang, J. Lee, and C. Zhang

0.5 0.5

0 0

-0.5 -0.5
-0.5 0 0.5 -0.5 0 0.5

(a) (b) (c)

Fig. 2. Data set consists of 1,000 samples. Points marked by ‘×’ belong to one cluster
and marked by ‘·’ belong to the other. (a) is the partition result by traditional GMM;
(b) is the result achieved by kGMM; (c) shows the probability that each point belongs
to the left-right cluster. The whiter the point is, the higher the probability is.

images of size 16x16, divided into a training set of 7,219 images and a test set
of 2,007 images.
The original input data is just vector form of the digit image, i.e., the input
feature space is with dimensionality 256. Optionally, we can perform a linear
discriminant analysis (LDA) to reduce the dimensionality of feature space. If
LDA is performed, the feature space yields to be 39.
Each category ω is estimated a density of p(x|ω) using 4 components GMM
on training set. To classify an test sample x, we use the Bayesian decision rule


ω∗ = arg max p(x|ω)P (ω) , ω = 1, · · · , 10 (20)
ω

where P (ω) is prior probability of category ω. In this experiment, we set P (ω) =


1/10. That is to say, all categories are with equal prior probability.
To be comparison, kGMM adopts the same experiment scheme as traditional
GMM except for using an RBF kernel function

k(x, x ) = exp(−γx − x 2 )

with γ = 0.0015, Gaussian mixture component number of 2 and dφ = 40 for


each Gaussian component.
The experiment results of GMM in original input space, GMM in the space
by LDA (from [4]) and kGMM are shown in Table 3. We can see that kGMM,
with less components number, has obviously much better performance than tra-
ditional GMM with or without LDA. Although, the result by kGMM is not the
state-of-art result on USPS, we still can improve the result by incorporating
invariance prior knowledge using Tangent distance as in [4].
Kernel Trick Embedded Gaussian Mixture Model 171

Table 3. Comparison results on USPS data set

Method Best Error rate


GMM 8.0%
LDA+GMM 6.7%
Kernel GMM 4.3%

5 Conclusion
In this paper, we present a kernel Gaussian Mixture Model, and deduce a pa-
rameter estimation algorithm by embedding kernel trick into EM algorithm.
Furthermore, we adopt a Monte Carlo sampling technique to speedup kGMM
upon large scale problem, thus make it more practical and efficient.
Compared with most classical kernel methods, kGMM can solve problems
in a probabilistic framework. Moreover, it can tackle nonlinear problems better
than the traditional GMM. Experimental results on synthetic and real-world
data set show that the proposed approach has satisfied performance.
Our future work will focus on incorporating prior knowledge such as invari-
ance in kGMM and enriching its applications.

Acknowledgements. The author would like to thank anonymous reviewers for


their helpful comments, also thank Jason Xu for helpful conversations about this
work.

References
1. Achlioptas, D., McSherry, F. and Schölkopf, B.: Sampling techniques for kernel
methods. In Advances in Neural Information Processing System (NIPS) 14, MIT
Press, Cambridge MA (2002)
2. Bilmes, J. A.: A Gentle Tutorial on the EM Algorithm and its Application to
Parameter Estimation for Gaussian Mixture and Hidden Markov Models, Technical
Report, UC Berkeley, ICSI-TR-97-021 (1997)
3. Bishop, C. M.: Neural Networks for Pattern Recognition, Oxford University Press.
(1995)
4. Dahmen, J., Keysers, D., Ney, H. and Güld, M.O.: Statistical Image Object Recog-
nition using Mixture Densities. Journal of Mathematical Imaging and Vision, 14(3)
(2001) 285–296
5. Duda, R. O., Hart, P. E. and Stork, D. G.: Pattern Classification, New York: John
Wiley & Sons Press, 2nd Edition. (2001)
6. Everitt, B. S.: An Introduction to Latent Variable Models, London: Chapman and
Hall. (1984)
7. Francis R. B. and Michael I. J.: Kernel Independent Component Analysis, Journal
of Machine Learning Research, 3, (2002) 1–48
8. Gestel, T. V., Suykens, J.A.K., Lanckriet, G., Lambrechts, A., Moor, B. De and
Vanderwalle J.: Bayesian framework for least squares support vector machine clas-
sifiers, gaussian processs and kernel fisher discriminant analysis. Neural Computa-
tion, 15(5) (2002) 1115–1148
172 J. Wang, J. Lee, and C. Zhang

9. Herbrich, R., Graepel, T. and Campbell, C.: Bayes Point Machines: Estimating
the Bayes Point in Kernel Space. In Proceedings of International Joint Conference
on Artificial Intelligence Work-shop on Support Vector Machines, (1999) 23–27
10. Kwok, J. T.: The Evidence Framework Applied to Support Vector Machines, IEEE
Trans. on NN, Vol. 11 (2000) 1162–1173.
11. Mika, S., Rätsch, G., Weston, J., Schölkopf, B. and Müller, K.R.: Fisher discrim-
inant analysis with kernels. IEEE Workshop on Neural Networks for Signal Pro-
cessing IX, (1999) 41–48
12. Mjolsness, E. and Decoste, D.: Machine Learning for Science: State of the Art and
Future Pros-pects, Science. Vol. 293 (2001)
13. Roberts, S. J.: Parametric and Non-Parametric Unsupervised Cluster Analysis,
Pattern Recogni-tion, Vol. 30. No 2, (1997) 261–272
14. Schölkopf, B., Smola, A.J. and Müller, K.R.: Nonlinear Component Analysis as a
Kernel Eigen-value Problem, Neural Computation, 10(5), (1998) 1299–1319
15. Schölkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Müller, K. R., Raetsch, G.
and Smola, A.: Input Space vs. Feature Space in Kernel-Based Methods, IEEE
Trans. on NN, Vol 10. No. 5, (1999) 1000–1017
16. Schölkopf, B. and Smola, A. J.: Learning with Kernels: Support Vector Machines,
Regularization and Beyond, MIT Press, Cambridge MA (2002)
17. Tipping, M. E.: Sparse Bayesian Learning and the Relevance Vector Machine,
Journal of Machine Learning Research. (2001)
18. Vapnik, V.: The Nature of Statistical Learning Theory, 2nd Edition, Springer-
Verlag, New York (1997)
19. Williams, C. and Seeger, M.: Using the Nyström Method to Speed Up Kernel
Machines. In T. K. Leen, T. G. Diettrich, and V. Tresp, editors, Advances in Neural
Information Processing Systems (NIPS)13. MIT Press, Cambridge MA (2001)
20. Taylor, J. S., Williams, C., Cristianini, N. and Kandola J.: On the Eigenspec-
trum of the Gram Matrix and Its Relationship to the Operator Eigenspectrum, N.
CesaBianchi et al. (Eds.): ALT 2002, LNAI 2533, Springer-Verlag, Berlin Heidel-
berg (2002) 23–40
21. Ng, A. Y., Jordan, M. I. and Weiss, Y.: On Spectral Clustering: Analysis and an
algorithm, Advance in Neural Information Processing Systems (NIPS) 14, MIT
Press, Cambridge MA (2002)
22. Moghaddam, B. and Pentland, A.: Probabilistic visual learning for object repre-
sentation, IEEE Trans. on PAMI, Vol. 19, No. 7 (1997) 696–710

Appendix

To be convenient, the subscripts representing Gaussian components and the


superscripts representing iterations are omitted in this part.
[Lemma 1] Suppose φ(·) is a mapping function that satisfies Mercer conditions
N
as in Equation (1)(Section 2) with N training samples X = {xi }i=1 . w is a
N -dimensional column vector with w = [w1 , · · · , wN ] ∈ R .
T N
N
Let’s define µ = i=1 φ(xi )wi2 and φ̃(xi ) = φ(xi ) − µ
Then the following consequences hold true:  
(1) If K is a N × N kernel matrix such that Kij = wi φ(xi ) · wj φ(xj ) ,
and K̃ is a N × N matrix, which is centered in the feature space, such that
Kernel Trick Embedded Gaussian Mixture Model 173
 
K̃ij = wi φ̃(xi ) · wj φ̃(xj ) , then we can get:

K̃ = K − WK − KW + WKW (a1)

where W = wwT . 
(2) If K is a N × N projecting kernel matrix such that Kij = φ(xi ) ·

wj φ(xj ) , and K̃ is a N × N matrix, which is centered in the feature space,
 
such that K   = φ̃(xi ) · wj φ̃(xj ) , then
ij

K̃ = K − W K − K W + W KW (a2)

where W = 1N wT , 1N is a N -dimensional column vector that all entries equal


to 1.
Proof: (1)
 
 ij = wi φ̃(xi ) · wj φ̃(xj )
K
= wi φ̃(xi )T wj φ̃(xj )
 
N
T  
N

= wi (φ(xi ) − φ(xk )wk2 ) wj (φ(xj ) − φ(xk )wk2 )
k=1 k=1

  
N
 
= wi φ(xi )T wj φ(xj ) − wi wk wk φ(xk )T wj φ(xj )
k=1


N
  
N 
N
 
− wk wi φ(xi )T wk φ(xk ) ·wj + wi wk wn wk φ(xk )T wn φ(xn )
k=1 k=1 n=1


N 
N 
N 
N
 
= Kij − wi wk Kkj − wk Kik wj + wi wk wn wk φ(xk )T wn φ(xn ) wj
k=1 k=1 k=1 n=1

then we can get the more compact expression as (a1).


Similarly, we can prove (2).
[Lemma 2] Suppose Σ is a covariance matrix such that

N
Σ= φ̃(xi )φ̃(xi )T wi2 (a3)
i=1

Then the following would be hold


(1) ΣV = λV ⇔ K̃β = λβ.
N
(2) V = i=1 βi φ̃(xi )wi .
Proof:
(1) Firstly, we prove ” ⇒ ”.
If ΣV = λV , then the solution V lies in space spanned by
w1 φ̃(x1 ), · · · , wN φ̃(xN ), and we have two useful consequences: firstly, we can
consider the equivalent equation
   
λ wk φ̃(xk ) · V = wk φ̃(xk ) · ΣV , for all k = 1, · · · , N (a4)
174 J. Wang, J. Lee, and C. Zhang

and secondly, there exist coefficients βi (i = 1, · · · , N ) such that

N
V = βi wi φ̃(xi ) (a5)
i=1

Combining (a4) and (a5), we get


N
  N  
N
 
λ βi wk φ̃(xk ) · wi φ̃(xi ) = βi wk φ̃(xk ) · wj φ̃(xj ) wj φ̃(xj ) · wi φ̃(xi )
i=1 i=1 j=1

then this can read

λK̃β = K̃ 2 β

where β = [β1 , · · · , βN ]T . K̃ is a symmetric matrix, which has a set of eigenvec-


tors which span the whole space, thus

λβ = K̃β (a6)

Similarly, we can prove ” ⇐ ”.


(2) is easy to prove, so the proof is omitted.
[Lemma 3] If x ∈ Rd is a sample, with φ̃(x) = φ(x) − µ, then

  N  
V · φ̃(xj ) = βi wi φ̃(xi ) · φ̃(xj ) = β T Γ (a7)
i=1

where Γ is j-th column of centered projecting matrix K̃ .


Proof:
This Lemma follows from (a5).
175

Efficiently Learning the Metric with


Side-Information

Tijl De Bie1 , Michinari Momma2 , and Nello Cristianini3


1
Department of Electrical Engineering ESAT-SCD, Katholieke Universiteit Leuven,
Kasteelpark Arenberg 10, 3001 Leuven, Belgium
[Link]@[Link]
2
Department of Decision Sciences and Engineering Systems, Rensselaer Polytechnic
Institute, Troy, NY 12180, USA
mommam@[Link]
3
Department of Statistics, University of California, Davis
Davis, CA 95616, USA
nello@[Link]

Abstract. A crucial problem in machine learning is to choose an appro-


priate representation of data, in a way that emphasizes the relations we
are interested in. In many cases this amounts to finding a suitable metric
in the data space. In the supervised case, Linear Discriminant Analysis
(LDA) can be used to find an appropriate subspace in which the data
structure is apparent. Other ways to learn a suitable metric are found
in [6] and [11]. However recently significant attention has been devoted
to the problem of learning a metric in the semi-supervised case. In par-
ticular the work by Xing et al. [15] has demonstrated how semi-definite
programming (SDP) can be used to directly learn a distance measure that
satisfies constraints in the form of side-information. They obtain a sig-
nificant increase in clustering performance with the new representation.
The approach is very interesting, however, the computational complexity
of the method severely limits its applicability to real machine learning
tasks. In this paper we present an alternative solution for dealing with the
problem of incorporating side-information. This side-information speci-
fies pairs of examples belonging to the same class. The approach is based
on LDA, and is solved by the efficient eigenproblem. The performance
reached is very similar, but the complexity is only O(d3 ) instead of O(d6 )
where d is the dimensionality of the data. We also show how our method
can be extended to deal with more general types of side-information.

1 Introduction
Machine learning algorithms rely to a large extent on the availability of a good
representation of the data, which is often the result of human design choices.
More specifically, a ‘suitable’ distance measure between data items needs to be
specified, so that a meaningful notion of ‘similarity’ is induced. The notion of
‘suitable’ is inevitably task dependent, since the same data might need very
different representations for different learning tasks.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 175–189, 2003.

c Springer-Verlag Berlin Heidelberg 2003
176 T. De Bie, M. Momma, and N. Cristianini

This means that automatizing the task of choosing a representation will nec-
essarily need utilization of some type of information (e.g. some of the labels,
or less refined forms of information about the task at hand). Labels may be
too expensive, while a less refined and more readily available source of infor-
mation can be used (known as side-information). For example, one may want
to define a metric over the space of movies descriptions, using data about cus-
tomers associations (such as sets of movies liked by the same customer in [9]) as
side-information.
This type of side-information is commonplace in marketing data, recommen-
dation systems, bioinformatics and web data. Many recent papers have dealt
with these and related problems; some by imposing extra constraints without
learning a metric, as in the constrained K-means algorithm [5], others by im-
plicitly learning a metric, like [9], [13] or explicitly by [15]. In particular, [15]
provides a conceptually elegant algorithm based on semi-definite programming
(SDP) for learning the metric in the data space based on side-information, an
algorithm that unfortunately has complexity O(d6 ) for d-dimensional data1 .
In this paper we present an algorithm for the problem of finding a suit-
able metric, using the side-information that consists of n example pairs
(1) (2)
(xi , xi ), i = 1, . . . , n belonging to the same but unknown class. Furthermore,
we place our algorithm in a general framework, in which also the methods de-
scribed in [14] and [13] would fit. More specifically, we show how these methods
can all be related with Linear Discriminant Analysis (LDA, see [8] or [7]).
For reference, we will first give a brief review of LDA. Next we show how
our method can be derived as an approximation for LDA in case only side-
information is available. Furthermore, we provide a derivation similar to the
one in [15] in order to show the correspondence between the two approaches.
Empirical results include a toy example, and UCI data sets also used in [15].

Notation. All vectors are assumed to be column vectors. With Id the identity
matrix of dimension d is meant. With 0, we denote a matrix or a vector of appro-
priate size, containing all zero elements. The vector 1 is a vector of appropriate
dimension containing all 1’s. A prime  denotes a transpose.
(1) (2)
To denote the side-information that consists of n pairs (xi , xi ) for which is
(1) (2)
known that xi and xi ∈ Rd belong to the same class, we will use the matrices
(1)  (2) 
X(1) ∈ Rn×d and X(2) ∈ Rn×d . These contain xi and xi as their ith rows:
 (1)    (2)  
x1 x1
 (1)    (2)  
   
X(1) =  x2  and X(2) =  x2 . This means that for any i = 1, . . . , n, it is
 ···   ··· 
(1)  (2) 
xn xn
1
The authors of [15] see this problem, and they try to circumvent it by developing a
gradient descent algorithm instead of using standard Newton algorithms for solving
SDP problems. However, this may lead to convergence problems, especially for data
sets in large dimensional spaces.
Efficiently Learning the Metric with Side-Information 177

known that the samples at the ith rows of X(1) and X(2) belong to the same class.
For ease of notation (but without loss of generality) we will construct the full
(1)
X
data matrix2 X ∈ R2n×d as X = . When we want to denote the sample
X(2)
corresponding to the ith row of X without regarding the side-information, it is
denoted as xi ∈ Rd (without superscript, and i = 1, . . . , 2n). The data matrix
should be centered, that is 1 X = 0 (the mean of each column is zero). We use
w ∈ Rd to denote a weight vector in this d-dimensional data space.
Although the labels for the samples are not known in our problem setting, we
will consider the label matrix Z ∈ R2n×c corresponding to X in our derivations.
(The number of classes is denoted by c.) It is defined as (where Zi,j indicates
the element at row i and column j):

1 when the class of the sample xi is j


Zi,j = ,
0 otherwise

followed by a centering to make all column sums equal to zero: Z = Z − 11 n Z.
We use wZ ∈ Rc to denote a weight vector in the c-dimensional label space.
The matrices CZX = CXZ = Z X, CZZ = Z Z, CXX = X X are called total
scatter matrices of X or Z with X or Z. The total scatter matrices for the subset

data matrices X(k) , k = 1, 2, are indexed by integers: Ckl = X(k) X(l) .
Again if the labels were known, we could identify the sets Ci =
{all xj in class i}. Then we could also compute the following quantities for the
samples in X: the number of samples in each class: ni = |Ci |; the class means
mi = n1i j:xj ∈Ci xj ; the between class scatter matrix CB = i=1 ni mi mi . The
c

within class scatter matrix CW = i=1 j:xj ∈Ci (xj − mi )(xj − mi ) . Since the
c

labels are not known in our problem setting, we will only use these quantities in
our derivations, not in our final results.

2 Learning the Metric

In this section, we will show how the LDA formulation which requires labels
can be adapted for cases where no labels but only side-information is available.
The resulting formulation can be seen as an approximation of LDA with labels
available. This will lead to an efficient algorithm to learn a metric: given the side-
information, solving just a generalized eigenproblem is sufficient to maximize the
expected separation between the clusters.

2
In all derivations, the only data samples involved are the ones that appear in the side-
information. It is not until the empirical results section that also data not involved
in the side-information is dealt with: the side-information is used to learn the metric,
and only subsequently, this metric is used to cluster any other available sample. We
also assume no sample appears twice in the side-information.
178 T. De Bie, M. Momma, and N. Cristianini

2.1 Motivation

Canonical Correlation Analysis (CCA) Formulation of Linear Discrim-


inant Analysis (LDA) for Classification. When given a data matrix X and
a label matrix Z, LDA [8] provides a way to find a projection of the data that
has the largest between class variance to within class variance ratio. This can be
w  CB w
formulated as a maximization problem of the Rayleigh quotient ρ(w) = w C
W w .
In the optimum ∇w ρ = 0, w is the eigenvector corresponding to the largest
eigenvalue of the generalized eigenvalue problem CB w = ρ CW w. Furthermore,
it is shown that LDA can also be computed by performing CCA between the data
and the label matrix ([3],[2],[12]). In other words, LDA maximizes the correlation
between a projection of the coordinates of the data points and a projection of
their class labels. This means the following CCA generalized eigenvalue problem
formulation can be used:
     
0 CXZ w CXX 0 w

CZX 0 wZ 0 CZZ wZ

The optimization problem corresponding to CCA is (as shown in e.g. [4]):

max w X ZwZ s.t. Xw2 = 1 and ZwZ 2 = 1 (1)


w,wZ

This formulation for LDA will be the starting point for our derivations.

Maximizing the Expected LDA Cost Function. In the problem setting at


hand however, we do not know the label matrix Z. Thus we can not perform
LDA in its basic form. However, the side-information that given pairs of samples
(1) (2)
(xi , xi ) belong to the same class (and thus have the same –but unknown–
label ) is available. (This side-information is further denoted by splitting X into
two matrices X(1) and X(2) as defined in the notation paragraph.)
Using a parameterization of the label matrix Z that explicitly realizes these
constraints given by the side-information, we derive a cost function that is equiv-
alent to the LDA cost function but that is written in terms of this parameter-
ization. Then we maximize the expected value of this LDA cost function, where
the expectation is taken over these parameters under a reasonable symmetry as-
sumption. The derivation can be found in Appendix A. Furthermore it is shown
in Appendix A that this expected LDA cost function is maximized by solving
for the dominant eigenvector of:

(C12 + C21 )w = λ(C11 + C22 )w (2)



where Ckl = X(k) X(l) .

In Appendix B we provide an alternative derivation leading to the same eigen-


value problem. This derivation is based on a cost function that is close to the
cost function used in [15].
Efficiently Learning the Metric with Side-Information 179

2.2 Interpretation and Dimensionality Selection

Interpretation. Given the eigenvector w, the corresponding eigenvalue λ is



equal to w (C12 +C21 )w 
w (C11 +C22 )w . The numerator w (C12 + C21 )w is twice the covariance
of the projections X(1) w with X(2) w (up to a factor equal to the number of
samples in X(k) ). The denominator normalizes with the sum of their variances
(up to the same factor). This means λ is very close to the correlation between
X(1) w and X(2) w (it becomes equal to their correlation when the variances
of X(1) w and X(2) w are equal, which will often be close to true as both X(1)
and X(2) are drawn
 from the same distribution). This makes sense: we want
X(1)
Xw = w and thus both X(1) w and X(2) w to be strongly correlated
X(2)
with a projection ZwZ of their (common but unknown) labels in Z on wZ (see
equation (1); this is what we actually wanted to optimize, but could not do
exactly since Z is not known). Now, when we want X(1) w and X(2) w to be
strongly correlated with the same labels, they necessarily have to be strongly
correlated with each other.
Some of the eigenvalues may be negative however. This means that along
these eigenvectors, samples that should be co-clustered according to the side-
information are anti -correlated. This can only be caused by features in the data
that are irrelevant for the clustering problem at hand (which can be seen as
noise).

Dimensionality Selection. As with LDA, one will generally not only use the
dominant eigenvector, but a dominant eigenspace to project the data on. The
number of eigenvectors used should depend on the signal to noise ratio along
these components: when it is too low, noise effects will cause poor performance
of a subsequent clustering. So we need to make an estimate of the noise level.
This is provided by the negative eigenvalues: they allow us to make a good
estimate of the noise level present in the data, thus motivating the strategy
adopted in this paper: only retain the k directions corresponding to eigenvalues
larger than the largest absolute value of the negative eigenvalues.

2.3 The Metric Corresponding to the Subspace Used

Since we will project the data onto the k dominant eigenvectors w, this finally
boils down to using the distance measure

d2 (xi , xj ) = (W (xi − xj )) (W (xi − xj )) = xi − xj 2WW .

where W is the matrix containing the k eigenvectors as its columns.


Normalization of the different eigenvectors could be done so as to make the
variance equal to 1 along each of the directions. However as can be understood
from the interpretation in 2.2, along directions with a high eigenvalue λ a better
separation can be expected. Therefore, we applied the heuristic to scale each
180 T. De Bie, M. Momma, and N. Cristianini

of the eigenvectors by multiplying them with their corresponding eigenvalue. In


doing that, a subsequent clustering like K-means will preferentially find cluster
separations orthogonal to directions that will probably separate well (which is
desirable).

2.4 Computational Complexity

Operations to be carried out in this algorithm are the computation of the d × d


scatter matrices, and solving a symmetric generalized eigenvalue problem of
size d. The computational complexity of this problem is thus O(d3 ). Since the
approach in [15] is basically an SDP with d2 parameters, its complexity is O(d6 ).
Thus a massive speedup can be achieved.

3 Remarks

3.1 Relation with Existing Literature

Actually, X(1) and X(2) do not have to belong to the same space, they can be
of a different kind: it is sufficient when corresponding samples in X(1) and X(2)
belong to the same class to do something similar as above. Of course then we
need different weight vectors in both spaces: w(1) and w(2) . Following a similar
reasoning as above, in Appendix C we provide an argumentation that solving
the CCA eigenproblem
     
0 C12 w(1) C11 0 w(1)

C21 0 w(2) 0 C22 w(2)

is closely related to LDA as well.


This is exactly what is done in [14] and [13] (in both papers in a kernel
induced feature space).

3.2 More General Types of Side-Information

Using similar approaches, also general types of side-information may be utilized.


We will only briefly mention them:

– When the groups of samples for which is known they belong to the same class
is larger than 2 (let us call them X(i) again, but now i is not restricted to only
1 or 2). This can be handled very analogously to our previous derivation.
Therefore we just state the resulting generalized eigenvalue problem:
 
 
X(k) X(k) w=λ X(k) X(k) w
k k k
Efficiently Learning the Metric with Side-Information 181

– Also in case we are dealing with more than 2 data sets that are of a different
nature (eg analogous to [14]: we could have more than 2 data sets, each
consisting of a text corpus in a different language), but for which is known
that corresponding samples belong to the same class (as described in the
previous subsection), the problem is easily shown to reduce to the extension
of CCA towards more data spaces, as is e.g. used in [1]. Space restrictions
do not permit us to go into this.
– It is possible to keep this approach completely general, allowing for any type
of side-information of the form of constraints that express for any number of
samples they belong to the same class, or on the contrary do not to belong
to the same class. Also knowledge of some of the labels can be exploited.
For doing this, we have to use a different parameterization for Z than used
in this paper. In principle also any prior distribution on the parameters can
be taken into account. However, sampling techniques will be necessary to
estimate the expected value of the LDA cost function in these cases. We will
not go into this in the current paper.

3.3 The Dual Eigenproblem

As a last remark, the dual or kernelized version of the generalized eigenvalue


problem
 can be derived  as follows. The solution w can be expressed in the form
 
w= X (1)
X (2) α where α ∈ R2n is a vector containing the dual variables.

Now, with Gram matrices Kkl = X(k) X(l) , and after introducing the notation
   
K11 K12
G1 = and G2 =
K21 K22

the α’s corresponding to the weight vectors w are found as the generalized
eigenvectors of

(G1 G2 + G2 G1 )α = λ(G1 G1 + G2 G2 )α.

This motivates that it will be possible to extend the approach to learning non-
linear metrics with side-information as well.

4 Empirical Results

The empirical results reported in this paper will be for clustering problems with
the type of side-information described above. Thus, with our method we learn
a suitable metric based on a set of samples for which the side-information is
known, i.e. X(1) and X(2) . Subsequently a K-means clustering of all samples
(including those that are not in X(1) or X(2) ) is performed, making use of the
metric that is learned.
182 T. De Bie, M. Momma, and N. Cristianini

4.1 Evaluation of Clustering Performance


We use the same measure of accuracy as is used in [15], namely, defining I(xi , xj )
as the function being 1 when xi and xj are clustered in the same cluster by the
algorithm,

k i,j>i;xi ,xj ∈Ck I(xi , xj ) i,j>i;¬∃k:xi ,xj ∈Ck (1 − I(xi , xj ))


Acc = + .
2 k i,j>i;xi ,xj ∈Ck 1 2 i,j>i;¬∃k:xi ,xj ∈Ck 1

4.2 Regularization
To deal with inaccuracies, numerical instabilities and influences of finite sample
size, we apply regularization to the generalized eigenvalue problem. This is done
in the same spirit as for CCA in [1], namely by adding a diagonal to the scatter
matrices C11 and C22 . This is justified thanks to the CCA-based derivation of
our algorithm. To train the regularization parameter, a cost function described
below is minimized via 10-fold cross validation.
In choosing the right regularization parameter, there are two things to con-
sider: firstly, we want the clustering to be good. This means that the side-
information should be reflected as well as possible by the clustering. Secondly we
want this clustering to be informative. This means, we don’t want one very large
cluster, the others being very small (the probability to fulfil the side-information
would be too easy then). Therefore, the cross-validation cost minimized here, is
the probability for the measured performance on the test set side-information,
given the sizes of the clusters found. (More exactly, we maximized the differ-
ence of this performance with its expected performance, divided by its standard
deviation.) This approach incorporates both considerations in a natural way.

4.3 Performance on a Toy Data Set


The effectiveness of the method is illustrated by using a toy example, in which
each of the clusters consists of two parts lying far apart (figure (1)). Standard
K-means has an accuracy of 0.50 on this data set, while the method developed
here gives an accuracy of 0.92.

4.4 Performance on Some UCI Data Sets


The empirical results on some UCI data sets, reported in table 1, are comparable
to the results in [15]. The first column contains the K-means clustering accuracy
without any side-information and preprocessing, averaged over 30 different initial
conditions. In the second column, results are given for small side-information
leaving 90 percent of the connected components3 , in the third column for large
3
We use the notion connected component as defined in [15]. That is, for given side-
information, a set of samples makes up one connected component, if between each
pair of samples in this set, there exists a path via edges corresponding to pairs given
in the side-information. For no side-information given, the number of connected
components is thus equal to the total number of samples.
Efficiently Learning the Metric with Side-Information 183

15

10

−5

−10

−15
3
4
2
2 1

0 0
−1
−2
−2
−4 −3

Fig. 1. A toy example whereby the two clusters each consist of two distinct clouds
of samples, that are widely separated. Ordinary K-means obviously has a very low
accuracy of 0.5, whereas when some side-information is taken into account as described
in this paper, the performance goes up to 0.92.

side-information leaving 70 percent of the connected components. For these two


columns, averages over 30 randomizations are shown. The side-information is
generated by randomly picking pairs of samples belonging to the same cluster.
The number between brackets indicates the standard deviation over these 30
randomizations.
Table 2 contains the accuracy on the UCI wine data set and on the protein
data set, for different amounts of side-information. To quantize the amount of
side-information, we used (as in [15]) the number of pairs in the side-information,
divided by the total number of pairs of samples belonging to the same class (the
ratio of constraints.)
These results are comparable with those reported in [15]. Like in [15], con-
strained K-means [5] will allow for a further improvement. (It is important to
note that constrained K-means on itself does not learn a metric, ie, the side-
information is not used for learning which directions in the data space are im-
portant in the clustering process. It rather imposes constraints assuring the
clustering result does not contradict the side-information.)

5 Conclusions

Finding a good representation of the data is of crucial importance in many


machine learning tasks. However, without any assumptions or side-information,
there is no way to find the ‘right’ metric for the data. We thus presented a way
184 T. De Bie, M. Momma, and N. Cristianini

Table 1. Accuracies for on UCI data sets, for different numbers of connected compo-
nents. (The more side-information, the less connected components. The fraction f is
the number of connected components divided by the total number of samples.)

Data set f =1 f = 0.9 f = 0.7


wine 0.69 (0.00) 0.92 (0.05) 0.95 (0.03)
protein 0.62 (0.02) 0.71 (0.04) 0.72 (0.06)
ionosphere 0.58 (0.02) 0.69 (0.09) 0.75 (0.05)
diabetes 0.56 (0.02) 0.60 (0.02) 0.61 (0.02)
balance 0.56 (0.02) 0.66 (0.01) 0.67 (0.03)
iris 0.83 (0.06) 0.92 (0.03) 0.92 (0.04)
soy 0.80 (0.08) 0.85 (0.09) 0.91 (0.1)
breast cancer 0.83 (0.01) 0.89 (0.02) 0.91 (0.02)

Table 2. Accuracies on the wine and the protein data sets, as a function of the ratio
of constraints.

ratio of accuracy ratio of accuracy


constr. for wine constr. for protein
0 0.69 (0.00) 0 0.62 (0.03)
0.0015 0.73 (0.08) 0.012 0.59 (0.04)
0.0023 0.78 (0.11) 0.019 0.60 (0.05)
0.0034 0.87 (0.08) 0.028 0.62 (0.04)
0.0051 0.91 (0.05) 0.041 0.67 (0.05)
0.0075 0.93 (0.05) 0.060 0.71 (0.05)
0.011 0.96 (0.05) 0.099 0.75 (0.05)
0.017 0.97 (0.017) 0.14 0.77 (0.05)
0.025 0.97 (0.018) 0.21 0.79 (0.06)
0.037 0.98 (0.015) 0.31 0.78 (0.07)

to learn an appropriate metric based on examples of co-clustered pairs of points.


This type of side-information is often much less expensive or easier to obtain
than full information about the label.
The proposed method is justified in two ways: as a maximization of the ex-
pected value of a Rayleigh quotient corresponding to LDA, and another way
showing connections with previous work on this type of problems. The result is
a very efficient algorithm, being much faster than, while showing similar perfor-
mance as the algorithm derived in [15].
Importantly, the method is put in a more general context, showing it is only
one example of a broad class of algorithms that are able to incorporate different
forms of side-information. It is pointed out how the method can be extended to
deal with basically any kind of side-information.
Furthermore, the result of the algorithm presented here is a lower dimensional
representation of the data, just like in other dimensionality reduction methods
such as PCA (Principal Component Analysis), PLS (Partial Least Squares),
Efficiently Learning the Metric with Side-Information 185

CCA and LDA, that try to identify interesting subspaces for a given task. This
often comes as an advantage, since algorithms like K-means and constrained
K-means will run faster on lower dimensional data.

Acknowledgements. TDB is a Research Assistant with the Fund for Scientific


Research - Flanders (F.W.O.-Vlaanderen). MM is supported by the NSF grant
IIS-9979860. This paper was written during a scientific visit of TDB and MM at
[Link]. The authors would like to thank Roman Rosipal, Pieter Abbeel and
Eric Xing for useful discussions.

References
1. F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of
Machine Learning Research, 3:1–48, 2002.
2. M. Barker and W.S. Rayens. Partial least squares for discrimination. Journal of
Chemometrics, 17:166–173, 2003.
3. M. S. Bartlett. Further aspects of the theory of multiple regression. Proc. Camb.
Philos. Soc., 34:33–40, 1938.
4. M. Borga, T. Landelius, and H. Knutsson. A Unified Approach to PCA, PLS,
MLR and CCA. Report LiTH-ISY-R-1992, ISY, SE-581 83 Linköping, Sweden,
November 1997.
5. P. Bradley, K. Bennett, and Ayhan Demiriz. Constrained K-means clustering.
Technical Report MSR-TR-2000-65, Microsoft Research, 2000.
6. N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target
alignment. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances
in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press.
7. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley &
Sons, Inc., 2nd edition, 2000.
8. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals
of Eugenics, 7, Part II:179–188, 1936.
9. T. Hofmann. What people don’t want. In European Conference on Machine Learn-
ing (ECML), 2002.
10. R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University
Press, 1991.
11. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning
the kernel matrix with semi-definite programming. Technical Report CSD-02-1206,
Division of Computer Science , University of California, Berkeley, 2002.
12. R. Rosipal, L.J. Trejo, and B. Matthews. Kernel PLS-SVC for linear and nonlinear
classification. In (to appear) Proceedings of the Twentieth International Conference
on Machine Learning, 2003.
13. J.-P. Vert and M. Kanehisa. Graph-driven features extraction from microarray
data using diffusion kernels and cca. In Advances in Neural Information Processing
Systems 15, Cambridge, MA, 2003. MIT Press.
14. A. Vinokourov, N. Cristianini, and J. Shawe-Taylor. Inferring a semantic repre-
sentation of text via cross-language correlation analysis. In Advances in Neural
Information Processing Systems 15, Cambridge, MA, 2003. MIT Press.
15. E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with
application to clustering with side-information. In Advances in Neural Information
Processing Systems 15, Cambridge, MA, 2003. MIT Press.
186 T. De Bie, M. Momma, and N. Cristianini

Appendix A: Derivation Based on LDA


Parameterization. As explained before, the side-information is such that we
(1) (2)
get pairs of samples (xi , xi ) which have the same class label. Using this side-
(1) (2)
information we stack the corresponding vectors xi and xi at the same row in
 (1)    (2)  
x1 x1
 (1)    (2)  
   
their respective matrices X(1) =  x2  and X(2) =  x2 . The full matrix
 ···   ··· 
(1)  (2) 
xn xn
containing all
 (1)  samples for which side-information is available, is then equal to
X
X= . Now, since we know each row of X(1) has the same label as the
X(2)
corresponding rowof  X(2) , a parameterization of the label matrix Z is easily
L
found to be Z = . Note that Z is centered iff L is centered. The matrix
L
L is in fact just the label matrix of both X(1) and X(2) on themselves. (We
want to stress L is not known, but used in the equations as an unknown matrix
parameter for now.)

The Rayleigh Quotient Cost Function That Incorporates the Side-


Information.
 (1)  Using this parameterization
  we apply LDA on the matrix
X L
with the label matrix to find the optimal directions for sepa-
X(2) L
ration of the classes. For this we use the CCA formulation of LDA. This means
we want to solve the CCA optimization problem (1) where we substitute the
values for Z and X:
 (1)   
X L  
max w (2) wZ = max w X(1) LwZ + w X(2) LwZ (3)
w,wZ X L w,wZ
 (1)  2
 X 
s.t.  
 X(2) w = X w + X w = 1
(1) 2 (2) 2
(4)

LwZ 2 = 1
The Lagrangian of this constrained optimization problem is:
 
  (X(1)  X(1) + X(2)  X(2) )w − µw L LwZ
L = w X(1) LwZ + w X(2) LwZ − λw Z

Differentiating with respect to wZ and w and equating to 0 yields


∇wZ L = 0 ⇒ L (X(1) + X(2) )w = 2µL LwZ (5)
 (1)  X(1) + X(2)  X(2) )w
∇w L = 0 ⇒ (X(1) + X(2) ) LwZ = 2λ(X (6)
From (5) we find that wZ = 2µ 1
(L L)† L (X(1) +X(2) )w. Filling this into equation
(6) and choosing λ = 4λµ gives that
   
(X(1) + X(2) ) L(L L)† L (X(1) + X(2) )w = λ(X(1) X(1) + X(2) X(2) )w.
Efficiently Learning the Metric with Side-Information 187

It is well known that solving for the dominant generalized eigenvector is equiv-
alent to maximizing the Rayleigh quotient:
 
w (X(1) + X(2) ) L(L L)† L (X(1) + X(2) )w
  . (7)
w (X(1) X(1) + X(2) X(2) )w
Until now, for the given side-information, there is still an exact equivalence be-
tween LDA and maximizing this Rayleigh quotient. The important difference
between the standard LDA cost function and (7) however, is that in the latter
the side-information is imposed explicitly by using the reduced parameterization
for Z in terms of L.

The Expected Cost Function. As pointed out, we do not know the term
between [·]. What we will do then is compute the expected value   of the cost
L
function (7) by averaging over all possible label matrices Z = , possibly
L
weighted with any symmetric4 a priori probability for the label matrices. Since
the only part that depends on the label matrix is the factor between [·], and since
it appears linearly in the cost function, we just need to compute the expectation

of this factor. This expectation is proportional to I− 11 n . To see this we only have
to use symmetry arguments (all values on the diagonal should be equal to each
other, and all other values should
 be equal
 to each other), and the observation
that L is centered and thus L(L L)† L 1 = 0. Now, since we assume that the
data matrix X containing the samples in the side-information is centered too,

(X(1) +X(2) ) 11
n (X
(1)
+X(2) ) is equal to the null matrix. Thus the expected value

of (X +X ) L(L L)† L (X(1) +X(2) ) is proportional to (X(1) +X(2) ) (X(1) +
(1) (2)  

X(2) ). The expected value of the LDA cost function in equation (7), where the
expectation is taken over all possible label assignments Z constrained by the
side-information, is then shown to be
w (C11 + C12 + C22 + C21 )w w (C12 + C21 )w
= 1 +
w (C11 + C22 )w w (C11 + C22 )w
The vector w maximizing this cost is the dominant generalized eigenvector of

(C12 + C21 )w = λ(C11 + C22 )w



where Ckl = X(k) X(l) .
(Note that the side-information is symmetric in the sense that one could
(1) (2) (2) (1)
replace an example pair (xi , xi ) with (xi , xi ) without losing any informa-
tion. However this operation does not change C12 + C21 nor C11 + C22 , so that
4
That is, the a priori probability of a label assignment L is the same as the probability
of the label assignment PL where P can be any permutation matrix. Remember
every row of L corresponds to the label of a pair of points in the side-information.
Thus, this invariance means we have no discriminating prior information on which
pair belongs to which of the classes. Using this ignorant prior is clearly the most
reasonable we can do, since we assume only the side-information is given here.
188 T. De Bie, M. Momma, and N. Cristianini

the eigenvalue problem to be solved does not change either, which is of course a
desirable property.)

Appendix B: Alternative Derivation


More in the spirit of [15], we can derive the algorithm by solving the constrained
optimization problem (where dim(W) = k means that the dimensionality of W
is k, that is, W has k columns):


maxW trace(X(1) WW X(2) )
s.t. dim(W) = k



  X(1)
W X(1) X (2)
 W = Ik
X(2)

so as to find a subspace of dimension k that optimizes the correlation between


samples belonging to the same class.
This can be reformulated as

maxW trace(W (C12 + C21 )W)


s.t. dim(W) = k
W (C11 + C22 )W = Ik

Solving this optimization problem amounts to solving for the eigenvectors


corresponding to the k largest eigenvalues of the generalized eigenvalue problem
described above (2).
The proof involves the following theorem by Ky Fan (see eg [10]):
Theorem 5.01. Let H be a symmetric matrix with eigenvalues λ1 > λ2 > . . . >
λn , and the corresponding eigenvectors U = (u1 , u2 , . . . , un ). Then

λ1 + . . . + λk = max

trace(P HP).
P P=I

Moreover, the optimal P∗ is given by P∗ = (u1 , . . . , uk )Q where Q is an arbi-


trary orthogonal matrix.
Since (C11 + C22 ) is positive definite, we can take P = (C11 + C22 )1/2 W,
so that the constraint W (C11 + C22 )W = Ik becomes P P = Ik . Also put
H = (C11 + C22 )−1/2 (C12 + C21 )(C11 + C22 )−1/2 , so that the objective function
(8) becomes trace(P HP). Applying the Ky Fan theorem and choosing Q = Ik ,
leads to the fact that P∗ = (u1 , . . . , uk ), with u1 , . . . , uk the k eigenvectors
corresponding of the k largest eigenvalues of H. Thus, the optimal W∗ = (C11 +
C22 )−1/2 P∗ . For P∗ an eigenvector of H = (C11 + C22 )−1/2 (C12 + C21 )(C11 +
C22 )−1/2 , this W∗ is exactly the generalized eigenvector (corresponding to the
same eigenvalue) of (2). The result is thus exactly the same as obtained in the
derivation in Appendix A.
Efficiently Learning the Metric with Side-Information 189

Appendix C: Connection to Literature


If we replace w in optimization problem (3) subject to (4) once by w(1) and one
by w(2) :
   
max w(1) X(1) LwZ + w(2) X(2) LwZ
w(1) ,w(2)

s.t. X(1) w(1) 2 + X(2) w(2) 2 = 1


LwZ 2 = 1

where L corresponds to the common label matrix for X(1) and X(2) (both cen-
tered). In a similar way as previous derivation, this can be shown to amount to
solving the eigenvalue problem:
   
0 X(1) L(L L)−1 L X(2) w(1)
 
X(2) L(L L)−1 L X(1) 0 w(2)
   (1) 
C11 0 w

0 C22 w(2)

which again corresponds to a Rayleigh quotient. Since also here in fact we do


not know the matrix L, we again take the expected value (as in Appendix A).
This leads to an expected Rayleigh quotient that is maximized by solving the
generalized eigenproblem corresponding to CCA:
   (1)     (1) 
0 C12 w C11 0 w
=λ .
C21 0 w(2) 0 C22 w(2)
Learning Continuous Latent Variable Models
with Bregman Divergences

Shaojun Wang1 and Dale Schuurmans2


1
Department of Statistics, University of Toronto, Canada
2
School of Computer Science, University of Waterloo, Canada

Abstract. We present a class of unsupervised statistical learning


algorithms that are formulated in terms of minimizing Bregman
divergences—a family of generalized entropy measures defined by convex
functions. We obtain novel training algorithms that extract hidden latent
structure by minimizing a Bregman divergence on training data, subject
to a set of non-linear constraints which consider hidden variables. An
alternating minimization procedure with nested iterative scaling is pro-
posed to find feasible solutions for the resulting constrained optimization
problem. The convergence of this algorithm along with its information
geometric properties are characterized.

Index Terms — statistical machine learning, unsupervised learning,


Bregman divergence, information geometry, alternating minimization,
forward projection, backward projection, iterative scaling.

1 Introduction
A variety of machine learning and statistical inference problems focus on su-
pervised learning from labeled training data. In such problems, convexity often
plays a central role in formulating the loss function to be minimized during
training. For example, a standard approach to formulating a training loss is
to distinguish a preferred value from a set of candidate prediction values, and
measure prediction error by a convex error measure. Examples of this include
least squares regression, decision tree learning, boosting, on-line learning, max-
imum likelihood for exponential models, logistic regression, maximum entropy,
support vector machines, statistical signal processing (e.g. Burg’s spectral esti-
mation for speech signal analysis and image reconstruction) and optimal portfo-
lio selection. Such problems can often be naturally cast as convex optimization
problems involving a Bregman divergence [5,10,23], which can lead to new al-
gorithms, analytical tools, and insights derived from the powerful methods of
convex analysis [2,3,7,13]. Training algorithms that solve these problems can be
cast as implementing a minimum Bregman divergence (MB) principle.
However, in practice, many of the natural patterns we wish to classify are
the result of causal processes that have hidden hierarchical structure—yielding
data that does not report the value of latent variables. For example, in natural
language learning the observed data rarely reports the value of hidden semantic
variables or syntactic structure, in speech signal analysis gender information is

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 190–204, 2003.

c Springer-Verlag Berlin Heidelberg 2003
Learning Continuous Latent Variable Models with Bregman Divergences 191

not explicitly marked, etc. Obtaining fully labeled data is tedious or impossi-
ble in most realistic cases. This motivates us to propose a class of unsupervised
statistical learning algorithms that are still formulated in terms of minimizing a
Bregman divergence, except that we must now change the problem formulation
to respect hidden variables. In this paper we propose training algorithms for
solving the latent minimum Bregman divergence (LMB) principle: given a set
of training data and features that one would like to match in the training data,
compute a model that minimizes a convex objective function (a Bregman diver-
gence) subject to a set of non-linear constraints that take into account possible
latent structure.
Our treatment of the LMB principle closely parallels the results presented in
[24] for the Kullback-Leibler divergence, but the extension proposed here is not
trivial. For probabilistic models under the Kullback-Leibler divergence, we can
show an equivalence between satisfying the constraints (i.e. achieving feasibility)
and locally maximizing the likelihood under a log-linear assumption. Thus, in
this case, we can resort to the EM algorithm [14] to develop a practical tech-
nique for finding feasible solutions and proving convergence. However, general
Bregman divergences raise a more difficult technical challenge because the EM
approach breaks down for these generalized entropy measures. In this paper, we
will overcome this difficulty by using an alternating minimization approach [9]
in a non-trivial way, see Figure 1. Thus, beyond the generalized KL divergence
used for unsupervised boosting in clustering [25], the techniques of this paper
can also handle a broader class of functions, such as the Itakura-Siato distortion
[17] for speech signal analysis.

LMB principle
AM algorithm
LME principle

EM algorithm

joint convex
Shannon K−L generalized Bregman
Bregman
entropy divergence K−L divergence divergence
divergence

unsupervised
boosting

Fig. 1. The AM algorithm proposed in this paper is valid for the family of joint convex
Bregman divergences, but the EM algorithm proposed in [24] is only valid for the K-L
divergence. The unsupervised boosting case is dealing with generalized K-L divergence,
thus it can be solved by using the AM algorithm to find feasible solutions under latent
minimum Bregman divergence principle.
192 S. Wang and D. Schuurmans

2 The LMB Principle


To express a joint (probability) model, let X ∈ X denote the complete data,
Y ∈ Y be the observed incomplete data and Z ∈ Z be the missing data. That is,
X = (Y, Z). Let φ(t) :  →  be a strictly convex function on an interval I ⊂ ,
and differentiable in the interior of I. Define a closed convex set S ⊂ X ,
where S is typically assumed to be the set of (probability) distributions (or
positive measures) p over X . For functions p and q on X with values in I, a
Bregman divergence 1 [4,7,10,11,12,13,18] is a generalized entropy measure that
is associated with a convex function φ

Bφ (p; q) = ∆φ (p(x); q(x)) µ(dx)
x∈X

where
∆φ (p(x); q(x)) = φ(p(x)) − φ(q(x)) − φ (q(x)) (p(x) − q(x))
and φ denotes the derivative of φ.2 That is, the Bregman divergence Bφ measures
the discrepancy between two distributions p and q by integrating the difference
between φ evaluated at p and φ’s first-order Taylor expansion about q, evaluated
at p over X .
To strengthen the interpretation of Bφ (p; q) as a measure of distance, we
make the following assumptions.
• ∆φ (u, v) is strictly convex in u and in v separately, but also satisfies the
stronger property that it is jointly convex in u, v. Thus our choice of Bregman
divergence Bφ (p; q) is strictly convex in p and in q separately, and also jointly
convex. This assumption lies at the heart of the analysis below.
• Bφ (p; q) is lower-semi-continuous in p and q jointly.
• For any fixed p ∈ S, the level sets {q : Bφ (p; q) ≤ } are bounded.
• If Bφ (pk ; q k ) → 0 and pk or q k is bounded, then pk → q k and q k → pk .
• If p ∈ S and q k → p, then Bφ (p; q k ) → 0.

Examples
1. Let φ(t) = t log t be defined on I = [0, ∞). Then φ (t) = log t + 1, and
  
p(x)
Bφ (p; q) = D(p; q) = p(x) log − p(x) + q(x) µ(dx)
x∈X q(x)
1
The machine learning community [8,13,18,19] is familiar with the discrete case, since
for supervised learning there are a finite number of sample points (training exam-
ples), so we can write the constraints as pertaining to a finite dimensional vector.
However, in unsupervised learning we are usually dealing with continuous variables,
and therefore instead of a vector, we are working with an infinite dimensional space.
2
In this paper, µ denotes a given σ-finite measure on X . If X is finite or countably
infinite, then µ is the counting measure, and integrals reduce to sums. If X is a subset
of a finite dimensional space, µ is the Lebesgue measure. If X is a combination of
both cases, µ will be a combination of both measures.
Learning Continuous Latent Variable Models with Bregman Divergences 193

which is the generalized KL divergence. This is the objective function of


the primal problem for AdaBoost [20]. When p and q are restricted to be
probability measures, it becomes the KL divergence, the objective function
of the primal problem for LogitBoost [16,20]. Furthermore, when q is chosen
to be uniform, it becomes the Shannon entropy.
2. Let φ(t) = t2 be defined on I = (−∞, ∞). Then φ (t) = 2t, and

Bφ (p; q) = ||p(x) − q(x)||2L2 (µ)

which is the measure of energy.


3. Let φ(t) = − log t be defined on I = (0, ∞). Then φ (t) = − 1t , and
  
q(x) p(x)
Bφ (p; q) = log + − 1 µ(dx)
x∈X p(x) q(x)

which is the Itakura-Saito distortion that arises in the spectral analysis of


speech signals. When q = 1, it becomes the Burg entropy [17].
4. Let φ(t) = t log t + (1 − t) log(1 − t) be defined on I = [0, 1]. Then φ (t) =
t
log 1−t , and
  
p(x) 1 − p(x)
Bφ (p; q) = p(x) log + (1 − p(x)) log µ(dx)
x∈X q(x) 1 − q(x)
1
which is the Bernoulli entropy. When q = 2, it becomes the Fermi-Dirac
entropy.

To formulate the minimum Bregman divergence principle, assume we have a


finite set of features f1 (x), ..., fN (x) which correspond to sufficient statistics in a
log-linear model, weak learners in boosting, or basis function in non-parametric
estimation. Given a set of complete data points X̃ = (Ỹ, Z̃), where Ỹ are ob-
served “descriptions” and Z̃ are observed “labels”, the minimum Bregman di-
vergence principle (MB) is:
MB principle. Choose a conditional distribution p(z|y) to minimize

min Bφ (p̃(y)p(z|y); q0 (x)) (1)


p(z|y)∈S

subject to the constraints


  
p̃(y) fi (x) p(z|y) µ(dz) = p̃(x)fi (x) for i = 1, ..., N (2)
z∈Z
y∈Ỹ x∈X̃

where x = (y, z), q0 ∈ S is a default distribution chosen so that φ (q0 ) = 0,


and quite often we set q0 to be uniform, and p̃(x) and p̃(y) denote the empirical
distributions of the complete and marginal data respectively.
In general, iterative scaling [11,12,13,18] is used to obtain the (global) optimal
solution for the MB principle.
194 S. Wang and D. Schuurmans

In contrast to MB principle, if the labels Z̃ are unobserved, we propose the


latent minimum Bregman divergence principle (LMB) as follows.
LMB principle. Choose a joint distribution p(x) to minimize
min Bφ (p; q0 ) (3)
p∈S

subject to the constraints


  
fi (x) p(x) µ(dx) = p̃(y) fi (x) p(z|y) µ(dz), for i = 1, ..., N(4)
x∈X z∈Z
y∈Ỹ

where x = (y, z), q0 ∈ S is a default distribution chosen so that φ (q0 ) = 0, and


quite often we set q0 to be uniform. Here p̃(y) is the empirical distribution of
the observed data and p(z|y) is the conditional distribution of latent variables
given the observed data.
Note that the conditional p(z|y) implicitly encodes thelatent structure and
is a nonlinear mapping of p(x). That is, p(z|y) = p(y, z)/ z ∈Z p(y, z  )µ(dz) =

p(x)/ x =(y,z ) p(x )µ(dx ) where x = (y, z) and x = (y, z  ). Clearly p(z|y) is
a non-linear function of p(x) because of the division. This means we are faced
with minimizing an objective (3) subject to a system of non-linear constraints
(4). Therefore, even though the objective function (3) is convex, no unique min-
imum can be guaranteed to exist. In fact, maxima and saddle points may exist.
Nevertheless, one can still attempt to derive an iterative training procedure that
finds feasible solutions to the LMB problem. With such a subroutine in hand,
one can then heuristically solve the LMB principle by gathering several feasible
solutions (by starting with different initial points) and then choosing the feasible
p that obtains the smallest Bregman divergence.
To illustrate how the LMB principle is related to unsupervised learning, as-
sume we are given a collection of unlabeled examples from which we wish to
construct a linear combination of weak “decision stumps” to create a “strong”
predicative model for clustering. In this case, we can formulate the problem
as minimizing the generalized K-L divergence of an unnormalized exponential
model defined in terms of the features (the decision stumps) subject to the (non-
linear) constraints that the model matches the generalized feature expectations.
Below we focus on developing an iterative algorithm for finding feasible so-
lutions. In general, solving (3) subject to (4) is quite complex. Since the original
problem does not yield a simple closed form solution for p, we instead look for
an approximate solution. First, we restrict the model to have an additive form.
Definition 1. [19] Let S ⊂ X be a set of measures. An additive model for S is
defined by an operation L : X × S → S satisfying the homomorphism property
L(r1 + r2 , s) = L (r1 , L(r2 , s)) for all r1 , r2 ∈ X and s ∈ S.

Lemma 1. [19] Given a convex function φ :  → , let Bφ be the Bregman


divergence defined on measures p ∈ S. Define the Legendre transform v ◦φ q0 by
v ◦φ q0 = arg min Bφ (p; q0 ) − v · p
p∈S
Learning Continuous Latent Variable Models with Bregman Divergences 195

Then we have v ◦φ q0 = Lφ (q0 , v) such that (Lφ (q0 , v)) (x) = (φ )−1 (φ (q0 (x)) −
v(x)) for all x. Also the map (v, q0 ) → v ◦φ q0 is an additive model for S.
By adopting an additive model restriction, we can make valuable progress
toward formulating a practical algorithm for approximately satisfying the LMB
principle.
In the following, we use a doubly iterative projection algorithm to obtain
feasible additive solutions, and also provide a characterization of its convergence
properties and information geometry.

3 Preliminaries: Convergence of Alternating Projections


We present a generalization of the alternating projection method of Csiszar and
Tusnady [9] for Bregman divergences, and show how this technique can be used
to find feasible solutions for the LMB principle. In developing our method we
need to derive a slightly more general convergence result than [9], which is due
to [15]. These results are originally shown in [6,15] for discrete case, here we
extend them for continuous variables.
Since projections onto closed convex sets may be thought of as solutions of
minimum divergence problems, we begin by introducing suitable definitions for
the Bregman divergence.
Definition 2. (Forward projection) Suppose Q ⊂ S is a nonempty closed
convex set, and let p ∈ S. We define the forward projection of p onto Q as the
unique element q ∗ ∈ Q such that Bφ (p; q ∗ ) = minq∈Q {Bφ (p; q)}.

Definition 3. (Backward projection) Suppose P ⊂ S is a nonempty closed


convex set, and let q ∈ S. We define the backward projection of q onto P as the
unique element p∗ ∈ P such that Bφ (p∗ ; q) = minp∈P {Bφ (p; q)}.
We can then define the alternating projection algorithm associated with the
Bregman divergence.
Alternating minimization (AM) algorithm. Consider two nonempty
closed convex sets P, Q ⊂ S.
Initialization: Let q 0 ∈ Q be an arbitrary distribution such that there exists
p ∈ P with Bφ (p; q 0 ) < ∞.
Iterative step: Given q k , find pk by backward projection onto P:
pk = arg min Bφ (p; q k )
p∈P

Then calculate q k+1 by forward projection onto Q:


q k+1 = arg min Bφ (pk ; q)
q∈Q

Repeat until convergence.


To prove that this procedure converges, we first demonstrate the “three points”
and “four points” properties for the Bregman divergence.
196 S. Wang and D. Schuurmans

Lemma 2. (Three points property) Consider a Bregman divergence Bφ on


two nonempty closed convex sets P, Q ⊂ S. Let q ∈ Q be such that Bφ (p; q) < ∞
for all p ∈ P, and let p∗ = arg minp∈P Bφ (p; q). Then for all p ∈ P we have

Bφ (p; q) − Bφ (p∗ ; q) ≥ Bφ (p; p∗ )

Proof. By definition of Bregman divergence, we have

Bφ (p; q) − Bφ (p∗ ; q)

= φ(p(x)) − φ(p∗ (x)) − φ (q(x)) (p(x) − p∗ (x)) µ(dx)
x∈X

= φ(p(x)) − φ(p∗ (x)) − φ (p∗ (x))(p(x) − p∗ (x))
x∈X
+ (φ (p∗ (x)) − φ (q(x))) (p(x) − p∗ (x)) µ(dx)

= Bφ (p; p∗ ) + (φ (p∗ (x)) − φ (q(x))) (p(x) − p∗ (x)) µ(dx)
x∈X

Denote the partial gradient of Bφ with respect to its first argument as


∇p Bφ (p; q), and note that ∂Bφ (p; q)/∂p(x) = φ (p(x)) − φ (q(x)). Therefore we
have

(∇p Bφ (p∗ ; q)) (x) = φ (p∗ (x)) − φ (q(x))

for all x. Since p∗ minimizes Bφ (p; q) over convex set P, we must have

∇p Bφ (p∗ ; q) · (p − p∗ ) ≥ 0

The result then follows since



(φ (p∗ (x)) − φ (q(x))) (p(x) − p∗ (x)) µ(dx) = ∇p Bφ (p∗ ; q) · (p − p∗ )
x∈X

Lemma 3. (Four points property) Consider a jointly convex Bregman


divergence Bφ on two nonempty closed convex sets P, Q ⊂ S. Let p ∈ P such
that Bφ (p; q) < ∞ for all q ∈ Q, and let q ∗ = arg minq∈Q Bφ (p; q). Then for all
u ∈ P, v ∈ Q we have

Bφ (u; q ∗ ) ≤ Bφ (u; p) + Bφ (u; v)

Proof. By the joint convexity assumption of ∆φ (p(x); q(x)), we have



∆φ (u(x); v(x)) ≥ ∆φ (p(x); q ∗ (x)) + ∗
∂p(x) ∆φ (p(x); q (x)) (u(x) − p(x))
∂ ∗
+ ∂q ∗ (x) ∆φ (p(x); q (x)) (v(x) − q ∗ (x))

for all x. Therefore

Bφ (u; v) ≥ Bφ (p; q ∗ ) + ∇p Bφ (p; q ∗ ) · (u − p) + ∇q∗ Bφ (p; q ∗ ) · (v − q ∗ )


Learning Continuous Latent Variable Models with Bregman Divergences 197

Since q ∗ minimizes Bφ (p; q) over the convex set Q, we have

∇q∗ Bφ (p; q ∗ ) · (v − q ∗ ) ≥ 0

Thus

Bφ (u; v) − Bφ (p; q ∗ ) − ∇p Bφ (p; q ∗ ) · (u − p) ≥ 0

On the other hand, by the definition of Bregman divergence, we have



 ∗  ∗ ∗
Bφ (u; p) − Bφ (u; q ) = x∈X
φ(q (x)) − φ(p(x)) + φ (q (x))(u(x) − q (x))
− φ (p(x))(u(x) − p(x)) µ(dx)

= −Bφ (p; q ∗ ) − x∈X
(φ (p(x)) − φ (q ∗ (x)))(u(x) − p(x)) µ(dx)

= −Bφ (p; q ∗ ) − ∇p Bφ (p; q ∗ ) · (u − p)

Thus we obtain

Bφ (u; p) + Bφ (u; v) − Bφ (u; q ∗ ) = Bφ (u; v) − Bφ (p; q ∗ ) − ∇p Bφ (p; q ∗ ) · (u − p)


≥0

Given these two lemmas, following [15] we obtain the following convergence
result.
Theorem 1. The alternating minimization algorithm (AM) converges. That is,
p1 , p2 , ... converges to some p∞ ∈ P, and q 1 , q 2 , ... converges to some q ∞ ∈ Q,
such that

Bφ (p∞ ; q ∞ ) = min Bφ (p; q)


p∈P,q∈Q

The proof of this theorem follows the same line of argument as that of theorem
2.17 given in [15].

4 The AM-IS Algorithm for Learning Latent Structure


We now extend this alternating minimization algorithm to finding feasible so-
lutions to the LMB principle. To understand the algorithm and its information
geometry, we first define some useful sub-manifolds in S.
 
    
C= p∈S: fi (x) p(x)µ(dx) = p̃(y) fi (x) p(z|y)µ(dz), i = 1...N
 x∈X z∈Z 
y∈Ỹ

M= p∈S: p(y, z) µ(dz) = p̃(y), for each y ∈ Ỹ
z∈Z

Ga = p ∈ S : p(x)fi (x) µ(dx) = ai , i = 1...N
x∈X
  N
 

E = pλ ∈ S : pλ (x) = Lφ q0 , λi fi (x) , λ ∈ Ω
i=1
198 S. Wang and D. Schuurmans

where C denotes the set of nonlinear constraints the model should satisfies, M
denotes the set of distributions whose observed marginal distribution matches
the observed empirical distribution, Ga denotes the set of distributions whose
features’ expectations are constant, E denotes the set of additive models, and
  N
 

N
Ω = λ ∈  : Lφ q0 , λi fi (x) < ∞
i=1

Now by choosing the closed convex set M̄ to play the role of P in the previous
discussion, and choosing the closed convex set Ē to play the role of Q, we can
define the corresponding forward projection and backward projection operators,
and then use these to iterate toward feasible LMB solutions.
First, to derive a backward projection operator, take a current pkλ playing
the role of q k in the previous discussion, and use this to determine a distribution
p∗ ∈ M̄ that minimizes
Bφ (p∗ ; pkλ ) = min Bφ (p; pkλ )
p∈M̄

That is, p∗ is the backward projection of pkλ onto M̄. To solve for p∗ , one can
formulate the Lagrangian Λ(p, α)
  
k
Λ(p, α) = Bφ (p; pλ ) + αy p(y, z)µ(dz) − p̃(y)
z∈Z
y∈Ỹ

Now since

Λ(p, α) = φ (p(x)) − φ (pkλ (x)) + αy
∂p(x)
it is not hard to see that the solution p∗ must satisfy
 
p∗ (x) = (φ )−1 φ (pkλ (x)) − αy

for all y ∈ Ỹ, where αy is chosen so that


 

 
p (y, z) µ(dz) = (φ )−1 φ (pkλ (y, z)) − αy µ(dz) = p̃(y) (5)
z∈Z z∈Z

Lemma 4. For φ corresponding to Examples 1 and 2 above, the backward pro-


jection of pkλ onto M̄ is given by the closed form solution p∗ (x) = p̃(y)pkλ (z|y)
for all x = (y, z).
Proof. Note that for φ(t) = t log t (Example 1) or φ(t) = t2 (Example 2) we have
 
φ (pkλ (x)) − φ (pkλ (y)) + φ (p̃(y)) = φ pkλ (z|y)p̃(y)
Therefore, if we let αy = φ (pkλ (y)) − φ (p̃(y)) we obtain
 
p∗ (x) = (φ )−1 φ (pkλ (x)) − αy
 
= (φ )−1 φ (pkλ (z|y)p̃(y))
= p̃(y)pkλ (z|y)
Learning Continuous Latent Variable Models with Bregman Divergences 199

Thus in many cases we can implement the backward projection step for AM
merely by calculating the conditional distribution pkλ (z|y) of the current model.
In general, one has to solve for the Lagrange multipliers that satistfy (5) to yield
a general form of the backward projection p∗ (x) = p̃(y)pkαy ,λ (z|y). In this case,
instead of using the original conditional distribution p(z|y) on the right-hand
side of the constraints, Eqn (4), a modified conditional distribution pαy (z|y)
which is a function of p(z|y) has to be used in the problem formulation of the
LMB principle.
Next, to formulate the forward projection step, we exploit the following
lemma.

Lemma 5. For any pk ∈ M̄, the forward projection of pk onto Ē is equivalent


to solving the minimization

min Bφ (p; q0 ) where ai = pk (x)fi (x)µ(dx) (6)
p∈Ga x∈X

Proof. To find the solution of (6), form the Lagrangian Ψ (p, λ)


N
   
Ψ (p, λ) = Bφ (p; q0 ) + λi p(x)fi (x)µ(dx) − pk (x)fi (x)µ(dx)
i=1 x∈X x∈X

Now since
N

Ψ (p, λ) = φ (p(x)) − φ (q0 (x)) + λi fi (x)
∂p(x) i=1

any solution must satisfy


 N


 −1 
pλ (x) = (φ ) φ (q0 (x)) − λi fi (x)
i=1

Plugging into Ψ , we are left with the problem of maximizing



Ψ (pλ , λ) = φ(pλ (x)) − φ(q0 (x)) − φ (q0 (x))(pλ (x) − q0 (x)) µ(dx)
x∈X
N   
k
+ λi pλ (x)fi (x)µ(dx) − p (x)fi (x)µ(dx)
i=1 x∈X x∈X

= Bφ (pk ; q0 ) − Bφ (pk ; pλ )

which is equivalent to minimizing Bφ (pk ; pλ ) over pλ ∈ Ē, the forward projection


of pk onto Ē.
To solve the minimization problem specified in (6) one can use iterative scal-
ing. By using an auxiliary function to bound the change in Bregman divergence
from below, the iterative scaling algorithm can be derived. Following [13], define
an auxiliary function as the following:
200 S. Wang and D. Schuurmans

Definition 4. We call A : S × N →  an auxiliary function for pk and f if it


satisfies the following conditions:
1. A(q, λ) is continuous in q and A(q, 0) = 0.
N
2. Bφ (pk ; q) − Bφ (pk ; Lφ (q, i=1 fi (x)λi )) ≥ A(q, λ).
3. If λ = 0 is a maximum of A(q, λ), then
 
fi (x) q(x) µ(dx) = fi (x) pk (x) µ(dx), i = 1, ..., N
x∈X x∈X

Maximizing this auxiliary function we obtain new parameters λ = λ + λ and


a new model given by
 N


qλ+λ = L qλ , λi fi (x)
i=1
  N
 N

 
= L L q0 , λi fi (x) , λi fi (x)
i=1 i=1
 N


= L qλ , (λi + λi )fi (x)
i=1

When λ = 0, we have that qλ = minp∈Ga Bφ (p; q0 ).


N
Lemma 6. Define f (x) = i=1 |fi (x)|, σi (x) = sign(fi (x)) and lφ (q, v) =
supp∈S v · p − Bφ (p; q). Then

N
 
fi (x) pk (x) µ(dx)
def
A(q, λ) = λi (7)
i=1 x∈X
 N
 N

 |fi (x)| 
− lφ q, σi (x)f (x)λi µ(dx)
x∈X i=1 f (x) i=1

is an auxiliary function for pk and f , and the corresponding iterative scaling


update scheme is given by
 N


k k
qt+1 = Lφ qt , λi fi (x) (8)
i=1

where λi , i = 1, ..., N satisfies


 
 k 
fi (x)Lφ qt , σi (x)f (x)λi µ(dx) = fi (x) pk (x) µ(dx) (9)
x∈X x∈X

and

lim qtk = q∞
k
= arg min Bφ (pk ; q) (10)
t→∞ q∈Ē
Learning Continuous Latent Variable Models with Bregman Divergences 201

Proof. Following [13], which considers discrete state distribution, we consider


the continuous case. The proof is essentially identical.
We verify that the function defined in (8) satisfies the three properties of the
definition. Property (1) holds since lφ (q, 0) = 0. Property (2) follows from the
convexity of lφ .
 N
  N

 
lφ q, λi fi (x) = lφ q, σi (x)|fi (x)|λi (11)
i=1 i=1
 N
 N

 |fi (x)| 
≤ lφ q, σi (x)f (x)λi µ(dx) (12)
x∈X i=1 f (x) i=1

The rest proof follows exactly the proof of Proposition 4.4 of [13].

We are then able to find feasible solutions for the LMB principle by using
an algorithm that combines the previous AM algorithm with a nested IS loop
to calculate the forward projection.

AM-IS algorithm:

Backward projection: Compute pk (x) = p̃(y) pkαy ,λ (z|y), which yields ai =



x∈X
pk (x)fi (x)µ(dx), i = 1, ..., N for the forward projection step.

Forward projection: Perform iterations of full parallel update IS as in (8) and


(9) to obtain the parameter values λ∞ = (λ∞ ∞ k k
1 , ..., λN ) and set pλ (x) = q∞ (x).

This alternating procedure will halt at a point where the three manifolds C,
E and Ga have a common intersection, since we reach a stationary point in that
case. Due to the nonlinearity of the manifold C, the intersection is not unique,
and multiple feasible solutions may exist.
We are now ready to prove the main result of this section that AM-IS can
be shown to converge and hence is guaranteed to yield feasible solutions to the
LMB principle.

Theorem 2. The AM-IS algorithm asymptotically yields feasible solutions to


the LMB principle for additive models.

Proof. By lemmas 5 and 6 and choose the closed convex set M̄ to play the role
of P in Theorem 2, and choose the closed convex set Ē to play the role of Q in
Theorem 2. The conclusion immediately follows.

Unlike the standard K-L divergence for which the EM-IS algorithm can be
shown to monotonically increase likelihood during each iteration [24], monotonic
improvement will not necessarily hold under the Bregman divergences.
202 S. Wang and D. Schuurmans

5 Information Geometry of AM-IS


We gain further insight by considering the well known Pythagorean theorem [13]
for additive models, which in the complete data case states that if there exists
pλ∗ ∈ Ḡa ∩ Ē then

Bφ (p; pλ ) = Bφ (p; pλ∗ ) + Bφ (pλ∗ ; pλ ) for all p ∈ Ḡa and pλ ∈ Ē

In the incomplete data case, this theorem needs to be modified to incorpo-


rate the effect of latent variables. Unlike the case in [24], in general, M is not
a sub-manifold of C,¯ thus the interpretation of the information geometry of
Pythagorean theorem need to be slightly modified.
Theorem 3. Pythagorean Property: For all pλ ∈ Ē and pλ∗ ∈ C¯∩ Ē, there exists
a p(x) ∈ M̄ such that

Bφ (p; pλ ) = Bφ (p; pλ∗ ) + Bφ (pλ∗ ; pλ ) (13)

Proof. For all pλ∗ ∈ C¯ ∩ Ē, pick p(x) = p̃(y)pαy ,λ∗ (z|y). Obviously p ∈ M̄. Now
we show that for all pλ ∈ Ē

Bφ (p̃(y)pαy ,λ∗ (z|y); pλ (x)) = Bφ (p̃(y)pαy ,λ∗ (z|y); pλ∗ (x)) + Bφ (pλ∗ (x); pλ (x))

Establishing the above equation is equivalent to showing



φ(p̃(y)pαy ,λ∗ (z|y)) − φ(pλ (x)) − φ (pλ (x)) (p̃(y)pαy ,λ∗ (z|y) − pλ (x)) µ(dx)
x∈X

= φ(p̃(y)pαy ,λ∗ (z|y)) − φ(pλ∗ (x)) − φ (pλ∗ (x)) (p̃(y)pαy ,λ∗ (z|y) − pλ∗ (x))
x∈X

µ(dx) + φ(pλ∗ (x)) − φ(pλ (x)) − φ (pλ (x)) (pλ∗ (x) − pλ (x)) µ(dx)
x∈X

Cancelling common terms leaves



φ (pλ (x))p̃(y)pαy ,λ∗ (z|y) µ(dx)
x∈X

= φ (pλ∗ (x))(p̃(y)pαy ,λ∗ (z|y) − pλ∗ (x)) µ(dx)
x∈X

+ φ (pλ (x))pλ∗ (x) µ(dx)
x∈X

Plugging
 N
  N

 
pλ (x) = L q0 , λi fi (x) = (φ )−1 φ (q0 ) + λi fi (x)
i=1 i=1
 N
  N

 
pλ∗ (x) = L q0 , λ∗i fi (x)  −1
= (φ ) 
φ (q0 ) + λ∗i fi (x)
i=1 i=1
Learning Continuous Latent Variable Models with Bregman Divergences 203

into the above equation, we then have


  N



φ (q0 ) + λi fi (x) p̃(y)pαy ,λ∗ (z|y) µ(dx)
x∈X i=1
  N

  
= φ(q0 ) + λ∗i fi (x) p̃(y)pαy ,λ∗ (z|y) − pλ∗ (x) µ(dx)
x∈X i=1
  N


+ φ(q0 ) + λi fi (x) pλ∗ (x) µ(dx)
x∈X i=1

Cancelling the common terms, we then are left with


 
N   
(λi − λ∗i )  p̃(y) pαy ,λ∗ (z|y)fi (x)µ(dz) − pλ∗ (x)fi (x)µ(dx)
i=1 z∈Z x∈X
y∈Ỹ
= 0

The term inside the brackets is 0 since pλ∗ (x) ∈ C¯ ∩ Ē.

6 Summary

There are a number of iterative methods for performing Bregman divergence


projections onto convex sets that can be used to illustrate existing supervised
machine learning techniques. In this paper, we have presented a class of unsuper-
vised statistical learning algorithms formulated in terms of minimizing Bregman
divergences subject to a set of non-linear constraints that consider hidden vari-
ables. We have proposed a new alternating minimization algorithm with nested
iterative scaling that asymptotically finds feasible solutions to this constrained
optimization problem, and provided its convergence and information geometry
properties.
We are developing this framework to provide analytical tools to transform
current supervised machine learning techniques to unsupervised counterparts.
In general, a greedy search procedure [25] similiar as in AdaBoost can be devel-
oped to automatically extract hidden latent structure. Prelimilary experimental
results on unsupervised boosting for clustering and gender independent speech
signal analysis are encouraging.

Acknowledgement: This work is supported by MITACS and NSERC.

References

1. H. Bauschke and J. Borwein, “Joint and Separate Convexity of the Bregman Dis-
tance,” in: Inherently Parallel Algorithms in Feasibility and Optimization and Their
Applications, Elsevier, 2001, pp. 23–36
204 S. Wang and D. Schuurmans

2. J. Borwein and A. Lewis, “Duality relationships for entropy-like minimization prob-


lems,” SIAM J. Control Optim., Vol. 29, No. 2, pp. 325–338, 1991
3. J. Borwein; A. Lewis, Convex Analysis and Nolinear Optimization, Springer 2000
4. L. Bregman, “The relaxation method of finding the common point of convex sets
and its applications to the solution of problems in convex programming,” USSR
Computational Mathematics and Mathematical Physics, Vol. 7, pp. 200–217, 1967
5. A. Buja and W. Stuetzle, “Degrees of Boosting,” manuscript, 2002
6. C. Byrne and Y. Censor, “Proximity function minimization using multiple Breg-
man projections with applications to split feasibility and Kullback-Leibler distance
minimization,” Annals of Operations Research, Vol. 105, pp. 77–98, 2001
7. Y. Censor and S. Zenios, Parallel Optimization: Theory, Algorithms, and Applica-
tions, Oxford University Press, 1997
8. M. Collins, R. Schapire and Y. Singer, “Logistic regression, AdaBoost and Bregman
distances,” Machine Learning, Vol. 48, No. 1–3, pp. 253–285, 2002
9. I. Csiszar and G. Tusnady, “Information geometry and alternating minimization
procedures,” Statistics and Decisions, Supplement Issue 1, pp. 205–237, 1984
10. I. Csiszar, “Why least squares and maximum entropy?” The Annals of Statisics,
Vol. 19, No. 4, pp. 2032–2066, 1991
11. I. Csiszar, “Generalized projections for non-negative functions,” Acta Mathematica
Hungarica, Vol. 68, No. 1–2, pp. 161–185, 1995
12. I. Csiszar, “Maxent, mathematics, and information theory,” Maximum Entropy
and Bayesian Methods, Edited by K. Hanson and R. Silver, pp. 35–50, Kluwer,
1996
13. S. Della Pietra, V. Della Pietra and J. Lafferty, “Duality and auxiliary functions
for Bregman distances,” Technical Report CMU-CS-01-109, CMU, 2001
14. A. Dempster, N. Laird and D. Rubin, “Maximum likelihood estimation from in-
complete data via the EM algorithm,” J. Royal Stat. Soc. B, Vol. 39, pp 1–38,
1977
15. P. Eggermont and V. LaRiccia, “On EM-like algorithms for minimum distance es-
timation,” Technical Report, Mathematical Sciences, University of Delaware, 1998
16. J. Friedman, T. Hastie and R. Tibshirani, “Additive logistic regression: A statistical
view of boosting,” Annals of Statistics, Vol. 28, No. 2, pp. 337–407, 2000
17. R. Johnson and J. Shore, “Which is the better entropy expression for speech pro-
cessing: -S log S or log S?,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, Vol. 32, No. 1, pp. 129–137, 1984
18. J. Lafferty, S. Della Pietra and V. Della Pietra, “Statistical learning algorithms
based on Bregman distances,” Canadian Workshop on Info. Theory, pp. 77–80,
1997
19. J. Lafferty, “Additive models, boosting, and inference for generalized divergences,”
Annual Conference on Computational Learning Theory, pp. 125–133, 1999
20. G. Lebanon and J. Lafferty, “Boosting and maximum likelihood for exponential
models,” In Advances in Neural Information Processing Systems (NIPS), 14, 2002
21. D. Luenberger, Optimization by Vector Space Methods, John Wiley & Sons, 1969
22. V. Vapnik, The Natural of Statistical Learning Theory, Springer, 2000
23. T. Zhang, “Statistical behavior and consistency of classification methods based on
convex risk minimization,” to appear in Annals of Statistics, 2004
24. S. Wang, D. Schuurmans and Y. Zhao, “The latent maximum entropy principle,”
manuscript, 2002
25. S. Wang, D. Schuurmans, A. Ghodsi and J. Rosenthal, “Unsupervised Boosting
with the Latent Maximum Entropy Principle,” manuscript, 2003
A Stochastic Gradient Descent Algorithm for
Structural Risk Minimisation

Joel Ratsaby

University College London Gower Street, London WC1E 6BT, United Kingdom
[Link]@[Link]

Abstract. Structural risk minimisation (SRM) is a general complexity


regularization method which automatically selects the model complexity
that approximately minimises the misclassification error probability of
the empirical risk minimiser. It does so by adding a complexity penalty
term (m, k) to the empirical risk of the candidate hypotheses and then
for any fixed sample size m it minimises the sum with respect to the
model complexity variable k.
When learning multicategory classification there are M subsamples mi ,
corresponding to the M pattern classes with a priori probabilities pi , 1 ≤
i ≤ M . Using the usual representation for a multi-category  classifier as
M individual boolean classifiers, the penalty becomes M i=1 pi (mi , ki ).
If the mi are given then the standard SRM trivially applies here by
minimizing the penalised empirical risk with respect to ki , 1, . . . , M .

However, in situations where the total sample size M i=1 mi needs to be
minimal one needs to also minimize the penalised empirical risk with
respect to the variables mi , i = 1, . . . , M . The obvious problem is that
the empirical risk can only be defined after the subsamples (and hence
their sizes) are given (known).
Utilising an on-line stochastic gradient descent approach, this paper over-
comes this difficulty and introduces a sample-querying algorithm which
extends the standard SRM principle. It minimises the penalised empiri-
cal risk not only with respect to the ki , as the standard SRM does, but
also with respect to the mi , i = 1, . . . , M .
The challenge here is in defining a stochastic empirical criterion which
when minimised yields a sequence of subsample-size vectors which
asymptotically achieve the Bayes’ optimal error convergence rate.

1 Introduction

Consider the general problem of learning classification with M pattern classes


each with a class conditional probability density fi (x), 1 ≤ i ≤ M , x ∈ IRd ,
and a priori probabilities pi , 1 ≤ i ≤ M . The functions fi (x), 1 ≤ i ≤ M ,
are assumed to be unknown while the pi are assumed to be known or unknown
depending on the particular setting. The learner observes randomly drawn i.i.d.
examples each consisting of a pair of a feature vector x ∈ IRd and a label y ∈
{1, 2, . . . , M }, which are obtained by first drawing y from {1, . . . , M } according

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 205–220, 2003.

c Springer-Verlag Berlin Heidelberg 2003
206 J. Ratsaby

to a discrete probability distribution {p1 , . . . , pM } and then drawing x according


to the selected probability density fy (x).
Denoting by c(x) a classifier which represents a mapping c : IRd → {1, 2, . . . ,
M } then the misclassification error of c is defined as the probability of misclassifi-
cation of a randomly drawn Mx with respect to the underlying mixture probability
density function f (x) = i=1 pi fi (x). This misclassification error is commonly
represented as the expected 0/1-loss, or simply as the loss, L(c) = E1{c(x)=y(x)} ,
of c where expectation is taken with respect to f (x) and y(x) denotes the true
label (or class origin) of the feature vector x. In general y(x) is a random variable
depending on x and only in the case of fi (x) having non-overlapping probability
1 supports then y(x) is a deterministic function1 . The aim is to learn, based
on a finite randomly drawn labelled sample, the optimal classifier known as the
Bayes classifier which by definition has minimum loss. In this paper we pose the
following question:
Question: Can the learning accuracy be improved if labeled examples are
independently randomly drawn according to the underlying class conditional
probability distributions but the pattern classes, i.e., the example labels, are
chosen not necessarily according to their a priori probabilities ?
We answer this in the affirmative by showing that there exists a tuning of the
subsample proportions which minimizes a loss criterion. The tuning is relative
to the intrinsic complexity of the Bayes-classifier.
Before continuing let us introduce some notation. We write const to denote
absolute constants or constants which do not depend on other variables in the
mathematical expression. We denote by {(xj , yj )}m j=1 an i.i.d. sample of labelled
examples where m denotes the total sample size, yj , 1 ≤ j ≤ m, are drawn i.i.d.
and taking the integer value ‘i’ with probability pi , 1 ≤ i ≤ M , while the cor-
responding xj are drawn according to the class conditional probability density
fyj (x). Denote by mi the number of examples having a y-value of ‘i’. Denote
M
by m = [m1 , . . . , mM ] the sample size vector and let m = i=1 mi ≡ m.
The notation argmink∈A g(k) for a set A means the subset (of possibly more
than one element) whose elements have the minimum value of g over A. A
slight abuse of notation will be made by using it for countable sets where
the notation means the subset of elements k such that2 g(k) = infk g(k  ).
The lossL(c) is expressed in terms of the class-conditional losses, Li (c), as
M
L(c) = i=1 pi Li (c) where Li (c) = Ei 1{c(x)=i} , and Ei is the expectation with
respect to the density fi (x). The empirical counterparts of the loss and condi-
M
tional loss are Lm (c) = i=1 pi Li,mi (c) where Li,mi (c) = m1i j:yj =i 1{c(xj )=i}

1
According to the probabilistic data-generation model mentioned above, only regions
in probability 1 support of the mixture distribution f (x) have a well-defined class
membership.
2
In that case, technically, if there does not exists a k in A such that g(k) = inf k g(k )
then we can always find an arbitrarily close approximating elements kn , i.e., ∀ > 0
∃N () such that for n > N () we have |g(kn ) − inf k g(k )| < .
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 207

where throughout the paper we assume the a priori probabilities are known to
the learner (see Assumption 1 below).

2 Structural Risk Minimisation

The loss L(c) depends on the unknown underlying probability distributions hence
realistically for a learning algorithm to work it needs to use only an estimate of
L(c). For a finite class C of classifiers the empirical loss Lm (c) is a consistent
estimator of L(c) uniformly for all c ∈ C hence provided that the sample size m is
sufficiently large, an algorithm that minimises Lm (c) over C will yield a classifier
ĉ whose loss L(ĉ) is an arbitrarily good approximation of the true minimum Bayes
loss, denoted here as L∗ , provided that the optimal Bayes classifier is contained in
C. The Vapnik-Chervonenkis theory (Vapnik, 1982) characterises the condition
for such uniform estimation over an infinite class C of classifiers. The condition
basically states that the class needs to have a finite complexity or richness which
is known as the Vapnik-Chervonenkis dimension and is defined as follows: for a
class H of functions from a set X to {0, 1} and a set S = {x1 , . . . , xl } of l points in
X, denote by H|S = {[h(x1 ), . . . , h(xl )] : h ∈ H}. Then the Vapnik-Chervonenkis
dimension
  of H denoted by V C(H) is the largest l such that the cardinality
H|S  = 2l . The method known as empirical risk minimisation represents a
general learning approach which for learning classification minimises the 0/1-
empirical loss and provided that the hypothesis class has a finite VC dimension
then the method yields a classifier ĉ with an asymptotically arbitrarily-close loss
to the minimum L∗ .
As is often the case in real learning algorithms, the hypothesis class can be
rich and may practically have an infinite VC-dimension, for instance, the class
of all two layer neural networks with a variable number of hidden nodes. The
method of Structural Risk Minimisation (SRM) was introduced by Vapnik (1982)
in order to learn such classes via empirical risk minimisation.
For the purpose of reviewing existing results we limit our discussion for the
remainder of this section to the case of two-category classification thus we use
m and k as scalars representing the total sample size and class VC-dimension,
respectively. Let us denote by Ck a class of classifiers having a VC-dimension
of k and let c∗k be the classifier which minimises the loss L(c) over Ck , i.e.,
c∗k = argminc∈Ck L(c). The standard setting for SRM considers the overall ∞ class C
of classifiers as an infinite union of finite VC-dimension classes, i.e., k=1 Ck , see
for instance Vapnik (1982), Devroye et. al. (1996), Shawe-Taylor et. al. (1996),
Lugosi & Nobel (1999), Ratsaby et. al. (1996). The best performing classifier in
C denoted as c∗ is defined as c∗ = argmin1≤k≤∞ L(c∗k ). Similarly, denote by ĉk
the empirically-best classifier in Ck , i.e., ĉk = argminc∈Ck Lm (c). Denoting by
k ∗ the minimal complexity of a class which contains c∗ , then depending on the
problem and on the type of classifiers used, k ∗ may even be infinite as in the
case when the Bayes classifier is not contained in C. The complexity k ∗ may be
thought of as the intrinsic complexity of the Bayes classifier.
208 J. Ratsaby

The idea behind SRM is to minimise not the pure empirical loss Lm (ck )
but a penalised version taking the form Lm (ck ) + (m, k) where (m, k) is some
increasing function of k and is sometimes referred to as a complexity penalty.
The classifier chosen by the criterion is then defined by

ĉ∗ = argmin1≤k≤∞ (Lm (ĉk ) + (m, k)) . (1)

The term (m, k) is proportional to the worst case deviations between the true
loss and the empirical loss uniformly over all functions in Ck . More recently there
has been interest in data-dependent penalty terms for structural risk minimisa-
tion which do not have an explicit complexity factor k but are related to the
class Ck by being defined as a supremum of some empirical quantity over Ck ,
for instance the maximum discrepancy criterion (Bartlett et. al., 2002) or the
Rademacher complexity (Kultchinskii, 2001).
We take the penalty to be as  in Vapnik (1982) (see also
Devroye et. al. (1996)) (m, k) = const k ln m
m where again const stands
for an absolute constant which for our purpose is not important. This bound is
central to the computations of the paper3 .
It can be shown (Devroye et. al., 1996) that for the two-pattern classification
case the error rate of the SRM-chosen classifier ĉ∗ (which implicitly depends on
the random sample of size m since it is obtained by minimising the sum in (1)),
satisfies 
∗ ∗ k ∗ ln m
L(ĉ ) > L(c ) + const (2)
m
infinitely often with probability 0 where again c∗ is the Bayes classifier which is
assumed to be in C and k ∗ is its intrinsic complexity. The nice feature of SRM
is that the selected classifier ĉ∗ automatically locks onto the minimal error rate
as if the unknown k ∗ was known beforehand.

3 Multicategory Classification

A classifier c(x) may be represented as a vector of M boolean classifiers bi (x),


where bi (x) = 1 if x is a pattern drawn from class ‘i’ and bi (x) = 0 otherwise.
A union of such boolean classifiers forms a well-defined M classifier c(x) if for
each x ∈ IRd , bi (x) = 1 for exactly one i, i.e., i=1 {x : bi (x) = 1} = IRd
and {x : bi (x) = 1} {x : bj (x) = 1} = ∅, for 1 ≤ i = j ≤ M . We also
refer to these boolean classifiers as the component classifiers ci (x), 1 ≤ i ≤ M ,
of a vector classifier c(x). The loss of a classifier
M c is just the average of the
losses of the component classifiers, i.e., L(c) = i=1 pi L(ci ) where for a boolean
3
There is actually an improved bound due to Talagrand, cf. Anthony & Bartlett (1999)
Section 4.6, but when adapted for almost sure statements it yields O( k+ln
m
m
) which
 
k ln m
for our work is insignificantly better then O m
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 209

classifier ci the loss m is defined as L(ci ) = Ei 1{ci (x)=1} , and the empirical loss
is Li,mi (ci ) = mi j=1
1 i
1{ci (xj )=1} which is based on a subsample {(xj , i)}m j=1
i

drawn i.i.d. from pattern class “i”.


The class C of classifiers is decomposed into a structure S = S1 ×S2 ×· · ·×SM ,
where Si is a nested structure (cf. Vapnik (1982)) of classes Bki , i = 1, 2, . . ., of
boolean classifiers bi (x), i.e., S1 = B1 , B2 , . . . , Bk1 , . . . , S2 = B1 , B2 , . . . , Bk2 , . . .
up to SM = B1 , B2 , . . . , BkM , . . . where ki ∈ ZZ + denotes the VC-dimension of
Bki and Bki ⊆ Bki +1 , 1 ≤ i ≤ M . For any fixed positive integer vector k ∈ ZZ M +
consider the class of vector classifiers Ck = Bk1 × Bk2 × · · · × BkM . Define by Gk
the subclass of Ck of classifiers c that are well-defined (in the sense mentioned
above). M
For vectors m and k in ZZ M , define (m, k) ≡ i=1 pi (mi , ki ) where
 +
as before (mi , ki ) = const mi . For any 0 < δ < 1, we denote by
ki ln mi
 1
ki ln mi +ln δ M
(mi , ki , δ) = mi and (m, k, δ) = i=1 pi (mi , ki , δ). Lemma 1 below
states an upper bound on the deviation between the empirical loss and the loss
uniformly over all classifiers in a class Gk and is a direct application of Theorem
6.7 Vapnik (1982). Before we state it, it is necessary to define what is meant by
an increasing sequence of vectors m.

Definition 1. (Increasing sample-size sequence) A sequence m(n) of sample-


size vectors is said to increase if: (a) at every n, there exists a j such that
mj (n + 1) > mj (n) and mi (n + 1) ≥ mi (n) for 1 ≤ i = j ≤ M and (b) there
exists an increasing function T (N ) such that for all N > 0, n > N implies every
component mi (n) > T (N ), 1 ≤ i ≤ M .

Definition 1 implies that for all 1 ≤ i ≤ M , mi (n) → ∞ as n → ∞. We


will henceforth use the notation m → ∞ to denote such an ever-increasing
sequence m(n) with respect to an implicit discrete indexing variable n. The
relevance of Definition 1 will become clearer later, in particular when considering
Lemma 3.

Definition 2. (Sequence generating procedure) A sequence generating proce-


dure φ is one which generates increasing sequences m(n) with a fixed function
Tφ (N ) as in Definition 1 and also satisfying the following: for all N, N  ≥ 1 such
that Tφ (N  ) = Tφ (N ) + 1 then |N  − N | ≤ const, where const is dependent only
on φ.

The above definition simply states a lower bound requirement on the rate of
increase of Tφ (N ). We now state the uniform strong law of large numbers for
the class of well-defined classifiers.
Lemma 1. (Uniform SLLN for multicategory classifier class) For any k ∈ ZZ M +
let Gk be a class of well-defined classifiers. Consider any sequence-generating pro-
cedure as in Definition 2 which generates m(n), n = 1, . . . , ∞. Let the empirical
m(n)
loss be defined based on examples {(xj , yj )}j=1 , each drawn i.i.d. according to
210 J. Ratsaby

an unknown underlying
 distribution over IRd × {1, . . . , M }. Then for arbitrary
0 < δ < 1, supc∈Gk Lm(n) 
 (c) − L(c) ≤ const
 (m(n), k, δ) with probability 1 − δ

and the events supc∈Gk Lm(n) (c) − L(c) > const (m(n), k), n = 1, 2, . . ., occur
infinitely often with probability 0, where m(n) is any sequence generated by the
procedure.

The outline of the proof is in Appendix A. We henceforth denote by c∗k the


optimal classifier in Gk , i.e., c∗k = argminc∈Gk L(c) and ĉk = argminc∈Gk Lm (c) is
the empirical minimiser over the class Gk .
In Section 2 we mentioned that the intrinsic unknown complexity k ∗ of the
Bayes classifier is automatically learned by minimising the penalised empirical
loss over the complexity variable k. If an upper bound of the form of (2) but
based on a vector m could be derived for the multicategory case then a second
minimisation step, this time over the sample-size vector m, will improve the SRM
error convergence rate. The main result of this paper (Theorem 1) shows that
through a stochastic gradient descent such minimisation improves the standard
SRM bound from (m, k ∗ ) to (m∗ , k ∗ ) where m∗ minimises (m, k ∗ ) over all
possible vectors m whose magnitude m equals the given total sample size m.
The technical challenge is to obtain this without assuming the knowledge of k ∗ .
Our approach is to estimate k ∗ and minimise an estimated criterion. We only
provide sketch of proofs for the stated lemmas and theorem. The full proofs are
in Ratsaby (2003).
Concerning the convergence mode of random variables, upper bounds are
based on the uniform strong law of large numbers, see Appendix A. Such bounds
originated in the work of Vapnik (1982), for instance his Theorem 6.7. Through-
out the current paper, almost sure statements are made by a standard appli-
cation of the Borel-Cantelli lemma. For  instance, taking m to be a scalar, the
1
r log m+log
statement supb∈B |L(b) − Lm (b)| ≤ const m
δ
with probability at least
1 − δ for any δ > 0 is alternatively stated as follows by letting δm = m12 : For
the sequence of randomvariables Lm (b), uniformly over all b ∈ B, we have
1
r log m+log
L(b) > Lm (b) + const m
δm
occur infinitely often with probability
0. Concerning our, perhaps loose, use of the word optimal, whenever not ex-
plicitly stated, optimality of a classifier or of a procedure or algorithm is only
with respect to minimisation of the criterion, namely, the upper bound on the
loss.

4 Standard SRM Loss Bounds

We will henceforth make the following assumption.

Assumption 1. The Bayes loss L∗ = 0 and there exists a classifier ck in the


structure S with L(ck ) = L∗ such that ki < ∞, 1 ≤ i ≤ M . The a priori pattern
class probabilities pi , 1 ≤ i ≤ M , are known to the learner.
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 211

Assumption 1 essentially amounts to the Probably Approximately Correct


(PAC) framework, Valiant (1984), Devroye et. al. (1996) Section 12.7, but with
a more relaxed constraint on the complexity of the hypothesis class C since it is
permitted to have an infinite VC-dimension. Also, in practice the a priori pattern
class probabilities can be estimated easily. In assuming that the learner knows
the pi , 1 ≤ i ≤ M , one approach would have the learner allocate sub-sample
sizes according to mi = pi m followed by doing structural risk minimisation.
However this does not necessarily minimise the upper bound on the loss of the
SRM-selected classifier and hence is inferior in this respect to Principle 1 which
is stated later. We note that if the classifier class was fixed and the intrinsic
complexity k ∗ of the Bayes classifier was known in advance then because of As-
sumption 1 one would resort to a bound of the form O (k log m/m) and not the
weaker bound that has a square root, see ch. 4.5 in Anthony & Bartlett (1999).
However, as mentioned before, not knowing k ∗ and hence using structural risk
minimisation as opposed to empirical risk minimisation over a fixed class, leads
to using the weaker bound for the complexity-penalty.
We next provide some additional definitions needed for the remainder of the
paper. Consider the set F ∗ = {argmink∈ZZ M L(c∗k )} = {k : L(c∗k ) = L∗ = 0}
+
which may contain more than one vector k. Following Assumption 1 we may
define the Bayes classifier c∗ as the particular classifier c∗k∗ whose complexity
is minimal, i.e., k ∗ = argmin{k∈F ∗ } {k∞ } where k∞ = max1≤i≤M |ki |. Note
again that there may be more than one such k ∗ . The significance of specifying
the Bayes classifier up to its complexity rather than just saying it is any classifier
having a loss L∗ will become apparent later in the paper.
For an empirical minimiser classifier ĉk define by the penalised empirical loss
(cf. Devroye et. al. (1996)) L̃m (ĉk ) = Lm (ĉk ) + (m, k). Consider the set F̂ =
{argmink∈ZZ M L̃(ĉk )} which may contain more than one vector k. In standard
+
structural risk minimisation (Vapnik, 1982) the selected classifier is any one
whose complexity index k ∈ F̂ . This will be modified later when we introduce
an algorithm which relies on the convergence of the complexity k̂ to some finite
limiting complexity value with increasing4 m. The selected classifier is therefore
defined to be one whose complexity satisfies k̂ = argmink∈F̂ k∞ . This minimal-
complexity SRM-selected classifier will be denoted as ĉk̂ or simply as ĉ∗ . We will
sometimes write k̂n and ĉn for the complexity and for the SRM-selected classifier,
respectively, in order to explicitly show the dependence on discrete time n.
The next lemma states that the complexity k̂ converges to some (not neces-
sarily unique) k ∗ corresponding to the Bayes classifier c∗ .

Lemma 2. Based on m examples {(xj , yj )}m j=1 each drawn i.i.d. according to
an unknown underlying distribution over IR × {1, . . . , M }, let ĉ∗ be the cho-
d

sen classifier of complexity k̂. Consider a sequence of samples ζ m(n) with in-
4
We will henceforth adopt the convention that a vector sequence k̂n → k∗ , a.s., means
that every component of k̂n converges to the corresponding component of k∗ , a.s., as
m → ∞.
212 J. Ratsaby

creasing sample-size vectors m(n) obtained by a sequence-generating procedure


as in Definition 2. Then (a) the corresponding complexity sequence k̂n converges
a.s. to k ∗ which from Assumption 1 has finite components. (b) For any sam-
ple ζ m(n) in the sequence, the loss of the corresponding classifier ĉ∗n satisfies
L(ĉ∗n ) > const (m(n), k ∗ ) infinitely often with probability 0.
The outline of the proof is in Appendix B. For the more general case of L∗ > 0
(but two-category classifiers) the upper bound becomes L∗ + const (m, k ∗ ), cf.
Devroye et. al. (1996). It is an open question whether in this case it is possible
to guarantee convergence of k̂n or some variation of it to a finite limiting value.
The previous lemma bounds the loss of the SRM-selected classifier ĉ∗ . As
suggested earlier, we wish to extend the SRM approach to do an additional
minimisation step by minimising the loss of ĉ∗ with respect to the sample size
vector m. In this respect, the subsample proportions may be tuned to the in-
trinsic Bayes complexity k ∗ thereby yield an improved error rate for ĉ∗ . This is
stated next:
Principle1. Choose m to minimise the criterion (m, k ∗ ) with respect to all m
M
such that i=1 mi = m, the latter being the a priori total sample size allocated
for learning.
In general there may be other proposed criteria just as there are many criteria for
model selection based on minimisation of different upper bounds. Note that if k ∗
was known then an optimal sample size m∗ = [m∗1 , . . . , m∗M ] could be computed
which yields a classifier ĉ∗ with the best (lowest) deviation const (m∗ , k ∗ ) away
from Bayes loss. The difficulty is that k ∗ = [k1∗ , . . . , kM

] is usually unknown since
it depends on the underlying unknown probability densities fi (x), 1 ≤ i ≤ M . To
overcome this we will minimise an estimate of (·, k ∗ ) rather than the criterion
(·, k ∗ ) itself.

5 The Extended SRM Algorithm


In this section we extend the SRM learning algorithm to include a stochastic
gradient descent step. The idea is to interleave the standard minimisation step
of SRM with a new step which asymptotically minimises the penalised empirical
loss with respect to the sample size. As before, m(n) denotes a sequence of
sample-size vectors indexed by an integer n ≥ 0 representing discrete time. When
referring to a particular ith component of the vector m(n) we write mi (n).
The algorithm initially starts with uniform sample size proportions m1 =
m2 = · · · = mM = const > 0, then at each time n ≥ 1 it selects the classifier ĉ∗n
defined as

ĉ∗n = argminĉn,k :k∈F̂n k∞ Standard Minimization Step (3)

where F̂n = {k : L̃n (ĉn,k ) = minr∈ZZM L̃n (ĉn,r )} and for any ĉn,k which minimises
+

Lm(n) (c) over all c ∈ Gk we define the penalised empirical loss as L̃n (ĉn,k ) =
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 213

Lm(n) (ĉn,k ) + (m(n), k) where Lm(n) stands for the empirical loss based on the
sample-size vector m(n) at time n.
The second minimisation step is done via a query rule which selects the
particular pattern class from which to draw examples as one which minimises
the stochastic criterion (·, k̂n ) with respect to the sample size vector m(n). The
complexity k̂n of ĉ∗n will be shown later to converge to k ∗ hence (·, k̂n ) serves
as a consistent estimator of the criterion (·, k ∗ ). We choose an adaptation step
which changes one component of m at a time, namely, it increases the component
mjmax (n) which corresponds to the direction of maximum descent of the criterion
(·, k̂n ) at time n. This may be written as

m(n + 1) = m(n) + ∆ ejmax New Minimization Step (4)

where the positive integer ∆ denotes some fixed minimisation step-size and for
any integer i ∈ {1, 2, . . . , M }, ei denotes an M -dimensional elementary vector
with 1 in the ith component and 0 elsewhere. Thus at time n the new minimi-
sation step produces a new value m(n + 1) which is used for drawing additional
examples according to specific sample sizes mi (n + 1), 1 ≤ i ≤ M .
Learning Algorithm XSRM (Extended SRM)
Let: mi (0) = const > 0, 1 ≤ i ≤ M .
mi (0)
Given: (a) M uniform-size samples {ζ mi (0) }Mi=1 , where ζ
mi (0)
= {(xj , ‘i’)}j=1 ,
and xj are drawn i.i.d. according to underlying class-conditional probabil-
ity densities fi (x). (b) A sequence of classes Gk , k ∈ ZZ M + , of well-defined
classifiers. (c) A constant minimisation step-size ∆ > 0. (d) Known a priori
probabilities pj , 1 ≤ j ≤ M (for defining Lm ).
Initialisation: (Time n = 0) Based on ζ mi (0) , 1 ≤ i ≤ M , determine a set
of candidate classifiers ĉ0,k minimising the empirical loss Lm(0) over Gk , k ∈
Z+M
, respectively. Determine ĉ∗0 according to (3) and denote its complexity
vector by k̂0 .
Output: ĉ∗0 .
Call Procedure NM: m(1) := N M (0).
Let n = 1.
While (still more available examples) Do:
1. Based on the sample ζ m(n) , determine the empirical minimisers ĉn,k for
each class Gk . Determine ĉ∗n according to (3) and denote its complexity
vector by k̂n .
2. Output: ĉ∗n .
3. Call Procedure NM: m(n + 1) := N M (n).
4. n := n + 1.
End Do
Procedure New Minimisation (NM)
Input: Time n.
(mj (n),k̂n,j )
– jmax (n) := argmax1≤j≤M pj mj (n) , where if more than one argmax
then choose any one.
214 J. Ratsaby

– Obtain: ∆ new i.i.d. examples from class jmax (n). Denotethem by ζn .


– Update Sample: ζ mjmax (n) (n+1) := ζ mjmax (n) (n) ζn , while
ζ mi (n+1) := ζ mi (n) , for 1 ≤ i = jmax (n) ≤ M .
– Return Value: m(n) + ∆ ejmax (n) .

The algorithm alternates between the standard minimisation step (3) and
the new minimisation step (4) repetitively until exhausting the total sample size
m which for most generality is assumed to be unknown a priori.
mi (n)
While for any fixed i ∈ {1, 2, . . . , M } the examples {(xj , i)}j=1 accumulated
m(n)
up until time n are all i.i.d. random variables, the total sample {(xj , yj )}j=1
consists of dependent random variables since based on the new minimisation the
choice of the particular class-conditional probability distribution used to draw
examples at each time instant l depends on the sample accumulated up until
time l − 1. It turns out that this dependency does not alter the results of Lemma
2. This follows from the proof of Lemma 2 and from the bound of Lemma 1
which holds even if the sample is i.i.d. only when conditioned on a pattern class
since it is the weighted average of the individual bounds corresponding to each
of the pattern classes. Therefore together with the next lemma this implies that
Lemma 2 applies to Algorithm XSRM.
Lemma 3. Algorithm XSRM is a sequence-generating procedure.
The outline of the proof is deferred to Appendix C. Next, we state the main
theorem of the paper.
Theorem 1. Assume that the Bayes complexity k ∗ is an unknown M -
dimensional vector of finite positive integers. Let the step size ∆ = 1 in Algo-
rithm XSRM resulting in a total sample size which increases with discrete time
as m(n) = n. Then the random sequence of classifiers ĉ∗n produced by Algorithm
XSRM is such that the events L(ĉ∗n ) > const (m(n), k ∗ ) or m(n)−m∗ (n)l1M >
1 occur infinitely often with probability 0 where m∗ (n) is the solution to the con-
strained minimisation of (m, k ∗ ) over all m of magnitude m = m(n).

Remark 1. In the limit of large n the bound const (m(n), k ∗ ) is almost mini-
mum (the minimum being at m∗ (n)) with respect to all vectors m ∈ ZZ M + of size
m(n). Note that this rate is achieved by Algorithm XSRM without the knowledge
of the intrinsic complexity k ∗ of the Bayes classifier. Compare this for instance
to uniform querying where at each time n one queries for subsamples of the

same size M from every pattern class. This leads to a different (deterministic)
sequence m(n) = M ∆
[1, 1, . . . , 1]n ≡ ∆ n and in turn to a sequence of classifiers
ĉn whose loss L(ĉn ) ≤ const (∆ n, k ∗ ), as n → ∞, where here the upper bound
is not even asymptotically minimal. A similar argument holds if the propor-
tions are based on the a priori pattern class probabilities since in general letting
mi = pi m does not necessarily minimise the upper bound. In Ratsaby (1998),
empirical results show the inferiority of uniform sampling compared to an online
approach based on Algorithm XSRM.
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 215

6 Proving Theorem 1

The proof of Theorem 1 is based on Lemma 2 and on two additional lemmas,


Lemma 4 and Lemma 5, which deal with the the convergent property of the new
minimisation step of Algorithm XSRM. The proof is outlined in Appendix D.
Our approach is to show that the adaptation step used in the new minimisation
step follows from the minimisation of the deterministic criterion (m, k ∗ ) with
a known k ∗ . Letting t, as well as n, denote discrete time t = 1, 2, . . ., we adopt
the notation m(t) for a deterministic sample size sequence governed by the de-
terministic criterion (m, k ∗ ) where k ∗ is taken to be known. We write m(n) to
denote the stochastic sequence governed by the random criterion (m, k̂n ). Thus
t or n distinguish between a deterministic or stochastic sample sequence, m(t)
or m(n), respectively. We start with the following definition.

Definition 3. (Optimal trajectory) Let m(t) be any positive integer-valued


function of t which denotes the total sample size at time t. The optimal tra-
jectory is a set of vectors m∗ (t) ∈ ZZM Z+ , defined as m∗ (t) =
+ indexed by t ∈ Z
argmin{m∈ZZM :m=m(t)} (m, k ∗ ).
+

First let us solve the following constrained minimisation problem. Fix a to-

Msample size m and minimise the error (m, k ∗) under
tal M the constraint that
i=1 m i = m. This amounts to minimising (m, k ) + λ( i=1 mi − m) over m
and λ. Denote the gradient by g(m, k ∗ ) = ∇(m, k ∗ ). Then the above is equiva-
lent to solving g(m, k ∗ ) + λ[1, 1, . . . , 1] = 0 for m and λ. The vector valued func-
 p (m ,k∗ ) p (m ,k∗ )
tion g(m, k ∗ ) may be approximated by g(m, k ∗ )  − 1 2m11 1 , − 2 2m22 2 ,. . . ,
p (m ,k∗ ) 
− M 2mMM M where we used the approximation 1 − log1mi  1 for 1 ≤ i ≤ M .
We then obtain the set of equations 2λ∗ m∗i = pi (m∗i , ki∗ ), 1 ≤ i ≤ M , and
∗ ∗
λ∗ = (m2m,k ) . We are interested not in obtaining a solution for a fixed m but
obtaining, using local gradient information, a sequence of solutions for the se-
quence of minimization problems corresponding to an increasing sequence of
total sample-size values m(t).
Applying the New Minimization procedure of Algorithm XSRM to the de-
terministic criterion (m, k ∗ ) we have an adaptation rule which modifies the
sample size vector m(t) at time t in the direction of steepest descent of
pj (mj (t),kj∗ )
(m, k ∗ ). This yields: j ∗ (t) = argmax1≤j≤M mj (t) which means we let
mj ∗ (t) (t + 1) = mj ∗ (t) (t) + ∆ while the remaining components of m(t) remain
unchanged, i.e., mj (t + 1) = mj (t), ∀j = j ∗ (t). The next lemma states that
this rule achieves the desired result, namely, the deterministic sequence m(t)
converges to the optimal trajectory m∗ (t).

Lemma 4. For any initial point m(0) ∈ IRM , satisfying mi (0) ≥ 3, for a fixed
positive ∆, there exists some finite integer 0 < N  < ∞ such that for all discrete
time t > N  the trajectory m(t) corresponding to a repeated application of the
216 J. Ratsaby

adaptation rule mj ∗ (t) (t + 1) = mj ∗ (t) (t) + ∆ is no farther than ∆ (in the l1M -
norm) from the optimal trajectory m∗ (t).
M
Outline
 of Proof: Recall that (m, k ∗ ) = ∗
i=1 pi (mi , ki ) where (mi , ki ) =

∂(m,k ) (mi ,ki∗ )
mi , 1 ≤ i ≤ M . The derivative  −pi 2m . Denote by xi =
ki ln mi
∂mi i
(m ,k∗ )
pi 2m i i
i
, and note that dm dxi
i
 − 32 mxi
i
, 1 ≤ i ≤ M . There is a one-to-one
correspondence between the vector x and m thus we may refer to the optimal
trajectory also in x-space. Consider the set T = {x = c[1, 1, . . . , 1] ∈ IRM + : c ∈

IR+ } and refer to T as the corresponding set in m-space. Define the Lyapunov
function V (x(t)) = V (t) = xmaxx(t)−x min (t)
min (t)
where for any vector x ∈ IRM
+ , xmax =
max1≤i≤M xi , and xmin = min1≤i≤M xi , and write mmax , mmin for the elements
of m with the same index as xmax , xmin , respectively. Denote by V̇ the derivative
of V with respect to t. Using standard analysis it can be shown that if x ∈ T then
V (x) > 0 and V̇ (x) < 0 while if x ∈ T then V (x) = 0 and V̇ (x) = 0. This means
that as long as m(t) is not on the optimal trajectory then V (t) decreases. To show
that the trajectory is an attractor V (t) is shown to decrease fast enough to zero
3
using the fact that V (t) ≤ const 1t 2 . Hence as t → ∞, the distance between
m(t) and the set T  dist(m(t), T  ) → 0 where dist(x, T ) = inf y∈T x − yl1M and
l1M denotes the Euclidean vector norm. It is then easy to show that for all large
t, m(t) is farther from m∗ (t) by no more than ∆.
We now show that the same adaptation rule may also be used in the setting
where k ∗ is unknown. The next lemma states that even when k ∗ is unknown, it
is possible, by using Algorithm XSRM, to generate a stochastic sequence which
asymptotically converges to the optimal m∗ (n) trajectory (again, the use of n
instead of t just means we have a random sequence m(n) and not a deterministic
sequence m(t) as was investigated above).

Lemma 5. Fix any ∆ ≥ 1 as a step size used by Algorithm XSRM. Given


a sample size vector sequence m(n), n → ∞, generated by Algorithm XSRM,
assume that k̂n → k ∗ almost surely. Let m∗ (n) be the optimal trajectory as in
Definition 3. Then the events m(n) − m∗ (n)l1M > ∆ occur infinitely often with
probability 0.

Outline of Proof: From Lemma 3 m(n) generated by Algorithm XSRM is an


increasing sample-size sequence. Therefore by Lemma 2 we have k̂n → k ∗ , a.s.,
as n → ∞. This means that P (∃n > N, |k̂n − k ∗ | > ) = δN () where δN () → 0
as N → ∞. It follows that for all δ > 0, there is a finite N (δ, ) ∈ ZZ+ such that
with probability 1 − δ for all n ≥ N (, δ), k̂n = k ∗ . It follows that with the same
probability for all n ≥ N , the criterion (m, k̂n ) = (m, k ∗ ), uniformly over all
m ∈ ZZ M
+ , and hence the trajectory m(n) taken by Algorithm XSRM, governed
by the criterion (·, k̂n ), equals the trajectory m(t), t ∈ ZZ+ , taken by minimising
the deterministic criterion (·, k ∗ ). Moreover, this probability of 1 − δ goes to
1 as N → ∞ by the a.s. convergence of k̂n to k ∗ . By Lemma 4, there exists a
N  < ∞ such that for all discrete time t > N  , m(t) − m∗ (t)l1M ≤ ∆. Let N  =
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 217

max{N, N  } then P ∃n > N  , k̂n = k ∗ or m(t)|t=n − m∗ (t)|t=n l1M


>∆ =
δN  where δN  → 0 as N → ∞. The latter means that the event k̂n = k ∗


or m(n) − m∗ (n)lM > ∆ occurs infinitely often with probability 0. The state-
1
ment of the lemma then follows.

Appendix

In this section we outline the proofs. Complete proofs appear in Ratsaby (2003).

A Proof Outline of Lemma 1

For a class of boolean classifiers Br of VC-dimension r it is known (cf.


Devroye et. al. (1996) ch. 6, Vapnik (1982) Theorem 6.7) that a bound on the
deviation between the loss and the empirical
 loss uniformly over all classifiers
r ln m+ln( δ1 )
b ∈ Br is supb∈Br |L(b) − Lm (b)| ≤ const m with probability 1 − δ
where m denotes the size of the random sample used for calculating empirical
loss Lm (b). Choosing for instance δm = m12 implies that the bound const r ln m
m

(with a different constant) does not hold infinitely often with probability 0. We
will refer to this as the uniform strong law of large numbers result and we note
that this was defined earlier as (m, r).
This result is used together with an application of the union boundMwhich re-
duces the probability P supc∈Ck |L(c) − Lm (c)| > (m, k, δ  ) into i=1 P ∃c ∈
Cki :|L(c) − Li,mi (c)| > (mi , ki , δ  ) which is bounded from above by M δ  . The
first part of the lemma then follows since the class of well defined classifiers Gk
is contained in the class Ck . For the second part of the lemma, by the premise
consider any fixed complexity vector k and any sequence-generating procedure
φ according to Definition 2. Define the following set of sample size vector se-
quences: AN ≡ {m(n) : n > N, m(n) is generated by φ}. As the space is dis-
crete, note that for any finite N , the set AN contains all possible paths except a
finite number of length-N paths. The proof proceeds by showing that the events
En ≡ {supc∈Gk L(c) − Lm(n) (c) > (m(n), k, δ) : m(n) generated by φ} occur

infinitely often with probability 0. To show this, we first choose  for δ to be δm =
1
max1≤j≤M m2j

, and then reduce the P ∃m(n) ∈ AN : supc∈Gk L(c) − Lm(n) (c) >

M 
(m(n), k, δm(n) ) to j=1 mj >Tφ (N ) m12 . Then use the fact that m(n) ∈ AN
j
implies there exists a point m such that min1≤j≤M mj > Tφ (N ) where Tφ (N )
is increasing with N hence the set {mj : mj > Tφ (N )} is strictly increas-
ing, 1 ≤ j ≤ M , which implies that the above double sum strictly de-
creases with increasing N . It then follows that limN →∞ P(∃m(n) ∈ AN :
supc∈Gk L(c) − Lm(n) (c) > (m(n), k)) = 0 which implies the events En oc-
cur i.o. with probability 0.
218 J. Ratsaby

B Proof Outline of Lemma 2

First we sketch the proof of the convergence of k̂ → k ∗ , where k ∗ is some vector


of minimal norm over all vectors k for which L(c∗k ) = 0. We henceforth denote
for a vector k ∈ ZZ M + , by k∞ = max1≤i≤M |ki |. All convergence statements are
made with respect to the increasing sequence m(n). The indexing variable n is
sometimes left hidden for simpler notation.
The set F̂ defined in Section 4 may be rewritten as F̂ = {k : L̃(ĉk ) = L̃(ĉ∗ )}.
The cardinality of F̂ is finite since for all k having at least one component ki
larger than some constant implies L̃(ĉk ) > L̃(ĉ∗ ) because (m, k) will be larger
than L̃(ĉ∗ ) which implies that the set of k for which L̃(ĉk ) ≤ L̃(ĉ∗ ) is finite. Now
for any α > 0, define F̂α = {k : L̃(ĉk ) ≤ L̃(ĉ∗ ) + α}. Recall that F ∗ was defined
in Section 4 as F ∗ = {k : L(c∗k ) = L∗ = 0} and define Fα∗ = {k : L(c∗k ) ≤ L∗ +α},
where the Bayes loss is L∗ = 0. Recall that the chosen classifier ĉ∗ has a complex-
ity k̂ = argmink∈F̂ k∞ . By Assumption 1, there exists a k ∗ = argmink∈F ∗ k∞
all of whose components are finite. The proof proceeds by first showing that
∗ ∗
F̂ ⊆ F(m,k ∗ ) , i.o. with probability 0. Then proving that k ∈ F̂ and that

for all m large enough, k = argmink∈F ∗ ∗ k∞ . It then follows that
(m,k )

k̂∞ = k ∗ ∞ i.o. with probability zero but where k̂ does not necessarily
equal k ∗ and that k̂ → k ∗ , (componentwise) a.s., m → ∞ (or equivalently,
with n → ∞ as the sequence m(n) is increasing) where k ∗ = argmink∈F ∗ k∞
is not necessarily unique but all of whose components are finite. This proves the
first part of the lemma. The proof of the second part of the Lemma follows simi-
larly as the proof of Lemma 1. Start with P (∃m(n) ∈ AN : L(ĉ∗n ) > (m(n), k ∗ ) )
which after
M some
∞ manipulation is shown to be bounded from above by
the sum j=1 kj =1 P ∃mj > Tφ (N ) : L(ĉkj ) > Lj,mj (ĉkj ) + (mj , kj ) . Then
make use of the uniform strong law result (see first  paragraph of Appendix
kj ln mj √  kj ln(emj )
A) and choose a const such that (mj , kj ) = const mj ≥ 3 mj .
Using the upper bound on the growth function cf. Vapnik (1982) Section 6.9,
Devroye et. al. (1996) Theorem 13.3, we have for some absolute constant κ > 0,
2
P L(ĉkj ) > Lj,mj (ĉkj )+ (mj , kj ) ≤ κmj j e−mj  (mj ,kj ) which is bounded from
k

above by κ m12 e−3kj for kj ≥ 1. The bound on the double sum then becomes
M  j
2κ j=1 mj >Tφ (N ) m12 which is strictly decreasing with N as in the proof of
j
Lemma 1. It follows that the events {L(ĉ∗n ) > (m(n), k ∗ )} occur infinitely often
with probability 0.

C Proof Outline of Lemma 3

Note that for this proof we cannot use Lemma 1 or parts of Lemma 2 since
they are conditioned on having a sequence-generating procedure. Our approach
here relies on the characteristics of the SRM-selected complexity k̂n which is
shown to be bounded uniformly over n based on Assumption 1. It follows that
A Stochastic Gradient Descent Algorithm for Structural Risk Minimisation 219

by the stochastic adaptation step of Algorithm XSRM the generated sample size
sequence m(n) is not only increasing but with a minimum rate of increase as
in Definition 2. This establishes that Algorithm XSRM is a sequence-generating
procedure. The proof starts by showing that for an increasing sequence m(n), as
in Definition 1, for all n there is some constant 0 < ρ < ∞ such that k̂n ∞ < ρ.
It then follows that for all n, k̂n is bounded by a finite constant independent of
n. So for a sequence generated by the new minimisation procedure in Algorithm
(mj (n),k̂n,j ) (m (n),k̃ )
XSRM, pj mj (n) are bounded by pj mj j (n) j , for some finite k̃j , 1 ≤ j ≤
M , respectively. It can be shown by simple analysis of the function (m, k) that
∂ 2 (mj ,kj ) ∂ 2 (mi ,ki )
for a fixed k the ratio of ∂m2j
/ ∂m2 converges to a constant dependent
i
on ki and kj with increasing mi , mj . Hence the adaptation step which always
increases one of the sub-samples yields increments of ∆mi and ∆mj which are
no farther apart than a constant multiple of each other for all n, for any pair
1 ≤ i, j ≤ M . Hence for a sequence m(n) generated by Algorithm XSRM the
following is satisfied: it is increasing in the sense of Definition 1, namely, for
all N > 0 there exists a Tφ (N ) such that for all n > N every component
mj (n) > Tφ (N ), 1 ≤ j ≤ M . Furthermore, its rate of increase is bounded from
below, namely, there exists a const > 0 such that for all N, N  > 0 satisfying
Tφ (N  ) = Tφ (N ) + 1, then |N  − N | ≤ const. It follows that Algorithm XSRM
is a sequence-generating procedure according to Definition 2.

D Proof Outline of Theorem 1


The classifier ĉ∗n is chosen according to (3) based on a sample of size vector
m(n) generated by Algorithm XSRM which is a sequence-generating procedure
(by Lemma 3). From Lemma 2, L(ĉ∗n ) > const (m(n), k ∗ ), i.o. with probability
0 and since ∆ = 1 then from Lemma 5 it follows that m(n) − m∗ (n)l1M > 1
i.o. with probability 0 where m∗ (n) =argminm:m=m(n) (m, k ∗ ).

References
Anthony M., Bartlett P. L., (1999), “Neural Network Learning: Theoretical Founda-
tions”, Cambridge University Press, UK.
Bartlett P. L., Boucheron S., Lugosi G., (2002) Model Selection and Error Estimation,
Machine Learning, Vol. 48, No.1–3, p. 85–113.
Devroye L., Gyorfi L. Lugosi G. (1996). “A Probabilistic Theory of Pattern Recogni-
tion”, Springer Verlag.
Kultchinskii V., (2001), Rademacher Penalties and Structural Risk Minimization, IEEE
Trans. on Info. Theory, Vol. 47, No. 5, p.1902–1914.
Lugosi G., Nobel A., (1999), Adaptive Model Selection Using Empirical Complexities.
Annals of Statistics, Vol. 27, pp.1830–1864.
Ratsaby J., (1998), Incremental Learning with Sample Queries, IEEE Trans. on PAMI,
Vol. 20, No. 8, p.883–888.
Ratsaby J., (2003), On Learning Multicategory Classification with Sample Queries,
Information and Computation, Vol. 185, No. 2, p. 298–327.
220 J. Ratsaby

Ratsaby J., Meir R., Maiorov V., (1996), Towards Robust Model Selection using Esti-
mation and Approximation Error Bounds, Proc. 9th Annual Conference on Compu-
tational Learning Theory, p.57, ACM, New York N.Y..
Shawe-Taylor J., Bartlett P., Williamson R., Anthony M., (1996), A Framework for
Structural Risk Minimisation. NeuroCOLT Technical Report Series, NC-TR-96-032,
Royal Holloway, University of London.
Valiant L. G., A Theory of the learnable, (1984), Comm. ACM, Vol. 27, No. 11, p.1134–
1142.
Vapnik V.N., (1982), “Estimation of Dependences Based on Empirical Data”, Springer-
Verlag, Berlin.
On the Complexity of Training a Single
Perceptron with Programmable Synaptic Delays

Jiřı́ Šı́ma

Department of Theoretical Computer Science, Institute of Computer Science,


Academy of Sciences of the Czech Republic, P. O. Box 5, 182 07 Prague 8, Czech
Republic
sima@[Link]

Abstract. We consider a single perceptron N with synaptic delays


which generalizes a simplified model for a spiking neuron where not only
the time that a pulse needs to travel through a synapse is taken into
account but also the input firing rates may have more different levels.
A synchronization technique is introduced so that the results concerning
the learnability of spiking neurons with binary delays also apply to N
with arbitrary delays. In particular, the consistency problem for N with
programmable delays and its approximation version prove to be NP-
hard. It follows that the perceptrons with programmable synaptic delays
are not properly PAC-learnable and the spiking neurons with arbitrary
delays do not allow robust learning unless RP = N P . In addition, we
show that the representation problem for N which is an issue whether an
n-variable Boolean function given in DNF (or as a disjunction of O(n)
threshold gates) can be computed by a spiking neuron is co-NP-hard.

1 Perceptrons with Synaptic Delays


Neural networks establish an important class of learning models that are widely
applied in practical applications to solving artificial intelligence tasks [12]. We
consider only a single (perceptron) neuron N having n analog inputs that are
encoded by firing rates x1 , . . . , xn ∈ [−1, 1]. Here the input values are normalized
but any bounded domain [−a, a] for a positive real a ∈ IR+ can replace [−1, 1]
without loss of generality [21]. As usual each input i (1 ≤ i ≤ n) is associated
with a real synaptic weight wi ∈ IR. In addition, N receives i’s analog input in the
form of a unit-length rectangular pulse (spike) of height |xi | (for xi < 0 upside
down). This pulse travels through i’s synapse in continuous time producing a
synaptic time delay di ∈ IR+ 0 which represents a nonnegative real parameter
individual for each input 1 ≤ i ≤ n. Taking these delays into account, current
i’s input xi (t) ∈ [−1, 1] to N at continuous time t ≥ 0 can be expressed as

xi for t ∈ Di
xi (t) = (1)
0 otherwise

Research partially supported by project LN00A056 of The Ministry of Education of
the Czech Republic.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 221–233, 2003.

c Springer-Verlag Berlin Heidelberg 2003
222 J. Šı́ma

where Di = [di , di + 1) is a time interval of the unit length during which N is


influenced by the spike from input i. This determines the real excitation
n

ξ(t) = w0 + wi xi (t) (2)
i=1

for N at time instant t ≥ 0 as a weighted sum of current inputs including a real


bias weight w0 ∈ IR.
The real output y(t) ∈ IR of N at continuous time t ≥ 0 is computed by
applying an activation function σ : IR −→ IR to the excitation, i.e.

y(t) = σ(ξ(t)) . (3)

For binary outputs y(t) ∈ {0, 1} the Heaviside activation function



1 for ξ ≥ 0
σ(ξ) = (4)
0 for ξ < 0

is usually employed. In this case, the output protocol can be defined so that
N with weights w0 , . . . , wn and delays d1 , . . . , dn computes a neuron function
yN : [−1, 1]n −→ {0, 1} defined for every input x1 , . . . , xn ∈ [−1, 1]n as
yN (x1 , . . . , xn ) = 1 iff there exists a time instant t ≥ 0 such that y(t) = 1.
Similarly, the logistic sigmoid
1
σL (ξ) = , (5)
1 + e−ξ
which is well-known from the back-propagation learning [26] produces analog
outputs y(t) ∈ [0, 1] whereas the output protocol can specify a time instant tout ≥
0 when the resulting output is read, that is yN (x1 , . . . , xn ) = y(tout ). Unless
otherwise stated we assume that neuron N employs the Heaviside activation
function (4).
By restricting certain parameters in the preceding definition of N we obtain
several computational units which are widely used in neurocomputing. For the
classical perceptrons [25] all synaptic delays are zero, i.e. di = 0 for i = 1, . . . , n,
and also tout = 0 when the logistic sigmoid (5) is employed [26]. Or assuming
the spikes with a uniform firing rate, e.g. xi ∈ {0, 1} for i = 1, . . . , n, neuron N
coincides with a simplified model of a spiking neuron with binary coded inputs
which was introduced and analyzed in [20]. Hence, the computational power of
N computing the Boolean functions is the same as that of the spiking neuron
with binary coded inputs [27] (cf. Section 4). In addition, the VC-dimension
Θ(n log n) of the spiking neuron still applies to N with n analog inputs as can
easily be verified by following the argument in [20]. From this point of view, N
represents generalization of the spiking neuron in which the temporal delays are
combined with the firing rates of perceptron units.
It follows that biological motivations for spiking neurons [10,19] partially
apply also to neuron N . For example, it is known that the synaptic delays are
On the Complexity of Training a Single Perceptron 223

tuned in biological neural systems through a variety of mechanisms. On the other


hand, the underlying computational model is still sufficiently simple providing
easy silicon implementation in pulsed VLSI [19].
In this paper we deal with the computational complexity of training a sin-
gle neuron N with programmable synaptic delays. The article is organized as
follows. In Section 2, the so-called consistency problem proves to be NP-hard
for N , which implies that the perceptrons with delays are not properly PAC-
learnable unless RP = N P . Furthermore, it is shown in Section 3 that even the
approximate training can be hard for N with binary firing rates, which means
that the spiking neurons with binary coded inputs do not allow robust learning
if RP = N P . In addition, the representation problem for spiking neurons is
proved to be co-NP-hard in Section 4.

2 A Single Perceptron with Delays Is Not Learnable


The computational complexity of training a neuron can be analyzed by using the
consistency (loading) problem [17] which is the problem of finding the neuron
parameters for a training task so that the neuron function is perfectly con-
sistent with all training data. For example, an efficient algorithm for the con-
sistency problem is required within the proper PAC learning framework [5] be-
sides the polynomial VC-dimension that common neural network models usually
possess [3,24,29]. Therefore, several learning heuristics have been proposed for
networks of spiking neurons, e.g. spike-propagation [6]. On the other hand, NP-
hardness of this problem implies that the neuron is not properly PAC learnable
(i.e. for any training data that can be loaded into the neuron) under generally
accepted complexity-theoretic assumption RP = N P [22]. An almost exhaus-
tive list of such NP-hardness results for feedforward perceptron networks was
presented in [28].
We define a training set

T = {(xk ; bk ); xk = (xk1 , . . . , xkn ) ∈ [−1, 1]n , bk ∈ {0, 1}, k = 1, . . . , m} (6)

containing m training examples, each composed of n-dimensional input xk from


[−1, 1]n labeled with the desired scalar output value bk from {0, 1} correspond-
ing to negative and positive examples. The decision version for the consistency
problem is formulated as follows:
Consistency Problem for Neuron N (CPN)
Instance: A training set T for N having n inputs.
Question: Are there weights w0 , . . . , wn and delays d1 , . . . , dn for N such that
yN (x) = b for every training example (x; b) ∈ T ?
For ordinary perceptrons with zero delays, i.e. di = 0 for i = 1, . . . , n, the
consistency problem is solvable in polynomial time by linear programming al-
though this problem restricted to binary weights is NP-complete [22]. However,
already for binary delays di ∈ {0, 1} the consistency problem becomes NP-
complete, even for spiking neurons having binary firing rates xi ∈ {0, 1} and
224 J. Šı́ma

fixed weights [20]. This implies that neuron N with binary delays is not prop-
erly PAC learnable unless RP = N P . The result generalizes also to bounded
delay values di ∈ {0, 1, . . . , c} for fixed c ≥ 2. For the spiking neurons with
unbounded delays, however, NP-hardness of the consistency problem was listed
among open problems [20].
In this section we prove that the consistency problem is NP-hard for a single
perceptron N with arbitrary delays, which partially answers the previous open
question provided that several levels of firing rates are allowed. For this purpose
a synchronization technique is introduced whose main idea can be described as
follows. The consistency of negative example (x1 , . . . , xn ; 0) means that for every
subset of inputs I ⊆ {1, . . . , n} whose spikes may simultaneously  influence N
(i.e. ∩i∈I Di = ∅) a corresponding excitation must satisfy w0 + i∈I wi xi < 0.
At the same time, by using the  consistency of other (mostly positive) training
examples we can enforce w0 + i∈J wi xi ≥ 0 for some J ⊆ {1, . . . , n}. In this way
we ensure that N is not simultaneously influenced by the spikes from inputs J,
that is ∩i∈J Di = ∅ which is then exploited for the synchronization of the input
spikes.

Theorem 1. CPN is NP-hard.

Proof. In order to achieve the NP-hardness result, the following variant of the
set splitting problem which is known to be NP-complete [9] will be reduced to
CPN in polynomial time.

3Set-Splitting Problem (3SSP)


Instance: A finite set S = {s1 , . . . , sp } and a collection C = {c ⊆ S ; |c | =
3,  = 1, . . . , r} of three-element subsets c of S.
Question: Is there a partition of S into two disjoint subsets S1 and S2 , i.e.
S = S1 ∪ S2 where S1 ∩ S2 = ∅, such that c ⊆ S1 and c ⊆ S2 for every
 = 1, . . . , r?

The 3SSP problem was also used for proving the result restricted to binary
delays [20]. The above-described synchronization technique generalizes the proof
to arbitrary delays.
Given a 3SSP instance S, C, we construct a training set T for neuron N with
n inputs where n = 2p+2. The input firing  rates of training examples exploit only
seven levels from −1, − 14 , − 18 , 0, 38 , 34 , 1 ⊆ [−1, 1]. A list of training examples
which are included in training set T follows:
3
(0, . . . , 0, 4 , 0, . . . , 0 ; 1)
↑ for i = 1, . . . , p , (7)
2i − 1

(0, . . . , 0, − 14 , 0, . . . , 0 ; 1)
↑ for i = 1, . . . , p , (8)
2i
On the Complexity of Training a Single Perceptron 225

(0, . . . , 0, 3
8 , − 18 , 0, . . . , 0 ; 0)
↑ ↑ for i = 1, . . . , p , (9)
2i − 1 2i

(0, . . . , 0, − 14 , 0 ; 1) ,
↑ (10)
2p + 1

(0, . . . , 0, − 14 ; 1) ,
↑ (11)
2p + 2

(0, . . . , 0, − 18 , − 18 ; 0) ,
↑ ↑ (12)
2p + 1 2p + 2

(0, . . . , 0, 1 , 0, . . . , 0 ; 1)
↑ for i = 1, . . . , p , (13)
2i − 1

(0, . . . , 0, 1 , 0, . . . , 0, 1 , 1 ; 0)
↑ ↑ ↑ for i = 1, . . . , p , (14)
2i − 1 2p + 1 2p + 2

(0, . . . , 0, −1 , 0, . . . , 0 ; 1)
↑ for i = 1, . . . , p , (15)
2i

(0, . . . , 0, −1 , 0, . . . , 0, 1 , 1 ; 0)
↑ ↑ ↑ for i = 1, . . . , p , (16)
2i 2p + 1 2p + 2

and
(0, . . . , 0, 1 , 1 , 0, . . . , 0, 1 , 1 , 0, . . . , 0, 1 , 1 , 0, . . . , 0 ; 0)
↑ ↑ ↑ ↑ ↑ ↑ (17)
2i − 1 2i 2j − 1 2j 2k − 1 2k

for each c = {si , sj , sk } ∈ C (1 ≤  ≤ r).


The number of training examples is |T | = 7p + r + 3, and hence, the construction
of T can be done in polynomial time in terms of the size of S, C.
Now, the correctness of the reduction will be verified, i.e. it will be shown
that the 3SSP instance has a solution iff the corresponding CPN instance is
solvable. So first assume that there exists a solution S1 , S2 of the 3SSP instance.
Define the weights and delays for N as follows:

w0 = −1 , w2p+1 = w2p+2 = −4 , (18)


w2i−1 = 2 , w2i = −4 for i = 1, . . . , p , (19)

0 for si ∈ S1
d2i−1 = , d2i = 1 − d2i−1 for i = 1, . . . , p , (20)
1 for si ∈ S2
d2p+1 = 0 d2p+2 = 1 . (21)
226 J. Šı́ma

Clearly,
D2i−1 ∩ D2i = ∅ (22)
for i = 1, . . . , p + 1 according to (20) and (21). It can easily be checked that
N with parameters (18)–(21) is consistent with training examples (7)–(16). For
instance, for any positive training example (7), excitation ξ(t) = −1 + 2 · 34 ≥ 0
when t ∈ D2i−1 , which is sufficient for N to output 1. Or for any negative
training example (9), excitation ξ(t) = −1 + 2 · 38 < 0 for all t ∈ D2i−1 and
ξ(t) = −1 − 4 · (− 18 ) < 0 for all t ∈ D2i , whereas ξ(t) = −1 < 0 for t ≥ 2, which
implies that N outputs desired 0. The verification for the remaining training
examples (7)–(16) is similar. Furthermore, D2i−1 ∩ D2j−1 ∩ D2k−1 = ∅ holds for
any c = {si , sj , sk } ∈ C according to (20) since c ⊆ S1 . Hence, for a negative
training example (17) corresponding to c excitation ξ(t) ≤ −1+2·1+2·1−4·1 < 0
for every t ≥ 0 due to (22) which produces zero output. This completes the
argument for the CPN instance to be solvable.
On the other hand, assume that there exist weights w0 , . . . , wn and delays
d1 , . . . , dn for N such that N is consistent with training examples (7)–(17). Any
consistent negative example ensures

w0 < 0 (23)

since the excitation must satisfy ξ(t) < 0 also for t ∈ ∪ni=1 Di . Hence, it follows
from (7) and (8) that w0 + 34 w2i−1 ≥ 0 and w0 − 14 w2i ≥ 0, respectively, which
sums up to
3 1
w0 + w2i−1 − w2i ≥ 0 for i = 1, . . . , p . (24)
8 8
On the other hand, by comparing inequality (24) with the consistency of negative
examples (9) we conclude that

D2i−1 ∩ D2i = ∅ for i = 1, . . . , p . (25)

Similarly, positive training examples (10) and (11) compel inequality

1 1
w0 − w2p+1 − w2p+2 ≥ 0 (26)
8 8
which implies
D2p+1 ∩ D2p+2 = ∅ (27)
when the consistency of negative example (12) is required.
Furthermore, positive training examples (13) ensure

w0 + w2i−1 ≥ 0 for i = 1, . . . , p (28)

which, confronted with the consistency of negative examples (14), implies

D2i−1 ⊆ D2p+1 ∪ D2p+2 for i = 1, . . . , p . (29)


On the Complexity of Training a Single Perceptron 227

Similarly, the simultaneous consistency of positive examples (15) and negative


examples (16) gives

D2i ⊆ D2p+1 ∪ D2p+2 for i = 1, . . . , p . (30)

It follows from (25), (27), (29), and (30) that for each 1 ≤ i ≤ p

either (D2i−1 = D2p+1 and D2i = D2p+2 )


or (D2i−1 = D2p+2 and D2i = D2p+1 ) , (31)

which represents the synchronization of the input spikes according to (27).


Thus define the splitting of S = S1 ∪ S2 as

S1 = {si ∈ S ; D2i−1 = D2p+1 } , S2 = S \ S1 . (32)

It will be proved that S1 , S2 is a solution of the 3SSP. On the contrary, assume


that there is c = {si , sj , sk } ∈ C such that c ⊆ S1 or c ⊆ S2 . According to
definition (32), D2i−1 = D2j−1 = D2k−1 = D2p+1 holds for c ⊆ S1 . Hence, the
consistency of a corresponding negative example (17) would require

w0 + w2i−1 + w2j−1 + w2k−1 < 0 (33)

due to (25), but inequalities (23) and (28) imply the opposite. Similarly,
D2i−1 = D2j−1 = D2k−1 = D2p+2 for c ⊆ S2 because of (32) and (31) pro-
viding contradiction (33). This completes the proof that the 3SSP instance is
solvable.

Corollary 1. If RP = N P , then a single perceptron N with programmable


synaptic delays is not properly PAC-learnable.

3 A Spiking Neuron Does Not Allow Robust Learning


A single perceptron N with delays can compute only very simple neuron func-
tions. Therefore the consistency problem introduced in Section 2 does not fre-
quently have a solution: there are no weight and delay parameters such that the
neuron function is consistent with all training data. In this case, one would be
satisfied with a good approximation in practice, that is with the neuron param-
eters yielding a small training error. For example, in the incremental learning
algorithms (e.g. [8]) that adapt single neurons before these are wired to a neu-
ral network, an efficient procedure for minimizing the training error is crucial
to keep the network size small for successful generalization. Thus the decision
version for the approximation problem is formulated as follows:
Approximation Problem for Neuron N (APN)
Instance: A training set T for N and a positive integer k.
Question: Are there weights w0 , . . . , wn and delays d1 , . . . , dn for N such that
yN (x) = b for at most k training examples (x; b) ∈ T ?
228 J. Šı́ma

Within the PAC framework, the NP-hardness of this problem implies that the
neuron does not allow robust learning (i.e. probably approximately optimal learn-
ing for any training task) unless RP = N P [14].
For the perceptrons with zero delays, the complexity of the approximation
problem has been resolved. Several authors proved that the approximation prob-
lem is NP-complete in this case [14,23] even if the bias is assumed to be zero [2,
16]. This means that the perceptrons with zero delays do not allow robust learn-
ing unless RP = N P . In addition, it is NP-hard to achieve a fixed error that is
a constant multiple of the optimum [4]. These results were also generalized to
analog outputs, e.g. for the logistic sigmoid (5) it is NP-hard to minimize the
training error under the L1 [15] or L2 [28] norm within a given absolute bound
or within 1 of its infimum.
In this section the approximation problem is proved to be NP-hard for percep-
tron N with arbitrary delays. The proof exploits only binary firing rates, which
means the result is also valid for spiking neurons with binary coded inputs.

Theorem 2. APN for N with binary firing rates is NP-hard.

Proof. The following vertex cover problem that is known to be NP-complete [18]
will be reduced to APN in polynomial time:

Vertex Cover Problem (VCP)


Instance: A graph G = (V, E) and a positive integer k ≤ |V |.
Question: Is there a vertex cover U ⊆ V of size at most k ≥ |U | vertices, that is
for each edge {u, v} ∈ E at least one of u and v belongs to U ?

A similar reduction was originally used for the NP-hardness result concerning
the approximate training an ordinary perceptron with zero synaptic delays [14].
The technique generalizes for arbitrary delays.
Thus given a VCP instance G = (V, E), k with n = |V | vertices V =
{v1 , . . . , vn } and r = |E| edges we construct a training set T for neuron N
with n inputs. Training set T contains the following m = n + r examples:

(0, . . . , 0, 1 , 0, . . . , 0 ; 1)
↑ for i = 1, . . . , n , (34)
i

(0, . . . , 0, 1 , 0, . . . , 0, 1 , 0, . . . , 0 ; 0)
↑ ↑ for each {vi , vj } ∈ E , (35)
i j

which can be constructed in polynomial time in terms of the size of the VCP in-
stance. Moreover, in this APN instance at most k inconsistent training examples
are allowed.
It will be shown that the VCP instance has a solution iff the corresponding
APN instance is solvable. So first assume that there exists a vertex cover U ⊆ V
On the Complexity of Training a Single Perceptron 229

of size at most k ≥ |U | vertices. Define the weights and delays for N as follows:
w0 = −1 , (36)

−1 if vi ∈ U
wi = for i = 1, . . . , n , (37)
1 if vi ∈
 U
di = 0 for i = 1, . . . , n . (38)
Obviously, negative examples (35) corresponding to edges {vi , vj } ∈ E produce
excitations either ξ(t) = −3 when both endpoints in U or ξ(t) = −1 when only
one endpoint in U , for t ∈ [0, 1), while ξ(t) = w0 = −1 for t ≥ 1, which means N
outputs desired 0. Furthermore, the positive examples (34) that correspond to
vertices vi ∈ U give excitations ξ(t) = 0 for t ∈ [0, 1) and hence N classifies them
correctly. On the other hand, N is not consistent with the positive examples (34)
corresponding to vertices vi ∈ U since ξ(t) = −2 for t ∈ [0, 1) and ξ(t) = −1
for t ≥ 1. Nevertheless, the size of vertex cover U is at most k, which also
upper-bounds the number of inconsistent training examples. This completes the
argument for the APN instance to be solvable.
On the other hand, assume that there exist weights w0 , . . . , wn and delays
d1 , . . . , dn making N consistent with all but at most k training examples (34)–
(35). Define U ⊆ V so that U contains vertex vi for each inconsistent positive
example (34) corresponding to vi . In addition, U includes just one of vi and vj
(chosen arbitrarily) for each inconsistent negative example (35) corresponding
to edge {vi , vj }. Clearly, |U | ≤ k since there are at most k inconsistent training
examples. It will be proved that U is a vertex cover for G. On the contrary,
assume that there is an edge {vi , vj } ∈ E such that vi , vj ∈ U . It follows from the
definition of U that N is consistent with the negative example (35) corresponding
to edge {vi , vj }, which implies
ξ(t) = w0 < 0 for t ∈ Di ∪ Dj , (39)
and it is consistent with the positive examples (34) corresponding to vertices
vi , vj , which ensures
ξ(t) = w0 + wi ≥ 0 for t ∈ Di (40)
ξ(t) = w0 + wj ≥ 0 for t ∈ Dj (41)
because of (39). By summing inequalities (39)–(41), we obtain
w0 + wi + wj > 0 . (42)
On the other hand, by comparing inequalities (40) and (41) with the consistency
of the negative example (35) corresponding to edge {vi , vj } we conclude that
Di = Dj (synchronization technique) and hence
ξ(t) = w0 + wi + wj < 0 for t ∈ Di = Dj , (43)
which contradicts inequality (42). This completes the proof that U is a solution
of VCP.

Corollary 2. If RP = N P , then a single spiking neuron with binary coded


inputs and arbitrary delays does not allow robust learning.
230 J. Šı́ma

4 The Representation Problem for Spiking Neurons

In this section we deal with the representation (membership) problem for the
spiking neurons with binary coded inputs:
Representation Problem for Spiking Neuron N (RPN)
Instance: A Boolean function f in DNF (disjunctive normal form).
Question: Is f computable by a single spiking neuron N , i.e. are there weights
w0 , . . . , wn and delays d1 , . . . , dn for N such that yN (x) = f (x) for every x ∈
{0, 1}n ?
The representation problem for perceptrons with zero delays, known as the linear
separability problem, was proved to be co-NP-complete [13]. We generalize the
co-NP-hardness result for spiking neurons with arbitrary delays. On the other
hand, the RPN is clearly in Σ2p whereas its hardness for Σ2p (or for NP) which
would imply [1] that the spiking neurons with arbitrary delays are not learnable
with membership and equivalence queries (unless N P = co − N P ) remains an
open problem.
Moreover, it was shown [20] that the class of n-variable Boolean functions
computable by spiking neurons is strictly contained in the class DLLT that con-
sists of functions representable as disjunctions of O(n) Boolean linear threshold
functions over n variables (from the class LT containing functions computable
by threshold gates) where the smallest number of threshold gates is called the
threshold number [11]. For example, class DLLT corresponds to two-layer net-
works with linear number of hidden perceptrons (with zero delays) and one
output OR gate. It was shown [27] that the threshold number of spiking neurons
with n inputs is at most n − 1 and can be lower-bounded by n/2. On the other
hand, there exists a Boolean function with threshold number 2 that cannot be
computed by a single spiking neuron [27]. We prove that a modified version of
RPN, denoted as DLLT-RPN, whose instances are Boolean functions f from
DLLT (instead of DNF) is also co-NP-hard. This means that it is hard to decide
whether a given n-variable Boolean function expressed as a disjunction of O(n)
threshold gates can be computed by a single spiking neuron.
Theorem 3. RPN and DLLT-RPN are co-NP-hard and belong to Σ2p .

Proof. The tautology problem that is known to be co-NP-complete [7] will be


reduced to RPN in polynomial time in a similar way as it was done for the linear
separability problem [13]:
Tautology Problem (TP)
Instance: A Boolean function g in DNF.
Question: Is g a tautology, i.e. g(x) = 1 for every x ∈ {0, 1}n ?
For the DLLT-RPN, a modified version of TP, denoted as DLLT-TP, whose in-
stances are Boolean functions g from DLLT will be exploited. For proving that
the DLLT-TP remains co-NP-complete, any TP instance ∨m j=1 Cj with m mono-
mials (conjunctions of literals over n variables) can be equivalently rewritten in
On the Complexity of Training a Single Perceptron 231

DNF ∨m    
j=1 ((Cj ∧ xj ) ∨ (Cj ∧ x̄j )) where x1 , . . . , xm are m new variables. Clearly,
in the new DNF formula the number of monomials is linear in terms of the num-
ber of variables. Moreover, any monomial can obviously be computed by a single
threshold gate.
Thus given a TP (DLLT-TP) instance g over n variables x1 , . . . , xn , we
construct a corresponding RPN (DLLT-RPN) instance f over n + 2 variables
x1 , . . . , xn , y1 , y2 in polynomial time as follows:

f (x1 , . . . , xn , y1 , y2 ) = (g(x1 , . . . , xn ) ∧ y1 ) ∨ (y1 ∧ ȳ2 ) ∨ (ȳ1 ∧ y2 ) . (44)

For TP instance g, function f is actually in DNF as required for the RPN. For
DLLT-TP instance g = ∨m j=1 gj with gj from LT, formula (44) contains terms
gj ∧ y1 that are equivalent with ḡj ∨ ȳ1 which belong to LT since class LT is
closed under negation [21] and summand W (1 − y1 ) with a sufficiently large
weight W can be added to the weighted sum for ḡj to evaluate ḡj ∨ ȳ1 . This
implies that f is from DLLT representing a DLLT-RPN instance.
It will be shown that the TP (DLLT-TP) instance has a solution iff the
corresponding RPN (DLLT-RPN) instance is solvable. So first assume that g
is a tautology. Hence f given by (44) can be equivalently rewritten as y1 ∨ y2
which is trivially computable by a spiking neuron. On the other hand, assume
that there exists a ∈ {0, 1}n such that g(a) = 0. In this case, f (a, y1 , y2 ) reduces
to XOR(y1 , y2 ) which cannot be implemented by a single spiking neuron [20].
For proving that RP N ∈ Σ2p (similarly for DLLT-RPN) consider an alter-
nating algorithm for the RPN that, given f in DNF, guesses polynomial-size
representations [20] of weights and delays for spiking neuron N first in its ex-
istential state, and then verifies yN (x) = f (x) for every x ∈ {0, 1}n (yN (x)
can be computed in polynomial time since there are only linear number of time
intervals to check) in its universal state.

5 Conclusion

The computational complexity of training a single perceptron with pro-


grammable synaptic delays which is a model that covers certain aspects of spik-
ing neurons (with binary coded inputs) has been analyzed. We have developed
a synchronization technique that generalizes the known non-learnability results
for arbitrary synaptic delays. In particular, we have proved that the perceptrons
with delays are not properly PAC-learnable and the spiking neurons do not allow
robust learning unless RP = N P . This represents a step towards solving an open
issue concerning the PAC-learnability of spiking neurons with arbitrary delays.
In addition, we have shown that it is co-NP-hard to decide whether a disjunc-
tion of O(n) threshold gates, which is known to implement any spiking neuron,
can reversely be computed by a single spiking neuron. An open issue remains
for further research whether the spiking neurons are learnable with membership
and equivalence queries.
232 J. Šı́ma

References
1. Aizenstein, H., Hegedüs, T., Hellerstein, L., Pitt, L.: Complexity Theoretic Hard-
ness Results for Query Learning. Computational Complexity 7 (1) (1998) 19–53
2. Amaldi, E.: On the complexity of training perceptrons. In: Kohonen, T., Mäkisara,
K., Simula, O., Kangas, J. (eds.): Proceedings of the ICANN’91 First Interna-
tional Conference on Artificial Neural Networks. Elsevier Science Publisher, North-
Holland, Amsterdam (1991) 55–60
3. Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations.
Cambridge University Press, Cambridge, UK (1999)
4. Arora, S., Babai, L., Stern, J., Sweedyk, Z.: The hardness of approximate optima in
lattices, codes, and systems of linear equations. Journal of Computer and System
Sciences 54 (2) (1997) 317–331
5. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the
Vapnik-Chervonenkis dimension. Journal of the ACM 36 (4) (1989) 929–965
6. Bohte, M., Kok, J.N., La Poutré, H.: Spike-prop: error-backpropagation in multi-
layer networks of spiking neurons. In: Proceedings of the ESANN’2000 European
Symposium on Artificial Neural Networks. D-Facto Publications, Brussels (2000)
419–425
7. Cook, S.A.: The complexity of theorem-proving procedures. In: Proceedings of the
STOC’71 Third Annual ACM Symposium on Theory of Computing. ACM Press,
New York (1971) 151–158
8. Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In:
Touretzky, D.S. (ed.): Advances in Neural Information Processing Systems
(NIPS’89), Vol. 2. Morgan Kaufmann, San Mateo (1990) 524–532
9. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory
of NP-Completeness. W.H. Freeman, San Francisco (1979)
10. Gerstner, W., Kistler, W.M.: Spiking Neuron Models: Single Neurons, Populations,
Plasticity. Cambridge University Press, Cambridge, UK (2002)
11. Hammer, P.L., Ibaraki, T., Peled, U.N.: Threshold numbers and threshold comple-
tions. In: Hansen, P. (ed.): Studies on Graphs and Discrete Programming, Annals
of Discrete Mathematics 11, Mathematics Studies, Vol. 59. North-Holland, Ams-
terdam (1981) 125–145
12. Haykin, S.: Neural Networks: A Comprehensive Foundation. 2nd edn. Prentice-
Hall, Upper Saddle River, NJ (1999)
13. Hegedüs, T., Megiddo, N.: On the geometric separability of Boolean functions.
Discrete Applied Mathematics 66 (3) (1996) 205–218
14. Höffgen, K.-U., Simon, H.-U., Van Horn, K.S.: Robust trainability of single neu-
rons. Journal of Computer and System Sciences 50 (1) (1995) 114–125
15. Hush, D.R.: Training a sigmoidal node is hard. Neural Computation 11 (5) (1999)
1249–1260
16. Johnson, D.S., Preparata, F.P.: The densest hemisphere problem. Theoretical Com-
puter Science 6 (1) (1978) 93–107
17. Judd, J.S.: Neural Network Design and the Complexity of Learning. The MIT
Press, Cambridge, MA (1990)
18. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E.,
Thatcher, J.W. (eds.): Complexity of Computer Computations. Plenum Press, New
York (1972) 85–103
19. Maass, W., Bishop, C.M. (eds.): Pulsed Neural Networks. The MIT Press, Cam-
bridge, MA (1999)
On the Complexity of Training a Single Perceptron 233

20. Maass, W., Schmitt, M.: On the complexity of learning for spiking neurons with
temporal coding. Information and Computation 153 (1) (1999) 26–46
21. Parberry, I.: Circuit Complexity and Neural Networks. The MIT Press, Cambridge,
MA (1994)
22. Pitt, L., Valiant, L.G.: Computational limitations on learning from examples. Jour-
nal of the ACM 35 (4) (1988) 965–984
23. Roychowdhury, V.P., Siu, K.-Y., Kailath, T.: Classification of linearly non-
separable patterns by linear threshold elements. IEEE Transactions on Neural
Networks 6 (2) (1995) 318–331
24. Roychowdhury, V.P., Siu, K.-Y., Orlitsky, A. (eds.): Theoretical Advances in Neu-
ral Computation and Learning. Kluwer Academic Publishers, Boston (1994)
25. Rosenblatt, F.: The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review 65 (6) (1958) 386–408
26. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-
propagating errors. Nature 323 (1986) 533–536
27. Schmitt, M.: On computing Boolean functions by a spiking neuron. Annals of
Mathematics and Artificial Intelligence 24 (1-4) (1998) 181–191
28. Šı́ma, J.: Training a single sigmoidal neuron is hard. Neural Computation 14 (11)
(2002) 2709–2728
29. Vidyasagar, M.: A Theory of Learning and Generalization. Springer-Verlag, Lon-
don (1997)
Learning a Subclass of Regular Patterns in
Polynomial Time

John Case1, , Sanjay Jain2, , Rüdiger Reischuk3 , Frank Stephan4 , and


Thomas Zeugmann3
1
Dept. of Computer and Information Sciences, University of Delaware, Newark, DE
19716-2586, USA
case@[Link]
2
School of Computing, National University of Singapore, Singapore 117543
sanjay@[Link]
3
Institute for Theoretical Informatics, University at Lübeck, Wallstr. 40, 23560
Lübeck, Germany
{reischuk, thomas}@[Link]
4
Mathematisches Institut, Universität Heidelberg, Im Neuenheimer Feld 294, 69120
Heidelberg, Germany
fstephan@[Link]

Abstract. Presented is an algorithm (for learning a subclass of erasing


regular pattern languages) which can be made to run with arbitrarily
high probability of success on extended regular languages generated by
patterns π of the form x0 α1 x1 ...αm xm for unknown m but known
c , from number of examples polynomial in m (and exponential in c ),
where x0 , . . . , xm are variables and where α1 , ..., αm are each strings
of constants or terminals of length c . This assumes that the algorithm
randomly draws samples with natural and plausible assumptions on the
distribution.
The more general looking case of extended regular patterns which alter-
nate between a variable and fixed length constant strings, beginning and
ending with either a variable or a constant string is similarly handled.

1 Introduction

The pattern languages were formally introduced by Angluin [1]. A pattern lan-
guage is (by definition) one generated by all the positive length substitution
instances in a pattern, such as, for example,
abxycbbzxa
— where the variables (for substitutions) are x, y, z and the constants/terminals
are a, b, c .

Supported in part by NSF grant number CCR-0208616 and USDA IFAFS grant
number 01-04145.

Supported in part by NUS grant number R252-000-127-112.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 234–246, 2003.

c Springer-Verlag Berlin Heidelberg 2003
Learning a Subclass of Regular Patterns in Polynomial Time 235

Since then, much work has been done on pattern languages and extended pat-
tern languages which also allow empty substitutions as well as on various special
cases of the above (cf., e.g., [1,6,7,10,12,21,20,22,23,26,19,29] and the references
therein). Furthermore, several authors have also studied finite unions of pattern
languages (or extended pattern languages), unbounded unions thereof and also
of important subclasses of (extended) pattern languages (see, for example, [11,
5,27,3,32]).

Nix [18] as well as Shinohara and Arikawa [28,29] outline interesting appli-
cations of pattern inference algorithms. For example, pattern language learning
algorithms have been successfully applied toward some problems in molecular bi-
ology (see [25,29]). Pattern languages and finite unions of pattern languages turn
out to be subclasses of Smullyan’s [30] Elementary Formal Systems (EFSs), and
Arikawa, Shinohara and Yamamoto [2] show that the EFSs can also be treated as
a logic programming language over strings. The investigations of the learnabil-
ity of subclasses of EFSs are interesting because they yield corresponding results
about the learnability of subclasses of logic programs. Hence, these results are
also of relevance for Inductive Logic Programming (ILP) [17,13,4,15]. Miyano et
al. [16] intensively studied the polynomial-time learnability of EFSs.

In the following we explain the main philosophy behind our research as well
as the ideas by which it emerged. As far as learning theory is concerned, pattern
languages are a prominent example of non-regular languages that can be learned
in the limit from positive data (cf. [1]). Gold [9] has introduced the correspond-
ing learning model. Let L be any language; then a text for L is any infinite
sequence of strings containing eventually all strings of L , and nothing else. The
information given to the learner are successively growing initial segments of a
text. Processing these segments, the learner has to output hypotheses about L .
The hypotheses are chosen from a prespecified set called hypothesis space. The
sequence of hypotheses has to converge to a correct description of the target
language.

Angluin [1] provides a learner for the class of all pattern languages that is
based on the notion of descriptive patterns. Here a pattern π is said to be
descriptive (for the set S of strings contained in the input provided so far) if
π can generate all strings contained in S and no other pattern having this
property generates a proper subset of the language generated by π . But no
efficient algorithm is known for computing descriptive patterns. Thus, unless
such an algorithm is found, it is even infeasible to compute a single hypothesis
in practice by using this approach.

Therefore, one has considered restricted versions of pattern language learning


in which the number k of different variables is fixed, in particular the case of
a single variable. Angluin [1] gives a learner for one-variable pattern languages
with update time O(4 log ) , where  is the sum of the length of all examples
seen so far. Note that this algorithm is also computing descriptive patterns even
of maximum length.
236 J. Case et al.

Another important special case extensively studied are the regular pattern
languages introduced by Shinohara [26]. These are generated by the regular pat-
terns, i.e., patterns in which each variable that appears, appears only once. The
learners designed by Shinohara [26] for regular pattern languages and extended
regular pattern languages are also computing descriptive patterns for the data
seen so far. These descriptive patterns are computable in time polynomial in the
length of all examples seen so far.
But when applying these algorithms in practice, another problem comes into
play, i.e., all the learners mentioned above are only known to converge in the
limit to a correct hypothesis for the target language. But the stage of convergence
is not decidable. Thus, a user never knows whether or not the learning process
is already finished. Such an uncertainty may not be tolerable in practice.
Consequently, one has tried to learn the pattern languages within Valiant’s
[31] PAC model. Shapire [24] could show that the whole class of pattern lan-
guages is not learnable within the PAC model unless P/poly = N P/poly for
any hypothesis space that allows a polynomially decidable membership problem.
Since membership is N P -complete for the pattern languages, his result does not
exclude the learnability of all pattern languages in an extended PAC model, i.e.,
a model in which one is allowed to use the set of all patterns as hypothesis space.
However, Kearns and Pitt [10] have established a PAC learning algorithm
for the class of all k -variable pattern languages, i.e., languages generated by
patterns in which at most k different variables occur. Positive examples are
generated with respect to arbitrary product distributions while negative exam-
ples are allowed to be generated with respect to any distribution. Additionally,
the length of substitution strings has been required to be polynomially related
to the length of the target pattern. Finally, their algorithm uses as hypothesis
space all unions of polynomially many patterns that have k or fewer variables1 .
The overall learning time of their PAC learning algorithm is polynomial in the
length of the target pattern, the bound for the maximum length of substitution
strings, 1/ε , 1/δ , and |±| . The constant in the running time achieved depends
doubly exponential on k , and thus, their algorithm becomes rapidly impractical
when k increases.
As far as the class of extended regular pattern languages is concerned, Miyano
et al. [16] showed the consistency problem to be N P -complete. Thus, the class
of all extended regular pattern languages is not polynomial-time PAC learnable
unless RP = N P for any learner that uses the regular patterns as hypothesis
space.
This is even true for REGPAT1 , i.e., the set of all extended regular pattern
languages where the length of constant strings is 1 (see below for a formal
1
More precisely, the number of allowed unions is at most poly(|π|, s, 1/ε, 1/δ, |±|) ,
where π is the target pattern, s the bound on the length on substitution strings,
ε and δ are the usual error and confidence parameter, respectively, and ± is the
alphabet of constants over which the patterns are defined.
Learning a Subclass of Regular Patterns in Polynomial Time 237

definition). The latter result follows from [16] via an equivalence proof to the
common subsequence languages studied in [14].
In the present paper we also study the special cases of learning the extended
regular pattern languages. On the one hand, they already allow non-trivial appli-
cations. On the other hand, it is by no means easy to design an efficient learner
for these classes of languages as noted above. Therefore, we aim to design an
efficient learner for an interesting subclass of the extended regular pattern lan-
guages which we define next.
Let Lang(π) be the extended pattern language generated by pattern π . For
c > 0 , let REGPATc be the set of all Lang(π) such that π is a pattern of the
form x0 α1 x1 α2 x2 . . . αm xm , where each αi is a string of terminals of length c
and x0 , x1 , x2 , . . . , xm are distinct variables.
We consider polynomial time learning of REGPATc for various data presen-
tations and for natural and plausible probability distributions on the input data.
As noted above, even REGPAT1 is not polynomial-time PAC learnable unless
RP = N P . Thus, one has to restrict the class of all probability distributions.
Then, the conceptional idea is as follows.
We explain it here for the case mainly studied in this paper, learning from
text (in our above notation). One looks again at the whole learning process as
learning in the limit. So, the data presented to the learner are growing initial
segments of a text. But now, we do not allow any text. Instead every text is
drawn according to some fixed probability distribution. Next, one determines
the expected number of examples needed by the learner until convergence. Let
E denote this expectation. Assuming prior knowledge about the underlying
probability distribution, E can be expressed in terms the learner may use con-
ceptionally to calculate E . Using Markov’s inequality, one easily sees that the
probability to exceed this expectation by a factor of t is bounded by 1/t . Thus,
we introduce, as in the PAC model, a confidence parameter δ . Given δ , one
needs roughly (1/δ) · E many examples to converge with probability at least
1 − δ . Knowing this, there is of course no need to compute any intermediate
hypotheses. Instead, now the learner firstly draws as many examples as needed
and then it computes just one hypothesis from it. This hypothesis is output, and
by construction we know it to be correct with probability at least 1 − δ . Thus,
we arrive at a learning model which we call probabilistically exact learning (cf.
Definition 5 below). Clearly, in order to have an efficient learner one also has to
ensure that this hypothesis can be computed in time polynomial in the length
of all strings seen. For arriving at an overall polynomial-time learner, it must be
also ensured that E is polynomially bounded in a suitable parameter. We use
the number of variables occurring in the regular target pattern, c (the length
of substitution strings) and a term describing knowledge about the probability
distribution as such a parameter.
For REGPATc , we have results for three different models of data presenta-
tion. The data are drawn according to the distribution prob described below.
238 J. Case et al.

The three models are as follows. Thanks to space limitations we present


herein the details and verification of our algorithm for the first model only. The
journal version of this paper will present more details. Σ is the terminal alpha-
bet. For natural numbers c > 0 , Σ c is Σ ∗ restricted to strings of length c .
(1) For drawing of examples according to prob for learning a pattern lan-
guage generated by π : one draws terminal string σ according to distribution
prob over Σ ∗ until σ ∈ Lang(π) is obtained. Then σ is returned to the learner.
(2) One draws σ according to prob and gives (σ, χLang(π) (σ)) to the
learner.
(3) As in (2), but one gives σ to the learner in the case that σ ∈ Lang(π) ,
and gives a pause-symbol to the learner otherwise.
For this paper, the natural and plausible assumptions on prob are the fol-
lowing.
(i) prob(Σ c ) ≥ prob(Σ c+1 ) for all c ;
prob(Σ c )
(ii) prob(σ) = |Σ c | , where σ ∈ Σ c .
(iii) there is an increasing polynomial pol such that prob(Σ c ) ≥ 1
pol(c) for
all c .
Our algorithm is presented in detail in Section 3 below. The complexity
bounds are described more exactly there, but, basically, the algorithm can be
made to run with arbitrarily high probability of success on extended regular
languages generated by patterns π of the form x0 α1 x1 ...αm xm for unknown
m but known c , from number of examples polynomial in m (and exponential
in c ), where α1 , ..., αm ∈ Σ c .
N.B. Having our patterns defined as starting and ending with variables is not
crucial (since one can handle patterns starting or ending with constants easily by
just looking at the data and seeing if they have a common suffix or prefix). Our
results more generally hold for patterns alternating variables and fixed length
constant strings, where the variables are not repeated. Our statements above
and in Section 3 below involving variables at the front and end is more for ease
of presentation of proof.

2 Preliminaries

Let N = {0, 1, 2, . . .} denote the set of natural numbers, and let N+ = N \ {0} .
For any set S , we write |S| to denote the cardinality of S .
Let Σ be any non-empty finite set of constants such that |Σ| ≥ 2 and let V
be a countably infinite set of variables such that Σ ∩ V = ∅ . By Σ ∗ we denote
the free monoid over Σ . The set of all finite non-null strings of symbols from
Σ is denoted by Σ + , i.e., Σ + = Σ ∗ \ {λ} , where λ denotes the empty string.
As above, Σ c denotes the set of strings over Σ with length c . We let a, b, . . .
Learning a Subclass of Regular Patterns in Polynomial Time 239

range over constant symbols. x, y, z, x1 , x2 , . . . range over variables. Following


Angluin [1], we define patterns and pattern languages as follows.

Definition 1. A term is an element of (Σ ∪ V )∗ . A ground term (or a word ,


or a string) is an element of Σ ∗ . A pattern is a non-empty term.

A substitution is a homomorphism from terms to terms that maps each sym-


bol a ∈ Σ to itself. The image of a term π under a substitution θ is denoted
πθ . We next define the language generated by a pattern.

Definition 2. The language generated by a pattern π is defined as Lang(π) =


{πθ ∈ Σ ∗ | θ is a substitution } .
We set PAT = {Lang(π) | π is a pattern} .

Note that we are considering extended (or erasing) pattern languages, i.e., a
variable may be substituted with the empty string λ . Though allowing empty
substitutions may seem a minor generalization, it is not. Learning erasing pat-
tern languages is more difficult for the case considered within this paper than
learning non-erasing ones. For the general case of arbitrary pattern languages,
already Angluin [1] showed the non-erasing pattern languages to be learnable
from positive data. However, the erasing pattern languages are not learnable
from positive data if |Σ| = 2 (cf. Reidenbach [19]).

Definition 3 (Shinohara[26]). A pattern π is said to be regular if it is of the


form x0 α1 x1 α2 x2 . . . αm xm , where αi ∈ Σ + and xi is the i -th variable.
We set REGPAT = {Lang(π) | π is a regular pattern} .

Definition 4. Suppose c ∈ N+ . We define

c = {π | π = x0 α1 x1 α2 x2 . . . αm xm , where each αi ∈ Σ } .
(a) regm c


(b) regc = m regc .
(c) REGPATc = {Lang(π) | π ∈ regc } .

Next, we define the learning model considered in this paper. As already


explained in the Introduction, our model differs to a certain extent from the
PAC model introduced by Valiant [31] which is distribution independent. In our
model, a bit of background knowledge concerning the class of allowed probability
distributions is allowed. So, we have a stronger assumption, but also a stronger
requirement, i.e., instead of learning an approximation for the target concept,
our learner is required to learn it exactly. Moreover, the class of erasing regular
pattern languages is known not to be PAC learnable (cf. [16] and the discussion
within the Introduction).
240 J. Case et al.

Definition 5. A learner M is said to probabilistically exactly learn a class


L of pattern languages according to probability distribution prob , if for all
δ , 0 < δ < 1 , for some polynomial q , when learning a Lang(π) ∈ L , with
probability at least 1 − δ , M draws at most q(|π|, 1δ ) examples according to
the probability distribution prob , and then outputs a pattern π  , such that
Lang(π) = Lang(π  ) .

As far as drawing of examples according to prob for learning a pattern


language generated by π is concerned, we assume the following model (the first
model discussed in the Introduction): one draws σ according to distribution
prob over Σ ∗ , until σ ∈ Lang(π) is obtained. Then σ is returned to the
learner. (Note: prob is thus defined over Σ ∗ .)
The other two models we mentioned in the Introduction are:
(2) There is a basic distribution prob and one draws σ according to prob
and gives (σ, χLang(π) (σ)) to the learner.
(3) As in (2), but one gives σ to the learner in the case that σ ∈ Lang(π) ,
and gives a pause-symbol to the learner otherwise.
We note that our proof works for models (2) and (3) above too.
For this paper, the assumptions on prob are (as in the Introduction) the
following.
(i) prob(Σ c ) ≥ prob(Σ c+1 ) for all c ∈ N ;
prob(Σ c )
(ii) prob(σ) = |Σ c | , where σ ∈ Σ c .
(iii) there is an increasing polynomial pol with prob(Σ c ) ≥ 1
pol(c) and
pol(c) = 0 for all c ∈ N .

3 Main Result

In this section we will show that REGPATc is probabilistically exactly learnable


according to probability distributions prob satisfying the constraints described
above.

Lemma 1. (based on Chernoff Bounds) Suppose X, Y ⊆ Σ ∗ , δ, are properly


between 0 and 1/2 , and prob(X) ≥ prob(Y ) + . Let e be the base of natural
logarithm. Then, if one draws at least
2 − log(δ)
2

log e
many examples from Σ ∗ according to the probability distribution prob , then
with probability at least 1 − δ , more elements of X than of Y show up. The
number 22∗δ is an upper bound for this number.

More generally, the following holds.


Learning a Subclass of Regular Patterns in Polynomial Time 241

Lemma 2. One can define a function r( , δ, k) which is polynomial in k, 1 , 1δ


such that for all sets X, Z, Y1 , Y2 , . . . , Yk ⊆ Σ ∗ , the following holds.
If prob(X) − prob(Yi ) ≥ , for i = 1, 2, . . . , k , and prob(Z) ≥ , and one
draws ≥ r( , δ, k) many examples from Σ ∗ according to the distribution prob ,
then with probability at least 1 − δ
(a) there is at least one example from Z .
(b) there are strictly more examples in X than in any of the sets Y1 , ..., Yk .

Proposition 1. For every regular pattern π and all m ∈ N , Lang(π) ∩


Σ m+1 ≥ |Σ| ∗ (Lang(π) ∩ Σ m ) .

Proof. Since any regular pattern π has a variable at the end, the proposition
follows.

Proposition 2. For any fixed constant c ∈ N+ and any alphabet Σ , there is a


polynomial f such that for every π ∈ regm
c , at least half of the strings of length
f (m) are generated by π .

Proof. Suppose that π = x0 α1 x1 α2 x2 . . . αm xm , and α1 , α2 , ..., αm ∈ Σ c .


Clearly, there is a length d ≥ c such that for every τ ∈ Σ c 
, at least half of the
d−c
strings in Σ d contain τ as a substring, that is, are in the set k=0 Σ k τ Σ d−k−c .
Now let f (m) = d ∗ m2 . We show that given π as above, at least half of the
strings of length f (m) are generated by π .
2
In order to see this, draw a string σ ∈ Σ d∗m according to a fair |Σ| -sided
coin such that all symbols are equally likely. Divide σ into m equal parts of
length d ∗ m . The i -th part contains αi with probability at least 1 − 2−m as a
substring, and thus the whole string is generated by π with probability at least
1 − m ∗ 2−m . Note that 1 − m ∗ 2−m ≥ 1/2 for all m , and thus f (m) meets
the specification.

We now present our algorithm for learning REGPATc . The algorithm has
prior knowledge about the function r from Lemma 2 and the function f from
Proposition 2. It takes as input c , δ and knowledge about the probability
distribution by getting pol .

Learner (c, δ, pol)


(1) Read examples until an n is found such that the shortest example is strictly
shorter than c ∗ n and the total number of examples (including repetitions)
is at least  
1 n
n∗r , , |Σ|c
.
2 ∗ |Σ|c ∗ f (n) ∗ pol(f (n)) δ
242 J. Case et al.

Let A be the set of all examples and Aj (j ∈ {1, 2, . . . , n}) , be the examples
whose index is j modulo n ; so the (k ∗ n + j) -th example from A goes to
Aj where k is an integer and j ∈ {1, 2, ..., n} .
Let i = 1 , π0 = x0 , X0 = {λ} and go to Step (2).
(2) For β ∈ Σ c , let Yi,β = Xi−1 βΣ ∗ .
If A ∩ Xi−1 = ∅ , then let m = i − 1 and go to Step (3).
Choose αi as the β ∈ Σ c , such that |Ai ∩ Yi,β | > |Ai ∩ Yi,β  | , for β  ∈
Σ c − {β} (if there is no such β , then abort the algorithm).
Let Xi be the set of all strings σ such that σ is in Σ ∗ α1 Σ ∗ α2 Σ ∗ . . . Σ ∗ αi ,
but no proper prefix τ of σ is in Σ ∗ α1 Σ ∗ α2 Σ ∗ . . . Σ ∗ αi .
Let πi = πi−1 αi xi , let i = i + 1 and go to Step (2).
(3) Output the pattern πm = x0 α1 x1 α2 x2 . . . αm xm and halt.
End

Note that since the shortest example is strictly shorter than c ∗ n it holds
that n ≥ 1 . Furthermore, if π = x0 , then the probability that a string drawn is
λ is at least 1/pol(0) . A lower bound for this is 1/(2 ∗ |Σ|c ∗ f (n) ∗ pol(f (n)) ,
whatever n is, due to the fact that pol is monotonically increasing. Thus λ
appears with probability 1 − δ/n in the set An and thus in the set A . So the
algorithm is correct for the case that π = x0 .
It remains to consider the case where π is of the form x0 α1 x1 α2 x2 . . . am xm
for some m ≥ 1 where all αi are in Σ c .

Claim. Suppose any pattern π = x0 α1 x1 α2 x2 ...αm xm ∈ regm c . Furthermore, let


πi−1 = x0 α1 x1 ...αi−1 xi−1 . Let the sets Yi,β , Xi be as defined in the algorithm
and let C(i, β, h) be the cardinality of Yi,β ∩ Lang(π) ∩ Σ h .
Then, for all h > 0 and all β ∈ Σ c \ {αi } , we have C(i, β, h) ≤ |Σ| ∗
C(i, αi , h − 1) ≤ C(i, αi , h) .

Proof. Let σ ∈ Yi,β ∩ Lang(π) . Note that σ has a unique prefix σi ∈ Xi .


Furthermore, there exist s ∈ Σ , η, τ ∈ Σ ∗ such that
(i) σ = σi βsητ and
(ii) βsη is the shortest possible string such that βsη ∈ Σ ∗ αi .
The existence of s is due to the fact that β = αi and β, αi have both
the length c . So the position of αi in σ must be at least one symbol behind
the one of β . If the difference is more than a symbol, η is used to take these
additional symbols.
Now consider the mapping t from Lang(π) ∩ Yi,β to Lang(π) ∩ Yi,αi which
replaces βs in the above representation of σ by αi – thus t(σ) = σi αi ητ . The
mapping t is |Σ| -to- 1 since it replaces the constant β by αi and erases s
(the information is lost about which element from Σ the value s is).
Clearly, σi but no proper prefix of σi is in Xi . So σi αi is in Xi αi . The
position of αi+1 , . . . , αm in σ are in the part covered by τ , since σi βsη
Learning a Subclass of Regular Patterns in Polynomial Time 243

is the shortest prefix of σ generated by πi αi . Since πi generates σi and


xi αi+1 xi+1 ...αm xm generates ητ , it follows that π generates t(σ) . Hence,
t(σ) ∈ Lang(π) . Furthermore, t(σ) ∈ Σ h−1 since the mapping t omits one
element. Also, clearly t(σ) ∈ Xi αi Σ ∗ = Yi,αi . Thus, for β = αi , β ∈ Σ c , it
holds that C(i, β, h) ≤ |Σ| ∗ C(i, αi , h − 1) . By combining with Proposition 1,
C(i, αi , h) ≥ |Σ| ∗ C(i, αi , h − 1) ≥ C(i, β, h) .

Claim. If m > i then there is a length h ≤ f (m) such that

|Σ|h
C(i, αi , h) ≥ C(i, β, h) +
2 ∗ |Σ|c ∗ f (m)
for all β ∈ Σ c \ {αi } . In particular,
1
prob(Yi,β ∩ Lang(π)) + ≤ prob(Yi,αi ∩ Lang(π)).
2∗ |Σ|c ∗ pol(f (m)) ∗ f (m)

C(i,β,h)
Proof. Let D(i, β, h) = |Σ|h
, for all h and β ∈ Σ c . Proposition 1 and
Claim 3 give that

D(i, β, h) ≤ D(i, αi , h − 1) ≤ D(i, αi , h).

Since every string in Lang(π) is in some set Yi,β , it holds that D(i, αi , f (m)) ≥
1
2∗|Σ|c . Furthermore, D(i, αi , h) = 0 for all h < c since m > 0 and π does
not generate the empty string. Thus there is an h ∈ {1, 2, ..., f (m)} with
1
D(i, αi , h) − D(i, αi , h − 1) ≥ .
2 ∗ |Σ|c ∗ f (m)
For this h , it holds that
1
D(i, αi , h) ≥ D(i, β, h) + .
2∗ |Σ|c∗ f (m)

The second part of the claim follows, by noting that


1 1
prob(Σ h ) ≥ ≥ .
pol(h) pol(f (m))

We now show that the learner presented above indeed probabilistically ex-
actly learns Lang(π) , for π ∈ regc .
A loop (Step (2)) invariant is that with probability at least 1 − δ∗(i−1)
n , the
pattern πi−1 is a prefix of the desired pattern π . This certainly holds before
entering Step (2) for the first time.
Case 1. i ∈ {1, 2, ..., m} .
244 J. Case et al.

By assumption, i ≤ m and πi−1 is with probability 1 − δ∗(i−1)


n a
prefix of π , that is, α1 , ..., αi−1 are selected correctly.
Since αi exists and every string generated by π is in Xi Σ ∗ αi Σ ∗ ,
no element of Lang(π) and thus no element of A is in Xi−1 and the
algorithm does not stop too early.
If β = αi and β  = αi , then
prob(Yi,β ∩ Lang(π))

1
≥ prob(Yi,β  ∩ Lang(π)) + ,
2 ∗ |Σ|c ∗ f (m) ∗ pol(f (m))

by Claim 3. By Lemma 2, αi is identified correctly with probability at


least 1 − δ/n from the data in Ai . It follows that the body of the loop
in Step (2) is executed correctly with probability at least 1 − δ/n and
the loop-invariant is preserved.

Case 2. i = m + 1 .

By Step (1) of the algorithm, the shortest example is strictly shorter


than c ∗ n and at least c ∗ m by construction. Thus, we already know
m < n.
With probability 1 − δ∗(n−1)
n the previous loops in Step (2) have
gone through successfully and πm = π . Consider the mapping t which
omits from every string the last symbol. Now σ ∈ Xm iff σ ∈ Lang(π)
and t(σ) ∈/ Lang(π) . Let D(π, h) be the weighted number of strings
|Σ h ∩Lang(π)|
generated by π of length h , that is, D(π, h) = |Σ|h
. Since
D(π, f (m)) ≥ 2 and D(π, 0) = 0 , there is a h ∈ {1, 2, . . . , f (m)} such
1

that
1 1
D(π, h) − D(π, h − 1) ≥ ≥ .
2 ∗ f (m) 2 ∗ |Σ|c ∗ f (n)
Note that h ≤ f (n) since f is increasing. It follows that

1
prob(Xm ) ≥
2∗ |Σ|c ∗ (f (n) ∗ pol(f (n))

and thus with probability at least 1 − nδ a string from Xm is in Am ,


and in particular in A (by Lemma 2). Thus the algorithm terminates
after going through the step (2) m times with the correct output with
probability at least 1 − δ .

To get polynomial time bound for the learner, we note the following. It is
easy to show that there is a polynomial q(m, δ1 ) which with sufficiently high
probability ( 1 − δ  , for any fixed δ  ) bounds the parameter n of the learning
algorithm. Thus, with probability at least 1 − δ  − δ the whole algorithm is
successful in time and example-number polynomial in m, 1/δ, 1/δ  . Thus, for
Learning a Subclass of Regular Patterns in Polynomial Time 245

any given δ  , by choosing δ  = δ = δ  /2 , one can get the desired polynomial


time algorithm.

We are hoping in the future (not as part of the present paper) to run our
algorithm on molecular biology data to see if it can quickly provide useful an-
swers.

References
1. D. Angluin. Finding patterns common to a set of strings. Journal of Computer
and System Sciences, 21:46–62, 1980.
2. S. Arikawa, T. Shinohara, and A. Yamamoto. Learning elementary formal systems.
Theoretical Computer Science, 95:97–113, 1992.
3. T. Shinohara and H. Arimura. Inductive inference of unbounded unions of pattern
languages from positive data. Theoretical Computer Science, 241:191–209, 2000.
4. I. Bratko and S. Muggleton. Applications of inductive logic programming. Com-
munications of the ACM, 1995.
5. A. Brāzma, E. Ukkonen, and J. Vilo. Discovering unbounded unions of regular
pattern languages from positive examples. In Proceedings of the 7th International
Symposium on Algorithms and Computation (ISAAC’96), volume 1178 of Lecture
Notes in Computer Science, pages 95–104, Springer, 1996.
6. J. Case, S. Jain, S. Kaufmann, A. Sharma, and F. Stephan. Predictive learning
models for concept drift. Theoretical Computer Science, 268:323–349, 2001. Special
Issue for ALT’98.
7. J. Case, S. Jain, S. Lange, and T. Zeugmann. Incremental concept learning for
bounded data mining. Information and Computation, 152(1):74–110, 1999.
8. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger, and T. Zeugmann. Learning
one-variable pattern languages very efficiently on average, in parallel, and by asking
queries. Theoretical Computer Science, 261(1):119–156, 2001.
9. E.M. Gold. Language identification in the limit. Information & Control, 10:447–
474, 1967.
10. M. Kearns and L. Pitt. A polynomial-time algorithm for learning k -variable
pattern languages from examples. In R. Rivest, D. Haussler and M. K. War-
muth (Eds.), Proceedings of the Second Annual ACM Workshop on Computational
Learning Theory, pages 57–71, Morgan Kaufmann Publishers Inc., 1989.
11. P. Kilpeläinen, H. Mannila, and E. Ukkonen. MDL learning of unions of simple
pattern languages from positive examples. In Paul Vitányi, editor, Second European
Conference on Computational Learning Theory, volume 904 of Lecture Notes in
Artificial Intelligence, pages 252–260. Springer, 1995.
12. S. Lange and R. Wiehagen. Polynomial time inference of arbitrary pattern lan-
guages. New Generation Computing, 8:361–370, 1991.
13. N. Lavrač and S. Džeroski. Inductive Logic Programming: Techniques and Appli-
cations. Ellis Horwood, 1994.
14. S. Matsumoto and A. Shinohara. Learnability of subsequence languages. Informa-
tion Modeling and Knowledge Bases VIII, pages 335–344, IOS Press, 1997.
15. T. Mitchell. Machine Learning. McGraw Hill, 1997.
16. S. Miyano, A. Shinohara and T. Shinohara. Polynomial-time learning of elementary
formal systems. New Generation Computing, 18:217–242, 2000.
246 J. Case et al.

17. S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods.
Journal of Logic Programming, 19/20:669–679, 1994.
18. R. Nix. Editing by examples. Technical Report 280, Department of Computer
Science, Yale University, New Haven, CT, USA, 1983.
19. D. Reidenbach. A Negative Result on Inductive Inference of Extended Pattern
Languages. In N. Cesa-Bianchi and M. Numao, editors, Algorithmic Learning
Theory, 13th International Conference, ALT 2002, Lübeck, Germany, November
2002, Proceedings, pages 308–320. Springer, 2002.
20. R. Reischuk and T. Zeugmann. Learning one-variable pattern languages in linear
average time. In Proceedings of the Eleventh Annual Conference on Computational
Learning Theory, pages 198–208. ACM Press, 1998.
21. P. Rossmanith and T. Zeugmann. Stochastic Finite Learning of the Pattern Lan-
guages. Machine Learning 44(1/2):67–91, 2001. Special Issue on Automata Induc-
tion, Grammar Inference, and Language Acquisition
22. A. Salomaa. Patterns (The Formal Language Theory Column). EATCS Bulletin,
54:46–62, 1994.
23. A. Salomaa. Return to patterns (The Formal Language Theory Column). EATCS
Bulletin, 55:144–157, 1994.
24. R. Schapire, Pattern languages are not learnable. In M.A. Fulk and J. Case, edi-
tors, Proceedings, 3rd Annual ACM Workshop on Computational Learning Theory,
pages 122–129, Morgan Kaufmann Publishers, Inc., 1990.
25. S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara, and S. Arikawa.
Knowledge acquisition from amino acid sequences by machine learning system
BONSAI. Trans. Information Processing Society of Japan, 35:2009–2018, 1994.
26. T. Shinohara. Polynomial time inference of extended regular pattern languages.
In RIMS Symposia on Software Science and Engineering, Kyoto, Japan, volume
147 of Lecture Notes in Computer Science, pages 115–127. Springer-Verlag, 1982.
27. T. Shinohara. Inferring unions of two pattern languages. Bulletin of Informatics
and Cybernetics, 20:83–88., 1983.
28. T. Shinohara and S. Arikawa. Learning data entry systems: An application of
inductive inference of pattern languages. Research Report 102, Research Institute
of Fundamental Information Science, Kyushu University, 1983.
29. T. Shinohara and S. Arikawa. Pattern inference. In Klaus P. Jantke and Steffen
Lange, editors, Algorithmic Learning for Knowledge-Based Systems, volume 961 of
Lecture Notes in Artificial Intelligence, pages 259–291. Springer, 1995.
30. R. Smullyan. Theory of Formal Systems, Annals of Mathematical Studies, No. 47.
Princeton, NJ, 1961.
31. L.G. Valiant. A theory of the learnable. Communications of the ACM 27:1134–
1142, 1984.
32. K. Wright. Identification of unions of languages drawn from an identifiable class.
In R. Rivest, D. Haussler, and M.K. Warmuth, editors, Proceedings of the Sec-
ond Annual Workshop on Computational Learning Theory, pages 328–333. Morgan
Kaufmann Publishers, Inc., 1989.
33. T. Zeugmann. Lange and Wiehagen’s pattern language learning algorithm: An
average-case analysis with respect to its total learning time. Annals of Mathematics
and Artificial Intelligence, 23(1–2):117–145, 1998.
Identification with Probability One of Stochastic
Deterministic Linear Languages

Colin de la Higuera1 and Jose Oncina2


1
EURISE, Université de Saint-Etienne, 23 rue du Docteur Paul Michelon,
42023 Saint-Etienne, France
cdlh@[Link],
[Link]
2
Departamento de Lenguajes y Sistemas Informáticos,
Universidad de Alicante, Ap.99. E-03080 Alicante, Spain
oncina@[Link],
[Link]

Abstract. Learning context-free grammars is generally considered a


very hard task. This is even more the case when learning has to be
done from positive examples only. In this context one possibility is to
learn stochastic context-free grammars, by making the implicit assump-
tion that the distribution of the examples is given by such an object.
Nevertheless this is still a hard task for which no algorithm is known.
We use recent results to introduce a proper subclass of linear grammars,
called deterministic linear grammars, for which we prove that a small
canonical form can be found. This has been a successful condition for a
learning algorithm to be possible. We propose an algorithm for this class
of grammars and we prove that our algorithm works in polynomial time,
and structurally converges to the target in the paradigm of identification
in the limit with probability 1. Although this does not ensure that only a
polynomial size sample is necessary for learning to be possible, we argue
that the criterion means that no added (hidden) bias is present.

1 Introduction

Context-free grammars are known to have a superior modeling capacity than


regular grammars or finite state automata. Learning these grammars is also
harder but considered an important and challenging task. Yet without external
help such as a knowledge of the structure of the strings [Sak92] only clever but
limited heuristics have been proposed [LS00,NMW97].
When no positive examples exist, or when the actual problem is that of build-
ing a language model, stochastic context-free grammars have been proposed. In
a number of applications (computational biology [SBH+ 94] and speech recog-
nition [WA02] are just two typical examples), it is speculated that success will

The author thanks the Generalitat Valenciana for partial support of this work
through project CETIDIB/2002/173.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 247–258, 2003.

c Springer-Verlag Berlin Heidelberg 2003
248 C. de la Higuera and J. Oncina

depend on being able to replace finite state models such as Hidden Markov Mod-
els by stochastic context-free grammars. Yet the problem of learning this type
of grammar from strings has rarely been addressed. The usual way of dealing
with the problem still consists in first learning a structure, and then estimating
the probabilities [Bak79].
In the more theoretical setting of learning from both examples and counter-
examples classes of grammars that are more general than the regular grammars,
but restricted to cases where both determinism and linearity apply have been
studied [dlHO02].
On the other hand, learning (deterministic) regular stochastic grammars has
received a lot of attention over the past 10 years. A well known algorithm for
this task is ALERGIA [CO94], which has been improved by different authors
[YLT00,CO99], and applied to different tasks [WA02].
We synthesize in this paper both types of results and propose a novel class of
stochastic languages that we call stochastic deterministic linear languages. We
prove that each language of the class admits an equivalence relation of finite
index, thus leading to a canonical normal form. We propose an algorithm that
works in polynomial time with respect to the learning data. It can identify with
probability one any language in the class.
In section 2 the necessary definitions are given. We prove in section 3 the
existence of a small normal form, and give in section 4 a learning algorithm that
can learn grammars in normal form.

2 Definitions

2.1 Languages and Grammars

An alphabet Σ is a finite nonempty set of symbols. Σ ∗ denotes the set of all


finite strings over Σ. A language L over Σ is a subset of Σ ∗ . In the following,
unless stated otherwise, symbols are indicated by a, b, c, . . . , strings by u, v, . . . ,
and the empty string by λ. The length of a string u will be denoted |u|.
Let u, v ∈ Σ ∗ , u−1 v = w such that v = uw (undefined if u is not a prefix of
v) and uv −1 = w such that u = wv (undefined if v is not a suffix of u). Let L
be a language and u ∈ Σ ∗ , u−1 L = {v : uv ∈ L} and Lu−1 = {v : vu ∈ L}.
Let L be a language, the prefix set is Pref(L) = {x : xy ∈ L}. The longest
common suffix (lcs(L)) of L is the longest string u such that (Lu−1 )u = L.
A context-free grammar G is a quadruple (Σ, V, R, S) where Σ is a finite
alphabet (of terminal symbols), V is a finite alphabet (of variables or non-
terminals), R ⊂ V ×(Σ ∪V )∗ is a finite set of production rules, and S(∈ V ) is the
starting symbol. We will denote uT v → uwv when (T, w) ∈ R. → *
is the reflexive
and transitive closure of →. If there exists u0 , . . . , uk such that u0 → · · · → uk
k
we will write u0 → uk . We denote by LG (T ) the language {w ∈ Σ ∗ : T → *
w}.
Two grammars are equivalent if they generate the same language. A context-free
grammar G = (Σ, V, R, S) is linear if R ⊂ V × (Σ ∗ V Σ ∗ ∪ Σ ∗ ).
Identification with Probability One 249

2.2 Stochastic Languages


A stochastic language L over Σ is defined by a probability density function over
Σ ∗ giving the probability p(w|L) that the string 
w ∈ Σ ∗ appears in the language.
To be consistent, a necessary condition is that x∈Σ ∗ p(x|L) = 1.
When convenient, we are going to represent a stochastic language as a set of
pairs: L = {(u, p(u|L)) : p(u|L) > 0}. Consequently (u, pu ) ∈ L =⇒ p(u|L) > 0.
Also to avoid unnecessary notations we will allow the empty set ∅ to be a
stochastic language (paired with an arbitrary function).
The probability of any subset X ⊆ Σ ∗ is given by

p(X|L) = p(u|L)
u∈X

Let L be a stochastic language and u ∈ Σ ∗ ,

Pref(L) = {u : (uv, p) ∈ L},


Sf(L) = {u : (vu, p) ∈ L},
uL = {(uv, p) : (v, p) ∈ L},
Lu = {(vu, p) : (v, p) ∈ L},
u−1 L = {(v, pv ) : (uv, p(uΣ ∗ |L)pv ) ∈ L}
Lu−1 = {(v, pv ) : (vu, pv p(Σ ∗ u|L)) ∈ L}.

Note that the expresions for u−1 L and Lu−1 are equivalent to {(v, pv ) :
pv = p(uv|L)/p(uΣ ∗ |L)} and {(v, pv ) : pv = p(vu|L)/p(uΣ ∗ |L)} respectively
but avoiding division by zero problems.
Of course, if u is a common prefix (u common suffix) of L then p(uΣ ∗ |L) =
1 (p(Σ ∗ u|L) = 1) and u−1 L = {(v, pv ) : (uv, pv ) ∈ L} (Lu−1 = {(v, pv ) :
(vu, pv ) ∈ L}).
We denote the longest common suffix reduction of a stochastic language L
by L ↓ = {(u, p) : z = lcs(L), (uz, p) ∈ L}, where lcs(L) = lcs{u : (u, p) ∈ L}.
Note that if L is a stochastic language then ∀u u−1 L, Lu−1 and L ↓ are also
stochastic languages.
A stochastic deterministic linear (SDL) grammar, G = (Σ, V, R, S, p) consists
Σ, V , S as for context-free grammars, a finite set R of derivation rules with either
of the structures X → aY w or X → λ; such that X → aY w, X → aZv ∈ R ⇒
Y = Z ∧ w = v, and a real function p : R →]0, 1] giving the probability of each
derivation.
The probability p(S → *
w) that the grammar G generates the string w is
defined recursively as:

p(X →
*
avw) = p(X → aY w)p(Y →
*
v)

where Y is the only variable such that X → Y w ∈ R (if such variable does not
exist, thenp(X → aY w) = 0 is assumed). It can be shown that if
∀A ∈ V p(A → α) = 1 and G does not contains useless symbols then G
250 C. de la Higuera and J. Oncina

defines a stochastic deterministic linear language LG through the probabilities


p(w|LG ) = p(S → *
w).
Let X be a variable in the SDL grammar G = (Σ, V, R, S, p) then LG (X) =
{(u, pu ) : p(X →
*
u) = pu }.
A non stochastic version of the above definition is studied in [dlHO02]: it
corresponds to a very general class of linear grammars that includes for instance
grammars for all regular languages, palindrome languages and {an bn : n ∈ N}.
In the same paper a more general form of deterministic linear grammars was
proposed, equivalent to the form we use to support our grammars here. Extension
of these results to general deterministic linear grammars will not be done in this
paper.

3 A Canonical Form for Stochastic Deterministic Linear


Grammars

For a class of stochastic languages to be identifiable in the limit with probability


one a reasonable assumption is that there exists some small canonical form for
any language representable in the class. We prove in this section that such is
indeed the case for stochastic deterministic linear grammars.
The purpose of this section is to reach a computable normal form for SDL
grammars. For this we first define a normal form for these grammars (called
advanced as the longest common suffixes appear as soon as possible), and then
construct such a grammar from any deterministic linear language.

Definition 1 (Advanced form). A stochastic deterministic linear grammar


G = (Σ, V, R, S, p) is in advanced form if:

1. ∀(T, aT  w) ∈ R, w = lcs(a−1 LG (T ));


2. all non-terminal symbols are accessible: ∀T ∈ V ∃u, v ∈ Σ ∗ : S →
*
uT v and
useful: ∀T ∈ V, LG (T ) = ∅;
3. ∀T, T  ∈ V, LG (T ) = LG (T  ) ⇒ T = T  .

We build the canonical form from the language so as to ensure uniqueness:

Definition 2 (Common suffix-free language equivalence). Given a


stochastic language L we define recursively the common suffix-free languages
CSFL (·), and the associated equivalence relation as follows:

CSFL (λ) = L 
 x ≡L y ⇐⇒ CSFL (x) = CSFL (y)
CSFL (xa) = (a CSFL (x)) ↓ 
−1

Proposition 1. The equivalence relation ≡L has a finite index.

Proof. See the appendix.


Identification with Probability One 251

Definition 3 (A canonical grammar). Given any stochastic linear de-


terministic language L, the canonical grammar associated with L is GL =
(Σ, V, R, SCSFL (λ) , p) where:

V = {SCSFL (x) : CSFL (x) = ∅}


R = {SCSFL (x) → aSCSFL (xa) lcs(a−1 CSFL (x)) : CSFL (xa) = ∅}
∪ {SCSFL (x) → λ : λ ∈ CSFL (x)}
p(SCSFL (x) → aY w) = p(aΣ ∗ w| CSFL (x)) = p(aΣ ∗ | CSFL (x))
p(SCSFL (x) → λ) = p(λ| CSFL (x))

Proposition 1 allows this construction to terminate. The correctness of the


construction is a consequence of:
Proposition 2. Let L be a SDL language and let GL = (Σ, V, R, S, p) be its
associated canonical grammar. Then L = LGL (S).
Proof. See the appendix.
Theorem 1. Given a SDL grammar G = (Σ, VG , RG , SG , pG ), let GL =
(Σ, VGL , RGL , SGL , pGL ) be the canonical grammar that generates L = LG (SG ),
1. GL is advanced
2. |VGL |  |VG | + 1.

Proof. We prove that GL is advanced by showing that conditions 1 to 4 of


definition 1 hold. The proof of the second part is a consequence of lemma 5 and
proposition 4: both results are given and proved in the appendix: they state that
the number of classes of CSFL and thus the number of variables in the canonical
grammar, is bounded by the number of non-terminals in the original grammar.

4 Learning SDL Grammars


As SDL languages admit a small canonical form it will be sufficient to have an
algorithm that can identify a grammar in this type of canonical form.
We are going to divide the task of learning in two steps:
1. Identify the topology of the grammar, that is type A → aBv rules, without
the probabilities.
2. Add the A → λ type rules and assign the probabilities.
The second step can be done by counting the use of the different rules while
parsing a sample (maximum likelihood estimation); alternatively, as this does
not achieve identification, techniques based on Stern-Brocot trees can be used
in a similar way as in [dlHT00]. Hence we are going to concentrate on the first
step.
Definition 4. Let L be a SDL language, and  a length lexicographic order
relation over Σ ∗ , the shortest prefix set of L is SpL = {x ∈ Pref(L) : CSFL (x) =
∅ ∧ y ≡L x ⇒ x  y}
252 C. de la Higuera and J. Oncina

Note that, in a canonical grammar, we have a one-to-one relation between


strings in Sp and non-terminals of the grammar. We shall thus use the strings
in Sp as identifiers for the non terminal symbols. To describe the algorithm we
shall imagine that we have access to an unlimited oracle that knows language L
and to which we can address the following queries:

nextL (x) = {xa ∈ Pref(L), a ∈ Σ}


equivL (x, y) ⇐⇒ x ≡L y
rightL (xa) = lcs(a−1 CSFL (x))

Algorithm 1 visits the prefixes of the language L in length lexicographic


order, and constructs the canonical grammar corresponding to definition 3. If
a prefix xa is visited and no previous equivalent non terminal has been found
(and placed in Sp), this prefix is added to Sp as a new non terminal and the
corresponding rule is added to the grammar. If there exists an equivalent non
terminal y in Sp then the corresponding rule is added but the strings for which x
is a prefix will not be visited (they will not be added to W ). When the algorithm
finishes, Sp contains all the shortest prefixes of the language.
Algorithm 1 is clearly polynomial in the size of set W , provided the auxiliary
functions are polynomial.
A stochastic sample S of the stochastic language L is an infinite sequence of
strings generated according to the probability distribution p(w|L). We denote
with Sn the sequence of the n first strings (not necessarily different) in S, which
will be used as input for the algorithm. The number of occurrences in Sn of

 string x will be denoted with cn (x), and for any subset X ⊆ Σ , cn (X) =
the
x∈X nc (x). Note that in the context of the algorithm, nextL (x), rightL (xa)
and equivL (xa, y) are only computed when x and y are in SpL . Therefore the
size of W is bounded by the number of prefixes of Sn . In order to use algorithm
1 with a sample Sn instead of an oracle with access to the whole language L

Algorithm 1 Computing G using functions next, right and equiv


Require: functions next, right and equiv, language L
Ensure: L(G) = L with G = (Σ, V, R, Sλ )
Sp = {λ}; V = {Sλ }
W = nextL (λ)
while W = ∅ do
xa = min W
W = W − {xa}
if ∃y ∈ Sp : equivL (xa, y) then
add Sx → aSy rightL (xa) to R
else
Sp = Sp ∪{xa}; V = V ∪ {Sxa }
W = W ∪ nextL (xa)
add Sx → aSxa rightL (xa) to R
end if
end while
Identification with Probability One 253

the 3 functions must be implemented as functions of Sn (nextSn (·), rightSn (·)


and equivSn (·, ·)) rather than L so that they give the same result as nextL (x),
rightL (xa) and equivL (xa, y) when x, y ∈ SpL and n tends to infinity.
In order to simplify notations we introduce:
Definition 5. Let L be a SDL language, then

lcs(x−1 L) if x = λ
tailL (x) = ∀x : CSFL (x) = ∅
λ if x = λ

A slightly different function tail that works over sequences is now introduced.
This function will be used to define a function right to work over sequences.
Definition 6. Let Sn be a finite sequence of strings, then

lcs(x−1 Sn ) if x = λ
tailSn (x) = ∀x ∈ Pref(Sn )
λ if x = λ

Lemma 1. Let GL = (Σ, V, R, S, p) be the canonical grammar of a SDL lan-


guage L, ∀x : CSFL (x) = ∅,

lcs(a−1 CSFL (x)) = (tailL (xa))(tailL (x))−1

Proof. The proof is similar to lemma 4(1) of [dlHO02]


Definition 7.

nextSn (x) = {xa : ∃xay ∈ Sn }


rightSn (xa) = tailSn (xa) tailSn (x)−1

It should be noticed that the above definition ensures that the functions
nextSn and rightSn can be computed in time polynomial in the size of Sn . We now
prove that the above definition allows functions nextSn and rightSn to converge
in the limit, to the intended functions nextL and rightL :
Lemma 2. Let L be a SDL language, for each sample Sn of L containing a set
D ⊆ {x : (x, p) ∈ L} such that:
1. ∀x ∈ SpL ∀a ∈ Σ : xa ∈ Pref(L) ⇒ ∃xaw ∈ D.
2. ∀x ∈ SpL ∀a ∈ Σ : CSFL (xa) = ∅ ⇒ tailD (xa) = tailL (xa)
then ∀x, y ∈ Sp(L),
1. nextSn (x) = nextL (x)
2. rightSn (xa) = rightL (xa)

Proof. Point 1 is clear by definition and point 2 is a consequence of lemma 1

Lemma 3. With probability one, nextSn (x) = nextL (x) and rightSn (xa) =
rightL (xa) ∀x ∈ Sp(L) except for finitely many values of n.
254 C. de la Higuera and J. Oncina

Proof. Given a SDL language, there exists (at least one) set D with non null
probability. Then with probability 1 any sufficiently large sample contains such
a set D. is unique for each SDL language. Then the above lemma yields the
result.
In order to evaluate the equivalence relation equiv(x, y) ⇐⇒ x ≡L y ⇐⇒
CSFL (x) = CSFL (y) we have to check if two stochastic languages are equivalent
from a finite sample Sn .
To do that, instead of comparing the probabilities of each string of the sample,
we are going to compare the probabilities of their prefixes. This strategy (also
used in ALERGIA [CO94] and RLIPS [CO99]) allows to distinguish different
probabilities faster, as more information is always available about a prefix than
about a whole string. It is therefore easy to establish the equivalence between
the various definitions:
Proposition 3. Two stochastic languages L1 and L2 are equal iff
p(aΣ ∗ |w−1 L1 ) = p(aΣ ∗ |w−1 L2 )∀a ∈ Σ, ∀w ∈ Σ ∗
Proof. L1 = L2 =⇒ ∀w ∈ Σ ∗ : p(w|L1 ) = p(w|L2 ) =⇒ w−1 L1 = w−1 L2 =⇒
∀z ⊆ Σ ∗ : p(z|w−1 L1 ) = p(z|w−1 L2 )
Conversely L1 = L2 =⇒ ∃w ∈ Σ ∗ : p(w|L1 ) = p(w|L2 ). Let w = az,
as p(az|L) = p(aΣ ∗ |L)p(z|a−1 L) then p(aΣ ∗ |L1 )p(z|a−1 L1 ) = p(aΣ ∗ |L2 )
p(z|a−1 L2 ).
Now we have 2 cases:
1. p(aΣ ∗ |L1 ) = p(aΣ ∗ |L2 ) and the proposition is shown.
2. p(aΣ ∗ |L1 ) = p(aΣ ∗ |L2 ) then p(z|a−1 L1 ) = p(z|a−1 L2 ).
This can be applyed recursively unless w = λ.
∗ ∗
In such case we have

 that ∃w ∈ Σ : p(w|L1 ) = p(w|L2 ) ∧ p(wΣ |L1 ) =
p(wΣ |L2 ). But since x∈Σ ∗ p(x|Li ) = 1, it follows that ∃a ∈ Σ such that
p(waΣ ∗ |L1 ) = p(waΣ ∗ |L2 ). Thus p(aΣ ∗ |w−1 L1 ) = p(aΣ ∗ |w−1 L2 ).
As a consequence,
x ≡L y ⇐⇒ p(aΣ ∗ |(xz)−1 L) = p(aΣ ∗ |(yz)−1 L)∀a ∈ Σ, z ∈ Σ ∗
If instead of the whole language we have a finite sample Sn we are going to
estimate the probabilities counting the appearances of the strings and comparing
using a confidence range.
Definition 8. Let f /n be the obseved frequency of a Bernoulli variable of prob-
ability p. We denote by α (n) a fuction such that p(| nf − p| < α (n)) > 1 − α
(the Hoeffding bound is one of such functions).

Lemma 4. Let f1 /n1 and f2 /n2 two obseved frecuencies of a Bernoulli variable
of probability p. Then:
  
 f1 f2 
p  − <  (n ) +  (n ) > (1 − α)2
n2 
α 1 α 2
n1
Identification with Probability One 255

Proof. p(| nf11 − nf22 | < α (n1 )+α (n2 )) < p(| nf11 −p|+| nf22 −p| < α (n1 )+α (n2 )) <
p(| nf11 − p| < α (n1 ) ∧ | nf22 − p| < α (n2 )) < (1 − α)2

Definition 9.
equivSn (x, y) ⇐⇒ ∀z ∈ Σ ∗ : xz ∈ Pref(Sn ) ∧ yz ∈ Pref(Sn ), ∀a ∈ Σ
 
 cn (xzaΣ ∗ ) cn (yzaΣ ∗ ) 
 − < α (cn (xzΣ ∗ )) + α (cn (yzΣ ∗ )) ∧
 cn (xzΣ ∗ ) cn (yzΣ ∗ ) 
 
 cn (xz) cn (yz) 
 − < α (cn (xzΣ ∗ )) + α (cn (yzΣ ∗ ))
 cn (xzΣ ∗ ) cn (yzΣ ∗ ) 

This does not correspond to an infinite number of tests but only to those for
which xz or yz is a prefix in Sn . Each of these tests returns the correct answer
with probability greater than (1 − α)2 . Because the number of checks grows with
| Pref(L)| we will allow the parameter α to depend on n.
∞
Theorem 2. Let the parameter αn be such that n=0 nαn is finite. Then, with
probability one, (x ≡L y) = equivSn (x, y) except for finitely many values of n.

Proof. In order to compute equivSn (x, y) a maximum of 2| Pref(Sn )| tests are


made, each with a confidence above (1 − αn )2 . Let An be the event that at
least one of the equivalence tests fails ((x ≡L y) = equivSn (x, y) when using
Sn as a sample. Then
∞Pr(An ) < 4αn | Pref(Sn )|. According to the Borel-Cantelli
lemma [Fel68], if n=0 Pr(An ) < ∞ then, with probability one, only finitely
many events An take place. As the expected∞size of Pref(Sn ) can not grow faster
than linearly with n, it is sufficient that n=1 nαn < ∞.

5 Discussion and Conclusion

We have described a type of stochastic grammars that correspond to a large class


of languages including regular languages, palindrome languages, linear LL(1)
languages and other typical linear languages such as {an bn , 0  n}. The existence
of a canonical form for any grammar in the class is proved, and an algorithm
that can learn stochastic deterministic linear grammars is given. This algorithm
works in polynomial time and can identify the structure and the probabilities
when these are rational (see [dlHT00] for details).
It is nevertheless easy to construct a grammar for which learning is practi-
cally doomed: with high probability, not enough examples will be available to
notice that some lethal merge should not take place. A counterexample can be
constructed by simulating parity functions with a grammar. So somehow the
paradigm we are using of polynomial identification in the limit with probability
one seems too weak. But on the other hand it is intriguing to notice that the
combination of the two criteria of polynomial runtime and identification in the
limit with probability one does not seem to result in a very strong condition: it is
for instance unclear if a non effective enumeration algorithm might also meet the
256 C. de la Higuera and J. Oncina

required [Link] might even be the case that the entire class of context-free
grammars may be identifiable in the limit with probability one by polynomial
algorithms.
An open problem for which in our mind an answer would be of real help for
further research in the field is that of coming up with a new learning criterion
for polynomial distribution learning. This should in a certain may better match
the idea of polynomial identification with probability one.

References

[Bak79] J. K. Baker. Trainable grammars for speech recognition. In Speech Com-


munication Papers for the 97th Meeting of the Acoustical Soc. of America,
pages 547–550, 1979.
[CO94] R. Carrasco and J. Oncina. Learning stochastic regular grammars by means
of a state merging method. In Proceedings of ICGI’94, number 862 in LNAI,
pages 139–150. Springer Verlag, 1994.
[CO99] R. C. Carrasco and J. Oncina. Learning deterministic regular grammars
from stochastic samples in polynomial time. RAIRO (Theoretical Infor-
matics and Applications), 33(1):1–20, 1999.
[dlHO02] C. de la Higuera and J. Oncina. Learning deterministic linear languages. In
Proceedings of COLT 2002, number 2375 in LNAI, pages 185–200, Berlin,
Heidelberg, 2002. Springer-Verlag.
[dlHT00] C. de la Higuera and F. Thollard. Identication in the limit with probability
one of stochastic deterministic finite automata. In Proceedings of ICGI
2000, volume 1891 of LNAI, pages 15–24. Springer-Verlag, 2000.
[Fel68] W. Feller. An Introduction to Probability Theory and Its Applications,
volume 1 and 2. John Wiley & Sons, Inc., New York, 3rd edition, 1968.
[LS00] P. Langley and S. Stromsten. Learning context-free grammars with a sim-
plicity bias. In Proceedings of ECML 2000, volume 1810 of LNCS, pages
220–228. Springer-Verlag, 2000.
[NMW97] C. Nevill-Manning and I. Witten. Identifying hierarchical structure in se-
quences: A linear-time algorithm. Journal of A. [Link], 7:67–82, 1997.
[Sak92] Y. Sakakibara. Efficient learning of context-free grammars from positive
structural examples. Information and Computation, 97:23–60, 1992.
[SBH+ 94] Y. Sakakibara, M. Brown, R. Hughley, I. Mian, K. Sjolander, R. Under-
wood, and D. Haussler. Stochastic context-free grammars for trna model-
ing. Nuclear Acids Res., 22:5112–5120, 1994.
[WA02] Y. Wang and A. Acero. Evaluation of spoken language grammar learning
in the atis domain. In Proceedings of ICASSP, 2002.
[YLT00] M. Young-Lai and F. W. Tompa. Stochastic grammatical inference of text
database structure. Machine Learning, 40(2):111–137, 2000.

6 Appendix

Propositions from section 3 aim at establishing that a small canonical form exists
for each SDL grammar. The following proofs follow the ideas from [dlHO02].
Identification with Probability One 257

6.1 Proof of Proposition 1


In order to prove the propositions we have to establish more definitions.
To define another equivalence relation over Σ ∗ , when given a stochastic de-
terministic linear grammar, we first associate in a unique way prefixes of strings
in the language with non-terminals:
Definition 10. Let G = (Σ, V, R, S, p) be a SDL grammar. With every string
x we associate the unique non terminal [x]G = T such that S → *
xT u; we extend
LG to be a total function by setting LG ([x]G ) = ∅ the non terminal T doen not
exists.
We use this definition to give another equivalence relation over Σ ∗ , when
given a SDL grammar:
Definition 11. Let G = (Σ, V, R, S, p) be a SDL grammar. We define the as-
sociated common suffix-free languages CSFG (.), and associated equivalence re-
lation as follows:

CSFG (λ) = LG (S) 
 x ≡G y ⇐⇒ CSFG (x) = CSFG (y)
CSFG (xa) = LG ([xa]G ) ↓ 

≡G is clearly an equivalence relation, in which all strings x such that [x]G is


undefined are in a unique class. The following lemma establishes that ≡G has
finite index, when G is a stochastic deterministic linear grammar:
Lemma 5. If [x]G = [y]G , x = λ and y = λ ⇒ x ≡G y. Hence if G contains n
non-terminals, ≡G has at most n + 2 classes.
The proof is straightforward. There can be at most two possible extra classes
corresponding to λ (when it is alone in its class) and the undefined class
Lemma 6. Let G = (Σ, V, R, S, p) be a SDL grammar. If X →
*
xY w then:

(x−1 L(X)) ↓ = L(Y ) ↓

Proof. It is enough to prove (a−1 L(X)) ↓ = L(Y ) ↓ if X → aY w ∈ R, which is


clear by double inclusion.

Proposition 4. Let G = (Σ, V, R, S, p) be a SDL grammar, and denote L =


LG (S). ∀x ∈ Σ ∗ , either CSFL (x) = CSFG (x) or CSFL (x) = ∅

Proof. By induction on the length of x.


Base: x = λ, then CSFL (x) = L = CSFG (x).
Suppose: the proposition is true for all strings of length up to k, so consider
string xa of length k + 1. CSFL (xa) = (a−1 CSFL (x)) ↓ (by definition 2).
If CSFL (x) = ∅, CSFL (xa) = ∅. If not (CSFL (x) = CSFG (x)) by induction
hypothesis, CSFL (xa) = (a−1 CSFL (x)) ↓ = (a−1 CSFG (x)) ↓ and there are
two sub-cases:
258 C. de la Higuera and J. Oncina

if x = λ CSFG (x) = LG ([x]G ), so CSFL (xa) = (a−1 LG ([x]G )) ↓


if x = λ CSFG (x) = LG ([x]G ) ↓, so: CSFL (xa) = (a−1 (LG ([x]G ) ↓)) ↓ (by
definition 11), =(a−1 (LG ([x]G ))) ↓
In both cases follows: CSFL (xa) = (a−1 LG ([x]G )) ↓ = LG ([xa]G ) ↓ (by
lemma 6)= CSFG (xa).

Corollary 1 (proof of proposition 1). Let G = (Σ, V, R, S, p) be a stochastic


deterministic linear grammar. So ≡LG (S) has finite index.
Proof. A consequence of lemma 5 and proposition 4:

6.2 Proof of Proposition 2


To avoid extra notations, we will denote (as in definition 10) by [x] the non-
terminal corresponding to x in the associated grammar (formally SCSFL (x) or
[x]GL ).
The proof that GL generates L is established through the following more
general result (as the special case where x = λ):
Proposition 5. ∀x ∈ Σ ∗ , LGL ([x]) = CSFL (x).
Proof. We prove it by double inclusion.
∀x ∈ Σ ∗ , CSFL (x) ⊆ LGL ([x])
Proof by induction on the length of all strings in CSFL (x).
Base case |w| = 0 ⇒ w = λ. If (λ, p) ∈ CSFL (x), by construction of the
rules, [x] → λ and p([x] → λ) = p so (λ, p) ∈ LGL ([x]).
Suppose now (induction hypothesis) that
∀x ∈ Σ ∗ , ∀w ∈ Σ k : (w, p) ∈ CSFL (x) ⇒ (w, p) ∈ LGL ([x]).
Let w = auv such that |w| = k + 1, (auv, p) ∈ CSFL (x) and let
v = lcs(a−1 CSFL (x)). As CSFL (xa) = (a−1 CSFL (x)) ↓, then ∃pu :
(u, pu ) ∈ CSFL (xa) and then p = pu p(aΣ ∗ | CSFL (x)). As by construc-
tion [x] → a[xa]v and p([x] → a[xa]v) = p(aΣ ∗ | CSFL (x)) and, by hy-
pothesis induction (|u|  k) (u, pu ) ∈ LG ([xa]), then (auv, p) ∈ LG ([x]).
∀x ∈ Σ ∗ , LGL ([x]) ⊆ CSFL (x)
Proof by induction on the order (k) of the derivation
k k
∀x ∈ Σ ∗ , ∀k ∈ N, ∀w ∈ Σ ∗ , [x]→w ⇒ (w, p([x]→w) ∈ CSFL (x).
1
Base case [x]→w. This case is only possible if w = λ. And, by construction,
such a rule is in the grammar because (λ, p(λ| CSFL (x)) ∈ CSFL (x)
Suppose now (induction hypothesis) that for any n  k :
n
∀x ∈ Σ ∗ , ∀w ∈ Σ ∗ : [x]→w ⇒ ∃p : (w, p) ∈ CSFL (x)
k+1 k
Take w ∈ Σ ∗ such that [x]−→w, then [x] → a[xa]v → w = auv with
k k
[xa] → u, and p = p([x] → a[xa]v)pu where pu = p([xa] → u), by induc-
tion hypothesis we know that (u, pu ) ∈ CSFL (xa) = (a−1 CSFL (x)) ↓ =
{(t, pt ) : (atv, pa pt ) ∈ CSFL (x), pa = p(aΣ ∗ | CSFL (x)), v =
lcs(a−1 CSFL (x))}. As by construction we know that p([x] → a[xa]v) =
p(aΣ ∗ | CSFL (x)) then (w, p) = (auv, p([x] → a[xa]v)pu ) ∈ CSFL (x).
Criterion of Calibration for Transductive
Confidence Machine with Limited Feedback

Ilia Nouretdinov and Vladimir Vovk

Department of Computer Science


Royal Holloway, University of London
{ilia,vovk}@[Link]

Abstract. This paper is concerned with the problem of on-line pre-


diction in the situation where some data is unlabelled and can never be
used for prediction, and even when data is labelled, the labels may arrive
with a delay. We construct a modification of randomised Transductive
Confidence Machine for this case and prove a necessary and sufficient
condition for its predictions being calibrated, in the sense that in the
long run they are wrong with a prespecified probability under the as-
sumption that data is generated independently by same distribution.
The condition for calibration turns out to be very weak: feedback should
be given on more than a logarithmic fraction of steps.

1 Introduction

In this paper we consider the problem of prediction: given some training data and
a new object xn we would like to predict its label yn . We use the randomised on-
line version of Transductive Confidence Machine as basic method of prediction;
first we explain why we are interested in this method and then formulate the
main question of this paper.
Transductive Confidence Machine (TCM) [3,4] is a prediction method giving
“p-values” py for any possible value y of the unknown label yn ; the p-values
satisfy the following property (proven in, e.g., [1]): if the data satisfies the i.i.d.
assumption, which means that the data is generated independently by same
mechanism, the probability that pyn < δ does not exceed δ for any threshold
δ ∈ (0, 1) (the validity property).
There are different ways of presenting the p-values. The one used in [3] only
works in the case of pattern recognition: the prediction algorithm outputs a
“most likely” label (y with the largest py ) together with confidence (one minus
the second largest py ) and credibility (the largest py ). Alternatively, the predic-
tion algorithm can be given a threshold δ as an input and its answer will be that
the label yn should lie in the set of such y that py > δ; this scenario of set (or
region) prediction was used in [5,2] and will be used in this paper. The validity
property says that the set prediction will be wrong with probability at most δ.
Therefore, we can guarantee some maximal probability of error; the downside is
that the set prediction can consist of more than one element.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 259–267, 2003.

c Springer-Verlag Berlin Heidelberg 2003
260 I. Nouretdinov and V. Vovk

Randomised TCM (rTCM), which is described below, is valid in a stronger


sense than pure TCM: the error probability is equal to δ.
In on-line TCM [5] it is supposed that machine learning is performed step-
by-step: on the nth step TCM predicts the new label yn using knowledge of the
new object xn and all the previous objects with their labels; after that the true
information about yn becomes available and TCM can use it on the next step
n + 1. In the paper [5] it was proven that the probability of error on each step is
again δ; moreover, errors on different steps are independent of each other, so the
mean percentage of errors asymptotically tends to δ (the calibration property).
In principle, it is easy to be calibrated in set prediction; what makes TCMs
interesting is that they output few uncertain predictions (predictions containing
more than one label). This can be demonstrated both empirically on standard
benchmark data sets (see, e.g., [5]) and theoretically: a simple Nearest Neigh-
bours rTCM produces asymptotically no more uncertain predictions than any
other calibrated algorithm for set prediction.
The interest of this paper is a more general case of on-line TCM prediction,
where only some subsequence of labels is available, possibly with a delay; a neces-
sary and sufficient condition for calibration in probability is given in Theorem 1
below. Originally, we stated this result assuming that true labels were given
without delay, but then we noticed that Daniil Ryabko’s [2] device of “ghost
rTCM” (in our terminology) makes it possible to add delays without any extra
work.

2 Online Randomised TCM


Now we describe (mainly following [5]) how on-line rTCM works.
Suppose we observe a sequence z1 , z2 , . . . , zn , . . . of examples, where zi =
(xi , yi ) ∈ Z = X × Y, xi ∈ X are objects to be labelled and yi ∈ Y are the
labels; X and Y are arbitrary measurable spaces.
“On-line” means that for any n we try to predict yn using

z1 = (x1 , y1 ), . . . , zn−1 = (xn−1 , yn−1 ), xn .

The method is as follows. We need a symmetric function

f (z1 , . . . , zn ) = (α1 , . . . , αn ).

“Symmetric” means that if we change order of z1 , . . . , zn , the order of α1 , . . . , αn


will change in the same way. In other words, there must exist a function F such
that
αi = F (z1 , . . . , zi−1 , zi+1 , . . . , zn , zi ),
where  · · ·  means a multiset. The output of on-line rTCM is a set Yn of pre-
dictions for yn ; a label y is included in Yn if and only if

#{i : αi > αn } + θn #{i : αi = αn } > nδ,

where
Criterion of Calibration for Transductive Confidence Machine 261

(α1 , . . . , αn ) = f (z1 , . . . , zn−1 , (xn , y)),


θn ∈ [0, 1] are random numbers distributed uniformly and independently of each
other and everything else, and δ > 0 is a given threshold (called significance
level ). We will be concerned with the error sequence e1 , . . . , en , . . . , where en = 0
if the true value yn is in Yn , and en = 1 otherwise.
In the paper [5] it is proven that for any probability distribution P in the set
Z of pairs zi = (xi , yi ), the corresponding (e1 , e2 , . . . ) is a Bernoulli sequence:
for each i, ei ∈ {0, 1}, ei = 1 with probability δ, and all ei are independent.

3 Restricted TCM

In practice we are likely to have the true labels yn only for a subset of steps n;
moreover, even for this subset yn may be given with a delay. In this paper we
consider the following scheme. We are given a function L : N → IN defined on
an infinite set N ⊆ IN and required to satisfy

L(n) ≤ n

for all n ∈ N and


m = n =⇒ L(m) = L(n)
for all m ∈ N and n ∈ N ; a function satisfying these properties will be called
the teaching schedule. The teaching schedule L describes the way the data is
disclosed to us: at the end of step n we are given the label yL(n) for the object
xL(n) . The elements of L’s domain N in the increasing order will be denoted ni :
N = {n1 , n2 , . . . } and n1 < n2 < · · · .
We transform the on-line randomised TCM algorithm to what we call the L-
restricted rTCM. We again use a symmetric function f (ζ1 , . . . , ζk ) = (α1 , . . . , αk )
and for any n = nk−1 + 1, . . . , nk and any y ∈ Y we include y in Yn if and only
if
#{i = 1, . . . , k : αi > αk } + θn #{i = 1, . . . , k : αi = αk } > kδ,
where
(α1 , . . . , αk ) = f (zL(n1 ) , . . . , zL(nk−1 ) , (xn , y)),
θn are random numbers and δ is a given significance level. As before, the error
sequence is: en = 1 if yn ∈
/ Yn and en = 0 otherwise.
Let U be the uniform distribution in [0, 1]. If a probability distribution P
in Z generates the examples zi , the distribution (P × U )∞ generates zi and
the random numbers θi and therefore determines the distribution of all random
variables, such as the errors ei , considered in this paper.
We say that a restricted rTCM is (well-)calibrated in probability if the corre-
sponding error sequence e1 , e2 , . . . has the property that
e1 + · · · + en
→δ
n
262 I. Nouretdinov and V. Vovk

in (P ×U )∞ -probability for any significance level δ and distribution P in Z. (Re-


member that, by definition, ξ1 , ξ2 , . . . converges to a constant c in Q-probability
if
lim Q {|ξn − c| > ε} → 0
n→∞

for any ε.)


Our aim is to prove the following statement.
Theorem 1. Let L be a teaching schedule with domain N = {n1 , n2 , . . . }, where
n1 , n2 , . . . is an increasing infinite sequence of positive integers.

– If limk→∞ (nk /nk−1 ) = 1, any L-restricted rTCM is calibrated in probability.


– If limk→∞ (nk /nk−1 ) = 1 does not hold, there exists an L-restricted rTCM
which is not calibrated in probability.

In words, the theorem asserts that the restricted rTCM is guaranteed to be


calibrated in probability if and only if the growth rate of nk is sub-exponential.

4 Proof That nk /nk−1 → 1 Is Sufficient

We start from a simple general lemma about martingale differences.


Lemma 1. If ξ1 , ξ2 , . . . is a martingale difference w.r. to σ-algebras F1 , F2 , . . .
such that, for all i ≥ 1,
E(ξi | Fi−1 ) ≤ 1
2

and w1 , w2 , . . . is a sequence of positive numbers, then


 2 
w1 ξ1 + · · · + wn ξn w12 + · · · + wn2
E ≤ .
w1 + · · · + wn (w1 + · · · + wn )2

Proof. Since elements of a martingale difference sequence are uncorrelated, we


have
   
E (w1 ξ1 + · · · + wn ξn ) =
2
wi2 E(ξi2 ) + 2 wi wj E(ξi ξj )
1≤i≤n 1≤i<j≤n

≤ wi2 .
1≤i≤n

Fix a probability distribution P in Z generating the examples zi ; let P stand


for (P × U )∞ (the probability distribution generating the examples zi and the
random numbers θi ) and E stand for the expected value w.r. to P.
Along with the original L-restricted rTCM making errors e1 , e2 , . . . we also
consider the ghost rTCM (introduced in [2]) which uses the same alpha function
as the L-restricted rTCM but is fed with the examples

z1 := zL(n1 ) , z2 := zL(n2 ) , . . .


Criterion of Calibration for Transductive Confidence Machine 263

and random numbers θ1 , θ2 , . . . (independent from each other and anything else);
the error sequence of the ghost rTCM is denoted e1 , e2 , . . . (remember that an
error is encoded as 1 and the absence of error as 0). The ghost rTCM is given
all labels and each label is given without delay. Notice that the input sequence
zL(n1 ) , zL(n2 ) , . . . to the ghost rTCM is also distributed according to P ∞ .
Set, for each n = 1, 2, . . . ,

dn = P{en = 1 | z1 , . . . , zn−1 }

(it is clear that, for each k, dn will be the same for all n = nk−1 + 1, . . . , nk ) and

dk = P ek = 1 | z1 , . . . , zk−1

.

Notice that, for all k = 1, 2, . . . ,

dnk = dk . (1)

Corollary 1. For each k,


 2 
(e1 − δ)n1 + (e2 − δ)(n2 − n1 ) + · · · + (ek − δ)(nk − nk−1 )
E
nk
n21 + (n2 − n1 )2 + · · · + (nk − nk−1 )2
≤ .
n2k

Proof. It is sufficient to apply Lemma 1 to w1 = n1 , w2 = n2 − n1 , . . . , wk =


nk − nk−1 , the independent zero-mean (by the result of [5] described at the end
of §2) random variables ξk = ek − δ, and the trivial σ-algebras.

Corollary 2. For each k,


 2 
(e1 − d1 )n1 + (e2 − d2 )(n2 − n1 ) + · · · + (ek − dk )(nk − nk−1 )
E
nk
n21 + (n2 − n1 )2 + · · · + (nk − nk−1 )2
≤ .
n2k

Proof. Use Lemma 1 for w1 = n1 , w2 = n2 −n1 , . . . , wk = nk −nk−1 , ξk = ek −dk ,


and the σ-algebras Fk generated by z1 , . . . , zk−1

.

Corollary 3. For each k,


 2
(e1 − d1 ) + (e2 − d2 ) + · · · + (enk − dnk ) 1
E ≤ .
nk nk

Proof. Apply Lemma 1 to wi = 1, ξi = ei − di , and the σ-algebras Fi generated


by z1 , . . . , zi .
264 I. Nouretdinov and V. Vovk

Lemma 2. If limk→∞ nnk+1 k


= 1 for some increasing sequence of positive integers
n1 , n2 , . . . , nk , . . . , then
n21 + (n2 − n1 )2 + · · · + (nk − nk−1 )2
lim = 0.
k→∞ n2k

Proof. For any ε > 0, there exists K such that nkn−n k−1
k−1
< ε for any k ≥ K.
Therefore,
n21 + (n2 − n1 )2 + · · · + (nk − nk−1 )2
n2k
n2K (nK+1 − nK )2 + · · · + (nk − nk−1 )2
≤ 2 +
nk n2k
n2K nK+1 − nK nK+1 − nK nK+2 − nK+1 nK+2 − nK+1
≤ + + + ···
n2k nK nk nK+1 nk
nk − nk−1 nk − nk−1 n2 (nK+1 − nK ) + · · · + (nk − nk−1 )
+ ≤ K2 +ε ≤ 2ε
nk−1 nk nk nk
from some k on.
Now it is easy to finish the proof of the first part of the theorem. In combi-
nation with Chebyshev’s inequality and Lemma 2, Corollary 1 implies that
(e1 − δ)n1 + (e2 − δ)(n2 − n1 ) + · · · + (ek − δ)(nk − nk−1 )
→0
nk
in probability; using the notation k(n) := min{k : nk ≥ n}, we can rewrite this
as
nk
1 
e − δ → 0. (2)
nk n=1 k(n)

Similarly, (1) and Corollary 2 imply


nk nk
1  1 
ek(n) − dk(n) = e − dn → 0 (3)
nk n=1 nk n=1 k(n)

and Corollary 3 implies


nk
1 
(en − dn ) → 0 (4)
nk n=1

(all convergences are in probability). Combining (2)–(4), we obtain


nk
1 
(en − δ) → 0; (5)
nk n=1

the condition nk+1 /nk → 1 allows us to replace nk with n in (5).


Criterion of Calibration for Transductive Confidence Machine 265

5 Proof That nk /nk−1 → 1 Is Necessary

As a first step, we construct the example space Z, the probability distribution


P in Z and an rTCM for which dk deviate consistently from δ. Let X = {0},
Y = {0, 1}, so zi is, essentially, always 0 or 1. The probability P is defined by
P {0} = P {1} = 12 . Define the alpha function (α1 , . . . , αk ) = f (ζ1 , . . . , ζk ) as
follows:
(α1 , . . . , αk ) = (ζ1 , . . . , ζk )
if ζ1 + · · · + ζk is even and

(α1 , . . . , αk ) = (1 − ζ1 , . . . , 1 − ζk )

if ζ1 + · · · + ζk is odd.
It follows from the central limit theorem that
#{i = 1, . . . , k : zi = 1}
∈ (0.4, 0.6) (6)
k
with probability more than 99% for k large enough. Let δ = 5%. Consider some
k ∈ {1, 2, . . . }; we will show that dk deviates significantly from δ with probability
more than 99% for sufficiently large k; namely, that dk is significantly greater
than δ if z1 + · · · + zk−1

is odd (intuitively, in this case both potential labels are
strange) and dk is significantly less than δ if z1 + · · · + zk−1
 
is even (intuitively,
both potential labels are typical). Formally:

– If z1 + · · · + zk−1

is odd, then

zk = 1 =⇒ z1 + · · · + zk−1



+ zk is even =⇒ αk = zk = 1
zk = 0 =⇒ z1 + · · · + zk−1

+ zk is odd =⇒ αk = 1 − zk = 1;

in both cases we have αk = 1 and, therefore, with probability more than


99%,

dk = P {θk #{i = 1, . . . , k : αi = 1} ≤ kδ}


kδ kδ 10
= ≥ = δ.
#{i = 1, . . . , k : αi = 1} 0.7k 7

– If z1 + · · · + zk−1

is even, then

zk = 1 =⇒ z1 + · · · + zk−1



+ zk is odd =⇒ αk = 1 − zk = 0
zk = 0 =⇒ z1 + · · · + zk−1

+ zk is even =⇒ αk = zk = 0;

in both cases αk = 0 and, therefore, with probability more than 99%,

dk = P {#{i = 1, . . . , k : αi = 1} + θk #{i = 1, . . . , k : αi = 0} ≤ kδ}


≤ P {0.3k ≤ kδ} = 0.
266 I. Nouretdinov and V. Vovk

To summarise, for large enough k,

|dk − δ| = |dnk − δ| > δ/3 (7)

with probability more than 99%.


Suppose that
n
1
ei − δ → 0 (8)
n i=1

in probability; we will deduce that nk /nk−1 → 1. By (4) (remember that Corol-


lary 3 and, therefore, (4) do not depend on the condition nk /nk−1 → 1) and (8)
we have
n
1
di − δ → 0;
n i=1
we can rewrite this in the form
n

di = n(δ + o(1))
i=1

(all o(1) are in probability). This equality implies


K

dnk (nk+1 − nk ) = nK+1 (δ + o(1))
k=0

and
K−1

dnk (nk+1 − nk ) = nK (δ + o(1));
k=0

subtracting the last equality from the penultimate one we obtain

dnK (nK+1 − nK ) = (nK+1 − nK )δ + o(nK+1 ),

i.e.,
(dnK − δ) (nK+1 − nK ) = o(nK+1 ).
In combination with (7) and (1), this implies nK+1 − nK = o(nK+1 ), i.e.,
nK+1 /nK → 1 as K → ∞.

References
1. Ilia Nouretdinov, Thomas Melluish, and Vladimir Vovk. Ridge Regression Confi-
dence Machine. In Proceedings of the 18th International Conference on Machine
Learning, 2001.
2. Daniil Ryabko, Vladimir Vovk, and Alex Gammerman. Online region prediction
with real teachers. Submitted for publication.
Criterion of Calibration for Transductive Confidence Machine 267

3. Craig Saunders, Alex Gammerman, and Vladimir Vovk. Transduction with confi-
dence and credibility. In Proceedings of the 16th International Joint Conference on
Artificial Intelligence, pp. 722–726, 1999.
4. Vladimir Vovk, Alex Gammerman, Craig Saunders. Machine-learning applications
of algorithmic randomness. Proceedings of the 16th International Conference on
Machine Learning, San Francisco, CA: Morgan Kaufmann, pp. 444–453, 1999.
5. Vladimir Vovk. On-line Confidence Machines are well-calibrated. Proceedings of the
43rd Annual Symposium on Foundations of Computer Science, IEEE Computer
Society, 2002.
Well-Calibrated Predictions from Online
Compression Models

Vladimir Vovk

Computer Learning Research Centre, Department of Computer Science,


Royal Holloway, University of London, Egham, Surrey TW20 0EX, England,
vovk@[Link],
[Link]

Abstract. It has been shown recently that Transductive Confidence


Machine (TCM) is automatically well-calibrated when used in the on-line
mode and provided that the data sequence is generated by an exchange-
able distribution. In this paper we strengthen this result by relaxing
the assumption of exchangeability of the data-generating distribution to
the much weaker assumption that the data agrees with a given “on-line
compression model”.

1 Introduction
Transductive Confidence Machine (TCM) was introduced in [1,2] as a practi-
cally meaningful way of providing information about reliability of the predic-
tions made. In [3] it was shown that TCM’s confidence information is valid in
a strong non-asymptotic sense under the standard assumption that the exam-
ples are exchangeable. In §2 we define a general class of models, called “on-line
compression models”, which include not only the exchangeability model but also
the Gaussian model, the Markov model, and many other interesting models. An
on-line compression model (OCM) is an automaton (usually infinite) for sum-
marizing statistical information efficiently. It is usually impossible to restore the
statistical information from OCM’s summary (so OCM performs lossy compres-
sion), but it can be argued that the only information lost is noise, since one of
our requirements is that the summary should be a “sufficient statistic”. In §3
we construct “confidence transducers” and state the main result of the paper
(proved in Appendix A) showing that the confidence information provided by
confidence transducers is valid in a strong sense. In the last three sections, §4–6,
we consider three interesting examples of on-line compression models: exchange-
ability, Gaussian and Markov models. The idea of compression modelling was
the main element of Kolmogorov’s programme for applications of probability [4],
which is discussed in Appendix B.

2 Online Compression Models


We are interested in making predictions about a sequence of examples z1 , z2 , . . .
output by Nature. Typically we will want to say something about example zn ,

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 268–282, 2003.

c Springer-Verlag Berlin Heidelberg 2003
Well-Calibrated Predictions from Online Compression Models 269

n = 1, 2, . . . , given the previous examples z1 , . . . , zn−1 . In this section we will


discuss an assumption that we might be willing to make about the examples,
and in the next section the actual prediction algorithms.
An on-line compression model is a 5-tuple M = (Σ, 2, Z, (Fn ), (Bn )), where:

1. Σ is a measurable space called the summary space; its elements are called
summaries; 2 ∈ Σ is a summary called the empty summary;
2. Z is a measurable space from which the examples zi are drawn;
3. Fn , n = 1, 2, . . . , are functions of the type Σ × Z → Σ called forward
functions;
4. Bn , n = 1, 2, . . . , are kernels of the type Σ → Σ × Z called backward kernels;
in other words, each Bn is a function Bn (A | σ) which depends on σ ∈ Σ and
a measurable set A ⊆ Σ × Z such that
– for each σ, Bn (A | σ) as a function of A is a probability distribution in
Σ × Z;
– for each A, Bn (A | σ) is a measurable function of σ;
it is required that Bn be a reverse to Fn in the sense that
 
Bn Fn−1 (σ) | σ = 1

for each σ ∈ Fn (Σ × Z). We will sometimes write Bn (σ) for the probability
distribution A → Bn (A | σ).

Next we explain briefly the intuitions behind this formal definition and introduce
some further notation.
An OCM is a way of summarizing statistical information. At the beginning
we do not have any information, which is represented by the empty summary
σ0 := 2. When the first example z1 arrives, we update our summary to σ1 :=
F1 (σ0 , z1 ), etc.; when example zn arrives, we update the summary to σn :=
Fn (σn−1 , zn ). This process is represented in Figure 1. Let tn be the nth statistic
in the OCM, which maps the sequence of the first n examples z1 , . . . , zn to σn :

t1 (z1 ) := F1 (σ0 , z1 );
tn (z1 , . . . , zn ) := Fn (tn−1 (z1 , . . . , zn−1 ), zn ), n = 2, 3, . . . .

The value tn (z1 , . . . , zn ) is a summary of the full data sequence z1 , . . . , zn avail-


able at the end of trial n; our definition requires that the summaries should be
computable on-line: the function Fn updates σn−1 to σn .
Condition 3 in the definition of OCM reflects its on-line character, as ex-
plained in the previous paragraph. We want, however, the system of summariz-
ing statistical information represented by the OCM to be efficient, so that no
useful information is lost. This is reflected in Condition 4: the distribution Pn
of the more detailed description (σn−1 , zn ) given the less detailed σn is known
and so does not carry any information about the distribution generating the
examples z1 , z2 , . . . ; in other words, σn contains the same useful information as
(σn−1 , zn ), and the extra information in (σn−1 , zn ) is noise. This intuition would
be captured in statistical terminology (see, e.g., [5], §2.2) by saying that σn is
270 V. Vovk

z1 z2 zn−1 zn

? ? ? ?
2 - σ1 - σ2 - ··· - σn−1 - σn

Fig. 1. Using the forward functions Fn to compute σn from z1 , . . . , zn

a “sufficient statistic” of z1 , . . . , zn (although this expression does not have a


formal meaning in our present context, since we do not have a full statistical
model {Pθ : θ ∈ Θ}).
Analogously to Figure 1, we can compute the distribution of the data
sequence z1 , . . . , zn from σn (see Figure 2). Formally, using the kernels
Bn (dσn−1 , dzn | σn ), we can define the conditional distribution Pn of z1 , . . . , zn
given σn by the formula

Pn (A1 × · · · × An | σn ) :=
 
· · · B1 (A1 | σ1 )B2 (dσ1 , A2 | σ2 ) . . .

Bn−1 (dσn−2 , An−1 | σn−1 )Bn (dσn−1 , An | σn ) (1)


for each product set A1 × · · · × An , Ai ⊆ Z, i = 1, . . . , n.

z1 z2 zn−1 zn
6 6 6 6

2  σ1  σ2  ···  σn−1  σn

Fig. 2. Using the backward functions Bn to extract the distribution of z1 , . . . , zn


from σn

We say that a probability distribution P in Z∞ agrees with the OCM


(Σ, 2, Z, (Fn ), (Bn )) if, for each n, Bn (A | σ) is a version of the conditional prob-
ability, w.r. to P , that (tn−1 (z1 , . . . , zn−1 ), zn ) ∈ A given tn (z1 , . . . , zn ) = σ and
given the values of zn+1 , zn+2 , . . . .

3 Confidence Transducers and the Main Result


A randomised transducer is a function f of the type (Z × [0, 1])∗ → [0, 1].
It is called “transducer” because it can be regarded as mapping each in-
put sequence (z1 , θ1 , z2 , θ2 , . . . ) in (Z × [0, 1])∞ (the examples zi are comple-
mented by random numbers θi ) into the output sequence (p1 , p2 , . . . ) defined
Well-Calibrated Predictions from Online Compression Models 271

by pn := f (z1 , θ1 , . . . , zn , θn ), n = 1, 2, . . . ; we will say that p1 , p2 , . . . are the


p-values produced by the transducer. We say that the transducer f is valid w.r.
to an OCM M if the output p-values p1 p2 . . . are always distributed according
to the uniform distribution U ∞ in [0, 1]∞ , provided the input examples z1 z2 . . .
are generated by a probability distribution that agrees with M and θ1 θ2 . . . are
generated, independently of z1 z2 . . . , from U ∞ . If we drop the dependence on
the random numbers θn , we obtain the notion of deterministic transducer.
Any sequence of measurable functions An : Σ ×Z → IR, n = 1, 2, . . . , is called
an individual strangeness measure w.r. to the OCM M = (Σ, 2, Z, (Fn ), (Bn )).
The confidence transducer associated with (An ) is the deterministic transducer
where pn are defined as

pn := Bn ({(σ, z) ∈ Σ × Z : An (σ, z) ≥ An (σn−1 , zn )} | σn ) (2)

and
σn := tn (z1 , . . . , zn ), σn−1 := tn−1 (z1 , . . . , zn−1 ) .
The randomised version is obtained by replacing (2) with

pn := Bn ({(σ, z) ∈ Σ × Z : An (σ, z) > An (σn−1 , zn )} | σn )


+ θn Bn ({(σ, z) ∈ Σ × Z : An (σ, z) = An (σn−1 , zn )} | σn ) . (3)

A confidence transducer in an OCM M is a confidence transducer associated


with some individual strangeness measure w.r. to M .

Theorem 1. Suppose the examples zn ∈ Z, n = 1, 2, . . . , are generated from


a probability distribution P that agrees with an on-line compression model. Any
randomised confidence transducer in that model is valid (will produce independent
p-values pn distributed uniformly in [0, 1]).

Confidence transducers can be used for “prediction with confidence”. Suppose


each example zn consists of two components, xn (the object) and yn (the label);
at trial n we are given xn and the goal is to predict yn ; for simplicity, we will
assume that the label space Y from which the labels are drawn is finite.
One mode of prediction with confidence is “region prediction” (as in [3]).
Suppose we are given a significance level δ > 0 (the maximum probability of error
we are prepared to tolerate). When given xn , we can output as the predictive
region Γn ⊆ Y the set of labels y such that yn = y would lead to a p-value pn > δ.
(When a confidence transducer is applied in this mode, we will sometimes refer
to it as a TCM.) If error at trial n is defined as yn ∈/ Γn , then by Theorem 1
errors at different trials are independent and the probability of error at each
trial is δ, assuming the pn are produced by a randomised confidence transducer.
In particular, such region predictors are well-calibrated, in the sense that the
number En of errors made in the first n trials satisfies

En
lim = δ.
n→∞ n
272 V. Vovk

This implies that if the pn are produced by a deterministic confidence transducer,


we will still have the conservative version of this property,

En
lim n→∞ ≤ δ.
n

An alternative way of presenting the confidence transducer’s output (used


in [2] and several other papers) is reporting, after seeing xn , a predicted label
(2) (1)
ŷn ∈ arg maxy∈Y pn (y), the confidence 1 − pn and the credibility pn , where
(1)
pn (y) is the p-value that would be obtained if yn = y, pn is the largest value
(2)
among pn (y) and pn is the second largest value among pn (y).

4 Exchangeability Model

In this section we discuss the only special case of OCM studied from the point
of view of prediction with confidence so far: the exchangeable model. In the next
two sections we will consider two other models, Gaussian and Markov; many
more models are considered in [6], Chapter 4. For defining specific OCM, we will
specify their statistics tn and conditional distributions Pn ; these will uniquely
identify Fn and Bn .
The exchangeability model has statistics

tn (z1 , . . . , zn ) := z1 , . . . , zn  ;

given the value of the statistic, all orderings have the same probability 1/n!. For-
mally, the set of bags z1 , . . . , zn  of size n is defined as Zn equipped with the σ-
algebra of symmetric (i.e., invariant under permutations of components) events;
the distribution on the orderings is given by zπ(1) , . . . , zπ(n) , where z1 , . . . , zn is
a fixed ordering and π is a random permutation (each permutation is chosen
with probability 1/n!).
The main results of [3] and [7] are special cases of Theorem 1.

5 Gaussian Model

In the Gaussian model, Z := IR, the statistics are

tn (z1 , . . . , zn ) := (z n , Rn ) ,
n
1 
z n := zi , Rn := (z1 − z n )2 + · · · + (zn − z n )2 ,
n i=1

and Pn (dz1 , . . . , dzn | σ) is the uniform distribution in t−1


n (σ) (in other words, it
is the uniform distribution in the n − 2-dimensional sphere in IRn with centre
(z, . . . , z) ∈ IRn of radius Rn lying inside the hyperplane n1 (z1 + · · · + zn ) = z n ).
Well-Calibrated Predictions from Online Compression Models 273

Let us give an explicit expression of the predictive region for the Gaussian
model and individual strangeness measure

An (tn−1 , zn ) = An ((z n−1 , Rn−1 ), zn ) := |zn − z n−1 | (4)

(it is easy to see that this individual strangeness measure is equivalent, in the
sense of leading to the same p-values, to |zn − z n |, as well as to several other
natural expressions, including (5)). Under Pn (dz1 , . . . , dzn | σ), the expression

(n − 1)(n − 2) zn − z n−1
(5)
n Rn−1

has Student’s t-distribution with n − 2 degrees of freedom (assuming n > 2; see,


e.g., [8], §29.4). If t(δ) is the value defined by P{|tn−2 | > t(δ) } = δ (where tn−2 has
Student’s t-distribution with n − 2 degrees of freedom), the predictive interval
corresponding to individual strangeness measure (4) is the set of z satisfying

n
|z − z n−1 | ≤ t(δ) Rn−1 .
(n − 1)(n − 2)

Therefore, we obtained the usual predictive regions based on the t-test (as in [9]
or, in more detail, [10]); now, however, we can see that the errors of this standard
procedure (applied in the on-line fashion) are independent.

6 Markov Model
The Gaussian OCM, considered in the previous section, is narrower than the
exchangeability OCM. The OCM considered in this section is interesting in that
it goes beyond exchangeability.
In this section we always assume that the example space Z is finite. The
following notation for digraphs will be used: in(v)/out(v) stand for the number
of arcs entering/leaving vertex v; nu,v is the number of arcs leading from vertex
u to vertex v.
The Markov summary of a data sequence z1 . . . zn is the following digraph
with two vertices marked:
– the set of vertices is Z (the state space of the Markov chain);
– the vertex z1 is marked as the source and the vertex zn is marked as the
sink (these two vertices are not necessarily distinct);
– the arcs of the digraph are the transitions zi zi+1 , i = 1, . . . , n − 1; the arc
zi zi+1 has zi as its tail and zi+1 as its head.
It is clear that in any such digraph all vertices v satisfy in(v) = out(v) with the
possible exception of the source and sink (unless they coincide), for which we
then have out(source) = in(source) + 1 and in(sink) = out(sink) + 1. We will call
a digraph with this property a Markov graph if the arcs with the same tail and
head are indistinguishable (for example, we do not distinguish two Eulerian paths
274 V. Vovk

that only differ in the order in which two such arcs are passed); its underlying
digraph will have the same structure but all its arcs will be considered to have
their own identity.
More formally, the Markov model (Σ, 2, Z, F, B) is defined as follows:

– Z is a finite set; its elements (examples) are also called states; one of the
states is designated as the initial state;
– Σ is the set of all Markov graphs with the vertex set Z;
– 2 is the Markov graph with no arcs and with both source and sink at the
designated initial state;
– Fn (σ, z) is the Markov graph obtained from σ by adding an arc from σ’s sink
to z and making z the new sink;
– let σ ↓ z, where σ is a Markov graph and z is one of σ’s vertices, be the
Markov graph obtained from σ by removing an arc from z to σ’s sink (σ ↓ z
does not exist if there is no arc from z to σ’s sink) and moving the sink to z,
and let N (σ) be the number of Eulerian paths from the source to the sink in
the Markov graph σ; Bn (σ) is (σ ↓ z, sink) with probability N (σ ↓ z)/N (σ),
where sink is σ’s sink and z ranges over the states for which σ ↓ z is defined.

We will take as the individual strangeness measure

An (σ, z) := −Bn ({(σ, z)} | Fn (σ, z)) (6)

(we need the minus sign because lower probability makes an example stranger).
To give a computationally efficient representation of the confidence transducer
corresponding to this individual strangeness measure, we need the following two
graph-theoretic results, versions of the BEST theorem and the Matrix-Tree the-
orem, respectively.

Lemma 1. In any Markov graph σ = (V, E) the number of Eulerian paths from
the source to the sink equals

out(sink) v∈V (out(v) − 1)!
T (σ)  ,
u,v∈V nu,v !

where T (σ) is the number of spanning out-trees in the underlying digraph centred
at the source.

Lemma 2. To find the number T (σ) of spanning out-trees rooted at the source
in the underlying digraph of a Markov graph σ with vertices z1 , . . . , zn (z1 being
the source),

– create the n × n matrix with the elements ai,j = −nzi ,zj ;


– change the diagonal elements so that each column sums to 0;
– compute the co-factor of a1,1 .

These two lemmas immediately follow from Theorems VI.24 and VI.28 in [11].
Well-Calibrated Predictions from Online Compression Models 275

250
errors
cumulative errors, uncertain and empty predictions

uncertain predictions
empty predictions
200

150

100

50

0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
examples

Fig. 3. TCM predicting the binary Markov chain with transition probabilities P(1 | 0) =
P(0 | 1) = 1% at significance level 2%; the cumulative numbers of errors (predictive
regions not covering the true label), uncertain (i.e., containing more than one label)
and empty predictive regions are shown

It is now easy to obtain an explicit formula for prediction in the binary case
Z = {0, 1}. First we notice that

N (σ ↓ z) T (σ ↓ z)nz,sink
Bn ({(σ ↓ z, sink)} | σ) = =
N (σ) T (σ) out(sink)

(all nu,v refer to the numbers of arcs in σ and sink is σ’s sink; we set N (σ ↓ z) =
T (σ ↓ z) := 0 when σ ↓ z does not exist). The following simple corollary from
the last formula is sufficient for computing the probabilities Bn in the binary
case:
nsink,sink
Bn ({(σ ↓ sink, sink)} | σ) = .
out(sink)

This gives us the following formulas for the TCM in the binary Markov model
(remember that the individual strangeness measure is (6)). Suppose the current
summary is given by a Markov graph with ni,j arcs going from vertex i to vertex
j (i, j ∈ {0, 1}) and let f : [0, 1] → [0, 1] be the function that squashes [0.5, 1] to
1:

p if p < 0.5
f (p) :=
1 otherwise .
276 V. Vovk

If the current sink is 0, the p-value corresponding to the next example 0 is

n0,0 + 1
f
n0,0 + n0,1 + 1

and the p-value corresponding to the next example 1 is (with 0/0 := 1)

n1,0
f . (7)
n1,0 + n1,1

If the current sink is 1, the p-value corresponding to the next example 1 is

n1,1 + 1
f
n1,1 + n1,0 + 1

and the p-value corresponding to the next example 0 is (with 0/0 := 1)

n0,1
f .
n0,1 + n0,0

Figure 3 shows the result of a computer simulation; as expected, the error line
is close to the straight line with the slope close to the significance level.

Acknowledgments. I am grateful to Per Martin-Löf, Glenn Shafer, Alex Gam-


merman, Phil Dawid, and participants in the workshop “Statistical Learning in
Classification and Model Selection” (January 2003, Eurandom) for useful discus-
sions. The anonymous referees’ comments helped to improve the presentation.
Gregory Gutin’s advice about graph theory is gratefully appreciated.
This work was partially supported by EPSRC (grant GR/R46670/01), BB-
SRC (grant 111/BIO14428), and EU (grant IST-1999-10226).

References
1. Saunders, C., Gammerman, A., Vovk, V.: Transduction with confidence and credi-
bility. In: Proceedings of the Sixteenth International Joint Conference on Artificial
Intelligence. (1999) 722–726
2. Vovk, V., Gammerman, A., Saunders, C.: Machine-learning applications of algo-
rithmic randomness. In: Proceedings of the Sixteenth International Conference on
Machine Learning, San Francisco, CA, Morgan Kaufmann (1999) 444–453
3. Vovk, V.: On-line Confidence Machines are well-calibrated. In: Proceedings of
the Forty Third Annual Symposium on Foundations of Computer Science, IEEE
Computer Society (2002) 187–196
4. Kolmogorov, A.N.: Combinatorial foundations of information theory and the cal-
culus of probabilities. Russian Mathematical Surveys 38 (1983) 29–40
5. Cox, D.R., Hinkley, D.V.: Theoretical Statistics. Chapman and Hall, London
(1974)
6. Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. Wiley, Chichester (2000)
Well-Calibrated Predictions from Online Compression Models 277

7. Vovk, V., Nouretdinov, I., Gammerman, A.: Testing exchangeability on-line.


In: Proceedings of the Twentieth International Conference on Machine Learning.
(2003)
8. Cramér, H.: Mathematical Methods of Statistics. Princeton University Press,
Princeton, NJ (1946)
9. Wilks, S.S.: Determination of sample sizes for setting tolerance limits. Annals of
Mathematical Statistics 12 (1941) 91–96
10. Guttman, I.: Statistical Tolerance Regions: Classical and Bayesian. Griffin, London
(1970)
11. Tutte, W.T.: Graph Theory. Cambridge University Press, Cambridge (2001)
12. Shiryaev, A.N.: Probability. Second edn. Springer, New York (1996)
13. Kolmogorov, A.N.: Logical basis for information theory and probability theory.
IEEE Transactions of Information Theory IT-14 (1968) 662–664
14. Martin-Löf, P.: The definition of random sequences. Information and Control 9
(1966) 602–619
15. Asarin, E.A.: Some properties of Kolmogorov δ-random finite sequences. Theory
of Probability and its Applications 32 (1987) 507–508
16. Asarin, E.A.: On some properties of finite objects random in the algorithmic sense.
Soviet Mathematics Doklady 36 (1988) 109–112
17. Vovk, V.: On the concept of the Bernoulli property. Russian Mathematical Surveys
41 (1986) 247–248
18. Martin-Löf, P.: Repetitive structures and the relation between canonical and micro-
canonical distributions in statistics and statistical mechanics. In Barndorff-Nielsen,
O., Blæsild, P., Schou, G., eds.: Proceedings of Conference on Foundational Ques-
tions in Statistical Inference, Aarhus (1974) 271–294
19. Lauritzen, S.L.: Extremal Families and Systems of Sufficient Statistics. Volume 49
of Lecture Notes in Statistics. Springer, New York (1988)
20. Vovk, V., Shafer, G.: Kolmogorov’s contributions to the foundations of probability.
Problems of Information Transmission 39 (2003) 21–31
21. Vovk, V.: Asymptotic optimality of Transductive Confidence Machine. In: Proceed-
ings of the Thirteenth International Conference on Algorithmic Learning Theory.
Volume 2533 of Lecture Notes in Artificial Intelligence. (2002) 336–350
22. Vovk, V.: Universal well-calibrated algorithm for on-line classification. In: Pro-
ceedings of the Sixteenth Annual Conference on Learning Theory. (2003)
23. Nouretdinov, I., V’yugin, V., Gammerman, A.: Transductive Confidence Machine
is universal. In Gavaldà, R., Jantke, K.P., Takimoto, E., eds.: Proceedings of
the Fourteenth International Conference on Algorithmic Learning Theory. Volume
2842 of Lecture Notes in Artificial Intelligence. Berlin, Springer (2003)

A Appendix: Proof of Theorem 1


We will use the notation EF for the conditional expectation w.r. to a σ-algebra
F; if necessary, the underlying probability distribution will be given as an upper
index. Similarly, PF will stand for the conditional probability w.r. to F. In this
appendix we will use the following properties of conditional expectation (see,
e.g., [12], §II.7.4):
A. If G and F are σ-algebras, G ⊆ F, ξ and η are bounded F-measurable
random variables, and η is G-measurable, EG (ξη) = η EG (ξ) a.s.
278 V. Vovk

B. If G and F are σ-algebras, G ⊆ F, and ξ is a random variable, EG (EF (ξ)) =


EG (ξ) a.s.; in particular, E(EF (ξ)) = E(ξ).

Proof of the Theorem


This proof is a generalization of the proof of Theorem 1 in [3], with the same
basic idea: to show that (p1 , . . . , pN ) is distributed as U N (it is easy to get rid of
the assumption of a fixed horizon N ), we reverse the time. Let P be the distri-
bution generating the examples; it is assumed to agree with the OCM. Imagine
that the sample (z1 , . . . , zN ) is generated in two steps: first, the summary σN
is generated from some probability distribution (namely, the image of the dis-
tribution P generating z1 , z2 , . . . under the mapping tN ), and then the sample
(z1 , . . . , zN ) is chosen randomly from PN (· | σN ). Already the second step en-
sures that, conditionally on knowing σN (and, therefore, unconditionally), the
sequence (pN , . . . , p1 ) is distributed as U N . Indeed, roughly speaking (i.e., ig-
noring borderline effects), pN will be the p-value corresponding to the statistic
AN and so distributed, at least approximately, as U (see, e.g., [5], §3.2); when
the pair (σN −1 , zN ) is disclosed, the value pN will be settled; conditionally on
knowing σN −1 and zN , pN −1 will also be distributed as U , and so on.
We start the formal proof by defining the σ-algebra Gn , n = 0, 1, 2, . . . ,
as the one on the sample space (Z × [0, 1])∞ generated by the random elements
σn , zn+1 , θn+1 , zn+2 , θn+2 , . . . . In particular, G0 (the most informative σ-algebra)
coincides with the original σ-algebra on (Z × [0, 1])∞ ; G0 ⊇ G1 ⊇ · · · .
Fix a randomised confidence transducer f ; it will usually be left implicit
in our notation. Let pn be the random variable f (z1 , θ1 , . . . , zn , θn ) for each
n = 1, 2, . . . ; P will refer to the probability distribution P × U ∞ (over examples
zn and random numbers θn ) and E to the expectation w.r. to P. The proof will
be based on the following lemma.
Lemma 3. For any trial n and any δ ∈ [0, 1],

PGn {pn ≤ δ} = δ . (8)


Proof. Let us fix a summary σn of the first n examples (z1 , . . . , zn ) ∈ Zn ; we
will omit the condition “ | σn ”. For every pair (σ̃, z̃) from Fn−1 (σn ) define
p+ (σ̃, z̃) := Bn {(σ, z) : An (σ, z) ≥ An (σ̃, z̃)} ,
p− (σ̃, z̃) := Bn {(σ, z) : An (σ, z) > An (σ̃, z̃)} .
It is clear that always p− ≤ p+ . Notice that the semi-closed intervals
[p− (σ̃, z̃), p+ (σ̃, z̃)), (σ̃, z̃) ∈ Σ × Z, either coincide or are disjoint; it is also
easy to see that they “lie next to each other”, in the sense that their union is
also a semi-closed interval (namely, [0, 1)).
Let us say that a pair (σ̃, z̃) is
– strange if p+ (σ̃, z̃) ≤ δ
– ordinary if p− (σ̃, z̃) > δ
– borderline if p− (σ̃, z̃) ≤ δ < p+ (σ̃, z̃).
Well-Calibrated Predictions from Online Compression Models 279

We will use the notation p− := p− (σ̃, z̃) and p+ := p+ (σ̃, z̃) where (σ̃, z̃) is
any borderline example. Notice that the Bn -measure of strange examples is p− ,
the Bn -measure of ordinary examples is 1−p+ , and the Bn -measure of borderline
examples is p+ − p− .
By the definition of rCT, pn ≤ δ if the pair (σn−1 , zn ) is strange, pn > δ if
the pair is ordinary, and pn ≤ δ with probability
δ − p−
(9)
p+ − p−
if the pair is borderline; indeed, in this case

pn = p− + θn (p+ − p− ) ,

and so pn ≤ δ is equivalent to
δ − p−
θn ≤ .
p+ − p−
Therefore, the overall probability that pn ≤ δ is
δ − p−
p− + (p+ − p− ) = δ.
p+ − p−
The other basic result that we will need is the following lemma.
Lemma 4. For any trial n = 1, 2, . . . , pn is Gn−1 -measurable.
Proof. Fix a trial n and δ ∈ [0, 1]. We are required to prove that the event
{pn ≤ δ} is Gn−1 -measurable. This follows from the definition, (3): pn is defined
in terms of σn−1 , zn and θn .
Fix temporarily positive integer N . First we prove that, for any n = 1, . . . , N
and any δ1 , . . . , δn ∈ [0, 1],

PGn {pn ≤ δn , . . . , p1 ≤ δ1 } = δn · · · δ1 . (10)

The proof is by induction in n. For n = 1, (10) immediately follows from


Lemma 3. For n > 1 we obtain, making use of Lemmas 3 and 4, properties A
and B of conditional expectations, and the inductive assumption:

PGn {pn ≤ δn , . . . , p1 ≤ δ1 }
  
= EGn EGn−1 I{pn ≤δn } I{pn−1 ≤δn−1 ,...,p1 ≤δ1 }
  
= EGn I{pn ≤δn } EGn−1 I{pn−1 ≤δn−1 ,...,p1 ≤δ1 }
 
= EGn I{pn ≤δn } δn−1 · · · δ1 = δn δn−1 · · · δ1

(IE being the indicator of event E) almost surely.


By property B, (10) immediately implies

P {pN ≤ δN , . . . , p1 ≤ δ1 } = δN · · · δ1 .
280 V. Vovk

Therefore, we have proved that the distribution of the random sequence


p1 p2 · · · ∈ [0, 1]∞ coincides with U ∞ on the σ-algebra FN generated by the
first N coordinate random variables p1 , . . . , pN . It is well known (see, e.g., [12],
Theorem II.3.3) that this implies that the distribution of p1 p2 . . . coincides with
U ∞ on all measurable sets in [0, 1]∞ .

B Appendix: Kolmogorov’s Programme and Repetitive


Structures

In this section we briefly discuss Kolmogorov’s programme for applications of


probability and two related developments originated by Martin-Löf and Freed-
man; in particular, we formally define a version of the notion of repetitive struc-
ture which is in a sense isomorphic to our notion of OCM.

B.1 Kolmogorov’s Programme

The standard approach to modelling uncertainty is to choose a family of prob-


ability distributions (statistical model ) one of which is believed to be the true
distribution generating, or explaining in a satisfactory way, the data. (In some
applications of probability theory, the true distribution is assumed to be known,
and so the statistical model is a one-element set. In Bayesian statistics, the sta-
tistical model is complemented by another element, a prior distribution on the
distributions in the model.) All modern applications of probability depend on
this scheme.
In 1965–1970 Kolmogorov suggested a different approach to modelling un-
certainty based on information theory; its purpose was to provide a more direct
link between the theory and applications of probability. His main idea was that
“practical conclusions of probability theory can be substantiated as implications
of hypotheses of limiting, under given constraints, complexity of the phenom-
ena under study” [4]. The main features of Kolmogorov’s programme can be
described as follows:

C (Compression): One fixes a “sufficient statistic” for the data. This is a


function of the data that extracts, intuitively, all useful information from the
data. This can be the number of ones in a binary sequence (the “Bernoulli
model” [13,14]), the number of ones after ones, ones after zeros, zeros after
ones and zeros after zeros in a binary sequence (the “Markov model” [4]),
the sample average and sample variance of a sequence of real numbers (the
“Gaussian model” [15,16]).
A (Algorithmic): If the value of the sufficient statistic is known, the infor-
mation left in the data is noise. This is formalized in terms of Kolmogorov
complexity: the complexity of the data under the constraint given by the
value of the sufficient statistic should be maximal (in other words, the data
should be algorithmically random given the value of the sufficient statistic).
Well-Calibrated Predictions from Online Compression Models 281

U (Uniformity): Semantically, the requirement of algorithmic randomness in


the previous item means that the conditional distribution of the data given
the sufficient statistic is uniform.
D (Direct): It is preferable to deduce properties of data sets directly from
the assumption of limiting complexity, without a detour through standard
statistical models (examples of such direct inferences are given in [15,16] and
hinted at in [4]), especially that Kolmogorov’s models are not completely
equivalent to standard statistical models [17].

(Kolmogorov’s only two publications on his programme are [4,13]; the work
reported in [14]–[17] was done under his supervision by his PhD students.)
After 1965 Kolmogorov and Martin-Löf worked on the information-theoretic
approach to probability applications independently of each other, but arrived at
similar concepts and definitions. In 1973 [18] Martin-Löf introduced the notion
of repetitive structure, later studied by Lauritzen [19]. Martin-Löf’s theory of
repetitive structures has features C and U of Kolmogorov’s programme but not
features A and D. An extra feature of repetitive structures is their on-line char-
acter : the conditional probability distributions are required to be consistent and
the sufficient statistic can usually be updated recursively as new data arrives.
The absence of algorithmic complexity and randomness from Martin-Löf’s
theory does not look surprising; e.g., it is argued in [20] that these algorithmic
notions are powerful sources of intuition, but for stating mathematical results in
their strongest and most elegant form it is often necessary to “translate” them
into a non-algorithmic form.
A more serious deviation from Kolmogorov’s ideas seems to be the absence
of “direct inferences”. The goal in the theory of repetitive structures is to derive
standard statistical models from repetitive structures (in the asymptotic on-
line setting the difference between Kolmogorov-type and standard models often
disappears); to apply repetitive structure to reality one still needs to go through
statistical models. In our approach (see Theorem 1 above or the optimality
results in [21,22]) statistical models become irrelevant.
Freedman and Diaconis independently came up with ideas similar to Kol-
mogorov’s (Freedman’s first paper in this direction was published in 1962); they
were inspired by de Finetti’s theorem and the Krylov-Bogolyubov approach to
ergodic theory.
Kolmogorov only considered the three models we discuss in §4–6, but many
other models have been considered by later authors (see, e.g., [6]).
The difference between standard statistical modelling and Kolmogorov’s
modelling discussed in [17] is not important for the purpose of one-step-ahead
forecasting in the exchangeable case (in particular, for both exchangeability and
Gaussian models of this paper; see [23]); it becomes important, however, in the
Markov case. The theory of prediction with confidence has a dual goal: valid-
ity (there should not be too many errors) and quality (there should not be too
many uncertain predictions). In the asymmetric Markov case, although we have
the validity result (Theorem 1), there is little hope of obtaining an optimality
result analogous to those of [21,22]. A manifestation of the difference between
282 V. Vovk

the two approaches to modelling is, e.g., the fact that (7) involves the ratio
n1,0 /(n1,0 + n1,1 ) rather than something like n0,1 /(n0,0 + n0,1 ).

B.2 Repetitive Structures


Let Σ and Z be measurable spaces (of “summaries” and “examples”, respec-
tively). An OCM-repetitive structure consists of the following two elements:
– a system of statistics (measurable functions) tn : Zn → Σ, n = 1, 2, . . . ;
– a system of kernels Pn : Σ → Zn , n = 1, 2, . . . .
These two elements are required to satisfy the following consistency require-
ments:
Agreement between Pn and tn : for each σ ∈ tn (Zn ), the probability distri-
bution Pn (· | σ) is concentrated on the set t−1 n (σ);
Consistency of tn over n: for all integers n > 1, tn (z1 , . . . , zn ) is determined
by tn−1 (z1 , . . . , zn−1 ) and zn , in the sense that the function tn is measurable
w.r. to the σ-algebra generated by tn−1 and zn .
Consistency of Pn over n: for all integers n > 1, all σ ∈ tn (Zn ), all
τ ∈ tn−1 (Zn−1 ), and all z ∈ Z, Pn−1 (· | τ ) should be a version of the
conditional distribution of z1 , . . . , zn−1 when z1 , . . . , zn is generated from
Pn (dz1 , . . . , dzn | σ) and it is known that tn−1 (z1 , . . . , zn−1 ) = τ and zn = z.
Remark 1. We say “OCM-repetitive structures” instead of “repetitive struc-
tures” since the latter are defined by different authors differently. Martin-Löf [18]
is only interested in uniform Pn , does not have the condition that tn should be
computable from tn−1 and zn among his requirements, and his requirement of
consistency of Pn over n involves conditioning on tn−1 = τ only (not on zn = z).
Lauritzen’s ([19], p. 207) repetitive structures do not involve any probabili-
ties (which enter the picture through parametric “projective statistical fields”).
Bernardo and Smith [6] do not use this term at all.
The notions of OCM and OCM-repetitive structure are very close. If M =
(Σ, 2, Z, (Fn ), (Bn )) is an OCM, then M  := (Z, Σ, (tn ), (Pn )), as defined in §2,
is an OCM-repetitive structure. If M = (Z, Σ, (tn ), (Pn )) is an OCM-repetitive
structure, an OCM M  := (Σ, 2, Z, (Fn ), (Bn )) can be defined as follows:
– Fn is a measurable function mapping tn−1 (z1 , . . . , zn−1 ) and zn to
tn (z1 , . . . , zn ), for all (z1 , . . . , zn ) ∈ Zn (the existence of such Fn follows
from the consistency of tn over n);
– Bn (dσn−1 , dzn | σn ) is the image of the distribution Pn (dz1 , . . . , dzn | σn )
under the mapping (z1 , . . . , zn ) → (σn−1 , zn ), where σn−1 :=
tn−1 (z1 , . . . , zn−1 ).
If M is an OCM-repetitive structure, M  is essentially the same as M , and if
M is an OCM, M  is essentially the same as M .
In our examples (exchangeability, Gaussian and Markov models) we found it
more convenient to start from the statistics tn and distributions Pn ; the condi-
tions of consistency were obviously satisfied in those cases.
Transductive Confidence Machine Is Universal

Ilia Nouretdinov, Vladimir V’yugin, and Alex Gammerman

Computer Learning Research Centre Royal Holloway, University of London Egham,


Surrey TW20 0EX, England

Abstract. Vovk’s Transductive Confidence Machine (TCM) is a practi-


cal prediction algorithm giving, in additions to its predictions, confidence
information valid under the general iid assumption. The main result of
this paper is that the prediction method used by TCM is universal un-
der a natural definition of what “valid” means: any prediction algorithm
providing valid confidence information can be replaced, without losing
much of its predictive performance, by a TCM. We use as the main tool
for our analysis the Kolmogorov theory of complexity and algorithmic
randomness.

1 Introduction

In the last several decades new powerful machine-learning algorithms have ap-
peared. A serious shortcoming of most of these algorithms, however, is that
they do not directly provide any measures of confidence in the predictions they
output. Two of the most important traditional ways to obtain such confidence
information are provided by PAC theory (a typical result that can be used is Lit-
tlestone and Warmuth’s theorem; see, e.g., [3]) and Bayesian theory. The former
is discussed in detail in [9] and the latter is discussed in [8], but disadvantages of
the traditional approaches can be summarized as follows: PAC bounds are valid
under the general iid assumption but are too weak for typical problems encoun-
tered in practice to give meaningful results; Bayesian bounds give practically
meaningful results, but are only valid under strong extra assumptions.
Vovk [4,16,14,11,12,17] proposed a practical (as confirmed by numerous em-
pirical studies reported in those papers) method of computing confidence infor-
mation valid under the general iid assumption. Vovk’s Transductive Confidence
Machine (TCM) is based on a specific formula

|{i : αi ≥ αl+1 }|
p= ,
l+1
where αi are numbers representing some measures of strangeness (cf. (1) in
Section 2). A natural question is whether there are better ways to produce
valid confidence information. In this paper (Sections 3 and 6) we show that the
first-order answer is “no”: no way of producing valid confidence information is
drastically better than TCM. We present our results in terms of Kolmogorov’s
theory of algorithmic complexity and randomness.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 283–297, 2003.

c Springer-Verlag Berlin Heidelberg 2003
284 I. Nouretdinov, V. V’yugin, and A. Gammerman

2 Prediction Using TCM

Suppose we have two sets: the training set (x1 , y1 ), . . . , (xl , yl ) and the test set
(xl+1 , yl+1 ) containing only one example. The unlabelled examples xi are drawn
from a set X and the labels yi are drawn from a finite set Y; we assume that |Y|
is small (i.e., we consider the problem of classification with a small number of
classes) 1 . The examples (xi , yi ) are assumed to be generated by some probability
distribution P (same for all examples) independently of each other; we call this
the iid assumption.
Set Z := X × Y. For any l a sequence z l = z1 , . . . , zl defines a multiset B
of all elements of this sequence, where each element z ∈ B is supplied by its
arity n(z) = |{j : zj = z}|. We call multiset B of this type a bag. Its size |B| is
defined as the sum of arities of all its elements. The bag defined by a sequence
z l is also called configuration of this sequence; to be precise define the standard
representation of this bag as a set con(z l ) = {(z1 , n(z1 )), . . . (zl , n(zl )}.
In this paper we discuss four natural ways of predicting with confidence,
which we call Randomness Predictors, Exchangeability Predictors, Invariant Ex-
changeability Predictors, and Transductive Confidence Machines. We start with
the latter (following the papers mentioned above).
An individual strangeness measure is a family of functions An , n = 1, 2, . . .,
such that each An maps every pair (B, z), where B is a bag of n − 1 elements of
Z and z is an element of Z, to a real (typically non-negative) number An (B, z).
(Intuitively, An (B, z) measures how different z is from the elements of B). The
Transductive Confidence Machine associated with An works as follows: when
given the data
(x1 , y1 ), . . . , (xl , yl ), xl+1

(the training set and the known component xl+1 of the test example), every
potential classification y of xl+1 is assigned the p-value

|{i : αi ≥ αl+1 }|
p(y) := , (1)
l+1

where
αi := Al+1 (con(z1 , . . . , zi−1 , zi+1 , . . . , zl+1 ), zi ),

zj = (xj , yj ) (except zl+1 = (xl+1 , y)), and con(z1 , . . . , zi−1 , zi+1 , . . . , zl+1 ) is a
bag. TCM’s output p : Y → [0, 1] can be further packaged in two different ways:

– we can output arg maxy p(y) as the prediction and say that 1 − p2 , where p2
is the second largest p-value, is the confidence and that the largest p-value
p1 is the credibility;
– or we can fix some conventional threshold δ (such as 1% or 5%) and output
as our prediction (predictive region) the set of all y such that p(y) > δ.
1
By |Y| we mean the cardinality of the set Y.
Transductive Confidence Machine Is Universal 285

The essence of TCM is formula (1). The following simple example illustrates a
definition of individual strangeness measure in the spirit of the 1-Nearest Neigh-
bour Algorithm (we assume that objects are vectors in a Euclidian space)

minj=i:yj =yi d(xi , xj )


αi = ,
minj=i:yj =yi d(xi , xj )

where d is the Euclidian distance (i.e. object is considered strange if it is in the


middle of objects labelled in a different way and is far from the objects labelled
in the same way). For other examples of TCM (and corresponding algorithms
computing αi ), see the papers referred to above.

3 Specific Randomness
Next we define Randomness Predictors (RP). At first, we consider a typical
example from statistics. Let Zn be a sample space and Qn be a sequence of
probability distributions in Zn , where n = 1, 2, . . .. Let fn (ω) be a statistics, i.e.
a sequence of real valued functions from Zn to the set of all real numbers. The
function
tn (ω) = Qn {α : fn (α) ≥ fn (ω)}
is called p-value and satisfies

Qn {ω : tn (ω) ≤ γ} ≤ γ (2)

for any real number γ. The outcomes ω with a small p-value have the small
probability. These outcomes should be consider as almost impossible from the
standpoint of the holder of the measure Qn .
The notion of p-value can be easily extended to the case where for any n we
consider a class of probability distributions Qn in Zn :

tn (ω) = sup Q{α : fn (α) ≥ fn (ω)}. (3)


Q∈Qn

This function satisfies

sup Q{ω : tn (ω) ≤ γ} ≤ γ (4)


Q∈Qn

for all γ. We fix the properties (2) and (4) as basic for the following definitions.
Let for any n a probability distribution Qn in Zn be given. We say that a sequence
of functions tn (ω) : Z → [0, 1] is an Qn -randomness test (p-test) if it satisfies
inequality (2) for any γ. Analogously, let for any n a class Qn of probability
distributions in Zn be given. We say that a sequence of functions tn (ω) is an
Qn -randomness test if the inequality (4) holds for any γ. We call inequality (2)
or (4) validity property of a test.
We will consider two important statistical models on the sequence of sample
spaces Zn . The iid model Qiid n , n = 1, 2, . . ., is defined for any n by the class
286 I. Nouretdinov, V. V’yugin, and A. Gammerman

of all probability distributions in Zn of the form Qn = P n , where P is some


probability distribution in Z and P n is its product. Instead of (1), we now define
the Randomness Predictor (RP)
p(y) := tl+1 (z l , (xl+1 , y)), (5)
where tl+1 is a Qiid l
l+1 -randomness test and z = (x1 , y1 ), . . . , (xl , yl ). Using this
function, we can define the corresponding predictive region, or the prediction,
confidence, and credibility, as above.
The exchangeability model Qexch n , n = 1, 2, . . ., uses exchangeable probabil-
ity distributions. A probability distribution P in Zn is exchangeable if, for any
permutation π : {1, . . . , n} → {1, . . . , n} and any data sequence z1 , . . . , zn ∈ Z n ,
P (z1 , . . . , zn ) = P (π(z1 ), . . . , π(zn )).
A sequence of functions tn : Zn → [0, 1], n = 1, 2, . . ., is an exchangeability
test if, for every n, any exchangeable probability distribution P in Zn , and any
γ ∈ [0, 1],
P {(z1 , . . . , zn ) ∈ Zn | tn (z1 , . . . , zn ) ≤ γ} ≤ γ. (6)
If we now define p(y) by the same formula (5), we obtain the notion of an
Exchangeability Predictor (EP).
If we further require tn to be invariant, in the sense that tn (z1 , . . . , zn ) does
not change if any zi and zj , i, j = 1, . . . , n − 1, are swapped, then we arrive at
the notion of an Invariant Exchangeability Predictor (IEP).
Our first proposition asserts that TCM and IEP are essentially the same
notion. Formally, we identify TCM, RP, EP, and IEP with the functions mapping
(x1 , y1 ), . . . , (xl , yl ), xl+1 to the function p = p(x1 ,y1 ),...,(xl ,yl ),xl+1 : Y → [0, 1],
according to (1) or (5), respectively. We say that a predictor (TCM, RP, EP, or
IEP) pA B
z l ,xl+1 (y) is (at least) as good as a predictor pz l ,xl+1 (y) if, for any training
set z l = (x1 , y1 ), . . . , (xl , yl ), any unlabelled test example xl+1 , and any label y,

z l ,xl+1 (y) ≤ pz l ,xl+1 (y).


pA B
(7)

We say that a class A (such as TCM or RP) of predictors is as good as a


class B of predictors if for any B ∈ B there exists A ∈ A such that A is as good
as B (i.e., if every predictor in B can be replaced by an equally good or better
predictor in A).
Proposition 1. Transductive Confidence Machines are as good as Invariant
Exchangeability Predictors, and vice versa.
Proof. For simplicity we will assume that X is finite. First we show that Trans-
ductive Confidence Machines are Invariant Exchangeability Predictors; we only
need to check the validity property
P {z l , (xl+1 , y) | pzl ,xl+1 (y) ≤ γ} ≤ γ
for the values p(y) = pzl ,xl+1 (y) computed according to (1), where P is an
exchangeable distribution which generates (x1 , y1 ), . . . , (xl , yl ), (xl+1 , y), and
Transductive Confidence Machine Is Universal 287

z l = (x1 , y1 ), . . . , (xl , yl ). Invariance is obvious. Inequality pzl ,xl+1 (y) ≤ γ means


that αl+1 is among the top 100γ% in the list α1 , . . . , αl+1 , where each element
is repeated according to its arity; the validity follows from the fact that all
permutations of αi are P -equiprobable.
To show that Invariant Exchangeability Predictors can be replaced by Trans-
ductive Confidence Machines, we have to explicitly construct α’s. Suppose we
are given an IEP generated by an invariant exchangeability test t. If B is a bag
in Z of size l and z ∈ Z, define

Al+1 (B, z) = 1/tl+1 (z1 , . . . , zl , z),

where z1 , . . . , zl is a list of all elements of B with repetitions (in any order;


because of invariance, the order does not matter). The corresponding TCM will
be as good as the IEP.  
It is clear that EP are as good as IEP and that RP are as good as EP. In the
next sections we will see that the opposite relations also hold, in some weaker
sense. To prove this we need a notion of optimal randomness test from the theory
of algorithmic randomness.

4 Algorithmic Randomness

4.1 Uniform Tests for Randomness

We refer readers to [7] for details of the theory of Kolmogorov complexity and
algorithmic randomness. We also will consider a logarithmic scale for tests (log-
tests of randomness) 2 .

dn (x|P, q) = − log tn (x|P, q),

where tn (x|P, q) is a randomness test satisfying (2) or (4).


In this case the validity property of a test (2) must be replaced on

P {z n : dn (z n |P, q) ≥ m} ≤ 2−m (8)

for all n, m, P and q, where z n = z1 , . . . , zn ∈ Zn . So, in the following sections


we consider log-tests, for example, iid log-tests or log-tests of exchangeability.
We present our final results (Corollaries 2 and 3) for tests defined in the direct
scale (4).
Let us define a notion of optimal uniform randomness test. Recall that Z :=
X × Y. Let N be the set of all positive integer numbers, Q be the set of all
rational numbers. We consider the discrete topology in these sets. Let R be the
set of all real numbers. We also need in some “computational effective” topology
2
In the following all logarithms are to the base 2. Below q is a parameter from some
set S of constructive objects. Any algorithm computing values of the test uses q as
input.
288 I. Nouretdinov, V. V’yugin, and A. Gammerman

in the set P(Zn ) of all probability distributions in Zn . This topology is generated


by intervals

{P ∈ P(Zn ) : a1 < P (ω1 ) < b1 , . . . ak < P (ωk ) < bk },

where ωi ∈ Z and ai < bi ∈ Q, i = 1, . . . k, k ∈ N. An open set U is called


effectively open if it can be represented as a union of a recursively enumerable
set of untervals. A family of real valued functions dn (ω|P, q) from ω ∈ Zn and
P ∈ P(Zn ), q ∈ S to R ∪ {+∞} is called lower semicomputable if the set

{(n, ω, r, P, q) : n ∈ N, ω ∈ Zn , r ∈ Q, q ∈ S, r < dn (ω|P, q)} (9)

is effectively open in the product topology in the set 3 D = N×Zn ×Q×P(Zn )×


S. This means that some algorithm given n, q and finite approximations of P
enumerates rational approximations from below of the test.

Proposition 2. There exists an optimal uniform randomness log-test dn , which


means the following: dn is lower semicomputable and for any other lower semi-
computable uniform randomness test dn 4

dn (ω|P, q) + O(1) ≥ dn (ω|P, q)

The proof of this proposition uses well known idea of universality from Kol-
mogorov theory of algorithmic randomness.
We fix some optimal uniform randomness log-test dn (ω|P, q). The value
dn (ω|P, q) is called the randomness level of the sequence ω with respect to P .
Parameter q will be used only in Section 4.2 for technical reason. In the following
usually we fix some q ∈ S and omit this variable from the notation of test.
Using the direct scale we consider the optimal uniform randomness test

δn (ω|P ) = 2−dn (ω|P )

satisfying (2). This test is minimal up to multiplicative constant in the class of


all upper semicomputable tests satisfying (2) 5 .
It is easy to verify that Proposition 2 and all considerations above will be
n or by P ∈ Qn
valid if we restrict ourselves by P ∈ Qiid exch
. So, we can consider
uniform optimal tests of randomness with respect to classes of iid or exchange-
able probability distributions. More correctly, analogously to the definition (3)
3
This topology is generated by intervals, which can be considered as constructive
objects (more correctly, any such interval has standard constructive representation).
4
a(x1 , . . . , xn ) ≤ b(x1 , . . . , xn ) + O(1) or a(x1 , . . . , xn ) − O(1) ≤ b(x1 , . . . , xn ) means
that a constant c ≥ 0 exists such that a(x1 , . . . , xn ) ≤ b(x1 , . . . , xn ) + c holds for
all values (of free variables) x1 , . . . , xn . a(x1 , . . . , xn ) = b(x1 , . . . , xn ) + O(1) means
that a(x1 , . . . , xn ) ≤ b(x1 , . . . , xn ) + O(1) and a(x1 , . . . , xn ) ≥ b(x1 , . . . , xn ) − O(1).
Relations with product sign are treated analogously using a multiplicative factors.
5
A definition of upper semicomputability can be obtained from the definition of lower
semicomputability (9) by replacing < on >.
Transductive Confidence Machine Is Universal 289

we define optimal log-test with respect to a sequence Q = {Qn } of classes of


probability distributions in Zn , n = 1, 2, . . .,

dQ
n (z1 , . . . , zn ) = inf d(z1 , . . . , zn |P ), (10)
P ∈Qn (Zn )

where z1 , . . . , zn ∈ Zn . So, an optimal iid-log-test diid (z1 , . . . , zn ) corresponds to


the iid model Qn . In the direct scale the iid-test is represented as

δniid (z1 , . . . , zn ) = 2−dn


iid
(z1 ,...,zn )
.

Analogously, an optimal uniform exchangeability log-test dexch n (δnexch ) is de-


invexc
fined. To define the optimal invariant exchangeability test dn (δninvexc ) we
consider in Proposition 2 only invariant log-tests. These optimal tests diid iid
n (δn ),
exch exch invexc invexc
dn (δn ) and dn (δn ) determine the Optimal Randomness Predic-
tor, Optimal Exchangeability Predictor, and Optimal Invariant Exchangeability
Predictor, respectively.
The main goal of this paper is to prove the following aproximate equality 6

δ invexc (z l , (xl+1 , y)) ≈ δ iid (z l , (xl+1 , y)) (11)

if the data set z l = (x1 , y1 ), . . . , (xl , yl ), xl+1 is random and the set Y is small.
This shows the universality of TCM: the optimal IEP (equivalently, TCM; see
Proposition 1) is about as good as the optimal RP. The precise statement involves
a multiplicative constant C; this is inevitable since randomness and exchange-
ability levels are only defined to within a constant factor (in direct scale). We
will prove this assertion by the following way. Approximate equality (11) will be
split into two:

δ invexc (z l , (xl+1 , y)) ≈ δ exch (z l , (xl+1 , y)), (12)


δ exch
(z , (xl+1 , y)) ≈ δ
l iid l
(z , (xl+1 , y)) (13)

(Theorem 1, Section 5 and Theorem 2, Section 6 below).

4.2 p- and i-tests

A definition of i-test will be obtained if we replace the validity property (8) by


a more strong requirement

2dn (z |P,q) dP (z n ) ≤ 1.
i n
(14)

We call log-test satisfying (8) p-log-test. It is easy to verify that Proposition 2


holds for i-tests, relations (11), (12) and (13) for i-tests are also valid. By Cheby-
shev inequality each i-test is a p-test. The following proposition gives the relation
between optimal p and i-tests.
6
In the following we omit lower index l + 1 in the notation of test.
290 I. Nouretdinov, V. V’yugin, and A. Gammerman

Proposition 3. Let dpn (z n |P, q) be the optimal p-log-test and din (z n |P, q) be the
optimal i-log-test. Then

di (z n |P ) − O(1) ≤ dp (z n |P ) ≤ di (z n |P, d(z n |P )) + O(1).

Proof. Let dpn (z n |P, q) be the optimal p-log-test and din (z n |P, q) be the optimal
i-log-test. Then the proposition asserts 7

di (z n |P ) − O(1) ≤ dp (z n |P ) ≤ di (z n |P, d(z n |P )) + O(1).

The first inequality ≤ is obvious. To prove the second one note that the lower
semicomputable function

m − 1 if dp (z n |P ) ≥ m,
ψ(z |P, m) =
n
−1 otherwise
is an i-log-test. Indeed,
  
2ψ(z |P,m) dP (z n ) = 2−1 dP (z n ) ≤ 1.
n
2m−1 dP (z n ) +
z n :dp (z n |P )≥m

Then by definition of optimal i-test

di (z n |P, m) ≥ ψ(z n |P, m) − O(1).

Putting m = dp (z n |P ) in this inequality we obtain

di (z n |P, d(z n |P )) ≥ dp (z n |P ) − O(1).



The relation between conditional and unconditional i-tests is presented by
the following proposition.
Proposition 4. Let k ∈ N. Then

di (z n |P ) − O(1) ≤ di (z n |P, k) ≤ di (z n |P ) + 2 log k + O(1).

Proof. The first inequality is obvious. To prove the second inequality let us note
that the function


2d (z |P,k) k −2
i n
ψ(z n |P ) = log
k=2
is an i-log-test. Indeed, it is lower semicomputable and
  
∞ 
k −2 2d (z |P,k) dP (z n ) ≤ 1.
i n
ψ(z n |P )dP (z n ) ≤
k=2

Then di (z n |P ) + O(1) ≥ ψ(z n |P ) ≥ di (z n |P, k) − 2 log k. 



It follows from Propositions 3 and 4 the following
7
We omit the lower index, i.e. we write d(z n |P, q) instead of dn (z n |P, q). We also omit
parameter q when it is not used.
Transductive Confidence Machine Is Universal 291

Corollary 1. dp (z n |P ) + O(1) ≥ di (z n |P ) ≥ dp (z n |P ) − 2 log dp (z n |P ) − O(1).


We use i-tests since they simplify our proofs. But we formulate our main result,
Theorem 2, for p-tests. In the following p and i variants of optimal tests for
classes of probability distributions, namely, dp,iid and di,iid , dp,exc and di,exc ,
dp,invexc and di,invexc , will be considered.

4.3 Randomness with Respect to Exchangeable Probability


Distributions
Any computable function F (p, q) (method of decoding), where p is a binary
string and q ∈ S, defines a measure of (plain) Kolmogorov complexity

KF (x|q) = min{|p| : F (p, q) = x}.

The main result of the theory is that an optimal F exists such that KF (x|q) ≤
KF  (x|q) + O(1) holds for any method F  of decoding. For detailed definition
and main properties of Kolmogorov (conditional) complexity K(x|q) we refer
reader to the book [7].
In the following we consider the prefix modification of Kolmogorov complex-
ity [7]. This means that only prefix methods of decoding are considered: if F (p, q)
and F (p , q) are defined then the strings p and p are incomparable.
Kolmogorov defined in [6] the notion of deficiency of randomness of an ele-
ment x of a finite set D

d(x|D) = log |D| − K(x|D). (15)

It is easy to verify that K(x|D) ≤ log |D) + O(1) and that the number of x ∈ D
such that d(x|D) > m does not exceed 2−m |D|. Earlier in [5] he also defined
m-Bernoulli sequence as a sequence x satisfying
 
n
K(x|n, k) ≥ log − m,
k
where n is the length of x and k is the number of ones in it.
For any finite sequence xn = x1 , . . . , xn ∈ Zn consider a permutation set

Ξ(xn ) = {z n : con(z n ) = con(xn )} (16)

i.e. the set of all sequences with the same configuration as xn (set of all permu-
tations of xn ). For any permutation set Ξ we consider the measure QΞ

1/|Ξ| if z n ∈ Ξ,
QΞ (z n ) =
0 otherwise
concentrated in the set Ξ of all sequences with the same configuration. An
optimal uniform log-test d(xn |QΞ(xn ) , q) for the class {QΞ : ∃z n ∈ Zn (Ξ =
Ξ(z n ))} can be defined in the spirit of Proposition 2.
The next proposition shows that the deficiency of exchangeability can be
characterized in a fashion free from probability concept.
292 I. Nouretdinov, V. V’yugin, and A. Gammerman

8
Proposition 5. It holds
di,exch (z n |q) = log |Ξ(z n )| − K(z n |Ξ(z n ), q) + O(1). (17)
Proof. We prove (17) and that it is also equal to di (z n |QΞ(zn ) , q) + O(1). Let us
prove that the function
dˆi (z n |q) = log |Ξ(z n )| − K(z n |Ξ(z n ), q)

is an uniform i-log-test of exchangeability. Indeed, let P̂ (Ξ(z n )) = zn ∈Ξ P (z n ).
Then for any exchangeable measure P ∈ P(Zn )
 
ˆi n
2d (z |q) dP (z n ) = 2−K(z |Ξ(z ),q) P̂ (Ξ(z n )) =
n n

z n ∈Zn
 
2−K(z |Ξ,q)
n
P̂ (Ξ) ≤1
Ξ z n ∈Ξ

Then dˆi (z n |q) ≤ di (z n |P, q) + O(1) for any exchangeable measure P , and so, we
have
dˆi (z n |q) ≤ di,exch (z n |q) + O(1) ≤ di (z n |QΞ(zn ) , q) + O(1).
Let us check the converse inequality. Let Ξ = Ξ(z n ). We have
di,exch (z n |q) = inf d(z n |q, P ) ≤ log |Ξ| − K(z n |q, QΞ ) =
P ∈Qexch

di (z n |q) + O(1).
Here we take into account that K(z n |q, QΞ ) = K(z n |q, Ξ) + O(1), which fol-
lows from the fact that measure QΞ and configuration Ξ are computationally
equivalent. 

Let D be a bag of elements of Z and x ∈ D has arity k(x). Then we can
assign a probability P (x) = k(x)/|D| to each element x of the bag and a positive
−lx −1
integer number
 −llxx such that 2 ≤ P (x) ≤ 2−lx . It follows from the Kraft
inequality 2 ≤ 1 that a corresponding decodable prefix code exists, and
so, K(x|D) ≤ log(|D|/k(x)) + O(1). Let us define the randomness deficiency of
x with respect to a bag D
d(x|D) = log(|D|/k(x)) − K(x|D). (18)
We have |{x : d(x|D) ≥ m}| ≤ 2−m |D| for any m.
The following proposition implies that the optimal invariant exchangeabil-
ity log-test di,invexc of a training set (x1 , y1 ), . . . (xl , yl ) and testing example
(xl+1 , y) coincides with generalized Kolmogorov’s deficiency of randomness of
testing example (xl+1 , y) with respect to the configuration of all sequence.
Proposition 6. Let u1 , . . . ul+1 ∈ Zl+1 . Then
di,invexc (u1 , . . . ul+1 ) = d(ul+1 |con(u1 , . . . ul+1 )) + O(1)
The proof of this proposition is analogous to the proof of Proposition 5.
8
The same relation holds for dp,exch (z n |q) if we replace the prefix variant of Kol-
mogorov complexity by its plain variant.
Transductive Confidence Machine Is Universal 293

5 EP and IEP Are Equivalent

Let us define
di,exch (z l , xl+1 ) = min di,exch (z l , (xl+1 , y)) (19)
y∈Y

The following theorem implies that if a training set is random 9 (with respect
to some exchangeable measure) then EP and IEP are almost the same notion.
Theorem 1. It holds

di,invexc (z l , (xl+1 , y)) − O(1) ≤ di,exch (z l , (xl+1 , y)) ≤ di,invexc (z l , (xl+1 , y))
+2 log di,invexc (z l , (xl+1 , y)) + di,exch (z l , xl+1 )) + 2 log |Y| + O(1),

where z l = (x1 , y1 ), . . . , (xl , yl ) is a training set and (xl+1 , y) is a testing example.


The proof of this theorem is based on relation for the complexity of a pair [7]
and is presented in Section 7.1.
In the direct scale of the definition of test we have
Corollary 2. For any  > 0

O(1)δ i,invexc (z l , (xl+1 , y)) ≥ δ i,exch (z l , (xl+1 , y))


≥ (δ i,invexc (z l , (xl+1 , y)))1+ δ i,exch (z l , xl+1 )|Y|−2 /(O(1),

where z l = (x1 , y1 ), . . . , (xl , yl ) is a training set and (xl+1 , y) is a testing example.

6 RP and IEP Are Equivalent

In this section we use p-tests. Let us define

dp,iid (z l , xl+1 ) = min dp,iid (z l , (xl+1 , y)). (20)


y

The following theorem shows that the difference between RP and IEP is not
essential in the most interesting case where a training set and an unlabelled test
example are random with respect to some iid probability distributions.
Theorem 2. It holds

dp,iid (z l , (xl+1 , y)) + O(1) ≥ dp,invexc (z l , (xl+1 , y)) ≥ dp,iid (z l , (xl+1 , y))
−4dp,iid (z l , xl+1 ) − 2 log dp,iid (z l , xl+1 ) − 4 log |Y| − O(1), (21)

where z l = (x1 , y1 ), . . . , (xl , yl ) is a training set, xl+1 is an unlabelled test exam-


ple, and y is a label.
The proof of this theorem is based on Theorem 1 and on Propositions 7 and 8,
and on Corollary 1 (see Section 7.3).
9
In other words, we suppose that the optimal log-test of the training set is small.
294 I. Nouretdinov, V. V’yugin, and A. Gammerman

Corollary 3. Let  > 0. Then

δ p,iid (z l , (xl+1 , y)/O(1) ≤ δ p,invexc (z l , (xl+1 , y))


≤ (δ p,iid (z l , (xl+1 , y)))1− |Y|4 (δ p,iid (z l , xl+1 ))−(4+) O(1),

where z l = (x1 , y1 ), . . . , (xl , yl ) is a training set, xl+1 is unlabelled test example


and y is a label.

Acknowledgments. Volodya Vovk initiated this work and proposed ideas of


the main theorems. The authors are deeply grateful to him for valuable discus-
sions.

7 Appendix

7.1 Proof of Theorem 1

Let z l = (x1 , y1 ), . . . , (xl , yl ) be a training set and (xl+1 , y) be a testing example.


By definition (19) for any z l and xl+1 an ȳ exists such that di,exch (z l , xl+1 ) =
di,exch (z l , (xl+1 , ȳ)).
Let Ξ be a set of all permutations of z l , (xl+1 , y) and Ξ̄ be a set of all
permutations of z l , (xl+1 , ȳ). We have by Proposition 5

di,exch (z l , (xl+1 , y)) = log |Ξ| − K(z l , (xl+1 , y)|Ξ) + O(1),


di,exch (z l , (xl+1 , ȳ) = log |Ξ̄| − K(z l , (xl+1 , ȳ)|Ξ̄) + O(1). (22)

Let k be the arity of (xl+1 , y) in con(z l , (xl+1 , y)) and k̄ be the arity of (xl+1 , ȳ)
in con(z l , xl+1 , ȳ)). By definition k|Ξ| = k̄|Ξ̄|. Then from (22) we obtain

di,exch (z l , (xl+1 , y)) = di,exch (z l , (xl+1 , ȳ)) + log k̄ − log k


+K(z l , (xl+1 , ȳ))|Ξ̄) − K(z l , (xl+1 , y)|Ξ) + O(1). (23)

By the well known equality for the complexity of a pair [7] we have

K(z l , (xl+1 , y)|Ξ)


= K((xl+1 , y)|Ξ) + K(z l |xl+1 , y, K(xl+1 , y|Ξ), Ξ) + O(1).

Then (23) is transformed to

di,exch (z l , (xl+1 , y)) = di,exch (z l , (xl+1 , ȳ)) + K(z l , (xl+1 , ȳ))|Ξ̄) + log k̄
− log k − K(z l |xl+1 , y, K(xl+1 , y|Ξ), Ξ) − K((xl+1 , y)|Ξ) + O(1). (24)

We have
|con(z l , (xl+1 , y))| = |con(z l , (xl+1 , ȳ))| = l + 1
Let m be the ordinal number of the pair (xl+1 , ȳ) in the list z l , (xl+1 , ȳ) sorted
in order of decreasing of theirs arities. Then it holds m ≤ (l + 1)/k̄.
Transductive Confidence Machine Is Universal 295

Let us prove the following inequalities between complexities:

K(z l , (xl+1 , ȳ)|Ξ̄) ≤ K(z l |xl+1 , y, d((xl+1 , y)|con(z l , (xl+1 , y)), Ξ)


+2 log d((xl+1 , y)|con(z l , (xl+1 , y))) + log(l + 1) − log k̄ + 2 log |Y| + O(1)

Indeed, let a program p conditional on xl+1 , y, d((xl+1 , y)|con(z l , (xl+1 , y)))


and Ξ computes z l . We add to p the binary codes of m, y and
d((xl+1 , y)|con(z l , (xl+1 , y))). Using Ξ̄ we can restore con(z l , (xl+1 , ȳ)), and
then by m we restore xl+1 and ȳ. Using this information we can also trans-
form Ξ̄ to Ξ. Hence, by the program p, Ξ̄, binary codes of m, y and by
d((xl+1 , y)|con(z l , (xl+1 , y))) we can compute z l , xl+1 and ȳ.
By definition

d((xl+1 , y)|con(z l , (xl+1 , y))) =


log(l + 1) − log k − K((xl+1 , y)|con(z l , (xl+1 , y))). (25)

Evidently, con(z l , (xl+1 , y)) and Ξ are computationally equivalent. By


(25) and Proposition 6 the value of K(xl+1 , y|Ξ) can be computed by
d((xl+1 , y)|con(z l , (xl+1 , y))), Ξ and pair (xl+1 , y). Then we have 10

K(z l |xl+1 , y, d((xl+1 , y)|con(z l , (xl+1 , y)), Ξ))


≤ K(z l |xl+1 , y, K(xl+1 , y|Ξ), Ξ) + O(1). (26)

Then by (26), (24) and (25) we obtain

di,exch (z l , (xl+1 , y)) ≤ di,exch (z l , (xl+1 , ȳ)) + log k̄ − log k


+ log(l + 1) − log k̄ − K((xl+1 , y)|con(z l , (xl+1 , y)))
+ 2 log d((xl+1 , y)|con(z l , (xl+1 , y))) + 2 log |Y| + O(1).

To obtain the final result we should apply Proposition 6. 




7.2 iid and Exchangeability Tests

We recall an important relation between iid and exchangeability tests from [13].

Proposition 7. It holds

di,exch (z n ) + O(1) ≥ di,iid (z n ) − di,iid (Ξ(z n )) − 2 log di,iid (Ξ(z n )), (27)

where z n ∈ Zn .

Proof omitted.
10
Here we use inequality K(x|q) ≤ K(x|f (q)) + O(1) which holds for any computable
function f (see [7]).
296 I. Nouretdinov, V. V’yugin, and A. Gammerman

7.3 Proof of Theorem 2


Proposition 8. Let z n = (x1 , y1 ), . . . , (xn , yn ). Then

dp,iid (Ξ(z n , (xn+1 , y))) ≤ dp,iid (z n , xn+1 ) + 2 log |Y| + O(1). (28)

For simplicity of presentation we consider only a case where all emements of


z n = (x1 , y1 ), . . . , (xn , yn ) are distinct and Y = {0, 1}.

Lemma 1. Let z n ∈ Zn . Then

dp,iid (Ξ(z n )) ≤ dp,iid (z n ) + O(1).

Proof omitted.

Lemma 2. Suppose that


n 1
P1 (x, y) = P (x, y) + P (x, 1 − y);
n+1 n+1
and U is the epimorphism

U (z n , (xn+1 , yn+1 )) = Ξ(z n , (xn+1 , 1 − yn+1 )),

where z n = (x1 , y1 ), . . . , (xn , yn ). Then for any class L of permutations sets

P n+1 (U −1 (L)) ≤ P1n+1 (L).

Proof omitted.

Lemma 3. Let dp be the optimal uniform randomness p-log-test. Then for any
P ∈ P(Z) there exists a P1 ∈ P(Z) such that

dp (Ξ(z n , (xn+1 , y))|P1n+1 ) ≤ dp (z n , (xn+1 , 1 − y)|P n+1 ) + O(1).

Proof. The measure P1 can be defined as in the Lemma 2. We know that


P n+1 (U −1 (L)) ≤ P1n+1 (L) for any class L of permutation sets, and the state-
ment has the type dp (W |P1n+1 ) ≤ dp (v|P n+1 ) + O(1), where W is a permutation
set and v ∈ U −1 (W ). Indeed, d (v|P n+1 ) = dp (U (v)|P1n+1 ) is really an uniform
test of randomness, let us check the validity property:

P n+1 (v : d (v|P n+1 ) ≥ m) = P n+1 (v : dp (U (v)|P1n+1 ) ≥ m)


≤ P1n+1 (W : dp (W |P1n+1 ) ≥ m) ≤ 2−m

for any m. Since d is a p-log-test, we have

dp (W |P1n+1 ) = dp (U (v)|P1n+1 ) = d (v|P n+1 ) ≤ dp (v|P n+1 ) + O(1).

To obtain the statement of the lemma we put v = z n , (xn+1 , 1 − y) and W =


Ξ(z n , (xn+1 , y)). 

Transductive Confidence Machine Is Universal 297

Lemma 4. dp,iid (Ξ(z n , (xn+1 , y))) ≤ dp,iid (z n , (xn+1 , 1 − y)) + O(1).


Proof. By Lemma 3 we have for some P and P1
dp,iid (z n , (xn+1 , 1 − y)) = dp (z n , (xn+1 , 1 − y)|P n+1 ) + O(1)
≥ dp (Ξ(z n , (xn+1 , y))|P1n+1 ) + O(1) ≥ dp,iid (Ξ(z n , (xn+1 , y)))
Proof of Proposition 8. Taking into account definition (20) we obtain inequal-
ity (28) as a direct corollary of Lemma 1 and Lemma 4.  
Proof of Theorem 2. Inequality (21) is a direct corollary of Theorem 1, Propo-
sition 8 and Corollary 1. 

References
1. J.M. Bernardo, A.F.M. Smith. Bayesian Theory. Wiley, Chichester, 2000.
2. [Link], [Link]. Theoretical Statistics. Chapman, Hall, London, 1974.
3. N. Cristianini, J. Shawe-Taylor. An Introduction to Support Vector Machines and
OtherKernel-based Methods. Cambridge, Cambridge University Press, 2000.
4. A. Gammerman, V. Vapnik, V. Vovk. Learning by transduction. In Proceedings of
UAI’1998, pages 148–156, San Francisco, MorganKaufmann.
5. A.N. Kolmogorov Three approaches to the quantitative definition of information,
Problems Inform. Transmission, 1965, 1 N1, p.4–7.
6. A.N. Kolmogorov Combinatorial foundations of information theory and the calcu-
lus of probabilities. Russian Math. Suveys, 1983, 38, N4, p.29–40.
7. M. Li, P. Vitányi. An Introduction to Kolmogorov Complexity and ItsApplications.
Springer, New York, 2nd edition, 1997.
8. T. Melluish, C. Saunders, I. Nouretdinov, V. Vovk. Comparing the Bayes and
typicalness [Link] Proceedings of ECML’2001, [Link] version published
as a CLRC technical report TR-01-05; see[Link]
9. I. Nouretdinov, V. Vovk, M. Vyugin, A. Gammerman. Pattern recognition and
density estimation under the general i.i.d. assumption. In David Helmbold and
Bob Williamson, editors, Proceedings of COLT’ 2001, pages 337–353.
10. H. Rogers. Theory of recursive functions and effective computability, New York:
McGraw Hill, 1967
11. C. Saunders, A. Gammerman, V. Vovk. Transduction with confidence and credi-
bility. In Proceedings of the 16th IJCAI, pages 722–726, 1999.
12. C. Saunders, A. Gammerman, V. Vovk. Computationally efficient transductive
machines. In Proceedings of ALT’00, 2000.
13. V. Vovk. On the concept of the Bernoulli property. Russian Mathematical Surveys,
41:247–248, 1986.
14. V. Vovk, A. Gammerman. Statistical applications of algorithmic randomness. In
Bulletin of the International Statistical Institute. The 52ndSession, Contributed
Papers, volume LVIII, book 3, pages 469–470, 1999.
15. V. Vovk, A. Gammerman. Algorithmic randomness for machine learning.
Manuscript, 2001.
16. V. Vovk, A. Gammerman, C. Saunders. Machine-learning applications of algorith-
mic randomness. In Proceedings of the 16th ICML, pages 444–453, 1999.
17. V. Vovk. On-Line Confidence Machines Are Well-Calibrated. In proceedings of
FOCS’02, pages 187–196, 2002.
18. I. Nuretdinov, V. Vovk, V. V’yugin, A. Gammerman, Transductive confidence ma-
chine is universal. CLRC technical report [Link]
On the Existence and Convergence of
Computable Universal Priors

Marcus Hutter

IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland


marcus@[Link]
[Link]

Abstract. Solomonoff unified Occam’s razor and Epicurus’ principle of


multiple explanations to one elegant, formal, universal theory of induc-
tive inference, which initiated the field of algorithmic information theory.
His central result is that the posterior of his universal semimeasure M
converges rapidly to the true sequence generating posterior µ, if the lat-
ter is computable. Hence, M is eligible as a universal predictor in case of
unknown µ. We investigate the existence and convergence of computable
universal (semi)measures for a hierarchy of computability classes: finitely
computable, estimable, enumerable, and approximable. For instance, M
is known to be enumerable, but not finitely computable, and to dominate
all enumerable semimeasures. We define seven classes of (semi)measures
based on these four computability concepts. Each class may or may not
contain a (semi)measure which dominates all elements of another class.
The analysis of these 49 cases can be reduced to four basic cases, two of
them being new. We also investigate more closely the types of conver-
gence, possibly implied by universality: in difference and in ratio, with
probability 1, in mean sum, and for Martin-Löf random sequences. We
introduce a generalized concept of randomness for individual sequences
and use it to exhibit difficulties regarding these issues.

1 Introduction

All induction problems can be phrased as sequence prediction tasks. This is, for
instance, obvious for time series prediction, but also includes classification tasks.
Having observed data x1 ,...,xt−1 at times 1,...,t−1, the task is to predict the t-th
symbol xt from sequence x=x1 ...xt−1 . The key concept to attack general induc-
tion problems is Occam’s razor and to a less extent Epicurus’ principle of multiple
explanations. The former/latter may be interpreted as to keep the simplest/all
theories consistent with the observations x1 ...xt−1 and to use these theories to
predict xt . Solomonoff [Sol64,Sol78] formalized and combined both principles in
his universal prior M (x) which assigns high/low probability to simple/complex
environments, hence implementing Occam and Epicurus. Solomonoff’s [Sol78]
central result is that if the probability µ(xt |x1 ...xt−1 ) of observing xt at time

This work was supported by SNF grant 2000-61847.00 to Jürgen Schmidhuber.

R. Gavaldà et al. (Eds.): ALT 2003, LNAI 2842, pp. 298–312, 2003.

c Springer-Verlag Berlin Heidelberg 2003
On the Existence and Convergence of Computable Universal Priors 299

t, given past observations x1 ...xt−1 is a computable function, then the univer-


sal posterior M (xt |x1 ...xt−1 ) converges rapidly for t → ∞ to the true posterior
µ(xt |x1 ...xt−1 ), hence M represents a universal predictor in case of unknown µ.
One representation of M is as a weighted sum of all enumerable “defective”
probability measures, called semimeasures (see Definition 2). The (from this
representation obvious) dominance M (x) ≥ const.×µ(x) for all computable µ
is the central ingredient in the convergence proof. What is so special about
the class of all enumerable semimeasures Msemi enum ? The larger we choose M
the less restrictive is the essential assumption that M should contain the true
distribution µ. Why not restrict to the still rather general class of estimable or
finitely
 computable (semi)measures? For every countable class M and ξM (x):=
ν∈M ν ν(x) with wν > 0, the important dominance ξM (x) ≥ wν ν(x) ∀ν ∈ M
w
is satisfied. The question is what properties does the mixture ξM possess. The
distinguishing property of M = ξMsemi enum
is that it is itself an element of Msemi
enum .
On the other hand, for prediction ξM ∈M is not by itself an important property.
What matters is whether ξM is computable (in one of the senses defined) to avoid
getting into the (un)realm of non-constructive math.
The intention of this work is to investigate the existence, computability
and convergence of universal (semi)measures for various computability classes:
finitely computable ⊂ estimable ⊂ enumerable ⊂ approximable (see Definition
1). For instance, M (x) is enumerable, but not finitely computable. The research
in this work was motivated by recent generalizations of Kolmogorov complexity
and Solomonoff’s prior by Schmidhuber [Sch02] to approximable (and others not
here discussed) cases.
Contents. In Section 2 we review various computability concepts and discuss
their relation. In Section 3 we define the prefix Kolmogorov complexity K, the
concept of (semi)measures, Solomonoff’s universal prior M , and explain its uni-
versality. Section 4 summarizes Solomonoff’s major convergence result, discusses
general mixture distributions and the important universality property – mul-
tiplicative dominance. In Section 5 we define seven classes of (semi)measures
based on four computability concepts. Each class may or may not contain a
(semi)measures which dominates all elements of another class. We reduce the
analysis of these 49 cases to four basic cases. Domination (essentially by M ) is
known to be true for two cases. The two new cases do not allow for domina-
tion. In Section 6 we investigate more closely the type of convergence implied
by universality. We summarize the result on posterior convergence in difference
(ξ −µ → 0) and improve the previous result [LV97] on the convergence in ratio
ξ/µ → 1 by showing rapid convergence without use of Martingales. In Section 7
we investigate whether convergence for all Martin-Löf random sequences could
hold. We define a generalized concept of randomness for individual sequences
and use it to show that proofs based on universality cannot decide this question.
Section 8 concludes the paper.
Notation. We denote strings of length n over finite alphabet X by x=x1 x2 ...xn
with xt ∈ X and further abbreviate x1:n := x1 x2 ...xn−1 xn and x<n := x1 ...xn−1 ,
 for the empty string, l(x) for the length of string x, and ω = x1:∞ for infinite
300 M. Hutter

n→∞
sequences. We abbreviate limn→∞ [f (n)−g(n)] = 0 by f (n) −→ g(n) and say
f converges to g, without implying that limn→∞ g(n) itself exists. We write
f (x)g(x) for g(x) = O(f (x)), i.e. if ∃c > 0 : f (x) ≥ cg(x)∀x.

2 Computability Concepts
We define several computability concepts weaker than can be captured by halting
Turing machines.
Definition 1 (Computable functions). We consider functions f : IN → IR:
f is finitely computable or recursive iff there are Turing machines T1/2 with
output interpreted as natural numbers and f (x) = TT12 (x)
(x) ,
f is approximable iff ∃ finitely computable φ(·,·) with limt→∞ φ(x,t) = f (x).
f is lower semi-computable or enumerable iff additionally φ(x,t) ≤ φ(x,t+1).
f is upper semi-computable or co-enumerable iff [−f ] is lower semi-
computable.
f is semi-computable iff f is lower- or upper semi-computable.
f is estimable iff f is lower- and upper semi-computable.
If f is estimable we can finitely compute an ε-approximation of f by upper and
lower semi-computing f and terminating when differing by less than ε. This
means that there is a Turing machine which, given x and ε, finitely computes ŷ
such that |ŷ−f (x)| < ε. Moreover it gives an interval estimate f (x) ∈ [ŷ−ε,ŷ+ε].
An estimable integer-valued function is finitely computable (take any ε<1). Note
that if f is only approximable or semi-computable we can still come arbitrar-
ily close to f (x) but we cannot devise a terminating algorithm which produces
an ε-approximation. In the case of lower/upper semi-computability we can at
least finitely compute lower/upper bounds to f (x). In case of approximabil-
ity, the weakest computability form, even this capability is lost. In analogy to
lower/upper semi-computability one may think of notions like lower/upper es-
timability but they are easily shown to coincide with estimability. The following
implications are valid:
enumerable=
lower semi-
recursive= ⇒ computable ⇒
semi-
finitely ⇒ estimable ⇒ approximable
computable
computable ⇒ co-enumerable= ⇒
upper semi-
computable

In the following we use the term computable synonymous to finitely computable,


but sometimes also generically for some of the computability forms of Definition
1. What we call estimable is often just called computable, but it makes sense
to separate the concepts of finite computability and estimability in this work,
since the former is conceptually easier and some previous results have only been
proved for this case.
On the Existence and Convergence of Computable Universal Priors 301

3 The Universal Prior M

The prefix Kolmogorov complexity K(x) is defined as the length of the shortest
binary program p ∈ {0,1}∗ for which a universal prefix Turing machine U (with
binary program tape and X ary output tape) outputs string x∈X ∗ , and similarly
K(x|y) in case of side information y [LV97]:

K(x) = min{l(p) : U (p) = x}, K(x|y) = min{l(p) : U (p, y) = x}

Solomonoff [Sol64,Sol78] (with a flaw fixed by Levin [ZL70]) defined (earlier) the
closely related quantity, the universal prior M (x). It is defined as the probability
that the output of a universal Turing machine starts with x when provided with
fair coin flips on the input tape. Formally, M can be defined as

M (x) := 2−l(p) (1)
p : U (p)=x∗

where the sum is over all so called minimal programs p for which U outputs a
string starting with x (indicated by the ∗). Before we can discuss the stochastic
properties of M we need the concept of (semi)measures for strings.

Definition 2 (Continuous (Semi)measures). µ(x) denotes the probability


with string x. We call µ ≥ 0 a (continuous) semimeasure
that a sequence starts
if µ() ≤ 1 and µ(x) ≥ a∈X µ(xa), and a (probability) measure if equality holds.

We have a∈X M (xa) < M (x) because there are programs p, which output x,
not followed by any a ∈ X . They just stop after printing x or continue forever
without any further output. Together with M () = 1 this shows that M is a
semimeasure, but not a probability measure. We can now state the fundamental
property of M [Sol78]:

Theorem 1 (Universality of M ). The universal prior M is an enumerable


semimeasure which multiplicatively dominates all enumerable semimeasures in
the sense that M (x)  2−K(ρ) ·ρ(x) for all enumerable semimeasures ρ. M is
enumerable, but not estimable or finitely computable.

The Kolmogorov complexity of a function like ρ is defined as the length of the


shortest self-delimiting code of a Turing machine computing this function in
the sense of Definition 1. Up to a multiplicative constant, M assigns higher
probability to all x than any other computable probability distribution.
It is possible to normalize M to a true probability measure Mnorm [Sol78,
LV97] with dominance still being true, but at the expense of giving up enu-
merability (Mnorm is still approximable). M is more convenient when studying
algorithmic questions, but a true probability measure like Mnorm is more con-
venient when studying stochastic questions.
302 M. Hutter

4 Universal Sequence Prediction


In which sense does M incorporate Occam’s razor and Epicurus’ principle of
multiple explanations? Since the shortest programs p dominate the sum in M ,
M (x) is roughly equal to 2−K(x) (M (x) = 2−K(x)+O(K(l(x)) ), i.e. M assigns high
probability to simple strings. More useful is to think of x as being the observed
history. We see from (1) that every program p consistent with history x is al-
lowed to contribute to M (Epicurus). On the other hand shorter programs give
significantly larger contribution (Occam). How does all this affect prediction? If
M (x) describes our (subjective) prior belief in x, then M (y|x) := M (xy)/M (x)
must be our posterior belief in y. From the symmetry of algorithmic informa-
tion K(xy) ≈ K(y|x)+K(x), and M (x) ≈ 2−K(x) and M (xy) ≈ 2−K(xy) we get
M (y|x) ≈ 2−K(y|x) . This tells us that M predicts y with high probability iff y
has an easy explanation, given x (Occam & Epicurus).
The above qualitative discussion should not create the impression that M (x)
and 2−K(x) always lead to predictors of comparable quality. Indeed in the on-
line/incremental setting, K(y) = O(1) invalidates the consideration above. The
proof of (2) below, for instance, depends on M being a semimeasure and the
chain rule being exactly true, neither of them is satisfied by 2−K(x) . See [Hut03]
for a more detailed analysis.
Sequence prediction algorithms try to predict the continuation xt ∈ X of
a given sequence x1 ...xt−1 . We assume that the true sequence is drawn from a
computable probability distribution µ, i.e. the true (objective) probability of x1:t
is µ(x1:t ). The probability of xt given x<t hence is µ(xt |x<t ) = µ(x1:t )/µ(x<t ).
Solomonoff’s [Sol78] central result is that M converges to µ. More precisely, for
binary alphabet, he showed that

   2
µ(x<t ) M (0|x<t ) − µ(0|x<t ) ≤ 1
2 ln 2·K(µ) + O(1) < ∞. (2)
t=1 x<t ∈{0,1}t−1

The infinite sum can only be finite if the difference M (0|x<t )−µ(0|x<t ) tends to
zero for t → ∞ with µ probability 1 (see Definition 4(i) and [Hut01] or Section
6 for general alphabet). This holds for any computable probability distribution
µ. The reason for the astonishing property of a single (universal) function to
converge to any computable probability distribution lies in the fact that the set
of µ-random sequences differ for different µ. Past data x<t are exploited to get
a (with t → ∞) improving estimate M (xt |x<t ) of µ(xt |x<t ).
The universality property (Theorem 1) is the central ingredient in the proof
of (2). The proof involves the construction of a semimeasure ξ whose dominance
is obvious. The hard part is to show its enumerability and equivalence to M . Let
M be the (countable) set of all enumerable semimeasures and define

ξ(x) := 2−K(ν) ν(x), then dominance ξ(x) ≥ 2−K(ν) ν(x) ∀ ν ∈ M (3)
ν∈M

is obvious. Is ξ lower semi-computable? To answer this question one has to be


more precise. Levin [ZL70] has shown that the set of all lower semi-computable
On the Existence and Convergence of Computable Universal Priors 303

semimeasures is enumerable (with repetitions). For this (ordered multi) set M=


Msemi
enum :={ν1 ,ν2 ,ν3 ,...} and K(νi ):=K(i) one can easily see that ξ is lower semi-
computable. Finally proving M (x)ξ(x) also establishes universality of M (see
[Sol78,LV97] for details. ξ M also holds).
The advantage of ξ over M is that it immediately generalizes to arbitrary
weighted sums of (semi)measures for arbitrary countable M.

5 Universal (Semi)Measures

What is so special about the set of all enumerable semimeasures Msemi enum ? The
larger we choose M the less restrictive is the assumption that M should contain
the true distribution µ, which will be essential throughout the paper. Why do
not restrict to the still rather general class of estimable or finitely computable
(semi)measures? It is clear that for every countable set M, the universal or
mixture distribution
 
ξ(x) := ξM (x) := wν ν(x) with wν ≤ 1 and wν > 0 (4)
ν∈M ν∈M

dominates all ν ∈M. This dominance is necessary for the desired convergence ξ →
µ similarly to (2). The question is what properties ξ possesses. The distinguishing
property of Msemi semi
enum is that ξ is itself an element of Menum . When concerned
with predictions, ξM ∈ M is not by itself an important property, but whether ξ
is computable in one of the senses of Definition 1. We define

M1  M2 :⇔ there is an element of M1 which dominates all elements of M2


:⇔ ∃ρ ∈ M1 ∀ν ∈ M2 ∃wν > 0 ∀x : ρ(x) ≥ wν ν(x).

 is transitive (but not necessarily reflexive) in the sense that M1 M2 M3
implies M1 M3 and M0 ⊇ M1 M2 ⊇ M3 implies M0 M3 . For the com-
putability concepts introduced in Section 2 we have the following proper set
inclusions
Mmsr msr msr
comp ⊂ Mest ≡ Menum ⊂ Mappr
msr

∩ ∩ ∩ ∩
Msemi
comp ⊂ M semi
est ⊂ M semi
enum ⊂ M semi
appr

where Mmsr c stands for the set of all probability measures of appro-
priate computability type c ∈ {comp=finitely computable, est=estimable,
enum=enumerable, appr=approximable}, and similarly for semimeasures
Msemi
c . From an enumeration of a measure ρ one can construct a co-enumeration
by exploiting ρ(x1:n ) = 1− y1:n =x1:n ρ(y1:n ). This shows that every enumerable
measure is also co-enumerable, hence estimable, which proves the identity ≡
above.
With this notation, Theorem 1 implies Msemi semi
enum Menum . Transitivity allows
semi msr
to conclude, for instance, that Mappr Mcomp , i.e. that there is an approximable
semimeasure which dominates all computable measures.
304 M. Hutter

The standard “diagonalization” way of proving M1  M2 is to take an arbi-


trary µ∈M1 and “increase” it to ρ such that µ ρ and show that ρ∈M2 . There
are 7×7 combinations of (semi)measures M1 with M2 for which M1 M2 could
be true or false. There are four basic cases, explicated in the following theorem,
from which the other 49 combinations displayed in Table 3 follow by transitivity.

Theorem 2 (Universal (semi)measures). A semimeasure ρ is said to be


universal for M if it multiplicatively dominates all elements of M in the sense
∀ν∃wν > 0 : ρ(x) ≥ wν ν(x)∀x. The following holds true:

o) ∃ρ : {ρ}  M: For every countable set of (semi)measures M, there is a


(semi)measure which dominates all elements of M.
i) Msemi semi
enum Menum : The class of enumerable semimeasures contains a univer-
sal element.
ii) Mmsr semi
appr  Menum : There is an approximable measure which dominates all
enumerable semimeasures.
iii) Msemi msr
est  Mcomp : There is no estimable semimeasure which dominates all
computable measures.
iv) Msemi msr
appr  Mappr : There is no approximable semimeasure which dominates all
approximable measures.

Table 3 (Existence of universal (semi)measures). The entry in row r and column


c indicates whether there is a r-able (semi)measure ρ for the set M which contains all
c-able (semi)measures, where r,c ∈ {comput, estimat, enumer, approxim}. Enumerable
measures are estimable. This is the reason why the enum. row and column in case
of measures are missing. The superscript indicates from which part of Theorem 2 the
answer follows. For the bold face entries directly, for the others using transitivity of .

 M semimeasure measure
ρ  comp. est. enum. appr. comp. est. appr.
s comp. noiii noiii noiii noiv noiii noiii noiv
e est. noiii noiii noiii noiv noiii noiii noiv
m enum. yesi yesi yesi noiv yesi yesi noiv
i appr. yesi yesi yesi noiv yesi yesi noiv
m comp. noiii noiii noiii noiv noiii noiii noiv
s est. noiii noiii noiii noiv noiii noiii noiv
r appr. yesii yesii yesii noiv yesii yesii noiv

If we ask for a universal (semi)measure which at least satisfies the weakest form
of computability, namely being approximable, we see that the largest dominated
set among the 7 sets defined above is the set of enumerable semimeasures. This
is the reason why Msemi semi
enum plays a special role. On the other hand, Menum is not
the largest set dominated by an approximable semimeasure, and indeed no such
largest set exists. One may, hence, ask for “natural” larger sets M. One such set,
On the Existence and Convergence of Computable Universal Priors 305

namely the set of cumulatively enumerable semimeasures MCEM , has recently


been discovered by Schmidhuber [Sch02], for which even ξCEM ∈ MCEM holds.
Theorem 2 also holds for discrete (semi)measures P : IN → [0,1] with
 (<)
x∈IN P (x) = 1. Theorem 2 (i) is Levin’s major result [LV97, Th.4.3.1 &
Th.4.5.1], (ii) is due to Solomonoff [Sol78], the proof of Msemi semi
comp  Mcomp in
[LV97, p249] contains minor errors and is not extensible to (iii) and the proof in
[LV97, p276] only applies to infinite alphabet and not to the binary/finite case
considered here.
Proof. We present proofs for binary alphabet X = {0,1} only. The proofs naturally
generalize from binary to arbitrary finite alphabet. argminx f (x) is the x that minimizes
f (x). Ties are broken in an arbitrary but computable way (e.g. by taking the smallest
x). 
(o) ρ(x) := ν∈M wν ν(x) with wν > 0 obviously dominates all ν ∈ M (with dom-

ination constant wν ). With w = 1 and all ν being (semi)measures also ρ is a
ν ν
(semi)measure.
(i) See [LV97, Th4.5.1].
ξ be a universal element in Msemi
n (ii) Let ξ(x1:t )
enum . We define [Sol78] ξnorm (x1:n ) :=

t=1 ξ(x<t 0)+ξ(x<t 1)


. By induction one can show that ξnorm is a measure and that
ξnorm (x)≥ξ(x)∀x, hence ξnorm ≥ξMsemi enum . As a ratio of enumerable functions, ξnorm
is still approximable, hence Mmsr semi
appr Menum .
(iii) Let µ ∈ Msemi comp . We recursively define the sequence x1:∞ by xk :=
∗ ∗
∗ ∗
argminxk µ(x<k xk ) and the measure ρ by ρ(x1:k ) = 1∀k and ρ(x) = 0 for all x which are
not prefixes of x∗1:∞ . Exploiting the fact that a minimum is smaller than an average
and that µ is a semimeasure, we get µ(x∗1:k ) = minxk µ(x∗<k xk ) ≤ 12 [µ(x∗<k 0)+µ(x∗<k 1)] ≤
1
2
µ(x∗<k ). Hence µ(x∗1:n )≤( 12 )n =( 12 )n ρ(x∗1:n ) which demonstrates that µ does not dom-
inate ρ. Since µ ∈ Msemi comp was arbitrary and ρ is a computable measure this implies
Msemi msr
comp  Mcomp .
Assume now that there is an estimable semimeasure σ  Mmsr comp . We construct
a finitely computable function µσ as follows. Choose an initial ε > 0 and finitely
compute an ε-approximation σ̂ of σ(x). If σ̂ > 4ε define µ(x) := σ̂, else halve ε and
repeat the process. Since σ(x) > 0 (otherwise it could not dominate, e.g. 2−l(x) ) the
loop terminates after finite time. So µ is finitely computable. Inserting σ̂ = µ(x) and
ε < 14 σ̂ = 14 µ(x) into |σ(x)− σ̂| < ε we get |σ(x)−µ(x)| < 14 µ(x), which implies 34 µ(x) ≤
σ(x) ≤ 54 µ(x). Unfortunately µ is not a semimeasure, but it still satisfies the weaker
inequality µ(x0)+µ(x1)≤ 43 [σ(x0)+σ(x1)]≤ 43 σ(x)≤ 43 · 54 µ(x)= 53 µ(x). This is sufficient
for the first half of the proof of (iii) to go through with 12 replaced by 12 · 53 = 56 < 1,
which shows that µ  Mmsr 4 msr
comp . But this contradicts µ ≥ 5 σ Mcomp showing that our
assumed estimable semimeasure σ does not exist, i.e. Mest  Mmsr semi
comp .
(iv) Assume µ ∈ Msemi msr
appr Mappr . We construct an approximable measure ρ which
is not dominated by µ, thus contradicting the assumption. Let µ1 ,µ2 ,... be a sequence
of recursive functions converging to µ. We recursively (in t and n) define sequences
yn1 ,yn2 ,... converging to yn and from them ρ1 ,ρ2 ,... converging to ρ. Let yn1 = 0 ∀n. If
t
µt (y<n ynt−1 ) > 23 µt (y<n
t
) then ynt := argminxn µt (y<n t
xn ) else ynt := ynt−1 . We show that
t
yn converges for t→∞ by assuming the contrary and showing a contradiction. Assume
that k is the smallest n for which ynt → yn . Since ynt → yn for all n < k and ynt ∈ {0,1} is
t
discrete there is a t0 such that y<k = y<k ∀t > t0 . Assume t > t0 in the following. Since
t
yk ∈{0,1}, some value, say ỹk , is assumed infinitely often. Non-convergence implies that
306 M. Hutter

the sequence leaves and enters to ỹk infinitely often. If ỹk is left (ykt−1 = ỹk = ykt ) we have
t→∞
t
µt (y<k ỹk ) = µt (y<k ykt−1 ) > 23 µt (y<k
t
) = 23 µt (y<k ) −→ 23 µ(y<k ). If ỹk is entered (ykt−1 =
ỹk = yk ) we have µt (y<k ỹk ) = µt (y<k ykt ) = minxk µt (y<k
t t t t
xk ) ≤ 12 [µt (y<k t
0)+µt (y<k 1)] ≤
t→∞
1
µ (y t ) = 12 µt (y<k ) −→ 12 µ(y<k ). Hence µt (y<k ỹk ) oscillates infinitely often between
2 t <k
2 1
> 3 µ(y<k ) and ≤ 2 µ(y<k ) which contradicts the assumption that µt converges. Hence
the assumption of a non-convergent ykt was wrong. With ykt also the measure ρt (y1:n t
):=
t
1 (and ρt (x) = 0 for all other x which are not prefixes of y1:∞ ) converges. For all
t t t
sufficiently large t we have y1:n = y1:n , hence µt (y1:n ) = µt (y1:n ) ≤ 23 µt (y<n ) ≤ ... ≤ ( 23 )n .
2 n
Since µ(y1:n ) ≤ ( 3 ) does not dominate ρ(y1:n ) = 1 (∀t > t0 ), we have µ  ρ. Since
µ ∈ Msemi
appr was arbitrary and ρ is an approximable measure we get Mappr  Mappr .
semi msr

6 Posterior Convergence
We have investigated in detail the computational properties of various mixture
distributions ξ. A mixture ξM multiplicatively dominates all distributions in
M. We have mentioned that dominance implies posterior convergence. In this
section we present in more detail what dominance implies and what not.
Convergence of ξ(xt |x<t ) to µ(xt |x<t ) with µ-probability 1 tells us that
ξ(xt |x<t ) is close to µ(xt |x<t ) for sufficiently large t and “most” sequences x1:∞ .
It says nothing about the speed of convergence, nor whether convergence is true
for any particular sequence (of measure 0). Convergence in mean sum defined
below is intended to capture the rate of convergence, Martin-Löf randomness is
used to capture convergence properties for individual sequences.
Martin-Löf randomness is a very important concept of randomness of in-
dividual sequences, which is closely related to Kolmogorov complexity and
Solomonoff’s universal prior. Levin gave a characterization equivalent to Martin-
Löf’s original definition [Lev73]:

Theorem 4 (Martin-Löf random sequences). A sequence x1:∞ is µ-


Martin-Löf random (µ.M.L.) iff there is a constant c such that M (x1:n ) ≤
c·µ(x1:n ) for all n.

One can show that a µ.M.L. random sequence x1:∞ passes all thinkable effective
randomness tests, e.g. the law of large numbers, the law of the iterated logarithm,
etc. In particular, the set of all µ.M.L. random sequences has µ-measure 1. The
following generalization is natural when considering general Bayes-mixtures ξ as
in this work:

Definition 3 (µ/ξ-random sequences). A sequence x1:∞ is called µ/ξ-


random (µ.ξ.r.) iff there is a constant c such that ξ(x1:n ) ≤ c·µ(x1:n ) for all
n.

Typically, ξ is a mixture over some M as defined in (3), in which case the reverse
inequality ξ(x)µ(x) is also true (for all x). For finite M or if ξ ∈M, the defini-
tion of µ/ξ-randomness depends only on M, and not on the specific weights used
On the Existence and Convergence of Computable Universal Priors 307

in ξ. For M = Msemi enum , µ/ξ-randomness is just µ.M.L. randomness. The larger


M, the more patterns are recognized as non-random. Roughly speaking, those
regularities characterized by some ν ∈ M are recognized by µ/ξ-randomness,
i.e. for M ⊂ Msemienum some µ/ξ-random strings may not be M.L. random. Other
randomness concepts, e.g. those by Schnorr, Ko, van Lambalgen, Lutz, Kurtz,
von Mises, Wald, and Church (see [Wan96,Lam87,Sch71]), could possibly also
be characterized in terms of µ/ξ-randomness for particular choices of M.
A classical (non-random) real-valued sequence at is defined to converge to a∗ ,
short at →a∗ if ∀ε∃t0 ∀t≥t0 :|at −a∗ |<ε. We are interested in convergence proper-
ties of random sequences zt (ω) for t→∞ (e.g. zt (ω)=ξ(ωt |ω<t )−µ(ωt |ω<t )). We
denote µ-expectations by E. The expected value of a function f :X t →IR, depen-
dent on x1:t, independent of xt+1:∞ , and possibly undefined on a set of µ-measure

0, is E[f ]= x1:t ∈X t µ(x1:t )f (x1:t ). The prime denotes that the sum is restricted
to x1:t with µ(x1:t )= 0. Similarly we use P[..] to denote the µ-probability of event
[..]. We define four convergence concepts for random sequences.
Definition 4 (Convergence of random sequences). Let z1 (ω),z2 (ω),... be
a sequence of real-valued random variables. zt is said to converge for t → ∞ to
random variable z∗ (ω)
i) with probability 1 (w.p.1)  :⇔ P[{ω : zt → z∗ }] = 1,

ii) in mean sum (i.m.s.) :⇔ t=1 E[(zt −z∗ )2 ] < ∞,
iii) for every µ-Martin-Löf random sequence (µ.M.L.) :⇔
∀ω : [∃c∀n : M (ω1:n ) ≤ cµ(ω1:n )] implies zt (ω) → z∗ (ω) for t → ∞,
iv) for every µ/ξ-random sequence (µ.ξ.r.) :⇔
∀ω : [∃c∀n : ξ(ω1:n ) ≤ cµ(ω1:n )] implies zt (ω) → z∗ (ω) for t → ∞.
In statistics, (i) is the “default” characterization of convergence of random se-
quences. Convergence i.m.s. (ii) is very strong: it provides a rate of convergence
in the sense that the expected number of ∞ times t in which zt deviates more
than ε from z∗ is finite and bounded by t=1 E[(zt −z∗ )2 ]/ε2 . Nothing can be
said for which t these deviations occur. If, additionally, |zt −z∗ | were monotone
decreasing, then |zt −z∗ | = o(t−1/2 ) could be concluded. (iii) uses Martin-Löf’s
notion of randomness of individual sequences to define convergence M.L. Since
this work deals with general Bayes-mixtures ξ, we generalized in (iv) the def-
inition of convergence M.L. based on M to convergence µ.ξ.r. based on ξ in a
natural way. One can show that convergence i.m.s. implies convergence w.p.1.
Also convergence M.L. implies convergence w.p.1. Universality of ξ implies the
following posterior convergence results:
Theorem 5 (Convergence of ξ to µ). Let there be sequences x1 x2 ... over a
finite alphabet X drawn with probability µ(x1:n )∈M for the first n symbols, where
µ is a measure and M a countable set of (semi)measures. The universal/mixture
posterior probability ξ(xt |x<t ) of the next symbol xt given x<t is related to the
true posterior probability µ(xt |x<t ) in the following way:
n  2   n  2 
ξ(x |x ) −1
E t <t
µ(xt |x<t )
−1 ≤ E ξ(xt |x<t ) − µ(xt |x<t ) ≤ ln wµ
t=1 t=1 xt
308 M. Hutter

where wµ is the weight (4) of µ in ξ, which implies



t→∞ ξ(xt |x<t ) t→∞
ξ(xt |x<t ) −→ µ(xt |x<t ) for any xt and µ(xt |x<t )
−→ 1, both i.m.s.

The latter strengthens the result ξ(xt |x<t )/µ(xt |x<t )→1 w.p.1 derived by Gács
in [LV97, Th.5.2.2] in that it also provides the “speed” of convergence.
Note also the subtle difference between the two convergence results. For any
sequence x1:∞ (possibly constant and not necessarily µ-random), µ(xt |x<t )−
ξ(xt |x<t ) converges to zero w.p.1 (referring to x1:∞ ), but no statement is
possible for ξ(xt |x<t )/µ(xt |x<t ), since lim infµ(xt |x<t ) could be zero. On the
other hand, if we stay on the µ-random sequence (x1:∞ = x1:∞ ), we have
ξ(xt |x<t )/µ(xt |x<t ) → 1 (whether infµ(xt |x<t ) tends to zero or not does not
matter). Indeed, it is easy to see that ξ(1|0<t )/µ(1|0<t ) ∝ t → ∞ diverges for
M = {µ,ν}, µ(1|x<t ) := 12 t−3 and ν(1|x<t ) := 12 t−2 , although 01:∞ is µ-random.

Proof. For a probability distribution yi ≥ 0 with y = 1 and a semi-distribution
i i  √ √
zi ≥ 0 with distance h(y,z) := i ( yi − zi )2
z ≤ 1 and i = {1,...,N }, the Hellinger
i i
yi 0
is upper bounded by the relative entropy d(y,z) = i yi ln zi (and 0ln z := 0). This can
be seen as follows: For arbitrary 0 ≤ y ≤ 1 and 0 ≤ z ≤ 1 we define
y √ √
f (y, z) := y ln − ( y − z)2 + z − y = 2yg( z/y) with g(t) := − ln t + t − 1 ≥ 0.
z

This shows f ≥ 0, and hence i
f (yi ,zi ) ≥ 0, which implies
 yi  √ √  
yi ln − ( yi − zi )2 ≥ yi − zi ≥ 1 − 1 = 0.
zi
i i i i

The (conditional) µ-expectations of a function f : X t → IR are defined as


 
E[f ] = µ(x1:t )f (x1:t ) and Et [f ] := E[f |x<t ] = µ(xt |x<t )f (x1:t ),
x1:t ∈X t xt ∈X

where sums over all xt or x1:t for which µ(x1:t ) = 0. If we insert X = {1,...,N },
N = |X |, i = xt , yi = µt := µ(xt |x<t ), and zi = ξt := ξ(xt |x<t ) into h and d we get (w.p.1)
 √ √ 
ht (x<t ) := xt
( µt − ξt )2 ≤ dt (x<t ) := µ ln µξtt = Et [ln µξtt ].
xt t
n
Taking the expectation E and the sum t=1
we get

n 
n
µt
n
µt µ(x1:n )
E[dt (x<t )] = E[Et [ln ]] = E[ln ] = E[ln ] ≤ ln wµ−1 (5)
ξt ξt ξ(x1:n )
t=1 t=1 t=1

where we have used E[Et [..]] = E[..] and exchanged the t-sum with the expectation
E, which transforms to a product inside the logarithm. In the last equality we have
used the chain rule for µ and ξ. Using universality ξ(x1:n ) ≥ wµ µ(x1:n ) yields the final
inequality. Finally
 2   2  √
Et ξt
µt
−1 = µt ξt
µt
−1 = ( ξt − µt )2 ≤ ht (x<t ) ≤ dt (x<t ).
xt xt
n
Taking the expectation E and the sum t=1
and chaining the result with (5) yields
Theorem 5. 2
On the Existence and Convergence of Computable Universal Priors 309

7 Convergence in Martin-Löf Sense

An interesting open question is whether ξ converges to µ (in difference or ratio)


individually for all Martin-Löf random sequences. Clearly, convergence µ.M.L.
may at most fail for a set of sequences with µ-measure zero. A convergence M.L.
result would be particularly interesting and natural for Solomonoff’s universal
prior M , since M.L. randomness can be defined in terms of M (see Theorem 4).
Attempts to convert the bounds in Theorem 5 to effective µ.M.L. randomness
M.L.
tests fail, since M (xt |x<t ) is not enumerable. The proof of M/µ −→ 1 given in
[LV97, Th.5.2.2] and [VL00, Th.10] is incomplete.1 The implication “M (x1:n ) ≤
c·µ(x1:n )∀n ⇒ limn→∞ M (x1:n )/µ(x1:n ) exists” has been used, but not proven,
and may indeed be wrong. Theorem 4 only implies supn M (x1:n )/µ(x1:n ) < ∞
for M.L. random sequences x1:∞ , and [Doo53, pp. 324–325] implies only that
limn→∞ M (x1:n )/µ(x1:n ) exists w.p.1, and not µ.M.L.
Vovk [Vov87] shows that for two finitely computable semi-measures µ and ρ
and x1:∞ being µ and ρ M.L. random that
∞ 
 2 ∞
 2
ρ(xt |x<t )
µ(xt |x<t ) − ρ(xt |x<t ) < ∞ and −1 < ∞.
t=1 xt t=1
µ(xt |x<t )

If M were recursive, then this would imply posterior M → µ and M/µ → 1 for
every µ.M.L. random sequence x1:∞ , since every sequence is M .M.L. random.
Since M is not recursive Vovk’s theorem cannot be applied and it is not obvious
how to generalize it. So the question of individual convergence remains open.
More generally, one may ask whether ξM →µ for every µ/ξ-random sequence. It
turns out that this is true for some M, but false for others.

Theorem 6 (µ/ξ-convergence of ξ to µ). Let X = {0,1} be binary and


MΘ := {µθ : µθ (1|x<t ) = θ ∀t, θ ∈ Θ} be the set of Bernoulli(θ) distributions
with parameters θ ∈ Θ. Let ΘD be a countable dense subset of [0,1], e.g. [0,1]∩IQ
and let ΘG be a countable subset of [0,1] with a gap in the sense that there
exist 0 < θ0 < θ1 < 1 such that [θ0 ,θ1 ] ∩ ΘG = {θ0 ,θ1 }, e.g. ΘG = { 14 , 12 } or
ΘG = ([0, 14 ]∪[ 12 ,1])∩IQ. Then

i) If x1:∞ is µ/ξMΘD random with µ ∈ MΘD , then ξMΘD (xt |x<t ) → µ(xt |x<t ),
1
The formulation of their Theorem is quite misleading in general: “Let µ be a positive
recursive measure. If the length of y is fixed and the length of x grows to infinity,
then M (y|x)/µ(y|x)→1 with µ-probability one. The infinite sequences ω with prefixes
x satisfying the displayed asymptotics are precisely [‘⇒’ and ‘⇐’] the µ-random
sequences.” First, for off-sequence y convergence w.p.1 does not hold (xy must be
demanded to be a prefix of ω). Second, the proof of ‘⇐’ has loopholes (see main
text). Last, ‘⇒’ is given without
 proof and is probably wrong. Also the assertion in
[LV97, Th.5.2.1] that St := E x (µ(xt |x<t )−M (xt |x<t ))2 converges to zero faster
t
than 1/t cannot
√ be made, since St may not decrease monotonically. For example,

for at := 1/ t if t is a cube and 0 otherwise, we have a < ∞, but at = o(1/t).
t=1 t
310 M. Hutter

ii) There are µ ∈ MΘG and µ/ξMΘG random x1:∞ for which ξMΘG(xt |x<t ) →
µ(xt |x<t )
Our original/main motivation of studying µ/ξ-randomness is the implication of
M.L.
Theorem 6 that M −→ µ cannot be decided from M being a mixture distribu-
tion or from the universality property (Theorem 1) alone. Further structural
properties of Msemi
enum have to be employed. For Bernoulli sequences, conver-
gence µ.ξMΘ .r. is related to denseness of MΘ . Maybe a denseness characteriza-
tion of Msemi
enum can solve the question of convergence M.L. of M . The property
M ∈ Msemi
enum is also not sufficient to resolve this question, since there are M  ξ
µ.ξ.r µ.ξ.r
for which ξ −→ µ and M  ξ for which ξ  −→ µ. Theorem 6 can be generalized to
i.i.d. sequences over general finite alphabet X .
The idea to prove (ii) is to construct a sequence x1:∞ which is µθ0 /ξ-random
and µθ1 /ξ-random for θ0 = θ1 . This is possible if and only if Θ contains a gap
and θ0 and θ1 are the boundaries of the gap. Obviously ξ cannot converge to θ0
and θ1 , thus proving non-convergence. For no θ ∈[0,1] will this x1:∞ be µθ M.L.-
random. Finally, the proof of Theorem 6 makes essential use of the mixture
representation of ξ, as opposed to the proof of Theorem 5 which only needs
dominance ξ M.
Proof. Let X = {0,1} and M = {µθ : θ ∈ Θ} with countable Θ ⊂ [0,1] and µθ (1|x1:n ) =
θ = 1−µθ (0|x1:n ), which implies
n1
µθ (x1:n ) = θn1 (1 − θ)n−n1 , n1 := x1 + ... +xn , θ̂ ≡ θ̂n :=
n
θ̂ depends on n; all other used/defined θ will be independent of n. ξ is defined in the
standard way as

ξ(x1:n ) = wθ µθ (x1:n ) ⇒ ξ(x1:n ) ≥ wθ µθ (x1:n ), (6)
θ∈Θ

where θ
wθ =1 and wθ >0 ∀θ. In the following let µ=µθ0 ∈M be the true environment.
ω = x1:∞ is µ/ξ-random ⇔ ∃cω : ξ(x1:n ) ≤ cω ·µθ0 (x1:n ) ∀n (7)
n→∞
For binary alphabet it is sufficient to establish whether ξ(1|x1:n ) −→ θ0 ≡µ(1|x1:n ) for
µ/ξ-random x1:∞ in order to decide ξ(xn |x<n ) → µ(xn |x<n ). We need the following
posterior representation of ξ:
 µθ (x1:n ) wθ µθ (x1:n ) 
ξ(1|x1:n ) = wnθ µθ (1|x1:n ), wnθ := wθ ≤ , wnθ = 1 (8)
ξ(x1:n ) wθ0 µθ0 (x1:n )
θ∈Θ θ∈Θ

The ratio µθ /µθ0 can be represented as follows:


µθ (x1:n )
µθ0 (x1:n )
= en[D(θ̂n ||θ0 )−D(θ̂n ||θ)] where D(θ̂||θ) = θ̂ ln θ̂
θ
+ (1− θ̂) ln 1−θ̂
1−θ
(9)

is the relative entropy between θ̂ and θ, which is continuous in θ̂ and θ, and is 0 if and
only if θ̂ = θ. We also need the following implication for sets Ω ⊆ Θ:
n→∞
  n→∞
If wnθ ≤ wθ gθ (n) −→ 0 and gθ (n) ≤ c ∀θ ∈Ω, then wnθ µθ (1|x1:n ) ≤ wnθ −→ 0,
θ∈Ω θ∈Ω
(10)
On the Existence and Convergence of Computable Universal Priors 311


which follows from boundedness wθ ≤ 1 and µθ ≤ 1. We now prove Theorem 6.
θ n
We leave the special considerations necessary when 0,1 ∈ Θ to the reader and assume,
henceforth, 0,1 ∈ Θ.
(i) Let Θ be a countable dense subset of (0,1) and x1:∞ be µ/ξ-random. Using (6)
and (7) in (9) for θ ∈ Θ to be determined later we can bound
µθ (x1:n ) cω
en[D(θ̂n ||θ0 )−D(θ̂n ||θ)] = ≤ =: c < ∞ (11)
µθ0 (x1:n ) wθ
Let us assume that θ̂ ≡ θ̂n → θ0 . This implies that there exists a cluster point θ̃ = θ0
of sequence θ̂n , i.e. θ̂n is infinitely often in an ε-neighborhood of θ̃, e.g. D(θ̂n ||θ̃) ≤ ε
for infinitely many n. θ̃ ∈ [0,1] may be outside Θ. Since θ̃ = θ0 this implies that θ̂n
must be “far” away from θ0 infinitely often. E.g. for ε = 41 (θ̃ −θ0 )2 , using D(θ̂||θ̃)+
D(θ̂||θ0 ) ≥ (θ̃−θ0 )2 , we get D(θ̂||θ0 ) ≥ 3ε. We now choose θ ∈ Θ so near to θ̃ such that
|D(θ̂||θ)−D(θ̂||θ̃)| ≤ ε (here we use denseness of Θ). Chaining all inequalities we get
D(θ̂||θ0 )−D(θ̂||θ)≥3ε−ε−ε=ε>0. This, together with (11) implies enε ≤c for infinitely
many n which is impossible. Hence, the assumption θ̂n → θ0 was wrong.
Now, θ̂n → θ0 implies that for arbitrary θ = θ0 , θ ∈ Θ and for sufficiently large n
there exists δθ >0 such that D(θ̂n ||θ)≥2δθ (since D(θ0 ||θ)= 0) and D(θ̂n ||θ0 )≤δθ . This
implies
wθ n[D(θ̂n ||θ0 )−D(θ̂n ||θ)] wθ −nδθ n→∞
wnθ ≤ e ≤ e −→ 0,
wθ0 wθ0
where we have used (8) and (9) in the first inequality and the second inequality holds
for sufficiently large n. Hence θ=θ0 wnθ → 0 by (10) and wnθ0 → 1 by normalization (8),
which finally gives
 n→∞
ξ(1|x1:n ) = wnθ0 µθ0 (1|x1:n ) + wnθ µθ (1|x1:n ) −→ µθ0 (1|x1:n ).
θ=θ0

(ii) We first consider the case Θ ={θ0 ,θ1 }: Let us choose θ̄ (=ln( 1−θ
1−θ1
0
)/ln( θθ10 1−θ0
1−θ1
),
potentially ∈ Θ) in the (KL) middle of θ0 and θ1 such that
D(θ̄||θ0 ) = D(θ̄||θ1 ), 0 < θ0 < θ̄ < θ1 < 1, (12)
n1 n→∞
and choose x1:∞ such that θ̂n := satisfies |θ̂n − θ̄| ≤ n1
n
(⇒ θ̂n −→ θ̄)
Using |D(θ̂||θ)−D(θ̄||θ)| ≤ c|θ̂− θ̄| ∀ θ,θ̂,θ̄ ∈ [θ0 ,θ1 ] (c = ln θθ10 (1−θ 0)
(1−θ1 )
< ∞) twice in (9) we
get
µθ1 (x1:n )
= en[D(θ̂n ||θ0 )−D(θ̂n ||θ1 )] ≤ en[D(θ̄||θ0 )+c|θ̂n −θ̄|−D(θ̄||θ1 )+c|θ̂n −θ̄|] ≤ e2c (13)
µθ0 (x1:n )
where we have used (12) in the last inequality. Now, (13) and (8) lead to
µθ0 (x1:n ) wθ µθ (x1:n ) −1 wθ
wnθ0 = wθ0 = [1 + 1 1 ] ≥ [1 + 1 e2c ]−1 =: c0 > 0, (14)
ξ(x1:n ) wθ0 µθ0 (x1:n ) wθ0
which shows that x1:∞ is µθ0 /ξ-random by (7). Exchanging θ0 ↔ θ1 in (13) and (14)
we similarly get wnθ1 ≥ c1 > 0, which implies (using wnθ0 +wnθ1 = 1)

ξ(1|x1:n ) = wnθ µθ (1|x1:n ) = wnθ0 ·θ0 + wnθ1 ·θ1 = θ0 = µθ0 (1|x1:n ). (15)
θ∈{θ0 ,θ1 }

n→∞

This shows ξ(1|x1:n ) −→ µ(1|x1:n ). For general Θ with gap in the sense that there
exist 0 < θ0 < θ1 < 1 with [θ0 ,θ1 ] ∩ Θ = {θ0 ,θ1 } one can show that all θ = θ0 ,θ1 give
asymptotically no contribution to ξ(1|x1:n ), i.e. (15) still applies. 2
312 M. Hutter

8 Conclusions
For a hierarchy of four computability definitions, we completed the classifica-
tion of the existence of computable (semi)measures dominating all computable
(semi)measures. Dominance is an important property of a prior, since it im-
plies rapid convergence of the corresponding posterior with probability one.
A strengthening would be convergence for all Martin-Löf (M.L.) random se-
quences. This seems natural, since M.L. randomness can be defined in terms of
Solomonoff’s prior M , so there is a close connection. Contrary to what was be-
lieved before, the question of posterior convergence M/µ→1 for all M.L. random
sequences is still open. We introduced a new flexible notion of µ/ξ-randomness
which contains Martin-Löf randomness as a special case. Though this notion
may have a wider range of application, the main purpose for its introduction
M.L.
was to show that standard proof attempts of M/µ −→ 1 based on dominance
only must fail. This follows from the derived result that the validity of ξ/µ → 1
for µ/ξ-random sequences depends on the Bayes mixture ξ.

References
[Doo53] J. L. Doob. Stochastic Processes. John Wiley & Sons, New York, 1953.
[Hut01] M. Hutter. Convergence and error bounds of universal prediction for general
alphabet. Proceedings of the 12th Eurpean Conference on Machine Learning
(ECML-2001), pages 239–250, 2001.
[Hut03] M. Hutter. Sequence prediction based on monotone complexity. In Proceedings
of the 16th Conference on Computational Learning Theory (COLT-2003).
[Lam87] M. van Lambalgen. Random Sequences. PhD thesis, Univ. Amsterdam, 1987.
[Lev73] L. A. Levin. On the notion of a random sequence. Soviet Math. Dokl.,
14(5):1413–1416, 1973.
[LV97] M. Li and P. M. B. Vitányi. An introduction to Kolmogorov complexity and
its applications. Springer, 2nd edition, 1997.
[Sch71] C. P. Schnorr. Zufälligkeit und Wahrscheinlichkeit. Springer, Berlin, 1971.
[Sch02] J. Schmidhuber. Hierarchies of generalized Kolmogorov complexities and
nonenumerable universal measures computable in the limit. International
Journal of Foundations of Computer Science, 13(4):587–612, 2002.
[Sol64] R. J. Solomonoff. A formal theory of inductive inference: Part 1 and 2. Inform.
Control, 7:1–22, 224–254, 1964.
[Sol78] R. J. Solomonoff. Complexity-based induction systems: comparisons and con-
vergence theorems. IEEE Trans. Inform. Theory, IT-24:422–432, 1978.
[VL00] P. M. B. Vitányi and M. Li. Minimum description length induction, Bayesian-
ism, and Kolmogorov complexity. IEEE Transactions on Information Theory,
46(2):446–464, 2000.
[Vov87] V. G. Vovk. On a randomness criterion. Soviet Mathematics Doklady,
35(3):656–660, 1987.
[Wan96] Y. Wang. Randomness and Complexity. PhD thesis, Univ. Heidelberg, 1996.
[ZL70] A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the
development of the concepts of information and randomness by means of the
theory of algorithms. Russian Mathematical Surveys, 25(6):83–124, 1970.
Author Index

Arpe, Jan 99 Oncina, Jose 247

Balbach, Frank 84 Ratsaby, Joel 205


Reischuk, Rüdiger 99, 234
Case, John 234
Cristianini, Nello 175 Sato, Masako 69
Schuurmans, Dale 190
De Bie, Tijl 175 Sharma, Arun 54
Shoudai, Takayoshi 114, 144
Eiter, Thomas 1 Šı́ma, Jiřı́ 221
Stephan, Frank 54, 234
Gammerman, Alex 283
Suzuki, Yusuke 114, 144
Higuera, Colin de la 247
Takano, Akihiko 15
Hutter, Marcus 298
Tishby, Naftali 16
Jain, Sanjay 234
Uchida, Tomoyuki 114, 144
Kitagawa, Genshiro 3 Uemura, Jin 69

Lange, Steffen 129 V’yugin, Vladimir 283


Lee, Jianguo 159 Vovk, Vladimir 259, 268

Martin, Eric 54 Wang, Jingdong 159


Matsumoto, Satoshi 114, 144 Wang, Shaojun 190
Miyahara, Tetsuhiro 114, 144
Momma, Michinari 175 Zeugmann, Thomas 17, 234
Zhang, Changshui 159
Nouretdinov, Ilia 259, 283 Zilles, Sandra 39, 129

You might also like