You are on page 1of 410

Lecture Notes in Artificial Intelligence 5809

Edited by R. Goebel, J. Siekmann, and W. Wahlster

Subseries of Lecture Notes in Computer Science


Ricard Gavaldà Gábor Lugosi
Thomas Zeugmann Sandra Zilles (Eds.)

Algorithmic
Learning Theory

20th International Conference, ALT 2009


Porto, Portugal, October 3-5, 2009
Proceedings

13
Series Editors
Randy Goebel, University of Alberta, Edmonton, Canada
Jörg Siekmann, University of Saarland, Saarbrücken, Germany
Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany

Volume Editors
Ricard Gavaldà
Universitat Politècnica de Catalunya
LARCA Research Group, Departament de Llenguatges i Sistemes Informàtics
Jordi Girona Salgado 1-3, 08034 Barcelona, Spain
E-mail: gavalda@lsi.upc.edu

Gábor Lugosi
Pompeu Fabra Universitat, ICREA and Department of Economics
Ramon Trias Fargas 25-27, 08005 Barcelona, Spain
E-mail: gabor.lugosi@gmail.com

Thomas Zeugmann
Hokkaido University, Division of Computer Science
N-14, W-9, Sapporo 060-0814, Japan
E-mail: thomas@ist.hokudai.ac.jp

Sandra Zilles
University of Regina, Department of Computer Science
Regina, Saskatchewan, Canada S4S 0A2
E-mail: zilles@cs.uregina.ca

Library of Congress Control Number: 2009934440

CR Subject Classification (1998): I.2, I.2.6, K.3.1, F.2, G.2, I.2.2, I.5.3

LNCS Sublibrary: SL 7 – Artificial Intelligence

ISSN 0302-9743
ISBN-10 3-642-04413-1 Springer Berlin Heidelberg New York
ISBN-13 978-3-642-04413-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
© Springer-Verlag Berlin Heidelberg 2009
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 12760312 06/3180 543210
Preface

This volume contains the papers presented at the 20th International Conference
on Algorithmic Learning Theory (ALT 2009), which was held in Porto, Portugal,
October 3–5, 2009. The conference was co-located with the 12th International
Conference on Discovery Science (DS 2009). The technical program of ALT 2009
contained 26 papers selected from 60 submissions, and 5 invited talks. The in-
vited talks were presented during the joint sessions of both conferences.
ALT 2009 was the 20th in the ALT conference series, established in Japan
in 1990. The series Analogical and Inductive Inference is a predecessor of this
series: it was held in 1986, 1989 and 1992, co-located with ALT in 1994, and
subsequently merged with ALT. ALT maintains its strong connections to Japan,
but has also been held in other countries, such as Australia, Germany, Hungary,
Italy, Singapore, Spain, and the USA. The ALT series is supervised by its Steer-
ing Committee: Naoki Abe (IBM Thomas J. Watson Research Center, Yorktown,
USA), Shai Ben-David (University of Waterloo, Canada), Phil Long (Google,
Mountain View, USA), Gábor Lugosi (Pompeu Fabra University, Barcelona,
Spain), Akira Maruoka (Ishinomaki Senshu University, Japan), Takeshi Shino-
hara (Kyushu Institute of Technology, Iizuka, Japan), Frank Stephan (National
University of Singapore, Republic of Singapore), Einoshin Suzuki (Kyushu Uni-
versity, Fukuoka, Japan), Eiji Takimoto (Kyushu University, Fukuoka, Japan),
György Turán (University of Illinois at Chicago, USA, and University of Szeged,
Hungary), Osamu Watanabe (Tokyo Institute of Technology, Japan), Thomas
Zeugmann (Chair, Hokkaido University, Japan), and Sandra Zilles (Publicity
Chair, University of Regina, Canada). The ALT web pages have been set up
(together with Frank Balbach and Jan Poland) and are maintained by Thomas
Zeugmann.
The present volume contains the texts of the 26 papers presented at ALT
2009, divided into groups of papers on online learning, learning graphs, active
learning and query learning, statistical learning, inductive inference, and semi-
supervised and unsupervised learning. The volume also contains abstracts of the
invited talks:
– Sanjoy Dasgupta (University of California, San Diego, USA): The Two Faces
of Active Learning
– Hector Geffner (Universitat Pompeu Fabra, Barcelona, Spain) Inference and
Learning in Planning
– Jiawei Han (University of Illinois at Urbana-Champaign, USA) Mining Het-
erogeneous Information Networks by Exploring the Power of Links
– Yishay Mansour (Tel Aviv University, Israel) Learning and Domain Adapta-
tion
– Fernando C.N. Pereira (Google, Mountain View, USA) Learning on the Web
Papers presented at DS 2009 are contained in the DS 2009 proceedings.
VI Preface

The E. Mark Gold Award has been presented annually at the ALT conferences
since 1999, for the most outstanding student contribution. This year, the award
was given to Hanna Mazzawi for the paper Reconstructing Weighted Graphs with
Minimal Query Complexity, co-authored by Nader Bshouty.
We would like to thank the many people and institutions who contributed
to the success of the conference. Thanks to the authors of the papers for their
submissions, and to the invited speakers for presenting exciting overviews of
important recent research developments. We are very grateful to the sponsors
of the conference for their generous financial support: University of Porto, Ar-
tificial Intelligence and Decision Support Laboratory, Center for Research in
Advanced Computing Systems, Portuguese Science and Technology Foundation,
Portuguese Artificial Intelligence Association, SAS, Alberta Ingenuity Centre for
Machine Learning, and Division of Computer Science, Hokkaido University.
We are grateful to the members of the Program Committee for ALT 2009.
Their hard work in reviewing and discussing the papers made sure that we
had an interesting and strong program. We also thank the subreferees assist-
ing the Program Committee. Special thanks go to the local arrangement chair
João Gama (University of Porto). We would like to thank the Discovery Sci-
ence conference for its ongoing collaboration with ALT, which makes it possible
to provide a well-rounded picture of the current theoretical and practical ad-
vances in machine learning and the related areas. In particular, we are grateful
to the conference chair João Gama (University of Porto) and Program Commit-
tee chairs Vítor Santos Costa (University of Porto) and Alípio Jorge (University
of Porto) for their cooperation. Last but not least, we thank Springer for their
support in preparing and publishing this volume of the Lecture Notes in Artificial
Intelligence series.

August 2009 Ricard Gavaldà


Gábor Lugosi
Thomas Zeugmann
Sandra Zilles
Organization

Conference Chair
Ricard Gavaldà Universitat Politècnica de Catalunya,
Barcelona, Spain

Program Committee
Peter Auer University of Leoben, Austria
José L. Balcázar Universitat Politècnica de Catalunya,
Barcelona, Spain
Shai Ben-David University of Waterloo, Canada
Avrim Blum Carnegie Mellon University, Pittsburgh, USA
Nader Bshouty Technion, Haifa, Israel
Claudio Gentile Università degli Studi dell’Insubria, Varese,
Italy
Peter Grünwald Centrum voor Wiskunde en Informatica (CWI),
Amsterdam, The Netherlands
Roni Khardon Tufts University, Medford, USA
Phil Long Google, Mountain View, USA
Gábor Lugosi ICREA and Pompeu Fabra University,
Barcelona, Spain (Chair)
Massimiliano Pontil University College London, UK
Alexander Rakhlin UC Berkeley, USA
Shai Shalev-Shwartz Toyota Technological Institute at Chicago, USA
Hans Ulrich Simon Ruhr-Universität Bochum, Germany
Frank Stephan National University of Singapore, Singapore
Csaba Szepesvári University of Alberta, Edmonton, Canada
Eiji Takimoto Kyushu University, Fukuoka, Japan
Sandra Zilles University of Regina, Canada (Chair)

Local Arrangements
João Gama University of Porto, Portugal

Subreferees
Jacob Abernethy Nicolò Cesa-Bianchi
Andreas Argyriou Jiang Chen
Marta Arias Alexander Clark
John Case Sanjoy Dasgupta
VIII Organization

Tom Diethe Mario Martin


Ran El-Yaniv Samuel Moelius III
Tim van Erven Rémi Munos
Steve Hanneke Francesco Orabona
Kohei Hatano Ronald Ortner
Tamir Hazan Dávid Pál
Colin de la Higuera Joel Ratsaby
Jeffrey Jackson Nicola Rebagliati
Sanjay Jain Lev Reyzin
Sham Kakade Sivan Sabato
Jyrki Kivinen Ohad Shamir
Wouter Koolen Robert Sloan
Timo Kötzing Jun’ichi Takeuchi
Lucy Kuncheva Christino Tamon
Steffen Lange György Turán
Alex Leung Vladimir Vovk
Guy Lever Yiming Ying
Tyler Lu Thomas Zeugmann
Eric Martin

Sponsoring Institutions
University of Porto
Artificial Intelligence and Decision Support Laboratory
Center for Research in Advanced Computing Systems
Portuguese Science and Technology Foundation
Portuguese Artificial Intelligence Association
SAS
Alberta Ingenuity Centre for Machine Learning
Division of Computer Science, Hokkaido University
Table of Contents

Invited Papers
The Two Faces of Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Sanjoy Dasgupta
Inference and Learning in Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Hector Geffner
Mining Heterogeneous Information Networks by Exploring the Power
of Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Jiawei Han
Learning and Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Yishay Mansour
Learning on the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Fernando C.N. Pereira

Regular Contributions

Online Learning
Prediction with Expert Evaluators’ Advice . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Alexey Chernov and Vladimir Vovk
Pure Exploration in Multi-armed Bandits Problems . . . . . . . . . . . . . . . . . . 23
Sébastien Bubeck, Rémi Munos, and Gilles Stoltz
The Follow Perturbed Leader Algorithm Protected from Unbounded
One-Step Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Vladimir V. V’yugin
Computable Bayesian Compression for Uniformly Discretizable
Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
L
 ukasz Debowski

Calibration and Internal No-Regret with Random Signals . . . . . . . . . . . . . 68
Vianney Perchet
St. Petersburg Portfolio Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
László Györfi and Péter Kevei

Learning Graphs
Reconstructing Weighted Graphs with Minimal Query Complexity . . . . . 97
Nader H. Bshouty and Hanna Mazzawi
X Table of Contents

Learning Unknown Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110


Nicolò Cesa-Bianchi, Claudio Gentile, and Fabio Vitale

Completing Networks Using Observed Data . . . . . . . . . . . . . . . . . . . . . . . . . 126


Tatsuya Akutsu, Takeyuki Tamura, and Katsuhisa Horimoto

Active Learning and Query Learning


Average-Case Active Learning with Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Andrew Guillory and Jeff Bilmes

Canonical Horn Representations and Query Learning . . . . . . . . . . . . . . . . . 156


Marta Arias and José L. Balcázar

Learning Finite Automata Using Label Queries . . . . . . . . . . . . . . . . . . . . . . 171


Dana Angluin, Leonor Becerra-Bonache, Adrian Horia Dediu, and
Lev Reyzin

Characterizing Statistical Query Learning: Simplified Notions and


Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Balázs Szörényi

An Algebraic Perspective on Boolean Function Learning . . . . . . . . . . . . . . 201


Ricard Gavaldà and Denis Thérien

Statistical Learning
Adaptive Estimation of the Optimal ROC Curve and a Bipartite
Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Stéphan Clémençon and Nicolas Vayatis

Complexity versus Agreement for Many Views: Co-regularization for


Multi-view Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Odalric-Ambrym Maillard and Nicolas Vayatis

Error-Correcting Tournaments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247


Alina Beygelzimer, John Langford, and Pradeep Ravikumar

Inductive Inference
Difficulties in Forcing Fairness of Polynomial Time Inductive
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
John Case and Timo Kötzing

Learning Mildly Context-Sensitive Languages with Multidimensional


Substitutability from Positive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Ryo Yoshinaka
Table of Contents XI

Uncountable Automatic Classes and Learning . . . . . . . . . . . . . . . . . . . . . . . 293


Sanjay Jain, Qinglong Luo, Pavel Semukhin, and Frank Stephan

Iterative Learning from Texts and Counterexamples Using Additional


Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Sanjay Jain and Efim Kinber

Incremental Learning with Ordinal Bounded Example Memory . . . . . . . . 323


Lorenzo Carlucci

Learning from Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338


Sanjay Jain, Frank Stephan, and Nan Ye

Semi-supervised and Unsupervised Learning


Smart PAC-Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
Hans Ulrich Simon

Approximation Algorithms for Tensor Clustering . . . . . . . . . . . . . . . . . . . . . 368


Stefanie Jegelka, Suvrit Sra, and Arindam Banerjee

Agnostic Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384


Maria Florina Balcan, Heiko Röglin, and Shang-Hua Teng

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399


The Two Faces of Active Learning

Sanjoy Dasgupta

University of California, San Diego

The active learning model is motivated by scenarios in which it is easy to amass


vast quantities of unlabeled data (images and videos off the web, speech signals
from microphone recordings, and so on) but costly to obtain their labels. Like
supervised learning, the goal is ultimately to learn a classifier. But like unsuper-
vised learning, the data come unlabeled. More precisely, the labels are hidden,
and each of them can be revealed only at a cost. The idea is to query the labels
of just a few points that are especially informative about the decision bound-
ary, and thereby to obtain an accurate classifier at significantly lower cost than
regular supervised learning.
There are two distinct narratives for explaining when active learning is helpful.
The first has to do with efficient search through the hypothesis space: perhaps
one can always explicitly select query points whose labels will significantly shrink
the set of plausible classifiers (those roughly consistent with the labels seen so
far)? The second argument for active learning has to do with exploiting cluster
structure in data. Suppose, for instance, that the unlabeled points form five
nice clusters; with luck, these clusters will be pure and only five labels will be
necessary!
Both these scenarios are hopelessly optimistic. But I will show that they each
motivate realistic models that can effectively be exploited by active learning
algorithms. These algorithms have provable label complexity bounds that are in
some cases exponentially lower than for supervised learning. I will also present
experiments with these algorithms, to illustrate their behavior and get a sense
of the gulf that still exists between the theory and practice of active learning.
This is joint work with Alina Beygelzimer, Daniel Hsu, John Langford, and
Claire Monteleoni.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, p. 1, 2009.



c Springer-Verlag Berlin Heidelberg 2009
Inference and Learning in Planning

Hector Geffner

ICREA & Universitat Pompeu Fabra


C/Roc Boronat 138, E-08018 Barcelona, Spain
hector.geffner@upf.edu
http://www.tecn.upf.es/~hgeffner

Abstract. Planning is concerned with the development of solvers for a


wide range of models where actions must be selected for achieving goals.
In these models, actions may be deterministic or not, and full or partial
sensing may be available. In the last few years, significant progress has
been made, resulting in algorithms that can produce plans effectively in a
variety of settings. These developments have to do with the formulation
and use of general inference techniques and transformations. In this in-
vited talk, I’ll review the inference techniques used for solving individual
planning instances from scratch, and discuss the use of learning methods
and transformations for obtaining more general solutions.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, p. 2, 2009.



c Springer-Verlag Berlin Heidelberg 2009
Mining Heterogeneous Information Networks by
Exploring the Power of Links

Jiawei Han

Department of Computer Science


University of Illinois at Urbana-Champaign
hanj@cs.uiuc.edu

Abstract. Knowledge is power but for interrelated data, knowledge is


often hidden in massive links in heterogeneous information networks. We
explore the power of links at mining heterogeneous information networks
with several interesting tasks, including link-based object distinction, ve-
racity analysis, multidimensional online analytical processing of hetero-
geneous information networks, and rank-based clustering. Some recent
results of our research that explore the crucial information hidden in links
will be introduced, including (1) Distinct for object distinction analysis,
(2) TruthFinder for veracity analysis, (3) Infonet-OLAP for online analyt-
ical processing of information networks, and (4) RankClus for integrated
ranking-based clustering. We also discuss some of our on-going studies
in this direction.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, p. 3, 2009.



c Springer-Verlag Berlin Heidelberg 2009
Learning and Domain Adaptation

Yishay Mansour

Blavatnik School of Computer Science,


Tel Aviv University
Tel Aviv, Israel
mansour@tau.ac.il

Abstract. Domain adaptation is a fundamental learning problem where


one wishes to use labeled data from one or several source domains to
learn a hypothesis performing well on a different, yet related, domain for
which no labeled data is available. This generalization across domains is
a very significant challenge for many machine learning applications and
arises in a variety of natural settings, including NLP tasks (document
classification, sentiment analysis, etc.), speech recognition (speakers and
noise or environment adaptation) and face recognition (different lighting
conditions, different population composition).
The learning theory community has only recently started to analyze
domain adaptation problems. In the talk, I will overview some recent
theoretical models and results regarding domain adaptation.
This talk is based on joint works with Mehryar Mohri and Afshin
Rostamizadeh.

1 Introduction

It is almost standard in machine learning to assume that the training and test
instances are drawn from the same distribution. This assumption is explicit in
the standard PAC model [19] and other theoretical models of learning, and it is a
natural assumption since when the training and test distributions substantially
differ there can be no hope for generalization. However, in practice, there are
several crucial scenarios where the two distributions are similar but not identical,
and therefore effective learning is potentially possible. This is the motivation for
domain adaptation.
The problem of domain adaptation arises in a variety of applications in natu-
ral language processing [6,3,9,4,5], speech processing [11,7,16,18,8,17], computer
vision [15], and many other areas. Quite often, little or no labeled data is avail-
able from the target domain, but labeled data from a source domain somewhat
similar to the target as well as large amounts of unlabeled data from the target
domain are at one’s disposal. The domain adaptation problem then consists of
leveraging the source labeled and target unlabeled data to derive a hypothesis
performing well on the target domain.
The first theoretical analysis of the domain adaptation problem was presented
by [1], who gave VC-dimension-based generalization bounds for adaptation in

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 4–6, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Learning and Domain Adaptation 5

classification tasks. Perhaps, the most significant contribution of that work was
the definition and application of a distance between distributions, the dA dis-
tance, that is particularly relevant to the problem of domain adaptation and
which can be estimated from finite samples for a finite VC dimension, as previ-
ously shown by [10]. This work was later extended by [2] who also gave a bound
on the error rate of a hypothesis derived from a weighted combination of the
source data sets for the specific case of empirical risk minimization. More re-
fined generalization bounds which apply to more general tasks, including regres-
sion and general loss functions appear in [12]. From an algorithmic perspective,
it is natural to re-weight the empirical distribution to better reflect the target
distribution; efficient algorithms for this re-weighting task were given in [12].
A more complex variant of this problem arises in sentiment analysis and other
text classification tasks where the learner receives information from several do-
main sources that he can combine to make predictions about a target domain.
As an example, often appraisal information about a relatively small number of
domains such as movies, books, restaurants, or music may be available, but little
or none is accessible for more difficult domains such as travel. This is known as
the multiple source adaptation problem. Instances of this problem can be found
in a variety of other natural language and image processing tasks.
The problem of adaptation with multiple sources was introduced and analyzed
[13,14]. The problem is formalized as follows. For each source domain i ∈ [1, k],
the learner receives the distribution of the input points Qi , as well as a hypoth-
esis hi with loss at most  on that source. The task consists of combining the k
hypotheses hi , i ∈ [1, k], to derive a hypothesis h with a loss as small as possible
with respect to the target distribution P . Unfortunately, a simple convex com-
bination of the k source hypotheses hi can perform very poorly; for example,
there are cases where any such convex combination would incur a classification
error of a half, even when each source hypothesis hi makes no error on its do-
main Qi (see [13]). In contrast, distribution weighted combinations of the source
hypotheses, which are combinations of source hypotheses weighted by the source
distributions, perform very well. In [13] it was shown that, remarkably, for any
fixed target function, there exists a distribution weighted combination of the
source hypotheses whose loss is at most  with respect to any mixture P of the k
source distributions Qi . For the case that the target distribution P is arbitrary,
generalization bounds, based on Rényi divergence between the sources and the
target distributions, were derived in [14].

References

1. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations
for domain adaptation. In: Proceedings of NIPS 2006 (2006)
2. Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Wortman, J.: Learning bounds
for domain adaptation. In: Proceedings of NIPS 2007 (2007)
3. Blitzer, J., Dredze, M., Pereira, F.: Biographies, Bollywood, Boom-boxes and
Blenders: Domain Adaptation for Sentiment Classification. In: ACL 2007 (2007)
6 Y. Mansour

4. Chelba, C., Acero, A.: Adaptation of maximum entropy capitalizer: Little data can
help a lot. Computer Speech & Language 20(4), 382–399 (2006)
5. Daumé III, H., Marcu, D.: Domain adaptation for statistical classifiers. Journal of
Artificial Intelligence Research 26, 101–126 (2006)
6. Dredze, M., Blitzer, J., Talukdar, P.P., Ganchev, K., Graca, J., Pereira,
F.: Frustratingly Hard Domain Adaptation for Parsing. In: CoNLL 2007 (2007)
7. Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate gaus-
sian mixture observations of markov chains. IEEE Transactions on Speech and
Audio Processing 2(2), 291–298 (1994)
8. Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge
(1998)
9. Jiang, J., Zhai, C.X.: Instance Weighting for Domain Adaptation in NLP. In: Pro-
ceedings of ACL 2007 (2007)
10. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Pro-
ceedings of the 30th International Conference on Very Large Data Bases (2004)
11. Legetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker
adaptation of continuous density hidden markov models. Computer Speech and
Language, 171–185 (1995)
12. Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation: Learning bounds
and algorithms. In: COLT (2009)
13. Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation with multiple
sources. In: Proceedings of NIPS 2008 (2008)
14. Mansour, Y., Mohri, M., Rostamizadeh, A.: Multiple source adaptation and the
Rényi divergence. In: Uncertainty in Artificial Inteligence, UAI (2009)
15. Martı́nez, A.M.: Recognizing imprecisely localized, partially occluded, and expres-
sion variant faces from a single sample per class. IEEE Trans. Pattern Anal. Mach.
Intell. 24(6), 748–763 (2002)
16. Della Pietra, S., Della Pietra, V., Mercer, R.L., Roukos, S.: Adaptive language
modeling using minimum discriminant estimation. In: HLT 1991: Proceedings of
the workshop on Speech and Natural Language, pp. 103–106 (1992)
17. Roark, B., Bacchiani, M.: Supervised and unsupervised PCFG adaptation to novel
domains. In: Proceedings of HLT-NAACL (2003)
18. Rosenfeld, R.: A Maximum Entropy Approach to Adaptive Statistical Language
Modeling. Computer Speech and Language 10, 187–228 (1996)
19. Valiant, L.G.: A theory of the learnable. Communication of the ACM 27(11),
1134–1142 (1984)
Learning on the Web

Fernando C.N. Pereira

University of Pennsylvania, USA

It is commonplace to say that the Web has changed everything. Machine learning
researchers often say that their projects and results respond to that change with
better methods for finding and organizing Web information. However, not much
of the theory, or even the current practice, of machine learning take the Web
seriously. We continue to devote much effort to refining supervised learning, but
the Web reality is that labeled data is hard to obtain, while unlabeled data is
inexhaustible. We cling to the iid assumption, while all the Web data generation
processes drift rapidly and involve many hidden correlations. Many of our theory
and algorithms assume data representations of fixed dimension, while in fact the
dimensionality of data, for example the number of distinct words in text, grows
with data size. While there has been much work recently on learning with sparse
representations, the actual patterns of sparsity on the Web are not paid much
attention. Those patterns might be very relevant to the communication costs
of distributed learning algorithms, which are necessary at Web scale, but little
work has been done on this.
Nevertheless, practical machine learning is thriving on the Web. Statistical
machine translation has developed non-parametric algorithms that learn how
to translate by mining the ever-growing volume of source documents and their
translations that are created on the Web. Unsupervised learning methods infer
useful latent semantic structure from the statistics of term co-occurrences in
Web documents. Image search achieves improved ranking by learning from user
responses to search results. In all those cases, Web scale demanded distributed
algorithms.
I will review some of those practical successes to try to convince you that
they are not just engineering feats, but also rich sources of new fundamental
questions that we should be investigating.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, p. 7, 2009.



c Springer-Verlag Berlin Heidelberg 2009
Prediction with Expert Evaluators’ Advice

Alexey Chernov and Vladimir Vovk

Computer Learning Research Centre, Department of Computer Science


Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK
{chernov,vovk}@cs.rhul.ac.uk

Abstract. We introduce a new protocol for prediction with expert ad-


vice in which each expert evaluates the learner’s and his own performance
using a loss function that may change over time and may be different
from the loss functions used by the other experts. The learner’s goal is
to perform better or not much worse than each expert, as evaluated by
that expert, for all experts simultaneously. If the loss functions used by
the experts are all proper scoring rules and all mixable, we show that
the defensive forecasting algorithm enjoys the same performance guar-
antee as that attainable by the Aggregating Algorithm in the standard
setting and known to be optimal. This result is also applied to the case
of “specialist” experts. In this case, the defensive forecasting algorithm
reduces to a simple modification of the Aggregating Algorithm.

1 Introduction
We consider the problem of online sequence prediction. A process generates
outcomes ω1 , ω2 , . . . step by step. At each step t, a learner tries to guess this
step’s outcome announcing his prediction γt . Then the actual outcome ωt is
revealed. The quality of the learner’s prediction is measured by a loss function:
the learner’s loss at step t is λ(γt , ωt ).
Prediction with expert advice is a framework that does not make any assump-
tions about the generating process. The performance of the learner is compared
to the performance of several other predictors called experts. At each step, each
expert gives his prediction γtn , then the learner produces his own prediction γt
(possibly based on the experts’ predictions at the last step and the experts’ pre-
dictions and outcomes at all the previous steps), and the accumulated losses are
updated for the learner and for the experts. There are many algorithms for the
learner in this framework; for a review, see [1].
In practical applications of the algorithms for prediction with expert advice,
choosing the loss function is often difficult. There may be no natural quantitative
measure of loss, just the vague concept that the closer the prediction to the
outcome the better. In such cases one usually selects from among several common
loss functions, such as the square loss function (reflecting the idea of least squares
methods) or the log loss function (which has an information theory background).
A similar issue arises when experts themselves are prediction algorithms that
optimize some losses internally. Then it is unfair to these experts when the
learner competes with them according to a “foreign” loss function.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 8–22, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Prediction with Expert Evaluators’ Advice 9

This paper introduces a new version of the framework of prediction with


expert advice where there is no single fixed loss function but some loss function
is linked to each expert. The performance of the learner is compared to the
performance of each expert according to the loss function linked to that expert.
Informally speaking, each expert has to be convinced that the learner performs
almost as well as, or better than, that expert himself.
We prove that a known algorithm for the learner, the defensive forecasting
algorithm [2], can be applied in the new setting and gives the same performance
guarantee as that attainable in the standard setting, provided all loss functions
are proper scoring rules.
Another framework to which our methods can be fruitfully applied is that of
“specialist experts”: see, e.g., [3]. We generalize some of the known results in the
case of mixable loss functions.
To keep presentation as simple as possible, we restrict ourselves to binary
outcomes from {0, 1}, predictions from [0, 1], and a finite number of experts. We
formulate our results for mixable loss functions only. However, these results can
be easily transferred to more general settings (non-binary outcomes, arbitrary
prediction spaces, countably many experts, second-guessing experts, etc.) where
the methods of [2] work. For a fuller version of this paper, see [4].

2 Prediction with Simple Experts’ Advice

In this preliminary section we recall the standard protocol of prediction with


expert advice and some known results.
Let {0, 1} be the set of possible outcomes ω, [0, 1] be the set of possible pre-
dictions γ, and λ : [0, 1] × {0, 1} → [0, ∞] be the loss function. The loss function
λ and parameter N (the number of experts) specify the game of prediction with
expert advice. The game is played by Learner, Reality, and N experts, Expert
1 to Expert N , according to the following protocol.

Prediction with expert advice


L0 := 0.
Ln0 := 0, n = 1, . . . , N .
FOR t = 1, 2, . . . :
Expert n announces γtn ∈ [0, 1], n = 1, . . . , N .
Learner announces γt ∈ [0, 1].
Reality announces ωt ∈ {0, 1}.
Lt := Lt−1 + λ(γt , ωt ).
Lnt := Lnt−1 + λ(γtn , ωt ), n = 1, . . . , N .
END FOR

The goal of Learner is to keep his loss Lt smaller or at least not much greater
than the loss Lnt of Expert n, at each step t and for all n = 1, . . . , N .
10 A. Chernov and V. Vovk

We only consider loss functions that have the following properties:

Assumption 1: λ(γ, 0) and λ(γ, 1) are continuous in γ ∈ [0, 1] and for the
standard (Aleksandrov’s) topology on [0, ∞].
Assumption 2: There exists γ ∈ [0, 1] such that λ(γ, 0) and λ(γ, 1) are both
finite.
Assumption 3: There exists no γ ∈ [0, 1] such that λ(γ, 0) and λ(γ, 1) are both
infinite.

The superprediction set for a loss function λ is


 
Σλ := (x, y) ∈ [0, ∞)2 | ∃γ λ(γ, 0) ≤ x and λ(γ, 1) ≤ y . (1)

By Assumption 2, this set is non-empty. For each learning rate η > 0, let Eη :
[0, ∞]2 → [0, 1]2 be the homeomorphism defined by Eη (x, y) := (e−ηx , e−ηy ).
The loss function λ is called η-mixable if the set Eη (Σλ ) is convex. It is called
mixable if it is η-mixable for some η > 0.
Theorem 1 (Vovk and Watkins). If a loss function λ is η-mixable, then
there exists a strategy for Learner that guarantees that in the game of prediction
with expert advice with N experts and the loss function λ it holds, for all T and
for all n = 1, . . . , N , that
1
LT ≤ LnT + ln N . (2)
η
The bound is optimal: if λ is not η-mixable, then no strategy for Learner can
guarantee (2).
For the proof and other details, see [1], [5], [6], or [7, Theorem 8]; one of the algo-
rithms guaranteeing (2) is the Aggregating Algorithm (AA). As shown in [2], one
can take the defensive forecasting algorithm instead of the AA in the theorem.

3 Proper Scoring Rules


A loss function λ is a proper scoring rule if for any π, π  ∈ [0, 1] it holds that

πλ(π, 1) + (1 − π)λ(π, 0) ≤ πλ(π  , 1) + (1 − π)λ(π  , 0) .

The interpretation is that the prediction π is an estimate of the probability that


ω = 1. The definition says that the expected loss with respect to a probability
distribution is minimal if the prediction is the true probability of 1. Informally,
a proper scoring rule encourages a forecaster (Learner or one of the experts) to
announce his true subjective probability that the next outcome will be 1. (See
[8] and [9] for detailed reviews.)
Simple examples of proper scoring rules are provided by two most common
loss functions: the log loss function

λ(γ, ω) := − ln(ωγ + (1 − ω)(1 − γ))


Prediction with Expert Evaluators’ Advice 11

(i.e., λ(γ, 0) = − ln(1 − γ) and λ(γ, 1) = − ln γ) and the square loss function

λ(γ, ω) := (ω − γ)2 .

A trivial but important for us generalization of the log loss function is


1
λ(γ, ω) := − ln(ωγ + (1 − ω)(1 − γ)) , (3)
η
where η is a positive constant. The generalized log loss function is also a proper
scoring rule (in general, multiplying a proper scoring rule by a positive constant
we again obtain a proper scoring rule).
It is well known that the log loss function is 1-mixable and the square loss
function is 2-mixable (see, e.g., [1], Section 3.6), and it is easy to check that the
generalized log loss function (3) is η-mixable.
We will often say “proper loss function” meaning a loss function that is a
proper scoring rule. Our main interest will be in loss functions that are both
mixable and proper. Let L be the set of all such loss functions. It is geometri-
cally obvious that any mixable loss function can be made proper by removing
inadmissible predictions (i.e., predictions γ that are strictly worse than some
other predictions) and reparameterizing the admissible predictions.

4 Prediction with Expert Evaluators’ Advice


In this section we consider a very general protocol of prediction with expert
advice. The intuition behind special cases of this protocol will be discussed in
the following sections.

Prediction with expert evaluators’ advice


FOR t = 1, 2, . . . :
Expert n announces γtn ∈ [0, 1], ηtn > 0, and ηtn -mixable λnt ∈ L,
n = 1, . . . , N .
Learner announces γt ∈ [0, 1].
Reality announces ωt ∈ {0, 1}.
END FOR

The main mathematical result of this paper is the following.


Theorem 2. Learner has a strategy (e.g., the defensive forecasting algorithm
described below) that guarantees that in the game of prediction with N expert
evaluators’ advice it holds, for all T and for all n = 1, . . . , N , that
T
  
ηtn λnt (γt , ωt ) − λnt (γtn , ωt ) ≤ ln N .
t=1

The description of the defensive forecasting algorithm and the proof of the the-
orem will be given in Sect. 7.
12 A. Chernov and V. Vovk

Corollary 1. For any η > 0, Learner has a strategy that guarantees


T
 T
 ln N
λnt (γt , ωt ) ≤ λnt (γtn , ωt ) + , (4)
t=1 t=1
η

for all T and all n = 1, . . . , N , in the game of prediction with N expert evaluators’
advice in which the experts are required to always choose η-mixable loss functions
λnt .
This corollary is more intuitive than Theorem 2 as (4) compares the cumulative
losses suffered by Learner and each expert.
In the following sections we will discuss two interesting special cases of The-
orem 2 and Corollary 1.

5 Prediction with Constant Expert Evaluators’ Advice


In the game of this section, as in the previous one, the experts are “expert
evaluators”: each of them measures Learner’s and his own performance using
his own loss function, supposed to be mixable and proper. The difference is that
now each expert is linked to a fixed loss function. The game is specified by N
loss functions λ1 , . . . , λN .

Prediction with constant expert evaluators’ advice


(n)
L0 := 0, n = 1, . . . , N .
Ln0 := 0, n = 1, . . . , N .
FOR t = 1, 2, . . . :
Expert n announces γtn ∈ [0, 1], n = 1, . . . , N .
Learner announces γt ∈ [0, 1].
Reality announces ωt ∈ {0, 1}.
Lt := Lt−1 + λn (γt , ωt ), n = 1, . . . , N .
(n) (n)

Lt := Lt−1 + λn (γtn , ωt ), n = 1, . . . , N .
n n

END FOR

There are two changes in the protocol as compared to the basic protocol of pre-
diction with expert advice in Sect. 2. The accumulated loss Lnt of each expert is
now calculated according to his own loss function λn . For Learner, there is no
(n)
single accumulated loss anymore. Instead, the loss Lt of Learner is calculated
separately against each expert, according to that expert’s loss function λn . Infor-
mally speaking, each expert evaluates his own performance and the performance
of Learner according to the expert’s own (but publicly known) criteria.
In the standard setting of prediction with expert advice it is often said that
Learner’s goal is to compete with the best expert in the pool. In the new setting,
we cannot speak about the best expert: the experts’ performance is evaluated by
different loss functions and thus the losses may be measured on different scales.
But it still makes sense to consider bounds on the regret Lt − Lnt for each n.
(n)
Prediction with Expert Evaluators’ Advice 13

Theorem 2 immediately implies the following performance guarantee for the


defensive forecasting algorithm in our current setting.
Corollary 2. Suppose that each λn is a proper loss function that is η n -mixable
for some η n > 0, n = 1, . . . , N . Then Learner has a strategy that guarantees that
in the game of prediction with N experts’ advice and loss functions λ1 , . . . , λN
it holds, for all T and for all n = 1, . . . , N , that
ln N
LT ≤ LnT +
(n)
.
ηn
Notice that Corollary 2 contains the bound (2) of Theorem 1 as a special case
(the assumption that λ is proper is innocuous in the context of Theorem 1).

Multiobjective Prediction with Expert Advice


To conclude this section, let us consider another variant of the protocol with
several loss functions. As mentioned in the introduction, sometimes we have
experts’ predictions, and we are not given a single loss function, but have several
possible candidates. The most cautious way to generate Learner’s predictions is
to ensure that the regret is small against all experts and according to all loss
functions. The following protocol formalizes this task. Now we have N experts
and M loss functions λ1 , . . . , λM .

Multiobjective prediction with expert advice


(m)
L0 := 0, m = 1, . . . , M .
Ln,m
0 := 0, n = 1, . . . , N and m = 1, . . . , M .
FOR t = 1, 2, . . . :
Expert n announces γtn ∈ [0, 1], n = 1, . . . , N .
Learner announces γt ∈ [0, 1].
Reality announces ωt ∈ {0, 1}.
Lt := Lt−1 + λm (γt , ωt ), m = 1, . . . , M .
(m) (m)

Lt := Ln,m
n,m m n
t−1 + λ (γt , ωt ), n = 1, . . . , N and m = 1, . . . , M .
END FOR

Corollary 3. Suppose that each λm is an η m -mixable proper loss function, for


some η m > 0, m = 1, . . . , M . There is a strategy for Learner that guarantees that,
in the multiobjective game of prediction with N experts and the loss functions
λ1 , . . . , λM ,
ln M N
LT ≤ Ln,m
(m)
T + (5)
ηm
for all T , all n = 1, . . . , N , and all m = 1, . . . , M .
Proof. This follows easily from Corollary 2. For each n ∈ {1, . . . , N }, let us
construct M new experts (n, m). Expert (n, m) predicts as Expert n and is
linked to the loss function λm . Applying Corollary 2 to these M N experts, we
get the bound (5). 

14 A. Chernov and V. Vovk

6 Prediction with Specialist Experts’ Advice


The experts of this section are allowed to “sleep”, i.e., abstain from giving advice
to Learner at some steps. We will be assuming that there is only one loss function
λ, although generalization to the case of N loss functions λ1 , . . . , λN , each linked
to an expert, is straightforward. The loss function λ does not need to be proper
(but it is still required to be mixable).
Let a be any object that does not belong to [0, 1]; intuitively, it will stand for
an expert’s decision to abstain.

Prediction with specialist experts’ advice


(n)
L0 := 0, n = 1, . . . , N .
Ln0 := 0, n = 1, . . . , N .
FOR t = 1, 2, . . . :
Expert n announces γtn ∈ ([0, 1] ∪ {a}), n = 1, . . . , N .
Learner announces γt ∈ [0, 1].
Reality announces ωt ∈ {0, 1}.
(n) (n)
Lt := Lt−1 + I{γtn  =a} λ(γt , ωt ), n = 1, . . . , N .
n n n
Lt := Lt−1 + I{γtn  =a} λ(γt , ωt ), n = 1, . . . , N .
END FOR
n n
The indicator function I{γtn 
=a} of the event γt = a is defined to be 1 if γt = a
and 0 if γtn = a. Therefore, Lt and Lnt refer to the cumulative losses of Learner
(n)

and Expert n over the steps when Expert n is awake. Now Learner’s goal is to
do as well as each expert on the steps chosen by that expert.
Corollary 4. Let λ be a loss function that is η-mixable for some η > 0. Then
Learner has a strategy that guarantees that in the game of prediction with N
specialist experts’ advice and loss function λ it holds, for all T and for all n =
1, . . . , N , that
ln N
LT ≤ LnT +
(n)
. (6)
η
Proof. Without loss of generality the loss function λ may be assumed to be
proper (as we said earlier, this can be achieved by reparameterization of predic-
tions). The protocol of this section then becomes a special case of the protocol
of Sect. 4 in which at each step each expert outputs ηtn = η and either λnt = λ
(when he is awake) or λnt = 0 (when he is asleep). (Alternatively, we could allow
zero learning rates and make each expert output λnt = λ and either ηtn = η,
when he is awake, or ηtn = 0, when he is asleep.) 


7 Defensive Forecasting Algorithm and the Proof of


Theorem 2
In this section we prove Theorem 2. Our proof is constructive: we explicitly
describe the defensive forecasting algorithm achieving the bound in Theorem 2.
Prediction with Expert Evaluators’ Advice 15

We will use the more intuitive notation πt , rather than γt , for the algorithm’s
predictions (to emphasize the interpretation of predictions as probabilities: cf.
the discussion of proper scoring rules in Sect. 3).

The Algorithm
For each n = 1, . . . , N , let us define the function
 ∗
Qn : [0, 1]N × (0, ∞)N × LN × [0, 1] × {0, 1} → [0, ∞]
T
  
n n n n
Qn (γ1• , η1• , λ•1 , π1 , ω1 , . . . , γT• , ηT• , λ•T , πT , ωT ) := eηt λt (πt ,ωt )−λt (γt ,ωt ) ,
t=1
(7)

where γtn are the components of γt• , ηtn are the components of ηt• , and λnt
are the components of λ•t : γt• := (γt1 , . . . , γtN ), ηt• := (ηt1 , . . . , ηtN ), and λ•t :=
0
(λ1t , . . . , λN
t ). As usual, the product t=1 is interpreted as 1, so that Q () = 1.
n

The functions Qn will usually be applied to γt• := (γt1 , . . . , γtN ) the predictions
made by all the N experts at step t, ηt• := (ηt1 , . . . , ηtN ) the learning rates chosen
by the experts at step t, and λ•t := (λ1t , . . . , λN
t ) the loss functions used by the
experts at step t. Notice that Qn does not depend on the predictions, learning
rates, and loss functions of the experts other than Expert n.
Set
N
1  n
Q := Q and ft (π, ω) :=
N n=1
 
Q γ1• , η1• , λ•1 , π1 , ω1 , . . . , γt−1
• •
, ηt−1 , λ•t−1 , πt−1 , ωt−1 , γt• , ηt• , λ•t , π, ω
 
− Q γ1• , η1• , λ•1 , π1 , ω1 , . . . , γt−1
• •
, ηt−1 , λ•t−1 , πt−1 , ωt−1 , (8)

where (π, ω) ranges over [0, 1] × {0, 1}; the expression ∞ − ∞ is understood as,
say, 0. The defensive forecasting algorithm is defined in terms of the functions ft .

Defensive forecasting algorithm


FOR t = 1, 2, . . . :
Read the experts’ predictions γt• = (γt1 , . . . , γtN ) ∈ [0, 1]N ,
learning rates ηt• = (ηt1 , . . . , ηtN ) ∈ (0, ∞)N ,
and loss functions λ•t = (λ1t , . . . , λN N
t )∈ L .
Define ft : [0, 1] × {0, 1} → [−∞, ∞] by (8).
If ft (0, 1) ≤ 0, predict πt := 0 and go to R.
If ft (1, 0) ≤ 0, predict πt := 1 and go to R.
Otherwise (if both ft (0, 1) > 0 and ft (1, 0) > 0),
take any π satisfying ft (π, 0) = ft (π, 1) and predict πt := π.
R: Read Reality’s move ωt ∈ {0, 1}.
END FOR
16 A. Chernov and V. Vovk

The existence of a π satisfying ft (π, 0) = ft (π, 1), when required by the algo-
rithm, will be proved in Lemma 1 below. We will see that in this case the function
ft (π) := ft (π, 1) − ft (π, 0) takes values of opposite signs at π = 0 and π = 1.
Therefore, a root of ft (π) = 0 can be found by, e.g., bisection (see [10], Chap. 9,
for a review of bisection and more efficient methods, such as Brent’s).

Reductions
The most important property of the defensive forecasting algorithm is that it
produces predictions πt such that the sequence
Qt := Q(γ1• , η1• , λ•1 , π1 , ω1 , . . . , γt• , ηt• , λ•t , πt , ωt ) (9)
is non-increasing. This property will be proved later; for now, we will only check
that it implies the bound on the regret term given in Theorem 2. Since the initial
value Q0 of Q is 1, we have Qt ≤ 1 for all t. And since Qn ≥ 0 for all n, we have
Qn ≤ N Q for all n. Therefore, Qnt , defined by (9) with Qn in place of Q, is at
most N at each step t. By the definition of Qn this means that
T
  
ηtn λnt (πt , ωt ) − λnt (γtn , ωt ) ≤ ln N ,
t=1

which is the bound claimed in the theorem.


In the proof of the inequalities Q0 ≥ Q1 ≥ · · · we will follow [2] (for a
presentation adapted to the binary case, see [11]). The key fact we use is that Q
is a game-theoretic supermartingale (see below). Let us define this notion and
prove its basic properties.
Let E be any non-empty set. A function S : (E × [0, 1] × {0, 1})∗ → (−∞, ∞]
is called a supermartingale (omitting “game-theoretic”) if, for any T , any
e1 , . . . , eT ∈ E, any π1 , . . . , πT ∈ [0, 1], and any ω1 , . . . , ωT −1 ∈ {0, 1}, it holds
that

πT S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , πT , 1)
+ (1 − πT )S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , πT , 0)
≤ S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 ) . (10)
Remark 1. The standard measure-theoretic notion of a supermartingale is ob-
tained when the arguments π1 , π2 , . . . in (10) are replaced by the forecasts pro-
duced by a fixed forecasting system. See, e.g., [12] for details. Game-theoretic
supermartingales are referred to as “superfarthingales” in [13].
A supermartingale S is called forecast-continuous if, for all T ∈ {1, 2, . . .}, all
e1 , . . . , eT ∈ E, all π1 , . . . , πT −1 ∈ [0, 1], and all ω1 , . . . , ωT ∈ {0, 1},
S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , π, ωT )
is a continuous function of π ∈ [0, 1]. The following lemma (proved and used in
similar contexts by, e.g., Levin [14] and Takemura [15]) states the most important
for us property of forecast-continuous supermartingales.
Prediction with Expert Evaluators’ Advice 17

Lemma 1. Let S be a forecast-continuous supermartingale. For any T and


for any values of the arguments e1 , . . . , eT ∈ E, π1 , . . . , πT −1 ∈ [0, 1], and
ω1 , . . . , ωT −1 ∈ {0, 1}, there exists π ∈ [0, 1] such that, for both ω = 0 and
ω = 1,

S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , π, ω)
≤ S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 ) .
Proof. Define a function f : [0, 1] × {0, 1} → (−∞, ∞] by

f (π, ω) := S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , π, ω)
− S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 )
(the subtrahend is assumed finite: there is nothing to prove when it is infinite).
Since S is a forecast-continuous supermartingale, f (π, ω) is continuous in π and
πf (π, 1) + (1 − π)f (π, 0) ≤ 0 (11)
for all π ∈ [0, 1]. In particular, f (0, 0) ≤ 0 and f (1, 1) ≤ 0.
Our goal is to show that for some π ∈ [0, 1] we have f (π, 1) ≤ 0 and f (π, 0) ≤
0. If f (0, 1) ≤ 0, we can take π = 0. If f (1, 0) ≤ 0, we can take π = 1. Assume
that f (0, 1) > 0 and f (1, 0) > 0. Then the difference f (π) := f (π, 1) − f (π, 0) is
positive for π = 0 and negative for π = 1. By the intermediate value theorem,
f (π) = 0 for some π ∈ (0, 1). By (11) we have f (π, 1) = f (π, 0) ≤ 0. 

The fact that the sequence (9) is non-increasing follows from the fact (see below)
that Q is a forecast-continuous supermartingale (when restricted to the allowed
moves for the players). The pseudocode for the defensive forecasting algorithm
and the paragraph following it are extracted from the proof of Lemma 1, as
applied to the supermartingale Q.
The weighted sum of finitely many forecast-continuous supermartingales taken
with positive weights is again a forecast-continuous supermartingale. Therefore,
the proof will be complete if we check that Qn is a supermartingale under the
restriction that λnt is ηtn -mixable for all n and t (it is forecast-continuous by
Assumption 1). But before we can do this, we will need to do some preparatory
work in the next subsection.

Geometry of Mixability and Proper Loss Functions


Assumption 1 and the compactness of [0, 1] imply that the superprediction set (1)
is closed. Along with the superprediction set, we will also consider the prediction
set  
Πλ := (x, y) ∈ [0, ∞)2 | ∃γ λ(γ, 0) = x and λ(γ, 1) = y .
In many cases (e.g., if λ is proper), the prediction set is the boundary of the
superprediction set. The prediction set can also be defined as the set of points
Λγ := (λ(γ, 0), λ(γ, 1)) (12)
that belong to IR2 , where γ ranges over the prediction space [0, 1].
18 A. Chernov and V. Vovk

Let us fix a constant η > 0. The prediction set of the generalized log loss
function (3) is the curve {(x, y) | e−ηx + e−ηy = 1} in IR2 . For each π ∈ (0, 1),
the π-point of this curve is Λπ , i.e., the point
1 1
− ln(1 − π), − ln π .
η η
Since the generalized log loss function is proper, the minimum of (1 − π)x + πy
(geometrically, of the dot product of (1 − π, π) and (x, y)) on the curve e−ηx +
e−ηy = 1 is attained at the π-point; in other words, the tangent of e−ηx +e−ηy = 1
at the π-point is orthogonal to the vector (1 − π, π).
A shift of the curve e−ηx + e−ηy = 1 is the curve e−η(x−α) + e−η(y−β) = 1
for some α, β ∈ IR (i.e., it is a parallel translation of e−ηx + e−ηy = 1 by some
vector (α, β)). The π-point of this shift is the point (α, β) + Λπ , where Λπ is the
π-point of the original curve e−ηx + e−ηy = 1. This provides us with a coordinate
system on each shift of e−ηx + e−ηy = 1 (π ∈ (0, 1) serves as the coordinate of
the corresponding π-point).
It will be convenient to use the geographical expressions “Northeast” and
“Southwest”. A point (x1 , y1 ) is Northeast of a point (x2 , y2 ) if x1 ≥ x2 and
y1 ≥ y2 . A set A ⊆ IR2 is Northeast of a shift of e−ηx + e−ηy = 1 if each point of
A is Northeast of some point of the shift. Similarly, a point is Northeast of a shift
of e−ηx + e−ηy = 1 (or of a straight line with a negative slope) if it is Northeast
of some point on that shift (or line). “Northeast” is replaced by “Southwest”
when the inequalities are ≤ rather than ≥, and we add the attribute “strictly”
when the inequalities are strict.
It is easy to see that the loss function is η-mixable if and only if for each
point (a, b) on the boundary of the superprediction set there exists a shift of
e−ηx +e−ηy = 1 passing through (a, b) such that the superprediction set lies to the
Northeast of the shift. This follows from the fact that the shifts of e−ηx +e−ηy = 1
correspond to the straight lines with negative slope under the homeomorphism
Eη : indeed, the preimage of ax + by = c, where a > 0, b > 0, and c > 0, is
ae−ηx + be−ηy = c, which is the shift of e−ηx + e−ηy = 1 by the vector
1 a 1 b
− ln , − ln .
η c η c
A similar statement for the property of being proper is:
Lemma 2. Suppose the loss function λ is η-mixable. It is a proper loss function
if and only if for each π the superprediction set is to the Northeast of the shift
of e−ηx + e−ηy = 1 passing through Λπ (as defined by (12)) and having Λπ as its
π-point.
Proof. The part “if” is obvious, so we will only prove the part “only if”. Let
λ be η-mixable and proper. Suppose there exists π such that the shift A1 of
e−ηx + e−ηy = 1 passing through Λπ and having Λπ as its π-point has some
superpredictions strictly to its Southwest. Let s be such a superprediction, and
let A2 be the tangent to A1 at the point Λπ . The image Eη (A1 ) is a straight
Prediction with Expert Evaluators’ Advice 19

line in [0, 1]2 , and the curve Eη (A2 ) touches Eη (A1 ) at Eη (Λπ ) and lies at the
same side of Eη (A1 ) as Eη (s). Any point p in the open interval (Eη (s), Eη (Λπ ))
that is close enough to Eη (Λπ ) will be strictly Northeast of Eη (A2 ). The point
Eη−1 (p) will then be a superprediction (by the η-mixability of λ) that is strictly
Southwest of A2 . This contradicts λ being a proper loss function, since A2 is the
straight line passing through Λπ and orthogonal to (1 − π, π). 


Proof of the Supermartingale Property


Let E ⊆ ([0, 1]N × (0, ∞)N × LN ) consist of sequences
 1 
γ , . . . , γ N , η 1 , . . . , η N , λ1 , . . . , λN

such that λn is η n -mixable for all n = 1, . . . , N . We will only be interested in the


restriction of Qn and Q to (E × [0, 1] × {0, 1})∗; these restrictions are denoted
with the same symbols.
The following lemma completes the proof of Theorem 2. We will prove it with-
out calculations, unlike the proofs (of different but somewhat similar properties)
presented in [2] (and, specifically for the binary case, in [11]).
Lemma 3. The function Qn defined on (E × [0, 1] × {0, 1})∗ by (7) is a super-
martingale.

Proof. It suffices to check that it is always true that

πT exp (ηTn (λnT (πT , 1) − λnT (γTn , 1)))


+ (1 − πT ) exp (ηTn (λnT (πT , 0) − λnT (γTn , 0))) ≤ 1 .

To simplify the notation, we omit the indices n and T ; this does not lead to
any ambiguity. Using the notation (a, b) := Λπ = (λ(π, 0), λ(π, 1)) and (x, y) :=
Λγ = (λ(γ, 0), λ(γ, 1)), we can further simplify the last inequality to

(1 − π) exp (η (a − x)) + π exp (η (b − y)) ≤ 1 .

In other words, it suffices to check that the (super)prediction set lies to the
Northeast of the shift
1 1
exp −η x − a − ln(1 − π) + exp −η y − b − ln π =1 (13)
η η

of the curve e−ηx + e−ηy = 1. The vector by which (13) is shifted is

1 1
a+ ln(1 − π), b + ln π ,
η η

and so (a, b) is the π-point of that shift. This completes the proof of the lemma: by
Lemma 2, the superprediction set indeed lies to the Northeast of that shift. 

20 A. Chernov and V. Vovk

8 Defensive Forecasting for Specialist Experts and the


AA

In this section we will find a more explicit version of defensive forecasting in


the case of specialist experts. Our algorithm will achieve a slightly more general
version of the bound (6); namely, we will replace the ln N in (6) by − ln pn where
pn is an a priori chosen weight for Expert n: all pn are non-negative and sum to
1. Without loss of generality all pn will be assumed positive (our algorithm can
always be applied to the subset of experts with positive weights). Let At be the
set of awake experts at time t: At := {n ∈ {1, . . . , N } | γtn = a}.
Let λ be an η-mixable loss function. By the definition of mixability there
exists a function Σ(u1 , . . . , uk , γ1 , . . . , γk ) (called a substitution function) such
that:

– the domain of Σ consists of all sequences (u1 , . . . , uk , γ1 , . . . , γk ), for all k =


0, 1, 2, . . ., of numbers ui ∈ [0, 1] summing to 1, u1 + · · · + uk = 1, and
predictions γ1 , . . . , γk ∈ [0, 1];
– Σ takes values in the prediction space [0, 1];
– for any (u1 , . . . , uk , γ1 , . . . , γk ) in the domain of Σ, the prediction γ :=
Σ(u1 , . . . , uk , γ1 , . . . , γk ) satisfies
k

∀ω ∈ {0, 1} : e−ηλ(γ,ω) ≥ e−ηλ(γi ,ω) ui . (14)
i=1

Fix such a function Σ. Notice that its value Σ() on the empty sequence can be
chosen arbitrarily, that the case k = 1 is trivial, and that the case k = 2 in fact
covers the cases k = 3, k = 4, etc.

Defensive forecasting algorithm for specialist experts


w0n := pn , n = 1, . . . , N .
FOR t = 1, 2, . . . :
Read the list At of awake experts
and their predictions γtn ∈ [0, 1], n ∈ At .
 
Predict πt := Σ unt−1 n∈A , (γtn )n∈At , where unt−1 := wt−1
n
/ n∈At
n
wt−1 .
t
Read the outcome ωt ∈ {0, 1}.
n
Set wtn := wt−1
n
eη(λ(πt ,ωt )−λ(γt ,ωt )) for all n ∈ At .
END FOR

This algorithm is a simple modification of the AA, and it becomes the AA


when the experts are always awake. Its main difference from the AA is in
the way the experts’ weights are updated. The weights of the sleeping ex-
perts are not changed, whereas the weights of the awake experts are multiplied
n
by eη(λ(πt ,ωt )−λ(γt ,ωt )) . Therefore, Learner’s loss serves as the benchmark: the
weight of an awake expert who performs better than Learner goes up, the weight
of an awake expert who performs worse than Learner goes down, and the weight
Prediction with Expert Evaluators’ Advice 21

of a sleeping expert does not change. In the case of the log loss function, this
algorithm was found by Freund et al. [3]; in this special case, Freund et al. derive
the same performance guarantee as we do.

Derivation of the Algorithm

In this derivation we will need the following notation. For each history of the
game, let An , n ∈ {1, . . . , N }, be the set of steps at which Expert n is awake:

An := {t ∈ {1, 2, . . .} | n ∈ At } .

For each positive integer k, [k] stands for the set {1, . . . , k}.
The method of defensive forecasting (as used in the proof of Corollary 4)
requires that at step T we should choose π = πT such that, for each ω ∈ {0, 1},
 n  n
pn eη(λ(π,ω)−λ(γT ,ω)) eη(λ(πt ,ωt )−λ(γt ,ωt ))
n∈AT t∈[T −1]∩An
  n
+ pn eη(λ(πt ,ωt )−λ(γt ,ωt ))
n∈AcT t∈[T −1]∩An
  n
≤ pn eη(λ(πt ,ωt )−λ(γt ,ωt ))
n∈[N ] t∈[T −1]∩An

where AcT stands for the complement of AcT in [N ]: AT := [N ] \ AT . This in-


equality is equivalent to
 n  n
pn eη(λ(π,ω)−λ(γT ,ω)) eη(λ(πt ,ωt )−λ(γt ,ωt ))
n∈AT t∈[T −1]∩An
  n
≤ pn eη(λ(πt ,ωt )−λ(γt ,ωt ))
n∈AT t∈[T −1]∩An

and can be rewritten as


 n
eη(λ(π,ω)−λ(γT ,ω)) unT −1 ≤ 1 , (15)
n∈AT

where unT −1 := wTn −1 / n∈AT wTn −1 are the normalized weights


 n
wTn −1 := pn eη(λ(πt ,ωt )−λ(γt ,ωt )) .
t∈[T −1]∩An

Comparing (15) and (14), we can see that it suffices to set


 n 
π := Σ uT −1 n∈AT , (γTn )n∈AT .
22 A. Chernov and V. Vovk

Acknowledgements

The anonymous reviewers’ comments were very helpful in weeding out mistakes
and improving presentation (although some of their suggestions could only be
used for the full version of the paper [4], not restricted by the page limit). This
work was supported in part by EPSRC grant EP/F002998/1. We are grateful to
the anonymous Eurocrat who coined the term “expert evaluator”.

References
1. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge
University Press, Cambridge (2006)
2. Chernov, A., Kalnishkan, Y., Zhdanov, F., Vovk, V.: Supermartingales in predic-
tion with expert advice. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.)
ALT 2008. LNCS (LNAI), vol. 5254, pp. 199–213. Springer, Heidelberg (2008)
3. Freund, Y., Schapire, R.E., Singer, Y., Warmuth, M.K.: Using and combining pre-
dictors that specialize. In: Proceedings of the Twenty Ninth Annual ACM Sympo-
sium on Theory of Computing, New York, Association for Computing Machinery,
pp. 334–343 (1997)
4. Chernov, A., Vovk, V.: Prediction with expert evaluators’ advice. Technical Report
arXiv:0902.4127 [cs.LG], arXiv.org e-Print archive (2009)
5. Haussler, D., Kivinen, J., Warmuth, M.K.: Sequential prediction of individual
sequences under general loss functions. IEEE Transactions on Information The-
ory 44, 1906–1925 (1998)
6. Vovk, V.: A game of prediction with expert advice. Journal of Computer and
System Sciences 56, 153–173 (1998)
7. Vovk, V.: Derandomizing stochastic prediction strategies. Machine Learning 35,
247–282 (1999)
8. Dawid, A.P.: Probability forecasting. In: Kotz, S., Johnson, N.L., Read, C.B. (eds.)
Encyclopedia of Statistical Sciences, vol. 7, pp. 210–218. Wiley, New York (1986)
9. Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estima-
tion. Journal of the American Statistical Association 102, 359–378 (2007)
10. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes
in C, 2nd edn. Cambridge University Press, Cambridge (1992)
11. Vovk, V.: Defensive forecasting for optimal prediction with expert advice. Technical
Report arXiv:0708.1503 [cs.LG], arXiv.org e-Print archive (August 2007)
12. Shafer, G., Vovk, V.: Probability and Finance: It’s Only a Game! Wiley, New York
(2001)
13. Dawid, A.P., Vovk, V.: Prequential probability: principles and properties.
Bernoulli 5, 125–162 (1999)
14. Levin, L.A.: Uniform tests of randomness. Soviet Mathematics Doklady 17,
337–340 (1976)
15. Vovk, V., Takemura, A., Shafer, G.: Defensive forecasting. In: Cowell, R.G.,
Ghahramani, Z. (eds.) Proceedings of the Tenth International Workshop
on Artificial Intelligence and Statistics, Savannah Hotel, Barbados, Society
for Artificial Intelligence and Statistics, January 6-8, pp. 365–372 (2005),
http://www.gatsby.ucl.ac.uk/aistats/
Pure Exploration in Multi-armed Bandits Problems

Sébastien Bubeck1 , Rémi Munos1 , and Gilles Stoltz2,3


1
INRIA Lille, SequeL Project, France
2
Ecole normale supérieure, CNRS, Paris, France
3
HEC Paris, CNRS, Jouy-en-Josas, France

Abstract. We consider the framework of stochastic multi-armed bandit prob-


lems and study the possibilities and limitations of strategies that perform an on-
line exploration of the arms. The strategies are assessed in terms of their simple
regret, a regret notion that captures the fact that exploration is only constrained by
the number of available rounds (not necessarily known in advance), in contrast to
the case when the cumulative regret is considered and when exploitation needs to
be performed at the same time. We believe that this performance criterion is suited
to situations when the cost of pulling an arm is expressed in terms of resources
rather than rewards. We discuss the links between the simple and the cumulative
regret. The main result is that the required exploration–exploitation trade-offs are
qualitatively different, in view of a general lower bound on the simple regret in
terms of the cumulative regret.

1 Introduction

Learning processes usually face an exploration versus exploitation dilemma, since they
have to get information on the environment (exploration) to be able to take good actions
(exploitation). A key example is the multi-armed bandit problem [Rob52], a sequential
decision problem where, at each stage, the forecaster has to pull one out of K given
stochastic arms and gets a reward drawn at random according to the distribution of
the chosen arm. The usual assessment criterion of a strategy is given by its cumulative
regret, the sum of differences between the expected reward of the best arm and the
obtained rewards. Typical good strategies, like the UCB strategies of [ACBF02], trade
off between exploration and exploitation.
Our setting is as follows. The forecaster may sample the arms a given number of
times n (not necessarily known in advance) and is then asked to output a recommenda-
tion, formed by a probability distribution over the arms. He is evaluated by his simple
regret, that is, the difference between the average payoff of the best arm and the average
payoff obtained by his recommendation. The distinguishing feature from the classical
multi-armed bandit problem is that the exploration phase and the evaluation phase are
separated. We now illustrate why this is a natural framework for numerous applications.
Historically, the first occurrence of multi-armed bandit problems was given by med-
ical trials. In the case of a severe disease, ill patients only are included in the trial and
the cost of picking the wrong treatment is high (the associated reward would equal a
large negative value). It is important to minimize the cumulative regret, since the test
and cure phases coincide. However, for cosmetic products, there exists a test phase

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 23–37, 2009.

c Springer-Verlag Berlin Heidelberg 2009
24 S. Bubeck, R. Munos, and G. Stoltz

separated from the commercialization phase, and one aims at minimizing the regret of
the commercialized product rather than the cumulative regret in the test phase, which
is irrelevant. (Here, several formulæ for a cream are considered and some quantitative
measurement, like skin moisturization, is performed.)
The pure exploration problem addresses the design of strategies making the best pos-
sible use of available numerical resources (e.g., as CPU time) in order to optimize the
performance of some decision-making task. That is, it occurs in situations with a prelim-
inary exploration phase in which costs are not measured in terms of rewards but rather in
terms of resources, that come in limited budget. A motivating example concerns recent
works on computer-go (e.g., the MoGo program of [GWMT06]). A given time, i.e., a
given amount of CPU times is given to the player to explore the possible outcome of
a sequences of plays and output a final decision. An efficient exploration of the search
space is obtained by considering a hierarchy of forecasters minimizing some cumulative
regret – see, for instance, the UCT strategy of [KS06] and the BAST strategy of [CM07].
However, the cumulative regret does not seem to be the right way to base the strategies
on, since the simulation costs are the same for exploring all options, bad and good ones.
This observation was actually the starting point of the notion of simple regret and of this
work. A final related example is the maximization of some function f , observed with
noise, see, e.g., [Kle04, BMSS09]. Whenever evaluating f at a point is costly (e.g., in
terms of numerical or financial costs), the issue is to choose as adequately as possible
where to query the value of this function in order to have a good approximation to the
maximum. The pure exploration problem considered here addresses exactly the design
of adaptive exploration strategies making the best use of available resources in order to
make the most precise prediction once all resources are consumed.
As a remark, it also turns out that in all examples considered above, we may impose
the further restriction that the forecaster ignores ahead of time the amount of available
resources (time, budget, or the number of patients to be included) – that is, we seek for
anytime performance. The problem of pure exploration presented above was referred
to as “budgeted multi-armed bandit problem” in [MLG04], where another notion of re-
gret than simple regret is considered. [Sch06] solves the pure exploration problem in a
minmax sense for the case of two arms only and rewards given by probability distribu-
tions over [0, 1]. [EDMM02] and [MT04] consider a related setting where forecasters

Parameters: K probability distributions for the rewards of the arms, ν1 , . . . , νK


For each round t = 1, 2, . . . ,

(1) the forecaster chooses ϕt ∈ P{1, . . . , K} and pulls It at random according to ϕt ;


(2) the environment draws the reward Yt for that action (also denoted by XIt ,TIt (t) with
the notation introduced in the text);
(3) the forecaster outputs a recommendation ψt ∈ P{1, . . . , K};
(4) If the environment sends a stopping signal, then the game takes an end; otherwise, the
next round starts.

Fig. 1. The pure exploration problem for multi-armed bandits


Pure Exploration in Multi-armed Bandits Problems 25

perform exploration during a random number of rounds T and aim at identifying an


ε–best arm. They study the possibilities and limitations of policies achieving this goal
with overwhelming 1 − δ probability and indicate in particular upper and lower bounds
on (the expectation of) T . Another related problem in the statistical literature is the
identification of the best arm (with high probability). However, the binary assessment
criterion used there (the forecaster is either right or wrong in recommending an arm)
does not capture the possible closeness in performance of the recommended arm com-
pared to the optimal one, which the simple regret does. Unlike the latter, this criterion
is not suited for a distribution-free analysis.

2 Problem Setup, Notation


We consider a sequential decision problem given by stochastic multi-armed bandits.
K  2 arms, denoted by j = 1, . . . , K, are available and the j–th of them is param-
eterized by a fixed (unknown) probability distribution νj over [0, 1] with expectation
μj ; at those rounds when it is pulled, its associated reward is drawn at random accord-
ing to νj , independently of all previous rewards. For each arm j and all time rounds
n  1, we denote by Tj (n) the number of times j was pulled from rounds 1 to n, and
by Xj,1 , Xj,2 , . . . , Xj,Tj (n) the sequence of associated rewards.
The forecaster has to deal simultaneously with two tasks, a primary one and an as-
sociated one. The associated task consists in exploration, i.e., the forecaster should in-
dicate at each round t the arm It to be pulled. He may resort to a randomized strategy,
which, based on past rewards, prescribes a probability distribution ϕt ∈ P{1, . . . , K}
(where we denote by P{1, . . . , K} the set of all probability distributions over the in-
dexes of the arms). In that case, It is drawn at random according to the probability
distribution ϕt and the forecaster gets to see the associated reward Yt , also denoted by
XIt ,TIt (t) with the notation above. The sequence (ϕt ) is referred to as an allocation
strategy. The primary task is to output at the end of each round t a recommendation
ψt ∈ P{1, . . . , K} to be used to form a randomized play in a one-shot instance if/when
the environment sends some stopping signal meaning that the exploration phase is over.
The sequence (ψt ) is referred to as a recommendation strategy. Figure 1 summarizes
the description of the sequential game and points out that the information available to
the forecaster for choosing ϕt , respectively ψt , is formed by the Xj,s for j = 1, . . . , K
and s = 1, . . . , Tj (t − 1), respectively, s = 1, . . . , Tj (t).
As we are only interested in the performances of the recommendation strategy (ψt ),
we call this problem the pure exploration problem for multi-armed bandits and evaluate
the strategies through their simple regrets. The simple regret rt of a recommendation
ψt = (ψj,t )j=1,...,K is defined as the expected regret on a one-shot instance of the
game, if a random action is taken according to ψt . Formally,
 
rt = r ψt = μ∗ − μψt where μ∗ = μj ∗ = max μj
j=1,...,K

and μψt = ψj,t μj
j=1,...,K

denote respectively the expectations of the rewards of the best arm j ∗ (a best arm, if
there are several of them with same maximal expectation) and of the recommendation
26 S. Bubeck, R. Munos, and G. Stoltz

ψt . A useful notation in the sequel is the gap Δj = μ∗ − μj between the maximal


expected reward and the one of the j–th arm ; as well as the minimal gap
Δ = min Δj .
j:Δj >0

n of related interest is the cumulative regret at round n, which is defined as


A quantity
Rn = t=1 μ∗ − μIt . A popular treatment of the multi-armed bandit problems is to
construct forecasters ensuring that ERn = o(n), see, e.g., [LR85] or [ACBF02], and
even Rn = o(n) a.s., as follows, e.g., from [ACBFS02, Theorem 6.3] together with a
martingale argument. The quantities rt = μ∗ − μIt are sometimes called instantaneous
regrets. They differ from the simple regrets rt and in particular, Rn = r1 + . . . + rn is
in general not equal to r1 + . . . + rn . Theorem 1, among others, will however indicate
some connections between rn and Rn .

Goal and structure of the paper: We study the links between simple and cumulative
regrets. Intuitively, an efficient allocation strategy for the simple regret should rely on
some exploration–exploitation trade-off. Our main contribution (Theorem 1, Section 3)
is a lower bound on the simple regret in terms of the cumulative regret suffered in the
exploration phase, showing that the trade-off involved in the minimization of the simple
regret is somewhat different from the one for the cumulative regret. It in particular
implies that the uniform allocation is a good benchmark when n is large. In Sections 4
and 5, we show how, despite all, one can fight against this negative result. For instance,
some strategies designed for the cumulative regret can outperform (for moderate values
of n) strategies with exponential rates of convergence for their simple regret.

3 The Smaller the Cumulative Regret, the Larger the Simple


Regret
It is immediate that for the recommendation formed by the empirical distribution of
plays of Figure 3, that is, ψn = (δI1 + . . . + δIn )/n, the regrets satisfy rn = Rn /n;
therefore, upper bounds on ERn lead to upper bounds on Ern . We show here that upper
bounds on ERn also lead to lower bounds on Ern : the smaller the guaranteed upper
bound on ERn , the larger the lower bound on Ern , no matter what the recommendation
strategies ψn are.
This is interpreted as a variation of the “classical” trade-off between exploration and
exploitation. Here, while the recommendation strategies ψn rely only on the exploitation
of the results of the preliminary exploration phase, the design of the allocation policies
ϕn consists in an efficient exploration of the arms. To guarantee this efficient explo-
ration, past payoffs of the arms have to be considered and thus, even in the exploration
phase, some exploitation is needed. Theorem 1 and its corollaries aim at quantifying
the needed respective amount of exploration and exploitation. In particular, to have an
asymptotic optimal rate of decrease for the simple regret, each arm should be sampled
a linear number of times, while for the cumulative regret, it is known that the forecaster
should not do so more than a logarithmic number of times on the suboptimal arms.
Formally, our main result is as follows. It is strong in the sense that we get lower
bounds for all possible sets of Bernoulli distributions {ν1 , . . . , νK } over the rewards.
Pure Exploration in Multi-armed Bandits Problems 27

Theorem 1 (Main result). For all allocation strategies (ϕt ) and all functions ε :
{1, 2, . . .} → R such that
for all (Bernoulli) distributions ν1 , . . . , νK on the rewards, there exists a constant
C  0 with ERn  Cε(n),

the simple regret of all recommendation strategies (ψt ) based on the allocation strate-
gies (ϕt ) is such that
for all sets of K  3 (distinct, Bernoulli) distributions on the rewards, all different from
a Dirac distribution at 1, there exists a constant D  0 and an ordering ν1 , . . . , νK of
the considered distributions with
Δ −Dε(n)
Ern  e .
2
Corollary 1. For allocation strategies (ϕt ), all recommendation strategies (ψt ), and
all sets of K  3 (distinct, Bernoulli) distributions on the rewards, there exist two
constants β > 0 and γ  0 such that, up to the choice of a good ordering of the
considered distributions,
Ern  β e−γn .

Theorem 1 is proved below and Corollary 1 follows from the fact that the cumulative
regrets are always bounded by n. To get further the point of the theorem, one should
keep in mind that the typical (distribution-dependent) rate of growth of the cumulative
regrets of good algorithms, e.g., UCB1 of [ACBF02], is ε(n) = ln n. This, as asserted in
[LR85], is the optimal rate. But the recommendation strategies based on such allocation
strategies are bound to suffer a simple regret that decreases at best polynomially fast.
We state this result for the slight modification UCB(p) of UCB1 stated in Figure 2; its
proof relies on noting that it achieves a cumulative regret bounded by a large enough
distribution-dependent constant times ε(n) = p ln n.

Corollary 2. The allocation strategy (ϕt ) given by the forecaster UCB(p) of Figure 2
ensures that for all recommendation strategies (ψt ) and all sets of K  3 (distinct,
Bernoulli) distributions on the rewards, there exist two constants β > 0 and γ  0
(independent of p) such that, up to the choice of a good ordering of the considered
distributions,
Ern  β n−γp .

Proof. The intuitive version of the proof of Theorem 1 is as follows. The basic idea
is to consider a tie case when the best and worst arms have zero empirical means; it
happens often enough (with a probability at least exponential in the number of times we
pulled these arms) and results in the forecaster basically having to pick another arm and
suffering some regret. Permutations are used to control the case of untypical or naive
forecasters that would despite all pull an arm with zero empirical mean, since they force
a situation when those forecasters choose the worst arm instead of the best one.
Formally, we fix the allocation strategies (ϕt ) and a corresponding function ε such
that the assumption of the theorem is satisfied. We consider below a set of K  3
(distinct) Bernoulli distributions; actually, we only use below that their parameters are
(up to a first ordering) such that 1 > μ1 > μ2  μ3  . . .  μK  0 and μ2 > μK
(thus, μ2 > 0).
28 S. Bubeck, R. Munos, and G. Stoltz

Another layer of notation is needed. It depends on permutations σ of {1, . . . , K}. To


have a gentle start, we first describe the notation when the permutation is the identity,
σ = id. We denote by P and E the probability and expectation with respect to the
K-tuple of distributions overs the arms ν1 , . . . , νK . For i = 1 (respectively, i = K),
we denote by Pi,id and Ei,id the probability and expectation with respect to the K-
tuples formed by δ0 , ν2 , . . . , νK (respectively, δ0 , ν2 , . . . , νK−1 , δ0 ), where δ0 denotes
the Dirac measure on 0. For a given permutation σ, we consider similar notation up
to a reordering. Pσ and Eσ refer to the probability and expectation with respect to the
K-tuple of distributions over the arms formed by the νσ−1 (1) , . . . , νσ−1 (K) . Note in
particular that the j–th best arm is located in the σ(j)–th position. Now, we denote
for i = 1 (respectively, i = K) by Pi,σ and Ei,σ the probability and expectation with
respect to the K-tuple formed by the νσ−1 (j) , except that we replaced the best of them,
located in the σ(1)–th position, by a Dirac measure on 0 (respectively, the best and
worst of them, located in the σ(1)–th and σ(K)–th positions, by Dirac measures on 0).
We provide a proof in six steps.
Step 1. Lower bounds by an average the maximum of the simple regrets obtained by
reordering,

1  μ1 − μ2   
max Eσ rn  Eσ rn  Eσ 1 − ψσ(1),n ,
σ K! σ K! σ

where we used that under Pσ , the index of the best arm is σ(1) and the minimal regret
for playing any other arm is at least μ1 − μ2 .
Step 2. Rewrites each term of the sum over σ as the product of three simple terms. We
use first that P1,σ is the same as Pσ , except that it ensures that arm σ(1) has zero reward
throughout. Denoting by
Tj (n)

Cj,n = Xj,t
t=1

the cumulative reward of the j–th till round n, one then gets
   
Eσ 1 − ψσ(1),n  Eσ 1 − ψσ(1),n I{Cσ(1),n =0}
 
= Eσ 1 − ψσ(1),n Cσ(1),n = 0 × Pσ Cσ(1),n = 0
 
= E1,σ 1 − ψσ(1),n Pσ Cσ(1),n = 0 .

Second, iterating the argument from P1,σ to PK,σ ,


   
E1,σ 1 − ψσ(1),n  E1,σ 1 − ψσ(1),n Cσ(K),n = 0 P1,σ Cσ(K),n = 0
 
= EK,σ 1 − ψσ(1),n P1,σ Cσ(K),n = 0

and therefore,
Pure Exploration in Multi-armed Bandits Problems 29
   
Eσ 1 − ψσ(1),n  EK,σ 1 − ψσ(1),n P1,σ Cσ(K),n = 0 Pσ Cσ(1),n = 0 .
(1)
Step 3. Deals with the second term in the right-hand side of (1),

T (n) E T (n)
P1,σ Cσ(K),n = 0 = E1,σ (1 − μK ) σ(K)  (1 − μK ) 1,σ σ(K) ,

where the equality can be seen by conditioning on I1 , . . . , In and then taking the ex-
pectation, whereas the inequality is a consequence of Jensen’s inequality. Now, the ex-
pected number of times the sub-optimal arm σ(K) is pulled under P1,σ is bounded by
the regret, by the very definition of the latter: (μ2 − μK ) E1,σ Tσ(K) (n)  E1,σ Rn .
Since by hypothesis (and by taking the maximum of K! values), there exists a constant
C such that for all σ, E1,σ Rn  C ε(n), we finally get

P1,σ Cσ(K),n = 0  (1 − μK )Cε(n)/(μ2 −μK ) .

Step 4. Lower bounds the third term in the right-hand side of (1) as

Pσ Cσ(1),n = 0  (1 − μ1 )Cε(n)/μ2 .

We denote by Wn = (I1 , Y1 , . . . , In , Yn ) the history of actions pulled and obtained


payoffs up to time n. What follows is reminiscent of the techniques used in [MT04]. We
are interested in realizations wn = (i1 , y1 , . . . , in , yn ) of the history such that whenever
σ(1) was played, it got a null reward. (We denote above by tj (t) is the realization of
Tj (t) corresponding to wn , for all j and t.) The likelihood of such a wn under Pσ is
(1 − μ1 )tσ(1) (n) times the one under P1,σ . Thus,

Pσ Cσ(1),n = 0 = Pσ {Wn = wn }
 
t (n) T (n)
= (1 − μ1 ) σ(1) P1,σ {Wn = wn } = E1,σ (1 − μ1 ) σ(1)

where the sums are over those histories wn such that the realizations of the payoffs
obtained by the arm σ(1) equal xσ(1),s = 0 for all s = 1, . . . , tσ(1) (n). The ar-
gument is concluded as before, first by Jensen’s inequality and then, by using that
μ2 E1,σ Tσ(1) (n)  E1,σ Rn  C ε(n) by definition of the regret and the hypothesis
put on its control.
Step 5. Resorts to a symmetry argument to show that as far as the first term of the
right-hand side of (1) is concerned,
  K!
EK,σ 1 − ψσ(1),n  .
σ
2

Since PK,σ only depends on σ(2), . . . , σ(K − 1), we denote by Pσ(2),...,σ(K−1) the
common value of these probability distributions when σ(1) and σ(K) vary (and a sim-
ilar notation for the associated expectation). We can thus group the permutations σ two
by two according to these (K −2)–tuples, one of the two permutations being defined by
30 S. Bubeck, R. Munos, and G. Stoltz

σ(1) equal to one of the two elements of {1, . . . , K} not present in the (K − 2)–tuple,
and the other one being such that σ(1) equals the other such element. Formally,
⎡ ⎤
  
EK,σ ψσ(1),n = Ej2 ,...,jK−1 ⎣ ψj,n ⎦
σ j2 ,...,jK−1 j∈{1,...,K}\{j2 ,...,jK−1 }
   K!
 Ej2 ,...,jK−1 1 = ,
j2 ,...,jK−1
2

where the summations over j2 , . . . , jK−1 are over all possible (K −2)–tuples of distinct
elements in {1, . . . , K}.
Step 6. Simply puts all pieces together and lower bounds max Eσ rn by
σ

μ1 − μ2   
EK,σ 1 − ψσ(1),n Pσ Cσ(1),n = 0 P1,σ Cσ(K),n = 0
K! σ
μ1 − μ2  ε(n)
 (1 − μK )C/(μ2 −μK ) (1 − μ1 )C/μ2 .
2

4 Upper Bounds on the Simple Regret


In this section, we aim at qualifying the implications of Theorem 1 by pointing out that
is should be interpreted as a result for large n only. For moderate values of n, strate-
gies not pulling each arm a linear number of the times in the exploration phase can
have interesting simple regrets. To do so, we consider only two natural and well-used
allocation strategies. The first one is the uniform allocation, which we use as a sim-
ple benchmark; it pulls each arm a linear number of times. The second one is UCB(p)
(a variant of UCB1 where the quantile factor may be a parameter); it is designed for
the classical exploration–exploitation dilemma (i.e., its minimizes the cumulative re-
gret) and pulls suboptimal arms a logarithmic number of times only. Of course, fancier
allocation strategies should also be considered in a second time but since the aim of
this paper is to study the links between cumulative and simple regrets, we restrict our
attention to the two discussed above.
In addition to these allocation strategies we consider three recommendation strate-
gies, the ones that recommend respectively the empirical distribution of plays, the em-
pirical best arm, or the most played arm). They are formally defined in Figures 2 and 3.
Table 1 summarizes the distribution-dependent and distribution-free bounds we could
prove so far (the difference between the two families of bounds is whether the constants
can depend or not on the unknown distributions νj ). It shows that two interesting cou-
ple of strategies are, on one hand, the uniform allocation together with the choice of
the empirical best arm, and on the other hand, UCB(p) together with the choice of the
most played arm. The first pair was perhaps expected, the second one might be consid-
ered more surprising. We only state here upper bounds on the simple regrets of these
two pairs and omit the other ones. The distribution-dependent lower bound is stated in
Corollary 1 and the distribution-free lower bound follows from a straightforward adap-
tation of the proof of the lower bound on the cumulative regret in [ACBFS02].
Pure Exploration in Multi-armed Bandits Problems 31

Parameters: K arms
Uniform allocation — Plays all arms one after the other
For each round t = 1, 2, . . . ,

use ϕt = δ[t mod K] , where [t mod K] denotes the value of t modulo K.

UCB(p) — Plays each arm once and then the one with the best upper confidence bound
Parameter: quantile factor p
For rounds t = 1, . . . , K, play ϕt = δt
For each round t = K + 1, K + 2, . . . ,
Tj (t−1)
1 
(1) compute, for all j = 1, . . . , K, the quantities μ
j,t−1 = Xj,s ;
Tj (t − 1) s=1

∗ p ln(t − 1)
(2) use ϕt = δjt−1
∗ , where jt−1 ∈ argmax μ
j,t−1 +
j=1,...,K Tj (t − 1)
(ties broken by choosing, for instance, the arm with smallest index).

Fig. 2. Two allocation strategies

Table 1. Distribution-dependent (top) and distribution-free (bottom) bounds on the expected sim-
ple regret of the considered pairs of allocation (lines) and recommendation (columns) strategies.
Lower bounds are also indicated. The  symbols denote the universal constants, whereas the 
are distribution-dependent constants.

Distribution-dependent Distribution-free
EDP EBA MPA EDP EBA MPA

K ln K
Uniform  e−n 
 n 
pK ln n  pK ln n
UCB(p) (p ln n)/n  n−  n2(1−p)  √ 
n p ln n n

K
Lower bound  e−n 
n

Table 1 indicates that while for distribution-dependent bounds, the asymptotic op-
timal rate of decrease in the number n of rounds √ for simple regrets is exponential, for
distribution-free bounds, the rate worsens to 1/ n. A similar situation arises for the cu-
mulative regret, see [LR85]
√ (optimal ln n rate for distribution-dependent bounds) versus
[ACBFS02] (optimal n rate for distribution-free bounds).
32 S. Bubeck, R. Munos, and G. Stoltz

Parameters: the history I1 , . . . , In of played actions and of their associated rewards


Y1 , . . . , Yn , grouped according to the arms as Xj,1 , . . . , Xj,Tj (n) , for j = 1, . . . , n

Empirical distribution of plays (EDP)


1
n
Draws a recommendation using the probability distribution ψn = δI .
n t=1 t

Empirical best arm (EBA)


Only considers arms j with Tj (n)  1, computes their associated empirical means
Tj (n)
1 
μ
j,n = Xj,s ,
Tj (n) s=1

and forms a deterministic recommendation (conditionally to the history),

ψn = δJn∗ where Jn∗ ∈ argmax μ


j,n
j

(ties broken in some way).

Most played arm (MPA)


Forms a deterministic recommendation (conditionally to the history),

ψn = δJn∗ where Jn∗ ∈ argmax Tj (n) .


j=1,...,N

(ties broken in some way).

Fig. 3. Three recommendation strategies

4.1 A Simple Benchmark: The Uniform Allocation Strategy


As explained above, the combination of the uniform allocation with the recommen-
dation indicating the empirical best arm, forms an important theoretical benchmark.
This section states its theoretical properties: the rate of decrease of its simple regret
is exponential in a√distribution-dependent sense and equals the optimal (up to a log-
arithmic term) 1/ n rate in the distribution-free case. In Proposition 1, we propose
two distribution-dependent bounds, the first one is sharper in the case when there are
few arms, while the second one is suited for large n. Their simple proof is omitted; it
relies on concentration inequalities, namely, Hoeffding’s inequality and McDiarmid’s
inequality. The distribution-free bound of Corollary 3 is obtained not as a corollary of
Proposition 1, but as a consequence of its proof. Its simple proof is also omitted.

Proposition 1. The uniform allocation strategy associated to the recommendation given


by the empirical best arm ensures that the simple regrets are bounded as follows:
 2
Ern  Δj e−Δj n/K/2 for all n  K ;
j:Δj >0
Pure Exploration in Multi-armed Bandits Problems 33
     
1n 2 8 ln K
Ern  max Δj exp − Δ for all n  1+ K.
j=1,...,K 8 K Δ2
Corollary 3. The uniform allocation strategy associated to the recommendation given
by the empirical best arm (at round Kn/K) ensures that the simple regrets are boun-
ded in a distribution-free sense, for n  K, as

2K ln K
sup Ern  2 .
ν1 ,...,νK n

4.2 Analysis of UCB(p) Combined with MPA


A first (distribution-dependent) bound is stated in Theorem 2; the bound does not in-
volve any quantity depending on the Δj , but it only holds for rounds n large enough,
a statement that does involve the Δj . Its interest is first that it is simple to read, and
second, that the techniques used to prove it imply easily a second (distribution-free)
bound, stated in Theorem 3 and which is comparable to Corollary 3.
Theorem 2. For p > 1, the allocation strategy given by UCB(p) associated to the rec-
ommendation given by the most played arm ensures that the simple regrets are bounded
in a distribution-dependent sense by
K 2p−1 2(1−p)
Ern  n
p−1
4Kp ln n
for all n sufficiently large, e.g., such that n  K + and n  K(K + 2).
Δ2
The polynomial rate in the upper bound above is not a coincidence according to the
lower bound exhibited in Corollary 2. Here, surprisingly enough, this polynomial rate
of decrease is distribution-free (but in compensation, the bound is only valid after a
distribution-dependent time). This rate illustrates Theorem 1: the larger p, the larger
the (theoretical bound on the) cumulative regret of UCB(p) but the smaller the simple
regret of UCB(p) associated to the recommendation given by the most played arm.
Theorem 3. For p > 1, the allocation strategy given by UCB(p) associated to the rec-
ommendation given by the most played arm ensures that the simple regrets are bounded
for all n  K(K + 2) in a distribution-free sense by
  
4Kp ln n K 2p−1 2(1−p) Kp ln n
Ern  + n =O .
n−K p−1 n

Remark 1. We can rephrase the results of [KS06] as using UCB1 as an allocation strat-
egy and forming a recommendation according to the empirical best arm. In particular,
[KS06, Theorem 5] provides a distribution-dependent bound on the probability of not
picking the best arm with this procedure and can be used to derive the following bound
on the simple regret:
 4  1 ρΔj /2
2

Ern 
Δj n
j:Δj >0
34 S. Bubeck, R. Munos, and G. Stoltz

for all n  1. The leading constants 1/Δj and the distribution-dependant exponent
make it not as useful as the one presented in Theorem 2. √ The best distribution-free
bound we could get from
√ this bound was of the order of 1/ ln n, to be compared to the
asymptotic optimal 1/ n rate stated in Theorem 3.

Proofs of Theorems 2 and 3


Lemma 1. For p > 1, the allocation strategy given by UCB(p) associated to the recom-
mendation given by the most played arm ensures that the simple regrets are bounded in a
distribution-dependent sense as follows. For all a1 , . . . , aK such that a1 + . . .+ aK = 1
and aj  0 for all j, with the additional property that for all suboptimal arms j and all
optimal arms j ∗ , one has aj  aj∗ , the following bound holds:
1 
Ern  (aj n)2(1−p)
p−1 ∗ j
=j

for all n sufficiently large, e.g., such that, for all suboptimal arms j,
4p ln n
aj n  1 + and aj n  K + 2 .
Δ2j
Proof. We first prove that whenever the most played arm Jn∗ is different from an optimal
arm j ∗ , then at least one of the suboptimal arms j is such that Tj (n)  aj n. To do so,
we prove the converse and assume that Tj (n) < aj n for all suboptimal arms. Then,
K  K
   
ai n = n = Ti (n) < Tj ∗ (n) + aj n
i=1 i=1 j∗ j

where, in the inequality, the first summation is over the optimal arms, the second one,
over the suboptimal ones. Therefore, we get
 
aj ∗ n < Tj∗ (n)
j∗ j∗

and there exists at least one optimal arm j ∗ such that Tj∗ (n) > aj∗ n. Since by definition
of the vector (a1 , . . . , aK ), one has aj  aj ∗ for all suboptimal arms, it comes that
Tj (n) < aj n < aj∗ n < Tj ∗ (n) for all suboptimal arms, and the most played arm Jn∗ is
thus an optimal arm. Thus, using that Δj  1 for all j,

Ern = EΔJn∗  P Tj (n)  aj n .
j:Δj >0

A side-result extracted from the proof of [ACBF02, Theorem 1] states that for all sub-
optimal arms j and all rounds t  K + 1,
4p ln n
P It = j and Tj (t − 1)   2 t1−2p whenever  . (2)
Δ2j
This yields that for a suboptimal arm j and since by the assumptions on n and the aj ,
the choice = aj n − 1 satisfies  K + 1 and  (4p ln n)/Δ2j ,
Pure Exploration in Multi-armed Bandits Problems 35

n
  
P Tj (n)  aj n  P Tj (t − 1) = aj n − 1 and It = j
t=aj n
n
 1
 2 t1−2p  (aj n)2(1−p) (3)
t=aj n
p−1

where we used a union bound for the second inequality and (2) for the third inequality.
A summation over all suboptimal arms j concludes the proof.

Proof (of Theorem 2). We apply Lemma 1 with the uniform choice aj = 1/K and
recall that Δ is the minimum of the Δj > 0.

Proof (of Theorem 3). We start the proof by using that ψj,n = 1 and Δj  1 for all
j, and can thus write
K
 
Ern = EΔJn∗ = Δj Eψj,n  ε + Δj Eψj,n .
j=1 j:Δj >ε

Since Jn∗ = j only if Tj (n)  n/K, that is, ψj,n = I{Jn∗ =j}  I{Tj (n)n/K} , we get
  n
Ern  ε + Δj P Tj (n)  .
K
j:Δj >ε

 Δj
Applying (3) with aj = 1/K leads to Ern  ε + K 2(p−1) n2(1−p)
p−1
j:Δj >ε
where ε is chosen such that for all Δj > ε, the condition = n/K − 1  (4p ln n)/Δ2j
is satisfied (n/K − 1  K + 1 being satisfied by the assumption
 on n and K). The
conclusion thus follows from taking, for instance, ε = (4pK ln n)/(n − K) and
upper bounding all remaining Δj by 1.

5 Conclusions: Comparison of the Bounds, Simulation Study

We now explain why, in some cases, the bound provided by our theoretical analysis
in Lemma 1 is better than the bound stated in Proposition 1. The central point in the
argument is that the bound of Lemma 1 is of the form  n2(1−p) , for some distribution-
dependent constant , that is, it has a distribution-free convergence rate. In comparison,
the bound of Proposition 1 involves the gaps Δj in the rate of convergence. Some care is
needed in the comparison, since the bound for UCB(p) holds only for n large enough,
but it is easy to find situations where for moderate values of n, the bound exhibited
for the sampling with UCB(p) is better than the one for the uniform allocation. These
situations typically involve a rather large number K of arms; in the latter case, the
uniform allocation strategy only samples n/K each arm, whereas the UCB strategy
focuses rapidly its exploration on the best arms. A general argument is proposed in the
extended version [BMS09, Appendix B]. We only consider here one numerical example
36 S. Bubeck, R. Munos, and G. Stoltz

ν =B(1/2),i=1..19; ν =B(0.66) ν =B(0.1),i=1..18; ν =B(0.5); ν =B(0.9)


i 20 i 19 20
0.15 0.25
UCB(2) with empirical best arm UCB(2) with empirical best arm
UCB(2) with most played arm UCB(2) with most played arm
0.145
Uniform sampling with empirical best arm Uniform sampling with empirical best arm
0.2
Expectation of the simple regret

Expectation of the simple regret


0.14

0.135
0.15

0.13

0.1
0.125

0.12
0.05
0.115

0.11 0
40 60 80 100 120 140 160 180 200 40 60 80 100 120 140 160 180 200
Allocation budget Allocation budget

Fig. 4. Simple regret of different pairs of allocation and recommendation strategies, for K = 20
arms with Bernoulli distributions of parameters indicated on top of each graph; X–axis: number
of samples, Y –axis: expectation of the simple regret (the smaller, the better)

extracted from there, see the right part of Figure 4. For moderate values of n (at least
when n is about 6 000), the bounds associated to the sampling with UCB(p) are better
than the ones associated to the uniform sampling.
To make the story described in this paper short, we can distinguish three regimes:
– for large values of n, uniform exploration is better (as shown by a combination of
the lower bound of Corollary 2 and of the upper bound of Proposition 1);
– for moderate values of n, sampling with UCB(p) is preferable, as discussed just
above;
– for small values of n, the best bounds to use seem to be the distribution-free bounds,
which are of the same order of magnitude for the two strategies.
Of course, these statements involve distribution-dependent quantifications (to determine
which n are small, moderate, or large).
We propose two simple experiments to illustrate our theoretical analysis; each of
them was run on 104 instances of the problem and we plotted the average simple regrets.
(More experiments can be found in [BMS09].) The first one corresponds in some sense
to the worst case alluded at the beginning of Section 4. It shows that for small values
of n (e.g., n  80 in the left plot of Figure 4), the uniform allocation strategy is very
competitive. Of course the range of these values of n can be made arbitrarily large by
decreasing the gaps. The second one corresponds to the numerical example described
earlier in this section.
We mostly illustrate here the small and moderate n regimes. (This is because for large
n, the simple regrets are usually very small, even below computer precision.) Because
of these chosen ranges, we do not see yet the uniform allocation strategy getting better
than UCB–based strategies. This has an important impact on the interpretation of the
lower bound of Theorem 1. While its statement is in finite time, it should be interpreted
as providing an asymptotic result only.
Pure Exploration in Multi-armed Bandits Problems 37

6 Pure Exploration for Bandit Problems in Topological Spaces


These results are of theoretical interest. We summarize them very briefly; statements
and proofs can be found in the extended version [BMS09]. Therein, we consider the
X –armed bandit problem with bounded payoffs of, e.g., [Kle04, BMSS09] and (re-
)define the notions of cumulative and simple regrets. The topological set X is a large
possibly non-parametric space but the associated mean-payoff function is continuous.
We show that, without any assumption on X , there exists a strategy with cumulative re-
gret ERn = o(n) if and only if there exist an allocation and a recommendation strategy
with simple regret Ern = o(1). We then use this equivalence to characterize the metric
spaces X in which the cumulative regret ERn can always be made o(n): they are given
by the separable spaces. Thus, here, in addition to its natural interpretation, the simple
regret appears as a tool for proving results on the cumulative regret.

References
[ACBF02] Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed
bandit problem. Machine Learning Journal 47, 235–256 (2002)
[ACBFS02] Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.: The non-stochastic multi-
armed bandit problem. SIAM Journal on Computing 32(1), 48–77 (2002)
[BMS09] Bubeck, S., Munos, R., Stoltz, G.: Pure exploration for multi-armed
bandit problems. Technical report, HAL report hal-00257454 (2009),
http://hal.archives-ouvertes.fr/hal-00257454/en
[BMSS09] Bubeck, S., Munos, R., Stoltz, G., Szepesvari, C.: Online optimization in X –
armed bandits. In: Advances in Neural Information Processing Systems, vol. 21
(2009)
[CM07] Coquelin, P.-A., Munos, R.: Bandit algorithms for tree search. In: Proceedings
of the 23rd Conference on Uncertainty in Artificial Intelligence (2007)
[EDMM02] Even-Dar, E., Mannor, S., Mansour, Y.: PAC bounds for multi-armed bandit
and Markov decision processes. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002.
LNCS (LNAI), vol. 2375, pp. 255–270. Springer, Heidelberg (2002)
[GWMT06] Gelly, S., Wang, Y., Munos, R., Teytaud, O.: Modification of UCT with patterns
in Monte-Carlo go. Technical Report RR-6062, INRIA (2006)
[Kle04] Kleinberg, R.: Nearly tight bounds for the continuum-armed bandit problem. In:
18th Advances in Neural Information Processing Systems (2004)
[KS06] Kocsis, L., Szepesvari, C.: Bandit based Monte-carlo planning. In: Fürnkranz,
J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212,
pp. 282–293. Springer, Heidelberg (2006)
[LR85] Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Ad-
vances in Applied Mathematics 6, 4–22 (1985)
[MLG04] Madani, O., Lizotte, D., Greiner, R.: The budgeted multi-armed bandit prob-
lem. In: Proceedings of the 17th Annual Conference on Computational Learning
Theory, pp. 643–645 (2004); Open problems session
[MT04] Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-
armed bandit problem. Journal of Machine Learning Research 5, 623–648
(2004)
[Rob52] Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of
the American Mathematics Society 58, 527–535 (1952)
[Sch06] Schlag, K.: Eleven tests needed for a recommendation. Technical Report
ECO2006/2, European University Institute (2006)
The Follow Perturbed Leader Algorithm
Protected from Unbounded One-Step Losses

Vladimir V. V’yugin

Institute for Information Transmission Problems, Russian Academy of Sciences,


Bol’shoi Karetnyi per. 19, Moscow GSP-4, 127994, Russia
vyugin@iitp.ru

Abstract. In this paper the sequential prediction problem with expert


advice is considered for the case when the losses of experts suffered at
each step can be unbounded. We present some modification of Kalai
and Vempala algorithm of following the perturbed leader where weights
depend on past losses of the experts. New notions of a volume and a
scaled fluctuation of a game are introduced. We present an algorithm
protected from unrestrictedly large one-step losses. This algorithm has
the optimal performance in the case when the scaled fluctuations of one-
step losses of experts of the pool tend to zero.

1 Introduction
Experts algorithms are used for online prediction or repeated decision making or
repeated game playing. Starting with the Weighted Majority Algorithm (WM)
of Littlestone and Warmuth [6] and Vovk’s [11] Aggregating Algorithm, the
theory of Prediction with Expert Advice has rapidly developed in the recent
times. Also, most authors have concentrated on predicting binary sequences and
have used specific loss functions, like absolute loss, square and logarithmic loss.
Arbitrary losses are less common. A survey can be found in the book of Lugosi,
Cesa-Bianchi [7].
In this paper, we consider a different general approach - “Follow the Perturbed
Leader FPL” algorithm, now called Hannan’s algorithm [3], [5], [7]. Under this
approach we only choose the decision that has fared the best in the past - the
leader. In order to cope with adversary some randomization is implemented
by adding a perturbation to the total loss prior to selecting the leader. The
goal of the learner’s algorithm is to perform almost as well as the best expert in
hindsight in the long run. The resulting FPL algorithm has the same performance
guarantees as WM-type √ algorithms for fixed learning rate and bounded one-step
losses, save for a factor 2.
Prediction with Expert Advice considered in this paper proceeds as follows.
We are asked to perform sequential actions at times t = 1, 2, . . . , T . At each time
step t, experts i = 1, . . . N receive results of their actions in form of their losses
sit - non-negative real numbers.
At the beginning of the step t Learner, observing cumulating losses si1:t−1 =
s1 + . . . + sit−1 of all experts i = 1, . . . N , makes a decision to follow one of these
i

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 38–52, 2009.

c Springer-Verlag Berlin Heidelberg 2009
The Follow Perturbed Leader Algorithm 39

experts, say Expert i. At the end of step t Learner receives the same loss sit as
Expert i at step t and suffers Learner’s cumulative loss s1:t = s1:t−1 + sit .
In the traditional framework, we suppose that one-step losses of all experts
are bounded, for example, 0 ≤ sit ≤ 1 for all i and t.
Well known simple example of a game with two experts shows that Learner
can perform much worse than each expert: let the current losses of two experts
on steps t = 0, 1, . . . , 6 be s10,1,2,3,4,5,6 = ( 12 , 0, 1, 0, 1, 0, 1) and s20.1,2,3,4,5,6 =
(0, 1, 0, 1, 0, 1, 0). Evidently, the “Follow Leader” algorithm always chooses the
wrong prediction.
When the experts one-step losses are bounded, this problem has been solved
using randomization of the experts cumulative losses. The method of following
the perturbed leader was discovered by Hannan [3]. Kalai and Vempala [5] redis-
covered this method and published a simple proof of the main result of Hannan.
They called an algorithm of this type FPL (Following the Perturbed Leader).
The FPL algorithm outputs prediction of an expert i which minimizes
1
si1:t−1 − ξ i ,

where ξ i , i = 1, . . . N , t = 1, 2, . . ., is a sequence of i.i.d random variables
distributed according to the exponential distribution with the density p(x) =
exp{−x}, and  is a learning rate.
Kalai and Vempala [5] show that the expected cumulative loss of the FPL
algorithm has the upper bound

log N
E(s1:t ) ≤ (1 + ) min si1:t + ,
i=1,...,N 
where  is a positive real number such that 0 <  < 1 is a learning rate, N is the
number of experts.
Hutter and Poland [4] presented a further developments of the FPL algorithm
for countable class of experts, arbitrary weights and adaptive learning rate. Also,
FPL algorithm is usually considered for bounded one-step losses: 0 ≤ sit ≤ 1 for
all i and t.
Most papers on prediction with expert advice either consider bounded losses
or assume the existence of a specific loss function (see [7]). We allow losses at
any step to be unbounded. The notion of a specific loss function is not used.
The setting allowing unbounded one-step losses do not have wide coverage in
literature; we can only refer reader to [1], [2], [9].
Poland and Hutter [9] have studied the games where one-step losses of all
experts at each step t are bounded from above by an increasing sequence Bt
given in advance. They presented a learning algorithm which is asymptotically
consistent for Bt = t1/16 .
Allenberg et al. [2] have considered polynomially bounded one-step losses for
a modified version of the Littlestone and Warmuth algorithm [6] under partial
monitoring.
√ In full information case, their algorithm has the expected regret
1
2 N ln N (T + 1) 2 (1+a+β ) in the case where one-step losses of all experts i =
1, 2, . . . N at each step t have the bound (sit )2 ≤ ta , where a > 0, and β > 0 is
40 V.V. V’yugin

a parameter of the algorithm.1 They have proved that this algorithm is Hannan
consistent if
1  i 2
T
max (st ) < cT a
1≤i≤N T
t=1
for all T , where c > 0 and 0 < a < 1.
In this paper, we consider also the case where the loss grows “faster than
polynomial, but slower than exponential”.
We present some modification of Kalai and Vempala [5] algorithm of following
the perturbed leader (FPL) for the case of unrestrictedly large one-step expert
losses sit not bounded in advance. This algorithm uses adaptive weights depend-
ing on past cumulative losses of the experts.
We analyze the asymptotic consistency of our algorithms using nonstandard

t
scaling. We introduce new notions of the volume of a game vt = maxi sij and
j=1
the scaled fluctuation of the game fluc(t) = Δvt /vt , where Δvt = vt − vt−1 .
We show in Theorem 1 that the algorithm of following the perturbed leader
with adaptive weights constructed in Section 2 is asymptotically consistent in
the mean in the case when vt → ∞ and Δvt = o(vt ) as t → ∞ with a computable
bound. Specifically, if fluc(t) ≤ γ(t) for all t, where γ(t) is a computable function
γ(t) such that γ(t) = o(1) as t → ∞, our algorithm has the expected regret
 
T
2 (e2 − 1)(1 + ln N ) (γ(t))1/2 Δvt ,
t=1

where e = 2.72 . . . is the base of the natural logarithm.


In particular, this algorithm is asymptotically consistent (in the mean) in a
modified sense
1
lim sup E(s1:T − min si1:T ) ≤ 0, (1)
T →∞ vT i=1,...N

where s1:T is the total loss of our algorithm on steps 1, 2, . . . T , and E(s1:T ) is
its expectation.
Proposition 1 of Section 2 shows that if the condition Δvt = o(vt ) is violated
the cumulative loss of any probabilistic prediction algorithm can be much more
than the loss of the best expert of the pool.
In Section 2 we present some sufficient conditions under which our learning
algorithm is Hannan consistent.2
In particular case, Corollary 1 of Theorem 1 says that our algorithm is asymp-
totically consistent (in the modified sense) in the case when one-step losses of all
experts at each step t are bounded by ta , where a is a positive real number. We
prove this result under an extra assumption that the volume of the game grows
slowly, lim inf vt /ta+δ > 0, where δ > 0 is arbitrary. Corollary 1 shows that our
t→∞
algorithm is also Hannan consistent when δ > 12 .
1
Allenberg et al. [2] considered losses −∞ < sit < ∞.
2
This means that (1) holds with probability 1, where E is omitted.
The Follow Perturbed Leader Algorithm 41

At the end of Section 2 we consider some applications of our algorithm for


the case of standard time-scaling.

2 The Follow Perturbed Leader Algorithm with Adaptive


Weights
We consider a game of prediction with expert advice with unbounded one-step
losses. At each step t of the game, all N experts receive one-step losses sit ∈
[0, +∞), i = 1, . . . N , and the cumulative loss of the ith expert after step t is
equal to
si1:t = si1:t−1 + sit .
A probabilistic learning algorithm of choosing an expert outputs at any step
t the probabilities P {It = i} of following the ith expert given the cumulative
losses si1:t−1 of the experts i = 1, . . . N in hindsight.
Probabilistic algorithm of choosing an expert
FOR t = 1, . . . T
Given past cumulative losses of the experts si1:t−1 , i = 1, . . . N , choose an
expert i with probability P {It = i}.
Receive the one-step losses at step t of the expert sit and suffer one-step loss
st = sit of the master algorithm.
ENDFOR
The performance of this probabilistic algorithm is measured in its expected
regret

E(s1:T − min si1:T ),


i=1,...N

where the random variable s1:T is the cumulative loss of the master algorithm,
si1:T , i = 1, . . . N , are the cumulative losses of the experts algorithms and E
is the mathematical expectation (with respect to the probability distribution
generated by probabilities P {It = i}, i = 1, . . . N , on the first T steps of the
game).3
In the case of bounded one-step expert losses, sit ∈ [0, 1], and a convex
√ loss
function, the well-known learning algorithms have expected regret O( T log N )
(see Lugosi, Cesa-Bianchi [7]).
A probabilistic algorithm is called asymptotically consistent in the mean if
1
lim sup E(s1:T − min si1:T ) ≤ 0. (2)
T →∞ T i=1,...N

A probabilistic learning algorithm is called Hannan consistent if


3
For simplicity, we suppose that the experts are oblivious, i.e., they cannot use in their
work random actions of the learning algorithm. The inequality (12) and the limit
(13) of Theorem 1 below can be easily reformulated and proved for non-oblivious
experts.
42 V.V. V’yugin
 
1
lim sup s1:T − min s1:T ≤ 0
i
(3)
T →∞ T i=1,...N

almost surely, where s1:T is its random cumulative loss.


In this section we study the asymptotical consistency of probabilistic learning
algorithms in the case of unbounded one-step losses.
Notice that when 0 ≤ sit ≤ 1 all expert algorithms have total loss ≤ T on first
T steps. This is not true for the unbounded case, and there are no reasons to
divide the expected regret (2) by T . We change the standard time scaling (2)
and (3) on a new scaling based on a new notion of volume of a game. We modify
the definition (2) of the normalized expected regret as follows. Define the volume
of a game at step t
t
vt = max sij .
i
j=1

Evidently, vt−1 ≤ vt for all t and maxi si1:t ≤ vt ≤ N maxi si1:t for all i and t.
A probabilistic learning algorithm is called asymptotically consistent in the
mean (in the modified sense) in a game with N experts if
1
lim sup E(s1:T − min si1:T ) ≤ 0. (4)
T →∞ vT i=1,...N

A probabilistic algorithm is called Hannan consistent (in the modified sense) if


 
1
lim sup s1:T − min s1:T ≤ 0
i
(5)
T →∞ vT i=1,...N

almost surely.
Notice that the notions of asymptotic consistency in the mean and Hannan
consistency may be non-equivalent for unbounded one-step losses.
A game is called non-degenerate if vt → ∞ (or equivalently, maxi si1:t → ∞)
as t → ∞.
Denote Δvt = vt − vt−1 . The number
Δvt maxi sit
fluc(t) = = , (6)
vt vt
is called scaled fluctuation of the game at the step t.
By definition 0 ≤ fluc(t) ≤ 1 for all t (put 0/0 = 0).
The following simple proposition shows that each probabilistic learning al-
gorithm is not asymptotically optimal in some game such that fluc(t)  → 0 as
t → ∞. For simplicity, we consider the case of two experts.
Proposition 1. For any probabilistic algorithm of choosing an expert and for
any  such that 0 <  < 1 two experts exist such that vt → ∞ as t → ∞ and
fluc(t) ≥ 1 − ,
1 1
E(s1:t − min si1:t ) ≥ (1 − )
vt i=1,2 2
for all t.
The Follow Perturbed Leader Algorithm 43

Proof. Given a probabilistic algorithm of choosing an expert and  such that


0 <  < 1, define recursively one-step losses s1t and s2t of expert 1 and expert 2
at any step t = 1, 2, . . . as follows. By s11:t and s21:t denote the cumulative losses
of these experts incurred at steps ≤ t, let vt be the corresponding volume, where
t = 1, 2, . . ..
Define v0 = 1 and Mt = 4vt−1 / for all t ≥ 1. For t ≥ 1, define s1t = 0 and
st = Mt if P {It = 1} ≥ 12 , and define s1t = Mt and s2t = 0 otherwise.
2

Let st be one-step loss of the master algorithm and s1:t be its cumulative loss
at step t ≥ 1. We have

1
E(s1:t ) ≥ E(st ) = s1t P {It = 1} + s2t P {It = 2} ≥ Mt
2
for all t ≥ 1. Also, since vt = vt−1 + Mt = (1 + 4/)vt−1 and min si1:t ≤ vt−1 , the
i
normalized expected regret of the master algorithm is bounded from below

1 2/ − 1 1
E(s1:t − min si1:t ) ≥ ≥ (1 − ).
vt i 1 + 4/ 2

for all t. By definition

Mt 1
fluc(t) = = ≥1−
vt−1 + Mt 1 + /4

for all t.
Let γ(t) be a computable non-increasing real function such that 0 < γ(t) < 1
for all t and γ(t) → 0 as t → ∞; for example, γ(t) = 1/tδ , where δ > 0. Define
 
1 ln 1+ln N
e2 −1
αt = 1− and (7)
2 ln γ(t)
e2 − 1
μt = (γ(t))αt = (γ(t))1/2 (8)
1 + ln N

for all t, where e = 2.72 . . . is the base of the natural logarithm.4


Without loss of generality we suppose that γ(t) < min{1, (e2 − 1)/(1 + ln N )}
for all t. Then 0 < αt < 1 for all t.
We consider an FPL algorithm with a variable learning rate

1
t = , (9)
μt vt−1

where μt is defined by (8) and the volume vt−1 depends on experts actions on
steps < t. By definition vt ≥ vt−1 and μt ≤ μt−1 for t = 1, 2, . . .. Also, by
definition μt → 0 as t → ∞.
4
The choice of the optimal value of αt will be explained later. It will be obtained by
minimization of the corresponding member of the sum (44).
44 V.V. V’yugin

Let ξt1 ,. . . ξtN , t = 1, 2, . . ., be a sequence of i.i.d random variables distributed


according to the density p(x) = exp{−x}. In what follows we omit the lower
index t.
We suppose without loss of generality that si0 = v0 = 0 for all i and 0 = ∞.
The FPL algorithm is defined as follows:
FPL algorithm PROT
FOR t = 1, . . . T
Choose an expert with the minimal perturbed cumulated loss on steps < t

1 i
It = argmini=1,2,...N {si1:t−1 − ξ }. (10)
t

Receive one-step losses sit for experts i = 1, . . . , N , and receive one-step loss sIt t
of the master algorithm.
ENDFOR

T
Let s1:T = sIt t be the cumulative loss of the FPL algorithm on steps ≤ T .
t=1
The following theorem shows that if the game is non-degenerate and Δvt =
o(vt ) as t → ∞ with a computable bound then the FPL-algorithm with variable
learning rate (9) is asymptotically consistent.

Theorem 1. Let γ(t) be a computable non-increasing positive real function such


that γ(t) → 0 as t → ∞. Let also the game be non-degenerate and such that

fluc(t) ≤ γ(t) (11)

for all t. Then the expected cumulated loss of the FPL algorithm PROT with
variable learning rate (9) for all t is bounded by

 
T
E(s1:T ) ≤ min si1:T + 2 (e − 1)(1 + ln N )
2 (γ(t))1/2 Δvt . (12)
i
t=1

Also, the algorithm PROT is asymptotically consistent in the mean

1
lim sup E(s1:T − min si1:T ) ≤ 0. (13)
T →∞ vT i=1,...N

The algorithm PROT is Hannan consistent, i.e.,


 
1
lim sup s1:T − min si1:T ≤ 0 (14)
T →∞ vT i=1,...N

almost surely, if


(γ(t))2 < ∞. (15)
t=1
The Follow Perturbed Leader Algorithm 45

Proof. In the proof of this theorem we follow the proof-scheme of [4] and [5].
Let αt be a sequence of real numbers defined by (7); recall that 0 < αt < 1
for all t.
The analysis of optimality of the FPL algorithm is based on an intermediate
predictor IFPL (Infeasible FPL) with the learning rate t defined by (16).
IFPL algorithm
FOR t = 1, . . . T
Define the learning rate
1
t = , where μt = (γ(t))αt , (16)
μt vt
vt is the volume of the game at step t and αt is defined by (7).
Choose an expert with the minimal perturbed cumulated loss on steps ≤ t
1
Jt = argmini=1,2,...N {si1:t −  ξ i }.
t
Receive the one step loss sJt t of the IFPL algorithm.
ENDFOR
The IFPL algorithm predicts under the knowledge of si1:t , i = 1, . . . N (and
vt ), which may not be available at beginning of step t. Using unknown value of
t is the main distinctive feature of our version of IFPL.
For any t, we have It = argmini {si1:t−1 − 1t ξ i } and Jt = argmini {si1:t − 1 ξ i } =
t
argmini {si1:t−1 + sit − 1 ξ i }.
t
The expected one-step and cumulated losses of the FPL and IFPL algorithms
at steps t and T are denoted
lt = E(sIt t ) and rt = E(sJt t ),

T 
T
l1:T = lt and r1:T = rt ,
t=1 t=1

respectively, where sIt t is the one-step loss of the FPL algorithm at step t and
sJt t is the one-step loss of the IFPL algorithm, and E denotes the mathematical
expectation.
Lemma 1. The cumulated expected losses of the FPL and IFPL algorithms with
rearning rates defined by (9) and (16) satisfy the inequality

T
l1:T ≤ r1:T + (e2 − 1) (γ(t))1−αt Δvt (17)
t=1

for all T , where αt is defined by (7).


Proof. Let c1 , . . . cN be nonnegative real numbers and
1
mj = min{si1:t−1 − ci },
i
=j t
1 1
mj = min{si1:t −  ci } = min{si1:t−1 + sit −  ci }.
i
=j t i
=j t
46 V.V. V’yugin

Let mj = sj1:t−1
1
− 1t cj 1 and mj = sj1:t
2
− 1 cj2 = sj1:t−1
2
+ sjt2 − 1 cj2 . By definition
t t
and since j2 = j we have
1 1 1
mj = sj1:t−1
1
− cj ≤ sj1:t−1
2
− cj2 ≤ sj1:t−1
2
+ sjt2 − cj2 = (18)
t 1 t t
   
1 1 1  1 1
sj1:t
2
−  cj2 +  − cj2 = mj +  − cj2 . (19)
t t t t t

We compare conditional probabilities P {It = j|ξ i = ci , i 


= j} and P {Jt = j|ξ i =
ci , i 
= j}.
The following chain of equalities and inequalities is valid:

P {It = j|ξ i = ci , i 
= j} =
1
P {sj1:t−1 − ξ j ≤ mj |ξ i = ci , i  = j} =
t
P {ξ j ≥ t (sj1:t−1 − mj )|ξ i = ci , i 
= j} =
P {ξ j ≥ t (sj1:t−1 − mj ) + (t − t )(sj1:t−1 − mj )|ξ i = ci , i 
= j} ≤ (20)
P {ξ ≥ j
t (sj1:t−1
− mj ) +
1
(t − t )(sj1:t−1 − sj1:t−1
2
+ cj2 )|ξ i = ci , i  = j} = (21)
t
exp{−(t − t )(sj1:t−1 − sj1:t−12
)} × (22)
1
P {ξ j ≥ t (sj1:t−1 − mj ) + (t − t ) cj2 |ξ i = ci , i  = j} ≤ (23)
t
exp{−(t − t )(sj1:t−1 − sj1:t−12
)} ×
 
1 1
P {ξ j ≥ t (sj1:t − sjt − mj −  − cj2 ) + (24)
t t
1
(t − t ) cj2 |ξ i = ci , i  = j} = (25)
t
exp{−(t − t )(sj1:t−1 − sj1:t−1
2
) + t sjt } × (26)
P {ξ ≥
j
t (sj1:t
− mj )|ξ i
= ci , i 
= j} =
  j
1 1 s
exp − − (sj1:t−1 − sj1:t−1
2
)+ t × (27)
μt vt−1 μt vt μt vt
1
P {ξ j > (sj − mj )|ξ i = ci , i 
= j} ≤
μt vt 1:t
j j2
Δvt (s1:t−1 − s1:t−1 ) Δvt
exp − + × (28)
μt vt vt−1 μt vt
1
P {ξ j > (sj − mj )|ξ i = ci , i 
= j} =
μt vt 1:t
 
Δvt sj1:t−1 − s1:t−1
2 j
exp 1− P {Jt = 1|ξ i = ci , i 
= j}. (29)
μt vt vt−1
The Follow Perturbed Leader Algorithm 47

Here the inequality (20)-(21) follows from (18) and t ≥ t . We have used twice,
in change from (21) to (22) and in change from (25) to (26), the equality P {ξ >
a + b} = e−b P {ξ > a} for any random variable ξ distributed according to the
exponential law. The equality (23)-(24) follows from (19). We have used in change
from (27) to (28) the equality vt − vt−1 = Δvt and the inequality sjt ≤ Δvt for
all j and t.
The expression in the exponent (29) is bounded
sj1:t−1 − sj1:t−1
2

≤ 1, (30)
vt−1
si
since v1:t−1
t−1
≤ 1 and si1:t−1 ≥ 0 for all t and i.
Therefore, we obtain

 P {It = j|ξ = ci , i = j} ≤
i

2 Δvt
exp P {Jt = j|ξ = ci , i 
i
= j} ≤ (31)
μt vt
exp{2(γ(t))1−αt }P {Jt = j|ξ i = ci , i 
= j}. (32)
Since, the inequality (32) holds for all ci , it also holds unconditionally
P {It = j} ≤ exp{2(γ(t))1−αt }P {Jt = j}. (33)
for all t = 1, 2, . . . and j = 1, . . . N .
Using inequality exp{2x} ≤ 1 + (e2 − 1)x for all x such that 0 ≤ x ≤ 1, we
obtain from (33) the lower bound

N
lt = E(sIt t ) = sjt P (It = j) ≤
j=1


N
exp{2(γ(t))1−αt } sjt P (Jt = j) = exp{2(γ(t))1−αt }E(sJt t ) =
j=1

exp{2(γ(t))1−αt }rt ≤ (1 + (e2 − 1)(γ(t))1−αt )rt . (34)


Since rt ≤ Δvt for all t, the inequality (34) implies

T
l1:T ≤ r1:T + (e2 − 1) (γ(t))1−αt Δvt
t=1

for all T . Lemma 1 is proved.


The following lemma, which is an analogue of the result from [5], gives a
bound for the IFPL algorithm.
Lemma 2. The expected cumulative loss of the IFPL algorithm with the learning
rate (16) is bounded by

T
r1:T ≤ min si1:T + (1 + ln N ) (γ(t))αt Δvt (35)
i
t=1

for all T , where αt is defined by (7).


48 V.V. V’yugin

Proof. The proof is along the line of the proof from Hutter and Poland [4] with
an exception that now the sequence t is not monotonic.
Let in this proof, st = (s1t , . . . sN
t ) be a vector of one-step losses and s1:t =
(s1:t , . . . sN
1
1:t ) be a vector of cumulative losses of the experts algorithms. Also, let
ξ = (ξ 1 , . . . ξ N ) be a vector whose coordinates are random variables.
Recall that t = 1/(μt vt ), μt ≤ μt−1 for all t, and v0 = 0, 0 = ∞.
Define s̃1:t = s1:t − 1 ξ for t = 1, 2, . . .. Consider the vector of one-step losses
 t
s̃t = st − ξ 1 − 1 for the moment.
t t−1
For any vector s and a unit vector d denote

M (s) = argmind∈D {d · s},

where D = {(0, . . . 1), (1, . . . 0)} is the set of N unit vectors of dimension N and
“·” is the inner product of two vectors.
We first show that


T
M (s̃1:t ) · s̃t ≤ M (s̃1:T ) · s̃1:T . (36)
t=1

For T = 1 this is obvious. For the induction step from T − 1 to T we need to


show that

M (s̃1:T ) · s̃T ≤ M (s̃1:T ) · s̃1:T − M (s̃1:T −1 ) · s̃1:T −1 .

This follows from s̃1:T = s̃1:T −1 + s̃T and

M (s̃1:T ) · s̃1:T −1 ≥ M (s̃1:T −1 ) · s̃1:T −1 .

We rewrite (36) as follows


T 
T  
1 1
M (s̃1:t ) · st ≤ M (s̃1:T ) · s̃1:T + M (s̃1:t ) · ξ −  . (37)
t=1 t=1
t t−1

By definition of M we have
 
ξ
M (s̃1:T ) · s̃1:T ≤ M (s1:T ) · s1:T −  =
T
ξ
min{d · s1:T } − M (s1:T ) ·  . (38)
d∈D T
1
The expectation of the last term in (38) is equal to T = μT vT .
The second term of (37) can be rewritten


T   
T
1 1
M (s̃1:t ) · ξ  −  = (μt vt − μt−1 vt−1 )M (s̃1:t ) · ξ. (39)
t=1
t t−1 t=1
The Follow Perturbed Leader Algorithm 49

We will use the inequality for mathematical expectation E

0 ≤ E(M (s̃1:t ) · ξ) ≤ E(M (ξ) · ξ) = E(max ξ i ) ≤ 1 + ln N. (40)


i

The proof of this inequality uses ideas of Lemma 1 from [4].


We have for the exponentially distributed random variables ξ i , i = 1, . . . N ,


N
P {max ξ i ≥ a} = P {∃i(ξ i ≥ a)} ≤ P {ξ i ≥ a} = N exp{−a}. (41)
i
i=1

∞
Since for any non-negative random variable η, E(η) = P {η ≥ y}dy, by (41)
0
we have
∞
E(max ξ − ln N ) =
i
P {max ξ i − ln N ≥ y}dy ≤
i i
0
∞
N exp{−y − ln N }dy = 1.
0

Therefore, E(maxi ξ i ) ≤ 1 + ln N .
By (40) the expectation of (39) has the upper bound


T 
T
E(M (s̃1:t ) · ξ)(μt vt − μt−1 vt−1 ) ≤ (1 + ln N ) μt Δvt .
t=1 t=1

Here we have used the inequality μt ≤ μt−1 for all t,


Since E(ξ i ) = 1 for all i, the expectation of the last term in (38) is equal to
 
ξ 1
E M (s1:T ) ·  =  = μT vT . (42)
T T

Combining the bounds (37)-(39) and (42), we obtain


 T 

r1:T = E M (s̃1:t ) · st ≤
t=1

T
min si1:T − μT vT + (1 + ln N ) μt Δvt ≤
i
t=1

T
min si1:T + (1 + ln N ) μt Δvt . (43)
i
t=1

Lemma is proved. .
We finish now the proof of the theorem.
50 V.V. V’yugin

The inequality (17) of Lemma 1 and the inequality (35) of Lemma 2 imply
the inequality

T
E(s1:T ) ≤ min si1:T + ((e2 − 1)(γ(t))1−αt + (1 + ln N )(γ(t))αt )Δvt . (44)
i
t=1

for all T .
The optimal value (7) of αt can be easily obtained by minimization of each
member of the sum (44) by αt . In this case μt is equal to (8) and (44) is equivalent
to (12).
T
We have t=1 Δvt = vT for all T , vt → ∞ and γ(t) → 0 as t → ∞. Then by
Toeplitz lemma [10]
 
1   T
2 (e2 − 1)(1 + ln N ) (γ(t))1/2 Δvt → 0
vT t=1

as T → ∞. Therefore, the FPL algorithm PROT is asymptotically consistent in


the mean, i.e., the relation (13) of Theorem 1 is proved.
We use some version of the strong law of large numbers to formulate a suffi-
cient condition for Hannan consistency of the algorithm PROT.
Lemma 3. Let g(x) be a positive nondecreasing real function such that x/g(x),
g(x)/x2 are non-increasing for x > 0 and g(x) = g(−x) for all x.
Let the assumptions of Theorem 1 hold and

 g(Δvt )
< ∞. (45)
t=1
g(vt )

Then the FPL algorithm PROT is Hannan consistent, i.e., (5) holds as T → ∞
almost surely.
Proof. We use Theorem 11 from Petrov [8] (Chapter IX, Section 2) which gives
sufficient conditions in order that the strong law of large numbers holds for a
sequence of independent unbounded random variables:
Let at be a nondecreasing sequence of real numbers such that at → ∞ as t →
∞ and Xt be a sequence of independent random variables such that E(Xt ) = 0,
for t = 1, 2, . . .. Let also, g(x) satisfies assumptions of Lemma 3. By Theorem 11
from Petrov [8] the inequality

 E(g(Xt ))
<∞ (46)
t=1
g(at )

implies

1 
T
Xt → 0 (47)
aT t=1

as T → ∞ almost surely.
The Follow Perturbed Leader Algorithm 51

Put Xt = st − E(st ), where st is the loss of the FPL algorithm PROT at step
t, and at = vt for all t. By definition |Xt | ≤ Δvt for all t. Then (46) is valid, and
by (47)
1 
T
1
(s1:T − E(s1:T )) = (st − E(st )) → 0
vT vT t=1
as T → ∞ almost surely.
This limit and the limit (13) imply (14).
By Lemma 3 the algorithm PROT is Hannan consistent, since (15) implies
(45) for g(x) = x2 . Theorem 1 is proved.
Authors of [2] and [9] considered polynomially bounded one-step losses. We
consider a specific example of the bound (44) for polynomial case.
Corollary 1. Assume that sit ≤ ta for all t and i = 1, . . . N , and
vt
lim inf a+δ > 0,
t→∞ t

where a and δ are positive real numbers. Let also in the algorithm PROT, γ(t) =
t−δ and μt = (γ(t))αt , where αt is defined by (7). Then
– (i) the algorithm PROT is asymptotically consistent in the mean for any
a > 0 and δ > 0;
– (ii) this algorithm is Hannan consistent for any a > 0 and δ > 12 ;
– (iii) the expected loss of this algorithm is bounded by
 1
E(s1:T ) ≤ min si1:T + 2 (e2 − 1)(1 + ln N )T 1− 2 δ+a (48)
i
as T → ∞.
This corollary follows directly from Theorem 1, where condition (15) of Theo-
rem 1 holds for δ > 12 .
If δ = 1 the regret from (48) is asymptotically equivalent to the regret from
Allenberg et al. [2] (see Section 1).
For a = 0 we have the case of bounded loss function (0 ≤ sit ≤ 1 for all i
and t). The FPL algorithm PROT is asymptotically consistent in the mean if
vt ≥ β(t) for all t, where β(t) is an arbitrary positive unbounded non-decreasing
computable function (we can get γ(t) = 1/β(t) in this case). This algorithm is
Hannan consistent if (15) holds, i.e.


(β(t))−2 < ∞.
t=1

For example, this condition be satisfied for β(t) = t1/2 ln t.


Theorem 1 is also valid for the standard time scaling, i.e., when vT = T for
all T , and when losses of experts are bounded, i.e., a = 0. Then the expected
regret has the upper bound
 
T 
2 (e2 − 1)(1 + ln N ) (γ(t))1/2 ≤ 4 (e2 − 1)(1 + ln N )T
t=1

which is similar to bounds from [4] and [5].


52 V.V. V’yugin

Acknowledgments

This research was partially supported by Russian foundation for fundamental


research: 09-07-00180-a and 09-01-00709a.

References
1. Cesa-Bianchi, N., Mansour, Y., Stoltz, G.: Improved second-order bounds for pre-
diction with expert advice. Machine Learning 66(2-3), 321–352 (2007)
2. Allenberg, C., Auer, P., Gyorfi, L., Ottucsak, G.: Hannan consistency in on-Line
learning in case of unbounded losses under partial monitoring. In: Balcázar, J.L.,
Long, P.M., Stephan, F. (eds.) ALT 2006. LNCS (LNAI), vol. 4264, pp. 229–243.
Springer, Heidelberg (2006)
3. Hannan, J.: Approximation to Bayes risk in repeated plays. In: Dresher, M., Tucker,
A.W., Wolfe, P. (eds.) Contributions to the Theory of Games, vol. 3, pp. 97–139.
Princeton University Press, Princeton (1957)
4. Hutter, M., Poland, J.: Prediction with expert advice by following the perturbed
leader for general weights. In: Ben-David, S., Case, J., Maruoka, A. (eds.) ALT
2004. LNCS (LNAI), vol. 3244, pp. 279–293. Springer, Heidelberg (2004)
5. Kalai, A., Vempala, S.: Efficient algorithms for online decisions. In: Schölkopf, B.,
Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 26–40.
Springer, Heidelberg (2003); Extended version in Journal of Computer and System
Sciences 71, 291–307 (2005)
6. Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. Information
and Computation 108, 212–261 (1994)
7. Lugosi, G., Cesa-Bianchi, N.: Prediction, Learning and Games. Cambridge
University Press, New York (2006)
8. Petrov, V.V.: Sums of independent random variables. Ergebnisse der Mathematik
und ihrer Grenzgebiete, Band 82. Springer, Heidelberg (1975)
9. Poland, J., Hutter, M.: Defensive universal learning with experts. For general
weight. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI),
vol. 3734, pp. 356–370. Springer, Heidelberg (2005)
10. Shiryaev, A.N.: Probability. Springer, Berlin (1980)
11. Vovk, V.G.: Aggregating strategies. In: Fulk, M., Case, J. (eds.) Proceedings of
the 3rd Annual Workshop on Computational Learning Theory, San Mateo, CA,
pp. 371–383. Morgan Kaufmann, San Francisco (1990)
Computable Bayesian Compression for
Uniformly Discretizable Statistical Models

Łukasz Dębowski

Centrum Wiskunde & Informatica, 1098 XG Amsterdam, The Netherlands

Abstract. Supplementing Vovk and V’yugin’s ‘if’ statement, we show


that Bayesian compression provides the best enumerable compression for
parameter-typical data if and only if the parameter is Martin-Löf random
with respect to the prior. The result is derived for uniformly discretizable
statistical models, introduced here. They feature the crucial property
that given a discretized parameter, we can compute how much data is
needed to learn its value with little uncertainty. Exponential families and
certain nonparametric models are shown to be uniformly discretizable.

1 Introduction

Algorithmic information theory inspires an appealing interpretation of Bayesian


inference [1,2,3,4]. Literally, a fixed individual parameter cannot have the prop-
erty of being distributed according to a distribution but, when it is represented
as a sequence of digits, the parameter is almost surely algorithmically random.
Thus, if you believe that a parameter obeys a prior, it may rather mean that
you suppose that the parameter is algorithmically random with respect to the
prior. We want to argue that this interpretation is valid.
We will assume that the parameter θ is, in some sense, effectively identifi-
able. Then one can disprove that a finite prefix of a fixed, not fully known θ
is algorithmically random by estimating the prefix and showing that there ex-
ists a shorter description of that prefix. Hence, Bayesian beliefs seem admissible
scientific hypotheses according to the Popperian philosophy,
 cf. [1].
Secondly, it follows that the Bayesian measure Pθ dQ(θ) gives the best enu-
merable compression of Pθ -typical data if and only if parameter θ is algorith-
mically random with respect to prior Q. This statement is useful when Pθ is
not computable for a fixed θ. Moreover, once we know where Bayesian compres-
sion fails, we should systematically adjust the prior to our hypotheses about the
algorithmic complexity of θ in an application.
As we will show, this ‘if and only if ’ result can be foreseen using the chain
rule for prefix Kolmogorov complexity of finite objects [5], [6, Theorem 3.9.1]. The
chain rule allows to relate randomness deficiencies for finite prefixes of the data
and of the parameter in some specific statistical models, which we call uniformly
discretizable. That yields a somewhat weaker ‘if and only if ’ statement. Subse-
quently, the statement can be strengthened using the dual chain rule for impos-
sibility levels of infinite sequences [1, Theorem 1] and extensions of Lambalgen’s

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 53–67, 2009.

c Springer-Verlag Berlin Heidelberg 2009
54 Ł. Dębowski

theorem for conditionally random sequences [7], [4, Theorem 4.2 and 5.3]. The
condition of uniform discretization can be completely removed from the ‘if ’ part
and relaxed to an effective identifiability of the parameter in the ‘only if ’ part.
Namely, given a prefix of the parameter, we must be able to compute how much
data is needed to learn its value with a fixed uncertainty.
The organization of this paper is as follows. In Section 2, we discuss quality of
Bayesian compression for individual parameters and we derive the randomness
deficiency bounds for prefixes of the parameter and the parameter-typical data.
These bounds hold for the newly introduced class of uniformly discretizable sta-
tistical models. In Section 3, we show that exponential families are uniformly
discretizable. The assumptions on the prior and the proof look familiar to statis-
ticians working in minimum description length (MDL) inference [8,9]. An exam-
ple of a ‘nonparametric’ uniformly discretizable model appears in Section 4. In
the final Section 5, we prove that countable mixtures of uniformly discretizable
models are uniformly discretizable if the Bayesian estimator consistently chooses
the right submodel for the data.
The definition of uniformly discretizable models is given below. Condition (3)
says that the parameter may be discretized to m ≥ μ(n) digits for the sake of
approximating the ‘true’ probability of data xn . Condition (4) asserts that the
parameter, discretized to m digits, can be predicted for all but finitely many m
given data xn of length n ≥ ν(m). Functions μ and ν depend on a model.
To fix our notation in advance, we use a countable alphabet X and a finite
Y = {0, 1, ..., D − 1}, D > 1. The logarithm to base D is written as log. An italic
x ∈ X+ is a string, a boldface x ∈ XN is an infinite sequence. The n-th symbol
of x is written as xn ∈ X and xn is the prefix of x of length n: x = x1 x2 x3 ... and
xn = x1 x2 ...xn . Capital boldface Y : 
X∗ → R denotes a distribution of strings
normalized lengthwise, i.e., 0 ≤ Y (x), a Y (xa)1{|a|=n} = Y (x), and Y (λ) = 1
for the empty string λ. There is a unique measure on measurable sets of infinite
sequences x ∈ XN , also denoted as Y , such that Y ({x : xn = x for n = |x|}) =
Y (x). Quantifier ‘n-eventually’ means ‘for all but finitely many n ∈ N’.

Definition 1. Fix a measurable subset Θ ⊂ YN . Let P : X∗ × Θ  (x, θ)  →


Pθ (x) ∈ R be a probability kernel, i.e., Pθ : X∗ → R is a probability measure for
each θ ∈ Θ and the mapping θ  → Pθ is measurable. Let also Q : Y∗ → R be
a probability measure on Θ, i.e., Q(Θ) = 1. A Bayesian statistical model (P , Q)
is called (μ, ν)-uniformly discretizable if it satisfies the following.

(i) Define the measure T : X∗ × Y∗ → R as



T (x, θ) := Pθ (x)dQ(θ), (1)
A(θ)

where A(θ) := {θ ∈ Θ : θ is the prefix of θ}, and denote its other marginal

Y (x) := T (x, λ) = Pθ (x)dQ(θ). (2)
Computable Bayesian Compression 55

(ii) Function μ : N → R is nondecreasing and we require that for all θ ∈ Θ,


Pθ -almost all x, and m ≥ μ(n),
log [Q(θm )Pθ (xn )/T (xn , θm )]
lim = 0. (3)
n→∞ log m
(iii) Function ν : N → R is nondecreasing and we require that for all θ ∈ Θ,
Pθ -almost all x, and n ≥ ν(m),
lim T (xn , θm )/Y (xn ) = 1. (4)
m→∞

Remark: A Bayesian model (P̃ , Q̃) with a kernel P̃ : X∗ × Θ̃ → R and a mea-


sure Q̃ on Θ̃ will be called (ρ, μ, ν)-uniformly discretizable if (P , Q) is (μ, ν)-
uniformly discretizable for a bijection ρ : Θ̃ → Θ, Pθ (x) := P̃ρ−1 (θ) (x), and
Q := Q̃ ◦ ρ−1 . We will write ‘(ρ, μ(n), ν(m))-uniformly discretizable’ when there
are no convenient symbols for functions μ and ν.
A few words of comment to this definition are due. By condition (3), the support
of prior Q equals Θ, i.e., Q(θm ) > 0 for all m and θ ∈ Θ. Condition (4) admits
a consistent estimator if there is a function σ : X∗ → N, where ν(σ(xn )) ≤ n,
σ(xn+1 ) ≥ σ(xn ), and limn σ(xn ) = ∞. Define the discrete maximum likelihood
estimator MLE(x; σ) := argmaxθ∈Ym T (x, θ) with m = σ(x). The estimator is
n
called consistent if MLE(xn ; σ) = θσ(x ) n-eventually for all θ ∈ Θ and Pθ -
almost all x. This property is indeed satisfied.
Four models presented in Sections 3 and 4 satisfy a stronger condition.
Definition 2. A (μ, ν)-uniformly discretizable model is called μ-uniformly dis-
cretizable if ν is recursive and μ(ν(m)) ≤ mα for an α > 0.
These models feature log μ(n) close to the logarithm of Shannon redundancy
− log Y (xn ) + log Pθ (xn ). A heuristic rationale is as follows. If we had μ ◦ ν = id,
− log Q(θm ) = Ω(m), and we put n = ν(m) then
|− log Y (xn ) + log Pθ (xn ) + log Q(θm )| = o(log m)
and hence μ(n) = m = O(− log Y (xn ) + log Pθ (xn )). Whereas − log Q(θm ) =
Ω(m) is a reasonable assumption, we rather observe μ(ν(m)) > m.
The present approach allows only discrete data. We hope, however, that uni-
formly discretizable models can be generalized to nondiscrete data so that con-
sistency and algorithmic optimality of Bayesian procedures in density estimation
could be characterized in a similar fashion, cf. [10]. Another interesting path of
development is to integrate the algorithmic perspective on Bayesianism with the
present MDL framework [8,9], where normalized maximum likelihood codes are
discussed. By the algorithmic optimality of Bayesian compression, the normal-
ized maximum likelihood measure,
 if it can be defined properly, should converge
to the Bayesian measure Pθ dQ(θ) in log-loss. We also suppose that reason-
able luckiness functions, introduced to guarantee existence of modified normal-
ized maximum likelihood codes [9, Section 11.3], may be close to algorithmic
information between the parameter and the data.
56 Ł. Dębowski

2 Bounds for the Data and Parameter Complexity

We will use a universal computer with an oracle, which can compute certain
functions R → R. To make it clear which these are, we adopt the following
definitions, cf. [11], [6, Sections 1.7 and 3.1], [1, Section 2], [4, Section 5]:

(i) A universal computer is an appropriate finite state machine that interacts


with infinite tapes. The machine can move along the tapes in discrete
steps, read and write on them single symbols from the finite set Y, and
announce the end of computation. We fix three one-sided tapes. At the
beginning of computation, tape α contains a program, i.e., a string from
a prefix-free subset of Y+ , and tape β contains an oracle, i.e., an element
of (0Y∗ ) ∪ (1YN ). At the end of computation, tape γ contains an output,
i.e., a string from Y∗ .
(ii) The prefix Kolmogorov complexity K(y) of a string y ∈ Y∗ is the minimal
length of such a program on tape α that y is output on tape γ provided
no symbol is read from tape β.
(iii) The conditional complexity K(y|δ) for y ∈ Y∗ and δ ∈ Y∗ ∪ YN is the
minimal length of such a program on tape α that y is output on tape γ
given 0δ or 1δ, respectively, as an oracle on tape β.
(iv) A function f : Y∗ ∪YN → Y∗ is recursive if there is such a program z ∈ Y+
that string f (y) is output for all oracles y ∈ Y∗ ∪ YN .
(v) Function φ is a prefix code if it is an injection and its image is prefix-free.
(vi) For certain prefix codes φW : W → Y∗ and φU : U → Y∗ ∪ YN and
arbitrary w ∈ W and u ∈ U, we put K(w) := K(φW (w)) and K(w|u) :=
K(φW (w)|φW (u)). Fixing φY∗ and φY∗ ∪YN as identity functions, f : U → W
is called recursive if so is φW ◦ f ◦ φ−1
U .
(vii) Numbers obey special conventions. Integers are Elias-coded [12], whereas
φQ (p/q) := φZ (p)N (q) for every irreducible fraction p/q. To convert a real
number from (−∞, ∞) into aone-sided sequence, we assume that φR (r) =

θ satisfies [1 + exp(−r)] = i=1 θi D−i . This solves the problem of real
arguments. A real-valued function f : W → R is called enumerable if
there is a recursive function g : W × N → Q nondecreasing in k such that
limk g(w, k) = f (w). A stronger condition, the f is called recursive if there
is a recursive function h : W × N → Q such that |f (w) − h(w, k)| < 1/k.
(viii) Pairs (w, u) enjoy the code φW×U (w, u) := φW (w)φW (u). This code cannot
be used if w is real. In the Proposition 2 of Section 3, where we need to
string real vectors, Cantor’s code is used instead.
(ix) The concepts mentioned above are analogously extended to partial func-
tions. Special care must be taken to assume computability of their do-
mains, which is important to guarantee that the inverse of the Shannon-
Fano-Elias code, used in Theorem 1, is recursive.

Last but not least, a semimeasure U is a function X∗ → R that satisfies 0 ≤


 ∗
U (x), a U (xa)1{|a|=n} ≤ U (x), and U (λ) ≤ 1. Symbol < denotes inequality
up to a multiplicative constant.
Computable Bayesian Compression 57

Impossibility level
n
D−K(x )
I(x; Y ) := inf (5)
n∈N Y (xn )

is a natural measure of randomness deficiency for a sequence x ∈ XN with respect


to a recursive measure Y , cf. [1], [6, Def. 4.5.10 and Thm. 4.5.5]. The respective
set of Y -Martin-Löf random sequences

LY := {x : I(x; Y ) < ∞} (6)

has two important properties.


Firstly, LY is the maximal set of sequences on which no enumerable semimea-
sure outperforms a recursive measure Y more than by a multiplicative constant.
Let M be the universal enumerable semimeasure [6, Section 4.5.1]. By [2, The-
orem 1 and Lemma 3], we have
∗ M (xn ) ∗ M (xn ) ∗
I(x; Y ) < lim inf n
< sup n)
< [I(x; Y )]1+ (7)
n→∞ Y (x ) n∈N Y (x

for a fixed > 0 and recursive Y . By the definition of M , U (xn ) < M (xn ) for
any enumerable (semi)measure U . Hence supn∈N U (xn )/Y (xn ) < ∞ if x ∈ LY .
Moreover, LY = LU if Y and U are mutually equivalent recursive measures, i.e.,
supn∈N U (xn )/Y (xn ) < ∞ ⇐⇒ supn∈N Y (xn )/U (xn ) < ∞ for all x ∈ XN .
Secondly, the set LY has full measure Y . The fact is well-known, cf. e.g. [1,
Remark 2], and it can be seen easily using the auxiliary statement below, which
strengthens Barron’s result [13, Theorem 3.1]. Whereas Y (LY ) = 1 follows for
|B(·)| = K(·), we shall use this lemma later also for |B(·)| = K(·|θ).
Lemma 1 (no hypercompression). Let B : X∗ → Y+ be a prefix code. Then

|B(xn )| + log Y (xn ) > 0 (8)

n-eventually for Y -almost all sequences x.

Proof. Consider the function W (x) := D−|B(x)| . By the Markov inequality,


   
W (xn ) W (xn )
Y ((8) is false) = Y ≥ 1 ≤ E x∼Y = 1{|x|=n} W (x).
Y (xn ) Y (xn ) x
 
Hence n Y ((8) is false) ≤ x D−|B(x)| ≤ 1 < ∞ by the Kraft inequality. The
claim now follows by the Borel-Cantelli lemma. 

Now let Y be a recursive Bayesian measure (2). In a prototypical case, mea-


sures Pθ are not enumerable Q-almost surely. But the data that are almost
surely typical for these measures can be optimally compressed with the effec-
tively computable measure Y . That is, Pθ (LY ) = 1 holds Q-almost everywhere,
as implied by the following statement.
58 Ł. Dębowski

Lemma 2 (cf. [14, Section 9]). Equality Y (X ) = 1 for Y = Pθ dQ(θ)
implies Pθ (X ) = 1 for Q-almost all θ.

Proof. Let Gn := {θ ∈ Θ : Pθ (X ) ≥ 1 − 1/n}. We have 1 = Y (X ) ≤ Q(Gn ) +


Q(Θ \ Gn )(1 − 1/n) = 1 − n−1 Q(Θ \ Gn ). Thus Q(Gn ) = 1. By σ-additivity,
Q(G) = inf n Q(Gn ) = 1 follows for G := {θ ∈ Θ : Pθ (X ) = 1} = n Gn . 

Notably, the Bayesian compressor can be shown optimal exactly when the pa-
rameter is incompressible. Strictly speaking, we will obtain Pθ (LY ) = 1 if and
only if θ is Martin-Löf random with respect to Q. This holds, of course, under
some tacit assumptions. For instance, if we take Pθ ≡ Y then Pθ (LY ) = 1 for all
θ ∈ Θ. We may thus suppose that the ‘if and only if ’ statement holds provided
the parameter can be effectively identified. The following two propositions form
the first step to see what assumptions are needed exactly.
Lemma 3. For a computer-dependent constant A, we have

K(x|θ) ≤ A + K(x|θm , K(θm )) + K(K(θm )) + K(m). (9)

Proof. A certain program for computing x given θ operates as follows. It first


calls a subroutine of length K(m) to compute m and a subroutine of length
K(K(θm )) to compute K(θm ). Then it reads the prefix θm of θ and passes θm
and K(θm ) to a subroutine of length K(x|θm , K(θm )) which returns x. 

Theorem 1. Let (P , Q) be a Bayesian statistical model with a recursive prior


Q : Y∗ → R and a recursive kernel P : X∗ × Θ → R.
(i) If (3) holds for Pθ -almost all x then

K(xn ) + log Y (xn ) ≥ K(θm ) + log Q(θm ) − 3 log m + o(log m) (10)

is also true for Pθ -almost all x.


(ii) If (4) holds for a recursive τ : Y∗ → N and n = τ (θm ) then

K(xn ) + log Y (xn ) ≤ K(θm ) + log Q(θm ) + O(1). (11)

Proof. (i) For Pθ -almost all x we have both (3) and

K(xn |θ) + log Pθ (xn ) ≥ 0 (12)

n-eventually, by Lemma 1 for |B(·)| = K(·|θ). Applying Lemma 3 to these


sequences yields

K(xn |θm , K(θm )) + log T (xn , θm ) − log Q(θm )


≥ −K(K(θm )) − K(m) + o(log m) = −2 log m + o(log m)

because K(θm ) ≤ m + log m + o(log m) and K(m) ≤ log m + o(log m). Since

K(xn |θm , K(θm )) + K(θm ) = K(xn , θm ) + O(1) (13)


Computable Bayesian Compression 59

by the chain rule for prefix complexity [6, Theorem 3.9.1], we obtain

K(xn , θm ) + log T (xn , θm ) ≥ K(θm ) + log Q(θm ) − 2 log m + o(log m).

In the following, we apply (13) with xn and θm switched, and observe that

T (xn , θm )
K(θm |xn , K(xn )) ≤ A + K(m) − log
Y (xn )

follows by conditional Shannon-Fano-Elias coding of θm of an arbitrary length


given xn , cf. [15, Section 5.9]. Hence (10) holds for Pθ -almost all x.
(ii) By conditional Shannon-Fano-Elias coding of xn given θm we obtain

T (xn , θm )
K(xn , θm ) ≤ A + K(θm ) − log . (14)
Q(θm )

(This time, we need not specify the length of xn separately since it can be
computed from θm .) Substituting (4) into (14) and chaining the result with
K(xn ) ≤ A + K(xn , θm ) yields (11). 

Theorem 1 applies to uniformly discretizable models if we plug in m ≥ μ(n) and


τ (θm ) ≥ ν(m). Hence we obtain the first, less elegant dichotomy.
Proposition 1. Let (P , Q) be a μ-uniformly discretizable model with a recur-
sive prior Q : Y∗ → R and a recursive kernel P : X∗ × Θ → R. We have

1 if θ ∈ LQ,log n ,
Pθ (LY ,log μ(n) ) = (15)
0 if θ ∈
 LQ,log n ,

where the sets of (Y , g(n))-random sequences are defined as

K(xn ) + log Y (xn )


LY ,g(n) := x : inf > −∞ . (16)
n∈N g(n)

In particular, LY ,1 = LY .
Theorem 1(ii) suffices to prove Pθ (LY ) = 0 for θ 
∈ LQ but to show Pθ (LY ) =
1 in the other case we need a stronger statement than Theorem 1(i). Here we can
rely on the chain rule for conditional impossibility levels by Vovk and V’yugin
[1, Theorem 1] and extensions of Lambalgen’s theorem for conditionally random
sequences by Takahashi [4]. For a recursive kernel P , let us define by analogy
the conditional impossibility level
n
D−K(x |θ)
I(x; P |θ) := inf (17)
n∈N Pθ (xn )

and the set of conditionally random sequences


 
LP |θ := x ∈ XN : I(x; P |θ) < ∞ . (18)
60 Ł. Dębowski

We have Pθ (LP |θ ) = 1 for all θ by Lemma 1, as used in (12). Adjusting the


proof of [6, Theorem 4.5.5] to computation with an oracle, we can show that the
definition of I(x; P |θ) given here is equivalent to the one given by [1], cf. [6,
Def. 4.5.10]. Hence
∗ ∗
 
1+
inf [I(x; P |θ) I(θ; Q)] < I(x; Y ) < inf I(x; P |θ) [I(θ; Q)] (19)
θ∈Θ θ∈Θ


holds for Y = Pθ dQ(θ) and > 0 by [1, Corollary 4].
Inequality (19) and Theorem 1(ii) imply the main claim of this article.

Theorem 2. Let (P , Q) be a Bayesian statistical model with a recursive prior


Q : Y∗ → R and a recursive kernel P : X∗ × Θ → R. Suppose that (4) holds
for all θ ∈ Θ, Pθ -almost all x, and n = τ (θm ), where τ : Y∗ → N is recursive.
Then we have

1 if θ ∈ LQ ,
Pθ (LY ) = (20)
0 if θ ∈
 LQ .

The upper part of (20) can be strengthened as decomposition LY = θ∈LQ LP |θ ,
which holds for all recursive P and Q [4, Cor. 4.3 & Thm. 5.3]. (Our definition
of a recursive P corresponds to ‘uniformly computable’ in [4].) We suppose that,
under the assumption of Theorem 2, sets LP |θ are disjoint for θ ∈ Θ. This would
strengthen the lower part of (20).

3 The Case of Exponential Families

As shown in [16], k-parameter exponential families exhibit Shannon redundancy


− log Y (xn ) + log Pθ (xn ) = k2 log n + Θ(log log n). Here we shall prove that these
 
models are uniformly discretizable with μ(n) = k2 + log n respectively. The
result is established under a familiar condition. Namely, a prior Q̃ on Θ̃ ⊂ Rk
is universally lower-bounded by the Lebesgue measure λ if for each ϑ ∈ Θ̃ there
exists an open set C  ϑ and a w > 0 such that Q̃(E) ≥ wλ(E) for every
measurable E ⊂ C. This condition implies that Θ̃ is the support of Q̃ and is
satisfied, in particular, if Q̃ and λ restricted to Θ̃ are mutually equivalent.
Let us write the components of vectors  ϑ, ϑ ∈ Rk as ϑ = (ϑ1 , ϑ2 , ..., ϑk ) and
k
their Euclidean distance as |ϑ − ϑ| := l=1 (ϑl − ϑl ) .
2

Example 1 (an exponential family). Let the kernel P̃ : X∗ × Θ̃  (x, ϑ) →


P̃ϑ (x) ∈ R represent a regular k-parameter exponential family. That is:

(i) Certain functions p : X → (0, ∞) and T : X → Rk satisfy x∈X p(x) < ∞
k
and ∀β∈Rk \0 ∀c∈R ∃x∈X l=1 βl Tl (x) 
= c (i.e., T has affinely independent
components).
Computable Bayesian Compression 61

  
k
(ii) Let Z(β) := x∈X p(x) exp l=1 β l T l (x) and define measures

n
 k


n
P̃β (x ) := p(xi ) exp βl Tl (x) − ln Z(β)
i=1 l=1
 
for β ∈ B := β ∈ Rk : Z(β ) < ∞ .
(iii) We require that B is open. (It is not empty since 0 ∈ B.) Under this
→ ϑ(β) := E x∼P̃β T (xi ) ∈ Rk is a twice differen-
condition, ϑ(·) : B  β 
tiable injection [17], [9]. Thus assume Θ̃ := ϑ(B) and put P̃ϑ := P̃β(ϑ) for
β(·) := ϑ−1 (·).
Additionally, let the prior Q̃ be universally lower-bounded by the Lebesgue mea-
sure on Rk and let it satisfy Q̃(Θ̃) = 1.
Proposition 2. Use Cantor’s code ρ := ρs ◦ ρn , where ρn : Θ̃ → (0, 1)k is
a differentiable injection and ρs : (0, 1)k → Y N
∞satisfies ρs (y)−i= θ1 θ2 θ3 ... for any
vector y ∈ (0, 1)k with components yl = i=1 θ(i−1)k+l D . Then the model
 k  
(P̃ , Q̃) is ρ, 2 + log n, D (2/k+)m
-uniformly discretizable for > 0.

Proof. Let Θ := ρ(Θ̃), Pθ (x) := P̃ρ−1 (θ) (x), Q := Q̃ ◦ ρ−1 , and A(θ) :=
 
{θ ∈ Θ : θ is the prefix of θ}. Consider a θ ∈ Θ. Firstly, let m ≥ k2 + log n.
We have (21) for ϑ = ρ−1 (θ) and An = ρ−1 (A(θm )). Hence (3) holds by the
Theorem 3(i) below. Secondly, let n ≥ D(2/k+)m . We have (23) for ϑ = ρ−1 (θ)
and Bn = ρ−1 (A(θm )). Hence (4) follows by Theorem 3(ii). 
The statement below may look more familiar for statisticians.
Theorem 3. Fix a ϑ ∈ Θ̃ for the model specified in Example 1.
(i) If we take sufficiently small measurable sets An ⊂ Θ̃ which satisfy
supϑ ∈An |ϑ − ϑ|
lim sup√ =0 (21)
n→∞ n−1 ln ln n
 
and put P̃n (x) := An P̃ϑ (x)dQ̃(ϑ )/ An dQ̃(ϑ ) then

log P̃n (xn ) − log P̃ϑ (xn )


lim =0 (22)
n→∞ ln ln n
for P̃ϑ -almost all x.
(ii) On the other hand, if we take sufficiently large measurable sets
 
Bn ⊃ ϑ ∈ Θ̃ : |ϑ − ϑ| ≥ n−1/2+α (23)

for an arbitrary α ∈ (0, 1/2) then


   
lim log P̃ϑ (xn )dQ̃(ϑ ) − log P̃ϑ (xn )dQ̃(ϑ ) = 0 (24)
n→∞ Bn

for P̃ϑ -almost all x.


62 Ł. Dębowski


Proof. (i) Function ϑ̂(xn ) := n−1 ni=1 T (xi ) is the maximum likelihood esti-
mator of ϑ, in the usual sense. Thus the Taylor expansion for any ϑ ∈ Θ̃ yields
k
log P̃ϑ̂(xn ) (xn ) − log P̃ϑ (xn ) = n l,m=1 Rlm (ϑ)Slm (ϑ), (25)
1
where Slm (ϑ) := (ϑl − ϑ̂l (xn ))(ϑm − ϑ̂m (xn )) and Rlm (ϑ) := 0 (1 − t)Ilm (tϑ +
(1 − t)ϑ̂(xn ))dt, whereas the observed Fisher information matrix Ilm (ϑ) :=
−n−1 ∂ϑl ∂ϑm log P̃ϑ (xn ) does not depend on n and xn . Consequently,

log P̃ϑ (xn ) − log P̃ϑ (xn ) =



n kl,m=1 [Rlm (ϑ ) [Slm (ϑ) − Slm (ϑ)] + [Rlm (ϑ ) − Rlm (ϑ)] Slm (ϑ)] .

With Cn denote the intersection


 of Θ̃ and the smallest ball containing An and
n  n 
ϑ̂(x ). Let dn := ϑ − ϑ̂(x ) and an := supϑ ∈An |ϑ − ϑ|. Hence we bound
  k  + 
  −
log P̃n (xn ) − log P̃ϑ (xn ) ≤ n l,m=1 |Rlm |an (2dn + an ) + |Rlm
+
− Rlm |d2n ,

+ −
where Rlm := supϑ ∈Cn Rlm (ϑ ) and Rlm := inf ϑ ∈Cn Rlm (ϑ ). By continuity of

Fisher information Ilm (ϑ) as a function of ϑ, Rlm+
and Rlm tend to Ilm (ϑ) for
n → ∞. On the other hand, the law of iterated logarithm

ϑ̂l (xn ) − ϑl
lim sup √ =1 (26)
n→∞ σl 2n−1 ln ln n

is satisfied for P̃ϑ -almost all x with variance σl2 := Varx∼P̃ϑ Tl (xi ) since the
maximum likelihood estimator is unbiased, i.e., E x∼P̃ϑ ϑ̂(xn ) = ϑ. Consequently,
we obtain (22) for (21).
(ii) The proof applies Laplace approximation as in [18] or in the proof of
Theorem 8.1 of [9, pages 248–251]. First of all, we have
  
Θ̃\Bn
P̃ϑ (xn )dQ̃(ϑ )
n n
log P̃ϑ (x )dQ̃(ϑ ) − log P̃ϑ (x )dQ̃(ϑ ) ≤  .
Bn Bn
P̃ϑ (xn )dQ̃(ϑ )

In the following, we consider a sufficiently large n. Because of the law of iterated


logarithm (26), ϑ̂(xn ) belongs to Bn for P̃ϑ -almost all x. Hence the robust-
ness property and the convexity of Kullback-Leibler divergence for exponential
families [9, Eq. (19.12) and Proposition 19.2] imply a bound for the numerator
 n n n
Θ̃\Bn P̃ϑ (x )dQ̃(ϑ ) ≤ supϑ ∈Θ̃\Bn P̃ϑ (x ) ≤ supϑ ∈∂Bn P̃ϑ (x ),
  

where ∂Bn is the boundary of Bn . Using (25) gives further


 
supϑ ∈∂Bn P̃ϑ (xn ) ≤ P̃ϑ̂(xn ) (xn ) exp −nR− δ 2
Computable Bayesian Compression 63
 
k
with R− := inf ϑ ∈Bn l=1 Rlm (ϑ )S lm (ϑ ) /|ϑ − ϑ̂(xn )|2 and δ := inf ϑ ∈∂Bn
|ϑ − ϑ̂(xn )|. Since the prior is universally lower-bounded by the Lebesgue mea-
sure, then (25) implies a bound for the denominator
   
P̃  (xn )dQ̃(ϑ ) ≥ wP̃ϑ̂(xn ) (xn ) |t|<δ exp −nR+ |t|2 dt,
Bn ϑ
 
k
where w > 0 and R+ := supϑ ∈Bn l=1 Rlm (ϑ )S lm (ϑ ) /|ϑ − ϑ̂(xn )|2 . Hence
we obtain an inequality for the ratio
 n √  
Θ̃\Bn P̃ϑ (x )dQ̃(ϑ) nR+ exp −nR− δ 2 /2
 ≤  .
n
B P̃ϑ (x )dQ̃(ϑ)
w |t|<δ√nR+ exp [−|t|2 ] dt
n

The right-hand side tends to zero with n → ∞ since δ = Ω(n−1/2+α ) whereas


R+ and R− tend to strictly positive constants by continuity and strictly positive
definiteness of the Fisher information matrix. 

4 Less Standard Examples

In this section we shall present less standard examples of statistical models. We


begin with two very simple models.

Example 2 (the data are the parameter). Put Pθ (xn ) := 1{xn =θn } for X = Y
and let Q(θ) > 0 for θ ∈ Y∗ . This model is (n, m)-uniformly discretizable.

Example 3 (a singleton model). Each parameter θ is random with respect to


the prior Q concentrated on this parameter, Θ = {θ}. The respective singleton
model (P , Q) is (0, 0)-uniformly discretizable.

Now, a slightly more complex instance. Consider a class of stationary processes


(Xi )i∈Z of form Xi := (Ki , θKi ), where the variables Ki are independent and
distributed according to the hyperbolic distribution

k −1/β
P (Ki = k) = p(k) := , k ∈ N, (27)
ζ(1/β)

with a fixed β ∈ (0, 1). This family of processes was introduced to model logical
consistency of texts in natural language [19]. The distribution of variables Xi is
equal to the measure P (Xi ∈ · ) = Pθ for the following Bayesian model.
Example 4 (an accessible description model). Put
n

Pθ (xn ) := p(ki )1{zi =θk } (28)
i
i=1

for xi = (ki , zi ) ∈ N × Y and let Q(θ) > 0 for θ ∈ Y∗ .


64 Ł. Dębowski

For this model, Shannon information between the data and the parameter equals
E (x,θ)∼T [− log Y (xn ) + log Pθ (xn )] = Θ(nβ ) asymptotically if Q(θ) = D−|θ| ,
cf. [19, Theorem 10]. As a consequence of the next statement, the accessible
description model (28) is (nυ , m1/λ )-uniformly discretizable for

υ > 2β/(1 − β) and λ < β.

Proposition 3. For independent variables (Ki )i∈Z with the distribution (27),

{K1 , K2 , ..., Kn } \ {1, 2, ..., nυ } = ∅, (29)


 
1, 2, ..., nλ  \ {K1 , K2 , ..., Kn } = ∅, (30)

n-eventually almost surely.


Proof. To establish the first claim, put Un := nυ  and observe

= ∅) ≤ ∞
P ({K1 , K2 , ..., Kn } \ {1, 2, ..., Un }  j=Un +1 P (j ∈ {K1 , K2 , ..., Kn })
∞ n
∞
= j=Un +1 1 − (1 − p(j)) ≤ j=Un +1 np(j)
 ∞
n−1−
1−1/β
n n Un
≤ k −1/β dk = ≤ for an > 0.
ζ(1/β) Un ζ(1/β) 1/β − 1 ζ(1/β)(1/β − 1)
∞
Hence n=1 P ({K1 , K2 , ..., Kn } \ {1, 2, ..., Un }  = ∅) < ∞ so (29) holds by the
Borel-Cantelli lemma. As for the second claim, put Ln := nλ  and observe
 n
= ∅) ≤ L
P ({1, 2, ..., Ln } \ {K1 , K2 , ..., Kn }  j=1 P (j ∈ {K1 , K2 , ..., Kn })
L n n n
= j=1 (1 − p(j)) ≤ Ln (1 − p(Ln )) = Ln exp [n log (1 − p(Ln ))]
≤ Ln exp [−np(Ln )] ≤ nβ exp [−n ] for an > 0.
∞
Hence n=1 P ({1, 2, ..., Ln } \ {K1 , K2 , ..., Kn } 
= ∅) < ∞ so (30) is also satisfied
by the Borel-Cantelli lemma. 
To use the above statement for the Bayesian model, notice first that Pθ (xn ) > 0
for Pθ -almost all x. Hence equalities zi = θki and
   
n m
T (xn , θm ) = yM ∈YM i=1 p(ki )1{zi =yki } k=1 1{θk =yk } Q(y )
M

  
= Pθ (xn ) yM ∈YM M
k∈{k1 ,k2 ,...,kn }∪{1,2,...,m} 1{θk =yk } Q(y )

hold for Pθ -almost all x with M := max {m, k1 , k2 , ..., kn }. Consequently,

Q(θm )Pθ (xn ) = T (xn , θm ) if {k1 , k2 , ..., kn } \ {1, 2, ..., m} = ∅, (31)


n m n
T (x , θ ) = Y (x ) if {1, 2, ..., m} \ {k1 , k2 , ..., kn } = ∅. (32)

Thus the model given in Example 4 is (nυ , m1/λ )-uniformly discretizable.


The last example is not uniformly discretizable. It stems from the observation
that any probability measure on X∞ can be encoded with a single sequence from
Y∞ . Such parameter is not identifiable, however.
Computable Bayesian Compression 65

Example 5 (a model that contains all distributions). For simplicity let X = N


and Y = {0, 1}. The link between θ and Pθ will be established by imposing
equalities Pθ (λ) = 1 and
  ∞
Pθ (xn ) = Pθ (xn−1 ) − Pθ (xn−1 y) · θφ(xn ,k) 2−k , (33)
y<xn k=1

where a recursive bijection φ : N+ × N → N is used. It is easy to see that Pθ is


a probability measure on X∞ for each θ. Conversely, each probability measure
on X∞ equals Pθ for at least one θ.
−|θ|
 prior be the uniform measure Q(θ) := 2
Let the . Then the Bayesian measure
Y = Pθ (x)dQ(θ) is recursive and equals
 
n 1 n−1 n−1 n
Y (x ) = Y (x )− Y (x y) =⇒ log2 Y (xn ) = − i=1 xi .
2 y<x n

Measure Y is not only optimal for all Q-random θ, in the sense of Pθ (LY ) = 1,
but it is also optimal for a certain θ 
∈ LQ that satisfies Pθ = Y . On the other
hand, by the asymptotic equipartition property, Pθ (LY ) = 0 for stationary
measures Pθ that have a different entropy rate than Y [15, Section 15.7].

5 Countable Unions of Models


Bayesian mixtures of uniformly discretizable models are uniformly discretizable
under the additional condition (34), which says that Bayesian model selection is
consistent for each θ ∈ Θ. Let us write θkm := θk θk+1 ...θm . Moreover, define T i
and Y i via (1)–(2) for models (P i , Qi ) substituted for (P , Q) respectively.
Theorem 4. Let models (P i , Qi ) be (μi , νi )-uniformly discretizable with kernels
Pθi (x) for 
θ ∈ Θ i and i ∈ A, a countable set. For a prefix code c : A → Y+ ,
put Θ := i∈A c(i)Θ i . Consecutively, denote idx(θ) := i and trn(θ) := ϑ for
idx(θ)
θ = c(i)ϑ ∈ Θ. Define the kernel Pθ (x) := Ptrn(θ) (x) for θ ∈ Θ and the prior
 
Q := i∈A w(i)(Qi ◦ trn) for i∈A w(i) = 1 and w(i) > 0. The model (P , Q)
is (μ, ν)-uniformly discretizable provided

μ(n) := supi∈A (|c(i)| + μi (n)) < ∞,


ν(m) := supi∈A νi (m − |c(i)|) < ∞,

and

lim Y (xn )/Y i (xn ) = w(i) (34)


n→∞

for i = idx(θ), Pθ -almost all x, and all θ ∈ Θ.


Remark: Assuming recursive models and mutually singular Pϑi , conver-
gence (34) may fail only for θ that are not Q-random, cf. [20]. Put X :=
66 Ł. Dębowski
 
x : limn Y (xn )/Y i (xn ) = w(i) . By the ordinary martingale convergence,
Y i (X ) = 1, whereas by convergence of recursive martingales [4, Theorem 3.1],
X ⊃ LY i . Next, by [4, Cor. 4.3 & Thm. 5.3], we obtain LY i ⊃ LP i |ϑ for
Qi -random ϑ. Hence Pθ (X ) = 1 if θ ∈ LQ and (35) holds true, in view of the
Theorem 5 below.

Proof. Let i = idx(θ). Observe that T (xn , θm ) = w(i)T i (xn , θ|c(i)|+1 m


) and
m i m
Q(θ ) = w(i)Q (θ|c(i)|+1 ) if m ≥ |c(i)|. Hence for Pθ -almost all x and
m ≥ μ(n), we have
   
 Q(θ m
)P (xn 
)  Qi (θ|c(i)|+1
m i
)Ptrn(θ) (xn ) 
log θ  = log  = o(log m).
 T (xn , θm )   m
T i (xn , θ|c(i)|+1 ) 

On the other hand, for Pθ -almost all x and n ≥ ν(m),


! i n m
"
T (xn , θm ) wi Y i (xn ) T (x , θ|c(i)|+1 )
lim = lim · = 1.
m→∞ Y (xn ) m→∞ Y (xn ) Y i (xn )

A complementary result says that the set of random parameters with respect to
the mixture is the union of the respective sets for the combined models.
Theorem 5. Consider the models from Theorem 4 and suppose that Qi satisfy

Qi (θk )/Qi (θm ) ≥ ack−m (35)

for all k ≥ m ≥ 0 and certain constants c < 1 and a > 0. Then for g(n) = Ω(1)
we have θ ∈ LQ,g(n) if and only if trn(θ) ∈ LQidx(θ) ,g(n) .

Proof. Let i = idx(θ). The claim is true if


 
 |c(i)|+m |c(i)|+m 
K(θm ) + log Q(θm ) − K(θ|c(i)|+1 ) − log Qi (θ|c(i)|+1 ) = O(1)
 
 |c(i)|+m 
for m ≥ |c(i)|. The latter condition is satisfied since K(θm ) − K(θ|c(i)|+1 ) ≤
 
 |c(i)|+m 
|c(i)| + O(1), whereas log Q(θm ) − log Qi (θ|c(i)|+1 ) ≤ |log w(i)| + O(|c(i)|) by
Q(θm ) = w(i)Qi (θ|c(i)|+1
m
) and (35). 

These propositions may be useful when we seek a compressor that is optimal


for all random and certain nonrandom parameters with respect to a given prior.
A possible solution is to find priors against which the originally considered non-
random parameters are random. Suppose that these priors and the original prior
yield uniformly discretizable models and consistent Bayesian selection among
these models is feasible. Then Theorems 2, 4, and 5 guarantee that the Bayesian
mixture of all considered models achieves the best enumerable compression for
all requested parameters and no so many others!
Computable Bayesian Compression 67

Acknowledgement
I would like to thank P. Grünwald, P. Harremoës, and J. Mielniczuk for discus-
sions. Cordial acknowledgements are due to an anonymous referee for suggest-
ing relevant references. They helped to improve this paper considerably. The
research, supported under the PASCAL II Network of Excellence, IST-2002-
506778, was done during the author’s leave from the Institute of Computer
Science, Polish Academy of Sciences.

References
1. Vovk, V.G., V’yugin, V.V.: On the empirical validity of the Bayesian method. J.
Roy. Statist. Soc. B 55, 253–266 (1993)
2. Vovk, V.G., V’yugin, V.V.: Prequential level of impossibility with some applica-
tions. J. Roy. Statist. Soc. B 56, 115–123 (1994)
3. Vitányi, P., Li, M.: Minimum description length induction, Bayesianism and
Kolmogorov complexity. IEEE Trans. Inform. Theor. 46, 446–464 (2000)
4. Takahashi, H.: On a definition of random sequences with respect to conditional
probability. Inform. Comput. 206, 1375–1382 (2008)
5. Gács, P.: On the symmetry of algorithmic information. Dokl. Akad. Nauk SSSR 15,
1477–1480 (1974)
6. Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and Its
Applications, 2nd edn. Springer, Heidelberg (1997)
7. van Lambalgen, M.: Random Sequences. PhD thesis, Universiteit van Amsterdam
(1987)
8. Barron, A., Rissanen, J., Yu, B.: The minimum description length principle in
coding and modeling. IEEE Trans. Inform. Theor. 44, 2743–2760 (1998)
9. Grünwald, P.D.: The Minimum Description Length Principle. MIT Press,
Cambridge (2007)
10. Yu, B., Speed, T.P.: Data compression and histograms. Probab. Theor. Rel.
Fields 92, 195–229 (1992)
11. Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages and
Computation. Addison-Wesley, Reading (1979)
12. Elias, P.: Universal codeword sets and representations for the integers. IEEE Trans.
Inform. Theor. 21, 194–203 (1975)
13. Barron, A.R.: Logically Smooth Density Estimation. PhD thesis, Stanford Univer-
sity (1985)
14. Dawid, A.: Statistical theory: The prequential approach. J. Roy. Statist. Soc. A 147,
278–292 (1984)
15. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester
(1991)
16. Li, L., Yu, B.: Iterated logarithmic expansions of the pathwise code lengths for
exponential families. IEEE Trans. Inform. Theor. 46, 2683–2689 (2000)
17. Barndorff-Nielsen, O.E.: Information and Exponential Families. Wiley, Chichester
(1978)
18. Jeffreys, H.: Theory of Probability, 3rd edn. Oxford University Press, Oxford (1961)
19. Dębowski, Ł.: On the vocabulary of grammar-based codes and the logical consis-
tency of texts (2008) E-print, http://arxiv.org/abs/0810.3125
20. Csiszar, I., Shields, P.C.: The consistency of the BIC Markov order estimator. Ann.
Statist. 28, 1601–1619 (2000)
Calibration and Internal No-Regret
with Random Signals

Vianney Perchet

Équipe Combinatoire et Optimisation, FRE 3232 CNRS,


Université Pierre et Marie Curie - Paris 6, 175 rue du Chevaleret, 75013 Paris
vianney.perchet@normalesup.org

Abstract. A calibrated strategy can be obtained by performing a strat-


egy that has no internal regret in some auxiliary game. Such a strategy
can be constructed explicitly with the use of Blackwell’s approachability
theorem, in an other auxiliary game. We establish the converse: a strat-
egy that approaches a convex B-set can be derived from the construction
of a calibrated strategy.
We develop these tools in the framework of a game with partial mon-
itoring, where players do not observe the actions of their opponents but
receive random signals, to define a notion of internal regret and construct
strategies that have no such regret.

1 Introduction

Consider an agent trying to predict a sequence of outcomes. For example, a


meteorologist announces each day the probability that it will rain the following
day. He will do this with a given accuracy (for instance, he chooses between
{0, 0.1, 0.2, . . . , 1}). The predictions will be considered successful if on the days
when the meteorologist forecasts 0.5, nearly half of these days are rainy and half
sunny. And this should be true for every possible prediction. Foster and Vohra [6]
called this property calibration and proved the existence of calibrated strategies,
without any assumption on the sequence of outcomes and on the knowledge of
the predictor.
The first section deals with the connections between three tools: calibration,
approachability and no-regret. The notion of regret in full monitoring has been
introduced by Hannan [9]: a player has asymptotically no external regret if his
average payoff could not have been better by knowing in advance the empirical
distribution of moves of the other players. Hannan [9] proved the existence of such
strategies and Blackwell [4] gave an alternative proof using his approachability
theorem. Foster and Vohra [7] (see also Fudenberg and Levine [8]) extended
Hannan’s result by proving the existence of strategies with no internal regret,
which is a more precise notion: a player has asymptotically no internal regret,
if for each of his action, he has no external regret on the set of stages where
he played it. We refer to Cesa-Bianchi and Lugosi [5] for a survey on sequential
prediction and regret.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 68–82, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Calibration and Internal No-Regret with Random Signals 69

A calibrated strategy can be obtained through the construction of a strategy


with no internal regret in an auxiliary game (see Sorin [17]). And this construc-
tion can be done explicitly using Blackwell’s approachability theorem [3] for an
orthant in IRd (see Hart and Mas-Colell [10]). We will provide a kind of converse
result: we derive an explicit construction of an approachability strategy for a
convex B-set through the use of a calibrated strategy, in some auxiliary game.
In the second section, we consider repeated games with partial monitoring,
where players do not observe the action of their opponents, but only receive
random signals and we focus on strategies that have no regret, in the following
sense. A player has asymptotically no external regret if his average payoff could
not have been better by knowing in advance the empirical distribution of sig-
nals (see Rustichini [15]). The existence of strategies with no external regret was
proved by Rustichini [15] and Lugosi, Mannor and Stoltz [14] constructed ex-
plicitly such strategies. Lehrer and Solan [13] defined a notion of internal regret
in the partial monitoring framework and proved the existence of strategies with
no such regret. We will generalize these results by constructing strategies that
have no regret, for a more precise notion of regret.

2 Full Monitoring Case: Approachability Implies


Calibration

This section is devoted to the full monitoring case. We recall the main results
about calibration of Foster and Vohra [6], approachability of Blackwell [3] and
regret of Hart and Mas-Colell [10]. We will prove some of these results in details,
since they give the main ideas about the construction of strategies in the partial
monitoring framework, given in section 4.

2.1 Calibration

Let S be a finite set of states. We consider a two-person repeated game where,


at stage n ∈ IN, Nature (Player 2) chooses a state sn ∈ S and Predictor (Player
1) chooses μn ∈ Δ(S) the set of probabilities over S. We assume that μn belongs
to a finite set M = {μ(l), l ∈ L}. Let ε > 0 such that for every probability
μ ∈ Δ(S), there exists μ(l) ∈ M such that μ − μ(l) ≤ ε where Δ(S) is seen
as a subset of IR|S| . Then M is called an ε-grid of Δ(S). With this notations,
the prediction at stage n is the choice of an element ln ∈ L, called the type of
that stage.
The choices of ln and sn are functions of the past observations (or the finite
history) hn−1 = (l1 , s1 , . . . , ln−1 , sn−1 ) and may be random. Explicitly, the set
 n 0
of finite histories is denoted by H = n∈IN (L × S) , with (L × S) = ∅ and a
strategy σ of Player 1 (resp. τ of Player 2) is a function from H to Δ(L) (resp.
Δ(S)) and σ(hn ) (resp. τ (hn )) is the law of ln+1 (resp. sn+1 ) after hn . A couple
IN
of strategies (σ, τ ) generates a probability, denoted by IPσ,τ , over H = (L × S) ,
the set of plays embedded with the cylinder σ-field.
70 V. Perchet

We will use the following notations. For any families {am ∈ IRd , lm ∈ L}m∈IN
and n ∈ IN, Nn (l) 
= {1 ≤ m ≤ n, lm = l} is the set of stages of type l (before
the n-th), an (l) = m∈Nn (l) am /|Nn (l)| is the average of {am } on this set and
n
an = m=1 am /n is the average over all the stages (before the n-th).
Definition 1. Foster-Vohra [6] A strategy σ of Player 1 is calibrated (with
respect to the ε-grid M) if for every l ∈ L and every strategy τ of Player 2:
 
|Nn (l)|
lim sup sn (l) − μ(l)2 − ε2 ≤ 0, IPσ,τ -as .
n→+∞ n
In words, a strategy of Player 1 is calibrated if, on the set of stages where μ(l)
is forecast, the empirical distribution of states is asymptotically close to μ(l) (as
long as the frequency of l is not too small). Foster-Vohra [6] proved the existence
of such strategies with an algorithm based on the Expected Brier Score.

2.2 Approachability
We will prove that calibration will follow from no-regret and that no-regret
will follow from approachability (following respectively Sorin [17] and Hart and
Mas-Colell [10]). We present here the notion of approachability introduced by
Blackwell [3].
Consider a two-person repeated game in discrete time with vector payoffs,
where at stage n ∈ IN, Player 1 (resp. Player 2) chooses the action in ∈ I
(resp. jn ∈ J), with both I and J finite. The corresponding vector payoff is
ρn = ρ(in , jn ) where ρ : I × J → IRd . As usual, a strategy σ (resp.τ ) of Player 1
n
(resp. Player 2) is a function from the set of finite histories H = n∈IN (I × J)
to Δ(I) (resp. Δ(J)).
For a closed set E ⊂ IRd and δ ≥ 0, we denote by E δ = {z ∈ IRd , dE (z) ≤ δ}
the δ-neighborhood of E and ΠE (z) = {e ∈ E, dE (z) = z − e} the set of closest
point to z in E, where dE (z) = inf e∈E z − e.
Definition 2. i) A closed set E ⊂ IRd is approachable by Player 1 if for every
ε > 0, there exists a strategy σ of Player 1 and N ∈ IN, such that for every
strategy τ of Player 2 and every n ≥ N :
 
Eσ,τ [dE (ρn )] ≤ ε and IP sup dE (ρn ) ≥ ε ≤ ε .
n≥N

Such a strategy is called an approachability strategy of E.


ii) A set E is excludable by Player 2, if there exists δ > 0 such that the comple-
ment of E δ is approachable by Player 2.
In words, a set E ⊂ IRd is approachable by Player 1, if he has a strategy such
that the average payoff converges almost surely to E, uniformly with respect to
the strategies of Player 2.
Blackwell [3] gave a sufficient geometric condition for a closed set E to be
approachable by Player 1. Denote by P 1 (x) = {ρ(x, y), y ∈ Δ(J)}, the set of
expected payoffs compatible with x ∈ Δ(I) and define similarly P 2 (y).
Calibration and Internal No-Regret with Random Signals 71

Definition 3. A closed subset E of IRd is a B-set, if for every z ∈ IRd , there


exist p ∈ ΠE (z) and x (= x(z, p)) ∈ Δ(I) such that the hyperplane through p and
perpendicular to z − p separates z from P 1 (x), or formally:

∀z ∈ IRd , ∃p ∈ ΠE (z), ∃x ∈ Δ(I), ρ(x, y) − p, z − p ≤ 0, ∀y ∈ Δ(J) . (1)

Informally, from any point z outside E there is a closest point p and a probability
x ∈ Δ(J) such that, whatever being the choice of Player 2, the expected payoff
and z are on different sides of the hyperplane through p and perpendicular to
z − p. In fact, this definition (and the following theorem) does not require that J
is finite: one can assume that Player 2 chooses an outcome vector U ∈ [−1, 1]|I|
so that the expected payoff is x, U .

Theorem 1. Blackwell [3]


If E is a B-set, then E is approachable by Player 1. Moreover, the strategy σ of
Player 1 defined by σ(hn ) = x(ρn ) is such that, for every strategy τ of Player 2:
 
4B 8B
Eσ,τ [dE (ρn )] ≤
2
and IPσ,τ sup dE (ρn ) ≥ η ≤ 2 , (2)
n n≥N η N

with B = supi,j ρ(i, j)2 .

In the case of a convex set C, there is a complete characterization:


Corollary 1. Blackwell [3]
A closed convex set C ⊂ IRd is approachable by Player 1 if and only if:

P 2 (y) ∩ C = ∅, ∀y ∈ Δ(J) . (3)

In particular, a closed convex set C is either approachable by Player 1, or ex-


cludable by Player 2.

Remark 1. Corollary 1 implies that there are (at least) two different ways to
prove that a convex set is approachable. The first one, called direct proof, consists
in proving that C is a B-set while the second one, called undirect proof, consists
in proving that C is not excludable by Player 2, which reduces to find, for every
y ∈ Δ(J), some x ∈ Δ(I) such that ρ(x, y) ∈ C.

2.3 Approachability Implies Internal no-Regret

Consider a two-person repeated game in discrete time, where at stage n ∈ IN


Player 1 chooses in ∈ I as above and Player 2 chooses a vector Un ∈ [−1, 1]c
(with c = |I|). The associated payoff is Unin , the in -th coordinate of Un . The
internal regret of the stage is the matrix Rn = R(in , Un ), where the function
2
R : I × [−1, 1]c → IRc is defined by:

(i ,j) 0 if i = i
R(i, U ) =
U − U otherwise.
j i
72 V. Perchet

With this definition, the average internal regret Rn is defined by:


  j
m∈Nn (i) Um − Um |Nn (i)| 
i
Rn = = U n (i)j − U n (i)i j∈I
.
n n i∈I
i,j∈I

Definition 4. Foster and Vohra [7]: A strategy σ of Player 1 is internally


consistent if for any strategy τ of Player 2:

lim sup Rn ≤ 0, IPσ,τ -as .


n→∞

The existence of such strategies have been proved by Foster and Vohra [7] and
Fudenberg and Levine [8].
Theorem 2. There exist internally consistent strategies.
Note that an internally consistent strategy can be obtained by constructing a
2
strategy that approaches the negative orthant Ω = IRc− in the auxiliary game
where the vector payoff at stage n is Rn .
The proof of Hart and Mas-Colell [10] of the fact that Ω is a B-set relies
on the two followings lemmas: Lemma 1 gives a geometric property of Ω and
Lemma 2 gives a property of the function R.
2
Lemma 1. Let ΠΩ (·) be the projection onto Ω. Then, for every A ∈ IRc :

ΠΩ (A), A − ΠΩ (A) = 0 . (4)


2
Proof. Note that since Ω = IRc− then A+ = A−ΠΩ (A) where A+ ij = max (Aij , 0)
and similarly A− = ΠΩ (A). The result is just a rewriting of A− , A+ = 0.  
For every non-negative (c × c)-matrix A = (aij )i,j∈I , λ ∈ Δ(L) is an invariant
probability of A if for every i ∈ I:

λ(j)aji = λ(i) aij .


j∈I j∈I

The existence of an invariant probability follows from the similar result for
Markov chains.
Lemma 2. Let A = (aij )i,j∈I be a non-negative matrix. Then for every λ, in-
variant probability of A, and every U ∈ IRc :

A, Eλ [R(·, U )] = 0 . (5)

Proof. The (i, j)-th coordinate of Eλ [R(·, U )] is λ(l) U j − U i , therefore:

A, Eλ [R(·, U )] = aij λ(i) U j − U i
i,j∈I
 
and the coefficient of each U i is j∈I aij λ(i) − j∈I aji λ(j) = 0, because λ is
an invariant measure of A. Therefore A, Eλ [R(·, U )] = 0. 

Calibration and Internal No-Regret with Random Signals 73

Proof of Theorem 2. Summing equations (4) (with A = Rn ) and (5) (with


 +
A = Rn ) gives:
 
Eλn [R(·, U )] − ΠΩ (Rn ), Rn − ΠΩ (Rn ) = 0 ,
+
for every λn invariant probability of Rn and every U ∈ [−1, 1]I .
Define the strategy σ of Player 1 by σ(hn ) = λn . The expected payoff at
stage n + 1 (given hn and Un+1 = U ) is Eλn [R(·, U )], so Ω is a B-set and is
approachable by Player 1. 


Remark 2. The construction of the strategy is based on approachability proper-


ties therefore the convergence is uniform with respect to the strategies of Player
2. Theorem 1 implies that for every η > 0, and for every strategy τ of Player 2:
   
|Nn (i)|  1
IPσ,τ ∃n ≥ N, ∃i, j ∈ i, U n (i)j − U n (i)i > η = O
n η2 N
 
|Nn (l)|  + 1
and Eσ,τ sup U n (i)j − U n (i)i =O √ .
i∈I n n

2.4 Internal Regret Implies Calibration


Sorin [17] proved that the construction of calibrated strategy can be reduced
to the construction of internally consistent strategy. The proof relies on the
following lemma:

Lemma 3. Let (am )m∈IN be a sequence in IRd and α, β two points in IRd . Then
for every n ∈ IN∗ :
n 2 2
m=1 am − β2 − am − α2 2 2
= an − β2 − an − α2 , (6)
n

with  · 2 the L2 -norm of IRd .

Proof. Develop the sums in equation (6) to get the result. 




Now, we can prove the following:


Theorem 3. Foster and Vohra [6]
For every finite grid of Δ(S), there exist calibrated strategies of Player 1.

Proof. We start with the framework described in 2.1. Consider the auxiliary
two-person game with vector payoff defined as follows. At stage n ∈ IN, Player 1
(resp. Player 2) chooses the action ln ∈ L (resp. sn ∈ S) which generates the
payoff Rn = R(ln , Un ) ∈ IRd , where R is as in 2.3, with:
 
2
Un = − sn − μ(l)2 ∈ IRc .
l∈L
74 V. Perchet

By definition of R and using Lemma 3, for every n ∈ IN∗ :


 2 2
lk |Nn (l)| m∈Nn (l) sm − μ(l)2 − sm − μ(k)2
Rn =
n |Nn (l)|
|Nn (l)|  
2 2
= sn (l) − μ(l)2 − sn (l) − μ(k)2 .
n
Let σ be an internally consistent strategy in this auxiliary game, then for every
l ∈ L and k ∈ L:
|Nn (l)|  2 2

lim sup sn (l) − μ(l)2 − sn (k) − μ(k)2 ≤ 0, IPσ,τ -as .
n→∞ n
Since {μ(k), k ∈ L} is a ε-grid of Δ(S), for every l ∈ L, and every n ∈ IN∗ , there
exists k ∈ L such that sn (l) − μ(k)22 ≤ ε2 , hence:

|Nn (l)|  2

lim sup sn (l) − μ(l)2 − ε2 ≤ 0, IPσ,τ -as .
n→∞ n


Remark 3. We have proved that σ is such that, for every l ∈ L, sn (l) is closer
to μ(l) than to any other μ(k), as soon as |Nn (l)|/n is not too small.
The fact that sn belongs to a finite set S and {μ(l)} are probabilities over S
is irrelevant: one can show that for any finite set {a(l) ∈ IRd , l ∈ L}, Player 1
has a strategy σ such that for any bounded sequence (am )m∈IN in IRd and for
every l and k :
 
|Nn (l)|
lim sup an (l) − a(l)2 − an (l) − a(k)2 ≤ 0 .
n→∞ n

3 Calibration Implies Approachability


The proof of Theorem 3 shows that the construction of a calibrated strategy can
be obtained through an approachability strategy of an orthant in an auxiliary
game.
Conversely, we will show that the approachability of a convex B-set can be
reduced to the existence of a calibrated strategy in an auxiliary game, and so
give a new proof of Corollary 1.
Alternative proof of Corollary 1. The idea of the proof is very natural: given
ε > 0, we construct a finite covering {Y (l), l ∈ L} of Δ(J) and associate to Y (l)
a probability x(l) ∈ Δ(I) such that ρ(x(l), y) ∈ C ε for every y ∈ Y (l). Player 1
will always choose his action accordingly to one of the {x(l)}. Assume that on
the stages when Player 1 played x(l), the empirical action of Player 2 is in Y (l),
then the average payoff on these stages is in the convex set C ε (by linearity of
ρ). And if this property is true for every l ∈ L, then the average payoff is also
in C ε (by convexity).
Calibration and Internal No-Regret with Random Signals 75

Formally, assume that condition (3) is satisfied and rephrased as:

∀y ∈ Δ(J), ∃x(= xy ) ∈ Δ(I), ρ(xy , y) ∈ C . (7)

Since ρ is multilinear and therefore continuous on Δ(I) × Δ(J), for every ε > 0,
there exists δ > 0 such that:

∀y, y  ∈ Δ(J), y − y  2 ≤ 2δ ⇒ ρ(xy , y  ) ∈ C ε .

We introduce the auxiliary game Γ where Player 2 chooses action (or state)
j ∈ J and Player 1 forecasts it, using {y(l), l ∈ L}, a finite grid of Δ(J) whose
diameter is smaller than δ. Let σ be a calibrated strategy for Player 1, so that
jn (l), the empirical distribution of actions of Player 2 on Nn (l), is asymptotically
δ-close to y(l).
Define the strategy of Player 1 in the initial game by performing σ and if
ln = l by playing accordingly to xy(l) = x(l) ∈ Δ(I), as depicted in (7). Since
the choices of actions of the two players are independent, ρn (l) will be close
to ρ (x(l), jn (l)), hence close to ρ(x(l), y(l)) and finally close to C ε , as soon as
|Nn (l)| is not too small.
Indeed, by construction of σ, for every η > 0 there exists N 1 ∈ IN such that,
for every strategy τ of Player 2:
 
|Nn (l)|  2

IPσ,τ ∀l ∈ L, ∀n ≥ N 1 , jn (l) − y(l)2 − δ 2 ≤ η ≥ 1 − η . (8)
n
Hoeffding-Azuma inequality for sum of bounded martingale differences (see [2,11])
implies that for any η ∈ (0, 1) with probability at least 1 − η,
  
2 2
|ρn (l) − ρ(x(l), jn (l)| ≤ ln ,
|Nn (l)| η

and therefore there exists N 2 ∈ IN such that for every l ∈ L:


  

IPσ,τ ∀m ≥ n, |ρn (l) − ρ(x(l), jn (l))| ≤ η |Nn (l)| ≥ N 2 ≥ 1 − η . (9)

Equations (8) and (9), taken with η ≤ ε/L, imply that, with probability at least
2
1 − 2ε, for every n ≥ max{N 1 , LN 2 /ε}, |ρn (l) − ρ(x(l), jn (l))| ≤ η ≤ ε, and
2
if Nn (l)/n ≥ ε/L then |Nn (l)| > N , so jn (l) − y(l) ≤ 2δ , and therefore
2 2

dC (ρn (l)) ≤ 2ε.


Since C is a convex set, dC (·) is convex and with probability at least 1 − 2ε:
 
|Nn (l)| |Nn (l)|
dC (ρn ) = dC ρn (l) ≤ dC (ρn (l))
n n
l∈L l∈L
|Nn (l)| |Nn (l)|
≤ dC (ρn (l)) +
n n
l:Nn (l)/n≥ε/L l:Nn (l)/n<ε/L
≤ 2ε + ε = 3ε.
76 V. Perchet

Therefore C is approachable by Player 1.


On the other hand, if there exists y such that P 2 (y) ∩ C = ∅, then Player 2
can approach P 2 (y), by playing at every stage accordingly to y. Therefore C is
not approachable by Player 1. 


Remark 4. Blackwell’s proof of this result is not explicit. He showed that the
condition (7) implies that C is a B-set and his proof relies on the use of Von
Neumann’s minmax theorem. In words, let z be a fixed point outside C. Assume
that if Player 1 knows y ∈ Δ(J) the law of the action of Player 2, then there is a
law xy ∈ Δ(I) such that the expected payoff ρ(xy , y) and z are in different sides
of the hyperplane described in the definition of a B-set. The minmax theorem
implies that there exists x ∈ Δ(I) such that for every y ∈ Δ(I), z and ρ(x, y)
are on different sides and therefore C is a B-set. This gives the existence of an
approachability strategy of C.
One of the major interest in calibration, is that it transforms this implicit
proof into an explicit constructive proof: while performing a calibrated strategy
(in an auxiliary game where J plays the role of the set of states), Player 1 can
enforce the property that, for every l ∈ L, the average move of Player 2 is almost
y(l) on Nn (l). So he just has (and could not do better) to play xy(l) on these
stages.

Remark 5. 1) Hoeffding-Azuma’s inequality for sums of bounded martingale


differences implies that for every strategy τ of Player 2:
 
|Nn (l)| ln(n)
Eσ,τ sup |ρn (l) − ρ (x(l), y n (l))| = O
l∈L n n

The strategy σ is based on approachability properties and on Hoeffding-


Azuma’s inequality, so one can show that:
 
ln(n)
Eσ,τ [dC (ρn ) − ε] ≤ O .
n

2) To deduce that ρn is in C ε from the fact that ρn (l) is in C ε for every l ∈ L,


it is necessary that C (or dC (·)) is convex.

4 Internal Regret in the Partial Monitoring Framework


Consider a two person repeated game in discrete time. At stage n ∈ IN, Player 1
(resp. Player 2) chooses in ∈ I (resp. jn ∈ J), which generates the payoff ρn =
ρ(in , jn ) with ρ : I × J → IR. Player 1 does not observe this payoff, he just
receives a signal sn ∈ S whose law is s(in , jn ) with s : I × J → Δ(S). The three
sets I, J and S are finite, the two functions ρ and s are extended multilineary
to Δ(I) × Δ(J) and we define s : Δ(J) → Δ(S)I by s(y) = (s(i, y))i∈I , where
Δ(S)I is the set of vectors of probability over S. We call any such vector a flag.
As usual, a strategy σ of Player 1 (resp. τ of Player 2) is a function from the
Calibration and Internal No-Regret with Random Signals 77


finite histories for Player 1, H 1 = n∈IN (I × S)n , to Δ(I) (resp. from
set of 
n
H 2 = n∈IN (I × S × J) to Δ(J)). A couple (σ, τ ) generates a probability IPσ,τ
IN
over H = (I × S × J) .

4.1 External Regret


Rustichini [15] defined the regret in the partial monitoring framework as follows:
a strategy σ of Player 1 has no external regret if IPσ,τ -as:
lim sup max ⎧ min ρ(x, y) − ρn ≤ 0 .
n→+∞ x∈Δ(I) ⎨ y ∈ Δ(J),
⎩ s(y) = s(j )
n

where s(jn ) ∈ Δ(S)I is the average flag. In words, the average payoff of Player 1
could not have been better uniformly if he had known the average distribution
of flags before the beginning of the game.
In this framework, given a flag μ ∈ Δ(S)I , the function miny∈s−1 (μ) ρ(·, y)
may not be linear. So the best response of Player 1 might not be a pure action
in I, but a mixed action x ∈ Δ(I) and any pure action in the support of x may
be a bad response. Note that this also appears in Rustichini’s definition, since
the maximum is taken over Δ(I) and not just over I as in the usual definition
of external regret in full monitoring.

4.2 Internal Regret


We consider here a generalization of the previous’s framework: At stage n ∈ IN,
Player 2 chooses a flag μn ∈ Δ(S)I while Player 1 chooses an action in and
receives a signal sn whose law is the in -th coordinate of μn . Given a flag μ
and x ∈ Δ(I), Player 1 evaluates the payoff through an evaluation function
G : Δ(I) × Δ(S)I → IR, which is not necessarily linear.
There are two requirements to define internal regret: we have to define a finite
partition of IN and for every element of that partition, Player 1 must choose a
point in Δ(I) that is a best response (or at least an ε-best response) to some flag.
Hence we have to distinguish the stages, not as a function of the action played,
but as a function of the law of the action. We also assume that the strategy of
Player 1 can be described by a finite family {x(l) ∈ Δ(I), l ∈ L} such that, at
stage n ∈ IN, Player 1 chooses a type ln and the law of its action in is x(ln ).
Definition 5. Lehrer-Solan [13]
For every n ∈ IN and every l ∈ L, the average internal regret of type l at stage
n is
Rn (l) = sup [G(x, μn (l)) − G(ın (l), μn (l))] .
x∈Δ(I)

A strategy σ of Player 1 is (L, ε)-internally consistent if for every strategy τ


of Player 2:
 
|Nn (l)|
lim sup Rn (l) − ε ≤ 0, ∀l ∈ L, IPσ,τ -as .
n→+∞ n
78 V. Perchet

Remark 6. Note that this definition is not intrinsic (unlike in the full monitoring
case) since it depends on the choice of {x(l), l ∈ L}, and is based uniquely on
the potential observations (ie the sequences of flags (μn )n∈IN ) of Player 1.

In order to construct (L, ε)-internally consistent strategies, some regularity over


G is required:
 
Assumption 1. For every ε > 0, there exist μ(l) ∈ Δ(S)I , x(l) ∈ Δ(I), l ∈ L
two finite families and η, δ > 0 such that:

1. Δ(S)I ⊂ l∈L B(μ(l), δ);
2. For every l ∈ L, if x − x(l) ≤ 2η and μ − μ(l) ≤ 2δ, then x ∈ BRε (μ),
 
where BRε (μ) = x ∈ Δ(I) : G(x, μ) ≥ supz∈Δ(I) G(z, μ) − ε is the set of ε-
 
best response to μ ∈ Δ(S)I and B(μ, δ) = μ ∈ Δ(S)I , μ − μ ≤ δ .
In words, Assumption 1 implies that G is regular with respect to μ and with
respect to x: given ε, the set of flags can be covered by a finite number of balls
centered in {μ(l)}, such that x(l), a best response to μ(l), is an ε-best response
to any μ in this ball. And if x is close enough to x(l), then x is also an ε-best
response to any μ close to μ(l).

Theorem 4. If G fulfills Assumption 1, there exist (L, ε)-internally consistent


strategies.

Some parts of the proof are quite technical, however the insight is very simple,
so we give firstly the main ideas. First assume that, in the one stage game, μ ∈
Δ(S)I is observed by Player 1, then there exists x ∈ Δ(I) such that x ∈ BR (μ).
Using an minmax argument, like Blackwell did for the proof of Corollary 1, one
could prove that Player 1 has an (L, ε)-internally consistent strategy (as did
Lehrer and Solan [13]).
The idea is to use calibration, as in the alternative proof of Corollary 1, to
transform this implicit proof into a constructive proof. Fix ε > 0 and assume for
the moment that Player 1 observes each μn . Consider the game where Player 1
predicts the sequence (μn )n∈IN using the δ-grid {μ(l), l ∈ L} given by Assump-
tion 1. A calibrated strategy of Player 1 chooses a sequences (ln )n∈IN in such a
way that μn (l) is asymptotically δ-close to μ(l). Hence Player 1 just has to play
accordingly to x(l) ∈ BRε (μ(l)) on these stages.
Indeed, since the choices of action are independent, ın (l) will be asymptotically
η-close to x(l) and the regularity of G will imply then that ın (l) ∈ BRε (μn (l))
and so the strategy will be (L, ε)-internally consistent.
The only issue is that in the current framework the signal depends on the
action of Player 1 who does not observe μn . The existence of calibrated strategies
is therefore not straightforward. However, it is well known that, up to a slight
perturbation of x(l), the information available to Player 1 after a long time
is close to μn (l) (as in the multi-armed bandit problem, some calibration and
no-regret frameworks, see chapter 6 in [5] for a survey on these techniques).
Calibration and Internal No-Regret with Random Signals 79

For every x ∈ Δ(I), define xη ∈ Δ(I), the η-perturbation of x by xη =


(1 − η)x + ηu with u the uniform probability over I and for every stage n of
type l, define sn by:
sn
sn = (0, . . . , 0, , 0, . . . , 0) ,
xη (l)[in ]

with xη (l)[in ] the weight put by xη (l) on in and denote by sn (l), the average of
{
sm } on Nn (l):
Lemma 4. For every θ > 0, there exists N ∈ IN such that, for every l ∈ L:

IPσ,τ (∀m ≥ n, 
sn (l) − μn (l) ≤ θ| Nn (l) ≥ N ) ≥ 1 − θ .

Proof. Since for every n ∈ IN, the choices of in and μn are independent:
 
s
Eσ,τ [ sn | hn−1 , ln , μn ] = μin [s]xη (ln )[i] 0, . . . , ,...,0
xη (ln )[i]
i∈I s∈S

= μin [s] (0, . . . , s, . . . , 0) = 0, . . . , μin , . . . , 0
i∈I s∈S i∈I

= μ1n , . . . , μIn = μn .

Therefore sn (l) is an unbiased estimator of μn (l) and Hoeffding-Azuma’s in-


equality implies that for every θ > 0 there exists N ∈ IN such that, for every
l ∈ L:

IPσ,τ (∀m ≥ n, 
sn (l) − μn (l) ≤ θ| Nn (l) ≥ N ) ≥ 1−θ . 


Assume now that Player 1 uses a calibrated strategy to predict the sequences
of sn (this game is in full monitoring), then he knows that asymptotically sn (l)
is closer to μ(l) than to any μ(k) (as soon as the frequency of l is big enough),
therefore it is δ-close to μ(l). Lemma 4 implies that μn (l) is asymptotically close
to sn (l) and therefore 2δ-close to μ(l).
Note that instead of trying to compute the sequence of payoffs from the sig-
nals, we consider an auxiliary game defined on the signal space (ie the observa-
tions) so that this new game is in fact (almost) in full monitoring.

Proof of Theorem 4. Consider the families {x(l) ∈ Δ(I), μ(l) ∈ Δ(S)I , l ∈ L}


and δ > 0 given by Assumption 1 for a fixed ε > 0.
Let Γ  be the auxiliary repeated game where at stage n Player 1 (resp Player
2) chooses ln ∈ L (resp. μn ∈ Δ(S)I ). Given these choices, in (resp. sn ) is drawn
accordingly to xη (ln ) (resp. μinn ). By Lemma 4, for every θ > 0, there exists
N 1 ∈ IN such that for every l ∈ L:

IPσ,τ (∀m ≥ n, 
sn (l) − μn (l) ≤ θ| Nn (l) ≥ N 1 ≥ 1 − θ . (10)

Let σ be a calibrated strategy associated to (sn )n∈IN in Γ  . For every θ > 0,


there exists N ∈ IN such that with IPσ,τ -probability greater than 1 − θ:
2
80 V. Perchet
 
|Nn (l)| 2 2
∀n ≥ N , ∀l, k ∈ L,
2

sn (l) − μ(l) − 
sn (l) − μ(k) ≤θ . (11)
n

Since {μ(k), k ∈ L} is a grid of Δ(S)I , for every n ∈ IN and l ∈ L, there exists


k ∈ L such that  sn (l) − μ(k) ≤ δ. Therefore, combining equation (10) and
(11), for every θ > 0 there exists N 3 ∈ IN such that:
   
|Nn (l)| 2
IPσ,τ ∀n ≥ N 3 , ∀l ∈ L, μn (l) − μ(l) − δ 2 ≤ θ, ≥ 1 − θ . (12)
n
For every stage of type l ∈ L, in is drawn accordingly to xη (l) and by definition
xη (l) − x(l) ≤ η. Therefore Hoeffding-Azuma’s inequality implies that, for
every θ > 0 there exists N 4 ∈ IN such that:
   
|Nn (l)|
IPσ,τ ∀n ≥ N 4 , ∀l ∈ L, ın (l) − x(l) − η ≤ θ, ≥ 1 − θ . (13)
n
Combining equation (12), (13) and using Assumption 1, for every θ > 0, there
exists N ∈ IN such that for every strategy τ of Player 2:
   
|Nn (l)|
IPσ,τ ∀n ≥ N, ∀l ∈ L, Rn (l) − ε ≤ θ, ≥ 1 − θ , (14)
n
and σ is (L, ε)-internally consistent. 

Remark 7. The strategy constructed is based on δ-calibration and Hoeffding-
Azuma’s inequality, therefore one can show that:
   
|Nn (l)| ln(n)
Eσ,τ sup Rn (l) − ε ≤O .
l∈L n n

4.3 Back to Payoff Space


Assumption 1 can be fulfilled with some continuity assumptions over G:
Proposition 1. Let G : Δ(I) × Δ(S)I be such that for every μ ∈ Δ(S)I , G(·, μ)
is continuous and the family of function {G(x, ·), x ∈ Δ(I)} is equicontinuous.
Then G fulfills Assumption 1.
Proof. Since {G(x, ·), x ∈ Δ(I)} is equicontinuous and Δ(S)I compact, for every
ε > 0, there exists δ > 0 such that:
ε
∀x ∈ Δ(I), ∀μ, μ ∈ Δ(S)I , μ − μ  ≤ 2δ ⇒ |G(x, μ) − G(x, μ )| ≤ .
2
Let {μ(l), l ∈ L} be a finite δ-grid of Δ(S)I and for every l ∈ L, x(l) ∈ BR(μ(l))
so that G(x(l), μ(l)) = maxz∈Δ(I) G(z, μ(l)). Since G(x(l), ·) is continuous, there
exists η(l) > 0 such that:

x − x(l) ≤ η(l) ⇒ |G(x, μ(l)) − G(x(l), μ(l))| ≤ ε/2 .


Calibration and Internal No-Regret with Random Signals 81

Define η = minl∈L η(l) and let x ∈ Δ(I), μ ∈ Δ(S)I and l ∈ L such that
x − x(l) ≤ η and μ − μ(l) ≤ δ, then:
ε
G(x, μ) ≥ G(x, μ(l)) − ≥ G(x(l), μ(l)) − ε = max G(z, μ(l)) − ε ,
2 z∈Δ(I)

and x ∈ BRε (μ). 




This proposition implies that the evaluation function used by Rustichini fulfills
Assumption 1 (Lugosi, Mannor and Stoltz [14]). Before proving that, we intro-
duce S, the range of s, which is a closed convex subset of Δ(S)I , and ΠS (·) the
projection onto it.
Corollary 2. Define G : Δ(I) × Δ(S)I → IR by:

inf y∈s−1 (μ) ρ(x, y) if μ ∈ S
G(x, μ) =
G (x, ΠS (μ)) otherwise.

Then G fulfills Assumption 1.



Proof. The function s can be extended linearly to IR|J| by s(y) = j∈J y(j)s(j)
where y = (y(j))j∈J . Therefore, by Aubin and Frankowska [1] (Theorem 2.2.1,
page 57), the multivalued application s−1 : S → Δ(J)I is λ-Lipschitz, and since
ΠS is 1-Lipschitz (because S is convex), G(x, ·) is also λ-Lipschitz, for every
x ∈ Δ(I). Therefore, {G(x, ·), x ∈ Δ(I)} is equicontinuous. For every μ ∈ Δ(S)I ,
G(·, μ) is 1-Lipschitz (see [14]), therefore continuous. Hence, by Proposition 1,
G fulfills Assumption 1. 


Concluding Remarks
The definitions and proofs rely uniquely on Assumption 1: it is not relevant
to assume that Player 1 faces only one opponent nor that the action set of
its opponent is finite. The only requirement is that given his information (a
probability in Δ(I) and a flag in Δ(S)I ), Player 1 can evaluate his payoff, no
matter how this payoff is obtained: for example we could have assumed that
Player 2 chooses at each stage an (unobserved) outcome vector U ∈ [−1, 1]|I|
and Player 1 chooses a coordinate, which is his observed payoff.
In the full monitoring framework, many improvements have been made in
the past years about calibration and regret (see for instance [12,16,18]). Here,
we aimed to clarify the links between the original notions of approachability,
internal regret and calibration in order to extend applications (in particular,
to get rid of the finiteness of J), to define the internal regret (with signals) as
calibration over an appropriate space and to give a proof derived from no-internal
regret (in full monitoring), itself derived from the approachability of an orthant
in this space.
Acknowledgments. I deeply thanks my advisor Sylvain Sorin for its great help
and numerous comments. I also acknowledge helpful remarks from Eilon Solan
and Gilles Stoltz.
82 V. Perchet

References
1. Aubin, J.-P., Frankowska, H.: Set-valued Analysis. Birkhäuser Boston Inc., Basel
(1990)
2. Azuma, K.: Weighted sums of certain dependent random variables. Tôhoku Math.
J. 19(2), 357–367 (1967)
3. Blackwell, D.: An analog of the minimax theorem for vector payoffs. Pacific J.
Math. 6, 1–8 (1956)
4. Blackwell, D.: Controlled random walks. In: Proceedings of the International
Congress of Mathematicians, 1954, Amsterdam, vol. III, pp. 336–338 (1956)
5. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge Uni-
versity Press, Cambridge (2006)
6. Foster, D.P., Vohra, R.V.: Asymptotic calibration. Biometrika 85, 379–390 (1998)
7. Foster, D.P., Vohra, R.V.: Regret in the on-line decision problem. Games Econom.
Behav. 29, 7–35 (1999)
8. Fudenberg, D., Levine, D.K.: Conditional universal consistency. Games Econom.
Behav. 29, 104–130 (1999)
9. Hannan, J.: Approximation to Bayes risk in repeated play. In: Contributions to the
theory of Games. Annals of Mathematics Studies, vol. 3(39), pp. 97–139. Princeton
University Press, Princeton (1957)
10. Hart, S., Mas-Colell, A.: A simple adaptive procedure leading to correlated equi-
librium. Econometrica 68, 1127–1150 (2000)
11. Hoeffding, W.: Probability inequalities for sums of bounded random variables.
J. Amer. Statist. Assoc. 58, 13–30 (1963)
12. Lehrer, E.: A wide range no-regret theorem. Games Econom. Behav. 42, 101–115
(2003)
13. Lehrer, E., Solan, E.: Learning to play partially-specified equilibrium (manuscript,
2007)
14. Lugosi, G., Mannor, S., Stoltz, G.: Strategies for prediction under imperfect
monitoring. Math. Oper. Res. 33, 513–528 (2008)
15. Rustichini, A.: Minimizing regret: the general case. Games Econom. Behav. 29,
224–243 (1999)
16. Sandroni, A., Smorodinsky, R., Vohra, R.V.: Calibration with many checking rules.
Math. Oper. Res. 28, 141–153 (2003)
17. Sorin, S.: Lectures on Dynamics in Games. Unpublished Lecture Notes (2008)
18. Vovk, V.: Non-asymptotic calibration and resolution. Theoret. Comput. Sci. 387,
77–89 (2007)
St. Petersburg Portfolio Games

László Györfi and Péter Kevei

Department of Computer Science and Information Theory


Budapest University of Technology and Economics
Magyar Tudósok körútja 2., Budapest, Hungary, H-1117
gyorfi@szit.bme.hu
Analysis and Stochastics Research Group
Hungarian Academy of Sciences
Aradi vértanúk tere 1, Szeged, Hungary, H-6720
kevei@math.u-szeged.hu

Abstract. We investigate the performance of the constantly rebalanced


portfolios, when the random vectors of the market process {Xi } are inde-
pendent, and each of them distributed as (X (1) , X (2) , . . . , X (d) , 1), d ≥ 1,
where X (1) , X (2) , . . . , X (d) are nonnegative iid random variables. Under
general conditions we show that the optimal strategy is the uniform:
(1/d, . . . , 1/d, 0), at least for d large enough. In case of St. Petersburg com-
ponents we compute the average growth rate and the optimal strategy
for d = 1, 2. In order to make the problem non-trivial, a commission fac-
tor is introduced and tuned to result in zero growth rate on any individ-
ual St. Petersburg components. One of the interesting observations made
is that a combination of two components of zero growth can result in a
strictly positive growth. For d ≥ 3 we prove that the uniform strategy is
the best, and we obtain tight asymptotic results for the growth rate.

1 Constantly Rebalanced Portfolio


Consider a hypothetical investor who can access d financial instruments (asset,
bond, cash, return of a game, etc.), and who can rebalance his wealth in each
round according to a portfolio vector b = (b(1) , . . . , b(d) ). The j-th component
b(j) of b denotes the proportion of the investor’s capital invested in financial
instrument j. We assume that the portfolio vector b has nonnegative components
and sum up to 1. The nonnegativity assumption means that short selling is not
allowed, while the latter condition means that our investor does not consume
nor deposit new cash into his portfolio, but reinvests it in each round. The set
of portfolio vectors is denoted by
⎧ ⎫
⎨ 
d ⎬
Δd = b = (b(1) , . . . , b(d) ); b(j) ≥ 0, b(j) = 1 .
⎩ ⎭
j=1


The first author acknowledges the support of the Computer and Automation
Research Institute of the Hungarian Academy of Sciences.

The work was supported in part by the Hungarian Scientific Research Fund, Grant
T-048360.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 83–96, 2009.

c Springer-Verlag Berlin Heidelberg 2009
84 L. Györfi and P. Kevei

The behavior of the market is given by the sequence of return vectors {xn },
(1) (d) (j)
xn = (xn , . . . , xn ), such that the j-th component xn of the return vector xn
denotes the amount obtained after investing a unit capital in the j-th financial
instrument on the n-th round.
Let S0 denote the investor’s initial capital. Then at the beginning of the first
(j)
round S0 b1 is invested into financial instrument j, and it results in return
(j) (j)
S0 b1 x1 , therefore at the end of the first round the investor’s wealth becomes


d
(j) (j)
S 1 = S0 b1 x1 = S0 b1 , x1  ,
j=1

where · , · denotes inner product. For the second round b2 is the new portfolio
and S1 is the new initial capital, so

S2 = S1 · b2 , x2  = S0 · b1 , x1  · b2 , x2  .

By induction, for the round n the initial capital is Sn−1 , therefore


n
Sn = Sn−1 bn , xn  = S0 bi , xi  . (1)
i=1

Of course the problem is to find the optimal investment strategy for a long
run period, that is to maximize Sn in some sense. The best strategy depends
on the optimality criteria. A naive attitude is to maximize the expected return
in each round. This leads to the risky strategy to invest all the money into
(j) (i)
the financial instrument j, with EXn = max{EXn : i = 1, 2, . . . , n}, where
(1) (2) (d)
Xn = (Xn , Xn , . . . , Xn ) is the market vector in the n-th round. Since the
(j)
random variable Xn can be 0 with positive probability, repeated application of
this strategy lead to quick bankrupt. The underlying phenomena is the simple
fact that E(Sn ) may increase exponentially, while Sn → 0 almost surely. A more
delicate optimality criterion was introduced by Breiman [3]: in each round we
maximize the expectation E lnb, Xn  for b ∈ Δd . This is the so-called log-
optimal portfolio, which is optimal under general conditions [3].
If the market process {Xi } is memoryless, i.e., it is a sequence of independent
and identically distributed (i.i.d.) random return vectors then the log-optimal
portfolio vector is the same in each round:

b∗ := arg max E{ln b , X1 }.


b∈Δd

In case of constantly rebalanced portfolio (CRP) we fix a portfolio vector b ∈ Δd .


In this special case, according to (1) we get Sn = S0 ni=1 b , xi , and so the
average growth rate of this portfolio selection is

1
n
1 1
ln Sn = ln S0 + ln b , xi  ,
n n n i=1
St. Petersburg Portfolio Games 85

therefore without loss of generality we can assume in the sequel that the initial
capital S0 = 1.
The optimality of b∗ means that if Sn∗ = Sn (b∗ ) denotes the capital after
round n achieved by a log-optimum portfolio strategy b∗ , then for any portfolio
strategy b with finite E{ln b , X1 } and with capital Sn = Sn (b) and for any
memoryless market process {Xn }∞ 1 ,

1 1
lim ln Sn ≤ lim ln Sn∗ almost surely
n→∞ n n→∞ n
and maximal asymptotic average growth rate is
1
lim ln Sn∗ = W ∗ := E{ln b∗ , X1 } almost surely.
n→∞ n
The proof of the optimality is a simple consequence of the strong law of large
numbers. Introduce the notation

W (b) = E{ln b , X1 }.


Then the strong law of large numbers implies that

1
n
1
ln Sn = ln b , Xi 
n n i=1
1 1
n n
= E{ln b , Xi } + (ln b , Xi  − E{ln b , Xi })
n i=1 n i=1
1
n
= W (b) + (ln b , Xi  − E{ln b , Xi })
n i=1
→ W (b) almost surely.
Similarly,
1
lim ln Sn∗ = W (b∗ ) = max W (b) almost surely.
n→∞ n b

In connection with CRP in a more general setup we refer to Kelly [8] and
Breiman [3].
In the following we assume that the i.i.d. random vectors {Xi }, have the
general form X = (X (1) , X (2) , . . . , X (d) , X (d+1) ), where X (1) , X (2) , . . . , X (d) are
nonnegative i.i.d. random variables and X (d+1) is the cash, that is X (d+1) ≡ 1,
and d ≥ 1. Then the concavity of the logarithm, and the symmetry of the first
d components immediately imply that the log-optimal portfolio has the form
b = (b, b, . . . , b, 1 − db), where of course 0 ≤ b ≤ 1/d. When does b = 1/d
correspond to the optimal strategy; that is when should we play with all our
money? In our special case W has the form

 d
W (b) = E ln b X + 1 − bd
(i)
.
i=1
86 L. Györfi and P. Kevei


Let denote Zd = di=1 X (i) . Interchanging the order of integration and differen-
tiation, we obtain
  
d d d
Zd − d
W (b) = E ln b X + 1 − bd
(i)
=E .
db db i=1
bZd + 1 − bd

For b = 0 we have W  (0) = E(Zd ) − d, which is nonnegative if and only if


E(X (1) ) ≥ 1. This implies the intuitively clear statement that we should risk at
all, if and only if the expectation of the game is not less than one. Otherwise the
optimal strategy is to take all your wealth in cash. The function W (·) is concave,
therefore the maximum is in b = 1/d if W  (1/d) ≥ 0, which means that
 
d
E ≤ 1. (2)
Zd

According to the strong law of large numbers d/Zd → 1/E(X (1) ) a.s. as d → ∞,
thus under some additional assumptions for the underlying variables E(d/Zd ) →
1/E(X (1) ), as d → ∞. Therefore if E(X (1) ) > 1, then for d large enough the
optimal strategy is (1/d, . . . , 1/d, 0).
In the latter computations we tacitly assumed some regularity conditions,
that is we can interchange the order of differentiation and integration, and that
we can take the L1 -limit instead of almost sure limit. One can show that these
conditions are satisfied if the underlying random variables have strictly positive
infimum. We skip the technical details.

2 St. Petersburg Game


2.1 Iterated St. Petersburg Game
Consider the simple St. Petersburg game, where the player invests 1$ and a fair
coin is tossed until a tail first appears, ending the game. If the first tail appears
in step k then the the payoff X is 2k and the probability of this event is 2−k :

P{X = 2k } = 2−k . (3)

The distribution function of the gain is

0, if x < 2 ,
F (x) = P{X ≤ x} = 2{log2 x} (4)
1− 1
2log2 x
=1− x , if x ≥ 2 ,

where x is the usual integer part of x and {x} stands for the fractional part.
Since E{X} = ∞, this game has delicate properties (cf. Aumann [1], Bernoulli
[2], Haigh [7], and Samuelson [10]). In the literature, usually the repeated St.
Petersburg game (called iterated St. Petersburg game, too) means multi-period
game such that it is a sequence of simple St. Petersburg games, where in each
round the player invests 1$. Let Xn denote the payoff for the n-th simple game.
St. Petersburg Portfolio Games 87


Assume that the sequence {X
nn }n=1 is i.i.d. After n rounds the player’s gain in
the repeated game is S̄n = i=1 Xi , then
S̄n
lim =1
n→∞ n log2 n
in probability, where log2 denotes the logarithm with base 2 (cf. Feller [6]).
Moreover,
S̄n
lim inf =1
n→∞ n log2 n

a.s. and
S̄n
lim sup =∞
n→∞ n log 2n
a.s. (cf. Chow and Robbins [4]). Introducing the notation for the largest payoff
Xn∗ = max Xi
1≤i≤n

and for the sum with the largest payoff withheld



n
Sn∗ = Xi − Xn∗ = S̄n − Xn∗ ,
i=1

one has that


Sn∗
lim =1
n→∞ n log2 n

a.s. (cf. Csörgő and Simons [5]).

2.2 Sequential St. Petersburg Game


According to the previous results S̄n ≈ n log2 n. Next we introduce the sequential
St. Petersburg game, having exponential growth. The sequential St. Petersburg
game means that the player starts with initial capital S0 = 1$, and there is a
sequence of simple St. Petersburg games, and for each simple game the player
(c)
reinvests his capital. If Sn−1 is the capital after the (n − 1)-th simple game then
(c) (c)
the invested capital is Sn−1 (1 − c), while Sn−1 c is the proportional cost of the
simple game with commission factor 0 < c < 1. It means that after the n-th
round the capital is
n n
(c)
Sn(c) = Sn−1 (1 − c)Xn = S0 (1 − c)n Xi = (1 − c)n Xi .
i=1 i=1
(c)
Because of its multiplicative definition, Sn has exponential trend:
(c) (c)
Sn(c) = 2nWn ≈ 2nW ,
with average growth rate
1
Wn(c) :=
log2 Sn(c)
n
and with asymptotic average growth rate
88 L. Györfi and P. Kevei

1
W (c) := lim log2 Sn(c) .
n→∞ n
Let’s calculate the the asymptotic average growth rate. Because of

1 1 
n
Wn(c) = log2 Sn(c) = n log2 (1 − c) + log2 Xi ,
n n i=1

the strong law of large numbers implies that

1
n
W (c) = log2 (1 − c) + lim log2 Xi = log2 (1 − c) + E{log2 X1 }
n→∞ n
i=1

a.s., so W (c) can be calculated via expected log-utility (cf. Kenneth [9]). A com-
mission factor c is called fair if

W (c) = 0,

so the growth rate of the sequential game is 0. Let’s calculate the fair c:


log2 (1 − c) = −E{log2 X1 } = − k · 2−k = −2,
k=1

i.e., c = 3/4.

2.3 Portfolio Game with One or Two St. Petersburg Components

Consider the portfolio game, where a fraction of the capital is invested in simple
fair St. Petersburg games and the rest is kept in cash, i.e., it is a CRP problem
with the return vector

X = (X (1) , . . . , X (d), X (d+1) ) = (X1 , . . . , Xd , 1)

(d ≥ 1) such that the first d i.i.d. components of the return vector X are of the
form
P{X  = 2k−2 } = 2−k , (5)
(k ≥ 1), while the last component is the cash. The main aim is to calculate the
largest growth rate Wd∗ .

Proposition 1. We have that W1∗ = 0.149 and W2∗ = 0.289.

Proof. For d = 1, fix a portfolio vector b = (b, 1 − b), with 0 ≤ b ≤ 1. The


asymptotic average growth rate of this portfolio game is

W (b) = E{log2 b , X} = E{log2 (bX  + 1 − b)} = E{log2 (b(X/4 − 1) + 1)}.


St. Petersburg Portfolio Games 89

The function log2 is concave, therefore W (b) is concave, too, so W (0) = 0 (keep
everything in cash) and W (1) = 0 (the simple game is fair) imply that for all
0 < b < 1, W (b) > 0. Let’s calculate maxb W (b). We have that


W (b) = log2 (b(2k /4 − 1) + 1) · 2−k
k=1


−1
= log2 (1 − b/2) · 2 + log2 (b(2k−2 − 1) + 1) · 2−k .
k=3

Figure 1 shows the curve of the average growth rate of the portfolio game. The
function W (·) attains its maximum at b = 0.385, that is

b∗ = (0.385, 0.615) ,

where the growth rate is W1∗ = W (0.385) = 0.149. It means that if for each round
of the game one reinvests 38.5% of his capital such that the real investment is
9.6%, while the cost is 28.9%, then the growth rate is approximately 11%, i.e.,
the portfolio game with two components of zero growth rate (fair St. Petersburg
game and cash) can result in growth rate of 10.9%.

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0.2 0.4 0.6 0.8 1

Fig. 1. The growth rate for one St. Petersburg component

Consider next d = 2. At the end of Section 1 we proved that the log-optimal


portfolio vector has the form b = (b, b, 1 − 2b), with 0 ≤ b ≤ 1/2. The asymptotic
average growth rate of this portfolio game is

W (b) = E{log2 b , X} = E{log2 (bX1 + bX2 + 1 − 2b)}


= E{log2 (b((X1 + X2 )/4 − 2) + 1)}.
90 L. Györfi and P. Kevei

0.25

0.2

0.15

0.1

0.05

0.1 0.2 0.3 0.4 0.5

Fig. 2. The growth rate for two St. Petersburg component

Figure 2 shows the curve of the average growth rate of the portfolio game.
Numerically we can determine that the maximum is taken at b = 0.364, so
b∗ = (0.364, 0.364, 0.272) ,
where the growth rate is W2∗ = W (0.364) = 0.289.

2.4 Portfolio Game with at Least Three St. Petersburg Components


Consider the portfolio game with d ≥ 3 St. Petersburg components. We saw that
the log-optimal portfolio has the form b = (b, . . . , b, 1 − db) with b ≤ 1/d.
Proposition 2. For d ≥ 3, we have that
b∗ = (1/d, . . . , 1/d, 0).

Proof. Using the notations at the end of Section 1, we have to prove the
inequality
d
W (1/d) ≥ 0 .
db
According to (2) this is equivalent with
 
d
1≥E .
X1 + · · · + Xd
For d = 3, 4, 5, numerically one can check this inequality. One has to prove the
proposition for d ≥ 6, which means that

1
1 ≥ E 1 d . (6)

d i=1 Xi
St. Petersburg Portfolio Games 91

We use induction. Assume that (6) holds until d − 1. Choose the integers d1 ≥ 3
and d2 ≥ 3 such that d = d1 + d2 . Then
1 1
d 
= d1 
d
1
d i=1 Xi
1
d i=1 Xi + 1
d i=d1 +1 Xi
1
= d1 
d ,
d1 1
d d1 i=1 Xi + d2 1
d d2 i=d1 +1 Xi
therefore the Jensen inequality implies that
1 d1 1 d2 1
d ≤ d1 + d ,
1  d 1  d 1
Xi
d i=1 Xi d1 i=1 Xi d2 i=d1 +1

and so
 
1 d1 1 d2 1
E d ≤E d1 + d
1
d i=1 Xi d 1
d1

i=1 Xi
d 1
d2 i=d1 +1 Xi

 
d1 1 d2 1
= E d1 + E 1 d2
d 1  d 
d1 i=1 Xi i=1 Xi d2
d1 d2
≤ + = 1,
d d
where the last inequality follows from the assumption of the induction.

2.5 Portfolio Game with Many St. Petersburg Components


For d ≥ 3, the best portfolio is the uniform portfolio with asymptotic average
growth rate
 
1  d
1 d
Wd∗ = E log2 X = E log2 Xi .
d i=1 i 4d i=1

First we compute this growth rate numerically for small values of d, then we
determine the exact asymptotic growth rate for d → ∞.
For d ≥ 2 arbitrary, by (3) we may write
 ∞  
 d  log2 2i1 + 2i2 + · · · + 2id
E log2 Xi = .
i=1 i ,i ,...,i =1
2i1 +i2 +···+id
1 2 d

Straightforward calculation shows that for d ≤ 8, summing from 1 to 20 in each


index independently, that is taking only 20d terms, the error is less then 1/1000.
Here are the first few values:
d 1 2 3 4 5 6 7 8
Wd∗ 0.149 0.289 0.421 0.526 0.606 0.669 0.721 0.765
Notice that W1∗ and W2∗ come from Section 2.3.
Now we return to the asymptotic results.
92 L. Györfi and P. Kevei

Theorem 1. For the asymptotic behavior of the average growth rate we have
0.8 1 log2 log2 d + 4
− ≤ Wd∗ − log2 log2 d + 2 ≤ .
ln 2 log2 d ln 2 log2 d

Proof. Because of
 d 
1 
d
Xi
Wd∗ = E log2 Xi = E log2 i=1
+ log2 log2 d − 2,
4d i=1 d log2 d
we have to show that
d 
0.8 1 i=1 Xi log2 log2 d + 4
− ≤ E log2 ≤ .
ln 2 log2 d d log2 d ln 2 log2 d
Concerning the upper bound in the theorem, use the decomposition
d d d
i=1 Xi i=1 X̃i Xi
log2 = log2 + log2 i=1
d
,
d log2 d d log2 d i=1 X̃i
where 
Xi , if Xi ≤ d log2 d ,
X̃i =
d log2 d, otherwise.
We prove that 
d
i=1 X̃i log2 log2 d + 2
E log2 ≤ , (7)
d log2 d ln 2 log2 d
and d 
Xi 2
0≤E log2 i=1
d
≤ . (8)
i=1 X̃i ln 2 log2 d
For (8), we have that
d  d 
i=1 Xi Xi
P log2 d ≥x =P i=1
d ≥2 x

i=1 X̃i i=1 X̃i


≤ P{∃ i ≤ d : Xi ≥ 2x X̃i }
= P{∃ i ≤ d : Xi ≥ 2x min{Xi , d log2 d}}
= P{∃ i ≤ d : Xi ≥ 2x d log2 d}
≤ d P{X ≥ 2x d log2 d}
2
≤d x ,
2 d log2 d
where we used that P{X ≥ x} ≤ 2/x, which is an immediate consequence of (4).
Therefore
d   d 

i=1 Xi i=1 Xi
E log2 d = P log2 d ≥ x dx
i=1 X̃i 0 i=1 X̃i
 ∞
2 2
≤ x
dx = ,
0 2 log2 d ln 2 log2 d
St. Petersburg Portfolio Games 93

and the proof of (8) is finished. For (7), put l = log2 (d log2 d) . Then for the
expectation of the truncated variable we have


l ∞

1 1
E(X̃1 ) = 2k k
+ d log 2 d
2 2k
k=1 k=l+1
1 d log2 d
= l + d log2 d 2=l+ ≤ l + 2.
2l+1 2 log2 (d log2 d)

Thus,
d  d 
X̃i 1 i=1 X̃i
E log2 i=1
= E ln
d log2 d ln 2 d log2 d
d 
1 X̃ i
≤ E i=1
−1
ln 2 d log2 d

1 E{X̃1 }
= −1
ln 2 log2 d
 
1 l+2
≤ −1
ln 2 log2 d
 
1 log2 (d log2 d) + 2
= −1
ln 2 log2 d
 
1 log2 d + log2 log2 d + 2
≤ −1
ln 2 log2 d
1 log2 log2 d + 2
= .
ln 2 log2 d

Concerning the lower bound in the theorem, consider the decomposition


d d + d −
Xi Xi Xi
log2 i=1
= log2 i=1
− log2 i=1
.
d log2 d d log2 d d log2 d

On the one hand for arbitrary ε > 0, we have that


d 
i=1 Xi
P ≤2 x
≤ P { for all i ≤ d, Xi ≤ 2x d log2 d}
d log2 d
d
= P {X ≤ 2x d log2 d}
 d
1
≤ 1− x
2 d log2 d
1
− 2x log
≤e 2d

1−ε
≤1− x ,
2 log2 d
94 L. Györfi and P. Kevei

for d large enough, where we used the inequality e−z ≤ 1 − (1 − ε)z, which holds
for z ≤ − ln(1 − ε). Thus
d 
i=1 Xi 1−ε
P >2 x
≥ x ,
d log2 d 2 log2 d

which implies that


⎧ ⎫ 
⎨ d +⎬  ∞ d
Xi X i
E log2 i=1 = P log2 i=1 > x dx
⎩ d log2 d ⎭ 0 d log2 d
 ∞ d 
i=1 Xi
= P x
> 2 dx
0 d log2 d
 ∞
1−ε
≥ x log d
dx
0 2 2
1 1−ε
= .
log2 d ln 2
Since ε is arbitrary we obtain
⎧ ⎫
⎨ d +⎬
Xi 1 1
E log2 i=1 ≥ .
⎩ d log2 d ⎭ log2 d ln 2

For the estimation of the negative part we use an other truncation method.
Now we cut the variable at d, so put

Xi , if Xi ≤ d ,
X̂i =
d, otherwise.
d
Introduce also the notations Ŝd = i=1 X̂i and cd = E(X̂1 )/ log2 d. Similar
computations as before show that
d
E(X̂1 ) = log2 d + log d = log2 d + 2{log2 d} − {log2 d} and
2 2
    d2  
E X̂12 ≤ 2 2 log2 d − 1 + log d < d 21−{log2 d} + 2{log2 d} ≤ 3d ,
2 2

where we used that 2 2 ≤ 21−y + 2y ≤ 3 for y ∈ [0, 1]; this can be proved easily.
Simple analysis shows again that 0.9 ≤ 2y − y ≤ 1 for y ∈ [0, 1], and so for cd − 1
we obtain
0.9 1
< cd − 1 < .
log2 d log2 d
 
Since di=1 Xi ≥ di=1 X̂i we have that
⎧ ⎫ ⎧ ⎫
⎨ d −⎬ ⎨ d −⎬
Xi X̂i
E log2 i=1 ≤E log2 i=1 .
⎩ d log2 d ⎭ ⎩ d log2 d ⎭
St. Petersburg Portfolio Games 95

Noticing that

Ŝd 2d
log2 > log2 = 1 − log2 log2 d ,
d log2 d d log2 d

we obtain
⎧ ⎫ 
⎨ −
⎬  0
Ŝd Ŝd
E log2 = P log2 ≤ x dx ,
⎩ d log2 d ⎭ − log2 log2 d d log2 d

thus we have to estimate the tail probabilities of Ŝd .


According to Bernstein’s inequality, for x < 0 we have
 
Ŝd Ŝd − E(Ŝd ) E( Ŝ d )
P log2 ≤x =P ≤ 2x −
d log2 d d log2 d d log2 d

Ŝd − E(Ŝd )
=P ≤ 2 − cd
x
d log2 d
⎧ ⎫
⎨ d log2 d (cd − 2 )
2 2 x 2 ⎬
≤ exp −  
⎩ 2 d E[(X̂)2 ] + d log2 d (cd −2 ) ⎭
2 x

3
 
log22 d (cd − 2x )2
≤ exp − .
6 + 23 log2 d (cd − 2x )

Let γ > 0 be fixed, we define it later. For x < −γ and d large enough the last
−γ 2
upper bound ≤ d−(1−2 ) , therefore
 
−γ
Ŝd log2 log2 d
P log2 ≤ x dx ≤ (1−2 −γ )2 .
− log2 log2 d d log 2 d d

We give an estimation for the integral on [−γ, 0]:


   γ  
0
Ŝd log22 d (cd − 2−x )2
P log2 ≤ x dx ≤ exp − dx
−γ d log2 d 0 6 + 23 log2 d (cd − 2−x )
 γ ln 2  
1 log22 d (cd − e−x )2
= exp − dx .
ln 2 0 6 + 23 log2 d (cd − e−x )

For arbitrarily fixed ε > 0 we choose γ > 0 such that 1 − x ≤ e−x ≤ 1 − (1 − ε)x,
for 0 ≤ x ≤ γ ln 2. Using also our estimations for cd − 1 we may write
   
log22 d (cd − e−x )2 log2 d (0.9/ log2 d + (1 − ε)x)2
exp − ≤ exp − 2 2
6 + 23 log2 d (cd − e−x ) 6 + 3 log2 d (1/ log2 d + x)
96 L. Györfi and P. Kevei

and continuing the estimation of the integral we have


 γ ln 2  
1 log22 d (0.9/ log2 d + (1 − ε)x)2
≤ exp − dx
ln 2 0 6 + 23 log2 d (1/ log2 d + x)
 log2 d γ ln 2  
1 1 (0.9 + (1 − ε)x)2
= exp − dx
ln 2 log2 d 0 6 + 23 (1 + x)
 ∞  
1 1 (0.9 + (1 − ε)x)2
≤ exp − dx
ln 2 log2 d 0 6 + 23 (1 + x)
1.7 1
≤ ,
ln 2 log2 d
where the last inequality holds if ε is small enough.
Summarizing, we have
⎧ ⎫ 
⎨ d −⎬  0
X̂ i Ŝd
E log2 i=1
= P log2 ≤ x dx
⎩ d log2 d ⎭ − log2 log2 d d log2 d
log2 log2 d 1.7 1
≤ +
d(1−2−γ )2 ln 2 log2 d
1.8 1
≤ ,
ln 2 log2 d
for d large enough. Together with the estimation of the positive part this proves
our theorem.

References
1. Aumann, R.J.: The St. Petersburg paradox: A discussion of some recent comments.
Journal of Economic Theory 14, 443–445 (1977)
2. Bernoulli, D.: Exposition of a new theory on the measurement of risk. Economet-
rica 22, 22–36 (1954); Originally published in 1738; translated by L. Sommer
3. Breiman, L.: Optimal gambling systems for favorable games. In: Proc. Fourth
Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 65–78. Univ. California Press,
Berkeley (1961)
4. Chow, Y.S., Robbins, H.: On sums of independent random variables with infinite
moments and “fair” games. Proc. Nat. Acad. Sci. USA 47, 330–335 (1961)
5. Csörgő, S., Simons, G.: A strong law of large numbers for trimmed sums, with
applications to generalized St. Petersburg games. Statistics and Probability Let-
ters 26, 65–73 (1996)
6. Feller, W.: Note on the law of large numbers and “fair” games. Ann. Math.
Statist. 16, 301–304 (1945)
7. Haigh, J.: Taking Chances. Oxford University Press, Oxford (1999)
8. Kelly, J.L.: A new interpretation of information rate. Bell System Technical Jour-
nal 35, 917–926 (1956)
9. Kenneth, A.J.: The use of unbounded utility functions in expected-utility maxi-
mization: Response. Quarterly Journal of Economics 88, 136–138 (1974)
10. Samuelson, P.: The St. Petersburg paradox as a divergent double limit. Interna-
tional Economic Review 1, 31–37 (1960)
Reconstructing Weighted Graphs
with Minimal Query Complexity

Nader H. Bshouty and Hanna Mazzawi

Technion - Israel Institute of Technology


{bshouty,hanna}@cs.technion.ac.il

Abstract. In this paper we consider the problem of reconstructing a


hidden weighted graph using additive queries. We prove the following:
Let G be a weighted hidden graph with n vertices and m edges such that
the weights on the edges are bounded between n−a and nb for any positive
constants a and b. For any m there exists a non-adaptive algorithm that
finds the edges of the graph using
 
m log n
O
log m

additive queries. This solves the open problem in [S. Choi, J. H. Kim.
Optimal Query Complexity Bounds for Finding Graphs. Proc. of the 40th
annual ACM Symposium on Theory of Computing , 749–758, 2008].
Choi and Kim’s proof holds for m ≥ (log n)α for a sufficiently large
constant α and uses graph theory. We use the algebraic approach for the
problem. Our proof is simple and holds for any m.

1 Introduction

In this paper we consider the following problem of reconstructing weighted


graphs using additive queries: Let G = (V, E, w) be a weighted hidden graph
where E ⊆ V × V , w : E → {i | n−a ≤ i ≤ nb } and n is the number of vertices
in V . Denote by m the size of E. Suppose that the set of vertices V is known
and the set of edges E is unknown. Given a set of vertices S ⊆ V , an additive
query, Q(S), returns the sum of weights in the subgraph induces by S. That is,

Q(S) = w(e).
e∈E∩(S×S)

Our goal is to exactly reconstruct the set of edges using additive queries.
One can distinguish between two types of algorithms to solve the problem.
Adaptive algorithms are algorithms that take into account outcomes of previous
queries where non-adaptive algorithms make all queries in advance, before any
answer is known. In this paper, we consider non-adaptive algorithms for the
problem. Our concern is the query complexity, that is, the number of queries
needed to be asked in order to reconstruct the graph.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 97–109, 2009.

c Springer-Verlag Berlin Heidelberg 2009
98 N.H. Bshouty and H. Mazzawi

The problem of reconstructing graphs using additive queries has been moti-
vated by applications in bioinformatics. Assume we have a set of labeled chemi-
cals, and we are able to tell how many pairs react when mixing several of those
chemicals together. We can represent the problem as a graph, where the chemi-
cals are the vertices and two chemicals that react with each other are connected
with an edge. The goal is to reconstruct this reactions graph using as few exper-
iments as possible.
One concrete example for reconstructing a hidden graph is in genome sequenc-
ing. Obtaining the genome sequence is important for the study of organisms. To
obtain the sequence, one common approach is to obtain short reads from the
genome sequence. These reads are assembled to contigs, which are contiguous
fragments that cover the genome sequence with possible gaps. Given these con-
tigs, the goal is to determine their relative place in the genome sequence. The
process of ordering the contigs is done using the multiplex PCR method. This
method, given a group of contigs determines the number of adjacent contigs in
the original genome sequence. Assuming that the genome sequence is circular,
the problem of ordering the contigs using the multiplex PCR method is equiva-
lent to reconstructing a hidden Hamiltonian cycle using queries [6,12].
The graph reconstructing problem has known a significant progress in the
past decade. For unweighted graph the information theoretic lower bound gives
 2

m log nm
Ω
log m

for the query complexity for any adaptive algorithm for this problem. A tight
upper bound was proved for some subclasses of unweighted graphs (Hamiltonian
graphs, matching, stars and cliques etc.) [13,12,11,6], unweighted graphs with
Ω(dn) edges where the degree of each vertex is bounded by d [11], graphs with
Ω(n2 ) edges [11] and then the former was extended to d-degenerate unweighted
graphs with Ω(dn) edges [13], i.e., graphs that their edges can be changed to
directed edges where the out-degree of each vertex is bounded by d. A recent
paper by Choi and Kim, [8], gave a tight upper bound for all unweighted graphs.
For reconstructing weighted graphs, in [8], Choi and Kim proved the follow-
ing: If m > (log n)α for sufficiently large α, then, there exists a non-adaptive
algorithm for reconstructing a weighted graph where the weights are bounded
between n−a and nb for any positive constants a and b using
 
m log n
O
log m
queries.
In this paper, we close the gap in m and prove that for any m there exists a
non-adaptive algorithm that reconstructs the hidden graph using
 
m log n
O
log m
queries. This matches the information theoretic lower bound.
Reconstructing Weighted Graphs with Minimal Query Complexity 99

In our analysis, we apply algebraic techniques for solving this problem. This
simplifies the proofs of the correctness.
The paper is organized as follows: In Section 2, we present notation, basic
tools and some background. In Section 3, we prove the existence of an algorithm
for the discretization of the problem. In Section 4, we present the algorithm for
the problem and prove correctness. Finally, Section 5, contains open problems.

2 Preliminaries
In this section we present some background, basic tools and notation.

2.1 Notation and Preliminary Results


We denote by R the set of real numbers and by R+ the set of positive numbers.
For an integer c, we denote by [c] the set {1, 2, . . . , c}. For s1 , s2 ∈ R, we denote
by [s1 , s2 ] the set of real numbers between s1 and s2 , that is, {j | s1 ≤ j ≤ s2 }.
Given t ∈ R+ such that s1 /t and s2 /t are integers, we denote by [s1 , t, s2 ] the
set {s1 , s1 + t, s1 + 2t, . . . , s2 − 2t, s2 − t, s2 }.
Let G = (V, E, w) be a weighted graph, where E ⊆ V ×V and w : E → [s1 , s2 ]
for some s1 , s2 ∈ R+ . Throughout the paper, we denote by n the size of V and
by m the size of E.
Given a matrix or a vector M , the weight wt(M ) of M is the number of
nonzero entries in M . Given an entry a of a matrix or a vector and s ∈ R+ , we
say that a is s-heavy if |a| ≥ s. For a matrix or a vector x, we denote by ψs (x)
the number of entries in x that are s-heavy.
We now prove two lemmas that will be used in this paper.
Lemma 1. Let B be a symmetric matrix. There are at least
 wt(B)
ψs (B) −
d
rows in B where each row is of weight at most d and contains an s-heavy entry.
Proof.There are at most wt(B)/d rows of weight more than d. Also, there are at
least ψs (B) rowswhere each row contains at least one s-heavy entry. Therefore,
there are at least ψs (B) − wt(B)/d rows in B where each row is of weight at
most d and contains an s-heavy entry.
Let ι be a function on non-negative integers defined as follows: ι(0) = 1 and
ι(i) = i for i > 0.
Lemma 2. Let m1 , m2 , . . . , mt be integers in [m] ∪ {0} such that
m1 + m2 + · · · + mt =  ≥ t.
Then
t

ι(mi ) ≥ m(−t)/(m−1) .
i=0
100 N.H. Bshouty and H. Mazzawi

Proof. Notice that when 1 < m1 ≤ m2 < m then ι(m1 − 1)ι(m2 + 1) = (m1 −
1)(m2 + 1) < m1 m2 = ι(m1 )ι(m2 ). Also when m1 = 0 and 1 < m2 < m then
ι(m1 + 1)ι(m2 − 1) = m2 − 1 < m2 = ι(m1 )ι(m2 ). Therefore the optimal value
of ι(m1 )ι(m2 ) · · · ι(mt ) is obtained when for every 0 < i < j ≤ t we either have
mi ∈ {1, m} or mj ∈ {1, m}. This is equivalent to: all mi ∈ {1, m} except at
most one. This implies that at least ( − t)/(m − 1) of the mi s are equal to m.

2.2 Algebraic View of the Problem

In this subsection we show that our problem is equivalent to reconstructing a


bilinear function xT Ay from substitution queries with x, y ∈ {0, 1}n, where A is
an n × n symmetric matrix with 2m nonzero entries [13].
Let G = (V, E, w) be a non-directed weighted graph where V = {v1 , v2 , . . . , vn },
E ⊆ V ×V and w : E → [s1 , s2 ]. Let AG = (aij ) ∈ Rn×n be the adjacency matrix
of G, that is, aij equals w((i, j)) if (i, j) ∈ E and equals zero otherwise. Given a
set of vertices S ⊆ V define the vector a where ai equals “1” if vi ∈ S and “0”
otherwise. Then, we have
aT AG a
Q(S) = .
2
Now, let z = x ∗ y = (xi yi ) ∈ {0, 1}n and x1 = x − z, y1 = y − z. Since AG is
symmetric we have

xT AG x y T AG y (x1 + y1 )T AG (x1 + y1 )
xT AG y = + + − xT1 AG x1 − y1T AG y1 .
2 2 2
Therefore, the problem of reconstructing the set of edges of the graph G using
additive queries is equivalent to finding the non-zero entries in its adjacency
matrix AG using queries of the form

f (x, y) = xT AG y,

where x, y ∈ {0, 1}n .

2.3 Basic Probability

In this subsection we give a preliminary results in probability that will be used


in this paper.
We start with Chernoff bound

Lemma 3. Let X1 , . . . , Xt be independent Poisson trials such that Xi ∈ {0, 1}


t t
and E[Xi ] = pi . Let P = i=1 pi and X = i=1 Xi . Then
2
Pr [X ≤ (1 − λ)P ] ≤ e−λ P/2
.

The following can be derived from Littlewood-Offord Theorem [14,10]. We prove


it in the appendix for completeness
Reconstructing Weighted Graphs with Minimal Query Complexity 101

Lemma 4. Let a1 , a2 , . . . , at , b1 , b2 , . . . , bt and s ≥ 0 be real numbers such that


bi − ai ≥ s for all i and let X1 , . . . , Xt be independent random variables. Suppose
that there is pi where 0 < pi < 1 such that Pr[Xi ≤ ai ] ≥ pi and Pr[Xi ≥ bi ] ≥ pi
for all i. Then, there is a constant c such that for any real number r and integer
ρ ≥ 1 we have
 t

Pr r ≤ Xi < r + ρs ≤ .
t
i=1 p
i=1 i

The following lemmas will be used in this paper

Lemma 5. Let a ∈ Rn be a vector. Then, there is a constant c such that for


any integer ρ ≥ 1, and a randomly (uniformly) chosen vectors x ∈ {0, 1}n we
have that
 cρ
Pr |aT x| ≤ ρs ≤  .
ψs (a)

Proof. Let Xi = ai xi . Then Xi = ai with probability 1/2 and Xi = 0 with


probability 1/2. The lemma now follows immediately from Lemma 4.

Lemma 6. Let b ∈ Rn where ψs (b) > 0. Then, for a randomly (uniformly)


chosen vectors x ∈ {0, 1}n we have that

Pr[|bT x| ≥ s/2] ≥ 1/2

Proof. Suppose w.l.o.g. |b1 | ≥ s. For any fixed x2 , . . . , xn ∈ {0, 1} we have xT b =


b1 x1 + b0 for some b0 ∈ R. Now this takes the values b0 for x1 = 0 and b0 + b1
for x1 = 1. Since |(b0 + b1 ) − b0 | = |b1 | ≥ s, one of them is at least s/2.

Corollary 1. Let B ∈ Rn×n be a matrix where ψs (B) > 0. Then, for a randomly
(uniformly) chosen vectors x, y ∈ {0, 1}n we have that

Pr[|xT By| ≥ s/4] ≥ 1/4

Proof. By Lemma 6 the probability that By contains an s/2-heavy entry is at


least 1/2. Assuming it does, by Lemma 6, the probability that |xT By| ≥ s/4 is
again 1/2. This implies the result.

3 Reconstructing Graphs

In this section we give an upper bound for the discretization of the problem and
then show how to solve the general problem.
Let G be the set of all graphs with n vertices and m edges such that the
weights of the edges are from the set [s1 , s1 /8m2 , s2 ], that is, the weights are
bounded by s1 , s2 and are multiples of s1 /8m2 . For the class G we prove the
following
102 N.H. Bshouty and H. Mazzawi

Theorem 1. There exists a set of queries S = {(x1 , y1 ), (x2 , y2 ), . . . , (xt , yt )}


of size ⎛  ⎞
m log n + log ss21
t = O⎝ ⎠
log m

where xi , yi ∈ {0, 1}n such that for all G, G ∈ G where E(G) = E(G ) there
exists i ∈ [t] such that

|xTi AG yi − xTi AG yi | > s1 /(8m).

To prove Theorem 1 we will prove that there exists a set of queries S of size t
such that for every B = AG − AG where G, G ∈ G and E(G) = E(G ) there
exists (x, y) ∈ S such that |xT By| > s1 /8m.
We divide into two cases. The first case is when the matrix B is a substraction
of adjacency matrices of graphs that are “close” to each other, i.e., B has only
few heavy entries. The second case is when the matrix B is a substraction of two
adjacency matrices of graphs that are “far” from each other, i.e., B has many
heavy entries.
First notice the following properties of B:
P1. Since G and G contains at most m edges we have wt(B) ≤ 2m.
P2. Since the weights of the edges are in [s1 , s1 /(8m2 ), s2 ], the weights in B are
in [−s2 , s1 /(8m2 ), s2 ].
P3. Since E(G) = E(G ) at least one of the entries of B is s1 -heavy.
We denote by B the class of symmetric matrices that satisfy (P1-P3). Then the
first case will be B1 = {B ∈ B | ψs1 /(8m) (B) ≤ m3/4 } and the second case will
be B2 = {B ∈ B | ψs1 /(8m) (B) > m3/4 }.

3.1 Eliminating all Close Graphs


In this subsection we analyze the case where the matrix B has few heavy entries.
Similar analysis appears in [8] and is presented here for completeness. We prove
the following.
Lemma 7. Let B1 = {B ∈ B | ψs1 /(8m) (B) ≤ m3/4 }. There exists a set of
queries S = {(x1 , y1 ), (x2 , y2 ), . . . , (xt , yt )} of size
 
m log n + log ss21
t=
log m
such that for every B ∈ B1 there exist i ∈ [t] such that

|xTi Byi | > s1 /(8m).

Proof. We start by proving a weaker claim. Let

B1 = {B  ∈ B1 | B  ∈ ([−s2 , s1 /(8m2 ), −s1 /(8m)] ∪ [s1 /(8m), s1 /(8m2 ), s2 ])n×n }.


Reconstructing Weighted Graphs with Minimal Query Complexity 103

We first prove that there exists a set of queries S  = {(x1 , y1 ), (x2 , y2 ),. . . ,(xt , yt )}
such that for every B  ∈ B1 we have i ∈ [t] such that
|xT  
i B yi | ≥ s1 /4.

By (P3) and Corollary 1, for a randomly chosen query x, y we have


Pr[|xT B  y| ≤ s1 /4] ≤ 3/4.
The size of B1 is bounded by
  m3/4
n2 8s2 m2
|B1 | ≤ < (4/3)t .
m3/4 s1
Therefore
Pr[(∃B ∈ B1 )(∀i) |xT  t t
i Byi | < s1 /4] < (4/3) (3/4) < 1

and the weaker claim follows.


Now, we argue that S  is the set of queries we are looking for. Let B ∈ B1 ,
remove all entries smaller than s1 /(8m) in absolute value. Denote the new matrix
by B ∗ . Notice that B ∗ ∈ B1 , therefore we have i ∈ [t] such that
|xT ∗ 
i B yi | ≥ s1 /4. (1)
Also note that
|xT  T ∗  T ∗ 
i Byi | = |xi B yi + xi (B − B )yi |. (2)
Since B − B ∗ has at most 2m − 1 non-zero entries and each non-zero entry is
bounded by s1 /8m in absolute value we have
|xT ∗ 
i (B − B )yi | < (s1 /8m)(2m − 1) = s1 /4 − s1 /(8m). (3)
By (1), (2) and (3) we get
|xT 
i Byi | > s1 /(8m).

Therefore the result follows.

3.2 Eliminating All Far Graphs


In this section we first prove the following,
Lemma 8. Let
U = {u | u ∈ [−s2 , s1 /(8m2 ), s2 ]n , wt(u) < m3/4 and ψs1 /(8m) (u) > 0}.
Then, for every  ⎛ ⎞
m log n + log ss21
t= Ω⎝ ⎠
log m

there exists a set of vectors Y = {y1 , y2 , . . . , yt } ⊂ {0, 1}n such that for every
u ∈ U the size of Yu = {i | |yiT u| > s1 /(16m)} is greater than t/4.
104 N.H. Bshouty and H. Mazzawi

Proof. By Lemma 6, for any u ∈ U and a randomly chosen y ∈ {0, 1}n we have
Pr[|y T u| ≥ s1 /16m] ≥ 1/2.
Therefore, if we randomly independently choose y1 , y2 , . . . , yt ∈ {0, 1}n we have
E[|Yu |] ≥ t/2. By Chernoff bound the probability that |Yu | ≤ t/4 is
−t
Pr[|Yu | ≤ t/4] < e 16 .
The probability that for all u ∈ U we have |Yu | > t/4 is
−t
Pr[∀u ∈ U : |Yu | ≥ t/4] = 1 − Pr[∃u ∈ U : |Yu | < t/4] ≥ 1 − |U |e 16 .
Finally, note that
  m3/4
n 16s2 m2
|U | < < et/16 ,
m3/4 s1
and therefore
−t
1−|U |e 16 > 0.
Lemma 9. Let B2 = {B ∈ B | ψs1 /(8m) (B) > m3/4 }. There exists a set of
queries S = {(x1 , y1 ), (x2 , y2 ), . . . , (xt , yt )} of size
⎛  ⎞
m log n + log ss21
t = O⎝ ⎠
log m

such that For every B ∈ B2 there is i ∈ [t] such that


|xTi Byi | > s1 /(8m).
Proof. Define U as in Lemma 8. That is,
U = {u | u ∈ [−s2 , s1 /(8m2 ), s2 ]n , wt(u) < m3/4 and ψs1 /(8m) (u) > 0}.
Let Y = {y1 , y2 , . . . , yt } the set of vectors that satisfies the condition in Lemma 8.
By Lemma 1, for d = m3/4 , there is at least
wt(B)
ψs1 /(8m) (B) −
m3/4
rows in B that are in U . By property P1 and since B ∈ B2 we have at least
wt(B)
ψs1 /(8m) (B) − ≥ m3/8 − 2m1/4
m3/4
rows in B that are in U . Let k = m3/8 − 2m1/4 and let BU be a k × n matrix
that its rows are any k rows in B that are in U . By Lemma 8,
t
 kt
ψs1 /(16m) (BU yi ) ≥ .
i=1
4
Reconstructing Weighted Graphs with Minimal Query Complexity 105

By Lemma 2,
t

ι(ψs1 /(16m) (BU yi )) ≥ k (k−4)t/(4k−4) ≥ mc1 t ,
i=1

for some constant c1


By Lemma 5 if we randomly independently choose x1 , x2 , . . . , xt , the proba-
bility that non of the queries xi , yi satisfy |xTi Byi | > s1 /8m is bounded by
 s1  
t
c
Pr ∀i ∈ [t] : |xTi Byi | ≤ 2 ≤ 
16m i=1
ι(ψs1 /(16m) (Byi ))

ct
= t
i=1 ι(ψs1 /(16m) (Byi ))
ct
≤ t
i=1 ι(ψs1 /(16m) (BU yi ))
 c  t
≤ c /2
= m−αt ,
m 1

where c > 1 and α > 0 are some constants.


Since
1 1
m−αt <    2m < ,
n 2 8s2 m2 |B2 |
2m s1

the result follows.

4 The Algorithm
In the previous section we showed that there exists a set of queries
S = {(x1 , y1 ), (x2 , y2 ), . . . , (xt , yt )}
such that for every G∗ , G ∈ G where E(G∗ ) = E(G ) we have i ∈ [t] such that
|xTi AG∗ yi − xTi AG yi | > s1 /8m.
Recall that G is the set of all graphs with n vertices and m edges, where the
weights of the edges are from the set [s1 , s1 /8m2 , s2 ].
Now, for reconstructing the edges of the graph we use the same algorithm as
in [8]. The algorithm is presented in Figure 1.
The query complexity of the algorithm is obvious. As for the correctness,
given a graph G, define G ∈ G, such that G is equivalent to G after we round
each weight of edge in G to the closest number that is a multiple of s1 /8m2 .
Obviously, since G−G has at most m non-zero entries and each entry is bounded
by s1 /16m2 we have that for all i ∈ [t]
|xTi AG yi − xTi AG yi | ≤ s1 m/16m2 = s1 /16m.
106 N.H. Bshouty and H. Mazzawi

Algorithm Edge Reconstruct

1. For all (xi , yi ) ∈ S


2. Ask xTi AG yi .
3. End for.
4. For all G ∈ G
5. Define D(G ) = (d1 , d2 , . . . , dt ) where di = |xTi AG yi − xTi AG yi |
6. if ψs1 /16m (D(G )) = 0
7. return G .
8. End if.
9. End for.

Fig. 1. Algorithm for reconstructing the set of edges of G

On the other hand, for any graph G∗ ∈ G that differs from G in at least one
edge, we have

|xTi AG yi − xTi AG∗ yi | = |xTi AG yi − xTi AG yi − (xTi AG∗ yi − xTi AG yi )|. (4)

Now, since G , G∗ ∈ G, we have i ∈ [t] such that

|xTi AG∗ yi − xTi AG yi | > s1 /8m. (5)

By (4) and (5), together with the fact that |xTi AG yi − xTi AG yi | ≤ s1 /16m we
get

|xTi AG yi − xTi AG∗ yi | > s1 /16m.

5 Conclusions and Open Problems

In this paper, we proved the existence of an optimal non-adaptive algorithm for


reconstructing the edge set of a hidden weighted graph, given that the weights of
the edges are real numbers bounded by n−a and nb for any constants a and b. An
open question is: Can we remove the condition on the weights? That is, is there
an algorithm for reconstructing a hidden weighted graph where the weights of
the edges are (unbounded) real numbers?
Also, while the problem of finding optimal constructive polynomial time al-
gorithm for reconstructing a hidden graph was solved for the adaptive case [15],
the problem is still open in the non-adaptive case. That is, the problem of finding
an explicit construction for algorithms in the non-adaptive setting is still open.
Reconstructing Weighted Graphs with Minimal Query Complexity 107

References
1. Aigner, M.: Combinatorial Search. John Wiley and Sons, Chichester (1988)
2. Alon, N., Asodi, V.: Learning a Hidden Subgraph. SIAM J. Discrete Math. 18(4),
697–712 (2005)
3. Alon, N., Beigel, R., Kasif, S., Rudich, S., Sudakov, B.: Learning a Hidden Match-
ing. SIAM J. Comput. 33(2), 487–501 (2004)
4. Angluin, D., Chen, J.: Learning a Hidden Graph Using O(log n) Queries per Edge.
In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp.
210–223. Springer, Heidelberg (2004)
5. Angluin, D., Chen, J.: Learning a Hidden Hypergraph. Journal of Machine Learning
Research 7, 2215–2236 (2006)
6. Bouvel, M., Grebinski, V., Kucherov, G.: Combinatorial Search on Graphs Moti-
vated by Bioinformatics Applications: A Brief Survey. In: Kratsch, D. (ed.) WG
2005. LNCS, vol. 3787, pp. 16–27. Springer, Heidelberg (2005)
7. Bshouty, N.H.: Optimal Algorithms for the Coin Weighing Problem with a Spring
Scale. In: COLT (2009)
8. Choi, S., Kim, J.H.: Optimal Query Complexity Bounds for Finding Graphs. In:
STOC, pp. 749–758 (2008)
9. Du, D., Hwang, F.K.: Combinatorial group testing and its application. Series on
applied mathematics, vol. 3. World Science (1993)
10. Erdös, P.: On a lemma of Littlewood and Offord. Bulletin of the American Math-
ematical Society 51, 898–902 (1945)
11. Grebinski, V., Kucherov, G.: Optimal Reconstruction of Graphs Under the Addi-
tive Model. Algorithmica 28(1), 104–124 (2000)
12. Grebiniski, V., Kucherov, G.: Reconstructing a hamiltonian cycle by querying the
graph: Application to DNA physical mapping. Discrete Applied Mathematics 88,
147–165 (1998)
13. Grebinski, V.: On the Power of Additive Combinatorial Search Model. In: Hsu,
W.-L., Kao, M.-Y. (eds.) COCOON 1998. LNCS, vol. 1449, pp. 194–203. Springer,
Heidelberg (1998)
14. Littlewood, J.E., Offord, A.C.: On the number of real roots of a random algebraic
equation. III. Mat. Sbornik 12, 277–285 (1943)
15. Mazzawi, H.: Optimally Reconstructing Weighted Graphs Using Queries
(manuscript)
16. Reyzin, L., Srivastava, N.: Learning and Verifying Graphs using Queries with a
Focus on Edge Counting. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT
2007. LNCS (LNAI), vol. 4754, pp. 285–297. Springer, Heidelberg (2007)
17. Sperner, E.: Ein Satz ber Untermengen einer endlichen Menge. Math. Z. 27, 544–
548 (1928)

6 Appendix
In this Appendix we prove Lemma 4.
We first prove few preliminary results.
Lemma 10. Let X1 , X2 , . . . , Xt be a random variables such that there is si > s
where Pr[Xi = si ] = 1/2 and Pr[Xi = 0] = 1/2. Let λ1 , . . . , λt ∈ {−1, 1}. Then
there is a constant c such that
c
max Pr[r ≤ λ1 X1 + λ2 X2 + · · · + λt Xt < r + s] ≤ √ .
r t
108 N.H. Bshouty and H. Mazzawi

 
Proof. Consider the lattice L = i {0, si } with the partial order ≺= i ≺i
where 0 ≺i si if λi = 1 and si ≺i 0 if λi = −1. It is easy to see that the set of all
solutions of r ≤ X1 + X2 + · · · + Xt < r + s is an anti-chain in L. This follows
the result.
Lemma 11. Let X1 , X2 , . . . , Xt be a random variables such that there is si > s
where Pr[Xi = si ] = pi and Pr[Xi = 0] = 1 − pi . Let λ1 , . . . , λt ∈ {−1, 1}. Then
there is a constant c such that for every r
c
max Pr[r ≤ λ1 X1 + λ2 X2 + · · · + λt Xt < r + s] ≤ .
r t
i=1 min(p i , 1 − p i )

Proof. We will assume w.l.o.g that pi < 1/2 and therefore min(pi , 1 − pi ) = pi .
Otherwise, we can replace Xi with Yi = si − Xi .
Let Yi be random variable that is equal to 1 with probability 2pi and 0 with
probability 1 − 2pi . Let Zi be a random variable that is equal to si with prob-
ability 1/2 and 0 with probability 1/2. It is easy to see that Xi = Yi Zi . Let
t
Y = Y1 + · · · + Yt and P = i=1 pi . Notice that E[Y ] = 2P . Then by Lemma 10
and Chernoff bound we have
Pr[r ≤ λ1 X1 + λ2 X2 + · · · + λt Xt < r + s]
= Pr[r ≤ λ1 Y1 Z1 + λ2 Y2 Z2 + · · · + λt Yt Zt < r + s]
≤ Pr[r ≤ λ1 Y1 Z1 + λ2 Y2 Z2 + · · · + λt Yt Zt < r + s | Y ≥ P ] + P r[Y < P ]
c
≤ √ + e−P/4 .
P
This completes the proof.
Lemma 12. Let X1 and X2 be random variables and s be any fixed real number.
Suppose X1 takes values x1 , x2 , . . . , x with probabilities p1 , . . . , p , respectively.
Let Y1 be a random variable that takes values y, x3 , x4 , . . . , x with probabilities
p1 + p2 , p3 , . . . , p . Then for y = x1 or y = x2 we have
max Pr[r ≤ X1 + X2 < r + s] ≤ max Pr[r ≤ Y1 + X2 < r + s].
r r

Proof. Suppose maxr Pr[r ≤ X1 + X2 < r + s] = p0 and Pr[r0 ≤ X1 + X2 <


r0 + s] = p0 . Now we choose y = x2 if
Pr[r0 − x1 ≤ X2 < r0 − x1 + s] ≤ Pr[r0 − x2 ≤ X2 < r0 − x2 + s] (6)
and y = x1 , otherwise. Suppose (6) is true. Then y = x2 and
max Pr[r ≤ X1 + X2 < r + s] = Pr[r0 ≤ X1 + X2 < r0 + s]
r

= Pr[X1 = x] Pr[r0 − x ≤ X2 < r0 − x + s]
x
= p1 Pr[r0 − x1 ≤ X2 < r0 − x1 + s] +
p2 Pr[r0 − x2 ≤ X2 < r0 − x2 + s]+
Reconstructing Weighted Graphs with Minimal Query Complexity 109


Pr[X1 = x] Pr[r0 − x ≤ X2 < r0 − x + s]
x∈{x1 ,x2 }

≤ (p1 + p2 ) Pr[r0 − x2 ≤ X2 < r0 − x2 + s] +



Pr[X1 = x] Pr[r0 − x ≤ X2 < r0 − x + s]
x∈{x1 ,x2 }

≤ Pr[Y1 = x2 ] Pr[r0 − x2 ≤ X2 < r0 − x2 + s] +



Pr[Y1 = x] Pr[r0 − x ≤ X2 < r0 − x + s]
x∈{x1 ,x2 }

= Pr[r0 ≤ Y1 + X2 < r0 + s]
≤ max Pr[r ≤ Y1 + X2 < r + s].
r
We will call this transformation a merging of the two values x1 and x2 of the
random variable X1 into one value in Y1 . We prove the following property of the
merging transformation
Lemma 13. Let X be a random variable such that Pr[X > s] ≥ p and also
Pr[X < 0] ≥ p. Then we can merge the values of X into two values s1 and s2 in
a random variable Y such that
1. s2 − s1 ≥ s/2.
2. Pr[X = s1 ] ≥ p and Pr[X = s2 ] ≥ p.
Proof. We first merge all the values that are greater than or equal to s, then
all the values that are less than or equal to 0 and then those that are between 0
and s. We get a random variable Z that gets 3 values s , s and s where
s ≤ 0 < s < s ≤ s , Pr[Z = s ] ≥ p and Pr[Z = s ] ≥ p. Now either
s − s ≥ s/2 or s − s ≥ s/2. If s − s > s/2 then we merge s and s and
if s − s > s/2 we merge s and s .
Now we prove our main result
Lemma 14. Let a1 , a2 , . . . , at , b1 , b2 , . . . , bt and s ≥ 0 be real numbers such that
bi − ai ≥ s for all i and let X1 , . . . , Xt be independent random variables. Suppose
that there is pi with 0 < pi < 1 such that Pr[Xi ≤ ai ] ≥ pi and Pr[Xi ≥ bi ] ≥ pi
for all i. Then, there is a constant c such that for any real number r and integer
ρ ≥ 1 we have
t

Pr r ≤ Xi < r + ρs ≤ .
t
i=1 i=1 p i

Proof. By Lemma 13 we can merge the values of each Xi into a new random
variable Yi that takes two values ai and bi where bi − ai ≥ s/2, Pr[Yi = ai ] ≥ pi ,
Pr[Yi = bi ] ≥ pi and
t
  t
s s
max Pr r ≤ Xi < r + ≤ max Pr r ≤ Yi < r + .
r 2 r 2
i=1 i=1

We will assume without loss of generality that ai = 0 for all i. Otherwise consider
the random variables Yi − ai . Now by Lemma 11 we get the result.
Learning Unknown Graphs

Nicolò Cesa-Bianchi1 , Claudio Gentile2 , and Fabio Vitale3


1
Dipartimento di Scienze dell’Informazione
Università degli Studi di Milano, Italy
cesa-bianchi@dsi.unimi.it
2
Dipartimento di Informatica e Comunicazione
Università dell’Insubria, Varese, Italy
claudio.gentile@uninsubria.it
3
Dipartimento di Scienze dell’Informazione
Università degli Studi di Milano, Italy
fabio.vitale@unimi.it

Abstract. Motivated by a problem of targeted advertising in social net-


works, we introduce and study a new model of online learning on labeled
graphs where the graph is initially unknown, and the algorithm is free
to choose the next vertex to predict. After observing that natural non-
adaptive exploration/prediction strategies (like depth-first with major-
ity vote) badly fail on simple binary labeled graphs, we introduce an
adaptive strategy that performs well under the hypothesis that the ver-
tices of the unknown graph (i.e., the members of the social network)
can be partitioned into a few well-separated clusters within which labels
are roughly constant (i.e., members in the same cluster tend to prefer the
same products). Our algorithm is efficiently implementable and provably
competitive against the best of these partitions.

Keywords: online learning, graph prediction, unknown graph,


clustering.

1 Introduction
Consider the advertising problem of targeting each member of a social network
(where ties between individuals indicate a certain degree of similarity in tastes
and interests) with the product he/she is most likely to buy. Unlike previous ap-
proaches to this problem —see, e.g., [20]— we consider the more interesting sce-
nario where the network and the preferences of network members for the products
in a given set are initially unknown, apart from those of a single “seed member”.
We assume there exists a mechanism to explore the social network by discovering
new members connected (i.e., with similar interests) to members that are already
known. This mechanism can be implemented in different ways, e.g., by providing
incentives or rewards to members with undiscovered connections. Alternatively,
if the network is hosted by a social network service (like FacebookTM ), the service
provider itself may release the needed pieces of information. Since each discovery
of a new member is presumably costly, the goal of the marketing strategy is to

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 110–125, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Learning Unknown Graphs 111

minimize the number of new members not being offered their preferred product.
In this respect the task may then be formulated as the following sequential prob-
lem: At each step t find the member qt , among those whose preferred product we
already know, who is most likely to have undiscovered connections that have the
same preferred product as qt . Once this member qt is identified, we obtain (through
the above-mentioned mechanism) a connection it to whom we may advertise qt ’s
preferred product. In order to make the problem easier for the advertising agent,
we make the simplifying assumption that once a product is advertised to a mem-
ber the agent may observe the member’s true preference, and thus know whether
the decision made was optimal.
This social network advertising task can be naturally cast as a graph predic-
tion problem where an agent sequentially explores the vertices and edges of an
unknown graph with unknown labels (i.e., product preferences) assigned to its
vertices. The online exploration proceeds as follows: At each time step t, the
agent selects a known node qt having unexplored edges, receives a new vertex
it adjacent to qt , and is required to output a prediction yt for the (unknown)
label yt associated with it . Then yt is revealed, and the algorithm incurs a loss
(
yt , yt ) measuring the discrepancy between prediction and true label. Thus, in
some sense, the agent is learning to explore the graph along directions that, given
past observations, look easier to predict. Our basic measure of performance is the
agent’s cumulative loss ( y1 , y1 )+ · · ·+ (
yn , yn ) over a sequence of n predictions.
In order to leverage on the assumption that connected members tend to prefer
the same products [20], we design agent strategies that perform well to the ex-
tent that the underlying graph labeling y = (y1 , . . . , yn ) is regular. That is, the
graph can be partitioned into a small number of weakly interconnected clusters
(subgroups of network members) such that labels in each cluster are all roughly
similar. In the case of binary labels and zero-one loss, a common measure of label
regularity for an n-vertex graph G with labels y = (y1 , . . . , yn ) ∈ {−1, +1}n is
the cutsize ΦG (y). This is the number of edges (i, j) in G whose endpoints ver-
tices have disagreeing labels, yi  = yj . The cumulative loss bound we prove in this
paper holds for general (real-valued) labels and is expressed in terms of a mea-
sure of regularity that, in the special case of binary labels, is often significantly
smaller than the cutsize ΦG (y), and never larger than 2ΦG (y). Furthermore, un-
like ΦG (y), which may be even quadratic in the number of nodes, our measure
of label regularity is never vacuous (i.e., it is never larger than n). In the paper
we also show that the algorithm achieving this bound is suitable to large scale
applications because of its small time and memory requirements.

1.1 Related Work


Online prediction of labeled graphs has been also studied in a “transductive”
learning model, different from the one studied here. In this model the graph
G (without labels) is known in advance, and the task is to sequentially pre-
dict the unknown labels of an adversarially chosen permutation of G’s vertices.
A technique proposed in [10] for transductive binary prediction is to embed the
graph into a linear space using the kernel defined by the Laplacian pseudoinverse
112 N. Cesa-Bianchi, C. Gentile, and F. Vitale

—see [16, 19], and then run the standard (kernel) Perceptron algorithm for pre-
dicting the vertex labels. This approach guarantees that the number of mistakes
is bounded by a quantity that depends linearly on the cutsize ΦG (y). Further
results involving the prediction of node labels in graphs with known structure
include [2, 3, 6, 9, 11, 12, 13, 14, 15, 17].
Our exploration/prediction model also bears some similarities to the graph
exploration problem introduced in [8], where the measure of performance is the
overall number edge traversals sufficient to ensure that each edge has been tra-
versed at least once. Unlike that approach, we do not charge any cost for visits
of the same node beyond the first visit. Moreover, in our setting depth-first
exploration performs badly on simple graphs with binary labels (see discus-
sion in Sect. 2), whereas depth-first traversal is optimal in the setting of [8]
for any undirected graph —see [1]. Finally, as we explain in Sect. 3, our ex-
ploration/prediction algorithm incrementally builds a spanning tree whose total
cost is equal to the algorithm’s cumulative loss. The problem of constructing a
minimum spanning tree online is also considered in [18], although only for graphs
with random edge costs.

2 The Exploration/Prediction Model


Let G = (V, E) be an unknown undirected and connected graph with vertex set
V = {1, 2, . . . , n} and edge set E ⊆ V × V . We use y = (y1 , . . . , yn ) ∈ Y n to
denote an unknown assignment of labels yi ∈ Y to the vertices i ∈ V , where Y
is a given label space, e.g., Y = R or Y = {−1, +1}.
We consider the following protocol between a graph exploration/prediction
algorithm and an adversary. Initially, the algorithm receives an arbitrary vertex
i0 ∈ V and its corresponding label y0 . For all subsequent steps t = 1, . . . , n − 1,
let Vt−1 ⊆ V be the set of vertices visited in the first t − 1 steps, where we
conventionally set V0 = {i0 }. Then:
1. The algorithm chooses node qt ∈ Vt−1 ; at this time the algorithm knows
that qt has unexplored edges (i.e., edges connecting qt to unseen nodes in
V \ Vt−1 ), though the number and destination of such edges is currently
unknown to the algorithm.
2. The adversary chooses a node it ∈ V \ Vt−1 that is adjacent to qt ;
3. All edges (it , j) ∈ E connecting it to previously visited vertices j ∈ Vt−1 are
revealed (including edge (qt , it ));
4. The algorithm predicts the label yt of it with yt ;
5. The label yt is revealed, and the algorithm incurs a loss.
At each step t = 1, . . . , n − 1, the loss of the algorithm is ( yt , yt ), where  :
Y × Y → R+ is a fixed and known function measuring the discrepancy between
yt and yt . For example, if Y = R, then we may set ( yt , yt ) = | yt − yt |. The
algorithm’s goal is to minimize its cumulative loss ( y1 , y1 ) + · · · + (yn , yn ).
Note that the edges (qt , it ), for t = 1, . . . , n − 1, form a spanning tree for G.
It is important to note that standard nonadaptive graph exploration strate-
gies (combined with simple prediction rules) are suboptimal in this setting. For
Learning Unknown Graphs 113

this purpose, consider the strategy depthFirst, performing a depth-first visit


of G (partially driven by the adversarial choice of it ) and predicting the label of
it through the adjacent node qt in the spanning tree generated by the visit. In
the binary classification case, when Y = {−1, +1} and ( y , y) = I{y
=y} (zero-one
loss), the graph cutsize ΦG (y) is an obvious mistake bound achieved   by such a
strategy. Figure 1 shows an example where depthFirst makes Ω |V | mistakes.
This high number of mistakes is not due to the choice of the prediction rule. In-
deed, the same large number of mistakes is achieved by variants of depthFirst
where the predicted label is determined by the majority vote of all labels (or
just of the mistaken ones) among the adjacent nodes seen so far. This holds even
when the graph labeling is consistent with the majority vote predictor based on
the entire graph. Similar examples can be constructed
 to show that visiting the
graph in breadth-first order can cause Ω |V | mistakes.

 
Fig. 1. A binary labeled graph with three clusters where depthFirst can make Ω |V |
mistakes. Edges are either arrow edges or grey edges. Arrow edges indicate predictions,
and numbers on such edges denote the adversarial order of presentation. For instance
edge 3 (connecting a −1 node to a +1 node) says that depthFirst uses the −1 label
associated with the start node (the current qt node) to predict the +1 label associated
with the end node (the current it node). As a matter of fact, in this example depth-
First could also predict yt through a majority vote of the labels of previously observed
nodes that are adjacent to it . Dark grey nodes are the mistaken nodes (for simplicity,
ties are mistakes in this figure). Notice that in the dotted area we could add as many
(mistaken) nodes as we like, thus making the graph cutsize ΦG (y) arbitrarily close to
|V |. These nodes would still be mistaken even if the majority vote were restricted to
previously mistaken (and adjacent) nodes. This is because depthFirst is forced to err
on the left-most node of the right-most cluster.

These algorithms fail mainly because their exploration strategy is oblivious to


the sequence of revealed labels. Next, we show an adaptive exploration strategy
that takes advantage of the revealed structure of the labeled graph in order to
make a substantially smaller number of mistakes. Our algorithm cga (Clustered
Graph Algorithm) learns the next “good” node qt ∈ Vt−1 to explore, and achieves
a cumulative loss bound based on a notion of cluster/labeling regularity called
114 N. Cesa-Bianchi, C. Gentile, and F. Vitale

merging degree. This notion arises naturally as a by-product of our analysis, and
can be considered a natural measure of cluster similarity of independent interest.

3 Regular Partitions and the Clustered Graph Algorithm


We are interested in designing exploration/prediction strategies that work well
to the extent the underlying graph G can be partitioned into a small number
of weakly connected regions (the “clusters”) such that labels on the vertices in
each cluster are similar. Before defining this property formally, we need a few
key auxiliary definitions.
Given a path s1 , . . . , sd in G, a notion of path length λ(s1 , . . . , sd ) can be
defined which is naturally related to the prediction loss. A reasonable choice
might be λ(s1 , . . . , sd ) = maxk=2,...,d (sk−1 , sk ), where we conventionally write
(st−1 , st ) instead of (yst−1 , yst ) when the labeling is understood from the con-
text. Note that, in the binary classification case, if the nodes s1 , . . . , sd are either
all positive or all negative, then λ(s1 , . . . , sd ) = 0. In general, we say that λ is a
path length assignment if it satisfies
λ(s1 , . . . , sd−1 , sd ) ≥ λ(s1 , . . . , sd−1 ) ≥ 0 (1)
for each path s1 , . . . , sd−1 , sd in G. As we see in Sect. 5, condition (1) helps in
designing efficient algorithms.
Given a path length assignment λ, denote by Pt (i, j) the set of all paths
connecting node i to node j in Gt = (Vt , Et ), the subgraph containing all nodes
Vt and edges Et that have been observed during the first t steps. The distance
dt (i, j) between i and j is the length of the shortest path between i and j in Gt ,
i.e., dt (i, j) = minπ∈Pt (i,j) λ(π). A partition P of V in subsets (or clusters) C
is regular if, for all C ∈ P and for all i ∈ C, maxj∈C d(i, j) < mink ∈C d(i, k),
where d(i, j), without subscript, denotes the length of the shortest path between
i and j in the whole graph G. See Fig. 2 for an example.
In a regular partition each node is closer to every node in its cluster than to
any other node outside. When −d(·, ·) is taken as similarity function, our notion
of regular partition becomes equivalent to the Apresjan clusters in [4] and to
the strict separation property of [5]. It is easy to see that according to (1) all
subgraphs induced by the clusters on a regular partition are connected graphs.
Note that every labeled graph G = (V, E) has at least  two regular partitions,

since both the trivial partitions P = {V } and P = {1}, {2}, . . . , {|V |} are
regular. Moreover, if labels are binary then the notion of regular partition is
equivalent to the natural partition made up of the smallest number of clusters
C, each one including only nodes with the same label.
We now introduce an algorithm, cga, that takes advantage of regular par-
titions. As we show in Sect. 4, the cumulative loss of cga can be expressed in
terms of the best regular partition of G with respect to the unknown labeling
y ∈ Rn .
At each time step t, cga sets yt to be the (known) label yqt of the selected
vertex qt ∈ Vt−1 . Hence,  the algorithm’s cumulative  loss is the cost of the span-
ning tree with edges (qt , it ) : t = 1, . . . , |V | − 1 where edge (qt , it ) has cost
Learning Unknown Graphs 115

d = 2.9
d = 2.0
d = 0.8
5.6 0.7 d = 0.4 5.6 0.7

5.2 5.1 0.1 5.2 5.1 0.1


0.9 0.9
5.4 5.4

3.8 3.8
3.1 3.1
3.5 3.5

2.9 3.9 2.9 3.9


3.6 3.6
3.2 3.2
d = 2.0
4.0 4.0
d = 0.8

Fig. 2. Two copies of a graph with real labels yi associated with each vertex i. On the
left, a shortest path connecting the two nodes enclosed in double circles is shown. The
path length is maxt (sk−1 , sk ), where (i, j) = |yi −yj |. The thick black edge is incident
to the nodes achieving the max in the path length expression. On the right, the vertices
of the same graph have been clustered to form a regular partition. The diameter of a
cluster C (the maximum of the pairwise distances between nodes of C) is denoted by
d. Similarly, d denotes the minimum of the pairwise distances (i, j), where i ∈ C and
j ∈ V \ C. Note that d is determined by one of the thick black edges connecting C with
the rest of the graph, while d is determined by the two nodes incident to the thick gray
edge. The partition is regular, hence d < d holds for each cluster.

(i, j) = (yi , yj ). The key to controlling this cost, however, is the specific rule
the algorithm uses to select the next qt based on Gt−1 . The approach we propose
is simple. If there exists a regular partition of G with few elements, then it does
not really matter how the spanning tree is built within each element, since the
cost of all these different trees will be small anyway. What matters the most
is the cost of the edges of the spanning tree that join two distinct elements of
the partition. In order to keep this cost small, our algorithm learns to select qt
so as to avoid going back to the same region many times. This is based on the
following notions.
Fix an arbitrary subset C ⊆ V . The inner border
∂C of C is the set of all nodes i ∈ C that are adjacent
to a node j  ∈ C (the dark grey nodes in the picture
at the side). The outer border ∂C of C is the set of
all the nodes j  ∈ C that are adjacent to at least one
node in the inner border of C (the light grey nodes).
We are now ready to define the exploration/
prediction rule of our algorithm. At each time t, cga selects and predicts the
label of a node adjacent to the node in the inner border of Vt−1 which is closest
to the previously predicted node it−1 . Formally,

yt = yqt where qt = argmin dt−1 (it−1 , q). (2)


q∈∂V t−1
116 N. Cesa-Bianchi, C. Gentile, and F. Vitale

Fig. 3. The behavior of cga displayed on the binary labeled graph of Fig. 1. The length
of a path s1 , . . . , sd is measured by maxk (sk−1 , sk ) and the loss is the zero-one loss.
The pictorial conventions are as in Fig. 1. As in that figure, the cutsize ΦG (y) of this
graph can be made as close to |V | as we like, still cga makes 4 mistakes. For the sake of
comparison, recall that the various versions of depthFirst can be forced to err ΦG (y)
times on this graph.

We say that cluster C is exhausted at time t if at time t the algorithm has


already selected all nodes in C together with its outer border, i.e., if C ∪∂C ⊆ Vt .
In the special but important case when labels are binary and the path length is
λ(s1 , . . . , sd ) = maxk (sk−1 , sk ) (being  the zero-one loss), the choice of node qt
in (2) can be defined as follows. If the cluster C where it−1 lies is not exhausted
at the beginning of time t, then cga picks any node qt connected to it−1 by
a path all contained in Vt−1 ∩ C. On the other hand, if C is exhausted, cga
chooses an arbitrary node in Vt−1 . Figure 3 contains a pictorial explaination of
the behavior of cga, as compared to depthFirst on the same binary labeled
graph as in Fig. 1. As we argue in the next section (Lemma 1 in Sect. 4), a key
property of cga is that when choosing qt causes the algorithm to move out of a
cluster of a regular partition, then the cluster must have been exhausted.
This suggests a fundamental difference between cga and simple algorithms
like depthFirst. Evidence of that is provided by comparing Fig. 1 to Fig. 3.
cga is seen to make a constant number of binary prediction mistakes on simple
graphs where depthFirst makes order of |V | mistakes.
The next definition provides our main measure of graph label regularity, which
we relate to cga’s predictive ability. Given a regular partition P of the vertices
V of an undirected, connected and labeled graph G = (V, E), foreach C ∈ P
the merging degree δ(C) of cluster C is defined as δ(C) = min |∂C|, |∂C| .
The overall merging degree of the partition, denoted by δ(P) is given by δ(P) =

C∈C δ(C).
The merging degree δ(C) of a cluster C ∈ P quantifies the amount of inter-
action between C and the remaining clusters in P. For instance, in Fig. 3 the
left-most cluster has merging degree 1, the middle one has merging degree 2, and
the right-most one has merging degree 1. Hence this figure shows a case in which
the mistake bound of our algorithm is tight. Note that the middle cluster has
Learning Unknown Graphs 117

merging degree 2 no matter how we increase the number of negatively labeled


nodes in the dotted area (together with the corresponding outbound edges).
In the binary case, it is not difficult to compare the merging degree of a
partition to the graph custsize. Since at least one edge contributing to the cutsize
ΦG (y) must be incident to each node in an inner or outer border of a cluster,
δ(P) is never larger than 2ΦG (y). On the other hand, as suggested for example
by Fig. 3, δ(P) is often much smaller ΦG (y) (observe that δ(P) is never larger
than n, while ΦG (y) can even be quadratic on dense graphs). Finally, as hinted
again by Fig. 3, δ(P) is typically more robust to noise as compared to ΦG (y). For
instance, if we flip the label of the left-most node of the cluster on the right, the
merging degree of the depicted partition gets affected only by a small amount,
whereas the cutsize can decrease significantly.

4 Analysis
This section contains the analysis of cga’s predictive performance. The compu-
tational complexity analysis is contained in Sect. 5. For the sake of presentation,
we single out the binary classification case since it is an important special case
of our setting.
Fix an undirected and connected graph G = (V, E). The following lemma is
a key property of our algorithm.
Lemma 1. Assume cga is run on a graph G with labeling y ∈ Y n , and pick
any time step t > 0. Let P be a regular partition and assume it−1 ∈ C, where C
is any cluster in P. Then C is exhausted at time t − 1 if and only if qt 
∈ C.
Proof. First, assume C is exhausted at time t − 1, i.e., C ∪ ∂C ⊆ Vt−1 . Then
all nodes in C have been visited, and no node in C has unexplored edges. This
implies C ∩ ∂V t−1 ≡ ∅ and that the selection rule (2) makes the algorithm pick
qt outside of C. Assume now qt ∈ C. Since each cluster is a connected subgraph,
if the labels are binary the prediction rule ensures that cluster C is exhausted.
In the general case (when labels are not binary) we can prove by contradiction
that C is exhausted by analyzing the following two cases:
1. There exists j ∈ C \Vt−1 . Since the subgraph in cluster C is connected, there
is a path in C connecting it−1 to j such that at least one node q  ∈ C on this
path: (a) has unexplored edges, and (b) belongs to Vt−1 , (i.e., q  ∈ ∂V t−1 ),
and (c) is connected to it−1 by a path all contained in C ∩ Vt−1 . Since the
partition is regular, q  is closer to it−1 than to any node outside of C. Hence,
by construction —see (2), the algorithm would choose this q  instead of qt
(due to (c) above), thereby leading to a contradiction.
2. There exists j ∈ ∂C \ Vt−1 . Again, since the subgraph in cluster C is con-
nected, there is a path in C connecting it−1 to a node in ∂C adjacent to j.
Then we fall back into the previous case since at least one node q  on this
path: (a) has unexplored edges, and (b) belongs to Vt−1 , and (c) is connected
to it−1 by a path all contained in C ∩ Vt−1 .
We begin to analyze the special case of binary labels and zero-one loss.
118 N. Cesa-Bianchi, C. Gentile, and F. Vitale

Theorem 1. If cga is run on an undirected and connected graph G with binary


labels then the total number m of mistakes satisfies m ≤ δ(P), where P is the
partition of V made up of the smallest number of clusters, each including only
nodes with the same label.1
The key idea to the proof of this theorem is the following. Fix a cluster C ∈ P.
In each time step t when both qt and it belong to C a mistake never occurs. The
remaining time steps are of two kinds only: (1) Incoming lossy steps, where node
it belongs to the inner border of C; (2) outgoing lossy steps, where it belongs to
the outer border of C. With each such step we can thus uniquely associate a node
it in either (inner or outer) border of C. The overall loss involving C, however,
is typically much smaller than the sum of border cardinalities. This because, in
general, in each given cluster incoming and outgoing steps alternate, since the
algorithm first enters and then leaves the cluster. Hence, incoming and outgoing
steps must occur the same number of times, and their sum must then be at most
twice the minimum of the size of borders (what we called merging degree of the
cluster). The only exception to this alternating pattern occurs when a cluster
gets exhausted. In this case an incoming step is not followed by any outgoing
step for the exhausted cluster.

Proof of Theorem 1. Index by 1, . . . , |P| the clusters in P. We abuse the


notation and use P also to denote the set of cluster indices. Let k(t) be the
index of the cluster which it belongs to, i.e., it ∈ Ck(t) . We say that step t is a
lossy step if ŷt 
= yt , i.e. the label of qt is different from the label of it . A step t
in which a mistake occurs is incoming for cluster i (denoted by ∗ → i) if qt  ∈ Ci
and it ∈ Ci , and it is outgoing for cluster i (denoted by i → ∗) if qt ∈ Ci and
it 
∈ Ci . An outgoing step for cluster Ci is regular if the previous step in which
the algorithm made a mistake is incoming for Ci . All other outgoing steps are
reg
called irregular. Let M→i (Mi→ ) be the set of all incoming (regular outgoing)
irr
lossy steps for cluster Ci . Also, let Mi→ be the set of all irregular outgoing lossy
steps for Ci .
reg
For each i ∈ P, define an injective mapping μi : Mi→ → M→i as follows (see
reg
Fig. 4 for reference): Each lossy step t in Mi→ is mapped to the previous step
t = μi (t) when a mistake occurred. Lemma 1 insures that such step must be
reg
incoming for i since t is a regular outgoing step. This shows that |Mi→ | ≤ |M→i |.

Now, let t be any irregular outgoing step for some cluster, t be the last lossy
step occurred before time t, and set j = k(t ). The very definition of an irregular
lossy step, combined with Lemma 1, allows us to conclude that t is the last lossy
step involving cluster Cj . This implies that t cannot be followed by an outgoing
lossy step j → ∗. Hence t is not in the image of μj , and the previous inequality
reg reg
for |Mi→ | can be refined as |Mi→ | ≤ |M→i | − Ii . Here Ii is the indicator function
of the following event: “The very last lossy step  t such thateither qt or it belong
to Ci is incoming for Ci ”. We now claim that i∈P Ii ≥ i∈P |Mi→ irr
|. In fact,
if we let t be an irregular lossy step and i be the index of the cluster for which
1
Note that such a P is a regular partition of V . Moreover, one can show that for this
partition the bound in the theorem is never vacuous.
Learning Unknown Graphs 119

the previous lossy step t is incoming, the fact that t is irregular implies that Ci
must be exhausted between time t and time t, which in turn implies that Ii = 1,
since t must be the very last lossy step involving cluster Ci . Hence
 reg
  
m= |Mi→ ∪ Mi→irr
|≤ |M→i | − Ii + |Mi→ irr
| ≤ |M→i |. (3)
i∈P i∈P i∈P

Next, for each i ∈ P we define two further injective mappings that associate with
each incoming lossy step ∗ → i a vertex in the inner border
 of Ci and
 a vertex in
the outer border of Ci . This shows that |M→i | ≤ min |∂Ci |, |∂Ci | = δ(Ci ) for
each i ∈ P. Together with (3) this would complete the proof (see again Fig. 4
for a pictorial explanation).

ν1 (s)

μi (t) t s

ν2 (s)

Fig. 4. Sequence (starting from the left) of incoming and regular outgoing lossy steps
involving a given cluster Ci . We only show the border nodes contributing to lossy steps.
We map injectively each regular outgoing lossy step t to the previous (incoming) lossy
step μi (t). We also map injectively each incoming lossy step s to the node ν1 (s) in the
inner border, whose label was predicted at time s. Finally, we map injectively s also to
the node ν2 (s) in the outer border that caused the previous (outgoing) lossy step for
the same cluster.

The first injective mapping ν1 : M→i → ∂Ci is easily defined: ν1 (t) = it ∈ Ci .


This is an injection because the algorithm can incur loss on a vertex at most
once. The second injective mapping ν2 : M→i → ∂Ci is defined in the following
way. Let M→i be equal to {t1 , . . . , tk }, with t1 < · · · < tk . If t = t1 then ν2 (t)
is simply qt ∈ ∂Ci . If instead t = tj with j ≥ 2, then ν2 (t) = it ∈ ∂Ci , where
t is an outgoing lossy step i → ∗, lying between tj−1 and tj . Note that cluster
Ci cannot be exhausted after step tj−1 since another incoming lossy step ∗ → i
occurs at time tj > tj−1 . Combined with Lemma 1 this guarantees the existence
of such a t . Moreover, no subsequent outgoing lossy steps i → ∗ can mispredict
the same label yit . 
As we already noted, the edges (qt , it ) produced during the online functioning of
the algorithm form a spanning tree T for G. Therefore cga’s number of mistakes
m is always equal to ΦT (y). This shows that an obvious lower bound on m is the
total number of clusters |P|, i.e., the cost of the minimum spanning tree for G.
In fact, it is not difficult to prove that an adaptive adversary can always force
120 N. Cesa-Bianchi, C. Gentile, and F. Vitale

any algorithm working within our learning protocol to make Ω(|P|) mistakes.
This simple observation can be strengthened so as to match the upper bound in
Theorem 1.

Theorem 2. For all undirected and connected graphs G with n nodes and de-
gree bounded by a constant, for all K < n, and for any (randomized) explo-
ration/prediction strategy, there exists a labeling y of G’s vertices such that the
strategy makes at least K/2 mistakes (in expectation) with respect to the algo-
rithm’s internal randomization, while δ(P) = O(K).

The above lower bound, whose proof is omitted due to space limitations, can
actually be shown to hold even in cases when G does not have bounded degree
nodes, like cliques or general trees.
We now turn to the general case of nonbinary labels. The following definitions
are useful for espressing the cumulative loss bound of our algorithm: Let P be a
regular partition of the vertex set V and fix a cluster C ∈ P. We say that edge
(qt , it ) causes an inter-cluster loss at time t if one of the two nodes of this
edge lies in ∂C and the other lies in ∂C. Edge (qt , it ) causes an intra-cluster
loss when both qt and it are in C. We denote by (C) the largest inter-cluster
loss in C, i.e.,
(C) = max (yi , yj ) .
∈∂C, (i,j)∈E
i∈∂C, j

Also max
P is the maximum inter-cluster loss in the whole graph G, i.e., max
P =
¯P = |P|−1
maxC∈P (C). We  also set for brevity  C∈P (C). Finally, we define
ε(C) = maxTC (i,j)∈E(TC ) (yi , yj ), where the max is over all spanning trees
TC of C and E(TC ) is the edge set of TC . Note that ε(C) bounds from above2
the total loss incurred in all steps t where qt and it both belong to C.
In the above definition, (C) is a measure of connectivity of C to the remaining
clusters, ε(C) is a measure of “internal cohesion” of C, while max
P and ¯P give
global distance measures among the clusters within P.
The following theorem shows that cga’s cumulative loss can be bounded
in terms of the regular partition P that best trades off total intra-cluster loss
(expressed by ε(C)), against total inter-cluster loss (expressed by δ(C) times
the largest inter-cluster loss (C)). It is important to stress that cga never
explicitely computes this optimal partition: It is the selection rule for qt in (2)
that guarantees this optimal behavior.

Theorem 3. If cga is run on an undirected and connected graph G with arbi-


trary real labels, then the cumulative loss can be bounded as
2 |V |
cga’s cumulative loss is t=1 (qt , it ), where the edges (qt , it ), t = 1, . . . , |V | − 1
form a spanning tree for G; hence the subset of such edges which are incident to
nodes in C form a spanning forest for C. Our definition of ε(C) takes into account
that the total loss associated with the edge set of a spanning tree TC for C is at least
as large as the total loss associated with the edge set E(F) of any spanning forest
F for C such that E(F) ⊆ E(TC ).
Learning Unknown Graphs 121


n
  
yt , yt ) ≤ min |P| max
( P − ¯P + ε(C) + (C)δ(C) , (4)
P
t=1 C∈P

where the minimum is over all regular partitions P of V .

Remark 1. If  is the zero-one loss, then the bound in (4) reduces to


n 
yt , yt ) ≤ min
( ε(C) + δ(C) . (5)
P
t=1 C∈P

This shows that in the binary case the total number of mistakes can also be
bounded by the maximum number of edges connecting different clusters that
can be part of a spanning tree for G. In the binary case (5) achieves its min-
imum either on the trivial partition P = {V } or on the partition made up of
the smallest number of clusters C, each one including only nodes with the same
label (as in Theorem 1). In most cases, the nontrivial regular partition is the
minimizer of (5), so that the intra-cluster term ε(C) disappears. Then the bound
only includes the sum of merging degrees (w.r.t. that partition), thereby recov-
ering the bound in Theorem 1. However, in certain degenerate cases, the trivial
partition P = {V } turns out to be the best one. In such a case, the right-hand
side of (5) becomes ε(V ) which, in turn, is bounded by ΦG (y).

The proof of Theorem 3 is similar to the one for the binary case, hence we
only emphasize the main differences. Let P be a regular partition of V . Clearly,
no matter how each C ∈ P is explored, the contribution to the total loss of
(qt , it ) for qt , it ∈ C is bounded by ε(C). The remaining losses contributed by
any cluster C are of two kinds only: losses on incoming steps, where the node it
belongs to the inner border of C, and losses on outgoing steps, where it belongs
to the outer border of C. As for the binary case, with each such step we can
thus associate a node in the inner and the outer border of C, since incoming
and outgoing step alternate for each cluster. The exception is when a cluster is
exhausted which, at first glance, seems to requires adding an extra term as big
as max
P times the size |P| of the partition (this term could have a significant
impact for certain graphs). However, as explained in the proof below, max P can
be replaced by the potentially much smaller term max − ¯P . In fact, in certain
P
cases this extra term disappears, and the final bound we obtain is just (5).
Proof of Theorem 3. Fix an arbitrary regular partition P of V and index by
1, . . . , |P| the clusters in it. We abuse the notation and use P also to denote the
set of cluster indices. We crudely upper bound the total loss incurred during
intra-cluster lossy steps by C∈P ε(C). Hence, in the rest of the proof we focus
on bounding the total loss incurred during inter-cluster lossy steps only. We say
that step t is a lossy step if (qt , it ) > 0, and we distinguish between intra-
cluster lossy steps (when qt and it belong to the same cluster) and inter-cluster
lossy steps (when qt and it belong to different clusters). We define incoming and
outgoing (regular and irregular) inter-cluster lossy steps for a given cluster Ci
reg irr
(and the relative sets M→i , Mi→ and Mi→ ) as in the binary case proof, as well
122 N. Cesa-Bianchi, C. Gentile, and F. Vitale

reg
as the injective mapping μi . In the binary  |Mi→ | by |M→i | − Ii .
 case we bounded
In a similar fashion, we now bound t∈M reg t by (Ci ) |M→i | − Ii , where we
i→
set for brevity t = (qt , it ). We can write
    
t ≤ (Ci ) |M→i | − Ii + max
P |Mi→ |
irr
reg irr
i∈P t∈Mi→ ∪Mi→ i∈P
 
≤ (Ci )|M→i | + max
P − (Cj )
i∈P j∈P : Ij =1
  
≤ (Ci )|M→i | + |P| max
P − ¯P ,
i∈P
 
i∈P Ii ≥ i∈P |Mi→ | (as for the
irr
where the second inequality follows from
regular partition considered in the binary case). The proof is concluded after
defining the two injective mapping ν1 and ν2 as in the binary case, and bounding
again |M→i | through δ(Ci ). 

5 Computational Complexity
In this section we briefly describe an efficient implementation of cga, and discuss
some improvements for the special case of binary labels. This implementation
shows that cga is especially useful when dealing with large scale applications.
Recall that the path length assignment λ is a parameter of the algorithm and
satisfies (1). In order to develop a consistent argument about cga’s time and
space requirements, we need to make assumptions on the time it takes to compute
this function. If we are given the distance between any pair of nodes i and j,
and the loss (j, j  ) for any j  adjacent to j, we assume to be able to compute
in constant time the length of the shortest path i, . . . , j, j  . This assumption is
easily seen to hold for many natural path length assignments λover graphs,
for instance λ(s1 , . . . , sd ) = maxk (sk−1 , sk ) and λ(s1 , . . . , sd ) = k (sk−1 , sk )
—note that both fulfill (1).
Because of the above assumption on the path length λ, in the general case of
real labels cga can be implemented using the well-known Dijkstra’s algorithm
for single-source shortest path (see, e.g., [7, Ch. 21]). After all nodes in Vt−1 and
all edges incident to it have been revealed, cga computes the distance between
it and any other node in Vt−1 by invoking Dijkstra’s algorithm on the sub-graph
Gt , so that cga can easily find node qt+1 . If Dijkstra’s algorithm is implemented
 heaps [7, Ch. 25], the total time required for predicting all |V |
with Fibonacci
labels is3 O |V ||E| + |V |2 log |V | . On the other hand, the space complexity is
always linear in the size of G.
We now sketch the binary case. The additional assumption λ(s1 , . . . , sd ) =
maxk (sk−1 , sk ) allows us to exploit the simple structure of regular partitions.
3
 
In practice, the actual running time is often far less than O |V ||E| + |V |2 log |V | ,
since at each time step t Dijkstra’s algorithm can be stopped as soon as the node of
∂V t−1 nearest to it in Gt has been found.
Learning Unknown Graphs 123

Coarsely speaking, we maintain information about the current inner border and
clusters, and organize this information in a balanced tree, connecting the nodes
lying in the same cluster through specially designed lists.
In order to describe this implementation, it is important to observe that, since
the graph is revelead incrementally, it might be the case that a single cluster C
in G at time t happens to be split into several disconnected parts in Gt . We
call sub-cluster each maximal set of nodes that are part of the same uniformly
labeled and connected subgraph of Gt . The main data structures we use (further
details are omitted due to space limitations) for organizing the nodes observed
so far by the algorithm combine the following:
– A self-balancing binary search tree T containing the labeled nodes in Vt . We
will refer to nodes in Vt and to nodes in T interchangeably.
– Given a sub-cluster C, all nodes in C ∩ ∂V t are connected via a special
list called border sub-cluster list. The remaining nodes in C are connected
through a list called internal sub-cluster list.
– All nodes in each sub-cluster C ⊆ Vt are linked to a special time-varying
set called sub-cluster record. This record enables access to the first and last
element of both the border and the internal sub-cluster list of C. The sub-
cluster record also contains the size of C.
The above data structures are intended to support the following main operations,
which are executed in the following order at each time step t, just after the
algorithm has selected qt : (1) insertion of it ; when it is chosen by the adversary
cga also receives the list N (it ) of all nodes in Vt−1 adjacent to it ; (2) merging of
subclusters required after the disclosure of yt ; (3) update of border and internal
sub-cluster lists (since some nodes in ∂V t−1 are not in ∂V t ); (4) choice of qt+1 .
The merging operation can be implemented as union-by-rank in standard
union-find data structures  (e.g., [7,Ch. 22]). The overall running time for |V |
nodes is smaller than O |V | log |V | . In fact, the dominating cost in the time
complexity is the cost for reaching at each time t the nodes of Vt−1 adjacent to
it . Each of these it ’s neighbors can be bijectively associated with an edge of E,
the height of tree T being at most logarithmic
 in V . Hence the overall
 running

time for predicting |V | labels is O |E| log |V | + |V | log |V | = O |E| log |V | ,
which is the best one can hope for (an obvious lower bound is |E|) up to a
logarithmic factor.
As for space complexity, it is important to stress that on every step t the
algorithm first stores and then “throws way” the received node list N (it ) (in the
worst case, the length of N (it ) is linear in |V |). The space complexity is therefore
O(|V |). This optimal use of space is one of the most important practical strengths
of cga, since the algorithm never needs to store the whole graph seen so far.

6 Conclusions and Ongoing Research


We have presented a first step towards the study of problems related to learning
(labeled) graph exploration strategies. This is a significant departure from more
124 N. Cesa-Bianchi, C. Gentile, and F. Vitale

standard approaches assuming prior knowledge of the underlying graph structure


(e.g., [2, 3, 6, 9, 10, 11, 12, 13, 14, 17] and references therein).
We are currently investigating to what extent our approach can be extended
to weighted graphs. In order to exploit the benefits of edge weights, our pro-
tocol in Sect. 2 could be modified to let cga observe the weights of all edges
incident to the current node. Whenever the weights of intra-cluster edges are
heavier than those of inter-cluster ones, our algorithm can take advantage of the
additional weight information. This calls for an analysis being able to capture
the interaction between node labels and edge weights.

Acknowledgments. We would like to thank the ALT 2009 reviewers for their
comments which greatly improved the presentation of this paper. This work
was supported in part by the PASCAL2 Network of Excellence under EC grant
216886. This publication only reflects the authors’ views.

References
[1] Albers, S., Henzinger, M.: Exploring unknown environments. SIAM Journal on
Computing 29(4), 1164–1188 (2000)
[2] Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph
mincuts. In: Proc. 18th ICML. Morgan Kaufmann, San Francisco (2001)
[3] Blum, A., Lafferty, J., Rwebangira, M., Reddy, R.: Semi-supervised learning using
randomized mincuts. In: Proc. 21st ICML. ACM Press, New York (2004)
[4] Bryant, D., Berry, V.: A Structured family of clustering and tree construction
methods. Advances in Applied Mathematics 27, 705–732 (2001)
[5] Balcan, N., Blum, A., Vempala, S.: A discriminative framework for clustering via
similarity functions. In: Proc. 40th STOC. ACM Press, New York (2008)
[6] Cesa-Bianchi, N., Gentile, C., Vitale, F.: Fast and optimal prediction of a labeled
tree. In: Proc. 22nd COLT. Omnipress (2009)
[7] Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT
Press, Cambridge (1990)
[8] Deng, X., Papadimitriou, C.H.: Exploring an unknown graph. In: Proc. 31st
FOCS, pp. 355–361. IEEE Press, Los Alamitos (1990)
[9] Hanneke, S.: An analysis of graph cut size for transductive learning. In: Proc. 23rd
ICML, pp. 393–399. ACM Press, New York (2006)
[10] Hebster, M., Pontil, M.: Prediction on a graph with the Perceptron. In: NIPS,
vol. 19, pp. 577–584. MIT Press, Cambridge (2007)
[11] Herbster, M.: Exploiting cluster-structure to predict the labeling of a graph. In:
Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI),
vol. 5254, pp. 54–69. Springer, Heidelberg (2008)
[12] Herbster, M., Lever, G., Pontil, M.: Online prediction on large diameter graphs.
In: NIPS, vol. 22. MIT Press, Cambridge (2009)
[13] Herbster, M., Pontil, M., Rojas-Galeano, S.: Fast prediction on a tree. In: NIPS,
vol. 22. MIT Press, Cambridge (2009)
[14] Herbster, M., Lever, G.: Predicting the labelling of a graph via minimum p-
seminorm interpolation. In: Proc. 22nd COLT. Omnipress (2009)
[15] Joachims, T.: Transductive Learning via Spectral Graph Partitioning. In: Proc.
20th ICML, pp. 305–312. AAAI Press, Menlo Park (2003)
Learning Unknown Graphs 125

[16] Kondor, I., Lafferty, J.: Diffusion kernels on graphs and other discrete input spaces.
In: Proc. 19th ICML, pp. 315–322. Morgan Kaufmann, San Francisco (2002)
[17] Pelckmans, J., Shawe-Taylor, J., Suykens, J., De Moor, B.: Margin based trans-
ductive graph cuts using linear programming. In: Proc. 11th AISTAT. JMLR
Proceedings Series, pp. 360–367 (2007)
[18] Remy, J., Souza, A., Steger, A.: On an online spanning tree problem in randomly
weighted graphs. Combinatorics, Probability and Computing 16, 127–144 (2007)
[19] Smola, A., Kondor, I.: Kernels and regularization on graphs. In: Schölkopf, B.,
Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144–
158. Springer, Heidelberg (2003)
[20] Yang, W.S., Dia, J.B.: Discovering cohesive subgroups from social networks for
targeted advertising. Expert Systems with Applications 34, 2029–2038 (2008)
Completing Networks Using Observed Data

Tatsuya Akutsu1 , Takeyuki Tamura1 , and Katsuhisa Horimoto2


1
Bioinformatics Center, Institute for Chemical Research, Kyoto University Gokasho,
Uji, Kyoto 611-0011, Japan
{takutsu,tamura}@kuicr.kyoto-u.ac.jp
2
Computational Biology Research Center
2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
k.horimoto@aist.go.jp

Abstract. This paper studies problems of completing a given Boolean


network (Boolean circuit) so that the input/output behavior is consis-
tent with given examples, where we only consider acyclic networks. These
problems arise in the study of inference of signaling networks using re-
porter proteins. We prove that these problems are NP-complete in gen-
eral and a basic version remains NP-complete even for tree structured
networks. On the other hand, we show that these problems can be solved
in polynomial time for partial k-trees of bounded (constant) indegree if
a logarithmic number of examples are given.

1 Introduction
Inference of biological networks, which include genetic networks, protein-protein
interaction networks and signaling networks, is an important topic in bioinfor-
matics and computational systems biology. For inference of genetic networks,
extensive studies have been done in this decade. The objective of this prob-
lem is, given a series of gene expression profiles (a series of states of all genes
under various environment and/or time steps), to infer a function along with
input genes that regulates each gene, where a set of functions constitutes a ge-
netic network. In inference of genetic networks, it is assumed that the states of
all genes are observable under each environment and/or each time step though
there exist some noise. This assumption is reasonable because we can observe
expression levels of all genes (or almost all genes) by using such technologies as
DNA microarray and DNA chip.
However, this assumption is not reasonable when we want to infer signaling
networks (i.e., signaling pathways). In this case, we need to observe activity levels
or quantities of proteins. Unfortunately, it is quite difficult to observe such kind
of data, especially in living organisms. Reporter proteins (or reporter genes) are
usually employed, each of which is associated with one or some kinds of proteins
[16]. However, both designing reporter proteins and introducing reporter proteins
to cells are hard tasks. In particular, introducing multiple types of reporter

This work is partially supported by the Cell Array Project from NEDO, Japan and
by a Grant-in-Aid ‘Systems Genomics’ from MEXT, Japan.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 126–140, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Completing Networks Using Observed Data 127

proteins is quite hard. Therefore, we can only assume that the activity levels
of one or a few kinds of proteins under various environments are observed in
analysis of signaling networks. While it is almost impossible to infer the network
only from such little information, we can utilize knowledge in literature and
databases. Thus, it is reasonable to assume that we have a preliminary network
model of the target signaling pathway where some parts are unclear or invalid.
Using observed data on the activity levels of a single or a few types of proteins
under various environment, it may be possible to modify a preliminary network
model so that it is consistent with the observed data. Among many ways to
modify the network model, it is reasonable to make the minimum modification,
by following the principle of Occam’s razor. This motivates us to study network
completion problems.
In this paper, we assume a Boolean network model [12] as a model of biological
networks because it is a fundamental model, a lot of theoretical and practical
studies have been done [12], and it has also been applied to analysis of signaling
networks [10]. We assume that the network topology is given (i.e., a set of input
nodes to each node is known) and Boolean functions are already assigned to
a subset of nodes. We also assume that a set of nodes is divided into external
nodes, internal nodes and output nodes, where only the activity levels of exter-
nal and output nodes can be observed. Output nodes correspond to proteins
whose activity levels are observed by reporter proteins, where we mainly con-
sider the case that there exists only one output node because it is very difficult
to introduce multiple reporter proteins. External nodes correspond to proteins
whose activity levels are controlled by stimuli given from outside of the cell
(e.g., environment), where these nodes can also be regarded as input nodes to
the network. Furthermore, we assume that the network is acyclic because the
state of the output node may not be determined uniquely if there exist cycles.
Therefore, we can assume that the state of output node is determined (through
internal nodes) from the states of external nodes. Then, a basic version of the
network completion problem is to determine Boolean functions for unassigned
nodes so that the resulting network is consistent with given set of examples ( i.e.,
a series of external and output states). We also consider variants of the problem
in which Boolean functions are assigned to all nodes but the minimum number
of modifications (e.g., modification of Boolean functions, deletions of edges) are
allowed. We show that these problems are NP-complete and the basic version
remains NP-complete even for tree structured networks. On the other hand, we
show that these problems can be solved in polynomial time for partial k-trees of
bounded (constant) indegree if a logarithmic number of examples are given.
There exist several related studies. As mentioned before, a lot of studies have
been done on inference of genetic networks from gene expression profiles. How-
ever, most of such studies are based on statistical or heuristic approaches and only
a few studies have been done from viewpoints of computational and/or sample
complexities. Akutsu et al. proposed a strategy to identify a genetic network un-
der a Boolean model using disruption and overexpression of multiple genes [2].
They analyzed combinatorial and computational complexities and showed that it
128 T. Akutsu, T. Tamura, and K. Horimoto

is possible to identify a network in polynomial time using O(n2D ) experiments,


where n is the number of nodes (i.e., number of genes) and D is the maximum
indegree (i.e., fan-in) of a network. They also analyzed the average case sample
complexity to identify a Boolean network when random gene expression patterns
are given and showed that O(log n) patterns are enough if D is bounded by a con-
stant [1]. Ideker et al. also studied a similar Boolean model and gave more practical
strategies for acyclic networks using information theoretic criteria [11]. Mochizuki
used a Boolean model and developed a method to estimate an upper bound of the
number of steady states from the network topology and observed data without
inferring Boolean functions [14]. In these studies, it is assumed that states of all
nodes are observable. Angluin et al. considered another model (called learning by
value injection queries) in which the state of only one node (i.e., output node) is
observable while the states of an arbitrary subset of nodes can be specified [4].
They showed that this problem is NP-hard in general and needs an exponential
number of queries in the worst case. They also showed that this problem can be
solved in polynomial time using a polynomial number of queries if the networks
are in class NC1 or AC0. This framework was further extended to analog circuits
[5] and probabilistic circuits [6]. This framework is close to ours, but is different in
two respects; value injection is possible for arbitrary nodes in their model whereas
value injection is not allowed in our framework (because it is not biologically plau-
sible to perform value injection queries to many genes/proteins); network topology
is not given in their framework whereas it is given in our framework. Of course, a
lot of studies have been done on exact and approximate learning of Boolean func-
tions [13]. However, the results are not directly applicable to our problems because
we assume that the network topology is given and a Boolean function assigned to
each node is arbitrary but of small indegee.

2 Problem Definitions

In this paper, we only consider acyclic Boolean networks as in [4,5,6]. Though


the states of nodes in a usual Boolean network are updated synchronously, we
need not consider time steps because the states of all nodes in acyclic Boolean
networks are determined uniquely from the states of external nodes and thus an
acyclic Boolean network is equivalent to an acyclic Boolean circuit.
As a model of signaling networks, we define a Boolean network with external,
internal and output nodes as follows. A Boolean network G(V, F ) consists of a
set V = {v1 , . . . , vn } of nodes and a list F = (f1 , . . . , fn ) of Boolean functions,
where each node takes a Boolean value (i.e., 0 or 1), and a Boolean function
fi (vi1 , . . . , vil ) with inputs from specified nodes vi1 , . . . , vil is assigned to each of
internal and output nodes vi . We use x to denote the negation of x, and use ∧, ∨
and ⊕ to denote AND, OR and XOR, respectively. We use IN (vi ) to denote the
set of input nodes vi1 , . . . , vil to vi . We allow that some vij are not relevant (i.e.,
these vij do not directly affect the state of vi ). For each G(V, F ), we associate a
directed graph G(V, E) defined by E = {(vj , vi ) | vj ∈ IN (vi )}. We use deg(vi )
to denote the indegree of vi (i.e., |IN (vi )| = deg(vi )). In this paper, we assume
Completing Networks Using Observed Data 129

that G(V, E) is acyclic. We also assume that there exists only one output node,
where some of the results can be extended for multiple output nodes. We assume
w.l.o.g. that v1 , . . . , vh are external nodes (whose indegrees are 0) and vn is the
output node. Each node takes either 0 or 1 and the state of node vi is denoted by
v̂i . For an internal or output node vi , v̂i is determined by v̂i = fi (v̂i1 , . . . , v̂ili ).
We have assumed so far that all fi are known. However, fi may not be known
for some nodes vi whereas IN (vi ) are known. Such a node is called an incomplete
node. A Boolean network is called incomplete if it contains an incomplete node,
otherwise it is called complete.
An (h + 1)-dimensional 0-1 vector e is called an example, where the first h
entries correspond to the external nodes and the last entry corresponds to the
output node. An example e is called positive if eh+1 = 1, otherwise it is called
negative. A complete Boolean network G(V, F ) is consistent with e if v̂n = eh+1
holds under the condition that v̂i = ei holds for i = 1, . . . , h. We define a basic
version of the network completion problem as follows (see also Fig. 1).

Definition 1. BNCMPL-1
Instance: An incomplete Boolean network G(V, F ), a set of examples {e1 , . . . , em },
Question: Is there an assignment of Boolean functions fi to incomplete nodes so
that the resulting network G(V, F  ) is consistent with all examples ?
An assignment satisfying the above condition is called a completion. In the above,
a set of nodes to which Boolean functions are assigned is specified. However,
existing knowledge about the target network may contain mistakes. In such a
case, it might be useful to modify Boolean functions for the minimum number
of nodes while keeping the network structure. Therefore, we define a variant of
the network completion problem as follows.
Definition 2. BNCMPL-2
Instance: A complete Boolean network G(V, F ), a set of examples {e1 , . . . , em },
and a positive integer L,
Question: Is there an assignment of Boolean functions fi to at most L nodes so
that the resulting network G(V, F  ) is consistent with all examples ?
In this definition, we allow that the algorithm can override the complete nodes
(i.e., other Boolean functions can be assigned to nodes for which Boolean func-
tions are already assigned). As a variant of BNCMPL-2, we can consider the
problem of minimizing the number L of nodes for which other Boolean function
should be assigned. This variant can be solved by solving BNCMPL-2 from
L = 0 to n. Note that deletion of an edge can be regarded as a modification
of Boolean function and thus can be handled in BNCMPL-2 because we allow
that some nodes in IN (vi ) are not relevant1 .
In this paper, we assume in most cases that the maximum indegree is bounded
by a constant D. This assumption is reasonable because it is quite hard in general
to learn Boolean functions with many inputs, and O(2n ) bits are required to
represent a Boolean function if an arbitrary Boolean function is allowed.
1
All the results in this paper are valid even if all nodes in IN (vi ) must be relevant.
130 T. Akutsu, T. Tamura, and K. Horimoto

3 Hardness Results
First, we show that BNCMPL-1 is NP-complete even if only one positive ex-
ample is given.
Proposition 1. BNCMPL-1 is NP-complete even if one positive example is
given and D = 2.
Proof. Since it is obvious that BNCMPL-1 is in NP, we show that it is NP-hard
by means of a polynomial time reduction from 3-SAT [9] (see also Fig. 1).
Let c1 , . . . , cM be clauses over Boolean variables x1 , . . . , xN . Then, 3-SAT
is the problem of deciding whether there is an assignment of 0-1 values to
x1 , . . . , xN that satisfies all the clauses (i.e., the values of all clauses are 1).
We construct an incomplete network G(V, F ) as follows2 , where we first as-
sume that nodes with large indegree are allowed. Let V = {v1 , . . . , v2N +M+1 }
and let {v1 , . . . , vN } be a set of external nodes. Let clause ci be defined as
ci = gi (xi1 , xi2 , xi3 ). For each node v2N +i (i = 1, . . . , M ), we assign a Boolean
function v2N +i = gi (vN +i1 , vN +i2 , vN +i3 ). For the output node v2N +M+1 , we as-
sign a Boolean function defined by v2N +M+1 = v2N +1 ∧ v2N +2 · · · ∧ v2N +M . For
each i = 1, . . . , N , let vN +i be an incomplete node such that IN (vN +i ) = {vi }.
Therefore, either vN +i = vi or vN +i = vi is assigned to vN +i 3 . Finally, we
let e = (1, 1, 1, . . . , 1). Then, it is straight-forward to see that there exists a
completed network G(V, F  ) if and only if there exists a satisfying assignment
for c1 , . . . , cM . It is also seen that the reduction can be done in polynomial
time.

v12 Output Node

v9 v10 v11

Imcomplete
v5 v6 v7 v8 Node

External Nodes
v1 v2 v3 v4

Fig. 1. Reduction from 3-SAT instance {x1 ∨ x2 ∨ x3 , x1 ∨ x3 ∨ x4 , x2 ∨ x3 ∨ x4 } to


BNCMPL-1. v4+i = vi (resp. v4+i = vi ) corresponds to xi = 1 (resp. xi = 0) for
i = 1, . . . , 4.

2
Construction can be simplified if we use internal nodes with degree 0.
3
Since we allow non-relevant input nodes, it is also possible that vN+i = 0 or vN+i = 1
is assigned. All the results in this paper are valid even if such an assignment is taken
into account.
Completing Networks Using Observed Data 131

Furthermore, we can modify the construction for the case of D = 2 by encod-


ing each Boolean function assigned to v2N +i for i = 1, . . . , M + 1 using Boolean
functions of arity 2 along with additional nodes.


Though the above result is rather straight-forward, we can strengthen the above
proposition for tree structured networks.
Theorem 1. BNCMPL-1 is NP-complete even if the network has a tree struc-
ture and D = 2.
Proof. We show a polynomial time reduction from 3-SAT to this special case.
Different from the proof in Prop. 1, we use examples to encode clauses.
Let c1 , . . . , cM be clauses over x1 , . . . , xN in an instance of 3-SAT. From this
instance, we construct G(V, F ) as follows (see Fig. 2). Let V = {v1 , . . . , v6N +1 }.
For convenience, we let yi = vi , zi = vN +i , wi = v2N +i , pi = v3N +i , qi = v4N +i ,
and ri = v5N +i , where yi , zi , and wi are external nodes. For the output node
v6N +1 , we assign v6N +1 = r1 ∨ r2 ∨ · · · ∨ rN . For each of qi and ri , we assign
qi = pi ⊕ zi and ri = qi ∧ wi . Each pi is an incomplete node with only one input
yi . Clearly, the resulting network has a tree structure.
Next, we create M examples where all examples are positive. For each clause
cj = lj1 ∨ lj2 ∨ lj3 , we create an example ej such that
– for i = 1, . . . , N , eji = ej2N +i = 1 if xi appears in cj as a positive or negative
literal, otherwise eji = ej2N +i = 0,
– for i = 1, . . . , N , ejN +i = 1 iff. xi appears in cj as a negative literal,
– ej3N +1 = 1.
Then, we show below that there exists a completed network G(V, F  ) if and only
if there exists a satisfying assignment for c1 , . . . , cM .
Assume that there exists a satisfying assignment. For each i = 1, . . . , N , we
assign pi = yi to pi if bi = 1, otherwise pi = yi to pi . Then, the state of ri for

OR

ri
qi AND
Incomplete
XOR Node
pi
External
yi zi wi Nodes

Fig. 2. Reduction from 3-SAT instance to BNCMPL-1 for trees. For each variable xi ,
a subnetwork shown in this figure is constructed.
132 T. Akutsu, T. Tamura, and K. Horimoto

example ej is equal to the state of the literal corresponding to xi if xi appears


in clause cj , otherwise it is 0. Therefore, we can see that the state of v6N +1 is
1 for all examples. Conversely, assume that there exists a required completion.
For each i = 1, . . . , N , we let bi = 1 if pi = yi is assigned to pi , bi = 0 otherwise.
Then, we can see that all clauses are satisfied.
The above reduction can be clearly done in polynomial time. As in the proof
of Prop. 1, we can encode the output node using nodes of indegree 2. 


In the above, we assumed that negation nodes can be used. The following theo-
rem states that BNCMPL-1 remains NP-complete even if only AND/OR nodes
are allowed, where the use of 3-Coloring was inspired from [15].

Theorem 2. BNCMPL-1 is NP-complete even if only AND/OR nodes of D =


2 are allowed.

Proof. We show a polynomial time reduction from 3-Coloring [9]. 3-Coloring


is, given an undirected graph G0 (V0 , E0 ) with N vertices, to decide whether or
not there exists a mapping χ from a set of vertices in G to {1, 2, 3} such that
χ(xi ) 
= χ(xj ) holds for all {xi , xj } ∈ E0 .
From G0 (V0 , E0 ), we construct an incomplete network G(V, F ) consisting of
AND/OR nodes as follows (see also Fig. 3). First, we create the following nodes:
– c1 , c2 , c3 and the output node o,
– yi , zi , pi , qi , c1i , c2i , c3i , wi12 , wi13 , wi23 , ri12 , ri13 , ri23 for i = 1, . . . , N ,
– spq
i,j for all (p, q) ∈ {(1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)} and for all {i, j} ∈
E0 where i < j,
where c1 , c2 , c3 , yi s, zi s, wipq s are external nodes, and cpi s are incomplete nodes.
For each cpi , we let IN (cpi ) = {cp , yi }. For each non-external node, assignment of
a Boolean function is done by pi =  ci ∨ ci∨ cpq
1 2 3
i , qi = zi ∧ pi , ripq = wipq ∧ chi ∧ cki ,
 pq
pq
si,j = ci ∧ cj ∧ yi ∧ yj , and o = ( qi ) ∨ ( ri ) ∨ si,j .
h k

(i) 1 (ii) 0 (iii) 1


o o o
OR qi OR r 12
i
OR
s 12
i,j

AND pi AND
AND
OR
c 1 c 3 1 2
i c 2i i c 1i c 2i c i c j

zi c1 c2 c3 yi w12i c1 c2 yi c1 c2 yi yi
1 0 0 0 1 1 0 0 1 0 0 1 1

Fig. 3. Reduction from 3-Coloring to BNCMPL-1. Parts (i), (ii) and (iii) put con-
straints that at least one color is assigned to each node, two colors cannot be assigned
to each node, and different colors must be assigned to neighboring nodes, respectively.
Completing Networks Using Observed Data 133

For each example e, the entry corresponding to each external/output node v


is denoted by e(v). Then, examples are given as follows.

(i) For each i ∈ {1, . . . , N }, we create e such that e(zi ) = e(yi ) = e(o) = 1, and
e(v) = 0 for the other v.
(ii) For each i ∈ {1, . . . , N }, we create e such that e(wipq ) = e(yi ) = 1, and
e(v) = 0 for the other v.
(iii) For each {i, j} ∈ E0 where i < j, we create e such that e(yi ) = e(yj ) =
e(o) = 1, and e(v) = 0 for the other v.

Then, we can show that there exists a valid 3-coloring for G0 (V0 , E0 ) if and only
if there is a required completion for G(V, F ). Furthermore, this reduction can be
done in polynomial time. As in the proof of Prop. 1, we can encode each node
with indegree more than 2 using nodes of indegree 2. 


We can also prove that BNCMPL-2 is NP-complete, where the proof is a bit
involved because Boolean functions assigned to any subset of nodes of cardinality
L can be modified.

Theorem 3. BNCMPL-2 is NP-complete.

Proof. BNCMPL-2 is clearly in NP. In order to prove NP-hardness, we replace


each T (ri ) in the proof of Thm. 1 by a large (but polynomial size) acyclic sub-
network Gi with an output node ri (see also Fig. 4), where T (v) denotes the
subtree of a tree T induced by v and its descendants.
Let N  = 2N + 1. In order to construct Gi , we make N  copies of T (ri ) where
yi , zi , wi and pi are shared by all copies, and pi = yi is initially assigned to
each pi (recall that there exists no incomplete node in BNCMPL-2). Let rij be
the jth copy of ri . Each Gi consists of N + 1 copies of an identical subnetwork:
Ci1 , Ci2 , · · · , CiN +1 . The subnetwork C has N  inputs and N  outputs. The value of
each output is 1 if at least N + 1 of the inputs are 1, otherwise it is 0. Therefore,
each Cik has N  copies of a majority circuit. Since a majority circuit can be
realized by an acyclic network using a polynomial number of logic gates with
fan-in 2 (using half adders), the size of each Cij is polynomial of N . The inputs
of Ci1 are rij s. The inputs of Cij+1 are the outputs of Cij . The first output (it can
be arbitrary) of CiN +1 is an input to the output node o, which is the disjunction
of all inputs. Since the size of Cij is polynomial, the whole network G(V, F )
has polynomial size and can be constructed in polynomial time. In addition to
the same examples as in Thm. 1, one negative example eM+1 is given, which is
defined by eM+1 i = 0 for all i = 1, . . . , 3N + 1.
Hereafter, we show that there exists a required completion for G(V, F ),
{e1 , . . . , eM+1 }, and L = N iff. there exists a satisfying assignment for 3-SAT.
Assume that (b1 , . . . , bN ) is a satisfying assignment for 3-SAT. For each i such
that bi = 0, we replace pi = yi assigned to pi with pi = yi . It is to be noted
that we need to replace at most N Boolean functions. By these modifications, we
obtain a completed network G(V, F  ). Hereafter, v̂(e) denotes the state of node v
in G(V, F  ) when example e is given. Let ci = li1 ∨ li2 ∨ li3 where i ∈ {1, . . . , M }.
134 T. Akutsu, T. Tamura, and K. Horimoto

Ci
1 r’1 OR r’N

N+1 N+1
C1 CN

r i1 r i2 r iN’
2
C1
2
CN

q 1i q 2i q N’
i

1 1
C1 CN

pi

yi z i wi

Fig. 4. Reduction from 3-SAT to BNCMPL-2

Since ci is satisfied by (b1 , . . . , bN ), one of li1 , li2 and li3 must be 1 for each i. We
assume w.l.o.g. that li1 takes value 1. Then, we can see that r̂ij1 (ei ) = 1 holds
for all j = 1, . . . , N  , from which r̂i1 (ei ) = 1 and ô(ei ) = 1 follow. Furthermore,
we can see that ô(eM+1 ) = 0. Therefore, there exists a required completion.
Conversely, assume that there exists a required completion G(V, F  ). Then,
we create an assignment (b1 , . . . , bN ) by letting bi = 0 iff. pi = yi is assigned
to pi . If no nodes other than pi s are changed4 , we can see that (b1 , . . . , bN ) is
a satisfying assignment, as in the proof of Thm. 1. Therefore, we consider the
case that some nodes other than pi s are changed. Since at most N nodes are
changed, we can see that at least N + 1 outputs of each Cij take the value di (ej )
defined by

1, if (ŵi (ej ) = 1) ∧ ((p̂i (ej ) = 1 ∧ ẑi (ej ) = 0) ∨ (p̂i (ej ) = 0 ∧ ẑi (ej ) = 1)),
di (ej ) =
0, otherwise.

We can also see that at least one Cij remains unchanged for each i. If CiN +1 is
unchanged, we can see that r̂i (ej ) = di (ej ) holds for all j. If CiN +1 is changed,
there must exists unchanged Cij for some j < N + 1. Then, we can see that for
each i ∈ {1, . . . , N }, either one of the following holds: r̂i (ej ) = di (ej ) for all
j, r̂i (ej ) = di (ej ) for all j, r̂i (ej ) = 0 for all j, r̂i (ej ) = 1 for all j.

4
We say that a node is changed if the assigned Boolean function is replaced.
Completing Networks Using Observed Data 135

Therefore, ô(ej ) can be represented by a Boolean function of d1 (ej ), . . .,


dN (ej ). Here we note that for each ej (j < M +1), r̂i (ej )  = r̂i (eM+1 ) holds for at
least 1 and at most 3 ri s (among N ri s). Furthermore, we can see that such ri s
 

are included in {rj 1 , rj 2 , rj 3 } where cj = lj1 ∨ lj2 ∨ lj3 . Since ej3N +1  = eM+1
3N +1
 j  M+1
holds, r̂i (e ) 
= r̂i (e ) holds for at least one i ∈ {j1 , j2 , j3 }. For such i,
di (ej ) 
= di (eM+1 ) holds. Since di (eM+1 ) = 0, we have di (ej ) = 1. Although
the output node may be changed, ô(ej ) = d1 (ej ) ∨ · · · ∨ dN (ej ) always holds for
any j in the resulting network G(V, F  ). Hence, p̂i (ej ) = 1 holds if ẑi (ej ) = 0,
otherwise (i.e., ẑi (ej ) = 1) p̂i (ej ) = 0 holds. If ẑi (ej ) = 0 holds, xi appears in
cj positively. Since p̂i (ej ) = 1 holds in this case, bi = 1 holds and thus cj is
satisfied. Similarly, if ẑi (ej ) = 1 holds, we can see that bi = 0 holds and cj is
satisfied. Therefore, there exists a satisfying assignment for 3-SAT.
The above reduction can be clearly done in polynomial time. We can encode
the output node using nodes of indegree two 5 . 


4 Algorithms for Tree-Like Networks

In Section 3, we showed that BNCMPL-1 is NP-complete even for tree struc-


tured networks of bounded indegree (i.e., the maximum indegree is bounded
by a constant). However, we can show that BNCMPL-1 and BNCMPL-2 are
solved in polynomial time for tree structured and tree-like networks of bounded
constant indegree if the number of examples is small (i.e., O(log n)). Considering
the case of a small number of examples is meaningful because a small number
of data are usually available in analysis of signaling networks.

4.1 Algorithms for Tree-Structured Networks

In this subsection, we present an algorithm for tree-structured networks, which


is to be extended for partial k-trees in Section 4.2. The algorithm is based on
dynamic programming and is similar to that in [3] where additional ideas are
introduced. We give the algorithm as a part of the proof of the following theorem.

Theorem 4. BNCMPL-1 is solved in polynomial time if the network structure


is a rooted tree of bounded indegree and the number of examples is O(log n).

Proof. We assume for simplicity that all non-external nodes are of indegree 2,
while the proof and algorithm can be extended for any trees of bounded constant
indegree D with keeping the polynomial time complexity.
Let Fv be an assignment of functions to unassigned nodes in T (v). Then,
v̂(ej , Fv ) denotes the state of v under assignment Fv when an example ej is
given. Let a be a 0-1 vector of size m (recall that m is the number of examples),
5
It should be noted that the theorem still holds if nodes encoding the output node
are changed by completion.
136 T. Akutsu, T. Tamura, and K. Horimoto

where the ith coordinate of a corresponds to the ith example. We define a


dynamic programming table S[v, a] by

1, if there exists Fv such that v̂(ej , Fv ) = aj for all j = 1, . . . , m,
S[v, a] =
0, otherwise.
It is to be noted that for each node v, we examine all possible combinations of
0-1 values against all m examples. Therefore, the size (i.e., the number of entries)
of S is n · 2m and thus is polynomial if m = O(log n).
The table S[v, a] can be computed as follows. Recall that an external node
vi corresponds to the ith entry of each example. Thus, for an external node vi ,
S[vi , a] is computed by

1, aj = eji holds for all j = 1, . . . , m,
S[vi , a] =
0, otherwise.
For a non-external and non-incomplete node v, let fv be the Boolean function
assigned to v in G(V, F ). For a non-external node v, let v L and v R be the left
and right children of v, respectively. Then, S[v, a] is computed by

⎨ 1, if there exists (f, aL , aR ) such that S[v L , aL ] = 1 and
S[v, a] = S[v R , aR ] = 1, and f (aL R
j , aj ) = aj for all j = 1, . . . , m,

0, otherwise,
where f is an arbitrary Boolean function with arity 2 if v is an incomplete
node, otherwise f = fv . Since the number of possible Boolean functions f is
2 D
22 = 16 (22 for the case of maximum indegree D), S[v, a] can be computed
in O(m · 2 · 2m ) = O(m · 22m ) time per entry. Since there are n · 2m entries,
m

the total time for constructing the dynamic programming table is O(mn · 23m ).
Once this table is constructed, we finally check whether or not S[vn , a] = 1 holds
for a such that aj = ejh+1 holds for all j = 1, . . . , m. It is straight-forward to see
that this algorithm correctly works in O(mn · 23m ) time. 

We can modify the above proof for BNCMPL-2.
Theorem 5. BNCMPL-2 is solved in polynomial time if the network structure
is a rooted tree of bounded indegree and the number of examples is O(log n).
Proof. For simplicity, we assume that all non-external nodes are of indegree 2.
Let Fv be an assignment of Boolean functions to nodes in T (v). Let size(Fv )
be the number of nodes such that a Boolean function assigned by Fv is different
from the original one. We define a DP table S[v, a, l] for l = 0, . . . , L by

⎨ 1, if there exists Fv such that v̂(ej , Fv ) = aj for all j = 1, . . . , m
S[v, a, l] = and size(Fv ) = l,

0, otherwise.
Then, for an external node vi , S[vi , a, l] is computed by

1, aj = eji holds for all j = 1, . . . , m, and l = 0,
S[vi , a, l] =
0, otherwise.
Completing Networks Using Observed Data 137

For a non-external node v, S[v, a, l] is computed by

S[v, a, l] =


⎪ 1, if there is (aL , aR , lL , lR ) such that S[v L , aL , lL ] = 1 and S[v R , aR , lR ]


⎨ = 1, fv (aL R
j , aj ) = aj for all j = 1, . . . , m, and l = lL + lR ,
1, if there is (f, aL , aR , lL , lR ) such that S[v L , aL , lL ] = 1 and S[v R , aR , lR ]



⎪ = 1, f (aL R
j , aj ) = aj for all j = 1, . . . , m, and l = lL + lR + 1,

0, otherwise.

It is straight-forward to see that this algorithm correctly works in O(mn2 L3 ·23m )


time. 


It is also possible to minimize the number of errors (i.e., to minimize |{j|ejh+1 


=
v̂n (ej )}) for BNCMPL-1 (resp. BNCMPL-2) by examining all S[vn , a] (resp.
S[vn , a, l]) if m = O(log n).

4.2 Algorithms for Partial k-Trees

We can extend the above mentioned algorithms for partial k-trees. A partial
k-tree is a graph with treewidth at most k, where the treewidth is defined
via tree decomposition [8]. A tree decomposition of a graph G(V, E) is a pair
T (VT , ET ), (Bt )t∈VT , where T (VT , ET ) is a rooted tree and (Bt )t∈VT is a fam-
ily of subsets of V such that (see also Fig. 5)

– For every v ∈ V , B −1 (v) = {t ∈ VT |v ∈ Bt } is nonempty and connected in


T,
– For every edge {u, v} ∈ E, there exists t ∈ VT such that u, v ∈ Bt .

The width of the decomposition is defined as maxt∈VT (|Bt |−1) and the treewidth
is the minimum of the widths among all the tree decompositions of G.
We present an algorithm for BNCMPL-1 on partial k-trees as the main part
of the proof of the following theorem, where we assume that k is a constant.

G(V,E) A T(VT ,ET )


A

B
C B

C
D

E D E

Fig. 5. Example of tree decomposition with treewidth 2


138 T. Akutsu, T. Tamura, and K. Horimoto

Theorem 6. BNCMPL-1 is solved in polynomial time if the network structure


is a partial k-tree of bounded indgree and the number of examples is O(log n).
Proof. For simplicity, we assume that all non-external nodes are of indegree 2.
For each non-external node vi , f, a, aL , aR is called a secondary assignment
if f (aL R
j , aj ) = aj holds for all j = 1, . . . , m, where f = fi if vi is a complete node
(otherwise f is arbitrary)6 . For each external node vi , a is called a secondary
assignment if aj = eji holds for all j = 1, . . . , m. Let Av = f, a, aL , aR and
Au = g, b, bL , b (or Au = b if u is an external node) be secondary assign-
ments for v and u, respectively. We say that Av is consistent with Au if u is the
first input node of v and aL = b, u is the second input node of v and aR = b,
or, u is not an input node of v (see Fig. 6).
Once the concept of secondary assignment is defined, we can apply the stan-
dard dynamic programming technique for partial k-trees [8]. Suppose that G is
a partial k-tree when G is regarded as an undirected graph. We can compute
a tree decomposition T (VT , ET ) of width k using Bodlaender’s algorithm [7,8].
For t ∈ VT , B(t) denotes the set of nodes in G defined by B(t) = t ∈des(t) Bt ,
where des(t) is the set of t and its descendants in T . For a set of nodes U =
{vi1 , . . . , vi|U | } in G, let A(U ) = Ai1 , . . . , Ai|U | be a tuple of secondary assign-
ments, where Aij is a secondary assignment for vij . We define a set of consistent
tuples Ass(U ) by

= j  }.
Ass(U ) = {A(U ) | Aij is consistent with Aij for all j 

We also define Ass(t) for t ∈ VT by Ass(t) = Ass(Bt ). Furthermore, we define


a set of extensible and consistent tuples ExtAss(t) by

ExtAss(t) = {A(Bt ) | A(Bt ) is a sub-tuple of A(B(t)) ∈ Ass(B(t))}.

That is, ExtAss(t) is the set of consistent tuples for Bt each of which can be
extensible to a consistent tuple for B(t).

v 1 1 1
0 0 1
A1 = , 0 , 1 , 0
1 1 1

vL vR

1 0 1 0 1 1
0 0 0 1 0 1
A2 = , 1 , 1 , 1 A3 = , 1 , 1 , 0
1 1 0 0 0 0

Fig. 6. All of A1 , A2 , and A3 are consistent secondary assignments. A1 and A2 are


consistent, whereas A1 and A3 are not consistent.

6
We use secondary assignment in order to discriminate from assignment defined in
Section 2.
Completing Networks Using Observed Data 139

For each node t ∈ VT , Ass(t) can be computed in O(km · 23(k+1)m ) time be-
cause the number of possible secondary assignments for each node in G is O(23m )
and thus the number of possible tuples is O(23(k+1)m ), and O(km) time is enough
to check the consistency of a tuple, where we assume that the maximum indegree
is bounded by 2. If t is a leaf in T , we let ExtAss(t) := Ass(t). Otherwise, let
t1 , . . . , tgt be the children of t in T and we assume that ExtAss(ti )s have been
already computed. For two tuples A(Bt ) for Bt and A(Bti ) for Bti , A(Bt ) and
A(Bti ) are said to be compatible if the same secondary assignments are assigned
to v for each v ∈ Bt ∩ Bti . Then, we can compute ExtAss(t) by
ExtAss(t) := {A(Bt ) | A(Bt ) ∈ Ass(t) is compatible with A(Bti ) ∈ ExtAss(ti )
for all i = 1, . . . , gt }.
Then, it is straight-forward to see that BNCMPL-1 has a required completion
iff. ExtAss(r)  = {} where r is the root of T . For the example of Fig. 5, we
compute ExtAss(C) from Ass(D) and Ass(E), ExtAss(B) from ExtAss(C),
and ExtAss(A) from ExtAss(B). Note that the output node can be located
outside Br (e.g., it is not located in A but in D in Fig. 5).
Clearly, ExtAss(t) can be computed in O(26(k+1)m · kmgt ) time per t. Since
t gt = O(n), the total computation time is O((2
6(k+1)m
· km + q(k)) · n), where
O(q(k) · n) is the time complexity of Bodlaender’s algorithm, which works in
linear time for a constant k. If m = O(log n), this time complexity is polyno-
mial. Furthermore, we can extend the algorithm and analysis for the case of the
maximum indegree bounded by a constant D. 

We can extend this result for BNCMPL-2 where details are omitted.
Corollary 1. BNCMPL-2 is solved in polynomial time if the network structure
is a partial k-tree of bounded indegree and the number of examples is O(log n).

5 Concluding Remarks
In this paper, we have studied problems of completing networks from example
data. We have shown that the problems are NP-complete in general but can be
solved in polynomial time for partial k-trees of bounded indegree if a logarithmic
number of examples are given.
Extension of the model and algorithms for networks with cycles is an im-
portant future work because real biological networks contain cycles. For that
purpose, it might be helpful to use feedback vertex sets because a network be-
comes acyclic if vertices in a feedback vertex set are removed. Other future
works include extension of BNCMPL-2 for handling insertions of edges, anal-
ysis of PAC-type learning models [13] as well as probabilistic extensions, and
development of practical algorithms.

Acknowledgment
We would like to thank Atsushi Mochizuki, Ryoko Morioka and Shigeru Saito
for helpful discussions.
140 T. Akutsu, T. Tamura, and K. Horimoto

References
1. Akutsu, T., Miyano, S., Kuhara, S.: Identification of genetic networks from a small
number of gene expression patterns under the Boolean network model. In: Proc.
Pacific Symposium on Biocomputing 1999, pp. 17–28 (1999)
2. Akutsu, T., Kuhara, S., Maruyama, O., Miyano, S.: Identification of genetic
networks by strategic gene disruptions and gene overexpressions under a Boolean
model. Theoretical Computer Science 298, 235–251 (2003)
3. Akutsu, T., Hayashida, M., Ching, W.-K., Ng, M.K.: Control of Boolean networks:
Hardness results and algorithms for tree structured networks. Journal of Theoret-
ical Biology 244, 670–679 (2007)
4. Angluin, D., Aspnes, J., Chen, J., Wu, Y.: Learning a circuit by injecting values.
In: Proc. 38th Annual ACM Symposium on Theory of Computing, pp. 584–593
(2006)
5. Angluin, D., Aspnes, J., Chen, J., Reyzin, L.: Learning large-alphabet and analog
circuits with value injection queries. Machine Learning 72, 113–138 (2008)
6. Angluin, D., Aspnes, J., Chen, J., Eisenstat, D., Reyzin, L.: Learning acyclic prob-
abilistic circuits using test paths. In: Proc. 21st Annual Conference on Learning
Theory, pp. 169–180 (2008)
7. Bodlaender, H.L.: A linear-time algorithm for finding tree-decompositions of small
treewidth. SIAM Journal on Computing 25, 1305–1317 (1996)
8. Flum, J., Grohe, M.: Parameterized Complexity Theory. Springer, Berlin (2006)
9. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory
of NP-Completeness. W.H. Freeman and Co., New York (1979)
10. Gupta, S., Bisht, S.S., Kukreti, R., Jain, S., Brahmachari, S.K.: Boolean net-
work analysis of a neurotransmitter signaling pathway. Journal of Theoretical
Biology 244, 463–469 (2007)
11. Ideker, T.E., Thorsson, V., Karp, R.M.: Discovery of regulatory interactions
through perturbation: inference and experimental design. In: Proc. Pacific Sympo-
sium on Biocomputing 2000, pp. 302–313 (2000)
12. Kauffman, S.A.: The Origins of Order: Self-organization and Selection in Evolution.
Oxford Univ. Press, NY (1993)
13. Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory.
MIT Press, Cambridge (1994)
14. Mochizuki, A.: Structure of regulatory networks and diversity of gene expression
patterns. Journal of Theoretical Biology 250, 307–321 (2008)
15. Pitt, L., Valiant, L.G.: Computational limitations on learning from examples.
Journal of the ACM 35, 965–984 (1988)
16. Tokumoto, Y., Horimoto, K., Miyake, J.: TRAIL inhibited the cyclic AMP re-
sponsible element mediated gene expression. Biochemical and Biophysical Research
Communications 381, 533–536 (2009)
Average-Case Active Learning with Costs

Andrew Guillory1, and Jeff Bilmes2


1
Computer Science and Engineering
University of Washington
guillory@cs.washington.edu
2
Electrical Engineering
University of Washington
bilmes@ee.washington.edu

Abstract. We analyze the expected cost of a greedy active learning


algorithm. Our analysis extends previous work to a more general setting
in which different queries have different costs. Moreover, queries may
have more than two possible responses and the distribution over hypothe-
ses may be non uniform. Specific applications include active learning with
label costs, active learning for multiclass and partial label queries, and
batch mode active learning. We also discuss an approximate version of
interest when there are very many queries.

1 Motivation

We first motivate the problem by describing it informally. Imagine two people


are playing a variation of twenty questions. Player 1 selects an object from a
finite set, and it is up to player 2 to identify the selected object by asking
questions chosen from a finite set. We assume for every object and every question
the answer is unambiguous: each question maps each object to a single answer.
Furthermore, each question has associated with it a cost, and the goal of player 2
is to identify the selected object using a sequence of questions with minimal cost.
There is no restriction that the questions are yes or no questions. Presumably,
complicated, more specific questions have greater costs. It doesn’t violate the
rules to include a single question enumerating all the objects (Is the object a
dog or a cat or an apple or...), but for the game to be interesting it should be
possible to identify the object using a sequence of less costly questions.
With player 1 the human expert and player 2 the learning algorithm, we can
think of active learning as a game of twenty questions. The set of objects is
the hypothesis class, the selected object is the optimal hypothesis with respect
to a training set, and the questions available to player 2 are label queries for
data points in the finite sized training set. Assuming the data set is separable,
label queries are unambiguous questions (i.e. each question has an unambiguous
answer). By restricting the hypothesis class to be a set of possible labelings of

This material is based upon work supported by the National Science Foundation
under grant IIS-0535100 and by an ONR MURI grant N000140510388.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 141–155, 2009.

c Springer-Verlag Berlin Heidelberg 2009
142 A. Guillory and J. Bilmes

the training set (i.e. the effective hypothesis class for some other possibly infinite
hypothesis class), we can also ensure there is a unique zero-error hypothesis. If
we set all question costs to 1, we recover the traditional active learning problem
of identifying the target hypothesis using a minimal number of labels.
However, this framework is also general enough to cover a variety of active
learning scenarios outside of traditional binary classification.
– Active learning with label costs. If different data points are more or
less costly to label, we can model these differences using non uniform label
costs. For example, if a longer document takes longer to label than a shorter
document, we can make costs proportional to document length. The goal is
then to identify the optimal hypothesis as quickly as possible as opposed to
using as few labels as possible. This notion of label cost is different than the
often studied notion of misclassification cost. Label cost refers to the cost of
acquiring a label at training time where misclassification cost refers to the
cost of incorrectly predicting a label at test time.
– Active learning for multiclass and partial label queries. We can di-
rectly ask for the label of a point (Is the label of this point “a”, “b”, or
“c”?), or we can ask less specific questions about the label (Is the label of
this point “a” or some other label?). We can also mix these question types,
presumably making less specific questions less costly. These kinds of partial
label queries are particularly important when examples have structured la-
bels. In a parsing problem, a partial label query could ask for the portion of
a parse tree corresponding to a small phrase in a long sentence.
– Batch mode active learning. Questions can also be queries for multiple
labels. In the extreme case, there can be a question corresponding to every
subset of possible single data point questions. Batch label queries only help
the algorithm reduce total label cost if the cost of querying for a batch of
labels is in some cases less than the of sum of the corresponding individual
label costs. This is the case if there is a constant additive cost overhead
associated with asking a question or if we want to minimize time spent
labeling and there are multiple labelers who can label examples in parallel.
Beyond these specific examples, this setting applies to any active learning prob-
lem for which different user interactions have different costs and are unambiguous
as we have defined. For example, we can ask questions concerning the percentage
of positive and negative examples according to the optimal classifier (Does the
optimal classifier label more than half of the data set positive?). This abstract
setting also has applications outside of machine learning.
– Information Retrieval. We can think of a question asking strategy as an
index into the set of objects which can then be used for search. If we make the
cost of a question the expected computational cost of computing the answer
for a given object, then a question asking strategy with low cost corresponds
to an index with fast search time. For example, if objects correspond to
points in n and questions correspond to axis aligned hyperplanes, a question
asking strategy is a kd-tree.
Average-Case Active Learning with Costs 143

– Compression. A question asking strategy produces a unique sequence of


responses for each object. If we make the cost of a question the log of the
number of possible responses to that question, then a question asking strat-
egy with low cost corresponds to a code book for the set of objects with
small code length [5].
Interpreted in this way, active learning, information retrieval, and compression
can be thought of as variations of the same problem in which we minimize
interaction cost, computation cost, and code length respectively.
In this work we consider this general problem for average-case cost. The object
is selected at random and the goal is to minimize the expected cost of identifying
the selected object. The distribution from which the object is drawn is known
but may not be uniform. Previous work [11, 6, 1, 3, 4] has shown simple greedy
algorithms are approximately optimal in certain more restrictive settings. We
extend these results to our more general setting.

2 Preliminaries
We first review the main result of Dasgupta [6] which our first bound extends. We
assume we have a finite set of objects (for example hypotheses) H with |H| = n.
A randomly chosen h∗ ∈ H is our target object with a known positive π(h)
defining the distribution over H by which h∗ is drawn. We assume minh π(h) > 0
and |H| > 1. We also assume there is a finite set of questions q1 , q2 , ...qm each
of which has a positive cost c1 , c2 , ...cm . Eachquestion qi maps each object to
a response from a finite set of answers A  h,i {qi (h)} and asking qi reveals
qi (h∗ ), eliminating from consideration all objects h for which qi (h) 
= qi (h∗ ). An

active learning algorithm continues asking questions until h has been identified
(i.e. we have eliminated all but one of the elements from H). We assume this is
possible for any element in H. The goal of the learning algorithm is to identify h∗
with questions incurring as little cost as possible. Our result bounds the expected
cost of identifying h∗ .
We assume that the distribution π, the hypothesis class H, the questions qi ,
and the costs ci are known. Any deterministic question asking strategy (e.g. a de-
terministic active learning algorithm taking in this known information) produces
a decision tree in which internal nodes are questions and the leaves are elements
of H. The cost of a query tree T with respect to a distribution π, C(T, π), is
∗ ∗
defined to be the expected cost of identifying  h when h is chosen according
to π. We can write C(T, π) as C(T, π) = h∈H π(h)cT (h) where cT (h) is the
cost to identify h as the target object. cT (h) is simply the sum of the costs of
the questions along the path from the root of T to h. We define πS to be π
restricted and normalized w.r.t. S. For s ∈ S, πS (s) = π(s)/π(S), and for s ∈ / S,
πS (s) = 0. Tree cost decomposes nicely.

Lemma 1. For any tree T and any S = i S i with ∀i,j S i ∩ S j = ∅, S  =∅

C(T, πS ) = πS (S i )C(T, πS i )
i
144 A. Guillory and J. Bilmes

Algorithm 1. Cost Sensitive Greedy Algorithm


1: S ⇐ H
2: repeat
3: i = argmax Δi (S, πS )/ci
i
4: S ⇐ {s ∈ S : qi (s) = qi (h∗ )}
5: until |S| = 1

We define the version space to be the subset of H consistent with the answers we
have received so far. Questions eliminate elements from the version space. For a
question qi and a particular version space S ⊆ H, we define S j  {s ∈ S : qi (s) =
j}. With this notation the dependence on qi is suppressed but  understood by
context. As shorthand, for a distribution π we define π(S) = s∈S π(s). On
average, asking question qi shrinks the absolute mass of S with respect to a
distribution π by
 π(S j )   π(S j )2
Δi (S, π)  ( π(S k )) = π(S) −
π(S) π(S)
j∈A k=j j∈A

We call this quantity the shrinkage of qi with respect to (S, π). We note Δi (S, π)
is only defined if π(S) > 0. If qi has cost ci , we call Δi (S,π)
ci the shrinkage-cost
ratio of qi with respect to (S, π).
In previous work [6, 1, 3], the greedy algorithm analyzed is the algorithm that
at each step chooses the question qi that maximizes the shrinkage with respect to
the current version space Δi (S, πS ). In our generalized setting, we define the cost
sensitive greedy algorithm to be the active learning algorithm which at each step
asks the question with the largest shrinkage-cost ratio Δi (S, πS )/ci where S is
the current version space. We call the tree generated by this method the greedy
query tree. See Algorithm 1. Adler and Heeringa [1] also analyzed a cost-sensitive
method for the restricted case of questions with two responses and uniform π,
and our method is equivalent to theirs in this case. The main result of Dasgupta
[6] is that, on average, with unit costs and yes/no questions, the greedy strategy
is not much worse than any other strategy. We repeat this result here.
Theorem 1. Theorem 3 [6] If |A| = 2 and ∀i ci = 1, then for any π the greedy
query tree T g has cost at most

C(T g , π) ≤ 4C ∗ ln 1/(min π(h))


h∈H

where C ∗ = minT C(T, π).


For a uniform, π, the log term becomes ln |H|, so the approximation factor grows
with the log of the number of objects. In the non uniform case, the greedy algo-
rithm can do significantly worse. However, Kosaraju et al. [11] and Chakaravarthy
et al. [3] show a simple rounding method can be used to remove dependence on
π . We first give an extension to Theorem 1 to our more general setting. We then
Average-Case Active Learning with Costs 145

show we how to remove dependence on π using a similar rounding method. Inter-


estingly, in our setting this rounding method introduces a dependence on the costs,
so neither bound is strictly better although together they generalize all previous
results.

3 Cost Independent Bound


Theorem 2. For any π the greedy query tree T g has cost at most
C(T g , π) ≤ 12C ∗ ln 1/(min π(h))
h∈H


where C  minT C(T, π).
What is perhaps surprising about this bound is that the quality of approximation
does not depend on the costs themselves. The proof follows part of the strategy
used by Dasgupta [6]. The general approach is to show that if the average cost
of some question tree is low, then there must be at least one question with
high shrinkage-cost ratio. We then use this to form the basis of an inductive
argument. However, this simple argument fails when only a few objects have
high probability mass.
We start by showing the shrinkage of qi monotonically decreases as we elimi-
nate elements from S.
Lemma 2. Extension of Lemma 6 [6] to non binary queries. If T ⊆ S ⊆ H,
and T 
= ∅ then, ∀i, π, Δi (T, π) ≤ Δi (S, π).
Proof. For |S| = 1 the result is immediate since |T | ≥ 1 and therefore S = T .
We show that if |S| > 2, removing any single element a ∈ S \ T from S does not
increase Δi (S, π). The lemma then follows since we can remove all of S \ T from
S an element at a time. Assume w.l.o.g. a ∈ S k for some k. Here let A  A \ {k}
(π(S k ) − π(a))(π(S) − π(S k ))  π(S j )(π(S) − π(S j ) − π(a))
Δi (S − {a}, π) = +
π(S) − π(a) 
π(S) − π(a)
j∈A

We show that this is term by term less than or equal to


π(S k )(π(S) − π(S k ))  π(S j )(π(S) − π(S j ))
Δi (S, π) = +
π(S) 
π(S)
j∈A

For the first term


(π(S k ) − π(a))(π(S) − π(S k )) π(S k )(π(S) − π(S k ))

π(S) − π(a) π(S)
because π(S) ≥ π(S k ) and π(a) ≥ 0. For any other term in the summation,
π(S j )(π(S) − π(S j ) − π(a))) π(S j )(π(S) − π(S j ))

π(S) − π(a) π(S)
because π(S) − π(S j ) ≥ π(a) ≥ 0 and π(S) > π(a).
Obviously, the same result holds when we consider shrinkage-cost ratios.
146 A. Guillory and J. Bilmes

Corollary 1. If T ⊆ S ⊆ H, and T 
= ∅ then for any i, π, Δi (T, π)/ci ≤
Δi (S, π)/ci .

 define 2the collision probability of a distribution v over Z to be CP(v) 


We
z∈Z v(z) This is exactly the probability two samples from v will be the same
and quantifies the extent to which mass is concentrated on only a few points
(similar to inverse entropy). If no question has a large shrinkage-cost ratio and
the collision probability is low, then the expected cost of any query tree must
be high.
Lemma 3. Extension of Lemma 7 [6] to non binary queries and non uniform
costs. For any set S and distribution v over S, if ∀i Δi (S, v)/ci < Δ/c, then for
any R ⊆ S with R  = ∅ and any query tree T whose leaves include R
c
C(T, vR ) ≥ v(R)(1 − CP(vR ))
Δ

Proof. We prove the lemma with induction on |R|. For |R| = 1, CP(vR ) = 1 and
the right hand side of the inequality is zero. For R > 1, we lower bound the cost
of any query tree on R. At its root, any query tree chooses some qi with cost ci
that divides the version space into Rj for j ∈ A. Using the inductive hypothesis
we can then write the cost of a tree as
 c
C(T, vR ) ≥ ci + vR (Rj ) (v(Rj )(1 − CP(vRj )))
Δ
j∈A
c 
= ci + v(R) (vR (Rj )2 − vR (Rj )2 CP(vRj ))
Δ
j∈A
c 
= ci + v(R)(1 − 1 + vR (Rj )2 − CP(vR ))
Δ
j∈A

Here we used
   
vR (Rj )2 CP(vRj ) = vR (Rj )2 vRj (r)2 = vR (r)2 = CP(vR )
j∈A j∈A r∈Rj r∈R

 
We now note v(R)(1 − j∈A vR (R ) ) = v(R) −
j 2
j∈A v(Rj )2 /v(R) = Δi (R, v)
c c
C(T, vR ) ≥ ci +v(R)(1 − CP(vR )) − Δi (R, v)
Δ Δ
c Δci − Δi (R, v)c
= v(R)(1 − CP(vS )) +
Δ Δ

Using Corollary 1, Δi (R, v)/ci ≤ Δi (S, v)/ci ≤ Δ/c, so Δci − Δi (R, v)c ≥ 0 and
therefore
c
C(R, vS ) ≥ v(R)(1 − CP(vR ))
Δ

which completes the induction.


Average-Case Active Learning with Costs 147

This lower bound on the cost of a tree translates into a lower bound on the
shrinkage-cost ratio of the question chosen by the greedy tree.
Corollary 2. Extension of Corollary 8 [6] to non binary queries and non uni-
form costs. For any S ⊆ H with S  = ∅ and query tree T whose leaves contain
S, there must be a question qi with Δi (S, πS )/ci ≥ (1 − CP(πS ))/C(T, πS )
Proof. Suppose this is not the case. Then there is some Δ/c < (1 − CP(πS ))/
C(T, πS ) such that ∀i Δi (S, πS )/ci ≤ Δ/c. By Lemma 3 (with v  πS , R  S),
c
C(T, πS ) ≥ πS (S) (1 − CP(πS )) > πS (S)C(T, πS ) = C(T, πS )
Δ
which is a contradiction.
A special case which poses some difficulty for the main proof is when for some
S ⊆ H we have CP(πS ) > 1/2. First note that if CP(πS ) > 1/2 one object h0 has
more than half the mass of S. In the lemma below, we use R  S \ {h0 }. Also
let δi be the relative mass of the hypotheses in R that are distinct from h0 w.r.t.
question qi . δi  πR ({r ∈ R : qi (h0 ) 
= qi (r)}) In other words, when question
qi is asked, R is divided into a set of hypotheses that agree with h0 (these have
relative mass 1 − δi ) and a set of hypotheses that disagree with h0 (these have
relative mass δi ). Dasgupta [6] also treats this as a special case. However, in
the more general setting treated here the situation is more subtle. For yes or
no questions, the question chosen by the greedy query tree is also the question
that removes the most mass from R. In our setting this is not necessarily the
case. The left of Figure 1 shows a counter example. However, we can show the
fraction of mass removed from R by the greedy query tree is at least half the
fraction removed by any other question. Furthermore, to handle costs, we must
instead consider the fraction of mass removed from R per unit cost.

Fig. 1. Left: Counter example showing that when a single hypothesis h0 contains more
than half the mass, the query with maximum shrinkage is not necessarily the query
that separates the most mass from h0 . Right: Notation for this case.
148 A. Guillory and J. Bilmes

In this lemma we use π{h0 } to denote the distribution which puts all mass on
h0 . The cost of identifying h0 in a tree T ∗ is then C ∗ (h0 )  C(T ∗ , π{h0 } ).
Lemma 4. Consider any S ⊆ H and π with CP(πS ) > 1/2 and π(h0 ) > 1/2.
Let C ∗ (h0 ) = C(T ∗ , π{h0 } ) for any T ∗ whose leaves contain S. Some question qi
has δi /ci > 1/C ∗ (h0 ).

Proof. There is always a set of questions indexed by the set I with total cost
 ∗
i∈I ci ≤ C (h0 ) that distinguish h0 from R within S. In particular, the set

of
 questions used to identify h0 in T satisfy this. Since the set identifies h0 ,
i∈I δi ≥ 1 which implies

 ci δi

≥ 1/C ∗ (h0 )
C (h 0 ) ci
i∈I


Because ci /C ∗ (h0 ) ∈ (0, 1] and i∈I ci /C ∗ (h0 ) ≤ 1, there must be a qi such
that δi /ci ≥ 1/C ∗ (h0 ).

Having shown that some query always reduces the relative mass of R by 1/C ∗ (h0 )
per unit cost, we now show that the greedy query tree reduces the mass of R by
at least half as much per unit cost.
Lemma 5. Consider any π and S ⊆ H with CP(πS ) > 1/2, π(h0 ) > 1/2,
and a corresponding subtree TSg in the greedy tree. Let C ∗ (h0 ) = C(T ∗ , π{h0 } )
for any T ∗ whose leaves contain S. The question qi chosen by TSg has δi /ci >
1/(2C ∗ (h0 )).

Proof. We prove this by showing that the fraction removed from R per unit cost
by the greedy query tree’s question is at least half that of any other question.
Combining this with Lemma 4, we get the desired result.
We can write the shrinkage of qi in terms of δi . Here let A  A \ {qi (h0 )}.
Since π(S qi (h0 ) ) = π(h0 ) + (π(S) − δi π(R)), and π(S) − π(S qi (h0 ) ) = δi π(R), we
have that

Δi (S, πS ) = (πS (h0 ) + (1 − δi )πS (R))δi πS (R) + πS (S j )(πS (S) − πS (S j ))
j∈A


We use j∈A πS (S j ) = δi πS (R).
We can then upper bound the shrinkage using πS (S) − πS (S j ) ≤ 1

Δi (S, πS ) ≤ (πS (h0 ) + (1 − δi )πS (R))δi πS (R) + δi πS (R) ≤ 2δi πS (R)

and lower bound the shrinkage using πS (h0 ) > 1/2 and πS (S) − πS (S j ) >
πS (h0 ) + (1 − δi )πS (R) for any j ∈ A

Δi (S, πS ) ≥ 2(πS (h0 ) + (1 − δi )πS (R))δi πS (R) ≥ δi πS (R)


Average-Case Active Learning with Costs 149

Let qi be any question and qj be the question chosen by the greedy tree giving
Δj (S, πS )/cj ≥ Δi (S, πS )/ci . Using the upper and lower bounds we derived,
we then know 2δj πS (R)/cj ≥ δi πS (R)/ci and can conclude 2δj /cj ≥ δi /ci .
Combining this with Lemma 4, δj /cj ≥ 1/(2C ∗ (h0 ).

The main theorem immediately follows from the next theorem.


Theorem 3. If T ∗ is any query tree for π and T g is the greedy query tree for
π, then for any S ⊆ H corresponding to the subtree TSg of T g ,

π(S)
C(TSg , πS ) ≤ 12C(T ∗ , πS ) ln
minh∈S π(h)

Proof. In this proof we use C ∗ (S) as a short hand for C(T ∗ , πS ). Also, we
use min(S) for mins∈S π(S). We proceed with induction on |S|. For |S| = 1,
C(TSg , πS ) is zero and the claim holds. For |S| > 1, we consider two cases.
Case one: CP(πS ) ≤ 1/2
At the root of TSg , the greedy query tree chooses some qi with cost ci that
reduces the version space to S j when qi (h∗ ) = j. Let π(S + )  max{π(S j ) : j ∈
A} Using the inductive hypothesis

C(TSg , πS ) = ci + πS (S j )C(TS j , πS j )
j∈A
 π(S j )
≤ ci + 12πS (S j )C ∗ (S j ) ln
min(S j )
j∈A
 π(S + )
≤ ci + 12( πS (S j )C ∗ (S j )) ln
min(S)
j∈A

Now using Lemma 1, π(S + ) = π(S)πS (S + ), and then ln(1 − x) ≤ −x

π(S)
C(TSg , πS ) ≤ ci + 12C ∗ (S) ln + 12C ∗ (S) ln πS (S + )
min(S)
π(S)
≤ ci + 12C ∗ (S) ln − 12C ∗ (S)(1 − πS (S + ))
min(S)

πS (S + ) ≥ j∈A πS (S j )2 because this sum is an expectation and ∀j πS (S + ) ≥
πS (S j ). From this follows

π(S) 
C(TSg , πS ) ≤ ci + 12C ∗ (S) ln − 12C ∗ (S)(1 − πS (S j )2 )
min(S)
j∈A

∗ π(S) ∗
(1 − j∈A πS (S j )2 ))
= ci + 12C (S) ln − 12C (S)ci
min(S) ci
150 A. Guillory and J. Bilmes


(1 − j∈A πS (S j )2 ) is Δi (S, πS ), so by Corollary 2 and using CP(πS ) ≤ 1/2

π(S) 1 − CP(πS )
C(TSg , πS ) ≤ ci + 12C ∗ (S) ln − 12C ∗ (S)ci
min(S) C ∗ (S)
π(S)
= ci + 12C ∗ (S) ln − 12(1 − CP(πS ))ci
min(S)
π(S)
≤ 12C ∗ (S) ln
min(S)
which completes this case.
Case two: CP(πS ) > 1/2
The hypothesis with more than half the mass, h0 , lies at some depth D in
the greedy tree TSg . Counting the root of TSg as depth 0, D ≥ 1. At depth d > 0,
let q0 , q1 , ...qd−1 be the questions asked so far, c0 , c1 , ...cd−1 be the costs of these
d−1
questions, and Cd = i=0 ci be the total cost incurred. At the root, C0 = 0.
At depth d < D, we define Rd to be the set of objects other than h0 that
are still in the version space along the path to h0 . R0  S \ {h0 } and for d > 0
Rd  Rd−1 \ {h : qd−1 (h)  = qd−1 (h0 )}. In other words, Rd is Rd−1 with the
objects that disagree with h0 on qd−1 removed. All of the objects in Rd have the
same response as h0 for q0 , q1 , ..., qd−1 . The right of Figure 1 shows this case.
We first bound the mass remaining in Rd as a function of the label cost
incurred so far. For d > 0, using Lemma 5,


d−1
ci ∗
π(Rd ) ≤ π(R0 ) (1 − ∗
) ≤ π(R0 )e−Cd /(2C (h0 ))

i=0
2C (h 0)

Using this bound, we can bound CD , the cost of identifying h0 (i.e. C(TSg , h0 )).
First note that π(RD−1 ) ≥ min(R0 ) since at least one object is left in RD−1 .
Combining this with the upper bound on the mass of Rd , we have if D − 1 > 0.

CD−1 ≤ 2C ∗ (h0 ) ln(π(R0 )/ min(R0 ))

This clearly also holds if D−1 = 0, since, C0 = 0. We now only need to bound the
cost of the final question (the question asked at level D − 1). If the final question
had cost greater than 2C ∗ (h0 ), then by Lemma 5, this question would reduce
the mass of the set containing h0 to less than π(h0 ). This is a contradiction, so
the final question must have cost no greater than 2C ∗ (h0 ).

π(R0 )
CD ≤ 2C ∗ (h0 ) ln + 2C ∗ (h0 )
min(R0 )

We use Ad−1  A \ qd−1 (h0 ). Let s ∈ Sdj be the set of objects removed from
Rd−1   d − 1 such that qd−1 (s) = j, that is Rd−1 =
with the question at depth
Rd + j∈A Sdj . Let Sd = j∈A Sdj . The right of Figure 1 illustrates this
d−1 d−1
notation. A useful variation of Lemma 1 we use in the following is that for
S = S 1 ∪ S 2 and S 1 ∩ S 2 = ∅, π(S)C ∗ (S) = π(S 1 )C ∗ (S 1 ) + π(S 2 )C ∗ (S 2 ).
Average-Case Active Learning with Costs 151

We can write

a

D 
π(S)C(TSg , πS ) = π(h0 )CD + π(Sdj )(Cd + C(TS j , πS j ))
d d
d=1 j∈Ad−1

b 
D 
D  π(Sdj )
≤ π(h0 )CD + π(Sd )Cd + π(Sdj )12C ∗ (Sdj ) ln
d=1 d=1 j∈A
min(Sdj )
d−1

c π(R0 )
≤ π(h0 )CD + π(R0 )CD + 12π(R0 )C ∗ (R0 ) ln
min(R0 )
d π(R0 )
≤ 2π(h0 )CD + 12π(R0 )C ∗ (R0 ) ln
min(R0 )

Here a) decomposes the total cost into the cost of identifying h0 and the cost of
each branch leaving the path to h0 . For each of these branches the total cost is
the cost incurred so far plus the cost of the tree rooted
at that branch. b) uses
the inductive hypothesis, c) uses ∀i,j Si ∩ Sj = ∅ and d Sd = R0 , and d) uses
π(R0 ) < π(h0 ). Continuing
a π(R0 ) π(R0 )
π(S)C(TSg , πS ) ≤ 4π(h0 )C ∗ (h0 )(ln + 1) + 12π(R0 )C ∗ (R0 ) ln
min(R0 ) min(R0 )
b π(S) π(S)
≤ 4π(h0 )C ∗ (h0 )(ln + 1) + 12π(R0 )C ∗ (R0 ) ln
min(S) min(S)

where a) uses our bound on CD and b) uses R0 ⊂ S. Finally

π(S) π(S)
π(S)C(TSg , πS ) ≤ 12π(h0 )C ∗ (h0 ) ln + 12π(R0 )C ∗ (R0 ) ln
min(S) min(S)
π(S)
= π(S)12C ∗ (S) ln
min(S)
π(S)
where we use π(S) > 2 min(S) and therefore ln min(S) > ln 2 > .5. Dividing both
sides by π(S) gives the desired result.

4 Distribution Independent Bound


We now show the dependence on π can be removed using a variation of the
rounding trick used by Kosaraju et al. [11] and Chakaravarthy et al. [3]. The
intuition behind this trick is that we can round up small values of π to obtain
a distribution π  in which ln(1/ minh∈H π  (h)) = O(ln n) while ensuring that for
any tree T , C(T, π)/C(T, π  ) is bounded above and below by a constant. Here
n = |H|. When the greedy algorithm is applied to this rounded distribution, the
resulting tree gives an O(log n) approximation to the optimal tree for the original
152 A. Guillory and J. Bilmes

distribution. In our cost sensitive setting, the intuition remains the same, but
the introduction of costs changes the result.
Let cmax  maxi ci and cmin  mini ci . In this discussion, we consider irre-
ducible query trees, which we define to be query trees which contain only ques-
tions with non-zero shrinkage. Greedy query trees will always have this property
as will optimal query trees. This property let’s us assume any path from the
root to a leaf has at most n nodes with cost at most cmax n because at least
one hypothesis is eliminated by each question. Define π  to be the distribution
obtained from π by adding cmin /(cmax n3 ) mass to any hypothesis h for which
π(h) < cmin /(cmax n3 ). Subtract the corresponding mass from a single hypoth-
esis hj for which π(hj ) ≥ 1/n (there must at least one such hypothesis). By
construction, we have that mini π  (hi ) ≥ cmin /(cmax n3 ). We can also bound the
amount by which the cost of a tree changes as a result of rounding
Lemma 6. For any irreducible query tree T and π,
1 3
C(T, π) ≤ C(T, π  ) ≤ C(T, π)
2 2

Proof. For the first inequality, let h be the hypothesis we subtract mass from
when rounding. The cost to identify h , cT (h ) is at most cmax n. Since we subtract
at most cmin /(cmax n2 ) mass and cT (h ) ≤ cmax n, we then have
cmin cmin 1
C(T, π  ) ≥ C(T, π) − cT (h ) ≥ C(T, π) − ≥ C(T, π)
cmax n2 n 2
The last step uses and C(T, π) > cmin and n > 2. For thesecond inequality, we
add at most cmin /(cmax n3 ) mass to each hypothesis and h cT (h) < cmax n2 , so
 cmin cmin 3
C(T, π  ) ≤ C(T, π) + cT (h) ≤ C(T, π) + ≤ C(T, π)
cmax n3 n 2
h∈H

The last step again uses C(T, π) > cmin and n > 2
We can finally give a bound on the greedy algorithm applied to π  , in terms of
n and cmax /cmin
Theorem 4. For any π the greedy query tree T g for π  has cost at most
cmax
C(T g , π) ≤ O(C ∗ ln(n ))
cmin
where C ∗  minT C(T, π).
Proof. Let T  be an optimal tree for π  and T ∗ be an optimal tree for π. Using
Theorem 2, mini π  (hi ) ≥ cmin /(cmax n3 ), and Lemma 6.
cmax
C(T g , π) ≤2C(T g , π  ) ≤ 72C(T  , π  ) ln(n )
cmin
cmax cmax
≤72C(T ∗, π  ) ln(n ) ≤ 108C(T ∗, π) ln(n )
cmin cmin
Average-Case Active Learning with Costs 153

5 -Approximate Algorithm
Some of the non traditional active learning scenarios involve a large number
of possible questions. For example, in the batch active learning scenario we
describe, there may be a question corresponding to every subset of single data
point questions. In these scenarios, it may not be possible to exactly find the
question with largest shrinkage-cost ratio. It is not hard to extend our analysis
to a strategy that at each step finds a question qi with

Δi (S, πS )/ci ≥ (1 − ) max Δj (S, πS )/cj


j

for  ∈ [0, 1). One can show  > 0 only introduces an 1/(1 − ) factor into the
bound. Kosaraju et al. [11] report a similar extension to their result.

6 Related Work
Table 1 summarizes previous results analyzing greedy approaches to this prob-
lem. A number of these results were derived independently in different contexts.
Our work gives the first approximation result for the general setting in which
there are more than two possible responses to questions, non uniform question
costs, and a non uniform distribution over objects. We give bounds for two al-
gorithms, one with performance independent of the query costs and one with
performance independent of the distribution over objects. Together these two
bounds match all previous bounds for less general settings. We also note that
Kosaraju et al. [11] only mention an extension to non binary queries (Remark
1), and our work is the first to give a full proof of an O(log n) bound for the case
of non binary queries and non uniform distributions over objects..
Our work and the work we extend are examples of exact active learning. We
seek to exactly identify a target hypothesis from a finite set using a sequence of
queries. Other work considers active learning where it suffices to identify with
high probability a hypothesis close to the target hypothesis [7, 2]. The exact and
approximate problems can sometimes be related [10].

Table 1. Summary of approximation ratios achieved by related work. Here n is the


number of objects, k is the number of possible responses, ci are the question costs, and
π is the distribution over objects.

k>2 Non uniform ci Non uniform π Result


Kosaraju et al. [11] Y N Y O(log n)
Dasgupta [6] N N Y O(log(1/ minh π(h)))
Adler and Heeringa [1] N Y N O(log n)
Chakaravarthy et al. [3] Y N Y O(log k log n)
Chakaravarthy et al. [4] Y N N O(log n)
This paper Y Y Y O(log(1/ minh π(h)))
This paper Y Y Y O(log(n maxi ci / mini ci ))
154 A. Guillory and J. Bilmes

Most theoretical work in active learning assumes unit costs and simple label
queries. An exception, Hanneke [9] also considers a general learning framework
in which queries are arbitrary and have known costs associated with them. In
fact, the setting used by Hanneke [9] is more general in that questions are al-
lowed to have more than one valid answer for each hypothesis. Hanneke [9]
gives worst-case upper and lower bounds in terms of a quantity called the Gen-
eral Identification Cost and related quantities. There are interesting parallels
between our average-case analysis and this worst-case result.
Practical work incorporating costs in active learning [12, 8] has also considered
methods that maximize a benefit-cost ratio similar in spirit to the method used
here. However, Settles et al. [12] suggests this strategy may not be sufficient for
practical cost savings.

7 Open Problems
Chakaravarthy et al. [3] show it is NP-hard to approximate the optimal query
tree within a factor of Ω(log n) for binary queries and non uniform π. This hard-
ness result is with respect to the number of objects. Some open questions remain.
For the more general setting with non uniform query costs, is there an algorithm
with an approximation ratio independent of both π and ci ? The simple round-
ing technique we use seems to require dependence on ci , but a more advanced
method could avoid this dependence. Also, can the Ω(log n) hardness result be
extended to the more restrictive case of uniform π? It would also be interesting
to extend our analysis to allow for questions to have more than one valid answer
for each hypothesis. This would allow queries which ask for a positively labeled
example from a set of examples. Such an extension appears non trivial, as a
straightforward extension assuming the given answer is randomly chosen from
the set of valid answers produces a tree in which the mass of hypotheses is split
across multiple branches, affecting the approximation.
Much work also remains in the analysis of other active learning settings with
general queries and costs. Of particular practical interest are extensions to ag-
nostic algorithms that converge to the correct hypothesis under no assumptions
[7, 2]. Extensions to treat label costs, partial label queries, and batch mode ac-
tive learning are all of interest, and these learning algorithms could potentially
be extended to treat these three sub problems at once using a similar setting.
For some of these algorithms, even without modification we can guarantee
the method does no worse than passive learning with respect to label cost. In
particular, Dasgupta et al. [7] and Beygelzimer et al. [2] both give algorithms
that iterate through T examples, at each step requesting a label with probability
pt . These algorithm are shown to not do much worse (in terms of generalization
error) than the passive algorithm which requests every label. Because the al-
gorithm queries for labels for a subset of T i.i.d. examples, the label cost of
the algorithm is also no worse than the passive algorithm requesting T random
labels. It remains an open problem however to show these algorithms can do
Average-Case Active Learning with Costs 155

better than passive learning in terms of label cost (most likely this will require
modifications to the algorithm or additional assumptions).

References
[1] Adler, M., Heeringa, B.: Approximating optimal binary decision trees. In: Goel,
A., Jansen, K., Rolim, J.D.P., Rubinfeld, R. (eds.) APPROX and RANDOM 2008.
LNCS, vol. 5171, pp. 1–9. Springer, Heidelberg (2008)
[2] Beygelzimer, A., Dasgupta, S., Langford, J.: Importance weighted active learning.
In: ICML (2009)
[3] Chakaravarthy, V.T., Pandit, V., Roy, S., Awasthi, P., Mohania, M.: Decision
trees for entity identification: approximation algorithms and hardness results. In:
PODS (2007)
[4] Chakaravarthy, V.T., Pandit, V., Roy, S., Sabharwal, Y.: Approximating decision
trees with multiway branches. In: ICALP (2009)
[5] Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley-
Interscience, Hoboken (2006)
[6] Dasgupta, S.: Analysis of a greedy active learning strategy. In: NIPS (2004)
[7] Dasgupta, S., Hsu, D., Monteleoni, C.: A general agnostic active learning algo-
rithm. In: NIPS (2007)
[8] Haertel, R., Sepppi, K.D., Ringger, E.K., Carroll, J.L.: Return on investment for
active learning. In: NIPS Workshop on Cost-Sensitive Learning (2008)
[9] Hanneke, S.: The cost complexity of interactive learning (unpublished, 2006),
http://www.cs.cmu.edu/ shanneke/docs/2006/cost-complexity-
working-notes.pdf
[10] Hanneke, S.: Teaching dimension and the complexity of active learning. In:
Bshouty, N.H., Gentile, C. (eds.) COLT 2007. LNCS (LNAI), vol. 4539, pp. 66–81.
Springer, Heidelberg (2007)
[11] Kosaraju, S.R., Przytycka, T.M., Borgstrom, R.: On an optimal split tree problem.
In: Dehne, F., Gupta, A., Sack, J.-R., Tamassia, R. (eds.) WADS 1999. LNCS,
vol. 1663, pp. 157–168. Springer, Heidelberg (1999)
[12] Settles, B., Craven, M., Friedland, L.: Active learning with real annotation costs.
In: NIPS Workshop on Cost-Sensitive Learning (2008)
Canonical Horn Representations
and Query Learning

Marta Arias1 and José L. Balcázar2


1
LARCA Research Group, Departament LSI
Universitat Politècnica de Catalunya, Spain
marias@lsi.upc.edu
2
Departamento de Matemáticas, Estadı́stica y Computación
Universidad de Cantabria, Spain
joseluis.balcazar@unican.es

Abstract. We describe an alternative construction of an existing canon-


ical representation for definite Horn theories, the Guigues-Duquenne ba-
sis (or GD basis), which minimizes a natural notion of implicational size.
We extend the canonical representation to general Horn, by providing a
reduction from definite to general Horn CNF. We show how this repre-
sentation relates to two topics in query learning theory: first, we show
that a well-known algorithm by Angluin, Frazier and Pitt that learns
Horn CNF always outputs the GD basis independently of the counterex-
amples it receives; second, we build strong polynomial certificates for
Horn CNF directly from the GD basis.

1 Introduction
The present paper is the result of an attempt to better understand the classic
algorithm by Angluin, Frazier, and Pitt [2] that learns propositional Horn for-
mulas. A number of intriguing questions remain open regarding this algorithm;
in particular, we were puzzled by the following one: along a run of the algo-
rithm, queries made by the algorithm depend heavily upon the counterexamples
selected as answers to the previous queries. It is therefore natural to expect
the outcome of the algorithm to depend on the answers received along the run.
However, attempts at providing an example of such behavior consistently fail.
In this paper we prove that such attempts must in fact fail: we describe a
canonical representation of Horn functions in terms of implications, and show
that the algorithm of [2] always outputs this particular representation. It turns
out that this canonical representation is well-known in the field of Formal Con-
cepts, and bears the name of the authors that, to the best of our knowledge,
first described it: the Guigues-Duquenne basis or GD basis [7, 12]. In addition,
the GD basis has the important quality of being of minimum size.
The GD basis is defined for definite Horn formulas only. We extend the notion
of GD basis to general Horn formulas by means of a reduction from general to

Work partially supported by MICINN projects SESAAME-BAR (TIN2008-06582-
C03-01) and FORMALISM (TIN2007-66523).

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 156–170, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Canonical Horn Representations and Query Learning 157

definite Horn formulas. This reduction allows us to lift the characterization of


the output of AFP as the generalized GD basis. Furthermore, the generalized
GD representation provides the basis  for
 building strong
m+2 polynomial certificates
with p(m, n) = m and q(m, n) = m+1 2 + m + 1 = 2 for the class of general
Horn formulas, extending a similar construction from [4] which applied only to
definite Horn.
Some of the technical lemmas and theorems in this paper are based on previous
results of [12, 7]; we credit this fact appropriately throughout this presentation.
As a general overview, we have adopted the following: the “bullet” operator (• )
of Section 3.1 is directly taken from [12], the “star” operator ( ) is standard in
the field of study of Closure Spaces, and the GD basis comes from the Formal
Concept Analysis literature. We consider that our contribution here is threefold:
first, to understand, translate, and interpret the results from these other fields;
second, to recognize the connection of these results to our own; third, to draw
new insights into our topic of study thanks to the fruitful combination of our
own intuitions and knowledge and the adoption of these outside results.
Due to the space limit, a number of proofs, mostly of simple lemmas, have
been omitted or just sketched. A longer version containing all proofs is available
from the authors’ webpages.

2 Preliminaries

We work within the standard framework in logic, where one is given an indexable
set X of propositional variables of cardinality n, Boolean functions are subsets
of the Boolean hypercube {0, 1}n, and these functions are represented by logical
formulas over the variable set in the standard way. Assignments are partially
ordered bitwise according to 0 ≤ 1 (the usual partial order of the hypercube);
the notation is x ≤ y. Readers not familiar with standard definitions of assign-
ment, assignment satisfaction or formula entailment (|=), literal, term, clause,
etc. should consult a standard textbook, e.g., [6]. A particularity of our work
is that we identify assignments x ∈ {0, 1}n with variable subsets α ⊆ X in the
standard way by connecting the variable subsets with the bits that are set to
1 in the assignments. We denote this explicitly when necessary with the func-
tions x = BITS(α) and α = ONES(x). Therefore, x |= α iff α ⊆ ONES(x) iff
BITS(α) ≤ x.
We are only concerned with Horn functions, and their representations using
conjunctive normal form (CNF). A Horn CNF formula is a conjunction of Horn
clauses. A clause is a disjunction of literals. A clause is definite Horn if it contains
exactly one positive literal, and it is negative if all its literals are negative. A
clause is Horn if it is either definite Horn or negative.
Horn clauses are generally viewed as implications where the negative liter-
als form the antecedent of the implication (a positive term), and the singleton
consisting of the positive literal, if it exists, forms the consequent of the clause.
Note that both can be empty; if the consequent is empty, then we are dealing
with a negative Horn clause. Furthermore, we allow our representations of Horn
158 M. Arias and J.L. Balcázar

CNF to deviate slightly from the standard in that we represent clauses sharing
the same antecedent together in one implication. Namely, an implication α → β,
where both α and β are possibly empty sets of propositional
 variables, is to be
interpreted as the conjunction of definite Horn clauses b∈β α → b if β  = ∅, and
as the negative clause α →  if β = ∅.1 A semantically equivalent interpretation
is to see both sets of variables α and β as positive terms; the Horn formula in
its standard form is obtained by distributivity on the variables of β. Note that
x |= ∅ for any assignment x; however, this is not the case with respect to the
right hand sides of nondefinite Horn clauses since, there, by convention, β = ∅
stands for the unsatisfiable. 
We refer to our generalized notion of conjunction of clauses sharing the an-
tecedent as implication; the term clause retains its classical meaning (namely,
a disjunction of literals). Notice that an implication may not be a clause, e.g.
(a → bc) corresponds in classical notation to the formula ¬a ∨ (b ∧ c). Thus,
(a → bc), (ab → c) and (ab → ∅) are Horn implications but only the latter
two are Horn clauses. Furthermore, we often use sets to denote conjunctions, as
we
 do with positive terms, also at other levels: a generic (implicational) CNF
i (αi → βi ) is often denoted in this text by {(αi → βi )}i . Parentheses are
mostly optional and generally used for ease of reading.
Clearly, an assignment x ∈ {0, 1}n satisfies the implication α → β, denoted
x |= α → β, if it either fails the antecedent or satisfies the consequent, that is,
x |= α or x |= β respectively, where now we are interpreting both α and β as
positive terms.
A Horn function admits several syntactically different Horn CNF representa-
tions; in this case, we say that these representations are equivalent. Such rep-
resentations are also known as theories or bases for the Boolean function they
represent. The size of a Horn function is the minimum number of clauses that a
Horn CNF representing it must have. The implication size of a Horn size is de-
fined analogously, but allowing formulas to have implications instead of clauses.
Clearly, every clause is an implication, and thus the implication size of a given
Horn function is always at most that of its standard size as measured in the
number of clauses. Not all Boolean functions are Horn. The following semantic
characterization is a well-known classic result proved in the context of proposi-
tional Horn logic e.g. in [10]:
Theorem 1. A Boolean function admits a Horn CNF basis if and only if the
set of assignments that satisfy it is closed under bit-wise intersection. 
An implication in a Horn CNF H is redundant if it can be removed from H
without changing the Horn function represented. A Horn CNF is irredundant if
it does not contain any redundant implication. Notice that an irredundant H
1
Notice that this differs from an alternative, older interpretation [11], nowadays ob-
solete, in which α → β represents the clause (¬x1 ∨ . . . ∨ ¬xk ∨ y1 ∨ . . . ∨ yk ), where
α = {x1 , . . . , xk } and β = {y1 , . . . , yk }. Though identical in syntax, the semantics
are different; in particular, ours can only represent a conjunction of definite Horn
clauses whereas the other represents a general possibly non-Horn clause.
Canonical Horn Representations and Query Learning 159

may still contain other sorts of redundancies, such as consequents larger than
strictly necessary. Such redundancies are not contemplated in this paper.

Forward chaining. We describe the well-known method of forward chaining


for definite Horn functions [6]. Notice that it directly extends to our compressed
representation where consequents of clauses can contain more than one variable.
Given a definite Horn CNF H = {αi → βi }i and a subset of propositional
variables α, we construct chains of subsets of propositional variables α = α(0) ⊂
α(1) ⊂ · · · ⊂ α(k) = α . Each α(i) with i > 0 is obtained from its predecessor
α(i−1) in the following way: if BITS(α(i−1) ) satisfies all implications in H, then
the process can stop with α(i−1) = α . If, on the other hand, BITS(α(i−1) )
violates some implication αj → βj ∈ H, then α(i) is set to α(i−1) ∪ βj .
Similarly, one can construct an increasing chain of assignments x = x(0) <
x (1)
< · · · < x(k) = x using our bijection α(i) = ONES(x(i) ) and x(i) =
BITS(α(i) ) for all i.
See [6] as a general reference for the following well-known results. Theorem 3
in particular refers to the fact that the forward chaining procedure is a sound
and complete deduction method for definite Horn CNF.
Theorem 2. The objects x and α are well-defined and computed by the for-
ward chaining procedure regardless of the order in which implications in H are
chosen. Moreover, x and α depend only on the underlying function being rep-
resented, and not on the particular choice of representation H; and for each x(i)
and α(i) along the way, we have that (x(i) ) = x and (α(i) ) = α . 
Theorem 3. Let h be a definite Horn function, and let α be an arbitrary variable
subset. Then h |= α → b if and only if b ∈ α . 

Closure operator and equivalence classes. It is easy to see that the 


operator is extensive (that is, x ≤ x and α ⊆ α ), monotonic (if x ≤ y then
x ≤ y  , and if α ⊆ β then α ⊆ β  ) and idempotent (x = x , and α = α )
for all assignments x, y and variable sets α, β; that is,  is a closure operator [4].
Thus, we refer to x as the closure of x w.r.t. a definite Horn function.
It should be always clear from the text with respect to what definite Horn
function we are taking the closure, hence it is omitted from the notation used.
An assignment x is said to be closed iff x = x, and similarly for variable sets.
Furthermore, it is not hard to see that closed elements are always positive (by
construction via the forward chaining procedure, they must satisfy all implica-
tions), and assignments that are not closed are always negative (similarly, they
must violate some implication). That is: x |= H if and only if x = x. This
closure operator induces a partition over the set of assignments {0, 1}n in the
following straightforward way: two assignments x and y belong to the same class
if x = y  . This notion of equivalence class carries over as expected to the power
set of propositional variables: the subsets α and β belong to the same class if
α = β  . It is worth noting that each equivalence class consists of a possibly
empty set of assignments that are not closed and a single closed set, its repre-
sentative.
160 M. Arias and J.L. Balcázar

3 The Guigues-Duquenne Basis for Definite Horn

In this section we characterize and show how to build a canonical basis for
definite Horn functions that is of minimum implication size. Our construction
is based on the notion of saturation, a notion that has been used already in the
context of Horn functions and seems very natural [4, 5]. It turns out that this
canonical form is, in essence, the Guigues-Duquenne basis (the GD basis) which
was introduced in [7]. Here, we introduce it in a form that is, to our knowledge,
novel, although it is relatively close to the approach of [12].
We begin by defining saturation and then prove several interesting properties
that serve as the basis for our work.

Definition 1. Let B = {αi → βi }i be a basis for some definite Horn function.

– We say that B is left-saturated if the following 2 conditions hold:


1. BITS(αi ) |= αi → βi , for all i;
2. BITS(αi ) |= αj → βj , for all i  = j.
Alternatively, it can be more succintly described by the following equivalence:
a basis {αi → βi }i is left-saturated if i = j ⇔ BITS(αi )  |= αj → βj .
– We say that B is right-saturated if for all i, βi = αi . Accordingly, we denote
right-saturated bases with {αi → αi }i .
– We say that a basis B is saturated iff it is left- and right-saturated.

Example 1. Let H = {a → b, b → c, ad → e}.

– H is not left-saturated: for example, the antecedent of ad → e is such that


BITS(ad)  |= a → b. One can already see that by including b in the an-
tecedent of the third clause, one avoids this particular violation.
– H is not right-saturated because a = abc and, for example, the implication
a → b is missing ac from its right-hand-side.
– The equivalent H  = {a → abc, b → bc, abcd → abcde} is saturated.

Lemma 1. Let B = {αi → βi }i be a basis for some definite Horn function h.

1. If B is left-saturated then B is irredundant.


2. If B is irredundant, then BITS(αi )  |= αi → βi for all i.
3. If B is saturated, then BITS(αi )  |= h and BITS(αi ) |= h hold for all i.
4. If B is saturated, then αi ⊆ αj ⇒ αi ⊂ αj , for all i = j.

Lemma 2. Let B = {αi → αi }i be an irredundant, right-saturated basis. Then,


B is left-saturated if and only if the following implication is true for all i 
= j:
αi ⊂ αj ⇒ αi ⊆ αj .

The following Lemma is a variant of a result of [12] translated into our notation.
We include the proof that is, in fact, missing from [12].

Lemma 3. Let B = {αi → αi }i be a saturated basis for a definite Horn func-
tion. Then for all i and β it holds that (β ⊆ αi and β  ⊂ αi ) ⇒ β  ⊆ αi .
Canonical Horn Representations and Query Learning 161

Proof. Let us assume that the conditions of the implication are true, namely,
that β ⊆ αi and β  ⊂ αi . We proceed by cases: if β is closed, then β  = β
and the implication is trivially true since β ⊆ αi clearly implies β  ⊆ αi when
β  = β. Otherwise, β is not closed. Let β = β (0) ⊂ β (1) ⊂ · · · ⊂ β (k) = β  be
the series of elements constructed by the forward chaining procedure described
in Section 2. We argue that if β (l) ⊆ αi and β (l) ⊂ β  , then β (l+1) ⊆ αi as well.
By repeatedly applying this fact to all the elements along the chain, we arrive
at the desired conclusion, namely, β  ⊆ αi . Let β (l) be such that β (l) ⊆ αi and
β (l) ⊂ β  . Thus β (l) violates some implication (αk → αk ) ∈ B. Our forward
chaining procedure assigns β (l+1) to β (l) ∪ αk . The following inequalities hold:
αk ⊆ β (l) because β (l)  |= αk → αk , β (l) ⊆ αi by assumption; hence αk ⊆ αi .
Using Lemma 2, and noticing the fact that, actually, αk ⊂ αi since β (l) ⊂ αi
(otherwise we could not have β  ⊂ αi ), we conclude that αk ⊆ αi . We have that
αk ⊆ αi and β (l) ⊆ αi so that β (l+1) = β (l) ∪ αk ⊆ αi as required. 
The next result characterizes our version of the canonical basis based on the
notion of saturation. The proof does rely heavily on Lemma 3, which is adapted
from a result from [12]. The connection to saturation and our proof technique
are indeed novel.
Theorem 4. Definite Horn functions have a unique saturated basis.
Proof. Let B1 and B2 be two equivalent and saturated bases. Let a → a be an
arbitrary implication in B1 . We show that a → a ∈ B2 as well. By symmetry,
this implies that B1 = B2 .
By Lemma 1(2), we have that BITS(a)  |= B1 and thus BITS(a) must violate
some implication b → b ∈ B2 , hence it must hold that b ⊆ a but b  ⊆ a.
The rest of the proof is concerned with showing that assuming b ⊂ a leads to a
contradiction. If so, then b = a and thus a → a ∈ B2 as well as desired.
Let us assume then that b ⊂ a so that, by monotonicity, b ⊆ a . If b ⊂ a ,
then we can use Lemma 3 with αi = a and β = b and conclude that b ⊆ a,
contradicting the fact that BITS(a)  |= (b → b ). Thus, it should be that b = a .
Now, consider b → a ∈ B2 . Clearly b is negative (otherwise, b = b , and then


b → b is redundant) and thus it must violate some implication c → c ∈ B1 ,


namely, c ⊆ b but c  ⊆ b. If c = b, then we have a → a ∈ B1 and c → c with
c ⊂ a and c = b = a contradicting the fact that B1 is irredundant. Thus,
 

c ⊂ b and so c ⊆ b . If c ⊂ b then we use Lemma 3 as before but with αi = b


and β = c and we conclude that c ⊆ b. Again, this means that b |= c → c
contradicting the fact that b violates this implication. So the only remaining
case is c = b but this means that we have the implications a → a ∈ B1 and
c → c ∈ B1 with c ⊂ a but a = c which again makes B1 redundant. 

3.1 Constructing the GD Basis


So far, our definition of saturation only tests whether a given basis is actually
saturated; we study now a saturation process to obtain the GD basis. New
definitions are needed. Let H be any Horn CNF, and α any variable subset. Let
162 M. Arias and J.L. Balcázar

H(α) be those clauses of H whose antecedents fall in the same equivalence class
as α, namely, H(α) = {αi → βi | αi → βi ∈ H and α = αi } .
Given a Horn function H and a variable subset α, we introduce a new operator
•:α• is the closure of α with respect to the subset of clauses H \ H(α). That is,
in order to compute α• one does forward chaining starting with α but one is not
allowed to use the clauses in H(α). This operator has been used in the literature
before in related contexts, for example in [12].
Example 2. Let H = {a → b, a → c, c → d}. Then, (ac) = abcd but (ac)• = acd
since H(ac) = {a → b, a → c} and we are only allowed to use the clause c → d
when computing (ac)• .
Computing the GD basis of a definite Horn H. First, saturate every
clause C = α → β in H by replacing it with the implication α• → α . Then,
remove possibly redundant implications, namely: (1) remove implications s.t.
α• = α , and (2) remove duplicates, and (3) remove subsumed implications,
i.e., implications α• → α for which there is another implication β • → β  s.t.
α = β  but β • ⊂ α• .
Let us denote with GD(H) the implicational definite Horn CNF obtained by
applying this procedure to input H. Note that this algorithm is designed to work
when given a definite Horn CNF both in implicational or standard form.
The procedure can be computed in quadratic time, since finding the closures
of antecedent and consequent of each clause can be done in linear time w.r.t. the
size of the initial Horn CNF H.
Example 3. Let H = {a → b, a → c, ad → e, ab → e}. We compute the closures
of the antecedents: a = abce, (ad) = abcde, and (ab) = abce. Therefore,
H(a) = {a → b, a → c, ab → e}, H(ad) = {ad → e}, and H(ab) = H(a). Thus,
a• = a, (ad)• = abcde, and (ab)• = abce. After saturation of every clause in H,
we obtain H  = {a → abce, a → abce, abcde → abcde, abce → abce}. It becomes
clear that the third clause was, in fact, redundant. Also, the fourth implication is
subsumed by the first two (after right-saturation) and we can group the first and
second implications together into a single one. Hence, GD(H) = {a → abce}.
In the remainder of this Section we show that the given algorithm computes the
unique saturated representation of its input. First, we need a simple lemma:
Lemma 4. Let H be any basis for a definite Horn CNF over variables X =
{x1 , . . . , xn }. For any α, β, γ ⊆ X, the following statements hold:
1. α ⊆ α• ⊆ α ;
2. If H |= β → γ, β ⊆ α• ; but β  ⊂ α , then γ ⊆ α• .
Lemma 5. The algorithm computing GD(H) outputs the GD basis of H for
any definite Horn formula H.
Proof. Let H be the input to the algorithm, and let H  be its output. We show
that H  must be saturated. Let α → β be an arbitrary implication in the out-
put H  . Because of the initial saturation process, we can refer to this implication
Canonical Horn Representations and Query Learning 163

as α• → α . Clearly, (α• ) = α , and H  is right-saturated. It is only left to


show that H  is left-saturated. By Lemma 4, it must be that α• ⊆ α , but the
removal of implications of type (1) guarantees that α• ⊂ α , thus we have that
BITS(α• )  |= α• → α and Condition 1 of left-saturation is satisfied. Now let
β → β be any other implication in H  . We need to show that BITS(α• ) |=
• 

β • → β  . Assume by way of contradiction that this is not so, and BITS(α• ) |= β •


but BITS(α• )  |= β  . That is, β • ⊆ α• but β  
⊆ α• . If β • = α• , then β  = α ,
contradicting the fact that both implications have survived type (2) of removal
of implications in the algorithm. Thus, β • ⊂ α• , and therefore β  ⊆ α must
hold as well. It cannot be that β  = α because we would have that α• → α is
subsumed by β • → β  and thus removed from the output H  during removal of
implications of type (3) (and it is not). Thus, it can only be that β • ⊂ α• and
β  ⊂ α . But if β  ⊂ α , Lemma 4 and the fact that H |= β • → β  (notice that
saturating clauses does not change the logical value of the resulting formula)
guarantee that β  ⊆ α• contradicting our assumption that β   ⊆ α• . It follows

that H is saturated as required. 

It is clear that GD(H) has at most as many implications as H. Thus, if H is


of minimum size, then so is GD(H). This, together with the fact that the GD
basis is unique, implies:

Theorem 5. [7] The GD basis of a definite Horn function is of minimum


implicational size. 

4 The Guigues-Duquenne Basis in Query Learning


The classic query learning algorithm by Angluin, Frazier, and Pitt [2] is able to
learn Horn CNF with membership and equivalence queries. It was proved in [2]
that the outcome of the algorithm is always equivalent to the target concept.
However, the following questions remain open: (1) which of the Horn CNF,
among the many equivalent candidates, is output? And (2) does this output
depend on the specific counterexamples given to the equivalence queries? Indeed,
each query depends on the counterexamples received so far, and intuitively the
final outcome should depend on that as well.
Our main result from this section is that, contrary to our first intuition, the
output is always the same Horn CNF: namely, the GD basis of the target Horn
function. This section assumes that the target is definite Horn, further sections
in the paper lift the “definite” constraint.

4.1 The AFP Algorithm for Definite Horn CNF


We recall some aspects of the learning algorithm as described in [4], which bears
only slight, inessential differences with the original in [2]. The algorithm main-
tains a set P of all the positive examples seen so far. The fact that the target
is definite Horn allows us to initialize P with the positive example 1n . The al-
gorithm maintains also a sequence N = (x1 , . . . , xt ) of representative negative
164 M. Arias and J.L. Balcázar

examples (these become the antecedents of the clauses in the hypotheses). The
argument of an equivalence query is prepared from the list N = (x1 , . . . , xt )
of negative examples combined with the set P of positive examples. The query
corresponds to the following intuitive bias: everything is assumed positive unless
some (negative) xi ∈ N suggests otherwise, and everything that some xi sug-
gests negative is assumed negative unless some positive example y ∈ P suggests
otherwise. This is exactly the intuition in the hypothesis constructed by the AFP
algorithm. 
For the set of positive examples P , denote Px = {y ∈ P  x ≤ y}. The
hypothesis to be queried, given the set P and the list N = (x1 , .
. . , xt ), is denoted
H(N, P ) and is defined as H(N, P ) = {ONES(xi ) → ONES( Pxi ) | xi ∈ N } .
A positive counterexample is treated just by adding it to P . A negative coun-
terexample y is used to either refine some xi into a smaller negative example, or
to add xt+1 to the list. Specifically, let

i := min({j  M Q(xj ∧ y) is negative, and xj ∧ y < xj } ∪ {t + 1})
and then refine xi into xi = xi ∧ y, in case i ≤ t, or else make xt+1 = y,
subsequently increasing t. The value of i is found through membership queries
on all the xj ∧ y for which xj ∧ y < xj holds.

AFP()
1 N ← ()  /* empty list */
2 P ← {1n }  /* top element */
3 t←0
4 while EQ(H(N, P )) = (“no”, y)  /* y is the counterexample */
5 do if y 
|= H(N, P )
6 then add y to P
7 else find the first i such that  /* N = (x1 , . . . , xt ) */
8 xi ∧ y < xi , and  /* that is, xi 
≤ y */
9 xi ∧ y is negative  /* use membership query */
10 if found
11 then xi ← xi ∧ y  /* replace xi by xi ∧ y in N */
12 else t ← t + 1; xt ← y  /* append y to end of N */
13 return H(N, P )

Fig. 1. The AFP learning algorithm for definite Horn CNF

The AFP algorithm is described in Figure 1. In order to prove that its output
is indeed the GD basis, we need the following lemmas from [4]:
Lemma 6 (Lemma 2 from [4]). Along the running of the AFP algorithm, at
the point of issuing the equivalence query, for every xi and xj in N with i < j
there exists a positive example z such that xi ∧ xj ≤ z ≤ xj . 
Lemma 7 (Variant of Lemma 1 from [4]). Along the running of the AFP
algorithm, at the point of issuing the equivalence
 query, for every xi and xj in
N with i < j and xi ≤ xj , it holds that Pxi ≤ xj .
Canonical Horn Representations and Query Learning 165

Proof. At the time xj is created, we know it is a negative counterexample to the


current query, for which it must be  therefore positive. That query includes the
implicationONES(xi ) → ONES( Pxi ), and xj must satisfy it, and then xi ≤
xj implies Pxi ≤ xj . From that point on, further positive examples may enlarge
Pxi and thus reduce Pxi , keeping the inequality. Further negative examples y
may reduce xi , again possibly enlarging Pxi and keeping the inequality; or may
reduce xj into xj ∧ y. If xi 
≤ xj ∧ y anymore, then there is nothing left to prove.
Finally, if xi ≤ xj ∧ y, then xi ≤ y, and y is again a negative
 counterexample
that
 must satisfy the implication ONES(x i ) → ONES( P x i ) as before, so that
Pxi ≤ xj ∧ y also for the new value of xj . 

Our key lemma for our next main result is:

Lemma 8. All hypotheses H(N, P ) output by the AFP learning algorithm in


equivalence queries are saturated.
 
Proof. Recall that H(N, P ) = {ONES(xi ) → ONES( Pxi )  xi ∈ N }, where

Pxi = {y ∈ P  xi ≤ y}. Let αi = ONES(xi ) and βi = ONES( Pxi ) for all i
so that H(N, P ) = {αi → βi  1 ≤ i ≤ t}.
First we show that H(N,  P ) is left-saturated. To see  that xi  |= αi → βi it
suffices to note that xi < Pxi since xi is negative but Pxi is positive by The-
orem 1, being an intersection of positive examples; thus, these two assignment
must be different.
Now we show that xi |= αj → βj , for all i  = j. If xi 
|= αj , then clearly xi |=
αj → βj . Otherwise, xi |= αj and therefore xj ≤ xi . If i < j, then by Lemma 6
we have that xi ∧ xj ≤ z ≤ xj for some positive z. Then, xi ∧ xj = xj ≤ z ≤ xj ,
so that xj = z, contradicting the fact that xj is negative whereas
 z is positive.
Otherwise, j < i. We apply Lemma 7: it must hold that Pxj ≤ xi . Thus, in
this case, xi |= αj → βj as well because xi |= βj = ONES( Pxj ).
It is only left to show that H(N, P ) is right-saturated. Clearly, H(N, P ) is con-
sistent with N and P , that is, x |= H(N, P ) for all x ∈ N and y |= H(N, P )  for all
y ∈ P . Take any x ∈ N contributing the implicationONES(x) → ONES( Px )
to H(N, P ). We show that it is right-saturated, i.e., Px = x , where the closure
is taken with respect to H(N, P ). We note first that H(N, P ) |= ONES(x) →
(ONES(x)) since the closure is taken w.r.t. implications in H(N, P ). By the
construction of H(N, P ), all examples y ∈ Px must satisfy it, hence they must
satisfy the implication ONES(x) → (ONES(x)) as well. Therefore, since y |=
ONES(x) we must have that y |= (ONES(x)) , or  equivalently, that x ≤ y.


This is true for  every such y in Px and thus x ≤ Px . On the other 



hand, it
is obvious that Px ≤ x since the implication  ONES(x) → ONES( Px ) of
H(N, P ) guarantees that all the variables in Pxare included in the forward
chaining process in the final x . So we have x ≤ Px ≤ x as required. 

Putting Theorem 4 and Lemma 8 together, we obtain:

Theorem 6. AFP, run on a definite Horn target, always outputs the GD basis
of the target concept. 
166 M. Arias and J.L. Balcázar

5 A Canonical Basis for General Horn


Naturally, we wish to extend the notion of saturation and GD basis to general
Horn functions. We do this via a a prediction-with-membership reduction [3]
from general Horn to definite Horn, and use the corresponding intuitions to de-
fine a GD basis for general Horn. We use this reduction to generalize our AFP
algorithm to general Horn CNF, and as a consequence one obtains that the gen-
eralized AFP always outputs a saturated version of the target function. Indeed,
for the generalized AFP it is also the case that the output is only dependent
on the target, and not on the counterexamples received along the run. Finally,
we contruct strong polynomial certificates for general Horn functions direclty in
terms of the generalized GD basis, thus generalizing our earlier result of [4].

5.1 Reducing General Horn CNF to Definite Horn CNF


In this section we describe the intuition of the representation mapping, which
we use in the next section to obtain a canonical basis for general Horn functions.
For any general Horn CNF H over n propositional variables, e.g. X =
{xi | 1 ≤ i ≤ n}, we construct a definite Horn H  over the set of n + 1 proposi-
tional variables X  = X ∪ {f }, where f is a new “dummy” variable; in essence
f represents the false (that is, empty) consequent of the negative clauses in H.
The relationship between the assignments for H and H  are as follows: for as-
signments of n + 1 variables xb where x assigns to the variables in X and b is
the truth value assigned to f , x0 |= H  if and only if x |= H, whereas x1 |= H 
if and only if x = 1n .
Define the implication Cf as f → X  . Let Hd be the set of definite Horn
clauses in H, and Hn = H \ {Hd } the negative ones. Define the mapping g as

g(H) = Hd ∪ {¬C → X  | C ∈ Hn } ∪ {Cf }.


That is, g(H) includes the definite clauses of H, the special implication Cf , and
the clauses C that are negative are made definite by forcing all the positive
literals, including f , into them. Clearly, the resulting g(H) is definite Horn.
Observe that that the new implication Cf is saturated and the ones coming from
Hn are right-saturated. Observe also that g is injective: given g(H), we recover
H by removing the implication Cf , and by removing all positive literals from
any implications containing f . Clearly, g −1 (g(H)) = H, since g −1 is removing
all that g adds.

5.2 Constructing a GD-like Basis for General Horn CNF


The notion of left-saturation translates directly into general Horn CNF:
Definition 2. Let B = {αi → βi }i be a basis for some general Horn function.
Notice that now βi can possibly be empty (it is empty for the negative clauses).
Then, B is left-saturated if the following two conditions hold:
1. BITS(αi ) 
|= αi → βi , for all i;
Canonical Horn Representations and Query Learning 167

2. BITS(αi ) |= αj → βj , for all i 


= j.

For a definite Horn CNF H, right-saturating a clause α → β essentially means


that we add to its consequent everything that is implied by its antecedent,
namely α . This can no longer be done in the case of general Horn CNF, since
we need to take special care of the negative clauses. If β = ∅, we cannot set
β to α without changing the underlying Boolean function being represented.
The closure x of an assignment x is defined as the closure with respect to all
definite clauses in the general Horn CNF. It is useful to continue to partition
assignments x in the Boolean hypercube according to their closures x ; how-
ever, in the general Horn case, we distinguish a new class (the negative class)
of closed assignments that are actually negative, that is, it is possible now that
x |= H. These assignments are exactly those that satisfy all definite clauses
of H but violate negative ones. Based on this, the negative clauses (those with
antecedent α such that BITS(α ) |= B) should be left unmodified, and the def-
inite clauses (those whose antecedents α are such that BITS(α ) |= B) should
be right-saturated. Thus, the definition is:

Definition 3. Let B = {αi → βi }i be a basis for some general Horn function.


Then, B is right-saturated if, for all i, βi = ∅ if αi 
|= B, and βi = αi otherwise.

As for the definite case, “saturated” means that the general Horn CNF in ques-
tion is both left- and right-saturated. We must see that this is the “correct”
definition in some sense:

Lemma 9. A basis H is saturated iff H = g −1 (GD(g(H))).

Proof. First let us note that the expression g −1 (GD(g(H))) is well-defined. We


can always invert g on GD(g(H)), since saturating g(H) does not modify Cf
(already saturated) and it does not touch the positive literals of implications
containing f since these are right-saturated. Therefore, we can invert it since the
parts added by g are left untouched by the construction of GD(g(H)).
We prove first that if H is saturated then H = g −1 (GD(g(H))). Assume,
then, that H is saturated but H  = g −1 (GD(g(H))). Applying g, which is in-
jective, this can only happen if GD(g(H))  = g(H), namely, g(H), as a definite
Horn CNF, differs from its own GD basis and, hence, it is not saturated: it must
be because some implications other than Cf is not saturated, since this last one
is saturated by construction. Also the ones containing f in their consequents are
right-saturated, so no change happens in the right-hand-sides of these implica-
tions when saturating g(H)). This means that when saturating we must add a
literal different from f to the right-hand-side of an implication not containing
f or to the left-hand-side of an implication. In both cases, this means that the
original H could not be saturated either, contradicting our assumption.
It is only left to show that an H such that H = g −1 (GD(g(H))) is indeed
saturated. By way of contradiction, assume that H is not saturated but H =
g −1 (GD(g(H))). Applying g to both sides, we must have that g(H) = GD(g(H))
so that g(H) is actually saturated. Notice that the only difference between H
168 M. Arias and J.L. Balcázar

and g(H) is in the implication Cf and the right-hand-sides of negative clauses


in H; g(H) being left-saturated means that so must be H since the left-hand-
sides of H and g(H) coincide exactly (ignoring Cf naturally). Therefore, H is
left-saturated as well. It must be that H is not right-saturated, that is, it is
either missing some variable in some non-empty consequent, or some clause that
should be negative is not. In the first case, then g(H) is missing it, too, and it
cannot be saturated. In the second case, then there is a redundant clause in H
contradicting the fact that H is left-saturated (see Lemma 1(1)). In both cases
we arrive at a contradiction, thus the lemma follows. 

This gives us a way to compute the saturation (that is, the GD basis) of a given
general Horn CNF:

Theorem 7. General Horn functions have a unique saturated basis. This basis,
which we denote GD(H), can be computed by GD(H) = g −1 (GD(g(H))).

Proof. If H is saturated then H = g −1 (GD(g(H))). The uniqueness of such an H


follows from the following facts: first, g(H) and g(H  ) are equivalent whenever H
and H  are equivalent; second, GD(g(H)) is unique for the function represented
by H (Theorem 4) and third, g −1 is univocally defined since g is injective. 

Example 4. Let H be the general Horn CNF {a → b, a → c, abc → ∅}. Then,


– g(H) = {a → b, a → c, abc → abcf , f → abcf };
– GD(g(H)) = {a → abcf , f → abcf };
– GD(H) = g −1 (GD(g(H))) = {a → ∅}.

Similarly to the case of definite Horn functions, GD(H) does not increase the
number of new implications, and therefore if H is of minimum size, GD(H) must
be of minimum size as well. This, together with the uniqueness of saturated
representation implies that:

Theorem 8. The GD basis of a general Horn function is of minimum implica-


tional size. 

5.3 The AFP Algorithm for General Horn CNF


We study now the AFP algorithm operating on general Horn CNF, by following
a detour: we obtain it via reduction to the definite case.
We consider, therefore, an algorithm that, for target a general Horn function
H, simulates the version of AFP algorithm from Figure 1 on its definite trans-
formation g(H), where g is the representation transformation from Section 5.1.
It has to simulate the membership and equivalence oracles for definite Horn
CNF that the underlying algorithm expects, by using the oracles that it has for
general Horn.
Initially, we set P = {1n+1 }, and N = (0n 1) since we know that g(H) is defi-
nite and must contain the implication f → X ∪ {f } by construction. In essence,
the positive assignment 1n+1 = f  and the negative 0n 1 = f • guarantee that
Canonical Horn Representations and Query Learning 169

the implication Cf is included in every hypothesis H(N, P ) that the simulation


outputs as an equivalence query.
In order to deal with the queries, we use two transformations: we must map
examples over the n + 1 variables, asked as membership queries, into examples
over the original example space over n variables, although in some cases are able
to answer the query directly as we shall see. Upon asking x0 as membership query
for g(H), we pass on to H the membership query about x. Membership queries
of the form x1 are answered always negatively, except for 1n+1 which is answered
positively (in fact query 1n+1 never arises anyway, because that example is in P
from the beginning). Conversely, n-bit counterexamples x from the equivalence
query with H are transformed into x0. The equivalence queries themselves are
transformed according to g −1 . It is readily checked that all equivalence queries
belong indeed to the image set of g since Cf ∈ H(N, P ).
All together, these functions constitute a prediction-with-membership (pwm)
reduction from general Horn to definite Horn, in the sense of [3]. It is interesting
to note that if we unfold the simulation, we end up with the original algorithm
by Angluin, Frazier and Pitt [2] (obviously, with no explicit reference to our
“dummy” f ).
Therefore, the outcome of AFP on a general Horn target H comes univocally
determined by the outcome of AFP on the corresponding definite Horn function
g(H); combining this fact with Theorems 6 and 7 leads to:

Theorem 9. The AFP algorithm always outputs the GD basis of the target
concept. 

5.4 Certificates for General Horn CNF


The certificate dimension of a given concept class is closely related to its learn-
ability in the model of learning from membership and equivalence queries [1,8,9].
Informally, a certificate for a class C of concepts of size at most m is a set of (la-
beled) assignments that proves that concepts consistent with it must be outside
C. The polynomial q(m, n) used below quantifies the cardinality of the certifi-
cate set in term of m, the size of the class, and n, the number of variables in
the class. The polynomial p(m, n) quantifies the expansion in size allowed in the
hypotheses. In this paper, p(m, n) = m and thus we construct strong certificates.
In [4] we show how to build strong certificates for definite Horn CNF. Here,
we extend this to general Horn CNF, and describe the certificates directly in
terms of the generalized GD basis. Due to space limit, we only sketch the proof.

Theorem 10. The class of general


 Horn
 CNF has strong
 polynomial certificates
with p(m, n) = m and q(m, n) = m+1
2 + m + 1 = m+2
2 .

Proof (Sketch). The argumentation follows, essentially, the same steps as the
analogous proof in [4], because, by Lemma 9, the GD basis in the general case
is saturated, and therefore all required facts carry over to the general case. Let
f be a Boolean function that cannot be represented with m Horn implications.
170 M. Arias and J.L. Balcázar

If f is not Horn, then three assignments x, y, x ∧ y such that x |= f , y |= f


but x ∧ y |= f suffice. Otherwise, f is a general Horn CNF of implicational size
strictly greater than m. Assume that f contains at least m + 1 non-redundant
and possibly negative implications {αi → βi }. We define the certificate for f :

Qf = {x•i , xi | 1 ≤ i ≤ m + 1, xi = BITS(αi ), βi 


= ∅}
∪ {x•i | 1 ≤ i ≤ m + 1, xi = BITS(αi ), βi = ∅}
  
∪ x•i ∧ x•j  1 ≤ i < j ≤ m + 1

It is illustrative to note the relation between this set of certificates for f and
its GD basis: the assignments x•i and xi correspond exactly to the left and
right-hand-sides of the (saturated) definite implications in GD(f ). For negative
clauses, only the (saturated) left-hand-side of the implication x•i matters. 

References
1. Angluin, D.: Queries revisited. Theoretical Computer Science 313, 175–194 (2004)
2. Angluin, D., Frazier, M., Pitt, L.: Learning conjunctions of Horn clauses. Machine
Learning 9, 147–164 (1992)
3. Angluin, D., Kharitonov, M.: When won’t membership queries help? Journal of
Computer and System Sciences 50(2), 336–355 (1995)
4. Arias, M., Balcázar, J.L.: Query learning and certificates in lattices. In: Freund, Y.,
Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254,
pp. 303–315. Springer, Heidelberg (2008)
5. Arias, M., Feigelson, A., Khardon, R., Servedio, R.A.: Polynomial certificates for
propositional classes. Inf. Comput. 204(5), 816–834 (2006)
6. Chang, C.-L., Lee, R.C.-T.: Symbolic Logic and Mechanical Theorem Proving.
Academic Press, Inc., Orlando (1973)
7. Guigues, J.L., Duquenne, V.: Familles minimales d’implications informatives re-
sultants d’un tableau de donnees binaires. Math. Sci. Hum. 95, 5–18 (1986)
8. Hegedüs, T.: On generalized teaching dimensions and the query complexity of
learning. In: Proceedings of the Conference on Computational Learning Theory,
pp. 108–117. ACM Press, New York (1995)
9. Hellerstein, L., Pillaipakkamnatt, K., Raghavan, V., Wilkins, D.: How many queries
are needed to learn? Journal of the ACM 43(5), 840–862 (1996)
10. Khardon, R., Roth, D.: Reasoning with models. Artificial Intelligence 87(1-2), 187–
213 (1996)
11. Wang, H.: Toward mechanical mathematics. IBM Journal for Research and Devel-
opment 4, 2–22 (1960)
12. Wild, M.: A theory of finite closure spaces based on implications. Advances in
Mathematics 108, 118–139 (1994)
Learning Finite Automata Using Label Queries

Dana Angluin1 , Leonor Becerra-Bonache1,2, , Adrian Horia Dediu2,3 ,


and Lev Reyzin1,
1
Department of Computer Science, Yale University
51 Prospect Street, New Haven, CT, USA
{dana.angluin,leonor.becerra-bonache,lev.reyzin}@yale.edu
2
Research Group on Mathematical Linguistics, Rovira i Virgili University
Avinguda Catalunya, 35, 43002, Tarragona, Spain
3
“Politehnica” University of Bucharest
Splaiul Independentei 313, 060042, Bucharest, Romania
adrianhoriadediu@yahoo.com

Abstract. We consider the problem of learning a finite automaton M


of n states with input alphabet X and output alphabet Y when a teacher
has helpfully or randomly labeled the states of M using labels from a
set L. The learner has access to label queries; a label query with input
string w returns both the output and the label of the state reached by
w. Because different automata may have the same output behavior, we
consider the case in which the teacher may “unfold” M to an output
equivalent machine M  and label the states of M  for the learner. We
give lower and upper bounds on the number of label queries to learn
the output behavior of M in these different scenarios. We also briefly
consider the case of randomly labeled automata with randomly chosen
transition functions.

1 Introduction

The problem of learning the behavior of a finite automaton has been considered
in several domains, including language learning and environment learning by
robots. Many interesting questions remain about the kinds of information that
permit efficient learning of finite automata.
One basic result is that finite automata are not learnable using a polynomial
number of membership queries. Consider a “password machine”, that is, an
acceptor with (n + 2) states that accepts exactly one binary string of length n;
the learner may query (2n − 1) strings before finding the one that is accepted. In
this case, the learner gets no partial information from the unsuccessful queries.
However, Freund et al. [5] show that regardless of the topology of the underly-
ing automaton, if its states are randomly labeled with 0 or 1, then a robot taking

Supported by a Marie Curie International Fellowship within the 6th European Com-
munity Framework Programme.

This material is based upon work supported under a National Science Foundation
Graduate Research Fellowship.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 171–185, 2009.

c Springer-Verlag Berlin Heidelberg 2009
172 D. Angluin et al.

a random walk on the automaton can learn to predict the labels while making
only a polynomial number of errors of prediction. Random labels on the states
provide a rich source of information that can be used to distinguish otherwise
difficult-to-distinguish pairs of states.
In a different setting, Becerra-Bonache, Dediu and Tı̂rnăucă [3] introduced
correction queries to model a kind of correction provided by a teacher to
a learner when the learner’s utterance is not grammatically correct. In their
model, a correction query with a string w gives the learner not only member-
ship information about w, but also, if w is not accepted, either the minimum
continuation of w that is accepted, or the information that no continuation of
w is accepted. In certain cases, corrections may provide a substantial amount
of partial information for the learner. For example, for a password machine, a
prefix of the password will be answered with the rest of the password. We may
think of correction queries as labeling each state q of the automaton with the
string rq that is the response to any correction query w that arrives at q.
In both of these cases, labels on states may facilitate the learning of finite
automata: randomly chosen labels in the work of Freund et al. and meaningfully
chosen labels in the work of Becerra-Bonache, Dediu and Tı̂rnăucă. In this paper
we explore the general idea of adding labels to the states of an automaton to
make it easier to learn. That is, we allow a teacher to prepare an automaton M
for learning by adding labels to its states (either carefully or randomly chosen).
When the learner queries a string, the learner receives not only the original
output of M for that string, but also the label attached to that state by the
teacher. In an extension of this idea, we also allow the teacher to “unfold” the
machine M to produce copies of a state that may then be given different labels.
These ideas are also relevant to automata testing [7] – labeling and unfolding
automata can make their structure easier to verify.
Depending on how labels are assigned, learning may or may not become easier.
If each state is assigned a unique label, the learning task becomes easy because
the learner knows which state the machine reaches on any given query. However, if
the labels are all the same, they give no additional information and learning may
require an exponential number of queries (as in the case of membership queries.)
Hence we focus on questions of the following sort. Given an automaton, how
can a teacher use a limited set of labels to make the learning problem easier?
If random labels are sprinkled on the states of an automaton, how much does
that help the learner? How few labels can we use and still make the learning
problem tractable? Other questions concern the structure of the automaton itself.
For example, we may consider changing the structure of the automaton before
labeling it. We also consider the problem of learning randomly labeled automata
with random structure.

2 Preliminaries

We consider finite automata with output, defined as follows. A finite automaton


M has a finite set Q of states, an initial state q0 ∈ Q, a finite alphabet X of input
Learning Finite Automata Using Label Queries 173

symbols, a finite alphabet Y of output symbols, an output function γ mapping


Q to Y and a transition function τ mapping Q × X to Q. We extend τ to map
Q × X ∗ to Q in the usual way. A finite acceptor is a finite automaton with
output alphabet Y = {0, 1}; if γ(q) = 1 then q is an accepting state, otherwise,
q is a rejecting state. In this paper we assume that there are at least two input
symbols and at least two output symbols, that is, |X| ≥ 2 and |Y | ≥ 2.
For any string w ∈ X ∗ , we define M (w) to be γ(τ (q0 , w)), that is, the output
of the state reached from q0 on input w. Two finite automata M1 and M2
are output-equivalent if they have the same input alphabet X and the same
output alphabet Y and for every string w ∈ X ∗ , M1 (w) = M2 (w).
If M is a finite automaton with output, then an output query with string
w ∈ X ∗ returns the symbol M (w). This generalizes the concept of a membership
query for an acceptor. That is, if M is an acceptor, an output query with w
returns 1 if w is accepted by M and 0 if w is rejected by M . We note that
Angluin’s polynomial time algorithm to learn finite acceptors using membership
queries and equivalence queries generalizes in a straightforward way to learn
finite automata with output using output queries and equivalence queries [2].
If q1 and q2 are states of a finite automaton with output, then q1 and q2 are
distinguishable if there exists a distinguishing string for them, namely, a
string w such that γ(τ (q1 , w)) 
= γ(τ (q2 , w)), that is, w leads from q1 and q2 to
two states with different output symbols. If M is minimized, every pair of its
states are distinguishable, and M has at most one sink state.
If d is a nonnegative integer, the d-signature tree of a state q is the finite
function mapping each input string z of length at most d to the output symbol
γ(τ (q, z)). We picture the d-signature tree of a state as a rooted tree of depth
d in which each internal node has |X| children labeled with the elements of X,
and each node is labeled with the symbol from Y that is the output of the state
reached from q on the input string z that leads from the root to this node. The
d-signature tree of a state gives the output behavior in a local neighborhood of
the automaton reachable from that state.
For any finite automaton M with output, we may consider its transition
graph, which is a finite directed graph (possibly with multiple edges and self-
loops) defined as follows. The vertices are the states of M and there is an edge
from q to q  for each transition τ (q, a) = q  . Properties of the transition graph
are applied to M ; that is, M is strongly connected if its transition graph
is strongly connected. Similarly, the out-degree of M is |X| for every node,
and the in-degree of M is the maximum number of edges entering any node
of its transition graph. For a positive integer k, we define an automaton M to
be k-concentrating if there is some set Q of at most k states of M such that
every state of M can reach at least one state in Q . Every strongly connected
automaton is 1-concentrating.

2.1 Labelings
If M is a finite automaton with output, then a labeling of M is a function 
mapping Q to a set L of labels, the label alphabet. We use M to construct
174 D. Angluin et al.

a new automaton M  by changing the output function to γ  (q) = (γ(q), (q)).


That is, the new output for a state is a pair incorporating the output symbol
for the state and the label attached to the state. For the scenario of learning
with labels, we assume that the learner has access to output queries for M  for
some labeling  of the hidden automaton M . For the scenario of learning with
unfolding and labels, we assume that the learner has access to output queries
for M1 for some labeling  of some automaton M1 that is output-equivalent
to M . In these two scenarios, the queries will be referred to as label queries.
The goal of the learner in either scenario is to use label queries to find a finite
automaton M  output-equivalent to M . Thus, the learner must discover the
output behavior of the hidden automaton, but not necessarily its topology or
labeling. We assume the learner is given both X and |Q|.

3 Learning with Labels


First, we show a lower bound on the number of label queries required to learn a
hidden automaton M with n states and an arbitrary labeling .

Proposition 1. Let L be a finite label alphabet. Learning a hidden automaton


with n states and a labeling  using symbols from L requires
 
|X|n log(n)
Ω
1 + log(|L|))
label queries in the worst case.

Proof. Recall that we have assumed that |X| and |Y | are both at least 2; we
consider |Y | = 2. Domaratzki, Kisman and Shallit [4] have shown that there are
at least
(|X| − o(1))n2n−1 n(|X|−1)n
distinct languages accepted by acceptors with n states. Because each label query
returns one of at most 2 · |L| values, an information theoretic argument gives the
claimed lower bound on the number of label queries. As a corollary, when |X|
and |L| are constants, we have a lower bound of Ω(n log(n)) label queries. 

3.1 Labels Carefully Chosen


In this section, we examine the case where the teacher is given a limit on the
number of different labels he may use, and he is able to label the states after
examining the automaton. Moreover, the learning algorithm may take advantage
of knowing the labeling strategy of the teacher. In this setting the problem takes
on an aspect of coding, and indicates the maximum extent to which labeling
may facilitate efficient learning. We begin with a simple proposition.

Proposition 2. An automaton with n states, helpfully labeled using n different


labels, can be learned using |X|n label queries.
Learning Finite Automata Using Label Queries 175

Proof. The teacher assigns a unique integer label between 1 and n to each state.
The learner asks a label query with the empty string to determine the output
and label of the start state, and then explores the transitions from the start state
by querying each a ∈ X. After querying an input string w, the label indicates
whether this state has been visited before. If the state is new, the learner explores
all the transitions from it by querying wa for each a ∈ X. Thus, after querying
at most |X|n strings, the learner knows the structure and outputs of the entire
automaton. The lower bound shows that this is asymptotically optimal if the
label set L has n elements. 

We next consider limiting the teacher to a constant number of different labels:


a polynomial number of label queries suffices in this case.

Theorem 1. For each automaton with n states, there is a helpful labeling using
2|X| different labels such that the automaton can be learned using O(|X|n2 ) label
queries.

Proof. Given an automaton M of n states, the teacher chooses an outward-


directed spanning tree T rooted at q0 of the transition graph of the automaton,
and labels the states of M to communicate T to the learner as follows. The label
of state q is the subset of X corresponding to the edges of T from q to other
nodes. The label of q directs the learner to q’s children. Using at most n label
queries and the structure of T , the learner can create a set S of n input strings
such that for each state q of M , there is one string w ∈ S such that τ (q0 , w) = q.
In [1], Angluin gives an algorithm for learning a regular language using mem-
bership queries given a live complete sample for the language. A live complete
sample for a language L is a set of strings P , that for every state q (other than
the dead state) of the minimal acceptor for L, contains a string that leads from
the start state to q. Given a live complete sample P , a learner can find the
regular language using O(k|P |n) membership queries, where k is the size of the
input alphabet. A straightforward generalization of this algorithm to automata
with output shows that the set S and O(|X|n2 ) output queries can be used to
find an automaton output equivalent to M . 

However, the number of queries, O(n2 ), does not meet the Ω(n log n) lower
bound, and the number of different labels is large. For a restricted class of au-
tomata, there is a helpful labeling with fewer labels that permits learning with
an asymptotically optimal O(n log n) label queries. To appreciate the generality
of Theorem 2, we note once more that every strongly connected automaton is
1-concentrating, and as we will see in Lemma 1, automata with a small input
alphabet can be unfolded to have small in-degree.

Theorem 2. Let k and c be positive integers. Any automaton in the class of


c-concentrating automata with in-degree at most k can be helpfully labeled with
at most (3k|X| + c) labels so that it can be learned using O(|X|n log(n)) label
queries.
176 D. Angluin et al.

Proof. We give the construction for 1-concentrating automata and indicate how
to generalize it at the end of the proof. Given a 1-concentrating automaton M
the teacher chooses as the root a node reachable from all other nodes in the
transition graph of M . The depth of a node is the length of the shortest path
from that node to the root. The teacher then chooses a spanning tree T directed
inward to the root by choosing a parent for each non-root node. (One way to do
this is to let the parent of a node q be the first node reached along a shortest
path from q to the root.) The teacher assigns, as part of the label for each node
q, an element a ∈ X such that τ (q, a) is the parent of q.
The teacher now adds more information to the labels of the nodes, which we
call color, using the colors yellow, red, green, and blue. The root is the unique
node colored yellow. Let t = log n; t bits are enough to give a unique identifier
for every node of the graph. Each node at depth a multiple of (t + 1) is colored
red. For each red node v we choose a unique identifier of t bits (c1 , c2 , . . . , ct )
encoded as green and blue labels. Now consider the maximal subtree rooted at
v containing no red nodes. For each level i from 1 to the depth of the subtree,
all the nodes at level i of the subtree are colored with ci (which is either blue
or green.) The teacher has (so far) used 3|X| + 1 labels – a direction and one of
three colors per non-root node, and a unique identifier for the root.
Given this labeling, the learner can start from any state and reach a local-
ization state whose identifier is known, as follows. The learner uses the parent
component of the labels to go up the tree until it passes one red node and arrives
at a second red node, or arrives at the root (whichever comes first), keeping track
of the labels seen. If the learner reaches the root, it knows where it is. Other-
wise, the learner interprets the labels seen between the first and second red node
encountered as an identifier for the node v reached. This involves observing at
most (2t+2) labels. Thus, even if the in-degree is not bounded, a 1-concentrating
automaton can be labeled so that with O(log(n)) label queries the learner can
reach a uniquely identified localizing state.
If each node of the tree T also has in-degree bounded by k, another component
of the label for each non-root node identifies which of the k possible predecessors
of its parent it is (numbered arbitrarily from 1 to at most k.) If the learner col-
lects these values on the path from u to its localization node v, then we have an
identifier for u with respect to v. Thus it takes O(log(n)) label queries to learn any
node’s identifier. If the node has not been encountered before, its |X| transitions
must be explored, as in Proposition 2. This gives us a learning algorithm using
O(|X|n log(n)) label queries. The labeling uses at most 3k|X| + 1 different labels.
If the automaton is c-concentrating for some c > 1, then the teacher selects
a set of at most c nodes such that every node can reach at least one of them
and constructs a forest of at most c inward directed disjoint spanning trees, and
proceeds as above. This increases the number of unique identifiers for the roots
from 1 to c. 

An open question is whether an arbitrary finite automaton with n states can


be helpfully labeled with O(1) labels in such a way that it can be learned using
O(|X|n log n) label queries.
Learning Finite Automata Using Label Queries 177

3.2 Labels Randomly Chosen

In this section we turn from labels carefully chosen by the teacher to an indepen-
dent uniform random choice of labels for states from a label alphabet L. With
nonzero probability the labeling may be completely uninformative, so results in
this scenario incorporate a confidence parameter δ > 0 that is an input to the
learner. The goal of the learner is to learn an automaton that is output equiv-
alent to the hidden automaton M with probability at least (1 − δ), where this
probability is taken over the labelings of M . Results on random labelings can be
used in the careful labeling scenario: the teacher generates a number of random
labelings until one is found that has the desired properties.
We first review the learning scenario considered by Freund et al. [5]. There
is a finite automaton over the input alphabet X = {0, 1} and output alpha-
bet {+, −}, where the transition function and start state of the automaton are
arbitrary, but the output symbol for each state is chosen independently and uni-
formly from {+, −}. The learner moves from state to state in the target automa-
ton according to a random walk (the next input symbol is chosen independently
and uniformly from {0, 1}) and, after learning what the next input symbol will
be, attempts to predict the output (+ or −) of the next state. After the pre-
diction, the learner is told the correct output and the process repeats with the
next input symbol in the random walk. If the learner’s prediction was incorrect,
this counts as a prediction mistake. In the first scenario they consider, the
learner may reset the machine to the initial state by predicting ? instead of + or
−; this counts as a default mistake. In this model, the learner is completely
passive, dependent upon the random walk process to disclose useful information
about the behavior of the underlying automaton. For this setting they prove the
following.

Theorem 3 (Freund et al. [5]). There exists a learning algorithm that takes
n and δ as input, runs in time polynomial in n and 1/δ and with probability at
least (1 − δ) makes no prediction mistakes and an expected O((n5 /δ 2 ) log(n/δ))
default mistakes.

The main idea is to use the d-signature tree of a state as the identifier for the
state, where d ≥ 2 log(n2 /δ). For this setting, there are at least n4 /δ 2 strings in a
signature tree of depth d. The following theorem of Trakhtenbrot and Barzdin’ [8]
establishes that signature trees of this depth are sufficient.

Theorem 4 (Trakhtenbrot and Barzdin’ [8]). For any natural number d


and for any finite automaton with n states and randomly chosen outputs from
Y , the probability that for some pair of distinguishable states the shortest distin-
guishing string is of length greater than d is less than

n2 (1/|Y |)d/2 .

We may apply these ideas to prove the following.


178 D. Angluin et al.

Theorem 5. For any positive integer s, any finite automaton with n states, over
the input alphabet X and output alphabet Y , with its states randomly labeled with
labels from a label alphabet L with |L| = |X|s can be learned using
 
n1+4/s
O |X| 2/s
δ
label queries, with probability at least (1 − δ) (with respect to the choice of labeling.)
Proof. Assume that the learning algorithm is given n, a bound on the number of
states of the hidden automaton, and the confidence parameter δ > 0. It calculates
a bound d = d(n, δ) (described below) and proceeds as follows, starting with
the empty input string. To explore the input string w, the learning algorithm
calculates the d signature tree (in the labeled automaton) of the state reached
by w by making label queries on wz for all input strings z of length at most d.
This requires O(|X|d ) queries. If this signature tree has not been encountered
before, then the algorithm explores the transitions wa for all a ∈ X. Assuming
that the labeling is “good”, that is, that all distinguishable pairs of states have
a distinguishing string in the labeled automaton of length at most d, then this
correctly learns the output behavior of the hidden automaton using O(|X|d+1 n)
label queries.
To apply Theorem 4, we assume that the hidden automaton M is an arbitrary
finite automaton with output with at most n states, input alphabet X and output
alphabet Y . The labels randomly chosen from L then play the role of the random
outputs in Theorem 4. There is a somewhat subtle issue: states distinguishable
in M by their outputs may not be distinguishable in the labeled automaton by
their labels alone. Fortunately, Freund et al. [5] have shown us how to address
this point. In the first case, if two states of M are distinguishable by their outputs
in M by a string of length at most d, then their d signature trees (in the labeled
automaton) will differ. Otherwise, if the shortest distinguishing string for the
two states (using just outputs) is of length at least d + 1, then generalizing the
argument for Theorem 2 in [5] from |Y | = 2 to arbitrary |Y |, the probability
that this pair of states is not distinguished by the random labeling by a string
of length at most d is bounded above by (1/|Y |)(d+1)/2 . Summing over all pairs
of states gives the required bound.
Thus, choosing  2
2 n
d≥ log ,
log |L| δ
suffices to ensure that the labeling is “good” with probability at least (1 − δ). If
we use more labels, the signature trees need not be so deep and the algorithm
does not need to make as many queries to determine them. In particular, if
|L| = |X|s , then the bound of O(|X|d+1 n) on the number of label queries used
by the algorithm becomes  
n1+4/s
O |X| 2/s ,
δ
completing the proof. 
Learning Finite Automata Using Label Queries 179

Corollary 1. Any finite automaton with n states can be learned using O(|X|n1+ )
label queries with probability at least 1/2, when it is randomly labeled with |L| =
f (|X|, ) labels.

Proof. With δ = 1/2 a choice of |L| ≥ |X|4/ suffices. 

We remark that this implies that there exists a careful labeling with O(|X|4 )
labels that achieves learnability with O(|X|n2 ) label queries, substantially im-
proving on the size of the label set used in Theorem 1. An open question is
whether a random labeling with O(1) labels enables efficient learning of an ar-
bitrary n state automaton with O(n log n) queries with high probability.

4 Unfolding Finite Automata


We now consider giving more power to the teacher. Because many automata
have the same output behavior, we ask what happens if a teacher can change
the underlying machine (without changing its output behavior) before placing
labels on it. In Sections 3.1 and 3.2, the teacher had to label the machine given to
him. Now we will examine what happens when a teacher can unfold an automa-
ton before putting labels on it. That is, given M , the teacher chooses another
automaton M  with the same output behavior as M and labels the states of M 
for the learner.

4.1 Unfolding and Then Labeling


We first remark that unfolding an automaton M from n to O(n log n) states
allows a careful labeling with just 2 labels to encode a description of the machine.

Proposition 3. Any finite automaton with n states can be unfolded to have


N = O(|X|n log(n) + n log(|Y |)) states and carefully labeled with 2 labels, in
such a way that it can be learned using N label queries. 

Proof. The total number of automata with output having n states, input alpha-
bet X and output alphabet Y is at most

n|X|n+1 |Y |n .

Thus, N = O(|X|n log(n) + n log(|Y |)) bits suffice to specify any one of these
machines.
The teacher chooses a ∈ X and unfolds the target automaton M as follows.
The strings ai for i = 0, 1, . . . , N − 1 each send the learner to a newly created
state, which act (with respect to transitions on other input symbols and output
behavior) just as their counterparts in the original machine. The remaining states
are unchanged. The unfolded automaton is output equivalent to M . The teacher
then specifies M using by labeling these N new states with the bits of the
specification of M . The learner simply asks a sequence of N queries on strings
of the form ai to receive the encoding of the hidden machine. 
180 D. Angluin et al.

This method does not work if we restrict the unfolding to O(|X|n) states, but
we show that this much unfolding is sufficient to reduce the in-degree of the
automaton to O(|X|).

Lemma 1. Let M be an arbitrary automaton of n states. There is an automaton


M  with the same output behavior as M , with at most (|X| + 1)n states whose
in-degree is bounded by 2|X| + 1.

Proof. Given M , we repeat the following process until it terminates. While there
is some state q with in-degree greater than 2|X| + 1, split q into two copies,
dividing the incoming edges as evenly as possible between the two copies, and
duplicating all |X| outgoing edges for the second copy of q.
It is clear that each step of this process preserves the output behavior of M .
To see that it terminates, for each node q let f (q) be the maximum of 0 and
din (q) − (|X| + 1), where din (q) is the in-degree of q. Consider the potential
function Φ that is the sum of f (q) for all nodes q in the transition graph. Φ
is initially at most |X|n − (|X| + 1), and each step reduces it by at least 1 =
(|X| + 1) − |X|. Thus, the process terminates after no more than |X|n steps
producing an output-equivalent automaton M  with no more than (|X| + 1)n
states and in-degree at most 2|X| + 1. 

In particular, an automaton with a sink state of high in-degree will be unfolded


by this process to have multiple copies of the sink state. Using this idea for degree
reduction, the teacher may use linear unfolding and helpful labeling to enable a
strongly connected automaton to be learned with O(n log n) label queries.

Corollary 2. For any strongly connected automaton M of n states, there is


an unfolding M  of M with at most (|X| + 1)n states and a careful labeling
of M  using O(|X|2 ) labels that allows the behavior of M to be learned using
O(|X|2 n log n) label queries.

Proof. Given a strongly connected automaton M with n states, the teacher uses
the method of Lemma 1 to produce an output equivalent machine M  with at
most (|X| + 1)n states and in-degree bounded by 2|X| + 1. This unfolding may
not preserve the property of being strongly connected, but there is at least one
state q that has at most (|X| + 1) copies in the unfolded machine M  . Because
M is strongly connected, every state of M  must be able to reach at least one
of the copies of q, so M  is (|X| + 1)-concentrating. Applying the method of
Theorem 2, the teacher can use 3(2|X| + 1)|X| + (|X| + 1) labels to label M  so
that it can be learned with O(|X|2 n log n) label queries. 

We now consider uniform random labelings of the states when the teacher is
allowed to choose the unfolding of the machine.

Theorem 6. Any automaton with n states can be unfolded to have O(n log(n/δ))
states and randomly labeled with 2 labels, such that with probability at least (1 − δ),
it can be learned using O(|X|n(log(n/δ))2 ) queries.
Learning Finite Automata Using Label Queries 181

Proof. Given n and δ, let t = log(n2 /δ). The teacher chooses a ∈ X and
unfolds the target machine M to construct the machine M  as follows. M  has
nt states (q, i) where q is a state of M and 0 ≤ i ≤ (t − 1). The start state is
(q0 , 0), where q0 is the start state of M . The output symbol for (q, i) is γ(q, ai ),
where γ is the output function of M . For 0 < i < (t − 1), the a transition
from (q, i) is to (q, (i + 1)). The a transition from (q, t − 1) is to (q  , 0), where
q  = τ (q, at ) and τ is the transition function of M . For all other input symbols
b with b  = a, the b transition from (q, i) is to (q  , 0), where q  = τ (q, ai b).
To see that M  is an unfolding of M , that is, M  is output equivalent to M ,
we show that each state (q, i) of M  is output equivalent to state τ (q, ai ) of M .
By construction, these two states have the same output. If i < (t − 1) then the
a transition from (q, i) is to (q, i + 1), which has the same output symbol as
τ (q, ai+1 ). The a transition from (q, t − 1) is to (q  , 0), where q  = τ (q, at ), which
has the same output symbol as τ (τ (q, at−1 ), a). If b  = a is an input symbol, then
the b transition from (q, i) is to (q  , 0) where q  = τ (q, ai b), which has the same
output symbol as τ (τ (q, ai ), b).
Suppose M  is randomly labeled with two labels. For each state q of M ,
define its label identifier in M  to be the sequence of labels of (q, i) for i =
0, 1, . . . , (t − 1). For two distinct states q1 and q2 of M , the probability that
their label identifiers in M  are equal is (1/2)t , which is at most δ/n2 . Thus, the
probability that there exist two distinct states q1 and q2 with the same label
identifier in M  is at most δ.
Given n and δ, the learning algorithm takes advantage of the known unfolding
strategy to construct states (j, i) for 0 ≤ j ≤ n − 1 and 0 ≤ i ≤ (t − 1) with
a transitions from (j, i) to (j, i + 1) for i < (t − 1). It starts with the empty
input string and uses the following exploration strategy. Given an input string
w that is known to arrive at some (q, 0) in M  , the learning algorithm makes
label queries on wai for i = 0, 1, . . . , (t − 1) to determine the label identifier of q
in M  . If this label identifier has not been seen before, the learner uses the next
unused (j, 0) to represent q and records the outputs and labels for the states
(j, i) for i = 0, 1, . . . , (t − 1). It must also explore all unknown transitions from
the states (j, i). If distinct states of M receive distinct label identifiers in M  ,
the learner learns a finite automaton output equivalent to M using O(|X|nt2 )
label queries. 

5 Automata with Random Structure


We may also ask whether randomly labeled finite automata are hard to learn “on
average”. We consider automata with randomly chosen transition functions and
random labels. The model of random structure that we consider is as follows. Let
the states be qi for i = 0, 1, . . . , (n − 1), where q0 is the start state. For each state
qi and input symbol a ∈ X, choose j uniformly at random from 0, 1, . . . , (n − 1)
and let τ (qi , a) = qj .
182 D. Angluin et al.

Theorem 7. A finite automaton with n states, a random transition function


and a random labeling can be learned using O(n log(n)) label queries, with high
probability. The probability is over the choice of transition function and labeling.

Proof. This was first proved by Korshunov in [6]; here we give a simpler proof.
Korshunov showed that the signature trees only need to be of depth asymptoti-
cally equal to log|X| (log|L| (n)) for the nodes to have unique signatures with high
probability. We use a method similar to signature trees, but simpler to analyze.
Instead of comparing signature trees for two states to tell whether or not they
are distinct, we compare the labels along at most four sets of transitions, which
we call signature paths – like a signature tree consisting only of four paths.
Lemmas 2 and 3 show that given X and n there are at most four signature
paths, each of length 3 log(n), such that for a random finite automaton of n
states with input alphabet  X and for any pair s1 and s2 of different states, the
log6 (n)
probability is O n3 that s1 and s2 are distinguishable but not distinguished
by any of the strings in the four signature paths. By the union bound, the
probability that there exist two distinguishable states that are not distinguished
by at least one of the strings in the four signature paths is at most
   6 
n log (n)
O = o(1).
2 n3

Hence, by running at most four signature paths, each of length 3 log(n), per newly
reached state, we get unique labels on the states. Then for each of the n states,
we can find their |X| transitions, and learn the machine, as in Proposition 2. 

We now turn to the two lemmas used in the proof of Theorem 7. We first consider
the case |X| > 2. If a, b, c ∈ X and  is a nonnegative integer, let D (a, b, c) denote
the set of all strings ai , bi , and ci such that 0 ≤ i ≤ .

Lemma 2. Let s1 and s2 be two different states in a random automaton with


|X| > 2. Let a, b, c ∈ X and  = 3 log(n). The probability
 that s1 and s2 are
log6 (n)
distinguishable, but not by any string in D (a, b, c) is O n3 .

Proof. We analyze the three (attempted) paths from two states s1 and s2 , which
we will call πs11 , πs21 , πs31 and πs12 , πs22 , πs32 , respectively. Each path will have length
3 log(n). We define each of the πi as a set of nodes reached by its respective set
of transitions.
We first look at the probability that the following event does not happen:
that both |πs11 | > 3 log(n) and |πs12 | > 3 log(n), and that πs11 ∩ πs12 = ∅, that is
the probability that both of these strings succeed in reaching 3 log(n) different
states, and that they share no states in common. We call the event that two
sets of states π1 and π2 have no states in common, and both have size at least
l, S(π1 , π2 , l) (success) and the failure event F (π1 , π2 , l) = S(π1 , π2 , l). So,
Learning Finite Automata Using Label Queries 183

3 log(n)   3 log(n)  
 i + |πs11 |  i + |πs12 |
P (F (πs11 , πs12 , 3 log(n))) ≤ +
i=1
n i=1
n
3 log(n)  
 i + 3 log(n)
≤2
i=1
n
 2 
log (n)
=O .
n
Now we look at the probability that F (πs21 , πs22 , 3 log(n)) given that we failed on
the first paths, or F (πs11 , πs12 , 3 log(n)), with l = 3 log(n),
3 log(n)  
   i + |πs21 | + |πs11 | + |πs12 |
P F (πs1 , πs2 , l)|F (πs1 , πs2 , l) ≤
2 2 1 1

i=1
n
3 log(n)  
 i + |πs22 | + |πs11 | + |πs12 |
+
i=1
n
3 log(n)  
 i + 9 log(n)
≤2
i=1
n
 2 
log (n)
=O .
n
Now, we will compute the probability that F (πs31 , πs32 , 3 log(n)) given failures on
the previous two pairs of states. Let l = 3 log(n),
3 log(n)  
   i + 25 log(n)
P F (πs1 , πs2 , l)|F (πs1 , πs2 , l), F (πs1 , πs2 , l) ≤ 2
3 3 1 1 2 2

i=1
n
 2 
log (n)
=O .
n
Last, we compute the probability
 none of these pairs of paths made  it to l =
3 log(n), or P (failure) = P F (πs11 , πs12 , l), F (πs21 , πs22 , l), F (πs31 , πs32 , l)
 
P (failure) = P (F (πs11 , πs12 , l)) · P F (πs21 , πs22 , l)|F (πs11 , πs12 , l) ·
 
P F (πs31 , πs32 , l)|F (πs11 , πs12 , l), F (πs21 , πs22 , 1)
 2   2   2 
log (n) log (n) log (n)
=O O O
n n n
 6 
log (n)
=O .
n3
Thus, given two distinct states with corresponding nonoverlapping signature
paths of length 3 log(n), the probability that all of the randomly
  chosen labels
1 log6 (n)
along the paths will be the same is 23 lg(n) = n3 = O n3 , which is the
probability that no string in D (a, b, c) distinguishes s1 from s2 . 
184 D. Angluin et al.

When |X| = 2, we do not have enough alphabet symbols to construct three


completely independent paths as in the proof of Lemma 2, but four paths suffice.
If a, b ∈ X and  is a nonnegative integer, let D (a, b) denote the set of all strings
ai , bi , abi and bai such that 0 ≤ i ≤ .

Lemma 3. Let s1 and s2 be two different states in a random automaton with


|X| = 2. Let a, b ∈ X and  = 3 log(n). The probability
 that
 s1 and s2 are
log6 (n)
distinguishable, but not by any string in D (a, b) is O n3 .

The proof of Lemma 3 is a case analysis using reasoning similar to that of


Lemma 2; we include an outline. If s1 and s2 are assigned different labels, then
they are distinguished by the empty string, so assume that they are assigned the
same label. If we consider τ (s1 , a) and τ (s2 , a), there are four cases, as follows.
(1) We have τ (s1 , a)  = τ (s2 , a) and neither one is s1 or s2 . In this case, an
argument analogous to that in Lemma 2 shows that the probability that the
paths ai , abi and bi fail to produce a distinguishing string for s1 and s2 is
bounded by O(log6 (n)/n3 ). (2) Exactly one of τ (s1 , a) and τ (s2 , a) is in the set
{s1 , s2 }. This happens with probability O(1/n), and in this case we can show
that the probability that the paths ai and bi do not produce a distinguishing
string for s1 and s2 is bounded by O(log4 (n)/n2 ), for a total failure probability
of O(log4 (n)/n3 ) for this case. (3) Both of τ (s1 , a) and τ (s2 , a) are in the set
{s1 , s2 }. This happens with probability O(1/n2 ), and in this case we can show
that the probability that the path bi does not produce a distinguishing string
for s1 and s2 is bounded by O(log 2 (n)/n), for a total failure probability of
O(log2 (n)/n3 ) for this case. (4) Neither of τ (s1 , a) and τ (s2 , a) is in the set
{s1 , s2 }, but τ (s1 , a) = τ (s2 , a). This happens with probability O(1/n), and we
proceed to analyze four parallel subcases for τ (s1 , b) and τ (s2 , b).
(4a) We have τ (s1 , b)  = τ (s2 , b) and neither of them is in the set {s1 , s2 }.
We can show that the probability that the paths bi and bai do not produce a
distinguishing string for s1 and s2 is bounded by O(log4 (n)/n2 ), for a failure
probability of O(log4 (n)/n3 ) in this subcase, because the probability of case (4)
is O(1/n). (4b) Exactly one of τ (s1 , b) and τ (s2 , b) is in the set {s1 , s2 }. In this
subcase, we can show that the probability that the path bi fails to produce a
distinguishing string for s1 and s2 is bounded by O(log2 (n)/n), for a total failure
probability in this subcase of O(log2 (n)/n3 ), because the probability of case (4)
is O(1/n) and the probability that one of τ (s1 , b) and τ (s2 , b) is in {s1 , s2 } is
O(1/n). (4c) Both of τ (s1 , b) and τ (s2 , b) are in {s1 , s2 }. The probability of this
happening is O(1/n2 ), for a total probability of this subcase of O(1/n3 ), because
the probability of case (4) is O(1/n). (4d) We have τ (s1 , b) = τ (s2 , b). Then
because we are in case (4), τ (s1 , a) = τ (s2 , a) and the labels assigned s1 and s2
are equal, so the states s1 and s2 are equivalent and therefore indistinguishable.

Acknowledgments

We would like to thank the anonymous referees for helpful comments.


Learning Finite Automata Using Label Queries 185

References
1. Angluin, D.: A note on the number of queries needed to identify regular languages.
Information and Control 51(1), 76–87 (1981)
2. Angluin, D.: Queries and concept learning. Machine Learning 2(4), 319–342 (1987)
3. Becerra-Bonache, L., Dediu, A.H., Tı̂rnăucă, C.: Learning DFA from correction and
equivalence queries. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita,
E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 281–292. Springer, Heidelberg
(2006)
4. Domaratzki, M., Kisman, D., Shallit, J.: On the number of distinct languages ac-
cepted by finite automata with n states. Journal of Automata, Languages and Com-
binatorics 7(4) (2002)
5. Freund, Y., Kearns, M.J., Ron, D., Rubinfeld, R., Schapire, R.E., Sellie, L.: Efficient
learning of typical finite automata from random walks. Information and Computa-
tion 138(1), 23–48 (1997)
6. Korshunov, A.: The degree of distinguishability of automata. Diskret. Analiz. 10(36),
39–59 (1967)
7. Lee, D., Yannakakis, M.: Testing finite-state machines: State identification and ver-
ification. IEEE Trans. Computers 43(3), 306–320 (1994)
8. Trakhtenbrot, B.A., Barzdin’, Y.M.: Finite Automata: Behavior and Synthesis.
North-Holland, Amsterdam (1973)
Characterizing Statistical Query Learning:
Simplified Notions and Proofs

Balázs Szörényi1,2
1
Fakultät für Mathematik, Ruhr-Universität Bochum, D-44780 Bochum, Germany
2
Hungarian Academy of Sciences and University of Szeged, Research Group on
Artificial Intelligence, H-6720 Szeged
szorenyi@inf.u-szeged.hu

Abstract. The Statistical Query model was introduced in [6] to handle


noise in the well-known PAC model. In this model the learner gains in-
formation about the target concept by asking for various statistics about
it. Characterizing the number of queries required by learning a given
concept class under fixed distribution was already considered in [3] for
weak learning; then in [8] strong learnability was also characterized. How-
ever, the proofs for these results in [3,10,8] (and for strong learnability
even the characterization itself) are rather complex; our main goal is to
present a simple approach that works for both problems. Additionally,
we strengthen the result on strong learnability by showing that a class
is learnable with polynomially many queries iff all consistent algorithms
use polynomially many queries, and by showing that proper and im-
proper learning are basically equivalent. As an example, we apply our
results on conjunctions under the uniform distribution.

1 Introduction

The Statistical Query model (called SQ model for short) was introduced by
Kearns [6] as an approach to handle noise in the well-known PAC model. The
general idea is that—instead of using random examples as in the PAC model—
the learner gains information about the unknown function by asking various
statistics (called queries) over the distribution of labeled examples. As it was
shown by Kearns [6], any learning algorithm in the SQ model can be transformed
to a PAC algorithm without much loss in efficiency. It is even more interesting
that the resulting algorithm is robust to noise. He has also shown that many
efficient PAC algorithms can also be converted to an efficient SQ algorithm.
Despite the power of the model that is apparent from the above results, it is
still weaker than the PAC model. Indeed, already in [6] it was shown that the
parities, which is a PAC-learnable class, cannot be efficiently learned in the SQ
model under the uniform distribution. The proof used an information theoretic

This work was supported in part by the Deutsche Forschungsgemeinschaft Grant SI
498/8-1, and the NKTH grant of the National Technology Programme 2008 (project
codename AALAMSRK NTP OM-00192/2008) of the Hungarian government.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 186–200, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Characterizing Statistical Query Learning: Simplified Notions and Proofs 187

argument, which was generalized later by Blum et al. in [3] to characterize weak
learnability of a concept class (where the goal is to do slightly better than ran-
dom guessing) in the SQ model for the distribution dependent case (i.e., when
the underlying distribution is fixed in advance and is known by the learner).
The characterization is based on the so called SQ dimension of the class which
is, roughly, the maximal size of an “almost orthogonal” system in the class.
However, the proof in [3] is rather long and complex. Subsequently Yang gave
an alternative, elegant proof for basically the same result [10]. In this paper we
present yet another, but much shorter proof, thereby significantly simplifying on
both existing proofs.
Strong learnability (i.e., when the goal is to approximate the target concept
with arbitrary accuracy) of a concept class in the distribution dependent case
was first characterized by Köbler and Lindner [7] in terms of a general frame-
work for protocols, called the general dimension. Independently Simon in [8]
gave another characterization for strong learnability that was based on the SQ
dimension (more precisely it was based on the SQ dimension of the class after
some translation), and is more of an algebraic flavor. However, both the char-
acterization and the proof are rather complex in [8]; as we shall show in this
paper, our simple approach that is successful in characterizing weak learnability,
can be also applied for strong learnability, thereby giving an alternative, simple
characterization for this problem as well, which might also have the potential to
be easier to apply and calculate for concrete concept classes. Recently Feldman
has also obtained a simple characterization of strong SQ learnability of a similar
flavor [5], however the two papers focus on different perspectives: Feldman is
interested in applications to agnostic learning and evolvability, meanwhile our
main interest is to find a really simple proof and a unified view of weak and
strong learnability. Additionally our approach also reveals that in the distri-
bution dependent case query-efficient learnability is possible if and only if all
consistent learning algorithms learn the given concept class query-efficiently.1
As far as we know, this was not known before. We also show that in the distri-
bution dependent case proper learning (i.e., when the queries of the learner are
restricted to use functions from the given concept class) is as strong as improper
learning, but we would like to point out that this can be easily deduced already
from the characterization result of Simon (see Observation 5).
Finally we show that in the distribution independent case (i.e., when the
learner doesn’t know anything about the underlying distribution) proper and
improper learning can differ significantly, and we contrast this with the above
mentioned result on their equivalence in the distribution dependent case.

Equivalent models. Ben-David et al. have introduced an equivalent model,


called learning by distances [2], and have also given upper and lower bound on
the minimal number of queries required for learning. However their upper bound
1
Query-efficiency means that the number of queries used by the learner is bounded
by some polynomial of the various parameters. When query-efficiency is in focus,
then usually no restrictions are set on the running time.
188 B. Szörényi

is exponential in their lower bound (see also our discussion on the topic in Sect.
7) and the paper does not reveal the relation of the model to noise-tolerant PAC
learning (which gave the importance of the SQ model).
In [11] Yang has introduced the model of honest SQ model using stronger queries
and less adversarial settings than the ones used in the SQ model. In [9] it is shown
how to apply the results and methods of this paper to prove a somewhat surprising
result: the equivalence of the honest and the “pure” SQ model.

Organization of the paper. Section 2 contains the formal introduction of


the SQ model and also some basic definitions. In Sect. 3 we present our al-
ternative proof for characterizing weak learnability with the SQ dimension, in
Sect. 4 we discuss the relation of strong and weak learnability, and then in
Sect. 5 we characterize strong learnability. In Sect. 6 we analyze the relation
of our strong SQ dimension to the ones of Simon and Feldman. In Sect. 7 as
an example, we compute our dimension notion for conjunctions under the uni-
form distribution. Finally, in Sect. 8 we contrast the result on the equivalence of
proper and improper learning in the distribution dependent case with the fact
that they occasionally significantly differ in the distribution independent case.

2 Preliminaries
A concept is a mapping from some domain to {−1, 1}. A concept class is a set of
concepts with the same domain. A Boolean concept over n variables is a concept
of the form {−1, 1}n → {−1, 1}. A family of concept classes is an infinite set
{Fn }∞ n=1 , such that each Fn is a concept class. The class of all concepts over
some domain X is denoted C(X).
The correlation of two functions f, g : X → R under some distribution D
over X is defined as f, gD = E[f (ξ)g(ξ)], where ξis a random variable with
distribution D. The norm of f under D is f D := f, f D . f is said to be a
γ-approximation of g, if f, gD ≥ γ.
In the Statistical Query model a learner (or learning algorithm) L can make
queries of the form (h, τ ), where τ is a positive constant called tolerance, and h
is chosen from some concept class H called the query class. Each such query is an-
swered with some c satisfying |c − f ∗ , hD | ≤ τ, where f ∗ is some fixed concept,
called the target concept that is unknown for the learner, and where D is some dis-
tribution over the input domain of f ∗ . (Here the learner is supposed to be familiar
with D.) The learner succeeds when he finds some function f ∈ H having correla-
tion at least γ with f ∗ for some constant γ > 0 fixed ahead of the learning process.
D,L
Parameter γ is called accuracy. Let qF ,H (τ, γ) denote the smallest integer q such
that L always succeeds in the above setting using at most q queries when the target
concept belongs to F . Finally, SLCD F ,H (τ, γ) (or the statistical learning complex-
D,L
ity) is defined to be the minimum value of qF ,H (τ, γ) over all possible learning al-
gorithms L. We would like to emphasize that in this paper we are interested only in
the number of queries during the learning process (i.e., the information complexity
of learning), and do not consider the running time.
Characterizing Statistical Query Learning: Simplified Notions and Proofs 189

Note that originally in [6] the SQ model allowed much more general queries,
but in [4] Bshouty and Feldman have shown that the two models are equivalent.2
We also consider the following variants of the above described learning model.
The learning is called proper when F = H, and is called improper when F  H.
Also, in general, a query (h, τ ) is proper if h ∈ F, otherwise it is improper. The
learner is a consistent learner, if | hi , hj D − ci | ≤ τi for i < j, where (hi , τi )
is the i-th query of the learner and ci is the answer for it. Finally, note that in
the above definition the learner is supposed to be familiar with the underlying
distribution, but the model can also be defined for the case when this is not
true. We are mainly interested in the former case (except for Sect. 8), but when
we want to explicitly refer to one case or the other, we shall call the former the
distribution dependent and the latter the distribution independent case.
For simplicity, when it causes no confusion, we omit D from notations like
SLCD F ,H (τ, γ) and f, gD , and simply use SLCF ,H (τ, γ) and f, g instead.

Definition 1. We say that a family {Fn }∞ n=1 of concept classes is weakly learn-
able in the SQ model with a family {Hn }∞ n=1 of query classes if there exists some
γ(n) > 0 and τ (n) > 0 such that 1/γ(n), 1/τ (n) and SLCFn ,Hn (τ (n), γ(n)) are
polynomially bounded in n. {Fn }∞ n=1 is strongly learnable in the SQ model with
queries from {Hn }∞ n=1 if there exists some τ (n, ) > 0 such that 1/τ (n, ) and
SLCFn ,Hn (τ (n), 1 − ) are polynomially bounded in n and 1/.
The following Observation, which we shall apply several times later, is basi-
cally the reason for the equivalence of the proper and improper learning in the
distribution dependent model.
Observation 1. Let f, g and h be arbitrary concepts. If f, h ≥ 1 −  and
g, h ≥ 1 − , then f, g = (1/2) f + g, f + g − 1 ≥ f + g, h − 1 ≥ 1 − 2.
Although this paper mainly considers concepts and concept classes, we would
like to point out that all the results remain valid for classes of functions with
norm bounded by 1 (which might be tempting to use for example in query
classes)—albeit in some cases, when the proof applies Observation 1, the con-
stants get slightly worse.3 The reason for this is the following lemma which is
the generalization of Observation 1 for these functions.
2
Actually they have shown how to simulate an arbitrary statistical query using two
statistical queries that are independent of the target function and two correlation
queries. However, when running time is not considered and the underlying distribu-
tion is known, one can omit the two former queries and just compute them directly.
3
The choice of 1 as an upper bound for the query function is arbitrary, one can
use any other constant instead. (But note that smaller constants would exclude all
concepts.) However, unbounded queries should not be allowed, because they make
the learning problem trivial. Indeed, for example when the target concept is Boolean
over n variables, and one uses
na query with tolerance 1/2 and with the function that
evaluates x ∈ {−1, 1}n to n
i=1 (1/) · 2 · 2
i(xi +1)/2
, then the value of the target
concept on inputs with probability at least /2n can be reconstructed from the
answer to this query, meanwhile the sum of the probabilities of the rest of the inputs
is less than .
190 B. Szörényi

Proposition 1. When f, g, h : {−1, 1}n → {−1, 1} have norm at most 1, and


f, h ≥ 1 −  and g, h ≥ 1 − , then f, g ≥ 1 − 6.

Proof. First of all, by Cauchy-Schwarz, f  ≥ f, h ≥ 1 − , and similarly


g ≥ 1 − . Using this
2 2
2(1 − 2) − 2 f, g ≤ f  + g − 2 f, g
2
= f − g
≤ (f − h + g − h)2
  2
≤ 2 − 2 f, h + 2 − 2 g, h
≤ 8 ,

implying 1 − 6 ≤ f, g.

Finally for integer d let [d] denote the set {1, . . . , d}.

3 Characterizing Weak Learnability

According to the definition, weak learnability is possible if and only if there


exists some polynomial p(n) such that SLCFn ,Hn (1/p(n), 3/p(n)) ≤ p(n) (sim-
ply define p(n) to be a polynomial that upper bounds 1/τ (n), 3/γ(n) and
SLCFn ,Hn (τ (n), γ(n))). This way the task of weak learning is basically to find
functions hn,1 , . . . , hn,p(n) ∈ Hn such that all f ∈ Fn has correlation at least
3/p(n) with at least one of hn,1 , . . . , hn,p(n) . Thus p(n) (and this way SLC itself)
can be considered as a kind of covering number. Bshouty and Feldman in [4]
make this property explicit in their characterization of weak learnability.
On the other hand, the notion of SQ dimension introduced by Blum et al. [3]
is rather a packing number in nature:
Definition 2. The SQ dimension (or weak SQ dimension) of a class of real
valued functions F over some domain X and under distribution D over X,
denoted SQDimD F , is the biggest integer d such that F contains some distinct
functions f1 , . . . .fd with pairwise correlations between −1/d and 1/d.
(Note that SQDim is defined not only for concept classes but also for more gen-
eral classes; Definition 4 will really make use of this generality.) For simplicity, as
mentioned, we use SQDimF instead of SQDimD F when this leads to no confusion.
The nice feature of the characterization result in [3] is that it binds the two
different type of notions. One direction, namely that SQDimF queries are enough
for weakly learning concept class F (properly!) is easy: if {f1 , . . . , fd } is a max-
imal subset of F fulfilling | fi , fj  | ≤ 1/d for 1 ≤ i < j ≤ d, then (due to the
maximality) it obviously holds that at least one of them has correlation at least
1/d with the target concept, thus the learner simply needs to query f1 , . . . , fd
with tolerance 1/(3d) in order to find an 1/(3d) approximation of it. However
Characterizing Statistical Query Learning: Simplified Notions and Proofs 191

the proof in [3] for the other direction was rather long and complex. Subse-
quently Yang in [10] gave another, elegant proof for this direction, based on the
eigenvalues of the correlation matrix of the concept class.4
Here we show that basically the same result can be derived using a very simple
argument, thus significantly simplifying on both of the above mentioned proofs.
The proof in some sense follows the same line of thought they use, but lacks the
machineries applied in them.
Theorem 2. Let F be a concept class and let d := SQDimF . Then any learning
algorithm that uses tolerance parameter lower bounded by τ > 0 requires in the
worst case at least (dτ 2 − 1)/2
√ queries for learning
√ F with accuracy at least τ .
In particular, when τ = 1/ 3 d, this means ( 3 d − 1)/2 queries.
Proof. Assume that f1 , . . . , fd ∈ F fulfill | fi , fj  | ≤ 1/d for i, j ∈ [d] distinct.
We show an (adversary) answering strategy that ensures to eliminate only a
small number of these functions after each query. Let h be an arbitrary query
function used by the learner (having thus norm at most 1) and let A := {i ∈
[d] : fi , h ≥ τ }. Then, by the Cauchy-Schwarz inequality
 2 2
    |A| − 1 |A|2
h, fi ≤ fi = fi , fj  ≤ 1+ ≤ |A| + ,
d d
i∈A i∈A i,j∈A i∈A

meanwhile, by the choice of A it holds that h, i∈A fi ≥ |A|τ, and the two
together implies that 1/|A| + 1/d ≥ τ 2 or equivalently, that |A| ≤ d/(dτ 2 −
1). Similar argument shows that at most d/(dτ 2 − 1) of the fi functions have
correlation at most −τ with h. Thus at most 2d/(dτ 2 − 1) of the functions will
be inconsistent with the answer if the adversary returns 0 to this query. This, in
turn, implies the desired lower bound (dτ 2 − 1)/2 on the learning complexity.
It is also worth mentioning that this result is quite tight in the improper case,
when the learner can use arbitrary functions of norm 1 in the queries. Indeed,
(i+1)·d2/3
if the concept class itself is {f1 , . . . , fd }, then defining gi := j=i·d2/3 +1 fj for

i = 0, 1, . . . , d1/3 − 1 (assuming for simplicity that 3 d is integer), at least one
hi = gi / gi , i = 0, 1, . . . , d1/3 − 1 will have correlation at least
1 1
1 − d2/3  (1)
d d2/3 + d2/3 d2/3 (1/d)

with the target function. Note that (1) asymptotically equals to 1/ 3 d.

4 Weak and Strong Learning


Aslam and Decatur [1] apply the boosting techniques from the PAC model to SQ
learning and show how to use (efficiently) a weak learning algorithm to achieve
4
The correlation matrix of the concept class F = {f1 , . . . , fs } is the s × s matrix C
such that Ci,j = fi , fj .
192 B. Szörényi

strong learnability. Their primary concern is the distribution independent case,


but their result (combined with results for weak learning) also has the following
consequence in the distribution dependent case:

 1 1
max SLCD
F ,H log , 1 −  = O d5 log2 ,
D 3d  

when H ⊇ F, and where d = maxD SQDimD F . However, this does not imply any
result on fixed distributions in general. Indeed, when the support of a distribu-
tion consists of only a single input, then one query is enough both in the weak
and in the strong setting—for any concept class. Thus the gap between the up-
per bound in the above equation and the number of queries required for strong
learning under some given (known) distribution can be as big as possible: expo-
nential versus constant. What is more, we cannot expect to bound the strong
SQ dimension under some distribution D using the weak SQ dimension under
the same distribution. Indeed, consider for example the uniform distribution and
the concept class Fn consisting of all the functions of the form v1 ∨ f , where f
is any parity function over variables v2 , . . . , vn . Then |Fn | = 2n−1 , and any two
distinct elements (v1 ∨ f ), (v1 ∨ f  ) ∈ Fn have correlation 1/2:
  1 1 1
v1 ∨ f, v1 ∨ f   = 2 P (v1 ∨ f ) = (v1 ∨ f  ) − 1 = 2 + P[f = f  ] − 1 =
2 2 2

(as the parity functions are uncorrelated under the uniform distribution), and so
by Theorem 4 strong learning of Fn requires superpolynomial number of queries,
meanwhile weak learning requires none.5

5 Characterizing Strong Learnability

In this section we give a complete characterization of strong learnability. More


precisely we define a dimension notion that is a generalization of the weak SQ
dimension SQDim from Sect. 3, and show that it is closely related to the learning
complexity.

Definition 3. For a concept class F let d0 (F , γ) denote the largest d such that
some f1 , . . . , fd ∈ F fulfill

• | fi , fj  | ≤ γ for 1 ≤ i < j ≤ d, and


• | fi , fj  − fk , f  | ≤ 1/d for all 1 ≤ i < j ≤ d and 1 ≤ k <  ≤ d.

Actually, this dimension notion is a kind of combination of the strong SQ di-


mension of Simon [8] (see also Sect. 6) and Yang [10].
5
Yang [11] has also shown a similar result for another concept class, but the argument
there is more complicated.
Characterizing Statistical Query Learning: Simplified Notions and Proofs 193

Theorem 3. Let F be a concept class and let d := d0 (F , 1 − /2). Then any


consistent algorithm that uses tolerance τ ≤ min{1/(4d+4), /4} requires at most
d/τ queries to learn F with accuracy 1 − . Specifically, setting τ = min{1/(4d +
4), /4}, the algorithm finds an (1 − )-approximation of the target concept after
4d · max{d + 1, 1/} queries, implying SLCF ,F (τ, 1 − ) ≤ 4d · max{d + 1, 1/}.
Proof. Assume that some consistent algorithm used tolerance as above, queried
h1 , . . . , hq in this order, and got the answers c1 , . . . , cq in this order. Suppose
that for some 1 ≤ i1 < i2 < · · · < i ≤ q and some c ∈ [−1, 1] it holds that
cij ∈ [c − τ, c + τ ] for j = 1, . . . , . The algorithm is consistent, thus hij , hik ∈
[cij − τ, cij + τ ] for 1 ≤ j < k ≤ , consequently hij , hik ∈ [c − 2τ, c + 2τ ] ⊆
[c − 1/(2d + 2), c + 1/(2d + 2)] for 1 ≤ j < k ≤ . Also note that | hi , hj  | ≤
|ci | + τ ≤ 1 − /2 for 1 ≤ i < j ≤ q, since c1 , . . . , cq have absolute value less then
1 − 3/4 (as otherwise the algorithm would have successfully terminated). The
two together imply however that  ≤ d0 (F , 1 − /2). As this was true for any c,
it follows that q ≤ d0 (F , 1 − /2)(2/(2τ )).
The proof for the other direction has the same structure as the proof for Theorem
2, with some necessary modifications.
Theorem 4. Let F ⊆ C(X) be any concept class for some domain  X, and as-
sume d := d0 (F , 1 − 2) ≥ 3. Then if the tolerance τ is bigger than
√ 3/(2d/2),

then SLCF ,C(X) (τ, 1−) ≥ dτ 2 /3. In particular SLCF ,C(X) (1/ 3 d, 1−) ≥ 3 d/3.
Proof. Assume 3/(2τ 2 ) ≤ d/2 and let d := 3/(2τ 2 ). By the choice of d there
exist f1 , . . . , fd ∈ F and c ∈ (−1 + 2, 1 − 2) satisfying | fi , fj  − c| ≤ 1/(2d) for
all 1 ≤ i < j ≤ d. We show an (adversary) answering strategy that ensures to
eliminate only a small number of the fi functions after each query. Let h ∈ C(X)
be an arbitrary query function used by the learner, and assume for simplicity
that f1 , h ≥ f2 , h ≥ · · · ≥ fd , h. Define α := fd , h, β := fd−d +1 , h,
A := [d ] and B := {d − d + 1, d − d + 2, . . . , d}. Then 1 −  ≥ α ≥ β ≥ −1 + 
whenever d ≥ 3 (recall Observation 1 and that d ≤ d/2 by our assumption on
τ ), furthermore A and B are disjoint sets of cardinality d . First note that
2
1  1 

fi −  fi
d d
i∈A i∈B

1   
=  2 fi 2 + fi 2 + fi , fj 
(d )
i∈A i∈B i,j∈A:i
=j

 
+ fi , fj  − 2 fi , fj 
i,j∈B:i
=j i∈A j∈B

1 1 1 1
≤  2
2d + d (d − 1) c + + d (d − 1) c + − 2(d )2 c −
(d ) 2d 2d 2d
4 2
≤ + ,
d d
and so, by the Cauchy-Schwarz inequality
194 B. Szörényi
   
1  1  1  1  4 2 6
h,  fi −  fi ≤  fi −  fi ≤ + ≤ .
d d d d d d d
i∈A i∈B i∈A i∈B

On the other hand, by the definition of A and B it also holds that


 
1  1  1  1 
h,  fi −  fi =  h, fi  −  h, fj  ≥ α − β ,
d d d d
i∈A i∈B i∈A j∈B


and so α − β ≤ 6/d ≤ 2τ . Thus, answering the learner’s query with (α + β)/2,
all but at most 2d −2 functions will be consistent with the answer. This, in turn,
implies the desired lower bound d/(2d − 2) ≥ dτ 2 /3 on the learning complexity.

The main result of this section is the following corollary of the two theorems
above:

Corollary 1. The following statements are equivalent for any family {Fn }∞
n=1
of concept classes under arbitrary (fixed) distribution:

– d0 (Fn , 1 − ) is polynomially bounded in n and 1/,


– {Fn }∞
n=1 is strongly learnable by some (possibly improper) algorithm,
– {Fn }∞
n=1 is strongly learnable by all consistent learning algorithms.

6 Other Dimension Notions for Strong Learnability

In this section we consider the relation of d0 and the strong SQ dimensions of


Simon [8] and Feldman [5]. For this let us first introduce SQDim∗ from [8].

Definition 4 ([8]). Given some concept class F , a subclass F  of it is (γ, H)-


trivial for some query class H and constant 0 < γ < 1, if some function
h ∈ H has correlation of at least γ with at least half of the functions in F  .
The remaining subclasses of F are said to be (γ, H)-nontrivial. The strong
SQ dimension associated with concept class F and query class H is the func-
tion SQDim∗F ,H (γ) := supF  SQDimF  −BF  , where F
ranges over all (γ, H)-

nontrivial subclasses of F , and where BF  = (1/|F |) f ∈F  f .

As it turns out below, it doesn’t really matter, which query class is used, as long
as it contains the concept class itself.

Observation 5. When F ⊆ H, then any (1 − , F )-trivial subset of F is also


(1−, H)-trivial, meanwhile, by Observation 1, it also holds that any (1−/2, H)-
trivial subset of F is also (1 − , F )-trivial. Thus
 
SQDim∗F ,H (1 − ) ≤ SQDim∗F ,F (1 − ) ≤ SQDim∗F ,H 1 − .
2
.
Characterizing Statistical Query Learning: Simplified Notions and Proofs 195

The following equation we shall need later.


2
f, g = f − B, g − B + f, B + g, B − B . (2)

Theorem 6. For any concept classes F and H satisfying F ⊆ H it holds that


max{32/2, 9d20 (F , 1 − 2 /32)} ≥ SQDim∗F ,H (1 − ).

Proof. According to Observation 5, it is enough to show that the statement of


the theorem holds for H = F .
Let F  be a (1 − , F )-nontrivial subset of F , and let F0 be a subset of F 
such that SQDimF0 −BF  = |F0 |. Assume furthermore that d := |F0 | ≥ 32/2.

Consider the correlation of BF  with

√ all the functions in F0 . Obviously there
exist some c ∈ [−1, 1] and some d ≥ d/3 such√that for some √ distinct functions
f1 , . . . , fd ∈ F0 it holds that fi , BF   ∈ [c−1/ 9d, c+1/ 9d] for j = 1, . . . , d .
Then for arbitrary indices i, j, k,  ∈ [d ] fulfilling i = j and k =  it holds (using
(2)) that

| fi , fj  − fk , f  | =|(fi − BF  , fj − BF   − fk − BF  , f − BF  )


+ (fi , BF   − fk , BF  ) + (fj , BF   − f , BF  )|
2 2·2
≤ +√
d 9d
3
≤√ (3)
d
using that d ≥ 32. To prove the theorem it thus suffices to show that the corre-
lation of any two distinct elements of F  has absolute value at most 1 − 2 /32.6
To upper bound fi , fj  for some 1 ≤ i < j ≤ d first note that using (2) with
f = fi , g = fj and B = BF  , and then applying the Cauchy-Schwarz inequality
1
fi , fj  ≤ + BF   (2 − BF  ) . (4)
d
Also note that the (1 − , F )-nontriviality of F  implies that

2 1  1  1 |F  | |F  | 
BF   = g, f  ≤ (1 − ) + =1− ,
|F  |2 
|F  | 
|F  | 2 2 2
g,f ∈F g∈F

and therefore BF   ≤ 1 − /2 ≤ 1 − /4. Combining this with (4), and noting
that x(2 − x) is monotone increasing on (0, 1) we get that

1    1 2
fi , fj  ≤ + 1− 1+ =1+ − .
d 4 4 d 16
6
Note that we cannot apply Observation 1 (or Proposition 1) directly to bound
fi , fj , because nontriviality only guarantees that none of the fi functions have
high correlation with at least half of F  , which doesn’t prevent them from having
really high correlation with some smaller portion of F  . It thus has to be shown that
no such set contains another fi .
196 B. Szörényi

Thus, since d ≥ 32/2, we have fi , fj  ≤ 1 − 2 /32.


Finally, let us give a lower bound for the pairwise correlation. If one pair had
correlation less than −1 + 1/32, then,
√ according to (3) all other pairs would have
correlation at most −1 + 1/32 + 3/ d, implying

2

d
0≤ fi
i=1


d
2

= fi  + 2 fi , fj 
i=1 1≤i<j≤d

1 3
≤ d + d (d − 1) −1 + +√ ,
32 d
which would lead to a contradiction, as d ≥ 32. Consequently fi , fj  ≥ −1 +
2 /32 for 1 ≤ i < j ≤ d .
Theorem 7. Let F and H be concept classes satisfying F ⊆ H. Then
d0 (F , 1−) ≤ max{2, 2·SQDim∗F ,F (1−/2)} ≤ max{2, 2·SQDim∗F ,H (1−/4)} .
Proof. The second inequality follows from Observation 5.
To prove the first inequality, let F  := {f1 , . . . , fd } ⊆ F be such that | fi , fj  | <
1 −  and | fi , fj  − fk , f  | < 1/d for 1 ≤ i < j ≤ d and 1 ≤ k <  ≤ d. Then

| fi − BF  , fj − BF   |
 
   
 1
d
1
d


= fi , fj  + 2 fk , f  − (fi , fk  + fj , fk )
 d d 
k,=1 k=1
   

1  d  1  d

  
≤ (fi , fj  − fi , fk ) +  2 (fk , f  − fj , fk )
d  d 
k=1 k,=1
2
≤ .
d
Furthermore, by Observation 1, F  is (1 − /2, F )-nontrivial.
The dimension notion introduced in [5] is a kind of simplified version of SQDim∗ :

Definition 5 ([5]). For concept class F over domain X let SSQ-DIM(F , ) :=


maxh SQDim{f  ∈(F −h): f  2 ≥} , where h ranges over all mappings from X to
[−1, 1].
Furthermore the proof of the two theorems above can be easily modified to show:
Theorem 8. For any concept class F it holds that max{32, 2/, 9d20(F , 1 −
/2)} ≥ SSQ-DIM(F , ) and max{2, 2 · SSQ-DIM(F , 2 /16)} ≥ d0 (F , 1 − ).
Characterizing Statistical Query Learning: Simplified Notions and Proofs 197

7 d0 for Conjunctions under the Uniform Distribution

In this section, as an example, we compute the exact value of d0 for the class
of conjunctions under the uniform distribution, up to a constant factor. (Note
however that this class is efficiently learnable in the Statistical Query model even
distribution independently [6], so d0 is obviously polynomial in n and in 1/.)
First of all let us compute the correlation of two conjunctions t and t that
have length  and  respectively, and share exactly s literals (as usual, −1 is
interpreted as “true” and 1 as “false”):

t, t  = E[t · t ]
= 1 − 2 P[t = t ]
= 1 − 2(P[t = −1] + P[t = −1] − 2 P[t = t = −1])
 
1 − 2/2 − 2/2 if t and t conflict,
=   (5)
1 − 2/2 − 2/2 + 4/2+ −s otherwise.

Next we prove a technical lemma we shall need later. Here we apply the con-
vention that for some x ∈ {0, 1}n the number of 1s in x is denoted |x|, and that
for x, y ∈ {0, 1}n x ∨ y (resp. x ∧ y) is the vector of length n with 1 on those
components that are 1 in at least one of x and y (resp. in both x and y), and is
0 everywhere else. For conjunctions we use similar notations, that is, |t| denotes
the number of literals appearing in term t, and t ∧ t denotes the term obtained
by joining the literals appearing in terms t and t .

Lemma 1. If for some H ⊆ {0, 1}n and for some integer c it holds that |x∨y| =
c for arbitrary distinct x, y ∈ H, then |H| ≤ n + 1.

Proof. For x ∈ H let xc denote the vector obtained by flipping the bits in x.
Then by De Morgan xc ∧ y c = (x ∨ y)c , and thus |xc ∧ y c | = n − c for arbitrary
x, y ∈ H. Construct the n × |H| matrix X such that its columns are the vectors
from H in an arbitrary order, and let C be the |H| × |H| matrix having n − c in
each entry. First of all note that X X − C is a diagonal matrix. If it contains
some zero element in the diagonal, then |xc | = n − c for some x ∈ H, implying
that for all other y ∈ H y c has 1 everywhere where x does and that each such y c
must have 1 at some unique position where the others have 0. This immediately
implies |H| ≤ n+1. Otherwise, when X X −C is a nonsingular diagonal matrix,
   
|H| = rank X X − C ≤ rank X X + 1 = rank(X) + 1 ≤ min{n, |H|} + 1

implying the statement of the claim.

Proposition 2. Let Fn be the set of conjunctions over variables v1 , . . . , vn .


Then under the uniform distribution d0 (Fn , 1 − ) ≤ 1 + max{2n + 2, 8/2 }.

Proof. Let t1 , . . . , td be terms satisfying | ti , tj  | ≤ 1− and | ti , tj −tk , t  | ≤


1/d for i, j, k,  ∈ [d] fulfilling i  = j and k  = . Assume for simplicity that td
198 B. Szörényi

is the longest term among them. Then by (5) it holds that 1 −  ≥ ti , td  ≥
1 − 4 P[ti = −1], implying
−|t | 
P[ti = −1] = 2 i ≥ , (6)
4
and thus

0 if ti and tj conflict
P[ti = tj = −1] = (7)
2−|ti ∧tj | ≥ (/4)2 otherwise

for distinct i, j ∈ [d − 1].


Let us assume that 1/d < 2 /8.
If for some I ⊆ [d − 1] it holds that all ti , i ∈ I, has the same length, then for
any indices i, j, k,  ∈ I fulfilling i 
= j and k =

2 1 1 (5 )
> ≥ | ti , tj  − tk , t  | = |P[ti = tj = −1] − P[tk = t = −1]| .
32 4d 4
Note that it cannot happen that ti and tj conflict with each other, but tk and
t do not—or vice versa—, since by (7) that would mean that the right hand
side is at least 2 /16, resulting in a contradiction. So either all ti with i ∈ I
conflict each other, or there is no conflicting pair among the terms with index in
I. The former case implies that {ti = −1}i∈I are all contradicting events, and
 (6)
so 1 ≥ i∈I P[ti = −1] ≥ |I| · (/4), giving the bound |I| ≤ 4/. In the latter
case, since by (7) both 2−|ti ∨tj | and 2−|tk ∨t | are at least 2 /16, we have that
2−|ti ∨tj | > (1/2)2−|tk ∨t | and 2−|tk ∨t | > (1/2)2−|ti∨tj | . This, however implies
that |ti ∨ tj | = |tk ∨ t |, and so, by Lemma 1 (applied for H ⊆ {0, 1}n consisting
of the vectors that represent some ti with i ∈ I by having 1 on position j iff ti
contains variable vj ), I has cardinality at most n + 1.
We have just seen that the sum of the number of terms of minimal length and
the number of terms of length one more is at most max{2n + 2, 8/}. However,
there cannot be distinct indices i, j, k ∈ [d − 1] fulfilling |ti | + 2 ≤ |tj |, |tk |, as
otherwise
2 1
>
8 d
≥ | ti , tj  − tj , tk  |
= |2 P[ti = −1] − 4 P[ti = tj = −1] − 2 P[tk = −1] + 4 P[tk = tj = −1]|
1
≥ · P[ti = −1]
2
(6) 
≥ ,
8
a contradiction.

Note that this bound is sharp up to a constant factor according to the example
below and that the terms consisting of one unnegated variable form an orthogonal
Characterizing Statistical Query Learning: Simplified Notions and Proofs 199

system of cardinality n. It also immediately follows that these results remain tight
even if we restrict Fn to be the set of monotone conjunctions over v1 , . . . , vn .7

Example 1. Let Fn be the set of all monotone conjunctions over variables v1, . . . , vn
and let Fn () consist of all t ∈ Fn of length . Set  := 2− and note that if t1 , t2 ∈
(5)
Fn () share s <  variables, then under the uniform distribution | t1 , t2  | = 1 −
4/2 + 4/22−s ≤ 1 − 2. If additionally t3 , t4 ∈ Fn () share s <  variables,

(5) 


 
 
then | t1 , t2  − t3 , t4  | = 4/22−s − 4/22−s  = 2 4 2s − 2s . Now we choose
 = (n) := c log n for some c > 1 (and thus  = (n) = 1/nc ) and s =
s(n) := log log n, and prove that d0 (Fn , 1 − ) = Ω(2 ) = Ω(n2c ) by showing
that one can find an I ⊆ Fn () of cardinality Ω(n2c ) that contains no two distinct
conjunctions sharing more than s variables. Such an I can simply be obtained
using the greedy method, since when n− ≥ 2(−s) then for any t ∈ Fn () there
 n−  
are exactly −s ≤ 2 n−
−s conjunctions in Fn () that share at least
i=0 i i n
s variables with t, thus (noting that |Fn ()| =  ) I can always be expanded
by some term when it has cardinality less than
n
1 ns
n− ∼  s
2 −s 2 

(using Stirling’s formula).

8 Proper vs. Improper Learning in the Distribution


Independent Case
In the distribution dependent case (i.e., when the learner knows the underlying
distribution) proper and improper learning are basically the same (recall Corol-
lary 1). In this section we contrast this result showing that in the distribution
independent case proper and improper learning can differ significantly. Consider
for example the class of singletons: Fn := {fx : x ∈ {−1, 1}n}, where fx evalu-
ates to −1 on x, and evaluates to 1 on every other input. Since Fn is a subset of
conjunctions, which was shown by Kearns in [6] to be efficiently learnable in the
Statistical Query model, Fn can be learned using polynomially many improper
queries.
Let us now define for each x, y ∈ {−1, 1}n a distribution Dx,y , which assigns
probability 1/2 to both x and y, and assigns probability 0 to every other input.
The key observation is that in case of proper learning each query must be one
of the fx functions. But then, as long as there are at least two of them that are
not yet queried, the adversary can just return 0 as the answer. Finally, when
7
In [2] Ben-David et al. related the learning complexity of a class F to its capacity
c(F, ) := min{|G| : ∀f ∈ F ∃g ∈ G s.t. f, g ≥ 1 − }. For Fn = {v1 , . . . vn }
this is polynomial in the learning complexity (in specific c(Fn , ) = n) under uniform
distribution, but for the monotone conjunctions this is superpolynomial (choose s =
log log n in Example 1). The two notions are thus not polynomially related.
200 B. Szörényi

only two singletons—say fx and fy —are unqueried, the adversary chooses one
of them as the target concept, and says that the underlying distribution is Dx,y .
This way the answers of the adversary remain consistent (no matter how small
the tolerance parameter of the learner was), and, at the same time, force the
learner to ask at least 2n − 1 queries—even for weakly learning the class.8
It might also worth mentioning that for the singletons SQDimD Fn ≤ 5 under
any distribution D, because, denoting by px the probability assigned to input
x ∈ {−1, 1}n, 1/6 ≥ fx , fy D = 1 − 2px − 2py implies that at least one of px and
py is 5/24 or greater, and thus if six functions from Fn had pairwise correlation
at most 1/6, then at least five distinct inputs would have probability 5/24 or
greater—a contradiction. This result shows that the number of proper queries
required for weakly learning some concept class can differ significantly in the
distribution dependent and in the distribution independent case: in some cases
it is constant versus exponential.

Acknowledgements. I would like to thank Hans Ulrich Simon for suggesting


me to work on this topic. I am also thankful to him, Thorsten Doliwa and Michael
Kallweit for the motivating discussions on the problem.

References
1. Aslam, J.A., Decatur, S.E.: General bounds on statistical query learning and PAC
learning with noise via hypothesis boosting. Inf. Comput. 141(2), 85–118 (1998)
2. Ben-David, S., Itai, A., Kushilevitz, E.: Learning by distances. Inform. Com-
put. 117(2), 240–250 (1995)
3. Blum, A., Furst, M., Jackson, J., Kearns, M., Mansour, Y., Rudich, S.: Weakly
learning DNF and characterizing statistical query learning using fourier analysis.
In: Proc. of 26th ACM Symposium on Theory of Computing (1994)
4. Bshouty, N.H., Feldman, V.: On using extended statistical queries to avoid mem-
bership queries. Journal of Machine Learning Research 2, 359–395 (2002)
5. Feldman, V.: A complete characterization of statistical query learning with appli-
cations to evolvability. In: FOCS 2009 (to appear, 2009)
6. Kearns, M.: Efficient noise-tolerant learning from statistical queries. J. ACM 45(6),
983–1006 (1998)
7. Köbler, J., Lindner, W.: The complexity of learning concept classes with polynomial
general dimension. Theor. Comput. Sci. 350(1), 49–62 (2006)
8. Simon, H.U.: A characterization of strong learnability in the statistical query
model. In: Thomas, W., Weil, P. (eds.) STACS 2007. LNCS, vol. 4393, pp. 393–404.
Springer, Heidelberg (2007)
9. Szörényi, B.: Honest queries do not help in the statistical query model (manuscript)
10. Yang, K.: New lower bounds for statistical query learning. J. Comput. Syst.
Sci. 70(4), 485–509 (2005); In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS
(LNAI), vol. 2375, pp. 229–509. Springer, Heidelberg (2002)
11. Yang, K.: On learning correlated boolean functions using statistical queries. In:
Abe, N., Khardon, R., Zeugmann, T. (eds.) ALT 2001. LNCS (LNAI), vol. 2225,
pp. 59–76. Springer, Heidelberg (2001)

8
This doesn’t contradict the result of Aslam and Decatur [1] mentioned in Section 4,
since their boosting algorithm uses improper queries.
An Algebraic Perspective on Boolean Function
Learning

Ricard Gavaldà1 and Denis Thérien2


1
Department of Software (LSI), U. Politècnica de Catalunya, Barcelona, Spain
gavalda@lsi.upc.edu
2
School of Computer Science, McGill University, Montréal, Québec, Canada
denis@cs.mcgill.ca

Abstract. In order to systematize existing results, we propose to ana-


lyze the learnability of boolean functions computed by an algebraically
defined model, programs over monoids. The expressiveness of the model,
hence its learning complexity, depends on the algebraic structure of the
chosen monoid. We identify three classes of monoids that can be iden-
tified, respectively, from Membership queries alone, Equivalence queries
alone, and both types of queries. The algorithms for the first class are
new to our knowledge, while those for the other two are combinations or
particular cases of known algorithms. Learnability of these three classes
captures many previous learning results. Moreover, by using nontrivial
taxonomies of monoids, we can argue that using the same techniques to
learn larger classes of boolean functions seems to require proving new
circuit lower bounds or proving learnability of DNF formulas.

1 Introduction
In his foundational paper [Val84], Valiant introduced the (nowadays called) PAC-
learning model, and showed that conjunctions of literals, monotone DNF formu-
las, and k-DNF formulas were learnable in the PAC model. Shortly after, Angluin
proposed the (nowadays called) Exact learning from queries model, proved that
Deterministic Finite Automata are learnable in this model [Ang87], and showed
how to recast Valiant’s three learning results in the exact model [Ang88].
Valiant’s and Angluin’s initial successes were followed by a flurry of PAC-
or Exact learning results, many of them concerning (as in Valiant’s paper) the
learnability of Boolean functions, others investigating learnability in larger do-
mains. For the case of Boolean functions, however, progress both in the pure
(distribution-free, polynomial-time) PAC model or in the exact learning model
has slowed down considerably in the last decade.
Certainly, one reason for this slowdown is the admission that these two mod-
els do not capture realistically many Machine Learning scenarios. So a lot of
the effort has shifted to investigating variations of the original models that ac-
commodate these features (noise tolerance, agnostic learning, attribute efficiency,
distribution specific learning, subexponential time, . . . ), and important advances
have been made here.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 201–215, 2009.

c Springer-Verlag Berlin Heidelberg 2009
202 R. Gavaldà and D. Thérien

But another undeniable reason of the slowdown is the fact that it is difficult
to find new learnable classes, either by extending current techniques to larger
classes or by finding totally different techniques. Many existing techniques seem
to be blocked by the frustrating problem of learning DNF, or by our lack of
knowledge of basic questions on boolean circuit complexity, such as the power
of modular or threshold circuits.
In this paper, we use algebraic tools for organizing many existing results
on Boolean function learning, and pointing out possible limitations of existing
techniques. We adopt the program over a monoid as computing model of Boolean
functions [Bar89, BST90]. We use existing, and very subtle, taxonomies of finite
monoids to classify many existing results on Boolean function learning, both in
the Exact and PAC learning models, into three distinct algorithmic paradigms.
The rationale beyond the approach is that the algebraic complexity of a
monoid is related to the computational complexity of the Boolean functions
it can compute, hence to their learning complexity. Furthermore, the existing
taxonomies of monoids may help in detecting corners of learnability that have
escaped attention so far because of lack of context, and also in indicating bar-
riers for a particular learning technique. We provide some examples of both
types of indications. Similar insights have led in the past to, e.g., the complete
classification of the communication complexity of boolean functions and regular
languages [TT05, CKK+ 07].
More precisely, we present three classes of monoids that are learnable in three
different Exact learning settings:
Strategy 1. Groups for which lower bounds are known in the program model,
all of which are solvable. Boolean functions computed over these groups can be
identified from polynomially many Membership queries and, in some cases, in
polynomial or quasipolynomial time. Membership learning in polynomial time
is impossible for any monoid which is not a solvable group.
Strategy 2. Monoids built as wreath products of DA monoids and p-groups.
These monoids compute boolean functions computed by decision lists whose
nodes contain MODp gates fed by NC0 functions of the inputs. These are learn-
able from Equivalence queries alone, hence also PAC-learnable, using variants of
the algorithms for learning decision lists and intersection-closed classes. The re-
sult can be extended to MODm gates (for nonprime m) with restrictions on their
accepting sets. All monoids in this class are nonuniversal (cannot compute all
boolean functions), in fact the largest class known to contain only nonuniversal
monoids. We argue that proving learnability of the most reasonable extensions
of this class (either in the PAC or the Equivalence-query model) requires either
new circuit lower bounds or learning DNF.
Strategy 3. Monoids in the variety named LGp  m Com. Programs over these
monoids are simulated by polynomially larger Multiplicity Automata (in the se-
quel, MA) over the field Fp , and thus are learnable from Membership and Equiv-
alence queries. Not all MA can be translated to programs over such monoids;
but all classes of Boolean functions that, to our knowledge, were shown to be
An Algebraic Perspective on Boolean Function Learning 203

learnable via the MA algorithm (except the full class of MA itself) are in fact
captured by this class of monoids. We conjecture that this is the largest class of
monoids that can be polynomially simulated by MA, hence it defines the limit
of what can be learned via the MA algorithm in our algebraic setting.
These three classes subsume a good number of the classes of Boolean functions
that have been proved learnable in the literature, and we will detail them when
presenting each of the strategies. Additionally, with the algebraic interpretation
we can examine more systematically the possible extensions these results, at least
within our framework. By examining natural extensions of our three classes of
monoids, we can argue that any substantial extension of two of our three monoid
classes provably requires solving two notoriously hard problems: either proving
learnability of DNF formulas or proving new lower bounds for classes of solvable
groups. This may be an indication that substantial advance on the learnability
of circuit-based classes similar to the ones we capture in our framework may
require new techniques.
Admittedly, there is no reason why every class of boolean functions interesting
from the learning point of view should be equivalent to programs computed over
a class of monoids, and certainly our classification leaves out many important
classes. Among them are classes explicitly defined in terms of threshold gates,
or by read-k restrictions on the variables, or by monotonicity conditions. This
is somehow unavoidable in our setting, since threshold gates have no natural
analogue on finite monoids, and because multiple reads and variable negation
are free in the program model. Similarly, the full classes of MA and DFA cannot
be captured in our framework, since for example the notion of automata size is
critically sensitive to the order in which the inputs are read, while in the program
model variables can always be renamed with no increase in size.
Our taxonomy is somehow complementary to those in [HS07, She08] based on
threshold functions. Some function classes are captured by both that approach
and ours, while each one contains classes not captured by the other.

2 Background

2.1 Boolean Functions

We build circuits typically using AND, OR, and MOD gates. We use the gener-
alized model of MODm gates that come equipped with an accepting set A ⊆ [m]
shown as superindex; [m] denotes the set {0 . . . m − 1} throughout the paper.
A MODA m gate outputs is 1 iff the sum of its inputs mod m is in A. We simply
write MODm gates to mean MODA m gates with arbitrary A’s. For each k, NCk
0

is the set of boolean functions depending each on at most k variables.


We often compose classes of boolean functions. For two classes C and D, C ◦D
denotes functions in C with inputs replaced with functions in D.
We denote with DL the class of functions computed by decision lists where
each node contains one variable. Therefore, e.g., DL ◦ NC0k are decision lists
whose nodes contain boolean functions depending on at most k variables.
204 R. Gavaldà and D. Thérien

We will typically discuss families of boolean functions, namely sequences


{fn }n≥0 where each fn is a function of arity n. Given a class C of families
of boolean functions, we use the term “boolean combinations of functions in C”
to denote the set of families of functions that can be obtained by combining some
fixed number (independent of n) of functions in C; in other words, the functions
in ∪k (NC0k ◦ C).
We will use the computation model called Multiplicity Automata, MA for
short. The following is one of several equivalent definitions; see e.g. [BV96,
BBB+ 00] for more details. A multiplicity automaton over an alphabet Σ and
a field F is a nondeterministic finite automaton over Σ where we associate an
element of F to each transition. The value of the automaton on an input x ∈ Σ 
is the sum over all accepting paths of the products of the elements along the
path, where sum and product are over the field.

2.2 Learning Theory

We assume familiarity with Valiant’s PAC model and especially Angluin’s model
of Exact learning via queries. In the Exact model, we use Membership and
Equivalence queries. As usual, we measure the resources used by a learning
algorithm as a function of the arity of the target function (denoted with n) and
the size of the target function within some representation language associated
to the class of functions to learn (denoted with s). Longer explanations can be
found in the extended version.
We will use repeatedly the well-known Composition Theorem (see e.g.
[KLPV87]) which states that if a class C (with minor syntactical requirements)
is learnable in polynomial time then C ◦ NC0k is also learnable in polynomial time
for every fixed k. The result is valid for both the Equivalence-query model and the
PAC model, but the proof fails in the presence of Membership queries.

2.3 Monoids and Programs

Recall that a monoid is a set equipped with a binary operation that is associa-
tive and has an identity. All the monoids in this paper are finite; some of our
statements about monoids might be different or fail for infinite monoids.
A group is a monoid where each element has an inverse. A monoid is aperiodic
if there is some number t such that at+1 = at for every element a. Only the one-
element monoid is both a group and aperiodic. A theorem by Krohn and Rhodes
states that every monoid can be built from groups and aperiodic monoids by
repeatedly applying the so-called wreath product. The wreath product of monoids
A and B is denoted with A B. Solvable groups, in particular, are precisely those
that can be built as iterated wreath product of Abelian groups. Definitions of
solvable groups and wreath product can be found in most textbooks on group
theory, and in the extended version of this paper.
A program over a monoid M is a pair (P, A), where A ⊆ M is the accepting
set and P is an ordered list of instructions. An instruction is a triple (i, ai , bi )
whose semantics is as follows: read (boolean) variable xi ; if xi = 0, emit element
An Algebraic Perspective on Boolean Function Learning 205

ai ∈ M , and emit element bi ∈ M if xi = 1. A list of instructions P defines a


sequence of elements in M on every assignment w to the variables. We denote
with P (w) the product in M of this sequence of elements. If P (w) ∈ A we say
that the program accepts w, and that it rejects w otherwise; alternatively, we
say that the program evaluates to 1 (resp. 0) on w. The length or size of the
program is the number of instructions in P .
Each program on n variables thus computes a boolean function from {0, 1}n
to {0, 1}. For a monoid M , B(M ) is the set of boolean
 functions recognized by
programs over M . If M is a set of monoids, B(M) is M∈M B(M ).
A monoid M is said to divide a monoid N if M is a homomorphic image of a
submonoid in N . A set of monoids closed under direct product and division (i.e.,
taking submonoids and homomorphic images) is called a variety (technically, a
pseudovariety since we are dealing with finite monoids). The following varieties
will appear in this paper:

– Com: All commutative monoids, i.e. those satisfying xy = yx.


– Ab: All Abelian groups. Recall that every finite Abelian group is a direct
product of a number of groups of the form Zpαi for different primes pi .
– Gp : All p-groups, that is, groups of cardinality a power of the prime p.
– Gnil : Nilpotent groups. For the purposes of this paper, a group is nilpotent iff
it is the direct product of a number of groups, each of which is a pi -group for
possibly different pi ’s. All Abelian groups are nilpotent. For interpretation,
it was shown in [PT88] that programs over nilpotent groups are equivalent
in power to polynomials of constant degree over a ring of the form (Zm )k ,
i.e., they compute the same set of boolean functions.
– G: The variety of all groups.
– DA: A variety of aperiodic monoids to be defined in Section 4.2. For inter-
pretation, it was shown in [GT03] that programs over monoids in DA are
equivalent in power to decision trees of bounded rank.

2.4 Learning Programs over Monoids: Generalities

Every monoid M defines a set of boolean functions B(M ) with an associated


notion of function size, namely the length of the shortest program over M .
The general question we ask is thus “given M and a learning model, is B(M )
polynomial-time learnable in that learning model?”. Polynomiality (or other
bounds) is on the number of variables and size in M of the target function,
denoted with s as already mentioned.
For a set of monoids M, we say for brevity “programs over M are learnable” or
even “M is learnable” to mean “for every fixed M ∈ M, B(M ) is learnable”, that
is, there may be a different algorithm for each M ∈ M, with a different running
time. In other words, each algorithm works for a fixed M that it “knows”. Models
where a single algorithm must work for a whole class of monoids are possible,
but we do not pursue them in this paper.
The following easy result is useful to compare the learning complexity of
different monoids:
206 R. Gavaldà and D. Thérien

Fact 1. If M divides N and B(N ) is learnable (in any of the learning models
in this paper), then B(M ) is also learnable.

In contrast, we do not know whether learnability is preserved under direct prod-


uct (which is to say, by taking fixed-size boolean combinations of classes of the
form B(M )): if it was, many of the open problems in this paper would be re-
solved, but we have no general argument or counterexample.

3 Learning from Small-Weight Assignments

The small-weight strategy applies to function classes with the following property.
n
Definition 1. For an assignment a ∈ {0, 1} , the weight of a is defined as
the number of 1s it contains, and denoted w(a). A representation class C is k-
narrowing if every two different functions f, g ∈ C of the same arity differ on
some assignment of weight at most k. (k may actually be a function of some
other parameters, such as the arity of f and g or their size in C).

The following is essentially proved in [GTT06].

Theorem 2. If C is k-narrowing, then C can be identified with nO(k) Member-


ship queries (and possibly unbounded time).
n
The algorithm witnessing this is simple: ask all assignments in {0, 1} of weight
at most k, of which there are at most nO(k) . Then find any function f ∈ C
consistent with all answers. By the narrowing property, that f must be equivalent
to the target.

3.1 Groups with Lower Bounds

It was shown in [Bar89] and [GTT06], respectively, that nonsolvable groups and
nongroups can compute any conjunction of variables and their negations by a
polynomial-size program. Any class of functions with this property is not n-
narrowing, and by a standard adversary argument, it requires 2n Membership
queries to be identified. Therefore we have:

Fact 3. If M is not a group, or if M is a nonsolvable group, then B(M ) cannot


be identified with a subexponential number of Membership queries.

Therefore, Membership learnability of classes of the form B(M ) is restricted, at


most, to solvable groups. There are two maximal subclasses of solvable groups
for which lower bounds on their computational power are known, and in both
cases the lower bound is essentially a narrowing property.

Fact 4. 1. For every nilpotent group M there is a constant k such that B(M )
is k-narrowing [PT88]. Therefore B(M ) can be identified from nO(k) Mem-
bership queries (and possibly unbounded time).
An Algebraic Perspective on Boolean Function Learning 207

2. For every group G ∈ Gp  Ab there is a constant c such that B(M ) is


(c log s)-narrowing [BST90]. Therefore, programs over G of length s can be
identified from nO(log s) Membership queries.

The next two theorems give specific, time-efficient versions of this strategy for
Abelian groups and Gp  Ab groups. These are, to our knowledge, new learning
algorithms.

Theorem 5. For every Abelian group G, B(G) is learnable from Membership


queries in time nc , for a constant c = c(G).

Theorem 6. For every G ∈ Gp  Ab with p prime, B(G) is learnable from


Membership queries in nc log s time, for a constant c = c(G).

(Recall that s stands for the length of the shortest program computing the target
function). Proofs are given as an Appendix in the extended version.

3.2 Interpretation in Circuit Terms

Let us now interpret these results in circuit terms. It is easy to see that programs
over a fixed Abelian group are polynomially equivalent boolean combinations of
MODm gates, for some m depending on the group. Theorem 5 then implies:

Corollary 1. For every m, fixed-size boolean combinations of MODm gates are


learnable from Membership queries in time nc , for c = c(m).

Also, it is shown in [BST90] that programs over a fixed group in Gp  Ab are


polynomially equivalent to MODp ◦ MODm circuits. Such circuits were shown in
[BBTV97] to be polynomial-time learnable from Membership and Equivalence
queries in polynomial time, by showing that they have small Multiplicity au-
tomata – a generalization of their construction is used in Section 5. Theorem 6
shows that Membership queries suffice, if quasipolynomial time is allowed:

Corollary 2. For every prime p and every m, functions computed by MODp ◦


MODm circuits of size s are learnable from Membership queries in time nO(log s) .

As an example, the 6-element permutation group on 3 points, S3 , can be de-


scribed as a wreath product Z3  Z2 . Intuitively each permutation can be de-
scribed by a rotation and a flip, which interact when permutations are composed
so direct product does not suffice. Programs over S3 are polynomially equivalent
to MOD3 ◦ MOD2 circuits, and our result claims that they are learnable from
nc log s Membership queries for some c.

3.3 Open Questions on Groups and Related Work


While programs over nilpotent groups can be identified from polynomially many
Membership queries, we have not resolved whether a time-efficient algorithm
exists, even in the far more powerful PAC+Membership model. In other words,
208 R. Gavaldà and D. Thérien

we know that the values of such a program on all small-weight assignments are
sufficient to identify it uniquely, but can these values be used to efficiently predict
the value of the program on an arbitrary assignment?
In circuit terms, by results of [PT88], such programs can be shown to be poly-
nomially equivalent to fixed-size boolean combinations of MODm ◦ NC0 circuits
or, equivalent, of polynomials of constant degree over Zm . We are not even aware
of algorithms learning a single MODA m ◦ NC circuit for arbitrary sets A. When
0

m is prime, one can use Fermat’s little theorem to make sure that the MODm
gate receives only inputs summing to either 0 or 1, at the expense of increasing
the arity of the NC0 part. Then, one can set up a set of linear equations where
the unknowns are the coefficients of the target polynomial and each small-weight
assignment provides an equation with constant term either 0 or 1. The solution
of this system must be equivalent to the target function.
For solvable groups that are neither nilpotent nor in Gp  Ab, the situation is
even worse in the sense that we do not have lower bounds on their computational
power, i.e., we cannot show that they are weaker than NC1 . Observe that any
learning result would establish a separation with NC1 , conditioned to the cryp-
tographic assumptions under which NC1 is nonlearnable. In another direction,
while lower bounds do exist for MODp ◦ MODm circuits, we do not have them
for MODp ◦ MODm ◦ NC0 ; linear lower bounds for some particular cases were
given in [CGPT06].
Let us note that programs over Abelian groups (equivalently, boolean combi-
nations of MODm gates) are particular cases of the multi-symmetric concepts
studied in [BCJ93]. Multi-symmetric concepts are there shown to be learnable
from Membership and Equivalence queries, while we showed that for these par-
ticular cases Membership queries suffice. XOR’s of k-terms and depth-k decision
trees are special cases of MODm ◦ NC0 previously noticed to be learnable from
Membership queries alone [BK].

4 Learning Intersection-Closed Classes

In this section we observe that classes of the form DL◦MODA m ◦NC are learnable
0

from Equivalence queries (for some particular combinations of m and accepting


sets A). The algorithm is actually the combination of two well-known algo-
rithms (plus the composition theorem to deal with NC0 ). 1) The algorithm for
learning submodules of a module in [HSW90] (though probably known before);
2) the algorithm in the companion paper extending it to nested differences of
intersection-closed classes, also in [HSW90]. More interestingly, we show that the
classes above have natural algebraic interpretation, and use this interpretation
that they may be very close to a stopping barrier for a certain kind of learning.

4.1 The Learning Algorithm

Theorem 7. For every m and k the class DL ◦ MOD[m]−{0}


m ◦ NC0k is learnable
k
from Equivalence queries in time polynomial in m, 22 , and nk .
An Algebraic Perspective on Boolean Function Learning 209

Proof. (Sketch; details are given in the extended version.) By the composition
theorem, it suffices to show that DL ◦ MOD[m]−{0}
m is learnable from Equivalence
queries, and we can even assume that the inputs to these MOD[m]−{0}m gates are
variables (no negations, no constants). Now observe that this class of functions is
the set of negations of nested differences of functions computed by MOD{0} gates
– where we identify a function with the set of assignments where it evaluates to
1. Furthermore, the set represented by every MOD{0} gate is a submodule (a set
closed under addition) of Znm , and submodules are intersection-closed. Therefore,
the algorithm in [HSW90] for learning nested differences of intersection-closed
classes applies, and one can show it learns the class above with nO(1) queries.

Note that in Theorem 7 the running time does not depend on the length of
the decision list that is being learned. In fact, as a byproduct of this proof one
can see that the length of these decision lists can be limited to a polynomial
of m and nk without actually restricting the class of functions being computed.
Intuitively, this is because there can be only as many linearly independent such
MOD gates, and a node whose answer is determined by the previous ones in the
decision list can be removed. Thus, for constant m and k, this class can compute
O(1)
at most 2n n-ary boolean functions and is not universal.
Also, note that we claim this result for MODm gates having all but 0 as
accepting elements. In the special case that m is a prime p, we can deal with
arbitrary accepting sets:

Theorem 8. For every prime p, every k, and arbitrary accepting sets A (pos-
sibly distinct at every MOD gate) the class DL ◦ MODA
p ◦ NCk is learnable from
0
c
Equivalence queries in time n , where c = c(p, k).

Proof. By Theorem 7, it suffices to show that that every function in MODA p is


equivalent to a polynomially larger function in MOD[p]−{0}
p ◦ NC 0
k , for some k
depending on p. This is a standard use of Fermat’s little theorem and details are
omitted in this version.

If we ignore the issue of proper learning and polynomials in the running time,
this subsumes at least the following known results:

– k-decision lists (which are DL ◦ NC0 ) [Riv87]. k-decision lists in turn sub-
sumed k-CNF and k-DNF, and rank-k decision trees.
– Systems of equations over Zm , i.e., DL ◦ MOD[m]−{0}
m .
– Polynomials of constant degree over finite fields, restricted to boolean func-
tions. When the field has prime cardinality p, these are equivalent to
MODp ◦ NC0 .
– Strict width-2 branching programs [BBTV97]. This is because it can be
shown that these are polynomially simulated by DL ◦ MOD2 ◦ NC0 (proof
omitted in this version).

These are virtually all known results on learning Boolean functions in the pure
PAC model (no Membership queries) that do not involve threshold gates or
210 R. Gavaldà and D. Thérien

read-restrictions, neither of which can be captured in our algebraic setting. It


is interesting that each of these classes, and in fact all of DL ◦ MODA
p ◦ NCk ,
0
O(1)
contain at most 2n functions. They are therefore not universal, i.e. they
cannot represent all boolean functions. This fact seems more significant after we
observe (in the next section) that a computationally equivalent class of monoids
is in fact the largest one known to contain only non-universal monoids.

4.2 Interpretation in Algebraic Terms


Classes closely related to those in the previous section have clear precise algebraic
interpretations. They involve the class DA of monoids, of which we give here
an operational definition. Formal definitions can be found e.g. in [Sch76, GT03,
Tes03, TT04].
Let M be a monoid in DA. Then the product of elements m1 , . . . , mn in M can
be determined by knowing the truth or falsehood of a fixed number of boolean
conditions of the form “m1 . . . mn , as a string over M , admits a factorization
of the form L0 a1 L1 a2 . . . ak Lk ”, where 1) the ai are elements of M , 2) each Li
is a language such that x ∈ Li can be determined solely by the set of letters
appearing in x, and 3) the expression L0 a1 L1 a2 . . . ak Lk is unambiguous, i.e.,
every string has at most one factorization in it.
As mentioned already in the introduction, it was shown in [GT03] that pro-
grams over monoids in DA are equivalent in power to decision trees in bounded
rank [EH89], where the required rank of the decision trees is related to the pa-
rameter k in its definition in the particular DA monoid. In particular, programs
over a fixed DA monoid can be simulated both by CNF and DNF formulas of
size nO(1) and by decision lists with bounded-length terms at the nodes, and can
be learned in the PAC and Equivalence-query models [EH89, Riv87, Sim95].
We then have the following characterization:
 
Theorem 9. 1. B(DA  Gnil ) = m,k DL ◦ MODm ◦ NC0k = m,k DL ◦
MOD{0}
p ◦ NC0k .
 
2. B(DA  Gp ) = m,k DL ◦ MODp ◦ NC0k = m,k DL ◦ MOD{0}
p ◦ NC0k =

m,k DL ◦ MODp ◦ NC0k .
[m]−{0}

The proof is omitted in this version. Intuitively, nilpotent groups provide the
“group” behavior of MODm ◦ NC0 and decision lists are equivalent to DA ◦
NC0 . A key ingredient is the fact that a MODpα gate can be simulated by a
MODp ◦ NC0 circuit; see e.g. [BT94] for a proof. The difference between parts
(1) and (2) is again the possibility of using Fermat’s little theorem reduce gates
to singleton accepting sets.
From this theorem and Theorem 8, it follows that we can learn programs
over DA  Gp monoids from Equivalence queries, yet we do not know how to
learn (to our knowledge) programs over DA  Gnil in any model. This algebraic
interpretation lets us explore this gap in learnability and, in particular, the
limitation of the learning paradigm in the previous subsection.
Since every p-group is nilpotent and it can be shown that DA  Gnil monoids
can only have nilpotent subgroups, we have
An Algebraic Perspective on Boolean Function Learning 211

DA  Gp ⊆ DA  Gnil ⊆ DA  G ∩ Mnil ,
where Mnil is the class of monoids having only nilpotent subgroups. Yet, there
is an important difference in what we know about DA  Gp and DA  Gnil .
Following [Tes03, TT04], a monoid M is said to have the Polynomial Length
Property (or PLP) if every program over M , regardless of its length, is equivalent
to another one whose length is polynomial in n. Clearly, every monoid in PLP is
nonuniversal, and the converse is conjectured in [Tes03, TT04]. More specifically,
the following was shown in [Tes03, TT04].
– Every monoid not in DA  G ∩ Mnil is universal.
– Every monoid in DA  Gp has the PLP, hence is not universal.
The question of either PLP or universality is thus open for DA  Gnil , sit-
ting between DA  Gp and DA  G ∩ Mnil , so resolving its learnability may
require new insights besides the intersection-closure/submodule-learning algo-
rithm. Note that, contrary to one could think, DA  Gnil is not equal to
DA  G ∩ Mnil : there are monoids that, in this context, can be built by us-
ing unsolvable groups and later using homomorphisms to leave only nilpotent
groups that cannot be obtained starting from nilpotent groups alone. Current
techniques seem insufficient (and may remain unable forever) to analyze even
these traces of unsolvability.
Are there other extensions of DA  Gp that we could investigate from the
learning point? The “obvious” is trying to extend the DA or Gp parts separately.
For the DA part, it is known [Sch76, Tes03] that every aperiodic monoid not in
DA necessarily is divided by one of two well-identified monoids, named U and
BA2 . Monoid U is the syntactic monoid of the language {a, b} aa{a, b}, and
programs over U are equivalent in power, up to polynomials, to DNF formulas.
Therefore, by Fact 1, extending DA in this direction implies learning at least
DNF. Monoid BA2 is the syntactic monoid of (ab) , and interestingly, although
it is aperiodic, programs over it can be simulated (essentially) by OR gates fed
by parity gates. In fact it in DA  Gp for every p, so we know it is learnable.
If we try to extend on the group part, we have already mentioned that the
two classes of groups beyond Gp for which we have lower bounds are Gnil and
Gp  Ab. We have already discussed the problems concerning DA  Gnil . For
Gp  Ab, they correspond to MODp ◦ MODm circuits, and we showed them
to be learnable from Membership queries alone in the previous section. With
Equivalence queries, however, learning MODp ◦MODm would also imply learning
MODp ◦ MODm ◦ NC0 and, as discussed in the previous section, this seems
difficult because we cannot even prove now that these circuits cannot do NC1 .
In particular, even learning programs over S3 (i.e. MOD3 ◦ MOD2 circuits) from
Equivalence queries alone seems unresolved now.

5 Learning as Multiplicity Automata


The learning algorithm for multiplicity automata [BV96, BBB+ 00] elegantly
unified many previous results and also implied learnability of several new classes.
212 R. Gavaldà and D. Thérien

It has remained one of the “maximal” learning algorithms for boolean functions,
in the sense that no later result has superseded it.
Theorem 10. [BV96, BBB+ 00] Let F be any finite field. Functions Σ  → F
represented as Multiplicity Automata over F are learnable from Evaluation and
Equivalence queries in time polynomial in the size of the MA and |Σ|.
We can use Multiplicity Automata to compute boolean functions as follows: We
take Σ = {0, 1}, and some accepting subset A ⊆ F , and the function evaluates
to 1 on an input if the MA outputs an element in A, and 0 otherwise. However,
as basically argued in [BBTV97] we can use Fermat’s little theorem to turn an
MA into one that always outputs either 0 or 1 (as field elements) with only
polynomial blowup, and therefore we can omit the accepting subset.
In this section we identify a class of monoids that can be simulated by MA’s,
but not the other way round. Yet, it can simulate most classes of boolean func-
tions whose learnability was proved via the MA-learning algorithm.
Note that it will be impossible to find a class of submonoids that, in our
setting, is precisely equivalent (up to polynomial blowup) to the whole class of
MA. This is true for the simple reason that the complexity of a function measured
as “shortest program length” cannot grow under renaming of input variables: it
suffices to change the variable names in the instructions of the program. MA,
on the other hand, read their input in the fixed order x1 , . . . , xn , so renaming
the input variables in a function canforce an exponential growth in MA size.
Consider as an example the function ni=1 (x2i−1 = x2i ): clearly, it is computed
by the MA of size O(n) that simply checks equality of appropriate
n pairs of
adjacent letters in its input string. However, its permutation i=1 (xi = x2n−i+1 )
is the palindrome function, whose MA size is roughly 2n over any field.
Our characterization uses the notion of Mal’tsev product of two monoids A and
B, denoted A  m B. We do not define the algebraic operation formally. We use
instead the following property, specific to our case [Wei87]: Let M be a monoid
in LGp  m Com, i.e., the Mal’tsev product of a monoid in Gp by one in Com.
Then, the product in M of a string of elements m1 . . . mn can be determined from
the truth of a fixed number of logical conditions of the following form: There are
elements a1 , . . . , ak in M , a number r ∈ [p], and commutative languages L0 ,
. . . , Lk over M  such that the number of factorizations of m1 . . . mn of the form
L0 a1 L1 a2 L2 . . . Lk−1 ak Lk is r modulo p.
Contrived as it seems, LGp  m Com is a natural borderline in representation
theory. Recent and deep work by Margolis et al [AMV05, AMSV09] shows that
semigroups in LGp  m Com are exactly those that can be embedded into a semi-
group of upper-triangular matrices over a field of characteristic p (and any size).
The main result in this section is:
Theorem 11. Let M be a monoid in LGp  m Com. Suppose that M is defined
as above by the a boolean combination of at most  conditions of length at most
k using commutative languages whose monoid has size C. Then every program
of length s over M is equivalent to an MA over Fp of size (s + C)c , where
c = c(p, , k).
An Algebraic Perspective on Boolean Function Learning 213

Corollary 3. Programs over monoids in LGp  m Com are polynomially simu-


lated by MAs over Fp that are direct sums of constant-width MA’s.

Proof. (of Theorem 11) (Sketch). Fix a program (P, A) over M of length s. Let
m1 , . . . ms be the sequence of elements in M produced by the instructions on P
for a given input x1 . . . xn . The value of P for an input, hence whether it belongs
to A, can be determined from the truth or falsehood of  conditions as described
above, each one given by a tuple of letters a1 , . . . , ak and commutative languages
L0 , . . . , Lk .
For each such condition,
  we build an MA to check it as follows: The MA is
the direct sum of ks MA’s, one for each of the positions where the a0 . . . ak
witnessing a factorization could appear. Each MA concurrently checks that each
of the chosen positions contains the right ai (when the input variable producing
the corresponding element mj is available) and concurrently checks whether the
subword wi between ai and ai+1 is in the language Li . Crucially, since Li is in
Com, membership of wi in Li can be computed by a fixed-width automaton,
regardless of the order in which the variables producing wi are read. The au-
tomaton produces 0 if this check fails for some i, and 1 otherwise. It can be
checked that the resulting automaton for each choice has size polynomial in s.
For each condition L0 a1 L1 . . . ak Lk , counting the number of factorizations
mod p amounts to taking the sum of the MA built for all possible guesses and
adding them over Fp .
To conclude the proof, take all MA’s resulting from the previous construction
and raise them to the p-th power. That increases their size by a power of p, and
by Fermat’s little theorem they become 0/1-valued. The boolean combination
of several conditions can be then expressed by (a fixed number) of sums and
products in Fp , with polynomial blowup.

We next note that several classes that were shown to be learnable by showing
they were polynomially simulated by MA.

Theorem 12. The following classes of boolean functions are polynomially sim-
ulated by programs over LGp 
m Com, hence are learnable from Membership and
Equivalence queries as MA:

– Polynomials over Fp (when viewed as computing boolean functions)


– Unambiguous DNF functions; these include decision trees k-term DNF for
constant k.
– constant-degree, depth-three, ΣΠΣ arithmetic circuits [KS06], when re-
stricted to boolean functions.

An interesting case is that of O(log n)-term DNF. It was observed in [Kus97]


c log n-term DNF can be rewritten into DFA of size roughly nc , hence learned
from Membership and Equivalence queries by Angluin’s algorithm [Ang87]. It is
probably false that c log n-term DNF can be simulated by programs over a fixed
monoid in LGp  m Com. However, we note that for every c and n, we note that
for every c and n, c log n-term DNF is simulated by a monoid of size nc that is
214 R. Gavaldà and D. Thérien

easily computed from c and n and commutative, hence in LGp  m Com. (See
the extended version for details.)
Finally, we conjecture that LGp 
m Com is the largest class of monoids that
are polynomially simulated by MA, hence, the largest class we can expect to
learn from MA within our algebraic framework:
Conjecture 1. If a monoid M is not in LGp 
m Com, then programs over M are
not polynomially simulated by MA’s over Fp .
The proof of this conjecture should be within reach given the characterization
given in [TT06] of the monoids that are not in LGp 
m Com: this happens iff the
monoid is divided by either the monoids U or BA2 described before, or by a so-
called Tq monoid or by a monoid whose commutator subgroup is not a p-group.
It would thus suffice to show that programs over these four kinds of monoids
cannot always be polynomially simulated by MA over Fp .

References
[AMSV09] Almeida, J., Margolis, S.W., Steinberg, B., Volkov, M.V.: Representa-
tion theory of finite semigroups, semigroup radicals and formal language
theory. Trans. Amer. Math. Soc. 3612, 1429–1461 (2009)
[AMV05] Almeida, J., Margolis, S.W., Volkov, M.V.: The pseudovariety of semi-
groups of triangular matrices over a finite field. RAIRO - Theoretical
Informatics and Applications 39(1), 31–48 (2005)
[Ang87] Angluin, D.: Learning regular sets from queries and counterexamples.
Information and Computation 75, 87–106 (1987)
[Ang88] Angluin, D.: Queries and concept learning. Machine Learning 2, 319–342
(1988)
[Bar89] Barrington, D.A.: Bounded-width polynomial-size branching programs
recognize exactly those languages in NC1 . Journal of Computer and Sys-
tem Sciences 38, 150–164 (1989)
[BBB+ 00] Beimel, A., Bergadano, F., Bshouty, N.H., Kushilevitz, E., Varricchio, S.:
Learning functions represented as multiplicity automata. Journal of the
ACM 47, 506–530 (2000)
[BBTV97] Bergadano, F., Bshouty, N.H., Tamon, C., Varricchio, S.: On learning
branching programs and small depth circuits. In: Ben-David, S. (ed.)
EuroCOLT 1997. LNCS, vol. 1208, pp. 150–161. Springer, Heidelberg
(1997)
[BCJ93] Blum, A., Chalasani, P., Jackson, J.C.: On learning embedded symmetric
concepts. In: COLT, pp. 337–346 (1993)
[BK] Bhshouty, N.H., Kushilevitz, E.: Learning from membership queries /
online learning. Course notes in N. Bshouty’s homepage
[BST90] Mix Barrington, D.A., Straubing, H., Thérien, D.: Non-uniform automata
over groups. Information and Computation 89, 109–132 (1990)
[BT94] Beigel, R., Tarui, J.: On ACC. Computational Complexity 4, 350–366
(1994)
[BV96] Bergadano, F., Varricchio, S.: Learning behaviors of automata from mul-
tiplicity and equivalence queries. SIAM Journal on Computing 25, 1268–
1280 (1996)
An Algebraic Perspective on Boolean Function Learning 215

[CGPT06] Chattopadhyay, A., Goyal, N., Pudlák, P., Thérien, D.: Lower bounds for
circuits with modm gates. In: FOCS, pp. 709–718 (2006)
[CKK+ 07] Chattopadhyay, A., Krebs, A., Koucký, M., Szegedy, M., Tesson, P.,
Thérien, D.: Languages with bounded multiparty communication com-
plexity. In: Thomas, W., Weil, P. (eds.) STACS 2007. LNCS, vol. 4393,
pp. 500–511. Springer, Heidelberg (2007)
[EH89] Ehrenfeucht, A., Haussler, D.: Learning decision trees from random ex-
amples. Information and Computation 82(3), 231–246 (1989)
[GT03] Gavaldà, R., Thérien, D.: Algebraic characterizations of small classes of
boolean functions. In: Alt, H., Habib, M. (eds.) STACS 2003. LNCS,
vol. 2607, pp. 331–342. Springer, Heidelberg (2003)
[GTT06] Gavaldà, R., Tesson, P., Thérien, D.: Learning expressions and programs
over monoids. Inf. Comput. 204(2), 177–209 (2006)
[HS07] Hellerstein, L., Servedio, R.A.: On pac learning algorithms for rich
boolean function classes. Theor. Comput. Sci. 384(1), 66–76 (2007)
[HSW90] Helmbold, D.P., Sloan, R.H., Warmuth, M.K.: Learning nested differences
of intersection-closed concept classes. Machine Learning 5, 165–196 (1990)
[KLPV87] Kearns, M.J., Li, M., Pitt, L., Valiant, L.G.: On the learnability of boolean
formulae. In: STOC, pp. 285–295 (1987)
[KS06] Klivans, A.R., Shpilka, A.: Learning restricted models of arithmetic cir-
cuits. Theory of Computing 2(1), 185–206 (2006)
[Kus97] Kushilevitz, E.: A simple algorithm for learning o(logn)-term dnf. Inf.
Process. Lett. 61(6), 289–292 (1997)
[PT88] Péladeau, P., Thérien, D.: Sur les langages reconnus par des groupes nilpo-
tents. Compte-rendus de l’Académie des Sciences de Paris, 93–95 (1988);
Translation to English as ECCC-TR01-040, Electronic Colloquium on
Computational Complexity (ECCC)
[Riv87] Rivest, R.L.: Learning decision lists. Machine Learning 2(3), 229–246
(1987)
[Sch76] Schützenberger, M.P.: Sur le produit de concaténation non ambigu. Semi-
group Forum 13, 47–75 (1976)
[She08] Sherstov, A.A.: Communication lower bounds using dual polynomials.
Bulletin of the EATCS 95, 59–93 (2008)
[Sim95] Simon, H.-U.: Learning decision lists and trees with equivalence-queries.
In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 322–336.
Springer, Heidelberg (1995)
[Tes03] Tesson, P.: Computational Complexity Questions Related to Finite
Monoids and Semigroups. PhD thesis, School of Computer Science,
McGill University (2003)
[TT04] Tesson, P., Thérien, D.: Monoids and computations. Intl. Journal of Al-
gebra and Computation 14(5-6), 801–816 (2004)
[TT05] Tesson, P., Thérien, D.: Complete classifications for the communication
complexity of regular languages. Theory Comput. Syst. 38(2), 135–159
(2005)
[TT06] Tesson, P., Thérien, D.: Bridges between algebraic automata theory and
complexity. Bull. of the EATCS 88, 37–64 (2006)
[Val84] Valiant, L.G.: A theory of the learnable. Communications of the ACM 27,
1134–1142 (1984)
[Wei87] Weil, P.: Closure of varieties of languages under products with counter.
J. of Comp. Syst. Sci. 2(3), 229–246 (1987)
Adaptive Estimation of the Optimal ROC Curve
and a Bipartite Ranking Algorithm

Stéphan Clémençon1 and Nicolas Vayatis2


1
LTCI, Telecom Paristech (TSI) - UMR Institut Telecom/CNRS 5141
stephan.clemencon@telecom-paristech.fr
2
CMLA, ENS Cachan & UniverSud - UMR CNRS 8536
61, avenue du Président Wilson - 94235 Cachan cedex, France
nicolas.vayatis@cmla.ens-cachan.fr

Abstract. In this paper, we propose an adaptive algorithm for bipartite


ranking and prove its statistical performance in a stronger sense than the
AUC criterion. Our procedure builds on and significantly improves the
RankOver algorithm proposed in [1]. The algorithm outputs a piecewise
constant scoring rule which is obtained by overlaying a finite collection
of classifiers. Here, each of these classifiers is the empirical solution of a
specific minimum-volume set (MV-set) estimation problem. The major
novelty arises from the fact that the levels of the MV-sets to recover are
chosen adaptively from the data to adjust to the variability of the target
curve. The ROC curve of the estimated scoring rule may be interpreted as
an adaptive spline approximant of the optimal ROC curve. Error bounds
for the estimate of the optimal ROC curve in terms of the L∞ -distance
are also provided.

1 Introduction
Since a few decades, ROC curves have been widely used as the golden standard
for assessing performance in areas such as signal detection, medical diagnosis,
credit risk screening. More recently, ROC analysis has become an area of grow-
ing interest in Machine Learning. Various aspects are considered in this new
approach such as model evaluation, model selection, machine learning metrics
for evaluating performance, model construction, multiclass ROC, geometry of
the ROC space, confidence bands for ROC curves, improving performance of
classifiers, connection between classifiers and rankers, model manipulation (see
for instance [2] and references therein). We focus here on the problem of bipartite
ranking and the issue of ROC curve optimization. Previous work on bipartite
ranking ([3], [4], [5]) considered the AUC criterion as the optimization target.
However, this criterion is known to weight the errors uniformly while ranking
rules with similar AUC may behave very differently on a subset of the input
space.
In the paper, we focus on two problems: (i) the estimation of the optimal
ROC∗ , (ii) the construction of a consistent scoring rule whose ROC curve con-
verges in supremum norm to the optimal ROC∗ . In contrast to binary classifica-
tion or AUC maximization, the classical empirical risk minimization approach

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 216–231, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Adaptive Estimation of the Optimal ROC Curve 217

cannot be invoked here because of the function-like nature of the performance


measure and the use of the supremum norm as a metric. The approach taken
here follows the perspective sketched in [1], and further explored in [6]. In these
two papers, ranking rules made of overlaying classifiers were considered and the
RankOver algorithm was introduced. Dealing with a function-like optimization
criterion as the ROC curve requires to perform both curve approximation and
statistical estimation. In the RankOver algorithm, the approximation step is
conducted with a piecewise linear approximation with fixed breakpoints on the
false positive rate axis. The estimation part involves a collection of classification
problems with mass constraint. In [6], we improved this step by using a modi-
fied minimum-volume set approach inspired from [7] to solve this collection of
constrained classification problems. More precisely, our method can be under-
stood as a statistical version of a simple finite element method with an explicit
scheme: it produces an accurate spline estimate of the optimal curve in the ROC
space, together with a scoring rule whose ROC curve mimics the behavior of the
optimal one. In our previous work [1], [6], bounds on the generalization rate of
this ranking algorithm were obtained under strong conditions on the regularity
of the optimal ROC curve. Indeed, it was assumed that the optimal ROC curve
was twice continuously differentiable and that its derivative was bounded in the
neighborhood of the origin. The purpose of this paper is to relax these regularity
conditions. In particular, we provide an adaptive algorithm which selects break-
points for the approximation of the ROC curve by the means of a data-driven
scheme which takes into account the variability of the target curve. Hence, the
partition of the false positive rate axis is chosen according to the local regularity
of the optimal curve.
The paper is structured as follows. In Section 2, notations are set out and
important concepts of ROC analysis are briefly described. Section 3 is devoted
to the presentation of the adaptive approximation of the optimal ROC curve
with dyadic recursive partitioning. In Section 4, theoretical results related to
empirical minimum-volume set (MV-set) estimation are recalled. The adaptive
statistical method for estimating the optimal ROC curve and the related ranking
algorithm are presented in sections 5 and 6 respectively, together with the main
results of the paper. Proofs are postponed to the Appendix.

2 Setup
2.1 Probabilistic Model
The probabilistic setup is the same as the one in standard binary classification.
Here and throughout, (X, Y ) denotes a pair of random variables where Y ∈
{−1, +1} is a binary label and X models some observation for predicting Y , tak-
ing its values in a high-dimensional feature space X ⊂ Rd . The joint distribution
of (X, Y ) is entirely determined by the pair (μ, η) where μ denotes the marginal
distribution of X and the regression function η(x) = P{Y = +1 | X = x}, x ∈ X .
We also introduce the theoretical proportion p = P{Y = +1}, as well as G and
H, the conditional distributions of X given Y = +1 and Y = −1 respectively.
218 S. Clémençon and N. Vayatis

Throughout the paper, these probability measures are assumed to be absolutely


continuous with respect to each other. Equipped with these notations, one may
write η(x) = p(dG/dH)(x)/(1 − p + p(dG/dH)(x)) and μ = pG + (1 − p)H.

2.2 Bipartite Ranking and ROC Curves


We briefly recall the issue of the bipartite ranking task and describe the key
notions related to this statistical learning problem.
Based on the observation of i.i.d. examples Dn = {(Xi , Yi ) : 1 ≤ i ≤ n},
the goal is to learn how to order all instances x ∈ X in a way that instances X
such that Y = +1 appear on top in the list with the largest possible probability.
Clearly, the simplest way of defining an order relationship on X is to transport
the natural order on the real line to the feature space through a scoring rule
s : X → R. The notion of ROC curve, which we recall below, provides a func-
tional criterion for evaluating the performance of the ordering induced by such
a function. We denote by F −1 (t) = inf{u ∈ R : F (u) ≥ t} the pseudo-inverse
of any càd-làg increasing function F : R → R and by S the set of all scoring
functions, i.e. the space of real-valued measurable functions on X .
Definition 1. (ROC curve) Let s ∈ S. The ROC curve of the scoring function
s(x) is the càd-làg curve given by

→ ROC(s, α) = 1 − Gs ◦ Hs−1 (1 − α),


α ∈ [0, 1] 
where Gs and Hs denote the conditional distributions of s(X) given Y = +1 and
given Y = −1 respectively. We denote by ROC∗ the ROC curve for s = η.
When Gs (du) and Hs (du) are both continuous distributions, the ROC curve
of s(x) is nothing else than the PP-plot:
t
→ (P{s(X) ≥ t | Y = −1}, P{s(X) ≥ t | Y = +1}) . (1)

It is a well-known result in ROC analysis that increasing transforms of the


regression function η(x) form the class S ∗ of optimal scoring functions in the
sense that their ROC curve, namely ROC∗ = ROC(η, .), dominates the ROC
curve of any other scoring function s(x) everywhere:

∀α ∈ [0, 1[, ROC(s, α) ≤ ROC∗ (α).


The proof of this fact is based on a simple application of Neyman-Pearson’s
lemma in hypothesis testing: the likelihood statistic
Φ(X) = (1 − p)η(X)/(p(1 − η(X)))
yields a uniformly most powerful statistical test for discriminating between the
composite hypotheses H0 : Y = −1 and H1 : Y = +1 (i.e. H0 : X ∼ H
and H1 : X ∼ G). Therefore, the power of any other test s(X) is smaller that
the test based on η(X) at the same level α. Recall also that, when continuous,
the curve ROC∗ is concave. Refer to [8] for a detailed list of properties of ROC
curves.
Adaptive Estimation of the Optimal ROC Curve 219

Remark 1. (Alternative convention.) Note that ROC curves may be al-


ternatively defined through the representation given in formula (1). With this
convention, jumps in the graph, due to possible degeneracy points of H ∗ and
G∗ , are continuously connected by line segments, see [9] for instance.

Hence, a good scoring function is such that, for any level α ∈ (0, 1), the power
ROC(s, α) of the test it defines is close to the optimal power ROC∗ (α). The
sup-norm

||ROC∗ − ROC(s, .)||∞ = sup {ROC∗ (α) − ROC(s, α)}


α∈(0,1)

provides a natural way of measuring the performance of a scoring rule s(x). The
ROC curve of the scoring function s(x) can be straightforwardly estimated from
the training dataset Dn by the stepwise function

α ∈ (0, 1)  s ◦ H
 α) = 1 − G
→ ROC(s,  s−1 (1 − α),

where
1  1 
 s (t) =
H s (t) =
I{s(Xi ) ≤ t} and G I{s(Xi ) ≤ t}
n− n+
i: Yi =−1 i: Yi =+1
n
with n+ = i=1 I{Yi = +1} = n − n− .
However, the target curve ROC∗ is unknown in practice and no empirical
counterpart is directly available for the deviation ||ROC∗ − ROC(s, .)||∞ . For
this reason, empirical risk minimization (ERM) strategies are generally based
on the L1 -distance, leading to the popular AUC criterion: minimizing ||ROC∗ −
ROC(s, .)||L1 ([0,1]) indeed boils down to maximize
 1
def
AUC(s) = ROC(s, α)dα .
α=0

An empirical counterpart of the AUC may be built from the Mann-Whitney


statistic, see [5] and the references therein. Beyond this computational advan-
tage, it is noteworthy that two scoring functions may have the same AUC but
their ROC curves can present very different shapes. Since the L1 -distance does
not permit to account for local properties of the ROC curve, we point out the
importance of deriving strategies for ROC curve estimation and optimization
whose convergence is validated in a stronger sense than the AUC. The goal of
this paper is precisely to provide an adaptive procedure for estimating ROC∗ in
sup norm under mild regularity conditions.
Regularity of the curve ROC∗ . In the subsequent analysis, the following
assumptions will be required.

A1 The conditional distributions G∗ (dt) and H ∗ (dt) of the random variable


η(X) are continuous.
220 S. Clémençon and N. Vayatis

A2 The cumulative distribution function H ∗ is strictly increasing on the support


of H ∗ (dt).
We recall that under these assumptions one may explicit the derivative of ROC∗ .
For any α ∈ (0, 1), we denote by Q∗ (α) the quantile of order (1 − α) of the
conditional distribution of η(X) given Y = −1.
Lemma 1. [9]. Suppose that assumptions A1 − A2 are fulfilled. Let α ∈ (0, 1)
such that Q∗ (α) < 1. Then, ROC∗ is differentiable at α and
dROC∗ 1−p Q∗ (α)
(α) = · .
dα p 1 − Q∗ (α)
In [6], a statistical procedure for estimating the curve ROC∗ , mimicking a linear
spline approximation scheme, has been proposed in a very restrictive setup,
stipulating that ROC∗ is of class C 2 with its two first derivatives bounded.
 .
As shown by the result above, boundedness of ROC∗ means that Q∗ (0) =

limα→0 Q (α) < 1, in other words that η(X) stays bounded away from 1, or
equivalently that the likelihood ratio Φ(X) = (1 − p)η(X)/(p(1 − η(X))) remains
bounded. It is the purpose of this paper to examine to which extent one may
estimate ROC∗ under weaker assumptions (see assumption A5 below), including
cases where it has a vertical tangent at the origin.

2.3 Ranking by Overlaying Classifiers


From the angle embraced in this paper, ranking amounts to recovering the de-
creasing collection of level sets of the regression function η(x):
{{x ∈ X | η(x) > u}, u ∈ [0, 1]} ,
without necessarily disposing of the corresponding levels. Indeed, any scoring
function of the form
 1
s∗ (x) = I{η(x) > Q∗ (α)} dν(α), (2)
0

where ν(dα) is an arbitrary finite positive measure on [0, 1] with same support
as the distribution H ∗ , is optimal with respect to the ROC criterion. The next
proposition also illustrates this view on the problem. We set the notations:

Rα = {x ∈ X | η(x) > Q∗ (α)} and Rs,α = {x ∈ X | s(x) > Q(s(X), α)},
where Q(s(X), α) is the quantile of order (1 − α) of the conditional distribution
of s(X) given Y = −1.
Proposition 1. [6]. Let s be a scoring function and α ∈ (0, 1) such that Q∗ (α) < 1.
Suppose additionally that the cdf Hs (respectively, H ∗ ) is continuous at Q(s(X), α)
(resp. at Q∗ (α)). Then, we have:
E(|η(X) − Q∗ (α)| I{X ∈ Rα

ΔRs,α })
ROC∗ (α) − ROC(s, α) = ∗
,
p(1 − Q (α))
where Δ denotes the symmetric difference between sets.
Adaptive Estimation of the Optimal ROC Curve 221

This result shows that the pointwise difference between the dominating ROC
curve and the one related to a candidate scoring function s may be interpreted

as the error made in recovering the specific level set Rα through Rs,α .

3 Adaptive Approximation
Here we focus on very simple approximants of ROC∗ , taken as piecewise constant
curves. Precisely, to any subdivision σ : α0 = 0 < α1 < . . . < αK < αK+1 = 1
of the unit interval, we associate the curve given by: ∀α ∈ (0, 1),


K
Eσ (ROC∗ )(α) = I{α ∈ [αk , αk+1 [} · ROC∗ (αk ). (3)
k=0

We point out that the approximant Eσ (ROC∗ )(α) is actually a ROC curve.
It coincides indeed with ROC(s∗σ , .) where s∗σ is the piecewise constant scoring
function given by:


K+1
∀x ∈ X , s∗σ (x) = ∗
I{x ∈ Rα k
}, (4)
k=1


which is obtained by ”overlaying” the regression level sets Rα k
= {x ∈ X :

η(x) > Q (αk )}, 1 ≤ k ≤ K.
Adaptive approximation. In free knot splines, it is well-known folklore that
the approximation rate in supremum norm by piecewise constant functions with
at most K pieces is of the order O(K −1 ), if and only if the target function belongs
to the space BV ([0, 1])1 , see Chapter 12 in [10]. From a practical perspective
however, in absence of full knowledge of the target curve, it is a very challenging
task to determine a grid of points {αk : 1 ≤ k ≤ K} that yields a nearly optimal
approximant. In the case where the points of the mesh grid are fixed in advance
independently from the curve f to approximate, say with uniform spacing, the
rate of approximation is of optimal order O(K −1 ) if and only if f belongs to the
space Lip1 ([0, 1]) of absolutely continuous functions f such that f  ∈ L∞ ([0, 1]).
The latter condition is precisely the type of assumption we would like to avoid in
the present work. We propose to use adaptive approximation schemes instead of
fixed grids. In such procedures, the mesh grid is progressively refined by adding
new breakpoints, as further information about the local variation of the target
is gained: this way, one uses a coarse mesh where the target is smooth, and a
finer mesh where it exhibits high degrees of variability. Given the properties of
the target ROC∗ (concave and non decreasing curve connecting (0, 0) to (1, 1)),
an ideal mesh grid should be finer and finer as one gets close to the origin.
Dyadic recursive partitioning. For computational reasons, here we shall re-
strict ourselves only to a dyadic grid of points αj,k = k2−j , with j ∈ N and
1
Recall that the space BV ([0, 1]) of functions of bounded variation on (0, 1) is the
space of absolutely continuous functions f : (0, 1) → R such that f  ∈ L1 ([0, 1]).
222 S. Clémençon and N. Vayatis

k ∈ {0, . . . , 2j − 1}, and to partitions of the unit interval [0, 1] produced by re-
cursive dyadic partitioning: any dyadic interval Ij,k = [αj,k , αj,k+1 ) is possibly
split into two halves, producing two siblings Ij+1,2k and Ij+1,2k+1 , depending
on the (estimated) local properties of the target curve. The adaptive estimation
algorithm described in the next section will then appear as a top-down search
strategy through a tree structure T , on which the Ij,k ’s are aligned. Precisely,
we will consider approximants of the form:

ROC∗ (αj,k ) · I{α ∈ [αj,k , αj,k+1 )},
Ij,k ∈{terminal nodes}

where the sum is taken over all dyadic intervals corresponding to terminal nodes,
determined by weights ω(·) fulfilling the two conditions:
(i) (Keep-or-kill) For any dyadic interval I ⊂ [0, 1), the weight ω(I) belongs
to {0, 1}.
(ii) (Heredity) If ω(I) = 1, then for any dyadic subinterval I  such that I ⊂ I  ,
we have ω(I  ) = 1. If ω(I) = 0, then for any dyadic subinterval I  ⊂ I, we
have ω(I  ) = 0.
Each collection ω of weights satisfying these two constraints is said admissible
and determines the nodes of a subtree Tω of the tree T representing the set of all
dyadic intervals. A dyadic subinterval I will be said terminal when ω(I) = 1 and
ω(I  ) = 0 for any dyadic subinterval I  ⊂ I: terminal subintervals correspond to
the outer leaves of Tω and form a partition Pω of [0, 1). The algorithm described
in the next section consists of selecting those intervals, i.e. the set ω. We denote
by σω the mesh grid made of endpoints of terminal subintervals selected by the
collection of weights ω. Given two admissible sequences of weights ω1 and ω2 ,
the mesh σω1 is said finer than σω2 when {I : ω2 (I) = 0} ⊂ {I : ω1 (I) = 0}.

4 Empirical MV-Set Estimation


Beyond the functional approximation facet of the problem, another key ingredi-
ent of the estimation procedure consists of estimating specific points
(αj,k , ROC∗ (αj,k )) = (H(Rα

j,k

), G(Rα j,k
))
lying on the optimal ROC curve, in order to gain information about its location
in the ROC space and the way it locally varies both at the same time. Following
in the footsteps of [6], a constructive approach to this problem lies in viewing

X \Rα as the solution of the following minimum-volume set (MV-set) estimation
problem:
min G(W ) subject to H(W ) > 1 − α,
W ∈B(X )

where the maximum is taken over the set B(X ) of all measurable subsets W ⊂ X .
Equivalently, this boils down to solve the constrained optimization problem:
sup G(R) subject to H(R) ≤ α.
R∈B(X )
Adaptive Estimation of the Optimal ROC Curve 223

From a statistical perspective, the search should be based on the empirical dis-
tributions:
1  1 
n n

H(dx) = 
I{Yi = −1} · δXi (dx) and G(dx) = I{Yi = +1} · δXi (dx),
n− i=1 n+ i=1

where δx denotes the Dirac mass at x ∈ X . An empirical version of the opti-


mization problem above is then

OP (α, φ) : 
sup G(R) 
subject to H(R) ≤ α + φ,
R∈R

where φ is a complexity penalty and R a class of measurable subsets of X . We


denote by R α a solution of this problem. The success of this program hinges
upon the richness of the class R and the calibration of the tolerance parameter
φ, as shown by the next result established in [6] (see also [11]) and involving the
following technical assumptions.

A3 For all α ∈ (0, 1), we have Rα ∈ R.
A4 The set R is such that the Rademacher average
  
1  
n

An = E sup  i I{Xi ∈ R}

R∈R n i=1 

is of order O(n−1/2 ).
Note that the assumption A4 is satisfied, for instance, when R is a VC class (see
for instance [12] for the use of Rademacher averages in complexity control).
Theorem 1. [6]. Suppose that assumptions A1 − A4 are fulfilled and for all
δ ∈ (0, 1), set
2 log(1/δ)
φ = φ(δ, n) = 2An + .
n
Then, there exists a constant c < ∞ such that for all δ ∈ (0, 1), we have with
probability at least 1 − δ: ∀n ∈ N∗ , ∀α ∈ (0, 1),
α ) ≤ α + 2φ(δ/2, n) and G(R
H(R α ) ≥ ROC∗ (α) − 2φ(δ/2, n) .

Remark 2. (Regularity vs. noise condition) Under the additional condi-


tion that the distribution η(X), denoted by F ∗ = pG∗ +(1−p)H ∗, has a bounded
density f ∗ , the following extension of Tsybakov’s noise condition ([13]) is fulfilled
for any α ∈ (0, 1): ∀t ≥ 0,
a
P {|η(X) − Q∗ (α) ≤ t|} ≤ c · t 1−a , (5)

with a = 1/2 and c = supt f ∗ (t). Notice that this condition is incompatible with
assumption A2 when a > 1/2. It has been shown in [6] (see Theorem 12 therein)
that, under this assumption, the deviation ROC∗ (α) − G(  Rα ) is then of order
O(n−5/8 ).
224 S. Clémençon and N. Vayatis

Adaptive estimation - Algorithm 1

(Input.) Target tolerance  ∈ (0, 1). Volume tolerance φ > 0. Training data
Dn = {(Xi , Yi ) : 1 ≤ i ≤ n}. Class R of level set candidates.
1. (Initialization.) Set ω (I0,0 ) = 0 and ω (Ij,k ) = 1 for all dyadic subinterval
I  I0,0 = [0, 1). Take β0,0 = 0 and β0,1 = 1.
2. (Iterations.) For all j ≥ 0, for all k ∈ {0, . . . , 2j − 1}: if ω (Ij,k ) = 0, then
(a) Compute E(I  j,k ) = βj,k+1 − βj,k ,

(b) If E(Ij,k ) > , then
i. set
ω (Ij+1,2k ) = ω (Ij+1,2k+1 ) = 0 ,
ii. solve the problem OP (αj+1,2k+1 , φ) → solution R α
j+1,2k+1 ,
iii. update:
βj+1,2k = βj,k and βj+1,2k+2 = βj,k+1 .
(c) Else, let the weights of the siblings of the Ij,k unchanged.
3. (Stopping rule.) The algorithm terminates as soon as the weights ω(·) of the
nodes of the current level j are all equal to 1.
(Output.) Let σ  the collection of dyadic levels αj,k corresponding to the
terminal nodes defined by ω . Compute the ROC∗ estimate:

∗ (α) =
ROC  R
G(  α ) · I{α ∈ Ij,k }.
j,k
αj,k ∈
σ

5 Adaptive Estimation of the Optimal ROC Curve


In this section we describe an adaptive algorithm for estimating the optimal curve
ROC∗ by piecewise constants. It should be interpreted as a statistical version
of the adaptive approximation scheme studied in [14]. We emphasize that the
crucial difference with the approach developed in [6] is that, here, the mesh grid,
the cardinality of the grid of points as well as their locations, used for computing
the ROC∗ estimate, are entirely learnt from the data. In this respect, we define
the local error empirical estimate on the subinterval I = [α1 , α2 ) ⊂ [0, 1) as

 = G(
E(I)  R
α2 ) − G(
 Rα1 ).


The quantity E(I) is nonnegative (by construction, the mapping α ∈ (0, 1) 

 R
G( α ) is non decreasing with probability one) and should be viewed as an em-
.
pirical counterpart of E(I) = ROC∗ (α2 ) − ROC∗ (α1 ), which provides a simple
way of estimating the variability of the (nondecreasing) function ROC∗ on I.

This measure is additive, as its statistical version E(.):

E(I1 ∪ I2 ) = E(I1 ) + E(I2 )


Adaptive Estimation of the Optimal ROC Curve 225

for any siblings I1 and I2 of the same subinterval. It controls the approximation
rate of ROC∗ by a constant on any interval I ⊂ [0, 1) in the sense that:

inf ||ROC∗ (.) − c||L∞ (I) ≤ E(I).


c∈[0,1)

The adaptive algorithm designated as ’Algorithm 1’ is based on the following


principle: a dyadic subinterval I will be part of the final partition of the true
positive rate axis whenever the empirical local error has not met the tolerance
on any of its ancestors J ⊃ I but meets the tolerance on it.
We point out that, by construction, the sequence ω produced by Algorithm
1 is admissible.
Remark 3. (On the stopping rule) One should notice that, as H(R)  ∈ {k/n :
k = 0, . . . , n} for any R ∈ R, the estimation algorithm necessarily stops before
exceeding the level j = j(n) = log(n)/ log(2): the empirical estimate ROC ∗
j(n)
has no more than 2 pieces.
We now establish a rate of convergence for Algorithm 1. The following assump-
tion shall be required. It classically permits to control the rate at which the
derivative of ROC∗ (α) may goes to infinity as α tends to zero, see [15].

A5 The derivative ROC∗ belongs to the space L log L of Borel functions f :
(0, 1) → R such that:
 1
def
||f ||L log L = (1 + log |f (α)|)|f (α)|dα < ∞.
α=0

The next result provides a bound for the rate of the estimator produced by
Algorithm 1.
Theorem 2. Let δ ∈ (0, 1). Suppose that assumptions A1 −A5 are fulfilled. Take
.
= (δ, n) = 7φ(δ/2, n). Then, we have, with probability at least 1 − δ:

∗ ||∞ ≤ 16φ(δ/2, n) .
∀n ≥ 1, ||ROC∗ − ROC

Moreover, the number of terminal nodes in the output of Algorithm 1 satisfies:



||ROC∗ ||L log L
σ ≤ κ
# , for some constant κ < ∞. (6)
φ(δ/2, n)
Corollary 1. Let δ ∈ (0, 1). Suppose that assumptions A1 − A5 are fulfilled.
Take and φ of the order of n−1 log(1/δ). Then,there exists a constant c such
that we have, with probability at least 1 − δ:

∗ ||∞ ≤ √c +
∀n ≥ 1, ||ROC∗ − ROC
2 log(1/δ)
,
n n
and the adaptive  whose cardinality is at most
√ Algorithm 1 builds a partition σ
of the order of n.
226 S. Clémençon and N. Vayatis

Remark 4. (On the rate of convergence.) When assuming ROC∗ of class C 1



on [0, 1] (which implies in particular that ROC∗ is bounded in the neighborhood
of 0), it may be shown that a piecewise constant estimate with rate O(n−1/2 ) can
be built using K = O(n1/2 ) equispaced grid points, cf [6]. It is remarkable that,
with the adaptive scheme of Algorithm 1, comparable performance is achieved,
while relaxing significantly the smoothness assumption on ROC∗ .
Remark 5. (On lower bounds.) To our knowledge, no lower bound result
related to statistical estimation of ROC∗ in sup norm is currently available in
the literature. Intuition indicates that the rate O(n−1/2 ) is accurate, insofar
as, in absence of further assumption, it is likely that it is the best rate that
can be obtained for the MV-set estimation problem, and consequently for local
estimation of ROC∗ at a given point α ∈ (0, 1).

6 Adaptive Ranking Algorithm


We now tackle the problem of building a scoring function s(x) whose ROC
curve is asymptotically close to the empirical estimate ROC ∗ . In general, the
latter is not a ROC curve: by construction, the sequence R α , (j, k) ∈ σ ,
j,k
sorted by increasing order of magnitude of their level αj,k , is not necessarily

increasing, in contrast to the true level sets the Rα j,k
. This induces an additional
’Monotonicity’ step in Algorithm 2, before overlaying the estimated sets.

Adaptive RankOver - Algorithm 2

(Input.) Target tolerance  ∈ (0, 1). Volume tolerance φ > 0. Training data
Dn = {(Xi , Yi ) : 1 ≤ i ≤ n}. Class R of level set candidates.
1. (Algorithm 1.) Run Algorithm 1, in order to get the regression level estimates
 α(1) , . . . , R
R   , where K
  = #   ) < 1.
σ and 0 = α(1) < . . . < α(K
α(K )

2. (Monotonocity.) Form recursively the non decreasing sequence Rαj,k de-


,
fined by: for 1 ≤ k < K
 α(1) and Rα(k+1) = Rα(k) ∪ R
Rα(1) = R  α(k) .

(Output.) Build the piecewise constant scoring function:



K

s (x) = I{x ∈ Rα(k) }.
k=1

Remark 6. (Top-down vs. Bottom-up) Alternatively, a monotonous sequence


α(k) , 1 ≤ k ≤ K
of sets can be built from the collection {R   } the following way:
  and R̄α(k) = R̄α(k+1) ∩ R
set R̄α(K  ) = R α(k) for k = K   −1, . . . , 1. A similar
α(K )
K 
result as the one stated below can be established for s̄ (x) = k=1 I{x ∈ R̄α(k) }.
Adaptive Estimation of the Optimal ROC Curve 227

The next theorem states the consistency of the estimated scoring function under
the same complexity and regularity assumptions.

Theorem 3. Let δ ∈ (0, 1). Suppose that assumptions A1 − A5 are fulfilled.


Take a target tolerance of the order of n−1/6 . Then, there exists a constant
c = c(δ) > 0 such that we have with probability at least 1 − δ: ∀n ≥ 1,

log n
s , ·) − ROC∗ ||∞ ≤ c
||ROC( .
n1/3

We observe that the rate of convergence of the order of n−1/6 obtained in The-
orem 3 is much slower than the n−1/3 rate obtained in [6]. This is due to the
fact that we relaxed the regularity assumptions on the optimal ROC curve and
used the approximation space made of piecewise constant curves, while we used
piecewise linear scoring curves before. We expect that, using nonlinear approx-
imation techniques, the n−1/6 -rate can be significantly improved but we leave
this issue open for a future work.

7 Conclusion

In this paper, we have seen how strong consistency of a piecewise constant es-
timate of the optimal ROC curve can be guaranteed under weak regularity as-
sumptions. Additionally, our approach leads to a strongly consistent piecewise
constant scoring rule in terms of ROC curve performance. Whereas the subdivi-
sion of the false positive rate axis used for building the ROC curve approximant
had to be fixed in advance in the original RankOver approach proposed in [6],
which was viewed as a severe restriction on its applicability, the essential novelty
of the two algorithms presented here lies in their ability to adapt automatically
to the variability of the (unknown) optimal ROC curve.

References
1. Clémençon, S., Vayatis, N.: Overlaying classifiers: a practical approach for optimal
ranking. In: NIPS 2008: Proceedings of the 2008 conference on Advances in neural
information processing systems, Vancouver, Canada, pp. 313–320 (2009)
2. Flach, P.: Tutorial on “the many faces of roc analysis in machine learning”. In:
ICML 2004 (2004)
3. Freund, Y., Iyer, R.D., Schapire, R.E., Singer, Y.: An efficient boosting algorithm
for combining preferences. Journal of Machine Learning Research 4, 933–969 (2003)
4. Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., Roth, D.: Generalization
bounds for the area under the ROC curve. Journal of Machine Learning Research 6,
393–425 (2005)
5. Clémençon, S., Lugosi, G., Vayatis, N.: Ranking and empirical risk minimization
of U-statistics. The Annals of Statistics 36(2), 844–874 (2008)
6. Clémençon, S., Vayatis, N.: Overlaying classifiers: a practical approach to optimal
scoring. To appear in Constructive Approximation (hal-00341246) (2009)
228 S. Clémençon and N. Vayatis

7. Scott, C., Nowak, R.: Learning minimum volume sets. Journal of Machine Learning
Research 7, 665–704 (2006)
8. Clémençon, S., Vayatis, N.: Tree-structured ranking rules and approximation of
the optimal ROC curve. Technical Report hal-00268068, HAL (2008)
9. Clémençon, S., Vayatis, N.: Tree-structured ranking rules and approximation of
the optimal ROC curve. In: ALT 2008: Proceedings of the 2008 conference on
Algorithmic Learning Theory (2008)
10. Devore, R., Lorentz, G.: Constructive Approximation. Springer, Heidelberg (1993)
11. Scott, C., Nowak, R.: A Neyman-Pearson approach to statistical learning. IEEE
Transactions on Information Theory 51(11), 3806–3819 (2005)
12. Boucheron, S., Bousquet, O., Lugosi, G.: Theory of Classification: A Survey of
Some Recent Advances. ESAIM: Probability and Statistics 9, 323–375 (2005)
13. Tsybakov, A.: Optimal aggregation of classifiers in statistical learning. Annals of
Statistics 32(1), 135–166 (2004)
14. Devore, R.: A note on adaptive approximation. Approx. Theory Appl. 3, 74–78
(1987)
15. Bennett, C., Sharpley, R.: Interpolation of Operators. Academic Press, London
(1988)
16. Devore, R.: Nonlinear approximation. Acta Numerica, 51–150 (1998)

Appendix - Proofs

Proof of Theorem 2

We first prove a lemma which quantifies the uniform deviation of the empirical
entropy from the true entropy over all dyadic scales.

Lemma 2. (Uniform deviation) Suppose that assumptions A3 − A4 are sat-


isfied. Let δ ∈ (0, 1). With probability at least 1 − δ, we have: ∀n ≥ 1,

sup  j,k ) − E(Ij,k )| ≤ 6φ(δ/2, n).


|E(I
j≥0, 0≤k<2j

proof. In the first place, we observe that

1  j,k ) − E(Ij,k )| ≤  j,k )) − ROC∗ (αj,k )|


sup |E(I max |G(R(α
2 j≥0, 0≤k<2j j≥0, 0≤k<2j


+ sup |G(R) − G(R)| .
R∈R

The proof then immediately follows from the complexity assumption A4 , com-
bined with Theorem 1. 


It follows from Lemma 2 that, with probability at least 1 − δ: ∀j ≤ log2 n,


∀k ∈ {0, . . . , 2j − 1},

 j,k ) ≤ E(Ij,k ) + 6φ(δ/2, n) and E(I


E(I  j,k ) ≥ E(Ij,k ) − 6φ(δ/2, n) .
Adaptive Estimation of the Optimal ROC Curve 229

We now introduce the notation for partitions σ based on the optimal ROC
curve at a target tolerance of . Let > 0 and consider the piecewise constant
approximant built from the same recursive strategy as the one implemented by
Algorithm 1, except that it is based on the (theoretical) error estimate E(.):
Eσ (ROC∗ ), σ denoting the associated mesh grid.
Choosing = 7φ(δ/2, n), we obtain that, with probability larger than 1−δ, the
 is finer than σ1 where 1 = 1 (δ, n) = + 6φ(δ/2, n) = 13φ(δ/2, n),
mesh grid σ
but coarser than σ0 with 0 = 0 (δ, n) = − 6φ(δ/2, n) = φ(δ/2, n). We thus
have
||Eσ  (ROC∗ ) − ROC∗ ||∞ ≤ ||Eσ1 (ROC∗ ) − ROC∗ ||∞ ≤ 1 ,
Now we use the following decomposition:

∗ − ROC∗ ||∞ ≤ ||ROC∗ − Eσ (ROC∗ )||∞ + ||Eσ (ROC∗ ) − ROC


||ROC ∗ ||∞ .
 

We have seen that the first term is bounded, with probability at least 1 − δ,
by On the same event, we have:
∗ ||∞ ≤ max |G(Rα(k) ) − G(
||Eσ  (ROC∗ ) − ROC  R α(k) )|

1≤k≤K

 α(k) )| + sup |G(R) − G(R)|


≤ max |G(Rα(k) ) − G(R 

1≤k≤K R∈R

≤ 3φ(δ/2, n) ,

where we have used Theorem 1 and a concentration inequality to derive the last
inequality. We have thus proved the estimation error rate of 16φ(δ/2, n) for the
∗ of Algorithm 1.
output ROC
We now show the bound on the cardinality of the partition as a function of the
target tolerance parameter. Let us denote by K (= #σ ) the number of pieces
forming this approximant. We have the following result.

Lemma 3. (Approximation rate) Suppose that assumptions A1 , A2 and A5


are fulfilled. There exists a universal constant κ > 0 such that, for all > 0:
κ 
K ≤ ||ROC∗ ||L log L .

For a proof of this lemma, we refer to [14], and also to subsection 3.3 in [16] for
more insights on adaptive approximation methods. It reveals that the number
∗ , i.e. the cardinality of σ
of pieces forming ROC (δ,n) is bounded by

||ROC∗ ||L log L
#σ0 (δ,n) ≤ κ . (7)
φ(δ/2, n)

In short, in regards to nonlinear approximation of ROC∗ , the performance of the


mesh grid σ selected empirically is comparable to the one of the ideal subdivision
σ , which would be obtained if an oracle could supply us perfect information
about the local variability of ROC∗ . 

230 S. Clémençon and N. Vayatis

Proof of Theorem 3

The next lemma permits to quantify the loss arising from the transformation
performed at step 2 of Algorithm 2.

Lemma 4. (Error stacking) Suppose that the assumptions of Theorem 3 are


  },
satisfied. Let δ ∈ (0, 1). With probability at least 1−δ, we have: ∀k ∈ {1, . . . , K

|H(Rα(k) ) − α(k)| ≤ kφ(δ/2, n), (8)

as well as
|G(Rα(k) ) − ROC∗ (α(k))| ≤ kφ(δ/2, n). (9)

proof. Notice that {α(k) : 1 ≤ k ≤ K   } ⊂ {k2−j : 0 ≤ j ≤ log n , 0 ≤ k <


2
2 }, see Remark 3. Observe also that we have
j

α(2) ) + H(R
H(Rα(2) ) = H(R α(1) ) \ H(R
α(2) )


and, since Rα(1) ∗
⊂ Rα(2) α(1) \ R
, one may write R α(2) as
   
α(1) \ R∗
R ∪ α(1) ∩ R∗
R \ α(2) \ R∗
R ∪ α(2) ∩ R∗
R .
α(1) α(1) α(2) α(2)

By additivity of the distribution H combined with Theorem 1, we obtain Equa-


tion (8) for k = 2. The general result is then established by recurrence. Equation
(9) may be proved in a similar fashion. 


s , .) − ROC∗ ||∞ is bounded by:


The deviation ||ROC(

||ROC∗ − Eσ  (ROC∗ )||∞ + ||ROC(


s , .) − Eσ  (ROC∗ )||∞ .

The first term may be shown to be of order by reproducing the argument


involved in the proof of Theorem 2, while the second term is bounded by

max |G(Rα(k) ) − Eσ  (ROC∗ )(H(Rα(k) ))| .


 }
k∈{1,...,K

The latter quantity may be bounded by:

max |G(Rα(k) ) − Eσ  (ROC∗ )(α(k))|+


 }
k∈{1,...,K

max |Eσ  (ROC∗ )(H(Rα(k) )) − Eσ  (ROC∗ )(α(k))| .


 }
k∈{1,...,K

The first term can be taken care of with Lemma 4. We now consider bounding
the second term. We count the number of jumps of the piecewise constant curve
Eσ  (ROC∗ ) between the x-values given by α(k) and H(Rα(k) ). With probability
Adaptive Estimation of the Optimal ROC Curve 231

at least 1 − δ, the number of jumps is given by the product of the total number
of jumps with the amplitude of the interval of false positive rate levels:
 ·
K max  2 φ( δ/2, n) ≤ (C/ 2 ) φ( δ/2, n) ,
|H(Rα(k) ) − α(k)| ≤ C · K
 }
k∈{1,...,K

where we have used Lemma 4 and a union bound in the first inequality and
Lemma 2 in the second. Given the assumption A4 , we are led to the calibration
for of the order of n−1/6 since we need to balance, up to some constants, with
a term of the order of −2 φ( δ/2, n).
Complexity versus Agreement for Many Views
Co-regularization for Multi-view Semi-supervised Learning

Odalric-Ambrym Maillard and Nicolas Vayatis


1
Sequential Learning Project,
INRIA Lille - Nord Europe, France
odalric.maillard@inria.fr
2
ENS Cachan & UniverSud - CMLA UMR CNRS 8536
nicolas.vayatis@cmla.ens-cachan.fr

Abstract. The paper considers the problem of semi-supervised multi-


view classification, where each view corresponds to a Reproducing Kernel
Hilbert Space. An algorithm based on co-regularization methods with ex-
tra penalty terms reflecting smoothness and general agreement properties
is proposed. We first provide explicit tight control on the Rademacher
(L1 ) complexity of the corresponding class of learners for arbitrary many
views, then give the asymptotic behavior of the bounds when the co-
regularization term increases, making explicit the relation between con-
sistency of the views and reduction of the search space. Since many views
involve many parameters, we third provide a parameter selection proce-
dure, based on the stability approach with clustering and localization
arguments. To this aim, we give an explicit bound on the variance (L2 -
diameter) of the class of functions. Finally we illustrate the algorithm
through simulations on toy examples.

1 Introduction
In real-life applications for classification tasks, different representations of a same
object may be available. Financial experts may use different sets of indicators
to assess the current market regime, while in the context of active computer
vision, several views of the same object are provided before rendering the deci-
sion. This problem is known as that of multi-view classification. After the early
work of (Blum & Mitchell, 1998) on learning from both labeled and unlabeled
data, this topic has been considered more recently by several authors (see for
example (Sridharan & Kakade, 2008),(Weston et al., 2005),(Zhou et al., 2004)).
In (Balcan & Blum, 2005), the authors propose a theoretical PAC-model for
semi-supervised learning where multi-view learning appears as a special case.
Due to the restriction over the search space (compatibility between different
views), multi-view learning may provide good generalization results, and indeed
this is the case in numerical experiments (e.g. (Belkin et al., 2005)). In (Rosen-
berg & Bartlett, 2007), these results are applied to a two-view learning problem

The first author is eligible for the E.M.Gold Award.

The second author was partly supported by the ANR Project TAMIS.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 232–246, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Complexity versus Agreement for Many Views 233

and explicit bounds on the Rademacher complexity of the class of predictors are
computed. Various algorithms are introduced together with theoretical studies
are provided in (Sindhwani et al., 2005). In the latter references, the central issue
addressed was to explain how consistency between views affects the performance
of classification procedures. Indeed, in multi-view learning, we consider individ-
uals predictors based on separate views, and one intuitive idea is that (1) having
a good final predictor is related to the agreement of individual predictors on
a majority of labels. It is generally assumed that (2) each view is independent
from the others conditionally on labeled data. Though this may be weakened
(see (Balcan et al., 2005)), providing theoretical justification to the heuristics
that conditional independence of the views allows for high-performance results
(two compatible classifiers trained on independent views are unlikely to agree
on a mislabeled item) has been the motivation for most of the works on this
topic. Thus we build on the same heuristics (1) and (2). As we intend to exploit
all the information available in the classification task, our setup will also take
unlabeled data into consideration.
In the present paper, we consider semi-supervised multi-view binary classifica-
tion with many views. Allowing for more than two views brings up new questions.
For instance, (i) how does the number of views V affects complexity measures?
and (ii) how to choose the parameters when there are as many as O(V 2 ) of them
? For the first issue, we will focus on the Rademacher average and track down
the dependency on V and other parameters in this formulation. As far as the
second issue is concerned, various strategies can be invoked. In supervised clas-
sification, cross-validation (e.g. 10-fold) techniques are widely used due to their
ease of implementation, but theory is unavailable in most cases (see (Celisse,
2008) and references therein for some recent developments). Another idea comes
from recent work on clustering and makes use of the stability approach (see
(Ben-David et al., 2006)). This relies on strongly theoretically founded results
known as localization arguments (see (Koltchinskii, 2006)) which takes advan-
tage of the so-called ‘small ball estimates” (see also (Li & Linde, 1999),(Berthet
& Shi, 2001)). The stability approach has also been applied successfully to other
learning problems (see e.g. (Sindhwani & Rosenberg, 2008)). This is the one we
have chosen in order to perform the selection of parameters. Comparison of dif-
ferent selection procedures, although interesting by itself, is not the purpose of
this paper.
In the sequel, we introduce an algorithm which combines semi-supervised
and multi-view ideas and generalizes over previous algorithms: it contains RLS,
co-RLS and co-Laplacian (see (Rosenberg & Bartlett, 2007),(Sindhwani et al.,
2005)) algorithms as special cases. We use the setup of Reproducing Kernel
Hilbert Spaces (RKHS) and provide explicit upper and lower data-dependent
bounds on the Rademacher complexity of the class of functions involved by this
general algorithm. Our second contribution is to give a new parameter selection
procedure, based on the work of (Koltchinskii, 2006), together with explicit
stability bounds (L2 -diameter) on the localized class for the general algorithm,
which has not been investigated so far.
234 O.-A. Maillard and N. Vayatis

The paper is organized as follows: Section 2 defines our framework and the
objective function. Section 3 is devoted to the Rademacher complexity control
with the first main theorem (section 3.2), and the asymptotic behavior of the
bound. Section 4 presents our stability-based selection procedure and the second
main theorem, on the L2 local diameter of our class of functions. In Section 5,
we successfully apply the algorithm to some toy examples.

2 Setup for Multi-view Semi-supervised Learning


Our approach is based on penalized empirical risk minimization in RKHS. A
compound penalty term will reflect both the complexity of the class of decision
functions and the particular context of multi-view semi-supervised learning. This
is an important improvement on the work of (Sindhwani et al., 2005) where
penalties (and corresponding algorithms) are considered separately. The goal we
pursue here is of unifying algorithms instead of comparing them. In this section,
we provide the notations and definitions of the penalty terms involved.
In the multi-view setup, an observation results from elements taken in a col-
lection of representation spaces X (v) ,v = 1, . . . , V , where V is the number of
views. We write x = (x(1) , ..., x(V ) ), where x(v) ∈ X (v) , for the resulting point
living in the product space which accounts for the multiple views of the object.
Learning with RKHS. Let {−1, 1} be the label set and a loss function, for
instance losssquare (g(x), y) = (y − g(x))2 , with x a data point and y a label. As
usual, we consider the label set Y = [−1, 1] instead. We consider n = u + l i.i.d.
data points l of which are labeled, and u are unlabeled. We now define the loss
of a multi-view classifier f thanks to the corresponding f (v) in each view:
Definition 1 (Loss). For f = (f (1) , ..., f (V ) ) and a sample (xi , yi )i=1,...,l :
l
1
Loss(f ) = loss(f, xi , yi ),
l i=1
1 V
where loss(f, x, y) may be for instance V v=1 losssquare (f (v) (x(v) ), y)

or losssquare ( V1 Vv=1 f (v) (x(v) ), y).

We consider real-valued decision functions φ : x  → V1 Vv=1 f (v) (x(v) ) where
f (v) : X (v) → Y is a classifier. We assume that each predictor f (v) lives in an
RKHS F (v) with kernel K (v) , associated representation function kv (., .), and
norm ||.||v . Thanks to the representer theorem, we restrict only to functions
f (v) ∈ Lv = span{kv (xi , .)}l+u
(v)
i=1 ⊂ F . Let F be the product space of the views
(v)

F (v)
and L ⊂ F the product space of the spans. This complexity penalization
leads to the following definition:
Definition 2 (Complexity ). For f = (f (1) , ..., f (V ) ) ∈ F , we define:
V

V
Complexity(f ) = λv ||f (v) ||2v where λ ∈ +
v=1
Complexity versus Agreement for Many Views 235

Semi-supervised regularization. In the sequel, we consider a batch of n =


u+l i.i.d. data points,(xi )i=1..l,l+1..l+u . xi is the representation of one object in all
(1) (V )
views xi = (xi , ..., xi ). This setup is in-between classification and clustering
theory: the labeled part allows for an objective function (whereas in clustering,
there is no labeling, thus no objective truth), and the unlabeled part involves
structure detection in the data. Using a graph-Laplacian is a natural choice to
express the search for structure, as explained for instance in (Smola & Kondor,
2003; Ando & Zhang, 2007). The idea is to consider that the data points depict
a manifold (see (Belkin et al., 2005)), for which the graph-Laplacian is a dis-
cretized Laplace-Beltrami differential operator. Assuming we have for each view
v a similarity graph given by its adjacency matrix W (v) , then the (unnormal-
ized) graph-Laplacian is L(v) = D(v) − W (v) , where D(v) is the diagonal matrix
(v)  (v)
Di,i = j Wi,j . Other interesting choices are the symmetrical or random walk
normalized graph-Laplacian. Since intuitively one wants that each f (v) ∈ F v
be smooth w.r.t similarity
V structures in all views, we use the weighted average
graph-Laplacian L = v=1 αv L(v) with weights α summing to 1.

Definition 3 (Smoothness). For f = (f (1) , ..., f (V ) ), we define:

V

Smoothness(f ) = γv f(v)T L f(v) , where
v=1

– γ = (γ1 , ..γV ) ≥ 0 meaning that each component is positive.


– L is defined Vbased on L(v) , the
V graph-Laplacian corresponding to the v-th
(v)
view: L = v=1 αv L with v=1 αv = 1.
– f(v) is the vector (f (v) (x1 ), ..., f (v) (xl+u ))T .
(v) (v)

Multiple view co-regularization. In a multi-view approach, the need for


compatibility between the f (v) is conveyed by a so-called Agreement term. We
propose the following one which penalizes disagreement with a square loss and
generalizes (Sindhwani et al., 2005) to our setting:

Definition 4 (Agreement). For f = (f (1) , ..., f (V ) ), and symmetric positive


definite matrices cL , cU ∈ V ×V , we define Agreement(f ) as the sum of:

 l

C L (f ) = cL
(v1 ) (v2 )
v1 ,v2 [f (v1 ) (xi ) − f (v2 ) (xi )]2
v1 
=v2 i=1

 l+u

and C U (f ) = cU
(v1 ) (v2 )
v1 ,v2 [f (v1 ) (xi ) − f (v2 ) (xi )]2
v1 
=v2 i=l+1

Compound complexity penalties. We finally formulate the objective func-


tion in this setup as the result of loss minimization with a compound penalty:
236 O.-A. Maillard and N. Vayatis

– Compute:

[1] f ∗ = argminf ∈F {Loss(f ) + Complexity(f )


+Smoothness(f ) + Agreement(f )}

1  ∗(v)
V
– Output: φ = f
V v=1

We point out that there is a representer theorem for this setting. Indeed, for
V
any fixed f (2) , ..., f (V ) ∈ Πv=2 F (v) , f ∗(1) minimizes a function
(1)
n ), yn ) + gf (2) ,...,f (V ) (||f ||1 )
cf (2) ,...,f (V ) (f (x1 ), y1 , ..., f (x(1)

w.r.t. f . Thus the representer theorem tells us that f (1) ∈ L1 . Iterating the
argument leads to f ∗ ∈ L. We also refer to (Sindhwani & Rosenberg, 2008) for
an alternative construction where one single RKHS combines all the views.
For specific choices of the parameters, we recover the former problems studied
in previous papers:
– when γ and C are 0 we have a Regularized Least Squares (RLS) in RKHS,
– when only γ = 0 we have a Co-Regularized Least Squares (co-RLS) problem
(see (Sindhwani et al., 2005)),
– when Agreement is diagonal nonzero (i.e. cL and cU are diagonal), we have a
co-Laplacian method (e.g. co-Laplacian RLS, co-Laplacian SVM, see (Sind-
hwani et al., 2005)) ; indeed, the f (v) are decoupled, and thus problem [1]
amounts to solving for each v:

f (v)∗ = argminf (v) ∈F (v) Loss(f (v) ) + λv ||f (v) ||2v + γv f(v)T Lf(v) .

3 Excess Risk Bound


This section is devoted to the control of the Rademacher complexity in our
problem.
We need the following assumption from (Rosenberg & Bartlett, 2007), which is
l V
satisfied for instance by the square loss (Loss(0, .., 0) = 1l i=1 V1 v=1 yi2 ≤ 1):
Assumption A1: The loss functional satisfies Loss(0, ..., 0) ≤ 1 where (0, ..., 0)
is the multi-predictor with constant output 0.

3.1 Preliminaries
One nice property is that under assumption (A1), the final predictor φ belongs
to:
 V

1  (v) (v)
J = x→ f (x ) : (f , .., f ) ∈ H
(1) (V )
V v=1

with H being the class of multi-predictors f , with total penalty bounded by 1:


Complexity versus Agreement for Many Views 237

H = {f ∈ L : Complexity(f ) + Smoothness(f )+ Agreement(f ) ≤ 1} .


Excess risk bounds involve the Rademacher complexity of the class G of learners.
For a sample (x1 , ...xn ), it is defined as
 n

2
Rn (G) = σ sup | σi g(xi )|
g∈G n i=1

where (σi )i≤n are Rademacher i.i.d. random variables (P(σi = 1) = P(σi =
−1) = 12 ).
The following proposition, adapted from (Rosenberg & Bartlett, 2007) makes
use of this data-dependent complexity to derive an upper bound of the excess risk:
Proposition 1 (Excess risk ). For any positive loss function L uniformly β-
Lipschitz w.r.t its first variable and upper-bounded by 1, then conditionally on
the unlabeled data, ∀δ ∈ (0, 1), with probability at least 1 − δ over the labeled
points drawn i.i.d, for φ∗l the empirical minimizer of the objective function:
2 
(L(φ∗l (X), Y )) − inf (L(φ(X), Y )) ≤ 4βRl (J ) + √ (2 + 3 ln(2/δ)/2)
φ∈J l
The proof is an easy combination of classical generalization bounds with some
arguments from (Rosenberg & Bartlett, 2007) and the following contraction
principle: if h is β-Lipschitz and h(0) = 0, then Rn (h ◦ J ) ≤ 2βRn (J ) (see
(Ledoux & Talagrand, 1991)), together with the symmetry of J .

3.2 Explicit Rademacher Complexity Bound


Block-wise notations. We use the following notations: for any n, In or I is the
identity of n , 0u,l the zero matrix of u×l . For any given n1 , n2 , A(v) ∈ n1 ×n2 ,
A is the block-diagonal matrix with blocks A(v) , v = 1..V (of size n1 V × n2 V ),
and similarly A the block-row matrix of size n1 V × n2 . To multiply block-wise
each block A(v) by the v-th component of a vector λ ∈ v , let λ̃ be the block-
diagonal matrix of size n1 V × n1 V with blocks λv In1 . Since we always multiply
the v-th block with the v-th component, we drop the index.

Data. With the following matrices, we decompose between labeled and unlabeled
(v)
KL Il
data: K (v) = (kv (x(v) ∈ Ên×n and Π = ∈ ÊnV ×l
(v)
i , xj ))1≤i,j≤n = (v)
KU 0u,l

Agreement. To compare between views pairwisely, we introduce a block-line de-


fined δ ∈ nV (V −1)×nV , with blocks (0 . . . 0 In 0 . . . 0 −In 0 . . . 0) with identity
matrices at position v1 and v2 = v1 . Let also the block-diagonal matrix Cv,w with
diagonal blocks (cL U
v,w )i=1..l and (cv,w )i=l+1..l+u , and then C ∈
nV (V −1)×nV (V −1)

the block-diagonal matrix with blocks Cv,w when v, w ∈ {1, .., V }

Smoothness. Let LI be the diagonal block matrix with all V blocks equaled to
L. Note that we would have introduced α̃L instead, if we have used each graph
Laplacian and not the average Laplacian L in the smoothness term.
238 O.-A. Maillard and N. Vayatis

Thanks to the previous notations, we can now state our first main theo-
rem, which shows an explicit upper and lower data-dependent bound for the
Rademacher complexity of the class of functions.
Theorem 1 (Rademacher complexity bound ). Under assumption (A1),
then
2 b 2b
≤ Rl (J ) ≤
21/4 V l Vl
T
where b2 = tr(B λ̃−1 ΠKLT ) − tr(J T (I + M  )−1 J  ) with
−1 −1
– B = (I
√ + λ̃ −1γ̃LTI K) ∈ nV ×nV
– J = Cδ λ̃ B KL ∈ nV (V −1)×l
 T
√ √
– M  = CδKB λ̃−1 δ T C ∈ nV (V −1)×nV (V −1)

Note that b is explicit as a difference of two terms. The first term only depends on
unlabeled data when Smoothness is null, and contains no co-regularization term.
The second term corresponds to the idea that there is a reduction in complexity
of the space. Indeed, in section 3.3, we give some results about the behavior of
b enforcing this idea. As pointed by (Sindhwani & Rosenberg, 2008), this term
is connected to a specific norm induced by the parameters and data over the
space.
This Theorem generalizes previous results: for instance, if V = 2, γ = 0, and
cL
v,w = 0, we recover exactly the previous known bound of (Rosenberg & Bartlett,
2007) where our 2cU v,w corresponds to their λ and our λ is their (γF , γG ).

3.3 Asymptotics
Let θ = (α, λ, γ, C) be the parameters of the learning problem, where α appears
in the graph-Laplacian, λ in the Complexity term, γ in the Smoothness term
and C in the Agreement term. The number of parameters grows with O(V 2 ).
We study how the previous Rademacher bound changes with these parameters.
More agreement reduces space complexity. The second term appearing in the
expression of b2 depends on the co-regularization (matrix) parameter C. To
see how constrained is the space when using bigger penalization, we introduce
Δ(C) = tr(J T (I + M  )−1 J  ), which can be written, provided that C −1 exists,
as:
Δ(C) = tr(J1T (C −1 + M1 )−1 J1 )
where J1 = δ λ̃−1 B T KLT and M1 = δKB λ̃−1 δ T .
Thus when the eigenvalues of C increases to +∞, Δ(C) tends to:
T
Δ∞ = tr(KLT B λ̃−1 δ T (δKB λ̃−1 δ T )−1 δ λ̃−1 B T KLT ),
T
which can be rewritten Δ∞ = tr(B λ̃−1 Πl KLT ),and shows that b2 → 0 in this
case. That b decreases as the model gets more constraint is coherent with the
intuition of multi-view learning. Similarly,b2 → 0 whenever ||γ||, or ||λ|| → ∞.
Complexity versus Agreement for Many Views 239

Unconstrained space. When the constraint on the space vanishes, we have a


completely different behavior. Indeed, if C = 0 then Δ(C) = 0. When γ =
0, we refer to (Rosenberg & Bartlett, 2007). Finally, when λ = 0, b2 has the
following expression (provided every term appearing in this expression is finite
and defined):

b2 = tr(ΠlT L−1
I γ̃
−1
Πl ) − tr(ΠlT L−1
I γ̃ δ (C −1 + δL−1
−1 T
I γ̃ δ ) δL−1
−1 T −1
I γ̃
−1
Πl )

Note that when both γ and λ tend to 0, the previous bound may tend to ∞
even in some simple case (which is coherent with the intuition). Note also that
the dependency with V is hidden here in the trace.

4 Stability-Based Parameter Selection


The multi-view setting involves new questions, like the choice of the parameters
since there are O(V 2 ) many of them. We now describe an automatic parameter
selection procedure which will be theoretically sound.

4.1 Theoretical Selection Procedure


n
Let Pn =  1
n n i=1 δXi be the empirical measure, and P the true measure. Thus
Pn f = n1 i=1 f (Xi ) and Pf = (f (X)). For a general class F of functions,
and probability measure Q, we define FQ ( ) = {f ∈ F; Qf − inf Qf ≤ } and
then introduce the true -optimal ball F ( ) = FP ( ), and the empirical -optimal
ball Fn ( ) = FPn ( ), or balls around the Empirical Risk Minimizer (ERM) and
True Risk Minimizer (TRM). For a general class F of functions, we now assume
that we have T : F 2 → + such that ∀f, g ∈ F (f − g) ≤ T 2 (f, g), and
then introduce the two objects: Δn ( ) = supf1 ,f2 ∈F () |Pn − P |(f1 − f2 ) and
DF ( ) = supf,g∈F () T (f, g). We refer to the first one as a L1 , P -diameter and
the second one as a L2 , P -diameter. Lemma 1 in (Koltchinskii, 2006) tells us
that for large enough radii, the empirical and true quasi-optimal sets around
the ERM and TRM are included in each other, or put differently, that true
quasi-optimal sets can be estimated by empirical quasi-optimal sets:

Lemma 1. (Koltchinskii) For any > 0, and any λ < 1, we set

Δn ( ) log( −1 ) 2 2 log( −1 )
2 ( ) + 2Δ ( )],
Bn ( , λ) = 2 + + [DF n
λ λn λ n

 
and rn ( , λ) = inf α ∈ [0, 1]; sup Bn ( , λj ) ≤ λ .
j∈ ;1≥λj ≥α

 

Set also = 2+ ln(rn (,λ))
ln(λ) . Then, with probability larger than 1 −  :

∀r ≥ rn ( , λ) F (r) ⊂ Fn (3r/2) and Fn (r) ⊂ F(2r) .


240 O.-A. Maillard and N. Vayatis

In the general case, if the radii are too small, then such inclusions no longer
hold, and the intersection may even be empty. For our problem, we will simply
select the parameter θ inducing the larger range of quasi-optimal sets controlled
around the ERM, which is a notion of stability. Thus, for a given radius of
the true penalized ball, we want to minimize the critical radius rn w.r.t. θ. A
side motivating intuition is that having good stability allows for easy discovery
of the minimizer f ∗ .

4.2 Empirical Selection Procedure


We now propose an empirical version of this lemma. Fortunately, using an em-
pirical estimation of the rn ( , λ) is possible thanks to the Theorem 3, page 18,
in (Koltchinskii, 2006), leading to a full data-dependent quantity. Indeed, let
Δ̂n ( ) = Rn (Fn ( )) and D̂Fn ( ) = supf,g∈Fn () Tn (f, g), with Tn2 bounding the
empirical variance n . The empirical versions of Bn ( , λ) and rn ( , λ) given by
(Koltchinskii, 2006) are:
 
r̂n ( , λ) = inf α ∈ [0, 1]; sup B̂n ( , λj ) ≤ λ3 , where
j∈ ;1≥λj ≥α

2cΔ̂n (c ) log( −1 ) log( −1 )


B̂n ( , λ) = + 2D̂Fn (c ) +
λ λ2 n λn
and c, c ≥ 1 are universal constants.
We now propose to apply this result to semi-supervised multi-view classifica-
tion. We identify the classes F̂θ,n to be
 V

1  (v) (v)
J (r) = x → f (x ); f ∈ H(r) ,
V v=1

where H(r) = {f ; π̂θ,l (f ) ≤ r}, and estimate Rn (F̂θ,n ( )) and D̂F̂n,θ ( ) for
each parameter θ. Note that the dependency w.r.t. θ = (α, λ, γ, C) is hidden in
the definition. Thus we need to bound the Rademacher complexity of J (r) and
its L2 , Pn -Diameter. An analysis of the proof of Theorem 1 shows that
√ changing
J = J (1) for J√(r) affects the Rademacher bound with a factor r, leading
r
to a bound 2b(θ) lV for the first term. Following the same analysis as for the
L1 -diameter (or Rademacher complexity), the next theorem gives us the second
bound we need:
Theorem 2 (Empirical local L2 diameter ). Under assumption A1, then

2d r
D̂J (r) ≤ √
lV
where d2 is the largest eigenvalue of (B − J2T (I + M )−1 J2 )λ̃−1 Π(KLT )T , with

J2 = Cδ λ̃−1 B T
Complexity versus Agreement for Many Views 241

Note the dependency with l instead of the l for the Rademacher bound.
Eventually, each θ leads to a radius rnθ ( , λ) ≥ r̂nθ ( , λ) defined likewise, using
upper bounds of Theorem 1 and 2. For maximal stability, we propose to have
the largest range of values for which Lemma 1 still holds, which boils down to
minimizing this quantity with θ. This leads to the following selection procedure
where each term is computable:

– Fix a probability threshold with  > 0 and λ < 1.


– Compute r(θ, n, l, , λ), defined by:
 
j 3
inf α ∈ [0, 1]; sup B̃n,l (θ, , λ ) ≤ λ ,
j∈ ;1≥λj ≥α

where the term B̃n,l (, λ) is:


√ √
2cb(θ) c  4d(θ) c  log(−1 ) log(−1 )
+ √ +
lV λ lV λ2 n λn

– Output:

θ∗ = argminθ∈Θ r(θ, n, l, , λ)

5 Experiments

We have performed some toy simulations to see the flexibility of this general algo-
rithm and the results are promising. Based on only one or two labeled points, we
can always recover perfect labeling of the data, even on the challenging cross-
moons data set on which all classical algorithms (Co-Laplacian and Co-RLS)
performs badly. For completeness, we first give hints how to solve the mini-
mization problem. Recall that the solutions of [1] can be written f (v) (x(v) ) =
l+u (v) (v) (v) (v) (v)
i=1 αi K (x , xi ) = Kx(v) α(v) . We first consider the case where the loss
function is differentiable.

Theorem 3 (Solution in the differentiable case). Assuming that the loss


function satisfies ∇α(v) Loss(f (v) ) = 2K (v) A(v) α(v) , then the solution of the prob-
lem 1 is given by the resolution of the linear system, where the α(v) are the un-
known vectors.
∀v ∈ 1 . . . V :
V

Y = [A(v) + λv I + γv LK (v) ]α(v) + 2 Cv,w (K (v) α(v) − K (w) α(w) )
w=1

where Yi = yi for 1 ≤ i ≤ l and Yi = 0 for l + 1 ≤ i ≤ l + u.

The proof is a straightforward application of usual algebra and is omitted here.


This system contains as a special case the linear system of (Sindhwani et al.,
242 O.-A. Maillard and N. Vayatis

2005). We can rewrite it as Sα = Ỹ where S is an appropriate matrix, α =


(α(1)T , .., α(V )T )T and Ỹ = (Y T , .., Y T )T , but S a priori is not positive and may
have a very large conditioning number.
An important case of non-differentiable loss function (which is not covered
by the previous Theorem) is the hinge loss used in SVM. How to use a classical
SVM solver for our problem, is left aside in this paper. A complete derivation is
given in(Belkin et al., 2005) when γ = 0.

5.1 Toy Examples

We have done some experiments on three toy examples (Figure 1), with only
two views and two classes for simplicity.

– The easy two moons-two lines data set, for which the data is linearly sepa-
rable in the second view, and almost separated in the first.
– The more complex two spirals-two clouds data set, with intricate spirals (to
“force” the use of graph-Laplacian). Note that a human operator cannot
separate the two classes without the information of the second view.
– The challenging cross-two moons data set, which appears to fool the tested
algorithms based on only one of the Smoothness or Agreement term.

Since the less labeled object, the more heuristic the definition of the “true”
classes, we refer here to human beings to say what are the true classes. Such a
definition of truth is a real problem still unsolved in the clustering community
and we do not pretend here to solve it. In the first two data sets, a human
only needs one label object of each class to recover the classes. For the last one,
because the cross yields ambiguity, a human operator needs two objects in each
class. Thus, we use this number of labels.
For each algorithm we use the quadratic loss, which is differentiable. The
first one is the classical RLS, for which Smoothness and Agreement are set to
0. The second one is a co-RLS, with only Smoothness set to 0. Then we used

Two moons-two lines Two spirals-two clouds One cross-two moons

Fig. 1. Three toy data sets. Normal points for unlabeled points, circle for class number
one and cross for class number two. From left to right: Two moons (above)- two lines
(below), with one labeled object in each class. Two spirals-two clouds, with one labeled
object in each class. One cross-two moons, with two labeled objects in each class.
Complexity versus Agreement for Many Views 243

a Laplacian-based algorithm (co-Laplacian), which outperform co-RLS on the


tricky two spirals-two clouds data set, and finally an algorithm with none of
the terms set to 0. Since all these algorithms are specialization of the general
algorithm, with some parameters set to 0 to highlight some behaviors, we just
tuned the parameters by hand trying to find the best results for each algorithm.
Finally, note that the choice of the kernels for each view is important, and
we used well-suited kernels for each problem (gaussian for clouds, linear for
lines, . . . ).

algo dataset 1 dataset 2 dataset 3


RLS 0.455 ± 0.035 0.103 ± 0.024 0.379 ± 0.026
co-RLS 0.146 ± 0.071 0.103 ± 0.024 0.467 ± 0.025
co-Laplacian 0.242 ± 0.040 0.001 ± 0.004 0.510 ± 0.028
general 0.011 ± 0.015 0.322 ± 0.067 0.042 ± 0.071

Empirical misclassification errors for the above algorithms (one set of param-
eters per dataset, some possibly put to zero when specified to each algorithm),
averaged over 1000 runs.

6 Discussion and Conclusion

In this paper, we have combined different aspects of semi-supervised and multi-


view learning into one algorithm. Based on previous work, we have derived an
explicit control for the L1 -diameter (Rademacher complexity) of the class of de-
cision functions for this new algorithm. Besides, we have shown how considering
the full multi-view learning problem may generate new questions. Combining
stability ideas from the statistical and clustering community, we have proposed
a new stability-based parameter selection procedure, which benefits from strong
recent theoretical developments. For this procedure to be implementable, we have
controlled the L2 -diameter of the class as well, which has not been investigated
so far for similar settings.

References

Ando, R.K., Zhang, T.: Learning on graph with laplacian regularization. In: Schölkopf,
B., Platt, J., Hoffman, T. (eds.) Advances in neural information processing systems,
vol. 19, pp. 25–32. MIT Press, Cambridge (2007)
Balcan, M., Blum, A.: A PAC-style model for learning from labeled and unlabeled
data. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 111–
126. Springer, Heidelberg (2005)
Balcan, M.F., Blum, A., Yang, K.: Co-training and expansion: Towards bridging the-
ory and practice. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in neural
information processing systems, vol. 17, pp. 89–96. MIT Press, Cambridge (2005)
Belkin, M., Niyogi, P., Sindhwani, V.: On Manifold Regularization. In: AISTAT (2005)
244 O.-A. Maillard and N. Vayatis

Ben-David, S., von Luxburg, U., Pal, D.: A sober look at clustering stability. In: Lugosi,
G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 5–19. Springer,
Heidelberg (2006)
Berthet, P., Shi, Z.: Small ball estimates for brownian motion under a weighted sup-
norm. Studia Sci. Math. Hung, 1–2, 275–289 (2001)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In:
COLT 1998: Proceedings of the eleventh annual conference on Computational learn-
ing theory, pp. 92–100. ACM, New York (1998)
Celisse, A.: Model selection via cross-validation in density estimation, regression, and
change-points detection. Doctoral dissertation, Universite Paris Sud, Faculte des
Sciences d’Orsay (2008)
Golub, G.H., Van Loan, C.F.: Matrix computations. The Johns Hopkins University
Press (1996)
Koltchinskii, V.: Local rademacher complexities and oracle inequalities in risk mini-
mization. The Annals of Statistics 34(6), 2593–2656 (2006)
Ledoux, M., Talagrand, M.: Probability on banach spaces: Isoperimetry and processes.
Springer, Berlin (1991)
Li, W.V., Linde, W.: Approximation, metric entropy and small ball estimates for gaus-
sian measures. Ann. Probab. 27, 1556–1578 (1999)
Rosenberg, D., Bartlett, P.L.: The rademacher complexity of co-regularized kernel
classes. In: Proceedings of the Eleventh ICAIS (2007)
Sindhwani, V., Niyogi, P., Belkin, M.: A co-regularization approach to semi-supervised
learning with multiple views. In: Workshop on Learning with Multiple Views, Pro-
ceedings of International Conference on Machine Learning (2005)
Sindhwani, V., Rosenberg, D.S.: An rkhs for multi-view learning and manifold co-
regularization. In: ICML 2008: Proceedings of the 25th international conference on
Machine learning, pp. 976–983. ACM, New York (2008)
Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Conference on
Learning Theory and 7th Kernel Workshop, pp. 144–158 (2003)
Sridharan, K., Kakade, S.M.: An information theoretic framework for multi-view learn-
ing. In: COLT, pp. 403–414. Omnipress (2008)
Weston, J., Leslie, C., Ie, E., Zhou, D., Elisseeff, A., Noble, W.S.: Semi-supervised
protein classification using cluster kernels. Bioinformatics 21, 3241–3247 (2005)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and
global consistency. In: Advances in Neural Information Processing Systems, vol. 16,
pp. 321–328. MIT Press, Cambridge (2004)

Appendix - Proofs
Sketch of Proof of Theorem 1
The proof of Theorem 1 follows the same line as (Rosenberg & Bartlett, 2007)
and extends their result to the compound regularization penalty in the case of
an arbitrary number of views. Since there is no novelty in the proof technique,
we do not reproduce it here entirely. For completeness, we recall the main next
steps: (i) use classical invariance properties of the kernel function to reformulate
the optimization problem with an invertible matrix, (ii) apply Lemma 3 below
to get the solution, (iii) eventually, rewrite it with the formulation involving
the initial data by use of the Sherman-Morrison-Woodbury formula (Golub &
Van Loan, 1996). We provide the key intermediate steps adapted to our setting.
Complexity versus Agreement for Many Views 245

Lemma 2. Under assumption (A1), the solution of the minimization problem


[1] belongs to the set L ∩ H.

Proof: Let Q be the functional to be minimized, decomposed as: Q(f ) = Loss(f )+


Π(f ). For the null multi-view predictor 0 ∈ F, we have Q(0) = Loss(0), thus
under assumption (A1), inf Q ≤ 1. But since all terms of Q are non negative, the
solution is in H. Finally, that f ∗ ∈ L by the representer theorem. 
First, we apply Lemma 2 to reduce the search space. Then, if f ∈ L ∩ H, thanks
to the representer theorem, we can write its component in each view f (v) =

fα(v) = ni=1 αi kv (., xi ), where α(v) ∈ n . Thus, a matrix reformulation of
(v) (v) (v)

f ∈ L ∩ H is:

f ∈ {(fα(1) , ..., fα(V ) ) : αT N α ≤ 1}


nV ×1
where α ∈ , and the data-dependent N square matrix is1 :
   v ,v
N = λ̃K + γ̃Diag K (1) LK (1) . . . K (V ) LK (V ) + KC1 2 , and
v1 
=v2

⎛ ⎞ ⎛ ⎞T
0 0
⎜ .. ⎟ ⎜ ⎟ ..
⎜ . ⎟ ⎜ ⎟ .
⎜ (v ) ⎟ ⎜ (v ) ⎟
⎜K 1 ⎟ ⎜K 1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
v1 ,v2
KC = ⎜ ... ⎟ Cv1 ,v2 ⎜ ... ⎟ .
⎜ ⎟ ⎜ ⎟
⎜−K (v2 ) ⎟ ⎜−K (v2 ) ⎟
⎜ ⎟ ⎜ ⎟
⎜ . ⎟ ⎜ . ⎟
⎝ . ⎠
. ⎝ . ⎠
.
0 0

Thus, the definition of the Rademacher complexity can be seen as the solu-
tion to an optimization problem under quadratic constraint. Indeed, since H is
symmetrical:
2
Rl (J ) = σ sup αT KLT σ
lV α;αT N α≤1

with σ = (σ1 , . . . , σl )T ∈ l×1 . To apply Lemma 3, we need an invertible matrix.


Let P, Σ such that P (v)T K (v) P (v) = Σ (v) is the diagonal matrix of non zero
(v)
eigenvalues of K (v) .We introduce α// , the projection of α(v) on the subspace
associated to the rows of K (v) . Since αT N α is left unchanged under this pro-
(v)
jection, we rewrite it with a(v) such that P (v) a(v) = α// , ending up with the
constraint aT T a ≤ 1 where T is now an invertible matrix. As mentioned, we use
the following lemmas to conclude:
1
Diag(v1 ....vk ) is a shortcut notation for the square matrix with diagonal blocks
v1 , ..., vk on the diagonal.
246 O.-A. Maillard and N. Vayatis

Lemma 3. If M is a symmetric positive definite matrix, then

sup v T α = ||M −1/2 v|| .


α:αT Mα≤1

Lemma 4. (Sherman-Morrison-Woodbury formula) Provided that the inverses


exist:
(A + U U T )−1 = A−1 − A−1 U (I + U T A−1 U )−1 U T A−1 .

Sketch of Proof of Theorem 2


The proof essentially follows the same steps as Theorem 1, but uses Lemma 5
below to solve the minimization problem.
By definition, we have D̂J (r) = supφ1 ,φ2 ∈J (r) l (φ1 − φ2 )1/2 , which is also
 l
1/2
4 
sup (Pl ((φ1 − φ2 )2 ))1/2 ≤ sup φ(xi )2 .
φ1 ,φ2 ∈J (r) l φ∈J (r) i=1

Since f (v) (xi ) = Ki α(v) , where Ki is the ith row of K (v) , D̂J (r)2 ≤ V42 l αT
(v) (v)

Dα where D ∈ nV ×nV is the symmetrical matrix with block v, w equal to


l
Ki . Applying Lemma 2 with (A1), φ ∈ J (r) is αT N α ≤ r. If we
(v)T (w)
i=1 Ki
introduce the same transformation as in Theorem 1, this is again aT T a ≤ r with
now invertible T . Moreover T as the appropriate form A + U U T for Lemma 4.
T
Thus Lemma 5 first tells us that we want the highest eigenvalue of T −1 P DP .

Lemma 5. When M is symmetric positive definite and Q is symmetric positive


semidefinite, the quadratic problem :

sup aT Qa
a;aT Ma≤r

admits as solution λr, where λ is the highest eigenvalue of M −1 Q.

Now, D = KeK T with e ∈ n×n being the projection matrix with diago-
nal blocks Il and 0u . Lemma 4 applies to T −1 , and since the eigenvalues of
T T
T −1 P DP and P T −1 P D are the same, we compute:
T T −1
P A−1 P D = P P BP λ̃−1 Σ P T KeK T = B λ̃−1 Π(KLT )T

Where A comes from Lemma 4. Similar computations yield the second term and
allow to conclude the proof.
Error-Correcting Tournaments

Alina Beygelzimer1, John Langford2, and Pradeep Ravikumar3


1
IBM Thomas J. Watson Research Center, Hawthorne, NY 10532, USA
beygel@us.ibm.com
2
Yahoo! Research, New York, NY 10018, USA
jl@yahoo-inc.com
3
University of California, Berkeley, CA 94720, USA
pradeepr@stat.berkeley.edu

Abstract. We present a family of pairwise tournaments reducing k-class classi-


fication to binary classification. These reductions are provably robust against a
constant fraction of binary errors, and match the best possible computation and
regret up to a constant.

1 Introduction
We consider the classical problem of multiclass classification, where given an instance
x ∈ X, the goal is to predict the most likely label y ∈ {1, . . . , k}, according to some
unknown probability distribution.
A common general approach to multiclass learning is to reduce a multiclass prob-
lem to a set of binary classification problems [2,6,10,11,14]. This black-box approach
is composable with any binary learning algorithm (and thus bias), including online al-
gorithms, Bayesian algorithms, and even humans.
A key technique for analyzing reductions is regret analysis, which bounds the “re-
gret” of the resulting multiclass classifier in terms of the average classification “regret”
on the binary problems. Here regret is the difference between the incurred loss and the
smallest achievable loss on the problem, i.e., excess loss due to suboptimal prediction.
The most commonly applied reduction is one-against-all, which creates a binary clas-
sification problem for each of the k classes: The classifier for class i is trained to predict
whether the label is i or not; predictions are done by evaluating each binary classifier
and randomizing over those which predict “yes,” or randomly if all answers are “no”.
This simple reduction is inconsistent, in the sense that given optimal (zero-regret) binary
classifiers, the reduction may not yield an optimal multiclass classifier in the presence of
noise. Optimizing squared loss of the binary predictions instead of the 0/1 loss √ makes
the approach consistent, but the resulting multiclass regret may be as high as 2kr,
where r is the average squared loss regret on the induced problems, which is upper
bounded by the average binary classification regret via the Probing reduction [15].
The probabilistic error correcting output code approach (PECOC) [14] reduces k-
class classification to learning O(k) regressors on the interval [0, 1], creating O(k) bi-
nary examples per multiclass example at both training and test time, with √ a test time
computation of O(k 2 ). The resulting multiclass regret is bounded by 4 r, where r is
the average squared loss regret of the regressors (which is upper bounded by the average

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 247–262, 2009.

c Springer-Verlag Berlin Heidelberg 2009
248 A. Beygelzimer, J. Langford, and P. Ravikumar

binary classification regret via the Probing reduction [15]). Thus PECOC removes the
dependence on the number of classes k. When only a constant number of labels have
non-zero probability given x, the complexity can be reduced to O(log k) examples per
multiclass example and O(k log k) computation per example [13].
This leads to several questions:
1. Is there a consistent reduction from multiclass to binary classification that does not
have a square root dependence [17]? For example, an average binary regret of just
0.01 may imply a PECOC multiclass regret of 0.4.
2. Is there a consistent reduction that requires just O(log k) computation, matching
the information theoretic lower bound? The well known tree reduction (see [9])
distinguishes between the labels using a balanced binary tree, where each non-leaf
nodes predicts “Is the correct multiclass label to the left or right?”. As shown in
Section 2, this method is inconsistent.
3. Can the above be achieved with a reduction that only performs pairwise compar-
isons between classes? One fear associated with the PECOC approach is that it
creates binary problems of the form “What is the probability that the label is in a
given random subset of labels?,” which may be hard to solve. Although this fear
is addressed by regret analysis (as the latter operates only on excess loss), and is
overstated in some cases [8,13], it is still of some concern, especially with larger
values of k.
The error-correcting tournament family presented here answers all of these questions in
the affirmative. It provides an exponentially faster in k method for multiclass prediction
with the resulting multiclass regret bounded by 5.5r, where r is the average binary
regret and every binary classifier logically compares two distinct class labels.
The result is based on a basic observation that if a non-leaf node fails to predict its
binary label, which may be unavoidable due to noise in the distribution, nodes between
this node and the root should have no preference for class label prediction. Utilizing
this observation, we construct a reduction, called the Filter Tree, with the property that
it uses O(log k) binary examples and O(log k) computation at training and test time
with a multiclass regret bounded by log k times the average binary regret.
The decision process of a Filter Tree, viewed bottom up, can be viewed as a single-
elimination tournament on a set of k players. Using c independent single-elimination
tournaments is of no use as it does not affect the average regret of an adversary con-
trolling the binary classifiers. Somewhat surprisingly, it is possible to have c = log k
complete single-elimination tournaments between k players in O(log k) rounds with no
player playing twice in the same round [5]. All error-correcting tournaments first pair
labels in consecutive interfering single-elimination tournaments, followed by a final
carefully weighted single-elimination tournament that decides among the log2 k win-
ners of the first phase. As for the Filter Tree, test time evaluation can start at the root
and proceed to a multiclass label with O(log k) computation.
This construction is also useful for the problem of robust search, yielding the first
algorithm which allows the adversary to err a constant fraction of the time in the “full
lie” setting [16] where a comparator can missort any comparison. Previous work either
applied to the “half lie” case where a comparator can fail to sort but can not actively
missort [5,18] or to a “full lie” setting where an adversary has a fixed known bound on
Error-Correcting Tournaments 249

the number of lies [16] or a fixed budget on the fraction of errors so far [4,3]. Indeed, it
might even appear impossible to have an algorithm robust to a constant fraction of full
lie errors since an error can always be reserved for the last comparison. By repeating
the last comparison O(log k) times we can defeat this strategy.
The result here is also useful for the actual problem of tournament construction in
games with real players. Our analysis does not assume that errors are i.i.d. [7], or have
known noise distributions [1] or known outcome distributions given player skills [12].
Consequently, the tournaments we construct are robust against severe bias such as a bi-
ased referee or some forms of bribery and collusion. Furthermore, the tournaments we
construct are shallow, requiring fewer rounds than m-elimination bracket tournaments,
which do not satisfy the guarantee provided here. In an m-elimination bracket tourna-
ment, bracket i is a single-elimination tournament on all players except the winners of
brackets 1, . . . , i − 1. After the bracket winners are determined, the player winning the
last bracket m plays the winner of bracket m − 1 repeatedly until one player has suf-
fered m losses (they start with m − 1 and m − 2 losses respectively). The winner moves
on to pair against the winner of bracket m − 2, and the process continues until only
one player remains.
m This method does not scale well to large m, as the final elimination
phase takes i=1 i − 1 = O(m2 ) rounds. Even for k = 8 and m = 3, our constructions
have smaller maximum depth than bracketed 3-elimination.

Paper overview. Section 2 shows that the simple divide-and-conquer tree approach is
inconsistent, motivating the Filter Tree algorithm described in section 3 (which applies
to more general cost sensitive multiclass problems). Section 3.1 proves that the algo-
rithm has the best possible computational dependence, and gives two upper bounds on
the regret of the returned (cost-sensitive) multiclass classifier.
Section 4 presents the error-correcting tournament family parametrized by an integer
m ≥ 1, which controls the tradeoff between maximizing robustness (m large) and min-
imizing depth (m small). Setting m = 1 gives the Filter Tree, while m = 4 ln k gives
a (multiclass to binary) regret ratio of 5.5 with O(log k) depth. Setting m = ck gives
regret ratio of 3 + O(1/c) with depth O(k). The results here provide a nearly free gen-
eralization of earlier work [5] in the robust search setting, to a more powerful adversary
that can missort as well as fail to sort. Section 5 gives an algorithm independent lower
bound of 2 on the regret ratio for large k. When the number of calls to a binary classifier
is independent (or nearly independent) of the label predicted, we strengthen this lower
bound to 3 for large k.

2 Inconsistency of Divide and Conquer Trees

One standard approach for reducing multiclass learning to binary learning is to split the
set of labels in half, then learn a binary classifier to distinguish between the subsets, and
repeat recursively until each subset contains one label. Multiclass predictions are made
by following a chain of classifications from the root down to the leaves.
The following theorem shows that there exist multiclass problems such that even
if we have an optimal classifier for the induced binary problem at each node, the tree
reduction does not yield an optimal multiclass predictor.
250 A. Beygelzimer, J. Langford, and P. Ravikumar

1 2 3 4 5 6

1 vs 2 3 vs 4 5 vs 6 7

{winner of 1 vs 2} vs {winner of 3 vs 4} {winner of 5 vs 6}


vs 7

Fig. 1. Filter Tree. Each node predicts whether the left or the right input label is more likely,
conditioned on a given x ∈ X. The root node predicts the best label for x.

Notation. Let D be the underlying distribution over X×Y , where X is some observable
feature space and Y = {1, . . . , k} is the label space. The error rate of a classifier
f : X → Y on D is given by err(f, D) = Pr(x,y)∼D [f (x)  = y]. The regret of f on D
is defined as reg(f, D) = err(f, D) − minf ∗ err(f ∗ , D).

The tree reduction transforms D into a distribution DT over binary labeled examples
by drawing a multiclass example (x, y) from D, drawing a random non-leaf node i,
and outputting instance x, i with label 1 if y is in the left subtree of node i, and 0
otherwise. A binary classifier f for this problem induces a multiclass classifier T (f ),
via a chain of binary predictions starting from the root.

Theorem 1. For all k ≥ 3, for all binary trees over the labels, there exists a multiclass
distribution D such that reg(T (f ∗ ), D) > 0 for any f ∗ = arg min err(f, DT ).
f

Proof. Find a node with one subset corresponding to two labels and the other subset
corresponding to a single label. (If the tree is perfectly balanced, simply let D assign
probability 0 to one of the labels.) Since we can freely rename labels without changing
the underlying problem, let the first two labels be 1 and 2, and the third label be 3.
Choose D with the property that D(y = 1 | x) = D(y = 2 | x) = 1/4 + 1/100,
while D(y = 3 | x) = 1/2 − 2/100. Under this distribution, the fraction of examples
for which label 1 or 2 is correct is 1/2 + 2/100, so any minimum error rate binary
predictor must choose either label 1 or label 2. Each of these choices has an error rate
of 3/4 − 1/100. The optimal multiclass predictor chooses label 3 and suffers an error
rate of 1/2 + 2/100, implying that the regret of the tree classifier based on an optimal
binary classifier is 1/4 − 3/100 > 0.

3 The Filter Tree Algorithm

The Filter Tree algorithm is illustrated by Figure 1. It is equivalent to a single-elimination


tournament on the set of labels structured as a binary tree T over the labels. In the first
round, the labels are paired according to the lowest level of the tree, and a classifier is
trained for each pair to predict which of the two labels is more likely. (The labels that
don’t have a pair in a given round, win that round for free.) The winning labels from the
Error-Correcting Tournaments 251

Algorithm 1. Filter-Train (multiclass training set S, binary learner Learn)


for each non-leaf node n in order from leaves to root do
Set Sn = ∅
for each (x, y) ∈ S such that y ∈ Γ (Tn ) and all nodes u on the path n ; y predict yu
given x do
add (x, yn ) to Sn
end
Let cn = Learn(Sn )
end
return c = {cn }

first round are in turn paired in the second round, and a classifier is trained to predict
whether the winner of one pair is more likely than the winner of the other. The process
of training classifiers to predict the best of a pair of winners from the previous round is
repeated until the root classifier is trained.
The setting above is akin to Boosting: At each round t, a booster creates an input
distribution Dt and calls an oracle learning algorithm to obtain a classifier with some
error t on Dt . The distribution Dt depends on the classifiers returned by the oracle in
previous rounds. The accuracy of the final classifier is analyzed in terms of t ’s.
Let Tn be the subtree of T rooted at node n. The set of leaves of a tree T is denoted
by Γ (T ). Let yn be the bit specifying whether the multiclass label y is in the left subtree
of n or not.
The key trick in the training stage (Algorithm 1) is to form the right training set
at each interior node. A training example for node n is formed conditioned on the
predictions of classifiers in the round before it. Thus the learned classifiers from the first
level of the tree are used to “filter” the distribution over examples reaching the second
level of the tree.
Given x and classifiers at each node, every edge in T is identified with a unique
label. The optimal decision at any non-leaf node is to choose the input edge (label) that
is more likely according to the true conditional probability. This can be done by using
the outputs of classifiers in the round before it as a filter during the training process: For
each observation, we set the label to 0 if the left parent’s output matches the multiclass
label, 1 if the right parent’s output matches, and reject the example otherwise.
The testing algorithm, Filter-Test, is very simple. Given a test example x ∈ X,
we output the label y such that every classifier on the path from y to the root prefers y.
Algorithm 2 extends this idea to the cost-sensitive multiclass case where each choice
has a different associated cost. Formally, a cost-sensitive k-class classification problem
is defined by a distribution D over X × [0, 1]k . The expected cost of a classifier f :
X → {1, ..., k} of D is (f, D) = E(x,c)∼D cf (x) . Here c ∈ [0, 1]k gives the cost
of each of the k choices for x. As in the multiclass case (which is a special case), the
regret of f on D is defined as regc (f, D) = (f, D) − minf ∗ (f ∗ , D).
The algorithm relies upon an importance weighted binary learning algorithm, which
takes examples of the form (x, y, w), where x is a feature vector used for prediction,
y is a binary label, and w ∈ [0, ∞) is the importance any classifier pays if it doesn’t
predict y on x.
252 A. Beygelzimer, J. Langford, and P. Ravikumar

Algorithm 2. C-Filter-Train (cost-sensitive training set S, importance-weighted


binary learner Learn)
for each non-leaf node n in the order from leaves to root do
Set Sn = ∅
for each example (x, c1 , ..., ck ) ∈ S do
Let a and b be the two classes input to n
Sn ← Sn ∪ {(x, arg min{ca , cb }, |ca − cb |)}
end
Let cn = Learn(Sn )
end
return c = {cn }

3.1 Filter Tree Analysis


Before doing the regret analysis, we note the computational characteristics of the Filter
Tree. Since the algorithm is a reduction, we count the computational complexity in the
reduction itself, assuming that the oracle calls take unit time.
1. Algorithm 1 requires O(log k) computation per multiclass example, by search-
ing for the correct leaf in O(log k) time, then filtering back toward the root. This
matches the information theoretic lower bound since simply reading one of k labels
requires log2 k bits.
2. Algorithm 2 requires O(k) computation per cost sensitive example, because there
are k − 1 nodes, each requiring constant computation per example. Since any
method must read the k costs, this bound is tight.
3. The testing algorithm is the same for both multiclass and cost-sensitive variants,
requiring O(log k) computation per example to descend a binary tree. Any method
must write out labels of length log2 k bits.
First, we define several concepts necessary to understand the analysis. Algorithm 2
transforms cost-sensitive multiclass examples into importance-weighted binary exam-
ples. This process implicitly transforms a distribution D over cost sensitive multiclass
examples into a distribution DFT over importance-weighted binary examples.
There are many induced problems, one for each call to the oracle Learn. To simplify
the analysis, we use a standard transformation allowing us to consider only a single
induced problem: We add the node index n as an additional feature into each importance
weighted binary example, and then train based upon the union of all the training sets.
The learning algorithm produces a single binary classifier c(x, n) for which we can
redefine cn (x) as c(x, n). The induced distribution DFT can be defined by the following
process: (1) draw a cost-sensitive example (x, c) from D, (2) pick a random node n, (3)
create an importance-weighted sample according to the algorithm, except using x, n
instead of x.
The theorem is quantified over all classifiers, and thus it holds for the classifier re-
turned by the algorithm. In practice, one can either call the oracle multiple times to
learn a separate classifier for each node (as we do in our experiments), or use iterative
techniques for dealing with the fact that the classifiers are dependent on other classifiers
closer to the leaves.
Error-Correcting Tournaments 253

When reducing to importance-weighted classification, the theorem statement de-


pends on importance weights. To remove the importances, we compose the reduction
with the Costing reduction [19], which alters the underlying distribution using rejection
sampling on the importance weights. This composition transforms DFT into a distribu-
tion D over binary examples.
We use the folk theorem from [19] saying that for all binary classifiers f and all
importance weighted binary distributions P , the importance weighted binary regret of
f on P is upper bounded by E(x,y,w)∼P [w] times the binary regret of f on the induced
binary distribution.
The core theorem relates the regret of a binary classifier f to the regret of the induced
cost sensitive classifier Filter-Test(f ).

Theorem 2. For all binary classifiers f and all cost sensitive multiclass distributions D,

regc (Filter-Test(f ), D) ≤ reg(f, D )E(x,c)∼D w(n, x, c),
n∈T

where w(n, x, c) is the importance weight in Algorithm 2 (the difference in cost between
the two labels that node n chooses between on x), and D is the induced distribution as
defined above.

Before proving the theorem, we state the corollary for multiclass classification.

Corollary 1. For all binary classifiers f and all multiclass distributions D on k labels,
for all Filter Trees of depth d, reg(Filter-Test(f ), D) ≤ d · reg(f, DFT ).

(Since all importance weights are either 0 or 1, we don’t need to apply Costing.) The
proof of the corollary given the theorem is simple since for any (x, y), the induced (x, c)
has at most one node per level
 with induced importance weight 1; all other importance
weights are 0. Therefore, n w(n, x, c) ≤ d.
Theorem 3 provides an alternative bound for cost-sensitive classification. It is the
first known bound giving a worst-case dependence of less than k.
Theorem 3. For all binary classifiers f and all cost-sensitive k-class distributions D,
regc(Filter-Test(f ), D) ≤ k reg(f, D )/2, where D is as defined above.

The remainder of this section proves Theorems 2 and 3.


Proof. (Theorem 2) It is sufficient to prove the claim for any x ∈ X because that implies
that the result holds for all expectations over x.
Conditioned on the value of x, each label y has a distribution over costs cy with an
expected value Ec∼D|x [cy ]. The zero regret cost sensitive classifier predicts according
to arg miny Ec∼D|x [cy ]. Suppose that Filter-Test(f ) predicts y  on x, inducing
cost sensitive regret regc (y  , D|x) = Ec∼D|x [cy ] − miny Ec∼D|x [cy ].
First, we show that the sum over the binary problems of the importance weighted
regret is at least regc(y  , D|x), using induction starting at the leaves. The induction
hypothesis is that the sum of the regrets of importance-weighted binary classifiers in
any subtree bounds the regret of the subtree output.
254 A. Beygelzimer, J. Langford, and P. Ravikumar

For node n, each importance weighted binary decision between class a and class
b has an importance weighted regret which is either 0 or rn = |Ec∼D|x [ca − cb ]| =
|Ec∼D|x [ca ] − Ec∼D|x [cb ]|, depending on whether the prediction is correct or not.
Assume without loss of generality that the predictor outputs class b. The regret of the
subtree Tn rooted at n is given by rTn = Ec∼D|x [cb ] − miny∈Γ (Tn ) Ec∼D|x [cy ].
As a base case, the inductive
 hypothesis is trivially
 satisfied for trees with one label.
Inductively, assume that n ∈L rn ≥ rL and n ∈R rn ≥ rR for the left subtree L
of n (providing a) and the right subtree R (providing b).
There are two possibilities. Either the minimizer comes from the leaves of L or the
leaves of R. The second possibility is easy since we have
 
rTn = Ec∼D|x [cb ] − min Ec∼D|x [cy ] = rR ≤ rn ≤ rn ,
y∈Γ (R)
n ∈R n ∈Tn

which proves the induction.


For the first possibility, we have
rTn = Ec∼D|x [cb ] − min Ec∼D|x [cy ]
y∈Γ (L)

= Ec∼D|x [cb ] − Ec∼D|x [ca ] + Ec∼D|x [ca ] − min Ec∼D|x [cy ]


y∈Γ (L)
 
= Ec∼D|x [cb ] − Ec∼D|x [ca ] + rL ≤ rn + rn ≤ rn ,
n ∈L n ∈Tn

which completes
 the induction. The inductive hypothesis for the root is that regc (y  ,

D|x) ≤ n∈T rn , implying regc (y , D|x) ≤ n∈T rn = (k − 1) · ri (f, DFT ), where
ri is the importance weighted binary regret on the induced problem.
Using the folk theorem from [19], we have ri (f, DFT ) = reg(f, D )E(x,y,w)∼DFT [w].
1

The expected importance is k−1 E(x,c)∼D n∈T w(n, x, c). Plugging this in, we get
the theorem.

The proof of Theorem 3 makes use of the following inequality. Consider a Filter Tree
T evaluated on a cost-sensitive multiclass instance with cost vector c ∈ [0, 1]k . Let ST
be the sum of importances over all nodes in T , and IT be the sum of importances over
the nodes where the class with the larger cost was selected for the next round. Let cT
denote the cost of the winner chosen by T .
Lemma 1. For any Filter Tree T on k labels, ST + cT ≤ IT + k2 .
Proof. The inequality follows by induction, the result being clear when k = 2. Assume
that the claim holds for the two subtrees, L and R, providing their respective inputs l
and r to the root of T , and T outputs r without loss of generality. Using the inductive
hypotheses for L and R, we get ST +cT = SL +SR +|cr −cl |+cr ≤ IL +IR + k2 −cl +
|cr −cl |. If cr ≥ cl , we have IT = IL +IR +(cr −cl ), and ST +cT ≤ IT + k2 −cl ≤ IT +
2 , as desired. If cr < cl , we have IT = IL + IR and ST + cT ≤ IT + 2 − cr ≤ IT + 2 ,
k k k

completing the proof.

Proof. (Theorem 3) We will fix (x, c) ∈ X × [0, 1]k and take the expectation over the
draw of (x, c) from D as the last step.
Error-Correcting Tournaments 255

Consider a Filter Tree T evaluated on (x, c) using a given binary classifier b. As


before, let ST be the sum of importances over all nodes in T , and IT be the sum of
importances over the nodes where b made a mistake. Recall that the regret of T on
(x, c), denoted in the proof by regT , is the difference between the cost of the tree’s
output and the smallest cost c∗ . The importance-weighted binary regret of b on (x, c)
is simply IT /ST . Since the expected importance is upper bounded by 1, IT /ST also
bounds the binary regret of b.
The inequality we need to prove is regT ST ≤ k2 IT . The proof is by induction on k,
the result being trivial if k = 2. Assume that the assertion holds for the two subtrees, L
and R, providing their respective inputs l and r to the root of T . (The number of classes
in L and R can be taken to be even, by splitting the odd class into two classes with
the same cost as the split class, which has no effect on the quantities in the theorem
statement.)
Let the best cost c∗ be in the left subtree L. Suppose first that T chooses r and
cr > cl . Let w = cr − cl . We have regL = cl − c∗ and regT = cr − c∗ = regL + w.
The left hand side of the inequality is thus regT ST = (regL + w)(SR + SL + w) =
w(regL +SR +SL +w)+regL (SL +SR ) ≤ w regL + IR + IL − cr − cl + w + k2 +
 
regL IR + IL − cl − cr + k2 ≤ k2 w + IR (w + regL ) + IL (w + regL ) +
k   
regL 2 − cr − cl ≤ 2 w + IR (w + regL ) + IL w + regL + k2 − cr − cl ≤ k2 w +
k

IR (w + regL ) + k2 IL ≤ k2 (w + IR + IL ) = k2 IT . The first inequality follows from


lemma 1. The second and fourth follow from w(regL − cl − cr + w) ≤ 0. The third
follows from regL ≤ IL . The fifth follows from regT ≤ k2 for k ≥ 2.
The proofs for the remaining three cases (cT = cl < cr , cT = cl > cr , and cl >
cr = cT ) use the same machinery as the proof above.

Case 2. T outputs l, and cl < cr . In this case regT = regL = cl − c∗ . The left
hand side can be rewritten as regT ST = regL (SR + SL + cr −cl ) = regL SL +
regL (SR + cr − cl ) ≤ regL IL + IR − 2cl + k2 ≤ IR + regL IL − 2cl + k2 ≤
 
IR + IL regL −2cl + k2 ≤ IR + k2 IL ≤ k2 IT . The first inequality from the lemma,
the second from regL ≤ 1, the third from regL ≤ IL , the fourth from −cL − c∗ < 0,
and the fifth because IT = IL + IR .

Case 3. T outputs l, and cl > cr . We have regT = regL = cl − c∗ . The left hand side
can be written as

|L| k − |L|
regT ST = regL (SR + SL + cl − cr ) ≤ IL +regL IR + − cr + c l − cr
2 2
k k k
≤ IL + IR + (cl − 2cr ) ≤ (IL + IR + (cl − cr )) = IT ,
2 2 2
The first inequality follows from the inductive hypothesis and the lemma, the second
from regL < 1 and regL < IL , and the third from cr > 0 and k/2 > 1.

Case 4. T outputs r, and cl > cr . Let w = cl −cr . We have regT = cr −c∗ = regL −w.
The left hand side can be written as
256 A. Beygelzimer, J. Langford, and P. Ravikumar

regT ST = (regL − w)(SR + SL + w)


= regL SL − wSL + (regL − w)(SR + w)
 
|L| |L| k − |L|
≤ IL − w IL + − cl + (regL − w) IR + cl − 2cr +
2 2 2

|L| |L| k − |L|
≤ IL − w IL + − cl + (IL − w) + (regL − w) (IR + cl − 2cr )
2 2 2
k k
≤ (IL + IR ) − w − w(IL − cl ) + (regL − w)(cl − 2cr ).
2 2
The first inequality follows from the inductive hypothesis and the lemma, the second
from regL ≤ IL , and the third from regL ≤ k2 .
The last three terms are upper bounded by −w − wregL + wcl + regL cl − 2cr regL −
wcl + 2wcr ≤ −w − regL (cr + cl ) + regL cl + 2wcr ≤ −w − (cl − c∗ )cr + wcr + (cl −
cr )cr ≤ 0, and thus can be ignored, yielding regT ST ≤ k2 (IL + IR ) = k2 IT , which
completes the proof. Taking the expectation over (x, c) completes the proof.

3.2 Lower Bound


The following simple example shows that the theorem is essentially tight in the worst
case.
Let k be a power of two, and let every label have cost 0 if it is is even, and 1 otherwise.
The tree structure is a complete binary tree of depth log k with the nodes being paired in
the order of their labels. Suppose that all pairwise classifications are correct, except for
class k wins all its log k games leading to cost-sensitive multiclass regret 1. If T is the
resulting filter tree, we have regT = 1, ST = k2 + log k − 1, and IT = log k, leading to
reg S
the reget ratio of ITT T ≤ k/2+log log k
k−1
= Ω( 2 log
k
k ), almost matching the theorem’s
k
bound of 2 the regret ratio.

4 Error-Correcting Tournaments
In this section we first state and then analyze error correcting tournaments. As this sec-
tion builds on the previous section, understanding the previous should be considered
prerequisite for reading this section. For simplicity, we work with only the multiclass
case. An extension for cost-sensitive multiclass problems is possible using the impor-
tance weighting techniques of the previous section.

4.1 Algorithm Description


An error-correcting tournament is one of a family of m-elimination tournaments where
m is a natural number. An m-elimination tournament operates in two phases. The first
phase consists of m single-elimination tournaments over the k labels where a label is
paired against another label at most once per round. Consequently, only one of these
single elimination tournaments has a simple binary tree structure—see for example
Figure 2 for an m = 3 elimination tournament on k = 8 labels. There is substan-
tial freedom in exactly how the pairings of the first phase are done—our bounds are
Error-Correcting Tournaments 257

dependent on the depth of any mechanism which pairs labels in m distinct single elim-
ination tournaments. One such explicit mechanism is stated in [5]. Note that once an
(x, y) example has lost m times, it is eliminated and no longer influences training at the
nodes closer to the root.
The second phase is a final elimination phase, where we select the winner from the
m winners of the first phase. It consists of a redundant single-elimination tournament,
where the degree of redundancy increases as the root is approached. To quantify the re-
dundancy, let every subtree Q have a charge cQ equal to the number of leaves under the
subtree. First phase winners at the leaves of final elimination tournament have charge
1. For any non-leaf node comparing subtree R to subtree L, the importance weight of
a binary example is set to max{cR , cL }. For reference, in tournament applications, an
importance weight can be expressed by playing games repeatedly where the winner of
R must beat the winner of L cL times to advance, and vice versa.
One complication arises: what happens when the two labels compared are the same?
In this case, the importance weight is set to 0, indicating there is no preference in the
pairing amongst the two choices.
8
7
6
5
4
3
2
1

Final Winner

Fig. 2. An example of a 3-elimination tournament on k = 8 players. There are m = 3 distinct


single elimination tournaments in first phase—one as solid lines, one as dashed lines, and one as
dotted lines. After that, a final elimination phase occurs over the three winners of the first phase.
The final elimination tournament has an extra weighting on the nodes, detailed in the text.

4.2 Error Correcting Tournament Analysis


A key concept throughout this section is the importance depth, defined as the worst-
case length (number of games) of the overall tournament, where importance-weighted
matches in the final elimination phase are played as repeated games. In Theorem 6 we
prove a bound on the importance depth.
The computational bound per example is essentially just the importance depth.
Theorem 4. (Structural Depth Bound) For any m-elimination tournament, the training
and test computation is O(m + ln k) per example.
258 A. Beygelzimer, J. Langford, and P. Ravikumar

Proof. The proof is by simplification of the importance depth bound (theorem 6), which
bounds the sum of importance weights at all nodes in the circuit.
To see that the importance depth controls the computation, first note that the impor-
tance depth bounds the circuit depth since all importance weights are at least 1. At train-
ing time, any one example is used at most once per circuit level starting at the leaves.
At testing time, an unlabeled example can have its label determined by traversing the
structure from root to leaf.

4.3 Regret Analysis


Our regret theorem is the analogue of corollary 1 for error-correcting tournaments. Us-
ing the one classifier trick detailed there, the reduction transforms a multiclass distribu-
tion D into an induced distribution ECT(D) over binary labeled examples. Let fECT
denote the multiclass predictor induced by a binary classifier f .
It is useful to have the notation m 2 for the smallest power of 2 larger than or equal
to m.

Theorem 5. (Main Theorem) For all distributions D over k-class examples, all bi-
nary classifiers f , all m-elimination tournaments ECT, the ratio of reg(fECT , D) to
reg(f, ECT(D)) is upper bounded by
m2
2+ m + k
2m for all m ≥ 2 and k > 2
2 ln k ln k
4+ m +2 m for all k ≤ 262 and m ≤ 4 log2 k

The first case shows that a regret ratio of 3 is achievable for very large m. The second
case is the best bound for cases of common interest. For m = 4 ln k it gives a ratio of
5.5.

Proof. The proof holds for each input x, and hence in expectation over x. For a fixed
x, we can define the regret of any label y as ry = maxy ∈{1,··· ,k} D(y  | x) − D(y | x).
A node n comparing two labels y and y  has regret rn , which is |D(y  | x)−D(y | x)|
 label is not predicted, and 0 otherwise. The regret of a tree T is
if the most probable
defined as rT = n∈T rn .
The first part of the proof is by induction on the tree structureF of the final phase.
The invariant for a subtree Q of F won by label q is cQ rq ≤ rQ + w∈Γ (Q) rTw , where
w is the winner of the first phase single-elimination tournament Tw .
When Q is a leaf w of F , we have cQ rq = rq ≤ rTi , where the inequality is from
Corollary 1 noting that d times the average binary regret is the sum of binary regrets.
Assume inductively that the hypothesis holds at node n for the right  subtree R and
the left subtree L of Q with respective winners q and l: cR rq ≤ rR + w∈Γ (R) rTw

and cL rl ≤ rL + w∈Γ (L) rTw . Now, a chain of inequalities holds, completing the
  
induction: rQ + w∈Γ (Q) rTw ≥ cL rn + rR + rL + w∈Γ (R) rTw + w∈Γ (L) rTw ≥
cL rn + cR rq + cL rl ≥ cQ rq . Here the first inequality uses the fact that the adversary
must pay at least cL rn to make q win. The second inequality follows by the induc-
tive hypothesis. The third inequality comes
 from rl + rn ≥ rq . To finish the proof,
m reg(fECT , D | x) = cF rf ≤ rF + w∈Γ (F ) rTw ≤ d reg(f, ECT(D | x)), where
Error-Correcting Tournaments 259

d is the maximum importance depth and the last quantity follows from the folk theorem
in [19]. Applying the importance depth theorem 6 and algebra complete the proof.

The depth bound follows from the following three lemmas.

Lemma 2. (First Phase Depth bound) The importance depth of the first phase tourna-
ment is bounded by the minimum of


⎪ log2 k + m log2 ( log2 k + 1)

⎨1.5 log k + 3m + 1
  2
⎪ k2 + 2m


⎩ √ 
For k ≤ 262 and m ≤ 4 log2 k, 2(m − 1) + ln k + ln k ln k + 4(m − 1).

Proof. The depth of the first phase is bounded by the classical problem of robust min-
imum finding with low depth. The first three cases hold because any such construction
upper bounds the depth of an error correcting tournament, and one such construction
has these bounds [5].
For the fourth case, we construct the depth bound by analyzing a continuous relax-
ation of the problem. The relaxation allows the number of labels remaining in each
single elimination tournament of the first phase to be broken into fractions. Relative to
this version, the actual problem has two important discretizations:

1. When a single-elimination tournament has only a single label remaining, it enters


the next single elimination tournament. This can have the effect of decreasing the
depth compared to the continuous relaxation.
2. When a single-elimination tournament has an odd number of labels remaining, the
odd label does not play that round. Thus the number of players does not quite halve,
potentially increasing the depth compared to the continuous relaxation.
( d )k
In the continuous version, tournament i on round d has i−1 2d labels, where the first
tournament corresponds to i = 1.
m  d  Consequently, the number of labels remaining in any
of the tournaments is 2kd i=1 i−1 . We can get an estimate of the depth by finding
the value of d such that this number is 1.
This value of d can be found using the Chernoff bound. The probability that a coin
m−1 2
with bias 1/2 has m − 1 or fewer heads in d coin flips is bounded by m−2d( 2 − d ) ,
1

and the probability that this occurs in k attempts is bounded by k times that. Setting this
 2
value to 1, we get ln k = 2d 12 − m−1 . Solving the equation for d, gives d = 2(m −
 d
1) + ln k + 4(m − 1) ln k + (ln k)2 . This last formula was verified computationally
for k < 262 and m < 4 log2 k by discretizing k into factors of 2 and running a simple
program to keep track of the number of labels in each tournament at each level. For
k ∈ {2l−1 + 1, 2l }, we used a pessimistic value of k = 2l−1 + 1 in the above formula
to compute the bound, and compared it to the output of the program for k = 2l .

Lemma 3. (Second Phase Depth Bound) In any m-elimination tournament, the second
phase has importance depth at most m 2 − 1 rounds for m > 1.
260 A. Beygelzimer, J. Langford, and P. Ravikumar

Proof. When two labels are compared in round i ≥ 1, the importance weight of their
log m−1 i−1
comparison is at most 2i−1 . Thus we have i=1 2 2 + m2 = m 2 − 1.

Putting everything together gives the importance depth theorem.


Theorem 6. (Importance Depth Bound) For all m-elimination tournaments, the impor-
tance depth is upper bounded by


⎪ log2 k + m log2 ( log2 k + 1) + m 2

⎨1.5 log k + 3m + m
2
k 2

⎪ + 2m + m 2

⎩ 2 √
For k ≤ 262 and m ≤ 4 log2 k, 2m + m 2 + 2 ln k + 2 m ln k.

Proof. We simply add the depths


of the first and second√phases from
√ Lemmas 2 and
3. For the last case, we bound ln k + 4(m − 1) ≤ ln k + 2 m and eliminate
subtractions in Lemma 3.

5 Lower Bound
All of our lower bounds hold for a somewhat more powerful adversary which is more
natural in a game playing tournament setting. In particular, we disallow reductions
which use importance weighting on examples, or equivalently, all importance weights
are set to 1. Note that we can modify our upper bound to obey this constraint by trans-
forming final elimination comparisons with importance weight i into 2i − 1 repeated
comparisons and use the majority vote. This modified construction has an importance
depth which is at most m larger implying the ratio of the adversary and the reduction’s
regret increases by at most 1.
The first lower bound says that for any reduction algorithm B, there exists an ad-
versary A with the average per-round regret r such that A can make B incur regret 2r
even if B knows r in advance. Thus an adversary who corrupts half of all outcomes
can force a maximally bad outcome. In the bounds below, fB denotes the multiclass
classifier induced by a reduction B using a binary classifier f .

Theorem 7. For any deterministic reduction B from k > 2 classification to binary


classification, there exists a choice of D and f such that reg(fB , D) ≥ 2 reg(f, B(D)).

Proof. The adversary A picks any two labels i and j. All comparisons involving i
but not j, are decided in favor of i. Similarly for j. The outcome of comparing i and
j is determined by the parity of the number of comparisons between i and j in some
fixed serialization of the algorithm. If the parity is odd, i wins; otherwise, j wins. The
outcomes of all other comparisons are picked arbitrarily.
Suppose that the algorithm halts after some number of queries c between i and j. If
neither i nor j wins, the adversary can simply assign probability 1/2 to i and j. The
adversary pays nothing while the algorithm suffers loss 1, yielding a regret ratio of ∞.
Assume without loss of generality that i wins. The depth of the circuit is either c or
at least c + 1, because each label can appear at most once in any round. If the depth is
Error-Correcting Tournaments 261

c, then since k > 2, some label is not involved in any query, and the adversary can set
the probability of that label to 1 resulting in ρ(B) = ∞.
Otherwise, A can set the probability of label j to be 1 while all others have probabil-
ity 0. The total regret of A is at most c+12 , while the regret of the winning label is 1.
Multiplying by the depth bound c + 1, gives a regret ratio of at least 2.

Note that the number of rounds in the above bound can depend on A. Next, we show
that for any algorithm B taking the same number of rounds for any adversary, there
exists an adversary A with a regret of roughly one third, such that A can make B incur
the maximal loss, even if B knows the power of the adversary.

Lemma 4. For any deterministic reduction B to binary classification with number of


rounds independent of the query outcomes, there exists a choice of D and f such that
reg(fB , D) ≥ (3 − k2 ) reg(f, B(D)).

Proof. Let B take q rounds to determine the winner, for any set of query outcomes. We
will design an adversary A with incurs regret r = 3k−2 qk
, such that A can make B incur
the maximal loss of 1, even if B knows r.
The adversary’s query answering strategy is to answer consistently with label 1 win-
ning for the first 2(k−1)
k r rounds, breaking ties arbitrarily. The total number of queries
that B can ask during this stage is at most (k − 1)r since each label can play at most
once in every round, and each query occupies two labels. Thus the total amount of re-
gret at this point is at most (k − 1)r, and there must exist a label i other than label k
with at most r losses. In the remaining q − 2(k−1)
n r = r rounds, A answers consistently
with label i and all other skills being 0.
Now if B selects label 1, A can set D(i | x) = 1 with r/q average regret from
the first stage. If B selects label i instead, A can choose that D(1 | x) = 1. Since the
number of queries between labels i and k in the second stage is at most r, the adversary
can incurs average regret at most r/q. If B chooses any other label to be the winner, the
regret ratio is unbounded.

References
1. Adler, M., Gemmell, P., Harchol-Balter, M., Karp, R., Kenyon, C.: Selection in the presence
of noise: The design of playoff systems. In: SODA 1994 (1994)
2. Allwein, E., Schapire, R., Singer, Y.: Reducing multiclass to binary: A unifying approach for
margin classifiers. Journal of Machine Learning Research 1, 113–141 (2000)
3. Aslam, J., Dhagat, A.: Searching in the presence of linearly bounded errors. In: STOC 1991
(1991)
4. Borgstrom, R., Rao Kosaraju, S.: Comparison-base search in the presence of errors. In: STOC
1993 (1993)
5. Denejko, P., Diks, K., Pelc, A., Piotr’ow, M.: Reliable minimum finding comparator net-
works. Fundamenta Informaticae 42, 235–249 (2000)
6. Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output
codes. Journal of Artificial Intelligence Research 2, 263–286 (1995)
7. Feige, U., Peleg, D., Raghavan, P., Upfal, E.: Computing with unreliable information. In:
Symposium on Theory of Computing, pp. 128–137 (1990)
262 A. Beygelzimer, J. Langford, and P. Ravikumar

8. Foster, D., Hsu, D.: http://hunch.net/?p=468


9. Fox, J.: Applied regression analysis, linear models, and related methods. Sage Publications,
Thousand Oaks (1997)
10. Guruswami, V., Sahai, A.: Multiclass learning, Boosting, and Error Correcting Codes. In:
COLT 1999 (1999)
11. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In: NIPS 1997 (1997)
12. Herbrich, R., Minka, T., Graepel, T.: TrueSkill(TM): A Bayesian skill rating system. In: NIPS
2007 (2007)
13. Hsu, D., Langford, J., Kakade, S., Zhang, T.: Multi-label prediction via compressed sensing
(2009); arXiv:0902.1284v1
14. Langford, J., Beygelzimer, A.: Sensitive Error Correcting Output Codes. In: Auer, P., Meir,
R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 158–172. Springer, Heidelberg (2005)
15. Langford, J., Zadrozny, B.: Estimating class membership probabilities using classifier learn-
ers. In: AISTAT 2005 (2005)
16. Ravikumar, B., Ganesan, K., Lakshmanan, K.B.: On selecting the largest element in spite of
erroneous information. In: Brandenburg, F.J., Wirsing, M., Vidal-Naquet, G. (eds.) STACS
1987. LNCS, vol. 247, pp. 88–99. Springer, Heidelberg (1987)
17. Williamson, B.: Personal communication
18. Yao, A.C., Yao, F.F.: On fault-tolerant networks for sorting. SIAM Journal of Comput-
ing 14(1), 120–128 (1985)
19. Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example
weighting. In: ICDM 2003 (2003)
Difficulties in Forcing Fairness of
Polynomial Time Inductive Inference

John Case and Timo Kötzing

Department of Computer and Information Sciences,


University of Delaware, Newark, DE 19716-2586, USA
{case,koetzing}@cis.udel.edu

Abstract. There are difficulties obtaining fair feasibility from polyno-


mial time updated language learning in the limit from positive data.
Pitt 1989 noted that unfair delaying tricks can achieve polynomial time
updates but with no feasibility constraint on the whole learning pro-
cess. In this context Yoshinaka 2009 makes a useful list of properties or
restrictions towards true feasibility. He also provides interesting exam-
ples of fair polynomial time algorithms featuring particular uniformly
polynomial time decidable hypothesis spaces, and each of his algorithms
satisfies several of his properties.
Yoshinaka claims that the combination of the three restrictions on
polynomial time learners of consistency (which we call herein postdictive
completeness), conservativeness and prudence is restrictive enough to
stop Pitt’s delaying tricks from working.
The present paper refutes the claim of the previous paragraph in three
settings. In the setting of uniformly polynomial time decidable hypoth-
esis spaces with a few effective closure properties, the three restrictions
allow maximal unfairness. The other two settings involve certain other
uniformly decidable hypothesis spaces and general language learning hy-
pothesis spaces. In each of these settings, the three restrictions forbid
some, but not all Pitt-style delaying tricks.
Inside the proofs of each of our theorems asserting that the three re-
strictions do not forbid some or all delaying tricks, the witnessing learners
can be seen to explicitly employ delaying tricks.

1 Introduction

For a class of (at least computably enumerable) languages L and an algorithmic


learning function h, we say that h TxtEx-learns L [Gol67, JORS99] iff, for each
L ∈ L, for every function T enumerating (or presenting) all and only the elements
of L (with or without pauses), as h is fed the succession of values T (0), T (1), . . ., it
outputs a corresponding succession of programs p(0), p(1), . . . from some hypoth-
esis space, and, for some i0 , for all i ≥ i0 , p(i) is a correct program for L, and
p(i + 1) = p(i). The function T as just above is called a text or presentation for
L. TxtEx-learning is also called learning in the limit from positive data.
We say that h TxtEx-learns a L in polynomial time iff there is a polynomial Q
such that, for each i, h computes p(i) within time Q(|T (0), T (1), . . . , T (i − 1)|).

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 263–277, 2009.

c Springer-Verlag Berlin Heidelberg 2009
264 J. Case and T. Kötzing

Pitt [Pit89] notes (in a slightly different context) that such a definition of polyno-
mial time learning may not give one any feasibility restriction on the total time
for successful learning. Here is informally why. Suppose h is any TxtEx-learner.
Then, for suitable polynomial Q, a variant of learner h can delay outputting
significant conjectures based on data σ until it has seen a much larger sequence
of data τ so that Q(|τ |) is enough time for h to think about σ as long as it needs.
Pitt [Pit89] discusses some possible ways to forbid such unfair delaying tricks.
More recently, Yoshinaka [Yos09] compiled a very useful list of properties to help
toward achieving fairness and efficiency in polynomial time learners, including to
avoid Pitt-style delaying tricks. In the second part of [Yos09], Yoshinaka provides
a number of interesting example fair polynomial time learners each satisfying
several of these properties. In each of his example algorithms, the associated
hypothesis space is uniformly polynomial time decidable.1 In the present paper,
we focus, for polynomial time learners, on three of Yoshinaka’s properties: Post-
dictive completeness2 , conservativeness, and prudence. Postdictive completeness
[Bār74, BB75, Wie76, Wie78] requires that each hypothesis output by a learner
correctly postdicts the input data on which that hypothesis is based. Conserva-
tiveness [Ang80] requires that each hypothesis may be changed only if it fails to
predict a new datum. Prudence [Wei82, OSW86] requires each output hypothesis
has to be for a target that the learner actually learns.
Yoshinaka [Yos09] claims that, for efficient learning in the limit from positive
data, the combination of postdictive completeness, conservativeness and prudence
is restrictive enough to prevent all Pitt-style delaying tricks.
In the present paper, in several different settings (settings mostly as to kind
of hypothesis spaces), we refute the claim of the immediately above paragraph.
In one of our settings, uniformly polynomial time decidable hypothesis spaces
with a few effective closure properties,3,4 the three restrictions allow maximal

1
These spaces are such that there is a polynomial Q and an algorithm so that, from
both an hypothesis i and an object x, the algorithm returns, within time Q(|i|,|x|)
a correct decision as to whether x is in the language defined by hypothesis i.
2
In the prior literature, except for [Ful88] and [CK08a, CK08b], what we call post-
dictive completeness is called consistency.
3
These effective closure properties pertain to obtaining finite languages and modifi-
cations of languages by finite languages.
4
The particular uniformly polynomial time hypothesis spaces Yoshinaka employs in
the second half of [Yos09] do not have our few effective closure properties, but his
algorithms would work essentially unchanged were one to extend his hypothesis
spaces to ones with our few effective closure properties. Then his algorithms would
not search or use the new hypotheses and would not learn any more languages. The
space of CFGs with Prohibition discussed below in this section and in Section 2.1
further below would work as such an extension of both Yoshinaka’s hypothesis spaces.
Yoshinaka does mention the possibility of extending his hypothesis spaces to provide
an hypothesis for Σ ∗ . We did not examine whether we could, in some cases, work
with such an extension instead of our few effective closure properties. We also did
not examine whether we can modify our (to be mentioned shortly) Theorem 13 to
cover just his particular hypothesis spaces.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 265

unfairness (Theorem 13 in Section 3 below).5 An example of our uniformly poly-


nomial time decidable hypothesis spaces (with a few effective closure properties)
employs efficiently coded DFAs. Another example employs an (also efficiently
coded), interesting extension of context free grammars (CFGs), called CFGs
with Prohibition [Bur05]. This latter example is treated in more detail after
Definition 2 in Section 2.1 below.
In all of our settings, any combination of just the two restrictions of con-
servativeness and prudence allows for arbitrary delaying tricks (Theorem 18 in
Section 5).
In each of our two settings besides the first setting of uniformly polynomial
time decidable hypothesis spaces (with a few effective closure properties), post-
dictive completeness does strictly forbid some, but not all Pitt-style delaying
tricks.6
The two residual settings are: 1. TxtEx-learning with certain other uniformly
decidable hypothesis spaces (Section 4), e.g., the (efficiently coded), explicitly
clocked, multi-tape Turing Machines which halt in linear time [RC94, Chapter 6]7
and 2. TxtEx-learning with a general purpose hypothesis space (Section 5).
The theorems that postdictive completeness forbids some delaying tricks in
these last two settings are: Theorem 14 in Section 4 and Theorem 19 in Section 5.
The theorems that postdictive completeness does not forbid all delaying tricks
in these last two settings are: Theorem 15 in Section 4 and Theorem 22 in
Section 5.
Inside the proofs of each of our theorems asserting that the three restrictions
do not forbid some or all delaying tricks, the witnessing learners can be seen to
explicitly employ delaying tricks. Note that many of our delaying tricks involve
“overlearning,” i.e., learning a larger class of languages than required.
To avoid having to define successively each of a large number of criteria of
successful learning (e.g., restricted variants of TxtEx-learning), we provide a
modular approach to presenting such definitions. In our modular approach, we
define names for “pieces” of our criteria (Section 2.1). Then, after that, each
criterion needed is named by stringing together the relevant names of its pieces.
For example, unrestricted TxtEx-learning in the present section will be later

5
It is an interesting open question, though, for our uniformly polynomial time decid-
able hypothesis spaces, whether the combination of postdictive completeness, conser-
vativeness and prudence is so restrictive, that, any class of languages TxtEx-learnable
employing such an hypothesis space and with those three restrictions, is also TxtEx-
learnable with an intuitively fair, different polynomial time learner respecting all
three restrictions.
6
That is, in our residual two settings, of the three restrictions, postdictive complete-
ness does improve fairness, but there can still be some residual unfair delaying tricks.
For these residual settings, we did not examine the question of whether adding
onto postdictive completeness, conservativeness and/or prudence, provides better
degree of avoidance of delaying tricks than postdictive completeness alone. Again:
we already know, though, that all three restrictions do not avoid all delaying tricks.
7
The associated class is not uniformly polynomial time decidable, by
[RC94, Theorem 6.5].
266 J. Case and T. Kötzing

named, TxtGEx.8 A similar modular approach appears already in [CK08a,


CK08b].

2 Mathematical Preliminaries

Any unexplained complexity-theoretic notions are from [RC94]. All unexplained


general computability-theoretic notions are from [Rog67].
Strings herein are finite and over the alphabet {0, 1}. {0, 1}∗ denotes the set
of all such strings; ε denotes the empty string.
N denotes the set of natural numbers, {0, 1, 2, . . .}. We do not distinguish
between natural numbers and their dyadic representations as strings.9
For each w ∈ {0, 1}∗ and n ∈ N, wn denotes n copies of w concatenated end
to end. For each string w, we define size(w) to be the length of w. Since we
identify each natural number x with its dyadic representation, for all n ∈ N,
size(n) denotes the length of the dyadic representation of n. For all strings w,
we define |w| to be max{1, size(w)}.10
The symbols ⊆, ⊂, ⊇, ⊃ respectively denote the subset, proper subset, superset
and proper superset relation between sets.
For sets A, B, we let A \ B := {a ∈ A | a  ∈ B}, A := N \ A and Pow(A) be
the power set of A.
The quantifier ∀∞ x means “for all but finitely many x”, the quantifier ∃∞ x
means “for infinitely many x”. For any set A, card(A) denotes the cardinality of
A.
P and R denote, respectively, the set of all partial and of all total functions
N → N ∪ {#}. dom and range denote, respectively, domain and range of a given
function.
We sometimes denote a function f of n > 0 arguments x1 , . . . , xn in lambda
notation (as in Lisp) as λx1 , . . . , xn f (x1 , . . . , xn ). For example, with c ∈ N,
λx c is the constantly c function of one argument.
A function ψ is partial computable iff there is a deterministic, multi-tape
Turing machine computing ψ. P and R denote, respectively, the set of all partial
computable and the set of all total (partial) computable functions N → N. If
f ∈ P is defined for some argument x, then we denote this fact by f (x)↓, and
we say that f on x converges.
We say that f ∈ P converges to p iff ∀∞ x : f (x)↓ = p; we write f → p to
denote this.11
ϕTM is the fixed programming system from [RC94, Chapter 3] for the partial
computable functions N → N. This system is based on deterministic, multi-tape
8
In general, standard inductive inference criteria names will be changed to slightly
different names in our modular approach.
9
The dyadic representation of a natural number x := the x-th finite string over {0, 1}
in length-lexicographical order, where the counting of strings starts with zero [RC94].
Hence, unlike with binary representation, lead zeros matter.
10
This convention about |ε| = 1 helps with runtime considerations.
11
f (x) converges should not be confused with f converges to.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 267

Turing machines (TMs). In this system the TM-programs are efficiently given
numerical names or codes.12 ΦTM denotes the TM step counting complexity mea-
sure also from [RC94, Chapter 3] and associated with ϕTM . In the present paper,
we employ a number of complexity bound results from [RC94, Chapters 3 & 4]
regarding (ϕTM , ΦTM ). These results will be clearly referenced as we use them.
For simplicity of notation, hereafter, we write (ϕ, Φ) for (ϕTM , ΦTM ). ϕp denotes
the partial computable function computed by the TM-program with code num-
ber p in the ϕ-system, and Φp denotes the partial computable runtime function
of the TM-program with code number p in the ϕ-system.
The symbol # is pronounced pause and is used to symbolize “no new input
data” in a text.
Note that all (partial) computable functions are N → N. Whenever we want
to consider (partial) computable functions on objects like finite sequences or
finite sets, we assume those objects to be efficiently coded as natural numbers.
We give such codings for finite sequences and finite sets below.
For all p, Wp denotes the computably enumerable (ce) set dom(ϕp ). E denotes
the set of all ce sets. We say that e is an index (in W ) for We .
We fix the 1-1 and onto pairing function ·, · : N × N → N from [RC94],
which is based on dyadic bit-interleaving. Pairing and unpairing is computable
in linear time.
Whenever we consider tuples of natural numbers as input to TMs, it is under-
stood that the general coding function ·, · is used to (left-associatively) code
the tuples into appropriate TM-input.
We identify any function f ∈ P with its graph {x, f (x) | x ∈ N}.
A finite sequence is a mapping with a finite initial segment of N as domain
(and range, (N ∪ {#})). ∅ denotes the empty sequence (and, also, the empty
set). The set of all finite sequences is denoted by Seq. For each finite sequence
σ, we will denote the first element, if any, of that sequence by σ(0), the second,
if any, with σ(1) and so on. #elets(σ) denotes the number of elements in a finite
sequence σ, that is, the cardinality of its domain.
From now on, by convention, f , g and h with or without decoration range
over (partial) functions N → N; x, y with or without decorations range over N.
D with or without decorations ranges over finite subsets of N.
Following [LV97], we fix a coding ·Seq of all sequences into N ∪ {#} (=
{0, 1}∗ ∪ {#}) – with the following properties.
The set of all codes of sequences is decidable in linear time. The time to encode
a sequence, that is, to compute
λk, v1 , . . . , vk v1 , . . . , vk Seq
is
k

O(λk, v1 , . . . , vk |vi |).
i=1
12
This numerical coding guarantees that many simple operations involving the coding
run in linear time. This is by contrast with historically more typical codings featuring
prime powers and corresponding at least exponential costs to do simple things.
268 J. Case and T. Kötzing

Therefore, the size of the codeword is also linearin the size of the elements:
k
λk, v1 , . . . , vk |v1 , . . . , vk Seq | is O(λk, v1 , . . . , vk i=1 |vi |).
13

Furthermore,
∀σ : #elets(σ) ≤ |σSeq |. (1)
Henceforth, we will many times identify a finite sequence σ with its code number
σSeq . However, when we employ expressions such as σ(x), σ = f and σ ⊂ f ,
we consider σ as a sequence, not as a number.
For a (partial) function g and i ∈ N, if ∀j < i : g(j)↓, then g[i] is defined to
be the finite sequence g(0), . . . , g(i − 1).
D, with and without decorations, ranges over finite sets. We fix the following
1-1 coding for all finite subsets of N. For each non-empty finite set D = {x0 <
. . . < xn }, x0 , . . . , xn Seq is the code for D and Seq is the code for ∅.
Henceforth, we will many times identify a finite set D with its code number.
However, when we employ expressions such as x ∈ D, card(D), max(D) and
D ⊂ D , we consider D and D as sets, not as numbers.
For each (possibly infinite) sequence q, let content(q) = (range(q) \ {#}).
We define LinPrograms = {e | ∃a, b, ∀x : Φe (x) ≤ a|x| + b} and
PolyPrograms = {e | ∃p polynomial ∀x ∈ N : Φe (x) ≤ p(|x|)}. Furthermore,
for let LinF = {ϕe | e ∈ LinPrograms} and PF = {ϕe | e ∈ PolyPrograms}.
For g ∈ PF we say that g is computable in polytime, or also, feasibly com-
putable. Recall that we have, by (1), ∀σ : #elets(σ) ≤ |σ|.
With log we denote the floor of the base-2 logarithm, with the exception of
log(0) = 0.
For all e, x, t, we write ϕe (x)↓t iff Φe (x) ≤ t. Furthermore, we write

ϕe (x), if Φe (x) ≤ t;
∀e, x, t : ϕe (x)↓t = (2)
0, otherwise.

The following lemma is used in many of our detailed proofs. The present
paper, because of space limitations, omits many details of proofs. Nonetheless,
we still include this lemma herein to give the reader some intuitions as to how
to manage some missing details.
Lemma 1. Regarding time-bounded computability, we have the following.
– Equality checks and log are computable in linear time [RC94, Lemma 3.2].
– Conditional definition is computable in a time polynomial in the runtimes
of its defining programs [RC94, Lemma 3.14].
– Bounded minimizations, and, hence, bounded maximizations are computable
in a time polynomial in the runtimes of its defining programs [RC94,
Lemma 3.15].
– Boolean combinations of predicates computable in polytime are computable
in polytime [RC94, Lemma 3.18].
– From [RC94, Corollary 3.7], we have that λe, x, t ϕe (x)↓|t| and
λe, x, t, z ϕe (x)↓|t| = z are computable in polynomial time.
13
For these O-formulas, |ε| = 1 helps.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 269

– Our coding of finite sequences easily gives that the following functions
are linear time computable. ∀x : 1 ≤ length(x̄), λσSeq #elets(σ) and
σ(i), if i < #elets(σ);
λσSeq , i
0, otherwise.
– Our coding above of finite sets enables content to be computable in polyno-
mial time.14

2.1 Learning Criteria Modules

In this section we give our modular definition of what a learning criterion is.
After that we will put the modules together to obtain the actual criteria we
need. As noted above, all standard inductive inference criteria names will be
changed to slightly different names in our modular approach.

Definition 2. An effective numbering of ce languages is a function V : N → E


such that there is a function f ∈ P with ∀e, x : f (e, x)↓ ⇔ x ∈ V (e).15 For
such numberings V , for each e, we will write Ve instead of V (e), and call e an
index or an hypothesis (in V ) for Ve . Recall that we identify functions with their
graphs. Therefore, ϕ and any other indexing for partial computable functions
is considered an effective numbering. We use effective numberings as hypothesis
spaces for our learners.
We will sometimes require that {Ve | e ∈ N} is effectively closed under some
finite modifications; precisely we will sometimes require

∃s∩ ∈ R : ∀e, D : Vs∩ (e,D) = Ve ∩ D ∧


∃s∪ ∈ R : ∀e, D : Vs∪ (e,D) = Ve ∪ D ∧ (3)
∃s\ ∈ R : ∀e, D : Vs\ (e,D) = Ve \ D.

Note that, in practice, many effective numberings of ce languages allow s∩ , s∪


and s\ as in (3) to be computable in polynomial or even linear time.
Effective numberings include the following important examples.

– W ((3) holds).
– A canonical numbering of all regular languages, represented by efficiently
coded DFAs (where membership is trivially uniformly polynomial time de-
cidable and (3) holds).
– For each pair of context free grammars (CFGs) G0 , G1 , we efficiently code
(G0 , G1 ) to be an index for (L(G0 ) \ L(G1 )). Then the resulting numbering,
in particular, has an index for each context free language. Furthermore,
14
This computation involves sorting. Selection sort can be done in quadratic time in
the RAM model [Knu73], and adding an extra linear factor to translate from RAM
complexity to deterministic multi-tape TM complexity [vEB90], we get selection sort
in cubic (and, hence, polynomial) time measured by ΦTM .
15
Note that such a numbering does not necessarily need to be onto, i.e., a numbering
might only number some of the ce languages, leaving out others.
270 J. Case and T. Kötzing

membership is uniformly polynomial time decidable [HU79, Sch91], and (3)


holds. As noted above, these grammars are called CFGs with Prohibition in
[Bur05].16
Definition 3. Any set C ⊆ P is a learner admissibility restriction. Intuitively,
a learner admissibility restriction defines which functions are admissible as po-
tential learners.
Two typical learner admissibility restrictions are P and R. When denoting
criteria with P as the learner admissibility restriction, we will omit P.
Definition 4. Any function from E to Pow(R) is called a target presenter for
the ce languages. The only target presenter used in this paper is Txt : E →
Pow(R), L → {ρ ∈ R | content(ρ) = L}.
Definition 5. Every computable operator P × R → P2 is called a sequence
generating operator.17 Intuitively, a sequence generating operator defines how
learner and presentation interact to generate two infinite sequences, one for
learner-outputs (we call this sequence the learner-sequence) and one for learnee-
outputs.
For any sequence generating operator β, we define β1 and β2 such that β =
λh, g (β1 (h, g), β2 (h, g)).
We define the following sequence generating operators.
– Goldstyle: G : P × R → P × R, (h, g) 
→ (λi h(g[i]), g).
– [JORS99] Set-driven: Sd : P × R → P × R, (h, g) 
→ (λi h(content(g[i])), g).
– [JORS99] Partly set-driven: Psd : P × R → P × R, (h, g) → 
(λi h(content(g[i]), i), g).
Definition 6. Every subset of P 2 is called a sequence acceptance criterion. In-
tuitively, a sequence acceptance criterion defines what identification-sequences
are considered a successful identification of a target. Any two such sequence ac-
ceptance criteria δ and δ  can be combined by intersecting them. For ease of
notation we write δδ  instead of δ ∩ δ  .
For each effective numbering of some ce languages V , we define the following
sequence acceptance criteria.
– Explanatory: ExV = {(p, q) ∈ P 2 | ∃p : p → p ∧ Vp = content(q)}.
– Postdictive Completeness: PcpV = {(p, q) ∈ R2 | ∀i : content(q[i]) ⊆ Vp(i) }.
– Conservativeness: ConvV = {(p, q) | ∀i : p(i) 
= p(i+1) ⇒ content(q[i+1])  ⊆
Vp(i) }.
For any given target presenter α and a sequence generating operator β, we
can turn a given sequence acceptance criterion δ into a learner admissibil-
ity restriction T δ by admitting only those learners that obey δ on all input :
16
Intuitively, G0 may “generate” an element, and G1 can correct it or exclude it.
The concept of Prohibition Grammars is generalized in [CCJ09, CR09] and, there,
they are called Correction Grammars.
17
Essentially, these computable operators are the recursive operators of [Rog67] but
with two arguments and two outputs and restricted to the indicated domain.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 271

T δ := {h ∈ P | ∀T ∈ range(α) : β(h, T ) ∈ δ}. We then speak of “total . . . .”


For example, total postdictive completeness, i.e., T PcpV , requires postdictive
completeness on any input data, including input data not necessarily taken from
a target to be learned.18
Definition 7. A learning criterion (for short, criterion) is a 4-tuple consisting
of a learner admissibility restriction, a target presenter, a sequence generating
operator and a sequence acceptance criterion. Let C, α, β, δ be, respectively, a
learner admissibility restriction, a target presenter, a sequence generating op-
erator and a sequence acceptance criterion. For h ∈ P, L ∈ dom(α), we say
that h (C, α, β, δ)-learns L iff: h ∈ C and, for all T ∈ α(L), β(h, T ) ∈ δ. For
h ∈ P and L ⊆ dom(α) we say that h (C, α, β, δ)-learns L iff, for all L ∈ L, h
(C, α, β, δ)-learns L. The set of (C, α, β, δ)-learnable sets of computable functions
is
Cαβδ := {L ⊆ E | ∃h ∈ P : h (C, α, β, δ)-learns L}. (4)
We refer to the sets Cαβδ as in (4) as learnability classes. Instead of writing
the tuple (C, α, β, δ), we will ambiguously write Cαβδ. For h ∈ P, the set of all
computable learnees (C, α, β, δ)-learned by h is denoted by Cαβδ(h) := {L ∈
E | h (C, α, β, δ)-learns L}.
Definition 8. We let Id be the function mapping a learning criterion (C, α, β, δ)
to the set Cαβδ, as defined in (4). We define two versions of prudent learning as
follows. For all C, α, β, δ, V , respectively, a learner admissibility restriction, a
target presenter, a sequence generating operator, a sequence acceptance criterion
and an effective numbering of ce languages, we let
∃h ∈ C : L ⊆ αβδ(h) ∧
PrudV (C, α, β, δ) = {L ⊆ dom(α) |
∀t ∈ L, ∀T ∈ α(t), ∀i : Vβ1 (h,T )(i) ∈ L},
and
∃h ∈ C : L ⊆ αβδ(h) ∧
T PrudV (C, α, β, δ) = {L ⊆ dom(α) |
∀e ∈ range(h) : Ve ∈ L}.
For D ∈ {Id, Prud, T Prud}, a learning criterion C and a learner h, we write
DC instead of D(C); further, we let DC(h) denote the set of all targets learnable
by h for criterion DC.
We subscript an entire learning criterion with an effective numbering V to change
all restrictions to expect hypotheses from V .
For example, we write the criterion of TxtEx-learning, with V -indices for the
hypothesis space, as TxtGExV . However, TxtGExV with the three restrictions
of total postdictive completeness, total conservativeness and prudence (not total
prudence) in our modular notation is written PrudT PcpT ConvTxtGExV ,
which we abbreviate as PrudT (PcpConv)TxtGExV . If, instead, we wanted
this criterion but with total prudence in the place of prudence, it could be written
T PrudT (PcpConv)TxtGExV .
18
Note that, while Yoshinaka [Yos09] essentially defines for his postdictive complete-
ness, conservativeness and prudence the total kinds, his interesting algorithms for
which he claims these three restrictions satisfy only the non-total versions.
272 J. Case and T. Kötzing

3 Uniformly Polytime Decidable Hypothesis Spaces


For this section, we let U be an arbitrary, fixed effective numbering of some ce
languages and such that λe, x x ∈ Ue is computable in polynomial time. We call
such a numbering uniformly polynomial time computable. Further suppose there
is an r ∈ PF such that ∀D : Ur(D) = D and suppose (3) in Section 2.1 holds
for U . Codings for DFAs or CFGs with Prohibition are example such U s (see
Definition 2 in Section 2.1).
Interestingly, Theorem 10 below says that, every conservative learner em-
ploying hypothesis space U can, without loss of generality, be assumed to be
polynomial time, postdictively complete, prudent and set-driven. This leads to
the main theorem in this section (Theorem 13), that, for hypothesis space U , no
combination of the three restrictions of postdictive completeness, conservative-
ness and prudence will forbid arbitrary delaying tricks.
First, we show with a lemma how we can delay set-driven learning and pre-
serve postdictive completeness and conservativeness. We use this lemma for the
succeeding theorem.

Lemma 9. We have

PFT (PcpConv)TxtSdExU = T (PcpConv)TxtSdExU .

Proof: “⊆” is immediate. Let h ∈ R and L = T (PcpConv)TxtSdExU (h). Fix


a ϕ-program for h.
Let P be a computable predicate such that

∀D , D : P (D , D) ⇔ [D ⊆ D ∧ h(D )↓|D| ∧ D ⊆ Uh(D ) ]. (5)

By Lemma 1, there is h ∈ PF such that



 h(D ), if there is ≤-max D ≤ |D| such that P (D , D);
∀D : h (D) = (6)
r(D), otherwise.

We omit the proof that this delaying construction works.

Theorem 10. We have

T PrudPFT (PcpConv)TxtSdExU = TxtGConvExU .

Proof: “⊆” is trivial. Regarding “⊇”: First we apply Proposition 16 to get total
conservativeness. Then we use Theorem 17 to obtain total postdictive complete-
ness. We use Theorem 20 to make the learner set-driven. By Lemma 9, such a
learner can be delayed to be computable in polynomial time. By Proposition 21,
this learner is automatically totally prudent.

Proposition 11 just below shows that any learner can be assumed partially set-
driven, and, importantly, the transformation of a learner to a partially set-driven
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 273

learner preserves prudence. The proposition and its proof are somewhat anal-
ogous to [JORS99, Proposition 5.29] and its proof. However, our proof, unlike
that of [JORS99, Proposition 5.29], does not require the hypothesis space to be
paddable.

Proposition 11. Let D ∈ {Id, Prud, T Prud}. We have

DTxtPsdExU = DTxtGExU .

We can delay partially set-driven learning just as we delayed set-driven learning


in Lemma 9, resulting in Lemma 12 just below.

Lemma 12. We have


PrudPFT PcpTxtPsdExU = PrudTxtGExU and
T PrudPFT PcpTxtPsdExU = T PrudTxtGExU .

The next theorem is the main result of the present section. As noted in Section 1
above, it says that the three restrictions of postdictive completeness, conserva-
tiveness and prudence allow maximal unfairness — within the current setting of
polynomial time decidable hypothesis spaces.
Theorem 13. Let δ ∈ {R2 , Pcp, Conv, PcpConv}, D ∈ {Id, Prud} and D ∈
{Id, T Prud}. Then

DPFTxtGδExU = DTxtGδExU and


D PFT δTxtGExU = D T δTxtGExU .

Proof: Use Theorem 10, as well as Theorem 18 and Lemma 12.

4 Other Uniformly Decidable Hypothesis Spaces

For this section, we let V : N → E range over effective numberings of some


ce languages such that λe, x x ∈ Ve is computable (we call such a numbering
uniformly decidable). Further suppose, for each such V , there is r ∈ R such that
∀D : Vr(D) = D.19
Example such numberings V include the classes of all linear time, polyno-
mial time, . . . decidable languages (not uniformly linear time, polynomial time,
. . . decidable), each represented by efficiently numerically coded programs in a
suitable subrecursive programming system for deciding languages [RC94].
For uniformly decidable hypothesis spaces, we get mixed results. We have al-
ready seen from Theorem 13 in Section 3 above that there are uniformly decid-
able hypothesis spaces where we have arbitrary delaying for all combinations of
postdictive completeness, conservativeness and prudence. Next is the first main
19
Note that, in practice, many effective numberings of some ce languages allow r to
be computable in polynomial or even linear time.
274 J. Case and T. Kötzing

theorem of the present section. It states that there are other uniformly decid-
able hypothesis spaces such that postdictive completeness, with or without any
of conservativeness and prudence, forbids some delaying tricks. By contrast, ac-
cording to Theorem 18 in Section 5, any combination of just the two restrictions
of conservativeness and prudence allows for arbitrary delaying tricks.

Theorem 14. There exists a uniformly decidable numbering V such that, for
each δ ∈ {R2 , Pcp, Conv, PcpConv}, D ∈ {Id, Prud} and D ∈ {Id, T Prud},

DPFTxtGδExV ⊂ DRTxtGδExV ⇔ δ ⊆ Pcp and


D PFT δTxtGExU ⊂ D T δTxtGExU ⇔ δ ⊆ Pcp.

We can, and sometimes do, think of total function learning as a special case
of TxtEx-learning thus. Suppose f is any (possibly, but not necessarily total)
function mapping non-negative integers into the same. Recall that we identify
f with its graph, {x, y | f (x) = y}, where x, y is the numeric coding of
(x, y) (Section 2). Then {x, y | f (x) = y} is a sublanguage of the non-negative
integers. Furthermore, programs for f are generally trivially intercompilable with
programs or grammars for {x, y | f (x) = y}. We sometimes refer to languages
of the form {x, y | f (x) = y} as single-valued languages.
Next is our second main result of the present section. It asserts the polynomial
time learnability with restrictions of postdictive completeness, conservativeness
and prudence of a uniformly decidable class of total single-valued languages
which are (the graphs of) the linear time computable functions. Importantly,
our proof of this theorem employs a Pitt-style delaying trick on an enumeration
technique [Gol67, BB75], and our result, then, entails, as advertised in Section 1
above, that some delaying tricks are not forbidden in the setting of the present
section.
Let θLtime be an efficiently coded programming system from [RC94, Chap-
ter 6] for LinF. θLtime is based on multi-tape TM-programs each explicitly
clocked to halt in linear time (in the length of its input). Let V Ltime be the cor-
responding effective numbering of all and only those ce languages (whose graphs
are) ∈ LinF. Note that V Ltime does not satisfy the condition at the beginning
of the present section on V s for obtaining codes of finite languages — since we
have only infinite languages in V Ltime . Instead, for V Ltime , we have (and use)
a linear time algorithm, which on any finite function F , outputs a V Ltime -index
for the zero-extension of F .

Theorem 15

LinF ∈ T PrudPFT (PcpConv)TxtGExV Ltime .

The remainder of this section presents two results that are used elsewhere. They
are put here to present them in more generality. They each hold for any V .
The following proposition says that, for any uniformly decidable V , conser-
vative learnability implies total conservative learnability. It is used for proving
Theorem 10 in Section 3.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 275

Proposition 16. We have

T ConvTxtGExV = TxtGConvExV .

The following theorem holds for all V and states that we can assume total
postdictive completeness when learning with total conservativeness.
Theorem 17. We have

T (PcpConv)TxtGExV = T ConvTxtGExV .

5 Learning ce Languages
For the remainder of this section, let V be any effective numbering of some ce
languages.
For the present section, next (and mentioned in Section 1 above) is our first
main result which says that any combination of just the two restrictions of
conservativeness and prudence allows for arbitrary delaying tricks.
Theorem 18. Let δ ∈ {R2 , Conv}, D ∈ {Id, Prud} and D ∈ {Id, T Prud}.
Then
DPFTxtGδExV = DTxtGδExV and
D PFT δTxtGExV = D T δTxtGExV .
Our proof of the just above theorem uses delaying tricks similar to those in the
proof of Lemma 9 in Section 3.
Our next main result of the present section says, for the general effective
numbering of all ce languages, W , combinations of postdictive completeness,
conservativeness and prudence forbid some delaying tricks iff postdictive com-
pleteness is part of the combination.
Theorem 19. Let δ ∈ {R2 , Pcp, Conv, PcpConv}, D ∈ {Id, Prud} and D ∈
{Id, T Prud}. Then
DPFTxtGδExW ⊂ DRTxtGδExW ⇔ δ ⊆ Pcp and
D PFT δTxtGExW ⊂ D T δTxtGExW ⇔ δ ⊆ Pcp.

Our proof of the just above theorem makes crucial use of [CK08b, Theorem 5(a)]
as well as Theorem 18 above.
Theorem 20 just below says that certain kinds of learners can be assumed without
loss of generality to be set-driven. This is interesting on its own, and is also of
important technical use for proving Theorem 10 in Section 3.
Theorem 20. Let V be such that (3) holds. We have

T (PcpConv)TxtSdExV = T (PcpConv)TxtGExV . (7)

The following proposition shows that total postdictive complete and total con-
servative, set-driven learners are automatically totally prudent. This, again, is
of important technical use for proving Theorem 10 in Section 3.
276 J. Case and T. Kötzing

Proposition 21. Let δ be a sequence acceptance criterion, let C ⊆ P. Let


h ∈ P. We have

T PrudCT (PcpConv)TxtSdδExV (h) = CT (PcpConv)TxtSdδExV (h).

Next is our last main result. As noted above in Section 1, this theorem says that,
in the general setting of the present section, postdictive completeness does not
forbid all delaying tricks.

Theorem 22. We have

LinF ∈ T PrudPFT (PcpConv)TxtGExW .

Proof: The effective numbering V Ltime from Theorem 15 can be translated into
the W -system in linear (and, hence, in polynomial) time.

References

[Ang80] Angluin, D.: Inductive inference of formal languages from positive data. In-
formation and Control 45, 117–135 (1980)
[Bār74] Bārzdiņš, J.: Inductive inference of automata, functions and programs. In:
Int. Math. Congress, Vancouver, pp. 771–776 (1974)
[BB75] Blum, L., Blum, M.: Toward a mathematical theory of inductive inference.
Information and Control 28, 125–155 (1975)
[Bur05] Burgin, M.: Grammars with prohibition and human-computer interaction.
In: Proceedings of the 2005 Business and Industry Symposium and the 2005
Military, Government, and Aerospace Simulation Symposium, pp. 143–147.
Society for Modeling and Simulation (2005)
[CCJ09] Carlucci, L., Case, J., Jain, S.: Learning correction grammars. Journal of
Symbolic Logic 74(2), 489–516 (2009)
[CK08a] Case, J., Kötzing, T.: Dynamic modeling in inductive inference. In: Freund,
Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI),
vol. 5254, pp. 404–418. Springer, Heidelberg (2008)
[CK08b] Case, J., Kötzing, T.: Dynamically delayed postdictive completeness and
consistency in learning. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T.
(eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 389–403. Springer, Heidelberg
(2008)
[CR09] Case, J., Royer, J.: Program size complexity of correction grammars, Work-
ing draft (2009)
[Ful88] Fulk, M.: Saving the phenomenon: Requirements that inductive machines not
contradict known data. Information and Computation 79, 193–209 (1988)
[Gol67] Gold, E.: Language identification in the limit. Information and Control 10,
447–474 (1967)
[HU79] Hopcroft, J., Ullman, J.: Introduction to Automata Theory Languages and
Computation. Addison-Wesley Publishing Company, Reading (1979)
[JORS99] Jain, S., Osherson, D., Royer, J., Sharma, A.: Systems that Learn: An In-
troduction to Learning Theory, 2nd edn. MIT Press, Cambridge (1999)
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 277

[Knu73] Knuth, D.: The Art of Computer Programming, Volume III: Sorting and
Searching. Addison-Wesley, Reading (1973)
[LV97] Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and Its
Applications, 2nd edn. Springer, Heidelberg (1997)
[OSW86] Osherson, D., Stob, M., Weinstein, S.: Systems that Learn: An Introduc-
tion to Learning Theory for Cognitive and Computer Scientists. MIT Press,
Cambridge (1986)
[Pit89] Pitt, L.: Inductive inference, DFAs, and computational complexity. In: Jan-
tke, K.P. (ed.) AII 1989. LNCS, vol. 397, pp. 18–44. Springer, Heidelberg
(1989)
[RC94] Royer, J., Case, J.: Subrecursive Programming Systems: Complexity and
Succinctness. Research monograph in Progress in Theoretical Computer Sci-
ence. Birkhäuser, Boston (1994)
[Rog67] Rogers, H.: Theory of Recursive Functions and Effective Computability. Mc-
Graw Hill, New York (1967); Reprinted by MIT Press, Cambridge, Mas-
sachusetts (1987)
[Sch91] Schabes, Y.: Polynomial time and space shift-reduce parsing of arbitrary
context-free grammars. In: Proceedings of the 29th annual meeting on As-
sociation for Computational Linguistics, pp. 106–113. Association for Com-
putational Linguistics (1991)
[vEB90] van Emde Boas, P.: Machine models and simulations. In: Van Leeuwen, J.
(ed.) Handbbook of Theoretical Computer Science. Algorithms and Com-
plexity, vol. A, pp. 3–66. MIT Press/Elsevier (1990)
[Wei82] Weinstein, S.: Private communication at the Workshop on Learnability The-
ory and Linguistics, University of Western Ontario (1982)
[Wie76] Wiehagen, R.: Limes-erkennung rekursiver funktionen durch spezielle strate-
gien. Elektronische Informationverarbeitung und Kybernetik 12, 93–99
(1976)
[Wie78] Wiehagen, R.: Zur Theorie der Algorithmischen Erkennung. PhD thesis,
Humboldt University of Berlin (1978)
[Yos09] Yoshinaka, R.: Learning efficiency of very simple grammars from positive
data. Theoretical Computer Science 410, 1807–1825 (2009); In: Hutter, M.,
Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp.
227–241. Springer, Heidelberg (2007)
Learning Mildly Context-Sensitive Languages
with Multidimensional Substitutability from
Positive Data

Ryo Yoshinaka

Graduate School of Information Science and Technology, Hokkaido University,


North-14 West-9, Sapporo, Japan
ry@ist.hokudai.ac.jp

Abstract. Recently Clark and Eyraud (2007) have shown that sub-
stitutable context-free languages, which capture an aspect of natural
language phenomena, are efficiently identifiable in the limit from pos-
itive data. Generalizing their work, this paper presents a polynomial-
time learning algorithm for new subclasses of mildly context-sensitive
languages with variants of substitutability.

1 Introduction

It has been a long-term goal of grammatical inference to find a reasonable class


of formal languages that are powerful enough for expressing natural languages
and are efficiently learnable under a reasonable model of language acquisition.
As Gold [10] showed that even the family of regular languages, which is located
in the lowest level of the Chomsky hierarchy, is not identifiable in the limit from
positive data, this learning model seems very restrictive. In spite of this diffi-
culty, researchers have been striving to find rich classes of languages efficiently
learnable in this model. Angluin’s reversible languages [1] are the first nontrivial
example of subclasses of regular languages that are efficiently identifiable in the
limit from positive data. The literature has found other subclasses of regular
languages, linear languages and context-free languages to be efficiently learnable
under this model. In particular Clark and Eyraud’s work [7, 8] on substitutable
context-free languages is noteworthy in regard to the close connection to natural
languages. Their work has led to several fruitful results in grammatical infer-
ence [5, 9, 19], which target even larger classes of context-free languages with
some special properties related to the substitutability. And now mildly context-
sensitive languages have arisen as a topical target of grammatical inference in
order to get even closer to natural languages [3, 14, 2, 12].
The goal of this paper is to present how to learn some specific kinds of
mildly context-sensitive languages by developing Clark and Eyraud’s technique
for learning substitutable context-free languages. We introduce the notion of mul-
tidimensional substitutability as a generalization of substitutability and demon-
strate that it closely relates to mildly context-sensitive languages. In fact the

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 278–292, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Learning Mildly Context-Sensitive Languages 279

role played by multidimensional substitutability in mildly context-sensitive lan-


guages is the exact analogue to that of substitutability in context-free languages,
and that of reversibility in regular languages, as well. We would like the reader
to recall that a regular language L is said to be zero-reversible if and only if, for
any strings x, y, x , y  ,

xy, xy  , x y ∈ L implies x y  ∈ L

and a substitutable language L satisfies that

xyz, xy  z, x yz  ∈ L implies x y  z  ∈ L.

Our m-dimensional substitutability is roughly expressed as that

x0 y1 x1 . . . ym xm , x0 y1 x1 . . . ym

xm , x0 y1 x1 . . . ym xm ∈ L
implies x0 y1 x1 . . . ym

xm ∈ L.

m-dimensional substitutability is a stronger restriction than substitutability.


This definition itself does not give richer language classes, but in fact this allows
us to infer mildly context-sensitive languages from finite sets of examples.
Among several formalisms of mildly context-sensitive grammars, we pick mul-
tiple context-free grammars for representing target languages. Section 2 reviews
the definition and some properties of multiple context-free grammars. Section 3
introduces the hierarchy of multidimensional substitutable languages and gives
some examples and counterexamples of those languages. The main issue of this
paper, learning multidimensional substitutable multiple context-free languages,
is discussed in Section 4. We conclude this paper in Section 5 with discussing
possible future directions of study.

2 Preliminaries
2.1 Definitions and Notations
The set of nonnegative integers is denoted by N and this paper will consider
only numbers in N. The cardinality of a set S is denoted by |S|. If w is a string
over an alphabet Σ, |w| denotes its length. ∅ is the empty set and λ is the
empty string. Σ ∗ denotes the set of all strings over Σ, Σ + = Σ ∗ − {λ} and
Σ k = { w ∈ Σ ∗ | |w| = k }. Any subset of Σ ∗ is called a language  (over Σ). If
L is a finite language over Σ, its size is defined as L = |L| + w∈L |w|. For
any x, xm means the m-tuple of x, while xm denotes the usual concatenation
of x, e.g., x3 = x, x, x and x3 = xxx. Hence (Σ ∗ )m is the set of m-tuples of
strings over Σ, which are called m-words. Similarly we define (·)∗ and (·)+ ,
where, for instance, (Σ ∗ )+ denotes the set of all m-words for all m ≥ 1. For
an m-word
 x = x1 , . . . , xm , |x| denotes its length m and x denotes its size
m+ 1≤i≤m |xi |. If f is a function defined on k-tuples, we will write f (z1 , . . . , zk )
for f (z1 , . . . , zk ) for readability.
280 R. Yoshinaka

2.2 Identification in the Limit from Positive Data


Our learning criterion is identification in the limit from positive data (or equiv-
alently from text ) introduced by Gold [10]. Let G be any recursive set of finite
descriptions, called grammars, and L be a function from G to non-empty lan-
guages over Σ. A learning algorithm A on G, is an algorithm that computes
a function from finite sequences of strings w1 , . . . , wn ∈ Σ ∗ to G. We define a
presentation of a language L to be an infinite sequence of elements (called pos-
itive examples) of L such that every element of L occurs at least once. Given
a presentation, we can consider the sequence of hypotheses that the algorithm
produces, writing Gn = A(w1 , . . . , wn ) for the nth such hypothesis. The algo-
rithm A is said to identify the class L of languages in the limit from positive
data if for every L ∈ L, for every presentation of L, there is an integer n0 such
that for all n > n0 , Gn = Gn0 and L = L(Gn0 ). For G ⊆ G satisfying that
L = { L(G) | G ∈ G }, one also says A identifies G in the limit from posi-
tive data. For convenience, we often allow the learner to refer to the previous
hypothesis Gn for computing Gn+1 in addition to w1 , . . . , wn+1 . Obviously this
relaxation does not effect the learnability of language classes. Moreover, learning
algorithms in this paper compute hypotheses from a set of positive examples by
identifying a sequence with the set consisting of the elements of the sequence.

2.3 Multiple Context-Free Grammars


A function from (Σ ∗ )m1  × · · ·× (Σ ∗ )mn  to (Σ ∗ )m is said to be linear regular,
if there is α1 , . . . , αm  ∈ ((Σ ∪ { zij | 1 ≤ i ≤ n, 1 ≤ j ≤ mi })∗ )m such that
each variable zij occurs exactly once in α1 , . . . , αm  and

f (y1 , . . . , yn ) = α1 [z := y], . . . , αm [z := y ]


for any  yi = yi1 , . . . , yimi  ∈ (Σ ∗ )mi  with 1 ≤ i ≤ n, where αk [z := y ]
denotes the string obtained by replacing each variable zij with the string yij .
For example, f defined as f (z11 , z12 , z21 , z22 , z23 ) = z12 az21 bz11 , c, z23 z22 
is linear regular, but g(z11 , z12 , z21 , z22 , z23 ) = z11 az21 bz11 , c, z23 z22  is not,
because z11 occurs twice and z12 disappears in the right-hand side of the equality.
The rank rank(f ) of f is defined to be n and the size size(f ) of f is m+|α1 . . . αm |.
A multiple context-free grammar ( mcfg) is a tuple G = Σ, Vdim , F, P, S,
where
– Σ is a finite set of terminal symbols,
– Vdim = V, dim is the pair of a finite set V of nonterminal symbols and a
function dim assigning a positive integer, called a dimension, to each element
of V ,
– F is a finite set of linear regular functions,1
– P is a finite set of rules of the form A → f (B1 , . . . , Bn ) where A, B1 , . . . , Bn ∈
V and f ∈ F maps (Σ ∗ )dim(B1 ) × · · · × (Σ ∗ )dim(Bn ) to (Σ ∗ )dim(A) ,
– S ∈ V is called the start symbol whose dimension is 1.

1
We identify a function symbol with the function itself for convention.
Learning Mildly Context-Sensitive Languages 281

We will simply write V for Vdim if no confusion occurs. If rank(f ) = 0 and


f () = y , we may write A → y instead of A → f (). The dimension dim(G) of
G is defined to be the maximum of dim(A) for A ∈ V and the rank rank(G)
of G is the maximum
 of rank(f ) for f ∈ F . The size G of G is defined as
G = |P | + ρ∈P size(ρ) where size(A → f (B1 , . . . , Bn )) = size(f ) + n + 1.
For each A ∈ V , L(G, A) is the smallest set of dim(A)-words such that if
A → f (B1 , . . . , Bn ) is a rule and yi ∈ L(G, Bi ), then f (y1 , . . . , yn ) ∈ L(G, A).
The language L(G) generated by G means the set { w ∈ Σ ∗ | w ∈ L(G, S) }.
L(G) is called a multiple context-free language ( mcfl). Two grammars G and
G are equivalent if L(G) = L(G ).
We denote by G(p, r) the collection of mcfgs G such that dim(G) ≤ p and
rank(G)
 ≤ r and define L(p, r) = { L(G) | G ∈ G(p, r) }. We also write G(p, ∗) =
r∈N G(p, r) and L(p, ∗) = r∈N L(p, r). The class of context-free grammars is
identified with G(1, ∗) and that of linear grammars corresponds to G(1, 1). It
is well-known that L(1, 1)  L(1, 2) = L(1, ∗). The following three, which are
thought to be typical mildly context-sensitive languages [13], are all in L(2, 1):

{ an bn cn | n ≥ 0 }, { am bn cm dn | m, n ≥ 0 }, { ww | w ∈ Σ ∗ }.

Seki et al. [17] and Rambow and Satta [15] have investigated the hierarchy of
mcfls.

Proposition 1 (Seki et al. [17]). For p ≥ 1, L(p, ∗)  L(p + 1, ∗).

In fact { (ai bi )p+1 | i ≥ 0 } ∈ L(p + 1, 1) − L(p, ∗).

Proposition 2 (Rambow and Satta [15]). For p ≥ 2, r ≥ 1, L(p, r) 


L(p, r + 1) except for L(2, 2) = L(2, 3).

Furthermore Rambow and Satta show a trade-off between dimension and rank.
This contrasts with Proposition 1.

Proposition 3 (Rambow and Satta [15]). For p ≥ 1, r ≥ 3 and 1 ≤ k ≤


r − 2, L(p, r) ⊆ L((k + 1)p, r − k).

Proposition 4 (Seki et al. [17]). Let p and r be fixed. It is decidable in


O(G|w|p(r+1) ) time whether w ∈ L(G) for any mcfg G ∈ G(p, r) and w ∈ Σ ∗ .

We close this subsection with introducing inessential restrictions on mcfgs. Let


f be a linear regular function such that f (z1 , . . . , zn ) = α1 , . . . , αm  for zi =
zi1 , . . . , zimi . If no αk from α1 , . . . , αm  is λ, f is said to be λ-free. f is non-
permuting if zij always occurs left of zi(j+1) in α1 . . . αm for any i, j with 1 ≤ i ≤
n and 1 ≤ j < mi . f is moreover said to be non-merging if no αk has zij zi(j+1)
as a substring for any i, j. An mcfg G is λ-free, non-permuting, non-merging,
if all of its functions are λ-free, non-permuting, non-merging, respectively. Note
that all G ∈ G(1, ∗) are non-permuting and non-merging. It is known that every
mcfg G ∈ G(p, r) has an equivalent λ-free mcfg G ∈ G(p, r) modulo λ [17].
282 R. Yoshinaka

Lemma 1. Every mcfg G ∈ G(p, r) has an equivalent non-permuting one G ∈


G(p, r).

Proof. A permutation π on m-words is a bijective linear regular function of rank


1 on (Σ ∗ )m such that π(z1 , z2 , . . . , zm ) is defined to be zp1 , . . . , zpm  for some
p1 , . . . , pm with {p1 , . . . , pm } = {1, . . . , m}. We define G to have nonterminals
Aπ with dim(Aπ ) = dim(A) for all nonterminals A of G and all permutations π
on dim(A)-words. For each rule A → f (B1 , . . . , Bn ) of G and each permutation
π on dim(A)-words, G has the rule of the form

Aπ → f π (B1π1 , . . . , Bnπn )

where f π is defined to satisfy that

f π (y1 , . . . , yn ) = π(f (π1−1 (y1 ), . . . , πn−1 (yn ))

and each πi , a permutation on dim(Bi )-words, is chosen so that f π is non-


permuting. Indeed π1 , . . . , πn are uniquely determined by π and f . It is easy
to see that for any permutation π on dim(A)-words, y ∈ L(G, A) iff π(y ) ∈
L(G , Aπ ). The start symbol of G is of course S I where I is the unique permu-
tation on 1-words, i.e., the identity.

We note that G  ≤ p!G.

Lemma 2. Every non-permuting mcfg G ∈ G(p, r) has an equivalent non-


merging one G ∈ G(p, r).

Proof. A merge μ on m-words is a linear regular function of rank 1 from (Σ ∗ )m


to (Σ ∗ )k for some k with 1 ≤ k ≤ m such that μ(z1 , z2 , . . . , zm ) is defined to
be z1 . . . zm1 , zm1 +1 . . . zm2 , . . . , zmk−1 +1 . . . zm  for some m1 , . . . , mk−1 with 1 ≤
m1 < · · · < mk−1 < m. We define G to have nonterminals Aμ for nonterminals
A of G and merges μ on dim(A)-words. For each rule A → f (B1 , . . . , Bn ) of G
and each merge μ on dim(A)-words, G has the rule of the form

Aμ → f μ (B1μ1 , . . . , Bnμn )

where f μ is defined to satisfy that

f μ (μ1 (y1 ), . . . , μn (yn )) = μ(f (y1 , . . . , yn ))

and each μi , a merge on dim(Bi )-words, is chosen so that f μ is non-merging.


Indeed μ1 , . . . , μn are uniquely determined by μ and f . It is easy to see that
for any merge μ on dim(A)-words,  y ∈ L(G, A) iff μ(y ) ∈ L(G , Aμ ). The start
 I
symbol of G is of course S where I is the unique merge on 1-words, i.e., the
identity.

We note that G  ≤ 2p−1 G. We say that a linear regular function f is good
if it is λ-free, non-permuting and non-merging, and that an mcfg G is good if
all of its functions are good. We assume that all mcfgs in this paper are good.
Learning Mildly Context-Sensitive Languages 283

3 Multidimensional Substitutable Languages

3.1 Multidimensional Substitutability and Multiple Context-Free


Hierarchy

This section introduces the notion of p-dimensional substitutability as a gen-


eralization of substitutability by Clark and Eyraud [8]. Let  ∈ Σ be a new
symbol, which represents a hole. If x ∈ (Σ ∪ {})∗ contains m occurrences of
, then x is called an m-context. For an m-context x = x0 x1  . . . xn with
x0 , . . . , xn ∈ Σ ∗ and an m-word y = y1 , . . . , yn  ∈ (Σ ∗ )∗ , we define an op-
eration  by x   y = x0 y1 x1 . . . yn xn . x  
y is defined only when x contains
exactly |y | occurrences of . For a positive integer p, a language L is said to be
pd-substitutable if and only if

x1  
y1 , x1  y2 , x2  y1 ∈ L implies x2  y2 ∈ L

for any x1 , x2 ∈ Σ ∗ (Σ + )m−1 Σ ∗ , y1 , y2 ∈ (Σ + )m and m ≤ p. For notational
convenience, we write Σ for Σ ∗ (Σ + )m−1 Σ ∗ . By S(p) we denote the class
[m]

of pd-substitutable languages. It is an immediate consequence of the definition


that S(p + 1) ⊆ S(p) for any p ≥ 1 and in fact the inclusion is proper. Clark and
Eyraud’s original notion of substitutability [8] is our 1d-substitutability. Thus
apparently our generalization of the notion of substitutability does not introduce
richer classes of languages. The following example however demonstrates how
nicely pd-substitutability works in p-dimensional mcfls.
Example 1. Let Σm = { ai | 1 ≤ i ≤ 2m } ∪ { #i | 1 ≤ i < 2m } and

Lm = { an1 #1 an2 #2 . . . #2m−1 an2m | n ≥ 0 }.

For any m, p ≥ 1, Lm ∈ S(p). Moreover any finite subset K  Lm is a mem-


ber of S(m − 1), but K is not in S(m) if |K| ≥ 2. Let Km = { #1 . . . #2m−1 ,
a1 #1 . . . #2m−1 a2m }, for example. The least md-substitutable language includ-
ing Km is in fact Lm .
Lm is a typical m-dimensional mcfl. In fact Lm ∈ L(m, 1)−L(m−1, ∗). To learn
Lm from finite examples, pd-substitutability with p < m is not a sufficiently
strong assumption, while md-substitutability is better suited. One may think
of md-substitutability in m-dimensional mcfls for m ≥ 1 as a generalization
of 1d-substitutability in context-free languages, as well as an analogue of zero-
reversibility in regular languages.

Example 2. Let Σm = { ai , bi | 1 ≤ i ≤ m } ∪ { #i | 1 ≤ i < 2m } and

Lm = { an1 a #1 bn1 b #2 an2 a #3 bn2 b #4 . . . #2m−2 anma #2m−1 bnmb | na , nb ≥ 0 }.

For any m, p ≥ 1, Lm ∈ S(p). Let Km



= { #1 . . . #2m−1 , a1 #1 b1 #2 . . . #2m−2 am

#2m−1 bm }. Then Km ∈ S(m − 1) − S(m). The least md-substitutable language

including Km is in fact Lm .
284 R. Yoshinaka

Thus those typical p-dimensional mcfls can be inferred from some finite subsets
with pd-substitutability, if one can compute the least language in S(p) including
an arbitrarily given finite language. Therefore we are concerned with the classes
of languages that are in S(p) and at the same time in L(p, ∗). Let us denote
SL(p, r) = S(p) ∩ L(p, r) and SL(p, ∗) = S(p) ∩ L(p, ∗).
On the other hand, some other typical p-dimensional mcfls are not pd-sub-
stitutable. We say that two m-words y1 and y2 are substitutable for each other
[m]
in L, when for any x ∈ Σ it holds that x   y1 ∈ L iff x  y2 ∈ L.
Example 3. The language L− n n n n
2 = { a1 #a2 #a3 #a4 | n ≥ 0 } is not 2d-substitut-
able. If a 2d-substitutable language L contains ### and a1 #a2 #a3 #a4 as L− 2
does, then #, # and a1 #a2 , a3 #a4  should be substitutable for each other in
L. This entails that a1 a1 #a2 a2 a3 #a4 a3 #a4 ∈ L − L−
2.
The language Lreverse = { w#wR | wR is the reverse of w ∈ {a, b}∗ } is
1d-substitutable but not 2d-substitutable. Actually even { an #an | n ≥ 0 }
is not 2d-substitutable. Suppose that a 2d-substitutable language L contains
aaa#aaa. Then aa#, a and a, #aa are substitutable for each other, because
of the shared 2-context aaa. At the same time aaa#aaa = aaaaa#, a,
so L must contain aaaa#aa = aaa  a, #aa, too. This shows that even a
singleton language is not 2d-substitutable, which contrasts the fact that every
singleton is 1d-substitutable.
The language Lcopy = { w#w | w ∈ {a, b}∗ } is not 1d-substitutable. If a
1d-substitutable language L contains a#a and b#b as Lcopy does, they should
be substitutable for each other. aa#aa ∈ L entails ab#ba ∈ L − Lcopy .
When the language L1 = { an1 #1 an2 | n ≥ 0 } is generated by a context-free gram-
mar, only nesting structural interpretation is possible, while by a 2-dimensional
mcfg, cross-serial dependency is also a possible interpretation at the same time.
One cannot decide which is the underlying structure from strings only. Actually
if a 2d-substitutable language contains a1 #1 a2 and a1 a1 #1 a2 a2 , both interpre-
tation are inevitably induced.

3.2 Comparison with Simple External Contextual Languages


We will extend the operator  so that x  y is defined for x ∈ ((Σ ∪ {})∗ )+
if x contains exactly |y| occurrences of . For instance, a, bcd, e 
y1 , y2 , y3 , y4  = a, by1 cy2 d, y3 ey4 . Simple external contextual ( sec) languages
are important mildly context-sensitive languages in the context of grammatical
inference [3, 14, 2]. For p ≥ 1 and q ≥ 0, a p, q-sec grammar G over Σ is a pair
B, C where B ∈ (Σ ∗ )p and C ⊆ (Σ ∗ Σ ∗ )p with |C| ≤ q. The p, q-sec
language L(G) generated by a p, q-sec grammar G is defined as
L(G) = { w1 . . . wp ∈ Σ ∗ | w1 , . . . , wp  = x1  · · ·  xn  B, xi ∈ C, n ≥ 0 }
( is associative). Let SEC(p, q) denote the class of p, q-sec language. We note
that SEC(p, q) ⊆ L(p, 1).
The languages Lm in Example 1 and Lm in Example 2 are in SEC(m, 1) ∩
SL(m, 1). However the classes SL(p, ∗) and SEC(p, ∗) are incomparable. The
Learning Mildly Context-Sensitive Languages 285


regular language (ab∗ cd∗ )∗ e ∈ m∈N SL(m, 1) is not a p, q-sec language for any
p, q. On the other hand, Lreverse from Example 3 is in SEC(1, 2) and is not
2d-substitutable. { an bn | n ≥ 1 } ∈ SEC(1, 1) is not 1d-substitutable either.

4 Learning pD-Substitutable Multiple Context-Free


Languages
4.1 Learning Algorithm
Let us arbitrarily fix positive integers p and r. This section presents an algorithm
that learns the class SL(p, r). However we do not yet have any grammatical
characterization of this class. For mathematical completeness, yet we have to
define our learning target by saying that our target representations are mcfgs
in G(p, r) generating pd-substitutable languages, though this property is not
decidable. While we have S(p + 1)  S(p) and L(p, r)  L(p + 1, r), the classes
SL(p, r) and SL(p + 1, r) are incomparable, unless r = 0.
We remark that our algorithm is easily modified to learn the class SL(p, ∗) if
we give up the polynomial-time
 computability as we will discuss later. On the
other hand SL(∗, r) = p∈N SL(p, r) is not identifiable in the limit from positive
data unless r = 0. Let L∗ = { an bcn den | n ≥ 0 }. It is easy to see that all finite
subsets of L∗ are in SL(1, 0), while L∗ ∈ SL(2, 1).
Our learning algorithm A(p, r) for SL(p, r), which is shown as Algorithm 1, is
a natural generalization of Clark and Eyraud’s original algorithm for SL(1, 2) =
SL(1, ∗) [8]. If the new positive example is generated by the previous hypothesis,
it keeps the hypothesis. Otherwise, A(p, r) computes an mcfg G(K) from the
set K of positive examples given so far. The set of nonterminals is defined as

VK = { y ∈ (Σ + )m | x  y ∈ K for some x ∈ Σ and 1 ≤ m ≤ p } ∪ {S},


[m]

where dim(y ) = |y |. We will write [[y ]] instead of y for clarifying that it means a
nonterminal symbol (indexed with y ). PK consists of the following rules:
– (Type I) [[y ]] → f ([[y1 ]], . . . , [[yn ]]) if there is a good function f of rank n ≤ r
such that y = f (y1 , . . . , yn ), where [[y ]], [[y1 ]], . . . , [[yn ]] ∈ VK − {S};
– (Type II) [[y ]] → Im ([[y  ]]) where Im is the identity on m-words for m = |y | ≤
y , x  y  ∈ K;
[m]
p, if there is x ∈ Σ such that x  
– (Type III) S → I1 ([[w]]) if w ∈ K;
and FK is the set of functions requested in the definition of PK . As VK is finite,
FK and PK are also finite. Then G(K) = Σ, VK , FK , PK , S ∈ G(p, r) is the
conjecture by A(p, r).
Instead of having rules [[y ]] → Im ([[y  ]]) of Type II, one may merge y and y 
for downsizing the output as Clark and Eyraud does in [8].
Example 4. Let p = 2 and r = 1. Let us consider the grammar G(K) =
Σ, VK , FK , PK , S for

K = { a#1 b#2 c#3 d, a#1 #2 c#3 , aa#1 b#2 cc#3 d }.


286 R. Yoshinaka

Algorithm 1. A(p, r)
Data: A sequence of strings w1 , w2 , . . .
Result: A sequence of mcfgs G1 , G2 , · · · ∈ G(p, r)
let Ĝ be a mcfg such that L(Ĝ) = ∅;
for n = 1, 2, . . . do
read the next string wn ;
if wn ∈ L(Ĝ) then
let Ĝ = G(K) where K = {w1 , . . . , wn };
end if
output Ĝ as Gn ;
end for

We see that VK contains the following four nonterminals and others:


[[a#1 #2 c#3 ]], [[a#1 , c#3 ]], [[#1 b, #3 d]], [[#1 , #3 ]] ∈ VK .
PK contains at least the following four rules of Type I :
[[a#1 #2 c#3 ]] → f ([[a#1 , c#3 ]]) where f (z1 , z2 ) = z1 #2 z2 ,
[[a#1 , c#3 ]] → g([[#1 , #3 ]]) where g(z1 , z2 ) = az1 , cz2 ,
[[#1 b, #3 d]] → h([[#1 , #3 ]]) where h(z1 , z2 ) = z1 b, z2 d,
[[#1 , #3 ]] → #1 , #3 ,
as well as the following rules of Type II:
[[#1 , #3 ]] → I2 ([[a#1 , c#3 ]]) due to
(ab#2 cd)  #1 , #3 , (ab#2 cd)  a#1 , c#3  ∈ K,
[[#1 , #3 ]] → I2 ([[#1 b, #3 d]]) due to
(a#2 c)  #1 , #3 , (a#2 c)  #1 b, #3 d ∈ K,
and their symmetries [[a#1 , c#3 ]] → I2 ([[#1 , #3 ]]) and [[#1 b, #3 d]] →
I2 ([[#1 , #3 ]]), too, and the rule S → I1 ([[a#1 #2 c#3 ]]) of Type III. Thus
G(K) generates every string derived by the mcfg G∗ with the rules
S → f (A), A → g(A), A → h(A), A → #1 , #3 ,
where f, g, h denote the same functions as in G(K). We have L(G∗ ) = { am #1
bn #2 cm #3 dn | m, n ≥ 0 } ⊆ L(G(K)). Here many other nonterminals and rules
of G(K) are suppressed, but indeed it holds that L(G(K)) = L(G∗ ) as we will
prove later. Note that L(G∗ ) ∈ SL(2, 1).

4.2 Correctness of the Algorithm


We first confirm that A(p, r) is consistent.
Lemma 3. K ⊆ L(G(K)) for any finite language K.
Proof. If w ∈ K, by definition G(K) has the rules S → [[w]] and [[w]] → w.
Learning Mildly Context-Sensitive Languages 287

We then show that the conjectured grammar Ĝ of our algorithm A(p, r) is always
a subset of the target language.
Lemma 4. For any L ∈ S(p) and any finite subset K of L, if w  ∈ L(G(K), [[y ]])
with [[y ]] ∈ VK − {S}, then 
y and w
 are substitutable for each other in L.

Proof. Let Ĝ = G(K). We prove the lemma by induction on the derivation of


w ∈ L(G(K), [[y ]]). Suppose that w  = f (w  n ) ∈ L(Ĝ, [[y]]) due to the rule
 1, . . . , w
[[y ]] → f ([[y1 ]], . . . , [[yn ]]) of type I and w
 i ∈ L(Ĝ, [[yi ]]) for i = 1, . . . , n. Note that
the base case is when n = 0. The presence of the rule implies the existence of
[|w|]

x ∈ Σ such that

x
y = x  f (y1 , . . . , yn ) ∈ K ⊆ L.

The induction hypothesis says that  yi and w i are substitutable for each other
in L. Recall that the rule f is designed to be good. This allows us the following
inference:

x  y = x  f (y1 , y2 , . . . , yn ) ∈ L =⇒ x  f (w


 1 , y2 , . . . , yn ) ∈ L =⇒ . . .
=⇒ x  f (w  n−1 , yn ) ∈ L =⇒ x  f (w
 1, . . . , w  1, . . . , w n) = x  w
 n−1 , w  ∈ L.

Thus y and w  are substitutable for each other.


Suppose that w  ∈ L(Ĝ, [[y ]]) due to the rule [[y ]] → Im ([[y  ]]) of Type II and
 ∈ L(Ĝ, [[y ]]). By the presence of the rule, y and y  are substitutable for each
w 

other. By the induction hypothesis, y  and w  are substitutable for each other.
Hence y and w  are also substitutable for each other in L.

Lemma 5. For any L ∈ S(p) and any finite subset K of L, it holds that
L(G(K)) ⊆ L.

Proof. Let Ĝ = G(K). If w ∈ L(Ĝ), i.e., w ∈ L(Ĝ, S), then there is [[y]] ∈ VK
such that S → I1 ([[y]]) is a rule of Type III of Ĝ and y ∈ K. By Lemma 4, y
and w are substitutable for each other in L.   y ∈ K ⊆ L implies that
  w = w ∈ L.

The conjectured language may be properly smaller than the target, when the
given data are not rich enough. We now define a finite subset of the target
language which ensures correct convergence of the conjecture of our learning
algorithm. For a good mcfg G ∈ G(p, r) generating the target language, we
define KG so that for each rule from G, it contains a shortest string from L(G)
which is derived using that rule at least once. For the sake of rigorousness, we
give a formal definition of KG here. Let X (G, A/B) be defined by:

– dim(A) ∈ X (G, A/A),


– if A → f (B1 , . . . , Bn ) is a rule and xj ∈ X (G, Bj /C) for some j ∈ {1, . . . , n}
and yi ∈ L(G, Bi ) for the other i = 1, . . . , j − 1, j + 1, . . . , n, then f (y1 , . . . ,
yj−1 , xj , yj+1 , . . . , yn ) ∈ X (G, A/C),
– nothing else is in X (G, A/B).
288 R. Yoshinaka

We then define the set KG as follows:

yA = min L(G, A),


[dim(A)]
xA = min{ x ∈ Σ | x ∈ X (G, S/A) },
KG = { xA  f (yB1 , . . . , yBn ) | A → f (B1 , . . . , Bn ) ∈ P },

where min S for a set S of m-words means an element y from S whose size y
is the smallest, and min S for a set S of m-contexts means an element x from S
whose length |x| is the smallest.
Lemma 6. For any G ∈ G(p, r), if KG ⊆ K, then L(G) ⊆ L(G(K)).

Proof. Let Ĝ = G(K). We show by induction that w  ∈ L(G, A) implies w  ∈


L(Ĝ, [[yA ]]). Because Ĝ has the rule S → I1 ([[yS ]]), this proves the lemma.
Suppose that w  = f (w  n ) ∈ L(G, A) due to the rule A → f (B1 , . . . , Bn )
 1, . . . , w
and w  i ∈ L(G, Bi ) for i = 1, . . . , n. The base case is when n = 0. Let y =
f (yB1 , . . . , yBn ). By definition we have xA  y ∈ K and thus [[y ]] → f ([[yB1 ]], . . . ,
[[yBn ]]) is a rule of Ĝ. By xA  yA ∈ K, Ĝ has the rule [[yA ]] → Idim(A) ([[y ]]), too.
Applying those two rules to w  i ∈ L(Ĝ, [[yBi ]]) for i = 1, . . . , n, which are obtained
by the induction hypothesis, we have that w  = f (w  n ) ∈ L(Ĝ, [[yA ]]).
1, . . . , w

Corollary 1. For any mcfg G ∈ G(p, r) generating a language in S(p), A(p, r)


identifies L(G) in the limit from positive data.

Proof. If the conjectured language L(G(K)) is not correct, Lemma 5 ensures


the existence of w ∈ L(G) − L(G(K)). By K ⊆ L(G(K)) (Lemma 3), w has
not yet appeared in K and A(p, r) will see w later. Hence A(p, r) will discard
the current conjecture at some point. Finally A(p, r) converges to the target
language by Lemma 6.

One can modify the learning algorithm so that it learns SL(p, ∗) by removing the
restriction on the rank of the hypothesized grammar. The rank is now bounded
by the length K of a longest example given so far, because we still restrict
functions of grammars to be λ-free. Let us call the learning algorithm obtained
by this way A(p, ∗).
Corollary 2. A(p, ∗) identifies SL(p, ∗) in the limit from positive data.

4.3 Efficiency of the Algorithm


We discuss in this subsection the efficiency of our learning algorithm in terms
of time for updating the conjecture and the amount of data for convergence.
This measurement is proposed by de la Higuera [11]. His definition was initially
designed for learning of regular languages and it is controversial whether it is
suitable for learning non-regular languages. Wakatsuki and Tomita [18] have
proposed to measure the complexity of an algorithm dealing with context-free
grammars by the parameter called the maximal thickness tG of G together with
the size of the grammar G. The thickness of a nonterminal symbol A is defined to
Learning Mildly Context-Sensitive Languages 289

be the length of a shortest string derived from A and tG is the maximum of the
thicknesses of the nonterminals. Instead of the original definition, we would like
the thickness τG of a grammar G to be defined as the maximal of the thickness
of the rules where the thickness of a rule ρ is defined to be the length of a
shortest string in L(G) that is derived with using ρ at least once. It is easy to
see that τG ≤ GtG . This works well for multiple context-free grammars as
well as for context-free grammars. Hence a value is bounded by a polynomial in
τG if and only if it is bounded by a polynomial in GtG . The following is our
criterion for efficient learning, which is a slight modification of de la Higuera’s
definition [11].
Definition 1. A representation class G of mcfgs is identifiable in the limit
from positive data with polynomial time and data if and only if there exists an
algorithm A such that

1. given a set K of positive examples, A returns a hypothesis Ĝ in polynomial


time in K,
2. for each grammar G∗ ∈ G, there exists a finite set K∗ of examples such that
– |K∗ | is bounded by a polynomial in G∗  [4],
– K∗  is bounded by a polynomial in G∗ τG∗ ,
– if K∗ ⊆ K ⊆ L(G∗ ), A converges to a grammar Ĝ such that L(Ĝ) =
L(G∗ ).

Clark and Eyraud’s [8] and Yoshinaka’s [19] learning algorithms for (k, l-)sub-
stitutable context-free languages satisfy this definition.
Lemma 7. Our algorithm A(p, r) computes its hypothesis Ĝ in polynomial time
in the total size of the given examples.

Proof. By Proposition 4, the membership of the new example w to the current


hypothesis Ĝ is decidable in O(Ĝ|w|p(r+1) ) time. As we will see below, it
2pr+2p+1
holds that Ĝ ∈ O(|K|2 K ) where K = max{ |w| | w ∈ K }. Thus
the membership is decidable in O(|K|2 3pr+3p+1 K ) time. Suppose that the new
example w is not generated by the current hypothesis. Then A(p, r) computes
G(K) = Σ, VK , FK , PK , S.
Each rule of Type I is constructed from a single word w ∈ K. If G(K) has
[m]
[[y ]] → f ([[y1 ]], . . . , [[yn ]]), there is x = x0 x1  . . . xm ∈ Σ such that w =
x  f (y1 , . . . , yn ) ∈ K, where m = |y |. Here the occurrences of x0 , . . . , xm and
yij from  yi = yi1 , . . . , yimi  with 1 ≤ i ≤ n and 1 ≤ j ≤ mi are pairwise
non-overlapping in w. Let k = m + 1 + m1 + · · · + mn denote the number of
those substrings. The fragments of w that are not covered by those k substrings
are from f itself. This factorization of w is determined by specifying where
each of the k substrings starts and ends except that the starting position of
x0 and the ending position of xm are predetermined. Thus there exist at most
(|w| + 1)2k−2 ≤ (|w| + 1)2p(r+1) such factorizations of w, because m, mi ≤ p,
n ≤ r and thus k ≤ p + 1 + pr. The size of the rule is bounded by O(|w|). Hence
we need O(|K|2pr+2p+1 K ) time to compute rules of Type I.
290 R. Yoshinaka

Each rule of Type II is constructed by comparing two words w1 , w2 ∈ K.


There are at most (|wi | + 1)2p pairs of xi and yi to be considered such that
xi  yi = wi for each i = 1, 2. Determining whether x1 = x2 is done in linear
time in |w1 | + |w2 | and the size of the rule has the same bound O(|w1 | + |w2 |).
Thus we need O(|w1 |2p |w2 |2p |w1 w2 |) time to construct all the possible rules of
Type II from w1 , w2 ∈ K. Hence we need O(|K|2 4p+1 K ) time to compute rules
of Type II.
G(K) has exactly |K| rules of Type III of size O(K ).
All in all, it takes O(|K|2 2pr+2p+1
K ) time to construct G(K) and its size
G(K) has the same bound.

Hence A(p, r) updates its hypothesis quickly if p and r are small.


Lemma 8. |KG | ≤ |P | and KG  ≤ |P |τG where G = Σ, V, F, P, S.

Proof. Each rule of G determines one element in KG , whose length is exactly the
thickness of the rule. We have |KG | ≤ G and KG  ≤ |KG |τG ≤ |P |τG .

The size of KG does not depend on p and r, while the updating time is poly-
nomial only when p and r are fixed. This contrasts to Yoshinaka’s discussion on
the learning efficiency of k, l-substitutable context-free languages [19], which are
another extension of Clark and Eyraud’s work [8]. His algorithm updates the
conjecture in polynomial time independently of k and l, while the size of data
for convergence is bounded by a polynomial whose degree is linear in k + l.

Theorem 1. The learning algorithm A(p, r) identifies SL(p, r) in the limit from
positive data with polynomial time and data.

Concerning the algorithm A(p, ∗) for SL(p, ∗), its updating time is not bounded
by a polynomial any longer, while KG still works well for A(p, ∗).

5 Discussions
This paper has demonstrated how Clark and Eyraud’s approach with substitutabil-
ity [8] works in learning mildly context-sensitive languages. pd-substitutability
seems nicely fit into p-dimensional mcfls as a generalization of 1d-substitutabil-
ity in context-free languages, which is the exact analogue of reversibility in regular
languages. The obtained learnable classes are however not rich, as we have seen in
Section 3 several rather simple languages that are not 2d-substitutable. pd-substi-
tutability easily causes too much generalization from finite languages even when
p = 2. The author hopes that this work provides a clue for further investiga-
tion on learning mildly context-sensitive languages possibly in other learning
schemes.
One naive trial for enriching the expressive power from 2d-substitutable lan-
guages might be considering the following property in addition to 1d-substitut-
ability:
x1 y1 x2 y2 , x1 y1 x2 y2 , x1 y1 x2 y2 ∈ L implies x1 y1 x2 y2 ∈ L
Learning Mildly Context-Sensitive Languages 291

for any x1 , x1 , y2 , y2 ∈ Σ ∗ and x2 , x2 , y1 , y1 ∈ Σ + . This property is stronger than
1d-substitutability and slightly weaker than 2d-substitutability (and might be
thought of as 2d-reversibility). However, this property is still too strong; neither
{ an #an | n ≥ 1 }, Lreverse nor Lcopy satisfies this property.
In order to control some kind of dependent structures in pd-substitutable
languages, Examples 1 and 2 insert delimiters #i . This trick is necessary even
in 1d-substitutable languages. While { an #bn | n ≥ 0 } is 1d-substitutable,
{ an bn | n ≥ 0 } is not. Yoshinaka’s approach of k, l-substitutability [19] enables
us to remove such delimiters. Thus again one may consider k1 , . . . , k2m -substi-
tutability:

x  v   y  , x  v  
y , x  v   y ∈ L implies x  v  y  ∈ L

for any v ∈ (Σ k1 Σ k2 ) × · · · × (Σ k2m−1 Σ k2m ). Indeed { an bn cn dn | n ≥ 1 }


is 14 -substitutable, but neither { an bn an bn | n ≥ 1 } nor { an #an | n ≥ 1 } is
k-substitutable for any k ∈ N4 .
Clark et al. [9] have developed their work on substitutable context-free lan-
guages to learning a much richer class of context-free languages with positive
examples and membership queries. Their approach would be generalized also for
mildly context-sensitive languages, where multidimensional substitutable lan-
guages should be regarded as a special case. This seems to be the most convincing
approach for future work.

Acknowledgement
The author deeply appreciates Thomas Zeugmann and the anonymous reviewers
for their valuable comments and advice.
This work was supported by Grant-in-Aid for Young Scientists (B-20700124)
and a grant from the Global COE Program, “Center for Next-Generation Infor-
mation Technology based on Knowledge Discovery and Knowledge Federation”,
from the Ministry of Education, Culture, Sports, Science and Technology of
Japan.

References
1. Angluin, D.: Inference of reversible languages. Journal of the Association for Com-
puting Machinery 29(3), 741–765 (1982)
2. Becerra-Bonache, L., Case, J., Jain, S., Stephan, F.: Iterative learning of simple
external contextual languages. In: Freund, Y., Györfi, L., Turán, G., Zeugmann,
T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 359–373. Springer, Heidelberg
(2008)
3. Becerra-Bonache, L., Yokomori, T.: Learning mild context-sensitiveness: Toward
understanding children’s language learning. In: Paliouras, G., Sakakibara, Y. (eds.)
ICGI 2004. LNCS (LNAI), vol. 3264, pp. 53–64. Springer, Heidelberg (2004)
4. Carme, J., Gilleron, R., Lemay, A., Niehren, J.: Interactive learning of node select-
ing tree transducer. Machine Learning 66(1), 33–67 (2007)
292 R. Yoshinaka

5. Clark, A.: PAC-learning unambiguous NTS languages. In: Sakakibara, et al [16],


pp. 59–71
6. Clark, A., Coste, F., Miclet, L. (eds.): ICGI 2008. LNCS (LNAI), vol. 5278.
Springer, Heidelberg (2008)
7. Clark, A., Eyraud, R.: Identification in the limit of substitutable context-free lan-
guages. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI),
vol. 3734, pp. 283–296. Springer, Heidelberg (2005)
8. Clark, A., Eyraud, R.: Polynomial identification in the limit of context-free substi-
tutable languages. Journal of Machine Learning Research 8, 1725–1745 (2007)
9. Clark, A., Eyraud, R., Habrard, A.: A polynomial algorithm for the inference of
context free languages. In: Clark, et al [6], pp. 29–42
10. Gold, E.M.: Language identification in the limit. Information and Control 10(5),
447–474 (1967)
11. de la Higuera, C.: Characteristic sets for polynomial grammatical inference. Ma-
chine Learning 27, 125–138 (1997)
12. Kasprzik, A.: A learning algorithm for multi-dimensional trees, or: Learning beyond
context-freeness. In: Clark, et al [6], pp. 111–124
13. Kudlek, M., Martı́n-Vide, C., Mateescu, A., Mitrana, V.: Contexts and the concept
of mild context-sensitivity. Linguistics and Philosophy 26(6), 703–725 (2003)
14. Oates, T., Armstrong, T., Becerra-Bonache, L., Atamas, M.: Inferring grammars
for mildly context sensitive languages in polynomial-time. In: Sakakibara, et al [16],
pp. 137–147
15. Rambow, O., Satta, G.: Independent parallelism in finite copying parallel rewriting
systems. Theor. Comput. Sci. 223(1-2), 87–120 (1999)
16. Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.): ICGI 2006.
LNCS (LNAI), vol. 4201. Springer, Heidelberg (2006)
17. Seki, H., Matsumura, T., Fujii, M., Kasami, T.: On multiple context-free grammars.
Theoretical Computer Science 88(2), 191–229 (1991)
18. Wakatsuki, M., Tomita, E.: A fast algorithm for checking the inclusion for very
simple deterministic pushdown automata. IEICE transactions on information and
systems E76-D(10), 1224–1233 (1993)
19. Yoshinaka, R.: Identification in the limit of k, l-substitutable context-free lan-
guages. In: Clark, et al [6], pp. 266–279
Uncountable Automatic Classes and Learning

Sanjay Jain1 , Qinglong Luo1 , Pavel Semukhin2 , and Frank Stephan1,2


1
Department of Computer Science,
National University of Singapore, Singapore 117417, Republic of Singapore
sanjay@comp.nus.edu.sg, luoqingl@comp.nus.edu.sg
2
Department of Mathematics,
National University of Singapore, Singapore 117543, Republic of Singapore
pavels@nus.edu.sg, fstephan@comp.nus.edu.sg

Abstract. In this paper we consider uncountable classes recognizable


by ω-automata and investigate suitable learning paradigms for them. In
particular, the counterparts of explanatory, vacillatory and behaviourally
correct learning are introduced for this setting. Here the learner reads in
parallel the data of a text for a language L from the class plus an ω-index
α and outputs a sequence of ω-automata such that all but finitely many
of these ω-automata accept the index α iff α is an index for L.
It is shown that any class is behaviourally correct learnable if and
only if it satisfies Angluin’s tell-tale condition. For explanatory learning,
such a result needs that a suitable indexing of the class is chosen. On the
one hand, every class satisfying Angluin’s tell-tale condition is vacillatory
learnable in every indexing; on the other hand, there is a fixed class such
that the level of the class in the hierarchy of vacillatory learning depends
on the indexing of the class chosen.
We also consider a notion of blind learning. On the one hand, a class
is blind explanatory (vacillatory) learnable if and only if it satisfies An-
gluin’s tell-tale condition and is countable; on the other hand, for be-
haviourally correct learning there is no difference between the blind and
non-blind version.
This work establishes a bridge between automata theory and inductive
inference (learning theory).

1 Introduction
Usually, in learning theory one considers classes consisting of countably many
languages from some countable domain. A typical example here is a class of all
recursive subsets of {0, 1, 2}∗, the set of all finite strings in the alphabet {0, 1, 2}.
However, each countably infinite domain has uncountably many subsets, and
thus we miss out many potential targets when we consider only countable classes.
The main goal of this paper is to find a generalization of the classical model
of learning which would be suitable for working with uncountable classes of
languages. The classes, which we consider, can be uncountable but they still

The first and fourth author are supported in part by NUS grant R252-000-308-112;
the third and fourth author are supported by NUS grant R146-000-114-112.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 293–307, 2009.

c Springer-Verlag Berlin Heidelberg 2009
294 S. Jain et al.

have some structure, namely, they are recognizable by Büchi automata. We will
investigate, how the classical notions of learnability have to be adjusted in this
setting in order to obtain meaningful results. To explain our approach in more
detail, we first give an overview of the classical model of inductive inference
which is the underlying model of learning in our paper.
Consider a class L = {Li }i∈I , where each language Li is a subset of Σ ∗ , the
set of finite strings in an alphabet Σ. In a classical model of learning, which was
introduced and studied by Gold [9], a learner M receives a sequence of all the
strings from a given language L ∈ L, possibly with repetitions. Such a sequence is
called a text for the language. After reading the first n strings from the texts, the
learner outputs a hypothesis in about what the target language might be. The
learner succeeds if it eventually converges to an index that correctly describes
the language to be learned, that is, if limn in = i and L = Li . If the learner
succeeds on all texts for all languages from a class, then we say that it learns this
class. This is the notion of explanatory learning (Ex). Such a model became the
standard one for the learnability of countable classes. Besides Ex, several other
paradigms for learning have been considered like, e.g., behaviourally correct
(BC) learning [3], vacillatory or finite explanatory (FEx) learning [8], partial
identification (Part) [13] and so on.
The indices that the learner outputs are usually finite objects like natural
numbers or finite strings. For example, Angluin [1] initiated the research on
learnability of uniformly recursive families indexed by natural numbers, and in
their recent work Jain, Luo and Stephan [10] considered automatic indexings by
finite strings in place of uniformly recursive ones. The collection of such finite
indices is countable, and hence we can talk only about countable classes of lan-
guages. On the other hand, the collection of all the subsets of Σ ∗ is uncountable,
and it looks too restrictive to consider only countable classes. Because of this, it
is interesting to find a generalization of the classical model which will allow us
to study the learnability of uncountable classes.
Below is the informal description of the learning model that we investigate in
this paper. First, since we are going to work with uncountable classes, we need
uncountably many indices to index a class to be learned. For this purpose we
will use infinite strings (or ω-strings) in a finite alphabet. There are computing
machines, called Büchi automata or ω-automata, which can be used naturally
for processing ω-strings. They were first introduced by Büchi [6,7] to prove the
decidability of S1S, the monadic second-order theory of the natural numbers with
successor function S(x) = x + 1. Because of this and other decidability results
the theory of ω-automata has become a popular area of research in theoretical
computer science, see, e.g., [14]. So, we will assume that a class to be learned
has an indexing by ω-strings which is Büchi recognizable.
The main difference between our model and the classical one is that the learner
does not output hypotheses as it processes a text. The reason for this is that it
is not possible to output an arbitrary infinite string in a finite amount of time.
Instead, in our model the learner is presented with an index α and a text T ,
and it must decide whether T is a text for the set with the index α. During its
Uncountable Automatic Classes and Learning 295

work, the learner outputs an infinite sequence of Büchi automata {An }n∈ω such
that An accepts the index α if and only if the learner at stage n thinks that T is
indeed a text for the set with the index α. The goal of the learner is to converge
in the limit to the right answer.
As one can see from the description above, the outputs of a learner take
form of ω-automata instead of just binary answers ‘yes’ or ‘no’. We chose such
definition due to the fact that a learner can read only a finite part of an infinite
index in a finite amount of time. If we required that a learner outputs its ‘yes’ or
‘no’ answer based on such finite information, then our model would become too
restrictive. On the other hand, a Büchi automaton allows a learner to encode
additional infinitary conditions that have to be verified before the index will be
accepted or rejected, for example, if the index contains infinitely many 1’s or
not. This approach makes a learner more powerful, and more nontrivial classes
become learnable.
Probably the most interesting property of our model is that for many learning
criteria, the learnability coincides with Angluin’s classical tell-tale condition for
the countable case (see the table at the end of this section). Angluin’s condition
states that for every set L from a class L, there is a finite subset DL ⊆ L
such that for any other L ∈ L with DL ⊆ L ⊆ L we have that L = L. It is
also well-known that in the classical case, every r.e. class is learnable according
to the criterion of partial identification. We will show that in our model every
ω-automatic class can be learned according to this criterion.
The results above show that the notions defined in this paper match the
intuition of learnability, and that our model is a natural one and is suitable for
investigating the learnability of uncountable classes of languages.
We also consider a notion of a blind learning. A learner is called blind if it does
not see an index presented to it. Such a learner can see only an input text, but
nevertheless it must decide whether the index and the text match each other.
It turns out that for the criterion of behaviourally correct learning, the blind
learners are as powerful as the non-blind ones without even the need to change
the indexing of a class, but for the other learning criteria this notion becomes
more restrictive.
The reader can find all formal definitions of the notions discussed here and
some necessary preliminaries in the next section. We summarize our results:
Criterion Condition Indexing Theorem
Ex ATTC New 17, 20
FEx ATTC Original 13, 20
BC ATTC Original 20
Part Any class Original 21
BlindBC ATTC Original 18, 20
BlindEx ATTC & Countable Original 19
BlindFEx ATTC & Countable Original 19
BlindPart Countable Original 22
In this table, the first column lists the learning criteria that we studied. Here,
Ex stands for explanatory learning, BC for behaviourally correct learning, FEx
296 S. Jain et al.

for finite explanatory or vacillatory learning, and Part for partial identification.
A prefix Blind denotes the blind version of the corresponding criterion. The
second column describes equivalent conditions for a given learning criterion.
Here, ATTC means that the class must satisfy Angluin’s tell-tale condition, and
Countable means that the class must be countable. The next column indicates
whether the learner uses the original indexing of the class or a new one. The last
column gives a reference to a theorem/corollary where the result is proved.

2 Preliminaries
An ω-automaton is mainly a finite automaton operating on ω-strings with an
infinitary acceptance condition which decides — depending upon the infinitely
often visited nodes — which ω-strings are accepted and which are rejected. For
a general background on the theory of finite automata the reader is referred
to [11].
Definition 1 ([6,7]). A nondeterministic ω-automaton is a tuple A = (S, Σ,
I, T ), where

(a) S is a finite set of states,


(b) Σ is a finite alphabet,
(c) I ⊆ S is the set of initial states, and
(d) T is the transition function T : S × Σ → P(S), where P(S) is the
power set of S.
An automaton A is deterministic iff |I| = 1, and for all s ∈ S and a ∈ Σ,
|T (s, a)| = 1.
An ω-string in an alphabet Σ is a function α : ω → Σ, where ω is the set
of natural numbers. We often identify an ω-string with the infinite sequence
α = α0 α1 α2 . . . , where αi = α(i). Let Σ ∗ and Σ ω denote the set of all finite
strings and the set of all ω-strings over the alphabet Σ, respectively.
We always assume that the elements of an alphabet Σ are linearly ordered.
This order can be extended to the length-lexicographical order ≤llex on Σ ∗ ;
here x ≤llex y iff |x| < |y| or |x| = |y| ∧ x ≤lex y, where ≤lex is the standard
lexicographical order.
Given an ω-automaton A = (S, Σ, I, T ) and an ω-string α, a run of A on α
is an ω-string
r = s0 . . . sn sn+1 . . . ∈ S ω
such that s0 ∈ I and for all n, sn+1 ∈ T (sn , αn ). Note that if an ω-automaton A
is deterministic, then for every α, there is a unique run of A on α. In this case
we will use the notation St A (α, k) to denote the state of A after it has read the
first k symbols of α.
Definition 2. Let Inf (r) denote the infinity set of a run r, that is,
Inf (r) = {s ∈ S : s appears infinitely often in r}.
We define the following accepting conditions for the run r:
Uncountable Automatic Classes and Learning 297

1) Büchi condition is determined by a subset F ⊆ S. The run r is accepting iff


Inf (r) ∩ F 
= ∅.
2) Muller condition is determined by a subset F ⊆ P(S). The run r is accepting
iff Inf (r) ∈ F.
3) Rabin condition is determined by Ω = {(L1 , R1 ), . . . , (Lh , Rh )}, where all Li
and Ri are subsets of S. The run r is accepting iff there is an i such that
1 ≤ i ≤ h, Inf (r) ∩ Li 
= ∅ and Inf (r) ∩ Ri = ∅.
It can be shown that all these acceptance conditions are equivalent (see [11]).
Therefore, we will say that an ω-automaton A accepts a string α iff there is a
run of A on α that satisfies the chosen accepting condition defined above. Let
L(A) denote the set of strings accepted by an automaton A.
Furthermore, every ω-automaton is equivalent to a deterministic one with
Muller acceptance condition (again see [11]). Thus, if not explicitly stated oth-
erwise, by an automaton we will always mean a deterministic ω-automaton with
Muller acceptance condition.
Definition 3 ([12])
1) A finite automaton is a tuple A = (S, Σ, I, T, F ), where S, Σ, I and T are
the same as in the definition of an ω-automaton, and F ⊆ S is the set of final
states.
2) For a finite string w = a0 . . . an−1 ∈ Σ ∗ , a run of A on w is a sequence
s0 . . . sn ∈ S ∗ such that s0 ∈ I and si+1 ∈ T (si , ai ) for all i ≤ n − 1. The run
is accepting iff sn ∈ F . The string w = a0 . . . an−1 is accepted by A iff there
is an accepting run of A on w.
Definition 4. 1) A convolution of k ω-strings α1 , . . . , αk ∈ Σ ω is an ω-string
⊗(α1 , . . . , αk ) in the alphabet Σ k defined as

⊗(α1 , . . . , αk )(n) = (α1 (n), . . . , αk (n)) for every n ∈ ω.

2) A convolution of k finite strings w1 , . . . , wk ∈ Σ ∗ is a string ⊗(w1 , . . . , wk )


k
of length l = max{|w1 |, . . . , |wk |} in the alphabet (Σ ∪ {#}) , where # is a
new padding symbol, defined as

⊗(w1 , . . . , wk )(n) = (v1 (n), . . . , vk (n)) for every n < l,

where for each i = 1, . . . , k and n < l,



wi (n) if n < |wi |
vi (n) =
# otherwise.

3) Correspondingly one defines the convolution of finite strings and ω-strings:


one identifies each finite string σ with the ω-string σ#ω and forms then the
corresponding convolution of ω-strings.
4) A convolution of k-ary relation R on finite or ω-strings is defined as

⊗R = {⊗(x1 , . . . , xk ) : (x1 , . . . , xk ) ∈ R}.


298 S. Jain et al.

5) A relation R on finite or ω-strings is automatic iff its convolution ⊗R is


recognizable by a finite or an ω-automaton, respectively.

For the ease of notation, we often just write (x, y) instead of ⊗(x, y) and so on. It
is well-known that the automatic relations are closed under union, intersection,
projection and complementation. In general, the following theorem holds, which
we will often use in this paper.

Theorem 5 ([4,5]). If a relation R on ω-strings is definable from other auto-


matic relations R1 , . . . , Rk by a first-order formula, then R itself is automatic.

Remark 6. 1) If we use additional parameters in a first-order definition of R,


then the parameters must be ultimately periodic strings.
2) Furthermore, in a definition of R we can use first-order variables of two sorts,
namely, one ranging over ω-strings and one ranging over finite strings. We can
do this because every finite string v can be identified with its ω-expansion
v#ω , and the set of all ω-expansions of the finite strings in alphabet Σ is
automatic.

A class L is a collection of sets of finite strings over some alphabet Γ , i.e.,


L ⊆ P(Γ ∗ ). An indexing for a class L is an onto mapping f : I → L, where
I is the set of indices. We will often denote the indexing as {Lα }α∈I , where
Lα = f (α).
An indexing {Lα }α∈I is automatic iff I is an automatic subset of Σ ω for some
alphabet Σ and the relation {(x, α) : x ∈ Lα } is automatic. A class is automatic
iff it has an automatic indexing. If it is not stated otherwise, all indexings and
all classes considered herein are assumed to be automatic.

Example 7. Here are some examples of automatic classes:


1) the class of all open intervals I = {q ∈ D : p < q < r} of dyadic rationals
where the border points p and r can be any real numbers;
2) the class of such intervals where r − p is either 1 or 2 or 3;
3) the class of all sets of finite strings which are given as the prefixes of an
infinite sequence;
4) the class of all sets of natural numbers in unary coding.

A text is an ω-string T of the form

T = u0 , u 1 , u2 , . . . ,

such that each ui is either equal to the pause symbol # or belongs to Γ ∗ , where
Γ is some alphabet. We call ui the ith input of the text. The content of a text
= #}. If content(T ) is equal to a set L ⊆ Γ ∗ ,
T is the set content(T ) = {ui : ui 
then we say that T is a text for L.

Definition 8. Let Γ and Σ be alphabets for sets and indices, respectively. A


learner is a Turing machine M that has the following:
Uncountable Automatic Classes and Learning 299

1) two read-only tapes: one for an ω-string from Σ ω representing an index


and one for a text for a set L ⊆ Γ ∗ ;
2) one write-only output tape on which M writes a sequence of automata
(in a suitable coding);
3) one read-write working tape.

Let Ind(M, α, T, s) and Txt(M, α, T, s) denote the number of symbols read in


the index and text tapes by learner M up to step s when it processes an index
α and a text T . Without loss of generality, we will assume that

lim Ind(M, α, T, s) = lim Txt(M, α, T, s) = ∞


s→∞ s→∞

for any α and T . By M(α, T, k) we denote the kth automaton output by learner
M when processing an index α and a text T . Without loss of generality, for the
learning criteria considered in this paper, we assume that M(α, T, k) is defined
for all k.

Definition 9 (see [3,8,9,13]). Let a class L = {Lα }α∈I (together with its
indexing) and a learner M be given. We say that
1) M BC-learns L iff for any index α ∈ I and any text T with content(T ) ∈ L,
there exists n such that for every m ≥ n,

M(α, T, m) accepts α iff Lα = content(T ).

2) M Ex-learns L iff for any index α ∈ I and any text T with content(T ) ∈ L,


there exists n such that for every m ≥ n, M(α, T, m) = M(α, T, n) and

M(α, T, m) accepts α iff Lα = content(T ).

3) M FEx-learns L iff M BC-learns L and for any α ∈ I and any text T with


content(T ) ∈ L, the set {M(α, T, n) : n ∈ ω} is finite.
4) M FExk -learns L iff M BC-learns L and for any α ∈ I and any text T with
content(T ) ∈ L, there exists n such that

|{M(α, T, m) : m ≥ n}| ≤ k.

5) M Part-learns L iff for any α ∈ I and any T with content(T ) ∈ L, there


exists a unique automaton A such that M outputs A infinitely often, and

A accepts α iff Lα = content(T ).

Here the abbreviations BC, Ex, FEx and Part stand for ‘behaviourally correct’,
‘explanatory’, ‘finite explanatory’ and ‘partial identification’, respectively; ‘finite
explanatory learning’ is also called ‘vacillatory learning’. We will also use the
notations BC, Ex, FEx, FExk and Part to denote the collection of classes
(with corresponding indexings) that are BC-, Ex-, FEx-, FExk - and Part-
learnable, respectively.
300 S. Jain et al.

Definition 10. A learner is called blind if it does not see the tape which contains
an index. The classes that are blind BC-, Ex-, etc. learnable are denoted as
BlindBC, BlindEx, etc., respectively.

Definition 11 ([1]). We say that a class L satisfies Angluin’s tell-tale condition


iff for every L ∈ L there is a finite DL ⊆ L such that for every L ∈ L, if
DL ⊆ L ⊆ L then L = L. Such DL is called a tell-tale set for L.

Fact 12 ([1]). If a class L is BC-learnable, then L satisfies Angluin’s tell-tale


condition.

The converse will also be shown to be true, hence for automatic classes one
can equate “L is learnable” with “L satisfies Angluin’s tell-tale condition”. Note
that the second and the third class given in Example 7 satisfy Angluin’s tell-tale
condition.

3 Vacillatory Learning

In the following it is shown that every learnable class can even be vacillatorily
learned and that the corresponding FEx-learner uses overall on all possible
inputs only a fixed number of automata.

Theorem 13. Let {Lα }α∈I be a class that satisfies Angluin’s tell-tale condition.
Then there are finitely many automata A1 , . . . , Ac and an FEx-learner M for
the class {Lα }α∈I with the property that for any α ∈ I and any text T for a
set from {Lα }α∈I , the learner M oscillates only between some of the automata
A1 , . . . , Ac on α and T .

Proof. Let M be a deterministic automaton recognizing the relation {(x, α) :


x ∈ Lα }, and let N be a deterministic automaton recognizing

{ (x, α) : {y ∈ Lα : y ≤llex x} is a tell-tale for Lα }.

Such N exists since the relation is first-order definable from ‘x ∈ Lα ’ and ≤llex
by the formula:

N accepts (x, α) ⇐⇒ ∀α ∈ I if ∀y ((y ∈ Lα & y ≤llex x) → y ∈ Lα ) &

∀y (y ∈ Lα → y ∈ Lα ), then ∀y(y ∈ Lα ↔ y ∈ Lα ) .

For each α ∈ I, consider an equivalence relation ≡M,α defined as

x ≡M,α y ⇐⇒ there is a t > max{|x|, |y|} such that


St M (⊗(x, α), t) = St M (⊗(y, α), t).

An equivalence relation ≡N,α is defined in a similar way.


Uncountable Automatic Classes and Learning 301

Note that the number of equivalence classes of ≡M,α is bounded by the number
of states of M , and for every x, y, if x ≡M,α y then x ∈ Lα ↔ y ∈ Lα . Therefore,
Lα is the union of finitely many equivalence classes of ≡M,α .
Let m and n be the number of states of M and N , respectively. Consider the
set of all finite tables U = {Ui,j : 1 ≤ i ≤ m, 1 ≤ j ≤ n} of size m × n such that
each Ui,j is either equal to a subset of {1, . . . , i} or to a special symbol Reject.
With each such table U we will associate an automaton A as described below.
The algorithm for learning {Lα }α∈I is now roughly as follows. On every step,
the learner M reads a finite part of the input text and based on this information
constructs a table U . After that M outputs the automaton associated with U .
First, we describe the construction of an automaton A for each table U . For
every α ∈ I, let m(α) and n(α) be the numbers of equivalence classes of ≡M,α
and ≡N,α , respectively. Also, let x1 , . . . , xm(α) be the length-lexicographically
least representatives of equivalence classes of ≡M,α such that
x1 <llex · · · <llex xm(α) .
Our goal is to construct A such that
A accepts α ⇐⇒ Um(α),n(α) is a subset of {1, . . . , m(α)}
such that Lα = {y : y ≡M,α xk for some k ∈ Um(α),n(α) }.
Let EqSt M (α, x, y, z) be the relation defined as
EqSt M (α, x, y, z) ⇐⇒ St M (⊗(x, α), |z|) = St M (⊗(y, α), |z|).
The relation EqSt N (α, x, y, z) is defined similarly. Note that these relations are
automatic.
Instead of constructing A explicitly, we will show that the language which
A needs to recognize is first-order definable from EqSt M (α, x, y, z), EqSt N (α, x,
y, z) and the relations recognized by M and N .
First, note that the equivalence relation x ≡M,α y can be defined by a formula:
∃z (|z| > max{|x|, |y|} and EqSt M (α, x, y, z)).
Similarly one can define x ≡N,α y. The fact that ≡M,α has exactly k many
equivalence classes can be expressed by a formula:
   
ClNum M,k (α) = ∃x1 . . . ∃xk xi 
≡M,α xj & ∀y y ≡M,α xi .
1≤i<j≤k 1≤i≤k

Again, ClNum N,k (α) expresses the same fact for ≡N,α . Finally, the fact that A
accepts α can be expressed by the following first-order formula:
 
ClNum M,i (α) & ClNum N,j (α) & ∃x1 . . . ∃xi
(i,j) : Ui,j =Reject
 
x1 <llex · · · <llex xi & ∀z (z ∈ Lα ↔ z ≡M,α xk )
k∈Ui,j
 
& ∀y (y <llex xk → y 
≡M,α xk ) .
1≤k≤i
302 S. Jain et al.

We now describe the algorithm for learning the class {Lα }α∈I . We will use the
notation x ≡M,α,s y as an abbreviation of

“there is t such that s ≥ t > max{|x|, |y|}


and St M (⊗(x, α), t) = St M (⊗(y, α), t).”
As before, let m and n be the numbers of states of automata M and N , respec-
tively. At step s, M computes ≤llex least representatives of equivalence classes
of ≡M,α,s and ≡N,α,s on the strings with length shorter than s. In other words,
it computes x1 , . . . , xp and y1 , . . . , yq such that

a) x1 is the empty string,


b) xk+1 is the ≤llex least x >llex xk such that |x| ≤ s and x ≡M,α,s xi for
all i ≤ k. If such x does not exists then the process stops.
The sequence y1 , . . . , yq is computed in a similar way.
Next, M constructs a table U of size m × n. For every i and j, the value of
Ui,j is defined as follows. If i > p or j > q, then let Ui,j = Reject. Otherwise, let
τs be the initial segment of the input text T consisting of the first s strings in
the text T . Check if the following two conditions are satisfied:

1) for every x, x ≤llex yj , if x ≡M,α,s x , then x ∈ content(τs ) iff x ∈


content(τs ),
2) for every k ≤ i and every y, if y ∈ content(τs ) and y ≡M,α,s xk , then
xk ∈ content(τs ).
If yes, then let Ui,j = {k : k ≤ i and xk ∈ content(τs )}. Otherwise, let Ui,j =
Reject. After U is constructed, M outputs an automaton A associated with U
as described above. As the number of different possible U is finite, the number
of distinct corresponding automata output by M is finite.
Let M(α, T, s) be an automaton output by learner M at step s when process-
ing the index α and the text T . To prove that the algorithm is correct we need
to show that for every α ∈ I and every text T such that content(T ) ∈ {Lα }α∈I ,

a) if content(T ) = Lα then for almost all s, M(α, T, s) accepts α,


b) if content(T ) 
= Lα then for almost all s, M(α, T, s) rejects α.
Recall that m(α) and n(α) are the numbers of equivalence classes of ≡M,α and
≡N,α , respectively. Note that there is a step s0 after which the values x1 <llex
· · · <llex xm(α) and y1 <llex · · · <llex yn(α) computed by M will always by
equal to ≤llex least representatives of equivalence classes of ≡M,α and ≡N,α ,
respectively.
Suppose that content(T ) = Lα . Hence, there is s1 ≥ s0 such that for every
s ≥ s1 the following conditions are satisfied:

1) for every k ≤ m(α), xk ∈ content(τs ) iff xk ∈ content(T ),


2) for every x, x ≤llex yn(α) , if x ≡M,α,s x , then x ∈ content(τs ) iff x ∈
content(τs ),
Uncountable Automatic Classes and Learning 303

3) for every k ≤ m(α) and every y, if y ∈ content(τs ) and y ≡M,α,s xk , then


xk ∈ content(τs ).
The last two conditions are satisfied since content(T ) = Lα is the union of
finitely many ≡M,α equivalence classes. Therefore, on every step s ≥ s1 , the
learner M constructs a table U such that Um(α),n(α) = {k : k ≤ m(α) and xk ∈
content(T )}. By our construction of the automaton A associated with U , A
accepts α if Lα = {y : y ≡M,α xk for some xk ∈ content(T )}. But since
content(T ) = Lα , this condition is satisfied.
Now suppose that content(T )  = Lα . Note that for every s ≥ s0 , yn(α) com-
puted by M at step s has the property that Dα = {x ∈ Lα : x ≤llex yn(α) } is a
tell-tale set for Lα . This follows from the definition of the automaton N and the
fact that yn(α) is the ≤llex largest among the representatives of ≡N,α equivalence
classes.
First, consider the case when Dα  content(T ), that is, there is x ∈ Lα ,
x ≤llex yn(α) but x ∈ / content(T ). Let s1 ≥ s0 be such that x ≡M,α,s1 xk for
some k ≤ m(α). Note that xk ≤llex x since xk is the minimal representative in
its equivalence class. If for some s2 ≥ s1 , xk ∈ content(τs2 ), then from this step
on Um(α),n(α) will be equal to Reject, and M(α, T, s) will reject α for all s ≥ s2 .
If xk ∈/ content(T ), then for all s ≥ s1 , M(α, T, s) will reject α either due to
the fact that Um(α),n(α) = Reject at step s, or because k ∈ / Um(α),n(α) while it
should be in Um(α),n(α) since both x and xk are in Lα .
Now suppose that Dα ⊆ content(T ). Since Dα is a tell-tale set for Lα and
content(T )  = Lα , there is x ∈ content(T ) \ Lα . Let s1 ≥ s0 be such that
x ∈ content(τs1 ) and x ≡M,α,s1 xk for some k ≤ m(α). If xk ∈ / content(T )
then for every s ≥ s1 , Um(α),n(α) = Reject and M(α, T, s) will reject α. If
there is s2 ≥ s1 such that xk ∈ content(τs2 ), then for every s ≥ s2 either
Um(α),n(α) = Reject or k ∈ Um(α),n(α) . In both cases M(α, T, s) will reject α
since xk ∈ / Lα . 


Definition 14. 1) Let α ∈ {0, 1, . . . , k}ω and β ∈ {1, . . . , k}ω . The function
fα,β is defined as follows:

α(m) if m = min{x ≥ n : α(x)  = 0},
fα,β (n) =
lim supx→∞ β(x) if such m does not exist.

Let Lα,β be the set of all nonempty finite prefixes of fα,β , that is,

Lα,β = {fα,β (0) . . . fα,β (n) : n ∈ ω}.

2) Define the class Lk as follows

Lk = { Lα,β : α ∈ {0, 1, . . . , k}ω , β ∈ {1, . . . , k}ω }.

Note that the class Lk is uncountable and automatic.

Theorem 15. For every k ≥ 2, the class Lk is in FExk \ FExk−1 .


304 S. Jain et al.

Remark 16. The last result can be strengthened in the following sense: for
every k ≥ 1 there is an indexing {Lβ }β∈I of the class L = { {α0 α1 α2 . . . αn−1 :
ω
n ∈ ω} : α ∈ {1, 2} } such that {Lβ }β∈I is FExk+1 -learnable but not FExk -
learnable. That is, the class can be kept fixed and only the indexing has to be
adjusted.

4 Explanatory Learning
The main result of this section is that for every learnable class, there is an index-
ing such that the class with this indexing is explanatorily learnable. Furthermore,
one can observe that the learner, as above, on any text T for a language in the
class and an index α, first might output automata which reject α, then automata
which accept α and at the end again automata which reject α; so, in short, the
sequence is of the form “reject–accept–reject” (or a subsequence of this).
Theorem 17. If a class L = {Lα }α∈I satisfies Angluin’s tell-tale condition,
then there is an indexing for L such that L with this indexing is Ex-learnable.

Proof. Let M be a deterministic automaton recognizing {(x, α) : x ∈ Lα },


and QM be its set of states. The set J of new indices for L will consist of
convolutions ⊗(α, β, γ), where α ∈ I, β ∈ {0, 1}ω defines a tell-tale set for Lα ,
and γ ∈ {P(QM )}ω keeps track of states of M when it reads ⊗(x, α) for some
finite strings x ∈ Lα . To simplify the notations we will write (α, β, γ) instead of
⊗(α, β, γ). Formally, J is defined as follows:

(α, β, γ) ∈ J ⇐⇒ α ∈ I, β = 0n 1ω for the minimal n such that


{x ∈ Lα : |x| < n} is a tell-tale set for Lα , and
for every k, γ(k) = {q ∈ QM : ∃x ∈ Lα
(|x| ≤ k and St M (⊗(x, α), k) = q)}.

We want to show that J is automatic. Again, it is enough to show that it is first-


order definable from other automatic relations. We can rewrite the definition for
β as
β ∈ 0∗ 1ω and ∀σ ∈ 0∗ σ ⊆ β & σ0 
⊆β→
{x ∈ Lα : |x| < |σ|} is a tell-tale set for Lα .

The first-order definition for a tell-tale set is given in the beginning of the proof
of Theorem 13. All other relations in this definition are clearly automatic.
The definition for γ can be written as
  
∀σ ∈ 0∗ q ∈ γ(|σ|) ↔ ∃x ∈ Lα ( |x| ≤ |σ| & St M (⊗(x, α), |σ|) = q) .
q∈QM

For every q ∈ QM , there are automata Aq and Bq that recognize the relations

{(σ, γ) : σ ∈ 0∗ & q ∈ γ(|σ|) } and {(σ, x, α) : σ ∈ 0∗ & St M (⊗(x, α), |σ|) = q) }.


Uncountable Automatic Classes and Learning 305

Therefore, J is first-order definable from automatic relations, and hence itself is


automatic.
We define a new indexing {Hα,β,γ }(α,β,γ)∈J for the class L as follows
Hα,β,γ = Lα .
Clearly, this indexing is automatic since
x ∈ Hα,β,γ ⇐⇒ x ∈ Lα and (α, β, γ) ∈ J.
We now describe a learner M that can Ex-learn the class L in the new indexing.
Let A be an automaton that recognizes the set J, and let Z be an automaton
that rejects all ω-strings. The learner M will output only automata A and Z
in a sequence Z–A–Z (or a subsequence of this). In other words, M can start
outputting automaton Z, then change its mind to A and then again change its
mind to Z, after which it will be outputting Z forever.
When an index (α, β, γ) is given to the learner M, it always assumes that
β and γ are correctly defined from α. Otherwise, it does not matter which au-
tomaton M will output in the limit, since both A and Z will reject the index
(α, β, γ).
We now show that for every finite string x,
x ∈ Lα ⇐⇒ St M (⊗(x, α), |x|) ∈ γ(|x|),
provided that γ is correct. Indeed, if x ∈ Lα , then St M (⊗(x, α), |x|) ∈ γ(|x|) by
the definition of γ. On the other hand, if St M (⊗(x, α), |x|) ∈ γ(|x|), then, again
by the definition of γ, there is y ∈ Lα with |y| ≤ |x| such that
St M (⊗(y, α), |x|) = St M (⊗(x, α), |x|).
Therefore, after |x| many steps the run of M on ⊗(x, α) coincides with the run
on ⊗(y, α). Hence M accepts ⊗(x, α), and x is in Lα .
At every step s, M reads the first s inputs x1 , . . . , xs from the input text. Then
M outputs A if the following conditions hold:

– There exists n ≤ s such that 0n 1 ⊆ β.


– For every i with xi  = #, xi belongs to Lα according to γ, i.e.,
St M (⊗(xi , α), |xi |) ∈ γ(|xi |).
– For every x with |x| < n, if x belongs to Lα according to γ, then x ∈
{x1 , . . . , xs }.
Otherwise, M outputs Z. This concludes the step s.
Note that M makes a change from Z to A or from A to Z at most once. Thus
it always converges to one of these automata. If the index (α, β, γ) is not in J,
then M always rejects it. If (α, β, γ) ∈ J, then for every x, we have that x ∈ Lα
according to γ iff x is indeed in Lα . Moreover, the set
Dn = {x : |x| < n and x ∈ Lα according to γ}
is a tell-tale set for Lα , where n is such that β = 0n 1ω .
306 S. Jain et al.

Let T be the input text. If content(T ) = Hα,β,γ , then there is a step s ≥ n such
that Dn is contained in {x1 , . . . , xs }. Therefore, M will output only A from step s
onward. If content(T ) = Hα,β,γ , then Dn  content(T ) or content(T )  Hα,β,γ .
In the first case, M will output Z on every step. In the second case, there is a
step s and an xi ∈ {x1 , . . . , xs } such that xi 
= # and xi is not in Lα according to
γ. Therefore, M will output Z from step s onward. This proves the correctness
of the algorithm. 


5 Blind Learning
Blind learning is distinguished from learning in that the learner itself does not
see the index; so the learner has to code up all the necessary information into
the automata which permit to decide whether the index is correct or incorrect.
In the case of behaviourally correct learning, this is done by coding more and
more finite information in a way that almost all automata recognize an incorrect
index and reject it (where the point from which on this is recognized depends
on the index). In the case of explanatory learning, this is impossible and hence
one has to simulate a traditional learner (for countable classes) and to code up
its conjecture into the automaton which then checks whether the index provided
is equivalent to the one to which the traditional learner has converged; hence
explanatorily learnable classes have to be countable.

Theorem 18. If a class L = {Lα }α∈I satisfies Angluin’s tell-tale condition,


then L is BlindBC-learnable.

Theorem 19. For every class L = {Lα }α∈I , the following are equivalent

1) L is BlindEx-learnable.
2) L is BlindFEx-learnable.
3) L is at most countable and satisfies Angluin’s tell-tale condition.

The following corollary summarizes the main results from the previous sections.

Corollary 20. For every automatic class L, the following are equivalent:

1) L satisfies Angluin’s tell-tale condition.


2) L is BC-learnable.
3) L is BlindBC-learnable.
4) L is FEx-learnable.
5) L is Ex-learnable in a suitable indexing.

Proof. The implications 3) ⇒ 2) and 4) ⇒ 2) are trivial; 2) ⇒ 1) and 5) ⇒ 1)


follow from Fact 12; 1) ⇒ 3) follows from Theorem 18; 1) ⇒ 4) follows from
Theorem 13; and 1) ⇒ 5) follows from Theorem 17. 

Uncountable Automatic Classes and Learning 307

6 Partial Identification
Partial identification is, in the traditional setting of inductive inference, a learn-
ing criterion where the learner outputs on every text of an r.e. language infinitely
many (not necessarily distinct) hypotheses such that exactly one hypothesis oc-
curs infinitely often and that hypothesis is correct. There is a recursive learner
succeeding on all r.e. sets, hence this concept is omniscient in the traditional
setting. Also in our model, every automatic class is partially identifiable.

Theorem 21. Every class with every given automatic indexing is Part-learnable.

Theorem 22. A class L = {Lα }α∈I is in BlindPart if and only if it is at


most countable.

References
1. Angluin, D.: Inductive inference of formal languages from positive data. Informa-
tion and Control 45(2), 117–135 (1980)
2. Bárány, V., Kaiser, L ., Rubin, S.: Cardinality and counting quantifiers on omega-
automatic structures. In: Proceedings of the 25th International Symposium on
Theoretical Aspects of Computer Science, STACS 2008, pp. 385–396 (2008)
3. Bārzdiņš, J.: Two theorems on the limiting synthesis of functions. Theory of Algo-
rithms and Programs 1, 82–88 (1974)
4. Blumensath, A., Grädel, E.: Automatic structures. In: 15th Annual IEEE Sympo-
sium on Logic in Computer Science, Santa Barbara, CA, pp. 51–62. IEEE Com-
puter Society Press, Los Alamitos (2000)
5. Blumensath, A., Grädel, E.: Finite presentations of infinite structures: automata
and interpretations. Theory of Computing Systems 37(6), 641–674 (2004)
6. Richard Büchi, J.: Weak second-order arithmetic and finite automata. Zeitschrift
für Mathematische Logik und Grundlagen der Mathematik 6, 66–92 (1960)
7. Richard Büchi, J.: On a decision method in restricted second order arithmetic.
In: Logic, Methodology and Philosophy of Science (Proceedings 1960 International
Congress), pp. 1–11. Stanford University Press, Stanford (1962)
8. Case, J.: The power of vacillation in language learning. SIAM Journal on Comput-
ing 28(6), 1941–1969 (1999) (electronic)
9. Mark Gold, E.: Language identification in the limit. Information and Control 10,
447–474 (1967)
10. Jain, S., Luo, Q., Stephan, F.: Learnability of automatic classes. Technical Report
TRA1/09, School of Computing, National University of Singapore (2009)
11. Khoussainov, B., Nerode, A.: Automata theory and its applications. Birkhäuser
Boston, Inc., Boston (2001)
12. Khoussainov, B., Nerode, A.: Automatic presentations of structures. In: Leivant,
D. (ed.) LCC 1994. LNCS, vol. 960, pp. 367–392. Springer, Heidelberg (1995)
13. Osherson, D.N., Stob, M., Weinstein, S.: Systems that learn. An introduction to
learning theory for cognitive and computer scientists. Bradford Book—MIT Press,
Cambridge (1986)
14. Vardi, M.Y.: The Büchi complementation saga. In: Thomas, W., Weil, P. (eds.)
STACS 2007. LNCS, vol. 4393, pp. 12–22. Springer, Heidelberg (2007)
Iterative Learning from Texts and
Counterexamples Using Additional Information

Sanjay Jain1, and Efim Kinber2


1
School of Computing, National University of Singapore, Singapore 117417,
Republic of Singapore
sanjay@comp.nus.edu.sg
2
Department of Computer Science, Sacred Heart University, Fairfield, CT
06825-1000, U.S.A.
kinbere@sacredheart.edu

Abstract. A variant of iterative learning in the limit (cf. [LZ96]) is stud-


ied when a learner gets negative examples refuting conjectures containing
data in excess of the target language and uses additional information of
the following four types: a) memorizing up to n input elements seen so
far; b) up to n feedback memberships queries (testing if an item is a
member of the input seen so far); c) the number of input elements seen
so far; d) the maximal element of the input seen so far. We explore how
additional information available to such learners (defined and studied
in [JK07]) may help. In particular, we show that adding the maximal
element or the number of elements seen so far helps such learners to in-
fer any indexed class of languages class-preservingly (using a descriptive
numbering defining the class) — as it is proved in [JK07], this is not
possible without using additional information. We also study how, in the
given context, different types of additional information fare against each
other, and establish hierarchies of learners memorizing n + 1 versus n
input elements seen and n + 1 versus n feedback membership queries.

1 Introduction
In this paper, we study some variants of learning in the limit from positive data
and negative counterexamples to conjectures, with restricted access to input
data. The general framework for study of learning in the limit was introduced in
[Gol67]. In Gold’s original model, TxtEx, a learner is able to hold full input data
seen so far in its long-term memory. However, this assumption is apparently too
strong for modeling many learning and cognitive processes. Wiehagen in [Wie76]
(see also [LZ96]) suggested a model for learning in the limit where the long-term
memory of the learners is limited to what they can store in their conjectures.
These learners are called iterative learners. This learning model, while strongly
limiting long-term memory, still makes salient an important aspect of learnability
in the limit: its incremental character. Some variants of iterative learning proved
to be quite useful in the context of applied machine learning (for example, [LZ06]

Supported in part by NUS grant number R252-000-308-112.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 308–322, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Iterative Learning from Texts and Counterexamples 309

applies the idea of iterative learning in the context of training Support Vector
Machines).
The iterative learning model has been used for study of learnability from all
positive examples (the corresponding formal model being denoted as TxtIt) as
well as all positive and negative examples (denoted as InfIt, see [LZ92]). One
can argue that TxtIt may be too weak (a learner gets only positive data and can
memorize only very limited amount of input), whereas InfIt may be too strong: it
is hard to conceive a realistic learning process, where the learner would be able to
get access to full negative data. For example, children learning languages, while
getting some negative data (in the form of corrections by parents or teachers),
never get the full set of negative data.
In [JK08], the model TxtEx was extended to allow negative counterexamples
to conjectures by a learner. This model is an example of active learning, where
a learner communicates with a teacher (formally, an oracle) making queries and
getting responses from the teacher. Active learning as a general framework for
study of learning processes was introduced by D. Angluin in [Ang88] and has
been widely utilized in various studies of theoretical and applied models of learn-
ability from examples since then. The model of iterative learning from full pos-
itive data and negative counterexamples, NCIt (NC here stands for “negative
counterexample”), defined in [JK07] actually combines two approaches: Gold’s
framework (as the learner incrementally gets access to full positive data) and
active learning (the learner, using subset queries, checks with the teacher if each
conjecture does not contain data in excess of the target languages and if the an-
swer is negative, the learner gets a negative counterexample showing an error).
In linguistic terms, non-grammatical sentences in conjectures are, thus, being
corrected. It should be noted that K. Popper [Pop68] regarded refutation of
overgeneralizing conjectures as a vital part of learning and discovery processes.
In this paper, we extend the NCIt model to incorporate some additional
features. Specifically, we consider the following two extensions of this model: in
addition to subset queries (for conjectures), the learner
a) can ask up to n feedback queries: whether the queried element belongs to
the input seen so far;
b) can store up to n input elements seen so far in its long-term memory (note
that when the long-term memory used by a learner is n-bounded, if the memory
is full then, in order to save a new input datum, the learner must sacrifice at
least one element currently stored in the memory);
In the context of iterative learning of languages from positive data, these two
types of “looking back” (in the context of feedback — using just one query per
conjecture) were defined in [LZ96] (an earlier variant of memory-bounded learn-
ing can be found in [OSW86], and the idea of feedback learning goes back to
[Wie76], where it was applied in the context of learning recursive functions in
the limit). Both these concepts were reformalized (the former named n-feedback
learning, and the latter named n-bounded memory learning) and thoroughly stud-
ied and discussed in [CJLZ99]. Motivation for these sorts of learnability models,
310 S. Jain and E. Kinber

as discussed in [CJLZ99], comes from the rapidly developing field of knowledge


discovery in databases, which includes, in particular, data mining, knowledge
extraction, information discovery, data pattern processing, information harvest-
ing, etc. Many of these tasks represent interactive incremental iterative processes
(cf., e.g., [BA96] and [FPSS96]), working on huge data sets, finding regularities,
and verifying them on small samples of the overall data. While the authors in
[CJLZ99] explore the aforementioned formalizations of “looking back” at small
(uniformly limited by some upper bound n) portions of input data in the con-
text of regular iterative learning, we, in this research, allow the learner to test
with the teacher if conjectures do not contain data in excess of the target lan-
guage. Our learners may also be allowed to memorize some “bounds” derived
from the input data seen so far — in the form of the maximal element or the
length of input seen so far (the latter type of additional information for itera-
tive learners was first considered in [CM08]). In this research, we study how the
aforementioned types of additional information can enhance capabilities of the
NCIt-learnability model in general, and how they, while helping a learner, fare
against each other.
Specifically, in Section 3, we discover some general effects of additional in-
formation on NCIt-learners. In particular, it was established in [JK07] that
iterative learners getting access to full positive and full negative data are, sur-
prisingly, weaker than NCIt-learners (note that the latter ones get negative
data just in the form of a finite set of negative counterexamples — however,
only when these negative data is really necessary). We now show that when the
learners getting full positive and full negative data are allowed to memorize just
one input datum or ask just one feedback membership query, they can sometimes
learn more than any learner that gets access to full positive data, that can use
negative counterexamples (to conjectures), and that can store all data seen so far
in its long-term memory (see Theorem 7). A known capability of NCIt-learners
(established in [JK07]) is of special importance for many practical classes of
languages: they can learn every indexed class of languages (that is, any class of
recursive languages, where it is decidable, for any index k and any element m,
if m is a member of the language with index k; examples of such classes are the
classes of all regular languages and all pattern languages ([Ang80]). However, as
it was established in [JK07], NCIt-learners sometimes cannot learn an indexed
class class-preservingly (cf. [LZZ08])— that is, they cannot learn by using any
descriptive numbering defining just the target class as hypothesis space. It turns
out that this feature of NCIt-learners remains even if they can make n-feedback
membership queries (see Theorem 10). However, class-preserving learning be-
comes possible if an NCIt-learner gets access to either the maximal element or
to the number of elements seen so far (see Theorem 9).
In Section 4, we strengthen some results in [CJLZ99], establishing non-trivial
hierarchies of NCIt-learners using n-feedback queries or n-bounded memory
based on the number n (see Theorems 11 and 12). Our examples of classes
witnessing the hierarchies in question also show that additional information in
the form of the maximal element seen so far and the number of elements seen
Iterative Learning from Texts and Counterexamples 311

so far might not match the help that an NCIt-learner gets in the form of one
extra feedback membership query, or one extra long-term memory cell.
In Section 5, we study tradeoffs between different types of additional informa-
tion used by NCIt-learners (the main purpose of this study is to make salient
advantages of each type of additional information for the learners in question).
In particular, similarly to corresponding results in [CJLZ99], we show that one
memory cell used by an NCIt-learner can give more help than any n feedback
membership queries (even in presence of the maximal element and the number
of elements seen so far), see Theorem 13, and, conversely, one feedback mem-
bership query can give more help than n-bounded memory (plus the maximal
element and the number of elements seen so far), see Theorem 14. Interestingly,
the maximal element seen so far alone can give more help than any number of
feedback membership queries, see Theorem 17. Also, the number of elements
and the maximal element seen so far combined together can provide more help
than any bounded number of memory cells or feedback membership queries, see
Theorem 19. We also show how an extra memory cell can simulate maximal
element for NCIt-learners using n memory cells, see Proposition 15. We also
obtain some partial results for other possible tradeoffs.

2 Preliminaries
2.1 Notation
For any unexplained recursion theoretic notation we refer the reader to [Rog67].
The symbol N denotes the set of natural numbers, {0, 1, 2, 3, . . .}. Languages are
subsets of N. Symbols ∅, ⊆, ⊂, ⊇, and ⊃ respectively denote the empty set,
subset, proper subset, superset, and proper superset. The cardinality of a set
S is denoted by card(S). The maximum and minimum of a set are denoted by

max(·), min(·), respectively, where max(∅) = 0 and min(∅) = ∞. ∀ denotes ‘for
all but finitely many’.
We let Dx denote the finite set with canonical index x [Rog67]. We let ·, ·
stand for an arbitrary, computable, 1–1 mapping from N × N onto N, which is
increasing in both its arguments [Rog67]. The pairing function can be extended
to n-tuples in a natural way (for example, by using x, y, z = x, y, z ).
By Wi we denote the i-th r.e. language in some fixed acceptable program-
ming system. We also say that i is a grammar for Wi . E denotes the set of all
r.e. languages. L, with or without decorations, ranges over E. L, with or without
decorations, ranges over subsets of E. χL denotes the characteristic function of
L, and L = N − L, that is the complement of L.
L is said to be an indexed family iff there exists an indexing L0 , L1 , . . . of all
and only the languages in L such that for some recursive function f , f (i, x) =
χLi (x).

2.2 Basic Definitions for Learning


A text T is a mapping from N into (N ∪ {#}). T (i) represents the (i + 1)-th
element in the text. We let T , with or without decorations, range over texts.
312 S. Jain and E. Kinber

content(T ) denotes the set of natural numbers in the range of T . A text T is for
a language L iff content(T ) = L. Intuitively, T (i) denotes the element presented
to the learner at time i, and #’s represent pauses in the presentation of data. T [n]
denotes the initial sequence of T of length n, that is T [n] = T (0)T (1) . . . T (n−1).
SEQ = {T [n] : n ∈ N, T is a text}. The empty sequence is denoted by λ. σ, τ, α
range over SEQ. σ τ denotes concatenation of σ and τ .
An informant [Gol67] I is a mapping from N to (N×{0, 1})∪{#} such that for
no x ∈ N, both (x, 0) and (x, 1) are in the range of I. content(I) = set of pairs in
the range of I. We say that I is an informant for L iff content(I) = {(x, χL (x)) :
x ∈ N}. Intuitively, informants give both all positive and all negative data for
the language being learned. I[n] denotes the first n elements of the informant I.
An inductive inference machine (IIM) [Gol67] learning from texts is an algo-
rithmic device which computes a (possibly partial) mapping from SEQ into N.
One can similarly define learners from informants and other modes of input as
considered below. We use the term learner or learning machine as synonyms for
inductive inference machines. We let M range over IIMs. M (T [n]) (or M (I[n]))
is interpreted as the grammar (index for an accepting program) conjectured by
the IIM M on the initial sequence T [n] (or I[n]). We say that M converges on

T to i, (written: M (T ) ↓ = i) iff ( ∀ n)[M (T [n]) = i]. Convergence on informants
is similarly defined.
There are several criteria for an IIM to be successful on a language. In this
paper we will be mainly concerned with explanatory (abbreviated Ex) criteria
of learning.

Definition 1. [Gol67, CL82]


(a) M TxtEx-identifies an r.e. language L (written: L ∈ TxtEx(M )) just

in case for all texts T for L, M (T [n]) is defined for all n and (∃i : Wi = L)( ∀
n)[M (T [n]) = i].
(b) M TxtEx-identifies a class L of r.e. languages (written: L ⊆ TxtEx(M ))
just in case M TxtEx-identifies each language from L.
(c) TxtEx = {L ⊆ E : (∃M )[L ⊆ TxtEx(M )]}.

One can similarly define learning criterion InfEx for learning from informants
instead of texts.
Next we consider iterative learning.

Definition 2. [Wie76, LZ96]


(a) M is iterative, iff there exists a partial recursive function F such that, for
all T and n, M (T [n + 1]) = F (M (T [n]), T (n)). Here M (λ) is viewed as some
predefined hypothesis.
(b) M TxtIt-identifies L, iff M is iterative, and M TxtEx-identifies L.
(c) TxtIt = {L : (∃M )[M TxtIt-identifies L]}.

InfIt can be defined similarly.


Intuitively, an iterative learner [Wie76, LZ96] is a learner whose hypothesis
depends only on its last conjecture and current input. That is, for some recursive
Iterative Learning from Texts and Counterexamples 313

function F , for n ≥ 0, M (T [n+1]) = F (M (T [n]), T (n)). Here, note that M (T [0])


is predefined to be some constant value. We will often identify F above with M
(that is use M (p, x) = F (p, x) to describe M (T [n + 1]), where p = M (T [n]) and
x = T (n)). This is for ease of notation. Context determines which interpretation
of the learner M is meant.
For Ex models of learning (for learning from texts or informants or their vari-
ants when learning from positive data and negative counterexamples, as defined
below), one may assume without loss of generality that the learners are total,
that is, defined on all initial segments of all texts (see, for example [OSW86]).
However for iterative learning one cannot assume so. Thus, we explicitly require
in the definition that iterative learners are defined on all inputs which are initial
segments of texts (informants) for a language in the class.
Note that, although it is not stated explicitly, an It-type learner might store
some input data in its conjecture (thus serving as a limited long-term memory).
However, the amount of stored data cannot grow indefinitely, as the learner must
stabilize to one (right) conjecture.
Learning with feedback and learning with bounded memory is a generalization
of iterative learning where the learner has access to some past data using queries
or via some finite amount of memory. Thus, in feedback learning an iterative
learner is additionally allowed to query whether some elements were present in
the past input data. In bounded memory, an iterative learner is able to memorize
in its memory some (bounded) finite number of data (in addition to its latest
conjecture). Below are the formal definitions.

Definition 3. [CJLZ99] (a) Suppose M is a learning machine (for a class L


of languages). We say that M is an m-feedback learner iff there exist partial
recursive functions F and Q such that for all L ∈ L, and all texts T for L,
(i) for all n: Q(M (T [n]), T (n)) ↓ ∈ Nm , and
(ii) If Q(M (T [n]), x) = (x1 , x2 , . . . , xm ) then M (T [n+1]) = F (M (T [n]), T (n),
y1 , y2 , . . . , ym ), where yi = 1 iff xi ∈ content(T [n]).
(b) We say that M TxtIt-identifies L with m-feedback iff M TxtEx-identifies
L and M is a m-feedback learner. Such learners M are also called TxtIt-learners
using m-feedback.

Definition 4. [LZ96] (a) Suppose M is a learning machine (for a class L of


languages). We say that M is an m-memory-bounded learner iff there exists a
(partial) recursive memory function mem (mapping finite sequences to finite
sets) and partial recursive functions F, F  such that for all L ∈ L, and all texts
T for L,
(i) for all n: mem(T [n]) ↓ ⊆ content(T [n]) and card(mem(T [n])) ≤ m
(ii) for all n: mem(T [n + 1]) = F  (M (T [n]), T (n), mem(T [n])) ↓ , and
mem(T [n + 1]) − mem(T [n]) ⊆ {T (n)}.
(iii) M (T [n + 1]) = F (M (T [n]), T (n), mem(T [n])) ↓ .
(b) We say that M TxtIt-identifies L with m-memory iff M TxtEx-identifies
L and M is a m-memory-bounded learner. Such learners M are also called TxtIt-
learner using m-memory or m-memory bounded TxtIt-learner.
314 S. Jain and E. Kinber

In both the above definitions, M (T [0]) is some fixed initial hypothesis.


Again, we often identify the learner M with the function F (along with iden-
tifying mem with F  ) as defined above, and the context determines which in-
terpretation of the learner M is meant.
One can similarly define feedback and memory bounded learning for learning
from informants. Besides the above models of learning, we sometimes allow the
learner access to the maximal element in the input seen so far, or the number
of elements in the input seen so far as an additional input. In the sequel, we
will typically refer to the “maximal element seen so far” and the “number of
elements seen so far” as simply the “maximal element” and, respectively, the
“number of elements”.

2.3 Learning with Negative Counterexamples

In this section we consider our models of learning from full positive data and
negative counterexamples as given by [JK08]. Intuitively, for learning with neg-
ative counterexamples, we may consider the learner being provided a text, one
element at a time, along with a negative counterexample to the latest conjec-
ture, if any. (One may view this negative counterexample as a response of the
teacher to the subset query when it is tested if the language generated by the
conjecture is a subset of the target language). One may model the list of negative
counterexamples as a second text for negative counterexamples being provided
to the learner. Thus the IIMs get as input two texts, one for positive data, and
other for negative counterexamples.

We say that M (T, T  ) converges to a grammar i, iff ( ∀ n)[M (T [n], T [n]) = i].
First, we define the model of learning from positive data and negative coun-
terexamples. NC in the definition below stands for negative counterexample.

Definition 5. [JK08]
(a) M NCEx-identifies a language L (written: L ∈ NCEx(M )) iff for all
texts T for L, and for all T  satisfying the condition:

(T  (n) ∈ Sn , if Sn 
= ∅) and (T  (n) = #, if Sn = ∅),
where Sn = L ∩ WM(T [n],T  [n])

M (T, T ) converges to a grammar i such that Wi = L.


(b) M NCEx-identifies a class L of languages (written: L ⊆ NCEx(M )), iff
M NCEx-identifies each language in the class.
(c) NCEx = {L : (∃M )[L ⊆ NCEx(M )]}.

For ease of notation, we sometimes define M (T [n], T [n]) also as M (T [n]), where
we separately describe how the counterexamples T  (n) are presented to the con-
jecture of M on input T [n].
One can similarly define NCIt-learning, where the learner’s output depends
only on the previous conjecture, the latest positive data, and the counterexample
provided.
Iterative Learning from Texts and Counterexamples 315

Definition 6. [JK07] (a) M is iterative (for learning from positive data and
negative counterexamples), iff there exists a partial recursive function F such
that, for all T, T  and n, M (T [n+1], T [n+1]) = F (M (T [n], T [n]), T (n), T  (n)).
Here M (λ, λ) is some predefined constant.
(b) M NCIt-identifies L, iff M is iterative, and M NCEx-identifies L.
(c) NCIt = {L : (∃M )[M NCIt-identifies L]}.
We will often identify F above with M (that is use M (p, x, y) = F (p, x, y) to
describe M (T [n + 1], T  [n + 1]), where p = M (T [n], T [n]) and x = T (n), y =
T  (n)). This is for ease of notation.
One should also note that the NCIt model is equivalent to allowing finitely
many subset queries (with counterexamples for the answer “no”) in iterative
learning.
One can extend the above definition to NCIt-learning with m-feedback or m-
memory, by allowing the learner M up to m queries about whether some element
x has appeared in the previous text or allowing the learner M to remember
up to m elements of the past data. The resulting criteria are called NCIt-
learning with m-feedback and NCIt-learning with m-memory, respectively. The
resulting learners are called m-feedback NCIt-learner (or NCIt-learners using
m-feedback) and m-memory bounded NCIt-learner (or NCIt-learner using m-
memory) respectively.
It follows from the definition that NCIt-learning is contained in NCIt-
learning using m-feedback and NCIt-learning using m-memory, which, in turn,
are contained in NCEx.

3 Some General Effects of Additional Information on


NCIt-learning
In this section, we look at some known capabilities of NCIt-learners and explore
whether they hold when a learner has access to additional information.
It was shown in [JK07] that capabilities of NCIt-learners exceed capabilities
of InfIt-learners. In this section, we show that if an InfIt-learner can store just
one element seen so far, or can use just one feedback query, then it can sometimes
learn more than any NCEx-learner (which can memorize the whole input seen
so far!). However, total InfIt-learners having access to the maximal element still
can be simulated by NCIt-learners having access to the maximal element.
An important result established in [JK07] is that NCIt-learners can infer any
indexed class of recursive languages. However, it is also shown in [JK07] that,
surprisingly, such NCIt-learners cannot learn indexed classes class-preservingly
(cf.[LZZ08]), that is, using a numbering of languages containing exactly the tar-
get class (and no other languages). Still class-preserving learnability is impor-
tant, as any natural hypotheses space for an indexed class is class-preserving.
We now show that NCIt-learners can learn indexed classes class-preservingly if
they have access to the maximal element or the number of elements seen so far.
However, adding the capability of using n feedback queries might not be enough
to help an NCIt-learner to infer an indexed class class-preservingly.
316 S. Jain and E. Kinber

3.1 Informants versus Negative Counterexamples


First we show how storing just one element seen so far or using one feedback
query can make an InfIt-learner stronger than any learner storing the whole
input seen so far, but getting only positive data and negative counterexamples.
Theorem 7. There exists a class which can be learnt using 1-memory bounded
(or 1-feedback) InfIt learner, which cannot be learnt by an NCEx-learner.
Let A be a semi-recursive, nonrecursive r.e. set, such that for every x ∈ N,
either both 2x and 2x + 1 are in A or both 2x and 2x + 1 are not in A. Let
L = {A ∪ {y} : y ∈ N}. We leave it to the reader to verify that the class L can
be iteratively learnt from informant using 1-feedback or 1-bounded memory and
cannot be NCEx-learnt.
Still, as the next theorem demonstrates, total InfIt-learners (that is, the ones
that are defined on all, even, possibly, non-valid inputs — that is, data which does
not represent a possible previous conjecture, a new input element, and the maxi-
mal element possible in a valid learning process for a language in the class being
learnt) can be simulated by NCIt-learners if both have access to the maximal el-
ement. For learning from informants, the maximal element present in the input is
the maximal y such that (y, 0) or (y, 1) is present in the input given so far.
Theorem 8. Any class which is InfIt learnable using the maximal element by
a total learner is also NCIt-learnable using the maximal element.

3.2 Indexed Families


Unlike the case of NCIt-learnability (without access to additional information),
class-preserving learnability of indexed classes can be achieved if an NCIt-
learner has access to the maximal element or the number of elements seen.
Theorem 9. (a) Every indexed family can be NCIt-identified (using a class
preserving hypothesis space) given the maximal element seen so far.
(b) Every indexed family can be NCIt-identified (using a class preserving
hypothesis space) given the number of elements seen so far.
Proof. (a) Suppose L is an indexed family, and L0 , L1 , . . . is its listing where
x ∈ Li can be effectively determined in x and i. Let Li [m] denote {x ∈ Li :
x ≤ m}. The conjectures of the learner would be of the form: p(j, S, X), where
p(j, S, X) is a grammar for Lj , and S, X are finite sets with some properties.
Suppose T is an input text for a language L, where T (n) = xn . Inductively,
if p(jn , Sn , Xn ) is output after T [n] has been seen, then the following invariants
will hold.
(A) For each j ∈ Sn , Lj ⊆ L, and Xn ⊆ L.
(B) content(T [n]) ⊆ Xn ∪ j∈Sn Lj ,
(C) For all j < jn , Lj 
= L.
(D) If jn 
∈ Sn , then either n = 0 or jn = jn−1 + 1.
(E) Xn ⊆ Xn+1 , Sn ⊆ Sn+1 , jn ≤ jn+1 .
Iterative Learning from Texts and Counterexamples 317

Initially, M (λ) = (0, ∅, ∅). The learner on the input p(jn , Sn , Xn ) and the new
element xn , the counterexample yn , and the maximal element m seen so far,
does the following:

(i) If yn = #, then Sn+1 = Sn ∪ {jn }; otherwise Sn+1 = Sn .


(ii) If (Xn ∪ {xn } ∪ j∈Sn Lj [m]) − {#} ⊆ Ljn , and yn = #, then jn+1 = jn ,
Xn+1 = Xn . Otherwise, jn+1 = jn + 1 and Xn+1 = Xn ∪ {xn } − {#}.

It is easy to verify that the invariants are satisfied. Furthermore, jn never goes
beyond the minimal grammar for L (see invariant (C)). Thus, the sequence of
jn converges, as well as Sn and Xn converge (as Xn+1  = Xn implies jn+1  = jn ,
and Sn ⊆ {j : j ≤ jn }, and using invariants (D) and (E)). Moreover,
 the last
conjecture is correct by (A) and (B), and using (Xn ∪ {xn } ∪ j∈Sn Lj [m]) −
{#} ⊆ Ljn from clause (ii) (as there is no further mind change).
(b) Only change is in (ii) above which is replaced by: (m below denotes the
number of elements seen so far by the learner)
(ii) If the first m elements in (Xn ∪ {xn } ∪ j∈Sn Lj ) − {#} are included in
Ljn , and y = #, then jn+1 = jn , Xn+1 = Xn . Otherwise, jn+1 = jn + 1 and
Xn+1 = Xn ∪ {xn } − {#}.
The rest of the proof is similar to the part (a), and we omit the details. 

Still, any n feedback queries might not help to achieve class-preserving learn-
ability of indexed classes by NCIt-learners.

Theorem 10. There exists an indexed family which cannot be learnt by an


NCIt-learner with n-feedback using a class preserving hypothesis space.

4 Hierarchy of n-Feedback and n-Memory Learners


In this section, we show that, in the context of NCIt-learnability, n + 1 stored
input elements seen and n + 1 feedback queries provide more capability than
n stored input elements seen and, respectively, n feedback queries. Note that,
on the negative sides of both results, neither NCIt-learners storing just up to
n elements seen, nor NCIt-learner using just up to n feedback queries can be
helped even if they have access to the maximal element and the number of
elements seen so far. On the other hand, learners witnessing the positive sides of
both results do not need access to negative counterexamples (refuting conjectures
containing data in excess of the target language).

Theorem 11. Fix n ∈ N. There exists a class L such that


(a) L can be iteratively learnt by an n + 1-feedback learner.
(b) L cannot be NCIt-learnt using n-feedback queries even if the maximal
element and the number of elements in the input seen so far is given to the
learner as additional information.

Theorem 12. Let n ∈ N. There exists a L such that


(a) L can be iteratively learnt using (n + 1)-bounded-memory.
318 S. Jain and E. Kinber

(b) L cannot be learnt by an NCIt-learner using n-memory, even if the learner


is given the number of elements and the maximal element seen so far as addi-
tional information.

Proof. (sketch) Let L1 = {L : (∃e)[∅ ⊂ L ⊆ { e, j, x : j, x ∈ N} and We = L


and for all x, [card({j :e, j, x ∈ L, j ≥ 1}) ≤ n + 1 or [card({j : e, j, x ∈
L, j ≥ 1}) = n + 2 and e,j,x∈L j is a prime number ]]]}.

Let L2 = {L : (∃e, x)[We ∈ L1 , x > max({x  : e, j, x ∈ We , j ≥ 1}) and
[card({j : e, j, x ∈ L, j ≥ 1}) = n + 2 and e,j,x∈L j is not a prime number ]
and L = We ∪ { e, j, x : j ≥ 1, e, j, x ∈ L}]}.
Let L = L1 ∪ L2 .
It can be shown that L can be iteratively learnt using (n+1)-bounded memory.
For the diagonalization against a learner M , we use Kleene’s recursion theorem
[Rog67], to construct a set We in stages s = 0, 1, 2, . . ., along with initial segments
σs and counterexample function fs . Let Wes denote We enumerated before stage
s, Es = range(fs ), and xs denote the least number such that Wes ∪Es ⊆ { e, j, x :
x < xs }. Initially, We0 contains e, 0, 0 , σ0 is a sequence with content { e, 0, 0 },
and f0 (i) = #, for all i. In stage s, the algorithm updates the above parameters
as follows.

In stage
 ws s,
 define ws , ws to be large enough numbers such that (ws + |σs | +
2) < n+1 and for all distinct c, c ≤ (n + 1) ∗ 2ws , there exists a p with
n

2ws < p ≤ ws such that c + p is a prime, but c + p is not a prime. Let ms > xs
be such that e, 0, ms > max(content(σs ) ∪ { e, j, xs : 1 ≤ j ≤ ws }). Let τs =
σs e, 0, ms , and enumerate e, 0, ms into We . Dovetail among the following
two searches:
(a) search for an initial segment σ of τs such that fs (M (σ)) = #, but WM(σ) −
content(τs )  = ∅;
(b) search for a τ  such that content(τ  ) − content(τs ) ⊆ { e, j, xs : 1 ≤ j ≤
ws }, M (τ  )  
) − content(τs )) ≤ n + 1 or
= M (τs ) and either (i) card(content(τ

(ii) card(content(τ ) − content(τs )) = n + 2, and e,j,x∈content(τ  )−content(τs ) j
is a prime number.
Here we assume that search in (a) has some priority in the sense that if one can
find such a σ within s steps, then (a) succeeds first with the shortest such σ. In
case (a) succeeds first, we let σs+1 = τs , Wes+1 = content(τs ), and fs+1 (M (σ)) =
the element found in WM(σ) − content(σs ) (rest of fs+1 is same as fs ). In case
(b) succeeds first, we let σs+1 = τ  , Wes+1 = content(σs+1 ) and let fs+1 = fs .
Now one can  show that if there are infinitely many stages, then M does not
converge on s σs . On the other hand, if there are only finitely many stages,
then one can show that, for some appropriate distinct S, S  ⊆ {2i : 1 ≤ i ≤ ws },
and corresponding p, αS , αS  with content(αX ) = { e, j, xs : j ∈ X} (for X = S
or S  ), one has that M (τs αS e, p, xs ∞ ) = M (τs αS  e, p, xs ∞ ), though
τs αS e, p, xs ∞ and τs αS  e, p, xs ∞ are texts for different languages in
L. We omit the details. 
Iterative Learning from Texts and Counterexamples 319

5 Advantages of Different Types of Additional


Information over Other Types
In this section we study tradeoffs between different types of additional informa-
tion in the context of NCIt-learnability.

5.1 Comparison of Feedback and Memory Bounded Learning


Results of this subsection significantly strengthen corresponding results given in
[CJLZ99]. Namely, they demonstrate that, in the context of NCIt-learnability,
just one stored inputted element can provide more than any n feedback queries
(even if, in addition, the learner has access to the maximal element and the
number of elements seen so far), and, conversely, one feedback query can do
more than any n stored input elements seen so far (and, additionally, the max-
imal element and the number of elements seen so far). Moreover, the iterative
learners witnessing the positive sides of these results do not not use negative
counterexamples to conjectures containing extra elements.

Theorem 13. There exists an L which can be iteratively learnt by a 1-memory


bounded learner, but which cannot be NCIt-learnt using n-feedback (even if the
learner is given the maximal element and the number of elements in the input
so far as additional information).

Theorem 14. There exists an L which can be iteratively learnt by a 1-feedback


learner, but which cannot be NCIt-learnt by a n-memory bounded learner (even
if the learner is given the maximal element and the number of elements in the
input so far as additional information).

5.2 Advantages of Using Maximal Element/Number of Elements

Results of this subsection demonstrate various advantages that NCIt-learners


can get while using the maximal element or/and the number of elements as
additional information.
The following proposition works if memory, instead of being a set, is allowed
to be a multiset (when updating the memory, if a new input element is greater
than the current maximal one, the learner must replace the old maximal by the
new one, however, the learner may also decide to store a separate copy of the
new element — for reasons different from it being maximal, so that it would
not be sacrificed when a new greater element appeared). It is open at present
whether this proposition holds if memory is just a set, as in the current paper.

Proposition 15. Any n-bounded memory learner with the maximal element in
the input as additional information can be simulated by an n+1-bounded memory
learner by using the extra memory for the maximal element seen, as long as the
memory of the learner is considered as a multi-set, rather than just a set.
320 S. Jain and E. Kinber

Our next result shows that adding access to the maximal element increases
learning capability of NCIt-learners storing up to n input elements seen so far.
Moreover, a learner witnessing the positive side of the result does not need access
to negative counterexamples refuting conjectures containing data in excess of the
language to be learned.

Theorem 16. There exists a class L which can be iteratively learnt by an n-


bounded memory learner with maximal element as additional input that cannot
be NCIt-learnt by a n-bounded memory learner.

Our next two results demonstrate that an NCIt-learner having access to just the
maximal element or the number of elements seen so far can sometimes do more
than any NCIt-learner using up to n feedback queries. First, as the next the-
orem demonstrates, NCIt-learners (or even iterative learners — not using neg-
ative counterexamples to conjectures) using the maximal element as additional
information can sometimes learn more than NCIt-learners using n feedback
queries and getting the number of elements as additional information. However,
we were not able to achieve a result of similar strength while faring the number
of elements seen so far against n feedback queries and the maximal element as
additional information. Whether it is possible, remains open.

Theorem 17. There exists a class L which can be iteratively learnt when the
learner is provided the maximal element in the input so far, but the class L
cannot be NCIt-learnt using n-feedback, for any n, even if the learner is given
the number of elements in the input as additional information.

Theorem 18. There exists a L which can be NCIt learnt using the number
of elements in the input as additional information, but, for all n, L cannot be
NCIt-learnt using n-feedback.

Note that, obviously, the maximal element can always be memorized by a learner
and, thus, cannot add more to the learning power of iterative learners than
even one memory cell for storing input elements. Therefore, we explore if the
number of elements seen so far can give an NCIt-learner more advantages than
n memorized input elements seen so far. We were able to achieve only a partial
solution — showing that the number of elements and the maximal element (or
one memory cell) together can provide more power to NCIt-learners than n
memorized input elements.

Theorem 19. There exists a class L such that L can be NCIt-learnt using 1-
memory (or the maximal element) and the number of elements, but cannot be
learnt using n feedback or n-memory bounded learner in NCIt manner, even if
it is given the maximal element.

Can the maximal element give more power to NCIt-learners than the number of
elements seen so far? The answer to this question is positive — even if the learners
using the maximal element are just iterative (not using negative counterexamples
to conjectures): it immediately follows from Theorem 17. However, we do not
Iterative Learning from Texts and Counterexamples 321

know if the number of elements can give more in the context of NCIt-learnability
than the maximal element. We have some partial solution to the above problem,
when one considers iterative learners rather than NCIt-learners.

Theorem 20. (a) Suppose L can be NCIt-identified using the number of ele-
ments, where the learner converges on all inputs (here the text input would be
from the target class, but the number of elements may sometimes not be valid —
we still expect the learner to converge). Then, L can be NCIt-identified using
access to the maximal element.
(b) There exists an L such that
(i) L can be iteratively learnt when given the number of elements in the input
seen so far as additional information (such a learner, however, may not be total).
(ii) For all n, L cannot be iteratively learnt by an n-feedback learner even if
it gets the maximal element as additional information.
(iii) For all n, L cannot be iteratively learnt by an n-memory bounded learner.

6 Conclusions

As we have shown, additional information of the types studied in this paper can
add interesting new capabilities to iterative learners getting negative examples
to conjectures containing data in excess of the target language. Some problems
related to comparisons of help provided by additional information remain open
(they are mentioned in Section 5), and solving these problems can offer new
(and, possibly, unexpected) insight into advantages of using additional informa-
tion of certain types for the learners in question. Similarly to [JK07], one might
also consider different types of negative examples (refuting conjectures contain-
ing extra elements) by iterative learners and explore how these different types
of negative examples may interplay with different types of additional informa-
tion. Yet another interesting area of research is studying iterative learnability
with counterexamples and additional information of specific indexed classes of
languages (for example, regular languages or patterns) — as we have shown all
such classes are learnable class-preservingly using maximal element or number
of elements as additional informantion, and, therefore, one can now study if and
when learnability of such classes may be efficient.
A general open problem for iterative learners of any type using additional
(bounded) memory is whether a multiset type memory (when a learner can
store the same inputted item several times; for example, the learner may decide
to store, say, 10 copies of the next input element) can have an advantage over a
set type memory (where every item is stored just once). We have not been able
to find an answer to this very interesting problem.

Acknowledgments. The authors are grateful to anonymous referees of ALT’2009


for many useful remarks and suggestions. We specially thank a referee for a simpler
proof of Theorem 7.
322 S. Jain and E. Kinber

References
[Ang80] Angluin, D.: Finding patterns common to a set of strings. Journal of Com-
puter and System Sciences 21, 46–62 (1980)
[Ang88] Angluin, D.: Queries and concept learning. Machine Learning 2, 319–342
(1988)
[BA96] Brachman, R., Anand, T.: The process of knowledge discovery in
databases: A human centered approach. In: Fayyad, U.M., Piatetsky-
Shapiro, G., Smyth, P., Uthurusam, R. (eds.) Advances in Knowledge
Discovery and Data Mining, pp. 37–58. AAAI Press, Menlo Park (1996)
[CJLZ99] Case, J., Jain, S., Lange, S., Zeugmann, T.: Incremental concept learning
for bounded data mining. Information and Computation 152(1), 74–110
(1999)
[CL82] Case, J., Lynes, C.: Machine inductive inference and language identifica-
tion. In: Nielsen, M., Schmidt, E.M. (eds.) ICALP 1982. LNCS, vol. 140,
pp. 107–115. Springer, Heidelberg (1982)
[CM08] Case, J., Moelius, S.: U-shaped, iterative, and iterative-with-counter learn-
ing. Machine Learning 72, 63–88 (2008)
[FPSS96] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to
knowledge discovery. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusam, R. (eds.) Advances in Knowledge Discovery and Data Mining,
pp. 1–34. AAAI Press, Menlo Park (1996)
[Gol67] Gold, E.M.: Language identification in the limit. Information and Con-
trol 10, 447–474 (1967)
[JK07] Jain, S., Kinber, E.: Iterative learning from positive data and nega-
tive counterexamples. Information and Computation 205(12), 1777–1805
(2007)
[JK08] Jain, S., Kinber, E.: Learning languages from positive data and negative
counterexamples. Journal of Computer and System Sciences 74(4), 431–
456 (2008); Special Issue: Carl Smith memorial issue
[LZ92] Lange, S., Zeugmann, T.: Types of monotonic language learning and their
characterization. In: Proceedings of the Fifth Annual Workshop on Com-
putational Learning Theory, pp. 377–390. ACM Press, New York (1992)
[LZ96] Lange, S., Zeugmann, T.: Incremental learning from positive data. Journal
of Computer and System Sciences 53, 88–103 (1996)
[LZ06] Li, Y., Zhang, W.: Simplify support vector machines by iterative learning.
Neural Processsing Information - Letters and Reviews 10, 11–17 (2006)
[LZZ08] Lange, S., Zeugmann, T., Zilles, S.: Learning indexed families of recursive
languages from positive data: A survey. Theoretical Computer Science 397,
194–232 (2008)
[OSW86] Osherson, D., Stob, M., Weinstein, S.: Systems that Learn: An Intro-
duction to Learning Theory for Cognitive and Computer Scientists. MIT
Press, Cambridge (1986)
[Pop68] Popper, K.: The Logic of Scientific Discovery, 2nd edn. Harper Torch
Books, New York (1968)
[Rog67] Rogers, H.: Theory of Recursive Functions and Effective Computability.
McGraw-Hill, New York (1967); Reprinted by MIT Press (1987)
[Wie76] Wiehagen, R.: Limes-Erkennung rekursiver Funktionen durch spezielle
Strategien. Journal of Information Processing and Cybernetics (EIK) 12,
93–99 (1976)
Incremental Learning with Ordinal Bounded
Example Memory

Lorenzo Carlucci

Department of Computer Science, University of Rome “La Sapienza”,


Via Salaria 113, 00198, Roma, Italy
carlucci@di.uniroma1.it

Abstract. A Bounded Example Memory learner is a learner that op-


erates incrementally and maintains a memory of finitely many data
items. The paradigm is well-studied and known to coincide with set-
driven learning. A hierarchy of stronger and stronger learning criteria
is obtained when one considers, for each k ∈ N, iterative learners that
can maintain a memory of at most k previously processed data items.
We report on recent investigations of extensions of the Bounded Ex-
ample Memory model where a constructive ordinal notation is used to
bound the number of times the learner can ask for proper global memory
extensions.

1 Introduction
In many learning contexts a learner is confronted with the task of inductively
forming hypotheses while being presented with an incoming stream of data. In
contexts the learning process can be said to be successful if, eventually, the
hypotheses that the learner forms provide a correct description of the observed
stream of data. Each single step of the learning process in this scenario involves
an observed data item and the formation of a new hypothesis.
It is very reasonable to assume that a real-world learner - be it artificial or
human - has memory limitations. A learner with memory limitations is a learner
that is unable to store such complete information about the previous stages of
the learning process. Each stage of the learning process is completely described
by the flow of data seen so far, and the sequence of the learner’s hypotheses so
far. The action of a learner with memory limitations, at each step of the learning
process, is completely determined by a limited portion of the previous stages of
the learning process. Let us call intensional memory the learner’s memory of its
own previously issued hypotheses. Let us call extensional memory the learner’s
memory of previously observed data items.
In the context of Gold’s formal theory of language learning [6], models with
restrictions on intensional and on extensional memories have been studied. In [9]
the paradigm of Bounded Example Memory is introduced. A bounded example

Partially supported by grant number 1339 of the John Templeton Foundation, and
by a Telecom Italia “Progetto Italia” Fellowship.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 323–337, 2009.

c Springer-Verlag Berlin Heidelberg 2009
324 L. Carlucci

memory learner is a learner whose intensional memory is, at each step of the
learning process, limited to remembering its own previous hypothesis, and whose
extensional memory is limited to storage of a finite number of previously observed
data items. At each step of the learning process, such a learner must decide, based
on (i) knowledge of its own previous hypothesis, on (ii) the content of its current
memory, and on (iii) the currently observed data item, whether to change its
hypothesis and whether to store in memory the currently observed data item. For
each number k one can similarly define a k-bounded example memory learner as
a bounded example memory learner whose memory can never exceed size k. For
k = 0 one obtains the paradigm of iterative learning [12], in which the learner
has no extensional memory and can only remember its own previous conjecture.
One of the main results of [9] is the following. For every k, there is a class
of languages that can be learned by a bounded example memory learner with
memory k + 1 but not by any bounded example memory learner with memory
k. [3] and the recent [7] present further results on this and related models.
In this paper we present some results on a new extension of the Bounded
Example Memory paradigm. Following a suggestion in [3], we investigate a
paradigm in which the learner is allowed to change its mind on how many data
items to store in memory as a function of some constructive ordinal α. Ordinals
are canonical representatives of well-orderings. A constructive ordinal can be
defined as the order-type of a computable well-ordering of the natural numbers.
Equivalently, constructive ordinals are those ordinals that have a program (a no-
tation) that specifies how to build them from below using standard operations
such as successor and constructive limit. Every constructive ordinal is countable
and notations for constructive ordinals are algorithmic finite objects. For each
initial segment of the constructive ordinals a univalent system of notations can
be defined. On the other hand, a universal (not univalent) system of notation
containing at least one notation for every constructive ordinal has been defined
by Kleene. For more details, see, e.g., [10]. For the sake of this paper, ordinals
can be treated in an informal way: we blur the distinction between a constructive
ordinal and a notation for it. The treatment can be made rigorous without effort
and without altering our results. Count-down from ordinal notations has been
applied in a number of ways in algorithmic learning theory, starting with [4],
where ordinal notations are used to bound the number of mind-changes that a
learning machine is allowed to make on its way to convergence. A different use
of ordinal notations is in the recent [2].
For every (notation for a) constructive ordinal α, the paradigm of α-bounded
example memory is defined. Intuitively, a learner with example memory bounded
by α must (algorithmically) count-down from (a notation for) α each time a
proper global memory extension occurs during the learning process (i.e., each
time the size of the memory set becomes strictly larger than the size of all the
previous memory sets). We show that this paradigm is strictly stronger than
k-bounded example memory but strictly weaker than finitely-bounded example
memory (with no form of a priori bound on memory size). We also show that the
concept of ordinal bounded example memory gives rise to a hierarchy and we
Incremental Learning with Ordinal Bounded Example Memory 325

exhibit a hiearchy up through ordinal ω 2 . We do not prove a general hierarchy


result for all constructive ordinals. Yet we believe that such a result is within
the reach of the methods of the present paper.

2 Preliminaries
Unexplained notation follows Rogers [10]. N denotes the set of natural numbers
{0, 1, 2, . . . }. N+ denotes the set of positive natural numbers. The set of finite
subsets of N is denoted by F in(N). We use the following set-theoretic notations:
∅ (empty set), ⊆ (subset), ⊂ (proper subset), ⊇ (superset), ⊃ (proper superset).
If X and Y are sets, then X ∪ Y , X ∩ Y , and X − Y denote the union, the
intersection, and the difference of X and Y , respectively. We use Z = X ∪Y ˙ to
abbreviate (Z = X ∪ Y ∧ X ∩ Y = ∅). The cardinality of a set X if denoted
by card(X). By card(X) ≤ ∗ we indicate that the cardinality of X if finite. We
let λx, y. x, y stand for a standard pairing function. We extend the notation to
pairing of n-tuples of numbers in the straightforward way. We denote with πin
(i ≤ n) the projection function of an n-tuple to its i-th component. We omit the
superscript when clear from context.
We use α, β to range over constructive ordinals. We blur the distinction be-
tween ordinals and their notations. We use O to denote the set of constructive or-
dinals. This symbol traditionally denotes Kleene’s universal system of notations
for constructive ordinals. This system would be used in a completely rigorous
presentation of our results.
We fix an acceptable programming system ϕ0 , ϕ1 , . . . for the partial com-
putable functions of type N → N. We denote by Wi the domain of the i-th
partial computable function ϕi . We could equivalently (modulo isomorphism of
numberings) define Wi as the set generated by grammar i. A language is a subset
of N. We are only interested in recursively enumerable languages, whose collec-
tion we denote by E. The symbol L ranges over elements in E. L ranges over
subsets of E, called language classes. Let λx, y.pad(x, y) be an injective padding
function (i.e., Wpad(x,y) = Wx ).
A sequence is a mapping from an initial segment of N+ into N# , where #
is a reserved symbol which we call pause symbol. We use N# to abbreviate
N ∪ {#}. The symbols σ, τ range over sequences. content(σ) denotes the range
of σ minus the # symbol. |σ| denotes the length of σ. We use ⊆, ⊂ for sequence
containment and proper containment respectively. A text is a mapping from N+
into N# . The symbol t ranges over texts. If t = (xi )i∈N+ is a text, t[n] denotes
the initial segment of t of length n, i.e., the sequence (x1 x2 . . . xn ). We use · for
concatenation. If the range of t minus the # symbol is equal to L, then we say
that t is a text for L.
A language learning machine is a partial computable function mapping finite
sequences to natural numbers.
We now define the basic paradigm of explanatory identification from text [5].
Definition 1 (Gold, [5]). Let M be a language learning machine, let L be a
language, let L be a language class.
326 L. Carlucci

(1) M TxtEx-identifies L if and only if, for every text t = (xi )i∈N+ for L, there
exists n ∈ N+ such that WM(t[n]) = L and for all n ≥ n M(t[n ]) = M(t[n]).
(2) M TxtEx-identifies L if and only if M TxtEx-identifies L for all L ∈ L.
(3) TxtEx(M) = {L : MTxtEx-identifies L}.
(4) TxtEx = {L : (∃M)[L ⊆ TxtEx(M)]}.

We define the paradigm of iterative learner. This is the basic paradigm of incre-
mental learning upon which the paradigms of bounded example memory learning
are built.

Definition 2 (Wiehagen, [12]). Let M : N×N# → N be a partial computable


function, let j0 ∈ N, let L be a language.

(1) (M, j0 ) TxtIt-identifies L if and only if, for each text t = (xi )i∈N+ for L,
the following hold.
(i) For each n ∈ N, Mn (t) is defined, where M0 (t) = j0 and Mn+1 (t) =
M (Mn (t), xn+1 ) = jn+1 .
(ii) (∃n ∈ N)[Wjn = L ∧ (∀n ≥ n)[jn = jn ]].
(2) For M , j0 as above, TxtIt(M, j0 ) = {L : (M, j0 ) TxtIt-identifies L}.
(3) TxtIt(M, j0 ) = {L : L ⊆ TxtIt(M, j0 )}.
(4) TxtIt = {L : (∃M, j0 )[L ⊆ TxtIt(M, j0 )]}.

TxtIt is known to be strictly contained in TxtEx. We observe that a function


M as in the previous definition can be used to define a language learning ma-
chine M as follows. M(t[0]) = M0 (t) = j0 , and, for all n ∈ N, M(t[n + 1]) =
M (M(t[n]), xn+1 ). Note that M is uniquely determined by M and j0 . For every
L such that (M, j0 ) TxtIt-identifies L, we also say that M TxtIt-identifies L.

3 Ordinal Bounded Example Memory Learning

The paradigm of Bounded Example Memory was introduced in [9] and further
investigated in [3] and in the recent [7]. A bounded example memory learner is
an iterative learner that is allowed to store at most k data items chosen from
the input text.

Definition 3 (Lange and Zeugmann, [9]). Let k ∈ N+ ∪ {∗}, where ∗ is a


new symbol.

(1) Let M : (N×F in(N))×N# → N×F in(N) be a partial computable function,


let j0 ∈ N, let L be a language. (M, j0 ) Bemk -identifies L if and only if,
for each text t = (xi )i∈N+ for L, the following hold.
(i) For each n ∈ N, Mn (t) is defined, where M0 (t) = (j0 , ∅) and Mn+1 (t) =
M (Mn (t), xn+1 ) = (jn+1 , Sn+1 ).
(ii) S0 = ∅ and ∀n ∈ N, Sn+1 ⊆ Sn ∪ {xn+1 }.
(iii) ∀n ∈ N, card(Sn+1 ) ≤ k.
(iv) (∃n ∈ N)[Wjn = L ∧ (∀n ≥ n)[jn = jn ]].
Incremental Learning with Ordinal Bounded Example Memory 327

(2) We say that (M, j0 ) Bemk -identifies L if and only if(M, j0 )Bemk -identifies
L for every L ∈ L.
A machine of the appropriate type that satisfies points (i)-(iii) above is referred
to as a Bemk -learner. By [8], Bem∗ is known to coincide with set-driven learning
[11]. With a slight abuse of notation we sometimes use Bem0 to denote TxtIt.
We now introduce an extension of the Bounded Example Memory model.
Definition 4. Let α be a fixed constructive ordinal (notation). Let M : (N ×
F in(N) × O) × N# → N × F in(N) × O be a partial computable function. Let
j0 ∈ N, let L be a language.
(1) We say that (M, j0 ) OBemα -identifies L if and only if for every t = (xj )j∈N+
text for L, points (i) to (v) below hold.
(i) for all n ∈ N, Mn (t) is defined, where M0 (t) = (j0 , S0 , α0 ), S0 = ∅,
α0 ≤ α, Mn+1 (t) = M (Mn (t), xn+1 ) = (jn+1 , Sn+1 , αn+1 ).
(ii) Sn+1 ⊆ Sn ∪ {xn+1 }.
(iii) αn ≥ αn+1 .
(iv) αn > αn+1 if and only if card(Sn+1 ) > max({card(Si ) : i ≤ n}).
(v) (∃n)(∀n ≥ n)[jn = jn ∧ Wjn = L].
(2) We say that (M, j0 ) OBemα -identifies L if and only if (M, j0 ) OBemα -
infers L for every L ∈ L.
A machine of the appropriate type that satisfies points (i)-(iv) above is referred
to as a OBemα -learner. OBemα -learning is a species of incremental learning:
each new hypothesis depends only on the previous hypothesis, the current mem-
ory, and the current data item. The above Definition can be simplified in case
the following is true. Call cumulative a bounded example memory learner that
never erases an element from memory without replacing it with a new one. If
cumulative learning does not restrict learning power, then in point (iv) it is
sufficient to ask that card(Sn+1 ) > card(Sn ).
For I ∈ {Bemk , Bem∗ , OBemα }, M of the appropriate type and j0 ∈ N
– We write I(M, j0 ) for {L : (M, j0 ) I-identifies L}, and
– We write I for {L : (∃M, j0 )[L ⊆ I(M, j0 )]}.
We write M (t) to indicate the conjecture to which M converges while processing
text t. We always assume that such a conjecture exists when we use this notation.
We state some basic facts in the following Lemma.
Lemma 1. For all k ∈ N+ , for all constructive ordinals α, β, the following hold.
(1) OBemk = Bemk .
(2) If α < β, then OBemα ⊆ OBemβ .
(3) OBemα ⊆ Bem∗ .
Proof. The proof is omitted for brevity. Note that to go from a Bemk -learner
to an OBemk -learner, one just needs to keep track of the maximum cardinality
of a memory set, a quantity which eventually stabilizes and can thus be padded
in the conjecture as long as needed. To go from an OBemk -learner to a Bemk -
learner, one dually pads the ordinal counter in the next conjecture. This also is
a quantity that eventually stabilizes on all relevant texts. 

328 L. Carlucci

As a word of caution note that a rigorous version of point (2) would read: for
every notation a, b, respectively for α and β, such that a <O b, then OBema ⊆
OBemb . Similar relations as those expressed in the above Lemma also hold for
the model of temporary bounded example memory as defined in [7] when the
definition is extended to ordinals in the straightforward way.
We state a basic locking sequence lemma for OBemα . Let (M, j0 ) be an
OBemα -learner. σ is a locking-sequence of the first type for (M, j0 ) on L if and
only if (1) content(σ) ⊆ L, (2) for every extension σ  ⊃ σ in L, π1 (M|σ| (σ)) =
π1 (M|σ | (σ  )), and (3) WM|σ| (σ) = L. σ is a locking-sequence of the second type for
(M, j0 ) on L if and only if (a) σ is a locking sequence of the first type for (M, j0 )
on L, and (b) for every extension σ  ⊃ σ in L, π3 (M|σ| (σ)) = π3 (M|σ | (σ  )).
Lemma 2 (Locking Sequence Lemma). If L ∈ OBemα (M, j0 ), then there
exists a locking sequence of the second type for M on L.
Proof. Straightforward using the standard argument [1, 6], point (iii) in Defini-
tion 4, and the well-orderedness of the ordinals. 


4 Learning with ω-Bounded Example Memory


We prove that learners with ω-bounded example memory can learn strictly more
than learners with bounded finite memory. Still, OBemω is strictly weaker than
Bem∗ . The same is actually true for all OBemα .
We start by recalling the definitions of the classes used in [9] to show that
TxtIt ⊂ Bem1 ⊂ Bem2 ⊂ . . .
Where Lange and Zeugmann [9] use symbols a, b, we use 3, 2, and where they use
string concatenation we use exponentiation. For p ∈ N we use {p}+ to denote
the set {p, p2 , p3 , . . . }. Let p0 , p1 , . . . an enumeration of the prime numbers in
increasing order. We denote the set {pi , p2i , p3i , . . . } by {p1 }+ .
Definition 5 (Class Lk ). Let k ∈ N+ . Lk is the class consisting of the follow-
ing languages.
– L1 = {p1 }+ ,
– L(j,1 ,...,k ) = {p11 , . . . , pj1 } ∪ {pj0 } ∪ {p11 , . . . , p1k } for all j, 1, . . . , k ∈ N+ .
Note that Lk ⊂ Lk+1 and that, in fact, Lk+1 = {L ∪ {p1 } : ∈ N+ , L ∈ Lk }.
One of the main results in [9] is the following Theorem.
Theorem 1 (Lange and Zeugmann, [9]). For all k ∈ N,
Lk+1 ∈ (Bemk+1 − Bemk ).
For a language L and a constructive ordinal α, we denote by L[α] (the α-tagged
variant of L) the language obtained from L by replacing each element x by
α, x , (i.e., L[α] = {α} × L). For L a language class, we denote by L[α] the class
{L[α] : L ∈ L}.
We now define a generalization of Lk .
Incremental Learning with Ordinal Bounded Example Memory 329

Definition 6 (Class Ck ). Let k ∈ N+ . Ck is the class (Lk )[k] .


Ck is the “k-tagged” variant of Lk . We denote by Ck,≥d , for d ∈ N, the subclass
of Ck containing (L1 )[k] and (L(j,1 ,...,k ) )[k] with j ≥ d. For succinctness, we
sometimes denote (L1 )[k] by Ck and (L(j,1 ,...,k ) )[k] by C(j,1 ,...,k ) .
Since the proof of Theorem 1 in [9] is essentially asymptotic, the following
Proposition holds.
Proposition 1. For all d, k ∈ N+ ,

Ck+1,≥d ∈ (Bemk+1 − Bemk ).

For the sake of the present Section, we identify Ck with the class obtained from
s s t t
it by replacing
 each element k, p1 by pk , and k, p0 by p0 , for ease of notation.
Let Cω = k∈N+ Ck .

Theorem 2. Cω ∈ (OBemω − k∈N+ OBemk ).
Proof. Let j, k ∈ N+ . For a set X ⊆ {pk }+ such that card(X) ≤ k, we write
C(j,X) for the set {p1k , . . . , pjk } ∪ {pj0 } ∪ X.
We first show Cω ∈ OBemω . Let X be a finite subset of Ck of cardinality
≤ k. Let s ∈ N+ . We define the set update(X, k, psk ) as the set containing the
(at most) k elements of X ∪ {psk } with largest exponents. Formally, we define
update(X, k, psk ) = X ∪ {psk } if card(X) < k, and otherwise update(X, k, psk ) =
z
{pzk2 , . . . , pkk+1 } if X = {pxk 1 , . . . , pxk k }, where x1 < · · · < xk , and {x1 , . . . , xk } ∪
{s} = {z1 , . . . , zk+1 }, where z1 < · · · < zk+1 . For technical convenience we define
update(X, k, a) = X for all a ∈ / {pk }+ .
We now define a learner M and a j0 ∈ N such that (M, j0 ) OBemω -identifies
Cω . M ’s conjectures have the form jn = pad(cn , An , Bn ), where An , Bn ∈ N,
and
– An records the exponent of the first pj0 seen (j = 0),
– Bn records the subscript of the first pak seen (k, a 
= 0) ,
For every text t = (xi )i∈N+ , we define

M0 (t) = (j0 , S0 , α0 ),

where A0 = B0 = 0, α0 = ω, c0 = an index for ∅, and

Mn+1 (t) = (pad(cn+1 , An+1 , Bn+1 ), Sn+1 , αn+1 ),

where An+1 , Bn+1 , Sn+1 , αn+1 , cn+1 are defined as follows.


Let k, j, a below indicate elements in N+ . For every n ∈ N,
 
j, if xn+1 = pj0 , k, if xn+1 = pak ,
An+1 = , Bn+1 =
An , otherwise. Bn , otherwise.
We also define α−n to be αn if card(Sn ) ≥ card(Sn+1 ) and αn −̇1 otherwise. We
complete the description of M ’s behaviour by the following case distinction.
330 L. Carlucci

(Case 1) If (An = 0∧xn+1 = pak ), then Sn+1 = update(Sn , k, xn+1 ), αn+1 = k,


cn+1 = index for Ck .
(Case 2) If (( An , Bn = j, 0 ∨ An , Bn = j, k ) ∧ xn+1 = pak ), then Sn+1 =
update(Sn , k, xn+1 ), αn+1 = k, cn+1 = index for C(j,Sn+1 ) .
(Case 3) If ( An , Bn = 0, k ∧ xn+1 = pj0 ), then Sn+1 = update(Sn , k, xn+1 ),
αn+1 = α−n , cn+1 = index for C(j,Sn+1 ) .
(Case 4) If ( An , Bn = j, k ∧ xn+1 = pak ), then Sn+1 = update(Sn , k, xn+1 ),
αn+1 = α−n , cn+1 = index for C(j,Sn+1 ) .
(Case 5) Else Sn+1 = Sn , αn+1 = αn , cn+1 = cn .
The above cases are exhaustive and M is a OBemω -learner for Cω . Let t be a
text for L ∈ Cω . Suppose first that t is for Ck for some k > 0. Then the first case
M enters is (Case 1). Afterwards, M always enters either (Case 1) or (Case 4)
and thus stabilizes on a conjecture for Ck . Suppose now that t is for C(j,1 ,...,k )
for some k, j, 1 , . . . , k ∈ N+ . If the first non-trivial element of text t is pj0 , then
M enters (Case 4) and pads An = j in its next conjecture. As soon as the first
element of the form psk appears, M enters (Case 2), stores psk in memory and
outputs a canonical index for C(j,psk ) . Afterwards, M will always enter either
(Case 5) or (Case 4). If the first non-trivial element of text t is psk , then M first
enters (Case 1). From then on, until eventually pj0 appears for the first time in
t, M pads k in Bn , stores the k maximal elements seen and conjectures Ck . As
soon as pj0 appears in t, M enters (Case 3) and afterwards always enters either
(Case 5) or (Case 4). Thus Meventually stabilizes on an index for C(j,1 ,...,k ) .
We now prove that Cω ∈ / k∈N+ OBemk . Suppose that Cω ∈ OBemk for
some k, as witnessed by (M, j0 ). Then Cω ∈ Bemk . But Ck+1 ⊆ Cω , and Ck+1 is
not Bemk -identifiable. Contradiction. 


With a very minor change the above proof shows that Cω is learnable by an
OBemω -learner with temporary memory as defined in [7].
We now observe that ordinal bounded example memory learning does not
exhaust Bem∗ , i.e., set-driven learning.

Theorem 3. For all α ∈ O, (Bem∗ − OBemα ) 


= ∅.

Proof. Consider the following class from [9]. For each j ∈ N, Lj = {2}+ −{2j+1 },
L− = {L j : j ∈ N}. This class is obviously in Bem∗ but it is shown in [9] not
to be in k Bemk .
To show that L− ∈ / OBemα , we can then argue exactly as in the proof of Theo-
rem 5, Claim 2 in [9]. Suppose otherwise as witnessed by (M, j0 ). Let σ be a locking
sequence of the second type for M on L0 . Let M|σ| (σ) = (j|σ|+1 , S|σ|+1 , α|σ|+1 ).
Let β = α|σ|+1 ≤ α such that M ’s ordinal counter is equal to β for all exten-
sions of σ in L0 . Then, on all extensions of σ in L0 , M does not make any proper
global memory extension. Thus, M ’s memory on all such extensions is bounded
by b = max({card(Si ) : i ≤ |σ|}). We omit further details for brevity. 


We prove a technical Lemma that will be used in Section 5 below. Let M :


(N × F in(N) × O) × N# → N × F in(N) × O be a partial computable function.
Incremental Learning with Ordinal Bounded Example Memory 331

Let j0 ∈ N. We say that M is well-behaved on a text t = (xi )i∈N+ if and only if


(M, j0 ) satisfies conditions (i)-(iv) of Definition 4. In other words, (M, j0 ) is an
OBemα -learner of t, for some α.
Lemma 3 (Extraction Lemma). Let C be a class of languages, σ a finite
sequence, M a function of the appropriate type, β a constructive ordinal, such
that the following properties hold ∀L ∈ C, ∀t such that σ · t is for L:
– WM(σ·t) = L,
– M is well-behaved on σ · t,
– π3 (M|σ| (σ)) ≤ β.
Then there exists M̃ , and j0 ∈ N such that C ⊆ OBemβ+b (M̃ , j0 ), where b =
max({card(π2 (Mi (σ))) : i ≤ |σ|}).
Proof. We start with an observation. If we define a map M̃ as follows: M̃0 (t) =
M|σ| (σ), M̃n+1 (t) = M|σ|+n+1 (σ · t), then we don’t necessarily obtain a function
satisfying point (iv) in the definition of an OBemα -learner on t. This is because
M ’s memory after processing σ, i.e., S|σ|+1 , may be a non-empty subset of
content(σ).
The idea of the simulation for proving the Lemma is the following. If M makes
no memory extension while processing σ (i.e., b = 0), then (M̃ , j|σ|+1 ) is the
desired OBemβ -learner. Otherwise, while processing t with M̃ and simulating
M on σ · t, we can dynamically transfer the part of M ’s memory on the current
initial segment of σ · t that can qualify as a part of M̃ on the corresponding
initial segment of t to M̃ ’s memory, while padding the residual part in the next
conjecture. The residual part is eventually stable. How this is done step-by-step
is described below. The number of proper global memory extensions made by
M̃ is less than the number of proper global memory extensions made by M
beyond σ plus the number of the proper global memory extensions made by M
while processing σ, i.e., b. Therefore we can define an appropriate ordinal counter
for M̃ starting at α|σ|+1 + b. We now prove part (1) in detail. Let M|σ| (σ) be
(j|σ|+1 , S|σ|+1 , α|σ|+1 ). Let s = |σ| + 1. We distinguish two cases.
(Case 1) M makes no memory extension while processing σ. See the above
discussion.
(Case 2) Not (Case 1). By hypothesis, card(Ss ) ≤ b. Set
M̃0 (t) = (pad(js , Ss , αs , b0 , m0 ), ∅, α̃0 ),
where b0 = 0, m0 = b, α̃0 = αs + b, and
M̃n+1 (t) = M̃ ((j̃n , S̃n , α̃n ), xn+1 )
= (j̃n+1 , S̃n+1 , α̃n+1 )
= (pad(js+n , Rn+1 , αs+(n+1) , bn+1 , mn+1 ), S̃n+1 , α̃n+1 )

where bn+1 records the maximum cardinality of a memory set of M̃ , mn+1


records the maximum cardinality of a memory set of M beyond σ, and the other
quantities are defined as follows.
332 L. Carlucci

R0 = Ss , S̃0 = ∅, and Rn+1 , S̃n+1 are defined according to the following case
distinction.

(Case i) (xn+1 ∈/ Ss+n ) ∧ (xn+1 ∈ Ss+(n+1) ).


(Case i.a) xn+1 enters Ss+(n+1) by substituting an element of Ss+n . Thus
Ss+n = S where xn+1 ∈ / S, and Ss+(n+1) = S  ∪{x
˙ n+1 } for some S  ⊂ S.
(Case i.b) xn+1 enters in Ss+(n+1) as a new element without substituting any
element, i.e., Ss+n = S, xn+1 ∈ / S, Ss+(n+1) = S  ∪{x
˙ n+1 }, S  = S.
(Case ii) (xn+1 ∈ Ss+n ) ∧ (xn+1 ∈ Ss+(n+1) ). Thus Ss+n = S ∪{x ˙ n+1 } and
˙ 
Ss+(n+1) = S ∪{xn+1 } for some S ⊆ S.
(Case iii) (xn+1 ∈ Ss+n ) ∧ (xn+1 ∈/ Ss+(n+1) ). Thus Ss+n = S, xn+1 ∈ S, and
Ss+(n+1) = S  , for some S  ⊂ S, xn+1 ∈ / S ,
(Case iv) (xn+1 ∈/ Ss+n ) ∧ (xn+1 ∈/ Ss+(n+1) ). Thus Ss+n = S and Ss+(n+1) =
S  , for some S  ⊂ S, and xn+1 ∈ / S ∪ S.

We set S̃n+1 = (S̃n ∩ S  ) ∪ {xn+1 } in (Case i) and (Case ii), and we set S̃n+1 =
(S̃n ∩ S  ) in (Case iii) and (Case iv). We set Rn+1 = (S  − S̃n ). One can always
recover the memory content Ss+(n+1) as Rn+1 ∪ S̃n+1 . The S̃n ’s satisfy the
conditions on memory, while the Rn+1 ’s eventually stabilize. The ordinal-counter
of M̃ is initialized at α̃0 = αs + b. Each time a proper global extension of
the memory S̃n of M̃ occurs, the ordinal-counter is updated as follows. If the
extension corresponds to an extension of M ’s memory before σ (this can happen
at most b times), then the second component is decreased by 1. If the extension
corresponds to an extension of M ’s memory beyond σ, then the first component
is decreased, emulating the corresponding ordinal-counter of M (which is padded
in the previous conjecture). 

Lemma 3 above lends itself to a number of variations. E.g., one can conclude
from the same hypotheses that there exists M̂ , and j0 ∈ N witnessing that Cσ =
{L − content(σ) : L ∈ C} is in OBemβ . This can be seen as follows: for every t
for (L − content(σ)), σ · t is a text for L, and for all L ∈ Cσ , L ∩ content(σ) = ∅.
Thus, no element of σ is ever transfered to bounded example memory in the
process defining M̃ in the above proof. Therefore no such element contributes a
proper memory extension. M̂ can be defined similarly to M̃ with the following
extras: M ’s conjecture is always padded in the hypothesis, and f (i) is output
instead of i, where f is an injective computable function such that for all x,
Wf (x) = (Wx − content(σ)) (f exists by the S-m-n Theorem [10]).
Also, if β in Lemma 3 is < ω, then there exists an s ∈ N such that
max({card(π2 (Mi (σ · t))) : i ∈ N}) ≤ s, and the conclusion of the Lemma
is that there exists an OBems -learner for C.

5 Hierarchy Results above ω


We first exhibit a family of language classes witnessing that
OBemω ⊂ OBemω+1 ⊂ · · · ⊂ OBemω+k ⊂ · · · ⊂ OBemω+ω .
Then we indicate how to extend it up through ω 2 .
Incremental Learning with Ordinal Bounded Example Memory 333

Definition 7 (Class Cω+k ). For k ∈ N+ , Cω+k is the class of all languages L


[ω]
such that L = La ∪ Lb where
– La is empty or in Cω , and
– Lb is empty or in Ck .

Thus, Cω+k consists of the following languages, for every choice of i,j,h, 1 , . . . , i ,
m1 , . . . , mk in N+ .
[ω]
– Ci = { ω, i, p1 , ω, i, p21 , ω, i, p31 , . . . },
[ω] mi
– C(j,m1 ,...,mi ) = {ω, i, p1 , . . . , ω, i, pj1 } ∪ {ω, i, pj0 } ∪ {ω, i, pm
1 , . . . , ω, i, p1 },
1

– Ck = { k, p1 , k, p21 , k, p31 , . . . },
– C(j,m1 ,...,mk ) = { k, p1 , . . . , k, pj1 } ∪ { k, pj0 } ∪ { k, pm
1
1
, . . . , k, pm
1
k
},
[ω] [ω]
– Ci ∪ Ck , C(h,1 ,...,i ) ∪ C(j,m1 ,...,mk ) ,
[ω] [ω]
– Ci ∪ C(j,m1 ,...,mk ) , C(h,1 ,...,i ) ∪ Ck .

Cω+(k+1) as just defined contains more languages than strictly needed to show
(OBemω+(k+1) − OBemω+k )  = ∅, yet we have chosen to present this definition
for uniformity with extensions to higher ordinals. Let us consider the following
subclass of Cω+1 . Let d ∈ N+ .

{L ∈ Cω+1 : { 1, p11 , . . . , 1, pd1 } ⊆ L}.

This class contains the following language class, for each s ∈ N+ .


[ω] [ω]
{C1 , C(d,d) ∪ Cs+1 , C(d,d) ∪ C(h,1 ,...,s+1 ) : ∀h, 1 , . . . , s+1 ∈ N+ }.

[ω]
For each s ∈ N+ , let us denote the latter class by Cs ⊕ { 1, p11 , . . . , 1, pd1 }. In
fact, this class is the same as the following class.
[ω]
{C1 } ∪ {L ∪ C(d,d) : L ∈ Cs+1 }.

The proof of Theorem 1 from [9], can be easily adapted to show the following.

Proposition 2. For each d ∈ N+ , for each s ∈ N+ ,


[ω]
Cs+1 ⊕ { 1, p1 , . . . , 1, pd1 } ∈ (Bems+1 − Bems ).

We use this fact essentially in the proof of the following Theorem.

Theorem 4. Cω+1 ∈ (OBemω+1 − OBemω ).

Proof. To see why Cω+1 ∈ OBemω+1 it is sufficient to notice that a learner M


can pad in its previous hypothesis the following information and act accordingly
using a straightforward case distinction.
– Whether an element of the form 1, pa1 has appeared,
– Whether an element of the form ω, s, pa1 has appeared,
334 L. Carlucci

– Whether an element of the form 1, pe0 appeared,


– Whether an element of the form ω, 1, pe0 appeared,
We now prove Cω+1 ∈ / OBemω . Suppose otherwise as witnessed by (M, j0 ).
Without loss of generality suppose α0 = ω. Let σ be a locking sequence (of
the first type) for M on L1 . Let σ  be the repletion of σ. Then σ  is also
[1]

a locking sequence for M on L1 and content(σ  ) = { 1, p11 , . . . , 1, pd1 } for


[1]

d = max({i : 1, pi1 ∈ content(σ)}). We distinguish two cases.


(Case 1) M ’s memory undergoes at least one extension while processing σ  .
Let b be the value of M ’s counter after processing σ  . Then the following holds,
because (M, j0 ) OBemω -identifies Cω+1 by hypothesis.

(∀L ∈ Cω+1 : content(σ  ) ⊆ L)(∀τ ⊂ L|τ ⊃ σ  )[card(π2 (M|τ | (τ ))) ≤ b].

By choice of σ  ,

S := {L ∈ Cω+1 : content(σ  ) ⊆ L} ⊇ Cs[ω] ⊕ { 1, p11 , . . . , 1, pd1 },

for every s ∈ N+ , in particular for s = b + 1. Now the conditions of Lemma 3


apply by taking σ, C, β in the statement of that Lemma to be, respectively, σ  ,
[ω] [ω]
Cb+1 ⊕ { 1, p11 , . . . , 1, pd1 } and b. In fact, ∀L ∈ Cb+1 ⊕ { 1, p11 , . . . , 1, pd1 }, ∀t

such that σ · t is for L,
– WM(σ ·t) = L,
– M is well-behaved on σ  · t.
– π3 (M|σ | (σ  )) ≤ b.
The first and second items are true because M by hypothesis OBemω -identifies
Cω+1 , the third item is true by hypotheses of the present case. Let b be the
maximum cardinality attained by a memory set of M while processing σ  . By
Lemma 3, one can define M̃ , and j0 ∈ N such that (M̃ , j0 ) OBem(b +b) -identifies
[ω] [ω]
S. But C(b +b+1) ⊕{ 1, p11 , . . . , 1, pd1 } ⊆ S, and C(b +b+1) ⊕{ 1, p11 , . . . , 1, pd1 } ∈
/
OBem(b +b) , by Proposition 2. Contradiction.
(Case 2) M ’s memory undergoes no extension while processing σ  . We distin-
guish two subcases.
(Case 2.1) For every extension τ of σ  in C1 , M ’s memory is not extended while
processing τ . Then σ  · 1, pd+1
1 and σ  · 1, pd+2
1 are equal for M . Then M cannot

distinguish between the texts σ · 1, pd+1 1 · 1, pd0 ·#∗ and σ  · 1, pd+2
1 · 1, pd0 ·#∗ ,
[1] [1]
respectively for L(d,d+1) and L(d,d+2) , both in Cω+1 .
(Case 2.2) There exists an extension τ of σ  in C1 such that M ’s memory
undergoes an extension while processing τ . Then τ is a locking sequence for M
on C1 to which (Case 1) applies. 


Theorem 5. For all k ∈ N, Cω+(k+1) ∈ (OBemω+(k+1) − OBemω+k ).

Proof. The base case k = 0 is Theorem 4. For the k > 1 case one can argue
as follows. Suppose by way of contradiction that (M, j0 ) witnesses Cω+(k+1) ∈
Incremental Learning with Ordinal Bounded Example Memory 335

OBemω+k . Let σ be a locking sequence (of the first type) for M on Ck+1 .
Consider the following cases.
(Case 1) For every extension σ  ⊇ σ in Ck+1 , M makes no memory exten-
sion while processing σ  . Then M can be fooled as an iterative learner in Case
[k+1]
2.1 of Theorem 4 above. The relevant languages here are L(k+1,1,...,k,k+1) and
[k+1]
L(k+1,1,...,k,k+2) .
(Case 2) Not (Case 1), and for some extension σ  of σ in Ck+1 , M makes
more than k memory extensions while processing σ  . Thus, M commits to fi-
nite memory b for some b ∈ N. Then one can argue as in Case 1 of Theo-
rem 4 above, considering the class of those languages in Cω+(k+1) that contain
content(σ  ).
(Case 3) Not (Case 1) and not (Case 2). Then there exists an extension σ  of σ
in Ck+1 , such that M makes at least one memory extension while processing σ 
and for all extension σ  of σ  in Ck+1 , M makes at most k memory extensions
while processing σ  . Then one can argue as in the proof of Theorem 1 (Claim
3 of Theorem 5 in [9]). The point is that the number of possible sets extending
content(σ  ) by adding k + 1 elements of the form k + 1, pt1  with
 d < t ≤ d + 3n
(where d = max({i : k + 1, pi1 ∈ content(σ  )})) grows as k+1 3n
in n, while the

k  
number of possible memory contents of M beyond σ is on such sets is i=0 3n i ,
which is asymptotically smaller. This allows one to select appropriate sets in
Cω+(k+1) which M fails to distinguish on two texts extending σ  . 


Let Cω+ω be k∈N+ Cω+k .

Theorem 6. Cω+ω ∈ (OBemω+ω − k∈N+ OBemω+k ).

Proof. Cω+ω ∈ OBemω+ω is easy. At step n, M can pad into its conjecture a
quadruple An , Bn , Jn , Hn that keeps track of the following information, and
act accordingly.
– An records the minimal x > 0 such that a x, pa1 has occurred.
– Bn records the minimal x > 0 such that a ω, x, pa1 has occurred.
– Jn records the minimal z > 0 such that a i, pz0 has occurred.
– Hn records the minimal z > 0 such that a ω, i, pz0 has occurred.

It is easy to see that Cω+ω ∈
/ k∈N+ OBemω+k . Suppose otherwise. Then for
some k ∈ N, there exists (M, j0 ) witnessing Cω+ω ∈ OBemω+k . But Cω+(k+1) ⊆
Cω+ω . A contradiction to Theorem 5. 


We now indicate how to extend the hierarchy up through ordinal ω 2 . A similar


pattern can be possibly adapted to yield a general non-collapsing result.

Definition 8 (Class Cω·m+k ). For m, k ∈ N+ , the class Cω·m+k is the class


containing all languages L such that L = La1 ∪· · ·∪Lam+1 , where, for 1 ≤ i ≤ m,
 [ω·i]
Lai∈ n∈N+ Ln or is empty, and Lam+1 ∈ Ck . The class Cω·(m+1) is defined
as k∈N+ Cω·m+k .
336 L. Carlucci

Theorem 7. For all m, k ∈ N+ ,


(1) Cω·m+(k+1) ∈ (OBemω·m+(k+1)
 − OBemω·m+k ), and
(2) Cω·m ∈ (OBemω·m − β<ω·m OBemβ ).
Proof. The base cases are given by Theorem 5 and Theorem 6, respectively.
Learnability is straightforward: an element ω · i, s, pt1 informs the learner that s
elements of type ω · i are to be memorized, using one of the m available additive
terms of form ω in the ordinal counter. We sketch the general unlearnability
proof for point (1). Given, by way of contradiction, a candidate OBemω·m+k -
learner (M, j0 ) for Cω·m+(k+1) we consider a locking sequence σ for M on Ck+1
and distinguish three cases.
(Case 1) For every extension σ  ⊇ σ in Ck+1 , M makes no memory extension
while processing σ  . Then M can be fooled as an iterative learner in Case 2.1 of
Theorem 4 above.
(Case 2) Not (Case 1), and for some extension σ  of σ in Ck+1 , M makes more
than k memory extensions while processing σ  . For some m ≤ m and s ∈ N, M
commits to using memory bounded by ω · m + s beyond σ  . Then one can argue
analogously as in Case 1 of Theorem 4 above, considering the class of languages
in Cω·m+(k+1) that contain content(σ  ). This class can be shown to include a
class that is hard for OBemω·m -learners.
(Case 3) Not (Case 1) and not (Case 2). Then there exists an extension σ  of σ
in Ck+1 , such that M makes at least one memory extension while processing σ 
and for all extension σ  of σ  in Ck+1 , M makes at most k memory extensions
while processing σ  . Then one can use the counting argument from the proof of
Theorem 1 (Claim 3 of Theorem 5 in [9]), as done in Theorem 5 above. 


6 Conclusion
We have introduced a proper extension of the Bounded Example Memory model
featuring algorithmic count-down from constructive ordinals to bound the num-
ber of proper, global memory extensions an incremental learner is allowed on its
way to convergence. We have shown that the concept gives rise to criteria
 that
lie strictly between the finite Bounded Example Memory hierarchy k Bemk
and set-driven learning Bem∗ . We have exhibited a hierarchy of learning crite-
ria up through ordinal ω 2 . We are confident that the general problem - given
constructive ordinals α > β, is it the case that (OBemα − OBemβ )  = ∅? -
can be attacked using similar methods. We also plan to investigate ordinal ver-
sions of feedback learning from [3]. An interesting side-question is: Are learners
with cumulative memory as powerful as learners that have the freedom to erase
memory content?

Acknowledgments. The author thanks the ALT 2009 anonymous referees for
useful comments. Special thanks go to one of the referees, who also suggested
how to extend the results of the present paper. Making justice of his suggestion
would have required a substantial reworking of the presentation and will be taken
up in future work.
Incremental Learning with Ordinal Bounded Example Memory 337

References
[1] Blum, L., Blum, M.: Towards a mathematical theory of inductive inference. In-
formation and Control 28, 125–155 (1975)
[2] Carlucci, L., Case, J., Jain, S.: Learning correction grammars. In: Bshouty, N.,
Gentile, C. (eds.) Proceedings of the 20th Annual Conference on Learning Theory,
San Diego, USA, pp. 203–217 (2007)
[3] Case, J., Jain, S., Lange, S., Zeugmann, T.: Incremental concept learning for
bounded data mining. Information and Computation 152(1), 74–110 (1999)
[4] Freivalds, R., Smith, C.H.: On the role of procrastination for Machine Learning.
Information and Computation 107, 237–271 (1993)
[5] Gold, E.M.: Language identification in the limit. Information and Control 10,
447–474 (1967)
[6] Jain, S., Osherson, D., Royer, J., Sharma, A.: Systems that learn: an introduction
to learning theory, 2nd edn. MIT Press, Cambridge (1999)
[7] Lange, S., Moelius, S.E., Zilles, S.: Learning with Temporary Memory. In: Freund,
Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254,
pp. 449–463. Springer, Heidelberg (2008)
[8] Kinber, E., Stephan, F.: Language learning from texts: mind-changes, limited
memory, and monotonicity. Information and Computation 123(2), 224–241 (1995)
[9] Lange, S., Zeugmann, T.: Incremental learning from positive data. Journal of
Computer and System Sciences 53(1), 88–103 (1996)
[10] Rogers, H.: Theory of recursive functions and effective computability. McGraw-
Hill, New York (1967); Reprinted by MIT Press (1987)
[11] Wexler, K., Culicover, P.W.: Formal principles of language acquisition. MIT Press,
Cambridge (1980)
[12] Wiehagen, R.: Limes-Erkennung rekursiver Funktionene durch spezielle Strate-
gien. Elektronische Informationsverarbeitung und Kybernetik 12(1/2), 93–99
(1976)
Learning from Streams

Sanjay Jain1, , Frank Stephan2, , and Nan Ye3


1
Department of Computer Science,
National University of Singapore, Singapore 117417, Republic of Singapore
sanjay@comp.nus.edu.sg
2
Department of Computer Science and Department of Mathematics,
National University of Singapore, Singapore 117417, Republic of Singapore
fstephan@comp.nus.edu.sg
3
Department of Computer Science,
National University of Singapore, Singapore 117417, Republic of Singapore
yenan@comp.nus.edu.sg

Abstract. Learning from streams is a process in which a group of learn-


ers separately obtain information about the target to be learned, but
they can communicate with each other in order to learn the target. We
are interested in machine models for learning from streams and study
its learning power (as measured by the collection of learnable classes).
We study how the power of learning from streams depends on the two
parameters m and n, where n is the number of learners which track a
single stream of input each and m is the number of learners (among the
n learners) which have to find, in the limit, the right description of the
target. We study for which combinations m, n and m , n the following
inclusion holds: Every class learnable from streams with parameters m, n
is also learnable from streams with parameters m , n . For the learning of
uniformly recursive classes, we get a full characterization which depends
only on the ratio m n
; but for general classes the picture is more compli-
cated. Most of the noninclusions in team learning carry over to nonin-
clusions with the same parameters in the case of learning from streams;
but only few inclusions are preserved and some additional noninclusions
hold. Besides this, we also relate learning from streams to various other
closely related and well-studied forms of learning: iterative learning from
text, learning from incomplete text and learning from noisy text.

1 Introduction
The present paper investigates the scenario where a team of learners observes
data from various sources, called streams, so that only the combination of all
these data give the complete picture of the target to be learnt; in addition the
communication abilities between the team members is limited. Examples of such
a scenario are the following: some scientists perform experiments to study a phe-
nomenon, but no one has the budget to do all the necessary experiments and

Supported in part by NUS grant number R252-000-308-112.

Supported in part by NUS grant number R146-000-114-112.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 338–352, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Learning from Streams 339

therefore they share the results; various earth-bound telescopes observe an ob-
ject in the sky, where each telescope can see the object only during some hours
a day; several space ships jointly investigate a distant planet.
This concrete setting is put into the abstract framework of inductive inference
as introduced by Gold [2,5,9]: the target to be learnt is modeled as a recursively
enumerable set of natural numbers (which is called a “language”); the team of
learners has to find in the limit an index for this set in a given hypothesis space.
This hypothesis space might be either an indexed family or, in the most general
form, just a fixed acceptable numbering of all r.e. sets. Each team member gets
as input a stream whose range is a subset of the set to be learnt; but all team
members together see all the elements of the set to be learnt. Communication
between the team members is modeled by allowing each team member to finitely
often make its data available to all the other learners.
The notion described above is denoted as [m, n]StreamEx-learning where n
is the number of team members and m is the minimum number of learners out
of these n which must converge to the correct hypothesis in the limit. Note that
this notion of learning from streams is a variant of team learning, denoted as
[m, n]TeamEx, which has been extensively studied [1,11,15,16,18,19]; the main
difference between the two notions is that in team learning, all members see
the same data, while in learning from streams, each team member sees only a
part of the data and can exchange with the other team members only finitely
much information. In the following, Ex denotes the standard notion of learning
in the limit from text; this notion coincides with [1, 1]StreamEx. In related
work, Baliga, Jain and Sharma [4] investigated a model of learning from various
sources of inaccurate data where most of the data sources are nearly accurate.
We start with giving the formal definitions in Section 2. In Section 3 we
first establish a characterization result for learning indexed families. Our main
theorem in this section, Theorem 7, shows a tell-tale like characterization for
learning from streams for indexed families. An indexed family L = {L0 , L1 , . . .}
is [m, n]StreamEx-learnable iff it is [1,  m n
]StreamEx-learnable iff there exists
a uniformly r.e. sequence E0 , E1 , . . . of finite sets such that Ei ⊆ Li and there
are at most  m n
 many languages L in L with Ei ⊆ L ⊆ Li . Thus, for indexed
families, the power of learning from streams depends only on the success ratio.
Additionally, we show that for indexed families, the hierarchy for stream learn-
ing is similar to the hierarchy for team function learning (see Corollary 9); note
that there is an indexed family in [m, n]TeamEx − [m, n]StreamEx iff m n ≤ 2.
1

We further show (Theorem 11) that a class L can be noneffectively learned from
streams iff each language in L has a finite tell-tale set [2] with respect to the
class L, though these tell-tale sets may not be uniformly recursively enumerable
from their indices. Hence the separation among different stream learning criteria
is due to computational reasons rather than information theoretic reasons.
In Section 4 we consider the relationship between stream learning criteria
with different parameters, for general classes of r.e. languages. Unlike the in-
dexed family case, we show that more streaming is harmful (Theorem 13): There
are classes of languages which can be learned by all n learners when the data is
340 S. Jain, F. Stephan, and N. Ye

divided into n streams, but which cannot be learned even by one of the learners
when the data is divided into n > n streams. Hence, for learning r.e. classes,
[1, n]StreamEx and [1, n ]StreamEx are incomparable for different n, n ≥ 1.
This stands in contrast to the learning of indexed families where we have that
[1, n]StreamEx is properly contained in [1, n + 1]StreamEx for each n ≥ 1.
Theorem 14 shows that requiring fewer number of machines to be successful
gives more power to stream learning even if the success ratio is sometimes high.
For each m there exists a class which is [m, n]StreamEx-learnable for all n ≥ m
but not [m + 1, n ]StreamEx-learnable for any n ≥ 2m.
In Section 5 we first show that stream learning is a proper restriction of
team learning in the sense that [m, n]StreamEx ⊂ [m, n]TeamEx, as long as
1 ≤ m ≤ n and n > 1. We also show how to carry over several separation re-
sults from team learning to learning from streams, as well as give one simulation
result which carries over. In particular we show in Theorem 17 that if m n > 3
2

then [m, n]StreamEx = [n, n]StreamEx. Also, in Theorem 19 we show that if


n ≤ 3 then [m, n]StreamEx  ⊆ Ex. One can similarly carry over several more
m 2

separation results from team learning.


One could consider streaming of data as some form of “missing data” as each
individual learner does not get to see all the data which is available, even though
potentially any particular data can be made available to all the learners via syn-
chronization. Iterative learning studies a similar phenomenon from a different
perspective: though the (single) learner gets all the data, it cannot remember
all of its past data; its new conjecture depends only on its just previous conjec-
ture and the new data. We show in Theorem 20 that in the context of iterative
learning, learning from streams is not restrictive (and is advantageous in some
cases, as Corollary 8 can be adapted for iterative stream learners). We addition-
ally compare stream learning with learning from incomplete or noisy data as
considered in [8,13].

2 Preliminaries and Model for Stream Learning

For any unexplained recursion theoretic notation, the reader is referred to the
textbooks of Rogers [17] and Odifreddi [12]. The symbol N denotes the set of
natural numbers, {0, 1, 2, 3, . . .}. Subsets of N are referred to as languages. The
symbols ∅, ⊆, ⊂, ⊇ and ⊃ denote empty set, subset, proper subset, super-
set and proper superset, respectively. The cardinality of a set S is denoted by
card(S). max(S) and min(S), respectively, denote the maximum and minimum
of a set S, where max(∅) = 0 and min(∅) = ∞. dom(ψ) and ran(ψ) denote
the domain and range of ψ. Furthermore, ·, · denotes a recursive 1–1 and onto
pairing function [17] from N × N to N which is increasing in both its arguments:
x, y = (x+y)(x+y+1)
2 + y. The pairing function can be extended to n-tuples by
taking x1 , x2 , . . . , xn  = x1 , x2 , . . . , xn .
The information available to the learner is a sequence consisting of exactly the
elements in the language being learned. In general, any sequence T on N ∪ {#}
is called a text, where # indicates a pause in information presentation. T (t)
Learning from Streams 341

denotes the (t + 1)-st element in T and T [t] denotes the initial segment of T
of length t. Thus T [0] = , where  is the empty sequence. ctnt(T ) denotes the
set of numbers in the text T . If σ is an initial segment of a text, then ctnt(σ)
denotes the set of numbers in σ. Let SEQ denote the set of all initial segments.
For σ, τ ∈ SEQ, σ ⊆ τ denotes that σ is an initial segment of τ . |σ| denotes the
length of σ.
A learner from texts is an algorithmic mapping from SEQ to N ∪ {?}. Here
the output ? of the learner is interpreted as “no conjecture at this time.” For a
learner M , one can view the sequence M (T [0]), M (T [1]), . . ., as a sequence of
conjectures (grammars) made by M on T .
Intuitively, successful learning is characterized by the sequence of conjectured
hypotheses eventually stabilizing on correct ones. The concepts of stabilization
and correctness can be formulated in various ways and we will be mainly con-
cerned with the notion of explanatory (Ex) learning. The conjectures of learners
are interpreted as grammars in a given hypothesis space H, which is always
recursively enumerable family of r.e. languages (in some cases, we even take
the hypothesis space to be a uniformly recursive family, also called an indexed
family). Unless specified otherwise, the hypothesis space is taken to be a fixed
acceptable numbering W0 , W1 , . . . of all r.e. sets.
Definition 1 (Gold [9]). Given a hypothesis space H = {H0 , H1 , . . .} and a
language L, a sequence of indices i0 , i1 , . . . is said to be an Ex-correct grammar
sequence for L, if there exists s such that for all t ≥ s, Hit = L and it = is . A
learner M Ex-learns a class L of languages iff for every L ∈ L and every text T
for L, M on T outputs an Ex-correct grammar sequence for L.
We use Ex to also denote the collection of language classes which are Ex-
learnt by some learner.
Now we consider learning from streams. For this the learners would get streams
of texts as input, rather than just one text.
Definition 2. Let n ≥ 1. T = (T1 , . . . , Tn ) is said to be a streamed text for L
if ctnt(T1 ) ∪ . . . ∪ ctnt(Tn ) = L. Here n is called the degree of dispersion of the
streamed text. We sometimes call a streamed text just a text, when it is clear
from the context what is meant.
Suppose T = (T1 , . . . , Tn ) is a streamed text. Then, for all t, σ = (T1 [t],
. . . , Tn [t]), is called an initial segment of T . Furthermore, we define T [t] =
(T1 [t], . . . , Tn [t]). We define ctnt(T [t]) = ctnt(T1 [t]) ∪ . . . ∪ ctnt(Tn [t]) and sim-
ilarly for the content of streamed texts. We let SEQn = {(σ1 , σ2 , . . . , σn ) : σ1 ,
σ2 , . . . , σn ∈ SEQ and |σ1 | = |σ2 | = . . . = |σn |}. For σ = (σ1 , σ2 , . . . , σn ) and
τ = (τ1 , τ2 , . . . , τn ), we say that σ ⊆ τ if σi ⊆ τi for i ∈ {1, . . . , n}.
Let L be a language collection and H be a hypothesis space.
When learning from streams, a team M1 , ..., Mn of learners accesses a stream-
ed text T = (T1 , . . . , Tn ) and works as follows. At time t, each learner Mi sees as
input Ti [t] plus the initial segment T [synct ], outputs a hypothesis hi,t and might
update synct+1 to t. Here, initially sync0 = 0 and synct+1 = synct whenever no
team member updates synct+1 at time t.
342 S. Jain, F. Stephan, and N. Ye

Assume that 1 ≤ m ≤ n. A team (M1 , . . . , Mn ) [m, n]StreamEx-learns L iff


for every L ∈ L and every streamed text T for L, (a) there is a maximal t such
that synct+1 = t and (b) for at least m indices i ∈ {1, 2, . . . , n}, the sequence of
hypotheses hi,0 , hi,1 , . . . is an Ex-correct sequence for L.
We let [m, n]StreamEx denote the collection of language classes which are
[m, n]StreamEx-learnt by some team. The ratio m n is called the success-ratio of
the team.
Note that a class L is [1, 1]StreamEx-learnable iff it is Ex-learnable. A further
important notion is that of team learning [18]. This can be reformulated in our
setting as follows: L is [m, n]TeamEx-learnable iff there is a team of learners
(M1 , . . . , Mn ) which [m, n]StreamEx-learn every language L ∈ L from every
streamed text (T1 , . . . , Tn ) for L when T1 = T2 = · · · = Tn (and thus each Ti is
a text for L).
For notational convenience we sometimes use Mi (T [t]) = Mi (T1 [t], . . . , Tn [t])
(along with Mi (Ti [t], T [synct ])) to denote Mi ’s output at time t when the team
M1 , . . . , Mn gets the streamed text T = (T1 , . . . , Tn ) as input. Note that here the
learner sees several inputs rather than just one input as in the case of learning
from texts (Ex-learning). It will be clear from context which kind of learner is
meant.
One can consider updating of synct+1 to t as synchronization, as the data
available to any of the learners is passed to every learner. Thus, for ease of
exposition, we often just refer to updating of synct+1 to t by Mi as request for
synchronization by Mi .
Note that in our models, there is no more synchronization after some finite
time. If one allows synchronization without such a constraint, then the learners
can synchronize at every step and thus there would be no difference from the
team learning model. Furthermore, in our model there is no restriction on how
the data is distributed among the learners. This is assumed to be done in an
adversary manner, with the only constraint being that every datum appears in
some stream. A stronger form would be that the data is distributed via some
mechanism (for example, x, if present, is assigned to the stream x mod n + 1).
We will not be concerned with such distributions but only point out that learning
in such a scenario is easier.
The following proposition is immediate from Definition 2.
Proposition 3. Suppose 1 ≤ m ≤ n. Then the following statements hold.
(a) [m, n]StreamEx ⊆ [m, n]TeamEx.
(b) [m + 1, n + 1]StreamEx ⊆ [m, n + 1]StreamEx.
(c) [m + 1, n + 1]StreamEx ⊆ [m, n]StreamEx.
The following definition on stabilizing sequence and locking sequences are gen-
eralizations of similar definitions for learning from texts.
Definition 4 (Based on Blum and Blum [5], Fulk [7]). Suppose that L is a
language and M1 , . . . , Mn are learners. Then, σ = (σ1 , . . . , σn ) is called a stabiliz-
ing sequence for M1 , . . . , Mn on L for [m, n]StreamEx-learning iff ctnt(σ) ⊆ L
Learning from Streams 343

and there are at least m numbers i ∈ {1, . . . , n} such that for all streamed texts T
for L with σ = T [|σ|] and for all t ≥ |σ|, when M1 , . . . , Mn are fed the streamed
text T , for synct and hi,t as defined in Definition 2, (a) synct ≤ |σ| and (b)
hi,t = hi,|σ| .
A stabilizing sequence σ is called a locking sequence for M1 , . . . , Mn on L for
[m, n]StreamEx-learning iff in (b) above hi,|σ| is additionally an index for L (in
the hypothesis space used).

The following fact is based on a result of Blum and Blum [5].

Fact 5. Assume that L is [m, n]StreamEx-learnable by M1 , ..., Mn . Then there


exists a locking sequence σ for M1 , M2 , . . . , Mn on L.

3 Some Characterization Results


In this section we first consider a characterization for learning from streams for
indexed families. Our characterization is similar in spirit to Angluin’s character-
ization for learning indexed families.

Definition 6 (Angluin [2]). L is said to satisfy the tell-tale set criterion if for
every L ∈ L, there exists a finite set DL such that for any L ∈ L with L ⊇ DL ,
we have L  ⊂ L. DL is called a tell-tale set of L. {DL : L ∈ L} is called a family
of tell-tale sets of L.

Angluin [2] used the term exact learning to refer to learning using the language
class to be learned as the hypothesis space and she showed that a uniformly re-
cursive language class L is exactly Ex-learnable iff it has a uniformly recursively
enumerable family of tell-tale sets [2]. A similar characterization holds for non-
effective learning [10, pp. 42–43]: Any class L of r.e. languages is noneffectively
Ex-learnable iff L satisfies the tell-tale criterion. For learning from streamed
text, we have the following corresponding characterization.

Theorem 7. Suppose k ≥ 1 and 1 ≤ m ≤ n and k+1 1


< mn ≤ k . Suppose
1

L = {L0 , L1 , . . .} is an indexed family where one can effectively (in i, x) test


whether x ∈ Li . Then L ∈ [m, n]StreamEx iff there exists a uniformly r.e.
sequence E0 , E1 , . . . of finite sets such that for each i, Ei ⊆ Li and there are at
most k sets L ∈ L with Ei ⊆ L ⊆ Li .

Proof. (⇒): Suppose M1 , M2 , . . . , Mn witness that L is in [m, n]StreamEx.


Consider any Li ∈ L. Let σ = (σ1 , σ2 , . . . , σn ) be a stabilizing sequence for
M1 , M2 , . . . , Mn on Li . Fix any j such that 1 ≤ j ≤ n and for all streamed texts
T for Li which extend σ, for all t ≥ |σ|, Mj (T [t]) = Mj (σ). Let Tr = σr #∞
for r ∈ {1, . . . , n} − {j}. Thus, for any L ∈ L and text Tj for L such that Tj
extends σj and ctnt(σ) ⊆ L ⊆ Li , we have that m of M1 , . . . , Mn on (T1 , . . . , Tn )
converge to grammars for L. Since the sequence of grammars output by Mr on
(T1 , T2 , . . . , Tn ) is independent of L chosen above (with the only constraint being
L satisfying ctnt(σ) ⊆ L ⊆ Li ), we have that there can be at most m n
such
344 S. Jain, F. Stephan, and N. Ye

L ∈ L. Now note that a stabilizing sequence σ for M1 , M2 , . . . , Mn on Li can be


foundin the limit. Let σ s denote the s-th approximation to σ. Then one can let
Ei = s∈N ctnt(σ s ) ∩ Li .
(⇐): Assume without loss of generality that each Li is distinct. Let Ei,s
denote Ei enumerated within s steps by the uniform process for enumerating
all the Ei ’s. Now, the learners M1 , . . . , Mn work as follows on a streamed text
T . The learners keep variables it , st along with synct . Initially i0 = s0 = 0.
At time t ≥ 0 the learner Mj does the following: If Eit ,st  ⊆ ctnt(T [synct ]) or
Eit ,st 
= Eit ,t or ctnt(Tj [t]) 
⊆ Lit , then synchronize and let it+1 , st+1 be such
that it+1 , st+1  = it , st  + 1. Note that it , st  can be recovered from T [synct ].
Note that for input streamed text T for Li , the values of it , st converge as
t goes to ∞. Otherwise, synct also diverges, and once synct is large enough so
that Ei ⊆ T [synct ] and one considers it , st  for which it = i and Ei,s = Ei,st
for s ≥ st , then the conditions above ensure that it , st and synct do not change
any further. Furthermore, i = limt→∞ it satisfies that Ei ⊆ Li ⊆ Li .
The output conjectures of the learners at time t are determined as follows:
Let S be the set of (up to) k least elements below t such that each j ∈ S satisfies
Eit ,st ⊆ Lj ∩ {x : x ≤ t} ⊆ Lit ∩ {x : x ≤ t}. Then, we allocate, for each j ∈ S,
m learners to output grammars for Lj . It is easy to verify that, for large enough
t, it and st would have stabilized to, say, i and s , respectively, and S will
contain every j such that Ei ⊆ Lj ⊆ Li . Thus, the team M1 , M2 , . . . , Mn will
[m, n]StreamEx-learn each Lj such that Ei ⊆ Lj ⊆ Li (the input language Li
is one such Lj ). The theorem follows from the above analysis. 
Here note that the direction (⇒) of the theorem holds even for arbitrary classes
L of r.e. languages, rather than just indexed families. The direction (⇐) does
not hold for arbitrary classes of r.e. languages. Furthermore, the learning algo-
rithm given above for the direction (⇐) uses the indexed family L itself as the
hypothesis space: so this is exact learning.
n
Corollary 8. Suppose 1 ≤ m ≤ n, 1 ≤ m ≤ n and n
m ≥ k+1 > m . Let L
contain the following sets:

– the sets {2e + 2x : x ∈ N} for all e;


– the sets {2e + 2x : x ≤ |We | + r} for all e ∈ N and r ∈ {1, 2, . . . , k};
– all finite sets containing at least one odd element.

Then L ∈ [m, n]StreamEx − [m , n ]StreamEx and L can be chosen as an


indexed family.

Proof sketch. First we show that L ∈ [1, k + 1]StreamEx. For each e and for
each L ⊆ {2e, 2e + 2, 2e + 4, . . .} with {2e} ⊆ L, let EL = {2e}; also, for any
language L ∈ L containing an odd number, let EL = L. Now, for an appropriate
indexing L0 , L1 , . . . of L, {ELi : i ∈ N} is a collection of uniformly r.e. finite sets
and for each L ∈ L, there are at most k + 1 sets L ∈ L such that EL ⊆ L ⊆ L.
Thus, L ∈ [1, k+1]StreamEx by Theorem 7. On the other hand, for each L ∈ L,
one cannot effectively (in indices for L) enumerate a finite subset EL of L such
Learning from Streams 345

that EL ⊆ L ⊆ L for at most k languages L ∈ L. We omit the details and the


proof that L can be chosen as an indexed family. 
Corollary 9. Let IND denote the collection of all indexed families. Suppose
1 ≤ m ≤ n and 1 ≤ m ≤ n . Then [m, n]StreamEx∩IND ⊆ [m , n ]StreamEx
n
∩ IND iff  m
n
 ≤ m .

Remark 10. One might also study the inclusion problem for IND with respect
to related criteria. One of them being conservative learning [2], where the addi-
tional requirement is that a team member Mi of a team M1 , . . . , Mn can change
its hypothesis from Ld to Le only if it has seen, either in its own stream or in
the synchronized part of all streams, some datum x ∈ / Ld . If one furthermore
requires that the learner is exact, that is, uses the hypothesis space given by the
indexed family, then one can show that there are more breakpoints than in the
case of usual team learning.
For example, there is a class which under these assumptions is conservatively
[2, 3]StreamEx-learnable but not conservatively learnable. The indexed family
L = {L0 , L1 , . . .} witnessing this separation is defined as follows. Let Φ be a
Blum complexity measure. For e ∈ N and a ∈ {1, 2}, L3e+a is {e, e + 1, e + 2, . . .}
if Φe (e) = ∞ and L3e+a is {e, e + 1, e + 2, . . .} − {Φe (e) + e + a} if Φe (e) < ∞.
Furthermore, the sets L0 , L3 , L6 , . . . form a recursive enumeration of all finite
sets D for which there is an e with Φe (e) < ∞, min(D) = e and max(D) ∈
{Φe (e) + e + 1, Φe (e) + e + 2}.
Note that the usage of the exact hypothesis space is essential for this remark.
However, the earlier results of this section do not depend on the choice of the
m
hypothesis space. Assume that there is a k ∈ {1, 2, 3, . . .} with m n ≤ k < n .
1

Then, similarly to Corollary 8, one can show that some class is conservatively
[m, n]StreamEx-learnable but not conservatively [m , n ]StreamEx-learnable.
The following result follows using the proof of Theorem 7 for noneffective learn-
ers. For noneffective learners one can consider every class as an indexed family.
Furthermore, finitely many elements can be added to Ei to separate Li from the
finitely many subsets of it which contain Ei and are proper subsets of Li — thus
giving us a tell-tale set for Li .
Theorem 11. Suppose 1 ≤ m ≤ n. L is noneffectively [m, n]StreamEx-learn-
able iff L satisfies Angluin’s tell-tale set criterion.
The above theorem shows that any separation between learning from streams
with different parameters must be due to computational difficulties.
Remark 12. Behaviourally correct learning (Bc-learning) requires a learner
to eventually output only correct hypotheses. Thus, the learner semantically
converges to a correct hypothesis, but may not converge syntactically (see [6,14]
for a formal definition). Suppose n ≥ 1. If an indexed family is [1, n]StreamEx-
learnable, then it is Bc-learnable using an acceptable numbering as hypothesis
space. This follows from the fact that an indexed family is Bc-learnable using an
acceptable numbering as hypothesis space iff it satisfies the noneffective tell-tale
criterion [3].
346 S. Jain, F. Stephan, and N. Ye

4 Relationship between Various StreamEx-criteria

In this and the next section, for m, n, m , n with 1 ≤ m ≤ n and 1 ≤ m ≤ n ,


we consider the relationship between [m, n]StreamEx and [m , n ]StreamEx.
We shall develop some basic theorems to show how the degree of dispersion,
the success ratio and the number of successful learners required, affect the ability
to learn from streams.
First, we show that the degree of dispersion plays an important role in the
power of learning from streams. The next theorem shows that for any n, there
are classes which are learnable from streams when the degree of dispersion is not
more than n, but are not learnable from streams when the degree of dispersion
is larger than n, irrespective of the success ratio.

Theorem 13. Forany n ≥ 1, there exists a language class L such that L ∈


[n, n]StreamEx − n >n [1, n ]StreamEx.

Proof. Consider the class L = L1 ∪ L2 , where L1 = {L : L = Wmin(L) ∧


∀x[card({(n + 1)x, . . . , (n + 1)x + n} ∩ L) ≤ 1]} and L2 = {L : ∃x [{(n + 1)x,
. . . , (n + 1)x + n} ⊆ L] and L = Wx for the least such x}.
It is easy to verify that L can be [n, n]StreamEx-learnt. The learners can
use synchronization to first find out the minimal element e in the input language;
thereafter, they can conjecture e, until one of the learners (in its stream) observes
(n+1)x+j and (n+1)x+j  for some x, j, j  , where j  = j  and j, j  ≤ n; in this case
the learners use synchronization to find and conjecture (in the limit) the minimal
x such that {(n + 1)x, . . . , (n + 1)x + n} is contained in the input language.
Now suppose by way of contradiction that L is [1, n ]StreamEx-learnable
by M1 , . . . , Mn for some n > n. We will use Kleene’s recursion theorem to
construct a language in L which is not [1, n ]StreamEx-learned by M1 , . . . , Mn .
First, we give an algorithm to construct in stages a set Se depending on a

parameter e. At stage s, we construct (σ1,s , . . . , σn ,s ) ∈ SEQn where we will
always have that σi,s ⊆ σi,s+1 .

– Stage 0: (σ1,0 , σ2,0 , . . . , σn ,0 ) = (e, #, . . . , #). Enumerate e into Se .


– Stage s > 0.

Let σ = (σ1,s−1 , . . . , σn ,s−1 ). Search for a τ = (τ1 , . . . , τn ) ∈ SEQn ,
such that (i) for i ∈ {1, . . . , n }, σi,s−1 ⊂ τi , (ii) min(ctnt(τ )) = e and (iii)
for all x, card({y : y ≤ n, (n + 1)x + y ∈ ctnt(τ )}) ≤ 1, and one of the
following holds:
(a) One of the learners requests for synchronization after τ is given as input
to the learners M1 , . . . , Mn .
(b) All the learners make a mind change between seeing σ and τ , that is,
for all i with 1 ≤ i ≤ n , for some τ  with σ ⊆ τ  ⊆ τ , Mi (σ)  = Mi (τ  ).
If one of the searches succeeds, then let σi,s = τi , enumerate ctnt(τ ) into Se
and go to stage s + 1.
Learning from Streams 347

If each stage finishes, then by Kleene’s recursion theorem, there exists  an


e such that We = Se and thus We ∈ L1 . For i ∈ {1, . . . , n }, let Ti = s σi,s .
Now, either the learners M1 , . . . , Mn synchronize infinitely often or each of them
makes infinitely many mind changes when the streamed text T = (T1 , T2 , . . . , Tn )
is given to them as input. Hence M1 , . . . , Mn do not [1, n ]StreamEx-learn
We ∈ L1 .
Now suppose stage s starts but does not finish. Let σ = (σ1,s−1 , σ2,s−1 , . . . ,
σn ,s−1 ). Thus, as the learners only see their own texts and the data given to every
learner up to the point of last synchronization, we have that for some j with 1 ≤
j ≤ n , for all τ = (τ1 , τ2 , . . . , τn ) extending σ = (σ1,s−1 , σ2,s−1 , . . . , σn ,s−1 ),
such that min(ctnt(τ )) = e and for all x, i, card({y : y ≤ n, (n + 1)x + y ∈
ctnt(σ) ∪ ctnt(τi )}) ≤ 1, (a) none of the learners synchronize after seeing τ and
(b) Mj does not make a mind change between σ and τ .
Let rem(i) = i mod (n + 1). Let xs = 1 + max(ctnt(σ)). For 1 ≤ i ≤ n ,
such that rem(i)  = rem(j), let Ti be an extension of σi,s such that ctnt(Ti ) −
ctnt(σi,s ) = {(n + 1)(xs + x) + rem(i) : x ∈ N}. For i ∈ {1, . . . , n } with
rem(i) = rem(j) and i  = j, we let Ti = σi,s #∞ . We will choose Tj below such
that σj,s−1 ⊆ Tj and ctnt(Tj ) − ctnt(σj,s−1 ) = {(n + 1)(xs + x) + rem(j) :
xs + x ≥ k}, for some k > xs .
Let pi be the grammar which Mi outputs in the limit, if any, when the team
M1 , . . . , Mn is provided with the input (T1 , . . . , Tn ). As the learner Mi only sees
Ti and the synchronized part of the streamed texts, by (a) and (b) above, we
have that none of the members of team synchronize beyond σ and the learner Mj
converges to the same grammar as it did after the team is provided with input
σ, irrespective of which k > xs is chosen. Now, by Kleene’s recursion theorem
there exists a k> xs such that Wk = ctnt(σj,s ) ∪ {(n + 1)(xs + x) + rem(j) :
xs + x ≥ k} ∪ i∈{1,2,...,n }−{j} ctnt(Ti ) and Wk  ∈ {Wpi : 1 ≤ i ≤ n }. Hence

Wk ∈ L2 and Wk is not [1, n ]StreamEx-learnt by M1 , . . . , Mn .
The theorem follows from the above analysis. 

The following result shows that the number of successful learners affects learn-
ability from streams crucially.

Theorem 14. Suppose k ≥ 1. Then, there exists an L such that for all n ≥ k
and n ≥ 2k, L ∈ [k, n]StreamEx but L 
∈ [k + 1, n ]StreamEx.

Proof. Let k be as in the statement of the theorem. Let ψ be a partial recursive


function such that ran(ψ) ⊆ {1, . . . , 2k}, the complement of dom(ψ) is infinite
and for any r.e. set S such that S ∩ C is infinite, S ∩ B is nonempty, where
B = { x, y : ψ(x) = y} and C = { x, j : x  ∈ dom(ψ), 1 ≤ j ≤ 2k}. Note that
one can construct such a ψ in a way similar to the construction of simple sets.
Let Ax = B ∪ { x, j : 1 ≤ j ≤ 2k}. Let L = {B} ∪ {Ax : x  ∈ dom(ψ)}.
We claim that L ∈ [k, n]StreamEx for all n ≥ k and that L  ∈ [k + 1,
n ]StreamEx for all n ≥ 2k.
We construct M1 , . . . , Mk which [k, n]StreamEx-learn L as follows.
348 S. Jain, F. Stephan, and N. Ye

On input T [t] = (T1 [t], . . . , Tn [t]), the learners synchronize if for some i,
ctnt(Ti [t − 1]) does not contain x, j and x, j   with j  = j  , but ctnt(Ti [t])

does contain such x, j and x, j .
If synchronization has happened (in some previous step), then the learners
output a grammar for B ∪ { x, j : 1 ≤ j ≤ 2k}, where x is the unique number
such that x, j and x, j   are in the synchronized text for some j  = j  . Other-
wise, M1 , . . . , Mk output a grammar for B and each Mi with k + 1 ≤ i ≤ n does
the following: it first looks for the least x such that x, j ∈ ctnt(Ti [t]) for some
j, and x is not verified to be in dom(ψ) in t steps; then Mi outputs a grammar
for Ax if such an x is found, and outputs ? if no such x is found.
If the learners ever synchronize, then clearly all learners correctly learn the
target language. Suppose no synchronization happens. If the language is B, then
M1 , . . . , Mk correctly learn the input language. If the language is Ax for some
x∈ / dom(ψ), then n ≥ 2k (otherwise synchronization would have happened) and
at least k learners among Mk+1 , . . . , Mn eventually see exactly one pair of the
form x, j, where 1 ≤ j ≤ 2k, and these learners will correctly learn the input
language.
Now suppose by way of contradiction that a team (M1 , . . . , Mn  ) of learn-
ers [k + 1, n ]StreamEx-learns L. By Fact 5, there exists a locking sequence
σ = (σ1 , . . . , σn ) for the learners M1 , . . . , Mn  on B. Let S ⊆ {1, . . . , n } be of
size k + 1 such that the learners Mi , i ∈ S, do not make a mind change beyond
σ on any streamed text T for B which extends σ.
By definition of ψ, there must be only finitely many x, j ∈ C such that the
learners M1 , M2 , . . . , Mn  synchronize or one of the learners Mi , i ∈ S, makes
a mind change beyond σ on any streamed text extending σ for B ∪ { x, j}
— otherwise we would have an infinite r.e. set S consisting of such pairs, with
S ⊆ C but S ∩ B = ∅, a contradiction to the definitions of ψ, B, C. Let X be
the set of these finitely many x, j. Let Z be the set of x such that, for some
i with 1 ≤ i ≤ n , the grammar output by Mi on input σ is for Ax , or the
grammar output by Mi (in the limit) on input σi #∞ (with the last point of
synchronization being before all of input σ is seen) is for Ax .
Select some z ∈ / dom(ψ) such that z  ∈ Z and (z, j) ∈ X for any j. Now
we construct a streamed text extending σ for Az on which the learners fail. Let
S  ⊇ S be a subset of {1, 2, . . . , n } of size 2k. If i is the j-th element of S  then
choose Ti such that Ti extends σi and ctnt(Ti ) = B ∪ { z, j} else (when i ∈ / S)

let Ti = σi # . Thus, T = (T1 , . . . , Tn ) is a streamed text for Az . However, only
the learners Mi with i ∈ S  − S can converge to correct grammars for Az (as
the learners Mi with i ∈ S or i  ∈ S  , would not have converged to a grammar
for Az by definition of z, X and Z above).
It follows that L ∈ / [k + 1, n ]StreamEx. 

5 Learning from Streams versus Team Learning

Team learning is a special form of learning from streams, in which all learners
receive the same complete information about the underlying reality, thus team
Learning from Streams 349

learnability provides upper bounds for learnability from streams with the same
parameters. These upper bounds are strict.

Theorem 15. Suppose 1 ≤ m ≤ n and n > 1. Then [m, n]StreamEx ⊂


[m, n]TeamEx.

Remark 16. Another question is how this transfers to the learnability of in-
n > 2 and L is an indexed family, then L ∈ [m, n]StreamEx
dexed families. If m 1

iff L ∈ [m, n]TeamEx iff L ∈ Ex. But if 1 ≤ m ≤ n2 , then the class L consisting


of N and all its finite subsets is [1, 2]TeamEx-learnable and [m, n]TeamEx-
learnable but not [m, n]StreamEx-learnable.

Below we will show how several results from team learning can be carried over
to the stream learning situation.
It was previously shown that in team learning, when the success ratios exceed
a certain threshold, then the exact success ratio does not affect learnability any
longer. Using a similar majority argument, we can show similar collapsing results
for learning from streams (Theorem 17 and Theorem 18).

Theorem 17. Suppose 1 ≤ m ≤ n. If m


n > 2
3, then [m, n]StreamEx =
[n, n]StreamEx.

Theorem 18. Suppose 1 ≤ m ≤ n and k ≥ 1. Then [ 2k


3 (n − m) + km,
kn]StreamEx ⊆ [m, n]StreamEx.

One can also carry over several diagonalization results from team learning to
learning from streams. An example is the following.

Theorem 19. For all j ∈ N, [j + 2, 2j + 3]StreamEx 


⊆ [j + 1, 2j + 1]TeamEx.

The class witnessing the separation is Lj = {L : card(L) ≥ j + 3 and if e0 <


. . . < ej+2 are the j +3 smallest elements of L, then either [We0 = . . . = Wej+1 =
L] or [at least one of e0 , . . . , ej+1 is a grammar for L and Wej+2 is finite and
max(Wej+2 ) is a grammar for L]}. We omit the details of the proof.

6 Iterative Learning and Learning from Inaccurate Texts


In this section, the notion of learning from streams is compared with other
notions of learning where the data is used by the learner in more restricted ways
or the data is presented in more adversarial manner than in the standard case of
learning. The first notion to be dealt with is iterative learning where the learner
only remembers the most recent hypothesis, but does not remember any past
data [20]. Later, we will consider other adversary input forms: for example the
case of incomplete texts where finitely many data-items might be omitted [8,13]
or noisy texts where finitely many data-items (not in the input language) might
be added to the input text.
350 S. Jain, F. Stephan, and N. Ye

The motivation for iterative learning is the following: When humans learn,
they do not memorize all past observed data, but mainly use the hypothesis they
currently hold, together with new observations to formulate new hypotheses.
Many scientific results can be considered to be obtained in iterative fashion. It-
erative learning for learning from a single stream/text was previously modeled by
requiring the learners to be a function of the previous hypothesis and the current
observed data. Formally, a single-stream learner M : (N∪{#})∗ → (N∪{?}) is it-
erative if there exists a recursive function F : (N∪{?})×(N∪{#}) → N∪{?} such
that on a text T , M (T [0]) =? and for t > 0, M (T [t]) = F (M (T [t− 1]), T (t)). For
notational simplicity, we shall write F (M (T [t−1]), T (t)) as M (M (T [t−1]), T (t)).
We can similarly define iterative learning from several streams by requiring each
learner’s hypothesis to be a recursive function of its previous hypothesis and the
set of the newest datum received by each learner — here, when synchronization
happens, the learners only share the latest data seen by the learners rather than
the whole history of data seen.
Iterative learning can be considered as a form of information incompleteness
as the learner(s) do not memorize all the past observed data. Interestingly, every
iteratively learnable class is learnable from streams irrespective of the parameters.
Theorem 20. For any n ≥ 1, every language class Ex-learnable by an iterative
learner is iteratively [n, n]StreamEx-learnable.

Proof. Suppose L is Ex-learnable by an iterative learner M . We construct


M1 , . . . , Mn which [n, n]StreamEx-learn L. We maintain the invariant that each
Mi outputs the same grammar g at each time step. Initially g =?. At any time t,
suppose Mi receives a datum xti , previous hypothesis is g and the synchronized
data, if any, was dt1 , dt2 , . . . , dtn . The output conjecture of the learners is g  = g, if
there is no synchronized data; otherwise the output conjecture of the learners is
g  = M (. . . M (M (g, dt1 )dt2 ) . . . dtn ). The learner Mi requests for synchronization
if M (g  , xti ) 
= g  . Clearly M1 , . . . , Mn form a team of iterative learners from
streams and always output the same hypothesis. Furthermore, it can be seen
that if M on the text T1 (0)T2 (0) . . . Tn (0)T1 (1)T2 (1) . . . Tn (1) . . . converges to a
hypothesis, then the sequence of hypothesis output by learners M1 , M2 , . . . , Mn
also converges to the same hypothesis. Thus, if M iteratively learns the input
language, then M1 , M2 , . . . , Mn also iteratively [n, n]StreamEx-learn the input
language. 
Now we compare learning from streams with learning from an incomplete or
noisy text. Formally, a text T ∈ (N ∪ {#})∞ is an incomplete text for L iff
L ⊇ ctnt(T ) and L − ctnt(T ) is finite [8,13]. A text for L is noisy iff ctnt(T ) ⊆ L
and ctnt(T ) − L is finite [13]. Ex-learning from incomplete or noisy texts is the
same as Ex-learning except that the texts are now incomplete texts or noisy
texts, respectively. In the following we investigate the relationships of these cri-
teria with learning from streams. We show that learning from streams is incom-
parable to learning from incomplete or noisy texts.
The nature of information incompleteness in learning from an incomplete text
is very different from the incompleteness caused by streaming of data, because
Learning from Streams 351

streaming only spreads information, but does not destroy information (Theo-
rem 11), while the incompleteness in an incomplete text involves the destruction
of information. This difference is made precise by the following incomparability
results.
Proposition 21. Suppose that L consists of L0 = N and all sets Lk+1 = {1 +
x, y : x ≤ k ∧ y ∈ N}. Then L ∈ [n, n]StreamEx for any n ≥ 1 but L can
neither be Ex-learnt from noisy text nor from incomplete text. Furthermore, L
is iteratively learnable.
For the separations in the converse direction, one cannot use indexed families as
every indexed family Ex-learnable from normal text is already learnable from
streams; obviously this implication survives when learnability from normal text
is replaced by learnability from incomplete or noisy text.
Remark 22. Suppose n ≥ 2. Then the cylindrification of the class L from Theo-
rem 13 is Ex-learnable from incomplete text but not [1, n]StreamEx-learnable.
Here the cylindrification of the class L is just the class of all sets { x, y : x ∈
L ∧ y ∈ N} with L ∈ L. Incomplete texts for a cylindrification of such a set L can
be translated into standard texts for L and so the learnability from incomplete
texts can be established; the diagonalization against the stream learners carries
over.
It is known that learnability from noisy text is possible only if for every two
different sets L, L in the class the differences L − L and L − L are both infi-
nite. This is a characterization for the case of indexed families, but it is only a
necessary but not sufficient criterion for classes in general. For example if a class
L consists of sets Lx = { x, y : y ∈ N − {ax }} without any method to obtain ax
from x in the limit, then learnability from noisy text is lost.
Theorem 23. There is a class L which is learnable from noisy text but not
[1, n]StreamEx-learnable for any n ≥ 2.

7 Conclusion
In this paper we investigated learning from several streams of data. For learn-
ing indexed families, we characterized the classes which are [m, n]StreamEx-
learnable using a tell-tale like characterization: An indexed family L = {L0 , L1 ,
. . .} is [m, n]StreamEx-learnable iff it is [1,  m n
]StreamEx-learnable iff there
exists a uniformly r.e. sequence E0 , E1 , . . . of finite sets such that Ei ⊆ Li and
there are at most  m n
 many languages L in L such that Ei ⊆ L ⊆ Li .
For general classes of r.e. languages, our investigation shows that the power of
learning from streams depends crucially on the degree of dispersion, the success
ratio and the number of successful learners required. Though higher degree of
dispersion is more restrictive in general, we show that any class of languages
which is iteratively learnable is also iteratively learnable from streams even if
one requires all the learners to be successful. There are several open problems
and our results suggest that there may not be a simple way to complete the
picture of relationship between various [m, n]StreamEx learning criteria.
352 S. Jain, F. Stephan, and N. Ye

References
1. Ambainis, A.: Probabilistic inductive inference: a survey. Theoretical Computer
Science 264, 155–167 (2001)
2. Angluin, D.: Inductive inference of formal languages from positive data. Informa-
tion and Control 45, 117–135 (1980)
3. Baliga, G., Case, J., Jain, S.: The synthesis of language learners. Information and
Computation 152, 16–43 (1999)
4. Baliga, G., Jain, S., Sharma, A.: Learning from multiple sources of inaccurate data.
SIAM Journal on Computing 26, 961–990 (1997)
5. Blum, L., Blum, M.: Toward a mathematical theory of inductive inference. Infor-
mation and Control 28, 125–155 (1975)
6. Case, J., Lynes, C.: Machine inductive inference and language identification. In:
Nielsen, M., Schmidt, E.M. (eds.) ICALP 1982. LNCS, vol. 140, pp. 107–115.
Springer, Heidelberg (1982)
7. Fulk, M.: Prudence and other conditions on formal language learning. Information
and Computation 85, 1–11 (1990)
8. Fulk, M., Jain, S.: Learning in the presence of inaccurate information. Theoretical
Computer Science 161, 235–261 (1996)
9. Mark Gold, E.: Language identification in the limit. Information and Control 10,
447–474 (1967)
10. Jain, S., Osherson, D., Royer, J.S., Sharma, A.: Systems That Learn: An Introduc-
tion to Learning Theory, 2nd edn. MIT Press, Cambridge (1999)
11. Jain, S., Sharma, A.: Team learning of computable languages. Theory of Computing
Systems 33, 35–58 (2000)
12. Odifreddi, P.: Classical Recursion Theory. North-Holland, Amsterdam (1989)
13. Osherson, D., Stob, M., Weinstein, S.: Systems That Learn: An Introduction to
Learning Theory for Cognitive and Computer Scientists. MIT Press, Cambridge
(1986)
14. Osherson, D., Weinstein, S.: Criteria of language learning. Information and Con-
trol 52, 123–138 (1982)
15. Pitt, L.: Probabilistic inductive inference. Journal of the ACM 36, 383–433 (1989)
16. Pitt, L., Smith, C.H.: Probability and plurality for aggregations of learning ma-
chines. Information and Computation 77, 77–92 (1988)
17. Rogers, H.: Theory of Recursive Functions and Effective Computability. McGraw-
Hill, New York (1967); Reprinted in MIT Press (1987)
18. Smith, C.H.: The power of pluralism for automatic program synthesis. Journal of
the ACM 29, 1144–1165 (1982)
19. Smith, C.H.: Three decades of team learning. In: Arikawa, S., Jantke, K.P. (eds.)
AII 1994 and ALT 1994. LNCS, vol. 872, pp. 211–228. Springer, Heidelberg (1994)
20. Wiehagen, R.: Limes-Erkennung rekursiver Funktionen durch spezielle Strategien.
Elektronische Informationsverbarbeitung und Kybernetik (EIK) 12, 93–99 (1976)
Smart PAC-Learners

Hans Ulrich Simon

Fakultät für Mathematik, Ruhr-Universität Bochum, 44780 Bochum, Germany


hans.simon@rub.de

Abstract. The PAC-learning model is distribution-independent in the


sense that the learner must reach a learning goal with a limited number
of labeled random examples without any prior knowledge of the under-
lying domain distribution. In order to achieve this, one needs generaliza-
tion error bounds that are valid uniformly for every domain distribution.
These bounds are (almost) tight in the sense that there is a domain dis-
tribution which does not admit a generalization error being significantly
smaller than the general bound. Note however that this leaves open the
possibility to achieve the learning goal faster if the underlying distribu-
tion is “simple”. Informally speaking, we say a PAC-learner L is “smart”
if, for a “vast majority” of domain distributions D, L does not require
significantly more examples to reach the “learning goal” than the best
learner whose strategy is specialized to D. In this paper, focusing on sam-
ple complexity and ignoring computational issues, we show that smart
learners do exist. This implies (at least from an information-theoretical
perspective) that full prior knowledge of the domain distribution (or ac-
cess to a huge collection of unlabeled examples) does (for a vast majority
of domain distributions) not significantly reduce the number of labeled
examples required to achieve the learning goal.

1 Introduction

We are concerned with sample-efficient strategies for properly PAC-learning a fi-


nite class H of concepts over a finite domain X. In the general PAC-learning frame-
work, a learner is exposed to a worst-case analysis by asking the following question:
what is the smallest sample size m = mH (ε, δ) such that, for every target concept
h ∈ H and every distribution D on X, the probability (taken over m randomly
chosen and correctly labeled examples) for returning an ε-accurate hypothesis is
at least 1 − δ? It is well-known [3], [4] that (up to some logarithmic factors) there
are matching upper and lower bounds on mH (ε, δ). The proof for the lower bound
makes use of a fiendish distribution Dε∗ which makes the learning task quite hard.
The lower bound is remains valid when Dε∗ is known to the learner. While this al-
most completely determines the sample size that is required in the worst-case, it
leaves open the question whether the learning goal can be achieved faster when
the underlying domain distribution D is significantly simpler than Dε∗ . Further-
more, if it can be achieved faster, it leaves open the question whether this can be

This work was supported by the Deutsche Forschungsgemeinschaft Grant SI 498/8-1.

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 353–367, 2009.

c Springer-Verlag Berlin Heidelberg 2009
354 H.U. Simon

exploited only by a learner that is specialized to D or if it can be as well exploited


by a “smart” PAC-learner who has no prior knowledge about D. This is precisely
the question that we try to answer in this paper.
Our main result: In our paper, it will be convenient to think of the target class
H, the sample size m and the accuracy parameter ε as fixed, respectively, and
to figure out the smallest value for δ such that m examples suffice to meet the
∗ ∗
(ε, δ)-criterion of PAC-learning.1 Let δD = δD (ε, m) denote the smallest value
for δ that can be achieved by a learner who is specialized to D. A general PAC-
learner who must cope with arbitrary distributions can clearly not be specialized
to every distribution at the same time. Nevertheless we can associate a quantity
L L
δD = δD (ε, m) with L that is defined as the smallest value of δ that such that m
examples are sufficient to meet the (ε, δ)-criterion provided that L was exposed
to the domain distribution D. Ideally, we would like to prove that there is a
PAC-learner L and a constant c (not even depending on H) such that, for every

domain distribution D, δD L
≤ c · δD . This would be a strong result as it is
basically saying that, for every D, the learning progress of L (without prior
knowledge of D) is made roughly at the same speed as the learning progress of
the learner whose strategy is perfectly specialized to D. Our result is however
slightly weaker than that. We can show that there is a PAC-learner L such that,

for a “vast majority” of domain distributions D, δD L
≤ 2c · δD (where c is a
constant that grows with the desired “vastness” of the majority). The formal
statement is found in Corollary 3.
Related papers: Our main result makes a contribution to a discussion about the
power of semi-supervised learning that was raised in [1]. The authors pursue the
question whether unlabeled data (which are usually cheap) provably help to save
labeled data (which are usually expensive). Moreover, they pursue this question
in a passive learning model (like PAC-learning) where the labeled data are gen-
erated at random (and the learner has no control about the data-generation
process). They present some particular concept classes for which they can prove
that unlabeled data do not significantly help. Their analysis uses a nice “rescal-
ing trick” that works however only for one-dimensional Euclidean domains. They
conjecture that unlabeled data do not significantly help for a much wider family
of concept classes. Our main result supports this conjecture for proper learning
(as opposed to agnostic learning) and for (arbitrary!) finite classes.
Our results are also weakly related to [2] where upper and lower bounds (in
terms of cover- and packing-numbers associated with H and D) on the sample
size are presented when H is PAC-learned under a fixed distribution D. If these
bounds were tight, one line of attack for proving our results could have been
to design a general PAC-learner that, when exposed to D, achieves the learning
goal with a sample size that does not significantly exceed the lower bound on the
sample size in the fixed-distribution setting. However, since the upper and lower
bounds in [2] leave a significant gap (although being related by a polynomial of
small degree), this approach does not look very promising.
1
This is obviously equivalent to discussing the sample size as a function in ε and δ.
Smart PAC-Learners 355

Structure of the paper: Section 2 clarifies the notation that is used throughout
the paper. Section 3 is devoted to learning under a fixed distribution D. This
setting is cast as a zero-sum game between two players, the learner and her
opponent, such that the Minimax Theorem from game theory applies. This leads
∗ ∗
to a nice characterization of δD = δD (ε, m). It is furthermore shown that, when
the opponent makes his draw first, there is a strategy for the learner that, despite
of being not defined in terms of D, does not perform much worse than the best
strategy that is specialized to D. Section 4 is devoted to the proof of the main
result. To this end, we first treat the case of finitely many distributions and
cast the resulting task for a general PAC-learner again as a zero-sum game.
Another application of the Minimax Theorem brings us then in the position to
prove an important result for a learner who simultaneously copes with finitely
many distributions. The case of (infinitely many) arbitrary distributions is finally
treated in a similar fashion by invocation of a continuity argument. At the end
of Section 4, the reader will find our main results. Section 5 is devoted to some
final discussions and open problems.

2 Notations
We assume that the reader is familiar with the PAC-learning framework and
knows the Minimax Theorem from game-theory.

Throughout the paper, we use the following notation:


– X denotes a finite domain.
– H denotes a finite concept class over domain X. Thus, every h ∈ H is a
function of the form h : X → {0, 1}.
– D denotes a domain distribution.
– m denotes the sample size.
– ε denotes the accuracy parameter.
– δ denotes the confidence parameter which bounds from above the “expected
failure rate” where “failure” means the delivery of an ε-inaccurate hypothe-
sis.
– (x, b) ∈ X m × {0, 1}m denotes a labeled sample.
– For x = (ξ1 , . . . , ξm ) ∈ X m and h ∈ H, we set
h(x) = (h(ξ1 ), . . . , h(ξm )) ∈ {0, 1}m.
– As usual, a learning function is a mapping of the form
L : X m × {0, 1}m → H,
i.e., it maps labeled samples to hypotheses.
– For two hypotheses h, h ∈ H,
h ⊕ h := {ξ ∈ X : h(ξ) 
= h (ξ)}
denotes their symmetric difference. Recall that h is called ε-accurate for h
w.r.t. D if D(h ⊕ h ) ≤ ε.
356 H.U. Simon

A deterministic learner can be identified with a learning function (if computa-


tional issues are ignored). We consider, however, randomized learners. Each-one
of these can be identified with a probability distribution over the set of all learn-
ing functions.

3 Learning under a Fixed Distribution Revisited


Consider a fixed distribution D on X that is known to the learner. Let L1 , . . . , LM
be a list of all learning functions mapping a labeled sample of size m to a hypothe-
sis from H = {h1 , . . . , hN }. For every ε > 0, i = 1, . . . , M , j = 1, . . . , N , x ∈ X m ,
ε,x,b
and b ∈ {0, 1}m, let ID [i, j] be the Bernoulli variable indicating whether the hy-
pothesis Li (x, b) is ε-inaccurate w.r.t. target concept hj and domain distribution
D, i.e.,

ε,x,b 1 if Li (x, b) is ε-inaccurate for hj w.r.t. D
ID [i, j] = .
0 otherwise
Now, let
  
ε,x,hj (x) ε,x,h (x)
Aε,m
D [i, j] := Ex∈D m ID [i, j] = Dm (x)ID j [i, j] (1)
x
= Pr [Li (x, hj (x)) is ε-inaccurate for hj w.r.t. D] . (2)
x∈Dm

If D, ε, m are obvious from context, we omit these letters as subscripts or super-


scripts in what follows. A randomized learner is given by a vector p ∈ [0, 1]M
M
that assigns a probability pi to every learning function Li (so that i=1 pi = 1).
Thus, we may identify learners with mixed strategies for the “row-player” in
the zero-sum game associated with A. We may view the “column-player” in this
game as an opponent of the learner. A mixed strategy for the opponent is given
by a vector q ∈ [0, 1]N that assigns an à-priori probability qj to every possible
N
target concept hj (so that j=1 qj = 1). The well-known Minimax Theorem
states that
min max p Aq = max min p Aq. (3)
p q q p
∗ ∗
In the sequel, we denote the optimal value in (3) by δD (ε, m). Note that δD (ε, m)
coincides with the smallest value for the confidence parameter δ such that every
target concept from H can be inferred up to accuracy ε from a random sample of
size m (w.r.t. the fixed distribution D) with a probability at least 1−δ of success.

Thus the definition of δD (ε, m) given here is consistent with the definition given
in the introduction.
Recall that by “expected failure rate (w.r.t. D, m, ε)” we mean the probability
p
for delivering an ε-inaccurate hypothesis. We denote by δD (ε, m) the expected
failure rate of the learner with mixed strategy p (with the opponent making the
second draw). Clearly,

δD p
(ε, m) = min δD (ε, m) = min max p Aj
p p j=1,...,N

where Aj denotes the j’th column of matrix A = Aε,m


D .
Smart PAC-Learners 357

Since a mixed strategy for the learner is a distribution over learning functions
(mapping a labeled sample to a hypothesis), we may equivalently think of the
learner as waiting for a random labeled sample (x, b) and then playing a mixed
strategy that depends on (x, b). In order to formalize this intuition, we consider
the new payoff-matrix à = ÃεD given by

1 if hi is ε-inaccurate for hj w.r.t. D
Ã[i, j] = .
0 otherwise

We associate the following game with Ã:

1. The opponent selects a vector q ∈ [0, 1]N specifying à-priori probabilities for
the target concept. Note that this implicitly determines
– the probability 
Q(b|x) = qj
j:hj (x)=b

of labeling a given sample x by b,


– and the à-posteriori probabilities
 qj
if hj (x) = b
Q(j|x, b) = Q(b|x) (4)
0 otherwise

for target concept hj given the labeled sample (x, b).


For sake of a compact notation, let q̃(x, b) denote the vector whose j’th
component is Q(j|x, b).
2. A labeled sample (x, b) is produced at random with probability

Pr(x, b) = Dm (x)Q(b|x). (5)

3. The learner chooses a vector p̃(x, b) ∈ [0, 1]N (that may depend on D, q and
(x, b)) specifying her mixed strategy w.r.t. payoff-matrix Ã.
4. The learner suffers loss p̃(x, b) Ãq̃(x, 
b) so that her expected loss, averaged
over all labeled samples, evaluates to x,b Pr(x, b)p̃(x, b) Ãq̃(x, b).

In the sequel, the games associated with A and Ã, respectively, are simply called
A-game and Ã-game, respectively.

Lemma 1. Let q ∈ [0, 1]N be an arbitrary but fixed mixed strategy for the
learner’s opponent. Then every mixed strategy p ∈ [0, 1]M for the learner in
the A-game can be mapped to a mixed strategy p̃ for the learner in the Ã-game
so that 
p Aq = Pr(x, b)p̃(x, b) Ãq̃(x, b) . (6)
x,b

Moreover, this mapping p  → p̃ is surjective, i.e., every mixed strategy for the
learner in the Ã-game has a pre-image (so that the optimal values in both games
are the same).
358 H.U. Simon

Proof. For every probability vector p ∈ [0, 1]M and every labeled sample (x, b),
we define the corresponding probability vector p̃(x, b) ∈ [0, 1]N as follows:

p̃i (x, b) = pi (7)
i:Li (x,b)=hi

Note that

N 
M
p̃i (x, b) = pi = 1 .
i =1 i=1

The following computation verifies (6):

(1 )  
N 
M
 m
p Aq = D (x) I x,hj (x) [i, j]pi qj
x j=1 i=1

  
M
m
= D (x) I x,b [i, j]pi qj
x,b j:hj (x)=b i=1

  
N 
= Dm (x) I x,b [i, j] pi qj
 
x,b j:hj (x)=b i =1 i:Li (x,b)=hi
=Ã[i ,j]

  
N 
= Dm (x) Ã[i , j]qj pi
x,b j:hj (x)=b i =1 i:Li (x,b)=hi

(7)  
N 
= Dm (x) Ã[i , j]p̃i (x, b)qj
x,b i =1 j:hj (x)=b

(4)  
N 
N
= Dm (x)Q(b|x) Ã[i , j]p̃i (x, b)Q(j|x, b)
x,b i =1 j=1
(5 ) 
= Pr(x, b)p̃(x, b) Ãq̃(x, b)
x,b

As for the second part of the lemma, consider a mixed strategy p̃ (x, b) of the
learner in the Ã-game. We shall specify a mixed strategy p of the learner in
the A-game such that function p̃(x, b) computed according to (7) coincides with
p̃ (x, b). To this end, let us make the notational convention p̃hi := p̃i and let us
choose p as follows:
pi = p̃Li (x,b) (x, b). (8)
x,b

Note that, with this choice of p,



M
(8 ) 
M 
N
pi = p̃Li (x,b) (x, b) = p̃i (x, b) = 1. (9)
i=1 i=1 x,b x,b i =1
 
=1
Smart PAC-Learners 359

In the second-last equation, we used the distributive law where the reader should
note the one-to-one correspondence between the set of all learning functions and
the free combination of (number of labeled samples many) hypotheses taken
from H.
A computation similar to (9) verifies now the desired coincidence between p̃
and p̃ :

(7 ) 
p̃i (x , b ) = pi
i:Li (x ,b )=hi

(8 ) 
= p̃Li (x,b) (x, b)
i:Li (x ,b )=hi x,b

= p̃i (x , b ) p̃Li (x,b) (x, b)
i:Li (x ,b )=hi (x,b)
=(x ,b )


N
= p̃i (x , b ) p̃i (x, b)
=(x ,b ) i=1
(x,b)
 
=1
= p̃i (x , b )

In the second-last equation, we used the distributive law where the reader should
note the one-to-one correspondence between the set of all learning functions with
a fixed value on one sample (x , b ) and the free combination of (number of labeled
samples minus 1 many) hypotheses taken from from H. 

As a corollary to Lemma 1, we obtain


⎡ ⎤
   

δD (ε, m) = max min p Aε,m
D q = max
⎣ Pr(x, b) · min p̃ Ãε,m
D q̃(x, b)
⎦ .
q p q p̃
x,b

We close this section with a result that prepares the ground for our analysis of
general PAC-learners in the next section:
Lemma 2. Let ε > 0 be a given accuracy and m ≥ 1 a given sample size.
For every probability vector q ∈ [0, 1]N , and every domain distribution D, the
following holds:

Pr(x, b)q̃(x, b) Ã2ε ∗
D q̃(x, b) ≤ 2δD (ε, m) (10)
x,b

Proof. Recall that q̃(x, b) is the vector that assigns the à-posteriori probabil-
ity Q(j|(x, b) for being the target concept to every hypothesis hj . Since the
à-posteriori probabilities outside the version space

V := {h ∈ H : h(x) = b}
360 H.U. Simon

are zero, only target concepts in V can contribute to the left hand-side in (10).
In the remainder of the proof, we simply write Ãε instead of ÃεD , and Ãεi denotes
the i’th row of this matrix. In the Ãε -game, the opponent makes the first draw
by choosing a (prior) probability vector q ∈ [0, 1]N . The following “Bayesian
strategy” for the learner minimizes (6): for a given labeled sample (x, b) pick a
hypothesis h∗ = hi∗ (x,b) ∈ H which maximizes the total à-posteriori probability
of hypotheses that are ε-close to h∗ w.r.t. D, i.e.,
⎧ ⎫
⎨  ⎬
i∗ (x, b) = arg max Q(j|x, b) .
i=1,...,N ⎩ ⎭
j:D(hi ⊕hj )≤ε

It follows that 

Pr(x, b)Ãεi∗ (x,b) q̃(x, b) ≤ δD (ε, m) . (11)
x,b

We are now prepared to verify (10). We call a hypothesis from the version
space V an “(D, x, b, ε)-exception” if it is not ε-close to h∗ w.r.t. D. Note
that Ãεi∗ (x,b) q̃(x, b) coincides with the total à-posteriori probability of (D, x, b, ε)-
exceptions. Consider now the strategy p̃ = q̃ for the learner. Given (x, b), she
picks a hypothesis ĥ = hi at random with probability Q(i|x, b). The following
observation, which is a simple application of the triangle inequality, is crucial: if
ĥ ∈ V is not an (D, x, b, ε)-exception, then ĥ is 2ε-close w.r.t. D to every hypoth-
esis in V that is not an (D, x, b, ε)-exception either. Thus, if ĥ and the target
concept are both picked at random according to q̃(x, b), then the probability for
ĥ being 2ε-inaccurate is bounded from above by twice the total probability of
(D, x, b, ε)-exceptions, i.e.,

q̃(x, b)Ã2ε q̃(x, b) ≤ 2Aεi∗ (x,b) q̃(x, b).

This, combined with (11), concludes the proof of the lemma. 

It is important to note that no knowledge of D is required to play the strategy


p̃ = q̃ in the Ã-game (with the opponent making the first draw). Nevertheless, as
made precise in Lemma 2, this is a reasonably good strategy for any underlying
domain distribution.

4 Smart PAC-Learners
Let us first consider learners that cope with an arbitrary but fixed finite list
D1 , . . . , DC of (not necessarily distinct) distributions on X.2 We shall define
a suitable payoff-matrix R blockwise so that R = [R(1) , . . . , R(C) ] and R(k) is
the block reserved for distribution Dk . Every block has M rows (one row for
every learning function) and N columns (one column for every possible target
concept). Choosing R(k) = Aε,m Dk (compare with the previous section) would lead
2
We shall later extend these considerations to arbitrary distributions.
Smart PAC-Learners 361

to mixed strategies for the PAC-learner that put too much emphasis on fiendish
domain distributions. Such strategies are not likely to succeed with considerably
fewer sample points when the underlying distribution D happens to be simple.
Assuming

(A1) ∀k = 1, . . . , C, δD k
(ε, m) 
= 0,
the following payoff-matrix is a better choice:
1 2ε,m
R(k) [i, j] := ∗ (ε, m) · ADk [i, j]
δDk

1
= ∗ Pr [Li (x, hj (x)) · is 2ε-inaccurate for hj w.r.t. Dk ]
δDk (ε, m) x∈Dkm

Intuitively, the scaling factor 1/δD k
(ε, m) challenges the learner to put more
emphasis on benign distributions. Note furthermore that we penalize the learner
only if her hypothesis is 2ε-inaccurate (as opposed to ε-inaccurate). This leaves
some slack which helps the learner to compensate for not knowing D.
In the sequel, we simply write δk∗ (ε, m) instead of δD

k
(ε, m) and A(k) instead
(k) (k)
of A2ε,m
Dk . Aj denotes the j’th column in A(k) and, similarly, Rj denotes the
j’th column in R(k) . A mixed strategy for the learner is a probability vector
p ∈ [0, 1]M (as in Section 3). A mixed strategy for the opponent is a proba-
(k)
bility vector q = [q (1) , . . . , q (C) ] ∈ [0, 1]CN where qj denotes the probability
for choosing domain distribution Dk and target concept hj . According to the
Minimax Theorem, the following holds:

min max p Rq = max min p Rq. (12)


p q q p

Let ρ∗ (ε, m) denote the optimal value in (12). The following quantities refer to
a learner who makes the first draw and applies the mixed strategy p:

max p Rj
(k)
ρp (ε, m) := max
j=1,...,N k=1,...,C

1   (k)
C
p
ρ̄ (ε, m) := max · p Rj
j=1,...,N C
k=1

∗ p
Clearly, ρ (ε, m) = minp ρ (ε, m).
Let us make perfectly clear the connection between these quantities and our
p
concept of a smart PAC-learner. We denote by δj,k (ε, m) the expected failure rate
(w.r.t. m, ε) of a PAC-learner with mixed strategy p when the target concept is hj
and the domain distribution is Dk .3 It follows from the definition of A(k) = A2ε,m
Dk
that
(2ε, m) = p Aj .
p (k)
δj,k
Thus, according to the definition of R(k) ,
3
In contrast to the previous section, here the learner has no prior knowledge of Dk .
362 H.U. Simon

p
δj,k (2ε, m)
= p  Rj ,
(k)
δk∗ (ε, m)
p
δj,k (2ε, m)
max max = ρp (ε, m),
j=1,...,N k=1,...,C δk∗ (ε, m)
1  δj,k (2ε, m)
C p
max · = ρ̄p (ε, m).
j=1,...,N C δk∗ (ε, m)
k=1

It becomes obvious now that the quantities ρp (ε, m) and ρ̄p (ε, m) measure how
well the general PAC-learner with mixed strategy p (and accuracy parameter 2ε)
competes against the best learner with full prior knowledge of the domain distri-
bution (and accuracy parameter ε). We call ρp (ε, m) the worst performance ratio
and ρ̄p (ε, m) the average performance ratio of the mixed strategy p (although
both quantities refer to the worst-case as far as the choice of the target concept
is concerned). In the sequel, a learner with mixed strategy p is identified with p
so that we can speak of a performance ratio of the learner. A very strong result
would be ρ∗ (ε, m) ≤ c for some small constant c, which would mean that there
exists a learner (mixed strategy) p whose worst performance ratio is bounded
by c. But, since this is somewhat overambitious, we pursue a weaker goal in the
following and analyze the average performance ratio, ρ̄p (ε, m), instead. We make
use of the (obvious) fact that
ρ̄∗ (ε, m) := min ρ̄p (ε, m)
p

is the optimal value in the R̄-game for

1  (k)
C
R̄ := R . (13)
C
k=1

With this notation, the following holds:


Lemma 3. For every mixed strategy q of the opponent in the R̄-game, there
exists a mixed strategy p for the learner such that p R̄q ≤ 2.
Proof. From the decomposition (6), we get the following decomposition of p R̄q:

(13) 1   (k)
C
p R̄q = p R q
C
k=1

1 
C
1  (k)
= ∗ (ε, m) p A q
C δDk
k=1

1  
C
(6) 1
= ∗ Pr(x, b)p̃(x, b)Ã(k) q̃(x, b).
C δDk (ε, m) k
k=1 x,b

Here, Prk (x, b) = Dkm (x)Q(b|x), Ã(k) = Ã2ε


Dk , and the quantities Q(b|x), q̃, p̃ are
derived from q and p, respectively, as explained in the previous section. According
to Lemma 2,
Smart PAC-Learners 363


Pr(x, b)q̃(x, b)Ã(k) q̃(x, b) ≤ 2δD k
(ε, m). (14)
k
x,b

According to Lemma 1, there exists a mixed strategy p for the learner such that
p̃ = q̃. With this choice of p, we get

1  
C (14)
1
p R̄q = ∗ (ε, m) Pr(x, b)q̃(x, b)Ã(k)
q̃(x, b) ≤ 2,
C δDk
k
k=1 x,b

as desired. 
Let R̄j denote the j’th column of matrix R̄. The Minimax Theorem applied to
the R̄-game allows us to infer from Lemma 3 the following
Corollary 1. ρ̄∗ (ε, m) ≤ 2, i.e., there exists a learner (mixed strategy) p whose
average performance ratio is bounded by 2.
So far, we have assumed that there is a finite list of distributions and the do-
main distribution is taken from this list. We now extend these considerations to
arbitrary distributions. Recall that our domain X is finite, say X = {ξ1 , . . . , ξd }.
The domain distributions are in one-to-one correspondence with the vectors
taken from the probability simplex
Δ := {z ∈ [0, 1]d : z1 + · · · + zd = 1}.
Specifically, Dz (ξν ) = zν for ν = 1, . . . , d. Note that instead of finitely many
block matrices R(1) , . . . , R(C) , as before, we now have a system of infinitely
many matrices R(z) . Let f : Δ → R+ be a continuous function that satisfies

f (z) dz = 1 (15)
z∈Δ

so that it can serve as a density function. For sake of simple notation, we set
δz∗ := δD

z
(ε, m). (16)
For every ζ > 0 and every E ⊆ Δ, let

Pr(E) := f (z) dz, (17)
z∈E
Δζ := {z ∈ Δ : δz∗ ≥ ζ}. (18)
The former Assumption (A1) is replaced now by the following assumption:
(A2) limζ→0 Pr(Δζ ) = 1.
(A2) implies that Pr(Δζ ) > 0 for every sufficiently small ζ > 0, which is assumed
in the sequel. Since f (z)/ Pr(Δζ ) is a continuous density function on Δζ , we can
now use 
1
R̄ζ := · f (z)R(z) dz (19)
Pr(Δζ ) z∈Δζ
as a payoff-matrix (where this matrix-equation is understood entry-wise).
364 H.U. Simon

Lemma 4. The integral on the right hand-side in (19) exists.

Proof. Since f (z) is continuous and Δ is compact, f is continuous and bounded


on Δ. Furthermore, for every z ∈ Δζ , each entry of R(z) is bounded by 1/ζ.
The lemma would be rather obvious if the functions R(z) [i, j] were continuous
in z. Although this is not the case in general, we may exploit the fact that
discontinuities occur only in the set

E= Eij
1≤i<j≤N

where
Eij := {z ∈ Δ : Dz (hi ⊕ hj ) = ε}.
Note that the sets Eij are of Lebesgue-measure zero, and so is E. For this reason,
integrating over Δζ leads to the same result as integrating over Δζ \ E, which,
by construction, is a set without discontinuities. 

The average performance ratio of a mixed strategy p for the learner refers now
to the density function f (z)/ Pr(Δζ ) and must therefore be redefined as follows:

1
f (z)p Rj dz = max p R̄jζ
(z)
ρ̄pζ (ε, m) := max ·
j=1,...,N Pr(Δζ ) z∈Δζ j=1,...,N

where R̄jζ denotes the j’th column of R̄ζ . With this notation, we get
Corollary 2. For every sufficiently small ζ > 0, there exists a learner (mixed
strategy) p whose average performance ratio is bounded by 2.

Proof. The crucial observation is that Lemma 3 is still correct when we define
R̄ := R̄ζ according to (19). The only modification in the proof of this lemma
is the substitution of integrals for sums. Thus the Minimax Theorem applies to
the R̄-game and Corollary 2 is obtained. 

Assumption (A2) and Corollary 2 combined with Markov’s Inequality immedi-


ately lead to
Corollary 3. For every continuous function f : Δ → R+ satisfying (15) and
for every constant c > 0, there exists a mixed strategy p for the learner such
that, for j = 1, . . . , N ,
1
Pr(z ∈ Δ : p Rj
(z)
≤ 2c) > 1 − . (20)
c
Proof. Let ζ > 0 and R̄ = R̄ζ . Corollary 2 combined with Markov’s Inequality
shows that there exists a mixed strategy p for the learner such that, for j =
1, . . . , N ,
1
Pr(z ∈ Δζ : p Rj > 2c) < .
(z)
c
By assumption (A2), we may conclude that
Smart PAC-Learners 365

1
Pr(z ∈ Δ : p Rj > 2c) ≤ Pr(z ∈ Δζ : p Rj
(z) (z)
> 2c) + Pr(Δ \ Δζ ) <
c
provided that ζ > 0 is sufficiently small. From this, Corollary 3 is immediate. 
According to Corollary 3, there exists a mixed strategy p for a learner without
any prior knowledge of the domain distribution such that, in comparison to the
best learner with full prior knowledge of the domain distribution, a performance
ratio of 2c is achieved for the “vast majority” of distributions. The total prob-
ability mass of distributions (measured according to density function f (z)) not
belonging to the “vast majority” is bounded by 1/c. So Corollary 3 is the result
that we had announced in the introduction.

5 Discussion of Assumption (A2) and Open Problems


We claim that Assumption (A2) is not very restrictive and provide some intuition
why this might be true. For every γ > 0, let

Δγ := {z ∈ Δ| ∀ν = 1, . . . , d : γ ≤ zν ≤ 1 − γ}.

Assume that z ∈ Δγ . Pick ν(z) ∈ {1, . . . , d} such that zν(z) = min{z1 , . . . , zd }.
Clearly,
1
γ ≤ zν(z) ≤ .
d
For sake of brevity, let ξ := ξν(z) . For b ∈ {0, 1}, consider the set

H(ξ, b) := {h ∈ H : h(ξ) = b}.

If the opponent finds two hypotheses h, h ∈ H(ξ, b) such that

Dz (h ⊕ h ) > 2ε, (21)

he can assign à-priori probability 1/2 to h and h , respectively, and achieve the
following:
With a probability of at least γ m , the sample x is of the form x = (ξ, . . . , ξ) ∈
X so that the learner cannot distinguish between h and h . Conditioned to
m

x = (ξ, . . . , ξ) ∈ X m , the learner will therefore fail with probability at least 1/2
(regardless of her strategy). Thus, the overall expected failure rate is at least
γ m /2.
The punchline of this discussion is the following implication:
 
 
  ∗ γm
z ∈ Δγ ∧ 2ε < max max D z (g ⊕ g ) ⇒ δ z (ε, m) ≥ (22)
b∈{0,1} g,g  ∈H(ξν(z) ,b) 2
 
(A3)

Condition (A3) looks wild but it is essentially saying that the knowledge of a
single labeled example should not trivialize the resulting version space (in terms
of its diameter) too much.
366 H.U. Simon

Define K(H) as the smallest number K such that, for every ξ ∈ X, there
exist g+ ∈ H(ξ, 1), g− ∈ H(ξ, 0) and g1 , . . . , gK ∈ H which satisfy the following
condition:

∀ξ  ∈ X \ {ξ}, ∃κ ∈ {1, . . . , K} : (g+ (ξ) = gκ (ξ) ∧ g+ (ξ  ) 


= gκ (ξ  ))
∨ (g− (ξ) = gκ (ξ) ∧ g− (ξ  ) 
= gκ (ξ  ))

If suitable functions g+ , g− , g1 , g2 , . . . cannot be found for some ξ ∈ X, we set


K(H) = ∞ by default.
Lemma 5. Assume that K(H) < ∞. Then, for every z ∈ Δγ and every d ≥ 2,
the following holds:

1 − 1/d 1
max max Dz (g ⊕ g  ) ≥ ≥
b∈{0,1} g,g ∈H(ξν(z) ,b) K(H) 2K(H)

Proof. According to the definition of ν(z), Dz (X \ {ξν(z) }) ≥ 1 − 1/d. Let us


set ξ := ξν(z) , let K = K(H), and let g+ , g− , g1 , . . . , gK be the functions chosen
in accordance with the definition of K(H). Without loss of generality, we may
assume that g1 , . . . , gK  agree with g+ on ξ whereas gK  +1 , . . . , gK agree with
g− on ξ. Note that the “disagreement sets”

g+ ⊕ g1 , . . . , g+ ⊕ gK  ; g− ⊕ gK  +1 , . . . , g− ⊕ gK

cover X \ {ξ}. Thus, by the pigeon-hole principle, there must exist a hypothesis
g ∈ {g+ , g− } and g  ∈ {g1 , . . . , gK } such that g(ξ) = g  (ξ) but the disagreement
set g ⊕ g  has probability mass at least (1 − 1/d)/K. 

Lemma 5 combined with the implication (22) yields


Corollary 4. If K(H) < ∞, then the following holds:
1 γm
∀γ > 0, ∀z ∈ Δγ , ∀ε < : δz∗ (ε, m) ≥
2K(H) 2

We remind the reader to (16) and to the definition of Δζ in (18). Since γ m /2 ≥ ζ


is equivalent to γ ≥ (2ζ)1/m , we obtain
Corollary 5. Let K(H) < ∞ and ε < 1/(2K(H)). Define γ(ζ) := (2ζ)1/m .
Then,
Δζ ⊇ Δγ(ζ) .
Since γ(ζ) → 0 for ζ → 0 and obviously limγ→0 Pr(Δγ ) = 1 for every underlying
continuous density function, we finally get
Corollary 6. Let K(H) < ∞. Then, for every ε < 1/(2K(H)), H satisfies
Assumption (A2).
A simple hypothesis class should intuitively have a good chance to trivialize the
version space resulting from a single labeled example (and, thus, a good chance
Smart PAC-Learners 367

to violate Assumption (A2)). But even for the almost trivial class of half-intervals
{1, . . . , r}, 0 ≤ r ≤ d, our sufficient condition K(H) < ∞ applies. This can be
seen as follows:
Pick an arbitrary ξ ∈ {1, . . . , d}, and designate the following half-intervals:

g+ := {1, . . . , ξ} , g− := ∅,
g1 := {1, . . . , d} , g2 := {1, . . . , ξ − 1}.

Pick an arbitrary ξ  ∈ {1, . . . , d} \ {ξ}. For ξ  > ξ, we get g+ (ξ) = g1 (ξ) = 1


and g+ (ξ  ) = 0 
= 1 = g1 (ξ  ). For ξ  < ξ, we get g− (ξ) = g2 (ξ) = 0 and

g− (ξ ) = 0  = 1 = g2 (ξ  ). We conclude that K(H) = 2 for the class of half-
intervals.

Open problems:
– For every finite hypothesis class, we have shown the mere existence of a
learner (mixed strategy) whose average performance ratio is bounded by 2.
Gain more insight how this strategy actually works and check under which
conditions it can be implemented efficiently.
– Prove or disprove that there exists a learner (mixed strategy) whose worst
performance ratio is bounded by a small constant.
– Prove or disprove our claim that assumption (A2) is not very restrictive.

References
1. Ben-David, S., Lu, T., Pál, D.: Does unlabeled data provably help? Worst-case
analysis of the sample complexity of semi-supervised learning. In: Proceedings of
the 21st Annual Conference on Learning Theory, pp. 33–44 (2008)
2. Benedek, G.M., Itai, A.: Learnability with respect to fixed distributions. Theoretical
Computer Science 86(2), 377–389 (1991)
3. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the
Vapnik-Chervonenkis dimension. Journal of the Association on Computing Machin-
ery 36(4), 929–965 (1989)
4. Ehrenfeucht, A., Haussler, D., Kearns, M., Valiant, L.: A general lower bound on
the number of examples needed for learning. Information and Computation 82(3),
247–261 (1989)
Approximation Algorithms for Tensor Clustering

Stefanie Jegelka1 , Suvrit Sra1 , and Arindam Banerjee2


1
Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany
{jegelka,suvrit}@tuebingen.mpg.de
2
Univ. of Minnesota, Twin Cities, Minneapolis, MN, USA
banerjee@cs.umn.edu

Abstract. We present the first (to our knowledge) approximation algo-


rithm for tensor clustering—a powerful generalization to basic 1D clus-
tering. Tensors are increasingly common in modern applications dealing
with complex heterogeneous data and clustering them is a fundamental
tool for data analysis and pattern discovery. Akin to their 1D cousins,
common tensor clustering formulations are NP-hard to optimize. But,
unlike the 1D case, no approximation algorithms seem to be known. We
address this imbalance and build on recent co-clustering work to derive
a tensor clustering algorithm with approximation guarantees, allowing
metrics and divergences (e.g., Bregman) as objective functions. There-
with, we answer two open questions by Anagnostopoulos et al. (2008).
Our analysis yields a constant approximation factor independent of data
size; a worst-case example shows this factor to be tight for Euclidean
co-clustering. However, empirically the approximation factor is observed
to be conservative, so our method can also be used in practice.

1 Introduction

Tensor clustering is a recent generalization to the basic one-dimensional clus-


tering problem, and it seeks to partition an order-m input tensor into coherent
sub-tensors while minimizing some cluster quality measure [1,2]. For example,
in co-clustering, which is a special case of tensor clustering with m = 2, one si-
multaneously partitions rows and columns of an input matrix to obtain coherent
submatrices, often while minimizing a Bregman divergence [3,4].
Being generalizations of the 1D case, common tensor clustering formulations are
also NP-hard to optimize. But despite the existence of a vast body of research on
approximation algorithms for 1D clustering problems (e.g., [5,6,7,8,9,10]), there
seem to be no published approximation algorithms for tensor clustering. Even for
(2D) co-clustering, there are only two recent attempts [11] and [12] (from 2008).
Both prove an approximation factor of 2α1 for Euclidean co-clustering given an
α1 -approximation for k-means, and show constant approximation factors for 1
([12] only for binary matrices) and p -norm [11] based variants.
Tensor clustering is a basic data analysis task with growing importance; sev-
eral domains now deal frequently with tensor data, e.g., data mining [13], com-
puter graphics [14], and computer vision [2]. We refer the reader to [15] for a

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 368–383, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Approximation Algorithms for Tensor Clustering 369

recent survey about tensors and their applications. The simplest tensor clus-
tering scenario, namely, co-clustering (also known as bi-clustering) is more es-
tablished [12,4,16,17,18]. Tensor clustering is less well known, though several
researchers have considered it before [1,2,19,20,21].

1.1 Contributions
The main contribution of this paper is the analysis of an approximation algo-
rithm for tensor clustering that achieves an approximation ratio of O(p(m)α),
1
where m is the order of the tensor, p(m) = m or p(m) = m log3 2 , and α is the
approximation factor of a corresponding 1D clustering algorithm. Our results
apply to a fairly broad class of objective functions, including metrics such as p
norms, Hilbertian metrics [22,23], and divergence functions such as Bregman di-
vergences [24] (with some assumptions). As corollaries, our results solve two open
problems posed by [12], viz., whether their methods for Euclidean co-clustering
could be extended to Bregman co-clustering, and if one could extend the ap-
proximation guarantees to tensor clustering. The bound also gives insight into
properties of the tensor clustering problem. We give an example for the tight-
ness of our bound for squared Euclidean distance, and provide an experimental
validation of the theoretical claims, which forms an additional contribution.

2 Background and Problem


Traditionally, “center” based clustering algorithms seek partitions of columns
of an input matrix X = [x1 , . . . , xn ] into clusters C = {C1 , . . . , CK }, and find
“centers” µk that minimize the objective
K 
J(C) = d(x, µk ), (2.1)
k=1 x∈Ck

where the function d(x, y) measures cluster quality. The “center” µk of cluster
Ck is given by the mean of the points in Ck when d(x, y) is a Bregman di-
vergence [25]. Co-clustering extends (2.1) to seek simultaneous partitions (and
centers µIJ ) of rows and columns of X, so that the objective function
 
J(C) = d(xij , µIJ ), (2.2)
I,J i∈I,j∈J

is minimized; µIJ denotes the (scalar) “center” of the cluster described by the
row and column index sets, viz., I and J. We generalize formulation (2.2) to
tensors in Section 2.2 after introducing some background on tensors.

2.1 Tensors
An order-m tensor A may be viewed as an element of the vector space Rn1 ×...×nm .
An individual entry of A is given by the multiply-indexed value ai1 i2 ...im , where
ij ∈ {1, . . . , nj } for 1 ≤ j ≤ m. For us, the most important tensor operation
370 S. Jegelka, S. Sra, and A. Banerjee

is multilinear matrix multiplication, which generalizes matrix multiplica-


tion [26]. Matrices act on other matrices by either left or right multiplication.
Similarly, for an order-m tensor, there are m dimensions on which a matrix may
act. For A ∈ Rn1 ×n2 ×···×nm , and matrices P1 ∈ Rp1 ×n1 , . . ., Pm ∈ Rpm ×nm ,
multilinear multiplication is defined by the action of the Pi on the different di-
mensions of A, and is denoted by A = (P1 , . . . , Pm )·A ∈ Rp1 ×···×pm . The individ-
n ,...,n
ual components of A are given by ai1 i2 ...im = j11,...,jmm=1 pi1 j1 · · · pim jm aj1 ...jm ,
(1) (m)

(k)
where pij denotes the ij-th entry of matrix Pk . The inner product between two
tensors A and B is defined as

A, B = ai1 ...im bi1 ...im , (2.3)
i1 ,...,im

and this inner product satisfies


 the
 following natural property (which generalizes
the familiar Ax, By = x, A By ):
   
(P1 , . . . , Pm ) · A, (Q1 , . . . , Qm ) · B = A, P 
1 Q1 , . . . , Pm Qm · B . (2.4)

Moreover, the Frobenius norm is A2 = A, A. Finally, we define an arbitrary
divergence function d(X, Y) as an elementwise sum of individual divergences, i.e.,

d(X, Y) = d(xi1 ,...,im , yi1 ,...,im ), (2.5)
i1 ,...,im

and we will define the scalar divergence d(x, y) as the need arises.

2.2 Problem Formulation

Let A ∈ Rn1 ×···×nm be an order-m tensor that we wish to partition into co-
herent sub-tensors (or clusters). In 3D, we divide a cube into smaller cubes by
cutting orthogonal to (i.e., along) each dimension (Fig. 1). A basic approach is
to minimize the sum of the divergences between individual (scalar) elements in
each cluster to their corresponding (scalar) cluster “centers”. Readers familiar
with [4] will recognize this to be a “block-average” variant of tensor clustering.
Assume that each dimension j (1 ≤ j ≤ m) is partitioned into kj clusters. Let
Cj ∈ {0, 1}nj ×kj be the cluster indicator matrix for dimension j, where the ik-th
entry of such a matrix is one if and only if index i belongs to the k-th cluster
(1 ≤ k ≤ kj ) for dimension j. Then, the tensor clustering problem is (cf. 2.2):
nj ×kj
minimize d(A, (C1 , . . . , Cm ) · M), s.t. Cj ∈ {0, 1} , (2.6)
C1 ,...,Cm ,M

where the tensor M collects all the cluster “centers.”

3 Algorithm and Analysis

Given formulation (2.6), our algorithm, which we name Combination Tensor


Clustering (CoTeC), follows the simple outline:
Approximation Algorithms for Tensor Clustering 371

1. Cluster along each dimension j, using an approximation algorithm


to obtain clustering Cj ; Let C = (C1 , . . . , Cm )
2. Compute M = argminX∈Rk1 ×···×km d(A, C · X).
3. Return the tensor clustering (C1 , . . . , Cm ) (with representatives M).

Remark 1. Instead of clustering one dimension at a time in Step 1, we can also


cluster along t dimensions simultaneously. In such a t-dimensional clustering of
an order-m tensor, we form groups of order-(m − t) tensors.

C3

C1 C2

μ3,1,3

C3

C1 C2

Fig. 1. CoTeC: Cluster along dimensions one (C1), two (C2), three (C3) separately
and combine the results; μ3,1,3 is the mean of sub-tensor (cluster) (3,1,3). The vari-
ous clusters in the final tensor clustering are color coded to indicate combination of
contributions from clusters along each dimension.

Our algorithm might be counterintuitive to some readers as merely clustering


along individual dimensions and then combining the results is against the idea
of “co”-clustering, where one simultaneously clusters along different dimensions.
However, our analysis shows that dimension-wise clustering suffices to obtain
strong approximation guarantees for tensor clustering—a fact often observed
empirically too. It is also easy to see that CoTeC runs in time O((m/t)T (t)), if
the subroutine for dimension-wise clustering takes T (t) time.
The main contribution of this paper is the following approximation guarantee
for CoTeC, which we prove in the remainder of this section.
Theorem 1 (Approximation). Let A be an order-m tensor and let Cj denote
its clustering along the jth subset of t dimensions (1 ≤ j ≤ m/t), as obtained
from a multiway clustering algorithm with guarantee αt 1 . Let C = (C1 , . . . , Cm/t )
1
We say an approximation algorithm has guarantee α if it yields a solution that
achieves an objective value within a factor O(α) of the optimum.
372 S. Jegelka, S. Sra, and A. Banerjee

denote the induced tensor clustering, and JOPT (m) the best m-dimensional clus-
tering. Then,
J(C) ≤ p(m/t)ρd αt JOPT (m), with (3.1)

1. ρd = 1 and p(m/t) = 2log2 m/t if d(x, y) = (x − y)2 ,


2. ρd = 1 and p(m/t) = 3log2 m/t if d(x, y) is a metric2 .

Thm. 1 is quite general, and it can be combined with some natural assumptions
(see §3.3) to yield results for tensor clustering with general divergence functions
(though ρd might be greater than 1). For particular choices of d one can perhaps
derive tighter bounds, though for squared Euclidean distances, we provide an
explicit example (Fig. 2) that shows the bound to be tight in 2D.

3.1 Analysis: Theorem 1, Euclidean Case

We begin our proof with the Euclidean case, i.e., d(x, y) = (x − y)2 . Our proof is
inspired by the techniques of [12]. We establish that given a clustering algorithm
which clusters along t of the m dimensions at a time3 with an approximation
factor of αt , CoTeC achieves an objective within a factor O(m/tαt ) of the
optimal. For example, for t = 1 we can use the seeding methods of [8,9] or the
stronger approximation algorithms of [5]. We assume without loss of generality
(wlog) that m = 2h t for an integer h (otherwise, pad in empty dimensions).
Since for the squared Frobenius norm, each cluster “center” is given by the
mean, we can recast Problem (2.6) into a more convenient form. To that end,
note that the individual entries of the means tensor M are given by (cf. (2.2))

1 
MI1 ...Im = ai1 ...im , (3.2)
|I1 | · · · |Im | i1 ∈I1 ,...,im ∈Im

with index sets Ij for 1 ≤ j ≤ m. Let Cj be the normalized cluster indicator



matrix obtained by normalizing the columns of Cj , so that Cj Cj = Ikj . Then,
we can rewrite (2.6) in terms of projection matrices Pj as:

minimize J(C) = A − (P1 , . . . , Pm ) · A2 , s.t. Pj = Cj Cj . (3.3)
C=(C1 ,...,Cm )

Lemma 1 (Pythagorean). Let P = (P1 , . . . , Pt ), P ⊥ = (I − P1 , . . . , I − Pt )


be collections of projection matrices Pj , and S and R be arbitrary collections of
m − t projection matrices. Then,

(P , S) · A + (P ⊥ , R) · B2 = (P , S) · A2 + (P ⊥ , R) · B2 .


2
The results can be trivially extended to λ-relaxed metrics that satisfy d(x, y) ≤
λ(d(x, z) + d(z, y)); the corresponding approximation factor just gets scaled by λ.
3
One could also consider clustering differently sized subsets of the dimensions, say
{t1 , . . . , tr }, where t1 +· · ·+tr = m. However, this requires unilluminating notational
jugglery, which we can skip for simplicity of exposition.
Approximation Algorithms for Tensor Clustering 373

Proof. Using A2 = A, A we can rewrite the l.h.s. as


 
(P , S) · A + (P ⊥ , R) · B2 = (P , S) · A2 + (P ⊥ , R) · B2 + 2 (P , S) · A, (P ⊥ , R) · B ,

from which the last term is immediately seen to be zero using Property (2.4)
and the fact that P ⊥
j Pj = Pj (I − Pj ) = 0.

Some more notation. Since we cluster along t dimensions at a time, we re-


cursively partition the initial set of all m dimensions until (after log(m/t) + 1
steps), the sets of dimensions have length t. Let l denote the level of recursion,
starting at l = log(m/t) = h and going down to l = 0. At level l, the sets of
dimensions will have length 2l t (so that for l = 0 we have t dimensions). We
represent each clustering along a subset of 2l t dimensions by its corresponding
2l t projection matrices. We gather these projection matrices into the collection
Pil (note boldface), where the index i ranges from 1 to 2h−l .
We also need some notation to represent a complete tensor clustering along all
m dimensions, where only a subset of 2l t dimensions are clustered. We pad the
collection Pil with m − 2l t identity matrices for the non-clustered dimensions,
and call this padded collection Qli . With recursive partitioning of the dimensions,
Qli subsumes Q0j for 2l (i − 1) < j ≤ 2l i, i.e.,
 2l i
Qli = Q0j .
j=2l (i−1)+1

At level 0, the algorithm yields the collections Q0i and Pi0 . The remaining clus-
terings are simply combinations, i.e., products of these level-0 clusterings. We
denote the collection of m − 2l t identity matrices (of appropriate size) by I l ,
so that Ql1 = (P1l , I l ). Accoutered with our notation, we now prove the main
lemma that relates the combined clustering to its sub-clusterings.
Lemma 2. Let A be an order-m tensor and m ≥ 2l t. The objective function for
any 2l t-dimensional clustering Pil = (P20l (i−1)+1 , . . . , P20l i ) can be bound via the
sub-clusterings along only one set of dimensions of size t as

A − Qli · A2 ≤ max 2l A − Q0j · A2 . (3.4)


2l (i−1)<j≤2l i

We can always (wlog) permute dimensions so that any set of 2l clustered dimen-
sions maps to the first 2l ones. Hence, it suffices to prove the lemma for i = 1,
i.e., the first 2l dimensions.
Proof. We prove the lemma for i = 1 by induction on l.
Base: Let l = 0. Then Ql1 = Q01 , and (3.4) holds trivially.
Induction: Assume the claim holds for l ≥ 0. Consider a clustering P1l+1 =
(P1l , P2l ), or equivalently Ql+1
1 = Ql1 Ql2 . Using P + P ⊥ = I, we decompose A as
⊥ ⊥ ⊥
A = (P1l+1 + P1l+1 , I l+1 ) · A = (P1l + P1l , P2l + P2l , I l+1 ) · A
⊥ ⊥ ⊥ ⊥
= (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A
⊥ ⊥ ⊥ ⊥
= Ql1 Ql2 · A + Ql1 Ql2 · A + Ql1 Ql2 · A + Ql1 Ql2 · A,
374 S. Jegelka, S. Sra, and A. Banerjee

⊥ ⊥
where Ql1 = (P1l , I l ). Since Ql+1
1 = Ql1 Ql2 , the Pythagorean Property 1 yields
⊥ ⊥ ⊥ ⊥
A − Ql+1
1 · A2 = Ql1 Ql2 · A2 + Ql1 Ql2 · A2 + Ql1 Ql2 · A2 .

Combining the above equalities with the assumption (wlog) Ql1 Ql2 · A2 ≥

Ql1 Ql2 · A2 , we obtain the inequalities
 ⊥ ⊥ ⊥ 
A − Ql1 Ql2 · A2 ≤ 2 Ql1 Ql2 · A2 + Ql1 Ql2 · A2
⊥ ⊥ ⊥ ⊥ ⊥
= 2Ql1 Ql2 · A + Ql1 Ql2 · A2 = 2Ql1 (Ql2 + Ql2 ) · A2

= 2Ql1 · A2 = 2A − Ql1 · A2
≤ 2 max A − Qlj · A2 ≤ 2 · 2l max A − Q0j · A2 ,
1≤j≤2l 1≤j≤2l+1

where the last step follows from the induction hypothesis (3.4), and the two
norm terms in the first line are combined using the Pythagorean Property.
Proof. (Thm. 1, Case 1 ). Let m = 2h t. Using an algorithm with guarantee αt ,
we cluster each subset (indexed by i) of t dimensions to obtain Q0i . Let Si be the
optimal sub-clustering of subset i, i.e., the result that Q0i would be if αt were 1.
We bound the objective for the collection of all m sub-clusterings P1h = Qh1 as

A − Qh1 · A2 ≤ 2h max A − Q0j · A2 ≤ 2h αt max A − Sj · A2 . (3.5)


j j

The first inequality follows from Lemma 2, while the last inequality follows from
the αt approximation factor that we used to get sub-clustering Q0j .
So far we have related our approximation to an optimal sub-clustering along
a set of dimensions. Let us hence look at the relation between such an optimal
sub-clustering S of the first t dimensions (via permutation, these dimensions
correspond to an arbitrary subset of size t), and the optimal tensor clustering F
along all the m = 2h t dimensions. Recall that a clustering can be expressed by
either the projection matrices collected in Ql1 , or by cluster indicator matrices
Ci together with the mean tensor M, so that

(C1 , . . . , C2l t , I l ) · M = Ql1 · A.

Let CSj and CF j be the dimension-wise cluster indicator matrices for S and F ,
respectively. By definition, S solves
nj ×kj
min A − (C1 , . . . , Ct , I 0 ) · M2 , s.t. Cj ∈ {0, 1} ,
C1 ,...,Ct ,M

which makes S even better than the sub-clustering (CF


1 , . . . , Ct ) induced by the
F

optimal m-dimensional clustering F . Thus,

A − S · A2 ≤ min A − (CF


1 , . . . , Ct , I ) · M
F 0 2
M
≤ A − (CF
1 , . . . , Ct , I )(I, . . . , I, Ct+1 , . . . , Cm ) · M 
F 0 F F F 2

= A − F · A2 , (3.6)
Approximation Algorithms for Tensor Clustering 375

where MF is the tensor of means for the optimal m-dimensional clustering. Com-
bining (3.5) with (3.6) yields the final bound for the combined clustering C = Qh1 ,
Jm (C) = A − Qh1 · A2 ≤ 2h αt A − F · A2 = 2h αt JOPT (m),
which completes the proof of the theorem.

Tightness of Bound. How tight is the bound for CoTeC implied by Thm. 1?
The following example shows that for Euclidean co-clustering, i.e., m = 2, the
bound is tight. Specifically, for every 0.25 > γ > 0, there exists a matrix for
which the approximation is as bad as J(C) = (m − γ)JOPT (m).
Let  be such that γ = 2(1 +
)−2 . The optimal 1D row clus- a b c d
tering C1 for the matrix in Fig- 1 − −1  1
ure 2 groups rows {1, 2} and {3, 4} 2 1  −1 −
together, and the optimal column 3 10 −  9 10 +  11
clustering is C2 = ({a, b}, {c, d}). 4 11 10 +  9 10 − 
The co-clustering loss for the com-
bination is J2 (C1 , C2 ) = 8 + 82 . Fig. 2. Matrix with co-clustering approxima-
−2
The optimal co-clustering, group- tion factor 2 − 2(1 + )
ing columns {a, d} and {b, c} (and
rows as C2 ) achieves an objective of JOPT (2) = 4(1 + )2 . Relating these results,
we get J2 (C1 , C2 ) = (2 − γ)JOPT (m). However, this example is a worst-case
scenario; the average factor is much better in practice, as revealed by our ex-
periments (§4). The latter combined with the structure of this negative example
suggest that with some assumptions on the data, one can probably obtain tighter
bounds. Also note that the bound holds for a CoTeC-like scheme treating di-
mensions separately, but not necessarily for all approximation algorithms.

3.2 Analysis: Theorem 1, Metric Case


Now we present our proof of Thm. 1 for the case where d(x, y) is a metric. For
this case, recall that the tensor clustering problem is
nj ×kj
minimize J(C) = d(A, (C1 , . . . , Cm ) · M), s.t. Cj ∈ {0, 1} . (3.7)
(C1 ,...,Cm ),M

Since in general the best representative M is not the mean tensor, we cannot use
the shorthand P · A for M, so the proof is different from the Euclidean case.
The following lemma is the basis of the induction for this case of Thm. 1.
Lemma 3. Let A be of order, m = 2h t, and Rli the clustering of the i-th subset
of 2l t dimensions (for l < h) with an approximation guarantee of α2l t —Rli
combines the Cj in a manner analogous to how Qli combines projection matrices.
Then the combination Rl+1 = Rli Rlj , i = j, satisfies
min d(A, Rl+1 · M) ≤ 3α2l t min d(A, F l+1 · M),
M M

where F is the optimal joint clustering of the dimensions covered by Rl+1 (as
l+1

before, we always assume that Rli and Rlj cover disjoint subsets of dimensions).
376 S. Jegelka, S. Sra, and A. Banerjee

Proof. Without loss of generality, we prove the lemma for Rl+1 1 = Rl1 Rl2 . Let
Mi = argminX d(A, Ri · X) be the associated representatives for i = 1, 2, and Sil
l l

the optimal 2l -dimensional clusterings. Further let F1l+1 = F1l F2l be the optimal
2l+1 -dimensional clustering. The following step is vital in relating objective val-
ues of Rl+1
1 and Sil . The optimal sub-clusterings will eventually be bounded by
the objective of the optimal F1l+1 . Let L = 2l+1 , and

M = argmin d(Rl1 Ml1 , Rl1 Rl2 · X), X ∈ Rk1 ×...×kL ×nL+1 ...×nm .
X

Let i,j be multi-indices running over dimensions 1 to 2l , and 2l + 1 to 2l+1 ,


respectively; let r be the multi-index covering the remaining m − L dimensions.
The multi-indices of the clusters defined by Rl1 and Rl2 , respectively, are I and
J. Since M is the element-wise minimum, we have
 
d(Rl1 · Ml1 , Rl1 Rl2 · M) = min d((μl1 )Ijr , μIJr )
µIJr ∈R
I,J i∈I,r j∈J

≤ d((μl1 )Ijr , (μl2 )iJr ) = d(Rl1 · Ml1 , Rl2 · Ml2 ).
I,J i∈I,r j∈J

Using this relation and the triangle inequality, we can now relate the objectives
for the combined clustering and for the optimal sub-clusterings:

min d(A, Rl1 Rl2 · Ml+1 ) ≤ d(A, Rl1 Rl2 · M)


Ml+1

≤ d(A, Rl1 · Ml1 ) + d(Rl1 · Ml1 , Rl1 Rl2 · M)


≤ d(A, Rl1 · Ml1 ) + d(Rl1 · Ml1 , Rl2 · M2 l )
≤ 2d(A, Rl1 · Ml1 ) + d(A, Rl2 · Ml2 )
≤ 2α2l t min d(A, S1l · X1 ) + α2l t min d(A, S2l · X2 ). (3.8)
X1 X2

However, owing to the optimality of S1l , we have

min d(A, S1l · Xl1 ) ≤ min d(A, F1l · Yl ) ≤ min d(A, F1l F2l · Yl+1 ),
Xl1 Yl Y l+1

and analogously for S2l . Plugging this inequality into (3.8) we get
min d(A, Rl1 Rl2 · Ml+1 ) ≤ 3α2l t min d(A, F1l F2l · Yl+1 ) = 3α2l t min d(A, F1l+1 · Yl+1 ).

Ml+1 Yl+1 Yl+1

Proof. (Thm. 1, Case 2 ). Given Lemma 3, the proof of Thm. 1 for the metric
case follows easily by induction if we hierarchically combine the sub-clusterings
and use α2l+1 t = 3α2l t , for l ≥ 0, as stated by the lemma.

3.3 Implications
We now mention several important implications of Theorem 1.
Approximation Algorithms for Tensor Clustering 377

Clustering with Bregman divergences. Bregman divergence based cluster-


ing and co-clustering are well-studied problems [25,4]. Here, the function d(x, y)
is parametrized by a strictly convex function f [24], so that d(x, y) = Bf (x, y) =
f (x) − f (y) − f  (y)(x − y). Under the assumption (also see [5,6])

σL x − y2 ≤ Bf (x, y) ≤ σU x − y2 , (3.9)

on the curvature of the divergence Bf (x, y), we can invoke Thm. 1 with ρd =
σU /σL . The proofs are omitted for brevity, and may be found in [27]. We would
like to stress that such curvature bounds seem to be necessary to guarantee
constant approximation factors for the underlying 1D clustering—this intuition
is reinforced by the results of [28], who avoided such curvature assumptions
and had to be content with a non-constant O(log n) approximation factor for
information theoretic clustering.

Clustering with p -norms. Thm. 1 (metric case) immediately yields approx-


imation factors for clustering with p -norms. We note that for binary matrices,
using t = 2 and the results of [11] we can obtain the slightly stronger guarantee

J(C) ≤ 3log2 (m)−1 (1 + 2)α1 JOPT (m).

Exploiting 1D clustering results. Substituting the approximation factors α1


of existing 1D clustering algorithms in Thm. 1 (with t = 1) instantly yields spe-
cific bounds for corresponding tensor clustering algorithms. Table 1 summarizes
these results, however we omit proofs for lack of space—see [27] for details.

Table 1. Approximation guarantees for Tensor Clustering Algorithms. K ∗ denotes the


maximum number of clusters, i.e., K ∗ = argmaxj kj ; c is some constant.

Problem Name Approx. Bound Proof


Metric tensor clustering J(C) ≤ m(1 + )JOPT (m) Thm. 1 + [6]
Bregman tensor clustering E[J(C)] ≤ 8mc(log K ∗ + 2)JOPT (m) (3.9), Thm. 1 + [7]
−1
Bregman tensor clustering J(C) ≤ mσU σL (1 + )JOPT (m) (3.9), Thm. 1 + [5]
Bregman co-clustering Above two results with m = 2 as above
Hilbertian metrics E[J(C)] ≤ 8m(log K ∗ + 2)JOPT (m) See [27]

4 Experimental Results
Our bounds depend strongly on the approximation factor αt of an underlying
t-dimensional clustering method. In our experiments, we study this close depen-
dence for t = 1, wherein we compare the tensor clusterings arising from different
1D methods of varying sophistication. Keep in mind that the comparison of the
1D methods is to see their impact on the tensor clustering built on top of them.
Our experiments reveal that the empirical approximation factors are usu-
ally smaller than the theoretical bounds, and these factors depend on statistical
378 S. Jegelka, S. Sra, and A. Banerjee

properties of the data. We also observe the linear dependence of the CoTeC ob-
jectives on the associated 1D objectives, as suggested by Thm. 1 (for Euclidean)
and Table 1 (2nd row, for KL Divergence).
Further comparisons show that in practice, CoTeC is competitive with a
greedy heuristic SiTeC (Simultaneous Tensor Clustering), which simultaneously
takes all dimensions into account, but lacks theoretical guarantees. As expected,
initializing SiTeC with CoTeC yields lower final objective values using fewer “si-
multaneous” iterations.
We focus on Euclidean dis-
uniform
seeding
lrl +1D k-means rk }CoTeC tance and KL Divergence to
+SiTeC +SiTeC
(1D) test CoTeC. To study the ef-
data lrcl rkc }SiTeC fect of the 1D method, we use
specific two seeding methods, uniform,
seeding
(1D) ls l
+1D k-means
sk }CoTeC and distance based (weighted
+SiTeC +SiTeC farthest first) drawing. The
lscl skc }SiTeC latter ensures 1D approxima-
tion factors for E[J(C)] by [7]
Fig. 3. Tensor clustering variants
for Euclidean clustering and by
[8,9] for KL Divergence.
We use each seeding by itself and as an initialization for k-means to get four
1D methods for each divergence (see Fig. 3). We refer to the CoTeC combination
of the corresponding independent 1D clusterings by abbreviations: (1) ‘r:’ uni-
formly sample centers from the data points and assign each point to its closest
center; (2) ‘s:’ sample centers with distance-specific seeding [7,8,9] and assign
each point to its closest center; (3) ‘rk:’ initialize Euclidean or Bregman k-means
with ‘r’; (4) ‘sk:’ initialize Euclidean or Bregman k-means with ‘s’.
The SiTeC method we compare to is the minimum sum-squared residue co-
clustering of [29] for Euclidean distances in 2D, and a generalization of Algo-
rithm 1 of [4] for 3D and Bregman 2D clustering. Additionally, we initialize
SiTeC with the outcome of each of the four CoTeC variants, which yields four
versions (of SiTeC), namely, rc, sc, rkc, and skc, initialized with the results
of ‘r’, ‘s’, ‘rk’, and ‘sk’, respectively. These variants inherit the guarantees of
CoTeC, as they monotonically decrease the objective value.

4.1 Experiments on Synthetic Data


For a controlled setting with synthetic data, we generate tensors A of size 75 ×
75 × 50 and 75 × 75, for which we randomly choose a 5 × 5 × 5 tensor of means
M and cluster indicator matrices Ci ∈ {0, 1}ni ×5 . For clustering with Euclidean
distances we add Gaussian noise (from N (0, σ 2 ) with varying σ) to A, while for
KL Divergences we use the sampling method of [4] with varying noise.
For each noise-level to test, we repeat the 1D seeding 20 times on each of five
generated tensors and average the resulting 100 objective values. To estimate the
approximation factor αm on a tensor, we divide the achieved objective J(C) by
the objective value of the “true” underlying tensor clustering. Figure 4 shows the
Approximation Algorithms for Tensor Clustering 379

5 r r
rk rk
rc rc
4.5 rkc 3.5 rkc
s s
sk sk
4 sc sc
skc 3 skc
3.5
factor

factor
2.5
3

2.5 2

2
1.5
1.5

1 1

0.5 1 1.5 2 3 0.5 1 1.5 2 3


σ σ

r 1.35 r
rk rk
1.5 rc 1.3 rc
rkc rkc
s 1.25 s
1.4 sk sk
sc sc
1.2
skc skc
1.3 1.15
factor

factor
1.1
1.2

1.05
1.1
1

1 0.95

0.9
0.9
0.85
1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2
σ σ

Fig. 4. Approximation factors for 3D clustering (left) and co-clustering (right) with
increasing noise. Top row: Euclidean distances, bottom row: KL Divergence. The x
axis shows σ, the y axis the empirical approximation factor.

Table 2. (i) Improvement of CoTeC and SiTeC variants upon ‘r’ in %; the respective
reference value (J2 for ‘r’) is shaded in gray. (ii) Average number of SiTeC iterations.

Bcell, Euc. Bcell, KL


(i) CoTeC SiTeC (i) CoTeC SiTeC
k1 k2 x xk xc xkc k1 k2 x xk xc xkc
20 3 x=r 5.75 · 105 31.66 20.05 33.05 20 3 x=r 3.37 · 10−1 17.59 22.23 23.26
x=s 18.83 32.24 24.61 33.36 x=s 10.54 18.44 22.99 22.98
20 6 x=r 5.56 · 105 49.13 35.26 50.37 20 6 x=r 3.15 · 10−1 18.62 24.51 25.43
x=s 34.97 50.55 43.93 51.66 x=s 11.76 20.52 25.69 26.23
50 3 x=r 5.63 · 105 31.10 14.77 31.76 50 3 x=r 3.20 · 10−1 15.70 20.12 21.07
x=s 15.25 32.58 19.14 33.17 x=s 9.61 17.24 20.85 21.33
50 6 x=r 5.18 · 105 47.55 34.63 48.41 50 6 x=r 2.85 · 10−1 16.38 21.61 22.57
x=s 36.22 49.83 43.77 50.55 x=s 11.86 18.63 23.24 23.13

(ii) k1 k2 rc rkc sc skc (ii) k1 k2 rc rkc sc skc


20 3 7.0 ± 1.4 2.0 ± 0.2 3.9 ± 1.0 2.2 ± 0.5 20 3 10.6 ± 2.8 7.5 ± 2.0 7.4 ± 1.8 7.0 ± 2.2
20 6 11.3 ± 2.3 2.6 ± 0.8 5.1 ± 2.0 2.7 ± 0.7 20 6 12.6 ± 3.4 8.8 ± 2.9 8.4 ± 2.1 8.1 ± 2.0
50 3 6.2 ± 1.9 2.0 ± 0.0 3.5 ± 2.0 2.0 ± 0.0 50 3 9.1 ± 2.3 6.2 ± 1.3 6.9 ± 1.8 6.0 ± 1.3
50 6 8.1 ± 2.1 2.1 ± 0.3 4.1 ± 1.6 2.0 ± 0.0 50 6 10.5 ± 1.8 7.7 ± 2.1 8.1 ± 2.3 6.9 ± 1.0

empirical approximation factor α̂m for Euclidean distance and KL Divergence.


Qualitatively, the plots for tensors of order 2 and 3 do not differ.
In all settings, the empirical factor remains below the theoretical factor. The
reason for decreasing approximation factors with higher noise could be lower
accuracy of the estimates of JOPT on the one hand, and more similar objective
values for all clusterings on the other hand. With low noise, distance-specific
380 S. Jegelka, S. Sra, and A. Banerjee

seeding s yields better results than uniform seeding r, and adding k-means on top
(rk,sk) improves the results of both. With Euclidean distances, CoTeC with well-
initialized 1D k-means (sk) competes with SiTeC. For KL Divergence, though,
SiTeC still improves on sk, and with high noise levels, 1D k-means does not
help: both rk and sk are as good as their seeding only counterparts r, s.

4.2 Experiments on Biological Data

We further assess the behavior of our method with gene expression data4 from
multiple sources [30,31,32]. For brevity, we only introduce two of the data sets
here for which we present more detailed results; more datasets and experiments
are described in [27].
The matrix Bcell [30] is a (1332×62) lymphoma microarray dataset of chronic
lymphotic leukemia, diffuse large B-cell leukemia and follicular lymphoma. The
order-3 tensor Interferon consists of gene expression levels from MS patients
treated with recombinant human interferon beta [32]. After removal of missing
values, a complete 6 × 21 × 66 tensor remained. For experiments with KL Di-
vergence, we normalized all tensors to have their entries sum up to one. Since
our analysis concerns the objective function J(C) alone, we disregard the “true”
labels, which are available for only one of the dimensions.
For each data set, we repeat the sampling of centers 30 times and average the
resulting objective values. Panel (i) in Table 2 (order-2), and in Table 3 (order-3)
show the objective value for the simplest CoTeC variant ‘r’ as a baseline, and
the relative improvements achieved by other methods. The methods are encoded
as x, xk, xc, xkc, where x stands for r or s, depending on the row in the table.

Table 3. (i) Improvement of CoTeC and SiTeC variants upon ‘r’ in %; the respective
reference value (J3 for ‘r’) is shaded in gray

Interferon, KL
(i) k1 k2 k3 x xk xc xkc
2 2 2 x=r 9.71 · 10−1 38.58 42.46 43.53
x=s 25.07 36.67 43.53 43.74
2 2 3 x=r 8.17 · 10−1 41.31 46.06 46.31
x=s 33.63 43.90 46.82 47.16
2 2 4 x=r 7.11 · 10−1 39.79 44.05 45.62
x=s 38.01 46.09 51.30 51.35

Figure 5 summarizes the average improvements for all five order-2 data sets
studied in [27]. Groups indicate methods, and colors indicate seeding techniques.
On average, a better seeding improves the results for all methods: the gray bars
are higher than their black counterparts in all groups. Just as for synthetic data,
1D k-means improves the CoTeC results here too. SiTeC (groups 3 and 4) is
better than CoTeC with mere seeding (r,s, group 1). Notably, for Euclidean
4
We thank Hyuk Cho for kindly providing us his preprocessed 2D data sets.
Approximation Algorithms for Tensor Clustering 381

(i) Eucl.Dist. (i) KL (ii) Eucl.Dist. (ii) KL


100
25 unif: x=r
dist.: x=s
% improvement wrt. ’r’

80
20

% iterations
15 60

10 40

5 20
unif: x=r
dist.: x=s
0 0
x xk xc xkc xc xkc

CoTeC SiTeC CoTeC SiTeC

Fig. 5. (i) % improvement of the objective J2 (C) with respect to uniform 1D seeding
(r), averaged over all order-2 data sets and parameter settings (details in [27]). (ii)
average number of SiTeC iterations, in % with respect to initialization by r.

distances, combining good 1D clusters obtained by k-means (rk,sk, group 2) is


on average better than SiTeC initialized with simple seeding (rc,sc, group 3). For
KL Divergences, on the other hand, SiTeC still outperforms all CoTeC variations.
Given the limitation to single dimensions, CoTeC performs surprisingly well
in comparison to SiTeC. Additionally, SiTeC initialized with CoTeC converges
faster to better solutions, further underscoring the utility of CoTeC.

Relation to 1D Clusterings. Our experiments support the theoretical results


and the intuitive expectation that better 1D clusterings yield better CoTeC
solutions. Can we quantify this relation?
Theorem 1 suggests a linear dependence of the order-m factor αm on α1 .
However, these factors are difficult to check empirically when optimal clusterings
are unknown. However, on one matrix JOPT (2)/JOPT (1) is constant, so if the
approximation factors are tight (up to a constant factor), the ratio

J2 (C1 , C2 )/J1 (Ci ) ≈ (α2 /α1 ) JOPT (2)/JOPT (1), i = 1, 2

only depends on α2 /α1 . Stat-


ing α2 = 2α1 ρd , Thm. 1 pre- 1.2
% improvement wrt. ’r’ − 1D

dicts J2 /J1 to be independent 1


10
of the 1D method, i.e., of α1 , 0.8
1
J /J

0.6
2

and constant on one matrix. 5 0.4


unif: x=r
The empirical ratios J2 /J1 uniform
dist−spec
0.2
dist.: x=s
0 0
in Figure 6 support this pre- x xk xc xkc x xk xc xkc

diction, which suggests that Fig. 6. Left: average improvement of 1D cluster-


for CoTeC the quality of ings (components) with respect to ‘r’. Right: aver-
the multi-dimensional cluster- age ratio J2 /J1 , both for the same clusterings as
ing directly depends on the in Figure 5.
quality of its 1D components,
both in theory and in practice.
382 S. Jegelka, S. Sra, and A. Banerjee

5 Conclusion
In this paper we presented CoTeC, a simple, and to our knowledge the first ap-
proximation algorithm for tensor clustering, which yielded approximation results
for Bregman co-clustering and tensor clustering as special cases. We proved an
approximation factor that grows linearly with the order of the tensor, and showed
tightness of the factor for the 2D Euclidean case (Fig. 2), though empirically the
observed factors are usually smaller than suggested by the theory.
Our worst-case example also illustrates the limitation of CoTeC, i.e., to ignore
the interaction between clusterings along multiple dimensions. Thm. 1 thus gives
hints how much information maximally lies in this interaction. Analyzing this
interplay could potentially lead to better approximation factors, e.g., by devel-
oping a co-clustering specific seeding technique. Using such an algorithm as a
subroutine in CoTeC will yield a hybrid that combines CoTeC’s simplicity with
better approximation guarantees.
Acknowledgment. AB was supported in part by NSF grant IIS-0812183.

References
1. Banerjee, A., Basu, S., Merugu, S.: Multi-way Clustering on Relation Graphs. In:
SIAM Conf. Data Mining, SDM (2007)
2. Shashua, A., Zass, R., Hazan, T.: Multi-way Clustering Using Super-Symmetric
Non-negative Tensor Factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.)
ECCV 2006. LNCS, vol. 3954, pp. 595–608. Springer, Heidelberg (2006)
3. Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In:
KDD, pp. 89–98 (2003)
4. Banerjee, A., Dhillon, I.S., Ghosh, J., Merugu, S., Modha, D.S.: A Generalized
Maximum Entropy Approach to Bregman Co-clustering and Matrix Approxima-
tion. JMLR 8, 1919–1986 (2007)
5. Ackermann, M.R., Blömer, J.: Coresets and Approximate Clustering for Bregman
Divergences. In: ACM-SIAM Symp. on Disc. Alg., SODA (2009)
6. Ackermann, M.R., Blömer, J., Sohler, C.: Clustering for metric and non-metric
distance measures. In: ACM-SIAM Symp. on Disc. Alg. (SODA) (April 2008)
7. Arthur, D., Vassilvitskii, S.: k-means++: The Advantages of Careful Seeding. In:
ACM-SIAM Symp. on Discete Algorithms (SODA), pp. 1027–1035 (2007)
8. Nock, R., Luosto, P., Kivinen, J.: Mixed Bregman clustering with approximation
guarantees. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML / PKDD
2008, Part II. LNCS (LNAI), vol. 5212, pp. 154–169. Springer, Heidelberg (2008)
9. Sra, S., Jegelka, S., Banerjee, A.: Approximation algorithms for Bregman cluster-
ing, co-clustering and tensor clustering. Technical Report 177, MPI for Biological
Cybernetics (2008)
10. Ben-David, S.: A framework for statistical clustering with constant time approx-
imation algorithms for K-median and K-means clustering. Mach. Learn. 66(2-3),
243–257 (2007)
11. Puolamäki, K., Hanhijärvi, S., Garriga, G.C.: An approximation ratio for biclus-
tering. Inf. Process. Letters 108(2), 45–49 (2008)
12. Anagnostopoulos, A., Dasgupta, A., Kumar, R.: Approximation algorithms for co-
clustering. In: Symp. on Principles of Database Systems, PODS (2008)
Approximation Algorithms for Tensor Clustering 383

13. Zha, H., Ding, C., Li, T., Zhu, S.: Workshop on Data Mining using Matrices and
Tensors. In: KDD (2008)
14. Hasan, M., Velazquez-Armendariz, E., Pellacini, F., Bala, K.: Tensor Clustering for
Rendering Many-Light Animations. In: Eurographics Symp. on Rendering, vol. 27
(2008)
15. Kolda, T.G., Bader, B.W.: Tensor Decompositions and Applications. SIAM Re-
view 51(3) (to appear, 2009)
16. Hartigan, J.A.: Direct clustering of a data matrix. J. of the Am. Stat. As-
soc. 67(337), 123–129 (1972)
17. Cheng, Y., Church, G.: Biclustering of expression data. In: Proc. ISMB, pp. 93–103.
AAAI Press, Menlo Park (2000)
18. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph
partitioning. In: KDD, pp. 269–274 (2001)
19. Bekkerman, R., El-Yaniv, R., McCallum, A.: Multi-way distributional clustering
via pairwise interactions. In: ICML (2005)
20. Agarwal, S., Lim, J., Zelnik-Manor, L., Perona, P., Kriegman, D., Belongie, S.:
Beyond pairwise clustering. In: IEEE CVPR (2005)
21. Govindu, V.M.: A tensor decomposition for geometric grouping and segmentation.
In: IEEE CVPR (2005)
22. Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2001)
23. Hein, M., Bousquet, O.: Hilbertian metrics and positive definite kernels on proba-
bility measures. In: AISTATS (2005)
24. Censor, Y., Zenios, S.A.: Parallel Optimization: Theory, Algorithms, and Applica-
tions. Oxford University Press, Oxford (1997)
25. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman Diver-
gences. JMLR 6(6), 1705–1749 (2005)
26. de Silva, V., Lim, L.H.: Tensor Rank and the Ill-Posedness of the Best Low-Rank
Approximation Problem. SIAM J. Matrix Anal. & Appl. 30(3), 1084–1127 (2008)
27. Jegelka, S., Sra, S., Banerjee, A.: Approximation algorithms for Bregman co-
clustering and tensor clustering (2009); arXiv:cs.DS/0812.0389v3
28. Chaudhuri, K., McGregor, A.: Finding metric structure in information theoretic
clustering. In: Conf. on Learning Theory, COLT (July 2008)
29. Cho, H., Dhillon, I.S., Guan, Y., Sra, S.: Minimum Sum Squared Residue based
Co-clustering of Gene Expression data. In: SDM, 114–125 (2004)
30. Kluger, Y., Basri, R., Chang, J.T.: Spectral biclustering of microarray data: Co-
clustering genes and conditions. Genome Research 13, 703–716 (2003)
31. Cho, H., Dhillon, I.: Coclustering of human cancer microarrays using minimum
sum-squared residue coclustering. IEEE/ACM Tran. Comput. Biol. Bioinf. 5(3),
385–400 (2008)
32. Baranzini, S.E., et al: Transcription-based prediction of response to IFNβ using
supervised computational methods. PLoS Biology 3(1) (2004)
Agnostic Clustering

Maria Florina Balcan1 , Heiko Röglin2, , and Shang-Hua Teng3


1
College of Computing, Georgia Institute of Technology
ninamf@cc.gatech.edu
2
Department of Quantitative Economics, Maastricht University
heiko@roeglin.org
3
Computer Science Department, University of Southern California
shanghua.teng@gmail.com

Abstract. Motivated by the principle of agnostic learning, we present


an extension of the model introduced by Balcan, Blum, and Gupta [3]
on computing low-error clusterings. The extended model uses a weaker
assumption on the target clustering, which captures data clustering in
presence of outliers or ill-behaved data points. Unlike the original tar-
get clustering property, with our new property it may no longer be the
case that all plausible target clusterings are close to each other. Instead,
we present algorithms that produce a small list of clusterings with the
guarantee that all clusterings satisfying the assumption are close to some
clustering in the list, proving both upper and lower bounds on the length
of the list needed.

1 Introduction
Problems of clustering data from pairwise distance or similarity information
are ubiquitous in science. Typical examples of such problems include clustering
proteins by function, images by subject, or documents by topic. In many of
these clustering applications there is an unknown target or desired clustering,
and while the distance information among data is merely heuristically defined,
the real goal in these applications is to minimize the clustering error with respect
to the target clustering.
A commonly used approach for data clustering is to first choose a particular
distance-based objective function Φ (e.g., k-median or k-means) and then design
a clustering algorithm that (approximately) optimizes this objective function [1,
2, 7]. The implicit hope is that approximately optimizing the objective function
will in fact produce a clustering of low clustering error, i.e. a clustering that is
pointwise close to the target clustering. Mathematically, the implicit assumption
is that the clustering error of any c-approximation to Φ on the data set is bounded
by some . We will refer to this assumed property as the (c, ) property for Φ.

This work was done in part while the authors were at Microsoft Research, New
England.

Supported by a fellowship within the Postdoc-Program of the German Academic
Exchange Service (DAAD).

R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 384–398, 2009.

c Springer-Verlag Berlin Heidelberg 2009
Agnostic Clustering 385

Balcan, Blum, and Gupta [3] have shown that by making this implicit as-
sumption explicit, one can efficiently compute a low-error clustering even in
cases when the approximation problem of the objective function is NP-hard. In
particular, they show that for any c = 1 + α > 1, if data satisfies the (c, )
property for the k-median or the k-means objective, then one can produce a
clustering that is O()-close to the target, even for values c for which obtaining
a c-approximation is NP-hard.
However, the (c, ) property is a strong assumption. In real data there may
well be some data points for which the (heuristic) distance measure does not
reflect cluster membership well, causing the (c, ) property to be violated. A
more realistic assumption is that the data satisfies the (c, ) property only after
some number of outliers or ill-behaved data points, i.e., a ν fraction of the data
points, have been removed. We will refer to this property as the (ν, c, ) property.
While the (c, ) property leads to the situation that all plausible clusterings
(i.e., all the clusterings satisfying the (c, ) property) are O()-close to each
other, two different sets of outliers could result in two different clusterings satis-
fying the (ν, c, ) property. We therefore analyze the clustering complexity of this
property [4], i.e, the size of the smallest ensemble of clusterings such that any
clustering satisfying the (ν, c, ) property is close to a clustering in the ensemble;
we provide tight upper and lower bounds on this quantity for several interesting
cases, as well as efficient algorithms for outputting a list such that any clustering
satisfying the property is close to one of those in the list.
Perspective. The clustering framework we analyze in this paper is related in
spirit to the agnostic learning model in the supervised learning setting [6]. In the
Probably Approximately Correct (or PAC) learning model of Valiant [8], also
known as the realizable setting, the assumption is that the data distribution over
labeled examples is correctly classified by some fixed but unknown concept in
some concept class, e.g., by a linear separator. In the agnostic setting [6] how-
ever, the assumption is weakened to the hope that most of the data is correctly
classified by some fixed but unknown concept in some concept space, and the
goal is to compete with the best concept in the class by an efficient algorithm.
Similarly, one can view the (ν, c, ) property as an agnostic version of the (c, )
property since we assume that the (ν, c, ) property is satisfied if the (c, ) prop-
erty is satisfied on most but not all of the points and moreover the points where
the property is not satisfied are adversarially chosen.
Our results. We present several algorithmic and information-theoretic results
in this new clustering model.
For most of this paper we focus on the k-median objective function. In the
case where the target clusters are large (have size Ω((/α + ν)n)) we show that
the algorithm in [3] can be used in order to output a single clustering that is
(ν + )-close to the target clustering. We then show that in the more general
case there can be multiple significantly different clusterings that can satisfy the
(ν, c, ) property. This is true even in the case where most of the points come
from large clusters; in this case, however, we show that we can in polynomial
time output a small list of k-clusterings such that any clustering that satisfies
386 M.F. Balcan, H. Röglin, and S.-H. Teng

the property is close to one of the clusterings in the list. In the case where most
of the points come from small clusters, we provide information-theoretic bounds
on the clustering complexity of this property.
We also show how both the analysis in [3] for the (c, ) property and our
analysis for the (ν, 1 + α, ) property can be adapted to the inductive case, where
we imagine our given data is only a small random sample of the entire data set.
Based on the sample, our algorithm outputs a clustering or a list of clusterings of
the full domain set that are evaluated with respect to the underlying distribution.
We conclude by discussing how our analysis extends to the k-means objective
function as well.

2 The Model
The clustering problems we consider fall into the following general framework:
 given a metric space M = (X, d) with point set X and a distance function
we are
d : X2 → R≥0 satisfying the triangle inequality — this is the ambient space.
We are also given the actual point set S ⊆ X we want to cluster; we use n to
denote the cardinality of S. A k-clustering C is a partition of S into k (possibly
empty) sets C1 , C2 , . . . , Ck . In this work, we always assume that there is a true
or target k-clustering CT for the point set S.
Commonly used clustering algorithms seek to minimize some objective func-
tion or “score”. For example, the k-median clustering objective k assigns
 to each
cluster Ci a “median” ci ∈ Ci and seeks to minimize Φ1 (C) = i=1 x∈Ci d(x, ci ).
Another example is the k-means clustering objective, which assigns
 to each clus-
ter Ci a “center” ci ∈ X and seeks to minimize Φ2 (C) = ki=1 x∈Ci d(x, ci )2 .
Given a function Φ and an instance (M, S), let OPTΦ = minC Φ(C), where the
minimum is over all k-clusterings of S.
The notion of distance between two k-clusterings C = {C1 , C2 , . . . , Ck } and
C  = {C1 , C2 , . . . , Ck } that we use throughout the paper, is the fraction of points
on which they disagree under the optimal matching of clusters in C to clusters
in C  ; we denote that as dist(C, C  ). Formally,
k
1
dist(C, C  ) = min 
|Ci − Cσ(i) |,
σ∈Sk n i=1

where Sk is the set of bijections σ : {1, . . . , k} → {1, . . . , k}. We say that two
clusterings C and C  are -close if dist(C, C  ) ≤  and we say that a clustering
has error  if it is -close to the target.
The (1 + α, )-property. The following notion originally introduced in [3] and
later studied in [5] is central to our discussion:
Definition 1. Given an objective function Φ (such as k-median or k-means),
we say that instance (S, d) satisfies the (1 + α, )-property for Φ with respect to
the target clustering CT if all clusterings C with Φ(C) ≤ (1+α)·OPTΦ are -close
to the target clustering CT for (S, d).
Agnostic Clustering 387

The (ν, 1 + α, )-property. In this paper, we study the following more robust
variation of Definition 1:
Definition 2. Given an objective function Φ (such as k-median or k-means),
we say that instance (S, d) satisfies the (ν, 1 + α, )-property for Φ with respect
to the target clustering CT if there exists a set of points S  ⊆ S of size at least
(1 − ν)n such that (S  , d) satisfies the (1 + α, )-property for Φ with respect to
the clustering CT ∩ S  induced by the target clustering on S  .
In other words our hope is that the (1 + α, )-property for objective Φ is satisfied
only after outliers or ill-behaved data points have been removed. Note that unlike
the case ν = 0, in general the (ν, 1+α, )-property could be satisfied with respect
to multiple significantly different clusterings, since we allow the set of outliers or
ill-behaved data points to be arbitrary. As a consequence we will be interested in
the size of the smallest list any algorithm could hope to output that guarantees
that at least one clustering in the list has small error. Given the instance (S, d),
we say that a given clustering C is consistent with the (ν, 1 + α, )-property for Φ
if (S, d) satisfies the (ν, 1 + α, )-property for Φ with respect to C. The following
notion originally introduced in [4] provides a formal measure of the inherent
usefulness of a given property.
Definition 3. Given an instance (S, d) and the (ν, 1 + α, )-property for Φ, we
define the (γ, k)-clustering complexity of the instance (S, d) with respect to the
(ν, 1 + α, )-property for Φ to be the length of the shortest list of clusterings
h1 , . . . , ht such that any consistent k-clustering is γ-close to some clustering in
the list. The (γ, k) clustering complexity of the (ν, 1 + α, )-property for Φ is the
maximum of this quantity over all instances (S, d).
Ideally, the (ν, 1+α, ) property should have (γ, k) clustering complexity polyno-
mial in k, 1/, 1/ν, 1/α, and 1/γ. Sometimes we analyze the clustering complexity
of our property restricted to some family of interesting clusterings. We define
this analogously:

Definition 4. Given an instance (S, d) and the (ν, 1 + α, )-property for Φ, we


define the (γ, k)-restricted clustering complexity of the instance (S, d) with re-
spect to the (ν, 1 + α, )-property for Φ and with respect to some family of clus-
terings F to be the length of the shortest list of clusterings h1 , . . . , ht such that
any consistent k-clustering in the family F is γ-close to some clustering in the
list. The (γ, k) restricted clustering complexity of the (ν, 1 + α, )-property for Φ
and F is the maximum of this quantity over all instances (S, d).

For example, we will analyze the (ν, 1 + α, )-property restricted to clusterings


in which every cluster has size Ω((/α + ν)n) or to the case where the average
cluster size is at least Ω((/α + ν)n).
Throughout the paper we use the following notations: For n ∈ N, we denote
by [n] the set {1, . . . , n}. Furthermore, log denotes the logarithm to base 2. We
say that a list C1 , C2 , C3 , . . . of clusterings is laminar if Ci+1 can be obtained
from Ci by merging some of the clusters of Ci .
388 M.F. Balcan, H. Röglin, and S.-H. Teng

3 k-Median Based Clustering: The (1 + α, )-Property


We start by summarizing in Section 3.1 consequences of the (1 + α, )-property
that are critical for the new results we present in this paper. We also describe the
algorithm presented in [3] for the case that all clusters in the target clustering
are large. Then in Section 3.2 we show how this algorithm can be extended to
and analyzed in the inductive case.

3.1 Key Properties of the (1 + α, )-Property


Given an instance of k-median specified by a metric space M = (X, d) and a
set of points S ⊆ X, fix an optimal k-median clustering C ∗ = {C1∗ , . . . , Ck∗ }, and
let c∗i be the center point for Ci∗ . For x ∈ S, let w(x) = mini d(x, c∗i ) be the
contribution of x to the k-median objective in C ∗ (i.e., x’s “weight”), and let
w2 (x) be x’s distance to the second-closest center point among {c∗1 , c∗2 , . . . , c∗k }.
1 
Also, let w = n x∈S w(x) = OPT n be the average weight of the points. Finally,
let ∗ = dist(CT , C ∗ ); so, from the (1 + α, )-property we have ∗ < .
Lemma 5 ( [3]). If the k-median instance (M, S) satisfies the (1+α, )-property
with respect to CT , then
(a) less than 6n points x ∈ S have w2 (x) − w(x) < αw 2 ,
(b) if each cluster in CT has size at least 2n, less than ( − ∗ )n points x ∈ S
on which CT and C ∗ agree have w2 (x) − w(x) < αw  , and
(c) for every z ≥ 1, at most zn/α points x ∈ S have w(x) ≥ αw z .

Algorithm 1. k-median, the case of large target clusters


Input: τ , b.
 
Step 1. Construct the graph Gτ = (S, Eτ ) by connecting all pairs {x, y} ∈ S2 with
d(x, y) ≤ τ .
Step 2. Create a new graph Hτ,b where we connect two points by an edge if they
share more than bn neighbors in common in Gτ .
Step 3. Let C  be any clustering obtained by taking the largest k components in Hτ,b ,
adding the vertices of all other smaller components to any of these.
Step 4. For each point x ∈ S and each cluster Cj , compute the median distance
dmed (x, j) between x and all points in Cj .
Insert x into the cluster Ci for i = argminj dmed (x, j).
Output: Clustering C 

Theorem 6 ( [3]). Assume that the k-median instance satisfies the (1 + α, )-
property. If each cluster in CT has size at least (3 + 10/α)n + 2, then given w
we can efficiently find a clustering that is -close to CT . If each cluster in CT has
size at least (4 + 15/α)n + 2, then we can efficiently find a clustering that is
-close to CT even without being given w.
Since some of the elements of this construction are essential in our subsequent
proofs, we summarize in the following the main ideas of this proof.
Agnostic Clustering 389

Main Ideas of the Construction. Assume first that we are given w. We use
Algorithm 1 with τ = 2αw 5 and b = (1 + 5/α). For the analysis, let us define
dcrit = αw 5 . We call point x good if both w(x) < dcrit and w2 (x) − w(x) ≥ 5dcrit ,
else x is called bad ; by Lemma 5 and the fact that ∗ ≤ , if all clusters in
the target have size greater than 2n, then at most a (1 + 5/α)-fraction of
points is bad. Let Xi be the good points in the optimal cluster Ci∗ , and let
B = S \ ∪Xi be the bad points. For instances satisfying the (1 + α, )-property,
the threshold graph Gτ defined in Algorithm 1 has the following properties:
(i) For all x, y in the same Xi , the edge {x, y} ∈ E(Gτ ). (ii) For x ∈ Xi and
y ∈ Xj =i , {x, y} ∈ E(Gτ ). Moreover, such points x, y do not share any neighbors
in Gτ (by the triangle inequality). This implies that each Xi is contained in a
distinct component of the graph Hτ,b ; the remaining components of Hτ,b contain
vertices from the “bad bucket” B. Since the Xi ’s are larger than B, we get that
the clustering C  obtained in Step 3 by taking the largest k components in H
and adding the vertices of all other smaller components to one of them differs
from the optimal clustering C ∗ only in the bad points which constitute an O(/α)
fraction of the total.
To argue that the clustering C  is -close to CT , we call a point x “red” if it
satisfies w2 (x) − w(x) < 5dcrit , “yellow” if it is not red but w(x) ≥ dcrit , and
“green” otherwise. So, the green points are those in the sets Xi , and we have
partitioned the bad set B into red points and yellow points. The clustering C 
agrees with C ∗ on the green points, so without loss of generality we may assume
Xi ⊆ Ci . Since each cluster in Ci has a strict majority of green points all of
which are clustered as in C ∗ , this means that for a non-red point x, the median
distance to points in its correct cluster with respect to C ∗ is less than the median
distance to points in any incorrect cluster. Thus, C  agrees with C ∗ on all non-red
points. Since there are at most ( − ∗ )n red points on which CT and C ∗ agree
by Lemma 5 — and C  and CT might disagree on all these points — this implies
dist(C  , CT ) ≤ ( − ∗ ) + ∗ = , as desired.

The “unknown w” Case. If we are not given the value w, and every target
cluster has size at least (4 + 15/α)n + 2, we instead run Algorithm 1 (with
τ = 2αw
5 and b = (1 + 5/α) repeatedly for different values of w, starting with
w = 0 (so the graph Gτ is empty) and at each step increasing w to the next
value such that Gτ contains at least one new edge. We say that a point is
missed if it does not belong to the k largest components of Hτ,b . The number
of missed points decreases with increasing w, and we stop with the smallest w,
for which we miss at most bn = (1 + 5/α)n points and each of the k largest
components contains more than 2bn points. Clearly, for the correct value of
w, we miss at most bn points because we miss only bad points. Additionally,
every Xi contains more than 2bn points. This implies that our guess for w
can only be smaller than the correct w and the resulting graphs Gτ and Hτ,b
can only have fewer edges than the corresponding graphs for the correct w.
However, since we miss at most bn points and every set Xi contains more than
bn points, there must be good points from every good set Xi that are not missed.
Hence, each of the k largest components corresponds to a distinct cluster Ci∗ . We
390 M.F. Balcan, H. Röglin, and S.-H. Teng

might misclassify all bad points and at most bn good points (those not in the
k largest components), but this nonetheless guarantees that each Ci contains at
least |Xi | − bn ≥ bn + 2 correctly clustered green points (with respect to C ∗ ) and
at most bn misclassified points. Therefore, as shown above for the case of known
w, the resulting clustering C  will correctly cluster all non-red points as in C ∗
and so is at distance at most  from CT .

3.2 The Inductive Case


In this section we consider an inductive model in which the set S is merely a small
random subset of points of size n from a much larger abstract instance space X,
|X| = N , N n, and the clustering we output is represented implicitly through
a hypothesis h : X → Y .

Algorithm 2. Inductive k-median


Input: (S, d),  ≤ 1, α > 0, k, n.
Training Phase:
Step 1. Set w = min{d(x, y) | x, y ∈ S} and τ = 2αw 5
.
Step 2. Apply Steps 1, 2 and 3 in Algorithm 1 with parameters τ and b = 2(1 + 5/α)
to generate a clustering C1 . . . Ck of the sample S.
Step 3. If the total number of points in C1 . . . Ck is at least (1 − b)n and each |Ci | ≥
2bn, then terminate the training phase. Else increase τ to the smallest τ  > τ for
which Gτ  = Gτ  and go to Step 2.
Testing Phase:
When a new point z arrives, compute for every cluster Ci the median distance of z to
all sample points in Ci . Assign z to the cluster that minimizes this median distance.

Our main result in this section is the following:


Theorem 7. Assume that the k-median instance (X, d) satisfies the (1 + α, )-
 in C
property and that each cluster T has size at least (6+30/α)N +2. If we draw
a sample S of size n = Θ 1 ln kδ , then we can use Algorithm 2 to produce a
clustering that is -close to the target with probability at least 1 − δ.
Proof. Let Xi be the good points in the optimal cluster Ci∗ , and let B = S \ ∪Xi
be the bad points defined as in Theorem 6 over the whole instance space X.
In particular, if w is the average weight of the points in the optimal k-median
solution over the whole instance space, we call point x good if both w(x) < dcrit
and w2 (x) − w(x) ≥ 5dcrit , else x is called bad. Let Xi be the good points in
the optimal cluster Ci∗ , and let B = S \ ∪Xi be the bad points. Since each
cluster in CT has size at least (6 + 30/α)N + 2 we can show using a similar
  6 that |Xi | > 5|B|. Also, since our sample is large
reasoning as inTheorem
enough, n = Θ 1 ln kδ , by Chernoff bounds, with probability at least 1 − δ
over the sample we have |B ∩ S| < 2(1 + 5/α)n and |Xi ∩ S| ≥ 4(1 + 5/α)n,
and so |Xi ∩ S| > 2|B ∩ S| for all i. This then ensures that if we apply Steps 1, 2
and 3 in Algorithm 1 with parameters τ = 2αw 5 and b = 2(1 + 5/α) we generate
Agnostic Clustering 391

a clustering C1 . . . Ck of the sample S that is O(b)-close to the target on the
sample. In particular, all good points in the sample that are in the same cluster
form cliques in the graph Hτ,b and good points from different clusters are in
different connected components of this graph. So, taking the largest connected
components of this graphs gives us a clustering that is O(b)-close to the target
clustering restricted to the sample S.
If we do not know w, then we use the same approach as in Theorem 6. That
is, we start by setting w = 0 and increase it until the k largest components in the
corresponding graph Hτ,b cover a large fraction of the points. The key point is
that the correctness of this approach followed from the fact that the number of
good points in every cluster is more than twice the total number of bad points.
As we have argued above, this is satisfied with probability at least 1 − δ for the
sample S as well, and hence, using arguments similar to the ones in Theorem 6
implies that we cluster the whole space with error at most .
Note that one can speed up Algorithm 2 as follows. Instead of repeatedly calling
Algorithm 1 from scratch, we can store the graphs G and H and only add
new edges to them in every iteration of Algorithm 2. Note also that in the test
phase, when a new point z arrives, we compute for every cluster Ci the median
distance of z to all sample points in Ci (and not to all the points added so Ci ),
and assign z to the cluster that minimizes this median distance. Note also that
a natural approach which will not work (due to the bad points) is to compute a
centroid/median for each Ci and then insert new points based on this Voronoi
diagram.

4 k-Median Based Clustering: The (ν, 1 + α, )-Property


We now study k-median clustering under the (ν, 1 + α, )-property. If C is an
arbitrary clustering consistent with the property, and its set of outliers or ill-
behaved data points is S \ S  , we will refer to w = OPTn as the value of C or the
value of S  , where OPT is the value of the optimal k-clustering of the set S  . We
start with the simple observation that if we are given a value w corresponding to
a consistent clustering CT on a subset S  , then we can efficiently find a clustering
that is (ν + )-close to CT if all clusters in CT are large.
Proposition 8. Assume that the target CT is consistent with the (ν, 1 + α, )-
property for k-median. Assume that each target cluster has size at least (3 +
10/α)n + 2 + 2νn. Let S  ⊆ S with |S  | ≥ (1 − ν)n be its corresponding set
of non-outliers. If we are given the value of S  , then we can efficiently find a
clustering that is (ν + )-close to CT .
Proof. We can use the same argument as in Theorem 6 with the modification
that we treat the outliers or ill-behaved data points S \ S  as additional red bad
points. To prove correctness, observe that the only property we used about red
bad points is that in the graph Gτ none of them connects to points from two
different sets Xi and Xj . Due to the triangle inequality, this is also satisfied for
the “outliers”. The proof then proceeds as in Theorem 6 above.
392 M.F. Balcan, H. Röglin, and S.-H. Teng

4.1 Large Target Clusters


We now show that the (ν +, k)-clustering complexity of the (ν, 1+α, )-property
is 1 in the “large clusters” case. Specifically:
Theorem 9. Let F be the family of clusterings with the property that every
cluster has size at least (4 + 15/α)n + 2 + 3νn. Then the (ν + , k) restricted
clustering complexity of the (ν, 1 + α, )-property with respect to F is 1, and
we can efficiently find a clustering that is (ν + )-close to any clustering in F
that is consistent with the (ν, 1 + α, )-property; in particular this clustering is
(ν + )-close to the target CT .
Proof. Let C1 be an arbitrary clustering consistent with the (ν, 1 + α, )-property
of minimal value of w. Let C2 be any other consistent clustering. By definition
we know that there exist sets of points S1 and S2 of size at least (1 − ν)n such
that (Si , d) satisfies the (1 + α, )-property with respect to the induced clustering
Ci ∩ Si on Si , for i = 1, 2. Let w and w denote the values of the clusterings C1
and C2 on the sets S1 and S2 , respectively; and by assumption we have w ≤ w .
Furthermore, let C1∗ and C2∗ denote the optimal k-clusterings on the sets S1 and
 2αw 
S2 , respectively. We set τ = 2αw 5 and τ = 5 , and b = (1 + 5/α) + ν and
consider the graphs Hτ,b and Hτ  ,b . Let K1 , . . . , Kk be the k largest connected
components in the graph Hτ,b , and let K1 , . . . , Kk be the k largest connected
components in the graph Hτ  ,b . For j ∈ [2], let Bj = (Sj \ ∪i Xij ) ∪ (S \ Sj )
denote the bad set of clustering Cj∗ . As in Theorem 6, we can show that |Bj | ≤
((1 + 5/α) + ν)n. For i ∈ [k], we denote by Xi1 the intersection of Ki with the
good set of clustering C1∗ and we denote by Xi2 the intersection of Ki with the
good set of clustering C2∗ . By the assumption that the size of the target clusters
is more than three times the size of the bad set, we have Xij ≥ 2|Bj | for all
i ∈ [k] and j ∈ [2].
As Hτ,b ⊆ Hτ  ,b , this implies that (up to reordering) Ki ⊆ Ki for every i. This
is because otherwise, if we end up merging two components Ki and Kj before
reaching w , then one of the clusters Kl must be a subset of B1 and so it must
be strictly smaller than (4 + 15/α)n + 2 + 3νn. This implies that the clusterings
C1∗ and C2∗ are O(/α + ν)-close to each other since they can only differ on the
bad set B1 ∪ B2 . By Proposition 8, this implies that also the clusterings C1 and
C2 are O(/α + ν)-close to each other.
Moreover, since Xij ≥ 2|Bj | for all i ∈ [k] and j ∈ [2], using an argument
 
similar to the one in Theorem 6 yields that the clusterings Cw and Cw  obtained

by running Algorithm 1 with w and w , respectively, are identical; moreover this
clustering is (ν + )-close to both C1 and C2 . This follows as the outliers in the
sets S \ S1 and S \ S2 can be treated as additional red bad points as described
in Proposition 8 above. Since C1 is an arbitrary clustering consistent with the
(ν, 1 + α, )-property with a minimal value of w and C2 is any other consistent
clustering, we obtain that the (ν + , k)-clustering complexity is 1.
By the same arguments, we can also use the algorithm for unknown w, de-
scribed after Theorem 6, to get (ν + )-close to any consistent clustering when
we do not know the value of w beforehand.
Agnostic Clustering 393

4.2 Target Clusters That Are Large on Average


We show here that if we allow some of the target clusters to be small, then the
(γ, k) clustering complexity of the (ν, 1 + α, )-property is larger than one — it
can be as large as k even for γ = 1/k. Specifically:

Theorem 10. For k ≤ νn and γ ≤ (1 − ν)/k the (γ, k)-clustering complexity of


the (ν, 1 + α, )-property is Ω(k).

Proof Sketch. Let A1 , . . . , Ak be sets of size n(1 − ν)/k and let x1 , . . . , xk be ad-
ditional points not belonging to any of the sets A1 , . . . , Ak such that the optimal
k-median solution on the set A1 ∪ . . . ∪ Ak is the clustering C = {A1 , . . . , Ak }
and the instance (A1 ∪ . . . ∪ Ak , d) satisfies the (1 + α, )-property. We assume
that S ⊆ N and that every set Ai consists of n(1 − ν)/k points at exactly the
same position ai ∈ N. In our construction, we will have a1 < . . . < ak .
By placing the point x1 very far away from all the sets Ai and by placing A1
and A2 much closer together than any other pair of sets, we can achieve that
the optimal k-median solution on the set A1 ∪ . . . ∪ Ak ∪ {x1 } is the clustering
{A1 ∪ A2 , A3 , . . . , Ak , {x1 }} and that the instance (A1 ∪ Ak ∪ {x1 }, d) satisfies
the (1 + α, )-property. We can continue analogously and place x2 very far away
from all the sets Ai and from x1 . Then the optimal k-median clustering on the
set A1 , . . . , ∪ . . . ∪ Ak ∪ {x1 , x2 } will be {A1 ∪ A2 ∪ A3 , A4 , . . . , Ak , {x1 , x2 }} if
A2 and A3 are much closer together than Ai and Ai+1 for i ≥ 3. The instance
also satisfies the (1 + α, )-property. This way, each of the clusterings {A1 ∪ . . . ∪
Ai , Ai+1 . . . Ak , {x1 }, {x2 }, . . . , {xi−1 }} is a consistent target clustering, and the
distance between any of them is at least γ.
Note that in the example in Theorem 10 all the clusterings that satisfy the
(ν, 1 + α, )-property have the feature that the total number of points that come
from large clusters (of size at least n(1 − ν)/k) is at least (1 − ν)n. We show that
in such cases we also have an upper bound of k on the clustering complexity.

Theorem 11. Let b = (6 + 10/α) + ν. Let F be the family of clusterings with


the property that the total number of points that come from clusters of size at
least 2bn is at least (1 − β)n. Then the (2b + β, k) restricted clustering complexity
of the (ν, 1 + α, )-property with respect to F is at most k and we can efficiently
construct a list of length at most k such that any clustering in F that is consistent
with the (ν, 1 + α, )-property is (2b + β)-close to one of the clusterings in the
list.

Proof. The main idea of the proof is to use the structure of the graphs H to show
that the clusterings that are consistent with the (ν, 1 + α, )-property are almost
laminar with respect to each other. Note that for all w < w we have Gw ⊆ Gw
and Hw ⊆ Hw . Here we used Gw and Hw as abbreviations for Gτ and Hτ with
τ = 2αw
5 . In the following, we say that a cluster is large if it contains at least 2bn
elements. To find a list of clusterings that “covers” all the relevant clusterings, we
use the following algorithm. We keep increasing the value of w until we reach a
value w1 such that the following is satisfied: Let K1 , . . . , Kk denote the k largest
394 M.F. Balcan, H. Röglin, and S.-H. Teng

connected components of the graph Hw1 and assume |K1 | ≥ |K2 | ≥ . . . ≥ |Kk |.
We set k1 = max{i ∈ [k] | |Ki | ≥ bn} and stop for the smallest w1 for which the
clusters K1 , . . . , Kk1 cover together a significant fraction of the space, namely a
1 − (b + β) fraction. Let S̃ = K1 ∪ . . . ∪ Kk1 . The first clustering we add to the
list contains a cluster for each of the components K1 , . . . , Kk1 and it assigns the
points in S \ S̃ arbitrarily to those. Now we increase the value of w and each
time we add an edge in Hw between two points in different components Ki and
Kj , we merge the corresponding clusters to obtain a new clustering with at least
one cluster less. We add this clustering to our list and we continue until only
one cluster is left. As in every step, the number of clusters decreases by at least
one, the list of clusterings produced this way has length at most k1 ≤ k. Let
w1 , w2 , . . . denote the values of w for which the clusterings are added to the list.
To complete the proof, we show that any clustering C satisfying the property is
(2b + β)-close to one of the clusterings in the list we constructed. Let wC denote
the value corresponding to C. First we notice that wC ≥ w1 . This follows easily
from the structure of the graph HwC : it has one connected component for every
large cluster in C and each of these components must contain at least bn points as
every large cluster contains at least 2bn points and the bad set contains at most
bn points. Also by definition and the fact that the size of the bad set is bounded
by bn, it follows that these components together cover at least a 1 − (b + β)
fraction of the points. This proves that wC ≥ w1 by the definition of w1 . Now let
i be maximal such that wi ≤ wC . We show that the clustering we output at wi is
(2b + β)-close to the clustering C. Let K1 , . . . , Kk  denote the components in Hwi
that evolved from the Ki and let K1 , . . . , Kk denote the evolved components
in HwC . As wC < wi+1 , k  = k  we can assume (up to reordering) that Ki = Ki
on the set S̃. As all points in S̃ that are not in the bad set for wi are clustered
in C according to the components K1 , . . . , Kk , the clusterings corresponding
to wi and wC can only differ on S \ S̃ and the bad set for wi . Using the fact
|S \ S̃| ≤ (b + β)n and that the size of the bad set is bounded by bn, we get that
the clustering we output at wi is (2b+β)-close to the clustering C, as desired.
Moreover, if every large cluster is at least as large as (12 + 20/α)n + 2νn + 2β,
then, as for w1 the size of the missed set is at most (6 + 10/α)n + νn + β, the
intersection of the good set with every large cluster is larger than the missed set
for wi for any i. This then implies that if we apply the median argument from
Step 4 of Algorithm 1, the clustering we get for wi is (ν +  + β)-close to the
clustering C if i is chosen as in the previous proof. Together with Theorem 11
this implies the following corollary.
Corollary 12. Let b = (6 + 10/α) + ν. Let F be the family of clusterings with
the property that the average cluster size n/k is at least 2bn/(1 − β). Then the
(ν +  + β, k) restricted clustering complexity of the (ν, 1 + α, )-property with
respect to F is at most k and we can efficiently construct a list of length at most
k such that any clustering in F that is consistent with the (ν, 1 + α, )-property
is (ν +  + β)-close to one of the clusterings in the list.
Agnostic Clustering 395

The Inductive Case. We show here how the algorithm in Theorem 11 can be
extended to the inductive setting.
Theorem 13. Let b = (6 + 10/α) + ν. Let F be the family of clusterings with
the property that the total number of points that come from clusters of size  at
least 2bn is at least (1 − β)n. If we draw a sample S of size n = O 1 ln kδ ,
then we can efficiently produce a list of length at most k such that any clustering
in the family F that is consistent with the (ν, 1 + α, )-property is 3(2b + β)-close
to one of the clusterings in the list with probability at least 1 − δ.
Proof Sketch. In the training phase, we will run the algorithm in Theorem 11
over the sample to get a list of clusterings L. Then we run an independent
“test phase” for each clustering in this list. Let C be one such clustering in the
list L with clusters C1 , . . . , Ck , and let S̃ be the set of relevant points defined
Theorem 11. In the test phase, when a new point x comes in, then we compute
for each cluster Ci the medium distance of x to Ci ∩ S̃, and insert it into the
cluster Ci to which it has the smallest median distance.
To prove correctness we use the fact that, as shown in Theorem 11, the
(2b + β, k)-clustering complexity of the (ν, 1 + α, )-property is at most k, when
restricted to clusterings in which the total number of points coming from clusters
of size at least 2bn is at least (1 − β)n. Let L be a list of k1 ≤ k clusterings such
that any consistent clustering is (2b + β)-close to one of them.
Now the argument is similar to the one in Theorem 7. In the proof of that
theorem, we used a Chernoff bound to argue that with probability at least 1 − δ
the good set of any cluster that is contained in the sample is more than twice
as large as the total bad set in the sample. Now we additionally apply a union
bound over the at most k clusterings in the list L to ensure this property for
each of the clusterings. From that point on the arguments are analogous to the
arguments in Theorem 7.

4.3 Small Target Clusters


We now consider the general case, where the target clusters can be arbitrarily
small. We start with a proposition showing that if we are willing relax the notion
of closeness significantly then the clustering complexity is still upper bounded
by k even in this general case. With a more careful analysis, we then show a
better upper bound on the clustering complexity in this general case.
Proposition 14. Let b = (6 + 10/α) + ν. Then the ((k + 4)b, k)-clustering
complexity of the (ν, 1 + α, )-property is at most k.
Proof. Let us consider a clustering C = (C1 , . . . , Ck ) and a set S  ⊆ S with
|S  | ≥ (1 − ν)n such that (S  , d) satisfies the (1 + α, )-property with respect to
the induced target clustering C ∩ S  . Let us first have a look at the graph Gw .
There exists a bad set B of size at most bn, and for every cluster i, the points
in Xi = Ci \ B form cliques in Gw . There are no edges between Xi and Xj for
i = j and there is no point x ∈ B that is simultaneously connected to Xi and
Xj for i = j.
396 M.F. Balcan, H. Röglin, and S.-H. Teng

If there are two different consistent clusterings C 1 and C 2 that have the same
value w, then, by the properties of Gw , all points in S \ (B1 ∪ B2 ) are identically
clustered. Hence, dist(C 1 , C 2 ) ≤ (|B1 |+|B2 |)/n ≤ 2b. This implies that we do not
lose too much by choosing for every value w with multiple consistent clusterings
one of them as representative. To be precise, let w1 < w2 < · · · < ws  be a list of
all values for which a correct clustering exists and for every wi , let C i denote a
correct clustering with value wi . We construct a sparsified list L of clusterings as
follows: insert C 1 into L; if the last clustering added to L is C i , add C j for the
smallest j > i for which dist(C i , C j ) ≥ (k + 2)b. This way, the list L will contain
clusterings C 1 , . . . , C s with values w1 , . . . , ws such that every correct clustering
is (k + 4)b-close to at least one of the clusterings in L.
It remains to bound the length s of the list L. Let us assume for contradiction
that s ≥ k+1. According to the properties of the graphs Gwi , the clusterings that
are induced by the clusterings C 1 , . . . , C k+1 on the set S \ (B1 ∪ . . . ∪ Bk+1 ) are
laminar. Furthermore, as the bad set B1 ∪ . . . ∪ Bk+1 has size at most (k + 1)bn,
two consecutive clusterings in the list must differ on the set S \ (B1 ∪ . . . ∪ Bk+1 ),
which means together with the laminarity implies that two clusters must have
merged. This can happen at most k − 1 times, contradicting the assumption that
s ≥ k + 1.

We will improve the result in the above proposition by imposing that consec-
utive clusterings in the list L in the above proof are significantly different in
the laminar part. In particular we will make use of the following lemma which
shows that if we have a laminar list of clusterings then the sum of the pairwise
distances between consecutive clusterings cannot be too big; this implies that if
the pairwise distances between consecutive clusterings are all large, then the list
must be short.

Lemma 15. Let C 1 , . . . , C s be a laminar list of clusterings, let k ≥ 2 denote


the number of clusters in C 1 , and let β ∈ (0, 1). If dist(C i , C i+1 ) ≥ β for every
i ∈ [s − 1], then s ≤ min{ 9 log(k/β)
β , k}.

Proof. When going from C i to C i+1 , clusters contained in the clustering C i merge
into bigger clusters contained in C i+1 . Merging the clusters K1 , . . . , K ∈ C i with
|K1 | ≥ |K2 | ≥ · · · ≥ |K | into cluster K ∈ C i+1 contributes (|K2 | + · · · + |Kl |)/n
to the distance between C i and C i+1 . When going from C i to C i+1 , multiple such
merges can occur and we know that their total contribution to the distance must
be at least β. We consider a single merge in which the pieces K1 , . . . , K ∈ C i
merge into K ∈ C i+1 virtually as − 1 merges and associate with each of them
a type. We say that the merge corresponding to Ki , i = 2, . . . , , has type j ∈ N
if |Ki | ∈ [n/2j+1 , n/2j ). If Ki has type j, we say that the data points contained
in Ki participate in a merge of type j.
For the step from C i to C i+1 , let xij denote the total number of virtual merges
of type j that occur. The number of merges of type j that can occur during the
whole sequence from C 1 to C s is bounded from above by 2j+1 as each of the
n data points can participate at most once in a merge of type j. This follows
Agnostic Clustering 397

because once a data point participated in a merge of type j, it is contained in a


piece of size at least n/2j .
We are only interested in types j ≤ L = log(k/β) + 1. As there can be at
most k −1 merges from C i to C i+1 , the total contribution to the distance between
C i and C i+1 coming from larger types can be at most k/2L+1 ≤ β/2. Hence for
every i ∈ [s − 1], the total contribution of types j ≤ L must be at least β/2.
In terms of the xij , these conditions can be expressed as
s−1
 L
xij xij β
∀j ∈ [L] : j+1
≤ 1 and ∀i ∈ [s − 1] : j
≥ .
i=1
2 j=1
2 2

This yields
s−1 L
  xij
(s − 1)β
≤ ≤L,
4 i=1 j=1
2j+1

and hence, s ≤ 4L/β + 1 ≤ 4log(k/β)


β
+4
+ 1 ≤ 9 log(k/β)
β . As in every step at
least two clusters must merge, s ≤ k and the lemma follows.
We can now show the following upper bound on the clustering complexity.

Theorem 16. Let b = (6 + 10/α) + ν. Then the(9 b log(k/b), k)-clustering
complexity of the (ν, 1 + α, )-property is at most 4 log(k/b)/b.
Proof. We use the same arguments as in Proposition 14. We construct L in the
same way, but with 7 b log(k/b) instead of (k + 2)b as bound on the distance of
consecutive clusterings. We assume  s ≥ t := 4 log(k/b)/b
 for contradiction that
and apply Lemma 15 with β = 7 b log(k/b)−s b ≥ 3 b log(k/b) to the induced
clusterings on S \ (B1 ∪ . . . ∪ Bt ). This yields s < t, contradicting the assumption
that s ≥ t.

5 Discussion and Open Questions


In this work we extend the results of Balcan, Blum, and Gupta [3] on finding low
error clusterings to the agnostic setting where we make the weaker assumption
that the data satisfies the (c, ) property only after some outliers have been
removed.
While we have focused in this paper on the (ν, c, ) property for k-median,
most of our results extend directly to the k-means objective as well. In particular,
for the k-means objective one can prove an analog of Lemma 5 with different
constants which then can be propagated through the main results of this paper.
It is worth noting that we have assumed implicitly throughout the paper
that the fraction of outliers or a good upper bound on it ν is known to the
algorithm. In the most general case, where no good upper bound on ν is known,
i.e., in the purely agnostic setting, we can run our algorithms 1/ times once for
each integral multiplicative of , thus incurring only a 1/ multiplicative factor
increase in the clustering complexity and in the running time.
398 M.F. Balcan, H. Röglin, and S.-H. Teng

Open Questions. The main concrete technical questions left open are whether
one can show a better upper bound on the clustering complexity in the case of
small target clusters and whether in this case there is an efficient algorithm for
constructing a short list of clusterings such that every consistent clustering is
close to one of the clusterings in the list.
More generally, it would also be interesting to analyze other natural variations
of the (c, ) property. For example, a natural direction would be to consider
variations that express beliefs that only the c-approximate clusterings that might
be returned by natural approximation algorithms are close to the target. In
particular, many approximation algorithms for clustering return Voronoi-based
clusterings [7]. In this context, a natural relaxation of the (c, )-property is to
assume that only the Voronoi-based clusterings that are c-approximations to
the optimal solution are -close to the target. It would be interesting to analyze
whether this is sufficient for efficiently finding low-error clusterings, both in the
realizable and in the agnostic setting.

Acknowledgements. We thank Avrim Blum and Mark Braverman for a num-


ber of helpful discussions.

References
1. Jain, K., Mahdian, M., Saberi, A.: A new greedy approach for facility location
problems. In: STOC (2002)
2. Charikar, M., Guha, S., Tardos, E., Shmoys, D.B.: A constant-factor approximation
algorithm for the k-median problem. In: Proceedings of the Thirty-First Annual
ACM Symposium on Theory of Computing (1999)
3. Balcan, M.F., Blum, A., Gupta, A.: Approximate clustering without the approx-
imation. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms
(2009)
4. Balcan, M.F., Blum, A., Vempala, S.: A discrimantive framework for clustering via
similarity functions. In: Proceedings of the 40th ACM Symposium on Theory of
Computing (2008)
5. Balcan, M.F., Braverman, M.: Finding low error clusterings. In: Proceedings of the
22nd Annual Conference on Learning Theory (2009)
6. Kearns, M.J., Schapire, R.E., Sellie, L.M.: Toward efficient agnostic learning. Ma-
chine Learning Journal (1994)
7. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.:
A local search approximation algorithm for k -means clustering. In: Proceedings of
the Eighteenth Annual Symposium on Computational Geometry (2002)
8. Valiant, L.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)
Author Index

Akutsu, Tatsuya 126 Kevei, Péter 83


Angluin, Dana 171 Kinber, Efim 308
Arias, Marta 156 Kötzing, Timo 263

Langford, John 247


Balcan, Maria Florina 384
Luo, Qinglong 293
Balcázar, José L. 156
Banerjee, Arindam 368
Maillard, Odalric-Ambrym 232
Becerra-Bonache, Leonor 171
Mansour, Yishay 4
Beygelzimer, Alina 247
Mazzawi, Hanna 97
Bilmes, Jeff 141
Munos, Rémi 23
Bshouty, Nader H. 97
Bubeck, Sébastien 23 Perchet, Vianney 68
Pereira, Fernando C.N. 7
Carlucci, Lorenzo 323
Case, John 263 Ravikumar, Pradeep 247
Cesa-Bianchi, Nicolò 110 Reyzin, Lev 171
Chernov, Alexey 8 Röglin, Heiko 384
Clémençon, Stéphan 216
Semukhin, Pavel 293
Simon, Hans Ulrich 353
Dasgupta, Sanjoy 1
Sra, Suvrit 368
Debowski,
 L
 ukasz 53 Stephan, Frank 293, 338
Dediu, Adrian Horia 171
Stoltz, Gilles 23
Szörényi, Balázs 186
Gavaldà, Ricard 201
Geffner, Hector 2 Tamura, Takeyuki 126
Gentile, Claudio 110 Teng, Shang-Hua 384
Guillory, Andrew 141 Thérien, Denis 201
Györfi, László 83
Vayatis, Nicolas 216, 232
Vitale, Fabio 110
Han, Jiawei 3
Vovk, Vladimir 8
Horimoto, Katsuhisa 126
V’yugin, Vladimir V. 38

Jain, Sanjay 293, 308, 338 Ye, Nan 338


Jegelka, Stefanie 368 Yoshinaka, Ryo 278

You might also like