Professional Documents
Culture Documents
Algorithmic
Learning Theory
13
Series Editors
Randy Goebel, University of Alberta, Edmonton, Canada
Jörg Siekmann, University of Saarland, Saarbrücken, Germany
Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany
Volume Editors
Ricard Gavaldà
Universitat Politècnica de Catalunya
LARCA Research Group, Departament de Llenguatges i Sistemes Informàtics
Jordi Girona Salgado 1-3, 08034 Barcelona, Spain
E-mail: gavalda@lsi.upc.edu
Gábor Lugosi
Pompeu Fabra Universitat, ICREA and Department of Economics
Ramon Trias Fargas 25-27, 08005 Barcelona, Spain
E-mail: gabor.lugosi@gmail.com
Thomas Zeugmann
Hokkaido University, Division of Computer Science
N-14, W-9, Sapporo 060-0814, Japan
E-mail: thomas@ist.hokudai.ac.jp
Sandra Zilles
University of Regina, Department of Computer Science
Regina, Saskatchewan, Canada S4S 0A2
E-mail: zilles@cs.uregina.ca
CR Subject Classification (1998): I.2, I.2.6, K.3.1, F.2, G.2, I.2.2, I.5.3
ISSN 0302-9743
ISBN-10 3-642-04413-1 Springer Berlin Heidelberg New York
ISBN-13 978-3-642-04413-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
© Springer-Verlag Berlin Heidelberg 2009
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 12760312 06/3180 543210
Preface
This volume contains the papers presented at the 20th International Conference
on Algorithmic Learning Theory (ALT 2009), which was held in Porto, Portugal,
October 3–5, 2009. The conference was co-located with the 12th International
Conference on Discovery Science (DS 2009). The technical program of ALT 2009
contained 26 papers selected from 60 submissions, and 5 invited talks. The in-
vited talks were presented during the joint sessions of both conferences.
ALT 2009 was the 20th in the ALT conference series, established in Japan
in 1990. The series Analogical and Inductive Inference is a predecessor of this
series: it was held in 1986, 1989 and 1992, co-located with ALT in 1994, and
subsequently merged with ALT. ALT maintains its strong connections to Japan,
but has also been held in other countries, such as Australia, Germany, Hungary,
Italy, Singapore, Spain, and the USA. The ALT series is supervised by its Steer-
ing Committee: Naoki Abe (IBM Thomas J. Watson Research Center, Yorktown,
USA), Shai Ben-David (University of Waterloo, Canada), Phil Long (Google,
Mountain View, USA), Gábor Lugosi (Pompeu Fabra University, Barcelona,
Spain), Akira Maruoka (Ishinomaki Senshu University, Japan), Takeshi Shino-
hara (Kyushu Institute of Technology, Iizuka, Japan), Frank Stephan (National
University of Singapore, Republic of Singapore), Einoshin Suzuki (Kyushu Uni-
versity, Fukuoka, Japan), Eiji Takimoto (Kyushu University, Fukuoka, Japan),
György Turán (University of Illinois at Chicago, USA, and University of Szeged,
Hungary), Osamu Watanabe (Tokyo Institute of Technology, Japan), Thomas
Zeugmann (Chair, Hokkaido University, Japan), and Sandra Zilles (Publicity
Chair, University of Regina, Canada). The ALT web pages have been set up
(together with Frank Balbach and Jan Poland) and are maintained by Thomas
Zeugmann.
The present volume contains the texts of the 26 papers presented at ALT
2009, divided into groups of papers on online learning, learning graphs, active
learning and query learning, statistical learning, inductive inference, and semi-
supervised and unsupervised learning. The volume also contains abstracts of the
invited talks:
– Sanjoy Dasgupta (University of California, San Diego, USA): The Two Faces
of Active Learning
– Hector Geffner (Universitat Pompeu Fabra, Barcelona, Spain) Inference and
Learning in Planning
– Jiawei Han (University of Illinois at Urbana-Champaign, USA) Mining Het-
erogeneous Information Networks by Exploring the Power of Links
– Yishay Mansour (Tel Aviv University, Israel) Learning and Domain Adapta-
tion
– Fernando C.N. Pereira (Google, Mountain View, USA) Learning on the Web
Papers presented at DS 2009 are contained in the DS 2009 proceedings.
VI Preface
The E. Mark Gold Award has been presented annually at the ALT conferences
since 1999, for the most outstanding student contribution. This year, the award
was given to Hanna Mazzawi for the paper Reconstructing Weighted Graphs with
Minimal Query Complexity, co-authored by Nader Bshouty.
We would like to thank the many people and institutions who contributed
to the success of the conference. Thanks to the authors of the papers for their
submissions, and to the invited speakers for presenting exciting overviews of
important recent research developments. We are very grateful to the sponsors
of the conference for their generous financial support: University of Porto, Ar-
tificial Intelligence and Decision Support Laboratory, Center for Research in
Advanced Computing Systems, Portuguese Science and Technology Foundation,
Portuguese Artificial Intelligence Association, SAS, Alberta Ingenuity Centre for
Machine Learning, and Division of Computer Science, Hokkaido University.
We are grateful to the members of the Program Committee for ALT 2009.
Their hard work in reviewing and discussing the papers made sure that we
had an interesting and strong program. We also thank the subreferees assist-
ing the Program Committee. Special thanks go to the local arrangement chair
João Gama (University of Porto). We would like to thank the Discovery Sci-
ence conference for its ongoing collaboration with ALT, which makes it possible
to provide a well-rounded picture of the current theoretical and practical ad-
vances in machine learning and the related areas. In particular, we are grateful
to the conference chair João Gama (University of Porto) and Program Commit-
tee chairs Vítor Santos Costa (University of Porto) and Alípio Jorge (University
of Porto) for their cooperation. Last but not least, we thank Springer for their
support in preparing and publishing this volume of the Lecture Notes in Artificial
Intelligence series.
Conference Chair
Ricard Gavaldà Universitat Politècnica de Catalunya,
Barcelona, Spain
Program Committee
Peter Auer University of Leoben, Austria
José L. Balcázar Universitat Politècnica de Catalunya,
Barcelona, Spain
Shai Ben-David University of Waterloo, Canada
Avrim Blum Carnegie Mellon University, Pittsburgh, USA
Nader Bshouty Technion, Haifa, Israel
Claudio Gentile Università degli Studi dell’Insubria, Varese,
Italy
Peter Grünwald Centrum voor Wiskunde en Informatica (CWI),
Amsterdam, The Netherlands
Roni Khardon Tufts University, Medford, USA
Phil Long Google, Mountain View, USA
Gábor Lugosi ICREA and Pompeu Fabra University,
Barcelona, Spain (Chair)
Massimiliano Pontil University College London, UK
Alexander Rakhlin UC Berkeley, USA
Shai Shalev-Shwartz Toyota Technological Institute at Chicago, USA
Hans Ulrich Simon Ruhr-Universität Bochum, Germany
Frank Stephan National University of Singapore, Singapore
Csaba Szepesvári University of Alberta, Edmonton, Canada
Eiji Takimoto Kyushu University, Fukuoka, Japan
Sandra Zilles University of Regina, Canada (Chair)
Local Arrangements
João Gama University of Porto, Portugal
Subreferees
Jacob Abernethy Nicolò Cesa-Bianchi
Andreas Argyriou Jiang Chen
Marta Arias Alexander Clark
John Case Sanjoy Dasgupta
VIII Organization
Sponsoring Institutions
University of Porto
Artificial Intelligence and Decision Support Laboratory
Center for Research in Advanced Computing Systems
Portuguese Science and Technology Foundation
Portuguese Artificial Intelligence Association
SAS
Alberta Ingenuity Centre for Machine Learning
Division of Computer Science, Hokkaido University
Table of Contents
Invited Papers
The Two Faces of Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Sanjoy Dasgupta
Inference and Learning in Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Hector Geffner
Mining Heterogeneous Information Networks by Exploring the Power
of Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Jiawei Han
Learning and Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Yishay Mansour
Learning on the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Fernando C.N. Pereira
Regular Contributions
Online Learning
Prediction with Expert Evaluators’ Advice . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Alexey Chernov and Vladimir Vovk
Pure Exploration in Multi-armed Bandits Problems . . . . . . . . . . . . . . . . . . 23
Sébastien Bubeck, Rémi Munos, and Gilles Stoltz
The Follow Perturbed Leader Algorithm Protected from Unbounded
One-Step Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Vladimir V. V’yugin
Computable Bayesian Compression for Uniformly Discretizable
Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
L
ukasz Debowski
Calibration and Internal No-Regret with Random Signals . . . . . . . . . . . . . 68
Vianney Perchet
St. Petersburg Portfolio Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
László Györfi and Péter Kevei
Learning Graphs
Reconstructing Weighted Graphs with Minimal Query Complexity . . . . . 97
Nader H. Bshouty and Hanna Mazzawi
X Table of Contents
Statistical Learning
Adaptive Estimation of the Optimal ROC Curve and a Bipartite
Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Stéphan Clémençon and Nicolas Vayatis
Inductive Inference
Difficulties in Forcing Fairness of Polynomial Time Inductive
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
John Case and Timo Kötzing
Sanjoy Dasgupta
Hector Geffner
Jiawei Han
Yishay Mansour
1 Introduction
It is almost standard in machine learning to assume that the training and test
instances are drawn from the same distribution. This assumption is explicit in
the standard PAC model [19] and other theoretical models of learning, and it is a
natural assumption since when the training and test distributions substantially
differ there can be no hope for generalization. However, in practice, there are
several crucial scenarios where the two distributions are similar but not identical,
and therefore effective learning is potentially possible. This is the motivation for
domain adaptation.
The problem of domain adaptation arises in a variety of applications in natu-
ral language processing [6,3,9,4,5], speech processing [11,7,16,18,8,17], computer
vision [15], and many other areas. Quite often, little or no labeled data is avail-
able from the target domain, but labeled data from a source domain somewhat
similar to the target as well as large amounts of unlabeled data from the target
domain are at one’s disposal. The domain adaptation problem then consists of
leveraging the source labeled and target unlabeled data to derive a hypothesis
performing well on the target domain.
The first theoretical analysis of the domain adaptation problem was presented
by [1], who gave VC-dimension-based generalization bounds for adaptation in
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 4–6, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Learning and Domain Adaptation 5
classification tasks. Perhaps, the most significant contribution of that work was
the definition and application of a distance between distributions, the dA dis-
tance, that is particularly relevant to the problem of domain adaptation and
which can be estimated from finite samples for a finite VC dimension, as previ-
ously shown by [10]. This work was later extended by [2] who also gave a bound
on the error rate of a hypothesis derived from a weighted combination of the
source data sets for the specific case of empirical risk minimization. More re-
fined generalization bounds which apply to more general tasks, including regres-
sion and general loss functions appear in [12]. From an algorithmic perspective,
it is natural to re-weight the empirical distribution to better reflect the target
distribution; efficient algorithms for this re-weighting task were given in [12].
A more complex variant of this problem arises in sentiment analysis and other
text classification tasks where the learner receives information from several do-
main sources that he can combine to make predictions about a target domain.
As an example, often appraisal information about a relatively small number of
domains such as movies, books, restaurants, or music may be available, but little
or none is accessible for more difficult domains such as travel. This is known as
the multiple source adaptation problem. Instances of this problem can be found
in a variety of other natural language and image processing tasks.
The problem of adaptation with multiple sources was introduced and analyzed
[13,14]. The problem is formalized as follows. For each source domain i ∈ [1, k],
the learner receives the distribution of the input points Qi , as well as a hypoth-
esis hi with loss at most on that source. The task consists of combining the k
hypotheses hi , i ∈ [1, k], to derive a hypothesis h with a loss as small as possible
with respect to the target distribution P . Unfortunately, a simple convex com-
bination of the k source hypotheses hi can perform very poorly; for example,
there are cases where any such convex combination would incur a classification
error of a half, even when each source hypothesis hi makes no error on its do-
main Qi (see [13]). In contrast, distribution weighted combinations of the source
hypotheses, which are combinations of source hypotheses weighted by the source
distributions, perform very well. In [13] it was shown that, remarkably, for any
fixed target function, there exists a distribution weighted combination of the
source hypotheses whose loss is at most with respect to any mixture P of the k
source distributions Qi . For the case that the target distribution P is arbitrary,
generalization bounds, based on Rényi divergence between the sources and the
target distributions, were derived in [14].
References
1. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations
for domain adaptation. In: Proceedings of NIPS 2006 (2006)
2. Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Wortman, J.: Learning bounds
for domain adaptation. In: Proceedings of NIPS 2007 (2007)
3. Blitzer, J., Dredze, M., Pereira, F.: Biographies, Bollywood, Boom-boxes and
Blenders: Domain Adaptation for Sentiment Classification. In: ACL 2007 (2007)
6 Y. Mansour
4. Chelba, C., Acero, A.: Adaptation of maximum entropy capitalizer: Little data can
help a lot. Computer Speech & Language 20(4), 382–399 (2006)
5. Daumé III, H., Marcu, D.: Domain adaptation for statistical classifiers. Journal of
Artificial Intelligence Research 26, 101–126 (2006)
6. Dredze, M., Blitzer, J., Talukdar, P.P., Ganchev, K., Graca, J., Pereira,
F.: Frustratingly Hard Domain Adaptation for Parsing. In: CoNLL 2007 (2007)
7. Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate gaus-
sian mixture observations of markov chains. IEEE Transactions on Speech and
Audio Processing 2(2), 291–298 (1994)
8. Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge
(1998)
9. Jiang, J., Zhai, C.X.: Instance Weighting for Domain Adaptation in NLP. In: Pro-
ceedings of ACL 2007 (2007)
10. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Pro-
ceedings of the 30th International Conference on Very Large Data Bases (2004)
11. Legetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker
adaptation of continuous density hidden markov models. Computer Speech and
Language, 171–185 (1995)
12. Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation: Learning bounds
and algorithms. In: COLT (2009)
13. Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation with multiple
sources. In: Proceedings of NIPS 2008 (2008)
14. Mansour, Y., Mohri, M., Rostamizadeh, A.: Multiple source adaptation and the
Rényi divergence. In: Uncertainty in Artificial Inteligence, UAI (2009)
15. Martı́nez, A.M.: Recognizing imprecisely localized, partially occluded, and expres-
sion variant faces from a single sample per class. IEEE Trans. Pattern Anal. Mach.
Intell. 24(6), 748–763 (2002)
16. Della Pietra, S., Della Pietra, V., Mercer, R.L., Roukos, S.: Adaptive language
modeling using minimum discriminant estimation. In: HLT 1991: Proceedings of
the workshop on Speech and Natural Language, pp. 103–106 (1992)
17. Roark, B., Bacchiani, M.: Supervised and unsupervised PCFG adaptation to novel
domains. In: Proceedings of HLT-NAACL (2003)
18. Rosenfeld, R.: A Maximum Entropy Approach to Adaptive Statistical Language
Modeling. Computer Speech and Language 10, 187–228 (1996)
19. Valiant, L.G.: A theory of the learnable. Communication of the ACM 27(11),
1134–1142 (1984)
Learning on the Web
It is commonplace to say that the Web has changed everything. Machine learning
researchers often say that their projects and results respond to that change with
better methods for finding and organizing Web information. However, not much
of the theory, or even the current practice, of machine learning take the Web
seriously. We continue to devote much effort to refining supervised learning, but
the Web reality is that labeled data is hard to obtain, while unlabeled data is
inexhaustible. We cling to the iid assumption, while all the Web data generation
processes drift rapidly and involve many hidden correlations. Many of our theory
and algorithms assume data representations of fixed dimension, while in fact the
dimensionality of data, for example the number of distinct words in text, grows
with data size. While there has been much work recently on learning with sparse
representations, the actual patterns of sparsity on the Web are not paid much
attention. Those patterns might be very relevant to the communication costs
of distributed learning algorithms, which are necessary at Web scale, but little
work has been done on this.
Nevertheless, practical machine learning is thriving on the Web. Statistical
machine translation has developed non-parametric algorithms that learn how
to translate by mining the ever-growing volume of source documents and their
translations that are created on the Web. Unsupervised learning methods infer
useful latent semantic structure from the statistics of term co-occurrences in
Web documents. Image search achieves improved ranking by learning from user
responses to search results. In all those cases, Web scale demanded distributed
algorithms.
I will review some of those practical successes to try to convince you that
they are not just engineering feats, but also rich sources of new fundamental
questions that we should be investigating.
1 Introduction
We consider the problem of online sequence prediction. A process generates
outcomes ω1 , ω2 , . . . step by step. At each step t, a learner tries to guess this
step’s outcome announcing his prediction γt . Then the actual outcome ωt is
revealed. The quality of the learner’s prediction is measured by a loss function:
the learner’s loss at step t is λ(γt , ωt ).
Prediction with expert advice is a framework that does not make any assump-
tions about the generating process. The performance of the learner is compared
to the performance of several other predictors called experts. At each step, each
expert gives his prediction γtn , then the learner produces his own prediction γt
(possibly based on the experts’ predictions at the last step and the experts’ pre-
dictions and outcomes at all the previous steps), and the accumulated losses are
updated for the learner and for the experts. There are many algorithms for the
learner in this framework; for a review, see [1].
In practical applications of the algorithms for prediction with expert advice,
choosing the loss function is often difficult. There may be no natural quantitative
measure of loss, just the vague concept that the closer the prediction to the
outcome the better. In such cases one usually selects from among several common
loss functions, such as the square loss function (reflecting the idea of least squares
methods) or the log loss function (which has an information theory background).
A similar issue arises when experts themselves are prediction algorithms that
optimize some losses internally. Then it is unfair to these experts when the
learner competes with them according to a “foreign” loss function.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 8–22, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Prediction with Expert Evaluators’ Advice 9
The goal of Learner is to keep his loss Lt smaller or at least not much greater
than the loss Lnt of Expert n, at each step t and for all n = 1, . . . , N .
10 A. Chernov and V. Vovk
Assumption 1: λ(γ, 0) and λ(γ, 1) are continuous in γ ∈ [0, 1] and for the
standard (Aleksandrov’s) topology on [0, ∞].
Assumption 2: There exists γ ∈ [0, 1] such that λ(γ, 0) and λ(γ, 1) are both
finite.
Assumption 3: There exists no γ ∈ [0, 1] such that λ(γ, 0) and λ(γ, 1) are both
infinite.
By Assumption 2, this set is non-empty. For each learning rate η > 0, let Eη :
[0, ∞]2 → [0, 1]2 be the homeomorphism defined by Eη (x, y) := (e−ηx , e−ηy ).
The loss function λ is called η-mixable if the set Eη (Σλ ) is convex. It is called
mixable if it is η-mixable for some η > 0.
Theorem 1 (Vovk and Watkins). If a loss function λ is η-mixable, then
there exists a strategy for Learner that guarantees that in the game of prediction
with expert advice with N experts and the loss function λ it holds, for all T and
for all n = 1, . . . , N , that
1
LT ≤ LnT + ln N . (2)
η
The bound is optimal: if λ is not η-mixable, then no strategy for Learner can
guarantee (2).
For the proof and other details, see [1], [5], [6], or [7, Theorem 8]; one of the algo-
rithms guaranteeing (2) is the Aggregating Algorithm (AA). As shown in [2], one
can take the defensive forecasting algorithm instead of the AA in the theorem.
(i.e., λ(γ, 0) = − ln(1 − γ) and λ(γ, 1) = − ln γ) and the square loss function
λ(γ, ω) := (ω − γ)2 .
The description of the defensive forecasting algorithm and the proof of the the-
orem will be given in Sect. 7.
12 A. Chernov and V. Vovk
for all T and all n = 1, . . . , N , in the game of prediction with N expert evaluators’
advice in which the experts are required to always choose η-mixable loss functions
λnt .
This corollary is more intuitive than Theorem 2 as (4) compares the cumulative
losses suffered by Learner and each expert.
In the following sections we will discuss two interesting special cases of The-
orem 2 and Corollary 1.
Lt := Lt−1 + λn (γtn , ωt ), n = 1, . . . , N .
n n
END FOR
There are two changes in the protocol as compared to the basic protocol of pre-
diction with expert advice in Sect. 2. The accumulated loss Lnt of each expert is
now calculated according to his own loss function λn . For Learner, there is no
(n)
single accumulated loss anymore. Instead, the loss Lt of Learner is calculated
separately against each expert, according to that expert’s loss function λn . Infor-
mally speaking, each expert evaluates his own performance and the performance
of Learner according to the expert’s own (but publicly known) criteria.
In the standard setting of prediction with expert advice it is often said that
Learner’s goal is to compete with the best expert in the pool. In the new setting,
we cannot speak about the best expert: the experts’ performance is evaluated by
different loss functions and thus the losses may be measured on different scales.
But it still makes sense to consider bounds on the regret Lt − Lnt for each n.
(n)
Prediction with Expert Evaluators’ Advice 13
Lt := Ln,m
n,m m n
t−1 + λ (γt , ωt ), n = 1, . . . , N and m = 1, . . . , M .
END FOR
and Expert n over the steps when Expert n is awake. Now Learner’s goal is to
do as well as each expert on the steps chosen by that expert.
Corollary 4. Let λ be a loss function that is η-mixable for some η > 0. Then
Learner has a strategy that guarantees that in the game of prediction with N
specialist experts’ advice and loss function λ it holds, for all T and for all n =
1, . . . , N , that
ln N
LT ≤ LnT +
(n)
. (6)
η
Proof. Without loss of generality the loss function λ may be assumed to be
proper (as we said earlier, this can be achieved by reparameterization of predic-
tions). The protocol of this section then becomes a special case of the protocol
of Sect. 4 in which at each step each expert outputs ηtn = η and either λnt = λ
(when he is awake) or λnt = 0 (when he is asleep). (Alternatively, we could allow
zero learning rates and make each expert output λnt = λ and either ηtn = η,
when he is awake, or ηtn = 0, when he is asleep.)
We will use the more intuitive notation πt , rather than γt , for the algorithm’s
predictions (to emphasize the interpretation of predictions as probabilities: cf.
the discussion of proper scoring rules in Sect. 3).
The Algorithm
For each n = 1, . . . , N , let us define the function
∗
Qn : [0, 1]N × (0, ∞)N × LN × [0, 1] × {0, 1} → [0, ∞]
T
n n n n
Qn (γ1• , η1• , λ•1 , π1 , ω1 , . . . , γT• , ηT• , λ•T , πT , ωT ) := eηt λt (πt ,ωt )−λt (γt ,ωt ) ,
t=1
(7)
where γtn are the components of γt• , ηtn are the components of ηt• , and λnt
are the components of λ•t : γt• := (γt1 , . . . , γtN ), ηt• := (ηt1 , . . . , ηtN ), and λ•t :=
0
(λ1t , . . . , λN
t ). As usual, the product t=1 is interpreted as 1, so that Q () = 1.
n
The functions Qn will usually be applied to γt• := (γt1 , . . . , γtN ) the predictions
made by all the N experts at step t, ηt• := (ηt1 , . . . , ηtN ) the learning rates chosen
by the experts at step t, and λ•t := (λ1t , . . . , λN
t ) the loss functions used by the
experts at step t. Notice that Qn does not depend on the predictions, learning
rates, and loss functions of the experts other than Expert n.
Set
N
1 n
Q := Q and ft (π, ω) :=
N n=1
Q γ1• , η1• , λ•1 , π1 , ω1 , . . . , γt−1
• •
, ηt−1 , λ•t−1 , πt−1 , ωt−1 , γt• , ηt• , λ•t , π, ω
− Q γ1• , η1• , λ•1 , π1 , ω1 , . . . , γt−1
• •
, ηt−1 , λ•t−1 , πt−1 , ωt−1 , (8)
where (π, ω) ranges over [0, 1] × {0, 1}; the expression ∞ − ∞ is understood as,
say, 0. The defensive forecasting algorithm is defined in terms of the functions ft .
The existence of a π satisfying ft (π, 0) = ft (π, 1), when required by the algo-
rithm, will be proved in Lemma 1 below. We will see that in this case the function
ft (π) := ft (π, 1) − ft (π, 0) takes values of opposite signs at π = 0 and π = 1.
Therefore, a root of ft (π) = 0 can be found by, e.g., bisection (see [10], Chap. 9,
for a review of bisection and more efficient methods, such as Brent’s).
Reductions
The most important property of the defensive forecasting algorithm is that it
produces predictions πt such that the sequence
Qt := Q(γ1• , η1• , λ•1 , π1 , ω1 , . . . , γt• , ηt• , λ•t , πt , ωt ) (9)
is non-increasing. This property will be proved later; for now, we will only check
that it implies the bound on the regret term given in Theorem 2. Since the initial
value Q0 of Q is 1, we have Qt ≤ 1 for all t. And since Qn ≥ 0 for all n, we have
Qn ≤ N Q for all n. Therefore, Qnt , defined by (9) with Qn in place of Q, is at
most N at each step t. By the definition of Qn this means that
T
ηtn λnt (πt , ωt ) − λnt (γtn , ωt ) ≤ ln N ,
t=1
πT S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , πT , 1)
+ (1 − πT )S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , πT , 0)
≤ S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 ) . (10)
Remark 1. The standard measure-theoretic notion of a supermartingale is ob-
tained when the arguments π1 , π2 , . . . in (10) are replaced by the forecasts pro-
duced by a fixed forecasting system. See, e.g., [12] for details. Game-theoretic
supermartingales are referred to as “superfarthingales” in [13].
A supermartingale S is called forecast-continuous if, for all T ∈ {1, 2, . . .}, all
e1 , . . . , eT ∈ E, all π1 , . . . , πT −1 ∈ [0, 1], and all ω1 , . . . , ωT ∈ {0, 1},
S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , π, ωT )
is a continuous function of π ∈ [0, 1]. The following lemma (proved and used in
similar contexts by, e.g., Levin [14] and Takemura [15]) states the most important
for us property of forecast-continuous supermartingales.
Prediction with Expert Evaluators’ Advice 17
S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , π, ω)
≤ S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 ) .
Proof. Define a function f : [0, 1] × {0, 1} → (−∞, ∞] by
f (π, ω) := S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 , eT , π, ω)
− S(e1 , π1 , ω1 , . . . , eT −1 , πT −1 , ωT −1 )
(the subtrahend is assumed finite: there is nothing to prove when it is infinite).
Since S is a forecast-continuous supermartingale, f (π, ω) is continuous in π and
πf (π, 1) + (1 − π)f (π, 0) ≤ 0 (11)
for all π ∈ [0, 1]. In particular, f (0, 0) ≤ 0 and f (1, 1) ≤ 0.
Our goal is to show that for some π ∈ [0, 1] we have f (π, 1) ≤ 0 and f (π, 0) ≤
0. If f (0, 1) ≤ 0, we can take π = 0. If f (1, 0) ≤ 0, we can take π = 1. Assume
that f (0, 1) > 0 and f (1, 0) > 0. Then the difference f (π) := f (π, 1) − f (π, 0) is
positive for π = 0 and negative for π = 1. By the intermediate value theorem,
f (π) = 0 for some π ∈ (0, 1). By (11) we have f (π, 1) = f (π, 0) ≤ 0.
The fact that the sequence (9) is non-increasing follows from the fact (see below)
that Q is a forecast-continuous supermartingale (when restricted to the allowed
moves for the players). The pseudocode for the defensive forecasting algorithm
and the paragraph following it are extracted from the proof of Lemma 1, as
applied to the supermartingale Q.
The weighted sum of finitely many forecast-continuous supermartingales taken
with positive weights is again a forecast-continuous supermartingale. Therefore,
the proof will be complete if we check that Qn is a supermartingale under the
restriction that λnt is ηtn -mixable for all n and t (it is forecast-continuous by
Assumption 1). But before we can do this, we will need to do some preparatory
work in the next subsection.
Let us fix a constant η > 0. The prediction set of the generalized log loss
function (3) is the curve {(x, y) | e−ηx + e−ηy = 1} in IR2 . For each π ∈ (0, 1),
the π-point of this curve is Λπ , i.e., the point
1 1
− ln(1 − π), − ln π .
η η
Since the generalized log loss function is proper, the minimum of (1 − π)x + πy
(geometrically, of the dot product of (1 − π, π) and (x, y)) on the curve e−ηx +
e−ηy = 1 is attained at the π-point; in other words, the tangent of e−ηx +e−ηy = 1
at the π-point is orthogonal to the vector (1 − π, π).
A shift of the curve e−ηx + e−ηy = 1 is the curve e−η(x−α) + e−η(y−β) = 1
for some α, β ∈ IR (i.e., it is a parallel translation of e−ηx + e−ηy = 1 by some
vector (α, β)). The π-point of this shift is the point (α, β) + Λπ , where Λπ is the
π-point of the original curve e−ηx + e−ηy = 1. This provides us with a coordinate
system on each shift of e−ηx + e−ηy = 1 (π ∈ (0, 1) serves as the coordinate of
the corresponding π-point).
It will be convenient to use the geographical expressions “Northeast” and
“Southwest”. A point (x1 , y1 ) is Northeast of a point (x2 , y2 ) if x1 ≥ x2 and
y1 ≥ y2 . A set A ⊆ IR2 is Northeast of a shift of e−ηx + e−ηy = 1 if each point of
A is Northeast of some point of the shift. Similarly, a point is Northeast of a shift
of e−ηx + e−ηy = 1 (or of a straight line with a negative slope) if it is Northeast
of some point on that shift (or line). “Northeast” is replaced by “Southwest”
when the inequalities are ≤ rather than ≥, and we add the attribute “strictly”
when the inequalities are strict.
It is easy to see that the loss function is η-mixable if and only if for each
point (a, b) on the boundary of the superprediction set there exists a shift of
e−ηx +e−ηy = 1 passing through (a, b) such that the superprediction set lies to the
Northeast of the shift. This follows from the fact that the shifts of e−ηx +e−ηy = 1
correspond to the straight lines with negative slope under the homeomorphism
Eη : indeed, the preimage of ax + by = c, where a > 0, b > 0, and c > 0, is
ae−ηx + be−ηy = c, which is the shift of e−ηx + e−ηy = 1 by the vector
1 a 1 b
− ln , − ln .
η c η c
A similar statement for the property of being proper is:
Lemma 2. Suppose the loss function λ is η-mixable. It is a proper loss function
if and only if for each π the superprediction set is to the Northeast of the shift
of e−ηx + e−ηy = 1 passing through Λπ (as defined by (12)) and having Λπ as its
π-point.
Proof. The part “if” is obvious, so we will only prove the part “only if”. Let
λ be η-mixable and proper. Suppose there exists π such that the shift A1 of
e−ηx + e−ηy = 1 passing through Λπ and having Λπ as its π-point has some
superpredictions strictly to its Southwest. Let s be such a superprediction, and
let A2 be the tangent to A1 at the point Λπ . The image Eη (A1 ) is a straight
Prediction with Expert Evaluators’ Advice 19
line in [0, 1]2 , and the curve Eη (A2 ) touches Eη (A1 ) at Eη (Λπ ) and lies at the
same side of Eη (A1 ) as Eη (s). Any point p in the open interval (Eη (s), Eη (Λπ ))
that is close enough to Eη (Λπ ) will be strictly Northeast of Eη (A2 ). The point
Eη−1 (p) will then be a superprediction (by the η-mixability of λ) that is strictly
Southwest of A2 . This contradicts λ being a proper loss function, since A2 is the
straight line passing through Λπ and orthogonal to (1 − π, π).
To simplify the notation, we omit the indices n and T ; this does not lead to
any ambiguity. Using the notation (a, b) := Λπ = (λ(π, 0), λ(π, 1)) and (x, y) :=
Λγ = (λ(γ, 0), λ(γ, 1)), we can further simplify the last inequality to
In other words, it suffices to check that the (super)prediction set lies to the
Northeast of the shift
1 1
exp −η x − a − ln(1 − π) + exp −η y − b − ln π =1 (13)
η η
1 1
a+ ln(1 − π), b + ln π ,
η η
and so (a, b) is the π-point of that shift. This completes the proof of the lemma: by
Lemma 2, the superprediction set indeed lies to the Northeast of that shift.
20 A. Chernov and V. Vovk
Fix such a function Σ. Notice that its value Σ() on the empty sequence can be
chosen arbitrarily, that the case k = 1 is trivial, and that the case k = 2 in fact
covers the cases k = 3, k = 4, etc.
of a sleeping expert does not change. In the case of the log loss function, this
algorithm was found by Freund et al. [3]; in this special case, Freund et al. derive
the same performance guarantee as we do.
In this derivation we will need the following notation. For each history of the
game, let An , n ∈ {1, . . . , N }, be the set of steps at which Expert n is awake:
An := {t ∈ {1, 2, . . .} | n ∈ At } .
For each positive integer k, [k] stands for the set {1, . . . , k}.
The method of defensive forecasting (as used in the proof of Corollary 4)
requires that at step T we should choose π = πT such that, for each ω ∈ {0, 1},
n n
pn eη(λ(π,ω)−λ(γT ,ω)) eη(λ(πt ,ωt )−λ(γt ,ωt ))
n∈AT t∈[T −1]∩An
n
+ pn eη(λ(πt ,ωt )−λ(γt ,ωt ))
n∈AcT t∈[T −1]∩An
n
≤ pn eη(λ(πt ,ωt )−λ(γt ,ωt ))
n∈[N ] t∈[T −1]∩An
Acknowledgements
The anonymous reviewers’ comments were very helpful in weeding out mistakes
and improving presentation (although some of their suggestions could only be
used for the full version of the paper [4], not restricted by the page limit). This
work was supported in part by EPSRC grant EP/F002998/1. We are grateful to
the anonymous Eurocrat who coined the term “expert evaluator”.
References
1. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge
University Press, Cambridge (2006)
2. Chernov, A., Kalnishkan, Y., Zhdanov, F., Vovk, V.: Supermartingales in predic-
tion with expert advice. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.)
ALT 2008. LNCS (LNAI), vol. 5254, pp. 199–213. Springer, Heidelberg (2008)
3. Freund, Y., Schapire, R.E., Singer, Y., Warmuth, M.K.: Using and combining pre-
dictors that specialize. In: Proceedings of the Twenty Ninth Annual ACM Sympo-
sium on Theory of Computing, New York, Association for Computing Machinery,
pp. 334–343 (1997)
4. Chernov, A., Vovk, V.: Prediction with expert evaluators’ advice. Technical Report
arXiv:0902.4127 [cs.LG], arXiv.org e-Print archive (2009)
5. Haussler, D., Kivinen, J., Warmuth, M.K.: Sequential prediction of individual
sequences under general loss functions. IEEE Transactions on Information The-
ory 44, 1906–1925 (1998)
6. Vovk, V.: A game of prediction with expert advice. Journal of Computer and
System Sciences 56, 153–173 (1998)
7. Vovk, V.: Derandomizing stochastic prediction strategies. Machine Learning 35,
247–282 (1999)
8. Dawid, A.P.: Probability forecasting. In: Kotz, S., Johnson, N.L., Read, C.B. (eds.)
Encyclopedia of Statistical Sciences, vol. 7, pp. 210–218. Wiley, New York (1986)
9. Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estima-
tion. Journal of the American Statistical Association 102, 359–378 (2007)
10. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes
in C, 2nd edn. Cambridge University Press, Cambridge (1992)
11. Vovk, V.: Defensive forecasting for optimal prediction with expert advice. Technical
Report arXiv:0708.1503 [cs.LG], arXiv.org e-Print archive (August 2007)
12. Shafer, G., Vovk, V.: Probability and Finance: It’s Only a Game! Wiley, New York
(2001)
13. Dawid, A.P., Vovk, V.: Prequential probability: principles and properties.
Bernoulli 5, 125–162 (1999)
14. Levin, L.A.: Uniform tests of randomness. Soviet Mathematics Doklady 17,
337–340 (1976)
15. Vovk, V., Takemura, A., Shafer, G.: Defensive forecasting. In: Cowell, R.G.,
Ghahramani, Z. (eds.) Proceedings of the Tenth International Workshop
on Artificial Intelligence and Statistics, Savannah Hotel, Barbados, Society
for Artificial Intelligence and Statistics, January 6-8, pp. 365–372 (2005),
http://www.gatsby.ucl.ac.uk/aistats/
Pure Exploration in Multi-armed Bandits Problems
1 Introduction
Learning processes usually face an exploration versus exploitation dilemma, since they
have to get information on the environment (exploration) to be able to take good actions
(exploitation). A key example is the multi-armed bandit problem [Rob52], a sequential
decision problem where, at each stage, the forecaster has to pull one out of K given
stochastic arms and gets a reward drawn at random according to the distribution of
the chosen arm. The usual assessment criterion of a strategy is given by its cumulative
regret, the sum of differences between the expected reward of the best arm and the
obtained rewards. Typical good strategies, like the UCB strategies of [ACBF02], trade
off between exploration and exploitation.
Our setting is as follows. The forecaster may sample the arms a given number of
times n (not necessarily known in advance) and is then asked to output a recommenda-
tion, formed by a probability distribution over the arms. He is evaluated by his simple
regret, that is, the difference between the average payoff of the best arm and the average
payoff obtained by his recommendation. The distinguishing feature from the classical
multi-armed bandit problem is that the exploration phase and the evaluation phase are
separated. We now illustrate why this is a natural framework for numerous applications.
Historically, the first occurrence of multi-armed bandit problems was given by med-
ical trials. In the case of a severe disease, ill patients only are included in the trial and
the cost of picking the wrong treatment is high (the associated reward would equal a
large negative value). It is important to minimize the cumulative regret, since the test
and cure phases coincide. However, for cosmetic products, there exists a test phase
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 23–37, 2009.
c Springer-Verlag Berlin Heidelberg 2009
24 S. Bubeck, R. Munos, and G. Stoltz
separated from the commercialization phase, and one aims at minimizing the regret of
the commercialized product rather than the cumulative regret in the test phase, which
is irrelevant. (Here, several formulæ for a cream are considered and some quantitative
measurement, like skin moisturization, is performed.)
The pure exploration problem addresses the design of strategies making the best pos-
sible use of available numerical resources (e.g., as CPU time) in order to optimize the
performance of some decision-making task. That is, it occurs in situations with a prelim-
inary exploration phase in which costs are not measured in terms of rewards but rather in
terms of resources, that come in limited budget. A motivating example concerns recent
works on computer-go (e.g., the MoGo program of [GWMT06]). A given time, i.e., a
given amount of CPU times is given to the player to explore the possible outcome of
a sequences of plays and output a final decision. An efficient exploration of the search
space is obtained by considering a hierarchy of forecasters minimizing some cumulative
regret – see, for instance, the UCT strategy of [KS06] and the BAST strategy of [CM07].
However, the cumulative regret does not seem to be the right way to base the strategies
on, since the simulation costs are the same for exploring all options, bad and good ones.
This observation was actually the starting point of the notion of simple regret and of this
work. A final related example is the maximization of some function f , observed with
noise, see, e.g., [Kle04, BMSS09]. Whenever evaluating f at a point is costly (e.g., in
terms of numerical or financial costs), the issue is to choose as adequately as possible
where to query the value of this function in order to have a good approximation to the
maximum. The pure exploration problem considered here addresses exactly the design
of adaptive exploration strategies making the best use of available resources in order to
make the most precise prediction once all resources are consumed.
As a remark, it also turns out that in all examples considered above, we may impose
the further restriction that the forecaster ignores ahead of time the amount of available
resources (time, budget, or the number of patients to be included) – that is, we seek for
anytime performance. The problem of pure exploration presented above was referred
to as “budgeted multi-armed bandit problem” in [MLG04], where another notion of re-
gret than simple regret is considered. [Sch06] solves the pure exploration problem in a
minmax sense for the case of two arms only and rewards given by probability distribu-
tions over [0, 1]. [EDMM02] and [MT04] consider a related setting where forecasters
denote respectively the expectations of the rewards of the best arm j ∗ (a best arm, if
there are several of them with same maximal expectation) and of the recommendation
26 S. Bubeck, R. Munos, and G. Stoltz
Goal and structure of the paper: We study the links between simple and cumulative
regrets. Intuitively, an efficient allocation strategy for the simple regret should rely on
some exploration–exploitation trade-off. Our main contribution (Theorem 1, Section 3)
is a lower bound on the simple regret in terms of the cumulative regret suffered in the
exploration phase, showing that the trade-off involved in the minimization of the simple
regret is somewhat different from the one for the cumulative regret. It in particular
implies that the uniform allocation is a good benchmark when n is large. In Sections 4
and 5, we show how, despite all, one can fight against this negative result. For instance,
some strategies designed for the cumulative regret can outperform (for moderate values
of n) strategies with exponential rates of convergence for their simple regret.
Theorem 1 (Main result). For all allocation strategies (ϕt ) and all functions ε :
{1, 2, . . .} → R such that
for all (Bernoulli) distributions ν1 , . . . , νK on the rewards, there exists a constant
C 0 with ERn Cε(n),
the simple regret of all recommendation strategies (ψt ) based on the allocation strate-
gies (ϕt ) is such that
for all sets of K 3 (distinct, Bernoulli) distributions on the rewards, all different from
a Dirac distribution at 1, there exists a constant D 0 and an ordering ν1 , . . . , νK of
the considered distributions with
Δ −Dε(n)
Ern e .
2
Corollary 1. For allocation strategies (ϕt ), all recommendation strategies (ψt ), and
all sets of K 3 (distinct, Bernoulli) distributions on the rewards, there exist two
constants β > 0 and γ 0 such that, up to the choice of a good ordering of the
considered distributions,
Ern β e−γn .
Theorem 1 is proved below and Corollary 1 follows from the fact that the cumulative
regrets are always bounded by n. To get further the point of the theorem, one should
keep in mind that the typical (distribution-dependent) rate of growth of the cumulative
regrets of good algorithms, e.g., UCB1 of [ACBF02], is ε(n) = ln n. This, as asserted in
[LR85], is the optimal rate. But the recommendation strategies based on such allocation
strategies are bound to suffer a simple regret that decreases at best polynomially fast.
We state this result for the slight modification UCB(p) of UCB1 stated in Figure 2; its
proof relies on noting that it achieves a cumulative regret bounded by a large enough
distribution-dependent constant times ε(n) = p ln n.
Corollary 2. The allocation strategy (ϕt ) given by the forecaster UCB(p) of Figure 2
ensures that for all recommendation strategies (ψt ) and all sets of K 3 (distinct,
Bernoulli) distributions on the rewards, there exist two constants β > 0 and γ 0
(independent of p) such that, up to the choice of a good ordering of the considered
distributions,
Ern β n−γp .
Proof. The intuitive version of the proof of Theorem 1 is as follows. The basic idea
is to consider a tie case when the best and worst arms have zero empirical means; it
happens often enough (with a probability at least exponential in the number of times we
pulled these arms) and results in the forecaster basically having to pick another arm and
suffering some regret. Permutations are used to control the case of untypical or naive
forecasters that would despite all pull an arm with zero empirical mean, since they force
a situation when those forecasters choose the worst arm instead of the best one.
Formally, we fix the allocation strategies (ϕt ) and a corresponding function ε such
that the assumption of the theorem is satisfied. We consider below a set of K 3
(distinct) Bernoulli distributions; actually, we only use below that their parameters are
(up to a first ordering) such that 1 > μ1 > μ2 μ3 . . . μK 0 and μ2 > μK
(thus, μ2 > 0).
28 S. Bubeck, R. Munos, and G. Stoltz
1 μ1 − μ2
max Eσ rn Eσ rn Eσ 1 − ψσ(1),n ,
σ K! σ K! σ
where we used that under Pσ , the index of the best arm is σ(1) and the minimal regret
for playing any other arm is at least μ1 − μ2 .
Step 2. Rewrites each term of the sum over σ as the product of three simple terms. We
use first that P1,σ is the same as Pσ , except that it ensures that arm σ(1) has zero reward
throughout. Denoting by
Tj (n)
Cj,n = Xj,t
t=1
the cumulative reward of the j–th till round n, one then gets
Eσ 1 − ψσ(1),n Eσ 1 − ψσ(1),n I{Cσ(1),n =0}
= Eσ 1 − ψσ(1),n Cσ(1),n = 0 × Pσ Cσ(1),n = 0
= E1,σ 1 − ψσ(1),n Pσ Cσ(1),n = 0 .
and therefore,
Pure Exploration in Multi-armed Bandits Problems 29
Eσ 1 − ψσ(1),n EK,σ 1 − ψσ(1),n P1,σ Cσ(K),n = 0 Pσ Cσ(1),n = 0 .
(1)
Step 3. Deals with the second term in the right-hand side of (1),
T (n) E T (n)
P1,σ Cσ(K),n = 0 = E1,σ (1 − μK ) σ(K) (1 − μK ) 1,σ σ(K) ,
where the equality can be seen by conditioning on I1 , . . . , In and then taking the ex-
pectation, whereas the inequality is a consequence of Jensen’s inequality. Now, the ex-
pected number of times the sub-optimal arm σ(K) is pulled under P1,σ is bounded by
the regret, by the very definition of the latter: (μ2 − μK ) E1,σ Tσ(K) (n) E1,σ Rn .
Since by hypothesis (and by taking the maximum of K! values), there exists a constant
C such that for all σ, E1,σ Rn C ε(n), we finally get
Step 4. Lower bounds the third term in the right-hand side of (1) as
Pσ Cσ(1),n = 0 (1 − μ1 )Cε(n)/μ2 .
where the sums are over those histories wn such that the realizations of the payoffs
obtained by the arm σ(1) equal xσ(1),s = 0 for all s = 1, . . . , tσ(1) (n). The ar-
gument is concluded as before, first by Jensen’s inequality and then, by using that
μ2 E1,σ Tσ(1) (n) E1,σ Rn C ε(n) by definition of the regret and the hypothesis
put on its control.
Step 5. Resorts to a symmetry argument to show that as far as the first term of the
right-hand side of (1) is concerned,
K!
EK,σ 1 − ψσ(1),n .
σ
2
Since PK,σ only depends on σ(2), . . . , σ(K − 1), we denote by Pσ(2),...,σ(K−1) the
common value of these probability distributions when σ(1) and σ(K) vary (and a sim-
ilar notation for the associated expectation). We can thus group the permutations σ two
by two according to these (K −2)–tuples, one of the two permutations being defined by
30 S. Bubeck, R. Munos, and G. Stoltz
σ(1) equal to one of the two elements of {1, . . . , K} not present in the (K − 2)–tuple,
and the other one being such that σ(1) equals the other such element. Formally,
⎡ ⎤
EK,σ ψσ(1),n = Ej2 ,...,jK−1 ⎣ ψj,n ⎦
σ j2 ,...,jK−1 j∈{1,...,K}\{j2 ,...,jK−1 }
K!
Ej2 ,...,jK−1 1 = ,
j2 ,...,jK−1
2
where the summations over j2 , . . . , jK−1 are over all possible (K −2)–tuples of distinct
elements in {1, . . . , K}.
Step 6. Simply puts all pieces together and lower bounds max Eσ rn by
σ
μ1 − μ2
EK,σ 1 − ψσ(1),n Pσ Cσ(1),n = 0 P1,σ Cσ(K),n = 0
K! σ
μ1 − μ2 ε(n)
(1 − μK )C/(μ2 −μK ) (1 − μ1 )C/μ2 .
2
Parameters: K arms
Uniform allocation — Plays all arms one after the other
For each round t = 1, 2, . . . ,
UCB(p) — Plays each arm once and then the one with the best upper confidence bound
Parameter: quantile factor p
For rounds t = 1, . . . , K, play ϕt = δt
For each round t = K + 1, K + 2, . . . ,
Tj (t−1)
1
(1) compute, for all j = 1, . . . , K, the quantities μ
j,t−1 = Xj,s ;
Tj (t − 1) s=1
∗ p ln(t − 1)
(2) use ϕt = δjt−1
∗ , where jt−1 ∈ argmax μ
j,t−1 +
j=1,...,K Tj (t − 1)
(ties broken by choosing, for instance, the arm with smallest index).
Table 1. Distribution-dependent (top) and distribution-free (bottom) bounds on the expected sim-
ple regret of the considered pairs of allocation (lines) and recommendation (columns) strategies.
Lower bounds are also indicated. The symbols denote the universal constants, whereas the
are distribution-dependent constants.
Distribution-dependent Distribution-free
EDP EBA MPA EDP EBA MPA
K ln K
Uniform e−n
n
pK ln n pK ln n
UCB(p) (p ln n)/n n− n2(1−p) √
n p ln n n
K
Lower bound e−n
n
Table 1 indicates that while for distribution-dependent bounds, the asymptotic op-
timal rate of decrease in the number n of rounds √ for simple regrets is exponential, for
distribution-free bounds, the rate worsens to 1/ n. A similar situation arises for the cu-
mulative regret, see [LR85]
√ (optimal ln n rate for distribution-dependent bounds) versus
[ACBFS02] (optimal n rate for distribution-free bounds).
32 S. Bubeck, R. Munos, and G. Stoltz
Remark 1. We can rephrase the results of [KS06] as using UCB1 as an allocation strat-
egy and forming a recommendation according to the empirical best arm. In particular,
[KS06, Theorem 5] provides a distribution-dependent bound on the probability of not
picking the best arm with this procedure and can be used to derive the following bound
on the simple regret:
4 1 ρΔj /2
2
Ern
Δj n
j:Δj >0
34 S. Bubeck, R. Munos, and G. Stoltz
for all n 1. The leading constants 1/Δj and the distribution-dependant exponent
make it not as useful as the one presented in Theorem 2. √ The best distribution-free
bound we could get from
√ this bound was of the order of 1/ ln n, to be compared to the
asymptotic optimal 1/ n rate stated in Theorem 3.
for all n sufficiently large, e.g., such that, for all suboptimal arms j,
4p ln n
aj n 1 + and aj n K + 2 .
Δ2j
Proof. We first prove that whenever the most played arm Jn∗ is different from an optimal
arm j ∗ , then at least one of the suboptimal arms j is such that Tj (n) aj n. To do so,
we prove the converse and assume that Tj (n) < aj n for all suboptimal arms. Then,
K K
ai n = n = Ti (n) < Tj ∗ (n) + aj n
i=1 i=1 j∗ j
where, in the inequality, the first summation is over the optimal arms, the second one,
over the suboptimal ones. Therefore, we get
aj ∗ n < Tj∗ (n)
j∗ j∗
and there exists at least one optimal arm j ∗ such that Tj∗ (n) > aj∗ n. Since by definition
of the vector (a1 , . . . , aK ), one has aj aj ∗ for all suboptimal arms, it comes that
Tj (n) < aj n < aj∗ n < Tj ∗ (n) for all suboptimal arms, and the most played arm Jn∗ is
thus an optimal arm. Thus, using that Δj 1 for all j,
Ern = EΔJn∗ P Tj (n) aj n .
j:Δj >0
A side-result extracted from the proof of [ACBF02, Theorem 1] states that for all sub-
optimal arms j and all rounds t K + 1,
4p ln n
P It = j and Tj (t − 1) 2 t1−2p whenever . (2)
Δ2j
This yields that for a suboptimal arm j and since by the assumptions on n and the aj ,
the choice = aj n − 1 satisfies K + 1 and (4p ln n)/Δ2j ,
Pure Exploration in Multi-armed Bandits Problems 35
n
P Tj (n) aj n P Tj (t − 1) = aj n − 1 and It = j
t=aj n
n
1
2 t1−2p (aj n)2(1−p) (3)
t=aj n
p−1
where we used a union bound for the second inequality and (2) for the third inequality.
A summation over all suboptimal arms j concludes the proof.
Proof (of Theorem 2). We apply Lemma 1 with the uniform choice aj = 1/K and
recall that Δ is the minimum of the Δj > 0.
Proof (of Theorem 3). We start the proof by using that ψj,n = 1 and Δj 1 for all
j, and can thus write
K
Ern = EΔJn∗ = Δj Eψj,n ε + Δj Eψj,n .
j=1 j:Δj >ε
Since Jn∗ = j only if Tj (n) n/K, that is, ψj,n = I{Jn∗ =j} I{Tj (n)n/K} , we get
n
Ern ε + Δj P Tj (n) .
K
j:Δj >ε
Δj
Applying (3) with aj = 1/K leads to Ern ε + K 2(p−1) n2(1−p)
p−1
j:Δj >ε
where ε is chosen such that for all Δj > ε, the condition = n/K − 1 (4p ln n)/Δ2j
is satisfied (n/K − 1 K + 1 being satisfied by the assumption
on n and K). The
conclusion thus follows from taking, for instance, ε = (4pK ln n)/(n − K) and
upper bounding all remaining Δj by 1.
We now explain why, in some cases, the bound provided by our theoretical analysis
in Lemma 1 is better than the bound stated in Proposition 1. The central point in the
argument is that the bound of Lemma 1 is of the form n2(1−p) , for some distribution-
dependent constant , that is, it has a distribution-free convergence rate. In comparison,
the bound of Proposition 1 involves the gaps Δj in the rate of convergence. Some care is
needed in the comparison, since the bound for UCB(p) holds only for n large enough,
but it is easy to find situations where for moderate values of n, the bound exhibited
for the sampling with UCB(p) is better than the one for the uniform allocation. These
situations typically involve a rather large number K of arms; in the latter case, the
uniform allocation strategy only samples n/K each arm, whereas the UCB strategy
focuses rapidly its exploration on the best arms. A general argument is proposed in the
extended version [BMS09, Appendix B]. We only consider here one numerical example
36 S. Bubeck, R. Munos, and G. Stoltz
0.135
0.15
0.13
0.1
0.125
0.12
0.05
0.115
0.11 0
40 60 80 100 120 140 160 180 200 40 60 80 100 120 140 160 180 200
Allocation budget Allocation budget
Fig. 4. Simple regret of different pairs of allocation and recommendation strategies, for K = 20
arms with Bernoulli distributions of parameters indicated on top of each graph; X–axis: number
of samples, Y –axis: expectation of the simple regret (the smaller, the better)
extracted from there, see the right part of Figure 4. For moderate values of n (at least
when n is about 6 000), the bounds associated to the sampling with UCB(p) are better
than the ones associated to the uniform sampling.
To make the story described in this paper short, we can distinguish three regimes:
– for large values of n, uniform exploration is better (as shown by a combination of
the lower bound of Corollary 2 and of the upper bound of Proposition 1);
– for moderate values of n, sampling with UCB(p) is preferable, as discussed just
above;
– for small values of n, the best bounds to use seem to be the distribution-free bounds,
which are of the same order of magnitude for the two strategies.
Of course, these statements involve distribution-dependent quantifications (to determine
which n are small, moderate, or large).
We propose two simple experiments to illustrate our theoretical analysis; each of
them was run on 104 instances of the problem and we plotted the average simple regrets.
(More experiments can be found in [BMS09].) The first one corresponds in some sense
to the worst case alluded at the beginning of Section 4. It shows that for small values
of n (e.g., n 80 in the left plot of Figure 4), the uniform allocation strategy is very
competitive. Of course the range of these values of n can be made arbitrarily large by
decreasing the gaps. The second one corresponds to the numerical example described
earlier in this section.
We mostly illustrate here the small and moderate n regimes. (This is because for large
n, the simple regrets are usually very small, even below computer precision.) Because
of these chosen ranges, we do not see yet the uniform allocation strategy getting better
than UCB–based strategies. This has an important impact on the interpretation of the
lower bound of Theorem 1. While its statement is in finite time, it should be interpreted
as providing an asymptotic result only.
Pure Exploration in Multi-armed Bandits Problems 37
References
[ACBF02] Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed
bandit problem. Machine Learning Journal 47, 235–256 (2002)
[ACBFS02] Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.: The non-stochastic multi-
armed bandit problem. SIAM Journal on Computing 32(1), 48–77 (2002)
[BMS09] Bubeck, S., Munos, R., Stoltz, G.: Pure exploration for multi-armed
bandit problems. Technical report, HAL report hal-00257454 (2009),
http://hal.archives-ouvertes.fr/hal-00257454/en
[BMSS09] Bubeck, S., Munos, R., Stoltz, G., Szepesvari, C.: Online optimization in X –
armed bandits. In: Advances in Neural Information Processing Systems, vol. 21
(2009)
[CM07] Coquelin, P.-A., Munos, R.: Bandit algorithms for tree search. In: Proceedings
of the 23rd Conference on Uncertainty in Artificial Intelligence (2007)
[EDMM02] Even-Dar, E., Mannor, S., Mansour, Y.: PAC bounds for multi-armed bandit
and Markov decision processes. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002.
LNCS (LNAI), vol. 2375, pp. 255–270. Springer, Heidelberg (2002)
[GWMT06] Gelly, S., Wang, Y., Munos, R., Teytaud, O.: Modification of UCT with patterns
in Monte-Carlo go. Technical Report RR-6062, INRIA (2006)
[Kle04] Kleinberg, R.: Nearly tight bounds for the continuum-armed bandit problem. In:
18th Advances in Neural Information Processing Systems (2004)
[KS06] Kocsis, L., Szepesvari, C.: Bandit based Monte-carlo planning. In: Fürnkranz,
J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212,
pp. 282–293. Springer, Heidelberg (2006)
[LR85] Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Ad-
vances in Applied Mathematics 6, 4–22 (1985)
[MLG04] Madani, O., Lizotte, D., Greiner, R.: The budgeted multi-armed bandit prob-
lem. In: Proceedings of the 17th Annual Conference on Computational Learning
Theory, pp. 643–645 (2004); Open problems session
[MT04] Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-
armed bandit problem. Journal of Machine Learning Research 5, 623–648
(2004)
[Rob52] Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of
the American Mathematics Society 58, 527–535 (1952)
[Sch06] Schlag, K.: Eleven tests needed for a recommendation. Technical Report
ECO2006/2, European University Institute (2006)
The Follow Perturbed Leader Algorithm
Protected from Unbounded One-Step Losses
Vladimir V. V’yugin
1 Introduction
Experts algorithms are used for online prediction or repeated decision making or
repeated game playing. Starting with the Weighted Majority Algorithm (WM)
of Littlestone and Warmuth [6] and Vovk’s [11] Aggregating Algorithm, the
theory of Prediction with Expert Advice has rapidly developed in the recent
times. Also, most authors have concentrated on predicting binary sequences and
have used specific loss functions, like absolute loss, square and logarithmic loss.
Arbitrary losses are less common. A survey can be found in the book of Lugosi,
Cesa-Bianchi [7].
In this paper, we consider a different general approach - “Follow the Perturbed
Leader FPL” algorithm, now called Hannan’s algorithm [3], [5], [7]. Under this
approach we only choose the decision that has fared the best in the past - the
leader. In order to cope with adversary some randomization is implemented
by adding a perturbation to the total loss prior to selecting the leader. The
goal of the learner’s algorithm is to perform almost as well as the best expert in
hindsight in the long run. The resulting FPL algorithm has the same performance
guarantees as WM-type √ algorithms for fixed learning rate and bounded one-step
losses, save for a factor 2.
Prediction with Expert Advice considered in this paper proceeds as follows.
We are asked to perform sequential actions at times t = 1, 2, . . . , T . At each time
step t, experts i = 1, . . . N receive results of their actions in form of their losses
sit - non-negative real numbers.
At the beginning of the step t Learner, observing cumulating losses si1:t−1 =
s1 + . . . + sit−1 of all experts i = 1, . . . N , makes a decision to follow one of these
i
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 38–52, 2009.
c Springer-Verlag Berlin Heidelberg 2009
The Follow Perturbed Leader Algorithm 39
experts, say Expert i. At the end of step t Learner receives the same loss sit as
Expert i at step t and suffers Learner’s cumulative loss s1:t = s1:t−1 + sit .
In the traditional framework, we suppose that one-step losses of all experts
are bounded, for example, 0 ≤ sit ≤ 1 for all i and t.
Well known simple example of a game with two experts shows that Learner
can perform much worse than each expert: let the current losses of two experts
on steps t = 0, 1, . . . , 6 be s10,1,2,3,4,5,6 = ( 12 , 0, 1, 0, 1, 0, 1) and s20.1,2,3,4,5,6 =
(0, 1, 0, 1, 0, 1, 0). Evidently, the “Follow Leader” algorithm always chooses the
wrong prediction.
When the experts one-step losses are bounded, this problem has been solved
using randomization of the experts cumulative losses. The method of following
the perturbed leader was discovered by Hannan [3]. Kalai and Vempala [5] redis-
covered this method and published a simple proof of the main result of Hannan.
They called an algorithm of this type FPL (Following the Perturbed Leader).
The FPL algorithm outputs prediction of an expert i which minimizes
1
si1:t−1 − ξ i ,
where ξ i , i = 1, . . . N , t = 1, 2, . . ., is a sequence of i.i.d random variables
distributed according to the exponential distribution with the density p(x) =
exp{−x}, and is a learning rate.
Kalai and Vempala [5] show that the expected cumulative loss of the FPL
algorithm has the upper bound
log N
E(s1:t ) ≤ (1 + ) min si1:t + ,
i=1,...,N
where is a positive real number such that 0 < < 1 is a learning rate, N is the
number of experts.
Hutter and Poland [4] presented a further developments of the FPL algorithm
for countable class of experts, arbitrary weights and adaptive learning rate. Also,
FPL algorithm is usually considered for bounded one-step losses: 0 ≤ sit ≤ 1 for
all i and t.
Most papers on prediction with expert advice either consider bounded losses
or assume the existence of a specific loss function (see [7]). We allow losses at
any step to be unbounded. The notion of a specific loss function is not used.
The setting allowing unbounded one-step losses do not have wide coverage in
literature; we can only refer reader to [1], [2], [9].
Poland and Hutter [9] have studied the games where one-step losses of all
experts at each step t are bounded from above by an increasing sequence Bt
given in advance. They presented a learning algorithm which is asymptotically
consistent for Bt = t1/16 .
Allenberg et al. [2] have considered polynomially bounded one-step losses for
a modified version of the Littlestone and Warmuth algorithm [6] under partial
monitoring.
√ In full information case, their algorithm has the expected regret
1
2 N ln N (T + 1) 2 (1+a+β ) in the case where one-step losses of all experts i =
1, 2, . . . N at each step t have the bound (sit )2 ≤ ta , where a > 0, and β > 0 is
40 V.V. V’yugin
a parameter of the algorithm.1 They have proved that this algorithm is Hannan
consistent if
1 i 2
T
max (st ) < cT a
1≤i≤N T
t=1
for all T , where c > 0 and 0 < a < 1.
In this paper, we consider also the case where the loss grows “faster than
polynomial, but slower than exponential”.
We present some modification of Kalai and Vempala [5] algorithm of following
the perturbed leader (FPL) for the case of unrestrictedly large one-step expert
losses sit not bounded in advance. This algorithm uses adaptive weights depend-
ing on past cumulative losses of the experts.
We analyze the asymptotic consistency of our algorithms using nonstandard
t
scaling. We introduce new notions of the volume of a game vt = maxi sij and
j=1
the scaled fluctuation of the game fluc(t) = Δvt /vt , where Δvt = vt − vt−1 .
We show in Theorem 1 that the algorithm of following the perturbed leader
with adaptive weights constructed in Section 2 is asymptotically consistent in
the mean in the case when vt → ∞ and Δvt = o(vt ) as t → ∞ with a computable
bound. Specifically, if fluc(t) ≤ γ(t) for all t, where γ(t) is a computable function
γ(t) such that γ(t) = o(1) as t → ∞, our algorithm has the expected regret
T
2 (e2 − 1)(1 + ln N ) (γ(t))1/2 Δvt ,
t=1
where s1:T is the total loss of our algorithm on steps 1, 2, . . . T , and E(s1:T ) is
its expectation.
Proposition 1 of Section 2 shows that if the condition Δvt = o(vt ) is violated
the cumulative loss of any probabilistic prediction algorithm can be much more
than the loss of the best expert of the pool.
In Section 2 we present some sufficient conditions under which our learning
algorithm is Hannan consistent.2
In particular case, Corollary 1 of Theorem 1 says that our algorithm is asymp-
totically consistent (in the modified sense) in the case when one-step losses of all
experts at each step t are bounded by ta , where a is a positive real number. We
prove this result under an extra assumption that the volume of the game grows
slowly, lim inf vt /ta+δ > 0, where δ > 0 is arbitrary. Corollary 1 shows that our
t→∞
algorithm is also Hannan consistent when δ > 12 .
1
Allenberg et al. [2] considered losses −∞ < sit < ∞.
2
This means that (1) holds with probability 1, where E is omitted.
The Follow Perturbed Leader Algorithm 41
where the random variable s1:T is the cumulative loss of the master algorithm,
si1:T , i = 1, . . . N , are the cumulative losses of the experts algorithms and E
is the mathematical expectation (with respect to the probability distribution
generated by probabilities P {It = i}, i = 1, . . . N , on the first T steps of the
game).3
In the case of bounded one-step expert losses, sit ∈ [0, 1], and a convex
√ loss
function, the well-known learning algorithms have expected regret O( T log N )
(see Lugosi, Cesa-Bianchi [7]).
A probabilistic algorithm is called asymptotically consistent in the mean if
1
lim sup E(s1:T − min si1:T ) ≤ 0. (2)
T →∞ T i=1,...N
Evidently, vt−1 ≤ vt for all t and maxi si1:t ≤ vt ≤ N maxi si1:t for all i and t.
A probabilistic learning algorithm is called asymptotically consistent in the
mean (in the modified sense) in a game with N experts if
1
lim sup E(s1:T − min si1:T ) ≤ 0. (4)
T →∞ vT i=1,...N
almost surely.
Notice that the notions of asymptotic consistency in the mean and Hannan
consistency may be non-equivalent for unbounded one-step losses.
A game is called non-degenerate if vt → ∞ (or equivalently, maxi si1:t → ∞)
as t → ∞.
Denote Δvt = vt − vt−1 . The number
Δvt maxi sit
fluc(t) = = , (6)
vt vt
is called scaled fluctuation of the game at the step t.
By definition 0 ≤ fluc(t) ≤ 1 for all t (put 0/0 = 0).
The following simple proposition shows that each probabilistic learning al-
gorithm is not asymptotically optimal in some game such that fluc(t) → 0 as
t → ∞. For simplicity, we consider the case of two experts.
Proposition 1. For any probabilistic algorithm of choosing an expert and for
any such that 0 < < 1 two experts exist such that vt → ∞ as t → ∞ and
fluc(t) ≥ 1 − ,
1 1
E(s1:t − min si1:t ) ≥ (1 − )
vt i=1,2 2
for all t.
The Follow Perturbed Leader Algorithm 43
Let st be one-step loss of the master algorithm and s1:t be its cumulative loss
at step t ≥ 1. We have
1
E(s1:t ) ≥ E(st ) = s1t P {It = 1} + s2t P {It = 2} ≥ Mt
2
for all t ≥ 1. Also, since vt = vt−1 + Mt = (1 + 4/)vt−1 and min si1:t ≤ vt−1 , the
i
normalized expected regret of the master algorithm is bounded from below
1 2/ − 1 1
E(s1:t − min si1:t ) ≥ ≥ (1 − ).
vt i 1 + 4/ 2
Mt 1
fluc(t) = = ≥1−
vt−1 + Mt 1 + /4
for all t.
Let γ(t) be a computable non-increasing real function such that 0 < γ(t) < 1
for all t and γ(t) → 0 as t → ∞; for example, γ(t) = 1/tδ , where δ > 0. Define
1 ln 1+ln N
e2 −1
αt = 1− and (7)
2 ln γ(t)
e2 − 1
μt = (γ(t))αt = (γ(t))1/2 (8)
1 + ln N
1
t = , (9)
μt vt−1
where μt is defined by (8) and the volume vt−1 depends on experts actions on
steps < t. By definition vt ≥ vt−1 and μt ≤ μt−1 for t = 1, 2, . . .. Also, by
definition μt → 0 as t → ∞.
4
The choice of the optimal value of αt will be explained later. It will be obtained by
minimization of the corresponding member of the sum (44).
44 V.V. V’yugin
1 i
It = argmini=1,2,...N {si1:t−1 − ξ }. (10)
t
Receive one-step losses sit for experts i = 1, . . . , N , and receive one-step loss sIt t
of the master algorithm.
ENDFOR
T
Let s1:T = sIt t be the cumulative loss of the FPL algorithm on steps ≤ T .
t=1
The following theorem shows that if the game is non-degenerate and Δvt =
o(vt ) as t → ∞ with a computable bound then the FPL-algorithm with variable
learning rate (9) is asymptotically consistent.
for all t. Then the expected cumulated loss of the FPL algorithm PROT with
variable learning rate (9) for all t is bounded by
T
E(s1:T ) ≤ min si1:T + 2 (e − 1)(1 + ln N )
2 (γ(t))1/2 Δvt . (12)
i
t=1
1
lim sup E(s1:T − min si1:T ) ≤ 0. (13)
T →∞ vT i=1,...N
almost surely, if
∞
(γ(t))2 < ∞. (15)
t=1
The Follow Perturbed Leader Algorithm 45
Proof. In the proof of this theorem we follow the proof-scheme of [4] and [5].
Let αt be a sequence of real numbers defined by (7); recall that 0 < αt < 1
for all t.
The analysis of optimality of the FPL algorithm is based on an intermediate
predictor IFPL (Infeasible FPL) with the learning rate t defined by (16).
IFPL algorithm
FOR t = 1, . . . T
Define the learning rate
1
t = , where μt = (γ(t))αt , (16)
μt vt
vt is the volume of the game at step t and αt is defined by (7).
Choose an expert with the minimal perturbed cumulated loss on steps ≤ t
1
Jt = argmini=1,2,...N {si1:t − ξ i }.
t
Receive the one step loss sJt t of the IFPL algorithm.
ENDFOR
The IFPL algorithm predicts under the knowledge of si1:t , i = 1, . . . N (and
vt ), which may not be available at beginning of step t. Using unknown value of
t is the main distinctive feature of our version of IFPL.
For any t, we have It = argmini {si1:t−1 − 1t ξ i } and Jt = argmini {si1:t − 1 ξ i } =
t
argmini {si1:t−1 + sit − 1 ξ i }.
t
The expected one-step and cumulated losses of the FPL and IFPL algorithms
at steps t and T are denoted
lt = E(sIt t ) and rt = E(sJt t ),
T
T
l1:T = lt and r1:T = rt ,
t=1 t=1
respectively, where sIt t is the one-step loss of the FPL algorithm at step t and
sJt t is the one-step loss of the IFPL algorithm, and E denotes the mathematical
expectation.
Lemma 1. The cumulated expected losses of the FPL and IFPL algorithms with
rearning rates defined by (9) and (16) satisfy the inequality
T
l1:T ≤ r1:T + (e2 − 1) (γ(t))1−αt Δvt (17)
t=1
Let mj = sj1:t−1
1
− 1t cj 1 and mj = sj1:t
2
− 1 cj2 = sj1:t−1
2
+ sjt2 − 1 cj2 . By definition
t t
and since j2 = j we have
1 1 1
mj = sj1:t−1
1
− cj ≤ sj1:t−1
2
− cj2 ≤ sj1:t−1
2
+ sjt2 − cj2 = (18)
t 1 t t
1 1 1 1 1
sj1:t
2
− cj2 + − cj2 = mj + − cj2 . (19)
t t t t t
P {It = j|ξ i = ci , i
= j} =
1
P {sj1:t−1 − ξ j ≤ mj |ξ i = ci , i = j} =
t
P {ξ j ≥ t (sj1:t−1 − mj )|ξ i = ci , i
= j} =
P {ξ j ≥ t (sj1:t−1 − mj ) + (t − t )(sj1:t−1 − mj )|ξ i = ci , i
= j} ≤ (20)
P {ξ ≥ j
t (sj1:t−1
− mj ) +
1
(t − t )(sj1:t−1 − sj1:t−1
2
+ cj2 )|ξ i = ci , i = j} = (21)
t
exp{−(t − t )(sj1:t−1 − sj1:t−12
)} × (22)
1
P {ξ j ≥ t (sj1:t−1 − mj ) + (t − t ) cj2 |ξ i = ci , i = j} ≤ (23)
t
exp{−(t − t )(sj1:t−1 − sj1:t−12
)} ×
1 1
P {ξ j ≥ t (sj1:t − sjt − mj − − cj2 ) + (24)
t t
1
(t − t ) cj2 |ξ i = ci , i = j} = (25)
t
exp{−(t − t )(sj1:t−1 − sj1:t−1
2
) + t sjt } × (26)
P {ξ ≥
j
t (sj1:t
− mj )|ξ i
= ci , i
= j} =
j
1 1 s
exp − − (sj1:t−1 − sj1:t−1
2
)+ t × (27)
μt vt−1 μt vt μt vt
1
P {ξ j > (sj − mj )|ξ i = ci , i
= j} ≤
μt vt 1:t
j j2
Δvt (s1:t−1 − s1:t−1 ) Δvt
exp − + × (28)
μt vt vt−1 μt vt
1
P {ξ j > (sj − mj )|ξ i = ci , i
= j} =
μt vt 1:t
Δvt sj1:t−1 − s1:t−1
2 j
exp 1− P {Jt = 1|ξ i = ci , i
= j}. (29)
μt vt vt−1
The Follow Perturbed Leader Algorithm 47
Here the inequality (20)-(21) follows from (18) and t ≥ t . We have used twice,
in change from (21) to (22) and in change from (25) to (26), the equality P {ξ >
a + b} = e−b P {ξ > a} for any random variable ξ distributed according to the
exponential law. The equality (23)-(24) follows from (19). We have used in change
from (27) to (28) the equality vt − vt−1 = Δvt and the inequality sjt ≤ Δvt for
all j and t.
The expression in the exponent (29) is bounded
sj1:t−1 − sj1:t−1
2
≤ 1, (30)
vt−1
si
since v1:t−1
t−1
≤ 1 and si1:t−1 ≥ 0 for all t and i.
Therefore, we obtain
P {It = j|ξ = ci , i = j} ≤
i
2 Δvt
exp P {Jt = j|ξ = ci , i
i
= j} ≤ (31)
μt vt
exp{2(γ(t))1−αt }P {Jt = j|ξ i = ci , i
= j}. (32)
Since, the inequality (32) holds for all ci , it also holds unconditionally
P {It = j} ≤ exp{2(γ(t))1−αt }P {Jt = j}. (33)
for all t = 1, 2, . . . and j = 1, . . . N .
Using inequality exp{2x} ≤ 1 + (e2 − 1)x for all x such that 0 ≤ x ≤ 1, we
obtain from (33) the lower bound
N
lt = E(sIt t ) = sjt P (It = j) ≤
j=1
N
exp{2(γ(t))1−αt } sjt P (Jt = j) = exp{2(γ(t))1−αt }E(sJt t ) =
j=1
Proof. The proof is along the line of the proof from Hutter and Poland [4] with
an exception that now the sequence t is not monotonic.
Let in this proof, st = (s1t , . . . sN
t ) be a vector of one-step losses and s1:t =
(s1:t , . . . sN
1
1:t ) be a vector of cumulative losses of the experts algorithms. Also, let
ξ = (ξ 1 , . . . ξ N ) be a vector whose coordinates are random variables.
Recall that t = 1/(μt vt ), μt ≤ μt−1 for all t, and v0 = 0, 0 = ∞.
Define s̃1:t = s1:t − 1 ξ for t = 1, 2, . . .. Consider the vector of one-step losses
t
s̃t = st − ξ 1 − 1 for the moment.
t t−1
For any vector s and a unit vector d denote
where D = {(0, . . . 1), (1, . . . 0)} is the set of N unit vectors of dimension N and
“·” is the inner product of two vectors.
We first show that
T
M (s̃1:t ) · s̃t ≤ M (s̃1:T ) · s̃1:T . (36)
t=1
T
T
1 1
M (s̃1:t ) · st ≤ M (s̃1:T ) · s̃1:T + M (s̃1:t ) · ξ − . (37)
t=1 t=1
t t−1
By definition of M we have
ξ
M (s̃1:T ) · s̃1:T ≤ M (s1:T ) · s1:T − =
T
ξ
min{d · s1:T } − M (s1:T ) · . (38)
d∈D T
1
The expectation of the last term in (38) is equal to T = μT vT .
The second term of (37) can be rewritten
T
T
1 1
M (s̃1:t ) · ξ − = (μt vt − μt−1 vt−1 )M (s̃1:t ) · ξ. (39)
t=1
t t−1 t=1
The Follow Perturbed Leader Algorithm 49
N
P {max ξ i ≥ a} = P {∃i(ξ i ≥ a)} ≤ P {ξ i ≥ a} = N exp{−a}. (41)
i
i=1
∞
Since for any non-negative random variable η, E(η) = P {η ≥ y}dy, by (41)
0
we have
∞
E(max ξ − ln N ) =
i
P {max ξ i − ln N ≥ y}dy ≤
i i
0
∞
N exp{−y − ln N }dy = 1.
0
Therefore, E(maxi ξ i ) ≤ 1 + ln N .
By (40) the expectation of (39) has the upper bound
T
T
E(M (s̃1:t ) · ξ)(μt vt − μt−1 vt−1 ) ≤ (1 + ln N ) μt Δvt .
t=1 t=1
Lemma is proved. .
We finish now the proof of the theorem.
50 V.V. V’yugin
The inequality (17) of Lemma 1 and the inequality (35) of Lemma 2 imply
the inequality
T
E(s1:T ) ≤ min si1:T + ((e2 − 1)(γ(t))1−αt + (1 + ln N )(γ(t))αt )Δvt . (44)
i
t=1
for all T .
The optimal value (7) of αt can be easily obtained by minimization of each
member of the sum (44) by αt . In this case μt is equal to (8) and (44) is equivalent
to (12).
T
We have t=1 Δvt = vT for all T , vt → ∞ and γ(t) → 0 as t → ∞. Then by
Toeplitz lemma [10]
1 T
2 (e2 − 1)(1 + ln N ) (γ(t))1/2 Δvt → 0
vT t=1
Then the FPL algorithm PROT is Hannan consistent, i.e., (5) holds as T → ∞
almost surely.
Proof. We use Theorem 11 from Petrov [8] (Chapter IX, Section 2) which gives
sufficient conditions in order that the strong law of large numbers holds for a
sequence of independent unbounded random variables:
Let at be a nondecreasing sequence of real numbers such that at → ∞ as t →
∞ and Xt be a sequence of independent random variables such that E(Xt ) = 0,
for t = 1, 2, . . .. Let also, g(x) satisfies assumptions of Lemma 3. By Theorem 11
from Petrov [8] the inequality
∞
E(g(Xt ))
<∞ (46)
t=1
g(at )
implies
1
T
Xt → 0 (47)
aT t=1
as T → ∞ almost surely.
The Follow Perturbed Leader Algorithm 51
Put Xt = st − E(st ), where st is the loss of the FPL algorithm PROT at step
t, and at = vt for all t. By definition |Xt | ≤ Δvt for all t. Then (46) is valid, and
by (47)
1
T
1
(s1:T − E(s1:T )) = (st − E(st )) → 0
vT vT t=1
as T → ∞ almost surely.
This limit and the limit (13) imply (14).
By Lemma 3 the algorithm PROT is Hannan consistent, since (15) implies
(45) for g(x) = x2 . Theorem 1 is proved.
Authors of [2] and [9] considered polynomially bounded one-step losses. We
consider a specific example of the bound (44) for polynomial case.
Corollary 1. Assume that sit ≤ ta for all t and i = 1, . . . N , and
vt
lim inf a+δ > 0,
t→∞ t
where a and δ are positive real numbers. Let also in the algorithm PROT, γ(t) =
t−δ and μt = (γ(t))αt , where αt is defined by (7). Then
– (i) the algorithm PROT is asymptotically consistent in the mean for any
a > 0 and δ > 0;
– (ii) this algorithm is Hannan consistent for any a > 0 and δ > 12 ;
– (iii) the expected loss of this algorithm is bounded by
1
E(s1:T ) ≤ min si1:T + 2 (e2 − 1)(1 + ln N )T 1− 2 δ+a (48)
i
as T → ∞.
This corollary follows directly from Theorem 1, where condition (15) of Theo-
rem 1 holds for δ > 12 .
If δ = 1 the regret from (48) is asymptotically equivalent to the regret from
Allenberg et al. [2] (see Section 1).
For a = 0 we have the case of bounded loss function (0 ≤ sit ≤ 1 for all i
and t). The FPL algorithm PROT is asymptotically consistent in the mean if
vt ≥ β(t) for all t, where β(t) is an arbitrary positive unbounded non-decreasing
computable function (we can get γ(t) = 1/β(t) in this case). This algorithm is
Hannan consistent if (15) holds, i.e.
∞
(β(t))−2 < ∞.
t=1
Acknowledgments
References
1. Cesa-Bianchi, N., Mansour, Y., Stoltz, G.: Improved second-order bounds for pre-
diction with expert advice. Machine Learning 66(2-3), 321–352 (2007)
2. Allenberg, C., Auer, P., Gyorfi, L., Ottucsak, G.: Hannan consistency in on-Line
learning in case of unbounded losses under partial monitoring. In: Balcázar, J.L.,
Long, P.M., Stephan, F. (eds.) ALT 2006. LNCS (LNAI), vol. 4264, pp. 229–243.
Springer, Heidelberg (2006)
3. Hannan, J.: Approximation to Bayes risk in repeated plays. In: Dresher, M., Tucker,
A.W., Wolfe, P. (eds.) Contributions to the Theory of Games, vol. 3, pp. 97–139.
Princeton University Press, Princeton (1957)
4. Hutter, M., Poland, J.: Prediction with expert advice by following the perturbed
leader for general weights. In: Ben-David, S., Case, J., Maruoka, A. (eds.) ALT
2004. LNCS (LNAI), vol. 3244, pp. 279–293. Springer, Heidelberg (2004)
5. Kalai, A., Vempala, S.: Efficient algorithms for online decisions. In: Schölkopf, B.,
Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 26–40.
Springer, Heidelberg (2003); Extended version in Journal of Computer and System
Sciences 71, 291–307 (2005)
6. Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. Information
and Computation 108, 212–261 (1994)
7. Lugosi, G., Cesa-Bianchi, N.: Prediction, Learning and Games. Cambridge
University Press, New York (2006)
8. Petrov, V.V.: Sums of independent random variables. Ergebnisse der Mathematik
und ihrer Grenzgebiete, Band 82. Springer, Heidelberg (1975)
9. Poland, J., Hutter, M.: Defensive universal learning with experts. For general
weight. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI),
vol. 3734, pp. 356–370. Springer, Heidelberg (2005)
10. Shiryaev, A.N.: Probability. Springer, Berlin (1980)
11. Vovk, V.G.: Aggregating strategies. In: Fulk, M., Case, J. (eds.) Proceedings of
the 3rd Annual Workshop on Computational Learning Theory, San Mateo, CA,
pp. 371–383. Morgan Kaufmann, San Francisco (1990)
Computable Bayesian Compression for
Uniformly Discretizable Statistical Models
Łukasz Dębowski
1 Introduction
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 53–67, 2009.
c Springer-Verlag Berlin Heidelberg 2009
54 Ł. Dębowski
theorem for conditionally random sequences [7], [4, Theorem 4.2 and 5.3]. The
condition of uniform discretization can be completely removed from the ‘if ’ part
and relaxed to an effective identifiability of the parameter in the ‘only if ’ part.
Namely, given a prefix of the parameter, we must be able to compute how much
data is needed to learn its value with a fixed uncertainty.
The organization of this paper is as follows. In Section 2, we discuss quality of
Bayesian compression for individual parameters and we derive the randomness
deficiency bounds for prefixes of the parameter and the parameter-typical data.
These bounds hold for the newly introduced class of uniformly discretizable sta-
tistical models. In Section 3, we show that exponential families are uniformly
discretizable. The assumptions on the prior and the proof look familiar to statis-
ticians working in minimum description length (MDL) inference [8,9]. An exam-
ple of a ‘nonparametric’ uniformly discretizable model appears in Section 4. In
the final Section 5, we prove that countable mixtures of uniformly discretizable
models are uniformly discretizable if the Bayesian estimator consistently chooses
the right submodel for the data.
The definition of uniformly discretizable models is given below. Condition (3)
says that the parameter may be discretized to m ≥ μ(n) digits for the sake of
approximating the ‘true’ probability of data xn . Condition (4) asserts that the
parameter, discretized to m digits, can be predicted for all but finitely many m
given data xn of length n ≥ ν(m). Functions μ and ν depend on a model.
To fix our notation in advance, we use a countable alphabet X and a finite
Y = {0, 1, ..., D − 1}, D > 1. The logarithm to base D is written as log. An italic
x ∈ X+ is a string, a boldface x ∈ XN is an infinite sequence. The n-th symbol
of x is written as xn ∈ X and xn is the prefix of x of length n: x = x1 x2 x3 ... and
xn = x1 x2 ...xn . Capital boldface Y :
X∗ → R denotes a distribution of strings
normalized lengthwise, i.e., 0 ≤ Y (x), a Y (xa)1{|a|=n} = Y (x), and Y (λ) = 1
for the empty string λ. There is a unique measure on measurable sets of infinite
sequences x ∈ XN , also denoted as Y , such that Y ({x : xn = x for n = |x|}) =
Y (x). Quantifier ‘n-eventually’ means ‘for all but finitely many n ∈ N’.
where A(θ) := {θ ∈ Θ : θ is the prefix of θ}, and denote its other marginal
Y (x) := T (x, λ) = Pθ (x)dQ(θ). (2)
Computable Bayesian Compression 55
We will use a universal computer with an oracle, which can compute certain
functions R → R. To make it clear which these are, we adopt the following
definitions, cf. [11], [6, Sections 1.7 and 3.1], [1, Section 2], [4, Section 5]:
Impossibility level
n
D−K(x )
I(x; Y ) := inf (5)
n∈N Y (xn )
Notably, the Bayesian compressor can be shown optimal exactly when the pa-
rameter is incompressible. Strictly speaking, we will obtain Pθ (LY ) = 1 if and
only if θ is Martin-Löf random with respect to Q. This holds, of course, under
some tacit assumptions. For instance, if we take Pθ ≡ Y then Pθ (LY ) = 1 for all
θ ∈ Θ. We may thus suppose that the ‘if and only if ’ statement holds provided
the parameter can be effectively identified. The following two propositions form
the first step to see what assumptions are needed exactly.
Lemma 3. For a computer-dependent constant A, we have
because K(θm ) ≤ m + log m + o(log m) and K(m) ≤ log m + o(log m). Since
by the chain rule for prefix complexity [6, Theorem 3.9.1], we obtain
In the following, we apply (13) with xn and θm switched, and observe that
T (xn , θm )
K(θm |xn , K(xn )) ≤ A + K(m) − log
Y (xn )
T (xn , θm )
K(xn , θm ) ≤ A + K(θm ) − log . (14)
Q(θm )
(This time, we need not specify the length of xn separately since it can be
computed from θm .) Substituting (4) into (14) and chaining the result with
K(xn ) ≤ A + K(xn , θm ) yields (11).
1 if θ ∈ LQ,log n ,
Pθ (LY ,log μ(n) ) = (15)
0 if θ ∈
LQ,log n ,
In particular, LY ,1 = LY .
Theorem 1(ii) suffices to prove Pθ (LY ) = 0 for θ
∈ LQ but to show Pθ (LY ) =
1 in the other case we need a stronger statement than Theorem 1(i). Here we can
rely on the chain rule for conditional impossibility levels by Vovk and V’yugin
[1, Theorem 1] and extensions of Lambalgen’s theorem for conditionally random
sequences by Takahashi [4]. For a recursive kernel P , let us define by analogy
the conditional impossibility level
n
D−K(x |θ)
I(x; P |θ) := inf (17)
n∈N Pθ (xn )
holds for Y = Pθ dQ(θ) and > 0 by [1, Corollary 4].
Inequality (19) and Theorem 1(ii) imply the main claim of this article.
1 if θ ∈ LQ ,
Pθ (LY ) = (20)
0 if θ ∈
LQ .
The upper part of (20) can be strengthened as decomposition LY = θ∈LQ LP |θ ,
which holds for all recursive P and Q [4, Cor. 4.3 & Thm. 5.3]. (Our definition
of a recursive P corresponds to ‘uniformly computable’ in [4].) We suppose that,
under the assumption of Theorem 2, sets LP |θ are disjoint for θ ∈ Θ. This would
strengthen the lower part of (20).
k
(ii) Let Z(β) := x∈X p(x) exp l=1 β l T l (x) and define measures
n
k
n
P̃β (x ) := p(xi ) exp βl Tl (x) − ln Z(β)
i=1 l=1
for β ∈ B := β ∈ Rk : Z(β ) < ∞ .
(iii) We require that B is open. (It is not empty since 0 ∈ B.) Under this
→ ϑ(β) := E x∼P̃β T (xi ) ∈ Rk is a twice differen-
condition, ϑ(·) : B β
tiable injection [17], [9]. Thus assume Θ̃ := ϑ(B) and put P̃ϑ := P̃β(ϑ) for
β(·) := ϑ−1 (·).
Additionally, let the prior Q̃ be universally lower-bounded by the Lebesgue mea-
sure on Rk and let it satisfy Q̃(Θ̃) = 1.
Proposition 2. Use Cantor’s code ρ := ρs ◦ ρn , where ρn : Θ̃ → (0, 1)k is
a differentiable injection and ρs : (0, 1)k → Y N
∞satisfies ρs (y)−i= θ1 θ2 θ3 ... for any
vector y ∈ (0, 1)k with components yl = i=1 θ(i−1)k+l D . Then the model
k
(P̃ , Q̃) is ρ, 2 + log n, D (2/k+)m
-uniformly discretizable for > 0.
Proof. Let Θ := ρ(Θ̃), Pθ (x) := P̃ρ−1 (θ) (x), Q := Q̃ ◦ ρ−1 , and A(θ) :=
{θ ∈ Θ : θ is the prefix of θ}. Consider a θ ∈ Θ. Firstly, let m ≥ k2 + log n.
We have (21) for ϑ = ρ−1 (θ) and An = ρ−1 (A(θm )). Hence (3) holds by the
Theorem 3(i) below. Secondly, let n ≥ D(2/k+)m . We have (23) for ϑ = ρ−1 (θ)
and Bn = ρ−1 (A(θm )). Hence (4) follows by Theorem 3(ii).
The statement below may look more familiar for statisticians.
Theorem 3. Fix a ϑ ∈ Θ̃ for the model specified in Example 1.
(i) If we take sufficiently small measurable sets An ⊂ Θ̃ which satisfy
supϑ ∈An |ϑ − ϑ|
lim sup√ =0 (21)
n→∞ n−1 ln ln n
and put P̃n (x) := An P̃ϑ (x)dQ̃(ϑ )/ An dQ̃(ϑ ) then
Proof. (i) Function ϑ̂(xn ) := n−1 ni=1 T (xi ) is the maximum likelihood esti-
mator of ϑ, in the usual sense. Thus the Taylor expansion for any ϑ ∈ Θ̃ yields
k
log P̃ϑ̂(xn ) (xn ) − log P̃ϑ (xn ) = n l,m=1 Rlm (ϑ)Slm (ϑ), (25)
1
where Slm (ϑ) := (ϑl − ϑ̂l (xn ))(ϑm − ϑ̂m (xn )) and Rlm (ϑ) := 0 (1 − t)Ilm (tϑ +
(1 − t)ϑ̂(xn ))dt, whereas the observed Fisher information matrix Ilm (ϑ) :=
−n−1 ∂ϑl ∂ϑm log P̃ϑ (xn ) does not depend on n and xn . Consequently,
+ −
where Rlm := supϑ ∈Cn Rlm (ϑ ) and Rlm := inf ϑ ∈Cn Rlm (ϑ ). By continuity of
−
Fisher information Ilm (ϑ) as a function of ϑ, Rlm+
and Rlm tend to Ilm (ϑ) for
n → ∞. On the other hand, the law of iterated logarithm
ϑ̂l (xn ) − ϑl
lim sup √ =1 (26)
n→∞ σl 2n−1 ln ln n
is satisfied for P̃ϑ -almost all x with variance σl2 := Varx∼P̃ϑ Tl (xi ) since the
maximum likelihood estimator is unbiased, i.e., E x∼P̃ϑ ϑ̂(xn ) = ϑ. Consequently,
we obtain (22) for (21).
(ii) The proof applies Laplace approximation as in [18] or in the proof of
Theorem 8.1 of [9, pages 248–251]. First of all, we have
Θ̃\Bn
P̃ϑ (xn )dQ̃(ϑ )
n n
log P̃ϑ (x )dQ̃(ϑ ) − log P̃ϑ (x )dQ̃(ϑ ) ≤ .
Bn Bn
P̃ϑ (xn )dQ̃(ϑ )
Example 2 (the data are the parameter). Put Pθ (xn ) := 1{xn =θn } for X = Y
and let Q(θ) > 0 for θ ∈ Y∗ . This model is (n, m)-uniformly discretizable.
k −1/β
P (Ki = k) = p(k) := , k ∈ N, (27)
ζ(1/β)
with a fixed β ∈ (0, 1). This family of processes was introduced to model logical
consistency of texts in natural language [19]. The distribution of variables Xi is
equal to the measure P (Xi ∈ · ) = Pθ for the following Bayesian model.
Example 4 (an accessible description model). Put
n
Pθ (xn ) := p(ki )1{zi =θk } (28)
i
i=1
For this model, Shannon information between the data and the parameter equals
E (x,θ)∼T [− log Y (xn ) + log Pθ (xn )] = Θ(nβ ) asymptotically if Q(θ) = D−|θ| ,
cf. [19, Theorem 10]. As a consequence of the next statement, the accessible
description model (28) is (nυ , m1/λ )-uniformly discretizable for
Proposition 3. For independent variables (Ki )i∈Z with the distribution (27),
= Pθ (xn ) yM ∈YM M
k∈{k1 ,k2 ,...,kn }∪{1,2,...,m} 1{θk =yk } Q(y )
Measure Y is not only optimal for all Q-random θ, in the sense of Pθ (LY ) = 1,
but it is also optimal for a certain θ
∈ LQ that satisfies Pθ = Y . On the other
hand, by the asymptotic equipartition property, Pθ (LY ) = 0 for stationary
measures Pθ that have a different entropy rate than Y [15, Section 15.7].
and
A complementary result says that the set of random parameters with respect to
the mixture is the union of the respective sets for the combined models.
Theorem 5. Consider the models from Theorem 4 and suppose that Qi satisfy
for all k ≥ m ≥ 0 and certain constants c < 1 and a > 0. Then for g(n) = Ω(1)
we have θ ∈ LQ,g(n) if and only if trn(θ) ∈ LQidx(θ) ,g(n) .
Acknowledgement
I would like to thank P. Grünwald, P. Harremoës, and J. Mielniczuk for discus-
sions. Cordial acknowledgements are due to an anonymous referee for suggest-
ing relevant references. They helped to improve this paper considerably. The
research, supported under the PASCAL II Network of Excellence, IST-2002-
506778, was done during the author’s leave from the Institute of Computer
Science, Polish Academy of Sciences.
References
1. Vovk, V.G., V’yugin, V.V.: On the empirical validity of the Bayesian method. J.
Roy. Statist. Soc. B 55, 253–266 (1993)
2. Vovk, V.G., V’yugin, V.V.: Prequential level of impossibility with some applica-
tions. J. Roy. Statist. Soc. B 56, 115–123 (1994)
3. Vitányi, P., Li, M.: Minimum description length induction, Bayesianism and
Kolmogorov complexity. IEEE Trans. Inform. Theor. 46, 446–464 (2000)
4. Takahashi, H.: On a definition of random sequences with respect to conditional
probability. Inform. Comput. 206, 1375–1382 (2008)
5. Gács, P.: On the symmetry of algorithmic information. Dokl. Akad. Nauk SSSR 15,
1477–1480 (1974)
6. Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and Its
Applications, 2nd edn. Springer, Heidelberg (1997)
7. van Lambalgen, M.: Random Sequences. PhD thesis, Universiteit van Amsterdam
(1987)
8. Barron, A., Rissanen, J., Yu, B.: The minimum description length principle in
coding and modeling. IEEE Trans. Inform. Theor. 44, 2743–2760 (1998)
9. Grünwald, P.D.: The Minimum Description Length Principle. MIT Press,
Cambridge (2007)
10. Yu, B., Speed, T.P.: Data compression and histograms. Probab. Theor. Rel.
Fields 92, 195–229 (1992)
11. Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages and
Computation. Addison-Wesley, Reading (1979)
12. Elias, P.: Universal codeword sets and representations for the integers. IEEE Trans.
Inform. Theor. 21, 194–203 (1975)
13. Barron, A.R.: Logically Smooth Density Estimation. PhD thesis, Stanford Univer-
sity (1985)
14. Dawid, A.: Statistical theory: The prequential approach. J. Roy. Statist. Soc. A 147,
278–292 (1984)
15. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester
(1991)
16. Li, L., Yu, B.: Iterated logarithmic expansions of the pathwise code lengths for
exponential families. IEEE Trans. Inform. Theor. 46, 2683–2689 (2000)
17. Barndorff-Nielsen, O.E.: Information and Exponential Families. Wiley, Chichester
(1978)
18. Jeffreys, H.: Theory of Probability, 3rd edn. Oxford University Press, Oxford (1961)
19. Dębowski, Ł.: On the vocabulary of grammar-based codes and the logical consis-
tency of texts (2008) E-print, http://arxiv.org/abs/0810.3125
20. Csiszar, I., Shields, P.C.: The consistency of the BIC Markov order estimator. Ann.
Statist. 28, 1601–1619 (2000)
Calibration and Internal No-Regret
with Random Signals
Vianney Perchet
1 Introduction
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 68–82, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Calibration and Internal No-Regret with Random Signals 69
This section is devoted to the full monitoring case. We recall the main results
about calibration of Foster and Vohra [6], approachability of Blackwell [3] and
regret of Hart and Mas-Colell [10]. We will prove some of these results in details,
since they give the main ideas about the construction of strategies in the partial
monitoring framework, given in section 4.
2.1 Calibration
We will use the following notations. For any families {am ∈ IRd , lm ∈ L}m∈IN
and n ∈ IN, Nn (l)
= {1 ≤ m ≤ n, lm = l} is the set of stages of type l (before
the n-th), an (l) = m∈Nn (l) am /|Nn (l)| is the average of {am } on this set and
n
an = m=1 am /n is the average over all the stages (before the n-th).
Definition 1. Foster-Vohra [6] A strategy σ of Player 1 is calibrated (with
respect to the ε-grid M) if for every l ∈ L and every strategy τ of Player 2:
|Nn (l)|
lim sup sn (l) − μ(l)2 − ε2 ≤ 0, IPσ,τ -as .
n→+∞ n
In words, a strategy of Player 1 is calibrated if, on the set of stages where μ(l)
is forecast, the empirical distribution of states is asymptotically close to μ(l) (as
long as the frequency of l is not too small). Foster-Vohra [6] proved the existence
of such strategies with an algorithm based on the Expected Brier Score.
2.2 Approachability
We will prove that calibration will follow from no-regret and that no-regret
will follow from approachability (following respectively Sorin [17] and Hart and
Mas-Colell [10]). We present here the notion of approachability introduced by
Blackwell [3].
Consider a two-person repeated game in discrete time with vector payoffs,
where at stage n ∈ IN, Player 1 (resp. Player 2) chooses the action in ∈ I
(resp. jn ∈ J), with both I and J finite. The corresponding vector payoff is
ρn = ρ(in , jn ) where ρ : I × J → IRd . As usual, a strategy σ (resp.τ ) of Player 1
n
(resp. Player 2) is a function from the set of finite histories H = n∈IN (I × J)
to Δ(I) (resp. Δ(J)).
For a closed set E ⊂ IRd and δ ≥ 0, we denote by E δ = {z ∈ IRd , dE (z) ≤ δ}
the δ-neighborhood of E and ΠE (z) = {e ∈ E, dE (z) = z − e} the set of closest
point to z in E, where dE (z) = inf e∈E z − e.
Definition 2. i) A closed set E ⊂ IRd is approachable by Player 1 if for every
ε > 0, there exists a strategy σ of Player 1 and N ∈ IN, such that for every
strategy τ of Player 2 and every n ≥ N :
Eσ,τ [dE (ρn )] ≤ ε and IP sup dE (ρn ) ≥ ε ≤ ε .
n≥N
Informally, from any point z outside E there is a closest point p and a probability
x ∈ Δ(J) such that, whatever being the choice of Player 2, the expected payoff
and z are on different sides of the hyperplane through p and perpendicular to
z − p. In fact, this definition (and the following theorem) does not require that J
is finite: one can assume that Player 2 chooses an outcome vector U ∈ [−1, 1]|I|
so that the expected payoff is x, U .
Remark 1. Corollary 1 implies that there are (at least) two different ways to
prove that a convex set is approachable. The first one, called direct proof, consists
in proving that C is a B-set while the second one, called undirect proof, consists
in proving that C is not excludable by Player 2, which reduces to find, for every
y ∈ Δ(J), some x ∈ Δ(I) such that ρ(x, y) ∈ C.
The existence of such strategies have been proved by Foster and Vohra [7] and
Fudenberg and Levine [8].
Theorem 2. There exist internally consistent strategies.
Note that an internally consistent strategy can be obtained by constructing a
2
strategy that approaches the negative orthant Ω = IRc− in the auxiliary game
where the vector payoff at stage n is Rn .
The proof of Hart and Mas-Colell [10] of the fact that Ω is a B-set relies
on the two followings lemmas: Lemma 1 gives a geometric property of Ω and
Lemma 2 gives a property of the function R.
2
Lemma 1. Let ΠΩ (·) be the projection onto Ω. Then, for every A ∈ IRc :
The existence of an invariant probability follows from the similar result for
Markov chains.
Lemma 2. Let A = (aij )i,j∈I be a non-negative matrix. Then for every λ, in-
variant probability of A, and every U ∈ IRc :
A, Eλ [R(·, U )] = 0 . (5)
Proof. The (i, j)-th coordinate of Eλ [R(·, U )] is λ(l) U j − U i , therefore:
A, Eλ [R(·, U )] = aij λ(i) U j − U i
i,j∈I
and the coefficient of each U i is j∈I aij λ(i) − j∈I aji λ(j) = 0, because λ is
an invariant measure of A. Therefore A, Eλ [R(·, U )] = 0.
Calibration and Internal No-Regret with Random Signals 73
Lemma 3. Let (am )m∈IN be a sequence in IRd and α, β two points in IRd . Then
for every n ∈ IN∗ :
n 2 2
m=1 am − β2 − am − α2 2 2
= an − β2 − an − α2 , (6)
n
Proof. We start with the framework described in 2.1. Consider the auxiliary
two-person game with vector payoff defined as follows. At stage n ∈ IN, Player 1
(resp. Player 2) chooses the action ln ∈ L (resp. sn ∈ S) which generates the
payoff Rn = R(ln , Un ) ∈ IRd , where R is as in 2.3, with:
2
Un = − sn − μ(l)2 ∈ IRc .
l∈L
74 V. Perchet
|Nn (l)| 2
lim sup sn (l) − μ(l)2 − ε2 ≤ 0, IPσ,τ -as .
n→∞ n
Remark 3. We have proved that σ is such that, for every l ∈ L, sn (l) is closer
to μ(l) than to any other μ(k), as soon as |Nn (l)|/n is not too small.
The fact that sn belongs to a finite set S and {μ(l)} are probabilities over S
is irrelevant: one can show that for any finite set {a(l) ∈ IRd , l ∈ L}, Player 1
has a strategy σ such that for any bounded sequence (am )m∈IN in IRd and for
every l and k :
|Nn (l)|
lim sup an (l) − a(l)2 − an (l) − a(k)2 ≤ 0 .
n→∞ n
Since ρ is multilinear and therefore continuous on Δ(I) × Δ(J), for every ε > 0,
there exists δ > 0 such that:
We introduce the auxiliary game Γ where Player 2 chooses action (or state)
j ∈ J and Player 1 forecasts it, using {y(l), l ∈ L}, a finite grid of Δ(J) whose
diameter is smaller than δ. Let σ be a calibrated strategy for Player 1, so that
jn (l), the empirical distribution of actions of Player 2 on Nn (l), is asymptotically
δ-close to y(l).
Define the strategy of Player 1 in the initial game by performing σ and if
ln = l by playing accordingly to xy(l) = x(l) ∈ Δ(I), as depicted in (7). Since
the choices of actions of the two players are independent, ρn (l) will be close
to ρ (x(l), jn (l)), hence close to ρ(x(l), y(l)) and finally close to C ε , as soon as
|Nn (l)| is not too small.
Indeed, by construction of σ, for every η > 0 there exists N 1 ∈ IN such that,
for every strategy τ of Player 2:
|Nn (l)| 2
IPσ,τ ∀l ∈ L, ∀n ≥ N 1 , jn (l) − y(l)2 − δ 2 ≤ η ≥ 1 − η . (8)
n
Hoeffding-Azuma inequality for sum of bounded martingale differences (see [2,11])
implies that for any η ∈ (0, 1) with probability at least 1 − η,
2 2
|ρn (l) − ρ(x(l), jn (l)| ≤ ln ,
|Nn (l)| η
Equations (8) and (9), taken with η ≤ ε/L, imply that, with probability at least
2
1 − 2ε, for every n ≥ max{N 1 , LN 2 /ε}, |ρn (l) − ρ(x(l), jn (l))| ≤ η ≤ ε, and
2
if Nn (l)/n ≥ ε/L then |Nn (l)| > N , so jn (l) − y(l) ≤ 2δ , and therefore
2 2
Remark 4. Blackwell’s proof of this result is not explicit. He showed that the
condition (7) implies that C is a B-set and his proof relies on the use of Von
Neumann’s minmax theorem. In words, let z be a fixed point outside C. Assume
that if Player 1 knows y ∈ Δ(J) the law of the action of Player 2, then there is a
law xy ∈ Δ(I) such that the expected payoff ρ(xy , y) and z are in different sides
of the hyperplane described in the definition of a B-set. The minmax theorem
implies that there exists x ∈ Δ(I) such that for every y ∈ Δ(I), z and ρ(x, y)
are on different sides and therefore C is a B-set. This gives the existence of an
approachability strategy of C.
One of the major interest in calibration, is that it transforms this implicit
proof into an explicit constructive proof: while performing a calibrated strategy
(in an auxiliary game where J plays the role of the set of states), Player 1 can
enforce the property that, for every l ∈ L, the average move of Player 2 is almost
y(l) on Nn (l). So he just has (and could not do better) to play xy(l) on these
stages.
finite histories for Player 1, H 1 = n∈IN (I × S)n , to Δ(I) (resp. from
set of
n
H 2 = n∈IN (I × S × J) to Δ(J)). A couple (σ, τ ) generates a probability IPσ,τ
IN
over H = (I × S × J) .
where s(jn ) ∈ Δ(S)I is the average flag. In words, the average payoff of Player 1
could not have been better uniformly if he had known the average distribution
of flags before the beginning of the game.
In this framework, given a flag μ ∈ Δ(S)I , the function miny∈s−1 (μ) ρ(·, y)
may not be linear. So the best response of Player 1 might not be a pure action
in I, but a mixed action x ∈ Δ(I) and any pure action in the support of x may
be a bad response. Note that this also appears in Rustichini’s definition, since
the maximum is taken over Δ(I) and not just over I as in the usual definition
of external regret in full monitoring.
Remark 6. Note that this definition is not intrinsic (unlike in the full monitoring
case) since it depends on the choice of {x(l), l ∈ L}, and is based uniquely on
the potential observations (ie the sequences of flags (μn )n∈IN ) of Player 1.
Some parts of the proof are quite technical, however the insight is very simple,
so we give firstly the main ideas. First assume that, in the one stage game, μ ∈
Δ(S)I is observed by Player 1, then there exists x ∈ Δ(I) such that x ∈ BR (μ).
Using an minmax argument, like Blackwell did for the proof of Corollary 1, one
could prove that Player 1 has an (L, ε)-internally consistent strategy (as did
Lehrer and Solan [13]).
The idea is to use calibration, as in the alternative proof of Corollary 1, to
transform this implicit proof into a constructive proof. Fix ε > 0 and assume for
the moment that Player 1 observes each μn . Consider the game where Player 1
predicts the sequence (μn )n∈IN using the δ-grid {μ(l), l ∈ L} given by Assump-
tion 1. A calibrated strategy of Player 1 chooses a sequences (ln )n∈IN in such a
way that μn (l) is asymptotically δ-close to μ(l). Hence Player 1 just has to play
accordingly to x(l) ∈ BRε (μ(l)) on these stages.
Indeed, since the choices of action are independent, ın (l) will be asymptotically
η-close to x(l) and the regularity of G will imply then that ın (l) ∈ BRε (μn (l))
and so the strategy will be (L, ε)-internally consistent.
The only issue is that in the current framework the signal depends on the
action of Player 1 who does not observe μn . The existence of calibrated strategies
is therefore not straightforward. However, it is well known that, up to a slight
perturbation of x(l), the information available to Player 1 after a long time
is close to μn (l) (as in the multi-armed bandit problem, some calibration and
no-regret frameworks, see chapter 6 in [5] for a survey on these techniques).
Calibration and Internal No-Regret with Random Signals 79
with xη (l)[in ] the weight put by xη (l) on in and denote by sn (l), the average of
{
sm } on Nn (l):
Lemma 4. For every θ > 0, there exists N ∈ IN such that, for every l ∈ L:
IPσ,τ (∀m ≥ n,
sn (l) − μn (l) ≤ θ| Nn (l) ≥ N ) ≥ 1 − θ .
Proof. Since for every n ∈ IN, the choices of in and μn are independent:
s
Eσ,τ [ sn | hn−1 , ln , μn ] = μin [s]xη (ln )[i] 0, . . . , ,...,0
xη (ln )[i]
i∈I s∈S
= μin [s] (0, . . . , s, . . . , 0) = 0, . . . , μin , . . . , 0
i∈I s∈S i∈I
= μ1n , . . . , μIn = μn .
IPσ,τ (∀m ≥ n,
sn (l) − μn (l) ≤ θ| Nn (l) ≥ N ) ≥ 1−θ .
Assume now that Player 1 uses a calibrated strategy to predict the sequences
of sn (this game is in full monitoring), then he knows that asymptotically sn (l)
is closer to μ(l) than to any μ(k) (as soon as the frequency of l is big enough),
therefore it is δ-close to μ(l). Lemma 4 implies that μn (l) is asymptotically close
to sn (l) and therefore 2δ-close to μ(l).
Note that instead of trying to compute the sequence of payoffs from the sig-
nals, we consider an auxiliary game defined on the signal space (ie the observa-
tions) so that this new game is in fact (almost) in full monitoring.
IPσ,τ (∀m ≥ n,
sn (l) − μn (l) ≤ θ| Nn (l) ≥ N 1 ≥ 1 − θ . (10)
Define η = minl∈L η(l) and let x ∈ Δ(I), μ ∈ Δ(S)I and l ∈ L such that
x − x(l) ≤ η and μ − μ(l) ≤ δ, then:
ε
G(x, μ) ≥ G(x, μ(l)) − ≥ G(x(l), μ(l)) − ε = max G(z, μ(l)) − ε ,
2 z∈Δ(I)
This proposition implies that the evaluation function used by Rustichini fulfills
Assumption 1 (Lugosi, Mannor and Stoltz [14]). Before proving that, we intro-
duce S, the range of s, which is a closed convex subset of Δ(S)I , and ΠS (·) the
projection onto it.
Corollary 2. Define G : Δ(I) × Δ(S)I → IR by:
inf y∈s−1 (μ) ρ(x, y) if μ ∈ S
G(x, μ) =
G (x, ΠS (μ)) otherwise.
Concluding Remarks
The definitions and proofs rely uniquely on Assumption 1: it is not relevant
to assume that Player 1 faces only one opponent nor that the action set of
its opponent is finite. The only requirement is that given his information (a
probability in Δ(I) and a flag in Δ(S)I ), Player 1 can evaluate his payoff, no
matter how this payoff is obtained: for example we could have assumed that
Player 2 chooses at each stage an (unobserved) outcome vector U ∈ [−1, 1]|I|
and Player 1 chooses a coordinate, which is his observed payoff.
In the full monitoring framework, many improvements have been made in
the past years about calibration and regret (see for instance [12,16,18]). Here,
we aimed to clarify the links between the original notions of approachability,
internal regret and calibration in order to extend applications (in particular,
to get rid of the finiteness of J), to define the internal regret (with signals) as
calibration over an appropriate space and to give a proof derived from no-internal
regret (in full monitoring), itself derived from the approachability of an orthant
in this space.
Acknowledgments. I deeply thanks my advisor Sylvain Sorin for its great help
and numerous comments. I also acknowledge helpful remarks from Eilon Solan
and Gilles Stoltz.
82 V. Perchet
References
1. Aubin, J.-P., Frankowska, H.: Set-valued Analysis. Birkhäuser Boston Inc., Basel
(1990)
2. Azuma, K.: Weighted sums of certain dependent random variables. Tôhoku Math.
J. 19(2), 357–367 (1967)
3. Blackwell, D.: An analog of the minimax theorem for vector payoffs. Pacific J.
Math. 6, 1–8 (1956)
4. Blackwell, D.: Controlled random walks. In: Proceedings of the International
Congress of Mathematicians, 1954, Amsterdam, vol. III, pp. 336–338 (1956)
5. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge Uni-
versity Press, Cambridge (2006)
6. Foster, D.P., Vohra, R.V.: Asymptotic calibration. Biometrika 85, 379–390 (1998)
7. Foster, D.P., Vohra, R.V.: Regret in the on-line decision problem. Games Econom.
Behav. 29, 7–35 (1999)
8. Fudenberg, D., Levine, D.K.: Conditional universal consistency. Games Econom.
Behav. 29, 104–130 (1999)
9. Hannan, J.: Approximation to Bayes risk in repeated play. In: Contributions to the
theory of Games. Annals of Mathematics Studies, vol. 3(39), pp. 97–139. Princeton
University Press, Princeton (1957)
10. Hart, S., Mas-Colell, A.: A simple adaptive procedure leading to correlated equi-
librium. Econometrica 68, 1127–1150 (2000)
11. Hoeffding, W.: Probability inequalities for sums of bounded random variables.
J. Amer. Statist. Assoc. 58, 13–30 (1963)
12. Lehrer, E.: A wide range no-regret theorem. Games Econom. Behav. 42, 101–115
(2003)
13. Lehrer, E., Solan, E.: Learning to play partially-specified equilibrium (manuscript,
2007)
14. Lugosi, G., Mannor, S., Stoltz, G.: Strategies for prediction under imperfect
monitoring. Math. Oper. Res. 33, 513–528 (2008)
15. Rustichini, A.: Minimizing regret: the general case. Games Econom. Behav. 29,
224–243 (1999)
16. Sandroni, A., Smorodinsky, R., Vohra, R.V.: Calibration with many checking rules.
Math. Oper. Res. 28, 141–153 (2003)
17. Sorin, S.: Lectures on Dynamics in Games. Unpublished Lecture Notes (2008)
18. Vovk, V.: Non-asymptotic calibration and resolution. Theoret. Comput. Sci. 387,
77–89 (2007)
St. Petersburg Portfolio Games
The first author acknowledges the support of the Computer and Automation
Research Institute of the Hungarian Academy of Sciences.
The work was supported in part by the Hungarian Scientific Research Fund, Grant
T-048360.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 83–96, 2009.
c Springer-Verlag Berlin Heidelberg 2009
84 L. Györfi and P. Kevei
The behavior of the market is given by the sequence of return vectors {xn },
(1) (d) (j)
xn = (xn , . . . , xn ), such that the j-th component xn of the return vector xn
denotes the amount obtained after investing a unit capital in the j-th financial
instrument on the n-th round.
Let S0 denote the investor’s initial capital. Then at the beginning of the first
(j)
round S0 b1 is invested into financial instrument j, and it results in return
(j) (j)
S0 b1 x1 , therefore at the end of the first round the investor’s wealth becomes
d
(j) (j)
S 1 = S0 b1 x1 = S0 b1 , x1 ,
j=1
where · , · denotes inner product. For the second round b2 is the new portfolio
and S1 is the new initial capital, so
Of course the problem is to find the optimal investment strategy for a long
run period, that is to maximize Sn in some sense. The best strategy depends
on the optimality criteria. A naive attitude is to maximize the expected return
in each round. This leads to the risky strategy to invest all the money into
(j) (i)
the financial instrument j, with EXn = max{EXn : i = 1, 2, . . . , n}, where
(1) (2) (d)
Xn = (Xn , Xn , . . . , Xn ) is the market vector in the n-th round. Since the
(j)
random variable Xn can be 0 with positive probability, repeated application of
this strategy lead to quick bankrupt. The underlying phenomena is the simple
fact that E(Sn ) may increase exponentially, while Sn → 0 almost surely. A more
delicate optimality criterion was introduced by Breiman [3]: in each round we
maximize the expectation E lnb, Xn for b ∈ Δd . This is the so-called log-
optimal portfolio, which is optimal under general conditions [3].
If the market process {Xi } is memoryless, i.e., it is a sequence of independent
and identically distributed (i.i.d.) random return vectors then the log-optimal
portfolio vector is the same in each round:
1
n
1 1
ln Sn = ln S0 + ln b , xi ,
n n n i=1
St. Petersburg Portfolio Games 85
therefore without loss of generality we can assume in the sequel that the initial
capital S0 = 1.
The optimality of b∗ means that if Sn∗ = Sn (b∗ ) denotes the capital after
round n achieved by a log-optimum portfolio strategy b∗ , then for any portfolio
strategy b with finite E{ln b , X1 } and with capital Sn = Sn (b) and for any
memoryless market process {Xn }∞ 1 ,
1 1
lim ln Sn ≤ lim ln Sn∗ almost surely
n→∞ n n→∞ n
and maximal asymptotic average growth rate is
1
lim ln Sn∗ = W ∗ := E{ln b∗ , X1 } almost surely.
n→∞ n
The proof of the optimality is a simple consequence of the strong law of large
numbers. Introduce the notation
1
n
1
ln Sn = ln b , Xi
n n i=1
1 1
n n
= E{ln b , Xi } + (ln b , Xi − E{ln b , Xi })
n i=1 n i=1
1
n
= W (b) + (ln b , Xi − E{ln b , Xi })
n i=1
→ W (b) almost surely.
Similarly,
1
lim ln Sn∗ = W (b∗ ) = max W (b) almost surely.
n→∞ n b
In connection with CRP in a more general setup we refer to Kelly [8] and
Breiman [3].
In the following we assume that the i.i.d. random vectors {Xi }, have the
general form X = (X (1) , X (2) , . . . , X (d) , X (d+1) ), where X (1) , X (2) , . . . , X (d) are
nonnegative i.i.d. random variables and X (d+1) is the cash, that is X (d+1) ≡ 1,
and d ≥ 1. Then the concavity of the logarithm, and the symmetry of the first
d components immediately imply that the log-optimal portfolio has the form
b = (b, b, . . . , b, 1 − db), where of course 0 ≤ b ≤ 1/d. When does b = 1/d
correspond to the optimal strategy; that is when should we play with all our
money? In our special case W has the form
d
W (b) = E ln b X + 1 − bd
(i)
.
i=1
86 L. Györfi and P. Kevei
Let denote Zd = di=1 X (i) . Interchanging the order of integration and differen-
tiation, we obtain
d d d
Zd − d
W (b) = E ln b X + 1 − bd
(i)
=E .
db db i=1
bZd + 1 − bd
According to the strong law of large numbers d/Zd → 1/E(X (1) ) a.s. as d → ∞,
thus under some additional assumptions for the underlying variables E(d/Zd ) →
1/E(X (1) ), as d → ∞. Therefore if E(X (1) ) > 1, then for d large enough the
optimal strategy is (1/d, . . . , 1/d, 0).
In the latter computations we tacitly assumed some regularity conditions,
that is we can interchange the order of differentiation and integration, and that
we can take the L1 -limit instead of almost sure limit. One can show that these
conditions are satisfied if the underlying random variables have strictly positive
infimum. We skip the technical details.
0, if x < 2 ,
F (x) = P{X ≤ x} = 2{log2 x} (4)
1− 1
2log2 x
=1− x , if x ≥ 2 ,
where x is the usual integer part of x and {x} stands for the fractional part.
Since E{X} = ∞, this game has delicate properties (cf. Aumann [1], Bernoulli
[2], Haigh [7], and Samuelson [10]). In the literature, usually the repeated St.
Petersburg game (called iterated St. Petersburg game, too) means multi-period
game such that it is a sequence of simple St. Petersburg games, where in each
round the player invests 1$. Let Xn denote the payoff for the n-th simple game.
St. Petersburg Portfolio Games 87
∞
Assume that the sequence {X
nn }n=1 is i.i.d. After n rounds the player’s gain in
the repeated game is S̄n = i=1 Xi , then
S̄n
lim =1
n→∞ n log2 n
in probability, where log2 denotes the logarithm with base 2 (cf. Feller [6]).
Moreover,
S̄n
lim inf =1
n→∞ n log2 n
a.s. and
S̄n
lim sup =∞
n→∞ n log 2n
a.s. (cf. Chow and Robbins [4]). Introducing the notation for the largest payoff
Xn∗ = max Xi
1≤i≤n
1
W (c) := lim log2 Sn(c) .
n→∞ n
Let’s calculate the the asymptotic average growth rate. Because of
1 1
n
Wn(c) = log2 Sn(c) = n log2 (1 − c) + log2 Xi ,
n n i=1
1
n
W (c) = log2 (1 − c) + lim log2 Xi = log2 (1 − c) + E{log2 X1 }
n→∞ n
i=1
a.s., so W (c) can be calculated via expected log-utility (cf. Kenneth [9]). A com-
mission factor c is called fair if
W (c) = 0,
so the growth rate of the sequential game is 0. Let’s calculate the fair c:
∞
log2 (1 − c) = −E{log2 X1 } = − k · 2−k = −2,
k=1
i.e., c = 3/4.
Consider the portfolio game, where a fraction of the capital is invested in simple
fair St. Petersburg games and the rest is kept in cash, i.e., it is a CRP problem
with the return vector
(d ≥ 1) such that the first d i.i.d. components of the return vector X are of the
form
P{X = 2k−2 } = 2−k , (5)
(k ≥ 1), while the last component is the cash. The main aim is to calculate the
largest growth rate Wd∗ .
The function log2 is concave, therefore W (b) is concave, too, so W (0) = 0 (keep
everything in cash) and W (1) = 0 (the simple game is fair) imply that for all
0 < b < 1, W (b) > 0. Let’s calculate maxb W (b). We have that
∞
W (b) = log2 (b(2k /4 − 1) + 1) · 2−k
k=1
∞
−1
= log2 (1 − b/2) · 2 + log2 (b(2k−2 − 1) + 1) · 2−k .
k=3
Figure 1 shows the curve of the average growth rate of the portfolio game. The
function W (·) attains its maximum at b = 0.385, that is
b∗ = (0.385, 0.615) ,
where the growth rate is W1∗ = W (0.385) = 0.149. It means that if for each round
of the game one reinvests 38.5% of his capital such that the real investment is
9.6%, while the cost is 28.9%, then the growth rate is approximately 11%, i.e.,
the portfolio game with two components of zero growth rate (fair St. Petersburg
game and cash) can result in growth rate of 10.9%.
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0.25
0.2
0.15
0.1
0.05
Figure 2 shows the curve of the average growth rate of the portfolio game.
Numerically we can determine that the maximum is taken at b = 0.364, so
b∗ = (0.364, 0.364, 0.272) ,
where the growth rate is W2∗ = W (0.364) = 0.289.
Proof. Using the notations at the end of Section 1, we have to prove the
inequality
d
W (1/d) ≥ 0 .
db
According to (2) this is equivalent with
d
1≥E .
X1 + · · · + Xd
For d = 3, 4, 5, numerically one can check this inequality. One has to prove the
proposition for d ≥ 6, which means that
1
1 ≥ E 1 d . (6)
d i=1 Xi
St. Petersburg Portfolio Games 91
We use induction. Assume that (6) holds until d − 1. Choose the integers d1 ≥ 3
and d2 ≥ 3 such that d = d1 + d2 . Then
1 1
d
= d1
d
1
d i=1 Xi
1
d i=1 Xi + 1
d i=d1 +1 Xi
1
= d1
d ,
d1 1
d d1 i=1 Xi + d2 1
d d2 i=d1 +1 Xi
therefore the Jensen inequality implies that
1 d1 1 d2 1
d ≤ d1 + d ,
1 d 1 d 1
Xi
d i=1 Xi d1 i=1 Xi d2 i=d1 +1
and so
1 d1 1 d2 1
E d ≤E d1 + d
1
d i=1 Xi d 1
d1
i=1 Xi
d 1
d2 i=d1 +1 Xi
d1 1 d2 1
= E d1 + E 1 d2
d 1 d
d1 i=1 Xi i=1 Xi d2
d1 d2
≤ + = 1,
d d
where the last inequality follows from the assumption of the induction.
First we compute this growth rate numerically for small values of d, then we
determine the exact asymptotic growth rate for d → ∞.
For d ≥ 2 arbitrary, by (3) we may write
∞
d log2 2i1 + 2i2 + · · · + 2id
E log2 Xi = .
i=1 i ,i ,...,i =1
2i1 +i2 +···+id
1 2 d
Theorem 1. For the asymptotic behavior of the average growth rate we have
0.8 1 log2 log2 d + 4
− ≤ Wd∗ − log2 log2 d + 2 ≤ .
ln 2 log2 d ln 2 log2 d
Proof. Because of
d
1
d
Xi
Wd∗ = E log2 Xi = E log2 i=1
+ log2 log2 d − 2,
4d i=1 d log2 d
we have to show that
d
0.8 1 i=1 Xi log2 log2 d + 4
− ≤ E log2 ≤ .
ln 2 log2 d d log2 d ln 2 log2 d
Concerning the upper bound in the theorem, use the decomposition
d d d
i=1 Xi i=1 X̃i Xi
log2 = log2 + log2 i=1
d
,
d log2 d d log2 d i=1 X̃i
where
Xi , if Xi ≤ d log2 d ,
X̃i =
d log2 d, otherwise.
We prove that
d
i=1 X̃i log2 log2 d + 2
E log2 ≤ , (7)
d log2 d ln 2 log2 d
and d
Xi 2
0≤E log2 i=1
d
≤ . (8)
i=1 X̃i ln 2 log2 d
For (8), we have that
d d
i=1 Xi Xi
P log2 d ≥x =P i=1
d ≥2 x
and the proof of (8) is finished. For (7), put l = log2 (d log2 d) . Then for the
expectation of the truncated variable we have
l ∞
1 1
E(X̃1 ) = 2k k
+ d log 2 d
2 2k
k=1 k=l+1
1 d log2 d
= l + d log2 d 2=l+ ≤ l + 2.
2l+1 2 log2 (d log2 d)
Thus,
d d
X̃i 1 i=1 X̃i
E log2 i=1
= E ln
d log2 d ln 2 d log2 d
d
1 X̃ i
≤ E i=1
−1
ln 2 d log2 d
1 E{X̃1 }
= −1
ln 2 log2 d
1 l+2
≤ −1
ln 2 log2 d
1 log2 (d log2 d) + 2
= −1
ln 2 log2 d
1 log2 d + log2 log2 d + 2
≤ −1
ln 2 log2 d
1 log2 log2 d + 2
= .
ln 2 log2 d
1−ε
≤1− x ,
2 log2 d
94 L. Györfi and P. Kevei
for d large enough, where we used the inequality e−z ≤ 1 − (1 − ε)z, which holds
for z ≤ − ln(1 − ε). Thus
d
i=1 Xi 1−ε
P >2 x
≥ x ,
d log2 d 2 log2 d
For the estimation of the negative part we use an other truncation method.
Now we cut the variable at d, so put
Xi , if Xi ≤ d ,
X̂i =
d, otherwise.
d
Introduce also the notations Ŝd = i=1 X̂i and cd = E(X̂1 )/ log2 d. Similar
computations as before show that
d
E(X̂1 ) = log2 d + log d = log2 d + 2{log2 d} − {log2 d} and
2 2
d2
E X̂12 ≤ 2 2 log2 d − 1 + log d < d 21−{log2 d} + 2{log2 d} ≤ 3d ,
2 2
√
where we used that 2 2 ≤ 21−y + 2y ≤ 3 for y ∈ [0, 1]; this can be proved easily.
Simple analysis shows again that 0.9 ≤ 2y − y ≤ 1 for y ∈ [0, 1], and so for cd − 1
we obtain
0.9 1
< cd − 1 < .
log2 d log2 d
Since di=1 Xi ≥ di=1 X̂i we have that
⎧ ⎫ ⎧ ⎫
⎨ d −⎬ ⎨ d −⎬
Xi X̂i
E log2 i=1 ≤E log2 i=1 .
⎩ d log2 d ⎭ ⎩ d log2 d ⎭
St. Petersburg Portfolio Games 95
Noticing that
Ŝd 2d
log2 > log2 = 1 − log2 log2 d ,
d log2 d d log2 d
we obtain
⎧ ⎫
⎨ −
⎬ 0
Ŝd Ŝd
E log2 = P log2 ≤ x dx ,
⎩ d log2 d ⎭ − log2 log2 d d log2 d
3
log22 d (cd − 2x )2
≤ exp − .
6 + 23 log2 d (cd − 2x )
Let γ > 0 be fixed, we define it later. For x < −γ and d large enough the last
−γ 2
upper bound ≤ d−(1−2 ) , therefore
−γ
Ŝd log2 log2 d
P log2 ≤ x dx ≤ (1−2 −γ )2 .
− log2 log2 d d log 2 d d
For arbitrarily fixed ε > 0 we choose γ > 0 such that 1 − x ≤ e−x ≤ 1 − (1 − ε)x,
for 0 ≤ x ≤ γ ln 2. Using also our estimations for cd − 1 we may write
log22 d (cd − e−x )2 log2 d (0.9/ log2 d + (1 − ε)x)2
exp − ≤ exp − 2 2
6 + 23 log2 d (cd − e−x ) 6 + 3 log2 d (1/ log2 d + x)
96 L. Györfi and P. Kevei
References
1. Aumann, R.J.: The St. Petersburg paradox: A discussion of some recent comments.
Journal of Economic Theory 14, 443–445 (1977)
2. Bernoulli, D.: Exposition of a new theory on the measurement of risk. Economet-
rica 22, 22–36 (1954); Originally published in 1738; translated by L. Sommer
3. Breiman, L.: Optimal gambling systems for favorable games. In: Proc. Fourth
Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 65–78. Univ. California Press,
Berkeley (1961)
4. Chow, Y.S., Robbins, H.: On sums of independent random variables with infinite
moments and “fair” games. Proc. Nat. Acad. Sci. USA 47, 330–335 (1961)
5. Csörgő, S., Simons, G.: A strong law of large numbers for trimmed sums, with
applications to generalized St. Petersburg games. Statistics and Probability Let-
ters 26, 65–73 (1996)
6. Feller, W.: Note on the law of large numbers and “fair” games. Ann. Math.
Statist. 16, 301–304 (1945)
7. Haigh, J.: Taking Chances. Oxford University Press, Oxford (1999)
8. Kelly, J.L.: A new interpretation of information rate. Bell System Technical Jour-
nal 35, 917–926 (1956)
9. Kenneth, A.J.: The use of unbounded utility functions in expected-utility maxi-
mization: Response. Quarterly Journal of Economics 88, 136–138 (1974)
10. Samuelson, P.: The St. Petersburg paradox as a divergent double limit. Interna-
tional Economic Review 1, 31–37 (1960)
Reconstructing Weighted Graphs
with Minimal Query Complexity
additive queries. This solves the open problem in [S. Choi, J. H. Kim.
Optimal Query Complexity Bounds for Finding Graphs. Proc. of the 40th
annual ACM Symposium on Theory of Computing , 749–758, 2008].
Choi and Kim’s proof holds for m ≥ (log n)α for a sufficiently large
constant α and uses graph theory. We use the algebraic approach for the
problem. Our proof is simple and holds for any m.
1 Introduction
Our goal is to exactly reconstruct the set of edges using additive queries.
One can distinguish between two types of algorithms to solve the problem.
Adaptive algorithms are algorithms that take into account outcomes of previous
queries where non-adaptive algorithms make all queries in advance, before any
answer is known. In this paper, we consider non-adaptive algorithms for the
problem. Our concern is the query complexity, that is, the number of queries
needed to be asked in order to reconstruct the graph.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 97–109, 2009.
c Springer-Verlag Berlin Heidelberg 2009
98 N.H. Bshouty and H. Mazzawi
The problem of reconstructing graphs using additive queries has been moti-
vated by applications in bioinformatics. Assume we have a set of labeled chemi-
cals, and we are able to tell how many pairs react when mixing several of those
chemicals together. We can represent the problem as a graph, where the chemi-
cals are the vertices and two chemicals that react with each other are connected
with an edge. The goal is to reconstruct this reactions graph using as few exper-
iments as possible.
One concrete example for reconstructing a hidden graph is in genome sequenc-
ing. Obtaining the genome sequence is important for the study of organisms. To
obtain the sequence, one common approach is to obtain short reads from the
genome sequence. These reads are assembled to contigs, which are contiguous
fragments that cover the genome sequence with possible gaps. Given these con-
tigs, the goal is to determine their relative place in the genome sequence. The
process of ordering the contigs is done using the multiplex PCR method. This
method, given a group of contigs determines the number of adjacent contigs in
the original genome sequence. Assuming that the genome sequence is circular,
the problem of ordering the contigs using the multiplex PCR method is equiva-
lent to reconstructing a hidden Hamiltonian cycle using queries [6,12].
The graph reconstructing problem has known a significant progress in the
past decade. For unweighted graph the information theoretic lower bound gives
2
m log nm
Ω
log m
for the query complexity for any adaptive algorithm for this problem. A tight
upper bound was proved for some subclasses of unweighted graphs (Hamiltonian
graphs, matching, stars and cliques etc.) [13,12,11,6], unweighted graphs with
Ω(dn) edges where the degree of each vertex is bounded by d [11], graphs with
Ω(n2 ) edges [11] and then the former was extended to d-degenerate unweighted
graphs with Ω(dn) edges [13], i.e., graphs that their edges can be changed to
directed edges where the out-degree of each vertex is bounded by d. A recent
paper by Choi and Kim, [8], gave a tight upper bound for all unweighted graphs.
For reconstructing weighted graphs, in [8], Choi and Kim proved the follow-
ing: If m > (log n)α for sufficiently large α, then, there exists a non-adaptive
algorithm for reconstructing a weighted graph where the weights are bounded
between n−a and nb for any positive constants a and b using
m log n
O
log m
queries.
In this paper, we close the gap in m and prove that for any m there exists a
non-adaptive algorithm that reconstructs the hidden graph using
m log n
O
log m
queries. This matches the information theoretic lower bound.
Reconstructing Weighted Graphs with Minimal Query Complexity 99
In our analysis, we apply algebraic techniques for solving this problem. This
simplifies the proofs of the correctness.
The paper is organized as follows: In Section 2, we present notation, basic
tools and some background. In Section 3, we prove the existence of an algorithm
for the discretization of the problem. In Section 4, we present the algorithm for
the problem and prove correctness. Finally, Section 5, contains open problems.
2 Preliminaries
In this section we present some background, basic tools and notation.
Proof. Notice that when 1 < m1 ≤ m2 < m then ι(m1 − 1)ι(m2 + 1) = (m1 −
1)(m2 + 1) < m1 m2 = ι(m1 )ι(m2 ). Also when m1 = 0 and 1 < m2 < m then
ι(m1 + 1)ι(m2 − 1) = m2 − 1 < m2 = ι(m1 )ι(m2 ). Therefore the optimal value
of ι(m1 )ι(m2 ) · · · ι(mt ) is obtained when for every 0 < i < j ≤ t we either have
mi ∈ {1, m} or mj ∈ {1, m}. This is equivalent to: all mi ∈ {1, m} except at
most one. This implies that at least ( − t)/(m − 1) of the mi s are equal to m.
xT AG x y T AG y (x1 + y1 )T AG (x1 + y1 )
xT AG y = + + − xT1 AG x1 − y1T AG y1 .
2 2 2
Therefore, the problem of reconstructing the set of edges of the graph G using
additive queries is equivalent to finding the non-zero entries in its adjacency
matrix AG using queries of the form
f (x, y) = xT AG y,
Corollary 1. Let B ∈ Rn×n be a matrix where ψs (B) > 0. Then, for a randomly
(uniformly) chosen vectors x, y ∈ {0, 1}n we have that
3 Reconstructing Graphs
In this section we give an upper bound for the discretization of the problem and
then show how to solve the general problem.
Let G be the set of all graphs with n vertices and m edges such that the
weights of the edges are from the set [s1 , s1 /8m2 , s2 ], that is, the weights are
bounded by s1 , s2 and are multiples of s1 /8m2 . For the class G we prove the
following
102 N.H. Bshouty and H. Mazzawi
where xi , yi ∈ {0, 1}n such that for all G, G ∈ G where E(G) = E(G ) there
exists i ∈ [t] such that
To prove Theorem 1 we will prove that there exists a set of queries S of size t
such that for every B = AG − AG where G, G ∈ G and E(G) = E(G ) there
exists (x, y) ∈ S such that |xT By| > s1 /8m.
We divide into two cases. The first case is when the matrix B is a substraction
of adjacency matrices of graphs that are “close” to each other, i.e., B has only
few heavy entries. The second case is when the matrix B is a substraction of two
adjacency matrices of graphs that are “far” from each other, i.e., B has many
heavy entries.
First notice the following properties of B:
P1. Since G and G contains at most m edges we have wt(B) ≤ 2m.
P2. Since the weights of the edges are in [s1 , s1 /(8m2 ), s2 ], the weights in B are
in [−s2 , s1 /(8m2 ), s2 ].
P3. Since E(G) = E(G ) at least one of the entries of B is s1 -heavy.
We denote by B the class of symmetric matrices that satisfy (P1-P3). Then the
first case will be B1 = {B ∈ B | ψs1 /(8m) (B) ≤ m3/4 } and the second case will
be B2 = {B ∈ B | ψs1 /(8m) (B) > m3/4 }.
We first prove that there exists a set of queries S = {(x1 , y1 ), (x2 , y2 ),. . . ,(xt , yt )}
such that for every B ∈ B1 we have i ∈ [t] such that
|xT
i B yi | ≥ s1 /4.
there exists a set of vectors Y = {y1 , y2 , . . . , yt } ⊂ {0, 1}n such that for every
u ∈ U the size of Yu = {i | |yiT u| > s1 /(16m)} is greater than t/4.
104 N.H. Bshouty and H. Mazzawi
Proof. By Lemma 6, for any u ∈ U and a randomly chosen y ∈ {0, 1}n we have
Pr[|y T u| ≥ s1 /16m] ≥ 1/2.
Therefore, if we randomly independently choose y1 , y2 , . . . , yt ∈ {0, 1}n we have
E[|Yu |] ≥ t/2. By Chernoff bound the probability that |Yu | ≤ t/4 is
−t
Pr[|Yu | ≤ t/4] < e 16 .
The probability that for all u ∈ U we have |Yu | > t/4 is
−t
Pr[∀u ∈ U : |Yu | ≥ t/4] = 1 − Pr[∃u ∈ U : |Yu | < t/4] ≥ 1 − |U |e 16 .
Finally, note that
m3/4
n 16s2 m2
|U | < < et/16 ,
m3/4 s1
and therefore
−t
1−|U |e 16 > 0.
Lemma 9. Let B2 = {B ∈ B | ψs1 /(8m) (B) > m3/4 }. There exists a set of
queries S = {(x1 , y1 ), (x2 , y2 ), . . . , (xt , yt )} of size
⎛ ⎞
m log n + log ss21
t = O⎝ ⎠
log m
By Lemma 2,
t
ι(ψs1 /(16m) (BU yi )) ≥ k (k−4)t/(4k−4) ≥ mc1 t ,
i=1
ct
= t
i=1 ι(ψs1 /(16m) (Byi ))
ct
≤ t
i=1 ι(ψs1 /(16m) (BU yi ))
c t
≤ c /2
= m−αt ,
m 1
4 The Algorithm
In the previous section we showed that there exists a set of queries
S = {(x1 , y1 ), (x2 , y2 ), . . . , (xt , yt )}
such that for every G∗ , G ∈ G where E(G∗ ) = E(G ) we have i ∈ [t] such that
|xTi AG∗ yi − xTi AG yi | > s1 /8m.
Recall that G is the set of all graphs with n vertices and m edges, where the
weights of the edges are from the set [s1 , s1 /8m2 , s2 ].
Now, for reconstructing the edges of the graph we use the same algorithm as
in [8]. The algorithm is presented in Figure 1.
The query complexity of the algorithm is obvious. As for the correctness,
given a graph G, define G ∈ G, such that G is equivalent to G after we round
each weight of edge in G to the closest number that is a multiple of s1 /8m2 .
Obviously, since G−G has at most m non-zero entries and each entry is bounded
by s1 /16m2 we have that for all i ∈ [t]
|xTi AG yi − xTi AG yi | ≤ s1 m/16m2 = s1 /16m.
106 N.H. Bshouty and H. Mazzawi
On the other hand, for any graph G∗ ∈ G that differs from G in at least one
edge, we have
|xTi AG yi − xTi AG∗ yi | = |xTi AG yi − xTi AG yi − (xTi AG∗ yi − xTi AG yi )|. (4)
By (4) and (5), together with the fact that |xTi AG yi − xTi AG yi | ≤ s1 /16m we
get
References
1. Aigner, M.: Combinatorial Search. John Wiley and Sons, Chichester (1988)
2. Alon, N., Asodi, V.: Learning a Hidden Subgraph. SIAM J. Discrete Math. 18(4),
697–712 (2005)
3. Alon, N., Beigel, R., Kasif, S., Rudich, S., Sudakov, B.: Learning a Hidden Match-
ing. SIAM J. Comput. 33(2), 487–501 (2004)
4. Angluin, D., Chen, J.: Learning a Hidden Graph Using O(log n) Queries per Edge.
In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp.
210–223. Springer, Heidelberg (2004)
5. Angluin, D., Chen, J.: Learning a Hidden Hypergraph. Journal of Machine Learning
Research 7, 2215–2236 (2006)
6. Bouvel, M., Grebinski, V., Kucherov, G.: Combinatorial Search on Graphs Moti-
vated by Bioinformatics Applications: A Brief Survey. In: Kratsch, D. (ed.) WG
2005. LNCS, vol. 3787, pp. 16–27. Springer, Heidelberg (2005)
7. Bshouty, N.H.: Optimal Algorithms for the Coin Weighing Problem with a Spring
Scale. In: COLT (2009)
8. Choi, S., Kim, J.H.: Optimal Query Complexity Bounds for Finding Graphs. In:
STOC, pp. 749–758 (2008)
9. Du, D., Hwang, F.K.: Combinatorial group testing and its application. Series on
applied mathematics, vol. 3. World Science (1993)
10. Erdös, P.: On a lemma of Littlewood and Offord. Bulletin of the American Math-
ematical Society 51, 898–902 (1945)
11. Grebinski, V., Kucherov, G.: Optimal Reconstruction of Graphs Under the Addi-
tive Model. Algorithmica 28(1), 104–124 (2000)
12. Grebiniski, V., Kucherov, G.: Reconstructing a hamiltonian cycle by querying the
graph: Application to DNA physical mapping. Discrete Applied Mathematics 88,
147–165 (1998)
13. Grebinski, V.: On the Power of Additive Combinatorial Search Model. In: Hsu,
W.-L., Kao, M.-Y. (eds.) COCOON 1998. LNCS, vol. 1449, pp. 194–203. Springer,
Heidelberg (1998)
14. Littlewood, J.E., Offord, A.C.: On the number of real roots of a random algebraic
equation. III. Mat. Sbornik 12, 277–285 (1943)
15. Mazzawi, H.: Optimally Reconstructing Weighted Graphs Using Queries
(manuscript)
16. Reyzin, L., Srivastava, N.: Learning and Verifying Graphs using Queries with a
Focus on Edge Counting. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT
2007. LNCS (LNAI), vol. 4754, pp. 285–297. Springer, Heidelberg (2007)
17. Sperner, E.: Ein Satz ber Untermengen einer endlichen Menge. Math. Z. 27, 544–
548 (1928)
6 Appendix
In this Appendix we prove Lemma 4.
We first prove few preliminary results.
Lemma 10. Let X1 , X2 , . . . , Xt be a random variables such that there is si > s
where Pr[Xi = si ] = 1/2 and Pr[Xi = 0] = 1/2. Let λ1 , . . . , λt ∈ {−1, 1}. Then
there is a constant c such that
c
max Pr[r ≤ λ1 X1 + λ2 X2 + · · · + λt Xt < r + s] ≤ √ .
r t
108 N.H. Bshouty and H. Mazzawi
Proof. Consider the lattice L = i {0, si } with the partial order ≺= i ≺i
where 0 ≺i si if λi = 1 and si ≺i 0 if λi = −1. It is easy to see that the set of all
solutions of r ≤ X1 + X2 + · · · + Xt < r + s is an anti-chain in L. This follows
the result.
Lemma 11. Let X1 , X2 , . . . , Xt be a random variables such that there is si > s
where Pr[Xi = si ] = pi and Pr[Xi = 0] = 1 − pi . Let λ1 , . . . , λt ∈ {−1, 1}. Then
there is a constant c such that for every r
c
max Pr[r ≤ λ1 X1 + λ2 X2 + · · · + λt Xt < r + s] ≤ .
r t
i=1 min(p i , 1 − p i )
Proof. We will assume w.l.o.g that pi < 1/2 and therefore min(pi , 1 − pi ) = pi .
Otherwise, we can replace Xi with Yi = si − Xi .
Let Yi be random variable that is equal to 1 with probability 2pi and 0 with
probability 1 − 2pi . Let Zi be a random variable that is equal to si with prob-
ability 1/2 and 0 with probability 1/2. It is easy to see that Xi = Yi Zi . Let
t
Y = Y1 + · · · + Yt and P = i=1 pi . Notice that E[Y ] = 2P . Then by Lemma 10
and Chernoff bound we have
Pr[r ≤ λ1 X1 + λ2 X2 + · · · + λt Xt < r + s]
= Pr[r ≤ λ1 Y1 Z1 + λ2 Y2 Z2 + · · · + λt Yt Zt < r + s]
≤ Pr[r ≤ λ1 Y1 Z1 + λ2 Y2 Z2 + · · · + λt Yt Zt < r + s | Y ≥ P ] + P r[Y < P ]
c
≤ √ + e−P/4 .
P
This completes the proof.
Lemma 12. Let X1 and X2 be random variables and s be any fixed real number.
Suppose X1 takes values x1 , x2 , . . . , x with probabilities p1 , . . . , p , respectively.
Let Y1 be a random variable that takes values y, x3 , x4 , . . . , x with probabilities
p1 + p2 , p3 , . . . , p . Then for y = x1 or y = x2 we have
max Pr[r ≤ X1 + X2 < r + s] ≤ max Pr[r ≤ Y1 + X2 < r + s].
r r
Pr[X1 = x] Pr[r0 − x ≤ X2 < r0 − x + s]
x∈{x1 ,x2 }
= Pr[r0 ≤ Y1 + X2 < r0 + s]
≤ max Pr[r ≤ Y1 + X2 < r + s].
r
We will call this transformation a merging of the two values x1 and x2 of the
random variable X1 into one value in Y1 . We prove the following property of the
merging transformation
Lemma 13. Let X be a random variable such that Pr[X > s] ≥ p and also
Pr[X < 0] ≥ p. Then we can merge the values of X into two values s1 and s2 in
a random variable Y such that
1. s2 − s1 ≥ s/2.
2. Pr[X = s1 ] ≥ p and Pr[X = s2 ] ≥ p.
Proof. We first merge all the values that are greater than or equal to s, then
all the values that are less than or equal to 0 and then those that are between 0
and s. We get a random variable Z that gets 3 values s , s and s where
s ≤ 0 < s < s ≤ s , Pr[Z = s ] ≥ p and Pr[Z = s ] ≥ p. Now either
s − s ≥ s/2 or s − s ≥ s/2. If s − s > s/2 then we merge s and s and
if s − s > s/2 we merge s and s .
Now we prove our main result
Lemma 14. Let a1 , a2 , . . . , at , b1 , b2 , . . . , bt and s ≥ 0 be real numbers such that
bi − ai ≥ s for all i and let X1 , . . . , Xt be independent random variables. Suppose
that there is pi with 0 < pi < 1 such that Pr[Xi ≤ ai ] ≥ pi and Pr[Xi ≥ bi ] ≥ pi
for all i. Then, there is a constant c such that for any real number r and integer
ρ ≥ 1 we have
t
cρ
Pr r ≤ Xi < r + ρs ≤ .
t
i=1 i=1 p i
Proof. By Lemma 13 we can merge the values of each Xi into a new random
variable Yi that takes two values ai and bi where bi − ai ≥ s/2, Pr[Yi = ai ] ≥ pi ,
Pr[Yi = bi ] ≥ pi and
t
t
s s
max Pr r ≤ Xi < r + ≤ max Pr r ≤ Yi < r + .
r 2 r 2
i=1 i=1
We will assume without loss of generality that ai = 0 for all i. Otherwise consider
the random variables Yi − ai . Now by Lemma 11 we get the result.
Learning Unknown Graphs
1 Introduction
Consider the advertising problem of targeting each member of a social network
(where ties between individuals indicate a certain degree of similarity in tastes
and interests) with the product he/she is most likely to buy. Unlike previous ap-
proaches to this problem —see, e.g., [20]— we consider the more interesting sce-
nario where the network and the preferences of network members for the products
in a given set are initially unknown, apart from those of a single “seed member”.
We assume there exists a mechanism to explore the social network by discovering
new members connected (i.e., with similar interests) to members that are already
known. This mechanism can be implemented in different ways, e.g., by providing
incentives or rewards to members with undiscovered connections. Alternatively,
if the network is hosted by a social network service (like FacebookTM ), the service
provider itself may release the needed pieces of information. Since each discovery
of a new member is presumably costly, the goal of the marketing strategy is to
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 110–125, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Learning Unknown Graphs 111
minimize the number of new members not being offered their preferred product.
In this respect the task may then be formulated as the following sequential prob-
lem: At each step t find the member qt , among those whose preferred product we
already know, who is most likely to have undiscovered connections that have the
same preferred product as qt . Once this member qt is identified, we obtain (through
the above-mentioned mechanism) a connection it to whom we may advertise qt ’s
preferred product. In order to make the problem easier for the advertising agent,
we make the simplifying assumption that once a product is advertised to a mem-
ber the agent may observe the member’s true preference, and thus know whether
the decision made was optimal.
This social network advertising task can be naturally cast as a graph predic-
tion problem where an agent sequentially explores the vertices and edges of an
unknown graph with unknown labels (i.e., product preferences) assigned to its
vertices. The online exploration proceeds as follows: At each time step t, the
agent selects a known node qt having unexplored edges, receives a new vertex
it adjacent to qt , and is required to output a prediction yt for the (unknown)
label yt associated with it . Then yt is revealed, and the algorithm incurs a loss
(
yt , yt ) measuring the discrepancy between prediction and true label. Thus, in
some sense, the agent is learning to explore the graph along directions that, given
past observations, look easier to predict. Our basic measure of performance is the
agent’s cumulative loss ( y1 , y1 )+ · · ·+ (
yn , yn ) over a sequence of n predictions.
In order to leverage on the assumption that connected members tend to prefer
the same products [20], we design agent strategies that perform well to the ex-
tent that the underlying graph labeling y = (y1 , . . . , yn ) is regular. That is, the
graph can be partitioned into a small number of weakly interconnected clusters
(subgroups of network members) such that labels in each cluster are all roughly
similar. In the case of binary labels and zero-one loss, a common measure of label
regularity for an n-vertex graph G with labels y = (y1 , . . . , yn ) ∈ {−1, +1}n is
the cutsize ΦG (y). This is the number of edges (i, j) in G whose endpoints ver-
tices have disagreeing labels, yi = yj . The cumulative loss bound we prove in this
paper holds for general (real-valued) labels and is expressed in terms of a mea-
sure of regularity that, in the special case of binary labels, is often significantly
smaller than the cutsize ΦG (y), and never larger than 2ΦG (y). Furthermore, un-
like ΦG (y), which may be even quadratic in the number of nodes, our measure
of label regularity is never vacuous (i.e., it is never larger than n). In the paper
we also show that the algorithm achieving this bound is suitable to large scale
applications because of its small time and memory requirements.
—see [16, 19], and then run the standard (kernel) Perceptron algorithm for pre-
dicting the vertex labels. This approach guarantees that the number of mistakes
is bounded by a quantity that depends linearly on the cutsize ΦG (y). Further
results involving the prediction of node labels in graphs with known structure
include [2, 3, 6, 9, 11, 12, 13, 14, 15, 17].
Our exploration/prediction model also bears some similarities to the graph
exploration problem introduced in [8], where the measure of performance is the
overall number edge traversals sufficient to ensure that each edge has been tra-
versed at least once. Unlike that approach, we do not charge any cost for visits
of the same node beyond the first visit. Moreover, in our setting depth-first
exploration performs badly on simple graphs with binary labels (see discus-
sion in Sect. 2), whereas depth-first traversal is optimal in the setting of [8]
for any undirected graph —see [1]. Finally, as we explain in Sect. 3, our ex-
ploration/prediction algorithm incrementally builds a spanning tree whose total
cost is equal to the algorithm’s cumulative loss. The problem of constructing a
minimum spanning tree online is also considered in [18], although only for graphs
with random edge costs.
Fig. 1. A binary labeled graph with three clusters where depthFirst can make Ω |V |
mistakes. Edges are either arrow edges or grey edges. Arrow edges indicate predictions,
and numbers on such edges denote the adversarial order of presentation. For instance
edge 3 (connecting a −1 node to a +1 node) says that depthFirst uses the −1 label
associated with the start node (the current qt node) to predict the +1 label associated
with the end node (the current it node). As a matter of fact, in this example depth-
First could also predict yt through a majority vote of the labels of previously observed
nodes that are adjacent to it . Dark grey nodes are the mistaken nodes (for simplicity,
ties are mistakes in this figure). Notice that in the dotted area we could add as many
(mistaken) nodes as we like, thus making the graph cutsize ΦG (y) arbitrarily close to
|V |. These nodes would still be mistaken even if the majority vote were restricted to
previously mistaken (and adjacent) nodes. This is because depthFirst is forced to err
on the left-most node of the right-most cluster.
merging degree. This notion arises naturally as a by-product of our analysis, and
can be considered a natural measure of cluster similarity of independent interest.
d = 2.9
d = 2.0
d = 0.8
5.6 0.7 d = 0.4 5.6 0.7
3.8 3.8
3.1 3.1
3.5 3.5
Fig. 2. Two copies of a graph with real labels yi associated with each vertex i. On the
left, a shortest path connecting the two nodes enclosed in double circles is shown. The
path length is maxt (sk−1 , sk ), where (i, j) = |yi −yj |. The thick black edge is incident
to the nodes achieving the max in the path length expression. On the right, the vertices
of the same graph have been clustered to form a regular partition. The diameter of a
cluster C (the maximum of the pairwise distances between nodes of C) is denoted by
d. Similarly, d denotes the minimum of the pairwise distances (i, j), where i ∈ C and
j ∈ V \ C. Note that d is determined by one of the thick black edges connecting C with
the rest of the graph, while d is determined by the two nodes incident to the thick gray
edge. The partition is regular, hence d < d holds for each cluster.
(i, j) = (yi , yj ). The key to controlling this cost, however, is the specific rule
the algorithm uses to select the next qt based on Gt−1 . The approach we propose
is simple. If there exists a regular partition of G with few elements, then it does
not really matter how the spanning tree is built within each element, since the
cost of all these different trees will be small anyway. What matters the most
is the cost of the edges of the spanning tree that join two distinct elements of
the partition. In order to keep this cost small, our algorithm learns to select qt
so as to avoid going back to the same region many times. This is based on the
following notions.
Fix an arbitrary subset C ⊆ V . The inner border
∂C of C is the set of all nodes i ∈ C that are adjacent
to a node j ∈ C (the dark grey nodes in the picture
at the side). The outer border ∂C of C is the set of
all the nodes j ∈ C that are adjacent to at least one
node in the inner border of C (the light grey nodes).
We are now ready to define the exploration/
prediction rule of our algorithm. At each time t, cga selects and predicts the
label of a node adjacent to the node in the inner border of Vt−1 which is closest
to the previously predicted node it−1 . Formally,
Fig. 3. The behavior of cga displayed on the binary labeled graph of Fig. 1. The length
of a path s1 , . . . , sd is measured by maxk (sk−1 , sk ) and the loss is the zero-one loss.
The pictorial conventions are as in Fig. 1. As in that figure, the cutsize ΦG (y) of this
graph can be made as close to |V | as we like, still cga makes 4 mistakes. For the sake of
comparison, recall that the various versions of depthFirst can be forced to err ΦG (y)
times on this graph.
4 Analysis
This section contains the analysis of cga’s predictive performance. The compu-
tational complexity analysis is contained in Sect. 5. For the sake of presentation,
we single out the binary classification case since it is an important special case
of our setting.
Fix an undirected and connected graph G = (V, E). The following lemma is
a key property of our algorithm.
Lemma 1. Assume cga is run on a graph G with labeling y ∈ Y n , and pick
any time step t > 0. Let P be a regular partition and assume it−1 ∈ C, where C
is any cluster in P. Then C is exhausted at time t − 1 if and only if qt
∈ C.
Proof. First, assume C is exhausted at time t − 1, i.e., C ∪ ∂C ⊆ Vt−1 . Then
all nodes in C have been visited, and no node in C has unexplored edges. This
implies C ∩ ∂V t−1 ≡ ∅ and that the selection rule (2) makes the algorithm pick
qt outside of C. Assume now qt ∈ C. Since each cluster is a connected subgraph,
if the labels are binary the prediction rule ensures that cluster C is exhausted.
In the general case (when labels are not binary) we can prove by contradiction
that C is exhausted by analyzing the following two cases:
1. There exists j ∈ C \Vt−1 . Since the subgraph in cluster C is connected, there
is a path in C connecting it−1 to j such that at least one node q ∈ C on this
path: (a) has unexplored edges, and (b) belongs to Vt−1 , (i.e., q ∈ ∂V t−1 ),
and (c) is connected to it−1 by a path all contained in C ∩ Vt−1 . Since the
partition is regular, q is closer to it−1 than to any node outside of C. Hence,
by construction —see (2), the algorithm would choose this q instead of qt
(due to (c) above), thereby leading to a contradiction.
2. There exists j ∈ ∂C \ Vt−1 . Again, since the subgraph in cluster C is con-
nected, there is a path in C connecting it−1 to a node in ∂C adjacent to j.
Then we fall back into the previous case since at least one node q on this
path: (a) has unexplored edges, and (b) belongs to Vt−1 , and (c) is connected
to it−1 by a path all contained in C ∩ Vt−1 .
We begin to analyze the special case of binary labels and zero-one loss.
118 N. Cesa-Bianchi, C. Gentile, and F. Vitale
the previous lossy step t is incoming, the fact that t is irregular implies that Ci
must be exhausted between time t and time t, which in turn implies that Ii = 1,
since t must be the very last lossy step involving cluster Ci . Hence
reg
m= |Mi→ ∪ Mi→irr
|≤ |M→i | − Ii + |Mi→ irr
| ≤ |M→i |. (3)
i∈P i∈P i∈P
Next, for each i ∈ P we define two further injective mappings that associate with
each incoming lossy step ∗ → i a vertex in the inner border
of Ci and
a vertex in
the outer border of Ci . This shows that |M→i | ≤ min |∂Ci |, |∂Ci | = δ(Ci ) for
each i ∈ P. Together with (3) this would complete the proof (see again Fig. 4
for a pictorial explanation).
ν1 (s)
μi (t) t s
ν2 (s)
Fig. 4. Sequence (starting from the left) of incoming and regular outgoing lossy steps
involving a given cluster Ci . We only show the border nodes contributing to lossy steps.
We map injectively each regular outgoing lossy step t to the previous (incoming) lossy
step μi (t). We also map injectively each incoming lossy step s to the node ν1 (s) in the
inner border, whose label was predicted at time s. Finally, we map injectively s also to
the node ν2 (s) in the outer border that caused the previous (outgoing) lossy step for
the same cluster.
any algorithm working within our learning protocol to make Ω(|P|) mistakes.
This simple observation can be strengthened so as to match the upper bound in
Theorem 1.
Theorem 2. For all undirected and connected graphs G with n nodes and de-
gree bounded by a constant, for all K < n, and for any (randomized) explo-
ration/prediction strategy, there exists a labeling y of G’s vertices such that the
strategy makes at least K/2 mistakes (in expectation) with respect to the algo-
rithm’s internal randomization, while δ(P) = O(K).
The above lower bound, whose proof is omitted due to space limitations, can
actually be shown to hold even in cases when G does not have bounded degree
nodes, like cliques or general trees.
We now turn to the general case of nonbinary labels. The following definitions
are useful for espressing the cumulative loss bound of our algorithm: Let P be a
regular partition of the vertex set V and fix a cluster C ∈ P. We say that edge
(qt , it ) causes an inter-cluster loss at time t if one of the two nodes of this
edge lies in ∂C and the other lies in ∂C. Edge (qt , it ) causes an intra-cluster
loss when both qt and it are in C. We denote by (C) the largest inter-cluster
loss in C, i.e.,
(C) = max (yi , yj ) .
∈∂C, (i,j)∈E
i∈∂C, j
Also max
P is the maximum inter-cluster loss in the whole graph G, i.e., max
P =
¯P = |P|−1
maxC∈P (C). We also set for brevity C∈P (C). Finally, we define
ε(C) = maxTC (i,j)∈E(TC ) (yi , yj ), where the max is over all spanning trees
TC of C and E(TC ) is the edge set of TC . Note that ε(C) bounds from above2
the total loss incurred in all steps t where qt and it both belong to C.
In the above definition, (C) is a measure of connectivity of C to the remaining
clusters, ε(C) is a measure of “internal cohesion” of C, while max
P and ¯P give
global distance measures among the clusters within P.
The following theorem shows that cga’s cumulative loss can be bounded
in terms of the regular partition P that best trades off total intra-cluster loss
(expressed by ε(C)), against total inter-cluster loss (expressed by δ(C) times
the largest inter-cluster loss (C)). It is important to stress that cga never
explicitely computes this optimal partition: It is the selection rule for qt in (2)
that guarantees this optimal behavior.
n
yt , yt ) ≤ min |P| max
( P − ¯P + ε(C) + (C)δ(C) , (4)
P
t=1 C∈P
n
yt , yt ) ≤ min
( ε(C) + δ(C) . (5)
P
t=1 C∈P
This shows that in the binary case the total number of mistakes can also be
bounded by the maximum number of edges connecting different clusters that
can be part of a spanning tree for G. In the binary case (5) achieves its min-
imum either on the trivial partition P = {V } or on the partition made up of
the smallest number of clusters C, each one including only nodes with the same
label (as in Theorem 1). In most cases, the nontrivial regular partition is the
minimizer of (5), so that the intra-cluster term ε(C) disappears. Then the bound
only includes the sum of merging degrees (w.r.t. that partition), thereby recov-
ering the bound in Theorem 1. However, in certain degenerate cases, the trivial
partition P = {V } turns out to be the best one. In such a case, the right-hand
side of (5) becomes ε(V ) which, in turn, is bounded by ΦG (y).
The proof of Theorem 3 is similar to the one for the binary case, hence we
only emphasize the main differences. Let P be a regular partition of V . Clearly,
no matter how each C ∈ P is explored, the contribution to the total loss of
(qt , it ) for qt , it ∈ C is bounded by ε(C). The remaining losses contributed by
any cluster C are of two kinds only: losses on incoming steps, where the node it
belongs to the inner border of C, and losses on outgoing steps, where it belongs
to the outer border of C. As for the binary case, with each such step we can
thus associate a node in the inner and the outer border of C, since incoming
and outgoing step alternate for each cluster. The exception is when a cluster is
exhausted which, at first glance, seems to requires adding an extra term as big
as max
P times the size |P| of the partition (this term could have a significant
impact for certain graphs). However, as explained in the proof below, max P can
be replaced by the potentially much smaller term max − ¯P . In fact, in certain
P
cases this extra term disappears, and the final bound we obtain is just (5).
Proof of Theorem 3. Fix an arbitrary regular partition P of V and index by
1, . . . , |P| the clusters in it. We abuse the notation and use P also to denote the
set of cluster indices. We crudely upper bound the total loss incurred during
intra-cluster lossy steps by C∈P ε(C). Hence, in the rest of the proof we focus
on bounding the total loss incurred during inter-cluster lossy steps only. We say
that step t is a lossy step if (qt , it ) > 0, and we distinguish between intra-
cluster lossy steps (when qt and it belong to the same cluster) and inter-cluster
lossy steps (when qt and it belong to different clusters). We define incoming and
outgoing (regular and irregular) inter-cluster lossy steps for a given cluster Ci
reg irr
(and the relative sets M→i , Mi→ and Mi→ ) as in the binary case proof, as well
122 N. Cesa-Bianchi, C. Gentile, and F. Vitale
reg
as the injective mapping μi . In the binary |Mi→ | by |M→i | − Ii .
case we bounded
In a similar fashion, we now bound t∈M reg t by (Ci ) |M→i | − Ii , where we
i→
set for brevity t = (qt , it ). We can write
t ≤ (Ci ) |M→i | − Ii + max
P |Mi→ |
irr
reg irr
i∈P t∈Mi→ ∪Mi→ i∈P
≤ (Ci )|M→i | + max
P − (Cj )
i∈P j∈P : Ij =1
≤ (Ci )|M→i | + |P| max
P − ¯P ,
i∈P
i∈P Ii ≥ i∈P |Mi→ | (as for the
irr
where the second inequality follows from
regular partition considered in the binary case). The proof is concluded after
defining the two injective mapping ν1 and ν2 as in the binary case, and bounding
again |M→i | through δ(Ci ).
5 Computational Complexity
In this section we briefly describe an efficient implementation of cga, and discuss
some improvements for the special case of binary labels. This implementation
shows that cga is especially useful when dealing with large scale applications.
Recall that the path length assignment λ is a parameter of the algorithm and
satisfies (1). In order to develop a consistent argument about cga’s time and
space requirements, we need to make assumptions on the time it takes to compute
this function. If we are given the distance between any pair of nodes i and j,
and the loss (j, j ) for any j adjacent to j, we assume to be able to compute
in constant time the length of the shortest path i, . . . , j, j . This assumption is
easily seen to hold for many natural path length assignments λover graphs,
for instance λ(s1 , . . . , sd ) = maxk (sk−1 , sk ) and λ(s1 , . . . , sd ) = k (sk−1 , sk )
—note that both fulfill (1).
Because of the above assumption on the path length λ, in the general case of
real labels cga can be implemented using the well-known Dijkstra’s algorithm
for single-source shortest path (see, e.g., [7, Ch. 21]). After all nodes in Vt−1 and
all edges incident to it have been revealed, cga computes the distance between
it and any other node in Vt−1 by invoking Dijkstra’s algorithm on the sub-graph
Gt , so that cga can easily find node qt+1 . If Dijkstra’s algorithm is implemented
heaps [7, Ch. 25], the total time required for predicting all |V |
with Fibonacci
labels is3 O |V ||E| + |V |2 log |V | . On the other hand, the space complexity is
always linear in the size of G.
We now sketch the binary case. The additional assumption λ(s1 , . . . , sd ) =
maxk (sk−1 , sk ) allows us to exploit the simple structure of regular partitions.
3
In practice, the actual running time is often far less than O |V ||E| + |V |2 log |V | ,
since at each time step t Dijkstra’s algorithm can be stopped as soon as the node of
∂V t−1 nearest to it in Gt has been found.
Learning Unknown Graphs 123
Coarsely speaking, we maintain information about the current inner border and
clusters, and organize this information in a balanced tree, connecting the nodes
lying in the same cluster through specially designed lists.
In order to describe this implementation, it is important to observe that, since
the graph is revelead incrementally, it might be the case that a single cluster C
in G at time t happens to be split into several disconnected parts in Gt . We
call sub-cluster each maximal set of nodes that are part of the same uniformly
labeled and connected subgraph of Gt . The main data structures we use (further
details are omitted due to space limitations) for organizing the nodes observed
so far by the algorithm combine the following:
– A self-balancing binary search tree T containing the labeled nodes in Vt . We
will refer to nodes in Vt and to nodes in T interchangeably.
– Given a sub-cluster C, all nodes in C ∩ ∂V t are connected via a special
list called border sub-cluster list. The remaining nodes in C are connected
through a list called internal sub-cluster list.
– All nodes in each sub-cluster C ⊆ Vt are linked to a special time-varying
set called sub-cluster record. This record enables access to the first and last
element of both the border and the internal sub-cluster list of C. The sub-
cluster record also contains the size of C.
The above data structures are intended to support the following main operations,
which are executed in the following order at each time step t, just after the
algorithm has selected qt : (1) insertion of it ; when it is chosen by the adversary
cga also receives the list N (it ) of all nodes in Vt−1 adjacent to it ; (2) merging of
subclusters required after the disclosure of yt ; (3) update of border and internal
sub-cluster lists (since some nodes in ∂V t−1 are not in ∂V t ); (4) choice of qt+1 .
The merging operation can be implemented as union-by-rank in standard
union-find data structures (e.g., [7,Ch. 22]). The overall running time for |V |
nodes is smaller than O |V | log |V | . In fact, the dominating cost in the time
complexity is the cost for reaching at each time t the nodes of Vt−1 adjacent to
it . Each of these it ’s neighbors can be bijectively associated with an edge of E,
the height of tree T being at most logarithmic
in V . Hence the overall
running
time for predicting |V | labels is O |E| log |V | + |V | log |V | = O |E| log |V | ,
which is the best one can hope for (an obvious lower bound is |E|) up to a
logarithmic factor.
As for space complexity, it is important to stress that on every step t the
algorithm first stores and then “throws way” the received node list N (it ) (in the
worst case, the length of N (it ) is linear in |V |). The space complexity is therefore
O(|V |). This optimal use of space is one of the most important practical strengths
of cga, since the algorithm never needs to store the whole graph seen so far.
Acknowledgments. We would like to thank the ALT 2009 reviewers for their
comments which greatly improved the presentation of this paper. This work
was supported in part by the PASCAL2 Network of Excellence under EC grant
216886. This publication only reflects the authors’ views.
References
[1] Albers, S., Henzinger, M.: Exploring unknown environments. SIAM Journal on
Computing 29(4), 1164–1188 (2000)
[2] Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph
mincuts. In: Proc. 18th ICML. Morgan Kaufmann, San Francisco (2001)
[3] Blum, A., Lafferty, J., Rwebangira, M., Reddy, R.: Semi-supervised learning using
randomized mincuts. In: Proc. 21st ICML. ACM Press, New York (2004)
[4] Bryant, D., Berry, V.: A Structured family of clustering and tree construction
methods. Advances in Applied Mathematics 27, 705–732 (2001)
[5] Balcan, N., Blum, A., Vempala, S.: A discriminative framework for clustering via
similarity functions. In: Proc. 40th STOC. ACM Press, New York (2008)
[6] Cesa-Bianchi, N., Gentile, C., Vitale, F.: Fast and optimal prediction of a labeled
tree. In: Proc. 22nd COLT. Omnipress (2009)
[7] Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT
Press, Cambridge (1990)
[8] Deng, X., Papadimitriou, C.H.: Exploring an unknown graph. In: Proc. 31st
FOCS, pp. 355–361. IEEE Press, Los Alamitos (1990)
[9] Hanneke, S.: An analysis of graph cut size for transductive learning. In: Proc. 23rd
ICML, pp. 393–399. ACM Press, New York (2006)
[10] Hebster, M., Pontil, M.: Prediction on a graph with the Perceptron. In: NIPS,
vol. 19, pp. 577–584. MIT Press, Cambridge (2007)
[11] Herbster, M.: Exploiting cluster-structure to predict the labeling of a graph. In:
Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI),
vol. 5254, pp. 54–69. Springer, Heidelberg (2008)
[12] Herbster, M., Lever, G., Pontil, M.: Online prediction on large diameter graphs.
In: NIPS, vol. 22. MIT Press, Cambridge (2009)
[13] Herbster, M., Pontil, M., Rojas-Galeano, S.: Fast prediction on a tree. In: NIPS,
vol. 22. MIT Press, Cambridge (2009)
[14] Herbster, M., Lever, G.: Predicting the labelling of a graph via minimum p-
seminorm interpolation. In: Proc. 22nd COLT. Omnipress (2009)
[15] Joachims, T.: Transductive Learning via Spectral Graph Partitioning. In: Proc.
20th ICML, pp. 305–312. AAAI Press, Menlo Park (2003)
Learning Unknown Graphs 125
[16] Kondor, I., Lafferty, J.: Diffusion kernels on graphs and other discrete input spaces.
In: Proc. 19th ICML, pp. 315–322. Morgan Kaufmann, San Francisco (2002)
[17] Pelckmans, J., Shawe-Taylor, J., Suykens, J., De Moor, B.: Margin based trans-
ductive graph cuts using linear programming. In: Proc. 11th AISTAT. JMLR
Proceedings Series, pp. 360–367 (2007)
[18] Remy, J., Souza, A., Steger, A.: On an online spanning tree problem in randomly
weighted graphs. Combinatorics, Probability and Computing 16, 127–144 (2007)
[19] Smola, A., Kondor, I.: Kernels and regularization on graphs. In: Schölkopf, B.,
Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144–
158. Springer, Heidelberg (2003)
[20] Yang, W.S., Dia, J.B.: Discovering cohesive subgroups from social networks for
targeted advertising. Expert Systems with Applications 34, 2029–2038 (2008)
Completing Networks Using Observed Data
1 Introduction
Inference of biological networks, which include genetic networks, protein-protein
interaction networks and signaling networks, is an important topic in bioinfor-
matics and computational systems biology. For inference of genetic networks,
extensive studies have been done in this decade. The objective of this prob-
lem is, given a series of gene expression profiles (a series of states of all genes
under various environment and/or time steps), to infer a function along with
input genes that regulates each gene, where a set of functions constitutes a ge-
netic network. In inference of genetic networks, it is assumed that the states of
all genes are observable under each environment and/or each time step though
there exist some noise. This assumption is reasonable because we can observe
expression levels of all genes (or almost all genes) by using such technologies as
DNA microarray and DNA chip.
However, this assumption is not reasonable when we want to infer signaling
networks (i.e., signaling pathways). In this case, we need to observe activity levels
or quantities of proteins. Unfortunately, it is quite difficult to observe such kind
of data, especially in living organisms. Reporter proteins (or reporter genes) are
usually employed, each of which is associated with one or some kinds of proteins
[16]. However, both designing reporter proteins and introducing reporter proteins
to cells are hard tasks. In particular, introducing multiple types of reporter
This work is partially supported by the Cell Array Project from NEDO, Japan and
by a Grant-in-Aid ‘Systems Genomics’ from MEXT, Japan.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 126–140, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Completing Networks Using Observed Data 127
proteins is quite hard. Therefore, we can only assume that the activity levels
of one or a few kinds of proteins under various environments are observed in
analysis of signaling networks. While it is almost impossible to infer the network
only from such little information, we can utilize knowledge in literature and
databases. Thus, it is reasonable to assume that we have a preliminary network
model of the target signaling pathway where some parts are unclear or invalid.
Using observed data on the activity levels of a single or a few types of proteins
under various environment, it may be possible to modify a preliminary network
model so that it is consistent with the observed data. Among many ways to
modify the network model, it is reasonable to make the minimum modification,
by following the principle of Occam’s razor. This motivates us to study network
completion problems.
In this paper, we assume a Boolean network model [12] as a model of biological
networks because it is a fundamental model, a lot of theoretical and practical
studies have been done [12], and it has also been applied to analysis of signaling
networks [10]. We assume that the network topology is given (i.e., a set of input
nodes to each node is known) and Boolean functions are already assigned to
a subset of nodes. We also assume that a set of nodes is divided into external
nodes, internal nodes and output nodes, where only the activity levels of exter-
nal and output nodes can be observed. Output nodes correspond to proteins
whose activity levels are observed by reporter proteins, where we mainly con-
sider the case that there exists only one output node because it is very difficult
to introduce multiple reporter proteins. External nodes correspond to proteins
whose activity levels are controlled by stimuli given from outside of the cell
(e.g., environment), where these nodes can also be regarded as input nodes to
the network. Furthermore, we assume that the network is acyclic because the
state of the output node may not be determined uniquely if there exist cycles.
Therefore, we can assume that the state of output node is determined (through
internal nodes) from the states of external nodes. Then, a basic version of the
network completion problem is to determine Boolean functions for unassigned
nodes so that the resulting network is consistent with given set of examples ( i.e.,
a series of external and output states). We also consider variants of the problem
in which Boolean functions are assigned to all nodes but the minimum number
of modifications (e.g., modification of Boolean functions, deletions of edges) are
allowed. We show that these problems are NP-complete and the basic version
remains NP-complete even for tree structured networks. On the other hand, we
show that these problems can be solved in polynomial time for partial k-trees of
bounded (constant) indegree if a logarithmic number of examples are given.
There exist several related studies. As mentioned before, a lot of studies have
been done on inference of genetic networks from gene expression profiles. How-
ever, most of such studies are based on statistical or heuristic approaches and only
a few studies have been done from viewpoints of computational and/or sample
complexities. Akutsu et al. proposed a strategy to identify a genetic network un-
der a Boolean model using disruption and overexpression of multiple genes [2].
They analyzed combinatorial and computational complexities and showed that it
128 T. Akutsu, T. Tamura, and K. Horimoto
2 Problem Definitions
that G(V, E) is acyclic. We also assume that there exists only one output node,
where some of the results can be extended for multiple output nodes. We assume
w.l.o.g. that v1 , . . . , vh are external nodes (whose indegrees are 0) and vn is the
output node. Each node takes either 0 or 1 and the state of node vi is denoted by
v̂i . For an internal or output node vi , v̂i is determined by v̂i = fi (v̂i1 , . . . , v̂ili ).
We have assumed so far that all fi are known. However, fi may not be known
for some nodes vi whereas IN (vi ) are known. Such a node is called an incomplete
node. A Boolean network is called incomplete if it contains an incomplete node,
otherwise it is called complete.
An (h + 1)-dimensional 0-1 vector e is called an example, where the first h
entries correspond to the external nodes and the last entry corresponds to the
output node. An example e is called positive if eh+1 = 1, otherwise it is called
negative. A complete Boolean network G(V, F ) is consistent with e if v̂n = eh+1
holds under the condition that v̂i = ei holds for i = 1, . . . , h. We define a basic
version of the network completion problem as follows (see also Fig. 1).
Definition 1. BNCMPL-1
Instance: An incomplete Boolean network G(V, F ), a set of examples {e1 , . . . , em },
Question: Is there an assignment of Boolean functions fi to incomplete nodes so
that the resulting network G(V, F ) is consistent with all examples ?
An assignment satisfying the above condition is called a completion. In the above,
a set of nodes to which Boolean functions are assigned is specified. However,
existing knowledge about the target network may contain mistakes. In such a
case, it might be useful to modify Boolean functions for the minimum number
of nodes while keeping the network structure. Therefore, we define a variant of
the network completion problem as follows.
Definition 2. BNCMPL-2
Instance: A complete Boolean network G(V, F ), a set of examples {e1 , . . . , em },
and a positive integer L,
Question: Is there an assignment of Boolean functions fi to at most L nodes so
that the resulting network G(V, F ) is consistent with all examples ?
In this definition, we allow that the algorithm can override the complete nodes
(i.e., other Boolean functions can be assigned to nodes for which Boolean func-
tions are already assigned). As a variant of BNCMPL-2, we can consider the
problem of minimizing the number L of nodes for which other Boolean function
should be assigned. This variant can be solved by solving BNCMPL-2 from
L = 0 to n. Note that deletion of an edge can be regarded as a modification
of Boolean function and thus can be handled in BNCMPL-2 because we allow
that some nodes in IN (vi ) are not relevant1 .
In this paper, we assume in most cases that the maximum indegree is bounded
by a constant D. This assumption is reasonable because it is quite hard in general
to learn Boolean functions with many inputs, and O(2n ) bits are required to
represent a Boolean function if an arbitrary Boolean function is allowed.
1
All the results in this paper are valid even if all nodes in IN (vi ) must be relevant.
130 T. Akutsu, T. Tamura, and K. Horimoto
3 Hardness Results
First, we show that BNCMPL-1 is NP-complete even if only one positive ex-
ample is given.
Proposition 1. BNCMPL-1 is NP-complete even if one positive example is
given and D = 2.
Proof. Since it is obvious that BNCMPL-1 is in NP, we show that it is NP-hard
by means of a polynomial time reduction from 3-SAT [9] (see also Fig. 1).
Let c1 , . . . , cM be clauses over Boolean variables x1 , . . . , xN . Then, 3-SAT
is the problem of deciding whether there is an assignment of 0-1 values to
x1 , . . . , xN that satisfies all the clauses (i.e., the values of all clauses are 1).
We construct an incomplete network G(V, F ) as follows2 , where we first as-
sume that nodes with large indegree are allowed. Let V = {v1 , . . . , v2N +M+1 }
and let {v1 , . . . , vN } be a set of external nodes. Let clause ci be defined as
ci = gi (xi1 , xi2 , xi3 ). For each node v2N +i (i = 1, . . . , M ), we assign a Boolean
function v2N +i = gi (vN +i1 , vN +i2 , vN +i3 ). For the output node v2N +M+1 , we as-
sign a Boolean function defined by v2N +M+1 = v2N +1 ∧ v2N +2 · · · ∧ v2N +M . For
each i = 1, . . . , N , let vN +i be an incomplete node such that IN (vN +i ) = {vi }.
Therefore, either vN +i = vi or vN +i = vi is assigned to vN +i 3 . Finally, we
let e = (1, 1, 1, . . . , 1). Then, it is straight-forward to see that there exists a
completed network G(V, F ) if and only if there exists a satisfying assignment
for c1 , . . . , cM . It is also seen that the reduction can be done in polynomial
time.
v9 v10 v11
Imcomplete
v5 v6 v7 v8 Node
External Nodes
v1 v2 v3 v4
2
Construction can be simplified if we use internal nodes with degree 0.
3
Since we allow non-relevant input nodes, it is also possible that vN+i = 0 or vN+i = 1
is assigned. All the results in this paper are valid even if such an assignment is taken
into account.
Completing Networks Using Observed Data 131
OR
ri
qi AND
Incomplete
XOR Node
pi
External
yi zi wi Nodes
Fig. 2. Reduction from 3-SAT instance to BNCMPL-1 for trees. For each variable xi ,
a subnetwork shown in this figure is constructed.
132 T. Akutsu, T. Tamura, and K. Horimoto
In the above, we assumed that negation nodes can be used. The following theo-
rem states that BNCMPL-1 remains NP-complete even if only AND/OR nodes
are allowed, where the use of 3-Coloring was inspired from [15].
AND pi AND
AND
OR
c 1 c 3 1 2
i c 2i i c 1i c 2i c i c j
zi c1 c2 c3 yi w12i c1 c2 yi c1 c2 yi yi
1 0 0 0 1 1 0 0 1 0 0 1 1
Fig. 3. Reduction from 3-Coloring to BNCMPL-1. Parts (i), (ii) and (iii) put con-
straints that at least one color is assigned to each node, two colors cannot be assigned
to each node, and different colors must be assigned to neighboring nodes, respectively.
Completing Networks Using Observed Data 133
(i) For each i ∈ {1, . . . , N }, we create e such that e(zi ) = e(yi ) = e(o) = 1, and
e(v) = 0 for the other v.
(ii) For each i ∈ {1, . . . , N }, we create e such that e(wipq ) = e(yi ) = 1, and
e(v) = 0 for the other v.
(iii) For each {i, j} ∈ E0 where i < j, we create e such that e(yi ) = e(yj ) =
e(o) = 1, and e(v) = 0 for the other v.
Then, we can show that there exists a valid 3-coloring for G0 (V0 , E0 ) if and only
if there is a required completion for G(V, F ). Furthermore, this reduction can be
done in polynomial time. As in the proof of Prop. 1, we can encode each node
with indegree more than 2 using nodes of indegree 2.
We can also prove that BNCMPL-2 is NP-complete, where the proof is a bit
involved because Boolean functions assigned to any subset of nodes of cardinality
L can be modified.
Ci
1 r’1 OR r’N
N+1 N+1
C1 CN
r i1 r i2 r iN’
2
C1
2
CN
q 1i q 2i q N’
i
1 1
C1 CN
pi
yi z i wi
Since ci is satisfied by (b1 , . . . , bN ), one of li1 , li2 and li3 must be 1 for each i. We
assume w.l.o.g. that li1 takes value 1. Then, we can see that r̂ij1 (ei ) = 1 holds
for all j = 1, . . . , N , from which r̂i1 (ei ) = 1 and ô(ei ) = 1 follow. Furthermore,
we can see that ô(eM+1 ) = 0. Therefore, there exists a required completion.
Conversely, assume that there exists a required completion G(V, F ). Then,
we create an assignment (b1 , . . . , bN ) by letting bi = 0 iff. pi = yi is assigned
to pi . If no nodes other than pi s are changed4 , we can see that (b1 , . . . , bN ) is
a satisfying assignment, as in the proof of Thm. 1. Therefore, we consider the
case that some nodes other than pi s are changed. Since at most N nodes are
changed, we can see that at least N + 1 outputs of each Cij take the value di (ej )
defined by
1, if (ŵi (ej ) = 1) ∧ ((p̂i (ej ) = 1 ∧ ẑi (ej ) = 0) ∨ (p̂i (ej ) = 0 ∧ ẑi (ej ) = 1)),
di (ej ) =
0, otherwise.
We can also see that at least one Cij remains unchanged for each i. If CiN +1 is
unchanged, we can see that r̂i (ej ) = di (ej ) holds for all j. If CiN +1 is changed,
there must exists unchanged Cij for some j < N + 1. Then, we can see that for
each i ∈ {1, . . . , N }, either one of the following holds: r̂i (ej ) = di (ej ) for all
j, r̂i (ej ) = di (ej ) for all j, r̂i (ej ) = 0 for all j, r̂i (ej ) = 1 for all j.
4
We say that a node is changed if the assigned Boolean function is replaced.
Completing Networks Using Observed Data 135
are included in {rj 1 , rj 2 , rj 3 } where cj = lj1 ∨ lj2 ∨ lj3 . Since ej3N +1 = eM+1
3N +1
j M+1
holds, r̂i (e )
= r̂i (e ) holds for at least one i ∈ {j1 , j2 , j3 }. For such i,
di (ej )
= di (eM+1 ) holds. Since di (eM+1 ) = 0, we have di (ej ) = 1. Although
the output node may be changed, ô(ej ) = d1 (ej ) ∨ · · · ∨ dN (ej ) always holds for
any j in the resulting network G(V, F ). Hence, p̂i (ej ) = 1 holds if ẑi (ej ) = 0,
otherwise (i.e., ẑi (ej ) = 1) p̂i (ej ) = 0 holds. If ẑi (ej ) = 0 holds, xi appears in
cj positively. Since p̂i (ej ) = 1 holds in this case, bi = 1 holds and thus cj is
satisfied. Similarly, if ẑi (ej ) = 1 holds, we can see that bi = 0 holds and cj is
satisfied. Therefore, there exists a satisfying assignment for 3-SAT.
The above reduction can be clearly done in polynomial time. We can encode
the output node using nodes of indegree two 5 .
Proof. We assume for simplicity that all non-external nodes are of indegree 2,
while the proof and algorithm can be extended for any trees of bounded constant
indegree D with keeping the polynomial time complexity.
Let Fv be an assignment of functions to unassigned nodes in T (v). Then,
v̂(ej , Fv ) denotes the state of v under assignment Fv when an example ej is
given. Let a be a 0-1 vector of size m (recall that m is the number of examples),
5
It should be noted that the theorem still holds if nodes encoding the output node
are changed by completion.
136 T. Akutsu, T. Tamura, and K. Horimoto
the total time for constructing the dynamic programming table is O(mn · 23m ).
Once this table is constructed, we finally check whether or not S[vn , a] = 1 holds
for a such that aj = ejh+1 holds for all j = 1, . . . , m. It is straight-forward to see
that this algorithm correctly works in O(mn · 23m ) time.
We can modify the above proof for BNCMPL-2.
Theorem 5. BNCMPL-2 is solved in polynomial time if the network structure
is a rooted tree of bounded indegree and the number of examples is O(log n).
Proof. For simplicity, we assume that all non-external nodes are of indegree 2.
Let Fv be an assignment of Boolean functions to nodes in T (v). Let size(Fv )
be the number of nodes such that a Boolean function assigned by Fv is different
from the original one. We define a DP table S[v, a, l] for l = 0, . . . , L by
⎧
⎨ 1, if there exists Fv such that v̂(ej , Fv ) = aj for all j = 1, . . . , m
S[v, a, l] = and size(Fv ) = l,
⎩
0, otherwise.
Then, for an external node vi , S[vi , a, l] is computed by
1, aj = eji holds for all j = 1, . . . , m, and l = 0,
S[vi , a, l] =
0, otherwise.
Completing Networks Using Observed Data 137
S[v, a, l] =
⎧
⎪
⎪ 1, if there is (aL , aR , lL , lR ) such that S[v L , aL , lL ] = 1 and S[v R , aR , lR ]
⎪
⎪
⎨ = 1, fv (aL R
j , aj ) = aj for all j = 1, . . . , m, and l = lL + lR ,
1, if there is (f, aL , aR , lL , lR ) such that S[v L , aL , lL ] = 1 and S[v R , aR , lR ]
⎪
⎪
⎪
⎪ = 1, f (aL R
j , aj ) = aj for all j = 1, . . . , m, and l = lL + lR + 1,
⎩
0, otherwise.
We can extend the above mentioned algorithms for partial k-trees. A partial
k-tree is a graph with treewidth at most k, where the treewidth is defined
via tree decomposition [8]. A tree decomposition of a graph G(V, E) is a pair
T (VT , ET ), (Bt )t∈VT , where T (VT , ET ) is a rooted tree and (Bt )t∈VT is a fam-
ily of subsets of V such that (see also Fig. 5)
The width of the decomposition is defined as maxt∈VT (|Bt |−1) and the treewidth
is the minimum of the widths among all the tree decompositions of G.
We present an algorithm for BNCMPL-1 on partial k-trees as the main part
of the proof of the following theorem, where we assume that k is a constant.
B
C B
C
D
E D E
= j }.
Ass(U ) = {A(U ) | Aij is consistent with Aij for all j
That is, ExtAss(t) is the set of consistent tuples for Bt each of which can be
extensible to a consistent tuple for B(t).
v 1 1 1
0 0 1
A1 = , 0 , 1 , 0
1 1 1
vL vR
1 0 1 0 1 1
0 0 0 1 0 1
A2 = , 1 , 1 , 1 A3 = , 1 , 1 , 0
1 1 0 0 0 0
6
We use secondary assignment in order to discriminate from assignment defined in
Section 2.
Completing Networks Using Observed Data 139
For each node t ∈ VT , Ass(t) can be computed in O(km · 23(k+1)m ) time be-
cause the number of possible secondary assignments for each node in G is O(23m )
and thus the number of possible tuples is O(23(k+1)m ), and O(km) time is enough
to check the consistency of a tuple, where we assume that the maximum indegree
is bounded by 2. If t is a leaf in T , we let ExtAss(t) := Ass(t). Otherwise, let
t1 , . . . , tgt be the children of t in T and we assume that ExtAss(ti )s have been
already computed. For two tuples A(Bt ) for Bt and A(Bti ) for Bti , A(Bt ) and
A(Bti ) are said to be compatible if the same secondary assignments are assigned
to v for each v ∈ Bt ∩ Bti . Then, we can compute ExtAss(t) by
ExtAss(t) := {A(Bt ) | A(Bt ) ∈ Ass(t) is compatible with A(Bti ) ∈ ExtAss(ti )
for all i = 1, . . . , gt }.
Then, it is straight-forward to see that BNCMPL-1 has a required completion
iff. ExtAss(r) = {} where r is the root of T . For the example of Fig. 5, we
compute ExtAss(C) from Ass(D) and Ass(E), ExtAss(B) from ExtAss(C),
and ExtAss(A) from ExtAss(B). Note that the output node can be located
outside Br (e.g., it is not located in A but in D in Fig. 5).
Clearly, ExtAss(t) can be computed in O(26(k+1)m · kmgt ) time per t. Since
t gt = O(n), the total computation time is O((2
6(k+1)m
· km + q(k)) · n), where
O(q(k) · n) is the time complexity of Bodlaender’s algorithm, which works in
linear time for a constant k. If m = O(log n), this time complexity is polyno-
mial. Furthermore, we can extend the algorithm and analysis for the case of the
maximum indegree bounded by a constant D.
We can extend this result for BNCMPL-2 where details are omitted.
Corollary 1. BNCMPL-2 is solved in polynomial time if the network structure
is a partial k-tree of bounded indegree and the number of examples is O(log n).
5 Concluding Remarks
In this paper, we have studied problems of completing networks from example
data. We have shown that the problems are NP-complete in general but can be
solved in polynomial time for partial k-trees of bounded indegree if a logarithmic
number of examples are given.
Extension of the model and algorithms for networks with cycles is an im-
portant future work because real biological networks contain cycles. For that
purpose, it might be helpful to use feedback vertex sets because a network be-
comes acyclic if vertices in a feedback vertex set are removed. Other future
works include extension of BNCMPL-2 for handling insertions of edges, anal-
ysis of PAC-type learning models [13] as well as probabilistic extensions, and
development of practical algorithms.
Acknowledgment
We would like to thank Atsushi Mochizuki, Ryoko Morioka and Shigeru Saito
for helpful discussions.
140 T. Akutsu, T. Tamura, and K. Horimoto
References
1. Akutsu, T., Miyano, S., Kuhara, S.: Identification of genetic networks from a small
number of gene expression patterns under the Boolean network model. In: Proc.
Pacific Symposium on Biocomputing 1999, pp. 17–28 (1999)
2. Akutsu, T., Kuhara, S., Maruyama, O., Miyano, S.: Identification of genetic
networks by strategic gene disruptions and gene overexpressions under a Boolean
model. Theoretical Computer Science 298, 235–251 (2003)
3. Akutsu, T., Hayashida, M., Ching, W.-K., Ng, M.K.: Control of Boolean networks:
Hardness results and algorithms for tree structured networks. Journal of Theoret-
ical Biology 244, 670–679 (2007)
4. Angluin, D., Aspnes, J., Chen, J., Wu, Y.: Learning a circuit by injecting values.
In: Proc. 38th Annual ACM Symposium on Theory of Computing, pp. 584–593
(2006)
5. Angluin, D., Aspnes, J., Chen, J., Reyzin, L.: Learning large-alphabet and analog
circuits with value injection queries. Machine Learning 72, 113–138 (2008)
6. Angluin, D., Aspnes, J., Chen, J., Eisenstat, D., Reyzin, L.: Learning acyclic prob-
abilistic circuits using test paths. In: Proc. 21st Annual Conference on Learning
Theory, pp. 169–180 (2008)
7. Bodlaender, H.L.: A linear-time algorithm for finding tree-decompositions of small
treewidth. SIAM Journal on Computing 25, 1305–1317 (1996)
8. Flum, J., Grohe, M.: Parameterized Complexity Theory. Springer, Berlin (2006)
9. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory
of NP-Completeness. W.H. Freeman and Co., New York (1979)
10. Gupta, S., Bisht, S.S., Kukreti, R., Jain, S., Brahmachari, S.K.: Boolean net-
work analysis of a neurotransmitter signaling pathway. Journal of Theoretical
Biology 244, 463–469 (2007)
11. Ideker, T.E., Thorsson, V., Karp, R.M.: Discovery of regulatory interactions
through perturbation: inference and experimental design. In: Proc. Pacific Sympo-
sium on Biocomputing 2000, pp. 302–313 (2000)
12. Kauffman, S.A.: The Origins of Order: Self-organization and Selection in Evolution.
Oxford Univ. Press, NY (1993)
13. Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory.
MIT Press, Cambridge (1994)
14. Mochizuki, A.: Structure of regulatory networks and diversity of gene expression
patterns. Journal of Theoretical Biology 250, 307–321 (2008)
15. Pitt, L., Valiant, L.G.: Computational limitations on learning from examples.
Journal of the ACM 35, 965–984 (1988)
16. Tokumoto, Y., Horimoto, K., Miyake, J.: TRAIL inhibited the cyclic AMP re-
sponsible element mediated gene expression. Biochemical and Biophysical Research
Communications 381, 533–536 (2009)
Average-Case Active Learning with Costs
1 Motivation
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 141–155, 2009.
c Springer-Verlag Berlin Heidelberg 2009
142 A. Guillory and J. Bilmes
the training set (i.e. the effective hypothesis class for some other possibly infinite
hypothesis class), we can also ensure there is a unique zero-error hypothesis. If
we set all question costs to 1, we recover the traditional active learning problem
of identifying the target hypothesis using a minimal number of labels.
However, this framework is also general enough to cover a variety of active
learning scenarios outside of traditional binary classification.
– Active learning with label costs. If different data points are more or
less costly to label, we can model these differences using non uniform label
costs. For example, if a longer document takes longer to label than a shorter
document, we can make costs proportional to document length. The goal is
then to identify the optimal hypothesis as quickly as possible as opposed to
using as few labels as possible. This notion of label cost is different than the
often studied notion of misclassification cost. Label cost refers to the cost of
acquiring a label at training time where misclassification cost refers to the
cost of incorrectly predicting a label at test time.
– Active learning for multiclass and partial label queries. We can di-
rectly ask for the label of a point (Is the label of this point “a”, “b”, or
“c”?), or we can ask less specific questions about the label (Is the label of
this point “a” or some other label?). We can also mix these question types,
presumably making less specific questions less costly. These kinds of partial
label queries are particularly important when examples have structured la-
bels. In a parsing problem, a partial label query could ask for the portion of
a parse tree corresponding to a small phrase in a long sentence.
– Batch mode active learning. Questions can also be queries for multiple
labels. In the extreme case, there can be a question corresponding to every
subset of possible single data point questions. Batch label queries only help
the algorithm reduce total label cost if the cost of querying for a batch of
labels is in some cases less than the of sum of the corresponding individual
label costs. This is the case if there is a constant additive cost overhead
associated with asking a question or if we want to minimize time spent
labeling and there are multiple labelers who can label examples in parallel.
Beyond these specific examples, this setting applies to any active learning prob-
lem for which different user interactions have different costs and are unambiguous
as we have defined. For example, we can ask questions concerning the percentage
of positive and negative examples according to the optimal classifier (Does the
optimal classifier label more than half of the data set positive?). This abstract
setting also has applications outside of machine learning.
– Information Retrieval. We can think of a question asking strategy as an
index into the set of objects which can then be used for search. If we make the
cost of a question the expected computational cost of computing the answer
for a given object, then a question asking strategy with low cost corresponds
to an index with fast search time. For example, if objects correspond to
points in n and questions correspond to axis aligned hyperplanes, a question
asking strategy is a kd-tree.
Average-Case Active Learning with Costs 143
2 Preliminaries
We first review the main result of Dasgupta [6] which our first bound extends. We
assume we have a finite set of objects (for example hypotheses) H with |H| = n.
A randomly chosen h∗ ∈ H is our target object with a known positive π(h)
defining the distribution over H by which h∗ is drawn. We assume minh π(h) > 0
and |H| > 1. We also assume there is a finite set of questions q1 , q2 , ...qm each
of which has a positive cost c1 , c2 , ...cm . Eachquestion qi maps each object to
a response from a finite set of answers A h,i {qi (h)} and asking qi reveals
qi (h∗ ), eliminating from consideration all objects h for which qi (h)
= qi (h∗ ). An
∗
active learning algorithm continues asking questions until h has been identified
(i.e. we have eliminated all but one of the elements from H). We assume this is
possible for any element in H. The goal of the learning algorithm is to identify h∗
with questions incurring as little cost as possible. Our result bounds the expected
cost of identifying h∗ .
We assume that the distribution π, the hypothesis class H, the questions qi ,
and the costs ci are known. Any deterministic question asking strategy (e.g. a de-
terministic active learning algorithm taking in this known information) produces
a decision tree in which internal nodes are questions and the leaves are elements
of H. The cost of a query tree T with respect to a distribution π, C(T, π), is
∗ ∗
defined to be the expected cost of identifying h when h is chosen according
to π. We can write C(T, π) as C(T, π) = h∈H π(h)cT (h) where cT (h) is the
cost to identify h as the target object. cT (h) is simply the sum of the costs of
the questions along the path from the root of T to h. We define πS to be π
restricted and normalized w.r.t. S. For s ∈ S, πS (s) = π(s)/π(S), and for s ∈ / S,
πS (s) = 0. Tree cost decomposes nicely.
Lemma 1. For any tree T and any S = i S i with ∀i,j S i ∩ S j = ∅, S =∅
C(T, πS ) = πS (S i )C(T, πS i )
i
144 A. Guillory and J. Bilmes
We define the version space to be the subset of H consistent with the answers we
have received so far. Questions eliminate elements from the version space. For a
question qi and a particular version space S ⊆ H, we define S j {s ∈ S : qi (s) =
j}. With this notation the dependence on qi is suppressed but understood by
context. As shorthand, for a distribution π we define π(S) = s∈S π(s). On
average, asking question qi shrinks the absolute mass of S with respect to a
distribution π by
π(S j ) π(S j )2
Δi (S, π) ( π(S k )) = π(S) −
π(S) π(S)
j∈A k=j j∈A
We call this quantity the shrinkage of qi with respect to (S, π). We note Δi (S, π)
is only defined if π(S) > 0. If qi has cost ci , we call Δi (S,π)
ci the shrinkage-cost
ratio of qi with respect to (S, π).
In previous work [6, 1, 3], the greedy algorithm analyzed is the algorithm that
at each step chooses the question qi that maximizes the shrinkage with respect to
the current version space Δi (S, πS ). In our generalized setting, we define the cost
sensitive greedy algorithm to be the active learning algorithm which at each step
asks the question with the largest shrinkage-cost ratio Δi (S, πS )/ci where S is
the current version space. We call the tree generated by this method the greedy
query tree. See Algorithm 1. Adler and Heeringa [1] also analyzed a cost-sensitive
method for the restricted case of questions with two responses and uniform π,
and our method is equivalent to theirs in this case. The main result of Dasgupta
[6] is that, on average, with unit costs and yes/no questions, the greedy strategy
is not much worse than any other strategy. We repeat this result here.
Theorem 1. Theorem 3 [6] If |A| = 2 and ∀i ci = 1, then for any π the greedy
query tree T g has cost at most
∗
where C minT C(T, π).
What is perhaps surprising about this bound is that the quality of approximation
does not depend on the costs themselves. The proof follows part of the strategy
used by Dasgupta [6]. The general approach is to show that if the average cost
of some question tree is low, then there must be at least one question with
high shrinkage-cost ratio. We then use this to form the basis of an inductive
argument. However, this simple argument fails when only a few objects have
high probability mass.
We start by showing the shrinkage of qi monotonically decreases as we elimi-
nate elements from S.
Lemma 2. Extension of Lemma 6 [6] to non binary queries. If T ⊆ S ⊆ H,
and T
= ∅ then, ∀i, π, Δi (T, π) ≤ Δi (S, π).
Proof. For |S| = 1 the result is immediate since |T | ≥ 1 and therefore S = T .
We show that if |S| > 2, removing any single element a ∈ S \ T from S does not
increase Δi (S, π). The lemma then follows since we can remove all of S \ T from
S an element at a time. Assume w.l.o.g. a ∈ S k for some k. Here let A A \ {k}
(π(S k ) − π(a))(π(S) − π(S k )) π(S j )(π(S) − π(S j ) − π(a))
Δi (S − {a}, π) = +
π(S) − π(a)
π(S) − π(a)
j∈A
Corollary 1. If T ⊆ S ⊆ H, and T
= ∅ then for any i, π, Δi (T, π)/ci ≤
Δi (S, π)/ci .
Proof. We prove the lemma with induction on |R|. For |R| = 1, CP(vR ) = 1 and
the right hand side of the inequality is zero. For R > 1, we lower bound the cost
of any query tree on R. At its root, any query tree chooses some qi with cost ci
that divides the version space into Rj for j ∈ A. Using the inductive hypothesis
we can then write the cost of a tree as
c
C(T, vR ) ≥ ci + vR (Rj ) (v(Rj )(1 − CP(vRj )))
Δ
j∈A
c
= ci + v(R) (vR (Rj )2 − vR (Rj )2 CP(vRj ))
Δ
j∈A
c
= ci + v(R)(1 − 1 + vR (Rj )2 − CP(vR ))
Δ
j∈A
Here we used
vR (Rj )2 CP(vRj ) = vR (Rj )2 vRj (r)2 = vR (r)2 = CP(vR )
j∈A j∈A r∈Rj r∈R
We now note v(R)(1 − j∈A vR (R ) ) = v(R) −
j 2
j∈A v(Rj )2 /v(R) = Δi (R, v)
c c
C(T, vR ) ≥ ci +v(R)(1 − CP(vR )) − Δi (R, v)
Δ Δ
c Δci − Δi (R, v)c
= v(R)(1 − CP(vS )) +
Δ Δ
Using Corollary 1, Δi (R, v)/ci ≤ Δi (S, v)/ci ≤ Δ/c, so Δci − Δi (R, v)c ≥ 0 and
therefore
c
C(R, vS ) ≥ v(R)(1 − CP(vR ))
Δ
This lower bound on the cost of a tree translates into a lower bound on the
shrinkage-cost ratio of the question chosen by the greedy tree.
Corollary 2. Extension of Corollary 8 [6] to non binary queries and non uni-
form costs. For any S ⊆ H with S = ∅ and query tree T whose leaves contain
S, there must be a question qi with Δi (S, πS )/ci ≥ (1 − CP(πS ))/C(T, πS )
Proof. Suppose this is not the case. Then there is some Δ/c < (1 − CP(πS ))/
C(T, πS ) such that ∀i Δi (S, πS )/ci ≤ Δ/c. By Lemma 3 (with v πS , R S),
c
C(T, πS ) ≥ πS (S) (1 − CP(πS )) > πS (S)C(T, πS ) = C(T, πS )
Δ
which is a contradiction.
A special case which poses some difficulty for the main proof is when for some
S ⊆ H we have CP(πS ) > 1/2. First note that if CP(πS ) > 1/2 one object h0 has
more than half the mass of S. In the lemma below, we use R S \ {h0 }. Also
let δi be the relative mass of the hypotheses in R that are distinct from h0 w.r.t.
question qi . δi πR ({r ∈ R : qi (h0 )
= qi (r)}) In other words, when question
qi is asked, R is divided into a set of hypotheses that agree with h0 (these have
relative mass 1 − δi ) and a set of hypotheses that disagree with h0 (these have
relative mass δi ). Dasgupta [6] also treats this as a special case. However, in
the more general setting treated here the situation is more subtle. For yes or
no questions, the question chosen by the greedy query tree is also the question
that removes the most mass from R. In our setting this is not necessarily the
case. The left of Figure 1 shows a counter example. However, we can show the
fraction of mass removed from R by the greedy query tree is at least half the
fraction removed by any other question. Furthermore, to handle costs, we must
instead consider the fraction of mass removed from R per unit cost.
Fig. 1. Left: Counter example showing that when a single hypothesis h0 contains more
than half the mass, the query with maximum shrinkage is not necessarily the query
that separates the most mass from h0 . Right: Notation for this case.
148 A. Guillory and J. Bilmes
In this lemma we use π{h0 } to denote the distribution which puts all mass on
h0 . The cost of identifying h0 in a tree T ∗ is then C ∗ (h0 ) C(T ∗ , π{h0 } ).
Lemma 4. Consider any S ⊆ H and π with CP(πS ) > 1/2 and π(h0 ) > 1/2.
Let C ∗ (h0 ) = C(T ∗ , π{h0 } ) for any T ∗ whose leaves contain S. Some question qi
has δi /ci > 1/C ∗ (h0 ).
Proof. There is always a set of questions indexed by the set I with total cost
∗
i∈I ci ≤ C (h0 ) that distinguish h0 from R within S. In particular, the set
∗
of
questions used to identify h0 in T satisfy this. Since the set identifies h0 ,
i∈I δi ≥ 1 which implies
ci δi
∗
≥ 1/C ∗ (h0 )
C (h 0 ) ci
i∈I
Because ci /C ∗ (h0 ) ∈ (0, 1] and i∈I ci /C ∗ (h0 ) ≤ 1, there must be a qi such
that δi /ci ≥ 1/C ∗ (h0 ).
Having shown that some query always reduces the relative mass of R by 1/C ∗ (h0 )
per unit cost, we now show that the greedy query tree reduces the mass of R by
at least half as much per unit cost.
Lemma 5. Consider any π and S ⊆ H with CP(πS ) > 1/2, π(h0 ) > 1/2,
and a corresponding subtree TSg in the greedy tree. Let C ∗ (h0 ) = C(T ∗ , π{h0 } )
for any T ∗ whose leaves contain S. The question qi chosen by TSg has δi /ci >
1/(2C ∗ (h0 )).
Proof. We prove this by showing that the fraction removed from R per unit cost
by the greedy query tree’s question is at least half that of any other question.
Combining this with Lemma 4, we get the desired result.
We can write the shrinkage of qi in terms of δi . Here let A A \ {qi (h0 )}.
Since π(S qi (h0 ) ) = π(h0 ) + (π(S) − δi π(R)), and π(S) − π(S qi (h0 ) ) = δi π(R), we
have that
Δi (S, πS ) = (πS (h0 ) + (1 − δi )πS (R))δi πS (R) + πS (S j )(πS (S) − πS (S j ))
j∈A
We use j∈A πS (S j ) = δi πS (R).
We can then upper bound the shrinkage using πS (S) − πS (S j ) ≤ 1
and lower bound the shrinkage using πS (h0 ) > 1/2 and πS (S) − πS (S j ) >
πS (h0 ) + (1 − δi )πS (R) for any j ∈ A
Let qi be any question and qj be the question chosen by the greedy tree giving
Δj (S, πS )/cj ≥ Δi (S, πS )/ci . Using the upper and lower bounds we derived,
we then know 2δj πS (R)/cj ≥ δi πS (R)/ci and can conclude 2δj /cj ≥ δi /ci .
Combining this with Lemma 4, δj /cj ≥ 1/(2C ∗ (h0 ).
π(S)
C(TSg , πS ) ≤ 12C(T ∗ , πS ) ln
minh∈S π(h)
Proof. In this proof we use C ∗ (S) as a short hand for C(T ∗ , πS ). Also, we
use min(S) for mins∈S π(S). We proceed with induction on |S|. For |S| = 1,
C(TSg , πS ) is zero and the claim holds. For |S| > 1, we consider two cases.
Case one: CP(πS ) ≤ 1/2
At the root of TSg , the greedy query tree chooses some qi with cost ci that
reduces the version space to S j when qi (h∗ ) = j. Let π(S + ) max{π(S j ) : j ∈
A} Using the inductive hypothesis
C(TSg , πS ) = ci + πS (S j )C(TS j , πS j )
j∈A
π(S j )
≤ ci + 12πS (S j )C ∗ (S j ) ln
min(S j )
j∈A
π(S + )
≤ ci + 12( πS (S j )C ∗ (S j )) ln
min(S)
j∈A
π(S)
C(TSg , πS ) ≤ ci + 12C ∗ (S) ln + 12C ∗ (S) ln πS (S + )
min(S)
π(S)
≤ ci + 12C ∗ (S) ln − 12C ∗ (S)(1 − πS (S + ))
min(S)
πS (S + ) ≥ j∈A πS (S j )2 because this sum is an expectation and ∀j πS (S + ) ≥
πS (S j ). From this follows
π(S)
C(TSg , πS ) ≤ ci + 12C ∗ (S) ln − 12C ∗ (S)(1 − πS (S j )2 )
min(S)
j∈A
∗ π(S) ∗
(1 − j∈A πS (S j )2 ))
= ci + 12C (S) ln − 12C (S)ci
min(S) ci
150 A. Guillory and J. Bilmes
(1 − j∈A πS (S j )2 ) is Δi (S, πS ), so by Corollary 2 and using CP(πS ) ≤ 1/2
π(S) 1 − CP(πS )
C(TSg , πS ) ≤ ci + 12C ∗ (S) ln − 12C ∗ (S)ci
min(S) C ∗ (S)
π(S)
= ci + 12C ∗ (S) ln − 12(1 − CP(πS ))ci
min(S)
π(S)
≤ 12C ∗ (S) ln
min(S)
which completes this case.
Case two: CP(πS ) > 1/2
The hypothesis with more than half the mass, h0 , lies at some depth D in
the greedy tree TSg . Counting the root of TSg as depth 0, D ≥ 1. At depth d > 0,
let q0 , q1 , ...qd−1 be the questions asked so far, c0 , c1 , ...cd−1 be the costs of these
d−1
questions, and Cd = i=0 ci be the total cost incurred. At the root, C0 = 0.
At depth d < D, we define Rd to be the set of objects other than h0 that
are still in the version space along the path to h0 . R0 S \ {h0 } and for d > 0
Rd Rd−1 \ {h : qd−1 (h) = qd−1 (h0 )}. In other words, Rd is Rd−1 with the
objects that disagree with h0 on qd−1 removed. All of the objects in Rd have the
same response as h0 for q0 , q1 , ..., qd−1 . The right of Figure 1 shows this case.
We first bound the mass remaining in Rd as a function of the label cost
incurred so far. For d > 0, using Lemma 5,
d−1
ci ∗
π(Rd ) ≤ π(R0 ) (1 − ∗
) ≤ π(R0 )e−Cd /(2C (h0 ))
i=0
2C (h 0)
Using this bound, we can bound CD , the cost of identifying h0 (i.e. C(TSg , h0 )).
First note that π(RD−1 ) ≥ min(R0 ) since at least one object is left in RD−1 .
Combining this with the upper bound on the mass of Rd , we have if D − 1 > 0.
This clearly also holds if D−1 = 0, since, C0 = 0. We now only need to bound the
cost of the final question (the question asked at level D − 1). If the final question
had cost greater than 2C ∗ (h0 ), then by Lemma 5, this question would reduce
the mass of the set containing h0 to less than π(h0 ). This is a contradiction, so
the final question must have cost no greater than 2C ∗ (h0 ).
π(R0 )
CD ≤ 2C ∗ (h0 ) ln + 2C ∗ (h0 )
min(R0 )
We use Ad−1 A \ qd−1 (h0 ). Let s ∈ Sdj be the set of objects removed from
Rd−1 d − 1 such that qd−1 (s) = j, that is Rd−1 =
with the question at depth
Rd + j∈A Sdj . Let Sd = j∈A Sdj . The right of Figure 1 illustrates this
d−1 d−1
notation. A useful variation of Lemma 1 we use in the following is that for
S = S 1 ∪ S 2 and S 1 ∩ S 2 = ∅, π(S)C ∗ (S) = π(S 1 )C ∗ (S 1 ) + π(S 2 )C ∗ (S 2 ).
Average-Case Active Learning with Costs 151
We can write
a
D
π(S)C(TSg , πS ) = π(h0 )CD + π(Sdj )(Cd + C(TS j , πS j ))
d d
d=1 j∈Ad−1
b
D
D π(Sdj )
≤ π(h0 )CD + π(Sd )Cd + π(Sdj )12C ∗ (Sdj ) ln
d=1 d=1 j∈A
min(Sdj )
d−1
c π(R0 )
≤ π(h0 )CD + π(R0 )CD + 12π(R0 )C ∗ (R0 ) ln
min(R0 )
d π(R0 )
≤ 2π(h0 )CD + 12π(R0 )C ∗ (R0 ) ln
min(R0 )
Here a) decomposes the total cost into the cost of identifying h0 and the cost of
each branch leaving the path to h0 . For each of these branches the total cost is
the cost incurred so far plus the cost of the tree rooted
at that branch. b) uses
the inductive hypothesis, c) uses ∀i,j Si ∩ Sj = ∅ and d Sd = R0 , and d) uses
π(R0 ) < π(h0 ). Continuing
a π(R0 ) π(R0 )
π(S)C(TSg , πS ) ≤ 4π(h0 )C ∗ (h0 )(ln + 1) + 12π(R0 )C ∗ (R0 ) ln
min(R0 ) min(R0 )
b π(S) π(S)
≤ 4π(h0 )C ∗ (h0 )(ln + 1) + 12π(R0 )C ∗ (R0 ) ln
min(S) min(S)
π(S) π(S)
π(S)C(TSg , πS ) ≤ 12π(h0 )C ∗ (h0 ) ln + 12π(R0 )C ∗ (R0 ) ln
min(S) min(S)
π(S)
= π(S)12C ∗ (S) ln
min(S)
π(S)
where we use π(S) > 2 min(S) and therefore ln min(S) > ln 2 > .5. Dividing both
sides by π(S) gives the desired result.
distribution. In our cost sensitive setting, the intuition remains the same, but
the introduction of costs changes the result.
Let cmax maxi ci and cmin mini ci . In this discussion, we consider irre-
ducible query trees, which we define to be query trees which contain only ques-
tions with non-zero shrinkage. Greedy query trees will always have this property
as will optimal query trees. This property let’s us assume any path from the
root to a leaf has at most n nodes with cost at most cmax n because at least
one hypothesis is eliminated by each question. Define π to be the distribution
obtained from π by adding cmin /(cmax n3 ) mass to any hypothesis h for which
π(h) < cmin /(cmax n3 ). Subtract the corresponding mass from a single hypoth-
esis hj for which π(hj ) ≥ 1/n (there must at least one such hypothesis). By
construction, we have that mini π (hi ) ≥ cmin /(cmax n3 ). We can also bound the
amount by which the cost of a tree changes as a result of rounding
Lemma 6. For any irreducible query tree T and π,
1 3
C(T, π) ≤ C(T, π ) ≤ C(T, π)
2 2
Proof. For the first inequality, let h be the hypothesis we subtract mass from
when rounding. The cost to identify h , cT (h ) is at most cmax n. Since we subtract
at most cmin /(cmax n2 ) mass and cT (h ) ≤ cmax n, we then have
cmin cmin 1
C(T, π ) ≥ C(T, π) − cT (h ) ≥ C(T, π) − ≥ C(T, π)
cmax n2 n 2
The last step uses and C(T, π) > cmin and n > 2. For thesecond inequality, we
add at most cmin /(cmax n3 ) mass to each hypothesis and h cT (h) < cmax n2 , so
cmin cmin 3
C(T, π ) ≤ C(T, π) + cT (h) ≤ C(T, π) + ≤ C(T, π)
cmax n3 n 2
h∈H
The last step again uses C(T, π) > cmin and n > 2
We can finally give a bound on the greedy algorithm applied to π , in terms of
n and cmax /cmin
Theorem 4. For any π the greedy query tree T g for π has cost at most
cmax
C(T g , π) ≤ O(C ∗ ln(n ))
cmin
where C ∗ minT C(T, π).
Proof. Let T be an optimal tree for π and T ∗ be an optimal tree for π. Using
Theorem 2, mini π (hi ) ≥ cmin /(cmax n3 ), and Lemma 6.
cmax
C(T g , π) ≤2C(T g , π ) ≤ 72C(T , π ) ln(n )
cmin
cmax cmax
≤72C(T ∗, π ) ln(n ) ≤ 108C(T ∗, π) ln(n )
cmin cmin
Average-Case Active Learning with Costs 153
5 -Approximate Algorithm
Some of the non traditional active learning scenarios involve a large number
of possible questions. For example, in the batch active learning scenario we
describe, there may be a question corresponding to every subset of single data
point questions. In these scenarios, it may not be possible to exactly find the
question with largest shrinkage-cost ratio. It is not hard to extend our analysis
to a strategy that at each step finds a question qi with
for ∈ [0, 1). One can show > 0 only introduces an 1/(1 − ) factor into the
bound. Kosaraju et al. [11] report a similar extension to their result.
6 Related Work
Table 1 summarizes previous results analyzing greedy approaches to this prob-
lem. A number of these results were derived independently in different contexts.
Our work gives the first approximation result for the general setting in which
there are more than two possible responses to questions, non uniform question
costs, and a non uniform distribution over objects. We give bounds for two al-
gorithms, one with performance independent of the query costs and one with
performance independent of the distribution over objects. Together these two
bounds match all previous bounds for less general settings. We also note that
Kosaraju et al. [11] only mention an extension to non binary queries (Remark
1), and our work is the first to give a full proof of an O(log n) bound for the case
of non binary queries and non uniform distributions over objects..
Our work and the work we extend are examples of exact active learning. We
seek to exactly identify a target hypothesis from a finite set using a sequence of
queries. Other work considers active learning where it suffices to identify with
high probability a hypothesis close to the target hypothesis [7, 2]. The exact and
approximate problems can sometimes be related [10].
Most theoretical work in active learning assumes unit costs and simple label
queries. An exception, Hanneke [9] also considers a general learning framework
in which queries are arbitrary and have known costs associated with them. In
fact, the setting used by Hanneke [9] is more general in that questions are al-
lowed to have more than one valid answer for each hypothesis. Hanneke [9]
gives worst-case upper and lower bounds in terms of a quantity called the Gen-
eral Identification Cost and related quantities. There are interesting parallels
between our average-case analysis and this worst-case result.
Practical work incorporating costs in active learning [12, 8] has also considered
methods that maximize a benefit-cost ratio similar in spirit to the method used
here. However, Settles et al. [12] suggests this strategy may not be sufficient for
practical cost savings.
7 Open Problems
Chakaravarthy et al. [3] show it is NP-hard to approximate the optimal query
tree within a factor of Ω(log n) for binary queries and non uniform π. This hard-
ness result is with respect to the number of objects. Some open questions remain.
For the more general setting with non uniform query costs, is there an algorithm
with an approximation ratio independent of both π and ci ? The simple round-
ing technique we use seems to require dependence on ci , but a more advanced
method could avoid this dependence. Also, can the Ω(log n) hardness result be
extended to the more restrictive case of uniform π? It would also be interesting
to extend our analysis to allow for questions to have more than one valid answer
for each hypothesis. This would allow queries which ask for a positively labeled
example from a set of examples. Such an extension appears non trivial, as a
straightforward extension assuming the given answer is randomly chosen from
the set of valid answers produces a tree in which the mass of hypotheses is split
across multiple branches, affecting the approximation.
Much work also remains in the analysis of other active learning settings with
general queries and costs. Of particular practical interest are extensions to ag-
nostic algorithms that converge to the correct hypothesis under no assumptions
[7, 2]. Extensions to treat label costs, partial label queries, and batch mode ac-
tive learning are all of interest, and these learning algorithms could potentially
be extended to treat these three sub problems at once using a similar setting.
For some of these algorithms, even without modification we can guarantee
the method does no worse than passive learning with respect to label cost. In
particular, Dasgupta et al. [7] and Beygelzimer et al. [2] both give algorithms
that iterate through T examples, at each step requesting a label with probability
pt . These algorithm are shown to not do much worse (in terms of generalization
error) than the passive algorithm which requests every label. Because the al-
gorithm queries for labels for a subset of T i.i.d. examples, the label cost of
the algorithm is also no worse than the passive algorithm requesting T random
labels. It remains an open problem however to show these algorithms can do
Average-Case Active Learning with Costs 155
better than passive learning in terms of label cost (most likely this will require
modifications to the algorithm or additional assumptions).
References
[1] Adler, M., Heeringa, B.: Approximating optimal binary decision trees. In: Goel,
A., Jansen, K., Rolim, J.D.P., Rubinfeld, R. (eds.) APPROX and RANDOM 2008.
LNCS, vol. 5171, pp. 1–9. Springer, Heidelberg (2008)
[2] Beygelzimer, A., Dasgupta, S., Langford, J.: Importance weighted active learning.
In: ICML (2009)
[3] Chakaravarthy, V.T., Pandit, V., Roy, S., Awasthi, P., Mohania, M.: Decision
trees for entity identification: approximation algorithms and hardness results. In:
PODS (2007)
[4] Chakaravarthy, V.T., Pandit, V., Roy, S., Sabharwal, Y.: Approximating decision
trees with multiway branches. In: ICALP (2009)
[5] Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley-
Interscience, Hoboken (2006)
[6] Dasgupta, S.: Analysis of a greedy active learning strategy. In: NIPS (2004)
[7] Dasgupta, S., Hsu, D., Monteleoni, C.: A general agnostic active learning algo-
rithm. In: NIPS (2007)
[8] Haertel, R., Sepppi, K.D., Ringger, E.K., Carroll, J.L.: Return on investment for
active learning. In: NIPS Workshop on Cost-Sensitive Learning (2008)
[9] Hanneke, S.: The cost complexity of interactive learning (unpublished, 2006),
http://www.cs.cmu.edu/ shanneke/docs/2006/cost-complexity-
working-notes.pdf
[10] Hanneke, S.: Teaching dimension and the complexity of active learning. In:
Bshouty, N.H., Gentile, C. (eds.) COLT 2007. LNCS (LNAI), vol. 4539, pp. 66–81.
Springer, Heidelberg (2007)
[11] Kosaraju, S.R., Przytycka, T.M., Borgstrom, R.: On an optimal split tree problem.
In: Dehne, F., Gupta, A., Sack, J.-R., Tamassia, R. (eds.) WADS 1999. LNCS,
vol. 1663, pp. 157–168. Springer, Heidelberg (1999)
[12] Settles, B., Craven, M., Friedland, L.: Active learning with real annotation costs.
In: NIPS Workshop on Cost-Sensitive Learning (2008)
Canonical Horn Representations
and Query Learning
1 Introduction
The present paper is the result of an attempt to better understand the classic
algorithm by Angluin, Frazier, and Pitt [2] that learns propositional Horn for-
mulas. A number of intriguing questions remain open regarding this algorithm;
in particular, we were puzzled by the following one: along a run of the algo-
rithm, queries made by the algorithm depend heavily upon the counterexamples
selected as answers to the previous queries. It is therefore natural to expect
the outcome of the algorithm to depend on the answers received along the run.
However, attempts at providing an example of such behavior consistently fail.
In this paper we prove that such attempts must in fact fail: we describe a
canonical representation of Horn functions in terms of implications, and show
that the algorithm of [2] always outputs this particular representation. It turns
out that this canonical representation is well-known in the field of Formal Con-
cepts, and bears the name of the authors that, to the best of our knowledge,
first described it: the Guigues-Duquenne basis or GD basis [7, 12]. In addition,
the GD basis has the important quality of being of minimum size.
The GD basis is defined for definite Horn formulas only. We extend the notion
of GD basis to general Horn formulas by means of a reduction from general to
Work partially supported by MICINN projects SESAAME-BAR (TIN2008-06582-
C03-01) and FORMALISM (TIN2007-66523).
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 156–170, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Canonical Horn Representations and Query Learning 157
2 Preliminaries
We work within the standard framework in logic, where one is given an indexable
set X of propositional variables of cardinality n, Boolean functions are subsets
of the Boolean hypercube {0, 1}n, and these functions are represented by logical
formulas over the variable set in the standard way. Assignments are partially
ordered bitwise according to 0 ≤ 1 (the usual partial order of the hypercube);
the notation is x ≤ y. Readers not familiar with standard definitions of assign-
ment, assignment satisfaction or formula entailment (|=), literal, term, clause,
etc. should consult a standard textbook, e.g., [6]. A particularity of our work
is that we identify assignments x ∈ {0, 1}n with variable subsets α ⊆ X in the
standard way by connecting the variable subsets with the bits that are set to
1 in the assignments. We denote this explicitly when necessary with the func-
tions x = BITS(α) and α = ONES(x). Therefore, x |= α iff α ⊆ ONES(x) iff
BITS(α) ≤ x.
We are only concerned with Horn functions, and their representations using
conjunctive normal form (CNF). A Horn CNF formula is a conjunction of Horn
clauses. A clause is a disjunction of literals. A clause is definite Horn if it contains
exactly one positive literal, and it is negative if all its literals are negative. A
clause is Horn if it is either definite Horn or negative.
Horn clauses are generally viewed as implications where the negative liter-
als form the antecedent of the implication (a positive term), and the singleton
consisting of the positive literal, if it exists, forms the consequent of the clause.
Note that both can be empty; if the consequent is empty, then we are dealing
with a negative Horn clause. Furthermore, we allow our representations of Horn
158 M. Arias and J.L. Balcázar
CNF to deviate slightly from the standard in that we represent clauses sharing
the same antecedent together in one implication. Namely, an implication α → β,
where both α and β are possibly empty sets of propositional
variables, is to be
interpreted as the conjunction of definite Horn clauses b∈β α → b if β = ∅, and
as the negative clause α → if β = ∅.1 A semantically equivalent interpretation
is to see both sets of variables α and β as positive terms; the Horn formula in
its standard form is obtained by distributivity on the variables of β. Note that
x |= ∅ for any assignment x; however, this is not the case with respect to the
right hand sides of nondefinite Horn clauses since, there, by convention, β = ∅
stands for the unsatisfiable.
We refer to our generalized notion of conjunction of clauses sharing the an-
tecedent as implication; the term clause retains its classical meaning (namely,
a disjunction of literals). Notice that an implication may not be a clause, e.g.
(a → bc) corresponds in classical notation to the formula ¬a ∨ (b ∧ c). Thus,
(a → bc), (ab → c) and (ab → ∅) are Horn implications but only the latter
two are Horn clauses. Furthermore, we often use sets to denote conjunctions, as
we
do with positive terms, also at other levels: a generic (implicational) CNF
i (αi → βi ) is often denoted in this text by {(αi → βi )}i . Parentheses are
mostly optional and generally used for ease of reading.
Clearly, an assignment x ∈ {0, 1}n satisfies the implication α → β, denoted
x |= α → β, if it either fails the antecedent or satisfies the consequent, that is,
x |= α or x |= β respectively, where now we are interpreting both α and β as
positive terms.
A Horn function admits several syntactically different Horn CNF representa-
tions; in this case, we say that these representations are equivalent. Such rep-
resentations are also known as theories or bases for the Boolean function they
represent. The size of a Horn function is the minimum number of clauses that a
Horn CNF representing it must have. The implication size of a Horn size is de-
fined analogously, but allowing formulas to have implications instead of clauses.
Clearly, every clause is an implication, and thus the implication size of a given
Horn function is always at most that of its standard size as measured in the
number of clauses. Not all Boolean functions are Horn. The following semantic
characterization is a well-known classic result proved in the context of proposi-
tional Horn logic e.g. in [10]:
Theorem 1. A Boolean function admits a Horn CNF basis if and only if the
set of assignments that satisfy it is closed under bit-wise intersection.
An implication in a Horn CNF H is redundant if it can be removed from H
without changing the Horn function represented. A Horn CNF is irredundant if
it does not contain any redundant implication. Notice that an irredundant H
1
Notice that this differs from an alternative, older interpretation [11], nowadays ob-
solete, in which α → β represents the clause (¬x1 ∨ . . . ∨ ¬xk ∨ y1 ∨ . . . ∨ yk ), where
α = {x1 , . . . , xk } and β = {y1 , . . . , yk }. Though identical in syntax, the semantics
are different; in particular, ours can only represent a conjunction of definite Horn
clauses whereas the other represents a general possibly non-Horn clause.
Canonical Horn Representations and Query Learning 159
may still contain other sorts of redundancies, such as consequents larger than
strictly necessary. Such redundancies are not contemplated in this paper.
In this section we characterize and show how to build a canonical basis for
definite Horn functions that is of minimum implication size. Our construction
is based on the notion of saturation, a notion that has been used already in the
context of Horn functions and seems very natural [4, 5]. It turns out that this
canonical form is, in essence, the Guigues-Duquenne basis (the GD basis) which
was introduced in [7]. Here, we introduce it in a form that is, to our knowledge,
novel, although it is relatively close to the approach of [12].
We begin by defining saturation and then prove several interesting properties
that serve as the basis for our work.
The following Lemma is a variant of a result of [12] translated into our notation.
We include the proof that is, in fact, missing from [12].
Lemma 3. Let B = {αi → αi }i be a saturated basis for a definite Horn func-
tion. Then for all i and β it holds that (β ⊆ αi and β ⊂ αi ) ⇒ β ⊆ αi .
Canonical Horn Representations and Query Learning 161
Proof. Let us assume that the conditions of the implication are true, namely,
that β ⊆ αi and β ⊂ αi . We proceed by cases: if β is closed, then β = β
and the implication is trivially true since β ⊆ αi clearly implies β ⊆ αi when
β = β. Otherwise, β is not closed. Let β = β (0) ⊂ β (1) ⊂ · · · ⊂ β (k) = β be
the series of elements constructed by the forward chaining procedure described
in Section 2. We argue that if β (l) ⊆ αi and β (l) ⊂ β , then β (l+1) ⊆ αi as well.
By repeatedly applying this fact to all the elements along the chain, we arrive
at the desired conclusion, namely, β ⊆ αi . Let β (l) be such that β (l) ⊆ αi and
β (l) ⊂ β . Thus β (l) violates some implication (αk → αk ) ∈ B. Our forward
chaining procedure assigns β (l+1) to β (l) ∪ αk . The following inequalities hold:
αk ⊆ β (l) because β (l) |= αk → αk , β (l) ⊆ αi by assumption; hence αk ⊆ αi .
Using Lemma 2, and noticing the fact that, actually, αk ⊂ αi since β (l) ⊂ αi
(otherwise we could not have β ⊂ αi ), we conclude that αk ⊆ αi . We have that
αk ⊆ αi and β (l) ⊆ αi so that β (l+1) = β (l) ∪ αk ⊆ αi as required.
The next result characterizes our version of the canonical basis based on the
notion of saturation. The proof does rely heavily on Lemma 3, which is adapted
from a result from [12]. The connection to saturation and our proof technique
are indeed novel.
Theorem 4. Definite Horn functions have a unique saturated basis.
Proof. Let B1 and B2 be two equivalent and saturated bases. Let a → a be an
arbitrary implication in B1 . We show that a → a ∈ B2 as well. By symmetry,
this implies that B1 = B2 .
By Lemma 1(2), we have that BITS(a) |= B1 and thus BITS(a) must violate
some implication b → b ∈ B2 , hence it must hold that b ⊆ a but b ⊆ a.
The rest of the proof is concerned with showing that assuming b ⊂ a leads to a
contradiction. If so, then b = a and thus a → a ∈ B2 as well as desired.
Let us assume then that b ⊂ a so that, by monotonicity, b ⊆ a . If b ⊂ a ,
then we can use Lemma 3 with αi = a and β = b and conclude that b ⊆ a,
contradicting the fact that BITS(a) |= (b → b ). Thus, it should be that b = a .
Now, consider b → a ∈ B2 . Clearly b is negative (otherwise, b = b , and then
H(α) be those clauses of H whose antecedents fall in the same equivalence class
as α, namely, H(α) = {αi → βi | αi → βi ∈ H and α = αi } .
Given a Horn function H and a variable subset α, we introduce a new operator
•:α• is the closure of α with respect to the subset of clauses H \ H(α). That is,
in order to compute α• one does forward chaining starting with α but one is not
allowed to use the clauses in H(α). This operator has been used in the literature
before in related contexts, for example in [12].
Example 2. Let H = {a → b, a → c, c → d}. Then, (ac) = abcd but (ac)• = acd
since H(ac) = {a → b, a → c} and we are only allowed to use the clause c → d
when computing (ac)• .
Computing the GD basis of a definite Horn H. First, saturate every
clause C = α → β in H by replacing it with the implication α• → α . Then,
remove possibly redundant implications, namely: (1) remove implications s.t.
α• = α , and (2) remove duplicates, and (3) remove subsumed implications,
i.e., implications α• → α for which there is another implication β • → β s.t.
α = β but β • ⊂ α• .
Let us denote with GD(H) the implicational definite Horn CNF obtained by
applying this procedure to input H. Note that this algorithm is designed to work
when given a definite Horn CNF both in implicational or standard form.
The procedure can be computed in quadratic time, since finding the closures
of antecedent and consequent of each clause can be done in linear time w.r.t. the
size of the initial Horn CNF H.
Example 3. Let H = {a → b, a → c, ad → e, ab → e}. We compute the closures
of the antecedents: a = abce, (ad) = abcde, and (ab) = abce. Therefore,
H(a) = {a → b, a → c, ab → e}, H(ad) = {ad → e}, and H(ab) = H(a). Thus,
a• = a, (ad)• = abcde, and (ab)• = abce. After saturation of every clause in H,
we obtain H = {a → abce, a → abce, abcde → abcde, abce → abce}. It becomes
clear that the third clause was, in fact, redundant. Also, the fourth implication is
subsumed by the first two (after right-saturation) and we can group the first and
second implications together into a single one. Hence, GD(H) = {a → abce}.
In the remainder of this Section we show that the given algorithm computes the
unique saturated representation of its input. First, we need a simple lemma:
Lemma 4. Let H be any basis for a definite Horn CNF over variables X =
{x1 , . . . , xn }. For any α, β, γ ⊆ X, the following statements hold:
1. α ⊆ α• ⊆ α ;
2. If H |= β → γ, β ⊆ α• ; but β ⊂ α , then γ ⊆ α• .
Lemma 5. The algorithm computing GD(H) outputs the GD basis of H for
any definite Horn formula H.
Proof. Let H be the input to the algorithm, and let H be its output. We show
that H must be saturated. Let α → β be an arbitrary implication in the out-
put H . Because of the initial saturation process, we can refer to this implication
Canonical Horn Representations and Query Learning 163
examples (these become the antecedents of the clauses in the hypotheses). The
argument of an equivalence query is prepared from the list N = (x1 , . . . , xt )
of negative examples combined with the set P of positive examples. The query
corresponds to the following intuitive bias: everything is assumed positive unless
some (negative) xi ∈ N suggests otherwise, and everything that some xi sug-
gests negative is assumed negative unless some positive example y ∈ P suggests
otherwise. This is exactly the intuition in the hypothesis constructed by the AFP
algorithm.
For the set of positive examples P , denote Px = {y ∈ P x ≤ y}. The
hypothesis to be queried, given the set P and the list N = (x1 , .
. . , xt ), is denoted
H(N, P ) and is defined as H(N, P ) = {ONES(xi ) → ONES( Pxi ) | xi ∈ N } .
A positive counterexample is treated just by adding it to P . A negative coun-
terexample y is used to either refine some xi into a smaller negative example, or
to add xt+1 to the list. Specifically, let
i := min({j M Q(xj ∧ y) is negative, and xj ∧ y < xj } ∪ {t + 1})
and then refine xi into xi = xi ∧ y, in case i ≤ t, or else make xt+1 = y,
subsequently increasing t. The value of i is found through membership queries
on all the xj ∧ y for which xj ∧ y < xj holds.
AFP()
1 N ← () /* empty list */
2 P ← {1n } /* top element */
3 t←0
4 while EQ(H(N, P )) = (“no”, y) /* y is the counterexample */
5 do if y
|= H(N, P )
6 then add y to P
7 else find the first i such that /* N = (x1 , . . . , xt ) */
8 xi ∧ y < xi , and /* that is, xi
≤ y */
9 xi ∧ y is negative /* use membership query */
10 if found
11 then xi ← xi ∧ y /* replace xi by xi ∧ y in N */
12 else t ← t + 1; xt ← y /* append y to end of N */
13 return H(N, P )
The AFP algorithm is described in Figure 1. In order to prove that its output
is indeed the GD basis, we need the following lemmas from [4]:
Lemma 6 (Lemma 2 from [4]). Along the running of the AFP algorithm, at
the point of issuing the equivalence query, for every xi and xj in N with i < j
there exists a positive example z such that xi ∧ xj ≤ z ≤ xj .
Lemma 7 (Variant of Lemma 1 from [4]). Along the running of the AFP
algorithm, at the point of issuing the equivalence
query, for every xi and xj in
N with i < j and xi ≤ xj , it holds that Pxi ≤ xj .
Canonical Horn Representations and Query Learning 165
Theorem 6. AFP, run on a definite Horn target, always outputs the GD basis
of the target concept.
166 M. Arias and J.L. Balcázar
As for the definite case, “saturated” means that the general Horn CNF in ques-
tion is both left- and right-saturated. We must see that this is the “correct”
definition in some sense:
This gives us a way to compute the saturation (that is, the GD basis) of a given
general Horn CNF:
Theorem 7. General Horn functions have a unique saturated basis. This basis,
which we denote GD(H), can be computed by GD(H) = g −1 (GD(g(H))).
Similarly to the case of definite Horn functions, GD(H) does not increase the
number of new implications, and therefore if H is of minimum size, GD(H) must
be of minimum size as well. This, together with the uniqueness of saturated
representation implies that:
Theorem 9. The AFP algorithm always outputs the GD basis of the target
concept.
Proof (Sketch). The argumentation follows, essentially, the same steps as the
analogous proof in [4], because, by Lemma 9, the GD basis in the general case
is saturated, and therefore all required facts carry over to the general case. Let
f be a Boolean function that cannot be represented with m Horn implications.
170 M. Arias and J.L. Balcázar
It is illustrative to note the relation between this set of certificates for f and
its GD basis: the assignments x•i and xi correspond exactly to the left and
right-hand-sides of the (saturated) definite implications in GD(f ). For negative
clauses, only the (saturated) left-hand-side of the implication x•i matters.
References
1. Angluin, D.: Queries revisited. Theoretical Computer Science 313, 175–194 (2004)
2. Angluin, D., Frazier, M., Pitt, L.: Learning conjunctions of Horn clauses. Machine
Learning 9, 147–164 (1992)
3. Angluin, D., Kharitonov, M.: When won’t membership queries help? Journal of
Computer and System Sciences 50(2), 336–355 (1995)
4. Arias, M., Balcázar, J.L.: Query learning and certificates in lattices. In: Freund, Y.,
Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254,
pp. 303–315. Springer, Heidelberg (2008)
5. Arias, M., Feigelson, A., Khardon, R., Servedio, R.A.: Polynomial certificates for
propositional classes. Inf. Comput. 204(5), 816–834 (2006)
6. Chang, C.-L., Lee, R.C.-T.: Symbolic Logic and Mechanical Theorem Proving.
Academic Press, Inc., Orlando (1973)
7. Guigues, J.L., Duquenne, V.: Familles minimales d’implications informatives re-
sultants d’un tableau de donnees binaires. Math. Sci. Hum. 95, 5–18 (1986)
8. Hegedüs, T.: On generalized teaching dimensions and the query complexity of
learning. In: Proceedings of the Conference on Computational Learning Theory,
pp. 108–117. ACM Press, New York (1995)
9. Hellerstein, L., Pillaipakkamnatt, K., Raghavan, V., Wilkins, D.: How many queries
are needed to learn? Journal of the ACM 43(5), 840–862 (1996)
10. Khardon, R., Roth, D.: Reasoning with models. Artificial Intelligence 87(1-2), 187–
213 (1996)
11. Wang, H.: Toward mechanical mathematics. IBM Journal for Research and Devel-
opment 4, 2–22 (1960)
12. Wild, M.: A theory of finite closure spaces based on implications. Advances in
Mathematics 108, 118–139 (1994)
Learning Finite Automata Using Label Queries
1 Introduction
The problem of learning the behavior of a finite automaton has been considered
in several domains, including language learning and environment learning by
robots. Many interesting questions remain about the kinds of information that
permit efficient learning of finite automata.
One basic result is that finite automata are not learnable using a polynomial
number of membership queries. Consider a “password machine”, that is, an
acceptor with (n + 2) states that accepts exactly one binary string of length n;
the learner may query (2n − 1) strings before finding the one that is accepted. In
this case, the learner gets no partial information from the unsuccessful queries.
However, Freund et al. [5] show that regardless of the topology of the underly-
ing automaton, if its states are randomly labeled with 0 or 1, then a robot taking
Supported by a Marie Curie International Fellowship within the 6th European Com-
munity Framework Programme.
This material is based upon work supported under a National Science Foundation
Graduate Research Fellowship.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 171–185, 2009.
c Springer-Verlag Berlin Heidelberg 2009
172 D. Angluin et al.
a random walk on the automaton can learn to predict the labels while making
only a polynomial number of errors of prediction. Random labels on the states
provide a rich source of information that can be used to distinguish otherwise
difficult-to-distinguish pairs of states.
In a different setting, Becerra-Bonache, Dediu and Tı̂rnăucă [3] introduced
correction queries to model a kind of correction provided by a teacher to
a learner when the learner’s utterance is not grammatically correct. In their
model, a correction query with a string w gives the learner not only member-
ship information about w, but also, if w is not accepted, either the minimum
continuation of w that is accepted, or the information that no continuation of
w is accepted. In certain cases, corrections may provide a substantial amount
of partial information for the learner. For example, for a password machine, a
prefix of the password will be answered with the rest of the password. We may
think of correction queries as labeling each state q of the automaton with the
string rq that is the response to any correction query w that arrives at q.
In both of these cases, labels on states may facilitate the learning of finite
automata: randomly chosen labels in the work of Freund et al. and meaningfully
chosen labels in the work of Becerra-Bonache, Dediu and Tı̂rnăucă. In this paper
we explore the general idea of adding labels to the states of an automaton to
make it easier to learn. That is, we allow a teacher to prepare an automaton M
for learning by adding labels to its states (either carefully or randomly chosen).
When the learner queries a string, the learner receives not only the original
output of M for that string, but also the label attached to that state by the
teacher. In an extension of this idea, we also allow the teacher to “unfold” the
machine M to produce copies of a state that may then be given different labels.
These ideas are also relevant to automata testing [7] – labeling and unfolding
automata can make their structure easier to verify.
Depending on how labels are assigned, learning may or may not become easier.
If each state is assigned a unique label, the learning task becomes easy because
the learner knows which state the machine reaches on any given query. However, if
the labels are all the same, they give no additional information and learning may
require an exponential number of queries (as in the case of membership queries.)
Hence we focus on questions of the following sort. Given an automaton, how
can a teacher use a limited set of labels to make the learning problem easier?
If random labels are sprinkled on the states of an automaton, how much does
that help the learner? How few labels can we use and still make the learning
problem tractable? Other questions concern the structure of the automaton itself.
For example, we may consider changing the structure of the automaton before
labeling it. We also consider the problem of learning randomly labeled automata
with random structure.
2 Preliminaries
2.1 Labelings
If M is a finite automaton with output, then a labeling of M is a function
mapping Q to a set L of labels, the label alphabet. We use M to construct
174 D. Angluin et al.
Proof. Recall that we have assumed that |X| and |Y | are both at least 2; we
consider |Y | = 2. Domaratzki, Kisman and Shallit [4] have shown that there are
at least
(|X| − o(1))n2n−1 n(|X|−1)n
distinct languages accepted by acceptors with n states. Because each label query
returns one of at most 2 · |L| values, an information theoretic argument gives the
claimed lower bound on the number of label queries. As a corollary, when |X|
and |L| are constants, we have a lower bound of Ω(n log(n)) label queries.
Proof. The teacher assigns a unique integer label between 1 and n to each state.
The learner asks a label query with the empty string to determine the output
and label of the start state, and then explores the transitions from the start state
by querying each a ∈ X. After querying an input string w, the label indicates
whether this state has been visited before. If the state is new, the learner explores
all the transitions from it by querying wa for each a ∈ X. Thus, after querying
at most |X|n strings, the learner knows the structure and outputs of the entire
automaton. The lower bound shows that this is asymptotically optimal if the
label set L has n elements.
Theorem 1. For each automaton with n states, there is a helpful labeling using
2|X| different labels such that the automaton can be learned using O(|X|n2 ) label
queries.
However, the number of queries, O(n2 ), does not meet the Ω(n log n) lower
bound, and the number of different labels is large. For a restricted class of au-
tomata, there is a helpful labeling with fewer labels that permits learning with
an asymptotically optimal O(n log n) label queries. To appreciate the generality
of Theorem 2, we note once more that every strongly connected automaton is
1-concentrating, and as we will see in Lemma 1, automata with a small input
alphabet can be unfolded to have small in-degree.
Proof. We give the construction for 1-concentrating automata and indicate how
to generalize it at the end of the proof. Given a 1-concentrating automaton M
the teacher chooses as the root a node reachable from all other nodes in the
transition graph of M . The depth of a node is the length of the shortest path
from that node to the root. The teacher then chooses a spanning tree T directed
inward to the root by choosing a parent for each non-root node. (One way to do
this is to let the parent of a node q be the first node reached along a shortest
path from q to the root.) The teacher assigns, as part of the label for each node
q, an element a ∈ X such that τ (q, a) is the parent of q.
The teacher now adds more information to the labels of the nodes, which we
call color, using the colors yellow, red, green, and blue. The root is the unique
node colored yellow. Let t = log n; t bits are enough to give a unique identifier
for every node of the graph. Each node at depth a multiple of (t + 1) is colored
red. For each red node v we choose a unique identifier of t bits (c1 , c2 , . . . , ct )
encoded as green and blue labels. Now consider the maximal subtree rooted at
v containing no red nodes. For each level i from 1 to the depth of the subtree,
all the nodes at level i of the subtree are colored with ci (which is either blue
or green.) The teacher has (so far) used 3|X| + 1 labels – a direction and one of
three colors per non-root node, and a unique identifier for the root.
Given this labeling, the learner can start from any state and reach a local-
ization state whose identifier is known, as follows. The learner uses the parent
component of the labels to go up the tree until it passes one red node and arrives
at a second red node, or arrives at the root (whichever comes first), keeping track
of the labels seen. If the learner reaches the root, it knows where it is. Other-
wise, the learner interprets the labels seen between the first and second red node
encountered as an identifier for the node v reached. This involves observing at
most (2t+2) labels. Thus, even if the in-degree is not bounded, a 1-concentrating
automaton can be labeled so that with O(log(n)) label queries the learner can
reach a uniquely identified localizing state.
If each node of the tree T also has in-degree bounded by k, another component
of the label for each non-root node identifies which of the k possible predecessors
of its parent it is (numbered arbitrarily from 1 to at most k.) If the learner col-
lects these values on the path from u to its localization node v, then we have an
identifier for u with respect to v. Thus it takes O(log(n)) label queries to learn any
node’s identifier. If the node has not been encountered before, its |X| transitions
must be explored, as in Proposition 2. This gives us a learning algorithm using
O(|X|n log(n)) label queries. The labeling uses at most 3k|X| + 1 different labels.
If the automaton is c-concentrating for some c > 1, then the teacher selects
a set of at most c nodes such that every node can reach at least one of them
and constructs a forest of at most c inward directed disjoint spanning trees, and
proceeds as above. This increases the number of unique identifiers for the roots
from 1 to c.
In this section we turn from labels carefully chosen by the teacher to an indepen-
dent uniform random choice of labels for states from a label alphabet L. With
nonzero probability the labeling may be completely uninformative, so results in
this scenario incorporate a confidence parameter δ > 0 that is an input to the
learner. The goal of the learner is to learn an automaton that is output equiv-
alent to the hidden automaton M with probability at least (1 − δ), where this
probability is taken over the labelings of M . Results on random labelings can be
used in the careful labeling scenario: the teacher generates a number of random
labelings until one is found that has the desired properties.
We first review the learning scenario considered by Freund et al. [5]. There
is a finite automaton over the input alphabet X = {0, 1} and output alpha-
bet {+, −}, where the transition function and start state of the automaton are
arbitrary, but the output symbol for each state is chosen independently and uni-
formly from {+, −}. The learner moves from state to state in the target automa-
ton according to a random walk (the next input symbol is chosen independently
and uniformly from {0, 1}) and, after learning what the next input symbol will
be, attempts to predict the output (+ or −) of the next state. After the pre-
diction, the learner is told the correct output and the process repeats with the
next input symbol in the random walk. If the learner’s prediction was incorrect,
this counts as a prediction mistake. In the first scenario they consider, the
learner may reset the machine to the initial state by predicting ? instead of + or
−; this counts as a default mistake. In this model, the learner is completely
passive, dependent upon the random walk process to disclose useful information
about the behavior of the underlying automaton. For this setting they prove the
following.
Theorem 3 (Freund et al. [5]). There exists a learning algorithm that takes
n and δ as input, runs in time polynomial in n and 1/δ and with probability at
least (1 − δ) makes no prediction mistakes and an expected O((n5 /δ 2 ) log(n/δ))
default mistakes.
The main idea is to use the d-signature tree of a state as the identifier for the
state, where d ≥ 2 log(n2 /δ). For this setting, there are at least n4 /δ 2 strings in a
signature tree of depth d. The following theorem of Trakhtenbrot and Barzdin’ [8]
establishes that signature trees of this depth are sufficient.
n2 (1/|Y |)d/2 .
Theorem 5. For any positive integer s, any finite automaton with n states, over
the input alphabet X and output alphabet Y , with its states randomly labeled with
labels from a label alphabet L with |L| = |X|s can be learned using
n1+4/s
O |X| 2/s
δ
label queries, with probability at least (1 − δ) (with respect to the choice of labeling.)
Proof. Assume that the learning algorithm is given n, a bound on the number of
states of the hidden automaton, and the confidence parameter δ > 0. It calculates
a bound d = d(n, δ) (described below) and proceeds as follows, starting with
the empty input string. To explore the input string w, the learning algorithm
calculates the d signature tree (in the labeled automaton) of the state reached
by w by making label queries on wz for all input strings z of length at most d.
This requires O(|X|d ) queries. If this signature tree has not been encountered
before, then the algorithm explores the transitions wa for all a ∈ X. Assuming
that the labeling is “good”, that is, that all distinguishable pairs of states have
a distinguishing string in the labeled automaton of length at most d, then this
correctly learns the output behavior of the hidden automaton using O(|X|d+1 n)
label queries.
To apply Theorem 4, we assume that the hidden automaton M is an arbitrary
finite automaton with output with at most n states, input alphabet X and output
alphabet Y . The labels randomly chosen from L then play the role of the random
outputs in Theorem 4. There is a somewhat subtle issue: states distinguishable
in M by their outputs may not be distinguishable in the labeled automaton by
their labels alone. Fortunately, Freund et al. [5] have shown us how to address
this point. In the first case, if two states of M are distinguishable by their outputs
in M by a string of length at most d, then their d signature trees (in the labeled
automaton) will differ. Otherwise, if the shortest distinguishing string for the
two states (using just outputs) is of length at least d + 1, then generalizing the
argument for Theorem 2 in [5] from |Y | = 2 to arbitrary |Y |, the probability
that this pair of states is not distinguished by the random labeling by a string
of length at most d is bounded above by (1/|Y |)(d+1)/2 . Summing over all pairs
of states gives the required bound.
Thus, choosing 2
2 n
d≥ log ,
log |L| δ
suffices to ensure that the labeling is “good” with probability at least (1 − δ). If
we use more labels, the signature trees need not be so deep and the algorithm
does not need to make as many queries to determine them. In particular, if
|L| = |X|s , then the bound of O(|X|d+1 n) on the number of label queries used
by the algorithm becomes
n1+4/s
O |X| 2/s ,
δ
completing the proof.
Learning Finite Automata Using Label Queries 179
Corollary 1. Any finite automaton with n states can be learned using O(|X|n1+ )
label queries with probability at least 1/2, when it is randomly labeled with |L| =
f (|X|, ) labels.
We remark that this implies that there exists a careful labeling with O(|X|4 )
labels that achieves learnability with O(|X|n2 ) label queries, substantially im-
proving on the size of the label set used in Theorem 1. An open question is
whether a random labeling with O(1) labels enables efficient learning of an ar-
bitrary n state automaton with O(n log n) queries with high probability.
Proof. The total number of automata with output having n states, input alpha-
bet X and output alphabet Y is at most
n|X|n+1 |Y |n .
Thus, N = O(|X|n log(n) + n log(|Y |)) bits suffice to specify any one of these
machines.
The teacher chooses a ∈ X and unfolds the target automaton M as follows.
The strings ai for i = 0, 1, . . . , N − 1 each send the learner to a newly created
state, which act (with respect to transitions on other input symbols and output
behavior) just as their counterparts in the original machine. The remaining states
are unchanged. The unfolded automaton is output equivalent to M . The teacher
then specifies M using by labeling these N new states with the bits of the
specification of M . The learner simply asks a sequence of N queries on strings
of the form ai to receive the encoding of the hidden machine.
180 D. Angluin et al.
This method does not work if we restrict the unfolding to O(|X|n) states, but
we show that this much unfolding is sufficient to reduce the in-degree of the
automaton to O(|X|).
Proof. Given M , we repeat the following process until it terminates. While there
is some state q with in-degree greater than 2|X| + 1, split q into two copies,
dividing the incoming edges as evenly as possible between the two copies, and
duplicating all |X| outgoing edges for the second copy of q.
It is clear that each step of this process preserves the output behavior of M .
To see that it terminates, for each node q let f (q) be the maximum of 0 and
din (q) − (|X| + 1), where din (q) is the in-degree of q. Consider the potential
function Φ that is the sum of f (q) for all nodes q in the transition graph. Φ
is initially at most |X|n − (|X| + 1), and each step reduces it by at least 1 =
(|X| + 1) − |X|. Thus, the process terminates after no more than |X|n steps
producing an output-equivalent automaton M with no more than (|X| + 1)n
states and in-degree at most 2|X| + 1.
Proof. Given a strongly connected automaton M with n states, the teacher uses
the method of Lemma 1 to produce an output equivalent machine M with at
most (|X| + 1)n states and in-degree bounded by 2|X| + 1. This unfolding may
not preserve the property of being strongly connected, but there is at least one
state q that has at most (|X| + 1) copies in the unfolded machine M . Because
M is strongly connected, every state of M must be able to reach at least one
of the copies of q, so M is (|X| + 1)-concentrating. Applying the method of
Theorem 2, the teacher can use 3(2|X| + 1)|X| + (|X| + 1) labels to label M so
that it can be learned with O(|X|2 n log n) label queries.
We now consider uniform random labelings of the states when the teacher is
allowed to choose the unfolding of the machine.
Theorem 6. Any automaton with n states can be unfolded to have O(n log(n/δ))
states and randomly labeled with 2 labels, such that with probability at least (1 − δ),
it can be learned using O(|X|n(log(n/δ))2 ) queries.
Learning Finite Automata Using Label Queries 181
Proof. Given n and δ, let t = log(n2 /δ). The teacher chooses a ∈ X and
unfolds the target machine M to construct the machine M as follows. M has
nt states (q, i) where q is a state of M and 0 ≤ i ≤ (t − 1). The start state is
(q0 , 0), where q0 is the start state of M . The output symbol for (q, i) is γ(q, ai ),
where γ is the output function of M . For 0 < i < (t − 1), the a transition
from (q, i) is to (q, (i + 1)). The a transition from (q, t − 1) is to (q , 0), where
q = τ (q, at ) and τ is the transition function of M . For all other input symbols
b with b = a, the b transition from (q, i) is to (q , 0), where q = τ (q, ai b).
To see that M is an unfolding of M , that is, M is output equivalent to M ,
we show that each state (q, i) of M is output equivalent to state τ (q, ai ) of M .
By construction, these two states have the same output. If i < (t − 1) then the
a transition from (q, i) is to (q, i + 1), which has the same output symbol as
τ (q, ai+1 ). The a transition from (q, t − 1) is to (q , 0), where q = τ (q, at ), which
has the same output symbol as τ (τ (q, at−1 ), a). If b = a is an input symbol, then
the b transition from (q, i) is to (q , 0) where q = τ (q, ai b), which has the same
output symbol as τ (τ (q, ai ), b).
Suppose M is randomly labeled with two labels. For each state q of M ,
define its label identifier in M to be the sequence of labels of (q, i) for i =
0, 1, . . . , (t − 1). For two distinct states q1 and q2 of M , the probability that
their label identifiers in M are equal is (1/2)t , which is at most δ/n2 . Thus, the
probability that there exist two distinct states q1 and q2 with the same label
identifier in M is at most δ.
Given n and δ, the learning algorithm takes advantage of the known unfolding
strategy to construct states (j, i) for 0 ≤ j ≤ n − 1 and 0 ≤ i ≤ (t − 1) with
a transitions from (j, i) to (j, i + 1) for i < (t − 1). It starts with the empty
input string and uses the following exploration strategy. Given an input string
w that is known to arrive at some (q, 0) in M , the learning algorithm makes
label queries on wai for i = 0, 1, . . . , (t − 1) to determine the label identifier of q
in M . If this label identifier has not been seen before, the learner uses the next
unused (j, 0) to represent q and records the outputs and labels for the states
(j, i) for i = 0, 1, . . . , (t − 1). It must also explore all unknown transitions from
the states (j, i). If distinct states of M receive distinct label identifiers in M ,
the learner learns a finite automaton output equivalent to M using O(|X|nt2 )
label queries.
Proof. This was first proved by Korshunov in [6]; here we give a simpler proof.
Korshunov showed that the signature trees only need to be of depth asymptoti-
cally equal to log|X| (log|L| (n)) for the nodes to have unique signatures with high
probability. We use a method similar to signature trees, but simpler to analyze.
Instead of comparing signature trees for two states to tell whether or not they
are distinct, we compare the labels along at most four sets of transitions, which
we call signature paths – like a signature tree consisting only of four paths.
Lemmas 2 and 3 show that given X and n there are at most four signature
paths, each of length 3 log(n), such that for a random finite automaton of n
states with input alphabet X and for any pair s1 and s2 of different states, the
log6 (n)
probability is O n3 that s1 and s2 are distinguishable but not distinguished
by any of the strings in the four signature paths. By the union bound, the
probability that there exist two distinguishable states that are not distinguished
by at least one of the strings in the four signature paths is at most
6
n log (n)
O = o(1).
2 n3
Hence, by running at most four signature paths, each of length 3 log(n), per newly
reached state, we get unique labels on the states. Then for each of the n states,
we can find their |X| transitions, and learn the machine, as in Proposition 2.
We now turn to the two lemmas used in the proof of Theorem 7. We first consider
the case |X| > 2. If a, b, c ∈ X and is a nonnegative integer, let D (a, b, c) denote
the set of all strings ai , bi , and ci such that 0 ≤ i ≤ .
Proof. We analyze the three (attempted) paths from two states s1 and s2 , which
we will call πs11 , πs21 , πs31 and πs12 , πs22 , πs32 , respectively. Each path will have length
3 log(n). We define each of the πi as a set of nodes reached by its respective set
of transitions.
We first look at the probability that the following event does not happen:
that both |πs11 | > 3 log(n) and |πs12 | > 3 log(n), and that πs11 ∩ πs12 = ∅, that is
the probability that both of these strings succeed in reaching 3 log(n) different
states, and that they share no states in common. We call the event that two
sets of states π1 and π2 have no states in common, and both have size at least
l, S(π1 , π2 , l) (success) and the failure event F (π1 , π2 , l) = S(π1 , π2 , l). So,
Learning Finite Automata Using Label Queries 183
3 log(n) 3 log(n)
i + |πs11 | i + |πs12 |
P (F (πs11 , πs12 , 3 log(n))) ≤ +
i=1
n i=1
n
3 log(n)
i + 3 log(n)
≤2
i=1
n
2
log (n)
=O .
n
Now we look at the probability that F (πs21 , πs22 , 3 log(n)) given that we failed on
the first paths, or F (πs11 , πs12 , 3 log(n)), with l = 3 log(n),
3 log(n)
i + |πs21 | + |πs11 | + |πs12 |
P F (πs1 , πs2 , l)|F (πs1 , πs2 , l) ≤
2 2 1 1
i=1
n
3 log(n)
i + |πs22 | + |πs11 | + |πs12 |
+
i=1
n
3 log(n)
i + 9 log(n)
≤2
i=1
n
2
log (n)
=O .
n
Now, we will compute the probability that F (πs31 , πs32 , 3 log(n)) given failures on
the previous two pairs of states. Let l = 3 log(n),
3 log(n)
i + 25 log(n)
P F (πs1 , πs2 , l)|F (πs1 , πs2 , l), F (πs1 , πs2 , l) ≤ 2
3 3 1 1 2 2
i=1
n
2
log (n)
=O .
n
Last, we compute the probability
none of these pairs of paths made it to l =
3 log(n), or P (failure) = P F (πs11 , πs12 , l), F (πs21 , πs22 , l), F (πs31 , πs32 , l)
P (failure) = P (F (πs11 , πs12 , l)) · P F (πs21 , πs22 , l)|F (πs11 , πs12 , l) ·
P F (πs31 , πs32 , l)|F (πs11 , πs12 , l), F (πs21 , πs22 , 1)
2 2 2
log (n) log (n) log (n)
=O O O
n n n
6
log (n)
=O .
n3
Thus, given two distinct states with corresponding nonoverlapping signature
paths of length 3 log(n), the probability that all of the randomly
chosen labels
1 log6 (n)
along the paths will be the same is 23 lg(n) = n3 = O n3 , which is the
probability that no string in D (a, b, c) distinguishes s1 from s2 .
184 D. Angluin et al.
Acknowledgments
References
1. Angluin, D.: A note on the number of queries needed to identify regular languages.
Information and Control 51(1), 76–87 (1981)
2. Angluin, D.: Queries and concept learning. Machine Learning 2(4), 319–342 (1987)
3. Becerra-Bonache, L., Dediu, A.H., Tı̂rnăucă, C.: Learning DFA from correction and
equivalence queries. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita,
E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 281–292. Springer, Heidelberg
(2006)
4. Domaratzki, M., Kisman, D., Shallit, J.: On the number of distinct languages ac-
cepted by finite automata with n states. Journal of Automata, Languages and Com-
binatorics 7(4) (2002)
5. Freund, Y., Kearns, M.J., Ron, D., Rubinfeld, R., Schapire, R.E., Sellie, L.: Efficient
learning of typical finite automata from random walks. Information and Computa-
tion 138(1), 23–48 (1997)
6. Korshunov, A.: The degree of distinguishability of automata. Diskret. Analiz. 10(36),
39–59 (1967)
7. Lee, D., Yannakakis, M.: Testing finite-state machines: State identification and ver-
ification. IEEE Trans. Computers 43(3), 306–320 (1994)
8. Trakhtenbrot, B.A., Barzdin’, Y.M.: Finite Automata: Behavior and Synthesis.
North-Holland, Amsterdam (1973)
Characterizing Statistical Query Learning:
Simplified Notions and Proofs
Balázs Szörényi1,2
1
Fakultät für Mathematik, Ruhr-Universität Bochum, D-44780 Bochum, Germany
2
Hungarian Academy of Sciences and University of Szeged, Research Group on
Artificial Intelligence, H-6720 Szeged
szorenyi@inf.u-szeged.hu
1 Introduction
The Statistical Query model (called SQ model for short) was introduced by
Kearns [6] as an approach to handle noise in the well-known PAC model. The
general idea is that—instead of using random examples as in the PAC model—
the learner gains information about the unknown function by asking various
statistics (called queries) over the distribution of labeled examples. As it was
shown by Kearns [6], any learning algorithm in the SQ model can be transformed
to a PAC algorithm without much loss in efficiency. It is even more interesting
that the resulting algorithm is robust to noise. He has also shown that many
efficient PAC algorithms can also be converted to an efficient SQ algorithm.
Despite the power of the model that is apparent from the above results, it is
still weaker than the PAC model. Indeed, already in [6] it was shown that the
parities, which is a PAC-learnable class, cannot be efficiently learned in the SQ
model under the uniform distribution. The proof used an information theoretic
This work was supported in part by the Deutsche Forschungsgemeinschaft Grant SI
498/8-1, and the NKTH grant of the National Technology Programme 2008 (project
codename AALAMSRK NTP OM-00192/2008) of the Hungarian government.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 186–200, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Characterizing Statistical Query Learning: Simplified Notions and Proofs 187
argument, which was generalized later by Blum et al. in [3] to characterize weak
learnability of a concept class (where the goal is to do slightly better than ran-
dom guessing) in the SQ model for the distribution dependent case (i.e., when
the underlying distribution is fixed in advance and is known by the learner).
The characterization is based on the so called SQ dimension of the class which
is, roughly, the maximal size of an “almost orthogonal” system in the class.
However, the proof in [3] is rather long and complex. Subsequently Yang gave
an alternative, elegant proof for basically the same result [10]. In this paper we
present yet another, but much shorter proof, thereby significantly simplifying on
both existing proofs.
Strong learnability (i.e., when the goal is to approximate the target concept
with arbitrary accuracy) of a concept class in the distribution dependent case
was first characterized by Köbler and Lindner [7] in terms of a general frame-
work for protocols, called the general dimension. Independently Simon in [8]
gave another characterization for strong learnability that was based on the SQ
dimension (more precisely it was based on the SQ dimension of the class after
some translation), and is more of an algebraic flavor. However, both the char-
acterization and the proof are rather complex in [8]; as we shall show in this
paper, our simple approach that is successful in characterizing weak learnability,
can be also applied for strong learnability, thereby giving an alternative, simple
characterization for this problem as well, which might also have the potential to
be easier to apply and calculate for concrete concept classes. Recently Feldman
has also obtained a simple characterization of strong SQ learnability of a similar
flavor [5], however the two papers focus on different perspectives: Feldman is
interested in applications to agnostic learning and evolvability, meanwhile our
main interest is to find a really simple proof and a unified view of weak and
strong learnability. Additionally our approach also reveals that in the distri-
bution dependent case query-efficient learnability is possible if and only if all
consistent learning algorithms learn the given concept class query-efficiently.1
As far as we know, this was not known before. We also show that in the distri-
bution dependent case proper learning (i.e., when the queries of the learner are
restricted to use functions from the given concept class) is as strong as improper
learning, but we would like to point out that this can be easily deduced already
from the characterization result of Simon (see Observation 5).
Finally we show that in the distribution independent case (i.e., when the
learner doesn’t know anything about the underlying distribution) proper and
improper learning can differ significantly, and we contrast this with the above
mentioned result on their equivalence in the distribution dependent case.
is exponential in their lower bound (see also our discussion on the topic in Sect.
7) and the paper does not reveal the relation of the model to noise-tolerant PAC
learning (which gave the importance of the SQ model).
In [11] Yang has introduced the model of honest SQ model using stronger queries
and less adversarial settings than the ones used in the SQ model. In [9] it is shown
how to apply the results and methods of this paper to prove a somewhat surprising
result: the equivalence of the honest and the “pure” SQ model.
2 Preliminaries
A concept is a mapping from some domain to {−1, 1}. A concept class is a set of
concepts with the same domain. A Boolean concept over n variables is a concept
of the form {−1, 1}n → {−1, 1}. A family of concept classes is an infinite set
{Fn }∞ n=1 , such that each Fn is a concept class. The class of all concepts over
some domain X is denoted C(X).
The correlation of two functions f, g : X → R under some distribution D
over X is defined as f, gD = E[f (ξ)g(ξ)], where ξis a random variable with
distribution D. The norm of f under D is f D := f, f D . f is said to be a
γ-approximation of g, if f, gD ≥ γ.
In the Statistical Query model a learner (or learning algorithm) L can make
queries of the form (h, τ ), where τ is a positive constant called tolerance, and h
is chosen from some concept class H called the query class. Each such query is an-
swered with some c satisfying |c − f ∗ , hD | ≤ τ, where f ∗ is some fixed concept,
called the target concept that is unknown for the learner, and where D is some dis-
tribution over the input domain of f ∗ . (Here the learner is supposed to be familiar
with D.) The learner succeeds when he finds some function f ∈ H having correla-
tion at least γ with f ∗ for some constant γ > 0 fixed ahead of the learning process.
D,L
Parameter γ is called accuracy. Let qF ,H (τ, γ) denote the smallest integer q such
that L always succeeds in the above setting using at most q queries when the target
concept belongs to F . Finally, SLCD F ,H (τ, γ) (or the statistical learning complex-
D,L
ity) is defined to be the minimum value of qF ,H (τ, γ) over all possible learning al-
gorithms L. We would like to emphasize that in this paper we are interested only in
the number of queries during the learning process (i.e., the information complexity
of learning), and do not consider the running time.
Characterizing Statistical Query Learning: Simplified Notions and Proofs 189
Note that originally in [6] the SQ model allowed much more general queries,
but in [4] Bshouty and Feldman have shown that the two models are equivalent.2
We also consider the following variants of the above described learning model.
The learning is called proper when F = H, and is called improper when F H.
Also, in general, a query (h, τ ) is proper if h ∈ F, otherwise it is improper. The
learner is a consistent learner, if | hi , hj D − ci | ≤ τi for i < j, where (hi , τi )
is the i-th query of the learner and ci is the answer for it. Finally, note that in
the above definition the learner is supposed to be familiar with the underlying
distribution, but the model can also be defined for the case when this is not
true. We are mainly interested in the former case (except for Sect. 8), but when
we want to explicitly refer to one case or the other, we shall call the former the
distribution dependent and the latter the distribution independent case.
For simplicity, when it causes no confusion, we omit D from notations like
SLCD F ,H (τ, γ) and f, gD , and simply use SLCF ,H (τ, γ) and f, g instead.
Definition 1. We say that a family {Fn }∞ n=1 of concept classes is weakly learn-
able in the SQ model with a family {Hn }∞ n=1 of query classes if there exists some
γ(n) > 0 and τ (n) > 0 such that 1/γ(n), 1/τ (n) and SLCFn ,Hn (τ (n), γ(n)) are
polynomially bounded in n. {Fn }∞ n=1 is strongly learnable in the SQ model with
queries from {Hn }∞ n=1 if there exists some τ (n, ) > 0 such that 1/τ (n, ) and
SLCFn ,Hn (τ (n), 1 − ) are polynomially bounded in n and 1/.
The following Observation, which we shall apply several times later, is basi-
cally the reason for the equivalence of the proper and improper learning in the
distribution dependent model.
Observation 1. Let f, g and h be arbitrary concepts. If f, h ≥ 1 − and
g, h ≥ 1 − , then f, g = (1/2) f + g, f + g − 1 ≥ f + g, h − 1 ≥ 1 − 2.
Although this paper mainly considers concepts and concept classes, we would
like to point out that all the results remain valid for classes of functions with
norm bounded by 1 (which might be tempting to use for example in query
classes)—albeit in some cases, when the proof applies Observation 1, the con-
stants get slightly worse.3 The reason for this is the following lemma which is
the generalization of Observation 1 for these functions.
2
Actually they have shown how to simulate an arbitrary statistical query using two
statistical queries that are independent of the target function and two correlation
queries. However, when running time is not considered and the underlying distribu-
tion is known, one can omit the two former queries and just compute them directly.
3
The choice of 1 as an upper bound for the query function is arbitrary, one can
use any other constant instead. (But note that smaller constants would exclude all
concepts.) However, unbounded queries should not be allowed, because they make
the learning problem trivial. Indeed, for example when the target concept is Boolean
over n variables, and one uses
na query with tolerance 1/2 and with the function that
evaluates x ∈ {−1, 1}n to n
i=1 (1/) · 2 · 2
i(xi +1)/2
, then the value of the target
concept on inputs with probability at least /2n can be reconstructed from the
answer to this query, meanwhile the sum of the probabilities of the rest of the inputs
is less than .
190 B. Szörényi
Finally for integer d let [d] denote the set {1, . . . , d}.
the proof in [3] for the other direction was rather long and complex. Subse-
quently Yang in [10] gave another, elegant proof for this direction, based on the
eigenvalues of the correlation matrix of the concept class.4
Here we show that basically the same result can be derived using a very simple
argument, thus significantly simplifying on both of the above mentioned proofs.
The proof in some sense follows the same line of thought they use, but lacks the
machineries applied in them.
Theorem 2. Let F be a concept class and let d := SQDimF . Then any learning
algorithm that uses tolerance parameter lower bounded by τ > 0 requires in the
worst case at least (dτ 2 − 1)/2
√ queries for learning
√ F with accuracy at least τ .
In particular, when τ = 1/ 3 d, this means ( 3 d − 1)/2 queries.
Proof. Assume that f1 , . . . , fd ∈ F fulfill | fi , fj | ≤ 1/d for i, j ∈ [d] distinct.
We show an (adversary) answering strategy that ensures to eliminate only a
small number of these functions after each query. Let h be an arbitrary query
function used by the learner (having thus norm at most 1) and let A := {i ∈
[d] : fi , h ≥ τ }. Then, by the Cauchy-Schwarz inequality
2 2
|A| − 1 |A|2
h, fi ≤ fi = fi , fj ≤ 1+ ≤ |A| + ,
d d
i∈A i∈A i,j∈A i∈A
meanwhile, by the choice of A it holds that h, i∈A fi ≥ |A|τ, and the two
together implies that 1/|A| + 1/d ≥ τ 2 or equivalently, that |A| ≤ d/(dτ 2 −
1). Similar argument shows that at most d/(dτ 2 − 1) of the fi functions have
correlation at most −τ with h. Thus at most 2d/(dτ 2 − 1) of the functions will
be inconsistent with the answer if the adversary returns 0 to this query. This, in
turn, implies the desired lower bound (dτ 2 − 1)/2 on the learning complexity.
It is also worth mentioning that this result is quite tight in the improper case,
when the learner can use arbitrary functions of norm 1 in the queries. Indeed,
(i+1)·d2/3
if the concept class itself is {f1 , . . . , fd }, then defining gi := j=i·d2/3 +1 fj for
√
i = 0, 1, . . . , d1/3 − 1 (assuming for simplicity that 3 d is integer), at least one
hi = gi / gi , i = 0, 1, . . . , d1/3 − 1 will have correlation at least
1 1
1 − d2/3 (1)
d d2/3 + d2/3 d2/3 (1/d)
√
with the target function. Note that (1) asymptotically equals to 1/ 3 d.
1 1
max SLCD
F ,H log , 1 − = O d5 log2 ,
D 3d
when H ⊇ F, and where d = maxD SQDimD F . However, this does not imply any
result on fixed distributions in general. Indeed, when the support of a distribu-
tion consists of only a single input, then one query is enough both in the weak
and in the strong setting—for any concept class. Thus the gap between the up-
per bound in the above equation and the number of queries required for strong
learning under some given (known) distribution can be as big as possible: expo-
nential versus constant. What is more, we cannot expect to bound the strong
SQ dimension under some distribution D using the weak SQ dimension under
the same distribution. Indeed, consider for example the uniform distribution and
the concept class Fn consisting of all the functions of the form v1 ∨ f , where f
is any parity function over variables v2 , . . . , vn . Then |Fn | = 2n−1 , and any two
distinct elements (v1 ∨ f ), (v1 ∨ f ) ∈ Fn have correlation 1/2:
1 1 1
v1 ∨ f, v1 ∨ f = 2 P (v1 ∨ f ) = (v1 ∨ f ) − 1 = 2 + P[f = f ] − 1 =
2 2 2
(as the parity functions are uncorrelated under the uniform distribution), and so
by Theorem 4 strong learning of Fn requires superpolynomial number of queries,
meanwhile weak learning requires none.5
Definition 3. For a concept class F let d0 (F , γ) denote the largest d such that
some f1 , . . . , fd ∈ F fulfill
1 1 1 1
≤ 2
2d + d (d − 1) c + + d (d − 1) c + − 2(d )2 c −
(d ) 2d 2d 2d
4 2
≤ + ,
d d
and so, by the Cauchy-Schwarz inequality
194 B. Szörényi
1 1 1 1 4 2 6
h, fi − fi ≤ fi − fi ≤ + ≤ .
d d d d d d d
i∈A i∈B i∈A i∈B
and so α − β ≤ 6/d ≤ 2τ . Thus, answering the learner’s query with (α + β)/2,
all but at most 2d −2 functions will be consistent with the answer. This, in turn,
implies the desired lower bound d/(2d − 2) ≥ dτ 2 /3 on the learning complexity.
The main result of this section is the following corollary of the two theorems
above:
Corollary 1. The following statements are equivalent for any family {Fn }∞
n=1
of concept classes under arbitrary (fixed) distribution:
As it turns out below, it doesn’t really matter, which query class is used, as long
as it contains the concept class itself.
2 1 1 1 |F | |F |
BF = g, f ≤ (1 − ) + =1− ,
|F |2
|F |
|F | 2 2 2
g,f ∈F g∈F
and therefore BF ≤ 1 − /2 ≤ 1 − /4. Combining this with (4), and noting
that x(2 − x) is monotone increasing on (0, 1) we get that
1 1 2
fi , fj ≤ + 1− 1+ =1+ − .
d 4 4 d 16
6
Note that we cannot apply Observation 1 (or Proposition 1) directly to bound
fi , fj , because nontriviality only guarantees that none of the fi functions have
high correlation with at least half of F , which doesn’t prevent them from having
really high correlation with some smaller portion of F . It thus has to be shown that
no such set contains another fi .
196 B. Szörényi
1 3
≤ d + d (d − 1) −1 + +√ ,
32 d
which would lead to a contradiction, as d ≥ 32. Consequently fi , fj ≥ −1 +
2 /32 for 1 ≤ i < j ≤ d .
Theorem 7. Let F and H be concept classes satisfying F ⊆ H. Then
d0 (F , 1−) ≤ max{2, 2·SQDim∗F ,F (1−/2)} ≤ max{2, 2·SQDim∗F ,H (1−/4)} .
Proof. The second inequality follows from Observation 5.
To prove the first inequality, let F := {f1 , . . . , fd } ⊆ F be such that | fi , fj | <
1 − and | fi , fj − fk , f | < 1/d for 1 ≤ i < j ≤ d and 1 ≤ k < ≤ d. Then
| fi − BF , fj − BF |
1
d
1
d
= fi , fj + 2 fk , f − (fi , fk + fj , fk )
d d
k,=1 k=1
1 d 1 d
≤ (fi , fj − fi , fk ) + 2 (fk , f − fj , fk )
d d
k=1 k,=1
2
≤ .
d
Furthermore, by Observation 1, F is (1 − /2, F )-nontrivial.
The dimension notion introduced in [5] is a kind of simplified version of SQDim∗ :
In this section, as an example, we compute the exact value of d0 for the class
of conjunctions under the uniform distribution, up to a constant factor. (Note
however that this class is efficiently learnable in the Statistical Query model even
distribution independently [6], so d0 is obviously polynomial in n and in 1/.)
First of all let us compute the correlation of two conjunctions t and t that
have length and respectively, and share exactly s literals (as usual, −1 is
interpreted as “true” and 1 as “false”):
t, t = E[t · t ]
= 1 − 2 P[t = t ]
= 1 − 2(P[t = −1] + P[t = −1] − 2 P[t = t = −1])
1 − 2/2 − 2/2 if t and t conflict,
= (5)
1 − 2/2 − 2/2 + 4/2+ −s otherwise.
Next we prove a technical lemma we shall need later. Here we apply the con-
vention that for some x ∈ {0, 1}n the number of 1s in x is denoted |x|, and that
for x, y ∈ {0, 1}n x ∨ y (resp. x ∧ y) is the vector of length n with 1 on those
components that are 1 in at least one of x and y (resp. in both x and y), and is
0 everywhere else. For conjunctions we use similar notations, that is, |t| denotes
the number of literals appearing in term t, and t ∧ t denotes the term obtained
by joining the literals appearing in terms t and t .
Lemma 1. If for some H ⊆ {0, 1}n and for some integer c it holds that |x∨y| =
c for arbitrary distinct x, y ∈ H, then |H| ≤ n + 1.
Proof. For x ∈ H let xc denote the vector obtained by flipping the bits in x.
Then by De Morgan xc ∧ y c = (x ∨ y)c , and thus |xc ∧ y c | = n − c for arbitrary
x, y ∈ H. Construct the n × |H| matrix X such that its columns are the vectors
from H in an arbitrary order, and let C be the |H| × |H| matrix having n − c in
each entry. First of all note that X X − C is a diagonal matrix. If it contains
some zero element in the diagonal, then |xc | = n − c for some x ∈ H, implying
that for all other y ∈ H y c has 1 everywhere where x does and that each such y c
must have 1 at some unique position where the others have 0. This immediately
implies |H| ≤ n+1. Otherwise, when X X −C is a nonsingular diagonal matrix,
|H| = rank X X − C ≤ rank X X + 1 = rank(X) + 1 ≤ min{n, |H|} + 1
is the longest term among them. Then by (5) it holds that 1 − ≥ ti , td ≥
1 − 4 P[ti = −1], implying
−|t |
P[ti = −1] = 2 i ≥ , (6)
4
and thus
0 if ti and tj conflict
P[ti = tj = −1] = (7)
2−|ti ∧tj | ≥ (/4)2 otherwise
2 1 1 (5 )
> ≥ | ti , tj − tk , t | = |P[ti = tj = −1] − P[tk = t = −1]| .
32 4d 4
Note that it cannot happen that ti and tj conflict with each other, but tk and
t do not—or vice versa—, since by (7) that would mean that the right hand
side is at least 2 /16, resulting in a contradiction. So either all ti with i ∈ I
conflict each other, or there is no conflicting pair among the terms with index in
I. The former case implies that {ti = −1}i∈I are all contradicting events, and
(6)
so 1 ≥ i∈I P[ti = −1] ≥ |I| · (/4), giving the bound |I| ≤ 4/. In the latter
case, since by (7) both 2−|ti ∨tj | and 2−|tk ∨t | are at least 2 /16, we have that
2−|ti ∨tj | > (1/2)2−|tk ∨t | and 2−|tk ∨t | > (1/2)2−|ti∨tj | . This, however implies
that |ti ∨ tj | = |tk ∨ t |, and so, by Lemma 1 (applied for H ⊆ {0, 1}n consisting
of the vectors that represent some ti with i ∈ I by having 1 on position j iff ti
contains variable vj ), I has cardinality at most n + 1.
We have just seen that the sum of the number of terms of minimal length and
the number of terms of length one more is at most max{2n + 2, 8/}. However,
there cannot be distinct indices i, j, k ∈ [d − 1] fulfilling |ti | + 2 ≤ |tj |, |tk |, as
otherwise
2 1
>
8 d
≥ | ti , tj − tj , tk |
= |2 P[ti = −1] − 4 P[ti = tj = −1] − 2 P[tk = −1] + 4 P[tk = tj = −1]|
1
≥ · P[ti = −1]
2
(6)
≥ ,
8
a contradiction.
Note that this bound is sharp up to a constant factor according to the example
below and that the terms consisting of one unnegated variable form an orthogonal
Characterizing Statistical Query Learning: Simplified Notions and Proofs 199
system of cardinality n. It also immediately follows that these results remain tight
even if we restrict Fn to be the set of monotone conjunctions over v1 , . . . , vn .7
Example 1. Let Fn be the set of all monotone conjunctions over variables v1, . . . , vn
and let Fn () consist of all t ∈ Fn of length . Set := 2− and note that if t1 , t2 ∈
(5)
Fn () share s < variables, then under the uniform distribution | t1 , t2 | = 1 −
4/2 + 4/22−s ≤ 1 − 2. If additionally t3 , t4 ∈ Fn () share s < variables,
(5)
then | t1 , t2 − t3 , t4 | = 4/22−s − 4/22−s = 2 4 2s − 2s . Now we choose
= (n) := c log n for some c > 1 (and thus = (n) = 1/nc ) and s =
s(n) := log log n, and prove that d0 (Fn , 1 − ) = Ω(2 ) = Ω(n2c ) by showing
that one can find an I ⊆ Fn () of cardinality Ω(n2c ) that contains no two distinct
conjunctions sharing more than s variables. Such an I can simply be obtained
using the greedy method, since when n− ≥ 2(−s) then for any t ∈ Fn () there
n−
are exactly −s ≤ 2 n−
−s conjunctions in Fn () that share at least
i=0 i i n
s variables with t, thus (noting that |Fn ()| = ) I can always be expanded
by some term when it has cardinality less than
n
1 ns
n− ∼ s
2 −s 2
only two singletons—say fx and fy —are unqueried, the adversary chooses one
of them as the target concept, and says that the underlying distribution is Dx,y .
This way the answers of the adversary remain consistent (no matter how small
the tolerance parameter of the learner was), and, at the same time, force the
learner to ask at least 2n − 1 queries—even for weakly learning the class.8
It might also worth mentioning that for the singletons SQDimD Fn ≤ 5 under
any distribution D, because, denoting by px the probability assigned to input
x ∈ {−1, 1}n, 1/6 ≥ fx , fy D = 1 − 2px − 2py implies that at least one of px and
py is 5/24 or greater, and thus if six functions from Fn had pairwise correlation
at most 1/6, then at least five distinct inputs would have probability 5/24 or
greater—a contradiction. This result shows that the number of proper queries
required for weakly learning some concept class can differ significantly in the
distribution dependent and in the distribution independent case: in some cases
it is constant versus exponential.
References
1. Aslam, J.A., Decatur, S.E.: General bounds on statistical query learning and PAC
learning with noise via hypothesis boosting. Inf. Comput. 141(2), 85–118 (1998)
2. Ben-David, S., Itai, A., Kushilevitz, E.: Learning by distances. Inform. Com-
put. 117(2), 240–250 (1995)
3. Blum, A., Furst, M., Jackson, J., Kearns, M., Mansour, Y., Rudich, S.: Weakly
learning DNF and characterizing statistical query learning using fourier analysis.
In: Proc. of 26th ACM Symposium on Theory of Computing (1994)
4. Bshouty, N.H., Feldman, V.: On using extended statistical queries to avoid mem-
bership queries. Journal of Machine Learning Research 2, 359–395 (2002)
5. Feldman, V.: A complete characterization of statistical query learning with appli-
cations to evolvability. In: FOCS 2009 (to appear, 2009)
6. Kearns, M.: Efficient noise-tolerant learning from statistical queries. J. ACM 45(6),
983–1006 (1998)
7. Köbler, J., Lindner, W.: The complexity of learning concept classes with polynomial
general dimension. Theor. Comput. Sci. 350(1), 49–62 (2006)
8. Simon, H.U.: A characterization of strong learnability in the statistical query
model. In: Thomas, W., Weil, P. (eds.) STACS 2007. LNCS, vol. 4393, pp. 393–404.
Springer, Heidelberg (2007)
9. Szörényi, B.: Honest queries do not help in the statistical query model (manuscript)
10. Yang, K.: New lower bounds for statistical query learning. J. Comput. Syst.
Sci. 70(4), 485–509 (2005); In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS
(LNAI), vol. 2375, pp. 229–509. Springer, Heidelberg (2002)
11. Yang, K.: On learning correlated boolean functions using statistical queries. In:
Abe, N., Khardon, R., Zeugmann, T. (eds.) ALT 2001. LNCS (LNAI), vol. 2225,
pp. 59–76. Springer, Heidelberg (2001)
8
This doesn’t contradict the result of Aslam and Decatur [1] mentioned in Section 4,
since their boosting algorithm uses improper queries.
An Algebraic Perspective on Boolean Function
Learning
1 Introduction
In his foundational paper [Val84], Valiant introduced the (nowadays called) PAC-
learning model, and showed that conjunctions of literals, monotone DNF formu-
las, and k-DNF formulas were learnable in the PAC model. Shortly after, Angluin
proposed the (nowadays called) Exact learning from queries model, proved that
Deterministic Finite Automata are learnable in this model [Ang87], and showed
how to recast Valiant’s three learning results in the exact model [Ang88].
Valiant’s and Angluin’s initial successes were followed by a flurry of PAC-
or Exact learning results, many of them concerning (as in Valiant’s paper) the
learnability of Boolean functions, others investigating learnability in larger do-
mains. For the case of Boolean functions, however, progress both in the pure
(distribution-free, polynomial-time) PAC model or in the exact learning model
has slowed down considerably in the last decade.
Certainly, one reason for this slowdown is the admission that these two mod-
els do not capture realistically many Machine Learning scenarios. So a lot of
the effort has shifted to investigating variations of the original models that ac-
commodate these features (noise tolerance, agnostic learning, attribute efficiency,
distribution specific learning, subexponential time, . . . ), and important advances
have been made here.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 201–215, 2009.
c Springer-Verlag Berlin Heidelberg 2009
202 R. Gavaldà and D. Thérien
But another undeniable reason of the slowdown is the fact that it is difficult
to find new learnable classes, either by extending current techniques to larger
classes or by finding totally different techniques. Many existing techniques seem
to be blocked by the frustrating problem of learning DNF, or by our lack of
knowledge of basic questions on boolean circuit complexity, such as the power
of modular or threshold circuits.
In this paper, we use algebraic tools for organizing many existing results
on Boolean function learning, and pointing out possible limitations of existing
techniques. We adopt the program over a monoid as computing model of Boolean
functions [Bar89, BST90]. We use existing, and very subtle, taxonomies of finite
monoids to classify many existing results on Boolean function learning, both in
the Exact and PAC learning models, into three distinct algorithmic paradigms.
The rationale beyond the approach is that the algebraic complexity of a
monoid is related to the computational complexity of the Boolean functions
it can compute, hence to their learning complexity. Furthermore, the existing
taxonomies of monoids may help in detecting corners of learnability that have
escaped attention so far because of lack of context, and also in indicating bar-
riers for a particular learning technique. We provide some examples of both
types of indications. Similar insights have led in the past to, e.g., the complete
classification of the communication complexity of boolean functions and regular
languages [TT05, CKK+ 07].
More precisely, we present three classes of monoids that are learnable in three
different Exact learning settings:
Strategy 1. Groups for which lower bounds are known in the program model,
all of which are solvable. Boolean functions computed over these groups can be
identified from polynomially many Membership queries and, in some cases, in
polynomial or quasipolynomial time. Membership learning in polynomial time
is impossible for any monoid which is not a solvable group.
Strategy 2. Monoids built as wreath products of DA monoids and p-groups.
These monoids compute boolean functions computed by decision lists whose
nodes contain MODp gates fed by NC0 functions of the inputs. These are learn-
able from Equivalence queries alone, hence also PAC-learnable, using variants of
the algorithms for learning decision lists and intersection-closed classes. The re-
sult can be extended to MODm gates (for nonprime m) with restrictions on their
accepting sets. All monoids in this class are nonuniversal (cannot compute all
boolean functions), in fact the largest class known to contain only nonuniversal
monoids. We argue that proving learnability of the most reasonable extensions
of this class (either in the PAC or the Equivalence-query model) requires either
new circuit lower bounds or learning DNF.
Strategy 3. Monoids in the variety named LGp m Com. Programs over these
monoids are simulated by polynomially larger Multiplicity Automata (in the se-
quel, MA) over the field Fp , and thus are learnable from Membership and Equiv-
alence queries. Not all MA can be translated to programs over such monoids;
but all classes of Boolean functions that, to our knowledge, were shown to be
An Algebraic Perspective on Boolean Function Learning 203
learnable via the MA algorithm (except the full class of MA itself) are in fact
captured by this class of monoids. We conjecture that this is the largest class of
monoids that can be polynomially simulated by MA, hence it defines the limit
of what can be learned via the MA algorithm in our algebraic setting.
These three classes subsume a good number of the classes of Boolean functions
that have been proved learnable in the literature, and we will detail them when
presenting each of the strategies. Additionally, with the algebraic interpretation
we can examine more systematically the possible extensions these results, at least
within our framework. By examining natural extensions of our three classes of
monoids, we can argue that any substantial extension of two of our three monoid
classes provably requires solving two notoriously hard problems: either proving
learnability of DNF formulas or proving new lower bounds for classes of solvable
groups. This may be an indication that substantial advance on the learnability
of circuit-based classes similar to the ones we capture in our framework may
require new techniques.
Admittedly, there is no reason why every class of boolean functions interesting
from the learning point of view should be equivalent to programs computed over
a class of monoids, and certainly our classification leaves out many important
classes. Among them are classes explicitly defined in terms of threshold gates,
or by read-k restrictions on the variables, or by monotonicity conditions. This
is somehow unavoidable in our setting, since threshold gates have no natural
analogue on finite monoids, and because multiple reads and variable negation
are free in the program model. Similarly, the full classes of MA and DFA cannot
be captured in our framework, since for example the notion of automata size is
critically sensitive to the order in which the inputs are read, while in the program
model variables can always be renamed with no increase in size.
Our taxonomy is somehow complementary to those in [HS07, She08] based on
threshold functions. Some function classes are captured by both that approach
and ours, while each one contains classes not captured by the other.
2 Background
We build circuits typically using AND, OR, and MOD gates. We use the gener-
alized model of MODm gates that come equipped with an accepting set A ⊆ [m]
shown as superindex; [m] denotes the set {0 . . . m − 1} throughout the paper.
A MODA m gate outputs is 1 iff the sum of its inputs mod m is in A. We simply
write MODm gates to mean MODA m gates with arbitrary A’s. For each k, NCk
0
We assume familiarity with Valiant’s PAC model and especially Angluin’s model
of Exact learning via queries. In the Exact model, we use Membership and
Equivalence queries. As usual, we measure the resources used by a learning
algorithm as a function of the arity of the target function (denoted with n) and
the size of the target function within some representation language associated
to the class of functions to learn (denoted with s). Longer explanations can be
found in the extended version.
We will use repeatedly the well-known Composition Theorem (see e.g.
[KLPV87]) which states that if a class C (with minor syntactical requirements)
is learnable in polynomial time then C ◦ NC0k is also learnable in polynomial time
for every fixed k. The result is valid for both the Equivalence-query model and the
PAC model, but the proof fails in the presence of Membership queries.
Recall that a monoid is a set equipped with a binary operation that is associa-
tive and has an identity. All the monoids in this paper are finite; some of our
statements about monoids might be different or fail for infinite monoids.
A group is a monoid where each element has an inverse. A monoid is aperiodic
if there is some number t such that at+1 = at for every element a. Only the one-
element monoid is both a group and aperiodic. A theorem by Krohn and Rhodes
states that every monoid can be built from groups and aperiodic monoids by
repeatedly applying the so-called wreath product. The wreath product of monoids
A and B is denoted with A B. Solvable groups, in particular, are precisely those
that can be built as iterated wreath product of Abelian groups. Definitions of
solvable groups and wreath product can be found in most textbooks on group
theory, and in the extended version of this paper.
A program over a monoid M is a pair (P, A), where A ⊆ M is the accepting
set and P is an ordered list of instructions. An instruction is a triple (i, ai , bi )
whose semantics is as follows: read (boolean) variable xi ; if xi = 0, emit element
An Algebraic Perspective on Boolean Function Learning 205
Fact 1. If M divides N and B(N ) is learnable (in any of the learning models
in this paper), then B(M ) is also learnable.
The small-weight strategy applies to function classes with the following property.
n
Definition 1. For an assignment a ∈ {0, 1} , the weight of a is defined as
the number of 1s it contains, and denoted w(a). A representation class C is k-
narrowing if every two different functions f, g ∈ C of the same arity differ on
some assignment of weight at most k. (k may actually be a function of some
other parameters, such as the arity of f and g or their size in C).
It was shown in [Bar89] and [GTT06], respectively, that nonsolvable groups and
nongroups can compute any conjunction of variables and their negations by a
polynomial-size program. Any class of functions with this property is not n-
narrowing, and by a standard adversary argument, it requires 2n Membership
queries to be identified. Therefore we have:
Fact 4. 1. For every nilpotent group M there is a constant k such that B(M )
is k-narrowing [PT88]. Therefore B(M ) can be identified from nO(k) Mem-
bership queries (and possibly unbounded time).
An Algebraic Perspective on Boolean Function Learning 207
The next two theorems give specific, time-efficient versions of this strategy for
Abelian groups and Gp Ab groups. These are, to our knowledge, new learning
algorithms.
(Recall that s stands for the length of the shortest program computing the target
function). Proofs are given as an Appendix in the extended version.
Let us now interpret these results in circuit terms. It is easy to see that programs
over a fixed Abelian group are polynomially equivalent boolean combinations of
MODm gates, for some m depending on the group. Theorem 5 then implies:
we know that the values of such a program on all small-weight assignments are
sufficient to identify it uniquely, but can these values be used to efficiently predict
the value of the program on an arbitrary assignment?
In circuit terms, by results of [PT88], such programs can be shown to be poly-
nomially equivalent to fixed-size boolean combinations of MODm ◦ NC0 circuits
or, equivalent, of polynomials of constant degree over Zm . We are not even aware
of algorithms learning a single MODA m ◦ NC circuit for arbitrary sets A. When
0
m is prime, one can use Fermat’s little theorem to make sure that the MODm
gate receives only inputs summing to either 0 or 1, at the expense of increasing
the arity of the NC0 part. Then, one can set up a set of linear equations where
the unknowns are the coefficients of the target polynomial and each small-weight
assignment provides an equation with constant term either 0 or 1. The solution
of this system must be equivalent to the target function.
For solvable groups that are neither nilpotent nor in Gp Ab, the situation is
even worse in the sense that we do not have lower bounds on their computational
power, i.e., we cannot show that they are weaker than NC1 . Observe that any
learning result would establish a separation with NC1 , conditioned to the cryp-
tographic assumptions under which NC1 is nonlearnable. In another direction,
while lower bounds do exist for MODp ◦ MODm circuits, we do not have them
for MODp ◦ MODm ◦ NC0 ; linear lower bounds for some particular cases were
given in [CGPT06].
Let us note that programs over Abelian groups (equivalently, boolean combi-
nations of MODm gates) are particular cases of the multi-symmetric concepts
studied in [BCJ93]. Multi-symmetric concepts are there shown to be learnable
from Membership and Equivalence queries, while we showed that for these par-
ticular cases Membership queries suffice. XOR’s of k-terms and depth-k decision
trees are special cases of MODm ◦ NC0 previously noticed to be learnable from
Membership queries alone [BK].
In this section we observe that classes of the form DL◦MODA m ◦NC are learnable
0
Proof. (Sketch; details are given in the extended version.) By the composition
theorem, it suffices to show that DL ◦ MOD[m]−{0}
m is learnable from Equivalence
queries, and we can even assume that the inputs to these MOD[m]−{0}m gates are
variables (no negations, no constants). Now observe that this class of functions is
the set of negations of nested differences of functions computed by MOD{0} gates
– where we identify a function with the set of assignments where it evaluates to
1. Furthermore, the set represented by every MOD{0} gate is a submodule (a set
closed under addition) of Znm , and submodules are intersection-closed. Therefore,
the algorithm in [HSW90] for learning nested differences of intersection-closed
classes applies, and one can show it learns the class above with nO(1) queries.
Note that in Theorem 7 the running time does not depend on the length of
the decision list that is being learned. In fact, as a byproduct of this proof one
can see that the length of these decision lists can be limited to a polynomial
of m and nk without actually restricting the class of functions being computed.
Intuitively, this is because there can be only as many linearly independent such
MOD gates, and a node whose answer is determined by the previous ones in the
decision list can be removed. Thus, for constant m and k, this class can compute
O(1)
at most 2n n-ary boolean functions and is not universal.
Also, note that we claim this result for MODm gates having all but 0 as
accepting elements. In the special case that m is a prime p, we can deal with
arbitrary accepting sets:
Theorem 8. For every prime p, every k, and arbitrary accepting sets A (pos-
sibly distinct at every MOD gate) the class DL ◦ MODA
p ◦ NCk is learnable from
0
c
Equivalence queries in time n , where c = c(p, k).
If we ignore the issue of proper learning and polynomials in the running time,
this subsumes at least the following known results:
– k-decision lists (which are DL ◦ NC0 ) [Riv87]. k-decision lists in turn sub-
sumed k-CNF and k-DNF, and rank-k decision trees.
– Systems of equations over Zm , i.e., DL ◦ MOD[m]−{0}
m .
– Polynomials of constant degree over finite fields, restricted to boolean func-
tions. When the field has prime cardinality p, these are equivalent to
MODp ◦ NC0 .
– Strict width-2 branching programs [BBTV97]. This is because it can be
shown that these are polynomially simulated by DL ◦ MOD2 ◦ NC0 (proof
omitted in this version).
These are virtually all known results on learning Boolean functions in the pure
PAC model (no Membership queries) that do not involve threshold gates or
210 R. Gavaldà and D. Thérien
The proof is omitted in this version. Intuitively, nilpotent groups provide the
“group” behavior of MODm ◦ NC0 and decision lists are equivalent to DA ◦
NC0 . A key ingredient is the fact that a MODpα gate can be simulated by a
MODp ◦ NC0 circuit; see e.g. [BT94] for a proof. The difference between parts
(1) and (2) is again the possibility of using Fermat’s little theorem reduce gates
to singleton accepting sets.
From this theorem and Theorem 8, it follows that we can learn programs
over DA Gp monoids from Equivalence queries, yet we do not know how to
learn (to our knowledge) programs over DA Gnil in any model. This algebraic
interpretation lets us explore this gap in learnability and, in particular, the
limitation of the learning paradigm in the previous subsection.
Since every p-group is nilpotent and it can be shown that DA Gnil monoids
can only have nilpotent subgroups, we have
An Algebraic Perspective on Boolean Function Learning 211
DA Gp ⊆ DA Gnil ⊆ DA G ∩ Mnil ,
where Mnil is the class of monoids having only nilpotent subgroups. Yet, there
is an important difference in what we know about DA Gp and DA Gnil .
Following [Tes03, TT04], a monoid M is said to have the Polynomial Length
Property (or PLP) if every program over M , regardless of its length, is equivalent
to another one whose length is polynomial in n. Clearly, every monoid in PLP is
nonuniversal, and the converse is conjectured in [Tes03, TT04]. More specifically,
the following was shown in [Tes03, TT04].
– Every monoid not in DA G ∩ Mnil is universal.
– Every monoid in DA Gp has the PLP, hence is not universal.
The question of either PLP or universality is thus open for DA Gnil , sit-
ting between DA Gp and DA G ∩ Mnil , so resolving its learnability may
require new insights besides the intersection-closure/submodule-learning algo-
rithm. Note that, contrary to one could think, DA Gnil is not equal to
DA G ∩ Mnil : there are monoids that, in this context, can be built by us-
ing unsolvable groups and later using homomorphisms to leave only nilpotent
groups that cannot be obtained starting from nilpotent groups alone. Current
techniques seem insufficient (and may remain unable forever) to analyze even
these traces of unsolvability.
Are there other extensions of DA Gp that we could investigate from the
learning point? The “obvious” is trying to extend the DA or Gp parts separately.
For the DA part, it is known [Sch76, Tes03] that every aperiodic monoid not in
DA necessarily is divided by one of two well-identified monoids, named U and
BA2 . Monoid U is the syntactic monoid of the language {a, b} aa{a, b}, and
programs over U are equivalent in power, up to polynomials, to DNF formulas.
Therefore, by Fact 1, extending DA in this direction implies learning at least
DNF. Monoid BA2 is the syntactic monoid of (ab) , and interestingly, although
it is aperiodic, programs over it can be simulated (essentially) by OR gates fed
by parity gates. In fact it in DA Gp for every p, so we know it is learnable.
If we try to extend on the group part, we have already mentioned that the
two classes of groups beyond Gp for which we have lower bounds are Gnil and
Gp Ab. We have already discussed the problems concerning DA Gnil . For
Gp Ab, they correspond to MODp ◦ MODm circuits, and we showed them
to be learnable from Membership queries alone in the previous section. With
Equivalence queries, however, learning MODp ◦MODm would also imply learning
MODp ◦ MODm ◦ NC0 and, as discussed in the previous section, this seems
difficult because we cannot even prove now that these circuits cannot do NC1 .
In particular, even learning programs over S3 (i.e. MOD3 ◦ MOD2 circuits) from
Equivalence queries alone seems unresolved now.
It has remained one of the “maximal” learning algorithms for boolean functions,
in the sense that no later result has superseded it.
Theorem 10. [BV96, BBB+ 00] Let F be any finite field. Functions Σ → F
represented as Multiplicity Automata over F are learnable from Evaluation and
Equivalence queries in time polynomial in the size of the MA and |Σ|.
We can use Multiplicity Automata to compute boolean functions as follows: We
take Σ = {0, 1}, and some accepting subset A ⊆ F , and the function evaluates
to 1 on an input if the MA outputs an element in A, and 0 otherwise. However,
as basically argued in [BBTV97] we can use Fermat’s little theorem to turn an
MA into one that always outputs either 0 or 1 (as field elements) with only
polynomial blowup, and therefore we can omit the accepting subset.
In this section we identify a class of monoids that can be simulated by MA’s,
but not the other way round. Yet, it can simulate most classes of boolean func-
tions whose learnability was proved via the MA-learning algorithm.
Note that it will be impossible to find a class of submonoids that, in our
setting, is precisely equivalent (up to polynomial blowup) to the whole class of
MA. This is true for the simple reason that the complexity of a function measured
as “shortest program length” cannot grow under renaming of input variables: it
suffices to change the variable names in the instructions of the program. MA,
on the other hand, read their input in the fixed order x1 , . . . , xn , so renaming
the input variables in a function canforce an exponential growth in MA size.
Consider as an example the function ni=1 (x2i−1 = x2i ): clearly, it is computed
by the MA of size O(n) that simply checks equality of appropriate
n pairs of
adjacent letters in its input string. However, its permutation i=1 (xi = x2n−i+1 )
is the palindrome function, whose MA size is roughly 2n over any field.
Our characterization uses the notion of Mal’tsev product of two monoids A and
B, denoted A m B. We do not define the algebraic operation formally. We use
instead the following property, specific to our case [Wei87]: Let M be a monoid
in LGp m Com, i.e., the Mal’tsev product of a monoid in Gp by one in Com.
Then, the product in M of a string of elements m1 . . . mn can be determined from
the truth of a fixed number of logical conditions of the following form: There are
elements a1 , . . . , ak in M , a number r ∈ [p], and commutative languages L0 ,
. . . , Lk over M such that the number of factorizations of m1 . . . mn of the form
L0 a1 L1 a2 L2 . . . Lk−1 ak Lk is r modulo p.
Contrived as it seems, LGp m Com is a natural borderline in representation
theory. Recent and deep work by Margolis et al [AMV05, AMSV09] shows that
semigroups in LGp m Com are exactly those that can be embedded into a semi-
group of upper-triangular matrices over a field of characteristic p (and any size).
The main result in this section is:
Theorem 11. Let M be a monoid in LGp m Com. Suppose that M is defined
as above by the a boolean combination of at most conditions of length at most
k using commutative languages whose monoid has size C. Then every program
of length s over M is equivalent to an MA over Fp of size (s + C)c , where
c = c(p, , k).
An Algebraic Perspective on Boolean Function Learning 213
Proof. (of Theorem 11) (Sketch). Fix a program (P, A) over M of length s. Let
m1 , . . . ms be the sequence of elements in M produced by the instructions on P
for a given input x1 . . . xn . The value of P for an input, hence whether it belongs
to A, can be determined from the truth or falsehood of conditions as described
above, each one given by a tuple of letters a1 , . . . , ak and commutative languages
L0 , . . . , Lk .
For each such condition,
we build an MA to check it as follows: The MA is
the direct sum of ks MA’s, one for each of the positions where the a0 . . . ak
witnessing a factorization could appear. Each MA concurrently checks that each
of the chosen positions contains the right ai (when the input variable producing
the corresponding element mj is available) and concurrently checks whether the
subword wi between ai and ai+1 is in the language Li . Crucially, since Li is in
Com, membership of wi in Li can be computed by a fixed-width automaton,
regardless of the order in which the variables producing wi are read. The au-
tomaton produces 0 if this check fails for some i, and 1 otherwise. It can be
checked that the resulting automaton for each choice has size polynomial in s.
For each condition L0 a1 L1 . . . ak Lk , counting the number of factorizations
mod p amounts to taking the sum of the MA built for all possible guesses and
adding them over Fp .
To conclude the proof, take all MA’s resulting from the previous construction
and raise them to the p-th power. That increases their size by a power of p, and
by Fermat’s little theorem they become 0/1-valued. The boolean combination
of several conditions can be then expressed by (a fixed number) of sums and
products in Fp , with polynomial blowup.
We next note that several classes that were shown to be learnable by showing
they were polynomially simulated by MA.
Theorem 12. The following classes of boolean functions are polynomially sim-
ulated by programs over LGp
m Com, hence are learnable from Membership and
Equivalence queries as MA:
easily computed from c and n and commutative, hence in LGp m Com. (See
the extended version for details.)
Finally, we conjecture that LGp
m Com is the largest class of monoids that
are polynomially simulated by MA, hence, the largest class we can expect to
learn from MA within our algebraic framework:
Conjecture 1. If a monoid M is not in LGp
m Com, then programs over M are
not polynomially simulated by MA’s over Fp .
The proof of this conjecture should be within reach given the characterization
given in [TT06] of the monoids that are not in LGp
m Com: this happens iff the
monoid is divided by either the monoids U or BA2 described before, or by a so-
called Tq monoid or by a monoid whose commutator subgroup is not a p-group.
It would thus suffice to show that programs over these four kinds of monoids
cannot always be polynomially simulated by MA over Fp .
References
[AMSV09] Almeida, J., Margolis, S.W., Steinberg, B., Volkov, M.V.: Representa-
tion theory of finite semigroups, semigroup radicals and formal language
theory. Trans. Amer. Math. Soc. 3612, 1429–1461 (2009)
[AMV05] Almeida, J., Margolis, S.W., Volkov, M.V.: The pseudovariety of semi-
groups of triangular matrices over a finite field. RAIRO - Theoretical
Informatics and Applications 39(1), 31–48 (2005)
[Ang87] Angluin, D.: Learning regular sets from queries and counterexamples.
Information and Computation 75, 87–106 (1987)
[Ang88] Angluin, D.: Queries and concept learning. Machine Learning 2, 319–342
(1988)
[Bar89] Barrington, D.A.: Bounded-width polynomial-size branching programs
recognize exactly those languages in NC1 . Journal of Computer and Sys-
tem Sciences 38, 150–164 (1989)
[BBB+ 00] Beimel, A., Bergadano, F., Bshouty, N.H., Kushilevitz, E., Varricchio, S.:
Learning functions represented as multiplicity automata. Journal of the
ACM 47, 506–530 (2000)
[BBTV97] Bergadano, F., Bshouty, N.H., Tamon, C., Varricchio, S.: On learning
branching programs and small depth circuits. In: Ben-David, S. (ed.)
EuroCOLT 1997. LNCS, vol. 1208, pp. 150–161. Springer, Heidelberg
(1997)
[BCJ93] Blum, A., Chalasani, P., Jackson, J.C.: On learning embedded symmetric
concepts. In: COLT, pp. 337–346 (1993)
[BK] Bhshouty, N.H., Kushilevitz, E.: Learning from membership queries /
online learning. Course notes in N. Bshouty’s homepage
[BST90] Mix Barrington, D.A., Straubing, H., Thérien, D.: Non-uniform automata
over groups. Information and Computation 89, 109–132 (1990)
[BT94] Beigel, R., Tarui, J.: On ACC. Computational Complexity 4, 350–366
(1994)
[BV96] Bergadano, F., Varricchio, S.: Learning behaviors of automata from mul-
tiplicity and equivalence queries. SIAM Journal on Computing 25, 1268–
1280 (1996)
An Algebraic Perspective on Boolean Function Learning 215
[CGPT06] Chattopadhyay, A., Goyal, N., Pudlák, P., Thérien, D.: Lower bounds for
circuits with modm gates. In: FOCS, pp. 709–718 (2006)
[CKK+ 07] Chattopadhyay, A., Krebs, A., Koucký, M., Szegedy, M., Tesson, P.,
Thérien, D.: Languages with bounded multiparty communication com-
plexity. In: Thomas, W., Weil, P. (eds.) STACS 2007. LNCS, vol. 4393,
pp. 500–511. Springer, Heidelberg (2007)
[EH89] Ehrenfeucht, A., Haussler, D.: Learning decision trees from random ex-
amples. Information and Computation 82(3), 231–246 (1989)
[GT03] Gavaldà, R., Thérien, D.: Algebraic characterizations of small classes of
boolean functions. In: Alt, H., Habib, M. (eds.) STACS 2003. LNCS,
vol. 2607, pp. 331–342. Springer, Heidelberg (2003)
[GTT06] Gavaldà, R., Tesson, P., Thérien, D.: Learning expressions and programs
over monoids. Inf. Comput. 204(2), 177–209 (2006)
[HS07] Hellerstein, L., Servedio, R.A.: On pac learning algorithms for rich
boolean function classes. Theor. Comput. Sci. 384(1), 66–76 (2007)
[HSW90] Helmbold, D.P., Sloan, R.H., Warmuth, M.K.: Learning nested differences
of intersection-closed concept classes. Machine Learning 5, 165–196 (1990)
[KLPV87] Kearns, M.J., Li, M., Pitt, L., Valiant, L.G.: On the learnability of boolean
formulae. In: STOC, pp. 285–295 (1987)
[KS06] Klivans, A.R., Shpilka, A.: Learning restricted models of arithmetic cir-
cuits. Theory of Computing 2(1), 185–206 (2006)
[Kus97] Kushilevitz, E.: A simple algorithm for learning o(logn)-term dnf. Inf.
Process. Lett. 61(6), 289–292 (1997)
[PT88] Péladeau, P., Thérien, D.: Sur les langages reconnus par des groupes nilpo-
tents. Compte-rendus de l’Académie des Sciences de Paris, 93–95 (1988);
Translation to English as ECCC-TR01-040, Electronic Colloquium on
Computational Complexity (ECCC)
[Riv87] Rivest, R.L.: Learning decision lists. Machine Learning 2(3), 229–246
(1987)
[Sch76] Schützenberger, M.P.: Sur le produit de concaténation non ambigu. Semi-
group Forum 13, 47–75 (1976)
[She08] Sherstov, A.A.: Communication lower bounds using dual polynomials.
Bulletin of the EATCS 95, 59–93 (2008)
[Sim95] Simon, H.-U.: Learning decision lists and trees with equivalence-queries.
In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 322–336.
Springer, Heidelberg (1995)
[Tes03] Tesson, P.: Computational Complexity Questions Related to Finite
Monoids and Semigroups. PhD thesis, School of Computer Science,
McGill University (2003)
[TT04] Tesson, P., Thérien, D.: Monoids and computations. Intl. Journal of Al-
gebra and Computation 14(5-6), 801–816 (2004)
[TT05] Tesson, P., Thérien, D.: Complete classifications for the communication
complexity of regular languages. Theory Comput. Syst. 38(2), 135–159
(2005)
[TT06] Tesson, P., Thérien, D.: Bridges between algebraic automata theory and
complexity. Bull. of the EATCS 88, 37–64 (2006)
[Val84] Valiant, L.G.: A theory of the learnable. Communications of the ACM 27,
1134–1142 (1984)
[Wei87] Weil, P.: Closure of varieties of languages under products with counter.
J. of Comp. Syst. Sci. 2(3), 229–246 (1987)
Adaptive Estimation of the Optimal ROC Curve
and a Bipartite Ranking Algorithm
1 Introduction
Since a few decades, ROC curves have been widely used as the golden standard
for assessing performance in areas such as signal detection, medical diagnosis,
credit risk screening. More recently, ROC analysis has become an area of grow-
ing interest in Machine Learning. Various aspects are considered in this new
approach such as model evaluation, model selection, machine learning metrics
for evaluating performance, model construction, multiclass ROC, geometry of
the ROC space, confidence bands for ROC curves, improving performance of
classifiers, connection between classifiers and rankers, model manipulation (see
for instance [2] and references therein). We focus here on the problem of bipartite
ranking and the issue of ROC curve optimization. Previous work on bipartite
ranking ([3], [4], [5]) considered the AUC criterion as the optimization target.
However, this criterion is known to weight the errors uniformly while ranking
rules with similar AUC may behave very differently on a subset of the input
space.
In the paper, we focus on two problems: (i) the estimation of the optimal
ROC∗ , (ii) the construction of a consistent scoring rule whose ROC curve con-
verges in supremum norm to the optimal ROC∗ . In contrast to binary classifica-
tion or AUC maximization, the classical empirical risk minimization approach
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 216–231, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Adaptive Estimation of the Optimal ROC Curve 217
2 Setup
2.1 Probabilistic Model
The probabilistic setup is the same as the one in standard binary classification.
Here and throughout, (X, Y ) denotes a pair of random variables where Y ∈
{−1, +1} is a binary label and X models some observation for predicting Y , tak-
ing its values in a high-dimensional feature space X ⊂ Rd . The joint distribution
of (X, Y ) is entirely determined by the pair (μ, η) where μ denotes the marginal
distribution of X and the regression function η(x) = P{Y = +1 | X = x}, x ∈ X .
We also introduce the theoretical proportion p = P{Y = +1}, as well as G and
H, the conditional distributions of X given Y = +1 and Y = −1 respectively.
218 S. Clémençon and N. Vayatis
Hence, a good scoring function is such that, for any level α ∈ (0, 1), the power
ROC(s, α) of the test it defines is close to the optimal power ROC∗ (α). The
sup-norm
provides a natural way of measuring the performance of a scoring rule s(x). The
ROC curve of the scoring function s(x) can be straightforwardly estimated from
the training dataset Dn by the stepwise function
α ∈ (0, 1) s ◦ H
α) = 1 − G
→ ROC(s, s−1 (1 − α),
where
1 1
s (t) =
H s (t) =
I{s(Xi ) ≤ t} and G I{s(Xi ) ≤ t}
n− n+
i: Yi =−1 i: Yi =+1
n
with n+ = i=1 I{Yi = +1} = n − n− .
However, the target curve ROC∗ is unknown in practice and no empirical
counterpart is directly available for the deviation ||ROC∗ − ROC(s, .)||∞ . For
this reason, empirical risk minimization (ERM) strategies are generally based
on the L1 -distance, leading to the popular AUC criterion: minimizing ||ROC∗ −
ROC(s, .)||L1 ([0,1]) indeed boils down to maximize
1
def
AUC(s) = ROC(s, α)dα .
α=0
where ν(dα) is an arbitrary finite positive measure on [0, 1] with same support
as the distribution H ∗ , is optimal with respect to the ROC criterion. The next
proposition also illustrates this view on the problem. We set the notations:
∗
Rα = {x ∈ X | η(x) > Q∗ (α)} and Rs,α = {x ∈ X | s(x) > Q(s(X), α)},
where Q(s(X), α) is the quantile of order (1 − α) of the conditional distribution
of s(X) given Y = −1.
Proposition 1. [6]. Let s be a scoring function and α ∈ (0, 1) such that Q∗ (α) < 1.
Suppose additionally that the cdf Hs (respectively, H ∗ ) is continuous at Q(s(X), α)
(resp. at Q∗ (α)). Then, we have:
E(|η(X) − Q∗ (α)| I{X ∈ Rα
∗
ΔRs,α })
ROC∗ (α) − ROC(s, α) = ∗
,
p(1 − Q (α))
where Δ denotes the symmetric difference between sets.
Adaptive Estimation of the Optimal ROC Curve 221
This result shows that the pointwise difference between the dominating ROC
curve and the one related to a candidate scoring function s may be interpreted
∗
as the error made in recovering the specific level set Rα through Rs,α .
3 Adaptive Approximation
Here we focus on very simple approximants of ROC∗ , taken as piecewise constant
curves. Precisely, to any subdivision σ : α0 = 0 < α1 < . . . < αK < αK+1 = 1
of the unit interval, we associate the curve given by: ∀α ∈ (0, 1),
K
Eσ (ROC∗ )(α) = I{α ∈ [αk , αk+1 [} · ROC∗ (αk ). (3)
k=0
We point out that the approximant Eσ (ROC∗ )(α) is actually a ROC curve.
It coincides indeed with ROC(s∗σ , .) where s∗σ is the piecewise constant scoring
function given by:
K+1
∀x ∈ X , s∗σ (x) = ∗
I{x ∈ Rα k
}, (4)
k=1
∗
which is obtained by ”overlaying” the regression level sets Rα k
= {x ∈ X :
∗
η(x) > Q (αk )}, 1 ≤ k ≤ K.
Adaptive approximation. In free knot splines, it is well-known folklore that
the approximation rate in supremum norm by piecewise constant functions with
at most K pieces is of the order O(K −1 ), if and only if the target function belongs
to the space BV ([0, 1])1 , see Chapter 12 in [10]. From a practical perspective
however, in absence of full knowledge of the target curve, it is a very challenging
task to determine a grid of points {αk : 1 ≤ k ≤ K} that yields a nearly optimal
approximant. In the case where the points of the mesh grid are fixed in advance
independently from the curve f to approximate, say with uniform spacing, the
rate of approximation is of optimal order O(K −1 ) if and only if f belongs to the
space Lip1 ([0, 1]) of absolutely continuous functions f such that f ∈ L∞ ([0, 1]).
The latter condition is precisely the type of assumption we would like to avoid in
the present work. We propose to use adaptive approximation schemes instead of
fixed grids. In such procedures, the mesh grid is progressively refined by adding
new breakpoints, as further information about the local variation of the target
is gained: this way, one uses a coarse mesh where the target is smooth, and a
finer mesh where it exhibits high degrees of variability. Given the properties of
the target ROC∗ (concave and non decreasing curve connecting (0, 0) to (1, 1)),
an ideal mesh grid should be finer and finer as one gets close to the origin.
Dyadic recursive partitioning. For computational reasons, here we shall re-
strict ourselves only to a dyadic grid of points αj,k = k2−j , with j ∈ N and
1
Recall that the space BV ([0, 1]) of functions of bounded variation on (0, 1) is the
space of absolutely continuous functions f : (0, 1) → R such that f ∈ L1 ([0, 1]).
222 S. Clémençon and N. Vayatis
k ∈ {0, . . . , 2j − 1}, and to partitions of the unit interval [0, 1] produced by re-
cursive dyadic partitioning: any dyadic interval Ij,k = [αj,k , αj,k+1 ) is possibly
split into two halves, producing two siblings Ij+1,2k and Ij+1,2k+1 , depending
on the (estimated) local properties of the target curve. The adaptive estimation
algorithm described in the next section will then appear as a top-down search
strategy through a tree structure T , on which the Ij,k ’s are aligned. Precisely,
we will consider approximants of the form:
ROC∗ (αj,k ) · I{α ∈ [αj,k , αj,k+1 )},
Ij,k ∈{terminal nodes}
where the sum is taken over all dyadic intervals corresponding to terminal nodes,
determined by weights ω(·) fulfilling the two conditions:
(i) (Keep-or-kill) For any dyadic interval I ⊂ [0, 1), the weight ω(I) belongs
to {0, 1}.
(ii) (Heredity) If ω(I) = 1, then for any dyadic subinterval I such that I ⊂ I ,
we have ω(I ) = 1. If ω(I) = 0, then for any dyadic subinterval I ⊂ I, we
have ω(I ) = 0.
Each collection ω of weights satisfying these two constraints is said admissible
and determines the nodes of a subtree Tω of the tree T representing the set of all
dyadic intervals. A dyadic subinterval I will be said terminal when ω(I) = 1 and
ω(I ) = 0 for any dyadic subinterval I ⊂ I: terminal subintervals correspond to
the outer leaves of Tω and form a partition Pω of [0, 1). The algorithm described
in the next section consists of selecting those intervals, i.e. the set ω. We denote
by σω the mesh grid made of endpoints of terminal subintervals selected by the
collection of weights ω. Given two admissible sequences of weights ω1 and ω2 ,
the mesh σω1 is said finer than σω2 when {I : ω2 (I) = 0} ⊂ {I : ω1 (I) = 0}.
where the maximum is taken over the set B(X ) of all measurable subsets W ⊂ X .
Equivalently, this boils down to solve the constrained optimization problem:
sup G(R) subject to H(R) ≤ α.
R∈B(X )
Adaptive Estimation of the Optimal ROC Curve 223
From a statistical perspective, the search should be based on the empirical dis-
tributions:
1 1
n n
H(dx) =
I{Yi = −1} · δXi (dx) and G(dx) = I{Yi = +1} · δXi (dx),
n− i=1 n+ i=1
OP (α, φ) :
sup G(R)
subject to H(R) ≤ α + φ,
R∈R
is of order O(n−1/2 ).
Note that the assumption A4 is satisfied, for instance, when R is a VC class (see
for instance [12] for the use of Rademacher averages in complexity control).
Theorem 1. [6]. Suppose that assumptions A1 − A4 are fulfilled and for all
δ ∈ (0, 1), set
2 log(1/δ)
φ = φ(δ, n) = 2An + .
n
Then, there exists a constant c < ∞ such that for all δ ∈ (0, 1), we have with
probability at least 1 − δ: ∀n ∈ N∗ , ∀α ∈ (0, 1),
α ) ≤ α + 2φ(δ/2, n) and G(R
H(R α ) ≥ ROC∗ (α) − 2φ(δ/2, n) .
with a = 1/2 and c = supt f ∗ (t). Notice that this condition is incompatible with
assumption A2 when a > 1/2. It has been shown in [6] (see Theorem 12 therein)
that, under this assumption, the deviation ROC∗ (α) − G( Rα ) is then of order
O(n−5/8 ).
224 S. Clémençon and N. Vayatis
(Input.) Target tolerance ∈ (0, 1). Volume tolerance φ > 0. Training data
Dn = {(Xi , Yi ) : 1 ≤ i ≤ n}. Class R of level set candidates.
1. (Initialization.) Set ω (I0,0 ) = 0 and ω (Ij,k ) = 1 for all dyadic subinterval
I I0,0 = [0, 1). Take β0,0 = 0 and β0,1 = 1.
2. (Iterations.) For all j ≥ 0, for all k ∈ {0, . . . , 2j − 1}: if ω (Ij,k ) = 0, then
(a) Compute E(I j,k ) = βj,k+1 − βj,k ,
(b) If E(Ij,k ) > , then
i. set
ω (Ij+1,2k ) = ω (Ij+1,2k+1 ) = 0 ,
ii. solve the problem OP (αj+1,2k+1 , φ) → solution R α
j+1,2k+1 ,
iii. update:
βj+1,2k = βj,k and βj+1,2k+2 = βj,k+1 .
(c) Else, let the weights of the siblings of the Ij,k unchanged.
3. (Stopping rule.) The algorithm terminates as soon as the weights ω(·) of the
nodes of the current level j are all equal to 1.
(Output.) Let σ the collection of dyadic levels αj,k corresponding to the
terminal nodes defined by ω . Compute the ROC∗ estimate:
∗ (α) =
ROC R
G( α ) · I{α ∈ Ij,k }.
j,k
αj,k ∈
σ
= G(
E(I) R
α2 ) − G(
Rα1 ).
The quantity E(I) is nonnegative (by construction, the mapping α ∈ (0, 1)
→
R
G( α ) is non decreasing with probability one) and should be viewed as an em-
.
pirical counterpart of E(I) = ROC∗ (α2 ) − ROC∗ (α1 ), which provides a simple
way of estimating the variability of the (nondecreasing) function ROC∗ on I.
This measure is additive, as its statistical version E(.):
for any siblings I1 and I2 of the same subinterval. It controls the approximation
rate of ROC∗ by a constant on any interval I ⊂ [0, 1) in the sense that:
The next result provides a bound for the rate of the estimator produced by
Algorithm 1.
Theorem 2. Let δ ∈ (0, 1). Suppose that assumptions A1 −A5 are fulfilled. Take
.
= (δ, n) = 7φ(δ/2, n). Then, we have, with probability at least 1 − δ:
∗ ||∞ ≤ 16φ(δ/2, n) .
∀n ≥ 1, ||ROC∗ − ROC
∗ ||∞ ≤ √c +
∀n ≥ 1, ||ROC∗ − ROC
2 log(1/δ)
,
n n
and the adaptive whose cardinality is at most
√ Algorithm 1 builds a partition σ
of the order of n.
226 S. Clémençon and N. Vayatis
(Input.) Target tolerance ∈ (0, 1). Volume tolerance φ > 0. Training data
Dn = {(Xi , Yi ) : 1 ≤ i ≤ n}. Class R of level set candidates.
1. (Algorithm 1.) Run Algorithm 1, in order to get the regression level estimates
α(1) , . . . , R
R , where K
= # ) < 1.
σ and 0 = α(1) < . . . < α(K
α(K )
The next theorem states the consistency of the estimated scoring function under
the same complexity and regularity assumptions.
log n
s , ·) − ROC∗ ||∞ ≤ c
||ROC( .
n1/3
We observe that the rate of convergence of the order of n−1/6 obtained in The-
orem 3 is much slower than the n−1/3 rate obtained in [6]. This is due to the
fact that we relaxed the regularity assumptions on the optimal ROC curve and
used the approximation space made of piecewise constant curves, while we used
piecewise linear scoring curves before. We expect that, using nonlinear approx-
imation techniques, the n−1/6 -rate can be significantly improved but we leave
this issue open for a future work.
7 Conclusion
In this paper, we have seen how strong consistency of a piecewise constant es-
timate of the optimal ROC curve can be guaranteed under weak regularity as-
sumptions. Additionally, our approach leads to a strongly consistent piecewise
constant scoring rule in terms of ROC curve performance. Whereas the subdivi-
sion of the false positive rate axis used for building the ROC curve approximant
had to be fixed in advance in the original RankOver approach proposed in [6],
which was viewed as a severe restriction on its applicability, the essential novelty
of the two algorithms presented here lies in their ability to adapt automatically
to the variability of the (unknown) optimal ROC curve.
References
1. Clémençon, S., Vayatis, N.: Overlaying classifiers: a practical approach for optimal
ranking. In: NIPS 2008: Proceedings of the 2008 conference on Advances in neural
information processing systems, Vancouver, Canada, pp. 313–320 (2009)
2. Flach, P.: Tutorial on “the many faces of roc analysis in machine learning”. In:
ICML 2004 (2004)
3. Freund, Y., Iyer, R.D., Schapire, R.E., Singer, Y.: An efficient boosting algorithm
for combining preferences. Journal of Machine Learning Research 4, 933–969 (2003)
4. Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., Roth, D.: Generalization
bounds for the area under the ROC curve. Journal of Machine Learning Research 6,
393–425 (2005)
5. Clémençon, S., Lugosi, G., Vayatis, N.: Ranking and empirical risk minimization
of U-statistics. The Annals of Statistics 36(2), 844–874 (2008)
6. Clémençon, S., Vayatis, N.: Overlaying classifiers: a practical approach to optimal
scoring. To appear in Constructive Approximation (hal-00341246) (2009)
228 S. Clémençon and N. Vayatis
7. Scott, C., Nowak, R.: Learning minimum volume sets. Journal of Machine Learning
Research 7, 665–704 (2006)
8. Clémençon, S., Vayatis, N.: Tree-structured ranking rules and approximation of
the optimal ROC curve. Technical Report hal-00268068, HAL (2008)
9. Clémençon, S., Vayatis, N.: Tree-structured ranking rules and approximation of
the optimal ROC curve. In: ALT 2008: Proceedings of the 2008 conference on
Algorithmic Learning Theory (2008)
10. Devore, R., Lorentz, G.: Constructive Approximation. Springer, Heidelberg (1993)
11. Scott, C., Nowak, R.: A Neyman-Pearson approach to statistical learning. IEEE
Transactions on Information Theory 51(11), 3806–3819 (2005)
12. Boucheron, S., Bousquet, O., Lugosi, G.: Theory of Classification: A Survey of
Some Recent Advances. ESAIM: Probability and Statistics 9, 323–375 (2005)
13. Tsybakov, A.: Optimal aggregation of classifiers in statistical learning. Annals of
Statistics 32(1), 135–166 (2004)
14. Devore, R.: A note on adaptive approximation. Approx. Theory Appl. 3, 74–78
(1987)
15. Bennett, C., Sharpley, R.: Interpolation of Operators. Academic Press, London
(1988)
16. Devore, R.: Nonlinear approximation. Acta Numerica, 51–150 (1998)
Appendix - Proofs
Proof of Theorem 2
We first prove a lemma which quantifies the uniform deviation of the empirical
entropy from the true entropy over all dyadic scales.
+ sup |G(R) − G(R)| .
R∈R
The proof then immediately follows from the complexity assumption A4 , com-
bined with Theorem 1.
We now introduce the notation for partitions σ based on the optimal ROC
curve at a target tolerance of . Let > 0 and consider the piecewise constant
approximant built from the same recursive strategy as the one implemented by
Algorithm 1, except that it is based on the (theoretical) error estimate E(.):
Eσ (ROC∗ ), σ denoting the associated mesh grid.
Choosing = 7φ(δ/2, n), we obtain that, with probability larger than 1−δ, the
is finer than σ1 where 1 = 1 (δ, n) = + 6φ(δ/2, n) = 13φ(δ/2, n),
mesh grid σ
but coarser than σ0 with 0 = 0 (δ, n) = − 6φ(δ/2, n) = φ(δ/2, n). We thus
have
||Eσ (ROC∗ ) − ROC∗ ||∞ ≤ ||Eσ1 (ROC∗ ) − ROC∗ ||∞ ≤ 1 ,
Now we use the following decomposition:
We have seen that the first term is bounded, with probability at least 1 − δ,
by On the same event, we have:
∗ ||∞ ≤ max |G(Rα(k) ) − G(
||Eσ (ROC∗ ) − ROC R α(k) )|
1≤k≤K
≤ 3φ(δ/2, n) ,
where we have used Theorem 1 and a concentration inequality to derive the last
inequality. We have thus proved the estimation error rate of 16φ(δ/2, n) for the
∗ of Algorithm 1.
output ROC
We now show the bound on the cardinality of the partition as a function of the
target tolerance parameter. Let us denote by K (= #σ ) the number of pieces
forming this approximant. We have the following result.
For a proof of this lemma, we refer to [14], and also to subsection 3.3 in [16] for
more insights on adaptive approximation methods. It reveals that the number
∗ , i.e. the cardinality of σ
of pieces forming ROC (δ,n) is bounded by
||ROC∗ ||L log L
#σ0 (δ,n) ≤ κ . (7)
φ(δ/2, n)
Proof of Theorem 3
The next lemma permits to quantify the loss arising from the transformation
performed at step 2 of Algorithm 2.
as well as
|G(Rα(k) ) − ROC∗ (α(k))| ≤ kφ(δ/2, n). (9)
α(2) ) + H(R
H(Rα(2) ) = H(R α(1) ) \ H(R
α(2) )
∗
and, since Rα(1) ∗
⊂ Rα(2) α(1) \ R
, one may write R α(2) as
α(1) \ R∗
R ∪ α(1) ∩ R∗
R \ α(2) \ R∗
R ∪ α(2) ∩ R∗
R .
α(1) α(1) α(2) α(2)
The first term can be taken care of with Lemma 4. We now consider bounding
the second term. We count the number of jumps of the piecewise constant curve
Eσ (ROC∗ ) between the x-values given by α(k) and H(Rα(k) ). With probability
Adaptive Estimation of the Optimal ROC Curve 231
at least 1 − δ, the number of jumps is given by the product of the total number
of jumps with the amplitude of the interval of false positive rate levels:
·
K max 2 φ( δ/2, n) ≤ (C/ 2 ) φ( δ/2, n) ,
|H(Rα(k) ) − α(k)| ≤ C · K
}
k∈{1,...,K
where we have used Lemma 4 and a union bound in the first inequality and
Lemma 2 in the second. Given the assumption A4 , we are led to the calibration
for of the order of n−1/6 since we need to balance, up to some constants, with
a term of the order of −2 φ( δ/2, n).
Complexity versus Agreement for Many Views
Co-regularization for Multi-view Semi-supervised Learning
1 Introduction
In real-life applications for classification tasks, different representations of a same
object may be available. Financial experts may use different sets of indicators
to assess the current market regime, while in the context of active computer
vision, several views of the same object are provided before rendering the deci-
sion. This problem is known as that of multi-view classification. After the early
work of (Blum & Mitchell, 1998) on learning from both labeled and unlabeled
data, this topic has been considered more recently by several authors (see for
example (Sridharan & Kakade, 2008),(Weston et al., 2005),(Zhou et al., 2004)).
In (Balcan & Blum, 2005), the authors propose a theoretical PAC-model for
semi-supervised learning where multi-view learning appears as a special case.
Due to the restriction over the search space (compatibility between different
views), multi-view learning may provide good generalization results, and indeed
this is the case in numerical experiments (e.g. (Belkin et al., 2005)). In (Rosen-
berg & Bartlett, 2007), these results are applied to a two-view learning problem
The first author is eligible for the E.M.Gold Award.
The second author was partly supported by the ANR Project TAMIS.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 232–246, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Complexity versus Agreement for Many Views 233
and explicit bounds on the Rademacher complexity of the class of predictors are
computed. Various algorithms are introduced together with theoretical studies
are provided in (Sindhwani et al., 2005). In the latter references, the central issue
addressed was to explain how consistency between views affects the performance
of classification procedures. Indeed, in multi-view learning, we consider individ-
uals predictors based on separate views, and one intuitive idea is that (1) having
a good final predictor is related to the agreement of individual predictors on
a majority of labels. It is generally assumed that (2) each view is independent
from the others conditionally on labeled data. Though this may be weakened
(see (Balcan et al., 2005)), providing theoretical justification to the heuristics
that conditional independence of the views allows for high-performance results
(two compatible classifiers trained on independent views are unlikely to agree
on a mislabeled item) has been the motivation for most of the works on this
topic. Thus we build on the same heuristics (1) and (2). As we intend to exploit
all the information available in the classification task, our setup will also take
unlabeled data into consideration.
In the present paper, we consider semi-supervised multi-view binary classifica-
tion with many views. Allowing for more than two views brings up new questions.
For instance, (i) how does the number of views V affects complexity measures?
and (ii) how to choose the parameters when there are as many as O(V 2 ) of them
? For the first issue, we will focus on the Rademacher average and track down
the dependency on V and other parameters in this formulation. As far as the
second issue is concerned, various strategies can be invoked. In supervised clas-
sification, cross-validation (e.g. 10-fold) techniques are widely used due to their
ease of implementation, but theory is unavailable in most cases (see (Celisse,
2008) and references therein for some recent developments). Another idea comes
from recent work on clustering and makes use of the stability approach (see
(Ben-David et al., 2006)). This relies on strongly theoretically founded results
known as localization arguments (see (Koltchinskii, 2006)) which takes advan-
tage of the so-called ‘small ball estimates” (see also (Li & Linde, 1999),(Berthet
& Shi, 2001)). The stability approach has also been applied successfully to other
learning problems (see e.g. (Sindhwani & Rosenberg, 2008)). This is the one we
have chosen in order to perform the selection of parameters. Comparison of dif-
ferent selection procedures, although interesting by itself, is not the purpose of
this paper.
In the sequel, we introduce an algorithm which combines semi-supervised
and multi-view ideas and generalizes over previous algorithms: it contains RLS,
co-RLS and co-Laplacian (see (Rosenberg & Bartlett, 2007),(Sindhwani et al.,
2005)) algorithms as special cases. We use the setup of Reproducing Kernel
Hilbert Spaces (RKHS) and provide explicit upper and lower data-dependent
bounds on the Rademacher complexity of the class of functions involved by this
general algorithm. Our second contribution is to give a new parameter selection
procedure, based on the work of (Koltchinskii, 2006), together with explicit
stability bounds (L2 -diameter) on the localized class for the general algorithm,
which has not been investigated so far.
234 O.-A. Maillard and N. Vayatis
The paper is organized as follows: Section 2 defines our framework and the
objective function. Section 3 is devoted to the Rademacher complexity control
with the first main theorem (section 3.2), and the asymptotic behavior of the
bound. Section 4 presents our stability-based selection procedure and the second
main theorem, on the L2 local diameter of our class of functions. In Section 5,
we successfully apply the algorithm to some toy examples.
F (v)
and L ⊂ F the product space of the spans. This complexity penalization
leads to the following definition:
Definition 2 (Complexity ). For f = (f (1) , ..., f (V ) ) ∈ F , we define:
V
V
Complexity(f ) = λv ||f (v) ||2v where λ ∈ +
v=1
Complexity versus Agreement for Many Views 235
V
Smoothness(f ) = γv f(v)T L f(v) , where
v=1
l
C L (f ) = cL
(v1 ) (v2 )
v1 ,v2 [f (v1 ) (xi ) − f (v2 ) (xi )]2
v1
=v2 i=1
l+u
and C U (f ) = cU
(v1 ) (v2 )
v1 ,v2 [f (v1 ) (xi ) − f (v2 ) (xi )]2
v1
=v2 i=l+1
– Compute:
1 ∗(v)
V
– Output: φ = f
V v=1
We point out that there is a representer theorem for this setting. Indeed, for
V
any fixed f (2) , ..., f (V ) ∈ Πv=2 F (v) , f ∗(1) minimizes a function
(1)
n ), yn ) + gf (2) ,...,f (V ) (||f ||1 )
cf (2) ,...,f (V ) (f (x1 ), y1 , ..., f (x(1)
w.r.t. f . Thus the representer theorem tells us that f (1) ∈ L1 . Iterating the
argument leads to f ∗ ∈ L. We also refer to (Sindhwani & Rosenberg, 2008) for
an alternative construction where one single RKHS combines all the views.
For specific choices of the parameters, we recover the former problems studied
in previous papers:
– when γ and C are 0 we have a Regularized Least Squares (RLS) in RKHS,
– when only γ = 0 we have a Co-Regularized Least Squares (co-RLS) problem
(see (Sindhwani et al., 2005)),
– when Agreement is diagonal nonzero (i.e. cL and cU are diagonal), we have a
co-Laplacian method (e.g. co-Laplacian RLS, co-Laplacian SVM, see (Sind-
hwani et al., 2005)) ; indeed, the f (v) are decoupled, and thus problem [1]
amounts to solving for each v:
f (v)∗ = argminf (v) ∈F (v) Loss(f (v) ) + λv ||f (v) ||2v + γv f(v)T Lf(v) .
3.1 Preliminaries
One nice property is that under assumption (A1), the final predictor φ belongs
to:
V
1 (v) (v)
J = x→ f (x ) : (f , .., f ) ∈ H
(1) (V )
V v=1
where (σi )i≤n are Rademacher i.i.d. random variables (P(σi = 1) = P(σi =
−1) = 12 ).
The following proposition, adapted from (Rosenberg & Bartlett, 2007) makes
use of this data-dependent complexity to derive an upper bound of the excess risk:
Proposition 1 (Excess risk ). For any positive loss function L uniformly β-
Lipschitz w.r.t its first variable and upper-bounded by 1, then conditionally on
the unlabeled data, ∀δ ∈ (0, 1), with probability at least 1 − δ over the labeled
points drawn i.i.d, for φ∗l the empirical minimizer of the objective function:
2
(L(φ∗l (X), Y )) − inf (L(φ(X), Y )) ≤ 4βRl (J ) + √ (2 + 3 ln(2/δ)/2)
φ∈J l
The proof is an easy combination of classical generalization bounds with some
arguments from (Rosenberg & Bartlett, 2007) and the following contraction
principle: if h is β-Lipschitz and h(0) = 0, then Rn (h ◦ J ) ≤ 2βRn (J ) (see
(Ledoux & Talagrand, 1991)), together with the symmetry of J .
Data. With the following matrices, we decompose between labeled and unlabeled
(v)
KL Il
data: K (v) = (kv (x(v) ∈ Ên×n and Π = ∈ ÊnV ×l
(v)
i , xj ))1≤i,j≤n = (v)
KU 0u,l
Smoothness. Let LI be the diagonal block matrix with all V blocks equaled to
L. Note that we would have introduced α̃L instead, if we have used each graph
Laplacian and not the average Laplacian L in the smoothness term.
238 O.-A. Maillard and N. Vayatis
Thanks to the previous notations, we can now state our first main theo-
rem, which shows an explicit upper and lower data-dependent bound for the
Rademacher complexity of the class of functions.
Theorem 1 (Rademacher complexity bound ). Under assumption (A1),
then
2 b 2b
≤ Rl (J ) ≤
21/4 V l Vl
T
where b2 = tr(B λ̃−1 ΠKLT ) − tr(J T (I + M )−1 J ) with
−1 −1
– B = (I
√ + λ̃ −1γ̃LTI K) ∈ nV ×nV
– J = Cδ λ̃ B KL ∈ nV (V −1)×l
T
√ √
– M = CδKB λ̃−1 δ T C ∈ nV (V −1)×nV (V −1)
Note that b is explicit as a difference of two terms. The first term only depends on
unlabeled data when Smoothness is null, and contains no co-regularization term.
The second term corresponds to the idea that there is a reduction in complexity
of the space. Indeed, in section 3.3, we give some results about the behavior of
b enforcing this idea. As pointed by (Sindhwani & Rosenberg, 2008), this term
is connected to a specific norm induced by the parameters and data over the
space.
This Theorem generalizes previous results: for instance, if V = 2, γ = 0, and
cL
v,w = 0, we recover exactly the previous known bound of (Rosenberg & Bartlett,
2007) where our 2cU v,w corresponds to their λ and our λ is their (γF , γG ).
3.3 Asymptotics
Let θ = (α, λ, γ, C) be the parameters of the learning problem, where α appears
in the graph-Laplacian, λ in the Complexity term, γ in the Smoothness term
and C in the Agreement term. The number of parameters grows with O(V 2 ).
We study how the previous Rademacher bound changes with these parameters.
More agreement reduces space complexity. The second term appearing in the
expression of b2 depends on the co-regularization (matrix) parameter C. To
see how constrained is the space when using bigger penalization, we introduce
Δ(C) = tr(J T (I + M )−1 J ), which can be written, provided that C −1 exists,
as:
Δ(C) = tr(J1T (C −1 + M1 )−1 J1 )
where J1 = δ λ̃−1 B T KLT and M1 = δKB λ̃−1 δ T .
Thus when the eigenvalues of C increases to +∞, Δ(C) tends to:
T
Δ∞ = tr(KLT B λ̃−1 δ T (δKB λ̃−1 δ T )−1 δ λ̃−1 B T KLT ),
T
which can be rewritten Δ∞ = tr(B λ̃−1 Πl KLT ),and shows that b2 → 0 in this
case. That b decreases as the model gets more constraint is coherent with the
intuition of multi-view learning. Similarly,b2 → 0 whenever ||γ||, or ||λ|| → ∞.
Complexity versus Agreement for Many Views 239
b2 = tr(ΠlT L−1
I γ̃
−1
Πl ) − tr(ΠlT L−1
I γ̃ δ (C −1 + δL−1
−1 T
I γ̃ δ ) δL−1
−1 T −1
I γ̃
−1
Πl )
Note that when both γ and λ tend to 0, the previous bound may tend to ∞
even in some simple case (which is coherent with the intuition). Note also that
the dependency with V is hidden here in the trace.
Δn ( ) log( −1 ) 2 2 log( −1 )
2 ( ) + 2Δ ( )],
Bn ( , λ) = 2 + + [DF n
λ λn λ n
and rn ( , λ) = inf α ∈ [0, 1]; sup Bn ( , λj ) ≤ λ .
j∈ ;1≥λj ≥α
Set also = 2+ ln(rn (,λ))
ln(λ) . Then, with probability larger than 1 − :
In the general case, if the radii are too small, then such inclusions no longer
hold, and the intersection may even be empty. For our problem, we will simply
select the parameter θ inducing the larger range of quasi-optimal sets controlled
around the ERM, which is a notion of stability. Thus, for a given radius of
the true penalized ball, we want to minimize the critical radius rn w.r.t. θ. A
side motivating intuition is that having good stability allows for easy discovery
of the minimizer f ∗ .
where H(r) = {f ; π̂θ,l (f ) ≤ r}, and estimate Rn (F̂θ,n ( )) and D̂F̂n,θ ( ) for
each parameter θ. Note that the dependency w.r.t. θ = (α, λ, γ, C) is hidden in
the definition. Thus we need to bound the Rademacher complexity of J (r) and
its L2 , Pn -Diameter. An analysis of the proof of Theorem 1 shows that
√ changing
J = J (1) for J√(r) affects the Rademacher bound with a factor r, leading
r
to a bound 2b(θ) lV for the first term. Following the same analysis as for the
L1 -diameter (or Rademacher complexity), the next theorem gives us the second
bound we need:
Theorem 2 (Empirical local L2 diameter ). Under assumption A1, then
√
2d r
D̂J (r) ≤ √
lV
where d2 is the largest eigenvalue of (B − J2T (I + M )−1 J2 )λ̃−1 Π(KLT )T , with
√
J2 = Cδ λ̃−1 B T
Complexity versus Agreement for Many Views 241
√
Note the dependency with l instead of the l for the Rademacher bound.
Eventually, each θ leads to a radius rnθ ( , λ) ≥ r̂nθ ( , λ) defined likewise, using
upper bounds of Theorem 1 and 2. For maximal stability, we propose to have
the largest range of values for which Lemma 1 still holds, which boils down to
minimizing this quantity with θ. This leads to the following selection procedure
where each term is computable:
– Output:
θ∗ = argminθ∈Θ r(θ, n, l, , λ)
5 Experiments
We have performed some toy simulations to see the flexibility of this general algo-
rithm and the results are promising. Based on only one or two labeled points, we
can always recover perfect labeling of the data, even on the challenging cross-
moons data set on which all classical algorithms (Co-Laplacian and Co-RLS)
performs badly. For completeness, we first give hints how to solve the mini-
mization problem. Recall that the solutions of [1] can be written f (v) (x(v) ) =
l+u (v) (v) (v) (v) (v)
i=1 αi K (x , xi ) = Kx(v) α(v) . We first consider the case where the loss
function is differentiable.
We have done some experiments on three toy examples (Figure 1), with only
two views and two classes for simplicity.
– The easy two moons-two lines data set, for which the data is linearly sepa-
rable in the second view, and almost separated in the first.
– The more complex two spirals-two clouds data set, with intricate spirals (to
“force” the use of graph-Laplacian). Note that a human operator cannot
separate the two classes without the information of the second view.
– The challenging cross-two moons data set, which appears to fool the tested
algorithms based on only one of the Smoothness or Agreement term.
Since the less labeled object, the more heuristic the definition of the “true”
classes, we refer here to human beings to say what are the true classes. Such a
definition of truth is a real problem still unsolved in the clustering community
and we do not pretend here to solve it. In the first two data sets, a human
only needs one label object of each class to recover the classes. For the last one,
because the cross yields ambiguity, a human operator needs two objects in each
class. Thus, we use this number of labels.
For each algorithm we use the quadratic loss, which is differentiable. The
first one is the classical RLS, for which Smoothness and Agreement are set to
0. The second one is a co-RLS, with only Smoothness set to 0. Then we used
Fig. 1. Three toy data sets. Normal points for unlabeled points, circle for class number
one and cross for class number two. From left to right: Two moons (above)- two lines
(below), with one labeled object in each class. Two spirals-two clouds, with one labeled
object in each class. One cross-two moons, with two labeled objects in each class.
Complexity versus Agreement for Many Views 243
Empirical misclassification errors for the above algorithms (one set of param-
eters per dataset, some possibly put to zero when specified to each algorithm),
averaged over 1000 runs.
References
Ando, R.K., Zhang, T.: Learning on graph with laplacian regularization. In: Schölkopf,
B., Platt, J., Hoffman, T. (eds.) Advances in neural information processing systems,
vol. 19, pp. 25–32. MIT Press, Cambridge (2007)
Balcan, M., Blum, A.: A PAC-style model for learning from labeled and unlabeled
data. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 111–
126. Springer, Heidelberg (2005)
Balcan, M.F., Blum, A., Yang, K.: Co-training and expansion: Towards bridging the-
ory and practice. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in neural
information processing systems, vol. 17, pp. 89–96. MIT Press, Cambridge (2005)
Belkin, M., Niyogi, P., Sindhwani, V.: On Manifold Regularization. In: AISTAT (2005)
244 O.-A. Maillard and N. Vayatis
Ben-David, S., von Luxburg, U., Pal, D.: A sober look at clustering stability. In: Lugosi,
G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 5–19. Springer,
Heidelberg (2006)
Berthet, P., Shi, Z.: Small ball estimates for brownian motion under a weighted sup-
norm. Studia Sci. Math. Hung, 1–2, 275–289 (2001)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In:
COLT 1998: Proceedings of the eleventh annual conference on Computational learn-
ing theory, pp. 92–100. ACM, New York (1998)
Celisse, A.: Model selection via cross-validation in density estimation, regression, and
change-points detection. Doctoral dissertation, Universite Paris Sud, Faculte des
Sciences d’Orsay (2008)
Golub, G.H., Van Loan, C.F.: Matrix computations. The Johns Hopkins University
Press (1996)
Koltchinskii, V.: Local rademacher complexities and oracle inequalities in risk mini-
mization. The Annals of Statistics 34(6), 2593–2656 (2006)
Ledoux, M., Talagrand, M.: Probability on banach spaces: Isoperimetry and processes.
Springer, Berlin (1991)
Li, W.V., Linde, W.: Approximation, metric entropy and small ball estimates for gaus-
sian measures. Ann. Probab. 27, 1556–1578 (1999)
Rosenberg, D., Bartlett, P.L.: The rademacher complexity of co-regularized kernel
classes. In: Proceedings of the Eleventh ICAIS (2007)
Sindhwani, V., Niyogi, P., Belkin, M.: A co-regularization approach to semi-supervised
learning with multiple views. In: Workshop on Learning with Multiple Views, Pro-
ceedings of International Conference on Machine Learning (2005)
Sindhwani, V., Rosenberg, D.S.: An rkhs for multi-view learning and manifold co-
regularization. In: ICML 2008: Proceedings of the 25th international conference on
Machine learning, pp. 976–983. ACM, New York (2008)
Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Conference on
Learning Theory and 7th Kernel Workshop, pp. 144–158 (2003)
Sridharan, K., Kakade, S.M.: An information theoretic framework for multi-view learn-
ing. In: COLT, pp. 403–414. Omnipress (2008)
Weston, J., Leslie, C., Ie, E., Zhou, D., Elisseeff, A., Noble, W.S.: Semi-supervised
protein classification using cluster kernels. Bioinformatics 21, 3241–3247 (2005)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and
global consistency. In: Advances in Neural Information Processing Systems, vol. 16,
pp. 321–328. MIT Press, Cambridge (2004)
Appendix - Proofs
Sketch of Proof of Theorem 1
The proof of Theorem 1 follows the same line as (Rosenberg & Bartlett, 2007)
and extends their result to the compound regularization penalty in the case of
an arbitrary number of views. Since there is no novelty in the proof technique,
we do not reproduce it here entirely. For completeness, we recall the main next
steps: (i) use classical invariance properties of the kernel function to reformulate
the optimization problem with an invertible matrix, (ii) apply Lemma 3 below
to get the solution, (iii) eventually, rewrite it with the formulation involving
the initial data by use of the Sherman-Morrison-Woodbury formula (Golub &
Van Loan, 1996). We provide the key intermediate steps adapted to our setting.
Complexity versus Agreement for Many Views 245
f ∈ L ∩ H is:
⎛ ⎞ ⎛ ⎞T
0 0
⎜ .. ⎟ ⎜ ⎟ ..
⎜ . ⎟ ⎜ ⎟ .
⎜ (v ) ⎟ ⎜ (v ) ⎟
⎜K 1 ⎟ ⎜K 1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
v1 ,v2
KC = ⎜ ... ⎟ Cv1 ,v2 ⎜ ... ⎟ .
⎜ ⎟ ⎜ ⎟
⎜−K (v2 ) ⎟ ⎜−K (v2 ) ⎟
⎜ ⎟ ⎜ ⎟
⎜ . ⎟ ⎜ . ⎟
⎝ . ⎠
. ⎝ . ⎠
.
0 0
Thus, the definition of the Rademacher complexity can be seen as the solu-
tion to an optimization problem under quadratic constraint. Indeed, since H is
symmetrical:
2
Rl (J ) = σ sup αT KLT σ
lV α;αT N α≤1
Since f (v) (xi ) = Ki α(v) , where Ki is the ith row of K (v) , D̂J (r)2 ≤ V42 l αT
(v) (v)
sup aT Qa
a;aT Ma≤r
Now, D = KeK T with e ∈ n×n being the projection matrix with diago-
nal blocks Il and 0u . Lemma 4 applies to T −1 , and since the eigenvalues of
T T
T −1 P DP and P T −1 P D are the same, we compute:
T T −1
P A−1 P D = P P BP λ̃−1 Σ P T KeK T = B λ̃−1 Π(KLT )T
Where A comes from Lemma 4. Similar computations yield the second term and
allow to conclude the proof.
Error-Correcting Tournaments
1 Introduction
We consider the classical problem of multiclass classification, where given an instance
x ∈ X, the goal is to predict the most likely label y ∈ {1, . . . , k}, according to some
unknown probability distribution.
A common general approach to multiclass learning is to reduce a multiclass prob-
lem to a set of binary classification problems [2,6,10,11,14]. This black-box approach
is composable with any binary learning algorithm (and thus bias), including online al-
gorithms, Bayesian algorithms, and even humans.
A key technique for analyzing reductions is regret analysis, which bounds the “re-
gret” of the resulting multiclass classifier in terms of the average classification “regret”
on the binary problems. Here regret is the difference between the incurred loss and the
smallest achievable loss on the problem, i.e., excess loss due to suboptimal prediction.
The most commonly applied reduction is one-against-all, which creates a binary clas-
sification problem for each of the k classes: The classifier for class i is trained to predict
whether the label is i or not; predictions are done by evaluating each binary classifier
and randomizing over those which predict “yes,” or randomly if all answers are “no”.
This simple reduction is inconsistent, in the sense that given optimal (zero-regret) binary
classifiers, the reduction may not yield an optimal multiclass classifier in the presence of
noise. Optimizing squared loss of the binary predictions instead of the 0/1 loss √ makes
the approach consistent, but the resulting multiclass regret may be as high as 2kr,
where r is the average squared loss regret on the induced problems, which is upper
bounded by the average binary classification regret via the Probing reduction [15].
The probabilistic error correcting output code approach (PECOC) [14] reduces k-
class classification to learning O(k) regressors on the interval [0, 1], creating O(k) bi-
nary examples per multiclass example at both training and test time, with √ a test time
computation of O(k 2 ). The resulting multiclass regret is bounded by 4 r, where r is
the average squared loss regret of the regressors (which is upper bounded by the average
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 247–262, 2009.
c Springer-Verlag Berlin Heidelberg 2009
248 A. Beygelzimer, J. Langford, and P. Ravikumar
binary classification regret via the Probing reduction [15]). Thus PECOC removes the
dependence on the number of classes k. When only a constant number of labels have
non-zero probability given x, the complexity can be reduced to O(log k) examples per
multiclass example and O(k log k) computation per example [13].
This leads to several questions:
1. Is there a consistent reduction from multiclass to binary classification that does not
have a square root dependence [17]? For example, an average binary regret of just
0.01 may imply a PECOC multiclass regret of 0.4.
2. Is there a consistent reduction that requires just O(log k) computation, matching
the information theoretic lower bound? The well known tree reduction (see [9])
distinguishes between the labels using a balanced binary tree, where each non-leaf
nodes predicts “Is the correct multiclass label to the left or right?”. As shown in
Section 2, this method is inconsistent.
3. Can the above be achieved with a reduction that only performs pairwise compar-
isons between classes? One fear associated with the PECOC approach is that it
creates binary problems of the form “What is the probability that the label is in a
given random subset of labels?,” which may be hard to solve. Although this fear
is addressed by regret analysis (as the latter operates only on excess loss), and is
overstated in some cases [8,13], it is still of some concern, especially with larger
values of k.
The error-correcting tournament family presented here answers all of these questions in
the affirmative. It provides an exponentially faster in k method for multiclass prediction
with the resulting multiclass regret bounded by 5.5r, where r is the average binary
regret and every binary classifier logically compares two distinct class labels.
The result is based on a basic observation that if a non-leaf node fails to predict its
binary label, which may be unavoidable due to noise in the distribution, nodes between
this node and the root should have no preference for class label prediction. Utilizing
this observation, we construct a reduction, called the Filter Tree, with the property that
it uses O(log k) binary examples and O(log k) computation at training and test time
with a multiclass regret bounded by log k times the average binary regret.
The decision process of a Filter Tree, viewed bottom up, can be viewed as a single-
elimination tournament on a set of k players. Using c independent single-elimination
tournaments is of no use as it does not affect the average regret of an adversary con-
trolling the binary classifiers. Somewhat surprisingly, it is possible to have c = log k
complete single-elimination tournaments between k players in O(log k) rounds with no
player playing twice in the same round [5]. All error-correcting tournaments first pair
labels in consecutive interfering single-elimination tournaments, followed by a final
carefully weighted single-elimination tournament that decides among the log2 k win-
ners of the first phase. As for the Filter Tree, test time evaluation can start at the root
and proceed to a multiclass label with O(log k) computation.
This construction is also useful for the problem of robust search, yielding the first
algorithm which allows the adversary to err a constant fraction of the time in the “full
lie” setting [16] where a comparator can missort any comparison. Previous work either
applied to the “half lie” case where a comparator can fail to sort but can not actively
missort [5,18] or to a “full lie” setting where an adversary has a fixed known bound on
Error-Correcting Tournaments 249
the number of lies [16] or a fixed budget on the fraction of errors so far [4,3]. Indeed, it
might even appear impossible to have an algorithm robust to a constant fraction of full
lie errors since an error can always be reserved for the last comparison. By repeating
the last comparison O(log k) times we can defeat this strategy.
The result here is also useful for the actual problem of tournament construction in
games with real players. Our analysis does not assume that errors are i.i.d. [7], or have
known noise distributions [1] or known outcome distributions given player skills [12].
Consequently, the tournaments we construct are robust against severe bias such as a bi-
ased referee or some forms of bribery and collusion. Furthermore, the tournaments we
construct are shallow, requiring fewer rounds than m-elimination bracket tournaments,
which do not satisfy the guarantee provided here. In an m-elimination bracket tourna-
ment, bracket i is a single-elimination tournament on all players except the winners of
brackets 1, . . . , i − 1. After the bracket winners are determined, the player winning the
last bracket m plays the winner of bracket m − 1 repeatedly until one player has suf-
fered m losses (they start with m − 1 and m − 2 losses respectively). The winner moves
on to pair against the winner of bracket m − 2, and the process continues until only
one player remains.
m This method does not scale well to large m, as the final elimination
phase takes i=1 i − 1 = O(m2 ) rounds. Even for k = 8 and m = 3, our constructions
have smaller maximum depth than bracketed 3-elimination.
Paper overview. Section 2 shows that the simple divide-and-conquer tree approach is
inconsistent, motivating the Filter Tree algorithm described in section 3 (which applies
to more general cost sensitive multiclass problems). Section 3.1 proves that the algo-
rithm has the best possible computational dependence, and gives two upper bounds on
the regret of the returned (cost-sensitive) multiclass classifier.
Section 4 presents the error-correcting tournament family parametrized by an integer
m ≥ 1, which controls the tradeoff between maximizing robustness (m large) and min-
imizing depth (m small). Setting m = 1 gives the Filter Tree, while m = 4 ln k gives
a (multiclass to binary) regret ratio of 5.5 with O(log k) depth. Setting m = ck gives
regret ratio of 3 + O(1/c) with depth O(k). The results here provide a nearly free gen-
eralization of earlier work [5] in the robust search setting, to a more powerful adversary
that can missort as well as fail to sort. Section 5 gives an algorithm independent lower
bound of 2 on the regret ratio for large k. When the number of calls to a binary classifier
is independent (or nearly independent) of the label predicted, we strengthen this lower
bound to 3 for large k.
One standard approach for reducing multiclass learning to binary learning is to split the
set of labels in half, then learn a binary classifier to distinguish between the subsets, and
repeat recursively until each subset contains one label. Multiclass predictions are made
by following a chain of classifications from the root down to the leaves.
The following theorem shows that there exist multiclass problems such that even
if we have an optimal classifier for the induced binary problem at each node, the tree
reduction does not yield an optimal multiclass predictor.
250 A. Beygelzimer, J. Langford, and P. Ravikumar
1 2 3 4 5 6
1 vs 2 3 vs 4 5 vs 6 7
Fig. 1. Filter Tree. Each node predicts whether the left or the right input label is more likely,
conditioned on a given x ∈ X. The root node predicts the best label for x.
Notation. Let D be the underlying distribution over X×Y , where X is some observable
feature space and Y = {1, . . . , k} is the label space. The error rate of a classifier
f : X → Y on D is given by err(f, D) = Pr(x,y)∼D [f (x) = y]. The regret of f on D
is defined as reg(f, D) = err(f, D) − minf ∗ err(f ∗ , D).
The tree reduction transforms D into a distribution DT over binary labeled examples
by drawing a multiclass example (x, y) from D, drawing a random non-leaf node i,
and outputting instance x, i with label 1 if y is in the left subtree of node i, and 0
otherwise. A binary classifier f for this problem induces a multiclass classifier T (f ),
via a chain of binary predictions starting from the root.
Theorem 1. For all k ≥ 3, for all binary trees over the labels, there exists a multiclass
distribution D such that reg(T (f ∗ ), D) > 0 for any f ∗ = arg min err(f, DT ).
f
Proof. Find a node with one subset corresponding to two labels and the other subset
corresponding to a single label. (If the tree is perfectly balanced, simply let D assign
probability 0 to one of the labels.) Since we can freely rename labels without changing
the underlying problem, let the first two labels be 1 and 2, and the third label be 3.
Choose D with the property that D(y = 1 | x) = D(y = 2 | x) = 1/4 + 1/100,
while D(y = 3 | x) = 1/2 − 2/100. Under this distribution, the fraction of examples
for which label 1 or 2 is correct is 1/2 + 2/100, so any minimum error rate binary
predictor must choose either label 1 or label 2. Each of these choices has an error rate
of 3/4 − 1/100. The optimal multiclass predictor chooses label 3 and suffers an error
rate of 1/2 + 2/100, implying that the regret of the tree classifier based on an optimal
binary classifier is 1/4 − 3/100 > 0.
first round are in turn paired in the second round, and a classifier is trained to predict
whether the winner of one pair is more likely than the winner of the other. The process
of training classifiers to predict the best of a pair of winners from the previous round is
repeated until the root classifier is trained.
The setting above is akin to Boosting: At each round t, a booster creates an input
distribution Dt and calls an oracle learning algorithm to obtain a classifier with some
error t on Dt . The distribution Dt depends on the classifiers returned by the oracle in
previous rounds. The accuracy of the final classifier is analyzed in terms of t ’s.
Let Tn be the subtree of T rooted at node n. The set of leaves of a tree T is denoted
by Γ (T ). Let yn be the bit specifying whether the multiclass label y is in the left subtree
of n or not.
The key trick in the training stage (Algorithm 1) is to form the right training set
at each interior node. A training example for node n is formed conditioned on the
predictions of classifiers in the round before it. Thus the learned classifiers from the first
level of the tree are used to “filter” the distribution over examples reaching the second
level of the tree.
Given x and classifiers at each node, every edge in T is identified with a unique
label. The optimal decision at any non-leaf node is to choose the input edge (label) that
is more likely according to the true conditional probability. This can be done by using
the outputs of classifiers in the round before it as a filter during the training process: For
each observation, we set the label to 0 if the left parent’s output matches the multiclass
label, 1 if the right parent’s output matches, and reject the example otherwise.
The testing algorithm, Filter-Test, is very simple. Given a test example x ∈ X,
we output the label y such that every classifier on the path from y to the root prefers y.
Algorithm 2 extends this idea to the cost-sensitive multiclass case where each choice
has a different associated cost. Formally, a cost-sensitive k-class classification problem
is defined by a distribution D over X × [0, 1]k . The expected cost of a classifier f :
X → {1, ..., k} of D is (f, D) = E(x,c)∼D cf (x) . Here c ∈ [0, 1]k gives the cost
of each of the k choices for x. As in the multiclass case (which is a special case), the
regret of f on D is defined as regc (f, D) = (f, D) − minf ∗ (f ∗ , D).
The algorithm relies upon an importance weighted binary learning algorithm, which
takes examples of the form (x, y, w), where x is a feature vector used for prediction,
y is a binary label, and w ∈ [0, ∞) is the importance any classifier pays if it doesn’t
predict y on x.
252 A. Beygelzimer, J. Langford, and P. Ravikumar
Theorem 2. For all binary classifiers f and all cost sensitive multiclass distributions D,
regc (Filter-Test(f ), D) ≤ reg(f, D )E(x,c)∼D w(n, x, c),
n∈T
where w(n, x, c) is the importance weight in Algorithm 2 (the difference in cost between
the two labels that node n chooses between on x), and D is the induced distribution as
defined above.
Before proving the theorem, we state the corollary for multiclass classification.
Corollary 1. For all binary classifiers f and all multiclass distributions D on k labels,
for all Filter Trees of depth d, reg(Filter-Test(f ), D) ≤ d · reg(f, DFT ).
(Since all importance weights are either 0 or 1, we don’t need to apply Costing.) The
proof of the corollary given the theorem is simple since for any (x, y), the induced (x, c)
has at most one node per level
with induced importance weight 1; all other importance
weights are 0. Therefore, n w(n, x, c) ≤ d.
Theorem 3 provides an alternative bound for cost-sensitive classification. It is the
first known bound giving a worst-case dependence of less than k.
Theorem 3. For all binary classifiers f and all cost-sensitive k-class distributions D,
regc(Filter-Test(f ), D) ≤ k reg(f, D )/2, where D is as defined above.
For node n, each importance weighted binary decision between class a and class
b has an importance weighted regret which is either 0 or rn = |Ec∼D|x [ca − cb ]| =
|Ec∼D|x [ca ] − Ec∼D|x [cb ]|, depending on whether the prediction is correct or not.
Assume without loss of generality that the predictor outputs class b. The regret of the
subtree Tn rooted at n is given by rTn = Ec∼D|x [cb ] − miny∈Γ (Tn ) Ec∼D|x [cy ].
As a base case, the inductive
hypothesis is trivially
satisfied for trees with one label.
Inductively, assume that n ∈L rn ≥ rL and n ∈R rn ≥ rR for the left subtree L
of n (providing a) and the right subtree R (providing b).
There are two possibilities. Either the minimizer comes from the leaves of L or the
leaves of R. The second possibility is easy since we have
rTn = Ec∼D|x [cb ] − min Ec∼D|x [cy ] = rR ≤ rn ≤ rn ,
y∈Γ (R)
n ∈R n ∈Tn
which completes
the induction. The inductive hypothesis for the root is that regc (y ,
D|x) ≤ n∈T rn , implying regc (y , D|x) ≤ n∈T rn = (k − 1) · ri (f, DFT ), where
ri is the importance weighted binary regret on the induced problem.
Using the folk theorem from [19], we have ri (f, DFT ) = reg(f, D )E(x,y,w)∼DFT [w].
1
The expected importance is k−1 E(x,c)∼D n∈T w(n, x, c). Plugging this in, we get
the theorem.
The proof of Theorem 3 makes use of the following inequality. Consider a Filter Tree
T evaluated on a cost-sensitive multiclass instance with cost vector c ∈ [0, 1]k . Let ST
be the sum of importances over all nodes in T , and IT be the sum of importances over
the nodes where the class with the larger cost was selected for the next round. Let cT
denote the cost of the winner chosen by T .
Lemma 1. For any Filter Tree T on k labels, ST + cT ≤ IT + k2 .
Proof. The inequality follows by induction, the result being clear when k = 2. Assume
that the claim holds for the two subtrees, L and R, providing their respective inputs l
and r to the root of T , and T outputs r without loss of generality. Using the inductive
hypotheses for L and R, we get ST +cT = SL +SR +|cr −cl |+cr ≤ IL +IR + k2 −cl +
|cr −cl |. If cr ≥ cl , we have IT = IL +IR +(cr −cl ), and ST +cT ≤ IT + k2 −cl ≤ IT +
2 , as desired. If cr < cl , we have IT = IL + IR and ST + cT ≤ IT + 2 − cr ≤ IT + 2 ,
k k k
Proof. (Theorem 3) We will fix (x, c) ∈ X × [0, 1]k and take the expectation over the
draw of (x, c) from D as the last step.
Error-Correcting Tournaments 255
Case 2. T outputs l, and cl < cr . In this case regT = regL = cl − c∗ . The left
hand side can be rewritten as regT ST = regL (SR + SL + cr −cl ) = regL SL +
regL (SR + cr − cl ) ≤ regL IL + IR − 2cl + k2 ≤ IR + regL IL − 2cl + k2 ≤
IR + IL regL −2cl + k2 ≤ IR + k2 IL ≤ k2 IT . The first inequality from the lemma,
the second from regL ≤ 1, the third from regL ≤ IL , the fourth from −cL − c∗ < 0,
and the fifth because IT = IL + IR .
Case 3. T outputs l, and cl > cr . We have regT = regL = cl − c∗ . The left hand side
can be written as
|L| k − |L|
regT ST = regL (SR + SL + cl − cr ) ≤ IL +regL IR + − cr + c l − cr
2 2
k k k
≤ IL + IR + (cl − 2cr ) ≤ (IL + IR + (cl − cr )) = IT ,
2 2 2
The first inequality follows from the inductive hypothesis and the lemma, the second
from regL < 1 and regL < IL , and the third from cr > 0 and k/2 > 1.
Case 4. T outputs r, and cl > cr . Let w = cl −cr . We have regT = cr −c∗ = regL −w.
The left hand side can be written as
256 A. Beygelzimer, J. Langford, and P. Ravikumar
4 Error-Correcting Tournaments
In this section we first state and then analyze error correcting tournaments. As this sec-
tion builds on the previous section, understanding the previous should be considered
prerequisite for reading this section. For simplicity, we work with only the multiclass
case. An extension for cost-sensitive multiclass problems is possible using the impor-
tance weighting techniques of the previous section.
dependent on the depth of any mechanism which pairs labels in m distinct single elim-
ination tournaments. One such explicit mechanism is stated in [5]. Note that once an
(x, y) example has lost m times, it is eliminated and no longer influences training at the
nodes closer to the root.
The second phase is a final elimination phase, where we select the winner from the
m winners of the first phase. It consists of a redundant single-elimination tournament,
where the degree of redundancy increases as the root is approached. To quantify the re-
dundancy, let every subtree Q have a charge cQ equal to the number of leaves under the
subtree. First phase winners at the leaves of final elimination tournament have charge
1. For any non-leaf node comparing subtree R to subtree L, the importance weight of
a binary example is set to max{cR , cL }. For reference, in tournament applications, an
importance weight can be expressed by playing games repeatedly where the winner of
R must beat the winner of L cL times to advance, and vice versa.
One complication arises: what happens when the two labels compared are the same?
In this case, the importance weight is set to 0, indicating there is no preference in the
pairing amongst the two choices.
8
7
6
5
4
3
2
1
Final Winner
Proof. The proof is by simplification of the importance depth bound (theorem 6), which
bounds the sum of importance weights at all nodes in the circuit.
To see that the importance depth controls the computation, first note that the impor-
tance depth bounds the circuit depth since all importance weights are at least 1. At train-
ing time, any one example is used at most once per circuit level starting at the leaves.
At testing time, an unlabeled example can have its label determined by traversing the
structure from root to leaf.
Theorem 5. (Main Theorem) For all distributions D over k-class examples, all bi-
nary classifiers f , all m-elimination tournaments ECT, the ratio of reg(fECT , D) to
reg(f, ECT(D)) is upper bounded by
m2
2+ m + k
2m for all m ≥ 2 and k > 2
2 ln k ln k
4+ m +2 m for all k ≤ 262 and m ≤ 4 log2 k
The first case shows that a regret ratio of 3 is achievable for very large m. The second
case is the best bound for cases of common interest. For m = 4 ln k it gives a ratio of
5.5.
Proof. The proof holds for each input x, and hence in expectation over x. For a fixed
x, we can define the regret of any label y as ry = maxy ∈{1,··· ,k} D(y | x) − D(y | x).
A node n comparing two labels y and y has regret rn , which is |D(y | x)−D(y | x)|
label is not predicted, and 0 otherwise. The regret of a tree T is
if the most probable
defined as rT = n∈T rn .
The first part of the proof is by induction on the tree structureF of the final phase.
The invariant for a subtree Q of F won by label q is cQ rq ≤ rQ + w∈Γ (Q) rTw , where
w is the winner of the first phase single-elimination tournament Tw .
When Q is a leaf w of F , we have cQ rq = rq ≤ rTi , where the inequality is from
Corollary 1 noting that d times the average binary regret is the sum of binary regrets.
Assume inductively that the hypothesis holds at node n for the right subtree R and
the left subtree L of Q with respective winners q and l: cR rq ≤ rR + w∈Γ (R) rTw
and cL rl ≤ rL + w∈Γ (L) rTw . Now, a chain of inequalities holds, completing the
induction: rQ + w∈Γ (Q) rTw ≥ cL rn + rR + rL + w∈Γ (R) rTw + w∈Γ (L) rTw ≥
cL rn + cR rq + cL rl ≥ cQ rq . Here the first inequality uses the fact that the adversary
must pay at least cL rn to make q win. The second inequality follows by the induc-
tive hypothesis. The third inequality comes
from rl + rn ≥ rq . To finish the proof,
m reg(fECT , D | x) = cF rf ≤ rF + w∈Γ (F ) rTw ≤ d reg(f, ECT(D | x)), where
Error-Correcting Tournaments 259
d is the maximum importance depth and the last quantity follows from the folk theorem
in [19]. Applying the importance depth theorem 6 and algebra complete the proof.
Lemma 2. (First Phase Depth bound) The importance depth of the first phase tourna-
ment is bounded by the minimum of
⎧
⎪
⎪ log2 k + m log2 ( log2 k + 1)
⎪
⎨1.5 log k + 3m + 1
2
⎪ k2 + 2m
⎪
⎪
⎩ √
For k ≤ 262 and m ≤ 4 log2 k, 2(m − 1) + ln k + ln k ln k + 4(m − 1).
Proof. The depth of the first phase is bounded by the classical problem of robust min-
imum finding with low depth. The first three cases hold because any such construction
upper bounds the depth of an error correcting tournament, and one such construction
has these bounds [5].
For the fourth case, we construct the depth bound by analyzing a continuous relax-
ation of the problem. The relaxation allows the number of labels remaining in each
single elimination tournament of the first phase to be broken into fractions. Relative to
this version, the actual problem has two important discretizations:
and the probability that this occurs in k attempts is bounded by k times that. Setting this
2
value to 1, we get ln k = 2d 12 − m−1 . Solving the equation for d, gives d = 2(m −
d
1) + ln k + 4(m − 1) ln k + (ln k)2 . This last formula was verified computationally
for k < 262 and m < 4 log2 k by discretizing k into factors of 2 and running a simple
program to keep track of the number of labels in each tournament at each level. For
k ∈ {2l−1 + 1, 2l }, we used a pessimistic value of k = 2l−1 + 1 in the above formula
to compute the bound, and compared it to the output of the program for k = 2l .
Lemma 3. (Second Phase Depth Bound) In any m-elimination tournament, the second
phase has importance depth at most m 2 − 1 rounds for m > 1.
260 A. Beygelzimer, J. Langford, and P. Ravikumar
Proof. When two labels are compared in round i ≥ 1, the importance weight of their
log m−1 i−1
comparison is at most 2i−1 . Thus we have i=1 2 2 + m2 = m 2 − 1.
5 Lower Bound
All of our lower bounds hold for a somewhat more powerful adversary which is more
natural in a game playing tournament setting. In particular, we disallow reductions
which use importance weighting on examples, or equivalently, all importance weights
are set to 1. Note that we can modify our upper bound to obey this constraint by trans-
forming final elimination comparisons with importance weight i into 2i − 1 repeated
comparisons and use the majority vote. This modified construction has an importance
depth which is at most m larger implying the ratio of the adversary and the reduction’s
regret increases by at most 1.
The first lower bound says that for any reduction algorithm B, there exists an ad-
versary A with the average per-round regret r such that A can make B incur regret 2r
even if B knows r in advance. Thus an adversary who corrupts half of all outcomes
can force a maximally bad outcome. In the bounds below, fB denotes the multiclass
classifier induced by a reduction B using a binary classifier f .
Proof. The adversary A picks any two labels i and j. All comparisons involving i
but not j, are decided in favor of i. Similarly for j. The outcome of comparing i and
j is determined by the parity of the number of comparisons between i and j in some
fixed serialization of the algorithm. If the parity is odd, i wins; otherwise, j wins. The
outcomes of all other comparisons are picked arbitrarily.
Suppose that the algorithm halts after some number of queries c between i and j. If
neither i nor j wins, the adversary can simply assign probability 1/2 to i and j. The
adversary pays nothing while the algorithm suffers loss 1, yielding a regret ratio of ∞.
Assume without loss of generality that i wins. The depth of the circuit is either c or
at least c + 1, because each label can appear at most once in any round. If the depth is
Error-Correcting Tournaments 261
c, then since k > 2, some label is not involved in any query, and the adversary can set
the probability of that label to 1 resulting in ρ(B) = ∞.
Otherwise, A can set the probability of label j to be 1 while all others have probabil-
ity 0. The total regret of A is at most c+12 , while the regret of the winning label is 1.
Multiplying by the depth bound c + 1, gives a regret ratio of at least 2.
Note that the number of rounds in the above bound can depend on A. Next, we show
that for any algorithm B taking the same number of rounds for any adversary, there
exists an adversary A with a regret of roughly one third, such that A can make B incur
the maximal loss, even if B knows the power of the adversary.
Proof. Let B take q rounds to determine the winner, for any set of query outcomes. We
will design an adversary A with incurs regret r = 3k−2 qk
, such that A can make B incur
the maximal loss of 1, even if B knows r.
The adversary’s query answering strategy is to answer consistently with label 1 win-
ning for the first 2(k−1)
k r rounds, breaking ties arbitrarily. The total number of queries
that B can ask during this stage is at most (k − 1)r since each label can play at most
once in every round, and each query occupies two labels. Thus the total amount of re-
gret at this point is at most (k − 1)r, and there must exist a label i other than label k
with at most r losses. In the remaining q − 2(k−1)
n r = r rounds, A answers consistently
with label i and all other skills being 0.
Now if B selects label 1, A can set D(i | x) = 1 with r/q average regret from
the first stage. If B selects label i instead, A can choose that D(1 | x) = 1. Since the
number of queries between labels i and k in the second stage is at most r, the adversary
can incurs average regret at most r/q. If B chooses any other label to be the winner, the
regret ratio is unbounded.
References
1. Adler, M., Gemmell, P., Harchol-Balter, M., Karp, R., Kenyon, C.: Selection in the presence
of noise: The design of playoff systems. In: SODA 1994 (1994)
2. Allwein, E., Schapire, R., Singer, Y.: Reducing multiclass to binary: A unifying approach for
margin classifiers. Journal of Machine Learning Research 1, 113–141 (2000)
3. Aslam, J., Dhagat, A.: Searching in the presence of linearly bounded errors. In: STOC 1991
(1991)
4. Borgstrom, R., Rao Kosaraju, S.: Comparison-base search in the presence of errors. In: STOC
1993 (1993)
5. Denejko, P., Diks, K., Pelc, A., Piotr’ow, M.: Reliable minimum finding comparator net-
works. Fundamenta Informaticae 42, 235–249 (2000)
6. Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output
codes. Journal of Artificial Intelligence Research 2, 263–286 (1995)
7. Feige, U., Peleg, D., Raghavan, P., Upfal, E.: Computing with unreliable information. In:
Symposium on Theory of Computing, pp. 128–137 (1990)
262 A. Beygelzimer, J. Langford, and P. Ravikumar
1 Introduction
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 263–277, 2009.
c Springer-Verlag Berlin Heidelberg 2009
264 J. Case and T. Kötzing
Pitt [Pit89] notes (in a slightly different context) that such a definition of polyno-
mial time learning may not give one any feasibility restriction on the total time
for successful learning. Here is informally why. Suppose h is any TxtEx-learner.
Then, for suitable polynomial Q, a variant of learner h can delay outputting
significant conjectures based on data σ until it has seen a much larger sequence
of data τ so that Q(|τ |) is enough time for h to think about σ as long as it needs.
Pitt [Pit89] discusses some possible ways to forbid such unfair delaying tricks.
More recently, Yoshinaka [Yos09] compiled a very useful list of properties to help
toward achieving fairness and efficiency in polynomial time learners, including to
avoid Pitt-style delaying tricks. In the second part of [Yos09], Yoshinaka provides
a number of interesting example fair polynomial time learners each satisfying
several of these properties. In each of his example algorithms, the associated
hypothesis space is uniformly polynomial time decidable.1 In the present paper,
we focus, for polynomial time learners, on three of Yoshinaka’s properties: Post-
dictive completeness2 , conservativeness, and prudence. Postdictive completeness
[Bār74, BB75, Wie76, Wie78] requires that each hypothesis output by a learner
correctly postdicts the input data on which that hypothesis is based. Conserva-
tiveness [Ang80] requires that each hypothesis may be changed only if it fails to
predict a new datum. Prudence [Wei82, OSW86] requires each output hypothesis
has to be for a target that the learner actually learns.
Yoshinaka [Yos09] claims that, for efficient learning in the limit from positive
data, the combination of postdictive completeness, conservativeness and prudence
is restrictive enough to prevent all Pitt-style delaying tricks.
In the present paper, in several different settings (settings mostly as to kind
of hypothesis spaces), we refute the claim of the immediately above paragraph.
In one of our settings, uniformly polynomial time decidable hypothesis spaces
with a few effective closure properties,3,4 the three restrictions allow maximal
1
These spaces are such that there is a polynomial Q and an algorithm so that, from
both an hypothesis i and an object x, the algorithm returns, within time Q(|i|,|x|)
a correct decision as to whether x is in the language defined by hypothesis i.
2
In the prior literature, except for [Ful88] and [CK08a, CK08b], what we call post-
dictive completeness is called consistency.
3
These effective closure properties pertain to obtaining finite languages and modifi-
cations of languages by finite languages.
4
The particular uniformly polynomial time hypothesis spaces Yoshinaka employs in
the second half of [Yos09] do not have our few effective closure properties, but his
algorithms would work essentially unchanged were one to extend his hypothesis
spaces to ones with our few effective closure properties. Then his algorithms would
not search or use the new hypotheses and would not learn any more languages. The
space of CFGs with Prohibition discussed below in this section and in Section 2.1
further below would work as such an extension of both Yoshinaka’s hypothesis spaces.
Yoshinaka does mention the possibility of extending his hypothesis spaces to provide
an hypothesis for Σ ∗ . We did not examine whether we could, in some cases, work
with such an extension instead of our few effective closure properties. We also did
not examine whether we can modify our (to be mentioned shortly) Theorem 13 to
cover just his particular hypothesis spaces.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 265
5
It is an interesting open question, though, for our uniformly polynomial time decid-
able hypothesis spaces, whether the combination of postdictive completeness, conser-
vativeness and prudence is so restrictive, that, any class of languages TxtEx-learnable
employing such an hypothesis space and with those three restrictions, is also TxtEx-
learnable with an intuitively fair, different polynomial time learner respecting all
three restrictions.
6
That is, in our residual two settings, of the three restrictions, postdictive complete-
ness does improve fairness, but there can still be some residual unfair delaying tricks.
For these residual settings, we did not examine the question of whether adding
onto postdictive completeness, conservativeness and/or prudence, provides better
degree of avoidance of delaying tricks than postdictive completeness alone. Again:
we already know, though, that all three restrictions do not avoid all delaying tricks.
7
The associated class is not uniformly polynomial time decidable, by
[RC94, Theorem 6.5].
266 J. Case and T. Kötzing
2 Mathematical Preliminaries
Turing machines (TMs). In this system the TM-programs are efficiently given
numerical names or codes.12 ΦTM denotes the TM step counting complexity mea-
sure also from [RC94, Chapter 3] and associated with ϕTM . In the present paper,
we employ a number of complexity bound results from [RC94, Chapters 3 & 4]
regarding (ϕTM , ΦTM ). These results will be clearly referenced as we use them.
For simplicity of notation, hereafter, we write (ϕ, Φ) for (ϕTM , ΦTM ). ϕp denotes
the partial computable function computed by the TM-program with code num-
ber p in the ϕ-system, and Φp denotes the partial computable runtime function
of the TM-program with code number p in the ϕ-system.
The symbol # is pronounced pause and is used to symbolize “no new input
data” in a text.
Note that all (partial) computable functions are N → N. Whenever we want
to consider (partial) computable functions on objects like finite sequences or
finite sets, we assume those objects to be efficiently coded as natural numbers.
We give such codings for finite sequences and finite sets below.
For all p, Wp denotes the computably enumerable (ce) set dom(ϕp ). E denotes
the set of all ce sets. We say that e is an index (in W ) for We .
We fix the 1-1 and onto pairing function ·, · : N × N → N from [RC94],
which is based on dyadic bit-interleaving. Pairing and unpairing is computable
in linear time.
Whenever we consider tuples of natural numbers as input to TMs, it is under-
stood that the general coding function ·, · is used to (left-associatively) code
the tuples into appropriate TM-input.
We identify any function f ∈ P with its graph {x, f (x) | x ∈ N}.
A finite sequence is a mapping with a finite initial segment of N as domain
(and range, (N ∪ {#})). ∅ denotes the empty sequence (and, also, the empty
set). The set of all finite sequences is denoted by Seq. For each finite sequence
σ, we will denote the first element, if any, of that sequence by σ(0), the second,
if any, with σ(1) and so on. #elets(σ) denotes the number of elements in a finite
sequence σ, that is, the cardinality of its domain.
From now on, by convention, f , g and h with or without decoration range
over (partial) functions N → N; x, y with or without decorations range over N.
D with or without decorations ranges over finite subsets of N.
Following [LV97], we fix a coding ·Seq of all sequences into N ∪ {#} (=
{0, 1}∗ ∪ {#}) – with the following properties.
The set of all codes of sequences is decidable in linear time. The time to encode
a sequence, that is, to compute
λk, v1 , . . . , vk v1 , . . . , vk Seq
is
k
O(λk, v1 , . . . , vk |vi |).
i=1
12
This numerical coding guarantees that many simple operations involving the coding
run in linear time. This is by contrast with historically more typical codings featuring
prime powers and corresponding at least exponential costs to do simple things.
268 J. Case and T. Kötzing
Therefore, the size of the codeword is also linearin the size of the elements:
k
λk, v1 , . . . , vk |v1 , . . . , vk Seq | is O(λk, v1 , . . . , vk i=1 |vi |).
13
Furthermore,
∀σ : #elets(σ) ≤ |σSeq |. (1)
Henceforth, we will many times identify a finite sequence σ with its code number
σSeq . However, when we employ expressions such as σ(x), σ = f and σ ⊂ f ,
we consider σ as a sequence, not as a number.
For a (partial) function g and i ∈ N, if ∀j < i : g(j)↓, then g[i] is defined to
be the finite sequence g(0), . . . , g(i − 1).
D, with and without decorations, ranges over finite sets. We fix the following
1-1 coding for all finite subsets of N. For each non-empty finite set D = {x0 <
. . . < xn }, x0 , . . . , xn Seq is the code for D and Seq is the code for ∅.
Henceforth, we will many times identify a finite set D with its code number.
However, when we employ expressions such as x ∈ D, card(D), max(D) and
D ⊂ D , we consider D and D as sets, not as numbers.
For each (possibly infinite) sequence q, let content(q) = (range(q) \ {#}).
We define LinPrograms = {e | ∃a, b, ∀x : Φe (x) ≤ a|x| + b} and
PolyPrograms = {e | ∃p polynomial ∀x ∈ N : Φe (x) ≤ p(|x|)}. Furthermore,
for let LinF = {ϕe | e ∈ LinPrograms} and PF = {ϕe | e ∈ PolyPrograms}.
For g ∈ PF we say that g is computable in polytime, or also, feasibly com-
putable. Recall that we have, by (1), ∀σ : #elets(σ) ≤ |σ|.
With log we denote the floor of the base-2 logarithm, with the exception of
log(0) = 0.
For all e, x, t, we write ϕe (x)↓t iff Φe (x) ≤ t. Furthermore, we write
ϕe (x), if Φe (x) ≤ t;
∀e, x, t : ϕe (x)↓t = (2)
0, otherwise.
The following lemma is used in many of our detailed proofs. The present
paper, because of space limitations, omits many details of proofs. Nonetheless,
we still include this lemma herein to give the reader some intuitions as to how
to manage some missing details.
Lemma 1. Regarding time-bounded computability, we have the following.
– Equality checks and log are computable in linear time [RC94, Lemma 3.2].
– Conditional definition is computable in a time polynomial in the runtimes
of its defining programs [RC94, Lemma 3.14].
– Bounded minimizations, and, hence, bounded maximizations are computable
in a time polynomial in the runtimes of its defining programs [RC94,
Lemma 3.15].
– Boolean combinations of predicates computable in polytime are computable
in polytime [RC94, Lemma 3.18].
– From [RC94, Corollary 3.7], we have that λe, x, t ϕe (x)↓|t| and
λe, x, t, z ϕe (x)↓|t| = z are computable in polynomial time.
13
For these O-formulas, |ε| = 1 helps.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 269
– Our coding of finite sequences easily gives that the following functions
are linear time computable. ∀x : 1 ≤ length(x̄), λσSeq #elets(σ) and
σ(i), if i < #elets(σ);
λσSeq , i
0, otherwise.
– Our coding above of finite sets enables content to be computable in polyno-
mial time.14
In this section we give our modular definition of what a learning criterion is.
After that we will put the modules together to obtain the actual criteria we
need. As noted above, all standard inductive inference criteria names will be
changed to slightly different names in our modular approach.
– W ((3) holds).
– A canonical numbering of all regular languages, represented by efficiently
coded DFAs (where membership is trivially uniformly polynomial time de-
cidable and (3) holds).
– For each pair of context free grammars (CFGs) G0 , G1 , we efficiently code
(G0 , G1 ) to be an index for (L(G0 ) \ L(G1 )). Then the resulting numbering,
in particular, has an index for each context free language. Furthermore,
14
This computation involves sorting. Selection sort can be done in quadratic time in
the RAM model [Knu73], and adding an extra linear factor to translate from RAM
complexity to deterministic multi-tape TM complexity [vEB90], we get selection sort
in cubic (and, hence, polynomial) time measured by ΦTM .
15
Note that such a numbering does not necessarily need to be onto, i.e., a numbering
might only number some of the ce languages, leaving out others.
270 J. Case and T. Kötzing
Lemma 9. We have
Proof: “⊆” is trivial. Regarding “⊇”: First we apply Proposition 16 to get total
conservativeness. Then we use Theorem 17 to obtain total postdictive complete-
ness. We use Theorem 20 to make the learner set-driven. By Lemma 9, such a
learner can be delayed to be computable in polynomial time. By Proposition 21,
this learner is automatically totally prudent.
Proposition 11 just below shows that any learner can be assumed partially set-
driven, and, importantly, the transformation of a learner to a partially set-driven
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 273
learner preserves prudence. The proposition and its proof are somewhat anal-
ogous to [JORS99, Proposition 5.29] and its proof. However, our proof, unlike
that of [JORS99, Proposition 5.29], does not require the hypothesis space to be
paddable.
DTxtPsdExU = DTxtGExU .
The next theorem is the main result of the present section. As noted in Section 1
above, it says that the three restrictions of postdictive completeness, conserva-
tiveness and prudence allow maximal unfairness — within the current setting of
polynomial time decidable hypothesis spaces.
Theorem 13. Let δ ∈ {R2 , Pcp, Conv, PcpConv}, D ∈ {Id, Prud} and D ∈
{Id, T Prud}. Then
theorem of the present section. It states that there are other uniformly decid-
able hypothesis spaces such that postdictive completeness, with or without any
of conservativeness and prudence, forbids some delaying tricks. By contrast, ac-
cording to Theorem 18 in Section 5, any combination of just the two restrictions
of conservativeness and prudence allows for arbitrary delaying tricks.
Theorem 14. There exists a uniformly decidable numbering V such that, for
each δ ∈ {R2 , Pcp, Conv, PcpConv}, D ∈ {Id, Prud} and D ∈ {Id, T Prud},
We can, and sometimes do, think of total function learning as a special case
of TxtEx-learning thus. Suppose f is any (possibly, but not necessarily total)
function mapping non-negative integers into the same. Recall that we identify
f with its graph, {x, y | f (x) = y}, where x, y is the numeric coding of
(x, y) (Section 2). Then {x, y | f (x) = y} is a sublanguage of the non-negative
integers. Furthermore, programs for f are generally trivially intercompilable with
programs or grammars for {x, y | f (x) = y}. We sometimes refer to languages
of the form {x, y | f (x) = y} as single-valued languages.
Next is our second main result of the present section. It asserts the polynomial
time learnability with restrictions of postdictive completeness, conservativeness
and prudence of a uniformly decidable class of total single-valued languages
which are (the graphs of) the linear time computable functions. Importantly,
our proof of this theorem employs a Pitt-style delaying trick on an enumeration
technique [Gol67, BB75], and our result, then, entails, as advertised in Section 1
above, that some delaying tricks are not forbidden in the setting of the present
section.
Let θLtime be an efficiently coded programming system from [RC94, Chap-
ter 6] for LinF. θLtime is based on multi-tape TM-programs each explicitly
clocked to halt in linear time (in the length of its input). Let V Ltime be the cor-
responding effective numbering of all and only those ce languages (whose graphs
are) ∈ LinF. Note that V Ltime does not satisfy the condition at the beginning
of the present section on V s for obtaining codes of finite languages — since we
have only infinite languages in V Ltime . Instead, for V Ltime , we have (and use)
a linear time algorithm, which on any finite function F , outputs a V Ltime -index
for the zero-extension of F .
Theorem 15
The remainder of this section presents two results that are used elsewhere. They
are put here to present them in more generality. They each hold for any V .
The following proposition says that, for any uniformly decidable V , conser-
vative learnability implies total conservative learnability. It is used for proving
Theorem 10 in Section 3.
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 275
T ConvTxtGExV = TxtGConvExV .
The following theorem holds for all V and states that we can assume total
postdictive completeness when learning with total conservativeness.
Theorem 17. We have
T (PcpConv)TxtGExV = T ConvTxtGExV .
5 Learning ce Languages
For the remainder of this section, let V be any effective numbering of some ce
languages.
For the present section, next (and mentioned in Section 1 above) is our first
main result which says that any combination of just the two restrictions of
conservativeness and prudence allows for arbitrary delaying tricks.
Theorem 18. Let δ ∈ {R2 , Conv}, D ∈ {Id, Prud} and D ∈ {Id, T Prud}.
Then
DPFTxtGδExV = DTxtGδExV and
D PFT δTxtGExV = D T δTxtGExV .
Our proof of the just above theorem uses delaying tricks similar to those in the
proof of Lemma 9 in Section 3.
Our next main result of the present section says, for the general effective
numbering of all ce languages, W , combinations of postdictive completeness,
conservativeness and prudence forbid some delaying tricks iff postdictive com-
pleteness is part of the combination.
Theorem 19. Let δ ∈ {R2 , Pcp, Conv, PcpConv}, D ∈ {Id, Prud} and D ∈
{Id, T Prud}. Then
DPFTxtGδExW ⊂ DRTxtGδExW ⇔ δ ⊆ Pcp and
D PFT δTxtGExW ⊂ D T δTxtGExW ⇔ δ ⊆ Pcp.
Our proof of the just above theorem makes crucial use of [CK08b, Theorem 5(a)]
as well as Theorem 18 above.
Theorem 20 just below says that certain kinds of learners can be assumed without
loss of generality to be set-driven. This is interesting on its own, and is also of
important technical use for proving Theorem 10 in Section 3.
Theorem 20. Let V be such that (3) holds. We have
The following proposition shows that total postdictive complete and total con-
servative, set-driven learners are automatically totally prudent. This, again, is
of important technical use for proving Theorem 10 in Section 3.
276 J. Case and T. Kötzing
Next is our last main result. As noted above in Section 1, this theorem says that,
in the general setting of the present section, postdictive completeness does not
forbid all delaying tricks.
Proof: The effective numbering V Ltime from Theorem 15 can be translated into
the W -system in linear (and, hence, in polynomial) time.
References
[Ang80] Angluin, D.: Inductive inference of formal languages from positive data. In-
formation and Control 45, 117–135 (1980)
[Bār74] Bārzdiņš, J.: Inductive inference of automata, functions and programs. In:
Int. Math. Congress, Vancouver, pp. 771–776 (1974)
[BB75] Blum, L., Blum, M.: Toward a mathematical theory of inductive inference.
Information and Control 28, 125–155 (1975)
[Bur05] Burgin, M.: Grammars with prohibition and human-computer interaction.
In: Proceedings of the 2005 Business and Industry Symposium and the 2005
Military, Government, and Aerospace Simulation Symposium, pp. 143–147.
Society for Modeling and Simulation (2005)
[CCJ09] Carlucci, L., Case, J., Jain, S.: Learning correction grammars. Journal of
Symbolic Logic 74(2), 489–516 (2009)
[CK08a] Case, J., Kötzing, T.: Dynamic modeling in inductive inference. In: Freund,
Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI),
vol. 5254, pp. 404–418. Springer, Heidelberg (2008)
[CK08b] Case, J., Kötzing, T.: Dynamically delayed postdictive completeness and
consistency in learning. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T.
(eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 389–403. Springer, Heidelberg
(2008)
[CR09] Case, J., Royer, J.: Program size complexity of correction grammars, Work-
ing draft (2009)
[Ful88] Fulk, M.: Saving the phenomenon: Requirements that inductive machines not
contradict known data. Information and Computation 79, 193–209 (1988)
[Gol67] Gold, E.: Language identification in the limit. Information and Control 10,
447–474 (1967)
[HU79] Hopcroft, J., Ullman, J.: Introduction to Automata Theory Languages and
Computation. Addison-Wesley Publishing Company, Reading (1979)
[JORS99] Jain, S., Osherson, D., Royer, J., Sharma, A.: Systems that Learn: An In-
troduction to Learning Theory, 2nd edn. MIT Press, Cambridge (1999)
Difficulties in Forcing Fairness of Polynomial Time Inductive Inference 277
[Knu73] Knuth, D.: The Art of Computer Programming, Volume III: Sorting and
Searching. Addison-Wesley, Reading (1973)
[LV97] Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and Its
Applications, 2nd edn. Springer, Heidelberg (1997)
[OSW86] Osherson, D., Stob, M., Weinstein, S.: Systems that Learn: An Introduc-
tion to Learning Theory for Cognitive and Computer Scientists. MIT Press,
Cambridge (1986)
[Pit89] Pitt, L.: Inductive inference, DFAs, and computational complexity. In: Jan-
tke, K.P. (ed.) AII 1989. LNCS, vol. 397, pp. 18–44. Springer, Heidelberg
(1989)
[RC94] Royer, J., Case, J.: Subrecursive Programming Systems: Complexity and
Succinctness. Research monograph in Progress in Theoretical Computer Sci-
ence. Birkhäuser, Boston (1994)
[Rog67] Rogers, H.: Theory of Recursive Functions and Effective Computability. Mc-
Graw Hill, New York (1967); Reprinted by MIT Press, Cambridge, Mas-
sachusetts (1987)
[Sch91] Schabes, Y.: Polynomial time and space shift-reduce parsing of arbitrary
context-free grammars. In: Proceedings of the 29th annual meeting on As-
sociation for Computational Linguistics, pp. 106–113. Association for Com-
putational Linguistics (1991)
[vEB90] van Emde Boas, P.: Machine models and simulations. In: Van Leeuwen, J.
(ed.) Handbbook of Theoretical Computer Science. Algorithms and Com-
plexity, vol. A, pp. 3–66. MIT Press/Elsevier (1990)
[Wei82] Weinstein, S.: Private communication at the Workshop on Learnability The-
ory and Linguistics, University of Western Ontario (1982)
[Wie76] Wiehagen, R.: Limes-erkennung rekursiver funktionen durch spezielle strate-
gien. Elektronische Informationverarbeitung und Kybernetik 12, 93–99
(1976)
[Wie78] Wiehagen, R.: Zur Theorie der Algorithmischen Erkennung. PhD thesis,
Humboldt University of Berlin (1978)
[Yos09] Yoshinaka, R.: Learning efficiency of very simple grammars from positive
data. Theoretical Computer Science 410, 1807–1825 (2009); In: Hutter, M.,
Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp.
227–241. Springer, Heidelberg (2007)
Learning Mildly Context-Sensitive Languages
with Multidimensional Substitutability from
Positive Data
Ryo Yoshinaka
Abstract. Recently Clark and Eyraud (2007) have shown that sub-
stitutable context-free languages, which capture an aspect of natural
language phenomena, are efficiently identifiable in the limit from pos-
itive data. Generalizing their work, this paper presents a polynomial-
time learning algorithm for new subclasses of mildly context-sensitive
languages with variants of substitutability.
1 Introduction
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 278–292, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Learning Mildly Context-Sensitive Languages 279
xy, xy , x y ∈ L implies x y ∈ L
xyz, xy z, x yz ∈ L implies x y z ∈ L.
x0 y1 x1 . . . ym xm , x0 y1 x1 . . . ym
xm , x0 y1 x1 . . . ym xm ∈ L
implies x0 y1 x1 . . . ym
xm ∈ L.
2 Preliminaries
2.1 Definitions and Notations
The set of nonnegative integers is denoted by N and this paper will consider
only numbers in N. The cardinality of a set S is denoted by |S|. If w is a string
over an alphabet Σ, |w| denotes its length. ∅ is the empty set and λ is the
empty string. Σ ∗ denotes the set of all strings over Σ, Σ + = Σ ∗ − {λ} and
Σ k = { w ∈ Σ ∗ | |w| = k }. Any subset of Σ ∗ is called a language (over Σ). If
L is a finite language over Σ, its size is defined as L = |L| + w∈L |w|. For
any x, xm means the m-tuple of x, while xm denotes the usual concatenation
of x, e.g., x3 = x, x, x and x3 = xxx. Hence (Σ ∗ )m is the set of m-tuples of
strings over Σ, which are called m-words. Similarly we define (·)∗ and (·)+ ,
where, for instance, (Σ ∗ )+ denotes the set of all m-words for all m ≥ 1. For
an m-word
x = x1 , . . . , xm , |x| denotes its length m and x denotes its size
m+ 1≤i≤m |xi |. If f is a function defined on k-tuples, we will write f (z1 , . . . , zk )
for f (z1 , . . . , zk ) for readability.
280 R. Yoshinaka
1
We identify a function symbol with the function itself for convention.
Learning Mildly Context-Sensitive Languages 281
{ an bn cn | n ≥ 0 }, { am bn cm dn | m, n ≥ 0 }, { ww | w ∈ Σ ∗ }.
Seki et al. [17] and Rambow and Satta [15] have investigated the hierarchy of
mcfls.
Furthermore Rambow and Satta show a trade-off between dimension and rank.
This contrasts with Proposition 1.
Aπ → f π (B1π1 , . . . , Bnπn )
Aμ → f μ (B1μ1 , . . . , Bnμn )
We note that G ≤ 2p−1 G. We say that a linear regular function f is good
if it is λ-free, non-permuting and non-merging, and that an mcfg G is good if
all of its functions are good. We assume that all mcfgs in this paper are good.
Learning Mildly Context-Sensitive Languages 283
x1
y1 , x1 y2 , x2 y1 ∈ L implies x2 y2 ∈ L
for any x1 , x2 ∈ Σ ∗ (Σ + )m−1 Σ ∗ , y1 , y2 ∈ (Σ + )m and m ≤ p. For notational
convenience, we write Σ for Σ ∗ (Σ + )m−1 Σ ∗ . By S(p) we denote the class
[m]
Thus those typical p-dimensional mcfls can be inferred from some finite subsets
with pd-substitutability, if one can compute the least language in S(p) including
an arbitrarily given finite language. Therefore we are concerned with the classes
of languages that are in S(p) and at the same time in L(p, ∗). Let us denote
SL(p, r) = S(p) ∩ L(p, r) and SL(p, ∗) = S(p) ∩ L(p, ∗).
On the other hand, some other typical p-dimensional mcfls are not pd-sub-
stitutable. We say that two m-words y1 and y2 are substitutable for each other
[m]
in L, when for any x ∈ Σ it holds that x y1 ∈ L iff x y2 ∈ L.
Example 3. The language L− n n n n
2 = { a1 #a2 #a3 #a4 | n ≥ 0 } is not 2d-substitut-
able. If a 2d-substitutable language L contains ### and a1 #a2 #a3 #a4 as L− 2
does, then #, # and a1 #a2 , a3 #a4 should be substitutable for each other in
L. This entails that a1 a1 #a2 a2 a3 #a4 a3 #a4 ∈ L − L−
2.
The language Lreverse = { w#wR | wR is the reverse of w ∈ {a, b}∗ } is
1d-substitutable but not 2d-substitutable. Actually even { an #an | n ≥ 0 }
is not 2d-substitutable. Suppose that a 2d-substitutable language L contains
aaa#aaa. Then aa#, a and a, #aa are substitutable for each other, because
of the shared 2-context aaa. At the same time aaa#aaa = aaaaa#, a,
so L must contain aaaa#aa = aaa a, #aa, too. This shows that even a
singleton language is not 2d-substitutable, which contrasts the fact that every
singleton is 1d-substitutable.
The language Lcopy = { w#w | w ∈ {a, b}∗ } is not 1d-substitutable. If a
1d-substitutable language L contains a#a and b#b as Lcopy does, they should
be substitutable for each other. aa#aa ∈ L entails ab#ba ∈ L − Lcopy .
When the language L1 = { an1 #1 an2 | n ≥ 0 } is generated by a context-free gram-
mar, only nesting structural interpretation is possible, while by a 2-dimensional
mcfg, cross-serial dependency is also a possible interpretation at the same time.
One cannot decide which is the underlying structure from strings only. Actually
if a 2d-substitutable language contains a1 #1 a2 and a1 a1 #1 a2 a2 , both interpre-
tation are inevitably induced.
regular language (ab∗ cd∗ )∗ e ∈ m∈N SL(m, 1) is not a p, q-sec language for any
p, q. On the other hand, Lreverse from Example 3 is in SEC(1, 2) and is not
2d-substitutable. { an bn | n ≥ 1 } ∈ SEC(1, 1) is not 1d-substitutable either.
where dim(y ) = |y |. We will write [[y ]] instead of y for clarifying that it means a
nonterminal symbol (indexed with y ). PK consists of the following rules:
– (Type I) [[y ]] → f ([[y1 ]], . . . , [[yn ]]) if there is a good function f of rank n ≤ r
such that y = f (y1 , . . . , yn ), where [[y ]], [[y1 ]], . . . , [[yn ]] ∈ VK − {S};
– (Type II) [[y ]] → Im ([[y ]]) where Im is the identity on m-words for m = |y | ≤
y , x y ∈ K;
[m]
p, if there is x ∈ Σ such that x
– (Type III) S → I1 ([[w]]) if w ∈ K;
and FK is the set of functions requested in the definition of PK . As VK is finite,
FK and PK are also finite. Then G(K) = Σ, VK , FK , PK , S ∈ G(p, r) is the
conjecture by A(p, r).
Instead of having rules [[y ]] → Im ([[y ]]) of Type II, one may merge y and y
for downsizing the output as Clark and Eyraud does in [8].
Example 4. Let p = 2 and r = 1. Let us consider the grammar G(K) =
Σ, VK , FK , PK , S for
Algorithm 1. A(p, r)
Data: A sequence of strings w1 , w2 , . . .
Result: A sequence of mcfgs G1 , G2 , · · · ∈ G(p, r)
let Ĝ be a mcfg such that L(Ĝ) = ∅;
for n = 1, 2, . . . do
read the next string wn ;
if wn ∈ L(Ĝ) then
let Ĝ = G(K) where K = {w1 , . . . , wn };
end if
output Ĝ as Gn ;
end for
We then show that the conjectured grammar Ĝ of our algorithm A(p, r) is always
a subset of the target language.
Lemma 4. For any L ∈ S(p) and any finite subset K of L, if w ∈ L(G(K), [[y ]])
with [[y ]] ∈ VK − {S}, then
y and w
are substitutable for each other in L.
x
y = x f (y1 , . . . , yn ) ∈ K ⊆ L.
The induction hypothesis says that yi and w i are substitutable for each other
in L. Recall that the rule f is designed to be good. This allows us the following
inference:
other. By the induction hypothesis, y and w are substitutable for each other.
Hence y and w are also substitutable for each other in L.
Lemma 5. For any L ∈ S(p) and any finite subset K of L, it holds that
L(G(K)) ⊆ L.
Proof. Let Ĝ = G(K). If w ∈ L(Ĝ), i.e., w ∈ L(Ĝ, S), then there is [[y]] ∈ VK
such that S → I1 ([[y]]) is a rule of Type III of Ĝ and y ∈ K. By Lemma 4, y
and w are substitutable for each other in L. y ∈ K ⊆ L implies that
w = w ∈ L.
The conjectured language may be properly smaller than the target, when the
given data are not rich enough. We now define a finite subset of the target
language which ensures correct convergence of the conjecture of our learning
algorithm. For a good mcfg G ∈ G(p, r) generating the target language, we
define KG so that for each rule from G, it contains a shortest string from L(G)
which is derived using that rule at least once. For the sake of rigorousness, we
give a formal definition of KG here. Let X (G, A/B) be defined by:
where min S for a set S of m-words means an element y from S whose size y
is the smallest, and min S for a set S of m-contexts means an element x from S
whose length |x| is the smallest.
Lemma 6. For any G ∈ G(p, r), if KG ⊆ K, then L(G) ⊆ L(G(K)).
One can modify the learning algorithm so that it learns SL(p, ∗) by removing the
restriction on the rank of the hypothesized grammar. The rank is now bounded
by the length K of a longest example given so far, because we still restrict
functions of grammars to be λ-free. Let us call the learning algorithm obtained
by this way A(p, ∗).
Corollary 2. A(p, ∗) identifies SL(p, ∗) in the limit from positive data.
be the length of a shortest string derived from A and tG is the maximum of the
thicknesses of the nonterminals. Instead of the original definition, we would like
the thickness τG of a grammar G to be defined as the maximal of the thickness
of the rules where the thickness of a rule ρ is defined to be the length of a
shortest string in L(G) that is derived with using ρ at least once. It is easy to
see that τG ≤ GtG . This works well for multiple context-free grammars as
well as for context-free grammars. Hence a value is bounded by a polynomial in
τG if and only if it is bounded by a polynomial in GtG . The following is our
criterion for efficient learning, which is a slight modification of de la Higuera’s
definition [11].
Definition 1. A representation class G of mcfgs is identifiable in the limit
from positive data with polynomial time and data if and only if there exists an
algorithm A such that
Clark and Eyraud’s [8] and Yoshinaka’s [19] learning algorithms for (k, l-)sub-
stitutable context-free languages satisfy this definition.
Lemma 7. Our algorithm A(p, r) computes its hypothesis Ĝ in polynomial time
in the total size of the given examples.
Proof. Each rule of G determines one element in KG , whose length is exactly the
thickness of the rule. We have |KG | ≤ G and KG ≤ |KG |τG ≤ |P |τG .
The size of KG does not depend on p and r, while the updating time is poly-
nomial only when p and r are fixed. This contrasts to Yoshinaka’s discussion on
the learning efficiency of k, l-substitutable context-free languages [19], which are
another extension of Clark and Eyraud’s work [8]. His algorithm updates the
conjecture in polynomial time independently of k and l, while the size of data
for convergence is bounded by a polynomial whose degree is linear in k + l.
Theorem 1. The learning algorithm A(p, r) identifies SL(p, r) in the limit from
positive data with polynomial time and data.
Concerning the algorithm A(p, ∗) for SL(p, ∗), its updating time is not bounded
by a polynomial any longer, while KG still works well for A(p, ∗).
5 Discussions
This paper has demonstrated how Clark and Eyraud’s approach with substitutabil-
ity [8] works in learning mildly context-sensitive languages. pd-substitutability
seems nicely fit into p-dimensional mcfls as a generalization of 1d-substitutabil-
ity in context-free languages, which is the exact analogue of reversibility in regular
languages. The obtained learnable classes are however not rich, as we have seen in
Section 3 several rather simple languages that are not 2d-substitutable. pd-substi-
tutability easily causes too much generalization from finite languages even when
p = 2. The author hopes that this work provides a clue for further investiga-
tion on learning mildly context-sensitive languages possibly in other learning
schemes.
One naive trial for enriching the expressive power from 2d-substitutable lan-
guages might be considering the following property in addition to 1d-substitut-
ability:
x1 y1 x2 y2 , x1 y1 x2 y2 , x1 y1 x2 y2 ∈ L implies x1 y1 x2 y2 ∈ L
Learning Mildly Context-Sensitive Languages 291
for any x1 , x1 , y2 , y2 ∈ Σ ∗ and x2 , x2 , y1 , y1 ∈ Σ + . This property is stronger than
1d-substitutability and slightly weaker than 2d-substitutability (and might be
thought of as 2d-reversibility). However, this property is still too strong; neither
{ an #an | n ≥ 1 }, Lreverse nor Lcopy satisfies this property.
In order to control some kind of dependent structures in pd-substitutable
languages, Examples 1 and 2 insert delimiters #i . This trick is necessary even
in 1d-substitutable languages. While { an #bn | n ≥ 0 } is 1d-substitutable,
{ an bn | n ≥ 0 } is not. Yoshinaka’s approach of k, l-substitutability [19] enables
us to remove such delimiters. Thus again one may consider k1 , . . . , k2m -substi-
tutability:
x v y , x v
y , x v y ∈ L implies x v y ∈ L
Acknowledgement
The author deeply appreciates Thomas Zeugmann and the anonymous reviewers
for their valuable comments and advice.
This work was supported by Grant-in-Aid for Young Scientists (B-20700124)
and a grant from the Global COE Program, “Center for Next-Generation Infor-
mation Technology based on Knowledge Discovery and Knowledge Federation”,
from the Ministry of Education, Culture, Sports, Science and Technology of
Japan.
References
1. Angluin, D.: Inference of reversible languages. Journal of the Association for Com-
puting Machinery 29(3), 741–765 (1982)
2. Becerra-Bonache, L., Case, J., Jain, S., Stephan, F.: Iterative learning of simple
external contextual languages. In: Freund, Y., Györfi, L., Turán, G., Zeugmann,
T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 359–373. Springer, Heidelberg
(2008)
3. Becerra-Bonache, L., Yokomori, T.: Learning mild context-sensitiveness: Toward
understanding children’s language learning. In: Paliouras, G., Sakakibara, Y. (eds.)
ICGI 2004. LNCS (LNAI), vol. 3264, pp. 53–64. Springer, Heidelberg (2004)
4. Carme, J., Gilleron, R., Lemay, A., Niehren, J.: Interactive learning of node select-
ing tree transducer. Machine Learning 66(1), 33–67 (2007)
292 R. Yoshinaka
1 Introduction
Usually, in learning theory one considers classes consisting of countably many
languages from some countable domain. A typical example here is a class of all
recursive subsets of {0, 1, 2}∗, the set of all finite strings in the alphabet {0, 1, 2}.
However, each countably infinite domain has uncountably many subsets, and
thus we miss out many potential targets when we consider only countable classes.
The main goal of this paper is to find a generalization of the classical model
of learning which would be suitable for working with uncountable classes of
languages. The classes, which we consider, can be uncountable but they still
The first and fourth author are supported in part by NUS grant R252-000-308-112;
the third and fourth author are supported by NUS grant R146-000-114-112.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 293–307, 2009.
c Springer-Verlag Berlin Heidelberg 2009
294 S. Jain et al.
have some structure, namely, they are recognizable by Büchi automata. We will
investigate, how the classical notions of learnability have to be adjusted in this
setting in order to obtain meaningful results. To explain our approach in more
detail, we first give an overview of the classical model of inductive inference
which is the underlying model of learning in our paper.
Consider a class L = {Li }i∈I , where each language Li is a subset of Σ ∗ , the
set of finite strings in an alphabet Σ. In a classical model of learning, which was
introduced and studied by Gold [9], a learner M receives a sequence of all the
strings from a given language L ∈ L, possibly with repetitions. Such a sequence is
called a text for the language. After reading the first n strings from the texts, the
learner outputs a hypothesis in about what the target language might be. The
learner succeeds if it eventually converges to an index that correctly describes
the language to be learned, that is, if limn in = i and L = Li . If the learner
succeeds on all texts for all languages from a class, then we say that it learns this
class. This is the notion of explanatory learning (Ex). Such a model became the
standard one for the learnability of countable classes. Besides Ex, several other
paradigms for learning have been considered like, e.g., behaviourally correct
(BC) learning [3], vacillatory or finite explanatory (FEx) learning [8], partial
identification (Part) [13] and so on.
The indices that the learner outputs are usually finite objects like natural
numbers or finite strings. For example, Angluin [1] initiated the research on
learnability of uniformly recursive families indexed by natural numbers, and in
their recent work Jain, Luo and Stephan [10] considered automatic indexings by
finite strings in place of uniformly recursive ones. The collection of such finite
indices is countable, and hence we can talk only about countable classes of lan-
guages. On the other hand, the collection of all the subsets of Σ ∗ is uncountable,
and it looks too restrictive to consider only countable classes. Because of this, it
is interesting to find a generalization of the classical model which will allow us
to study the learnability of uncountable classes.
Below is the informal description of the learning model that we investigate in
this paper. First, since we are going to work with uncountable classes, we need
uncountably many indices to index a class to be learned. For this purpose we
will use infinite strings (or ω-strings) in a finite alphabet. There are computing
machines, called Büchi automata or ω-automata, which can be used naturally
for processing ω-strings. They were first introduced by Büchi [6,7] to prove the
decidability of S1S, the monadic second-order theory of the natural numbers with
successor function S(x) = x + 1. Because of this and other decidability results
the theory of ω-automata has become a popular area of research in theoretical
computer science, see, e.g., [14]. So, we will assume that a class to be learned
has an indexing by ω-strings which is Büchi recognizable.
The main difference between our model and the classical one is that the learner
does not output hypotheses as it processes a text. The reason for this is that it
is not possible to output an arbitrary infinite string in a finite amount of time.
Instead, in our model the learner is presented with an index α and a text T ,
and it must decide whether T is a text for the set with the index α. During its
Uncountable Automatic Classes and Learning 295
work, the learner outputs an infinite sequence of Büchi automata {An }n∈ω such
that An accepts the index α if and only if the learner at stage n thinks that T is
indeed a text for the set with the index α. The goal of the learner is to converge
in the limit to the right answer.
As one can see from the description above, the outputs of a learner take
form of ω-automata instead of just binary answers ‘yes’ or ‘no’. We chose such
definition due to the fact that a learner can read only a finite part of an infinite
index in a finite amount of time. If we required that a learner outputs its ‘yes’ or
‘no’ answer based on such finite information, then our model would become too
restrictive. On the other hand, a Büchi automaton allows a learner to encode
additional infinitary conditions that have to be verified before the index will be
accepted or rejected, for example, if the index contains infinitely many 1’s or
not. This approach makes a learner more powerful, and more nontrivial classes
become learnable.
Probably the most interesting property of our model is that for many learning
criteria, the learnability coincides with Angluin’s classical tell-tale condition for
the countable case (see the table at the end of this section). Angluin’s condition
states that for every set L from a class L, there is a finite subset DL ⊆ L
such that for any other L ∈ L with DL ⊆ L ⊆ L we have that L = L. It is
also well-known that in the classical case, every r.e. class is learnable according
to the criterion of partial identification. We will show that in our model every
ω-automatic class can be learned according to this criterion.
The results above show that the notions defined in this paper match the
intuition of learnability, and that our model is a natural one and is suitable for
investigating the learnability of uncountable classes of languages.
We also consider a notion of a blind learning. A learner is called blind if it does
not see an index presented to it. Such a learner can see only an input text, but
nevertheless it must decide whether the index and the text match each other.
It turns out that for the criterion of behaviourally correct learning, the blind
learners are as powerful as the non-blind ones without even the need to change
the indexing of a class, but for the other learning criteria this notion becomes
more restrictive.
The reader can find all formal definitions of the notions discussed here and
some necessary preliminaries in the next section. We summarize our results:
Criterion Condition Indexing Theorem
Ex ATTC New 17, 20
FEx ATTC Original 13, 20
BC ATTC Original 20
Part Any class Original 21
BlindBC ATTC Original 18, 20
BlindEx ATTC & Countable Original 19
BlindFEx ATTC & Countable Original 19
BlindPart Countable Original 22
In this table, the first column lists the learning criteria that we studied. Here,
Ex stands for explanatory learning, BC for behaviourally correct learning, FEx
296 S. Jain et al.
for finite explanatory or vacillatory learning, and Part for partial identification.
A prefix Blind denotes the blind version of the corresponding criterion. The
second column describes equivalent conditions for a given learning criterion.
Here, ATTC means that the class must satisfy Angluin’s tell-tale condition, and
Countable means that the class must be countable. The next column indicates
whether the learner uses the original indexing of the class or a new one. The last
column gives a reference to a theorem/corollary where the result is proved.
2 Preliminaries
An ω-automaton is mainly a finite automaton operating on ω-strings with an
infinitary acceptance condition which decides — depending upon the infinitely
often visited nodes — which ω-strings are accepted and which are rejected. For
a general background on the theory of finite automata the reader is referred
to [11].
Definition 1 ([6,7]). A nondeterministic ω-automaton is a tuple A = (S, Σ,
I, T ), where
For the ease of notation, we often just write (x, y) instead of ⊗(x, y) and so on. It
is well-known that the automatic relations are closed under union, intersection,
projection and complementation. In general, the following theorem holds, which
we will often use in this paper.
T = u0 , u 1 , u2 , . . . ,
such that each ui is either equal to the pause symbol # or belongs to Γ ∗ , where
Γ is some alphabet. We call ui the ith input of the text. The content of a text
= #}. If content(T ) is equal to a set L ⊆ Γ ∗ ,
T is the set content(T ) = {ui : ui
then we say that T is a text for L.
for any α and T . By M(α, T, k) we denote the kth automaton output by learner
M when processing an index α and a text T . Without loss of generality, for the
learning criteria considered in this paper, we assume that M(α, T, k) is defined
for all k.
Definition 9 (see [3,8,9,13]). Let a class L = {Lα }α∈I (together with its
indexing) and a learner M be given. We say that
1) M BC-learns L iff for any index α ∈ I and any text T with content(T ) ∈ L,
there exists n such that for every m ≥ n,
|{M(α, T, m) : m ≥ n}| ≤ k.
Here the abbreviations BC, Ex, FEx and Part stand for ‘behaviourally correct’,
‘explanatory’, ‘finite explanatory’ and ‘partial identification’, respectively; ‘finite
explanatory learning’ is also called ‘vacillatory learning’. We will also use the
notations BC, Ex, FEx, FExk and Part to denote the collection of classes
(with corresponding indexings) that are BC-, Ex-, FEx-, FExk - and Part-
learnable, respectively.
300 S. Jain et al.
Definition 10. A learner is called blind if it does not see the tape which contains
an index. The classes that are blind BC-, Ex-, etc. learnable are denoted as
BlindBC, BlindEx, etc., respectively.
The converse will also be shown to be true, hence for automatic classes one
can equate “L is learnable” with “L satisfies Angluin’s tell-tale condition”. Note
that the second and the third class given in Example 7 satisfy Angluin’s tell-tale
condition.
3 Vacillatory Learning
In the following it is shown that every learnable class can even be vacillatorily
learned and that the corresponding FEx-learner uses overall on all possible
inputs only a fixed number of automata.
Theorem 13. Let {Lα }α∈I be a class that satisfies Angluin’s tell-tale condition.
Then there are finitely many automata A1 , . . . , Ac and an FEx-learner M for
the class {Lα }α∈I with the property that for any α ∈ I and any text T for a
set from {Lα }α∈I , the learner M oscillates only between some of the automata
A1 , . . . , Ac on α and T .
Such N exists since the relation is first-order definable from ‘x ∈ Lα ’ and ≤llex
by the formula:
N accepts (x, α) ⇐⇒ ∀α ∈ I if ∀y ((y ∈ Lα & y ≤llex x) → y ∈ Lα ) &
∀y (y ∈ Lα → y ∈ Lα ), then ∀y(y ∈ Lα ↔ y ∈ Lα ) .
Note that the number of equivalence classes of ≡M,α is bounded by the number
of states of M , and for every x, y, if x ≡M,α y then x ∈ Lα ↔ y ∈ Lα . Therefore,
Lα is the union of finitely many equivalence classes of ≡M,α .
Let m and n be the number of states of M and N , respectively. Consider the
set of all finite tables U = {Ui,j : 1 ≤ i ≤ m, 1 ≤ j ≤ n} of size m × n such that
each Ui,j is either equal to a subset of {1, . . . , i} or to a special symbol Reject.
With each such table U we will associate an automaton A as described below.
The algorithm for learning {Lα }α∈I is now roughly as follows. On every step,
the learner M reads a finite part of the input text and based on this information
constructs a table U . After that M outputs the automaton associated with U .
First, we describe the construction of an automaton A for each table U . For
every α ∈ I, let m(α) and n(α) be the numbers of equivalence classes of ≡M,α
and ≡N,α , respectively. Also, let x1 , . . . , xm(α) be the length-lexicographically
least representatives of equivalence classes of ≡M,α such that
x1 <llex · · · <llex xm(α) .
Our goal is to construct A such that
A accepts α ⇐⇒ Um(α),n(α) is a subset of {1, . . . , m(α)}
such that Lα = {y : y ≡M,α xk for some k ∈ Um(α),n(α) }.
Let EqSt M (α, x, y, z) be the relation defined as
EqSt M (α, x, y, z) ⇐⇒ St M (⊗(x, α), |z|) = St M (⊗(y, α), |z|).
The relation EqSt N (α, x, y, z) is defined similarly. Note that these relations are
automatic.
Instead of constructing A explicitly, we will show that the language which
A needs to recognize is first-order definable from EqSt M (α, x, y, z), EqSt N (α, x,
y, z) and the relations recognized by M and N .
First, note that the equivalence relation x ≡M,α y can be defined by a formula:
∃z (|z| > max{|x|, |y|} and EqSt M (α, x, y, z)).
Similarly one can define x ≡N,α y. The fact that ≡M,α has exactly k many
equivalence classes can be expressed by a formula:
ClNum M,k (α) = ∃x1 . . . ∃xk xi
≡M,α xj & ∀y y ≡M,α xi .
1≤i<j≤k 1≤i≤k
Again, ClNum N,k (α) expresses the same fact for ≡N,α . Finally, the fact that A
accepts α can be expressed by the following first-order formula:
ClNum M,i (α) & ClNum N,j (α) & ∃x1 . . . ∃xi
(i,j) : Ui,j =Reject
x1 <llex · · · <llex xi & ∀z (z ∈ Lα ↔ z ≡M,α xk )
k∈Ui,j
& ∀y (y <llex xk → y
≡M,α xk ) .
1≤k≤i
302 S. Jain et al.
We now describe the algorithm for learning the class {Lα }α∈I . We will use the
notation x ≡M,α,s y as an abbreviation of
Definition 14. 1) Let α ∈ {0, 1, . . . , k}ω and β ∈ {1, . . . , k}ω . The function
fα,β is defined as follows:
α(m) if m = min{x ≥ n : α(x) = 0},
fα,β (n) =
lim supx→∞ β(x) if such m does not exist.
Let Lα,β be the set of all nonempty finite prefixes of fα,β , that is,
Remark 16. The last result can be strengthened in the following sense: for
every k ≥ 1 there is an indexing {Lβ }β∈I of the class L = { {α0 α1 α2 . . . αn−1 :
ω
n ∈ ω} : α ∈ {1, 2} } such that {Lβ }β∈I is FExk+1 -learnable but not FExk -
learnable. That is, the class can be kept fixed and only the indexing has to be
adjusted.
4 Explanatory Learning
The main result of this section is that for every learnable class, there is an index-
ing such that the class with this indexing is explanatorily learnable. Furthermore,
one can observe that the learner, as above, on any text T for a language in the
class and an index α, first might output automata which reject α, then automata
which accept α and at the end again automata which reject α; so, in short, the
sequence is of the form “reject–accept–reject” (or a subsequence of this).
Theorem 17. If a class L = {Lα }α∈I satisfies Angluin’s tell-tale condition,
then there is an indexing for L such that L with this indexing is Ex-learnable.
The first-order definition for a tell-tale set is given in the beginning of the proof
of Theorem 13. All other relations in this definition are clearly automatic.
The definition for γ can be written as
∀σ ∈ 0∗ q ∈ γ(|σ|) ↔ ∃x ∈ Lα ( |x| ≤ |σ| & St M (⊗(x, α), |σ|) = q) .
q∈QM
For every q ∈ QM , there are automata Aq and Bq that recognize the relations
Let T be the input text. If content(T ) = Hα,β,γ , then there is a step s ≥ n such
that Dn is contained in {x1 , . . . , xs }. Therefore, M will output only A from step s
onward. If content(T ) = Hα,β,γ , then Dn content(T ) or content(T ) Hα,β,γ .
In the first case, M will output Z on every step. In the second case, there is a
step s and an xi ∈ {x1 , . . . , xs } such that xi
= # and xi is not in Lα according to
γ. Therefore, M will output Z from step s onward. This proves the correctness
of the algorithm.
5 Blind Learning
Blind learning is distinguished from learning in that the learner itself does not
see the index; so the learner has to code up all the necessary information into
the automata which permit to decide whether the index is correct or incorrect.
In the case of behaviourally correct learning, this is done by coding more and
more finite information in a way that almost all automata recognize an incorrect
index and reject it (where the point from which on this is recognized depends
on the index). In the case of explanatory learning, this is impossible and hence
one has to simulate a traditional learner (for countable classes) and to code up
its conjecture into the automaton which then checks whether the index provided
is equivalent to the one to which the traditional learner has converged; hence
explanatorily learnable classes have to be countable.
Theorem 19. For every class L = {Lα }α∈I , the following are equivalent
1) L is BlindEx-learnable.
2) L is BlindFEx-learnable.
3) L is at most countable and satisfies Angluin’s tell-tale condition.
The following corollary summarizes the main results from the previous sections.
Corollary 20. For every automatic class L, the following are equivalent:
6 Partial Identification
Partial identification is, in the traditional setting of inductive inference, a learn-
ing criterion where the learner outputs on every text of an r.e. language infinitely
many (not necessarily distinct) hypotheses such that exactly one hypothesis oc-
curs infinitely often and that hypothesis is correct. There is a recursive learner
succeeding on all r.e. sets, hence this concept is omniscient in the traditional
setting. Also in our model, every automatic class is partially identifiable.
Theorem 21. Every class with every given automatic indexing is Part-learnable.
References
1. Angluin, D.: Inductive inference of formal languages from positive data. Informa-
tion and Control 45(2), 117–135 (1980)
2. Bárány, V., Kaiser, L ., Rubin, S.: Cardinality and counting quantifiers on omega-
automatic structures. In: Proceedings of the 25th International Symposium on
Theoretical Aspects of Computer Science, STACS 2008, pp. 385–396 (2008)
3. Bārzdiņš, J.: Two theorems on the limiting synthesis of functions. Theory of Algo-
rithms and Programs 1, 82–88 (1974)
4. Blumensath, A., Grädel, E.: Automatic structures. In: 15th Annual IEEE Sympo-
sium on Logic in Computer Science, Santa Barbara, CA, pp. 51–62. IEEE Com-
puter Society Press, Los Alamitos (2000)
5. Blumensath, A., Grädel, E.: Finite presentations of infinite structures: automata
and interpretations. Theory of Computing Systems 37(6), 641–674 (2004)
6. Richard Büchi, J.: Weak second-order arithmetic and finite automata. Zeitschrift
für Mathematische Logik und Grundlagen der Mathematik 6, 66–92 (1960)
7. Richard Büchi, J.: On a decision method in restricted second order arithmetic.
In: Logic, Methodology and Philosophy of Science (Proceedings 1960 International
Congress), pp. 1–11. Stanford University Press, Stanford (1962)
8. Case, J.: The power of vacillation in language learning. SIAM Journal on Comput-
ing 28(6), 1941–1969 (1999) (electronic)
9. Mark Gold, E.: Language identification in the limit. Information and Control 10,
447–474 (1967)
10. Jain, S., Luo, Q., Stephan, F.: Learnability of automatic classes. Technical Report
TRA1/09, School of Computing, National University of Singapore (2009)
11. Khoussainov, B., Nerode, A.: Automata theory and its applications. Birkhäuser
Boston, Inc., Boston (2001)
12. Khoussainov, B., Nerode, A.: Automatic presentations of structures. In: Leivant,
D. (ed.) LCC 1994. LNCS, vol. 960, pp. 367–392. Springer, Heidelberg (1995)
13. Osherson, D.N., Stob, M., Weinstein, S.: Systems that learn. An introduction to
learning theory for cognitive and computer scientists. Bradford Book—MIT Press,
Cambridge (1986)
14. Vardi, M.Y.: The Büchi complementation saga. In: Thomas, W., Weil, P. (eds.)
STACS 2007. LNCS, vol. 4393, pp. 12–22. Springer, Heidelberg (2007)
Iterative Learning from Texts and
Counterexamples Using Additional Information
1 Introduction
In this paper, we study some variants of learning in the limit from positive data
and negative counterexamples to conjectures, with restricted access to input
data. The general framework for study of learning in the limit was introduced in
[Gol67]. In Gold’s original model, TxtEx, a learner is able to hold full input data
seen so far in its long-term memory. However, this assumption is apparently too
strong for modeling many learning and cognitive processes. Wiehagen in [Wie76]
(see also [LZ96]) suggested a model for learning in the limit where the long-term
memory of the learners is limited to what they can store in their conjectures.
These learners are called iterative learners. This learning model, while strongly
limiting long-term memory, still makes salient an important aspect of learnability
in the limit: its incremental character. Some variants of iterative learning proved
to be quite useful in the context of applied machine learning (for example, [LZ06]
Supported in part by NUS grant number R252-000-308-112.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 308–322, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Iterative Learning from Texts and Counterexamples 309
applies the idea of iterative learning in the context of training Support Vector
Machines).
The iterative learning model has been used for study of learnability from all
positive examples (the corresponding formal model being denoted as TxtIt) as
well as all positive and negative examples (denoted as InfIt, see [LZ92]). One
can argue that TxtIt may be too weak (a learner gets only positive data and can
memorize only very limited amount of input), whereas InfIt may be too strong: it
is hard to conceive a realistic learning process, where the learner would be able to
get access to full negative data. For example, children learning languages, while
getting some negative data (in the form of corrections by parents or teachers),
never get the full set of negative data.
In [JK08], the model TxtEx was extended to allow negative counterexamples
to conjectures by a learner. This model is an example of active learning, where
a learner communicates with a teacher (formally, an oracle) making queries and
getting responses from the teacher. Active learning as a general framework for
study of learning processes was introduced by D. Angluin in [Ang88] and has
been widely utilized in various studies of theoretical and applied models of learn-
ability from examples since then. The model of iterative learning from full pos-
itive data and negative counterexamples, NCIt (NC here stands for “negative
counterexample”), defined in [JK07] actually combines two approaches: Gold’s
framework (as the learner incrementally gets access to full positive data) and
active learning (the learner, using subset queries, checks with the teacher if each
conjecture does not contain data in excess of the target languages and if the an-
swer is negative, the learner gets a negative counterexample showing an error).
In linguistic terms, non-grammatical sentences in conjectures are, thus, being
corrected. It should be noted that K. Popper [Pop68] regarded refutation of
overgeneralizing conjectures as a vital part of learning and discovery processes.
In this paper, we extend the NCIt model to incorporate some additional
features. Specifically, we consider the following two extensions of this model: in
addition to subset queries (for conjectures), the learner
a) can ask up to n feedback queries: whether the queried element belongs to
the input seen so far;
b) can store up to n input elements seen so far in its long-term memory (note
that when the long-term memory used by a learner is n-bounded, if the memory
is full then, in order to save a new input datum, the learner must sacrifice at
least one element currently stored in the memory);
In the context of iterative learning of languages from positive data, these two
types of “looking back” (in the context of feedback — using just one query per
conjecture) were defined in [LZ96] (an earlier variant of memory-bounded learn-
ing can be found in [OSW86], and the idea of feedback learning goes back to
[Wie76], where it was applied in the context of learning recursive functions in
the limit). Both these concepts were reformalized (the former named n-feedback
learning, and the latter named n-bounded memory learning) and thoroughly stud-
ied and discussed in [CJLZ99]. Motivation for these sorts of learnability models,
310 S. Jain and E. Kinber
so far might not match the help that an NCIt-learner gets in the form of one
extra feedback membership query, or one extra long-term memory cell.
In Section 5, we study tradeoffs between different types of additional informa-
tion used by NCIt-learners (the main purpose of this study is to make salient
advantages of each type of additional information for the learners in question).
In particular, similarly to corresponding results in [CJLZ99], we show that one
memory cell used by an NCIt-learner can give more help than any n feedback
membership queries (even in presence of the maximal element and the number
of elements seen so far), see Theorem 13, and, conversely, one feedback mem-
bership query can give more help than n-bounded memory (plus the maximal
element and the number of elements seen so far), see Theorem 14. Interestingly,
the maximal element seen so far alone can give more help than any number of
feedback membership queries, see Theorem 17. Also, the number of elements
and the maximal element seen so far combined together can provide more help
than any bounded number of memory cells or feedback membership queries, see
Theorem 19. We also show how an extra memory cell can simulate maximal
element for NCIt-learners using n memory cells, see Proposition 15. We also
obtain some partial results for other possible tradeoffs.
2 Preliminaries
2.1 Notation
For any unexplained recursion theoretic notation we refer the reader to [Rog67].
The symbol N denotes the set of natural numbers, {0, 1, 2, 3, . . .}. Languages are
subsets of N. Symbols ∅, ⊆, ⊂, ⊇, and ⊃ respectively denote the empty set,
subset, proper subset, superset, and proper superset. The cardinality of a set
S is denoted by card(S). The maximum and minimum of a set are denoted by
∞
max(·), min(·), respectively, where max(∅) = 0 and min(∅) = ∞. ∀ denotes ‘for
all but finitely many’.
We let Dx denote the finite set with canonical index x [Rog67]. We let ·, ·
stand for an arbitrary, computable, 1–1 mapping from N × N onto N, which is
increasing in both its arguments [Rog67]. The pairing function can be extended
to n-tuples in a natural way (for example, by using x, y, z = x, y, z ).
By Wi we denote the i-th r.e. language in some fixed acceptable program-
ming system. We also say that i is a grammar for Wi . E denotes the set of all
r.e. languages. L, with or without decorations, ranges over E. L, with or without
decorations, ranges over subsets of E. χL denotes the characteristic function of
L, and L = N − L, that is the complement of L.
L is said to be an indexed family iff there exists an indexing L0 , L1 , . . . of all
and only the languages in L such that for some recursive function f , f (i, x) =
χLi (x).
content(T ) denotes the set of natural numbers in the range of T . A text T is for
a language L iff content(T ) = L. Intuitively, T (i) denotes the element presented
to the learner at time i, and #’s represent pauses in the presentation of data. T [n]
denotes the initial sequence of T of length n, that is T [n] = T (0)T (1) . . . T (n−1).
SEQ = {T [n] : n ∈ N, T is a text}. The empty sequence is denoted by λ. σ, τ, α
range over SEQ. σ τ denotes concatenation of σ and τ .
An informant [Gol67] I is a mapping from N to (N×{0, 1})∪{#} such that for
no x ∈ N, both (x, 0) and (x, 1) are in the range of I. content(I) = set of pairs in
the range of I. We say that I is an informant for L iff content(I) = {(x, χL (x)) :
x ∈ N}. Intuitively, informants give both all positive and all negative data for
the language being learned. I[n] denotes the first n elements of the informant I.
An inductive inference machine (IIM) [Gol67] learning from texts is an algo-
rithmic device which computes a (possibly partial) mapping from SEQ into N.
One can similarly define learners from informants and other modes of input as
considered below. We use the term learner or learning machine as synonyms for
inductive inference machines. We let M range over IIMs. M (T [n]) (or M (I[n]))
is interpreted as the grammar (index for an accepting program) conjectured by
the IIM M on the initial sequence T [n] (or I[n]). We say that M converges on
∞
T to i, (written: M (T ) ↓ = i) iff ( ∀ n)[M (T [n]) = i]. Convergence on informants
is similarly defined.
There are several criteria for an IIM to be successful on a language. In this
paper we will be mainly concerned with explanatory (abbreviated Ex) criteria
of learning.
One can similarly define learning criterion InfEx for learning from informants
instead of texts.
Next we consider iterative learning.
In this section we consider our models of learning from full positive data and
negative counterexamples as given by [JK08]. Intuitively, for learning with neg-
ative counterexamples, we may consider the learner being provided a text, one
element at a time, along with a negative counterexample to the latest conjec-
ture, if any. (One may view this negative counterexample as a response of the
teacher to the subset query when it is tested if the language generated by the
conjecture is a subset of the target language). One may model the list of negative
counterexamples as a second text for negative counterexamples being provided
to the learner. Thus the IIMs get as input two texts, one for positive data, and
other for negative counterexamples.
∞
We say that M (T, T ) converges to a grammar i, iff ( ∀ n)[M (T [n], T [n]) = i].
First, we define the model of learning from positive data and negative coun-
terexamples. NC in the definition below stands for negative counterexample.
Definition 5. [JK08]
(a) M NCEx-identifies a language L (written: L ∈ NCEx(M )) iff for all
texts T for L, and for all T satisfying the condition:
(T (n) ∈ Sn , if Sn
= ∅) and (T (n) = #, if Sn = ∅),
where Sn = L ∩ WM(T [n],T [n])
For ease of notation, we sometimes define M (T [n], T [n]) also as M (T [n]), where
we separately describe how the counterexamples T (n) are presented to the con-
jecture of M on input T [n].
One can similarly define NCIt-learning, where the learner’s output depends
only on the previous conjecture, the latest positive data, and the counterexample
provided.
Iterative Learning from Texts and Counterexamples 315
Definition 6. [JK07] (a) M is iterative (for learning from positive data and
negative counterexamples), iff there exists a partial recursive function F such
that, for all T, T and n, M (T [n+1], T [n+1]) = F (M (T [n], T [n]), T (n), T (n)).
Here M (λ, λ) is some predefined constant.
(b) M NCIt-identifies L, iff M is iterative, and M NCEx-identifies L.
(c) NCIt = {L : (∃M )[M NCIt-identifies L]}.
We will often identify F above with M (that is use M (p, x, y) = F (p, x, y) to
describe M (T [n + 1], T [n + 1]), where p = M (T [n], T [n]) and x = T (n), y =
T (n)). This is for ease of notation.
One should also note that the NCIt model is equivalent to allowing finitely
many subset queries (with counterexamples for the answer “no”) in iterative
learning.
One can extend the above definition to NCIt-learning with m-feedback or m-
memory, by allowing the learner M up to m queries about whether some element
x has appeared in the previous text or allowing the learner M to remember
up to m elements of the past data. The resulting criteria are called NCIt-
learning with m-feedback and NCIt-learning with m-memory, respectively. The
resulting learners are called m-feedback NCIt-learner (or NCIt-learners using
m-feedback) and m-memory bounded NCIt-learner (or NCIt-learner using m-
memory) respectively.
It follows from the definition that NCIt-learning is contained in NCIt-
learning using m-feedback and NCIt-learning using m-memory, which, in turn,
are contained in NCEx.
Initially, M (λ) = (0, ∅, ∅). The learner on the input p(jn , Sn , Xn ) and the new
element xn , the counterexample yn , and the maximal element m seen so far,
does the following:
It is easy to verify that the invariants are satisfied. Furthermore, jn never goes
beyond the minimal grammar for L (see invariant (C)). Thus, the sequence of
jn converges, as well as Sn and Xn converge (as Xn+1 = Xn implies jn+1 = jn ,
and Sn ⊆ {j : j ≤ jn }, and using invariants (D) and (E)). Moreover,
the last
conjecture is correct by (A) and (B), and using (Xn ∪ {xn } ∪ j∈Sn Lj [m]) −
{#} ⊆ Ljn from clause (ii) (as there is no further mind change).
(b) Only change is in (ii) above which is replaced by: (m below denotes the
number of elements seen so far by the learner)
(ii) If the first m elements in (Xn ∪ {xn } ∪ j∈Sn Lj ) − {#} are included in
Ljn , and y = #, then jn+1 = jn , Xn+1 = Xn . Otherwise, jn+1 = jn + 1 and
Xn+1 = Xn ∪ {xn } − {#}.
The rest of the proof is similar to the part (a), and we omit the details.
Still, any n feedback queries might not help to achieve class-preserving learn-
ability of indexed classes by NCIt-learners.
2ws < p ≤ ws such that c + p is a prime, but c + p is not a prime. Let ms > xs
be such that e, 0, ms > max(content(σs ) ∪ { e, j, xs : 1 ≤ j ≤ ws }). Let τs =
σs e, 0, ms , and enumerate e, 0, ms into We . Dovetail among the following
two searches:
(a) search for an initial segment σ of τs such that fs (M (σ)) = #, but WM(σ) −
content(τs ) = ∅;
(b) search for a τ such that content(τ ) − content(τs ) ⊆ { e, j, xs : 1 ≤ j ≤
ws }, M (τ )
) − content(τs )) ≤ n + 1 or
= M (τs ) and either (i) card(content(τ
(ii) card(content(τ ) − content(τs )) = n + 2, and e,j,x∈content(τ )−content(τs ) j
is a prime number.
Here we assume that search in (a) has some priority in the sense that if one can
find such a σ within s steps, then (a) succeeds first with the shortest such σ. In
case (a) succeeds first, we let σs+1 = τs , Wes+1 = content(τs ), and fs+1 (M (σ)) =
the element found in WM(σ) − content(σs ) (rest of fs+1 is same as fs ). In case
(b) succeeds first, we let σs+1 = τ , Wes+1 = content(σs+1 ) and let fs+1 = fs .
Now one can show that if there are infinitely many stages, then M does not
converge on s σs . On the other hand, if there are only finitely many stages,
then one can show that, for some appropriate distinct S, S ⊆ {2i : 1 ≤ i ≤ ws },
and corresponding p, αS , αS with content(αX ) = { e, j, xs : j ∈ X} (for X = S
or S ), one has that M (τs αS e, p, xs ∞ ) = M (τs αS e, p, xs ∞ ), though
τs αS e, p, xs ∞ and τs αS e, p, xs ∞ are texts for different languages in
L. We omit the details.
Iterative Learning from Texts and Counterexamples 319
Proposition 15. Any n-bounded memory learner with the maximal element in
the input as additional information can be simulated by an n+1-bounded memory
learner by using the extra memory for the maximal element seen, as long as the
memory of the learner is considered as a multi-set, rather than just a set.
320 S. Jain and E. Kinber
Our next result shows that adding access to the maximal element increases
learning capability of NCIt-learners storing up to n input elements seen so far.
Moreover, a learner witnessing the positive side of the result does not need access
to negative counterexamples refuting conjectures containing data in excess of the
language to be learned.
Our next two results demonstrate that an NCIt-learner having access to just the
maximal element or the number of elements seen so far can sometimes do more
than any NCIt-learner using up to n feedback queries. First, as the next the-
orem demonstrates, NCIt-learners (or even iterative learners — not using neg-
ative counterexamples to conjectures) using the maximal element as additional
information can sometimes learn more than NCIt-learners using n feedback
queries and getting the number of elements as additional information. However,
we were not able to achieve a result of similar strength while faring the number
of elements seen so far against n feedback queries and the maximal element as
additional information. Whether it is possible, remains open.
Theorem 17. There exists a class L which can be iteratively learnt when the
learner is provided the maximal element in the input so far, but the class L
cannot be NCIt-learnt using n-feedback, for any n, even if the learner is given
the number of elements in the input as additional information.
Theorem 18. There exists a L which can be NCIt learnt using the number
of elements in the input as additional information, but, for all n, L cannot be
NCIt-learnt using n-feedback.
Note that, obviously, the maximal element can always be memorized by a learner
and, thus, cannot add more to the learning power of iterative learners than
even one memory cell for storing input elements. Therefore, we explore if the
number of elements seen so far can give an NCIt-learner more advantages than
n memorized input elements seen so far. We were able to achieve only a partial
solution — showing that the number of elements and the maximal element (or
one memory cell) together can provide more power to NCIt-learners than n
memorized input elements.
Theorem 19. There exists a class L such that L can be NCIt-learnt using 1-
memory (or the maximal element) and the number of elements, but cannot be
learnt using n feedback or n-memory bounded learner in NCIt manner, even if
it is given the maximal element.
Can the maximal element give more power to NCIt-learners than the number of
elements seen so far? The answer to this question is positive — even if the learners
using the maximal element are just iterative (not using negative counterexamples
to conjectures): it immediately follows from Theorem 17. However, we do not
Iterative Learning from Texts and Counterexamples 321
know if the number of elements can give more in the context of NCIt-learnability
than the maximal element. We have some partial solution to the above problem,
when one considers iterative learners rather than NCIt-learners.
Theorem 20. (a) Suppose L can be NCIt-identified using the number of ele-
ments, where the learner converges on all inputs (here the text input would be
from the target class, but the number of elements may sometimes not be valid —
we still expect the learner to converge). Then, L can be NCIt-identified using
access to the maximal element.
(b) There exists an L such that
(i) L can be iteratively learnt when given the number of elements in the input
seen so far as additional information (such a learner, however, may not be total).
(ii) For all n, L cannot be iteratively learnt by an n-feedback learner even if
it gets the maximal element as additional information.
(iii) For all n, L cannot be iteratively learnt by an n-memory bounded learner.
6 Conclusions
As we have shown, additional information of the types studied in this paper can
add interesting new capabilities to iterative learners getting negative examples
to conjectures containing data in excess of the target language. Some problems
related to comparisons of help provided by additional information remain open
(they are mentioned in Section 5), and solving these problems can offer new
(and, possibly, unexpected) insight into advantages of using additional informa-
tion of certain types for the learners in question. Similarly to [JK07], one might
also consider different types of negative examples (refuting conjectures contain-
ing extra elements) by iterative learners and explore how these different types
of negative examples may interplay with different types of additional informa-
tion. Yet another interesting area of research is studying iterative learnability
with counterexamples and additional information of specific indexed classes of
languages (for example, regular languages or patterns) — as we have shown all
such classes are learnable class-preservingly using maximal element or number
of elements as additional informantion, and, therefore, one can now study if and
when learnability of such classes may be efficient.
A general open problem for iterative learners of any type using additional
(bounded) memory is whether a multiset type memory (when a learner can
store the same inputted item several times; for example, the learner may decide
to store, say, 10 copies of the next input element) can have an advantage over a
set type memory (where every item is stored just once). We have not been able
to find an answer to this very interesting problem.
References
[Ang80] Angluin, D.: Finding patterns common to a set of strings. Journal of Com-
puter and System Sciences 21, 46–62 (1980)
[Ang88] Angluin, D.: Queries and concept learning. Machine Learning 2, 319–342
(1988)
[BA96] Brachman, R., Anand, T.: The process of knowledge discovery in
databases: A human centered approach. In: Fayyad, U.M., Piatetsky-
Shapiro, G., Smyth, P., Uthurusam, R. (eds.) Advances in Knowledge
Discovery and Data Mining, pp. 37–58. AAAI Press, Menlo Park (1996)
[CJLZ99] Case, J., Jain, S., Lange, S., Zeugmann, T.: Incremental concept learning
for bounded data mining. Information and Computation 152(1), 74–110
(1999)
[CL82] Case, J., Lynes, C.: Machine inductive inference and language identifica-
tion. In: Nielsen, M., Schmidt, E.M. (eds.) ICALP 1982. LNCS, vol. 140,
pp. 107–115. Springer, Heidelberg (1982)
[CM08] Case, J., Moelius, S.: U-shaped, iterative, and iterative-with-counter learn-
ing. Machine Learning 72, 63–88 (2008)
[FPSS96] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to
knowledge discovery. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusam, R. (eds.) Advances in Knowledge Discovery and Data Mining,
pp. 1–34. AAAI Press, Menlo Park (1996)
[Gol67] Gold, E.M.: Language identification in the limit. Information and Con-
trol 10, 447–474 (1967)
[JK07] Jain, S., Kinber, E.: Iterative learning from positive data and nega-
tive counterexamples. Information and Computation 205(12), 1777–1805
(2007)
[JK08] Jain, S., Kinber, E.: Learning languages from positive data and negative
counterexamples. Journal of Computer and System Sciences 74(4), 431–
456 (2008); Special Issue: Carl Smith memorial issue
[LZ92] Lange, S., Zeugmann, T.: Types of monotonic language learning and their
characterization. In: Proceedings of the Fifth Annual Workshop on Com-
putational Learning Theory, pp. 377–390. ACM Press, New York (1992)
[LZ96] Lange, S., Zeugmann, T.: Incremental learning from positive data. Journal
of Computer and System Sciences 53, 88–103 (1996)
[LZ06] Li, Y., Zhang, W.: Simplify support vector machines by iterative learning.
Neural Processsing Information - Letters and Reviews 10, 11–17 (2006)
[LZZ08] Lange, S., Zeugmann, T., Zilles, S.: Learning indexed families of recursive
languages from positive data: A survey. Theoretical Computer Science 397,
194–232 (2008)
[OSW86] Osherson, D., Stob, M., Weinstein, S.: Systems that Learn: An Intro-
duction to Learning Theory for Cognitive and Computer Scientists. MIT
Press, Cambridge (1986)
[Pop68] Popper, K.: The Logic of Scientific Discovery, 2nd edn. Harper Torch
Books, New York (1968)
[Rog67] Rogers, H.: Theory of Recursive Functions and Effective Computability.
McGraw-Hill, New York (1967); Reprinted by MIT Press (1987)
[Wie76] Wiehagen, R.: Limes-Erkennung rekursiver Funktionen durch spezielle
Strategien. Journal of Information Processing and Cybernetics (EIK) 12,
93–99 (1976)
Incremental Learning with Ordinal Bounded
Example Memory
Lorenzo Carlucci
1 Introduction
In many learning contexts a learner is confronted with the task of inductively
forming hypotheses while being presented with an incoming stream of data. In
contexts the learning process can be said to be successful if, eventually, the
hypotheses that the learner forms provide a correct description of the observed
stream of data. Each single step of the learning process in this scenario involves
an observed data item and the formation of a new hypothesis.
It is very reasonable to assume that a real-world learner - be it artificial or
human - has memory limitations. A learner with memory limitations is a learner
that is unable to store such complete information about the previous stages of
the learning process. Each stage of the learning process is completely described
by the flow of data seen so far, and the sequence of the learner’s hypotheses so
far. The action of a learner with memory limitations, at each step of the learning
process, is completely determined by a limited portion of the previous stages of
the learning process. Let us call intensional memory the learner’s memory of its
own previously issued hypotheses. Let us call extensional memory the learner’s
memory of previously observed data items.
In the context of Gold’s formal theory of language learning [6], models with
restrictions on intensional and on extensional memories have been studied. In [9]
the paradigm of Bounded Example Memory is introduced. A bounded example
Partially supported by grant number 1339 of the John Templeton Foundation, and
by a Telecom Italia “Progetto Italia” Fellowship.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 323–337, 2009.
c Springer-Verlag Berlin Heidelberg 2009
324 L. Carlucci
memory learner is a learner whose intensional memory is, at each step of the
learning process, limited to remembering its own previous hypothesis, and whose
extensional memory is limited to storage of a finite number of previously observed
data items. At each step of the learning process, such a learner must decide, based
on (i) knowledge of its own previous hypothesis, on (ii) the content of its current
memory, and on (iii) the currently observed data item, whether to change its
hypothesis and whether to store in memory the currently observed data item. For
each number k one can similarly define a k-bounded example memory learner as
a bounded example memory learner whose memory can never exceed size k. For
k = 0 one obtains the paradigm of iterative learning [12], in which the learner
has no extensional memory and can only remember its own previous conjecture.
One of the main results of [9] is the following. For every k, there is a class
of languages that can be learned by a bounded example memory learner with
memory k + 1 but not by any bounded example memory learner with memory
k. [3] and the recent [7] present further results on this and related models.
In this paper we present some results on a new extension of the Bounded
Example Memory paradigm. Following a suggestion in [3], we investigate a
paradigm in which the learner is allowed to change its mind on how many data
items to store in memory as a function of some constructive ordinal α. Ordinals
are canonical representatives of well-orderings. A constructive ordinal can be
defined as the order-type of a computable well-ordering of the natural numbers.
Equivalently, constructive ordinals are those ordinals that have a program (a no-
tation) that specifies how to build them from below using standard operations
such as successor and constructive limit. Every constructive ordinal is countable
and notations for constructive ordinals are algorithmic finite objects. For each
initial segment of the constructive ordinals a univalent system of notations can
be defined. On the other hand, a universal (not univalent) system of notation
containing at least one notation for every constructive ordinal has been defined
by Kleene. For more details, see, e.g., [10]. For the sake of this paper, ordinals
can be treated in an informal way: we blur the distinction between a constructive
ordinal and a notation for it. The treatment can be made rigorous without effort
and without altering our results. Count-down from ordinal notations has been
applied in a number of ways in algorithmic learning theory, starting with [4],
where ordinal notations are used to bound the number of mind-changes that a
learning machine is allowed to make on its way to convergence. A different use
of ordinal notations is in the recent [2].
For every (notation for a) constructive ordinal α, the paradigm of α-bounded
example memory is defined. Intuitively, a learner with example memory bounded
by α must (algorithmically) count-down from (a notation for) α each time a
proper global memory extension occurs during the learning process (i.e., each
time the size of the memory set becomes strictly larger than the size of all the
previous memory sets). We show that this paradigm is strictly stronger than
k-bounded example memory but strictly weaker than finitely-bounded example
memory (with no form of a priori bound on memory size). We also show that the
concept of ordinal bounded example memory gives rise to a hierarchy and we
Incremental Learning with Ordinal Bounded Example Memory 325
2 Preliminaries
Unexplained notation follows Rogers [10]. N denotes the set of natural numbers
{0, 1, 2, . . . }. N+ denotes the set of positive natural numbers. The set of finite
subsets of N is denoted by F in(N). We use the following set-theoretic notations:
∅ (empty set), ⊆ (subset), ⊂ (proper subset), ⊇ (superset), ⊃ (proper superset).
If X and Y are sets, then X ∪ Y , X ∩ Y , and X − Y denote the union, the
intersection, and the difference of X and Y , respectively. We use Z = X ∪Y ˙ to
abbreviate (Z = X ∪ Y ∧ X ∩ Y = ∅). The cardinality of a set X if denoted
by card(X). By card(X) ≤ ∗ we indicate that the cardinality of X if finite. We
let λx, y. x, y stand for a standard pairing function. We extend the notation to
pairing of n-tuples of numbers in the straightforward way. We denote with πin
(i ≤ n) the projection function of an n-tuple to its i-th component. We omit the
superscript when clear from context.
We use α, β to range over constructive ordinals. We blur the distinction be-
tween ordinals and their notations. We use O to denote the set of constructive or-
dinals. This symbol traditionally denotes Kleene’s universal system of notations
for constructive ordinals. This system would be used in a completely rigorous
presentation of our results.
We fix an acceptable programming system ϕ0 , ϕ1 , . . . for the partial com-
putable functions of type N → N. We denote by Wi the domain of the i-th
partial computable function ϕi . We could equivalently (modulo isomorphism of
numberings) define Wi as the set generated by grammar i. A language is a subset
of N. We are only interested in recursively enumerable languages, whose collec-
tion we denote by E. The symbol L ranges over elements in E. L ranges over
subsets of E, called language classes. Let λx, y.pad(x, y) be an injective padding
function (i.e., Wpad(x,y) = Wx ).
A sequence is a mapping from an initial segment of N+ into N# , where #
is a reserved symbol which we call pause symbol. We use N# to abbreviate
N ∪ {#}. The symbols σ, τ range over sequences. content(σ) denotes the range
of σ minus the # symbol. |σ| denotes the length of σ. We use ⊆, ⊂ for sequence
containment and proper containment respectively. A text is a mapping from N+
into N# . The symbol t ranges over texts. If t = (xi )i∈N+ is a text, t[n] denotes
the initial segment of t of length n, i.e., the sequence (x1 x2 . . . xn ). We use · for
concatenation. If the range of t minus the # symbol is equal to L, then we say
that t is a text for L.
A language learning machine is a partial computable function mapping finite
sequences to natural numbers.
We now define the basic paradigm of explanatory identification from text [5].
Definition 1 (Gold, [5]). Let M be a language learning machine, let L be a
language, let L be a language class.
326 L. Carlucci
(1) M TxtEx-identifies L if and only if, for every text t = (xi )i∈N+ for L, there
exists n ∈ N+ such that WM(t[n]) = L and for all n ≥ n M(t[n ]) = M(t[n]).
(2) M TxtEx-identifies L if and only if M TxtEx-identifies L for all L ∈ L.
(3) TxtEx(M) = {L : MTxtEx-identifies L}.
(4) TxtEx = {L : (∃M)[L ⊆ TxtEx(M)]}.
We define the paradigm of iterative learner. This is the basic paradigm of incre-
mental learning upon which the paradigms of bounded example memory learning
are built.
(1) (M, j0 ) TxtIt-identifies L if and only if, for each text t = (xi )i∈N+ for L,
the following hold.
(i) For each n ∈ N, Mn (t) is defined, where M0 (t) = j0 and Mn+1 (t) =
M (Mn (t), xn+1 ) = jn+1 .
(ii) (∃n ∈ N)[Wjn = L ∧ (∀n ≥ n)[jn = jn ]].
(2) For M , j0 as above, TxtIt(M, j0 ) = {L : (M, j0 ) TxtIt-identifies L}.
(3) TxtIt(M, j0 ) = {L : L ⊆ TxtIt(M, j0 )}.
(4) TxtIt = {L : (∃M, j0 )[L ⊆ TxtIt(M, j0 )]}.
The paradigm of Bounded Example Memory was introduced in [9] and further
investigated in [3] and in the recent [7]. A bounded example memory learner is
an iterative learner that is allowed to store at most k data items chosen from
the input text.
(2) We say that (M, j0 ) Bemk -identifies L if and only if(M, j0 )Bemk -identifies
L for every L ∈ L.
A machine of the appropriate type that satisfies points (i)-(iii) above is referred
to as a Bemk -learner. By [8], Bem∗ is known to coincide with set-driven learning
[11]. With a slight abuse of notation we sometimes use Bem0 to denote TxtIt.
We now introduce an extension of the Bounded Example Memory model.
Definition 4. Let α be a fixed constructive ordinal (notation). Let M : (N ×
F in(N) × O) × N# → N × F in(N) × O be a partial computable function. Let
j0 ∈ N, let L be a language.
(1) We say that (M, j0 ) OBemα -identifies L if and only if for every t = (xj )j∈N+
text for L, points (i) to (v) below hold.
(i) for all n ∈ N, Mn (t) is defined, where M0 (t) = (j0 , S0 , α0 ), S0 = ∅,
α0 ≤ α, Mn+1 (t) = M (Mn (t), xn+1 ) = (jn+1 , Sn+1 , αn+1 ).
(ii) Sn+1 ⊆ Sn ∪ {xn+1 }.
(iii) αn ≥ αn+1 .
(iv) αn > αn+1 if and only if card(Sn+1 ) > max({card(Si ) : i ≤ n}).
(v) (∃n)(∀n ≥ n)[jn = jn ∧ Wjn = L].
(2) We say that (M, j0 ) OBemα -identifies L if and only if (M, j0 ) OBemα -
infers L for every L ∈ L.
A machine of the appropriate type that satisfies points (i)-(iv) above is referred
to as a OBemα -learner. OBemα -learning is a species of incremental learning:
each new hypothesis depends only on the previous hypothesis, the current mem-
ory, and the current data item. The above Definition can be simplified in case
the following is true. Call cumulative a bounded example memory learner that
never erases an element from memory without replacing it with a new one. If
cumulative learning does not restrict learning power, then in point (iv) it is
sufficient to ask that card(Sn+1 ) > card(Sn ).
For I ∈ {Bemk , Bem∗ , OBemα }, M of the appropriate type and j0 ∈ N
– We write I(M, j0 ) for {L : (M, j0 ) I-identifies L}, and
– We write I for {L : (∃M, j0 )[L ⊆ I(M, j0 )]}.
We write M (t) to indicate the conjecture to which M converges while processing
text t. We always assume that such a conjecture exists when we use this notation.
We state some basic facts in the following Lemma.
Lemma 1. For all k ∈ N+ , for all constructive ordinals α, β, the following hold.
(1) OBemk = Bemk .
(2) If α < β, then OBemα ⊆ OBemβ .
(3) OBemα ⊆ Bem∗ .
Proof. The proof is omitted for brevity. Note that to go from a Bemk -learner
to an OBemk -learner, one just needs to keep track of the maximum cardinality
of a memory set, a quantity which eventually stabilizes and can thus be padded
in the conjecture as long as needed. To go from an OBemk -learner to a Bemk -
learner, one dually pads the ordinal counter in the next conjecture. This also is
a quantity that eventually stabilizes on all relevant texts.
328 L. Carlucci
As a word of caution note that a rigorous version of point (2) would read: for
every notation a, b, respectively for α and β, such that a <O b, then OBema ⊆
OBemb . Similar relations as those expressed in the above Lemma also hold for
the model of temporary bounded example memory as defined in [7] when the
definition is extended to ordinals in the straightforward way.
We state a basic locking sequence lemma for OBemα . Let (M, j0 ) be an
OBemα -learner. σ is a locking-sequence of the first type for (M, j0 ) on L if and
only if (1) content(σ) ⊆ L, (2) for every extension σ ⊃ σ in L, π1 (M|σ| (σ)) =
π1 (M|σ | (σ )), and (3) WM|σ| (σ) = L. σ is a locking-sequence of the second type for
(M, j0 ) on L if and only if (a) σ is a locking sequence of the first type for (M, j0 )
on L, and (b) for every extension σ ⊃ σ in L, π3 (M|σ| (σ)) = π3 (M|σ | (σ )).
Lemma 2 (Locking Sequence Lemma). If L ∈ OBemα (M, j0 ), then there
exists a locking sequence of the second type for M on L.
Proof. Straightforward using the standard argument [1, 6], point (iii) in Defini-
tion 4, and the well-orderedness of the ordinals.
For the sake of the present Section, we identify Ck with the class obtained from
s s t t
it by replacing
each element k, p1 by pk , and k, p0 by p0 , for ease of notation.
Let Cω = k∈N+ Ck .
Theorem 2. Cω ∈ (OBemω − k∈N+ OBemk ).
Proof. Let j, k ∈ N+ . For a set X ⊆ {pk }+ such that card(X) ≤ k, we write
C(j,X) for the set {p1k , . . . , pjk } ∪ {pj0 } ∪ X.
We first show Cω ∈ OBemω . Let X be a finite subset of Ck of cardinality
≤ k. Let s ∈ N+ . We define the set update(X, k, psk ) as the set containing the
(at most) k elements of X ∪ {psk } with largest exponents. Formally, we define
update(X, k, psk ) = X ∪ {psk } if card(X) < k, and otherwise update(X, k, psk ) =
z
{pzk2 , . . . , pkk+1 } if X = {pxk 1 , . . . , pxk k }, where x1 < · · · < xk , and {x1 , . . . , xk } ∪
{s} = {z1 , . . . , zk+1 }, where z1 < · · · < zk+1 . For technical convenience we define
update(X, k, a) = X for all a ∈ / {pk }+ .
We now define a learner M and a j0 ∈ N such that (M, j0 ) OBemω -identifies
Cω . M ’s conjectures have the form jn = pad(cn , An , Bn ), where An , Bn ∈ N,
and
– An records the exponent of the first pj0 seen (j = 0),
– Bn records the subscript of the first pak seen (k, a
= 0) ,
For every text t = (xi )i∈N+ , we define
M0 (t) = (j0 , S0 , α0 ),
With a very minor change the above proof shows that Cω is learnable by an
OBemω -learner with temporary memory as defined in [7].
We now observe that ordinal bounded example memory learning does not
exhaust Bem∗ , i.e., set-driven learning.
Proof. Consider the following class from [9]. For each j ∈ N, Lj = {2}+ −{2j+1 },
L− = {L j : j ∈ N}. This class is obviously in Bem∗ but it is shown in [9] not
to be in k Bemk .
To show that L− ∈ / OBemα , we can then argue exactly as in the proof of Theo-
rem 5, Claim 2 in [9]. Suppose otherwise as witnessed by (M, j0 ). Let σ be a locking
sequence of the second type for M on L0 . Let M|σ| (σ) = (j|σ|+1 , S|σ|+1 , α|σ|+1 ).
Let β = α|σ|+1 ≤ α such that M ’s ordinal counter is equal to β for all exten-
sions of σ in L0 . Then, on all extensions of σ in L0 , M does not make any proper
global memory extension. Thus, M ’s memory on all such extensions is bounded
by b = max({card(Si ) : i ≤ |σ|}). We omit further details for brevity.
R0 = Ss , S̃0 = ∅, and Rn+1 , S̃n+1 are defined according to the following case
distinction.
We set S̃n+1 = (S̃n ∩ S ) ∪ {xn+1 } in (Case i) and (Case ii), and we set S̃n+1 =
(S̃n ∩ S ) in (Case iii) and (Case iv). We set Rn+1 = (S − S̃n ). One can always
recover the memory content Ss+(n+1) as Rn+1 ∪ S̃n+1 . The S̃n ’s satisfy the
conditions on memory, while the Rn+1 ’s eventually stabilize. The ordinal-counter
of M̃ is initialized at α̃0 = αs + b. Each time a proper global extension of
the memory S̃n of M̃ occurs, the ordinal-counter is updated as follows. If the
extension corresponds to an extension of M ’s memory before σ (this can happen
at most b times), then the second component is decreased by 1. If the extension
corresponds to an extension of M ’s memory beyond σ, then the first component
is decreased, emulating the corresponding ordinal-counter of M (which is padded
in the previous conjecture).
Lemma 3 above lends itself to a number of variations. E.g., one can conclude
from the same hypotheses that there exists M̂ , and j0 ∈ N witnessing that Cσ =
{L − content(σ) : L ∈ C} is in OBemβ . This can be seen as follows: for every t
for (L − content(σ)), σ · t is a text for L, and for all L ∈ Cσ , L ∩ content(σ) = ∅.
Thus, no element of σ is ever transfered to bounded example memory in the
process defining M̃ in the above proof. Therefore no such element contributes a
proper memory extension. M̂ can be defined similarly to M̃ with the following
extras: M ’s conjecture is always padded in the hypothesis, and f (i) is output
instead of i, where f is an injective computable function such that for all x,
Wf (x) = (Wx − content(σ)) (f exists by the S-m-n Theorem [10]).
Also, if β in Lemma 3 is < ω, then there exists an s ∈ N such that
max({card(π2 (Mi (σ · t))) : i ∈ N}) ≤ s, and the conclusion of the Lemma
is that there exists an OBems -learner for C.
Thus, Cω+k consists of the following languages, for every choice of i,j,h, 1 , . . . , i ,
m1 , . . . , mk in N+ .
[ω]
– Ci = { ω, i, p1 , ω, i, p21 , ω, i, p31 , . . . },
[ω] mi
– C(j,m1 ,...,mi ) = {ω, i, p1 , . . . , ω, i, pj1 } ∪ {ω, i, pj0 } ∪ {ω, i, pm
1 , . . . , ω, i, p1 },
1
– Ck = { k, p1 , k, p21 , k, p31 , . . . },
– C(j,m1 ,...,mk ) = { k, p1 , . . . , k, pj1 } ∪ { k, pj0 } ∪ { k, pm
1
1
, . . . , k, pm
1
k
},
[ω] [ω]
– Ci ∪ Ck , C(h,1 ,...,i ) ∪ C(j,m1 ,...,mk ) ,
[ω] [ω]
– Ci ∪ C(j,m1 ,...,mk ) , C(h,1 ,...,i ) ∪ Ck .
Cω+(k+1) as just defined contains more languages than strictly needed to show
(OBemω+(k+1) − OBemω+k ) = ∅, yet we have chosen to present this definition
for uniformity with extensions to higher ordinals. Let us consider the following
subclass of Cω+1 . Let d ∈ N+ .
[ω]
For each s ∈ N+ , let us denote the latter class by Cs ⊕ { 1, p11 , . . . , 1, pd1 }. In
fact, this class is the same as the following class.
[ω]
{C1 } ∪ {L ∪ C(d,d) : L ∈ Cs+1 }.
The proof of Theorem 1 from [9], can be easily adapted to show the following.
By choice of σ ,
Proof. The base case k = 0 is Theorem 4. For the k > 1 case one can argue
as follows. Suppose by way of contradiction that (M, j0 ) witnesses Cω+(k+1) ∈
Incremental Learning with Ordinal Bounded Example Memory 335
OBemω+k . Let σ be a locking sequence (of the first type) for M on Ck+1 .
Consider the following cases.
(Case 1) For every extension σ ⊇ σ in Ck+1 , M makes no memory exten-
sion while processing σ . Then M can be fooled as an iterative learner in Case
[k+1]
2.1 of Theorem 4 above. The relevant languages here are L(k+1,1,...,k,k+1) and
[k+1]
L(k+1,1,...,k,k+2) .
(Case 2) Not (Case 1), and for some extension σ of σ in Ck+1 , M makes
more than k memory extensions while processing σ . Thus, M commits to fi-
nite memory b for some b ∈ N. Then one can argue as in Case 1 of Theo-
rem 4 above, considering the class of those languages in Cω+(k+1) that contain
content(σ ).
(Case 3) Not (Case 1) and not (Case 2). Then there exists an extension σ of σ
in Ck+1 , such that M makes at least one memory extension while processing σ
and for all extension σ of σ in Ck+1 , M makes at most k memory extensions
while processing σ . Then one can argue as in the proof of Theorem 1 (Claim
3 of Theorem 5 in [9]). The point is that the number of possible sets extending
content(σ ) by adding k + 1 elements of the form k + 1, pt1 with
d < t ≤ d + 3n
(where d = max({i : k + 1, pi1 ∈ content(σ )})) grows as k+1 3n
in n, while the
k
number of possible memory contents of M beyond σ is on such sets is i=0 3n i ,
which is asymptotically smaller. This allows one to select appropriate sets in
Cω+(k+1) which M fails to distinguish on two texts extending σ .
Let Cω+ω be k∈N+ Cω+k .
Theorem 6. Cω+ω ∈ (OBemω+ω − k∈N+ OBemω+k ).
Proof. Cω+ω ∈ OBemω+ω is easy. At step n, M can pad into its conjecture a
quadruple An , Bn , Jn , Hn that keeps track of the following information, and
act accordingly.
– An records the minimal x > 0 such that a x, pa1 has occurred.
– Bn records the minimal x > 0 such that a ω, x, pa1 has occurred.
– Jn records the minimal z > 0 such that a i, pz0 has occurred.
– Hn records the minimal z > 0 such that a ω, i, pz0 has occurred.
It is easy to see that Cω+ω ∈
/ k∈N+ OBemω+k . Suppose otherwise. Then for
some k ∈ N, there exists (M, j0 ) witnessing Cω+ω ∈ OBemω+k . But Cω+(k+1) ⊆
Cω+ω . A contradiction to Theorem 5.
6 Conclusion
We have introduced a proper extension of the Bounded Example Memory model
featuring algorithmic count-down from constructive ordinals to bound the num-
ber of proper, global memory extensions an incremental learner is allowed on its
way to convergence. We have shown that the concept gives rise to criteria
that
lie strictly between the finite Bounded Example Memory hierarchy k Bemk
and set-driven learning Bem∗ . We have exhibited a hierarchy of learning crite-
ria up through ordinal ω 2 . We are confident that the general problem - given
constructive ordinals α > β, is it the case that (OBemα − OBemβ ) = ∅? -
can be attacked using similar methods. We also plan to investigate ordinal ver-
sions of feedback learning from [3]. An interesting side-question is: Are learners
with cumulative memory as powerful as learners that have the freedom to erase
memory content?
Acknowledgments. The author thanks the ALT 2009 anonymous referees for
useful comments. Special thanks go to one of the referees, who also suggested
how to extend the results of the present paper. Making justice of his suggestion
would have required a substantial reworking of the presentation and will be taken
up in future work.
Incremental Learning with Ordinal Bounded Example Memory 337
References
[1] Blum, L., Blum, M.: Towards a mathematical theory of inductive inference. In-
formation and Control 28, 125–155 (1975)
[2] Carlucci, L., Case, J., Jain, S.: Learning correction grammars. In: Bshouty, N.,
Gentile, C. (eds.) Proceedings of the 20th Annual Conference on Learning Theory,
San Diego, USA, pp. 203–217 (2007)
[3] Case, J., Jain, S., Lange, S., Zeugmann, T.: Incremental concept learning for
bounded data mining. Information and Computation 152(1), 74–110 (1999)
[4] Freivalds, R., Smith, C.H.: On the role of procrastination for Machine Learning.
Information and Computation 107, 237–271 (1993)
[5] Gold, E.M.: Language identification in the limit. Information and Control 10,
447–474 (1967)
[6] Jain, S., Osherson, D., Royer, J., Sharma, A.: Systems that learn: an introduction
to learning theory, 2nd edn. MIT Press, Cambridge (1999)
[7] Lange, S., Moelius, S.E., Zilles, S.: Learning with Temporary Memory. In: Freund,
Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254,
pp. 449–463. Springer, Heidelberg (2008)
[8] Kinber, E., Stephan, F.: Language learning from texts: mind-changes, limited
memory, and monotonicity. Information and Computation 123(2), 224–241 (1995)
[9] Lange, S., Zeugmann, T.: Incremental learning from positive data. Journal of
Computer and System Sciences 53(1), 88–103 (1996)
[10] Rogers, H.: Theory of recursive functions and effective computability. McGraw-
Hill, New York (1967); Reprinted by MIT Press (1987)
[11] Wexler, K., Culicover, P.W.: Formal principles of language acquisition. MIT Press,
Cambridge (1980)
[12] Wiehagen, R.: Limes-Erkennung rekursiver Funktionene durch spezielle Strate-
gien. Elektronische Informationsverarbeitung und Kybernetik 12(1/2), 93–99
(1976)
Learning from Streams
1 Introduction
The present paper investigates the scenario where a team of learners observes
data from various sources, called streams, so that only the combination of all
these data give the complete picture of the target to be learnt; in addition the
communication abilities between the team members is limited. Examples of such
a scenario are the following: some scientists perform experiments to study a phe-
nomenon, but no one has the budget to do all the necessary experiments and
Supported in part by NUS grant number R252-000-308-112.
Supported in part by NUS grant number R146-000-114-112.
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 338–352, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Learning from Streams 339
therefore they share the results; various earth-bound telescopes observe an ob-
ject in the sky, where each telescope can see the object only during some hours
a day; several space ships jointly investigate a distant planet.
This concrete setting is put into the abstract framework of inductive inference
as introduced by Gold [2,5,9]: the target to be learnt is modeled as a recursively
enumerable set of natural numbers (which is called a “language”); the team of
learners has to find in the limit an index for this set in a given hypothesis space.
This hypothesis space might be either an indexed family or, in the most general
form, just a fixed acceptable numbering of all r.e. sets. Each team member gets
as input a stream whose range is a subset of the set to be learnt; but all team
members together see all the elements of the set to be learnt. Communication
between the team members is modeled by allowing each team member to finitely
often make its data available to all the other learners.
The notion described above is denoted as [m, n]StreamEx-learning where n
is the number of team members and m is the minimum number of learners out
of these n which must converge to the correct hypothesis in the limit. Note that
this notion of learning from streams is a variant of team learning, denoted as
[m, n]TeamEx, which has been extensively studied [1,11,15,16,18,19]; the main
difference between the two notions is that in team learning, all members see
the same data, while in learning from streams, each team member sees only a
part of the data and can exchange with the other team members only finitely
much information. In the following, Ex denotes the standard notion of learning
in the limit from text; this notion coincides with [1, 1]StreamEx. In related
work, Baliga, Jain and Sharma [4] investigated a model of learning from various
sources of inaccurate data where most of the data sources are nearly accurate.
We start with giving the formal definitions in Section 2. In Section 3 we
first establish a characterization result for learning indexed families. Our main
theorem in this section, Theorem 7, shows a tell-tale like characterization for
learning from streams for indexed families. An indexed family L = {L0 , L1 , . . .}
is [m, n]StreamEx-learnable iff it is [1, m n
]StreamEx-learnable iff there exists
a uniformly r.e. sequence E0 , E1 , . . . of finite sets such that Ei ⊆ Li and there
are at most m n
many languages L in L with Ei ⊆ L ⊆ Li . Thus, for indexed
families, the power of learning from streams depends only on the success ratio.
Additionally, we show that for indexed families, the hierarchy for stream learn-
ing is similar to the hierarchy for team function learning (see Corollary 9); note
that there is an indexed family in [m, n]TeamEx − [m, n]StreamEx iff m n ≤ 2.
1
We further show (Theorem 11) that a class L can be noneffectively learned from
streams iff each language in L has a finite tell-tale set [2] with respect to the
class L, though these tell-tale sets may not be uniformly recursively enumerable
from their indices. Hence the separation among different stream learning criteria
is due to computational reasons rather than information theoretic reasons.
In Section 4 we consider the relationship between stream learning criteria
with different parameters, for general classes of r.e. languages. Unlike the in-
dexed family case, we show that more streaming is harmful (Theorem 13): There
are classes of languages which can be learned by all n learners when the data is
340 S. Jain, F. Stephan, and N. Ye
divided into n streams, but which cannot be learned even by one of the learners
when the data is divided into n > n streams. Hence, for learning r.e. classes,
[1, n]StreamEx and [1, n ]StreamEx are incomparable for different n, n ≥ 1.
This stands in contrast to the learning of indexed families where we have that
[1, n]StreamEx is properly contained in [1, n + 1]StreamEx for each n ≥ 1.
Theorem 14 shows that requiring fewer number of machines to be successful
gives more power to stream learning even if the success ratio is sometimes high.
For each m there exists a class which is [m, n]StreamEx-learnable for all n ≥ m
but not [m + 1, n ]StreamEx-learnable for any n ≥ 2m.
In Section 5 we first show that stream learning is a proper restriction of
team learning in the sense that [m, n]StreamEx ⊂ [m, n]TeamEx, as long as
1 ≤ m ≤ n and n > 1. We also show how to carry over several separation re-
sults from team learning to learning from streams, as well as give one simulation
result which carries over. In particular we show in Theorem 17 that if m n > 3
2
For any unexplained recursion theoretic notation, the reader is referred to the
textbooks of Rogers [17] and Odifreddi [12]. The symbol N denotes the set of
natural numbers, {0, 1, 2, 3, . . .}. Subsets of N are referred to as languages. The
symbols ∅, ⊆, ⊂, ⊇ and ⊃ denote empty set, subset, proper subset, super-
set and proper superset, respectively. The cardinality of a set S is denoted by
card(S). max(S) and min(S), respectively, denote the maximum and minimum
of a set S, where max(∅) = 0 and min(∅) = ∞. dom(ψ) and ran(ψ) denote
the domain and range of ψ. Furthermore, ·, · denotes a recursive 1–1 and onto
pairing function [17] from N × N to N which is increasing in both its arguments:
x, y = (x+y)(x+y+1)
2 + y. The pairing function can be extended to n-tuples by
taking x1 , x2 , . . . , xn = x1 , x2 , . . . , xn .
The information available to the learner is a sequence consisting of exactly the
elements in the language being learned. In general, any sequence T on N ∪ {#}
is called a text, where # indicates a pause in information presentation. T (t)
Learning from Streams 341
denotes the (t + 1)-st element in T and T [t] denotes the initial segment of T
of length t. Thus T [0] = , where is the empty sequence. ctnt(T ) denotes the
set of numbers in the text T . If σ is an initial segment of a text, then ctnt(σ)
denotes the set of numbers in σ. Let SEQ denote the set of all initial segments.
For σ, τ ∈ SEQ, σ ⊆ τ denotes that σ is an initial segment of τ . |σ| denotes the
length of σ.
A learner from texts is an algorithmic mapping from SEQ to N ∪ {?}. Here
the output ? of the learner is interpreted as “no conjecture at this time.” For a
learner M , one can view the sequence M (T [0]), M (T [1]), . . ., as a sequence of
conjectures (grammars) made by M on T .
Intuitively, successful learning is characterized by the sequence of conjectured
hypotheses eventually stabilizing on correct ones. The concepts of stabilization
and correctness can be formulated in various ways and we will be mainly con-
cerned with the notion of explanatory (Ex) learning. The conjectures of learners
are interpreted as grammars in a given hypothesis space H, which is always
recursively enumerable family of r.e. languages (in some cases, we even take
the hypothesis space to be a uniformly recursive family, also called an indexed
family). Unless specified otherwise, the hypothesis space is taken to be a fixed
acceptable numbering W0 , W1 , . . . of all r.e. sets.
Definition 1 (Gold [9]). Given a hypothesis space H = {H0 , H1 , . . .} and a
language L, a sequence of indices i0 , i1 , . . . is said to be an Ex-correct grammar
sequence for L, if there exists s such that for all t ≥ s, Hit = L and it = is . A
learner M Ex-learns a class L of languages iff for every L ∈ L and every text T
for L, M on T outputs an Ex-correct grammar sequence for L.
We use Ex to also denote the collection of language classes which are Ex-
learnt by some learner.
Now we consider learning from streams. For this the learners would get streams
of texts as input, rather than just one text.
Definition 2. Let n ≥ 1. T = (T1 , . . . , Tn ) is said to be a streamed text for L
if ctnt(T1 ) ∪ . . . ∪ ctnt(Tn ) = L. Here n is called the degree of dispersion of the
streamed text. We sometimes call a streamed text just a text, when it is clear
from the context what is meant.
Suppose T = (T1 , . . . , Tn ) is a streamed text. Then, for all t, σ = (T1 [t],
. . . , Tn [t]), is called an initial segment of T . Furthermore, we define T [t] =
(T1 [t], . . . , Tn [t]). We define ctnt(T [t]) = ctnt(T1 [t]) ∪ . . . ∪ ctnt(Tn [t]) and sim-
ilarly for the content of streamed texts. We let SEQn = {(σ1 , σ2 , . . . , σn ) : σ1 ,
σ2 , . . . , σn ∈ SEQ and |σ1 | = |σ2 | = . . . = |σn |}. For σ = (σ1 , σ2 , . . . , σn ) and
τ = (τ1 , τ2 , . . . , τn ), we say that σ ⊆ τ if σi ⊆ τi for i ∈ {1, . . . , n}.
Let L be a language collection and H be a hypothesis space.
When learning from streams, a team M1 , ..., Mn of learners accesses a stream-
ed text T = (T1 , . . . , Tn ) and works as follows. At time t, each learner Mi sees as
input Ti [t] plus the initial segment T [synct ], outputs a hypothesis hi,t and might
update synct+1 to t. Here, initially sync0 = 0 and synct+1 = synct whenever no
team member updates synct+1 at time t.
342 S. Jain, F. Stephan, and N. Ye
and there are at least m numbers i ∈ {1, . . . , n} such that for all streamed texts T
for L with σ = T [|σ|] and for all t ≥ |σ|, when M1 , . . . , Mn are fed the streamed
text T , for synct and hi,t as defined in Definition 2, (a) synct ≤ |σ| and (b)
hi,t = hi,|σ| .
A stabilizing sequence σ is called a locking sequence for M1 , . . . , Mn on L for
[m, n]StreamEx-learning iff in (b) above hi,|σ| is additionally an index for L (in
the hypothesis space used).
Definition 6 (Angluin [2]). L is said to satisfy the tell-tale set criterion if for
every L ∈ L, there exists a finite set DL such that for any L ∈ L with L ⊇ DL ,
we have L ⊂ L. DL is called a tell-tale set of L. {DL : L ∈ L} is called a family
of tell-tale sets of L.
Angluin [2] used the term exact learning to refer to learning using the language
class to be learned as the hypothesis space and she showed that a uniformly re-
cursive language class L is exactly Ex-learnable iff it has a uniformly recursively
enumerable family of tell-tale sets [2]. A similar characterization holds for non-
effective learning [10, pp. 42–43]: Any class L of r.e. languages is noneffectively
Ex-learnable iff L satisfies the tell-tale criterion. For learning from streamed
text, we have the following corresponding characterization.
Proof sketch. First we show that L ∈ [1, k + 1]StreamEx. For each e and for
each L ⊆ {2e, 2e + 2, 2e + 4, . . .} with {2e} ⊆ L, let EL = {2e}; also, for any
language L ∈ L containing an odd number, let EL = L. Now, for an appropriate
indexing L0 , L1 , . . . of L, {ELi : i ∈ N} is a collection of uniformly r.e. finite sets
and for each L ∈ L, there are at most k + 1 sets L ∈ L such that EL ⊆ L ⊆ L.
Thus, L ∈ [1, k+1]StreamEx by Theorem 7. On the other hand, for each L ∈ L,
one cannot effectively (in indices for L) enumerate a finite subset EL of L such
Learning from Streams 345
Remark 10. One might also study the inclusion problem for IND with respect
to related criteria. One of them being conservative learning [2], where the addi-
tional requirement is that a team member Mi of a team M1 , . . . , Mn can change
its hypothesis from Ld to Le only if it has seen, either in its own stream or in
the synchronized part of all streams, some datum x ∈ / Ld . If one furthermore
requires that the learner is exact, that is, uses the hypothesis space given by the
indexed family, then one can show that there are more breakpoints than in the
case of usual team learning.
For example, there is a class which under these assumptions is conservatively
[2, 3]StreamEx-learnable but not conservatively learnable. The indexed family
L = {L0 , L1 , . . .} witnessing this separation is defined as follows. Let Φ be a
Blum complexity measure. For e ∈ N and a ∈ {1, 2}, L3e+a is {e, e + 1, e + 2, . . .}
if Φe (e) = ∞ and L3e+a is {e, e + 1, e + 2, . . .} − {Φe (e) + e + a} if Φe (e) < ∞.
Furthermore, the sets L0 , L3 , L6 , . . . form a recursive enumeration of all finite
sets D for which there is an e with Φe (e) < ∞, min(D) = e and max(D) ∈
{Φe (e) + e + 1, Φe (e) + e + 2}.
Note that the usage of the exact hypothesis space is essential for this remark.
However, the earlier results of this section do not depend on the choice of the
m
hypothesis space. Assume that there is a k ∈ {1, 2, 3, . . .} with m n ≤ k < n .
1
Then, similarly to Corollary 8, one can show that some class is conservatively
[m, n]StreamEx-learnable but not conservatively [m , n ]StreamEx-learnable.
The following result follows using the proof of Theorem 7 for noneffective learn-
ers. For noneffective learners one can consider every class as an indexed family.
Furthermore, finitely many elements can be added to Ei to separate Li from the
finitely many subsets of it which contain Ei and are proper subsets of Li — thus
giving us a tell-tale set for Li .
Theorem 11. Suppose 1 ≤ m ≤ n. L is noneffectively [m, n]StreamEx-learn-
able iff L satisfies Angluin’s tell-tale set criterion.
The above theorem shows that any separation between learning from streams
with different parameters must be due to computational difficulties.
Remark 12. Behaviourally correct learning (Bc-learning) requires a learner
to eventually output only correct hypotheses. Thus, the learner semantically
converges to a correct hypothesis, but may not converge syntactically (see [6,14]
for a formal definition). Suppose n ≥ 1. If an indexed family is [1, n]StreamEx-
learnable, then it is Bc-learnable using an acceptable numbering as hypothesis
space. This follows from the fact that an indexed family is Bc-learnable using an
acceptable numbering as hypothesis space iff it satisfies the noneffective tell-tale
criterion [3].
346 S. Jain, F. Stephan, and N. Ye
The following result shows that the number of successful learners affects learn-
ability from streams crucially.
Theorem 14. Suppose k ≥ 1. Then, there exists an L such that for all n ≥ k
and n ≥ 2k, L ∈ [k, n]StreamEx but L
∈ [k + 1, n ]StreamEx.
On input T [t] = (T1 [t], . . . , Tn [t]), the learners synchronize if for some i,
ctnt(Ti [t − 1]) does not contain x, j and x, j with j = j , but ctnt(Ti [t])
does contain such x, j and x, j .
If synchronization has happened (in some previous step), then the learners
output a grammar for B ∪ { x, j : 1 ≤ j ≤ 2k}, where x is the unique number
such that x, j and x, j are in the synchronized text for some j = j . Other-
wise, M1 , . . . , Mk output a grammar for B and each Mi with k + 1 ≤ i ≤ n does
the following: it first looks for the least x such that x, j ∈ ctnt(Ti [t]) for some
j, and x is not verified to be in dom(ψ) in t steps; then Mi outputs a grammar
for Ax if such an x is found, and outputs ? if no such x is found.
If the learners ever synchronize, then clearly all learners correctly learn the
target language. Suppose no synchronization happens. If the language is B, then
M1 , . . . , Mk correctly learn the input language. If the language is Ax for some
x∈ / dom(ψ), then n ≥ 2k (otherwise synchronization would have happened) and
at least k learners among Mk+1 , . . . , Mn eventually see exactly one pair of the
form x, j, where 1 ≤ j ≤ 2k, and these learners will correctly learn the input
language.
Now suppose by way of contradiction that a team (M1 , . . . , Mn ) of learn-
ers [k + 1, n ]StreamEx-learns L. By Fact 5, there exists a locking sequence
σ = (σ1 , . . . , σn ) for the learners M1 , . . . , Mn on B. Let S ⊆ {1, . . . , n } be of
size k + 1 such that the learners Mi , i ∈ S, do not make a mind change beyond
σ on any streamed text T for B which extends σ.
By definition of ψ, there must be only finitely many x, j ∈ C such that the
learners M1 , M2 , . . . , Mn synchronize or one of the learners Mi , i ∈ S, makes
a mind change beyond σ on any streamed text extending σ for B ∪ { x, j}
— otherwise we would have an infinite r.e. set S consisting of such pairs, with
S ⊆ C but S ∩ B = ∅, a contradiction to the definitions of ψ, B, C. Let X be
the set of these finitely many x, j. Let Z be the set of x such that, for some
i with 1 ≤ i ≤ n , the grammar output by Mi on input σ is for Ax , or the
grammar output by Mi (in the limit) on input σi #∞ (with the last point of
synchronization being before all of input σ is seen) is for Ax .
Select some z ∈ / dom(ψ) such that z ∈ Z and (z, j) ∈ X for any j. Now
we construct a streamed text extending σ for Az on which the learners fail. Let
S ⊇ S be a subset of {1, 2, . . . , n } of size 2k. If i is the j-th element of S then
choose Ti such that Ti extends σi and ctnt(Ti ) = B ∪ { z, j} else (when i ∈ / S)
∞
let Ti = σi # . Thus, T = (T1 , . . . , Tn ) is a streamed text for Az . However, only
the learners Mi with i ∈ S − S can converge to correct grammars for Az (as
the learners Mi with i ∈ S or i ∈ S , would not have converged to a grammar
for Az by definition of z, X and Z above).
It follows that L ∈ / [k + 1, n ]StreamEx.
Team learning is a special form of learning from streams, in which all learners
receive the same complete information about the underlying reality, thus team
Learning from Streams 349
learnability provides upper bounds for learnability from streams with the same
parameters. These upper bounds are strict.
Remark 16. Another question is how this transfers to the learnability of in-
n > 2 and L is an indexed family, then L ∈ [m, n]StreamEx
dexed families. If m 1
Below we will show how several results from team learning can be carried over
to the stream learning situation.
It was previously shown that in team learning, when the success ratios exceed
a certain threshold, then the exact success ratio does not affect learnability any
longer. Using a similar majority argument, we can show similar collapsing results
for learning from streams (Theorem 17 and Theorem 18).
One can also carry over several diagonalization results from team learning to
learning from streams. An example is the following.
The motivation for iterative learning is the following: When humans learn,
they do not memorize all past observed data, but mainly use the hypothesis they
currently hold, together with new observations to formulate new hypotheses.
Many scientific results can be considered to be obtained in iterative fashion. It-
erative learning for learning from a single stream/text was previously modeled by
requiring the learners to be a function of the previous hypothesis and the current
observed data. Formally, a single-stream learner M : (N∪{#})∗ → (N∪{?}) is it-
erative if there exists a recursive function F : (N∪{?})×(N∪{#}) → N∪{?} such
that on a text T , M (T [0]) =? and for t > 0, M (T [t]) = F (M (T [t− 1]), T (t)). For
notational simplicity, we shall write F (M (T [t−1]), T (t)) as M (M (T [t−1]), T (t)).
We can similarly define iterative learning from several streams by requiring each
learner’s hypothesis to be a recursive function of its previous hypothesis and the
set of the newest datum received by each learner — here, when synchronization
happens, the learners only share the latest data seen by the learners rather than
the whole history of data seen.
Iterative learning can be considered as a form of information incompleteness
as the learner(s) do not memorize all the past observed data. Interestingly, every
iteratively learnable class is learnable from streams irrespective of the parameters.
Theorem 20. For any n ≥ 1, every language class Ex-learnable by an iterative
learner is iteratively [n, n]StreamEx-learnable.
streaming only spreads information, but does not destroy information (Theo-
rem 11), while the incompleteness in an incomplete text involves the destruction
of information. This difference is made precise by the following incomparability
results.
Proposition 21. Suppose that L consists of L0 = N and all sets Lk+1 = {1 +
x, y : x ≤ k ∧ y ∈ N}. Then L ∈ [n, n]StreamEx for any n ≥ 1 but L can
neither be Ex-learnt from noisy text nor from incomplete text. Furthermore, L
is iteratively learnable.
For the separations in the converse direction, one cannot use indexed families as
every indexed family Ex-learnable from normal text is already learnable from
streams; obviously this implication survives when learnability from normal text
is replaced by learnability from incomplete or noisy text.
Remark 22. Suppose n ≥ 2. Then the cylindrification of the class L from Theo-
rem 13 is Ex-learnable from incomplete text but not [1, n]StreamEx-learnable.
Here the cylindrification of the class L is just the class of all sets { x, y : x ∈
L ∧ y ∈ N} with L ∈ L. Incomplete texts for a cylindrification of such a set L can
be translated into standard texts for L and so the learnability from incomplete
texts can be established; the diagonalization against the stream learners carries
over.
It is known that learnability from noisy text is possible only if for every two
different sets L, L in the class the differences L − L and L − L are both infi-
nite. This is a characterization for the case of indexed families, but it is only a
necessary but not sufficient criterion for classes in general. For example if a class
L consists of sets Lx = { x, y : y ∈ N − {ax }} without any method to obtain ax
from x in the limit, then learnability from noisy text is lost.
Theorem 23. There is a class L which is learnable from noisy text but not
[1, n]StreamEx-learnable for any n ≥ 2.
7 Conclusion
In this paper we investigated learning from several streams of data. For learn-
ing indexed families, we characterized the classes which are [m, n]StreamEx-
learnable using a tell-tale like characterization: An indexed family L = {L0 , L1 ,
. . .} is [m, n]StreamEx-learnable iff it is [1, m n
]StreamEx-learnable iff there
exists a uniformly r.e. sequence E0 , E1 , . . . of finite sets such that Ei ⊆ Li and
there are at most m n
many languages L in L such that Ei ⊆ L ⊆ Li .
For general classes of r.e. languages, our investigation shows that the power of
learning from streams depends crucially on the degree of dispersion, the success
ratio and the number of successful learners required. Though higher degree of
dispersion is more restrictive in general, we show that any class of languages
which is iteratively learnable is also iteratively learnable from streams even if
one requires all the learners to be successful. There are several open problems
and our results suggest that there may not be a simple way to complete the
picture of relationship between various [m, n]StreamEx learning criteria.
352 S. Jain, F. Stephan, and N. Ye
References
1. Ambainis, A.: Probabilistic inductive inference: a survey. Theoretical Computer
Science 264, 155–167 (2001)
2. Angluin, D.: Inductive inference of formal languages from positive data. Informa-
tion and Control 45, 117–135 (1980)
3. Baliga, G., Case, J., Jain, S.: The synthesis of language learners. Information and
Computation 152, 16–43 (1999)
4. Baliga, G., Jain, S., Sharma, A.: Learning from multiple sources of inaccurate data.
SIAM Journal on Computing 26, 961–990 (1997)
5. Blum, L., Blum, M.: Toward a mathematical theory of inductive inference. Infor-
mation and Control 28, 125–155 (1975)
6. Case, J., Lynes, C.: Machine inductive inference and language identification. In:
Nielsen, M., Schmidt, E.M. (eds.) ICALP 1982. LNCS, vol. 140, pp. 107–115.
Springer, Heidelberg (1982)
7. Fulk, M.: Prudence and other conditions on formal language learning. Information
and Computation 85, 1–11 (1990)
8. Fulk, M., Jain, S.: Learning in the presence of inaccurate information. Theoretical
Computer Science 161, 235–261 (1996)
9. Mark Gold, E.: Language identification in the limit. Information and Control 10,
447–474 (1967)
10. Jain, S., Osherson, D., Royer, J.S., Sharma, A.: Systems That Learn: An Introduc-
tion to Learning Theory, 2nd edn. MIT Press, Cambridge (1999)
11. Jain, S., Sharma, A.: Team learning of computable languages. Theory of Computing
Systems 33, 35–58 (2000)
12. Odifreddi, P.: Classical Recursion Theory. North-Holland, Amsterdam (1989)
13. Osherson, D., Stob, M., Weinstein, S.: Systems That Learn: An Introduction to
Learning Theory for Cognitive and Computer Scientists. MIT Press, Cambridge
(1986)
14. Osherson, D., Weinstein, S.: Criteria of language learning. Information and Con-
trol 52, 123–138 (1982)
15. Pitt, L.: Probabilistic inductive inference. Journal of the ACM 36, 383–433 (1989)
16. Pitt, L., Smith, C.H.: Probability and plurality for aggregations of learning ma-
chines. Information and Computation 77, 77–92 (1988)
17. Rogers, H.: Theory of Recursive Functions and Effective Computability. McGraw-
Hill, New York (1967); Reprinted in MIT Press (1987)
18. Smith, C.H.: The power of pluralism for automatic program synthesis. Journal of
the ACM 29, 1144–1165 (1982)
19. Smith, C.H.: Three decades of team learning. In: Arikawa, S., Jantke, K.P. (eds.)
AII 1994 and ALT 1994. LNCS, vol. 872, pp. 211–228. Springer, Heidelberg (1994)
20. Wiehagen, R.: Limes-Erkennung rekursiver Funktionen durch spezielle Strategien.
Elektronische Informationsverbarbeitung und Kybernetik (EIK) 12, 93–99 (1976)
Smart PAC-Learners
1 Introduction
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 353–367, 2009.
c Springer-Verlag Berlin Heidelberg 2009
354 H.U. Simon
Structure of the paper: Section 2 clarifies the notation that is used throughout
the paper. Section 3 is devoted to learning under a fixed distribution D. This
setting is cast as a zero-sum game between two players, the learner and her
opponent, such that the Minimax Theorem from game theory applies. This leads
∗ ∗
to a nice characterization of δD = δD (ε, m). It is furthermore shown that, when
the opponent makes his draw first, there is a strategy for the learner that, despite
of being not defined in terms of D, does not perform much worse than the best
strategy that is specialized to D. Section 4 is devoted to the proof of the main
result. To this end, we first treat the case of finitely many distributions and
cast the resulting task for a general PAC-learner again as a zero-sum game.
Another application of the Minimax Theorem brings us then in the position to
prove an important result for a learner who simultaneously copes with finitely
many distributions. The case of (infinitely many) arbitrary distributions is finally
treated in a similar fashion by invocation of a continuity argument. At the end
of Section 4, the reader will find our main results. Section 5 is devoted to some
final discussions and open problems.
2 Notations
We assume that the reader is familiar with the PAC-learning framework and
knows the Minimax Theorem from game-theory.
Since a mixed strategy for the learner is a distribution over learning functions
(mapping a labeled sample to a hypothesis), we may equivalently think of the
learner as waiting for a random labeled sample (x, b) and then playing a mixed
strategy that depends on (x, b). In order to formalize this intuition, we consider
the new payoff-matrix à = ÃεD given by
1 if hi is ε-inaccurate for hj w.r.t. D
Ã[i, j] = .
0 otherwise
1. The opponent selects a vector q ∈ [0, 1]N specifying à-priori probabilities for
the target concept. Note that this implicitly determines
– the probability
Q(b|x) = qj
j:hj (x)=b
3. The learner chooses a vector p̃(x, b) ∈ [0, 1]N (that may depend on D, q and
(x, b)) specifying her mixed strategy w.r.t. payoff-matrix Ã.
4. The learner suffers loss p̃(x, b) Ãq̃(x,
b) so that her expected loss, averaged
over all labeled samples, evaluates to x,b Pr(x, b)p̃(x, b) Ãq̃(x, b).
In the sequel, the games associated with A and Ã, respectively, are simply called
A-game and Ã-game, respectively.
Lemma 1. Let q ∈ [0, 1]N be an arbitrary but fixed mixed strategy for the
learner’s opponent. Then every mixed strategy p ∈ [0, 1]M for the learner in
the A-game can be mapped to a mixed strategy p̃ for the learner in the Ã-game
so that
p Aq = Pr(x, b)p̃(x, b) Ãq̃(x, b) . (6)
x,b
Moreover, this mapping p → p̃ is surjective, i.e., every mixed strategy for the
learner in the Ã-game has a pre-image (so that the optimal values in both games
are the same).
358 H.U. Simon
Proof. For every probability vector p ∈ [0, 1]M and every labeled sample (x, b),
we define the corresponding probability vector p̃(x, b) ∈ [0, 1]N as follows:
p̃i (x, b) = pi (7)
i:Li (x,b)=hi
Note that
N
M
p̃i (x, b) = pi = 1 .
i =1 i=1
(1 )
N
M
m
p Aq = D (x) I x,hj (x) [i, j]pi qj
x j=1 i=1
M
m
= D (x) I x,b [i, j]pi qj
x,b j:hj (x)=b i=1
N
= Dm (x) I x,b [i, j] pi qj
x,b j:hj (x)=b i =1 i:Li (x,b)=hi
=Ã[i ,j]
N
= Dm (x) Ã[i , j]qj pi
x,b j:hj (x)=b i =1 i:Li (x,b)=hi
(7)
N
= Dm (x) Ã[i , j]p̃i (x, b)qj
x,b i =1 j:hj (x)=b
(4)
N
N
= Dm (x)Q(b|x) Ã[i , j]p̃i (x, b)Q(j|x, b)
x,b i =1 j=1
(5 )
= Pr(x, b)p̃(x, b) Ãq̃(x, b)
x,b
As for the second part of the lemma, consider a mixed strategy p̃ (x, b) of the
learner in the Ã-game. We shall specify a mixed strategy p of the learner in
the A-game such that function p̃(x, b) computed according to (7) coincides with
p̃ (x, b). To this end, let us make the notational convention p̃hi := p̃i and let us
choose p as follows:
pi = p̃Li (x,b) (x, b). (8)
x,b
In the second-last equation, we used the distributive law where the reader should
note the one-to-one correspondence between the set of all learning functions and
the free combination of (number of labeled samples many) hypotheses taken
from H.
A computation similar to (9) verifies now the desired coincidence between p̃
and p̃ :
(7 )
p̃i (x , b ) = pi
i:Li (x ,b )=hi
(8 )
= p̃Li (x,b) (x, b)
i:Li (x ,b )=hi x,b
= p̃i (x , b ) p̃Li (x,b) (x, b)
i:Li (x ,b )=hi (x,b)
=(x ,b )
N
= p̃i (x , b ) p̃i (x, b)
=(x ,b ) i=1
(x,b)
=1
= p̃i (x , b )
In the second-last equation, we used the distributive law where the reader should
note the one-to-one correspondence between the set of all learning functions with
a fixed value on one sample (x , b ) and the free combination of (number of labeled
samples minus 1 many) hypotheses taken from from H.
We close this section with a result that prepares the ground for our analysis of
general PAC-learners in the next section:
Lemma 2. Let ε > 0 be a given accuracy and m ≥ 1 a given sample size.
For every probability vector q ∈ [0, 1]N , and every domain distribution D, the
following holds:
Pr(x, b)q̃(x, b) Ã2ε ∗
D q̃(x, b) ≤ 2δD (ε, m) (10)
x,b
Proof. Recall that q̃(x, b) is the vector that assigns the à-posteriori probabil-
ity Q(j|(x, b) for being the target concept to every hypothesis hj . Since the
à-posteriori probabilities outside the version space
V := {h ∈ H : h(x) = b}
360 H.U. Simon
are zero, only target concepts in V can contribute to the left hand-side in (10).
In the remainder of the proof, we simply write Ãε instead of ÃεD , and Ãεi denotes
the i’th row of this matrix. In the Ãε -game, the opponent makes the first draw
by choosing a (prior) probability vector q ∈ [0, 1]N . The following “Bayesian
strategy” for the learner minimizes (6): for a given labeled sample (x, b) pick a
hypothesis h∗ = hi∗ (x,b) ∈ H which maximizes the total à-posteriori probability
of hypotheses that are ε-close to h∗ w.r.t. D, i.e.,
⎧ ⎫
⎨ ⎬
i∗ (x, b) = arg max Q(j|x, b) .
i=1,...,N ⎩ ⎭
j:D(hi ⊕hj )≤ε
It follows that
∗
Pr(x, b)Ãεi∗ (x,b) q̃(x, b) ≤ δD (ε, m) . (11)
x,b
We are now prepared to verify (10). We call a hypothesis from the version
space V an “(D, x, b, ε)-exception” if it is not ε-close to h∗ w.r.t. D. Note
that Ãεi∗ (x,b) q̃(x, b) coincides with the total à-posteriori probability of (D, x, b, ε)-
exceptions. Consider now the strategy p̃ = q̃ for the learner. Given (x, b), she
picks a hypothesis ĥ = hi at random with probability Q(i|x, b). The following
observation, which is a simple application of the triangle inequality, is crucial: if
ĥ ∈ V is not an (D, x, b, ε)-exception, then ĥ is 2ε-close w.r.t. D to every hypoth-
esis in V that is not an (D, x, b, ε)-exception either. Thus, if ĥ and the target
concept are both picked at random according to q̃(x, b), then the probability for
ĥ being 2ε-inaccurate is bounded from above by twice the total probability of
(D, x, b, ε)-exceptions, i.e.,
4 Smart PAC-Learners
Let us first consider learners that cope with an arbitrary but fixed finite list
D1 , . . . , DC of (not necessarily distinct) distributions on X.2 We shall define
a suitable payoff-matrix R blockwise so that R = [R(1) , . . . , R(C) ] and R(k) is
the block reserved for distribution Dk . Every block has M rows (one row for
every learning function) and N columns (one column for every possible target
concept). Choosing R(k) = Aε,m Dk (compare with the previous section) would lead
2
We shall later extend these considerations to arbitrary distributions.
Smart PAC-Learners 361
to mixed strategies for the PAC-learner that put too much emphasis on fiendish
domain distributions. Such strategies are not likely to succeed with considerably
fewer sample points when the underlying distribution D happens to be simple.
Assuming
∗
(A1) ∀k = 1, . . . , C, δD k
(ε, m)
= 0,
the following payoff-matrix is a better choice:
1 2ε,m
R(k) [i, j] := ∗ (ε, m) · ADk [i, j]
δDk
1
= ∗ Pr [Li (x, hj (x)) · is 2ε-inaccurate for hj w.r.t. Dk ]
δDk (ε, m) x∈Dkm
∗
Intuitively, the scaling factor 1/δD k
(ε, m) challenges the learner to put more
emphasis on benign distributions. Note furthermore that we penalize the learner
only if her hypothesis is 2ε-inaccurate (as opposed to ε-inaccurate). This leaves
some slack which helps the learner to compensate for not knowing D.
In the sequel, we simply write δk∗ (ε, m) instead of δD
∗
k
(ε, m) and A(k) instead
(k) (k)
of A2ε,m
Dk . Aj denotes the j’th column in A(k) and, similarly, Rj denotes the
j’th column in R(k) . A mixed strategy for the learner is a probability vector
p ∈ [0, 1]M (as in Section 3). A mixed strategy for the opponent is a proba-
(k)
bility vector q = [q (1) , . . . , q (C) ] ∈ [0, 1]CN where qj denotes the probability
for choosing domain distribution Dk and target concept hj . According to the
Minimax Theorem, the following holds:
Let ρ∗ (ε, m) denote the optimal value in (12). The following quantities refer to
a learner who makes the first draw and applies the mixed strategy p:
max p Rj
(k)
ρp (ε, m) := max
j=1,...,N k=1,...,C
1 (k)
C
p
ρ̄ (ε, m) := max · p Rj
j=1,...,N C
k=1
∗ p
Clearly, ρ (ε, m) = minp ρ (ε, m).
Let us make perfectly clear the connection between these quantities and our
p
concept of a smart PAC-learner. We denote by δj,k (ε, m) the expected failure rate
(w.r.t. m, ε) of a PAC-learner with mixed strategy p when the target concept is hj
and the domain distribution is Dk .3 It follows from the definition of A(k) = A2ε,m
Dk
that
(2ε, m) = p Aj .
p (k)
δj,k
Thus, according to the definition of R(k) ,
3
In contrast to the previous section, here the learner has no prior knowledge of Dk .
362 H.U. Simon
p
δj,k (2ε, m)
= p Rj ,
(k)
δk∗ (ε, m)
p
δj,k (2ε, m)
max max = ρp (ε, m),
j=1,...,N k=1,...,C δk∗ (ε, m)
1 δj,k (2ε, m)
C p
max · = ρ̄p (ε, m).
j=1,...,N C δk∗ (ε, m)
k=1
It becomes obvious now that the quantities ρp (ε, m) and ρ̄p (ε, m) measure how
well the general PAC-learner with mixed strategy p (and accuracy parameter 2ε)
competes against the best learner with full prior knowledge of the domain distri-
bution (and accuracy parameter ε). We call ρp (ε, m) the worst performance ratio
and ρ̄p (ε, m) the average performance ratio of the mixed strategy p (although
both quantities refer to the worst-case as far as the choice of the target concept
is concerned). In the sequel, a learner with mixed strategy p is identified with p
so that we can speak of a performance ratio of the learner. A very strong result
would be ρ∗ (ε, m) ≤ c for some small constant c, which would mean that there
exists a learner (mixed strategy) p whose worst performance ratio is bounded
by c. But, since this is somewhat overambitious, we pursue a weaker goal in the
following and analyze the average performance ratio, ρ̄p (ε, m), instead. We make
use of the (obvious) fact that
ρ̄∗ (ε, m) := min ρ̄p (ε, m)
p
1 (k)
C
R̄ := R . (13)
C
k=1
(13) 1 (k)
C
p R̄q = p R q
C
k=1
1
C
1 (k)
= ∗ (ε, m) p A q
C δDk
k=1
1
C
(6) 1
= ∗ Pr(x, b)p̃(x, b)Ã(k) q̃(x, b).
C δDk (ε, m) k
k=1 x,b
According to Lemma 1, there exists a mixed strategy p for the learner such that
p̃ = q̃. With this choice of p, we get
1
C (14)
1
p R̄q = ∗ (ε, m) Pr(x, b)q̃(x, b)Ã(k)
q̃(x, b) ≤ 2,
C δDk
k
k=1 x,b
as desired.
Let R̄j denote the j’th column of matrix R̄. The Minimax Theorem applied to
the R̄-game allows us to infer from Lemma 3 the following
Corollary 1. ρ̄∗ (ε, m) ≤ 2, i.e., there exists a learner (mixed strategy) p whose
average performance ratio is bounded by 2.
So far, we have assumed that there is a finite list of distributions and the do-
main distribution is taken from this list. We now extend these considerations to
arbitrary distributions. Recall that our domain X is finite, say X = {ξ1 , . . . , ξd }.
The domain distributions are in one-to-one correspondence with the vectors
taken from the probability simplex
Δ := {z ∈ [0, 1]d : z1 + · · · + zd = 1}.
Specifically, Dz (ξν ) = zν for ν = 1, . . . , d. Note that instead of finitely many
block matrices R(1) , . . . , R(C) , as before, we now have a system of infinitely
many matrices R(z) . Let f : Δ → R+ be a continuous function that satisfies
f (z) dz = 1 (15)
z∈Δ
so that it can serve as a density function. For sake of simple notation, we set
δz∗ := δD
∗
z
(ε, m). (16)
For every ζ > 0 and every E ⊆ Δ, let
Pr(E) := f (z) dz, (17)
z∈E
Δζ := {z ∈ Δ : δz∗ ≥ ζ}. (18)
The former Assumption (A1) is replaced now by the following assumption:
(A2) limζ→0 Pr(Δζ ) = 1.
(A2) implies that Pr(Δζ ) > 0 for every sufficiently small ζ > 0, which is assumed
in the sequel. Since f (z)/ Pr(Δζ ) is a continuous density function on Δζ , we can
now use
1
R̄ζ := · f (z)R(z) dz (19)
Pr(Δζ ) z∈Δζ
as a payoff-matrix (where this matrix-equation is understood entry-wise).
364 H.U. Simon
where
Eij := {z ∈ Δ : Dz (hi ⊕ hj ) = ε}.
Note that the sets Eij are of Lebesgue-measure zero, and so is E. For this reason,
integrating over Δζ leads to the same result as integrating over Δζ \ E, which,
by construction, is a set without discontinuities.
The average performance ratio of a mixed strategy p for the learner refers now
to the density function f (z)/ Pr(Δζ ) and must therefore be redefined as follows:
1
f (z)p Rj dz = max p R̄jζ
(z)
ρ̄pζ (ε, m) := max ·
j=1,...,N Pr(Δζ ) z∈Δζ j=1,...,N
where R̄jζ denotes the j’th column of R̄ζ . With this notation, we get
Corollary 2. For every sufficiently small ζ > 0, there exists a learner (mixed
strategy) p whose average performance ratio is bounded by 2.
Proof. The crucial observation is that Lemma 3 is still correct when we define
R̄ := R̄ζ according to (19). The only modification in the proof of this lemma
is the substitution of integrals for sums. Thus the Minimax Theorem applies to
the R̄-game and Corollary 2 is obtained.
1
Pr(z ∈ Δ : p Rj > 2c) ≤ Pr(z ∈ Δζ : p Rj
(z) (z)
> 2c) + Pr(Δ \ Δζ ) <
c
provided that ζ > 0 is sufficiently small. From this, Corollary 3 is immediate.
According to Corollary 3, there exists a mixed strategy p for a learner without
any prior knowledge of the domain distribution such that, in comparison to the
best learner with full prior knowledge of the domain distribution, a performance
ratio of 2c is achieved for the “vast majority” of distributions. The total prob-
ability mass of distributions (measured according to density function f (z)) not
belonging to the “vast majority” is bounded by 1/c. So Corollary 3 is the result
that we had announced in the introduction.
Δγ := {z ∈ Δ| ∀ν = 1, . . . , d : γ ≤ zν ≤ 1 − γ}.
Assume that z ∈ Δγ . Pick ν(z) ∈ {1, . . . , d} such that zν(z) = min{z1 , . . . , zd }.
Clearly,
1
γ ≤ zν(z) ≤ .
d
For sake of brevity, let ξ := ξν(z) . For b ∈ {0, 1}, consider the set
he can assign à-priori probability 1/2 to h and h , respectively, and achieve the
following:
With a probability of at least γ m , the sample x is of the form x = (ξ, . . . , ξ) ∈
X so that the learner cannot distinguish between h and h . Conditioned to
m
x = (ξ, . . . , ξ) ∈ X m , the learner will therefore fail with probability at least 1/2
(regardless of her strategy). Thus, the overall expected failure rate is at least
γ m /2.
The punchline of this discussion is the following implication:
∗ γm
z ∈ Δγ ∧ 2ε < max max D z (g ⊕ g ) ⇒ δ z (ε, m) ≥ (22)
b∈{0,1} g,g ∈H(ξν(z) ,b) 2
(A3)
Condition (A3) looks wild but it is essentially saying that the knowledge of a
single labeled example should not trivialize the resulting version space (in terms
of its diameter) too much.
366 H.U. Simon
Define K(H) as the smallest number K such that, for every ξ ∈ X, there
exist g+ ∈ H(ξ, 1), g− ∈ H(ξ, 0) and g1 , . . . , gK ∈ H which satisfy the following
condition:
1 − 1/d 1
max max Dz (g ⊕ g ) ≥ ≥
b∈{0,1} g,g ∈H(ξν(z) ,b) K(H) 2K(H)
g+ ⊕ g1 , . . . , g+ ⊕ gK ; g− ⊕ gK +1 , . . . , g− ⊕ gK
cover X \ {ξ}. Thus, by the pigeon-hole principle, there must exist a hypothesis
g ∈ {g+ , g− } and g ∈ {g1 , . . . , gK } such that g(ξ) = g (ξ) but the disagreement
set g ⊕ g has probability mass at least (1 − 1/d)/K.
to violate Assumption (A2)). But even for the almost trivial class of half-intervals
{1, . . . , r}, 0 ≤ r ≤ d, our sufficient condition K(H) < ∞ applies. This can be
seen as follows:
Pick an arbitrary ξ ∈ {1, . . . , d}, and designate the following half-intervals:
g+ := {1, . . . , ξ} , g− := ∅,
g1 := {1, . . . , d} , g2 := {1, . . . , ξ − 1}.
Open problems:
– For every finite hypothesis class, we have shown the mere existence of a
learner (mixed strategy) whose average performance ratio is bounded by 2.
Gain more insight how this strategy actually works and check under which
conditions it can be implemented efficiently.
– Prove or disprove that there exists a learner (mixed strategy) whose worst
performance ratio is bounded by a small constant.
– Prove or disprove our claim that assumption (A2) is not very restrictive.
References
1. Ben-David, S., Lu, T., Pál, D.: Does unlabeled data provably help? Worst-case
analysis of the sample complexity of semi-supervised learning. In: Proceedings of
the 21st Annual Conference on Learning Theory, pp. 33–44 (2008)
2. Benedek, G.M., Itai, A.: Learnability with respect to fixed distributions. Theoretical
Computer Science 86(2), 377–389 (1991)
3. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the
Vapnik-Chervonenkis dimension. Journal of the Association on Computing Machin-
ery 36(4), 929–965 (1989)
4. Ehrenfeucht, A., Haussler, D., Kearns, M., Valiant, L.: A general lower bound on
the number of examples needed for learning. Information and Computation 82(3),
247–261 (1989)
Approximation Algorithms for Tensor Clustering
1 Introduction
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 368–383, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Approximation Algorithms for Tensor Clustering 369
recent survey about tensors and their applications. The simplest tensor clus-
tering scenario, namely, co-clustering (also known as bi-clustering) is more es-
tablished [12,4,16,17,18]. Tensor clustering is less well known, though several
researchers have considered it before [1,2,19,20,21].
1.1 Contributions
The main contribution of this paper is the analysis of an approximation algo-
rithm for tensor clustering that achieves an approximation ratio of O(p(m)α),
1
where m is the order of the tensor, p(m) = m or p(m) = m log3 2 , and α is the
approximation factor of a corresponding 1D clustering algorithm. Our results
apply to a fairly broad class of objective functions, including metrics such as p
norms, Hilbertian metrics [22,23], and divergence functions such as Bregman di-
vergences [24] (with some assumptions). As corollaries, our results solve two open
problems posed by [12], viz., whether their methods for Euclidean co-clustering
could be extended to Bregman co-clustering, and if one could extend the ap-
proximation guarantees to tensor clustering. The bound also gives insight into
properties of the tensor clustering problem. We give an example for the tight-
ness of our bound for squared Euclidean distance, and provide an experimental
validation of the theoretical claims, which forms an additional contribution.
where the function d(x, y) measures cluster quality. The “center” µk of cluster
Ck is given by the mean of the points in Ck when d(x, y) is a Bregman di-
vergence [25]. Co-clustering extends (2.1) to seek simultaneous partitions (and
centers µIJ ) of rows and columns of X, so that the objective function
J(C) = d(xij , µIJ ), (2.2)
I,J i∈I,j∈J
is minimized; µIJ denotes the (scalar) “center” of the cluster described by the
row and column index sets, viz., I and J. We generalize formulation (2.2) to
tensors in Section 2.2 after introducing some background on tensors.
2.1 Tensors
An order-m tensor A may be viewed as an element of the vector space Rn1 ×...×nm .
An individual entry of A is given by the multiply-indexed value ai1 i2 ...im , where
ij ∈ {1, . . . , nj } for 1 ≤ j ≤ m. For us, the most important tensor operation
370 S. Jegelka, S. Sra, and A. Banerjee
(k)
where pij denotes the ij-th entry of matrix Pk . The inner product between two
tensors A and B is defined as
A, B = ai1 ...im bi1 ...im , (2.3)
i1 ,...,im
Moreover, the Frobenius norm is A2 = A, A. Finally, we define an arbitrary
divergence function d(X, Y) as an elementwise sum of individual divergences, i.e.,
d(X, Y) = d(xi1 ,...,im , yi1 ,...,im ), (2.5)
i1 ,...,im
and we will define the scalar divergence d(x, y) as the need arises.
Let A ∈ Rn1 ×···×nm be an order-m tensor that we wish to partition into co-
herent sub-tensors (or clusters). In 3D, we divide a cube into smaller cubes by
cutting orthogonal to (i.e., along) each dimension (Fig. 1). A basic approach is
to minimize the sum of the divergences between individual (scalar) elements in
each cluster to their corresponding (scalar) cluster “centers”. Readers familiar
with [4] will recognize this to be a “block-average” variant of tensor clustering.
Assume that each dimension j (1 ≤ j ≤ m) is partitioned into kj clusters. Let
Cj ∈ {0, 1}nj ×kj be the cluster indicator matrix for dimension j, where the ik-th
entry of such a matrix is one if and only if index i belongs to the k-th cluster
(1 ≤ k ≤ kj ) for dimension j. Then, the tensor clustering problem is (cf. 2.2):
nj ×kj
minimize d(A, (C1 , . . . , Cm ) · M), s.t. Cj ∈ {0, 1} , (2.6)
C1 ,...,Cm ,M
C3
C1 C2
μ3,1,3
C3
C1 C2
Fig. 1. CoTeC: Cluster along dimensions one (C1), two (C2), three (C3) separately
and combine the results; μ3,1,3 is the mean of sub-tensor (cluster) (3,1,3). The vari-
ous clusters in the final tensor clustering are color coded to indicate combination of
contributions from clusters along each dimension.
denote the induced tensor clustering, and JOPT (m) the best m-dimensional clus-
tering. Then,
J(C) ≤ p(m/t)ρd αt JOPT (m), with (3.1)
Thm. 1 is quite general, and it can be combined with some natural assumptions
(see §3.3) to yield results for tensor clustering with general divergence functions
(though ρd might be greater than 1). For particular choices of d one can perhaps
derive tighter bounds, though for squared Euclidean distances, we provide an
explicit example (Fig. 2) that shows the bound to be tight in 2D.
We begin our proof with the Euclidean case, i.e., d(x, y) = (x − y)2 . Our proof is
inspired by the techniques of [12]. We establish that given a clustering algorithm
which clusters along t of the m dimensions at a time3 with an approximation
factor of αt , CoTeC achieves an objective within a factor O(m/tαt ) of the
optimal. For example, for t = 1 we can use the seeding methods of [8,9] or the
stronger approximation algorithms of [5]. We assume without loss of generality
(wlog) that m = 2h t for an integer h (otherwise, pad in empty dimensions).
Since for the squared Frobenius norm, each cluster “center” is given by the
mean, we can recast Problem (2.6) into a more convenient form. To that end,
note that the individual entries of the means tensor M are given by (cf. (2.2))
1
MI1 ...Im = ai1 ...im , (3.2)
|I1 | · · · |Im | i1 ∈I1 ,...,im ∈Im
from which the last term is immediately seen to be zero using Property (2.4)
and the fact that P ⊥
j Pj = Pj (I − Pj ) = 0.
At level 0, the algorithm yields the collections Q0i and Pi0 . The remaining clus-
terings are simply combinations, i.e., products of these level-0 clusterings. We
denote the collection of m − 2l t identity matrices (of appropriate size) by I l ,
so that Ql1 = (P1l , I l ). Accoutered with our notation, we now prove the main
lemma that relates the combined clustering to its sub-clusterings.
Lemma 2. Let A be an order-m tensor and m ≥ 2l t. The objective function for
any 2l t-dimensional clustering Pil = (P20l (i−1)+1 , . . . , P20l i ) can be bound via the
sub-clusterings along only one set of dimensions of size t as
We can always (wlog) permute dimensions so that any set of 2l clustered dimen-
sions maps to the first 2l ones. Hence, it suffices to prove the lemma for i = 1,
i.e., the first 2l dimensions.
Proof. We prove the lemma for i = 1 by induction on l.
Base: Let l = 0. Then Ql1 = Q01 , and (3.4) holds trivially.
Induction: Assume the claim holds for l ≥ 0. Consider a clustering P1l+1 =
(P1l , P2l ), or equivalently Ql+1
1 = Ql1 Ql2 . Using P + P ⊥ = I, we decompose A as
⊥ ⊥ ⊥
A = (P1l+1 + P1l+1 , I l+1 ) · A = (P1l + P1l , P2l + P2l , I l+1 ) · A
⊥ ⊥ ⊥ ⊥
= (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A
⊥ ⊥ ⊥ ⊥
= Ql1 Ql2 · A + Ql1 Ql2 · A + Ql1 Ql2 · A + Ql1 Ql2 · A,
374 S. Jegelka, S. Sra, and A. Banerjee
⊥ ⊥
where Ql1 = (P1l , I l ). Since Ql+1
1 = Ql1 Ql2 , the Pythagorean Property 1 yields
⊥ ⊥ ⊥ ⊥
A − Ql+1
1 · A2 = Ql1 Ql2 · A2 + Ql1 Ql2 · A2 + Ql1 Ql2 · A2 .
⊥
Combining the above equalities with the assumption (wlog) Ql1 Ql2 · A2 ≥
⊥
Ql1 Ql2 · A2 , we obtain the inequalities
⊥ ⊥ ⊥
A − Ql1 Ql2 · A2 ≤ 2 Ql1 Ql2 · A2 + Ql1 Ql2 · A2
⊥ ⊥ ⊥ ⊥ ⊥
= 2Ql1 Ql2 · A + Ql1 Ql2 · A2 = 2Ql1 (Ql2 + Ql2 ) · A2
⊥
= 2Ql1 · A2 = 2A − Ql1 · A2
≤ 2 max A − Qlj · A2 ≤ 2 · 2l max A − Q0j · A2 ,
1≤j≤2l 1≤j≤2l+1
where the last step follows from the induction hypothesis (3.4), and the two
norm terms in the first line are combined using the Pythagorean Property.
Proof. (Thm. 1, Case 1 ). Let m = 2h t. Using an algorithm with guarantee αt ,
we cluster each subset (indexed by i) of t dimensions to obtain Q0i . Let Si be the
optimal sub-clustering of subset i, i.e., the result that Q0i would be if αt were 1.
We bound the objective for the collection of all m sub-clusterings P1h = Qh1 as
The first inequality follows from Lemma 2, while the last inequality follows from
the αt approximation factor that we used to get sub-clustering Q0j .
So far we have related our approximation to an optimal sub-clustering along
a set of dimensions. Let us hence look at the relation between such an optimal
sub-clustering S of the first t dimensions (via permutation, these dimensions
correspond to an arbitrary subset of size t), and the optimal tensor clustering F
along all the m = 2h t dimensions. Recall that a clustering can be expressed by
either the projection matrices collected in Ql1 , or by cluster indicator matrices
Ci together with the mean tensor M, so that
Let CSj and CF j be the dimension-wise cluster indicator matrices for S and F ,
respectively. By definition, S solves
nj ×kj
min A − (C1 , . . . , Ct , I 0 ) · M2 , s.t. Cj ∈ {0, 1} ,
C1 ,...,Ct ,M
= A − F · A2 , (3.6)
Approximation Algorithms for Tensor Clustering 375
where MF is the tensor of means for the optimal m-dimensional clustering. Com-
bining (3.5) with (3.6) yields the final bound for the combined clustering C = Qh1 ,
Jm (C) = A − Qh1 · A2 ≤ 2h αt A − F · A2 = 2h αt JOPT (m),
which completes the proof of the theorem.
Tightness of Bound. How tight is the bound for CoTeC implied by Thm. 1?
The following example shows that for Euclidean co-clustering, i.e., m = 2, the
bound is tight. Specifically, for every 0.25 > γ > 0, there exists a matrix for
which the approximation is as bad as J(C) = (m − γ)JOPT (m).
Let be such that γ = 2(1 +
)−2 . The optimal 1D row clus- a b c d
tering C1 for the matrix in Fig- 1 − −1 1
ure 2 groups rows {1, 2} and {3, 4} 2 1 −1 −
together, and the optimal column 3 10 − 9 10 + 11
clustering is C2 = ({a, b}, {c, d}). 4 11 10 + 9 10 −
The co-clustering loss for the com-
bination is J2 (C1 , C2 ) = 8 + 82 . Fig. 2. Matrix with co-clustering approxima-
−2
The optimal co-clustering, group- tion factor 2 − 2(1 + )
ing columns {a, d} and {b, c} (and
rows as C2 ) achieves an objective of JOPT (2) = 4(1 + )2 . Relating these results,
we get J2 (C1 , C2 ) = (2 − γ)JOPT (m). However, this example is a worst-case
scenario; the average factor is much better in practice, as revealed by our ex-
periments (§4). The latter combined with the structure of this negative example
suggest that with some assumptions on the data, one can probably obtain tighter
bounds. Also note that the bound holds for a CoTeC-like scheme treating di-
mensions separately, but not necessarily for all approximation algorithms.
Since in general the best representative M is not the mean tensor, we cannot use
the shorthand P · A for M, so the proof is different from the Euclidean case.
The following lemma is the basis of the induction for this case of Thm. 1.
Lemma 3. Let A be of order, m = 2h t, and Rli the clustering of the i-th subset
of 2l t dimensions (for l < h) with an approximation guarantee of α2l t —Rli
combines the Cj in a manner analogous to how Qli combines projection matrices.
Then the combination Rl+1 = Rli Rlj , i = j, satisfies
min d(A, Rl+1 · M) ≤ 3α2l t min d(A, F l+1 · M),
M M
where F is the optimal joint clustering of the dimensions covered by Rl+1 (as
l+1
before, we always assume that Rli and Rlj cover disjoint subsets of dimensions).
376 S. Jegelka, S. Sra, and A. Banerjee
Proof. Without loss of generality, we prove the lemma for Rl+1 1 = Rl1 Rl2 . Let
Mi = argminX d(A, Ri · X) be the associated representatives for i = 1, 2, and Sil
l l
the optimal 2l -dimensional clusterings. Further let F1l+1 = F1l F2l be the optimal
2l+1 -dimensional clustering. The following step is vital in relating objective val-
ues of Rl+1
1 and Sil . The optimal sub-clusterings will eventually be bounded by
the objective of the optimal F1l+1 . Let L = 2l+1 , and
M = argmin d(Rl1 Ml1 , Rl1 Rl2 · X), X ∈ Rk1 ×...×kL ×nL+1 ...×nm .
X
Using this relation and the triangle inequality, we can now relate the objectives
for the combined clustering and for the optimal sub-clusterings:
min d(A, S1l · Xl1 ) ≤ min d(A, F1l · Yl ) ≤ min d(A, F1l F2l · Yl+1 ),
Xl1 Yl Y l+1
and analogously for S2l . Plugging this inequality into (3.8) we get
min d(A, Rl1 Rl2 · Ml+1 ) ≤ 3α2l t min d(A, F1l F2l · Yl+1 ) = 3α2l t min d(A, F1l+1 · Yl+1 ).
Ml+1 Yl+1 Yl+1
Proof. (Thm. 1, Case 2 ). Given Lemma 3, the proof of Thm. 1 for the metric
case follows easily by induction if we hierarchically combine the sub-clusterings
and use α2l+1 t = 3α2l t , for l ≥ 0, as stated by the lemma.
3.3 Implications
We now mention several important implications of Theorem 1.
Approximation Algorithms for Tensor Clustering 377
on the curvature of the divergence Bf (x, y), we can invoke Thm. 1 with ρd =
σU /σL . The proofs are omitted for brevity, and may be found in [27]. We would
like to stress that such curvature bounds seem to be necessary to guarantee
constant approximation factors for the underlying 1D clustering—this intuition
is reinforced by the results of [28], who avoided such curvature assumptions
and had to be content with a non-constant O(log n) approximation factor for
information theoretic clustering.
4 Experimental Results
Our bounds depend strongly on the approximation factor αt of an underlying
t-dimensional clustering method. In our experiments, we study this close depen-
dence for t = 1, wherein we compare the tensor clusterings arising from different
1D methods of varying sophistication. Keep in mind that the comparison of the
1D methods is to see their impact on the tensor clustering built on top of them.
Our experiments reveal that the empirical approximation factors are usu-
ally smaller than the theoretical bounds, and these factors depend on statistical
378 S. Jegelka, S. Sra, and A. Banerjee
properties of the data. We also observe the linear dependence of the CoTeC ob-
jectives on the associated 1D objectives, as suggested by Thm. 1 (for Euclidean)
and Table 1 (2nd row, for KL Divergence).
Further comparisons show that in practice, CoTeC is competitive with a
greedy heuristic SiTeC (Simultaneous Tensor Clustering), which simultaneously
takes all dimensions into account, but lacks theoretical guarantees. As expected,
initializing SiTeC with CoTeC yields lower final objective values using fewer “si-
multaneous” iterations.
We focus on Euclidean dis-
uniform
seeding
lrl +1D k-means rk }CoTeC tance and KL Divergence to
+SiTeC +SiTeC
(1D) test CoTeC. To study the ef-
data lrcl rkc }SiTeC fect of the 1D method, we use
specific two seeding methods, uniform,
seeding
(1D) ls l
+1D k-means
sk }CoTeC and distance based (weighted
+SiTeC +SiTeC farthest first) drawing. The
lscl skc }SiTeC latter ensures 1D approxima-
tion factors for E[J(C)] by [7]
Fig. 3. Tensor clustering variants
for Euclidean clustering and by
[8,9] for KL Divergence.
We use each seeding by itself and as an initialization for k-means to get four
1D methods for each divergence (see Fig. 3). We refer to the CoTeC combination
of the corresponding independent 1D clusterings by abbreviations: (1) ‘r:’ uni-
formly sample centers from the data points and assign each point to its closest
center; (2) ‘s:’ sample centers with distance-specific seeding [7,8,9] and assign
each point to its closest center; (3) ‘rk:’ initialize Euclidean or Bregman k-means
with ‘r’; (4) ‘sk:’ initialize Euclidean or Bregman k-means with ‘s’.
The SiTeC method we compare to is the minimum sum-squared residue co-
clustering of [29] for Euclidean distances in 2D, and a generalization of Algo-
rithm 1 of [4] for 3D and Bregman 2D clustering. Additionally, we initialize
SiTeC with the outcome of each of the four CoTeC variants, which yields four
versions (of SiTeC), namely, rc, sc, rkc, and skc, initialized with the results
of ‘r’, ‘s’, ‘rk’, and ‘sk’, respectively. These variants inherit the guarantees of
CoTeC, as they monotonically decrease the objective value.
5 r r
rk rk
rc rc
4.5 rkc 3.5 rkc
s s
sk sk
4 sc sc
skc 3 skc
3.5
factor
factor
2.5
3
2.5 2
2
1.5
1.5
1 1
r 1.35 r
rk rk
1.5 rc 1.3 rc
rkc rkc
s 1.25 s
1.4 sk sk
sc sc
1.2
skc skc
1.3 1.15
factor
factor
1.1
1.2
1.05
1.1
1
1 0.95
0.9
0.9
0.85
1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2
σ σ
Fig. 4. Approximation factors for 3D clustering (left) and co-clustering (right) with
increasing noise. Top row: Euclidean distances, bottom row: KL Divergence. The x
axis shows σ, the y axis the empirical approximation factor.
Table 2. (i) Improvement of CoTeC and SiTeC variants upon ‘r’ in %; the respective
reference value (J2 for ‘r’) is shaded in gray. (ii) Average number of SiTeC iterations.
seeding s yields better results than uniform seeding r, and adding k-means on top
(rk,sk) improves the results of both. With Euclidean distances, CoTeC with well-
initialized 1D k-means (sk) competes with SiTeC. For KL Divergence, though,
SiTeC still improves on sk, and with high noise levels, 1D k-means does not
help: both rk and sk are as good as their seeding only counterparts r, s.
We further assess the behavior of our method with gene expression data4 from
multiple sources [30,31,32]. For brevity, we only introduce two of the data sets
here for which we present more detailed results; more datasets and experiments
are described in [27].
The matrix Bcell [30] is a (1332×62) lymphoma microarray dataset of chronic
lymphotic leukemia, diffuse large B-cell leukemia and follicular lymphoma. The
order-3 tensor Interferon consists of gene expression levels from MS patients
treated with recombinant human interferon beta [32]. After removal of missing
values, a complete 6 × 21 × 66 tensor remained. For experiments with KL Di-
vergence, we normalized all tensors to have their entries sum up to one. Since
our analysis concerns the objective function J(C) alone, we disregard the “true”
labels, which are available for only one of the dimensions.
For each data set, we repeat the sampling of centers 30 times and average the
resulting objective values. Panel (i) in Table 2 (order-2), and in Table 3 (order-3)
show the objective value for the simplest CoTeC variant ‘r’ as a baseline, and
the relative improvements achieved by other methods. The methods are encoded
as x, xk, xc, xkc, where x stands for r or s, depending on the row in the table.
Table 3. (i) Improvement of CoTeC and SiTeC variants upon ‘r’ in %; the respective
reference value (J3 for ‘r’) is shaded in gray
Interferon, KL
(i) k1 k2 k3 x xk xc xkc
2 2 2 x=r 9.71 · 10−1 38.58 42.46 43.53
x=s 25.07 36.67 43.53 43.74
2 2 3 x=r 8.17 · 10−1 41.31 46.06 46.31
x=s 33.63 43.90 46.82 47.16
2 2 4 x=r 7.11 · 10−1 39.79 44.05 45.62
x=s 38.01 46.09 51.30 51.35
Figure 5 summarizes the average improvements for all five order-2 data sets
studied in [27]. Groups indicate methods, and colors indicate seeding techniques.
On average, a better seeding improves the results for all methods: the gray bars
are higher than their black counterparts in all groups. Just as for synthetic data,
1D k-means improves the CoTeC results here too. SiTeC (groups 3 and 4) is
better than CoTeC with mere seeding (r,s, group 1). Notably, for Euclidean
4
We thank Hyuk Cho for kindly providing us his preprocessed 2D data sets.
Approximation Algorithms for Tensor Clustering 381
80
20
% iterations
15 60
10 40
5 20
unif: x=r
dist.: x=s
0 0
x xk xc xkc xc xkc
Fig. 5. (i) % improvement of the objective J2 (C) with respect to uniform 1D seeding
(r), averaged over all order-2 data sets and parameter settings (details in [27]). (ii)
average number of SiTeC iterations, in % with respect to initialization by r.
0.6
2
5 Conclusion
In this paper we presented CoTeC, a simple, and to our knowledge the first ap-
proximation algorithm for tensor clustering, which yielded approximation results
for Bregman co-clustering and tensor clustering as special cases. We proved an
approximation factor that grows linearly with the order of the tensor, and showed
tightness of the factor for the 2D Euclidean case (Fig. 2), though empirically the
observed factors are usually smaller than suggested by the theory.
Our worst-case example also illustrates the limitation of CoTeC, i.e., to ignore
the interaction between clusterings along multiple dimensions. Thm. 1 thus gives
hints how much information maximally lies in this interaction. Analyzing this
interplay could potentially lead to better approximation factors, e.g., by devel-
oping a co-clustering specific seeding technique. Using such an algorithm as a
subroutine in CoTeC will yield a hybrid that combines CoTeC’s simplicity with
better approximation guarantees.
Acknowledgment. AB was supported in part by NSF grant IIS-0812183.
References
1. Banerjee, A., Basu, S., Merugu, S.: Multi-way Clustering on Relation Graphs. In:
SIAM Conf. Data Mining, SDM (2007)
2. Shashua, A., Zass, R., Hazan, T.: Multi-way Clustering Using Super-Symmetric
Non-negative Tensor Factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.)
ECCV 2006. LNCS, vol. 3954, pp. 595–608. Springer, Heidelberg (2006)
3. Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In:
KDD, pp. 89–98 (2003)
4. Banerjee, A., Dhillon, I.S., Ghosh, J., Merugu, S., Modha, D.S.: A Generalized
Maximum Entropy Approach to Bregman Co-clustering and Matrix Approxima-
tion. JMLR 8, 1919–1986 (2007)
5. Ackermann, M.R., Blömer, J.: Coresets and Approximate Clustering for Bregman
Divergences. In: ACM-SIAM Symp. on Disc. Alg., SODA (2009)
6. Ackermann, M.R., Blömer, J., Sohler, C.: Clustering for metric and non-metric
distance measures. In: ACM-SIAM Symp. on Disc. Alg. (SODA) (April 2008)
7. Arthur, D., Vassilvitskii, S.: k-means++: The Advantages of Careful Seeding. In:
ACM-SIAM Symp. on Discete Algorithms (SODA), pp. 1027–1035 (2007)
8. Nock, R., Luosto, P., Kivinen, J.: Mixed Bregman clustering with approximation
guarantees. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML / PKDD
2008, Part II. LNCS (LNAI), vol. 5212, pp. 154–169. Springer, Heidelberg (2008)
9. Sra, S., Jegelka, S., Banerjee, A.: Approximation algorithms for Bregman cluster-
ing, co-clustering and tensor clustering. Technical Report 177, MPI for Biological
Cybernetics (2008)
10. Ben-David, S.: A framework for statistical clustering with constant time approx-
imation algorithms for K-median and K-means clustering. Mach. Learn. 66(2-3),
243–257 (2007)
11. Puolamäki, K., Hanhijärvi, S., Garriga, G.C.: An approximation ratio for biclus-
tering. Inf. Process. Letters 108(2), 45–49 (2008)
12. Anagnostopoulos, A., Dasgupta, A., Kumar, R.: Approximation algorithms for co-
clustering. In: Symp. on Principles of Database Systems, PODS (2008)
Approximation Algorithms for Tensor Clustering 383
13. Zha, H., Ding, C., Li, T., Zhu, S.: Workshop on Data Mining using Matrices and
Tensors. In: KDD (2008)
14. Hasan, M., Velazquez-Armendariz, E., Pellacini, F., Bala, K.: Tensor Clustering for
Rendering Many-Light Animations. In: Eurographics Symp. on Rendering, vol. 27
(2008)
15. Kolda, T.G., Bader, B.W.: Tensor Decompositions and Applications. SIAM Re-
view 51(3) (to appear, 2009)
16. Hartigan, J.A.: Direct clustering of a data matrix. J. of the Am. Stat. As-
soc. 67(337), 123–129 (1972)
17. Cheng, Y., Church, G.: Biclustering of expression data. In: Proc. ISMB, pp. 93–103.
AAAI Press, Menlo Park (2000)
18. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph
partitioning. In: KDD, pp. 269–274 (2001)
19. Bekkerman, R., El-Yaniv, R., McCallum, A.: Multi-way distributional clustering
via pairwise interactions. In: ICML (2005)
20. Agarwal, S., Lim, J., Zelnik-Manor, L., Perona, P., Kriegman, D., Belongie, S.:
Beyond pairwise clustering. In: IEEE CVPR (2005)
21. Govindu, V.M.: A tensor decomposition for geometric grouping and segmentation.
In: IEEE CVPR (2005)
22. Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2001)
23. Hein, M., Bousquet, O.: Hilbertian metrics and positive definite kernels on proba-
bility measures. In: AISTATS (2005)
24. Censor, Y., Zenios, S.A.: Parallel Optimization: Theory, Algorithms, and Applica-
tions. Oxford University Press, Oxford (1997)
25. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman Diver-
gences. JMLR 6(6), 1705–1749 (2005)
26. de Silva, V., Lim, L.H.: Tensor Rank and the Ill-Posedness of the Best Low-Rank
Approximation Problem. SIAM J. Matrix Anal. & Appl. 30(3), 1084–1127 (2008)
27. Jegelka, S., Sra, S., Banerjee, A.: Approximation algorithms for Bregman co-
clustering and tensor clustering (2009); arXiv:cs.DS/0812.0389v3
28. Chaudhuri, K., McGregor, A.: Finding metric structure in information theoretic
clustering. In: Conf. on Learning Theory, COLT (July 2008)
29. Cho, H., Dhillon, I.S., Guan, Y., Sra, S.: Minimum Sum Squared Residue based
Co-clustering of Gene Expression data. In: SDM, 114–125 (2004)
30. Kluger, Y., Basri, R., Chang, J.T.: Spectral biclustering of microarray data: Co-
clustering genes and conditions. Genome Research 13, 703–716 (2003)
31. Cho, H., Dhillon, I.: Coclustering of human cancer microarrays using minimum
sum-squared residue coclustering. IEEE/ACM Tran. Comput. Biol. Bioinf. 5(3),
385–400 (2008)
32. Baranzini, S.E., et al: Transcription-based prediction of response to IFNβ using
supervised computational methods. PLoS Biology 3(1) (2004)
Agnostic Clustering
1 Introduction
Problems of clustering data from pairwise distance or similarity information
are ubiquitous in science. Typical examples of such problems include clustering
proteins by function, images by subject, or documents by topic. In many of
these clustering applications there is an unknown target or desired clustering,
and while the distance information among data is merely heuristically defined,
the real goal in these applications is to minimize the clustering error with respect
to the target clustering.
A commonly used approach for data clustering is to first choose a particular
distance-based objective function Φ (e.g., k-median or k-means) and then design
a clustering algorithm that (approximately) optimizes this objective function [1,
2, 7]. The implicit hope is that approximately optimizing the objective function
will in fact produce a clustering of low clustering error, i.e. a clustering that is
pointwise close to the target clustering. Mathematically, the implicit assumption
is that the clustering error of any c-approximation to Φ on the data set is bounded
by some . We will refer to this assumed property as the (c, ) property for Φ.
This work was done in part while the authors were at Microsoft Research, New
England.
Supported by a fellowship within the Postdoc-Program of the German Academic
Exchange Service (DAAD).
R. Gavaldà et al. (Eds.): ALT 2009, LNAI 5809, pp. 384–398, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Agnostic Clustering 385
Balcan, Blum, and Gupta [3] have shown that by making this implicit as-
sumption explicit, one can efficiently compute a low-error clustering even in
cases when the approximation problem of the objective function is NP-hard. In
particular, they show that for any c = 1 + α > 1, if data satisfies the (c, )
property for the k-median or the k-means objective, then one can produce a
clustering that is O()-close to the target, even for values c for which obtaining
a c-approximation is NP-hard.
However, the (c, ) property is a strong assumption. In real data there may
well be some data points for which the (heuristic) distance measure does not
reflect cluster membership well, causing the (c, ) property to be violated. A
more realistic assumption is that the data satisfies the (c, ) property only after
some number of outliers or ill-behaved data points, i.e., a ν fraction of the data
points, have been removed. We will refer to this property as the (ν, c, ) property.
While the (c, ) property leads to the situation that all plausible clusterings
(i.e., all the clusterings satisfying the (c, ) property) are O()-close to each
other, two different sets of outliers could result in two different clusterings satis-
fying the (ν, c, ) property. We therefore analyze the clustering complexity of this
property [4], i.e, the size of the smallest ensemble of clusterings such that any
clustering satisfying the (ν, c, ) property is close to a clustering in the ensemble;
we provide tight upper and lower bounds on this quantity for several interesting
cases, as well as efficient algorithms for outputting a list such that any clustering
satisfying the property is close to one of those in the list.
Perspective. The clustering framework we analyze in this paper is related in
spirit to the agnostic learning model in the supervised learning setting [6]. In the
Probably Approximately Correct (or PAC) learning model of Valiant [8], also
known as the realizable setting, the assumption is that the data distribution over
labeled examples is correctly classified by some fixed but unknown concept in
some concept class, e.g., by a linear separator. In the agnostic setting [6] how-
ever, the assumption is weakened to the hope that most of the data is correctly
classified by some fixed but unknown concept in some concept space, and the
goal is to compete with the best concept in the class by an efficient algorithm.
Similarly, one can view the (ν, c, ) property as an agnostic version of the (c, )
property since we assume that the (ν, c, ) property is satisfied if the (c, ) prop-
erty is satisfied on most but not all of the points and moreover the points where
the property is not satisfied are adversarially chosen.
Our results. We present several algorithmic and information-theoretic results
in this new clustering model.
For most of this paper we focus on the k-median objective function. In the
case where the target clusters are large (have size Ω((/α + ν)n)) we show that
the algorithm in [3] can be used in order to output a single clustering that is
(ν + )-close to the target clustering. We then show that in the more general
case there can be multiple significantly different clusterings that can satisfy the
(ν, c, ) property. This is true even in the case where most of the points come
from large clusters; in this case, however, we show that we can in polynomial
time output a small list of k-clusterings such that any clustering that satisfies
386 M.F. Balcan, H. Röglin, and S.-H. Teng
the property is close to one of the clusterings in the list. In the case where most
of the points come from small clusters, we provide information-theoretic bounds
on the clustering complexity of this property.
We also show how both the analysis in [3] for the (c, ) property and our
analysis for the (ν, 1 + α, ) property can be adapted to the inductive case, where
we imagine our given data is only a small random sample of the entire data set.
Based on the sample, our algorithm outputs a clustering or a list of clusterings of
the full domain set that are evaluated with respect to the underlying distribution.
We conclude by discussing how our analysis extends to the k-means objective
function as well.
2 The Model
The clustering problems we consider fall into the following general framework:
given a metric space M = (X, d) with point set X and a distance function
we are
d : X2 → R≥0 satisfying the triangle inequality — this is the ambient space.
We are also given the actual point set S ⊆ X we want to cluster; we use n to
denote the cardinality of S. A k-clustering C is a partition of S into k (possibly
empty) sets C1 , C2 , . . . , Ck . In this work, we always assume that there is a true
or target k-clustering CT for the point set S.
Commonly used clustering algorithms seek to minimize some objective func-
tion or “score”. For example, the k-median clustering objective k assigns
to each
cluster Ci a “median” ci ∈ Ci and seeks to minimize Φ1 (C) = i=1 x∈Ci d(x, ci ).
Another example is the k-means clustering objective, which assigns
to each clus-
ter Ci a “center” ci ∈ X and seeks to minimize Φ2 (C) = ki=1 x∈Ci d(x, ci )2 .
Given a function Φ and an instance (M, S), let OPTΦ = minC Φ(C), where the
minimum is over all k-clusterings of S.
The notion of distance between two k-clusterings C = {C1 , C2 , . . . , Ck } and
C = {C1 , C2 , . . . , Ck } that we use throughout the paper, is the fraction of points
on which they disagree under the optimal matching of clusters in C to clusters
in C ; we denote that as dist(C, C ). Formally,
k
1
dist(C, C ) = min
|Ci − Cσ(i) |,
σ∈Sk n i=1
where Sk is the set of bijections σ : {1, . . . , k} → {1, . . . , k}. We say that two
clusterings C and C are -close if dist(C, C ) ≤ and we say that a clustering
has error if it is -close to the target.
The (1 + α, )-property. The following notion originally introduced in [3] and
later studied in [5] is central to our discussion:
Definition 1. Given an objective function Φ (such as k-median or k-means),
we say that instance (S, d) satisfies the (1 + α, )-property for Φ with respect to
the target clustering CT if all clusterings C with Φ(C) ≤ (1+α)·OPTΦ are -close
to the target clustering CT for (S, d).
Agnostic Clustering 387
The (ν, 1 + α, )-property. In this paper, we study the following more robust
variation of Definition 1:
Definition 2. Given an objective function Φ (such as k-median or k-means),
we say that instance (S, d) satisfies the (ν, 1 + α, )-property for Φ with respect
to the target clustering CT if there exists a set of points S ⊆ S of size at least
(1 − ν)n such that (S , d) satisfies the (1 + α, )-property for Φ with respect to
the clustering CT ∩ S induced by the target clustering on S .
In other words our hope is that the (1 + α, )-property for objective Φ is satisfied
only after outliers or ill-behaved data points have been removed. Note that unlike
the case ν = 0, in general the (ν, 1+α, )-property could be satisfied with respect
to multiple significantly different clusterings, since we allow the set of outliers or
ill-behaved data points to be arbitrary. As a consequence we will be interested in
the size of the smallest list any algorithm could hope to output that guarantees
that at least one clustering in the list has small error. Given the instance (S, d),
we say that a given clustering C is consistent with the (ν, 1 + α, )-property for Φ
if (S, d) satisfies the (ν, 1 + α, )-property for Φ with respect to C. The following
notion originally introduced in [4] provides a formal measure of the inherent
usefulness of a given property.
Definition 3. Given an instance (S, d) and the (ν, 1 + α, )-property for Φ, we
define the (γ, k)-clustering complexity of the instance (S, d) with respect to the
(ν, 1 + α, )-property for Φ to be the length of the shortest list of clusterings
h1 , . . . , ht such that any consistent k-clustering is γ-close to some clustering in
the list. The (γ, k) clustering complexity of the (ν, 1 + α, )-property for Φ is the
maximum of this quantity over all instances (S, d).
Ideally, the (ν, 1+α, ) property should have (γ, k) clustering complexity polyno-
mial in k, 1/, 1/ν, 1/α, and 1/γ. Sometimes we analyze the clustering complexity
of our property restricted to some family of interesting clusterings. We define
this analogously:
Theorem 6 ( [3]). Assume that the k-median instance satisfies the (1 + α, )-
property. If each cluster in CT has size at least (3 + 10/α)n + 2, then given w
we can efficiently find a clustering that is -close to CT . If each cluster in CT has
size at least (4 + 15/α)n + 2, then we can efficiently find a clustering that is
-close to CT even without being given w.
Since some of the elements of this construction are essential in our subsequent
proofs, we summarize in the following the main ideas of this proof.
Agnostic Clustering 389
Main Ideas of the Construction. Assume first that we are given w. We use
Algorithm 1 with τ = 2αw 5 and b = (1 + 5/α). For the analysis, let us define
dcrit = αw 5 . We call point x good if both w(x) < dcrit and w2 (x) − w(x) ≥ 5dcrit ,
else x is called bad ; by Lemma 5 and the fact that ∗ ≤ , if all clusters in
the target have size greater than 2n, then at most a (1 + 5/α)-fraction of
points is bad. Let Xi be the good points in the optimal cluster Ci∗ , and let
B = S \ ∪Xi be the bad points. For instances satisfying the (1 + α, )-property,
the threshold graph Gτ defined in Algorithm 1 has the following properties:
(i) For all x, y in the same Xi , the edge {x, y} ∈ E(Gτ ). (ii) For x ∈ Xi and
y ∈ Xj =i , {x, y} ∈ E(Gτ ). Moreover, such points x, y do not share any neighbors
in Gτ (by the triangle inequality). This implies that each Xi is contained in a
distinct component of the graph Hτ,b ; the remaining components of Hτ,b contain
vertices from the “bad bucket” B. Since the Xi ’s are larger than B, we get that
the clustering C obtained in Step 3 by taking the largest k components in H
and adding the vertices of all other smaller components to one of them differs
from the optimal clustering C ∗ only in the bad points which constitute an O(/α)
fraction of the total.
To argue that the clustering C is -close to CT , we call a point x “red” if it
satisfies w2 (x) − w(x) < 5dcrit , “yellow” if it is not red but w(x) ≥ dcrit , and
“green” otherwise. So, the green points are those in the sets Xi , and we have
partitioned the bad set B into red points and yellow points. The clustering C
agrees with C ∗ on the green points, so without loss of generality we may assume
Xi ⊆ Ci . Since each cluster in Ci has a strict majority of green points all of
which are clustered as in C ∗ , this means that for a non-red point x, the median
distance to points in its correct cluster with respect to C ∗ is less than the median
distance to points in any incorrect cluster. Thus, C agrees with C ∗ on all non-red
points. Since there are at most ( − ∗ )n red points on which CT and C ∗ agree
by Lemma 5 — and C and CT might disagree on all these points — this implies
dist(C , CT ) ≤ ( − ∗ ) + ∗ = , as desired.
The “unknown w” Case. If we are not given the value w, and every target
cluster has size at least (4 + 15/α)n + 2, we instead run Algorithm 1 (with
τ = 2αw
5 and b = (1 + 5/α) repeatedly for different values of w, starting with
w = 0 (so the graph Gτ is empty) and at each step increasing w to the next
value such that Gτ contains at least one new edge. We say that a point is
missed if it does not belong to the k largest components of Hτ,b . The number
of missed points decreases with increasing w, and we stop with the smallest w,
for which we miss at most bn = (1 + 5/α)n points and each of the k largest
components contains more than 2bn points. Clearly, for the correct value of
w, we miss at most bn points because we miss only bad points. Additionally,
every Xi contains more than 2bn points. This implies that our guess for w
can only be smaller than the correct w and the resulting graphs Gτ and Hτ,b
can only have fewer edges than the corresponding graphs for the correct w.
However, since we miss at most bn points and every set Xi contains more than
bn points, there must be good points from every good set Xi that are not missed.
Hence, each of the k largest components corresponds to a distinct cluster Ci∗ . We
390 M.F. Balcan, H. Röglin, and S.-H. Teng
might misclassify all bad points and at most bn good points (those not in the
k largest components), but this nonetheless guarantees that each Ci contains at
least |Xi | − bn ≥ bn + 2 correctly clustered green points (with respect to C ∗ ) and
at most bn misclassified points. Therefore, as shown above for the case of known
w, the resulting clustering C will correctly cluster all non-red points as in C ∗
and so is at distance at most from CT .
a clustering C1 . . . Ck of the sample S that is O(b)-close to the target on the
sample. In particular, all good points in the sample that are in the same cluster
form cliques in the graph Hτ,b and good points from different clusters are in
different connected components of this graph. So, taking the largest connected
components of this graphs gives us a clustering that is O(b)-close to the target
clustering restricted to the sample S.
If we do not know w, then we use the same approach as in Theorem 6. That
is, we start by setting w = 0 and increase it until the k largest components in the
corresponding graph Hτ,b cover a large fraction of the points. The key point is
that the correctness of this approach followed from the fact that the number of
good points in every cluster is more than twice the total number of bad points.
As we have argued above, this is satisfied with probability at least 1 − δ for the
sample S as well, and hence, using arguments similar to the ones in Theorem 6
implies that we cluster the whole space with error at most .
Note that one can speed up Algorithm 2 as follows. Instead of repeatedly calling
Algorithm 1 from scratch, we can store the graphs G and H and only add
new edges to them in every iteration of Algorithm 2. Note also that in the test
phase, when a new point z arrives, we compute for every cluster Ci the median
distance of z to all sample points in Ci (and not to all the points added so Ci ),
and assign z to the cluster that minimizes this median distance. Note also that
a natural approach which will not work (due to the bad points) is to compute a
centroid/median for each Ci and then insert new points based on this Voronoi
diagram.
Proof Sketch. Let A1 , . . . , Ak be sets of size n(1 − ν)/k and let x1 , . . . , xk be ad-
ditional points not belonging to any of the sets A1 , . . . , Ak such that the optimal
k-median solution on the set A1 ∪ . . . ∪ Ak is the clustering C = {A1 , . . . , Ak }
and the instance (A1 ∪ . . . ∪ Ak , d) satisfies the (1 + α, )-property. We assume
that S ⊆ N and that every set Ai consists of n(1 − ν)/k points at exactly the
same position ai ∈ N. In our construction, we will have a1 < . . . < ak .
By placing the point x1 very far away from all the sets Ai and by placing A1
and A2 much closer together than any other pair of sets, we can achieve that
the optimal k-median solution on the set A1 ∪ . . . ∪ Ak ∪ {x1 } is the clustering
{A1 ∪ A2 , A3 , . . . , Ak , {x1 }} and that the instance (A1 ∪ Ak ∪ {x1 }, d) satisfies
the (1 + α, )-property. We can continue analogously and place x2 very far away
from all the sets Ai and from x1 . Then the optimal k-median clustering on the
set A1 , . . . , ∪ . . . ∪ Ak ∪ {x1 , x2 } will be {A1 ∪ A2 ∪ A3 , A4 , . . . , Ak , {x1 , x2 }} if
A2 and A3 are much closer together than Ai and Ai+1 for i ≥ 3. The instance
also satisfies the (1 + α, )-property. This way, each of the clusterings {A1 ∪ . . . ∪
Ai , Ai+1 . . . Ak , {x1 }, {x2 }, . . . , {xi−1 }} is a consistent target clustering, and the
distance between any of them is at least γ.
Note that in the example in Theorem 10 all the clusterings that satisfy the
(ν, 1 + α, )-property have the feature that the total number of points that come
from large clusters (of size at least n(1 − ν)/k) is at least (1 − ν)n. We show that
in such cases we also have an upper bound of k on the clustering complexity.
Proof. The main idea of the proof is to use the structure of the graphs H to show
that the clusterings that are consistent with the (ν, 1 + α, )-property are almost
laminar with respect to each other. Note that for all w < w we have Gw ⊆ Gw
and Hw ⊆ Hw . Here we used Gw and Hw as abbreviations for Gτ and Hτ with
τ = 2αw
5 . In the following, we say that a cluster is large if it contains at least 2bn
elements. To find a list of clusterings that “covers” all the relevant clusterings, we
use the following algorithm. We keep increasing the value of w until we reach a
value w1 such that the following is satisfied: Let K1 , . . . , Kk denote the k largest
394 M.F. Balcan, H. Röglin, and S.-H. Teng
connected components of the graph Hw1 and assume |K1 | ≥ |K2 | ≥ . . . ≥ |Kk |.
We set k1 = max{i ∈ [k] | |Ki | ≥ bn} and stop for the smallest w1 for which the
clusters K1 , . . . , Kk1 cover together a significant fraction of the space, namely a
1 − (b + β) fraction. Let S̃ = K1 ∪ . . . ∪ Kk1 . The first clustering we add to the
list contains a cluster for each of the components K1 , . . . , Kk1 and it assigns the
points in S \ S̃ arbitrarily to those. Now we increase the value of w and each
time we add an edge in Hw between two points in different components Ki and
Kj , we merge the corresponding clusters to obtain a new clustering with at least
one cluster less. We add this clustering to our list and we continue until only
one cluster is left. As in every step, the number of clusters decreases by at least
one, the list of clusterings produced this way has length at most k1 ≤ k. Let
w1 , w2 , . . . denote the values of w for which the clusterings are added to the list.
To complete the proof, we show that any clustering C satisfying the property is
(2b + β)-close to one of the clusterings in the list we constructed. Let wC denote
the value corresponding to C. First we notice that wC ≥ w1 . This follows easily
from the structure of the graph HwC : it has one connected component for every
large cluster in C and each of these components must contain at least bn points as
every large cluster contains at least 2bn points and the bad set contains at most
bn points. Also by definition and the fact that the size of the bad set is bounded
by bn, it follows that these components together cover at least a 1 − (b + β)
fraction of the points. This proves that wC ≥ w1 by the definition of w1 . Now let
i be maximal such that wi ≤ wC . We show that the clustering we output at wi is
(2b + β)-close to the clustering C. Let K1 , . . . , Kk denote the components in Hwi
that evolved from the Ki and let K1 , . . . , Kk denote the evolved components
in HwC . As wC < wi+1 , k = k we can assume (up to reordering) that Ki = Ki
on the set S̃. As all points in S̃ that are not in the bad set for wi are clustered
in C according to the components K1 , . . . , Kk , the clusterings corresponding
to wi and wC can only differ on S \ S̃ and the bad set for wi . Using the fact
|S \ S̃| ≤ (b + β)n and that the size of the bad set is bounded by bn, we get that
the clustering we output at wi is (2b+β)-close to the clustering C, as desired.
Moreover, if every large cluster is at least as large as (12 + 20/α)n + 2νn + 2β,
then, as for w1 the size of the missed set is at most (6 + 10/α)n + νn + β, the
intersection of the good set with every large cluster is larger than the missed set
for wi for any i. This then implies that if we apply the median argument from
Step 4 of Algorithm 1, the clustering we get for wi is (ν + + β)-close to the
clustering C if i is chosen as in the previous proof. Together with Theorem 11
this implies the following corollary.
Corollary 12. Let b = (6 + 10/α) + ν. Let F be the family of clusterings with
the property that the average cluster size n/k is at least 2bn/(1 − β). Then the
(ν + + β, k) restricted clustering complexity of the (ν, 1 + α, )-property with
respect to F is at most k and we can efficiently construct a list of length at most
k such that any clustering in F that is consistent with the (ν, 1 + α, )-property
is (ν + + β)-close to one of the clusterings in the list.
Agnostic Clustering 395
The Inductive Case. We show here how the algorithm in Theorem 11 can be
extended to the inductive setting.
Theorem 13. Let b = (6 + 10/α) + ν. Let F be the family of clusterings with
the property that the total number of points that come from clusters of size at
least 2bn is at least (1 − β)n. If we draw a sample S of size n = O 1 ln kδ ,
then we can efficiently produce a list of length at most k such that any clustering
in the family F that is consistent with the (ν, 1 + α, )-property is 3(2b + β)-close
to one of the clusterings in the list with probability at least 1 − δ.
Proof Sketch. In the training phase, we will run the algorithm in Theorem 11
over the sample to get a list of clusterings L. Then we run an independent
“test phase” for each clustering in this list. Let C be one such clustering in the
list L with clusters C1 , . . . , Ck , and let S̃ be the set of relevant points defined
Theorem 11. In the test phase, when a new point x comes in, then we compute
for each cluster Ci the medium distance of x to Ci ∩ S̃, and insert it into the
cluster Ci to which it has the smallest median distance.
To prove correctness we use the fact that, as shown in Theorem 11, the
(2b + β, k)-clustering complexity of the (ν, 1 + α, )-property is at most k, when
restricted to clusterings in which the total number of points coming from clusters
of size at least 2bn is at least (1 − β)n. Let L be a list of k1 ≤ k clusterings such
that any consistent clustering is (2b + β)-close to one of them.
Now the argument is similar to the one in Theorem 7. In the proof of that
theorem, we used a Chernoff bound to argue that with probability at least 1 − δ
the good set of any cluster that is contained in the sample is more than twice
as large as the total bad set in the sample. Now we additionally apply a union
bound over the at most k clusterings in the list L to ensure this property for
each of the clusterings. From that point on the arguments are analogous to the
arguments in Theorem 7.
If there are two different consistent clusterings C 1 and C 2 that have the same
value w, then, by the properties of Gw , all points in S \ (B1 ∪ B2 ) are identically
clustered. Hence, dist(C 1 , C 2 ) ≤ (|B1 |+|B2 |)/n ≤ 2b. This implies that we do not
lose too much by choosing for every value w with multiple consistent clusterings
one of them as representative. To be precise, let w1 < w2 < · · · < ws be a list of
all values for which a correct clustering exists and for every wi , let C i denote a
correct clustering with value wi . We construct a sparsified list L of clusterings as
follows: insert C 1 into L; if the last clustering added to L is C i , add C j for the
smallest j > i for which dist(C i , C j ) ≥ (k + 2)b. This way, the list L will contain
clusterings C 1 , . . . , C s with values w1 , . . . , ws such that every correct clustering
is (k + 4)b-close to at least one of the clusterings in L.
It remains to bound the length s of the list L. Let us assume for contradiction
that s ≥ k+1. According to the properties of the graphs Gwi , the clusterings that
are induced by the clusterings C 1 , . . . , C k+1 on the set S \ (B1 ∪ . . . ∪ Bk+1 ) are
laminar. Furthermore, as the bad set B1 ∪ . . . ∪ Bk+1 has size at most (k + 1)bn,
two consecutive clusterings in the list must differ on the set S \ (B1 ∪ . . . ∪ Bk+1 ),
which means together with the laminarity implies that two clusters must have
merged. This can happen at most k − 1 times, contradicting the assumption that
s ≥ k + 1.
We will improve the result in the above proposition by imposing that consec-
utive clusterings in the list L in the above proof are significantly different in
the laminar part. In particular we will make use of the following lemma which
shows that if we have a laminar list of clusterings then the sum of the pairwise
distances between consecutive clusterings cannot be too big; this implies that if
the pairwise distances between consecutive clusterings are all large, then the list
must be short.
Proof. When going from C i to C i+1 , clusters contained in the clustering C i merge
into bigger clusters contained in C i+1 . Merging the clusters K1 , . . . , K ∈ C i with
|K1 | ≥ |K2 | ≥ · · · ≥ |K | into cluster K ∈ C i+1 contributes (|K2 | + · · · + |Kl |)/n
to the distance between C i and C i+1 . When going from C i to C i+1 , multiple such
merges can occur and we know that their total contribution to the distance must
be at least β. We consider a single merge in which the pieces K1 , . . . , K ∈ C i
merge into K ∈ C i+1 virtually as − 1 merges and associate with each of them
a type. We say that the merge corresponding to Ki , i = 2, . . . , , has type j ∈ N
if |Ki | ∈ [n/2j+1 , n/2j ). If Ki has type j, we say that the data points contained
in Ki participate in a merge of type j.
For the step from C i to C i+1 , let xij denote the total number of virtual merges
of type j that occur. The number of merges of type j that can occur during the
whole sequence from C 1 to C s is bounded from above by 2j+1 as each of the
n data points can participate at most once in a merge of type j. This follows
Agnostic Clustering 397
This yields
s−1 L
xij
(s − 1)β
≤ ≤L,
4 i=1 j=1
2j+1
Open Questions. The main concrete technical questions left open are whether
one can show a better upper bound on the clustering complexity in the case of
small target clusters and whether in this case there is an efficient algorithm for
constructing a short list of clusterings such that every consistent clustering is
close to one of the clusterings in the list.
More generally, it would also be interesting to analyze other natural variations
of the (c, ) property. For example, a natural direction would be to consider
variations that express beliefs that only the c-approximate clusterings that might
be returned by natural approximation algorithms are close to the target. In
particular, many approximation algorithms for clustering return Voronoi-based
clusterings [7]. In this context, a natural relaxation of the (c, )-property is to
assume that only the Voronoi-based clusterings that are c-approximations to
the optimal solution are -close to the target. It would be interesting to analyze
whether this is sufficient for efficiently finding low-error clusterings, both in the
realizable and in the agnostic setting.
References
1. Jain, K., Mahdian, M., Saberi, A.: A new greedy approach for facility location
problems. In: STOC (2002)
2. Charikar, M., Guha, S., Tardos, E., Shmoys, D.B.: A constant-factor approximation
algorithm for the k-median problem. In: Proceedings of the Thirty-First Annual
ACM Symposium on Theory of Computing (1999)
3. Balcan, M.F., Blum, A., Gupta, A.: Approximate clustering without the approx-
imation. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms
(2009)
4. Balcan, M.F., Blum, A., Vempala, S.: A discrimantive framework for clustering via
similarity functions. In: Proceedings of the 40th ACM Symposium on Theory of
Computing (2008)
5. Balcan, M.F., Braverman, M.: Finding low error clusterings. In: Proceedings of the
22nd Annual Conference on Learning Theory (2009)
6. Kearns, M.J., Schapire, R.E., Sellie, L.M.: Toward efficient agnostic learning. Ma-
chine Learning Journal (1994)
7. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.:
A local search approximation algorithm for k -means clustering. In: Proceedings of
the Eighteenth Annual Symposium on Computational Geometry (2002)
8. Valiant, L.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)
Author Index