Thesis Decision

UvA-DARE (Digital Academic Repository)
Ensemble approaches to semi-supervised learning
Tanha, J.
Publication date
2013
Document Version
Final published version
Link to publication
Citation for published version (APA):

Tanha, J. (2013). Ensemble approaches to semi-supervised learning.
General rights
It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)
and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open
content license (like Creative Commons).
Disclaimer/Complaints regulations
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please
let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material
inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter
to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You
will be contacted as soon as possible.
UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)
Download date:25 Apr 2021

ENSEMBLE APPROACHES TO SEMI–SUPERVISED
LEARNING
ACADEMISCH PROEFSCHRIFT
ter verkrijging van de graad van doctor

aan de Universiteit van Amsterdam
op gezag van de Rector Magnificus
prof.dr. D.C. van den Boom
ten overstaan van een door het college voor promoties
ingestelde commissie,
in het openbaar te verdedigen in de Agnietenkapel
op donderdag 18 April 2013, te 14:00 uur
door
Jafar Tanha
geboren te Bonab, Iran

Promotiecommissie
Promotor: Prof.dr. H. Afsarmanesh

Co-promotor: Dr. M.W. van Someren
Overige leden: Prof.dr. A.P.J.M. Siebes

Prof.dr. M. Welling
Prof.dr. Ir.H.L.W. Blockeel
Prof.dr. F.C.A. Groen
Prof.dr. P.W. Adriaans
Prof.dr. J.N. Kok
Faculteit der Natuurwetenschappen, Wiskunde en Informatica
SIKS Dissertation Series No. 2013-14
The research reported in this thesis has been carried out under the auspices of
SIKS, the Dutch Research School for Information and Knowledge Systems.
Copyright
c 2013 by Jafar Tanha. All rights reserved.
The cover illustration depicts the cluster assumption in semi-supervised learning.

Cover design by Parisa Fulady and Robert Kanters.
Printed and bound by Ridderprint.
ISBN: 978-90-5335-669-2
Acknowledgments
Completing my PhD thesis has probably been the most challenging and stressful
activity of my first 36 years of my life. The best and worst moments of my doctoral
journey have been shared with many people, in particular with my family. I have
spent a great privileged years in the Informatics Institute at the University of
Amsterdam and whose members will always remain dear to me.
First of all, I would like to express my deeply-felt thanks and gratitude to my
dissertation promotor, Professor Hamideh Afsarmanesh for her warm encourage-
ment and thoughtful guidance. Thanks to her, I learned a lot about the scientific
research process and the research report preparation. Besides, Hamideh has given
me freedom to explore directions that I found interesting, which eventually led
to this thesis.
I thank Dr. Maarten van Someren who was my co-promoter through my doc-
toral research and provided me with insightful decisions about the research. I
would like to express my gratitude to Maarten, not only for his excellent com-
ments, but also for listening to me whenever I was excited about a new idea.
I also thank the other members of my thesis committee: Professor Arno Siebes,
Professor Max Welling, Professor Hendrik Blockeel, Professor Joost Kok, Profes-
sor Pieter Adriaans, and Professor Frans Goren, whose helpful suggestions have
increased the readability of my thesis.
Special thanks go to my friends in Iran Dr. Alireza Aliahmadi, Dr. Davood
Karimzadgan, Esmaeil Mousavi, Dr. Feisal Hassani, and Dr. Hamidreza Rabiee,
whose helpful suggestions enabled me to find the scholarship from the ministry
of science, research, and technology in Iran for my doctoral program.
Members of Federated Collaborative Networks (FCN) and HCS Lab. also
deserve my sincerest thanks, their friendship and assistance have meant more to
me than I could ever express. Special thanks to my former colleagues: Simon,
Gerben, Sofiya, Katja, and Mattijs for their assistance and friendship.
I would also like to thank my colleagues: Naser Ayat, Mahdi Sargolzaei,
Miriam Ghijsen, Mahdieh Shadi, Hodjat Soleimani, Masoud Mazloom, Amirhos-
iii
sein Habibian, and Mohammad Shafahi, with whom I have had insightful discus-
sions and who have provided me assistance with my research.
My friends in Amsterdam have been my sources of happiness and support.
Special thanks go to Fereidoon, Akhtar, and Bita Farrokhnia, Atique and Firouzeh
Sultanpour, Mohammad, Mehran, Mohammadali Hadad, Ramin Sarrami, Mah-
moud Botshekan, Hamidreza Fallah, Hamid Abbasi, Said Eslami, Amir Avan,
Mohammad Jazayeri, Mahdi Jaghoori, Naser Asghari, Azadeh Gharaiee, Yaser
Rahmani, Mohammad Amin Moradi, and Fatemeh Moayed. I am very happy
that, in many cases, my friendship with them have extended well beyond our
shared time in Amsterdam.
Finally, my debt of gratitude and thanks go to my parents, Gholamhassan
and Robab. Their love provided my inspiration and has been my driving force.
I owe them everything and wish that I could show them just how much I love
and appreciate them. I am grateful to my wife and my lovely daughter, Leila
and Ayda, whose support, love, and encouragement has allowed me to fulfil this
journey. I also would like to thank my mother in-law, Fatemeh Saghafi, for her
unconditional support. Finally, I would like to dedicate this work to my late
relatives including my grandmother Sona Shakourzadeh, my grandfathers Reza
Tanha and Hamid Hajizadeh, and my father-in-law Ghader Shoeibi, who left us
too soon. I hope that this work makes them proud.
Jafar Tanha
March 2013
iv
Contents
Acknowledgments iii
1 Introduction 1
1.1 Research Questions and Solution Approach . . . . . . . . . . . . . 3
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 9
2.1 Supervised Ensemble Learning . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Multiclass Boosting . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Semi-Supervised Learning and Ensemble Methods . . . . . . . . . 13
2.2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Semi-Supervised Learning Algorithms . . . . . . . . . . . . 15
2.2.3 Multiclass Semi-Supervised Boosting . . . . . . . . . . . . 18
3 Self-Training With Decision Trees 21

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Overview of Self-Training . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Semi-Supervised Self-Training with Decision Trees . . . . . . . . . 24
3.3.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 The Self-Training Algorithm . . . . . . . . . . . . . . . . . 24
3.4 Self-Training by Improving Probability Estimates . . . . . . . . . 25
3.4.1 C4.4: No-Pruning and Laplacian Correction . . . . . . . . 26
3.4.2 NBTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.3 Grafted Decision Tree . . . . . . . . . . . . . . . . . . . . 27
3.4.4 Combining No-pruning, Laplace correction and Grafting . 27
3.4.5 Global Distance-Based Measure . . . . . . . . . . . . . . . 27
3.5 Self-Training for Ensembles of Decision Trees . . . . . . . . . . . . 29
v
3.5.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.2 The Random Subspaces Method . . . . . . . . . . . . . . . 31
3.5.3 The Ensemble Self-Training algorithm . . . . . . . . . . . 32
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6.2 Setup of the Experiments . . . . . . . . . . . . . . . . . . 35
3.6.3 Statistical Test . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.4 Decision Tree Learners . . . . . . . . . . . . . . . . . . . . 36
3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7.1 Self-Training with a Single Classifier . . . . . . . . . . . . 36
3.7.2 Self-Training with an Ensemble of Trees . . . . . . . . . . 40
3.7.3 Results of Experiments on Web-page Datasets . . . . . . . 42
3.8 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . 44
4 Ensemble Co-Training 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Semi-Supervised Co-Training . . . . . . . . . . . . . . . . . . . . . 48
4.3 A Bound for Learning from noisy data . . . . . . . . . . . . . . . 50
4.3.1 Criterion for error rate and number of unlabeled data . . . 51
4.3.2 Ensemble-Co-Training Algorithm . . . . . . . . . . . . . . 54
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.2 Base Learner . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . 62
5 Multiclass Semi-Supervised Boosting 65

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Loss Function for Semi-Supervised Learning . . . . . . . . . . . . 67
5.3 Multiclass Semi-Supervised Learning . . . . . . . . . . . . . . . . 70
5.3.1 Multiclass Setting . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Loss Function for Multiclass Semi-Supervised Learning . . . . . . 72
5.5 Multiclass Semi-Supervised Boosting . . . . . . . . . . . . . . . 74
5.5.1 The Multi-SemiAdaBoost Algorithm . . . . . . . . . . . . 77
5.6 Variations of Multi-SemiAdaBoost . . . . . . . . . . . . . . . . . . 78
5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.7.1 Similarity Function . . . . . . . . . . . . . . . . . . . . . . 80
5.7.2 Sampling Data for Labeling . . . . . . . . . . . . . . . . . 80
5.7.3 Supervised Base Learner . . . . . . . . . . . . . . . . . . . 81
5.7.4 UCI datasets . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.7.5 Setup of the experiments . . . . . . . . . . . . . . . . . . . 81
5.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.8.1 Different Proportions of Unlabeled Examples . . . . . . . . 86
vi
5.8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.8.3 Convergence of Multi-SemiAdaBoost . . . . . . . . . . . . 86
5.8.4 Sensitivity to the Scale Parameter . . . . . . . . . . . . . . 88
5.8.5 Sensitivity to the Size of the Newly-Labeled Data . . . . . 88
6 Application of Multiclass Semi-Supervised Learning 91

6.1 Behavior Recognition from Accelerometer Data . . . . . . . . . . 91
6.1.1 Recognizing Bird Behavior from Accelerometer Data . . . 92
6.1.2 Experiments with the Bird Dataset . . . . . . . . . . . . . 94
6.1.3 Results of recognition of bird behavior . . . . . . . . . . . 94
6.2 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.1 Experiments on Text Classification . . . . . . . . . . . . . 97
6.2.2 Results on Text Classification . . . . . . . . . . . . . . . . 98
7 Conclusion and Discussion 101

7.1 Answers to Research Questions . . . . . . . . . . . . . . . . . . . 101
7.2 General Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A Appendix 107
B Appendix 111
Abstract 115
Samenvatting 119
vii
Chapter 1
Introduction
Constructing models of data is useful for a variety of scientific and engineering

fields in order to exploit the structure which is hidden in the data. Basically,
these models can be descriptive to gain knowledge from the data, predictive, or
both. An example of descriptive models is clusterings of objects that specify the
objects structure. Predictive models typically predict future events, but they may
also predict unknown properties of the objects, as in classifications. For example,
consider that we would like to identify spam emails. In other words, given any
email message as input, the predictive model should be able to indicate whether
the message is spam or not. The learning task here is to find a function that can
assign a spam class label, e.g. a yes/no, to every email message. Therefore, it
needs to look for methods that can effectively explain and exploit the structure
of the data that we observe. In this thesis we are interested in the development of
algorithms for constructing predictive models. Supervised learning is the learning
task of constructing such a predictive model.
In supervised learning, there is a set of inputs X = {xi }ni=1 , often called
the examples or instances. Each example xi is described by k features, called
the feature space, and an output yi from the set of outputs Y = {yj }m j=1 , also
called the labels or target values, where m is the number of possible classes.
Then, a set of examples along with their labels (xi , yi ), called labeled training
data, is used to find the generalized mapping function from the inputs to the
outputs in each problem. So, it is assumed that for each data point (xi , yi )
there is a function f such that yi = f (xi ). The goal of the supervised learning
algorithm here is to find a good approximation h from the hypothesis space to f
for predicting the output value for every instance xi , such that yi = h(xi ), with
the highest possible accuracy. The function h is then called a classifier. There is
a wide range of algorithms for supervised learning, each with its own strengths
and weaknesses. Depending on the nature of the output value, these algorithms
can be divided into classification problems when the output space is discrete or
regression problems when the output space is continuous. For the classification
1
2 Chapter 1. Introduction
problem, there are many learning algorithms, e.g. decision tree learning, which
is used in this research and addressed in Chapter 3.
The effect of supervised learning algorithms depends on the availability of
sufficient labeled data. However, in many classification problems such as ob-
ject detection, action recognition, document and web-page categorization a large
amount of unlabeled data is readily available, while labeled data usually is not
easy to obtain. Assigning labels to data - annotating data - in some domains
such as medical diagnosis and bioinformatics, requires special measurements or
applying specialized devices, which are often expensive. Labeling is also difficult
in the domains that need a lot of preprocessing processes, e.g. segmentation in
speech signals and images. Furthermore, in most practical cases such a task is
tedious and time-consuming, and needs to be performed repeatedly by a human.
On the other hand, unlabeled data is usually available in abundance, and collect-
ing a large pool of them is an easy task. Recently the Internet often provides a
large amount of data in the form of unlabeled data in relation to different ap-
plications, such as data related to classification of document and web-page, and
thus learning from the data on the Internet is becoming widespread. Therefore,
it is interesting and important enough to investigate the methods that can effec-
tively learn from both labeled and unlabeled data - the so-called semi-supervised
learning. This is especially the case for methods that use only a small set of la-
beled data and a large pool of unlabeled data to build their classification models.
In fact, semi-supervised learning is a learning task that uses unlabeled data and
combines the implicit information in the unlabeled data with the explicit clas-
sification information of the labeled data, in order to improve the classification
performance.
Most existing methods for semi-supervised learning employ a supervised learn-
ing algorithm to build a classification model using labeled and unlabeled data.
One promising methods for supervised learning in many applications is ensemble
methods. Ensemble methods are the combination of many classification models in
order to build a strong model, see more details in Chapter 2. In this dissertation
our interest is in developing methods for semi-supervised learning with a focus
on ensemble methods. We focus on two types of algorithms for semi-supervised
learning, which we call iterative and additive algorithms.
Iterative semi-supervised learning algorithms wrap around a base learner and
replace their current hypothesis with a new hypothesis at each iteration. Through
the training procedure, first the unlabeled data are labeled and assigned a con-
fidence. This produces a set of newly-labeled data at each iteration. In most
algorithms a subset of these data is then selected and added to the pool of initial
labeled data. However, selecting this subset of newly-labeled data is not an easy
task and highly impacts the performance of the iterative algorithms. Then, a
larger training set is obtained for training. The training process is repeated until
it reaches a stopping condition. The classifier that is constructed in the final
iteration will then become the final classifier. Two examples of this approach are
1.1. Research Questions and Solution Approach 3
self-training and co-training as addressed in Section 1.1. We propose two specific

methods, each based on one of these two examples of the iterative approach, as
mentioned in Chapters 3 and 4.
Additive algorithms add a new component classifier to a linear combination
of weak classifiers. These algorithms build a new classifier, assign a weight and
add it to a linear combination of the classifiers at each iteration of the training
procedure. Basically, in these methods, changes in the weights of examples leads
to building a new classifier and its weight. The training process will be finished
when it reaches a stopping condition. An example of this approach is the boost-
ing method [34]. Boosting is one of the most promising ensemble methods for
supervised learning in many applications [29, 28]. This fact has motivated us to
extend boosting methods for semi-supervised learning, especially when the prob-
lem requires multiclass classification, because most of the current semi-supervised
learning methods were originally designed to handel binary classification prob-
lems. However, many practical domains, for example typical recognition of pat-
terns, objects, and documents, involve more than two classes and are thus mul-
ticlass classification problems. For supporting that, we propose two methods for
multiclass semi-supervised learning using the boosting approach, as addressed in
Sections 5.5 and 5.6.
1.1 Research Questions and Solution Approach

The goal of this thesis is to develop methods for semi-supervised learning such
that they exploit the hidden structure of the unlabeled data, which is called the
informative unlabeled data, in order to improve the classification performance
of the base learner. For that, we propose several types of methods for semi-
supervised learning as follows.
Firstly, we employ a decision tree as the base learner for self-training. A
self-training algorithm as an iterative method uses its own predictions to assign
labels to unlabeled data. Then, a set of newly-labeled data, which we call a set
of high-confidence predictions, are selected to be added to the training set for
the next iterations. The performance of self-training strongly depends on the
selected newly-labeled data. For that, it is very important that the confidence
of prediction, called probability estimation, is correctly estimated. The main
research questions addressed here include:
1. How to effectively use a decision tree as the base learner in self-training?
2. How to find a subset of high-confidence predictions at each iteration of the

training process?
As addressed in Chapter 3, using the probability estimation of the standard

decision tree learning as the selection metric to self-training cannot improve the
classification performance of a self-training algorithm. The reason is that the

distributions at the leaves of decision trees provide poor probability estimates,
which indicates the relevance of the research questions 1 and 2. It can be seen
that decision trees as the base learner in self-training involves two main difficulties
to produce a good ranking of instances. These include: (1) the sample size at
the leaves is almost always small, and (2) all instances at a leaf get the same
probability. Therefore, good probability estimation is needed in self-training for
selecting unlabeled examples for labeling, as the selection metric. For that, we
first make several modifications to the basic decision tree learner that produce
better probability estimation. Next, we extend this improvement to algorithms
for ensembles of decision trees and show the effect of using a modified decision
tree learner as a base learner in the ensemble classifier, when it is used as the base
classifier for self-training. We also show the effect of the proposed modifications
with our experiments on a number of public machine learning datasets.
Secondly, we consider the main issues of co-training as one of the promising
iterative learning methods for semi-supervised learning. Co-training involves two,
preferably independent, views of data both of which are individually sufficient for
training a classifier; please see more details in Chapter 4. Two key issues in co-
training are: (1) measuring the confidence in labels that are predicted for the
unlabeled data and (2) a criterion for stopping the training process. For that,
we propose a new form of co-training, which uses N different learning algorithms
rather than using different subsets of features to make different hypotheses. The
research questions addressed here include:
3. How to select a subset of high-confidence predictions in an ensemble?

4. What is the effect of the number of newly-labeled data on the error rate of
a component classifier in an ensemble?
In Chapter 4, we answer these questions by designing a new form of co-training

algorithm. The proposed method uses an ensemble of different learning algo-
rithms to build a classification model. Based on a theorem by Angluin and Laird
[1] that relates noise in the data to the error of hypotheses learned from these
data, we propose criteria for selection metric and error rate estimation for a clas-
sifier at each iteration of the training process. We then use these criteria to select
a set of high-confidence predictions for each component classifier and to estimate
its error rate. We also derive a stopping condition based on the sample size and
the error rate. We perform a set of experiments to show the effect of the proposed
method and to compare it to the other related methods on the same datasets.
Thirdly, we investigate the use of additive methods to develop new methods
for multiclass semi-supervised learning. A multiclass classification deals with the
classification task with more than two classes. To solve the multiclass classifi-
cation problem two main approaches can be used. The first is to convert the
multiclass problem into a set of binary classification problems, e.g. one-vs-all.
1.1. Research Questions and Solution Approach 5
This approach can have various problems, such as the imbalanced class distribu-
tions, increased complexity, no guarantee for having an optimal joint classifier or
probability estimation, and different scales for the outputs of generated binary
classifiers, which complicate the task of combining them. The second approach is
to use a multiclass classifier directly. Although a number of methods have been
proposed for multiclass supervised learning, they are not able to handle multi-
class semi-supervised learning. For this, there is a need to extend a multiclass
supervised learning to multiclass semi-supervised learning. We propose two mul-
ticlass semi-supervised boosting algorithms that solve the multiclass classification
problems directly. We then look at the following questions:
5. How to design an additive algorithm for multiclass semi-supervised learn-

ing?
5a. How to assign weights to classifiers and unlabeled examples?

5b. How to use these weights to assign pseudo-labels to the unlabeled ex-
amples?
These questions are addressed in Chapter 5 by designing two algorithms.

These proposed algorithms are based on two novel multiclass loss functions, which
we have developed. The first algorithm uses two regularization terms on labeled
and unlabeled examples to minimize the inconsistency between similarity matrix
and classifier predictions. It can boost any kind of multiclass base learner. The
second algorithm uses the extra term in loss function for minimizing the cost
of margin for labeled data as well as minimizing the inconsistency between the
classifier predictions and pairwise similarities. These algorithms employ the sim-
ilarity matrix and the classifier predictions to find a criterion to assign weights
to data. The weights are then used to assign “pseudo-labels” and select from
unlabeled data. Both algorithms exploit two main semi-supervised assumptions
in classification models.
Finally, to evaluate the proposed methods for multiclass classification prob-
lem, which is needed to see if the proposed methods do benefit from the unlabeled
examples, we apply them to the problem of bird behavior recognition and text
categorization datasets. We report the results of these applications in Chapter
6. We use data from accelerometers attached to birds to recognize bird behavior.
Labeling large data related to accelerometer signals manually is a difficult and
time-consuming process and requires detailed knowledge and expertise both in
animal behavior and in understanding of the tri-axial accelerometer data. There-
fore, this problem seems very well suited for semi-supervised learning. We further
evaluate our proposed algorithms on the text classification problem using the pop-
ular text datasets from TREC and Reuters, when only a small set of labeled data
exists.
1.2 Outline
The rest of this dissertation is organized as follows.
In Chapter 2 we introduce the preliminary definitions and concepts that are
used throughout the thesis. In this chapter, we first give a brief introduction
to ensemble methods. We then give an overview of the semi-supervised learning
algorithms, in which the research in this thesis is situated.
In Chapter 3, we propose several modifications to the decision tree learner in
order to improve its probability estimation, when it is used for self-training as the
base learner. We then show how these modifications aid the decision tree to do
benefit from the unlabeled data. Next, we use the modified decision trees as the
base learners in ensembles of trees. We end with performing a set of experiments
to show the performance of the proposed methods.
In Chapter 4 we propose a new form of co-training, which uses N different
learning algorithms. We use a theorem to derive a measure for deciding if the
adding of a set of unlabeled data can reduce the error of a component classifier
or not. We then evaluate the proposed method to the other related methods, to
show both the effectiveness of the proposed criteria for the estimation of the error
rate and that of the stopping condition.
Multiclass classification is the subject of Chapter 5. We develop two boosting
algorithms to deal with multiclass semi-supervised learning. We first give a formal
definition of multiclass classification and then design two multiclass exponential
loss functions. We derive two algorithms from the aforementioned loss functions.
We end with performing a set of experiments on public datasets to show the
performance of the proposed methods and to compare them to the state-of-the-
art methods in this field.
Chapter 6 investigates the use of the methods proposed in Chapter 5, on
real world applications. We use data from accelerometers attached to birds to
recognize bird behavior. We also use text data from TREC and Reuters to classify.
We then apply the aforementioned algorithms to these two real applications.
Finally, in Chapter 7 we conclude the thesis and point to possible further
research about the methods that are proposed in this thesis.
The results reported in Chapters 3 to 6 of this thesis have been published.
The highlights of the author’s publications follow:
• A part of the approach and results in Chapter 3, are presented at the IEEE
International Conference on Intelligent Computing and Intelligent Systems
(ICIS) in 2011 [92], also an extended version of this work is submitted to
the Journal of Neural Computing and Applications.
• The approach and results addressed in Chapter 4, are presented at the 23rd
IEEE International Conference on Tools with Artificial Intelligence (ICTAI)
in 2011 [93].
1.2. Outline 7
• A part of the approach and results in Chapter 5, are presented at the IEEE
International Conference on Data Mining (ICDM) in 2012 [95], and a full
version of this work is accepted for publication by the Journal of Pattern
Recognition Letters.
• Results of Chapter 6 are presented at the 24rd IEEE International Confer-

ence on Tools with Artificial Intelligence (ICTAI) in 2012 [96].
Chapter 2
Background
Most semi-supervised learning methods are extensions of existing supervised and

unsupervised learning algorithms. Therefore, before introducing the develop-
ments in the semi-supervised learning literature, we briefly review relevant su-
pervised learning methods. The focus of this thesis is on extensions of ensemble
classifiers, as addressed within the Section 2.1. We therefore first introduce the
ensemble methods for supervised learning. Next, we proceed to discuss semi-
supervised learning methods and the underlying assumptions. Finally, we discuss
multiclass semi-supervised boosting approaches.
2.1 Supervised Ensemble Learning

Ensemble classifiers combine multiple component classifiers (called base classi-
fiers, ensemble members, or weak learners because of their simplicity) to build a
meta-classifier with the aim that they will perform better than any of their com-
ponent classifiers. For this combination to be successful, the ensemble classifiers
must be diverse. In other words, if all of them are the same functions, then the
combination will not help in improving the classification performance. Therefore,
the ensemble methods differ in how they generate their weak classifiers as well as
how they combine the base classifiers regarding their weights.
Formally, assume C = {f1 (x), f2 (x), ..., fK (x)} is a set of classifiers in an
ensemble. An ensemble method then generates a weighted combination of these
classifiers as its final output by:
K
X
F (x) = wi fi (x) (2.1)
i=1
where F (x) is the final classification model and wi is either the weight of
the i-th component base learner or the weighted voting for the i-th component
classifier for the final decision F (x).
9
10 Chapter 2. Background
The required training sets for an ensemble classifier are normally generated
based on multiple subsamples from the training data, as addressed in Sections
2.1.1 and 2.1.2, or from subsets of the features, as mentioned in Section 3.5.2. A
typical ensemble method for classification problem includes the following compo-
nents:
• Training set: A training set is labeled examples that are used for training.
• Base Learner: A Base Learner is a learning algorithm that is used to learn

from training set.
• Diversity Generator: This component is responsible for generating the di-

verse classifiers.
• Combiner: is used for combining the classification models.
where the task of diversity generator in the ensemble classifier is to use the clas-
sifiers that differ from each other as much as possible. Kuncheva and Whitaker
[54] introduce several methods to measure the diversity in an ensemble classifier.
The task of the combiner is to produce the final classification model by combining
the generated classification models.
Sample 1 Train Classification Weight 1

Dataset 1 classifier 1 model 1
Sample 2 Train Classification Weight 2

Dataset 2 classifier 2 model 2
Training Generate different

Combine the Final
Dataset subsamples . . classification hypothesis
. . models
. .
Sample N Train Classification Weight N

Dataset N
classifier N model N
Figure 2.1: General view of the independent ensemble methods
There are mainly two kinds of ensemble frameworks for building the ensembles:
dependent (sequential) and independent (parallel) [77]. In a dependent framework
the output of a classifier is used in the generation of the next classifier. Thus it
is possible to take advantage of the knowledge generated in previous iterations to
guide learning in the next iterations. An example of this approach is the boosting.
In the second framework, called independent, each classifier is built independently,
2.1. Supervised Ensemble Learning 11
and their outputs are then combined by voting methods, as addressed in Section
4.3.2. A well-known example of this framework is bagging. Figure 2.1 shows
an overview of the independent ensemble methods. These two frameworks are
further described below.
2.1.1 Bagging
Bootstrap aggregating (bagging) [13] is an ensemble method that generates hy-
potheses for its ensemble independently from each other, using random subsets of
the training set. For each classifier a training set is generated by randomly draw-
ing, with replacement of N examples, where N is the size of the original training
set. In the resampling process many of the original examples may be repeated in
the resulting training set. Each independent classifier in the ensemble is generated
with a different random sample from the training set. Since bagging resamples
the training set with replacement, each instance can be sampled multiple times.
Examples that are not sampled are kept as test set, called Out-Of-Bag (OOB),
and are used to evaluate the generated classification model.
Breiman [13] showed that bagging is effective for “unstable” learning algo-
rithms where small changes in the training set result in large changes in predic-
tions. Breiman claimed that neural networks and decision trees are examples
of unstable learning algorithms. A more detailed discussion about bagging in
relation to our approach is presented in Chapter 3.
2.1.2 Boosting
Dependent ensemble frameworks work by assigning weights to training examples
in order to produce a set of diverse weak classifiers. These frameworks, like boost-
ing [34], are generic ensemble learning frameworks, which sequentially construct a
linear combination of base classifiers. Boosting is a general method for improving
the performance of weak classifiers, such as classification rules or decision trees.
The method works by repeatedly running a weak learner, on weighted training
examples. The generated classifiers are then combined into a final composite
classifier, which often achieves a better performance than a single weak classi-
fier. One of the first boosting algorithms, AdaBoost (Adaptive Boosting), was
introduced by Freund and Schapire [34]. The main idea behind the boosting al-
gorithm is to give more weight to examples that are misclassified by hypotheses
that were constructed so far. At each iteration weights of the data are updated
on the basis of classification errors. Examples that are misclassified by previous
classifiers are overweighted relative to examples that were correctly classified. As
a result, the weak learner focuses on the hard examples in later iterations. Finally
the weighted combination of the weak classifiers becomes the final classification
model. Many boosting algorithms were proposed beside AdaBoost [34], for ex-
ample Arcing [14], LogitBoost and GentleBoost [37]. They mainly differ in the
modification of the weights of examples and classifiers. We propose an extension

of the boosting approach for semi-supervised learning in Chapter 5.
A different form of boosting is based on Support Vector Machines (SVMs). An
example of this approach is LPBoost [25], where at each iteration of boosting, a
linear programming problem is solved in order to achieve the best weights for the
examples. Then according to these weights LPBoost tends to find the best weak
learner. Theoretically, it has been shown that LPBoost and regularized versions
of it, such as SoftBoost [102] and Entropy Regularized LPBoost [101], can handle
the classification tasks in smaller boosting iterations and experimentally are often
superior to other boosting methods [82].
Generally, ensemble methods are considered to be one of the most successful
types of machine learning techniques. They have been shown to be superior to
many other algorithms on various large-scale and very high dimensional problems,
see [28] and [3].
Friedman et al. [37] showed that the AdaBoost algorithm can be interpreted as
stage-wise estimation procedures for fitting an additive logistic regression model.
AdaBoost optimizes an exponential loss function which is equivalent to the bino-
mial log-likelihood. An exponential loss function is defined as follows:
`(F ) = E[e−yF (x) ] (2.2)

where y ∈ {−1, 1}. The goal here is to find the F that minimizes the loss
function. Friedman et al. showed that the function F (x) that minimizes `(F ) is
equivalent to the symmetric logistic transform of P (y = 1|x). Besides Friedman
et al. [37], Breiman [14], and Schapire and Singer [84] also made some connections
between the original binary class AdaBoost algorithm and the exponential loss
function.
2.1.3 Multiclass Boosting

Multiclass classification is the task of classifying examples into more than two
classes. Supervised multiclass classification algorithms aim at assigning a class
label for each input example. Most boosting algorithms are originally designed to
solve binary classification problems. To solve the multiclass classification problem
two main approaches have been proposed. The first is to use generic techniques
to apply binary methods to multiclass problems. These techniques often work
by decomposing the original multiclass classification problem into several binary
sub-problems and then apply the binary learning algorithm to each of them.
Examples of this approach include one-vs-all, one-vs-one, and error-correcting
output code [30]. The second approach is to develop a multiclass boosting algo-
rithm directly. A number of methods have been proposed for multiclass boosting,
such as SAMME [110], Adaboost-Cost [69], and GD-Boost [79]. In Chapter 5, we
extend the SAMME method in order to deal with the multiclass semi-supervised
2.2. Semi-Supervised Learning and Ensemble Methods 13
classification problem.
Other methods, like GentleBoost [37], AdaBoost.MH [85] and CD-Boost [79]
still decompose the multiclass problem into several binary sub-problems with
different strategies, which in general works better than methods like on-vs-all
[79].
2.2 Semi-Supervised Learning and Ensemble Meth-

ods
Supervised learning methods are effective when there are sufficient labeled in-
stances. In many applications, such as object detection, document and web-page
categorization, labeled instances however are difficult, expensive, or time con-
suming to obtain because they require empirical research or experienced human
annotators. Semi-supervised learning algorithms use not only the labeled data but
also the unlabeled data to build a classifier. The goal of semi-supervised learning
such as the approach introduced in this thesis is to use unlabeled instances and
combine the information in the unlabeled examples with the explicit classification
information of labeled examples for improving the classification performance.
2.2.1 Basic Definitions

In semi-supervised learning for the labeled points Xl = (x1 , x2 , ..., xl ) labels {1, ..., K}
are provided, and for the unlabeled points Xu = (xl+1 , xl+2 , ..., xl+u ), the labels
are not known. We assume that the labeled and unlabeled data are drawn in-
dependently from the same data distribution. In applications of semi-supervised
learning normally l u , where l is the number of labeled data and u is the num-
ber of unlabeled data. The main idea behind of semi-supervised learning is to use
unlabeled data along with labeled data to achieve an accurate decision boundary.
For this, many semi-supervised learning algorithms employ the unlabeled data to
regularize the loss function as follows:
l
X l+u
X
`(yi , F (xi )) + ù (F (xi )) (2.3)
i=1 i=l+1
where F (.) is a classifier and ù shows the penalty term for the unlabeled data.
There are many semi-supervised learning methods, which vary in the underlying
assumptions. Two main assumptions for semi-supervised learning are: cluster
and manifold assumptions.
Cluster Assumption
The cluster assumption addresses the fact that the data space consists of a number
of clusters and if points are in the same cluster then they likely belong to the same
+
+ +
_
+ +
+
+
+
_
+ _
_
_ _
_
(a) Cluster Assumption (b) Manifold Assumption
Figure 2.2: The positive and negative signs show labeled examples from two
different classes. The gray regions depict the unlabeled samples. The dashed
line for decision boundary is obtained by only training on labeled examples and
usually crosses the dense regions of the feature space. These decision boundaries
are moved to regions with lower density (solid line) using unlabeled data.
class. The cluster assumption thus says that the decision boundary should lie in
a low-density region [18]. In other words, the cluster assumption emphasizes
that the examples being in very dense regions of the feature space are likely
to share the same class label, see Figure 2.2. Many successful semi-supervised
learning methods are based on this assumption, such as Semi-Supervised SVM
[5], low density separation [18], and Transductive SVM [50]. Our approaches in
this thesis are also based on this assumption. Furthermore, from the decision
boundary point of view one can say that the best decision boundary is situated
at the low density reigns. Such a decision boundary should maximize not only
the margin of labeled data, but also the (pseudo-)margin of unlabeled data.
Manifold Assumption
The manifold assumption is that the (high-dimensional) data lie on a low-dimensional
manifold, see Figure 2.2. Many graph-based semi-supervised methods are based
on this assumption, for example Markov random walks [47], Label propagation
[112], and Laplacian SVMs [4]. In graph-based semi-supervised learning methods
usually the underlying geometry of the data is represented as a graph, where
samples are the vertices and similarity between examples are the edges of graph.
In other words, graph-based methods learn the manifold structure of the fea-
ture space with labeled and unlabeled data and use this information to improve
the performance of the supervised learning algorithm. Our proposed methods in

Chapter 5 are based on both cluster and manifold assumption.
2.2.2 Semi-Supervised Learning Algorithms

Recently, many methods have been developed for semi-supervised learning. In
this dissertation we are interested in the methods that are extensions of supervised
learning methods. Examples of such algorithms are: iterative methods, margin
based methods, graph based methods, and additive methods, as described below.
Iterative-based Methods
Self-training is one of the earliest paradigms of iterative methods for semi-supervised
learning. In self-training a classifier is first trained with a small set of labeled
data. The classifier is then used to classify the unlabeled data. Next, the most
confidently unlabeled points, together with their predicted labels, are added to
the training set. The classifier is re-trained and the procedure repeated until it
reaches a stopping condition. Then it outputs its latest classifier. In the self-
training algorithm, an issue is that the misclassified examples will propagate to
later iterations and can have a strong effect on the result. Thus, there is a need
to find a metric to select a set of high-confidence predictions at each iteration of
the self-training procedure. Some algorithms try to avoid this by “unlearning”
unlabeled points if the prediction confidence drops below a threshold [105]. Self-
training has been applied to several natural language processing tasks. Yarowsky
[105] uses self-training for word sense disambiguation. Rosenberg et al. [78] pro-
posed a self-training approach to object detection. A self-training algorithm is
used to recognize different nouns in [76]. [64] proposes a self-training algorithm to
classify dialogues as “emotional” or “non-emotional” with a procedure involving
two classifiers. In [100] a semi-supervised self-training approach using a hybrid of
Naive Bayes and decision trees is used to classify sentences as subjective or objec-
tive. The main difficulty in self-training is how to select a set of high-confidence
predictions. In Chapter 3 we discuss such a selection metric.
Generative models also are based on iterative approaches. A generative model
estimates the joint distribution p(x, y) = p(y)p(x|y) where p(x|y) is an identifiable
mixture distribution, such as a mixture of Gaussians. In this approach each
class is estimated by a Gaussian distribution model and the unlabeled examples
are used to estimate parameters of the distribution. The mixture model ideally
should be identifiable. If the mixture model assumption holds, then the unlabeled
data effectively improve the classification performance, otherwise they cannot
be helpful [111]. Nigam et al. [71] apply the EM algorithm with a mixture of
multinomials, using an iterative approach for text classification. They showed
that the resulting classification models work better than those trained only on
labeled data. Recently, Fujino et al. [38] extended generative mixture models by
including a bias correction term and discriminative training using the maximum
entropy principle.
One of the successful iterative approaches for semi-supervised learning is co-
training [8]. Co-training involves two, preferably independent, views of data both
of which are individually sufficient for training a classifier. In co-training, the
“views” are subsets of the features that describe the data. Each classifier pre-
dicts labels for the unlabeled data and a degree of confidence. Unlabeled instances
that are labeled with high confidence by one classifier are used as training data for
the other. This is repeated until none of classifiers changes. In co-training instead
of different views of data which is rare in many domains, one can use different
learning algorithms. Goldman and Zhou [39] propose a co-training algorithm
using different learning algorithms. This version uses statistical tests to decide
which unlabeled data can be labeled with confidence. As mentioned earlier the
main difficulty in iterative methods is how to select a set of high-confidence pre-
dictions. Co-training uses the agreement between classifiers for selection, which
is not always helpful and some misclassified examples will propagate to later it-
erations. In Chapter 4, we propose an ensemble co-training algorithm that uses
N different learning algorithms to select from unlabeled examples.
Margin-based Methods
Supervised margin based methods are popular and successful methods for classi-
fication tasks. This is a good motivation to extend methods for semi-supervised
learning. Many margin based methods are usually extensions of Support Vec-
tor Machine (SVM) for semi-supervised learning. An SVM uses a function that
minimizes the error on training data as well as cost of margin. To extend this
function for semi-supervised learning, there is a need to define the loss function
for unlabeled examples. Transductive Support Vector Machine (TSVM) [49] is
one of the early efforts to extend standard SVMs to semi-supervised learning. In
a standard SVM, there is only labeled data for training and the goal is to find
a linear decision boundary with maximum margin in the Reproducing Kernel
Hilbert Space. Since, in semi-supervised learning there is a large set of unlabeled
data, a TSVM should enforce the decision boundary to lie in low density region
in feature space by maximizing margin of both labeled and unlabeled data. One
can use the following regularization term for unlabeled data in loss function of
TSVM:
ù (F (x)) = max{0, 1 − |1 − F (x)|}, x ∈ Xu (2.4)

This regularization term maximizes the margin for the unlabeled data. The
resulting optimization problem can be solved by gradient descent as in [17], named
∇SV M . In [5], a mixed integer programming based approach is used to solve the
resulting optimization problem, named Semi-Supervised SVM (S3VM). None of
these methods is applicable to large datasets because of their high computational

complexity.
Graph-based Methods
Typically, graph-based semi-supervised learning methods are based on the mani-
fold assumption. These methods define a graph where the nodes are the examples
(labeled and unlabeled) and edges show the similarity between the examples.
These methods usually assume label smoothness over the graph. The goal of
many graph-based methods is to estimate a function on the graph. This function
should minimize the loss on labeled examples and smooth on the whole graph.
For that, a consistency term is defined between the class labels and the similarity
among data.
There are many methods for graph-based semi-supervised learning. Examples
are Markov random walks [47], Label propagation [112], and Laplacian SVMs [4].
The following loss function is used in many graph-based methods:
X
ù (F (x)) = S(x, x0 )kF (x) − F (x0 )k22 (2.5)
x0 ∈Xu
x6=x0
where S(x, x0 ) is pairwise similarity and k.k is the Euclidean norm. We use this
term in our proposed method in different way employing an exponential function
in Chapter 5.
Additive-based Methods
Boosting as an additive method is one of the promising supervised learning
method in many applications [28]. The goal of the boosting is to minimize equa-
tion (2.2), which is the cost of the margin, at each iteration of the boosting
procedure. The expression yF (x) is the classification margin by definition. A
boosting algorithm can be extended to semi-supervised learning by defining the
margin for unlabeled data. Examples of this approach are ASSEMBLE [6] and
Marginboost [23]. They extend the margin definition to unlabeled examples us-
ing “pseudo-margin”. The “pseudo-margin” for unlabeled examples is defined
as |F (x)| in ASSEMBLE and (F (x)2 ) in MarginBoost. Algorithm 1 shows the
“pseudo-code” of the ASSEMBLE method. At each iteration the boosting algo-
rithm assigns “pseudo-labels” and weights to unlabeled examples. Then, a set
of examples are sampled based on their weights. This set is used to train a new
component classifier. Next the weights and ensemble classifier are updated iter-
atively. The weighted combination of the classifiers forms the final classification
model.
In Chapter 5, we discuss boosting algorithms for semi-supervised learning in
more detail.
Algorithm 1 ASSEMBLE
Initialize: L, U, S, D0 (x), H0 (x);
L: Labeled data; U: Unlabeled data; S: sampled examples
D0 (x): weight of each example; H0 (x) = 0;
yi ← [1, 0, −1]; for i ∈ U // M is the number of iterations
for t=0 to M do
- Train
P classifier ft+1 based on S and Dt ;
if Dt (i)yi ft+1 (xi ) ≤ 0 then
return Ht
end if P
- Compute t+1 ← Dt I(yi 6= ft+1 (xi ))
- Compute αt+1 as in AdaBoost;
- Ht+1 ← Ht + αt+1 ft+1 ;
- yi ← Ht+1 (xi ); for i ∈ U
i 6=ft+1 (xi )))
- Dt+1 (i) ← Dt exp(αt+1 I(y
Zt
where Zt is the normalization factor and I(x) outputs 1 if x is true, and 0 otherwise.
- S ← Sample(L+U,|L|,Dt+1 )
end for
return HT ;
Another boosting method for semi-supervised learning is CoBoost [21], which

is an extension of co-training in boosting style. It uses the agreement among weak
classifiers to label the unlabeled examples in co-training style. AgreementBoost
[57], proposes to use of multiple boosting learners within the co-training style.
It works by reducing the variance of the boosting models trained for each view.
More recently, in [66], a boosting algorithm is proposed for semi-supervised learn-
ing, which employs pair-wise similarities and classifier predictions to select from
unlabeled data. RegBoost [20] uses the margin cost and a regularization penalty
on unlabeled data based on the main assumptions in semi-supervised learning.
Experiments show that they outperforms the other methods. They are originally
designed for binary classification problems.
2.2.3 Multiclass Semi-Supervised Boosting

Although methods like SemiBoost and RegBoost perform well in many domains,
they are originally designed for binary classification problems. However, many
practical domains, for example recognition of speech, objects, and text involve
more than two classes. For a multiclass semi-supervised learning problem, the re-
cent algorithms are now those based on “pseudo-margin” [90] or the use of binary
methods, like SemiBoost and Regboost, to handle the multiclass case employing
the one-versus-all or similar meta-algorithm, which is problematic. This approach
can have various problems such as imbalanced class distribution, increased com-
plexity, no guarantee to have an optimal joint classifier or probability estimation,
and different scales for the outputs of generated binary classifiers which compli-
cates combining them, see [48] and [79]. Recently, in [99] a new boosting method
is used for semi-supervised multiclass learning which uses similarity between pre-
dictions and data. It maps labels to n-dimensional space, but this mapping may
not lead to train a classifier that minimizes the margin cost. SERBoost [80] uses
expectation regularization to maximize the margin of unlabeled data.
The main difficulties of the multiclass classification problem in boosting style
are: (1) how to define the margin, (2) how to design the loss function for margin
cost, and (3) how to find the weighting factor for the data and the classifiers. For
addressing these three challenges, we propose two methods in Chapter 5, called
Multiclass SemiBoost and Multi-SemiAdaBoost. We then evaluate our proposed
methods and compare these methods to the state-of-the-art methods using many
public machine learning datasets.
Chapter 3
Self-Training With Decision Trees
This chapter focuses on Semi-Supervised Learning with a decision tree learner.

Standard decision tree learning cannot be used as the base learner in a self-training
method for semi-supervised learning. We demonstrate this with experiments.
Good probability estimations are needed in self-training for selecting unlabeled
examples for labeling, as selection metric, but a basic decision tree learner does
not produce reliable probability estimation. We consider the effect of several
modifications to the basic decision tree learner which produce better probability
estimations than using the distributions at the leaves of the tree. We show that
these modifications do not produce better performance when used on the labeled
data only, but they do benefit more from the unlabeled data in self-training.
The modifications that we consider are Naive Bayes Tree, a combination of No-
pruning and Laplace correction, grafting, and using a distance-based measure.
Next we extend this improvement to algorithms for ensembles of decision trees.
We show that when a modified decision tree learner is used as the base learner,
the ensemble learner does benefit from unlabeled data in self-training.
This chapter is an extension of the published paper by Tanha et al. [92], at the
IEEE International Conference on Intelligent Computing and Intelligent Systems
(ICIS) in 2011.
3.1 Introduction
In this chapter we focus on self-training with a decision tree learner as the base
learner. Self-training is a commonly used technique for semi-supervised learning.
In self-training, a classifier is first trained on a small set of labeled data. This
classifier is then used to classify the unlabeled data. Next, a set of unlabeled data
points, of which class labels can be predicted with high confidence, is added to
the training set. The classifier is re-trained and this procedure is repeated until
it reaches a stopping condition. In self-training misclassified examples are used
for further learning and thus can have a strong effect on the result. There is thus
21
22 Chapter 3. Self-Training With Decision Trees
a need to find a metric for the selection from the unlabeled data. In a decision
tree classifier the class distribution at the leaves is normally used as probability
estimation for the predictions. We show below that a standard decision tree
learning algorithm like C4.5 performs poorly as the base learner in self-training.
Accuracy after self-training is almost the same as after supervised learning from
only the labeled data. The algorithm does not benefit from the unlabeled data.
Provost and Domingos [73] showed that the distributions at the leaves of decision
trees provide poor ranking for probability estimates. Using class frequencies at the
leaves for probability estimation brings two problems when ranking and selecting
instances: the first is that the sample size at the leaves is almost always small,
which produces poor estimates, and the second is that all instances at a leaf
are assigned the same probability. Examples that are far from the boundary
are more likely to belong to a class than those close to the boundary. Because
of poor probability estimation, selecting newly-labeled data in self-training is
error-prone and these errors will propagate to later iterations. However, in many
domains decision trees are the best choice as the base learner, see [73]. This has
motivated us to look for improvements of the probability estimation for decision
tree learning.
We consider several methods that may improve the probability estimation of
tree classifiers: (a) No-pruning and applying the Laplacian Correction (C4.4) [73],
(b) Grafting [103], (c) a combination of Grafting with Laplacian Correction and
No-pruning, (d) a Naive Bayes Tree classifier (NBTree) [53], (e) using a distance-
based measure combined with the improved decision tree learners. Our hypothesis
is that these modified decision tree learners will show accuracy similar to the
standard decision tree learner when applied to the labeled data only, but will
benefit from the unlabeled data when used as the base classifier in self-training.
Next we extend our analysis from single decision tree as the base classifier to
ensembles of decision trees, in particular the Random Subspace Method (RSM)
[44] and Random Forests (RF) [15]. In this case, probability estimation can be
estimated from the predictions of multiple trees. However, since these trees suffer
from poor probability estimation, the ensemble learner will not benefit much from
self-training on unlabeled data. For the same reason as for the single decision tree
learner, we also expect that using the modified decision tree learners as the base
learner for the ensemble will improve self-training with the ensemble classifier as
the base learner. The results of the experiments on the several benchmark UCI
datasets and visual features of web-pages datasets [10] confirm this.
The rest of this chapter is organized as follows. Section 3.2 reviews related
work on semi-supervised learning. In Section 3.3, decision tree learning algorithms
as the supervised learner in self-training are evaluated. In Section 3.4 and 3.5
we address the improvements for self-training. The experimental setup of the
experiments is presented in Section 3.6. Section 3.7 presents the results of the
experiments and in Section 3.8, we present our conclusions.
3.2. Overview of Self-Training 23
3.2 Overview of Self-Training

There are several kinds of learning algorithms for semi-supervised learning. The
generative approach is one of the best-known. In this method the labels of the
unlabeled data are viewed as missing values of model parameters, and the EM
(Expectation Maximization) algorithm [26] is used to find both the maximum
likelihood estimation of the model parameters and estimates for the missing val-
ues. Generative methods differ from each other by the type of distribution used
to fit the data, for example a mixture of Gaussians [86] and mixture of experts
[68]. Nigam et al. [71] apply the EM algorithm to a mixture of multinomials for
text classification. They showed that the resulting classifiers perform better than
those trained only on originally labeled data. However, Nigam and Ghani [72]
also found that there were examples in which EM did not work properly. One
issue in EM when it is used in semi-supervised learning is that locally maximizing
the likelihood may not lead to the optimal classification performance [78]. The
EM approach is an example of the iterative approach or briefly iterative methods,
as addressed in Subsection 2.2.2. Other methods based on iterative approach are
methods, such as co-training [9], Co-forest [108], and Ensemble-Co-Training [93].
Self-training is also an iterative method.
Self-training has been applied to several natural language processing tasks.
Yarowsky [105] uses self-training for word sense disambiguation. A self-training
algorithm is used to recognize different nouns in [76]. Maeireizo et al. [64] propose
a self-training algorithm to classify dialogues as “emotional” or “non-emotional”
with a procedure involving two classifiers. In [100] a semi-supervised self-training
approach using a hybrid of Naive Bayes and decision trees is used to classify
sentences as subjective or objective.
Rosenberg et al. [78] proposed a self-training approach to object detection.
They implemented a wrapper around the training process of an existing object
detector and presented empirical results. The main contribution of this empirical
study is to show that a model trained on a small set of labeled instances can
achieve results comparable to a model trained in the supervised manner using a
larger set of labeled data. The Nearest Neighbor classifier was used as the base
classifier in self-training.
Li et al. [61] propose a self-training semi-supervised support vector machine
algorithm and its selection metric, which are designed to train a classifier from a
limited number of training data. Two examples show the validity of the proposed
algorithm and selection metric. The authors apply the proposed algorithm to
a data set collected from a P300-based brain computer interface speller. This
algorithm is shown to significantly reduce training effort.
In general, self-training is a wrapper algorithm, and is hard to analyze. How-
ever, for specific base classifiers, there have been some studies, such as [41]. Haffari
and Sarkar [41] showed that the Yarowsky’s algorithm [105] minimizes an upper-
bound on a new definition of cross entropy based on a specific instantiation of the
Bregman distance. In this chapter, we analyze the effect of modifying a decision

tree learner to improve the probability estimation of its predictions, when the
decision tree is used as the base learner in self-training.
3.3 Semi-Supervised Self-Training with Decision

Trees
In this section we first provide a basic definition of semi-supervised learning and
then address the self-training algorithm using decision tree classifiers.
3.3.1 Basic Definitions

In semi-supervised learning there is a small amount of labeled data and a large
pool of unlabeled data. Data points can be divided into the points Xl =(x1 ,x2 ...,xl ),
for which labels Yl ={+1,-1} are provided, and the points Xu = (xl+1 ,xl+2 ,...,xl+u ),
the labels of which are not known. We assume that labeled and unlabeled data
are drawn independently from the same data distribution. In this chapter we
consider datasets for which l u, where l and u are the number of labeled data
and unlabeled data respectively.
3.3.2 The Self-Training Algorithm

The self-training algorithm uses its own predictions to obtain new training data.
A base learner is first trained with a small number of labeled examples, the initial
training set. The classifier is then used to predict the labels for the unlabeled
examples (prediction step) based on the classification confidence. Next, a subset
S of the unlabeled examples, together with their predicted labels, is selected to
train a new classifier (selection step). Typically, S consists of a few unlabeled
examples with high-confidence predictions. The classifier is then re-trained on the
new set of labeled examples, and the procedure is repeated (re-training step) until
it reaches a stopping condition. Here we employ the decision tree classifier as the
base learner in self-training. The most well-known algorithm for building decision
trees is C4.5 [74], an extension of Quinlan’s earlier ID3 algorithm. Decision trees
are one of the widely used classification methods in many domains, for example
[16], [7], and [36]. They are fast and effective in many domains. They work well
with little or no tweaking of parameters which has made them a popular tool
for many domains [73]. This makes it useful to find a semi-supervised method
for learning decision trees. Algorithm 2 presents the main structure of the self-
training algorithm.
The goal of the selection function in Algorithm 2 is to find a subset of high-
confidence predictions. For that, the selection function selects the newly-labeled
examples of which the probability estimations are above the threshold T , at
3.4. Self-Training by Improving Probability Estimates 25
Algorithm 2 Outline of the Self-Training algorithm

Initialize: L, U, F, T ; L: Labeled data; U: Unlabeled data;
F : underlying classifier; T : threshold for selection;
Max-Iterations : number of iterations; {Pl }M l=1 : Prior probability;
t ← 1;
while (U ! = empty) and (t < M ax − Iterations ) do
- Train Classifier F on L;
for each xi ∈ U do
- Assign pseudo-label to xi based on classification confidence
- Sort Newly-Labeled examples based on the confidence
- Select a set S of the high-confidence predictions according to nl ∝ Pl
and threshold T // Selection Step
- Update U = U - S; L = L U S;
- t←t+1
- Re-Train F by the new training set
end while
Output: Generate final hypothesis based on the new training set
each iteration of the training process. This is important, because a misclassified

prediction will propagate to produce further classification errors. At each iteration
the newly-labeled instances are added to the original labeled data for constructing
a new classifier based on the new training set. The number of iterations in
Algorithm 2 depends on the threshold T and also on pre-defined maximal number
of iterations M ax − Iterations.
3.4 Self-Training by Improving Probability Es-

timates
The main difficulty in self-training is to find a set of high-confidence predictions,
especially when the base learner is a decision tree learner. Although decision
tree classifiers are effective in many applications, they provide poor probability
estimation. Therefore, good probability estimates are needed to find a set of high-
confidence predictions in self-training. The distribution at the leaf of a decision
tree gives the probability that the instance belongs to the majority class but these
probabilities are based on very few data points, due to the fragmentation of data
over the decision tree. For example, if a leaf node has subset of 50 examples of
which 45 examples belong to one class, then any example that corresponds to
this leaf will get 0.9% probability where a leaf with 3 examples of one class get a
probability of 1.0. In semi-supervised learning this problem is very serious because
the size of initial training set is small from the beginning. Provost and Domingos
[73] showed that the probability estimates obtained from the distributions in the
leaves of decision trees are not very accurate. It seems that pruning methods
tend to prune too much so that the resulting decision tree tends to “underfit” the
data. Provost and Domingos propose several modifications of the C4.5 decision
tree learner for finding better probability estimations. They propose decision tree
learner C4.4 which does not prune and uses the Laplace correction to smooth
probabilities at leaves. Experiments show that this improves the probability
estimation of the decision tree learner.
Here, we consider several methods for improving the probability estimates at
the leaves of decision trees. Beside C4.4, these are Naive Bayes Tree, Grafted
Decision Tree [103], and global distance-based measure.
3.4.1 C4.4: No-Pruning and Laplacian Correction

One candidate improvement is the Laplacian correction (or Laplace estimator)
which smooths the probability values at the leaves of the decision tree [73]. In
fact smoothing of probability estimates from small samples is a well-studied sta-
tistical problem [89]. Assume there are K instances of a class out of N instances
at a leaf, and C classes. The Laplacian correction calculates the estimated prob-
ability P(class) as (K+1)/(N+C). Therefore, while the frequency estimate yields
a probability of 1.0 from K=10, N=10 leaf, for a binary classification problem
the Laplace estimate produces a probability of (10+1)/(10+2)=0.92. For sparse
data, the Laplacian correction at the leaves of a tree yields a more reliable esti-
mation that is crucial for the selection step in self-training. Without it, regions
with low density will show relatively extreme probabilities that are based on very
few data points. These have a high probability of being used for self-training,
which is problematic because of misclassified examples with high probability of
being correct.
Another possible improvement is a decision tree learner that does not do any
pruning. Although this introduces the risk of “overfitting”, it may be a useful
method because of the small amount of training data. If there are few training
data then pruning methods can easily produce underfitting and No-pruning avoids
this. In applications of semi-supervised learning, “underfitting” is a potential
problem. Although it produces even fewer data at leave nodes, No-pruning may
therefore still provide better estimates, especially if combined with the Laplace
correction. We therefore include this combination in our analysis as C4.4.
3.4.2 NBTree
The Naive Bayesian Tree learner, NBTree [53], combines the Naive Bayes Clas-
sifier with decision tree learning. In an NBTree, a local Naive Bayes Classifier is
constructed at each leaf of decision tree that is built by a standard decision tree
learning algorithm like C4.5. NBTree achieves in some domains higher accuracy
3.4. Self-Training by Improving Probability Estimates 27
than either a Naive Bayes Classifier or a decision tree learner. NBTree uses addi-
tional attributes and gives a posterior probability distribution that can be used to
estimate the confidence of the classifier. Similar to the standard decision tree, a
threshold for numerical attributes is chosen using entropy minimization strategy.
For numerical attributes, NBTree fits a Gaussian distribution to estimate the
conditional probabilities. Unlike the standard decision tree learner, this uses the
distance of values to the decision boundary to estimate the posterior probabilities.
This will improve the probability estimates.
3.4.3 Grafted Decision Tree

A grafted decision tree classifier generates a ”grafted“ decision tree from a stan-
dard decision tree. The idea behind grafting is that some regions in the data
space are more sparsely populated. The class label for sparse regions can better
be estimated from a larger region. The grafting technique [103] searches for re-
gions of the multidimensional space of attributes that are labeled by the decision
tree but they contain no or very sparse training data. These regions are then
split by the region that corresponds to a leaf and labeling the empty or sparse
areas by the label of the majority above the previous leaf node. Consider the
example in Figure 3.1. It shows the resulting grafted tree. As can be seen, there
are two cuts in the decision trees at nodes 12 and 4. After grafting, branches are
added by the grafting technique. Grafting performs a kind of local “unpruning”
for low-density areas. This can improve the resulting model, see [103]. In fact,
grafted decision tree learning gives better decision tree in case of sparse data and
also improves the probability estimates.
3.4.4 Combining No-pruning, Laplace correction and Graft-

ing
We combine Grafting with the Laplacian correction and No-pruning, which is
called C4.4graft. We expect that C4.4graft gives better decision tree than C4.5
in the case of sparse data and it also improves probability estimates due to using
the Laplacian correction and No-pruning, see Section 3.7.1.
3.4.5 Global Distance-Based Measure

Finally we use a confidence measure based on the distance between an example
and labeled examples from both classes. Specifically we use the difference in
average Mahalanobis distance between an unlabeled example and all labeled ex-
amples. As addressed in Algorithm 2, the probability estimation is used to select
high-confidence predictions, which may not always be optimal because of some
misclassified examples with high-confidence probability estimation. Another way
to select from the unlabeled examples is to use a distance-based approach. In
X > 12 X > 12
X <= 12 X <= 12
| X >= 4 | X >= 4
| X < 4 | X < 4
| Y > 7
| Y <= 7
| Y > 2
| Y <= 2
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(a) Standard Decision Tree (b) Grafted Decision Tree
Figure 3.1: Grafted Decision tree
this method, for each unlabeled example the distance between the corresponding
example and all other positive examples is computed based on the Mahalanobis
distance method which differs from Euclidean distance in that it takes into ac-
count the correlation of the data. It is defined as follows:
q
D(X) = (X − X̄)T S −1 (X − X̄) (3.1)
where X = (X1 , ..., Xn ) is a multivariate vector, X̄ = (X̄1 , ..., X̄n ) is the mean,
and S is covariance matrix. For each unlabeled example, the distance is calculated
from all positive and negative examples separately. Then the absolute value of the
difference between pi and qi is calculated as a score for each unlabeled example.
Next, these scores are used for selection metric. Algorithm 3 shows the procedure
for selection metric using Mahalanobis distance. As shown, for each example
a score is assigned. Then, a set of examples with highest score is selected from
unlabeled examples. We use this selection metric along with the classifier selection
metric, probability estimation, to select from the unlabeled examples.
In Algorithm 3, a set of unlabeled examples is first selected. This subset is
then used in the self-training process to assign “pseudo-labels”. Next, the high-
confidence predictions from this subset is added with their “pseudo-labels” to the
training set. This procedure is repeated until it reaches a stopping condition.
Algorithm 4 shows the self-training procedure.
3.5. Self-Training for Ensembles of Decision Trees 29
+
+ -
+ - -
+ + ++ + - -
C1 + -
+ y -
- C2 -
+ + + - -
x -
-
Figure 3.2: Distance of unlabeled examples X and Y and positive and negative
examples based on the Mahalanobis distance
Algorithm 3 SelectionMetric
Initialize: U, P;
U : Unlabeled examples; P: Number of selected examples;
- pi ← Calculate distance between xi and all positive examples;
- qi ← Calculate distance between xi and all negative examples;
- wi ← |pi − qi | as score for each example;
S ← Select P% examples with highest score;
return S
3.5 Self-Training for Ensembles of Decision Trees

In this section we extend the analysis from decision trees to ensembles of decision
trees. An ensemble combines many, possibly weak, classifiers to produce a (hope-
fully) strong classifier [29]. Ensemble methods differ in their base learner and in
how classifiers are combined. Examples of the ensemble methods are bagging [13],
boosting [33], Random Forest (RF) [15], and Random Subspace Method (RSM)
[44]. For an ensemble classifier, the probability estimation can be estimated by
combining the confidences of their components. This tends to improve both the
classification accuracy and the probability estimation [63]. However, if a stan-
dard decision tree learner is used as the base learner then the problems that we
noted above carry over to the ensemble. We therefore expect that improving
the probability estimates of the base learner will enable the ensemble learner to
benefit more from the unlabeled data than if the standard decision tree learner
is used. Algorithm 5 shows the generation of the trees in an ensemble classifier.
In the experiments we use Random Forest and the Random Subspace Method to
generate ensembles of trees.
The RSM and RF ensemble classifiers are well-suited for data that are high-
dimensional. A single decision tree minimizes the number of features that are
used in the decision tree and does not exploit features that are correlated and
Algorithm 4 Self-Training with Distance-based Selection Metric

Initialize: L, U, F, T, P ; L: Labeled data; U: Unlabeled data;
F : Underlying classifier; T : threshold for selection;
P: Number of selected unlabeled examples at each iteration;
Max-Iterations : Number of iterations; {Pl }M l=1 : Prior probability;
t ← 1;
- H t−1 ← BaseClassif ier(L, F );
- U 0 ← SelectionMetric(U,P);
for each xi ∈ U 0 do
- Assign pseudo-label to xi based on H t−1 prediction
- Measure confidence for the assigned pseudo-label to xi
end for
- Select a set S of high-confidence predictions according to distribution
of classes and threshold T // Selection Step
- Re-Train H t ← BaseClassif ier(L, F );
- t←t+1
end while
Output: Classifier H T is the final hypothesis
Algorithm 5 Ensemble of Decision Trees

Initialize: L, F, N ; L: Labeled data;
F : Base classifier; N: Number of trees;
for i=1 to N do
- Li ←BootstrapSample(L)// or RandomSubSpaceSample(L);
- hi ← F (Li )
- H ← H+ hi
Output: Generate ensemble H
all have little predictive power. The ensemble classifiers do not suffer from this
problem because the random component and the ensemble allow including more
features.
3.5.1 Random Forest

A Random Forest is an ensemble of decision trees in which each tree is first
constructed independently of other trees in the forest. These n decision trees are
trained on different subsets of the training set, generated from original labeled
data by bagging [13], as addressed in Section 2.1.1. In bagging each training set is
sampled from original data with replacement. For each tree those examples that
3.5. Self-Training for Ensembles of Decision Trees 31
are not included in the training set are called Out-Of-Bag data for that tree and
these are used for estimating the error, called Out Of Bag Error. Therefore, there
is no need for cross-validation or a separate test set to get an unbiased test set error
in this method. Random Forest uses randomized feature selection while the tree is
growing. In the case of multidimensional datasets this property is indeed crucial,
because when there are hundreds or thousand features, for example in medical
diagnosis and documents, many weakly relevant features may not appear in a
single decision tree at all. The final hypothesis in Random Forest is produced by
using majority voting method among the trees. Many semi-supervised algorithms
employ the Random Forest approach, such as Co-Forest [108].
3.5.2 The Random Subspaces Method

A Random Subspace method [44] is an ensemble method that combines randomly
chosen feature subspaces of the original feature space. In the Random Subspaces
method instead of using a subsample of data points, subsampling is performed on
the feature space. The Random Subspaces method constructs a decision forest
that maintains highest accuracy on training data and improves on generalization
accuracy as it grows in complexity.
Assume that the i-th tree of the ensemble be defined as hi (X, Si ) : X 7→
Y , where X are the data points, Y are the labels, and the Si are independent
identically distributed (i.i.d) random vectors. Let’s denote the ensemble H as
H = {h1 , h2 , ..., hN }, where N is the number of trees in the forest. Then, the
probability estimation for predicting example x is defined as follows:
argal P (k|x) (3.2)

k
where
N
1 X
P (k|x) = Pi (k|x) (3.3)
N i=1
and k is the label and Pi (k|x) is the probability estimation of the i-th tree for
sample x. As seen, this method produces a good ranking of probability estimation
using voting methods, if each component classifier produces a reliable probability
estimation.
Boosting also is one of the promising ensemble methods which sequentially
constructs a weighted combination of the classifiers. However, it has two main
drawbacks [29]: it performs poorly in the case of a small training set and noisy
labeled data. For this, we focus on RF and RSM as the ensemble classifiers in
this chapter.
3.5.3 The Ensemble Self-Training algorithm
Since our goal is to improve the probability estimates of decision trees, we employ
modified version of C4.5, as mentioned earlier, to generate the trees in an ensemble
classifier. Algorithm 6 shows the ensemble self-training algorithm. As can be
seen, first the decision trees are generated by bagging or the random subspace
method. Next the ensemble classifier assigns “pseudo-labels” and confidence to
the unlabeled examples at each iteration. Labeling is performed by using different
voting strategies, such as majority voting or average probability. Then a set of
high-confidence predictions based on the pre-defined threshold T is selected for
the next iterations. A new training set is constructed by combination of selected
newly-labeled and original labeled examples. Next, a new ensemble of trees based
on the new training set is generated. The training process is repeated until it
reaches a stopping condition.
Algorithm 6 Ensemble Self-Training

Initialize: L, U, F, T, N ; L: Labeled data; U: Unlabeled data;
F : Underlying classifier; T : threshold for selection; N: Number of trees;
Max-Iterations : Number of iterations; {Pl }M l=1 : Prior probability;
t ← 1;
- H t−1 ← EnsembleT rees(L, F, N );
- Assign pseudo-label to xi based on H t−1 prediction
- Measure confidence for the assigned pseudo-label to xi
end for
- Select a set S of high-confidence predictions according to distribution
of classes and threshold T // Selection Step
- Re-Train H t ← EnsembleT rees(L, F, N );
- t←t+1
end while
Output: Ensemble classifier H T
The selection metric in this algorithm is based on the ensemble classifier, which
on one hand improves the classification accuracy and on the other hand improves
the probability estimation of the trees. As a result, the probability estimation
as the selection metric in this algorithm may lead to select a high-confidence
predictions and performs better than a single classifier.
3.6. Experiments 33
3.6 Experiments
In the experiments we compare the performance of the supervised decision tree
learning algorithms to self-training. We expect to see that the decision tree
learners that give better probability estimates for their predictions will benefit
more from unlabeled examples. In other words, the difference in performance
between supervised learning from the labeled data and semi-supervised learning
on all data should be larger when the probability estimates of the predictions are
more accurate. We then make the same comparison for ensemble learners.
3.6.1 Datasets
We use a number of UCI datasets [32] and web-pages classification task [10] to
evaluate the performance of our proposed methods. Recently the UCI datasets
have been extensively used for evaluating semi-supervised learning methods. Con-
sequently, we adopt UCI datasets to assess the performance of the proposed al-
gorithm in comparison to supervised learning.
UCI datasets
Fourteen datasets from the UCI repository are used in our experiments. We
selected these datasets, because they differ from the number of features and ex-
amples, and distribution of classes. Information about these datasets is in Table
3.1. All sets have two classes and Perc. represents the percentage of the largest
class.
Table 3.1: Overview of UCI Datasets

Dataset (classes) Attributes Size Perc.
Breath-cancer (1,2) 10 286 70
Bupa (1,2) 6 345 58
Car (1,2) 7 1594 76
Cmc (1,3) 10 1140 55
Colic (1,2) 22 368 63
Diabetes (1,2) 6 768 65
Heart Statlog (1,2) 13 270 55
Hepatitis (1,2) 19 155 79
Ionosphere (1,2) 34 351 36
Liver (1,2) 7 345 58
Sonar (1,2) 61 208 53
Tic-tac-toe (1,2) 9 958 65
Vote (1,2) 16 435 61
Wave (1,2) 41 3345 51
Web-Pages Datasets
To automatically classify and process web-pages in [10], the visual appearance of
the web-pages are considered as one of the main features. The current systems for
web-pages classification usually use the textual content of those pages, including
both the displayed content and the underlying (HTML) code. However, the visual
appearance of the web-pages can be a very important feature. Using generic visual
features for classifying the web-pages for several different types of tasks is main
task in [10]. The simple color and edge histograms, Gabor and texture features
are used for classification task. These were extracted using an off-the-shelf visual
feature extraction package (Lire, written in Java, [62]). The learning task is to
find a classifier that recognizes the aesthetic value and the recency of a page. In
this task the number of attributes is much larger than in the used UCI datasets
in this study, which is more suitable for ensemble methods.
Figure 3.3: Examples of web-pages
Information about these datasets is in Table 3.2. All datasets have two classes
and Prc. represents the percentage of the largest class label.
Table 3.2: Overview of web-pages datasets

Dataset Attributes Size Perc.
Aesthetic 192 60 50
Recency 192 60 50
For each web-page a number of low-level features are computed, such as colors
and edge histograms, Tamura, and Gabor features [10]. In this experiment we
use binary classes U gly and Beautif ul as labels for Aesthetic dataset. These
labels were assigned by human evaluators with a very high inter-rater agreement.
3.6. Experiments 35
Figure 3.3 shows samples of U gly and Beautif ul web-pages. For the Recency
dataset, we look at Old and N ew designed web-pages.
3.6.2 Setup of the Experiments

For each dataset, 30 percent of the data are kept as test set, and the rest is used
as training data. Training data in each experiment are first partitioned into 90
percent unlabeled data and 10 percent labeled data, keeping the class proportions
in all sets similar to the original data set. We run each experiment 10 times with
different subsets of training and testing data. The results reported refer to the test
set. To provide a statistical basis for the main comparisons we use the following
statistical tests.
3.6.3 Statistical Test

We compare the performance of the proposed algorithms by the given method
in [27]. We first apply Friedman’s test as a nonparametric test equivalent to
the repeated measures ANOVA. Under the null hypothesis Friedman’s test states
that all the algorithms are alike and the rejection of this hypothesis implies the
differences among the performance of the algorithms. It ranks the algorithms
based on their performance for each dataset separately, then it assigns the rank
1 to the best performing algorithm, rank 2 to the second best, and so on. In case
of ties average ranks are assigned to the related algorithms.
Let’s define rij as the rank of the j-th algorithm on the i-th datasets.
P j The
1
Friedman test compares the average ranks of algorithms, Rj = N i ri , where
N is the number of datasets and k is the number of the classifiers. Friedman’s
statistic
" #
2
12N X k(k + 1)
χ2F = Rj2 − (3.4)
k(k + 1) j 4
is distributed according to χ2F with k − 1 degrees of freedom. Iman and

Davenport [46] showed that Friedman’s χ2F is conservative. Then they present a
better statistic based on Friedman’s test:
(N − 1)χ2F
FF = (3.5)
N (k − 1) − χ2F
This statistics is distributed according to the F − distribution with k − 1 and
(k − 1)(N − 1) degrees of freedom.
As mentioned earlier, Friedman’s test shows whether there is a significant
difference between the averages or not. The next step for comparing the perfor-
mance of the algorithms with each other is to use a post-hoc test. We use Holm’s
method [45] for post-hoc tests. This sequentially checks the hypothesis ordered
by their performance. The ordered p-values are denoted by p1 ≤ p2 ≤ ... ≤ pk−1 .

α
Then each pi is compared with (k−i) , starting from the most significant p-value.
α
If p1 is less than (k−1) , then the corresponding hypothesis is rejected and p2 is
α
compared with (k−2) . As soon as a certain hypothesis cannot be rejected, all
remaining averages are taken as not significantly different. The test statistic for
comparing two algorithms is
Ri − Rj
z=q (3.6)
k(k+1)
6N
The value z is used to find the corresponding probability from the normal
distribution table, which is compared with the corresponding value of α.
3.6.4 Decision Tree Learners

We use several decision tree classifiers as the base classifier in self-training: J48
(the Java implementation of C4.5), C4.4, NBTree, C4.4graft, J48graft, and en-
semble of trees. For our experiments we use the WEKA [42] implementation of
the classifiers in Java.
3.7 Results
Tables 3.3, 3.4, 3.7, and 3.8 give the classification accuracies for the experiments.
For each base classifier, the performance of the supervised learning only on labeled
data and self-training on both labeled and unlabeled data are reported.
3.7.1 Self-Training with a Single Classifier

At first, we compare the classification accuracy achieved by applying a decision
tree learner to a set of labeled data only and self-training on these labeled data
and a large set of unlabeled data. Table 3.3 compares the results of the standard
decision tree learner J48 (DT) to its self-training version (ST-DT) and the same
for the C4.4, grafting (GT), the combination of grafting and C4.4 (C4G), and
the Naive Bayes Decision trees (NBTree). We expect that the modified algo-
rithms show results similar to J48 when only labeled data are used but improved
classification accuracies when they are used as the base learner in self-training.
Self-training with J48 decision tree learner

In Table 3.3, the columns DT and ST-DT show the classification accuracy of
J48 base learner and self-training respectively. As can be seen, self-training does
not benefit from unlabeled data and there is no difference in accuracy between
3.7. Results 37
Table 3.3: Average classification accuracy of supervised learning and self-training

with different base classifiers
Dataset DT ST-DT C4.4 ST-C4.4 GT ST-GT C4G ST-C4G NB ST-NB
Breath-cancer 68.00 69.00 66.25 67.00 68.00 70.01 66.25 70.12 72.50 75.75
Bupa 58.62 57.09 58.62 58.68 58.62 59.25 58.62 61.40 58.20 58.20
Car 86.08 86.04 85.48 87.48 86.08 87.28 85.48 88.28 85.08 87.68
Cmc 57.00 58.25 56.75 59.05 57.00 59.13 56.75 60.12 54.25 58.00
Colic 72.83 72.36 70.56 73.70 72.84 74.80 70.56 75.03 74.60 76.71
Diabetes 67.82 67.83 67.51 69.18 68.46 69.40 68.14 71.79 71.14 72.59
Heart 67.27 67.27 68.63 70.50 67.27 69.10 68.63 72.12 71.81 73.85
Hepatitis 76.00 75.60 76.00 76.40 76.00 76.60 76.00 80.60 78.40 82.40
Ionoshere 70.47 70.67 70.47 71.46 70.37 71.56 70.37 73.72 79.97 82.57
Liver 56.80 56.60 57.00 60.80 57.00 59.80 57.00 59.98 57.00 59.90
Sonar 63.40 63.40 63.40 63.76 63.40 64.92 63.40 65.40 59.60 63.60
Tic-tac-toe 66.40 68.20 63.40 68.80 66.40 70.10 63.80 69.60 65.20 68.60
Vote 89.08 89.08 89.05 90.30 89.08 89.80 88.05 90.48 90.00 92.74
Wave 83.10 83.63 82.85 85.13 84.10 85.25 83.60 86.25 84.75 88.00
learning from the labeled data only and self-training from labeled and unlabeled
data. The average improvement over all datasets is 0.15 %.
Self-Training with C4.4, Grafting, and NBTree
Table 3.3 gives the classification accuracy of C4.4 and ST-C4.4. Using C4.4 as the
base learner in self-training algorithm does not result in higher classification ac-
curacy if only the labeled data are used, but unlike the basic decision tree learner,
C4.4 enables self-training to become effective for nearly all the datasets. The av-
erage improvement over the used datasets is 1.9%. The reason for improvement
is that using Laplacian correction and No-pruning give better rank for probabil-
ity estimation of the decision tree, which leads to select a set of high-confidence
predictions.
In Table 3.3, we can see the same observation for J48graft (grafted decision
tree). When using only labeled data we see no difference with the standard al-
gorithm but self-training improves the performance of the supervised J48graft
on all datasets. The average improvement over all datasets is 1.6%. Next, we
combine Laplacian correction and No-pruning in J48graft, C4.4graft. This mod-
ification in grafted decision tree gives better results when it is used as the base
learner in self-training. The results show that self-training algorithm outper-
forms the supervised C4.4graft classifier on all datasets. A T-test on the results
shows that self-training significantly improves the classification performance of
C4.4graft classifier on 10 out of 14 datasets and the average improvement over
all datasets is 3.5%. Finally, the results of experiments on NBTree classifier as
the base learner in self-training show that it improves the performance of NBTree
classifier on 13 out of 14 datasets and the average improvement is 2.7%.
Self-Training with Single Classifier and Distance-based Mea-

sure
To evaluate the impact of the selection metric for self-training (as in Algorithms
4 and 3), we run another set of experiments. Table 3.4 shows the results of the
experiments on UCI dataset using Algorithm 4. The columns ST-DT, ST-C4.4,
ST-GT, ST-C4G, and ST-NB show the classification performance of self-training
with J48, C4.4, J48graft, C4.4graft, and NBTree as the base learners respectively.
The results show that in general using both the Mahalanobis distance and
probability estimation as the selection metric improves the classification perfor-
mance of all self-training algorithms and emphasizes the effectiveness of this se-
lection metric. For example, comparing the results of self-training in Tables 3.3
and 3.4, when the base classifier is C4.4graft, we observe that self-training in
Table 3.4, using distance-based selection metric, outperforms the self-training in
Table 3.3 on 10 out of 14 datasets. The same results can be seen for J48 as the
base learner.
Table 3.4: Average performance of supervised learning and self-training using
the Mahalanobis distance along with the probability estimation as the selection
metric
Supervised Self-Training
Dataset DT ST-DT ST-C44 ST-G ST-GLU ST-NB
Breath-cancer 68.00 70.07 70.12 70.25 71.27 76.14
Bupa 58.62 61.60 61.60 61.80 62.37 61.76
Car 86.08 87.04 87.56 87.28 89.01 88.04
Cmc 57.00 59.00 60.01 60.77 61.01 60.00
Colic 72.83 74.29 75.17 75.49 76.25 77.34
Diabetes 67.82 69.80 70.80 70.90 71.20 72.40
Heart 67.27 68.50 71.61 70.27 71.63 74.00
Hepatitis 76.00 77.12 77.29 78.71 80.97 82.40
Ionosphere 70.47 72.62 71.61 72.29 73.43 82.15
Liver 56.80 57.28 60.88 61.10 60.80 58.32
Sonar 63.40 64.12 65.00 65.10 66.43 64.31
Tic-tac-toe 66.40 67.40 66.02 68.40 67.33 67.17
Vote 89.08 90.43 91.37 92.12 92.18 92.15
Wave 83.10 84.63 85.13 85.67 86.75 88.00
Statistical Analysis
In this section the results in Table 3.3 are analyzed using statistical tests. Table
3.5 shows rank of each algorithm for each dataset according to Friedman’s test.
Average ranks are reported in the last row. Friedman’s test checks whether the
measured average ranks are significantly different from the mean rank Rj = 2.96
expected under the null-hypothesis. Then, according to the equations (4.17) and
(3.5), χ2F = 39.54 and FF = 31.24.
3.7. Results 39
Table 3.5: Statistical Rank ( Friedman’s test)

Datasets Decision Tree C4.4 J48graft C4.4graft NBTree
Breath-cancer 4 5 3 2 1
Bupa 5 3 2 1 4
Car 5 3 4 1 2
Cmc 5 3 2 1 4
Colic 5 4 3 2 1
Diabetes 5 4 3 2 1
Heart 5 3 4 2 1
Hepatitis 5 4 3 2 1
Ionosphere 5 4 3 2 1
Liver 5 3 4 1 2
Sonar 5 3 2 1 4
Tic-tac-toe 5 3 1 2 4
Vote 5 3 4 2 1
Wave 5 4 3 2 1
Average Rank 4.93 3.50 2.93 1.64 1.93
With five algorithms and 14 datasets, FF is distributed according to the F −

distribution with 5 − 1 = 4 and (5 − 1)(14 − 1) = 39 degrees of freedom. The
critical value of F (4, 39) for α = 0.05 is 2.61, so we reject the null-hypothesis. In
the next step we use Holm’s test. For that, we have to compute and order the
corresponding statistics and p values.
Table 3.6: Statistical Rank (Holm’s test)

i classifier Z P-value α/(k − i)
1 C4.4graft 5.498051603 0.00000004 0.0125
2 NBTree 4.780914437 0.00000174 0.016666667
3 j48graft 3.82473155 0.00013092 0.025
4 C4.4 2.031888636 0.0421658 0.05
Table 3.6 shows the results. The Holm procedure rejects the first, second,
third, and then fourth hypotheses, since the corresponding p values are smaller
than the adjusted α0 s (0.05). We conclude that the performance of C4.4graft,
NBTree, J48graft, and C4.4 as the base learner in self-training algorithm are
significantly different from standard decision tree (J48).
Based on the results of the experiments we conclude that improving the prob-
ability estimation of the tree classifiers leads to better selection metric for the
self-training algorithm and produces better classification model. We observe that
using Laplacian correction, No-pruning, grafting, and NBTree produce better
probability estimation in tree classifiers. We also observed that Mahalanobis dis-
tance method for sampling is useful and guides a decision tree learner to select a
set of high-confidence predictions.
3.7.2 Self-Training with an Ensemble of Trees

In this experiment, we analyze the effect of modifying the decision tree learner
when it is used as the base learner in an algorithm for learning an ensemble of
trees. The ensemble learner itself is the base learner in self-training. Because for
many domains ensembles show better results than a single base learner. We expect
that the ensemble learner will perform somewhat better than the basic decision
tree learner when used on the labeled data only. More interesting thought, we
expect that the ensemble, with improved base learner, can improve the probability
estimation and therefore if it is used as the base learner in self-training it will
benefit even more from the unlabeled data than a single modified decision tree
learner.
As a base classifier in RF we use C4.4graft. Also, REPTree, NBTree, and
C4.4graft classifiers are used in a RSM ensemble classifier as the base classifiers.
REPTree is a fast decision tree learner. It builds a decision tree using information
gain and prunes it using reduced-error pruning (with backfitting). REPTree only
sorts values for numeric attributes once. Missing values are dealt with by splitting
the corresponding examples into pieces like in C4.5 tree. It is a default decision
tree learner for Random Subspace Method in WEKA.
Table 3.7: Average performance of supervised learning and self-training with

ensemble classifiers
Dataset RFG ST-RFG RREP ST-RREP RG ST-RG RNB ST-RNB
Breath-cancer 66.25 68.75 68.50 69.50 68.50 71.50 74.50 75.50
Bupa 57.34 60.32 56.45 57.72 55.04 59.38 58.40 58.40
Car 87.00 88.32 77.60 80.40 80.60 83.02 78.00 80.80
Cmc 60.25 63.25 58.75 59.63 59.50 63.70 57.25 59.38
Colic 75.00 74.90 67.32 71.65 77.50 79.60 76.77 79.26
Diabetes 69.56 71.66 70.66 70.75 67.82 70.56 70.04 72.29
Heart 74.99 76.22 72.04 74.58 73.18 76.09 70.91 73.40
Hepatitis 80.00 80.80 80.00 80.00 80.40 81.80 79.60 82.00
Ionoshere 80.00 82.00 71.20 73.80 73.04 77.76 78.31 81.10
Liver 56.60 58.14 56.00 58.40 61.40 63.00 56.80 58.00
Sonar 63.60 67.20 59.20 60.60 63.40 64.80 59.80 61.20
Tic-tac-toe 70.00 71.40 67.20 67.60 69.60 69.60 68.20 70.40
Vote 91.25 93.25 88.78 90.25 89.00 93.00 88.78 92.50
Wave 86.00 88.50 85.75 86.75 87.25 89.50 88.75 89.75
Table 3.7 gives the classification accuracies of the all experiments on the
UCI datasets. The columns RFG, RREP, RG, and RNB show the classification
performance of supervised classifiers Random Forest with C4.4graft and RSM
with REPTree, C4.4gtaft, and NBTree respectively and their corresponding self-
training algorithms. Using RF with C4.4graft as the base learner in self-training
improves the classification performance of the supervised classifier RF, on 13 out
of 14 datasets. As can be seen, the results are better than a single decision tree,
3.7. Results 41
but in most cases the differences are not significant. We suspect that it is due to
using the bagging method for generating different training set in Random Forest.
In the case of a small set of labeled data, bagging does not work very well [94],
because the pool of labeled data is too small for re-sampling. However the average
improvement is 1.9% over all datasets.
In the second experiment, we use the RSM ensemble classifier with improved
versions of decision trees. We observe that using C4.4graft and NBTree as the
base classifiers in RSM is more effective than using REPTree, when RSM is used
as the base learner in self-training. The results show that RSM with C4.4graft as
the base learner in self-training improves the classification performance of RSM
on 13 out of 14 datasets and the average improvement over all datasets is 2.7%.
The same results are shown in Table 3.7 for RSM, when the NBTree is the base
learner.
Self-Training with Ensemble Classifier and Distance-based

Measure
Table 3.8 shows the results of the experiments on UCI dataset using Algorithm 4.
The columns ST-RFG, ST-RREP, ST-RG, and ST-RNB show the classification
performance of self-training with RF and RSM with REPTree, C4.4graft, and
NBTree as the base learners respectively. The results, as in single classifier,
show that in general using the Mahalanobis distance along with the probability
estimation as the selection metric improves the classification performance of all
self-training algorithms and emphasizes the effectiveness of this selection metric.
Table 3.8: Average performance of supervised learning and self-training using
the Mahalanobis distance along with the probability estimation as the selection
metric
Supervised Self-Training
Dataset DT ST-RF ST-RREp ST-RG ST-RNB
Breath-cancer 68.00 69.80 70.50 71.95 76.93
Bupa 58.62 61.60 60.76 61.06 60.04
Car 86.08 89.27 81.40 84.29 81.18
Cmc 57.00 64.00 60.00 64.12 60.10
Colic 72.83 75.21 71.15 80.60 79.32
Diabetes 67.82 71.80 70.98 71.80 72.00
Heart 67.27 77.01 74.15 76.62 74.27
Hepatitis 76.00 81.81 81.25 82.15 83.00
Ionoshere 70.47 82.34 74.79 78.56 81.91
Liver 56.80 59.00 57.92 63.60 58.52
Sonar 63.40 69.07 64.15 69.23 63.14
Tic-tac-toe 66.40 72.01 69.34 70.67 71.33
Vote 89.08 93.25 91.53 93.17 93.79
Wave 83.10 88.15 86.97 89.50 89.86
3.7.3 Results of Experiments on Web-page Datasets

As mentioned earlier, we use visual features of web-pages to classify the web-pages
dataset. For this task in the experiments, we use the Mahalanobis distance along
with the probability estimation as the selection metric in self-training. As before,
we use 10% labeled examples of Aesthetics and Recency, i.e. only six labeled
examples, which is well-fit to semi-supervised learning. The reason why we use
these datasets is that they consist a large set of features which is well-suited for
ensemble methods RSM and RF.
Tables 3.9 and 3.10 show the classification performance of the used datasets.
As shown, consistent with our previous results on UCI datasets, NBTree and
C4.4graft base classifiers in self-training achieve the best results.
Table 3.9: Average Classification Accuracy of supervised learning and self-

training with single classier on web-pages datasets
Dataset DT ST-DT C4.4 ST-C4.4 NB ST-NB C4G ST-C4G
Aesthetics 53.47 57.70 54.12 57.87 54.40 59.22 54.48 60.45
Recency 65.27 68.08 66.01 70.71 67.00 72.47 67.50 74.36
In the second experiment we use ensemble self-training. Results in table 3.10

show that ensemble RSM classifier with NBTree and C4.4graft, as the base learn-
ers, achieves the best classification performance. Finally, comparing Table 3.9 to
Table 3.10, shows that ensemble methods outperform the single classifier. The
results also verify that improving both the classification accuracy and the proba-
bility estimates of the base learner in self-training are effective for improving the
performance.
Table 3.10: Average performance of supervised RF, RSM-REPTree, RSM-

NBTree, and RSM-C4.4graft and their self-training algorithms
Dataset RFG ST-RFG RREP ST-RREP RNB ST-RNB RG ST-RG
Aesthetics 58.55 61.91 60.88 61.04 60.41 65.01 63.88 68.91
Recency 68.49 71.87 70.37 72.3 71.29 75.57 70.37 78.74
Sensitivity to the Amount of Labeled Data

To study the sensitivity of the proposed algorithms to the number of labeled data,
we run a set of experiments with different proportions of labeled data which vary
from 10% to 40%. We expect that the difference between the supervised algo-
rithms and the semi-supervised methods decreases when more labeled data are
available. Figure 3.4 shows the performance of self-training with different base
learners on two web-pages datasets. In this experiment, we use RF and RSM
with C4.4graft as the base learner in self-training. For fair comparison we include
3.7. Results 43
a set of experiments with single classifiers, J48 and C4.4graft, as well. Figure
3.4 shows the performance obtained by self-training algorithms and supervised
classifiers. Figure 3.4.a and 3.4.b show the performance of self-training with en-
semble classifiers and Figure 3.4.c and 3.4.d give the performance of self-training
with single classifiers on web-pages datasets. Consistent with our hypothesis we
RFG ST-RFG RG ST-RG RFG ST-RFG RG ST-RG

75 90
73
85
Classification Accuracy
71
69
80
67
65
75
63
61 70
59
57 65
0 10 20 30 40 0 10 20 30 40 50
Labeled Data Labeled Data
(a) Aesthetics dataset (b) Recency dataset
DT ST-DT C4G ST-C4G DT ST-DT C4G ST-C4G

77 88
86
72 84
82
67 80
78
62 76
74
57 72
70
52 68
0 10 20 30 40 50 0 10 20 30 40 50
(c) Aesthetics dataset (d) Recency dataset
Figure 3.4: Average Performance of Self-Training (using ensemble classifier (a and

b) and single classifier (c and d) as the base learner) with increasing proportions
of labeled data on web-pages datasets
observe that difference between supervised algorithms and self-training methods

decreases when the number of labeled data increases. Another interesting obser-
vation is that RF improves the classification performance of self-training when
more labeled data are available, because with more labeled data the bagging
approach, used in RF, generates diverse decision trees.
Sensitivity to the number of trees

In this experiment, we study the impact of the number of trees in ensemble
methods RF and RSM, as the base learner of self-training, on the performance.
Figure 3.5 shows the performance of self-training with different number of trees
on two datasets. As before, we use 10% labeled data in this experiment. The
NBTree, C4.4graft, and REPTree are used as the base learners in RSM meta
classifier.
ST-C4G ST-NB ST-RREP ST-RG ST-C4G ST-NB ST-RREP ST-RG

80 93
75 88
Clssification Accuracy
83
70
78
65
73
60
68
55 63
50 58
0 10 20 30 40 50 0 10 20 30 40 50
Number of Trees Number of Trees
(a) Aesthetics dataset (b) Recency dataset
Figure 3.5: Average Performance of Self-Training with increasing the number of

trees on web-pages datasets
It is not surprising that overall the classification accuracy is improved with

increasing the number of trees. We suspect that when the number of trees is
increasing in ensemble methods, the probability estimates will improve. Conse-
quently, it improves the performance of the self-training algorithm.
3.8 Conclusion and Discussion

The main contribution of this chapter is the observation that when a learning
algorithm is used as the base learner in self-training, it is very important that
the confidence of prediction is correctly estimated, probability estimation. The
standard technique of using the distribution at the leaves of decision tree as
probability estimation does not enable self-training with a decision tree learner
to benefit from unlabeled data. The accuracy is the same as when the decision
tree learner is applied to only the labeled data. If a modified decision tree learner
is used which has an improved technique for estimating probability, then self-
training with the modified version does benefit from the unlabeled data. Although
to a lesser extent, the same is true when the modified decision tree learners are
used as the base learner in an ensemble learner.
3.8. Conclusion and Discussion 45
Our experimental results showed that five modifications: No-pruning with

Laplacian correction, Grafting, No-pruning with Laplacian correction with Graft-
ing, NBTree, and global distance-based measure, enable self-training to benefit
from unlabeled data, even when there is only a small number of labeled data.
The best result based on experiments with a small amount of labeled instances
(10%), which is the most relevant for semi-supervised settings, was obtained by a
combination of grafting, No-pruning and Laplacian correction, called C4.4graft.
This was useful in the Random Subspace Method as well. Random Forest suffers
from the small amount of labeled data and therefore does not work well. Better
probability-based ranking and high classification accuracy could select the high-
confidence predictions in the selection step of self-training and therefore, these
variations improved the performance of self-training.
Future work, can consider extending the proposed selection metric to mul-
ticlass classification problem. In co-training also the selection of a set of high-
confidence predictions is vital to achieve better performance. It is interesting to
see how the proposed methods in this chapter behave when they are used for
co-training.
Chapter 4
Ensemble Co-Training
This chapter presents a method that uses an ensemble of classifiers for co-training
rather than feature subsets. The ensemble is used to estimate the probability of
incorrect labeling and this is used with a theorem by Angluin and Laird [1] to
derive a measure for deciding if adding a set of unlabeled data will reduce the error
of a component classifier or not. Our method does not require a time consuming
test for selecting a subset of unlabeled data. Experiments show that in most cases
our method outperforms similar methods.
rd
23 IEEE International Conference on Tools with Artificial Intelligence (ICTAI)
in 2011.
4.1 Introduction
Here we consider co-training [8] as one of the widely used semi-supervised learning
methods. Co-training involves two, preferably independent, views of data both
of which are individually sufficient for training a classifier. In co-training, the
“views” are subsets of the features that describe the data. Each classifier predicts
labels for the unlabeled data and a degree of confidence. Unlabeled instances that
are labeled with high confidence by one classifier are used as training data for
the other. This is repeated until none of classifiers changes. Algorithm 7 gives
an overview of the co-training algorithm. Another approach to co-training is
to use different learning algorithms instead of feature subsets as in [39]. This
version uses statistical tests to decide which unlabeled data should be labeled
with confidence.
In this chapter, we propose two improvements for co-training. We call the
resulting method Ensemble-Co-Training. First we consider co-training by an en-
semble of N classifiers that are trained in parallel by different learning algorithms
and second we derive a stop-criterion based on the training error rate, using a the-
orem by Angluin and Laird [1] that describes the effect of adding uncertain data.
47
48 Chapter 4. Ensemble Co-Training
Algorithm 7 Co-Training
Initialize: L, U;
L: Labeled data; U: Unlabeled data; N: Number of iterations;
X1 and X2 are two views of feature space;
Sample a set of unlabeled examples U 0 from U;
t←1
for each iteration i < N do
Train classifier f1 regarding X1 as feature space using labeled data L;
Train classifier f2 regarding X2 as feature space using labeled data L;
Allow f1 to assign label to unlabeled examples in U 0
Allow f2 to assign label to unlabeled examples in U 0
Select a set of high confidence predictions
Add newly-labeled examples to L;
Reproduce U 0 from U;
end for
Output: Generate final hypothesis;
Training an ensemble that votes on the labels for unlabeled data has the advan-
tage that better estimates of confidence of predictions are obtained. This is useful
because semi-supervised learning is used in settings where only a small amount
of labeled data is available. The stop-criterion is useful because it prevents the
extra iterations. Our proposed method uses Theorem 1, which has two param-
eters taken from PAC-learning (Probably Approximately Correct) [24], with an
intuitive meaning: the probability and the size of a prediction error. We derive
the algorithm from this theorem. We then perform a set of experiments on UCI
dataset to show the effect of the proposed method. Our experiments show that
Ensemble-Co-Training improves the classification performance and outperforms
the other related methods.
The rest of this chapter is organized as follows. Section 4.2 outlines differ-
ent kinds of co-training algorithms. In Section 4.3 we derive the Ensemble-Co-
Training algorithm. Section 4.4 describes the experimental setup. In Section 4.5
we compare the results of Ensemble-Co-Training on the UCI datasets. Finally,
in Section 4.6, we present our conclusion and discussion.
4.2 Semi-Supervised Co-Training

We distinguish two categories of co-training methods in terms of the number of
classifiers and views of data, which are learning with multiple views, and learning
with a single view and multiple classifiers.
The original co-training method uses multiple views of the data. As men-
tioned in the introduction, the co-training paradigm that was proposed by Blum
4.2. Semi-Supervised Co-Training 49
and Mitchell [8] works well in domains that naturally have multiple views of data.
Nigam and Ghani [72] analyzed the effectiveness and applicability of co-training
when there are no two natural views of data. They show that when independent
and redundant views exist, co-training algorithms outperform other algorithms
using unlabeled data, otherwise there is no difference. However in practice, many
domains are not described by a large number of attributes that can naturally
be split into two views. The result of randomly partitioned views is not always
effective [72]. The authors of [72] proposed a new multi-view co-training algo-
rithm, which is called co-EM. This algorithm combines multi-view learning with
the probabilistic EM model. The Naive Bayes classifier is used to estimate class
labels.
Another type of co-training algorithm uses a learning method with a single
view and multiple classifiers. An example of this approach was developed by
Goldman and Zhou [39]. They propose a method which trains two different
classifiers with different learning algorithms. Their method uses a statistical
test to select unlabeled data for labeling as selection metric. The rest of the co-
training process in their method is the same as the original co-training. The same
authors proposed Democratic Co-Learning [107]. In democratic co-learning, a set
of different learning algorithms is used to train a set of classifiers separately on
the labeled data set in self-training manner. They use a statistical method for
selecting unlabeled data for labeling.
Zhou and Li [109] propose the Tri-Training method, which is a single view and
multiple classifiers type of co-training algorithm. The proposed method uses a
base learning algorithm to construct three classifiers from different sub-samples.
The classifiers are first trained on data sets that are generated from the initial
labeled data via bootstrap sampling [13]. Two classifiers then predict labels
for the third classifier. Predictions are made via majority voting by the three
final classifiers. Tri-Training does not need multiple views of data nor statistical
testing. Tri-Training can be done with more than three classifiers, which gives
a method called Co-Forest [60]. A problem with this approach in the context of
semi-supervised learning is that there is only a small amount of labeled data and
therefore it is not possible to select subsamples that vary enough and are of a
sufficient size. Furthermore, ensemble learning methods need different classifiers
[54] to be effective, but, as mentioned earlier, when there is only a small amount of
labeled data then all initial training data that are generated by bagging methods
will be roughly the same. Therefore, in this case the ensemble approach will not
be effective.
There are many applications for the original co-training approach and co-EM,
for example see [12]. In [88] an ensemble of decision trees is used for image
classification. The basic idea of this algorithm is to start with a limited number
of labeled images, and gradually propagate the labels to the unlabeled images.
At each iteration a set of newly-labeled examples is added to the training set to
improve the performance of the decision tree learning algorithm. The improved
decision trees are used to propagate labels to more images, and the process is
repeated when it reaches the stopping condition. This algorithm directly uses
the predictions of weak decision tree classifiers, which are not accurate enough.
Adding mislabeled instances may lead to worse results.
In this chapter we propose a single view and multiple classifiers type of co-
training algorithm. The main idea of our approach is to use multiple base classi-
fiers with different learning algorithms instead of using same base learner on the
different subsamples of original labeled data. In this approach, the algorithm does
not require independent and redundant attributes, but instead it needs N differ-
ent hypotheses as in ensemble methods. For this, we employ N different learning
algorithms. This approach also tends to select a subset of high-confidence pre-
dictions at each iteration using different classifiers, which is more challenging in
self-training. We use a notion from PAC-learning for controlling the error rate,
selecting a subset of high-confidence predictions by ensemble, and deciding when
to stop the training process.
Finally, our proposed method is related to “Disagreement Boosting” [58],
which performs a form of boosting in which examples are re-weighted not by
their predictability but by disagreement between multiple classifiers. Both meth-
ods use agreement between trained classifiers. In the boosting approach the learn-
ing algorithm constructs a hypothesis in each iteration and incorporates these in
a linear combination where in our approach each learning algorithm contributes
a hypothesis. These are then combined to make a prediction.
4.3 A Bound for Learning from noisy data

Two key issues in co-training are: (1) measuring the confidence in labels that are
predicted for the unlabeled data, and (2) a criterion for stopping the training pro-
cess [91]. Co-training aims at adding a subset of the high-confidence predictions,
called newly-labeled examples. At some point labels will be noisy and cause the
result of learning to become worse. This is a form of “overfitting”. Problems (1)
and (2) can be solved in an empirical way, by using a holdout set of labeled data
to assess the effect of adding newly-labeled data. However, since semi-supervised
learning is used for learning tasks where labeled data is scarce, this is not a good
solution. Instead, we propose an analytic solution for solving this problem. This
can be summarized as follows. We use a theorem by Angluin and Laird [1] that
relates the number of training data to the probability that a consistent hypothesis
has an error larger than some threshold for a setting with training data and with
a certain error in the labels. We use an ensemble of learners for co-training in-
stead of two and the agreement among the predictions of labels for the unlabeled
data to obtain an estimate of the labeling error rate. Using this, we can esti-
mate the effect of learning on the error of the result of adding the newly-labeled
data to the training set. This is used to decide which subset of high-confidence
4.3. A Bound for Learning from noisy data 51
predictions should be added to the initial labeled data in order to improve the
classification performance. Finally, the training process will be stopped when the
estimated error rate in the initial labeled data is expected to increase. Figure 4.1
shows the general overview of Ensemble-Co-Training. In this section we review
En
En mp rro
se
se on r ra
Classifier Classifier
co e e
mb Classifier
m e te
th
1 1
bl ne
le 1
eH t
H
p clas
e
m si
n h
ak fie
rio n t
ep rh
ite o
Classifier Classifier Classifier
cr sed
re n a
di n
2 2 2
ed a
ct d c
os g b
io o
ns m
op lin
fo pu
pr mp
r
.
Sa
. .
te
.
Unlabeled . .
Labeled Data Training Set Data
.
m or
co s f . .
te
Sa e pr
pu
n
io
th
m op
hj ict
pl
er ed
in ose
d
an
gb d
sifi s pr
as crit
c ak e
ed er
Classifier Classifier Classifier
r r en p m
on ion
las
N-1 N-1 N-1

H
at et
e e po ble
co sem
p
rro n
le H
mb
En
m
e
Ens Classifier Classifier Classifier
th
N N N
Add Newly-Labeled
Examples
Figure 4.1: Block diagram of the proposed Ensemble-Co-Training algorithm
the theorem that we use and show how it can be used to define a criterion for
adding newly-labeled data. The entire algorithm is presented in section 4.3.2.
4.3.1 Criterion for error rate and number of unlabeled

data
As addressed in the Introduction chapter, semi-supervised learning is the learning
task, when there is a limited number of labeled data and a large amount of
unlabeled data. Data points is divided into two parts: the points Xl =(x1 ,x2 ...,xl ),
which are labeled by Yl ={+1,-1}, and the points Xu = (xl+1 ,xl+2 ,...,xl+u ), the
labels of which are not known. We consider l u, where l and u are the number
of labeled data and unlabeled data respectively, which is more suitable for semi-
supervised setting.
Here in order to find criteria for selecting examples and for deciding when to
stop, we need an estimator for the error rate. As we mentioned earlier, using a
hold-out set for this is not attractive, because we have little labeled data. We
cannot expect that the effect gives substantial improvements on the labeled data
either. This motivates a need for an estimate of the effect of labeling unlabeled
data and adding them to the training data. Inspired by [39] and [109], we formu-
late a function that estimates the true classification error of a hypothesis from
the size of the training set and the probability that a data point is mislabeled.
This is based on a theorem, in the style of PAC-learning, by Angluin and Laird
[1]. This theorem is as follows.
Theorem 1. If we draw a sequence σ of m data points where each data point
has a probability η of being mislabeled, and we compute the set of hypotheses
that are consistent with σ then if
2 2N
m≥ ln( ) (4.1)
2 (1 − 2η)2 δ
holds, where is the classification error of the worst remaining candidate hypoth-
esis on σ, η (< 0.5) is an upper bound on the noise rate in the classifications of
the data, N is the number of hypotheses, and δ is a probability that expresses
confidence, then for a hypothesis Hi that minimizes disagreement with σ holds
that:
P r[d(Hi , H ∗ ) ≥ ] ≤ δ (4.2)
where d(, ) is the sum over the probabilities of differences between classifications
of the data according to hypothesis i and the actual data.
To construct an estimator for the error of a hypothesis, , we rewrite the above
inequality as follows. First, we set δ to a fixed value, which means that we assume
that the probability of an error is equal for all hypotheses, and second assume
that N is (approximately) constant between two iterations.
Here, we introduce c such that c = 2λ ln( 2N δ
), where λ is chosen so that
equation (4.1) holds. Substituting c in (4.1) gives:
c
m= (4.3)
2 (1 − 2η)2
So, reformulating (4.3), then gives:

c
2
= m(1 − 2η)2 (4.4)

Based on this corollary, the error rate can be controlled. In particular the
change in the (bound on the) error rate can be estimated and used to select the
newly-labeled examples and to decide when to stop.
For this, we need to estimate η. This is done using a set of hypotheses that are
constructed by different learning algorithms. Suppose that there are k classifiers,
denoted by H ∗ . All classifiers in H ∗ except hj , called Hp and p = 1, ..., k such
that p 6= j, predict labels for the unlabeled data based on voting methods. Then
the newly-labeled data is used for hj , component classifier, in the next iteration
of the training process. In more detail, let L, U , and Li,j denote the labeled
data, unlabeled data, and the newly-labeled instances for the jth classifier in the
ith iteration of training process respectively. Moreover, in the ith iteration, a
component classifier hj has an initial labeled data with size |L| and a number
of newly-labeled data with size |Li,j |, determined by the ensemble Hp . Assume
that the error rate of Hp on Li,j is êi,j . Then the number of instances that are
mislabeled by Hp in Li,j is estimated as êi,j |Li,j |, where êi,j is upper bound of
classification error rate of the Hp . The same computation is also done for the
initial labeled data such that ηL |L|, where ηL denotes the classification noise rate
of L. As mentioned above, the training set in the ith iteration of the training
process is L ∪ Li,j for a component classifier hj . In this training set the number
of noisy instances are instances in L and instances in Li,j . Therefore, the noise
rate in this training set can be estimated by:
ηL |L| + êi,j |Li,j |
ηi,j = (4.5)
|L| + |Li,j |
As shown, the error rate in the ith iteration for a component classifier hj is
estimated by (4.5). Since c in (4.4) is a constant, for simplicity assume that c = 1,
substituting (4.5) in (4.4), when training is in the ith iteration, gives:
1
ni,j ' = (|L| + |Li,j |)(1 − 2ηi,j )2 . (4.6)
2i,j
From this we derive a criterion for whether adding data reduces the error of
the result of learning or not.
Theorem 2. If the following inequality satisfies in the ith and (i − 1)th
iteration:
êi,j |Li−1,j |
< <1 (4.7)
êi−1,j |Li,j |
where j = 1, 2, ..., k, then in the ith and (i − 1)th iteration, the worst-case error
rate for a component classifier hj satisfies i,j < i−1,j .
Proof:
Given the inequalities in (4.7), then will have:
|Li,j | > |Li−1,j | and êi,j |Li,j | < êi−1,j |Li−1,j | (4.8)
thus, it can be easily shown that:
|L| + |Li,j | > |L| + |Li−1,j | (4.9)
and then
η|L| + êi,j |Li,j | η|L| + êi−1,j |Li−1,j |
< (4.10)
|L| + |Li,j | |L| + |Li−1,j |
According to (4.5) and (4.10) will have:

η|L| + êi,j |Li,j | η|L| + êi−1,j |Li−1,j |
ηi,j = and ηi−1,j = (4.11)
|L| + |Li,j | |L| + |Li−1,j |
hence, ηi,j < ηi−1,j and then it can be easily written as:
ni−1,j = (|L| + |Li−1,j |)(1 − 2ηi−1,j )2

(4.12)
and
ni,j = (|L| + |Li,j |)(1 − 2ηi,j )2 (4.13)
So, according to ηi,j < ηi−1,j and (4.10) will have; ni,j > ni−1,j , and since
according to (4.4) n ∝ 12 , then i,j < i−1,j .
Theorem 2 can be interpreted as saying that if the inequality (4.7) is satisfied,
then the worst-case error rate of a component classifier hj will be iteratively
reduced, and the size of newly-labeled instances is iteratively increased in the
training process.
As can be derived from (4.7), êi,j < êi−1,j and |Li,j | > |Li−1,j | should be
satisfied at the same time. However, in some cases êi,j |Li,j | < êi−1,j |Li−1,j | may
be violated because |Li,j | might be much larger than |Li−1,j |. When this occurs, in
order not to stop the training process, a subsample of |Li,j | are randomly selected
such that new |Li,j | satisfies:
êi−1,j |Li−1,j |
|Li,j | < , (4.14)
êi,j
The condition in inequality (4.7) is used for stopping the training process
and controlling the number of newly-labeled data as well as the error rate in
the Ensemble-Co-Training algorithm. In Section 4.3.2 we present the Ensemble-
Co-Training algorithm based on the analysis that we did in this section. This
condition is based on several assumptions. The theorem holds for a ”candidate
elimination” type of learning algorithm rather than a best hypothesis estimator
that learns from examples that are drawn at random. In our application the
examples are selected and a single hypothesis is constructed in each iteration. At
the end the multiple hypotheses are combined in an ensemble. An overview of
our proposed algorithm is presented in Figure 4.2.
4.3.2 Ensemble-Co-Training Algorithm

In Ensemble-Co-Training each component classifier hj is first trained on the orig-
inal labeled data. Then Hp is built using all classifiers except one. This ensemble
is used to select a set of high-confidence predictions using voting methods. Next,
• Initialize : L ,U , H ;
• L : Labeled data ; U : Unlabeled data ;
• H : Ensemble of classifiers ;
• At each iteration i :
1. for j ∈ {1 ,2 ,... , k }
- Find êi,j as error rate for component classifier
hj based on disagreement among classifiers
- Assign labels to the unlabeled examples based on the
agreement among ensemble Hp
- Sample high - confidence examples for component
classifier hj
- Build the component classifier hj based on newly - labeled
and original labeled examples
2. Control the error rate for each component classifier
based on inequality (4.7)
- Update ensemble H
• Generate final hypothesis
Figure 4.2: An outline of the Ensemble-Co-Training
Hp estimates the error rate for each component classifier. After that, a subset of U
is selected by ensemble Hp for a component classifier hj . A pre-defined threshold
is used for selecting high-confidence predictions. Data that have an improvement
of error above the threshold are added to the labeled training data. Note that
each classifier has its own training set through the training process. This avoids
that classifiers converge too early and strengthens the effect of the ensemble. The
data that is labeled for the classifier is not removed from the unlabeled data U to
allow it to be labeled for other classifiers as well. This training process is repeated
until there are no more data that can be labeled such that they improve the per-
formance of any classifier. An outline of the Ensemble-Co-Training algorithm is
shown in Figure 4.2.
In the Ensemble-Co-Training algorithm instead of computing i,j , we use the
disagreement among classifiers as error rate, called êi,j . Through the training
process the algorithm attempts to decrease the disagreement among classifiers in
order to improve the performance.
Note that inequality (4.7) sometimes could not be satisfied, because the size of
Li,j is much larger than the size of Li−1,j . In this case the training process cannot
reduce the error rate due to stopping the training process earlier. This is because
of the fact that the error rate is a worse case error rather than an expectation.
To solve this problem, a subsample of Li,j is randomly selected that does satisfy
inequality (4.7), i.e., êi,j |Li,j | < êi−1,j |Li−1,j |, then will have:
êi−1,j |Li−1,j |
|Li,j | = [ ] − c0 . (4.15)
êi,j
In (4.14) the size of Li−1,j should be restricted by some criterion. Intuitively,

it can be bounded by:
êi,j
|Li−1,j | = [ ] + d0 (4.16)
êi−1,j − êi,j
To simplify (4.15) and (4.16) assume that c0 = d0 = 1.

Figure 4.3 presents the pseudo-code of the Ensemble-Co-Training algorithm in
more detail. The Ensemble-Co-Training algorithm uses inequality (4.7), (4.15),
and (4.16) as well as three main functions: EstimateError, LabelingUnlabeled, and
finalHypothesis. The EstimateError(Hp , L) function estimates the classification
error rate of hypothesis derived from combination of h1 , ..., hk as the disagreement
among classification models such that k 6= j for the component classifier hj on the
training data. Note that our assumption is that the training data and the newly-
labeled data have the same data distribution. Since estimating the classification
error rate on the unlabeled instances cannot be done easily, in Ensemble-Co-
Training only the initial labeled data are used for estimating the error rate.
We use two different error rate estimations in Ensemble-Co-Training. The
first approach is the disagreement among predictions by hypotheses produced by
different learning algorithms. In addition to this we check if a set of new examples
does not increase the error on the labeled data. The second approach is the use
of the classification confidence to estimate the error rate.
The LabelingUnlabeled function labels a subset of high-confidence predictions
by Hp for the component classifier hj . After the last iteration the resulting hy-
potheses are combined into an ensemble classifier. We experiment with two meth-
ods for combining hypotheses: “Average of Probabilities” and “Majority Voting”
[52]. Suppose that Y = (y1 , y2 , ..., ym ) are the class labels and there are K clas-
sifiers. The “Average of Probabilities” voting method for prediction of the new
example x is computed as follows:
K
1 X
arg max( pi (ym |x)), (4.17)
m K i=1
For the “Majority Voting” method, the maximum number of classifiers is

considered as the main rule. It means that the majority of the classifiers should
agree to assign label ym to the instance x.
Finally, the finalHypothesis function forms the final classification model based
on voting methods. In the experiments, we compare the Ensemble-Co-Training
algorithm without applying the criterion that we mentioned earlier, to show the
impact of the criterion on the performance. In particular we use a version of
Democratic Co-Learning [107], which is another form of self-training with multiple
classifiers, without the conservative statistical test.
1 Ensemble - Co - Training (L , U , H , θ)
2 Input : L : Initial labeled data , U : Unlabeled instances ,
3 H : The N Classifiers ,
4 θ: The pre - defined confidence Threshold
5 {Pl }M
l=1 : Prior probability
6 Begin
7 for j ∈ {1 ,2 ,.. , k } do
8 hj ← Learn ( Classif ierj ,L )
9 L0,j ← 0
10 ê0,j ← 0.5 // Upper bound of error rate
11 end // for
12 i ← 0 // Number of iteration
13 repeat
14 i← i + 1
15 for j ∈ {1 ,2 ,... , k }
16 êi,j ← EstimateError (Hp , L )
17 if (êi,j < êi−1,j ) then
18 for each x ∈ U do
19 if ( Confidence (x , Hp ) > θ)
20 Lab elingU nlabel ed (x ,Hp )
21 Li,j ← Li,j ∪ {(x, Hp )}
22 end // for
23 if ( |Li−1,j | = 0) then
ê
24 Li−1,j ← b (êi−1,ji,j−êi,j ) + 1c
25 if ( |Li−1,j | < |Li,j | and êi,j |Li,j | < êi−1,j |Li−1,j |) then
26 updatei,j ← TRUE
ê
27 else if ( |Li−1,j | > (êi−1,ji,j−êi,j ) ) then
28 for each class l , sample nl ∝ Pl as
ê |L |
29 Li,j ← Subsample(Li,j , d i−1,jêi,ji−1,j − 1e)
30 updatei,j ← TRUE
31 end // if
32 end // for j
33 for j ∈ {1 ,2 ,... , k }
34 if ( updatei,j = TRUE ) then
35 hj ← Learn(Classif ierj , L ∪ Li,j )
36 êi−1,j ← êi,j
37 Li−1,j ← |Li,j |
38 end // if
39 end // for j
40 until ( none of hi changes )
41 end // repeat
42
43 Output : Final H ( x ) ← argmax Σi:hi (x)=y 1
44 End
Figure 4.3: Ensemble-Co-Training Algorithm

4.4 Experiments
In the experiments we compare Ensemble-Co-Training (ECT) with several other
algorithms. One comparison is with using the base learner on the labeled data
only for supervised learning and the second is with semi-supervised learning al-
gorithms. The purpose is to evaluate if Ensemble-Co-Training exploits the in-
formation in the unlabeled data. We also compare Ensemble-Co-Training to the
variations that use different voting strategies for LabelingUnlabeled and Estima-
teError functions. Furthermore, we include comparisons with the algorithms for
semi-supervised learning, in particular self-training, Tri-Training, and Co-Forest.
In our experiments, we use the WEKA implementation of the classifiers with
default parameter settings [42].
4.4.1 Datasets
We use UCI datasets for assessing the proposed algorithms. We selected these
datasets, because they vary in the number of features and examples, and dis-
tribution of classes. Table 4.1 summarizes the specification of eight benchmark
datasets from the UCI data repository [32] which are used in our experiments.
Table 4.1: Overview of Datasets
Dataset Attributes Size Perc.

Bupa 6 345 58
Colic 22 368 63
Diabetes 6 768 65
Heart 13 270 55
Hepatitis 19 155 79
Ionosphere 34 351 64
Tic-tac-toe 9 958 65
Vote 16 435 61
For each dataset, about 30 percent of the data are kept as test set, and the rest
is used as training examples, which include a small amount of labeled and a large
pool of unlabeled data. Training instances in each experiment are partitioned
into 90 percent unlabeled data and 10 percent labeled data, keeping the class
proportions in all sets similar to the original data set. In each experiment, eight
independent runs are performed with different random partitions of L and U .
The average results are summarized in Tables 4.2 to 4.7.
4.4.2 Base Learner

In self-training and tri-training, we use C4.4grafted classifier as the base learner,
which is a decision tree learner [103] with “grafting” and Laplacian correction,
4.5. Results 59
adaptations that often improve performance in domains with sparse data. The
Random Forest [15] approach is employed in the Co-Forest learning method. We
describe Ensemble-Co-Training for any number of supervised learning algorithms.
Here, we only consider four learners: C4.4grafted, Naive Bayes, Random forest,
and J48 with Laplacian Correction algorithm. To make performance comparable
in co-forest, self-training, and Ensemble-Co-Training we set the value of pre-
defined threshold (θ) at 0.75 for all classifiers.
4.5 Results
In the first experiment we compare the learning algorithms in the supervised
setting. Table 4.2 shows the classification accuracies. As can be seen, in general
ensemble methods achieve better classification accuracies than a single classifier,
which is what we expected. The best performance is boldfaced for each dataset.
Table 4.2: Average Classification Accuracy of Supervised learning with Grafted

decision tree (DT), Tri-training, Co-Forest, and Ensemble-Co-Training (ECT)
Dataset Grafted DT Tri-training Co-Forest ECT

Bupa 58.63 56.96 55.97 58.01
Colic 80.74 71.30 73.98 81.02
Diabetes 65.35 67.63 66.49 66.40
Heart Statlog 73.54 78.41 72.73 76.14
Hepatitis 72.57 80.29 75.71 78.00
Ionosphere 79.13 77.76 82.48 83.11
Tic-tac-toe 65.00 65.60 67.78 67.66
Vote 94.41 87.53 90.79 94.64
In the second experiment we compare the performance of Ensemble-Co-Training,

Self-Training, Tri-Training, and Co-Forest in the semi-supervised setting. Ta-
ble 4.3 shows the classification accuracies of the methods. In four out of eight
datasets Ensemble-Co-Training achieves the highest classification accuracies and
in the other sets the differences are small.
Here, we wish to find an optimal strategy for labeling the unlabeled exam-
ples as well as combining hypotheses using different voting strategies, which are
“Averaging Probability” and “Majority voting”. The classification accuracies of
these methods are shown in Table 4.4. Using the “Majority voting” method for
estimating the error rate improves the classification accuracies of five data sets
out of eight in our experiments.
Table 4.5 gives the classification accuracies of using different voting methods
for the LabelingU nlabeled function in Ensemble-Co-Training, which are again
“Averaging Probability” and’“Majority Voting” methods. The best classification
Table 4.3: Average Classification Accuracy of Self-training with Grafted decision

tree (DT), Tri-training, Co-Forest, and Ensemble-Co-Training (ECT)
Dataset Grafted DT Tri-training Co-Forest ECT

Bupa 57.01 56.97 56.64 58.58
Colic 78.91 72.69 76.48 81.25
Diabetes 65.88 67.72 67.94 66.45
Heart Statlog 71.78 78.25 74.84 77.44
Hepatitis 73.20 82.00 81.43 80.29
Ionosphere 79.76 79.38 86.58 83.48
Tic-tac-toe 67.35 65.83 68.49 67.96
Vote 93.94 87.06 92.42 94.99
Table 4.4: Average Classification Accuracy of Ensemble-Co-Training (ECT) using

Averaging Probability and Majority voting in the EstimateError function for
estimating the error rate
Dataset ECT Averaging Probability ECT Majority voting

Bupa 58.58 59.59
Colic 81.25 81.39
Diabetes 66.45 67.02
Heart Statlog 77.44 76.30
Hepatitis 80.29 80.00
Ionosphere 83.48 83.98
Tic-tac-toe 67.96 68.02
Vote 94.99 94.99
accuracies among the different settings in Ensemble-Co-Training are boldfaced

in Table 4.5. Using “Majority voting” for the LabelingU nlabeled function in
Ensemble-Co-Training achieves the best results.
In the previous experiment (second) we used the default settings for the learn-
ing algorithms. In the next experiment we optimize the parameters. This is not
possible in practical settings because the amount of data that is needed is normally
not available. Yet it puts the non-optimized results in perspective. In particular
compare the best current setting of Ensemble-Co-Training and Co-Forest which
are almost always the two best methods. Table 4.6 shows the result of this com-
parison. On average the classification accuracy of Ensemble-Co-Training is 1.99%
higher than the Co-Forest algorithm.
Here, we compare the results of using a version of Democratic co-learning,
without the conservative statistical test, and Ensemble-Co-Training without the
criterion that we discussed. The aim of this comparison is to show the advantages
of using the criterion. As can be seen in Table 4.7 there is still some improvement
4.5. Results 61
Table 4.5: Average Classification Accuracy of Ensemble-Co-Training(ECT) us-

ing Average voting and Majority voting in the LabelingUnlabeled function for
assignment labels
Dataset ECT Averaging Probability ECT Majority Voting

Bupa 58.58 59.94
Colic 81.25 82.50
Heart Statlog 77.44 78.73
Tic-tac-toe 67.96 69.40
Vote 94.99 94.59
Table 4.6: Average Classification Accuracy and Standard Deviation of Co-Forest

and Ensemble-Co-training with the best setting
Dataset Co-Forest ECT Majority Voting Improvement%

Bupa 56.64(±1.89) 59.94(±2.11) 3.30
Colic 76.48(±3.42) 82.50(±0.84) 6.02
Diabetes 67.94(±1.65) 67.11(±1.11) -0.83
Heart Statlog 74.84(±3.11) 78.73(±0.48) 3.90
Hepatitis 81.43(±1.5) 81.75(±2.11) 0.32
Ionosphere 86.58(±2.21) 86.71(±1.31) 0.12
Tic-tac-toe 68.49(±1.11) 69.40(±1.24) 0.91
Vote 92.42(±1.63) 94.59(±0.84) 2.17
in the results, but it is less than using Ensemble-Co-Training with the criterion.
The third column (“ECT Majority Voting”) of Table 4.6 shows the results of
using the criterion and the second column (“ECT”) of Table 4.7 gives the results
without employing the criterion. The presented method not only improves the
accuracy, but also it reduces the complexity of the training process. Another
observation in this experiment is that the Ensemble-Co-Training without the
criterion also works better than the Democratic co-learning, because it benefits
from the co-training approach for the training process.
Finally, we compare the performance of all methods that we discussed in

this chapter. Figure 4.4 shows the results of the comparisons. We observe that
Ensemble-Co-Training with optimized setting gives the best results in all datasets
in our experiments.
Table 4.7: Classification Accuracy of Ensemble-Co-training without using the

Criterion and Democratic Co-learning
Dataset ECT Democratic Co-Learning

Bupa 58.74 57.05
Colic 82.36 80.60
Heart 76.70 76.30
Tic-tac-toe 67.52 67.40
Vote 94.79 94.48
Tri-training Co-Forest ECT-Without the Criterion Best ECT

100.00
95.00
90.00
85.00
80.00
75.00
70.00
65.00
60.00
55.00
50.00
bupa colic diabetes heart hepatitis ionosphere tic-tac-toe vote
Datasets
Figure 4.4: The Performance Comparisons of all methods

We propose a method that uses an ensemble of classifiers for co-training rather
than feature subsets. The ensemble is used to estimate the probability of incorrect
labeling and this is used with a theorem by Angluin and Laird [1] to derive a
measure for deciding if the adding of a set of unlabeled data will reduce the error
of a component classifier or not. Our method does not require a time consuming
test for selecting a subset of unlabeled data. Experiments show that in most cases
our method outperforms similar methods.
Balcan and Blum [2] present a general analysis of semi-supervised learning
with discriminative classifiers (that do not try to model the distribution of the
data). They point out that an assumption is required on the relation between
the distribution of the data and of the classes. Without such an assumption
semi-supervised learning is not possible. Our method is implicitly based on the
assumption that the learning algorithms that construct the individual hypotheses
bring the relevant learning biases to bear on the data and that their agreement is
a good measure for the predictability of the data. The experiments show that the
learning algorithms that were included together give good results in a range of
application domains that vary substantially in many dimensions. For the future
work, it would be interesting to make a further comparison with Disagreement
Boosting methods.
Chapter 5
Multiclass Semi-Supervised Boosting
In this chapter we present an algorithm for multiclass semi-supervised learning.

Existing semi-supervised algorithms use approaches such as one-versus-all to con-
vert the multiclass problem to several binary classification problems. However,
this is not optimal. We propose a multiclass semi-supervised boosting algorithm
that solves multiclass classification problems directly. The algorithm is based on
a novel multiclass loss function consisting of the margin cost on labeled data and
two regularization terms on labeled and unlabeled data. Experimental results on
a number of benchmark UCI datasets show that the proposed algorithm performs
better than the state-of-the-art boosting algorithms for multiclass semi-supervised
learning.
Preliminary research on multiclass semi-supervised learning is published by
Tanha et al. [95], at the IEEE International Conference on Data Mining (ICDM)
in 2012. Also, an extended version of this work is accepted for publication in the
Special issue on “Partially Supervised Learning for Pattern Recognition” of the
journal of Pattern Recognition Letters.
5.1 Introduction
Most of the existing semi-supervised learning methods were originally designed
for binary classification problems. Nevertheless, many application domains, such
as text and document classification, pattern recognition, and text mining, involve
more than two classes. To solve the multiclass classification problem two main
approaches have been proposed. The first is to convert the multiclass problem
into a set of binary classification problems. Examples of this approach include
one-vs-all, one-vs-one, and error-correcting output code [30]. This approach can
have various problems such as imbalanced class distribution, increased complex-
ity, no guarantee to have an optimal joint classifier or probability estimation, and
different scales for the outputs of generated binary classifiers which complicates
combining them, see [48] and [79]. The second approach is to use a multiclass
65
66 Chapter 5. Multiclass Semi-Supervised Boosting
classifier directly. Although a number of methods have been proposed for multi-
class supervised learning, for example [69], [110], and [79], they are not able to
handle multiclass semi-supervised learning, which is the aim of this study.
Most of the semi-supervised learning methods work iterative or additive. It-
erative semi-supervised learning algorithms replace their hypothesis at each it-
eration by a new one. Additive algorithms add a new component to a linear
combination as in boosting algorithms. Typically, self-training algorithms are
iterative. Additive algorithms, like boosting [35], are generic ensemble learning
frameworks, which sequentially construct linear combination of base classifiers.
Boosting is one of the most successful ensemble methods for supervised learning.
The boosting approach has been extended to semi-supervised learning with dif-
ferent strategies. Examples are MarginBoost [23] and ASSEMBLE [6]. The main
difficulties in this approach are how to sample the unlabeled examples and which
class labels to assign to them. MarginBoost and ASSEMBLE use the classifier
to predict the class labels of the unlabeled examples, the “pseudo-labels”. These
algorithms then attempt to minimize the margin cost for the labeled data and
a cost associated with the “pseudo-labels” of the unlabeled data. A sample of
the pseudo-labeled data is then used in the next iteration. The pseudo-labels and
also the associated cost depend strongly on the classifier predictions and therefore
this type of algorithm cannot exploit information from the unlabeled data and
the final decision boundary is very close to that of the initial classifier, see [20]
and [66].
To solve the above problem, a recent idea is to use a smoothness regularizer
based on the cluster and the manifold assumptions. The idea is to not use the
“pseudo-margin” from the predictions of the base learner directly. Instead, beside
the margin of the labeled data, also the “consistency” is maximized. Consistency
is a form of smoothness. Class labels are consistent if they are equal for data
that are similar. For this the definition of the weight for data and for classifiers
in the boosting algorithm must be modified. This is done in SemiBoost [66]
and RegBoost [20]. Experiments show that this idea is more effective than the
approach based on “pseudo-margin” or confidence, see [20] and [66]. However,
this solution is limited to binary classification problems. For a multiclass semi-
supervised learning problem, the state-of-the-art algorithms are now those based
on the “pseudo-margin” [90] or the use of binary methods, like SemiBoost and
Regboost, to handle the multiclass case employing the one-versus-all or similar
meta-algorithm, which is problematic as mentioned earlier. Recently, in [99] a
new boosting method is used for semi-supervised multiclass learning which uses
similarity between predictions and data. It maps labels to n-dimensional space,
but this mapping may not lead to train a classifier that minimizes the margin
cost properly.
In this chapter we present a boosting algorithm for multiclass semi-supervised
learning, named Multi-SemiAdaBoost. Unlike many semi-supervised learning
algorithms that are extensions of specific base classifiers, like SVM, to semi-
5.2. Loss Function for Semi-Supervised Learning 67
supervised learning, Multi-SemiAdaBoost can boost any kind of base classifier.

It minimizes both the empirical error on the labeled data and the inconsistency
over labeled and unlabeled data based on the cluster and the manifold assump-
tion. This generalizes the SemiBoost and RegBoost algorithms from binary to
multiclass classification using a coding scheme for the multiclass classification by
[110] for supervised multiclass boosting. Our proposed method uses the mar-
gin on the labeled data, the similarity among labeled and unlabeled data, and
the similarity among unlabeled data in the exponential loss function. We give
a formal definition of this loss function and derive functions for the weights of
classifiers and unlabeled data by minimizing an upper bound on the objective
function. We then compare the performance of the algorithm with (a) binary
algorithms with smoothness regularization used in an one-versus-all scheme to
do multiclass classification (RegBoost and SemiBoost), (b) a boosting multiclass
semi-supervised learner without smoothness regularization ([90] and ASSEMBLE
[6]), and (c) with supervised learning from the labeled examples using the base
learner and using AdaBoost with the base learner to evaluate the effect of using
the unlabeled data. The results of the experiments on the benchmark UCI and
text classification datasets show that Multi-SemiAdaBoost outperforms the other
boosting algorithms and gives the best results.
This chapter is organized as follows: Section 5.2 presents the loss function for
semi-supervised learning, Sections 5.3 and 5.4 formalize the setting and the loss
function, Section 5.5 derives the weights for the boosting algorithm, the variation
of the proposed algorithm is presented in Section 5.6, Sections 5.7 and 5.8 present
the experiments and the results, and Section 5.9 draws the main conclusions.
5.2 Loss Function for Semi-Supervised Learning

As addressed in Chapter 2, the goal of semi-supervised learning is to use both
labeled and unlabeled to construct a strong classification model. For this, many
semi-supervised learning algorithms employ the unlabeled data to regularize the
loss function as follows:
X X
`(yi , F (xi )) + ù (F (xj )) (5.1)
xi ∈Xl xj ∈Xu
where F (.) is a binary classifier, ù shows the penalty term for the unlabeled
data, and l and u are the labeled and unlabeled data. Typically semi-supervised
learning methods vary in the underlying assumptions. Two main assumptions for
semi-supervised learning are the cluster and manifold assumptions.
Many graph-based methods, as introduced in Chapter 2, for semi-supervised
learning are based on the manifold assumption. These methods build a graph
based on the pairwise similarity between examples (labeled and unlabeled). Then,
the goal is to estimate a function on the graph. This function should minimize
the loss on labeled examples and smooth on the whole graph. Many graph-based
methods use the following loss function:
X
ù (F (x)) = S(x, x0 )kF (x) − F (x0 )k22 (5.2)
x0 ∈Xu
x6=x0
where S(x, x0 ) is pairwise similarity. This penalty term in fact enforces the classi-
fier to predict similar class labels for similar examples such that the inconsistency
in whole graph is minimized. Examples of the methods that use the above loss
function are Spectral Graph Transducer [51], Label Propagation approach [112],
Gaussian process based approach [55], and Manifold Regularization [4]. Two
main issues in graph-based approaches are: (1) they are very sensitive to the
similarity kernel methods [114], and (2) they were originally designed for binary
classification problems. Another challenge in graph-based methods is to find a
suitable graph structure for building graphs [113]. We use the above smoothness
term in our proposed method in different way for multiclass cases.
Most of the semi-supervised learning algorithms that are based on the clus-
ter assumption are those that use the unlabeled data to regularize the decision
boundary. These algorithms usually are extensions of specific base classifier, such
as SVM. Examples of these methods are Semi-Supervised SVM [5], Transductive
SVM (TSVM) [50], and [19]. A TSVM uses the following regularization term in
its loss function:
ù (F (x)) = max{0, 1 − |F (x)|}, x ∈ u (5.3)

The regularization term (5.3) in the loss function pushes the decision boundary
to lie in low density region in feature space by maximizing margin of unlabeled
data. However, finding the exact TSVM is a NP-hard problem. Recent ap-
proaches cannot solve semi-supervised classification problems with more than a
few hundred unlabeled examples. More recently Cristianini [22] uses semi-definite
programming (SDP) to solve the TSVM training problem and [104] proposes a
multiclass version of the SDP idea. However the proposed methods can handle
only a few hundred unlabeled examples and are still computationally expensive
in the case of large number of unlabeled data, specially for multiclass problem.
The second issue in these methods is that they are not optimal in domains where
the SVM is not the best choice for the classifier.
Another kinds of semi-supervised learning algorithms that are based on the
cluster assumption are the form of iterative or additive. The main difficulty in
these approaches is how to find informative unlabeled data so that using them
may lead to improve the decision boundary. As mentioned in the Introduction,
the iterative approach replaces the current hypothesis with a new one. Exam-
ples of this approach are Self-training [18], Co-training [8], Co-forest [108], and
Ensemble-Co-Training [93]. The problem of this approach is that the mislabeled
5.2. Loss Function for Semi-Supervised Learning 69
examples are used throughout the training process and will propagate to later
iterations.
Additive type of algorithms are of the boosting type. Often boosting is used
to improve the classification margin by sampling a subset of high-confidence pre-
dictions of the base classifier with their “pseudo-labels”. Then, these newly-
labeled examples are used besides the original labeled examples for constructing
a new classifier for the next iteration. At each iteration a new classifier is con-
structed but this does not replace the previous. It then receives a weight and
is added to ensemble of classifiers. Examples of this type of algorithm for semi-
supervised learning are ASSEMBLE [6] and Marginboost [23]. This approach
uses the “pseudo-margin” concept for unlabeled data and forms the following
penalty term for unlabeled data:
X
ù (F (x)) = exp(−|F (x)|) (5.4)
x∈Xu
Although the additive approach improves the classification margin, there are
two main issues. First, it relies only on the classification confidence to assign
“pseudo-labels” and does not provide novel information from the unlabeled ex-
amples, such as similarity between examples or marginal distributions. Conse-
quently, the new classifier that was trained on newly-labeled examples, is likely
to have the same decision boundary as the first classifier instead of construct-
ing a new one. The reason is that by adapting the decision boundary the poor
predictions will not gain higher confidence, but instead the examples with high
classification confidence will gain even higher confidence. This issue also has been
addressed in [66] and [20]. Another issue with this method is that the proposed
loss function was originally designed for binary classification problems and it may
not work properly for the multiclass classification problems, which needs to define
the multiclass margin and multiclass loss function. To solve the first problem, a
recent approach is to use the classification confidence as well as additional infor-
mation from unlabeled data like pairwise similarities together in order to design
a new form of loss function. The goal of the designed loss function is to mini-
mize the inconsistency between the classifier predictions and pairwise similarities.
Examples are SemiBoost [66] and RegBoost [20]. They derive criteria to assign
confidence to each unlabeled example as well as for the new constructed classifiers.
Then a set of high-confidence predictions is selected based on the assigned con-
fidence. Although, experiments show that they outperform the other methods,
they are originally designed for binary classification problems.
Most recently several semi-supervised boosting algorithms were developed for
Multiclass classification problems, see [99], [81], and [90]. In [99] a new boosting
method is used for semi-supervised multiclass learning which uses similarity be-
tween predictions and data. It maps labels to n − dimensional space, but this
mapping may not lead to train a classifier that minimizes the margin cost prop-
erly. RMSBoost [81] uses the expectation regularization principle for multiclass
classification problem. The cross entropy between prior probability and the opti-
mized model is used for regularization. Then label priors used in RSMBoost result
in a probabilistic way of assigning priors top unlabeled examples based on under-
lying cluster structures. The extension of supervised Multiclass AdaBoost [110] is
used in [90] to design a method for multiclass semi-supervised learning based on
“pseudo-margin” approach as in ASSEMBLE, called SemiMultiAdaBoost. There
is thus a need for direct multiclass algorithm for semi-supervised learning based
on both cluster and manifold assumptions.
Our proposed method, Multi-SemiAdaBoost (MSAB), combines ideas from
the methods that we reviewed above to the design of a multiclass semi-supervised
boosting algorithm based on manifold and cluster assumption. MSAB employs
the advantages of the graph-based methods to minimize the inconsistency between
data as well as the ensemble approach to minimize the margin cost to solve directly
the multiclass classification problem. We design an exponential loss function for
multiclass problems.
5.3 Multiclass Semi-Supervised Learning

In this section, we define the multiclass setting and extend it to multiclass semi-
supervised learning.
5.3.1 Multiclass Setting

Suppose we are given a set of training data (X, Y ) = {(xi , yi )}ni=1 where xi ∈
Rd , yi ∈ {1, .., K} denotes a class label, and (X, Y ) are drawn from unknown
underlying distribution D. Let pk = PD (y = k|X = x). The goal is to learn an
optimal classification rule F (x) : X → {1, ..., K}. An optimal classification rule
must achieve the lowest error rate. The error rate for F (x) is as follows:
EX,Y 1(F (x) 6= Y ) = EX PD (F (x) 6= Y |X)

= 1 − EX PD (F (x) = Y |X)
k
EX [1F (x)=j PD (Y = j|X = x)]
X
=1− (5.5)
j=1
where 1 is the indicator function. (5.5) can be rewritten as:
F ∗ (x) = argmin{1 − pk (x)} = argmax pk (x) (5.6)

k k
which is known as the Bayes decision rule. Implementing the Bayes decision rule
is not easy because of estimating PD (Y |X). One possible solution is to use the
boosting approach that combines many weak classifiers to estimate the Bayes
5.3. Multiclass Semi-Supervised Learning 71
decision rule and implements the classifier as:
F ∗ (x) = argmax hH ∗ (x), ek i (5.7)

k∈{1,...,K}
where ek is a standard basis vector. The goal then is to find a predictor H ∗ :

X → RK so that it minimizes the risk:
n
X
RM (H) = EX,Y {LM [Y, H(x)]} ' LM [Yi , H(xi )] (5.8)
i=1
and
H ∗ (x) = argmin RM (H) (5.9)

H
where LM [., .] is a multiclass loss function. Note that the class labels ±1 play a
significant role in the formulation of loss function of binary classification problems.
Therefore, there is a need to find a way to map labels {1, ..., K} to K-dimensional
space so that this mapping be consistent with (5.8), which is aimed at finding
an optimal predictor. For this, let Yi ∈ Rk be a K-dimensional vector with all
1
entries equal to − K−1 except a 1 in position i as a coding method. Each class
label i ∈ {1, ..., K} is then mapped into a vector Yi that indicates a class label.
Each entry yi,j is thus defined as follows:

1 if i = j
yi,j = −1 (5.10)
K−1
if i 6= j
where K is the number of classes. Then Y which is the set of K-dimensional

vectors, will be as follows:
1 1
(1, − K−1 , ..., − K−1 )T
 
 (− 1 , 1, ..., − 1 )T 
 K−1 K−1 
 . 
Y =  (5.11)

 . 

 . 
1 1 T
(− K−1 , , ..., − K−1 , 1)
where Yi ∈ Y and K
P
j=1 yi,j = 0.
The above notation for multiclass setting was used by [110] and [56] for multi-
class AdaBoost and multiclass SVM respectively. Here, we use the aforementioned
multiclass notation in an exponential loss function. The resulting empirical risk
is:
n
X −1
RM (H) ' exp( hYi , H(xi )i) (5.12)
i=1
K
It has been shown that RM (H) is Bayes consistent, see [110]. Then
argmax hH ∗ (x), ek i = argmax pk (x) (5.13)

k k
which means that (5.7) implements the Bayes decision rule. This justifies the use
of (5.12) as loss function for multiclass classification problem. We use the above
notation to formulate the loss function for multiclass semi-supervised learning.
In multiclass semi-supervised learning for the labeled points Xl = (x1 , x2 , ..., xl )
labels {1, ..., K} are provided and for the unlabeled points Xu = (xl+1 , xl+2 , ..., xl+u ),
the labels are not known. We use the coding method in (5.11) to formulate the
multiclass semi-supervised learning task.
For our algorithm we need a (symmetric) similarity matrix S = [Si,j ]n×n ,
where Si,j = Sj,i is the similarity between the points xi and xj . S lu = [Si,j ]nl ×nu
denotes the similarity matrix of the labeled and unlabeled data and S uu =
[Si,j ]nu ×nu of the unlabeled data. Our algorithm is a “meta-learner” that uses
a supervised learning algorithm as base learner.
We assume that the labeled and unlabeled data are drawn independently from
the same data distribution. In applications of semi-supervised learning normally
l u , where l is the number of labeled data and u is the number of unlabeled
data.
5.4 Loss Function for Multiclass Semi-Supervised

Learning
The loss functions of the ASSEMBLE and MarginBoost algorithms for semi-
supervised boosting algorithms are the sum of empirical loss on the labeled data
and on the unlabeled data:
X X
C1 L(y, H(x)) + C2 Lu (|H(x)|) (5.14)
x∈Xl x∈Xu
where H(.) is an ensemble classifier and Lu (.) is the penalty term related to
the unlabeled examples, and C1 and C2 are the contribution of the labeled and
unlabeled data respectively. The terms L(.) and Lu (.) define the margin and
“pseudo-margin” for labeled and unlabeled examples respectively. Since there is
no true label for unlabeled examples, “pseudo-labels” are used for them and there-
fore instead of margin, “pseudo-margin” is defined for the unlabeled examples,
like |H(x)| in ASSEMBLE. Considering (5.14) in the boosting setting, the goal
of each iteration is to sample the unlabeled examples and assign “pseudo-labels”
to them. Then a new component classifier is built based on the new training set
which minimizes (5.14). For sampling and labeling the prediction of the boosted
classifier H(x) is used.
5.4. Loss Function for Multiclass Semi-Supervised Learning 73
Unlike the existing semi-supervised boosting methods that only rely on the
classifier predictions to labeling, our approach defines the loss for the unlabeled
data in terms of consistency, the extent to which examples that are similar have
the same label or “pseudo-label”. This explicitly includes the similarity between
examples. If examples that are similar, according to the similarity matrix, are
assigned different classes then this adds a penalty to the loss function. For this we
design an exponential multiclass loss function instead of directly using a binary
loss function. This gives three terms for the loss function: (1) the empirical loss
on the labeled data, (2) the consistency between labeled and unlabeled data, and
(3) the consistency between unlabeled data. Combining these terms results in a
minimization problem with the following objective function:
F (Y, S, H) = C1 Fl (Y, H) + C2 Flu (Y, S lu , H) + C3 Fuu (S uu , H) (5.15)
where C1 , C2 , and C3 are weights for the three terms. As we showed in (5.13), the
proposed multiclass exponential loss function in multiclass AdaBoost effectively
implements the Bayes decision rule. We adapt this loss function for multiclass
semi-supervised learning. The loss function for labeled data becomes:
nl
X 1
Fl (Y, H) = exp(− hH(xi ), Yi i) (5.16)
i=1
K
where K is the number of classes and H(.) is a multiclass ensemble classifier.

In (5.15), Flu (Y, S lu , H) has been designed to measure the consistency between
the similarity information and the classifier predictions on labeled and unlabeled
data. For this term we define the multiclass loss function as follows:
nl X
nu
lu
X −1
Flu (Y, S , H) = S lu (xi , xj )exp( hH(xj ), Yi i) (5.17)
i=1 j=1
K −1
where nl and nu are the number of labeled and unlabeled data respectively.
The third term of (5.15) measures the consistency between the unlabeled
examples in terms of the similarity information and the classifier predictions. We
use the harmonic function to define Fuu (S uu , H). This is a popular way to define
the consistency in many graph-based methods, for example [115].
X 1 →
−
Fuu (S uu , H) = S uu (xi , xj )exp hH(xi ) − H(xj ), 1 i (5.18)
i,j∈nu
K −1
Combining the above three terms leads the following optimization problem:
minimize F(Y, S, H)
(5.19)
subject to H1 (x) + ... + Hk (x) = 0, x ∈ X .
where X is the training set. Here, we briefly show that F (Y, S, H) maintains the
convexity. The first and second term of (5.15) are in terms of exponential function
and dot product. Since exp(.) and its argument are convex, and Si,j is positive
semi-definite. Then (5.16) and (5.17) are convex. In (5.18) using power series
result in cosh(.) function, which is a convex function. Therefore, (5.15) is convex.
The convex optimization problem can be effectively solved by numerical methods.
Another valuable point in the loss function (5.19) is that on one hand it uses the
boosting approach to exploit the cluster assumption as in many boosting methods
and on the other hand it employs (5.18) as regularization term for unlabeled
data, which is used in many graph-based methods, such as LapSVM [4], for
manifold regularization. Therefore, the proposed method effectively exploits both
the cluster and the manifold assumption. The relationship between the manifold
regularization and (5.18) is shown in Appendix A.
5.5 Multiclass Semi-Supervised Boosting

In this section we derive the Multi-SemiAdaBoost algorithm, which has the form
of a boosting algorithm. Given a set of labeled and unlabeled data, a similarity
metric, and a supervised (multiclass) learning algorithm, Multi-SemiAdaBoost
effectively minimizes the objective function (5.15). In Multi-SemiAdaBoost, a se-
ries of classifiers is constructed by applying a supervised base learner to varying
sets of training data. Each classifier gets a weight. When a termination condition
is reached, the weighted combination of the classifiers becomes the final hypoth-
esis. The algorithm starts with the labeled data and at each iteration adds some
unlabeled data that received a “pseudo-label” from the classification model con-
structed in the previous iteration. For this algorithm we need to find the weights
for the classifiers and the unlabeled data.
Suppose that H t (x) : X −→ Rk defines the linear combination of classification
models after t − 1 iterations. Then, at the t-th iteration, H t (x) is computed as:
H t (x) = H t−1 (x) + β t ht (x) (5.20)
where β t ∈ R is the weight of the base classifier ht (x), and ht (x) is determined
by P (x), which is a multiclass base learner. Let also ht (x) denote a classifier as
follows:
∀x ∈ RP ht (x) : X → Y (5.21)
where Y was defined in (5.11). The goal of each iteration is to find a new compo-
nent classifier ht (x) and its weight β t that minimizes the objective function (5.19)
and computes the weights for the data.
Two main approaches to solve the optimization problem (5.19) are: gradient
descent [20] and bound optimization [66]. A difficulty of the gradient descent
method is to determine the step size at each iteration. We therefore use the
5.5. Multiclass Semi-Supervised Boosting 75
bound optimization approach to derive the boosting algorithm for (5.19), which
automatically determines the step size at each iteration of boosting process.
At the t-th iteration, the optimization problem (5.15) is derived by substitut-
ing (5.20) in (5.19):
nl
X 1
F = argmin C1 exp(− hH t−1 (xi ) + β t ht (xi ), Yi i)
ht (x),β t i=1
K
nl X
nu
X 1
+ C2 S lu (xi , xj )exp(− hH t−1 (xj ) + β t ht (xj ), Yi i)
i=1 j=1
K −1
X
+ C3 S uu (xi , xj )
i,j∈nu
1 X
t−1 t t t−1 t t
exp h(H (xi ) + β h (xi )) − (H (xj ) + β h (xj )), ek i (5.22)
K − 1 k∈l
To solve (5.22), we first simplify it and then with reformulation we derive a

criterion to assign weights to data. Finally, we propose a criterion to assign class
labels and β as weight for the new classifier ht (x). For this, we use a standard
basis vector ek to represent class membership. In Appendix A, it is shown that
minimizing (5.22) is equivalent to minimizing the following optimization problem:
nl
X 1
hHit−1 + β t hti , Yi i

F 1 = C1 exp −
i=1
K
nl X
nu X
X −1
+ C2 S lu (xi , xj )exp( hH t−1 , ek i)
i=1 j=1 k∈l
K −1 j
−β t
exp( hhtj , ek i)δ(Yi , ek )
K −1
X X 1
+ C3 S uu (xi , xj )exp hHit−1 − Hjt−1 , ek i
i,j∈nu k∈l
K −1
βt
exp hhti − htj , ek i (5.23)
K −1
Subject to:
H1t (xi ) + ... + Hkt (xi ) = 0
where δ(Yi , ek ) is:

1 if i = k
δ(Yi , ek ) = (5.24)
0 6 k
if i =
where F1 is an upper bound on F and minimizing F1 will also minimize F . To

simplify the notation we write Hit−1 for H t−1 (xi ) and hti for ht (xi ). For now
assume that C1 = C2 = C3 = 1. As mentioned before, (5.23) is a nonlinear

function. To solve this problem and also to find a criterion for assigning “pseudo-
labels” to the unlabeled examples, in Appendix A, we show that (5.23) can be
reformulated as (5.25):
nl n
u X
X −β t t X −β t
F2 = Wi exp( hhi , Yi i) + Pi,k exp( hhti , ek i)
i=1
K i=1 k∈l
K − 1
(5.25)
where
−1 t−1
Wi = exp( hHi , Yi i) (5.26)
K
and
nl
X −1 t−1
Pi,k = S lu (xi , xj )e( K−1 hHi ,ek i)
δ(Yi , ek )
j=1
nu
1
hHjt−1 −Hit−1 ,ek i 1
X
uu
+ S (xi , xj )e K−1 e K−1 (5.27)
j=1
In the optimization problem (5.25), Pi,k used for weighting the unlabeled
examples and Wi is used for weighting labeled data. According to these criteria
the algorithm decides whether to select a sample for the new training set or not.
The expression in (5.25) is in terms of β t and ht (x) and hence make it difficult
to solve. In Appendix A, we show the derivative of (5.25) which leads to the
optimal class value and weights for the unlabeled data as well as weights for the
classifiers. Then,
Ŷi = arg max(Pi,k ) (5.28)
k
and the weight for the example will be wi0

= max |Pi,k |. Meanwhile the value for
β that minimizes F2 is:
0 t
P P P
2
i∈nu k∈l Pi,k δ (hi .ek , Pi = k) + i∈nl Wi
(K − 1)
hti =Yi
β= log(K − 1) + log P P 0 t
P
K i∈nu k∈l Pi,k δ (hi .ek , Pi 6= k) + i∈nl Wi
hti 6=Yi
(5.29)
where Pi ≡ P (xi ) is a multiclass base learner as in (5.20).

Now, let t denote the weighted error made by the classifier ht . We define t
as follows:
0
P P P
i∈nu k∈l Pi,k δ (hi .ek , Pi 6= k) + i∈nl Wi
t hti 6=Yi
= P P P (5.30)
i∈nu k∈l P i,k + i∈nl Wi
5.5. Multiclass Semi-Supervised Boosting 77
Replacing (5.30) in (5.29) leads to the following β t :
(K − 1)2 1 − t
βt = log(K − 1) + log( t ) (5.31)
K
which is a generalization of the weighting factor of multiclass AdaBoost [110]. If
the classification problem is binary then k = 2, we obtain the formula for β that
is used in AdaBoost [35]. Our formulation can thus be seen as a generalization
of AdaBoost to multiclass semi-supervised learning.
5.5.1 The Multi-SemiAdaBoost Algorithm

Based on the previous discussion, we can provide the details of the algorithm.
In Algorithm 8, first the values of Pi,k and Wi are computed for each unlabeled
Algorithm 8 Multi-SemiAdaBoost
Initialize: L, U, S, H 0 (x) = 0; L: Labeled data; U: Unlabeled data;
S: Similarity Matrix; H 0 (x): Ensemble of Classifiers; t ← 1;
while (β t > 0) and (t < M ) do // M is the number of iteration
for each xi ∈ L do
Compute Wi for labeled example xi based on (5.26)
- Compute Pi,k for unlabeled example xi based on the pairwise
similarity and classifier predictions using (5.27)
- Assign pseudo-labels to unlabeled examples based on (5.28)
- Normalize the weights of labeled and unlabeled examples
- Sample a set of high-confidence examples from labeled and
unlabeled examples
- Build a new weighted classifier ht (x) based on the newly-labeled
and original labeled examples
- Compute the weights and β t for each new classifier ht (x) using (5.29)
- Update H t ← H t−1 + β t ht
- t←t+1
end while
Output: Generate final hypothesis based on the weights and classifiers
and labeled example. In order to compute the value of Pi,k , we use the similarity
information among data and the classifier predictions and Wi is computed as in
multiclass AdaBoost (SAMME). Then (5.28) is used to assign a “pseudo-label”
to an unlabeled example and the value of (5.28) becomes the weight. In the next
step, a set of high-confidence newly-labeled data besides the labeled data are used
based on the weights for training a new classifier, called ht (x). The algorithm
then uses the value of Pi,k as weight to sample data which will lead to a decrease
of the value of the objective function. This new set is consistent with classifier
predictions and similarity between examples. As can be seen in Algorithm 8, a
new classifier is built at each iteration of the boosting process. Then the weight
of ht (x), called β t , is computed for each new classifier. The boosting process is
repeated until it reaches a stopping condition. After finishing, the final hypothesis
is the weighted combination of the generated classifiers. Although, it has been
empirically determined that using a fixed number of classifiers, (around 20) is
best for AdaBoost [37], we use β t ≤ 0 as a stopping criterion. This is analogous
to using t ≥ 12 for AdaBoost. Optionally, for pragmatic reasons the algorithm
can of course be given a threshold on the number of iterations.
As can be seen in (5.27), the value of Pi,k depends on several factors: (i)
the value will be large if xi cannot be classified confidently. This means that
Hit−1 .ek is small and one of its close neighbors is labeled. The first term of (5.27)
refers to this; (ii) the example is close to some unlabeled examples that are con-
fidently classified, which means that large similarity and Hjt−1 for the unlabeled
example xj . This point comes from the second term of (5.27). This shows the
role of similarity information for sample selection in Multi-SemiAdaBoost and
makes it different from the other methods, such as ASSEMBLE, MarginBoost
and SemiMultiAdaBoost [90].
5.6 Variations of Multi-SemiAdaBoost

If there are no unlabeled data then the loss function in (5.15) simplifies to the
margin cost on the labeled data. However, if supervised learning on the labeled
data produces poor results then it may be useful to not include it in the loss
function. This may improve the results. The resulting loss function then consists
of two terms: the consistency between the labeled and unlabeled data and between
the unlabeled data. With small changes F (Y, S, H) is then as follows:
F (Y, S, H) = C1 Flu (Y, S lu , H) + C2 Fuu (S uu , H) (5.32)
where Flu (Y, S lu , H)

nl X
nu
X −1
Flu (Y, S lu , H) = S lu (xi , xj )exp( hH(xj ), Yi i) (5.33)
i=1 j=1
K
and Fuu (S uu , H) as:

X
Fuu (S uu , H) = S uu (xi , xj )
i,j∈nu
1 →
−
exp hH(xi ) − H(xj ), 1 i (5.34)
K
5.6. Variations of Multi-SemiAdaBoost 79
To solve (5.32), we used the same analysis that we did for Multi-SemiAdaBoost,
more detail in Appendix B. The expressions that are required for the resulting
algorithm, which we call Multiclass SemiBoost, are:
nl
−1 t−1
X
Pi,k = S lu (xi , xj )e( K hHi ,ek i)
δ(Yi , ek )
j=1
nu
1
X
+ S uu (xi , xj )e K eK (5.35)
j=1
and then,
Ŷi = arg max(Pi,k ) (5.36)
k
and the weight for the example will be wi0 = max |Pi,k |. Meanwhile the value for
β is:
1 − t
β t = (K − 1) log(K − 1) + log( t ) (5.37)

where Pi,k is a criterion to assign the “pseudo-labels” to the unlabeled examples
and β t is the weight of each new component classifier ht (x).
Algorithm 9 Multiclass SemiBoost

Initialize: L, U, S, H 0 (x) = 0; L: Labeled data; U: Unlabeled data;
S: Similarity Matrix; H 0 (x):
Ensemble of Classifiers; t ← 1;// M is the number of iteration
while (β t > 0) and (t < M ) do
- Compute Pi,k for unlabeled example xi based on the pairwise similarity and
classifier predictions using (5.35)
- Assign pseudo-labels for unlabeled examples based on equation (5.36)
- Normalize the weights of unlabeled examples
- Sample a set of high-confidence examples from the unlabeled examples
based on their weights
- Build a new classifier ht (x) based on the newly-labeled and labeled examples
- Compute the weights and β t for each new classifier ht (x) using (5.37)
- Update H t ← H t−1 + β t ht
- t←t+1
end while
Output: Generate final hypothesis based on the weights and classifiers
The Multiclass SemiBoost algorithm is the same as Multi-SemiAdaBoost ex-

cept for the labeled data, which is not considered in this algorithm. Algorithm
9 shows the detail. As shown, at each iteration a new classifier is built and
the weight of ht (x), called β t , is computed for each new classifier based on
(5.37), which is different from the weight of each component classifier in Multi-
SemiAdaBoost. Assigning weights for the unlabeled data is now based on the Pi,k
in (5.35).
5.7 Experiments
In the experiments we compare Multi-SemiAdaBoost (MSAB) with several other
algorithms. One comparison is with using the base learner on the labeled data
only and a second is with (multiclass) AdaBoost (SAMME [110]) using the same
base learner. The purpose is to evaluate if MSAB exploits the information in the
unlabeled data. We also compare MSAB with a version that does not include
the margin for the labeled data in the objective function, Multiclass SemiBoost
(MCSB). The effect of the labeled data depends on the quality of the base clas-
sifier and we would like to evaluate its importance. We include comparisons
with the state-of-the-art algorithms for semi-supervised boosting, in particular
ASSEMBLE, SemiMultiAdaBoost [90], RegBoost [20], and SemiBoost [66]. Like
MSAB, SemiBoost and RegBoost use smoothness regularization but they are
limited to binary classification. For comparison, we use the one-vs-all method to
handle the multiclass classification problem with RegBoost and SemiBoost. In
our experiments, we use the WEKA implementation of the classifiers with default
parameter settings [42].
Three components in Multi-SemiAdaBoost and Multiclass SemiBoost are: (i)
the similarity between examples, (ii) sample from the unlabeled examples, and
(iii) stopping condition for boosting process.
5.7.1 Similarity Function

There are different distance-based approaches to compute the similarity between
the data. In this chapter we employ the Radial Basis Function (RBF) which is
used effectively as similarity metric in many domains. For two examples xi and
xj the RBF similarity is computed as:
kxi − xj k22
S(xi , xj ) = exp(− ) (5.38)
σ2
where σ is the scale parameter which plays a major role in the performance of
the kernel, and should be carefully tuned to the problem [114].
5.7.2 Sampling Data for Labeling

In the sampling process only the high-confidence data points must be selected
for training. Finding the best selection is a difficult problem [93]. On one hand,
selecting a small set of newly-labeled examples might lead to slow convergence,
5.7. Experiments 81
and on the other hand, selecting a large set of newly-labeled examples may include
some poor examples. One possible solution for that is to use a threshold or even
a fixed number which is optimized through the training process. Sampling data
for labeling is based on the following measure for confidence:
Ŷi − max{Yi,k |Yi,k 6= Ŷi , k = 1, ..., K}

Pd (xi ) = Pnu (5.39)
i=1 (Ŷi − max{Yi,k |Yi,k 6= Ŷi , k = 1, ..., K})
where Ŷi is computed from (5.28). Pd (xi ) is viewed as the probability distribution
of classes of the example xi , which amounts to a measure of confidence. For the
labeled data we use Wi in (5.26) as weight:
Wi
Pd (xi ) = Pnl (5.40)
i=1 Wi
where xi ∈ L. We use learning from weighed examples as in AdaBoost and at
each iteration we select the top 15% of the unlabeled data based on the weights
and add them to the training set.
Another component in the proposed methods is to identify the stopping con-
ditions. As mentioned in Section 5.5.1, we use both β t ≤ 0 and a fix number of
iterations as stopping conditions.
5.7.3 Supervised Base Learner

As mentioned earlier, we assume that the base learner as a block box in the
algorithm. It means that the proposed algorithm does not need to know the inner
process of the base learner. However, the performance of the method depends on
the base learner. We experiment with the Decision Tree learner (J48, the Java
implementation of C4.5 decision tree classifier), Naive Bayes, and Support Vector
Machine (SVM) with default setting (polynomial kernel and sequential minimal
optimization algorithm) as base classifiers.
5.7.4 UCI datasets

We use the UCI datasets for assessing the proposed algorithms. Table 5.1 sum-
marizes the specification of 16 benchmark datasets from the UCI data repository
[32] which are used in our experiments.
5.7.5 Setup of the experiments

For each dataset, 30 percent of the data are kept as test set, and the rest is used
as training data. Training data in each experiment are first partitioned into 90
percent unlabeled data and 10 percent labeled data, keeping the class proportions
in all sets similar to the original data set. We run each experiment 10 times with
Table 5.1: Overview of Datasets
Dataset #Samples #Attributes # Classes

Balance 625 4 3
Car 1728 6 4
Cmc 1473 9 3
Dermatology 366 34 6
Diabetes 768 9 2
Ecoli 336 7 8
Glass 214 9 6
Iris 150 4 3
Liver 345 6 2
Optdigit 1409 64 10
Sonar 208 60 2
Soybean 686 35 19
Vowel 990 14 11
Wave 5000 21 3
Wine 178 13 3
Zoo 101 17 7
different subsets of training and testing data. The results reported refer to the
test set.
5.8 Results
Tables 5.2, 5.3, and 5.4 give the results of all experiments. The second and
third columns in these tables give the performance of supervised multiclass base
classifiers (DT for Decision Trees, NB for Naive Bayes, and SVM for Support
Vector Machine) and multiclass AdaBoost meta classifier, SAMME [110], using
the same classifiers as base learner. The columns ASML, SMBoost, RBoost,
and SBoost show the performance of four semi-supervised boosting algorithms
ASSEMBLE, SemiMultiAdaBoost, RegBoost and SemiBoost respectively.
MSAB and Supervised Learning

The results in Tables 5.2, 5.3, and 5.4 show that MSAB significantly improves
the performance of supervised base classifiers for nearly all the datasets. Using
the statistical t-test, we observed that MSAB improves the performance of J48
and Naive Bayes base classifiers on 16 out of 16 and SVM base classifier on 15
out of 16 datasets. The results show that MSAB is also better than multiclass
AdaBoost meta classifier trained using only labeled data. We also observe that
the classification models generated by using MSAB are relatively more stable than
J48 base classifier because of lower standard deviation in classification accuracy.
5.8. Results 83
Table 5.2: The classification accuracy and standard deviation of different algo-
rithms with 10% labeled examples and J48 as base learner
Supervised Learning Semi-Supervised Learning
DataSets DT SAMME ASML SMBoost RBoost SBoost MSAB MCSB
Balance 70.72 74.75 78.01 77.73 75.34 76.56 81.9 80.27
5.5 3.17 3.5 3.1 4.1 3.7 2.01 1.4
Car 76.93 78.34 78.97 72.88 78.4 78.59 80.4 79.81
1.6 2 1.6 1.3 1.3 1 1.7 1.8
Cmc 42.03 44.89 40.69 39.04 47.89 48.46 49.26 48.81
3.8 4.2 3.2 3.1 2.9 2.2 2.7 3
Dermat- 73 76.99 87.28 87.77 87.1 86.68 88.37 88.37
ology 4.1 5 4.2 3.1 3.9 3.9 2.2 2.2
Diabetes 68.81 70.97 71.98 69.78 71.5 73.2 74.96 73.2
5 2.5 4.5 5 4.1 3.4 2.6 2.7
Ecoli 65.88 66.82 75.54 77.41 79.43 77.41 85.98 85.98
12.9 8.5 4 7 3.8 4.6 3.4 3.4
Glass 49.79 50.94 52.84 54.62 60.21 61.55 66.38 63.86
9.6 6.1 7.1 9.5 4.4 4.6 3.4 3.1
Iris 71.13 71.13 77.08 88.69 91.4 93.45 94.64 93.54
8.9 8.9 6.3 8.1 6.2 5.6 4.4 3
Liver 54.86 56.93 51.03 51.17 62.01 62.46 63.27 63.27
7.5 7.6 5.7 2.8 2.8 1.4 4.1 4.1
Optdigit 63.26 76.52 79.42 87.71 76.41 74.65 80.28 84.13
2.4 3.4 2.6 1.4 2.1 1.3 2.5 2
Sonar 60.08 59.87 57.98 60.5 72.23 74.42 73.57 74.57
9.2 6.9 9 8.1 5.1 5 5 2.1
Soybean 51 47.83 71.37 68.51 67.1 61.49 71.06 73.06
4.4 4.6 2.5 2 2.5 3.4 2.4 3.4
Vowel 35.92 42.57 26.52 26.7 44.31 47.94 57.02 52.02
4.7 2.6 3.2 2.5 2.7 3 3 2.1
Wave 70.1 76.04 76.95 75.53 77.82 77.82 79.15 78.16
1.7 0.8 1.2 1.7 1.3 1.3 1.1 1.8
Wine 67.36 67.36 91.92 93.27 94.1 94.73 96.49 95.12
10.3 10.3 4.4 5.3 4.9 4.8 3.3 3.7
Zoo 66.19 56.66 85.23 85.23 87.94 87.14 90.00 92.59
5.2 8.1 4.2 4.2 5.4 5.5 4.7 3.2
MSAB and Semi-Supervised Boosting Algorithms
Table 5.2, 5.3, 5.4 show that MSAB gives better results than all four semi-
supervised algorithms ASSEMBLE, SemiMultiAdaBoost, RegBoost, and Semi-
Boost on 14 (for J48), 13 (for Naive Bayes), or 16 (for SVM) out of 16 datasets.
The improvement of MSAB in most of the used datasets are significant and it
outperforms the ASSEMBLE and SMBoost on 15 out of 16 datasets, when the
base classifier is J48. The same results can be seen with the other base classifiers.
rithms with 10% labeled examples and Naive Bayes as base learner
DataSets NB SAMME ASML SMBoost RBoost SBoost MSAB MCSB
Balance 75.88 76.71 79.77 79.98 78.24 77.36 82.09 80.12
3.5 2.5 2 2.5 3.5 4.1 3.2 3.8
Car 77.5 78.51 77.64 80.42 79.12 78.66 80.75 79.46
1.9 0.5 1.9 2.5 9.5 2.1 1.8 0.8
Cmc 46.98 48.05 47.35 46.94 48.2 49.09 51.11 49.92
1.4 1.3 3.1 2.2 2.1 1.4 1.7 1.64
Dermat- 81.52 81.52 86.94 85.25 86.44 86.44 90.00 91.64
ology 7.4 7.4 7.5 4.5 3.3 3.9 2.4 1.1
Diabetes 73.45 73.05 73.64 75.09 72.12 74.90 75.29 75.42
1.1 1.5 1.6 2.4 3.1 1.8 2.8 2.1
Ecoli 81.68 78.69 81.3 76.07 80 80.74 83.92 84.11
3.3 6.3 1.9 8 2.7 2.8 3.2 2.8
Glass 50 50.21 45.58 51.26 57.01 56.51 60.08 59.87
7 6.7 9.9 8.7 7 7.1 3.4 3.9
Iris 81.25 80.95 87.49 91.07 87.32 90.47 93.45 92.85
5.5 6.6 3.4 5.9 4.7 6.3 4.7 4.4
Liver 55.39 55.04 52.74 55.22 55.06 55.75 60.17 56.63
5.5 7.2 4.1 8.8 7 6.3 5.7 6.9
Optdigit 72.04 70.57 84.60 87.35 80.54 77.95 83.61 85.10
1.9 1.7 1 2 3.1 2.5 2 2.5
Sonar 64.41 63.23 62.64 65.58 65.5 70.91 70.58 71.00
7.1 5.1 7.6 5.6 5.1 5.4 6.5 7.2
Soybean 70 68.44 76.85 73.07 67.77 57.77 74.3 76.70
1.4 7.1 1.2 3.9 1.1 1.1 2.9 1.8
Vowel 47.52 48.02 25.82 32.78 47.34 44.5 52.53 51.40
3.4 2 3.3 3.5 2.7 2.2 3 3.8
Wave 80.44 80.91 80.35 79.94 81.01 81.83 83.43 82.94
1.4 1.9 1.1 0.7 1.7 1.3 0.6 1.2
Wine 84.56 84.56 94.38 93.68 93.69 92.28 95.17 94.33
5.8 5.8 2.2 2.9 3 2.6 2.9 2.2
Zoo 87.77 87.77 89.44 94.00 89.52 89.52 91.11 91.11
4 4 2.5 2.7 3.8 3.8 3.4 3.8
5.8. Results 85
rithms with 10% labeled examples and SVM as base learner
DataSets SVM SAMME ASML SMBoost RBoost SBoost MSAB MCSB
Balance 82.66 81.69 83.28 80.79 82.21 83.84 86.33 84.38
3.2 4.8 2.6 4.1 3.1 3.4 2.5 3.6
Car 80.92 83.83 77.95 77.95 81.23 80 82.32 82.21
2 1 2.3 2.2 2 0.9 1.9 2.2
Cmc 45.12 45.49 43.09 44.29 48.32 47.72 51.14 50.78
1.9 2.4 2.6 1.9 3.7 4.7 2.4 2.1
Dermat- 86.54 86.84 89.54 82.48 91.01 90.67 91.66 91.24
ology 1.7 1.7 2.5 1.7 3.1 3.8 1.8 2
Diabetes 71.3 72.56 73.35 73.43 72.64 72.64 74.07 73.04
2.6 1.9 2.4 2.3 2.3 2.8 2.1 2.6
Ecoli 79.2 82.35 81.19 74.64 80.21 77.68 86.24 84.22
1.7 3.6 2.8 5.8 3.1 3.6 3.3 2
Glass 47.64 48.23 46.76 42.94 53.12 52.94 55.88 55.14
5.2 5.9 4.2 5.6 4.9 4.6 4.8 5.4
Iris 84.32 86.45 81.25 88.65 88.95 86.19 96.87 94.67
6.7 9.2 11 8 6.5 9.7 2.7 4.4
Liver 57.66 55.01 56.19 50.44 57.98 58.99 61.94 60.61
1.6 6 2.8 8.3 5.3 3.2 2.3 2.7
Optdigit 89.92 89.92 89.71 84.78 88.01 88.6 90.50 90.50
1.2 1.2 1.2 4.9 3.2 1.7 1.4 1.4
Sonar 65.88 65.88 67.35 62.05 68.43 70 73.52 72.35
4 4 6 6.7 4.5 4.6 2.5 5.3
Soybean 72.69 72.69 59.56 56.01 68.09 68.82 77.93 77.46
3 3 2 1.3 1.7 1.3 0.7 1.4
Vowel 47.52 48.02 23.82 30.78 45.89 42.50 51.90 51.40
3.4 2 3.3 3.5 4.6 2.2 3.6 3.8
Wave 80.44 80.91 80.35 79.94 82.01 81.83 83.43 82.94
1.4 1.9 1.1 0.7 1.8 1.3 0.6 1.2
Wine 88.81 88.81 95.61 95.08 95.98 96.49 97.19 97.10
4.9 4.9 2.4 3 2.9 3 2 1.7
Zoo 95.83 95.83 90.76 90 92 92 95.83 95.83
1.6 1.6 2.5 2.1 3.1 3.3 1.6 1.6
MSAB and Multiclass SemiBoost

As can be seen in Table 5.2, Multiclass SemiBoost performs better than the
other semi-supervised boosting algorithms on 15 out of 16 datasets. Table 5.4
shows that Multiclass SemiBoost gives better performance than the other four
algorithms on 12 out of 16 datasets. Furthermore, MSAB outperforms Multiclass
SemiBoost in 9 out of 16 datasets for both base learners with 2 ties for J48 and 1
for Naive Bayes. The results suggest that when boosting the base classifier works
(for example for the Naive Bayes classifier on the Ecoli or Diabetes dataset) then
MSAB outperforms Multiclass SemiBoost, which is what we would expect. The
same result is shown in Table 5.4.
5.8.1 Different Proportions of Unlabeled Examples
To study the sensitivity of MSAB to the number of labeled data, we run a set
of experiments with different proportions of labeled data which vary from 5%
to 40%. We compare supervised base classifiers to supervised AdaBoost meta
classifier and MSAB. We expect that the difference between supervised algorithms
and MSAB algorithm decreases when more labeled data are available. Similar to
the pervious experiments, we use a separate test set to evaluate the performance.
Figure 5.1 shows the performance on 5 datasets, Iris, Glass, Sonar, Soybean,
and W ine, with different base learners.
It is observed that with the additional unlabeled data and only 5% labeled
data, MSAB algorithm significantly improves the performance of the base clas-
sifier. Consistent with our hypothesis, MSAB improves the performance of the
base classifier and performs better than supervised AdaBoost with the same base
classifier.
5.8.2 Discussion
There are datasets where the proposed algorithms may not significantly improve
the performance of the base classifiers. In these cases the supervised algorithm
outperforms all the semi-supervised algorithms, for example in Table 5.3 the
AdaBoost meta classifier performs the same as the proposed methods on Car,
Cmc, and Diabetes datasets and outperforms the other methods. This kind of
result emphasizes that the unlabeled examples do not guarantee that they always
are useful and improve the performance. Comparing the results of Tables, we
observed that almost in all cases MSAB and Multiclass SemiBoost improve the
performance of the base learner and in most cases they outperform the other
compared boosting algorithms in this study.
5.8.3 Convergence of Multi-SemiAdaBoost
In this section, we empirically show the convergence of Multi-SemiAdaBoost al-

gorithm on two datasets, W ine and Optdigit. We use different base classifiers
in this experiment, J48 is used for W ine dataset and SVM for Optdigit dataset.
As shown in Figure 5.2.a and 5.2.b both the value of β and objective function
are decreasing through the boosting process for W ine dataset, which is what we
would expect. For the first 10 iterations, the value of β and objective function
fall rapidly and after that they are decreasing slowly. Figure 5.2.c and 5.2.d show
the same results for Optdigit dataset, when the base learner is SVM.
5.8. Results 87
J48 AdaBoost MSAB J48 AdaBoost MSAB

90 90
83 85
80
76 75
70
69 65
60
62 55
55 50
45
48 40
0 10 20 30 40 0 5 10 15 20 25 30 35 40
(a) Soybean-J48 (b) Sonar-J48
Navie Bayes AdaBoost MSAB Navie Bayes AdaBoost MSAB

98 60
96
94 55
92
90 50
88
86 45
84
82 40
5 10 15 20 30 40 5 10 15 20 30 40
(c) Iris-Naive Bayes (d) Glass-Naive Bayes
SVM AdaBoost MSAB SVM AdaBoost MSAB
100
92
98
87 96
94
82
92
77 90
88
72
86
67 84
5 10 15 20 30 40 5 10 15 20 30 40
(e) Soybean-SVM (f) Wine-SVM
Figure 5.1: Average Performance of MSAB with increasing proportions of labeled

data on different datasets
0.12 0.12
0.1 0.1
Objective Function
0.08 0.08
Beta Value
0.06 0.06
0.04 0.04
0.02 0.02
0 0
0 10 20 30 40 0 10 20 30 40
Iterations Iterations
(a) Wine-J48 (b) Wine-J48
0.35 150
0.3 140
130
Objective Function
0.25
120
Beta Value
0.2
110
0.15
100
0.1 90
0.05 80
0 70
0 10 20 30 40 0 10 20 30 40
Iterrartions Iterartions
(c) Optdigit-SVM (d) Optdigit-SVM
Figure 5.2: Objective function and β m value of Wine and Optdigit datasets in
terms of the number of iterations
5.8.4 Sensitivity to the Scale Parameter

As shown in similarity function (5.38), there is a parameter, Sigma, which needs
to be tuned in the experiments. Selecting the value of Sigma is difficult task
in many graph-based methods [114] and several heuristic methods have been
proposed to compute its value. For this we setup a set of experiments to show
the sensitivity of the proposed method to the scale parameter. The results of the
experiments show that Multi-SemiAdaBoost is relatively stable with respect to
the Sigma value. Figure 5.3 shows the results on two datasets with two different
base learning algorithms.
5.8.5 Sensitivity to the Size of the Newly-Labeled Data

In section 5.7.2 we presented a criterion to assign confidence for predictions. In
this section we perform a set of experiments to study the performance of MSAB
at the size of the sampled unlabeled examples. The sample size varies from 5% to
100% in the experiments. Figure 5.4 shows the performance on two datasets with
two different base classifiers. It is shown that sampling 5%-20% of the unlabeled
J48-MSAB J48 SVM-MSAB SVM
100 100
90 90
Clssification Accuracy
80 80
70 70
60 60
50 50
40 40
30
30
20
20
10
0 10
0 5 10 15 20 0
0 5 10 15 20
Sigma
Sigma
(a) Iris-J48 (b) Iris-SVM
J48-MSAB J48 SVM-MSAB SVM
100 100
90 90
80 80
70 70
60 60
50 50
40 40
30 30
20 20
10 10
0 0
0 5 10 15 20 0 5 10 15 20
Sigma Sigma
(c) Optdigit-J48 (d) Optdigit-SVM
Figure 5.3: Performance of base learning algorithms J48 and SVM with 10%
labeled examples on Iris and Optdigit datasets, with increasing value of the Sigma.
examples is a safe choice, and it usually gives an improved performance.

Two main issues in semi-supervised boosting methods are: (1) how to sample from
the unlabeled data and, (2) how to assign the high-confidence “pseudo-labels” to
them. To solve these problems for a multiclass classification problem, we propose
a boosting method which is based on the main semi-supervised assumptions, the
manifold and the cluster. We define a novel multiclass loss function consisting of
the margin cost on labeled data and the regularization on labeled and unlabeled
data. For the regularization terms as in graph-based method, we define the
inconsistency between data using similarity information. Then we combine the
similarity information and classifier predictions to derive a criterion to assign the
“pseudo-labels” and the weights to the unlabeled data and another criterion to
weight the labeled data as in multiclass AdaBoost. Based on the criteria a set of
high-confidence newly-labeled data and original labeled data is used to construct
a new component classifier, which minimizes the objective function. The final
J48-MSAB J48 J48-MSAB J48

100 100
90
90
80 80
70 70
60 60
50 50
40 40
30 30
20 20
10 10
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Sample Percentage Sample Percentage
(a) Iris-J48 (b) Glass-J48
svm-MSAB svm svm-MSAB svm
100 100
90 90
80 80
70 70
60 60
50 50
40 40
30 30
20 20
10 10
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Sample Percentage Sample Percentage
(c) Iris-SVM (d) Glass-SVM
Figure 5.4: Performance of base learners J48 and SVM with 10% labeled exam-
ples, with increasing the sample size of unlabeled data points sampled on two
datasets.
classification model is formed by the combination of the generated classifiers.

Experiments show that in almost all cases Multi-SemiAdaBoost and Multiclass
SemiBoost significantly improve the performance of the base learner and in most
cases they outperform the other boosting algorithms in this comparison.
An open problem for future work is how to efficiently find and tune a good
similarity function. Another issue is the mapping for the multiclass problem to
K − dimensional space. The mapping that is used here is not the only possibility
and it seems interesting to exploit other coding schemes.
Chapter 6
Application of Multiclass
Semi-Supervised Learning
In this chapter, we apply two proposed algorithms in Chapter 5 to bird behaviors

recognition and text classification problems. First we consider the task of semi-
supervised learning to recognize bird behavior from accelerometer data. We then
evaluate the proposed algorithms on the text classification problem using the
popular text datasets from TREC and Reuters.
24rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI)
in 2012.
6.1 Behavior Recognition from Accelerometer

Data
Recently, methods for studying animal behavior have been extended with the
logging and analysis of sensor data. The improved technology allows biologists
to study behaviors of animals in all circumstances. However, given the nature
and the abundance of the resulting data from sensor, manual interpretation is
often difficult, time consuming, and in some cases infeasible. Therefore, machine
learning methods can be a welcome approach for analyzing animal behaviors from
sensor data.
We consider the task of semi-supervised learning to recognize animal behavior
from accelerometer data. Using animal-borne sensors, accelerometers data can
be collected remotely and as a result aspects of animal behavior can be studied
for which no suitable data were previously available. Questions about migration,
foraging, breeding and effects of environmental conditions can now be answered
by following animals for a longer period. As an intermediate step in the analy-
sis, behaviors must be recognized in the accelerometer data. Several studies have
shown that this is possible and useful. One of the first studies using accelerometer
91
92 Chapter 6. Application of Multiclass Semi-Supervised Learning
data was that of Yoda et al. [106], using bi-axial accelerometers to differentiate
whether Adelie penguins (Pygoscelis adeliae) were upright, prone, walking, por-
poising, or tobogganing. Data were labeled by the authors from inspecting the
accelerometer data. In [67], cow behaviors are classified from accelerometer data
using Support Vector Machines. This achieved a reasonable recognition of behav-
iors such as standing, lying, and feeding. In [83], European shags were equipped
with a one-dimensional (surge) accelerometer logger. Parameters of wavelets were
clustered and used to recognize behavior. The approach was not empirically val-
idated. Experiments with recognizing human behaviors are described in [75].
Recently [70] compared several learning algorithms on accelerometer data from
vultures that were labeled from observations at locations where vultures spend
much time. Differences between the algorithms were small. Shamoun-Baranes
et al. [87] describes in detail how a decision tree learner is applied to accelerome-
ter data from oystercatchers. However, labeling accelerometer signal manually is
difficult and time-consuming process and requires detailed knowledge of animal
behavior and understanding of the tri-axial accelerometer data. Therefore, this
problem seems very well suited for semi-supervised classification problem, when
the task is multiclass classification.
6.1.1 Recognizing Bird Behavior from Accelerometer Data

Accelerometer data are acquired using UvA-BiTS data loggers (UvA-Bird Track-
ing System, www.uva-bits.nl). These devices are attached to the back of a bird
using a specially designed harness. The logger includes a tri-axial accelerometer
which has a sampling rate of 20 Hz, a GPS system, a solar panel and a chip
with 4MB memory. Figure 6.1 shows the used GPS. The data collection scheme
can be (remotely) programmed. The device is capable of recording 10 seconds
(200 samples) of accelerometer data and a GPS fix every 3 seconds. However,
because of the power consumption this setting is rarely used. Data are offloaded
automatically when the device is near a (portable) base station. A Zigbee radio
transceiver manages the communication. The loggers are small and relatively
light compared to the body size and weight of the species and thus birds are not
hindered in exhibiting their normal behaviors. Some birds wear the device for
two years or more.
The data acquired from the accelerometers represent tri-axial acceleration
(surge (x), sway(y), and heave(z)). For example, if the data logger is in a static
and horizontal position, gravity causes a heave which equals 1 (1 g-force = 9.81
m.s-2) while sway and surge are both 0. The time interval at which accelerometer
data is measured and stored depends on the settings defined by the researcher
and on the battery status.
Several bird species have been equipped with the data loggers, such as the
oystercatcher, peregrine, lesser black-backed gull and Montagu’s Harrier. The
present study deals with the lesser black-backed gull, a bird species that exhibits
6.1. Behavior Recognition from Accelerometer Data 93
Figure 6.1: Black-backed Gull and GPS
Table 6.1: Features from the accelerometer data
mean, standard deviation and trend of pitch and roll

mean, standard deviation of x, y and z values
correlations between x, y and z
autocorrelations for x, y and z with 4 legs
motionlessness [40] (normalized for signal length)
maxamplitudeinfrequency-domain for x, y, z
energy in ranges 0-2.5, 2.5-5, 5-7.5 and 7.5-10 Hz for x, y and z
entropy of the energy spectrum for x, y and z
speed in 2 dimensions
estimate of noise in x, y and z
added noise (over x, y and z)
a wide range of behaviors. The dataset consists of data from seven different lesser
black-backed gulls (Larus fuscus) collected in June and July 2010. The record-
ing settings varied per bird and date, and therefore the number of measurements
varies between birds. Logs were split into segments of 1 second and these were
classified into seven behavioral categories from graphical plots of the accelerom-
eter data by an expert in bird behavior who is also familiar with mathematical
models. The categories are: stand, flap, soar, walk, fight, care and brood. The
number of calculated features over a 1 second segment is 59 features, which are
summarized in Table 6.1. The resulting dataset consists of 5589 segments.
6.1.2 Experiments with the Bird Dataset

We perform several series of experiments to evaluate if Multiclass SemiBoost
(MCSB) benefits from the unlabeled bird data, how this depends on the amount
of labeled data and the base classifier. We also compare Multiclass SemiBoost
to two semi-supervised learning algorithms ASSEMBLE and SemiMultiAdaBoost
(MultiAdaBoost) [90]. Furthermore, we include the (supervised) multiclass Ad-
aBoost meta classifier in the comparisons. As base learner we use the Decision
Tree (J48 is Java implementation of C4.5 tree classifier), the Naive Bayes (NB),
and the Support Vector Machine (WEKA implementation of SVM with default
setting is used) classifiers. The selected classifiers are also used in the AdaBoost
meta classifier as base classifiers. The reason for selecting these classifiers is that
they give good classification models for this domain as we will show in the exper-
iments. The results give an idea of how generally Multiclass SemiBoost improves
the performance. In our experiments, we use the WEKA classes with the default
settings [42] in Java to implement the proposed method.
We report on experiments with recognition of bird behavior. In the experi-
ments 30 percent of the data are kept as test set and the rest is used for training.
Training instances in each experiment are first partitioned into labeled and unla-
beled data, keeping the class proportions in all sets similar to the original data set.
We run each experiment 10 times with different subsets of training and testing
data.
In our experiments, we select the top 15% of the unlabeled data based on the
weights and add them to the training set at each iteration of the boosting process.
6.1.3 Results of recognition of bird behavior

Tables 6.2, 6.3, and 6.4 show the performance (average classification accuracies
and standard deviations) of two supervised and three semi-supervised learning
algorithms. The first column of the tables show the percentage of the labeled data,
which is used in each experiment, and the second column gives the performance of
supervised base classifiers. The next columns show the performance of two semi-
supervised algorithms ASSEMBLE and SemiMultiAdaBoost (MultiAdaboost).
We also report the results of using supervised AdaBoost with three base classifiers.
Table 6.2 shows that Multiclass SemiBoost (MCSB) improves the performance
of J48 base classifier on all used datasets. Using statistical test (t-test), we ob-
served that MCSB significantly performs better that J48 on 6 out of 7 datasets.
It also outperforms the other semi-supervised learning algorithms used in this
study on 6 out of 7 datasets. Furthermore, the standard deviations of classifica-
tion accuracies of the results indicate that the classification model generated by
MCSB is more stable than the corresponding classification model using J48 as
base classifier. We see same results in Tables 6.3 and 6.4.
To study the sensitivity of MCSB and dataset to the base classifier, we run a
6.1. Behavior Recognition from Accelerometer Data 95
Table 6.2: Average Classification Accuracy and Standard Deviation of J48 for
Bird dataset with different proportions of labeled data

Labeled%(#) J48 AdaBoost ASSEMBLE MultiAdaboost MCSB
1%(37) 69.40 69.89 67.45 70.10 76.33
5.11 5.23 2.61 2.18 3.71
2%(75) 78.08 83.78 82.19 78.92 84.49
4.99 2.54 3.14 3.51 1.40
3%(113) 80.68 84.61 84.71 82.12 86.04
1.16 1.73 1.62 1.22 1.35
5%(188) 83.89 87.45 87.06 87.76 88.03
3.67 1.13 1.32 0.64 0.28
7%(262) 85.92 89.86 90.47 87.16 90.74
1.10 1.54 0.65 1.37 0.50
10%(375) 87.50 90.47 90.72 87.81 90.98
1.10 0.52 1.45 0.76 0.65
20%(750) 90.26 93.23 93.03 89.41 91.75
0.33 0.45 0.86 0.64 0.57
Table 6.3: Average Classification Accuracy and Standard Deviation of Naive

Bayes for Bird dataset with different proportions of labeled data

Labeled%(#) NB AdaBoost ASSEMBLE MultiAdaboost MCSB
1%(37) 78.32 78.25 76.01 78.17 78.81
4.59 4.95 5.03 3.80 3.20
2%(75) 83.92 83.74 82.94 83.25 84.45
1.95 1.52 2.52 3.24 1.42
3%(113) 85.78 85.23 85.96 85.24 86.21
1.37 2.32 1.02 4.10 1.03
5%(188) 85.41 85.15 85.53 85.58 86.38
0.42 0.62 0.30 1.60 0.61
7%(262) 85.54 85.67 86.61 85.77 86.53
2.38 2.36 1.51 0.70 2.36
10%(375) 85.75 85.75 86.21 85.80 86.77
1.03 1.03 1.10 2.27 1.06
20%(750) 86.81 86.81 87.10 86.35 86.88
1.80 1.80 1.05 1.91 1.87
set of experiments with three different base learners, J48, Naive Bayes, and SVM.
Figure 6.2 and 6.3 show the results. As can be seen, J48 as supervised classifier
and base classifier in MCSB outperforms the others. Figure 6.3.a shows that in
the supervised case, J48 outperforms the other supervised classifiers. There is an
Table 6.4: Average Classification Accuracy and Standard Deviation of SVM for
Bird dataset with different proportions of labeled data

Labeled%(#) SVM AdaBoost ASSEMBLE MultiAdaboost MCSB
1%(37) 74.07 73.20 73.32 71.87 75.55
2.54 4.70 3.21 1.15 2.92
2%(75) 76.43 77.61 80.10 78.59 80.26
2.05 3.13 1.06 2.38 2.33
3%(113) 79.14 79.24 79.66 79.79 81.21
1.58 2.44 1.48 0.36 2.60
5%(118) 81.70 82.71 83.24 83.30 84.87
2.95 1.47 1.35 2.18 0.47
7%(262) 82.60 83.18 82.61 85.05 86.09
1.09 1.21 0.33 1.46 1.47
10%(375) 84.89 84.46 84.40 86.88 87.90
1.59 0.96 0.71 1.01 0.82
20%(750) 87.00 88.42 86.05 88.58 88.82
0.90 0.52 0.53 2.15 0.33
exception while the labeled data is less than 5%. In this case, the Naive Bayes
classifier performs better.
J48 AdaBoost Multiclass-SemiBoost Fully labled data SVM AdaBoost Multiclass-SemiBoost Fully Labeled Data
95.00 95.00
90.00 90.00
85.00
85.00
80.00
80.00
75.00
70.00 75.00
65.00 70.00
0 5 10 15 20 25 0 5 10 15 20 25
(a) J48 (b) SVM
Figure 6.2: Performance Comparison of Multiclass SemiBoost to Supervised clas-

sifier and Supervised AdaBoost with different base classifiers and with different
proportions of labeled data
As can be seen from Tables 6.2, 6.3, 6.4 and Figures 6.2 and 6.3, there are
clear patterns in the results. The performance of the AdaBoost meta classifier is
nearly equal to its base learner, when the base learners are the Naive Bayes and
SVM classifiers. It also outperforms its base learner for the Decision Tree learning
algorithm when the number of labeled data is increasing. Another observation is
6.2. Text Classification 97
that J48 as base learner outperforms the other base learner in both the supervised
and the semi-supervised case. When there is more than about 600 (15%) labeled
data, the unlabeled data is not helpful and does not improve the performance of
the base classifier anymore, which is an interesting observation for biologists.
J48 NB SVM MCSB-J48 MCSB-NB MCSB-SVM

95.00 95.00
Classification Accurcay
90.00 90.00
85.00
85.00
80.00
80.00
75.00
70.00 75.00
65.00 70.00
0 5 10 15 20 25 0 5 10 15 20 25
(a) Supervised Algorithms (b) Semi-Supervised Algorithms
Figure 6.3: Performance Comparison of Multiclass SemiBoost and Supervised

classifiers with different proportions of labeled data
6.2 Text Classification

Text categorization or classification is the task of assigning text documents to
pre-specified classes of documents. Text categorization presents unique challenges
due to the large number of attributes and the large number of training examples.
This provides an opportunity to make more effective use of these collections.
We here consider text classification problem in the semi-supervised setting, when
there is a limited number of labeled documents and a huge number of unlabeled
documents. Specially, we focus on multi-category text classification problems
using the methods from Chapter 5.
We use several datasets from different resources. The specification of the
used datasets are summarized in Table 6.5. Datasets re0 and re1 are derived
from Reuters-21578 [59] and tr11, tr12, tr21, tr23, tr31, tr41, and tr45 are from
TREC[5-7] [98]. Finally, dataset wap is from the webACE project [11]. These
datasets were used in many text classification studies, e.g.[43]. We took these
datasets from [97].
6.2.1 Experiments on Text Classification

We evaluate the Multi-SemiAdaBoost (MSAB) and Multiclass SemiBoost (MCSB)
algorithms on the text classification problem using the popular text datasets.
The goal in this experiment is if Multi-SemiAdaBoost (MSAB) and Multiclass
Table 6.5: Overview of Text Classification Datasets
Dataset Source #Samples #Words # Min class size # Classes

re0 Reuters-21578 1504 2886 11 13
re1 Reuters-21578 1657 3758 10 25
tr11 TREC 414 6429 6 9
tr12 TREC 313 5804 9 8
tr21 TREC 336 7902 4 6
tr23 TREC 204 5832 6 6
tr31 TREC 927 10128 2 7
tr41 TREC 878 7454 9 10
tr45 TREC 690 8261 14 10
wap WebACE 1560 8460 5 20
SemiBoost (MCSB) benefit from the unlabeled data and how this depends on
the base classifier. We also compare the proposed algorithms to four semi-
supervised algorithms ASSEMBLE (ASML), SemiMultiAdaBoost (SMBoost),
RegBoost (RBoost) [20], and SemiBoost (SBoost) [65]. RegBoost and SemiBoost
use the one-vs-all approach to handel the multiclass text classification problem in
our experiments. Furthermore, we include the (supervised) multiclass AdaBoost
(SAMME) meta classifier in the comparisons. As base learner we use the Decision
Tree and the Naive Bayes (NB) classifiers.
In the experiments 30 percent of the data are kept as test set and the rest is
used for training. We use only 10% labeled data in the experiment. We run each
experiment 10 times with different subsets of training and testing data.
6.2.2 Results on Text Classification

Table 6.6 gives the results of the experiments on the text classification datasets.
The results show that MSAB improves the performance of the supervised base
classifier for nearly all datasets. Using a statistical test (t-test), we find that
MSAB significantly improves the performance of J48 on 9 out of 10 datasets. It
also outperforms the other semi-supervised algorithms on 8 out of 10 datasets,
when the base classifier is J48. Another observation is that MCSB also signifi-
cantly improves the performance of J48 classifier on 8 out of 10 datasets.
Table 6.7 shows the results of the experiments on text classification datasets,
when the base learner is the Naive Bayes classifier. We observed that MSAB
performs better than the supervised Naive Bayes classifier on all used datasets.
It also outperforms the other four semi-supervised learning algorithms on 6 out of
10 datasets. Another comparison is between Multi-SemiAdaBoost and Multiclass
SemiBoost algorithms. The performance of MSAB is comparable to MCSB in
these datasets. The results also suggest that whenever boosting the base classifier
improves the performance of the base classifier then MSAB outperforms MCSB,
rithms with 10% labeled examples and J48 as base learner
Supervised Semi-Supervised Learning
DataSets J48 SAMME ASML SMBoost RBoost SBoost MCSB MSAB
re0 63.35 65.43 66.17 65.33 65.23 59.63 67.11 68.74
1.4 2.1 1.2 3.6 3 2.8 2.2 2.3
re1 67.33 67.18 64.36 62.06 67.12 68.65 71.052 69.69
2.6 3 5 3.8 2.5 1.5 0.2 1.7
tr11 69.84 72.13 69.72 68.95 68.12 69.57 77.73 77.6
5.6 4.9 5.4 1.7 7 8.3 3.1 2.6
tr12 73.97 75.85 65.56 65.73 68.87 65.13 79.67 80.1
4.5 3.8 8 6.7 7.3 7 3.1 2.8
tr21 73.45 75.3 77 79.62 77.5 76.23 79.16 80.86
4.7 6.3 3.9 3.6 4.3 5.1 3 3.3
tr23 72.02 77.77 71.23 67.06 78.34 77.57 84.32 85.51
8.7 11 13 8.4 8 7.5 1.3 3.2
tr31 88.43 91.29 92.02 84.58 80.33 80.33 90.96 92.22
3.3 3.5 1.8 4.6 8 8 2.4 2.1
tr41 79.92 84.15 83.8 78.69 76.89 76.93 87.08 87.32
3 3.1 2.8 2.8 4.5 4.8 2.2 2.1
tr45 80.86 86.73 86.03 76.7 77.65 74.71 86.2 89.67
3.5 2.4 2.2 2 5.6 5.8 1.8 2.3
wap 52.19 58.58 55.64 57.74 53.4 52.57 59.32 62.36
4.2 3.3 6.2 4.3 4.1 3.4 3.6 3.6
which is what we would expect.

First, we apply the Multiclass SemiBoost (MCSB) algorithm to bird behavior
recognition. Different proportions of labeled data are used in the experiments.
We observed that with only 1% labeled data, MCSB improved the performance
of J48 learning algorithm about 7%. Another observation is that with only 20%
labeled data MCSB gives the results comparable with the performance of the
fully labeled data. We then employ Multi-SemiAdaBoost (MSAB) and Multiclass
SemiBoost (MCSB) to classify text datasets, when there is only 10% labeled data.
The results of experiments show significant improvements.
rithms with 10% labeled examples and Naive Bayes as base learner
Supervised Semi-Supervised Learning
DataSets NB SAMME ASML SMBoost RBoost SBoost MCSB MSAB
re0 58.73 63.96 61.39 59.23 50.5 45.56 63.32 64.11
7 3.7 8.5 4 4.3 4.3 4.7 4.9
re1 64.62 65.37 64.64 62.44 64.89 47.55 66.95 67.48
5 4.4 5.5 8.3 7 3.1 4.2 5.2
tr11 64.37 67.93 74.04 61.06 56.75 54.19 71.14 71.75
2.2 2.7 3.5 10 8 5 3.3 2.8
tr12 64.79 63.64 57.39 59.43 50.21 52.04 73.98 73.72
3.6 4.2 7.1 6.6 4.5 3.3 5.6 4.4
tr21 61.8 62.26 0 54.55 77.12 76.62 66.66 66.97
11 11.3 6.5 5.8 4.6 10.7 10.2
tr23 72.38 72.38 76.82 54.28 61.9 61.9 76.71 79.04
10.8 10.8 3.4 7.3 11.5 11.5 8.1 9.5
tr31 87.7 88.2 89.92 80.89 77.09 76.3 90.36 90.19
2.6 3.2 3.3 1.8 2.7 3 1.5 1.1
tr41 77.46 75.63 81.19 71.47 58.9 60.63 81.12 80.84
1.7 5.1 2.1 1.9 5 5.9 1.5 1
tr45 76.12 77.02 77.2 69.63 64.39 60.63 79.9 79.36
5.4 5.7 3.6 7.5 7 6.7 4.5 3.1
wap 57.95 57.85 60.13 56.33 45.6 46.93 60.16 59.63
1.3 1.1 3.4 2.7 3 2.3 1.7 1.9
Chapter 7
Conclusion and Discussion
In this chapter, we first look at the questions posed in the Introduction chapter
and how they are answered through our approach, as presented in Chapters 3 to
6. We then point to some directions for future work.
7.1 Answers to Research Questions

As addressed in the Introduction chapter, we employ the decision tree as the base
learner for self-training. For that, we addressed the following research questions:
1. How to effectively use a decision tree as the base learner in self-training?
2. How to find a subset of high-confidence predictions at each iteration of train-

ing process?
In Chapter 3, we first use a decision tree as the base learner for self-training.
This chapter deals with the research questions 1 and 2. A self-training algorithm
wraps around a base learner, trains a classifier, and predicts labels for unlabeled
data. Then, a set of newly-labeled data is selected for the next iterations. As
addressed in Chapter 3, the selection metric plays the main role in self-training.
The probability estimation of the base learner is often used as the selection metric
to select among the newly-labeled data. With our experiments, we demonstrate
that the current probability estimation of the decision tree is not an adequate
selection metric for self-training. This is due to the fact that the distributions
at the leaves of the decision trees provide poor ranking for probability estima-
tion. Therefore, there is a need for methods that can improve the probability
estimation of the decision tree. In order to answer questions 1 and 2, we have
made several modifications to the basic decision tree learner that produce better
probability estimation. We show that these modifications do not produce better
performance when used only on the labeled data, rather they do benefit more
from the unlabeled data in self-training. The modifications that we consider are:
101
102 Chapter 7. Conclusion and Discussion
Naive Bayes Tree, a combination of No-pruning and Laplace correction, grafting,

and using a distance-based measure. Based on the results of our experiments
we can conclude that improving the probability estimation of the tree classifiers
leads to better selection metric for the self-training algorithm and produces bet-
ter classification model. We then extend these modifications to algorithms for
ensembles of decision trees. We show that when a modified decision tree is used
as the base learner, the ensemble learner does benefit from unlabeled data as well.
Our experimental results on a number of benchmark UCI datasets show that the
proposed methods improve the classification performance of self-training.
Chapter 4 addresses the following research questions, when ensembles of learn-
ing algorithms are used in the style of co-training.
3. How to select a subset of high-confidence predictions in an ensemble?
4. What is the effect of the number of newly labeled data on the error rate of
a component classifier in an ensemble?
In Chapter 4, we propose a method that uses an ensemble of classifiers for co-

training rather than feature subsets, named Ensemble-Co-Training. The Ensemble-
Co-Training first uses N different learning algorithms to construct the ensemble
classifiers. Then, the ensemble of N − 1 classifiers predicts labels for unlabeled
data based on the agreement among classifiers for a component classifier. This
procedure is repeated for every component classifier. Next, the ensemble clas-
sifiers are all updated for the next iterations. This training process is repeated
until it reaches a stopping condition. To answer questions 3 and 4, we first derive
the stopping criterion based on the error rate and the size of training set at each
iteration. The ensemble is used to estimate the probability of misclassification
and this is used with a theorem by Angluin and Laird [1] to derive a measure for
deciding if the adding of a set of unlabeled data will reduce the error of a compo-
nent classifier or not. The derived measure, on the other hand, evaluates the effect
of the added set of newly-labeled data on the error rate. We then derive an upper
bound for the required number of unlabeled examples. Our experimental results
on a number of UCI datasets show that the proposed algorithm outperforms the
other iterative methods for semi-supervised learning.
For the additive approach, which builds a linear combination of weak classi-
fiers, a method is needed to assign weights to the unlabeled data and to the com-
ponent classifiers. In Chapter 5 we address this issue and answer the following
specific research questions, where we propose an additive method for multiclass
semi-supervised learning.
5. How to design an additive algorithm for multiclass semi-supervised learn-

ing?
5a. How to assign weights to classifiers and unlabeled examples?

7.1. Answers to Research Questions 103
5b. How to use these weights to assign pseudo-labels to the unlabeled ex-
amples?
In Chapter 5, in order to answer the question 5 and its sub-questions, we

design two multiclass loss functions from which we derive two algorithms for
semi-supervised learning, named Multi-SemiAdaBoost (MSAB) and Multiclass
SemiBoost (MCSB). MCSB minimizes the inconsistency over the data and mar-
gin cost for the unlabeled data. In addition MSAB minimizes the margin cost
for the labeled data. These give better solutions than existing algorithms which
use approaches such as one-versus-all to convert the multiclass problem to sev-
eral binary classification problems. The proposed methods have the following
properties:
• they work well in domains for which the manifold or the cluster assumption
hold. This is achieved by a loss function based on consistency between data.
• they can boost any supervised learning algorithm as the base learner. This
property makes it possible to select a learning algorithm that is suitable
for a domain because of its: learning bias, computational costs or other
reasons.
• they maximize consistency in the sense that instances that are similar are
more likely to have the same class label and they maximize the margin.
• they do not have a parameter to control the number of iterations (although

a threshold can be added to limit computation time). The only remaining
parameter is the proportion of data that is sampled and labeled at each
iteration. We show that the results are not very sensitive to this param-
eter. Finally, a parameter is needed to balance the weight of labeled and
unlabeled data in the loss function. The similarity measure may have a
parameter (for example the σ in the Radial Basis Function for similarity
measure).
From the loss functions we derive criteria to assign weights for the unlabeled
examples. The sampling process is then performed based on the weights. Next,
the weights are used to assign “pseudo-labels” to unlabeled examples. Our ex-
perimental results on a number of benchmark UCI datasets show that the pro-
posed algorithms perform better than the state-of-the-art algorithms for multi-
class semi-supervised learning. We also observe that the resulting classification
models generated by the proposed methods are more stable than the correspond-
ing supervised classification models. Another interesting observation is that there
are datasets where the proposed algorithms may not significantly improve the
performance of the base classifiers. In these cases the supervised algorithm out-
performs all the semi-supervised algorithms. This kind of result emphasizes that
the unlabeled examples do not guarantee that they are always useful and improve
the performance.
We further apply the methods to bird behavior recognition and text classifi-
cation in Chapter 6. First, we consider the task of semi-supervised learning to
recognize bird behavior from accelerometer data. Labeling accelerometer signals
manually is a difficult, time-consuming process, and requires detailed knowledge
of animal behavior and understanding of the tri-axial accelerometer data. This
is a multiclass classification task and seems very well suited for semi-supervised
learning. We use different proportions of labeled data, which vary from 1% to
20% in the experiments. With 1% labeled data we achieve 7% improvement,
when the decision tree learner is used as the base learner. We demonstrate how-
ever that with 20% labeled data MCSB gives results that are comparable with the
results of fully labeled data. In addition, we evaluate the proposed algorithms on
the text classification problem using the popular text datasets from TREC and
Reuters. The results show significant improvements on the performance, when
there is only 10% labeled data. We end with comparing MSAB and MCSB to the
state-of-the-art methods on text datasets. The results of our experiments show
that our proposed algorithms significantly outperform the other methods.
7.2 General Conclusion

The main difficulty for using unlabeled data in a semi-supervised setting is how
to select a set of informative unlabeled instances, such that it can improve the
classification performance of the base learner. We presented several algorithms
that advance the state-of-the-art in semi-supervised learning. At first, we used
the probability estimation of decision tree as the selection metric, in order to
find a set of high-confidence predictions at each iteration of self-training. We
observed that a good probability estimation leads to selecting reliable subset of the
newly-labeled data. We then combined a distance-based measure with probability
estimation of decision tree, as the selection metric. Our experiments showed
that this combination results in better classification performance and outperforms
the other methods. Therefore, we could conclude that using a combination of
probability estimation of the base learner and a distance-based measure can be a
proper selection metric for self-training, though our conclusion is limited to the
decision tree as the base learner.
So, we further performed experiments on ensembles of different classifiers in
the style of co-training, in order to select a subset of high-confidence predictions.
We observed that using an ensemble of classifiers in co-training style, besides
providing a criterion for controlling the error rate, is an effective method to im-
prove classification performance and benefits from the unlabeled data. Although,
our proposed method achieved better results than self-training and several other
related methods, as addressed in Section 4.3.1 it only relied on the agreement
7.3. Future Work 105
among classifiers for sample selection.

At the end, we used both the classifier predictions and pairwise similarities
among data to sample from unlabeled data. We showed that this combination
achieved better results than the state-of-the-art methods. Indeed, as shown, this
combination is used to guide the ensemble classifier more effectively, in order to
select the informative unlabeled data, which are now consistent with the clas-
sifier predictions and pairwise similarities. However, there are some datasets
for which the proposed methods may not improve the classification performance
significantly. These kinds of results emphasize that there is no guarantee that un-
labeled data is always helpful and can add more information to the classification
model of labeled data.
7.3 Future Work

In this section, we address some potential further research directions that are
related to the covered topics in this dissertation.
Self-training wraps around a given classifier and iteratively assigns labels to
unlabeled examples. We proposed a combination of classification confidence and
distance-based measure as the selection metric, when the base learner was the
decision tree. It is interesting to study the limitation of the other classifiers as the
base learners for self-training and to analyze their probability estimates, especially
for multiclass classification. Also theoretical analysis of the relation between base
classifier confidence and the result of self-training is an interesting direction for
research. In addition, we propose an ensemble of N different learning algorithms
in the style of co-training to semi-supervised learning using iterative method.
In this direction, we have employed the agreement among classifiers to assign
labels to the unlabeled examples. However, in the agreement-based method for
assigning labels, all classifiers have identical weight, which is not always optimal.
An interesting research direction in this case is to find an optimal way to combine
the different hypotheses.
We have used a similarity function to compute pairwise similarities between
examples in proposed additive methods for multiclass classification. As we men-
tioned in Chapter 5, the choice of similarity function has an effect, and in many
similarity functions the parameters need to be tuned, which is computationally
expensive. An interesting direction is to automatically choose and tune the sim-
ilarity information using a boosting approach and to combine this function with
the ensemble.
As was recently pointed out by Saberian and Vasconcelos [79], there is an op-
timal way to define the margin for multiclass problems. Defining margin for mul-
ticlass case needs an optimal way to map labels {1, 2, 3, .., K} to N-dimensional
(N ≤ K) space. For that, Saberian and Vasconcelos [79] employ the vertices of
n-regular simplex as labels in N-dimensional space. An interesting possibility is
to adapt the loss function proposed in [79] to multiclass semi-supervised learning.

Online semi-supervised learning is another interesting research direction, by
analogy with our work. Online supervised boosting methods can probably be
adapted to the semi-supervised setting. Developing methods to use the super-
vised online learning approaches for semi-supervised learning may lead to efficient
algorithms in many practical learning tasks. Also combining active learning with
semi-supervised learning seems a promising direction for future work.
Another interesting research direction related to this thesis is the possibility
of learning from different sources. We have tried to use ensemble learning to
improve performance of the base classifier employing unlabeled examples. The
goal here is to use prior knowledge of different sources to improve performance
of a base classifier, which only uses a limited number of labeled data. The well-
known approach for learning from several sources is transfer learning. Combining
methods in semi-supervised learning with transfer learning can result in very
efficient algorithms in many practical learning tasks.
Appendix A
Appendix
Manifold
In graph based methods the inconsistency between {(xi , yi )}ni=1 and similarity matrix is:
X
F (y) = S(xi , xj )(yi − yj )2 = y T Ly (A.1)
i,j
where L is the graph laplacian. We use the exponential function (5.18) in order to measure
the inconsistency between the similarity and classifier predictions which can be reformulated as
(using power series):
X →
− X
F (y, H) = S(xi , xj )cosh(hH(xi ) − H(xj ), 1 i) ' S(xi , xj )
i,j i,j
X →
−
+ S(x , x )(hH(xi ) − H(xj ), 1 i)2
i j
(A.2)
i,j
In (A.2), the first term is constant and the second term is equivalent to graph laplacian
which is used as manifold regularization in many graph based methods. Then minimizing the
inconstancy in (5.18) is equivalent to minimizing graph laplacian as in graph based methods [4]
which are based on the manifold assumption.
107
Proof.1
Let the function hYi , ek i be equal to 1, if a data point with Yi belongs to class k. Then (5.22)
can be rewritten as follows:
nl
X 1
F = argmin C1 exp(− hH t−1 (xi ) + β t ht (xi ), Yi i)
ht (x),β t i=1
K
nl X
nu
t−1
1
+β t htj ,ek i
X P
+ C2 S lu (xi , xj )e− K−1 k∈l hYi ,ek ihHj
i=1 j=1
X
+ C3 S uu (xi , xj )
i,j∈nu
t−1
1
+β t hti )−(Hjt−1 +β t htj ),ek i
P
e K−1 k∈l h(Hi
(A.3)
Using the following inequality in (5.22):
1 c1 + c2 + ... + cn
(c1 c2 ...cn ) n ≤ (A.4)
n
and then replacing the value of Yi results in the (5.23), where δ is defined in (5.24).
Proof.2
−1
According to the formulation that we mentioned earlier, K−1 ≤ β t h(xi ).ek ≤ 1, then multiply-
1
ing it with K−1 and using the exponential function gives the following inequality:
−1 βt 1
e (K−1)2 ≤ e K−1 h(xi ).ek ≤ e K−1 (A.5)
Using (B.4) for decomposing the third term of (5.23) leads to the following expression:
βt 1 −β t
e K−1 hh(xi )−h(xj ),ek i ≤ e K−1 e K−1 hh(xj ),ek i (A.6)
Replacing the inequality (B.5) in the last term of (5.23) gives:
nl
X 1 βt
F1 ≤ exp(− hHit−1 , Yi i)exp(− hhti , Yi i)
i=1
K K
nl X
nu X
X −1
+ S lu (xi , xj )exp( hH t−1 , ek i)
i=1 j=1 k∈l
K −1 j
−β t
exp( hhtj , ek iδ(Yi , ek )
K −1
X X 1
+ S uu (xi , xj )exp hHit−1 − Hjt−1 ), ek i
i,j∈nu k∈l
K −1
−β t
1 t

e K−1 e K−1 hhj ,ek i (A.7)
−β t
Factoring out the common term e K−1 hh(xj ),ek i and re-arranging terms then results in:
nl
X 1 βt
F1 ≤ hHit−1 , Yi i)exp(− hhti , Yi i)
exp(−
i=1
K K
n n
u X
−β t l
X X
+ exp( hhtj , ek i) S lu (xi , xj )
j=1
K − 1 i=1
k∈l
−1
exp( hH t−1 , ek i)δ(Yi , ek )
K −1 j

X
uu 1 t−1 t−1 1
( K−1 )
+ S (xi , xj )exp( hH − Hi , ek i)e (A.8)
i∈n
K −1 j
u
As a results, it gives:
nl nu X
F1 ≤ Wi exp( hhi , Yi i) + Pi,k exp hhti , ek i (A.9)
i=1
K i=1
K −1
k∈l
where
−1 t−1
Wi = exp( hHi , Yi i) (A.10)
K
and
nl
X −1 t−1
Pi,k = S lu (xi , xj )e( K−1 hHi ,ek i)
δ(Yi , ek )
j=1
nu
t−1
1
−Hit−1 ,ek i 1
X
+ S uu (xi , xj )e K−1 hHj e K−1 (A.11)
j=1
Proof.3
We use the following inequality to decompose the elements of F2 :
∀x ∈ [−1, 1] exp(βx) ≤ exp(β) + exp(−β) + βx − 1, (A.12)
Replacing the inequality (B.10) in (5.25) gives:

nl nl
X −β t βt X βt t
F2 ≤ Wi (e K + e K − 1) − Wi ( hh , Yi i)
i=1 i=1
K i
nu X nu X
X −β t βt X βt
+ Pi,k (e K−1 +e K−1 − 1) − Pi,k hht , ek i
i=1 k∈l i=1 k∈l
K −1 i
(A.13)
Then, re-arranging the above inequality gives:

nl nu X
−β t βt −β t βt
X X
F2 ≤ Wi (e K + e K − 1) + Pi,k (e K−1 + e K−1 − 1)
i=1 i=1 k∈l
nl t nu X
X β t X βt
− Wi ( hh , Yi i) + Pi,k hhti , ek i
i=1
K i i=1
K −1
k∈l
(A.14)
The first term in inequality (A.14) is independent of hi . Hence, to minimize F2 , finding the
examples with the largest Pi,k and Wi is sufficient at each iteration of the boosting process.
We assign the value Pi,k of selected examples as weight for the newly-labeled examples, hence
wi0 = max |Pi,k |. To prove the second part of the proposition, we expand F2 as follows:
nl nu X
F2 = Wi exp( hhi , Yi i) + Pi,k exp( hhti , ek i)
i=1
K i=1
K − 1
k∈l
X −β t X βt
= Wi exp( )+ Wi exp( )
i∈nl
K −1 i∈n
(K − 1)2
l
hti =Yi hti 6=Yi
nu X
X −β t 0 t
+ Pi,k exp( )δ (hhi , ek i, Pi = k)
i=1 k∈l
K −1
nu X
X βt
+ Pi,k exp δ 0 hhti , ek i, Pi 6= k)
i=1 k∈l
(K − 1)2
(A.15)
where δ 0 is defined as:

1 if T is true
δ 0 (hhti , ek i, T ) = (A.16)
0 otherwise
Differentiating the above equation w.r.t β and equating it to zero, gives:

∂F2 −1 −β X
= exp( ) Wi
∂β K −1 K − 1 i∈n
l
hti =Yi
1 β X
+ 2
exp( 2
) Wi
(K − 1) (K − 1) i∈n
l
hti 6=Yi
nu X
−1 −β X
+ exp( ) Pi,k δ 0 (hhti , ek i, Pi = k))
K −1 K − 1 i=1
k∈l
t nu X
1 β X
+ exp( ) Pi,k δ 0 (hhti , ek i, Pi 6= k)) = 0
(K − 1)2 (K − 1)2 i=1
k∈l
(A.17)
Simplifying the above equation results in (5.29).

Appendix B
Appendix
Proof.1
Let the function hYi , ek i be equal to 1 if a data point with Yi belongs to class k. Then (5.32)
can be rewritten as follows:
 
nl X
nu
X 1
P t−1 t t
F1 = arg tmin t C1 S lu (xi , xj )e− K k∈l hYi ,ek ihHj +β hj ,ek i 
h (x),β
i=1 j=1
 
t−1
1
+β t hti )−(Hjt−1 +β t htj ),ek i
X P
+ C2 S uu (xi , xj )e K k∈l h(Hi 
i,j∈nu
(B.1)
Using the following inequality in (5.32):
1 c1 + c2 + ... + cn
(c1 c2 ...cn ) n ≤ (B.2)
n
and then replacing the value of Yi results in the following expression:
nl X
nu X t
−1 t−1
,ek i) ( −β t
X
F1 = arg tmin t C1 S lu (xi , xj )e( K hHj e K hhj ,ek i) δ(Yi , ek )
h (x),β
i=1 j=1 k∈l

1
hHit−1 −Hjt−1 ,ek i 1
β t hhti −htj ,ek i
X X
+ C2 S uu (xi , xj )e K e K (B.3)
i,j∈nu k∈l
where δ is defined in (5.24).
Proof.2
−1
According to the formulation for multiclass classification, K−1 ≤ β t hh(xi ), ek i ≤ 1, then multi-
1
plying it with K and using the exponential function gives the following inequality:
−1 βt 1
e K(K−1) ≤ e K h(xi ).ek ≤ e K (B.4)
111
Using (B.4) for decomposing the second term of (B.3) leads to the following expression:
βt 1 −β t
e K hh(xi )−h(xj ),ek i ≤ e K e K hh(xj ),ek i
(B.5)
Replacing the inequality (B.5) in the last term of (B.3) gives:
nl X
nu X
X −1 t−1 −β t t
F1 ≤ S lu (xi , xj )exp(
hHj , ek i)exp( hhj , ek i)δ(Yi , ek )
i=1 j=1 k∈l
K K
X X 1 1 −βt t
+ S uu (xi , xj )exp hHit−1 − Hjt−1 , ek i e K e K hhj ,ek i (B.6)
i,j∈n
K
u k∈l
−β t
h(xj ).ek
Factoring out the common term e K and re-arranging terms then results in:
nu X nl
−β t t
X
X −1 t−1
F1 ≤ exp( hhj , ek i) S lu (xi , xj )exp( hHj , ek i)δ(Yi , ek )
j=1 k∈l
K i=1
K

X 1 1
+ S uu (xi , xj )exp( hHjt−1 − Hit−1 , ek i)exp( ) (B.7)
i∈n
K K
u
As a result we obtain:
nu X
X −β t
F1 ≤ Pi,k exp hhti , ek i (B.8)
i=1 k∈l
K
where
nl
X −1 t−1
Pi,k = S lu (xi , xj )e( K hHi ,ek )i
δ(Yi , ek )
j=1
nu
1
X
uu
+ S (xi , xj )e K eK (B.9)
j=1
Proof.3
We use the following inequality to decompose the elements of F2 :
∀x ∈ [−1, 1] exp(βx) ≤ exp(β) + exp(−β) + βx − 1, (B.10)
Replacing the inequality (B.10) in (5.25) gives:
nu X nu X
X −β t βt X βt t
F2 ≤ Pi,k (e K + e K − 1) − Pi,k hh , ek i (B.11)
i=1 k∈l i=1 k∈l
K i
The first term in inequality (B.11) is independent of hi . Hence, to minimize F2 , finding the
examples with the largest Pi,k is sufficient at each iteration of the boosting process. We assign
the value Pi,k of selected examples as weight for the newly-labeled examples, hence wi =
max |Pi,k |.
To prove the second part of the proposition, we expand F2 as follows:
nu X
X −β t t
F2 = Pi,k exp( hhi , ek i)
i=1 k∈l
K
nu X
X −β t 0 t
= Pi,k exp( )δ (hhi , ek i, Pi = k)
i=1 k∈l
K
nu X
X βt
+ Pi,k exp δ 0 (hhti , ek i, Pi 6= k)
i=1 k∈l
K(K − 1)
(B.12)
where δ 0 is defined as:

0 1 if T is true
δ (hti .ek , T ) = (B.13)
0 otherwise
Differentiating the above equation w.r.t β and equating it to zero, gives:

nu X
∂F2 −1 −β X
= exp( ) Pi,k δ 0 (hhti , ek i, Pi = k))
∂β K K i=1
k∈l
nu X
1 β X
+ exp( ) Pi,k δ 0 (hhti , ek i, Pi 6= k)) = 0
K(K − 1) K(K − 1) i=1
k∈l
(B.14)
Simplifying the above equation results in the following expression:

!
P 0 t
P
k∈l Pi,k δ (hhi , ek i, Pi = k)

i∈nu
β = (K − 1) log(K − 1) + log P P 0 t
i∈nu k∈l Pi,k δ (hhi , ek i, Pi 6= k)
(B.15)
.
Abstract
In many classification problems such as object detection, document and web-page

categorization, a large pool of unlabeled data is readily available, while labeled
data is usually not easy to obtain. Furthermore, for most practical problems an-
notating data by human is tedious and expensive. Supervised learning algorithms
give high classification performance when only sufficient labeled data is available.
Therefore, studying methods which can use a large amount of unlabeled instances
beside a set of labeled data - the so-called semi-supervised learning - is becoming
more and more important. In fact, semi-supervised learning is a learning task
that uses unlabeled data and combines the information in the unlabeled data with
the explicit classification information of the labeled data, in order to improve the
classification performance.
In this dissertation our interest is in developing methods for semi-supervised
learning with a focus on ensembles. Ensemble methods are the combination of
many classification models in order to build a strong classification model, which
produces results comparable to the state-of-the-art methods in many learning
tasks. We then distinguish two types of algorithms for semi-supervised learning:
iterative and additive algorithms.
Iterative semi-supervised learning algorithms replace at each iteration their
current hypothesis with a new hypothesis. At each iteration, the unlabeled data
are labeled and assigned a confidence. This produces a new set of labeled data.
In most algorithms a subset of these data is selected and added to the pool
of labeled data. Now, a larger training set exists for training. This training
process is repeated until it reaches a stopping condition. The classifier that is
constructed in the final iteration becomes the final classifier. Two examples of
this approach are Self-training and Co-training. We propose two methods based
on the iterative approach. First, we focus on self-training, when the base learner
is a decision tree classifier. Our experiments showed that sampling only based
on the probability estimation of the decision tree classifier cannot improve the
self-training performance. We then consider the effect of several modifications to
115
the basic decision tree learner that produce better probability estimations. We
show that these modifications do not produce better performance if used only on
labeled data, but they do benefit more from the unlabeled data in self-training.
The modifications that we consider are based on Naive Bayes Tree, a combination
of No-pruning and Laplace correction, grafting, ensemble of trees, and using a
distance-based measure. The second proposed iterative method is a new form
of co-training that uses an ensemble of different learning algorithms. Based on
a theorem that relates noise in the data to the error of hypotheses learned from
these data, we propose a criterion for selection metric and error rate estimation
for a classifier at each iteration of the training process. Ensembles of classifiers
are used to derive criteria for sampling and error rate estimation in co-training
style. Our experimental results on a number of benchmark UCI datasets show
that the proposed methods outperform the other related methods.
Additive algorithms add a new component to a linear combination of classi-
fiers. These algorithms build a new classifier, assign a weight and add it to a
linear combination of weak classifiers. An example of this approach is the boost-
ing method. Boosting is one of the most successful meta-classifiers for supervised
learning. This motivates us to extend a boosting method for semi-supervised
learning. Most of the current semi-supervised learning methods were originally
designed for binary classification problems. However, many practical domains, for
example typical recognition of patterns, objects, and documents, involve more
than two classes. To solve the multiclass classification problem two main ap-
proaches can be used. The first is to convert the multiclass problem into a set
of binary classification problems, e.g. one-vs-all. This approach can have various
problems, such as the imbalanced class distributions, increased complexity, no
guarantee for having an optimal joint classifier or a probability estimation, and
different scales for the outputs of generated binary classifiers, which complicate
the task of combining them. The second approach is to use a multiclass classifier
directly. Although a number of methods have been proposed for multiclass super-
vised learning, they are not able to handle multiclass semi-supervised learning.
For this, there is a need to extend a multiclass supervised learning to multiclass
semi-supervised learning. We propose two multiclass semi-supervised boosting
algorithms that solve the multiclass classification problems directly. Our pro-
posed algorithms are based on two main semi-supervised learning assumptions,
the manifold and the cluster assumption, to exploit the information from the un-
labeled examples in order to improve the performance of the base classifier. The
first algorithm uses two regularization terms on labeled and unlabeled examples
to minimize the inconsistency between similarity matrix and classifier predictions.
The second algorithm uses the extra term in loss function for minimizing the cost
of the margin for labeled data as well as minimizing the inconsistency between
the classifier predictions and pairwise similarities. These algorithms employ the
similarity matrix and the classifier predictions to find a criterion to assign weights
to data. The results of our experiments on the UCI datasets indicate that the
proposed algorithms outperform the state-of-the-art methods in this field.
We further applied the methods to bird behavior recognition and text classi-
fication. At first, we consider the task of semi-supervised learning to recognize
bird behavior from accelerometer data. Labeling accelerometer signals manually
is a difficult and time-consuming process and requires detailed knowledge of an-
imal behavior and understanding of the tri-axial accelerometer data. This is a
multiclass classification task. Therefore, this problem seems very well suited for
semi-supervised learning. We used different proportions of labeled data, which
vary from 1% to 20%. With 1% labeled data we achieved 7% improvement,
when the decision tree learner was used as the base learner. We also evaluate
the proposed algorithms on the text classification problem using the popular text
datasets from TREC and Reuters. The result showed significant improvements
on the performance, when there was only 10% labeled data.
Samenvatting
In veel classificatieproblemen zoals object detectie en classificatie van documenten

en webpagina’s een verzameling ongelabelde data is makkelijk te verkrijgen, gela-
belde data zijn veel moeilijker te vinden. Annotatie door een mens is saai werk
en in sommige gevallen is een menselijke expert nodig die duur is of moeilijk te
vinden. Algorithmen voor ”supervised” leren kunnen goede resultaten geven als
er veel gelabelde data beschikbaar zijn. Voor veel problemen zijn dus methoden
interessant die gebruik kunnen maken van zowel gelabelde als ongelabelde data,
dus voor ”semi-supervised” leren.
In deze dissertatie is onze voornaamste interesse in de ontwikkeling van meth-
oden voor semi-supervised leren met ensembles. Ensemble methoden combineren
een aantal classifiers tot een ensemble. In de praktijk is gebleken dat ensemble
methoden state-of-the-art resultaten leveren voor veel leertaken. We onderschei-
den twee typen algorithmen voor semi-supervised leren: iteratieve en additieve.
Iteratieve algorithmen voor semi-supervised leren gebruiken een basis leeral-
gorithme om in elke iteratie een classifier te leren uit de gelabelde data. Die
voorspelt dan de labels van de ongelabelde data. Tevens wordt er een mate van
vertrouwen (confidence) toegekend aan de classificatie. In de meeste algorithmen
wordt een deel van de nieuw-gelabelde data geselecteerd op basis van vertrouwen
en toegevoegd aan de gelabelde data. Hiermee wordt een nieuwe classifier geleerd.
In iteratieve algorithmen vervangt de nieuwe classifier de vorige. Als een stop-
conditie is bereikt dan is de laatst geconstrueerde classifier het resultaat. Twee
voorbeelden van van deze benadering zijn Self-Training en Co-Training. We in-
troduceren twee methoden die de iteratieve benadering volgen.
Eerst richten we ons op Self-Training met een beslisbomenleerder als basis
leeralgorithme. Uit experimenten blijkt dat selecteren van nieuw-gelabelde voor-
beelden op basis van vertrouwen dat wordt toegekend door de beslisbomenleerder
geen verbetering oplevert ten opzichte van het leren van alleen de gelabelde
trainingsdata. We analyseren het effect van enkele veranderingen van de basis
beslisbomenleerder die als doel hebben het verbeteren van de schatting van het
119
vertrouwen. We laten zien dat deze verbeteringen niet leiden tot een verbetering
als alleen de gelabelde trainingsdata gebruikt worden, maar dat ze wel zorgen voor
een verbetering als ook ongelabelde data worden gebruikt in een Self-Training al-
gorithme. De modificaties zijn gebaseerd op Naive Bayes Tree, een combinatie van
”niet prunen” en de Laplace correctie, ”grafting”, een ensemble van beslisbomen
en het gebruik van een afstandsmaat.
De tweede iteratieve methode is een nieuwe vorm van Co-Training die een
ensemble van verschillende leeralgorithmen gebruikt. Een stelling die ruis in de
data en de hoeveelheid relateert aan de fout van hypotheses die geleerd worden
op de data wordt gebruikt voor een criterium voor selectie van nieuw gelabelde
voorbeelden. De overeenkomst in classificaties in een ensemble wordt gebruikt om
de hoeveelheid ruis te schatten. Experimenten op datasets uit de UCI repository
laten zien dat deze methoden beter werken dan vergelijkbare methoden.
Additieve algorithmen maken in elke iteratie een nieuwe classifier, kennen die
een gewicht toe en voegen dit toe aan een lineaire combinatie. Een voorbeeld
hiervan is boosting. We passen het principe van boosting toe op semi-supervised
leren. De meeste methoden voor boosting zijn ontworpen voor binaire classificatie.
Bij veel problemen zoals herkenning van typen webpagina’s, en typen objecten,
gaat het echter om meer dan twee klassen. Voor multiclass classificatie zijn twee
benaderingen voorgesteld. De eerste is om het multiclass probleem te converteren
naar een verzameling binaire classificatieproblemen. Voorbeelden hiervan zijn
”one-vs-all”, ”one-vs-one” en ”error-correcting output codes”. Deze benadering
heeft een aantal problemen zoals ongebalanceerde klasse verdelingen, vergrote
complexiteit, geen garantie voor het vinden van een optimale classifier of schatting
van de waarschijnlijkheid, en verschillende schalen voor de output van de binaire
classifiers, waardoor die moeilijk zijn te combineren. De tweede aanpak is om
direct multiclass classificatie te gebruiken. We introduceren een algorithme voor
semi-supervised leren voor multiclass classificatie, dat is gebaseerd op boosting.
Het algorithme is gebaseerd op twee veelgemaakte aannames, de manifold en
de cluster assumpties om de nieuwe informatie uit de ongelabelde voorbeelden
te exploiteren. Het gebruikt de marge op de gelabelde data en de gelijkenis
(similarity) tussen ongelabelde data in een explonenti”ele verliesfunctie. We geven
een formele definitie van deze versliesfunctie en leiden functies af voor voor de
gewichten van classifiers en ongelabelde data door de doelfunctie (objective) te
minimaliseren. Resultaten op datasets uit de UCI repository wijzen erop dat het
voorgestelde algorithme beter werkt dan algorithmen die de “state-of-the-art”
vertegenwoordigen.
We hebben onze methoden toegepast op het herkennen van vogelgedrag en
op tekst classificatie. De eerste taak betreft het herkennen van vogelgedrag uit
accelerometer data. Het labelen van dergelijke data met gedrag is een lastig en
tijdrovend proces en vergt gedetailleerde kennis van vogelgedrag en begrip van
de drie-dimensionele accelerometer data. Dit is een typisch voorbeeld van een
multiclass probleem. Omdat er veel ongelabelde data zijn is dit probleem zeer
geschikt voor semi-supervised leren. Experimenten laten zien dat met 1% van de
gelabelde data semi-supervised leren 7% verbetering geeft. Resultaten op datasets
uit de TREC competitie en de Reuters dataset geven vergelijkbare resultaten.
Bibliography
[1] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2:
343–370, 1988. ISSN 0885-6125.
[2] Maria-Florina Balcan and Avrim Blum. A discriminative model for semi-supervised learn-
ing. J. ACM, 57(3), 2010.
[3] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms:

Bagging, boosting, and variants. Machine learning, 36(1):105–139, 1999.
[4] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework

for learning from labeled and unlabeled examples. The Journal of Machine Learning
Research, 7:2399–2434, 2006.
[5] K. Bennett and A. Demiriz. Semi-supervised support vector machines. Advances in

Neural Information processing systems, pages 368–374, 1999.
[6] K.P. Bennett, A. Demiriz, and R. Maclin. Exploiting unlabeled data in ensemble meth-
ods. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 289–296. ACM, 2002.
[7] Hendrik Blockeel and Luc De Raedt. Top-down induction of first-order logical decision
trees. Artificial Intelligence, 101(1-2):285 – 297, 1998. ISSN 0004-3702.
[8] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In
Proceedings of the eleventh annual conference on Computational learning theory, pages
92–100. ACM, 1998.
[9] Avrim Blum and Tom M. Mitchell. Combining labeled and unlabeled data with co-
training. In COLT, pages 92–100, 1998.
[10] V. Boer, M.W. Someren, and T. Lupascu. Web page classification using image analysis
features. Web Information Systems and Technologies, pages 272–285, 2011.
[11] D. Boley, M. Gini, R. Gross, E.H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher,
and J. Moore. Document categorization and query generation on the world wide web using
webace. Artificial Intelligence Review, 13(5):365–391, 1999.
123
[12] Ulf Brefeld and Tobias Scheffer. Co-em support vector learning. In Proceedings of the
twenty-first international conference on Machine learning, ICML ’04, pages 16–, New
York, NY, USA, 2004. ACM. ISBN 1-58113-838-5.
[13] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
[14] L. Breiman. Arcing classifier (with discussion and a rejoinder by the author). The annals
of statistics, 26(3):801–849, 1998.
[15] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[16] Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised

learning algorithms. In Proceedings of the 23rd international conference on Machine
learning, ICML ’06, pages 161–168, New York, NY, USA, 2006. ACM. ISBN 1-59593-
383-2.
[17] O. Chapelle and A. Zien. Semi-supervised classification by low density separation. 2004.
[18] O. Chapelle, B. Schölkopf, A. Zien, et al. Semi-supervised learning, volume 2. MIT press
Cambridge, MA:, 2006.
[19] O. Chapelle, V. Sindhwani, and S.S. Keerthi. Optimization techniques for semi-supervised
support vector machines. The Journal of Machine Learning Research, 9:203–233, 2008.
[20] K. Chen and S. Wang. Semi-supervised learning via regularized boosting working on
multiple semi-supervised assumptions. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 33(1):129–143, 2011.
[21] M. Collins and Y. Singer. Unsupervised models for named entity classification. In Pro-
ceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language
Processing and Very Large Corpora, pages 189–196, 1999.
[22] T.D.B.N. Cristianini. Convex methods for transduction. In Advances in Neural Infor-
mation Processing Systems 16: Proceedings of the 2003 Conference, volume 16, page 73.
MIT Press, 2004.
[23] F. dAlché Buc, Y. Grandvalet, and C. Ambroise. Semi-supervised marginboost. Advances

in neural information processing systems, 14:553–560, 2002.
[24] S. Dasgupta, D. McAllester, and M. Littman. PAC Generalization Bounds for Co-
Training. In Proceedings of Neural Information Proccessing Systems, 2001.
[25] A. Demiriz, K.P. Bennett, and J. Shawe-Taylor. Linear programming boosting via column
generation. Machine Learning, 46(1):225–254, 2002.
[26] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data
via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological),
39(1):pp. 1–38, 1977. ISSN 00359246.
[27] J. Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of
Machine Learning Research, 7:1–30, 2006.
[28] T. Dietterich. Ensemble methods in machine learning. Multiple classifier systems, pages
1–15, 2000.
[29] T.G. Dietterich. An experimental comparison of three methods for constructing ensembles
of decision trees: Bagging, boosting, and randomization. Machine learning, 40(2):139–
157, 2000.
[30] T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting
output codes. Arxiv preprint cs/9501101, 1995.
[31] Tom Fawcett and Nina Mishra, editors. Machine Learning, Proceedings of the Twentieth
International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA,
2003. AAAI Press. ISBN 1-57735-189-4.
[32] A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL
http://archive.ics.uci.edu/ml.
[33] Y. Freund, R. Schapire, and N. Abe. A short introduction to boosting. Journal-Japanese

Society For Artificial Intelligence, 14(771-780):1612, 1999.
[34] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In
ICML, pages 148–156, 1996.
[35] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In
ICML, pages 148–156, 1996.
[36] M.A. Friedl, C.E. Brodley, and A.H. Strahler. Maximizing land cover classification accu-
racies produced by decision trees at continental to global scales. Geoscience and Remote
Sensing, IEEE Transactions on, 37(2):969 –977, mar 1999.
[37] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view
of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28
(2):337–407, 2000.
[38] Akinori Fujino, Naonori Ueda, and Kazumi Saito. A hybrid generative/discriminative
approach to semi-supervised classifier design. In AAAI, pages 764–769, 2005.
[39] Sally A. Goldman and Yan Zhou. Enhancing supervised learning with unlabeled data. In
ICML, pages 327–334, 2000.
[40] I. Gyllensten. Physical activity recognition in daily life using a triaxial accelerometer,
2010.
[41] G. Haffari and A. Sarkar. Analysis of semi-supervised learning with the yarowsky algo-
rithm. In Uncertainty in Artificial Intelligence (UAI), 2007.
[42] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and
Ian H. Witten. The weka data mining software: an update. SIGKDD Explor. Newsl., 11:
10–18, November 2009. ISSN 1931-0145.
[43] Eui-Hong Han and George Karypis. Centroid-based document classification: Analysis
and experimental results. In Djamel Zighed, Jan Komorowski, and Jan Zytkow, editors,
Principles of Data Mining and Knowledge Discovery, volume 1910 of Lecture Notes in
Computer Science, pages 116–123. Springer Berlin / Heidelberg, 2000.
[44] T.K. Ho. The random subspace method for constructing decision forests. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 20(8):832–844, 1998.
[45] S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal
of statistics, pages 65–70, 1979.
[46] R.L. Iman and J.M. Davenport. Approximations of the critical region of the fbietkan
statistic. Communications in Statistics-Theory and Methods, 9(6):571–595, 1980.
[47] M.S.T. Jaakkola. Partially labeled classification with markov random walks. In Advances
in neural information processing systems 14: proceedings of the 2002 conference, volume 2,
page 945. MIT Press, 2002.
[48] R. Jin and J. Zhang. Multi-class learning by smoothed boosting. Machine learning, 67
(3):207–227, 2007.
[49] Thorsten Joachims. Transductive inference for text classification using support vector
machines. In ICML, pages 200–209, 1999.
[50] Thorsten Joachims. Transductive inference for text classification using support vector
machines. In Ivan Bratko and Saso Dzeroski, editors, ICML, pages 200–209. Morgan
Kaufmann, 1999. ISBN 1-55860-612-2.
[51] Thorsten Joachims. Transductive learning via spectral graph partitioning. In Fawcett
and Mishra [31], pages 290–297. ISBN 1-57735-189-4.
[52] Josef Kittler, Mohamad Hatef, Robert P.W. Duin, and Jiri Matas. On combining clas-
sifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:226–239,
1998. ISSN 0162-8828.
[53] Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.
In KDD, pages 202–207, 1996.
[54] L.I. Kuncheva and C.J. Whitaker. Measures of diversity in classifier ensembles and their
relationship with the ensemble accuracy. Machine learning, 51(2):181–207, 2003.
[55] N.D. Lawrence and M.I. Jordan. Semi-supervised learning via gaussian processes. Ad-
vances in neural information processing systems, 17:753–760, 2005.
[56] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines. Journal of the
American Statistical Association, 99(465):67–81, 2004.
[57] B. Leskes and L. Torenvliet. The value of agreement a new boosting algorithm. Journal
of Computer and System Sciences, 74(4):557–586, 2008.
[58] Boaz Leskes and Leen Torenvliet. The value of agreement a new boosting algorithm. J.
Comput. Syst. Sci., 74(4):557–586, 2008.
[59] D Lewis, D. Reuters-21578 text categorization test collection distribution,

http://www.research.att.com/ lewis.
[60] Ming Li and Zhi-Hua Zhou. Improve computer-aided diagnosis with machine learning
techniques using undiagnosed samples. Systems, Man and Cybernetics, Part A: Systems
and Humans, IEEE Transactions on, 37(6):1088 –1098, 2007.
[61] Yuanqing Li, Cuntai Guan, Huiqi Li, and Zhengyang Chin. A self-training semi-supervised
svm algorithm and its application in an eeg-based brain computer interface speller system.
Pattern Recognition Letters, 29(9):1285 – 1294, 2008.
[62] M. Lux and S.A. Chatzichristofis. Lire: lucene image retrieval: an extensible java cbir
library. In Proceeding of the 16th ACM international conference on Multimedia, pages
1085–1088. ACM, 2008.
[63] R. Maclin and D. Opitz. Popular ensemble methods: An empirical study. Arxiv preprint
arXiv:1106.0257, 2011.
[64] B. Maeireizo, D. Litman, and R. Hwa. Co-training for predicting emotions with spoken
dialogue data. In Proceedings of the ACL 2004 on Interactive poster and demonstration
sessions, page 28. Association for Computational Linguistics, 2004.
[65] P.K. Mallapragada, R. Jin, and A.K. Jain. Active query selection for semi-supervised
clustering. In Pattern Recognition, 2008. ICPR 2008. 19th International Conference on,
pages 1–4. IEEE, 2008.
[66] P.K. Mallapragada, R. Jin, A.K. Jain, and Y. Liu. Semiboost: Boosting for semi-
supervised learning. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
31(11):2000–2014, 2009.
[67] Paula Martiskainen, Mikko Jrvinen, Jukka-Pekka Skn, Jarkko Tiirikainen, Mikko
Kolehmainen, and Jaakko Mononen. Cow behaviour pattern recognition using a three-
dimensional accelerometer and support vector machines. Applied Animal Behaviour Sci-
ence, 119(12):32 – 38, 2009.
[68] D.J. Miller and H.S. Uyar. A mixture of experts classifier with learning based on both
labelled and unlabelled data. In M. Mozer, M.I. Jordan, and T. Petsche, editors, Advances
in NLP, pages 57–?577, Boston, 1997. MIT Press.
[69] I. Mukherjee and R.E. Schapire. A theory of multiclass boosting. Arxiv preprint
arXiv:1108.2989, 2011.
[70] Ran Nathan, Orr Spiegel, Scott Fortmann-Roe, Roi Harel, Martin Wikelski, and Wayne M
Getz. Using tri-axial acceleration data to identify behavioral modes of free-ranging ani-
mals: general concepts and tools illustrated for griffon vultures. J Exp Biol, 215(pt 6):
986, 2012.
[71] K. Nigam, A.K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled
and unlabeled documents using em. Machine learning, 39(2):103–134, 2000.
[72] Kamal Nigam and Rayid Ghani. Analyzing the effectiveness and applicability of co-
training. In CIKM, pages 86–93. ACM, 2000.
[73] Foster J. Provost and Pedro Domingos. Tree induction for probability-based ranking.
Machine Learning, 52(3):199–215, 2003.
[74] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. ISBN
1-55860-238-0.
[75] Nishkam Ravi, Nikhil D, Preetham Mysore, and Michael L. Littman. Activity recognition
from accelerometer data. In In Proceedings of the Seventeenth Conference on Innovative
Applications of Artificial Intelligence(IAAI, pages 1541–1546. AAAI Press, 2005.
[76] E. Riloff, J. Wiebe, and W. Phillips. Exploiting subjectivity classification to improve
information extraction. In Proceedings of the National Conference On Artificial Intelli-
gence, volume 20, page 1106. Menlo Park, CA; Cambridge, MA; London; AAAI Press;
MIT Press; 1999, 2005.
[77] L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1):1–39, 2010.
[78] Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-supervised self-
training of object detection models. In WACV/MOTION, pages 29–36. IEEE Computer
Society, 2005. ISBN 0-7695-2271-8.
[79] M.J. Saberian and N. Vasconcelos. Multiclass boosting: Theory and algorithms. In Proc.
Neural Information Processing Systems (NIPS), 2011.
[80] A. Saffari, H. Grabner, and H. Bischof. Serboost: Semi-supervised boosting with expec-
tation regularization. Computer Vision–ECCV 2008, pages 588–601, 2008.
[81] A. Saffari, C. Leistner, and H. Bischof. Regularized multi-class semi-supervised boosting.

In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on,
pages 967–974. IEEE, 2009.
[82] A. Saffari, M. Godec, T. Pock, C. Leistner, and H. Bischof. Online multi-class lpboost.
In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages
3570–3577. IEEE, 2010.
[83] K.Q. Sakamoto, K. Sato, M. Ishizuka, Y. Watanuki, A. Takahashi, F. Daunt, and S. Wan-
less. Can ethograms be automatically generated using body acceleration data from free-
ranging birds? PLoS One, 4(4):e5379, 2009.
[84] R.E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated pre-
dictions. Machine learning, 37(3):297–336, 1999.
[85] R.E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization.
Machine learning, 39(2):135–168, 2000.
[86] B.M. Shahshahani and D.A. Landgrebe. The effect of unlabeled samples in reducing the
small sample size problem and mitigating the hughes phenomenon. IEEE Transactions
on, 32(5):1087 –1095, September 1994. ISSN 0196-2892.
[87] J. Shamoun-Baranes, R. Bom, E.E. van Loon, B.J. Ens, K. Oosterbeek, and W. Bouten.
From sensor data to animal behaviour: An oystercatcher example. PloS one, 7(5):e37997,
2012.
[88] Aayush Sharma, Gang Hua, Zicheng Liu, and Zhengyou Zhang. Meta-tag propagation by
co-training an ensemble classifier for improving image search relevance. Computer Vision
and Pattern Recognition Workshop, 0:1–6, 2008.
[89] Jeffrey Simonoff. Smoothing Methods in Statistics. Springer, New York, 1996.
[90] E. Song, D. Huang, G. Ma, and C.C. Hung. Semi-supervised multi-class adaboost by
exploiting unlabeled data. Expert Systems with Applications, 2010.
[91] J. Tanha, M.V.Someren, and H. Afsarmanesh. Ensemble-training: Ensemble based co-

training. In Benelearn, 2011.
[92] J. Tanha, M.V.Someren, and H. Afsarmanesh. Semi-supervised self-training with decision
trees: An empirical study. In 3rd IEEE International Conference on Intelligent Computing
and Intelligent System (ICIS), pages 669–704, 2011.
[93] J. Tanha, M. van Someren, and H. Afsarmanesh. Disagreement-based co-training. In

Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on,
pages 803–810. IEEE, 2011.
[94] J. Tanha, M. van Someren, and H. Afsarmanesh. Ensemble based co-training. In 23rd
Benelux Conference on Artificial Intelligence, pages 223–231, 2011.
[95] J. Tanha, M. van Someren, and H. Afsarmanesh. An adaboost algorithm for multiclass
semi-supervised learning. In Data Mining (ICDM), 2012 12rd IEEE International Con-
ference on. IEEE, 2012.
[96] J. Tanha, M. van Someren, M. Bakker, W. Bouten, J. Shamoun-Baranes, and H. Af-

sarmanesh. Multiclass semi-supervised learning for animal behavior recognition from
accelerometer data. In Tools with Artificial Intelligence (ICTAI), 2012 24rd IEEE Inter-
national Conference on. IEEE, 2012.
[97] TextDatasets. Public datasets http://tunedit.org.
[98] TREC. Text retrieval conference. http://trec.nist.gov.
[99] H. Valizadegan, R. Jin, and A. Jain. Semi-supervised boosting for multi-class classifica-
tion. Machine Learning and Knowledge Discovery in Databases, pages 522–537, 2008.
[100] Bin Wang, Bruce Spencer, Charles X. Ling, and Harry Zhang. Semi-supervised self-
training for sentence subjectivity classification. In Proceedings of 21st conference on
Advances in artificial intelligence, pages 344–355, Berlin, Heidelberg, 2008. Springer-
Verlag.
[101] M. Warmuth, K. Glocer, and S. Vishwanathan. Entropy regularized lpboost. In Algo-

rithmic Learning Theory, pages 256–271. Springer, 2008.
[102] M.K. Warmuth, K. Glocer, and G. Rätsch. Boosting algorithms for maximizing the soft
margin. Advances in neural information processing systems, 20:1585–1592, 2008.
[103] Geoffrey I. Webb. Decision tree grafting from the all tests but one partition. In Thomas
Dean, editor, IJCAI, pages 702–707. Morgan Kaufmann, 1999. ISBN 1-55860-613-0.
[104] L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class support vector

machines. In Proceedings of the national conference on artificial intelligence, volume 20,
page 904. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,
2005.
[105] D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In

Proceedings of the 33rd annual meeting on Association for Computational Linguistics,
pages 189–196. Association for Computational Linguistics, 1995.
[106] K. Yoda, Y. Naito, K. Sato, A. Takahashi, J. Nishikawa, Y. Ropert-Coudert, M. Kurita,

and Y. Le Maho. A new technique for monitoring the behaviour of free-ranging adlie
penguins. J Exp Biol, 204(Pt 4):685–90, 2001.
[107] Yan Zhou and Sally Goldman. Democratic co-learning. Tools with Artificial Intelligence,
IEEE International Conference on, 0:594–202, 2004.
[108] Z.H. Zhou and M. Li. Semi-supervised learning by disagreement. Knowledge and Infor-
mation Systems, 24(3):415–439, 2010.
[109] Zhi-Hua Zhou and Ming Li. Tri-training: Exploiting unlabeled data using three classifiers.
IEEE Transactions on Knowledge and Data Engineering, 17:1529–1541, 2005.
[110] J. Zhu, S. Rosset, H. Zou, and T. Hastie. Multi-class adaboost. Statistics and its Interface,
2,:349–360, 2009.
[111] X. Zhu. Semi-Supervised Learning Literature Survey. Technical Report 1530, Computer
Sciences, University of Wisconsin-Madison, 2005.
[112] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label prop-
agation. School Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-
CALD-02-107, 2002.
[113] Xiaojin Zhu. Semi-supervised learning literature survey, 2006.
[114] Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervised learning using
gaussian fields and harmonic functions. In Fawcett and Mishra [31], pages 912–919. ISBN
1-57735-189-4.
[115] Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervised learning using
gaussian fields and harmonic functions. In ICML, pages 912–919, 2003.
SIKS Dissertation Series
1998 1999-8 Jacques H.J. Lenting (UM)

Informed Gambling: Conception
1998-1 Johan van den Akker (CWI)
and Analysis of a Multi-Agent Mech-
DEGAS - An Active, Temporal
anism for Discrete Reallocation
Database of Autonomous Objects
2000
1998-2 Floris Wiesman (UM)
Information Retrieval by Graphi- 2000-1 Frank Niessink (VU)
cally Browsing Meta-Information Perspectives on Improving Software
Maintenance
1998-3 Ans Steuten (TUD)
A Contribution to the Linguistic 2000-2 Koen Holtman (TUE)
Analysis of Business Conversations Prototyping of CMS Storage Man-
within the Language/Action Per- agement
spective 2000-3 Carolien M.T. Metselaar (UvA)
1998-4 Dennis Breuker (UM) Sociaal-organisatorische gevolgen
Memory versus Search in Games van kennistechnologie; een proces-
benadering en actorperspectief
1998-5 E.W. Oskamp (RUL)
Computerondersteuning bij 2000-4 Geert de Haan (VU)
Straftoemeting ETAG, A Formal Model of Compe-
tence Knowledge for User Interface
1999 Design
1999-1 Mark Sloof (VU) 2000-5 Ruud van der Pol (UM)
Physiology of Quality Change Mod- Knowledge-based Query Formula-
elling; Automated modelling of tion in Information Retrieval
Quality Change of Agricultural
2000-6 Rogier van Eijk (UU)
Products
Programming Languages for Agent
1999-2 Rob Potharst (EUR) Communication
Classification using decision trees 2000-7 Niels Peek (UU)
and neural nets Decision-theoretic Planning of Clin-
1999-3 Don Beal (UM) ical Patient Management
The Nature of Minimax Search 2000-8 Veerle Coup (EUR)
1999-4 Jacques Penders (UM) Sensitivity Analyis of Decision-
The practical Art of Moving Physi- Theoretic Networks
cal Objects 2000-9 Florian Waas (CWI)
1999-5 Aldo de Moor (KUB) Principles of Probabilistic Query
Empowering Communities: A Optimization
Method for the Legitimate User- 2000-10 Niels Nes (CWI)
Driven Specification of Network In- Image Database Management Sys-
formation Systems tem Design Considerations, Algo-
1999-6 Niek J.E. Wijngaards (VU) rithms and Architecture
Re-design of compositional systems 2000-11 Jonas Karlsson (CWI)
Scalable Distributed Data Struc-
1999-7 David Spelt (UT)
tures for Database Management
Verification support for object
database design 2001
2001-1 Silja Renooij (UU) 2002-3 Henk Ernst Blok (UT)
Qualitative Approaches to Quantify- Database Optimization Aspects for
ing Probabilistic Networks Information Retrieval
2001-2 Koen Hindriks (UU) 2002-4 Juan Roberto Castelo Valdueza
Agent Programming Languages: (UU)
Programming with Mental Models The Discrete Acyclic Digraph
Markov Model in Data Mining
2001-3 Maarten van Someren (UvA)
Learning as problem solving 2002-5 Radu Serban (VU)
2001-4 Evgueni Smirnov (UM) The Private Cyberspace Modeling
Conjunctive and Disjunctive Version Electronic Environments inhabited
Spaces with Instance-Based Bound- by Privacy-concerned Agents
ary Sets 2002-6 Laurens Mommers (UL)
2001-5 Jacco van Ossenbruggen (VU) Applied legal epistemology; Building
Processing Structured Hypermedia: a knowledge-based ontology of the
A Matter of Style legal domain
2001-6 Martijn van Welie (VU) 2002-7 Peter Boncz (CWI)

Task-based User Interface Design Monet: A Next-Generation DBMS
Kernel For Query-Intensive Applica-
2001-7 Bastiaan Schonhage (VU) tions
Diva: Architectural Perspectives on
Information Visualization 2002-8 Jaap Gordijn (VU)
Value Based Requirements Engi-
2001-8 Pascal van Eck (VU) neering: Exploring Innovative E-
A Compositional Semantic Structure Commerce Ideas
for Multi-Agent Systems Dynamics
2002-9 Willem-Jan van den Heuvel (KUB)
2001-9 Pieter Jan ’t Hoen (RUL)
Integrating Modern Business Appli-
Towards Distributed Development
cations with Objectified Legacy Sys-
of Large Object-Oriented Models,
tems
Views of Packages as Classes
2002-10 Brian Sheppard (UM)
2001-10 Maarten Sierhuis (UvA)
Towards Perfect Play of Scrabble
Modeling and Simulating Work
Practice, BRAHMS: a multiagent 2002-11 Wouter C.A. Wijngaards (VU)
modeling and simulation language Agent Based Modelling of Dynamics:
for work practice analysis and design Biological and Organisational Appli-
2001-11 Tom M. van Engers (VU) cations
Knowledge Management: The Role 2002-12 Albrecht Schmidt (UvA)
of Mental Models in Business Sys- Processing XML in Database Sys-
tems Design tems
2002-13 Hongjing Wu (TUE)

2002
A Reference Architecture for Adap-
2002-1 Nico Lassing (VU) tive Hypermedia Applications
Architecture-Level Modifiability
Analysis 2002-14 Wieke de Vries (UU)
Agent Interaction: Abstract Ap-
2002-2 Roelof van Zwol (UT) proaches to Modelling, Program-
Modelling and searching web-based ming and Verifying Multi-Agent Sys-
document collections tems
2002-15 Rik Eshuis (UT) 2003-11 Simon Keizer (UT)
Semantics and Verification of UML Reasoning under Uncertainty in
Activity Diagrams for Workflow Natural Language Dialogue using
Modelling Bayesian Networks
2002-16 Pieter van Langen (VU) 2003-12 Roeland Ordelman (UT)
The Anatomy of Design: Founda- Dutch speech recognition in multi-
tions, Models and Applications media information retrieval
2002-17 Stefan Manegold (UvA) 2003-13 Jeroen Donkers (UM)

Understanding, Modeling, and Im- Nosce Hostem - Searching with Op-
proving Main-Memory Database ponent Models
Performance 2003-14 Stijn Hoppenbrouwers (KUN)
Freezing Language: Conceptualisa-
2003
tion Processes across ICT-Supported
2003-1 Heiner Stuckenschmidt (VU) Organisations
Ontology-Based Information Shar- 2003-15 Mathijs de Weerdt (TUD)
ing in Weakly Structured Environ- Plan Merging in Multi-Agent Sys-
ments tems
2003-2 Jan Broersen (VU) 2003-16 Menzo Windhouwer (CWI)
Modal Action Logics for Reasoning Feature Grammar Systems - Incre-
About Reactive Systems mental Maintenance of Indexes to
Digital Media Warehouses
2003-3 Martijn Schuemie (TUD)
Human-Computer Interaction and 2003-17 David Jansen (UT)
Presence in Virtual Reality Expo- Extensions of Statecharts with Prob-
sure Therapy ability, Time, and Stochastic Timing
2003-4 Milan Petkovic (UT) 2003-18 Levente Kocsis (UM)
Content-Based Video Retrieval Sup- Learning Search Decisions
ported by Database Technology 2004
2003-5 Jos Lehmann (UvA) 2004-1 Virginia Dignum (UU)
Causation in Artificial Intelligence A Model for Organizational Interac-
and Law - A modelling approach tion: Based on Agents, Founded in
Logic
2003-6 Boris van Schooten (UT)
Development and specification of 2004-2 Lai Xu (UvT)
virtual environments Monitoring Multi-party Contracts
for E-business
2003-7 Machiel Jansen (UvA)
2004-3 Perry Groot (VU)
Formal Explorations of Knowledge
A Theoretical and Empirical Anal-
Intensive Tasks
ysis of Approximation in Symbolic
2003-8 Yongping Ran (UM) Problem Solving
Repair Based Scheduling 2004-4 Chris van Aart (UvA)
2003-9 Rens Kortmann (UM) Organizational Principles for Multi-
The resolution of visually guided be- Agent Architectures
haviour 2004-5 Viara Popova (EUR)
2003-10 Andreas Lincke (UvT) Knowledge discovery and mono-
Electronic Business Negotiation: tonicity
Some experimental studies on the 2004-6 Bart-Jan Hommes (TUD)
interaction between medium, inno- The Evaluation of Business Process
vation context and culture Modeling Techniques
2004-7 Elise Boltjes (UM) 2005
Voorbeeldig onderwijs; voor- 2005-1 Floor Verdenius (UvA)
beeldgestuurd onderwijs, een opstap Methodological Aspects of Designing
naar abstract denken, vooral voor Induction-Based Applications
meisjes
2005-2 Erik van der Werf (UM)
2004-8 Joop Verbeek (UM) AI techniques for the game of Go
Politie en de Nieuwe Internationale
Informatiemarkt, Grensregionale 2005-3 Franc Grootjen (RUN)
politiële gegevensuitwisseling en dig- A Pragmatic Approach to the Con-
itale expertise ceptualisation of Language
2004-9 Martin Caminada (VU) 2005-4 Nirvana Meratnia (UT)

For the Sake of the Argument; explo- Towards Database Support for Mov-
rations into argument-based reasoning Object data
ing 2005-5 Gabriel Infante-Lopez (UvA)
Two-Level Probabilistic Grammars
2004-10 Suzanne Kabel (UvA)
for Natural Language Parsing
Knowledge-rich indexing of learning-
objects 2005-6 Pieter Spronck (UM)
Adaptive Game AI
2004-11 Michel Klein (VU)
Change Management for Distributed 2005-7 Flavius Frasincar (TUE)
Ontologies Hypermedia Presentation Genera-
tion for Semantic Web Information
2004-12 The Duy Bui (UT) Systems
Creating emotions and facial expres-
sions for embodied agents 2005-8 Richard Vdovjak (TUE)
A Model-driven Approach for Build-
2004-13 Wojciech Jamroga (UT) ing Distributed Ontology-based Web
Using Multiple Models of Reality: Applications
On Agents who Know how to Play
2005-9 Jeen Broekstra (VU)
2004-14 Paul Harrenstein (UU) Storage, Querying and Inferencing
Logic in Conflict. Logical Explo- for Semantic Web Languages
rations in Strategic Equilibrium
2005-10 Anders Bouwer (UvA)
2004-15 Arno Knobbe (UU) Explaining Behaviour: Using Qual-
Multi-Relational Data Mining itative Simulation in Interactive
Learning Environments
2004-16 Federico Divina (VU)
Hybrid Genetic Relational Search for 2005-11 Elth Ogston (VU)
Inductive Learning Agent Based Matchmaking and
Clustering - A Decentralized Ap-
2004-17 Mark Winands (UM) proach to Search
Informed Search in Complex Games
2005-12 Csaba Boer (EUR)
2004-18 Vania Bessa Machado (UvA) Distributed Simulation in Industry
Supporting the Construction of
Qualitative Knowledge Models 2005-13 Fred Hamburg (UL)
Een Computermodel voor het On-
2004-19 Thijs Westerveld (UT) dersteunen van Euthanasiebeslissin-
Using generative probabilistic mod- gen
els for multimedia retrieval
2005-14 Borys Omelayenko (VU)
2004-20 Madelon Evers (Nyenrode) Web-Service configuration on the Se-
Learning from Design: facilitating mantic Web; Exploring how seman-
multidisciplinary design teams tics meets pragmatics
2005-15 Tibor Bosse (VU) 2006-8 Eelco Herder (UT)
Analysis of the Dynamics of Cogni- Forward, Back and Home Again -
tive Processes Analyzing User Behavior on the Web
2005-16 Joris Graaumans (UU) 2006-9 Mohamed Wahdan (UM)
Usability of XML Query Languages Automatic Formulation of the Audi-
tor’s Opinion
2005-17 Boris Shishkov (TUD)
2006-10 Ronny Siebes (VU)
Software Specification Based on Re-
Semantic Routing in Peer-to-Peer
usable Business Components
Systems
2005-18 Danielle Sent (UU) 2006-11 Joeri van Ruth (UT)
Test-selection strategies for proba- Flattening Queries over Nested Data
bilistic networks Types
2005-19 Michel van Dartel (UM) 2006-12 Bert Bongers (VU)
Situated Representation Interactivation - Towards an e-
cology of people, our technological
2005-20 Cristina Coteanu (UL)
environment, and the arts
Cyber Consumer Law, State of the
Art and Perspectives 2006-13 Henk-Jan Lebbink (UU)
Dialogue and Decision Games for In-
2005-21 Wijnand Derks (UT) formation Exchanging Agents
Improving Concurrency and Recov-
ery in Database Systems by Exploit- 2006-14 Johan Hoorn (VU)
ing Application Semantics Software Requirements: Update,
Upgrade, Redesign - towards a The-
2006 ory of Requirements Change
2006-1 Samuil Angelov (TUE) 2006-15 Rainer Malik (UU)
Foundations of B2B Electronic Con- CONAN: Text Mining in the
tracting Biomedical Domain
2006-2 Cristina Chisalita (VU) 2006-16 Carsten Riggelsen (UU)
Contextual issues in the design and Approximation Methods for Efficient
use of information technology in or- Learning of Bayesian Networks
ganizations 2006-17 Stacey Nagata (UU)
User Assistance for Multitasking
2006-3 Noor Christoph (UvA)
with Interruptions on a Mobile De-
The role of metacognitive skills in
vice
learning to solve problems
2006-18 Valentin Zhizhkun (UvA)
2006-4 Marta Sabou (VU) Graph transformation for Natural
Building Web Service Ontologies Language Processing
2006-5 Cees Pierik (UU) 2006-19 Birna van Riemsdijk (UU)
Validation Techniques for Object- Cognitive Agent Programming: A
Oriented Proof Outlines Semantic Approach
2006-6 Ziv Baida (VU) 2006-20 Marina Velikova (UvT)
Software-aided Service Bundling - Monotone models for prediction in
Intelligent Methods Tools for data mining
Graphical Service Modeling 2006-21 Bas van Gils (RUN)
2006-7 Marko Smiljanic (UT) Aptness on the Web
XML schema matching – balancing 2006-22 Paul de Vrieze (RUN)
efficiency and effectiveness by means Fundaments of Adaptive Personali-
of clustering sation
2006-23 Ion Juvina (UU) 2007-8 Mark Hoogendoorn (VU)
Development of Cognitive Model for Modeling of Change in Multi-Agent
Navigating on the Web Organizations
2006-24 Laura Hollink (VU) 2007-9 David Mobach (VU)
Semantic Annotation for Retrieval of Agent-Based Mediated Service Ne-
Visual Resources gotiation
2006-25 Madalina Drugan (UU) 2007-10 Huib Aldewereld (UU)
Conditional log-likelihood MDL and Autonomy vs. Conformity: an Insti-
Evolutionary MCMC tutional Perspective on Norms and
2006-26 Vojkan Mihajlović (UT) Protocols
Score Region Algebra: A Flexible 2007-11 Natalia Stash (TUE)
Framework for Structured Informa- Incorporating Cognitive/Learning
tion Retrieval Styles in a General-Purpose Adap-
2006-27 Stefano Bocconi (CWI) tive Hypermedia System
Vox Populi: generating video doc-
2007-12 Marcel van Gerven (RUN)
umentaries from semantically anno-
Bayesian Networks for Clinical Deci-
tated media repositories
sion Support: A Rational Approach
2006-28 Borkur Sigurbjornsson (UvA) to Dynamic Decision-Making under
Focused Information Access using Uncertainty
XML Element Retrieval
2007-13 Rutger Rienks (UT)
2007 Meetings in Smart Environments;
2007-1 Kees Leune (UvT) Implications of Progressing Technol-
Access Control and Service-Oriented ogy
Architectures 2007-14 Niek Bergboer (UM)
2007-2 Wouter Teepe (RUG) Context-Based Image Analysis
Reconciling Information Exchange 2007-15 Joyca Lacroix (UM)
and Confidentiality: A Formal Ap- NIM: a Situated Computational
proach Memory Model
2007-3 Peter Mika (VU)
2007-16 Davide Grossi (UU)
Social Networks and the Semantic
Designing Invisible Handcuffs. For-
Web
mal investigations in Institutions
2007-4 Jurriaan van Diggelen (UU) and Organizations for Multi-agent
Achieving Semantic Interoperability Systems
in Multi-agent Systems: a dialogue-
based approach 2007-17 Theodore Charitos (UU)
Reasoning with Dynamic Networks
2007-5 Bart Schermer (UL) in Practice
Software Agents, Surveillance, and
the Right to Privacy: a Legisla- 2007-18 Bart Orriens (UvT)
tive Framework for Agent-enabled On the development an management
Surveillance of adaptive business collaborations
2007-6 Gilad Mishne (UvA) 2007-19 David Levy (UM)

Applied Text Analytics for Blogs Intimate relationships with artificial
partners
2007-7 Natasa Jovanovic’ (UT)
To Whom It May Concern - Ad- 2007-20 Slinger Jansen (UU)
dressee Identification in Face-to-Face Customer Configuration Updating in
Meetings a Software Supply Network
2007-21 Karianne Vermaas (UU) 2008-8 Janneke Bolt (UU)
Fast diffusion and broadening use: A Bayesian Networks: Aspects of Ap-
research on residential adoption and proximate Inference
usage of broadband internet in the
2008-9 Christof van Nimwegen (UU)
Netherlands between 2001 and 2005
The paradox of the guided user: as-
2007-22 Zlatko Zlatev (UT) sistance can be counter-effective
Goal-oriented design of value and 2008-10 Wauter Bosma (UT)
process models from patterns Discourse oriented summarization
2007-23 Peter Barna (TUE) 2008-11 Vera Kartseva (VU)
Specification of Application Logic in Designing Controls for Network Or-
Web Information Systems ganizations: A Value-Based Ap-
proach
2007-24 Georgina Ramrez Camps (CWI)
Structural Features in XML Re- 2008-12 Jozsef Farkas (RUN)
trieval A Semiotically Oriented Cognitive
Model of Knowledge Representation
2007-25 Joost Schalken (VU)
Empirical Investigations in Software 2008-13 Caterina Carraciolo (UvA)
Process Improvement Topic Driven Access to Scientific
Handbooks
2008
2008-14 Arthur van Bunningen (UT)
2008-1 Katalin Boer-Sorbn (EUR) Context-Aware Querying; Better
Agent-Based Simulation of Financial Answers with Less Effort
Markets: A modular, continuous-
2008-15 Martijn van Otterlo (UT)
time approach
The Logic of Adaptive Behavior:
2008-2 Alexei Sharpanskykh (VU) Knowledge Representation and Al-
On Computer-Aided Methods for gorithms for the Markov Decision
Modeling and Analysis of Organiza- Process Framework in First-Order
tions Domains.
2008-3 Vera Hollink (UvA) 2008-16 Henriette van Vugt (VU)

Optimizing hierarchical menus: a Embodied agents from a user’s per-
usage-based approach spective
2008-17 Martin Op ’t Land (TUD)
2008-4 Ander de Keijzer (UT)
Applying Architecture and Ontology
Management of Uncertain Data - to-
to the Splitting and Allying of Enter-
wards unattended integration
prises
2008-5 Bela Mutschler (UT) 2008-18 Guido de Croon (UM)
Modeling and simulating causal de- Adaptive Active Vision
pendencies on process-aware infor-
mation systems from a cost perspec- 2008-19 Henning Rode (UT)
tive From Document to Entity Retrieval:
Improving Precision and Perfor-
2008-6 Arjen Hommersom (RUN) mance of Focused Text Search
On the Application of Formal Meth-
ods to Clinical Guidelines, an Artifi- 2008-20 Rex Arendsen (UvA)
cial Intelligence Perspective Geen bericht, goed bericht. Een on-
derzoek naar de effecten van de in-
2008-7 Peter van Rosmalen (OU) troductie van elektronisch berichten-
Supporting the tutor in the design verkeer met de overheid op de ad-
and support of adaptive e-learning ministratieve lasten van bedrijven
2008-21 Krisztian Balog (UvA) 2008-35 Ben Torben Nielsen (UvT)
People Search in the Enterprise Dendritic morphologies: function
shapes structure
2008-22 Henk Koning (UU)
Communication of IT-Architecture 2009
2008-23 Stefan Visscher (UU) 2009-1 Rasa Jurgelenaite (RUN)
Bayesian network models for the Symmetric Causal Independence
management of ventilator-associated Models
pneumonia 2009-2 Willem Robert van Hage (VU)
2008-24 Zharko Aleksovski (VU) Evaluating Ontology-Alignment
Using background knowledge in on- Techniques
tology matching 2009-3 Hans Stol (UvT)
2008-25 Geert Jonker (UU) A Framework for Evidence-based
Efficient and Equitable Exchange in Policy Making Using IT
Air Traffic Management Plan Repair 2009-4 Josephine Nabukenya (RUN)
using Spender-signed Currency Improving the Quality of Organisa-
2008-26 Marijn Huijbregts (UT) tional Policy Making using Collabo-
Segmentation, Diarization and ration Engineering
Speech Transcription: Surprise Data 2009-5 Sietse Overbeek (RUN)
Unraveled Bridging Supply and Demand for
2008-27 Hubert Vogten (OU) Knowledge Intensive Tasks - Based
Design and Implementation Strate- on Knowledge, Cognition, and Qual-
gies for IMS Learning Design ity
2008-28 Ildiko Flesch (RUN) 2009-6 Muhammad Subianto (UU)

On the Use of Independence Rela- Understanding Classification
tions in Bayesian Networks 2009-7 Ronald Poppe (UT)
2008-29 Dennis Reidsma (UT) Discriminative Vision-Based Recov-
Annotations and Subjective Ma- ery and Recognition of Human Mo-
chines - Of Annotators, Embodied tion
Agents, Users, and Other Humans 2009-8 Volker Nannen (VU)
2008-30 Wouter van Atteveldt (VU) Evolutionary Agent-Based Policy
Semantic Network Analysis: Tech- Analysis in Dynamic Environments
niques for Extracting, Representing 2009-9 Benjamin Kanagwa (RUN)
and Querying Media Content Design, Discovery and Construction
2008-31 Loes Braun (UM) of Service-oriented Systems
Pro-Active Medical Information Re- 2009-10 Jan Wielemaker (UvA)
trieval Logic programming for knowledge-
2008-32 Trung H. Bui (UT) intensive interactive applications
Toward Affective Dialogue Man- 2009-11 Alexander Boer (UvA)
agement using Partially Observable Legal Theory, Sources of Law & the
Markov Decision Processes Semantic Web
2008-33 Frank Terpstra (UvA) 2009-12 Peter Massuthe (TUE, Humboldt-
Scientific Workflow Design; theoret- Universitaet zu Berlin)
ical and practical issues Operating Guidelines for Services
2008-34 Jeroen de Knijf (UU) 2009-13 Steven de Jong (UM)
Studies in Frequent Tree Mining Fairness in Multi-Agent Systems
2009-14 Maksym Korotkiy (VU) 2009-28 Sander Evers (UT)
From ontology-enabled services to Sensor Data Management with
service-enabled ontologies (making Probabilistic Models
ontologies work in e-science with 2009-29 Stanislav Pokraev (UT)
ONTO-SOA) Model-Driven Semantic Integration
2009-15 Rinke Hoekstra (UvA) of Service-Oriented Applications
Ontology Representation - Design 2009-30 Marcin Zukowski (CWI)
Patterns and Ontologies that Make Balancing vectorized query execu-
Sense tion with bandwidth-optimized stor-
2009-16 Fritz Reul (UvT) age
New Architectures in Computer 2009-31 Sofiya Katrenko (UvA)
Chess A Closer Look at Learning Relations
2009-17 Laurens van der Maaten (UvT) from Text
Feature Extraction from Visual Data 2009-32 Rik Farenhorst (VU)
and Remco de Boer (VU)
2009-18 Fabian Groffen (CWI)
Architectural Knowledge Manage-
Armada, An Evolving Database Sys-
ment: Supporting Architects and
tem
Auditors
2009-19 Valentin Robu (CWI) 2009-33 Khiet Truong (UT)
Modeling Preferences, Strategic How Does Real Affect Affect Affect
Reasoning and Collaboration in Recognition In Speech?
Agent-Mediated Electronic Markets
2009-34 Inge van de Weerd (UU)
2009-20 Bob van der Vecht (UU) Advancing in Software Product
Adjustable Autonomy: Controling Management: An Incremental
Influences on Decision Making Method Engineering Approach
2009-21 Stijn Vanderlooy (UM) 2009-35 Wouter Koelewijn (UL)
Ranking and Reliable Classification Privacy en Politiegegevens; Over
2009-22 Pavel Serdyukov (UT) geautomatiseerde normatieve
Search For Expertise: Going beyond informatie-uitwisseling
direct evidence 2009-36 Marco Kalz (OUN)
Placement Support for Learners in
2009-23 Peter Hofgesang (VU)
Learning Networks
Modelling Web Usage in a Changing
Environment 2009-37 Hendrik Drachsler (OUN)
Navigation Support for Learners in
2009-24 Annerieke Heuvelink (VU) Informal Learning Networks
Cognitive Models for Training Simu-
lations 2009-38 Riina Vuorikari (OU)
Tags and self-organisation: a meta-
2009-25 Alex van Ballegooij (CWI) data ecology for learning resources in
”RAM: Array Database Manage- a multilingual context
ment through Relational Mapping”
2009-39 Christian Stahl (TUE, Humboldt-
2009-26 Fernando Koch (UU) Universitaet zu Berlin)
An Agent-Based Model for the De- Service Substitution – A Behavioral
velopment of Intelligent Mobile Ser- Approach Based on Petri Nets
vices
2009-40 Stephan Raaijmakers (UvT)
2009-27 Christian Glahn (OU) Multinomial Language Learning: In-
Contextual Support of social En- vestigations into the Geometry of
gagement and Reflection on the Web Language
2009-41 Igor Berezhnyy (UvT) 2010-9 Hugo Kielman (UL)
Digital Analysis of Paintings A Politiele gegevensverwerking en
Privacy, Naar een effectieve waar-
2009-42 Toine Bogers (UvT)
borging
Recommender Systems for Social
Bookmarking 2010-10 Rebecca Ong (UL)
Mobile Communication and Protec-
2009-43 Virginia Nunes Leal Franqueira tion of Children
(UT)
2010-11 Adriaan Ter Mors (TUD)
Finding Multi-step Attacks in Com-
The world according to MARP:
puter Networks using Heuristic
Multi-Agent Route Planning
Search and Mobile Ambients
2010-12 Susan van den Braak (UU)
2009-44 Roberto Santana Tapia (UT) Sensemaking software for crime anal-
Assessing Business-IT Alignment in ysis
Networked Organizations
2010-13 Gianluigi Folino (RUN)
2009-45 Jilles Vreeken (UU) High Performance Data Mining us-
Making Pattern Mining Useful ing Bio-inspired techniques
2009-46 Loredana Afanasiev (UvA) 2010-14 Sander van Splunter (VU)
Querying XML: Benchmarks and Automated Web Service Reconfigu-
Recursion ration
2010 2010-15 Lianne Bodenstaff (UT)
Managing Dependency Relations in
2010-1 Matthijs van Leeuwen (UU) Inter-Organizational Models
Patterns that Matter
2010-16 Sicco Verwer (TUD)
2010-2 Ingo Wassink (UT) Efficient Identification of Timed Au-
Work flows in Life Science tomata, theory and practice
2010-3 Joost Geurts (CWI) 2010-17 Spyros Kotoulas (VU)
A Document Engineering Model and Scalable Discovery of Networked Re-
Processing Framework for Multime- sources: Algorithms, Infrastructure,
dia documents Applications
2010-18 Charlotte Gerritsen (VU)
2010-4 Olga Kulyk (UT)
Caught in the Act: Investigating
Do You Know What I Know? Sit-
Crime by Agent-Based Simulation
uational Awareness of Co-located
Teams in Multidisplay Environments 2010-19 Henriette Cramer (UvA)
People’s Responses to Autonomous
2010-5 Claudia Hauff (UT) and Adaptive Systems
Predicting the Effectiveness of
Queries and Retrieval Systems 2010-20 Ivo Swartjes (UT)
Whose Story Is It Anyway? How
2010-6 Sander Bakkes (UvT) Improv Informs Agency and Author-
Rapid Adaptation of Video Game AI ship of Emergent Narrative
2010-7 Wim Fikkert (UT) 2010-21 Harold van Heerde (UT)
Gesture interaction at a Distance Privacy-aware data management by
means of data degradation
2010-8 Krzysztof Siewicz (UL)
Towards an Improved Regulatory 2010-22 Michiel Hildebrand (CWI)
Framework of Free Software. Pro- End-user Support for Access to Het-
tecting user freedoms in a world of erogeneous Linked Data
software communities and eGovern- 2010-23 Bas Steunebrink (UU)
ments The Logical Structure of Emotions
2010-24 Dmytro Tykhonov Designing 2010-37 Niels Lohmann (TUE)
Generic and Efficient Negotiation Correctness of services and their
Strategies composition
2010-25 Zulfiqar Ali Memon (VU) 2010-38 Dirk Fahland (TUE)
Modelling Human-Awareness for From Scenarios to components
Ambient Agents: A Human Min-
2010-39 Ghazanfar Farooq Siddiqui (VU)
dreading Perspective
Integrative modeling of emotions in
2010-26 Ying Zhang (CWI) virtual agents
XRPC: Efficient Distributed Query
2010-40 Mark van Assem (VU)
Processing on Heterogeneous
Converting and Integrating Vocabu-
XQuery Engines
laries for the Semantic Web
2010-27 Marten Voulon (UL)
2010-41 Guillaume Chaslot (UM)
Automatisch contracteren
Monte-Carlo Tree Search
2010-28 Arne Koopman (UU) 2010-42 Sybren de Kinderen (VU)
Characteristic Relational Patterns Needs-driven service bundling in a
2010-29 Stratos Idreos (CWI) multi-supplier setting - the compu-
Database Cracking: Towards Auto- tational e3-service approach
tuning Database Kernels 2010-43 Peter van Kranenburg (UU)
2010-30 Marieke van Erp (UvT) A Computational Approach to
Accessing Natural History - Discov- Content-Based Retrieval of Folk
eries in data cleaning, structuring, Song Melodies
and retrieval 2010-44 Pieter Bellekens (TUE)
2010-31 Victor de Boer (UvA) An Approach towards Context-
Ontology Enrichment from Hetero- sensitive and User-adapted Access to
geneous Sources on the Web Heterogeneous Data Sources, Illus-
trated in the Television Domain
2010-32 Marcel Hiel (UvT)
An Adaptive Service Oriented Archi- 2010-45 Vasilios Andrikopoulos (UvT)
tecture: Automatically solving In- A theory and model for the evolution
teroperability Problems of software services
2010-46 Vincent Pijpers (VU)
2010-33 Robin Aly (UT)
e3alignment: Exploring Inter-
Modeling Representation Uncer-
Organizational Business-ICT Align-
tainty in Concept-Based Multimedia
ment
Retrieval
2010-47 Chen Li (UT)
2010-34 Teduh Dirgahayu (UT)
Mining Process Model Variants:
Interaction Design in Service Com-
Challenges, Techniques, Examples
positions
2010-48 Milan Lovric (EUR)
2010-35 Dolf Trieschnigg (UT)
Behavioral Finance and Agent-
Proof of Concept: Concept-based
Based Artificial Markets
Biomedical Information Retrieval
2010-49 Jahn-Takeshi Saito (UM)
2010-36 Jose Janssen (OU) Solving difficult game positions
Paving the Way for Lifelong Learn-
ing; Facilitating competence devel- 2010-50 Bouke Huurnink (UvA)
opment through a learning path Search in Audiovisual Broadcast
specification Archives
2010-51 Alia Khairia Amin (CWI) 2011-10 Bart Bogaert (UvT)
Understanding and supporting in- Cloud Content Contention
formation seeking tasks in multiple
sources 2011-11 Dhaval Vyas (UT)
Designing for Awareness: An
2010-52 Peter-Paul van Maanen (VU) Experience-focused HCI Perspective
Adaptive Support for Human-
Computer Teams: Exploring the Use 2011-12 Carmen Bratosin (TUE)
of Cognitive Models of Trust and At- Grid Architecture for Distributed
tention Process Mining
2010-53 Edgar Meij (UvA)
2011-13 Xiaoyu Mao (UvT)
Combining Concepts and Language
Airport under Control. Multiagent
Models for Information Access
Scheduling for Airport Ground Han-
2011 dling
2011-1 Botond Cseke (RUN) 2011-14 Milan Lovric (EUR)
Variational Algorithms for Bayesian Behavioral Finance and Agent-
Inference in Latent Gaussian Models Based Artificial Markets
2011-2 Nick Tinnemeier (UU)
Organizing Agent Organizations. 2011-15 Marijn Koolen (UvA)
Syntax and Operational Semantics The Meaning of Structure: the Value
of an Organization-Oriented Pro- of Link Evidence for Information Re-
gramming Language trieval
2011-3 Jan Martijn van der Werf (TUE) 2011-16 Maarten Schadd (UM)
Compositional Design and Verifica- Selective Search in Games of Differ-
tion of Component-Based Informa- ent Complexity
tion Systems
2011-17 Jiyin He (UvA)
2011-4 Hado van Hasselt (UU)
Exploring Topic Structure: Coher-
Insights in Reinforcement Learning;
ence, Diversity and Relatedness
Formal analysis and empirical evalu-
ation of temporal-difference learning 2011-18 Mark Ponsen (UM)
algorithms Strategic Decision-Making in com-
2011-5 Base van der Raadt (VU) plex games
Enterprise Architecture Coming of
Age - Increasing the Performance of 2011-19 Ellen Rusman (OU)
an Emerging Discipline. The Mind’s Eye on Personal Profiles
2011-6 Yiwen Wang (TUE) 2011-20 Qing Gu (VU)

Semantically-Enhanced Recommen- Guiding service-oriented software
dations in Cultural Heritage engineering - A view-based approach
2011-7 Yujia Cao (UT) 2011-21 Linda Terlouw (TUD)
Multimodal Information Presenta- Modularization and Specification of
tion for High Load Human Com- Service-Oriented Systems
puter Interaction
2011-8 Nieske Vergunst (UU) 2011-22 Junte Zhang (UvA)
BDI-based Generation of Robust System Evaluation of Archival De-
Task-Oriented Dialogues scription and Access
2011-9 Tim de Jong (OU) 2011-23 Wouter Weerkamp (UvA)

Contextualised Mobile Media for Finding People and their Utterances
Learning in Social Media
2011-24 Herwin van Welbergen (UT) 2011-37 Adriana Burlutiu (RUN)
Behavior Generation for Interper- Machine Learning for Pairwise Data,
sonal Coordination with Virtual Hu- Applications for Preference Learning
mans On Specifying, Scheduling and and Supervised Network Inference
Realizing Multimodal Virtual Hu-
2011-38 Nyree Lemmens (UM)
man Behavior
Bee-inspired Distributed Optimiza-
2011-25 Syed Waqar ul Qounain Jaffry (VU) tion
Analysis and Validation of Models
2011-39 Joost Westra (UU)
for Trust Dynamics
Organizing Adaptation using Agents
2011-26 Matthijs Aart Pontier (VU) in Serious Games
Virtual Agents for Human Commu-
2011-40 Viktor Clerc (VU)
nication - Emotion Regulation and
Architectural Knowledge Manage-
Involvement-Distance Trade-Offs in
ment in Global Software Develop-
Embodied Conversational Agents
ment
and Robots
2011-27 Aniel Bhulai (VU) 2011-41 Luan Ibraimi (UT)
Dynamic website optimization Cryptographically Enforced Dis-
through autonomous management tributed Data Access Control
of design patterns 2011-42 Michal Sindlar (UU)
2011-28 Rianne Kaptein (UvA) Explaining Behavior through Mental
Effective Focused Retrieval by Ex- State Attribution
ploiting Query Context and Docu- 2011-43 Henk van der Schuur (UU)
ment Structure Process Improvement through Soft-
2011-29 Faisal Kamiran (TUE) ware Operation Knowledge
Discrimination-aware Classification 2011-44 Boris Reuderink (UT)
2011-30 Egon van den Broek (UT) Robust Brain-Computer Interfaces
Affective Signal Processing (ASP): 2011-45 Herman Stehouwer (UvT)
Unraveling the mystery of emotions Statistical Language Models for Al-
2011-31 Ludo Waltman (EUR) ternative Sequence Selection
Computational and Game-Theoretic 2011-46 Beibei Hu (TUD)
Approaches for Modeling Bounded Towards Contextualized Information
Rationality Delivery: A Rule-based Architec-
2011-32 Nees-Jan van Eck (EUR) ture for the Domain of Mobile Police
Methodological Advances in Biblio- Work
metric Mapping of Science 2011-47 Azizi Bin Ab Aziz (VU)
2011-33 Tom van der Weide (UU) Exploring Computational Models for
Arguing to Motivate Decisions Intelligent Support of Persons with
Depression
2011-34 Paolo Turrini (UU)
Strategic Reasoning in Interdepen- 2011-48 Mark Ter Maat (UT)
dence: Logical and Game-theoretical Response Selection and Turn-taking
Investigations for a Sensitive Artificial Listening
Agent
2011-35 Maaike Harbers (UU)
Explaining Agent Behavior in Vir- 2011-49 Andreea Niculescu (UT)
tual Training Conversational interfaces for task-
oriented spoken dialogues: design as-
2011-36 Erik van der Spek (UU)
pects influencing interaction quality
Experiments in serious game design:
a cognitive approach 2012
2012-1 Terry Kakeeto (UvT) 2012-14 Evgeny Knutov(TUE)
Relationship Marketing for SMEs in Generic Adaptation Framework for
Uganda Unifying Adaptive Web-based Sys-
2012-2 Muhammad Umair (VU) tems
Adaptivity, emotion, and Rational- 2012-15 Natalie van der Wal (VU)
ity in Human and Ambient Agent Social Agents. Agent-Based Mod-
Models elling of Integrated Internal and So-
2012-3 Adam Vanya (VU) cial Dynamics of Cognitive and Af-
Supporting Architecture Evolution fective Processes.
by Mining Software Repositories 2012-16 Fiemke Both (VU)
2012-4 Jurriaan Souer (UU) Helping people by understanding
Development of Content Manage- them - Ambient Agents supporting
ment System-based Web Applica- task execution and depression treat-
tions ment
2012-5 Marijn Plomp (UU) 2012-17 Amal Elgammal (UvT)

Maturing Interorganisational Infor- Towards a Comprehensive Frame-
mation Systems work for Business Process Compli-
ance
2012-6 Wolfgang Reinhardt (OU)
Awareness Support for Knowledge 2012-18 Eltjo Poort (VU)
Workers in Research Networks Improving Solution Architecting
Practices
2012-7 Rianne van Lambalgen (VU)
When the Going Gets Tough: Ex- 2012-19 Helen Schonenberg (TUE)
ploring Agent-based Models of What’s Next? Operational Support
Human-like Performance under De- for Business Process Execution
manding Conditions 2012-20 Ali Bahramisharif (RUN)
2012-8 Gerben Klaas Dirk de Vries (UvA) Covert Visual Spatial Attention,
Kernel Methods for Vessel Trajecto- a Robust Paradigm for Brain-
ries Computer Interfacing
2012-9 Ricardo Neisse (UT) 2012-21 Roberto Cornacchia (TUD)
Trust and Privacy Management Sup- Querying Sparse Matrices for Infor-
port for Context-Aware Service Plat- mation Retrieval
forms 2012-22 Thijs Vis (UvT)
2012-10 David Smits (TUE) Intelligence, politie en veiligheidsdi-
Towards a Generic Distributed enst: verenigbare grootheden?
Adaptive Hypermedia Environment 2012-23 Christian Muehl (UT)
2012-11 J.C.B. Rantham Prabhakara (TUE) Toward Affective Brain-Computer
Process Mining in the Large: Pre- Interfaces: Exploring the Neuro-
processing, Discovery, and Diagnos- physiology of Affect during Human
tics Media Interaction
2012-12 J.C.B. Rantham Prabhakara (TUE) 2012-24 Laurens van der Werff (UT)
Process Mining in the Large: Pre- Evaluation of Noisy Transcripts for
processing, Discovery, and Diagnos- Spoken Document Retrieval
tics 2012-25 Silja Eckartz (UT)
2012-13 Suleman Shahid (UvT) Managing the Business Case Devel-
Fun and Face: Exploring non-verbal opment in Inter-Organizational IT
expressions of emotion during play- Projects: A Methodology and its
ful interactions Application
2012-26 Emile de Maat (UVA) 2012-40 Agus Gunawan (UvT)
Making Sense of Legal Text Information Access for SMEs in In-
donesia
2012-27 Hayrettin Gurkok (UT)
Mind the Sheep! User Experience 2012-41 Sebastian Kelle (OU)
Evaluation and Brain-Computer In- Game Design Patterns for Learning
terface Games
2012-42 Dominique Verpoorten (OU)
2012-28 Nancy Pascall (UvT) Reflection Amplifiers in self-
Engendering Technology Empower- regulated Learning
ing Women 2012-43 Anna Tordai (VU)
2012-29 Almer Tigelaar (UT) On Combining Alignment Tech-
Peer-to-Peer Information Retrieval niques
2012-30 Alina Pommeranz (TUD) 2012-44 Benedikt Kratz (UvT)

Designing Human-Centered Systems A Model and Language for Business-
for Reflective Decision Making aware Transactions
2012-31 Emily Bagarukayo (RUN) 2012-45 Simon Carter (UVA)

A Learning by Construction Ap- Exploration and Exploitation of
proach for Higher Order Cognitive Multilingual Data for Statistical Ma-
Skills Improvement, Building Capac- chine Translation
ity and Infrastructure 2012-46 Manos Tsagkias (UVA)
Mining Social Media: Tracking Con-
2012-32 Wietske Visser (TUD)
tent and Predicting Behavior
Qualitative multi-criteria preference
representation and reasoning 2012-47 Jorn Bakker (TUE)
Handling Abrupt Changes in Evolv-
2012-33 Rory Sie (OUN)
ing Time-series Data
Coalitions in Cooperation Networks
(COCOON) 2012-48 Michael Kaisers (UM)
Learning against Learning - Evolu-
2012-34 Pavol Jancura (RUN) tionary dynamics of reinforcement
Evolutionary analysis in PPI net- learning algorithms in strategic in-
works and applications teractions
2012-35 Evert Haasdijk (VU) 2012-49 Steven van Kervel (TUD)
Never Too Old To Learn – On-line Ontology driven Enterprise Informa-
Evolution of Controllers in Swarm- tion Systems Engineering
and Modular Robotics
2012-50 Jeroen de Jong (TUD)
2012-36 Denis Ssebugwawo (RUN) Heuristics in Dynamic Scheduling;
Analysis and Evaluation of Collabo- a practical framework with a case
rative Modeling Processes study in elevator dispatching
2012-37 Agnes Nakakawa (RUN) 2013
A Collaboration Process for Enter-
prise Architecture Creation 2013-1 Viorel Milea (EUR)
News Analytics for Financial Deci-
2012-38 Selmar Smit (VU) sion Support
Parameter Tuning and Scientific
Testing in Evolutionary Algorithms 2013-2 Erietta Liarou (CWI)
MonetDB/DataCell: Leveraging the
2012-39 Hassan Fatemi (UT) Column-store Database Technology
Risk-aware design of value and coor- for Efficient and Scalable Stream
dination networks Processing
2013-3 Szymon Klarman (VU) 2013-9 Fabio Gori (RUN)
Reasoning with Contexts in Descrip- Metagenomic Data Analysis: Com-
tion Logics putational Methods and Applica-
tions
2013-4 Chetan Yadati(TUD)
Coordinating autonomous planning 2013-10 Jeewanie Jayasinghe Arachchige(UvT)
and scheduling A Unified Modeling Framework for
Service Design.
2013-5 Dulce Pumareja (UT)
Groupware Requirements Evolutions 2013-11 Evangelos Pournaras (TUD)
Patterns Multi-level Reconfigurable Self-
organization in Overlay Services
2013-6 Romulo Gonzalves(CWI) 2013-12 Maryam Razavian (VU)
The Data Cyclotron: Juggling Data Knowledge-driven Migration to Ser-
and Queries for a Data Warehouse vices
Audience
2013-13 Mohammad Zafiri (UT)
2013-7 Giel van Lankveld (UT) Service Tailoring: User-centric cre-
Quantifying Individual Player Dif- ation of integrated IT-based home-
ferences care services to support independent
living of elderly
2013-8 Robbert-Jan Merk(VU)
Making enemies: cognitive modeling 2013-14 Jafar Tanha (UvA)
for opponent agents in fighter pilot Ensemble Approaches to Semi–
simulators Supervised Learning

Thesis Decision

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis Decision

Uploaded by

Copyright:

Available Formats

UvA-DARE (Digital Academic Repository)

Ensemble approaches to semi-supervised learning

Citation for published version (APA):

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

Download date:25 Apr 2021

ter verkrijging van de graad van doctor

geboren te Bonab, Iran

Promotor: Prof.dr. H. Afsarmanesh

Overige leden: Prof.dr. A.P.J.M. Siebes

Faculteit der Natuurwetenschappen, Wiskunde en Informatica

SIKS Dissertation Series No. 2013-14

The cover illustration depicts the cluster assumption in semi-supervised learning.

3 Self-Training With Decision Trees 21

5 Multiclass Semi-Supervised Boosting 65

6 Application of Multiclass Semi-Supervised Learning 91

7 Conclusion and Discussion 101

Constructing models of data is useful for a variety of scientific and engineering

self-training and co-training as addressed in Section 1.1. We propose two specific

1.1 Research Questions and Solution Approach

1. How to effectively use a decision tree as the base learner in self-training?

2. How to find a subset of high-confidence predictions at each iteration of the

As addressed in Chapter 3, using the probability estimation of the standard

classification performance of a self-training algorithm. The reason is that the

3. How to select a subset of high-confidence predictions in an ensemble?

In Chapter 4, we answer these questions by designing a new form of co-training

5. How to design an additive algorithm for multiclass semi-supervised learn-

5a. How to assign weights to classifiers and unlabeled examples?

These questions are addressed in Chapter 5 by designing two algorithms.

• Results of Chapter 6 are presented at the 24rd IEEE International Confer-

Most semi-supervised learning methods are extensions of existing supervised and

2.1 Supervised Ensemble Learning

• Base Learner: A Base Learner is a learning algorithm that is used to learn

• Diversity Generator: This component is responsible for generating the di-

• Combiner: is used for combining the classification models.

Sample 1 Train Classification Weight 1

Sample 2 Train Classification Weight 2

Training Generate different

Sample N Train Classification Weight N

Figure 2.1: General view of the independent ensemble methods

modification of the weights of examples and classifiers. We propose an extension

`(F ) = E[e−yF (x) ] (2.2)

2.1.3 Multiclass Boosting

2.2 Semi-Supervised Learning and Ensemble Meth-

2.2.1 Basic Definitions

(a) Cluster Assumption (b) Manifold Assumption

the performance of the supervised learning algorithm. Our proposed methods in

2.2.2 Semi-Supervised Learning Algorithms

`u (F (x)) = max{0, 1 − |1 − F (x)|}, x ∈ Xu (2.4)

these methods is applicable to large datasets because of their high computational

Another boosting method for semi-supervised learning is CoBoost [21], which

2.2.3 Multiclass Semi-Supervised Boosting

This chapter focuses on Semi-Supervised Learning with a decision tree learner.

3.2 Overview of Self-Training

Bregman distance. In this chapter, we analyze the effect of modifying a decision

3.3 Semi-Supervised Self-Training with Decision

3.3.1 Basic Definitions

3.3.2 The Self-Training Algorithm

Algorithm 2 Outline of the Self-Training algorithm

each iteration of the training process. This is important, because a misclassified

3.4 Self-Training by Improving Probability Es-

3.4.1 C4.4: No-Pruning and Laplacian Correction

3.4.3 Grafted Decision Tree

3.4.4 Combining No-pruning, Laplace correction and Graft-