You are on page 1of 37

Mach Learn

DOI 10.1007/s10994-014-5457-9

Asymptotic analysis of estimators on multi-label data

Andreas P. Streich · Joachim M. Buhmann

Received: 20 November 2011 / Accepted: 4 June 2014


© The Author(s) 2014. This article is published with open access at Springerlink.com

Abstract Multi-label classification extends the standard multi-class classification paradigm


by dropping the assumption that classes have to be mutually exclusive, i.e., the same data
item might belong to more than one class. Multi-label classification has many important
applications in e.g. signal processing, medicine, biology and information security, but the
analysis and understanding of the inference methods based on data with multiple labels are
still underdeveloped. In this paper, we formulate a general generative process for multi-label
data, i.e. we associate each label (or class) with a source. To generate multi-label data items,
the emissions of all sources in the label set are combined. In the training phase, only the prob-
ability distributions of these (single label) sources need to be learned. Inference on multi-label
data requires solving an inverse problem, models of the data generation process therefore
require additional assumptions to guarantee well-posedness of the inference procedure. Sim-
ilarly, in the prediction (test) phase, the distributions of all single-label sources in the label
set are combined using the combination function to determine the probability of a label set.
We formally describe several previously presented inference methods and introduce a novel,
general-purpose approach, where the combination function is determined based on the data
and/or on a priori knowledge of the data generation mechanism. This framework includes
cross-training and new source training (also named label power set method) as special cases.
We derive an asymptotic theory for estimators based on multi-label data and investigate
the consistency and efficiency of estimators obtained by several state-of-the-art inference
techniques. Several experiments confirm these findings and emphasize the importance of a
sufficiently complex generative model for real-world applications.

Editors: Grigorios Tsoumakas, Min-Ling Zhang, and Zhi-Hua Zhou.

A. P. Streich (B)
Science and Technology Group, Phonak AG, Laubisrütistrasse 28, 8712 Stäfa, Switzerland
e-mail: andreas.streich@alumni.ethz.ch

J. M. Buhmann
Department of Computer Science, ETH Zurich, Universitätstrasse 6, 8092 Zurich, Switzerland
e-mail: jbuhmann@inf.ethz.ch

123
Mach Learn

Keywords Generative model · Asymptotic analysis · Multi-label classification ·


Consistency

1 Introduction

Multi-labelled data are encountered in classification of acoustic and visual scenes (Boutell
et al. 2004), in text categorization (Joachims 1998; McCallum 1999), in medical diagnosis
(Kawai and Takahashi 2009) and other application areas. For the classification of acoustic
scenes, consider for example the well-known Cocktail-Party problem (Arons 1992), where
several signals are mixed together and the objective is to detect the original signal. For a more
detailed overview, we refer to Tsoumakas et al. (2010) and Zhang et al. (2013).

1.1 Prior art in multi-label learning and classification

In spite of its growing significance and attention, the theoretical analysis of multi-label
classification is still in its infancy with limited literature. Some recent publications, however,
show an interest to gain a fundamental insight into the problem of classifying multi-label data.
Most attention is thereby attributed to correlations in the label sets. Using error-correcting
output codes for multi-label classification (Dietterich and Bakiri 1995) has been proposed
very early to “correct” invalid (i.e. improbable) label sets. The principle of maximum entropy
is employed in Zhu et al. (2005) to capture correlations in the label set. The assumption of
small label sets is exploited in the framework of compressed sensing by Hsu et al. (2009).
Conditional random fields are used in Ghamrawi and McCallum (2005) to parameterize label
co-occurrences. Instead of independent dichotomies, a series of classifiers is built in Read
et al. (2009), where a classifier gets the output of all preceding classifiers in the chain as
additional input. A probabilistic version thereof is presented in Dembczyński et al. (2010).
Two important gaps in the theory of multi-label classification have attracted the attention
of the community in recent years: first, most research programs primarily focus on the label
set, while an interpretation of how multi-label data arise is missing in the vast majority of
the cases. Deconvolution problems (Streich 2010) define a special case of inference from
multi-label data, as discussed in Chap. 2. In-depth analysis of the asymptotic behaviour of
the estimators has been presented in Masry (1991, 1993). Secondly, a large number of quality
measures has been presented, the understanding of how these are related with each other is
underdeveloped. Dembczyński et al. (2012) analyses the interrelation between some of the
most commonly used performance metrics. A theoretical analysis on the Bayes consistency
of learning algorithm with respect to different loss functions is presented in Gao and Zhou
(2013).
This contribution mainly addresses the issue how multi-label data are generated, i.e.,
we propose a generative model for multi-label data. A datum is composed of emissions by
multiple sources. The emitting sources are indicated by the label set. These emissions are
combined by a problem specific combination function like the linear superposition principle
in optics or acoustics. The combination function specifies a core model assumption in the
data generation process. Each source generates data items according to a source specific
probability distribution. This point of view, as the reader should note, points into a direction
that is orthogonal to the previously mentioned literature on label correlation: extra knowledge
on the distribution of the label sets can coherently be represented by a prior over the label
sets.

123
Mach Learn

Furthermore, we assume that the sources are described by parametric distributions.1 In


this setting, the accuracy of the parameter estimators is a fundamental value to assess the
quality of an inference scheme. This measure is of central interest in asymptotic theory, which
investigates the distribution of a summary statistic in the asymptotic limit (Brazzale et al.
2007). Asymptotic analysis of parametric models has become an essential tool in statistics, as
the exact distributions of the quantities of interest cannot be measured in most settings. In the
first place, asymptotic analysis is used to check whether an estimation method is consistent,
i.e. whether the obtained estimators converge to the correct parameter values if the number of
data items available for inference goes to infinity. Furthermore, asymptotic theory provides
approximate answers where exact ones are not available, namely in the case of data sets of
finite size. Asymptotic analysis describes for example how efficiently an inference method
uses the given data for parameter estimation (Liang and Jordan 2008).
Consistent inference schemes are essential for generative classifiers, and a more efficient
inference scheme yields more precise classification results than a less efficient one, given the
same training data. More specifically, the expected error of a classifier converges to the Bayes
error for maximum a posteriori classification, if the estimated parameters converge to the
true parameter values (Devroye et al. 1996). In this paper, we first review the state-of-the-art
asymptotic theory for estimators based on single-label data. We then extend the asymptotic
analysis to inference on multi-label data and prove statements about the identifiability of
parameters and the asymptotic distribution of their estimators in this demanding setting.

1.2 Advantages of generative models

Generative models define only one approach to machine learning problems. For classification,
discriminative models directly estimate the posterior distributions of class labels given data
and, thereby, they avoid an explicit estimate of class specific likelihood distributions. A
further reduction in complexity is obtained by discriminant functions, which map a data item
directly to a set of classes or clusters (Hastie et al. 1993).
Generative models are the most demanding of all alternatives. If the only goal is to classify
data in an easy setting, designing and inferring the complete generative model might be
a wasteful use of resources and demand excessive amounts of data. However, namely in
demanding scenarios, there exist well-founded reasons for generative models (Bishop 2007):
Generative description of data Even though this may be considered as stating the obvi-
ous, we emphasize that assumptions on the generative process underlying the observed
data naturally enter into a generative model. Incorporating such prior knowledge into
discriminative models proves typically significantly more difficult.
Interpretability The nature of multi-source data is best understood by studying how such
data are generated. In most applications, the sources in the generative model come with
a clear semantic meaning. Determining their parameters is thus not only an intermediate
step to the final goal of classification, but an important piece of information on the
structure of the data. Consider the cocktail party problem, where several speech and noise
sources are superposed to the speech of the dialogue partner. Identifying the sources which
generate the perceived signal is a demanding problem. The final goal, however, might
go even further and consist of finding out what your dialogue partner said. A generative
model for the sources present in the current acoustic situation enables us to determine the
most likely emission of each source given the complete signal. This approach, referred to

1 This supposition significantly simplifies the subsequent calculations, it is, however, not essential for the
approach proposed here.

123
Mach Learn

as model-based source separation (Hershey et al. 2010), critically depends on a reliable


source model.
Reject option and outlier detection Given a generative model, we can also determine the
probability of a particular data item. Samples with a low probability are called outliers.
Their generation is not confidently represented by the generative model, and no reliable
assignment of a data item to a set of sources is possible. Furthermore, outlier detection
might be helpful in the overall system in which the machine learning application is
integrated: outliers may be caused by defective measurement device or by fraud.

Since these advantages of generative models are prevalent in the considered applications,
we restrict ourselves to generative methods when comparing our approaches with existing
techniques.

1.3 A generative understanding of multi-label data

When defining a generative model, a distribution for each source has to be defined. To do so,
one usually employs a parametric distribution, possibly based on prior knowledge or a study
of the distribution of the data with a particular label. In the multi-label setting, the combi-
nation function is a further key component of the generative model. This function defines
the semantics of the multi-label: while each single-labelled observation item is understood
as a sample from a probability distribution identified by its label, multi-label observations
are understood as a combination of the emissions of all sources in the label set. The combi-
nation function describes how the individual source emissions are combined to the observed
data. Choosing an appropriate combination function is essential for successful inference and
prediction. As we demonstrate in this paper, an inappropriate combination function might
lead to inconsistent parameter estimators and worse label predictions, both compared to a
simplistic approach where multi-label data items are ignored. Conversely, choosing the right
combination function will allow us to extract more information from the training data, thus
yielding more precise parameter estimators and superior classification accuracy.
The prominence of the combination function in the generative model naturally raises
the question how this combination function can be determined. Specifying the combination
function can be a challenging task when applying the deconvolutive method for multi-label
classification. However, in our previous work, we achieved the insight that the combination
function can typically be determined based on the data and prior knowledge, i.e. expertise in
the field. For example in role mining, the disjunction of Boolean data is the natural choice
(see Streich et al. 2009 for details), while the addition of (supposedly) Gaussian emissions
is widely used in the classification of sounds (Streich and Buhmann 2008).

2 A generative model for multi-label data

We now present the generative process that we assume to have produced the observed data.
Such generative models are widely found for single-label classification and clustering, but
have not yet been formulated in a general form for multi-label data.

2.1 Label sets and source emissions

Let K denote the number of sources, and N the number of data items. We assume that
the systematic regularities of the observed data are generated by a set K = {1, . . . , K } of
K sources. Furthermore, we assume that all sources have the same sample space . Each

123
Mach Learn

Fig. 1 The generative model A


for an observation X with source
set L. An independent sample k
is drawn from each source k
according to the distribution
P(k |θk ). The source set L is
sampled from the source set
distribution P(L). These samples
are then combined to observation
by the combination function
cκ (, L). Note that the
observation X only depends on
emissions from sources contained
in the source set L

source k ∈ K emits samples k ∈  according to a given parametric probability distributions


P(k |θk ), where θk is the parameter tuple of source k. Realizations of the random variables
k are denoted by ξk . Note that both the parameters θk and the emission k can be vectors.
In this case, θk,1 , θk,2 , . . . and k,1 , k,2 , . . ., denote different components of these vectors,
respectively. Emissions of different sources are assumed to be independent of each other. The
tuple of all source emissions
K is denoted by  := (1 , . . . ,  K ), its probability distribution is
given by P(|θ ) = k=1 P(k |θk ). The tuple of the parameters of all K sources is denoted
by θ := (θ1 , . . . , θ K ).
Given an observation X = x, the source set L = {λ1 , . . . , λ M } ⊆ K denotes the set
of all sources involved in generating X . The set of all possible label sets is denoted by L.
If L = {λ}, i.e. |L| = 1, X is called a single-label data item, and X is assumed to be a
sample from source λ. On the other hand, if |L| > 1, X is called a multi-label data item
and is understood as a combination of the emissions of all sources in the label set L. This
combination is formalized by the combination function cκ :  K × L → , where κ is
a set of parameters the combination function might depend on. Note that the combination
function only depends on emissions of sources in the label set and is independent of any
other emissions.
The generative process A for a data item, as illustrated in Fig. 1, consists of the following
three steps:

(1) Draw a label set L from the distribution P(L).


(2) For each k ∈ K, draw an independent sample k ∼ P(k |θk ) from source k. Set
 := (1 , . . . ,  K ).
(3) Combine the source samples to the observation X = cκ (, L).

2.2 The combination function

The combination function models how emissions of one or several sources are combined to
the structure component of the observation X . Often, the combination function reflects a priori
knowledge of the data generation process like the linear superposition law of electrodynamics
and acoustics or disjunctions in role mining. For source sets of cardinality one, i.e. for
single-label data, the combination function chooses the emission of the corresponding source:
cκ (, {λ}) = λ .
For source sets with more than one source, the combination function can be either deter-
ministic or stochastic. Examples for deterministic combination functions are the (weighted)
sum and the Boolean OR operation. In this case, the value of X is completely determined by 

123
Mach Learn

and L. In terms of probability distribution, a deterministic combination function corresponds


to a point mass at X = cκ (, L):

P(X |, L) = 1{X =cκ (,L)} .

Stochastic combination functions allow us to formulate e.g. the well-known mixture discrim-
inant analysis as a multi-label problem (Streich 2010). However, stochastic combination
functions render inference more complex, since a description of the stochastic behaviour of
the function has to be learned in addition to the parameters of the source distributions. In the
considered applications, deterministic combination functions suffice to model the assumed
generative process. For this reason, we will not further discuss probabilistic combination
functions in this paper.

2.3 Probability distribution for structured data

Given the assumed generative process A, the probability of an observation X for source set
L and parameters θ amounts to

P(X |L, θ) = P(X |, L) d P(|θ)

We refer to P(X |L, θ) as the proxy distribution of observations with source set L. Note that
in the presented interpretation of multi-label data, the distributions P(X |L, θ) for all source
sets L are derived from the single source distribution.
For a full generative model, we introduce πL as the probability of source set L. The overall
probability of a data item D = (X, L) is thus
 
P(X, L|θ ) = P(L) · · · · P(X |, L) d P(1 |θ1 ) · · · d P( K |θ K ) (1)

Several samples from the generative process are assumed to be independent and identically
distributed (i.i.d.). The probability of N observations
N X = (X 1 , . . . , X N ) with source sets
L = (L1 , . . . , L N ) is thus P(X, L|θ) = n=1 P(X n , Ln |θ). The assumption of i.i.d. data
items allows us a substantial simplification of the model but is not a requirement for the
assumed generative model.
To give an example of our generative model, we re-formulate the model used in McCallum
(1999) in the terminology of this contribution. Omitting the mixture weights of individual
classes within the label set (denoted by λ in the original contribution) and understanding
a single document
 as a collection of W words, the probability of a single document is
W 
P(X ) = L∈L P(L) w=1 λ∈L P(X w |λ). Comparing with the assumed data likelihood
(Eq. 1), we find that the combination function is the juxtaposition, i.e. every word emitted
by a source during the generative process will be found in the document.
A similar word-based mixture model for multi-label text classification is presented in Ueda
and Saito (2006). Rosen-Zvi et al. (2004) introduce the author-topic model, a generative
model for documents that combines the mixture model over words with Latent Dirichlet
Allocation (Blei et al. 2003) to include authorship information: each author is associated
with a multinomial distribution over topics and each topic is associated with a multinomial
distribution over words. A document with multiple authors is modeled as a distribution
over topics that is a mixture of the distributions associated with the authors. An additional
dependency on the recipient is introduced in McCallum et al. (2005) in order to predict
people’s roles from email communications. Yano et al. (2009) uses the topic model to predict

123
Mach Learn

the response to political blogs. We are not aware of any generative approaches to multi-label
classification in other domains then text categorization.

2.4 Quality measures for multi-label classification

The quality measure mathematically formulates the evaluation criteria for the machine learn-
ing task at hand. A whole series of measures has been defined (Tsoumakas and Katakis
2007) to cover different requirements to multi-label classification. Commonly used are aver-
age precision, coverage, hamming loss, one-error and ranking loss (Schapire and Singer
2000; Zhang and Zhou 2006) as well as accuracy, precision, recall and F-Score (Godbole
and Sarawagi 2004; Qi et al. 2007). We will focus on the balanced error rate (BER) (adapted
from single-label classification) and precision, recall and F-score (inspired by information
retrieval).
The BER is the ratio of incorrectly classified samples per label set, averaged (with equal
weight) over all label sets:
  
1  n 1{L̂n =L} 1{Ln =L}
B E R(L̂, L) := 
|L| n 1{Ln =L}
L∈L

While the BER considers the entire label set, precision N and recall are calculated first
per label. We first calculate the true positives t pk = n=1 (1{k∈L̂n } 1{k∈Ln } ), false positives
N N
f pk = n=1 (1{k∈L̂n } 1{k ∈/ Ln } ) and false negatives f n k = n=1 (1{k ∈/ L̂n } 1{k∈Ln } ) for each
class k. The precision pr eck of class k is the fraction of data items correctly identified as
belonging to k, divided by the number of all data items identified as belonging to k. The
recall r eck for a class k is the fraction of instances correctly recognized as belonging to this
class, divided by the number of instances which belong to class k:
t pk t pk
pr eck := r eck :=
t pk + f pk t pk + f n k
Good performance with respect to either precision or recall alone can be obtained by either
very conservatively assigning data items to classes (leading to typically small label sets and
a high precision, but a low recall) or by attributing labels in a very generous way (yielding
high recall, but low precision). The F-score Fk , defined as the harmonic mean of precision
and recall, finds a balance between the two measures:
2 · r eck · pr eck
Fk :=
r eck + pr eck
Precision, recall and the F-score are determined individually for each base label k. We report
the average over all labels k (macro averaging). All these measures take values between 0
(worst) and 1 (best). The error rate and the BER are quality measures computed on an entire
data set. Its values also range from 0 to 1, but here 0 is best.
Besides the quality criteria on the classification output, the accuracy of the parameter
estimator compares the estimated source parameters with the true source parameters. This
model-based criterion thus assesses the obtained solution of the essential inference prob-
lem in generative classification. However, a direct comparison between true and estimated
parameters is typically only possible for experiments with synthetically generated data. The
possibility to directly assess the inference quality and the extensive control over the experi-
mental setting are actually the main reasons why, in this paper, we focus on experiments with
synthetic data. We measure the accuracy of the parameter estimation by the mean square

123
Mach Learn

Table 1 Overview over the probability distributions used in this paper

Symbol Meaning

Pθk (k ) True distribution of the emissions of source k, given θk


Pθ () True joint distribution of the emissions of all sources
PL,θ (X ) True distribution of the observations X with label set L
M (X )
PL Distribution of the observation X with label set L, as

assumed by method M, and given parameters θ
PL,D (X ) Empirical distribution of an observation X with label set L in the data set D
Pπ (L) True distribution of the label sets
PD (L) Empirical distribution of the label sets in D
Pθ (D) True distribution of data item D
PθM (D) Distribution of data item D as assumed by method M
PD (D) Empirical distribution of a data item D in the data set D
M ( )
PD,θ Conditional distribution of the emission k of source k
k k
given D and θk , as assumed by inference method M
M ()
PD,θ Conditional distribution of the source emissions 
given θ and D, as assumed by inference method M
A data item D = (X, L) is an observation X along with its label set L

error (MSE), defined as the average squared distance between the true parameter θ and its
estimator θ̂ :
 2

1 
K

M S E(θ̂ , θ ) := Eθ̂ θk,· − θ̂π(k),· .
K k
k=1

The MSE can be decomposed as follows:


K 2 
1 
M S E(θ̂ , θ ) = Eθ̂ θk,· − θ̂π(k),· + Vθ̂ θ̂k (2)
K k k
k=1

The first term Eθ̂k[||θk,· − θ̂π(k),· ||] is the expected deviation of the estimator θ̂π(k),· from the
true value θk,· , called the bias of the estimator. The second term Vθ̂k[θ̂k ] indicates the variance
of the estimator over different data sets. We will rely on this bias-variance decomposition
when computing the asymptotic distribution of the mean-squared error of the estimators. In
the experiments, we will report the root mean square error (RMS).

3 Preliminaries

Preliminaries to study the asymptotic behaviour of the estimators obtained by different infer-
ence methods are introduced in this section. This paper contains an elaborate notation, the
probability distributions used are summarized in Table 1.

3.1 Exponential family distributions

In the following, we assume that the source distributions are members of the exponential
family (Wainwright and Jordan 2008). This assumption implies that the distribution Pθk (k )
of source k admits a density pθk (ξk ) in the following form:

123
Mach Learn

pθk (ξk ) = exp (θk , φ(ξk ) − A(θk )) . (3)


Here θk are the naturalparameters, φ(ξk ) are the  sufficient statistics of the sample ξk of source
k, and A(θk ) := log exp (θk , φ(ξk )) dξk is the log-partition function. The expression
S
θk , φ(ξk ) := s=1 θk,s ·(φ(ξk ))s denotes the inner product between the natural parameters
θk and the sufficient statistics φ(ξk ). The number S is called the dimensionality of the expo-
nential family. θk,s is the sth dimension of the parameter vector of source k, and (φ(ξk ))s
is the sth dimension of the sufficient statistics. The (S-dimensional) parameter space of the
distribution is denoted by Θ. The class of exponential family distributions contains many of
the widely used probability distributions: the Bernoulli, Poisson and the χ 2 distribution are
one-dimensional exponential family distributions; the Gamma, Beta and normal distribution
are examples of two-dimensional exponential family distributions. 
K
The joint distribution of the independent sources is Pθ () = k=1 Pθ k (k ), with the
K
density function pθ (ξ ) = k=1 pθk (ξk ). To shorten the notation, we define the vectorial
sufficient statistic φ(ξ ) := (φ(ξ1 ), . . . , φ(ξ K ))T , the
 Kparameter vector θ := (θ1 , . . . , θ K )
T

and the cumulative log-partition function A(θ ) := k=1 A(θk ). Using the parameter vector
θand the emission vector ξ , the density function pθ of the source emissions is pθ (ξ ) =
K
k=1 pθk (ξk ) = exp (θ, φ(ξ ) − A(θ )).
Exponential family distributions have the property that the derivatives of the log-partition
function with respect to the parameter vector θ are moments of the sufficient statistics φ(·).
Namely the first and second derivative of A(·) are the expected first and second moment of
the statistics:
∇θ A(θ ) = E∼Pθ [φ()] ∇θ2 A(θ ) = V∼Pθ [φ()] (4)
where E X ∼P [X ] and VX ∼P [X ] denote the expectation value and the covariance matrix of a
random variable X sampled from distribution P. In all statements in this paper, we assume
that all considered variances are finite.

3.2 Identifiability

The representation of exponential family distributions in Eq. 3 may not be unique, e.g. if
the sufficient statistics φ(ξk ) are mutually dependent. In this case, the dimensionality S of
the exponential family distribution can be reduced. Unless this is done, the parameters θk
(1) (2)
are unidentifiable: there exist at least two different parameter values θk  = θk which
imply the same probability distribution pθ (1) = pθ (2) . These two paramter values cannot be
k k
distinguished based on observations, they are therefore called unidentifiable (Lehmann and
Casella 1998).

Definition 1 (Identifiability) Let ℘ = { pθ : θ ∈ } be a parametric statistical model


with parameter space . ℘ is called identifiable if the mapping θ → pθ is one-to-one:
pθ (1) = pθ (2) ⇐⇒ θ (1) = θ (2) for all θ (1) , θ (2) ∈ .

Identifiability of the model in the sense that the mapping θ → pθ can be inverted is equivalent
to being able to learn the true parameters of the model if an infinite number of samples from
the model can be observed (Lehmann and Casella 1998).
In all concrete learning problems, identifiability is always conditioned on the data. Obvi-
ously, if there are no observations from a particular source (class), the likelihood of the data
is independent of the parameter values of the never-occurring source. The parameters of the
particular source are thus unidentifiable.

123
Mach Learn

3.3 M- and Z -estimators

A popular method to determine the estimators θ̂ = (θ̂1 , . . . , θ̂ K ) for a generative model


based on independent and identically-distributed (i.i.d.) Ndata items D = (D1 , . . . , D N ) is
to maximize a criterion function θ → M N (θ ) = N1 n=1 m θ (Dn ), where m θ : D → R
are known functions. An estimator θ̂ = arg maxθ M N (θ ) maximizing M N (θ ) is called an
M-estimator, where M stands for maximization.
For continuously differentiable criterion functions, the maximizing value is often deter-
to θ equal to zero. With ψθ (D) := ∇θ m θ (D), this
mined by setting the derivative with respect 
N
yields an equation of the type  N (θ ) = N1 n=1 ψθ (Dn ), and the parameter θ is then deter-
mined such that  N (θ ) = 0. This type of estimator is called Z-estimator, with Z standing
for zero.

Maximum-likelihood estimators Maximum likelihood estimators are M-estimators with the


criterion function m θ (D) := (θ ; D). The corresponding Z -estimator, which we will use in
this paper, is obtained by computing the derivative of the log-likelihood with respect to the
parameter vector θ , called the score:

ψθ (D) = ∇θ (θ ; D). (5)

Convergence Assume that there exists an asymptotic criterion function θ → (θ ) such that
P
the sequence of criterion functions converges in probability to a fixed limit:  N (θ ) → (θ)
for every θ. Convergence can only be obtained if there is a unique zero θ 0 of (·), and
if only parameters θ close to θ 0 yield a value of (θ ) close to zero. Thus, θ 0 has to be a
well-separated zero of (·) (van der Vaart 1998):

Theorem 1 Let  N be random vector-valued functions and let  be a fixed vector-valued


function of θ such that for every  > 0
P
sup || N (θ ) − (θ )|| → 0 inf ||(θ)|| > ||(θ 0 )|| = 0.
θ ∈ θ :d(θ,θ 0 )≥

Then any sequence of estimators θ̂ N such that  N (θ̂ N ) = o P (1) converges in probability
to θ 0 .

The notation o P (1) denotes a sequence of random vectors that converges to 0 in probability,
and d(θ , θ 0 ) indicates the Euclidian distance between the estimator θ and the true value θ 0 .
The second condition implies that θ 0 is the only zero of (·) outside a neighborhood of
size  around θ 0 . As (·) is defined as the derivative of the likelihood function (Eq. 5), this
criterion is equivalent to a concave likelihood function over the whole parameter space Θ. If
the likelihood function is not concave, there are several (local) optima, and convergence to
the global maximizer θ 0 cannot be guaranteed.

Asymptotic normality Given consistency, the question about how the estimators θ̂ N are dis-
tributed around the asymptotic limit θ 0 arises. Assuming the criterion function θ → ψθ (D)
to be twice continuously differentiable,  N (θ̂ N ) can be expanded through a Taylor series
around θ 0 . Then, using the central limit theorem, θ̂ N is found to be normally distributed
around θ 0 (van der Vaart 1998). Defining v⊗ := vv T , we get the following theorem (all
expectation values w.r.t. the true distribution of the data items D):

123
Mach Learn

 
Theorem 2 Assume that E D ψθ 0 (D)⊗ < ∞ and that the map θ → E D [ψθ (D)] is differ-

entiable at a zero θ 0 with non-singular
√ derivative matrix. Then, the sequence N · (θ̂ N − θ 0 )
is asymptotically normal: N · (θ̂ N − θ 0 ) → N (0, ) with asymptotic variance
  −1   ⊗   −1 T
 = E D ∇θ ψθ 0 (D) · E D ψθ 0 (D) · E D ∇θ ψθ 0 (D) . (6)

3.4 Maximum-likelihood estimation on single-label data

To estimate parameters on single-label data, a data set D = {(X n , λn )}, n = 1, . . . , N ,


with λn ∈ {1, . . . , K } for all n, is separated according to the class label, so that one gets K
sets X1 , . . . , X K , where Xk := {X n |(X n , λn ) ∈ D, λn = k} contains all observations with
label k. All samples in Xk are assumed to be i.i.d. random variables distributed according
to P(X |θk ). It is assumed that the samples in Xk do not provide any information about θk 
if k  = k  , i.e. parameters for the different classes are functionally independent of each other
(Duda et al. 2000). Therefore, we obtain K  independent parameter estimation problems,
each with criterion function  Nk (θk ) = N1k X ∈Xk ψθk ((X, k)), where Nk := |Xk |. The
parameter estimator θ̂k is then determined such that  Nk (θk ) = 0. More specifically for
maximum-likelihood estimation of parameters of exponential family distributions (Eq. 3),
the criterion function ψθk (D) = ∇θ (θ ; D) (Eq. 5) for a data item D = (X, {k}) becomes
ψθk (D) = φ(X ) − Ek ∼Pθk [φ(k )]. Choosing θ̂k such that the criterion function  Nk (θk )
is zero means changing the model parameter such that the average value of the sufficient
statistics of the observations coincides with the expected sufficient statistics:
1 
 Nk (θk ) = φ(X ) − Ek ∼Pθk [φ(k )] . (7)
Nk
X ∈Xk

Hence, maximum-likelihood estimators in exponential families are moment estimators


(Wainwright and Jordan 2008). The theorems of consistency and asymptotic normality are
directly applicable.
With the same formalism, it becomes clear why the inference problems for different classes
are independent: assume an observation X with label k is given. Under the assumption of the
generative model, the label k states that X is a sample from source pθk . Trying to derive infor-
mation about the parameter θk  of a second source k   = k from X , we would derive pθk with
respect to θk  to get the score function. Since pθk is independent of θk  , this derivative is zero,
and the data item (X, k) does not contribute to the criterion function  Nk  (θk  ) (Eq. 7) for θk  .

Fisher information For inference in a parametric model with a consistent estimator θ̂k → θk ,
the Fisher information I (Fisher 1925) is defined as the second moment of the score function.
Since the parameter estimator θ̂ is chosen such that the average of the score function is zero,
the second moment of the score function corresponds to its variance:
 
IXk (θk ) := E X ∼P G ψθk (X )⊗ = VX ∼P G [φ(X )] , (8)
θk θk

where the expectation is taken with respect to the true distribution Pθ G . The Fisher information
k
thus indicates to what extend the score function depends on the parameter. The larger this
dependency is, the more the observed data depends on the parameter value, and the more
accurately this parameter value can be determined for a given set of training data. According
to the Cramér–Rao bound (Rao 1945; Cramér 1946, 1999), the reciprocal of the Fisher

123
Mach Learn

information is a lower bound on the variance of any unbiased estimator of a deterministic


parameter. An estimator θ̂k is called efficient if V X ∼P G [θ̂k ] = (IXk (θk ))−1 .
θk

4 Asymptotic distribution of multi-label estimators

We now extend the asymptotic analysis to estimators based on multi-label data. We restrict
ourselves to maximum likelihood estimators for the parameters of exponential family distri-
butions. As we are mainly interested in comparing different ways to learn from data, we also
assume the parametric form of the distribution to be known.

4.1 From observations to source emissions

In single-label inference problems, each observation provides a sample of a source indicated


by the label, as discussed in Sect. 3.4. In the case of inference based on multi-label data,
the situation is more involved, since the source emissions cannot be observed directly. The
relation between the source emissions and the observations are formalized by the combination
function (see Sect. 2) describing the observation X based on an emission vector  and the
label set L.
To perform inference, we have to determine which emission vector  has produced the
observed X . To solve this inverse problem, an inference method relies on additional con-
straints besides assuming the parametric form of the distribution, namely on the combination
function. These design assumptions — made implicitly or explicitly — enable the infer-
ence scheme to derive information about the distribution of the source emissions given an
observation.
In this analysis, we focus on differences in the assumed combination function.
P M (X |, L) denotes the probabilistic representations of the combination function: it spec-
ifies the probability distribution of an observation X given the emission vector  and the
label set L, as assumed by method M. We formally describe several techniques along with
the analysis of their estimators in Sect. 5. It is worth mentioning that for single-label data,
all estimation techniques considered in this work are equal and yield consistent and efficient
parameter estimators, as they agree on the combination function for single-label data: the
identity function is the only reasonable choice in this case.
The probability distribution of X given the label set L, the parameters θ and the com-
bination function assumed by method M is computed by marginalizing  out of the joint
distribution of  and X :

PLM,θ (X ) := P M
(X | L, θ ) = P M (X |, L) d P(|θ )

For the probability of a data item D = (X, L) given the parameters θ and under the assump-
tions made by model M, we have

Pθ (D) := P (X, L|θ ) = πL · P M (X |, L) p(|θ ) d.
M M
(9)

Estimating the probability of the label set L, πL , is a standard problem of estimating the
parameters of a categorical distribution. According to the law of large numbers, the empirical
frequency of occurrence converges to the true probability for each label set. Therefore, we
do not further investigate this estimation problem and assume that the true value of πL can
be determined for all L ∈ L.

123
Mach Learn

The probability of a particular emission vector  given a data item D and the parameters
θ is computed using Bayes’ theorem:

M P M (X |, L) · P(|θ )
PD,θ () := P M (|X, L, θ ) = (10)
P M (X |L, θ )
The dependency of θ on the parameter vector θ indicates that the estimation of the contribu-
tions of a source may depend on the parameters of a different source. When solving clustering
problems, we also find cross-dependencies between parameters of different classes. How-
ever, these dependencies are due to the fact that the class assignments are not known but are
probabilistically estimated. If the true class labels were known, the dependencies would dis-
appear. In the context of multi-label classification, however, the mutual dependencies persist
even when the true labels (called label set in our context) are known.
The distribution P M (|D, θ ) describes the essential difference between inference meth-
ods for multi-label data. For an inference method M which assumes that an observation X
is a sample from each source contained in the label set L, P M (k |D, θ ) is a point mass
(Dirac mass) at X . In the above example of the sum of Gaussian emissions, P M (|D, θ )
has a continuous density function.

4.2 Conditions for identifiability

As in the standard scenario of learning from single-label data, parameter inference is only
possible if the parameters θ are identifiable. Conversely, parameters are unidentifiable if
θ (1)  = θ (2) , but Pθ (1) = Pθ (2) . For our setting as specified in Eq. 9, this is the case if


N    N  
M (1) M (2)
log πLn P (X n |ξ , Ln ) p(ξ |θ ) dξ = log πLn P (X n |ξ , Ln ) p(ξ |θ ) dξ
n=1 n=1
(1) (2)
but θ = θ . The following situations imply such a scenario:
– A particular source k never occurs in the label set, formally |{L ∈ L|k ∈ L}| = 0 or
πL = 0 ∀L ∈ L : L  k. This excess parameterization is the trivial case — one cannot
infer the parameters of a source without observing emissions from that source. In such
a case, the probability of the observed data (Eq. 9) is invariant of the parameters θk of
source k.
– The combination function ignores all (!) emissions of a particular source k. Thus, under
the assumptions of the inference method M, the emission k of source k never has an
influence on the observation. Hence, the combination function does not depend on k .
If this independence holds for all L, information on the source parameters θk cannot be
obtained from the data.
– The data available for inference does not support distinguishing different parameters of
a pair of sources. Assume e.g. that source 2 only occurs together with source 1, i.e. for
all n with 2 ∈ Ln , we also have 1 ∈ Ln . Unless the combination function is such that
information can be derived about the emissions of the two sources 1 and 2 for at least
some of the data items, there is a set of parameters θ1 and θ2 for the two sources that
yields the same likelihood.
If the distribution of a particular source is unidentifiable, the chosen representation is prob-
lematic for the data at hand and might e.g. contain redundancies, such as a source (class)
which is never observed. More specifically, in the first two cases, there does not exist any
empirical evidence for the existence of a source which is either never observed or has no

123
Mach Learn

influence on the data. In the last case, one might doubt if the two classes 1 and 2 are really
separate entities, or whether it might be more reasonable to merge them to a single class.
Conversely, non-compliance to the three above conditions is a necessary (but not sufficient!)
condition for parameter identifiability in the model.

4.3 Maximum likelihood estimation on multi-label data

Based on the probability of a data item D given the parameter vector θ under the assumptions
of the inference method M (Eq. 9) and using a uniform prior over the parameters, the
log-likelihood of a parameter θ given a data item D = (X, L) is given by M (θ ; D) =
log(P M (X, L|θ )). Using the particular properties of exponential family distributions (Eq. 4),
the score function is
ψθM (D) = ∇M (θ ; D) = E∼P M [φ()] − ∇ A(θ ) (11)
D,θ

= E∼P M [φ()] − E∼Pθ [φ()] . (12)


D,θ

Comparing with the score function obtained in the single-label case (Eq. 7), the difference in
the first term becomes apparent. While the first term is the sufficient statistic of the observation
in the previous case, we now find the expected value of the sufficient statistic of the emissions,
conditioned on D = (X, L). This formulation contains the single-label setting as a special
case: given the single-label observation X with label k, we are sure that the kth source has
emitted X , i.e. k = X . In the more general case of multi-label data, several emission vectors
 might have produced the observed X . The distribution of these emission vectors (D and
θ ) is given by Eq. 10. The expectation of the sufficient statistics of the emissions with respect
to this distribution now plays the role of the sufficient statistic of the observation in the
single-label case.
As in the single-label case, we assume that several emissions are independent given their
sources (conditional independence). The likelihood and the criterion function for a data set
D = (D1 , . . . , D N ) thus factorize:

1 M
N
 NM (θ ) = ψθ (Dn ) (13)
n
n=1
M
In the following, we study Z -estimators θ̂ N obtained by setting  NM (θ̂ M
N ) = 0. We analyse
the asymptotic behaviour of the criterion function  NM and derive conditions for consistent
estimators as well as their convergence rates.

4.4 Asymptotic behaviour of the estimation equation

We analyse the criterion function in Eq. 13. The N observations used to estimate  NM (θ M 0 )
originate from a mixture distribution specified by the label sets. Using the i.i.d. assumption
and defining DL := {(X  , L ) ∈ D|L = L}, we derive
1   M 1  1  M
 NM (θ ) = ψθ (D) = |DL | ψθ (D) (14)
N N |DL |
L∈L D∈DL L∈L D∈DL

Denote by PL,D the empirical distribution of observations with label set L. Then,
1  M  
ψθ (D) = E X ∼PL,D ψθM ((X, L)) with NL := |DL |
NL
D∈DL

123
Mach Learn

is an average of independent, identically distributed random variables. By the law of large


numbers, this empirical average converges to the true average as the number of data items,
NL , goes to infinity:
   
E X ∼PL,D ψθM ((X, L))  E X ∼PL,θ G ψθM ((X, L)) . (15)

Furthermore, define π̂L := NL /N . Again by the law of large numbers, we get π̂L  πL .
Inserting (15) into (14), we derive
     
 NM (θ ) = π̂L E X ∼PL,D ψθM ((X, L))  πL E X ∼PL,θ G ψθM ((X, L)) (16)
L∈L L∈L

Inserting the value of the score function (Eq. 12) into Eq. 16 yields

 NM (θ )  E D∼Pθ G E∼P M [φ()] − E∼Pθ [φ()] (17)
D,θ

This expression shows that the maximum likelihood estimator is a moment estimator also
for inference based in multi-label data. However, the source emissions cannot be observed
directly, and the expected value of its sufficient statistic substitutes for this missing informa-
tion. The average is taken with respect to the distribution of the source emissions assumed
by the inference method M.

4.5 Conditions for consistent estimators

Estimators are characterized by properties like consistency and efficiency. The following
theorem specifies conditions under which the estimator θ̂ M
N is consistent.

Theorem 3 (Consistency of estimators.) Assume the inference method Muses the true con-
ditional distribution of the source emissions  given data items, i.e. for all data items
D = (X, L), P M (|(X, L), θ ) = P G (|(X, L), θ ), and that P M (X|L, θ ) is concave.
Then the estimator θ̂ determined as a zero of  NM (θ ) (Eq. 17) is consistent.

Proof The true parameter of the generative process, denoted by θ G , is a zero of  G (θ ),


the criterion function derived from the true generative model. According to Theorem 1,
P
supθ∈ || NM (θ ) −  G (θ )|| → 0 is a necessary condition for consistency of θ̂ M
N . Inserting
the criterion function  NM (θ ) (Eq. 17) yields the condition
 
 
E D∼Pθ G E∼P M [φ()] − E D∼Pθ G E∼P G [φ()]  = 0. (18)
D,θ D,θ

Splitting the generative process for the data items D ∼ Pθ G into a separate generation of the
label set L and an observation X , L ∼ Pπ G , X ∼ PL,θ G , Eq. 18 is fulfilled if
  
 
πL E X ∼P G E∼P M [φ()] − E∼P G [φ()] = 0. (19)
L,θ G (X,L),θ (X,L),θ
L∈L
M
L),θ = P(X,L),θ for all data items D = (X, L), this condition
Using the assumption that P(X, G

is trivially fulfilled. 


Differences between PDM δ δ δ


δ ,θ and PD δ ,θ for some data items D = (X , L ), on the other
G

hand, have no effect on the consistency of the result if either the probability of D δ is zero,
or if the expected value of the sufficient statistics is identical for the two different parameter
vectors. The first situation implies that either the label set Lδ never occurs in any data item,

123
Mach Learn

or the observation X δ never occurs with label set Lδ . The second situation implies that
the parameters are unidentifiable. Hence, we formulate the stronger conjecture that if an
inference procedure yields inconsistent estimators on data with a particular label set, its
overall parameter estimators are inconsistent. This implies, in particular, that inconsistencies
on two (or more) label sets cannot compensate each other to yield an estimator which is
consistent on the entire data set.
As we show in Sect. 5, ignoring all multi-label data yields consistent estimators. However,
discarding a possibly large part of the data is not efficient, which motivates the quest for more
advanced inference techniques to retrieve information of the source parameters from multi-
label data.

4.6 Efficiency of parameter estimation

Given that an estimator θ̂ is consistent, the next question of interest concerns the rate at
which the deviation from the true parameter value converges to zero. This rate is given by
the asymptotic variance of the estimator (Eq. 6). We will compute the asymptotic variance
specifically for maximum likelihood estimators in order to compare different inference tech-
niques which yield consistent estimators in terms of how efficiently they use the provided
data set for inference.

Fisher information The Fisher information is introduced to measure the information content
of a data item for the parameters of the source that is assumed to have generated the data. In
multi-label classification, the definition of the Fisher information (Eq. 8) has to be extended,
as the source emissions are only indirectly observed:

Definition 2 Fisher information of multi-label data The Fisher information IL measures


the amount of information a data item D = (X, L) with label set L contain about the
parameter vector θ :

IL := V∼Pθ [φ()] − E X ∼PL,θ V∼P M [φ()] (20)
D,θ

The term V∼P M [φ()] measures the uncertainty about the emission vector , given a data
D,θ
item D. This term vanishes if and only if the data item D completely determines the source
emission(s) of all involved sources. In the other extreme case where the data item D does
not reveal any information about the source emissions, this is equal to V∼Pθ [φ()], and the
Fisher information vanishes.

Asymptotic variance We now determine the asymptotic variance of an estimator.

Theorem 4 (Asymptotic variance.) Denote by PD,θ M () the distribution of the emission vec-

tor  given the data item D and the parameters θ , under the assumptions made by the
inference method M. Furthermore, let IL denote the Fisher information of data with label
set L. Then, the asymptotic variance of the maximum likelihood estimator θ̂ is given by
 
 = (EL [IL ])−1 · VD E∼P M [φ()] · (EL [IL ])−T , (21)
D,θ

where all expectations and variances are computed with respect to the true distribution.

Proof We derive the asymptotic variance based on Theorem 2 on asymptotic normality of


Z -estimators. The first and last factor in Eq. 6 are the derivative of the criterion function

123
Mach Learn

ψθM (D) (Eq. 11):


 ⊗
∇θ2 PθM (D) ∇ PθM (D)
∇θ ψθM (D) = ∇θ2 M (θ ; D) = −
PθM (D) PθM (D)

where v⊗ denotes the outer product of vector v. The particular properties of the exponential
family distributions imply
∇ 2 PθM (D)  ⊗
= E∼P M [φ()] − E∼Pθ [φ()] + V∼P M [φ()] − V∼Pθ [φ()]
PθM (D) D,θ D,θ

with ∇ PθM (D)/PθM (D) = ψθM (D) and using Eq. 12, we get

∇ψθM (D) = V∼P M [φ()] − V∼Pθ [φ()] .


D,θ

The expected Fisher information matrix over all label sets results from computing the expec-
tation over the data items D:

E D∼Pθ G [∇ψθ (D)] = E D∼Pθ G V∼P M [φ()] − V∼Pθ [φ()] = EL [IL ] .
D,θ

For the middle term of Eq. 6, we have




 
E D∼Pθ G (ψθ (D))⊗ = VD∼Pθ G E∼P M [φ()]
D,θ̂

⊗
+ E D∼Pθ G E∼P M [φ()] − E∼Pθ̂ [φ()]
D,θ̂

The condition for θ̂ given in Eq. 17 implies


   
E D∼Pθ G (ψθ (D))⊗ = VD∼Pθ G E∼P M φ(ξ ) (22)
D,θ

Using Eq. 6, we derive the expression for the asymptotic variance of the estimator θ stated
in the theorem. 


According to this result, the asymptotic variance of the estimator is determined by two factors.
We analyse them in the following two subsections and afterwards derive some well-known
results for special cases.

(A) Bias-variance decomposition We define the expectation-deviance for label set L as


the difference between the expected value of the sufficient statistics under the distribution
assumed by method M, given observations with label set L, and the expected value of the
sufficient statistic given all data items:
 

M
EL := E X ∼PL,θ G E∼P M [φ()] − E D ∼Pθ G E∼P M [φ()]
 (23)
(X,L),θ̂ D  ,θ̂

The middle factor (Eq. 22) of the estimator  variance


 is the variance in the expectation values
of the sufficient statistics of . Using E X X 2 = E X [X ]2 + VX [X ] and splitting D = (X, L)
into the observation X and the label set L, it can be decomposed as

 

 M ⊗ 
VD∼Pθ G E∼P M[φ()] = EL EL + EL VX ∼PL,θ G E∼P M [φ()] . (24)
D,θ̂ (X,L),θ̂

Two independent effects thus cause a high variance of the estimator:

123
Mach Learn

(1) The expected value of the sufficient statistics of the source emissions based on observa-
tions with a particular label L deviates from the true parameter value. Note that this effect
can be present even if the estimator is consistent: these deviations of sufficient statistics
conditioned on a particular label set might cancel out each other when averaging over all
label sets and thus yield a consistent estimator. However, an estimator obtained by such
a procedure has a higher variance than an estimator which is obtained by a procedure
which yields consistent estimators also conditioned on every label set.
(2) The expected value of the sufficient statistics of the source emissions given the observation
X varies with X . This contribution is typically large for one-against-all methods (Rifkin
and Klautau 2004).

Note that for inference methods which fulfil the conditions of Theorem 3, we have EM
L =
0. Methods which yield consistent estimators on any label set are thus not only provably
consistent, but also yield parameters with less variation.

(B) Special cases The above result reduces to well-known formula for some special cases of
single label assignments.
Variance of estimators on single-label data If estimation is based on single-label data, i.e.
D = (X, L) and L = {λ}, the source emissions are fully determined by the available data,
as the observations are considered to be direct emissions of the respective source.


K 
M M M 1{k =X } if k = λ
PD,θ () = PD,θ (k ), with PD,θ (k ) =
k k P(k |θk ) otherwise
k=1

The estimation procedure is thus independent for every source k. Furthermore, we have
Ek ∼P M [φ(k )] = X and Vk ∼P M [φ(k )] = 0. Hence,  is a diagonal matrix, with
D,θk D,θk
diagonal elements
 ⊗ 
−1 −1
kk = I{k} VD∼P G [φ(X )] + E X ∼P G [φ(X )] − Ek ∼Pθk [φ(k )] I{k}
θ k θ k

Variance
of consistent estimators
Consistent estimators are characterized by the expression
E D∼Pθ G E∼P M [φ()] = E∼Pθ [φ()] and thus
D,θ

 = (EL [IL ])−1 · VD∼Pθ G E∼P M [φ()] · (EL [IL ])−1 .
D,θ

Variance of consistent estimators on single-label data Combining the two aforementioned


conditions, we derive

λλ = V∼Pθ [φ()]−1 = IDλ (θλ ), (25)

which corresponds to the well-known result for single-label data (Eq. 8).

5 Asymptotic analysis of multi-label inference methods

In this section, we formally describe several techniques for inference based on multi-label
data and apply the results obtained in Sect. 4 to study the asymptotic behaviour of estimators
obtained with these methods.

123
Mach Learn

5.1 Ignore training (Mignor e )

The ignore training is probably the simplest, but also the most limited way of treating multi-
label data: data items which belong to more than one class are simply ignored (Boutell et al.
2004), i.e. the estimation of source parameters is uniquely based on single-label data. The
overall probability of an emission vector  given the data item D thus factorizes:

ignor e

K
ignor e
PD,θ () = PD,θ ,k (k ) (26)
k=1
ignor e
Each of the factors PD,θ ,k (k ), representing the probability distribution of source k, only
ignor e ignor e
depends on the parameter θk , i.e. we have PD,θ ,k (k ) = PD,θk (k ) for all k = 1, . . . , K .
A data item D = (X, L) does exclusively provide information about source k if L = {k}. In
ignor e
the case L  = {k}, the probability distribution of emissions k , P (k ), is invariant to
D,θ̂k
data item D.

ignor e 1{k =X } if L = {k}
P (k ) = ignor e (27)
D,θ̂k P (k ) otherwise
θ̂k

Observing a multi-label data items does not change the assumed probability distribution of
any of the classes, as these data items are discarded by Mignor e . From Eqs. 26 and 27, we
obtain the following criterion function given a data item D:

ignor e

K
ignor e ignor e φ(X ) − Ek ∼Pθ̂ [φ(k )] if L = {k}
ψθ (D) = ψθk (D), ψθk (D) = k
0 otherwise
k=1
(28)
The estimator θ̂ ignor e is consistent and normally distributed:
ignor e ignor e
Lemma 1 The estimator θ̂ N determined as a zero of  N (θ ) as defined in Eqs. 13
√ ignor e
and 28 is distributed according to N · (θ̂ N − θ ) → N (0,  ignor e ). The covari-
G
ignor e ignor e ignor e
ance matrix  ignor e is given by  ignor e = diag(11 , . . . ,  K K ), with kk =
ignor e
VX ∼Pθk[ψθk −1
((X, {k}))] .

This statement follows directly from Theorem 2 about the asymptotic distribution of
estimators based on single-label data. A formal proof is given in Sect. 1 in the appendix.

5.2 New source training (Mnew )

New source training defines new meta-classes for each label set such that every data item
belongs to a single class (in terms of these meta-labels) (Boutell et al. 2004). Doing so,
the number of parameters to be inferred is heavily increased as compared to the generative
process. We define the number of possible label sets as L := |L| and assume an arbitrary, but
fixed, ordering of the possible label sets. Let L[l] be the l th label set in this ordering. Then,
new () =  L P new ( ). As for M
we have: PD,θ l=1 D,θ ,l l ignor e , each of the factors represents the
probability distribution of one of the sources given the data item D. Hence

1{l =X } if L = L[l]
new
PD,θ ,l ( ) = P new
( ) = (29)
,θl (l ) otherwise
l l
D,θl PLnew

123
Mach Learn

For the criterion function on a data item D = (X, L), we thus have


L 
ψ(X ) − El ∼Pθl [ψ(l )] if L = L[l]
ψθnew (D) = ψθnew (D), ψθnew (D) =
l l 0 otherwise
l=1

The estimator θ̂ new


N is consistent and normally distributed:

Lemma 2 The estimator θ̂ new N √ obtained as a zero of the criterion function  N (θ ) is


new

asymptotically distributed as N · (θ̂ N − θ ) → N (0,  ). The covariance matrix


new G new

is block-diagonal:  new = diag(11 new , . . . ,  new ), with the diagonal elements given by
LL
ll = VX ∼PLnew
new −1
[ψθ G (X )] .
[l],θ l l

Again, this corresponds to the result obtained for consistent single-label inference tech-
niques in Eq. 25. The main drawback of this method is that there are typically not enough
training data available to reliably estimate a parameter set for each label set. Furthermore, it
is not possible to assign a new data item to a label set which is not seen in the training data.

5.3 Cross-training (Mcr oss )

Cross-training (Boutell et al. 2004), takes each sample X which belongs to class k as an
emission of class k, independent of other labels the data item has. The probability of  thus
factorizes into a product over the probabilities of the different source emissions:


K
cr oss
PD,θ () = ,k (k )
cr oss
PD,θ (30)
k=1

As all sources are assumed to be independent of each other, we have for all k

1{k =X } if k ∈ L
cr oss
PD,θ ,k (k ) = P cr oss
(k ), P cr oss
(k ) = (31)
D,θk D,θk Pθk (k ) otherwise
cr oss = P ( ) in the case k ∈
Again, PD,θ / L means that X does not provide any information
k θk k
about the assumed Pθk , i.e. the estimated distribution is unchanged. For the criterion function,
we have


K 
φ(X ) − Ek ∼Pθk [φ(k )] if k ∈ L
ψθcr oss (D) = ψθcrk oss (D), ψθcrk oss (D) =
0 otherwise
k=1
(32)

The parameters obtained by Mcr oss are not consistent:

Lemma 3 The estimator θ̂ cr oss obtained as a zero of the criterion function ψ Ncr oss (θ ) are
inconsistent if the training data set contains at least one multi-label data item.

The inconsistency is due to the fact that multi-label data items are used to estimate the
parameters of all sources the data item belongs to without considering the influence of the
other sources. The bias of the estimator grows if the fraction of multi-label data used for the
estimation increases. A formal proof is given in the appendix (Sect. 1).

123
Mach Learn

5.4 Deconvolutive training (Mdeconv )

The deconvolutive training method estimates the distribution of the source emissions given
a data item. Modelling the generative process, the distribution of an observation X given the
emission vector  and the label set L is

P deconv (X |, L) = 1{X =cdeconv


κ (,L)}

Integrating out the source emissions, we obtain the probability of an observation X as


P deconv (X |L, θ ) = P(X |, L) d P(|θ). Using Bayes’ theorem and the above notation,
we have:
P deconv (X |, L) · P deconv (|θ )
P deconv (|D, θ ) = (33)
P deconv (X |L, θ )
If the true combination function is provided to the method, or the method can correctly
estimate this function, then P deconv (|D, θ ) corresponds to the true conditional distribution.
The target function is defined by

ψθdeconv (D) = E∼P deconv [φ()] − E∼Pθ [φ()] (34)


D,θ̂

Unlike in the methods presented before, the combination function c(·, ·) in Mdeconv
influences the assumed distribution of emissions , P deconv (). For this reason, it is not
D,θ̂
possible to describe the distribution of the estimators obtained by this method in general.
However, given the identifiability conditions discussed in Sect. 3.2, the parameter estimators
converge to their true values.

6 Addition of Gaussian-distributed emissions

Multi-label Gaussian sources allow us to study the influence of addition as a link function.
We consider the case of two univariate Gaussian distributions with sample space R. The
probability density function is p(ξ ) = √1 exp(− (ξ2σ −μ)2
2 ). Mean and standard deviation of
σ 2π
the kth source are denoted by μk and σk , respectively, for k = 1, 2.

6.1 Theoretical investigation

Rearranging terms in order to write the Gaussian distribution as a member of the exponential
family (Eq. 3), we derive
T  
μk 1  
2 T θk,1 2
θk = , − T = x, x A(θ k ) = − − ln −2θ k,2
σk 2 2σk 2 4θk,2
The natural parameters θ are not the most common parameterization of the Gaussian distrib-
ution. However, the usual parameters (μk , σk2 ) can be easily computed from the parameters
θk :
1 1 μk
− = θk,2 ⇐⇒ σk2 = − θk,1 = ⇐⇒ μk = σk2 · θk,1 . (35)
2σk2 2θk,2 σk2
The parameter space is Θ = {(θ1 , θ2 ) ∈ R|θ2 < 0}. In the following, we assume μ1 = −a
and μ2 = a. The parameters of the first and second source are thus θ1 = (− σa2 , − 2σ1 2 )T and
1 1

123
Mach Learn

θ2 = ( σa 2 , − 2σ1 2 )T As combination function, we choose the addition: k(1 , 2 ) = 1 +2 .


2 2
We allow both single labels and the label set {1, 2}, i.e. L = {{1}, {2}, {1, 2}}. The expected
values of the observation X conditioned on the label set are
E X ∼P1 [X ] = −a E X ∼P2 [X ] = a E X ∼P1,2 [X ] = 0. (36)
Since the convolution of two Gaussian distributions is again a Gaussian distribution, data
with the multi-label set {1, 2} is also distributed according to a Gaussian. We denote the
parameters of this proxy-distribution by θ12 = (0, − 2(σ 21+σ 2 ) )T .
1 2

Lemma 4 Assume a generative setting as described above. Denote the total number of
data items by N and the fraction of data items with label set L by πL . Furthermore, we
define w12 := π2 σ12 + π1 σ22 , s12 := σ12 + σ22 , and m 1 := (π2 σ12 σ12 2 + 2π σ 2 s ), m :=
1 2 12 2
(π1 σ2 σ12 + 2π2 σ1 s12 ). The MSE in the estimator of the mean, averaged over all sources,
2 2 2

for the inference methods Mignor e , Mnew , Mcr oss and Mdeconv , is as follows:

1 σ1 2 σ2 2
M S E(μ̂ignor e , μ) = + (37)
2 π1 N π2 N

1 σ1 2 σ2 2 σ 1 2 + σ2 2
M S E(μ̂new , μ) = + + (38)
3 π1 N π2 N π12 N

1 2 1 1
M S E(μ̂cr oss , μ) = π12 + a2
2 (π1 + π12 )2 (π2 + π12 )2

1 π1 π2
+ π12 + a2
2 (π1 + π12 )3 N (π2 + π12 )3 N
 
1 π1 σ12 + π12 σ12 2 π2 σ22 + π12 σ12
2
+ + (39)
2 (π1 + π12 )2 N (π2 + π12 )2 N

1 π122 σ 2w + π π m + π π 2s2
2 12 12 2 1 1 2 12 2
M S E(μ̂ deconv
, μ) = σ1
2 (π1 π2 s12 + π12 w12 )2 N

2 σ 2w + π π m + π 2π s2
π12 1 12 12 1 2 1 2 12 2
+ σ2 (40)
(π1 π2 s12 + π12 w12 )2 N

The proof mainly consists of lengthy calculations and is given in Sect. 1. We rely on the
computer-algebra system Maple for parts of the calculations.

6.2 Experimental results

To verify the theoretical result, we apply the presented inference techniques to synthetic data,
generated with a = 3.5 and unit variance: σ1 = σ2 = 1. The Bayes error, i.e. the error of the
optimal generative classifier, in this setting is 9.59 %. We use training data sets of different
size and test sets of the same size as the maximal size of the training data sets. All experiments
are repeated with 100 randomly sampled training and test data sets.
In Fig. 2, the average deviation of the estimated source centroids from the true centroids
are plotted for different inference techniques and a varying number of training data, and
compared with the values predicted from the asymptotic analysis. The theoretical predictions
agree with the deviations measured in the experiments. Small differences are obtained for
small training set sizes, as in this setting, both the law of large numbers and the central limit

123
Mach Learn

1.5 1.5
RM S(μ̂ignore , μ)

RM S(μ̂new , μ)
1 1

0.5 0.5

0 15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164 0 15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164

Training Set Size N Training Set Size


(a) Estimator Accuracy for Mignore (b) Estimator Accuracy for Mnew

1.5
2.6

2.4

RM S(μ̂deconv , μ)
RM S(μ̂cross , μ)

2.2
1
2

1.8

1.6 0.5

1.4

1.2
15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164
0 15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164

Training Set Size N Training Set Size N


(c) Estimator Accuracy for Mcross (d) Estimator Accuracy for Mdeconv

Fig. 2 Deviation of parameter values from true values: the box plot indicate the values obtained in an exper-
iment with 100 runs, the red line gives the RMS predicted by the asymptotic analysis. Note the difference in
scale in Fig. 2c

theorem, on which we rely in our analysis, are not fully applicable. As the number of data
items increases, these deviations vanish.
Mcr oss has a clear bias, i.e. a deviation from the true parameter values which does not
vanish as the number of data items grows to infinity. All other inference technique are con-
sistent, but differ in the convergence rate: Mdeconv attains the fastest convergence, followed
by Mignor e . Mnew has the slowest convergence of the analysed consistent inference tech-
niques, as this method infers parameters of a separate class for the multi-label data. Due to
the generative process, these data items have a higher variance, which entails a high vari-
ance of the respective estimator. Therefore, Mnew has a higher average estimation error
than Mignor e .
The quality of the classification results obtained by different methods is reported in Fig. 3.
The low precision value of Mdeconv shows that this classification rule is more likely to assign
a wrong label to a data item than the competing inference methods. Paying this price, on the
other hand, Mdeconv yields the highest recall values of all classification techniques analysed
in this paper. On the other extreme, Mcr oss and Mignor e have a precision of 100 %, but a
very low recall of about 75 %. Note that Mignor e only handles single-label data and is thus
limited to attributing single labels. In the setting of these experiments, the single label data
items are very clearly separated. Confusions are thus very unlikely, which explains the very
precise labels as well as the low recall rate. In terms of the F-score, defined as the harmonic
mean of the precision and the recall, Mdeconv yields the best results for all training set sizes,
closely followed by Mnew . Mignor e and Mcr oss perform inferior to Mdeconv and Mnew .

123
Mach Learn

1 1

0.99 0.95
prec(L̂test , Ltest )

0.98 0.9

rec(L̂test , Ltest )
0.97
0.85
0.96
0.8
0.95 M
ignore
M 0.75
new
0.94
M
cross
0.93 M
deconv
0.7 M
ignore
M
new
M
cross
M
deconv

10 100 1000 10000 10 100 1000 10000


Training Set Size N Training Set Size N
(a) Average Precision (b) Average Recall
1 0.5
Mignore Mnew Mcross Mdeconv Mignore Mnew Mcross Mdeconv
0.975 0.4

0.95 BER(L̂test , Ltest ) 0.3


F (L̂test , Ltest )

0.925 0.25

0.9 0.2

0.875 0.15

0.85
0.1
0.825

0.8 0.07
10 100 1000 10000 10 100 1000 10000

Training Set Size N Training Set Size N


(c) Average F-Score (d) Balanced Error Rate
Fig. 3 Classification quality of different inference methods. 100 training and test data sets are generated from
two sources with mean ±3.5 and standard deviation 1

Also for the BER, the deconvolutive model yields the best results, with Mnew reaching
similar results. Both Mcr oss and Mignor e incur significantly increased errors. In Mcr oss ,
this effect is caused by the biased estimators, while Mignor e discards all training data with
label set {1, 2} and can thus “not do anything with such data”.

6.3 Influence of model mismatch

Deconvolutive training requires a more elaborate model design than the other methods pre-
sented here, as the combination function has to be specified as well, which poses an additional
source of potential errors compared to e.g. Mnew .
To investigate the sensitivity of the classification results to model mismatch, we gen-
erate again Gaussian-distributed data from two sources with mean ±3.5 and unit vari-
ance, as in the previous section. However, the true combination function is now set to
c((1 , 2 )T , {1, 2}) = 1 + 1.5 · 2 , but the model assumes a combination function as
in the previous section, i.e. ĉ((1 , 2 )T , {1, 2}) = 1 + 2 . The probabilities of the indi-
vidual label sets are π{1} = π{2} = 0.4 and π{1,2} = 0.2. The classification result for this
setting are displayed in Fig. 4. For the quality measures precision and recall, Mnew and
Mdeconv are quite similar in this example. For the more comprehensive quality measures
F-score and B E R, we observe that Mdeconv is advantageous for small training data sets.
Hence, the deconvolutive approach is beneficial for small training data sets even when the
combination function is not correctly modelled. With more training data, Mnew catches up

123
Mach Learn

1
1
0.95
0.975
0.9
0.95
prec(L̂, L)

rec(L̂, L)
0.925 0.85

0.9 0.8

0.875 0.75

0.85 M M M M M M M M
ignore new cross deconv ignore new cross deconv
0.7
10 100 1000 10 100 1000

Training Set Size N Training Set Size N


(a) Average Precision (b) Average Recall
1 0.3
0.25

0.95
ˆ , Ltest ) 0.2
ˆ , Ltest )

0.15
BER(Ltest

0.9
F (Ltest

0.1
0.85

M M M M M M M M
ignore new cross deconv ignore new cross deconv
0.8 0.05
10 100 1000 10 100 1000
Training Set Size N Training Set Size N
(c) Average F-Score (d) Balanced Error Rate
Fig. 4 Classification quality of different inference methods, with a deviation between the true and the assumed
combination function for the label set {1, 2}. Data is generated from two sources with mean ±3.5 and standard
deviation 1. The experiment is run with 100 pairs of training and test data

and then outperforms Mdeconv . The explanation for this behavior lies in the bias-variance
decomposition of the estimation error for the model parameters (Eq. 2): Mnew uses more
source distributions (and hence more parameters) to estimate the data distribution, but does
not rely on assumptions on the combination function. Mdeconv , on the contrary, is more
thrifty with parameters, but relies on assumptions on the combination function. In a setting
with little training data, the variance dominates the accuracy of the parameter estimators, and
Mdeconv will therefore yield more precise parameter estimators and superior classification
results. As the number of training data increases, the variance of the estimators decreases,
and the (potential) bias dominates the parameter estimation error. With a misspecified model,
Mdeconv yields poorer results than Mnew in this setting.

7 Disjunction of Bernoulli-distributed emissions

We consider the Bernoulli distribution as an example of a discrete distribution in the expo-


nential family with emissions in B := {0, 1}. The Bernoulli distribution has one parameter
β, which describes the probability for a 1.
7.1 Theoretical investigation

The Bernoulli distribution


 is a member of the exponential family  with the following
 parame-
βk exp θk
terization: θk = log 1−β k
, φ(k ) = k , and A(θ k ) = − log 1 − 1+exp θk . As combina-

123
Mach Learn

tion function, we consider the Boolean OR, which yields a 1 if either of the two inputs is 1,
and 0 otherwise. Thus, we have
P(X = 1|L = {1, 2}) = β1 + β2 − β1 β2 =: β12 (41)
Note that β12 ≥ max{β1 , β2 }: When combining the emissions of two Bernoulli distributions
with a Boolean OR, the probability of a one is at least as large as the probability that one
of the sources emitted a one. Equality implies either that the partner source never emits a
one, i.e. β12 = β1 if and only if β2 = 0, or that one of the sources always emits a one, i.e.
β12 = β1 if β1 = 1. The conditional probability distributions are as follows:
P(|(X, {1}), θ ) = 1{(1) =X } · Ber ((2) |θ (2) ) (42)
(1) (1)
P(|(X, {2}), θ ) = Ber ( |θ ) · 1{(2) =X } (43)
P(|(0, {1, 2}), θ ) = 1{(1) =0} · 1{(2) =0} (44)
P(, X = 1|L = {1, 2}, θ )
P(|(1, {1, 2}), θ ) = (45)
P(X = 1|L = {1, 2}, θ )
In particular, the joint distribution of the emission vector  and the observation X is as
follows:
P( = (ξ1 , ξ2 )T , X = (ξ1 ∨ ξ2 )|L = {1, 2}, θ ) = (1 − β1 )1−ξ1 (1 − β2 )1−ξ2 (β1 )ξ1 (β2 )ξ2
All other combinations of  and X have probability 0.
Lemma 5 Consider the generative setting described above, with N data items in total.
The fraction of data items with label set L by πL . Furthermore, define v1 := β1 (1 − β1 ),
v2 := β2 (1 − β2 ), v12 := β12 (1 − β12 ), w1 := β1 (1 − β2 ), w2 := β2 (1 − β1 ) and
π12 π12
v̂1 = w2 (1 − π12 w2 ) v̂2 = w1 (1 − π12 w1 ) . (46)
(π1 + π12 )2 (π2 + π12 )2
The MSE in the estimator of the parameter β̂, averaged over all sources, for the inference
methods Mignor e , Mnew , Mcr oss and Mdeconv is as follows:

1 β1 (1 − β1 ) β2 (1 − β2 ) β12 (1 − β12 )
M S E(β̂ new , β) = + + (47)
3 π1 N π2 N π12 N

1 β1 (1 − β1 ) β2 (1 − β2 )
M S E(β̂ ignor e , β) = + (48)
2 π1 N π2 N
⊗ ⊗
1 π12 1 π12
M S E(β̂ cr oss , β) = w2 + w1
2 π1 + π12 2 π2 + π12
 
1 1 v1 π12 (β1 − β12 )2
2 2
π1 v1 + π12 v12
+ + (49)
2 π1 N v̂12 (π1 + π12 )3 (π1 + π12 )2
 
1 1 v22 π12 2 (β − β )2
2 12 π2 v2 + π12 v12
+ +
2 π2 N v̂22 (π2 + π12 )3 (π2 + π12 )2
1 1 π2 β12 + π12 w2
M S E(β̂ deconv , β) = v1
2 π1 N π12 (π1 w2 + π2 w1 ) + π1 π2 β12
1 1 π1 β12 + π12 w1
+ v2 (50)
2 π2 N π12 (π1 w2 + π2 w1 ) + π1 π2 β12
The proof of this lemma involves lengthy calculations that we partially perform in Maple.
Details are given in Section A.3 of (Streich 2010).

123
Mach Learn

0.5 0.5

0.4 0.4
RM S(β̂ ignore , β)

RM S(β̂ new , β)
0.3 0.3

0.2 0.2

0.1 0.1

0 15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164 6246 9369
0 15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164 6246 9369

Training Set Size N Training Set Size N


(a) Estimator Accuracy for Mignore (b) Estimator Accuracy for Mnew
0.5 0.5

0.4 0.4
RM S(β̂ cross , β)

RM S(β̂ new , β)
0.3 0.3

0.2 0.2

0.1 0.1

0 15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164 6246 9369
0 15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164 6246 9369

Training Set Size N Training Set Size N


(c) Estimator Accuracy for Mcross (d) Estimator Accuracy for Mdeconv

Fig. 5 Deviation of parameter values from true values: the box plots indicate the values obtained in an
experiment with 100 runs, the red line gives the RMS predicted by the asymptotic analysis

7.2 Experimental results

To evaluate the estimators obtained by the different inference methods, we use a setting with
β 1 = 0.40 · 110×1 and β 2 = 0.20 · 110×1 , where 110×1 denotes a 10-dimensional vector of
ones. Each dimension is treated independently, and all results reported here are averages and
standard deviations over 100 independent training and test samples.
The RMS of the estimators obtained by different inference techniques are depicted in
Fig. 5. We observe that asymptotic values predicted by theory are in good agreement with the
deviations measured in the experiments, thus confirming the theory results. Mcr oss yields
clearly biased estimators, while Mdeconv yields the most accurate parameters.
Recall that the parameter describing the proxy distribution of data items from the label
set {1, 2} is defined as β12 = β1 + β2 − β1 β2 (Eq. 41) and thus larger than any of β1
or β2 . While the expectation of the Bernoulli distribution is thus increasing, the variance
β12 (1 − β12 ) of the proxy distribution is smaller than the variance of the base distributions.
To study the influence of this effect onto the estimator precision, we compare the RMS of the
source estimators obtained by Mdeconv and Mnew , illustrated in Fig. 6: the method Mdeconv
is most advantageous if at least one of β1 or β2 is small. In this case, the variance of the
proxy distribution is approximately the sum of the variances of the base distributions. As the
parameters β of the base distributions increase, the advantage of Mdeconv in comparison to
Mnew decreases. If β1 or β2 is high, the variance of the proxy distribution is smaller than

123
Mach Learn

1.5 2
1 5

5
1.75 1. 1

1.
1.2 1.75 1 25
0.9 1.5 0.9
2 1.

1.

1
5 2 5

25
1.7 2
2.25 1.

1.25
2.25 75

5
2

1.7

2
5
2.2

1.5
0.7 0.7

1.75

2.2
2.5

2.2

5
2.5

1.5

2.25
5

2
2

2
0.5 0.5
β

β
2.5
2.25

2
1.75

2
5
0.3 0.3 2.2

2.5
2.25

1.75

1 25
1.5
2

5
2.2
5
1.7

1.5
2.5 2
1.5

2
1. 2.2 2

1 .25
7 5
0.1 1. 5 0.1

1
1.75

1
1. 5 1.75 5
1.2 1
1 25

0.
2 1.5 1.5
1.25

7
2 25

5
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
β1 β1
new deconv
(a) RM S(β̂ , β) (b) RM S(β̂ , β)

0 0.2 0.2 0.2 0.2


0.1
−0
.2 0.2 0.1
0.9 0.2 0.2 0.9 −0.3

0.2
0
0 0.1
−0
.4

−0
−0.2

.1 0
0.7 0 0.7
−0
.2
−0.6

0.2
2

2
−0.4

0.5 0.5
−0.3
β

0.1
−0

0.2
.1
−0.8

−0

0
.2
−0.2
0

0.3 0.3 −0.1


.6

−0.4
−0

−0
.2
0.2

−0.4 −0.2 −0.2


0.1 −0.4 0.1 −0

0.1
−0.6
−0.3

.1

0
0.2
−0.3 −0.2
−0.3
−0.6 −0.3
0

−0.8
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
β β
1 1
deconv new deconv new
(c) RM S(β̂ , β) − RM S(β̂ , β) RM S(β̂ ,β)−RM S(β̂ ,β)
(d) RM S(β̂
new
,β)

Fig. 6 Comparison of the estimation accuracy for β for the two methods Mnew and Mdeconv for different
values of β1 and β2

the variance of any of the base distributions, and Mnew yields on average more accurate
estimators than Mdeconv .

8 Conclusion

In this paper, we develop a general framework to describe inference techniques for multi-
label data. Based on this generative model, we derive an inference method which respects
the assumed semantics of multi-label data. The generality of the framework also enables us
to formally characterize previously presented inference algorithms for multi-label data.
To theoretically assess different inference methods, we derive the asymptotic distribution
of estimators obtained on multi-label data and thus confirm experimental results on synthetic
data. Additionally, we prove that cross training yields inconsistent parameter estimators.

123
Mach Learn

As we show in several experiments, the differences in estimator accuracy directly trans-


late into significantly different classification performances for the considered classification
techniques.
In our experiments, we have observed that the values of the quality differences between
the considered classification methods largely depends on the quality criterion used to assess
a classification result. A theoretical analysis of the performance of classification techniques
with respect to different quality criteria will be an interesting continuation of this work.

Acknowledgments We appreciate valuable discussions with Cheng Soon Ong. This work was in part funded
by CTI grant Nr. 8539.2;2 EPSS-ES.

Open Access This article is distributed under the terms of the Creative Commons Attribution License which
permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source
are credited.

Appendix 1: Asymptotic distribution of estimators

This section contains the proofs of the lemmas describing the asymptotic distribution of
estimators obtained by the inference methods Mignor e , Mnew and Mcr oss in Sect. 5.

Proof Lemma 1 Mignor e reduces the estimation problem to the standard single-label classi-
fication problem for K independent sources. The results of single-label asymptotic analysis
are directly applicable, the estimators θ̂ ignor e are consistent and converge to θ G .
As only single-label data is used in the estimation process, the estimators for different
sources are independent and the asymptotic covariance matrix is block-diagonal, as stated in
Lemma 1. The diagonal elements are given by Eq. 25, which yields the given expression.  

Proof Lemma 2 Mnew reduces the estimation problem to the standard single-label classifi-
cation problem for L := |L| independent sources. The results of standard asymptotic analysis
(Sect. 3.4) are therefore directly applicable: The parameter estimators θ̂ new for all single-label
sources (including the proxy-distributions) are consistent with the true parameter values θ G
and asymptotically normally distributed, as stated in the lemma.
The covariance matrix of the estimators is block-diagonal as the parameters are estimated
independently for each source. Using Eq. 25, we obtain the values for the diagonal elements
as given in the lemma. 


Proof Lemma 3 The parameters θk of source k are estimated independently for each source.
Combining Eqs. 17 and 32, the condition for θk is
 !
 Ncr oss (θk ) := ψθcrk oss (D) = 0.
D

ψθcrk oss (D) = 0 in the case k ∈ / L thus implies that D has no an influence on the parameter
estimation. For simpler notation, we define the set of all label sets which contain k as Lk ,
formally Lk := {L ∈ L|k ∈ L}. The asymptotic criterion function for θk is then given by

 cr oss (θk ) = E D∼Pθ G Ek ∼PD,θ
cross [φ(k )] − E ∼P [φ(k )]
k θk
k
 
= πL E X ∼PL,θ G [φ(X )] + πL E∼Pθk [φ(X )] − Ek ∼Pθk [φ(k )]
L∈Lk L∈
/ Lk

123
Mach Learn

Setting  cr oss (θk ) = 0 yields


1 
E X ∼Pθ̂ cross [φ(X )] =  πL E X ∼PL,θ G [φ(X )] . (51)
k 1− L∈
/ Lk πL
L∈Lk

The mismatch of θ̂ crk


oss thus grows as the fraction of multi-label data grows. Furthermore, the

mismatch depends on the dissimilarity of the sufficient statistics of the partner labels from
the sufficient statistics of source k. 


Appendix 2: Lemma 4

Proof Lemma 4 This proof consists mainly of computing summary statistics.

Ignore training (Mignor e )

Mean value of the mean estimator As derived in the general description of the method in
Sect. 5.1, the ignore training yields consistent estimators for the single-label source distrib-
utions: θ̂1,1 → − σa2 and θ̂2,1 → σa2 .
1 2
Variance of the mean estimator Recall that we assume to have πL N observations with
label set L, and the variance of the source emissions is assumed to be V∼Pk [φ()] = σk2 .
The variance of the estimator for the single-label source means based on a training set of size
N is thus V μ̂k = σk2 /(πk N ).
Mean-squared error of the estimator With the above, the MSE, averaged over the two
sources, is given by

1 σ1 2 σ2 2
M S E(θ̂ ignor
μ
e
, θ ) = + .
2 π1 N π2 N
Since the estimators obtained by Mignor e are consistent, the MSE only depends on the
variance of the estimator.

New source training (Mnew )

Mean value of the estimator The training is based on single-label data items and therefore
yields consistent estimators (Theorem. 2). Note that this method uses three sources to model
the generative process in the given example: θ̂1,1 → − σa2 , θ̂2,1 → σa2 , θ̂12,1 → 0.
1 2

Variance of the mean estimator The variance is given in Lemma 2 and takes the following
values in our setting:
  σ1 2   σ2 2   σ12 2 σ1 2 + σ2 2
V μ̂1 = V μ̂2 = V μ̂12 = =
π1 N π2 N π12 N π12 N
Since the observations with label set L = {1, 2} have a higher variance than single-label
observations, the estimator μ̂12 also has a higher variance than the estimators for single
sources.

Mean-squared error of the estimator Given the above, the MSE is given by

1 σ1 2 σ2 2 σ1 2 + σ2 2
M S E(θ̂ new
μ , θ ) = + + .
3 π1 N π2 N π12 N

123
Mach Learn

Cross-training (Mcr oss )

As described in Eq. 30, the probability distributions of the source emissions given the obser-
vations are assumed to be mutually independent by Mcr oss . The criterion function ψθcrk oss (D)
is given in Eq. 32. The parameter θk is chosen according to Eq. 51:
1 
E X ∼Pθ cross [X ] =  πL E X ∼PL,θ G [X ]
k 1 − L∈/ Lk πL
L∈Lk

Mean value of the mean estimator With the conditional expectations of the observations
given the labels (see Eq. 36), we have for the mean estimate of source 1:
1  
μ̂1 = E X ∼Pθ cross [X ] = π1 E X ∼P{1},θ G [X ] + π12 E X ∼P{1,2},θ G [X ]
1 1 − π2
π1 · a a
=− =−
π1 + π12 1 + ππ121
π2 · a a
similarly μ̂2 = =
π2 + π12 1 + ππ122

The deviation from the true value increases with the ratio of multi-label data items compared
to the number of single-label data items from the corresponding source.

Mean value of the standard deviation estimator According to the principle of maximum
likelihood, the estimator for the source variance σk2 is the empirical variance of all data items
which contain k their label sets:
1   2
σ̂12 = x − μ̂1
|D1 ∪ D12 |
x∈(D1 ∪D12 )
⎛ ⎞
1  2   2
= ⎝ x − μ̂1 + x − μ̂1 ⎠
N (π1 + π12 )
x∈D1 x∈D12

π1 π12 π1 σG,1
2 + π12 σG,12
2
= a2 + (52)
(π1 + π12 )2 π1 + π12
π π
2 12 a 2 π σ 2 + π σ2
2 G,2 12 G,12
and similarly σ̂22 = + . (53)
(π2 + π12 )2 π2 + π12
The variance of the source
 emissions
 under the assumptions of method Mcr oss is given by
V∼Pθ [φ()] = diag σ̂12 , σ̂22 .

Variance of the mean estimator We use the decomposition derived in Sect. 4.6 to determine
the variance. Using the expected values of the sufficient statistics conditioned on the label
sets and the variances thereof, as given in Table 2, we have
π σ 2 + π σ 2 π12 σ12
2

1 1 12 12
EL VX ∼PL,θ G E∼P cross [φ()] = 2 .
(X,L),θ̂ π12 σ12
2 π2 σ22 + π12 σ12
Furthermore, the expected value of the sufficient statistics over all data items is
−π a + π μ̂ 
1 2 1
E D∼Pθ G E∼P cross [φ()] =
D,θ̂ π1 μ̂2 + π2 a

123
Mach Learn

Table 2 Quantities used to determine the asymptotic behavior of parameter estimators obtained by Mcr oss
for a Gaussian distribution
Quantity L = {1} L = {2} L = {1, 2}
     
X θ̂1,1 X
E∼P cr oss [φ()]
 θ̂2,1
X
(X,L),θ̂

 x  
−a μ̂1 0
E X ∼P E∼P cr oss [φ()]
L,θ G
(X,L),θ̂ μ̂ a 0

 2     
2 σ2
σ12
σ12 0 0 0 12
VX ∼P E∼P cr oss [φ()]
L,θ G (X,L),θ̂ 0 0 0 σ12 2 σ2
σ12 12

Hence
 ⊗

EL∼Pπ E X ∼PL,θ G E∼P cross [φ()] − E D  ∼P E∼P cross [φ()]


(X,L),θ̂ θG  D ,θ̂
 
π1 π12 2
π1 +π12 a − (π1 +ππ12
1 π12 π2
)(π2 +π12 ) a
2
= π1 π12 π2 π2 π12 2
− (π1 +π12 )(π2 +π12 ) a 2 π2 +π12 a

The variance of the sufficient statistics of the emissions of single sources and the Fisher
information matrices for each label set are thus given by
 2 
0 0 σ̂1 0
V∼P cross [φ()] = I {1} = −
(X,{1}),θ̂ 0 σ̂22 0 0
2  
σ̂1 0 0 0
V∼P cross [φ()] = I{2} = −
(X,{2}),θ̂ 0 0 0 σ̂22
 2 
00 σ̂1 0
V∼P cross [φ()] = I{1,2} = −
(X,{1,2}),θ̂ 00 0 σ̂22
The expected value of the Fisher information matrices over all label sets is
 
EL∼PL [IL ] = −diag (π1 + π12 )σ̂12 , (π2 + π12 )σ̂22
where the values of σ̂1 and σ̂2 are given in Eqs. 52 and 53. Putting everything together, the
covariance matrix of the estimator θ̂ cr oss is given by

vθ,11 vθ,12
θcr oss =
vθ,12 vθ,22
with diagonal elements
π1 + π12 π2 + π12
vθ,11 = vθ,22 = .
π1 π12 a 2 + π1 σ12 + π12 σ12
2 π2 π12 a 2 + π2 σ12 + π12 σ12
2

To get the variance of the mean estimator, recall Eq. 35. The covariance matrix for the
mean estimator is
  
vμ,11 vμ,12 1 π1 π12 π1 σ12 + π12 σ12
2

cr oss
= , with vμ,11 = · a +
2
vμ,12 vμ,22 π1 + π12 (π1 + π12 )2 π1 + π12
 
1 π2 π12 π2 σ22 + π12 σ12
2
vμ,22 = · a +
2
.
π2 + π12 (π2 +π12 )2 π2 + π12

123
Mach Learn

The first term in the brackets gives the variance of the means of the two true sources involved
in generating the samples used to estimate the mean of the particular source. The second
term is the average variance of the sources.

Mean-squared error of the mean estimator Finally, the Mean Squared Error is given by:

1 2 1 1
M S E(μ̂cr oss , μ) = π12 + a2
2 (π1 + π12 )2 (π2 + π12 )2

1 1 π1 1 π2
+ π12 + a2
2 (π1 + π12 )N (π1 + π12 )2 (π2 + π12 )N (π2 + π12 )2
 
1 1 π1 σ12 + π12 σ12
2
1 π2 σ22 + π12 σ12
2
+ +
2 (π1 + π12 )N π1 + π12 (π2 + π12 )N π2 + π12

This expression describes the three effects contributing to the estimation error of Mcr oss :
– The first line indicates the inconsistency of the estimator. This term grows with the mean
of the true sources (a and −a, respectively) and with the ratio of multi-label data items.
Note that this term is independent of the number of data items.
– The second line measures the variance of the observation x given the label set L, averaged
over all label sets and all sources. This term thus describes the excess variance of the
estimator due to the inconsistency in the estimation procedure.
– The third line is the weighted average of the variance of the individual sources, as it is
also found for consistent estimators.
The second and third line describe the variance of the observations according to the law of
total variance:

VX [X ] = VL [E X [X |L]] + EL [VX [X |L]]


 ! "  ! "
second line third line

Note that (π1 + π12 )N and (π2 + π12 )N is the number of data items used to infer the
parameters of source 1 and 2, respectively.

Deconvolutive training (Mdeconv )

Mean value of the mean estimator The conditional expectations of the sufficient statistics of
the single-label data are:
 
X μ̂1
E∼P deconv () [φ1 ()] = E∼P deconv () [φ1 ()] = (54)
(X,{1}),θ μ̂2 (X,{2}),θ X
Observations X with label set L = {1, 2} are interpreted as the sum of the emissions from
the two sources. Therefore, there is no unique expression for the conditional expectation of
the source emissions given the data item D = (X, L):
 
μ̂1 X − μ̂2
E∼P deconv () [φ1 ()] = =
(X,{1,2}),θ X − μ̂1 μ̂2
We use a parameter λ ∈ [0, 1] to parameterize the blending between these two extremes:
 
μ̂1 X − μ̂2
E∼P deconv () [φ1 ()] = λ + (1 − λ) (55)
(X,{1,2}),θ X − μ̂1 μ̂2

123
Mach Learn

 T
Furthermore, we have E∼Pθ [φ1 ()] = μ̂1 , μ̂2 . The criterion function θdeconv (D) for
the parameter vector θ then implies the condition
   
X̄ 1 μ̂1 λμ̂1 + (1 − λ)( X̄ 12 − μ̂2 ) ! μ̂1
π1 + π2 + π12 = ,
μ̂2 X̄ 2 λ( X̄ 12 − μ̂1 ) + (1 − λ)μ̂2 μ̂2

where we have defined X̄ 1 , X̄ 2 and X̄ 12 as the average of the observations with label set {1},
{2} and {1, 2}, respectively. Solving for μ̂, we get
1  1 
μ̂1 = (1 + λ) X̄ 1 + (1 − λ) X̄ 12 − (1 − λ) X̄ 2 μ̂2 = −λ X̄ 1 + λ X̄ 12 + (2 − λ) X̄ 2 .
2 2
     
Since E X̄ 1 = −a, E X̄ 12 = 0 and E X̄ 2 = a, the mean estimators are consistent
independent of the chosen λ: E [μ1 ] = −a and E [μ2 ] = a. In particular, we have, for all L:



E X ∼PL,θ G E∼P deconv [φ()] = E D  ∼P G E∼P deconv [φ()]


(X,L),θ̂ θ D  ,θ̂

Mean of the variance estimator. We compute the second component φ2 () of the sufficient
statistics vector φ() for the emissions given a data item. For single-label data items, we
have
 2 
X2 μ̂1 + σ̂12
E∼P deconv () [φ2 ()] = E∼P deconv [φ
() 2 ()] =
(X,{1}),θ̂ μ̂22 + σ̂22 (X,{2}),θ̂ X2

For multi-label data items, the situation is again more involved. As when determining the
estimator for the mean, we find again two extreme cases:
2  
X − μ̂22 − σ̂22 μ̂21 + σ̂12
E∼P deconv [φ2 ()] = =
(X,{1,2}),θ̂ μ̂22 + σ̂22 X 2 − μ̂21 − σ̂12

We use again a parameter λ ∈ [0, 1] to parameterize the blending between the two extreme
cases and write
2  
X − μ̂22 − σ̂22 μ̂21 + σ̂12
E∼P deconv [φ2 ()] = λ + (1 − λ)
(X,{1,2}),θ̂ μ̂22 + σ̂22 X 2 − μ̂21 − σ̂12

Since the estimators for the mean are consistent, we do not  distinguish
 between the true and
the estimated mean values any more. Using E X ∼P{l},θ G X 2 = μl2 + σl2 for l = 1, 2, and
 
E X ∼P{1,2},θ G X 2 = μ21 + μ22 + σ12 + σ22 , the criterion function implies, in the consistent
case, the following condition for the standard deviation parameters
2  2   2   
μ + σ12 μ1 + σ̂12 λμ1 + σ12 + σ22 − σ̂22  + (1 − λ) μ21 + σ̂12 
π1 12 + π 2 + π12
μ2 + σ̂22 μ22 + σ22 λ μ22 + σ̂22 + (1 − λ) μ22 + σ12 + σ22 − σ̂12
2 
! μ + σ̂1
2
= 12
μ2 + σ̂22

Solving for σ̂1 and σ̂2 , we find σ̂1 = σ1 and σ̂2 = σ2 . The estimators for the standard deviation
are thus consistent as well.

Variance of the mean estimator. Based on Eqs. 54 and 55, the variance of the conditional
expectation values over observations X with label set L, for the three possible label sets, is

123
Mach Learn

given by
 
VX ∼P{1},θ G E∼P deconv [φ()] = diag σ12 , 0
(X,{1}),θ
 
VX ∼P{2},θ G E∼P deconv [φ()] = diag 0, σ22
(X,{2}),θ
(1 − λ)2 λ(1 − λ) 
VX ∼P{1,2},θ G E∼P deconv [φ()] = σ12
2
(X,{1,2}),θ λ(1 − λ) λ2
and thus
 
π1 σ12 0 (1 − λ)2 λ(1 − λ)
EL∼Pπ VX ∼PL,θ G E∼P deconv [φ()] = + π12 σ12
2
(X,L),θ 0 π2 σ22 λ(1 − λ) λ2
The variance of the assumed source emissions are given by
 
V∼P deconv [φ()] = diag 0, σ22
(X,{1},θ
 
V∼P deconv [φ()] = diag σ12 , 0
(X,{2},θ
 

λ1 + (1 − λ)(X − 2 )
V∼P deconv [φ()] = V∼P deconv
(X,{1,2},θ (X,{1,2},θ λ(X − 1 ) + (1 − λ)2
2  2 
σ1 −σ1 2 σ2 −σ22
= λ2 + (1 − λ)2
−σ12 σ12 −σ22 σ22
 2 2
With V∼Pθ [φ()] = diag σ1 , σ2 , the Fisher information matrices for the single-label
   
data are given by I{1} = −diag σ12 , 0 and I{2} = −diag 0, σ22 . For the label set L = {1, 2},
we have
2 
(λ − 1)σ12 + (1 − λ)2 σ22 −λ2 σ12 − (1 − λ)2 σ22
I{1,2} =
−λ2 σ12 − (1 − λ)2 σ22 λ2 σ12 + (1 − λ)2 − 1 σ22
Choosing
 λ such
 that the trace of the information matrix I{1,2} is maximized yields λ =
σ22 / σ12 + σ22 and the following value for the information matrix of label set {1, 2}:
4 
1 σ1 σ12 σ22
I{1,2} = − 2
σ1 + σ22 σ1 σ2 σ2
2 2 4

The expected Fisher information matrix is then given by


⎛  ⎞
σ12 σ12 σ22
⎜ σ1 π1 + π12 σ12 +σ22 π12 σ 2 +σ 2
2

EL∼Pπ [IL ] = − ⎜⎝
1 2 ⎟

σ1 σ2
2 2 σ 2
π12 σ 2 +σ 2 σ22 π2 + π12 σ 2 +σ 2
2
1 2 1 2

vθ,11
2 vθ,12
2
With this, we have θdeconv = , with the matrix elements given by
vθ,12 vθ,22
2 2

 
2 σ 2 w + π π π σ 2 σ 2 + 2π σ 2 s
π12 2 12 12 2 2 1 12 1 2 12 + π1 π2 s12
2 2
vθ,11
2
=
σ12 (π1 π2 s12 + π12 w12 )2
2 w + π π π (2s − σ 2 )
π12 12 12 1 2 12
vθ,12
2
= 12
(π1 π2 s12 + π12 w12 )2
π12 σ12 w12 + π12 π1 (π1 σ22 σ12
2 2 + 2π σ 2 s ) + π 2 π s 2
2 1 12 1 2 12
vθ,22
2
=
σ2 (π1 π2 s12 + π12 w12 )
2 2

123
Mach Learn

where, for simpler notation, we have defined w12 := π2 σ12 + π1 σ22 and s12 := σ12 + σ22 . For
the variance of the mean estimators, using Eq. 35, we get
 
vμ,11
2 vμ,12
2
μdeconv
=
vμ,12
2 vμ,22
2
 
2 σ 2 w + π π π σ 2 σ 2 + 2π σ 2 s
π12 2 12 12 2 2 1 12 1 2 12 + π1 π2 s12 2
2 2
with vμ,11 =
2
σ1 (56)
(π1 π2 s12 + π12 w12 )2
π 2 w12 + π12 π1 π2 (2s12 − σ12 2 )
vμ,12
2
= 12 σ12 σ22
(π1 π2 s12 + π12 w12 ) 2

π 2 σ 2 w12 + π12 π1 (π1 σ22 σ122 + 2π σ 2 s ) + π 2 π s 2


2 1 12 1 2 12 2
vμ,22
2
= 12 1 σ2 . (57)
(π1 π2 s12 + π12 w12 )2

Mean-squared error of the mean estimator Given that the estimators μdeconv are consistent,
the mean squared error of the estimator is given by the average of the diagonal elements of
μdeconv :

1  deconv  vμ,11 + vμ,22


2 2
M S E μdeconv = tr μ = .
2 2
Inserting the expressions in Eqs. 56 and 57 yields the expression given in the theorem.

References

Arons, B. (1992). A review of the cocktail party effect. Journal of the American Voice I/O Society, 12, 35–50.
Bishop, C. M. (2007). Pattern recognition and machine learning. Information science and statistics. Berlin:
Springer.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning
Research, 3, 993–1022.
Boutell, M., Luo, J., Shen, X., & Brown, C. (2004). Learning multi-label scene classification. Pattern Recog-
nition, 37(9), 1757–1771.
Brazzale, A. R., Davison, A. C., & Reid, N. (2007). Applied asymptotics: Case studies in small-sample
statistics. Cambridge: Cambridge University Press.
Cramér, H. (1946). Contributions to the theory of statistical estimation. Skand. Aktuarietids, 29, 85–94.
Cramér, H. (1999). Mathematical methods of statistics. Princeton: Princeton University Press.
Dembczyński, K., Cheng, W., & Hüllermeier, E. (2010). Bayes optimal multilabel classification via proba-
bilistic classifier chains. In Proceedings of the 27th International Conference on Machine Learning.
Dembczyński, K., Waegeman, W., Cheng, W., & Hüllermeier, E. (2012). On label dependence and loss
minimization in multi-label classification. Machine Learning, 88(1–2), 5–45.
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Stochastic modelling
and applied probability. Heidelberg: Springer.
Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes.
Journal of Articificial Intelligence Research, 2, 263–286.
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). Hoboken: Wiley-Interscience.
Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosoph-
ical Society, 22, 700–725.
Gao, W., & Zhou, Z.-H. (2013). On the consistency of multi-label learning. Artificial Intelligence, 199–200,
22–44.
Ghamrawi, N. & McCallum, A. (2005). Collective multi-label classification. In Proceedings of the ACM
Conference on Information and Knowledge Management (CIKM), pp. 195–200.
Godbole, S. & Sarawagi, S. (2004). Discriminative methods for multi-labeled classification. In Proceedings
of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 22–30.
Hastie, T., Tibshirani, R., & Buja, A. (1993). Flexible discriminant analysis by optimal scoring. Journal of the
American Statistical Association, 89, 1255–1270.

123
Mach Learn

Hershey, J. R., Rennie, S. J., Olsen, P. A., & Kristjansson, T. T. (2010). Super-human multi-talker speech
recognition: A graphical modeling approach. Computer Speech and Language, 24(1), 45–66.
Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed sensing. In
Proceedings of NIPS.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features.
In Proceedings of ECML.
Kawai, K., & Takahashi, Y. (2009). Identification of the dual action antihypertensive drugs using tfs-based
support vector machines. Chem-Bio Informatics Journal, 9, 41–51.
Lehmann, E. L., & Casella, G. (1998). Theory of point estimation. New York: Springer.
Liang, P. & Jordan, M. I. (2008). An asymptotic analysis of generative, discriminative, and pseudolikelihood
estimators. In Proceedings of ICML, pp. 584–591, New York, USA. ACM.
Masry, E. (1991). Multivariate probability density deconvolution for stationary random processes. IEEE Trans-
actions on Information Theory, 37(4), 1105–1115.
Masry, E. (1993). Strong consistency and rates for deconvolution of multivariate densities of stationary
processes. Stochastic Processes and Their Applications, 47(1), 53–74.
McCallum, A., Corrada-Emmanuel, A., & Wang, X. The author-recipient-topic model for topic and role dis-
covery in social networks: Experiments with enron and academic email. (2005). Amherst, MA: University
of Massachusetts Amherst, Technical report, Department of Computer Science.
McCallum, A. K. (1999). Multi-label text classification with a mixture model trained by EM. In Proceedings
of NIPS.
Qi, G.-J., Hua, X.-S., Rui, Y., Tang, J., Mei, T., & Zhang, H.-J. (2007). Correlative multi-label video annotation.
In Proceedings of the 15th ACM International Conference on Multimedia, pp. 17–26.
Rao, C. R. (1945). Information and the accuracy attainable in the estimation of statistical parameters. Bulletin
of the Calcutta Mathematical Society, 37, 81–91.
Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classifier chains for multi-label classification.
Machine Learning and Knowledge Discovery in Databases, 278, 254–269.
Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research,
5, 101–141.
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and
documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence.
Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine
Learning, 39(2/3), 135–168.
Streich, A. P. (2010). Multi-label classification and clustering for acoustics and computer security. PhD thesis,
ETH Zurich.
Streich, A. P. & Buhmann, J. M. (2008). Classification of multi-labeled data: A generative approach. In
Procedings of ECML, pp. 390–405.
Streich, A. P., Frank, M., Basin, D., & Buhmann, J. M. (2009). Multi-assignment clustering for boolean data.
In Proceedings of ICML, pp. 969–976. Omnipress.
Tsoumakas, G., & Katakis, I. (2007). Multi label classification: An overview. International Journal of Data
Warehousing and Mining, 3(3), 1–13.
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Data mining and knowledge discovery handbook. In O.
Maimon & L. Rokach (Eds.), Mining multi-label data (2nd ed.). Heidelberg: Springer.
Ueda, N., & Saito, K. (2006). Parametric mixture model for multitopic text. Systems and Computers in Japan,
37(2), 56–66.
van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge series in statistical and probabilistic mathe-
matics. Cambridge: Cambridge University Press.
Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference.
Foundations and Trends in Machine Learning, 1(1–2), 1–305.
Yano, T., Cohen, W. W., & Smith, N. A. (2009). Predicting response to political blog posts with topic models.
In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American
Chapter of the Association for Computational Linguistics, pp. 477–485.
Zhang, M.-L., & Zhou, Z.-H. (2006). Multi-label neural network with applications to functional genomics and
text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1338–1351.
Zhang, M.-L. & Zhou, Z.-H. (2013). A review on multi-label learning algorithms. IEEE Transactions on
Knowledge and Data Engineering. in press.
Zhu, S., Ji, X., Xu, W., & Gong, Y. (2005). Multi-labelled classification using maximum entropy method. In
Proceedings of SIGIR.

123

You might also like