Submitted to Statistical Science


By Peter B¨
uhlmann and Torsten Hothorn
urich and Universit¨
at Erlangen-N¨
We present a statistical perspective on boosting. Special empha-
sis is given to estimating potentially complex parametric or nonpara-
metric models, including generalized linear and additive models as
well as regression models for survival analysis. Concepts of degrees
of freedom and corresponding Akaike or Bayesian information cri-
teria, particularly useful for regularization and variable selection in
high-dimensional covariate spaces, are discussed as well.
The practical aspects of boosting procedures for fitting statistical
models are illustrated by means of the dedicated open-source soft-
ware package mboost. This package implements functions which can
be used for model fitting, prediction and variable selection. It is flex-
ible, allowing for the implementation of new boosting algorithms op-
timizing user-specified loss functions.

1. Introduction. Freund and Schapire’s AdaBoost algorithm for clas-
sification [29–31] has attracted much attention in the machine learning com-
munity [cf. 76, and the references therein] as well as in related areas in statis-
tics [15, 16, 33]. Various versions of the AdaBoost algorithm have proven to
be very competitive in terms of prediction accuracy in a variety of applica-
tions. Boosting methods have been originally proposed as ensemble methods,
see Section 1.1, which rely on the principle of generating multiple predictions
and majority voting (averaging) among the individual classifiers.
Later, Breiman [15, 16] made a path-breaking observation that the Ada-
Boost algorithm can be viewed as a gradient descent algorithm in func-
tion space, inspired by numerical optimization and statistical estimation.
Moreover, Friedman et al. [33] laid out further important foundations which
linked AdaBoost and other boosting algorithms to the framework of statis-
tical estimation and additive basis expansion. In their terminology, boosting
is represented as “stagewise, additive modeling”: the word “additive” doesn’t
imply a model fit which is additive in the covariates, see our Section 4, but
refers to the fact that boosting is an additive (in fact, a linear) combination
of “simple” (function) estimators. Also Mason et al. [62] and R¨atsch et al.
[70] developed related ideas which were mainly acknowledged in the machine
Keywords and phrases: Generalized linear models, Generalized additive models, Gradi-
ent boosting, Survival analysis, Variable selection, Software
imsart-sts ver. 2005/10/19 file: BuehlmannHothorn_Boosting.tex date: June 4, 2007

2 ¨

learning community. In Hastie et al. [42], additional views on boosting are
given: in particular, the authors first pointed out the relation between boost-
ing and `1 -penalized estimation. The insights of Friedman et al. [33] opened
new perspectives, namely to use boosting methods in many other contexts
than classification. We mention here boosting methods for regression (in-
cluding generalized regression) [22, 32, 71], for density estimation [73], for
survival analysis [45, 71] or for multivariate analysis [33, 59]. In quite a few
of these proposals, boosting is not only a black-box prediction tool but also
an estimation method for models with a specific structure such as linearity
or additivity [18, 22, 45]. Boosting can then be seen as an interesting reg-
ularization scheme for estimating a model. This statistical perspective will
drive the focus of our exposition of boosting.
We present here some coherent explanations and illustrations of concepts
about boosting, some derivations which are novel, and we aim to increase
the understanding of some methods and some selected known results. Be-
sides giving an overview on theoretical concepts of boosting as an algorithm
for fitting statistical models, we look at the methodology from a practical
point of view as well. The dedicated add-on package mboost [“model-based
boosting”, 43] to the R system for statistical computing [69] implements
computational tools which enable the data analyst to compute on the theo-
retical concepts explained in this paper as close as possible. The illustrations
presented throughout the paper focus on three regression problems with con-
tinuous, binary and censored response variables, some of them having a large
number of covariates. For each example, we only present the most important
steps of the analysis. The complete analysis is contained in a vignette as part
of the mboost package (see Appendix A) so that every result shown in this
paper is reproducible.
Unless stated differently, we assume that the data are realizations of ran-
dom variables

(X1 , Y1 ), . . . , (Xn , Yn )

from a stationary process with p-dimensional predictor variables Xi and one-
dimensional response variables Yi ; for the case of multivariate responses,
some references are given in Section 9.1. In particular, the setting above
includes independent, identically distributed (i.i.d.) observations. The gen-
eralization to stationary processes is fairly straightforward: the methods and
algorithms are the same as in the i.i.d. framework, but the mathematical
theory requires more elaborate techniques. Essentially, one needs to ensure
that some (uniform) laws of large numbers still hold, e.g., assuming station-
ary, mixing sequences: some rigorous results are given in [59] and [57].

imsart-sts ver. 2005/10/19 file: BuehlmannHothorn_Boosting.tex date: June 4, 2007


1.1. Ensemble schemes: multiple prediction and aggregation. Ensemble
schemes construct multiple function estimates or predictions from re-weighted
data and use a linear (or sometimes convex) combination thereof for pro-
ducing the final, aggregated estimator or prediction.
First, we specify a base procedure which constructs a function estimate
gˆ(·) with values in R, based on some data (X1 , Y1 ), . . . , (Xn , Yn ):
base procedure
(X1 , Y1 ), . . . , (Xn , Yn ) −→ gˆ(·).
For example, a very popular base procedure is a regression tree.
Then, generating an ensemble from the base procedures, i.e., an ensemble
of function estimates or predictions, works generally as follows:
base procedure
re-weighted data 1 −→ gˆ[1] (·)
base procedure
re-weighted data 2 −→ gˆ[2] (·)

··· ···
··· ···
base procedure
re-weighted data M −→ gˆ[M ] (·)
aggregation: fˆA (·) = αm gˆ[m] (·).
What is termed here with “re-weighted data” means that we assign individ-
ual data weights to every of the n sample points. We have also implicitly
assumed that the base procedure allows to do some weighted fitting, i.e.,
estimation is based on a weighted sample. Throughout the paper (except in
Section 1.2), we assume that a base procedure estimate gˆ(·) is real-valued
(i.e., a regression procedure) making it more adequate for the “statistical
perspective” on boosting, in particular for the generic FGD algorithm in
Section 2.1.
The above description of an ensemble scheme is too general to be of any
direct use. The specification of the data re-weighting mechanism as well as
the form of the linear combination coefficients {αm }M
m=1 are crucial, and var-
ious choices characterize different ensemble schemes. Most boosting methods
are special kinds of sequential ensemble schemes, where the data weights in
iteration m depend on the results from the previous iteration m − 1 only
(memoryless with respect to iterations m − 2, m − 3, . . .). Examples of other
ensemble schemes include bagging [14] or random forests [1, 17].

1.2. AdaBoost. The AdaBoost algorithm for binary classification [31] is
the most well known boosting algorithm. The base procedure is a classifier

imsart-sts ver. 2005/10/19 file: BuehlmannHothorn_Boosting.tex date: June 4, 2007

4 ¨

with values in {0, 1} (slightly different from a real-valued function estimator
as assumed above), e.g., a classification tree.
AdaBoost algorithm
1. Initialize some weights for individual sample points: wi = 1/n for
i = 1, . . . , n. Set m = 0.
2. Increase m by 1. Fit the base procedure to the weighted data, i.e.,
do a weighted fitting using the weights wi , yielding the classifier
gˆ[m] (·).
3. Compute the weighted in-sample misclassification rate
n  n
X [m−1] [m−1]
err[m] = wi I Yi 6= gˆ[m] (Xi ) / wi ,
i=1 i=1
[m] 1 − err[m]
α = log ,

and up-date the weights
˜i = wi exp α[m] I Yi 6= gˆ[m] (Xi ) ,
[m] X
wi =w
˜i / w
˜j .

4. Iterate steps 2 and 3 until m = mstop and build the aggregated classifier
by weighted majority voting:
fˆAdaBoost (x) = argmin
α[m] I(ˆ
g [m] (x) = y).
y∈{0,1} m=1

By using the terminology mstop (instead of M as in the general description
of ensemble schemes), we emphasize here and later that the iteration process
should be stopped to avoid overfitting. It is a tuning parameter of AdaBoost
which may be selected using some cross-validation scheme.

1.3. Slow overfitting behavior. It has been debated until about the year
of 2000 whether the AdaBoost algorithm is immune to overfitting when run-
ning more iterations, i.e., stopping wouldn’t be necessary. It is clear nowa-
days that AdaBoost and also other boosting algorithms are overfitting even-
tually, and early stopping (using a value of mstop before convergence of the
surrogate loss function, given in (3.3), takes place) is necessary [7, 51, 64].
We emphasize that this is not in contradiction to the experimental results

imsart-sts ver. 2005/10/19 file: BuehlmannHothorn_Boosting.tex date: June 4, 2007


by Breiman [15] where the test set misclassification error still decreases after
the training misclassification error is zero (because the training error of the
surrogate loss function in (3.3) is not zero before numerical convergence).
Nevertheless, the AdaBoost algorithm is quite resistant to overfitting
(slow overfitting behavior) when increasing the number of iterations mstop .
This has been observed empirically, although some cases with clear overfit-
ting do occur for some datasets [64]. A stream of work has been devoted to
develop VC-type bounds for the generalization (out-of-sample) error to ex-
plain why boosting is overfitting very slowly only. Schapire et al. [77] prove a
remarkable bound for the generalization misclassification error for classifiers
in the convex hull of a base procedure. This bound for the misclassifica-
tion error has been improved by Koltchinskii and Panchenko [53], deriving
also a generalization bound for AdaBoost which depends on the number of
boosting iterations.
It has been argued in [33, rejoinder] and [21] that the overfitting resis-
tance (slow overfitting behavior) is much stronger for the misclassification
error than many other loss functions such as the (out-of-sample) negative
log-likelihood (e.g., squared error in Gaussian regression). Thus, boosting’s
resistance of overfitting is coupled with a general fact that overfitting is
less an issue for classification (i.e., the 0-1 loss function). Furthermore, it is
proved in [6] that the misclassification risk can be bounded by the risk of
the surrogate loss function: it demonstrates from a different perspective that
the 0-1 loss can exhibit quite a different behavior than the surrogate loss.
Finally, Section 5.1 develops the variance and bias for boosting when
utilized to fit a one-dimensional curve. Figure 5.1 illustrates the difference
between the boosting and the smoothing spline approach, and the eigen-
analysis of the boosting method (see Formula (5.2)) yields the following:
boosting’s variance increases with exponentially small increments while its
squared bias decreases exponentially fast as the number of iterations grow.
This also explains why boosting’s overfitting kicks in very slowly.

1.4. Historical remarks. The idea of boosting as an ensemble method for
improving the predictive performance of a base procedure seems to have its
roots in machine learning. Kearns and Valiant [52] proved that if individual
classifiers perform at least slightly better than guessing at random, their
predictions can be combined and averaged yielding much better predictions.
Later, Schapire [75] proposed a boosting algorithm with provable polynomial
run-time to construct such a better ensemble of classifiers. The AdaBoost
algorithm [29–31] is considered as a first path-breaking step towards practi-
cally feasible boosting algorithms.

imsart-sts ver. 2005/10/19 file: BuehlmannHothorn_Boosting.tex date: June 4, 2007

1) with boosting can be done by considering the empirical risk n−1 ni=1 ρ(Yi . statistical frame- work which yields a direct interpretation of boosting as a method for func- tion estimation. f (X))].6 ¨ BUHLMANN & HOTHORN The results from Breiman [15. n.1). . Breiman [15. Functional gradient descent. [33] and Friedman [32] then developed a more general. In the sequel. 2. 2. The following algorithm has been given by Friedman [32]. i = 1. 16] showed that the AdaBoost algorithm can be represented as a steepest descent algorithm in function space which we call functional gradient descent (FGD). Set m = 0. The generic FGD or boosting algorithm. Increase m by 1. Consider the problem of estimating a real-valued function (2.1) and to Tukey’s [83] method of “twicing” (see Section 5. Common choices are n fˆ[0] (·) ≡ argmin n−1 X ρ(Yi . it is a “stagewise. additive modeling” approach (but the word “additive” doesn’t imply a model fit which is addi- tive in the covariates. f (·) where ρ(·. FGD and boosting are used as equivalent terminology for the same method or algo- rithm. 16].tex date: June 4. f )|f =fˆ[m−1] (Xi ) . f (Xi )) and pursuing iterative steepest descent P in function space. Initialize fˆ[0] (·) with an offset value. Friedman et al. f ) = |y − f |2 yields the well-known population minimizer f ∗ (x) = E[Y |X = x]. f ) and eval- uate at f ˆ[m−1] (Xi ): ∂ Ui = − ρ(Yi . uncover older roots of boosting.1. ∂ 2. . Generic FGD algorithm 1. see Section 4). 2007 . there is an immediate connection to the Gauss-Southwell algorithm [79] for solving a linear system of equations (see Section 4. For example. 2005/10/19 file: BuehlmannHothorn_Boosting. In their terminology. . Estimation of f ∗ (·) in (2. the squared error loss ρ(y. ·) is a loss function which is typically assumed to be differentiable and convex with respect to the second argument. In the context of regression. ∂f imsart-sts ver. showing that boosting can be inter- preted as a functional gradient descent algorithm. Compute the negative gradient − ∂f ρ(Y.1) f ∗ (·) = argmin E[ρ(Y. . c) c i=1 or fˆ[0] (·) ≡ 0.

1. A reason for this can be seen from the following formulation in func- tion space which is similar to the exposition in Mason et al. 2. .g. can be deter- mined via cross-validation or some information criterion. i..4. x ∈ Rp . when evaluating the derivative −dC at fˆ[m−1] and Xi . ν = 0.1. . ∂ −dC(f )(x) = − C(f + αδx )|α=0 . while the predictive accuracy has been empirically found to be potentially better and almost never worse when choosing ν “sufficiently small” (e. f : Rp → R. . Xn by the real- valued base procedure (e.1) [32]. ·) than squared error): it yields a slightly different algorithm but the additional line search seems unneces- sary for achieving a good estimator fˆ[mstop ] . . regression) base procedure (Xi . we get −dC(fˆ[m−1] )(Xi ) = n−1 Ui . Up-date fˆ[m] (·) = fˆ[m−1] (·) + ν · gˆ[m] (·). . Thus.1. proceed along an estimate of the negative gradient vector. f (Xi )) and P the usual inner product hf. .g. ∂α where δx denotes the delta. which is the main tuning parameter. gi = n−1 ni=1 f (Xi )g(Xi ).tex date: June 4. imsart-sts ver. Un to X1 . 2007 . Friedman [32] suggests to use an additional line search between steps 3 and 4 (in case of other loss functions ρ(·. . Consider the empirical risk functional C(f ) = n−1 ni=1 ρ(Yi . Iterate steps 2 to 4 until m = mstop for some stopping iteration mstop . Ui )ni=1 −→ gˆ[m] (·).. Un a negative gradient vector. . . The stopping iteration. as long as it is “small” such as ν = 0. . The choice of the step-length factor ν in step 4 is of minor importance. [62] and to the discussion in Ridgeway [72]. In steps 2 and 3 of the generic FGD algorithm. 2005/10/19 file: BuehlmannHothorn_Boosting. 5. . 4. gˆ[m] (·) can be viewed as an approximation of the negative gra- dient vector.e. A smaller value of ν typically requires a larger number of boosting iterations and thus more computing time.. Alternative formulation in function space. we associated with U1 . Fit the negative gradient vector U1 .(or indicator-) function at x ∈ Rp . The latter statement is based on empirical evidence and some mathematical reasoning as described at the beginning of Section 7. see Section 5. In particular. BOOSTING ALGORITHMS AND MODEL FITTING 7 3. where 0 < ν ≤ 1 is a step- length factor (see below). We can then calculate P the negative Gˆ ateaux derivative dC(·) of the functional C(·). .

+1}). P For example.1. the two algorithms coincide. Binary classification. Various boosting algorithms can be defined by specifying different (surrogate) loss functions ρ(·.1) is the same as for the exponential loss in (3.. the response variable is Y ∈ {0. gˆi = n−1 ni=1 Ui gˆ(Xi ). Thus.3) below. For binary classification. By scaling..8 ¨ BUHLMANN & HOTHORN with U1 . Often. typically using (nonparametric) least squares.1) ρlog-lik (˜ y . P 3. Then. +1} (this coding is used in mboost as well). For certain base procedures. it holds that n−1 ni=1 (Ui − gˆ(Xi ))2 = C − hU.tex date: June 4. . f ) = log2 (1 + exp(−2˜ y f )). . which then becomes an upper bound of the misclassification error. [62] fit the base procedure by maximizing − hU. . imsart-sts ver. if gˆ(·) is the componentwise linear least squares base procedure described in (4. In mboost.. the factor 1/2 is a bit unusual but it will enable that the population minimizer of the loss in (3. the negative gradient of this loss function is implemented in a function Binomial() returning an object of class boost family which con- tains the negative gradient function as a slot (assuming a binary response variable y ∈ {−1. where P C = n−1 ni=1 Ui2 is a constant. Some loss functions and boosting algorithms. We point out that the algorithm in Mason et al. The mboost package provides an environment for defining loss func- tions via boost family objects. see Fig- ure 1. 2007 . 1} with P[Y = 1] = p. We parametrize p = exp(f )/(exp(f )+exp(−f )) so that f = log(p/(1−p))/2 equals half of the log-odds ratio. the negative gradient vector U1 .1). it is notationally more convenient to encode the response by Y˜ = 2Y − 1 ∈ {−1. the negative log-likelihood is log(1 + exp(−2˜ y f )). we prefer to use the equivalent loss function (3. Mason et al. . Un can be interpreted as a func- tional (Gˆateaux) derivative evaluated at the data points. [62] is different from the generic FGD method above: while the latter is fitting the negative gra- dient vector by the base procedure. 2005/10/19 file: BuehlmannHothorn_Boosting. ·). gˆi. Un exactly as in steps 2 and 3 of the generic FGD algorithm. as exemplified below. . We consider the negative binomial log-likelihood as loss function: − (y log(p) + (1 − y) log(1 − p)) . 3.

2005/10/19 file: BuehlmannHothorn_Boosting. 2 1 − p(x) imsart-sts ver.2) ρ0-1 (y.tex date: June 4. the surrogate loss which is the implementing loss function for the algorithm). Therefore.]: 1 p(x)   ∗ fexp (x) = log . +1}) ( ∗ +1 if p(x) > 1/2 f0-1 (x) = −1 if p(x) ≤ 1/2. the misclassification loss is (3.2) is the exponential loss (3. +1}) in mboost as AdaExp() family. where the function f induces the following classifier for Y :   1  if f (x) > 0 C(x) = 0 if f (x) < 0   undetermined if f (x) = 0.3 the BinomialBoosting algorithm (similar to LogitBoost [33]) which uses the negative log-likelihood as loss function (i.3) ρexp (y. We will describe in Section 3. The population minimizer can be shown to be the same as for the log- likelihood loss [33.e. cf. the so-called margin value. implemented (with notation y ∈ {−1. whose population minimizer is equivalent to the Bayes classifier (for Y˜ ∈ {−1. BOOSTING ALGORITHMS AND MODEL FITTING 9 The population minimizer can be shown to be [33. Another upper convex approximation of the 0-1 loss function in (3. 2007 . f ) = exp(−˜ y f ). see Figure 1. p(x) = P[Y = 1|X = x].1) is a function of y˜f . Note that the 0-1 loss in (3. p(x) = P[Y = 1|X = x].1) can be viewed as a convex upper approximation of the (computationally intractable) non-convex 0-1 loss. a misclassification (including the undetermined case) happens if and only if Y˜ f (X) ≤ 0. where p(x) = P[Y = 1|X = x].] 1 p(x)   ∗ flog-lik (x) = log . cf.2) cannot be used for boosting or FGD: it is non-differentiable and also non-convex as a function of the margin value y˜f . Hence. 2 1 − p(x) The loss function in (3. f ) = I{˜yf ≤0} . The negative log-likelihood loss in (3.

We interpret the boosting estimate fˆ[m] (·) as an estimate of the popula- tion minimizer f ∗ (·). In particular.or Binomial- Boosting are estimates of half of the log-odds ratio. exp(fˆ[m] (x)) + exp(−fˆ[m] (x)) The reason for constructing these probability estimates is based on the fact that boosting with a suitable stopping iteration is consistent [7.1). When using the log-likelihood loss in (3. we use most of- ten the squared error loss (scaled by the factor 1/2 such that the negative gradient vector equals the residuals. we obtain LogitBoost [33] or BinomialBoosting from Section 3. we define probability estimates via exp(fˆ[m] (x)) pˆ[m] (x) = . Logit. 2005/10/19 file: BuehlmannHothorn_Boosting. Thus. see Section 3. the output from AdaBoost. 2007 . It is also an upper convex bound of the misclassification error.2. Very popular in machine learning is the hinge function.10 ¨ BUHLMANN & HOTHORN Using functional gradient descent with different (surrogate) loss functions yields different boosting algorithms. the standard loss function for support vector machines: ρSVM (y. For regression with response Y ∈ R. we essentially get AdaBoost [30] from Section 1. and with the exponential loss in (3.3). Since fSVM ∗ (·) is a classi- fier and non-invertible function of p(x). see Figure 1. [64]. Some cautionary remarks about this line of argumentation are presented by Mease et al.3. there is no direct way to obtain conditional class probability estimates. 3. Its population minimizer is ∗ fSVM (x) = sign(p(x) − 1/2) which is the Bayes classifier for Y˜ ∈ {−1. f ) = |y − f |2 2 with population minimizer fL∗ 2 (x) = E[Y |X = x]. where [x]+ = xI{x>0} denotes the positive part. +1}. imsart-sts ver. 1 (3.4) ρL2 (y. Regression. f ) = [1 − y˜f ]+ .3 below).2.tex date: June 4. 51].

for binary classification. we can com- pute partial derivatives since the single point y = f (usually) has probability zero to be realized by the data. Left panel with monotone loss functions: 0-1 loss. exponential loss. i. negative log- likelihood. The former is ρL1 (y. if |y − f | ≤ δ ρHuber (y.e. in “Y-space”) include the L1 . 2005/10/19 file: BuehlmannHothorn_Boosting. if |y − f | > δ imsart-sts ver.3. hinge loss (SVM). The corresponding boosting algorithm is L2 Boosting. as functions of the margin y˜f = (2y − 1)f . f ) = |y − f | with population minimizer f ∗ (x) = median(Y |X = x) and is implemented in mboost as Laplace().5).and L2 -loss is the Huber-loss function from robust statistics: ( |y − f |2 /2. BOOSTING ALGORITHMS AND MODEL FITTING 11 monotone non−monotone ρ0−−1 ρ0−−1 6 6 ρSVM ρL2 ρexp ρL1 5 5 ρlog−−lik 4 4 Loss Loss 3 3 2 2 1 1 0 0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (2y − 1)f (2y − 1)f Fig 1. Losses. It is described in more detail in Section 3. A compromise between the L1 .and Huber- loss. right panel with non-monotone loss functions: squared error (L2 ) and absolute error (L1 ) as in (3. Although the L1 -loss is not differentiable at the point y = f . see Friedman [32] and B¨uhlmann and Yu [22].. f ) = δ(|y − f | − δ/2). 2007 . This loss function is available in mboost as family GaussReg(). Alternative loss functions which have some robustness properties (with re- spect to the error distribution.tex date: June 4.

.tex date: June 4. y f + (˜ The L1 .5) y − f |2 = |1 − y˜f |2 = (1 − 2˜ |˜ y f )2 . 3.2. imsart-sts ver. We usually prefer the negative log-likelihood loss in (3. (iii) it grows linearly as the margin value y˜f tends to −∞. Buja et al. where the previous fit fˆ[m−1] (·) is used. the L2 . +1}) . as discussed below. see Figure 1. [24]. both the L1 . Two important boosting algorithms. . A strategy for choosing (a changing) δ adaptively has been proposed by Friedman [32]: δm = median({|Yi − fˆ[m−1] (Xi )|. We now describe the two algorithms appearing in the last two rows of Table 1 in more detail.1.and L2 -loss functions are non-monotone functions of the margin value y˜f .or L1 -loss can also be used for binary classification. 2005/10/19 file: BuehlmannHothorn_Boosting. (ii) it is a monotone loss function of the margin value y˜f . . However. 3. The L2 -loss for classification (with response variable y ∈ {−1. +1}): functions of the margin value y˜f (˜ |˜ y − f | = |1 − y˜f |.3). +1}) is implemented in GaussClass(). For Y ∈ {0. n}). we prefer to use monotone loss functions. Moreover. cf.3. i = 1. Motivated from the popula- tion point of view. Connections to binary classification.and L2 -loss functions can be parametrized as y ∈ {−1. ( 1 if p(x) > 1/2 fL∗ 1 (x) = median(Y |X = x) = 0 if p(x) ≤ 1/2. respectively. the population minimizer of the L1 -loss is the Bayes classifier. unlike the exponential loss in (3.12 ¨ BUHLMANN & HOTHORN which is available in mboost as Huber(). All loss functions mentioned for binary classification (displayed in Fig- ure 1) can be viewed and interpreted from the perspective of proper scoring rules. Table 1 summarizes the most popular loss functions and their corresponding boosting algorithms. 2007 . The third point reflects a robustness aspect: it is similar to Huber’s loss function which also penalizes large values linearly (instead of quadratically as with the L2 -loss). (3. A negative aspect is that they penalize margin val- ues which are greater than 1: penalizing large margin values can be seen as a way to encourage solutions fˆ ∈ [−1. the population minimizers are fL∗ 2 (x) = E[Y |X = x] = p(x) = P[Y = 1|X = x]. 1] which is the range of the popula- tion minimizers fL∗ 1 and fL∗ 2 (for Y˜ ∈ {−1. Thus. .1) because: (i) it yields probability estimates. 1}.

.1 why the negative log-likelihood loss function in (3. BinomialBoosting: the FGD version of LogitBoost. We already gave some reasons at the end of Section 3.1) is very useful for binary classification problems. 3. The stopping iteration mstop is the main tuning parameter which can be selected using cross-validation or some information criterion as described in Section 5. Iterate steps 2 to 4 until m = mstop for some stopping iteration mstop . Up-date fˆ[m] (·) = fˆ[m−1] (·)+ν·ˆ g [m] (·). 1}.g. regression) base procedure (Xi . 3. 3. L2 Boosting amounts to refitting residuals multiple times. L2 Boosting. .tex date: June 4. L2 Boosting algorithm 1. we obtain the following algorithm. It is very useful for regression. 2. f ∈ R log2 (1 + e ) 2 log 1−p(x) LogitBoost / BinomialBoosting 1 2 y ∈ R. Fit the residual vector U1 .2. f ). population minimizers f ∗ (x) and names of corresponding boosting algorithms. 2007 . . BOOSTING ALGORITHMS AND MODEL FITTING 13 range spaces ρ(y.3. 2005/10/19 file: BuehlmannHothorn_Boosting. Tukey [83] recognized this to be useful and proposed “twicing” which is nothing else than L2 Boosting using mstop = 2 (and ν = 1). 4. .4. 1}. Thus. Un to X1 . . Note that the negative gradient vector becomes the residual vector.1. Increase m by 1. The derivation from the generic FGD algorithm in Section 2. . 5. p(x) = P[Y = 1|X = x]. Fried- man et al. f ) f∗ (x)  algorithm 1 p(x) y ∈ {0.1 is straight- forward. and they proposed LogitBoost imsart-sts ver. where 0 < ν ≤ 1 is a step-length factor (as in the general FGD-algorithm).3. Xn by the real-valued base procedure (e.2. f ) = |y − f |2 /2. Applying the general description of the FGD-algorithm from Section 2. [33] were first in advocating this. f ∈ R 2 |y − f| E[Y |X = x] L2 Boosting Table 1 Various loss functions ρ(y. . Compute the residuals Ui = Yi − fˆ[m−1] (Xi ) for i = 1. L2 Boosting is the simplest and perhaps most instruc- tive boosting algorithm. . .1 to the squared error loss function ρL2 (y.. f ∈ R exp(−(2y − 1)f ) 2 log 1−p(x) AdaBoost   −2(2y−1)f 1 p(x) y ∈ {0. The default value is fˆ[0] (·) ≡ Y . Initialize fˆ[0] (·) with an offset value. Set m = 0. in particular in presence of very many predictor variables. . n. . Ui )ni=1 −→ gˆ[m] (·). .

we can use the technique in very many other settings. The notation is as follows: gˆ(·) is an estimate from a base procedure which is based on data (X1 .1. the boosting algorithm is described in Section 2.1. 3. (Xn .1): the deviation from FGD is the use of Newton’s method involving the Hessian matrix (instead of a step-length for the gradient). U1 ). imsart-sts ver. we describe here a version of LogitBoost: to avoid conflicting terminology. We find the latter usually more interesting as it allows for better interpretation of the resulting model.1 using the loss func- tion ρlog-lik from (3. structural properties of the boosting function estimator are in- duced by a linear combination of structural characteristics of the base pro- cedure. Survival analysis is an important area of application with censored observations: we describe in Section 8 how to deal with it.4. 2005/10/19 file: BuehlmannHothorn_Boosting. Due to the generic nature of boosting or functional gradient descent. 2007 . . 4. In the sequel. Un ) denotes the current negative gradient. With BinomialBoosting. . We recall that the generic boosting estimator is a sum of base procedure estimates m fˆ[m] (·) = ν X gˆ[k] (·). there is no need that the base procedure is able to do weighted fitting: this constitutes a slight difference to the requirement for LogitBoost [33]. . k=1 Therefore. Un ) where (U1 .1). This choice can be driven by the aim of optimizing the predictive capacity only or by considering some structural properties of the boosting estimate in addition. . we coin it BinomialBoosting. where pˆ is the relative frequency of Y = 1. Other data structures and models. For the sake of coherence with the generic functional gradient descent algorithm in Section 2. . . The default offset value is fˆ[0] (·) ≡ log(ˆ p/(1 − pˆ))/2. For data with univariate responses and loss functions which are differentiable with respect to the second argument. the jth component of a vector c will be denoted by c(j) . BinomialBoosting algorithm Apply the generic FGD algorithm from Section 2.tex date: June 4. Choosing the base procedure. . Every boosting algorithm requires the specification of a base procedure. The following important examples of base procedures yield useful struc- tures for the boosting estimator fˆ[m] (·). .14 ¨ BUHLMANN & HOTHORN which is very similar to the generic FGD algorithm when using the loss from (3.

tex date: June 4. not necessarily a different one for each iteration. we advocate to use boosting on mean centered predictor variables X ˜ (j) = X (j) − X (j) .1). when centering also the response Y˜i = Yi − Y .1. When using BinomialBoosting with componentwise linear least squares from (4. The method is also known as matching pursuit in signal processing [60]. The notation should be read that only the Sˆm th component of the coefficient estimate βˆ[m] (in iteration m) has been up-dated. Boosting can be very useful for fitting potentially high-dimensional generalized linear mod- els. n n  n  (j) 2  2 (j) (j) βˆ(j) = . we obtain a linear model fit. Componentwise linear least squares for linear models. i=1 i=1 1≤j≤p i=1 It selects the best variable in a simple linear model in the sense of ordinary least squares fitting. this becomes p Y˜i = ˜ (j) + noisei X β (j) X i j=1 imsart-sts ver. Usually.2. 2007 . where Sˆm denotes the index of the selected predictor variable in iteration m. Consider the base procedure ˆ ˆ gˆ(x) = βˆ(S) x(S) . we obtain a fit. As m tends to infinity. weak greedy algorithm in computational mathematics [81]. We will discuss more properties of L2 Boosting with componentwise linear least squares in Section 5. For every iteration m. fˆ[m] (·) converges to a least squares solution which is unique if the design matrix has full rank p ≤ n. When using L2 Boosting with this base procedure. BOOSTING ALGORITHMS AND MODEL FITTING 15 4. In case of a linear i i model. Sˆ = argmin Ui − βˆ(j) Xi X X X (4. we do not want to shrink the intercept term. In addition. and it is a Gauss-Southwell algorithm [79] for solving a linear system of equations. 2005/10/19 file: BuehlmannHothorn_Boosting. boosting typically shrinks the (logistic) regression coefficients towards zero. the up-date of the coefficient estimates is ˆ βˆ[m] = βˆ[m−1] + ν · βˆ(Sm ) .1) Xi Ui / Xi . and we up-date the function linearly: ˆ ˆ fˆ[m] (x) = fˆ[m−1] (x) + ν βˆ(Sm ) x(Sm ) . of a linear logistic regression model. including variable selection. As will be discussed in more detail in Section 5. we select in every itera- tion one predictor variable. Alternatively.2.

51153 1. This reference method is very accurate in measuring body fat but finds little applicability in practical environments. data = bodyfat) R> coef(bf_lm) (Intercept) hipcirc kneebreadth anthro3a -75. . Garcia et al. Backward-elimination was applied to select important variables from the available anthropometrical measurements and Garcia et al. knee breadth and a compound covariate which is defined as the sum of log chin skinfold. i=1 Illustration: Prediction of total body fat. . . By default. x ˜(p) . log triceps skinfold and log subscapular skinfold: R> bf_lm <. the women’s body compo- sition was measured by Dual Energy X-Ray Absorptiometry (DXA). the mean of the response variable is used as an offset in the first step of the boosting algorithm.. . c)). . Therefore.tex date: June 4. We center the covariates prior to model fitting in addition. componentwise linear least squares. 2005/10/19 file: BuehlmannHothorn_Boosting. such as a linear combination of only a few covariates. Note that it is not necessary to center the response variables when using the default offset value fˆ[0] = Y in L2 Boosting (for BinomialBoosting. .1) by minimizing squared error (argument family = GaussReg() is the default): R> bf_glm <. is of special interest in this applica- tion: we employ the glmboost function from package mboost to fit a linear regression model by means of L2 Boosting with componentwise linear least squares. allows for a reformulation imsart-sts ver. data = bodyfat. [34] report on the de- velopment of predictive regression equations for body fat content by means of p = 9 common anthropometric measurements which were obtained for n = 71 healthy German women. we would center the predictor variables n only but never the response. . In addition. [34] re- port a final linear model utilizing hip circumference. a simple regression equation for predicting DXA measurements of body fat is of special interest for the practitioner.90199 8.90964 A simple regression formula which is easy to communicate. 0) as with ordinary least squares. by default.. the function glmboost fits a linear model (with initial mstop = 100 and shrinkage parameter ν = 0. 2007 . 0. and we would use fˆ[0] ≡ argmin n−1 P c ρ(Yi . control = boost_control(center = TRUE)) Note that. y˜) = (0. i. .16 ¨ BUHLMANN & HOTHORN which forces the regression surface through the center (˜ x(1) .glmboost(DEXfat ~ . mainly because of high costs and the methodological efforts needed. the special form of the base learner.e. As mentioned above.23478 0.lm(DEXfat ~ hipcirc + kneebreadth + anthro3a.

351626 elbowbreadth kneebreadth anthro3a anthro3b -0.1893046 0.189716 0.0000000 1. The out-of-bootstrap mean squared error for 100 bootstrap samples is depicted in the upper part of Figure 2.783 is the intercept in the uncentered model). which we have used for imsart-sts ver."offset") [1] 30. The coef- ficients of the linear model with mstop = 45 boosting iterations are R> coef(bf_glm[mstop(aic)]) (Intercept) age waistcirc hipcirc 0.6051548 anthro3c anthro4 0.tex date: June 4.326860 3.013602 0.384140 1.3488781 elbowbreadth kneebreadth anthro3a anthro3b 0. In Section 5. see the bottom part of Figure 2 in addition. n−1 ni=1 Yi = 30. we need to investigate how many boosting iterations are appropriate. 2007 .0000000 attr(.736589 3. Thus.0023271 0.783 We notice that most covariates have been used for fitting and thus no ex- tensive variable selection was performed in the above model.783 and thus 7 covariates have been selected for the final model (intercept equal to zero occurs here for mean centered response and predictors and hence."offset") [1] 30.4. BOOSTING ALGORITHMS AND MODEL FITTING 17 of the boosting fit in terms of a linear combination of the covariates which can be assessed via R> coef(bf_glm) (Intercept) age waistcirc hipcirc 0.0000000 0. 2005/10/19 file: BuehlmannHothorn_Boosting. kneebreadth and anthro3a. This criterion attains its minimum for R> mstop(aic <. Note that P the variables hipcirc.000000 attr(.656524 anthro3c anthro4 0.5043133 0.595363 0.AIC(bf_glm)) [1] 45 boosting iterations.3268603 3. Resampling methods such as cross-validation or the bootstrap can be used to estimate the out-of-sample error for a varying number of boosting iterations.000000 0.5217686 3. The plot leads to the impression that approx- imately mstop = 44 would be a sufficient number of boosting iterations. a corrected version of the Akaike information criterion (AIC) is proposed for determining the optimal number of boosting iterations.

The dashed horizontal line depicts the average out-of-bootstrap error of the linear model for the pre-selected variables hipcirc. imsart-sts ver.0 Corrected AIC 4.18 ¨ BUHLMANN & HOTHORN 140 Out−of−bootstrap squared error 120 100 80 60 40 20 ● 0 2 8 16 24 32 40 48 56 64 72 80 88 96 Number of boosting iterations 5.5 5.5 ● 3. The lower part shows the corrected AIC criterion.tex date: June 4. kneebreadth and anthro3a fitted via ordinary least squares. 2007 .5 4.0 0 20 40 60 80 100 Number of boosting iterations Fig 2. 2005/10/19 file: BuehlmannHothorn_Boosting.0 3. bodyfat data: Out-of-bootstrap squared error for varying number of boosting itera- tions mstop (top).

2. As a note of caution. . 2005/10/19 file: BuehlmannHothorn_Boosting. n   2 Z (j) fˆ(j) (·) = argmin (f 00 (x))2 dx.. . p. The base procedure is then ˆ ˆ gˆ(x) = fˆ(S) (x(S) ). . . BOOSTING ALGORITHMS AND MODEL FITTING 19 fitting a linear model at the beginning of this paragraph. have become very popular for adding more flexibility to the linear structure in generalized linear models. . .(j) x(j) . 4.e.tex date: June 4. Suppose that fˆ(j) (·) is a least squares cubic smoothing spline estimate based on (j) (j) (4. For further details. . X (4. . i=1 imsart-sts ver. Additive and generalized additive models. Componentwise smoothing spline for additive models. Such flexibility can also be added in boosting (whose framework is especially useful for high-dimensional prob- lems).3. That is. have been selected by the boosting algorithm as well. 2007 . n  2 (j) fˆ(j) (·) as above and Sˆ = argmin Ui − fˆ(j) (Xi ) X . X ˆ+ j=1 n   (j) n−1 fˆ[m]. . L2 Boosting with componentwise smoothing splines yields an additive model. Xn with fixed degrees of freedom df. introduced by Hastie and Tibshirani [40] (see also [41]). Un against X1 . i.3) Ui − f Xi +λ f (·) i=1 where λ > 0 is a tuning parameter such that the trace of the corresponding hat matrix equals df. we refer to Green and Silverman [36]. We can normalize to obtain the following additive model estimator: p   fˆ[m] (x) = µ fˆ[m].(j) Xi X = 0 for all j = 1. . .2) U1 . see Section 3. We can choose use a nonparametric base procedure for function estima- tion. 1≤j≤p i=1 where the degrees of freedom df are the same for all fˆ(j) (·). . This can be seen immediately since L2 Boosting proceeds additively for up-dating the function fˆ[m] (·). including variable selection. we use in the sequel the terminology of “hat matrix” in a broad sense: it is a linear operator but not a projection in general. a fit which is additive in the predictor vari- ables.

defaulting to 4. With L2 Boosting. we can use componentwise smoothing splines also in BinomialBoosting. 2007 . Moreover. . yielding an ad- ditive logistic regression fit. mainly for kneebreadth.4 via R> mstop(aic <. unbiased split or variable selection in the context of different scales is proposed in [47]. Illustration: Prediction of total body fat (cont.1. The procedure has been empir- ically demonstrated to be often much better than fitting with MARS [23]. 2005/10/19 file: BuehlmannHothorn_Boosting. We can estimate the number of boosting iterations mstop using the cor- rected AIC criterion described in Section 5.3. regression trees are the most popular base procedures.20 ¨ BUHLMANN & HOTHORN As with the componentwise linear least squares base procedure. Componentwise smoothing splines can be generalized to pairwise smooth- ing splines which searches for and fits the best pairs of predictor variables such that a smooth of U1 .tex date: June 4. The bias can then be reduced by additional boosting iterations.AIC(bf_gam)) [1] 46 Similar to the linear regression model. . see also Section 4. imsart-sts ver. regression trees handle covariates measured at different scales (continuous. data = bodyfat) The degrees of freedom in the componentwise smoothing spline base proce- dure can be defined by the dfbase argument. i. this yields a nonparametric model fit with first order interaction terms. Trees. the partial contributions of the co- variates can be extracted from the boosting fit.1 and squared error loss): R> bf_gam <. This choice of low variance but high bias has been analyzed in B¨uhlmann and Yu [22]. the partial fits are given in Figure 3 showing some slight non-linearity.4. ordinal or nominal vari- ables) in a unified way. .). we estimate an additive model using the gamboost function from mboost (first with pre-specified mstop = 100 boosting iterations. we do not need to search for good data transformations. ν = 0.e. . They have the advantage to be invariant un- der monotone transformations of predictor variables. The degrees of freedom in the smoothing spline base procedure should be chosen“small”such as df = 4. For the most important vari- ables. Being more flexible than the linear model which we fitted to the bodyfat data in Section 4.. This yields low variance but typically large bias of the base procedure. Un against this pair of predictors reduces the residual sum of squares most.gamboost(DEXfat ~ . 4. In the machine learning community..

we can easily do this by constraining the (maximal) number of nodes in the base procedure. if we want to constrain the degree of interactions. When using stumps.). boosting trees with (at most) d terminal nodes results in a nonparametric model having at most interactions of order d − 2. 2007 . Illustration: Prediction of total body fat (cont.0 4.5 4. bodyfat data: Partial contributions of four covariates in an additive model (without centering of estimated functions to mean zero).5 3. because every stump-estimate is a function of a single predictor variable only. Therefore. Both the gbm package [74] and the mboost package are helpful when decision trees are to be used as base procedures. In mboost. the function blackboost implements boosting imsart-sts ver. i. BOOSTING ALGORITHMS AND MODEL FITTING 21 ● ● ●●● ●● ● 5 5 ● ●● ●● ● ●● ●● ●● ● ●● ●●● ●●●● ●● ● ● ●●● fpartial fpartial ● ● ● ●● ● ●● ●●● 0 0 ● ●● ●●● ● ● ●● ●●● ●● ●● ● ●● ● ● ●● ●● ●● ●●● ● ●●● ●●● ●● −5 −5 ●● ● ● 90 100 110 120 130 70 80 90 100 110 hipcirc waistcirc ● ● 5 ● ● 5 ● ●● ● ●● ●● ●● ●●● ● ● fpartial fpartial ●● ●● ●●● ● ●● ● ●●● ●●● 0 0 ●●● ●● ●● ●●●● ●●●●●●●●● ●● ● ●● ●●● ●● ● ● −5 −5 8 9 10 11 2.5 5.0 3. a tree with two terminal nodes only. 2005/10/19 file: BuehlmannHothorn_Boosting.tex date: June 4.0 kneebreadth anthro3b Fig 3.. Similarly.e. the boost- ing estimate will be an additive model in the original predictor variables.

For the componentwise smoothing splines. For example.22 ¨ BUHLMANN & HOTHORN 60 ●● 50 ● Prediction blackboost ● ●● ● ● ●●● ● ● ●● 40 ● ●● ● ● ●● ● ● ● ●● ●●● ● 30 ●● ● ●●● ● ● ● ●● ●●● ●● ●● ● ●● ●● 20 ● ●● ● ● ● ●● ● ●●● ●● 10 10 20 30 40 50 60 Prediction gbm Fig 4. the structure specification should come first. We have seen above that the structural properties of a boosting estimate are determined by the choice of a base pro- cedure. imsart-sts ver.2)? A general answer is: choose the base procedure (having the desired structure) with low variance at the price of larger estimation bias. control = boost_control(mstop = 500)) Conditional inference trees [47] as available from the party package [46] are utilized as base procedures. 2007 . 4..blackboost(DEXfat ~ . the function boost_control defines the number of boosting iterations mstop.4. After having made a choice. 2005/10/19 file: BuehlmannHothorn_Boosting. we can use the function gbm from the gbm package which yields roughly the same fit as can be seen from Figure 4. In our opinion.tex date: June 4. The low-variance principle. Alternatively. data = bodyfat. how should we choose the degrees of freedom for the componentwise smoothing spline in (4. Here. bodyfat data: Fitted values of both the gbm and mboost implementations of L2 Boosting with different regression trees as base learners. for fitting such classical black-box models: R> bf_black <. the question becomes how “complex” the base proce- dure should be.

It is easy to show that the hat matrix of the L2 Boosting fit (for simplicity. has a “learning capacity” such that the residual vector is shorter than the input-response vector. 5. . . classification or survival analysis. Yn )> to their fitted values (fˆ(X1 ). U > U = U U > = I.g. . .3. Nonparametric curve estimation: from basics to asymptotic optimal- ity. . fˆ(Xn ))> . First. e.3)). The spectral representation is H = U DU > . and we will then illustrate concepts and re- sults which are especially useful for high-dimensional data. df = 4. and Bm Y converges to the fully saturated model Y. .1 (Replica 1). When specializing to the case of a cubic smoothing spline base procedure (cf. . These can serve as heuristics for boosting algorithms with other convex loss functions for problems in e.1) allows for several insights.. Examples include nonparametric kernel smoothers or smoothing splines. we aim at increasing the un- derstanding of the simple L2 Boosting algorithm. Formula (5.. it has been demonstrated in Friedman [32] that a small step-size factor ν can be often beneficial and almost never yields substantially worse predictive performance of boosting estimates. D = diag(λ1 . Here. . .e. . Formula (4. interpolating the response variables exactly. Consider the case with a linear base procedure having a hat matrix H : Rn → Rn . implying low variance but potentially large estimation bias. L2 Boosting. L2 Boosting is functional gradient descent using the squared error loss which amounts to repeated fitting of ordinary residuals. . .g. Consider the toy problem of estimating a regression function E[Y |X = x] with one-dimensional predictor X ∈ R and a continuous response Y ∈ R.1. 2007 . Note that a small step-size factor can be seen as a shrinkage of the base procedure by the factor ν. imsart-sts ver. We give some reasons for the low-variance principle in Section 5. if the base procedure satisfies kI − Hk < 1 for a suitable norm.. with fˆ[0] ≡ 0 and ν = 1) in iteration m equals: (5. we see that Bm converges to the identity I as m → ∞. We first start with a toy problem of curve estimation. i. Thus.1. Moreover. λn ).tex date: June 4. as described already in Section 3. we see here explicitly that we have to stop early with the boosting iterations in order to prevent over-fitting.1) Bm = Bm−1 + H(I − Bm−1 ) = I − (I − H)m . 5. 2005/10/19 file: BuehlmannHothorn_Boosting. it is useful to invoke some eigen-analysis. mapping the response variables Y = (Y1 . BOOSTING ALGORITHMS AND MODEL FITTING 23 this would imply a low number of degrees of freedom.

the set of eigenvalues. the eigenvalues of the boosting hat operator (matrix) in iteration m satisfy: (5. First. di. Moreover. L2 Boosting with cubic smoothing splines having fixed degrees of free- dom achieves the minimax convergence rate over Sobolev function classes of smoothness degree ξ ≥ 2. ..24 ¨ BUHLMANN & HOTHORN where λ1 ≥ λ2 ≥ . For both cases. n). i. In Figure 5 we demonstrate the difference between the two approaches for changing “complexity” of the estimated curve fit by means of a toy ex- ample first shown in [22]. . By careful inspection of the eigen-analysis for this simple case of boosting a smoothing spline. the largest two eigenvalues are equal to 1. we have the following.3) 0 < di. . .e. minimax rates are achieved by using a base procedure with fixed degrees of freedom which means low variance from an asymptotic perspective. . all other eigenvalues can be changed by either varying the degrees of freedom df = ni=1 λi in a single P smoothing spline. as n → ∞.m ≡ 1 for all m.. 2005/10/19 file: BuehlmannHothorn_Boosting. 2007 . Two items are interesting. di. . (5. 0 < λi < 1 (i = 3. . .m ). dn. i. . with the stopping iteration as the one and imsart-sts ver. Dm = diag(d1.m . . Therefore. [22] When stopping the boosting iterations appropriately. It then follows with (5. When comparing the spectrum. .m = 1 − (1 − λi )m . .m → 1 (m → ∞). B¨ uhlmann and Yu [22] proved an asymptotic minimax rate result: Replica 1.m = 1 − (1 − λi )m < 1 (i = 3. mstop = mn = O(n4/(2ξ+1) ). ≥ λn denote the (ordered) eigenvalues of H.2) d1. . n). .e. L2 Boosting with cubic smooth- ing splines has the capability to adapt to higher order smoothness of the true underlying function: thus. mn → ∞ (n → ∞) with ξ ≥ 2 as be- low. of a smoothing spline with its boosted version. or by varying the boosting iteration m with some fixed (low-variance) smoothing spline base procedure having fixed (low) values λi . It is well known that a smoothing spline satisfies: λ1 = λ2 = 1.tex date: June 4. Secondly.m ≡ d2.1) that Bm = U Dm U > . Both methods have about the same minimum mean squared error but L2 Boosting overfits much more slowly than a single smoothing spline.

Xi ∼ U (−1/2.tex date: June 4. L2 Boosting of smoothing splines can achieve faster mean squared error convergence rates than the classical O(n−4/5 ). Recently. . We illustrate here a related phe- nomenon with kernel estimators. h) = (nh) K Kh (x − xi )Yi . 5. . Right: single smoothing spline with varying degrees of freedom. we can nevertheless adapt to any higher-order de- gree of smoothness (without the need of choosing a higher order spline base procedure). av- eraged over 100 simulation runs.4 0.6 ● ● ● ● ● 0. 2005/10/19 file: BuehlmannHothorn_Boosting. 2). . We consider fixed. 91]. 1/2). univariate design points xi = i/n (i = 1. asymptotic convergence and minimax rate results have been established for early-stopped boosting in more general settings [10.1. n) and the Nadaraya-Watson kernel estimator for the nonparametric regression function E[Y |X = x]: n n x − xi   −1 Yi = n−1 X X gˆ(x.8 ● Mean squared error ● 0. assuming that the true underlying function is sufficiently smooth.0 0. BOOSTING ALGORITHMS AND MODEL FITTING 25 Boosting Smoothing Splines 0. only tuning parameter. It is straightforward to derive the form of L2 Boosting using m = 2 iterations imsart-sts ver. for varying number of boosting iterations. . n = 100). Left: L2 Boosting with smoothing spline base procedure (having fixed degrees of freedom df = 4) and using ν = 0.6 0. K(·) a kernel in the form of a probability density which is symmetric around zero and Kh (x) = h−1 K(x/h).8 0.0 0 200 400 600 800 1000 10 20 30 40 Number of boosting iterations Degrees of freedom Fig 5. L2 Boosting using kernel estimators. . As we have pointed out in Replica 1. . .2 0. Mean squared prediction error E[(f (X) − fˆ(X))2 ] for the regression model Yi = 0. i=1 h i=1 where h > 0 is the bandwidth. with ε ∼ N (0.8Xi + sin(6Xi ) + εi (i = 1.2 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.1.4 ● ● ● ● ● ● ● 0.1. 2007 . .

for surface modeling with design points in Zi ∈ R2 .e. The model encompasses the representation of a noisy signal by an expansion with an over-complete dictionary of functions {g (j) (·) : j = 1. X Yi = f (Zi ) + εi . r=1 For fixed design points xi = i/n. imsart-sts ver. . e. This method has the following basic properties: 1. . the L2 Boosting method often does variable selection. . twicing [83]. j Fitting the model (5.. We allow for the number of predictors p to be much larger than the sample size n. When stopping early which is usually needed to avoid over-fitting. with E[εi ] = 0 and independent from all Xi ’s. 2005/10/19 file: BuehlmannHothorn_Boosting. assuming that the true regression func- tion is four times continuously differentiable. .26 ¨ BUHLMANN & HOTHORN (with fˆ[0] ≡ 0 and ν = 1).1 which fits in every iteration the best predictor variable reducing the residual sum of squares most.tex date: June 4. . the kernel Khtw (·) is asymptotically equiv- alent to a higher-order kernel (which can take negative values) yielding a squared bias term of order O(h8 ). More de- tails including also non-equispaced designs are given in DiMarzio and Taylor [27]. p}. . with the Nadaraya-Watson kernel estimator: n fˆ[2] (x) = (nh)−1 X Khtw (x − xi )Yi . εn are i. . This solution is unique if the design matrix has full rank p ≤ n. j=1 where ε1 . .g.d. i. As the number m of boosting iterations increases. i = 1. Consider a poten- tially high-dimensional linear model p X (j) (5. twicing or L2 Boosting with m = 2 iterations amounts to be a Nadaraya-Watson kernel estimator with a higher-order kernel. Khtw (u) = 2Kh (u) − Kh ∗ Kh (u).2. 5. . . . f (z) = β (j) g (j) (z) (z ∈ R2 ). i=1 n where Kh ∗ Kh (u) = n−1 X Kh (u − xr )Kh (xr ).4) can be done using L2 Boosting with the componen- twise linear least squares base procedure from Section 4.4) Yi = β0 + β (j) Xi + εi . This explains from another angle why boosting is able to improve the mean squared error rate of the base procedure. 2. .i. Thus. n. 2007 . the L2 Boosting es- timate fˆ[m] (·) converges to a least squares solution. L2 Boosting for high-dimensional linear models..

with the Family function as follows: R> rho <. w) R> L2fm <. related to the Lasso as described in Sec- tion 5. f.function(y. 2007 . f).function(y. The coefficient estimates βˆ[m] are (typically) shrunken versions of a least squares estimate βˆOLS . Variable selection is especially impor- tant in high-dimensional situations. Illustration: Breast cancer subtypes. 1e-05) -y * log(p) . we study a binary clas- sification problem involving p = 7129 gene expression levels in n = 49 breast cancer tumor samples [data taken from 90].1. yfit.2.1. with clas- sical AIC based on the binomial log-likelihood for stopping the boosting iterations. the loss function and a function for computing an offset term (offset).2. we first transform the factor y to a numeric variable with 0/1 coding: R> yfit <. f. and a different evaluating loss function (the loss argument).(1 .function(y.f R> offset <. w = 1) y . Thus. here the squared error loss implemented as a function rho. w) weighted. family = L2fm. control = ctrl) imsart-sts ver. loss = rho. For each sample.glmboost(x.1e-05. 2005/10/19 file: BuehlmannHothorn_Boosting.tex date: June 4.pData(westbc)$nodal. center = TRUE) R> west_glm <. see Section 3. here the negative binomial log-likelihood.1 The general framework implemented in mboost allows us to specify the nega- tive gradient (the ngradient argument) corresponding to the surrogate loss function.p) } R> ngradient <. can now be passed to the glmboost function for boosting with componentwise linear least squares (here initial mstop = 200 iterations are used): R> ctrl <. bundling the negative gradient. As an example. w = 1) { p <. offset = offset) The resulting object (called L2fm).pmax(pmin(1 . BOOSTING ALGORITHMS AND MODEL FITTING 27 a binary re- sponse variable describes the lymph node status (25 negative and 24 posi- tive).numeric(y) . The data are stored in form of an exprSet object westbc [see 35] and we first extract the matrix of expression levels and the response variable: R> x <.t(exprs(westbc)) R> y <.Family(ngradient = ngradient.boost_control(mstop = 200.y We aim at using L2 Boosting for classification.y) * log(1 .mean(y.

Of course. method = "classical") R> mstop(aic) [1] 100 where the AIC is computed as -2(log-likelihood) + 2(degrees of freedom) = 2 (evaluating loss) + 2(degrees of freedom).28 ¨ BUHLMANN & HOTHORN 0. computing all Lasso solutions.8GHz).Gram=FALSE). only a few seconds. 2005/10/19 file: BuehlmannHothorn_Boosting. westbc data: Standardized regression coefficients βˆ(j) V dar(X (j) ) (left panel) for mstop = 100 determined from the classical AIC criterion shown in the right panel. we obtain 33 genes q with non-zero regression coefficients whose standardized values βˆ(j) Var(X d (j) ) are depicted in the left panel of Figure 6. 2.3.tex date: June 4.00 ●●●●●●● ●●●●●●●●● ●●●● ● 25 ●●●●● ● 0 5 10 15 20 25 30 0 50 100 150 200 Index Number of boosting iterations q Fig 6.6 seconds on a medium scale desktop computer (Intel Pentium 4. takes about 6. When we stop after mstop = 100 boosting iterations. see Formula (5.8).08 40 ● Standardized coefficients 35 0.04 AIC ● 30 ● ●●●● 0.. As a comparison. this form of estimation and variable selection is computa- tionally very efficient.7 seconds. imsart-sts ver. Fitting such a linear model to p = 7129 covariates for n = 49 observations takes about 3. Figure 6 shows the AIC curve depending on the number of boosting it- erations. Thus. i.AIC(west_glm. 39] in R (with use. The notion of degrees of freedom is discussed in Section 5. The question how to choose mstop can be addressed by the classical AIC criterion as follows R> aic <.e. 2007 . using package lars [28. we could also use BinomialBoosting for analyzing the data: the computational CPU time would be of the same order of magnitude.

m = mn → ∞ (n → ∞) sufficiently slowly. To capture the notion of high-dimensionality. [28] made the connection rigorous and explicit: they consider a version of L2 Boosting.e. when adding some backward steps to boosting. we equip the model with a dimensionality p = pn which is allowed to grow (j) with sample size n. BOOSTING ALGORITHMS AND MODEL FITTING 29 5.4).. L2 Boosting with componentwise linear least squares satisfies EXnew [(fˆn[mn ] (Xnew ) − fn (Xnew ))2 ] → 0 in probability (n → ∞).5)β(λ) Yi − β0 − β (j) Xi  + λ |β (j) |.e. imsart-sts ver. it may be useful to interpret boosting as being “related” to `1 - penalty based methods.2. 2007 .tex date: June 4.4). called forward stagewise linear regression (FSLR). and they show that FSLR with infinitesimally small step-sizes (i.1. FSLR. [67].425]. 2005/10/19 file: BuehlmannHothorn_Boosting. We review here a re- sult establishing asymptotic consistency for very high-dimensional but sparse linear models as in (5. [42] pointed out first an intriguing connection between L2 Boosting with componentwise linear least squares and the Lasso [82] which is the following `1 -penalty method:  2 n p p (j) ˆ = argmin n−1 X X X (5. The latter is very similar to the algorithm proposed earlier by Osborne et al. the value ν in step 4 of the L2 Boosting algorithm in Section 3.2. p. Lasso and LARS all coincide [28. Connections to the Lasso. 5.1) produces a set of solutions which is approximately equivalent to the set of Lasso solutions when varying the regularization parameter λ in Lasso (see (5. Asymptotic consistency in high dimensions. such modified L2 Boosting coincides with the Lasso (Zhao and Yu [93]). For more general situations. Assume that pn = pn (j) O(exp(n1−ξ )) for some 0 < ξ ≤ 1 (high-dimensionality) and sup |βn | < P n∈N j=1 ∞ (sparseness of the true regression function w.t. the `1 -norm).3. In special cases where the design matrix satisfies a “positive cone condition”. [18] Consider the linear model in (5. β i=1 j=1 j=1 Efron et al. i. [28] (see also [68] for generalized linear models). Despite the fact that L2 Boosting and Lasso are not equivalent methods in general.2. Replica 2. moreover. (j) the variables Xi are bounded and E[|εi |4/ξ ] < ∞.r. the coefficients β (j) = βn are now potentially depending on n and the regression function is denoted by fn (·). Then: when stopping the boosting iterations appropriately. The approximate equivalence is derived by representing FSLR and Lasso as two different modifications of the computationally efficient least angle regres- sion (LARS) algorithm from Efron et al.. moreover. Hastie et al.5) above).

4) with many transformed predictors: if the true regression function can be represented as a sparse linear combi- nation of original or transformed predictors. 5. L2 Boosting with componentwise smooth- ing splines fits a whole smoothing spline function (for a selected predictor variable) at a time. In terms of empirical performance. Replica 2 identifies boosting as a method which is able to consistently estimate a very high-dimensional but sparse linear model. where the model formula performs the computations of all trans- formations by means of the bs (B-spline basis) function from the package splines. independent of and with the same distribution as the X-component of the data (Xi . . a similar result holds as well [37]. As an option for continuously measured covariates. .tex date: June 4. The result holds for almost arbitrary designs and no assumptions about collinearity or correlations are required. 2005/10/19 file: BuehlmannHothorn_Boosting. . In view of Replica 2. . It should be noted though that the inclusion of non-effective variables in the design matrix does degrade the finite-sample performance to a certain extent.5). for the Lasso in (5. we may utilize a B- spline basis as illustrated in the next paragraph. there seems to be no overall superiority of L2 Boosting over Lasso or vice-versa. consistency is still guaranteed. individ- ual spline basis functions from various predictor variables are selected and fitted one at a time. We emphasize that during the process of L2 Boosting with componentwise linear least squares.). Transforming predictor variables. we may enrich the design matrix in model (5.3. Illustration: Prediction of total body fat (cont.30 ¨ BUHLMANN & HOTHORN where Xnew denotes new predictor variables. in contrast. n).2. Yi ) (i = 1. First. Such transformations and estimation of a corresponding linear model can be done with the glmboost function. higher order interactions can be specified in generalized AN(C)OVA models and L2 Boosting with componentwise linear least squares can be used to select a small number out of potentially many interaction terms. For example. we set up a formula transforming each covariate R> bsfm DEXfat ~ bs(age) + bs(waistcirc) + bs(hipcirc) + bs(elbowbreadth) + bs(kneebreadth) + bs(anthro3a) + bs(anthro3b) + bs(anthro3c) + bs(anthro4) and then fit the complex linear model by using the glmboost function with initial mstop = 5000 boosting iterations: imsart-sts ver. 2007 .

We consider L2 Boosting with componentwise linear least squares. Similarly to (5. .boost_control(mstop = 5000) R> bf_bs <. . Due to this dependence on Y . p. . We emphasize that Bm is depending on the response variable Y via the selected components Sˆr . Componentwise linear least squares. . Denote by H(j) = X(j) (X(j) )> /kX(j) k2 . . . data = bodyfat. . j = 1. Again. . . control = ctrl) R> mstop(aic <. p} denotes the component which is selected in the com- ponentwise least squares base procedure in the rth boosting iteration. . . m. . U ˆn . BOOSTING ALGORITHMS AND MODEL FITTING 31 R> ctrl <.3. The hat matrix of the componentwise linear least squares base procedure (see (4. . 5. we then obtain the hat matrix of L2 Boosting in iteration m: ˆ Bm = Bm−1 + ν · H(Sm ) (I − Bm−1 ) ˆ ˆ ˆ (5. . . . . 2005/10/19 file: BuehlmannHothorn_Boosting. .3. imsart-sts ver. Degrees of freedom for L2 Boosting. . kxk2 = x> x denotes the Euclidean norm for a vector x ∈ Rn . 5. . Xn )> only. the n×n hat matrix for the linear least squares fitting operator using the jth (j) (j) predictor variable X(j) = (X1 . . Note that the depicted functional relationship de- rived from the model fitted above (Figure 7) is qualitatively the same as the one derived from the additive model (Figure 3). . . Neglecting the selection effect of Sˆr (r = 1.glmboost(bsfm. where Sˆr ∈ {1. A notion of degrees of freedom will be useful for estimating the stopping iteration of boosting (Section 5. we define the degrees of freedom of the boosting fit in iteration m as df(m) = trace(Bm ). where Sˆ is as in (4.AIC(bf_bs)) [1] 2891 The corrected AIC criterion (see Section 5.1). Bm should be viewed as an approximate hat matrix only. .tex date: June 4.1). r = 1.1. . m). Un ) 7→ U ˆ1 . the partial contributions of each of the 9 original covariates can be computed easily and are shown in Figure 7 (for the same variables as in Figure 3).4).1)) is then ˆ H(S) : (U1 . 2007 . .4) suggests to stop after mstop = 2891 boosting iterations and the final model selects 21 (transformed) pre- dictor variables.6) = I − (I − νH(Sm ) )(I − νH(Sm−1 ) ) · · · (I − νH(S1 ) ).

we can represent p X (j) (5. we can estimate the error variance σε2 = E[ε2i ] in the linear model (5.4) by n  1 2 Yi − fˆ[mstop ] (Xi ) .tex date: June 4.0 3.5 4. Having some notion of degrees of freedom at hand. df(m) is very different from counting the number of vari- ables which have been selected until iteration m.5 5. j=1 imsart-sts ver. bodyfat data: Partial fits for a linear model fitted to transformed covariates using B-splines (without centering of estimated functions to mean zero). 2005/10/19 file: BuehlmannHothorn_Boosting. Even with ν = 1. 2007 .0 4.7) Bm = Bm .0 kneebreadth anthro3b Fig 7. X ˆε2 = σ n − df(mstop ) i=1 Moreover.32 ¨ BUHLMANN & HOTHORN ● ●●●● 10 10 ●●● ● ●● ● ● ●● ●● ●●● ● ● ●● ●● ● ●● ●● ● ● ●●●● ●● ●●● 5 5 ●●● ● ● ● ● ●●●●● ●● ● ●● ●●●● ●● ● ●●● ● ● ●● ●●●● ●● fpartial fpartial ● ●● ●●● ●●●● ● ●● 0 0 −5 −5 −10 −10 90 100 110 120 130 70 80 90 100 110 hipcirc waistcirc 10 10 ● ● ●● 5 5 ● ● ● ●● ●● ● ●●●● ● ● ● ●● ●●● ●● ●● ●●●● fpartial fpartial ●● ●● ●● ●● ● ● ●● ● ●●● 0 0 ●●● ●●●●●●●●●●●●●●●●● −5 −5 −10 −10 8 9 10 11 2.5 3.

tex date: June 4. 2005/10/19 file: BuehlmannHothorn_Boosting. The individual degrees of freedom df(j) (m) are a useful measure to quantify [m] the “complexity” of the individual coefficient estimate βˆj .2). Bm Y = X(j) βˆj . we have a decomposition of the total degrees of freedom into p terms: p X df(m) = df(j) (m). Note that the Bm ’s can be easily computed in an iterative way by up-dating as follows: ˆ (S m ) (Sˆ ) ˆ Bm m = Bm−1 + ν · H(Sm ) (I − Bm−1 ).2 and the illustration in Section 5. Internal stopping criteria for L2 Boosting.4.e. F = i=1 i . (j) (j) Bm = Bm−1 for all j 6= Sˆm . n n σ2 Y 2 − nˆ σ2 P nˆ S= . we may employ the gMDL criterion (Hansen and Yu [38]): df(m) gMDL(m) = log(S) + log(F ). i=1 In mboost. n − df(m) df(m)S The gMDL criterion bridges the AIC and BIC in a data-driven way: it is an attempt to adaptively select the better among the two. Thus. without pursuing some sort of cross-validation. 2007 . We can use the corrected AIC [49]: 1 + df(m)/n σ2) + AICc (m) = log(ˆ . method = "corrected") (with x being an object returned by glmboost or gam- boost called with family = GaussReg()).. 5. the corrected AIC criterion can be computed via AIC(x. Alternatively. we prefer to work with the imsart-sts ver. i. (1 − df(m) + 2)/n n ˆ 2 = n−1 X σ (Yi − (Bm Y)i )2 . we can now use information criteria for estimating a good stopping iteration. j=1 (j) (j) df (m) = trace(Bm ). BOOSTING ALGORITHMS AND MODEL FITTING 33 (j) where Bm is the (approximate) hat matrix which yields the fitted values for (j) [m] (j) the jth predictor. When using L2 Boosting for binary classification (see also the end of Sec- tion 3. Having some degrees of freedom at hand.

or for BIC(m) with the penalty term log(n)df(m). Then. the prediction optimal tuned Lasso selects asymptotically too large models. For problems with many predictor variables.e. i. The mathematical properties of boosting for variable selection are still open questions. Then. Consider a linear model as in (5.g. allowing for p  n but being sparse.2. δ) for some small δ > 0. the Lasso and also L2 Boosting with P componentwise linear least squares and using very small ν (in step 4 of L2 Boosting. We often would like to construct estimators which are less biased.1). boosting is computationally much more efficient than classical all subset selection schemes.. 28]. The bias of the Lasso mainly causes the difficulties mentioned above. choosing the penalty parameter λ in (5.5) such that the mean squared error is minimal. the following is relevant. n X (5.g. and they give examples where it holds and where it fails to be true.e. Boosting for variable selection.. 1]. It is important to note the role of the sufficient and “almost” necessary condition of the Lasso for model selection: Zhao and Yu [94] call it the “irrepresentable condition” which has (mainly) implications on the “degree of collinearity” of the design (predictor variables). e. 2005/10/19 file: BuehlmannHothorn_Boosting. the probability for estimating the true sub-model converges to a number which is less than one or even zero if the problem is high-dimensional [65]. see Section 3. we truncate by max(min((Bm Y)i . In fact. imsart-sts ver. (If (Bm Y)i ∈ / [0. It is instructive to look at regression with orthonormal design. δ = 10−5 ).1) yield the soft-threshold estimator [23.. there is a sufficient and “almost” necessary neighborhood stability condition (the word “almost” refers to a strict inequality “<” whereas “≤” suffices for sufficiency) such that for some suitable penalty parameter λ in (5. 1−δ). A further complication is the fact that when tuning the Lasso for prediction optimality. i.4) with ni=1 Xi Xi = δjk .34 ¨ BUHLMANN & HOTHORN binomial log-likelihood in AIC.5). We address here the question whether boosting is a good variable selection scheme. e.8) AIC(m) = −2 Yi log ((Bm Y)i ) i=1 + (1 − Yi ) log (1 − (Bm Y)i ) + 2df(m). the Lasso finds the true underlying sub-model (the predictor variables with corresponding regression coefficients 6= 0) with probability tending quickly to 1 as n → ∞ [65]. L2 Boosting.tex date: June 4. the model (j) (k) (5. whether it leads to a consistent model selection method.. 2007 . When borrowing from the analogy of L2 Boosting with the Lasso (see Section 5. 6.

Instead of (5. the adaptive Lasso coincides with the non-negative garrote [13]. the (computationally efficient) Lasso seems to be a very useful method for variable filtering: for many cases. Nevertheless. ˆ(j) | j=1 |β init imsart-sts ver. It exhibits the same amount of bias regardless by how much the observation (the variable z in Figure 8) exceeds the threshold.tex date: June 4. given by Zou [96].5). This is in contrast to the hard-threshold estimator and the adaptive Lasso in (6.1) which are much better in terms of bias. is a single component of X> Y. 2005/10/19 file: BuehlmannHothorn_Boosting. 3 Adaptive Lasso Hard−thresholding Soft−thresholding 2 1 0 0 −1 −2 −3 −3 −2 −1 0 1 2 3 z Fig 8. A nice proposal to correct Lasso’s over-estimation behavior is the adaptive Lasso. For this design. It is based on re-weighting the penalty function. Hard-threshold (dotted-dashed). BOOSTING ALGORITHMS AND MODEL FITTING 35 see Figure 8. denoted by z. the prediction optimal tuned Lasso selects a sub-model which contains the true model with high probability. 2007 .1) β(λ) Yi − β0 − β (j) Xi  β i=1 j=1 p X |β (j) | +λ . soft-threshold (dotted) and adaptive Lasso (solid) estimator in a linear model with orthonormal design. The value on the x- abscissa. the adaptive Lasso estimator is  2 n p (j) ˆ = argmin n−1 X X (6.

1) i=1 i=1 n −νn−1 X Ui gˆ[m] (Xi ). n n n−1 ρ(Yi . fˆ[m] (Xi )) ≈ n−1 ρ(Yi . i=1 imsart-sts ver. Consistency of the adaptive Lasso for variable selection has been proved for the case with fixed predictor-dimension p [96] and also for the high-dimensional case with p = pn  n [48].g. We do not expect that boosting is free from the difficulties which occur when using the Lasso for variable selection. Twin Boosting with componentwise linear least squares is proved to be equivalent to the adaptive Lasso for the case of an orthonormal linear model and it is empirically shown.e. that it has much better variable selection properties than the corresponding boosting algorithm [19].tex date: June 4.1). Consider the case with the componentwise linear least squares base procedure and without (j) 2 Pn   loss of generality with standardized predictor variables (i. Then. it will not be selected in the second: this property also holds for the (j) adaptive Lasso in (6.2. i=1 using a first-order Taylor expansion and the definition of Ui .. First. Consider the empirical risk at iteration m. βˆinit = 0 enforces βˆ(j) = 0. e. the Lasso (from a first stage of Lasso estimation). 7. In special settings. 2005/10/19 file: BuehlmannHothorn_Boosting. Moreover. 2007 . n−1 i=1 Xi = 1 for all j). Twin Boosting is much more generically applicable. For example. For exponential family models with general loss functions.e. fˆ[m−1] (Xi )) X X (7. and the second stage is enforced to resemble the first boosting round.36 ¨ BUHLMANN & HOTHORN where βˆinit is an initial estimator. Twin Boosting [19] is the boosting analogue to the adaptive Lasso.1. n (Sˆm ) (Sˆm ) gˆ[m] (x) = n−1 X Ui Xi x . in general and for various base procedures and models. if a variable has not been selected in the first round of boost- ing. we address the issue about omitting a line search between steps 3 and 4 of the generic FGD algorithm. we can use the generic FGD algorithm as described in Section 2. Boosting for exponential family models. i. similar results can be obtained with Sparse Boosting [23]: however. that also boosting would produce an interesting set of sub-models when varying the number of iterations. The hope is though. It consists of two stages of boosting: the first stage is as usual. 6. Twin Boosting.

The algorithm is described in Section 3. i=1 Comparing this with Formula (7.3. the numerical convergence rate is n−1 ni=1 ρL2 (Yi .1) becomes: n n n−1 ρ(Yi .1) as loss function. 2007 . fˆ[m−1] (Xi )) X X i=1 i=1 n (Sˆm ) 2 −ν(1 − ν/2)(n−1 X Ui Xi ) . we obtain the exact identity: n n n−1 ρL2 (Yi . of course at the price of doing more iterations but not necessarily more computing time (since the line-search is omitted in every iteration).tex date: June 4. ˆ Bm = Bm−1 + 4νW [m−1] H(Sm ) (I − Bm−1 ) (m ≥ 2). 7. f ) = |y − f |2 /2.6). (7. Since the population minimizer is f ∗ (x) = log[p(x)/(1 − p(x))]/2. Bino- mialBoosting uses the negative binomial log-likelihood from (3.2) i=1 i=1 n (Sˆm ) 2 −ν(n−1 X Ui Xi ) . imsart-sts ver. One principal difference is the derivation of the boosting hat matrix. fˆ[m] (Xi )) = n−1 ρL2 (Yi . For binary classification with Y ∈ {0. fˆ[m−1] (Xi )) X X (7. 1}. i=1 In case of the squared error loss ρL2 (y. In- stead of (5. BOOSTING ALGORITHMS AND MODEL FITTING 37 and the expression in (7. fˆ[m] (Xi )) ≈ n−1 ρ(Yi .1. a linearization argument leads to the following recursion (as- suming fˆ[0] (·) ≡ 0) for an approximate hat matrix Bm : ˆ B1 = ν4W [0] H(S1 ) . This completes P our reasoning why the line-search in the general functional gradient descent algorithm can be omitted. 2005/10/19 file: BuehlmannHothorn_Boosting. BinomialBoosting. fˆ[m] (Xi )) = O(m−1/6 ) (m → ∞) [81]. estimates from BinomialBoost- ing are on half of the logit-scale: the componentwise linear least squares base procedure yields a logistic linear model fit while using componentwise smoothing splines fits a logistic additive model. for L2 Boosting.2) we see that functional gradient de- scent with a general loss function and without additional line-search be- haves very similar to L2 Boosting (since ν is small) with respect to opti- mizing the empirical risk. Many of the concepts and facts from Section 5 about L2 Boosting become useful heuristics for Binomi- alBoosting. 1 ≤ i ≤ n).3) W [m] = diag(ˆ p[m] (Xi )(1 − pˆ[m] (Xi ).2.

this AIC cri- terion can be computed via AIC(x. 2005/10/19 file: BuehlmannHothorn_Boosting. 2007 . data = wpbc2. We are faced with many covariates (p = 32) for a limited number of observations without missing values (n = 194). and variable selection is an important is- sue.wpbc[cc. colnames(wpbc) != "time"] R> wpbc_step <. trace = 0) The final model consists of 16 parameters with R> logLik(wpbc_step) 'log Lik. [Yi log(ˆ i=1 or for BIC(m) with the penalty term log(n)df(m). Prediction models for re- currence events in breast cancer patients based on covariates which have been computed from a digitized image of a fine needle aspirate of breast tissue (those measurements describe characteristics of the cell nuclei present in the image) have been studied by Street et al.. df(m) = trace(Bm ). non-recurrence) and later in Section 8 by means of survival models.3. Illustration: Wisconsin prognostic breast cancer.13 (df=16) R> AIC(wpbc_step) [1] 192. family = binomial()). n X AIC(m) = −2 p[m] (Xi )) + (1 − Yi ) log(1 − pˆ[m] (Xi ))] + 2df(m). We can choose a classical logistic regression model via AIC in a stepwise algorithm as follows R> cc <.cases(wpbc) R> wpbc2 <. [80] [the data is part of the UCI repository 11]. and they can be used for information criteria. In mboost.38 ¨ BUHLMANN & HOTHORN A derivation is given in Appendix A.complete.step(glm(status ~ . method = "classical") (with x being an object returned by glmboost or gamboost called with family = Bino- mial()). e.' -80.26 and we want to compare this model to a logistic regression model fitted via gradient boosting. Degrees of freedom are then defined as in Section 5. We first analyze this data as a binary prediction problem (recurrence vs. where pˆ is the empirical proportion of recurrences) and of 1/2 log(ˆ we initially use mstop = 500 boosting iterations imsart-sts ver.g.tex date: June 4. We simply select the Binomial family (with default offset p/(1 − pˆ)).

8513e+00 -3.4445e-02 (because of using the offset-value fˆ[0] .76 Optimal number of boosting iterations: 99 Degrees of freedom (for mstop = 99): 14.2108e-03 worst_area worst_smoothness worst_compactness 1. "classical")) R> aic [1] 199.. "classical") R> aic [1] 199.4917e-02 1.147 boosting iterations.1561e-02 2. center = TRUE) R> wpbc_glm <.AIC(wpbc_gam.glmboost(status ~ .mstop(aic <.boost_control(mstop = 500. We now restrict the number of boosting iterations to mstop = 465 and then obtain the estimated coefficients via R> wpbc_glm <.8253e+01 SE_texture SE_perimeter SE_compactness -8.2187e+00 1. A generalized additive model adds more flexibility to the regression func- tion but is still interpretable.9469e-01 tsize pnodes 4.4505e-02 mean_smoothness mean_symmetry mean_fractaldim 2. we have to add the value fˆ[0] to the reported intercept estimate above for the logistic regression model).wpbc_glm[mstop(aic)] R> coef(wpbc_glm)[abs(coef(wpbc_glm)) > 0] (Intercept) mean_radius mean_texture -1.54 Optimal number of boosting iterations: 465 Degrees of freedom (for mstop = 465): 9.tex date: June 4. BOOSTING ALGORITHMS AND MODEL FITTING 39 R> ctrl <.2511e-01 -5.2125e+00 SE_fractaldim worst_radius worst_perimeter 5.AIC(wpbc_glm.3468e-02 1. 2005/10/19 file: BuehlmannHothorn_Boosting.1463e+01 SE_concavity SE_concavepoints SE_symmetry -6.8453e-03 -2. control = ctrl) The classical AIC criterion (-2 log-likelihood + 2 df) suggests to stop after R> aic <.8646e-04 9. family = Binomial()) R> mopt <.7553e-02 5..gamboost(status ~ . data = wpbc2. We fit a logistic additive model to the wpbc data as follows: R> wpbc_gam <.9238e+00 -2.9560e+00 -1. 2007 .9307e+00 -2.0454e+01 5. data = wpbc2. family = Binomial().583 imsart-sts ver.

2. 2. .2 −0. 7.2 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● fpartial fpartial ● ●● ● 0.4 ●●● ●● 0. .4 15 20 25 30 35 0. For count data with Y ∈ {0.0 0.40 ¨ BUHLMANN & HOTHORN 0.2 0. we can use Poisson regression: we assume that Y |X = x has a Poisson(λ(x)) distribution and the goal is to estimate the function f (x) = log(λ(x)). 1.08 0.4 ● ● ● ● ● 0.4 0.5 3. The partial contributions of the four most important variables are depicted in Figure 9 indicating a remarkable degree of non-linearity.20 tsize ● worst_smoothness ● ● 0.2 ●● ●● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● fpartial ● fpartial ● ● ●● ●●● ●● ● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● 0.5 2. imsart-sts ver.5 worst_radius SE_texture Fig 9. .}.2 0. f ) = −yf + exp(f ). f = log(λ).0 0.2 ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●●●● ● ● ● ● ● ● −0.0 ●● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ●● ●● ● ●● ●● ● ● ● ● ● ●● ● ●●●● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● −0.2 ●● ● ●● ● ● ● ● ● −0.4 ● 2 4 6 8 10 0.0 2. 2005/10/19 file: BuehlmannHothorn_Boosting.4 0.tex date: June 4. 2007 .0 1.0 ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●●● ● ●●● ● ●● ●● ● ●● ●● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● −0. The negative log- likelihood yields then the loss function ρ(y.12 0.5 1.16 0.4 −0. PoissonBoosting.0 3.4 −0. wpbc data: Partial contributions of four selected covariates in an additive logistic model (without centering of estimated functions to mean zero). This model selected 16 out of 32 covariates.2 −0.

and it is implemented in mboost as Poisson() family. . . x(q) . all types of base procedures can be imsart-sts ver. estimated by maximum likelihood. . . . .1. .1 the issue of choosing an initial value fˆ[0] (·) for boosting. BOOSTING ALGORITHMS AND MODEL FITTING 41 which can be used in the functional gradient descent algorithm in Section 2. where the latter part is from L2 Boosting and x ˜(j) is the residual when linearly regressing x(j) to x(1) .tex date: June 4. We have briefly described in Sections 2. This can be quite important for applications where we would like to estimate some parts of a model in an unpenalized (non-regularized) fashion and others being subject to regularization. W [m] = diag(λ 7. The final model es- [m ] (j) timate is then qj=1 βˆOLS.j x(j) + pj=q+1 βˆj stop x P P ˜ . A concrete example would be: fˆ[0] (·) is the maximum likelihood estimate in a generalized linear model and boosting would be done with componentwise smoothing splines to model additive deviations from a gen- eralized linear model.3. Another example would be a linear model Y = Xβ + ε as in (5. . . . A related strategy has been used in [4] for modeling multivariate volatility in financial time series. (7. . X (q) : consider the projection Pq onto the linear span of X (1) . X (q) and use L2 Boosting with componentwise linear least squares on the new response (I − Pq )Y and the new p − q-dimensional predictor (I − Pq )X. Survival analysis. we may think of a parametric form of fˆ[0] (·). . (I − P1 )Y = Y − Y and (j) (I − P1 )X (j) = X (j) − n−1 ni=1 Xi . enter the estimated linear model in an unpenalized way. The negative gradient of Cox’ partial likelihood can be used to fit proportional hazards models to censored response variables with boosting algorithms [71]. 2007 . Then. . This is exactly the proposal at the P end of Section 4. 1 ≤ i ≤ n).1.3). We propose to do ordinary least squares regression on X (1) . 2005/10/19 file: BuehlmannHothorn_Boosting. analogous concepts can be used. say the first q predictor variables X (1) .1 and 4.4) ˆ [m] (Xi ). and deviations from the parametric model would be built in by pursuing boosting iterations (with a nonparametric base pro- cedure). Of course. the approximate boosting hat matrix is computed by the following recursion ˆ B1 = νW [0] H(S1 ) . Initialization of boosting. . For example. X (q) . 8.4) where some of the predictor variables. . For generalized linear models. Similarly to (7. . ˆ Bm = Bm−1 + νW [m−1] H(Sm ) (I − Bm−1 ) (m ≥ 2). A special case which is used in most applications is with q = 1 and X (1) ≡ 1 encoding for an intercept.

Here. the concepts of degrees of freedom and information criteria are analogous to Sections 5.42 ¨ BUHLMANN & HOTHORN utilized. What we observe is Oi = (Y˜i . where ∆i = I(Ti ≤ Ci ) is a censoring indicator and Ci the censoring time.g. i = 1. . we can use the weighted least squares framework with weights arising from inverse probability censoring. The strategy is then to estimate G(·|x). Thus.. n. 2007 . and do weighted L2 Boosting using the weighted squared error loss: n 1 (Y˜ − f (Xi ))2 . Furthermore. imsart-sts ver. X ∆i ˆ T˜i |Xi ) i G( i=1 where the weights are of the form ∆i G( ˆ T˜i |Xi )−1 (the specification of the ˆ estimator G(t|x) may play a substantial role in the whole procedure). Details are given in [45]. we make a restrictive assumption that Ci is conditionally independent of Ti given Xi (and we assume independence among different indices i): this implies that the coarsening at random assumption holds [89]. but this step is not crucial for what follows: Yi = log(Ti ). details are given in [45]. Y˜i = log(T˜i ). We consider the squared error loss for the complete data.4. for example. componentwise linear least squares fits a Cox model with a linear predictor. f ) = |y−f |2 (without the irrelevant factor 1/2). f ) = (˜ . Ci ). . We sketch this approach in the sequel. ρ(y. the observed data loss function is weighted by the inverse probability −1 for censoring ∆G(t˜|x) (the weights are inverse probabilities of censoring. Under the coarsening at random assumption. ∆i ). IPC). e.X [(Y − f (X))2 ] = EO [ρobs (O. G(t˜|x) G(c|x) = P[C > c|X = x]. . f (X))]. by the Kaplan-Meier esti- mator. we can use various base procedures as long as they allow for weighted least squares fitting. We transform the survival times to the log-scale. see van der Laan and Robins [89].3 and 5.tex date: June 4. . it then holds that EY. Xi . For the observed data. Alternatively. We assume complete data of the following form: survival times Ti ∈ R+ (some of them right-censored) and predictors Xi ∈ Rp . T˜i = min(Ti . As demonstrated in the previous sections. 2005/10/19 file: BuehlmannHothorn_Boosting. the following weighted version turns out to be useful: 1 y − f )2 ∆ ρobs (o.

showing a reasonable model fit.wpbc$status == "R" R> iw <. A very different exposition than ours is the overview of boosting by Meir and R¨atsch [66]. Instead of the bi- nary response variable describing the recurrence status. i. see [44]. data = wpbc. we calculate IPC weights R> censored <.tex date: June 4.. First. such as fitting an additive Cox model using mboost.wpbc_surv[mstop(aic)] The following variables have been selected for fitting R> names(coef(wpbc_surv)[abs(coef(wpbc_surv)) > 0]) [1] "mean_radius" "mean_texture" [3] "mean_perimeter" "mean_smoothness" [5] "mean_symmetry" "SE_texture" [7] "SE_smoothness" "SE_concavepoints" [9] "SE_symmetry" "worst_concavepoints" and the fitted values are depicted in Figure 10. We briefly summarize here some other works which have not been mentioned in the earlier sections. 9. censored)) R> wpbc3 <. control = ctrl. wpbc$status == "N") ~ .boost_control(center = TRUE) R> glmboost(Surv(wpbc$time. a Cox model with linear predictor can be fitted using L2 Boosting by implementing the negative gradient of the partial likelihood (see [71]) via R> ctrl <. names(wpbc) != "status"] and fit a weighted linear model by boosting with componentwise linear weighted least squares as base procedure: R> ctrl <. 2005/10/19 file: BuehlmannHothorn_Boosting. weights = iw) R> mstop(aic <.glmboost(log(time) ~ . BOOSTING ALGORITHMS AND MODEL FITTING 43 Illustration: Wisconsin prognostic breast cancer (cont. data = wpbc3.boost_control(mstop = 500. Other works. imsart-sts ver..IPCweights(Surv(wpbc$time.AIC(wpbc_surv)) [1] 122 R> wpbc_surv <. center = TRUE) R> wpbc_surv <. 2007 . all observations with non-recurrence are censored.). control = ctrl) For more examples. Alternatively.wpbc[. we make use of the additionally available time information for modeling the time to recurrence.. family = CoxPH().e.

59]. Boosting methods for estimating propensity scores. 86]. Models for multivariate responses are studied in [20. text classification [78].1. wpbc data: Fitted values of an IPC-weighted linear model. 26]. for flexible semiparametric mixed models [88] or for non- parametric models with quality constraints [54. are proposed by [63]. There are numerous applications of boosting methods to real data prob- lems. 56.44 ¨ BUHLMANN & HOTHORN 5 Predicted time to recurrence ● ● 4 ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● 1 0 0 1 2 3 4 5 Time to recurrence (log−scale) Fig 10. Other works deal with boosting approaches for generalized linear and nonparametric models [55. 87]. 2005/10/19 file: BuehlmannHothorn_Boosting. censored observations with IPC weight zero are not plotted. some multi-class boosting methods are discussed in [33. Boosting methodology has been used for various other statistical models than what we have discussed in the previ- ous sections. 9.tex date: June 4. The radius of the circles is proportional to the IPC weight of the corresponding observation. 95]. taking both time to recurrence and censoring information into account. Methodology and applications . multivariate financial time series [2–4]. 2007 . We mention here classification of tumor types from gene expressions [25. imsart-sts ver. 85. doc- ument routing [50] or survival analysis [8] (different from the approach in Section 8). a special weighting scheme for modeling ob- servational data.

k=1 k=1 where the gˆ[k] (·)’s belong to a function class such as stumps or trees with a fixed number of terminal nodes. For example. [10]. P ficients satisfy k w a relation which we have already seen from another perspective in Section 5. [5. 2005/10/19 file: BuehlmannHothorn_Boosting. Mannor et al. which is necessary in practice and theory (cf. mboost allows to fit potentially high-dimensional linear or smooth additive models. The estimator above is a convex combina- tion of individual functions. For example.n is given in [7]. The first consistency result for AdaBoost has been given by Jiang [51]. w ˆk = 1. find their computational counterparts in the mboost package. [91] and Bissantz et al. 58]). Zhang and Yu [92] generalized the results for a functional gradient descent with an additional relaxation scheme. and their theory covers also more gen- eral loss functions than the exponential loss in AdaBoost.2. and extending the reference implementation of tree-based gradient boosting from the gbm package [74]. 6. such as loss functions and its negative gradients. [12] derived results for rates of convergence of (versions of) convex combination schemes. one can actually look at this as a linear combination of functions whose coef- ˆk = λ.tex date: June 4. This then represents an `1 -constraint as in Lasso. [58]). w w ˆk ≥ 0. For L2 Boosting. By scaling. there has been a substantial focus on estimation in the convex hull of function classes (cf. and a different constructive proof with a range for the stopping value mstop = mstop. The the- oretical ingredients of boosting algorithms. base learners and internal stopping criteria. Its implementation and user-interface reflect our statistical perspective of boosting as a tool for es- timation in structured models. APPENDIX A: SOFTWARE The data analyzes presented in this paper have been performed using the mboost add-on package to the R system of statistical computing. the first minimax rate result has been established by B¨ uhlmann and Yu [22]. and it has methods to compute degrees of freedom which in turn imsart-sts ver. in contrast to boosting which pursues a linear combination. 2007 . [61] and Blanchard et al. In the machine learning community. Asymptotic theory. This has been extended to much more general settings by Yao et al. one may want to estimate a regression or probability function by using ∞ X ∞ X ˆk gˆ[k] (·).1. The asymptotic analysis of boosting algorithms include consistency and minimax rate results.2. Later. BOOSTING ALGORITHMS AND MODEL FITTING 45 9. Consistency of such convex combination or `1 -regularized “boosting” methods has been given by Lugosi and Vayatis [58].

package = "mboost") whereas the R code for reproducibility of our analyzes can be assessed by R> edit(vignette("mboost_illustrations". package = "mboost")) There are several alternative implementations of boosting techniques avail- able as R add-on packages. ∂f exp(f ) + exp(−f ) p[m] (X1 ).46 ¨ BUHLMANN & HOTHORN allow for the use of information criteria such as AIC or BIC or for estimation of variance. Such an object can later be fed into the fitting procedure of a linear or additive model which optimizes the corresponding empirical risk (an example is given in Section 5. APPENDIX B: DERIVATION OF BOOSTING HAT MATRICES Derivation of formula (7. The rendered output of the analyzes is available by the R-command R> vignette("mboost_illustrations". Both the source version as well as binaries for several operating systems of the mboost [43] package are freely available from the Comprehensive R Archive Network (http://CRAN. imsart-sts ver. Moreover. f ) = 2(y − p).1) = pˆ[m−1] + 2W [m−1] νH(Sm ) 2 Y − pˆ[m−1] . The reader can install our package directly from the R prompt via R> install. for high-dimensional (generalized) linear models.R-project. . . Therefore. Boosting for additive models based on penalized B-splines is implemented in GAMBoost [9. 2007 . Then. dependencies = TRUE) R> library("mboost") All analyzes presented in this paper are contained in a package vignette. The negative gradient is ∂ exp(f ) − ρ(y. pˆ[m] (Xn ))> and Next. . 2005/10/19 file: BuehlmannHothorn_Boosting. 84].org). p = . we aren’t limited to already implemented boosting algorithms but can easily set up our own boosting procedure by implementing the negative gradient of the surrogate loss func- tion of interest. The reference implementation for tree-based gra- dient boosting is gbm [74]. .3). The Family function in mboost can be used to create an object of class boost family implementing the negative gradient for general surrogate loss functions.2). we linearize pˆ[m] : we denote by pˆ[m] = (ˆ analogously for fˆ[m] .tex date: June 4.packages("mboost". ∂p   pˆ[m] ≈ pˆ[m−1] + |f =fˆm−1 fˆ[m] − fˆ[m−1] ∂f ˆ   (B. our implementation is very fast to fit models even when the dimension of the predictor space is in the ten-thousands.

and B¨ uhlmann. (2005). [2] Audrino. D. 1 ≤ i ≤ n). and Barone-Adesi. and Geman. Neural Computation 9 1545–1588. F. (2003). (2005). Hothorn was supported by Deutsche Forschungs- gemeinschaft (DFG) under grant HO 3242/1-3. The work of T. Here. BOOSTING ALGORITHMS AND MODEL FITTING 47 where W [m] = diag(ˆ p(Xi )(1 − pˆ(Xi )).tex date: June 4. . Roman Lutz and Lukas Meier for discussions and detailed remarks. the negative gradient is ∂ − ρ(y. .4) as in the binomial case above. Journal of Computational Finance 6 65–89. F. ˆ Bm ≈ Bm−1 + ν4W [m−1] HSm (I − Bm−1 ) (m ≥ 2). G. 2007 . A multivariate FGD technique to improve VaR computation in equity markets.1) ˆ B1 ≈ ν4W [0] HS1 . analogously to (B. the editor and the executive editor Ed George for constructive comments. imsart-sts ver. f ) = y − λ. and Barone-Adesi. λ ˆ [m] (Xn ))> we get. We then complete the derivation of where W [m] = diag(λ(X (7. P. (1997).1). [4] Audrino. we thank four referees. Since for the hat matrix. we obtain from (B. . Computational Management Science 2 87–106. . [3] Audrino.  Derivation of formula (7.  ACKNOWLEDGMENTS We would like to thank Axel Benner. F. G.4). 2005/10/19 file: BuehlmannHothorn_Boosting. ∂f ˆ [m] = (λ When linearizing λ ˆ [m] (X1 ). Moreover. 1 ≤ i ≤ n). λ = exp(f ). Volatility estimation with functional gradi- ent descent for very high-dimensional financial time series. Shape quantization and recognition with random- ized trees.3) is approximately true. Y. Bm Y = pˆ[m] . The arguments are analogous as for the bino- mial case above. Functional gradient descent for financial time series with an application to the measurement of market risk. which shows that (7. Florian Leitenstorfer. ˆ i )). Journal of Banking and Finance 29 959–977. REFERENCES [1] Amit. ˆ [m−1] + ∂λ | ˆm−1 fˆ[m] − fˆ[m−1]   ˆ [m] ≈ λ λ ∂f f =f ˆ [m−1] + W [m−1] νH(Sˆm ) Y − λ   = λ ˆ [m−1] .

pdf [20] B¨uhlmann. P. (1998). P. Hastie and R. The Annals of Statistics 34 559–583. and Merz. [18] B¨uhlmann. (2003). rep. Loss functions for binary class prob- ability estimation: structure and applications. N.berkeley. T. The Annals of Statistics 26 801–849. In Advances in Neural Information Processing Systems. Tech- nometrics 37 373–384. and Lutz. Convergence rates of general regularization methods for statistical inverse problems and applica- imsart-sts ver.pdf [11] Blake. (1996). H.9-3. URL http://CRAN.math. (2007).. Munk. Journal of the American Statistical Association 101 138–156. F. [24] Buja. University of Washington. (2007). [6] Bartlett. UCI repository of machine learning databases. [7] URL http://www.48 ¨ BUHLMANN & HOTHORN [5] Bartlett. ETH Z¨ urich. N. L. Y. Machine Learning 24 123–140. Stuetzle. G. (2001). Discussion of additive logistic regression: a statis- tical view (J. B. and Yu. SIAM Journal of Numerical Analysis (in press). Boosting with the L2 loss: Regression and clas- sification. P. Jordan. (2006). Tech.stochastik.).). Tech. 2007 . [21] B¨uhlmann.stat. Machine Learning 45 5–32. P.ethz. Application of ”aggregated classifiers” in survival time studies. Prediction games & arcing algorithms. (in press). (2003). classification. Boosting algorithms: with an application to bootstrapping multivariate time series. R. (2003). URL http://www. W. URL ftp://ftp. eds. AdaBoost is consistent. Journal of Machine Learning Research 4 861–894.pdf [8] Benner. L. In Proceedings of the 13th IFAC Symposium on System Identification. Physica-Verlag. hohage_munk_ruymgaart. eds. [15] Breiman. A. URL http://www. A. and Vayatis. W. L. L. Random forests. T. Hohage. Koul. [16] Breiman.). P.. P. Boosting for high-dimensional linear models. Imperial College Press.stat. and Yu. Fan and [10] Bissantz. Neural Computation 11 1493–1517. Generalized additive models by likelihood based boosting. (2002). P. H..tex date: June 4. C. (2000). vol. [17] Breiman. G. C.html [12] Blanchard. (1995). (2006). Journal of Machine Learning Research 7 1001–1024. URL http://www. (1999). Convexity. Better subset regression using the nonnegative garrote. (2006). M. and B. The Annals of Statistics 28 377–386.R-project. P. Twin boosting: improved feature selection and prediction. rep. Sparse boosting. 19. Arcing classifiers (with discussion). and Shen. and Traskin. Heidelberg. [14] Breiman. [13] Breiman. Lugosi. L.. (1998).ics. [22] B¨uhlmann. B. buhlmann/TwinBoosting1. Tibshirani. A. R pack- age version 0. 2005/10/19 file: BuehlmannHothorn_Boosting. and Ruymgaart. P. Journal of the American Statistical Association 98 324– paper-proper-scoring.. In Proceedings in Computational Statistics (COMPSTAT) (W. and risk bounds. concentration and convex- ity. (2006). and Yu.stat..uni-goettingen. J. [19] B¨uhlmann. B. On the rate of convergence of regularized boosting classifiers. Prediction algorithms: complexity. R¨ onz. M. Bagging predictors. J. In The Frontiers in Statistics (J. (2006). (2005).uci. auths. [9] Binder. and McAuliffe.math. [23] B¨uhlmann. (2006).

Y. Statistical Science 1 297–318. (2004). and Tibshirani... J. Greedy function approximation: a gradient boosting machine.. Multistep kernel regression smoothing by boosting. G. M. Y. The Annals of Statistics 32 407–451. and Yu. Journal of Computer and System Sciences 55 119–139. B. Ellis. [35] Gentleman. (2004). [31] Freund. [32] Friedman.. [41] Hastie. University of Leeds. (2001). Iacus.. and Friedman. [26] Dettling. Lasso and Forward Stagewise. Yang.. R. M. Dudoit. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. (2000). (1996). and Silverman. Ge. Rossini. Johnstone. H. Obesity Research 13 626–634. Hastie. and B¨ uhlmann. mboost: Model-Based Boosting. Hastie. London. M. Generalized additive models (with discus- sion). Tech. Generalized Additive Models. and Efron. T. P. Boosting for tumor classification with gene expression data. B. and Ritov. lars: Least Angle Regression. A. [29] Freund. (1986)... Data Mining. R. Mor- gan Kaufmann Publishers Inc. Experiments with a new boosting algorithm. [30] Freund. Zunft. T. D. Lecture Notes in Computer Science. and Schapire.. and Tibshirani. Smyth. J. Chapman & Hall.. P. L.R-project.. Smith.maths. C. M. T. and Zhang.leeds. The Elements of Statistical Learning. Bernoulli 10 971–988. [36] Green. [37] Greenshtein. Y. R. [33] Friedman. J. A decision-theoretic generalization of on- line learning and an application to boosting. imsart-sts ver. The Annals of Statistics 29 1189–1232.. K. (2001). (1994). C. R. and bone breadths. CA. Genome Biology 5 R80. S. In Proceedings of the Thirteenth International Conference on Machine Learning. Bioinformatics 20 3583–3593. rep.pdf [28] Efron. Model selection and minimum description length principle. T. T. R. K. Gautier. Persistence in high-dimensional predictor selection and the virtue of over-parametrization. (2001). New York. B.. Carey. E. [38] Hansen. M. M. L. Gentry. BOOSTING ALGORITHMS AND MODEL FITTING 49 [25] Dettling. T. [43] Hothorn. J. Dettling. and Tibshirani. Hothorn. U. Hornik. Li. URL http://www. B. (2006).... and Tibshirani. A decision-theoretic generalization of on-line learning and an application to boosting. (2005). Leisch. Tierney. and Trippo. L. B. M. Sawitzki. circumferences. R.. and B¨ uhlmann. C. T. (2004). M. Least angle regression (with discussion). Journal of the American Statistical Association 96 746–774. Bioinformatics 19 1061– 2005/10/19 file: BuehlmannHothorn_Boosting.. Wagner. F.5-8. URL http://CRAN. Springer. C.. R. and Taylor. T. (2003). V.Inference and Prediction. BagBoosting for tumor classification with gene expression data. Irizarry. (2005).. R. J. [34] Garcia. (1997).org [40] Hastie. B. Bates.tex date: June 4. New York. Tibshirani.R-project.. M¨ achler. (1995). 2007 . Improved prediction of body fat by measuring skinfold thickness. I. Chapman & Hall.. Bolstad. Koebnick. In Proceedings of the Second European Conference on Computational Learning Theory. [42] Hastie. Bioconductor: open software development for computational biology and bioinformatics. S.. J. R package version 0. Huber.. G.. A. [39] Hastie. Y.. Hothorn. Additive logistic regression: a statistical view of boosting (with discussion).9-7. R.. P. R package version 0. (2004). and Schapire. and Schapire. San Francisco. J. C.. Springer. (2004).. R. URL http://CRAN... J. T. Y. (1990).. [27] DiMarzio. The Annals of Statistics 28 337–407..

Kulkarni. P. F. Bartlett. Survival ensembles. Dudoit. The Annals of Statistics 30 1–50.. (2007). Adaptive Lasso for sparse high- dimensional regression. Statistica Sinica 16 471–494. The Annals of Statistics 32 30–55 (disc. and adaptivity. (2004). Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Schapire.. Boosting for high-multivariate responses in high-dimensional linear regression.. [60] Mallat. Knot selection by boosting techniques. T. Hornik. and Zeileis. R.. 18. Heidelberg. Simonoff.pdf [49] Hurvich. Empirical margin distributions and bounding the generalization error of combined classifiers. Tech. Lewis. G.. A. Physica-Verlag. Convergence and consis- tency of regularized boosting algorithms with stationary β-mixing observations. MIT Press.R-project. rep. and Zhang. (1994). T. Rundensteiner. Singer. B.. Smoothing with curvature constraints based on boosting techniques. T. In Advances in Neural Information Processing Systems (Y. (2006). Rizzi and M. [54] Leitenstorfer.stat. [48] Huang. URL http://www. M. and Tutz. Sch¨ olkopf and J. Functional gradi- ent techniques for combining hypotheses. pp.). Biostatistics 7 355–373. Z.. and Tsai. eds. 2007 . C.. G. Bioinformatics 22 2828–2829. [51] Jiang. and Singhal. Baxter.-H. and B¨ uhlmann. S. University of Iowa. S. (2000)..50 ¨ BUHLMANN & HOTHORN [44] Hothorn. K. Model-based boosting in high dimensions. In Proceedings in Computational Statistics (COMP- STAT) ( Callan and E. eds. S. [46] Hothorn. G. Biostatistics (in press). (2006). Boosting for document routing.. R package version 0. (2006). and Zhang. Generalized monotonic regression based on B-splines with an application to air pollution data. The Annals of Statistics 32 13–29 (disc. [52] Kearns. On the Bayes-risk consistency of regularized boosting methods (with discussion). Vichi. (2006). and Vayatis. Journal of Computational and Graphical Statistics 15 651–674. D. V. [59] Lutz. (2006). (2006). L. R. on Informa- tion and Knowledge Management (A. IEEE Transactions on Signal Processing 41 3397–3415. (1998). T. Hornik. [58] Lugosi. K. B¨ uhlmann. Unbiased recursive partitioning: A conditional inference framework. and Panchenko. 2005/10/19 file: BuehlmannHothorn_Boosting. (2003). vol. P. Ma. 85–134). 9th ACM Int. [47] Hothorn. R. [55] Leitenstorfer. Agah. F. Journal of the Royal Statististical Society. F. Weiss. (2007). Matching pursuits with time-frequency dictio- naries.uiowa. Molinaro.-L. C. Journal of the Association for Computing Machinery 41 67–95. Cryptographic limitations on learning Boolean formulae and finite automata. R. and Zhang. (2006). and van der Laan. http://CRAN. J. D. and Zeileis.. R. and B¨ uhlmann. J. S. [57] Lozano. (1993). Computational Statistics & Data Analysis 51 4605–4621. J. A. A. L.). A. (2004). Greedy algorithms for classification– consistency.. [50] Iyer. J... T. 85–134). eds. [62] Mason. Y. P. Journal of Machine Learning Research 4 713–741. Platt. and Frean. C. A. and Tutz. and Valiant. [61] Mannor. and Tutz. M. W. S.). [56] Leitenstorfer. party: A Laboratory for Recursive Part(y)itioning. M. (2002). and Schapire. Series B 60 271–293. convergence rates.9-11.tex date: June 4. G. pp. In Advances in Large Margin Classifiers imsart-sts ver. [53] Koltchinskii. In Proceedings of CIKM-00. N. [45] Hothorn. P.. Conf. ACM Press. Process consistency for AdaBoost (with discussion). (2006). (2000).

T.. The state of boosting. Vienna.). [72] Ridgeway. G. [79] Southwell. IMA Journal of Numerical Analysis 20 389–403. and Singer. G. gbm: Generalized Boosted Regression Models. (2002). R. (2004). Mangasarian. H.. Computational Statistics & Data Analysis 38 379–392. [65] Meinshausen. C. and Morral. G. Machine Learning 39 135–168. San Francisco. BOOSTING ALGORITHMS AND MODEL FITTING 51 (A.. Presnell. O. Looking for lumps: Boosting and bagging for density estima- tion. Ridgeway. Lecture Notes in Computer Science. [68] Park. Wyner.-Y. P. Springer. [78] Schapire. R. Psychological Methods 9 403–425. Computing Science and Statistics 31 172–181. Mallick and B. Mendelson and A. Freund. An induc- tive learning approach to prognostic prediction. URL http://www.i-pensieri. Journal of Machine Learning Research 8 409–439. W. 2005/10/19 file: BuehlmannHothorn_Boosting. G. Discussion of Additive logistic regression: a statistical view of boosting (J. A. R.R-project. and B¨ uhlmann. (2007). [73] Ridgeway.. B. [66] Meir. Journal of the Royal Statistical [70] R¨atsch. Schuurmans. D. A.shtml [75] Schapire. Y. Austria. and M¨ uller. Morgan Kaufmann Publishers Inc. The strength of weak learnability. Boosting the mar- gin: a new explanation for the effectiveness of voting methods. M. W. Denison.stanford. [77] Schapire. Bartlett. G. The Annals of Statistics 28 393–400. CA. Cam- bridge. G. (1946). M. An L1 regularization-path algorithm for gen- eralized linear models. N.. Springer. (2002). B. [74] Ridgeway. Yu. The Annals of Statis- tics 26 1651–1686. Smola. Holmes. Boostexter: a boosting-based system for text categorization. Cost-weighted boosting with jittering and over/under-sampling: JOUS-boost. An introduction to boosting and leveraging. Onoda. W. and Turlach.5-7. In Proceedings of the Twelfth Inter- national Conference on Machine Learning.. and Hastie. Soft margins for AdaBoost. [67] Osborne. M. (1995). auths. eds. [71] Ridgeway. The boosting approach to machine learning: an overview. (1999). URL http://www. Ma- chine Learning 42 287–320. and Buja. R. K. (2007). Tibshirani. A new approach to variable selection in least squares problems. In Advanced Lectures on Machine Learning (S. R package version 1. R Foundation for Statistical Computing. P.). R. (2000). F. A. (2000). B. Bartlett. L. [80] Street. D. Oxford Uni- veristy Press.. [63] McCaffrey. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. In MSRI Workshop on Nonlinear Estimation and Classification (D. N. and R¨ atsch. Series B (in press). (1998).pdf [69] R Development Core Team (2006). R. (2003). eds. G. High-dimensional graphs and variable selection with the Lasso. Hastie. [76] Schapire.). T. R. (2006). imsart-sts ver. (2001). B. G. (1990).). R. Relaxation Methods in Theoretical Physics.. R: A Language and Environment for Statis- tical Computing. Sch¨ olkopf and D. 2007 . and Wolberg. eds. Smola. Y. MIT Press. URL http://www-stat.tex date: June 4. Hansen. ISBN 3-900051-07-0. The Annals of Statistics 34 1436–1462. (2006).. Machine Learning 5 197– 227. (2000). and T.jrssb. [64] Mease. P.

Hothorn@R-project. B.. Journal of Computational and Graphical Statistics 16 165–188. [82] Tibshirani. Boosting Ridge regression.berkeley..stat. Germany e-mail: Torsten. L.pdf [92] Zhang. J. [84] Tutz. (2007).edu/~yao/publications/earlystop. and Leitenstorfer. Journal of the American Statistical Association 101 1418–1429. H. Zuzan. V. Journal of the Royal Statistical Society. Rosasco.tex date: June 4. and Hastie. Weak greedy algorithms. URL https://www. G.. and Binder.... [90] West. M. and Hechenbichler. Springer. T. Series B 58 267–288. F.pdf [96] Zou. (2005). S. Predicting the clinical status of human breast cancer by using gene expression profiles. G. Dressman. (2007). (2007). and Yu. Huang. Seminar f¨ur Statistik Institut f¨ ur Medizininformatik. (2003). rep. [88] Tutz. Constructive Approximation (in press). Marks... A.. (2007). Exploratory Data Analysis. and Reithinger. Waldstraße 6. and Nevins. Generalized additive modelling with implicit variable selection by likelihood based M.. (2006). C. [89] van der University of California. Multiclass AdaBoost. ETH Z¨ urich Biometrie und Epidemiologie CH-8092 Z¨ urich Friedrich-Alexander-Universit¨ at Switzerland Erlangen-N¨ urnberg e-mail: buhlmann@stat. [95] Zhu. Boosting with early stopping: convergence and con- sistency. Computational Statis- tics & Data Analysis (in press). [93] Zhao. Berkeley. [83] Tukey. D-91054 Erlangen. B.. Y. T.. Spang. Advances in Computational Math- ematics 12 213–227. [85] imsart-sts ver. Zou. Blanchette. B. Regression shrinkage and selection via the Lasso.52 ¨ BUHLMANN & HOTHORN [81] Temlyakov. Flexible semiparametric mixed models. URL http://math. The Annals of Statistics 33 1538–1579.pdf [94] Zhao. MA.math. and Yu. G. Ishida. H. Journal Statistical Computation and Simulation 75 391–408. On model selection consistency of Lasso. [87] Tutz. The adaptive Lasso and its oracle properties. and Binder. Tech.berkeley. Aggregating classifiers with ordinal re- sponse structure. H. (2005). H.. J. S. J. H. Statis- tics in Medicine 26 2872–2900. (2001). URL http://www-stat. Proceedings of the National Academy of Sciences (USA) 98 11462–11467.ethz. (2000). (2006). Olson. H. rep. 2007 . J. Rosset. (1996). and Caponnetto. (1977). and Robins. Stanford University. On early stopping in gradient descent learning. Generalized smooth monotonic regression in additive modelling. (2005). G. P. and Yu. [86] Tutz. P. R. Addison-Wesley.stanford. J. Journal of Machine Learning Research 7 2541–2563. F. Unified Methods for Censored Longitu- dinal Data and Causality. E. Biometrics 62 961–971. (2005). Boosted Lasso. K. 2005/10/19 file: BuehlmannHothorn_Boosting. Reading. J. [91] Yao. G. Tech. (2006).