# Beyond Classiﬁcation Beyond Classiﬁcation Beyond Classiﬁcation

Beyond Classiﬁcation Beyond Classiﬁcation
Rob Schapire
Princeton University
[currently visiting Yahoo! Research]
Classiﬁcation and Beyond Classiﬁcation and Beyond Classiﬁcation and Beyond
Classiﬁcation and Beyond Classiﬁcation and Beyond
• earlier, studied classiﬁcation learning
• goal: learn to classify examples into ﬁxed set of categories
• want to predict correct class as often as possible
• many applications
• however, often faced with learning problems that don’t ﬁt this
• predicting real-valued quantities:
• how many times will some web page be visited?
• predicting probabilities:
• what is the probability user will click on some link?
• how likely is it that some user is a spammer?
This Lecture This Lecture This Lecture
This Lecture This Lecture
• general techniques for:
• predicting real-valued quantities — “regression”
• predicting probabilities
• central, unifying idea: loss minimization
Regression Regression Regression
Regression Regression
Example: Weather Prediction Example: Weather Prediction Example: Weather Prediction
Example: Weather Prediction Example: Weather Prediction
• meteorologists A and B apply for job
• to test which is better:
• ask each to predict how much it will rain
• observe actual amount
• repeat
predictions actual
A B outcome
Monday 1.2 0.5 0.9
Tuesday 0.1 0.3 0.0
Wednesday 2.0 1.0 2.1
• how to judge who gave better predictions?
Example (cont.) Example (cont.) Example (cont.)
Example (cont.) Example (cont.)
• natural idea:
• measure discrepancy between predictions and outcomes
• e.g., measure using absolute diﬀerence
• choose forecaster with closest predictions overall
predictions actual diﬀerence
A B outcome A B
Monday 1.2 0.5 0.9 0.3 0.4
Tuesday 0.1 0.3 0.0 0.1 0.3
Wednesday 2.0 1.0 2.1 0.1 1.1
0.5 1.8
• could have measured discrepancy in other ways
• e.g., diﬀerence squared
• which measure to use?
Loss Loss Loss
Loss Loss
• each forecast scored using loss function
x = weather conditions
f (x) = predicted amount
y = actual outcome
• loss function L(f (x), y) measures discrepancy between
prediction f (x) and outcome y
• e.g.:
• absolute loss: L(f (x), y) = |f (x) − y|
• square loss: L(f (x), y) = (f (x) − y)
2
• which L to use?
• need to understand properties of loss functions
Square Loss Square Loss Square Loss
Square Loss Square Loss
• square loss often sensible because encourages predictions close
to true expectation
• ﬁx x
• say y random with µ = E[y]
• predict f = f (x)
• can show:
E[L(f , y)] = E

(f − y)
2

= (f − µ)
2
+ Var(y)
. .. .
intrinsic randomness
• therefore:
• minimized when f = µ
• lower square loss ⇒ f closer to µ
• forecaster with lowest square loss has predictions closest
to E[y|x] on average
Learning for Regression Learning for Regression Learning for Regression
Learning for Regression Learning for Regression
• say examples (x, y) generated at random
• expected square loss
E[L
f
] ≡ E

(f (x) − y)
2

minimized when f (x) = E[y|x] for all x
• how to minimize from training data (x
1
, y
1
), . . . , (x
m
, y
m
)?
• attempt to ﬁnd f with minimum empirical loss:
ˆ
E[L
f
] ≡
1
m
m
¸
i =1
(f (x
i
) − y
i
)
2
• if ∀f :
ˆ
E[L
f
] ≈ E[L
f
] then f that minimizes
ˆ
E[L
f
] will
approximately minimize E[L
f
]
• to be possible, need to choose f of restricted form to avoid
overﬁtting
Linear Regression Linear Regression Linear Regression
Linear Regression Linear Regression
• e.g., if x ∈ R
n
, could choose to use linear predictors of form
f (x) = w · x
• then need to ﬁnd w to minimize
1
m
m
¸
i =1
(w · x
i
− y
i
)
2
• can solve in closed form
• can also minimize on-line (e.g. using gradient descent)
Regularization Regularization Regularization
Regularization Regularization
• to constrain predictor further, common to add regularization
term to encourage small weights:
1
m
m
¸
i =1
(w · x
i
− y
i
)
2
+ λw
2
(in this case, called “ridge regression”)
• can signiﬁcantly improve performance by limiting overﬁtting
• requires tuning of λ parameter
• diﬀerent forms of regularization have diﬀerent properties
• e.g., using w
1
solutions
Absolute Loss Absolute Loss Absolute Loss
Absolute Loss Absolute Loss
• what if instead use L(f (x), y) = |f (x) − y| ?
• can show E[|f (x) − y|] minimized when
f (x) = median of y’s conditional distribution, given x
• potentially, quite diﬀerent behavior from square loss
• not used so often
Summary so far Summary so far Summary so far
Summary so far Summary so far
• can handle prediction of real-valued outcomes by:
• choosing a loss function
• computing a prediction rule with minimum loss on
training data
• diﬀerent loss functions have diﬀerent properties:
• square loss estimates conditional mean
• absolute loss estimates conditional median
• what if goal is to estimate entire conditional distribution of y
given x?
Estimating Probabilities Estimating Probabilities Estimating Probabilities
Estimating Probabilities Estimating Probabilities
Weather Example (revisited) Weather Example (revisited) Weather Example (revisited)
Weather Example (revisited) Weather Example (revisited)
• say goal now is to predict probability of rain
• again, can compare A and B’s predictions:
predictions actual
A B outcome
Monday 60% 80% rain
Tuesday 20% 70% no-rain
Wednesday 90% 50% no-rain
• which is better?
Plausible Approaches Plausible Approaches Plausible Approaches
Plausible Approaches Plausible Approaches
• similar to classiﬁcation
• but goal now is to predict probability of class
• could reduce to regression:
y =

1 if rain
0 if no-rain
• minimize square loss to estimate
E[y|x] = Pr[y = 1|x] = Pr[rain|x]
• reasonable, though somewhat awkward and unnatural
(especially when more than two possible outcomes)
Diﬀerent Approach: Maximum Likelihood Diﬀerent Approach: Maximum Likelihood Diﬀerent Approach: Maximum Likelihood
Diﬀerent Approach: Maximum Likelihood Diﬀerent Approach: Maximum Likelihood
• each forecaster predicting distribution over set of outcomes
y ∈ {rain, no-rain} for given x
• can compute probability of observed outcomes according to
each forecaster — “likelihood”
predictions actual likelihood
A B outcome A B
Monday 60% 80% rain 0.6 0.8
Tuesday 20% 70% no-rain 0.8 0.3
Wednesday 90% 50% no-rain 0.1 0.5
likelihood(A) = .6 × .8 × .1
likelihood(B) = .8 × .3 × .5
• intuitively, higher likelihood ⇒ better ﬁt of estimated
probabilities to observations
• so: choose maximum-likelihood forecaster
Log Loss Log Loss Log Loss
Log Loss Log Loss
• given training data (x
1
, y
1
), . . . , (x
m
, y
m
)
• f (y|x) = predicted probability of y on given x
• likelihood of f =
m
¸
i =1
f (y
i
|x
i
)
• maximizing likelihood ≡ minimizing negative log likelihood
m
¸
i =1
(−log f (y
i
|x
i
))
• L(f (·|x), y) = −log f (y|x) called “log loss”
Estimating Probabilities Estimating Probabilities Estimating Probabilities
Estimating Probabilities Estimating Probabilities
• Pr[y|x] = true probability of y given x
• can prove: E[−log f (y|x)] minimized when f (y|x) = Pr[y|x]
• more generally,
E[−log f (y|x)] = (average distance between f (·|x) and Pr[·|x])
+(intrinsic uncertainty of Pr[·|x])
• so: minimizing log loss encourages choice of predictor close to
true conditional probabilities
Learning Learning Learning
Learning Learning
• given training data (x
1
, y
1
), . . . , (x
m
, y
m
), choose f (y|x) to
minimize
1
m
¸
i
(−log f (y
i
|x
i
))
• as before, need to restrict form of f
• e.g.: if x ∈ R
n
, y ∈ {0, 1}, common to use f of form
f (y = 1|x) = σ(w · x)
where σ(z) = 1/(1 + e
−z
)
• can numerically ﬁnd w to minimize log loss
• “logistic regression”
Log Loss and Square Loss Log Loss and Square Loss Log Loss and Square Loss
Log Loss and Square Loss Log Loss and Square Loss
• e.g.: if x ∈ R
n
, y ∈ R, can take f (y|x) to be gaussian with
mean w · x and ﬁxed variance
• then minimizing log loss ≡ linear regression
• general: square loss ≡ log loss with gaussian conditional
probability distributions (and ﬁxed variance)
Classiﬁcation and Loss Minimization Classiﬁcation and Loss Minimization Classiﬁcation and Loss Minimization
Classiﬁcation and Loss Minimization Classiﬁcation and Loss Minimization
• in classiﬁcation learning, try to minimize 0-1 loss
L(f (x), y) =

1 if f (x) = y
0 else
• expected 0-1 loss = generalization error
• empirical 0-1 loss = training error
• computationally and numerically diﬃcult loss since
discontinuous and not convex
• to handle, both AdaBoost and SVM’s minimize alternative
surrogate losses
• SVM’s: “hinge” loss
Summary Summary Summary
Summary Summary
• much of learning can be viewed simply as loss minimization
• diﬀerent losses have diﬀerent properties and purposes
• regression (real-valued labels):
• use square loss to estimate conditional mean
• use absolute loss to estimate conditional median
• estimating conditional probabilities:
• use log loss (≡ maximum likelihood)
• classiﬁcation:
• use 0/1-loss (or surrogate)
• provides uniﬁed and ﬂexible means of algorithm design

Classiﬁcation and Beyond
• earlier, studied classiﬁcation learning
• •

goal: learn to classify examples into ﬁxed set of categories want to predict correct class as often as possible

• many applications • however, often faced with learning problems that don’t ﬁt this

paradigm: predicting real-valued quantities: • how many times will some web page be visited? • how much will be bid on a particular advertisement? • predicting probabilities: • what is the probability user will click on some link? • how likely is it that some user is a spammer?

This Lecture • general techniques for: predicting real-valued quantities — “regression” • predicting probabilities • • central. unifying idea: loss minimization .

Regression .

0 1.9 0.Example: Weather Prediction • meteorologists A and B apply for job • to test which is better: ask each to predict how much it will rain observe actual amount • repeat • • Monday Tuesday Wednesday predictions A B 1.0 2.3 2.1 • how to judge who gave better predictions? .1 0.2 0.5 0.0 actual outcome 0.

5 0.2 0.0 2.1 diﬀerence A B 0.1 0. diﬀerence squared • which measure to use? .. measure using absolute diﬀerence • choose forecaster with closest predictions overall • Monday Tuesday Wednesday predictions A B 1..8 • could have measured discrepancy in other ways • e.3 0.4 0.3 0.g.1 0.1 1.5 1.0 1.Example (cont.1 0.) • natural idea: measure discrepancy between predictions and outcomes • e.9 0.3 2.0 actual outcome 0.g.

y ) = |f (x) − y | square loss: L(f (x).Loss • each forecast scored using loss function x = weather conditions f (x) = predicted amount y = actual outcome • loss function L(f (x).: • • absolute loss: L(f (x). y ) measures discrepancy between prediction f (x) and outcome y • e. y ) = (f (x) − y )2 need to understand properties of loss functions • which L to use? • .g.

y )] = E (f − y )2 = (f − µ)2 + Var(y ) intrinsic randomness • therefore: minimized when f = µ lower square loss ⇒ f closer to µ • forecaster with lowest square loss has predictions closest to E [y |x] on average • • .Square Loss • square loss often sensible because encourages predictions close to true expectation • ﬁx x • say y random with µ = E [y ] • predict f = f (x) • can show: E [L(f .

. y ) generated at random • expected square loss E [Lf ] ≡ E (f (x) − y )2 minimized when f (x) = E [y |x] for all x • how to minimize from training data (x1 . . y1 ).Learning for Regression • say examples (x. ym )? • attempt to ﬁnd f with minimum empirical loss: 1 ˆ E [Lf ] ≡ m m (f (xi ) − yi )2 i =1 ˆ ˆ • if ∀f : E [Lf ] ≈ E [Lf ] then f that minimizes E [Lf ] will approximately minimize E [Lf ] • to be possible. . need to choose f of restricted form to avoid overﬁtting . (xm . .

g.g.Linear Regression • e. could choose to use linear predictors of form f (x) = w · x • then need to ﬁnd w to minimize 1 m • m (w · xi − yi )2 i =1 can solve in closed form • can also minimize on-line (e. using gradient descent) . if x ∈ Rn ..

.g. called “ridge regression”) • can signiﬁcantly improve performance by limiting overﬁtting • requires tuning of λ parameter • diﬀerent forms of regularization have diﬀerent properties • e. using w solutions 1 instead tends to encourage “sparse” .Regularization • to constrain predictor further. common to add regularization term to encourage small weights: 1 m m (w · xi − yi )2 + λ w i =1 2 (in this case.

Absolute Loss • what if instead use L(f (x). given x • potentially. y ) = |f (x) − y | ? • can show E [|f (x) − y |] minimized when f (x) = median of y ’s conditional distribution. quite diﬀerent behavior from square loss • not used so often .

Summary so far • can handle prediction of real-valued outcomes by: choosing a loss function • computing a prediction rule with minimum loss on training data • • diﬀerent loss functions have diﬀerent properties: • • square loss estimates conditional mean absolute loss estimates conditional median • what if goal is to estimate entire conditional distribution of y given x? .

Estimating Probabilities .

Weather Example (revisited) • say goal now is to predict probability of rain • again. can compare A and B’s predictions: Monday Tuesday Wednesday • which is better? predictions A B 60% 80% 20% 70% 90% 50% actual outcome rain no-rain no-rain .

though somewhat awkward and unnatural (especially when more than two possible outcomes) • .Plausible Approaches • similar to classiﬁcation • but goal now is to predict probability of class • could reduce to regression: y= 1 if rain 0 if no-rain • minimize square loss to estimate E [y |x] = Pr[y = 1|x] = Pr[rain|x] reasonable.

3 × .5 Monday Tuesday Wednesday likelihood(A) = .5 • intuitively.8 0. no-rain} for given x • can compute probability of observed outcomes according to each forecaster — “likelihood” predictions A B 60% 80% 20% 70% 90% 50% actual outcome rain no-rain no-rain likelihood A B 0. higher likelihood ⇒ better ﬁt of estimated probabilities to observations • so: choose maximum-likelihood forecaster .3 0.8 × .8 × .1 0.Diﬀerent Approach: Maximum Likelihood • each forecaster predicting distribution over set of outcomes y ∈ {rain.6 0.8 0.6 × .1 likelihood(B) = .

ym ) • f (y |x) = predicted probability of y on given x m • likelihood of f = i =1 f (yi |xi ) • maximizing likelihood ≡ minimizing negative log likelihood m (− log f (yi |xi )) i =1 • L(f (·|x). y ) = − log f (y |x) called “log loss” . . y1 ). . . (xm .Log Loss • given training data (x1 . .

Estimating Probabilities • Pr[y |x] = true probability of y given x • can prove: E [− log f (y |x)] minimized when f (y |x) = Pr[y |x] • more generally. E [− log f (y |x)] = (average distance between f (·|x) and Pr[·|x]) +(intrinsic uncertainty of Pr[·|x]) • so: minimizing log loss encourages choice of predictor close to true conditional probabilities .

need to restrict form of f • e. (xm . 1}. y1 ). common to use f of form f (y = 1|x) = σ(w · x) where σ(z) = 1/(1 + e −z ) • can numerically ﬁnd w to minimize log loss • “logistic regression” . .Learning • given training data (x1 . y ∈ {0. . . .: if x ∈ Rn . choose f (y |x) to minimize 1 m (− log f (yi |xi )) i • as before. ym ).g.

y ∈ R.g.Log Loss and Square Loss • e. can take f (y |x) to be gaussian with mean w · x and ﬁxed variance • then minimizing log loss ≡ linear regression • general: square loss ≡ log loss with gaussian conditional probability distributions (and ﬁxed variance) .: if x ∈ Rn .

Classiﬁcation and Loss Minimization • in classiﬁcation learning. both AdaBoost and SVM’s minimize alternative surrogate losses • AdaBoost: “exponential” loss • SVM’s: “hinge” loss . try to minimize 0-1 loss L(f (x). y ) = 1 if f (x) = y 0 else • • expected 0-1 loss = generalization error empirical 0-1 loss = training error • computationally and numerically diﬃcult loss since discontinuous and not convex • to handle.

Summary • much of learning can be viewed simply as loss minimization • diﬀerent losses have diﬀerent properties and purposes regression (real-valued labels): • use square loss to estimate conditional mean • use absolute loss to estimate conditional median • estimating conditional probabilities: • use log loss (≡ maximum likelihood) • classiﬁcation: • use 0/1-loss (or surrogate) • • provides uniﬁed and ﬂexible means of algorithm design .