## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

**Beyond Classiﬁcation Beyond Classiﬁcation
**

Rob Schapire

Princeton University

[currently visiting Yahoo! Research]

Classiﬁcation and Beyond Classiﬁcation and Beyond Classiﬁcation and Beyond

Classiﬁcation and Beyond Classiﬁcation and Beyond

• earlier, studied classiﬁcation learning

• goal: learn to classify examples into ﬁxed set of categories

• want to predict correct class as often as possible

• many applications

• however, often faced with learning problems that don’t ﬁt this

paradigm:

• predicting real-valued quantities:

• how many times will some web page be visited?

• how much will be bid on a particular advertisement?

• predicting probabilities:

• what is the probability user will click on some link?

• how likely is it that some user is a spammer?

This Lecture This Lecture This Lecture

This Lecture This Lecture

• general techniques for:

• predicting real-valued quantities — “regression”

• predicting probabilities

• central, unifying idea: loss minimization

Regression Regression Regression

Regression Regression

Example: Weather Prediction Example: Weather Prediction Example: Weather Prediction

Example: Weather Prediction Example: Weather Prediction

• meteorologists A and B apply for job

• to test which is better:

• ask each to predict how much it will rain

• observe actual amount

• repeat

predictions actual

A B outcome

Monday 1.2 0.5 0.9

Tuesday 0.1 0.3 0.0

Wednesday 2.0 1.0 2.1

• how to judge who gave better predictions?

Example (cont.) Example (cont.) Example (cont.)

Example (cont.) Example (cont.)

• natural idea:

• measure discrepancy between predictions and outcomes

• e.g., measure using absolute diﬀerence

• choose forecaster with closest predictions overall

predictions actual diﬀerence

A B outcome A B

Monday 1.2 0.5 0.9 0.3 0.4

Tuesday 0.1 0.3 0.0 0.1 0.3

Wednesday 2.0 1.0 2.1 0.1 1.1

0.5 1.8

• could have measured discrepancy in other ways

• e.g., diﬀerence squared

• which measure to use?

Loss Loss Loss

Loss Loss

• each forecast scored using loss function

x = weather conditions

f (x) = predicted amount

y = actual outcome

• loss function L(f (x), y) measures discrepancy between

prediction f (x) and outcome y

• e.g.:

• absolute loss: L(f (x), y) = |f (x) − y|

• square loss: L(f (x), y) = (f (x) − y)

2

• which L to use?

• need to understand properties of loss functions

Square Loss Square Loss Square Loss

Square Loss Square Loss

• square loss often sensible because encourages predictions close

to true expectation

• ﬁx x

• say y random with µ = E[y]

• predict f = f (x)

• can show:

E[L(f , y)] = E

(f − y)

2

= (f − µ)

2

+ Var(y)

. .. .

intrinsic randomness

• therefore:

• minimized when f = µ

• lower square loss ⇒ f closer to µ

• forecaster with lowest square loss has predictions closest

to E[y|x] on average

Learning for Regression Learning for Regression Learning for Regression

Learning for Regression Learning for Regression

• say examples (x, y) generated at random

• expected square loss

E[L

f

] ≡ E

(f (x) − y)

2

**minimized when f (x) = E[y|x] for all x
**

• how to minimize from training data (x

1

, y

1

), . . . , (x

m

, y

m

)?

• attempt to ﬁnd f with minimum empirical loss:

ˆ

E[L

f

] ≡

1

m

m

¸

i =1

(f (x

i

) − y

i

)

2

• if ∀f :

ˆ

E[L

f

] ≈ E[L

f

] then f that minimizes

ˆ

E[L

f

] will

approximately minimize E[L

f

]

• to be possible, need to choose f of restricted form to avoid

overﬁtting

Linear Regression Linear Regression Linear Regression

Linear Regression Linear Regression

• e.g., if x ∈ R

n

, could choose to use linear predictors of form

f (x) = w · x

• then need to ﬁnd w to minimize

1

m

m

¸

i =1

(w · x

i

− y

i

)

2

• can solve in closed form

• can also minimize on-line (e.g. using gradient descent)

Regularization Regularization Regularization

Regularization Regularization

• to constrain predictor further, common to add regularization

term to encourage small weights:

1

m

m

¸

i =1

(w · x

i

− y

i

)

2

+ λw

2

(in this case, called “ridge regression”)

• can signiﬁcantly improve performance by limiting overﬁtting

• requires tuning of λ parameter

• diﬀerent forms of regularization have diﬀerent properties

• e.g., using w

1

instead tends to encourage “sparse”

solutions

Absolute Loss Absolute Loss Absolute Loss

Absolute Loss Absolute Loss

• what if instead use L(f (x), y) = |f (x) − y| ?

• can show E[|f (x) − y|] minimized when

f (x) = median of y’s conditional distribution, given x

• potentially, quite diﬀerent behavior from square loss

• not used so often

Summary so far Summary so far Summary so far

Summary so far Summary so far

• can handle prediction of real-valued outcomes by:

• choosing a loss function

• computing a prediction rule with minimum loss on

training data

• diﬀerent loss functions have diﬀerent properties:

• square loss estimates conditional mean

• absolute loss estimates conditional median

• what if goal is to estimate entire conditional distribution of y

given x?

Estimating Probabilities Estimating Probabilities Estimating Probabilities

Estimating Probabilities Estimating Probabilities

Weather Example (revisited) Weather Example (revisited) Weather Example (revisited)

Weather Example (revisited) Weather Example (revisited)

• say goal now is to predict probability of rain

• again, can compare A and B’s predictions:

predictions actual

A B outcome

Monday 60% 80% rain

Tuesday 20% 70% no-rain

Wednesday 90% 50% no-rain

• which is better?

Plausible Approaches Plausible Approaches Plausible Approaches

Plausible Approaches Plausible Approaches

• similar to classiﬁcation

• but goal now is to predict probability of class

• could reduce to regression:

y =

1 if rain

0 if no-rain

• minimize square loss to estimate

E[y|x] = Pr[y = 1|x] = Pr[rain|x]

• reasonable, though somewhat awkward and unnatural

(especially when more than two possible outcomes)

Diﬀerent Approach: Maximum Likelihood Diﬀerent Approach: Maximum Likelihood Diﬀerent Approach: Maximum Likelihood

Diﬀerent Approach: Maximum Likelihood Diﬀerent Approach: Maximum Likelihood

• each forecaster predicting distribution over set of outcomes

y ∈ {rain, no-rain} for given x

• can compute probability of observed outcomes according to

each forecaster — “likelihood”

predictions actual likelihood

A B outcome A B

Monday 60% 80% rain 0.6 0.8

Tuesday 20% 70% no-rain 0.8 0.3

Wednesday 90% 50% no-rain 0.1 0.5

likelihood(A) = .6 × .8 × .1

likelihood(B) = .8 × .3 × .5

• intuitively, higher likelihood ⇒ better ﬁt of estimated

probabilities to observations

• so: choose maximum-likelihood forecaster

Log Loss Log Loss Log Loss

Log Loss Log Loss

• given training data (x

1

, y

1

), . . . , (x

m

, y

m

)

• f (y|x) = predicted probability of y on given x

• likelihood of f =

m

¸

i =1

f (y

i

|x

i

)

• maximizing likelihood ≡ minimizing negative log likelihood

m

¸

i =1

(−log f (y

i

|x

i

))

• L(f (·|x), y) = −log f (y|x) called “log loss”

Estimating Probabilities Estimating Probabilities Estimating Probabilities

Estimating Probabilities Estimating Probabilities

• Pr[y|x] = true probability of y given x

• can prove: E[−log f (y|x)] minimized when f (y|x) = Pr[y|x]

• more generally,

E[−log f (y|x)] = (average distance between f (·|x) and Pr[·|x])

+(intrinsic uncertainty of Pr[·|x])

• so: minimizing log loss encourages choice of predictor close to

true conditional probabilities

Learning Learning Learning

Learning Learning

• given training data (x

1

, y

1

), . . . , (x

m

, y

m

), choose f (y|x) to

minimize

1

m

¸

i

(−log f (y

i

|x

i

))

• as before, need to restrict form of f

• e.g.: if x ∈ R

n

, y ∈ {0, 1}, common to use f of form

f (y = 1|x) = σ(w · x)

where σ(z) = 1/(1 + e

−z

)

• can numerically ﬁnd w to minimize log loss

• “logistic regression”

Log Loss and Square Loss Log Loss and Square Loss Log Loss and Square Loss

Log Loss and Square Loss Log Loss and Square Loss

• e.g.: if x ∈ R

n

, y ∈ R, can take f (y|x) to be gaussian with

mean w · x and ﬁxed variance

• then minimizing log loss ≡ linear regression

• general: square loss ≡ log loss with gaussian conditional

probability distributions (and ﬁxed variance)

Classiﬁcation and Loss Minimization Classiﬁcation and Loss Minimization Classiﬁcation and Loss Minimization

Classiﬁcation and Loss Minimization Classiﬁcation and Loss Minimization

• in classiﬁcation learning, try to minimize 0-1 loss

L(f (x), y) =

1 if f (x) = y

0 else

• expected 0-1 loss = generalization error

• empirical 0-1 loss = training error

• computationally and numerically diﬃcult loss since

discontinuous and not convex

• to handle, both AdaBoost and SVM’s minimize alternative

surrogate losses

• AdaBoost: “exponential” loss

• SVM’s: “hinge” loss

Summary Summary Summary

Summary Summary

• much of learning can be viewed simply as loss minimization

• diﬀerent losses have diﬀerent properties and purposes

• regression (real-valued labels):

• use square loss to estimate conditional mean

• use absolute loss to estimate conditional median

• estimating conditional probabilities:

• use log loss (≡ maximum likelihood)

• classiﬁcation:

• use 0/1-loss (or surrogate)

• provides uniﬁed and ﬂexible means of algorithm design

**Classiﬁcation and Beyond
**

• earlier, studied classiﬁcation learning

• •

goal: learn to classify examples into ﬁxed set of categories want to predict correct class as often as possible

• many applications • however, often faced with learning problems that don’t ﬁt this

paradigm: predicting real-valued quantities: • how many times will some web page be visited? • how much will be bid on a particular advertisement? • predicting probabilities: • what is the probability user will click on some link? • how likely is it that some user is a spammer?

•

This Lecture • general techniques for: predicting real-valued quantities — “regression” • predicting probabilities • • central. unifying idea: loss minimization .

Regression .

0 1.9 0.Example: Weather Prediction • meteorologists A and B apply for job • to test which is better: ask each to predict how much it will rain observe actual amount • repeat • • Monday Tuesday Wednesday predictions A B 1.0 2.3 2.1 • how to judge who gave better predictions? .1 0.2 0.5 0.0 actual outcome 0.

5 0.2 0.0 2.1 diﬀerence A B 0.1 0. diﬀerence squared • which measure to use? .. measure using absolute diﬀerence • choose forecaster with closest predictions overall • Monday Tuesday Wednesday predictions A B 1..8 • could have measured discrepancy in other ways • e.3 0.4 0.3 0.g.1 0.1 1.5 1.0 1.Example (cont.1 0.) • natural idea: measure discrepancy between predictions and outcomes • e.9 0.3 2.0 actual outcome 0.g.

y ) = |f (x) − y | square loss: L(f (x).Loss • each forecast scored using loss function x = weather conditions f (x) = predicted amount y = actual outcome • loss function L(f (x).: • • absolute loss: L(f (x). y ) measures discrepancy between prediction f (x) and outcome y • e. y ) = (f (x) − y )2 need to understand properties of loss functions • which L to use? • .g.

y )] = E (f − y )2 = (f − µ)2 + Var(y ) intrinsic randomness • therefore: minimized when f = µ lower square loss ⇒ f closer to µ • forecaster with lowest square loss has predictions closest to E [y |x] on average • • .Square Loss • square loss often sensible because encourages predictions close to true expectation • ﬁx x • say y random with µ = E [y ] • predict f = f (x) • can show: E [L(f .

. y ) generated at random • expected square loss E [Lf ] ≡ E (f (x) − y )2 minimized when f (x) = E [y |x] for all x • how to minimize from training data (x1 . . y1 ).Learning for Regression • say examples (x. ym )? • attempt to ﬁnd f with minimum empirical loss: 1 ˆ E [Lf ] ≡ m m (f (xi ) − yi )2 i =1 ˆ ˆ • if ∀f : E [Lf ] ≈ E [Lf ] then f that minimizes E [Lf ] will approximately minimize E [Lf ] • to be possible. . need to choose f of restricted form to avoid overﬁtting . (xm . .

g.g.Linear Regression • e. could choose to use linear predictors of form f (x) = w · x • then need to ﬁnd w to minimize 1 m • m (w · xi − yi )2 i =1 can solve in closed form • can also minimize on-line (e. using gradient descent) . if x ∈ Rn ..

.g. called “ridge regression”) • can signiﬁcantly improve performance by limiting overﬁtting • requires tuning of λ parameter • diﬀerent forms of regularization have diﬀerent properties • e. using w solutions 1 instead tends to encourage “sparse” .Regularization • to constrain predictor further. common to add regularization term to encourage small weights: 1 m m (w · xi − yi )2 + λ w i =1 2 (in this case.

Absolute Loss • what if instead use L(f (x). given x • potentially. y ) = |f (x) − y | ? • can show E [|f (x) − y |] minimized when f (x) = median of y ’s conditional distribution. quite diﬀerent behavior from square loss • not used so often .

Summary so far • can handle prediction of real-valued outcomes by: choosing a loss function • computing a prediction rule with minimum loss on training data • • diﬀerent loss functions have diﬀerent properties: • • square loss estimates conditional mean absolute loss estimates conditional median • what if goal is to estimate entire conditional distribution of y given x? .

Estimating Probabilities .

Weather Example (revisited) • say goal now is to predict probability of rain • again. can compare A and B’s predictions: Monday Tuesday Wednesday • which is better? predictions A B 60% 80% 20% 70% 90% 50% actual outcome rain no-rain no-rain .

though somewhat awkward and unnatural (especially when more than two possible outcomes) • .Plausible Approaches • similar to classiﬁcation • but goal now is to predict probability of class • could reduce to regression: y= 1 if rain 0 if no-rain • minimize square loss to estimate E [y |x] = Pr[y = 1|x] = Pr[rain|x] reasonable.

3 × .5 Monday Tuesday Wednesday likelihood(A) = .5 • intuitively.8 0. no-rain} for given x • can compute probability of observed outcomes according to each forecaster — “likelihood” predictions A B 60% 80% 20% 70% 90% 50% actual outcome rain no-rain no-rain likelihood A B 0. higher likelihood ⇒ better ﬁt of estimated probabilities to observations • so: choose maximum-likelihood forecaster .3 0.8 × .8 × .1 0.Diﬀerent Approach: Maximum Likelihood • each forecaster predicting distribution over set of outcomes y ∈ {rain.6 0.8 0.6 × .1 likelihood(B) = .

ym ) • f (y |x) = predicted probability of y on given x m • likelihood of f = i =1 f (yi |xi ) • maximizing likelihood ≡ minimizing negative log likelihood m (− log f (yi |xi )) i =1 • L(f (·|x). y ) = − log f (y |x) called “log loss” . . y1 ). . . (xm .Log Loss • given training data (x1 . .

Estimating Probabilities • Pr[y |x] = true probability of y given x • can prove: E [− log f (y |x)] minimized when f (y |x) = Pr[y |x] • more generally. E [− log f (y |x)] = (average distance between f (·|x) and Pr[·|x]) +(intrinsic uncertainty of Pr[·|x]) • so: minimizing log loss encourages choice of predictor close to true conditional probabilities .

need to restrict form of f • e. (xm . 1}. y1 ). common to use f of form f (y = 1|x) = σ(w · x) where σ(z) = 1/(1 + e −z ) • can numerically ﬁnd w to minimize log loss • “logistic regression” . .Learning • given training data (x1 . y ∈ {0. . . .: if x ∈ Rn . choose f (y |x) to minimize 1 m (− log f (yi |xi )) i • as before. ym ).g.

y ∈ R.g.Log Loss and Square Loss • e. can take f (y |x) to be gaussian with mean w · x and ﬁxed variance • then minimizing log loss ≡ linear regression • general: square loss ≡ log loss with gaussian conditional probability distributions (and ﬁxed variance) .: if x ∈ Rn .

Classiﬁcation and Loss Minimization • in classiﬁcation learning. both AdaBoost and SVM’s minimize alternative surrogate losses • AdaBoost: “exponential” loss • SVM’s: “hinge” loss . try to minimize 0-1 loss L(f (x). y ) = 1 if f (x) = y 0 else • • expected 0-1 loss = generalization error empirical 0-1 loss = training error • computationally and numerically diﬃcult loss since discontinuous and not convex • to handle.

Summary • much of learning can be viewed simply as loss minimization • diﬀerent losses have diﬀerent properties and purposes regression (real-valued labels): • use square loss to estimate conditional mean • use absolute loss to estimate conditional median • estimating conditional probabilities: • use log loss (≡ maximum likelihood) • classiﬁcation: • use 0/1-loss (or surrogate) • • provides uniﬁed and ﬂexible means of algorithm design .

- Coursera _ Online Courses From Top Universities
- Little Field Report 1
- Ordinal Feature Selection for Iris and Palmprint Recognition+Report 2
- Modern Statistical Methods
- warga tua
- 04 Linear Regression
- 15notes (1)
- 1-s2.0-S0925231215020184-main
- Chapter 14
- Problems
- Multiscale Coherence Regularization Reconstruction Using a Nonlocal Operator for Fast Variable-Density Spiral Imaging
- Application of Use Rate for Estimating Parameter and Finding the Approximate Failure Number using Warranty Claims in Linear Scale
- Maximum Likelihood Decoding
- FMI2005
- Notes 17
- Business Intelligence Presentation - Part 2 of 2
- Evolution and Influence Measurement Association Information Network
- 13923
- ps1-sol.pdf
- 2012 Reuters Ipsos State Polling 11.01.12
- Modersitzki
- Mit18 05s14 Exam2 Sol
- Forecasting Methods
- Forecasting Methods.pptx
- Beerli and Felsenstein 2001
- Technical Efficiency Differentials and Resource-Productivity Analysis Among Smallholder Soybean Farmers in Benue State, Nigeria
- ReliabilityEng-Part12
- EVTFinance061207
- 9d606POM Module 2 (2)
- Binomial Distribution Problems

- tmp6AC3.tmp
- frbsf_let_19821001.pdf
- 52737_1940-1944
- tmp5555.tmp
- tmpF94F
- A Defect Prediction Model for Software Product based on ANFIS
- tmpAC1A.tmp
- tmpB924
- tmp5BB0.tmp
- tmp2B64
- tmpD35.tmp
- tmp158.tmp
- tmp4D19
- tmp8BA8.tmp
- United States v. Oliver W. Arnold, 380 F.2d 366, 4th Cir. (1967)
- tmp8086.tmp
- Blue Rider Press 2016 Sampler
- tmpBC71
- Comprehensive Survey of Data Classification & Prediction Techniques
- When Risk Assessment is Risky — David Ehrenfeld
- As NZS 1462.29-2006 Methods of Test for Plastics Pipes and Fittings Plastics Piping and Ducting Systems - Det
- Teacher Guide Activity 3 Gravity
- tmp6794.tmp
- tmpD7CF.tmp
- tmpC1B7.tmp
- 65642_1970-1974
- DHI_Env_Eco_EIA_OverView_Flyer
- Next Generation Earth System Prediction
- tmpB81B.tmp
- tmp5968.tmp

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Close Dialog## This title now requires a credit

Use one of your book credits to continue reading from where you left off, or restart the preview.

Loading