You are on page 1of 52

RidgeRegression

RyotaTomioka
DepartmentofMathema6calInforma6cs
TheUniversityofTokyo
Aboutthisclass
Badnews:ThisclasswillbeinEnglish.
Goodnews:Thetopicridgeregressionis
probablyalreadyfamiliartoyou.
EvenbeGernews:ifyouaskaques6onin
Englishduringtheclass,thenyoudontneed
tohandinanyassignment(noreport)forthis
class.
#Ofcourseyoucans6llaskques6onsinJapanesebut
youhavetohandinyourassignmentasusual.
WhyEnglish?
Numberofspeakers?
Chinese(mandarin)845million
Spanish329million
English328million

LetscompareGammadistribu6oninWikipedia
No!
Englishfornon-na6vespeakers
Itisthenumberofreaders.
Japanese/Spanish/ChineseWikipedia
Readmostlybyna6vespeakers
EnglishWikipedia
Readbymanynon-na6vespeakers
Englishisthebestlanguagetoexpressyour
ideas,inven6ons,research.
Nobodyspeaks(orwrites)perfectEnglish
TheworldisfullofbadEnglish(butwhocares)
Outline
RidgeRegression(regularizedlinearregression)
Formula6on
HandlingNonlinearityusingbasisfunc6ons
Classica6on
Mul6-classclassica6on
SingularitythedarksideofRR
Whydoesithappen?
Howcanweavoidit?
Summary
ProblemSegng
Trainingexamples:(x
i
,y
i
)(i=1,,n),x
i
R
p
Goal
Learnalinearfunc6on
f(x*)=w
T
x*(wR
p
)
thatpredictstheoutputy*foratestpoint
(x*,y*)P(X,Y)
Notethatthetestpointisnotincludedinthe
traningexamples(Wewantgeneraliza6on!)

x
1
x
2
x
n
y
1
y
2
y
n
P(X,Y)

IID
x*

?
RidgeRegression
Solvetheminimiza6onproblem
minimize
w
y Xw
2
+w
2
Trainingerror Regulariza6on(ridge)term
(:regulariza6onconst.)
Note:CanbeinterpretedasaMaximumAPosteriori(MAP)es6ma6on
GaussianlikelihoodwithGaussianprior.
y =

y
1
y
2
.
.
.
y
n

X =

x
1

x
2

.
.
.
x
n

Target
output
Design
matrix
Designingthedesignmatrix
ColumnsofXcanbedierentsourcesofinfo
e.g.,predic6ngthepriceofanapartment

ColumnsofXcanalsobederived
e.g.,polynomialregression
#
r
o
o
m
s

T
r
a
i
n

s
t
.

S
i
z
e

B
a
t
h
r
o
o
m

S
u
n
l
i
g
h
t

P
e
t

O
K

X =

X =

x
p1
1
x
2
1
x
1
1
x
p1
2
x
2
2
x
2
1
.
.
.
.
.
.
x
p1
n
x
2
n
x
n
1

Solvingridgeregression
Takethegradient,andsolve
X

(y Xw) +w = 0
whichgives
(I
p
:ppiden6tymatrix)
Thesolu6oncanalsobewriGenas(exercise)
w =

X +I
p

1
X

y
w = X

XX

+I
n

1
y
Example:polynomialgng
Degree(p-1)polynomialmodel
Designmatrix:
y = w
1
x
p1
+ +w
p1
x +w
p
+ noise
=

x
p1
x 1

w
1
.
.
.
w
p1
w
p

+ noise
X =

x
p1
1
x
2
1
x
1
1
x
p1
2
x
2
2
x
2
1
.
.
.
.
.
.
x
p1
n
x
2
n
x
n
1

Example:5th-orderpolynomialgng


True
Learned
Training examples
True
w

0
0
1
0
1
0

Learned
w =

0.63
0.37
3.3
0.41
3.0
0.052

=0.001

1

2

3

4

5

6

7

8

9
Example:RBFgng
Gaussianradialbasisfunc6on(Gaussian-RBF)
(x;
c
) = exp

1
2
(x
c
)
2

0
1
0
0
0
0
1
0
0

RR-RBF(=10
-8
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=1e!08


True
Learned
Training examples
RR-RBF(=10
-7
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=1e!07


True
Learned
Training examples
RR-RBF(=10
-6
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=1e!06


True
Learned
Training examples
RR-RBF(=10
-5
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=1e!05


True
Learned
Training examples
RR-RBF(=10
-4
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=0.0001


True
Learned
Training examples
RR-RBF(=10
-3
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=0.001


True
Learned
Training examples
RR-RBF(=10
-2
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=0.01


True
Learned
Training examples
RR-RBF(=10
-1
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=0.1


True
Learned
Training examples
RR-RBF(=1)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=1


True
Learned
Training examples
RR-RBF(=10)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=10


True
Learned
Training examples
Binaryclassica6on
Targetyis+1or-1.
Justapplyridgeregressionwith+1/-1targets
(forgetabouttheGaussiannoiseassump6on!)
WeagainuseGaussianRBF:
Orange(+1)
orlemon(-1)
(x;
c
) = exp

1
2
x
c

Vector
y =

1
1
1
.
.
.
1

Outputs
tobe
predicted
Classica6on:Truth
!
!
"
#
!
!
!
!
!
!
!$"#
!
$
"
#
!$"#
!
$
"
#
!
$
!
$
!
$
!
$
!
$ !
%
"
#
!
%
"
#
!
%
"#
!
%
"
#
!
%
"
#
!%"#
!
%
!
%
!
%
!
%
!
%
!
%
!
%
!%
!
&
"
#
!
&
"
#
!
&
"
#
!
&
"
#
!
&
"
#
!&"#
&
&
&
&
&
&
"
#
&
"
#
&
"
#
&
"
#
%
%
%
%
%
"
#
%"#
!$ !%"# !% !&"# & &"# % %"# $
!$
!%"#
!%
!&"#
&
&"#
%
%"#
$

Classica6onwithRR,=0.1
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
"
#
$
!
"
#
$
!"#$
!
"
#
$
"
"
"
"
"
"#$
"
#
$
"
#
$
"
#$
! !
!
!% !!#$ !! !"#$ " "#$ ! !#$ %
!%
!!#$
!!
!"#$
"
"#$
!
!#$
%
!
!
"
#
!
!
!
!
!
!
!$"#
!
$
"
#
!$"#
!
$
"
#
!$
!
$
!
$
!
$
!
$ !
%
"
#
!
%
"
#
!%"#
!
%
"
#
!
%
"
#
!%"#
!
%
!
%
!
%
!
%
!
%
!
%
!
%
!%
!
&
"
#
!
&
"
#
!
&
"#
!
&
"
#
!
&
"
#
!&"#
&
&
&
&
&
&
"
#
&
"
#
&
"
#
&
"#
%
%
%
%
%
"#
%"#
!$ !%"# !% !&"# & &"# % %"# $
!$
!%"#
!%
!&"#
&
&"#
%
%"#
$
Truth Learned(=0.1)

Classica6onwithRR,=1
!
!
"
#
!
!
!
!
!
!
!$"#
!
$
"
#
!$"#
!
$
"
#
!$
!
$
!
$
!
$
!
$ !
%
"
#
!
%
"
#
!%"#
!
%
"
#
!
%
"
#
!%"#
!
%
!
%
!
%
!
%
!
%
!
%
!
%
!%
!
&
"
#
!
&
"
#
!
&
"#
!
&
"
#
!
&
"
#
!&"#
&
&
&
&
&
&
"
#
&
"
#
&
"
#
&
"#
%
%
%
%
%
"#
%"#
!$ !%"# !% !&"# & &"# % %"# $
!$
!%"#
!%
!&"#
&
&"#
%
%"#
$
Truth Learned(=1)
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
"
#
$
!
"
#
$
!
"
#$
!
"
#
$
"
"
"
"
"
"
#
$
"
#$
"
#$
"#$
!
!
!% !!#$ !! !"#$ " "#$ ! !#$ %
!%
!!#$
!!
!"#$
"
"#$
!
!#$
%

Classica6onwithRR,=10
!
!
"
#
!
!
!
!
!
!
!$"#
!
$
"
#
!$"#
!
$
"
#
!$
!
$
!
$
!
$
!
$ !
%
"
#
!
%
"
#
!%"#
!
%
"
#
!
%
"
#
!%"#
!
%
!
%
!
%
!
%
!
%
!
%
!
%
!%
!
&
"
#
!
&
"
#
!
&
"#
!
&
"
#
!
&
"
#
!&"#
&
&
&
&
&
&
"
#
&
"
#
&
"
#
&
"#
%
%
%
%
%
"#
%"#
!$ !%"# !% !&"# & &"# % %"# $
!$
!%"#
!%
!&"#
&
&"#
%
%"#
$
Truth Learned(=10)
!
!
"
#
!
!
"
#
!!
!
!
!
!
!
!
!
$
"
%
!
$
"%
!
$
"
%
!
$
"
%
!
$
"
%
!
$
"
%
!
$
"
&
!
$
"
&
!
$
"
&
!$"&
!
$
"
'
!
$
"
'
!
$
"
'
!
$
"
#
!
$
"
#
!
$
"
#
$
$
$
$
$
"
#
$
"
#
$
"#
$
"
#
$
"'
$
"
'
$"'
$
"
&
$
"&
!# !!"( !! !$"( $ $"( ! !"( #
!#
!!"(
!!
!$"(
$
$"(
!
!"(
#

Classica6onwithRR,=100
!
!
"
#
!
!
!
!
!
!
!$"#
!
$
"
#
!$"#
!
$
"
#
!$
!
$
!
$
!
$
!
$ !
%
"
#
!
%
"
#
!%"#
!
%
"
#
!
%
"
#
!%"#
!
%
!
%
!
%
!
%
!
%
!
%
!
%
!%
!
&
"
#
!
&
"
#
!
&
"#
!
&
"
#
!
&
"
#
!&"#
&
&
&
&
&
&
"
#
&
"
#
&
"
#
&
"#
%
%
%
%
%
"#
%"#
!$ !%"# !% !&"# & &"# % %"# $
!$
!%"#
!%
!&"#
&
&"#
%
%"#
$
Truth Learned(=100)
!
!
"#
!
!
"
#
!
!
"
#
!
!
"
$
!
!
"$
!
!
"
$
!!"$
!
!
"
$
!
!
"%
!
!
"
%
!
!
"
%
!
!
"%
!
!
"
%
!
!
"
&
!
!
"
&
!
!
"
&
!
!
"&
!
!
"
&
!
!
"
'
!
!
"
'
!
!
"
'
!
!
"'
!
!
"
(
!
!
"
(
!
!
"
(
!
!
!
!
!
"
(
!"(
!' !("$ !( !!"$ ! !"$ ( ("$ '
!'
!("$
!(
!!"$
!
!"$
(
("$
'
Mul6-classclassica6on
0
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
5
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
1
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
3
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
4
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
6
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
7
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
8
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
9
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
USPSdigitsdataset
hGp://www-stat-class.stanford.edu/~6bs/ElemStatLearn/datasets/zip.info
y =

0 1 0 0 0
1 0 0 0 0
0 0 0 1 0
.
.
.
.
.
.
0 1 0 0 0

Numberof
samples
7291trainingsamples,
2007testsamples
500 1000 2000 4000 8000
40
50
60
70
80
90
100
Number of samples
A
c
c
u
r
a
c
y

(
%
)
USPSdataset
Wecanobtain88%accuracyonaheld-outtest-set
usingabout7000trainingexamples
Amachinecanlearn!
(usingaverysimple
learningalgorithm)
machinelearners
care.
=10
-6
Summary(sofar)
Ridgeregression(RR)isverysimple.
RRcanbecodedinoneline:
W=(X*X+lambda*eye(n))\(X*Y);!
RRcanpreventover-gngbyregulariza6on.
Classica6onproblemcanalsobesolvedby
properlydeningtheoutputY.
Nonlineari6escanbehandledbyusingbasis
func6ons(polynomial,GaussianRBF,etc.).
Singularity
-ThedarksideofRR
USPSdataset(p=256)
(WhatIhavebeenhiding)
Themoredatathelessaccurate??
16 32 64 128 256 512 1024 2048 4096 8192
20
30
40
50
60
70
80
90
Number of samples
A
c
c
u
r
a
c
y

(
%
)
256isthenumber
ofpixels(16x16)
intheimage
(n)
=10
-6
BreastCancerWisconsin
(diagnos6c)dataset(p=30)
0 50 100 150 200 250
70
75
80
85
90
95
Number of samples
A
c
c
u
r
a
c
y

(
%
)
Breast Cancer Wisconsin
(n)
=10
-6
30real-valuedfeatures
radius
texture
perimeter
area,etc.
SPECTHeartdataset(p=22)
0 50 100 150 200 250
55
60
65
70
75
80
85
90
Number of samples (n)
A
c
c
u
r
a
c
y

(
%
)
SPECT Heart p=22
=10
-6
22binaryfeatures
Spambasedataset(p=57)
10
0
10
1
10
2
10
3
10
4
50
60
70
80
90
100
Number of samples (n)
A
c
c
u
r
a
c
y

(
%
)
Spambase p=57
=10
-6
55real-valuedfeatures
wordfrequency
characterfrequency
2integer-valuedfeats
run-length
Muskdataset(p=166)
0 100 200 300 400 500
55
60
65
70
75
80
85
Number of samples (n)
A
c
c
u
r
a
c
y

(
%
)
musk p=166
=10
-6
166real-valuedfeatures
Singularity
Whydoesithappen?
Howcanweavoidit?

Whydoesithappen?
Letsanalyzethesimplestcase:regression.

Model
DesignmatrixXisxed(Xisnotarandomvar.).
Output
Es6mator

Generaliza6onError
Gaussiannoise
Es6mated Truth
y = Xw

+ , N(0,
2
I
n
)
E

w w

2
w =

X +I
p

1
X

y
Expecta6onwrt
AnalysisStrategy
(1)Bias-variancedecomposi6on
(2)Analyzethevariance
(3)Analyzethebias
E

w w

2
= E

w w
2
+ w w

2
Variance Bias(squared)
w = E

w w where isthemeanes6mator
E

w w
2
=?
w w

2
=?
Analyzethevariance(sketch)
2.Lets
1
>0,,s
m
>0betheposi6vesingularvalues
ofX(m=min(n,p)).Showthat
E

w w
2
=
2
Tr

(X

X +I
p
)
2
X

1.Showthat
E

w w
2
=
2
m

i=1
s
2
i
(s
2
i
+)
2
0

2
m

i=1
s
2
i
Variancecanbelargeifthemin.singular-valueisclosetozero!
Variance
Analyzethebias(sketch)
1.Showthat
w w

2
=
2
(X

X +I
p
)
1
w

2
2.Showthat
wheres
i
=0(ifi>m),
v
i
istheithrightsingularvectorofX
w w

2
=
p

i=1

v
i

s
2
i
+

2
w w

2
0

p
i=n+1

v
i

2
(n < p),
0 (otherwise).
Result(=10
-6
)
0 50 100 150 200 250
!50
0
50
100
150
200
250
Number of samples
G
e
n
e
r
a
l
i
z
a
t
i
o
n

e
r
r
o
r

l
l
w
!
w
!
l
l
2
Ridge Regression: number of variables=100, lambda=1e!06


simulation
bias
variance
Result(=0.001)
0 50 100 150 200 250
0
20
40
60
80
100
120
Number of samples
G
e
n
e
r
a
l
i
z
a
t
i
o
n

e
r
r
o
r

l
l
w
!
w
!
l
l
2
Ridge Regression: number of variables=100, lambda=0.001


simulation
bias
variance
Result(=1)
0 50 100 150 200 250
0
20
40
60
80
100
120
Number of samples
G
e
n
e
r
a
l
i
z
a
t
i
o
n

e
r
r
o
r

l
l
w
!
w
!
l
l
2
Ridge Regression: number of variables=100, lambda=1


simulation
bias
variance
Howaboutclassica6on?
Model
Inputvectorx
i
issampledfromstandardGaussian
distribu6on(x
i
isarandomvariable):
Thetrueclassierisalsoanormalrandomvariable:
Output
(NotaGaussiannoise!)
Generaliza6onError
y = sign(Xw

)
Es6mated
Truth
w
w

=
1

arccos

ww

x
i
N(0, I
p
) (i = 1, . . . , n)
w

N(0, I
p
)
Analyzingclassica6on
Let=n/pandassumethat
Analyzetheinnerproduct

Analyzethenorm
Numberof
samples
Numberof
features
Regulariza6on
constant
n , p , 0
E w

( < 1),

( > 1).
Ew

2
= p. E w
2
=

(1
2

)
1
( < 1),
2

(1)+1
2

1
( > 1).
Analyzingclassica6on(result)
0 50 100 150 200 250
45
50
55
60
65
70
75
80
85
Number of samples (n)
C
l
a
s
s
i
f
i
c
a
t
i
o
n

a
c
c
u
r
a
c
y

(
%
)
Ridge Regression: number of variables=100, lambda=1e!06


simulation
analysis
Howcanweavoidthesingularity?
Regulariza6on
Logis6cregression
!! " !
"
"#!
$
%
!
&
'
%
(
P
(
y
=
+
1
|
x
)

log
P(y = +1|x)
P(y = 1|x)
= w

x
minimize
w
n

i=1
log(1 + exp(y
i
w

x
i
)) +

2
w
2
Trainingerror Regulariza6onterm
(:regulariza6onconst.)
Howcanweavoidsingularity?
Summary
Ridgeregression(RR)isverysimpleandeasyto
implement.
RRhaswideapplica6on,e.g.,classica6on,mul6-
classclassica6on
Becarefulaboutthesingularity.Addingdatadoes
notalwayshelpimproveperformance.
Analyzingthesingularity:predictsthesimulated
performancequan6ta6vely.
Regressionsegng:variancegoestoinityatn=p.
Classica6onsegng:normgoestoininityatn=p. w
2
Furtherreadings
ElementsofSta6s6calLearning(Has6e,
Tibshirani,Friedman)2009(2
nd
edi6on)
Ridgeregression(Sec.3.4)
Bias&variance(Sec.7.3)
Crossvalida6on(Sec.7.10)
Sta6s6calMechanicsofGeneraliza6on(Opper
andKinzel)inModelsofneuralnetworksIII:
Associa?on,generaliza?on,andrepresenta?on,
1995.
Analysisofperceptron
Singularity

You might also like