Ridge Regression（数理情報工学演習第二, 2012）

RidgeRegression
RyotaTomioka
DepartmentofMathema6calInforma6cs
TheUniversityofTokyo
Aboutthisclass
Badnews:ThisclasswillbeinEnglish.
Goodnews:Thetopicridgeregressionis
probablyalreadyfamiliartoyou.
EvenbeGernews:ifyouaskaques6onin
Englishduringtheclass,thenyoudontneed
tohandinanyassignment(noreport)forthis
class.
#Ofcourseyoucans6llaskques6onsinJapanesebut
youhavetohandinyourassignmentasusual.
WhyEnglish?
Numberofspeakers?
Chinese(mandarin)845million
Spanish329million
English328million

LetscompareGammadistribu6oninWikipedia
No!
Englishfornon-na6vespeakers
Itisthenumberofreaders.
Japanese/Spanish/ChineseWikipedia
Readmostlybyna6vespeakers
EnglishWikipedia
Readbymanynon-na6vespeakers
Englishisthebestlanguagetoexpressyour
ideas,inven6ons,research.
Nobodyspeaks(orwrites)perfectEnglish
TheworldisfullofbadEnglish(butwhocares)
Outline
RidgeRegression(regularizedlinearregression)
Formula6on
HandlingNonlinearityusingbasisfunc6ons
Classica6on
Mul6-classclassica6on
SingularitythedarksideofRR
Whydoesithappen?
Howcanweavoidit?
Summary
ProblemSegng
Trainingexamples:(x
i
,y
i
)(i=1,,n),x
i
R
p
Goal
Learnalinearfunc6on
f(x*)=w
T
x*(wR
p
)
thatpredictstheoutputy*foratestpoint
(x*,y*)P(X,Y)
Notethatthetestpointisnotincludedinthe
traningexamples(Wewantgeneraliza6on!)
x
1
x
2
x
n
y
1
y
2
y
n
P(X,Y)
IID
x*
?
RidgeRegression
Solvetheminimiza6onproblem
minimize
w
y Xw
2
+w
2
Trainingerror Regulariza6on(ridge)term
(:regulariza6onconst.)
Note:CanbeinterpretedasaMaximumAPosteriori(MAP)es6ma6on
GaussianlikelihoodwithGaussianprior.
y =
y
1
y
2
.
.
.
y
n
X =
x
1
x
2
.
.
.
x
n
Target
output
Design
matrix
Designingthedesignmatrix
ColumnsofXcanbedierentsourcesofinfo
e.g.,predic6ngthepriceofanapartment
ColumnsofXcanalsobederived
e.g.,polynomialregression
#
r
o
o
m
s
T
r
a
i
n
s
t
.
S
i
z
e
B
a
t
h
r
o
o
m
S
u
n
l
i
g
h
t
P
e
t
O
K
X =

X =
x
p1
1
x
2
1
x
1
1
x
p1
2
x
2
2
x
2
1
.
.
.
.
.
.
x
p1
n
x
2
n
x
n
1
Solvingridgeregression
Takethegradient,andsolve
X
(y Xw) +w = 0
whichgives
(I
p
:ppiden6tymatrix)
Thesolu6oncanalsobewriGenas(exercise)
w =
X +I
p
1
X
y
w = X
XX
+I
n
1
y
Example:polynomialgng
Degree(p-1)polynomialmodel
Designmatrix:
y = w
1
x
p1
+ +w
p1
x +w
p
+ noise
=
x
p1
x 1
w
1
.
.
.
w
p1
w
p
+ noise
X =
x
p1
1
x
2
1
x
1
1
x
p1
2
x
2
2
x
2
1
.
.
.
.
.
.
x
p1
n
x
2
n
x
n
1
Example:5th-orderpolynomialgng

True
Learned
Training examples
True
w
0
0
1
0
1
0
Learned
w =
0.63
0.37
3.3
0.41
3.0
0.052
=0.001
1

2

3

4

5

6

7

8

9
Example:RBFgng
Gaussianradialbasisfunc6on(Gaussian-RBF)
(x;
c
) = exp
1
2
(x
c
)
2
0
1
0
0
0
0
1
0
0
RR-RBF(=10
-8
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=1e!08

True
Learned
Training examples
RR-RBF(=10
-7
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=1e!07

True
Learned
Training examples
RR-RBF(=10
-6
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=1e!06

True
Learned
Training examples
RR-RBF(=10
-5
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=1e!05

True
Learned
Training examples
RR-RBF(=10
-4
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=0.0001

True
Learned
Training examples
RR-RBF(=10
-3
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=0.001

True
Learned
Training examples
RR-RBF(=10
-2
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=0.01

True
Learned
Training examples
RR-RBF(=10
-1
)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=0.1

True
Learned
Training examples
RR-RBF(=1)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=1

True
Learned
Training examples
RR-RBF(=10)
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2
!3
!2
!1
0
1
2
3
lambda=10

True
Learned
Training examples
Binaryclassica6on
Targetyis+1or-1.
Justapplyridgeregressionwith+1/-1targets
(forgetabouttheGaussiannoiseassump6on!)
WeagainuseGaussianRBF:
Orange(+1)
orlemon(-1)
(x;
c
) = exp
1
2
x
c
Vector
y =
1
1
1
.
.
.
1
Outputs
tobe
predicted
Classica6on:Truth
!
!
"
#
!
!
!
!
!
!
!$"#
!
$
"
#
!$"#
!
$
"
#
!
$
!
$
!
$
!
$
!
$ !
%
"
#
!
%
"
#
!
%
"#
!
%
"
#
!
%
"
#
!%"#
!
%
!
%
!
%
!
%
!
%
!
%
!
%
!%
!
&
"
#
!
&
"
#
!
&
"
#
!
&
"
#
!
&
"
#
!&"#
&
&
&
&
&
&
"
#
&
"
#
&
"
#
&
"
#
%
%
%
%
%
"
#
%"#
!$ !%"# !% !&"# & &"# % %"# $
!$
!%"#
!%
!&"#
&
&"#
%
%"#
$
Classica6onwithRR,=0.1
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
"
#
$
!
"
#
$
!"#$
!
"
#
$
"
"
"
"
"
"#$
"
#
$
"
#
$
"
#$
! !
!
!% !!#$ !! !"#$ " "#$ ! !#$ %
!%
!!#$
!!
!"#$
"
"#$
!
!#$
%
!
!
"
#
!
!
!
!
!
!
!$"#
!
$
"
#
!$"#
!
$
"
#
!$
!
$
!
$
!
$
!
$ !
%
"
#
!
%
"
#
!%"#
!
%
"
#
!
%
"
#
!%"#
!
%
!
%
!
%
!
%
!
%
!
%
!
%
!%
!
&
"
#
!
&
"
#
!
&
"#
!
&
"
#
!
&
"
#
!&"#
&
&
&
&
&
&
"
#
&
"
#
&
"
#
&
"#
%
%
%
%
%
"#
%"#
!$ !%"# !% !&"# & &"# % %"# $
!$
!%"#
!%
!&"#
&
&"#
%
%"#
$
Truth Learned(=0.1)
Classica6onwithRR,=1
!
!
"
#
!
!
!
!
!
!
!$"#
!
$
"
#
!$"#
!
$
"
#
!$
!
$
!
$
!
$
!
$ !
%
"
#
!
%
"
#
!%"#
!
%
"
#
!
%
"
#
!%"#
!
%
!
%
!
%
!
%
!
%
!
%
!
%
!%
!
&
"
#
!
&
"
#
!
&
"#
!
&
"
#
!
&
"
#
!&"#
&
&
&
&
&
&
"
#
&
"
#
&
"
#
&
"#
%
%
%
%
%
"#
%"#
!$ !%"# !% !&"# & &"# % %"# $
!$
!%"#
!%
!&"#
&
&"#
%
%"#
$
Truth Learned(=1)
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
"
#
$
!
"
#
$
!
"
#$
!
"
#
$
"
"
"
"
"
"
#
$
"
#$
"
#$
"#$
!
!
!% !!#$ !! !"#$ " "#$ ! !#$ %
!%
!!#$
!!
!"#$
"
"#$
!
!#$
%
!
!
"
#
!
!
!
!
!
!
!$"#
!
$
"
#
!$"#
!
$
"
#
!$
!
$
!
$
!
$
!
$ !
%
"
#
!
%
"
#
!%"#
!
%
"
#
!
%
"
#
!%"#
!
%
!
%
!
%
!
%
!
%
!
%
!
%
!%
!
&
"
#
!
&
"
#
!
&
"#
!
&
"
#
!
&
"
#
!&"#
&
&
&
&
&
&
"
#
&
"
#
&
"
#
&
"#
%
%
%
%
%
"#
%"#
!$ !%"# !% !&"# & &"# % %"# $
!$
!%"#
!%
!&"#
&
&"#
%
%"#
$
Truth Learned(=10)
!
!
"
#
!
!
"
#
!!
!
!
!
!
!
!
!
$
"
%
!
$
"%
!
$
"
%
!
$
"
%
!
$
"
%
!
$
"
%
!
$
"
&
!
$
"
&
!
$
"
&
!$"&
!
$
"
'
!
$
"
'
!
$
"
'
!
$
"
#
!
$
"
#
!
$
"
#
$
$
$
$
$
"
#
$
"
#
$
"#
$
"
#
$
"'
$
"
'
$"'
$
"
&
$
"&
!# !!"( !! !$"( $ $"( ! !"( #
!#
!!"(
!!
!$"(
$
$"(
!
!"(
#
!
!
"
#
!
!
!
!
!
!
!$"#
!
$
"
#
!$"#
!
$
"
#
!$
!
$
!
$
!
$
!
$ !
%
"
#
!
%
"
#
!%"#
!
%
"
#
!
%
"
#
!%"#
!
%
!
%
!
%
!
%
!
%
!
%
!
%
!%
!
&
"
#
!
&
"
#
!
&
"#
!
&
"
#
!
&
"
#
!&"#
&
&
&
&
&
&
"
#
&
"
#
&
"
#
&
"#
%
%
%
%
%
"#
%"#
!$ !%"# !% !&"# & &"# % %"# $
!$
!%"#
!%
!&"#
&
&"#
%
%"#
$
Truth Learned(=100)
!
!
"#
!
!
"
#
!
!
"
#
!
!
"
$
!
!
"$
!
!
"
$
!!"$
!
!
"
$
!
!
"%
!
!
"
%
!
!
"
%
!
!
"%
!
!
"
%
!
!
"
&
!
!
"
&
!
!
"
&
!
!
"&
!
!
"
&
!
!
"
'
!
!
"
'
!
!
"
'
!
!
"'
!
!
"
(
!
!
"
(
!
!
"
(
!
!
!
!
!
"
(
!"(
!' !("$ !( !!"$ ! !"$ ( ("$ '
!'
!("$
!(
!!"$
!
!"$
(
("$
'
Mul6-classclassica6on
0
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
5
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
1
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
3
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
4
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
6
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
7
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
8
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
9
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
USPSdigitsdataset
hGp://www-stat-class.stanford.edu/~6bs/ElemStatLearn/datasets/zip.info
y =
0 1 0 0 0
1 0 0 0 0
0 0 0 1 0
.
.
.
.
.
.
0 1 0 0 0
Numberof
samples
7291trainingsamples,
2007testsamples
500 1000 2000 4000 8000
40
50
60
70
80
90
100
Number of samples
A
c
c
u
r
a
c
y

(
%
)
USPSdataset
Wecanobtain88%accuracyonaheld-outtest-set
usingabout7000trainingexamples
Amachinecanlearn!
(usingaverysimple
learningalgorithm)
machinelearners
care.
=10
-6
Summary(sofar)
Ridgeregression(RR)isverysimple.
RRcanbecodedinoneline:
W=(X*X+lambda*eye(n))\(X*Y);!
RRcanpreventover-gngbyregulariza6on.
Classica6onproblemcanalsobesolvedby
properlydeningtheoutputY.
Nonlineari6escanbehandledbyusingbasis
func6ons(polynomial,GaussianRBF,etc.).
Singularity
-ThedarksideofRR
USPSdataset(p=256)
(WhatIhavebeenhiding)
Themoredatathelessaccurate??
16 32 64 128 256 512 1024 2048 4096 8192
20
30
40
50
60
70
80
90
Number of samples
A
c
c
u
r
a
c
y

(
%
)
256isthenumber
ofpixels(16x16)
intheimage
(n)
=10
-6
BreastCancerWisconsin
(diagnos6c)dataset(p=30)
0 50 100 150 200 250
70
75
80
85
90
95
Number of samples
A
c
c
u
r
a
c
y

(
%
)
Breast Cancer Wisconsin
(n)
=10
-6
30real-valuedfeatures
radius
texture
perimeter
area,etc.
SPECTHeartdataset(p=22)
0 50 100 150 200 250
55
60
65
70
75
80
85
90
Number of samples (n)
A
c
c
u
r
a
c
y

(
%
)
SPECT Heart p=22
=10
-6
22binaryfeatures
Spambasedataset(p=57)
10
0
10
1
10
2
10
3
10
4
50
60
70
80
90
100
A
c
c
u
r
a
c
y

(
%
)
Spambase p=57
=10
-6
wordfrequency
characterfrequency
2integer-valuedfeats
run-length
Muskdataset(p=166)
0 100 200 300 400 500
55
60
65
70
75
80
85
A
c
c
u
r
a
c
y

(
%
)
musk p=166
=10
-6
Singularity
Whydoesithappen?
Howcanweavoidit?
Whydoesithappen?
Letsanalyzethesimplestcase:regression.
Model
DesignmatrixXisxed(Xisnotarandomvar.).
Output
Es6mator
Generaliza6onError
Gaussiannoise
Es6mated Truth
y = Xw
+ , N(0,
2
I
n
)
E
w w
2
w =
X +I
p
1
X
y
Expecta6onwrt
AnalysisStrategy
(1)Bias-variancedecomposi6on
(2)Analyzethevariance
(3)Analyzethebias
E
w w
2
= E
w w
2
+ w w
2
Variance Bias(squared)
w = E
w w where isthemeanes6mator
E
w w
2
=?
w w
2
=?
Analyzethevariance(sketch)
2.Lets
1
>0,,s
m
>0betheposi6vesingularvalues
ofX(m=min(n,p)).Showthat
E
w w
2
=
2
Tr
(X
X +I
p
)
2
X
1.Showthat
E
w w
2
=
2
m
i=1
s
2
i
(s
2
i
+)
2
0

2
m
i=1
s
2
i
Variancecanbelargeifthemin.singular-valueisclosetozero!
Variance
Analyzethebias(sketch)
1.Showthat
w w
2
=
2
(X
X +I
p
)
1
w
2
2.Showthat
wheres
i
=0(ifi>m),
v
i
istheithrightsingularvectorofX
w w
2
=
p
i=1
v
i
s
2
i
+
2
w w
2
0
p
i=n+1
v
i
2
(n < p),
0 (otherwise).
Result(=10
-6
)
0 50 100 150 200 250
!50
0
50
100
150
200
250
Number of samples
G
e
n
e
r
a
l
i
z
a
t
i
o
n

e
r
r
o
r

l
l
w
!
w
!
l
l
2
Ridge Regression: number of variables=100, lambda=1e!06

simulation
bias
variance
Result(=0.001)
0 50 100 150 200 250
0
20
40
60
80
100
120
Number of samples
G
e
n
e
r
a
l
i
z
a
t
i
o
n

e
r
r
o
r

l
l
w
!
w
!
l
l
2
Ridge Regression: number of variables=100, lambda=0.001

simulation
bias
variance
Result(=1)
0 50 100 150 200 250
0
20
40
60
80
100
120
Number of samples
G
e
n
e
r
a
l
i
z
a
t
i
o
n

e
r
r
o
r

l
l
w
!
w
!
l
l
2
Ridge Regression: number of variables=100, lambda=1

simulation
bias
variance
Howaboutclassica6on?
Model
Inputvectorx
i
issampledfromstandardGaussian
distribu6on(x
i
isarandomvariable):
Thetrueclassierisalsoanormalrandomvariable:
Output
(NotaGaussiannoise!)
Generaliza6onError
y = sign(Xw
)
Es6mated
Truth
w
w
=
1
arccos
ww
x
i
N(0, I
p
) (i = 1, . . . , n)
w
N(0, I
p
)
Analyzingclassica6on
Let=n/pandassumethat
Analyzetheinnerproduct
Analyzethenorm
Numberof
samples
Numberof
features
Regulariza6on
constant
n , p , 0
E w
( < 1),
( > 1).
Ew
2
= p. E w
2
=
(1
2
)
1
( < 1),
2
(1)+1
2
1
( > 1).
Analyzingclassica6on(result)
0 50 100 150 200 250
45
50
55
60
65
70
75
80
85
C
l
a
s
s
i
f
i
c
a
t
i
o
n

a
c
c
u
r
a
c
y

(
%
)
Ridge Regression: number of variables=100, lambda=1e!06

simulation
analysis
Howcanweavoidthesingularity?
Regulariza6on
Logis6cregression
!! " !
"
"#!
$
%
!
&
'
%
(
P
(
y
=
+
1
|
x
)
log
P(y = +1|x)
P(y = 1|x)
= w
x
minimize
w
n
i=1
log(1 + exp(y
i
w
x
i
)) +

2
w
2
Trainingerror Regulariza6onterm
(:regulariza6onconst.)
Howcanweavoidsingularity?
Summary
Ridgeregression(RR)isverysimpleandeasyto
implement.
RRhaswideapplica6on,e.g.,classica6on,mul6-
classclassica6on
Becarefulaboutthesingularity.Addingdatadoes
notalwayshelpimproveperformance.
Analyzingthesingularity:predictsthesimulated
performancequan6ta6vely.
Regressionsegng:variancegoestoinityatn=p.
Classica6onsegng:normgoestoininityatn=p. w
2
Furtherreadings
ElementsofSta6s6calLearning(Has6e,
Tibshirani,Friedman)2009(2
nd
edi6on)
Ridgeregression(Sec.3.4)
Bias&variance(Sec.7.3)
Crossvalida6on(Sec.7.10)
Sta6s6calMechanicsofGeneraliza6on(Opper
andKinzel)inModelsofneuralnetworksIII:
Associa?on,generaliza?on,andrepresenta?on,
1995.
Analysisofperceptron
Singularity

Ridge Regression（数理情報工学演習第二, 2012）

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ridge Regression（数理情報工学演習第二, 2012）

Uploaded by

Copyright:

Available Formats

RidgeRegression

You might also like