You are on page 1of 13

Kernels and Kernelized

Perceptron

Instructor: Alan Ritter

Many Slides from Carlos Guestrin and Luke Zettlemoyer


What'if'the'data'is'not'linearly'separable?'
Use features of features
of features of features….
0 1
x1
B ... C
B C
B xn C
B C
B x1 x2 C
(x) = B
B
C
C
B x1 x3 C
B ... C
B C
@ ex1 A
...

Feature space can get really large really quickly!


NonIlinear'features:'1D'input'
•  Datasets'that'are'linearly'separable'with'some'noise'work'
out'great:'
' x
0

•  But'what'are'we'going'to'do'if'the'dataset'is'just'too'hard?''
0 x

•  How'about…'mapping'data'to'a'higherIdimensional'space:'
x2

x
Feature'spaces'
•  General'idea:'''map'to'higher'dimensional'space'
–  if'x'is'in'Rn,'then'φ(x)'is'in'Rm'for'm>n'
–  Can'now'learn'feature'weights'w#in'Rm'and'predict:''
y = sign(w · (x))
–  Linear'funcXon'in'the'higher'dimensional'space'will'be'nonIlinear'in'
the'original'space'

x → φ(x)
Higher'order'polynomials'
number of monomial terms

d=4

m – input features
d – degree of polynomial

d=3
grows fast!
d = 6, m = 100
d=2
about 1.6 billion terms
number of input dimensions
⌥ x e⇥. . x. =w ⇥ j yj xj
⌥u1 .⇤w .. v1
⌥ . . .
Efficient'dotIproduct'of'polynomials'
⇥(u).⇥(v) = . =j u v + u v = u.v
⇤L 1 1 2 2
⇤L = w⌥ 2 x(1)⇥y x2
u v ⇥
⇤w =⇧ w eu1exactlyj j j⌃
jjyj x v1
Polynomials
⇤ ⇤w
⇥(u).⇥(v) of degree
⌅ j⇤
= . ⌅ d = u1 v1 + u2 v2 =
2 ju 2 v
u1 . 2 v1
. . 2
d=1 ⇥⇥⌥ ⇥
⇥ ⌥⌅ v⇤
u
⌥u1u
1 1⇤u 2 v2 1
1 ⌥ 1 v2 2 ⌅ 2 2
⇥(u).⇥(v)
⇥(u).⇥(v) =⇧
⇥(u).⇥(v)== . u1 . ⇧ ==uu1 v1v1⌃
. ⌃ ++
1
1 =u2uu
v221vv=
21 + u.v
u.v
= 2u1 v1 u2 v2
⇤Luu22u2⌥u1 u1 uvv222 v⌥ 2 v1
v1xv2
⇥(u).⇥(v)
⇤ ⌅=
= u
⇤⌥

w
2 ⌅ .
⌃ ⇧2

vj2y j j ⌃ = u 2 2
1 v1 + 2u1 v1 u2
d=2 u⇤w
1
2 2
uv21u1
2 v2 v1
⌥ u1 u2 ⌥ v1uv22 j v 2
2 v )2
⇥(u).⇥(v) = ⌥ . ⌥ = (u= v
u 2
+
v 2
u
⇧ u2 u1 ⌃⇥⇧ v2 v1 ⌃⇥1 11 1 2 21 1 2 2
+ 2u v u v + u 2 2
2 v2

u
u2 1 v 2v1 = (u v + u v ) 2
⇥(u).⇥(v) = 2 . 2 = 1u11 v1 + 2 2u v = u.v
2 2
u2 v2 = (u.v) 2
⇥(u).⇥(v) = (u.v)d
For any d (we will skip proof): ⇥(u).⇥(v) = (u.v)d
d
K(u, v) = ⇥(u).⇥(v) = (u.v)
⇥(u).⇥(v) = (u.v)d
•  Cool! Taking a dot product and an exponential gives same
results as mapping into high dimensional space and then taking
dot product
The' Kernel'Trick '
•  A'kernel&func*on'defines'a'dot'product'in'some'feature'space.'
& & &K(u,v)='φ(u)!#φ(v)'
•  Example:''
'2Idimensional'vectors'u=[u1&&&u2]'and'v=[v1&&&v2];''let'K(u,v)=(1'+'u!v)2,'
'Need'to'show'that'K(xi,xj)='φ(xi)#!φ(xj):'
''K(u,v)=(1'+'u!v)2,='1+'u12v12&+&2'u1v1&u2v2+&u22v22&+'2u1v1&+&2u2v2=&
&&&&&&&=&[1,'u12,&&√2'u1u2,&&&u22,&&√2u1,&&√2u2]#!#[1,''v12,&&√2v1v2,&&v22,&&√2v1,&&√2v2]'='
'''''''='φ(u)#!φ(v),''''where'φ(x)'='#[1,''x12,&&√2'x1x2,&&&x22,&&&√2x1,&&√2x2]'
•  Thus,'a'kernel'funcXon&implicitly&maps'data'to'a'highIdimensional'space'
(without'the'need'to'compute'each'φ(x)'explicitly).'
•  But,'it'isn’t'obvious'yet'how'we'will'incorporate'it'into'actual'learning'
algorithms…'
“Kernel'trick”'for'The'Perceptron!'
•  Never'compute'features'explicitly!!!'
–  Compute'dot'products'in'closed'form'K(u,v)'='Φ(u)'!'Φ(v)''
•  Standard'Perceptron:' •  Kernelized'Perceptron:'
•  set'a i=0'for'each'example'i'
•  set'wi=0'for'each'feature'i'
•  For't=1..T,'i=1..n:'
•  set'ai=0'for'each'example'i' X
–  y'' = sign(( ak (xk )) · (xi ))
•  For't=1..T,'i=1..n:' k
i X
'' = sign(w · (x ))
–  y ' = sign( ak K(xk , xi ))
–  if'y'≠'yi' –  if'y'≠'yi'
= w + y i (xi )
k
•  w
''

•  'ai'+='yi' •  ai'+='yi'
'
•  At'all'Xmes'during'learning:'
X ' Exactly the same
w= ak (xk ) computations, but can use
k K(u,v) to avoid enumerating
the features!!!
•  set'ai=0'for'each'example'i' IniXal:'
•  For't=1..T,'i=1..n:' •  a'='[a1,'a2,'a3,'a4]'='[0,0,0,0]'
X t=1,i=1'
–  y'' = sign( ak K(xk , xi )) •  ΣkakK(xk,x1)'='0x4+0x0+0x4+0x0'='0,'sign(0)=I1'
–  if'y'≠'yi' k •  a1'+='y1 "'a1+=1,'new'a='[1,0,0,0]'
•  ai'+='yi' t=1,i=2'
' •  ΣkakK(xk,x2)'='1x0+0x4+0x0+0x4'='0,'sign(0)=I1'
t=1,i=3'
x1#' x2# y# •  ΣkakK(xk,x3)'='1x4+0x0+0x4+0x0'='4,'sign(4)=1'
t=1,i=4'
1' 1' 1' •  ΣkakK(xk,x4)'='1x0+0x4+0x0+0x4'='0,'sign(0)=I1'
I1' 1' I1' x1 t=2,i=1'
•  ΣkakK(xk,x1)'='1x4+0x0+0x4+0x0'='4,'sign(4)=1'
I1' I1' 1' …'
'
1' I1' I1' x2
'
'
K(u,v)'='(u!v)2' K# x1# x2# x3# x4# Converged!!!'
e.g.,'' x1' 4' 0' 4' 0' •  y=Σk'ak'K(xk,x)'
K(x1,x2)'' '''''''='1×K(x1,x)+0×K(x2,x)+0×K(x3,x)+0×K(x4,x)'
''''='K([1,1],[I1,1])' x2' 0' 4' 0' 4'
'''''''='K(x1,x)'
''''='(1xI1+1x1)2' x3' 4' 0' 4' 0' '''''''='K([1,1],x)'''(because'x1=[1,1])'
''''''='0'
x4' 0' 4' 0' 4' '''''''='(x1+x2)2'''''''''(because''K(u,v)'='(u!v)2)'
'' '
Common'kernels'
•  Polynomials'of'degree'exactly'd&

•  Polynomials'of'degree'up'to'd&

•  Gaussian'kernels'

•  Sigmoid'
'
'
•  And'many'others:'very'acXve'area'of'research!'
Overfipng?'
•  Huge'feature'space'with'kernels,'what'about'
overfipng???'
–  Oqen'robust'to'overfipng,'e.g.'if'you'don’t'make'
too'many'Perceptron'updates'
–  SVMs'(which'we'will'see'next)'will'have'a'clearer'
story'for'avoiding'overfipng'
–  But'everything'overfits'someXmes!!!'
•  Can'control'by:'
–  Choosing'a'be@er'Kernel'
–  Varying'parameters'of'the'Kernel'(width'of'Gaussian,'etc.)'
Kernels'in'logisXc'regression'
1
P (Y = 0|X = x, w, w0 ) =
1 + exp(w0 + w · x)
•  Define'weights'in'terms'of'data'points:'
X
w= ↵j (xj )
j
1
P (Y = 0|X = x, w, w0 ) = P
1 + exp(w0 + j ↵j (xj ) · (x))
1
= P
' 1 + exp(w0 + j ↵j K(xj , x))
•  Derive'gradient'descent'rule'on'αj,w0'
•  Similar'tricks'for'all'linear'models:'SVMs,'etc'
What'you'need'to'know'
•  The'kernel'trick'
•  Derive'polynomial'kernel'
•  Common'kernels'
•  Kernelized'perceptron'

You might also like