Kernels and Kernelized Perceptron: Instructor: Alan Ritter

Kernels and Kernelized
Perceptron
Instructor: Alan Ritter
Many Slides from Carlos Guestrin and Luke Zettlemoyer

What'if'the'data'is'not'linearly'separable?'
Use features of features
of features of features….
0 1
x1
B ... C
B C
B xn C
B C
B x1 x2 C
(x) = B
B
C
C
B x1 x3 C
B ... C
B C
@ ex1 A
...
Feature space can get really large really quickly!

NonIlinear'features:'1D'input'
•  Datasets'that'are'linearly'separable'with'some'noise'work'
out'great:'
' x
0
•  But'what'are'we'going'to'do'if'the'dataset'is'just'too'hard?''
0 x
•  How'about…'mapping'data'to'a'higherIdimensional'space:'
x2
x
Feature'spaces'
•  General'idea:'''map'to'higher'dimensional'space'
–  if'x'is'in'Rn,'then'φ(x)'is'in'Rm'for'm>n'
–  Can'now'learn'feature'weights'w#in'Rm'and'predict:''
y = sign(w · (x))
–  Linear'funcXon'in'the'higher'dimensional'space'will'be'nonIlinear'in'
the'original'space'
x → φ(x)
Higher'order'polynomials'
number of monomial terms
d=4
m – input features
d – degree of polynomial
d=3
grows fast!
d = 6, m = 100
d=2
about 1.6 billion terms
number of input dimensions
⌥ x e⇥. . x. =w ⇥ j yj xj
⌥u1 .⇤w .. v1
⌥ . . .
Efficient'dotIproduct'of'polynomials'
⇥(u).⇥(v) = . =j u v + u v = u.v
⇤L 1 1 2 2
⇤L = w⌥ 2 x(1)⇥y x2
u v ⇥
⇤w =⇧ w eu1exactlyj j j⌃
jjyj x v1
Polynomials
⇤ ⇤w
⇥(u).⇥(v) of degree
⌅ j⇤
= . ⌅ d = u1 v1 + u2 v2 =
2 ju 2 v
u1 . 2 v1
. . 2
d=1 ⇥⇥⌥ ⇥
⇥ ⌥⌅ v⇤
u
⌥u1u
1 1⇤u 2 v2 1
1 ⌥ 1 v2 2 ⌅ 2 2
⇥(u).⇥(v)
⇥(u).⇥(v) =⇧
⇥(u).⇥(v)== . u1 . ⇧ ==uu1 v1v1⌃
. ⌃ ++
1
1 =u2uu
v221vv=
21 + u.v
u.v
= 2u1 v1 u2 v2
⇤Luu22u2⌥u1 u1 uvv222 v⌥ 2 v1
v1xv2
⇥(u).⇥(v)
⇤ ⌅=
= u
⇤⌥
⇧
w
2 ⌅ .
⌃ ⇧2
⌥
vj2y j j ⌃ = u 2 2
1 v1 + 2u1 v1 u2
d=2 u⇤w
1
2 2
uv21u1
2 v2 v1
⌥ u1 u2 ⌥ v1uv22 j v 2
2 v )2
⇥(u).⇥(v) = ⌥ . ⌥ = (u= v
u 2
+
v 2
u
⇧ u2 u1 ⌃⇥⇧ v2 v1 ⌃⇥1 11 1 2 21 1 2 2
+ 2u v u v + u 2 2
2 v2
u
u2 1 v 2v1 = (u v + u v ) 2
⇥(u).⇥(v) = 2 . 2 = 1u11 v1 + 2 2u v = u.v
2 2
u2 v2 = (u.v) 2
⇥(u).⇥(v) = (u.v)d
For any d (we will skip proof): ⇥(u).⇥(v) = (u.v)d
d
K(u, v) = ⇥(u).⇥(v) = (u.v)
⇥(u).⇥(v) = (u.v)d
•  Cool! Taking a dot product and an exponential gives same
results as mapping into high dimensional space and then taking
dot product
The' Kernel'Trick '
•  A'kernel&func*on'defines'a'dot'product'in'some'feature'space.'
& & &K(u,v)='φ(u)!#φ(v)'
•  Example:''
'2Idimensional'vectors'u=[u1&&&u2]'and'v=[v1&&&v2];''let'K(u,v)=(1'+'u!v)2,'
'Need'to'show'that'K(xi,xj)='φ(xi)#!φ(xj):'
''K(u,v)=(1'+'u!v)2,='1+'u12v12&+&2'u1v1&u2v2+&u22v22&+'2u1v1&+&2u2v2=&
&&&&&&&=&[1,'u12,&&√2'u1u2,&&&u22,&&√2u1,&&√2u2]#!#[1,''v12,&&√2v1v2,&&v22,&&√2v1,&&√2v2]'='
'''''''='φ(u)#!φ(v),''''where'φ(x)'='#[1,''x12,&&√2'x1x2,&&&x22,&&&√2x1,&&√2x2]'
•  Thus,'a'kernel'funcXon&implicitly&maps'data'to'a'highIdimensional'space'
(without'the'need'to'compute'each'φ(x)'explicitly).'
•  But,'it'isn’t'obvious'yet'how'we'will'incorporate'it'into'actual'learning'
algorithms…'
“Kernel'trick”'for'The'Perceptron!'
•  Never'compute'features'explicitly!!!'
–  Compute'dot'products'in'closed'form'K(u,v)'='Φ(u)'!'Φ(v)''
•  Standard'Perceptron:' •  Kernelized'Perceptron:'
•  set'a i=0'for'each'example'i'
•  set'wi=0'for'each'feature'i'
•  For't=1..T,'i=1..n:'
•  set'ai=0'for'each'example'i' X
–  y'' = sign(( ak (xk )) · (xi ))
•  For't=1..T,'i=1..n:' k
i X
'' = sign(w · (x ))
–  y ' = sign( ak K(xk , xi ))
–  if'y'≠'yi' –  if'y'≠'yi'
= w + y i (xi )
k
•  w
''
•  'ai'+='yi' •  ai'+='yi'
'
•  At'all'Xmes'during'learning:'
X ' Exactly the same
w= ak (xk ) computations, but can use
k K(u,v) to avoid enumerating
the features!!!
•  set'ai=0'for'each'example'i' IniXal:'
•  For't=1..T,'i=1..n:' •  a'='[a1,'a2,'a3,'a4]'='[0,0,0,0]'
X t=1,i=1'
–  y'' = sign( ak K(xk , xi )) •  ΣkakK(xk,x1)'='0x4+0x0+0x4+0x0'='0,'sign(0)=I1'
–  if'y'≠'yi' k •  a1'+='y1 "'a1+=1,'new'a='[1,0,0,0]'
•  ai'+='yi' t=1,i=2'
' •  ΣkakK(xk,x2)'='1x0+0x4+0x0+0x4'='0,'sign(0)=I1'
t=1,i=3'
x1#' x2# y# •  ΣkakK(xk,x3)'='1x4+0x0+0x4+0x0'='4,'sign(4)=1'
t=1,i=4'
1' 1' 1' •  ΣkakK(xk,x4)'='1x0+0x4+0x0+0x4'='0,'sign(0)=I1'
I1' 1' I1' x1 t=2,i=1'
•  ΣkakK(xk,x1)'='1x4+0x0+0x4+0x0'='4,'sign(4)=1'
I1' I1' 1' …'
'
1' I1' I1' x2
'
'
K(u,v)'='(u!v)2' K# x1# x2# x3# x4# Converged!!!'
e.g.,'' x1' 4' 0' 4' 0' •  y=Σk'ak'K(xk,x)'
K(x1,x2)'' '''''''='1×K(x1,x)+0×K(x2,x)+0×K(x3,x)+0×K(x4,x)'
''''='K([1,1],[I1,1])' x2' 0' 4' 0' 4'
'''''''='K(x1,x)'
''''='(1xI1+1x1)2' x3' 4' 0' 4' 0' '''''''='K([1,1],x)'''(because'x1=[1,1])'
''''''='0'
x4' 0' 4' 0' 4' '''''''='(x1+x2)2'''''''''(because''K(u,v)'='(u!v)2)'
'' '
Common'kernels'
•  Polynomials'of'degree'exactly'd&
•  Polynomials'of'degree'up'to'd&
•  Gaussian'kernels'
•  Sigmoid'
'
'
•  And'many'others:'very'acXve'area'of'research!'
Overfipng?'
•  Huge'feature'space'with'kernels,'what'about'
overfipng???'
–  Oqen'robust'to'overfipng,'e.g.'if'you'don’t'make'
too'many'Perceptron'updates'
–  SVMs'(which'we'will'see'next)'will'have'a'clearer'
story'for'avoiding'overfipng'
–  But'everything'overfits'someXmes!!!'
•  Can'control'by:'
–  Choosing'a'be@er'Kernel'
–  Varying'parameters'of'the'Kernel'(width'of'Gaussian,'etc.)'
Kernels'in'logisXc'regression'
1
P (Y = 0|X = x, w, w0 ) =
1 + exp(w0 + w · x)
•  Define'weights'in'terms'of'data'points:'
X
w= ↵j (xj )
j
1
P (Y = 0|X = x, w, w0 ) = P
1 + exp(w0 + j ↵j (xj ) · (x))
1
= P
' 1 + exp(w0 + j ↵j K(xj , x))
•  Derive'gradient'descent'rule'on'αj,w0'
•  Similar'tricks'for'all'linear'models:'SVMs,'etc'
What'you'need'to'know'
•  The'kernel'trick'
•  Derive'polynomial'kernel'
•  Common'kernels'
•  Kernelized'perceptron'

Kernels and Kernelized Perceptron: Instructor: Alan Ritter

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kernels and Kernelized Perceptron: Instructor: Alan Ritter

Uploaded by

Copyright:

Available Formats

Kernels and Kernelized

Instructor: Alan Ritter

Many Slides from Carlos Guestrin and Luke Zettlemoyer

Feature space can get really large really quickly!

You might also like