14 views

Uploaded by Amur Al-Manji

Attribution Non-Commercial (BY-NC)

- statement_pxcheng-ML.pdf
- Image Reconstruction Using Pixel Wise Support Vector Machine Svm Classification
- Data mining for credit card fraud: A comparative study
- SVM
- Svm Imagini Folder Matlab5.Doc_1
- An Optimized Feature Selection Technique for Email Classification
- JORDAAN2002_Estimation of the Regularization Parameter for SVR
- smart antenna
- Feature Extraction Supervised
- Dg 4201725728
- 07208643
- Online Svr Training by Solving the Primal Optimization Problem
- SESSION I
- A survey of Network Intrusion Detection using soft computing Technique
- optimization answer
- Machine Learning Para Analysis
- A Novl Ftr Slctn Algo Fr Txt Ctgrztn_Shang2007
- 5 Mahendra Paper
- Chapter 7
- Get PDF 1

You are on page 1of 85

Machines

Nello Cristianini

BIOwulf Technologies

nello@support-vector.net

http://www.support-vector.net/tutorial.html

ICML 2001

A Little History

Vapnik. Greatly developed ever since.

z Initially popularized in the NIPS community, now an

important and active field of all Machine Learning

research.

z Special issues of Machine Learning Journal, and

Journal of Machine Learning Research.

z Kernel Machines: large class of learning algorithms,

SVMs a particular instance.

www.support-vector.net

A Little History

z Annual workshop at NIPS

z Centralized website: www.kernel-machines.org

z Textbook (2000): see www.support-vector.net

z Now: a large and diverse community: from machine

learning, optimization, statistics, neural networks,

functional analysis, etc. etc

z Successful applications in many fields (bioinformatics,

text, handwriting recognition, etc)

z Fast expanding field, EVERYBODY WELCOME ! -

www.support-vector.net

Preliminaries

exploit complex patterns in data (eg: by

clustering, classifying, ranking, cleaning, etc.

the data)

z Typical problems: how to represent complex

patterns; and how to exclude spurious

(unstable) patterns (= overfitting)

z The first is a computational problem; the

second a statistical problem.

www.support-vector.net

Very Informal Reasoning

the class of possible patterns by introducing a

notion of similarity between data

z Example: similarity between documents

z By length

z By topic

z By language …

z Choice of similarity Î Choice of relevant

features

www.support-vector.net

More formal reasoning

products between data items

z Many standard algorithms can be rewritten so that they

only require inner products between data (inputs)

z Kernel functions = inner products in some feature

space (potentially very complex)

z If kernel given, no need to specify what features of the

data are being used

www.support-vector.net

Just in case …

x, z = ∑xz

i i

z Hyperplane: i

x

w, x + b = 0

x

x

o x

x

w

o

x

o

b x

o

o

o

www.support-vector.net

Overview of the Tutorial

example of Kernel Perceptron

z Derive Support Vector Machines

z Other kernel based algorithms

z Properties and Limitations of Kernels

z On Kernel Alignment

z On Optimizing Kernel Alignment

www.support-vector.net

Parts I and II: overview

z Kernel Induced Feature Spaces

z Generalization Theory

z Optimization Theory

z Support Vector Machines (SVM)

www.support-vector.net

Modularity IMPORTANT

CONCEPT

modules:

– A general purpose learning machine

– A problem specific kernel function

z Any K-B algorithm can be fitted with any kernel

z Kernels themselves can be constructed in a modular

way

z Great for software engineering (and for analysis)

www.support-vector.net

1-Linear Learning Machines

is a hyperplane in input space

z The Perceptron Algorithm (Rosenblatt, 57)

before looking at SVMs and Kernel Methods in

general

www.support-vector.net

Basic Notation

z Input space x ∈X

z Output space y ∈Y = {−1,+1}

z Hypothesis h ∈H

z Real-valued: f:X → R

z Training Set S = {( x1, y1),...,( xi, yi ),....}

z Test error ε

z Dot product x, z

www.support-vector.net

Perceptron

input space

x

x

x f ( x ) = w, x + b

h( x ) = sign( f ( x ))

o x

x

w

o

x

o

b x

o

o

o

www.support-vector.net

Perceptron Algorithm

Update rule

(ignoring threshold):

if yi( wk, xi ) ≤ 0 then

x

z

x

x

o x

wk + 1 ← wk + ηyixi o

w

x

x

k ← k +1 b o

o

o

o

x

www.support-vector.net

Observations

points w =

∑ α i y ix i

αi ≥ 0

z Only used informative points (mistake driven)

z The coefficient of a point in combination

reflects its ‘difficulty’

www.support-vector.net

Observations - 2

z Mistake bound: x

x

R

x

2 g x

o

M≤ x

γ o

o

x

x

o

o

o

z possible to rewrite the algorithm using this alternative

representation

www.support-vector.net

IMPORTANT

Dual Representation CONCEPT

follows:

f (x) = w, x + b = ∑αiyi xi,x + b

w = ∑ αiyixi

www.support-vector.net

Dual Representation

z if yi 3∑α y x , x + b8 ≤ 0

j j j i then αi ← αi + η

dot products

www.support-vector.net

Duality: First Property of SVMs

Machines

z SVMs are Linear Learning Machines

represented in a dual fashion

f (x) = w, x + b = ∑αiyi xi,x + b

z Data appear only within dot products (in

decision function and in training algorithm)

www.support-vector.net

Limitations of LLMs

z Noisy data

www.support-vector.net

Non-Linear Classifiers

classifiers (neurons): a Neural Network

(problems: local minima; many parameters;

heuristics needed to train; etc)

space including non-linear features, then use a

linear classifier

www.support-vector.net

Learning in the Feature Space

linearly separable x → φ( x)

f

x f (x)

o x f (o) f (x)

o x f (o) f (x)

o

f (o)

f (x)

o x f (o)

X F

www.support-vector.net

Problems with Feature Space

solves the problem of expressing complex

functions

BUT:

zThere is a computational problem (working with

very large vectors)

zAnd a generalization theory problem (curse of

dimensionality)

www.support-vector.net

Implicit Mapping to Feature Space

with many dimensions

z Can make it possible to use infinite dimensions

– efficiently in time / space

z Other advantages, both practical and

conceptual

www.support-vector.net

Kernel-Induced Feature Spaces

appear inside dot products:

important. May not even know the map φ

www.support-vector.net

IMPORTANT

Kernels CONCEPT

product between the images of the two

arguments

K ( x1, x 2 ) = φ ( x1), φ ( x 2 )

z Given a function K, it is possible to verify that it

is a kernel

www.support-vector.net

Kernels

simply rewriting it in dual representation and

replacing dot products with kernels:

www.support-vector.net

IMPORTANT

The Kernel Matrix CONCEPT

K(1,1) K(1,2) K(1,3) … K(1,m)

K=

… … … … …

www.support-vector.net

The Kernel Matrix

z Information ‘bottleneck’: contains all necessary

information for the learning algorithm

z Fuses information about the data AND the

kernel

z Many interesting properties:

www.support-vector.net

Mercer’s Theorem

Definite

z Any symmetric positive definite matrix can be

regarded as a kernel matrix, that is as an inner

product matrix in some space

www.support-vector.net

More Formally: Mercer’s Theorem

function is a kernel: i.e. there exists a mapping

φ

such that it is possible to write:

K ( x1, x 2 ) = φ ( x1), φ ( x 2 )

Pos. Def.

I K ( x , z ) f ( x ) f (z )dxdz ≥ 0

∀f ∈ L2

www.support-vector.net

Mercer’s Theorem

K ( x1, x 2 ) = ∑ λφ

i i ( x 1)φi ( x 2 )

i

www.support-vector.net

Examples of Kernels

K ( x, z ) = x, z

d

− x − z 2 / 2σ

K ( x, z ) = e

www.support-vector.net

Example: Polynomial Kernels

x = ( x1, x 2 );

z = ( z1, z 2 );

= ( x 1 z1 + x 2 z 2 ) 2 =

2

x, z

= x12 z12 + x22 z22 + 2 x1z1 x 2 z 2 =

= ( x12 , x22 , 2 x1 x 2 ),( z12 , z22 , 2 z1z 2 ) =

= φ ( x ), φ ( z ) www.support-vector.net

Example: Polynomial Kernels

www.support-vector.net

Example: the two spirals

(gaussian kernels)

www.support-vector.net

IMPORTANT

Making Kernels CONCEPT

operations. If K, K’ are kernels, then:

z K+K’ is a kernel

z cK is a kernel, if c>0

z aK+bK’ is a kernel, for a,b >0

z Etc etc etc……

z can make complex kernels from simple ones:

modularity !

www.support-vector.net

Second Property of SVMs:

z Use a dual representation

AND

z Operate in a kernel induced feature space

(that is: ∑

f (x) = αiyi φ(xi),φ(x) + b

is a linear function in the feature space implicitely

defined by K)

www.support-vector.net

Kernels over General Structures

sequences, over trees, etc.

z Applied in text categorization, bioinformatics,

etc

www.support-vector.net

A bad kernel …

mostly diagonal: all points orthogonal to each

other, no clusters, no structure …

1 0 0 … 0

0 1 0 … 0

1

… … … … …

0 0 0 … 1

www.support-vector.net

IMPORTANT

No Free Kernel CONCEPT

features, kernel matrix becomes diagonal

choose a good kernel

www.support-vector.net

Other Kernel-based algorithms

LLMs (e.g. clustering; PCA; etc). Dual

representation often possible (in optimization

problems, by Representer’s theorem).

www.support-vector.net

%5($.

www.support-vector.net

NEW

TOPIC

z The curse of dimensionality: easy to overfit in high

dimensional spaces

(=regularities could be found in the training set that are accidental,

that is that would not be found again in a test set)

that separates the data: many such hyperplanes exist)

hyperplane

www.support-vector.net

The Generalization Problem

hyperplane (inductive principles)

z Bayes, statistical learning theory / pac, MDL,

…

z Each can be used, we will focus on a simple

case motivated by statistical learning theory

(will give the basic SVM)

www.support-vector.net

Statistical (Computational)

Learning Theory

(in a p.a.c. setting: assumption of I.I.d. data;

etc)

z Standard bounds from VC theory give upper

and lower bound proportional to VC dimension

z VC dimension of LLMs proportional to

dimension of space (can be huge)

www.support-vector.net

Assumptions and Definitions

z train and test points drawn randomly (I.I.d.) from D

z training error of h: fraction of points in S misclassifed

by h

z test error of h: probability under D to misclassify a point

x

z VC dimension: size of largest subset of X shattered by

H (every dichotomy implemented)

www.support-vector.net

~ VC

ε = O

VC Bounds m

VC = (number of dimensions of X) +1

www.support-vector.net

Margin Based Bounds

~ ( R / γ ) 2

ε = O

m

y if ( x i )

γ = min i

f

www.support-vector.net

IMPORTANT

Margin Based Bounds CONCEPT

large)) the other bound can be applied and better

generalization can be achieved:

~ ( R / γ ) 2

ε = O

m

z Best hyperplane: the maximal margin one

z Margin is large is kernel chosen well

www.support-vector.net

Maximal Margin Classifier

maximal margin hyperplane in feature space

z Third feature of SVMs: maximize the margin

z SVMs control capacity by increasing the

margin, not by reducing the number of degrees

of freedom (dimension free capacity control).

www.support-vector.net

Two kinds of margin

funct = min yif ( xi )

yif ( xi )

geom = min

x

x

x

g x

o

o

x

x f

x

o

o

o

www.support-vector.net

Two kinds of margin

www.support-vector.net

Max Margin = Minimal Norm

geometric margin equal 1/||w||

z Hence, maximize the margin by minimizing the

norm

www.support-vector.net

Max Margin = Minimal Norm

+

Distance between w, x + b = +1

The two convex hulls

−

x

w, x + b = −1

x

x

w,( x + − x − ) = 2

g x

o

x

x

o

x

o

w + − 2

o

o

o

,( x − x ) =

w w

www.support-vector.net

IMPORTANT

The primal problem STEP

z Minimize:

subject to:

w, w

yi 4 w, xi + b 9 ≥ 1

www.support-vector.net

Optimization Theory

hyperplane: constrained optimization

(quadratic programming)

z Use Lagrange theory (or Kuhn-Tucker Theory)

z Lagrangian:

1

w, w − ∑ αiyi ! w, xi + b − 1"#$

1 6

2

α ≥ 0

www.support-vector.net

From Primal to Dual

1

L( w ) = w, w − ∑ αiyi ! w, xi + b − 1"#$

1 6

2

αi ≥ 0

Differentiate and substitute:

∂L

=0

∂b

∂L

=0

∂w

www.support-vector.net

IMPORTANT

The Dual Problem STEP

1

W (α ) = ∑ αi − ∑iαα

z Maximize:

i jyiyj xi, xj

,j

i 2

z Subject to: αi ≥ 0

∑ αiyi = 0

i

www.support-vector.net

IMPORTANT

Convexity CONCEPT

convex, no local minima (second effect of

Mercer’s conditions)

z Solvable in polynomial time …

z (convexity is another fundamental property of

SVMs)

www.support-vector.net

Kuhn-Tucker Theorem

Properties of the solution:

z Duality: can use kernels

z KKT conditions: 1 6

αi yi w, xi + b − 1 = 0

z

∀i

Sparseness: only the points nearest to the hyperplane

w = ∑α y x

(margin = 1) have positive weight

i i i

z They are called support vectors

www.support-vector.net

KKT Conditions Imply Sparseness

Sparseness:

another fundamental property of SVMs

x

x

x

g x

o

x

x

o

x

o

o

o

www.support-vector.net

Properties of SVMs - Summary

9 Duality

9 Kernels

9 Margin

9 Convexity

9 Sparseness

www.support-vector.net

Dealing with noise

in feature space, the margin distribution

can be optimized

ε≤

(

1 R + ∑ξ

2

2

)

m γ 2

yi 4 w, xi + b 9 ≥ 1 − ξi

www.support-vector.net

The Soft-Margin Classifier

1

Minimize: w, w + C ∑ ξi

2 i

1

w, w + C ∑ ξi

2

Or:

2 i

Subject to:

yi 4 w, xi + b 9 ≥ 1 − ξi

www.support-vector.net

Slack Variables

ε≤

(

1 R + ∑ξ

2

)

2

m γ 2

yi 4 w, xi + b 9 ≥ 1 − ξi

www.support-vector.net

Soft Margin-Dual Lagrangian

1

z Box constraints W(α ) = ∑αi − 2 ∑iαα

,j

i jyiyj xi, xj

i

0 ≤ αi ≤ C

∑ αiyi = 0

i

1 1

z Diagonal ∑i αi − 2 ∑iαα

,j

i jyiyj xi, xj −

2C

∑ αjαj

0 ≤ αi

∑ αiyi ≥ 0

i

www.support-vector.net

The regression case

retained, introducing epsilon-insensitive loss:

0 yi-<w,xi>+b

www.support-vector.net

Regression: the ε-tube

www.support-vector.net

Implementation Techniques

linear equality constraint (and inequalities as

well) 1

W (α ) = ∑ αi − ∑ αα y y K ( x , x )

i, j

i j i j i j

i 2

αi ≥ 0

∑ αiyi = 0

i

www.support-vector.net

Simple Approximation

z Stochastic Gradient Ascent (sequentially

update 1 weight at the time) gives excellent

approximation in most cases

αi ← αi +

1

1 − yi ∑ αiyiK ( xi, xj )

K ( xi, xi )

www.support-vector.net

Full Solution: S.M.O.

z Realizes gradient descent without leaving the

linear constraint (J. Platt).

www.support-vector.net

Other “kernelized” Algorithms

discriminant, bayes classifier, ridge regression,

etc. etc

z Much work in past years into designing kernel

based algorithms

z Now: more work on designing good kernels (for

any algorithm)

www.support-vector.net

On Combining Kernels

z Too many features leads to overfitting also in

kernel methods

z Kernel combination needs to be based on

principles

z Alignment

www.support-vector.net

IMPORTANT

Kernel Alignment CONCEPT

Alignment (= similarity between Gram

matrices)

K1, K 2

A( K1, K 2) =

K1, K1 K 2, K 2

www.support-vector.net

Many interpretations

z As Correlation coefficient between ‘oracles’

that is should be given by the labels vector

(after all: target is the only relevant feature !)

www.support-vector.net

The ideal kernel

1 1 -1 … -1

1 1 -1 … -1

YY’= -1 -1 1 1

… … … … …

-1 -1 1 … 1

www.support-vector.net

Combining Kernels

that are aligned to the target and not aligned to

each other.

z

K 1, YY'

A( K 1, YY' ) =

K 1, K 1 YY' , YY'

www.support-vector.net

Spectral Machines

a set of labels to a given kernel

z By solving this problem: yKy

y = arg max

yy'

yi ∈{−1,+1}

z Approximated by principal eigenvector

(thresholded) (see courant-hilbert theorem)

www.support-vector.net

Courant-Hilbert theorem

z Principal Eigenvalue / Eigenvector

characterized by:

vAv

λ = max

v vv'

www.support-vector.net

Optimizing Kernel Alignment

vice versa

z In the first case: model selection method

z Second case: clustering / transduction method

www.support-vector.net

Applications of SVMs

z Bioinformatics

z Machine Vision

z Text Categorization

z Handwritten Character Recognition

z Time series analysis

www.support-vector.net

Text Kernels

z Latent semantic kernels (icml2001)

z String matching kernels

z …

z See KerMIT project …

www.support-vector.net

Bioinformatics

z Gene Expression

z Protein sequences

z Phylogenetic Information

z Promoters

z …

www.support-vector.net

Conclusions:

networks. -

z General and rich class of pattern recognition methods

z %RR RQ 690VZZZVXSSRUWYHFWRUQHW

www.kernel-machines.org

z www.NeuroCOLT.org

www.support-vector.net

- statement_pxcheng-ML.pdfUploaded byErmia Bivatan
- Image Reconstruction Using Pixel Wise Support Vector Machine Svm ClassificationUploaded byIJSTR Research Publication
- Data mining for credit card fraud: A comparative studyUploaded byIzemAmazigh
- SVMUploaded byShrihariVoniyadka
- Svm Imagini Folder Matlab5.Doc_1Uploaded byMircea
- An Optimized Feature Selection Technique for Email ClassificationUploaded byIJSTR Research Publication
- JORDAAN2002_Estimation of the Regularization Parameter for SVRUploaded byAnonymous PsEz5kGVae
- smart antennaUploaded byVijaya Shree
- Feature Extraction SupervisedUploaded bypmjabor
- Dg 4201725728Uploaded byAnonymous 7VPPkWS8O
- 07208643Uploaded bystalker23arm
- Online Svr Training by Solving the Primal Optimization ProblemUploaded bywwwwwwwqqq
- SESSION IUploaded bydiranj
- A survey of Network Intrusion Detection using soft computing TechniqueUploaded byInternational Journal for Scientific Research and Development - IJSRD
- optimization answerUploaded byhasanaltinbas
- Machine Learning Para AnalysisUploaded bycarlos nitidum
- A Novl Ftr Slctn Algo Fr Txt Ctgrztn_Shang2007Uploaded bysajid
- 5 Mahendra PaperUploaded bysuresh naragund
- Chapter 7Uploaded byammar_harb
- Get PDF 1Uploaded byPoorni Babloo
- Robust Integrated Production Planning and Order AcceptanceUploaded byHamet Huallpa Paz
- IJIREEICE 3Uploaded byAllam Jayaprakash
- A Comprehensive Aesthetic QualityUploaded byAja
- Secure and Efficient Handover Authentication and Detection of Spoofing AttackUploaded byInternational Journal of Research in Engineering and Technology
- OrenUploaded byAriel Carlos De Leon
- A contour-color-action approach to automatic retrieval of several common video genres.pdfUploaded byAnonymous ntIcRwM55R
- Classification of Myocardial Infarction using Discrete Wavelet Transform and Support Vector MachineUploaded byIRJET Journal
- Remote Sensing Ieee 2014 ProjectsUploaded bykasanpro
- dual_121Uploaded byWira Pratama
- Understanding Consumer's Brand Categorization Across Three Countries-- Application of Fuzzy Rule-based ClassificationUploaded byM Nishom

- What is Alternating CurrentUploaded bymasti0
- Fiber Patent Landscape Analysis With SocialCitnetUploaded byCrafitti Consulting
- Xorg.0.log.oldUploaded bySylvain Hauptgewinn Grolleau
- TDC 41599 a (Physics)Uploaded byEr Paramjit Singh
- Performance Appraisal (Rama Krishna Knitter Pvt. Ltd) FinalUploaded byujranchaman
- Thermo LawsUploaded byBARBOSA RAFFAELLI
- A Project Report on Cadbury 75 PagesUploaded byRenju Romals
- Purcell Murray Builder SalesUploaded byPurcellMurray
- Autoclave Validation MSPDAUploaded byYessine Mrabet
- Foreign Exchange MarketUploaded byFarhad Reza
- Legmed Notes Atty. CapuleUploaded byPrince Dela Cruz
- White Line ChipUploaded byspforster13
- manual Cisco IP Phone 7962G.pdfUploaded byHernan Copa
- Hariti Temple of SwayambhuUploaded byMin Bahadur shakya
- RecyclingUploaded byhaniadamxoxo
- American Journal of Roentgenology Volume 189 Issue 2 2007 [Doi 10.2214%2FAJR.07.2085] Suzuki, Shigeru; Furui, Shigeru; Okinaga, Kota; Sakamoto, Tsutom -- Differentiation of Femoral Versus Inguinal Her (1)Uploaded byNur Endah Chayati
- Draft Buget UE 2018Uploaded byClaudiu
- Catalog Yashwant Sluice GatesUploaded byKedar Patwardhan
- company tenderUploaded byapi-353659107
- [CAD][ProENGINEER]Proe Tutorial 231pagesUploaded byŠovljanski Nedeljko
- Module 23 - Fourier SeriesUploaded byapi-3827096
- Mysliwiec Fs OconnorUploaded bycraveland
- Photoshop for 3DUploaded bySri Harsha Rongala
- glass processing romaniaUploaded byttemperedglas
- 6002 Rex Conveyor Idlers CatalogUploaded byJose Miguel Paredes Lajo
- Mag Liar i Fret CompUploaded byanon_798980771
- 03_practice_-_classification_of_the_elements_key.pdfUploaded byjt
- DVDO Quick6R Product BriefUploaded byDavid Ward
- Gallery 202 VocabularyUploaded by2011 YCM English Docent Course
- 8b Compufet Summary BioElectric Shield ResearchUploaded byBioElectricShield