This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

Scribd Selects Books

Hand-picked favorites from

our editors

our editors

Scribd Selects Audiobooks

Hand-picked favorites from

our editors

our editors

Scribd Selects Comics

Hand-picked favorites from

our editors

our editors

Scribd Selects Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

P. 1

Neural Nets|Views: 64|Likes: 5

Published by _sweater_

See more

See less

https://www.scribd.com/doc/22897294/Neural-Nets

07/06/2010

text

original

Vasant Honavar

**Artificial Neural Networks
**

Jihoon Yang

Data Mining Research Laboratory Department of Computer Science p p Sogang University Email: yangjh@sogang.ac.kr URL: mllab.sogang.ac.kr/people/jhyang.html

Neural Networks

• Decision trees are good at modeling nonlinear interactions among a small subset of attributes So et es e a e te ested ea te act o s among all • Sometimes we are interested I linear interactions a o g a attributes • Simple neural networks are good at modeling such interactions • The resulting models have close connections with naïve Bayes

Department of Computer Science Data Mining Research Laboratory

2

**Learning Threshold Functions
**

• Outline • Background • Threshold logic functions • Connection to logic g y • Connection to geometry • Learning threshold functions – perceptron algorithms and its variants • Perceptron convergence theorem

Department of Computer Science Data Mining Research Laboratory

3

**Background – Neural computation
**

• 1900: Birth of neuroscience – Ramon Cajal et al. • 1913: Behaviorist or stimulus response psychology • 1930-50: Theory of Computation, Church-Turing Thesis • 1943: McCulloch & Pitts “A logical calculus of neuronal activity” g • 1949: Hebb – Organization of Behavior • 1956: Birth of Artificial Intelligence – “Computers and Thought” • 1960-65: Perceptron model developed by Rosenblatt

Department of Computer Science Data Mining Research Laboratory

4

**Background – Neural computation
**

• 1969: Minsky and Papert criticize Perceptron • 1969: Chomsky argues for universal innate grammar knowledge based • 1970: Rise of cognitive psychology and knowledge-based AI • 1975: Learning algorithms for multi-layer neural networks • 1985: Resurgence of neural networks and machine learning f • 1988: Birth of computational neuroscience • 1990: Successful applications (stock market, OCR, robotics) y g • 1990-2000: New synthesis of behaviorist and cognitive or representational approaches in AI and psychology y g • 2000-: Synthesis of logical and probabilistic approaches to representation and learning

Department of Computer Science Data Mining Research Laboratory

5

**Background – Brains and Computers
**

• Brain consists of 1011 neurons, each of which is connected to 104 neighbors ac eu o s slow (1 seco d espo d stimulus) t e • Each neuron is s o ( millisecond to respond to a st u us) but the brain is astonishingly fast at perceptual tasks (e.g. face recognition) • Brain processes and learns from multiple sources of sensory information (visual, tactile, auditory…) • B i i massively parallel, shallowly serial, modular and Brain is i l ll l h ll l i l d l d hierarchical with recurrent and lateral connectivity within and between modules • If cognition is – or at least can be modeled by – computation, it is natural to ask how and what brains compute

Department of Computer Science Data Mining Research Laboratory

6

**Brain and information processing
**

Primary somato-sensory cortex Motor association cortex Primary motor cortex Sensory association area Auditory cortex Speech comprehension p p Visual association area Primary visual cortex Auditory association area

7

Prefrontal cortex f

Department of Computer Science Data Mining Research Laboratory

Neural Networks

**Ramon Cajal, 1900 Cajal
**

Department of Computer Science Data Mining Research Laboratory

8

Neurons and Computation

Department of Computer Science Data Mining Research Laboratory

9

**McCulloch Pitts McCulloch-Pitts computational model of a neuron x1 w1
**

X0 =1

W0

Input

x2

w2

∑

Synaptic weights

n

n

Output

y

M

xn

M

wn

y = 1 if

∑w x

i =0

i i

>0

y = −1 otherwise

When a neuron receives input signals from other neurons, its membrane voltage increases. When it exceeds a certain threshold, the th neuron “fires” a burst of pulses. “fi ” b t f l

Department of Computer Science Data Mining Research Laboratory

10

**Threshold neuron – Connection with Geometry x2
**

Decision boundary

w1x1 + w2x2 + w0 > 0

(w1,w2)

C1 C2

x1 w1x1 + w2x2 + wo = 0

w1x1 + w2x2 + w0 < 0

∑w x + w = 0

i=1 i i 0

n

χ+ = Xp ∈ℜn W • Xp + w0 > 0

Department of Computer Science Data Mining Research Laboratory

{

describes a hyperplane which divides the instance n space ℜ into two half spaces half–spaces

} and χ = {X ∈ℜ W• X

− n p

p

+ w0 < 0

}

11

McCulloch Pitts McCulloch-Pitts neuron or Threshold neuron

y = sign (W • X + w0 ) ⎛ n ⎞ = sign ⎜ ∑ wi xi ⎟ ⎜ ⎟ ⎝ i =0 ⎠ = sign W T X + w0

(

)

⎡ x1 ⎤ ⎢x ⎥ X = ⎢ 2⎥ ⎢M⎥ ⎢ ⎥ ⎣ xn ⎦

⎡ w1 ⎤ ⎢w ⎥ W = ⎢ 2⎥ ⎢M⎥ ⎢ ⎥ ⎣ wn ⎦

sign (v ) = 1 if v > 0 = 0 otherwise

Department of Computer Science Data Mining Research Laboratory

12

Threshold neuron – Connection with Geometry

Department of Computer Science Data Mining Research Laboratory

13

**Threshold neuron – Connection with Geometry
**

• Instance space

ℜn

• Hypothesis space is the set of (n-1)-dimensional hyperplanes de ed the d e s o a sta ce defined in t e n-dimensional instance space • A hypothesis is defined by

∑w x

i =0

n

i i

=0

• Orientation of the hyperplane is governed by

( w1...wn )T

• The perpendicular distance of the hyperplane from the origin is given by

⎛ ⎜ ⎜ ⎝

(

⎞ ⎟ 2 2 2 ⎟ w1 + w2 + ... + wn ⎠ w0

)

Department of Computer Science Data Mining Research Laboratory

14

**Threshold neuron as a pattern classifier
**

• The threshold neuron can be used to classify a set of instances into one of two classes C1, C2 • If the output of the neuron for input pattern Xp is +1 t e Xp is t e o t e eu o o put patte s then s assigned to class C1 • If the output is -1 then the pattern Xp is assigned to C2

• Example

[ w0 w1 w2 ]T = [ −1 − 1 1]T X T = [1 0]T W • X p + w0 = −1 + ( −1) = −2 p X p is assigned to class C2

Department of Computer Science Data Mining Research Laboratory

15

**Threshold neuron – Connection with Logic
**

• Suppose the input space is {0,1}n • Then threshold neuron computes a Boolean function f: {0,1}n {-1,1}

• Example – Let w0 = -1.5; w1 = w2 = 1 – In this case, the threshold neuron p g implements the logical AND function

x1 0 0 1 1

x2 0 1 0 1

g(X) -1.5 -0.5 -0.5 0.5

y -1 -1 -1 1

Department of Computer Science Data Mining Research Laboratory

16

**Threshold neuron – Connection with Logic
**

• A threshold neuron with the appropriate choice of weights can implement Boolean AND, OR, and NOT function eo e o a y arbitrary oo ea u ct o , t e e exists • Theorem: For any a b t a y Boolean function f, there e sts a network of threshold neurons that can implement f • Theorem: Any arbitrary finite state automaton can be realized using threshold neurons and delay units • N t Networks of th k f threshold neurons, given access t unbounded h ld i to b d d memory, can compute any Turing-computable function • Corollary: Brains if given access to enough working memory, can compute any computable function

Department of Computer Science Data Mining Research Laboratory

17

**Threshold neuron – Connection with Logic
**

• Theorem: There exist functions that cannot be implemented by a single threshold neuron

• Example: Exclusive OR

x2

Why?

x1

Department of Computer Science Data Mining Research Laboratory

18

**Threshold neuron – Connection with Logic
**

• Definition: A function that can be computed by a single threshold neuron is called a threshold function O t e 6 put oo ea u ct o s, a e oo ea t es o d • Of the 16 2-input Boolean functions, 14 are Boolean threshold functions • As n increases the number of Boolean threshold functions becomes increases, an increasingly small fraction of the total number of n-input Boolean functions

N Threshold (n ) ≤ 2 n ;

2

N Boolean (n ) = 2 2

n

Department of Computer Science Data Mining Research Laboratory

19

**Terminology and Notation
**

• Synonyms: Threshold function, Linearly separable function, Linear discriminant function Sy o y s es o d eu o , cCu oc tts eu o , e cept o , • Synonyms: Threshold neuron, McCulloch-Pitts neuron, Perceptron, Threshold Logic Unit (TLU)

• We often include w0 as one of the components of W and incorporate x0 as the corresponding component of X with the understanding that x0 = 1; Then y = 1 if W·X > 0 and y = -1 otherwise 1

Department of Computer Science Data Mining Research Laboratory

20

**Learning Threshold functions
**

• A training example Ek is an ordered pair (Xk, dk) where

T X k = [x0 k x1k .... xnk ] is an (n+1) dimensional input pattern, and

**d k = f ( X k ) ∈ {−1, 1} is the desired output of the classifier and f is
**

an unknown target function to be learned

• A training set E is simply a multi-set of examples

Department of Computer Science Data Mining Research Laboratory

21

**Learning Threshold functions
**

S + = {X k (X k , d k ) ∈ E and d k = 1}

S − = {X k (X k , d k ) ∈ E and d k = −1}

• We say that a training set E is linearly separable if and only if

**∃W * such that ∀X p ∈ S + , W * • X p > 0 and ∀X p ∈ S − , W * • X p < 0
**

• Learning task: Given a linearly separable training set E, find a solution W *

such that ∀X p ∈ S + , W* • X p > 0 and ∀X p ∈ S − , W* • X p < 0

Department of Computer Science Data Mining Research Laboratory

22

**Rosenblatt s Rosenblatt’s Perceptron Learning Algorithm
**

1. Initialize

W = [0 0.....0]

T

2. Set learning rate

η >0

3. Repeat until a complete pass through E results in no weight updates For F each training example Ek ∈ E ht i i l {

yk ← sign ( W • X k )

} 4.

W ← W + η(d k − yk )X k

Return

W * ← W;

W*

Department of Computer Science Data Mining Research Laboratory

23

**Perceptron Learning Algorithm – Example
**

Let S+ = {(1, 1, 1), (1, 1, -1), (1, 0, -1)} S- = {(1,-1, -1), (1,-1, 1), (1,0, 1) } W = (0 0 0)

Xk dk W W.Xk

η=

1 2

yk

Update?

Updated W

(1, 1, 1) (1, 1, 1) (1 1 -1) (1,0, -1) (1, -1, -1) (1,-1, (1 1 1) (1,0, 1) (1, 1, 1)

1 1 1 -1 -1 1 -1 1

(0, 0, 0) (1, 1, (1 1 1) (1, 1, 1) (2, 1, 0) (1, 2, (1 2 1) (1, 2, 1) (0, 2, 0)

0 1 0 1 0 2 2

-1 1 -1 1 -1 1 1 1

Yes No Yes Yes No Yes No

(1, 1, 1) (1, 1, (1 1 1) (2, 1, 0) (1, 2, 1) (1, 2, (1 2 1) (0, 2, 0) (0, 2, 0)

Department of Computer Science Data Mining Research Laboratory

24

**Perceptron Convergence Theorem (Novikoff)
**

Theorem: n Let E = {(X k , d k )} be a training set where X k ∈ {1} × ℜ and d k ∈ {−1,1} Let S + = X k (X k , d k ) ∈ E & d k = 1 and S − = X k (X k , d k ) ∈ E & d k = −1

{

}

{

}

The perceptron algorithm is guaranteed to terminate after a bounded * number t of weight updates with a weight vector W

**such that ∀ X k ∈ S + , W * • X k ≥ δ and ∀ X k ∈ S − , W * • X k ≤ −δ
**

for some δ > 0, whenever such W * ∈ ℜ n +1 , y p – that is, E is linearly separable.

2

and δ > 0 exist

The bound on the number t of weight updates is given by

⎛ W* L ⎞ ⎟ where L = max X t ≤⎜ k X k ∈S ⎜ δ ⎟ ⎝ ⎠

and S = S + U S −

Department of Computer Science Data Mining Research Laboratory

25

Proof of Perceptron Convergence Theorem

Let

Wt

be the weight vector after t weight updates

W

*

Invariant : ∀θ cos θ ≤ 1

θ

Wt

Department of Computer Science Data Mining Research Laboratory

26

Proof of Perceptron Convergence Theorem

**Let W * be such that ∀X k ∈ S + , W * • X k ≥ δ and ∀X k ∈ S − , W * • X k ≤ −δ WLOG assume that W * • X = 0 passes through the origin. Let ∀X k ∈ S + , Z k = X k , ∀X k ∈ S − , Z k = − X k , Z = {Z k }
**

k

(∀X

∈ S + , W * • X k ≥ δ & ∀X k ∈ S − , W * • X k ≤ −δ ⇔ ∀Z k ∈ Z , W * • Z k ≥ δ .

Let E ' = {(Z k ,1)}

Department of Computer Science Data Mining Research Laboratory

(

)

)

27

Proof of Perceptron Convergence Theorem

**[Weight update based on example (Zk ,1) ] ⇔ [(d k = 1) ∧ ( yk = −1)] ∴ W * • Wt +1 = W * • (Wt + 2ηZ k )
**

= W * • Wt + 2η W * • Z k S ce Since ∀Z k ∈ Z , W * • Z k ∴ ∀t

where W0 = [0 0 .... 0] and η > 0

T

Wt +1 = Wt + η(d k − yk )Z k

(

(

)

( ≥ δ ), W

)

*

• Wt +1 ≥ W * • Wt + 2ηδ

W * • Wt ≥ 2tηδ .......... .......... .......... .......... ....(a)

28

Department of Computer Science Data Mining Research Laboratory

Proof of Perceptron Convergence Theorem

Wt +1

2

≡ Wt +1 • Wt +1

= (Wt + 2ηZ k ) • (Wt + 2ηZ k )

**Note weight update based on Z k ⇔ (Wt • Z k ≤ 0 ) ∴ Wt +1 Hence
**

2

= (Wt • Wt ) + 4η(Wt • Z k ) + 4η 2 (Z k • Z k ) ≤ Wt Wt

2 2

+ 4η 2 Z k

2

≤ Wt

2

+ 4η 2 L2

≤ 4tη 2 L2

∴ ∀t Wt ≤ 2ηL t .......... .......... .......... .......... ..(b)

Department of Computer Science Data Mining Research Laboratory

29

Proof of Perceptron Convergence Theorem

From ( ) we h F (a) have :

⇒ ∀t 2tηδ ≤ W * • Wt ⇒ ∀t 2tηδ ≤ W * Wt cos θ ⇒ ∀t 2tηδ ≤ W * Wt Q ∀θ ∀t 2tηδ ≤ W * 2ηL t ⇒ ∀t ⎧ ⎪ ⇒ ⎨∀t ⎪ ⎩ ⎧ ⎛ W * L ⎞ 2 ⎫⎫ ⎪ ⎜ ⎟ ⎪⎪ ⎨t ≤ ⎜ ⎬⎬ δ ⎟ ⎪⎪ ⎪ ⎝ ⎠ ⎭⎭ ⎩

{

{

(

∀t W * • Wt ≥ 2tηδ δ

}

)} {

(

)

}

cos θ ≤ 1, t ≤ W* L

Substituting for an upper bound on Wt from (b),

{

} { (δ

)}

Department of Computer Science Data Mining Research Laboratory

30

**Notes on the Perceptron Convergence Theorem
**

• The bound on the number of weight updates does not depend on the learning rate e bound s ot use u dete g e t e a go t • The bou d is not useful in determining when to stop the algorithm because it depends on the norm of the unknown weight vector and delta • The convergence theorem offers no guarantees when the training data set is not linearly separable

• Exercise: Prove that the perceptron algorithm is robust with respect to fluctuations in the learning rate

0 < ηmin ≤ ηt ≤ ηmax < ∞

Department of Computer Science Data Mining Research Laboratory

31

Multiple classes

K-1 binary classifiers

K(K-1)/2 binary classifiers

One versus rest One-versus-rest

One-versus-one

**Problem: Green region has ambiguous class membership
**

Department of Computer Science Data Mining Research Laboratory

32

**Multi category Multi-category classifiers
**

Winner-Take-All Network • Define K linear functions of the form:

**yk ( X ) = WkT X + wk 0 h( X ) = argmaxyk ( X )
**

k

C1

= argmaxWkT X + wk 0

k

(

)

C3

C2

• Decision surface between class Ck and Cj

(W −W ) X + (w

T k j

k0

− wj0 ) = 0

Department of Computer Science Data Mining Research Laboratory

33

**Linear separator for K classes
**

• Decision regions defined by

(W −W ) X + (w

T k j

k0

− wj0 ) = 0

are singly connected and convex • For any points X A, X B ∈ Rk , any X that lies on the line connecting XA and XB

∧

X = λX A + (1− λ) X B where0 ≤ λ ≤1 h

also lies in Rk

∧

Department of Computer Science Data Mining Research Laboratory

34

Winner Take All Winner-Take-All Networks

**yiip = 1 iff Wi • X p > W j • X p ∀j ≠ i yip = 0 otherwise
**

W1 = [1 - 1 - 1] , W2 = [1 1 1] , W3 = [2 0 0]

T T T

Note: Wj are augmented weight vectors g g W1.Xp 1 1 1 1 -1 1 -1 +1 1 +1 -1 1 +1 -1 1 +1 3 1 1 -1 W2.Xp -1 1 1 1 3 W3.Xp 2 2 2 2 y1 1 0 0 0 y2 0 0 0 1 y3 0 1 1 0

**What does neuron 3 compute?
**

Department of Computer Science Data Mining Research Laboratory

35

Linear separability of multiple classes

**Let S1, S 2 , S3 ...S M be multisets of instances Let C1, C2 , C3 ...C M be disjoint classes ∀i Si ⊆ Ci ∀i ≠ j Ci I C j = ∅ We say that the sets S1, S 2 , S3 ...S M are linearly
**

* * separable iff ∃ weight vectors W1* , W2 ,..WM such that

∀i ∀X p ∈ Si , Wi* • X p > W* • X p ∀j ≠ i j

{

(

)

}

Department of Computer Science Data Mining Research Laboratory

36

**Training WTA Classifiers
**

d k = 1 iff X p ∈ Ck ; d k = 0 otherwise kp kp ykp = 1 iff Wk • X p > W j • X p ∀k ≠ j Suppose d kp = 1, y jp = 1 and ykp = 0 Wk ← Wk + ηX p ; W j ← W j − ηX p ; All other weights are left unchanged unchanged. Suppose d kp = 1, y jp = 0 and ykp = 1. The weights are unchanged unchanged. Suppose d kp = 1, ∀j y jp = 0 (there was a tie) Wk ← Wk + ηX p ηX All other weights are left unchanged.

Department of Computer Science Data Mining Research Laboratory

37

**WTA Convergence Theorem
**

• Given a linearly separable training set, the WTA learning algorithm is guaranteed to converge to a solution within a finite number of weight updates

• Proof sketch: Transform the WTA training problem to the problem of training a single perceptron using a suitably transformed training set; Then the proof of WTA learning algorithm reduces to the proof of perceptron learning algorithm

Department of Computer Science Data Mining Research Laboratory

38

**WTA Convergence Theorem
**

Let W T = [ W1W2 ....WM ]T be a concatenat ion of the weight vectors associated with the M neurons in the WTA group. Consider a multi - category training set E = {(X p , f ( X p ) )} where ∀X p f ( X p ) ∈ {C1 ,...CM } for an M (n + 1) input perceptron : X p12 = [ X p − X p φ X p13 = [ X p ... X p1M = [ X p φ

Let X p ∈ C1. Generate (M 1) training examples using X p M-

φ ... φ ]

φ − X p φ ... φ ] φ ... φ − X p ]

where φ is an all zero vector with the same dimension as X p and set the desired output of the corresponding perceptron to be 1 in each case. Similarly, from each training example for an ( n + 1) − input WTA, we can generate ( M − 1) examples for an M (n + 1) input single neuron. Let the union of the resulting E ( M − 1) examples be E '

Department of Computer Science Data Mining Research Laboratory

39

WTA Convergence Theorem

By construction, there is a one - to - one correspondence between the weight on the multi - category set of examples E and the result of training an M (n + 1) input perceptron on the transformed training set E '. Hence the convergence proof of WTA learning algorithm follows from the perceptron convergence theorem. vector W T = [ W1W2 ....WM ]T that results from training an M - neuron WTA

Department of Computer Science Data Mining Research Laboratory

40

**Weight space representation
**

• Pattern space representation: – Coordinates of space correspond to attributes (features) – A point in the space represents an instance – Weight vector Wv defines a hyperplane Wv·X = 0 X

• Weight space (dual) representation: – Coordinates define a weight space – A point in the space represents a choice of weights Wv p p p g – An instance Xp defines a hyperplane W·Xp = 0

Department of Computer Science Data Mining Research Laboratory

41

Weight space representation

W • Xr = 0

Solution region

w0

Xr

W • Xp = 0

Xp

∈ S+ ∈ S−

Xq

W • Xq = 0

w1

Department of Computer Science Data Mining Research Laboratory

42

Weight space representation

w0

Wt +1

W • X p = 0, X p ∈ S +

Wt +1 ← Wt + ηX p

Fractional correction rule

ηX p

Wt

w1

⎛ Wt • X p + ε ⎞ ⎟(d − y )X Wt +1 ← Wt + λ⎜ p p ⎜ Xp • Xp +ε ⎟ p ⎝ ⎠ 0 < λ < 1; λ ≠ 0.5 when d p , y p ∈ {− 1,1} h

**ε > 0 is a constant (to handle the case when
**

the dot product Wt • X p or X p • X p (or both) approach zero.

Department of Computer Science Data Mining Research Laboratory

43

“Perceptrons” (1969) Perceptrons

“The perceptron […] has many features that attract attention: its linearity, its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgement that the extension is ste e [pp 3 sterile.” [pp. 231 – 232] 3 ]

Department of Computer Science Data Mining Research Laboratory

44

Limitations of Perceptrons

• Perceptrons can only represent threshold functions • Perceptrons can only learn linear decision boundaries • What if the data are not linearly separable? – Modify the learning procedure or the weight update equation? (e.g. Pocket algorithm, Thermal perceptron)

– More complex networks? p – Non-linear transformations into a feature space where the data

become separable?

Department of Computer Science Data Mining Research Laboratory

45

**Extending Linear Classifiers: Learning in feature spaces L i i f t
**

• Map data into a feature space where they are linearly separable

x → ϕ ( x)

o o

x

x x

ϕ ( x)

ϕ (o)

x

ϕ ( x)

o o

X

ϕ (o)

ϕ ( x)

ϕ (o) ϕ (o)

ϕ ( x)

Φ

46

Department of Computer Science Data Mining Research Laboratory

Exclusive OR revisited

• In the feature (hidden) space:

ϕ1 ( x1 , x2 ) = e −|| X −W || = z1

1 2

W1 = [1,1]T W2 = [0,0]T

Decision boundary

(1,1)

ϕ 2 (x1 , x2 ) = e −|| X −W || = z2

2 2

z2

1.0 10 0.5

(0,0)

(0,1) and (1,0)

0.5

1.0

z1

• When mapped into the feature space <z1, z2>, C1 and C2 become linearly separable. So a linear classifier with φ1(X) and φ2(X) as inputs can be used to solve the XOR problem. i t b dt l th bl

Department of Computer Science Data Mining Research Laboratory

47

**Learning in the Feature Space
**

• High dimensional feature spaces

X = ( x1 , x2 , L , xn ) → ϕ ( X ) = (ϕ1 ( X ), ϕ 2 ( X ), L , ϕ d ( X ))

where typically d >> n solve the problem of expressing complex functions • B t thi introduces But this i t d – Computational problem (working with very large vectors) Solved using the kernel trick – implicit feature spaces – Generalization problem (curse of dimensionality) Solved by maximizing the margin of separation – first implemented in SVM (Vapnik) We will return to SVM later..

Department of Computer Science Data Mining Research Laboratory

48

**Linear Classifiers – linear discriminant functions
**

• Perceptron implements a linear discriminant function – a linear decision surface given by

⎡ x1 ⎤ ⎡ w1 ⎤ ⎢x ⎥ ⎢w ⎥ ⎢ 2⎥ ⎢ 2⎥ X =⎢ ⎥ W =⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ xn ⎥ ⎢ wn ⎥ ⎣ ⎦ ⎣ ⎦

y ( X ) = W T X + w0 = 0

• The solution hyperplane simply has to separate the classes • We can consider alternative criteria for separating hyperplanes

Department of Computer Science Data Mining Research Laboratory

49

**Project data onto a line joining the means of the two classes t l
**

Measure of separation of classes – separation of the projected means

m2 − m1 = W T ( μ 2 − μ1 )

Problems: • Separation can be made arbitrarily large by increasing the magnitude of W – constrain W to be of unit length • Classes that are well separated in the original space can have non trivial overlap in the projection – maximize between class variance in the projection

Department of Computer Science Data Mining Research Laboratory

50

**Fisher s Fisher’s Linear Discriminant
**

• Given two classes, find the linear discriminant W ∈ ℜ n that maximizes Fisher’s discriminant ratio:

(W (μ f (W ; μ , Σ , μ , Σ ) =

T 1 1 2 2

W T (μ1 − μ 2 )(μ1 − μ 2 ) W = W T (Σ1 + Σ 2 )W

T

1 − μ2 ) T W (Σ1 + Σ 2 )W

)

2

• Set

∂f (W ; μ1 , Σ1 , μ 2 , Σ 2 ) =0 ∂W ∂W

Department of Computer Science Data Mining Research Laboratory

51

**Fisher s Fisher’s Linear Discriminant
**

f (W ; μ1 , Σ1 , μ 2 , Σ 2 ) = W T (μ1 − μ 2 )(μ1 − μ 2 ) W W T (Σ1 + Σ 2 )W

T

∂f (W ; μ1 , Σ1 , μ 2 , Σ 2 ) ∂W ∂ (W T ( μ1 − μ 2 )( μ1 − μ 2 )T W ) ∂ − (W T ( μ1 − μ 2 )( μ1 − μ 2 )T W ) (W T (Σ1 + Σ 2 )W ) (W T (Σ1 + Σ 2 )W ) ∂W ∂W = (W T (Σ1 + Σ 2 )W ) 2

(W T (Σ1 + Σ 2 )W ) ∂ (W T ( μ1 − μ 2 )( μ1 − μ 2 )T W ) ∂ − (W T ( μ1 − μ 2 )( μ1 − μ 2 )T W ) (W T (Σ1 + Σ 2 )W ) = 0 ∂W ∂W

(W T (Σ1 + Σ 2 )W ) 2( μ1 − μ 2 )( μ1 − μ 2 )T W − (W T ( μ1 − μ 2 )( μ1 − μ 2 )T W ) 2(Σ1 + Σ 2 )W = 0

( μ1 − μ 2 )( μ1 − μ 2 )T W = k (Σ1 + Σ 2 )W ( k = const) ( μ1 − μ 2 ) = k ' (Σ1 + Σ 2 )W ( k ' = const Q ( μ1 − μ 2 )( μ1 − μ 2 )T W has the same direction as ( μ1 − μ 2 ))

W * ∝ (Σ1 + Σ 2 ) −1 ( μ1 − μ 2 )

Department of Computer Science Data Mining Research Laboratory

52

**Fisher s Fisher’s Linear Discriminant
**

W ∈ ℜ n that maximizes Fisher’s discriminant ratio:

W * = (Σ1 + Σ 2 ) −1 ( μ1 − μ 2 )

• Unique solution • Easy to compute • Has a probabilistic interpretation • Can be updated incrementally as new data become available • Naturally extends to K class problems K-class • Can be generalized (using kernel trick) to handle non linearly separable class boundaries

Department of Computer Science Data Mining Research Laboratory

53

Project data based on Fisher discriminant

Department of Computer Science Data Mining Research Laboratory

54

**Fisher s Fisher’s Linear Discriminant
**

• Can be shown to maximize between class separation • If the samples in each class have Gaussian distribution, then c ass cat o using the s e discriminant ca be s o classification us g t e Fisher d sc a t can shown to y e d yield minimum error classifier • If the within class variance is isotropic then Σ1 and Σ2 are isotropic, proportional to the identity matrix I and W corresponding to the Fisher discriminant is proportional to the difference between the class means (μ1 – μ2) • Can be generalized to K classes

Department of Computer Science Data Mining Research Laboratory

55

**Least Squares for Classification
**

P ( ck | X ) = P ( X | ck ) P ( ck )

∑ P ( X | c ) P (c )

l =1 l l

M

**t k ( X p ) = t kp = 1 if X p ∈ ck ⎫ ⎪ ⎬kth target output t k ( X p ) = t kp = 0 if X p ∉ ck ⎪ ⎭ g k ( X p ;W ) = kth output for input X p ES (W ) = ∑ (g k ( X p ;W ) − t k ) kp
**

P p =1 2

=

Department of Computer Science Data Mining Research Laboratory

X p ∈ck

∑ (g

k

( X p ;W ) − 1) +

2

X p ∉ck

∑ (g

k

( X p ;W ) − 0 )

2

56

**Least Squares for Classification
**

1 ES (W ) |S |→∞ | S | lim

2

= P (ck ) ∫ (g k ( X p ;W ) − 1) P ( X | ck )dX + P (ci ≠ k ) ∫ g k ( X p ;W )P ( X | ci ≠ k )dX

2 2

= ∫ g k ( X p ;W )P ( X )dX − 2 ∫ g k ( X p ;W )P ( X , ck )dX + ∫ P ( X , ck )dX

2 independent of W

= ∫ (g k ( X p ;W ) − P(ck | X ) ) P ( X )dX + ∫ P (ck | X ) P (ci ≠ k | X ) P ( X )dX 144444 44444 2 3

• Because least square criterion minimizes this quantity with respect to W, we have g k ( X ;W ) ≈ P (ck | X ) assuming that the functions g k ( X ;W ) are expressive enough to represent P (ck | X ) • Exercise: Show that Fisher discriminant is a special case of Least Squares classification

Department of Computer Science Data Mining Research Laboratory

57

**Generative vs. Discriminative Models
**

• Bayesian decision theory revisited • Generative models – Naïve Bayes • Discriminative models – Perceptron, Fisher discriminant, Support vector machines • Relating generative and discriminative models • Tradeoffs between generative and discriminative models • Generalizations and extensions

Department of Computer Science Data Mining Research Laboratory

58

**Alternative realization of the Bayesian recipe
**

• Chef 1 – Generative (Informative) model Note that P(c1|X) = P(X|c1)P(c1) / P(X) Model P(X|c1), P(X|c2), P(c1), and P(c2) Using Bayes rule choose c1 if P(X|c1)P(c1) > P(X|c2)P(c2) rule, otherwise choose c2

• Chef 2 – Discriminative model – e.g. conditional training Model P(c1|X), P(c2|X), or the ratio P(c1|X) / P(c2|X) directly Choose c1 if P(c1|X) / P(c2|X) > 1 Otherwise choose c2

Department of Computer Science Data Mining Research Laboratory

59

**Generative vs. Discriminative Classifiers
**

• Generative classifiers – Assume some functional form for P(X|C), P(C) – Estimate parameters of P(X|C), P(C) directly from training data – Use Bayes rule to calculate P(C|X=x) • Discriminative classifiers – conditional version – Assume some functional form for P(C|X) – Estimate parameters of P(C|X) directly from training data • Discriminative classifiers – maximum margin version – Assume some functional form f(W) for the discriminant – Find W that maximizes the margin of separation between classes (e.g. SVM)

Department of Computer Science Data Mining Research Laboratory

60

Generative vs. Discriminative Models

Department of Computer Science Data Mining Research Laboratory

61

**Which chef cooks a better Bayesian recipe?
**

• In theory, generative and conditional models produce identical results in the limit – The classification produced by the generative model is the same as that produced by the discriminative model

– That is, given unlimited data, assuming that both approaches

select the correct form for the relevant probability distribution or the model for the discriminant function, they will produce identical results

– If the assumed form of the probability distribution is incorrect,

then it is possible that the generative model might have a higher classification error than the discriminative model • How about in practice?

Department of Computer Science Data Mining Research Laboratory

62

**Which chef cooks a better Bayesian recipe?
**

• In practice – The error of the classifier that uses the discriminative model can be lower than that of the classifier that uses the generative model

– Naïve Bayes is a generative model – A perceptron i a discriminative model, and so is SVM t is di i i ti d l d i – An SVM can outperform naïve Bayes on classification

• If the goal is classification, it might be useful to consider discriminative models that directly learn the classifier without going solving the harder intermediate problem of modeling the joint probability distribution of inputs and classes (Vapnik)

Department of Computer Science Data Mining Research Laboratory

63

**From generative to discriminative models
**

• Assume classes are binary y ∈ {0,1} • Suppose we model the class by a binomial distribution with pa a ete parameter q

p ( y | q ) = q y (1 − q ) (1− y )

• A Assume each component Xj of input X each have Gaussian h t fi t hh G i distributions with parameters өj and are independent given the class

p ( x, y | Θ) = p ( y | q )∏ p ( x j | y, θ j )

j =1

n

where Θ = (q, θ1 ,..., θ n ) , ,

Department of Computer Science Data Mining Research Laboratory

64

From generative to discriminative models

⎧ ⎫ ⎪ 1 ⎪ exp ⎨− ( x j − μ0 j ) 2 ⎬ 2 ( 2πσ 2 )1/ 2 ⎪ 2σ j ⎪ j ⎩ ⎭ ⎧ 1 ⎫ 1 ⎪ ⎪ p ( x j | y = 1, θ j ) = exp ⎨− ( x j − μ1 j ) 2 ⎬ 2 1/ 2 2 ( 2πσ j ) ⎪ ⎪ 2σ j ⎭ ⎩ where θ j = ( μ 0 j , μ1 j , σ j ) p ( x j | y = 0, θ j ) = 1 (Note : we have assumed that ∀jσ 0 j = σ 1 j = σ j )

Department of Computer Science Data Mining Research Laboratory

65

**From generative to discriminative models
**

• The calculation of the posterior probability p(Y=1|x,Ө) is simplified if we use matrix notation

⎛ ⎧ 1 ⎛ x − μ ⎞2 ⎫ ⎞ 1 ⎪ ⎜ j 1j ⎟ ⎪⎟ p ( x | y = 1, Θ) = ∏ ⎜ exp ⎨− ⎜ 1/ 2 ⎜ ⎟ ⎬⎟ σj j =1 ⎜ ( 2π ) ⎪ 2 ⎝ σ j ⎠ ⎪⎟ ⎩ ⎭⎠ ⎝ 1 ⎧ 1 ⎫ = exp ⎨− ( x − μ1 )T Σ −1 ( x − μ1 )⎬ n/2 1/ 2 (2π ) | Σ | ⎭ ⎩ 2

n

⎡σ 12 ⎢ 2 where μ1 = ( μ11 ,..., μ1n )T ; and Σ = diag (σ 12 ...σ n ) = ⎢ 0 ⎢ ⎣0

0⎤ ⎥ . 0⎥ 2 0 σn ⎥ ⎦ 0

Department of Computer Science Data Mining Research Laboratory

66

**From generative to discriminative models
**

p ( y = 1 | x, Θ) = p ( x | y = 1, Θ) p ( y = 1 | q ) p ( x | y = 1, Θ) p ( y = 1 | q ) + p ( x | y = 0, Θ) p ( y = 0 | q )

⎧ 1 ⎫ q exp ⎨− ( x − μ1 )T Σ −1 ( x − μ1 )⎬ ⎩ 2 ⎭ = ⎧ 1 ⎫ ⎧ 1 ⎫ q exp ⎨− ( x − μ1 )T Σ −1 ( x − μ1 )⎬ + (1 − q ) exp ⎨− ( x − μ 0 )T Σ −1 ( x − μ 0 )⎬ ⎭ ⎩ 2 ⎩ 2 ⎭ 1 = ⎧ ⎫ ⎛ q ⎞ 1 1 T −1 T −1 1 + exp ⎨− log⎜ ⎜ 1 − q ⎟ + 2 ( x − μ1 ) Σ ( x − μ1 ) − 2 ( x − μ 0 ) Σ ( x − μ 0 )⎬ ⎟ ⎝ ⎠ ⎩ ⎭ = 1 ⎧ ⎫ ⎪ ⎛ q ⎞⎪ 1 ⎪ ⎪ 1 + exp ⎨− ( μ1 − μ 0 )T Σ −1 x + ( μ1 − μ 0 )T Σ −1 ( μ1 + μ 0 ) − log⎜ ⎟ ⎜ 1 − q ⎟⎬ 14 244 4 3 2 ⎝ ⎠ ⎪ 144444424444443 ⎪ βT ⎪ ⎪ γ ⎩ ⎭ 1 = 1 + exp( − β T x − γ ) p( where we have used AT DA − B T DB = ( A + B )T D ( A − B ) for a symmetric matrix D

Department of Computer Science Data Mining Research Laboratory

67

**From generative to discriminative models
**

1 1 + exp( − β T x − γ )

p ( y = 1 | x, Θ) =

• The posterior probability that Y = 1 takes the form

φ ( z) =

1 1 + e−z where z = β T x + γ is an affine function of x

Department of Computer Science Data Mining Research Laboratory

68

Sigmoid or Logistic Function

Department of Computer Science Data Mining Research Laboratory

69

**Implications of the logistic posterior
**

• Posterior probability of Y is a logistic function of an affine function of x Co tou s of equal poste o p obab ty a e es t e put • Contours o equa posterior probability are lines in the input space •

equal for all vectors x that lie along a line that is orthogonal to β

β T x is proportional to the projection of x on β and this projection is

• Special case – Variances of Gaussians = 1 – The contours of equal posterior probability are lines that are orthogonal to the difference vector between the means of the two classes • Equal posterior for the two classes when z = 0

Department of Computer Science Data Mining Research Laboratory

70

Geometric interpretation (diagonal Σ)

Contour plot

Department of Computer Science Data Mining Research Laboratory

71

Geometric interpretation

p ( x | y = 1, Θ) p ( y = 1 | q ) p ( x | y = 1, Θ) p ( y = 1 | q ) + p ( x | y = 0, Θ) p ( y = 0 | q ) 1 = ⎧ ⎫ ⎪ ⎛ q ⎞⎪ 1 ⎪ ⎪ 1 + exp ⎨− ( μ1 − μ 0 )T Σ −1 x + ( μ1 − μ 0 )T Σ −1 ( μ1 + μ 0 ) − log⎜ ⎜ 1 − q ⎟⎬ ⎟ 14 244 4 3 2 ⎝ ⎠ ⎪ 144444424444443 ⎪ βT ⎪ ⎪ γ ⎩ ⎭ 1 1 = = 1 + exp(− β T x − γ ) 1 + e − z p ( y = 1 | x, Θ) = (μ + μ0 ) ⎞ ⎛ when q = 1 − q, z = ( μ1 − μ 0 )T Σ −1 ⎜ x − 1 ⎟ 2 ⎝ ⎠

• In this case, the posterior probabilities for the two classes are equal case when x is equidistant from the two means

Department of Computer Science Data Mining Research Laboratory

72

Geometric interpretation

• If the prior probabilities of the classes are such that q > 0.5 the effect is to shift the logistic function to the left resulting in a larger value for the posterior probability for Y = 1 for any given point in the input space • q < 0.5 results in a shift of the logistic function to the right resulting p p y ( g in a smaller value for the posterior probability for Y = 1 (or larger value for the posterior probability for Y = 0)

Department of Computer Science Data Mining Research Laboratory

73

Geometric interpretation (general Σ)

Contour plot

Now the equi-probability contours are still lines in the input space although the lines are no longer orthogonal to the difference in means of the two classes

Department of Computer Science Data Mining Research Laboratory

74

**Generalization to multiple classes – Softmax function
**

• Y is a multinomial variable which takes on one of K values

qk = p ( y = k | q ) = p ( y k = 1 | q ) w ee where ( y = k ) ≡ ( y k = 1), q = (q1q2 ...qK )

• As before X is a multivariate Gaussian before,

**⎧ 1 ⎫ exp⎨− ( X − μ k )T Σ −1 ( X − μ k )⎬ (2π ) | Σ | ⎩ 2 ⎭ where μ k = ( μ k1...μ kn ); and ∀kΣ k = Σ p ( X | y k = 1, Θ) = 1
**

n/2 1/ 2

( (covariance matrix is assumed to be same for each class) )

Department of Computer Science Data Mining Research Laboratory

75

**Generalization to multiple classes – Softmax function
**

• Posterior probability for class k is obtained via Bayes rule

p ( y k = 1 | X , Θ) = p ( X | y k = 1, Θ) p ( y k = 1 | q )

∑ p( X | y

l =1

K

l

= 1, Θ) p ( y l = 1 | q )

⎧ 1 ⎫ qk exp ⎨− ( X − μ k )T Σ −1 ( X − μ k )⎬ ⎩ 2 ⎭ = K ⎧ 1 ⎫ ∑ ql exp⎨− 2 ( X − μl )T Σ −1 ( X − μl )⎬ ⎩ ⎭ l =1 1 T ⎧ T ⎫ exp ⎨μ k Σ −1 X − μ k Σ −1μ k + log qk )⎬ 2 ⎭ = K ⎩ 1 T −1 ⎧ ⎫ ∑ exp⎨μlT Σ −1 X − 2 μl Σ μl + log ql ⎬ ⎩ ⎭ l =1

Department of Computer Science Data Mining Research Laboratory

76

**Generalization to multiple classes – Softmax function
**

• We have shown that

1 T ⎫ ⎧ T exp ⎨μ k Σ −1 X − μ k Σ −1μ k + log qk )⎬ 2 ⎭ p ( y k = 1 | X , Θ) = K ⎩ 1 ⎧ ⎫ ∑ exp⎨μlT Σ −1 X − 2 μlT Σ −1μl + llog ql ⎬ ⎩ ⎭ l =1

**• Defining parameter vectors and augmenting the input vector X by adding a constant input of 1 we have
**

⎡ 1 T −1 ⎤ ⎢− 2 μ k Σ μ k + log qk ⎥ βk = ⎥ ⎢ Σ −1μ k ⎥ ⎢ ⎣ ⎦ p ( y = 1 | X , Θ) =

k

eβk X

T

∑e

l =1

K

β lT X

=

e

K l =1

βk , X

l ,X

∑e β

Department of Computer Science Data Mining Research Laboratory

77

**Generalization to multiple classes – Softmax function
**

p ( y = 1 | X , Θ) =

k

e βk X

T

∑ e βl

l =1

K

T

=

X

e

K l =1

βk , X

l ,X

∑e β

**• Corresponds to the decision rule
**

h( X ) = arg max p ( y k = 1 | X , Θ) = arg max e

k k

βk , X

= arg max β k , X

k

**• Consider the ratio of posterior prob. for classes k and j ≠ k
**

p ( y = 1 | X , Θ) e = K j p ( y = 1 | X , Θ)

k l =1

βk , X

l ,X

∑e β

l =1

K

l ,X

∑e β

e

β j ,X

=

e e

βk , X β j ,X

=e

( β k − β j ), X

Department of Computer Science Data Mining Research Laboratory

78

Equi probability Equi-probability contours of the softmax function

( β 3 − β1 ) T X = 0

( β1 − β 2 ) T X = 0

( β 2 − β 3 )T X = 0

Department of Computer Science Data Mining Research Laboratory

79

**Naïve Bayes classifier with discrete attributes and K classes d l
**

qk = prior probability of class k i b bili f l

**η kji = probability that x j (the jth component of X ) takes the ith
**

value in its domain when X belongs to class k p ( X , Y | Θ) = p (Y | q )∏ p ( x j | Y , θ j )

j =1 n

qk = p ( y = 1 | q )

k

η kji = p ( x ij = 1 | y k = 1,η )

p ( y k = 1 | X ,η ) = qk ∏∏ (η kji )

j i x ij

∑ q ∏∏ (η

l l j i

lji j

)

x ij

Department of Computer Science Data Mining Research Laboratory

80

**Naïve Bayes classifier with discrete attributes and K classes d l
**

qk = p ( y k = 1 | q ); η kji = p ( x ij = 1 | y k = 1,η ) p ( y k = 1 | X ,η ) = qk ∏∏ (η kji )

j i x ij

∑ q ∏∏ (η

l l j i

lji

)

x ij

n Nj ⎧ ⎫ exp ⎨log qk + ∑∑ x ij logη kji ⎬ j =1 i =1 ⎩ ⎭ = K n Nj ⎧ ⎫ ∑ exp⎨log ql + ∑∑ x ij logηlji ⎬ l =1 j =1 i =1 ⎭ ⎩

=

eβk X

T

∑e

l =1

K

β lT X

=

e<βk , X >

∑e β

< l =1

K

l ,X

>

Department of Computer Science Data Mining Research Laboratory

81

**From generative to discriminative models
**

• A curious fact about all of the generative models we have considered so far is that – The posterior probability of class can be expressed in the form of a logistic function in the case of a binary classifier and a softmax function in the case of a K-class classifier • For multinomial and Gaussian class conditional densities (in the case of the latter, with equal but otherwise arbitrary covariance matrices) – The contours of equal posterior probabilities of classes are q p p hyperplanes in the input (feature) space • The result is a simple linear classifier analogous to the perceptron p g p p (for binary classification) or winner-take-all network (for K-ary classification) • These results hold for a more general class of distributions

Department of Computer Science Data Mining Research Laboratory

82

**The exponential family of distributions
**

• The exponential family is specified by

p ( X | η ) = h( X )e {η

T

G ( X ) − A (η )

}

where η is a parameter vector and A(η) h(X) and G(X) are A(η), appropriately chosen functions • G Gaussian, binomial, and multinomial (and many other) distributions i bi i l d lti i l( d th ) di t ib ti belong to the exponential family

Department of Computer Science Data Mining Research Laboratory

83

The Bernoulli distribution belongs to the exponential family ti l f il

p ( X | η ) = h( X )e {η

T

G ( X ) − A (η )

}

• Bernoulli distribution with success rate q is given by

**⎧ ⎛ q ⎞ ⎫ ⎟ x + log(1 − q ) ⎬ p ( x | q ) = q x (1 − q )1− x = exp ⎨log⎜ ⎟ ⎜ ⎩ ⎝1− q ⎠ ⎭
**

• We can see that Bernoulli distribution belongs to the exponential g p family by choosing

η = log⎜ g⎜

⎛ q ⎞ ⎟; G ( x) = x; h( x ) = 1 1− q ⎟ ⎝ ⎠

A(η ) = − log(1 − q ) = log(1 + eη )

Department of Computer Science Data Mining Research Laboratory

84

The Gaussian distribution belongs to the exponential family ti l f il

p ( X | η ) = h( X )e {η

p( x | μ , σ 2 ) =

T

G ( X ) − A (η )

}

• Univariate Gaussian distribution can be written as

1 ⎧ 1 (x − μ )2 ⎫ exp ⎨− ⎬ 1/ 2 2 (2π ) σ ⎭ ⎩ 2σ 1 1 2 1 ⎧μ ⎫ = exp ⎨ 2 x − x − μ 2 − ln σ ⎬ 1/ 2 2 2 (2π ) 2σ 2σ ⎩σ ⎭

• We see that Gaussian distribution belongs to the exponential family g p y by choosing

⎡μ / σ 2 ⎤ μ2 ; A(η ) = + ln σ η=⎢ 2⎥ 2σ 2 ⎢− 1 / 2σ ⎥ ⎣ ⎦ ⎡x ⎤ 1 G ( x ) = ⎢ 2 ⎥; h ( x ) = ( 2π )1/ 2 ⎣x ⎦

Department of Computer Science Data Mining Research Laboratory

85

**The exponential family of distributions
**

• The exponential family which is given by

p ( X | η ) = h( X )e {η

T

G ( X ) − A (η )

}

where η is a parameter vector and A(η) h(X) and G(X) are A(η), appropriately chosen functions – can be shown to include several additional distributions such as the multinomial, the Poisson, the Gamma, the Dirichlet, among others

Department of Computer Science Data Mining Research Laboratory

86

**From generative to discriminative models
**

• In the case of the generative models we have seen – The posterior probability of class can be expressed in the form of a logistic function in the case of a binary classifier and a softmax function in the case of a K-class classifier

– The contours of equal posterior probabilities of classes are

hyperplanes in the input (feature) space yielding a linear classifier (for binary classification) or winner-take-all network (for K-ary classification) • We just showed that the probability distributions underlying the generative models considered belong to the exponential family • What can we say about the classifiers when the underlying generative models are distributions from the exponential family?

Department of Computer Science Data Mining Research Laboratory

87

**Classification problem for generic class conditional density from the exponential family diti ld it f th ti l f il
**

p ( X | η ) = h( X )e {η

T

G ( X ) − A (η )

}

• Consider binary classification task with density for class 0 and c ass pa a ete ed class 1 parameterized by η0 and η1 . Further assume G(x) is a linear a d u t e assu e G( ) s ea function of x (before augmenting x with a 1)

p ( y = 1 | x, η) = = = p (x | y = 1, η) p ( y = 1 | q ) p (x | y = 1, η) p ( y = 1 | q ) + p ( x | y = 0, η) p ( y = 0 | q )

T exp η1 G (x) − A( η1 ) h( x)q1 exp η1 G (x) − A( η1 ) h( x)q1 + exp ηT G (x) − A( η0 ) h(x)q0 p T p 0

{

{

}

{

}

}

1 ⎧ q ⎫ 1 + exp ⎨− ( η0 − η1 )T G ( x) − A( η0 ) + A( η1 ) + log 0 ⎬ q1 ⎭ ⎩

• Note that this is a logistic function of a linear function of x

Department of Computer Science Data Mining Research Laboratory

88

**Classification problem for generic class conditional density from the exponential family diti ld it f th ti l f il
**

p ( X | η ) = h( X )e {η

T

G ( X ) − A (η )

}

• Consider K-ary classification task; Suppose G(x) is a linear function o of x exp ηT G (x) − A( ηk ) qk k k p ( y = 1 | x, η) = K ∑ exp ηT G (x) − A(ηl ) ql l

{

=

} ∑ exp{η G (x) − A(η ) + log q }

exp ηT G ( x) − A( ηk ) + log qk k

T l l K l =1 l

l =1

{

}

}

{

which is a softmax function of a linear function of x !!

Department of Computer Science Data Mining Research Laboratory

89

Summary

• A variety of class conditional densities all yield the same logisticlinear or softmax-linear (with respect to parameters) form for the posterior probability • In practice, choosing a class conditional density can be difficult – especially in high dimensional spaces – e.g. multivariate Gaussian g q y where the covariance matrix grows quadratically in the number of dimensions • The invariance of the functional form of the posterior probability with respect to the choice of the distribution is good news! • It is not necessary to specify the class conditional density at all if we can work directly with the posterior – which brings us to discriminative models!

Department of Computer Science Data Mining Research Laboratory

90

**Maximum margin classifiers
**

• Discriminative classifiers that maximize the margin of separation • Support Vector Machines – Feature spaces – Kernel machines – VC theory and generalization bounds – Maximum margin classifiers • Comparisons with conditionally trained discriminative classifiers

Department of Computer Science Data Mining Research Laboratory

91

Perceptrons revisited

• Perceptrons – Can only compute threshold functions – Can only represent linear decision surfaces – Can only learn to classify linearly separable training data • How can we deal with non linearly separable data? – Map data into a typically higher dimensional feature space where the classes become separable • Two problems must be solved – Computational problem of working with high dimensional feature space – Overfitting problem in high dimensional feature spaces

Department of Computer Science Data Mining Research Laboratory

92

**Maximum margin model
**

• We can not outperform Bayes optimal classifier when – The generative model assumed is correct – The data set is large enough to ensure reliable estimation of pa a ete s of the ode s parameters o t e models • But discriminative models may be better than generative models when – The correct generative model is seldom known – The data set is often simply not large enough • Maximum margin classifiers are a kind of discriminative classifiers designed to circumvent the overfitting problem

Department of Computer Science Data Mining Research Laboratory

93

**Extending Linear Classifiers: Learning in feature spaces L i i f t
**

• Map data into a feature space where they are linearly separable

x → ϕ ( x)

o o

x x x

ϕ ( x)

ϕ (o)

x

ϕ ( x)

o o

X

ϕ (o)

ϕ ( x)

ϕ (o) ϕ (o)

ϕ ( x)

Φ

94

Department of Computer Science Data Mining Research Laboratory

**Linear Separability in Feature Spaces
**

• The original input space can always be mapped to some higherdimensional feature space where the training data become separable:

Department of Computer Science Data Mining Research Laboratory

95

**Learning in the Feature Space
**

• High dimensional feature spaces

X = ( x1 , x2 , L , xn ) → ϕ ( X) = (ϕ1 ( X), ϕ 2 ( X),L , ϕ d ( X))

where typically d >> n solve the problem of expressing complex functions • B t thi introduces But this i t d – Computational problem (working with very large vectors) – Generalization problem (curse of dimensionality) • SVM offer an elegant solution to both problems

Department of Computer Science Data Mining Research Laboratory

96

**The Perceptron Algorithm (primal form)
**

initialize W0 ← 0, b0 ← 0, k ← 0,η ∈ ℜ + repeat error false for i = 1.. l 1 if yi ( Wk , X i + bk ) ≤ 0 then

Wk +1 ← Wk + ηyi X i

bk +1 ← bk + ηyi error true endif endfor until (error == false) return k , ( Wk , bk ) where k is the number of mistakes

k ← k +1

Department of Computer Science Data Mining Research Laboratory

97

**The Perceptron Algorithm Revisited
**

• The perceptron works by adding misclassified positive or subtracting misclassified negative examples to an arbitrary weight vector, which (without loss of generality) we assumed to be the zero vector • So the final weight vector is a linear combination of training points

w = ∑ α i yi xi ,

where, since the sign of the coefficient of xi is given by label yi, h i th i f th ffi i t f i i b l b l the α i are positive values, proportional to the number of times, misclassification of i has caused the weight to be updated. It is called the embedding strength of the pattern i

i =1

l

x

x

Department of Computer Science Data Mining Research Laboratory

98

Dual Representation

• The decision function can be rewritten as:

⎛ h( X) = sgn ( W, X + b ) = sgn ⎜ ⎜ ⎝

⎞ ⎛ l ⎞ ⎜ ∑ α j y j X j ⎟, X + b ⎟ ⎜ ⎟ ⎟ ⎝ j =1 ⎠ ⎠

⎛ l ⎞ = sgn ⎜ ∑ α j y j X j , X + b ⎟ ⎟ ⎜ ⎝ j =1 ⎠

• On training example ( X i , yi ), the update rule is: g p p

⎛ l ⎞ if yi ⎜ ∑ α j y j X j , X i + b ⎟ ≤ 0, then α i = α i + η ⎜ ⎟ ⎝ j =1 ⎠

• WLOG, we can take η = 1

Department of Computer Science Data Mining Research Laboratory

99

**Implication of Dual Representation
**

• When Linear Learning Machines are represented in the dual form

⎞ ⎛ l h( X i ) = sgn ( W, X i + b ) = sgn ⎜ ∑ α j y j X j , X i + b ⎟ g g ⎜ ⎟ ⎝ j =1 ⎠

• Data appear only inside dot products (in decision function and in training algorithm) • The matrix

G = Xi , X j

(

)

l i , j =1

which is the matrix of pairwise dot products between training samples is called the Gram matrix

Department of Computer Science Data Mining Research Laboratory

100

**Implicit Mapping to Feature Space
**

• Kernel machines

– Solve the computational problem of working with many

d e so s dimensions

– Can make it possible to use infinite dimensions efficiently – Offer other advantages, both practical and conceptual

Department of Computer Science Data Mining Research Laboratory

101

**Kernel Induced Kernel-Induced Feature Space
**

f ( X i ) = W, ϕ ( X i ) + b

h( X i ) = sgn ( f ( X i ) )

where ϕ : X → Φ is a non-linear map from input space to feature space

• In the dual representation, the data points only appear inside dot products

⎛ l ⎞ h( X i ) = sgn ( W, ϕ ( X i ) + b ) = sgn ⎜ ∑ α j y j ϕ ( X j ), ϕ ( X i ) + b ⎟ ⎜ ⎟ ⎝ j =1 ⎠

Department of Computer Science Data Mining Research Laboratory

102

Kernels

• Kernel function returns the value of the dot product between the images of the two arguments

K ( x1 , x2 ) = ϕ ( x1 ), ϕ ( x2 )

• When using kernels, the dimensionality of the feature space Φ is not necessarily important because of the special properties of kernel functions

• Given a function K, it is possible to verify that it is a kernel

Department of Computer Science Data Mining Research Laboratory

103

Kernel Machines

• We can use perceptron learning algorithm in the feature space by taking its dual representation and replacing dot products with kernels

x1 , x2 ← K ( x1 , x2 ) = ϕ ( x1 ), ϕ ( x2 )

Department of Computer Science Data Mining Research Laboratory

104

**The Kernel Matrix
**

• Kernel matrix is the Gram matrix in the feature space Φ (the matrix of pairwise dot products between feature vectors corresponding to the training samples) K(1,1) K(2,1) K(2 1) K(1,2) K(2,2) K(2 2) K(1,3) K(2,3) K(2 3)

… …

K(1,l) K(2,l) K(2 l)

K=

…

K(l,1)

…

K(l,2)

…

K(l,3)

… …

…

K(l,l)

Department of Computer Science Data Mining Research Laboratory

105

**Properties of Kernel Matrices
**

• It is easy to show that the Gram matrix (and hence the kernel matrix) is – A Square matrix T – Symmetric (K = K ) Sy et c – Positive semi-definite

– –

all eigenvalues of K are non-negative – Recall that eigenvalues of a square matrix A are given by values of λ that satisfy |A – λI| = 0, or

w T Aw ≥ 0 for all values of the vector w

• Any symmetric positive semi-definite matrix can be regarded as a kernel matrix, that is, as an inner product matrix in some feature space Φ

K ( X i , X j ) = ϕ ( X i ), ϕ ( X j )

Department of Computer Science Data Mining Research Laboratory

106

**Mercer’s Theorem: Characterization of Kernel Functions Ch t i ti fK lF ti
**

• A function K : X × X → ℜ is said to be (finitely) positive semidefinite if – K is a symmetric function: K(x,y) = K(y,x)

– Matrices formed by restricting K to any finite subset of the

space X are positive semi-definite

• Every (finitely) positive semi-definite, symmetric function is a kernel: i there exists a mapping ϕ such that it is possible to k l i.e. th i t i h th t i ibl t write

K ( X i , X j ) = ϕ ( X i ), ϕ ( X j )

Department of Computer Science Data Mining Research Laboratory

107

**Mercer’s Theorem: Characterization of Kernel Functions Ch t i ti fK lF ti
**

• Eigenvalues of the Gram matrix define an expansion of Mercer’s kernels:

K ( X i , X j ) = ϕ ( X i ), ϕ ( X j ) = ∑ λlϕ l ( X i )ϕl ( X j )

l

That is, the eigenvalues act as features!

• Recall that the eigenvalues of a square matrix A correspond to solutions of |A – λI| = 0

Department of Computer Science Data Mining Research Laboratory

108

Examples of Kernels

• Simple examples of kernels:

K ( X, Z) = X, Z

⎛ X ,Z −⎜ ⎜ ⎝ σ

d

K ( X, Z) = e

⎞ ⎟ ⎟ ⎠

Department of Computer Science Data Mining Research Laboratory

109

Example: Polynomial Kernels

x = ( x1 , x2 ) z = ( z1 , z 2 )

x, z

2

= ( x1 z1 + x2 z 2 ) 2

2 2 = x12 z12 + x2 z 2 + 2 x1 z1 x2 z 2 2 2 = x12 , x2 , 2 x1 x2 , z12 , z 2 , 2 z1 z 2

(

)(

)

= ϕ ( x), ϕ ( z )

Department of Computer Science Data Mining Research Laboratory

110

**Making Kernels – Closure Properties
**

• The set of kernels is closed under some operations; If K1, K2 are kernels over X × X , then the following are kernels:

**K ( X, Z) = K1 ( X, Z) + K 2 ( X, Z) K ( X, Z) = aK1 ( X, Z); a > 0 K ( X, Z) = K1 ( X, Z) K 2 ( X, Z) K ( X, Z) = f ( X) f (Z); f : X → ℜ K ( X, Z) = K 3 (ϕ ( X), ϕ (Z)); (ϕ : X → ℜ N ; K 3 is a kernel over ℜ N × ℜ N ) K ( X, Z) = XT BZ; ( B is a symmetric positive definite n × n matrix and X ⊆ ℜ N )
**

• We can make complex kernels from simple ones: modularity!

Department of Computer Science Data Mining Research Laboratory

111

Kernels

• We can define kernels over arbitrary instance spaces including – finite dimensional vector spaces – Boolean spaces * – ∑ where ∑ is a finite alphabet – Documents, graphs, etc.

• Kernels need not always be expressed by a closed form formula • Many useful kernels can be computed by complex algorithms (e.g. diffusion kernels over graphs)

Department of Computer Science Data Mining Research Laboratory

112

**String Kernel (p-spectrum kernel) (p spectrum
**

• The p-spectrum of a string is the histogram – vector of number of occurrences of all possible contiguous substrings – of length p e can define e e u ct o (s,t) o e • We ca de e a kernel function K(s,t) over ∑ product of the p-spectra of s and t

*

× ∑* as t e inner the e

s = statistics t = computation p=3 Common substrings: tat, ati K(s,t) = 2

Department of Computer Science Data Mining Research Laboratory

113

Kernel Machines

• Kernel machines are Linear Learning Machines, that:

– Use a dual representation – Operate in a kernel induced feature space (that is a linear

function in the feature space implicitly defined by the Gram matrix corresponding to the data set)

⎛ l ⎞ h( X i ) = sgn ( W, ϕ ( X i ) + b ) = sgn ⎜ ∑ α j y j ϕ ( X j ) ϕ ( X i ) + b ⎟ ), ⎜ ⎟ ⎝ j =1 ⎠

Department of Computer Science Data Mining Research Laboratory

114

**Kernels – the good, the bad, and the ugly
**

• Bad kernel – A kernel whose Gram (kernel) matrix is mostly diagonal – all data points are orthogonal to each other, and hence the machine is unable to detect hidden structure in the data

1 0

0 1

0 0 1

… …

0 0

… 0

… 0

… 0

… …

… 1

Department of Computer Science Data Mining Research Laboratory

115

**Kernels – the good, the bad, and the ugly
**

• Good kernel – Corresponds to a Gram (kernel) matrix in which subsets of data points belonging to the same class are similar to each other, and hence the machine can detect hidden structure in the data 3 2 0 0 0 2 3 0 0 0 0 0 4 3 3 0 0 3 4 2 0 0 3 2 4 Class 1 Class 2

Department of Computer Science Data Mining Research Laboratory

116

**Kernels – the good, the bad, and the ugly
**

• In mapping in a space with too may irrelevant features, kernel matrix becomes diagonal eed some prior o edge of target choose e e • Need so e p o knowledge o ta get to c oose a good kernel

Department of Computer Science Data Mining Research Laboratory

117

**Learning in the Feature Space
**

• High dimensional feature spaces

X = ( x1 , x2 , L , xn ) → ϕ ( X) = (ϕ1 ( X), ϕ 2 ( X),L , ϕ d ( X))

where typically d >> n solve the problem of expressing complex functions

• But this introduces – computational problem (working with very large vectors) solved by the kernel trick – implicit computation of dot products in kernel induced feature spaces via dot products in the input space – generalization theory problem (curse of dimensionality)

Department of Computer Science Data Mining Research Laboratory

118

**The Generalization Problem
**

• The curse of dimensionality – It is easy to overfit in high dimensional spaces

• The learning problem is ill posed – There are infinitely many hyperplanes that separate the training data d t – Need a principled approach to choose an optimal hyperplane

Department of Computer Science Data Mining Research Laboratory

119

**The Generalization Problem
**

• “Capacity” of the machine – ability to learn any training set without error – related to VC dimension ce e t e o y s ot a e t comes ea g o • Excellent memory is not an asset when it co es to learning from limited data • “A machine with too much capacity is like a botanist with a A photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree” – C. Burges

Department of Computer Science Data Mining Research Laboratory

120

**History of Key Developments leading to SVM
**

• 1958: Perceptron (Rosenblatt) • 1963: Margin (Vapnik) • 1964: Kernel Trick (Aizerman) • 1965: Optimization formulation (Mangasarian) ( ) • 1971: Kernels (Wahba) • 1992-1994: SVM (Vapnik) • 1996 – present: Rapid growth, numerous applications, extensions to other problems

Department of Computer Science Data Mining Research Laboratory

121

**A Little Learning Theory
**

• Suppose: – We are given l training examples (Xi, yi) – Train and test points drawn randomly (i.i.d.) from some u unknown probability distribution D(X, y) o p obab ty d st but o ( , • The machine learns the mapping Xi yi and outputs a hypothesis h(X, α, h(X α b); A particular choice of (α, b) generates “trained machine” (α trained machine • The expectation of the test error or expected risk is

R (α, b) =

1 | y − h( X, α, b) | dD ( X, y ) 2∫

Department of Computer Science Data Mining Research Laboratory

122

**A Bound on the Generalization Performance
**

• The empirical risk is:

1 l Remp (α, b) = ∑ | y − h( X, α, b) | 2l i =1

• Choose some δ such that 0 < δ < 1; With probability 1- δ the following bound – risk bound of h(X α) i distribution D h ld f ll i b d i kb d f h(X, ) in di t ib ti holds (Vapnik, 1995)

**⎛ d (log(2l / d ) + 1) − log(δ / 4 ) ⎞ R (α, b) ≤ Remp (α, b) + ⎜ ⎟ l ⎝ ⎠
**

where d ≥ 0 is called VC dimension, a measure of “capacity” of capacity machine

Department of Computer Science Data Mining Research Laboratory

123

**A Bound on the Generalization Performance
**

• The second term in the right-hand side is called VC confidence

**• Three key points about the actual risk bound: – It is independent of D(X, y)
**

– It is usually not possible to compute the left-hand side – If we know d, we can compute the right-hand side , p g

• The risk bound gives us a way to compare learning machines!

Department of Computer Science Data Mining Research Laboratory

124

**The Vapnik Chervonenkis (VC) Dimension Vapnik-Chervonenkis
**

• Definition: The VC dimension of a set of functions H = {h(X, α, b)} is d if and d only if there exists a set of d data points such that each of the 2 possible labelings (dichotomies) of the d data points, can be realized using some member of H but that no set exists with q > d satisfying this property

Department of Computer Science Data Mining Research Laboratory

125

The VC Dimension

• A set S of instances is said to be shattered by a hypothesis class H if and only if for every dichotomy of S, there exists a hypothesis in H that is consistent with the dichotomy • VC dimension of H is size of largest subset of X shattered by H • VC dimension measures the capacity of a set H of hypotheses (functions) • If for any number N it is possible to find N points X1,…, XN th t can f b N, i ibl t fi d i t that N be separated in all the 2 possible ways, we will say that the VCdimension of the set is infinite

Department of Computer Science Data Mining Research Laboratory

126

**The VC Dimension Example
**

• Suppose that the data live in ℜ 2 space, and the set {h(X, α)} consists of oriented straight lines (linear discriminants) t s possible d t ee po ts t at ca s atte ed t s • It is poss b e to find three points that can be shattered by this set of functions; It is not possible to find four • So the VC dimension of the set of linear discriminants in ℜ 2 is three

Department of Computer Science Data Mining Research Laboratory

127

**The VC Dimension of Hyperplanes
**

• Theorem 1: Consider some set of m points in ℜ n. Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes if and only if the position vectors of the remaining points are linearly independent. • Corollary: The VC dimension of the set of oriented hyperplanes in ℜ n is n+1, since we can always choose n+1 points, and then choose one of the points as origin, such that the position vectors of the remaining origin points are linearly independent, but can never choose n+2 points (since no n+1 vectors in ℜ n can be linearly independent).

Department of Computer Science Data Mining Research Laboratory

128

The VC Dimension

• VC dimension can be infinite even when the number of parameters of the set {h(X, α)} of hypothesis functions is low

• Example: {h(X, α)} ≡ sgn(sin(αX)), X,α ∈ ℜ For any integer l with any labels y1 ,L yl , yi ∈ {− 1,1} , we can find l points X 1 ,L X l and parameterα such that those points can be shattered by h(X, α) Those points are: xi Th i t and parameter α is:

= 10− i ,

l

i = 1,..., l. 1

( (1 − yi )10i ) α = π (1 + ∑ ) 2 i =1

Department of Computer Science Data Mining Research Laboratory

129

The VC Dimension

• Let H be a hypothesis class over an instance space X • Both H and X may be infinite • We need a way to describe the behavior of H on a finite set of points S ⊆ X

S = {X 1, X 2 ....X m }

• For any concept class H over X, and any S ⊆ X

Π H (S ) = {h ∩ S : h ∈ H }

• Equivalently, we can write

Π H (S ) = {(h( X 1 )......h( X m )) : h ∈ H }

**• ∏ H (S ) is the set of all dichotomies or behaviors on S that are induced or realized by H i d d li d b
**

Department of Computer Science Data Mining Research Laboratory

130

The VC Dimension

• If Π H (S ) = {0,1} where |S| = m, or equivalently, Π H S = 2 we say that S is shattered by H

m

( )

m

• A set S of instances is said to be shattered by a hypothesis class H if and only if for every dichotomy of S, there exists a hypothesis in H that is consistent with the dichotomy

Department of Computer Science Data Mining Research Laboratory

131

**VC Dimension of a Hypothesis Class
**

• Definition: The VC dimension V(H), of a hypothesis class H defined over an instance space X is the cardinality d of the largest subset of X that is shattered by H. If arbitrarily large finite subsets of X can be shattered by H, V(H) = ∞ • How can we show that V(H) is at least d? Find a set of cardinality at least d that is shattered by H • How can we show that V(H) = d? Show th t V(H) is at least d and no set of cardinality (d+1) can b Sh that i tl t d t f di lit (d 1) be shattered by H

Department of Computer Science Data Mining Research Laboratory

132

**VC Dimension of a Hypothesis Class – Example
**

• Example: Let the instance space X be the 2-dimensional Euclidian space. Let the hypothesis space H be the set of axis parallel rectangles in the plane. V(H)=4 (there exists a set of 4 points that can be shattered by a set of axis parallel rectangles but no set of 5 points can be shattered by H) y )

Department of Computer Science Data Mining Research Laboratory

133

Some Useful Properties of VC Dimension

(H 1 ⊆ H 2 ) ⇒ V (H 1 ) ≤ V (H 2 ) (H = H 1 ∪ H 2 ) ⇒ V (H ) ≤ V (H 1 ) + V (H 2 ) + 1

If H l is formed by a union or intersection of l If V ( H ) = d , Π H (m) = max{Π H ( S ) : S = m}, Φd (m ) = 2 m if m ≥ d and Φd (m ) = O m d if m < d Π H (m) ≤ Φd (m ) where concepts from H , V (H l ) = O(V ( H )l log l )

(H = {X − h : h ∈ H }) ⇒ V (H ) = V (H )

If H is a finite concept class, V ( H ) ≤ log H

( )

Department of Computer Science Data Mining Research Laboratory

134

Risk Bound

• What is VC dimension and empirical risk of the nearest neighbor classifier? y u be o po ts, abe ed a b t a y, ca s atte ed • Any number of points, labeled arbitrarily, can be shattered by a nearest-neighbor classifier, thus d = ∞ and empirical risk = 0 • So the bound provide no useful information in this case

Department of Computer Science Data Mining Research Laboratory

135

**Minimizing the Bound by Minimizing d
**

log log ⎛ d (l (2l / d ) + 1) − l (δ / 4 ) ⎞ R (α, b) ≤ Remp (α, b) + ⎜ ⎟ l ⎝ ⎠

• VC confidence (second term) depends on d/l • One should choose that learning machine whose set of functions has minimal d • For large values of d/l the bound is not tight

Department of Computer Science Data Mining Research Laboratory

136

Minimizing the Risk Bound by Minimizing d

d /l

Department of Computer Science Data Mining Research Laboratory

137

**Bounds on Error of Classification
**

• Vapnik proved that the error ε of classification function h for separable data sets is ⎛d ⎞ ε = O⎜ ⎟ ⎝l⎠ where d is the VC dimension of the hypothesis class and l is the number of training examples

• The classification error – Depends on the VC dimension of the hypothesis class – Is independent of the dimensionality of the feature space

Department of Computer Science Data Mining Research Laboratory

138

**Structural Risk Minimization
**

• Finding a learning machine with the minimum upper bound on the actual risk – leads us to a method of choosing an optimal machine for a given task – essential idea of the structural risk minimization (SRM)

• Let H 1 ⊂ H 2 ⊂ H 3 ⊂ L be a sequence of nested subsets of hypotheses whose VC dimensions satisfy d1 < d2 < d3 < … – SRM consists of fi di i t f finding that subset of functions which th t b t f f ti hi h minimizes the upper bound on the actual risk – SRM involves training a series of classifiers, and choosing a classifier from the series for which the sum of empirical risk and VC confidence is minimal

Department of Computer Science Data Mining Research Laboratory

139

**Margin Based Bounds on Error of Classification
**

• The error ε of classification function h for separable data sets is

ε = O⎜ ⎟

• Margin based bound

⎛d ⎞ ⎝l⎠

⎛ 1 ⎛ L ⎞2 ⎞ ε = O⎜ ⎜ ⎟ ⎟ ⎜l ⎜γ ⎟ ⎟ ⎝ ⎝ ⎠ ⎠

L = max X p

p

γ = min i

yi f (xi ) || w ||

f ( x) = w , x + b

• Important insight: Error of the classifier trained on a separable data set is inversely proportional to its margin, and is independent of the dimensionality margin of the input space!

Department of Computer Science Data Mining Research Laboratory

140

**Maximal Margin Classifier
**

• The bounds on error of classification suggest the possibility of improving generalization by maximizing the margin

• Minimize the risk of overfitting by choosing the maximal margin hyperplane in feature space

• SVMs control capacity by increasing the margin, not by reducing the th number of features b ff t

Department of Computer Science Data Mining Research Laboratory

141

Margin

w

• Linear separation of the input space

1 1 1 1 1

f (X) =< W, X > + b h(X) = sign( f (X))

b || W ||

1

W

1 1

Department of Computer Science Data Mining Research Laboratory

142

**Functional and Geometric Margin
**

• The functional margin of a linear discriminant (w,b) w.r.t. a labeled d pattern ( xi , yi ) ∈ R × {−1,1} is defined as

γ i ≡ yi ( w, xi + b)

• If the functional margin is negative, then the pattern is incorrectly classified, if it is positive then the classifier predicts the correct p p label • The larger | γ i | , the further away Xi is from the discriminant • This is made more precise in the notion of the geometric margin

γi ≡

γi

|| w ||

**which measures the Euclidean distance of a point from the decision boundary
**

Department of Computer Science Data Mining Research Laboratory

143

Geometric Margin

Xi ∈ S +

γj

γi

X j ∈ S−

**The geometric margin of two points
**

Department of Computer Science Data Mining Research Laboratory

144

Geometric Margin

Example

(1,1)

W = (1 1) b = −1 X i = (1 1) Yi = 1 γi = (Yi )((1)x1 + (1).x2 − 1) = (1)(1 + 1 − 1) = 1 γi 1 = W 2

γi

γj

Xi ∈ S +

X j ∈ S−

γi =

Department of Computer Science Data Mining Research Laboratory

145

**Margin of a Training Set
**

• The functional margin of a training set

γ = min γi

i

γ

• The geometric margin of a training set

γ = min γi

i

Department of Computer Science Data Mining Research Laboratory

146

**Maximum Margin Separating Hyperplane
**

•

**γ = min γi is called the (functional) margin of (W,b) i
**

w.r.t. the data set S={(Xi,yi)}

• The margin of a training set S is the maximum geometric margin over all hyperplanes; A hyperplane realizing this maximum is a maximal margin hyperplane

**Maximal M M i l Margin Hyperplane i H l
**

Department of Computer Science Data Mining Research Laboratory

147

**VC dimension of Δ-margin hyperplanes Δ margin
**

• If data samples belong to a sphere of radius R, then the set of Δmargin hyperplanes has VC dimension bounded by

VC ( H ) ≤ min( R 2 / Δ2 , n) + 1

• WLOG, we can make R=1 by normalizing data points of each class so that they lie within a sphere of radius R

• For large margin hyperplanes, VC dimension is controlled independent of dimensionality n of the feature space

Department of Computer Science Data Mining Research Laboratory

148

Maximizing Margin

Minimizing ||W||

• Definition of hyperplane (W,b) does not change if we rescale it to (σW, σb), for σ>0. u ct o a a g depe ds on sca g, geo et c a g • Functional margin depends o scaling, but geometric margin does not

γ

• If we fix (by rescaling) the functional margin to 1, the geometric 1 margin will be equal 1/||W|| • Th Then, we can maximize the margin by minimizing the norm ||W|| i i th i b i i i i th

Department of Computer Science Data Mining Research Laboratory

149

Maximizing Margin

Minimizing ||W||

• Let X + and X − be the nearest data points in the sample space. Then we have, by fixing the functional margin to be 1:

x x

γ

o o o o o o

w, x + + b = +1

x x x

w, x − + b = −1 w, ( x + − x − ) = 2 w 2 , ( x+ − x− ) = || w || || w ||

Department of Computer Science Data Mining Research Laboratory

150

Learning as Optimization

• Minimize

W, W

Subject to:

yi ( W, Xi + b) ≥ 1

Department of Computer Science Data Mining Research Laboratory

151

**Digression – Minimizing/Maximizing Functions
**

Consider f ( x ), a function of a scalar variable x with domain Dx the chord joining the points f ( X 1 ) and f ( X 2 ) lies above

f ( x ) is convex over some sub - domain D ⊆ Dx if ∀X 1 , X 2 ∈ D,

the graph of f ( x )

X a such that ∀x ∈ U , f ( x ) > f ( X a ) h th t

f ( x ) has a local minimum at x = X a if ∃ neighborhood U ⊆ Dx around

**We say that lim f ( x ) = A if for any ε > 0 ∃δ > 0 such that f ( x ) − A < ε if, 0,
**

x→a

**∀x such that x-a < δ
**

Department of Computer Science Data Mining Research Laboratory

152

**Minimizing/Maximizing Functions
**

We say that f (x ) is continuous at x = a if lim

ε →0

{ lim

x → a +∈

f ( x ) = lim

}

ε →0

{ lim

x → a −∈

f (x )

}

**The derivative of the function f (x ) is defined as df f ( x + Δx ) − f ( x ) = lim (Δx ) dx Δx→0 df f dx = 0 if X 0 is a local maximum or a local minimum
**

x= X 0

Department of Computer Science Data Mining Research Laboratory

153

Minimizing/Maximizing Functions

d (u + v ) du dv = + dx d dx d d dx d (uv ) dv du =u +v dx dx dx ⎛u⎞ ⎛ du ⎞ ⎛ dv ⎞ d ⎜ ⎟ v⎜ ⎟ − u ⎜ ⎟ ⎝ v ⎠ = ⎝ dx ⎠ ⎝ dx ⎠ dx v2

Department of Computer Science Data Mining Research Laboratory

154

**Taylor Series Approximation of Functions
**

Taylor series approximation of f ( x ) If f ( x ) is differentiable i.e., its derivatives df d 2 f d ⎛ df ⎞ dn f , = ⎜ ⎟, ... n exist at x = X 0 and dx dx 2 dx ⎝ dx ⎠ dx f ( x ) is continuous in the neighborhood of x = X 0 , then

⎛ df f (x ) = f ( X 0 ) + ⎜ ⎜ ⎝ dx ⎛ df f (x ) ≈ f ( X 0 ) + ⎜ ⎜ dx ⎝

Department of Computer Science Data Mining Research Laboratory

⎛ n ⎞ ⎟( x − X 0 ) + ..... + 1 ⎜ df ⎟ n! ⎜ dx n x= X 0 ⎠ ⎝ ⎞ ⎟( x − X 0 ) ⎟ x= X 0 ⎠

⎞ ⎟( x − X )n 0 ⎟ x= X 0 ⎠

155

Chain Rule

Let f (X ) = f (x0, x1 , x2 ,.....xn ) ∂f is bt i d by treatin ll i obtained b t ti g all xi i ≠ j as constant. t t ∂xi Chain rule Let z = φ(u1....u m )

Let ui = f i (x0, x1......xn ) ⎞ ⎟ ⎟ ⎠

m ⎛ ∂z ⎞⎛ ∂ui ∂z ⎟⎜ Then ∀k = ∑⎜ ∂xk i =1 ⎜ ∂ui ⎟⎜ ∂xk ⎝ ⎠⎝

Department of Computer Science Data Mining Research Laboratory

156

Taylor Series Approximation of Multivariate Functions

**Let f (X ) = f (x0, x1 , x2 ,.....xn ) be differentiable and continuous at diff ti bl d ti t X 0 = (x00, x10 , x20 ,.....xn 0 )
**

n

Then ⎛ ∂f ⎞ (xi − xi 0 ) f (X ) ≈ f (X 0 ) + ∑ ⎜ ⎟ i = 0 ⎝ ∂x ⎠ X = X 0

Department of Computer Science Data Mining Research Laboratory

157

Minimizing/Maximizing Multivariate Functions

in the direction of the negative gradient of f (X ) evaluated at X C ⎛ ∂f ∂f ∂f X C ← X C − η⎜ , .......... ... ⎜ ∂x ∂x ∂x n 1 ⎝ 0 ⎞ ⎟ (why?) ⎟ ⎠ X = XC

To find X * that minimizes f (X ), we change current guess X C

for small (ideally infinitesimally small)

Department of Computer Science Data Mining Research Laboratory

158

**Minimizing/Maximizing Multivariate Functions
**

Suppose we more from Z 0 to Z1. We want to ensure f (Z1 ) ≤ f (Z 0 ). In the neighborhood of Z 0 , using Taylor series expansion, we can write ⎛ df ⎞ ⎟(ΔZ ) + ... f (Z1 ) = f (Z 0 + ΔZ ) = f (Z 0 ) + ⎜ ⎜ dZ Z =Z ⎟ 0 ⎠ ⎝ ⎛ df ⎞ ⎟(ΔZ ) Δf = f ( Z1 ) − f ( Z 0 ) ≈ ⎜ ⎜ dZ Z =Z ⎟ 0 ⎠ ⎝ We want to make sure Δf ≤ 0. If we choose ⎛ df ΔZ = −η ⎜ ⎜ ⎝ dZ

Department of Computer Science Data Mining Research Laboratory

⎞ ⎛ ⎟ → Δf = −η ⎜ df ⎟ ⎜ Z =Z0 ⎠ ⎝ dZ

⎞ ⎟ ≤0 ⎟ Z =Z0 ⎠

159

2

Minimizing/Maximizing Functions

f (x1, x2)

Gradient descent/ascent is guaranteed to find the minimum/maximum when the function has a single minimum/maximum

XC= (x1C, x2C) x2 X*

x1

Department of Computer Science Data Mining Research Laboratory

160

Constrained Optimization

• Primal optimization problem • Given functions f, gi, i=1…k; hj, j=1..m; defined on a domain Ω ⊆ ℜn

minimize subject to

f (w ) w∈Ω g i (w ) ≤ 0 i = 1...k , h j (w ) = 0 j = 1...m

{ objective function { inequality constraints { equality constraints li i

• Shorthand

h(w ) = 0 denotes h j (w ) = 0 j = 1...m

• Feasible region

g (w ) ≤ 0 denotes g i (w ) ≤ 0 i = 1...k

F = {w ∈ Ω : g (w ) ≤ 0, h(w ) = 0}

161

Department of Computer Science Data Mining Research Laboratory

Optimization Problems

• Linear program – objective function as well as equality and inequality constraints are linear Quad at c p og a object e u ct o s quad at c, a d t e • Quadratic program – objective function is quadratic, and the equality and inequality constraints are linear • Inequality constraints gi(w) ≤ 0 can be active i e gi(w) = 0 or i.e. inactive i.e. gi(w) < 0 • I Inequality constraints are often transformed into equality lit t i t ft t f di t lit constraints using slack variables gi(w) ≤ 0 gi(w) + ξi = 0 with ξi ≥ 0 • We will be interested primarily in convex optimization problems

Department of Computer Science Data Mining Research Laboratory

162

**Convex Optimization Problem
**

• If function f is convex, any local minimum w of an unconstrained optimization problem with objective function f is also a global * * minimum, since for any u ≠ w f ( w ) ≤ f (u ) • A set Ω ⊂ R is called convex if , ∀w , u ∈ Ω and for any θ the point (θ w + (1 − θ )u ) ∈ Ω

n

*

∈ (0,1),

• A convex optimization problem is one in which the set ti i ti bl i i hi h th t objective function and all the constraints are convex

th Ω , the

Department of Computer Science Data Mining Research Laboratory

163

Lagrangian Theory

• Given an optimization problem with an objective function f(w) and equality constraints hj(w) = 0 j = 1..m, we define the Lagrangian function as

L(w,β) = f (w) + ∑ β j h j (w)

j =1

m

where βj are called the Lagrange multipliers A necessary condition multipliers. for w* to be minimum of f(w) subject to the constraints hj(w) = 0 j = 1..m is given by

∂L w * ,β * ∂L w * ,β * = 0, =0 ∂w ∂β

• The condition is sufficient if L(w,β*) is a convex function of w

(

)

(

)

Department of Computer Science Data Mining Research Laboratory

164

**Lagrangian Theory – Example
**

minimize subject to f ( x, y ) = x + 2 y x2 + y2 − 4 = 0

L ( x , y , λ ) = x + 2 y + ( x 2 + y 2 − 4) λ ∂L( x, y, λ ) = 1 + 2λ x = 0 ∂x ∂L( x, y, λ ) = 2 + 2λ y = 0 ∂y ∂L( x, y, λ ) = x2 + y2 − 4 = 0 ∂λ Solving the above, we have :

λ=±

5 2 4 ,x = m ,y=m 4 5 5 2 4 ,y=− f is minimized when x = − 5 5

Department of Computer Science Data Mining Research Laboratory

165

**Lagrangian Theory – Example
**

• Find the lengths u, v, w of sides of the box that has the largest volume for a given surface area c

minimize -uvw uvw subject to c 2 c⎞ ⎛ L = −uvw + β ⎜ wu + uv + vw − ⎟ 2⎠ ⎝ ∂L ∂L ∂L = 0 = −uv + β(u + v); = 0 = −vw + β(v + w); = 0 = − wu + β(u + w); ∂w ∂u ∂v c βv( w − u ) = 0 and βw(u − v) = 0 ⇒ u = v = w = 6 wu + uv + vw =

Department of Computer Science Data Mining Research Laboratory

166

**Lagrangian Theory – Example
**

• The entropy of a probability distribution p = (p1...pn) over a finite set {1, 2,...n} i d fi d as {1 2 } is defined

H (p) = −∑ pi log2 pi

i =1

n

• The maximum entropy distribution can be found by minimizing -H(p) subject to the constraints

∑p

i =1

n

i

=1

∀i pi ≥ 0

⎛ n ⎞ L(p, β ) = ∑ pi log 2 pi + β ⎜ ∑ pi − 1⎟ i =1 ⎝ i =1 ⎠

n

**The uniform distribution (p = (1/n, …, 1/n)) has the maximum entropy
**

Department of Computer Science Data Mining Research Laboratory

167

**Generalized Lagrangian Theory
**

• Given an optimization problem with domain Ω ⊆ ℜn

minimize subject to

f (w ) w∈Ω g i (w ) ≤ 0 i = 1...k , h j (w ) = 0 j = 1...m

{ objective function { inequality constraints { equality constraints

where f is convex, and gi and hj are affine, we can define the generalized Lagrangian function as li d L i f ti

L(w, α, β ) = f (w ) + ∑ αi g i (w ) + ∑ β j h j (w )

i =1 j =1

k

m

• An affine function is a linear function plus a translation: F(x) is affine if F(x)=G(x)+b where G(x) is a linear function of x and b is a constant

Department of Computer Science Data Mining Research Laboratory

168

**Generalized Lagrangian Theory: KKT Conditions
**

• Given an optimization problem with domain Ω ⊆ ℜn

minimize subject to

f (w ) w∈Ω g i (w ) ≤ 0 i = 1...k , h j (w ) = 0 j = 1...m

{ objective function { inequality constraints { equality constraints

where f is convex, and gi and hj are affine. The necessary and sufficient conditions for w* to be an optimum are the existence of α* and β* such that

α i* g i (w * ) = 0; g i (w * ) ≤ 0 ; α i* ≥ 0; i = 1...k

Department of Computer Science Data Mining Research Laboratory

169

∂L w * , α * , β* ∂L w * , α * , β* = 0, = 0, ∂w ∂β

(

)

(

)

**Generalized Lagrangian Theory – Example
**

minimize subject to f ( x, y ) = ( x − 1) 2 + ( y − 3) 2 x+ y ≤ 2; x− y ≤ 0

L = ( x − 1) 2 + ( y − 3) 2 + λ1 ( x + y − 2) + λ2 ( x − y ) ∂L = 2( x − 1) + λ1 + λ2 = 0 ∂x ∂L = 2( y − 3) + λ1 − λ2 = 0 ∂y λ1 ( x + y − 2) = 0; λ2 ( x − y ) = 0

λ1 ≥ 0, λ2 ≥ 0, x + y ≤ 2, x − y ≤ 0

Case 1 : λ1 = 0, λ2 = 0 ⇒ 2( x − 1) = 0 ,2( y − 3) = 0 ⇒ x = 1,y = 3 This is not a feasible solution because it violates x + y ≤ 2 Similarly, Case2 : λ1 = 0, λ2 ≠ 0 does not yield a feasible solution Case 3 : λ1 ≠ 0 ,λ2 = 0 ⇒ 2( x − 1) + λ1 = 0 ,2( y − 3) + λ1 = 0, λ1 ( x + y − 2) = 0 2 − λ1 6 − λ1 ⎛ 2 − λ1 6 − λ1 ⎞ ,y= , λ1 ⎜ + − 2 ⎟ = 0 ⇒ λ1 = 2, x = 0, y = 2 2 2 2 ⎝ 2 ⎠ which is a feasible solution Case 4 : λ1 ≠ 0 ,λ2 ≠ 0 does not yeild a feasible solution which yields x = Hence, the solution is x = 0, y = 2

Department of Computer Science Data Mining Research Laboratory

170

**Dual Optimization Problem
**

• A primal problem can be transformed into a dual by simply setting to zero, the derivatives of the Lagrangian with respect to the primal variables and substituting the results back into the Lagrangian to remove dependence on the primal variables, resulting in

θ (α, β ) = inf L(w, α, β )

w∈Ω

which contains only dual variables and needs to be maximized under simpler constraints • The infimum of a set S of real numbers is denoted by inf(S) and is defined to be the largest real number that is smaller than or equal to every number in S If no such number exists then inf(S) = −∞ S. ∞

Department of Computer Science Data Mining Research Laboratory

171

**Dual Optimization Problem
**

• Theorem: Let w ∈ Ω be a feasible solution of the primal problem and (α, β) a feasible solution of the dual problem. Then

f (w ) ≥ θ (α, β)

• Corollary: The value of the dual is upper bounded by the value of the primal,

sup{θ (α, β) : α ≥ 0} ≤ inf{ f (w ) : g (w ) ≤ 0, h(w ) = 0}

• Corollary: If f (w ) = θ (α , β ) where α ≥ 0, g (w ) ≤ 0, h(w * * * then w and (α , β ) solve the primal and dual problem * * respectively. In this case αi g i (w ) = 0 for i = 1,..., k

* * *

*

*

*

)=0

Department of Computer Science Data Mining Research Laboratory

172

**Maximum Margin Hyperplane
**

• The problem of finding the maximal margin hyperplane is a constrained optimization problem

• Use Lagrange theory (extended by Karush, Kuhn, and Tucker – KKT)

• Lagrangian:

Lp (w ) =

α ≥0

Department of Computer Science Data Mining Research Laboratory

1 w, w − ∑ α i [ yi ( w, xi + b) − 1] 2

(4)

173

**From Primal to Dual
**

• Minimize Lp(w) with respect to (w,b) requiring that derivatives of Lp(w) with respect to (w,b) all vanish, subject to the constraints α i e e t at g ( ) • Differentiating Lp(w):

≥0

∂L p ∂w ∂L p ∂b

i

= 0, =0

(5) (6)

w = ∑ α i yi xi

(7)

(8)

∑α y

i i

i

=0

• Substituting this equality constraints back into Lp(w) we obtain a dual problem

Department of Computer Science Data Mining Research Laboratory

174

**The Dual Problem
**

⎛ • Maximize: LD (w ) = 1 ⎜ ∑ α i yi x i ⎞⎛ ∑ α j y j x j ⎞ ⎟ ⎟⎜ ⎜ ⎟ 2 ⎝

i

⎠⎝

j

⎠

⎡ ⎛ − ∑ α i ⎢ yi ⎜ ⎢ ⎜ i ⎣ ⎝ =

⎞ ⎤ ⎛ ⎞ ⎜ ∑ α j y j x j ⎟, x i + b ⎟ − 1⎥ ⎜ ⎟ ⎟ ⎥ ⎝ j ⎠ ⎠ ⎦

1 ∑ α iα j yi y j xi , x j −∑ α iα j yi y j xi , x j −b∑ α i yi +∑ α i 2 ij ij i i 1 ∑ α iα j yi y j xi , x j +∑ α i 2 ij i

=−

subject to j

∑α y

i

i

= 0 and α i ≥ 0

i

• Duality permits the use of kernels! • The value of b does not appear in the dual problem and so b needs to be found from primal constraints max yi =−1 ( w ⋅ xi ) + min yi =1 ( w ⋅ xi ) b=− 2

175

Department of Computer Science Data Mining Research Laboratory

**Karush Kuhn Tucker Karush-Kuhn-Tucker Conditions
**

LP (w ) = 1 w, w − ∑ α i [yi ( w, x i + b ) − 1] 2 i

∂LP = 0 ⇒ w = ∑ α i yi x i ∂w i ∂LP = 0 ⇒ ∑ α i yi = 0 ∂b i

α i [yi ( w, x i + b ) − 1] = 0 αi ≥ 0

• Solving for the SVM is equivalent to finding a solution to the KKT conditions

yi ( w , x i + b ) − 1 ≥ 0

Department of Computer Science Data Mining Research Laboratory

176

**Karush Kuhn Tucker Karush-Kuhn-Tucker Conditions for SVM
**

• The KKT conditions state that optimal solutions satisfy

α i ( w , b)

must

α i [yi ( w, x i + b ) − 1] = 0

• Only the training samples Xi for which the functional margin = 1 can have nonzero αi; They are called Support Vectors

• The optimal hyperplane can be expressed in the dual representation in terms of this subset of training samples – the support vectors

f (x, α, b) = ∑ yiα i xi ⋅ x + b = ∑ yiα i xi ⋅ x + b

i =1 i∈sv

Department of Computer Science Data Mining Research Laboratory

l

177

Karush Kuhn Tucker Karush-Kuhn-Tucker Conditions for SVM

w,w = ∑∑ yi y jαiα j xi , x j

j =1 i =1

l

l

= ∑α j y j ∑αi yi xi , x j

j∈SV i∈SV

**⎛ ⎞ (Qx j ∈ SV, y j ⎜ ∑αi yi xi , x j + b⎟ = 1 ⇒ y j ∑αi yi xi , x j = 1− y jb) i∈SV ⎝ i∈SV ⎠ = ∑α j (1− y jb)
**

j∈SV

= ∑αi

i∈SV

So we have γ =

1 1 = w ⎛ ⎞ ⎜ ∑αi ⎟ ⎜ x ∈SV ⎟ ⎝ i ⎠

178

Department of Computer Science Data Mining Research Laboratory

Support Vector Machines Yield Sparse Solutions

γ

Support vectors from p positive (yellow) class are (y ) shown in green and the negative class are shown in purple

Department of Computer Science Data Mining Research Laboratory

179

Example: XOR

Q(α ) = ∑ α i −

i =1

N

K (x i , x j ) = 1 + x i , x j

2 ϕ (x) = 1, x1 , 2x1x 2 , x 2 , 2x1 , 2x 2 2

[

(

1 N N ∑∑ α iα j yi y j xi , x j 2 i =1 j =1

)

2

]

T

Department of Computer Science Data Mining Research Laboratory

180

Example: XOR

Q(α ) = α1 + α 2 + α 3 + α 4 − 1 2 2 9α12 − 2α1α 2 − 2α1α 3 + 2α1α 4 + 9α 2 + 2α 2α 3 − 2α 2α 4 + 9α 32 − 2α 3α 4 + 9α 4 2

(

)

**• Setting derivative of Q w.r.t. the dual variables to 0:
**

9α1 − α 2 − α 3 + α 4 = 1 − α1 + 9α 2 + α 3 − α 4 = 1 − α1 + α 2 + 9α 3 − α 4 = 1

α1 − α 2 − α 3 + 9α 4 = 1 α 0,1 = α 0, 2 = α 0,3 = α 0, 4 = , Q0 (α ) =

1 8 1 4

**• Not surprisingly, each of the training samples is a support vector
**

N ⎡ 2 w 0 = ∑ α i yiϕ (x i ) = ⎢0 0 − 2 i =1 ⎣

⎤ 0 0 0⎥ ⎦

181

Department of Computer Science Data Mining Research Laboratory

Example: Two Spirals

Department of Computer Science Data Mining Research Laboratory

182

**Summary of Maximal Margin Classifier
**

• Good generalization when the training data is noise free and separable in the kernel-induced feature space cess e y a ge ag a ge u t p e s o te signal outliers • Excessively large Lagrange multipliers often s g a out e s – data points that are most difficult to classify – SVM can be used for identifying outliers • Focuses on boundary points – if they happen to be mislabeled (noisy training data) – The result is a terrible classifier if the training set is separable – Noisy training data often makes the training set non separable – Use of highly nonlinear kernels to make the training set separable increases the likelihood of over fitting – despite maximizing the margin

Department of Computer Science Data Mining Research Laboratory

183

**Non Separable Non-Separable Case
**

• In the case of non-separable data in feature space, the objective function grows arbitrarily large

• Solution: Relax the constraint that all training data be correctly classified but only when necessary to do so

• Cortes and Vapnik, 1995 introduced slack variables in the constraints: t i t

ξi , i = 1,.., l

ξi = max(0, γ − yi ( w , xi + b))

Department of Computer Science Data Mining Research Laboratory

(17)

184

Non Separable Non-Separable Case

ξi = max(0, γ − yi ( w, xi + b)).

Generalization error can be shown to be h t b

γ

ξj

ε≤

1 L + ∑ξ 2 l γ

(

2 i

)

2

ξi

Department of Computer Science Data Mining Research Laboratory

185

**Non Separable Non-Separable Case
**

• For error to occur, the corresponding ξ i must exceed γ (which was chosen to be unity)

• So

∑ξ

i

i

is an upper bound on the number of training errors

2 • So the objective function is changed from || w || / 2 to || w || / 2 + C where C and k are parameters (typically k=1 or 2 and C is a large positive constant)

2

∑ξ

i

k i

Department of Computer Science Data Mining Research Laboratory

186

**The Soft Margin Classifier Soft-Margin
**

• Minimize:

1 w, w w + C ∑ ξi 2 i

1 Subject to yi w , x i + b ≥ (1− ξ i )

(

)

ξi ≥ 0

LP =

**1 || w ||2 +C ∑ ξ i − ∑ α i ( yi ( w, x i + b ) − 1 + ξ i ) − ∑ μiξ i 2 i i i 1442443 14444244443 4 4 123
**

objective function inequality constraints involving ξ i

non - negativity constraints on ξ i

Department of Computer Science Data Mining Research Laboratory

187

**The Soft Margin Classifier: KKT Conditions Soft-Margin
**

ξ i ≥ 0, α i ≥ 0, μi ≥ 0 μ iξ i = 0

α i > 0 only if

α i ( yi ( w , x i + b ) − 1 + ξ i ) = 0

• Xi is a support vector i.e. w , x i + b = ±1 and hence ξ i = 0 OR • Xi is misclassified by (w , b) and hence ξ i > 0

Department of Computer Science Data Mining Research Laboratory

188

**From Primal to Dual
**

LP = 1 || w ||2 +C ∑ ξ i − ∑ α i ( yi ( w, x i + b ) − 1 + ξ i ) − ∑ μiξ i 2 i i i ∂LP = 0 ⇒ w = ∑ α i yi x i ∂w i ∂LP = 0 ⇒ ∑ α i yi = 0 ∂b i ∂LP = 0 ⇒ α i + μi = C ⇒ 0 ≤ α i ≤ C ∂ξ i

Equating the derivatives of LP w r t primal variables to 0 w.r.t. 0,

Substituting back into LP, we get the dual LD to be maximized

LD = ∑ α i −

i

Department of Computer Science Data Mining Research Laboratory

1 ∑ α iα j yi y j xi , x j 2 ij

189

**Soft Margin Soft-Margin Dual Lagrangian
**

• Maximize:

LD = ∑ α i −

i

1 ∑ α iα j yi y j xi , x j 2 ij

Subject to 0 ≤ α i ≤ C and KKT Conditions:

∑α y

i i

i

=0

α i ( yi ( w , x i + b ) − 1 + ξ i ) = 0 ξ i (α i − C ) = 0

• Slack variables are non-zero (misclassified training sample) only when αi = C • For true support vectors, 0 < αi < C vectors • For all other (correctly classified) training samples, αi = 0

Department of Computer Science Data Mining Research Laboratory

190

Implementation Techniques

• Maximizing a quadratic function, subject to a linear equality and inequality constraints

W (α ) = ∑ α i −

i

1 ∑α iα j yi y j K ( xi , x j ) 2 i, j

0 ≤ αi ≤ C

∑α y

i i

i

=0

∂W (α ) = 1 − yi ∑ α j y j K ( xi , x j ) ∂α i j

Department of Computer Science Data Mining Research Laboratory

191

**On-line Algorithm for the 1-Norm Soft Margin (bias (bi b = 0)
**

Given training set D

α←0

repeat

ηi =

1 K (x i , x i )

for i = 1 to l

α i ← α i + ηi (1 − yi ∑ α j y j K ( xi , x j )) (

if α i < 0 then α i ← 0

j

else if α i > C then α i ← C

endfor until stopping criterion satisfied return α Note: In the general setting, b≠0 and we need to solve for b

Department of Computer Science Data Mining Research Laboratory

192

Implementation Techniques

• Use QP packages (MINOS, LOQO, quadprog from MATLAB optimization toolbox). They are not online and require that the data are held in memory in the form of kernel matrix.

• Stochastic Gradient Ascent: Sequentially update 1 weight at the time. So it is online. Gives excellent approximation in most cases.

ˆ αi ← αi +

⎛ ⎞ 1 1 − yi ∑ α j y j K ( xi , x j ) ⎟ ⎜ K ( xi , xi ) ⎝ j ⎠

Department of Computer Science Data Mining Research Laboratory

193

**Chunking and Decomposition
**

Given training set D

**ˆ select an arbitrary working set D ⊂ D
**

repeat

α←0

**ˆ p p solve optimization problem on D
**

select new working set from data not satisfying KKT conditions until stopping criterion satisfied return

α

Department of Computer Science Data Mining Research Laboratory

194

**Sequential Minimal Optimization (SMO)
**

• At each step SMO: optimize two weights order not to violate the linear constraint

α i ,α j

simultaneously in

∑α y

i i

i

=0

• Optimization of the two weights is performed analytically • Realizes gradient dissent without leaving the linear constraint (J. Platt) • Online versions exist (Li, Long; Gentile)

Department of Computer Science Data Mining Research Laboratory

195

SVM Implementations

• SVMLight – one of the first practical implementations of SVM (T. Joachims) at ab toolboxes o S • Matlab too bo es for SVM • LIBSVM – one of the best SVM implementations by Chih-Chung Chang and Chih-Jen Lin of the National University of Singapore http://www.csie.ntu.edu.tw/~cjlin/libsvm/ • WLSVM – LIBSVM i t integrated with WEKA machine l t d ith hi learning t lb i toolbox Yasser El-Manzalawy of the ISU AI Lab http://www.cs.iastate.edu/~yasser/wlsvm/

Department of Computer Science Data Mining Research Laboratory

196

**Why does SVM work?
**

• The feature space is often very high dimensional. Why don’t we have the curse of dimensionality? h th f di i lit ? • A classifier in a high-dimensional space has many parameters and is h d t i hard to estimate ti t • Vapnik argues that the fundamental problem is not the number of parameters t b estimated. Rather, th problem is the capacity of t to be ti t d R th the bl i th it f a classifier. • T i ll a classifier with many parameters is very flexible, but Typically, l ifi ith t i fl ibl b t there are also exceptions – Let xi=10i where i ranges from 1 to n. The classifier can classify all xi correctly for all possible combination of class labels on xi – This 1-parameter classifier is very flexible

Department of Computer Science Data Mining Research Laboratory

197

**Why does SVM work?
**

• Vapnik argues that the capacity of a classifier should not be characterized by the number of parameters, but by the capacity of a classifier – This is formalized by the “VC-dimension” of a classifier • The minimization of ||w||2 subject to the condition that the margin 1 VC dimension =1 has the effect of restricting the VC-dimension of the classifier in the feature space • The SVM performs structural risk minimization: the empirical risk (training error), plus a term related to the generalization ability of the classifier, is minimized • The term ½||w||2 “shrinks” the parameters towards zero to avoid overfitting

Department of Computer Science Data Mining Research Laboratory

198

**Choosing the Kernel Function
**

• Probably the most tricky part of using SVM • The kernel function should maximize the similarity among sta ces t class e accentuating the differences instances within a c ass while acce tuat g t e d e e ces between classes • A variety of kernels have been proposed (diffusion kernel Fisher kernel, kernel, string kernel, …) for different types of data • I practice, a low degree polynomial kernel or RBF kernel with a In ti l d l i lk l k l ith reasonable width is a good initial try for data that live in a fixed dimensional input space • Low order Markov kernels and its relatives are good ones to consider for structured data – strings, images, etc.

Department of Computer Science Data Mining Research Laboratory

199

**Other Aspects of SVM
**

• How to use SVM for multi-class classification? – One can change the QP formulation to become multi-class – More often, multiple binary classifiers are combined – One can train multiple one-versus-all classifiers, or combine classifiers multiple pairwise classifiers “intelligently” • H How t interpret the SVM discriminant function value as to i t t th di i i tf ti l probability? – By performing logistic regression on the SVM output of a set of data th t i d t that is not used for training t df t i i • Some SVM software (like LIBSVM) have these features built-in

Department of Computer Science Data Mining Research Laboratory

200

**Recap: How to Use SVM
**

• Prepare the data set • Select the kernel function to use • Select the parameter of the kernel function and the value of C – You can use the values suggested by the SVM software – You can use a tuning set to tune C g g • Execute the training algorithm and obtain the αi and b • Unseen data can be classified using the αi , the support vectors, and b

Department of Computer Science Data Mining Research Laboratory

201

**Strengths and Weaknesses of SVM
**

• Strengths – Training is relatively easy

–

No local optima

– It scales relatively well to high dimensional data – Tradeoff between classifier complexity and error can be

controlled explicitly – Non traditional data like strings and trees can be used as input Non-traditional to SVM, instead of feature vectors

• Weaknesses – Need to choose a “good” kernel function g

Department of Computer Science Data Mining Research Laboratory

202

Recent Developments

• Better understanding of the relation between SVM and regularized discriminative classifiers (logistic regression) o edge based SVM – incorporating prior knowledge as co po at g p o o edge • Knowledge-based S constraints in SVM optimization (Shavlik et al.) • A zoo of kernel functions • Extensions of SVM to multi-class and structured label classification tasks l ifi ti t k • One class SVM (anomaly detection): with only examples from one (normal) class and estimate its distribution in order to distinguish it from other unlike unseen examples (outliers)

Department of Computer Science Data Mining Research Laboratory

203

**Representative Applications of SVM
**

• • • • Handwritten letter classification Text classification Image classification (e.g. face recognition) Bioinformatics – Classification of tissues – healthy versus cancerous – based on gene expression data – P t i function/structure classification Protein f ti / t t l ifi ti – Protein sub-cellular localization • …

Department of Computer Science Data Mining Research Laboratory

204

**Learning Real-Valued Functions Real Valued
**

• (Bayesian recipe for learning real-valued functions) • (Learning linear functions using gradient descent in weight space) • Universal function approximation theorem • Learning nonlinear functions using gradient descent in weight space • Practical considerations and examples

Department of Computer Science Data Mining Research Laboratory

205

**Kolmogorov s Kolmogorov’s theorem (Kolmogorov, 1940)
**

• In search of more expressive hypothesis spaces • Any continuous function from input to output can be expressed in t e o the form

g ( x1 ,..xn ) =

2 n +1 j =1

∑ g (Σu

j

ij

( xi ) )

∀( x1 ,...xn ) ∈ I n ( I = [0 ,1];n ≥ 2)

by choosing proper nonlinearities gj and uij

Department of Computer Science Data Mining Research Laboratory

206

**Universal function approximation theorem (Cybenko, (C b k 1989)
**

• Let ϕ : ℜ ℜ be a non-constant, bounded (hence non-linear), monotone, continuous function. Let IN be the N-dimensional unit hypercube in ℜN. • Let C(IN ) = {f: IN ℜ} be the set of all continuous functions with domain IN and range ℜ. Then for any function f ∈ C(IN ) and any ε > , g ( j ; 0, ∃ an integer L and a set of real values θ, αj, θj, wji (1≤j≤L; 1≤i≤N ) such that

⎛ N ⎞ F ( x1 , x 2 ...x n ) = ∑ α j ϕ⎜ ∑ w ji xi − θ j ⎟ − θ ⎜ ⎟ j =1 ⎝ i =1 ⎠

L

is a uniform approximation of f – that is,

∀( x1 ,...x N ) ∈ I N , F ( x1 ,...x N ) − f ( x1 ,...x N ) < ε

Department of Computer Science Data Mining Research Laboratory

207

**Universal function approximation theorem (UFAT)
**

⎛ N ⎞ F ( x1 , x 2 ...x n ) = ∑ α j ϕ⎜ ∑ w ji xi − θ j ⎟ − θ ⎜ ⎟ j =1 ⎝ i =1 ⎠

L

• Unlike Kolmogorov’s theorem UFAT requires only one kind of Kolmogorov s theorem, nonlinearity to approximate any arbitrary nonlinear function to any desired accuracy • The sigmoid function satisfies the UFAT requirements

ϕ( z ) =

1 ;a > 0 1 + e − az

z → −∞

lim ϕ( z ) = 0; lim ϕ( z ) = 1

z → +∞

• Similar universal approximation properties can be guaranteed for other functions (e.g. radial basis functions)

Department of Computer Science Data Mining Research Laboratory

208

**Universal function approximation theorem
**

• UFAT guarantees the existence of arbitrarily accurate approximations of continuous f i ti f ti functions defined over bounded ti d fi d b d d subsets of ℜN • UFAT tells us the representational power of a certain class of multit ll th t ti l f t i l f lti layer networks relative to the set of continuous functions defined on bounded subsets of ℜN • UFAT is not constructive – it does not tell us how to choose the parameters to construct a desired function • To learn an unknown function from data, we need an algorithm to search the hypothesis space of multilayer networks • Generalized delta rule allows the form of the nonlinearity to be learned from the training data

Department of Computer Science Data Mining Research Laboratory

209

**Feed forward Feed-forward neural networks
**

• A feed-forward 3-layer network consists of 3 layers of nodes – Input nodes – Hidden nodes – Output nodes

• Interconnected by modifiable weights from input nodes to the y g p hidden nodes and the hidden nodes to the output nodes

• More general topologies (with more than 3 layers of nodes, or connections that skip layers – e.g., direct connections between input and output nodes) are also possible

Department of Computer Science Data Mining Research Laboratory

210

A three layer network that approximates the exclusive OR function l i f ti

Department of Computer Science Data Mining Research Laboratory

211

**Three layer feed forward Three-layer feed-forward neural network
**

• A single bias unit is connected to each unit other than the input units it • Net input

n j = ∑ xi w ji + w j 0 = ∑ xi w ji ≡ W j . • X,

i =1 i =0

N

N

where the subscript i indexes units in the input layer, j in the hidden; wji denotes the input-to-hidden layer weights at the hidden unit j j. • The output of a hidden unit is a nonlinear function of its net input. That is, yj = f(nj) e g is e.g.,

yj =

Department of Computer Science Data Mining Research Laboratory

1

1+ e

−n j

212

**Three layer feed forward Three-layer feed-forward neural network
**

• Each output unit similarly computes its net activation based on the hidden hidd unit signals as: it i l

**nk = ∑ y j wkj + wk 0 = ∑ y j wkj = Wk • Y,
**

j =1 j =0

nH

nH

where the subscript k indexes units in the ouput layer and nH denotes the number of hidden units • The output can be a linear or nonlinear function of the net input e.g.,

y k = nk

Department of Computer Science Data Mining Research Laboratory

213

Computing nonlinear functions using a feed-forward neural network f df d l t k

Department of Computer Science Data Mining Research Laboratory

214

Realizing non linearly separable class boundaries using a 3-layer feed-forward neural network i 3l f df d l t k

Department of Computer Science Data Mining Research Laboratory

215

**Learning nonlinear functions
**

Given a training set determine: • Network structure – number of hidden nodes or more generally, network topology • Start small and grow the network g • Start with a sufficiently large network and prune away the unnecessary connections • For a given structure, determine the parameters (weights) that minimize the error on the training samples (e.g., the mean squared error) • For now, we focus on the latter

Department of Computer Science Data Mining Research Laboratory

216

**Generalized delta rule – error back-propagation back propagation
**

• Challenge – we know the desired outputs for nodes in the output layer, but not the hidden layer g p g • Need to solve the credit assignment problem – dividing the credit or blame for the performance of the output nodes among hidden nodes • Generalized delta rule offers an elegant solution to the credit assignment problem in feed-forward neural networks in which each neuron computes a differentiable function of its inputs • Solution can be generalized to other kinds of networks, including non networks with cycles

Department of Computer Science Data Mining Research Laboratory

217

**Feed forward Feed-forward networks
**

• Forward operation (computing output for a given input based on the current weights) g p ( g ) • Learning – modification of the network parameters (weights) to minimize an appropriate error measure • Because each neuron computes a differentiable function of its inputs if error is a differentiable function of the network outputs, the error is a differentiable function of the weights in the network – so we can perform gradient descent!

Department of Computer Science Data Mining Research Laboratory

218

A fully connected 3 layer network 3-layer

Department of Computer Science Data Mining Research Laboratory

219

**Generalized delta rule
**

• Let tkp be the k-th target (or desired) output for input pattern Xp and zkp be the output produced by k-th output node and let W represent all the weights in the network • Training error:

1 ES (W) = ∑ 2 p

∑ (t

k =1

M

kp

− z kp ) 2 = ∑ E p (W )

p

• The weights are initialized with pseudo-random values and are changed in a direction that will reduce the error:

Δw ji = −η

∂E S ∂w ji

Δwki = −η

∂E S ∂wkj

Department of Computer Science Data Mining Research Laboratory

220

**Generalized delta rule
**

η>0 i a suitable the learning rate W W+ ΔW is it bl th l i t

Hidden–to-output weights

∂E p ∂wkj

∂nkp ∂wkj

∂E p ∂n kp

=

∂E p ∂nkp . ∂nkp ∂wkj

= y jp

= ∂E p ∂z kp . = −(t kp − z kp )(1) ∂z kp ∂nkp

wkj ← wkj − η

Department of Computer Science Data Mining Research Laboratory

∂E p ∂wkj

= wkj + (t kp − z kp ) y jp = wkj + δ kp y jp

221

**Generalized delta rule
**

Weights from input to hidden units

∂E p ∂w ji

=∑

k =1 M M

=∑

k =1

M

∂E p ∂z kp ∂z kp ∂w ji

=∑

k =1

M

∂E p ∂z kp ∂y jp ∂n jp . . ∂z kp ∂y jp ∂n jp ∂w ji

∂ ⎡1 M 2⎤ ⎢ 2 ∑ (tlp − zlp ) ⎥ (wkj )( y jp )(1 − y jp )(xip ) ∂z kp ⎣ l =1 ⎦

= −∑ (t kp − z kp )(wkj )( y jp )(1 − y jp )(xip )

k =1

**⎛M ⎞ = −⎜ ∑ δ kp (wkj )( y jp )(1 − y jp )⎟(xip ) = −δ jp xip ⎝ k =1 44 24444 ⎠ 14 4 3
**

δ jp

w ji ← w ji + ηδ jp xip

Department of Computer Science Data Mining Research Laboratory

222

**Back propagation algorithm
**

Start with small random initial weights Until desired stopping criterion is satisfied do Se ect t a Select a training sample from S g sa p e o Compute the outputs of all nodes based on current weights and the input sample Compute the weight updates for output nodes Compute the weight updates for hidden nodes Update the weights

Department of Computer Science Data Mining Research Laboratory

223

**Using neural networks for classification
**

Network outputs are real valued. How can we use the networks for classification?

F (X p ) = arg max z kp (X

k

Classify a pattern by assigning it to the class that corresponds to the index of the output node with the largest output for the pattern

Department of Computer Science Data Mining Research Laboratory

224

**Some Useful Tricks
**

• Initializing weights to small random values that place the neurons in the linear portion of their operating range for most of the patterns in the training set improves speed of convergence e.g.,

w ji = ± 21N

1 |xi| i =1,...,N

∑

For input to hidden layer weights with the sign of the weight chosen at random

wkj = ± 21N

i =1,...,N

∑

ϕ(

∑

1 w ji x )

i

For hidden to output layer weights with the sign of the weight chosen at random

Department of Computer Science Data Mining Research Laboratory

225

**Some Useful Tricks
**

• Use of momentum term allows the effective learning rate for each weight to adapt as needed and helps speed up convergence – in a network with 2 layers of weights,

w ji (t + 1) = w ji (t ) + Δw ji (t ) ⎫ ⎪ ∂ES ⎪ + αΔw ji (t − 1) ⎪ Δw ji (t ) = − η ∂w ji w ji = w ji (t ) ⎪ where 0 < α ,η < 1 with typical values of ⎬ wkj (t + 1) = wkj (t ) + Δwkj (t ) η = 0.5 to 0.6, α = 0.8 to 0.9 ⎪ ⎪ ∂ES ⎪ Δwkj (t ) = −η + αΔwkj (t − 1) ⎪ ∂wkj wkj = wkj (t ) ⎭

Department of Computer Science Data Mining Research Laboratory

226

**Some Useful Tricks
**

• Use sigmoid function which satisfies ϕ(–z)=–ϕ(z) helps speed up convergence

⎛ ebz − e − bz ⎞ ϕ (z ) = a⎜ bz −bz ⎟ ⎜e +e ⎟ ⎝ ⎠ 2 ∂ϕ a = 1.716, b = ⇒ ∂z 3

≈1

z =0

and ϕ ( z ) is linear in the range − 1 < z < 1 g

Department of Computer Science Data Mining Research Laboratory

227

**Some Useful Tricks
**

• Randomizing the order of presentation of training examples from one pass t th next helps avoid local minima to the th l id l l i i • Introducing small amounts of noise in the weight updates (or into examples) during training helps improve generalization – l )d i t i i h l i li ti minimizes over fitting, makes the learned approximation more robust to noise, and helps avoid local minima • If using the suggested sigmoid nodes in the output layer, set target output for output nodes to be 1 for target class and -1 for all others

Department of Computer Science Data Mining Research Laboratory

228

**Some useful tricks
**

• Regularization helps avoid over fitting and improves generalization li ti

R(W ) = λE (W ) + (1 − λ )C (W ); 0 ≤ λ ≤ 1

C (W ) = − ⎞ 1⎛ 2 ⎜ ∑ w 2 + ∑ wkj ⎟ ji ⎟ 2 ⎜ ji kj ⎝ ⎠

∂C ∂C = − w ji and − = − wkj ∂w ji ∂wkj

Start with λ close to 1 and gradually lower it during training. When λ <1 it tends to drive weights toward zero setting up a <1, tension between error reduction and complexity minimization

Department of Computer Science Data Mining Research Laboratory

229

**Some Useful Tricks
**

Input and output encodings: • Do not eliminate natural proximity in the input or output space • Do not normalize input patterns to be of unit length if the length is likely to be relevant for distinguishing between classes l • Do not introduce unwarranted proximity as an artifact • Do not use log2 M outputs to encode M classes, use M outputs instead to avoid spurious proximity in the output space • Use error correcting codes when feasible

Department of Computer Science Data Mining Research Laboratory

230

**Some Useful Tricks
**

Examples of a good code: • Binary thermometer codes for encoding real values • Suppose we can use 10 bits to represent a value between -1.0 and +1.0 • We can quantize the interval [-1, 1] into 10 equal parts • 0.38 in thermometer code is 1111000000 • 0 60 i th 0.60 in thermometer code is 1111110000 t d i • Note values that are close along the real number line have thermometer codes that are close in Hamming distance Example of a bad code: • Ordinary binary representations of integers

Department of Computer Science Data Mining Research Laboratory

231

**Some Useful Tricks
**

• Use of problem specific information (if known) speeds up convergence and improves generalization g g • In networks designed for translation-invariant visual image classification, building in translation invariance as a constraint on the weights helps • If we know the function to be approximated is smooth, we can build that in as part of the criterion to be minimized – minimize in addition to the error, the gradient of the error with respect to the inputs

Department of Computer Science Data Mining Research Laboratory

232

**Some Useful Tricks
**

• Manufacture training data – training networks with translated and rotated patterns if translation and rotation invariant recognition is desired • Incorporating hints during training • Hints are used as additional outputs during training to help shape the hidden layer representation

Hint nodes (e.g., vowels versus consonants in training a phoneme recognizer)

Department of Computer Science Data Mining Research Laboratory

233

**Some Useful Tricks
**

• Reducing the effective number of free parameters (degrees of freedom) helps improve generalization g • Regularization • Preprocess the data to reduce the dimensionality of the input – • Train a neural network with output same as input but with input, fewer hidden neurons than the number of inputs • Use the hidden layer outputs as inputs to a second network to do f d function approximation ti i ti

Department of Computer Science Data Mining Research Laboratory

234

**Some Useful Tricks
**

• Choice of appropriate error function is critical – do not blindly minimize sum squared error – there are many cases where other criteria are appropriate

• Example

⎛ tk kp E S ( W ) = ∑∑ t kp ln⎜ ⎜z p =1 k =1 ⎝ kp

P M

⎞ ⎟ ⎟ ⎠

is appropriate for minimizing the distance between the target probability distribution over the M output variables and the probability distribution represented by the network

Department of Computer Science Data Mining Research Laboratory

235

**Some Useful Tricks
**

• Interpreting the outputs as class conditional probabilities • Use linear output nodes or exponential but not sigmoid output nodes

nkp = ∑ wkj y jp

j =0

nH

⎛ ⎞ ⎜ ⎟ ⎜ nkp ⎟ linear output z kp = M ⎜ ⎟ ⎜ ∑ nlp ⎟ ⎝ l =1 ⎠ ⎛ ⎜ nkp e exponential output z kp = ⎜ M ⎜ nlp ⎜ ∑e ⎝ l =1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎠

Department of Computer Science Data Mining Research Laboratory

236

**Other Topics on ANN
**

• Constructive/Generative neural network learning • Hopfield networks with Hebbian learning – Associative memory – Optimization • Adaptive Resonance Theory (ART) and its variants g g p( ) • Kohonen’s Self Organizing Map (SOM) • …

Department of Computer Science Data Mining Research Laboratory

237

- Read and print without ads
- Download to keep your version
- Edit, email or read offline

An Introduction to Programming the Microchip PIC in CCS C

Pdms Design

Introduction to Digital Speech Processing

Digital Drawing Sample Chapter

(2004-08)(0273655361)Systems Analysis and Design 2nd Edition

7266138 C Programming for Embedded Systems

Linux For You Dec 09

Excel Functions Dictionary

Algorithms and Complexity

ncs_31

Tutorial Python

Languages and Automata

Unix system programming in Objective Caml

Toward a Personal Quantum Computer

What’s a Microcontroller

Programming With C++ (New Age, 2009, 8122426131)

Standard LaTeX Math Mode

Open Office - Calc Guide

Systems Analysis and Design

PHP Design Patterns

Parallelization of Game AI - slides from Paris Game AI Conference '09

C Programming

eBook - Microsoft SQL Server Black Book

Python Programming

Efficient True Random Number Generation

Information Theory, Inference, And Learning Algorithms - Book

Programming XC on XMOS Devices

International Journal of Computer Science

Models for Computation

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

CANCEL

OK

You've been reading!

NO, THANKS

OK

scribd