# Lecture 18: Generative Models

CS 475: Machine Learning

Raman Arora

April 5, 2017

Slides credit: Greg Shakhnarovich

Review

Review: Kernel support vector machines

Mercer kernels: K(xi , xj ) = (xi ) · (xj )
The optimization problem (learning SVM):
8 9
<X N
1 X
N =
max ↵i ↵i ↵j yi yj K(xi , xj )
: 2 ;
i=1 i,j=1

• Need to compute the kernel matrix for the training data
Prediction: !
X
ŷ = sign wˆ0 + ↵i yi K(xi , x)
↵i >0

• Need to compute K(xi , x) for all SVs xi .

e.t. xN ).. i. . xN ). . yi (w · xi + w0 ) 1 8i w Theorem: the solution can be represented as N X ⇤ w = i xi i=1 This is the “magic” behind Support Vector Machines! Review Review: Representer theorem . yi (w⇤ · xi + w0 ) 1 ) yi (wX · xi + w0 ) 1 . . w? · xi = 0 for all i = 1. N For all xi we have w ⇤ · x i = wX · x i + w? · x i = wX · x i therefore. w? 2/ Span(x1 . . .t. . Review Review: Representer theorem Consider the optimization problem w⇤ = argmin kwk2 s. . . where wX = N i=1 i xi 2 Span(x1 . . yi (w · xi + w0 ) 1 8i ) w = i xi w i=1 Let w⇤P= wX + w? . .proof I N X ⇤ 2 ⇤ w = argmin kwk s. . .

we have a solution wX that satisfies all the constraints. Suppose w? 6= 0.t. and for which kwX k2 < kwX k2 + kw? k2 = kw⇤ k2 . xj ) + 41 yi ↵j yj K(xi . we have kw⇤ k2 = w⇤ · w⇤ = (wX + w? ) · (wX + w? ) = wX · wX + w? · w? . This contradicts optimality of w⇤ .j i j + . Review Review: Representer theorem . x) Can not write w explicitly. instead.j=1 The objective for learning is 8 2 3 9 < X X X = min ↵i ↵j yi yj K(xi . xj ) i=1. | {z } | {z } kwX k2 kw? k2 since wX · w? = 0. hence w⇤ = wX . optimize ↵ How can we write the regularizer? " # 2 3 X X kwk2 = w · w = ↵i yi (xi ) · 4 ↵j yj (xj )5 i j N X = ↵i ↵j yi yj K(xi . i. QED Review Review: Kernel SVM in the primal P Recall: ŷ = sign wˆ0 + ↵i >0 ↵i yi K(xi . Then. xj ) 5 ↵ :2 .proof II N X ⇤ 2 ⇤ w = argmin kwk s. yi (w · xi + w0 ) 1 8i ) w = i xi w i=1 Now.

Review Review: SVM with RBF (Gaussian) kernels Data are linearly separable in the (infinite-dimensional) feature space We don’t need to explicitly compute dot products in that feature space – instead we simply evaluate the RBF kernel.t. yi (w · xi + w0 ) 1 8i ) w = ↵i xi w i=1 Polynomial kernel (includes linear): K(xi . c. xj . ) = exp 2 kxi xj k . Review Review: kernels Representer theorem: N X ⇤ 2 ⇤ w = argmin kwk s. d) =. . (c + xi · xj )d RBF kernel: ✓ ◆ 1 2 K(xi . xj .

⇠i 0. P ⇣ ⌘ Optimization: min C i ⇠i + ⇠˜i + 12 kwk2 . yi f (xi ) ✏ ⇠˜i . x) = 1} Review Review: SVM regression The key ideas: ✏-insensitive loss ✏-tube y(x) y+✏ ⇠>0 y `(z) y ✏ ⇠b > 0 z ✏ ✏ x Two sets of slack variables: yi  f (xi ) + ✏ + ⇠i . x) = 1} negative margin: level set X {x : ↵i yi K(xi . Review Review: SVM with RBF kernels: geometry positive margin: level set X {x : ↵i yi K(xi . ⇠˜i 0.

multi-class SVM extension is next Multiclass SVM Multiclass SVM: setup Many attempts to generalize SVM to multi-class. . or a class tree often the most efficient. easy if have calibrated p (y | x) One-vs-one: build C2 classifiers need to reconcile. y): Jŷ(x. | {z } c W Can stack wc s into rows of W Empirical 0/1 loss on (x.. our options for C > 2 classes include: One-vs-all: build C classifiers need to reconcile. SVM). learn wc for c = 1. wC ) = argmax wc · x. . C. more problematic since can have inconsistencies Build some sort of “tournament”. Multiclass SVM SVM with more than two classes Some classifiers are “natively multiclass” e. .yi wy T x r a. decision trees With any natively binary classifier (AdaBoost. ŷ(x. .g. . . Basic idea: for C classes. how to build the tree? Extend to multi-class by modifying the machinery softmax is an extension of logistic regression. .b = 1 i↵ a = b. w1 . y): max wr T x + 1 r. . W) 6= yK Surrogate loss on (x. otherwise 0. we will follow the one due to Crammer and Singer (2000). logistic regression. .

↵i. c 6= yi wyTi xi wc T x i 1 ⇠i .t. Multiclass SVM Optimization Surrogate loss is a bound on 0/1 loss: 1 X 1 Xh i Jŷ(xi .t.8i. r.r 0.⇠ ↵ 2 c i XX ⇥ ⇤ + ↵i.t.r +1 ⇠i i r s.r (wrT wyTi )xi yi . ⇠i 0. 8 i. where kWk22 is the Frobenius norm of W. Multiclass SVM Soft constraint version General (non-separable) case: X min kWk22 + ⇠i W 2 i s. W) 6= yi K  max wrT x + 1 yi .r wyT xi N N r i i Proceed as in (separable) SVM: want to find the lowest norm solution that achieves 1-margin min kWk22 W2 s. c 6= yi wyTi xi wc T x i 1. .r : X X min max kwc k22 + ⇠i W. 8 i. Introducing Lagrange multipliers ↵i.

c P Note: p(x) = c p(x. Multiclass SVM Generative models Generative models Reminder: optimal classification Expected classification error is minimized by h(x) = argmax p (y = c | x) c p (x | y = c) p(y = c) = argmax . and can be ignored. c p(x) The Bayes classifier: p (x | y = c) p(y = c) h⇤ (x) = argmax c p(x) = argmax p (x | y = c) p(y = c) c = argmax {log p (x | y = c) + log p(y = c)} . . y = c) is equal for all c.

if p(y = c) = 1/C for all c.4 y =2. h⇤ (x) =1 is called the Bayes risk R⇤ . Generative models Bayes risk 0. For example.1 p(x. log p (x | y = c) + log p(y = c) such that h⇤ (x) = argmax c (x). Z R⇤ = 1 max {p (x | c = y) p(y = c)} dx x c Generative models Discriminant functions We can construct.1 y =1.2 −10 −8 −6 −4 −2 0 2 4 6 8 10 classification problem.3 This is the minimal 0. a discriminant function c (x) . y) with any classifier! 0 In a sense. R⇤ measures the −0.5 error) of Bayes classifier h⇤ 0. c Can simplify c by removing terms and factors common for all c since they won’t a↵ect the decision boundary. 0.6 The risk (probability of 0.2 achievable risk for the given 0. h⇤ (x) =2 inherent difficulty of the −0. can drop the prior term: c (x) = log p (x | y = c) . for each class c.

the Bayes classifier is h⇤ (x) = argmax c (x) = sign ( +1 (x) 1 (x)) . p (x | y = 1) Generative models Equal covariance Gaussian case Consider the case of pc (x) = N (x. Generative models Two-category case In case of two classes y 2 {±1}. With equal priors. µc . c=±1 Decision boundary is given by +1 (x) 1 (x) = 0. ⌃). • Sometimes f (x) = +1 (x) 1 (x) is referred to as a discriminant function. k (x) = log p(x | y = k) 1 1 = log(2⇡)d/2 log(|⌃|) (x µk )T ⌃ 1 (x µk ) | {z 2 } 2 same for all k T 1 T 1 / const x | ⌃ {z x} + µk ⌃ x + xT ⌃ 1 µk µTk ⌃ 1 µk same for all k Now consider two classes r and q: r (x) / 2µTr ⌃ 1 x µTr ⌃ 1 µr q (x) / 2µTq ⌃ 1 x µTq ⌃ 1 µq . and equal prior for all classes. this is equivalent to the (log)-likelihood ratio test:  p (x | y = +1) h⇤ (x) = sign log .

Typically. 8 10 6 10 4 8 8 2 6 6 0 4 −2 4 −4 2 2 −6 0 −8 0 −2 −10 −12 −4 −2 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 −6 −4 −2 0 2 4 6 8 10 12 . y) or...C and ⌃ are.. equivalently. to derive discriminants. multinomial for discrete.. p (x | y = c) and p(y = c). the model imposes certain parametric form on the assumed distributions. • Most popular: Gaussian for continuous. • We will see later in this class non-parametric models. and requires estimation of the parameters from data. the classifier is OK even if data clearly don’t conform to assumptions. What should we do when we don’t know the Gaussians? Generative models Generative models for classification In generative models one explicitly models p(x. Often. we can compute the optimal w. w0 directly. Generative models Linear discriminant Two class discriminants: r (x) r (x) = 2µTr ⌃ 1 x µTr ⌃ 1 µr 2µTq ⌃ 1 x + µTq ⌃ 1 µq = w T x + w0 If we know what µ1.

Generative models Gaussians with unequal covariances What if we remove the restriction that 8c. ⌃c for each c. ⌃c = ⌃? Compute ML estimate for µc . We get discriminants (and decision boundaries) quadratic in x: 1 T c (x) = x ⌃c 1 x + µTc ⌃c 1 x constc (x) 2 Decision boundary with two classes: 1 0 = 0 Generative models Quadratic decision boundaries What do quadratic boundaries look like in 2D? Second-degree curves can be any conic section: Can all of these arise from two Gaussian classes? .

Hart & Stork] Mixture models Mixture models So far. we have assumed that each class has a single coherent model. 10 8 6 4 2 0 −2 −4 −6 −8 −6 −4 −2 0 2 4 6 8 10 What if the examples (within the same class) are from a number of distinct “types”? . Generative models Quadratic decision boundaries [Duda.

c=1 ⇡c . Images of the same category but di↵erent sorts of objects: chairs with/without armrests. di↵erent views. p(y = c) are the mixing probabilities We need to parametrize the component densities p (x | y = c). di↵erent expressions. Di↵erent ways of pronouncing the same phonemes. Multiple topics within the same document. 6 • yi is the identity of the component 4 “responsible” for xi . 2 • yi is a hidden (latent) variable: 0 never observed. ⇡) = p(y = c)p (x | y = c) . −2 A mixture model: −4 −6 −4 −2 0 2 4 6 8 k X p(x. . Mixture models Examples Images of the same person under di↵erent conditions: with/without glasses. Mixture models Mixture models Assumptions: 10 8 • k underlying types (components).

06 0.08 0. ⇡) = ⇡c · N (x. subject to c=1 ⇡c = 1. . ⇡) = ⇡c · p(x. If the c-th component is a Gaussian. . • Given yi . . p (x | y = c) = N (x. ✓ k ] and write k X p(x. .02 0 −10 −5 0 5 10 0 −10 −5 0 5 10 = 0 −10 −5 0 5 10 Mixture models Generative model for a mixture The generative process with k-component mixture: • The parameters ✓ c for each component c are fixed. ✓ yi ). ⌃k ].08 0.04 0.15 ⇥0. . . ✓ c ). .1 0. ✓. µk . draw xi ⇠ p (x | yi .02 0. ⌃1 . .14 0. ⇡) · p(x|y. µc . .06 ⇥0. . c=1 where ✓ = [µ1 . . Example: mixture of Gaussians. ⌃c ) . k X p(x.05 0. .12 0.2 0. ✓. . Then we can denote ✓ = [✓ 1 .1 0. 0.3 = 0.04 0. ✓ y ) Any data point xi could have been generated in k ways. . Mixture models Parametric mixtures Suppose that the parameters of the c-th component are ✓c .16 0. . produces a valid pdf. ✓. y. ⇡k ]. The entire generative model for x and y: p(x. ⇡) = p(y. ⌃c ). . µc . c=1 Pk Any valid setting of ✓ and ⇡.7 + 0.1 0. . • Draw yi ⇠ [⇡1 .

⌃c ) . Mixture models Likelihood of a mixture model Idea: estimate set of parameters that maximize likelihood given the observed data. . The log-likelihood of ⇡. ✓) = log ⇡c N (xi . µc . i=1 c=1 No closed-form solution because of the sum inside log. ✓: N X k X log p(X. • We need to take into account all possible components that could have generated xi . p.