This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Max Welling

Donald Bren School of Information and Computer Science

University of California Irvine

April 20, 2010

2

Contents

Preface iii

Learning and Intuition vii

1 Data and Information 1

1.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Preprocessing the Data . . . . . . . . . . . . . . . . . . . . . . . 4

2 Data Visualization 7

3 Learning 11

3.1 In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Types of Machine Learning 17

4.1 In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Nearest Neighbors Classiﬁcation 21

5.1 The Idea In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . 23

6 The Naive Bayesian Classiﬁer 25

6.1 The Naive Bayes Model . . . . . . . . . . . . . . . . . . . . . . 25

6.2 Learning a Naive Bayes Classiﬁer . . . . . . . . . . . . . . . . . 27

6.3 Class-Prediction for New Instances . . . . . . . . . . . . . . . . . 28

6.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.6 The Idea In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . 31

7 The Perceptron 33

7.1 The Perceptron Model . . . . . . . . . . . . . . . . . . . . . . . 34

i

ii CONTENTS

7.2 A Different Cost function: Logistic Regression . . . . . . . . . . 37

7.3 The Idea In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . 38

8 Support Vector Machines 39

8.1 The Non-Separable case . . . . . . . . . . . . . . . . . . . . . . 43

9 Support Vector Regression 47

10 Kernel ridge Regression 51

10.1 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . 52

10.2 An alternative derivation . . . . . . . . . . . . . . . . . . . . . . 53

11 Kernel K-means and Spectral Clustering 55

12 Kernel Principal Components Analysis 59

12.1 Centering Data in Feature Space . . . . . . . . . . . . . . . . . . 61

13 Fisher Linear Discriminant Analysis 63

13.1 Kernel Fisher LDA . . . . . . . . . . . . . . . . . . . . . . . . . 66

13.2 A Constrained Convex Programming Formulation of FDA . . . . 68

14 Kernel Canonical Correlation Analysis 69

14.1 Kernel CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

A Essentials of Convex Optimization 73

A.1 Lagrangians and all that . . . . . . . . . . . . . . . . . . . . . . . 73

B Kernel Design 77

B.1 Polynomials Kernels . . . . . . . . . . . . . . . . . . . . . . . . 77

B.2 All Subsets Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 78

B.3 The Gaussian Kernel . . . . . . . . . . . . . . . . . . . . . . . . 79

Preface

In winter quarter 2007 I taught an undergraduate course in machine learning at

UC Irvine. While I had been teaching machine learning at a graduate level it

became soon clear that teaching the same material to an undergraduate class was

a whole new challenge. Much of machine learning is build upon concepts from

mathematics such as partial derivatives, eigenvalue decompositions, multivariate

probability densities and so on. I quickly found that these concepts could not

be taken for granted at an undergraduate level. The situation was aggravated by

the lack of a suitable textbook. Excellent textbooks do exist for this ﬁeld, but I

found all of them to be too technical for a ﬁrst encounter with machine learning.

This experience led me to believe there was a genuine need for a simple, intuitive

introduction into the concepts of machine learning. A ﬁrst read to wet the appetite

so to speak, a prelude to the more technical and advanced textbooks. Hence, the

book you see before you is meant for those starting out in the ﬁeld who need a

simple, intuitive explanation of some of the most useful algorithms that our ﬁeld

has to offer.

Machine learning is a relatively recent discipline that emerged from the gen-

eral ﬁeld of artiﬁcial intelligence only quite recently. To build intelligent machines

researchers realized that these machines should learn from and adapt to their en-

vironment. It is simply too costly and impractical to design intelligent systems by

ﬁrst gathering all the expert knowledge ourselves and then hard-wiring it into a

machine. For instance, after many years of intense research the we can now recog-

nize faces in images to a high degree accuracy. But the world has approximately

30,000 visual object categories according to some estimates (Biederman). Should

we invest the same effort to build good classiﬁers for monkeys, chairs, pencils,

axes etc. or should we build systems to can observe millions of training images,

some with labels (e.g. in these pixels in the image correspond to a car) but most

of them without side information? Although there is currently no system which

can recognize even in the order of 1000 object categories (the best system can get

iii

iv PREFACE

about 60% correct on 100 categories), the fact that we pull it off seemingly effort-

lessly serves as a “proof of concept” that it can be done. But there is no doubt in

my mind that building truly intelligent machines will involve learning from data.

The ﬁrst reason for the recent successes of machine learning and the growth of

the ﬁeld as a whole is rooted in its multidisciplinary character. Machine learning

emerged from AI but quickly incorporated ideas from ﬁelds as diverse as statis-

tics, probability, computer science, information theory, convex optimization, con-

trol theory, cognitive science, theoretical neuroscience, physics and more. To

give an example, the main conference in this ﬁeld is called: advances in neural

information processing systems, referring to information theory and theoretical

neuroscience and cognitive science.

The second, perhaps more important reason for the growth of machine learn-

ing is the exponential growth of both available data and computer power. While

the ﬁeld is build on theory and tools developed statistics machine learning recog-

nizes that the most exiting progress can be made to leverage the enormous ﬂood

of data that is generated each year by satellites, sky observatories, particle accel-

erators, the human genome project, banks, the stock market, the army, seismic

measurements, the internet, video, scanned text and so on. It is difﬁcult to ap-

preciate the exponential growth of data that our society is generating. To give

an example, a modern satellite generates roughly the same amount of data all

previous satellites produced together. This insight has shifted the attention from

highly sophisticated modeling techniques on small datasets to more basic analy-

sis on much larger data-sets (the latter sometimes called data-mining). Hence the

emphasis shifted to algorithmic efﬁciency and as a result many machine learning

faculty (like myself) can typically be found in computer science departments. To

give some examples of recent successes of this approach one would only have

to turn on one computer and perform an internet search. Modern search engines

do not run terribly sophisticated algorithms, but they manage to store and sift

through almost the entire content of the internet to return sensible search results.

There has also been much success in the ﬁeld of machine translation, not because

a new model was invented but because many more translated documents became

available.

The ﬁeld of machine learning is multifaceted and expanding fast. To sample

a few sub-disciplines: statistical learning, kernel methods, graphical models, ar-

tiﬁcial neural networks, fuzzy logic, Bayesian methods and so on. The ﬁeld also

covers many types of learning problems, such as supervised learning, unsuper-

vised learning, semi-supervised learning, active learning, reinforcement learning

etc. I will only cover the most basic approaches in this book from a highly per-

v

sonal perspective. Instead of trying to cover all aspects of the entire ﬁeld I have

chosen to present a few popular and perhaps useful tools and approaches. But

what will (hopefully) be signiﬁcantly different than most other scientiﬁc books is

the manner in which I will present these methods. I have always been frustrated

by the lack of proper explanation of equations. Many times I have been staring at

a formula having not the slightest clue where it came from or how it was derived.

Many books also excel in stating facts in an almost encyclopedic style, without

providing the proper intuition of the method. This is my primary mission: to write

a book which conveys intuition. The ﬁrst chapter will be devoted to why I think

this is important.

MEANT FOR INDUSTRY AS WELL AS BACKGROUND READING]

This book was written during my sabbatical at the Radboudt University in Ni-

jmegen (Netherlands). Hans for discussion on intuition. I like to thank Prof. Bert

Kappen who leads an excellent group of postocs and students for his hospitality.

Marga, kids, UCI,...

vi PREFACE

Learning and Intuition

We have all experienced the situation that the solution to a problem presents itself

while riding your bike, walking home, “relaxing” in the washroom, waking up in

the morning, taking your shower etc. Importantly, it did not appear while bang-

ing your head against the problem in a conscious effort to solve it, staring at the

equations on a piece of paper. In fact, I would claim, that all my bits and pieces

of progress have occured while taking a break and “relaxing out of the problem”.

Greek philosophers walked in circles when thinking about a problem; most of us

stare at a computer screen all day. The purpose of this chapter is to make you more

aware of where your creative mind is located and to interact with it in a fruitful

manner.

My general thesis is that contrary to popular belief, creative thinking is not

performed by conscious thinking. It is rather an interplay between your con-

scious mind who prepares the seeds to be planted into the unconscious part of

your mind. The unconscious mind will munch on the problem “out of sight” and

return promising roads to solutions to the consciousness. This process iterates

until the conscious mind decides the problem is sufﬁciently solved, intractable or

plain dull and moves on to the next. It maybe a little unsettling to learn that at

least part of your thinking goes on in a part of your mind that seems inaccessible

and has a very limited interface with what you think of as yourself. But it is un-

deniable that it is there and it is also undeniable that it plays a role in the creative

thought-process.

To become a creative thinker one should how learn to play this game more

effectively. To do so, we should think about the language in which to represent

knowledge that is most effective in terms of communication with the unconscious.

In other words, what type of “interface” between conscious and unconscious mind

should we use? It is probably not a good idea to memorize all the details of a

complicated equation or problem. Instead we should extract the abstract idea and

capture the essence of it in a picture. This could be a movie with colors and other

vii

viii LEARNING AND INTUITION

baroque features or a more “dull” representation, whatever works. Some scientist

have been asked to describe how they represent abstract ideas and they invari-

ably seem to entertain some type of visual representation. A beautiful account

of this in the case of mathematicians can be found in a marvellous book “XXX”

(Hardamard).

By building accurate visual representations of abstract ideas we create a data-

base of knowledge in the unconscious. This collection of ideas forms the basis for

what we call intuition. I often ﬁnd myself listening to a talk and feeling uneasy

about what is presented. The reason seems to be that the abstract idea I am trying

to capture from the talk clashed with a similar idea that is already stored. This in

turn can be a sign that I either misunderstood the idea before and need to update

it, or that there is actually something wrong with what is being presented. In a

similar way I can easily detect that some idea is a small perturbation of what I

already knew (I feel happily bored), or something entirely new (I feel intrigued

and slightly frustrated). While the novice is continuously challenged and often

feels overwhelmed, the more experienced researcher feels at ease 90% of the time

because the “new” idea was already in his/her data-base which therefore needs no

and very little updating.

Somehow our unconscious mind can also manipulate existing abstract ideas

into new ones. This is what we usually think of as creative thinking. One can

stimulate this by seeding the mind with a problem. This is a conscious effort

and is usually a combination of detailed mathematical derivations and building

an intuitive picture or metaphor for the thing one is trying to understand. If you

focus enough time and energy on this process and walk home for lunch you’ll ﬁnd

that you’ll still be thinking about it in a much more vague fashion: you review

and create visual representations of the problem. Then you get your mind off the

problem altogether and when you walk back to work suddenly parts of the solu-

tion surface into consciousness. Somehow, your unconscious took over and kept

working on your problem. The essence is that you created visual representations

as the building blocks for the unconscious mind to work with.

In any case, whatever the details of this process are (and I am no psychologist)

I suspect that any good explanation should include both an intuitive part, including

examples, metaphors and visualizations, and a precise mathematical part where

every equation and derivation is properly explained. This then is the challenge I

have set to myself. It will be your task to insist on understanding the abstract idea

that is being conveyed and build your own personalized visual representations. I

will try to assist in this process but it is ultimately you who will have to do the

hard work.

ix

Many people may ﬁnd this somewhat experimental way to introduce students

to new topics counter-productive. Undoubtedly for many it will be. If you feel

under-challenged and become bored I recommend you move on to the more ad-

vanced text-books of which there are many excellent samples on the market (for

a list see (books)). But I hope that for most beginning students this intuitive style

of writing may help to gain a deeper understanding of the ideas that I will present

in the following. Above all, have fun!

x LEARNING AND INTUITION

Chapter 1

Data and Information

Data is everywhere in abundant amounts. Surveillance cameras continuously

capture video, every time you make a phone call your name and location gets

recorded, often your clicking pattern is recorded when surﬁng the web, most ﬁ-

nancial transactions are recorded, satellites and observatories generate tera-bytes

of data every year, the FBI maintains a DNA-database of most convicted crimi-

nals, soon all written text from our libraries is digitized, need I go on?

But data in itself is useless. Hidden inside the data is valuable information.

The objective of machine learning is to pull the relevant information from the data

and make it available to the user. What do we mean by “relevant information”?

When analyzing data we typically have a speciﬁc question in mind such as :“How

many types of car can be discerned in this video” or “what will be weather next

week”. So the answer can take the form of a single number (there are 5 cars), or a

sequence of numbers or (the temperature next week) or a complicated pattern (the

cloud conﬁguration next week). If the answer to our query is itself complex we

like to visualize it using graphs, bar-plots or even little movies. But one should

keep in mind that the particular analysis depends on the task one has in mind.

Let me spell out a few tasks that are typically considered in machine learning:

Prediction: Here we ask ourselves whether we can extrapolate the information

in the data to new unseen cases. For instance, if I have a data-base of attributes

of Hummers such as weight, color, number of people it can hold etc. and another

data-base of attributes of Ferraries, then one can try to predict the type of car

(Hummer or Ferrari) from a new set of attributes. Another example is predicting

the weather (given all the recorded weather patterns in the past, can we predict the

weather next week), or the stock prizes.

1

2 CHAPTER 1. DATA AND INFORMATION

Interpretation: Here we seek to answer questions about the data. For instance,

what property of this drug was responsible for its high success-rate? Does a secu-

rity ofﬁcer at the airport apply racial proﬁling in deciding who’s luggage to check?

How many natural groups are there in the data?

Compression: Here we are interested in compressing the original data, a.k.a.

the number of bits needed to represent it. For instance, ﬁles in your computer can

be “zipped” to a much smaller size by removing much of the redundancy in those

ﬁles. Also, JPEG and GIF (among others) are compressed representations of the

original pixel-map.

All of the above objectives depend on the fact that there is structure in the

data. If data is completely random there is nothing to predict, nothing to interpret

and nothing to compress. Hence, all tasks are somehow related to discovering

or leveraging this structure. One could say that data is highly redundant and that

this redundancy is exactly what makes it interesting. Take the example of natu-

ral images. If you are required to predict the color of the pixels neighboring to

some random pixel in an image, you would be able to do a pretty good job (for

instance 20% may be blue sky and predicting the neighbors of a blue sky pixel

is easy). Also, if we would generate images at random they would not look like

natural scenes at all. For one, it wouldn’t contain objects. Only a tiny fraction of

all possible images looks “natural” and so the space of natural images is highly

structured.

Thus, all of these concepts are intimately related: structure, redundancy, pre-

dictability, regularity, interpretability, compressibility. They refer to the “food”

for machine learning, without structure there is nothing to learn. The same thing

is true for human learning. From the day we are born we start noticing that there

is structure in this world. Our survival depends on discovering and recording this

structure. If I walk into this brown cylinder with a green canopy I suddenly stop,

it won’t give way. In fact, it damages my body. Perhaps this holds for all these

objects. When I cry my mother suddenly appears. Our game is to predict the

future accurately, and we predict it by learning its structure.

1.1 Data Representation

What does “data” look like? In other words, what do we download into our com-

puter? Data comes in many shapes and forms, for instance it could be words from

a document or pixels from an image. But it will be useful to convert data into a

1.1. DATA REPRESENTATION 3

standard format so that the algorithms that we will discuss can be applied to it.

Most datasets can be represented as a matrix, X = [X

in

], with rows indexed by

“attribute-index” i and columns indexed by “data-index” n. The value X

in

for

attribute i and data-case n can be binary, real, discrete etc., depending on what

we measure. For instance, if we measure weight and color of 100 cars, the matrix

X is 2 × 100 dimensional and X

1,20

= 20, 684.57 is the weight of car nr. 20 in

some units (a real value) while X

2,20

= 2 is the color of car nr. 20 (say one of 6

predeﬁned colors).

Most datasets can be cast in this form (but not all). For documents, we can

give each distinct word of a prespeciﬁed vocabulary a nr. and simply count how

often a word was present. Say the word “book” is deﬁned to have nr. 10, 568 in the

vocabulary then X

10568,5076

= 4 would mean: the word book appeared 4 times in

document 5076. Sometimes the different data-cases do not have the same number

of attributes. Consider searching the internet for images about rats. You’ll retrieve

a large variety of images most with a different number of pixels. We can either

try to rescale the images to a common size or we can simply leave those entries in

the matrix empty. It may also occur that a certain entry is supposed to be there but

it couldn’t be measured. For instance, if we run an optical character recognition

system on a scanned document some letters will not be recognized. We’ll use a

question mark “?”, to indicate that that entry wasn’t observed.

It is very important to realize that there are many ways to represent data and

not all are equally suitable for analysis. By this I mean that in some represen-

tation the structure may be obvious while in other representation is may become

totally obscure. It is still there, but just harder to ﬁnd. The algorithms that we will

discuss are based on certain assumptions, such as, “Hummers and Ferraries can

be separated with by a line, see ﬁgure ??. While this may be true if we measure

weight in kilograms and height in meters, it is no longer true if we decide to re-

code these numbers into bit-strings. The structure is still in the data, but we would

need a much more complex assumption to discover it. A lesson to be learned is

thus to spend some time thinking about in which representation the structure is as

obvious as possible and transform the data if necessary before applying standard

algorithms. In the next section we’ll discuss some standard preprocessing opera-

tions. It is often advisable to visualize the data before preprocessing and analyzing

it. This will often tell you if the structure is a good match for the algorithm you

had in mind for further analysis. Chapter ?? will discuss some elementary visual-

ization techniques.

4 CHAPTER 1. DATA AND INFORMATION

1.2 Preprocessing the Data

As mentioned in the previous section, algorithms are based on assumptions and

can become more effective if we transform the data ﬁrst. Consider the following

example, depicted in ﬁgure ??a. The algorithm we consists of estimating the area

that the data occupy. It grows a circle starting at the origin and at the point it

contains all the data we record the area of circle. In the ﬁgure why this will be

a bad estimate: the data-cloud is not centered. If we would have ﬁrst centered it

we would have obtained reasonable estimate. Although this example is somewhat

simple-minded, there are many, much more interesting algorithms that assume

centered data. To center data we will introduce the sample mean of the data, given

by,

E[X]

i

=

1

N

N

¸

n=1

X

in

(1.1)

Hence, for every attribute i separately, we simple add all the attribute value across

data-cases and divide by the total number of data-cases. To transform the data so

that their sample mean is zero, we set,

X

′

in

= X

in

−E[X]

i

∀n (1.2)

It is now easy to check that the sample mean of X

′

indeed vanishes. An illustra-

tion of the global shift is given in ﬁgure ??b. We also see in this ﬁgure that the

algorithm described above now works much better!

In a similar spirit as centering, we may also wish to scale the data along the

coordinate axis in order make it more “spherical”. Consider ﬁgure ??a,b. In

this case the data was ﬁrst centered, but the elongated shape still prevented us

from using the simplistic algorithm to estimate the area covered by the data. The

solution is to scale the axes so that the spread is the same in every dimension. To

deﬁne this operation we ﬁrst introduce the notion of sample variance,

V[X]

i

=

1

N

N

¸

n=1

X

2

in

(1.3)

where we have assumed that the data was ﬁrst centered. Note that this is similar

to the sample mean, but now we have used the square. It is important that we

have removed the sign of the data-cases (by taking the square) because otherwise

positive and negative signs might cancel each other out. By ﬁrst taking the square,

all data-cases ﬁrst get mapped to positive half of the axes (for each dimension or

1.2. PREPROCESSING THE DATA 5

attribute separately) and then added and divided by N. You have perhaps noticed

that variance does not have the same units as X itself. If X is measured in grams,

then variance is measured in grams squared. So to scale the data to have the same

scale in every dimension we divide by the square-root of the variance, which is

usually called the sample standard deviation.,

X

′′

in

=

X

′

in

V[X

′

]

i

∀n (1.4)

Note again that sphering requires centering implying that we always have to per-

form these operations in this order, ﬁrst center, then sphere. Figure ??a,b,c illus-

trate this process.

You may now be asking, “well what if the data where elongated in a diagonal

direction?”. Indeed, we can also deal with such a case by ﬁrst centering, then

rotating such that the elongated direction points in the direction of one of the

axes, and then scaling. This requires quite a bit more math, and will postpone this

issue until chapter ?? on “principal components analysis”. However, the question

is in fact a very deep one, because one could argue that one could keep changing

the data using more and more sophisticated transformations until all the structure

was removed from the data and there would be nothing left to analyze! It is indeed

true that the pre-processing steps can be viewed as part of the modeling process

in that it identiﬁes structure (and then removes it). By remembering the sequence

of transformations you performed you have implicitly build a model. Reversely,

many algorithm can be easily adapted to model the mean and scale of the data.

Now, the preprocessing is no longer necessary and becomes integrated into the

model.

Just as preprocessing can be viewed as building a model, we can use a model

to transform structured data into (more) unstructured data. The details of this

process will be left for later chapters but a good example is provided by compres-

sion algorithms. Compression algorithms are based on models for the redundancy

in data (e.g. text, images). The compression consists in removing this redun-

dancy and transforming the original data into a less structured or less redundant

(and hence more succinct) code. Models and structure reducing data transforma-

tions are in sense each others reverse: we often associate with a model an under-

standing of how the data was generated, starting from random noise. Reversely,

pre-processing starts with the data and understands how we can get back to the

unstructured random state of the data [FIGURE].

Finally, I will mention one more popular data-transformation technique. Many

algorithms are are based on the assumption that data is sort of symmetric around

6 CHAPTER 1. DATA AND INFORMATION

the origin. If data happens to be just positive, it doesn’t ﬁt this assumption very

well. Taking the following logarithm can help in that case,

X

′

in

= log(a + X

in

) (1.5)

Chapter 2

Data Visualization

The process of data analysis does not just consist of picking an algorithm, ﬁtting

it to the data and reporting the results. We have seen that we need to choose a

representation for the data necessitating data-preprocessing in many cases. De-

pending on the data representation and the task at hand we then have to choose

an algorithm to continue our analysis. But even after we have run the algorithm

and study the results we are interested in, we may realize that our initial choice of

algorithm or representation may not have been optimal. We may therefore decide

to try another representation/algorithm, compare the results and perhaps combine

them. This is an iterative process.

What may help us in deciding the representation and algorithm for further

analysis? Consider the two datasets in Figure ??. In the left ﬁgure we see that the

data naturally forms clusters, while in the right ﬁgure we observe that the data is

approximately distributed on a line. The left ﬁgure suggests a clustering approach

while the right ﬁgure suggests a dimensionality reduction approach. This illus-

trates the importance of looking at the data before you start your analysis instead

of (literally) blindly picking an algorithm. After your ﬁrst peek, you may decide

to transform the data and then look again to see if the transformed data better suit

the assumptions of the algorithm you have in mind.

“Looking at the data” sounds more easy than it really is. The reason is that

we are not equipped to think in more than 3 dimensions, while most data lives

in much higher dimensions. For instance image patches of size 10 × 10 live in a

100 pixel space. How are we going to visualize it? There are many answers to

this problem, but most involve projection: we determine a number of, say, 2 or

3 dimensional subspaces onto which we project the data. The simplest choice of

subspaces are the ones aligned with the features, e.g. we can plot X

1n

versus X

2n

7

8 CHAPTER 2. DATA VISUALIZATION

etc. An example of such a scatter plot is given in Figure ??.

Note that we have a total of d(d − 1)/2 possible two dimensional projections

which amounts to 4950 projections for 100 dimensional data. This is usually too

many to manually inspect. How do we cut down on the number of dimensions?

perhaps random projections may work? Unfortunately that turns out to be not a

great idea in many cases. The reason is that data projected on a random subspace

often looks distributed according to what is known as a Gaussian distribution (see

Figure ??). The deeper reason behind this phenomenon is the central limit theo-

rem which states that the sum of a large number of independent random variables

is (under certain conditions) distributed as a Gaussian distribution. Hence, if we

denote with w a vector in R

d

and by x the d-dimensional random variable, then

y = w

T

x is the value of the projection. This is clearly is a weighted sum of

the random variables x

i

, i = 1..d. If we assume that x

i

are approximately in-

dependent, then we can see that their sum will be governed by this central limit

theorem. Analogously, a dataset {X

in

} can thus be visualized in one dimension

by “histogramming”

1

the values of Y = w

T

X, see Figure ??. In this ﬁgure we

clearly recognize the characteristic “Bell-shape” of the Gaussian distribution of

projected and histogrammed data.

In one sense the central limit theorem is a rather helpful quirk of nature. Many

variables follow Gaussian distributions and the Gaussian distribution is one of

the few distributions which have very nice analytic properties. Unfortunately, the

Gaussian distribution is also the most uninformative distribution. This notion of

“uninformative” can actually be made very precise using information theory and

states: Given a ﬁxed mean and variance, the Gaussian density represents the least

amount of information among all densities with the same mean and variance. This

is rather unfortunate for our purposes because Gaussian projections are the least

revealing dimensions to look at. So in general we have to work a bit harder to see

interesting structure.

A large number of algorithms has been devised to search for informative pro-

jections. The simplest being “principal component analysis” or PCA for short ??.

Here, interesting means dimensions of high variance. However, it was recognized

that high variance is not always a good measure of interestingness and one should

rather search for dimensions that are non-Gaussian. For instance, “independent

components analysis” (ICA) ?? and “projection pursuit” ?? searches for dimen-

1

A histogram is a bar-plot where the height of the bar represents the number items that had a

value located in the interval on the x-axis o which the bar stands (i.e. the basis of the bar). If many

items have a value around zero, then the bar centered at zero will be very high.

9

sions that have heavy tails relative to Gaussian distributions. Another criterion

is to to ﬁnd projections onto which the data has multiple modes. A more recent

approach is to project the data onto a potentially curved manifold ??.

Scatter plots are of course not the only way to visualize data. Its a creative

exercise and anything that helps enhance your understanding of the data is allowed

in this game. To illustrate I will give a few examples form a

10 CHAPTER 2. DATA VISUALIZATION

Chapter 3

Learning

This chapter is without question the most important one of the book. It concerns

the core, almost philosophical question of what learning really is (and what it is

not). If you want to remember one thing from this book you will ﬁnd it here in

this chapter.

Ok, let’s start with an example. Alice has a rather strange ailment. She is not

able to recognize objects by their visual appearance. At her home she is doing

just ﬁne: her mother explained Alice for every object in her house what is is and

how you use it. When she is home, she recognizes these objects (if they have not

been moved too much), but when she enters a new environment she is lost. For

example, if she enters a new meeting room she needs a long time to infer what

the chairs and the table are in the room. She has been diagnosed with a severe

case of ”overﬁtting”. What is the matter with Alice? Nothing is wrong with her

memory because she remembers the objects once she has seem them. In fact, she

has a fantastic memory. She remembers every detail of the objects she has seen.

And every time she sees a new objects she reasons that the object in front of her

is surely not a chair because it doesn’t have all the features she has seen in ear-

lier chairs. The problem is that Alice cannot generalize the information she has

observed from one instance of a visual object category to other, yet unobserved

members of the same category. The fact that Alice’s disease is so rare is under-

standable there must have been a strong selection pressure against this disease.

Imagine our ancestors walking through the savanna one million years ago. A lion

appears on the scene. Ancestral Alice has seen lions before, but not this particular

one and it does not induce a fear response. Of course, she has no time to infer the

possibility that this animal may be dangerous logically. Alice’s contemporaries

noticed that the animal was yellow-brown, had manes etc. and immediately un-

11

12 CHAPTER 3. LEARNING

derstood that this was a lion. They understood that all lions have these particular

characteristics in common, but may differ in some other ones (like the presence

of a scar someplace).

Bob has another disease which is called over-generalization. Once he has seen

an object he believes almost everything is some, perhaps twisted instance of the

same object class (In fact, I seem to suffer from this so now and then when I

think all of machine learning can be explained by this one new exciting principle).

If ancestral Bob walks the savanna and he has just encountered an instance of

a lion and ﬂed into a tree with his buddies, the next time he sees a squirrel he

believes it is a small instance of a dangerous lion and ﬂees into the trees again.

Over-generalization seems to be rather common among small children.

One of the main conclusions from this discussion is that we should neither

over-generalize nor over-ﬁt. We need to be on the edge of being just right. But

just right about what? It doesn’t seem there is one correct God-given deﬁnition

of the category chairs. We seem to all agree, but one can surely ﬁnd examples

that would be difﬁcult to classify. When do we generalize exactly right? The

magic word is PREDICTION. From an evolutionary standpoint, all we have to

do is make correct predictions about aspects of life that help us survive. Nobody

really cares about the deﬁnition of lion, but we do care about the our responses

to the various animals (run away for lion, chase for deer). And there are a lot

of things that can be predicted in the world. This food kills me but that food is

good for me. Drumming my ﬁsts on my hairy chest in front of a female generates

opportunities for sex, sticking my hand into that yellow-orange ﬂickering“ﬂame”

hurts my hand and so on. The world is wonderfully predictable and we are very

good at predicting it.

So why do we care about object categories in the ﬁrst place? Well, apparently

they help us organize the world and make accurate predictions. The category lions

is an abstraction and abstractions help us to generalize. In a certain sense, learning

is all about ﬁnding useful abstractions or concepts that describe the world. Take

the concept “ﬂuid”, it describes all watery substances and summarizes some of

their physical properties. Ot he concept of “weight”: an abstraction that describes

a certain property of objects.

Here is one very important corollary for you: “machine learning is not in

the business of remembering and regurgitating observed information, it is in the

business of transferring (generalizing) properties from observed data onto new,

yet unobserved data”. This is the mantra of machine learning that you should

repeat to yourself every night before you go to bed (at least until the ﬁnal exam).

The information we receive from the world has two components to it: there

13

is the part of the information which does not carry over to the future, the un-

predictable information. We call this “noise”. And then there is the information

that is predictable, the learnable part of the information stream. The task of any

learning algorithm is to separate the predictable part from the unpredictable part.

Now imagine Bob wants to send an image to Alice. He has to pay 1 dollar cent

for every bit that he sends. If the image were completely white it would be really

stupid of Bob to send the message: pixel 1: white, pixel 2: white, pixel 3: white,.....

He could just have send the message all pixels are white!. The blank image is

completely predictable but carries very little information. Now imagine a image

that consist of white noise (your television screen if the cable is not connected).

To send the exact image Bob will have to send pixel 1: white, pixel 2: black, pixel

3: black,.... Bob can not do better because there is no predictable information in

that image, i.e. there is no structure to be modeled. You can imagine playing a

game and revealing one pixel at a time to someone and pay him 1$ for every next

pixel he predicts correctly. For the white image you can do perfect, for the noisy

picture you would be random guessing. Real pictures are in between: some pixels

are very hard to predict, while others are easier. To compress the image, Bob can

extract rules such as: always predict the same color as the majority of the pixels

next to you, except when there is an edge. These rules constitute the model for the

regularities of the image. Instead of sending the entire image pixel by pixel, Bob

will now ﬁrst send his rules and ask Alice to apply the rules. Every time the rule

fails Bob also send a correction: pixel 103: white, pixel 245: black. A few rules

and two corrections is obviously cheaper than 256 pixel values and no rules.

There is one fundamental tradeoff hidden in this game. Since Bob is sending

only a single image it does not pay to send an incredibly complicated model that

would require more bits to explain than simply sending all pixel values. If he

would be sending 1 billion images it would pay off to ﬁrst send the complicated

model because he would be saving a fraction of all bits for every image. On the

other hand, if Bob wants to send 2 pixels, there really is no need in sending a

model whatsoever. Therefore: the size of Bob’s model depends on the amount

of data he wants to transmit. Ironically, the boundary between what is model

and what is noise depends on how much data we are dealing with! If we use a

model that is too complex we overﬁt to the data at hand, i.e. part of the model

represents noise. On the other hand, if we use a too simple model we ”underﬁt”

(over-generalize) and valuable structure remains unmodeled. Both lead to sub-

optimal compression of the image. But both also lead to suboptimal prediction

on new images. The compression game can therefore be used to ﬁnd the right

size of model complexity for a given dataset. And so we have discovered a deep

14 CHAPTER 3. LEARNING

connection between learning and compression.

Now let’s think for a moment what we really mean with “a model”. A model

represents our prior knowledge of the world. It imposes structure that is not nec-

essarily present in the data. We call this the “inductive bias”. Our inductive bias

often comes in the form of a parametrized model. That is to say, we deﬁne a

family of models but let the data determine which of these models is most appro-

priate. A strong inductive bias means that we don’t leave ﬂexibility in the model

for the data to work on. We are so convinced of ourselves that we basically ignore

the data. The downside is that if we are creating a “bad bias” towards to wrong

model. On the other hand, if we are correct, we can learn the remaining degrees

of freedom in our model from very few data-cases. Conversely, we may leave the

door open for a huge family of possible models. If we now let the data zoom in

on the model that best explains the training data it will overﬁt to the peculiarities

of that data. Now imagine you sampled 10 datasets of the same size N and train

these very ﬂexible models separately on each of these datasets (note that in reality

you only have access to one such dataset but please play along in this thought

experiment). Let’s say we want to determine the value of some parameter θ. Be-

cause the models are so ﬂexible, we can actually model the idiosyncrasies of each

dataset. The result is that the value for θ is likely to be very different for each

dataset. But because we didn’t impose much inductive bias the average of many

of such estimates will be about right. We say that the bias is small, but the vari-

ance is high. In the case of very restrictive models the opposite happens: the bias

is potentially large but the variance small. Note that not only is a large bias is bad

(for obvious reasons), a large variance is bad as well: because we only have one

dataset of size N, our estimate could be very far off simply we were unlucky with

the dataset we were given. What we should therefore strive for is to inject all our

prior knowledge into the learning problem (this makes learning easier) but avoid

injecting the wrong prior knowledge. If we don’t trust our prior knowledge we

should let the data speak. However, letting the data speak too much might lead to

overﬁtting, so we need to ﬁnd the boundary between too complex and too simple

a model and get its complexity just right. Access to more data means that the data

can speak more relative to prior knowledge. That, in a nutshell is what machine

learning is all about.

3.1. IN A NUTSHELL 15

3.1 In a Nutshell

Learning is all about generalizing regularities in the training data to new, yet un-

observed data. It is not about remembering the training data. Good generalization

means that you need to balance prior knowledge with information from data. De-

pending on the dataset size, you can entertain more or less complex models. The

correct size of model can be determined by playing a compression game. Learning

= generalization = abstraction = compression.

16 CHAPTER 3. LEARNING

Chapter 4

Types of Machine Learning

We now will turn our attention and discuss some learning problems that we will

encounter in this book. The most well studied problem in ML is that of supervised

learning. To explain this, let’s ﬁrst look at an example. Bob want to learn how

to distinguish between bobcats and mountain lions. He types these words into

Google Image Search and closely studies all catlike images of bobcats on the one

hand and mountain lions on the other. Some months later on a hiking trip in the

San Bernardino mountains he sees a big cat....

The data that Bob collected was labelled because Google is supposed to only

return pictures of bobcats when you search for the word ”bobcat” (and similarly

for mountain lions). Let’s call the images X

1

, ..X

n

and the labels Y

1

, ..., Y

n

. Note

that X

i

are much higher dimensional objects because they represent all the in-

formation extracted from the image (approximately 1 million pixel color values),

while Y

i

is simply −1 or 1 depending on how we choose to label our classes. So,

that would be a ratio of about 1 million to 1 in terms of information content! The

classiﬁcation problem can usually be posed as ﬁnding (a.k.a. learning) a function

f(x) that approximates the correct class labels for any input x. For instance, we

may decide that sign[f(x)] is the predictor for our class label. In the following we

will be studying quite a few of these classiﬁcation algorithms.

There is also a different family of learning problems known as unsupervised

learning problems. In this case there are no labels Y involved, just the features

X. Our task is not to classify, but to organize the data, or to discover the structure

in the data. This may be very useful for visualization data, compressing data,

or organizing data for easy accessibility. Extracting structure in data often leads

to the discovery of concepts, topics, abstractions, factors, causes, and more such

terms that all really mean the same thing. These are the underlying semantic

17

18 CHAPTER 4. TYPES OF MACHINE LEARNING

factors that can explain the data. Knowing these factors is like denoising the

data where we ﬁrst peel off the uninteresting bits and pieces of the signal and

subsequently transform onto an often lower dimensional space which exposes the

underlying factors.

There are two dominant classes of unsupervised learning algorithms: cluster-

ing based algorithms assume that the data organizes into groups. Finding these

groups is then the task of the ML algorithm and the identity of the group is the se-

mantic factor. Another class of algorithms strives to project the data onto a lower

dimensional space. This mapping can be nonlinear, but the underlying assump-

tion is that the data is approximately distributed on some (possibly curved) lower

dimensional manifold embedded in the input space. Unrolling that manifold is

then the task of the learning algorithm. In this case the dimensions should be

interpreted as semantic factors.

There are many variations on the above themes. For instance, one is often

confronted with a situation where you have access to many more unlabeled data

(only X

i

) and many fewer labeled instances (both (X

i

, Y

i

). Take the task of clas-

sifying news articles by topic (weather, sports, national news, international etc.).

Some people may have labeled some news-articles by hand but there won’t be all

that many of those. However, we do have a very large digital library of scanned

newspapers available. Shouldn’t it be possible to use those scanned newspapers

somehow to to improve the classiﬁer? Imagine that the data naturally clusters into

well separated groups (for instance because news articles reporting on different

topics use very different words). This is depicted in Figure ??). Note that there

are only very few cases which have labels attached to them. From this ﬁgure it

becomes clear that the expected optimal decision boundary nicely separates these

clusters. In other words, you do not expect that the decision boundary will cut

through one of the clusters. Yet that is exactly what would happen if you would

only be using the labeled data. Hence, by simply requiring that decision bound-

aries do not cut through regions of high probability we can improve our classiﬁer.

The subﬁeld that studies how to improve classiﬁcation algorithms using unlabeled

data goes under the name “semi-supervised learning”.

A fourth major class of learning algorithms deals with problems where the

supervised signal consists only of rewards (or costs) that are possibly delayed.

Consider for example a mouse that needs to solve a labyrinth in order to obtain

his food. While making his decisions he will not receive any feedback (apart from

perhaps slowly getting more hungry). It’s only at the end when he reaches the

cheese that receives his positive feedback, and he will have use this to reinforce

his perhaps random earlier decisions that lead him to the cheese. These problem

19

fall under the name ”reinforcement learning”. It is a very general setup in which

almost all known cases of machine learning can be cast, but this generality also

means that these type of problems can be very difﬁcult. The most general RL

problems do not even assume that you know what the world looks like (i.e. the

maze for the mouse), so you have to simultaneously learn a model of the world

and solve your task in it. This dual task induces interesting trade-offs: should

you invest time now to learn machine learning and reap the beneﬁt later in terms

of a high salary working for Yahoo!, or should you stop investing now and start

exploiting what you have learned so far? This is clearly a function of age, or

the time horizon that you still have to take advantage of these investments. The

mouse is similarly confronted with the problem of whether he should try out this

new alley in the maze that can cut down his time to reach the cheese considerably,

or whether he should simply stay with he has learned and take the route he already

knows. This clearly depends on howoften he thinks he will have to run through the

same maze in the future. We call this the exploration versus exploitation trade-off.

The reason that RL is a very exciting ﬁeld of research is because of its biological

relevance. Do we not also have ﬁgure out how the world works and survive in it?

Let’s go back to the news-articles. Assume we have control over what article

we will label next. Which one would be pick. Surely the one that would be most

informative in some suitably deﬁned sense. Or the mouse in the maze. Given that

decides to explore, where does he explore? Surely he will try to seek out alleys

that look promising, i.e. alleys that he expects to maximize his reward. We call

the problem of ﬁnding the next best data-case to investigate “active learning”.

One may also be faced with learning multiple tasks at the same time. These

tasks are related but not identical. For instance, consider the problem if recom-

mending movies to customers of Netﬂix. Each person is different and would re-

ally require a separate model to make the recommendations. However, people also

share commonalities, especially when people show evidence of being of the same

“type” (for example a sf fan or a comedy fan). We can learn personalized models

but share features between them. Especially for new customers, where we don’t

have access to many movies that were rated by the customer, we need to “draw

statistical strength” from customers who seem to be similar. From this example

it has hopefully become clear that we are trying to learn models for many differ-

ent yet related problems and that we can build better models if we share some of

the things learned for one task with the other ones. The trick is not to share too

much nor too little and how much we should share depends on how much data and

prior knowledge we have access to for each task. We call this subﬁeld of machine

learning:“multi-task learning.

20 CHAPTER 4. TYPES OF MACHINE LEARNING

4.1 In a Nutshell

There are many types of learning problems within machine learning. Supervised

learning deals with predicting class labels from attributes, unsupervised learn-

ing tries to discover interesting structure in data, semi-supervised learning uses

both labeled and unlabeled data to improve predictive performance, reinforcement

learning can handle simple feedback in the form of delayed reward, active learn-

ing optimizes the next sample to include in the learning algorithm and multi-task

learning deals with sharing common model components between related learning

tasks.

Chapter 5

Nearest Neighbors Classiﬁcation

Perhaps the simplest algorithm to perform classiﬁcation is the “k nearest neigh-

bors (kNN) classiﬁer”. As usual we assume that we have data of the form{X

in

, Y

n

}

where X

in

is the value of attribute i for data-case n and Y

n

is the label for data-

case n. We also need a measure of similarity between data-cases, which we will

denote with K(X

n

, X

m

) where larger values of K denote more similar data-cases.

Given these preliminaries, classiﬁcation is embarrassingly simple: when you

are provided with the attributes X

t

for a new (unseen) test-case, you ﬁrst ﬁnd the

k most similar data-cases in the dataset by computing K(X

t

, X

n

) for all n. Call

this set S. Then, each of these k most similar neighbors in S can cast a vote on

the label of the test case, where each neighbor predicts that the test case has the

same label as itself. Assuming binary labels and an odd number of neighbors, this

will always result in a decision.

Although kNN algorithms are often associated with this simple voting scheme,

more sophisticated ways of combining the information of these neighbors is al-

lowed. For instance, one could weigh each vote by the similarity to the test-case.

This results in the following decision rule,

Y

t

= 1 if

¸

n∈S

K(X

t

, X

n

)(2Y

n

−1) > 0 (5.1)

Y

t

= 0 if

¸

n∈S

K(X

t

, X

n

)(2Y

n

−1) < 0 (5.2)

(5.3)

and ﬂipping a coin if it is exactly 0.

Why do we expect this algorithm to work intuitively? The reason is that we

expect data-cases with similar labels to cluster together in attribute space. So to

21

22 CHAPTER 5. NEAREST NEIGHBORS CLASSIFICATION

ﬁgure out the label of a test-case we simply look around and see what labels our

neighbors have. Asking your closest neighbor is like betting all your money on a

single piece of advice and you might get really unlucky if your closest neighbor

happens to be an odd-one-out. It’s typically better to ask several opinions before

making your decision. However, if you ask too much around you will be forced to

ask advice from data-cases that are no longer very similar to you. So there is some

optimal number of neighbors to ask, which may be different for every problem.

Determining this optimal number of neighbors is not easy, but we can again use

cross validation (section ??) to estimate it.

So what is good and bad about kNN? First, it’s simplicity makes it attractive.

Very few assumptions about the data are used in the classiﬁcation process. This

property can also be a disadvantage: if you have prior knowledge about how the

data was generated, its better to use it, because less information has to be ex-

tracted from the data. A second consideration is computation time and memory

efﬁciency. Assume you have a very large dataset, but you need to make decisions

very quickly. As an example, consider surﬁng the web-pages of Amazone.com.

Whenever you search for a book, it likes to suggest 10 others. To do that it could

classify books into categories and suggest the top ranked in that category. kNN re-

quires Amazone to store all features of all books at a location that is accessible for

fast computation. Moreover, to classify kNN has to do the neighborhood search

every time again. Clearly, there are tricks that can be played with smart indexing,

but wouldn’t it be much easier if we would have summarized all books by a sim-

ple classiﬁcation function f

θ

(X), that “spits out” a class for any combination of

features X?

This distinction between algorithms/models that require memorizing every

data-item data is often called “parametric” versus “non-parametric”. It’s impor-

tant to realize that this is somewhat of a misnomer: non-parametric models can

have parameters (such as the number of neighbors to consider). The key distinc-

tion is rather wether the data is summarized through a set of parameters which

together comprise a classiﬁcation function f

θ

(X), or whether we retain all the

data to do the classiﬁcation “on the ﬂy”.

KNN is also known to suffer from the “curse of high dimensions”. If we

use many features to describe our data, and in particular when most of these fea-

tures turn out to be irrelevant and noisy for the classiﬁcation, then kNN is quickly

confused. Imagine that there are two features that contain all the information nec-

essary for a perfect classiﬁcation, but that we have added 98 noisy, uninformative

features. The neighbors in the two dimensional space of the relevant features are

unfortunately no longer likely to be the neighbors in the 100 dimensional space,

5.1. THE IDEA IN A NUTSHELL 23

because 98 noisy dimensions have been added. This effect is detrimental to the

kNN algorithm. Once again, it is very important to choose your initial represen-

tation with much care and preprocess the data before you apply the algorithm. In

this case, preprocessing takes the form of “feature selection” on which a whole

book in itself could be written.

5.1 The Idea In a Nutshell

To classify a new data-item you ﬁrst look for the k nearest neighbors in feature

space and assign it the same label as the majority of these neighbors.

24 CHAPTER 5. NEAREST NEIGHBORS CLASSIFICATION

Chapter 6

The Naive Bayesian Classiﬁer

In this chapter we will discuss the “Naive Bayes” (NB) classiﬁer. It has proven to

be very useful in many application both in science as well as in industry. In the

introduction I promised I would try to avoid the use of probabilities as much as

possible. However, in chapter I’ll make an exception, because the NB classiﬁer is

most naturally explained with the use of probabilities. Fortunately, we will only

need the most basic concepts.

6.1 The Naive Bayes Model

NB is mostly used when dealing with discrete-valued attributes. We will explain

the algorithm in this context but note that extensions to continuous-valued at-

tributes are possible. We will restrict attention to classiﬁcation problems between

two classes and refer to section ?? for approaches to extend this two more than

two classes.

In our usual notation we consider D discrete valued attributes X

i

∈ [0, .., V

i

], i =

1..D. Note that each attribute can have a different number of values V

i

. If the orig-

inal data was supplied in a different format, e.g. X

1

= [Y es, No], then we simply

reassign these values to ﬁt the above format, Y es = 1, No = 0 (or reversed). In

addition we are also provided with a supervised signal, in this case the labels are

Y = 0 and Y = 1 indicating that that data-item fell in class 0 or class 1. Again,

which class is assigned to 0 or 1 is arbitrary and has no impact on the performance

of the algorithm.

Before we move on, let’s consider a real world example: spam-ﬁltering. Every

day your mailbox get’s bombarded with hundreds of spam emails. To give an

25

26 CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER

example of the trafﬁc that it generates: the university of California Irvine receives

on the order of 2 million spam emails a day. Fortunately, the bulk of these emails

(approximately 97%) is ﬁltered out or dumped into your spam-box and will reach

your attention. How is this done? Well, it turns out to be a classic example of

a classiﬁcation problem: spam or ham, that’s the question. Let’s say that spam

will receive a label 1 and ham a label 0. Our task is thus to label each new email

with either 0 or 1. What are the attributes? Rephrasing this question, what would

you measure in an email to see if it is spam? Certainly, if I would read “viagra”

in the subject I would stop right there and dump it in the spam-box. What else?

Here are a few: “enlargement, cheap, buy, pharmacy, money, loan, mortgage,

credit” and so on. We can build a dictionary of words that we can detect in each

email. This dictionary could also include word phrases such as “buy now”, “penis

enlargement”, one can make phrases as sophisticated as necessary. One could

measure whether the words or phrases appear at least once or one could count the

actual number of times they appear. Spammers know about the way these spam

ﬁlters work and counteract by slight misspellings of certain key words. Hence we

might also want to detect words like “via gra” and so on. In fact, a small arms race

has ensued where spam ﬁlters and spam generators ﬁnd new tricks to counteract

the tricks of the “opponent”. Putting all these subtleties aside for a moment we’ll

simply assume that we measure a number of these attributes for every email in a

dataset. We’ll also assume that we have spam/ham labels for these emails, which

were acquired by someone removing spam emails by hand from his/her inbox.

Our task is then to train a predictor for spam/ham labels for future emails where

we have access to attributes but not to labels.

The NB model is what we call a “generative” model. This means that we

imagine how the data was generated in an abstract sense. For emails, this works

as follows, an imaginary entity ﬁrst decides howmany spamand hamemails it will

generate on a daily basis. Say, it decides to generate 40% spam and 60% ham. We

will assume this doesn’t change with time (of course it does, but we will make

this simplifying assumption for now). It will then decide what the chance is that

a certain word appears k times in a spam email. For example, the word “viagra”

has a chance of 96% to not appear at all, 1% to appear once, 0.9% to appear twice

etc. These probabilities are clearly different for spam and ham, “viagra” should

have a much smaller probability to appear in a ham email (but it could of course;

consider I send this text to my publisher by email). Given these probabilities, we

can then go on and try to generate emails that actually look like real emails, i.e.

with proper sentences, but we won’t need that in the following. Instead we make

the simplifying assumption that email consists of “a bag of words”, in random

6.2. LEARNING A NAIVE BAYES CLASSIFIER 27

order.

6.2 Learning a Naive Bayes Classiﬁer

Given a dataset, {X

in

, Y

n

}, i = 1..D, n = 1..N, we wish to estimate what these

probabilities are. To start with the simplest one, what would be a good estimate

for the number of the percentage of spam versus ham emails that our imaginary

entity uses to generate emails? Well, we can simply count how many spam and

ham emails we have in our data. This is given by,

P(spam) =

# spam emails

total # emails

=

¸

n

I[Y

n

= 1]

N

(6.1)

Here we mean with I[A = a] a function that is only equal to 1 if its argument is

satisﬁed, and zero otherwise. Hence, in the equation above it counts the number

of instances that Y

n

= 1. Since the remainder of the emails must be ham, we also

ﬁnd that

P(ham) = 1 −P(spam) =

# ham emails

total # emails

=

¸

n

I[Y

n

= 0]

N

(6.2)

where we have used that P(ham) + P(spam) = 1 since an email is either ham or

spam.

Next, we need to estimate how often we expect to see a certain word or phrase

in either a spam or a ham email. In our example we could for instance ask

ourselves what the probability is that we ﬁnd the word “viagra” k times, with

k = 0, 1, > 1, in a spam email. Let’s recode this as X

viagra

= 0 meaning that

we didn’t observe “viagra”, X

viagra

= 1 meaning that we observed it once and

X

viagra

= 2 meaning that we observed it more than once. The answer is again

that we can count how often these events happened in our data and use that as an

estimate for the real probabilities according to which it generated emails. First for

spam we ﬁnd,

P

spam

(X

i

= j) =

# spam emails for which the word i was found j times

total # of spam emails

(6.3)

=

¸

n

I[X

in

= j ∧ Y

n

= 1]

¸

n

I[Y

n

= 1]

(6.4)

Here we have deﬁned the symbol ∧ to mean that both statements to the left and

right of this symbol should hold true in order for the entire sentence to be true.

28 CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER

For ham emails, we compute exactly the same quantity,

P

ham

(X

i

= j) =

# ham emails for which the word i was found j times

total # of ham emails

(6.5)

=

¸

n

I[X

in

= j ∧ Y

n

= 0]

¸

n

I[Y

n

= 0]

(6.6)

Both these quantities should be computed for all words or phrases (or more gen-

erally attributes).

We have now ﬁnished the phase where we estimate the model from the data.

We will often refer to this phase as “learning” or training a model. The model

helps us understand how data was generated in some approximate setting. The

next phase is that of prediction or classiﬁcation of new email.

6.3 Class-Prediction for New Instances

New email does not come with a label ham or spam (if it would we could throw

spam in the spam-box right away). What we do see are the attributes {X

i

}. Our

task is to guess the label based on the model and the measured attributes. The

approach we take is simple: calculate whether the email has a higher probability

of being generated from the spam or the ham model. For example, because the

word “viagra” has a tiny probability of being generated under the ham model it

will end up with a higher probability under the spam model. But clearly, all words

have a say in this process. It’s like a large committee of experts, one for each

word. each member casts a vote and can say things like: “I am 99% certain its

spam”, or “It’s almost deﬁnitely not spam (0.1% spam)”. Each of these opinions

will be multiplied together to generate a ﬁnal score. We then ﬁgure out whether

ham or spam has the highest score.

There is one little practical caveat with this approach, namely that the product

of a large number of probabilities, each of which is necessarily smaller than one,

very quickly gets so small that your computer can’t handle it. There is an easy ﬁx

though. Instead of multiplying probabilities as scores, we use the logarithms of

those probabilities and add the logarithms. This is numerically stable and leads to

the same conclusion because if a > b then we also have that log(a) > log(b) and

vice versa. In equations we compute the score as follows:

S

spam

=

¸

i

log P

spam

(X

i

= v

i

) + log P(spam) (6.7)

6.3. CLASS-PREDICTION FOR NEW INSTANCES 29

where with v

i

we mean the value for attribute i that we observe in the email under

consideration, i.e. if the email contains no mention of the word “viagra” we set

v

viagra

= 0.

The ﬁrst term in Eqn.6.7 adds all the log-probabilities under the spam model

of observing the particular value of each attribute. Every time a word is observed

that has high probability for the spam model, and hence has often been observed

in the dataset, will boost this score. The last term adds an extra factor to the score

that expresses our prior belief of receiving a spam email instead of a ham email.

We compute a similar score for ham, namely,

S

ham

=

¸

i

log P

ham

(X

i

= v

i

) + log P(ham) (6.8)

and compare the two scores. Clearly, a large score for spam relative to ham pro-

vides evidence that the email is indeed spam. If your goal is to minimize the total

number of errors (whether they involve spam or ham) then the decision should be

to choose the class which has the highest score.

In reality, one type of error could have more serious consequences than an-

other. For instance, a spam email making it in my inbox is not too bad, bad an

important email that ends up in my spam-box (which I never check) may have

serious consequences. To account for this we introduce a general threshold θ and

use the following decision rule,

Y = 1 if S

1

> S

0

+ θ (6.9)

Y = 0 if S

1

< S

0

+ θ (6.10)

(6.11)

If these quantities are equal you ﬂip a coin.

If θ = −∞, we always decide in favor of label Y = 1, while if we use

θ = +∞ we always decide in favor of Y = 0. The actual value is a matter of

taste. To evaluate a classiﬁer we often draw an ROC curve. An ROC curve is

obtained by sliding θ between −∞ and +∞ and plotting the true positive rate

(the number of examples with label Y = 1 also classiﬁed as Y = 1 divided by the

total number of examples with Y = 1) versus the false positive rate (the number

of examples with label Y = 0 classiﬁed as Y = 1 divided by the total number of

examples with Y = 0). For more details see chapter ??.

30 CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER

6.4 Regularization

The spam ﬁlter algorithm that we discussed in the previous sections does unfortu-

nately not work very well if we wish to use many attributes (words, word-phrases).

The reason is that for many attributes we may not encounter a single example in

the dataset. Say for example that we deﬁned the word “Nigeria” as an attribute, but

that our dataset did not include one of those spam emails where you are promised

mountains of gold if you invest your money in someone bank in Nigeria. Also

assume there are indeed a few ham emails which talk about the nice people in

Nigeria. Then any future email that mentions Nigeria is classiﬁed as ham with

100% certainty. More importantly, one cannot recover from this decision even if

the email also mentions viagra, enlargement, mortgage and so on, all in a single

email! This can be seen by the fact that log P

spam

(X

“Nigeria”

> 0) = −∞while the

ﬁnal score is a sum of these individual word-scores.

To counteract this phenomenon, we give each word in the dictionary a small

probability of being present in any email (spam or ham), before seeing the data.

This process is called smoothing. The impact on the estimated probabilities are

given below,

P

spam

(X

i

= j) =

α +

¸

n

I[X

in

= j ∧ Y

n

= 1]

V

i

α +

¸

n

I[Y

n

= 1]

(6.12)

P

ham

(X

i

= j) =

α +

¸

n

I[X

in

= j ∧ Y

n

= 0]

V

i

α +

¸

n

I[Y

n

= 0]

(6.13)

where V

i

is the number of possible values of attribute i. Thus, α can be interpreted

as a small, possibly fractional number of “pseudo-observations” of the attribute in

question. It’s like adding these observations to the actual dataset.

What value for α do we use? Fitting its value on the dataset will not work,

because the reason we added it was exactly because we assumed there was too

little data in the ﬁrst place (we hadn’t received one of those annoying “Nigeria”

emails yet) and thus will relate to the phenomenon of overﬁtting. However, we

can use the trick described in section ?? where we split the data two pieces. We

learn a model on one chunk and adjust α such that performance of the other chunk

is optimal. We play this game this multiple times with different splits and average

the results.

6.5. REMARKS 31

6.5 Remarks

One of the main limitations of the NBclassiﬁer is that it assumes independence be-

tween attributes (This is presumably the reason why we call it the naive Bayesian

classiﬁer). This is reﬂected in the fact that each classiﬁer has an independent

vote in the ﬁnal score. However, imagine that I measure the words, “home” and

“mortgage”. Observing “mortgage” certainly raises the probability of observing

“home”. We say that they are positively correlated. It would therefore be more

fair if we attributed a smaller weight to “home” if we already observed mortgage

because they convey the same thing: this email is about mortgages for your home.

One way to obtain a more fair voting scheme is to model these dependencies ex-

plicitly. However, this comes at a computational cost (a longer time before you

receive your email in your inbox) which may not always be worth the additional

accuracy. One should also note that more parameters do not necessarily improve

accuracy because too many parameters may lead to overﬁtting.

6.6 The Idea In a Nutshell

Consider Figure ??. We can classify data by building a model of how the data was

generated. For NB we ﬁrst decide whether we will generate a data-item from class

Y = 0 or class Y = 1. Given that decision we generate the values for D attributes

independently. Each class has a different model for generating attributes. Clas-

siﬁcation is achieved by computing which model was more likely to generate the

new data-point, biasing the outcome towards the class that is expected to generate

more data.

32 CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER

Chapter 7

The Perceptron

We will now describe one the simplest parametric classiﬁers: the perceptron

and its cousin the logistic regression classiﬁer. However, despite its simplicity

it should not be under-estimated! It is the workhorse for most companies in-

volved with some form of machine learning (perhaps tying with the decision tree

classiﬁer). One could say that it represents the canonical parametric approach to

classiﬁcation where we believe that a straight line is sufﬁcient to separate the two

classes of interest. An example of this is given in Figure ?? where the assumption

that the two classes can be separated by a line is clearly valid.

However, this assumption need not always be true. Looking at Figure ?? we

clearly observe that there is no straight line that will do the job for us. What can

we do? Our ﬁrst inclination is probably to try and ﬁt a more complicated sepa-

ration boundary. However, there is another trick that we ill be using often in this

book. Instead we can increase the dimensionality of the space by “measuring”

more things of the data. Call φ

k

(X) feature k that was measured from the data.

The features can be highly nonlinear functions. The simplest choice may be to

also measure φ

i

(X) = X

2

i

, ∀k for each attribute X

k

. But we may also measure

cross-products such as φ

ij

(X) = X

i

X

j

, ∀i, j. The latter will allow you to ex-

plicitly model correlations between attributes. For example, if X

i

represents the

presence (1) or absence (0) of the word “viagra” and similarly for X

j

and the pres-

ence/absence of the word “dysfunction”, then the cross product feature X

i

X

j

let’s

you model the presence of both words simultaneously (which should be helpful in

trying to ﬁnd out what this document is about). We can add as many features as we

like, adding another dimension for every new feature. In this higher dimensional

space we can now be more conﬁdent in assuming that the data can be separated

by a line.

33

34 CHAPTER 7. THE PERCEPTRON

I like to warn the reader at this point that more features is not necessarily

a good thing if the new features are uninformative for the classiﬁcation task at

hand. The problem is that they introduce noise in the input that can mask the

actual signal (i.e. the good, discriminative features). In fact, there is a whole

subﬁeld of ML that deals with selecting relevant features from a set that is too

large. The problem of too many dimensions is sometimes called “the curse of

high dimensionality”. Another way of seeing this is that more dimensions often

lead to more parameters in the model (as in the case for the perceptron) and can

hence lead to overﬁtting. To combat that in turn we can add regularizers as we

will see in the following.

With the introduction of regularizers, we can sometimes play magic and use

an inﬁnite number of features. How we play this magic will be explained when

we will discuss kernel methods in the next sections. But let us ﬁrst start simple

with the perceptron.

7.1 The Perceptron Model

Our assumption is that a line can separate the two classes of interest. To make our

life a little easier we will switch to the Y = {+1, −1} representation. With this,

we can express the condition mathematically expressed as

1

,

Y

n

≈ sign(

¸

k

w

k

X

kn

−α) (7.1)

where “sign” is the sign-function (+1 for nonnegative reals and −1 for negative

reals). We have introduced K+1 parameters {w

1

, .., w

K

, α} which deﬁne the line

for us. The vector w represents the direction orthogonal to the decision boundary

depicted in Figure ??. For example, a line through the origin is represented by

w

T

x = 0, i.e. all vectors x with a vanishing inner product with w. The scalar

quantity α represents the offset of the line w

T

x = 0 from the origin, i.e. the

shortest distance from the origin to the line. This can be seen by writing the points

on the line as x = y + v where y is a ﬁxed vector pointing to an arbitrary point

on the line and v is the vector on the line starting at y (see Figure ??). Hence,

w

T

(y + v) − α = 0. Since by deﬁnition w

T

v = 0, we ﬁnd w

T

y = α which

means that α is the projection of y onto w which is the shortest distance from the

origin to the line.

1

Note that we can replace X

k

→ φ

k

(X) but that for the sake of simplicity we will refrain from

doing so at this point.

7.1. THE PERCEPTRON MODEL 35

We like to estimate these parameters from the data (which we will do in a

minute), but it is important to notice that the number of parameters is ﬁxed in

advance. In some sense, we believe so much in our assumption that the data is

linearly separable that we stick to it irrespective of how many data-cases we will

encounter. This ﬁxed capacity of the model is typical for parametric methods, but

perhaps a little unrealistic for real data. A more reasonable assumption is that

the decision boundary may become more complex as we see more data. Too few

data-cases simply do not provide the resolution (evidence) necessary to see more

complex structure in the decision boundary. Recall that non-parametric methods,

such as the “nearest-neighbors” classiﬁers actually do have this desirable feature.

Nevertheless, the linear separability assumption comes with some computation

advantages as well, such as very fast class prediction on new test data. I believe

that this computational convenience may be at the root for its popularity. By the

way, when we take the limit of an inﬁnite number of features, we will have happily

returned the land of “non-parametrics” but we have exercise a little patience before

we get there.

Now let’s write down a cost function that we wish to minimize in order for our

linear decision boundary to become a good classiﬁer. Clearly, we would like to

control performance on future, yet unseen test data. However, this is a little hard

(since we don’t have access to this data by deﬁnition). As a surrogate we will

simply ﬁt the line parameters on the training data. It can not be stressed enough

that this is dangerous in principle due to the phenomenon of overﬁtting (see sec-

tion ??). If we have introduced very many features and no form of regularization

then we have many parameters to ﬁt. When this capacity is too large relative to

the number of data cases at our disposal, we will be ﬁtting the idiosyncrasies of

this particular dataset and these will not carry over to the future test data. So,

one should split of a subset of the training data and reserve it for monitoring per-

formance (one should not use this set in the training procedure). Cycling though

multiple splits and averaging the result was the cross-validation procedure dis-

cussed in section ??. If we do not use too many features relative to the number of

data-cases, the model class is very limited and overﬁtting is not an issue. (In fact,

one may want to worry more about “underﬁtting” in this case.)

Ok, so now that we agree on writing down a cost on the training data, we need

to choose an explicit expression. Consider now the following choice:

C(w, α) =

1

2

1

N

N

¸

n=1

(Y

n

−w

T

X

n

+ α)

2

(7.2)

36 CHAPTER 7. THE PERCEPTRON

where we have rewritten w

T

X

n

=

¸

k

w

k

X

kn

. If we minimize this cost then

w

T

X

n

−α tends to be positive when Y

n

= +1 and negative when Y

n

= −1. This

is what we want! Once optimized we can then easily use our optimal parameters

to perform prediction on new test data X

test

as follows:

˜

Y

test

= sign(

¸

k

w

∗

k

X

test

−α

∗

) (7.3)

where

˜

Y is used to indicate the predicted value for Y .

So far so good, but how do we obtain our values for {w

∗

, α

∗

}? The simplest

approach is to compute the gradient and slowly descent on the cost function (see

appendix ?? for background). In this case, the gradients are simple:

∇

w

C(w, α) = −

1

N

N

¸

n=1

(Y

n

−w

T

X

n

+ α)X

n

= −X(Y −X

T

w+ α) (7.4)

∇

α

C(w, α) =

1

N

N

¸

n=1

(Y

n

−w

T

X

n

+ α) = (Y −X

T

w + α) (7.5)

where in the latter matrix expression we have used the convention that X is the

matrix with elements X

kn

. Our gradient descent is now simply given as,

w

t+1

= w

t

−η∇

w

C(w

t

, α

t

) (7.6)

α

t+1

= α

t

−η∇

α

C(w

t

, α

t

) (7.7)

Iterating these equations until convergence will minimize the cost function. One

may criticize plain vanilla gradient descent for many reasons. For example you

need to be carefully choose the stepsize η or risk either excruciatingly slowconver-

gence or exploding values of the iterates w

t

, α

t

. Even if convergence is achieved

asymptotically, it is typically slow. Using a Newton-Ralphson method will im-

prove convergence properties considerably but is also very expensive. Many meth-

ods have been developed to improve the optimization of the cost function, but that

is not the focus of this book.

However, I do want to mention a very popular approach to optimization on

very large datasets known as “stochastic gradient descent”. The idea is to select

a single data-item randomly and perform an update on the parameters based on

that:

w

t+1

= w

t

+ η(Y

n

−w

T

X

n

+ α)X

n

(7.8)

α

t+1

= α

t

= η(Y

n

−w

T

X

n

+ α) (7.9)

7.2. A DIFFERENT COST FUNCTION: LOGISTIC REGRESSION 37

The fact that we are picking data-cases randomly injects noise in the updates, so

even close to convergence we are “wiggling around” the solution. If we decrease

the stepsize however, the wiggles get smaller. So it seems a sensible strategy

would be to slowly decrease the stepsize and wiggle our way to the solution. This

stochastic gradient descent is actually very efﬁcient in practice if we can ﬁnd a

good annealing schedule for the stepsize. Why really? It seems that if we use

more data-cases in a mini-batch to perform a parameter update we should be able

to make larger steps in parameter space by using bigger stepsizes. While this

reasoning holds close to the solution it does not far away from the solution. The

intuitive reason is that far away from convergence every datapoint will tell you

the same story: move in direction X to improve your model. You simply do not

need to query datapoints in order to extract that information. So for a bad model

there is a lot of redundancy in the information that data-cases can convey about

improving the parameters and querying a few is sufﬁcient. Closer to convergence

you need to either use more data or decrease the stepsize to increase the resolution

of your gradients.

This type of reasoning clearly makes an effort to include the computational

budget part of the overall objective. This is what we have argued in chapter XX

is the distinguishing feature of machine learning. If you are not convinced about

how important this is in the face of modern day datasets imaginer the following.

Company C organizes a contest where they provide a virtually inﬁnite dataset

for some prediction task. You can earn 1 million dollars if you make accurate

predictions on some test set by Friday next week. You can choose between a

single parameter update based on all the data or many updates on small subsets of

the data, Who do you think will win the contest?

7.2 A Different Cost function: Logistic Regression

The cost function of Eq. 7.2 penalizes gross violations of ones predictions rather

severely (quadratically). This is sometimes counter-productive because the algo-

rithm might get obsessed with improving the performance of one single data-case

at the expense of all the others. The real cost simply counts the number of mis-

labelled instances, irrespective of howbadly off you prediction function w

T

X

n

+α

was. So, a different function is often used,

C(w, α) = −

1

N

N

¸

n=1

Y

n

tanh(w

T

X

n

+ α) (7.10)

38 CHAPTER 7. THE PERCEPTRON

The function tanh(·) is plotted in ﬁgure ??. It shows that the cost can never be

larger than 2 which ensures the robustness against outliers. We leave it to the

reader to derive the gradients and formulate the gradient descent algorithm.

7.3 The Idea In a Nutshell

Figure ?? tells the story. One assumes that your data can be separated by a line.

Any line can be represented by w

T

x = α. Data-cases from one class satisfy

w

T

X

n

≤ α while data-cases from the other class satisfy w

T

X

n

≥ α. To achieve

that, you write down a cost function that penalizes data-cases falling on the wrong

side of the line and minimize it over {w, α}. For a test case you simply compute

the sign of w

T

X

test

−α to make a prediction as to which class it belongs to.

Chapter 8

Support Vector Machines

Our task is to predict whether a test sample belongs to one of two classes. We

receive training examples of the form: {x

i

, y

i

}, i = 1, ..., n and x

i

∈ R

d

, y

i

∈

{−1, +1}. We call {x

i

} the co-variates or input vectors and {y

i

} the response

variables or labels.

We consider a very simple example where the data are in fact linearly sepa-

rable: i.e. I can draw a straight line f(x) = w

T

x − b such that all cases with

y

i

= −1 fall on one side and have f(x

i

) < 0 and cases with y

i

= +1 fall on the

other and have f(x

i

) > 0. Given that we have achieved that, we could classify

new test cases according to the rule y

test

= sign(x

test

).

However, typically there are inﬁnitely many such hyper-planes obtained by

small perturbations of a given solution. How do we choose between all these

hyper-planes which the solve the separation problem for our training data, but

may have different performance on the newly arriving test cases. For instance,

we could choose to put the line very close to members of one particular class,

say y = −1. Intuitively, when test cases arrive we will not make many mistakes

on cases that should be classiﬁed with y = +1, but we will make very easily

mistakes on the cases with y = −1 (for instance, imagine that a new batch of

test cases arrives which are small perturbations of the training data). A sensible

thing thus seems to choose the separation line as far away from both y = −1 and

y = +1 training cases as we can, i.e. right in the middle.

Geometrically, the vector wis directed orthogonal to the line deﬁned by w

T

x =

b. This can be understood as follows. First take b = 0. Now it is clear that all vec-

tors, x, with vanishing inner product with w satisfy this equation, i.e. all vectors

orthogonal to wsatisfy this equation. Now translate the hyperplane away from the

origin over a vector a. The equation for the plane now becomes: (x −a)

T

w = 0,

39

40 CHAPTER 8. SUPPORT VECTOR MACHINES

i.e. we ﬁnd that for the offset b = a

T

w, which is the projection of a onto to the

vector w. Without loss of generality we may thus choose a perpendicular to the

plane, in which case the length ||a|| = |b|/||w|| represents the shortest, orthogonal

distance between the origin and the hyperplane.

We now deﬁne 2 more hyperplanes parallel to the separating hyperplane. They

represent that planes that cut through the closest training examples on either side.

We will call them “support hyper-planes” in the following, because the data-

vectors they contain support the plane.

We deﬁne the distance between the these hyperplanes and the separating hy-

perplane to be d

+

and d

−

respectively. The margin, γ, is deﬁned to be d

+

+ d

−

.

Our goal is now to ﬁnd a the separating hyperplane so that the margin is largest,

while the separating hyperplane is equidistant from both.

We can write the following equations for the support hyperplanes:

w

T

x = b + δ (8.1)

w

T

x = b −δ (8.2)

We now note that we have over-parameterized the problem: if we scale w, b and

δ by a constant factor α, the equations for x are still satisﬁed. To remove this

ambiguity we will require that δ = 1, this sets the scale of the problem, i.e. if we

measure distance in millimeters or meters.

We can now also compute the values for d

+

= (||b+1| −|b||)/||w|| = 1/||w||

(this is only true if b / ∈ (−1, 0) since the origin doesn’t fall in between the hyper-

planes in that case. If b ∈ (−1, 0) you should use d

+

= (||b + 1| + |b||)/||w|| =

1/||w||). Hence the margin is equal to twice that value: γ = 2/||w||.

With the above deﬁnition of the support planes we can write down the follow-

ing constraint that any solution must satisfy,

w

T

x

i

−b ≤ −1 ∀ y

i

= −1 (8.3)

w

T

x

i

−b ≥ +1 ∀ y

i

= +1 (8.4)

or in one equation,

y

i

(w

T

x

i

−b) −1 ≥ 0 (8.5)

We now formulate the primal problem of the SVM:

minimize

w,b

1

2

||w||

2

subject to y

i

(w

T

x

i

−b) −1 ≥ 0 ∀i (8.6)

41

Thus, we maximize the margin, subject to the constraints that all training cases

fall on either side of the support hyper-planes. The data-cases that lie on the

hyperplane are called support vectors, since they support the hyper-planes and

hence determine the solution to the problem.

The primal problem can be solved by a quadratic program. However, it is

not ready to be kernelised, because its dependence is not only on inner products

between data-vectors. Hence, we transformto the dual formulation by ﬁrst writing

the problem using a Lagrangian,

L(w, b, α) =

1

2

||w||

2

−

N

¸

i=1

α

i

y

i

(w

T

x

i

−b) −1

(8.7)

The solution that minimizes the primal problem subject to the constraints is given

by min

w

max

α

L(w, α), i.e. a saddle point problem. When the original objective-

function is convex, (and only then), we can interchange the minimization and

maximization. Doing that, we ﬁnd that we can ﬁnd the condition on w that must

hold at the saddle point we are solving for. This is done by taking derivatives wrt

w and b and solving,

w−

¸

i

α

i

y

i

x

i

= 0 ⇒ w

∗

=

¸

i

α

i

y

i

x

i

(8.8)

¸

i

α

i

y

i

= 0 (8.9)

Inserting this back into the Lagrangian we obtain what is known as the dual prob-

lem,

maximize L

D

=

N

¸

i=1

α

i

−

1

2

¸

ij

α

i

α

j

y

i

y

j

x

T

i

x

j

subject to

¸

i

α

i

y

i

= 0 (8.10)

α

i

≥ 0 ∀i (8.11)

The dual formulation of the problem is also a quadratic program, but note that the

number of variables, α

i

in this problem is equal to the number of data-cases, N.

The crucial point is however, that this problem only depends on x

i

through the

inner product x

T

i

x

j

. This is readily kernelised through the substitution x

T

i

x

j

→

k(x

i

, x

j

). This is a recurrent theme: the dual problem lends itself to kernelisation,

while the primal problem did not.

42 CHAPTER 8. SUPPORT VECTOR MACHINES

The theory of duality guarantees that for convex problems, the dual prob-

lem will be concave, and moreover, that the unique solution of the primal prob-

lem corresponds tot the unique solution of the dual problem. In fact, we have:

L

P

(w

∗

) = L

D

(α

∗

), i.e. the “duality-gap” is zero.

Next we turn to the conditions that must necessarily hold at the saddle point

and thus the solution of the problem. These are called the KKT conditions (which

stands for Karush-Kuhn-Tucker). These conditions are necessary in general, and

sufﬁcient for convex optimization problems. They can be derived from the pri-

mal problem by setting the derivatives wrt to w to zero. Also, the constraints

themselves are part of these conditions and we need that for inequality constraints

the Lagrange multipliers are non-negative. Finally, an important constraint called

“complementary slackness” needs to be satisﬁed,

∂

w

L

P

= 0 → w−

¸

i

α

i

y

i

x

i

= 0 (8.12)

∂

b

L

P

= 0 →

¸

i

α

i

y

i

= 0 (8.13)

constraint - 1 y

i

(w

T

x

i

−b) −1 ≥ 0 (8.14)

multiplier condition α

i

≥ 0 (8.15)

complementary slackness α

i

y

i

(w

T

x

i

−b) −1

= 0 (8.16)

It is the last equation which may be somewhat surprising. It states that either

the inequality constraint is satisﬁed, but not saturated: y

i

(w

T

x

i

− b) − 1 > 0

in which case α

i

for that data-case must be zero, or the inequality constraint is

saturated y

i

(w

T

x

i

− b) − 1 = 0, in which case α

i

can be any value α

i

≥ 0. In-

equality constraints which are saturated are said to be “active”, while unsaturated

constraints are inactive. One could imagine the process of searching for a solution

as a ball which runs down the primary objective function using gradient descent.

At some point, it will hit a wall which is the constraint and although the derivative

is still pointing partially towards the wall, the constraints prohibits the ball to go

on. This is an active constraint because the ball is glued to that wall. When a

ﬁnal solution is reached, we could remove some constraints, without changing the

solution, these are inactive constraints. One could think of the term ∂

w

L

P

as the

force acting on the ball. We see from the ﬁrst equation above that only the forces

with α

i

= 0 exsert a force on the ball that balances with the force from the curved

quadratic surface w.

The training cases with α

i

> 0, representing active constraints on the posi-

tion of the support hyperplane are called support vectors. These are the vectors

8.1. THE NON-SEPARABLE CASE 43

that are situated in the support hyperplane and they determine the solution. Typi-

cally, there are only few of them, which people call a “sparse” solution (most α’s

vanish).

What we are really interested in is the function f(·) which can be used to

classify future test cases,

f(x) = w

∗T

x −b

∗

=

¸

i

α

i

y

i

x

T

i

x −b

∗

(8.17)

As an application of the KKT conditions we derive a solution for b

∗

by using the

complementary slackness condition,

b

∗

=

¸

j

α

j

y

j

x

T

j

x

i

−y

i

i a support vector (8.18)

where we used y

2

i

= 1. So, using any support vector one can determine b, but for

numerical stability it is better to average over all of them (although they should

obviously be consistent).

The most important conclusion is again that this function f(·) can thus be

expressed solely in terms of inner products x

T

i

x

i

which we can replace with ker-

nel matrices k(x

i

, x

j

) to move to high dimensional non-linear spaces. Moreover,

since α is typically very sparse, we don’t need to evaluate many kernel entries in

order to predict the class of the new input x.

8.1 The Non-Separable case

Obviously, not all datasets are linearly separable, and so we need to change the

formalism to account for that. Clearly, the problem lies in the constraints, which

cannot always be satisﬁed. So, let’s relax those constraints by introducing “slack

variables”, ξ

i

,

w

T

x

i

−b ≤ −1 + ξ

i

∀ y

i

= −1 (8.19)

w

T

x

i

−b ≥ +1 −ξ

i

∀ y

i

= +1 (8.20)

ξ

i

≥ 0 ∀i (8.21)

The variables, ξ

i

allow for violations of the constraint. We should penalize the

objective function for these violations, otherwise the above constraints become

void (simply always pick ξ

i

very large). Penalty functions of the form C(

¸

i

ξ

i

)

k

44 CHAPTER 8. SUPPORT VECTOR MACHINES

will lead to convex optimization problems for positive integers k. For k = 1, 2

it is still a quadratic program (QP). In the following we will choose k = 1. C

controls the tradeoff between the penalty and margin.

To be on the wrong side of the separating hyperplane, a data-case would need

ξ

i

> 1. Hence, the sum

¸

i

ξ

i

could be interpreted as measure of how “bad” the

violations are and is an upper bound on the number of violations.

The new primal problem thus becomes,

minimize

w,b,ξ

L

P

=

1

2

||w||

2

+ C

¸

i

ξ

i

subject to y

i

(w

T

x

i

−b) −1 + ξ

i

≥ 0 ∀i (8.22)

ξ

i

≥ 0 ∀i (8.23)

leading to the Lagrangian,

L(w, b, ξ, α, µ) =

1

2

||w||

2

+C

¸

i

ξ

i

−

N

¸

i=1

α

i

y

i

(w

T

x

i

−b) −1 + ξ

i

−

N

¸

i=1

µ

i

ξ

i

(8.24)

from which we derive the KKT conditions,

1.∂

w

L

P

= 0 → w−

¸

i

α

i

y

i

x

i

= 0 (8.25)

2.∂

b

L

P

= 0 →

¸

i

α

i

y

i

= 0 (8.26)

3.∂

ξ

L

P

= 0 → C −α

i

−µ

i

= 0 (8.27)

4.constraint-1 y

i

(w

T

x

i

−b) −1 + ξ

i

≥ 0 (8.28)

5.constraint-2 ξ

i

≥ 0 (8.29)

6.multiplier condition-1 α

i

≥ 0 (8.30)

7.multiplier condition-2 µ

i

≥ 0 (8.31)

8.complementary slackness-1 α

i

y

i

(w

T

x

i

−b) −1 + ξ

i

= 0 (8.32)

9.complementary slackness-1 µ

i

ξ

i

= 0 (8.33)

(8.34)

From here we can deduce the following facts. If we assume that ξ

i

> 0, then

µ

i

= 0 (9), hence α

i

= C (1) and thus ξ

i

= 1 − y

i

(x

T

i

w − b) (8). Also, when

ξ

i

= 0 we have µ

i

> 0 (9) and hence α

i

< C. If in addition to ξ

i

= 0 we also have

that y

i

(w

T

x

i

− b) −1 = 0, then α

i

> 0 (8). Otherwise, if y

i

(w

T

x

i

−b) −1 > 0

8.1. THE NON-SEPARABLE CASE 45

then α

i

= 0. In summary, as before for points not on the support plane and on the

correct side we have ξ

i

= α

i

= 0 (all constraints inactive). On the support plane,

we still have ξ

i

= 0, but now α

i

> 0. Finally, for data-cases on the wrong side of

the support hyperplane the α

i

max-out to α

i

= C and the ξ

i

balance the violation

of the constraint such that y

i

(w

T

x

i

−b) −1 + ξ

i

= 0.

Geometrically, we can calculate the gap between support hyperplane and the

violating data-case to be ξ

i

/||w||. This can be seen because the plane deﬁned by

y

i

(w

T

x −b) −1 + ξ

i

= 0 is parallel to the support plane at a distance |1 + y

i

b −

ξ

i

|/||w|| from the origin. Since the support plane is at a distance |1 + y

i

b|/||w||

the result follows.

Finally, we need to convert to the dual problem to solve it efﬁciently and to

kernelise it. Again, we use the KKT equations to get rid of w, b and ξ,

maximize L

D

=

N

¸

i=1

α

i

−

1

2

¸

ij

α

i

α

j

y

i

y

j

x

T

i

x

j

subject to

¸

i

α

i

y

i

= 0 (8.35)

0 ≤ α

i

≤ C ∀i (8.36)

Surprisingly, this is almost the same QP is before, but with an extra constraint on

the multipliers α

i

which now live in a box. This constraint is derived from the fact

that α

i

= C −µ

i

and µ

i

≥ 0. We also note that it only depends on inner products

x

T

i

x

j

which are ready to be kernelised.

46 CHAPTER 8. SUPPORT VECTOR MACHINES

Chapter 9

Support Vector Regression

In kernel ridge regression we have seen the ﬁnal solution was not sparse in the

variables α. We will now formulate a regression method that is sparse, i.e. it has

the concept of support vectors that determine the solution.

The thing to notice is that the sparseness arose from complementary slackness

conditions which in turn came from the fact that we had inequality constraints.

In the SVM the penalty that was paid for being on the wrong side of the support

plane was given by C

¸

i

ξ

k

i

for positive integers k, where ξ

i

is the orthogonal

distance away from the support plane. Note that the term ||w||

2

was there to

penalize large w and hence to regularize the solution. Importantly, there was no

penalty if a data-case was on the right side of the plane. Because all these data-

points do not have any effect on the ﬁnal solution the α was sparse. Here we do

the same thing: we introduce a penalty for being to far away from predicted line

wΦ

i

+ b, but once you are close enough, i.e. in some “epsilon-tube” around this

line, there is no penalty. We thus expect that all the data-cases which lie inside the

data-tube will have no impact on the ﬁnal solution and hence have corresponding

α

i

= 0. Using the analogy of springs: in the case of ridge-regression the springs

were attached between the data-cases and the decision surface, hence every item

had an impact on the position of this boundary through the force it exerted (recall

that the surface was from “rubber” and pulled back because it was parameterized

using a ﬁnite number of degrees of freedom or because it was regularized). For

SVR there are only springs attached between data-cases outside the tube and these

attach to the tube, not the decision boundary. Hence, data-items inside the tube

have no impact on the ﬁnal solution (or rather, changing their position slightly

doesn’t perturb the solution).

We introduce different constraints for violating the tube constraint from above

47

48 CHAPTER 9. SUPPORT VECTOR REGRESSION

and from below,

minimize −w, ξ,

ˆ

ξ

1

2

||w||

2

+

C

2

¸

i

(ξ

2

i

+

ˆ

ξ

2

i

)

subject to w

T

Φ

i

+ b −y

i

≤ ε + ξ

i

∀i

y

i

−w

T

Φ

i

−b ≤ ε +

ˆ

ξ

i

∀i (9.1)

The primal Lagrangian becomes,

L

P

=

1

2

||w||

2

+

C

2

¸

i

(ξ

2

i

+

ˆ

ξ

2

i

)+

¸

i

α

i

(w

T

Φ

i

+b−y

i

−ε−ξ

i

)+

¸

i

ˆ α

i

(y

i

−w

T

Φ

i

−b−ε−

ˆ

ξ

i

)

(9.2)

Remark I: We could have added the constraints that ξ

i

≥ 0 and

ˆ

ξ

i

≥ 0. However,

it is not hard to see that the ﬁnal solution will have that requirement automatically

and there is no sense in constraining the optimization to the optimal solution as

well. To see this, imagine some ξ

i

is negative, then, by setting ξ

i

= 0 the cost is

lower and non of the constraints is violated, so it is preferred. We also note due to

the above reasoning we will always have at least one of the ξ,

ˆ

ξ zero, i.e. inside

the tube both are zero, outside the tube one of them is zero. This means that at the

solution we have ξ

ˆ

ξ = 0.

Remark II: Note that we don’t scale ε = 1 like in the SVM case. The reason is that

{y

i

} now determines the scale of the problem, i.e. we have not over-parameterized

the problem.

We now take the derivatives w.r.t. w, b, ξ and

ˆ

ξ to ﬁnd the following KKT

conditions (there are more of course),

w =

¸

i

(ˆ α

i

−α

i

)Φ

i

(9.3)

ξ

i

= α

i

/C

ˆ

ξ

i

= ˆ α

i

/C (9.4)

Plugging this back in and using that now we also have α

i

ˆ α

i

= 0 we ﬁnd the dual

problem,

maximize

α, ˆ α

−

1

2

¸

ij

(ˆ α

i

−α

i

)(ˆ α

j

−α

j

)(K

ij

+

1

C

δ

ij

) +

¸

i

(ˆ α

i

−α

i

)y

i

−

¸

i

(ˆ α

i

+ α

i

)ε

subject to

¸

i

(ˆ α

i

−α

i

) = 0

α

i

≥ 0, ˆ α

i

≥ 0 ∀i (9.5)

49

From the complementary slackness conditions we can read the sparseness of the

solution out:

α

i

(w

T

Φ

i

+ b −y

i

−ε −ξ

i

) = 0 (9.6)

ˆ α

i

(y

i

−w

T

Φ

i

−b −ε −

ˆ

ξ

i

) = 0 (9.7)

ξ

i

ˆ

ξ

i

= 0, α

i

ˆ α

i

= 0 (9.8)

where we added the last conditions by hand (they don’t seem to directly follow

from the formulation). Now we clearly see that if a case is above the tube

ˆ

ξ

i

will take on its smallest possible value in order to make the constraints satisﬁed

ˆ

ξ

i

= y

i

− w

T

Φ

i

− b − ε. This implies that ˆ α

i

will take on a positive value and

the farther outside the tube the larger the ˆ α

i

(you can think of it as a compensating

force). Note that in this case α

i

= 0. A similar story goes if ξ

i

> 0 and α

i

> 0. If

a data-case is inside the tube the α

i

, ˆ α

i

are necessarily zero, and hence we obtain

sparseness.

We nowchange variables to make this optimization problemlook more similar

to the SVM and ridge-regression case. Introduce β

i

= ˆ α

i

− α

i

and use ˆ α

i

α

i

= 0

to write ˆ α

i

+ α

i

= |β

i

|,

maximize

β

−

1

2

¸

ij

β

i

β

j

(K

ij

+

1

C

δ

ij

) +

¸

i

β

i

y

i

−

¸

i

|β

i

|ε

subject to

¸

i

β

i

= 0 (9.9)

where the constraint comes from the fact that we included a bias term

1

b.

From the slackness conditions we can also ﬁnd a value for b (similar to the

SVM case). Also, as usual, the prediction of new data-case is given by,

y = w

T

Φ(x) + b =

¸

i

β

i

K(x

i

, x) + b (9.10)

It is an interesting exercise for the reader to work her way through the case

1

Note by the way that we could not use the trick we used in ridge-regression by deﬁning a

constant feature φ

0

= 1 and b = w

0

. The reason is that the objective does not depend on b.

50 CHAPTER 9. SUPPORT VECTOR REGRESSION

where the penalty is linear instead of quadratic, i.e.

minimize

w,ξ,

ˆ

ξ

1

2

||w||

2

+ C

¸

i

(ξ

i

+

ˆ

ξ

i

)

subject to w

T

Φ

i

+ b −y

i

≤ ε + ξ

i

∀i

y

i

−w

T

Φ

i

−b ≤ ε +

ˆ

ξ

i

∀i (9.11)

ξ

i

≥ 0,

ˆ

ξ

i

≥ 0 ∀i (9.12)

leading to the dual problem,

maximize

β

−

1

2

¸

ij

β

i

β

j

K

ij

+

¸

i

β

i

y

i

−

¸

i

|β

i

|ε

subject to

¸

i

β

i

= 0 (9.13)

−C ≤ β

i

≤ +C ∀i (9.14)

where we note that the quadratic penalty on the size of β is replaced by a box

constraint, as is to be expected in switching from L

2

norm to L

1

norm.

Final remark: Let’s remind ourselves that the quadratic programs that we have

derived are convex optimization problems which have a unique optimal solution

which can be found efﬁciently using numerical methods. This is often claimed as

great progress w.r.t. the old neural networks days which were plagued by many

local optima.

Chapter 10

Kernel ridge Regression

Possibly the most elementary algorithm that can be kernelized is ridge regression.

Here our task is to ﬁnd a linear function that models the dependencies between

covariates {x

i

} and response variables {y

i

}, both continuous. The classical way

to do that is to minimize the quadratic cost,

C(w) =

1

2

¸

i

(y

i

−w

T

x

i

)

2

(10.1)

However, if we are going to work in feature space, where we replace x

i

→Φ(x

i

),

there is an clear danger that we overﬁt. Hence we need to regularize. This is an

important topic that will return in future classes.

A simple yet effective way to regularize is to penalize the norm of w. This

is sometimes called “weight-decay”. It remains to be determined how to choose

λ. The most used algorithm is to use cross validation or leave-one-out estimates.

The total cost function hence becomes,

C =

1

2

¸

i

(y

i

−w

T

x

i

)

2

+

1

2

λ||w||

2

(10.2)

which needs to be minimized. Taking derivatives and equating them to zero gives,

¸

i

(y

i

−w

T

x

i

)x

i

= λw ⇒ w =

λI +

¸

i

x

i

x

T

i

−1

¸

j

y

j

x

j

(10.3)

We see that the regularization term helps to stabilize the inverse numerically by

bounding the smallest eigenvalues away from zero.

51

52 CHAPTER 10. KERNEL RIDGE REGRESSION

10.1 Kernel Ridge Regression

We now replace all data-cases with their feature vector: x

i

→ Φ

i

= Φ(x

i

). In

this case the number of dimensions can be much higher, or even inﬁnitely higher,

than the number of data-cases. There is a neat trick that allows us to perform the

inverse above in smallest space of the two possibilities, either the dimension of

the feature space or the number of data-cases. The trick is given by the following

identity,

(P

−1

+ B

T

R

−1

B)

−1

B

T

R

−1

= PB

T

(BPB

T

+ R)

−1

(10.4)

Now note that if B is not square, the inverse is performed in spaces of different

dimensionality. To apply this to our case we deﬁne Φ = Φ

ai

and y = y

i

. The

solution is then given by,

w = (λI

d

+ ΦΦ

T

)

−1

Φy = Φ(Φ

T

Φ + λI

n

)

−1

y (10.5)

This equation can be rewritten as: w =

¸

i

α

i

Φ(x

i

) with α = (Φ

T

Φ + λI

n

)

−1

y.

This is an equation that will be a recurrent theme and it can be interpreted as: The

solution w must lie in the span of the data-cases, even if the dimensionality of the

feature space is much larger than the number of data-cases. This seems intuitively

clear, since the algorithm is linear in feature space.

We ﬁnally need to show that we never actually need access to the feature vec-

tors, which could be inﬁnitely long (which would be rather impractical). What we

need in practice is is the predicted value for anew test point, x. This is computed

by projecting it onto the solution w,

y = w

T

Φ(x) = y(Φ

T

Φ + λI

n

)

−1

Φ

T

Φ(x) = y(K + λI

n

)

−1

κ(x) (10.6)

where K(bx

i

, bx

j

) = Φ(x

i

)

T

Φ(x

j

) and κ(x) = K(x

i

, x). The important message

here is of course that we only need access to the kernel K.

We can now add bias to the whole story by adding one more, constant feature

to Φ: Φ

0

= 1. The value of w

0

then represents the bias since,

w

T

Φ =

¸

a

w

a

Φ

ai

+w

0

(10.7)

Hence, the story goes through unchanged.

10.2. AN ALTERNATIVE DERIVATION 53

10.2 An alternative derivation

Instead of optimizing the cost function above we can introduce Lagrange multi-

pliers into the problem. This will have the effect that the derivation goes along

similar lines as the SVM case. We introduce new variables, ξ

i

= y

i

− w

T

Φ

i

and

rewrite the objective as the following constrained QP,

minimize −w, ξ L

P

=

¸

i

ξ

2

i

subject to y

i

−w

T

Φ

i

= ξ

i

∀i (10.8)

||w|| ≤ B (10.9)

This leads to the Lagrangian,

L

P

=

¸

i

ξ

2

i

+

¸

i

β

i

[y

i

−w

T

Φ

i

−ξ

i

] + λ(||w||

2

−B

2

) (10.10)

Two of the KKT conditions tell us that at the solution we have:

2ξ

i

= β

i

∀i, 2λw =

¸

i

β

i

Φ

i

(10.11)

Plugging it back into the Lagrangian, we obtain the dual Lagrangian,

L

D

=

¸

i

(−

1

4

β

2

i

+ β

i

y

i

) −

1

4λ

¸

ij

(β

i

β

j

K

ij

) −λB

2

(10.12)

We now redeﬁne α

i

= β

i

/(2λ) to arrive at the following dual optimization prob-

lem,

maximize−α, λ −λ

2

¸

i

α

2

i

+2λ

¸

i

α

i

y

i

−λ

¸

ij

α

i

α

j

K

ij

−λB

2

s.t.λ ≥ 0

(10.13)

Taking derivatives w.r.t. α gives precisely the solution we had already found,

α

∗

i

= (K + λI)

−1

y (10.14)

Formally we also need to maximize over λ. However, different choices of λ cor-

respond to different choices for B. Either λ or B should be chosen using cross-

validation or some other measure, so we could as well vary λ in this process.

54 CHAPTER 10. KERNEL RIDGE REGRESSION

One big disadvantage of the ridge-regression is that we don’t have sparseness

in the α vector, i.e. there is no concept of support vectors. This is useful be-

cause when we test a new example, we only have to sum over the support vectors

which is much faster than summing over the entire training-set. In the SVM the

sparseness was born out of the inequality constraints because the complementary

slackness conditions told us that either if the constraint was inactive, then the

multiplier α

i

was zero. There is no such effect here.

Chapter 11

Kernel K-means and Spectral

Clustering

The objective in K-means can be written as follows:

C(z, µ) =

¸

i

||x

i

−µ

z

i

||

2

(11.1)

where we wish to minimize over the assignment variables z

i

(which can take val-

ues z

i

= 1, .., K, for all data-cases i, and over the cluster means µ

k

, k = 1..K. It

is not hard to show that the following iterations achieve that,

z

i

= arg max

k

||x

i

−µ

k

||

2

(11.2)

µ

k

=

1

N

k

¸

i∈C

k

x

i

(11.3)

where C

k

is the set of data-cases assigned to cluster k.

Now, let’s assume we have deﬁned many features, φ(x

i

) and wish to do clus-

tering in feature space. The objective is similar to before,

C(z, µ) =

¸

i

||φ(x

i

) −µ

z

i

||

2

(11.4)

We will now introduce a N × K assignment matrix, Z

nk

, each column of which

represents a data-case and contains exactly one 1 at row k if it is assigned to

cluster k. As a result we have

¸

k

Z

nk

= 1 and N

k

=

¸

n

Z

nk

. Also deﬁne

55

56 CHAPTER 11. KERNEL K-MEANS AND SPECTRAL CLUSTERING

L = diag[1/

¸

n

Z

nk

] = diag[1/N

k

]. Finally deﬁne Φ

in

= φ

i

(x

n

). With these

deﬁnitions you can now check that the matrix M deﬁned as,

M = ΦZLZ

T

(11.5)

consists of N columns, one for each data-case, where each column contains a

copy of the cluster mean µ

k

to which that data-case is assigned.

Using this we can write out the K-means cost as,

C = tr[(Φ −M)(Φ −M)

T

] (11.6)

Next we can show that Z

T

Z = L

−1

(check this), and thus that (ZLZ

T

)

2

=

ZLZ

T

. In other words, it is a projection. Similarly, I −ZLZ

T

is a projection on

the complement space. Using this we simplify eqn.11.6 as,

C = tr[Φ(I −ZLZ

T

)

2

Φ

T

] (11.7)

= tr[Φ(I −ZLZ

T

)Φ

T

] (11.8)

= tr[ΦΦ

T

] −tr[ΦZLZ

T

Φ

T

] (11.9)

= tr[K] −tr[L

1

2

Z

T

KZL

1

2

] (11.10)

where we used that tr[AB] = tr[BA] and L

1

2

is deﬁned as taking the square root

of the diagonal elements.

Note that only the second term depends on the clustering matrix Z, so we can

we can now formulate the following equivalent kernel clustering problem,

max

Z

tr[L

1

2

Z

T

KZL

1

2

] (11.11)

such that: Z is a binary clustering matrix. (11.12)

This objective is entirely speciﬁed in terms of kernels and so we have once again

managed to move to the ”dual” representation. Note also that this problem is

very difﬁcult to solve due to the constraints which forces us to search of binary

matrices.

Our next step will be to approximate this problem through a relaxation on

this constraint. First we recall that Z

T

Z = L

−1

⇒ L

1

2

Z

T

ZL

1

2

= I. Renaming

H = ZL

1

2

, with H an N×K dimensional matrix, we can formulate the following

relaxation of the problem,

max

H

tr[H

T

KH] (11.13)

subject to H

T

H = I (11.14)

57

Note that we did not require H to be binary any longer. The hope is that the

solution is close to some clustering solution that we can then extract a posteriori.

The above problem should look familiar. Interpret the columns of H as a

collection of K mutually orthonormal basis vectors. The objective can then be

written as,

K

¸

k=1

h

T

k

Kh

k

(11.15)

By choosing h

k

proportional to the K largest eigenvectors of K we will maximize

the objective, i.e. we have

K = UΛU

T

, ⇒ H = U

[1:K]

R (11.16)

where R is a rotation inside the eigenvalue space, RR

T

= R

T

R = I. Using this

you can now easily verify that tr[H

T

KH] =

¸

K

k=1

λ

k

where {λ

k

}, k = 1..K are

the largest K eigenvalues.

What is perhaps surprising is that the solution to this relaxed kernel-clustering

problemis given by kernel-PCA! Recall that for kernel PCA we also solved for the

eigenvalues of K. How then do we extract a clustering solution from kernel-PCA?

Recall that the columns of H (the eigenvectors of K) should approximate the

binary matrix Z which had a single 1 per row indicating to which cluster data-case

n is assigned. We could try to simply threshold the entries of H so that the largest

value is set to 1 and the remaining ones to 0. However, it often works better to

ﬁrst normalize H,

ˆ

H

nk

=

H

nk

¸

k

H

2

nk

(11.17)

All rows of

ˆ

H are located on the unit sphere. We can now run a simple clustering

algorithm such as K-means on the data matrix

ˆ

H to extract K clusters. The above

procedure is sometimes referred to as “spectral clustering”.

Conclusion: Kernel-PCA can be viewed as a nonlinear feature extraction tech-

nique. Input is a matrix of similarities (the kernel matrix or Gram matrix) which

should be positive semi-deﬁnite and symmetric. If you extract two or three fea-

tures (dimensions) you can use it as a non-linear dimensionality reduction method

(for purposes of visualization). If you use the result as input to a simple clustering

method (such as K-means) it becomes a nonlinear clustering method.

58 CHAPTER 11. KERNEL K-MEANS AND SPECTRAL CLUSTERING

Chapter 12

Kernel Principal Components

Analysis

Let’s ﬁst see what PCA is when we do not worry about kernels and feature spaces.

We will always assume that we have centered data, i.e.

¸

i

x

i

= 0. This can

always be achieved by a simple translation of the axis.

Our aim is to ﬁnd meaningful projections of the data. However, we are facing

an unsupervised problem where we don’t have access to any labels. If we had,

we should be doing Linear Discriminant Analysis. Due to this lack of labels,

our aim will be to ﬁnd the subspace of largest variance, where we choose the

number of retained dimensions beforehand. This is clearly a strong assumption,

because it may happen that there is interesting signal in the directions of small

variance, in which case PCA in not a suitable technique (and we should perhaps

use a technique called independent component analysis). However, usually it is

true that the directions of smallest variance represent uninteresting noise.

To make progress, we start by writing down the sample-covariance matrix C,

C =

1

N

¸

i

x

i

x

T

i

(12.1)

The eigenvalues of this matrix represent the variance in the eigen-directions of

data-space. The eigen-vector corresponding to the largest eigenvalue is the direc-

tion in which the data is most stretched out. The second direction is orthogonal

to it and picks the direction of largest variance in that orthogonal subspace etc.

Thus, to reduce the dimensionality of the data, we project the data onto the re-

59

60 CHAPTER 12. KERNEL PRINCIPAL COMPONENTS ANALYSIS

tained eigen-directions of largest variance:

UΛU

T

= C ⇒ C =

¸

a

λ

a

u

a

u

T

a

(12.2)

and the projection is given by,

y

i

= U

T

k

x

i

∀i (12.3)

where U

k

means the d×k sub-matrix containing the ﬁrst k eigenvectors as columns.

As a side effect, we can now show that the projected data are de-correlated in this

new basis:

1

N

¸

i

y

i

y

T

i

=

1

N

¸

i

U

T

k

x

i

x

T

i

U

k

= U

T

k

CU

k

= U

T

k

UΛU

T

U

T

k

= Λ

k

(12.4)

where Λ

k

is the (diagonal) k × k sub-matrix corresponding to the largest eigen-

values.

Another convenient property of this procedure is that the reconstruction error

in L

2

-norm between from y to x is minimal, i.e.

¸

i

||x

i

−P

k

x

i

||

2

(12.5)

where P

k

= U

k

U

T

k

is the projection onto the subspace spanned by the columns of

U

k

, is minimal.

Now imagine that there are more dimensions than data-cases, i.e. some di-

mensions remain unoccupied by the data. In this case it is not hard to show that

the eigen-vectors that span the projection space must lie in the subspace spanned

by the data-cases. This can be seen as follows,

λ

a

u

a

= Cu

a

=

1

N

¸

i

x

i

x

T

i

u

a

=

1

N

¸

i

(x

T

i

u

a

)x

i

⇒ u

a

=

¸

i

(x

T

i

u

a

)

Nλ

a

x

i

=

¸

i

α

a

i

x

i

(12.6)

where u

a

is some arbitrary eigen-vector of C. The last expression can be inter-

preted as: “every eigen-vector can be exactly written (i.e. losslessly) as some

linear combination of the data-vectors, and hence it must lie in its span”. This

also implies that instead of the eigenvalue equations Cu = λu we may consider

the N projected equations x

T

i

Cu = λx

T

i

u ∀i. From this equation the coefﬁcients

12.1. CENTERING DATA IN FEATURE SPACE 61

α

a

i

can be computed efﬁciently a space of dimension N (and not d) as follows,

x

T

i

Cu

a

= λ

a

x

T

i

u

a

⇒

x

T

i

1

N

¸

k

x

k

x

T

k

¸

j

α

a

j

x

j

= λ

a

x

T

i

¸

j

α

a

j

x

j

⇒

1

N

¸

j,k

α

a

j

[x

T

i

x

k

][x

T

k

x

j

] = λ

a

¸

j

α

a

j

[x

T

i

x

j

] (12.7)

We now rename the matrix [x

T

i

x

j

] = K

ij

to arrive at,

K

2

α

a

= Nλ

a

Kα

a

⇒ Kα

a

= (

˜

λ

a

)α

a

with

˜

λ

a

= Nλ

a

(12.8)

So, we have derived an eigenvalue equation for α which in turn completely deter-

mines the eigenvectors u. By requiring that u is normalized we ﬁnd,

u

T

a

u

a

= 1 ⇒

¸

i,j

α

a

i

α

a

j

[x

T

i

x

j

] = α

T

a

Kα

a

= Nλ

a

α

T

a

α

a

= 1 ⇒ ||α

a

|| = 1/

Nλ

a

(12.9)

Finally, when we receive a new data-case t and we like to compute its projections

onto the new reduced space, we compute,

u

T

a

t =

¸

i

α

a

i

x

T

i

t =

¸

i

α

a

i

K(x

i

, t) (12.10)

This equation should look familiar, it is central to most kernel methods.

Obviously, the whole exposition was setup so that in the end we only needed

the matrix K to do our calculations. This implies that we are now ready to ker-

nelize the procedure by replacing x

i

→ Φ(x

i

) and deﬁning K

ij

= Φ(x

i

)Φ(x

j

)

T

,

where Φ(x

i

) = Φ

ia

.

12.1 Centering Data in Feature Space

It is in fact very difﬁcult to explicitly center the data in feature space. But, we

know that the ﬁnal algorithm only depends on the kernel matrix, so if we can

center the kernel matrix we are done as well. A kernel matrix is given by K

ij

=

Φ

i

Φ

T

j

. We now center the features using,

Φ

i

= Φ

i

−

1

N

¸

k

Φ

k

(12.11)

62 CHAPTER 12. KERNEL PRINCIPAL COMPONENTS ANALYSIS

Hence the kernel in terms of the new features is given by,

K

c

ij

= (Φ

i

−

1

N

¸

k

Φ

k

)(Φ

j

−

1

N

¸

l

Φ

l

)

T

(12.12)

= Φ

i

Φ

T

j

−[

1

N

¸

k

Φ

k

]Φ

T

j

−Φ

i

[

1

N

¸

l

Φ

T

l

] + [

1

N

¸

k

Φ

k

][

1

N

¸

l

Φ

T

l

] (12.13)

= K

ij

−κ

i

1

T

j

−1

i

κ

T

j

+ k1

i

1

T

j

(12.14)

with κ

i

=

1

N

¸

k

K

ik

(12.15)

and k =

1

N

2

¸

ij

K

ij

(12.16)

Hence, we can compute the centered kernel in terms of the non-centered kernel

alone and no features need to be accessed.

At test-time we need to compute,

K

c

(t

i

, x

j

) = [Φ(t

i

) −

1

N

¸

k

Φ(x

k

)][Φ(x

j

) −

1

N

¸

l

Φ(x

l

)]

T

(12.17)

Using a similar calculation (left for the reader) you can ﬁnd that this can be ex-

pressed easily in terms of K(t

i

, x

j

) and K(x

i

, x

j

) as follows,

K

c

(t

i

, x

j

) = K(t

i

, x

j

) −κ(t

i

)1

T

j

−1

i

κ(x

j

)

T

+ k1

i

1

T

j

(12.18)

Chapter 13

Fisher Linear Discriminant Analysis

The most famous example of dimensionality reduction is ”principal components

analysis”. This technique searches for directions in the data that have largest vari-

ance and subsequently project the data onto it. In this way, we obtain a lower

dimensional representation of the data, that removes some of the ”noisy” direc-

tions. There are many difﬁcult issues with how many directions one needs to

choose, but that is beyond the scope of this note.

PCA is an unsupervised technique and as such does not include label informa-

tion of the data. For instance, if we imagine 2 cigar like clusters in 2 dimensions,

one cigar has y = 1 and the other y = −1. The cigars are positioned in parallel

and very closely together, such that the variance in the total data-set, ignoring the

labels, is in the direction of the cigars. For classiﬁcation, this would be a terrible

projection, because all labels get evenly mixed and we destroy the useful infor-

mation. A much more useful projection is orthogonal to the cigars, i.e. in the

direction of least overall variance, which would perfectly separate the data-cases

(obviously, we would still need to perform classiﬁcation in this 1-D space).

So the question is, how do we utilize the label information in ﬁnding informa-

tive projections? To that purpose Fisher-LDA considers maximizing the following

objective:

J(w) =

w

T

S

B

w

w

T

S

W

w

(13.1)

where S

B

is the “between classes scatter matrix” and S

W

is the “within classes

scatter matrix”. Note that due to the fact that scatter matrices are proportional to

the covariance matrices we could have deﬁned J using covariance matrices – the

proportionality constant would have no effect on the solution. The deﬁnitions of

63

64 CHAPTER 13. FISHER LINEAR DISCRIMINANT ANALYSIS

the scatter matrices are:

S

B

=

¸

c

N

c

(µ

c

− ¯ x)(µ

c

− ¯ x)

T

(13.2)

S

W

=

¸

c

¸

i∈c

(x

i

−µ

c

)(x

i

−µ

c

)

T

(13.3)

where,

µ

c

=

1

N

c

¸

i∈c

x

i

(13.4)

¯ x = =

1

N

¸

i

x

i

=

1

N

¸

c

N

c

µ

c

(13.5)

and N

c

is the number of cases in class c. Oftentimes you will see that for 2 classes

S

B

is deﬁned as S

′

B

= (µ

1

− µ

2

)(µ

1

− µ

2

)

T

. This is the scatter of class 1 with

respect to the scatter of class 2 and you can show that S

B

=

N

1

N

2

N

S

′

B

, but since it

boils down to multiplying the objective with a constant is makes no difference to

the ﬁnal solution.

Why does this objective make sense. Well, it says that a good solution is one

where the class-means are well separated, measured relative to the (sum of the)

variances of the data assigned to a particular class. This is precisely what we want,

because it implies that the gap between the classes is expected to be big. It is also

interesting to observe that since the total scatter,

S

T

=

¸

i

(x

i

− ¯ x)(x

i

− ¯ x)

T

(13.6)

is given by S

T

= S

W

+ S

B

the objective can be rewritten as,

J(w) =

w

T

S

T

w

w

T

S

W

w

−1 (13.7)

and hence can be interpreted as maximizing the total scatter of the data while

minimizing the within scatter of the classes.

An important property to notice about the objective J is that is is invariant

w.r.t. rescalings of the vectors w → αw. Hence, we can always choose w such

that the denominator is simply w

T

S

W

w = 1, since it is a scalar itself. For this rea-

son we can transform the problem of maximizing J into the following constrained

65

optimization problem,

min

w

−

1

2

w

T

S

B

w (13.8)

s.t. w

T

S

W

w = 1 (13.9)

corresponding to the lagrangian,

L

P

= −

1

2

w

T

S

B

w +

1

2

λ(w

T

S

W

w−1) (13.10)

(the halves are added for convenience). The KKT conditions tell us that the fol-

lowing equation needs to hold at the solution,

S

B

w = λS

W

w ⇒ S

−1

W

S

B

w = λw (13.11)

This almost looks like an eigen-value equation, if the matrix S

−1

W

S

B

would have

been symmetric (in fact, it is called a generalized eigen-problem). However, we

can apply the following transformation, using the fact that S

B

is symmetric posi-

tive deﬁnite and can hence be written as S

1

2

B

S

1

2

B

, where S

1

2

B

is constructed from its

eigenvalue decomposition as S

B

= UΛU

T

→ S

1

2

B

= UΛ

1

2

U

T

. Deﬁning v = S

1

2

B

w

we get,

S

1

2

B

S

−1

W

S

1

2

B

v = λv (13.12)

This problem is a regular eigenvalue problem for a symmetric, positive deﬁnite

matrix S

1

2

B

S

−1

W

S

1

2

B

and for which we can can ﬁnd solution λ

k

and v

k

that would

correspond to solutions w

k

= S

−

1

2

B

v

k

.

Remains to choose which eigenvalue and eigenvector corresponds to the de-

sired solution. Plugging the solution back into the objective J, we ﬁnd,

J(w) =

w

T

S

B

w

w

T

S

W

w

= λ

k

w

T

k

S

W

w

k

w

T

k

S

W

w

k

= λ

k

(13.13)

from which it immediately follows that we want the largest eigenvalue to maxi-

mize the objective

1

.

1

If you try to ﬁnd the dual and maximize that, you’ll get the wrong sign it seems. My best

guess of what goes wrong is that the constraint is not linear and as a result the problem is not

convex and hence we cannot expect the optimal dual solution to be the same as the optimal primal

solution.

66 CHAPTER 13. FISHER LINEAR DISCRIMINANT ANALYSIS

13.1 Kernel Fisher LDA

So how do we kernelize this problem? Unlike SVMs it doesn’t seem the dual

problem reveal the kernelized problem naturally. But inspired by the SVM case

we make the following key assumption,

w =

¸

i

α

i

Φ(x

i

) (13.14)

This is a central recurrent equation that keeps popping up in every kernel machine.

It says that although the feature space is very high (or even inﬁnite) dimensional,

with a ﬁnite number of data-cases the ﬁnal solution, w

∗

, will not have a component

outside the space spanned by the data-cases. It would not make much sense to

do this transformation if the number of data-cases is larger than the number of

dimensions, but this is typically not the case for kernel-methods. So, we argue

that although there are possibly inﬁnite dimensions available a priori, at most N

are being occupied by the data, and the solution w must lie in its span. This is a

case of the “representers theorem” that intuitively reasons as follows. The solution

wis the solution to some eigenvalue equation, S

1

2

B

S

−1

W

S

1

2

B

w = λw, where both S

B

and S

W

(and hence its inverse) lie in the span of the data-cases. Hence, the part

w

⊥

that is perpendicular to this span will be projected to zero and the equation

above puts no constraints on those dimensions. They can be arbitrary and have no

impact on the solution. If we now assume a very general form of regularization on

the norm of w, then these orthogonal components will be set to zero in the ﬁnal

solution: w

⊥

= 0.

In terms of α the objective J(α) becomes,

J(α) =

α

T

S

Φ

B

α

α

T

S

Φ

W

α

(13.15)

where it is understood that vector notation nowapplies to a different space, namely

the space spanned by the data-vectors, R

N

. The scatter matrices in kernel space

can expressed in terms of the kernel only as follows (this requires some algebra to

13.1. KERNEL FISHER LDA 67

verify),

S

Φ

B

=

¸

c

κ

c

κ

T

c

−κκ

T

(13.16)

S

Φ

W

= K

2

−

¸

c

N

c

κ

c

κ

T

c

(13.17)

κ

c

=

1

N

c

¸

i∈c

K

ij

(13.18)

κ =

1

N

¸

i

K

ij

(13.19)

So, we have managed to express the problem in terms of kernels only which

is what we were after. Note that since the objective in terms of α has exactly

the same form as that in terms of w, we can solve it by solving the generalized

eigenvalue equation. This scales as N

3

which is certainly expensive for many

datasets. More efﬁcient optimization schemes solving a slightly different problem

and based on efﬁcient quadratic programs exist in the literature.

Projections of new test-points into the solution space can be computed by,

w

T

Φ(x) =

¸

i

α

i

K(x

i

, x) (13.20)

as usual. In order to classify the test point we still need to divide the space into

regions which belong to one class. The easiest possibility is to pick the cluster

with smallest Mahalonobis distance: d(x, µ

Φ

c

) = (x

α

− µ

α

c

)

2

/(σ

α

c

)

2

where µ

α

c

and σ

α

c

represent the class mean and standard deviation in the 1-d projected space

respectively. Alternatively, one could train any classiﬁer in the 1-d subspace.

One very important issue that we did not pay attention to is regularization.

Clearly, as it stands the kernel machine will overﬁt. To regularize we can add a

term to the denominator,

S

W

→S

W

+ βI (13.21)

By adding a diagonal term to this matrix makes sure that very small eigenvalues

are bounded away from zero which improves numerical stability in computing the

inverse. If we write the Lagrangian formulation where we maximize a constrained

quadratic form in α, the extra term appears as a penalty proportional to ||α||

2

which acts as a weight decay term, favoring smaller values of α over larger ones.

Fortunately, the optimization problemhas exactly the same formin the regularized

case.

68 CHAPTER 13. FISHER LINEAR DISCRIMINANT ANALYSIS

13.2 A Constrained Convex Programming Formu-

lation of FDA

Chapter 14

Kernel Canonical Correlation

Analysis

Imagine you are given 2 copies of a corpus of documents, one written in English,

the other written in German. You may consider an arbitrary representation of

the documents, but for deﬁniteness we will use the “vector space” representation

where there is an entry for every possible word in the vocabulary and a document

is represented by count values for every word, i.e. if the word “the appeared 12

times and the ﬁrst word in the vocabulary we have X

1

(doc) = 12 etc.

Let’s say we are interested in extracting low dimensional representations for

each document. If we had only one language, we could consider running PCA

to extract directions in word space that carry most of the variance. This has the

ability to infer semantic relations between the words such as synonymy, because

if words tend to co-occur often in documents, i.e. they are highly correlated, they

tend to be combined into a single dimension in the new space. These spaces can

often be interpreted as topic spaces.

If we have two translations, we can try to ﬁnd projections of each representa-

tion separately such that the projections are maximally correlated. Hopefully, this

implies that they represent the same topic in two different languages. In this way

we can extract language independent topics.

Let x be a document in English and y a document in German. Consider the

projections: u = a

T

x and v = b

T

y. Also assume that the data have zero mean.

We now consider the following objective,

ρ =

E[uv]

E[u

2

]E[v

2

]

(14.1)

69

70 CHAPTER 14. KERNEL CANONICAL CORRELATION ANALYSIS

We want to maximize this objective, because this would maximize the correlation

between the univariates u and v. Note that we divided by the standard deviation

of the projections to remove scale dependence.

This exposition is very similar to the Fisher discriminant analysis story and I

encourage you to reread that. For instance, there you can ﬁnd how to generalize

to cases where the data is not centered. We also introduced the following “trick”.

Since we can rescale a and b without changing the problem, we can constrain

them to be equal to 1. This then allows us to write the problem as,

maximize

a,b

ρ = E[uv]

subject to E[u

2

] = 1

E[v

2

] = 1 (14.2)

Or, if we construct a Lagrangian and write out the expectations we ﬁnd,

min

a,b

max

λ

1

,λ

2

¸

i

a

T

x

i

y

T

i

b−

1

2

λ

1

(

¸

i

a

T

x

i

x

T

i

a−N)−

1

2

λ

2

(

¸

i

b

T

y

i

y

T

i

b−N)

(14.3)

where we have multiplied by N. Let’s take derivatives wrt to a and b to see what

the KKT equations tell us,

¸

i

x

i

y

T

i

b −λ

1

¸

i

x

i

x

T

i

a = 0 (14.4)

¸

i

y

i

x

T

i

a −λ

2

¸

i

y

i

y

T

i

b = 0 (14.5)

First notice that if we multiply the ﬁrst equation with a

T

and the second with

b

T

and subtract the two, while using the constraints, we arrive at λ

1

= λ

2

= λ.

Next, rename S

xy

=

¸

i

x

i

y

T

i

, S

x

=

¸

i

x

i

x

T

i

and S

y

=

¸

i

y

i

y

T

i

. We deﬁne

the following larger matrices: S

D

is the block diagonal matrix with S

x

and S

y

on

the diagonal and zeros on the off-diagonal blocks. Also, we deﬁne S

O

to be the

off-diagonal matrix with S

xy

on the off diagonal. Finally we deﬁne c = [a, b].

The two equations can then we written jointly as,

S

O

c = λS

D

c ⇒ S

−1

D

S

O

c = λc ⇒ S

1

2

O

S

−1

D

S

1

2

O

(S

1

2

O

c) = λ(S

1

2

O

c) (14.6)

which is again an regular eigenvalue equation for c

′

= S

1

2

O

c

14.1. KERNEL CCA 71

14.1 Kernel CCA

As usual, the starting point to map the data-cases to feature vectors Φ(x

i

) and

Ψ(y

i

).When the dimensionality of the space is larger than the number of data-

cases in the training-set, then the solution must lie in the span of data-cases, i.e.

a =

¸

i

α

i

Φ(x

i

) b =

¸

i

β

i

Ψ(y

i

) (14.7)

Using this equation in the Lagrangian we get,

L = α

T

K

x

K

y

β −

1

2

λ(α

T

K

2

x

α−N) −

1

2

λ(β

T

K

2

y

β −N) (14.8)

where α is a vector in a different N-dimensional space than e.g. a which lives in

a D-dimensional space, and K

x

=

¸

i

Φ(x

i

)

T

Φ(x

i

) and similarly for K

y

.

Taking derivatives w.r.t. α and β we ﬁnd,

K

x

K

y

β = λK

2

x

α (14.9)

K

y

K

x

α = λK

2

y

β (14.10)

Let’s try to solve these equations by assuming that K

x

is full rank (which is typ-

ically the case). We get, α = λ

−1

K

−1

x

K

y

β and hence, K

2

y

β = λ

2

K

2

y

β which

always has a solution for λ = 1. By recalling that,

ρ =

1

N

¸

i

a

T

S

xy

b =

1

N

¸

i

λa

T

S

x

a = λ (14.11)

we observe that this represents the solution with maximal correlation and hence

the preferred one. This is a typical case of over-ﬁtting emphasizes again the need

to regularize in kernel methods. This can be done by adding a diagonal term to the

constraints in the Lagrangian (or equivalently to the denominator of the original

objective), leading to the Lagrangian,

L = α

T

K

x

K

y

β −

1

2

λ(α

T

K

2

x

α+ η||α||

2

−N) −

1

2

λ(β

T

K

2

y

β + η||β||

2

−N)

(14.12)

One can see that this acts as a quadratic penalty on the norm of α and β. The

resulting equations are,

K

x

K

y

β = λ(K

2

x

+ ηI)α (14.13)

K

y

K

x

α = λ(K

2

y

+ ηI)β (14.14)

72 CHAPTER 14. KERNEL CANONICAL CORRELATION ANALYSIS

Analogues to the primal problem, we will deﬁne big matrices, K

D

which contains

(K

2

x

+ ηI) and (K

2

y

+ ηI) as blocks on the diagonal and zeros at the blocks off

the diagonal, and the matrix K

O

which has the matrices K

x

K

y

on the right-upper

off diagonal block and K

y

K

x

at the left-lower off-diagonal block. Also, we deﬁne

γ = [α, β]. This leads to the equation,

K

O

γ = λK

D

γ ⇒ K

−1

D

K

O

γ = λγ ⇒ K

1

2

O

K

−1

D

K

1

2

O

(K

1

2

O

γ) = λ(K

1

2

O

γ)

(14.15)

which is again a regular eigenvalue equation. Note that the regularization also

moved the smallest eigenvalue away from zero, and hence made the inverse more

numerically stable. The value for η needs to be chosen using cross-validation or

some other measure. Solving the equations using this larger eigen-value problem

is actually not quite necessary, and more efﬁcient methods exist (see book).

The solutions are not expected to be sparse, because eigen-vectors are not

expected to be sparse. One would have to replace L

2

norm penalties with L

1

norm penalties to obtain sparsity.

Appendix A

Essentials of Convex Optimization

A.1 Lagrangians and all that

Most kernel-based algorithms fall into two classes, either they use spectral tech-

niques to solve the problem, or they use convex optimization techniques to solve

the problem. Here we will discuss convex optimization.

A constrained optimization problem can be expressed as follows,

minimize

x

f

0

(x)

subject to f

i

(x) ≤ 0 ∀i

h

j

(x) = 0 ∀j (A.1)

That is we have inequality constraints and equality constraints. We now write

the primal Lagrangian of this problem, which will be helpful in the following

development,

L

P

(x, λ, ν) = f

0

(x) +

¸

i

λ

i

f

i

(x) +

¸

j

ν

j

h

j

(x) (A.2)

where we will assume in the following that λ

i

≥ 0 ∀i. From here we can deﬁne

the dual Lagrangian by,

L

D

(λ, ν) = inf

x

L

P

(x, λ, ν) (A.3)

This objective can actually become −∞ for certain values of its arguments. We

will call parameters λ ≥ 0, ν for which L

D

> −∞dual feasible.

73

74 APPENDIX A. ESSENTIALS OF CONVEX OPTIMIZATION

It is important to notice that the dual Lagrangian is a concave function of λ, ν

because it is a pointwise inﬁmum of a family of linear functions in λ, ν function.

Hence, even if the primal is not convex, the dual is certainly concave!

It is not hard to show that

L

D

(λ, ν) ≤ p

∗

(A.4)

where p

∗

is the primal optimal point. This simply follows because

¸

i

λ

i

f

i

(x) +

¸

j

ν

j

h

j

(x) ≤ 0 for a primal feasible point x

∗

.

Thus, the dual problem always provides lower bound to the primal problem.

The optimal lower bound can be found by solving the dual problem,

maximize

λ,ν

L

D

(λ, ν)

subject to λ

i

≥ 0 ∀i (A.5)

which is therefore a convex optimization problem. If we call d

∗

the dual optimal

point we always have: d

∗

≤ p

∗

, which is called weak duality. p

∗

− d

∗

is called

the duality gap. Strong duality holds when p

∗

= d

∗

. Strong duality is very nice,

in particular if we can express the primal solution x

∗

in terms of the dual solution

λ

∗

, ν

∗

, because then we can simply solve the dual problem and convert to the

answer to the primal domain since we know that solution must then be optimal.

Often the dual problem is easier to solve.

So when does strong duality hold? Up to some mathematical details the an-

swer is: if the primal problem is convex and the equality constraints are linear.

This means that f

0

(x) and {f

i

(x)} are convex functions and h

j

(x) = Ax −b.

The primal problem can be written as follows,

p

∗

= inf

x

sup

λ≥0,ν

L

P

(x, λ, ν) (A.6)

This can be seen as follows by noting that sup

λ≥0,ν

L

P

(x, λ, ν) = f

0

(x) when

x is feasible but ∞ otherwise. To see this ﬁrst check that by violating one of

the constraints you can ﬁnd a choice of λ, ν that makes the Lagrangian inﬁnity.

Also, when all the constraints are satisﬁed, the best we can do is maximize the

additional terms to be zero, which is always possible. For instance, we can simply

set all λ, ν to zero, even though this is not necessary if the constraints themselves

vanish.

The dual problem by deﬁnition is given by,

d

∗

= sup

λ≥0,ν

inf

x

L

P

(x, λ, ν) (A.7)

A.1. LAGRANGIANS AND ALL THAT 75

Hence, the “sup” and “inf” can be interchanged if strong duality holds, hence

the optimal solution is a saddle-point. It is important to realize that the order of

maximization and minimization matters for arbitrary functions (but not for convex

functions). Try to imagine a “V” shapes valley which runs diagonally across the

coordinate system. If we ﬁrst maximize over one direction, keeping the other

direction ﬁxed, and then minimize the result we end up with the lowest point on

the rim. If we reverse the order we end up with the highest point in the valley.

There are a number of important necessary conditions that hold for problems

with zero duality gap. These Karush-Kuhn-Tucker conditions turn out to be sufﬁ-

cient for convex optimization problems. They are given by,

∇f

0

(x

∗

) +

¸

i

λ

∗

i

∇f

i

(x

∗

) +

¸

j

ν

∗

j

∇h

j

(x

∗

) = 0 (A.8)

f

i

(x

∗

) ≤ 0 (A.9)

h

j

(x

∗

) = 0 (A.10)

λ

∗

i

≥ 0 (A.11)

λ

∗

i

f

i

(x

∗

) = 0 (A.12)

The ﬁrst equation is easily derived because we already sawthat p

∗

= inf

x

L

P

(x, λ

∗

, ν

∗

)

and hence all the derivatives must vanish. This condition has a nice interpretation

as a “balancing of forces”. Imagine a ball rolling down a surface deﬁned by f

0

(x)

(i.e. you are doing gradient descent to ﬁnd the minimum). The ball gets blocked

by a wall, which is the constraint. If the surface and constraint is convex then if the

ball doesn’t move we have reached the optimal solution. At that point, the forces

on the ball must balance. The ﬁrst term represent the force of the ball against the

wall due to gravity (the ball is still on a slope). The second term represents the re-

action force of the wall in the opposite direction. The λ represents the magnitude

of the reaction force, which needs to be higher if the surface slopes more. We say

that this constraint is “active”. Other constraints which do not exert a force are

“inactive” and have λ = 0. The latter statement can be read of from the last KKT

condition which we call “complementary slackness”. It says that either f

i

(x) = 0

(the constraint is saturated and hence active) in which case λ is free to take on a

non-zero value. However, if the constraint is inactive: f

i

(x) ≤ 0, then λ must

vanish. As we will see soon, the active constraints will correspond to the support

vectors in SVMs!

76 APPENDIX A. ESSENTIALS OF CONVEX OPTIMIZATION

Complementary slackness is easily derived by,

f

0

(x

∗

) = L

D

(λ

∗

, ν

∗

) = inf

x

f

0

(x) +

¸

i

λ

∗

i

f

i

(x) +

¸

j

ν

∗

j

h

j

(x)

≤ f

0

(x

∗

) +

¸

i

λ

∗

i

f

i

(x

∗

) +

¸

j

ν

∗

j

h

j

(x

∗

) (A.13)

≤ f

0

(x

∗

) (A.14)

where the ﬁrst line follows from Eqn.A.6 the second because the inf is always

smaller than any x

∗

and the last because f

i

(x

∗

) ≤ 0, λ

∗

i

≥ 0 and h

j

(x

∗

) = 0.

Hence all inequalities are equalities and each term is negative so each term must

vanish separately.

Appendix B

Kernel Design

B.1 Polynomials Kernels

The construction that we will followbelowis to ﬁrst write feature vectors products

of subsets of input attributes, i.e. deﬁne features vectors as follows,

φ

I

(x) = x

i

1

1

x

i

2

2

...x

in

n

(B.1)

where we can put various restrictions on the possible combinations of indices

which are allowed. For instance, we could require that their sum is a constant

s, i.e. there are precisely s terms in the product. Or we could require that each

i

j

= [0, 1]. Generally speaking, the best choice depends on the problem you are

modelling, but another important constraint is that the corresponding kernel must

be easy to compute.

Let’s deﬁne the kernel as usual as,

K(x, y) =

¸

I

φ

I

(x)φ

I

(y) (B.2)

where I = [i

1

, i

2

, ...i

n

]. We have already encountered the polynomial kernel as,

K(x, y) = (R +x

T

y)

d

=

d

¸

s=0

d!

s!(d −s)!

R

d−s

(x

T

y)

s

(B.3)

where the last equality follows from a binomial expansion. If we write out the

77

78 APPENDIX B. KERNEL DESIGN

term,

(x

T

y)

s

= (x

1

y

1

+x

2

y

2

+...+x

n

y

n

)

s

=

¸

i

1

,i

2

,...,in

i

1

+i

2

+...+in=s

s!

i

1

!i

2

!...i

n

!

(x

1

y

1

)

i

1

(x

2

y

2

)

i

2

...(x

n

y

n

)

in

(B.4)

Taken together with eqn. B.3 we see that the features correspond to,

φ

I

(x) =

d!

(d −s)!

1

i

1

!i

2

!...i

n

!

R

d−s

x

i

1

1

x

i

2

2

...x

in

n

with i

1

+i

2

+...+i

n

= s < d

(B.5)

The point is really that in order to efﬁciently compute the total sum of

(n+d)!

n! d!

terms

we have inserted very special coefﬁcients. The only true freedom we have left is

in choosing R: for larger R we down-weight higher order polynomials more.

The question we want to answer is: howmuch freedom do we have in choosing

different coefﬁcients and still being able to compute the inner product efﬁciently.

B.2 All Subsets Kernel

We deﬁne the feature again as the product of powers of input attributes. However,

in this case, the choice of power is restricted to [0,1], i.e. the feature is present

or absent. For n input dimensions (number of attributes) we have 2

n

possible

combinations.

Let’s compute the kernel function:

K(x, y) =

¸

I

φ

I

(x)φ

I

(y) =

¸

I

¸

j:i

j

=1

x

j

y

j

=

n

¸

i=1

(1 + x

i

y

i

) (B.6)

where the last identity follows from the fact that,

n

¸

i=1

(1 + z

i

) = 1 +

¸

i

z

i

+

¸

ij

z

i

z

j

+ ... + z

1

z

2

...z

n

(B.7)

i.e. a sum over all possible combinations. Note that in this case again, it is much

efﬁcient to compute the kernel directly than to sum over the features. Also note

that in this case there is no decaying factor multiplying the monomials.

B.3. THE GAUSSIAN KERNEL 79

B.3 The Gaussian Kernel

This is given by,

K(x, y) = exp(−

1

2σ

2

||x −y||

2

) (B.8)

where σ controls the ﬂexibility of the kernel: for very small σ the Gram matrix

becomes the identity and every points is very dissimilar to any other point. On

the other hand, for σ very large we ﬁnd the constant kernel, with all entries equal

to 1, and hence all points looks completely similar. This underscores the need in

kernel-methods for regularization; it is easy to perform perfect on the training data

which does not imply you will do well on new test data.

In the RKHS construction the features corresponding to the Gaussian kernel

are Gaussians around the data-case, i.e. smoothed versions of the data-cases,

φ(x) = exp(−

1

2σ

2

||x −·||

2

) (B.9)

and thus every new direction which is added to the feature space is going to be or-

thogonal to all directions outside the width of the Gaussian and somewhat aligned

to close-by points.

Since the inner product of any feature vector with itself is 1, all vectors have

length 1. Moreover, inner products between any two different feature vectors is

positive, implying that all feature vectors can be represented in the positive orthant

(or any other orthant), i.e. they lie on a sphere of radius 1 in a single orthant.

80 APPENDIX B. KERNEL DESIGN

Bibliography

81

2

Contents

Preface Learning and Intuition 1 Data and Information 1.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Preprocessing the Data . . . . . . . . . . . . . . . . . . . . . . . Data Visualization iii vii 1 2 4 7

2 3

Learning 11 3.1 In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Types of Machine Learning 17 4.1 In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Nearest Neighbors Classiﬁcation 21 5.1 The Idea In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . 23 The Naive Bayesian Classiﬁer 6.1 The Naive Bayes Model . . . . . 6.2 Learning a Naive Bayes Classiﬁer 6.3 Class-Prediction for New Instances 6.4 Regularization . . . . . . . . . . . 6.5 Remarks . . . . . . . . . . . . . . 6.6 The Idea In a Nutshell . . . . . . . 25 25 27 28 30 31 31

4

5

6

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

7

The Perceptron 33 7.1 The Perceptron Model . . . . . . . . . . . . . . . . . . . . . . . 34 i

. . .1 Polynomials Kernels . . . . . . . . . . . . . . . . .1 Centering Data in Feature Space . . . . . . . . . . . . . . . . . . . . B. . . . . . . 9 Support Vector Regression 10 Kernel ridge Regression 10.2 An alternative derivation . . . . . . . The Idea In a Nutshell . . . . . B Kernel Design B. . . . .3 The Gaussian Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Kernel Canonical Correlation Analysis 14. . . . . . . . . . . . . . . . . .2 A Constrained Convex Programming Formulation of FDA . . .ii 7.3 CONTENTS A Different Cost function: Logistic Regression . . . . . .2 All Subsets Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Kernel CCA . . . . . . . . . . . . 11 Kernel K-means and Spectral Clustering 12 Kernel Principal Components Analysis 12. . . . . . . . . . .1 The Non-Separable case . . . . . . . . . . . . . . . . . . . . . 37 38 39 43 47 51 52 53 55 59 61 63 66 68 69 71 73 73 77 77 78 79 8 Support Vector Machines 8. .2 7. . . . . . . . .1 Kernel Fisher LDA . . . A Essentials of Convex Optimization A. . . . . B. . . . . . . . . . . . . . . 10.1 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . .1 Lagrangians and all that . . . . . . . . . . . . . . . . . . . . . 13. . . . . 13 Fisher Linear Discriminant Analysis 13. . . . . . . . . . .

pencils. Excellent textbooks do exist for this ﬁeld. some with labels (e. Machine learning is a relatively recent discipline that emerged from the general ﬁeld of artiﬁcial intelligence only quite recently. intuitive introduction into the concepts of machine learning. I quickly found that these concepts could not be taken for granted at an undergraduate level. after many years of intense research the we can now recognize faces in images to a high degree accuracy. Should we invest the same effort to build good classiﬁers for monkeys. It is simply too costly and impractical to design intelligent systems by ﬁrst gathering all the expert knowledge ourselves and then hard-wiring it into a machine. or should we build systems to can observe millions of training images. eigenvalue decompositions. intuitive explanation of some of the most useful algorithms that our ﬁeld has to offer. For instance. axes etc. the book you see before you is meant for those starting out in the ﬁeld who need a simple. Much of machine learning is build upon concepts from mathematics such as partial derivatives. chairs. in these pixels in the image correspond to a car) but most of them without side information? Although there is currently no system which can recognize even in the order of 1000 object categories (the best system can get iii . While I had been teaching machine learning at a graduate level it became soon clear that teaching the same material to an undergraduate class was a whole new challenge. multivariate probability densities and so on. but I found all of them to be too technical for a ﬁrst encounter with machine learning. But the world has approximately 30.g. A ﬁrst read to wet the appetite so to speak. a prelude to the more technical and advanced textbooks. This experience led me to believe there was a genuine need for a simple. The situation was aggravated by the lack of a suitable textbook. Hence.Preface In winter quarter 2007 I taught an undergraduate course in machine learning at UC Irvine.000 visual object categories according to some estimates (Biederman). To build intelligent machines researchers realized that these machines should learn from and adapt to their environment.

the internet. To give an example. particle accelerators. semi-supervised learning. Machine learning emerged from AI but quickly incorporated ideas from ﬁelds as diverse as statistics. perhaps more important reason for the growth of machine learning is the exponential growth of both available data and computer power. scanned text and so on. Modern search engines do not run terribly sophisticated algorithms. fuzzy logic. I will only cover the most basic approaches in this book from a highly per- . such as supervised learning. the main conference in this ﬁeld is called: advances in neural information processing systems. physics and more. But there is no doubt in my mind that building truly intelligent machines will involve learning from data. artiﬁcial neural networks.iv PREFACE about 60% correct on 100 categories). graphical models. The second. information theory. seismic measurements. cognitive science. reinforcement learning etc. To sample a few sub-disciplines: statistical learning. the army. theoretical neuroscience. The ﬁeld of machine learning is multifaceted and expanding fast. The ﬁrst reason for the recent successes of machine learning and the growth of the ﬁeld as a whole is rooted in its multidisciplinary character. banks. It is difﬁcult to appreciate the exponential growth of data that our society is generating. but they manage to store and sift through almost the entire content of the internet to return sensible search results. referring to information theory and theoretical neuroscience and cognitive science. the fact that we pull it off seemingly effortlessly serves as a “proof of concept” that it can be done. kernel methods. active learning. sky observatories. To give some examples of recent successes of this approach one would only have to turn on one computer and perform an internet search. Bayesian methods and so on. video. probability. a modern satellite generates roughly the same amount of data all previous satellites produced together. computer science. not because a new model was invented but because many more translated documents became available. control theory. The ﬁeld also covers many types of learning problems. unsupervised learning. To give an example. the human genome project. This insight has shifted the attention from highly sophisticated modeling techniques on small datasets to more basic analysis on much larger data-sets (the latter sometimes called data-mining). Hence the emphasis shifted to algorithmic efﬁciency and as a result many machine learning faculty (like myself) can typically be found in computer science departments. While the ﬁeld is build on theory and tools developed statistics machine learning recognizes that the most exiting progress can be made to leverage the enormous ﬂood of data that is generated each year by satellites. the stock market. convex optimization. There has also been much success in the ﬁeld of machine translation.

The ﬁrst chapter will be devoted to why I think this is important. Many books also excel in stating facts in an almost encyclopedic style. MEANT FOR INDUSTRY AS WELL AS BACKGROUND READING] This book was written during my sabbatical at the Radboudt University in Nijmegen (Netherlands).. But what will (hopefully) be signiﬁcantly different than most other scientiﬁc books is the manner in which I will present these methods. This is my primary mission: to write a book which conveys intuition. I have always been frustrated by the lack of proper explanation of equations.. UCI. Instead of trying to cover all aspects of the entire ﬁeld I have chosen to present a few popular and perhaps useful tools and approaches. Hans for discussion on intuition.v sonal perspective. Bert Kappen who leads an excellent group of postocs and students for his hospitality. Many times I have been staring at a formula having not the slightest clue where it came from or how it was derived. kids. without providing the proper intuition of the method. I like to thank Prof.. Marga. .

vi

PREFACE

**Learning and Intuition
**

We have all experienced the situation that the solution to a problem presents itself while riding your bike, walking home, “relaxing” in the washroom, waking up in the morning, taking your shower etc. Importantly, it did not appear while banging your head against the problem in a conscious effort to solve it, staring at the equations on a piece of paper. In fact, I would claim, that all my bits and pieces of progress have occured while taking a break and “relaxing out of the problem”. Greek philosophers walked in circles when thinking about a problem; most of us stare at a computer screen all day. The purpose of this chapter is to make you more aware of where your creative mind is located and to interact with it in a fruitful manner. My general thesis is that contrary to popular belief, creative thinking is not performed by conscious thinking. It is rather an interplay between your conscious mind who prepares the seeds to be planted into the unconscious part of your mind. The unconscious mind will munch on the problem “out of sight” and return promising roads to solutions to the consciousness. This process iterates until the conscious mind decides the problem is sufﬁciently solved, intractable or plain dull and moves on to the next. It maybe a little unsettling to learn that at least part of your thinking goes on in a part of your mind that seems inaccessible and has a very limited interface with what you think of as yourself. But it is undeniable that it is there and it is also undeniable that it plays a role in the creative thought-process. To become a creative thinker one should how learn to play this game more effectively. To do so, we should think about the language in which to represent knowledge that is most effective in terms of communication with the unconscious. In other words, what type of “interface” between conscious and unconscious mind should we use? It is probably not a good idea to memorize all the details of a complicated equation or problem. Instead we should extract the abstract idea and capture the essence of it in a picture. This could be a movie with colors and other vii

viii

LEARNING AND INTUITION

baroque features or a more “dull” representation, whatever works. Some scientist have been asked to describe how they represent abstract ideas and they invariably seem to entertain some type of visual representation. A beautiful account of this in the case of mathematicians can be found in a marvellous book “XXX” (Hardamard). By building accurate visual representations of abstract ideas we create a database of knowledge in the unconscious. This collection of ideas forms the basis for what we call intuition. I often ﬁnd myself listening to a talk and feeling uneasy about what is presented. The reason seems to be that the abstract idea I am trying to capture from the talk clashed with a similar idea that is already stored. This in turn can be a sign that I either misunderstood the idea before and need to update it, or that there is actually something wrong with what is being presented. In a similar way I can easily detect that some idea is a small perturbation of what I already knew (I feel happily bored), or something entirely new (I feel intrigued and slightly frustrated). While the novice is continuously challenged and often feels overwhelmed, the more experienced researcher feels at ease 90% of the time because the “new” idea was already in his/her data-base which therefore needs no and very little updating. Somehow our unconscious mind can also manipulate existing abstract ideas into new ones. This is what we usually think of as creative thinking. One can stimulate this by seeding the mind with a problem. This is a conscious effort and is usually a combination of detailed mathematical derivations and building an intuitive picture or metaphor for the thing one is trying to understand. If you focus enough time and energy on this process and walk home for lunch you’ll ﬁnd that you’ll still be thinking about it in a much more vague fashion: you review and create visual representations of the problem. Then you get your mind off the problem altogether and when you walk back to work suddenly parts of the solution surface into consciousness. Somehow, your unconscious took over and kept working on your problem. The essence is that you created visual representations as the building blocks for the unconscious mind to work with. In any case, whatever the details of this process are (and I am no psychologist) I suspect that any good explanation should include both an intuitive part, including examples, metaphors and visualizations, and a precise mathematical part where every equation and derivation is properly explained. This then is the challenge I have set to myself. It will be your task to insist on understanding the abstract idea that is being conveyed and build your own personalized visual representations. I will try to assist in this process but it is ultimately you who will have to do the hard work.

If you feel under-challenged and become bored I recommend you move on to the more advanced text-books of which there are many excellent samples on the market (for a list see (books)). Above all. Undoubtedly for many it will be. But I hope that for most beginning students this intuitive style of writing may help to gain a deeper understanding of the ideas that I will present in the following.ix Many people may ﬁnd this somewhat experimental way to introduce students to new topics counter-productive. have fun! .

x LEARNING AND INTUITION .

need I go on? But data in itself is useless. can we predict the weather next week). What do we mean by “relevant information”? When analyzing data we typically have a speciﬁc question in mind such as :“How many types of car can be discerned in this video” or “what will be weather next week”. most ﬁnancial transactions are recorded. But one should keep in mind that the particular analysis depends on the task one has in mind. Hidden inside the data is valuable information. if I have a data-base of attributes of Hummers such as weight. number of people it can hold etc. Let me spell out a few tasks that are typically considered in machine learning: Prediction: Here we ask ourselves whether we can extrapolate the information in the data to new unseen cases. the FBI maintains a DNA-database of most convicted criminals. satellites and observatories generate tera-bytes of data every year. or the stock prizes. bar-plots or even little movies. 1 . often your clicking pattern is recorded when surﬁng the web. Another example is predicting the weather (given all the recorded weather patterns in the past. color. soon all written text from our libraries is digitized. Surveillance cameras continuously capture video. and another data-base of attributes of Ferraries. or a sequence of numbers or (the temperature next week) or a complicated pattern (the cloud conﬁguration next week). So the answer can take the form of a single number (there are 5 cars). If the answer to our query is itself complex we like to visualize it using graphs. For instance. every time you make a phone call your name and location gets recorded.Chapter 1 Data and Information Data is everywhere in abundant amounts. The objective of machine learning is to pull the relevant information from the data and make it available to the user. then one can try to predict the type of car (Hummer or Ferrari) from a new set of attributes.

what do we download into our computer? Data comes in many shapes and forms. Also. If data is completely random there is nothing to predict. for instance it could be words from a document or pixels from an image. compressibility. you would be able to do a pretty good job (for instance 20% may be blue sky and predicting the neighbors of a blue sky pixel is easy). From the day we are born we start noticing that there is structure in this world. All of the above objectives depend on the fact that there is structure in the data. Perhaps this holds for all these objects. predictability. the number of bits needed to represent it. For one. and we predict it by learning its structure. redundancy.k. Our game is to predict the future accurately. a. For instance. Take the example of natural images. But it will be useful to convert data into a . all of these concepts are intimately related: structure. it won’t give way. 1. it wouldn’t contain objects. if we would generate images at random they would not look like natural scenes at all. regularity. ﬁles in your computer can be “zipped” to a much smaller size by removing much of the redundancy in those ﬁles. They refer to the “food” for machine learning. nothing to interpret and nothing to compress. JPEG and GIF (among others) are compressed representations of the original pixel-map.a. When I cry my mother suddenly appears. Our survival depends on discovering and recording this structure. it damages my body. Hence. Also. DATA AND INFORMATION Interpretation: Here we seek to answer questions about the data.2 CHAPTER 1. Thus. all tasks are somehow related to discovering or leveraging this structure.1 Data Representation What does “data” look like? In other words. If you are required to predict the color of the pixels neighboring to some random pixel in an image. One could say that data is highly redundant and that this redundancy is exactly what makes it interesting. In fact. For instance. interpretability. Only a tiny fraction of all possible images looks “natural” and so the space of natural images is highly structured. If I walk into this brown cylinder with a green canopy I suddenly stop. what property of this drug was responsible for its high success-rate? Does a security ofﬁcer at the airport apply racial proﬁling in deciding who’s luggage to check? How many natural groups are there in the data? Compression: Here we are interested in compressing the original data. without structure there is nothing to learn. The same thing is true for human learning.

It is still there. In the next section we’ll discuss some standard preprocessing operations. The structure is still in the data. The value Xin for attribute i and data-case n can be binary. For documents. The algorithms that we will discuss are based on certain assumptions. . It is very important to realize that there are many ways to represent data and not all are equally suitable for analysis. Most datasets can be cast in this form (but not all). It is often advisable to visualize the data before preprocessing and analyzing it. the matrix X is 2 × 100 dimensional and X1. such as.5076 = 4 would mean: the word book appeared 4 times in document 5076.20 = 2 is the color of car nr.57 is the weight of car nr. We can either try to rescale the images to a common size or we can simply leave those entries in the matrix empty. It may also occur that a certain entry is supposed to be there but it couldn’t be measured. Say the word “book” is deﬁned to have nr.1.. Most datasets can be represented as a matrix. but just harder to ﬁnd. 20 in some units (a real value) while X2. Consider searching the internet for images about rats. discrete etc. 20 (say one of 6 predeﬁned colors). to indicate that that entry wasn’t observed. DATA REPRESENTATION 3 standard format so that the algorithms that we will discuss can be applied to it. While this may be true if we measure weight in kilograms and height in meters. but we would need a much more complex assumption to discover it. A lesson to be learned is thus to spend some time thinking about in which representation the structure is as obvious as possible and transform the data if necessary before applying standard algorithms. X = [Xin ]. 10.20 = 20. see ﬁgure ??. By this I mean that in some representation the structure may be obvious while in other representation is may become totally obscure. if we run an optical character recognition system on a scanned document some letters will not be recognized. if we measure weight and color of 100 cars. real. it is no longer true if we decide to recode these numbers into bit-strings.1. This will often tell you if the structure is a good match for the algorithm you had in mind for further analysis. 684. “Hummers and Ferraries can be separated with by a line. For instance. depending on what we measure. 568 in the vocabulary then X10568. You’ll retrieve a large variety of images most with a different number of pixels. We’ll use a question mark “?”. we can give each distinct word of a prespeciﬁed vocabulary a nr. Sometimes the different data-cases do not have the same number of attributes. and simply count how often a word was present. Chapter ?? will discuss some elementary visualization techniques. with rows indexed by “attribute-index” i and columns indexed by “data-index” n. For instance.

N 1 E[X]i = Xin (1. given by. but now we have used the square. depicted in ﬁgure ??a. An illustration of the global shift is given in ﬁgure ??b.4 CHAPTER 1.3) where we have assumed that the data was ﬁrst centered. ′ Xin = Xin − E[X]i ∀n (1. we may also wish to scale the data along the coordinate axis in order make it more “spherical”. By ﬁrst taking the square. Note that this is similar to the sample mean. In the ﬁgure why this will be a bad estimate: the data-cloud is not centered. 1 V[X]i = N N 2 Xin n=1 (1. Although this example is somewhat simple-minded. DATA AND INFORMATION 1. To center data we will introduce the sample mean of the data. much more interesting algorithms that assume centered data. To deﬁne this operation we ﬁrst introduce the notion of sample variance. The algorithm we consists of estimating the area that the data occupy.b. we simple add all the attribute value across data-cases and divide by the total number of data-cases. The solution is to scale the axes so that the spread is the same in every dimension. algorithms are based on assumptions and can become more effective if we transform the data ﬁrst. It is important that we have removed the sign of the data-cases (by taking the square) because otherwise positive and negative signs might cancel each other out. We also see in this ﬁgure that the algorithm described above now works much better! In a similar spirit as centering.2) It is now easy to check that the sample mean of X ′ indeed vanishes. If we would have ﬁrst centered it we would have obtained reasonable estimate. all data-cases ﬁrst get mapped to positive half of the axes (for each dimension or .1) N n=1 Hence. To transform the data so that their sample mean is zero. for every attribute i separately.2 Preprocessing the Data As mentioned in the previous section. we set. Consider the following example. In this case the data was ﬁrst centered. Consider ﬁgure ??a. but the elongated shape still prevented us from using the simplistic algorithm to estimate the area covered by the data. there are many. It grows a circle starting at the origin and at the point it contains all the data we record the area of circle.

and will postpone this issue until chapter ?? on “principal components analysis”. because one could argue that one could keep changing the data using more and more sophisticated transformations until all the structure was removed from the data and there would be nothing left to analyze! It is indeed true that the pre-processing steps can be viewed as part of the modeling process in that it identiﬁes structure (and then removes it). This requires quite a bit more math. we can also deal with such a case by ﬁrst centering. pre-processing starts with the data and understands how we can get back to the unstructured random state of the data [FIGURE]. the preprocessing is no longer necessary and becomes integrated into the model. then variance is measured in grams squared.2. Many algorithms are are based on the assumption that data is sort of symmetric around . Models and structure reducing data transformations are in sense each others reverse: we often associate with a model an understanding of how the data was generated. we can use a model to transform structured data into (more) unstructured data. Reversely. Now.4) Note again that sphering requires centering implying that we always have to perform these operations in this order. You may now be asking. Finally. I will mention one more popular data-transformation technique.1. Figure ??a. ′′ Xin = ′ Xin V[X ′ ]i ∀n (1. So to scale the data to have the same scale in every dimension we divide by the square-root of the variance. ﬁrst center.b. then rotating such that the elongated direction points in the direction of one of the axes. Compression algorithms are based on models for the redundancy in data (e. Reversely. many algorithm can be easily adapted to model the mean and scale of the data. PREPROCESSING THE DATA 5 attribute separately) and then added and divided by N. images). starting from random noise. text.c illustrate this process. The compression consists in removing this redundancy and transforming the original data into a less structured or less redundant (and hence more succinct) code. the question is in fact a very deep one. You have perhaps noticed that variance does not have the same units as X itself.g. then sphere.. If X is measured in grams. Just as preprocessing can be viewed as building a model. However. The details of this process will be left for later chapters but a good example is provided by compression algorithms. Indeed. By remembering the sequence of transformations you performed you have implicitly build a model. “well what if the data where elongated in a diagonal direction?”. which is usually called the sample standard deviation. and then scaling.

5) . Taking the following logarithm can help in that case.6 CHAPTER 1. DATA AND INFORMATION the origin. If data happens to be just positive. it doesn’t ﬁt this assumption very well. ′ Xin = log(a + Xin ) (1.

ﬁtting it to the data and reporting the results. We may therefore decide to try another representation/algorithm. 2 or 3 dimensional subspaces onto which we project the data. you may decide to transform the data and then look again to see if the transformed data better suit the assumptions of the algorithm you have in mind.Chapter 2 Data Visualization The process of data analysis does not just consist of picking an algorithm. but most involve projection: we determine a number of.g. while in the right ﬁgure we observe that the data is approximately distributed on a line. say. This is an iterative process. In the left ﬁgure we see that the data naturally forms clusters. How are we going to visualize it? There are many answers to this problem. we may realize that our initial choice of algorithm or representation may not have been optimal. After your ﬁrst peek. The reason is that we are not equipped to think in more than 3 dimensions. The simplest choice of subspaces are the ones aligned with the features. The left ﬁgure suggests a clustering approach while the right ﬁgure suggests a dimensionality reduction approach. Depending on the data representation and the task at hand we then have to choose an algorithm to continue our analysis. compare the results and perhaps combine them. This illustrates the importance of looking at the data before you start your analysis instead of (literally) blindly picking an algorithm. “Looking at the data” sounds more easy than it really is. e. What may help us in deciding the representation and algorithm for further analysis? Consider the two datasets in Figure ??. while most data lives in much higher dimensions. But even after we have run the algorithm and study the results we are interested in. For instance image patches of size 10 × 10 live in a 100 pixel space. we can plot X1n versus X2n 7 . We have seen that we need to choose a representation for the data necessitating data-preprocessing in many cases.

Hence. A large number of algorithms has been devised to search for informative projections. An example of such a scatter plot is given in Figure ??. In this ﬁgure we clearly recognize the characteristic “Bell-shape” of the Gaussian distribution of projected and histogrammed data. So in general we have to work a bit harder to see interesting structure. In one sense the central limit theorem is a rather helpful quirk of nature. Many variables follow Gaussian distributions and the Gaussian distribution is one of the few distributions which have very nice analytic properties. However. The reason is that data projected on a random subspace often looks distributed according to what is known as a Gaussian distribution (see Figure ??). a dataset {Xin } can thus be visualized in one dimension by “histogramming”1 the values of Y = wT X. How do we cut down on the number of dimensions? perhaps random projections may work? Unfortunately that turns out to be not a great idea in many cases. Here. Unfortunately. see Figure ??.e. then y = wT x is the value of the projection. This is clearly is a weighted sum of the random variables xi . The deeper reason behind this phenomenon is the central limit theorem which states that the sum of a large number of independent random variables is (under certain conditions) distributed as a Gaussian distribution. If we assume that xi are approximately independent. If many items have a value around zero. 1 .d. if we denote with w a vector in Rd and by x the d-dimensional random variable. the basis of the bar). interesting means dimensions of high variance. Analogously. then we can see that their sum will be governed by this central limit theorem. it was recognized that high variance is not always a good measure of interestingness and one should rather search for dimensions that are non-Gaussian..8 CHAPTER 2. the Gaussian distribution is also the most uninformative distribution. DATA VISUALIZATION etc. This notion of “uninformative” can actually be made very precise using information theory and states: Given a ﬁxed mean and variance. then the bar centered at zero will be very high. This is usually too many to manually inspect. For instance. The simplest being “principal component analysis” or PCA for short ??. This is rather unfortunate for our purposes because Gaussian projections are the least revealing dimensions to look at. Note that we have a total of d(d − 1)/2 possible two dimensional projections which amounts to 4950 projections for 100 dimensional data. i = 1. the Gaussian density represents the least amount of information among all densities with the same mean and variance. “independent components analysis” (ICA) ?? and “projection pursuit” ?? searches for dimenA histogram is a bar-plot where the height of the bar represents the number items that had a value located in the interval on the x-axis o which the bar stands (i.

A more recent approach is to project the data onto a potentially curved manifold ??. Scatter plots are of course not the only way to visualize data. Its a creative exercise and anything that helps enhance your understanding of the data is allowed in this game. Another criterion is to to ﬁnd projections onto which the data has multiple modes.9 sions that have heavy tails relative to Gaussian distributions. To illustrate I will give a few examples form a .

DATA VISUALIZATION .10 CHAPTER 2.

A lion appears on the scene. had manes etc. The fact that Alice’s disease is so rare is understandable there must have been a strong selection pressure against this disease. At her home she is doing just ﬁne: her mother explained Alice for every object in her house what is is and how you use it. but not this particular one and it does not induce a fear response. She remembers every detail of the objects she has seen. Alice’s contemporaries noticed that the animal was yellow-brown. She is not able to recognize objects by their visual appearance. she has a fantastic memory. What is the matter with Alice? Nothing is wrong with her memory because she remembers the objects once she has seem them. let’s start with an example. When she is home. almost philosophical question of what learning really is (and what it is not). she has no time to infer the possibility that this animal may be dangerous logically. yet unobserved members of the same category. if she enters a new meeting room she needs a long time to infer what the chairs and the table are in the room. and immediately un11 . If you want to remember one thing from this book you will ﬁnd it here in this chapter. Imagine our ancestors walking through the savanna one million years ago. For example. She has been diagnosed with a severe case of ”overﬁtting”. Ok. Of course. It concerns the core. In fact. Ancestral Alice has seen lions before.Chapter 3 Learning This chapter is without question the most important one of the book. The problem is that Alice cannot generalize the information she has observed from one instance of a visual object category to other. And every time she sees a new objects she reasons that the object in front of her is surely not a chair because it doesn’t have all the features she has seen in earlier chairs. Alice has a rather strange ailment. but when she enters a new environment she is lost. she recognizes these objects (if they have not been moved too much).

Once he has seen an object he believes almost everything is some. all we have to do is make correct predictions about aspects of life that help us survive. I seem to suffer from this so now and then when I think all of machine learning can be explained by this one new exciting principle). sticking my hand into that yellow-orange ﬂickering“ﬂame” hurts my hand and so on. When do we generalize exactly right? The magic word is PREDICTION. This food kills me but that food is good for me. it describes all watery substances and summarizes some of their physical properties. apparently they help us organize the world and make accurate predictions. We seem to all agree. but may differ in some other ones (like the presence of a scar someplace). The world is wonderfully predictable and we are very good at predicting it. chase for deer). Take the concept “ﬂuid”. Bob has another disease which is called over-generalization. The category lions is an abstraction and abstractions help us to generalize. learning is all about ﬁnding useful abstractions or concepts that describe the world. Nobody really cares about the deﬁnition of lion. So why do we care about object categories in the ﬁrst place? Well. One of the main conclusions from this discussion is that we should neither over-generalize nor over-ﬁt. Ot he concept of “weight”: an abstraction that describes a certain property of objects. But just right about what? It doesn’t seem there is one correct God-given deﬁnition of the category chairs. but we do care about the our responses to the various animals (run away for lion. They understood that all lions have these particular characteristics in common. the next time he sees a squirrel he believes it is a small instance of a dangerous lion and ﬂees into the trees again. perhaps twisted instance of the same object class (In fact.12 CHAPTER 3. And there are a lot of things that can be predicted in the world. In a certain sense. From an evolutionary standpoint. LEARNING derstood that this was a lion. but one can surely ﬁnd examples that would be difﬁcult to classify. Here is one very important corollary for you: “machine learning is not in the business of remembering and regurgitating observed information. it is in the business of transferring (generalizing) properties from observed data onto new. If ancestral Bob walks the savanna and he has just encountered an instance of a lion and ﬂed into a tree with his buddies. This is the mantra of machine learning that you should repeat to yourself every night before you go to bed (at least until the ﬁnal exam). The information we receive from the world has two components to it: there . We need to be on the edge of being just right. Over-generalization seems to be rather common among small children. yet unobserved data”. Drumming my ﬁsts on my hairy chest in front of a female generates opportunities for sex.

there really is no need in sending a model whatsoever. The compression game can therefore be used to ﬁnd the right size of model complexity for a given dataset. pixel 2: white. To compress the image.. Since Bob is sending only a single image it does not pay to send an incredibly complicated model that would require more bits to explain than simply sending all pixel values. On the other hand. i. Bob can extract rules such as: always predict the same color as the majority of the pixels next to you.. Now imagine Bob wants to send an image to Alice. pixel 3: black. But both also lead to suboptimal prediction on new images. These rules constitute the model for the regularities of the image. Bob can not do better because there is no predictable information in that image. The task of any learning algorithm is to separate the predictable part from the unpredictable part. If he would be sending 1 billion images it would pay off to ﬁrst send the complicated model because he would be saving a fraction of all bits for every image.. For the white image you can do perfect. A few rules and two corrections is obviously cheaper than 256 pixel values and no rules... Real pictures are in between: some pixels are very hard to predict. He has to pay 1 dollar cent for every bit that he sends. To send the exact image Bob will have to send pixel 1: white. pixel 3: white. Now imagine a image that consist of white noise (your television screen if the cable is not connected). the boundary between what is model and what is noise depends on how much data we are dealing with! If we use a model that is too complex we overﬁt to the data at hand. pixel 245: black. if we use a too simple model we ”underﬁt” (over-generalize) and valuable structure remains unmodeled. Therefore: the size of Bob’s model depends on the amount of data he wants to transmit. the learnable part of the information stream. You can imagine playing a game and revealing one pixel at a time to someone and pay him 1$ for every next pixel he predicts correctly. part of the model represents noise. He could just have send the message all pixels are white!. Every time the rule fails Bob also send a correction: pixel 103: white. while others are easier. for the noisy picture you would be random guessing. And then there is the information that is predictable. Instead of sending the entire image pixel by pixel. i. except when there is an edge. If the image were completely white it would be really stupid of Bob to send the message: pixel 1: white. there is no structure to be modeled..e. Both lead to suboptimal compression of the image.. the unpredictable information.e. pixel 2: black. There is one fundamental tradeoff hidden in this game. And so we have discovered a deep . Bob will now ﬁrst send his rules and ask Alice to apply the rules.. if Bob wants to send 2 pixels. We call this “noise”. On the other hand.13 is the part of the information which does not carry over to the future. The blank image is completely predictable but carries very little information. Ironically..

The result is that the value for θ is likely to be very different for each dataset. If we don’t trust our prior knowledge we should let the data speak. That is to say. so we need to ﬁnd the boundary between too complex and too simple a model and get its complexity just right. We are so convinced of ourselves that we basically ignore the data. LEARNING Now let’s think for a moment what we really mean with “a model”. we may leave the door open for a huge family of possible models.14 connection between learning and compression. Our inductive bias often comes in the form of a parametrized model. CHAPTER 3. Access to more data means that the data can speak more relative to prior knowledge. Conversely. The downside is that if we are creating a “bad bias” towards to wrong model. in a nutshell is what machine learning is all about. we can learn the remaining degrees of freedom in our model from very few data-cases. If we now let the data zoom in on the model that best explains the training data it will overﬁt to the peculiarities of that data. That. but the variance is high. Because the models are so ﬂexible. A strong inductive bias means that we don’t leave ﬂexibility in the model for the data to work on. a large variance is bad as well: because we only have one dataset of size N. A model represents our prior knowledge of the world. What we should therefore strive for is to inject all our prior knowledge into the learning problem (this makes learning easier) but avoid injecting the wrong prior knowledge. In the case of very restrictive models the opposite happens: the bias is potentially large but the variance small. Now imagine you sampled 10 datasets of the same size N and train these very ﬂexible models separately on each of these datasets (note that in reality you only have access to one such dataset but please play along in this thought experiment). our estimate could be very far off simply we were unlucky with the dataset we were given. we deﬁne a family of models but let the data determine which of these models is most appropriate. . On the other hand. It imposes structure that is not necessarily present in the data. We call this the “inductive bias”. Note that not only is a large bias is bad (for obvious reasons). However. We say that the bias is small. letting the data speak too much might lead to overﬁtting. we can actually model the idiosyncrasies of each dataset. if we are correct. But because we didn’t impose much inductive bias the average of many of such estimates will be about right. Let’s say we want to determine the value of some parameter θ.

The correct size of model can be determined by playing a compression game. .3. Learning = generalization = abstraction = compression. It is not about remembering the training data. yet unobserved data. you can entertain more or less complex models.1 In a Nutshell Learning is all about generalizing regularities in the training data to new. Depending on the dataset size. IN A NUTSHELL 15 3. Good generalization means that you need to balance prior knowledge with information from data.1.

LEARNING .16 CHAPTER 3.

or to discover the structure in the data. Let’s call the images X1 . factors. that would be a ratio of about 1 million to 1 in terms of information content! The classiﬁcation problem can usually be posed as ﬁnding (a. There is also a different family of learning problems known as unsupervised learning problems. In this case there are no labels Y involved. For instance.. So. The data that Bob collected was labelled because Google is supposed to only return pictures of bobcats when you search for the word ”bobcat” (and similarly for mountain lions). abstractions. compressing data.. . learning) a function f (x) that approximates the correct class labels for any input x. Our task is not to classify. This may be very useful for visualization data.. Extracting structure in data often leads to the discovery of concepts. Bob want to learn how to distinguish between bobcats and mountain lions. Some months later on a hiking trip in the San Bernardino mountains he sees a big cat. . In the following we will be studying quite a few of these classiﬁcation algorithms.k.a. Yn . He types these words into Google Image Search and closely studies all catlike images of bobcats on the one hand and mountain lions on the other. and more such terms that all really mean the same thing. The most well studied problem in ML is that of supervised learning.Chapter 4 Types of Machine Learning We now will turn our attention and discuss some learning problems that we will encounter in this book. These are the underlying semantic 17 . but to organize the data.Xn and the labels Y1 .. while Yi is simply −1 or 1 depending on how we choose to label our classes. just the features X.. we may decide that sign[f (x)] is the predictor for our class label. To explain this. Note that Xi are much higher dimensional objects because they represent all the information extracted from the image (approximately 1 million pixel color values). causes. topics... let’s ﬁrst look at an example. or organizing data for easy accessibility.

While making his decisions he will not receive any feedback (apart from perhaps slowly getting more hungry). we do have a very large digital library of scanned newspapers available. national news. Yi). TYPES OF MACHINE LEARNING factors that can explain the data. Knowing these factors is like denoising the data where we ﬁrst peel off the uninteresting bits and pieces of the signal and subsequently transform onto an often lower dimensional space which exposes the underlying factors. In other words. international etc. Consider for example a mouse that needs to solve a labyrinth in order to obtain his food.). Finding these groups is then the task of the ML algorithm and the identity of the group is the semantic factor. one is often confronted with a situation where you have access to many more unlabeled data (only Xi ) and many fewer labeled instances (both (Xi .18 CHAPTER 4. Yet that is exactly what would happen if you would only be using the labeled data. A fourth major class of learning algorithms deals with problems where the supervised signal consists only of rewards (or costs) that are possibly delayed. Hence. There are two dominant classes of unsupervised learning algorithms: clustering based algorithms assume that the data organizes into groups. The subﬁeld that studies how to improve classiﬁcation algorithms using unlabeled data goes under the name “semi-supervised learning”. This mapping can be nonlinear. For instance. Shouldn’t it be possible to use those scanned newspapers somehow to to improve the classiﬁer? Imagine that the data naturally clusters into well separated groups (for instance because news articles reporting on different topics use very different words). These problem . From this ﬁgure it becomes clear that the expected optimal decision boundary nicely separates these clusters. It’s only at the end when he reaches the cheese that receives his positive feedback. Note that there are only very few cases which have labels attached to them. There are many variations on the above themes. sports. but the underlying assumption is that the data is approximately distributed on some (possibly curved) lower dimensional manifold embedded in the input space. Take the task of classifying news articles by topic (weather. This is depicted in Figure ??). Another class of algorithms strives to project the data onto a lower dimensional space. you do not expect that the decision boundary will cut through one of the clusters. Unrolling that manifold is then the task of the learning algorithm. by simply requiring that decision boundaries do not cut through regions of high probability we can improve our classiﬁer. However. and he will have use this to reinforce his perhaps random earlier decisions that lead him to the cheese. Some people may have labeled some news-articles by hand but there won’t be all that many of those. In this case the dimensions should be interpreted as semantic factors.

we need to “draw statistical strength” from customers who seem to be similar. It is a very general setup in which almost all known cases of machine learning can be cast. We can learn personalized models but share features between them. where we don’t have access to many movies that were rated by the customer. people also share commonalities. These tasks are related but not identical. or the time horizon that you still have to take advantage of these investments. especially when people show evidence of being of the same “type” (for example a sf fan or a comedy fan). consider the problem if recommending movies to customers of Netﬂix. Do we not also have ﬁgure out how the world works and survive in it? Let’s go back to the news-articles. i. Each person is different and would really require a separate model to make the recommendations. or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age. From this example it has hopefully become clear that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. Or the mouse in the maze. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time to reach the cheese considerably. Which one would be pick. This clearly depends on how often he thinks he will have to run through the same maze in the future.19 fall under the name ”reinforcement learning”. alleys that he expects to maximize his reward. where does he explore? Surely he will try to seek out alleys that look promising. However. For instance. Given that decides to explore. . or whether he should simply stay with he has learned and take the route he already knows. Especially for new customers. We call this the exploration versus exploitation trade-off. This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the beneﬁt later in terms of a high salary working for Yahoo!. The most general RL problems do not even assume that you know what the world looks like (i. We call the problem of ﬁnding the next best data-case to investigate “active learning”. Assume we have control over what article we will label next. Surely the one that would be most informative in some suitably deﬁned sense. but this generality also means that these type of problems can be very difﬁcult. We call this subﬁeld of machine learning:“multi-task learning. One may also be faced with learning multiple tasks at the same time.e. the maze for the mouse). so you have to simultaneously learn a model of the world and solve your task in it. The reason that RL is a very exciting ﬁeld of research is because of its biological relevance.e.

TYPES OF MACHINE LEARNING 4. . active learning optimizes the next sample to include in the learning algorithm and multi-task learning deals with sharing common model components between related learning tasks. Supervised learning deals with predicting class labels from attributes. reinforcement learning can handle simple feedback in the form of delayed reward.1 In a Nutshell There are many types of learning problems within machine learning.20 CHAPTER 4. semi-supervised learning uses both labeled and unlabeled data to improve predictive performance. unsupervised learning tries to discover interesting structure in data.

For instance. this will always result in a decision. Given these preliminaries. Although kNN algorithms are often associated with this simple voting scheme.Chapter 5 Nearest Neighbors Classiﬁcation Perhaps the simplest algorithm to perform classiﬁcation is the “k nearest neighbors (kNN) classiﬁer”. each of these k most similar neighbors in S can cast a vote on the label of the test case. So to 21 .2) (5. Call this set S. one could weigh each vote by the similarity to the test-case. where each neighbor predicts that the test case has the same label as itself. As usual we assume that we have data of the form {Xin . classiﬁcation is embarrassingly simple: when you are provided with the attributes Xt for a new (unseen) test-case. which we will denote with K(Xn . Xn ) for all n. Yn } where Xin is the value of attribute i for data-case n and Yn is the label for datacase n. more sophisticated ways of combining the information of these neighbors is allowed. We also need a measure of similarity between data-cases. This results in the following decision rule.1) (5. Then. Xn )(2Yn − 1) < 0 n∈S (5. Xm ) where larger values of K denote more similar data-cases. you ﬁrst ﬁnd the k most similar data-cases in the dataset by computing K(Xt . Assuming binary labels and an odd number of neighbors. Yt = 1 Yt = 0 if n∈S K(Xt . Why do we expect this algorithm to work intuitively? The reason is that we expect data-cases with similar labels to cluster together in attribute space. Xn )(2Yn − 1) > 0 K(Xt .3) if and ﬂipping a coin if it is exactly 0.

A second consideration is computation time and memory efﬁciency. However. but wouldn’t it be much easier if we would have summarized all books by a simple classiﬁcation function fθ (X). To do that it could classify books into categories and suggest the top ranked in that category. It’s typically better to ask several opinions before making your decision. Determining this optimal number of neighbors is not easy. which may be different for every problem. Moreover. Imagine that there are two features that contain all the information necessary for a perfect classiﬁcation. consider surﬁng the web-pages of Amazone. then kNN is quickly confused. if you ask too much around you will be forced to ask advice from data-cases that are no longer very similar to you. it likes to suggest 10 others. So there is some optimal number of neighbors to ask.com. but we can again use cross validation (section ??) to estimate it. there are tricks that can be played with smart indexing. NEAREST NEIGHBORS CLASSIFICATION ﬁgure out the label of a test-case we simply look around and see what labels our neighbors have. . KNN is also known to suffer from the “curse of high dimensions”. If we use many features to describe our data. This property can also be a disadvantage: if you have prior knowledge about how the data was generated. but you need to make decisions very quickly. As an example. kNN requires Amazone to store all features of all books at a location that is accessible for fast computation. So what is good and bad about kNN? First. The key distinction is rather wether the data is summarized through a set of parameters which together comprise a classiﬁcation function fθ (X). but that we have added 98 noisy. its better to use it. and in particular when most of these features turn out to be irrelevant and noisy for the classiﬁcation. The neighbors in the two dimensional space of the relevant features are unfortunately no longer likely to be the neighbors in the 100 dimensional space. because less information has to be extracted from the data. It’s important to realize that this is somewhat of a misnomer: non-parametric models can have parameters (such as the number of neighbors to consider). or whether we retain all the data to do the classiﬁcation “on the ﬂy”. Assume you have a very large dataset. Asking your closest neighbor is like betting all your money on a single piece of advice and you might get really unlucky if your closest neighbor happens to be an odd-one-out.22 CHAPTER 5. it’s simplicity makes it attractive. to classify kNN has to do the neighborhood search every time again. that “spits out” a class for any combination of features X? This distinction between algorithms/models that require memorizing every data-item data is often called “parametric” versus “non-parametric”. uninformative features. Clearly. Whenever you search for a book. Very few assumptions about the data are used in the classiﬁcation process.

Once again. it is very important to choose your initial representation with much care and preprocess the data before you apply the algorithm. THE IDEA IN A NUTSHELL 23 because 98 noisy dimensions have been added. In this case. 5.5.1 The Idea In a Nutshell To classify a new data-item you ﬁrst look for the k nearest neighbors in feature space and assign it the same label as the majority of these neighbors. preprocessing takes the form of “feature selection” on which a whole book in itself could be written. This effect is detrimental to the kNN algorithm.1. .

NEAREST NEIGHBORS CLASSIFICATION .24 CHAPTER 5.

Note that each attribute can have a different number of values Vi . e. which class is assigned to 0 or 1 is arbitrary and has no impact on the performance of the algorithm.Chapter 6 The Naive Bayesian Classiﬁer In this chapter we will discuss the “Naive Bayes” (NB) classiﬁer. Y es = 1. Before we move on. Fortunately.D. To give an 25 . in chapter I’ll make an exception. In the introduction I promised I would try to avoid the use of probabilities as much as possible.g. Vi ]. then we simply reassign these values to ﬁt the above format. . let’s consider a real world example: spam-ﬁltering. In our usual notation we consider D discrete valued attributes Xi ∈ [0. Every day your mailbox get’s bombarded with hundreds of spam emails. 6. It has proven to be very useful in many application both in science as well as in industry. No = 0 (or reversed). No]... i = 1. because the NB classiﬁer is most naturally explained with the use of probabilities. We will restrict attention to classiﬁcation problems between two classes and refer to section ?? for approaches to extend this two more than two classes. X1 = [Y es. we will only need the most basic concepts. Again. in this case the labels are Y = 0 and Y = 1 indicating that that data-item fell in class 0 or class 1. If the original data was supplied in a different format..1 The Naive Bayes Model NB is mostly used when dealing with discrete-valued attributes. However. We will explain the algorithm in this context but note that extensions to continuous-valued attributes are possible. In addition we are also provided with a supervised signal.

the bulk of these emails (approximately 97%) is ﬁltered out or dumped into your spam-box and will reach your attention.9% to appear twice etc. How is this done? Well. We can build a dictionary of words that we can detect in each email. which were acquired by someone removing spam emails by hand from his/her inbox. it turns out to be a classic example of a classiﬁcation problem: spam or ham. One could measure whether the words or phrases appear at least once or one could count the actual number of times they appear. “viagra” should have a much smaller probability to appear in a ham email (but it could of course. For emails. The NB model is what we call a “generative” model. loan. mortgage. THE NAIVE BAYESIAN CLASSIFIER example of the trafﬁc that it generates: the university of California Irvine receives on the order of 2 million spam emails a day. money. what would you measure in an email to see if it is spam? Certainly. a small arms race has ensued where spam ﬁlters and spam generators ﬁnd new tricks to counteract the tricks of the “opponent”. cheap. We’ll also assume that we have spam/ham labels for these emails. Our task is thus to label each new email with either 0 or 1. in random . “penis enlargement”.26 CHAPTER 6. What else? Here are a few: “enlargement. These probabilities are clearly different for spam and ham. It will then decide what the chance is that a certain word appears k times in a spam email. buy. an imaginary entity ﬁrst decides how many spam and ham emails it will generate on a daily basis. What are the attributes? Rephrasing this question. the word “viagra” has a chance of 96% to not appear at all. In fact. Putting all these subtleties aside for a moment we’ll simply assume that we measure a number of these attributes for every email in a dataset. it decides to generate 40% spam and 60% ham. but we won’t need that in the following. with proper sentences. i. 0. Hence we might also want to detect words like “via gra” and so on. pharmacy. consider I send this text to my publisher by email). but we will make this simplifying assumption for now). 1% to appear once. Say. Our task is then to train a predictor for spam/ham labels for future emails where we have access to attributes but not to labels. that’s the question. if I would read “viagra” in the subject I would stop right there and dump it in the spam-box. one can make phrases as sophisticated as necessary. Given these probabilities. This dictionary could also include word phrases such as “buy now”. We will assume this doesn’t change with time (of course it does.e. we can then go on and try to generate emails that actually look like real emails. credit” and so on. This means that we imagine how the data was generated in an abstract sense. Spammers know about the way these spam ﬁlters work and counteract by slight misspellings of certain key words. For example. this works as follows. Let’s say that spam will receive a label 1 and ham a label 0. Fortunately. Instead we make the simplifying assumption that email consists of “a bag of words”.

2. {Xin .2) where we have used that P (ham) + P (spam) = 1 since an email is either ham or spam. Pspam (Xi = j) = # spam emails for which the word i was found j times (6. Yn }. and zero otherwise. > 1. This is given by. P (spam) = # spam emails = total # emails n I[Yn = 1] N (6. In our example we could for instance ask ourselves what the probability is that we ﬁnd the word “viagra” k times. Next. First for spam we ﬁnd. we need to estimate how often we expect to see a certain word or phrase in either a spam or a ham email. in the equation above it counts the number of instances that Yn = 1. The answer is again that we can count how often these events happened in our data and use that as an estimate for the real probabilities according to which it generated emails. 27 6.1) Here we mean with I[A = a] a function that is only equal to 1 if its argument is satisﬁed. with k = 0. Xviagra = 1 meaning that we observed it once and Xviagra = 2 meaning that we observed it more than once. n = 1.3) total # of spam emails I[Xin = j ∧ Yn = 1] = n (6. To start with the simplest one. what would be a good estimate for the number of the percentage of spam versus ham emails that our imaginary entity uses to generate emails? Well. we wish to estimate what these probabilities are.2 Learning a Naive Bayes Classiﬁer Given a dataset. we can simply count how many spam and ham emails we have in our data. Let’s recode this as Xviagra = 0 meaning that we didn’t observe “viagra”.4) n I[Yn = 1] Here we have deﬁned the symbol ∧ to mean that both statements to the left and right of this symbol should hold true in order for the entire sentence to be true. 1. we also ﬁnd that P (ham) = 1 − P (spam) = # ham emails = total # emails n I[Yn = 0] N (6. .D. i = 1. LEARNING A NAIVE BAYES CLASSIFIER order..N. Since the remainder of the emails must be ham..6. Hence. in a spam email.

The next phase is that of prediction or classiﬁcation of new email. each of which is necessarily smaller than one.5) (6. The model helps us understand how data was generated in some approximate setting. It’s like a large committee of experts. very quickly gets so small that your computer can’t handle it. Each of these opinions will be multiplied together to generate a ﬁnal score. each member casts a vote and can say things like: “I am 99% certain its spam”. But clearly. we compute exactly the same quantity. because the word “viagra” has a tiny probability of being generated under the ham model it will end up with a higher probability under the spam model.3 Class-Prediction for New Instances New email does not come with a label ham or spam (if it would we could throw spam in the spam-box right away). For example. This is numerically stable and leads to the same conclusion because if a > b then we also have that log(a) > log(b) and vice versa.6) Both these quantities should be computed for all words or phrases (or more generally attributes). THE NAIVE BAYESIAN CLASSIFIER For ham emails.28 CHAPTER 6. namely that the product of a large number of probabilities. or “It’s almost deﬁnitely not spam (0.7) . Instead of multiplying probabilities as scores. Our task is to guess the label based on the model and the measured attributes. all words have a say in this process. one for each word. Pham (Xi = j) = # ham emails for which the word i was found j times total # of ham emails I[Xin = j ∧ Yn = 0] = n n I[Yn = 0] (6.1% spam)”. We will often refer to this phase as “learning” or training a model. The approach we take is simple: calculate whether the email has a higher probability of being generated from the spam or the ham model. What we do see are the attributes {Xi }. We have now ﬁnished the phase where we estimate the model from the data. we use the logarithms of those probabilities and add the logarithms. We then ﬁgure out whether ham or spam has the highest score. 6. There is one little practical caveat with this approach. There is an easy ﬁx though. In equations we compute the score as follows: Sspam = i log Pspam (Xi = vi ) + log P (spam) (6.

6.3. CLASS-PREDICTION FOR NEW INSTANCES

29

where with vi we mean the value for attribute i that we observe in the email under consideration, i.e. if the email contains no mention of the word “viagra” we set vviagra = 0. The ﬁrst term in Eqn.6.7 adds all the log-probabilities under the spam model of observing the particular value of each attribute. Every time a word is observed that has high probability for the spam model, and hence has often been observed in the dataset, will boost this score. The last term adds an extra factor to the score that expresses our prior belief of receiving a spam email instead of a ham email. We compute a similar score for ham, namely, Sham =

i

log Pham (Xi = vi ) + log P (ham)

(6.8)

and compare the two scores. Clearly, a large score for spam relative to ham provides evidence that the email is indeed spam. If your goal is to minimize the total number of errors (whether they involve spam or ham) then the decision should be to choose the class which has the highest score. In reality, one type of error could have more serious consequences than another. For instance, a spam email making it in my inbox is not too bad, bad an important email that ends up in my spam-box (which I never check) may have serious consequences. To account for this we introduce a general threshold θ and use the following decision rule, Y =1 Y =0 if S1 > S0 + θ if S1 < S0 + θ (6.9) (6.10) (6.11)

If these quantities are equal you ﬂip a coin. If θ = −∞, we always decide in favor of label Y = 1, while if we use θ = +∞ we always decide in favor of Y = 0. The actual value is a matter of taste. To evaluate a classiﬁer we often draw an ROC curve. An ROC curve is obtained by sliding θ between −∞ and +∞ and plotting the true positive rate (the number of examples with label Y = 1 also classiﬁed as Y = 1 divided by the total number of examples with Y = 1) versus the false positive rate (the number of examples with label Y = 0 classiﬁed as Y = 1 divided by the total number of examples with Y = 0). For more details see chapter ??.

30

CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER

6.4 Regularization

The spam ﬁlter algorithm that we discussed in the previous sections does unfortunately not work very well if we wish to use many attributes (words, word-phrases). The reason is that for many attributes we may not encounter a single example in the dataset. Say for example that we deﬁned the word “Nigeria” as an attribute, but that our dataset did not include one of those spam emails where you are promised mountains of gold if you invest your money in someone bank in Nigeria. Also assume there are indeed a few ham emails which talk about the nice people in Nigeria. Then any future email that mentions Nigeria is classiﬁed as ham with 100% certainty. More importantly, one cannot recover from this decision even if the email also mentions viagra, enlargement, mortgage and so on, all in a single email! This can be seen by the fact that log Pspam (X“Nigeria” > 0) = −∞ while the ﬁnal score is a sum of these individual word-scores. To counteract this phenomenon, we give each word in the dictionary a small probability of being present in any email (spam or ham), before seeing the data. This process is called smoothing. The impact on the estimated probabilities are given below,

Pspam (Xi = j) =

**I[Xin = j ∧ Yn = 1] Vi α + n I[Yn = 1] α + n I[Xin = j ∧ Yn = 0] Pham (Xi = j) = Vi α + n I[Yn = 0]
**

n

α+

(6.12) (6.13)

where Vi is the number of possible values of attribute i. Thus, α can be interpreted as a small, possibly fractional number of “pseudo-observations” of the attribute in question. It’s like adding these observations to the actual dataset. What value for α do we use? Fitting its value on the dataset will not work, because the reason we added it was exactly because we assumed there was too little data in the ﬁrst place (we hadn’t received one of those annoying “Nigeria” emails yet) and thus will relate to the phenomenon of overﬁtting. However, we can use the trick described in section ?? where we split the data two pieces. We learn a model on one chunk and adjust α such that performance of the other chunk is optimal. We play this game this multiple times with different splits and average the results.

6.5. REMARKS

31

6.5 Remarks

One of the main limitations of the NB classiﬁer is that it assumes independence between attributes (This is presumably the reason why we call it the naive Bayesian classiﬁer). This is reﬂected in the fact that each classiﬁer has an independent vote in the ﬁnal score. However, imagine that I measure the words, “home” and “mortgage”. Observing “mortgage” certainly raises the probability of observing “home”. We say that they are positively correlated. It would therefore be more fair if we attributed a smaller weight to “home” if we already observed mortgage because they convey the same thing: this email is about mortgages for your home. One way to obtain a more fair voting scheme is to model these dependencies explicitly. However, this comes at a computational cost (a longer time before you receive your email in your inbox) which may not always be worth the additional accuracy. One should also note that more parameters do not necessarily improve accuracy because too many parameters may lead to overﬁtting.

**6.6 The Idea In a Nutshell
**

Consider Figure ??. We can classify data by building a model of how the data was generated. For NB we ﬁrst decide whether we will generate a data-item from class Y = 0 or class Y = 1. Given that decision we generate the values for D attributes independently. Each class has a different model for generating attributes. Classiﬁcation is achieved by computing which model was more likely to generate the new data-point, biasing the outcome towards the class that is expected to generate more data.

THE NAIVE BAYESIAN CLASSIFIER .32 CHAPTER 6.

What can we do? Our ﬁrst inclination is probably to try and ﬁt a more complicated separation boundary. Call φk (X) feature k that was measured from the data. if Xi represents the presence (1) or absence (0) of the word “viagra” and similarly for Xj and the presence/absence of the word “dysfunction”. One could say that it represents the canonical parametric approach to classiﬁcation where we believe that a straight line is sufﬁcient to separate the two classes of interest. The features can be highly nonlinear functions. For example. An example of this is given in Figure ?? where the assumption that the two classes can be separated by a line is clearly valid. 33 . We can add as many features as we like. The simplest choice may be to also measure φi (X) = Xi2 . The latter will allow you to explicitly model correlations between attributes. Looking at Figure ?? we clearly observe that there is no straight line that will do the job for us. In this higher dimensional space we can now be more conﬁdent in assuming that the data can be separated by a line. there is another trick that we ill be using often in this book. But we may also measure cross-products such as φij (X) = Xi Xj . ∀k for each attribute Xk . However. j. then the cross product feature Xi Xj let’s you model the presence of both words simultaneously (which should be helpful in trying to ﬁnd out what this document is about). despite its simplicity it should not be under-estimated! It is the workhorse for most companies involved with some form of machine learning (perhaps tying with the decision tree classiﬁer). However. However. Instead we can increase the dimensionality of the space by “measuring” more things of the data. adding another dimension for every new feature. ∀i.Chapter 7 The Perceptron We will now describe one the simplest parametric classiﬁers: the perceptron and its cousin the logistic regression classiﬁer. this assumption need not always be true.

i. Another way of seeing this is that more dimensions often lead to more parameters in the model (as in the case for the perceptron) and can hence lead to overﬁtting. i. The problem is that they introduce noise in the input that can mask the actual signal (i. This can be seen by writing the points on the line as x = y + v where y is a ﬁxed vector pointing to an arbitrary point on the line and v is the vector on the line starting at y (see Figure ??). we can express the condition mathematically expressed as1 . Note that we can replace Xk → φk (X) but that for the sake of simplicity we will refrain from doing so at this point. 1 . α} which deﬁne the line for us.e. the shortest distance from the origin to the line. all vectors x with a vanishing inner product with w. a line through the origin is represented by wT x = 0. wK . How we play this magic will be explained when we will discuss kernel methods in the next sections. Hence. . The scalar quantity α represents the offset of the line wT x = 0 from the origin. Since by deﬁnition wT v = 0. With this. The vector w represents the direction orthogonal to the decision boundary depicted in Figure ??. For example. −1} representation. wT (y + v) − α = 0. THE PERCEPTRON I like to warn the reader at this point that more features is not necessarily a good thing if the new features are uninformative for the classiﬁcation task at hand.e. To make our life a little easier we will switch to the Y = {+1. discriminative features).34 CHAPTER 7. In fact..1) where “sign” is the sign-function (+1 for nonnegative reals and −1 for negative reals). The problem of too many dimensions is sometimes called “the curse of high dimensionality”. we ﬁnd wT y = α which means that α is the projection of y onto w which is the shortest distance from the origin to the line.e. But let us ﬁrst start simple with the perceptron. With the introduction of regularizers. the good.1 The Perceptron Model Our assumption is that a line can separate the two classes of interest.. there is a whole subﬁeld of ML that deals with selecting relevant features from a set that is too large. we can sometimes play magic and use an inﬁnite number of features. 7. Yn ≈ sign( k wk Xkn − α) (7. To combat that in turn we can add regularizers as we will see in the following. We have introduced K + 1 parameters {w1 .

one should split of a subset of the training data and reserve it for monitoring performance (one should not use this set in the training procedure). we would like to control performance on future. but it is important to notice that the number of parameters is ﬁxed in advance. the linear separability assumption comes with some computation advantages as well.1. When this capacity is too large relative to the number of data cases at our disposal. such as very fast class prediction on new test data. As a surrogate we will simply ﬁt the line parameters on the training data. so now that we agree on writing down a cost on the training data. when we take the limit of an inﬁnite number of features. If we do not use too many features relative to the number of data-cases. this is a little hard (since we don’t have access to this data by deﬁnition). Now let’s write down a cost function that we wish to minimize in order for our linear decision boundary to become a good classiﬁer. one may want to worry more about “underﬁtting” in this case. If we have introduced very many features and no form of regularization then we have many parameters to ﬁt. we will have happily returned the land of “non-parametrics” but we have exercise a little patience before we get there. (In fact. such as the “nearest-neighbors” classiﬁers actually do have this desirable feature. Too few data-cases simply do not provide the resolution (evidence) necessary to see more complex structure in the decision boundary. So. Recall that non-parametric methods. This ﬁxed capacity of the model is typical for parametric methods. we will be ﬁtting the idiosyncrasies of this particular dataset and these will not carry over to the future test data. However. yet unseen test data.2) . Nevertheless. Cycling though multiple splits and averaging the result was the cross-validation procedure discussed in section ??. By the way.7. the model class is very limited and overﬁtting is not an issue. but perhaps a little unrealistic for real data. I believe that this computational convenience may be at the root for its popularity. we need to choose an explicit expression. THE PERCEPTRON MODEL 35 We like to estimate these parameters from the data (which we will do in a minute). α) = 2N N (Yn − wT Xn + α)2 n=1 (7. In some sense.) Ok. A more reasonable assumption is that the decision boundary may become more complex as we see more data. It can not be stressed enough that this is dangerous in principle due to the phenomenon of overﬁtting (see section ??). we believe so much in our assumption that the data is linearly separable that we stick to it irrespective of how many data-cases we will encounter. Consider now the following choice: 1 1 C(w. Clearly.

αt . αt ) (7. the gradients are simple: 1 ∇w C(w. In this case. wt+1 = wt − η∇w C(wt .5) where in the latter matrix expression we have used the convention that X is the matrix with elements Xkn . One may criticize plain vanilla gradient descent for many reasons. αt ) αt+1 = αt − η∇α C(wt .3) ˜ where Y is used to indicate the predicted value for Y . α) = − N ∇α C(w.9) . This is what we want! Once optimized we can then easily use our optimal parameters to perform prediction on new test data Xtest as follows: ˜ Ytest = sign( k ∗ wk Xtest − α∗ ) (7. THE PERCEPTRON where we have rewritten wT Xn = k wk Xkn . but that is not the focus of this book.7) Iterating these equations until convergence will minimize the cost function. Our gradient descent is now simply given as. If we minimize this cost then T w Xn − α tends to be positive when Yn = +1 and negative when Yn = −1. However.8) (7. Even if convergence is achieved asymptotically. but how do we obtain our values for {w∗. Using a Newton-Ralphson method will improve convergence properties considerably but is also very expensive. α) = 1 N N N (Yn − wT Xn + α)Xn = −X(Y − X T w + α) (7. α∗ }? The simplest approach is to compute the gradient and slowly descent on the cost function (see appendix ?? for background). I do want to mention a very popular approach to optimization on very large datasets known as “stochastic gradient descent”. For example you need to be carefully choose the stepsize η or risk either excruciatingly slow convergence or exploding values of the iterates wt . Many methods have been developed to improve the optimization of the cost function. The idea is to select a single data-item randomly and perform an update on the parameters based on that: wt+1 = wt + η(Yn − wT Xn + α)Xn αt+1 = αt = η(Yn − wT Xn + α) (7.4) n=1 (Yn − wT Xn + α) = (Y − X T w + α) n=1 (7. So far so good.6) (7.36 CHAPTER 7. it is typically slow.

The intuitive reason is that far away from convergence every datapoint will tell you the same story: move in direction X to improve your model. irrespective of how badly off you prediction function wT Xn +α was.2 A Different Cost function: Logistic Regression The cost function of Eq. You can choose between a single parameter update based on all the data or many updates on small subsets of the data. This type of reasoning clearly makes an effort to include the computational budget part of the overall objective. Why really? It seems that if we use more data-cases in a mini-batch to perform a parameter update we should be able to make larger steps in parameter space by using bigger stepsizes. The real cost simply counts the number of mislabelled instances. α) = − N N Yn tanh(wT Xn + α) n=1 (7. While this reasoning holds close to the solution it does not far away from the solution. You can earn 1 million dollars if you make accurate predictions on some test set by Friday next week.2. 1 C(w. So.10) . so even close to convergence we are “wiggling around” the solution. This is sometimes counter-productive because the algorithm might get obsessed with improving the performance of one single data-case at the expense of all the others. So for a bad model there is a lot of redundancy in the information that data-cases can convey about improving the parameters and querying a few is sufﬁcient. If we decrease the stepsize however. Who do you think will win the contest? 7. You simply do not need to query datapoints in order to extract that information. the wiggles get smaller. If you are not convinced about how important this is in the face of modern day datasets imaginer the following.7. This stochastic gradient descent is actually very efﬁcient in practice if we can ﬁnd a good annealing schedule for the stepsize. A DIFFERENT COST FUNCTION: LOGISTIC REGRESSION 37 The fact that we are picking data-cases randomly injects noise in the updates. Closer to convergence you need to either use more data or decrease the stepsize to increase the resolution of your gradients. a different function is often used. Company C organizes a contest where they provide a virtually inﬁnite dataset for some prediction task. So it seems a sensible strategy would be to slowly decrease the stepsize and wiggle our way to the solution. This is what we have argued in chapter XX is the distinguishing feature of machine learning.2 penalizes gross violations of ones predictions rather severely (quadratically). 7.

For a test case you simply compute the sign of wT Xtest − α to make a prediction as to which class it belongs to. α}. Data-cases from one class satisfy wT Xn ≤ α while data-cases from the other class satisfy wT Xn ≥ α. THE PERCEPTRON The function tanh(·) is plotted in ﬁgure ??.3 The Idea In a Nutshell Figure ?? tells the story. you write down a cost function that penalizes data-cases falling on the wrong side of the line and minimize it over {w. We leave it to the reader to derive the gradients and formulate the gradient descent algorithm. To achieve that. It shows that the cost can never be larger than 2 which ensures the robustness against outliers. . One assumes that your data can be separated by a line.38 CHAPTER 7. 7. Any line can be represented by wT x = α.

but we will make very easily mistakes on the cases with y = −1 (for instance. we could classify new test cases according to the rule ytest = sign(xtest )..e. +1}. say y = −1. However. We call {xi } the co-variates or input vectors and {yi} the response variables or labels. This can be understood as follows. with vanishing inner product with w satisfy this equation. Intuitively.. the vector w is directed orthogonal to the line deﬁned by wT x = b. We consider a very simple example where the data are in fact linearly separable: i. First take b = 0. 39 . yi }. For instance. Geometrically. i = 1. I can draw a straight line f (x) = wT x − b such that all cases with yi = −1 fall on one side and have f (xi ) < 0 and cases with yi = +1 fall on the other and have f (xi ) > 0. typically there are inﬁnitely many such hyper-planes obtained by small perturbations of a given solution. . How do we choose between all these hyper-planes which the solve the separation problem for our training data. imagine that a new batch of test cases arrives which are small perturbations of the training data). i. i. right in the middle.. all vectors orthogonal to w satisfy this equation. yi ∈ {−1. We receive training examples of the form: {xi .e.e. Now translate the hyperplane away from the origin over a vector a. Given that we have achieved that. x. Now it is clear that all vectors.Chapter 8 Support Vector Machines Our task is to predict whether a test sample belongs to one of two classes. The equation for the plane now becomes: (x − a)T w = 0. A sensible thing thus seems to choose the separation line as far away from both y = −1 and y = +1 training cases as we can. but may have different performance on the newly arriving test cases. when test cases arrive we will not make many mistakes on cases that should be classiﬁed with y = +1. we could choose to put the line very close to members of one particular class. n and xi ∈ Rd .

The margin. They represent that planes that cut through the closest training examples on either side. SUPPORT VECTOR MACHINES i. the equations for x are still satisﬁed. We will call them “support hyper-planes” in the following. while the separating hyperplane is equidistant from both.5) ∀ yi = −1 ∀ yi = +1 (8. wT xi − b ≤ −1 wT xi − b ≥ +1 or in one equation. if we measure distance in millimeters or meters.1) (8. We deﬁne the distance between the these hyperplanes and the separating hyperplane to be d+ and d− respectively. To remove this ambiguity we will require that δ = 1.4) (8. because the datavectors they contain support the plane. We now deﬁne 2 more hyperplanes parallel to the separating hyperplane. i. which is the projection of a onto to the vector w.e. We can now also compute the values for d+ = (||b + 1| − |b||)/||w|| = 1/||w|| (this is only true if b ∈ (−1.e.40 CHAPTER 8. We can write the following equations for the support hyperplanes: wT x = b + δ wT x = b − δ (8. in which case the length ||a|| = |b|/||w|| represents the shortest.b subject to 1 ||w||2 2 yi(wT xi − b) − 1 ≥ 0 ∀i (8. Our goal is now to ﬁnd a the separating hyperplane so that the margin is largest. Hence the margin is equal to twice that value: γ = 2/||w||. γ. is deﬁned to be d+ + d− .6) . this sets the scale of the problem. 0) since the origin doesn’t fall in between the hyper/ planes in that case. b and δ by a constant factor α. If b ∈ (−1. we ﬁnd that for the offset b = aT w.2) We now note that we have over-parameterized the problem: if we scale w. 0) you should use d+ = (||b + 1| + |b||)/||w|| = 1/||w||). Without loss of generality we may thus choose a perpendicular to the plane. yi(wT xi − b) − 1 ≥ 0 We now formulate the primal problem of the SVM: minimizew. orthogonal distance between the origin and the hyperplane.3) (8. With the above deﬁnition of the support planes we can write down the following constraint that any solution must satisfy.

we can interchange the minimization and maximization. Doing that. . xj ). The crucial point is however. When the original objectivefunction is convex. This is readily kernelised through the substitution xT xj → i i k(xi . N maximize LD = i=1 αi − 1 2 T αi αj yi yj xi xj ij subject to i αi yi = 0 αi ≥ 0 ∀i (8. The data-cases that lie on the hyperplane are called support vectors. we ﬁnd that we can ﬁnd the condition on w that must hold at the saddle point we are solving for. but note that the number of variables. This is a recurrent theme: the dual problem lends itself to kernelisation. we transform to the dual formulation by ﬁrst writing the problem using a Lagrangian. i. αi in this problem is equal to the number of data-cases. This is done by taking derivatives wrt w and b and solving.10) (8. b. N. However. because its dependence is not only on inner products between data-vectors. since they support the hyper-planes and hence determine the solution to the problem. it is not ready to be kernelised.9) αi yi = 0 i Inserting this back into the Lagrangian we obtain what is known as the dual problem. (and only then).8) (8. Hence.41 Thus. we maximize the margin.11) The dual formulation of the problem is also a quadratic program. that this problem only depends on xi through the inner product xT xj . a saddle point problem. α) = ||w||2 − 2 N αi yi(wT xi − b) − 1 i=1 (8.7) The solution that minimizes the primal problem subject to the constraints is given by minw maxα L(w. w− i αi yi xi = 0 ⇒ w∗ = i αi yixi (8.e. 1 L(w. while the primal problem did not. The primal problem can be solved by a quadratic program. subject to the constraints that all training cases fall on either side of the support hyper-planes. α).

the constraints prohibits the ball to go on.15) (8. we could remove some constraints. ∂w LP = 0 → ∂b LP = 0 → i w− i αi yi xi = 0 (8.13) (8. i. We see from the ﬁrst equation above that only the forces with αi = 0 exsert a force on the ball that balances with the force from the curved quadratic surface w.1 yi (wT xi − b) − 1 ≥ 0 multiplier condition αi ≥ 0 complementary slackness αi yi (wT xi − b) − 1 = 0 It is the last equation which may be somewhat surprising. Inequality constraints which are saturated are said to be “active”.16) αi yi = 0 constraint . It states that either the inequality constraint is satisﬁed. Finally. SUPPORT VECTOR MACHINES The theory of duality guarantees that for convex problems.42 CHAPTER 8.12) (8. an important constraint called “complementary slackness” needs to be satisﬁed. representing active constraints on the position of the support hyperplane are called support vectors. in which case αi can be any value αi ≥ 0. while unsaturated constraints are inactive. Next we turn to the conditions that must necessarily hold at the saddle point and thus the solution of the problem. Also. and sufﬁcient for convex optimization problems. that the unique solution of the primal problem corresponds tot the unique solution of the dual problem. the constraints themselves are part of these conditions and we need that for inequality constraints the Lagrange multipliers are non-negative. without changing the solution. or the inequality constraint is saturated yi(wT xi − b) − 1 = 0. In fact. the “duality-gap” is zero. When a ﬁnal solution is reached. These conditions are necessary in general. This is an active constraint because the ball is glued to that wall. The training cases with αi > 0. and moreover. These are called the KKT conditions (which stands for Karush-Kuhn-Tucker). At some point. these are inactive constraints. These are the vectors . They can be derived from the primal problem by setting the derivatives wrt to w to zero. it will hit a wall which is the constraint and although the derivative is still pointing partially towards the wall. the dual problem will be concave. One could think of the term ∂w LP as the force acting on the ball. we have: LP (w∗ ) = LD (α∗ ). but not saturated: yi(wT xi − b) − 1 > 0 in which case αi for that data-case must be zero. One could imagine the process of searching for a solution as a ball which runs down the primary objective function using gradient descent.e.14) (8.

We should penalize the objective function for these violations.21) The variables. which cannot always be satisﬁed. What we are really interested in is the function f (·) which can be used to classify future test cases. there are only few of them. Typically. THE NON-SEPARABLE CASE 43 that are situated in the support hyperplane and they determine the solution.19) (8. Clearly. and so we need to change the formalism to account for that. Moreover. the problem lies in the constraints.18) 8.8.17) As an application of the KKT conditions we derive a solution for b∗ by using the complementary slackness condition. since α is typically very sparse. αj yj xT xi − yi j i a support vector (8.1 The Non-Separable case Obviously. So.1. ξi allow for violations of the constraint. The most important conclusion is again that this function f (·) can thus be expressed solely in terms of inner products xT xi which we can replace with keri nel matrices k(xi . ξi . f (x) = w∗T x − b∗ = i αi yi xT x − b∗ i (8. Penalty functions of the form C( i ξi)k . but for numerical stability it is better to average over all of them (although they should obviously be consistent). otherwise the above constraints become void (simply always pick ξi very large). xj ) to move to high dimensional non-linear spaces. So. not all datasets are linearly separable. which people call a “sparse” solution (most α’s vanish). let’s relax those constraints by introducing “slack variables”. we don’t need to evaluate many kernel entries in order to predict the class of the new input x. wT xi − b ≤ −1 + ξi wT xi − b ≥ +1 − ξi ξi ≥ 0 ∀i ∀ yi = −1 ∀ yi = +1 (8. using any support vector one can determine b.20) (8. b∗ = j 2 where we used yi = 1.

32) (8. 2 it is still a quadratic program (QP).∂b LP = 0 → i w− i αi yi xi = 0 (8.27) (8.31) (8.33) (8. 1 L(w. Also. when i ξi = 0 we have µi > 0 (9) and hence αi < C. In the following we will choose k = 1.multiplier condition-2 8.constraint-1 5. Otherwise.complementary slackness-1 9.26) (8.24) from which we derive the KKT conditions. If we assume that ξi > 0.34) αi yi = 0 C − αi − µi = 0 yi (wT xi − b) − 1 + ξi ≥ 0 ξi ≥ 0 αi ≥ 0 µi ≥ 0 αi yi (wT xi − b) − 1 + ξi = 0 µi ξi = 0 3. b.22) (8.complementary slackness-1 From here we can deduce the following facts. To be on the wrong side of the separating hyperplane. µ) = ||w||2 +C 2 N N ξi − i i=1 αi yi (w xi − b) − 1 + ξi − i=1 T µi ξi (8. Hence. For k = 1.∂w LP = 0 → 2.∂ξ LP = 0 → 4. The new primal problem thus becomes.multiplier condition-1 7. if yi (wT xi − b) − 1 > 0 .ξ LP = ||w||2 + C 2 subject to T ξi i yi(w xi − b) − 1 + ξi ≥ 0 ∀i ξi ≥ 0 ∀i (8.28) (8. C controls the tradeoff between the penalty and margin.b. α.23) leading to the Lagrangian.44 CHAPTER 8. 1.30) (8.29) (8.25) (8. the sum i ξi could be interpreted as measure of how “bad” the violations are and is an upper bound on the number of violations.constraint-2 6. If in addition to ξi = 0 we also have that yi (wT xi − b) − 1 = 0. a data-case would need ξi > 1. hence αi = C (1) and thus ξi = 1 − yi (xT w − b) (8). ξ. then µi = 0 (9). SUPPORT VECTOR MACHINES will lead to convex optimization problems for positive integers k. then αi > 0 (8). 1 minimizew.

In summary. we use the KKT equations to get rid of w. for data-cases on the wrong side of the support hyperplane the αi max-out to αi = C and the ξi balance the violation of the constraint such that yi (wT xi − b) − 1 + ξi = 0. but now αi > 0. Since the support plane is at a distance |1 + yi b|/||w|| the result follows.1. This constraint is derived from the fact that αi = C − µi and µi ≥ 0. Finally. THE NON-SEPARABLE CASE 45 then αi = 0. Finally. but with an extra constraint on the multipliers αi which now live in a box. Geometrically. This can be seen because the plane deﬁned by yi (wT x − b) − 1 + ξi = 0 is parallel to the support plane at a distance |1 + yi b − ξi |/||w|| from the origin.35) (8. i . N maximize LD = i=1 αi − 1 2 T αi αj yi yj xi xj ij subject to i αi yi = 0 0 ≤ αi ≤ C ∀i (8. Again. as before for points not on the support plane and on the correct side we have ξi = αi = 0 (all constraints inactive).8. On the support plane.36) Surprisingly. this is almost the same QP is before. We also note that it only depends on inner products xT xj which are ready to be kernelised. we can calculate the gap between support hyperplane and the violating data-case to be ξi /||w||. we need to convert to the dual problem to solve it efﬁciently and to kernelise it. we still have ξi = 0. b and ξ.

46 CHAPTER 8. SUPPORT VECTOR MACHINES .

Hence.e. Note that the term ||w||2 was there to penalize large w and hence to regularize the solution.e. not the decision boundary. We introduce different constraints for violating the tube constraint from above 47 . but once you are close enough. there is no penalty. i. hence every item had an impact on the position of this boundary through the force it exerted (recall that the surface was from “rubber” and pulled back because it was parameterized using a ﬁnite number of degrees of freedom or because it was regularized). Here we do the same thing: we introduce a penalty for being to far away from predicted line wΦi + b. We will now formulate a regression method that is sparse. in some “epsilon-tube” around this line. where ξi is the orthogonal distance away from the support plane. Importantly. there was no penalty if a data-case was on the right side of the plane. data-items inside the tube have no impact on the ﬁnal solution (or rather. i. it has the concept of support vectors that determine the solution.Chapter 9 Support Vector Regression In kernel ridge regression we have seen the ﬁnal solution was not sparse in the variables α. For SVR there are only springs attached between data-cases outside the tube and these attach to the tube. changing their position slightly doesn’t perturb the solution). In the SVM the penalty that was paid for being on the wrong side of the support plane was given by C i ξik for positive integers k. Using the analogy of springs: in the case of ridge-regression the springs were attached between the data-cases and the decision surface. Because all these datapoints do not have any effect on the ﬁnal solution the α was sparse. We thus expect that all the data-cases which lie inside the data-tube will have no impact on the ﬁnal solution and hence have corresponding αi = 0. The thing to notice is that the sparseness arose from complementary slackness conditions which in turn came from the fact that we had inequality constraints.

ξ and ξ to ﬁnd the following KKT conditions (there are more of course). ξ zero. ξ subject to 1 C ||w||2 + 2 2 T ˆ (ξi2 + ξi2 ) i w Φi + b − yi ≤ ε + ξi ∀i ˆ yi − wT Φi − b ≤ ε + ξi ∀i (9. b. w. then. maximizeα. ˆ We now take the derivatives w. We also note due to ˆ the above reasoning we will always have at least one of the ξ.1) The primal Lagrangian becomes. SUPPORT VECTOR REGRESSION ˆ minimize − w. Remark II: Note that we don’t scale ε = 1 like in the SVM case. αi ≥ 0 ˆ ∀i (9. CHAPTER 9. so it is preferred.e. This means that at the ˆ solution we have ξ ξ = 0. The reason is that {yi} now determines the scale of the problem. i.e.48 and from below. we have not over-parameterized the problem. ξ. C 1 LP = ||w||2+ 2 2 ˆ (ξi2 +ξi2)+ i i αi (wT Φi +b−yi −ε−ξi )+ i ˆ αi (yi−wT Φi −b−ε−ξi ) ˆ (9.3) (9. w = i (αi − αi )Φi ˆ ˆ ξi = αi /C ˆ (9.t. by setting ξi = 0 the cost is lower and non of the constraints is violated.α ˆ subject to i − 1 2 (αi − αi )(αj − αj )(Kij + ˆ ˆ ij 1 δij ) + C (αi − αi )yi − ˆ i i (αi + αi )ε ˆ (αi − αi ) = 0 ˆ αi ≥ 0. outside the tube one of them is zero.4) ξi = αi /C Plugging this back in and using that now we also have αi αi = 0 we ﬁnd the dual ˆ problem. i.r.5) . inside the tube both are zero. Remark I: We could have added the constraints that ξi ≥ 0 and ξ it is not hard to see that the ﬁnal solution will have that requirement automatically and there is no sense in constraining the optimization to the optimal solution as well. To see this.2) ˆi ≥ 0. However. imagine some ξi is negative.

9) βi = 0 where the constraint comes from the fact that we included a bias term1 b.7) (9. If a data-case is inside the tube the αi .49 From the complementary slackness conditions we can read the sparseness of the solution out: αi (wT Φi + b − yi − ε − ξi ) = 0 ˆ αi (yi − wT Φi − b − ε − ξi ) = 0 ˆ ˆ ξiξi = 0. and hence we obtain ˆ sparseness. Introduce βi = αi − αi and use αi αi = 0 ˆ ˆ to write αi + αi = |βi |. We now change variables to make this optimization problem look more similar to the SVM and ridge-regression case.6) (9. Also. αi are necessarily zero.10) It is an interesting exercise for the reader to work her way through the case Note by the way that we could not use the trick we used in ridge-regression by deﬁning a constant feature φ0 = 1 and b = w0 . the prediction of new data-case is given by. This implies that αi will take on a positive value and ˆ the farther outside the tube the larger the αi (you can think of it as a compensating ˆ force). ˆ maximizeβ subject to i − 1 2 βi βj (Kij + ij 1 δij ) + C βi yi − i i |βi|ε (9. x) + b (9. Now we clearly see that if a case is above the tube ξi will take on its smallest possible value in order to make the constraints satisﬁed ˆ ξi = yi − wT Φi − b − ε. as usual. Note that in this case αi = 0. αi αi = 0 ˆ (9. A similar story goes if ξi > 0 and αi > 0. y = wT Φ(x) + b = i βi K(xi .8) where we added the last conditions by hand (they don’t seem to directly follow ˆ from the formulation). From the slackness conditions we can also ﬁnd a value for b (similar to the SVM case). The reason is that the objective does not depend on b. 1 .

SUPPORT VECTOR REGRESSION where the penalty is linear instead of quadratic. . Final remark: Let’s remind ourselves that the quadratic programs that we have derived are convex optimization problems which have a unique optimal solution which can be found efﬁciently using numerical methods.e. This is often claimed as great progress w.t.50 CHAPTER 9. as is to be expected in switching from L2 norm to L1 norm. ξi ≥ 0 ∀i (9.r. the old neural networks days which were plagued by many local optima.11) (9. i.ξ ˆ subject to 1 ||w||2 + C 2 T ˆ (ξi + ξi ) i w Φi + b − yi ≤ ε + ξi ∀i ˆ yi − wT Φi − b ≤ ε + ξi ∀i ˆ ξi ≥ 0.13) (9. maximizeβ subject to i − 1 2 βi βj Kij + ij i βi yi − i |βi |ε (9.12) leading to the dual problem. minimizew.ξ.14) βi = 0 −C ≤ βi ≤ +C ∀i where we note that the quadratic penalty on the size of β is replaced by a box constraint.

This is sometimes called “weight-decay”. where we replace xi → Φ(xi ).1) However. The total cost function hence becomes. Here our task is to ﬁnd a linear function that models the dependencies between covariates {xi } and response variables {yi }. It remains to be determined how to choose λ. A simple yet effective way to regularize is to penalize the norm of w. C(w) = 1 2 (yi − wT xi )2 i (10. −1 (yi − w xi )xi = λw i T ⇒ w= λI + i xi xT i j yj xj (10. there is an clear danger that we overﬁt. Hence we need to regularize. if we are going to work in feature space. both continuous.Chapter 10 Kernel ridge Regression Possibly the most elementary algorithm that can be kernelized is ridge regression. The classical way to do that is to minimize the quadratic cost. Taking derivatives and equating them to zero gives. This is an important topic that will return in future classes.2) i which needs to be minimized.3) We see that the regularization term helps to stabilize the inverse numerically by bounding the smallest eigenvalues away from zero. C= 1 2 1 (yi − wT xi )2 + λ||w||2 2 (10. 51 . The most used algorithm is to use cross validation or leave-one-out estimates.

6) where K(bxi . constant feature to Φ: Φ0 = 1. This is an equation that will be a recurrent theme and it can be interpreted as: The solution w must lie in the span of the data-cases. w = (λId + ΦΦT )−1 Φy = Φ(ΦT Φ + λIn )−1 y (10. which could be inﬁnitely long (which would be rather impractical). wT Φ = a wa Φai + w0 (10. What we need in practice is is the predicted value for anew test point. x.7) Hence. since the algorithm is linear in feature space. We ﬁnally need to show that we never actually need access to the feature vectors. To apply this to our case we deﬁne Φ = Φai and y = yi . In this case the number of dimensions can be much higher.4) Now note that if B is not square. The trick is given by the following identity. even if the dimensionality of the feature space is much larger than the number of data-cases. This is computed by projecting it onto the solution w. . or even inﬁnitely higher. KERNEL RIDGE REGRESSION 10.52 CHAPTER 10. x). than the number of data-cases. This seems intuitively clear. the story goes through unchanged. (P −1 + B T R−1 B)−1 B T R−1 = P B T (BP B T + R)−1 (10. There is a neat trick that allows us to perform the inverse above in smallest space of the two possibilities. the inverse is performed in spaces of different dimensionality. The value of w0 then represents the bias since. We can now add bias to the whole story by adding one more.1 Kernel Ridge Regression We now replace all data-cases with their feature vector: xi → Φi = Φ(xi ). The important message here is of course that we only need access to the kernel K. The solution is then given by. either the dimension of the feature space or the number of data-cases. bxj ) = Φ(xi )T Φ(xj ) and κ(x) = K(xi . y = wT Φ(x) = y(ΦT Φ + λIn )−1 ΦT Φ(x) = y(K + λIn )−1 κ(x) (10.5) This equation can be rewritten as: w = i αi Φ(xi ) with α = (ΦT Φ + λIn )−1 y.

. We introduce new variables. different choices of λ correspond to different choices for B. AN ALTERNATIVE DERIVATION 53 10. This will have the effect that the derivation goes along similar lines as the SVM case. minimize − w.13) Taking derivatives w.2 An alternative derivation Instead of optimizing the cost function above we can introduce Lagrange multipliers into the problem.11) Plugging it back into the Lagrangian. we obtain the dual Lagrangian. ∗ αi = (K + λI)−1 y (10. However.10) Two of the KKT conditions tell us that at the solution we have: 2ξi = βi ∀i.10. 2λw = i βi Φi (10. so we could as well vary λ in this process.λ ≥ 0 (10. λ −λ2 i 2 αi +2λ i αi yi −λ ij αi αj Kij −λB 2 s. Either λ or B should be chosen using crossvalidation or some other measure.12) We now redeﬁne αi = βi /(2λ) to arrive at the following dual optimization problem.t. LD = i 1 1 (− βi2 + βi yi ) − 4 4λ (βi βj Kij ) − λB 2 ij (10. maximize−α.t. LP = i ξi2 + i βi [yi − wT Φi − ξi ] + λ(||w||2 − B 2 ) (10. ξi = yi − wT Φi and rewrite the objective as the following constrained QP.8) (10. α gives precisely the solution we had already found.r.2.9) yi − wT Φi = ξi ∀i ||w|| ≤ B This leads to the Lagrangian. ξ subject to LP = i ξi2 (10.14) Formally we also need to maximize over λ.

KERNEL RIDGE REGRESSION One big disadvantage of the ridge-regression is that we don’t have sparseness in the α vector. then the multiplier αi was zero. In the SVM the sparseness was born out of the inequality constraints because the complementary slackness conditions told us that either if the constraint was inactive.e. . we only have to sum over the support vectors which is much faster than summing over the entire training-set. There is no such effect here.54 CHAPTER 10. i. This is useful because when we test a new example. there is no concept of support vectors.

. and over the cluster means µk . K.1) where we wish to minimize over the assignment variables zi (which can take values zi = 1. The objective is similar to before. each column of which represents a data-case and contains exactly one 1 at row k if it is assigned to cluster k. Also deﬁne 55 .3) µk = 1 Nk xi i∈Ck where Ck is the set of data-cases assigned to cluster k. φ(xi ) and wish to do clustering in feature space.K. As a result we have k Znk = 1 and Nk = n Znk . µ) = i ||φ(xi) − µzi ||2 (11. Now. C(z.2) (11.4) We will now introduce a N × K assignment matrix. let’s assume we have deﬁned many features. .. for all data-cases i. k = 1..Chapter 11 Kernel K-means and Spectral Clustering The objective in K-means can be written as follows: C(z. It is not hard to show that the following iterations achieve that. µ) = i ||xi − µzi ||2 (11. Znk . zi = arg max ||xi − µk ||2 k (11.

Renaming 1 H = ZL 2 . KERNEL K-MEANS AND SPECTRAL CLUSTERING L = diag[1/ n Znk ] = diag[1/Nk ]. In other words. so we can we can now formulate the following equivalent kernel clustering problem. I − ZLZ T is a projection on the complement space. M = ΦZLZ T (11.13) (11. max tr[L 2 Z T KZL 2 ] Z 1 1 (11.6) Next we can show that Z T Z = L−1 (check this). it is a projection.11) (11.8) T T T (11.7) (11. Note that only the second term depends on the clustering matrix Z. Note also that this problem is very difﬁcult to solve due to the constraints which forces us to search of binary matrices.56 CHAPTER 11. With these deﬁnitions you can now check that the matrix M deﬁned as. with H an N × K dimensional matrix.10) where we used that tr[AB] = tr[BA] and L 2 is deﬁned as taking the square root of the diagonal elements. Using this we simplify eqn. C = tr[(Φ − M)(Φ − M)T ] (11. where each column contains a copy of the cluster mean µk to which that data-case is assigned.9) (11.12) such that: Z is a binary clustering matrix. one for each data-case.5) consists of N columns.14) subject to H T H = I . Finally deﬁne Φin = φi (xn ). C = tr[Φ(I − ZLZ T )2 ΦT ] = tr[Φ(I − ZLZ T )ΦT ] = tr[ΦΦ ] − tr[ΦZLZ Φ ] = tr[K] − tr[L 2 Z T KZL 2 ] 1 1 1 (11. and thus that (ZLZ T )2 = ZLZ T . Similarly. Using this we can write out the K-means cost as.11. Our next step will be to approximate this problem through a relaxation on 1 1 this constraint. This objective is entirely speciﬁed in terms of kernels and so we have once again managed to move to the ”dual” representation. First we recall that Z T Z = L−1 ⇒ L 2 Z T ZL 2 = I.6 as. max tr[H T KH] H (11. we can formulate the following relaxation of the problem.

The above procedure is sometimes referred to as “spectral clustering”.K are k=1 the largest K eigenvalues. However. .16) where R is a rotation inside the eigenvalue space.57 Note that we did not require H to be binary any longer. Interpret the columns of H as a collection of K mutually orthonormal basis vectors. We can now run a simple clustering ˆ algorithm such as K-means on the data matrix H to extract K clusters. If you extract two or three features (dimensions) you can use it as a non-linear dimensionality reduction method (for purposes of visualization). If you use the result as input to a simple clustering method (such as K-means) it becomes a nonlinear clustering method. k = 1. K hT Khk k k=1 (11. it often works better to ﬁrst normalize H. We could try to simply threshold the entries of H so that the largest value is set to 1 and the remaining ones to 0. The objective can then be written as. How then do we extract a clustering solution from kernel-PCA? Recall that the columns of H (the eigenvectors of K) should approximate the binary matrix Z which had a single 1 per row indicating to which cluster data-case n is assigned. i.. RRT = RT R = I. The hope is that the solution is close to some clustering solution that we can then extract a posteriori.e. Input is a matrix of similarities (the kernel matrix or Gram matrix) which should be positive semi-deﬁnite and symmetric. we have K = UΛU T . Hnk ˆ Hnk = (11.15) By choosing hk proportional to the K largest eigenvectors of K we will maximize the objective. Using this you can now easily verify that tr[H T KH] = K λk where {λk }. What is perhaps surprising is that the solution to this relaxed kernel-clustering problem is given by kernel-PCA! Recall that for kernel PCA we also solved for the eigenvalues of K. ⇒ H = U[1:K] R (11.17) 2 Hnk k ˆ All rows of H are located on the unit sphere. Conclusion: Kernel-PCA can be viewed as a nonlinear feature extraction technique. The above problem should look familiar.

58 CHAPTER 11. KERNEL K-MEANS AND SPECTRAL CLUSTERING .

1) The eigenvalues of this matrix represent the variance in the eigen-directions of data-space. Thus. where we choose the number of retained dimensions beforehand. we are facing an unsupervised problem where we don’t have access to any labels. However. i xi = 0. i. To make progress. usually it is true that the directions of smallest variance represent uninteresting noise. This can always be achieved by a simple translation of the axis.e. because it may happen that there is interesting signal in the directions of small variance. C= 1 N xi xT i i (12. The eigen-vector corresponding to the largest eigenvalue is the direction in which the data is most stretched out.Chapter 12 Kernel Principal Components Analysis Let’s ﬁst see what PCA is when we do not worry about kernels and feature spaces. in which case PCA in not a suitable technique (and we should perhaps use a technique called independent component analysis). This is clearly a strong assumption. we start by writing down the sample-covariance matrix C. we should be doing Linear Discriminant Analysis. If we had. However. The second direction is orthogonal to it and picks the direction of largest variance in that orthogonal subspace etc. our aim will be to ﬁnd the subspace of largest variance. Our aim is to ﬁnd meaningful projections of the data. We will always assume that we have centered data. to reduce the dimensionality of the data. Due to this lack of labels. we project the data onto the re59 .

is minimal. i. (12.3) where Uk means the d×k sub-matrix containing the ﬁrst k eigenvectors as columns.e. This can be seen as follows.60 CHAPTER 12. ||xi − Pk xi ||2 i T where Pk = Uk Uk is the projection onto the subspace spanned by the columns of Uk . Another convenient property of this procedure is that the reconstruction error in L2 -norm between from y to x is minimal.e. In this case it is not hard to show that the eigen-vectors that span the projection space must lie in the subspace spanned by the data-cases. The last expression can be interpreted as: “every eigen-vector can be exactly written (i. we can now show that the projected data are de-correlated in this new basis: 1 N T yi yi = i 1 N T T T T T Uk xi xi Uk = Uk CUk = Uk UΛU T Uk = Λk i (12.2) ∀i (12. i. From this equation the coefﬁcients i i λa ua = Cua = 1 N xi xT ua = i 1 N (xT ua )xi ⇒ ua = i . Now imagine that there are more dimensions than data-cases.5) (xT ua ) i a xi = αi xi Nλa i i i i (12. and hence it must lie in its span”. some dimensions remain unoccupied by the data.6) where ua is some arbitrary eigen-vector of C.4) where Λk is the (diagonal) k × k sub-matrix corresponding to the largest eigenvalues. T yi = Uk xi ⇒ C= a λa ua uT a (12. This also implies that instead of the eigenvalue equations Cu = λu we may consider the N projected equations xT Cu = λxT u ∀i. KERNEL PRINCIPAL COMPONENTS ANALYSIS tained eigen-directions of largest variance: UΛU T = C and the projection is given by. As a side effect.e. losslessly) as some linear combination of the data-vectors.

t) (12. the whole exposition was setup so that in the end we only needed the matrix K to do our calculations. But.j a a αi αj [xT xj ] = αT Kαa = Nλa αT αa = 1 ⇒ ||αa || = 1/ Nλa i a a (12.k j a αj xj = λa xT i j ⇒ a αj xj ⇒ (12.11) . A kernel matrix is given by Kij = T Φi Φj . we have derived an eigenvalue equation for α which in turn completely determines the eigenvectors u. uT t = a i a αi xT t = i i a αi K(xi .12. 61 xT Cua = λa xT ua i i 1 xT i N 1 N xk xT k k j a αj [xT xk ][xT xj ] = λa i k j. We now center the features using. when we receive a new data-case t and we like to compute its projections onto the new reduced space. so if we can center the kernel matrix we are done as well. Obviously. uT ua = 1 ⇒ a i. we compute. where Φ(xi ) = Φia .1.9) Finally. By requiring that u is normalized we ﬁnd. This implies that we are now ready to kernelize the procedure by replacing xi → Φ(xi ) and deﬁning Kij = Φ(xi )Φ(xj )T .8) So.1 Centering Data in Feature Space It is in fact very difﬁcult to explicitly center the data in feature space. it is central to most kernel methods.7) a αj [xT xj ] i We now rename the matrix [xT xj ] = Kij to arrive at.10) This equation should look familiar. Φi = Φi − 1 N Φk k (12. i K 2 αa = Nλa Kαa ⇒ ˜ ˜ Kαa = (λa )αa with λa = Nλa (12. CENTERING DATA IN FEATURE SPACE a αi can be computed efﬁciently a space of dimension N (and not d) as follows. 12. we know that the ﬁnal algorithm only depends on the kernel matrix.

15) (12.13) ΦT ] l l T Kij − κi 1T − 1i κj + k1i 1T j j 1 with κi = Kik N k (12. xj ) − κ(ti )1T − 1i κ(xj )T + k1i 1j j (12. xj ) = [Φ(ti ) − 1 N Φ(xk )][Φ(xj ) − k 1 N Φ(xl )]T l (12.12) ΦT ] + [ l 1 N Φk ][ k 1 Φi ΦT − [ j N Φk ]ΦT − Φi [ j k 1 N l 1 N (12. T Kc (ti.18) . c Kij = = = (Φi − 1 N Φk )(Φj − k 1 N Φl )T l (12. Kc (ti . xj ) and K(xi .62 CHAPTER 12.17) Using a similar calculation (left for the reader) you can ﬁnd that this can be expressed easily in terms of K(ti .14) (12. xj ) = K(ti . At test-time we need to compute. KERNEL PRINCIPAL COMPONENTS ANALYSIS Hence the kernel in terms of the new features is given by. xj ) as follows.16) and k= 1 N2 Kij ij Hence. we can compute the centered kernel in terms of the non-centered kernel alone and no features need to be accessed.

e. In this way. PCA is an unsupervised technique and as such does not include label information of the data. how do we utilize the label information in ﬁnding informative projections? To that purpose Fisher-LDA considers maximizing the following objective: w T SB w J(w) = T (13. which would perfectly separate the data-cases (obviously. The cigars are positioned in parallel and very closely together. Note that due to the fact that scatter matrices are proportional to the covariance matrices we could have deﬁned J using covariance matrices – the proportionality constant would have no effect on the solution. A much more useful projection is orthogonal to the cigars. is in the direction of the cigars.Chapter 13 Fisher Linear Discriminant Analysis The most famous example of dimensionality reduction is ”principal components analysis”. in the direction of least overall variance. that removes some of the ”noisy” directions. The deﬁnitions of 63 . because all labels get evenly mixed and we destroy the useful information. ignoring the labels. There are many difﬁcult issues with how many directions one needs to choose. This technique searches for directions in the data that have largest variance and subsequently project the data onto it. if we imagine 2 cigar like clusters in 2 dimensions.1) w SW w where SB is the “between classes scatter matrix” and SW is the “within classes scatter matrix”. but that is beyond the scope of this note. such that the variance in the total data-set. one cigar has y = 1 and the other y = −1. we would still need to perform classiﬁcation in this 1-D space). For instance. For classiﬁcation. this would be a terrible projection. i. we obtain a lower dimensional representation of the data. So the question is.

since it is a scalar itself.r. we can always choose w such that the denominator is simply wT SW w = 1. For this reason we can transform the problem of maximizing J into the following constrained . measured relative to the (sum of the) variances of the data assigned to a particular class. Well.6) is given by ST = SW + SB the objective can be rewritten as. An important property to notice about the objective J is that is is invariant w. Oftentimes you will see that for 2 classes ′ SB is deﬁned as SB = (µ1 − µ2 )(µ1 − µ2 )T . It is also interesting to observe that since the total scatter. ST = i (xi − x)(xi − x)T ¯ ¯ (13. because it implies that the gap between the classes is expected to be big.5) and Nc is the number of cases in class c.2) (13. J(w) = w T ST w −1 w T SW w (13.4) 1 N Nc µc c 1 x= = ¯ N xi = i (13. This is precisely what we want. FISHER LINEAR DISCRIMINANT ANALYSIS the scatter matrices are: SB = c ¯ ¯ Nc (µc − x)(µc − x)T (xi − µc )(xi − µc )T c i∈c (13.t. it says that a good solution is one where the class-means are well separated. but since it N boils down to multiplying the objective with a constant is makes no difference to the ﬁnal solution.64 CHAPTER 13. Hence.3) SW = where. µc = 1 Nc xi i∈c (13. rescalings of the vectors w → αw.7) and hence can be interpreted as maximizing the total scatter of the data while minimizing the within scatter of the classes. Why does this objective make sense. This is the scatter of class 1 with ′ respect to the scatter of class 2 and you can show that SB = N1 N2 SB .

12) This problem is a regular eigenvalue problem for a symmetric. SB w = λSW w ⇒ −1 SW SB w = λw (13. we ﬁnd. using the fact that SB is symmetric posi1 1 1 2 2 2 tive deﬁnite and can hence be written as SB SB . My best guess of what goes wrong is that the constraint is not linear and as a result the problem is not convex and hence we cannot expect the optimal dual solution to be the same as the optimal primal solution. If you try to ﬁnd the dual and maximize that. Deﬁning v = SB w we get.13) −1 from which it immediately follows that we want the largest eigenvalue to maximize the objective1.10) 1 − w T SB w 2 T w SW w = 1 (13.9) (the halves are added for convenience). The KKT conditions tell us that the following equation needs to hold at the solution. 1 . corresponding to the lagrangian.11) −1 This almost looks like an eigen-value equation. Plugging the solution back into the objective J. if the matrix SW SB would have been symmetric (in fact. Remains to choose which eigenvalue and eigenvector corresponds to the desired solution. However. 1 1 LP = − wT SB w + λ(wT SW w − 1) 2 2 (13. positive deﬁnite 1 1 −1 2 2 matrix SB SW SB and for which we can can ﬁnd solution λk and vk that would correspond to solutions wk = SB 2 vk .t. it is called a generalized eigen-problem). where SB is constructed from its 2 2 eigenvalue decomposition as SB = UΛU T → SB = UΛ 2 U T . J(w) = w T S W wk w T SB w = λk = λk k T w T SW w wk S W wk (13.8) (13. we can apply the following transformation. minw s. you’ll get the wrong sign it seems. 1 1 1 −1 2 2 SB SW SB v = λv 1 1 (13.65 optimization problem.

Φ α T SB α Φ αT SW α J(α) = (13. w= i αi Φ(xi ) (13. w∗ . The solution 1 1 −1 2 2 w is the solution to some eigenvalue equation. The scatter matrices in kernel space can expressed in terms of the kernel only as follows (this requires some algebra to . This is a case of the “representers theorem” that intuitively reasons as follows. with a ﬁnite number of data-cases the ﬁnal solution. and the solution w must lie in its span. So. If we now assume a very general form of regularization on the norm of w. then these orthogonal components will be set to zero in the ﬁnal solution: w⊥ = 0. Hence. FISHER LINEAR DISCRIMINANT ANALYSIS 13. will not have a component outside the space spanned by the data-cases. In terms of α the objective J(α) becomes. But inspired by the SVM case we make the following key assumption. at most N are being occupied by the data. It says that although the feature space is very high (or even inﬁnite) dimensional.1 Kernel Fisher LDA So how do we kernelize this problem? Unlike SVMs it doesn’t seem the dual problem reveal the kernelized problem naturally. namely the space spanned by the data-vectors.14) This is a central recurrent equation that keeps popping up in every kernel machine. but this is typically not the case for kernel-methods. we argue that although there are possibly inﬁnite dimensions available a priori. RN . SB SW SB w = λw.15) where it is understood that vector notation now applies to a different space. where both SB and SW (and hence its inverse) lie in the span of the data-cases.66 CHAPTER 13. They can be arbitrary and have no impact on the solution. It would not make much sense to do this transformation if the number of data-cases is larger than the number of dimensions. the part w⊥ that is perpendicular to this span will be projected to zero and the equation above puts no constraints on those dimensions.

18) (13. Projections of new test-points into the solution space can be computed by. Fortunately. Clearly. (13. In order to classify the test point we still need to divide the space into regions which belong to one class. favoring smaller values of α over larger ones. More efﬁcient optimization schemes solving a slightly different problem and based on efﬁcient quadratic programs exist in the literature.16) (13. we have managed to express the problem in terms of kernels only which is what we were after. To regularize we can add a term to the denominator. KERNEL FISHER LDA verify). Φ SB = c Φ SW 67 κc κT − κκT c Nc κc κT c c (13. one could train any classiﬁer in the 1-d subspace. as it stands the kernel machine will overﬁt.17) (13.19) = K2 − 1 Nc 1 N κc = κ = Kij i∈c Kij i So. Note that since the objective in terms of α has exactly the same form as that in terms of w.1. the extra term appears as a penalty proportional to ||α||2 which acts as a weight decay term. Alternatively. µΦ ) = (xα − µα )2 /(σc )2 where µα c c c α and σc represent the class mean and standard deviation in the 1-d projected space respectively. If we write the Lagrangian formulation where we maximize a constrained quadratic form in α. wT Φ(x) = i αi K(xi . x) (13. One very important issue that we did not pay attention to is regularization.13. we can solve it by solving the generalized eigenvalue equation. The easiest possibility is to pick the cluster α with smallest Mahalonobis distance: d(x. .20) as usual.21) SW → SW + βI By adding a diagonal term to this matrix makes sure that very small eigenvalues are bounded away from zero which improves numerical stability in computing the inverse. This scales as N 3 which is certainly expensive for many datasets. the optimization problem has exactly the same form in the regularized case.

2 A Constrained Convex Programming Formulation of FDA . FISHER LINEAR DISCRIMINANT ANALYSIS 13.68 CHAPTER 13.

i. We now consider the following objective. If we had only one language. If we have two translations. Let x be a document in English and y a document in German. Hopefully. ρ= E[uv] E[u2 ]E[v 2 ] 69 (14. they tend to be combined into a single dimension in the new space. Let’s say we are interested in extracting low dimensional representations for each document.Chapter 14 Kernel Canonical Correlation Analysis Imagine you are given 2 copies of a corpus of documents. Consider the projections: u = aT x and v = bT y. This has the ability to infer semantic relations between the words such as synonymy. this implies that they represent the same topic in two different languages. i. In this way we can extract language independent topics. but for deﬁniteness we will use the “vector space” representation where there is an entry for every possible word in the vocabulary and a document is represented by count values for every word.e.e. the other written in German. they are highly correlated. Also assume that the data have zero mean. we could consider running PCA to extract directions in word space that carry most of the variance. These spaces can often be interpreted as topic spaces. if the word “the appeared 12 times and the ﬁrst word in the vocabulary we have X1 (doc) = 12 etc. one written in English. because if words tend to co-occur often in documents.1) . You may consider an arbitrary representation of the documents. we can try to ﬁnd projections of each representation separately such that the projections are maximally correlated.

6) . maximizea. For instance.3) where we have multiplied by N. we arrive at λ1 = λ2 = λ. Let’s take derivatives wrt to a and b to see what the KKT equations tell us. This then allows us to write the problem as. while using the constraints. KERNEL CANONICAL CORRELATION ANALYSIS We want to maximize this objective. T T Next. rename Sxy = i xi yi . we can constrain them to be equal to 1. Note that we divided by the standard deviation of the projections to remove scale dependence.bmaxλ1 . Finally we deﬁne c = [a. we deﬁne SO to be the off-diagonal matrix with Sxy on the off diagonal. Sx = i xi xT and Sy = i yi yi . We also introduced the following “trick”. Also. b]. if we construct a Lagrangian and write out the expectations we ﬁnd.b subject to ρ = E[uv] E[u2 ] = 1 E[v 2 ] = 1 (14. −1 2 −1 2 2 2 SO c = λSD c ⇒ SD SO c = λc ⇒ SO SD SO (SO c) = λ(SO c) 2 which is again an regular eigenvalue equation for c′ = SO c 1 1 1 1 1 (14. This exposition is very similar to the Fisher discriminant analysis story and I encourage you to reread that. Since we can rescale a and b without changing the problem. T xi yi b − λ1 i i xi xT a = 0 i T yi yi b = 0 i (14.5) yi xT a i i − λ2 First notice that if we multiply the ﬁrst equation with aT and the second with bT and subtract the two. there you can ﬁnd how to generalize to cases where the data is not centered. We deﬁne i the following larger matrices: SD is the block diagonal matrix with Sx and Sy on the diagonal and zeros on the off-diagonal blocks. mina.70 CHAPTER 14.λ2 i 1 T aT xi yi b− λ1 ( 2 i 1 aT xi xT a−N)− λ2 ( i 2 T bT yi yi b−N) i (14. The two equations can then we written jointly as. because this would maximize the correlation between the univariates u and v.2) Or.4) (14.

9) (14.14. This can be done by adding a diagonal term to the constraints in the Lagrangian (or equivalently to the denominator of the original objective).11) we observe that this represents the solution with maximal correlation and hence the preferred one. Ky β = λ2 Ky β which always has a solution for λ = 1.13) (14.When the dimensionality of the space is larger than the number of datacases in the training-set. α = λ−1 Kx Ky β and hence. Taking derivatives w.t. 1 1 2 2 L = αT Kx Ky β − λ(αT Kx α − N) − λ(β T Ky β − N) 2 2 (14. We get. By recalling that.g. and Kx = i Φ(xi )T Φ(xi ) and similarly for Ky . leading to the Lagrangian. The resulting equations are.10) Let’s try to solve these equations by assuming that Kx is full rank (which is typ2 −1 2 ically the case). 1 1 2 2 L = αT Kx Ky β − λ(αT Kx α + η||α||2 − N) − λ(β T Ky β + η||β||2 − N) 2 2 (14.7) Using this equation in the Lagrangian we get. KERNEL CCA 71 14. the starting point to map the data-cases to feature vectors Φ(xi ) and Ψ(yi ).14) . 2 Kx Ky β = λ(Kx + ηI)α 2 Ky Kx α = λ(Ky + ηI)β (14. then the solution must lie in the span of data-cases.1 Kernel CCA As usual. a= i αi Φ(xi ) b= i βi Ψ(yi ) (14. i. This is a typical case of over-ﬁtting emphasizes again the need to regularize in kernel methods.e.1.r. a which lives in a D-dimensional space. α and β we ﬁnd.8) where α is a vector in a different N-dimensional space than e. 2 Kx Ky β = λKx α 2 Ky Kx α = λKy β (14.12) One can see that this acts as a quadratic penalty on the norm of α and β. ρ= 1 N aT Sxy b = i 1 N λaT Sx a = λ i (14.

−1 −1 2 2 2 2 KO γ = λKD γ ⇒ KD KO γ = λγ ⇒ KO KD KO (KO γ) = λ(KO γ) (14. and the matrix KO which has the matrices Kx Ky on the right-upper off diagonal block and Ky Kx at the left-lower off-diagonal block. β].72 CHAPTER 14.15) which is again a regular eigenvalue equation. and hence made the inverse more numerically stable. KD which contains 2 2 (Kx + ηI) and (Ky + ηI) as blocks on the diagonal and zeros at the blocks off the diagonal. KERNEL CANONICAL CORRELATION ANALYSIS Analogues to the primal problem. Also. we will deﬁne big matrices. and more efﬁcient methods exist (see book). This leads to the equation. 1 1 1 1 . because eigen-vectors are not expected to be sparse. Solving the equations using this larger eigen-value problem is actually not quite necessary. The value for η needs to be chosen using cross-validation or some other measure. The solutions are not expected to be sparse. Note that the regularization also moved the smallest eigenvalue away from zero. we deﬁne γ = [α. One would have to replace L2 norm penalties with L1 norm penalties to obtain sparsity.

We will call parameters λ ≥ 0. which will be helpful in the following development. From here we can deﬁne the dual Lagrangian by.3) This objective can actually become −∞ for certain values of its arguments. A constrained optimization problem can be expressed as follows. LD (λ. ν for which LD > −∞ dual feasible.Appendix A Essentials of Convex Optimization A. ν) = infx LP (x. 73 . or they use convex optimization techniques to solve the problem. λ. LP (x.2) where we will assume in the following that λi ≥ 0 ∀i. ν) = f0 (x) + i λi fi (x) + j νj hj (x) (A. either they use spectral techniques to solve the problem.1) That is we have inequality constraints and equality constraints. ν) (A. λ. We now write the primal Lagrangian of this problem. Here we will discuss convex optimization.1 Lagrangians and all that Most kernel-based algorithms fall into two classes. minimizex subject to f0 (x) fi (x) ≤ 0 ∀i hj (x) = 0 ∀j (A.

Strong duality holds when p∗ = d∗ . which is called weak duality.ν subject to LD (λ. To see this ﬁrst check that by violating one of the constraints you can ﬁnd a choice of λ. λ. we can simply set all λ. ν) x λ≥0. λ. If we call d∗ the dual optimal point we always have: d∗ ≤ p∗ . ν) λ≥0. the best we can do is maximize the additional terms to be zero.74 APPENDIX A. even though this is not necessary if the constraints themselves vanish. ν that makes the Lagrangian inﬁnity.ν x (A. Strong duality is very nice. ν to zero. the dual problem always provides lower bound to the primal problem. This simply follows because i λi fi (x) + ∗ j νj hj (x) ≤ 0 for a primal feasible point x . The dual problem by deﬁnition is given by. in particular if we can express the primal solution x∗ in terms of the dual solution λ∗ . ESSENTIALS OF CONVEX OPTIMIZATION It is important to notice that the dual Lagrangian is a concave function of λ. p∗ − d∗ is called the duality gap. Thus. Often the dual problem is easier to solve. So when does strong duality hold? Up to some mathematical details the answer is: if the primal problem is convex and the equality constraints are linear. ν ∗ . because then we can simply solve the dual problem and convert to the answer to the primal domain since we know that solution must then be optimal. even if the primal is not convex.6) This can be seen as follows by noting that supλ≥0. For instance.ν (A. λ. d∗ = sup inf LP (x. This means that f0 (x) and {fi (x)} are convex functions and hj (x) = Ax − b. which is always possible. ν) ≤ p∗ (A. The primal problem can be written as follows. ν function.7) . ν) λi ≥ 0 ∀i (A. the dual is certainly concave! It is not hard to show that LD (λ. Hence. when all the constraints are satisﬁed. p∗ = inf sup LP (x. The optimal lower bound can be found by solving the dual problem. maximizeλ. ν) = f0 (x) when x is feasible but ∞ otherwise.4) where p∗ is the primal optimal point.ν LP (x. ν because it is a pointwise inﬁmum of a family of linear functions in λ.5) which is therefore a convex optimization problem. Also.

The ﬁrst term represent the force of the ball against the wall due to gravity (the ball is still on a slope).A. If we reverse the order we end up with the highest point in the valley. which needs to be higher if the surface slopes more.9) (A. ν ∗ ) and hence all the derivatives must vanish. ∇f0 (x∗ ) + i λ∗ ∇fi (x∗ ) + i j ∗ νj ∇hj (x∗ ) = 0 (A. Imagine a ball rolling down a surface deﬁned by f0 (x) (i. We say that this constraint is “active”. These Karush-Kuhn-Tucker conditions turn out to be sufﬁcient for convex optimization problems. the active constraints will correspond to the support vectors in SVMs! . It says that either fi (x) = 0 (the constraint is saturated and hence active) in which case λ is free to take on a non-zero value. However.e. if the constraint is inactive: fi (x) ≤ 0.11) (A. The second term represents the reaction force of the wall in the opposite direction. and then minimize the result we end up with the lowest point on the rim.1. LAGRANGIANS AND ALL THAT 75 Hence. They are given by. If the surface and constraint is convex then if the ball doesn’t move we have reached the optimal solution. you are doing gradient descent to ﬁnd the minimum). Other constraints which do not exert a force are “inactive” and have λ = 0. hence the optimal solution is a saddle-point. If we ﬁrst maximize over one direction. the “sup” and “inf” can be interchanged if strong duality holds. This condition has a nice interpretation as a “balancing of forces”.10) (A. which is the constraint. There are a number of important necessary conditions that hold for problems with zero duality gap. The latter statement can be read of from the last KKT condition which we call “complementary slackness”. the forces on the ball must balance. then λ must vanish.12) fi (x ) ≤ 0 hj (x∗ ) = 0 ∗ λi ≥ 0 ∗ λi fi (x∗ ) = 0 ∗ The ﬁrst equation is easily derived because we already saw that p∗ = inf x LP (x. At that point. It is important to realize that the order of maximization and minimization matters for arbitrary functions (but not for convex functions). keeping the other direction ﬁxed. The ball gets blocked by a wall.8) (A. Try to imagine a “V” shapes valley which runs diagonally across the coordinate system. As we will see soon. λ∗ . The λ represents the magnitude of the reaction force.

. ESSENTIALS OF CONVEX OPTIMIZATION Complementary slackness is easily derived by. f0 (x∗ ) = LD (λ∗ . ν ∗ ) = inf x f0 (x) + i ∗ νj hj (x∗ ) j λ∗ fi (x) + i j ∗ νj hj (x) ≤ f0 (x∗ ) + i λ∗ fi (x∗ ) + i (A. i Hence all inequalities are equalities and each term is negative so each term must vanish separately.76 APPENDIX A.6 the second because the inf is always smaller than any x∗ and the last because fi (x∗ ) ≤ 0.A.13) (A. λ∗ ≥ 0 and hj (x∗ ) = 0.14) ≤ f0 (x∗ ) where the ﬁrst line follows from Eqn.

e. the best choice depends on the problem you are modelling.2) where I = [i1 . Generally speaking. deﬁne features vectors as follows.Appendix B Kernel Design B.3) where the last equality follows from a binomial expansion.1 Polynomials Kernels The construction that we will follow below is to ﬁrst write feature vectors products of subsets of input attributes. Or we could require that each ij = [0. If we write out the 77 . d K(x. For instance. φI (x) = xi1 xi2 . i. Let’s deﬁne the kernel as usual as.in ]. y) = (R + x y) = s=0 T d d! Rd−s (xT y)s s!(d − s)! (B. but another important constraint is that the corresponding kernel must be easy to compute. i. . there are precisely s terms in the product. We have already encountered the polynomial kernel as..xin 1 2 n (B... we could require that their sum is a constant s. 1]..1) where we can put various restrictions on the possible combinations of indices which are allowed.e. y) = I φI (x)φI (y) (B. K(x. i2 .

+in = s < d (n+d)! n! d! (B...78 term. n (1 + zi ) = 1 + i=1 i zi + ij zi zj + .. B.3 we see that the features correspond to. Note that in this case again..1]. φI (x) = 1 d! Rd−s xi1 xi2 ....in ! with i1 +i2 +. B. the choice of power is restricted to [0. .. The only true freedom we have left is in choosing R: for larger R we down-weight higher order polynomials more. (xT y)s = (x1 y1 +x2 y2 +.2 All Subsets Kernel We deﬁne the feature again as the product of powers of input attributes..... y) = I φI (x)φI (y) = I j:ij =1 xj yj = i=1 (1 + xi yi ) (B. the feature is present or absent..+in =s s! (x1 y1 )i1 (x2 y2 )i2 .in i1 +i2 +. For n input dimensions (number of attributes) we have 2n possible combinations...e.i2 .(xn yn )in i1 !i2 !. Also note that in this case there is no decaying factor multiplying the monomials.7) i. However..+xn yn )s = APPENDIX B. Let’s compute the kernel function: n K(x..in ! (B. + z1 z2 .6) where the last identity follows from the fact that.. a sum over all possible combinations.zn (B. in this case.e. KERNEL DESIGN i1 ..xin 1 2 n (d − s)! i1 !i2 !. i.4) Taken together with eqn..5) The point is really that in order to efﬁciently compute the total sum of terms we have inserted very special coefﬁcients. The question we want to answer is: how much freedom do we have in choosing different coefﬁcients and still being able to compute the inner product efﬁciently.. it is much efﬁcient to compute the kernel directly than to sum over the features..

In the RKHS construction the features corresponding to the Gaussian kernel are Gaussians around the data-case. they lie on a sphere of radius 1 in a single orthant.e. THE GAUSSIAN KERNEL 79 B. all vectors have length 1.8) 2σ 2 where σ controls the ﬂexibility of the kernel: for very small σ the Gram matrix becomes the identity and every points is very dissimilar to any other point.3. implying that all feature vectors can be represented in the positive orthant (or any other orthant). This underscores the need in kernel-methods for regularization.B. Since the inner product of any feature vector with itself is 1.3 The Gaussian Kernel This is given by. On the other hand. and hence all points looks completely similar. . K(x. i. i. 1 ||x − y||2) (B. y) = exp(− φ(x) = exp(− 1 ||x − ·||2) 2σ 2 (B. inner products between any two different feature vectors is positive. for σ very large we ﬁnd the constant kernel. Moreover. with all entries equal to 1. smoothed versions of the data-cases.9) and thus every new direction which is added to the feature space is going to be orthogonal to all directions outside the width of the Gaussian and somewhat aligned to close-by points. it is easy to perform perfect on the training data which does not imply you will do well on new test data.e.

80 APPENDIX B. KERNEL DESIGN .

Bibliography 81 .

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd