## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Charles Elkan

elkan@cs.ucsd.edu

March 29, 2011

1

Contents

Contents 2

1 Introduction 5

1.1 Limitations of predictive analytics . . . . . . . . . . . . . . . . . . 6

1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Predictive analytics in general 9

2.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Data cleaning and recoding . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Interpreting coefﬁcients of a linear model . . . . . . . . . . . . . . 14

2.5 Evaluating performance . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Introduction to Rapidminer 21

3.1 Standardization of features . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Example of a Rapidminer process . . . . . . . . . . . . . . . . . . 22

3.3 Other notes on Rapidminer . . . . . . . . . . . . . . . . . . . . . . 25

4 Support vector machines 27

4.1 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Linear soft-margin SVMs . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Nonlinear kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Selecting the best SVM settings . . . . . . . . . . . . . . . . . . . 32

5 Classiﬁcation with a rare class 37

5.1 Measuring performance . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Thresholds and lift . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Ranking examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . 42

2

CONTENTS 3

5.5 Isotonic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.6 Univariate logistic regression . . . . . . . . . . . . . . . . . . . . . 43

5.7 Pitfalls of link prediction . . . . . . . . . . . . . . . . . . . . . . . 49

6 Detecting overﬁtting: cross-validation 53

6.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2 Nested cross-validation . . . . . . . . . . . . . . . . . . . . . . . . 54

7 Making optimal decisions 59

7.1 Predictions, decisions, and costs . . . . . . . . . . . . . . . . . . . 59

7.2 Cost matrix properties . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.3 The logic of costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.4 Making optimal decisions . . . . . . . . . . . . . . . . . . . . . . . 62

7.5 Limitations of cost-based analysis . . . . . . . . . . . . . . . . . . 64

7.6 Rules of thumb for evaluating data mining campaigns . . . . . . . . 64

7.7 Evaluating success . . . . . . . . . . . . . . . . . . . . . . . . . . 66

8 Learning classiﬁers despite missing labels 71

8.1 The standard scenario for learning a classiﬁer . . . . . . . . . . . . 71

8.2 Sample selection bias in general . . . . . . . . . . . . . . . . . . . 72

8.3 Covariate shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.4 Reject inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.5 Positive and unlabeled examples . . . . . . . . . . . . . . . . . . . 75

8.6 Further issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

9 Recommender systems 83

9.1 Applications of matrix approximation . . . . . . . . . . . . . . . . 84

9.2 Measures of performance . . . . . . . . . . . . . . . . . . . . . . . 84

9.3 Additive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

9.4 Multiplicative models . . . . . . . . . . . . . . . . . . . . . . . . . 86

9.5 Combining models by ﬁtting residuals . . . . . . . . . . . . . . . . 87

9.6 Further issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

10 Text mining 93

10.1 The bag-of-words representation . . . . . . . . . . . . . . . . . . . 94

10.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . 94

10.3 Training Bayesian classiﬁers . . . . . . . . . . . . . . . . . . . . . 96

10.4 Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

10.5 Discriminative classiﬁcation . . . . . . . . . . . . . . . . . . . . . 98

10.6 Clustering documents . . . . . . . . . . . . . . . . . . . . . . . . . 99

4 CONTENTS

10.7 Topic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

10.8 Latent semantic analysis . . . . . . . . . . . . . . . . . . . . . . . 100

10.9 Open questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

11 Social network analytics 107

11.1 Difﬁculties in network mining . . . . . . . . . . . . . . . . . . . . 107

11.2 Unsupervised network mining . . . . . . . . . . . . . . . . . . . . 108

11.3 Collective classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . 108

11.4 Link prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

11.5 Iterative collective classiﬁcation . . . . . . . . . . . . . . . . . . . 109

11.6 Other topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

12 Interactive experimentation 115

Bibliography 117

Chapter 1

Introduction

There are many deﬁnitions of data mining. We shall take it to mean the application

of learning algorithms and statistical methods to real-world datasets. There are nu-

merous data mining domains in science, engineering, business, and elsewhere where

data mining is useful. We shall focus on applications that are related to business, but

the methods that are most useful are mostly the same for applications in science or

engineering.

The focus will be on methods for making predictions. For example, the available

data may be a customer database, along with labels indicating which customers failed

to pay their bills. The goal will then be to predict which other customers might fail

to pay in the future. In general, analytics is a newer name for data mining. Predictive

analytics indicates a focus on making predictions.

The main alternative to predictive analytics can be called descriptive analytics.

This area is often also called “knowledge discovery in data” or KDD. In a nutshell,

the goal of descriptive analytics is to discover patterns in data. Finding patterns is

often fascinating and sometimes highly useful, but in general it is harder to obtain

direct beneﬁt from descriptive analytics than from predictive analytics. For example,

suppose that customers of Whole Foods tend to be liberal and wealthy. This pattern

may be noteworthy and interesting, but what should Whole Foods do with the ﬁnd-

ing? Often, the same ﬁnding suggests two courses of action that are both plausible,

but contradictory. In such a case, the ﬁnding is really not useful in the absence of

additional knowledge. For example, perhaps Whole Foods at should direct its mar-

keting towards additional wealthy and liberal people. Or perhaps that demographic is

saturated, and it should aim its marketing at a currently less tapped, different, group

of people?

In contrast, predictions can be typically be used directly to make decisions that

maximize beneﬁt to the decision-maker. For example, customers who are more likely

5

6 CHAPTER 1. INTRODUCTION

not to pay in the future can have their credit limit reduced now. It is important to

understand the difference between a prediction and a decision. Data mining lets us

make predictions, but predictions are useful to an agent only if they allow the agent

to make decisions that have better outcomes.

Some people may feel that the focus in this course on maximizing proﬁt is dis-

tasteful or disquieting. After all, maximizing proﬁt for a business may be at the

expense of consumers, and may not beneﬁt society at large. There are several re-

sponses to this feeling. First, maximizing proﬁt in general is maximizing efﬁciency.

Society can use the tax system to spread the beneﬁt of increased proﬁt. Second, in-

creased efﬁciency often comes from improved accuracy in targeting, which beneﬁts

the people being targeted. Businesses have no motive to send advertising to people

who will merely be annoyed and not respond.

1.1 Limitations of predictive analytics

It is important to understand the limitations of predictive analytics. First, in general,

one cannot make progress without a dataset for training of adequate size and quality.

Second, it is crucial to have a clear deﬁnition of the concept that is to be predicted,

and to have historical examples of the concept. Consider for example this extract

from an article in the London Financial Times dated May 13, 2009:

Fico, the company behind the credit score, recently launched a service

that pre-qualiﬁes borrowers for modiﬁcation programmes using their in-

house scoring data. Lenders pay a small fee for Fico to refer potential

candidates for modiﬁcations that have already been vetted for inclusion

in the programme. Fico can also help lenders ﬁnd borrowers that will

best respond to modiﬁcations and learn how to get in touch with them.

It is hard to see how this could be a successful application of data mining, because

it is hard to see how a useful labeled training set could exist. The target concept

is “borrowers that will best respond to modiﬁcations.” From a lender’s perspective

(and Fico works for lenders not borrowers) such a borrower is one who would not pay

under his current contract, but who would pay if given a modiﬁed contract. Especially

in 2009, lenders had no long historical experience with offering modiﬁcations to

borrowers, so FICO did not have relevant data. Moreover, the deﬁnition of the target

is based on a counterfactual, that is on reading the minds of borrowers. Data mining

cannot read minds.

For a successful data mining application, the actions to be taken based on pre-

dictions need to be deﬁned clearly and to have reliable proﬁt consequences. The

1.2. OVERVIEW 7

difference between a regular payment and a modiﬁed payment is often small, for ex-

ample $200 in the case described in the newspaper article. It is not clear that giving

people modiﬁcations will really change their behavior dramatically.

For a successful data mining application also, actions must not have major unin-

tended consequences. Here, modiﬁcations may change behavior in undesired ways.

A person requesting a modiﬁcation is already thinking of not paying. Those who get

modiﬁcations may be motivated to return and request further concessions.

Additionally, for predictive analytics to be successful, the training data must be

representative of the test data. Typically, the training data come from the past, while

the test data arise in the future. If the phenomenon to be predicted is not stable

over time, then predictions are likely not to be useful. Here, changes in the general

economy, in the price level of houses, and in social attitudes towards foreclosures,

are all likely to change the behavior of borrowers in the future, in systematic but

unknown ways.

Last but not least, for a successful application it helps if the consequences of

actions are essentially independent for different examples. This may be not the case

here. Rational borrowers who hear about others getting modiﬁcations will try to

make themselves appear to be candidates for a modiﬁcation. So each modiﬁcation

generates a cost that is not restricted to the loss incurred with respect to the individual

getting the modiﬁcation.

An even more clear example of an application of predictive analytics that is un-

likely to succeed is learning a model to predict which persons will commit a major

terrorist act. There are so few positive training examples that statistically reliable

patterns cannot be learned. Moreover, intelligent terrorists will take steps not to ﬁt in

with patterns exhibited by previous terrorists [Jonas and Harper, 2006].

1.2 Overview

In this course we shall only look at methods that have state-of-the-art accuracy, that

are sufﬁciently simple and fast to be easy to use, and that have well-documented

successful applications. We will tread a middle ground between focusing on theory

at the expense of applications, and understanding methods only at a cookbook level.

Often, we shall not look at multiple methods for the same task, when there is one

method that is at least as good as all the others from most points of view. In particular,

for classiﬁer learning, we will look at support vector machines (SVMs) in detail. We

will not examine alternative classiﬁer learning methods such as decision trees, neural

networks, boosting, and bagging. All these methods are excellent, but it is hard to

identify clearly important scenarios in which they are deﬁnitely superior to SVMs.

8 CHAPTER 1. INTRODUCTION

We may also look at random forests, a nonlinear method that is often superior to

linear SVMs, and which is widely used in commercial applications nowadays.

Chapter 2

Predictive analytics in general

This chapter explains supervised learning, linear regression, and data cleaning and

recoding.

2.1 Supervised learning

The goal of a supervised learning algorithm is to obtain a classiﬁer by learning from

training examples. A classiﬁer is something that can be used to make predictions on

test examples. This type of learning is called “supervised” because of the metaphor

that a teacher (i.e. a supervisor) has provided the true label of each training example.

Each training and test example is represented in the same way, as a row vector

of ﬁxed length p. Each element in the vector representing an example is called a

feature value. It may be real number or a value of any other type. A training set is

a set of vectors with known label values. It is essentially the same thing as a table

in a relational database, and an example is one row in such a table. Row, tuple, and

vector are essentially synonyms. A column in such a table is often called a feature,

or an attribute, in data mining. Sometimes it is important to distinguish between a

feature, which is an entire column, and a feature value.

The label y for a test example is unknown. The output of the classiﬁer is a

conjecture about y, i.e. a predicted y value. Often each label value y is a real number.

In this case, supervised learning is called “regression” and the classiﬁer is called a

“regression model.” The word “classiﬁer” is usually reserved for the case where label

values are discrete. In the simplest but most common case, there are just two label

values. These may be called -1 and +1, or 0 and 1, or no and yes, or negative and

positive.

With n training examples, and with each example consisting of values for p dif-

ferent features, the training data are a matrix with n rows and p columns, along with

9

10 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL

a column vector of y values. The cardinality of the training set is n, while its dimen-

sionality is p. We use the notation x

ij

for the value of feature number j of example

number i. The label of example i is y

i

. True labels are known for training examples,

but not for test examples.

2.2 Data cleaning and recoding

In real-world data, there is a lot of variability and complexity in features. Some fea-

tures are real-valued. Other features are numerical but not real-valued, for example

integers or monetary amounts. Many features are categorical, e.g. for a student the

feature “year” may have values freshman, sophomore, junior, and senior. Usually the

names used for the different values of a categorical feature make no difference to data

mining algorithms, but are critical for human understanding. Sometimes categorical

features have names that look numerical, e.g. zip codes, and/or have an unwieldy

number of different values. Dealing with these is difﬁcult.

It is also difﬁcult to deal with features that are really only meaningful in con-

junction with other features, such as “day” in the combination “day month year.”

Moreover, important features (meaning features with predictive power) may be im-

plicit in other features. For example, the day of the week may be predictive, but only

the day/month/year date is given in the original data. An important job for a human

is to think which features may be predictive, based on understanding the application

domain, and then to write software that makes these features explicit. No learning

algorithm can be expected to discover day-of-week automatically as a function of

day/month/year.

Even with all the complexity of features, many aspects are typically ignored in

data mining. Usually, units such as dollars and dimensions such as kilograms are

omitted. Difﬁculty in determining feature values is also ignored: for example, how

does one deﬁne the year of a student who has transferred from a different college, or

who is part-time?

A very common difﬁculty is that the value of a given feature is missing for some

training and/or test examples. Often, missing values are indicated by question marks.

However, often missing values are indicated by strings that look like valid known

values, such as 0 (zero). It is important not to treat a value that means missing

inadvertently as a valid regular value.

Some training algorithms can handle missingness internally. If not, the simplest

approach is just to discard all examples (rows) with missing values. Keeping only

examples without any missing values is called “complete case analysis.” An equally

simple, but different, approach is to discard all features (columns) with missing val-

2.2. DATA CLEANING AND RECODING 11

ues, However, in many applications, both these approaches eliminate too much useful

training data.

Also, the fact that a particular feature is missing may itself be a useful predictor.

Therefore, it is often beneﬁcial to create an additional binary feature that is 0 for

missing and 1 for present. If a feature with missing values is retained, then it is

reasonable to replace each missing value by the mean or mode of the non-missing

values. This process is called imputation. More sophisticated imputation procedures

exist, but they are not always better.

Some training algorithms can only handle categorical features. For these, features

that are numerical can be discretized. The range of the numerical values is partitioned

into a ﬁxed number of intervals that are called bins. The word “partitioned” means

that the bins are exhaustive and mutually exclusive, i.e. non-overlapping. One can

set boundaries for the bins so that each bin has equal width, i.e. the boundaries are

regularly spaced, or one can set boundaries so that each bin contains approximately

the same number of training examples, i.e. the bins are “equal count.” Each bin is

given an arbitrary name. Each numerical value is then replaced by the name of the

bin in which the value lies. It often works well to use ten bins.

Other training algorithms can only handle real-valued features. For these, cat-

egorical features must be made numerical. The values of a binary feature can be

recoded as 0.0 or 1.0. It is conventional to code “false” or “no” as 0.0, and “true” or

“yes” as 1.0. Usually, the best way to recode a feature that has k different categorical

values is to use k real-valued features. For the jth categorical value, set the jth of

these features equal to 1.0 and set all k −1 others equal to 0.0.

1

Categorical features with many values (say, over 20) are often difﬁcult to deal

with. Typically, human intervention is needed to recode them intelligently. For ex-

ample, zipcodes may be recoded as just their ﬁrst one or two letters, since these

indicate meaningful regions of the United States. If one has a large dataset with

many examples from each of the 50 states, then it may be reasonable to leave this

as a categorical feature with 50 values. Otherwise, it may be useful to group small

states together in sensible ways, for example to create a New England group for MA,

CT, VT, ME, NH.

An intelligent way to recode discrete predictors is to replace each discrete value

by the mean of the target conditioned on that discrete value. For example, if the

average label value is 20 for men versus 16 for women, these values could replace

the male and female values of a variable for gender. This idea is especially useful as

1

The ordering of the values, i.e. which value is associated with j = 1, etc., is arbitrary. Mathemat-

ically it is preferable to use only k − 1 real-valued features. For the last categorical value, set all k − 1

features equal to 0.0. For the jth categorical value where j < k, set the jth feature value to 1.0 and set

all k − 1 others equal to 0.0.

12 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL

a way to convert a discrete feature with many values, for example the 50 U.S. states,

into a useful single numerical feature.

However, as just explained, the standard way to recode a discrete feature with

m values is to introduce m − 1 binary features. With this standard approach, the

training algorithm can learn a coefﬁcient for each new feature that corresponds to an

optimal numerical value for the corresponding discrete value. Conditional means are

likely to be meaningful and useful, but they will not yield better predictions than the

coefﬁcients learned in the standard approach. A major advantage of the conditional-

means approach is that it avoids an explosion in the dimensionality of training and

test examples.

Mixed types.

Sparse data.

Normalization. After conditional-mean new values have been created, they can

be scaled to have zero mean and unit variance in the same way as other features.

2.3 Linear regression

Let x be an instance and let y be its real-valued label. For linear regression, x must

be a vector of real numbers of ﬁxed length. Remember that this length p is often

called the dimension, or dimensionality, of x. Write x = 'x

1

, x

2

, . . . , x

p

`. The

linear regression model is

y = b

0

+b

1

x

1

+b

2

x

2

+. . . +b

p

x

p

.

The righthand side above is called a linear function of x. The linear function is

deﬁned by its coefﬁcients b

0

to b

p

. These coefﬁcients are the output of the data

mining algorithm.

The coefﬁcient b

0

is called the intercept. It is the value of y predicted by the

model if x

i

= 0 for all i. Of course, it may be completely unrealistic that all features

x

i

have value zero. The coefﬁcient b

i

is the amount by which the predicted y value

increases if x

i

increases by 1, if the value of all other features is unchanged. For

example, suppose x

i

is a binary feature where x

i

= 0 means female and x

i

= 1

means male, and suppose b

i

= −2.5. Then the predicted y value for males is lower

by 2.5, everything else being held constant.

Suppose that the training set has cardinality n, i.e. it consists of n examples of

the form 'x

i

, y

i

`, where x

i

= 'x

i1

, . . . , x

ip

`. Let b be any set of coefﬁcients. The

predicted value for x

i

is

ˆ y

i

= f(x

i

; b) = b

0

+

p

¸

j=1

b

j

x

ij

.

2.3. LINEAR REGRESSION 13

The semicolon in the expression f(x

i

; b) emphasizes that the vector x

i

is a variable

input, while b is a ﬁxed set of parameter values. If we deﬁne x

i0

= 1 for every i, then

we can write

ˆ y

i

=

p

¸

j=0

b

j

x

ij

.

The constant x

i0

= 1 can be called a pseudo-feature.

Finding the optimal values of the coefﬁcients b

0

to b

p

is the job of the training

algorithm. To make this task well-deﬁned, we need a deﬁnition of what “optimal”

means. The standard approach is to say that optimal means minimizing the sum of

squared errors on the training set, where the squared error on training example i is

(y

i

− ˆ y

i

)

2

. The training algorithm then ﬁnds

ˆ

b = argmin

b

n

¸

i=1

(f(x

i

; b) −y

i

)

2

.

The objective function

¸

i

(y

i

−

¸

j

b

j

x

ij

)

2

is called the sum of squared errors, or

SSE for short. Note that during training the n different x

i

and y

i

values are ﬁxed,

while the parameters b are variable.

The optimal coefﬁcient values

ˆ

b are not deﬁned uniquely if the number n of train-

ing examples is less than the number p of features. Even if n > p is true, the optimal

coefﬁcients have multiple equivalent values if some features are themselves related

linearly. Here, “equivalent” means that the different sets of coefﬁcients achieve the

same minimum SSE. For an intuitive example, suppose features 1 and 2 are height

and weight respectively. Suppose that x

2

= 120 + 5(x

1

− 60) = −180 + 5x

1

ap-

proximately, where the units of measurement are pounds and inches. Then the same

model can be written in many different ways:

• y = b

0

+b

1

x

1

+b

2

x

2

• y = b

0

+b

1

x

1

+b

2

(−180 + 5x

1

) = [b

0

−180b

2

] + [b

1

(1 + 5b

2

)]x

1

+ 0x

2

and more. In the extreme, suppose x

1

= x

2

. Then all models y = b

0

+b

1

x

1

+b

2

x

2

are equivalent for which b

1

+b

2

equals a constant.

When two or more features are approximately related linearly, then the true val-

ues of the coefﬁcients of those features are not well determined. The coefﬁcients

obtained by training will be strongly inﬂuenced by randomness in the training data.

Regularization is a way to reduce the inﬂuence of this type of randomness. Consider

all models y = b

0

+b

1

x

1

+b

2

x

2

for which b

1

+b

2

= c. Among these models, there is

a unique one that minimizes the function b

2

1

+b

2

2

. This model has b

1

= b

2

= c/2. We

can obtain it by setting the objective function for training to be the sum of squared

14 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL

errors (SSE) plus a function that penalizes large values of the coefﬁcients. A sim-

ple penalty function of this type is

¸

p

j=1

b

2

j

. A parameter λ can control the relative

importance of the two objectives, namely SSE and penalty:

ˆ

b = argmin

b

1

n

n

¸

i=1

(y

i

− ˆ y

i

)

2

+ λ

1

p

p

¸

j=1

b

2

j

.

If λ = 0 then one gets the standard least-squares linear regression solution. As λ

gets larger, the penalty on large coefﬁcients gets stronger, and the typical values of

coefﬁcients get smaller. The parameter λ is often called the strength of regularization.

The fractions 1/n and 1/p do not make an essential difference. They can be used to

make the numerical value of λ easier to interpret.

The penalty function

¸

p

j=1

b

2

j

is the L

2

norm of the vector b. Using it for linear

regression is called ridge regression. Any penalty function that treats all coefﬁcients

b

j

equally, like the L

2

norm, is sensible only if the typical magnitudes of the values

of each feature are similar; this is an important motivation for data normalization.

Note that in the formula

¸

p

j=1

b

2

j

the sum excludes the intercept coefﬁcient b

0

. One

reason for doing this is that the target y values are typically not normalized.

2.4 Interpreting coefﬁcients of a linear model

It is common to desire a data mining model that is interpretable, that is one that

can be used not only to make predictions, but also to understand mechanisms in the

phenomenon that is being studied. Linear models, whether for regression or for clas-

siﬁcation as described in later chapters, do appear interpretable at ﬁrst sight. How-

ever, much caution is needed when attempting to derive conclusions from numerical

coefﬁcients.

Consider the following linear regression model for predicting high-density choles-

terol (HDL) levels.

2

2

HDL cholesterol is considered beneﬁcial and is sometimes called “good” cholesterol. Source:

http://www.jerrydallal.com/LHSP/importnt.htm. Predictors have been reordered

here from most to least statistically signiﬁcant, as measured by p-value.

2.5. EVALUATING PERFORMANCE 15

predictor coefficient std.error Tstat p-value

intercept 1.16448 0.28804 4.04 <.0001

BMI -0.01205 0.00295 -4.08 <.0001

LCHOL 0.31109 0.10936 2.84 0.0051

GLUM -0.00046 0.00018 -2.50 0.0135

DIAST 0.00255 0.00103 2.47 0.0147

BLC 0.05055 0.02215 2.28 0.0239

PRSSY -0.00041 0.00044 -0.95 0.3436

SKINF 0.00147 0.00183 0.81 0.4221

AGE -0.00092 0.00125 -0.74 0.4602

From most to least statistically signiﬁcant, the predictors are body mass index, the

log of total cholesterol, diastolic blood pressures, vitamin C level in blood, systolic

blood pressure, skinfold thickness, and age in years. (It is not clear what GLUM is.)

The example illustrates at least two crucial issues. First, if predictors are collinear,

then one may appear signiﬁcant and the other not, when in fact both are signiﬁcant or

both are not. Above, diastolic blood pressure is statistically signiﬁcant, but systolic

is not. This may possibly be true for some physiological reason. But it may also be

an artifact of collinearity.

Second, a predictor may be practically important, and statistically signiﬁcant,

but still useless for interventions. This happens if the predictor and the outcome

have a common cause, or if the outcome causes the predictor. Above, vitamin C

is statistically signiﬁcant. But it may be that vitamin C is simply an indicator of a

generally healthy diet high in fruits and vegetables. If this is true, then merely taking

a vitamin C supplement will cause an increase in HDL level.

A third crucial issue is that a correlation may disagree with other knowledge and

assumptions. For example, vitamin C is generally considered beneﬁcial or neutral.

If lower vitamin C was associated with higher HDL, one would be cautious about

believing this relationship, even if the association was statistically signiﬁcant.

2.5 Evaluating performance

In a real-world application of supervised learning, we have a training set of examples

with labels, and a test set of examples with unknown labels. The whole point is to

make predictions for the test examples.

However, in research or experimentation we want to measure the performance

achieved by a learning algorithm. To do this we use a test set consisting of examples

16 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL

with known labels. We train the classiﬁer on the training set, apply it to the test set,

and then measure performance by comparing the predicted labels with the true labels

(which were not available to the training algorithm).

It is absolutely vital to measure the performance of a classiﬁer on an independent

test set. Every training algorithm looks for patterns in the training data, i.e. corre-

lations between the features and the class. Some of the patterns discovered may be

spurious, i.e. they are valid in the training data due to randomness in how the train-

ing data was selected from the population, but they are not valid, or not as strong,

in the whole population. A classiﬁer that relies on these spurious patterns will have

higher accuracy on the training examples than it will on the whole population. Only

accuracy measured on an independent test set is a fair estimate of accuracy on the

whole population. The phenomenon of relying on patterns that are strong only in the

training data is called overﬁtting. In practice it is an omnipresent danger.

Most training algorithms have some settings that the user can choose between.

For ridge regression the main algorithmic parameter is the degree of regularization

λ. Other algorithmic choices are which sets of features to use. It is natural to run

a supervised learning algorithm many times, and to measure the accuracy of the

function (classiﬁer or regression function) learned with different settings. A set of

labeled examples used to measure accuracy with different algorithmic settings, in

order to pick the best settings, is called a validation set. If you use a validation set,

it is important to have a ﬁnal test set that is independent of both the training set

and the validation set. For fairness, the ﬁnal test set must be used only once. The

only purpose of the ﬁnal test set is to estimate the true accuracy achievable with the

settings previously chosen using the validation set.

Dividing the available data into training, validation, and test sets should be done

randomly, in order to guarantee that each set is a random sample from the same distri-

bution. However, a very important real-world issue is that future real test examples,

for which the true label is genuinely unknown, may be not a sample from the same

distribution.

Quiz question

Suppose you are building a model to predict how many dollars someone will spend at

Sears. You know the gender of each customer, male or female. Since you are using

linear regression, you must recode this discrete feature as continuous. You decide

to use two real-valued features, x

11

and x

12

. The coding is a standard “one of n”

scheme, as follows:

gender x

11

x

12

male 1 0

female 0 1

Learning from a large training set yields the model

y = . . . + 15x

11

+ 75x

12

+. . .

Dr. Roebuck says “Aha! The average woman spends $75, but the average man spends

only $15.”

Write your name below, and then answer the following three parts with one or

two sentences each:

(a) Explain why Dr. Roebuck’s conclusion is not valid. The model only predicts

spending of $75 for a woman if all other features have value zero. This may not be

true for the average woman. Indeed it will not be true for any woman, if features

such as “age” are not normalized.

(b) Explain what conclusion can actually be drawn from the numbers 15 and 75.

The conclusion is that if everything else is held constant, then on average a woman

will spend $60 more than a man. Note that if some feature values are systematically

different for men and women, then even this conclusion is not useful, because it is not

reasonable to hold all other feature values constant.

(c) Explain a desirable way to simplify the model. The two features x

11

and

x

12

are linearly related. Hence, they make the optimal model be undeﬁned, in the

absence of regularization. It would be good to eliminate one of these two features.

The expressiveness of the model would be unchanged.

Quiz for April 6, 2010

Your name:

Suppose that you are training a model to predict how many transactions a credit card

customer will make. You know the education level of each customer. Since you are

using linear regression, you recode this discrete feature as continuous. You decide to

use two real-valued features, x

37

and x

38

. The coding is a “one of two” scheme, as

follows:

x

37

x

38

college grad 1 0

not college grad 0 1

Learning from a large training set yields the model

y = . . . + 5.5x

37

+ 3.2x

38

+. . .

(a) Dr. Goldman concludes that the average college graduate makes 5.5 transactions.

Explain why Dr. Goldman’s conclusion is likely to be false.

The model only predicts 5.5 transactions if all other features, including the in-

tercept, have value zero. This may not be true for the average college grad. It will

certainly be false if features such as “age” are not normalized.

(b) Dr. Sachs concludes that being a college graduate causes a person to make 2.3

more transactions, on average. Explain why Dr. Sachs’ conclusion is likely to be

false also.

First, if any other feature have different values on average for men and women,

for example income, then 5.5 − 3.2 = 2.3 is not the average difference in predicted

y value between groups. Said another way, it is unlikely to be reasonable to hold all

other feature values constant when comparing groups.

Second, even if 2.3 is the average difference, one cannot say that this difference

is caused by being a college graduate. There may be some other unknown common

cause, for example.

Linear regression assignment

This assignment is due at the start of class on Tuesday April 12, 2011. You should

work in a team of two. Choose a partner who has a different background from you.

Download the ﬁle cup98lrn.zip fromfromhttp://archive.ics.uci.

edu/ml/databases/kddcup98/kddcup98.html. Read the associated doc-

umentation. Load the data into Rapidminer (or other software for data mining such

as R). Select the 4843 records that have feature TARGET_B=1. Save these as a

native-format Rapidminer example set.

Now, build a linear regression model to predict the ﬁeld TARGET_D as accurately

as possible. Use root mean squared error (RMSE) as the deﬁnition of error, and use

ten-fold cross-validation to measure RMSE. Do a combination of the following:

• Recode non-numerical features as numerical.

• Discard useless features.

• Transform features to improve their usefulness.

• Compare different strengths of ridge regularization.

Do the steps above repeatedly in order to explore alternative ways of using the data.

The outcome should be the best possible model that you can ﬁnd that uses 30 or

fewer of the original features.

To make your work more efﬁcient, be sure to save the 4843 records in a format

that Rapidminer can load quickly. You can use three-fold cross-validation during

development to save time also. If you normalize all input features, and you use

strong regularization (ridge parameter 10

7

perhaps), then the regression coefﬁcients

will indicate the relative importance of features.

The deliverable is a brief report that is formatted similarly to this assignment

description. Describe what you did that worked, and your results. Explain any as-

sumptions that you made, and any limitations on the validity or reliability of your

results. If you use Rapidminer, include a printout of your ﬁnal process. Include your

ﬁnal regression model. Do not speculate about future work, and do not explain ideas

that did not work. Write in the present tense. Organize your report logically, not

chronologically.

Comments on the regression assignment

It is typically useful to rescale predictors to have mean zero and variance one. How-

ever, it loses interpretability to rescale the target variable. Note that if all predictors

have mean zero, then the intercept of a linear regression model is the mean of the

target, $15.6243 here, assuming that the intercept is not regularized.

The assignment speciﬁcally asks you to report root mean squared error, RMSE.

One could also report mean squared error, MSE, but whichever is chosen should be

used consistently. In general, do not confuse readers by switching between multiple

performance measures without a good reason.

As is often the case, good performance can be achieved with a very simple model.

The most informative single feature is LASTGIFT, the dollar amount of the person’s

most recent gift. A model based on just this single feature achieves RMSE of $9.98.

In 2009, three of 11 teams achieved similar ﬁnal RMSEs that were slightly better than

$9.00. The two teams that omitted LASTGIFT achieved RMSE worse than $11.00.

However, it is possible to do signiﬁcantly better.

The assignment asks you to produce a ﬁnal model based on at most 30 of the

original features. Despite this directive, it is not a good idea to begin by choosing

a subset of the 480 original features based on human intuition. The teams that did

this all omitted features that in fact would have made their ﬁnal models considerably

better, including sometimes the feature LASTGIFT. As explained above, it is also

not a good idea to eliminate automatically features with missing values.

Rapidminer has operators that search for highly predictive subsets of variables.

These operators have two major problems. First, they are too slow to be used on a

large initial set of variables, so it is easy for human intuition to pick an initial set

that is bad. Second, these operators try a very large number of alternative subsets

of variables, and pick the one that performs best on some dataset. Because of the

high number of alternatives considered, this subset is likely to overﬁt substantially

the dataset on which it is best. For more discussion of this problem, see the section

on nested cross-validation in a later chapter.

Chapter 3

Introduction to Rapidminer

By default, many Java implementations allocate so little memory that Rapidminer

will quickly terminate with an “out of memory” message. This is not a problem with

Windows 7, but otherwise, launch Rapidminer with a command like

java -Xmx1g -jar rapidminer.jar

where 1g means one gigabyte.

3.1 Standardization of features

The recommended procedure is as follows, in order.

• Normalize all numerical features to have mean zero and variance 1.

• Convert each nominal feature with k alternative values into k different binary

features.

• Optionally, drop all binary features with fewer than 100 examples for either

binary value.

• Convert each binary feature into a numerical feature with values 0.0 and 1.0.

It is not recommended to normalize numerical features obtained frombinary features.

The reason is that this normalization would destroy the sparsity of the dataset, and

hence make some efﬁcient training algorithms much slower. Note that destroying

sparsity is not an issue when the dataset is stored as a dense matrix, which it is by

default in Rapidminer.

Normalization is also questionable for variables whose distribution is highly un-

even. For example, almost all donation amounts are under $50, but a few are up

21

22 CHAPTER 3. INTRODUCTION TO RAPIDMINER

to $500. No linear normalization, whether by z-scoring or by transformation to the

range 0 to 1 or otherwise, will allow a linear classiﬁer to discriminate well between

different common values of the variable. For variables with uneven distributions, it

is often useful to apply a nonlinear transformation that makes their distribution have

more of a uniform or Gaussian shape. A common transformation that often achieves

this is to take logarithms. A useful trick when zero is a frequent value of a variable

x is to use the mapping x → log(x + 1). This leaves zeros unchanged and hence

preserves sparsity.

Eliminating binary features for which either value has fewer than 100 training

examples is a heuristic way to prevent overﬁtting and at the same time make training

faster. It may not be necessary, or beneﬁcial, if regularization is used during training.

If regularization is not used, then at least one binary feature must be dropped from

each set created by converting a multivalued feature, in order to prevent the existence

of multiple equivalent models. For efﬁciency and interpretability, it is best to drop the

binary feature corresponding to the most common value of the original multivalued

feature.

If you have separate training and test sets, it is important to do all preprocessing

in a perfectly consistent way on both datasets. The simplest is to concatenate both

sets before preprocessing, and then split them after. However, concatenation allows

information from the test set to inﬂuence the details of preprocessing, which is a form

of information leakage from the test set, that is of using the test set during training,

which is not legitimate.

3.2 Example of a Rapidminer process

Figure 3.2 shows a tree structure of Rapidminer operators that together perform a

standard data mining task. The tree illustrates how to perform some common sub-

tasks.

The ﬁrst operator is ExampleSource. “Read training example” is a name for this

particular instance of this operator. If you have two instances of the same operator,

you can identify them with different names. You select an operator by clicking on

it in the left pane. Then, in the right pane, the Parameters tab shows the arguments

of the operator. At the top there is an icon that is a picture of either a person with a

red sweater, or a person with an academic cap. Click on the icon to toggle between

these. When the person in red is visible, you are in expert mode, where you can see

all the parameters of each operator.

The example source operator has a parameter named attributes. Normally this

is ﬁle name with extension “aml.” The corresponding ﬁle should contain a schema

deﬁnition for a dataset. The actual data are in a ﬁle with the same name with exten-

3.2. EXAMPLE OF A RAPIDMINER PROCESS 23

Root node of Rapidminer process

Pr ocess

Read training examples

Exampl eSour ce

Select features by name

Feat ur eNameFi l t er

Convert nominal to binary

Nomi nal 2Bi nomi nal

Remove almost-constant features

RemoveUsel essAt t r i but es

Convert binary to real-valued

Nomi nal 2Numer i cal

Z-scoring

Nor mal i zat i on

Scan for best strength of regularization

Gr i dPar amet er Opt i mi zat i on

Cross-validation

XVal i dat i on

Regularized linear regression

W- Li near Regr essi on

ApplierChain

Oper at or Chai n

Applier

Model Appl i er

Compute RMSE on test fold

Regr essi onPer f or mance

Figure 3.1: Rapidminer process for regularized linear regression.

24 CHAPTER 3. INTRODUCTION TO RAPIDMINER

sion “dat.” The easiest way to create these ﬁles is by clicking on “Start Data Loading

Wizard.” The ﬁrst step with this wizard is to specify the ﬁle to read data from, the

character that begins comment lines, and the decimal point character. Ticking the

box for “use double quotes” can avoid some error messages.

In the next panel, you specify the delimiter that divides ﬁelds within each row

of data. If you choose the wrong delimiter, the data preview at the bottom will look

wrong. In the next panel, tick the box to make the ﬁrst row be interpreted as ﬁeld

names. If this is not true for your data, the easiest is to make it true outside Rapid-

miner, with a text editor or otherwise. When you click Next on this panel, all rows

of data are loaded. Error messages may appear in the bottom pane. If there are no

errors and the data ﬁle is large, then Rapidminer hangs with no visible progress. The

same thing happens if you click Previous from the following panel. You can use a

CPU monitor to see what Rapidminer is doing.

The next panel asks you to specify the type of each attribute. The wizard guesses

this based only on the ﬁrst row of data, so it often makes mistakes, which you have to

ﬁx by trial and error. The following panel asks you to say which features are special.

The most common special choice is “label” which means that an attribute is a target

to be predicted.

Finally, you specify a ﬁle name that is used with “aml” and “dat” extensions to

save the data in Rapidminer format.

To keep just features with certain names, use the operator FeatureNameFilter. Let

the argument skip_features_with_name be .* and let the argument except_

features_with_name identify the features to keep. In our sample process, it is

(.*AMNT.*)|(.*GIFT.*)(YRS.*)|(.*MALE)|(STATE)|(PEPSTRFL)|(.*GIFT)

|(MDM.*)|(RFA_2.*).

In order to convert a discrete feature with k different values into k real-valued

0/1 features, two operators are needed. The ﬁrst is Nominal2Binominal, while

the second is Nominal2Numerical. Note that the documentation of the latter

operator in Rapidminer is misleading: it cannot convert a discrete feature into multi-

ple numerical features directly. The operator Nominal2Binominal is quite slow.

Applying it to discrete features with more than 50 alternative values is not recom-

mended.

The simplest way to ﬁnd a good value for an algorithm setting is to use the

XValidation operator nested inside the GridParameterOptimization op-

erator. The way to indicate nesting of operators is by dragging and dropping. First

create the inner operator subtree. Then insert the outer operator. Then drag the root

of the inner subtree, and drop it on top of the outer operator.

3.3. OTHER NOTES ON RAPIDMINER 25

3.3 Other notes on Rapidminer

Reusing existing processes.

Saving datasets.

Operators from Weka versus from elsewhere.

Comments on speciﬁc operators: Nominal2Numerical, Nominal2Binominal, Re-

moveUselessFeatures.

Eliminating non-alphanumerical characters, using quote marks, trimming lines,

trimming commas.

Chapter 4

Support vector machines

This chapter explains soft-margin support vector machines (SVMs), including linear

and nonlinear kernels. It also discusses detecting overﬁtting via cross-validation, and

preventing overﬁtting via regularization.

We have seen how to use linear regression to predict a real-valued label. Now we

will see how to use a similar model to predict a binary label. In later chapters, where

we think probabilistically, it will be convenient to assume that a binary label takes on

true values 0 or 1. However, in this chapter it will be convenient to assume that the

label y has true value either +1 or −1.

4.1 Loss functions

Let x be an instance, let y be its true label, and let f(x) be a prediction. Assume that

the prediction is real-valued. If we need to convert it into a binary prediction, we will

threshold it at zero:

ˆ y = 2 I(f(x) ≥ 0) −1

where I() is an indicator function that is 1 if its argument is true, and 0 if its argument

is false.

A so-called loss function l measures how good the prediction is. Typically loss

functions do not depend directly on x, and a loss of zero corresponds to a perfect

prediction, while otherwise loss is positive. The most obvious loss function is

l(f(x), y) = I(f(x) = y)

which is called the 0-1 loss function. The usual deﬁnition of accuracy uses this loss

function. However, it has undesirable properties. First, it loses information: it does

not distinguish between predictions f(x) that are almost right, and those that are

27

28 CHAPTER 4. SUPPORT VECTOR MACHINES

very wrong. Second, mathematically, its derivative is either undeﬁned or zero, so it

is difﬁcult to use in training algorithms that try to minimize loss via gradient descent.

A better loss function is squared error:

l(f(x), y) = (f(x) −y)

2

which is inﬁnitely differentiable everywhere, and does not lose information when the

prediction f(x) is real-valued. However, this loss function says that the prediction

f(x) = 1.5 is as undesirable as f(x) = 0.5 when the true label is y = 1. Intuitively,

if the true label is +1, then a prediction with the correct sign that is greater than 1

should not be considered incorrect.

The following loss function, which is called hinge loss, satisﬁes the intuition just

suggested:

l(f(x), y) = max¦0, 1 −yf(x)¦.

The hinge loss function deserves some explanation. Suppose the true label is y = 1.

Then the loss is zero as long as the prediction f(x) ≥ 1. The loss is positive, but less

than 1, if 0 < f(x) < 1. The loss is large, i.e. greater than 1, if f(x) < 0.

Using hinge loss is the ﬁrst major insight behind SVMs. An SVM classiﬁer f

is trained to minimize hinge loss. The training process aims to achieve predictions

f(x) ≥ 1 for all training instances x with true label y = +1, and to achieve predic-

tions f(x) ≤ −1 for all training instances x with y = −1. Overall, training seeks

to classify points correctly, and to distinguish clearly between the two classes, but it

does not seek to make predictions be exactly +1 or −1. In this sense, the training

process intuitively aims to ﬁnd the best possible classiﬁer, without trying to satisfy

any unnecessary additional objectives also.

4.2 Regularization

Given a set of training examples 'x

i

, y

i

` for i = 1 to i = n, the total training loss

(sometimes called empirical loss) is the sum

n

¸

i=1

l(f(x

i

), y

i

).

Suppose that the function f is selected by minimizing average training loss:

f = argmin

f∈F

1

n

n

¸

i=1

l(f(x

i

), y

i

)

where F is a space of candidate functions. If F is too ﬂexible, and/or the training

set is too small, then we run the risk of overﬁtting the training data. But if F is too

4.3. LINEAR SOFT-MARGIN SVMS 29

restricted, then we run the risk of underﬁtting the data. In general, we do not know in

advance what the best space F is for a particular training set. A possible solution to

this dilemma is to choose a ﬂexible space F, but at the same time to impose a penalty

on the complexity of f. Let c(f) be some real-valued measure of complexity. The

learning process then becomes to solve

f = argmin

f∈F

λc(f) +

1

n

n

¸

i=1

l(f(x

i

), y

i

).

Here, λ is a parameter that controls the relative strength of the two objectives, namely

to minimize the complexity of f and to minimize training error.

Suppose that the space of candidate functions is deﬁned by a vector w ∈ R

d

of

parameters, i.e. we can write f(x) = g(x; w) where g is some ﬁxed function. In this

case we can deﬁne the complexity of each candidate function to be the norm of the

corresponding w. Most commonly we use the square norm:

c(f) = [[w[[

2

=

d

¸

j=1

w

2

j

.

However we could also use other norms, including the L

0

norm

c(f) = [[w[[

2

0

=

d

¸

j=1

I(w

j

= 0)

or the L

1

norm

c(f) = [[w[[

2

1

=

d

¸

j=1

[w

j

[.

The square norm is the most convenient mathematically.

4.3 Linear soft-margin SVMs

A linear classiﬁer is a function f(x) = g(x; w) = x w where g is the dot product

function. Putting the ideas above together, the objective of learning is to ﬁnd

w = argmin

w∈R

d λ[[w[[

2

+

1

n

n

¸

i=1

max¦0, 1 −y

i

(w x

i

)¦.

A linear soft-margin SVM classiﬁer is precisely the solution to this optimization

problem. It can be proved that the solution to this minimization problem is always

30 CHAPTER 4. SUPPORT VECTOR MACHINES

unique. Moreover, the objective function is convex, so there are no local minima.

Note that d is the dimensionality of x, and w has the same dimensionality.

An equivalent way of writing the same optimization problem is

w = argmin

w∈R

d [[w[[

2

+C

n

¸

i=1

max¦0, 1 −y

i

(w x

i

)¦

with C = 1/(nλ). Many SVM implementations let the user specify C as opposed

to λ. A small value for C corresponds to strong regularization, while a large value

corresponds to weak regularization. Intuitively, everything else being equal, a smaller

training set should require a smaller C value. However, useful guidelines are not

known for what the best value of C might be for a given dataset. In practice, one has

to try multiple values of C, and ﬁnd the best value experimentally.

Mathematically, the optimization problem above is called an unconstrained pri-

mal formulation. There is an alternative formulation that is equivalent, and is useful

theoretically. This so-called dual formulation is

max

α∈R

n

n

¸

i=1

α

i

−

1

2

n

¸

i=1

n

¸

j=1

α

i

α

j

y

i

y

j

(x

i

x

j

)

subject to 0 ≤ α

i

≤ C.

The primal and dual formulations are different optimization problems, but they have

the same unique solution. The solution to the dual problem is a coefﬁcient α

i

for

each training example. Notice that the optimization is over R

n

, whereas it is over R

d

in the primal formulation. The trained classiﬁer is f(x) = w x where the vector

w =

n

¸

i=1

α

i

y

i

x

i

.

This equation says that w is a weighted linear combination of the training instances

x

i

, where the weight of each instance is between 0 and C, and the sign of each

instance is its label y

i

. The training instances x

i

such that α

i

> 0 are called support

vectors. These instances are the only ones that contribute to the ﬁnal classiﬁer.

The constrained dual formulation is the basis of the training algorithms used by

standard SVMimplementations, but recent research has shown that the unconstrained

primal formulation in fact leads to faster training algorithms, at least in the linear case

as above. Moreover, the primal version is easier to understand and easier to use as a

foundation for proving bounds on generalization error. However, the dual version is

easier to extend to obtain nonlinear SVM classiﬁers. This extension is based on the

idea of a kernel function.

4.4. NONLINEAR KERNELS 31

4.4 Nonlinear kernels

Consider two instances x

i

and x

j

, and consider their dot product x

i

x

j

. This dot

product is a measure of the similarity of the two instances: it is large if they are

similar and small if they are not. Dot product similarity is closely related to Euclidean

distance through the identity

d(x

i

, x

j

) = [[x

i

−x

j

[[ = ([[x

i

[[

2

−2x

i

x

j

+[[x

j

[[

2

)

1/2

where by deﬁnition [[x[[

2

= x x.

1

Therefore, the equation

f(x) = w x =

n

¸

i=1

α

i

y

i

(x

i

x)

says that the prediction for a test example x is a weighted average of the training

labels y

i

, where the weight of y

i

is the product of α

i

and the degree to which x is

similar to x

i

.

Consider a re-representation of instances x → Φ(x) where the transformation Φ

is a function R

d

→ R

d

**. In principle, we could use dot-product to deﬁne similarity
**

in the new space R

d

**, and train an SVM classiﬁer in this space. However, suppose
**

we have a function K(x

i

, x

j

) = Φ(x

i

) Φ(x

j

). This function is all we need in order

to write down the optimization problem and its solution; we do not need to know the

function Φ in any explicit way. Speciﬁcally, let k

ij

= K(x

i

, x

j

). The learning task

is to solve

max

α∈R

n

n

¸

i=1

α

i

−

1

2

n

¸

i=1

n

¸

j=1

α

i

α

j

y

i

y

j

k

ij

subject to 0 ≤ α

i

≤ C.

The solution is

f(x) = [

n

¸

i=1

α

i

y

i

Φ(x

i

)] x =

n

¸

i=1

α

i

y

i

K(x

i

, x).

This classiﬁer is a weighted combination of at most n functions, one for each training

instance x

i

. These are called basis functions.

The result above says that in order to train a nonlinear SVM classiﬁer, all that we

need is the kernel matrix of size n by n whose entries are k

ij

. And in order to apply

1

If the instances have unit length, that is ||xi|| = ||xj|| = 1, then Euclidean distance and dot

product similarity are perfectly anticorrelated. For many applications of support vector machines, it

is advantageous to normalize features to have the same mean and variance. It can be advantageous

also to normalize instances so that they have unit length. However, in general one cannot have both

normalizations be true at the same time.

32 CHAPTER 4. SUPPORT VECTOR MACHINES

the trained nonlinear classiﬁer, all that we need is the kernel function K. The function

Φ never needs to be known explicitly. Using K exclusively in this way instead of Φ

is called the “kernel trick.” Practically, the function K can be much easier to deal

with than Φ, because K is just a mapping to R, rather than to a high-dimensional

space R

d

.

Intuitively, regardless of which kernel K is used, that is regardless of which re-

representation Φ is used, the complexity of the classiﬁer f is limited, since it is

deﬁned by at most n coefﬁcients α

i

. The function K is ﬁxed, so it does not increase

the intuitive complexity of the classiﬁer.

One particular kernel is especially important. The radial basis function (RBF)

kernel is the function

K(x

i

, x) = exp(−γ[[x

i

−x[[

2

)

where γ > 0 is an adjustable parameter. Using an RBF kernel, each basis function

K(x

i

, x) is “radial” since it is based on the Euclidean distance [[x

i

−x[[ fromx

i

to x.

With an RBF kernel, the classiﬁer f(x) =

¸

i

α

i

y

i

K(x

i

, x) is similar to a nearest-

neighbor classiﬁer. Given a test instance x, its predicted label f(x) is a weighted

average of the labels y

i

of the support vectors x

i

. The support vectors that contribute

non-negligibly to the predicted label are those for which the Euclidean distance [[x

i

−

x[[ is small.

The RBF kernel can also be written

K(x

i

, x) = exp(−[[x

i

−x[[

2

/σ

2

)

where σ

2

= 1/γ. This notation emphasizes the similarity with a Gaussian distribu-

tion. A smaller value for γ, i.e. a larger value for σ

2

, corresponds to basis functions

that are less peaked, i.e. that are signiﬁcantly non-zero for a wider range of x values.

Using a larger value for σ

2

is similar to using a larger number k of neighbors in a

nearest neighbor classiﬁer.

4.5 Selecting the best SVM settings

Getting good results with support vector machines requires some care. The consensus

opinion is that the best SVM classiﬁer is typically at least as accurate as the best of

any other type of classiﬁer, but obtaining the best SVM classiﬁer is not trivial. The

following procedure is recommended as a starting point:

1. Code data in the numerical format needed for SVM training.

2. Scale each attribute to have range 0 to 1, or −1 to +1, or to have mean zero

and unit variance.

4.5. SELECTING THE BEST SVM SETTINGS 33

3. Use cross-validation to ﬁnd the best value for C for a linear kernel.

4. Use cross-validation to ﬁnd the best values for C and γ for an RBF kernel.

5. Train on the entire available data using the parameter values found to be best

via cross-validation.

It is reasonable to start with C = 1 and γ = 1, and to try values that are smaller and

larger by factors of 2:

'C, γ` ∈ ¦. . . , 2

−3

, 2

−2

, 2

−1

, 1, 2, 2

2

, 2

3

, . . .¦

2

.

This process is called a grid search; it is easy to parallelize of course. It is important

to try a wide enough range of values to be sure that more extreme values do not give

better performance. Once you have found the best values of C and γ to within a

factor of 2, doing a more ﬁne-grained grid search around these values is sometimes

beneﬁcial, but not always.

34 CHAPTER 4. SUPPORT VECTOR MACHINES

Quiz question

(a) Draw the hinge loss function for the case where the true label y = 1. Label the

axes clearly.

(b) Explain where the derivative is (i) zero, (ii) constant but not zero, or (iii) not

deﬁned.

(c) For each of the three cases for the derivative, explain its intuitive implications

for training an SVM classiﬁer.

Quiz for April 20, 2010

Your name:

Suppose that you have trained SVM classiﬁers carefully for a given learning task.

You have selected the settings that yield the best linear classiﬁer and the best RBF

classiﬁer. It turns out that both classiﬁers perform equally well in terms of accuracy

for your task.

(a) Now you are given many times more training data. After ﬁnding optimal settings

again and retraining, do you expect the linear classiﬁer or the RBF classiﬁer to have

better accuracy? Explain very brieﬂy.

(b) Do you expect the optimal settings to involve stronger regularization, or weaker

regularization? Explain very brieﬂy.

CSE 291 Assignment

This assignment is due at the start of class on Tuesday April 20. As before, you

should work in a team of two, choosing a partner with a different background. You

may keep the same partner as before, or change partners.

This project uses data published by Kevin Hillstrom, a well-known data min-

ing consultant. You can ﬁnd the data at http://cseweb.ucsd.edu/users/

elkan/250B/HillstromData.csv. For a detailed description, see http://

minethatdata.com/blog/2008/03/minethatdata-e-mail-analytics-and-data.

html.

For this assignment, use only the data for customers who are not sent any email

promotion. Your job is to train a good model to predict which customers visit the

retailer’s website. For now, you should ignore information about which customers

make a purchase, and how much they spend.

Build support vector machine (SVM) models to predict the target label as ac-

curately as possible. In the same general way as for linear regression, recode non-

numerical features as numerical, and transform features to improve their usefulness.

Train the best possible model using a linear kernel, and also the best possible model

using a radial basis function (RBF) kernel. The outcome should be the two most

accurate SVM classiﬁers that you can ﬁnd, without overﬁtting or underﬁtting.

Decide thoughtfully which measure of accuracy to use, and explain your choice

in your report. Use nested cross-validation carefully to ﬁnd the best settings for

training, and to evaluate the accuracy of the best classiﬁers as fairly as possible. In

particular, you should identify good values of the soft-margin C parameter for both

kernels, and of the width parameter for the RBF kernel.

For linear SVMs, the Rapidminer operator named FastLargeMargin is recom-

mended. Because it is fast, you can explore models based on a large number of

transformed features. Training nonlinear SVMs is much slower, but one can hope

that good performance can be achieved with fewer features.

As before, the deliverable is a well-organized, well-written, and well-formatted

report of about two pages. Describe what you did that worked, and your results. Ex-

plain any assumptions that you made, and any limitations on the validity or reliability

of your results. Explain carefully your nested cross-validation procedure.

Include a printout of your ﬁnal Rapidminer process, and a description of your

two ﬁnal models (not included in the two pages). Do not speculate about future

work, and do not explain ideas that do not work. Write in the present tense. Organize

your report logically, not chronologically.

Chapter 5

Classiﬁcation with a rare class

In many data mining applications, the goal is to ﬁnd needles in a haystack. That is,

most examples are negative but a few examples are positive. The goal is to iden-

tify the rare positive examples, as accurately as possible. For example, most credit

card transactions are legitimate, but a few are fraudulent. We have a standard bi-

nary classiﬁer learning problem, but both the training and test sets are unbalanced.

In a balanced set, the fraction of examples of each class is about the same. In an

unbalanced set, some classes are rare while others are common.

5.1 Measuring performance

A major difﬁculty with unbalanced data is that accuracy is not a meaningful measure

of performance. Suppose that 99% of credit card transactions are legitimate. Then

we can get 99% accuracy by predicting trivially that every transaction is legitimate.

On the other hand, suppose we somehow identify 5% of transactions for further in-

vestigation, and half of all fraudulent transactions occur among these 5%. Clearly the

identiﬁcation process is doing something worthwhile and not trivial. But its accuracy

is only 95%.

For concreteness in further discussion, we will consider only the two class case,

and we will call the rare class positive. Rather than talk about fractions or percentages

of a set, we will talk about actual numbers (also called counts) of examples. It turns

out that thinking about actual numbers leads to less confusion and more insight than

thinking about fractions. Suppose the test set has a certain total size n, say n = 1000.

We can represent the performance of the trivial classiﬁer as follows:

37

38 CHAPTER 5. CLASSIFICATION WITH A RARE CLASS

predicted

positive negative

positive 0 10

truth

negative 0 990

The performance of the non-trivial classiﬁer is

predicted

positive negative

positive 5 5

truth

negative 45 945

A table like the ones above is called a 22 contingency table. Above, rows corre-

spond to actual labels, while columns correspond to predicted labels. It would be

equally valid to swap the rows and columns. Unfortunately, there is no standard

convention about whether rows are actual or predicted. Remember that there is a

universal convention that in notation like x

ij

the ﬁrst subscript refers to rows while

the second subscript refers to columns.

A table like the ones above is also called a confusion matrix. For supervised

learning with discrete predictions, only a confusion matrix gives complete informa-

tion about the performance of a classiﬁer. No single number that summarizes per-

formance, for example accuracy, can provide a full picture of the usefulness of a

classiﬁer.

The four entries in a 22 contingency table have standard names. They are called

called true positives tp, false positives fp, true negatives tn, and false negatives fn,

as follows:

predicted

positive negative

positive tp fn

truth

negative fp tn

The terminology true positive, etc., is standard, but as mentioned above, whether

columns correspond to predicted and rows to actual, or vice versa, is not standard.

As mentioned, the entries in a confusion matrix are counts, i.e. integers. The total

of the four entries tp+tn+fp+fn = n, the number of test examples. Depending on

the application, different summaries are computed from these entries. In particular,

accuracy a = (tp + tn)/n. Assuming that n is known, three of the counts in a

confusion matrix can vary independently. This is the reason why no single number

5.2. THRESHOLDS AND LIFT 39

can describe completely the performance of a classiﬁer. When writing a report, it is

best to give the full confusion matrix explicitly, so that readers can calculate whatever

performance measurements they are most interested in.

When accuracy is not meaningful, two summary measures that are commonly

used are called precision and recall. They are deﬁned as follows:

• precision p = tp/(tp +fp), and

• recall r = tp/(tp +fn).

The names “precision” and “recall” come from the ﬁeld of information retrieval. In

other research areas, recall is often called sensitivity, while precision is sometimes

called positive predictive value.

Precision is undeﬁned for a classiﬁer that predicts that every test example is neg-

ative, that is when tp + fp = 0. Worse, precision can be misleadingly high for a

classiﬁer that predicts that only a few test examples are positive. Consider the fol-

lowing confusion matrix:

predicted

positive negative

positive 1 9

truth

negative 0 990

Precision is 100% but 90% of actual positives are missed. F-measure is a widely

used metric that overcomes this limitation of precision. It is the harmonic average of

precision and recall:

F =

1

1/p + 1/r

=

pr

r +p

.

For the confusion matrix above F = 1 0.1/(1 + 0.1) = 0.09.

Besides accuracy, precision, recall, and F-measure, many other summaries are

also commonly computed from confusion matrices. Some of these are called speci-

ﬁcity, false positive rate, false negative rate, positive and negative likelihood ratio,

kappa coefﬁcient, and more. Rather than rely on agreement and understanding of the

deﬁnitions of these, it is preferable simply to report a full confusion matrix explicitly.

5.2 Thresholds and lift

A confusion matrix is always based on discrete predictions. Often, however, these

predictions are obtained by thresholding a real-valued predicted score. For example,

an SVM classiﬁer yields a real-valued prediction f(x) which is then compared to

40 CHAPTER 5. CLASSIFICATION WITH A RARE CLASS

the threshold zero to obtain a discrete yes/no prediction. Confusion matrices cannot

represent information about the usefulness of underlying real-valued predictions. We

shall return to this issue below, but ﬁrst we shall consider the issue of selecting a

threshold.

Setting the threshold determines the number tp + fp of examples that are pre-

dicted to be positive. In some scenarios, there is a natural threshold such as zero for

an SVM. However, even when a natural threshold exists, it is possible to change the

threshold to achieve a target number of positive predictions. This target number is

often based on a so-called budget constraint. Suppose that all examples predicted to

be positive are subjected to further investigation. An external limit on the resources

available will determine how many examples can be investigated. This number is a

natural target for the value fp+tp. Of course, we want to investigate those examples

that are most likely to be actual positives, so we want to investigate the examples x

with the highest prediction scores f(x). Therefore, the correct strategy is to choose

a threshold t such that we investigate all examples x with f(x) ≥ t and the number

of such examples is determined by the budget constraint.

Given that a ﬁxed number of examples are predicted to be positive, a natural ques-

tion is how good a classiﬁer is at capturing the actual positives within this number.

This question is answered by a measure called lift. The deﬁnition is a bit complex.

First, let the fraction of examples predicted to be positive be x = (tp +fp)/n. Next,

let the base rate of actual positive examples be b = (tp + fn)/n. Let the density of

actual positives within the predicted positives be d = tp/(tp + fp). Now, the lift at

x is deﬁned to be the ratio d/b. Intuitively, a lift of 2 means that actual positives are

twice as dense among the predicted positives as they are among all examples.

Lift can be expressed as

d

b

=

tp/(tp +fp)

(tp +fn)/n

=

tp n

(tp +fp)(tp +fn)

=

tp

tp +fn

n

tp +fp

=

recall

prediction rate

.

Lift is a useful measure of success if the number of examples that should be predicted

to be positive is determined by external considerations. However, budget constraints

should normally be questioned. In the credit card scenario, perhaps too many trans-

actions are being investigated and the marginal beneﬁt of investigating some trans-

actions is negative. Or, perhaps too few transactions are being investigated; there

would be a net beneﬁt if additional transactions were investigated. Making optimal

decisions about how many examples should be predicted to be positive is discussed

in the next chapter.

While a budget constraint is not normally a rational way of choosing a threshold

for predictions, it is still more rational than choosing an arbitrary threshold. In par-

5.3. RANKING EXAMPLES 41

Figure 5.1: ROC curves for three alternative binary classiﬁers. Source: Wikipedia.

ticular, the threshold zero for an SVM classiﬁer has some mathematical meaning but

is not a rational guide for making decisions.

5.3 Ranking examples

Applying a threshold to a classiﬁer with real-valued outputs loses information, be-

cause the distinction between examples on the same side of the threshold is lost.

Moreover, at the time a classiﬁer is trained and evaluated, it is often the case that

the threshold to be used for decision-making is not known. Therefore, it is useful to

compare different classiﬁers across the range of all possible thresholds.

Typically, it is not meaningful to use the same numerical threshold directly for

different classiﬁers. However, it is meaningful to compare the recall achieved by

different classiﬁers when the threshold for each one is set to make their precisions

42 CHAPTER 5. CLASSIFICATION WITH A RARE CLASS

equal. This is what an ROC curve does.

1

Concretely, an ROC curve is a plot of

the performance of a classiﬁer, where the horizontal axis measures false positive rate

(fpr) and the vertical axis measures true positive rate (tpr). These are deﬁned as

fpr =

fp

fp +tn

tpr =

tp

tp +fn

Note that tpr is the same as recall and is sometimes also called “hit rate.”

In an ROC plot, the ideal point is at the top left. One classiﬁer uniformly domi-

nates another if its curve is always above the other’s curve. It happens often that the

ROC curves of two classiﬁers cross, which implies that neither one dominates the

other uniformly.

ROC plots are informative, but do not provide a quantitative measure of the per-

formance of classiﬁers. A natural quantity to consider is the area under the ROC

curve, often abbreviated AUC. The AUC is 0.5 for a classiﬁer whose predictions are

no better than random, while it is 1 for a perfect classiﬁer. The AUC has an intuitive

meaning: it can be proved that it equals the probability that the classiﬁer will rank

correctly two randomly chosen examples where one is positive and one is negative.

One reason why AUC is widely used is that, as shown by the probabilistic mean-

ing just mentioned, it has built into it the implicit assumption that the rare class is

more important than the common class.

5.4 Conditional probabilities

Reason 1 to want probabilities: predicting expectations.

Ideal versus achievable probability estimates.

Reason 2 to want probabilities: understanding predictive power.

Deﬁnition of well-calibrated probability.

Brier score.

Converting scores into probabilities.

5.5 Isotonic regression

Let f

i

be prediction scores on a dataset, and let y

i

∈ ¦0, 1¦ be the corresponding true

labels. Let f

j

be the f

i

values sorted from smallest to largest, and let y

j

be the y

i

1

ROC stands for “receiver operating characteristic.” This terminology originates in the theory of

detection based on electromagnetic waves.

5.6. UNIVARIATE LOGISTIC REGRESSION 43

values sorted in the same order. For each f

j

we want to ﬁnd an output value g

j

such

that the g

j

values are monotonically increasing, and squared error relative to the y

j

values is minimized. Formally, the optimization problem is

min

g

1

,...,gn

(y

j

−g

j

)

2

subject to g

j

≤ g

j+1

for j = 1 to j = n −1

It is a remarkable fact that if squared error is minimized, then the resulting predictions

are well-calibrated probabilities.

There is an elegant algorithm called “pool adjacent violators” (PAV) that solves

this problem in linear time. The algorithm is as follows, where pooling a set means

replacing each member of the set by the arithmetic mean of the set.

Let g

j

= y

j

for all j

Start with j = 1 and increase j until the ﬁrst j such that g

j

≤ g

j+1

Pool g

j

and g

j+1

Move left: If g

j−1

≤ g

j

then pool g

j−1

to g

j+1

Continue to the left until monotonicity is satisﬁed

Proceed to the right

Given a test example x, the procedure to predict a well-calibrated probability is as

follows:

Apply the classiﬁer to obtain f(x)

Find j such that f

j

≤ f(x) ≤ f

j+1

The predicted probability is g

j

.

5.6 Univariate logistic regression

The disadvantage of isotonic regression is that it creates a lookup table for converting

scores into estimated probabilities. An alternative is to use a parametric model. The

most common model is called univariate logistic regression. The model is

log

p

1 −p

= a +bf

where f is a prediction score and p is the corresponding estimated probability.

The equation above shows that the logistic regression model is essentially a linear

model with intercept a and coefﬁcient b. An equivalent way of writing the model is

p =

1

1 +e

−(a+bf)

.

As above, let f

i

be prediction scores on a training set, and let y

i

∈ ¦0, 1¦ be the

corresponding true labels. The parameters a and b are chosen to minimize the total

44 CHAPTER 5. CLASSIFICATION WITH A RARE CLASS

loss

¸

i

l(

1

1 +e

−(a+bf

i

)

, y

i

).

The precise loss function l that is typically used in the optimization is called condi-

tional log likelihood (CLL) and is explained in Chapter ??? below. However, one

could also use squared error, which would be consistent with isotonic regression.

5.6. UNIVARIATE LOGISTIC REGRESSION 45

Quiz

All questions below refer to a classiﬁer learning task with two classes, where the base

rate for the positive class y = 1 is 5%.

(a) Suppose that a probabilistic classiﬁer predicts p(y = 1[x) = c for some constant

c, for all test examples x. Explain why c = 0.05 is the best value for c.

The value 0.05 is the only constant that is a well-calibrated probability. “Well-

calibrated” means that c = 0.05 equals the average frequency of positive examples

in sets of examples with predicted score c.

(b) What is the error rate of the classiﬁer from part (a)? What is its MSE?

With a prediction threshold of 0.5, all examples are predicted to be negative, so

the error rate is 5%. The MSE is

0.05 (1 −0.05)

2

+ 0.95 (0 −0.05)

2

= 0.05 0.95 1

which equals 0.0475.

(c) Suppose a well-calibrated probabilistic classiﬁer satisﬁes a ≤ p(y = 1[x) ≤ b

for all x. What is the maximum possible lift at 10% of this classiﬁer?

If the upper bound b is a well-calibrated probability, then the fraction of positive

examples among the highest-scoring examples is at most b. Hence, the lift is at most

b/0.05 where 0.05 is the base rate. Note that this is the highest possible lift at any

percentage, not just at 10%.

46 CHAPTER 5. CLASSIFICATION WITH A RARE CLASS

2009 Assignment

This assignment is due at the start of class on Tuesday April 28, 2009. As before, you

should work in a team of two. You may change partners, or keep the same partner.

Like previous assignments, this one uses the KDD98 training set. However, you

should now use the entire dataset. (Revised.) The goal is to train a classiﬁer with real-

valued outputs that identiﬁes test examples with TARGET_B = 1. Speciﬁcally, the

measure of success to optimize is lift at 10%. That is, as many positive test examples

as possible should be among the 10% of test examples with highest prediction score.

You should use logistic regression ﬁrst, because this is a fast and reliable method

for training probabilistic classiﬁers. If you are using Rapidminer, then use the logis-

tic regression option of the FastLargeMargin operator. As before, recode and

transform features to improve their usefulness. Do feature selection to reduce the

size of the training set as much as is reasonably possible.

Next, you should apply a different learning method to the same training set that

you developed using logistic regression. The objective is to see whether this other

method can perform better than logistic regression, using the same data code in the

same way. The second learning method can be a support vector machine, a neural

network, or a decision tree, for example. Apply cross-validation to ﬁnd good algo-

rithm settings.

(Deleted: When necessary, use a postprocessing method (isotonic regression or

logistic regression) to obtain calibrated estimates of conditional probabilities. In-

vestigate the probability estimates produced by your two methods. What are the

minimum, mean, and maximum predicted probabilities? Discuss whether these are

reasonable.

Assignment

This assignment is due at the start of class on Tuesday April 27, 2010. As before, you

should work in a team of two. You may change partners, or keep the same partner.

Like the previous assignment, this one uses the e-commerce data published by

Kevin Hillstrom However, the goal now is to predict who makes a purchase on the

website (again, for customers who are not sent any promotion). This is a highly

unbalanced classiﬁcation task.

First, use logistic regression. If you are using Rapidminer, then use the logis-

tic regression option of the FastLargeMargin operator. As before, recode and

transform features to improve their usefulness. Investigate the probability estimates

produced by your two methods. What are the minimum, mean, and maximum pre-

dicted probabilities? Discuss whether these are reasonable.

Second, use your creativity to get the best possible performance in predicting

who makes a purchase. The measure of success to optimize is lift at 25%. That is, as

many positive test examples as possible should be among the 25% of test examples

with highest prediction score. You may apply any learning algorithms that you like.

Can any method achieve better accuracy than logistic regression?

For predicting who makes a purchase, compare learning a classiﬁer directly with

learning two classiﬁers, the ﬁrst to predict who visits the website and the second to

predict which visitors make purchases. Note that mathematically

p(buy[x) = p(buy[x, visit)p(visit[x).

As before, be careful not to fool yourself about the success of your methods.

Quiz for April 27, 2010

Your name:

The current assignment is based on the equation

p(buy = 1[x) = p(buy = 1[x, visit = 1)p(visit = 1[x).

Explain clearly but brieﬂy why this equation is true.

5.7. PITFALLS OF LINK PREDICTION 49

5.7 Pitfalls of link prediction

This section was originally an assignment. The goal of writing it as an assignment

was to help develop three important abilities. The ﬁrst ability is critical thinking,

i.e. the skill of identifying what is most important and then identifying what may be

incorrect. The second ability is understanding a new application domain quickly, in

this case a task in computational biology. The third ability is presenting opinions in

a persuasive way. Students were asked to explain arguments and conclusions crisply,

concisely, and clearly.

Assignment

The paper you should read is Predicting protein-protein interactions from primary

structure by Joel Bock and David Gough, published in the journal Bioinformatics in

2001. The full text of this paper in PDF is supposed to be available free. If you have

difﬁculty obtaining it, please post on the class message board.

You should ﬁgure out and describe three major ﬂaws in the paper. The ﬂaws

concern

• how the dataset is constructed,

• how each example is represented, and

• how performance is measured and reported.

Each of the three mistakes is serious. The paper has 248 citations according to

Google Scholar as of May 26, 2009, but unfortunately each ﬂaw by itself makes

the results of the paper not useful as a basis for future research. Each mistake is

described sufﬁciently clearly in the paper: it is a sin of commission, not a sin of

omission.

The second mistake, how each example is represented, is the most subtle, but at

least one of the papers citing this paper does explain it clearly. It is connected with

how SVMs are applied here. Remember the slogan: “If you cannot represent it then

you cannot learn it.”

Separately, provide a brief critique of the four beneﬁts claimed for SVMs in the

section of the paper entitled Support vector machine learning. Are these beneﬁts

true? Are they unique to SVMs? Does the work described in this paper take advan-

tage of them?

50 CHAPTER 5. CLASSIFICATION WITH A RARE CLASS

Sample answers

Here is a brief summary of what I see as the most important ﬂaws of the paper

Predicting protein-protein interactions from primary structure.

(1) How the dataset is constructed. The problem here is that the negative exam-

ples are not pairs of genuine proteins. Instead, they are pairs of randomly generated

amino acid sequences. It is quite possible that these artiﬁcial sequences could not

fold into actual proteins at all. The classiﬁers reported in this paper may learn mainly

to distinguish between real proteins and non-proteins.

The authors acknowledge this concern, but do not overcome it. They could have

used pairs of genuine proteins as negative examples. It is true that one cannot be

sure that any given pair really is non-interacting. However, the great majority of

pairs do not interact. Moreover, if a negative example really is an interaction, that

will presumably slightly reduce the apparent accuracy of a trained classiﬁer, but not

change overall conclusions.

(2) How each example is represented. This is a subtle but clear-cut and important

issue, assuming that the research uses a linear classiﬁer.

Let x

1

and x

2

be two proteins and let f(x

1

) and f(x

2

) be their representations

as vectors in R

d

. The pair of proteins is represented as the concatenated vector

'f(x

1

) f(x

2

)` ∈ R

2d

. Suppose a trained linear SVM has parameter vector w. By

deﬁnition w ∈ R

2d

also. (If there is a bias coefﬁcient, so w ∈ R

2d+1

, the conclusion

is the same.)

Now suppose the ﬁrst protein x

1

is ﬁxed and consider varying the second protein

x

2

. Proteins x

2

will be ranked according to the numerical value of the dot product

w 'f(x

1

) f(x

2

)`. This is equal to w

1

f(x

1

) + w

2

f(x

2

) where the vector w

is written as 'w

1

w

2

`. If x

1

is ﬁxed, then the ﬁrst term is constant and the second

term w

2

f(x

2

) determines the ranking. The ranking of x

2

proteins will be the same

regardless of what the x

1

protein is. This fundamental drawback of linear classiﬁers

for predicting interactions is pointed out in [Vert and Jacob, 2008, Section 5].

With a concatenated representation of protein pairs, a linear classiﬁer can at best

learn the propensity of individual proteins to interact. Such a classiﬁer cannot repre-

sent patterns that depend on features that are true only for speciﬁc protein pairs. This

is the relevance of the slogan “If you cannot represent it then you cannot learn it.”

Note: Previously I was under the impression that the authors stated that they used

a linear kernel. On rereading the paper, it fails to mention at all what kernel they use.

If the research uses a linear kernel, then the argument above is applicable.

(3) How performance is measured and reported. Most pairs of proteins are

non-interacting. It is reasonable to use training sets where negative examples (non-

interacting pairs) are undersampled. However, it is not reasonable or correct to report

5.7. PITFALLS OF LINK PREDICTION 51

performance (accuracy, precision, recall, etc.) on test sets where negative examples

are under-represented, which is what is done in this paper.

As for the four claimed advantages of SVMs:

1. SVMs are nonlinear while requiring “relatively few” parameters.

The authors do not explain what kernel they use. In any case, with a nonlinear

kernel the number of parameters is the number of support vectors, which is

often close to the number of training examples. It is not clear relative to what

this can be considered “few.”

2. SVMs have an analytic upper bound on generalization error.

This upper bound does motivate the SVM approach to training classiﬁers, but

it typically does not provide useful guarantees for speciﬁc training sets. In

any case, a bound of this type has nothing to do with assigning conﬁdences to

individual predictions. In practice overﬁtting is prevented by straightforward

search for the value of the C parameter that is empirically best, not by applying

a theorem.

3. SVMs have fast training, which is essential for screening large datasets.

SVM training is slow compared to many other classiﬁer learning methods, ex-

cept for linear classiﬁers trained by fast algorithms that were only published

after 2001, when this paper was published. As mentioned above, a linear clas-

siﬁer is not appropriate with the representation of protein pairs used in this

paper.

In any case, what is needed for screening many test examples is fast classiﬁer

application, not fast training. Applying a linear classiﬁer is fast, whether it is

an SVM or not. Applying a nonlinear SVM typically has the same order-of-

magnitude time complexity as applying a nearest-neighbor classiﬁer, which is

the slowest type of classiﬁer in common use.

4. SVMs can be continuously updated in response to new data.

At least one algorithm is known for updating an SVM given a new training

example, but it is not cited in this paper. I do not know any algorithm for

training an optimal new SVM efﬁciently, that is without retraining on old data.

In any case, new real-world protein data arrives slowly enough that retraining

from scratch is feasible.

Chapter 6

Detecting overﬁtting:

cross-validation

6.1 Cross-validation

Usually we have a ﬁxed database of labeled examples available, and we are faced

with a dilemma: we would like to use all the examples for training, but we would

also like to use many examples as an independent test set. Cross-validation is a

procedure for overcoming this dilemma. It is the following algorithm.

Input: Training set S, integer constant k

Procedure:

partition S into k disjoint equal-sized subsets S

1

, . . . , S

k

for i = 1 to i = k

let T = S ` S

i

run learning algorithm with T as training set

test the resulting classiﬁer on S

i

obtaining tp

i

, fp

i

, tn

i

, fn

i

compute tp =

¸

i

tp

i

, fp =

¸

i

fp

i

, tn =

¸

i

tn

i

, fn =

¸

i

fn

i

The output of cross-validation is a confusion matrix based on using each labeled

example as a test example exactly once. Whenever an example is used for testing a

classiﬁer, it has not been used for training that classiﬁer. Hence, the confusion matrix

obtained by cross-validation is a fair indicator of the performance of the learning

algorithm on independent test examples.

If n labeled examples are available, the largest possible number of folds is k = n.

This special case is called leave-one-out cross-validation (LOOCV). However, the

time complexity of cross-validation is k times that of running the training algorithm

53

54 CHAPTER 6. DETECTING OVERFITTING: CROSS-VALIDATION

once, so often LOOCV is computationally infeasible. In recent research the most

common choice for k is 10.

Note that cross-validation does not produce any single ﬁnal classiﬁer, and the

confusion matrix it provides is not the performance of any speciﬁc single classiﬁer.

Instead, this matrix is an estimate of the average performance of a classiﬁer learned

froma training set of size (k−1)n/k where n is the size of S. The common procedure

is to create a ﬁnal classiﬁer by training on all of S, and then to use the confusion

matrix obtained from cross-validation as an informal estimate of the performance of

this classiﬁer. This estimate is likely to be conservative in the sense that the ﬁnal

classiﬁer may have slightly better performance since it is based on a slightly larger

training set.

Suppose that the time to train a classiﬁer is proportional to the number of training

examples, and the time to make predictions on test examples is negligible in compar-

ison. The time required for k-fold cross-validation is then O(k (k − 1)/k n) =

O((k − 1)n). Three-fold cross-validation is therefore twice as time-consuming as

two-fold. This suggests that for preliminary experiments, two-fold is a good choice.

The results of cross-validation can be misleading. For example, if each example

is duplicated in the training set and we use a nearest-neighbor classiﬁer, then LOOCV

will show a zero error rate. Cross-validation with other values of k will also yield

misleadingly low error estimates.

Subsampling means omitting examples from some classes. Subsampling the

common class is a standard approach to learning from unbalanced data, i.e. data

where some classes are very common while others are very rare. It is reasonable

to do subsampling in the training folds of cross-validation, but not in the test fold.

Reported performance numbers should always be based on a set of test examples that

is directly typical of a genuine test population. It is a not uncommon mistake with

cross-validation to do subsampling on the test set.

6.2 Nested cross-validation

A model selection procedure is a method for choosing the best learning algorithm

from a set of alternatives. Often, the alternatives are all the same algorithm but with

different parameter settings, for example different C and γ values. Another common

case is where the alternatives are different subsets of features. Typically, the number

of alternatives is ﬁnite and the only way to evaluate an alternative is to run it explicitly

on a training set. So, how should we do model selection?

A simple way to apply cross-validation for model selection is the following:

Input: dataset S, integer k, set V of alternative algorithm settings

Procedure:

6.2. NESTED CROSS-VALIDATION 55

partition S randomly into k disjoint subsets S

1

, . . . , S

k

of equal size

for each setting v in V

for i = 1 to i = k

let T = S ` S

i

run the learning algorithm with setting v and T as training set

apply the trained model to S

i

obtaining performance e

i

end for

let M(v) be the average of e

i

end for

select ˆ v = argmax

v

M(v)

The output of this model selection procedure is ˆ v. The input set V of alternative

settings can be a grid of parameter values.

But, any procedure for selecting parameter values is itself part of the learning

algorithm. It is crucial to understand this point. The setting ˆ v is chosen to maximize

M(v), so M(ˆ v) is not a fair estimate of the performance to be expected from ˆ v on

future data. Stated another way, ˆ v is chosen to optimize performance on all of S, so

ˆ v is likely to overﬁt S.

Notice that the division of S into subsets happens just once, and the same division

is used for all settings v. This choice reduces the random variability in the evaluation

of different v. A new partition of S could be created for each setting v, but this would

not overcome the issue that ˆ v is chosen to optimize performance on S.

What should we do about the fact that any procedure for selecting parameter

values is itself part of the learning algorithm? One answer is that this procedure

should itself be evaluated using cross-validation. This process is called nested cross-

validation, because one cross-validation is run inside another.

Speciﬁcally, nested cross-validation is the following process:

Input: dataset S, integers k and k

**, set V of alternative algorithm settings
**

Procedure:

partition S randomly into k disjoint subsets S

1

, . . . , S

k

of equal size

for i = 1 to i = k

let T = S ` S

i

partition T randomly into k

disjoint subsets T

1

, . . . , T

k

of equal size

for each setting v in V

for j = 1 to j = k

let U = T ` T

j

run the learning algorithm with setting v and U as training set

apply the trained model to T

j

obtaining performance e

j

end for

56 CHAPTER 6. DETECTING OVERFITTING: CROSS-VALIDATION

let M(v) be the average of e

j

end for

select ˆ v = argmax

v

M(v)

run the learning algorithm with setting ˆ v and T as training set

apply the trained model to S

i

obtaining performance e

i

end for

report the average of e

i

Now, the ﬁnal reported average e

i

is the estimated performance of the classiﬁer ob-

tained by running the same model selection procedure (i.e. the search over each set-

ting v) on the whole dataset S.

Some notes:

1. Trying every setting in V explicitly can be prohibitive. It may be preferable

to search in V in a more clever way, using a genetic algorithm or some other

heuristic method. As mentioned above, the search for a good member of V

is itself part of the learning algorithm, so it can certainly be intelligent and/or

heuristic.

2. Above, the same partition of T is used for each setting v. This reduces ran-

domness a little bit compared to using a different partition for each v, but the

latter would be correct also.

Quiz for April 13, 2010

Your name:

The following cross-validation procedure is intended to ﬁnd the best regularization

parameter R for linear regression, and to report a fair estimate of the RMSE to be

expected on future test data.

Input: dataset S, integer k, set V of alternative R values

Procedure:

partition S randomly into k disjoint subsets S

1

, . . . , S

k

of equal size

for each R value in V

for i = 1 to i = k

let T = S ` S

i

train linear regression with parameter R and T as training set

apply the trained regression equation to S

i

obtaining SSE e

i

end for

compute M =

¸

i

e

i

report R and RMSE =

M/[S[

end for

(a) Suppose that you choose the R value for which the reported RMSE is lowest.

Explain why this method is likely to be overoptimistic, as an estimate of the RMSE

to be expected on future data.

Each R value is being evaluated on the entire set S. The value that seems to be

best is likely overﬁtting this set.

(b) Very brieﬂy, suggest an improved variation of the method above.

The procedure to select a value for R is part of the learning algorithm. This

whole procedure should be evaluated using a separate test set, or via cross-validation,

Additional notes: The procedure to select a value for R uses cross-validation, so

if this procedure is evaluated itself using cross-validation, then the entire process is

nested cross-validation.

Incorrect answers include the following:

58 CHAPTER 6. DETECTING OVERFITTING: CROSS-VALIDATION

• “The partition of S should be stratiﬁed.” No; ﬁrst of all, we are doing re-

gression, so stratiﬁcation is not well-deﬁned, and second, failing to stratify

increases variability but not does not cause overﬁtting.

• “The partition of S should be done separately for each R value, not just once.”

No; a different partition for each R value might increase the variability in the

evaluation of each R, but it would not change the fact that the best R is being

selected according to its performance on all of S.

Two basic points to remember are that it is never fair to evaluate a learning method on

its training set, and that any search for settings for a learning algorithm (e.g. search

for a subset of features, or for algorithmic parameters) is part of the learning method.

Chapter 7

Making optimal decisions

This chapter discusses making optimal decisions based on predictions, and maximiz-

ing the value of customers.

7.1 Predictions, decisions, and costs

Decisions and predictions are conceptually very different. For example, a prediction

concerning a patient may be “allergic” or “not allergic” to aspirin, while the corre-

sponding decision is whether or not to administer the drug. Predictions can often be

probabilistic, while decisions typically cannot.

Suppose that examples are credit card transactions and the label y = 1 designates

a legitimate transaction. Then making the decision y = 1 for an attempted transaction

means acting as if the transaction is legitimate, i.e. approving the transaction. The

essence of cost-sensitive decision-making is that it can be optimal to act as if one

class is true even when some other class is more probable. For example, if the cost

of approving a fraudulent transaction is proportional to the dollar amount involved,

then it can be rational not to approve a large transaction, even if the transaction is

most likely legitimate. Conversely, it can be rational to approve a small transaction

even if there is a high probability it is fraudulent.

Mathematically, let i be the predicted class and let j be the true class. If i = j

then the prediction is correct, while if i = j the prediction is incorrect. The (i, j)

entry in a cost matrix c is the cost of acting as if class i is true, when in fact class j

is true. Here, predicting i means acting as if i is true, so one could equally well call

this deciding i.

A cost matrix c has the following structure when there are only two classes:

59

60 CHAPTER 7. MAKING OPTIMAL DECISIONS

actual negative actual positive

predict negative c(0, 0) = c

00

c(0, 1) = c

01

predict positive c(1, 0) = c

10

c(1, 1) = c

11

The cost of a false positive is c

10

while the cost of a false negative is c

01

. We fol-

low the convention that cost matrix rows correspond to alternative predicted classes,

while columns correspond to actual classes. In short the convention is row/column =

i/j = predicted/actual. (This convention is the opposite of the one in Section 5.1 so

perhaps we should switch one of these to make the conventions similar.)

The optimal prediction for an example x is the class i that minimizes the expected

cost

e(x, i) =

¸

j

p(j[x)c(i, j). (7.1)

For each i, e(x, i) is an expectation computed by summing over the alternative pos-

sibilities for the true class of x. In this framework, the role of a learning algorithm is

to produce a classiﬁer that for any example x can estimate the probability p(j[x) of

each class j being the true class of x.

7.2 Cost matrix properties

Conceptually, the cost of labeling an example incorrectly should always be greater

than the cost of labeling it correctly. For a 2x2 cost matrix, this means that it should

always be the case that c

10

> c

00

and c

01

> c

11

. We call these conditions the

“reasonableness” conditions.

Suppose that the ﬁrst reasonableness condition is violated, so c

00

≥ c

10

but still

c

01

> c

11

. In this case the optimal policy is to label all examples positive. Similarly,

if c

10

> c

00

but c

11

≥ c

01

then it is optimal to label all examples negative. (The

reader can analyze the case where both reasonableness conditions are violated.)

For some cost matrices, some class labels are never predicted by the optimal

policy given by Equation (7.1). The following is a criterion for when this happens.

Say that row m dominates row n in a cost matrix C if for all j, c(m, j) ≥ c(n, j).

In this case the cost of predicting n is no greater than the cost of predicting m,

regardless of what the true class j is. So it is optimal never to predict m. As a special

case, the optimal prediction is always n if row n is dominated by all other rows in

a cost matrix. The two reasonableness conditions for a two-class cost matrix imply

that neither row in the matrix dominates the other.

Given a cost matrix, the decisions that are optimal are unchanged if each entry in

the matrix is multiplied by a positive constant. This scaling corresponds to changing

the unit of account for costs. Similarly, the decisions that are optimal are unchanged

7.3. THE LOGIC OF COSTS 61

if a constant is added to each entry in the matrix. This shifting corresponds to chang-

ing the baseline away from which costs are measured. By scaling and shifting entries,

any two-class cost matrix

c

00

c

01

c

10

c

11

that satisﬁes the reasonableness conditions can be transformed into a simpler matrix

that always leads to the same decisions:

0 c

01

1 c

11

where c

01

= (c

01

− c

00

)/(c

10

− c

00

) and c

11

= (c

11

− c

00

)/(c

10

− c

00

). From a

matrix perspective, a 2x2 cost matrix effectively has two degrees of freedom.

7.3 The logic of costs

Costs are not necessarily monetary. A cost can also be a waste of time, or the severity

of an illness, for example. Although most recent research in machine learning has

used the terminology of costs, doing accounting in terms of beneﬁts is generally

preferable, because avoiding mistakes is easier, since there is a natural baseline from

which to measure all beneﬁts, whether positive or negative. This baseline is the state

of the agent before it takes a decision regarding an example. After the agent has

made the decision, if it is better off, its beneﬁt is positive. Otherwise, its beneﬁt is

negative.

When thinking in terms of costs, it is easy to posit a cost matrix that is logi-

cally contradictory because not all entries in the matrix are measured from the same

baseline. For example, consider the so-called German credit dataset that was pub-

lished as part of the Statlog project [Michie et al., 1994]. The cost matrix given with

this dataset at http://www.sgi.com/tech/mlc/db/german.names is as

follows:

actual bad actual good

predict bad 0 1

predict good 5 0

Here examples are people who apply for a loan from a bank. “Actual good” means

that a customer would repay a loan while “actual bad” means that the customer would

default. The action associated with “predict bad” is to deny the loan. Hence, the

cashﬂowrelative to any baseline associated with this prediction is the same regardless

62 CHAPTER 7. MAKING OPTIMAL DECISIONS

of whether “actual good” or “actual bad” is true. In every economically reasonable

cost matrix for this domain, both entries in the “predict bad” row must be the same.

If these entries are different, it is because different baselines have been chosen for

each entry.

Costs or beneﬁts can be measured against any baseline, but the baseline must be

ﬁxed. An opportunity cost is a foregone beneﬁt, i.e. a missed opportunity rather than

an actual penalty. It is easy to make the mistake of measuring different opportunity

costs against different baselines. For example, the erroneous cost matrix above can

be justiﬁed informally as follows: “The cost of approving a good customer is zero,

and the cost of rejecting a bad customer is zero, because in both cases the correct

decision has been made. If a good customer is rejected, the cost is an opportunity

cost, the foregone proﬁt of 1. If a bad customer is approved for a loan, the cost is the

lost loan principal of 5.”

To see concretely that the reasoning in quotes above is incorrect, suppose that the

bank has one customer of each of the four types. Clearly the cost matrix above is

intended to imply that the net change in the assets of the bank is then −4. Alterna-

tively, suppose that we have four customers who receive loans and repay them. The

net change in assets is then +4. Regardless of the baseline, any method of accounting

should give a difference of 8 between these scenarios. But with the erroneous cost

matrix above, the ﬁrst scenario gives a total cost of 6, while the second scenario gives

a total cost of 0.

In general the amount in some cells of a cost or beneﬁt matrix may not be con-

stant, and may be different for different examples. For example, consider the credit

card transactions domain. Here the beneﬁt matrix might be

fraudulent legitimate

refuse $20 −$20

approve −x 0.02x

where x is the size of the transaction in dollars. Approving a fraudulent transaction

costs the amount of the transaction because the bank is liable for the expenses of

fraud. Refusing a legitimate transaction has a non-trivial cost because it annoys a

customer. Refusing a fraudulent transaction has a non-trivial beneﬁt because it may

prevent further fraud and lead to the arrest of a criminal.

7.4 Making optimal decisions

Given known costs for correct and incorrect predictions, a test example should be

predicted to have the class that leads to the lowest expected cost. This expectation is

7.4. MAKING OPTIMAL DECISIONS 63

the predicted average cost, and is computed using the conditional probability of each

class given the example.

In the two-class case, the optimal prediction is class 1 if and only if the expected

cost of this prediction is less than or equal to the expected cost of predicting class 0,

i.e. if and only if

p(y = 0[x)c

10

+p(y = 1[x)c

11

≤ p(y = 0[x)c

00

+p(y = 1[x)c

01

which is equivalent to

(1 −p)c

10

+pc

11

≤ (1 −p)c

00

+pc

01

given p = p(y = 1[x). If this inequality is in fact an equality, then predicting either

class is optimal.

The threshold for making optimal decisions is p

∗

such that

(1 −p

∗

)c

10

+p

∗

c

11

= (1 −p

∗

)c

00

+p

∗

c

01

.

Assuming the reasonableness conditions c

10

> c

00

and c

01

≥ c

11

, the optimal pre-

diction is class 1 if and only if p ≥ p

∗

. Rearranging the equation for p

∗

leads to

c

00

−c

10

= −p

∗

c

10

+p

∗

c

11

+p

∗

c

00

−p

∗

c

01

which has the solution

p

∗

=

c

10

−c

00

c

10

−c

00

+c

01

−c

11

assuming the denominator is nonzero, which is implied by the reasonableness con-

ditions. This formula for p

∗

shows that any 2x2 cost matrix has essentially only one

degree of freedom from a decision-making perspective, although it has two degrees

of freedom from a matrix perspective. The cause of the apparent contradiction is that

the optimal decision-making policy is a nonlinear function of the cost matrix.

Note that in some domains costs have more impact than probabilities on decision-

making. An extreme case is that with some cost matrices the optimal decision is

always the same, regardless of what the various outcome probabilities are for a given

example x.

1

Decisions and predictions may be not in 1:1 correspondence.

1

The situation where the optimal decision is ﬁxed is well-known in philosophy as Pascal’s wager.

Essentially, this is the case where the minimax decision and the minimum expected cost decision are

always the same. An argument analogous to Pascal’s wager provides a justiﬁcation for Occam’s razor

in machine learning. Essentially, a PAC theorem says that if we have a ﬁxed, limited number of training

examples, and the true concept is simple, then if we pick a simple concept then we will be right.

However if we pick a complex concept, there is no guarantee we will be right.

64 CHAPTER 7. MAKING OPTIMAL DECISIONS

7.5 Limitations of cost-based analysis

The analysis of decision-making above assumes that the objective is to minimize

expected cost. This objective is reasonable when a game is played repeatedly, and

the actual class of examples is set stochastically.

The analysis may not be appropriate when the agent is risk-averse and can make

only a few decisions.

Also not appropriate when the actual class of examples is set by a non-random

adversary.

Costs may need to be estimated via learning.

We may need to make repeated decisions over time about the same example.

Decisions about one example may inﬂuence other examples.

When there are more than two classes, we do not get full label information on

training examples.

7.6 Rules of thumb for evaluating data mining campaigns

Let q be a fraction of the test set. There is a remarkable rule of thumb that the lift

attainable at q is around

**1/q. For example, if q = 0.25 then the attainable lift is
**

around

√

4 = 2.

The rule of thumb is not valid for very small values of q, for two reasons. The

ﬁrst reason is mathematical: an upper bound on the lift is 1/t, where t is the overall

fraction of positives; t is also called the target rate, response rate, or base rate. The

second reason is empirical: lifts observed in practice tend to be well below10. Hence,

the rule of thumb can reasonably be applied when q > 0.02 and q > t

2

.

The fraction of all positive examples that belong to the top-ranked fraction q of

the test set is q

1/q =

√

q. The lift in the bottom 1 − q is (1 −

√

q)/(1 − q). We

have

lim

q→1

1 −

√

q

1 −q

= 0.5.

This says that the examples ranked lowest are positive with probability equal to t/2.

In other words, the lift for the lowest ranked examples cannot be driven below 0.5.

Let c be the cost of a contact and let b be the beneﬁt of a positive. This means

that the beneﬁt matrix is

nature simple nature complex

predict simple succeed fail

predict complex fail fail

Here “simple” means that the hypothesis is drawn from a space with low cardinality, while complex

means it is drawn from a space with high cardinality.

7.6. RULES OF THUMB FOR EVALUATING DATA MINING CAMPAIGNS 65

actual negative actual positive

decide negative 0 0

decide positive −c b −c

Let n be the size of the test set, The proﬁt at q is the total beneﬁt minus the total cost,

which is

nqbt

1/q −nqc = nc(bt

√

q/c −q) = nc(k

√

q −q)

where k = tb/c. Proﬁt is maximized when

0 =

d

dq

(kq

0.5

−q) = k(0.5)q

−0.5

−1.

The solution is q = (k/2)

2

.

This solution is in the range zero to one only when k ≤ 2. When tb/c > 2 then

maximum proﬁt is achieved by soliciting every prospect, so data mining is pointless.

The maximum proﬁt that can be achieved is

nc(k(k/2) −(k/2)

2

) = nck

2

/4 = ncq.

Remarkably, this is always the same as the cost of running the campaign, which is

nqc.

As k decreases, the attainable proﬁt decreases fast, i.e. quadratically. If k < 0.4

then q < 0.04 and the attainable proﬁt is less than 0.04nc. Such a campaign may

have high risk, because the lift attainable at small q is typically worse than suggested

by the rule of thumb. Note however that a small adverse change in c, b, or t is not

likely to make the campaign lose money, because the expected revenue is always

twice the campaign cost.

It is interesting to consider whether data mining is beneﬁcial or not from the

point of view of society at large. Remember that c is the cost of one contact from the

perspective of the initiator of the campaign. Each contact also has a cost or beneﬁt

for the recipient. Generally, if the recipient responds, one can assume that the contact

was beneﬁcial for him or her, because s/he only responds if the response is beneﬁcial.

However, if the recipient does not respond, which is the majority case, then one can

assume that the contact caused a net cost to him or her, for example a loss of time.

From the point of view of the initiator of a campaign a cost or beneﬁt for a

respondent is an externality. It is not rational for the initiator to take into account

these beneﬁts or costs, unless they cause respondents to take actions that affect the

initiator, such as closing accounts. However, it is rational for society to take into

account these externalities.

The conclusion above is that the revenue from a campaign, for its initiator, is

roughly twice its cost. Suppose that the beneﬁt of responding for a respondent is λb

66 CHAPTER 7. MAKING OPTIMAL DECISIONS

where b is the beneﬁt to the initiator. Suppose also that the cost of a solicitation to

a person is µc where c is the cost to the initiator. The net beneﬁt to respondents is

positive as long as µ < 2λ.

The reasoning above clariﬁes why spam email campaigns are harmful to society.

For these campaigns, the cost c to the initiator is tiny. However, the cost to a recipient

is not tiny, so µ is large. Whatever λ is, it is likely that µ > 2λ.

In summary, data mining is only beneﬁcial in a narrow sweet spot, where

tb/2 ≤ c ≤ αtb

where α is some constant greater than 1. The product tb is the average beneﬁt of

soliciting a random customer. If the cost c of solicitation is less than half of this, then

it is rational to contact all potential respondents. If c is much greater than the average

beneﬁt, then the campaign is likely to have high risk for the initiator.

As an example of the reasoning above, consider the scenario of the 1998 KDD

contest. Here t = 0.05 about, c = $0.68, and the average beneﬁt is b = $15

approximately. We have k = tb/c = 75/68 = 1.10. The rule of thumb predicts that

the optimal fraction of people to solicit is q = 0.30, while the achievable proﬁt per

person cq = $0.21. In fact, the methods that perform best in this domain achieve

proﬁt of about $0.16, while soliciting about 70% of all people.

7.7 Evaluating success

In order to evaluate whether a real-world campaign is proceeding as expected, we

need to estimate a conﬁdence interval for the proﬁt on the test set. This should be a

range [ab] such that the probability is (say) 95% that the observed proﬁt lies in this

range.

Intuitively, there are two sources of uncertainty for each test person: whether s/he

will donate, and if so, how much. We need to work out the logic of how to quantify

these uncertainties, and then the logic of combining the uncertainties into an overall

conﬁdence interval. The central issue here is not any mathematical details such as

Student distributions versus Gaussians. It is the logic of tracking where uncertainty

arises, and how it is propagated.

7.7. EVALUATING SUCCESS 67

Quiz (2009)

Suppose your lab is trying to crystallize a protein. You can try experimental condi-

tions x that differ on temperature, salinity, etc. The label y = 1 means crystallization

is successful, while y = 0 means failure. Assume (not realistically) that the results

of different experiments are independent.

You have a classiﬁer that predicts p(y = 1[x), the probability of success of ex-

periment x. The cost of doing one experiment is $60. The value of successful crys-

tallization is $9000.

(a) Write down the beneﬁt matrix involved in deciding rationally whether or not

to perform a particular experiment.

The beneﬁt matrix is

success failure

do experiment 9000-60 -60

don’t 0 0

(b) What is the threshold probability of success needed in order to perform an

experiment?

It is rational to do an experiment under conditions x if and only if the expected

beneﬁt is positive, that is if and only if

(9000 −60) p + (−60) (1 −p) > 0.

where p = p(success[x). The threshold value of p is 60/9000 = 1/150.

(c) Is it rational for your lab to take into account a budget constraint such as “we

only have a technician available to do one experiment per day”?

No. A budget constraint is an alternative rule for making decisions that is less

rational than maximizing expected beneﬁt. The optimal behavior is to do all experi-

ments that have success probability over 1/150.

Additional explanation: The $60 cost of doing an experiment should include

the expense of technician time. If many experiments are worth doing, then more

technicians should be hired.

If the budget constraint is unavoidable, then experiments should be done starting

with those that have highest probability of success.

If just one successful crystallization is enough, then experiments should also be

done in order of declining success probability, until the ﬁrst actual success.

Quiz for May 4, 2010

Your name:

Suppose that you work for a bank that wants to prevent criminal laundering of money.

The label y = 1 means a money transfer is criminal, while y = 0 means the transfer

is legal. You have a classiﬁer that estimates the probability p(y = 1[x) where x is a

vector of feature values describing a money transfer.

Let z be the dollar amount of the transfer. The matrix of costs (negative) and

beneﬁts (positive) involved in deciding whether or not to deny a transfer is as follows:

criminal legal

deny 0 −0.10z

allow −z 0.01z

Work out the rational policy based on p(y = 1[x) for deciding whether or not to

allow a transfer.

7.7. EVALUATING SUCCESS 69

2009 Assignment

This assignment is due at the start of class on Tuesday May 5. As before, you should

work in a team of two, and you are free to change partners or not.

This assignment is the last one to use the KDD98 data. You should now train

on the entire training set, and measure ﬁnal success on the test set that you have not

previously used. The goal is to solicit an optimal subset of the test examples. The

measure of success to maximize is total donations received minus $0.68 for every

solicitation.

You should train a regression function to predict donation amounts, and a classi-

ﬁer to predict donation probabilities. For a test example x, let the predicted donation

be a(x) and let the predicted donation probability be p(x). You should decide to send

a donation request to person x if and only if

p(x) a(x) ≥ 0.68.

You should use the training set for all development work. In particular, you should

use part of the training set for debugging your procedure for reading in test examples,

making decisions concerning them, and tallying total proﬁt. Only use the actual test

set once, to measure the ﬁnal success of your method.

Notes: The test instances are in the ﬁle cup98val.zip at http://archive.

ics.uci.edu/ml/databases/kddcup98/kddcup98.html. The test set

labels are in valtargt.txt. The labels are sorted by CONTROLN, unlike the test

instances.

2010 Assignment

This week’s assignment is to participate in the PAKDD 2010 data mining contest.

Details of the contest are at http://sede.neurotech.com.br/PAKDD2010/.

Each team of two students should register and download the ﬁles for the contest.

Your ﬁrst goal should be to understand the data and submission formats, and to

submit a correct set of predictions. Make sure that when you have good predictions

later, you will not run into any technical difﬁculties.

Your second goal should be to understand the contest scenario and the differences

between the training and two test datasets. Do some exploratory analysis of the three

datasets. In your written report, explain your understanding of the scenario, and your

general ﬁndings about the datasets.

Next, based on your general understanding, design a sensible approach for achiev-

ing the contest objective. Implement this approach and submit predictions to the con-

test. Of course, you may reﬁne your approach iteratively and you may make multiple

submissions. Meet the May 3 deadline for uploading your best predictions and a

copy of your report. The contest rules ask each team to submit a paper of four pages.

The manuscript should be in the scientiﬁc paper format of the confer-

ence detailing the stages of the KDD process, focusing on the aspects

which make the solution innovative. The manuscript should be the ba-

sis for writing up a full paper for an eventual pos-conference proceed-

ing (currently under negotiation). The authors of top solutions and se-

lected manuscripts with innovative approaches will have the opportunity

to submit to that forum. The quality of the manuscripts will not be used

for the competition assessment, unless in the case of ties.

You can ﬁnd template ﬁles for LaTeX and Word at http://www.springer.

com/computer/lncs?SGWID=0-164-6-793341-0. Do not worry about

formatting details.

Chapter 8

Learning classiﬁers despite

missing labels

This chapter discusses how to learn two-class classiﬁers from nonstandard training

data. Speciﬁcally, we consider three different but related scenarios where labels are

missing for some but not all training examples.

8.1 The standard scenario for learning a classiﬁer

In the standard situation, the training and test sets are two different random samples

from the same population. Being from the same population means that both sets

follow a common probability distribution. Let x be a training or test instance, and let

y be its label. The common distribution is p(x, y).

The notation p(x, y) is an abbreviation for p(X = x and Y = y) where X and

Y are random variables, and x and y are possible values of these variables. If X and

Y are both discrete-valued random variables, this distribution is a probability mass

function (pmf). Otherwise, it is a probability density function (pdf).

It is important to understand that we can always write

p(x, y) = p(x)p(y[x)

and also

p(x, y) = p(y)p(x[y)

without loss of generality and without making any assumptions. The equations above

are sometimes called the chain rule of probabilities. They are true both when the

probability values are probability masses, and when they are probability densities.

71

72 CHAPTER 8. LEARNING CLASSIFIERS DESPITE MISSING LABELS

8.2 Sample selection bias in general

Suppose that the label y is observed for some training instances x, but not for others.

This means that the training instances x for which y is observed are not a random

sample from the same population as the test instances. Let s be a new random vari-

able with value s = 1 if and only if x is selected. This means that y is observed if

and only if s = 1. An important question is whether the unlabeled training examples

can be exploited in some way. However, here we focus on a different question: how

can we adjust for the fact that the labeled training examples are not representative of

the test examples?

Formally, x, y, and s are random variables. There is some ﬁxed unknown joint

probability distribution p(x, y, s) over triples 'x, y, s`. We can identify three pos-

sibilities for s. The easiest case is when p(s = 1[x, y) is a constant. In this case,

the labeled training examples are a fully random subset. We have fewer training

examples, but otherwise we are in the usual situation. This case is called “missing

completely at random” (MCAR).

A more difﬁcult situation arises if s is correlated with x and/or with y, that is

if p(s = 1) = p(s = 1[x, y). Here, there are two subcases. First, suppose that

s depends on x but, given x, not on y. In other words, when x is ﬁxed then s is

independent of y. This means that p(s = 1[x, y) = p(s = 1[x) for all x. In this case,

Bayes’ rule gives that

p(y[x, s = 1) =

p(s = 1[x, y)p(y[x)

p(s = 1[x)

=

p(s = 1[x)p(y[x)

p(s = 1[x)

= p(y[x)

assuming that p(s = 1[x) > 0 for all x. Therefore we can learn a correct model of

p(y[x) from just the labeled training data, without using the unlabeled data in any

way. This case is rather misleadingly called “missing at random” (MAR). It is not

the case that labels are missing in a totally random way, because missingness does

depend on x. It is also not the case that s and y are independent. However, it is true

that s and y are conditionally independent, conditional on x. Concretely, for each

value of x the equation p(y[x, s = 0) = p(y[x) = p(y[x, s = 1) holds.

The assumption that p(s = 1[x) > 0 is important. The real-world meaning of

this assumption is that label information must be available with non-zero probability

for every possible instance x. Otherwise, it might be the case for some x that no

labeled training examples are available from which to estimate p(y[x, s = 1).

Suppose that even when x is known, there is still some correlation between s

and y. In this case the label y is said to be “missing not at random” (MNAR) and

inference is much more difﬁcult. We do not discuss this case further here.

8.3. COVARIATE SHIFT 73

8.3 Covariate shift

Imagine that the available training data come from one hospital, but we want to learn

a classiﬁer to use at a different hospital. Let x be a patient and let y be a label such

as “has swine ﬂu.” The distribution p(x) of patients is different at the two hospitals,

but we are willing to assume that the relationship to be learned, which is p(y[x), is

unchanged.

In a scenario like the one above, no unlabeled training examples are available.

Labeled training examples from all classes are available, but the distribution of in-

stances is different for the training and test sets. This scenario is called “covariate

shift” where covariate means feature, and shift means change of distribution.

Following the argument in the previous section, the assumption that p(y[x) is

unchanged means that p(y[x, s = 1) = p(y[x, s = 0) where s = 1 refers to the ﬁrst

hospital and s = 0 refers to the second hospital. The simple approach to covariate

shift is thus train the classiﬁer on the data 'x, y, s = 1` in the usual way, and to apply

it to patients from the second hospital directly.

8.4 Reject inference

Suppose that we want to build a model that predicts who will repay a loan. We need

to be able to apply the model to the entire population of future applicants (sometimes

called the “through the door” population). The usable training examples are people

who were actually granted a loan in the past. Unfortunately, these people are not rep-

resentative of the population of all future applicants. They were previously selected

precisely because they were thought to be more likely to repay. The problem of “re-

ject inference” is to learn somehow from previous applicants who were not approved,

for whom we hence do not have training labels.

Another example is the task of learning to predict the beneﬁt of a medical treat-

ment. If doctors have historically given the treatment only to patients who were

particularly likely to beneﬁt, then the observed success rate of the treatment is likely

to be higher than its success rate on “through the door” patients. Conversely, if doc-

tors have historically used the treatment only on patients who were especially ill, then

the “through the door” success rate may be higher than the observed success rate.

The “reject inference” scenario is similar to the covariate shift scenario, but with

two differences. The ﬁrst difference is that unlabeled training instances are available.

The second difference is that the label being missing is expected to be correlated

with the value of the label. The important similarity is the common assumption that

missingness depends only on x, and not additionally on y. Whether y is missing may

74 CHAPTER 8. LEARNING CLASSIFIERS DESPITE MISSING LABELS

be correlated with the value of y, but this correlation disappears after conditioning

on x.

If missingness does depend on y, even after conditioning on x, then we are in the

MNAR situation and in general we are can draw no ﬁrm conclusions. An example of

this situation is “survivor bias.” Suppose our analysis is based on historical records,

and those records are more likely to exist for survivors, everything else being equal.

Then p(s = 1[y = 1, x) > p(s = 1[y = 0, x) where “everything else being equal”

means that x is the same on both sides.

A useful fact is the following. Suppose we want to estimate E[f(z)] where z

follows the distribution p(z), but we can only draw samples of z from the distribution

q(z). The fact is

E[f(z)[z ∼ p(z)] = E[f(z)

p(z)

q(z)

[z ∼ q(z)],

assuming q(z) > 0 for all z. More generally the requirement is that q(z) > 0

whenever p(z) > 0, assuming the deﬁnition 0/0 = 0. The equation above is called

the importance-sampling identity.

Let the goal be to compute E[f(x, y)[x, y ∼ p(x, y)] for any function f. To make

notation more concise, write this as E[f] and write E[f(x, y)[x, y ∼ p(x, y, s = 1)]

as E[f[s = 1]. We have

E[f] = E[f

p(x)p(y[x)

p(x[s = 1)p(y[x, s = 1)

[s = 1]

= E[f

p(x)

p(x[s = 1)

[s = 1]

since p(y[x) = p(y[x, s = 1). Applying Bayes’ rule to p(x[s = 1) gives

E[f] = E[f

p(x)

p(s = 1[x)p(x)/p(s = 1)

[s = 1]

= E[f

p(s = 1)

p(s = 1[x)

[s = 1].

The constant p(s = 1) can be estimated as r/n where r is the number of labeled

training examples and n is the total number of training examples. Let ˆ p(s = 1[x) be

a trained model of the conditional probability p(s = 1[x). The estimate of E[f] is

then

r

n

r

¸

i=1

f(x

i

, y

i

)

ˆ p(s = 1[x

i

)

.

8.5. POSITIVE AND UNLABELED EXAMPLES 75

This estimate is called a “plug-in” estimate because it is based on plugging the ob-

served values of 'x, y` into a formula that would be correct if based on integrating

over all values of 'x, y`.

The weighting approach just explained is correct in the statistical sense that it

is unbiased, if the propensity estimates ˆ p(s = 1[x) are correct. However, this ap-

proach typically has high variance since a few labeled examples with high values

for 1/ˆ p(s = 1[x) dominate the sum. Therefore an important question is whether

alternative approaches exist that have lower variance. One simple heuristic is to

place a ceiling on the values 1/ˆ p(s = 1[x). For example the ceiling 1000 is used

by [Huang et al., 2006]. However, no good method for selecting the ceiling value is

known.

In medical research, the ratio p(s = 1)/p(s = 1[x) is called the “inverse proba-

bility of treatment” (IPT) weight.

When does the reject inference scenario give rise to MNAR bias?

8.5 Positive and unlabeled examples

Suppose that only positive examples are labeled. This fact can be stated formally as

the equation

p(s = 1[x, y = 0) = 0. (8.1)

Without some assumption about which positive examples are labeled, it is impossi-

ble to make progress. A common assumption is that the labeled positive examples

are chosen completely randomly from all positive examples. Let this be called the

“selected completely at random” assumption. Stated formally, it is that

p(s = 1[x, y = 1) = p(s = 1[y = 1) = c. (8.2)

Another way of stating the assumption is that s and x are conditionally independent

given y.

A training set consists of two subsets, called the labeled (s = 1) and unlabeled

(s = 0) sets. Suppose we provide these two sets as inputs to a standard training

algorithm. This algorithm will yield a function g(x) such that g(x) = p(s = 1[x)

approximately. The following lemma shows how to obtain a model of p(y = 1[x)

from g(x).

Lemma 1. Suppose the “selected completely at random” assumption holds. Then

p(y = 1[x) = p(s = 1[x)/c where c = p(s = 1[y = 1).

76 CHAPTER 8. LEARNING CLASSIFIERS DESPITE MISSING LABELS

Proof. Remember that the assumption is p(s = 1[y = 1, x) = p(s = 1[y = 1).

Now consider p(s = 1[x). We have that

p(s = 1[x) = p(y = 1 ∧ s = 1[x)

= p(y = 1[x)p(s = 1[y = 1, x)

= p(y = 1[x)p(s = 1[y = 1).

The result follows by dividing each side by p(s = 1[y = 1).

Several consequences of the lemma are worth noting. First, f is an increasing

function of g. This means that if the classiﬁer f is only used to rank examples x

according to the chance that they belong to class y = 1, then the classiﬁer g can be

used directly instead of f.

Second, f = g/p(s = 1[y = 1) is a well-deﬁned probability f ≤ 1 only if

g ≤ p(s = 1[y = 1). What this says is that g > p(s = 1[y = 1) is impossible. This

is reasonable because the labeled (positive) and unlabeled (negative) training sets for

g are samples from overlapping regions in x space. Hence it is impossible for any

example x to belong to the positive class for g with a high degree of certainty.

The value of the constant c = p(s = 1[y = 1) can be estimated using a trained

classiﬁer g and a validation set of examples. Let V be such a validation set that is

drawn from the overall distribution p(x, y, s) in the same manner as the nontradi-

tional training set. Let P be the subset of examples in V that are labeled (and hence

positive). The estimator of p(s = 1[y = 1) is the average value of g(x) for x in P.

Formally the estimator is e

1

=

1

n

¸

x∈P

g(x) where n is the cardinality of P.

We shall show that e

1

= p(s = 1[y = 1) = c if it is the case that g(x) = p(s =

1[x) for all x. To do this, all we need to show is that g(x) = c for x ∈ P. We can

show this as follows:

g(x) = p(s = 1[x)

= p(s = 1[x, y = 1)p(y = 1[x)

+ p(s = 1[x, y = 0)p(y = 0[x)

= p(s = 1[x, y = 1) 1 + 0 0 since x ∈ P

= p(s = 1[y = 1).

Note that in principle any single example fromP is sufﬁcient to determine c, but that

in practice averaging over all members of P is preferable.

There is an alternative way of using Lemma 1. Let the goal be to estimate

E

p(x,y,s)

[h(x, y)] for any function h, where p(x, y, s) is the overall distribution. To

make notation more concise, write this as E[h]. We want an estimator of E[h] based

on a positive-only training set of examples of the form 'x, s`.

8.5. POSITIVE AND UNLABELED EXAMPLES 77

Clearly p(y = 1[x, s = 1) = 1. Less obviously,

p(y = 1[x, s = 0) =

p(s = 0[x, y = 1)p(y = 1[x)

p(s = 0[x)

=

[1 −p(s = 1[x, y = 1)]p(y = 1[x)

1 −p(s = 1[x)

=

(1 −c)p(y = 1[x)

1 −p(s = 1[x)

=

(1 −c)p(s = 1[x)/c

1 −p(s = 1[x)

=

1 −c

c

p(s = 1[x)

1 −p(s = 1[x)

.

By deﬁnition

E[h] =

x,y,s

h(x, y)p(x, y, s)

=

x

p(x)

1

¸

s=0

p(s[x)

1

¸

y=0

p(y[x, s)h(x, y)

=

x

p(x)

p(s = 1[x)h(x, 1)

+ p(s = 0[x)[p(y = 1[x, s = 0)h(x, 1)

+ p(y = 0[x, s = 0)h(x, 0)]

.

The plug-in estimate of E[h] is then the empirical average

1

m

¸

x,s=1

h(x, 1) +

¸

x,s=0

w(x)h(x, 1) + (1 −w(x))h(x, 0)

where

w(x) = p(y = 1[x, s = 0) =

1 −c

c

p(s = 1[x)

1 −p(s = 1[x)

(8.3)

and m is the cardinality of the training set. What this says is that each labeled exam-

ple is treated as a positive example with unit weight, while each unlabeled example

is treated as a combination of a positive example with weight p(y = 1[x, s = 0)

and a negative example with complementary weight 1 − p(y = 1[x, s = 0). The

probability p(s = 1[x) is estimated as g(x) where g is the nontraditional classiﬁer

explained in the previous section.

78 CHAPTER 8. LEARNING CLASSIFIERS DESPITE MISSING LABELS

The result above on estimating E[h] can be used to modify a learning algorithm

in order to make it work with positive and unlabeled training data. One method is to

give training examples individual weights. Positive examples are given unit weight

and unlabeled examples are duplicated; one copy of each unlabeled example is made

positive with weight p(y = 1[x, s = 0) and the other copy is made negative with

weight 1 −p(y = 1[x, s = 0).

8.6 Further issues

Spam ﬁltering scenario.

Observational studies.

Concept drift.

Moral hazard.

Adverse selection.

Quiz for May 11, 2010

Your name:

The importance sampling identity is the equation

E[f(z)[z ∼ p(z)] = E[f(z)w(z)[z ∼ q(z)]

where

w(z) =

p(z)

q(z)

.

We assume that q(z) > 0 for all z such that p(z) > 0, and we deﬁne 0/0 = 0.

Suppose that the training set consists of values z sampled according to the probability

distribution q(z). Explain intuitively which members of the training set will have

greatest inﬂuence on the estimate of E[f(z)[z ∼ p(z)].

Quiz for 2009

(a) Suppose you use the weighting approach to deal with reject inference. What are

the minimum and maximum possible values of the weights?

Let x be a labeled example, and let its weight be p(s = 1)/p(s = 1[x). Intu-

itively, this weight is how many copies are needed to allow the one labeled example

to represent all the unlabeled examples that are similar. The conditional probability

p(s = 1[x) can range between 0 and 1, so the weight can range between p(s = 1)

and inﬁnity.

(b) Suppose you use the weighting approach to learn from positive and unlabeled

examples. What are the minimum and maximum possible values of the weights?

In this scenario, weights are assigned to unlabeled examples, not to labeled ex-

amples as above. The weights here are probabilities p(y = 1[x, s = 0), so they range

between 0 and 1.

(c) Explain intuitively what can go wrong if the “selected completely at random”

assumption is false, when learning from positive and unlabeled examples.

The “selected completely at random” assumption says that the positive examples

with known labels are perfectly representative of the positive examples with unknown

labels. If this assumption is false, then there will be unlabeled examples that in fact

are positive, but that we treat as negative, because they are different from the labeled

positive examples. The trained model of the positive class will be too narrow.

8.6. FURTHER ISSUES 81

Assignment (revised)

The goal of this assignment is to train useful classiﬁers using training sets with miss-

ing information of three different types: (i) covariate shift, (ii) reject inference, and

(iii) no labeled negative training examples. In http://www.cs.ucsd.edu/

users/elkan/291/dmfiles.zip you can ﬁnd four datasets: one test set, and

one training set for each of the three scenarios. Training set (i) has 5,164 examples,

sets (ii) and (iii) have 11,305 examples, and the test set has 11,307 examples. Each

example has values for 13 predictors. (Many thanks to Aditya Menon for creating

these ﬁles.)

You should train a classiﬁer separately based on each training set, and measure

performance separately but on the same test set. Use accuracy as the primary measure

of success, and use the logistic regression option of FastLargeMargin as the

main training method. In each case, you should be able to achieve better than 82%

accuracy.

All four datasets are derived from the so-called Adult dataset that is available

at http://archive.ics.uci.edu/ml/datasets/Adult. Each example

describes one person. The label to be predicted is whether or not the person earns

over $50,000 per year. This is an interesting label to predict because it is analogous

to a label describing whether or not the person is a customer that is desirable in some

way. (We are not using the published weightings, so the fnlwgt feature has been

omitted from our datasets.)

First, do cross-validation on the test set to establish the accuracy that is achievable

when all training set labels are known. In your report, show graphically a learning

curve, that is accuracy as a function of the number of training examples, for 1000,

2000, etc. training examples.

Training set (i) requires you to overcome covariate shift, since it does not follow

the same distribution as the population of test examples. Evaluate experimentally the

effectiveness of learning p(y[x) from biased training sets of size 1000, 2000, etc.

Training set (ii) requires reject inference, because the training examples are a

randomsample fromthe test population, but the training label is known only for some

training examples. The persons with known labels, on average, are better prospects

than the ones with unknown labels. Compare learning p(y[x) directly with learning

p(y[x) after reweighting.

In training set (iii), a random subset of positive training examples have known

labels. Other training examples may be negative or positive. Use an appropriate

weighting method. Explain how you estimate the constant c = p(s = 1[y = 1) and

discuss the accuracy of this estimate. The true value is c = 0.4; in a real application

we would not know this, of course.

82 CHAPTER 8. LEARNING CLASSIFIERS DESPITE MISSING LABELS

For each scenario and each training method, include in your report a learning

curve ﬁgure ’ that shows accuracy as a function of the number (1000, 2000, etc.)

of labeled training examples used. Discuss the extent to which each of the three

missing-label scenarios reduces achievable accuracy.

Chapter 9

Recommender systems

The collaborative ﬁltering (CF) task is to recommend items to a user that he or she

is likely to like, based on ratings for different items provided by the same user and

on ratings provided by other users. The general assumption is that users who give

similar ratings to some items are likely also to give similar ratings to other items.

From a formal perspective, the input to a collaborative ﬁltering algorithm is a

matrix of incomplete ratings. Each row corresponds to a user, while each column

corresponds to an item. If user 1 ≤ i ≤ m has rated item 1 ≤ j ≤ n, the matrix

entry x

ij

is the value of this rating. Often rating values are integers between 1 and 5.

If a user has not rated an item, the corresponding matrix entry is missing. Missing

ratings are often represented as 0, but this should not be viewed as an actual rating

value.

The output of a collaborative ﬁltering algorithm is a prediction of the value of

each missing matrix entry. Typically the predictions are not required to be integers.

Given these predictions, many different real-world tasks can be performed. For ex-

ample, a recommender system might suggest to a user those items for which the

predicted ratings by this user are highest.

There are two main general approaches to the formal collaborative ﬁltering task:

nearest-neighbor-based and model-based. Given a user and item for which a predic-

tion is wanted, nearest neighbor (NN) approaches use a similarity function between

rows, and/or a similarity function between columns, to pick a subset of relevant other

users or items. The prediction is then some sort of average of the known ratings in

this subset.

Model-based approaches to collaborative ﬁltering construct a low-complexity

representation of the complete x

ij

matrix. This representation is then used instead of

the original matrix to make predictions. Typically each prediction can be computed

in O(1) time using only a ﬁxed number of coefﬁcients from the representation.

83

84 CHAPTER 9. RECOMMENDER SYSTEMS

The most popular model-based collaborative ﬁltering algorithms are based on

standard matrix approximation methods such as the singular value decomposition

(SVD), principal component analysis (PCA), or nonnegative matrix factorization

(NNMF). Of course these methods are themselves related to each other.

From the matrix decomposition perspective, the fundamental issue in collabo-

rative ﬁltering is that the matrix given for training is incomplete. Many algorithms

have been proposed to extend SVD, PCA, NNMF, and related methods to apply to

incomplete matrices, often based on expectation-maximization (EM). Unfortunately,

methods based on EM are intractably slow for large collaborative ﬁltering tasks, be-

cause they require many iterations, each of which computes a matrix decomposition

on a full mby n matrix. Here, we discuss an efﬁcient approach to decomposing large

incomplete matrices.

9.1 Applications of matrix approximation

In addition to collaborative ﬁltering, many other applications of matrix approxima-

tion are important also. Examples include the voting records of politicians, the re-

sults of sports matches, and distances in computer networks. The largest research

area with a goal similar to collaborative ﬁltering is so-called “item response theory”

(IRT), a subﬁeld of psychometrics. The aim of IRT is to develop models of how a

person answers a multiple-choice question on a test. Users and items in collaborative

ﬁltering correspond to test-takers and test questions in IRT. Missing ratings are called

omitted responses in IRT.

One conceptual difference between collaborative ﬁltering research and IRT re-

search is that the aim in IRT is often to ﬁnd a single parameter for each individual

and a single parameter for each question that together predict answers as well as pos-

sible. The idea is that each individual’s parameter value is then a good measure of

intrinsic ability and each question’s parameter value is a good measure of difﬁculty.

In contrast, in collaborative ﬁltering research the goal is often to ﬁnd multiple pa-

rameters describing each person and each item, because the assumption is that each

item has multiple relevant aspects and each person has separate preferences for each

of these aspects.

9.2 Measures of performance

Given a set of predicted ratings and a matching set of true ratings, the difference be-

tween the two sets can be measured in alternative ways. The two standard measures

are mean absolute error (MAE) and mean squared error (MSE). Given a probability

distribution over possible values, MAE is minimized by taking the median of the dis-

9.3. ADDITIVE MODELS 85

tribution while MSE is minimized by taking the mean of the distribution. In general

the mean and the median are different, so in general predictions that minimize MAE

are different from predictions that minimize MSE.

Most matrix approximation algorithms aim to minimize MSE between the train-

ing data (a complete or incomplete matrix) and the approximation obtained from the

learned low-complexity representation. Some methods can be used with equal ease

to minimize MSE or MAE.

9.3 Additive models

Let us consider ﬁrst models where each matrix entry is represented as the sum of two

contributions, one from its row and one from its column. Formally, a

ij

= r

i

+ c

j

where a

ij

is the approximation of x

ij

and r

i

and c

j

are scalars to be learned.

Deﬁne the training mean ¯ x

i·

of each row to be the mean of all its known values;

deﬁne the the training mean of each column ¯ x

·j

similarly. The user-mean model sets

r

i

= ¯ x

i·

and c

j

= 0 while the item-mean model sets r

i

= 0 and c

j

= ¯ x

·j

. A slightly

more sophisticated baseline is the “bimean” model: r

i

= 0.5¯ x

i·

and c

j

= 0.5¯ x

·j

.

The “bias from mean” model has r

i

= ¯ x

i·

and c

j

= (x

ij

− r

i

)

·j

. Intuitively, c

j

is the average amount by which users who provide a rating for item j like this item

more or less than the average item that they have rated.

The optimal additive model can be computed quite straightforwardly. Let I be

the set of matrix indices for which x

ij

is known. The MSE optimal additive model is

the one that minimizes

¸

i,j∈I

(r

i

+c

j

−x

ij

)

2

.

This optimization problem is a special case of a sparse least-squares linear regres-

sion problem: ﬁnd z that minimizes |Az − b|

2

where the column vector z =

'r

1

, . . . , r

m

, c

1

, . . . , c

n

`

, b is a column vector of x

ij

values and the corresponding

row of A is all zero except for ones in positions i and m + j. z can be computed by

many methods. The standard method uses the Moore-Penrose pseudoinverse of A:

z = (A

A)

−1

A

**b. However this approach requires inverting an m+n by m+n ma-
**

trix, which is computationally expensive. Section 9.4 below gives a gradient-descent

method to obtain the optimal z which only requires O([I[) time and space.

The matrix A is always rank-deﬁcient, with rank at most m+n−1, because any

constant can be added to all r

i

and subtracted from all c

j

without changing the matrix

reconstruction. If some users or items have very few ratings, the rank of A may be

less than m+n−1. However the rank deﬁciency of A does not cause computational

problems in practice.

86 CHAPTER 9. RECOMMENDER SYSTEMS

9.4 Multiplicative models

Multiplicative models are similar to additive models, but the row and column values

are multiplied instead of added. Speciﬁcally, x

ij

is approximated as a

ij

= r

i

c

j

with

r

i

and c

j

being scalars to be learned. Like additive models, multiplicative models

have one unnecessary degree of freedom. We can ﬁx any single value r

i

or c

j

to be a

constant without changing the space of achievable approximations.

A multiplicative model is a rank-one matrix decomposition, since the rows and

columns of the y matrix are linearly proportional. We now present a general algo-

rithm for learning row value/column value models; additive and multiplicative mod-

els are special cases. Let x

ij

be a matrix entry and let r

i

and c

j

be corresponding row

and column values. We approximate x

ij

by a

ij

= f(r

i

, c

j

) for some ﬁxed function

f. The approximation error is deﬁned to be e(f(r

i

, c

j

), x

ij

). The training error E

over all known matrix entries I is the pointwise sum of errors.

To learn r

i

and c

j

by minimizing E by gradient descent, we evaluate

∂

∂r

i

E =

¸

i,j∈I

∂

∂r

i

e(f(r

i

, c

j

), x

ij

)

=

¸

i,j∈I

∂

∂f(r

i

, c

j

)

e(f(r

i

, c

j

), x

ij

)

∂

∂r

i

f(r

i

, c

j

).

As before, I is the set of matrix indices for which x

ij

is known. Consider the special

case where e(u, v) = [u − v[

p

for p > 0. This case is a generalization of MAE and

MSE: p = 2 corresponds to MSE and p = 1 corresponds to MAE. We have

∂

∂u

e(u, v) = p[u −v[

p−1

∂

∂u

[u −v[

= p[u −v[

p−1

sgn(u −v).

Here sgn(a) is the sign of a, that is −1, 0, or 1 if a is negative, zero, or positive. For

computational purposes we can ignore the non-differentiable case u = v. Therefore

∂

∂r

i

E =

¸

i,j∈I

p[f(r

i

, c

j

) −x

ij

[

p−1

sgn(f(r

i

, c

j

) −x

ij

)

∂

∂r

i

f(r

i

, c

j

).

Now suppose f(r

i

, c

j

) = r

i

+c

j

so

∂

∂r

i

f(r

i

, c

j

) = 1. We obtain

∂

∂r

i

E = p

¸

i,j∈I

[r

i

+c

j

−x

ij

[

p−1

sgn(r

i

+c

j

−x

ij

).

9.5. COMBINING MODELS BY FITTING RESIDUALS 87

Alternatively, suppose f(r

i

, c

j

) = r

i

c

j

so

∂

∂r

i

f(r

i

, c

j

) = c

j

. We get

∂

∂r

i

E = p

¸

i,j∈I

[r

i

c

j

−x

ij

[

p−1

sgn(r

i

c

j

−x

ij

)c

j

.

Given the gradients above, we apply online (stochastic) gradient descent. This means

that we iterate over each triple 'r

i

, c

j

, x

ij

` in the training set, compute the gradient

with respect to r

i

and c

j

based just on this one example, and perform the updates

r

i

:= r

i

−λ

∂

∂r

i

e(f(r

i

, c

j

), x

ij

)

and

c

j

:= c

j

−λ

∂

∂c

j

e(f(r

i

, c

j

), x

ij

)

where λ is a learning rate. Note that λ determines the step sizes for stochastic gradient

descent; no separate algorithm parameter is needed for this.

After any ﬁnite number of epochs, stochastic gradient descent does not converge

fully. The choice of 30 epochs is a type of early stopping that leads to good results by

not overﬁtting the training data. For the learning rate we use a decreasing schedule:

λ = 0.2/e for additive models and λ = 0.4/e for multiplicative models, where

1 ≤ e ≤ 30 is the number of the current epoch.

Gradient descent as described above directly optimizes precisely the same ob-

jective function (given the MSE error function) that is called “incomplete data likeli-

hood” in EM approaches to factorizing matrices with missing entries. It is sometimes

forgotten that EM is just one approach to solving maximum-likelihood problems; in-

complete matrix factorization is an example of a maximum-likelihood problemwhere

an alternative solution method is superior.

9.5 Combining models by ﬁtting residuals

Simple models can be combined into more complex models. In general, a second

model is trained to ﬁt the difference between the predictions made by the ﬁrst model

and the truth as speciﬁed by the training data; this difference is called the residual.

The most straightforward way to combine models is additive. Let a

ij

be the

prediction from the ﬁrst model and let x

ij

be the corresponding training value. The

second model is trained to minimize error relative to the residual x

ij

− a

ij

. Let

b

ij

be the prediction made by the second model for matrix entry ij. The combined

prediction is then a

ij

+ b

ij

. If the ﬁrst and second models are both multiplicative,

then the combined model is a rank-two approximation. The extension to higher-rank

approximations is obvious.

88 CHAPTER 9. RECOMMENDER SYSTEMS

Standard principal component analysis (PCA) is a variety of mixed model. Be-

fore the ﬁrst rank-one approximation is computed, the mean of each column is sub-

tracted from that column of the original matrix. Thus the actual ﬁrst approximation

of the original matrix is a

ij

= (0 + m

j

) + r

i

c

j

= (1 m

j

) + r

i

c

j

where m

j

is an

initial column value and r

i

c

j

is a multiplicative model.

The common interpretation of a combined model is that each component model

represents an aspect or property of items and users. The idea is that the column

value for each item is the intensity with which it possesses this property, and the

row value for each user is the weighting the user places on this property. However

this point of view implicitly treats each component model as equally important. It is

more accurate to view each component model as an adjustment to previous models,

where each model is empirically less important than each previous one, because the

numerical magnitude of its contribution is smaller. The aspect of items represented

by each model cannot be understood in isolation, but only in the context of previous

aspects.

9.6 Further issues

One issues not discussed above is regularization: attempting to improve generaliza-

tion by reducing the effective number of parameters in the trained low-complexity

representation.

Another issue not discussed is any potential probabilistic interpretation of a ma-

trix decomposition. One reason for not considering models that are explicitly proba-

bilistic is that these are usually based on Gaussian assumptions that are incorrect by

deﬁnition when the observed data are integers and/or in a ﬁxed interval.

A third issue not discussed is any explicit model for how matrix entries come

to be missing. There is no assumption or analysis concerning whether entries are

missing “completely at random” (MCAR), “at random” (MAR), or “not at random”

(MNAR). Intuitively, in the collaborative ﬁltering context, missing ratings are likely

to be MNAR. This means that even after accounting for other inﬂuences, whether or

not a rating is missing still depends on its value, because people select which items

to rate based on how much they anticipate liking them. Concretely, everything else

being equal, low ratings are still more likely to be missing than high ratings. This

intuition can be conﬁrmed empirically. On the Movielens dataset, the average rating

in the training set is 3.58. The average prediction from one reasonable method for

ratings in the test set, i.e. ratings that are unknown but known to exist, is 3.79. The

average prediction for ratings that in fact do not exist is 3.34.

Amazon: people who looked at x bought y eventually (within 24hrs usually)

9.6. FURTHER ISSUES 89

Quiz

For each part below, say whether the statement in italics is true or false, and then

explain your answer brieﬂy.

(a) For any model, the average predicted rating for unrated movies is expected to

be less than the average actual rating for rated movies.

True. People are more likely to like movies that they have actually watched, than

random movies that they have not watched. Any good model should capture this fact.

In more technical language, the value of a rating is correlated with whether or

not it is missing.

(b) Let the predicted value of rating x

ij

be r

i

c

j

+ s

i

d

j

and suppose r

i

and c

j

are trained ﬁrst. For all viewers i, r

i

should be positive, but for some i, s

i

should be

negative.

True. Ratings x

ij

are positive, and x

ij

= r

i

c

j

on average, so r

i

and c

j

should

always be positive. (They could always both be negative, but that would be unintuitive

without being more expressive.)

The term s

i

d

j

models the difference x

ij

− r

i

c

j

. This difference is on average

zero, so it is sometimes positive and sometimes negative. In order to allow s

i

d

j

to be negative, s

i

must be negative sometimes. (Making s

i

be always positive, while

allowing d

j

to be negative, might be possible. However in this case the expressiveness

of the model s

i

d

j

would be reduced.)

(c) We have a training set of 500,000 ratings for 10,000 viewers and 1000 movies,

and we train a rank-50 unregularized factor model. This model is likely to overﬁt the

training data.

True. The unregularized model has 50 (10, 000+1000) = 550, 000 parameters,

which is more than the number of data points for training. Hence, overﬁtting is

practically certain.

Quiz for May 25, 2010

Page 87 of the lecture notes says

Given the gradients above, we apply online (stochastic) gradient descent.

This means that we iterate over each triple 'r

i

, c

j

, x

ij

` in the training

set, compute the gradient with respect to r

i

and c

j

based just on this one

example, and perform the updates

r

i

:= r

i

−λ

∂

∂r

i

e(f(r

i

, c

j

), x

ij

)

and

c

j

:= c

j

−λ

∂

∂c

j

e(f(r

i

, c

j

), x

ij

)

where λ is a learning rate.

State and explain what the ﬁrst update rule is for the special case e(u, v) = (u −v)

2

and f(r

i

, c

j

) = r

i

c

j

.

Assignment

The goal of this assignment is to apply a collaborative ﬁltering method to the task of

predicting movie ratings. You should use the small MovieLens dataset available at

http://www.grouplens.org/node/73. This has 100,000 ratings given by

943 users to 1682 movies.

You can select any collaborative ﬁltering method that you like. You may reuse

existing software, or you may write your own, in Matlab or in another programming

language. Whatever your choice, you must understand fully the algorithm that you

apply, and you should explain it with mathematical clarity in your report. The method

that you choose should handle missing values in a sensible and efﬁcient way.

When reporting ﬁnal results, do ﬁve-fold cross-validation, where each rating is

assigned randomly to one fold. Note that this experimental procedure makes the task

easier, because it is likely that every user and every movie is represented in each

training set. Moreover, evaluation is biased towards users who have provided more

ratings; it is easier to make accurate predictions for these users.

In your report, showmean absolute error graphically as a function of a measure of

complexity of your chosen method. If you select a matrix factorization method, this

measure of complexity will likely be rank. Also showtiming information graphically.

Discuss whether you could run your chosen method on the full Netﬂix dataset of

about 10

8

ratings. Also discuss whether your chosen method needs a regularization

technique to reduce overﬁtting.

Good existing software includes the following:

• Jason Rennie’s fast maximum margin matrix factorization for collaborative ﬁl-

tering (MMMF) at http://people.csail.mit.edu/jrennie/matlab/.

• Guy Lebanon’s toolkit at http://www-2.cs.cmu.edu/˜lebanon/IR-lab.

htm.

You are also welcome to write your own code, or to choose other software. If you

choose other existing software, you may want to ask the instructor for comments

ﬁrst.

Chapter 10

Text mining

This chapter explains how to do data mining on datasets that are collections of doc-

uments. Text mining tasks include

• classiﬁer learning

• clustering,

• topic modeling, and

• latent semantic analysis.

Classiﬁers for documents are useful for many applications. Major uses for binary

classiﬁers include spam detection and personalization of streams of news articles.

Multiclass classiﬁers are useful for routing messages to recipients.

Most classiﬁers for documents are designed to categorize according to subject

matter. However, it is also possible to learn to categorize according to qualitative

criteria such as helpfulness for product reviews submitted by consumers.

Classiﬁers are useful for ranking documents as well as for dividing them into

categories. With a training set of very helpful product reviews, and another training

set of very unhelpful reviews, we can learn a scoring function that sorts other reviews

accirding to their degree of helpfulness. There is often no need to pick a threshold,

which would be arbitrary, to separate marginally helpful from marginally unhelpful

reviews.

In many applications of multiclass classiﬁcation, a single document can belong

to more than one category, so it is correct to predict more than one label. This task is

speciﬁcally called multilabel classiﬁcation. In standard multiclass classiﬁcation, the

classes are mutually exclusive, i.e. a special type of negative correlation is ﬁxed in

advance. In multilabel classiﬁcation, it is important to learn the positive and negative

correlations between classes.

93

94 CHAPTER 10. TEXT MINING

10.1 The bag-of-words representation

The ﬁrst question we must answer is how to represent documents. For genuine un-

derstanding of natural language one must obviously preserve the order of the words

in documents. However, for many large-scale data mining tasks, including classify-

ing and clustering documents, it is sufﬁcient to use a simple representation that loses

all information about word order.

Given a collection of documents, the ﬁrst task to perform is to identify the set of

all words used a least once in at least one document. This set is called the vocabulary

V . Often, it is reduced in size by keeping only words that are used in at least two

documents. (Words that are found only once are often mis-spellings or other mis-

takes.) Although the vocabulary is a set, we ﬁx an arbitrary ordering for it so we can

refer to word 1 through word m where m = [V [ is the size of the vocabulary.

Once V has been ﬁxed, each document is represented as a vector with integer

entries of length m. If this vector is x then its jth component x

j

is the number of

appearances of word j in the document. The length of the document is n =

¸

m

j=1

x

j

.

For typical documents, n is much smaller than m and x

j

= 0 for most words j.

Many applications of text mining also eliminate from the vocabulary so-called

“stop” words. These are words that are common in most documents and do not

correspond to any particular subject matter. In linguistics these words are some-

times called “function” words. They include pronouns (you, he, it) connectives (and,

because, however), prepositions (to, of, before), auxiliaries (have, been, can), and

generic nouns (amount, part, nothing). It is important to appreciate, however, that

generic words carry a lot of information for many tasks, including identifying the

author of a document or detecting its genre.

A collection of documents is represented as a two-dimensional matrix where

each row describes a document and each column corresponds to a word. Each entry

in this matrix is an integer count; most entries are zero. It makes sense to view each

column as a feature. It also makes sense to learn a low-rank approximation of the

whole matrix; doing this is called latent semantic analysis (LSA) and is discussed in

Section 10.8 below.

10.2 The multinomial distribution

Once we have a representation for individual documents, the natural next step is

to select a model for a set of documents. This model is a probability distribution.

Given a training set of documents, we will choose values for the parameters of the

distribution that make the training documents have high probability.

10.2. THE MULTINOMIAL DISTRIBUTION 95

Then, given a test document, we can evaluate its probability according to the

model. The higher this probability is, the more similar the test document is to the

training set.

The probability distribution that we use is the multinomial. Mathematically, this

distribution is

p(x; θ) =

n!

¸

m

j=1

x

j

!

m

¸

j=1

θ

x

j

j

.

where the data x are a vector of non-negative integers and the parameters θ are a

real-valued vector. Both vectors have the same length m. The components of θ are

non-negative and have unit sum:

¸

m

j=1

θ

j

= 1.

Intuitively, θ

j

is the probability of word j while x

j

is the count of word j. Each

time word j appears in the document it contributes an amount θ

j

to the total proba-

bility, hence the term θ

j

to the power x

j

.

Like any discrete distribution, a multinomial has to sum to one, where the sum is

over all possible data points. Here, a data point is a document containing n words.

The number of such documents is exponential in their length n: it is m

n

. The proba-

bility of any individual document will therefore be very small. What is important is

the relative probability of different documents. A document that mostly uses words

with high probability will have higher relative probability.

At ﬁrst sight, computing the probability of a document requires O(m) time be-

cause of the product over j. However, if x

j

= 0 then θ

x

j

j

= 1 so the jth factor can

be omitted from the product. Similarly, 0! = 1 so the jth factor can be omitted from

¸

m

j=1

x

j

!. Hence, computing the probability of a document needs only O(n) time.

Because the probabilities of individual documents decline exponentially with

length n, it is necessary to do numerical computations with log probabilities:

log p(x; θ) = log n! −[

m

¸

j=1

log x

j

!] + [

m

¸

j=1

x

j

log θ

j

]

Given a set of training documents, the maximum-likelihood estimate of the jth pa-

rameter is

θ

j

=

1

T

¸

x

x

j

where the sum is over all documents x belonging to the training set. The normalizing

constant is T =

¸

x

¸

j

x

j

which is the sum of the sizes of all training documents.

If a multinomial has θ

j

= 0 for some j, then every document with x

j

> 0 for this

j has zero probability, regardless of any other words in the document. Probabilities

that are perfectly zero are undesirable, so we want θ

j

> 0 for all j. Smoothing with

96 CHAPTER 10. TEXT MINING

a constant c is the way to achieve this. We set

θ

j

∝ c +

¸

x

x

j

where the symbol ∝means “proportional to.” The constant c is called a pseudocount.

Intuitively, it is a notional number of appearances of word j that are assumed to exist,

regardless of the true number of appearances. Typically c is chosen in the range

0 < c ≤ 1. Because the equality

¸

j

θ

j

= 1 must be preserved, the normalizing

constant must be T

= mc +T in

θ

j

=

1

T

(c +

¸

x

x

j

).

In order to avoid big changes in the estimated probabilities θ

j

, one should have c <

T/m.

Technically, one multinomial is a distribution over all documents of a ﬁxed size

n. Therefore, what is learned by the maximum-likelihood process just described is

in fact a different distribution for each size n. These distributions, although separate,

have the same parameter values.

Generative process. Sampling with replacement.

10.3 Training Bayesian classiﬁers

Bayesian learning is an approach to learning classiﬁers based on Bayes’ rule. Let x

be an example and let y be its class. Suppose the alternative class values are 1 to K.

We can write

p(y = k[x) =

p(x[y = k)p(y = k)

p(x)

.

In order to use the expression on the right for classiﬁcation, we need to learn three

items from training data: p(x[y = k), p(y = k), and p(x). We can estimate p(y = k)

easily as n

k

/

¸

K

k=1

n

k

where n

k

is the number of training examples with class label

k. The denominator p(x) can be computed as the sum of numerators

p(x) =

K

¸

k=1

p(x[y = k)p(y = k).

In general, the class-conditional distribution p(x[y = k) can be any distribution

whose parameters are estimated using the training examples that have class label k.

Note that the training process for each class uses only the examples from that

class, and is separate from the training process for all other classes. Usually, each

10.4. BURSTINESS 97

class is represented by one multinomial ﬁtted by maximum-likelihood as described

in Section 10.2. In the formula

θ

j

=

1

T

(c +

¸

x

x

j

)

the sum is taken over the documents in one class. Unfortunately, the exact value of c

can strongly inﬂuence classiﬁcation accuracy.

As discussed in Chapter ?? above, when one class of documents is rare, it is not

reasonable to use accuracy to measure the success of a classiﬁer for documents. In-

stead, it is common to use the so-called F-measure instead of accuracy. This measure

is the harmonic mean of precision and recall:

f =

1

1/p + 1/r

where p and r are precision and recall for the rare class.

10.4 Burstiness

The multinomial model says that each appearance of the same word j always has

the same probability θ

j

. In reality, additional appearances of the same word are less

surprising, i.e. they have higher probability. Consider the following excerpt from a

newspaper article, for example.

Toyota Motor Corp. is expected to announce a major overhaul. Yoshi

Inaba, a former senior Toyota executive, was formally asked by Toyota

this week to oversee the U.S. business. Mr. Inaba is currently head of an

international airport close to Toyota’s headquarters in Japan.

Toyota’s U.S. operations noware suffering fromplunging sales. Mr. In-

aba was credited with laying the groundwork for Toyota’s fast growth in

the U.S. before he left the company.

Recently, Toyota has had to idle U.S. assembly lines, pay workers

who aren’t producing vehicles and offer a limited number of voluntary

buyouts. Toyota now employs 36,000 in the U.S.

The multinomial distribution arises from a process of sampling words with re-

placement. An alternative distribution named the the Dirichlet compound multino-

mial (DCM) arises from an urn process that captures the authorship process better.

Consider a bucket with balls of [V [ different colors. After a ball is selected randomly,

it is not just replaced, but also one more ball of the same color is added. Each time a

98 CHAPTER 10. TEXT MINING

ball is drawn, the chance of drawing the same color again is increased. This increased

probability models the phenomenon of burstiness.

Let the initial number of balls with color j be β

j

. These initial values are the

parameters of the DCM distribution. The DCM parameter vector β has length [V [,

like the multinomial parameter vector, but the sum of the components of β is un-

constrained. This one extra degree of freedom allows the DCM to discount multiple

observations of the same word, in an adjustable way. The smaller the parameter vales

β

j

are, the more words are bursty.

10.5 Discriminative classiﬁcation

There is a consensus that linear support vector machines (SVMs) are the best known

method for learning classiﬁers for documents. Nonlinear SVMs are not beneﬁcial,

because the number of features is typically much larger than the number of training

examples. For the same reason, choosing the strength of regularization appropriately,

typically by cross-validation, is crucial.

When using a linear SVM for text classiﬁcation, accuracy can be improved con-

siderably by transforming the raw counts. Since counts do not follow Gaussian distri-

butions, it is not sensible to make them have mean zero and variance one. Inspired by

the discussion above of burstiness, it is sensible to replace each count x by log(1+x).

This transformation maps 0 to 0, so it preserves sparsity. A more extreme transforma-

tion that loses information is to make each count binary, that is to replace all non-zero

values by one.

One can also transform counts in a supervised way, that is in a way that uses label

information. This is most straightforward when there are just two classes. Experi-

mentally, the following transformation gives the highest accuracy:

x → sgn(x) [ log tp/fn −log fp/tn[.

Above, tp is the number of positive training examples containing the word, and fn

is the number of these examples not containing the word, while fp is the number

of negative training examples containing the word, and tn is the number of these

examples not containing the word. If any of these numbers is zero, we replace it by

0.5, which of course is less than any of these numbers that is genuinely non-zero.

Notice that in the formula above the positive and negative classes are treated in a

perfectly symmetric way. The value [ log tp/fn − log fp/tn[ is large if tp/fn and

fp/tn have very different values. In this case the word is highly diagnostic for at

least one of the two classes.

10.6. CLUSTERING DOCUMENTS 99

The transformation above is sometimes called logodds weighting. Intuitively,

ﬁrst each count x is transformed into a binary feature sgn(x), and then these features

are weighted according to their predictiveness.

10.6 Clustering documents

Suppose that we have a collection of documents, and we want to ﬁnd an organization

for these, i.e. we want to do unsupervised learning. The simplest variety of unsuper-

vised learning is clustering.

Clustering can be done probabilistically by ﬁtting a mixture distribution to the

given collection of documents. Formally, a mixture distribution is a probability den-

sity function of the form

p(x) =

K

¸

k=1

α

k

p(x; θ

k

).

Here, K is the number of components in the mixture model. For each k, p(x; θ

k

) is

the distribution of component number k. The scalar α

k

is the proportion of compo-

nent number k.

Each component is a cluster.

10.7 Topic models

Mixture models, and clusterings in general, are based on the assumption that each

data point is generated by a single component model. For documents, if compo-

nents are topics, it is often more plausible to assume that each document can contain

words from multiple topics. Topic models are probabilistic models that make this

assumption.

Latent Dirichlet allocation (LDA) is the most widely used topic model. It is

based on the intuition that each document contains words from multiple topics; the

proportion of each topic in each document is different, but the topics themselves are

the same for all documents.

The generative process assumed by the LDA model is as follows:

Given: Dirichlet distribution with parameter vector α of length K

Given: Dirichlet distribution with parameter vector β of length V

for topic number 1 to topic number K

draw a multinomial with parameter vector φ

k

according to β

for document number 1 to document number M

draw a topic distribution, i.e. a multinomial θ according to α

100 CHAPTER 10. TEXT MINING

for each word in the document

draw a topic z according to θ

draw a word w according to φ

z

Note that z is an integer between 1 and K for each word.

The prior distributions α and β are assumed to be ﬁxed and known, as are the

number K of topics, the number M of documents, the length N

m

of each document,

and the cardinality V of the vocabulary (i.e. the dictionary of all words).

For learning, the training data are the words in all documents. Learning has

two goals: (i) to infer the document-speciﬁc multinomial θ for each document, and

(ii) to infer the topic distribution φ

k

for each topic. After training, ϕ is a vector of

word probabilities for each topic indicating the content of that topic. The distribution

θ of each document is useful for classifying new documents, measuring similarity

between documents, and more.

When applying LDA, it is not necessary to learn α and β. Steyvers and Grifﬁths

recommend ﬁxed uniform values α = 50/K and β = .01, where K is the number of

topics.

10.8 Latent semantic analysis

10.9 Open questions

It would be interesting to work out an extension of the logodds mapping for the

multiclass case. One suggestion is

x → sgn(x) max

i

[ log tp

i

/fn

i

[

where i ranges over the classes, tp

i

is the number of training examples in class i

containing the word, and fn

i

is the number of these examples not containing the

word.

It is also not known if combining the log transformation and logodds weighting

is beneﬁcial, as in

x → log(x + 1) [ log tp/fn −log fp/tn[.

10.9. OPEN QUESTIONS 101

Quiz

(a) Explain why, with a multinomial distribution, “the probabilities of individual doc-

uments decline exponentially with length n.”

The probability of document x of length n according to a multinomial distribution

is

p(x; θ) =

n!

¸

m

j=1

x

j

!

m

¸

j=1

θ

x

j

j

.

[Rough argument.] Each θ

j

value is less than 1. In total n =

¸

j

x

j

of these

values are multiplied together. Hence as n increases the product

¸

m

j=1

θ

x

j

j

decreases

exponentially. Note the multinomial coefﬁcient n!/

¸

m

j=1

x

j

! does increase with n,

but more slowly.

(b) Consider the multiclass Bayesian classiﬁer

ˆ y = argmax

k

p(x[y = k)p(y = k)

p(x)

.

Simplify the expression inside the argmax operator as much as possible, given that

the model p(x[y = k) for each class is a multinomial distribution.

The denominator p(x) is the same for all k, so it does not inﬂuence which k is the

argmax. Within the multinomial distributions p(x[y = k), the multinomial coefﬁcient

does not depend on k, so it is constant for a single x and it can be eliminated also,

giving

ˆ y = argmax

k

p(y = k)

m

¸

j=1

θ

x

j

kj

.

where θ

kj

is the jth parameter of the kth multinomial.

(c) Consider the classiﬁer from part (b) and suppose that there are just two classes.

Simplify the classiﬁer further into a linear classiﬁer.

Let the two classes be k = 0 and k = 1 so we can write

ˆ y = 1 if and only if p(y = 1)

m

¸

j=1

θ

x

j

1j

> p(y = 0)

m

¸

j=1

θ

x

j

0j

.

Taking logarithms and using indicator function notation gives

ˆ y = I

log p(y = 1) +

m

¸

j=1

x

j

log θ

1j

−log p(y = 0) −

m

¸

j=1

x

j

log θ

0j

.

102 CHAPTER 10. TEXT MINING

The classiﬁer simpliﬁes to

ˆ y = I(c

0

+

m

¸

j=1

x

j

c

j

)

where the coefﬁcients are c

0

= log p(y = 1) − log p(y = 0) and c

j

= log θ

1j

−

log θ

0j

. The expression inside the indicator function is a linear function of x.

Quiz for May 18, 2010

Your name:

Consider the task of learning to classify documents into one of two classes, using the

bag-of-words representation. Explain why a regularized linear SVM is expected to

be more accurate than a Bayesian classiﬁer using one maximum-likelihood multino-

mial for each class.

Write one or two sentences for each of the following reasons.

(a) Handling words that are absent in one class in training.

(b) Learning the relative importance of different words.

104 CHAPTER 10. TEXT MINING

Assignment (2009)

The purpose of this assignment is to compare two different approaches to learning

a classiﬁer for text documents. The dataset to use is called Classic400. It consists

of 400 documents from three categories over a vocabulary of 6205 words. The cate-

gories are quite distinct from a human point of view, and high accuracy is achievable.

First, you should try Bayesian classiﬁcation using a multinomial model for each

of the three classes. Note that you may need to smooth the multinomials with pseu-

docounts. Second, you should train a support vector machine (SVM) classiﬁer. You

will need to select a method for adapting SVMs for multiclass classiﬁcation. Since

there are more features than training examples, regularization is vital.

For each of the two classiﬁer learning methods, investigate whether you can

achieve better accuracy via feature selection, feature transformation, and/or feature

weighting. Because there are relatively few examples, using ten-fold cross-validation

to measure accuracy is suggested. Try to evaluate the statistical signiﬁcance of dif-

ferences in accuracy that you ﬁnd. If you allow any leakage of information from the

test fold to training folds, be sure to explain this in your report.

The dataset is available at http://www.cs.ucsd.edu/users/elkan/

291/classic400.zip, which contains three ﬁles. The main ﬁle cl400.csv is

a comma-separated 400 by 6205 array of word counts. The ﬁle truelabels.csv

gives the actual class of each document, while wordlist gives the string corre-

sponding to the word indices 1 to 6205. These strings are not needed for train-

ing or applying classiﬁers, but they are useful for interpreting classiﬁers. The ﬁle

classic400.mat contains the same three ﬁles in the form of Matlab matrices.

Assignment due on May 18, 2010

The purpose of this assignment is to compare two different approaches to learning a

binary classiﬁer for text documents. The dataset to use is the movie review polarity

dataset, version 2.0, published by Lillian Lee at http://www.cs.cornell.

edu/People/pabo/movie-review-data/. Be sure to read the README

carefully, and to understand the data and task fully.

First, you should try Bayesian classiﬁcation using a multinomial model for each

of the two classes. You should smooth the multinomials with pseudocounts. Sec-

ond, you should train a linear discriminative classiﬁer, either logistic regression or a

support vector machine. Since there are more features than training examples, regu-

larization is vital.

For both classiﬁer learning methods, investigate whether you can achieve better

accuracy via feature selection, feature transformation, and/or feature weighting. Try

to evaluate the statistical signiﬁcance of differences in accuracy that you ﬁnd. Think

carefully about any ways in which you may be allowing leakage of information from

test subsets that makes estimates of accuracy be biased. Discuss this issue in your

report.

Compare the accuracy that you can achieve with accuracies reported in some of

the many published papers that use this dataset. You can ﬁnd links to these papers at

http://www.cs.cornell.edu/People/pabo/movie-review-data/

otherexperiments.html. Also, analyze your trained classiﬁer to identify what

features of movie reviews are most indicative of the review being favorable or unfa-

vorable.

Feedback on this text mining assignment: Leakage, overﬁtting, over 90% accu-

racy. For SVMs the C parameter is not the strength of regularization but rather the

reverse. Those who did not try strong regularization found worse performance with

an SVM than with the Bayesian classiﬁer.

Chapter 11

Social network analytics

A social network is a graph where nodes represent individual people and edges rep-

resent relationships.

Example: Telephone network. Nodes are subscribers, and edges are phone calls.

Call detail records (CDRs).

There are two types of data mining one can do with a network: supervised and

unsupervised. The aim of supervised learning is to obtain a model that can predict

labels for nodes, or labels for edges. For nodes, this is sometimes called collective

classiﬁcation. For edges, the most basic label is existence. Predicting whether or not

edges exist is called link prediction.

Examples of tasks that involve collective classiﬁcation: predicting churn of sub-

scribers; recognizing fraudulent applicants.

Examples of tasks that involve link prediction: suggesting new friends on Face-

book, identifying appropriate citations between scientiﬁc papers, predicting which

pairs of terrorists know each other.

11.1 Difﬁculties in network mining

Many networks are not strictly social networks, but have similar properties. Exam-

ples: citation networks, protein interaction networks.

Both collective classiﬁcation and link prediction are transductive problems. Trans-

duction is the situation where the set of test examples is known at the time a classiﬁer

is to be trained. In this situation, the real goal is not to train a classiﬁer. Instead, it is

simply to make predictions about speciﬁc examples. In principle, some methods for

transduction might make predictions without having an explicit reusable classiﬁer.

We will look at methods that do involve explicit classiﬁers. However, these classi-

ﬁers will be usable only for nodes that are part of the part of the network that is known

107

108 CHAPTER 11. SOCIAL NETWORK ANALYTICS

at training time. We will not solve the cold-start problem of making predictions for

nodes that are not known during training.

Social networks, and other graphs in data mining, can be complex. In particular,

edges may be directed or undirected, they can be of multiple types, and/or they can

be weighted. The graph can be bipartite or not.

Given a social network, nodes have two fundamental characteristics. First, they

have identity. This means that we know which nodes are the same and which are

different in the graph, but we do not necessarily know anything else about nodes.

Second, nodes may be be associated with vectors that specify the values of features.

These vectors are sometimes called side-information.

For many data mining tasks where the examples are people, there are ﬁve general

types of feature: personal demographic, group demographic, behavioral, incentive,

and social. If group demographic features have predictive power, it is often because

they are reﬂections of social features. For example, if zipcode is predictive of the

brand of automobile a person will purchase, that is likely because of contagion effects

between people who know each other, or see each other on the streets.

11.2 Unsupervised network mining

The aim of unsupervised data mining is often to come up with a model that explains

the statistics of the connectivity of the network. Mathematical models that explain

patterns of connectivity are often time-based, e.g. the “rich get richer” idea. This

type of learning can be fascinating as sociology. For example, it has revealed that

nonsmokers tend to stop being friends with smokers, but thin people do not stop

being friends with obese people. Models of entire networks can lead to predictions

about the evolution of the network that can be useful for making decisions. For

example, one can identify the most inﬂuential nodes, and make special efforts to

reach them.

11.3 Collective classiﬁcation

General approach: Extend each node with a ﬁxed-length vector of feature values

derived from its neighbors, and/or the whole network. A node can have a varying

number of neighbors, but many learning algorithms require ﬁxed-length representa-

tions. For this reason, features based on neighbors are often aggregates.

These features may include:

1. Aggregation operators include “sum,” “mean,” “mode,” “minimum,” “maxi-

mum,” “count,” and “exists.”

11.4. LINK PREDICTION 109

2. Lowrank approximation of the adjacency matrix. Empirically, the Enron email

adjacency matrix has rank approximately 2 only

3. Low rank approximation of Laplacian or modularity matrix.

11.4 Link prediction

Link prediction is similar to collective classiﬁcation, but involves some additional

difﬁculties. First, the label to be predicted is typically highly unbalanced: between

most pairs of nodes, there is no edge.

Second, there is a major pitfall in using a linear classiﬁer.

Inferring unobserved links versus forecasting future newlinks. Similar to protein-

protein interaction

Nevertheless, the most successful general approach to link prediction is the same

as for collective classiﬁcation: Extend each node with a vector of feature values

derived from the network.

One can also extend each potential edge with a vector of feature values. For

example, for predicting future co-authorship:

1. Keyword overlap is the most informative feature for predicting co-authorship.

2. Sum of neighbors and sum of papers are next most predictive; these features

measure the propensity of an individual as opposed to any interaction between

individuals.

3. Shortest-distance is the most informative graph-based feature.

Cold-start problem for new nodes. Generalization to new networks.

11.5 Iterative collective classiﬁcation

There are many cases where we want to predict labels for nodes in a social network.

Often the training and test examples, that is the labeled and unlabeled nodes, are

members of the same network, that is, they are not completely separate datasets.

The standard approach to cross-validation is not appropriate when examples are

linked in a network.

In traditional classiﬁcation tasks, the labels of examples are assumed to be inde-

pendent. (This is not the same as assuming that the examples themselves are indepen-

dent; see the discussion of sample selection bias above.) If labels are independent,

then a classiﬁer can make predictions separately for separate examples. However,

110 CHAPTER 11. SOCIAL NETWORK ANALYTICS

if labels are not independent, then in principle a classiﬁer can achieve higher accu-

racy by predicting labels for related examples simultaneously. This situation is called

collective classiﬁcation.

Suppose that examples are nodes in a graph, and nodes are joined by edges.

Edges can have labels and/or weights, so in effect multiple graphs can be overlaid

on the same examples. For example, if nodes are persons then one set of edges may

represent the “same address” relationship while another set may represent the “made

telephone call to” relationship.

Intuitively, the labels of neighbors are often correlated. Given a node x, we would

like to use the labels of the neighbors of x as features when predicting the label of x,

and vice versa. A principled approach to collective classiﬁcation would ﬁnd mutu-

ally consistent predicted labels for x and its neighbors. However, in general there is

no guarantee that mutually consistent labels are unique, and there is no general algo-

rithm for inferring them. Experimentally, a simple iterative algorithm often performs

as well as more sophisticated methods for collective classiﬁcation.

Given a node x, let N(x) be the set of its neighbors. Let S(x) be the bag of

labels of nodes in N(x), where we allow “unknown” as a special label value. Let

g(x) be some representation of S(x), perhaps using aggregate operators. A classiﬁer

is a function f(x, g(x)). A training set may include examples with known labels and

examples with unknown labels. Let L be the training examples with known labels.

The examples that are used in training are the set

E =

¸

x∈L

N(x).

Given a trained classiﬁer f(x, g(x)), the algorithm for classifying test examples is

the following.

Initialization: for each test node x

compute N(x), S(x), and g(x)

compute prediction ˆ y = f(x, g(x))

repeat

select ordering R of nodes x

for each x according to R

let S(x) be the current predicted labels of N(x)

compute prediction ˆ y = f(x, g(x))

until no change in predicted labels

The algorithm above is purely heuristic, but it is sensible and often effective in prac-

tice.

11.6. OTHER TOPICS 111

11.6 Other topics

Enron email visualization: http://jheer.org/enron/v1/enron_nice_

1.png

Network-focused versus node-focused analysis.

Cascading models of behavior.

Record linkage, alias detection.

Betweenness Centrality: Brandes 2001.

Quiz for June 1, 2010

Your name:

Consider the task of predicting which pairs of nodes in a social network are linked.

Speciﬁcally, suppose you have a training set of pairs for which edges are known to

exist, or known not to exist. You need to make predictions for the remaining edges.

Let the adjacency matrix A have dimension n n. You have trained a matrix

factorization A = UV , where U has dimension n k for some k < n. Because A

is symmetric, V is the transpose of U.

Let u

i

be the row of U that represents node i. Let 'j, k` be a pair for which you

need to predict whether an edge exists. Consider these two possible ways to make

this prediction:

1. Predict the dot-product u

j

u

k

.

2. Predict f([u

j

, u

k

]) where [u

j

, u

k

] is the concatenation of the vectors u

j

and

u

k

, and f is a trained logistic regression model.

Explain which of these two approaches is best, and why. (Do not mention any other

approaches, which are outside the scope of this question.)

Assignment due on June 1, 2010

The purpose of this assignment is to apply and evaluate methods to predict the pres-

ence of links that are unknown in a social network. Use either the Cora dataset or

the Terrorists dataset. Both datasets are available at http://www.cs.umd.edu/

˜sen/lbc-proj/LBC.html.

In the Cora dataset, each node represents an academic paper. Each paper has a

label, where the seven label values are different subareas of machine learning. Each

paper is also represented by a bag-of-words vector of length 1433. The network

structure is that each paper is cited by, or cites, at least one other paper in the dataset.

The Terrorists dataset is more complicated. For explanations see the paper Entity

and relationship labeling in afﬁliation networks by Zhao, Sen, and Getoor from the

2006 ICML Workshop on Statistical Network Analysis, available at http://www.

mindswap.org/papers/2006/RelClzPIT.pdf. According to Table 1 and

Section 6 of this paper, there are 917 edges that connect 435 terrorists. There is

information about each terrorist, and also about each edge. The edges are labeled

with types.

For either dataset, your task is to pretend that some edges are unknown, and then

to predict these, based on the known edges and on the nodes. Suppose that you are

doing ten-fold cross-validation on the Terrorists dataset. Then you would pretend

that 92 actual edges are unknown. Based on the 927 − 92 = 825 known edges, and

on all 435 nodes, you would predict a score for each of the 435 434/2 potential

edges. The higher the score of the 92 held-out edges, the better.

You should use logistic regression to predict the score of each potential edge.

When predicting edges, you may use features obtained from (i) the network structure

of the edges known to be present and absent, (ii) properties of the nodes, and/or

(iii) properties of pairs of nodes. For (i), use a method for converting the network

structure into a vector of ﬁxed length of feature values for each node, as discussed in

class. Using one or more of the feature types (i), (ii), and (iii) yields seven alternative

sets of features. Do experiments comparing at least two of these.

Each experiment should use logistic regression and ten-fold cross-validation.

Think carefully about exactly what information is legitimately included in training

folds. Also, avoid the mistakes discussed in Section 5.7 above.

Chapter 12

Interactive experimentation

This chapter discusses data-driven optimization of websites, in particular via A/B

testing.

A/B testing can reveal “willingness to pay.” Compare proﬁt at prices x and y,

based on number sold. Choose the price that yields higher proﬁt.

115

116 CHAPTER 12. INTERACTIVE EXPERIMENTATION

Quiz

The following text is from a New York Times article dated May 30, 2009.

Mr. Herman had run 27 ads on the Web for his client Vespa, the

scooter company. Some were rectangular, some square. And the text

varied: One tagline said, “Smart looks. Smarter purchase,” and dis-

played a $0 down, 0 percent interest offer. Another read, “Pure fun. And

function,” and promoted a free T-shirt.

Vespa’s goal was to ﬁnd out whether a ﬁnancial offer would attract

customers, and Mr. Herman’s data concluded that it did. The $0 down

offer attracted 71 percent more responses from one group of Web surfers

than the average of all the Vespa ads, while the T-shirt offer drew 29

percent fewer.

(a) What basic principle of the scientiﬁc method did Mr. Herman not follow,

according to the description above?

(b) Suppose that it is true that the ﬁnancial offer does attract the highest number of

purchasers. What is an important reason why the other offer might still be preferable

for the company?

(c) Explain the writing mistake in the phrase “Mr. Herman’s data concluded that

it did.” Note that the error is relatively high-level; it is not a spelling or grammar

mistake.

Bibliography

[Huang et al., 2006] Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., and

Sch¨ olkopf, B. (2006). Correcting sample selection bias by unlabeled data. In Pro-

ceedings of the Neural Information Processing Systems Conference (NIPS 2006).

[Jonas and Harper, 2006] Jonas, J. and Harper, J. (2006). Effective counterterror-

ism and the limited role of predictive data mining. Technical report, Cato In-

stitute. Available at http://www.cato.org/pub_display.php?pub_

id=6784.

[Michie et al., 1994] Michie, D., Spiegelhalter, D. J., and Taylor, C. C. (1994). Ma-

chine Learning, Neural and Statistical Classiﬁcation. Ellis Horwood.

[Vert and Jacob, 2008] Vert, J.-P. and Jacob, L. (2008). Machine learning for in

silico virtual screening and chemical genomics: New strategies. Combinatorial

Chemistry & High Throughput Screening, 11(8):677–685(9).

117

Contents

Contents 1 Introduction 1.1 Limitations of predictive analytics . . . . . . . . . . . . . . . . . . 1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Predictive analytics in general 2.1 Supervised learning . . . . . . . . . . . . 2.2 Data cleaning and recoding . . . . . . . . 2.3 Linear regression . . . . . . . . . . . . . 2.4 Interpreting coefﬁcients of a linear model 2.5 Evaluating performance . . . . . . . . . . 2 5 6 7 9 9 10 12 14 15 21 21 22 25 27 27 28 29 31 32 37 37 39 41 42

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 Introduction to Rapidminer 3.1 Standardization of features . . . . . . . . . . . . . . . . . . . . . . 3.2 Example of a Rapidminer process . . . . . . . . . . . . . . . . . . 3.3 Other notes on Rapidminer . . . . . . . . . . . . . . . . . . . . . . 4 Support vector machines 4.1 Loss functions . . . . . . . . . . 4.2 Regularization . . . . . . . . . . 4.3 Linear soft-margin SVMs . . . . 4.4 Nonlinear kernels . . . . . . . . 4.5 Selecting the best SVM settings 5 Classiﬁcation with a rare class 5.1 Measuring performance . . 5.2 Thresholds and lift . . . . 5.3 Ranking examples . . . . . 5.4 Conditional probabilities .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . .

. . . .

. . . . 2

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

CONTENTS 5.5 5.6 5.7 6 Isotonic regression . . . . . . . . . . . . . . . . . . . . . . . . . . Univariate logistic regression . . . . . . . . . . . . . . . . . . . . . Pitfalls of link prediction . . . . . . . . . . . . . . . . . . . . . . .

3 42 43 49 53 53 54 59 59 60 61 62 64 64 66 71 71 72 73 73 75 78 83 84 84 85 86 87 88 93 94 94 96 97 98 99

Detecting overﬁtting: cross-validation 6.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Nested cross-validation . . . . . . . . . . . . . . . . . . . . . . . . Making optimal decisions 7.1 Predictions, decisions, and costs . . . . . . . . . . . 7.2 Cost matrix properties . . . . . . . . . . . . . . . . . 7.3 The logic of costs . . . . . . . . . . . . . . . . . . . 7.4 Making optimal decisions . . . . . . . . . . . . . . . 7.5 Limitations of cost-based analysis . . . . . . . . . . 7.6 Rules of thumb for evaluating data mining campaigns 7.7 Evaluating success . . . . . . . . . . . . . . . . . . Learning classiﬁers despite missing labels 8.1 The standard scenario for learning a classiﬁer 8.2 Sample selection bias in general . . . . . . . 8.3 Covariate shift . . . . . . . . . . . . . . . . . 8.4 Reject inference . . . . . . . . . . . . . . . . 8.5 Positive and unlabeled examples . . . . . . . 8.6 Further issues . . . . . . . . . . . . . . . . . Recommender systems 9.1 Applications of matrix approximation 9.2 Measures of performance . . . . . . . 9.3 Additive models . . . . . . . . . . . . 9.4 Multiplicative models . . . . . . . . . 9.5 Combining models by ﬁtting residuals 9.6 Further issues . . . . . . . . . . . . .

7

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

8

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

9

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

10 Text mining 10.1 The bag-of-words representation 10.2 The multinomial distribution . . 10.3 Training Bayesian classiﬁers . . 10.4 Burstiness . . . . . . . . . . . . 10.5 Discriminative classiﬁcation . . 10.6 Clustering documents . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. 12 Interactive experimentation Bibliography . . . . . . . . 100 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. . . . . . . . .4 CONTENTS 10. . 107 107 108 108 109 109 111 115 117 . . . . . . .5 Iterative collective classiﬁcation 11. . . . . 11. .2 Unsupervised network mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Collective classiﬁcation . . . . . . . . . . . . .1 Difﬁculties in network mining . . . . . . . . . . . . . . . . . . 99 10. . . . . . . . . . .6 Other topics .9 Open questions . . . . . . . . . 11. . . . . . . . . . . . . . . . . . .4 Link prediction . . . . . . . . . . . . . . . . . 11. . 100 11 Social network analytics 11. . . . . . .8 Latent semantic analysis . . . . . . . . . . . . . . . . . . .7 Topic models . . . . . . . . .

For example. but the methods that are most useful are mostly the same for applications in science or engineering. the available data may be a customer database. For example. The main alternative to predictive analytics can be called descriptive analytics. In general. the goal of descriptive analytics is to discover patterns in data. along with labels indicating which customers failed to pay their bills.Chapter 1 Introduction There are many deﬁnitions of data mining. business. group of people? In contrast. and elsewhere where data mining is useful. Or perhaps that demographic is saturated. but what should Whole Foods do with the ﬁnding? Often. but contradictory. There are numerous data mining domains in science. and it should aim its marketing at a currently less tapped. but in general it is harder to obtain direct beneﬁt from descriptive analytics than from predictive analytics. Predictive analytics indicates a focus on making predictions. perhaps Whole Foods at should direct its marketing towards additional wealthy and liberal people. predictions can be typically be used directly to make decisions that maximize beneﬁt to the decision-maker. For example. The focus will be on methods for making predictions. We shall focus on applications that are related to business. the ﬁnding is really not useful in the absence of additional knowledge. suppose that customers of Whole Foods tend to be liberal and wealthy. analytics is a newer name for data mining. The goal will then be to predict which other customers might fail to pay in the future. This area is often also called “knowledge discovery in data” or KDD. For example. customers who are more likely 5 . engineering. In a nutshell. This pattern may be noteworthy and interesting. In such a case. We shall take it to mean the application of learning algorithms and statistical methods to real-world datasets. the same ﬁnding suggests two courses of action that are both plausible. Finding patterns is often fascinating and sometimes highly useful. different.

Businesses have no motive to send advertising to people who will merely be annoyed and not respond. It is important to understand the difference between a prediction and a decision. Moreover. and to have historical examples of the concept.6 CHAPTER 1. which beneﬁts the people being targeted. increased efﬁciency often comes from improved accuracy in targeting. First. the deﬁnition of the target is based on a counterfactual. in general. Second. the actions to be taken based on predictions need to be deﬁned clearly and to have reliable proﬁt consequences. The target concept is “borrowers that will best respond to modiﬁcations. It is hard to see how this could be a successful application of data mining. Society can use the tax system to spread the beneﬁt of increased proﬁt. but who would pay if given a modiﬁed contract. Data mining lets us make predictions.” From a lender’s perspective (and Fico works for lenders not borrowers) such a borrower is one who would not pay under his current contract. Consider for example this extract from an article in the London Financial Times dated May 13. maximizing proﬁt for a business may be at the expense of consumers. one cannot make progress without a dataset for training of adequate size and quality. so FICO did not have relevant data. lenders had no long historical experience with offering modiﬁcations to borrowers. the company behind the credit score. Especially in 2009. The . For a successful data mining application. Data mining cannot read minds. recently launched a service that pre-qualiﬁes borrowers for modiﬁcation programmes using their inhouse scoring data. and may not beneﬁt society at large. 2009: Fico. There are several responses to this feeling. Some people may feel that the focus in this course on maximizing proﬁt is distasteful or disquieting. After all. it is crucial to have a clear deﬁnition of the concept that is to be predicted. Fico can also help lenders ﬁnd borrowers that will best respond to modiﬁcations and learn how to get in touch with them. 1. Lenders pay a small fee for Fico to refer potential candidates for modiﬁcations that have already been vetted for inclusion in the programme. but predictions are useful to an agent only if they allow the agent to make decisions that have better outcomes. maximizing proﬁt in general is maximizing efﬁciency. Second.1 Limitations of predictive analytics It is important to understand the limitations of predictive analytics. because it is hard to see how a useful labeled training set could exist. INTRODUCTION not to pay in the future can have their credit limit reduced now. First. that is on reading the minds of borrowers.

for predictive analytics to be successful. It is not clear that giving people modiﬁcations will really change their behavior dramatically. 2006]. There are so few positive training examples that statistically reliable patterns cannot be learned. and bagging. Here. We will not examine alternative classiﬁer learning methods such as decision trees.1.2 Overview In this course we shall only look at methods that have state-of-the-art accuracy. are all likely to change the behavior of borrowers in the future. OVERVIEW 7 difference between a regular payment and a modiﬁed payment is often small. An even more clear example of an application of predictive analytics that is unlikely to succeed is learning a model to predict which persons will commit a major terrorist act. in systematic but unknown ways. So each modiﬁcation generates a cost that is not restricted to the loss incurred with respect to the individual getting the modiﬁcation. Those who get modiﬁcations may be motivated to return and request further concessions. All these methods are excellent. while the test data arise in the future. We will tread a middle ground between focusing on theory at the expense of applications. that are sufﬁciently simple and fast to be easy to use. and in social attitudes towards foreclosures. If the phenomenon to be predicted is not stable over time. for a successful application it helps if the consequences of actions are essentially independent for different examples. actions must not have major unintended consequences. we shall not look at multiple methods for the same task. Additionally. and that have well-documented successful applications. we will look at support vector machines (SVMs) in detail. A person requesting a modiﬁcation is already thinking of not paying. changes in the general economy. modiﬁcations may change behavior in undesired ways. 1. Moreover. the training data come from the past. and understanding methods only at a cookbook level. for classiﬁer learning. for example $200 in the case described in the newspaper article. in the price level of houses. intelligent terrorists will take steps not to ﬁt in with patterns exhibited by previous terrorists [Jonas and Harper. Often. the training data must be representative of the test data. Rational borrowers who hear about others getting modiﬁcations will try to make themselves appear to be candidates for a modiﬁcation. Here. then predictions are likely not to be useful. but it is hard to identify clearly important scenarios in which they are deﬁnitely superior to SVMs. neural networks. For a successful data mining application also. In particular. . Typically. boosting.2. This may be not the case here. Last but not least. when there is one method that is at least as good as all the others from most points of view.

and which is widely used in commercial applications nowadays.8 CHAPTER 1. a nonlinear method that is often superior to linear SVMs. INTRODUCTION We may also look at random forests. .

This type of learning is called “supervised” because of the metaphor that a teacher (i. there are just two label values. and data cleaning and recoding. A classiﬁer is something that can be used to make predictions on test examples. A column in such a table is often called a feature. and an example is one row in such a table. Sometimes it is important to distinguish between a feature. tuple.1 Supervised learning The goal of a supervised learning algorithm is to obtain a classiﬁer by learning from training examples.e. i. In the simplest but most common case. It is essentially the same thing as a table in a relational database. supervised learning is called “regression” and the classiﬁer is called a “regression model. These may be called -1 and +1. The label y for a test example is unknown. in data mining. A training set is a set of vectors with known label values. Each training and test example is represented in the same way. or negative and positive. With n training examples. a supervisor) has provided the true label of each training example.e. or an attribute. the training data are a matrix with n rows and p columns. or 0 and 1. and with each example consisting of values for p different features. In this case.” The word “classiﬁer” is usually reserved for the case where label values are discrete. a predicted y value.Chapter 2 Predictive analytics in general This chapter explains supervised learning. Each element in the vector representing an example is called a feature value. along with 9 . It may be real number or a value of any other type. The output of the classiﬁer is a conjecture about y. and vector are essentially synonyms. Row. linear regression. and a feature value. which is an entire column. or no and yes. 2. Often each label value y is a real number. as a row vector of ﬁxed length p.

True labels are known for training examples. zip codes. often missing values are indicated by strings that look like valid known values.10 CHAPTER 2. Dealing with these is difﬁcult. many aspects are typically ignored in data mining. or who is part-time? A very common difﬁculty is that the value of a given feature is missing for some training and/or test examples. Some training algorithms can handle missingness internally. but not for test examples. Usually. the simplest approach is just to discard all examples (rows) with missing values. Keeping only examples without any missing values is called “complete case analysis. junior. but are critical for human understanding. An important job for a human is to think which features may be predictive. important features (meaning features with predictive power) may be implicit in other features. Usually the names used for the different values of a categorical feature make no difference to data mining algorithms. e. for a student the feature “year” may have values freshman. while its dimensionality is p. but only the day/month/year date is given in the original data. 2. for example integers or monetary amounts. e. such as “day” in the combination “day month year. The cardinality of the training set is n. Even with all the complexity of features.” Moreover.” An equally simple.g. For example. the day of the week may be predictive. and/or have an unwieldy number of different values. but different. PREDICTIVE ANALYTICS IN GENERAL a column vector of y values. Other features are numerical but not real-valued. Sometimes categorical features have names that look numerical. such as 0 (zero). sophomore. Difﬁculty in determining feature values is also ignored: for example. approach is to discard all features (columns) with missing val- . there is a lot of variability and complexity in features. It is important not to treat a value that means missing inadvertently as a valid regular value. The label of example i is yi . No learning algorithm can be expected to discover day-of-week automatically as a function of day/month/year. based on understanding the application domain. missing values are indicated by question marks. Many features are categorical. how does one deﬁne the year of a student who has transferred from a different college. However.g. and then to write software that makes these features explicit. If not. units such as dollars and dimensions such as kilograms are omitted.2 Data cleaning and recoding In real-world data. and senior. Often. It is also difﬁcult to deal with features that are really only meaningful in conjunction with other features. We use the notation xij for the value of feature number j of example number i. Some features are real-valued.

then it is reasonable to replace each missing value by the mean or mode of the non-missing values. An intelligent way to recode discrete predictors is to replace each discrete value by the mean of the target conditioned on that discrete value. non-overlapping. For example. Mathematically it is preferable to use only k − 1 real-valued features. This idea is especially useful as The ordering of the values. Otherwise. One can set boundaries for the bins so that each bin has equal width.0. For example. the best way to recode a feature that has k different categorical values is to use k real-valued features. which value is associated with j = 1.” Each bin is given an arbitrary name.e. it is often beneﬁcial to create an additional binary feature that is 0 for missing and 1 for present. Therefore.0. The values of a binary feature can be recoded as 0. VT. It often works well to use ten bins. set all k − 1 features equal to 0. over 20) are often difﬁcult to deal with. CT. the bins are “equal count. then it may be reasonable to leave this as a categorical feature with 50 values. This process is called imputation. If a feature with missing values is retained. i. The range of the numerical values is partitioned into a ﬁxed number of intervals that are called bins. set the jth of these features equal to 1.0 and set all k − 1 others equal to 0.0. or one can set boundaries so that each bin contains approximately the same number of training examples. For the jth categorical value where j < k. i.2. For these. since these indicate meaningful regions of the United States. The word “partitioned” means that the bins are exhaustive and mutually exclusive.e. features that are numerical can be discretized. Also. but they are not always better.0. and “true” or “yes” as 1.2. NH. However. is arbitrary.0.0 and set all k − 1 others equal to 0. the boundaries are regularly spaced. human intervention is needed to recode them intelligently. If one has a large dataset with many examples from each of the 50 states. For these. For the last categorical value. ME. 1 . Each numerical value is then replaced by the name of the bin in which the value lies. categorical features must be made numerical. For the jth categorical value. set the jth feature value to 1. if the average label value is 20 for men versus 16 for women. etc.e. Typically. zipcodes may be recoded as just their ﬁrst one or two letters. these values could replace the male and female values of a variable for gender. Other training algorithms can only handle real-valued features. in many applications. it may be useful to group small states together in sensible ways.e.1 Categorical features with many values (say. i. DATA CLEANING AND RECODING 11 ues. It is conventional to code “false” or “no” as 0. Usually.0 or 1.. Some training algorithms can only handle categorical features. More sophisticated imputation procedures exist. both these approaches eliminate too much useful training data. i. for example to create a New England group for MA. the fact that a particular feature is missing may itself be a useful predictor.0.

12 CHAPTER 2.5. Write x = x1 . xip . suppose xi is a binary feature where xi = 0 means female and xi = 1 means male.3 Linear regression Let x be an instance and let y be its real-valued label. . It is the value of y predicted by the model if xi = 0 for all i. After conditional-mean new values have been created. The predicted value for xi is p yi = f (xi . and suppose bi = −2. A major advantage of the conditionalmeans approach is that it avoids an explosion in the dimensionality of training and test examples. . The coefﬁcient b0 is called the intercept. . i. . . Normalization. The linear regression model is y = b0 + b1 x1 + b2 x2 + . Mixed types. 2. x must be a vector of real numbers of ﬁxed length. x2 . Let b be any set of coefﬁcients. Then the predicted y value for males is lower by 2. the standard way to recode a discrete feature with m values is to introduce m − 1 binary features. The linear function is deﬁned by its coefﬁcients b0 to bp . . PREDICTIVE ANALYTICS IN GENERAL a way to convert a discrete feature with many values. Of course. xp . . For example. everything else being held constant. The righthand side above is called a linear function of x. they can be scaled to have zero mean and unit variance in the same way as other features. or dimensionality. yi . With this standard approach. Suppose that the training set has cardinality n. it may be completely unrealistic that all features xi have value zero. However. where xi = xi1 . the training algorithm can learn a coefﬁcient for each new feature that corresponds to an optimal numerical value for the corresponding discrete value. + bp xp . if the value of all other features is unchanged.e.5. it consists of n examples of the form xi . For linear regression. as just explained. states. b) = b0 + ˆ j=1 bj xij . . but they will not yield better predictions than the coefﬁcients learned in the standard approach. Remember that this length p is often called the dimension. . Sparse data. . Conditional means are likely to be meaningful and useful. . The coefﬁcient bi is the amount by which the predicted y value increases if xi increases by 1. These coefﬁcients are the output of the data mining algorithm. of x. for example the 50 U.S. into a useful single numerical feature.

b) − yi )2 . The standard approach is to say that optimal means minimizing the sum of squared errors on the training set. we need a deﬁnition of what “optimal” means. then we can write p yi = ˆ j=0 bj xij . Note that during training the n different xi and yi values are ﬁxed. where the units of measurement are pounds and inches. The objective function i (yi − j bj xij )2 is called the sum of squared errors. Then all models y = b0 + b1 x1 + b2 x2 are equivalent for which b1 + b2 equals a constant. Consider all models y = b0 +b1 x1 +b2 x2 for which b1 +b2 = c.2. where the squared error on training example i is (yi − yi )2 . or SSE for short. We 1 2 can obtain it by setting the objective function for training to be the sum of squared . there is a unique one that minimizes the function b2 + b2 . suppose x1 = x2 . Among these models. Finding the optimal values of the coefﬁcients b0 to bp is the job of the training algorithm. “equivalent” means that the different sets of coefﬁcients achieve the same minimum SSE. Then the same model can be written in many different ways: • y = b0 + b1 x1 + b2 x2 • y = b0 + b1 x1 + b2 (−180 + 5x1 ) = [b0 − 180b2 ] + [b1 (1 + 5b2 )]x1 + 0x2 and more. For an intuitive example. Suppose that x2 = 120 + 5(x1 − 60) = −180 + 5x1 approximately. while b is a ﬁxed set of parameter values. If we deﬁne xi0 = 1 for every i. The coefﬁcients obtained by training will be strongly inﬂuenced by randomness in the training data. Regularization is a way to reduce the inﬂuence of this type of randomness. The training algorithm then ﬁnds ˆ n ˆ = argmin b b i=1 (f (xi . then the true values of the coefﬁcients of those features are not well determined. b) emphasizes that the vector xi is a variable input. The optimal coefﬁcient values ˆ are not deﬁned uniquely if the number n of trainb ing examples is less than the number p of features. In the extreme. When two or more features are approximately related linearly. the optimal coefﬁcients have multiple equivalent values if some features are themselves related linearly. Even if n > p is true. This model has b1 = b2 = c/2. To make this task well-deﬁned. suppose features 1 and 2 are height and weight respectively. Here. The constant xi0 = 1 can be called a pseudo-feature. while the parameters b are variable. LINEAR REGRESSION 13 The semicolon in the expression f (xi .3.

Predictors have been reordered here from most to least statistically signiﬁcant. that is one that can be used not only to make predictions. 2 .14 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL errors (SSE) plus a function that penalizes large values of the coefﬁcients. A parameter λ can control the relative j=1 j importance of the two objectives. Consider the following linear regression model for predicting high-density cholesterol (HDL) levels. as measured by p-value. Any penalty function that treats all coefﬁcients bj equally. do appear interpretable at ﬁrst sight. The fractions 1/n and 1/p do not make an essential difference. much caution is needed when attempting to derive conclusions from numerical coefﬁcients. Note that in the formula p b2 the sum excludes the intercept coefﬁcient b0 . Source: http://www. The penalty function p b2 is the L2 norm of the vector b.4 Interpreting coefﬁcients of a linear model It is common to desire a data mining model that is interpretable. like the L2 norm. Using it for linear j=1 j regression is called ridge regression. As λ gets larger. The parameter λ is often called the strength of regularization. whether for regression or for classiﬁcation as described in later chapters. A simple penalty function of this type is p b2 . Linear models. They can be used to make the numerical value of λ easier to interpret. One j=1 j reason for doing this is that the target y values are typically not normalized.com/LHSP/importnt. and the typical values of coefﬁcients get smaller.htm. However.jerrydallal. j j=1 If λ = 0 then one gets the standard least-squares linear regression solution. the penalty on large coefﬁcients gets stronger. but also to understand mechanisms in the phenomenon that is being studied. is sensible only if the typical magnitudes of the values of each feature are similar. 2. this is an important motivation for data normalization.2 HDL cholesterol is considered beneﬁcial and is sometimes called “good” cholesterol. namely SSE and penalty: ˆ = argmin 1 b b n n i=1 1 (yi − yi ) + λ ˆ p 2 p b2 .

However. EVALUATING PERFORMANCE predictor intercept BMI LCHOL GLUM DIAST BLC PRSSY SKINF AGE coefficient 1. and a test set of examples with unknown labels.00092 std.3436 0.47 2. skinfold thickness. If this is true.00041 0. if predictors are collinear. (It is not clear what GLUM is. when in fact both are signiﬁcant or both are not.95 0. we have a training set of examples with labels. one would be cautious about believing this relationship. First.28804 0. diastolic blood pressure is statistically signiﬁcant.84 -2.0239 0.16448 -0. For example. but still useless for interventions.31109 -0.0001 0.4602 15 From most to least statistically signiﬁcant. a predictor may be practically important. But it may also be an artifact of collinearity.error 0.74 p-value <. A third crucial issue is that a correlation may disagree with other knowledge and assumptions.5 Evaluating performance In a real-world application of supervised learning.00018 0.81 -0.2. in research or experimentation we want to measure the performance achieved by a learning algorithm.00046 0. but systolic is not. then one may appear signiﬁcant and the other not. If lower vitamin C was associated with higher HDL. the log of total cholesterol.10936 0. 2.08 2.00103 0.01205 0. Above.04 -4. even if the association was statistically signiﬁcant.02215 0. To do this we use a test set consisting of examples .0001 <.00295 0. The whole point is to make predictions for the test examples. Second.5. diastolic blood pressures. the predictors are body mass index.4221 0. systolic blood pressure. and age in years. or if the outcome causes the predictor. vitamin C level in blood.00125 Tstat 4. then merely taking a vitamin C supplement will cause an increase in HDL level. Above.00183 0.05055 -0. This may possibly be true for some physiological reason.0051 0.00255 0. and statistically signiﬁcant.0135 0. This happens if the predictor and the outcome have a common cause.00147 -0. But it may be that vitamin C is simply an indicator of a generally healthy diet high in fruits and vegetables.50 2.) The example illustrates at least two crucial issues. vitamin C is generally considered beneﬁcial or neutral.28 -0.0147 0.00044 0. vitamin C is statistically signiﬁcant.

However.e. We train the classiﬁer on the training set.e. A set of labeled examples used to measure accuracy with different algorithmic settings. and to measure the accuracy of the function (classiﬁer or regression function) learned with different settings. Dividing the available data into training. The only purpose of the ﬁnal test set is to estimate the true accuracy achievable with the settings previously chosen using the validation set. . PREDICTIVE ANALYTICS IN GENERAL with known labels. they are valid in the training data due to randomness in how the training data was selected from the population. Most training algorithms have some settings that the user can choose between. is called a validation set. It is natural to run a supervised learning algorithm many times. and test sets should be done randomly. For ridge regression the main algorithmic parameter is the degree of regularization λ. it is important to have a ﬁnal test set that is independent of both the training set and the validation set. may be not a sample from the same distribution. the ﬁnal test set must be used only once.16 CHAPTER 2. It is absolutely vital to measure the performance of a classiﬁer on an independent test set. or not as strong. correlations between the features and the class. in order to pick the best settings. in order to guarantee that each set is a random sample from the same distribution. Other algorithmic choices are which sets of features to use. The phenomenon of relying on patterns that are strong only in the training data is called overﬁtting. A classiﬁer that relies on these spurious patterns will have higher accuracy on the training examples than it will on the whole population. i. validation. If you use a validation set. a very important real-world issue is that future real test examples. In practice it is an omnipresent danger. for which the true label is genuinely unknown. in the whole population. i. Every training algorithm looks for patterns in the training data. For fairness. but they are not valid. Only accuracy measured on an independent test set is a fair estimate of accuracy on the whole population. and then measure performance by comparing the predicted labels with the true labels (which were not available to the training algorithm). apply it to the test set. Some of the patterns discovered may be spurious.

x11 and x12 . male or female. The model only predicts spending of $75 for a woman if all other features have value zero. . Note that if some feature values are systematically different for men and women. You know the gender of each customer. It would be good to eliminate one of these two features. This may not be true for the average woman. The two features x11 and x12 are linearly related. Indeed it will not be true for any woman. The conclusion is that if everything else is held constant. if features such as “age” are not normalized. then on average a woman will spend $60 more than a man. . .Quiz question Suppose you are building a model to predict how many dollars someone will spend at Sears. Hence. but the average man spends only $15. as follows: gender male female x11 1 0 x12 0 1 Learning from a large training set yields the model y = . they make the optimal model be undeﬁned. Since you are using linear regression. . Dr. you must recode this discrete feature as continuous. because it is not reasonable to hold all other feature values constant. then even this conclusion is not useful. The expressiveness of the model would be unchanged. (b) Explain what conclusion can actually be drawn from the numbers 15 and 75. . Roebuck’s conclusion is not valid. and then answer the following three parts with one or two sentences each: (a) Explain why Dr. You decide to use two real-valued features. Roebuck says “Aha! The average woman spends $75. + 15x11 + 75x12 + . (c) Explain a desirable way to simplify the model. in the absence of regularization. The coding is a standard “one of n” scheme.” Write your name below.

5 transactions.2 = 2. Said another way.Quiz for April 6. (a) Dr. Explain why Dr. The model only predicts 5. Second. Explain why Dr. it is unlikely to be reasonable to hold all other feature values constant when comparing groups. for example. . . x37 and x38 . have value zero. if any other feature have different values on average for men and women. one cannot say that this difference is caused by being a college graduate. then 5. 2010 Your name: Suppose that you are training a model to predict how many transactions a credit card customer will make. Since you are using linear regression. Sachs’ conclusion is likely to be false also. Goldman concludes that the average college graduate makes 5. .3 more transactions.5x37 + 3. on average. for example income. Sachs concludes that being a college graduate causes a person to make 2. . First. The coding is a “one of two” scheme.5 − 3.2x38 + .3 is the average difference.5 transactions if all other features. you recode this discrete feature as continuous. even if 2. You know the education level of each customer. You decide to use two real-valued features. + 5. . There may be some other unknown common cause. (b) Dr. as follows: college grad not college grad x37 1 0 x38 0 1 Learning from a large training set yields the model y = . This may not be true for the average college grad.3 is not the average difference in predicted y value between groups. including the intercept. It will certainly be false if features such as “age” are not normalized. Goldman’s conclusion is likely to be false.

Linear regression assignment This assignment is due at the start of class on Tuesday April 12. then the regression coefﬁcients will indicate the relative importance of features.zip from from http://archive. edu/ml/databases/kddcup98/kddcup98. Download the ﬁle cup98lrn. • Transform features to improve their usefulness. To make your work more efﬁcient. Explain any assumptions that you made. and do not explain ideas that did not work. Do not speculate about future work. and your results. You should work in a team of two. If you normalize all input features. Do a combination of the following: • Recode non-numerical features as numerical. • Discard useless features. The deliverable is a brief report that is formatted similarly to this assignment description. not chronologically. Select the 4843 records that have feature TARGET_B=1. and you use strong regularization (ridge parameter 107 perhaps). and use ten-fold cross-validation to measure RMSE.ics. Load the data into Rapidminer (or other software for data mining such as R).uci. If you use Rapidminer. be sure to save the 4843 records in a format that Rapidminer can load quickly. Read the associated documentation. Describe what you did that worked. You can use three-fold cross-validation during development to save time also.html. Do the steps above repeatedly in order to explore alternative ways of using the data. Use root mean squared error (RMSE) as the deﬁnition of error. Organize your report logically. Choose a partner who has a different background from you. . The outcome should be the best possible model that you can ﬁnd that uses 30 or fewer of the original features. Include your ﬁnal regression model. include a printout of your ﬁnal process. Save these as a native-format Rapidminer example set. build a linear regression model to predict the ﬁeld TARGET_D as accurately as possible. Now. • Compare different strengths of ridge regularization. and any limitations on the validity or reliability of your results. 2011. Write in the present tense.

including sometimes the feature LASTGIFT. In general.00.00. First. they are too slow to be used on a large initial set of variables. RMSE. As explained above. In 2009. . it is possible to do signiﬁcantly better. then the intercept of a linear regression model is the mean of the target. For more discussion of this problem. three of 11 teams achieved similar ﬁnal RMSEs that were slightly better than $9. The two teams that omitted LASTGIFT achieved RMSE worse than $11. Second. it is not a good idea to begin by choosing a subset of the 480 original features based on human intuition. Note that if all predictors have mean zero. see the section on nested cross-validation in a later chapter. The assignment speciﬁcally asks you to report root mean squared error.6243 here. and pick the one that performs best on some dataset. As is often the case. These operators have two major problems. A model based on just this single feature achieves RMSE of $9. MSE. good performance can be achieved with a very simple model. $15. so it is easy for human intuition to pick an initial set that is bad. assuming that the intercept is not regularized. it loses interpretability to rescale the target variable. these operators try a very large number of alternative subsets of variables. Because of the high number of alternatives considered. Despite this directive. However. The assignment asks you to produce a ﬁnal model based on at most 30 of the original features. The most informative single feature is LASTGIFT. One could also report mean squared error. this subset is likely to overﬁt substantially the dataset on which it is best. it is also not a good idea to eliminate automatically features with missing values.Comments on the regression assignment It is typically useful to rescale predictors to have mean zero and variance one. Rapidminer has operators that search for highly predictive subsets of variables. the dollar amount of the person’s most recent gift. The teams that did this all omitted features that in fact would have made their ﬁnal models considerably better. but whichever is chosen should be used consistently. However. do not confuse readers by switching between multiple performance measures without a good reason.98.

but a few are up 21 . This is not a problem with Windows 7. For example. • Convert each binary feature into a numerical feature with values 0. but otherwise.Chapter 3 Introduction to Rapidminer By default. almost all donation amounts are under $50. many Java implementations allocate so little memory that Rapidminer will quickly terminate with an “out of memory” message. in order. Note that destroying sparsity is not an issue when the dataset is stored as a dense matrix. The reason is that this normalization would destroy the sparsity of the dataset. 3. • Normalize all numerical features to have mean zero and variance 1. launch Rapidminer with a command like java -Xmx1g -jar rapidminer.jar where 1g means one gigabyte. Normalization is also questionable for variables whose distribution is highly uneven.0 and 1. which it is by default in Rapidminer.0. • Optionally. • Convert each nominal feature with k alternative values into k different binary features.1 Standardization of features The recommended procedure is as follows. It is not recommended to normalize numerical features obtained from binary features. drop all binary features with fewer than 100 examples for either binary value. and hence make some efﬁcient training algorithms much slower.

in order to prevent the existence of multiple equivalent models. If you have two instances of the same operator. If regularization is not used. Eliminating binary features for which either value has fewer than 100 training examples is a heuristic way to prevent overﬁtting and at the same time make training faster. A common transformation that often achieves this is to take logarithms. you can identify them with different names. which is not legitimate. The ﬁrst operator is ExampleSource. it is best to drop the binary feature corresponding to the most common value of the original multivalued feature. Click on the icon to toggle between these. Normally this is ﬁle name with extension “aml. For variables with uneven distributions. However. or beneﬁcial. If you have separate training and test sets. This leaves zeros unchanged and hence preserves sparsity. The tree illustrates how to perform some common subtasks. then at least one binary feature must be dropped from each set created by converting a multivalued feature. whether by z-scoring or by transformation to the range 0 to 1 or otherwise. or a person with an academic cap. in the right pane. Then. it is important to do all preprocessing in a perfectly consistent way on both datasets. you are in expert mode. if regularization is used during training. When the person in red is visible. “Read training example” is a name for this particular instance of this operator.2 shows a tree structure of Rapidminer operators that together perform a standard data mining task. You select an operator by clicking on it in the left pane. The example source operator has a parameter named attributes. No linear normalization. For efﬁciency and interpretability.22 CHAPTER 3. At the top there is an icon that is a picture of either a person with a red sweater.” The corresponding ﬁle should contain a schema deﬁnition for a dataset. and then split them after. A useful trick when zero is a frequent value of a variable x is to use the mapping x → log(x + 1). where you can see all the parameters of each operator. it is often useful to apply a nonlinear transformation that makes their distribution have more of a uniform or Gaussian shape. that is of using the test set during training. 3. concatenation allows information from the test set to inﬂuence the details of preprocessing. the Parameters tab shows the arguments of the operator. INTRODUCTION TO RAPIDMINER to $500. The actual data are in a ﬁle with the same name with exten- . which is a form of information leakage from the test set.2 Example of a Rapidminer process Figure 3. It may not be necessary. will allow a linear classiﬁer to discriminate well between different common values of the variable. The simplest is to concatenate both sets before preprocessing.

EXAMPLE OF A RAPIDMINER PROCESS 23 Root node of Rapidminer process Process Read training examples ExampleSource Select features by name FeatureNameFilter Convert nominal to binary Nominal2Binominal Remove almost-constant features RemoveUselessAttributes Convert binary to real-valued Nominal2Numerical Z-scoring Normalization Scan for best strength of regularization GridParameterOptimization Cross-validation XValidation Regularized linear regression W-LinearRegression ApplierChain OperatorChain Applier ModelApplier Compute RMSE on test fold RegressionPerformance Figure 3.3.2. .1: Rapidminer process for regularized linear regression.

24 CHAPTER 3. The following panel asks you to say which features are special.*AMNT. The next panel asks you to specify the type of each attribute. In order to convert a discrete feature with k different values into k real-valued 0/1 features. the character that begins comment lines.*). then Rapidminer hangs with no visible progress. so it often makes mistakes.” The easiest way to create these ﬁles is by clicking on “Start Data Loading Wizard. Then insert the outer operator. the easiest is to make it true outside Rapidminer. If there are no errors and the data ﬁle is large. The wizard guesses this based only on the ﬁrst row of data. If you choose the wrong delimiter. tick the box to make the ﬁrst row be interpreted as ﬁeld names. and drop it on top of the outer operator.*MALE)|(STATE)|(PEPSTRFL)|(.*GIFT. Then drag the root of the inner subtree. When you click Next on this panel. The most common special choice is “label” which means that an attribute is a target to be predicted. In our sample process. Applying it to discrete features with more than 50 alternative values is not recommended.* and let the argument except_ features_with_name identify the features to keep. The operator Nominal2Binominal is quite slow.” The ﬁrst step with this wizard is to specify the ﬁle to read data from. and the decimal point character. Finally. To keep just features with certain names. The ﬁrst is Nominal2Binominal.*)|(. Let the argument skip_features_with_name be .*)|(. with a text editor or otherwise. Note that the documentation of the latter operator in Rapidminer is misleading: it cannot convert a discrete feature into multiple numerical features directly. First create the inner operator subtree. If this is not true for your data. you specify a ﬁle name that is used with “aml” and “dat” extensions to save the data in Rapidminer format. The simplest way to ﬁnd a good value for an algorithm setting is to use the XValidation operator nested inside the GridParameterOptimization operator. Error messages may appear in the bottom pane. In the next panel. the data preview at the bottom will look wrong. use the operator FeatureNameFilter. while the second is Nominal2Numerical. The way to indicate nesting of operators is by dragging and dropping. In the next panel. all rows of data are loaded. Ticking the box for “use double quotes” can avoid some error messages.*GIFT) |(MDM. it is (.*)(YRS. which you have to ﬁx by trial and error. you specify the delimiter that divides ﬁelds within each row of data. two operators are needed. You can use a CPU monitor to see what Rapidminer is doing. . The same thing happens if you click Previous from the following panel. INTRODUCTION TO RAPIDMINER sion “dat.*)|(RFA_2.

3. Eliminating non-alphanumerical characters. trimming commas.3 Other notes on Rapidminer Reusing existing processes. Comments on speciﬁc operators: Nominal2Numerical. Saving datasets. Nominal2Binominal. . OTHER NOTES ON RAPIDMINER 25 3. using quote marks. Operators from Weka versus from elsewhere. RemoveUselessFeatures.3. trimming lines.

.

let y be its true label. We have seen how to use linear regression to predict a real-valued label. it has undesirable properties. it will be convenient to assume that a binary label takes on true values 0 or 1. we will threshold it at zero: y = 2 · I(f (x) ≥ 0) − 1 ˆ where I() is an indicator function that is 1 if its argument is true. Assume that the prediction is real-valued. and 0 if its argument is false. The most obvious loss function is l(f (x). in this chapter it will be convenient to assume that the label y has true value either +1 or −1. and those that are 27 . However. where we think probabilistically. and preventing overﬁtting via regularization. If we need to convert it into a binary prediction. while otherwise loss is positive. 4.Chapter 4 Support vector machines This chapter explains soft-margin support vector machines (SVMs). including linear and nonlinear kernels.1 Loss functions Let x be an instance. In later chapters. However. y) = I(f (x) = y) which is called the 0-1 loss function. and a loss of zero corresponds to a perfect prediction. It also discusses detecting overﬁtting via cross-validation. Now we will see how to use a similar model to predict a binary label. A so-called loss function l measures how good the prediction is. First. it loses information: it does not distinguish between predictions f (x) that are almost right. The usual deﬁnition of accuracy uses this loss function. Typically loss functions do not depend directly on x. and let f (x) be a prediction.

the total training loss (sometimes called empirical loss) is the sum n l(f (xi ). and does not lose information when the prediction f (x) is real-valued. y) = (f (x) − y)2 which is inﬁnitely differentiable everywhere. so it is difﬁcult to use in training algorithms that try to minimize loss via gradient descent.2 Regularization Given a set of training examples xi . In this sense. The training process aims to achieve predictions f (x) ≥ 1 for all training instances x with true label y = +1. Then the loss is zero as long as the prediction f (x) ≥ 1. Second. 1 − yf (x)}. if 0 < f (x) < 1. but it does not seek to make predictions be exactly +1 or −1.e.28 CHAPTER 4. The loss is positive. Overall. and/or the training set is too small. this loss function says that the prediction f (x) = 1. but less than 1. yi ). training seeks to classify points correctly. which is called hinge loss. Using hinge loss is the ﬁrst major insight behind SVMs. without trying to satisfy any unnecessary additional objectives also. i=1 Suppose that the function f is selected by minimizing average training loss: f = argminf ∈F 1 n n l(f (xi ). If F is too ﬂexible. y) = max{0. However. if f (x) < 0. yi for i = 1 to i = n. 4. then we run the risk of overﬁtting the training data. the training process intuitively aims to ﬁnd the best possible classiﬁer. The following loss function. Suppose the true label is y = 1. and to achieve predictions f (x) ≤ −1 for all training instances x with y = −1. then a prediction with the correct sign that is greater than 1 should not be considered incorrect. its derivative is either undeﬁned or zero.5 when the true label is y = 1. A better loss function is squared error: l(f (x). Intuitively. An SVM classiﬁer f is trained to minimize hinge loss. The hinge loss function deserves some explanation. SUPPORT VECTOR MACHINES very wrong. mathematically. greater than 1. yi ) i=1 where F is a space of candidate functions. But if F is too . The loss is large. and to distinguish clearly between the two classes.5 is as undesirable as f (x) = 0. i. if the true label is +1. satisﬁes the intuition just suggested: l(f (x).

In this case we can deﬁne the complexity of each candidate function to be the norm of the corresponding w. namely to minimize the complexity of f and to minimize training error.e. A possible solution to this dilemma is to choose a ﬂexible space F . w) where g is some ﬁxed function.3. LINEAR SOFT-MARGIN SVMS 29 restricted. 4. It can be proved that the solution to this minimization problem is always . w) = x · w where g is the dot product function. λ is a parameter that controls the relative strength of the two objectives. then we run the risk of underﬁtting the data. Let c(f ) be some real-valued measure of complexity. In general. The square norm is the most convenient mathematically. Suppose that the space of candidate functions is deﬁned by a vector w ∈ Rd of parameters.4.3 Linear soft-margin SVMs A linear classiﬁer is a function f (x) = g(x. Putting the ideas above together. including the L0 norm d c(f ) = ||w||2 = 0 j=1 I(wj = 0) or the L1 norm d c(f ) = ||w||2 = 1 j=1 |wj |. However we could also use other norms. 1 − yi (w · xi )}. The learning process then becomes to solve f = argminf ∈F λc(f ) + 1 n n l(f (xi ). Most commonly we use the square norm: d c(f ) = ||w|| = j=1 2 2 wj . we do not know in advance what the best space F is for a particular training set. i. yi ). i=1 A linear soft-margin SVM classiﬁer is precisely the solution to this optimization problem. i=1 Here. we can write f (x) = g(x. but at the same time to impose a penalty on the complexity of f . the objective of learning is to ﬁnd w = argminw∈Rd λ||w||2 + 1 n n max{0.

Notice that the optimization is over Rn . while a large value corresponds to weak regularization. The trained classiﬁer is f (x) = w · x where the vector n w= i=1 αi yi xi . This equation says that w is a weighted linear combination of the training instances xi . the dual version is easier to extend to obtain nonlinear SVM classiﬁers. The solution to the dual problem is a coefﬁcient αi for each training example. The constrained dual formulation is the basis of the training algorithms used by standard SVM implementations. but they have the same unique solution. the primal version is easier to understand and easier to use as a foundation for proving bounds on generalization error. An equivalent way of writing the same optimization problem is n w = argminw∈Rd ||w||2 + C i=1 max{0. SUPPORT VECTOR MACHINES unique. and w has the same dimensionality. Moreover.30 CHAPTER 4. a smaller training set should require a smaller C value. These instances are the only ones that contribute to the ﬁnal classiﬁer. Intuitively. In practice. Note that d is the dimensionality of x. so there are no local minima. and ﬁnd the best value experimentally. useful guidelines are not known for what the best value of C might be for a given dataset. This so-called dual formulation is n α∈R max n i=1 αi − 1 2 n n αi αj yi yj (xi · xj ) i=1 j=1 subject to 0 ≤ αi ≤ C. The primal and dual formulations are different optimization problems. However. The training instances xi such that αi > 0 are called support vectors. This extension is based on the idea of a kernel function. Many SVM implementations let the user specify C as opposed to λ. 1 − yi (w · xi )} with C = 1/(nλ). everything else being equal. and the sign of each instance is its label yi . but recent research has shown that the unconstrained primal formulation in fact leads to faster training algorithms. There is an alternative formulation that is equivalent. where the weight of each instance is between 0 and C. Moreover. whereas it is over Rd in the primal formulation. the objective function is convex. Mathematically. at least in the linear case as above. the optimization problem above is called an unconstrained primal formulation. . A small value for C corresponds to strong regularization. one has to try multiple values of C. However. and is useful theoretically.

For many applications of support vector machines. Consider a re-representation of instances x → Φ(x) where the transformation Φ is a function Rd → Rd . we could use dot-product to deﬁne similarity in the new space Rd . The result above says that in order to train a nonlinear SVM classiﬁer. Dot product similarity is closely related to Euclidean distance through the identity d(xi .4. in general one cannot have both normalizations be true at the same time. let kij = K(xi . we do not need to know the function Φ in any explicit way. . The learning task is to solve n α∈Rn max i=1 1 αi − 2 n n αi αj yi yj kij subject to 0 ≤ αi ≤ C. This classiﬁer is a weighted combination of at most n functions. and train an SVM classiﬁer in this space. where the weight of yi is the product of αi and the degree to which x is similar to xi . then Euclidean distance and dot product similarity are perfectly anticorrelated. xj ) = Φ(xi ) · Φ(xj ). In principle. However. all that we need is the kernel matrix of size n by n whose entries are kij . It can be advantageous also to normalize instances so that they have unit length. And in order to apply 1 If the instances have unit length. it is advantageous to normalize features to have the same mean and variance. However. and consider their dot product xi · xj .4.1 Therefore. i=1 j=1 The solution is n n f (x) = [ i=1 αi yi Φ(xi )] · x = i=1 αi yi K(xi . the equation n f (x) = w · x = i=1 αi yi (xi · x) says that the prediction for a test example x is a weighted average of the training labels yi . suppose we have a function K(xi . xj ) = ||xi − xj || = (||xi ||2 − 2xi · xj + ||xj ||2 )1/2 where by deﬁnition ||x||2 = x · x. NONLINEAR KERNELS 31 4. that is ||xi || = ||xj || = 1. Speciﬁcally. This function is all we need in order to write down the optimization problem and its solution. This dot product is a measure of the similarity of the two instances: it is large if they are similar and small if they are not. one for each training instance xi .4 Nonlinear kernels Consider two instances xi and xj . x). xj ). These are called basis functions.

Scale each attribute to have range 0 to 1. Intuitively. The radial basis function (RBF) kernel is the function K(xi . Given a test instance x. the complexity of the classiﬁer f is limited. its predicted label f (x) is a weighted average of the labels yi of the support vectors xi . This notation emphasizes the similarity with a Gaussian distribution. since it is deﬁned by at most n coefﬁcients αi . because K is just a mapping to R. regardless of which kernel K is used. x) = exp(−γ||xi − x||2 ) where γ > 0 is an adjustable parameter. all that we need is the kernel function K. i. Using an RBF kernel. a larger value for σ 2 . . corresponds to basis functions that are less peaked. the function K can be much easier to deal with than Φ. i. but obtaining the best SVM classiﬁer is not trivial. or −1 to +1. the classiﬁer f (x) = i αi yi K(xi . each basis function K(xi . With an RBF kernel. or to have mean zero and unit variance. x) is similar to a nearestneighbor classiﬁer. The consensus opinion is that the best SVM classiﬁer is typically at least as accurate as the best of any other type of classiﬁer. The support vectors that contribute non-negligibly to the predicted label are those for which the Euclidean distance ||xi − x|| is small. that is regardless of which rerepresentation Φ is used. SUPPORT VECTOR MACHINES the trained nonlinear classiﬁer.e.” Practically. rather than to a high-dimensional space Rd . Code data in the numerical format needed for SVM training. The following procedure is recommended as a starting point: 1. The RBF kernel can also be written K(xi . One particular kernel is especially important. A smaller value for γ.5 Selecting the best SVM settings Getting good results with support vector machines requires some care. Using K exclusively in this way instead of Φ is called the “kernel trick. that are signiﬁcantly non-zero for a wider range of x values. x) is “radial” since it is based on the Euclidean distance ||xi −x|| from xi to x.32 CHAPTER 4. The function Φ never needs to be known explicitly. Using a larger value for σ 2 is similar to using a larger number k of neighbors in a nearest neighbor classiﬁer. x) = exp(−||xi − x||2 /σ 2 ) where σ 2 = 1/γ. 2. The function K is ﬁxed. so it does not increase the intuitive complexity of the classiﬁer.e. 4.

. . This process is called a grid search.5. 2. 2−1 . . 22 . doing a more ﬁne-grained grid search around these values is sometimes beneﬁcial. 2−3 . 2−2 . Use cross-validation to ﬁnd the best value for C for a linear kernel. Once you have found the best values of C and γ to within a factor of 2. 23 . It is important to try a wide enough range of values to be sure that more extreme values do not give better performance.}2 . and to try values that are smaller and larger by factors of 2: C. SELECTING THE BEST SVM SETTINGS 3. . γ ∈ {. 1.4. Train on the entire available data using the parameter values found to be best via cross-validation. . it is easy to parallelize of course. Use cross-validation to ﬁnd the best values for C and γ for an RBF kernel. 33 5. . . It is reasonable to start with C = 1 and γ = 1. 4. but not always.

(c) For each of the three cases for the derivative. SUPPORT VECTOR MACHINES Quiz question (a) Draw the hinge loss function for the case where the true label y = 1.34 CHAPTER 4. (b) Explain where the derivative is (i) zero. Label the axes clearly. explain its intuitive implications for training an SVM classiﬁer. . (ii) constant but not zero. or (iii) not deﬁned.

(b) Do you expect the optimal settings to involve stronger regularization. You have selected the settings that yield the best linear classiﬁer and the best RBF classiﬁer. or weaker regularization? Explain very brieﬂy. do you expect the linear classiﬁer or the RBF classiﬁer to have better accuracy? Explain very brieﬂy.Quiz for April 20. . After ﬁnding optimal settings again and retraining. It turns out that both classiﬁers perform equally well in terms of accuracy for your task. (a) Now you are given many times more training data. 2010 Your name: Suppose that you have trained SVM classiﬁers carefully for a given learning task.

In particular. Build support vector machine (SVM) models to predict the target label as accurately as possible. Organize your report logically. Do not speculate about future work. or change partners.com/blog/2008/03/minethatdata-e-mail-analytics-and-data. well-written.ucsd. and your results. For a detailed description. and a description of your two ﬁnal models (not included in the two pages). Decide thoughtfully which measure of accuracy to use. For linear SVMs. As before. you should work in a team of two. you should identify good values of the soft-margin C parameter for both kernels. As before. . but one can hope that good performance can be achieved with fewer features. recode nonnumerical features as numerical. Explain carefully your nested cross-validation procedure. use only the data for customers who are not sent any email promotion. and to evaluate the accuracy of the best classiﬁers as fairly as possible. Write in the present tense. and any limitations on the validity or reliability of your results. and explain your choice in your report. and do not explain ideas that do not work. This project uses data published by Kevin Hillstrom. Because it is fast. Explain any assumptions that you made. The outcome should be the two most accurate SVM classiﬁers that you can ﬁnd. see http:// minethatdata.csv. choosing a partner with a different background. the deliverable is a well-organized. you should ignore information about which customers make a purchase. You may keep the same partner as before. and well-formatted report of about two pages. a well-known data mining consultant. Describe what you did that worked. not chronologically. Train the best possible model using a linear kernel. and of the width parameter for the RBF kernel. For now. Your job is to train a good model to predict which customers visit the retailer’s website. In the same general way as for linear regression.CSE 291 Assignment This assignment is due at the start of class on Tuesday April 20. Training nonlinear SVMs is much slower. For this assignment. you can explore models based on a large number of transformed features. without overﬁtting or underﬁtting. Use nested cross-validation carefully to ﬁnd the best settings for training. Include a printout of your ﬁnal Rapidminer process. and transform features to improve their usefulness.edu/users/ elkan/250B/HillstromData. html. You can ﬁnd the data at http://cseweb. and also the best possible model using a radial basis function (RBF) kernel. the Rapidminer operator named FastLargeMargin is recommended. and how much they spend.

the fraction of examples of each class is about the same. we will talk about actual numbers (also called counts) of examples. most credit card transactions are legitimate. That is. In an unbalanced set. and half of all fraudulent transactions occur among these 5%. But its accuracy is only 95%. In a balanced set. 5. The goal is to identify the rare positive examples.Chapter 5 Classiﬁcation with a rare class In many data mining applications. most examples are negative but a few examples are positive. Rather than talk about fractions or percentages of a set. Clearly the identiﬁcation process is doing something worthwhile and not trivial. On the other hand. the goal is to ﬁnd needles in a haystack. but a few are fraudulent. We have a standard binary classiﬁer learning problem. Suppose the test set has a certain total size n. suppose we somehow identify 5% of transactions for further investigation. Suppose that 99% of credit card transactions are legitimate. For concreteness in further discussion. It turns out that thinking about actual numbers leads to less confusion and more insight than thinking about fractions. For example.1 Measuring performance A major difﬁculty with unbalanced data is that accuracy is not a meaningful measure of performance. but both the training and test sets are unbalanced. we will consider only the two class case. We can represent the performance of the trivial classiﬁer as follows: 37 . and we will call the rare class positive. Then we can get 99% accuracy by predicting trivially that every transaction is legitimate. as accurately as possible. say n = 1000. some classes are rare while others are common.

there is no standard convention about whether rows are actual or predicted. the number of test examples. Depending on the application. This is the reason why no single number . They are called called true positives tp. three of the counts in a confusion matrix can vary independently. for example accuracy. Above. is not standard.e. i. The four entries in a 2×2 contingency table have standard names. As mentioned. etc. integers. true negatives tn. only a confusion matrix gives complete information about the performance of a classiﬁer. different summaries are computed from these entries. as follows: predicted positive negative tp fn fp tn positive truth negative The terminology true positive. accuracy a = (tp + tn)/n. Assuming that n is known. A table like the ones above is also called a confusion matrix. Remember that there is a universal convention that in notation like xij the ﬁrst subscript refers to rows while the second subscript refers to columns.38 CHAPTER 5. is standard. Unfortunately. and false negatives f n. or vice versa. In particular. the entries in a confusion matrix are counts. whether columns correspond to predicted and rows to actual. For supervised learning with discrete predictions. The total of the four entries tp+tn+f p+f n = n. false positives f p. rows correspond to actual labels. while columns correspond to predicted labels. It would be equally valid to swap the rows and columns. No single number that summarizes performance. but as mentioned above. can provide a full picture of the usefulness of a classiﬁer.. CLASSIFICATION WITH A RARE CLASS predicted positive negative 0 10 0 990 positive truth negative The performance of the non-trivial classiﬁer is predicted positive negative 5 5 45 945 positive truth negative A table like the ones above is called a 2×2 contingency table.

it is preferable simply to report a full confusion matrix explicitly. THRESHOLDS AND LIFT 39 can describe completely the performance of a classiﬁer.1/(1 + 0. Precision is undeﬁned for a classiﬁer that predicts that every test example is negative. and more. and • recall r = tp/(tp + f n). In other research areas. many other summaries are also commonly computed from confusion matrices. Worse. For example. recall.09. When writing a report. kappa coefﬁcient. It is the harmonic average of precision and recall: 1 pr F = = . Rather than rely on agreement and understanding of the deﬁnitions of these.1) = 0. 5. and F-measure. false negative rate. The names “precision” and “recall” come from the ﬁeld of information retrieval. however. two summary measures that are commonly used are called precision and recall. positive and negative likelihood ratio.2.2 Thresholds and lift A confusion matrix is always based on discrete predictions. it is best to give the full confusion matrix explicitly. When accuracy is not meaningful. Consider the following confusion matrix: predicted positive negative 1 9 0 990 positive truth negative Precision is 100% but 90% of actual positives are missed. so that readers can calculate whatever performance measurements they are most interested in. precision. false positive rate. F-measure is a widely used metric that overcomes this limitation of precision. these predictions are obtained by thresholding a real-valued predicted score. recall is often called sensitivity. They are deﬁned as follows: • precision p = tp/(tp + f p).5. 1/p + 1/r r+p For the confusion matrix above F = 1 · 0. Often. Besides accuracy. an SVM classiﬁer yields a real-valued prediction f (x) which is then compared to . Some of these are called speciﬁcity. while precision is sometimes called positive predictive value. precision can be misleadingly high for a classiﬁer that predicts that only a few test examples are positive. that is when tp + f p = 0.

the lift at x is deﬁned to be the ratio d/b. it is possible to change the threshold to achieve a target number of positive predictions. perhaps too few transactions are being investigated. Suppose that all examples predicted to be positive are subjected to further investigation. Or. Given that a ﬁxed number of examples are predicted to be positive. Of course. First. tp + f n tp + f p prediction rate Lift is a useful measure of success if the number of examples that should be predicted to be positive is determined by external considerations. This question is answered by a measure called lift. so we want to investigate the examples x with the highest prediction scores f (x). Making optimal decisions about how many examples should be predicted to be positive is discussed in the next chapter. even when a natural threshold exists. Next. This target number is often based on a so-called budget constraint. the correct strategy is to choose a threshold t such that we investigate all examples x with f (x) ≥ t and the number of such examples is determined by the budget constraint. Intuitively. Now. CLASSIFICATION WITH A RARE CLASS the threshold zero to obtain a discrete yes/no prediction. In par- . we want to investigate those examples that are most likely to be actual positives. Therefore. let the fraction of examples predicted to be positive be x = (tp + f p)/n. a lift of 2 means that actual positives are twice as dense among the predicted positives as they are among all examples. but ﬁrst we shall consider the issue of selecting a threshold. In the credit card scenario. it is still more rational than choosing an arbitrary threshold. let the base rate of actual positive examples be b = (tp + f n)/n. The deﬁnition is a bit complex. there would be a net beneﬁt if additional transactions were investigated. We shall return to this issue below. budget constraints should normally be questioned. Let the density of actual positives within the predicted positives be d = tp/(tp + f p). Lift can be expressed as d b = = tp/(tp + f p) tp · n = (tp + f n)/n (tp + f p)(tp + f n) tp n recall = . a natural question is how good a classiﬁer is at capturing the actual positives within this number. Setting the threshold determines the number tp + f p of examples that are predicted to be positive. This number is a natural target for the value f p + tp. An external limit on the resources available will determine how many examples can be investigated. However. there is a natural threshold such as zero for an SVM. Confusion matrices cannot represent information about the usefulness of underlying real-valued predictions.40 CHAPTER 5. In some scenarios. While a budget constraint is not normally a rational way of choosing a threshold for predictions. perhaps too many transactions are being investigated and the marginal beneﬁt of investigating some transactions is negative. However.

it is useful to compare different classiﬁers across the range of all possible thresholds. ticular. Therefore. it is meaningful to compare the recall achieved by different classiﬁers when the threshold for each one is set to make their precisions . Moreover. However. RANKING EXAMPLES 41 Figure 5. Source: Wikipedia.5. at the time a classiﬁer is trained and evaluated.3. the threshold zero for an SVM classiﬁer has some mathematical meaning but is not a rational guide for making decisions. 5. because the distinction between examples on the same side of the threshold is lost.1: ROC curves for three alternative binary classiﬁers. it is often the case that the threshold to be used for decision-making is not known.3 Ranking examples Applying a threshold to a classiﬁer with real-valued outputs loses information. it is not meaningful to use the same numerical threshold directly for different classiﬁers. Typically.

This is what an ROC curve does. an ROC curve is a plot of the performance of a classiﬁer. the ideal point is at the top left. CLASSIFICATION WITH A RARE CLASS equal.” This terminology originates in the theory of detection based on electromagnetic waves. Converting scores into probabilities. as shown by the probabilistic meaning just mentioned. Brier score. The AUC has an intuitive meaning: it can be proved that it equals the probability that the classiﬁer will rank correctly two randomly chosen examples where one is positive and one is negative. and let y j be the yi ROC stands for “receiver operating characteristic. Reason 2 to want probabilities: understanding predictive power.42 CHAPTER 5. which implies that neither one dominates the other uniformly. and let yi ∈ {0. it has built into it the implicit assumption that the rare class is more important than the common class. but do not provide a quantitative measure of the performance of classiﬁers. often abbreviated AUC.1 Concretely. Let f j be the fi values sorted from smallest to largest.4 Conditional probabilities Reason 1 to want probabilities: predicting expectations. where the horizontal axis measures false positive rate (fpr) and the vertical axis measures true positive rate (tpr). Ideal versus achievable probability estimates. while it is 1 for a perfect classiﬁer. Deﬁnition of well-calibrated probability.” In an ROC plot. It happens often that the ROC curves of two classiﬁers cross. One reason why AUC is widely used is that. 1} be the corresponding true labels.5 Isotonic regression Let fi be prediction scores on a dataset. These are deﬁned as f pr = fp f p + tn tp tpr = tp + f n Note that tpr is the same as recall and is sometimes also called “hit rate.5 for a classiﬁer whose predictions are no better than random. The AUC is 0. 5. ROC plots are informative. One classiﬁer uniformly dominates another if its curve is always above the other’s curve. A natural quantity to consider is the area under the ROC curve. 1 . 5.

There is an elegant algorithm called “pool adjacent violators” (PAV) that solves this problem in linear time. 5. The parameters a and b are chosen to minimize the total . the procedure to predict a well-calibrated probability is as follows: Apply the classiﬁer to obtain f (x) Find j such that f j ≤ f (x) ≤ f j+1 The predicted probability is gj . 1} be the corresponding true labels.. An alternative is to use a parametric model. The model is log p = a + bf 1−p where f is a prediction score and p is the corresponding estimated probability. Let gj = y j for all j Start with j = 1 and increase j until the ﬁrst j such that gj ≤ gj+1 Pool gj and gj+1 Move left: If gj−1 ≤ gj then pool gj−1 to gj+1 Continue to the left until monotonicity is satisﬁed Proceed to the right Given a test example x. the optimization problem is g1 . Formally. then the resulting predictions are well-calibrated probabilities. let fi be prediction scores on a training set.5. The most common model is called univariate logistic regression. UNIVARIATE LOGISTIC REGRESSION 43 values sorted in the same order.6.. An equivalent way of writing the model is p= 1 1+ e−(a+bf ) . The algorithm is as follows.. where pooling a set means replacing each member of the set by the arithmetic mean of the set. For each f j we want to ﬁnd an output value gj such that the gj values are monotonically increasing.6 Univariate logistic regression The disadvantage of isotonic regression is that it creates a lookup table for converting scores into estimated probabilities. and squared error relative to the y j values is minimized. The equation above shows that the logistic regression model is essentially a linear model with intercept a and coefﬁcient b.gn min (y j − gj )2 subject to gj ≤ gj+1 for j = 1 to j = n − 1 It is a remarkable fact that if squared error is minimized.. As above. and let yi ∈ {0.

yi ). 1 + e−(a+bfi ) The precise loss function l that is typically used in the optimization is called conditional log likelihood (CLL) and is explained in Chapter ??? below. . one could also use squared error. CLASSIFICATION WITH A RARE CLASS l( i 1 . which would be consistent with isotonic regression.44 loss CHAPTER 5. However.

(b) What is the error rate of the classiﬁer from part (a)? What is its MSE? With a prediction threshold of 0. the lift is at most b/0. all examples are predicted to be negative.05 is the best value for c. for all test examples x. not just at 10%. so the error rate is 5%. . Explain why c = 0.05 where 0.05 is the base rate.5. The value 0.0475. “Wellcalibrated” means that c = 0. UNIVARIATE LOGISTIC REGRESSION 45 Quiz All questions below refer to a classiﬁer learning task with two classes.5. where the base rate for the positive class y = 1 is 5%.05)2 = 0.05 is the only constant that is a well-calibrated probability.05 · 0. then the fraction of positive examples among the highest-scoring examples is at most b.05 · (1 − 0. Note that this is the highest possible lift at any percentage. Hence. (a) Suppose that a probabilistic classiﬁer predicts p(y = 1|x) = c for some constant c. The MSE is 0.6.95 · 1 which equals 0.95 · (0 − 0. What is the maximum possible lift at 10% of this classiﬁer? If the upper bound b is a well-calibrated probability. (c) Suppose a well-calibrated probabilistic classiﬁer satisﬁes a ≤ p(y = 1|x) ≤ b for all x.05)2 + 0.05 equals the average frequency of positive examples in sets of examples with predicted score c.

mean. because this is a fast and reliable method for training probabilistic classiﬁers. use a postprocessing method (isotonic regression or logistic regression) to obtain calibrated estimates of conditional probabilities. You may change partners. Speciﬁcally. using the same data code in the same way. Investigate the probability estimates produced by your two methods. Like previous assignments. or a decision tree. this one uses the KDD98 training set. you should apply a different learning method to the same training set that you developed using logistic regression. you should now use the entire dataset. The second learning method can be a support vector machine. 2009. you should work in a team of two. recode and transform features to improve their usefulness. (Deleted: When necessary. for example. as many positive test examples as possible should be among the 10% of test examples with highest prediction score. or keep the same partner. (Revised. a neural network. If you are using Rapidminer. That is. The objective is to see whether this other method can perform better than logistic regression. and maximum predicted probabilities? Discuss whether these are reasonable. As before. CLASSIFICATION WITH A RARE CLASS 2009 Assignment This assignment is due at the start of class on Tuesday April 28. As before. . Apply cross-validation to ﬁnd good algorithm settings. You should use logistic regression ﬁrst.) The goal is to train a classiﬁer with realvalued outputs that identiﬁes test examples with TARGET_B = 1.46 CHAPTER 5. the measure of success to optimize is lift at 10%. Do feature selection to reduce the size of the training set as much as is reasonably possible. What are the minimum. However. then use the logistic regression option of the FastLargeMargin operator. Next.

the ﬁrst to predict who visits the website and the second to predict which visitors make purchases. . then use the logistic regression option of the FastLargeMargin operator. use logistic regression. recode and transform features to improve their usefulness. mean. Can any method achieve better accuracy than logistic regression? For predicting who makes a purchase. compare learning a classiﬁer directly with learning two classiﬁers. as many positive test examples as possible should be among the 25% of test examples with highest prediction score. 2010. If you are using Rapidminer. This is a highly unbalanced classiﬁcation task. for customers who are not sent any promotion). You may apply any learning algorithms that you like. and maximum predicted probabilities? Discuss whether these are reasonable. Like the previous assignment. First. As before.Assignment This assignment is due at the start of class on Tuesday April 27. As before. be careful not to fool yourself about the success of your methods. You may change partners. you should work in a team of two. or keep the same partner. use your creativity to get the best possible performance in predicting who makes a purchase. visit)p(visit|x). this one uses the e-commerce data published by Kevin Hillstrom However. As before. Note that mathematically p(buy|x) = p(buy|x. The measure of success to optimize is lift at 25%. That is. Investigate the probability estimates produced by your two methods. What are the minimum. Second. the goal now is to predict who makes a purchase on the website (again.

. 2010 Your name: The current assignment is based on the equation p(buy = 1|x) = p(buy = 1|x. Explain clearly but brieﬂy why this equation is true.Quiz for April 27. visit = 1)p(visit = 1|x).

Each of the three mistakes is serious. Students were asked to explain arguments and conclusions crisply. how each example is represented. published in the journal Bioinformatics in 2001. please post on the class message board. but at least one of the papers citing this paper does explain it clearly. Remember the slogan: “If you cannot represent it then you cannot learn it. If you have difﬁculty obtaining it. The second ability is understanding a new application domain quickly. and clearly. 2009. in this case a task in computational biology. Assignment The paper you should read is Predicting protein-protein interactions from primary structure by Joel Bock and David Gough. the skill of identifying what is most important and then identifying what may be incorrect. The goal of writing it as an assignment was to help develop three important abilities. but unfortunately each ﬂaw by itself makes the results of the paper not useful as a basis for future research. The second mistake. i.7 Pitfalls of link prediction This section was originally an assignment.5. Each mistake is described sufﬁciently clearly in the paper: it is a sin of commission. • how each example is represented. It is connected with how SVMs are applied here. not a sin of omission. The ﬂaws concern • how the dataset is constructed. concisely.” Separately. and • how performance is measured and reported. Are these beneﬁts true? Are they unique to SVMs? Does the work described in this paper take advantage of them? . provide a brief critique of the four beneﬁts claimed for SVMs in the section of the paper entitled Support vector machine learning. PITFALLS OF LINK PREDICTION 49 5.e. is the most subtle. The ﬁrst ability is critical thinking. The paper has 248 citations according to Google Scholar as of May 26. The third ability is presenting opinions in a persuasive way.7. The full text of this paper in PDF is supposed to be available free. You should ﬁgure out and describe three major ﬂaws in the paper.

It is quite possible that these artiﬁcial sequences could not fold into actual proteins at all. (2) How each example is represented. With a concatenated representation of protein pairs. then the ﬁrst term is constant and the second term w2 · f (x2 ) determines the ranking. they are pairs of randomly generated amino acid sequences. but do not overcome it.) Now suppose the ﬁrst protein x1 is ﬁxed and consider varying the second protein x2 . so w ∈ R2d+1 . Let x1 and x2 be two proteins and let f (x1 ) and f (x2 ) be their representations as vectors in Rd . (3) How performance is measured and reported. This fundamental drawback of linear classiﬁers for predicting interactions is pointed out in [Vert and Jacob.50 CHAPTER 5. it fails to mention at all what kernel they use. On rereading the paper. the conclusion is the same. They could have used pairs of genuine proteins as negative examples. If x1 is ﬁxed. If the research uses a linear kernel. CLASSIFICATION WITH A RARE CLASS Sample answers Here is a brief summary of what I see as the most important ﬂaws of the paper Predicting protein-protein interactions from primary structure. a linear classiﬁer can at best learn the propensity of individual proteins to interact. However. that will presumably slightly reduce the apparent accuracy of a trained classiﬁer. By deﬁnition w ∈ R2d also. It is true that one cannot be sure that any given pair really is non-interacting. The authors acknowledge this concern. assuming that the research uses a linear classiﬁer. Instead. Moreover. This is a subtle but clear-cut and important issue. if a negative example really is an interaction. The problem here is that the negative examples are not pairs of genuine proteins. Section 5]. However. 2008. This is equal to w1 · f (x1 ) + w2 · f (x2 ) where the vector w is written as w1 w2 . it is not reasonable or correct to report . Such a classiﬁer cannot represent patterns that depend on features that are true only for speciﬁc protein pairs. then the argument above is applicable. Suppose a trained linear SVM has parameter vector w. This is the relevance of the slogan “If you cannot represent it then you cannot learn it. (1) How the dataset is constructed. but not change overall conclusions. The classiﬁers reported in this paper may learn mainly to distinguish between real proteins and non-proteins. Proteins x2 will be ranked according to the numerical value of the dot product w · f (x1 ) f (x2 ) .” Note: Previously I was under the impression that the authors stated that they used a linear kernel. It is reasonable to use training sets where negative examples (noninteracting pairs) are undersampled. The ranking of x2 proteins will be the same regardless of what the x1 protein is. the great majority of pairs do not interact. (If there is a bias coefﬁcient. Most pairs of proteins are non-interacting. The pair of proteins is represented as the concatenated vector f (x1 ) f (x2 ) ∈ R2d .

but it typically does not provide useful guarantees for speciﬁc training sets. what is needed for screening many test examples is fast classiﬁer application. precision. This upper bound does motivate the SVM approach to training classiﬁers. As mentioned above. SVMs have fast training. new real-world protein data arrives slowly enough that retraining from scratch is feasible. SVMs have an analytic upper bound on generalization error. when this paper was published. 4. except for linear classiﬁers trained by fast algorithms that were only published after 2001. .) on test sets where negative examples are under-represented. Applying a nonlinear SVM typically has the same order-ofmagnitude time complexity as applying a nearest-neighbor classiﬁer. I do not know any algorithm for training an optimal new SVM efﬁciently.5. In any case. recall. whether it is an SVM or not. At least one algorithm is known for updating an SVM given a new training example. with a nonlinear kernel the number of parameters is the number of support vectors. In any case. which is what is done in this paper. which is essential for screening large datasets. a bound of this type has nothing to do with assigning conﬁdences to individual predictions. which is the slowest type of classiﬁer in common use. etc. not by applying a theorem. It is not clear relative to what this can be considered “few.7. but it is not cited in this paper. not fast training. In any case. 3. PITFALLS OF LINK PREDICTION 51 performance (accuracy. SVMs can be continuously updated in response to new data. which is often close to the number of training examples. that is without retraining on old data. SVM training is slow compared to many other classiﬁer learning methods. As for the four claimed advantages of SVMs: 1. Applying a linear classiﬁer is fast.” 2. In any case. The authors do not explain what kernel they use. a linear classiﬁer is not appropriate with the representation of protein pairs used in this paper. SVMs are nonlinear while requiring “relatively few” parameters. In practice overﬁtting is prevented by straightforward search for the value of the C parameter that is empirically best.

.

. f ni compute tp = i tpi . However. the time complexity of cross-validation is k times that of running the training algorithm 53 . It is the following algorithm. f n = i f ni The output of cross-validation is a confusion matrix based on using each labeled example as a test example exactly once. and we are faced with a dilemma: we would like to use all the examples for training.1 Cross-validation Usually we have a ﬁxed database of labeled examples available. Whenever an example is used for testing a classiﬁer. . it has not been used for training that classiﬁer. Sk for i = 1 to i = k let T = S \ Si run learning algorithm with T as training set test the resulting classiﬁer on Si obtaining tpi . f pi .Chapter 6 Detecting overﬁtting: cross-validation 6. tn = i tni . but we would also like to use many examples as an independent test set. This special case is called leave-one-out cross-validation (LOOCV). . Cross-validation is a procedure for overcoming this dilemma. the largest possible number of folds is k = n. . f p = i f pi . Hence. Input: Training set S. integer constant k Procedure: partition S into k disjoint equal-sized subsets S1 . If n labeled examples are available. the confusion matrix obtained by cross-validation is a fair indicator of the performance of the learning algorithm on independent test examples. tni .

Typically. i. the number of alternatives is ﬁnite and the only way to evaluate an alternative is to run it explicitly on a training set. The time required for k-fold cross-validation is then O(k · (k − 1)/k · n) = O((k − 1)n). So. Subsampling the common class is a standard approach to learning from unbalanced data. and the time to make predictions on test examples is negligible in comparison. this matrix is an estimate of the average performance of a classiﬁer learned from a training set of size (k−1)n/k where n is the size of S. 6. Another common case is where the alternatives are different subsets of features. It is a not uncommon mistake with cross-validation to do subsampling on the test set. This estimate is likely to be conservative in the sense that the ﬁnal classiﬁer may have slightly better performance since it is based on a slightly larger training set. DETECTING OVERFITTING: CROSS-VALIDATION once. and then to use the confusion matrix obtained from cross-validation as an informal estimate of the performance of this classiﬁer. For example. the alternatives are all the same algorithm but with different parameter settings.54 CHAPTER 6. then LOOCV will show a zero error rate. Reported performance numbers should always be based on a set of test examples that is directly typical of a genuine test population. Instead. but not in the test fold.e. The common procedure is to create a ﬁnal classiﬁer by training on all of S. In recent research the most common choice for k is 10. Suppose that the time to train a classiﬁer is proportional to the number of training examples.2 Nested cross-validation A model selection procedure is a method for choosing the best learning algorithm from a set of alternatives. integer k. Three-fold cross-validation is therefore twice as time-consuming as two-fold. Often. It is reasonable to do subsampling in the training folds of cross-validation. two-fold is a good choice. The results of cross-validation can be misleading. This suggests that for preliminary experiments. how should we do model selection? A simple way to apply cross-validation for model selection is the following: Input: dataset S. and the confusion matrix it provides is not the performance of any speciﬁc single classiﬁer. for example different C and γ values. data where some classes are very common while others are very rare. Note that cross-validation does not produce any single ﬁnal classiﬁer. set V of alternative algorithm settings Procedure: . Subsampling means omitting examples from some classes. so often LOOCV is computationally infeasible. Cross-validation with other values of k will also yield misleadingly low error estimates. if each example is duplicated in the training set and we use a nearest-neighbor classiﬁer.

The setting v is chosen to maximize ˆ M (v). Tk of equal size for each setting v in V for j = 1 to j = k let U = T \ Tj run the learning algorithm with setting v and U as training set apply the trained model to Tj obtaining performance ej end for . ˆ Notice that the division of S into subsets happens just once. . It is crucial to understand this point. v is chosen to optimize performance on all of S.2. nested cross-validation is the following process: Input: dataset S. . This process is called nested crossvalidation. ˆ What should we do about the fact that any procedure for selecting parameter values is itself part of the learning algorithm? One answer is that this procedure should itself be evaluated using cross-validation. and the same division is used for all settings v. set V of alternative algorithm settings Procedure: partition S randomly into k disjoint subsets S1 . . integers k and k . . . Sk of equal size for each setting v in V for i = 1 to i = k let T = S \ Si run the learning algorithm with setting v and T as training set apply the trained model to Si obtaining performance ei end for let M (v) be the average of ei end for select v = argmaxv M (v) ˆ 55 The output of this model selection procedure is v . A new partition of S could be created for each setting v. . so ˆ v is likely to overﬁt S. any procedure for selecting parameter values is itself part of the learning algorithm. . Stated another way. but this would not overcome the issue that v is chosen to optimize performance on S. Speciﬁcally. NESTED CROSS-VALIDATION partition S randomly into k disjoint subsets S1 . . . This choice reduces the random variability in the evaluation of different v. . so M (ˆ) is not a fair estimate of the performance to be expected from v on v ˆ future data. . The input set V of alternative ˆ settings can be a grid of parameter values. But. . because one cross-validation is run inside another.6. Sk of equal size for i = 1 to i = k let T = S \ Si partition T randomly into k disjoint subsets T1 .

2. As mentioned above. . using a genetic algorithm or some other heuristic method. Above. so it can certainly be intelligent and/or heuristic. Some notes: 1. the ﬁnal reported average ei is the estimated performance of the classiﬁer obtained by running the same model selection procedure (i.e. the same partition of T is used for each setting v.56 CHAPTER 6. It may be preferable to search in V in a more clever way. the search over each setting v) on the whole dataset S. the search for a good member of V is itself part of the learning algorithm. Trying every setting in V explicitly can be prohibitive. This reduces randomness a little bit compared to using a different partition for each v. but the latter would be correct also. DETECTING OVERFITTING: CROSS-VALIDATION let M (v) be the average of ej end for select v = argmaxv M (v) ˆ run the learning algorithm with setting v and T as training set ˆ apply the trained model to Si obtaining performance ei end for report the average of ei Now.

set V of alternative R values Procedure: partition S randomly into k disjoint subsets S1 . The value that seems to be best is likely overﬁtting this set. . Incorrect answers include the following: . Each R value is being evaluated on the entire set S. and to report a fair estimate of the RMSE to be expected on future test data. Explain why this method is likely to be overoptimistic. then the entire process is nested cross-validation. so if this procedure is evaluated itself using cross-validation. . suggest an improved variation of the method above. integer k. . This whole procedure should be evaluated using a separate test set. . Input: dataset S. as an estimate of the RMSE to be expected on future data. 2010 Your name: The following cross-validation procedure is intended to ﬁnd the best regularization parameter R for linear regression.Quiz for April 13. (b) Very brieﬂy. Additional notes: The procedure to select a value for R uses cross-validation. The procedure to select a value for R is part of the learning algorithm. or via cross-validation. Sk of equal size for each R value in V for i = 1 to i = k let T = S \ Si train linear regression with parameter R and T as training set apply the trained regression equation to Si obtaining SSE ei end for compute M = i ei report R and RM SE = M/|S| end for (a) Suppose that you choose the R value for which the reported RMSE is lowest.

” No. and that any search for settings for a learning algorithm (e. search for a subset of features.g. so stratiﬁcation is not well-deﬁned. • “The partition of S should be done separately for each R value.58 CHAPTER 6. failing to stratify increases variability but not does not cause overﬁtting. we are doing regression. or for algorithmic parameters) is part of the learning method. and second. . DETECTING OVERFITTING: CROSS-VALIDATION • “The partition of S should be stratiﬁed. Two basic points to remember are that it is never fair to evaluate a learning method on its training set.” No. but it would not change the fact that the best R is being selected according to its performance on all of S. ﬁrst of all. not just once. a different partition for each R value might increase the variability in the evaluation of each R.

For example. while decisions typically cannot. and maximizing the value of customers. Mathematically. Suppose that examples are credit card transactions and the label y = 1 designates a legitimate transaction. If i = j then the prediction is correct. even if the transaction is most likely legitimate. i. and costs Decisions and predictions are conceptually very different. then it can be rational not to approve a large transaction.e. A cost matrix c has the following structure when there are only two classes: 59 .1 Predictions. predicting i means acting as if i is true. The (i. 7. Here. when in fact class j is true. while the corresponding decision is whether or not to administer the drug. if the cost of approving a fraudulent transaction is proportional to the dollar amount involved. decisions. For example. Conversely. j) entry in a cost matrix c is the cost of acting as if class i is true. let i be the predicted class and let j be the true class. approving the transaction. so one could equally well call this deciding i. while if i = j the prediction is incorrect. The essence of cost-sensitive decision-making is that it can be optimal to act as if one class is true even when some other class is more probable. Then making the decision y = 1 for an attempted transaction means acting as if the transaction is legitimate. it can be rational to approve a small transaction even if there is a high probability it is fraudulent. Predictions can often be probabilistic. a prediction concerning a patient may be “allergic” or “not allergic” to aspirin.Chapter 7 Making optimal decisions This chapter discusses making optimal decisions based on predictions.

c(m. Given a cost matrix.1) j For each i. 0) = c00 c(1. j). In short the convention is row/column = i/j = predicted/actual. the cost of labeling an example incorrectly should always be greater than the cost of labeling it correctly. 1) = c11 predict negative predict positive The cost of a false positive is c10 while the cost of a false negative is c01 . the decisions that are optimal are unchanged if each entry in the matrix is multiplied by a positive constant. MAKING OPTIMAL DECISIONS actual negative c(0. some class labels are never predicted by the optimal policy given by Equation (7. In this case the optimal policy is to label all examples positive.) The optimal prediction for an example x is the class i that minimizes the expected cost e(x. We call these conditions the “reasonableness” conditions. the role of a learning algorithm is to produce a classiﬁer that for any example x can estimate the probability p(j|x) of each class j being the true class of x. j). Suppose that the ﬁrst reasonableness condition is violated.) For some cost matrices. The following is a criterion for when this happens. i) is an expectation computed by summing over the alternative possibilities for the true class of x. (7. In this framework. this means that it should always be the case that c10 > c00 and c01 > c11 . while columns correspond to actual classes.60 CHAPTER 7. so c00 ≥ c10 but still c01 > c11 . the optimal prediction is always n if row n is dominated by all other rows in a cost matrix. (The reader can analyze the case where both reasonableness conditions are violated. Similarly. 1) = c01 c(1.1 so perhaps we should switch one of these to make the conventions similar. Say that row m dominates row n in a cost matrix C if for all j. i) = p(j|x)c(i. The two reasonableness conditions for a two-class cost matrix imply that neither row in the matrix dominates the other. 7. (This convention is the opposite of the one in Section 5. We follow the convention that cost matrix rows correspond to alternative predicted classes. This scaling corresponds to changing the unit of account for costs.2 Cost matrix properties Conceptually. 0) = c10 actual positive c(0. So it is optimal never to predict m. e(x. As a special case. Similarly. j) ≥ c(n. For a 2x2 cost matrix. if c10 > c00 but c11 ≥ c01 then it is optimal to label all examples negative. the decisions that are optimal are unchanged .1). In this case the cost of predicting n is no greater than the cost of predicting m. regardless of what the true class j is.

com/tech/mlc/db/german. This baseline is the state of the agent before it takes a decision regarding an example.3. a 2x2 cost matrix effectively has two degrees of freedom. the cashﬂow relative to any baseline associated with this prediction is the same regardless . if it is better off. Although most recent research in machine learning has used the terminology of costs.sgi. When thinking in terms of costs. its beneﬁt is positive. or the severity of an illness. From a matrix perspective. After the agent has made the decision. This shifting corresponds to changing the baseline away from which costs are measured. since there is a natural baseline from which to measure all beneﬁts. 1994]. its beneﬁt is negative. whether positive or negative. because avoiding mistakes is easier. any two-class cost matrix c00 c10 c01 c11 that satisﬁes the reasonableness conditions can be transformed into a simpler matrix that always leads to the same decisions: 0 1 c01 c11 where c01 = (c01 − c00 )/(c10 − c00 ) and c11 = (c11 − c00 )/(c10 − c00 ). THE LOGIC OF COSTS 61 if a constant is added to each entry in the matrix. doing accounting in terms of beneﬁts is generally preferable.names is as follows: predict bad predict good actual bad 0 5 actual good 1 0 Here examples are people who apply for a loan from a bank.7. By scaling and shifting entries.. consider the so-called German credit dataset that was published as part of the Statlog project [Michie et al. For example.3 The logic of costs Costs are not necessarily monetary. it is easy to posit a cost matrix that is logically contradictory because not all entries in the matrix are measured from the same baseline. Hence. “Actual good” means that a customer would repay a loan while “actual bad” means that the customer would default. for example. 7. The cost matrix given with this dataset at http://www. A cost can also be a waste of time. Otherwise. The action associated with “predict bad” is to deny the loan.

Alternatively.02x refuse approve where x is the size of the transaction in dollars.” To see concretely that the reasoning in quotes above is incorrect. but the baseline must be ﬁxed. it is because different baselines have been chosen for each entry. Refusing a fraudulent transaction has a non-trivial beneﬁt because it may prevent further fraud and lead to the arrest of a criminal. If a bad customer is approved for a loan. a missed opportunity rather than an actual penalty. a test example should be predicted to have the class that leads to the lowest expected cost. the ﬁrst scenario gives a total cost of 6. suppose that the bank has one customer of each of the four types. In general the amount in some cells of a cost or beneﬁt matrix may not be constant.4 Making optimal decisions Given known costs for correct and incorrect predictions. Regardless of the baseline. For example. any method of accounting should give a difference of 8 between these scenarios. In every economically reasonable cost matrix for this domain. If these entries are different.e. For example. and the cost of rejecting a bad customer is zero. It is easy to make the mistake of measuring different opportunity costs against different baselines. and may be different for different examples. i. MAKING OPTIMAL DECISIONS of whether “actual good” or “actual bad” is true. This expectation is . An opportunity cost is a foregone beneﬁt. 7. both entries in the “predict bad” row must be the same. If a good customer is rejected. Costs or beneﬁts can be measured against any baseline. Refusing a legitimate transaction has a non-trivial cost because it annoys a customer. because in both cases the correct decision has been made. the erroneous cost matrix above can be justiﬁed informally as follows: “The cost of approving a good customer is zero. the cost is an opportunity cost. But with the erroneous cost matrix above. while the second scenario gives a total cost of 0. Approving a fraudulent transaction costs the amount of the transaction because the bank is liable for the expenses of fraud. the foregone proﬁt of 1. the cost is the lost loan principal of 5. Clearly the cost matrix above is intended to imply that the net change in the assets of the bank is then −4. Here the beneﬁt matrix might be fraudulent $20 −x legitimate −$20 0.62 CHAPTER 7. suppose that we have four customers who receive loans and repay them. The net change in assets is then +4. consider the credit card transactions domain.

This formula for p∗ shows that any 2x2 cost matrix has essentially only one degree of freedom from a decision-making perspective. and is computed using the conditional probability of each class given the example. The threshold for making optimal decisions is p∗ such that (1 − p∗ )c10 + p∗ c11 = (1 − p∗ )c00 + p∗ c01 . MAKING OPTIMAL DECISIONS 63 the predicted average cost.4. However if we pick a complex concept. Note that in some domains costs have more impact than probabilities on decisionmaking.1 Decisions and predictions may be not in 1:1 correspondence. the optimal prediction is class 1 if and only if the expected cost of this prediction is less than or equal to the expected cost of predicting class 0. the optimal prediction is class 1 if and only if p ≥ p∗ . if and only if p(y = 0|x)c10 + p(y = 1|x)c11 ≤ p(y = 0|x)c00 + p(y = 1|x)c01 which is equivalent to (1 − p)c10 + pc11 ≤ (1 − p)c00 + pc01 given p = p(y = 1|x). limited number of training examples. If this inequality is in fact an equality. then if we pick a simple concept then we will be right. 1 .7. this is the case where the minimax decision and the minimum expected cost decision are always the same. although it has two degrees of freedom from a matrix perspective. a PAC theorem says that if we have a ﬁxed. Assuming the reasonableness conditions c10 > c00 and c01 ≥ c11 . The situation where the optimal decision is ﬁxed is well-known in philosophy as Pascal’s wager. then predicting either class is optimal. there is no guarantee we will be right. Essentially. and the true concept is simple. Essentially.e. which is implied by the reasonableness conditions. An argument analogous to Pascal’s wager provides a justiﬁcation for Occam’s razor in machine learning. The cause of the apparent contradiction is that the optimal decision-making policy is a nonlinear function of the cost matrix. i. Rearranging the equation for p∗ leads to c00 − c10 = −p∗ c10 + p∗ c11 + p∗ c00 − p∗ c01 which has the solution p∗ = c10 − c00 c10 − c00 + c01 − c11 assuming the denominator is nonzero. In the two-class case. regardless of what the various outcome probabilities are for a given example x. An extreme case is that with some cost matrices the optimal decision is always the same.

q→1 1 − q This says that the examples ranked lowest are positive with probability equal to t/2.6 Rules of thumb for evaluating data mining campaigns Let q be a fraction of the test set.5. response rate. the lift for the lowest ranked examples cannot be driven below 0. .02 and q > t2 . the rule of thumb can reasonably be applied when q > 0.64 CHAPTER 7. We may need to make repeated decisions over time about the same example. The rule of thumb is not valid for very small values of q. where t is the overall fraction of positives. t is also called the target rate. For example. while complex means it is drawn from a space with high cardinality.5. The fraction of all positive examples that belong to the top-ranked fraction q of √ √ the test set is q 1/q = q. Decisions about one example may inﬂuence other examples. The ﬁrst reason is mathematical: an upper bound on the lift is 1/t. In other words. 7. This objective is reasonable when a game is played repeatedly. The lift in the bottom 1 − q is (1 − q)/(1 − q). for two reasons. The analysis may not be appropriate when the agent is risk-averse and can make only a few decisions. Costs may need to be estimated via learning. Hence.5 Limitations of cost-based analysis The analysis of decision-making above assumes that the objective is to minimize expected cost. Also not appropriate when the actual class of examples is set by a non-random adversary.25 then the attainable lift is √ around 4 = 2. When there are more than two classes. if q = 0. We have √ 1− q lim = 0. This means that the beneﬁt matrix is predict simple predict complex nature simple succeed fail nature complex fail fail Here “simple” means that the hypothesis is drawn from a space with low cardinality. Let c be the cost of a contact and let b be the beneﬁt of a positive. The second reason is empirical: lifts observed in practice tend to be well below 10. and the actual class of examples is set stochastically. MAKING OPTIMAL DECISIONS 7. or base rate. There is a remarkable rule of thumb that the lift attainable at q is around 1/q. we do not get full label information on training examples.

The conclusion above is that the revenue from a campaign. If k < 0. because the lift attainable at small q is typically worse than suggested by the rule of thumb. then one can assume that the contact caused a net cost to him or her. it is rational for society to take into account these externalities. Suppose that the beneﬁt of responding for a respondent is λb . for example a loss of time. When tb/c > 2 then maximum proﬁt is achieved by soliciting every prospect.04nc. From the point of view of the initiator of a campaign a cost or beneﬁt for a respondent is an externality. or t is not likely to make the campaign lose money. RULES OF THUMB FOR EVALUATING DATA MINING CAMPAIGNS 65 actual negative 0 −c actual positive 0 b−c decide negative decide positive Let n be the size of the test set. the attainable proﬁt decreases fast. is roughly twice its cost.6. The proﬁt at q is the total beneﬁt minus the total cost.7. Proﬁt is maximized when 0= d (kq 0. quadratically. which is the majority case. this is always the same as the cost of running the campaign. unless they cause respondents to take actions that affect the initiator. because s/he only responds if the response is beneﬁcial. i.e. for its initiator.04 and the attainable proﬁt is less than 0. dq The solution is q = (k/2)2 . The maximum proﬁt that can be achieved is nc(k(k/2) − (k/2)2 ) = nck 2 /4 = ncq. This solution is in the range zero to one only when k ≤ 2. However. However. Note however that a small adverse change in c. b. which is √ √ nqbt 1/q − nqc = nc(bt q/c − q) = nc(k q − q) where k = tb/c. Remember that c is the cost of one contact from the perspective of the initiator of the campaign. one can assume that the contact was beneﬁcial for him or her. so data mining is pointless. Such a campaign may have high risk.5)q −0. Remarkably. which is nqc. As k decreases.5 − 1.5 − q) = k(0. Each contact also has a cost or beneﬁt for the recipient. because the expected revenue is always twice the campaign cost. Generally. such as closing accounts. It is not rational for the initiator to take into account these beneﬁts or costs. if the recipient responds. It is interesting to consider whether data mining is beneﬁcial or not from the point of view of society at large.4 then q < 0. if the recipient does not respond.

68. then the campaign is likely to have high risk for the initiator. If c is much greater than the average beneﬁt. It is the logic of tracking where uncertainty arises. then it is rational to contact all potential respondents. c = $0. This should be a range [ab] such that the probability is (say) 95% that the observed proﬁt lies in this range. In fact. Whatever λ is.21. there are two sources of uncertainty for each test person: whether s/he will donate. Intuitively. where tb/2 ≤ c ≤ αtb where α is some constant greater than 1. it is likely that µ > 2λ. the cost to a recipient is not tiny. We have k = tb/c = 75/68 = 1. The reasoning above clariﬁes why spam email campaigns are harmful to society. If the cost c of solicitation is less than half of this. . consider the scenario of the 1998 KDD contest. MAKING OPTIMAL DECISIONS where b is the beneﬁt to the initiator. and then the logic of combining the uncertainties into an overall conﬁdence interval.10. the methods that perform best in this domain achieve proﬁt of about $0. and how it is propagated. and the average beneﬁt is b = $15 approximately. Suppose also that the cost of a solicitation to a person is µc where c is the cost to the initiator. The central issue here is not any mathematical details such as Student distributions versus Gaussians.05 about. In summary. while soliciting about 70% of all people. For these campaigns. The net beneﬁt to respondents is positive as long as µ < 2λ. The rule of thumb predicts that the optimal fraction of people to solicit is q = 0. The product tb is the average beneﬁt of soliciting a random customer. data mining is only beneﬁcial in a narrow sweet spot. and if so. However. We need to work out the logic of how to quantify these uncertainties. Here t = 0.30. so µ is large. while the achievable proﬁt per person cq = $0. 7. we need to estimate a conﬁdence interval for the proﬁt on the test set. the cost c to the initiator is tiny. how much.7 Evaluating success In order to evaluate whether a real-world campaign is proceeding as expected.66 CHAPTER 7. As an example of the reasoning above.16.

If just one successful crystallization is enough. . then experiments should be done starting with those that have highest probability of success. Additional explanation: The $60 cost of doing an experiment should include the expense of technician time. the probability of success of experiment x. The optimal behavior is to do all experiments that have success probability over 1/150. If the budget constraint is unavoidable. that is if and only if (9000 − 60) · p + (−60) · (1 − p) > 0. until the ﬁrst actual success. (c) Is it rational for your lab to take into account a budget constraint such as “we only have a technician available to do one experiment per day”? No. while y = 0 means failure. then more technicians should be hired. (a) Write down the beneﬁt matrix involved in deciding rationally whether or not to perform a particular experiment. where p = p(success|x). The threshold value of p is 60/9000 = 1/150.7. EVALUATING SUCCESS 67 Quiz (2009) Suppose your lab is trying to crystallize a protein. then experiments should also be done in order of declining success probability. If many experiments are worth doing.7. A budget constraint is an alternative rule for making decisions that is less rational than maximizing expected beneﬁt. etc. The value of successful crystallization is $9000. You have a classiﬁer that predicts p(y = 1|x). salinity. The beneﬁt matrix is do experiment don’t success 9000-60 0 failure -60 0 (b) What is the threshold probability of success needed in order to perform an experiment? It is rational to do an experiment under conditions x if and only if the expected beneﬁt is positive. The cost of doing one experiment is $60. You can try experimental conditions x that differ on temperature. The label y = 1 means crystallization is successful. Assume (not realistically) that the results of different experiments are independent.

01z deny allow Work out the rational policy based on p(y = 1|x) for deciding whether or not to allow a transfer. Let z be the dollar amount of the transfer. The matrix of costs (negative) and beneﬁts (positive) involved in deciding whether or not to deny a transfer is as follows: criminal 0 −z legal −0.Quiz for May 4. You have a classiﬁer that estimates the probability p(y = 1|x) where x is a vector of feature values describing a money transfer. The label y = 1 means a money transfer is criminal. 2010 Your name: Suppose that you work for a bank that wants to prevent criminal laundering of money.10z 0. . while y = 0 means the transfer is legal.

and measure ﬁnal success on the test set that you have not previously used. You should train a regression function to predict donation amounts.68 for every solicitation. You should now train on the entire training set. and tallying total proﬁt.7. unlike the test instances. Notes: The test instances are in the ﬁle cup98val. and a classiﬁer to predict donation probabilities. .html.txt. You should use the training set for all development work. The goal is to solicit an optimal subset of the test examples.zip at http://archive. The test set labels are in valtargt.uci. As before. ics. EVALUATING SUCCESS 69 2009 Assignment This assignment is due at the start of class on Tuesday May 5. You should decide to send a donation request to person x if and only if p(x) · a(x) ≥ 0.edu/ml/databases/kddcup98/kddcup98. to measure the ﬁnal success of your method. making decisions concerning them.7. Only use the actual test set once. you should use part of the training set for debugging your procedure for reading in test examples. and you are free to change partners or not. In particular. you should work in a team of two. This assignment is the last one to use the KDD98 data. For a test example x. let the predicted donation be a(x) and let the predicted donation probability be p(x).68. The measure of success to maximize is total donations received minus $0. The labels are sorted by CONTROLN.

Do not worry about formatting details. Of course. . Do some exploratory analysis of the three datasets. Your second goal should be to understand the contest scenario and the differences between the training and two test datasets. based on your general understanding. Details of the contest are at http://sede.com.br/PAKDD2010/. you will not run into any technical difﬁculties. focusing on the aspects which make the solution innovative. The manuscript should be the basis for writing up a full paper for an eventual pos-conference proceeding (currently under negotiation). The authors of top solutions and selected manuscripts with innovative approaches will have the opportunity to submit to that forum. Meet the May 3 deadline for uploading your best predictions and a copy of your report. In your written report. and to submit a correct set of predictions.2010 Assignment This week’s assignment is to participate in the PAKDD 2010 data mining contest. unless in the case of ties. com/computer/lncs?SGWID=0-164-6-793341-0. You can ﬁnd template ﬁles for LaTeX and Word at http://www. design a sensible approach for achieving the contest objective. explain your understanding of the scenario. The contest rules ask each team to submit a paper of four pages. The manuscript should be in the scientiﬁc paper format of the conference detailing the stages of the KDD process. you may reﬁne your approach iteratively and you may make multiple submissions. The quality of the manuscripts will not be used for the competition assessment. Make sure that when you have good predictions later.neurotech. Your ﬁrst goal should be to understand the data and submission formats. and your general ﬁndings about the datasets.springer. Each team of two students should register and download the ﬁles for the contest. Next. Implement this approach and submit predictions to the contest.

Otherwise. the training and test sets are two different random samples from the same population.1 The standard scenario for learning a classiﬁer In the standard situation. this distribution is a probability mass function (pmf).Chapter 8 Learning classiﬁers despite missing labels This chapter discusses how to learn two-class classiﬁers from nonstandard training data. If X and Y are both discrete-valued random variables. 71 . It is important to understand that we can always write p(x. it is a probability density function (pdf). y) is an abbreviation for p(X = x and Y = y) where X and Y are random variables. and let y be its label. and when they are probability densities. Speciﬁcally. The common distribution is p(x. y). Let x be a training or test instance. Being from the same population means that both sets follow a common probability distribution. The equations above are sometimes called the chain rule of probabilities. y) = p(x)p(y|x) and also p(x. we consider three different but related scenarios where labels are missing for some but not all training examples. They are true both when the probability values are probability masses. 8. y) = p(y)p(x|y) without loss of generality and without making any assumptions. and x and y are possible values of these variables. The notation p(x.

Concretely. s = 1) = p(s = 1|x. First. The real-world meaning of this assumption is that label information must be available with non-zero probability for every possible instance x. An important question is whether the unlabeled training examples can be exploited in some way. Otherwise. . it is true that s and y are conditionally independent. y. without using the unlabeled data in any way. there is still some correlation between s and y.2 Sample selection bias in general Suppose that the label y is observed for some training instances x. A more difﬁcult situation arises if s is correlated with x and/or with y. y)p(y|x) p(s = 1|x)p(y|x) = = p(y|x) p(s = 1|x) p(s = 1|x) assuming that p(s = 1|x) > 0 for all x. the labeled training examples are a fully random subset. s) over triples x.72 CHAPTER 8. This case is rather misleadingly called “missing at random” (MAR). LEARNING CLASSIFIERS DESPITE MISSING LABELS 8. In this case. This means that p(s = 1|x. and s are random variables. because missingness does depend on x. when x is ﬁxed then s is independent of y. However. Bayes’ rule gives that p(y|x. s = 1). not on y. but not for others. s = 0) = p(y|x) = p(y|x. This means that y is observed if and only if s = 1. y) = p(s = 1|x) for all x. However. conditional on x. s . This means that the training instances x for which y is observed are not a random sample from the same population as the test instances. suppose that s depends on x but. The easiest case is when p(s = 1|x. The assumption that p(s = 1|x) > 0 is important. In this case. for each value of x the equation p(y|x. We have fewer training examples. We can identify three possibilities for s. there are two subcases. y). Let s be a new random variable with value s = 1 if and only if x is selected. that is if p(s = 1) = p(s = 1|x. y. but otherwise we are in the usual situation. s = 1) holds. There is some ﬁxed unknown joint probability distribution p(x. here we focus on a different question: how can we adjust for the fact that the labeled training examples are not representative of the test examples? Formally. it might be the case for some x that no labeled training examples are available from which to estimate p(y|x. In other words. It is also not the case that s and y are independent. Suppose that even when x is known. This case is called “missing completely at random” (MCAR). It is not the case that labels are missing in a totally random way. Therefore we can learn a correct model of p(y|x) from just the labeled training data. We do not discuss this case further here. y) is a constant. y. In this case the label y is said to be “missing not at random” (MNAR) and inference is much more difﬁcult. given x. Here. x.

3 Covariate shift Imagine that the available training data come from one hospital. but with two differences. then the “through the door” success rate may be higher than the observed success rate. s = 0) where s = 1 refers to the ﬁrst hospital and s = 0 refers to the second hospital. Let x be a patient and let y be a label such as “has swine ﬂu. The second difference is that the label being missing is expected to be correlated with the value of the label. The problem of “reject inference” is to learn somehow from previous applicants who were not approved. Another example is the task of learning to predict the beneﬁt of a medical treatment. s = 1 in the usual way. Following the argument in the previous section. and not additionally on y. The ﬁrst difference is that unlabeled training instances are available. but the distribution of instances is different for the training and test sets.4 Reject inference Suppose that we want to build a model that predicts who will repay a loan. for whom we hence do not have training labels. Conversely. If doctors have historically given the treatment only to patients who were particularly likely to beneﬁt. They were previously selected precisely because they were thought to be more likely to repay.8. The “reject inference” scenario is similar to the covariate shift scenario. if doctors have historically used the treatment only on patients who were especially ill. The important similarity is the common assumption that missingness depends only on x. but we are willing to assume that the relationship to be learned. but we want to learn a classiﬁer to use at a different hospital. s = 1) = p(y|x. no unlabeled training examples are available.3. then the observed success rate of the treatment is likely to be higher than its success rate on “through the door” patients. the assumption that p(y|x) is unchanged means that p(y|x. The usable training examples are people who were actually granted a loan in the past. Labeled training examples from all classes are available. Unfortunately. We need to be able to apply the model to the entire population of future applicants (sometimes called the “through the door” population). and to apply it to patients from the second hospital directly. The simple approach to covariate shift is thus train the classiﬁer on the data x. is unchanged. In a scenario like the one above. 8. Whether y is missing may .” The distribution p(x) of patients is different at the two hospitals. these people are not representative of the population of all future applicants. y. This scenario is called “covariate shift” where covariate means feature. COVARIATE SHIFT 73 8. which is p(y|x). and shift means change of distribution.

We have E[f ] = E[f p(x)p(y|x) |s = 1] p(x|s = 1)p(y|x. x) > p(s = 1|y = 0. y)] for any function f . and those records are more likely to exist for survivors. write this as E[f ] and write E[f (x. y. but we can only draw samples of z from the distribution q(z). LEARNING CLASSIFIERS DESPITE MISSING LABELS be correlated with the value of y. everything else being equal. s = 1) p(x) = E[f |s = 1] p(x|s = 1) since p(y|x) = p(y|x. then we are in the MNAR situation and in general we are can draw no ﬁrm conclusions. y ∼ p(x. An example of this situation is “survivor bias. The equation above is called the importance-sampling identity. even after conditioning on x. More generally the requirement is that q(z) > 0 whenever p(z) > 0. Let p(s = 1|x) be ˆ a trained model of the conditional probability p(s = 1|x). Applying Bayes’ rule to p(x|s = 1) gives E[f ] = E[f p(x) |s = 1] p(s = 1|x)p(x)/p(s = 1) p(s = 1) = E[f |s = 1]. Suppose we want to estimate E[f (z)] where z follows the distribution p(z). but this correlation disappears after conditioning on x. A useful fact is the following. Let the goal be to compute E[f (x. Then p(s = 1|y = 1. y ∼ p(x. The estimate of E[f ] is then r n r i=1 f (xi .74 CHAPTER 8. The fact is E[f (z)|z ∼ p(z)] = E[f (z) p(z) |z ∼ q(z)]. assuming the deﬁnition 0/0 = 0. yi ) . p(s = 1|x) The constant p(s = 1) can be estimated as r/n where r is the number of labeled training examples and n is the total number of training examples. s = 1). p(s = 1|xi ) ˆ .” Suppose our analysis is based on historical records. y)|x. To make notation more concise. s = 1)] as E[f |s = 1]. x) where “everything else being equal” means that x is the same on both sides. q(z) assuming q(z) > 0 for all z. y)|x. If missingness does depend on y.

. Suppose we provide these two sets as inputs to a standard training algorithm. this apˆ proach typically has high variance since a few labeled examples with high values for 1/ˆ(s = 1|x) dominate the sum. y . no good method for selecting the ceiling value is known. A common assumption is that the labeled positive examples are chosen completely randomly from all positive examples. This algorithm will yield a function g(x) such that g(x) = p(s = 1|x) approximately. For example the ceiling 1000 is used p by [Huang et al. the ratio p(s = 1)/p(s = 1|x) is called the “inverse probability of treatment” (IPT) weight. if the propensity estimates p(s = 1|x) are correct. However. called the labeled (s = 1) and unlabeled (s = 0) sets. However.5. This fact can be stated formally as the equation p(s = 1|x.5 Positive and unlabeled examples Suppose that only positive examples are labeled. it is impossible to make progress. y = 0) = 0.. The following lemma shows how to obtain a model of p(y = 1|x) from g(x). Let this be called the “selected completely at random” assumption. Then p(y = 1|x) = p(s = 1|x)/c where c = p(s = 1|y = 1). it is that p(s = 1|x. Therefore an important question is whether p alternative approaches exist that have lower variance. When does the reject inference scenario give rise to MNAR bias? 8. y = 1) = p(s = 1|y = 1) = c. The weighting approach just explained is correct in the statistical sense that it is unbiased. In medical research. (8.8. Stated formally. A training set consists of two subsets. (8. 2006]. One simple heuristic is to place a ceiling on the values 1/ˆ(s = 1|x).1) Without some assumption about which positive examples are labeled. Suppose the “selected completely at random” assumption holds. y into a formula that would be correct if based on integrating over all values of x. Lemma 1. POSITIVE AND UNLABELED EXAMPLES 75 This estimate is called a “plug-in” estimate because it is based on plugging the observed values of x.2) Another way of stating the assumption is that s and x are conditionally independent given y.

Several consequences of the lemma are worth noting. This is reasonable because the labeled (positive) and unlabeled (negative) training sets for g are samples from overlapping regions in x space. Remember that the assumption is p(s = 1|y = 1. What this says is that g > p(s = 1|y = 1) is impossible. y = 1)p(y = 1|x) + p(s = 1|x. The value of the constant c = p(s = 1|y = 1) can be estimated using a trained classiﬁer g and a validation set of examples. 1 Formally the estimator is e1 = n x∈P g(x) where n is the cardinality of P . y = 1) · 1 + 0 · 0 since x ∈ P = p(s = 1|y = 1). where p(x. all we need to show is that g(x) = c for x ∈ P .s) [h(x. y = 0)p(y = 0|x) = p(s = 1|x. but that in practice averaging over all members of P is preferable. Second. LEARNING CLASSIFIERS DESPITE MISSING LABELS Proof. There is an alternative way of using Lemma 1. We have that p(s = 1|x) = p(y = 1 ∧ s = 1|x) = p(y = 1|x)p(s = 1|y = 1. s) is the overall distribution.y. write this as E[h]. We want an estimator of E[h] based on a positive-only training set of examples of the form x. To make notation more concise. s . Let P be the subset of examples in V that are labeled (and hence positive). x) = p(y = 1|x)p(s = 1|y = 1). To do this. Now consider p(s = 1|x). Hence it is impossible for any example x to belong to the positive class for g with a high degree of certainty. x) = p(s = 1|y = 1). . Let V be such a validation set that is drawn from the overall distribution p(x. Note that in principle any single example from P is sufﬁcient to determine c. The result follows by dividing each side by p(s = 1|y = 1). First. We can show this as follows: g(x) = p(s = 1|x) = p(s = 1|x. f = g/p(s = 1|y = 1) is a well-deﬁned probability f ≤ 1 only if g ≤ p(s = 1|y = 1).76 CHAPTER 8. We shall show that e1 = p(s = 1|y = 1) = c if it is the case that g(x) = p(s = 1|x) for all x. y. The estimator of p(s = 1|y = 1) is the average value of g(x) for x in P . then the classiﬁer g can be used directly instead of f . y. y)] for any function h. s) in the same manner as the nontraditional training set. This means that if the classiﬁer f is only used to rank examples x according to the chance that they belong to class y = 1. Let the goal be to estimate Ep(x. f is an increasing function of g.

s = 1) = 1. What this says is that each labeled example is treated as a positive example with unit weight. POSITIVE AND UNLABELED EXAMPLES Clearly p(y = 1|x.8. The probability p(s = 1|x) is estimated as g(x) where g is the nontraditional classiﬁer explained in the previous section. y. 1) + p(y = 0|x. c 1 − p(s = 1|x) h(x. y = 1)]p(y = 1|x) 1 − p(s = 1|x) (1 − c)p(y = 1|x) 1 − p(s = 1|x) (1 − c)p(s = 1|x)/c 1 − p(s = 1|x) 1 − c p(s = 1|x) . y)p(x. y = 1)p(y = 1|x) p(s = 0|x) [1 − p(s = 1|x.5.s=1 x. s = 0) = = = = = By deﬁnition E[h] = x.s=0 w(x)h(x. . 1) + x. y) p(x) p(s = 1|x)h(x. s = 0)h(x. s = 0) = 1 − c p(s = 1|x) c 1 − p(s = 1|x) (8. s = 0). 0)] . while each unlabeled example is treated as a combination of a positive example with weight p(y = 1|x. Less obviously.y. 0) and m is the cardinality of the training set. s = 0)h(x. p(y = 1|x. The plug-in estimate of E[h] is then the empirical average 1 m where w(x) = p(y = 1|x. s)h(x.3) h(x. s) p(x) x s=0 = = x p(s|x) y=0 p(y|x. 1) + (1 − w(x))h(x. s = 0) and a negative example with complementary weight 1 − p(y = 1|x.s 1 1 77 p(s = 0|x. 1) + p(s = 0|x)[p(y = 1|x.

Concept drift. Positive examples are given unit weight and unlabeled examples are duplicated. LEARNING CLASSIFIERS DESPITE MISSING LABELS The result above on estimating E[h] can be used to modify a learning algorithm in order to make it work with positive and unlabeled training data. s = 0) and the other copy is made negative with weight 1 − p(y = 1|x. . Moral hazard. 8. Observational studies. one copy of each unlabeled example is made positive with weight p(y = 1|x. s = 0). Adverse selection. One method is to give training examples individual weights.6 Further issues Spam ﬁltering scenario.78 CHAPTER 8.

Suppose that the training set consists of values z sampled according to the probability distribution q(z). Explain intuitively which members of the training set will have greatest inﬂuence on the estimate of E[f (z)|z ∼ p(z)].Quiz for May 11. 2010 Your name: The importance sampling identity is the equation E[f (z)|z ∼ p(z)] = E[f (z)w(z)|z ∼ q(z)] where w(z) = p(z) . . and we deﬁne 0/0 = 0. q(z) We assume that q(z) > 0 for all z such that p(z) > 0.

The weights here are probabilities p(y = 1|x. so the weight can range between p(s = 1) and inﬁnity. What are the minimum and maximum possible values of the weights? Let x be a labeled example.Quiz for 2009 (a) Suppose you use the weighting approach to deal with reject inference. then there will be unlabeled examples that in fact are positive. (c) Explain intuitively what can go wrong if the “selected completely at random” assumption is false. The trained model of the positive class will be too narrow. The conditional probability p(s = 1|x) can range between 0 and 1. because they are different from the labeled positive examples. not to labeled examples as above. (b) Suppose you use the weighting approach to learn from positive and unlabeled examples. this weight is how many copies are needed to allow the one labeled example to represent all the unlabeled examples that are similar. The “selected completely at random” assumption says that the positive examples with known labels are perfectly representative of the positive examples with unknown labels. If this assumption is false. s = 0). weights are assigned to unlabeled examples. so they range between 0 and 1. but that we treat as negative. when learning from positive and unlabeled examples. and let its weight be p(s = 1)/p(s = 1|x). What are the minimum and maximum possible values of the weights? In this scenario. Intuitively. .

(We are not using the published weightings.307 examples. in a real application we would not know this. and (iii) no labeled negative training examples. Other training examples may be negative or positive. Evaluate experimentally the effectiveness of learning p(y|x) from biased training sets of size 1000. do cross-validation on the test set to establish the accuracy that is achievable when all training set labels are known. Compare learning p(y|x) directly with learning p(y|x) after reweighting.ics.edu/ml/datasets/Adult. because the training examples are a random sample from the test population. a random subset of positive training examples have known labels. In http://www.) First. The true value is c = 0. and use the logistic regression option of FastLargeMargin as the main training method. you should be able to achieve better than 82% accuracy. and the test set has 11. sets (ii) and (iii) have 11. etc. FURTHER ISSUES 81 Assignment (revised) The goal of this assignment is to train useful classiﬁers using training sets with missing information of three different types: (i) covariate shift. show graphically a learning curve.4. Each example has values for 13 predictors. The persons with known labels. on average.ucsd. for 1000. Training set (ii) requires reject inference.6.305 examples. 2000. Each example describes one person. Explain how you estimate the constant c = p(s = 1|y = 1) and discuss the accuracy of this estimate. 2000. and one training set for each of the three scenarios. are better prospects than the ones with unknown labels.) You should train a classiﬁer separately based on each training set. This is an interesting label to predict because it is analogous to a label describing whether or not the person is a customer that is desirable in some way. etc.000 per year. Training set (i) has 5. In your report. (Many thanks to Aditya Menon for creating these ﬁles. and measure performance separately but on the same test set. Training set (i) requires you to overcome covariate shift. (ii) reject inference. Use an appropriate weighting method. so the fnlwgt feature has been omitted from our datasets. In each case.zip you can ﬁnd four datasets: one test set. since it does not follow the same distribution as the population of test examples. but the training label is known only for some training examples.edu/ users/elkan/291/dmfiles. of course.uci. All four datasets are derived from the so-called Adult dataset that is available at http://archive.cs. .164 examples. training examples.8. In training set (iii). Use accuracy as the primary measure of success. that is accuracy as a function of the number of training examples. The label to be predicted is whether or not the person earns over $50.

LEARNING CLASSIFIERS DESPITE MISSING LABELS For each scenario and each training method.82 CHAPTER 8. include in your report a learning curve ﬁgure ’ that shows accuracy as a function of the number (1000. . Discuss the extent to which each of the three missing-label scenarios reduces achievable accuracy. 2000. etc.) of labeled training examples used.

while each column corresponds to an item. many different real-world tasks can be performed. Typically the predictions are not required to be integers. Often rating values are integers between 1 and 5. Typically each prediction can be computed in O(1) time using only a ﬁxed number of coefﬁcients from the representation. There are two main general approaches to the formal collaborative ﬁltering task: nearest-neighbor-based and model-based. nearest neighbor (NN) approaches use a similarity function between rows. Given these predictions. If user 1 ≤ i ≤ m has rated item 1 ≤ j ≤ n. and/or a similarity function between columns. From a formal perspective. If a user has not rated an item. Given a user and item for which a prediction is wanted. the matrix entry xij is the value of this rating. the input to a collaborative ﬁltering algorithm is a matrix of incomplete ratings. 83 . For example. The general assumption is that users who give similar ratings to some items are likely also to give similar ratings to other items. based on ratings for different items provided by the same user and on ratings provided by other users. to pick a subset of relevant other users or items. the corresponding matrix entry is missing.Chapter 9 Recommender systems The collaborative ﬁltering (CF) task is to recommend items to a user that he or she is likely to like. Each row corresponds to a user. This representation is then used instead of the original matrix to make predictions. Missing ratings are often represented as 0. The output of a collaborative ﬁltering algorithm is a prediction of the value of each missing matrix entry. a recommender system might suggest to a user those items for which the predicted ratings by this user are highest. Model-based approaches to collaborative ﬁltering construct a low-complexity representation of the complete xij matrix. but this should not be viewed as an actual rating value. The prediction is then some sort of average of the known ratings in this subset.

84

CHAPTER 9. RECOMMENDER SYSTEMS

The most popular model-based collaborative ﬁltering algorithms are based on standard matrix approximation methods such as the singular value decomposition (SVD), principal component analysis (PCA), or nonnegative matrix factorization (NNMF). Of course these methods are themselves related to each other. From the matrix decomposition perspective, the fundamental issue in collaborative ﬁltering is that the matrix given for training is incomplete. Many algorithms have been proposed to extend SVD, PCA, NNMF, and related methods to apply to incomplete matrices, often based on expectation-maximization (EM). Unfortunately, methods based on EM are intractably slow for large collaborative ﬁltering tasks, because they require many iterations, each of which computes a matrix decomposition on a full m by n matrix. Here, we discuss an efﬁcient approach to decomposing large incomplete matrices.

9.1

Applications of matrix approximation

In addition to collaborative ﬁltering, many other applications of matrix approximation are important also. Examples include the voting records of politicians, the results of sports matches, and distances in computer networks. The largest research area with a goal similar to collaborative ﬁltering is so-called “item response theory” (IRT), a subﬁeld of psychometrics. The aim of IRT is to develop models of how a person answers a multiple-choice question on a test. Users and items in collaborative ﬁltering correspond to test-takers and test questions in IRT. Missing ratings are called omitted responses in IRT. One conceptual difference between collaborative ﬁltering research and IRT research is that the aim in IRT is often to ﬁnd a single parameter for each individual and a single parameter for each question that together predict answers as well as possible. The idea is that each individual’s parameter value is then a good measure of intrinsic ability and each question’s parameter value is a good measure of difﬁculty. In contrast, in collaborative ﬁltering research the goal is often to ﬁnd multiple parameters describing each person and each item, because the assumption is that each item has multiple relevant aspects and each person has separate preferences for each of these aspects.

9.2

Measures of performance

Given a set of predicted ratings and a matching set of true ratings, the difference between the two sets can be measured in alternative ways. The two standard measures are mean absolute error (MAE) and mean squared error (MSE). Given a probability distribution over possible values, MAE is minimized by taking the median of the dis-

9.3. ADDITIVE MODELS

85

tribution while MSE is minimized by taking the mean of the distribution. In general the mean and the median are different, so in general predictions that minimize MAE are different from predictions that minimize MSE. Most matrix approximation algorithms aim to minimize MSE between the training data (a complete or incomplete matrix) and the approximation obtained from the learned low-complexity representation. Some methods can be used with equal ease to minimize MSE or MAE.

9.3

Additive models

Let us consider ﬁrst models where each matrix entry is represented as the sum of two contributions, one from its row and one from its column. Formally, aij = ri + cj where aij is the approximation of xij and ri and cj are scalars to be learned. Deﬁne the training mean xi· of each row to be the mean of all its known values; ¯ deﬁne the the training mean of each column x·j similarly. The user-mean model sets ¯ ri = xi· and cj = 0 while the item-mean model sets ri = 0 and cj = x·j . A slightly ¯ ¯ more sophisticated baseline is the “bimean” model: ri = 0.5¯i· and cj = 0.5¯·j . x x The “bias from mean” model has ri = xi· and cj = (xij − ri )·j . Intuitively, cj ¯ is the average amount by which users who provide a rating for item j like this item more or less than the average item that they have rated. The optimal additive model can be computed quite straightforwardly. Let I be the set of matrix indices for which xij is known. The MSE optimal additive model is the one that minimizes (ri + cj − xij )2 .

i,j ∈I

This optimization problem is a special case of a sparse least-squares linear regression problem: ﬁnd z that minimizes Az − b 2 where the column vector z = r1 , . . . , rm , c1 , . . . , cn , b is a column vector of xij values and the corresponding row of A is all zero except for ones in positions i and m + j. z can be computed by many methods. The standard method uses the Moore-Penrose pseudoinverse of A: z = (A A)−1 A b. However this approach requires inverting an m + n by m + n matrix, which is computationally expensive. Section 9.4 below gives a gradient-descent method to obtain the optimal z which only requires O(|I|) time and space. The matrix A is always rank-deﬁcient, with rank at most m + n − 1, because any constant can be added to all ri and subtracted from all cj without changing the matrix reconstruction. If some users or items have very few ratings, the rank of A may be less than m + n − 1. However the rank deﬁciency of A does not cause computational problems in practice.

86

CHAPTER 9. RECOMMENDER SYSTEMS

9.4

Multiplicative models

Multiplicative models are similar to additive models, but the row and column values are multiplied instead of added. Speciﬁcally, xij is approximated as aij = ri cj with ri and cj being scalars to be learned. Like additive models, multiplicative models have one unnecessary degree of freedom. We can ﬁx any single value ri or cj to be a constant without changing the space of achievable approximations. A multiplicative model is a rank-one matrix decomposition, since the rows and columns of the y matrix are linearly proportional. We now present a general algorithm for learning row value/column value models; additive and multiplicative models are special cases. Let xij be a matrix entry and let ri and cj be corresponding row and column values. We approximate xij by aij = f (ri , cj ) for some ﬁxed function f . The approximation error is deﬁned to be e(f (ri , cj ), xij ). The training error E over all known matrix entries I is the pointwise sum of errors. To learn ri and cj by minimizing E by gradient descent, we evaluate ∂ E = ∂ri =

i,j ∈I

i,j ∈I

∂ e(f (ri , cj ), xij ) ∂ri ∂ ∂ e(f (ri , cj ), xij ) f (ri , cj ). ∂f (ri , cj ) ∂ri

As before, I is the set of matrix indices for which xij is known. Consider the special case where e(u, v) = |u − v|p for p > 0. This case is a generalization of MAE and MSE: p = 2 corresponds to MSE and p = 1 corresponds to MAE. We have ∂ ∂ e(u, v) = p|u − v|p−1 |u − v| ∂u ∂u = p|u − v|p−1 sgn(u − v). Here sgn(a) is the sign of a, that is −1, 0, or 1 if a is negative, zero, or positive. For computational purposes we can ignore the non-differentiable case u = v. Therefore ∂ E= ∂ri p|f (ri , cj ) − xij |p−1 sgn(f (ri , cj ) − xij )

i,j ∈I ∂ ∂ri f (ri , cj )

∂ f (ri , cj ). ∂ri

Now suppose f (ri , cj ) = ri + cj so ∂ E=p ∂ri

= 1. We obtain

**|ri + cj − xij |p−1 sgn(ri + cj − xij ).
**

i,j ∈I

cj ). The extension to higher-rank approximations is obvious.5. stochastic gradient descent does not converge fully. this difference is called the residual. xij ) ∂cj where λ is a learning rate. This means that we iterate over each triple ri . The choice of 30 epochs is a type of early stopping that leads to good results by not overﬁtting the training data. The second model is trained to minimize error relative to the residual xij − aij .2/e for additive models and λ = 0. We get |ri cj − xij |p−1 sgn(ri cj − xij )cj . incomplete matrix factorization is an example of a maximum-likelihood problem where an alternative solution method is superior. we apply online (stochastic) gradient descent. Gradient descent as described above directly optimizes precisely the same objective function (given the MSE error function) that is called “incomplete data likelihood” in EM approaches to factorizing matrices with missing entries. compute the gradient with respect to ri and cj based just on this one example.5 Combining models by ﬁtting residuals Simple models can be combined into more complex models. no separate algorithm parameter is needed for this. The most straightforward way to combine models is additive. If the ﬁrst and second models are both multiplicative. After any ﬁnite number of epochs. cj ). and perform the updates ri := ri − λ and cj := cj − λ ∂ e(f (ri . In general. Let aij be the prediction from the ﬁrst model and let xij be the corresponding training value. 9.9. where 1 ≤ e ≤ 30 is the number of the current epoch. Note that λ determines the step sizes for stochastic gradient descent.j ∈I Given the gradients above. . COMBINING MODELS BY FITTING RESIDUALS Alternatively. cj . cj ) 87 = cj . Let bij be the prediction made by the second model for matrix entry ij. suppose f (ri . For the learning rate we use a decreasing schedule: λ = 0. It is sometimes forgotten that EM is just one approach to solving maximum-likelihood problems. then the combined model is a rank-two approximation. cj ) = ri cj so ∂ E=p ∂ri ∂ ∂ri f (ri . a second model is trained to ﬁt the difference between the predictions made by the ﬁrst model and the truth as speciﬁed by the training data. i. The combined prediction is then aij + bij . xij ) ∂ri ∂ e(f (ri . xij in the training set.4/e for multiplicative models.

everything else being equal. because the numerical magnitude of its contribution is smaller. One reason for not considering models that are explicitly probabilistic is that these are usually based on Gaussian assumptions that are incorrect by deﬁnition when the observed data are integers and/or in a ﬁxed interval. whether or not a rating is missing still depends on its value. the average rating in the training set is 3. “at random” (MAR). RECOMMENDER SYSTEMS Standard principal component analysis (PCA) is a variety of mixed model. The aspect of items represented by each model cannot be understood in isolation. but only in the context of previous aspects. i.34. This intuition can be conﬁrmed empirically.79. Before the ﬁrst rank-one approximation is computed.88 CHAPTER 9. or “not at random” (MNAR). The common interpretation of a combined model is that each component model represents an aspect or property of items and users. Amazon: people who looked at x bought y eventually (within 24hrs usually) . low ratings are still more likely to be missing than high ratings. It is more accurate to view each component model as an adjustment to previous models. where each model is empirically less important than each previous one. Thus the actual ﬁrst approximation of the original matrix is aij = (0 + mj ) + ri cj = (1 · mj ) + ri cj where mj is an initial column value and ri cj is a multiplicative model. There is no assumption or analysis concerning whether entries are missing “completely at random” (MCAR). because people select which items to rate based on how much they anticipate liking them. A third issue not discussed is any explicit model for how matrix entries come to be missing. missing ratings are likely to be MNAR. Another issue not discussed is any potential probabilistic interpretation of a matrix decomposition. On the Movielens dataset. Concretely. Intuitively. The average prediction for ratings that in fact do not exist is 3. and the row value for each user is the weighting the user places on this property. The average prediction from one reasonable method for ratings in the test set. ratings that are unknown but known to exist. However this point of view implicitly treats each component model as equally important. is 3. The idea is that the column value for each item is the intensity with which it possesses this property. This means that even after accounting for other inﬂuences. the mean of each column is subtracted from that column of the original matrix.58. 9. in the collaborative ﬁltering context.e.6 Further issues One issues not discussed above is regularization: attempting to improve generalization by reducing the effective number of parameters in the trained low-complexity representation.

(b) Let the predicted value of rating xij be ri cj + si dj and suppose ri and cj are trained ﬁrst. In order to allow si dj to be negative. (a) For any model. so ri and cj should always be positive. si should be negative.9. (Making si be always positive. This model is likely to overﬁt the training data. so it is sometimes positive and sometimes negative. True. ri should be positive. and xij = ri cj on average. However in this case the expressiveness of the model si dj would be reduced. In more technical language. For all viewers i. say whether the statement in italics is true or false. (They could always both be negative. and we train a rank-50 unregularized factor model. and then explain your answer brieﬂy. Hence.000 viewers and 1000 movies. People are more likely to like movies that they have actually watched. Ratings xij are positive. the average predicted rating for unrated movies is expected to be less than the average actual rating for rated movies. The unregularized model has 50 · (10.6. This difference is on average zero. True. FURTHER ISSUES 89 Quiz For each part below. which is more than the number of data points for training. Any good model should capture this fact.000 ratings for 10. might be possible. 000 parameters. the value of a rating is correlated with whether or not it is missing. True.) (c) We have a training set of 500. overﬁtting is practically certain.) The term si dj models the difference xij − ri cj . than random movies that they have not watched. 000 + 1000) = 550. si must be negative sometimes. . but that would be unintuitive without being more expressive. while allowing dj to be negative. but for some i.

xij ) ∂ri ∂ e(f (ri . cj ). This means that we iterate over each triple ri . v) = (u − v)2 and f (ri . xij in the training set. ∂ e(f (ri . cj . cj ). 2010 Page 87 of the lecture notes says Given the gradients above. State and explain what the ﬁrst update rule is for the special case e(u. cj ) = ri · cj . compute the gradient with respect to ri and cj based just on this one example. we apply online (stochastic) gradient descent. xij ) ∂cj .Quiz for May 25. and perform the updates ri := ri − λ and cj := cj − λ where λ is a learning rate.

• Guy Lebanon’s toolkit at http://www-2. You may reuse existing software. . it is easier to make accurate predictions for these users. Also show timing information graphically. or you may write your own. You should use the small MovieLens dataset available at http://www. and you should explain it with mathematical clarity in your report. If you choose other existing software. Discuss whether you could run your chosen method on the full Netﬂix dataset of about 108 ratings. in Matlab or in another programming language. When reporting ﬁnal results. you must understand fully the algorithm that you apply. show mean absolute error graphically as a function of a measure of complexity of your chosen method. evaluation is biased towards users who have provided more ratings. In your report.mit.cs. where each rating is assigned randomly to one fold. You are also welcome to write your own code. because it is likely that every user and every movie is represented in each training set.edu/˜lebanon/IR-lab. If you select a matrix factorization method. You can select any collaborative ﬁltering method that you like. this measure of complexity will likely be rank. Good existing software includes the following: • Jason Rennie’s fast maximum margin matrix factorization for collaborative ﬁltering (MMMF) at http://people. Moreover. or to choose other software.Assignment The goal of this assignment is to apply a collaborative ﬁltering method to the task of predicting movie ratings. Note that this experimental procedure makes the task easier. Also discuss whether your chosen method needs a regularization technique to reduce overﬁtting.cmu. you may want to ask the instructor for comments ﬁrst. This has 100.csail.org/node/73. do ﬁve-fold cross-validation. Whatever your choice. The method that you choose should handle missing values in a sensible and efﬁcient way.grouplens. htm.edu/jrennie/matlab/.000 ratings given by 943 users to 1682 movies.

.

In multilabel classiﬁcation. Classiﬁers for documents are useful for many applications. i. • topic modeling. There is often no need to pick a threshold. it is important to learn the positive and negative correlations between classes.Chapter 10 Text mining This chapter explains how to do data mining on datasets that are collections of documents. so it is correct to predict more than one label. it is also possible to learn to categorize according to qualitative criteria such as helpfulness for product reviews submitted by consumers. This task is speciﬁcally called multilabel classiﬁcation. which would be arbitrary. to separate marginally helpful from marginally unhelpful reviews. In many applications of multiclass classiﬁcation. and another training set of very unhelpful reviews. Most classiﬁers for documents are designed to categorize according to subject matter.e. 93 . Text mining tasks include • classiﬁer learning • clustering. a single document can belong to more than one category. Multiclass classiﬁers are useful for routing messages to recipients. However. we can learn a scoring function that sorts other reviews accirding to their degree of helpfulness. a special type of negative correlation is ﬁxed in advance. With a training set of very helpful product reviews. In standard multiclass classiﬁcation. the classes are mutually exclusive. and • latent semantic analysis. Classiﬁers are useful for ranking documents as well as for dividing them into categories. Major uses for binary classiﬁers include spam detection and personalization of streams of news articles.

(Words that are found only once are often mis-spellings or other mistakes. It is important to appreciate. Many applications of text mining also eliminate from the vocabulary so-called “stop” words. it is sufﬁcient to use a simple representation that loses all information about word order. before). prepositions (to. we ﬁx an arbitrary ordering for it so we can refer to word 1 through word m where m = |V | is the size of the vocabulary. doing this is called latent semantic analysis (LSA) and is discussed in Section 10. This model is a probability distribution. that generic words carry a lot of information for many tasks. most entries are zero. These are words that are common in most documents and do not correspond to any particular subject matter. Once V has been ﬁxed. we will choose values for the parameters of the distribution that make the training documents have high probability. Each entry in this matrix is an integer count. it) connectives (and. Given a training set of documents. he. Given a collection of documents. can). of. It also makes sense to learn a low-rank approximation of the whole matrix.) Although the vocabulary is a set. the ﬁrst task to perform is to identify the set of all words used a least once in at least one document. Often.94 CHAPTER 10. TEXT MINING 10. including classifying and clustering documents. each document is represented as a vector with integer entries of length m. In linguistics these words are sometimes called “function” words. If this vector is x then its jth component xj is the number of appearances of word j in the document.8 below. This set is called the vocabulary V . n is much smaller than m and xj = 0 for most words j. . The length of the document is n = m xj . However. however. 10. part.1 The bag-of-words representation The ﬁrst question we must answer is how to represent documents.2 The multinomial distribution Once we have a representation for individual documents. been. including identifying the author of a document or detecting its genre. A collection of documents is represented as a two-dimensional matrix where each row describes a document and each column corresponds to a word. and generic nouns (amount. the natural next step is to select a model for a set of documents. however). because. for many large-scale data mining tasks. They include pronouns (you. For genuine understanding of natural language one must obviously preserve the order of the words in documents. j=1 For typical documents. auxiliaries (have. nothing). It makes sense to view each column as a feature. it is reduced in size by keeping only words that are used in at least two documents.

so we want θj > 0 for all j. θ) = log n! − [ j=1 log xj !] + [ j=1 xj · log θj ] Given a set of training documents. the maximum-likelihood estimate of the jth parameter is 1 θj = xj T x where the sum is over all documents x belonging to the training set. Because the probabilities of individual documents decline exponentially with length n. it is necessary to do numerical computations with log probabilities: m m log p(x. If a multinomial has θj = 0 for some j. Hence. xj ! j=1 j=1 where the data x are a vector of non-negative integers and the parameters θ are a real-valued vector. What is important is the relative probability of different documents. At ﬁrst sight. computing the probability of a document needs only O(n) time. a data point is a document containing n words. The probability of any individual document will therefore be very small. if xj = 0 then θj j = 1 so the jth factor can be omitted from the product. However.10. the more similar the test document is to the training set. a multinomial has to sum to one. Here. θ) = m θj j . THE MULTINOMIAL DISTRIBUTION 95 Then. given a test document. 0! = 1 so the jth factor can be omitted from m j=1 xj !. Like any discrete distribution.2. Smoothing with . θj is the probability of word j while xj is the count of word j. where the sum is over all possible data points. hence the term θj to the power xj . Both vectors have the same length m. The probability distribution that we use is the multinomial. then every document with xj > 0 for this j has zero probability. Similarly. j=1 Intuitively. The higher this probability is. we can evaluate its probability according to the model. Each time word j appears in the document it contributes an amount θj to the total probability. The number of such documents is exponential in their length n: it is mn . this distribution is m n! x p(x. Mathematically. The normalizing constant is T = x j xj which is the sum of the sizes of all training documents. The components of θ are non-negative and have unit sum: m θj = 1. Probabilities that are perfectly zero are undesirable. computing the probability of a document requires O(m) time bex cause of the product over j. regardless of any other words in the document. A document that mostly uses words with high probability will have higher relative probability.

3 Training Bayesian classiﬁers Bayesian learning is an approach to learning classiﬁers based on Bayes’ rule. one multinomial is a distribution over all documents of a ﬁxed size n. We can estimate p(y = k) easily as nk / K nk where nk is the number of training examples with class label k=1 k. Technically. The denominator p(x) can be computed as the sum of numerators K p(x) = k=1 p(x|y = k)p(y = k). what is learned by the maximum-likelihood process just described is in fact a different distribution for each size n. each . Generative process. 10.” The constant c is called a pseudocount. Therefore. regardless of the true number of appearances. have the same parameter values. p(x) In order to use the expression on the right for classiﬁcation. We can write p(x|y = k)p(y = k) p(y = k|x) = . Note that the training process for each class uses only the examples from that class. Usually. We set θj ∝ c + x CHAPTER 10. Because the equality j θj = 1 must be preserved. the normalizing constant must be T = mc + T in θj = 1 (c + T xj ).96 a constant c is the way to achieve this. one should have c < T /m. x In order to avoid big changes in the estimated probabilities θj . TEXT MINING xj where the symbol ∝ means “proportional to. we need to learn three items from training data: p(x|y = k). Typically c is chosen in the range 0 < c ≤ 1. Sampling with replacement. p(y = k). although separate. Suppose the alternative class values are 1 to K. the class-conditional distribution p(x|y = k) can be any distribution whose parameters are estimated using the training examples that have class label k. and p(x). and is separate from the training process for all other classes. it is a notional number of appearances of word j that are assumed to exist. Intuitively. Let x be an example and let y be its class. In general. These distributions.

Toyota now employs 36. for example. pay workers who aren’t producing vehicles and offer a limited number of voluntary buyouts. Consider a bucket with balls of |V | different colors. Recently. As discussed in Chapter ?? above. they have higher probability. 10. Toyota has had to idle U. Unfortunately. Yoshi Inaba. Toyota Motor Corp. Consider the following excerpt from a newspaper article. additional appearances of the same word are less surprising. it is not just replaced. Each time a .2. is expected to announce a major overhaul. An alternative distribution named the the Dirichlet compound multinomial (DCM) arises from an urn process that captures the authorship process better. The multinomial distribution arises from a process of sampling words with replacement. After a ball is selected randomly. was formally asked by Toyota this week to oversee the U. Mr. a former senior Toyota executive. business.e. Inaba was credited with laying the groundwork for Toyota’s fast growth in the U. assembly lines. it is not reasonable to use accuracy to measure the success of a classiﬁer for documents. Inaba is currently head of an international airport close to Toyota’s headquarters in Japan. but also one more ball of the same color is added. In the formula θj = 1 (c + T xj ) x the sum is taken over the documents in one class. BURSTINESS 97 class is represented by one multinomial ﬁtted by maximum-likelihood as described in Section 10. operations now are suffering from plunging sales.10.000 in the U. This measure is the harmonic mean of precision and recall: f= 1 1/p + 1/r where p and r are precision and recall for the rare class. Instead. when one class of documents is rare. Mr. i.S.4 Burstiness The multinomial model says that each appearance of the same word j always has the same probability θj . it is common to use the so-called F-measure instead of accuracy.S.S. before he left the company.S. Toyota’s U. In reality. the exact value of c can strongly inﬂuence classiﬁcation accuracy.4.S.

One can also transform counts in a supervised way. tp is the number of positive training examples containing the word. When using a linear SVM for text classiﬁcation. the more words are bursty. in an adjustable way.5 Discriminative classiﬁcation There is a consensus that linear support vector machines (SVMs) are the best known method for learning classiﬁers for documents. so it preserves sparsity. . choosing the strength of regularization appropriately. and f n is the number of these examples not containing the word. This transformation maps 0 to 0. In this case the word is highly diagnostic for at least one of the two classes. and tn is the number of these examples not containing the word. Nonlinear SVMs are not beneﬁcial. because the number of features is typically much larger than the number of training examples. The value | log tp/f n − log f p/tn| is large if tp/f n and f p/tn have very different values. Inspired by the discussion above of burstiness. For the same reason. we replace it by 0. typically by cross-validation. Since counts do not follow Gaussian distributions. like the multinomial parameter vector. TEXT MINING ball is drawn. If any of these numbers is zero. accuracy can be improved considerably by transforming the raw counts. the chance of drawing the same color again is increased. is crucial. that is in a way that uses label information. Above.98 CHAPTER 10. it is not sensible to make them have mean zero and variance one.5. 10. while f p is the number of negative training examples containing the word. The smaller the parameter vales βj are. Notice that in the formula above the positive and negative classes are treated in a perfectly symmetric way. These initial values are the parameters of the DCM distribution. the following transformation gives the highest accuracy: x → sgn(x) · | log tp/f n − log f p/tn|. which of course is less than any of these numbers that is genuinely non-zero. This one extra degree of freedom allows the DCM to discount multiple observations of the same word. This increased probability models the phenomenon of burstiness. Experimentally. but the sum of the components of β is unconstrained. Let the initial number of balls with color j be βj . The DCM parameter vector β has length |V |. that is to replace all non-zero values by one. A more extreme transformation that loses information is to make each count binary. it is sensible to replace each count x by log(1+x). This is most straightforward when there are just two classes.

Each component is a cluster.6.e.10. 10. The generative process assumed by the LDA model is as follows: Given: Dirichlet distribution with parameter vector α of length K Given: Dirichlet distribution with parameter vector β of length V for topic number 1 to topic number K draw a multinomial with parameter vector φk according to β for document number 1 to document number M draw a topic distribution.7 Topic models Mixture models. and we want to ﬁnd an organization for these. i. Intuitively. the proportion of each topic in each document is different.6 Clustering documents Suppose that we have a collection of documents. Formally. θk ). a multinomial θ according to α . and clusterings in general. Latent Dirichlet allocation (LDA) is the most widely used topic model. θk ) is the distribution of component number k. For documents. but the topics themselves are the same for all documents. i. 10. and then these features are weighted according to their predictiveness. if components are topics. are based on the assumption that each data point is generated by a single component model. The scalar αk is the proportion of component number k. Clustering can be done probabilistically by ﬁtting a mixture distribution to the given collection of documents. Topic models are probabilistic models that make this assumption. K is the number of components in the mixture model. a mixture distribution is a probability density function of the form K p(x) = k=1 αk p(x. It is based on the intuition that each document contains words from multiple topics. Here. The simplest variety of unsupervised learning is clustering.e. ﬁrst each count x is transformed into a binary feature sgn(x). we want to do unsupervised learning. CLUSTERING DOCUMENTS 99 The transformation above is sometimes called logodds weighting. For each k. it is often more plausible to assume that each document can contain words from multiple topics. p(x.

ϕ is a vector of word probabilities for each topic indicating the content of that topic. and f ni is the number of these examples not containing the word. and (ii) to infer the topic distribution φk for each topic. TEXT MINING Note that z is an integer between 1 and K for each word. where K is the number of topics. the training data are the words in all documents. After training. as in x → log(x + 1) · | log tp/f n − log f p/tn|. it is not necessary to learn α and β. When applying LDA. tpi is the number of training examples in class i containing the word. 10.01. and more. One suggestion is x → sgn(x) · max | log tpi /f ni | i where i ranges over the classes. . For learning.e.100 for each word in the document draw a topic z according to θ draw a word w according to φz CHAPTER 10. The distribution θ of each document is useful for classifying new documents. It is also not known if combining the log transformation and logodds weighting is beneﬁcial. the dictionary of all words). as are the number K of topics.8 10.9 Latent semantic analysis Open questions It would be interesting to work out an extension of the logodds mapping for the multiclass case. the length Nm of each document. the number M of documents. The prior distributions α and β are assumed to be ﬁxed and known. measuring similarity between documents. and the cardinality V of the vocabulary (i. Learning has two goals: (i) to infer the document-speciﬁc multinomial θ for each document. Steyvers and Grifﬁths recommend ﬁxed uniform values α = 50/K and β = .

so it is constant for a single x and it can be eliminated also. p(x) Simplify the expression inside the argmax operator as much as possible. The denominator p(x) is the same for all k. the multinomial coefﬁcient does not depend on k. Hence as n increases the product j=1 θj j decreases exponentially. Let the two classes be k = 0 and k = 1 so we can write m m j θ1j > p(y = 0) y = 1 if and only if p(y = 1) ˆ j=1 x j θ0j . (b) Consider the multiclass Bayesian classiﬁer y = argmax ˆ k p(x|y = k)p(y = k) . Simplify the classiﬁer further into a linear classiﬁer. Note the multinomial coefﬁcient n!/ m xj ! does increase with n. giving m y = argmax p(y = k) ˆ k j=1 j θkj . (c) Consider the classiﬁer from part (b) and suppose that there are just two classes. . given that the model p(x|y = k) for each class is a multinomial distribution. In total n = j xj of these x m values are multiplied together. j=1 but more slowly. so it does not inﬂuence which k is the argmax. OPEN QUESTIONS 101 Quiz (a) Explain why. with a multinomial distribution. x j=1 Taking logarithms and using indicator function notation gives m m y = I log p(y = 1) + ˆ j=1 xj log θ1j − log p(y = 0) − j=1 xj log θ0j .” The probability of document x of length n according to a multinomial distribution is m n! x p(x. xj ! j=1 j=1 [Rough argument. “the probabilities of individual documents decline exponentially with length n. Within the multinomial distributions p(x|y = k). θ) = m θj j .9. x where θkj is the jth parameter of the kth multinomial.] Each θj value is less than 1.10.

The expression inside the indicator function is a linear function of x. TEXT MINING m xj cj ) where the coefﬁcients are c0 = log p(y = 1) − log p(y = 0) and cj = log θ1j − log θ0j . .102 The classiﬁer simpliﬁes to y = I(c0 + ˆ j=1 CHAPTER 10.

Explain why a regularized linear SVM is expected to be more accurate than a Bayesian classiﬁer using one maximum-likelihood multinomial for each class. (b) Learning the relative importance of different words.Quiz for May 18. (a) Handling words that are absent in one class in training. using the bag-of-words representation. Write one or two sentences for each of the following reasons. . 2010 Your name: Consider the task of learning to classify documents into one of two classes.

you should train a support vector machine (SVM) classiﬁer. The main ﬁle cl400. while wordlist gives the string corresponding to the word indices 1 to 6205.mat contains the same three ﬁles in the form of Matlab matrices. TEXT MINING Assignment (2009) The purpose of this assignment is to compare two different approaches to learning a classiﬁer for text documents. Try to evaluate the statistical signiﬁcance of differences in accuracy that you ﬁnd. regularization is vital.104 CHAPTER 10. For each of the two classiﬁer learning methods. feature transformation.cs. Second.zip. Because there are relatively few examples. These strings are not needed for training or applying classiﬁers. . The ﬁle truelabels. If you allow any leakage of information from the test fold to training folds. The dataset is available at http://www. you should try Bayesian classiﬁcation using a multinomial model for each of the three classes. You will need to select a method for adapting SVMs for multiclass classiﬁcation. First. using ten-fold cross-validation to measure accuracy is suggested. and/or feature weighting. be sure to explain this in your report.edu/users/elkan/ 291/classic400.ucsd. and high accuracy is achievable. Note that you may need to smooth the multinomials with pseudocounts. The ﬁle classic400.csv is a comma-separated 400 by 6205 array of word counts. investigate whether you can achieve better accuracy via feature selection. which contains three ﬁles. It consists of 400 documents from three categories over a vocabulary of 6205 words. The categories are quite distinct from a human point of view.csv gives the actual class of each document. Since there are more features than training examples. but they are useful for interpreting classiﬁers. The dataset to use is called Classic400.

2010 The purpose of this assignment is to compare two different approaches to learning a binary classiﬁer for text documents. and to understand the data and task fully. You can ﬁnd links to these papers at http://www. investigate whether you can achieve better accuracy via feature selection. edu/People/pabo/movie-review-data/. overﬁtting.cornell. analyze your trained classiﬁer to identify what features of movie reviews are most indicative of the review being favorable or unfavorable. Try to evaluate the statistical signiﬁcance of differences in accuracy that you ﬁnd. You should smooth the multinomials with pseudocounts. For SVMs the C parameter is not the strength of regularization but rather the reverse. Those who did not try strong regularization found worse performance with an SVM than with the Bayesian classiﬁer.0. First. you should try Bayesian classiﬁcation using a multinomial model for each of the two classes. published by Lillian Lee at http://www. Discuss this issue in your report. For both classiﬁer learning methods. The dataset to use is the movie review polarity dataset.Assignment due on May 18. Feedback on this text mining assignment: Leakage.edu/People/pabo/movie-review-data/ otherexperiments. . Since there are more features than training examples. version 2.cornell. over 90% accuracy.cs. you should train a linear discriminative classiﬁer. either logistic regression or a support vector machine. Compare the accuracy that you can achieve with accuracies reported in some of the many published papers that use this dataset. feature transformation. Think carefully about any ways in which you may be allowing leakage of information from test subsets that makes estimates of accuracy be biased. Also. regularization is vital. and/or feature weighting.cs.html. Be sure to read the README carefully. Second.

.

Examples of tasks that involve link prediction: suggesting new friends on Facebook. Instead. We will look at methods that do involve explicit classiﬁers.Chapter 11 Social network analytics A social network is a graph where nodes represent individual people and edges represent relationships. For nodes. this is sometimes called collective classiﬁcation. protein interaction networks. 11. Nodes are subscribers. Examples: citation networks. In principle. However. The aim of supervised learning is to obtain a model that can predict labels for nodes. and edges are phone calls. recognizing fraudulent applicants. identifying appropriate citations between scientiﬁc papers. the real goal is not to train a classiﬁer. it is simply to make predictions about speciﬁc examples. Example: Telephone network. the most basic label is existence. Transduction is the situation where the set of test examples is known at the time a classiﬁer is to be trained. Both collective classiﬁcation and link prediction are transductive problems. Call detail records (CDRs). or labels for edges.1 Difﬁculties in network mining Many networks are not strictly social networks. In this situation. but have similar properties. predicting which pairs of terrorists know each other. Examples of tasks that involve collective classiﬁcation: predicting churn of subscribers. these classiﬁers will be usable only for nodes that are part of the part of the network that is known 107 . Predicting whether or not edges exist is called link prediction. There are two types of data mining one can do with a network: supervised and unsupervised. For edges. some methods for transduction might make predictions without having an explicit reusable classiﬁer.

108 CHAPTER 11. it has revealed that nonsmokers tend to stop being friends with smokers. We will not solve the cold-start problem of making predictions for nodes that are not known during training. they have identity. 11. behavioral. it is often because they are reﬂections of social features. but we do not necessarily know anything else about nodes. the “rich get richer” idea. For example. For many data mining tasks where the examples are people. that is likely because of contagion effects between people who know each other. incentive. This type of learning can be fascinating as sociology.g.2 Unsupervised network mining The aim of unsupervised data mining is often to come up with a model that explains the statistics of the connectivity of the network. and other graphs in data mining. e. SOCIAL NETWORK ANALYTICS at training time. but thin people do not stop being friends with obese people. can be complex. there are ﬁve general types of feature: personal demographic.” .” “mean. In particular. and/or the whole network. Mathematical models that explain patterns of connectivity are often time-based. For example. Given a social network. nodes have two fundamental characteristics.” “minimum. The graph can be bipartite or not. Second.” and “exists. These features may include: 1. and/or they can be weighted. For example. Models of entire networks can lead to predictions about the evolution of the network that can be useful for making decisions. they can be of multiple types.” “count. Aggregation operators include “sum. and social. if zipcode is predictive of the brand of automobile a person will purchase.” “maximum. and make special efforts to reach them. For this reason. First. 11. or see each other on the streets. group demographic. Social networks. nodes may be be associated with vectors that specify the values of features.” “mode. This means that we know which nodes are the same and which are different in the graph. features based on neighbors are often aggregates. If group demographic features have predictive power. A node can have a varying number of neighbors.3 Collective classiﬁcation General approach: Extend each node with a ﬁxed-length vector of feature values derived from its neighbors. edges may be directed or undirected. but many learning algorithms require ﬁxed-length representations. These vectors are sometimes called side-information. one can identify the most inﬂuential nodes.

However.4 Link prediction Link prediction is similar to collective classiﬁcation.) If labels are independent. the labels of examples are assumed to be independent. 11. for predicting future co-authorship: 1. For example. the Enron email adjacency matrix has rank approximately 2 only 3. The standard approach to cross-validation is not appropriate when examples are linked in a network. 3. Inferring unobserved links versus forecasting future new links. that is. Low rank approximation of the adjacency matrix. they are not completely separate datasets. Often the training and test examples. Shortest-distance is the most informative graph-based feature. that is the labeled and unlabeled nodes. Similar to proteinprotein interaction Nevertheless. there is a major pitfall in using a linear classiﬁer. In traditional classiﬁcation tasks. the label to be predicted is typically highly unbalanced: between most pairs of nodes. First.11. see the discussion of sample selection bias above. the most successful general approach to link prediction is the same as for collective classiﬁcation: Extend each node with a vector of feature values derived from the network.5 Iterative collective classiﬁcation There are many cases where we want to predict labels for nodes in a social network. Low rank approximation of Laplacian or modularity matrix. then a classiﬁer can make predictions separately for separate examples. these features measure the propensity of an individual as opposed to any interaction between individuals. but involves some additional difﬁculties. Cold-start problem for new nodes. Generalization to new networks.4. Keyword overlap is the most informative feature for predicting co-authorship. there is no edge. Second. 2. (This is not the same as assuming that the examples themselves are independent. 11. LINK PREDICTION 109 2. Sum of neighbors and sum of papers are next most predictive. . Empirically. One can also extend each potential edge with a vector of feature values. are members of the same network.

but it is sensible and often effective in practice. if nodes are persons then one set of edges may represent the “same address” relationship while another set may represent the “made telephone call to” relationship. Experimentally. Let S(x) be the bag of labels of nodes in N (x). and there is no general algorithm for inferring them. SOCIAL NETWORK ANALYTICS if labels are not independent. g(x)) ˆ until no change in predicted labels The algorithm above is purely heuristic. g(x)).110 CHAPTER 11. we would like to use the labels of the neighbors of x as features when predicting the label of x. Let g(x) be some representation of S(x). A principled approach to collective classiﬁcation would ﬁnd mutually consistent predicted labels for x and its neighbors. . The examples that are used in training are the set E= x∈L N (x). g(x)) ˆ repeat select ordering R of nodes x for each x according to R let S(x) be the current predicted labels of N (x) compute prediction y = f (x. For example. Given a node x. Let L be the training examples with known labels. Suppose that examples are nodes in a graph. and g(x) compute prediction y = f (x. where we allow “unknown” as a special label value. Intuitively. the algorithm for classifying test examples is the following. perhaps using aggregate operators. then in principle a classiﬁer can achieve higher accuracy by predicting labels for related examples simultaneously. and nodes are joined by edges. the labels of neighbors are often correlated. S(x). A training set may include examples with known labels and examples with unknown labels. let N (x) be the set of its neighbors. so in effect multiple graphs can be overlaid on the same examples. and vice versa. g(x)). Initialization: for each test node x compute N (x). a simple iterative algorithm often performs as well as more sophisticated methods for collective classiﬁcation. However. Edges can have labels and/or weights. A classiﬁer is a function f (x. Given a node x. in general there is no guarantee that mutually consistent labels are unique. This situation is called collective classiﬁcation. Given a trained classiﬁer f (x.

png Network-focused versus node-focused analysis. . Cascading models of behavior. OTHER TOPICS 111 11.6 Other topics Enron email visualization: http://jheer.11.6. Record linkage. Betweenness Centrality: Brandes 2001. alias detection.org/enron/v1/enron_nice_ 1.

which are outside the scope of this question. k be a pair for which you need to predict whether an edge exists. Predict f ([uj . suppose you have a training set of pairs for which edges are known to exist. You need to make predictions for the remaining edges. and f is a trained logistic regression model. 2010 Your name: Consider the task of predicting which pairs of nodes in a social network are linked. Speciﬁcally. Let the adjacency matrix A have dimension n × n. Because A is symmetric. where U has dimension n × k for some k n. and why.) . V is the transpose of U . 2. uk ] is the concatenation of the vectors uj and uk . Let ui be the row of U that represents node i. (Do not mention any other approaches. or known not to exist. uk ]) where [uj .Quiz for June 1. Consider these two possible ways to make this prediction: 1. Explain which of these two approaches is best. Let j. You have trained a matrix factorization A = U V . Predict the dot-product uj · uk .

For (i).html. You should use logistic regression to predict the score of each potential edge. . 2010 The purpose of this assignment is to apply and evaluate methods to predict the presence of links that are unknown in a social network. There is information about each terrorist.7 above. Do experiments comparing at least two of these. The Terrorists dataset is more complicated. Suppose that you are doing ten-fold cross-validation on the Terrorists dataset.edu/ ˜sen/lbc-proj/LBC. Using one or more of the feature types (i).pdf. at least one other paper in the dataset. your task is to pretend that some edges are unknown.umd. avoid the mistakes discussed in Section 5.org/papers/2006/RelClzPIT. Both datasets are available at http://www. The edges are labeled with types. (ii).Assignment due on June 1. as discussed in class. When predicting edges. The higher the score of the 92 held-out edges. and also about each edge. you may use features obtained from (i) the network structure of the edges known to be present and absent. and then to predict these. (ii) properties of the nodes. For explanations see the paper Entity and relationship labeling in afﬁliation networks by Zhao. According to Table 1 and Section 6 of this paper. there are 917 edges that connect 435 terrorists.cs. Based on the 927 − 92 = 825 known edges. and on all 435 nodes. you would predict a score for each of the 435 · 434/2 potential edges. use a method for converting the network structure into a vector of ﬁxed length of feature values for each node. or cites. Also. Then you would pretend that 92 actual edges are unknown. each node represents an academic paper. Each paper is also represented by a bag-of-words vector of length 1433. the better. and (iii) yields seven alternative sets of features. and Getoor from the 2006 ICML Workshop on Statistical Network Analysis. Each paper has a label. mindswap. based on the known edges and on the nodes. The network structure is that each paper is cited by. and/or (iii) properties of pairs of nodes. where the seven label values are different subareas of machine learning. For either dataset. Each experiment should use logistic regression and ten-fold cross-validation. Use either the Cora dataset or the Terrorists dataset. In the Cora dataset. Sen. Think carefully about exactly what information is legitimately included in training folds. available at http://www.

Chapter 12

Interactive experimentation

This chapter discusses data-driven optimization of websites, in particular via A/B testing. A/B testing can reveal “willingness to pay.” Compare proﬁt at prices x and y, based on number sold. Choose the price that yields higher proﬁt.

115

116

CHAPTER 12. INTERACTIVE EXPERIMENTATION

Quiz

The following text is from a New York Times article dated May 30, 2009. Mr. Herman had run 27 ads on the Web for his client Vespa, the scooter company. Some were rectangular, some square. And the text varied: One tagline said, “Smart looks. Smarter purchase,” and displayed a $0 down, 0 percent interest offer. Another read, “Pure fun. And function,” and promoted a free T-shirt. Vespa’s goal was to ﬁnd out whether a ﬁnancial offer would attract customers, and Mr. Herman’s data concluded that it did. The $0 down offer attracted 71 percent more responses from one group of Web surfers than the average of all the Vespa ads, while the T-shirt offer drew 29 percent fewer. (a) What basic principle of the scientiﬁc method did Mr. Herman not follow, according to the description above? (b) Suppose that it is true that the ﬁnancial offer does attract the highest number of purchasers. What is an important reason why the other offer might still be preferable for the company? (c) Explain the writing mistake in the phrase “Mr. Herman’s data concluded that it did.” Note that the error is relatively high-level; it is not a spelling or grammar mistake.

. and Sch¨ lkopf. and Harper. Correcting sample selection bias by unlabeled data. (2006). Neural and Statistical Classiﬁcation. J. Machine Learning. C..Bibliography [Huang et al..cato. 11(8):677–685(9). J. (1994). Technical report.. 2006] Jonas. (2006).. Combinatorial Chemistry & High Throughput Screening. Available at http://www.php?pub_ id=6784.. A. Borgwardt.. J. D. Effective counterterrorism and the limited role of predictive data mining. In Proo ceedings of the Neural Information Processing Systems Conference (NIPS 2006). 117 . B. and Taylor. 2006] Huang. [Michie et al. Gretton. Machine learning for in silico virtual screening and chemical genomics: New strategies. Smola. and Jacob.. (2008). D. C. 1994] Michie. J. [Jonas and Harper. K. A. L. J. Spiegelhalter. Cato Institute.-P. 2008] Vert. Ellis Horwood.org/pub_display. [Vert and Jacob. M.

- 06549589
- Lecture 3
- Rapidminer 4.6 Tutorial
- Ba Predictive Analytics1 PDF
- 432156IJIT7113-312
- A Deep Learning Approach for Network Intrusion Detection System
- radialbfunctio.pdf
- Scala for Data Science - Sample Chapter
- Advanced Credit Risk Modeling for Basel II Using SAS - Course Notes (2008)
- Machine Learning Yearning V0.5 01
- Artificial neural networks
- 1-s2.0-S1746809415000932-main.pdf
- Applying Data Mining Techniques Using SAS Enterprise Miner
- Rapid Miner 4.4 Tutorial
- Curso Web y Data Mining 3 Predictive Modeling Using Logistic Regresion
- Thesis Defence Presentation
- Sim Tools
- Analyzing Compositional Data With R
- The Interplay of Optimization and Machine Learning Research.pdf
- hayat2014.pdf
- What Are the Types of CRM Analytics
- Howtocreatenewbusinessmodelswithbigdataandanalytics Jun152012 120711095159 Phpapp01
- FRM Part I - Quantitative 09 10 2011
- Skills That Improve Profitability
- Person Target Profiling
- Project Progress Report Google Docs
- WEKA_Lab-Record1.pdf
- Iccv2011 Struck
- IEEE TRANSACTION ON IMAGE PROCESSING
- Writing Tips for Economics Research Papers_P Nikolov 2010

Skip carousel

- tmpA609
- An Enhance Image Retrieval of User Interest Using Query Specific Approach and Data Mining Technique
- Moving Car Detection using HOG features
- tmp35DF.tmp
- Named Entity Recognition for English Tweets using Random Kitchen Sink Algorithm
- A Survey on Medical Data Classification
- Lemon Disease Detection Using Image Processing
- Product Review Analysis with Ranking System Based on Transaction ID and OTP
- Colour Object Recognition using Biologically Inspired Model
- tmpD1F
- tmp8086.tmp
- A Survey of Various Intrusion Detection Systems
- Texture Classification based on Gabor Wavelet
- An Improved sentiment classification for objective word.
- A Survey On Mining Conceptual Rule and Ontological Matching For Text Summarization
- tmpD0AB
- Character recognition of Sindhi (Arabic) Script
- tmpD7F8
- tmp96ED
- Survey paper on methodologies employed in MINERAL exploration
- Survey on Efficient Feature Subset Selection Technique on High Dimensional Small Sized Data
- tmpBC71
- Detecting Content Based Image Spam in E-mail
- Language Identification System for Indian Languages
- Data Partitioning and Machine Learning Techniques for Lake Level Forecasting
- tmp768B
- A Survey on Machine Learning Approach in Data Mining
- tmp4E31.tmp
- tmp81FA.tmp
- Agricultural Plant Disease Detection and its Treatment using Image Processing

Sign up to vote on this title

UsefulNot usefulRead Free for 30 Days

Cancel anytime.

Close Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

LectureNotes will be available on

Loading