This action might not be possible to undo. Are you sure you want to continue?
Charles Elkan
elkan@cs.ucsd.edu
May 31, 2011
1
Contents
Contents 2
1 Introduction 5
1.1 Limitations of predictive analytics . . . . . . . . . . . . . . . . . . 6
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Predictive analytics in general 11
2.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Data cleaning and recoding . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Interpreting coefﬁcients of a linear model . . . . . . . . . . . . . . 16
2.5 Evaluating performance . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Introduction to Rapidminer 25
3.1 Standardization of features . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Example of a Rapidminer process . . . . . . . . . . . . . . . . . . 26
3.3 Other notes on Rapidminer . . . . . . . . . . . . . . . . . . . . . . 29
4 Support vector machines 31
4.1 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Linear softmargin SVMs . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Dual formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Nonlinear kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Selecting the best SVM settings . . . . . . . . . . . . . . . . . . . 37
5 Doing valid experiments 45
5.1 Crossvalidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Nested crossvalidation . . . . . . . . . . . . . . . . . . . . . . . . 47
2
CONTENTS 3
6 Classiﬁcation with a rare class 53
6.1 Thresholds and lift . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Ranking examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Training to overcome imbalance . . . . . . . . . . . . . . . . . . . 58
6.4 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . 59
6.5 Isotonic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.6 Univariate logistic regression . . . . . . . . . . . . . . . . . . . . . 60
6.7 Pitfalls of link prediction . . . . . . . . . . . . . . . . . . . . . . . 60
7 Logistic regression 69
8 Making optimal decisions 73
8.1 Predictions, decisions, and costs . . . . . . . . . . . . . . . . . . . 73
8.2 Cost matrix properties . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3 The logic of costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.4 Making optimal decisions . . . . . . . . . . . . . . . . . . . . . . . 76
8.5 Limitations of costbased analysis . . . . . . . . . . . . . . . . . . 78
8.6 Rules of thumb for evaluating data mining campaigns . . . . . . . . 78
8.7 Evaluating success . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9 Learning classiﬁers despite missing labels 87
9.1 The standard scenario for learning a classiﬁer . . . . . . . . . . . . 87
9.2 Sample selection bias in general . . . . . . . . . . . . . . . . . . . 88
9.3 Covariate shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.4 Reject inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.5 Positive and unlabeled examples . . . . . . . . . . . . . . . . . . . 91
9.6 Further issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10 Recommender systems 99
10.1 Applications of matrix approximation . . . . . . . . . . . . . . . . 100
10.2 Measures of performance . . . . . . . . . . . . . . . . . . . . . . . 100
10.3 Additive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.4 Multiplicative models . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.5 Combining models by ﬁtting residuals . . . . . . . . . . . . . . . . 103
10.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
10.7 Further issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11 Text mining 111
11.1 The bagofwords representation . . . . . . . . . . . . . . . . . . . 112
11.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . 113
4 CONTENTS
11.3 Training Bayesian classiﬁers . . . . . . . . . . . . . . . . . . . . . 115
11.4 Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
11.5 Discriminative classiﬁcation . . . . . . . . . . . . . . . . . . . . . 116
11.6 Clustering documents . . . . . . . . . . . . . . . . . . . . . . . . . 117
11.7 Topic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
11.8 Open questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
12 Matrix factorization and applications 127
12.1 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . 128
12.2 Principal component analysis . . . . . . . . . . . . . . . . . . . . . 130
12.3 Latent semantic analysis . . . . . . . . . . . . . . . . . . . . . . . 131
12.4 Matrix factorization with missing data . . . . . . . . . . . . . . . . 133
12.5 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 133
12.6 Graphbased regularization . . . . . . . . . . . . . . . . . . . . . . 133
13 Social network analytics 137
13.1 Issues in network mining . . . . . . . . . . . . . . . . . . . . . . . 138
13.2 Unsupervised network mining . . . . . . . . . . . . . . . . . . . . 139
13.3 Collective classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . 140
13.4 The social dimensions approach . . . . . . . . . . . . . . . . . . . 142
13.5 Supervised link prediction . . . . . . . . . . . . . . . . . . . . . . 142
14 Interactive experimentation 149
15 Reinforcement learning 151
15.1 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . 152
15.2 RL versus costsensitive learning . . . . . . . . . . . . . . . . . . . 153
15.3 Algorithms to ﬁnd an optimal policy . . . . . . . . . . . . . . . . . 154
15.4 Q functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
15.5 The Qlearning algorithm . . . . . . . . . . . . . . . . . . . . . . . 156
15.6 Fitted Q iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
15.7 Representing continuous states and actions . . . . . . . . . . . . . . 158
15.8 Inventory management applications . . . . . . . . . . . . . . . . . 160
Bibliography 163
Chapter 1
Introduction
There are many deﬁnitions of data mining. We shall take it to mean the application
of learning algorithms and statistical methods to realworld datasets. There are nu
merous data mining domains in science, engineering, business, and elsewhere where
data mining is useful. We shall focus on applications that are related to business, but
the methods that are most useful are mostly the same for applications in science or
engineering.
The focus will be on methods for making predictions. For example, the available
data may be a customer database, along with labels indicating which customers failed
to pay their bills. The goal will then be to predict which other customers might fail
to pay in the future. In general, analytics is a newer name for data mining. Predictive
analytics indicates a focus on making predictions.
The main alternative to predictive analytics can be called descriptive analytics.
This area is often also called “knowledge discovery in data” or KDD. In a nutshell,
the goal of descriptive analytics is to discover patterns in data. Finding patterns is
often fascinating and sometimes highly useful, but in general it is harder to obtain
direct beneﬁt from descriptive analytics than from predictive analytics. For example,
suppose that customers of Whole Foods tend to be liberal and wealthy. This pattern
may be noteworthy and interesting, but what should Whole Foods do with the ﬁnd
ing? Often, the same ﬁnding suggests two courses of action that are both plausible,
but contradictory. In such a case, the ﬁnding is really not useful in the absence of
additional knowledge. For example, perhaps Whole Foods at should direct its mar
keting towards additional wealthy and liberal people. Or perhaps that demographic is
saturated, and it should aim its marketing at a currently less tapped, different, group
of people?
In contrast, predictions can be typically be used directly to make decisions that
maximize beneﬁt to the decisionmaker. For example, customers who are more likely
5
6 CHAPTER 1. INTRODUCTION
not to pay in the future can have their credit limit reduced now. It is important to
understand the difference between a prediction and a decision. Data mining lets us
make predictions, but predictions are useful to an agent only if they allow the agent
to make decisions that have better outcomes.
Some people may feel that the focus in this course on maximizing proﬁt is dis
tasteful or disquieting. After all, maximizing proﬁt for a business may be at the
expense of consumers, and may not beneﬁt society at large. There are several re
sponses to this feeling. First, maximizing proﬁt in general is maximizing efﬁciency.
Society can use the tax system to spread the beneﬁt of increased proﬁt. Second, in
creased efﬁciency often comes from improved accuracy in targeting, which beneﬁts
the people being targeted. Businesses have no motive to send advertising to people
who will merely be annoyed and not respond.
1.1 Limitations of predictive analytics
It is important to understand the limitations of predictive analytics. First, in general,
one cannot make progress without a dataset for training of adequate size and quality.
Second, it is crucial to have a clear deﬁnition of the concept that is to be predicted,
and to have historical examples of the concept. Consider for example this extract
from an article in the London Financial Times dated May 13, 2009:
Fico, the company behind the credit score, recently launched a service
that prequaliﬁes borrowers for modiﬁcation programmes using their in
house scoring data. Lenders pay a small fee for Fico to refer potential
candidates for modiﬁcations that have already been vetted for inclusion
in the programme. Fico can also help lenders ﬁnd borrowers that will
best respond to modiﬁcations and learn how to get in touch with them.
It is hard to see how this could be a successful application of data mining, because
it is hard to see how a useful labeled training set could exist. The target concept
is “borrowers that will best respond to modiﬁcations.” From a lender’s perspective
(and Fico works for lenders not borrowers) such a borrower is one who would not pay
under his current contract, but who would pay if given a modiﬁed contract. Especially
in 2009, lenders had no long historical experience with offering modiﬁcations to
borrowers, so FICO did not have relevant data. Moreover, the deﬁnition of the target
is based on a counterfactual, that is on reading the minds of borrowers. Data mining
cannot read minds.
For a successful data mining application, the actions to be taken based on pre
dictions need to be deﬁned clearly and to have reliable proﬁt consequences. The
1.2. OVERVIEW 7
difference between a regular payment and a modiﬁed payment is often small, for ex
ample $200 in the case described in the newspaper article. It is not clear that giving
people modiﬁcations will really change their behavior dramatically.
For a successful data mining application also, actions must not have major unin
tended consequences. Here, modiﬁcations may change behavior in undesired ways.
A person requesting a modiﬁcation is already thinking of not paying. Those who get
modiﬁcations may be motivated to return and request further concessions.
Additionally, for predictive analytics to be successful, the training data must be
representative of the test data. Typically, the training data come from the past, while
the test data arise in the future. If the phenomenon to be predicted is not stable
over time, then predictions are likely not to be useful. Here, changes in the general
economy, in the price level of houses, and in social attitudes towards foreclosures,
are all likely to change the behavior of borrowers in the future, in systematic but
unknown ways.
Last but not least, for a successful application it helps if the consequences of
actions are essentially independent for different examples. This may be not the case
here. Rational borrowers who hear about others getting modiﬁcations will try to
make themselves appear to be candidates for a modiﬁcation. So each modiﬁcation
generates a cost that is not restricted to the loss incurred with respect to the individual
getting the modiﬁcation.
An even more clear example of an application of predictive analytics that is un
likely to succeed is learning a model to predict which persons will commit a major
terrorist act. There are so few positive training examples that statistically reliable
patterns cannot be learned. Moreover, intelligent terrorists will take steps not to ﬁt in
with patterns exhibited by previous terrorists [Jonas and Harper, 2006].
1.2 Overview
In this course we shall only look at methods that have stateoftheart accuracy, that
are sufﬁciently simple and fast to be easy to use, and that have welldocumented
successful applications. We will tread a middle ground between focusing on theory
at the expense of applications, and understanding methods only at a cookbook level.
Often, we shall not look at multiple methods for the same task, when there is one
method that is at least as good as all the others from most points of view. In particular,
for classiﬁer learning, we will look at support vector machines (SVMs) in detail. We
will not examine alternative classiﬁer learning methods such as decision trees, neural
networks, boosting, and bagging. All these methods are excellent, but it is hard to
identify clearly important scenarios in which they are deﬁnitely superior to SVMs.
8 CHAPTER 1. INTRODUCTION
We may also look at random forests, a nonlinear method that is often superior to
linear SVMs, and which is widely used in commercial applications nowadays.
CSE 291 Quiz for April 5, 2011
Your name:
Student id, if you have one:
Email address (important):
Suppose that you work for Apple. The company has developed an unprecedented
new product called the iKid, which is aimed at children. Your colleague says “Let’s
use data mining on our database of existing customers to learn a model that predicts
who is likely to buy an iKid.”
Is your colleague’s idea good or bad? Explain at least one signiﬁcant speciﬁc
reason that supports your answer.
Chapter 2
Predictive analytics in general
This chapter explains supervised learning, linear regression, and data cleaning and
recoding.
2.1 Supervised learning
The goal of a supervised learning algorithm is to obtain a classiﬁer by learning from
training examples. A classiﬁer is something that can be used to make predictions on
test examples. This type of learning is called “supervised” because of the metaphor
that a teacher (i.e. a supervisor) has provided the true label of each training example.
Each training and test example is represented in the same way, as a row vector
of ﬁxed length p. Each element in the vector representing an example is called a
feature value. It may be real number or a value of any other type. A training set is
a set of vectors with known label values. It is essentially the same thing as a table
in a relational database, and an example is one row in such a table. Row, tuple, and
vector are essentially synonyms. A column in such a table is often called a feature,
or an attribute, in data mining. Sometimes it is important to distinguish between a
feature, which is an entire column, and a feature value.
The label y for a test example is unknown. The output of the classiﬁer is a
conjecture about y, i.e. a predicted y value. Often each label value y is a real number.
In this case, supervised learning is called “regression” and the classiﬁer is called a
“regression model.” The word “classiﬁer” is usually reserved for the case where label
values are discrete. In the simplest but most common case, there are just two label
values. These may be called 1 and +1, or 0 and 1, or no and yes, or negative and
positive.
With n training examples, and with each example consisting of values for p dif
ferent features, the training data are a matrix with n rows and p columns, along with
11
12 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL
a column vector of y values. The cardinality of the training set is n, while its dimen
sionality is p. We use the notation x
ij
for the value of feature number j of example
number i. The label of example i is y
i
. True labels are known for training examples,
but not for test examples.
2.2 Data cleaning and recoding
In realworld data, there is a lot of variability and complexity in features. Some fea
tures are realvalued. Other features are numerical but not realvalued, for example
integers or monetary amounts. Many features are categorical, e.g. for a student the
feature “year” may have values freshman, sophomore, junior, and senior. Usually the
names used for the different values of a categorical feature make no difference to data
mining algorithms, but are critical for human understanding. Sometimes categorical
features have names that look numerical, e.g. zip codes, and/or have an unwieldy
number of different values. Dealing with these is difﬁcult.
It is also difﬁcult to deal with features that are really only meaningful in con
junction with other features, such as “day” in the combination “day month year.”
Moreover, important features (meaning features with predictive power) may be im
plicit in other features. For example, the day of the week may be predictive, but only
the day/month/year date is given in the original data. An important job for a human
is to think which features may be predictive, based on understanding the application
domain, and then to write software that makes these features explicit. No learning
algorithm can be expected to discover dayofweek automatically as a function of
day/month/year.
Even with all the complexity of features, many aspects are typically ignored in
data mining. Usually, units such as dollars and dimensions such as kilograms are
omitted. Difﬁculty in determining feature values is also ignored: for example, how
does one deﬁne the year of a student who has transferred from a different college, or
who is parttime?
A very common difﬁculty is that the value of a given feature is missing for some
training and/or test examples. Often, missing values are indicated by question marks.
However, often missing values are indicated by strings that look like valid known
values, such as 0 (zero). It is important not to treat a value that means missing
inadvertently as a valid regular value.
Some training algorithms can handle missingness internally. If not, the simplest
approach is just to discard all examples (rows) with missing values. Keeping only
examples without any missing values is called “complete case analysis.” An equally
simple, but different, approach is to discard all features (columns) with missing val
2.2. DATA CLEANING AND RECODING 13
ues, However, in many applications, both these approaches eliminate too much useful
training data.
Also, the fact that a particular feature is missing may itself be a useful predictor.
Therefore, it is often beneﬁcial to create an additional binary feature that is 0 for
missing and 1 for present. If a feature with missing values is retained, then it is
reasonable to replace each missing value by the mean or mode of the nonmissing
values. This process is called imputation. More sophisticated imputation procedures
exist, but they are not always better.
Some training algorithms can only handle categorical features. For these, features
that are numerical can be discretized. The range of the numerical values is partitioned
into a ﬁxed number of intervals that are called bins. The word “partitioned” means
that the bins are exhaustive and mutually exclusive, i.e. nonoverlapping. One can
set boundaries for the bins so that each bin has equal width, i.e. the boundaries are
regularly spaced, or one can set boundaries so that each bin contains approximately
the same number of training examples, i.e. the bins are “equal count.” Each bin is
given an arbitrary name. Each numerical value is then replaced by the name of the
bin in which the value lies. It often works well to use ten bins.
Other training algorithms can only handle realvalued features. For these, cat
egorical features must be made numerical. The values of a binary feature can be
recoded as 0.0 or 1.0. It is conventional to code “false” or “no” as 0.0, and “true” or
“yes” as 1.0. Usually, the best way to recode a feature that has k different categorical
values is to use k realvalued features. For the jth categorical value, set the jth of
these features equal to 1.0 and set all k −1 others equal to 0.0.
1
Categorical features with many values (say, over 20) are often difﬁcult to deal
with. Typically, human intervention is needed to recode them intelligently. For ex
ample, zipcodes may be recoded as just their ﬁrst one or two letters, since these
indicate meaningful regions of the United States. If one has a large dataset with
many examples from each of the 50 states, then it may be reasonable to leave this
as a categorical feature with 50 values. Otherwise, it may be useful to group small
states together in sensible ways, for example to create a New England group for MA,
CT, VT, ME, NH.
An intelligent way to recode discrete predictors is to replace each discrete value
by the mean of the target conditioned on that discrete value. For example, if the
average label value is 20 for men versus 16 for women, these values could replace
the male and female values of a variable for gender. This idea is especially useful as
1
The ordering of the values, i.e. which value is associated with j = 1, etc., is arbitrary. Mathemat
ically it is preferable to use only k − 1 realvalued features. For the last categorical value, set all k − 1
features equal to 0.0. For the jth categorical value where j < k, set the jth feature value to 1.0 and set
all k − 1 others equal to 0.0.
14 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL
a way to convert a discrete feature with many values, for example the 50 U.S. states,
into a useful single numerical feature.
However, as just explained, the standard way to recode a discrete feature with
m values is to introduce m − 1 binary features. With this standard approach, the
training algorithm can learn a coefﬁcient for each new feature that corresponds to an
optimal numerical value for the corresponding discrete value. Conditional means are
likely to be meaningful and useful, but they will not yield better predictions than the
coefﬁcients learned in the standard approach. A major advantage of the conditional
means approach is that it avoids an explosion in the dimensionality of training and
test examples.
Mixed types.
Sparse data.
Normalization. After conditionalmean new values have been created, they can
be scaled to have zero mean and unit variance in the same way as other features.
When preprocessing and recoding data, it is vital not to peek at test data. If
preprocessing is based on test data in any way, then the test data is available indirectly
for training, which can lead to overﬁtting. If a feature is normalized by zscoring, its
mean and standard deviation must be computed using only the training data. Then,
later, the same mean and standard deviation values must be used for normalizing this
feature in test data.
Even more important, target values from test data must not be used in any way
before or during training. When a discrete value is replaced by the mean of the target
conditioned on it, the mean must be just of training target values. Later, in test data,
the same discrete value must be replaced by the same mean of training target values.
2.3 Linear regression
Let x be an instance and let y be its realvalued label. For linear regression, x must
be a vector of real numbers of ﬁxed length. Remember that this length p is often
called the dimension, or dimensionality, of x. Write x = ¸x
1
, x
2
, . . . , x
p
). The
linear regression model is
y = b
0
+b
1
x
1
+b
2
x
2
+. . . +b
p
x
p
.
The righthand side above is called a linear function of x. The linear function is
deﬁned by its coefﬁcients b
0
to b
p
. These coefﬁcients are the output of the data
mining algorithm.
The coefﬁcient b
0
is called the intercept. It is the value of y predicted by the
model if x
i
= 0 for all i. Of course, it may be completely unrealistic that all features
x
i
have value zero. The coefﬁcient b
i
is the amount by which the predicted y value
2.3. LINEAR REGRESSION 15
increases if x
i
increases by 1, if the value of all other features is unchanged. For
example, suppose x
i
is a binary feature where x
i
= 0 means female and x
i
= 1
means male, and suppose b
i
= −2.5. Then the predicted y value for males is lower
by 2.5, everything else being held constant.
Suppose that the training set has cardinality n, i.e. it consists of n examples of
the form ¸x
i
, y
i
), where x
i
= ¸x
i1
, . . . , x
ip
). Let b be any set of coefﬁcients. The
predicted value for x
i
is
ˆ y
i
= f(x
i
; b) = b
0
+
p
j=1
b
j
x
ij
.
The semicolon in the expression f(x
i
; b) emphasizes that the vector x
i
is a variable
input, while b is a ﬁxed set of parameter values. If we deﬁne x
i0
= 1 for every i, then
we can write
ˆ y
i
=
p
j=0
b
j
x
ij
.
The constant x
i0
= 1 can be called a pseudofeature.
Finding the optimal values of the coefﬁcients b
0
to b
p
is the job of the training
algorithm. To make this task welldeﬁned, we need a deﬁnition of what “optimal”
means. The standard approach is to say that optimal means minimizing the sum of
squared errors on the training set, where the squared error on training example i is
(y
i
− ˆ y
i
)
2
. The training algorithm then ﬁnds
ˆ
b = argmin
b
n
i=1
(f(x
i
; b) −y
i
)
2
.
The objective function
i
(y
i
−
j
b
j
x
ij
)
2
is called the sum of squared errors, or
SSE for short. Note that during training the n different x
i
and y
i
values are ﬁxed,
while the parameters b are variable.
The optimal coefﬁcient values
ˆ
b are not deﬁned uniquely if the number n of train
ing examples is less than the number p of features. Even if n > p is true, the optimal
coefﬁcients have multiple equivalent values if some features are themselves related
linearly. Here, “equivalent” means that the different sets of coefﬁcients achieve the
same minimum SSE. For an intuitive example, suppose features 1 and 2 are height
and weight respectively. Suppose that x
2
= 120 + 5(x
1
− 60) = −180 + 5x
1
ap
proximately, where the units of measurement are pounds and inches. Then the same
model can be written in many different ways:
• y = b
0
+b
1
x
1
+b
2
x
2
16 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL
• y = b
0
+b
1
x
1
+b
2
(−180 + 5x
1
) = [b
0
−180b
2
] + [b
1
(1 + 5b
2
)]x
1
+ 0x
2
and more. In the extreme, suppose x
1
= x
2
. Then all models y = b
0
+b
1
x
1
+b
2
x
2
are equivalent for which b
1
+b
2
equals a constant.
When two or more features are approximately related linearly, then the true val
ues of the coefﬁcients of those features are not well determined. The coefﬁcients
obtained by training will be strongly inﬂuenced by randomness in the training data.
Regularization is a way to reduce the inﬂuence of this type of randomness. Consider
all models y = b
0
+b
1
x
1
+b
2
x
2
for which b
1
+b
2
= c. Among these models, there is
a unique one that minimizes the function b
2
1
+b
2
2
. This model has b
1
= b
2
= c/2. We
can obtain it by setting the objective function for training to be the sum of squared
errors (SSE) plus a function that penalizes large values of the coefﬁcients. A sim
ple penalty function of this type is
p
j=1
b
2
j
. A parameter λ can control the relative
importance of the two objectives, namely SSE and penalty:
ˆ
b = argmin
b
1
n
n
i=1
(y
i
− ˆ y
i
)
2
+ λ
1
p
p
j=1
b
2
j
.
If λ = 0 then one gets the standard leastsquares linear regression solution. As λ
gets larger, the penalty on large coefﬁcients gets stronger, and the typical values of
coefﬁcients get smaller. The parameter λ is often called the strength of regularization.
The fractions 1/n and 1/p do not make an essential difference. They can be used to
make the numerical value of λ easier to interpret.
The penalty function
p
j=1
b
2
j
is the L
2
norm of the vector b. Using it for linear
regression is called ridge regression. Any penalty function that treats all coefﬁcients
b
j
equally, like the L
2
norm, is sensible only if the typical magnitudes of the values
of each feature are similar; this is an important motivation for data normalization.
Note that in the formula
p
j=1
b
2
j
the sum excludes the intercept coefﬁcient b
0
. One
reason for doing this is that the target y values are typically not normalized.
2.4 Interpreting coefﬁcients of a linear model
It is common to desire a data mining model that is interpretable, that is one that
can be used not only to make predictions, but also to understand mechanisms in the
phenomenon that is being studied. Linear models, whether for regression or for clas
siﬁcation as described in later chapters, do appear interpretable at ﬁrst sight. How
ever, much caution is needed when attempting to derive conclusions from numerical
coefﬁcients.
2.4. INTERPRETING COEFFICIENTS OF A LINEAR MODEL 17
Consider the following linear regression model for predicting highdensity choles
terol (HDL) levels.
2
2
HDL cholesterol is considered beneﬁcial and is sometimes called “good” cholesterol. Source:
http://www.jerrydallal.com/LHSP/importnt.htm. Predictors have been reordered
here from most to least statistically signiﬁcant, as measured by pvalue.
18 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL
predictor coefficient std.error Tstat pvalue
intercept 1.16448 0.28804 4.04 <.0001
BMI 0.01205 0.00295 4.08 <.0001
LCHOL 0.31109 0.10936 2.84 0.0051
GLUM 0.00046 0.00018 2.50 0.0135
DIAST 0.00255 0.00103 2.47 0.0147
BLC 0.05055 0.02215 2.28 0.0239
PRSSY 0.00041 0.00044 0.95 0.3436
SKINF 0.00147 0.00183 0.81 0.4221
AGE 0.00092 0.00125 0.74 0.4602
From most to least statistically signiﬁcant, the predictors are body mass index, the
log of total cholesterol, diastolic blood pressures, vitamin C level in blood, systolic
blood pressure, skinfold thickness, and age in years. (It is not clear what GLUM is.)
The example illustrates at least two crucial issues. First, if predictors are collinear,
then one may appear signiﬁcant and the other not, when in fact both are signiﬁcant or
both are not. Above, diastolic blood pressure is statistically signiﬁcant, but systolic
is not. This may possibly be true for some physiological reason. But it may also be
an artifact of collinearity.
Second, a predictor may be practically important, and statistically signiﬁcant,
but still useless for interventions. This happens if the predictor and the outcome
have a common cause, or if the outcome causes the predictor. Above, vitamin C
is statistically signiﬁcant. But it may be that vitamin C is simply an indicator of a
generally healthy diet high in fruits and vegetables. If this is true, then merely taking
a vitamin C supplement will cause an increase in HDL level.
A third crucial issue is that a correlation may disagree with other knowledge and
assumptions. For example, vitamin C is generally considered beneﬁcial or neutral.
If lower vitamin C was associated with higher HDL, one would be cautious about
believing this relationship, even if the association was statistically signiﬁcant.
2.5 Evaluating performance
In a realworld application of supervised learning, we have a training set of examples
with labels, and a test set of examples with unknown labels. The whole point is to
make predictions for the test examples.
However, in research or experimentation we want to measure the performance
achieved by a learning algorithm. To do this we use a test set consisting of examples
2.5. EVALUATING PERFORMANCE 19
with known labels. We train the classiﬁer on the training set, apply it to the test set,
and then measure performance by comparing the predicted labels with the true labels
(which were not available to the training algorithm).
It is absolutely vital to measure the performance of a classiﬁer on an independent
test set. Every training algorithm looks for patterns in the training data, i.e. corre
lations between the features and the class. Some of the patterns discovered may be
spurious, i.e. they are valid in the training data due to randomness in how the train
ing data was selected from the population, but they are not valid, or not as strong,
in the whole population. A classiﬁer that relies on these spurious patterns will have
higher accuracy on the training examples than it will on the whole population. Only
accuracy measured on an independent test set is a fair estimate of accuracy on the
whole population. The phenomenon of relying on patterns that are strong only in the
training data is called overﬁtting. In practice it is an omnipresent danger.
Most training algorithms have some settings that the user can choose between.
For ridge regression the main algorithmic parameter is the degree of regularization
λ. Other algorithmic choices are which sets of features to use. It is natural to run
a supervised learning algorithm many times, and to measure the accuracy of the
function (classiﬁer or regression function) learned with different settings. A set of
labeled examples used to measure accuracy with different algorithmic settings, in
order to pick the best settings, is called a validation set. If you use a validation set,
it is important to have a ﬁnal test set that is independent of both the training set
and the validation set. For fairness, the ﬁnal test set must be used only once. The
only purpose of the ﬁnal test set is to estimate the true accuracy achievable with the
settings previously chosen using the validation set.
Dividing the available data into training, validation, and test sets should be done
randomly, in order to guarantee that each set is a random sample from the same distri
bution. However, a very important realworld issue is that future real test examples,
for which the true label is genuinely unknown, may be not a sample from the same
distribution.
Quiz question
Suppose you are building a model to predict how many dollars someone will spend at
Sears. You know the gender of each customer, male or female. Since you are using
linear regression, you must recode this discrete feature as continuous. You decide
to use two realvalued features, x
11
and x
12
. The coding is a standard “one of n”
scheme, as follows:
gender x
11
x
12
male 1 0
female 0 1
Learning from a large training set yields the model
y = . . . + 15x
11
+ 75x
12
+. . .
Dr. Roebuck says “Aha! The average woman spends $75, but the average man spends
only $15.”
Write your name below, and then answer the following three parts with one or
two sentences each:
(a) Explain why Dr. Roebuck’s conclusion is not valid. The model only predicts
spending of $75 for a woman if all other features have value zero. This may not be
true for the average woman. Indeed it will not be true for any woman, if features
such as “age” are not normalized.
(b) Explain what conclusion can actually be drawn from the numbers 15 and 75.
The conclusion is that if everything else is held constant, then on average a woman
will spend $60 more than a man. Note that if some feature values are systematically
different for men and women, then even this conclusion is not useful, because it is not
reasonable to hold all other feature values constant.
(c) Explain a desirable way to simplify the model. The two features x
11
and
x
12
are linearly related. Hence, they make the optimal model be undeﬁned, in the
absence of regularization. It would be good to eliminate one of these two features.
The expressiveness of the model would be unchanged.
Quiz for April 6, 2010
Your name:
Suppose that you are training a model to predict how many transactions a credit card
customer will make. You know the education level of each customer. Since you are
using linear regression, you recode this discrete feature as continuous. You decide to
use two realvalued features, x
37
and x
38
. The coding is a “one of two” scheme, as
follows:
x
37
x
38
college grad 1 0
not college grad 0 1
Learning from a large training set yields the model
y = . . . + 5.5x
37
+ 3.2x
38
+. . .
(a) Dr. Goldman concludes that the average college graduate makes 5.5 transactions.
Explain why Dr. Goldman’s conclusion is likely to be false.
The model only predicts 5.5 transactions if all other features, including the in
tercept, have value zero. This may not be true for the average college grad. It will
certainly be false if features such as “age” are not normalized.
(b) Dr. Sachs concludes that being a college graduate causes a person to make 2.3
more transactions, on average. Explain why Dr. Sachs’ conclusion is likely to be
false also.
First, if any other feature have different values on average for men and women,
for example income, then 5.5 − 3.2 = 2.3 is not the average difference in predicted
y value between groups. Said another way, it is unlikely to be reasonable to hold all
other feature values constant when comparing groups.
Second, even if 2.3 is the average difference, one cannot say that this difference
is caused by being a college graduate. There may be some other unknown common
cause, for example.
CSE 291 Quiz for April 12, 2011
Your name:
Email address (important):
Suppose that you are training a model (a classiﬁer) to predict whether or not a grad
uate student drops out. You have enough positive and negative training examples,
from a selective graduate program. You are training the model correctly, without
overﬁtting or any other mistake. Assume further that verbal ability is a strong causal
inﬂuence on the target, and that the verbal GRE score is a good measure of verbal
ability.
Explain two separate reasonable scenarios where the verbal GRE score, despite
all the assumptions above being true, does not have strong predictive power in a
trained model.
Linear regression assignment
This assignment is due at the start of class on Tuesday April 12, 2011. You should
work in a team of two. Choose a partner who has a different background from you.
Download the ﬁle cup98lrn.zip fromfromhttp://archive.ics.uci.
edu/ml/databases/kddcup98/kddcup98.html. Read the associated doc
umentation. Load the data into Rapidminer (or other software for data mining such
as R). Select the 4843 records that have feature TARGET_B=1. Save these as a
nativeformat Rapidminer example set.
Now, build a linear regression model to predict the ﬁeld TARGET_D as accurately
as possible. Use root mean squared error (RMSE) as the deﬁnition of error, and use
tenfold crossvalidation to measure RMSE. Do a combination of the following:
• Recode nonnumerical features as numerical.
• Discard useless features.
• Transform features to improve their usefulness.
• Compare different strengths of ridge regularization.
Do the steps above repeatedly in order to explore alternative ways of using the data.
The outcome should be the best possible model that you can ﬁnd that uses 30 or
fewer of the original features.
To make your work more efﬁcient, be sure to save the 4843 records in a format
that Rapidminer can load quickly. You can use threefold crossvalidation during
development to save time also. If you normalize all input features, and you use
strong regularization (ridge parameter 10
7
perhaps), then the regression coefﬁcients
will indicate the relative importance of features.
The deliverable is a brief report that is formatted similarly to this assignment
description. Describe what you did that worked, and your results. Explain any as
sumptions that you made, and any limitations on the validity or reliability of your
results. If you use Rapidminer, include a printout of your ﬁnal process. Include your
ﬁnal regression model. Do not speculate about future work, and do not explain ideas
that did not work. Write in the present tense. Organize your report logically, not
chronologically.
Comments on the regression assignment
It is typically useful to rescale predictors to have mean zero and variance one. How
ever, it loses interpretability to rescale the target variable. Note that if all predictors
have mean zero, then the intercept of a linear regression model is the mean of the
target, $15.6243 here, assuming that the intercept is not regularized.
The assignment speciﬁcally asks you to report root mean squared error, RMSE.
One could also report mean squared error, MSE, but whichever is chosen should be
used consistently. In general, do not confuse readers by switching between multiple
performance measures without a good reason.
As is often the case, good performance can be achieved with a very simple model.
The most informative single feature is LASTGIFT, the dollar amount of the person’s
most recent gift. A model based on just this single feature achieves RMSE of $9.98.
In 2009, three of 11 teams achieved similar ﬁnal RMSEs that were slightly better than
$9.00. The two teams that omitted LASTGIFT achieved RMSE worse than $11.00.
However, it is possible to do signiﬁcantly better.
The assignment asks you to produce a ﬁnal model based on at most 30 of the
original features. Despite this directive, it is not a good idea to begin by choosing
a subset of the 480 original features based on human intuition. The teams that did
this all omitted features that in fact would have made their ﬁnal models considerably
better, including sometimes the feature LASTGIFT. As explained above, it is also
not a good idea to eliminate automatically features with missing values.
Rapidminer has operators that search for highly predictive subsets of variables.
These operators have two major problems. First, they are too slow to be used on a
large initial set of variables, so it is easy for human intuition to pick an initial set
that is bad. Second, these operators try a very large number of alternative subsets
of variables, and pick the one that performs best on some dataset. Because of the
high number of alternatives considered, this subset is likely to overﬁt substantially
the dataset on which it is best. For more discussion of this problem, see the section
on nested crossvalidation in a later chapter.
Chapter 3
Introduction to Rapidminer
By default, many Java implementations allocate so little memory that Rapidminer
will quickly terminate with an “out of memory” message. This is not a problem with
Windows 7, but otherwise, launch Rapidminer with a command like
java Xmx1g jar rapidminer.jar
where 1g means one gigabyte.
3.1 Standardization of features
The recommended procedure is as follows, in order.
• Normalize all numerical features to have mean zero and variance 1.
• Convert each nominal feature with k alternative values into k different binary
features.
• Optionally, drop all binary features with fewer than 100 examples for either
binary value.
• Convert each binary feature into a numerical feature with values 0.0 and 1.0.
It is not recommended to normalize numerical features obtained frombinary features.
The reason is that this normalization would destroy the sparsity of the dataset, and
hence make some efﬁcient training algorithms much slower. Note that destroying
sparsity is not an issue when the dataset is stored as a dense matrix, which it is by
default in Rapidminer.
Normalization is also questionable for variables whose distribution is highly un
even. For example, almost all donation amounts are under $50, but a few are up
25
26 CHAPTER 3. INTRODUCTION TO RAPIDMINER
to $500. No linear normalization, whether by zscoring or by transformation to the
range 0 to 1 or otherwise, will allow a linear classiﬁer to discriminate well between
different common values of the variable. For variables with uneven distributions, it
is often useful to apply a nonlinear transformation that makes their distribution have
more of a uniform or Gaussian shape. A common transformation that often achieves
this is to take logarithms. A useful trick when zero is a frequent value of a variable
x is to use the mapping x → log(x + 1). This leaves zeros unchanged and hence
preserves sparsity.
Eliminating binary features for which either value has fewer than 100 training
examples is a heuristic way to prevent overﬁtting and at the same time make training
faster. It may not be necessary, or beneﬁcial, if regularization is used during training.
If regularization is not used, then at least one binary feature must be dropped from
each set created by converting a multivalued feature, in order to prevent the existence
of multiple equivalent models. For efﬁciency and interpretability, it is best to drop the
binary feature corresponding to the most common value of the original multivalued
feature.
If you have separate training and test sets, it is important to do all preprocessing
in a perfectly consistent way on both datasets. The simplest is to concatenate both
sets before preprocessing, and then split them after. However, concatenation allows
information from the test set to inﬂuence the details of preprocessing, which is a form
of information leakage from the test set, that is of using the test set during training,
which is not legitimate.
3.2 Example of a Rapidminer process
Figure 3.2 shows a tree structure of Rapidminer operators that together perform a
standard data mining task. The tree illustrates how to perform some common sub
tasks.
The ﬁrst operator is ExampleSource. “Read training example” is a name for this
particular instance of this operator. If you have two instances of the same operator,
you can identify them with different names. You select an operator by clicking on
it in the left pane. Then, in the right pane, the Parameters tab shows the arguments
of the operator. At the top there is an icon that is a picture of either a person with a
red sweater, or a person with an academic cap. Click on the icon to toggle between
these. When the person in red is visible, you are in expert mode, where you can see
all the parameters of each operator.
The example source operator has a parameter named attributes. Normally this
is ﬁle name with extension “aml.” The corresponding ﬁle should contain a schema
deﬁnition for a dataset. The actual data are in a ﬁle with the same name with exten
3.2. EXAMPLE OF A RAPIDMINER PROCESS 27
Root node of Rapidminer process
Pr ocess
Read training examples
Exampl eSour ce
Select features by name
Feat ur eNameFi l t er
Convert nominal to binary
Nomi nal 2Bi nomi nal
Remove almostconstant features
RemoveUsel essAt t r i but es
Convert binary to realvalued
Nomi nal 2Numer i cal
Zscoring
Nor mal i zat i on
Scan for best strength of regularization
Gr i dPar amet er Opt i mi zat i on
Crossvalidation
XVal i dat i on
Regularized linear regression
W Li near Regr essi on
ApplierChain
Oper at or Chai n
Applier
Model Appl i er
Compute RMSE on test fold
Regr essi onPer f or mance
Figure 3.1: Rapidminer process for regularized linear regression.
28 CHAPTER 3. INTRODUCTION TO RAPIDMINER
sion “dat.” The easiest way to create these ﬁles is by clicking on “Start Data Loading
Wizard.” The ﬁrst step with this wizard is to specify the ﬁle to read data from, the
character that begins comment lines, and the decimal point character. Ticking the
box for “use double quotes” can avoid some error messages.
In the next panel, you specify the delimiter that divides ﬁelds within each row
of data. If you choose the wrong delimiter, the data preview at the bottom will look
wrong. In the next panel, tick the box to make the ﬁrst row be interpreted as ﬁeld
names. If this is not true for your data, the easiest is to make it true outside Rapid
miner, with a text editor or otherwise. When you click Next on this panel, all rows
of data are loaded. Error messages may appear in the bottom pane. If there are no
errors and the data ﬁle is large, then Rapidminer hangs with no visible progress. The
same thing happens if you click Previous from the following panel. You can use a
CPU monitor to see what Rapidminer is doing.
The next panel asks you to specify the type of each attribute. The wizard guesses
this based only on the ﬁrst row of data, so it often makes mistakes, which you have to
ﬁx by trial and error. The following panel asks you to say which features are special.
The most common special choice is “label” which means that an attribute is a target
to be predicted.
Finally, you specify a ﬁle name that is used with “aml” and “dat” extensions to
save the data in Rapidminer format.
To keep just features with certain names, use the operator FeatureNameFilter. Let
the argument skip_features_with_name be .* and let the argument except_
features_with_name identify the features to keep. In our sample process, it is
(.*AMNT.*)(.*GIFT.*)(YRS.*)(.*MALE)(STATE)(PEPSTRFL)(.*GIFT)
(MDM.*)(RFA_2.*).
In order to convert a discrete feature with k different values into k realvalued
0/1 features, two operators are needed. The ﬁrst is Nominal2Binominal, while
the second is Nominal2Numerical. Note that the documentation of the latter
operator in Rapidminer is misleading: it cannot convert a discrete feature into multi
ple numerical features directly. The operator Nominal2Binominal is quite slow.
Applying it to discrete features with more than 50 alternative values is not recom
mended.
The simplest way to ﬁnd a good value for an algorithm setting is to use the
XValidation operator nested inside the GridParameterOptimization op
erator. The way to indicate nesting of operators is by dragging and dropping. First
create the inner operator subtree. Then insert the outer operator. Then drag the root
of the inner subtree, and drop it on top of the outer operator.
3.3. OTHER NOTES ON RAPIDMINER 29
3.3 Other notes on Rapidminer
Reusing existing processes.
Saving datasets.
Operators from Weka versus from elsewhere.
Comments on speciﬁc operators: Nominal2Numerical, Nominal2Binominal, Re
moveUselessFeatures.
Eliminating nonalphanumerical characters, using quote marks, trimming lines,
trimming commas.
Chapter 4
Support vector machines
This chapter explains softmargin support vector machines (SVMs), including linear
and nonlinear kernels. It also discusses detecting overﬁtting via crossvalidation, and
preventing overﬁtting via regularization.
We have seen how to use linear regression to predict a realvalued label. Now we
will see how to use a similar model to predict a binary label. In later chapters, where
we think probabilistically, it will be convenient to assume that a binary label takes on
true values 0 or 1. However, in this chapter it will be convenient to assume that the
label y has true value either +1 or −1.
4.1 Loss functions
Let x be an instance, let y be its true label, and let f(x) be a prediction. Assume that
the prediction is realvalued. If we need to convert it into a binary prediction, we will
threshold it at zero:
ˆ y = 2 I(f(x) ≥ 0) −1
where I() is an indicator function that is 1 if its argument is true, and 0 if its argument
is false.
A socalled loss function l measures how good the prediction is. Typically loss
functions do not depend directly on x, and a loss of zero corresponds to a perfect
prediction, while otherwise loss is positive. The most obvious loss function is
l(f(x), y) = I(f(x) ,= y)
which is called the 01 loss function. The usual deﬁnition of accuracy uses this loss
function. However, it has undesirable properties. First, it loses information: it does
not distinguish between predictions f(x) that are almost right, and those that are
31
32 CHAPTER 4. SUPPORT VECTOR MACHINES
very wrong. Second, mathematically, its derivative is either undeﬁned or zero, so it
is difﬁcult to use in training algorithms that try to minimize loss via gradient descent.
A better loss function is squared error:
l(f(x), y) = (f(x) −y)
2
which is inﬁnitely differentiable everywhere, and does not lose information when the
prediction f(x) is realvalued. However, this loss function says that the prediction
f(x) = 1.5 is as undesirable as f(x) = 0.5 when the true label is y = 1. Intuitively,
if the true label is +1, then a prediction with the correct sign that is greater than 1
should not be considered incorrect.
The following loss function, which is called hinge loss, satisﬁes the intuition just
suggested:
l(f(x), −1) = max¦0, 1 +f(x)¦
l(f(x), +1) = max¦0, 1 −f(x)¦
To understand this loss function, look at a ﬁgure showing l(f(x), −1) as a function
of the output f(x) of the classiﬁer. Then the loss is zero (the best case) as long as
the prediction f(x) ≤ −1. The further away f(x) is from−1 in the wrong direction,
the bigger the loss. There is a similar but opposite pattern when the true label is
y = +1.
1
Using hinge loss is the ﬁrst major insight behind SVMs. An SVM classiﬁer f
is trained to minimize hinge loss. The training process aims to achieve predictions
f(x) ≥ 1 for all training instances x with true label y = +1, and to achieve predic
tions f(x) ≤ −1 for all training instances x with y = −1. Overall, training seeks
to classify points correctly, and to distinguish clearly between the two classes, but it
does not seek to make predictions be exactly +1 or −1. In this sense, the training
process intuitively aims to ﬁnd the best possible classiﬁer, without trying to satisfy
any unnecessary additional objectives also.
4.2 Regularization
Given a set of training examples ¸x
i
, y
i
) for i = 1 to i = n, the total training loss
(sometimes called empirical loss) is the sum
n
i=1
l(f(x
i
), y
i
).
1
It is mathematically convenient to write the hinge loss function as
l(f(x), y) = max{0, 1 − yf(x)}.
4.2. REGULARIZATION 33
Suppose that the function f is selected by minimizing average training loss:
f = argmin
f∈F
n
i=1
l(f(x
i
), y
i
)
where F is a space of candidate functions. If F is too ﬂexible, and/or the training
set is too small, then we run the risk of overﬁtting the training data. But if F is too
restricted, then we run the risk of underﬁtting the data. In general, we do not know in
advance what the best space F is for a particular training set. A possible solution to
this dilemma is to choose a ﬂexible space F, but at the same time to impose a penalty
on the complexity of f. Let c(f) be some realvalued measure of complexity. The
learning process then becomes to solve
f = argmin
f∈F
λc(f) +
n
i=1
l(f(x
i
), y
i
).
Here, λ is a parameter that controls the relative strength of the two objectives, namely
to minimize the complexity of f and to minimize training error.
Suppose that the space of candidate functions is deﬁned by a vector w ∈ R
d
of
parameters, i.e. we can write f(x) = g(x; w) where g is some ﬁxed function. In this
case we can deﬁne the complexity of each candidate function to be the norm of the
corresponding w. Most commonly we use the square norm:
c(f) = [[w[[
2
=
d
j=1
w
2
j
.
However we could also use other norms, including the L
0
norm
c(f) = [[w[[
2
0
=
d
j=1
I(w
j
,= 0)
or the L
1
norm
c(f) = [[w[[
2
1
=
d
j=1
[w
j
[.
The square norm is the most convenient mathematically.
34 CHAPTER 4. SUPPORT VECTOR MACHINES
4.3 Linear softmargin SVMs
A linear classiﬁer is a function f(x) = g(x; w) = x w where g is the dot product
function. Putting the ideas above together, the objective of learning is to ﬁnd
w = argmin
w∈R
d λ[[w[[
2
+
1
n
n
i=1
max¦0, 1 −y
i
(w x
i
)¦
where the multiplier 1/n is included because it is conventional. A linear softmargin
SVM classiﬁer is precisely the solution to this optimization problem. It can be proved
that the solution to the minimization problem is always unique. Moreover, the objec
tive function is convex, so there are no local minima. Note that d is the dimensionality
of x, and w has the same dimensionality.
An equivalent way of writing the same optimization problem is
w = argmin
w∈R
d [[w[[
2
+C
n
i=1
max¦0, 1 −y
i
(w x
i
)¦
with C = 1/(nλ). Many SVM implementations let the user specify C as opposed
to λ. A small value for C corresponds to strong regularization, while a large value
corresponds to weak regularization. Intuitively, when dimensionality d is ﬁxed, a
smaller training set should require a smaller C value. However, useful guidelines are
not known for what the best value of C might be for a given dataset. In practice, one
has to try multiple values of C, and ﬁnd the best value experimentally.
Mathematically, the optimization problem above is called an unconstrained pri
mal formulation. It is the easiest description of SVMs to understand, and it is the
basis of the fastest known algorithms for training linear SVMs.
4.4 Dual formulation
There is an alternative formulation that is equivalent, and is useful theoretically. This
socalled dual formulation is
max
α∈R
n
n
i=1
α
i
−
1
2
n
i=1
n
j=1
α
i
α
j
y
i
y
j
(x
i
x
j
)
subject to 0 ≤ α
i
≤ C.
The primal and dual formulations are different optimization problems, but they have
the same unique solution. The solution to the dual problem is a coefﬁcient α
i
for
4.5. NONLINEAR KERNELS 35
each training example. Notice that the optimization is over R
n
, whereas it is over R
d
in the primal formulation. The trained classiﬁer is f(x) = w x where the vector
w =
n
i=1
α
i
y
i
x
i
.
This equation says that w is a weighted linear combination of the training instances
x
i
, where the weight of each instance is between 0 and C, and the sign of each
instance is its label y
i
.
Often, solving the dual optimization problem leads to α
i
being exactly zero for
many training instances. The training instances x
i
such that α
i
> 0 are called support
vectors. These instances are the only ones that contribute to the ﬁnal classiﬁer.
If the number of training examples n is much bigger than the number of features
d then most training examples will not be support vectors. This fact is the basis for
a picture that is often used to illustrate the intuitive idea that an SVM maximizes the
margin between positive and negative training examples. This picture is not incorrect,
but it is misleading for several reasons. First, the SVMs that are useful in practice
are socalled softmargin SVMs, where some training points can be misclassiﬁed.
Second, the regularization that is part of the loss function for SVM training is most
useful in cases where n is small relative to d. In those cases, often half or more of
all training points become support vectors. Third, the strength of regularization that
gives best performance on test data cannot be determined by looking at distances
between training points.
The constrained dual formulation is the basis of the training algorithms used by
the ﬁrst generation of SVM implementations. However, as just mentioned, recent
research has shown that the unconstrained primal formulation leads to faster training
algorithms in the linear case. Moreover, the primal version is easier to understand and
easier to use as a foundation for proving bounds on generalization error. However, the
dual version is easier to extend to obtain nonlinear SVM classiﬁers. This extension
is based on the idea of a kernel function.
4.5 Nonlinear kernels
Consider two instances x
i
and x
j
, and consider their dot product x
i
x
j
. This dot
product is a measure of the similarity of the two instances: it is large if they are
similar and small if they are not. Dot product similarity is closely related to Euclidean
distance through the identity
d(x
i
, x
j
) = [[x
i
−x
j
[[ = ([[x
i
[[
2
−2x
i
x
j
+[[x
j
[[
2
)
1/2
36 CHAPTER 4. SUPPORT VECTOR MACHINES
where by deﬁnition [[x[[
2
= x x. If the instances have unit length, that is [[x
i
[[ =
[[x
j
[[ = 1, then Euclidean distance and dot product similarity are perfectly anticor
related.
2
Therefore, the equation
f(x) = w x =
n
i=1
α
i
y
i
(x
i
x)
says that the prediction for a test example x is a weighted average of the training
labels y
i
, where the weight of y
i
is the product of α
i
and the degree to which x is
similar to x
i
.
Consider a rerepresentation of instances x → Φ(x) where the transformation Φ
is a function R
d
→R
d
. In principle, we could use dotproduct to deﬁne similarity in
the new space R
d
, and train an SVM classiﬁer in this space. However, suppose that
we have a function K(x
i
, x
j
) = Φ(x
i
) Φ(x
j
). This function is all that is needed
in order to write down the optimization problem and its solution. If we know K
then we do not need to know the function Φ in any explicit way. Speciﬁcally, let
k
ij
= K(x
i
, x
j
). The learning task is to solve
max
α∈R
n
n
i=1
α
i
−
1
2
n
i=1
n
j=1
α
i
α
j
y
i
y
j
k
ij
subject to 0 ≤ α
i
≤ C.
The solution is
f(x) = [
n
i=1
α
i
y
i
Φ(x
i
)] x =
n
i=1
α
i
y
i
K(x
i
, x).
This classiﬁer is a weighted combination of at most n functions, one for each support
vector x
i
. These functions are sometimes called basis functions.
The result above says that in order to train a nonlinear SVM classiﬁer, all that is
needed is the kernel matrix of size n by n whose entries are k
ij
. In order to apply
the trained nonlinear classiﬁer, that is to make predictions for arbitrary test examples,
the kernel function K is needed. The function Φ never needs to be known explicitly.
Using K exclusively in this way instead of Φ is called the “kernel trick.” Practically,
the function K can be much easier to deal with than Φ, because K is just a mapping
to R, rather than to a highdimensional space R
d
.
Intuitively, regardless of which kernel K is used, that is regardless of which re
representation Φ is used, the complexity of the classiﬁer f is limited, since it is
2
For many applications of support vector machines, it is advantageous to normalize features to
have the same mean and variance. It can be advantageous also to normalize instances so that they have
unit length. However, in general one cannot have both normalizations be true at the same time.
4.6. SELECTING THE BEST SVM SETTINGS 37
deﬁned by at most n coefﬁcients α
i
. The function K is ﬁxed, so it does not increase
the intuitive complexity of the classiﬁer.
One particular kernel is especially important. The radial basis function (RBF)
kernel is the function
K(x
i
, x) = exp(−γ[[x
i
−x[[
2
)
where γ > 0 is an adjustable parameter. Note that the value of K(x
i
, x) is always
between 0 and 1, with the maximum occurring only if x = x
i
. Using an RBF kernel,
each basis function K(x
i
, x) is “radial” since it is based on the Euclidean distance
[[x
i
−x[[ between a support vector x
i
and a data point x.
An important intuition about SVMs is that with an RBF kernel, the classiﬁer
f(x) =
i
α
i
y
i
K(x
i
, x) is similar to a nearestneighbor classiﬁer. Given a test in
stance x, its predicted label f(x) is a weighted average of the labels y
i
of the support
vectors x
i
. The support vectors that contribute nonnegligibly to the predicted label
are those for which the Euclidean distance [[x
i
−x[[ is small.
The RBF kernel can also be written
K(x
i
, x) = exp(−[[x
i
−x[[
2
/σ
2
)
where σ
2
= 1/γ. This notation emphasizes the similarity with a Gaussian distribu
tion. A smaller value for γ, i.e. a larger value for σ
2
, gives a basis functions that is
less peaked, i.e. that is signiﬁcantly above zero for a wider range of x values. Using
a larger value for σ
2
is similar to using a larger number k of neighbors in a nearest
neighbor classiﬁer.
4.6 Selecting the best SVM settings
Getting good results with support vector machines requires some care. The consensus
opinion is that the best SVM classiﬁer is typically at least as accurate as the best of
any other type of classiﬁer, but obtaining the best SVM classiﬁer is not trivial. The
following procedure is recommended as a starting point:
1. Code data in the numerical format needed for SVM training.
2. Scale each attribute to have range 0 to 1, or to have range −1 to +1, or to have
mean zero and unit variance.
3. Use crossvalidation to ﬁnd the best value for C for a linear kernel.
4. Use crossvalidation to ﬁnd the best values for C and γ for an RBF kernel.
38 CHAPTER 4. SUPPORT VECTOR MACHINES
5. Train on the entire available data using the parameter values found to be best
via crossvalidation.
It is reasonable to start with C = 1 and γ = 1, and to try values that are smaller and
larger by factors of 2:
¸C, γ) ∈ ¦. . . , 2
−3
, 2
−2
, 2
−1
, 1, 2, 2
2
, 2
3
, . . .¦
2
.
This process is called a grid search; it is easy to parallelize of course. It is important
to try a wide enough range of values to be sure that more extreme values do not give
better performance. Once you have found the best values of C and γ to within a
factor of 2, doing a more ﬁnegrained grid search around these values is sometimes
beneﬁcial, but not always.
4.6. SELECTING THE BEST SVM SETTINGS 39
Quiz question
(a) Draw the hinge loss function for the case where the true label y = 1. Label the
axes clearly.
(b) Explain where the derivative is (i) zero, (ii) constant but not zero, or (iii) not
deﬁned.
(c) For each of the three cases for the derivative, explain its intuitive implications
for training an SVM classiﬁer.
CSE 291 Quiz for April 19, 2011
Your name:
Email address:
As seen in class, the hinge loss function is
l(f(x), −1) = max¦0, 1 +f(x)¦
l(f(x), +1) = max¦0, 1 −f(x)¦
(a) Explain what the two arguments of the hinge loss function are.
(b) Explain the reason why l(2, 1) = l(1, 1) but l(−2, 1) ,= l(−1, 1).
Quiz for April 20, 2010
Your name:
Suppose that you have trained SVM classiﬁers carefully for a given learning task.
You have selected the settings that yield the best linear classiﬁer and the best RBF
classiﬁer. It turns out that both classiﬁers perform equally well in terms of accuracy
for your task.
(a) Now you are given many times more training data. After ﬁnding optimal settings
again and retraining, do you expect the linear classiﬁer or the RBF classiﬁer to have
better accuracy? Explain very brieﬂy.
(b) Do you expect the optimal settings to involve stronger regularization, or weaker
regularization? Explain very brieﬂy.
CSE 291 Assignment 2 (2011)
This assignment is due at the start of class on Tuesday April 19, 2011. As before,
you should work in a team of two, choosing a partner with a different background.
You may keep the same partner as before, or change partners.
Like the previous assignment, this one uses the KDD98 training set. However,
you should now use more of the dataset. The goal is to train a support vector machine
(SVM) classiﬁer that identiﬁes examples who make a nonzero donation, that is ex
amples for which TARGET_B = 1. Only about 5% of all examples are positive. The
ﬁrst thing that you should do is extract all positive examples, and a random subset
of twice as many negative examples. We will see next week how to deal with all the
negative examples.
Train SVM models to predict the target label as accurately as possible. In the
same general way as for linear regression, recode nonnumerical features as numeri
cal, and transform features to improve their usefulness. Train the best possible linear
SVM, and also the best possible SVM using a radial basis function (RBF) kernel.
The outcome should be the two SVM classiﬁers with highest 0/1 accuracy that you
can ﬁnd, without overﬁtting or underﬁtting.
Use crossvalidation and grid search carefully to ﬁnd the best settings for training,
and to evaluate the accuracy of the best classiﬁers as fairly as possible. Identify
good values for the softmargin C parameter for both kernels, and for the width
parameter for the RBF kernel. For linear SVMs, the Rapidminer operator named
FastLargeMargin is recommended. Because it is fast, you can explore models based
on a large number of transformed features. Training nonlinear SVMs is much slower.
As before, the deliverable is a short report that is organized well, written well, and
formatted well. Describe what you did that worked, and your results. Explain any
assumptions that you made, and any limitations on the validity or reliability of your
results. Be clear and explicit about the 0/1 accuracy that you achieve on test data.
Explain carefully your crossvalidation procedure. If you use Rapidminer, include
a printout of your ﬁnal processes. Whatever software you use, describe your ﬁnal
linear and RBF models.
Do not speculate about future work, and do not explain ideas that do not work.
Write in the present tense. Organize your report logically, not chronologically.
CSE 291 Assignment (2010)
This assignment is due at the start of class on Tuesday April 20, 2010. As before,
you should work in a team of two, choosing a partner with a different background.
You may keep the same partner as before, or change partners.
This project uses data published by Kevin Hillstrom, a wellknown data min
ing consultant. You can ﬁnd the data at http://cseweb.ucsd.edu/users/
elkan/250B/HillstromData.csv. For a detailed description, see http://
minethatdata.com/blog/2008/03/minethatdataemailanalyticsanddata.
html.
For this assignment, use only the data for customers who are not sent any email
promotion. Your job is to train a good model to predict which customers visit the
retailer’s website. For now, you should ignore information about which customers
make a purchase, and how much they spend.
Build support vector machine (SVM) models to predict the target label as ac
curately as possible. In the same general way as for linear regression, recode non
numerical features as numerical, and transform features to improve their usefulness.
Train the best possible model using a linear kernel, and also the best possible model
using a radial basis function (RBF) kernel. The outcome should be the two most
accurate SVM classiﬁers that you can ﬁnd, without overﬁtting or underﬁtting.
Decide thoughtfully which measure of accuracy to use, and explain your choice
in your report. Use nested crossvalidation carefully to ﬁnd the best settings for
training, and to evaluate the accuracy of the best classiﬁers as fairly as possible. In
particular, you should identify good values of the softmargin C parameter for both
kernels, and of the width parameter for the RBF kernel.
For linear SVMs, the Rapidminer operator named FastLargeMargin is recom
mended. Because it is fast, you can explore models based on a large number of
transformed features. Training nonlinear SVMs is much slower, but one can hope
that good performance can be achieved with fewer features.
As before, the deliverable is a wellorganized, wellwritten, and wellformatted
report of about two pages. Describe what you did that worked, and your results. Ex
plain any assumptions that you made, and any limitations on the validity or reliability
of your results. Explain carefully your nested crossvalidation procedure.
Include a printout of your ﬁnal Rapidminer process, and a description of your
two ﬁnal models (not included in the two pages). Do not speculate about future
work, and do not explain ideas that do not work. Write in the present tense. Organize
your report logically, not chronologically.
Chapter 5
Doing valid experiments
Consider a classiﬁer that predicts membership in one of two classes, called positive
and negative. Suppose that a set of examples has a certain total size n, say n = 1000.
We can represent the performance of any classiﬁer with a table as follows:
predicted
positive negative
positive tp fn
truth
negative fp tn
where tp +fn +fp +tn = n. A table like the one above is called a 22 confusion
matrix. Rows correspond to actual labels, while columns correspond to predicted
labels.
1
The four entries in a 22 confusion matrix have standard names. They are true
positives tp, false positives fp, true negatives tn, and false negatives fn, It is impor
tant to remember that these numbers are counts, i.e. integers, not ratios, i.e. fractions.
Depending on the application, different summaries are computed from a confusion
matrix. In particular, accuracy is the ratio (tp +tn)/n.
5.1 Crossvalidation
Usually we have a ﬁxed database of labeled examples available, and we are faced
with a dilemma: we would like to use all the examples for training, but we would
1
It would be equally valid to swap the rows and columns. Unfortunately, there is no standard
convention about whether rows are actual or predicted. Remember that there is a universal convention
that in notation like xij the ﬁrst subscript refers to rows while the second subscript refers to columns.
45
46 CHAPTER 5. DOING VALID EXPERIMENTS
also like to use many examples as an independent test set. Crossvalidation is a
procedure for overcoming this dilemma. It is the following algorithm.
Input: Training set S, integer constant k
Procedure:
partition S into k disjoint equalsized subsets S
1
, . . . , S
k
for i = 1 to i = k
let T = S ¸ S
i
run learning algorithm with T as training set
test the resulting classiﬁer on S
i
obtaining tp
i
, fp
i
, tn
i
, fn
i
compute tp =
i
tp
i
, fp =
i
fp
i
, tn =
i
tn
i
, fn =
i
fn
i
The output of crossvalidation is a confusion matrix based on using each labeled
example as a test example exactly once. Whenever an example is used for testing a
classiﬁer, it has not been used for training that classiﬁer. Hence, the confusion matrix
obtained by crossvalidation is a fair indicator of the performance of the learning
algorithm on independent test examples.
If n labeled examples are available, the largest possible number of folds is k = n.
This special case is called leaveoneout crossvalidation (LOOCV). However, the
time complexity of crossvalidation is k times that of running the training algorithm
once, so often LOOCV is computationally infeasible. In recent research the most
common choice for k is 10.
A disadvantage of crossvalidation is that it does not produce any single ﬁnal
classiﬁer, and the confusion matrix it provides is not the performance of any speciﬁc
single classiﬁer. Instead, this matrix is an estimate of the average performance of a
classiﬁer learned from a training set of size (k −1)n/k where n is the size of S. The
common procedure is to create a ﬁnal classiﬁer by training on all of S, and then to
use the confusion matrix obtained from crossvalidation as an informal estimate of
the performance of this classiﬁer. This estimate is likely to be conservative in the
sense that the ﬁnal classiﬁer may have slightly better performance since it is based
on a slightly larger training set.
Suppose that the time to train a classiﬁer is proportional to the number of training
examples, and the time to make predictions on test examples is negligible in compar
ison. The time required for kfold crossvalidation is then O(k (k − 1)/k n) =
O((k − 1)n). Threefold crossvalidation is therefore twice as timeconsuming as
twofold. This suggests that for preliminary experiments, twofold is a good choice.
The results of crossvalidation can be misleading. For example, if each example
is duplicated in the training set and we use a nearestneighbor classiﬁer, then LOOCV
will show a zero error rate. Crossvalidation with other values of k will also yield
misleadingly low error estimates.
5.2. NESTED CROSSVALIDATION 47
Subsampling means omitting examples from some classes. Subsampling the
common class is a standard approach to learning from unbalanced data, i.e. data
where some classes are very common while others are very rare. It is reasonable
to do subsampling in the training folds of crossvalidation, but not in the test fold.
Reported performance numbers should always be based on a set of test examples that
is directly typical of a genuine test population. It is a not uncommon mistake with
crossvalidation to do subsampling on the test set.
2
5.2 Nested crossvalidation
A model selection procedure is a method for choosing the best classiﬁer, or learning
algorithm, from a set of alternatives. Often, the alternatives are all the same algo
rithm but with different parameter settings, for example SVMs with different C and
γ values. Another common case is where the alternatives are different subsets of
features. Typically, the number of alternatives is ﬁnite and the only way to evaluate
an alternative is to run it explicitly on a training set. So, how should we do model
selection?
A simple way to apply crossvalidation for model selection is the following:
Input: dataset S, integer k, set V of alternative algorithm settings
Procedure:
partition S randomly into k disjoint subsets S
1
, . . . , S
k
of equal size
for each setting v in V
for i = 1 to i = k
let T = S ¸ S
i
run the learning algorithm with setting v and T as training set
apply the trained model to S
i
obtaining performance e
i
end for
let M(v) be the average of e
i
end for
select ˆ v = argmax
v
M(v)
The output of this model selection procedure is ˆ v. The input set V of alternative
settings can be a grid of parameter values.
But, any procedure for selecting parameter values is itself part of the learning
algorithm. It is crucial to understand this point. The setting ˆ v is chosen to maximize
M(v), so M(ˆ v) is not a fair estimate of the performance to be expected from ˆ v on
2
Nevertheless, for the assignment due on April 19, 2011, you should subsample the negatives in
the KDD98 dataset just once, initially. Then, do crossvalidation under the assumption that 2:1 is the
true ratio of negatives to positives.
48 CHAPTER 5. DOING VALID EXPERIMENTS
future data. Stated another way, ˆ v is chosen to optimize performance on all of S, so
ˆ v is likely to overﬁt S.
Notice that the division of S into subsets happens just once, and the same division
is used for all settings v. This choice reduces the random variability in the evaluation
of different v. A new partition of S could be created for each setting v, but this would
not overcome the issue that ˆ v is chosen to optimize performance on S.
What should we do about the fact that any procedure for selecting parameter
values is itself part of the learning algorithm? One answer is that this procedure
should itself be evaluated using crossvalidation. This process is called nested cross
validation, because one crossvalidation is run inside another.
Speciﬁcally, nested crossvalidation is the following process:
Input: dataset S, integers k and k
t
, set V of alternative algorithm settings
Procedure:
partition S randomly into k disjoint subsets S
1
, . . . , S
k
of equal size
for i = 1 to i = k
let T = S ¸ S
i
partition T randomly into k
t
disjoint subsets T
1
, . . . , T
k
of equal size
for each setting v in V
for j = 1 to j = k
t
let U = T ¸ T
j
run the learning algorithm with setting v and U as training set
apply the trained model to T
j
obtaining performance e
ij
end for
let M
i
(v) be the average of e
ij
end for
select ˆ v
i
= argmax
v
M
i
(v)
run the learning algorithm with setting ˆ v
i
and T as training set
apply the trained model to S
i
obtaining performance e
i
end for
report e, the average of e
i
Like standard crossvalidation, nested crossvalidation does not produce a single ﬁnal
model. The obvious thing to do is to run nonnested crossvalidation on the whole
dataset, obtain a recommended setting ˆ v, and then train once on the whole dataset
using ˆ v yielding a single ﬁnal model f.
The number e obtained by nested crossvalidation is an estimate of the perfor
mance of f. One could argue that nested crossvalidation is an unreasonable amount
of effort to obtain a single number that, although fair, is not provably correct, and is
likely too low because it is based on smaller training sets.
5.2. NESTED CROSSVALIDATION 49
Whether or not one uses nested crossvalidation, trying every setting in V explic
itly can be prohibitive. It may be preferable to search in V in a more clever way,
using a genetic algorithm or some other heuristic method. As mentioned above, the
search for a good member of the set V is itself part of the learning algorithm, so it
can certainly be intelligent and/or heuristic.
Quiz for April 13, 2010
Your name:
The following crossvalidation procedure is intended to ﬁnd the best regularization
parameter R for linear regression, and to report a fair estimate of the RMSE to be
expected on future test data.
Input: dataset S, integer k, set V of alternative R values
Procedure:
partition S randomly into k disjoint subsets S
1
, . . . , S
k
of equal size
for each R value in V
for i = 1 to i = k
let T = S ¸ S
i
train linear regression with parameter R and T as training set
apply the trained regression equation to S
i
obtaining SSE e
i
end for
compute M =
i
e
i
report R and RMSE =
M/[S[
end for
(a) Suppose that you choose the R value for which the reported RMSE is lowest.
Explain why this method is likely to be overoptimistic, as an estimate of the RMSE
to be expected on future data.
Each R value is being evaluated on the entire set S. The value that seems to be
best is likely overﬁtting this set.
(b) Very brieﬂy, suggest an improved variation of the method above.
The procedure to select a value for R is part of the learning algorithm. This
whole procedure should be evaluated using a separate test set, or via crossvalidation,
Additional notes: The procedure to select a value for R uses crossvalidation, so
if this procedure is evaluated itself using crossvalidation, then the entire process is
nested crossvalidation.
Incorrect answers include the following:
5.2. NESTED CROSSVALIDATION 51
• “The partition of S should be stratiﬁed.” No; ﬁrst of all, we are doing re
gression, so stratiﬁcation is not welldeﬁned, and second, failing to stratify
increases variability but not does not cause overﬁtting.
• “The partition of S should be done separately for each R value, not just once.”
No; a different partition for each R value might increase the variability in the
evaluation of each R, but it would not change the fact that the best R is being
selected according to its performance on all of S.
Two basic points to remember are that it is never fair to evaluate a learning method on
its training set, and that any search for settings for a learning algorithm (e.g. search
for a subset of features, or for algorithmic parameters) is part of the learning method.
Chapter 6
Classiﬁcation with a rare class
In many data mining applications, the goal is to ﬁnd needles in a haystack. That is,
most examples are negative but a few examples are positive. The goal is to iden
tify the rare positive examples, as accurately as possible. For example, most credit
card transactions are legitimate, but a few are fraudulent. We have a standard bi
nary classiﬁer learning problem, but both the training and test sets are unbalanced.
In a balanced set, the fraction of examples of each class is about the same. In an
unbalanced set, some classes are rare while others are common.
A major difﬁculty with unbalanced data is that accuracy is not a meaningful mea
sure of performance. Suppose that 99% of credit card transactions are legitimate.
Then we can get 99% accuracy by predicting trivially that every transaction is le
gitimate. On the other hand, suppose we somehow identify 5% of transactions for
further investigation, and half of all fraudulent transactions occur among these 5%.
Clearly the identiﬁcation process is doing something worthwhile and not trivial. But
its accuracy is only 95%.
For concreteness in further discussion, we will consider only the two class case,
and we will call the rare class positive. Rather than talk about fractions or percentages
of a set, we will talk about actual numbers (also called counts) of examples. It turns
out that thinking about actual numbers leads to less confusion and more insight than
thinking about fractions. Suppose the test set has a certain total size n, say n = 1000.
We can represent the performance of the trivial classiﬁer as the following confusion
matrix:
predicted
positive negative
positive 0 10
truth
negative 0 990
53
54 CHAPTER 6. CLASSIFICATION WITH A RARE CLASS
The performance of the nontrivial classiﬁer is
predicted
positive negative
positive 5 5
truth
negative 45 945
For supervised learning with discrete predictions, only a confusion matrix gives
complete information about the performance of a classiﬁer. No single number, for ex
ample accuracy, can provide a full picture of a confusion matrix, or of the usefulness
of a classiﬁer. Assuming that n is known, three of the counts in a confusion matrix
can vary independently. When writing a report, it is best to give the full confusion
matrix explicitly, so that readers can calculate whatever performance measurements
they are most interested in.
When accuracy is not meaningful, two summary measures that are commonly
used are called precision and recall. They are deﬁned as follows:
• precision p = tp/(tp +fp), and
• recall r = tp/(tp +fn).
The names “precision” and “recall” come from the ﬁeld of information retrieval. In
other research areas, recall is often called sensitivity, while precision is sometimes
called positive predictive value.
Precision is undeﬁned for a classiﬁer that predicts that every test example is neg
ative, that is when tp + fp = 0. Worse, precision can be misleadingly high for a
classiﬁer that predicts that only a few test examples are positive. Consider the fol
lowing confusion matrix:
predicted
positive negative
positive 1 9
truth
negative 0 990
Precision is 100% but 90% of actual positives are missed. Fmeasure is a widely
used metric that overcomes this limitation of precision. It is the harmonic average of
precision and recall:
F =
2
1/p + 1/r
=
2pr
r +p
.
For the confusion matrix above F = 2 0.1/(1 + 0.1) = 0.18.
6.1. THRESHOLDS AND LIFT 55
Besides accuracy, precision, recall, and Fmeasure, many other summaries are
also commonly computed from confusion matrices. Some of these are called speci
ﬁcity, false positive rate, false negative rate, positive and negative likelihood ratio,
kappa coefﬁcient, and more. Rather than rely on agreement and understanding of the
deﬁnitions of these, it is preferable simply to report a full confusion matrix explicitly.
6.1 Thresholds and lift
A confusion matrix is always based on discrete predictions. Often, however, these
predictions are obtained by thresholding a realvalued predicted score. For example,
an SVM classiﬁer yields a realvalued prediction f(x) which is then compared to the
threshold zero to obtain a discrete yes/no prediction. Confusion matrices cannot rep
resent information about the overall usefulness of underlying realvalued predictions.
We shall return to this issue below, but ﬁrst we shall consider the issue of selecting a
threshold.
Setting the threshold determines the number tp + fp of examples that are pre
dicted to be positive. In some scenarios, there is a natural threshold such as zero for
an SVM. However, even when a natural threshold exists, it is possible to change the
threshold to achieve a target number of positive predictions. This target number is
often based on a socalled budget constraint. Suppose that all examples predicted to
be positive are subjected to further investigation. An external limit on the resources
available will determine how many examples can be investigated. This number is a
natural target for the value fp+tp. Of course, we want to investigate those examples
that are most likely to be actual positives, so we want to investigate the examples x
with the highest prediction scores f(x). Therefore, the correct strategy is to choose
a threshold t such that we investigate all examples x with f(x) ≥ t and the number
of such examples is determined by the budget constraint.
Given that a ﬁxed number of examples are predicted to be positive, a natural ques
tion is how good a classiﬁer is at capturing the actual positives within this number.
This question is answered by a measure called lift. Intuitively, a lift of 2 means that
actual positives are twice as dense among the predicted positives as they are among
all examples.
The precise deﬁnition of lift is a bit complex. First, let the fraction of examples
predicted to be positive be the prediction rate x = (tp+fp)/n. Next, let the base rate
of actual positive examples be b = (tp + fn)/n. Let the density of actual positives
within the predicted positives be d = tp/(tp +fp). Now, the lift at prediction rate x
is deﬁned to be the ratio d/b. Note that the lift will be different at different prediction
rates. Typically, the smaller the prediction rate the higher the lift.
56 CHAPTER 6. CLASSIFICATION WITH A RARE CLASS
Lift can be expressed as
d
b
=
tp/(tp +fp)
(tp +fn)/n
=
tp n
(tp +fp)(tp +fn)
=
tp
tp +fn
n
tp +fp
=
recall
prediction rate
.
Lift is a useful measure of success if the number of examples that should be predicted
to be positive is determined by external considerations. However, budget constraints
should normally be questioned. In the credit card scenario, perhaps too many trans
actions are being investigated and the marginal beneﬁt of investigating some trans
actions is negative. Or, perhaps too few transactions are being investigated; there
would be a net beneﬁt if additional transactions were investigated. Making optimal
decisions about how many examples should be predicted to be positive is discussed
in the next chapter.
While a budget constraint is not normally a rational way of choosing a threshold
for predictions, it is still more rational than choosing an arbitrary threshold. In par
ticular, the threshold zero for an SVM classiﬁer has some mathematical meaning but
is not a rational guide for making decisions.
6.2 Ranking examples
Applying a threshold to a classiﬁer with realvalued outputs loses information, be
cause the distinction between examples on the same side of the threshold is lost.
Moreover, at the time a classiﬁer is trained and evaluated, it is often the case that
the threshold to be used for decisionmaking is not known. Therefore, it is useful to
compare different classiﬁers across the range of all possible thresholds.
Typically, it is not meaningful to use the same numerical threshold directly for
different classiﬁers. However, it is meaningful to compare the recall achieved by
different classiﬁers when the threshold for each one is set to make their precisions
equal. This is what an ROC curve does.
1
Concretely, an ROC curve is a plot of
the performance of a classiﬁer, where the horizontal axis measures false positive rate
(fpr) and the vertical axis measures true positive rate (tpr). These are deﬁned as
fpr =
fp
fp +tn
tpr =
tp
tp +fn
1
ROC stands for “receiver operating characteristic.” This terminology originates in the theory of
detection based on electromagnetic waves. The receiver is an antenna and its operating characteristic is
the sensitivity that it is tuned for.
6.2. RANKING EXAMPLES 57
Figure 6.1: ROC curves for three alternative binary classiﬁers. Source: Wikipedia.
Note that tpr is the same as recall and is sometimes also called “hit rate.”
In an ROC plot, the ideal point is at the top left. One classiﬁer uniformly domi
nates another if its curve is always above the other’s curve. It happens often that the
ROC curves of two classiﬁers cross, which implies that neither one dominates the
other uniformly.
ROC plots are informative, but do not provide a quantitative measure of the per
formance of classiﬁers. A natural quantity to consider is the area under the ROC
curve, often abbreviated AUC. The AUC is 0.5 for a classiﬁer whose predictions are
no better than random, while it is 1 for a perfect classiﬁer. The AUC has an intuitive
meaning: it can be proved that it equals the probability that the classiﬁer will rank
correctly two randomly chosen examples where one is positive and one is negative.
One reason why AUC is widely used is that, as shown by the probabilistic mean
ing just mentioned, it has built into it the implicit assumption that the rare class is
more important than the common class.
58 CHAPTER 6. CLASSIFICATION WITH A RARE CLASS
6.3 Training to overcome imbalance
The previous sections were about measuring performance in applications where one
class is rare but important. This section describes ideas for actually performing better
on such applications.
The most obvious idea is to undersample the negative class. That is, only include
in the training set a randomly chosen subset of the available negative examples. A
major disadvantage of this approach is that it loses information. It does however
make training faster, which can be an important beneﬁt.
Consider the regularized loss function that is minimized by SVM training:
c(f) +C
n
i=1
l(f(x
i
), y
i
).
The parameter C controls the tradeoff between regularization and minimizing em
pirical loss. The empirical loss function treats all n training examples equally. If
most training examples belong to one class, then most pressure will be to achieve
high accuracy on that class. One response to class imbalance is to give different
importances to each class in the optimization task. Speciﬁcally, we can minimize
c(f) +C
−1
i:y
i
=−1
l(f(x
i
), −1) +C
1
i:y
i
=1
l(f(x
i
), 1).
Good values of the two C parameters can be chosen by crossvalidation, or their ratio
can be ﬁxed so that each class has equal importance. Many SVM software packages
allow the user to specify these two different C parameters.
A subtle idea is that we can actually optimize AUC of a linear classiﬁer directly
on the training set. The empirical AUC of a linear classiﬁer with coefﬁcients w is
A =
1
n
−1
n
1
i:y
i
=1
j:y
j
=−1
I(w
t
x
i
> w
t
x
j
).
This is the same as 0/1 accuracy on a new set Q:
A =
1
[Q[
¸i,j)∈Q
I(I(w
t
x
i
> w
t
x
j
) = I(y
i
> y
j
)).
where Q = ¦¸i, j) : y
i
,= y
j
¦. Now, note that
I(w
t
x
i
> w
t
x
j
) = I(w
t
(x
i
−x
j
) > 0).
6.4. CONDITIONAL PROBABILITIES 59
This formulation suggests learning a linear classiﬁer w such that a training example
x
i
−x
j
has 0/1 label I(y
i
> y
j
). Speciﬁcally, we minimize
λc(f) +
¸i,j)∈Q
l(w
t
(x
i
−x
j
), y
i
).
The training set Q for this learning task is balanced because if the pair ¸i, j) belongs
to Q then so does ¸j, i). Note also that if y
i
= 1 if and only if y
j
= −1 so we want
x
i
−x
j
to have the label 1 if and only if y
i
= 1.
The set Qcan have size up to n
2
/4 if the original training set has size n. However,
the minimization can often be done efﬁciently by stochastic gradient descent, picking
elements ¸i, j) fromQrandomly. The weight vector w typically reaches convergence
after seeing only O(n) elements of Q.
6.4 Conditional probabilities
Reason 1 to want probabilities: predicting expectations.
Ideal versus achievable probability estimates.
Reason 2 to want probabilities: understanding predictive power.
Deﬁnition of wellcalibrated probability.
Brier score.
Converting scores into probabilities.
6.5 Isotonic regression
Let f
i
be prediction scores on a dataset, and let y
i
∈ ¦0, 1¦ be the corresponding true
labels. Let f
j
be the f
i
values sorted from smallest to largest, and let y
j
be the y
i
values sorted in the same order. For each f
j
we want to ﬁnd an output value g
j
such
that the g
j
values are monotonically increasing, and squared error relative to the y
j
values is minimized. Formally, the optimization problem is
min
g
1
,...,gn
(y
j
−g
j
)
2
subject to g
j
≤ g
j+1
for j = 1 to j = n −1
It is a remarkable fact that if squared error is minimized, then the resulting predictions
are wellcalibrated probabilities.
There is an elegant algorithm called “pool adjacent violators” (PAV) that solves
this problem in linear time. The algorithm is as follows, where pooling a set means
replacing each member of the set by the arithmetic mean of the set.
Let g
j
= y
j
for all j
Start with j = 1 and increase j until the ﬁrst j such that g
j
,≤ g
j+1
60 CHAPTER 6. CLASSIFICATION WITH A RARE CLASS
Pool g
j
and g
j+1
Move left: If g
j−1
,≤ g
j
then pool g
j−1
to g
j+1
Continue to the left until monotonicity is satisﬁed
Proceed to the right
Given a test example x, the procedure to predict a wellcalibrated probability is as
follows:
Apply the classiﬁer to obtain f(x)
Find j such that f
j
≤ f(x) ≤ f
j+1
The predicted probability is g
j
.
6.6 Univariate logistic regression
The disadvantage of isotonic regression is that it creates a lookup table for converting
scores into estimated probabilities. An alternative is to use a parametric model. The
most common model is called univariate logistic regression. The model is
log
p
1 −p
= a +bf
where f is a prediction score and p is the corresponding estimated probability.
The equation above shows that the logistic regression model is essentially a linear
model with intercept a and coefﬁcient b. An equivalent way of writing the model is
p =
1
1 +e
−(a+bf)
.
As above, let f
i
be prediction scores on a training set, and let y
i
∈ ¦0, 1¦ be the
corresponding true labels. The parameters a and b are chosen to minimize the total
loss
i
l(
1
1 +e
−(a+bf
i
)
, y
i
).
The precise loss function l that is typically used in the optimization is called condi
tional log likelihood (CLL) and is explained in Chapter ??? below. However, one
could also use squared error, which would be consistent with isotonic regression.
6.7 Pitfalls of link prediction
This section was originally an assignment. The goal of writing it as an assignment
was to help develop three important abilities. The ﬁrst ability is critical thinking,
i.e. the skill of identifying what is most important and then identifying what may be
6.7. PITFALLS OF LINK PREDICTION 61
incorrect. The second ability is understanding a new application domain quickly, in
this case a task in computational biology. The third ability is presenting opinions in
a persuasive way. Students were asked to explain arguments and conclusions crisply,
concisely, and clearly.
Assignment: The paper you should read is Predicting proteinprotein
interactions from primary structure by Joel Bock and David Gough,
published in the journal Bioinformatics in 2001. Figure out and describe
three major ﬂaws in the paper. The ﬂaws concern
• how the dataset is constructed,
• how each example is represented, and
• how performance is measured and reported.
Each of the three mistakes is serious. The paper has 248 citations ac
cording to Google Scholar as of May 26, 2009, but unfortunately each
ﬂaw by itself makes the results of the paper not useful as a basis for fu
ture research. Each mistake is described sufﬁciently clearly in the paper:
it is a sin of commission, not a sin of omission.
The second mistake, how each example is represented, is the most
subtle, but at least one of the papers citing this paper does explain it
clearly. It is connected with how SVMs are applied here. Remember the
slogan: “If you cannot represent it then you cannot learn it.”
Separately, provide a brief critique of the four beneﬁts claimed for
SVMs in the section of the paper entitled Support vector machine learn
ing. Are these beneﬁts true? Are they unique to SVMs? Does the work
described in this paper take advantage of them?
Sample answers: Here is a brief summary of what I see as the most important
issues with the paper.
(1) How the dataset is constructed. The problem here is that the negative exam
ples are not pairs of genuine proteins. Instead, they are pairs of randomly generated
amino acid sequences. It is quite possible that these artiﬁcial sequences could not
fold into actual proteins at all. The classiﬁers reported in this paper may learn mainly
to distinguish between real proteins and nonproteins.
The authors acknowledge this concern, but do not overcome it. They could have
used pairs of genuine proteins as negative examples. It is true that one cannot be
sure that any given pair really is noninteracting. However, the great majority of
pairs do not interact. Moreover, if a negative example really is an interaction, that
will presumably slightly reduce the apparent accuracy of a trained classiﬁer, but not
change overall conclusions.
62 CHAPTER 6. CLASSIFICATION WITH A RARE CLASS
(2) How each example is represented. This is a subtle but clearcut and important
issue, assuming that the research uses a linear classiﬁer.
Let x
1
and x
2
be two proteins and let f(x
1
) and f(x
2
) be their representations
as vectors in R
d
. The pair of proteins is represented as the concatenated vector
¸f(x
1
) f(x
2
)) ∈ R
2d
. Suppose a trained linear SVM has parameter vector w. By
deﬁnition w ∈ R
2d
also. (If there is a bias coefﬁcient, so w ∈ R
2d+1
, the conclusion
is the same.)
Now suppose the ﬁrst protein x
1
is ﬁxed and consider varying the second protein
x
2
. Proteins x
2
will be ranked according to the numerical value of the dot product
w ¸f(x
1
) f(x
2
)). This is equal to w
1
f(x
1
) + w
2
f(x
2
) where the vector w
is written as ¸w
1
w
2
). If x
1
is ﬁxed, then the ﬁrst term is constant and the second
term w
2
f(x
2
) determines the ranking. The ranking of x
2
proteins will be the same
regardless of what the x
1
protein is. This fundamental drawback of linear classiﬁers
for predicting interactions is pointed out in [Vert and Jacob, 2008, Section 5].
With a concatenated representation of protein pairs, a linear classiﬁer can at best
learn the propensity of individual proteins to interact. Such a classiﬁer cannot repre
sent patterns that depend on features that are true only for speciﬁc protein pairs. This
is the relevance of the slogan “If you cannot represent it then you cannot learn it.”
Note: Previously I was under the impression that the authors stated that they used
a linear kernel. On rereading the paper, it fails to mention at all what kernel they use.
If the research uses a linear kernel, then the argument above is applicable.
(3) How performance is measured and reported. Most pairs of proteins are
noninteracting. It is reasonable to use training sets where negative examples (non
interacting pairs) are undersampled. However, it is not reasonable or correct to report
performance (accuracy, precision, recall, etc.) on test sets where negative examples
are underrepresented, which is what is done in this paper.
As for the four claimed advantages of SVMs:
1. SVMs are nonlinear while requiring “relatively few” parameters.
The authors do not explain what kernel they use. In any case, with a nonlinear
kernel the number of parameters is the number of support vectors, which is
often close to the number of training examples. It is not clear relative to what
this can be considered “few.”
2. SVMs have an analytic upper bound on generalization error.
This upper bound does motivate the SVM approach to training classiﬁers, but
it typically does not provide useful guarantees for speciﬁc training sets. In
any case, a bound of this type has nothing to do with assigning conﬁdences to
individual predictions. In practice overﬁtting is prevented by straightforward
6.7. PITFALLS OF LINK PREDICTION 63
search for the value of the C parameter that is empirically best, not by applying
a theorem.
3. SVMs have fast training, which is essential for screening large datasets.
SVM training is slow compared to many other classiﬁer learning methods, ex
cept for linear classiﬁers trained by fast algorithms that were only published
after 2001, when this paper was published. As mentioned above, a linear clas
siﬁer is not appropriate with the representation of protein pairs used in this
paper.
In any case, what is needed for screening many test examples is fast classiﬁer
application, not fast training. Applying a linear classiﬁer is fast, whether it is
an SVM or not. Applying a nonlinear SVM typically has the same orderof
magnitude time complexity as applying a nearestneighbor classiﬁer, which is
the slowest type of classiﬁer in common use.
4. SVMs can be continuously updated in response to new data.
At least one algorithm is known for updating an SVM given a new training
example, but it is not cited in this paper. I do not know any algorithm for
training an optimal new SVM efﬁciently, that is without retraining on old data.
In any case, new realworld protein data arrives slowly enough that retraining
from scratch is feasible.
64 CHAPTER 6. CLASSIFICATION WITH A RARE CLASS
Quiz for 2009
All questions below refer to a classiﬁer learning task with two classes, where the
base rate for the positive class y = 1 is 5%.
(a) Suppose that a probabilistic classiﬁer predicts p(y = 1[x) = c for some constant
c, for all test examples x. Explain why c = 0.05 is the best value for c.
The value 0.05 is the only constant that is a wellcalibrated probability. “Well
calibrated” means that c = 0.05 equals the average frequency of positive examples
in sets of examples with predicted score c.
(b) What is the error rate of the classiﬁer from part (a)? What is its MSE?
With a prediction threshold of 0.5, all examples are predicted to be negative, so
the error rate is 5%. The MSE is
0.05 (1 −0.05)
2
+ 0.95 (0 −0.05)
2
= 0.05 0.95 1
which equals 0.0475.
(c) Suppose a wellcalibrated probabilistic classiﬁer satisﬁes a ≤ p(y = 1[x) ≤ b
for all x. What is the maximum possible lift at 10% of this classiﬁer?
If the upper bound b is a wellcalibrated probability, then the fraction of positive
examples among the highestscoring examples is at most b. Hence, the lift is at most
b/0.05 where 0.05 is the base rate. Note that this is the highest possible lift at any
percentage, not just at 10%.
Quiz for April 27, 2010
Your name:
The current assignment is based on the equation
p(buy = 1[x) = p(buy = 1[x, visit = 1)p(visit = 1[x).
Explain clearly but brieﬂy why this equation is true.
CSE 291 Quiz for April 26, 2011
Your name:
Email address:
The lecture notes for last week (Section 6.3) suggest training to maximize AUC by
minimizing the objective function
λc(f) +
¸i,j)∈Q
l(w
t
(x
i
−x
j
), y
i
).
Here, the set Q identiﬁes all pairs of original training examples x
i
and x
j
that have
different labels: Q = ¦¸i, j) : y
i
,= y
j
¦.
Dr. Jekyll says that in fact, one only needs to use a smaller set
Q
t
= ¦¸i, j) : y
i
= +1 and y
j
= −1¦.
Is Dr. Jekyll correct? Explain your answer brieﬂy.
CSE 291 Assignment (2011)
This assignment is due at the start of class on Tuesday April 26, 2011. As before, you
should work in a team of two. You may change partners, or keep the same partner.
Like the previous assignments, this one uses the KDD98 training set. However,
you should now use the entire dataset. The goal is to train a classiﬁer with realvalued
outputs that identiﬁes test examples with TARGET_B = 1. The measure of success
to optimize is AUC (area under the ROC curve). As explained in class, AUC is
the probability that a random positive example scores higher than a random negative
example.
You should try (i) a linear classiﬁer and (ii) any other supervised learning method
of your choice. For (i), if you use Rapidminer, then the FastLargeMargin op
erator is recommended. As before, recode and transform features to improve their
usefulness. Do feature selection to reduce the size of the training set as much as is
reasonably possible.
Both learning methods should be applied to the same training set. The objective
is to see whether the second method can perform better than the linear method, using
the same data coded in the same way. The second method can be a nonlinear support
vector machine, a neural network, or a decision tree, for example. The second method
can also be a linear classiﬁer that is different from your ﬁrst linear method. For
example, you may choose to compare logistic regression and a linear SVM.
For both methods, apply crossvalidation to ﬁnd good algorithm settings.
CSE 291 Assignment (2010)
This assignment is due at the start of class on Tuesday April 27, 2010. As before, you
should work in a team of two. You may change partners, or keep the same partner.
Like the previous assignment, this one uses the ecommerce data published by
Kevin Hillstrom However, the goal now is to predict who makes a purchase on the
website (again, for customers who are not sent any promotion). This is a highly
unbalanced classiﬁcation task.
First, use logistic regression. If you are using Rapidminer, then use the logis
tic regression option of the FastLargeMargin operator. As before, recode and
transform features to improve their usefulness. Investigate the probability estimates
produced by your two methods. What are the minimum, mean, and maximum pre
dicted probabilities? Discuss whether these are reasonable.
Second, use your creativity to get the best possible performance in predicting
who makes a purchase. The measure of success to optimize is lift at 25%. That is, as
many positive test examples as possible should be among the 25% of test examples
with highest prediction score. You may apply any learning algorithms that you like.
Can any method achieve better accuracy than logistic regression?
For predicting who makes a purchase, compare learning a classiﬁer directly with
learning two classiﬁers, the ﬁrst to predict who visits the website and the second to
predict which visitors make purchases. Note that mathematically
p(buy[x) = p(buy[x, visit)p(visit[x).
As before, be careful not to fool yourself about the success of your methods.
Chapter 7
Logistic regression
This chapter shows how to obtain a linear classiﬁer whose predictions are probabili
ties.
We assume that the label y follows a probability distribution that is different
for different examples x. Technically, for each x there is a different distribution
f(y[x; w) of y, but all these distributions share the same parameters w. In the binary
case, the possible values of y are y = 1 and y = 0.
Given training data consisting of ¸x
i
, y
i
) pairs, the principle of maximum condi
tional likelihood says to choose a parameter estimate ˆ w that maximizes the product
i
f(y
i
[x
i
; w). Then, for any speciﬁc value of x, ˆ w can be used to predict values for
y.
We assume that we never want to predict values of x. If data points x themselves
are always known, then there is no need to have a probabilistic model of them. In
order to justify the conditional likelihood being a product, we just need to assume
that the y
i
are independent conditional on the x
i
. We do not need to assume that the
x
i
are independent
When y is binary and x is a realvalued vector, the model
p(y = 1[x; α, β) =
1
1 + exp −[α +
d
j=1
β
j
x
j
]
is called logistic regression. We use j to index over the feature values x
1
to x
d
of
a single example of dimensionality d, since we use i below to index over training
examples 1 to n.
The logistic regression model is easier to understand in the form
log
p
1 −p
= α +
d
j=1
β
j
x
j
69
70 CHAPTER 7. LOGISTIC REGRESSION
where p is an abbreviation for p(y = 1[x; α, β). The ratio p/(1 − p) is called the
odds of the event y = 1 given x, and log[p/(1 − p)] is called the log odds. Since
probabilities range between 0 and 1, odds range between 0 and +∞ and log odds
range unboundedly between −∞ and +∞. A linear expression of the form α +
j
β
j
x
j
can also take unbounded values, so it is reasonable to use a linear expression
as a model for log odds, but not as a model for odds or for probabilities. Essentially,
logistic regression is the simplest possible model for a random yes/no outcome that
depends linearly on predictors x
1
to x
d
.
For each feature j, exp(β
j
x
j
) is a multiplicative scaling factor on the odds p/(1−
p). If the predictor x
j
is binary, then exp(β
j
) is the extra odds of having the outcome
y = 1 when x
j
= 1, compared to when x
j
= 0. If the predictor x
j
is realvalued,
then exp(β
j
) is the extra odds of having the outcome y = 1 when the value of x
j
increases by one unit. A major limitation of the basic logistic regression model is
that the probability p must either increase monotonically, or decrease monotonically,
as a function of each predictor x
j
. The basic model does not allow the probability to
depend in a Ushaped way on any x
j
.
Given the training set ¦¸x
1
, y
1
), . . . , ¸x
n
, y
n
)¦, training a logistic regression clas
siﬁer involves maximizing the log conditional likelihood (LCL). This is the sum of
the LCL for each training example:
LCL =
n
i=1
log L(w; y
i
[x
i
) =
n
i=1
log f(y
i
[x
i
; w).
Given a single training example ¸x
i
, y
i
), the log conditional likelihood is log p
i
if the
true label y
i
= 1 and log(1 −p
i
) if y
i
= 0, where p
i
= p(y = 1[x
i
; w).
y
i
log p
i
+ (1 −y
i
) log(1 −p
i
)
Now write
p
i
=
1
1 + exp −h
i
1 −p
i
= 1 −
1
1 + exp −h
i
=
exp −h
i
1 + exp −h
i
=
1
(exp h
i
) + 1
So the LCL for one example is
−log(1 + exp −y
i
h
i
)
71
and maximizing the LCL is equivalent to minimizing
i
log(1 + exp −y
i
h
i
).
The loss function inside the sum is called log loss. It has some similarity in shape to
hinge loss.
Consider learning a logistic regression classiﬁer for categorizing documents. Sup
pose that word number j appears only in documents whose labels are positive. The
partial derivative of the log conditional likelihood with respect to the parameter for
this word is
∂
∂β
j
LCL =
i
(y
i
−p
i
)x
ij
.
This derivative will always be positive, as long as the predicted probability p
i
is not
perfectly one for all these documents. Therefore, following the gradient will send
β
j
to inﬁnity. Then, every test document containing this word will be predicted to
be positive with certainty, regardless of all other words in the test document. This
overconﬁdent generalization is an example of overﬁtting.
There is a standard method for solving this overﬁtting problem that is quite sim
ple, but quite successful. The solution is called regularization. The idea is to impose
a penalty on the magnitude of the parameter values. This penalty should be mini
mized, in a tradeoff with maximizing likelihood. Mathematically, the optimization
problem to be solved is
ˆ
β = argmax
β
LCL −µ[[β[[
2
where [[β[[
2
is the L
2
norm of the parameter vector, and the constant µ quantiﬁes the
tradeoff between maximizing likelihood and minimizing parameter values.
This type of regularization is called quadratic or Tikhonov regularization. A
major reason why it is popular is that it can be derived from several different points
of view. In particular, it arises as a consequence of assuming a Gaussian prior on
parameters. It also arises from theorems on minimizing generalization error, i.e. error
on independent test sets drawn from the same distribution as the training set. And, it
arises fromrobust classiﬁcation: assuming that each training point lies in an unknown
location inside a sphere centered on its measured location.
Chapter 8
Making optimal decisions
This chapter discusses making optimal decisions based on predictions, and maximiz
ing the value of customers.
8.1 Predictions, decisions, and costs
Decisions and predictions are conceptually very different. For example, a prediction
concerning a patient may be “allergic” or “not allergic” to aspirin, while the corre
sponding decision is whether or not to administer the drug. Predictions can often be
probabilistic, while decisions typically cannot.
Suppose that examples are credit card transactions and the label y = 1 designates
a legitimate transaction. Then making the decision y = 1 for an attempted transaction
means acting as if the transaction is legitimate, i.e. approving the transaction. The
essence of costsensitive decisionmaking is that it can be optimal to act as if one
class is true even when some other class is more probable. For example, if the cost
of approving a fraudulent transaction is proportional to the dollar amount involved,
then it can be rational not to approve a large transaction, even if the transaction is
most likely legitimate. Conversely, it can be rational to approve a small transaction
even if there is a high probability it is fraudulent.
Mathematically, let i be the predicted class and let j be the true class. If i = j
then the prediction is correct, while if i ,= j the prediction is incorrect. The (i, j)
entry in a cost matrix c is the cost of acting as if class i is true, when in fact class j
is true. Here, predicting i means acting as if i is true, so one could equally well call
this deciding i.
A cost matrix c has the following structure when there are only two classes:
73
74 CHAPTER 8. MAKING OPTIMAL DECISIONS
actual negative actual positive
predict negative c(0, 0) = c
00
c(0, 1) = c
01
predict positive c(1, 0) = c
10
c(1, 1) = c
11
The cost of a false positive is c
10
while the cost of a false negative is c
01
. We fol
low the convention that cost matrix rows correspond to alternative predicted classes,
while columns correspond to actual classes. In short the convention is row/column
= i/j = predicted/actual. (This convention is the opposite of the one in Section 5 so
perhaps we should switch one of these to make the conventions similar.)
The optimal prediction for an example x is the class i that minimizes the expected
cost
e(x, i) =
j
p(j[x)c(i, j). (8.1)
For each i, e(x, i) is an expectation computed by summing over the alternative pos
sibilities for the true class of x. In this framework, the role of a learning algorithm is
to produce a classiﬁer that for any example x can estimate the probability p(j[x) of
each class j being the true class of x.
8.2 Cost matrix properties
Conceptually, the cost of labeling an example incorrectly should always be greater
than the cost of labeling it correctly. For a 2x2 cost matrix, this means that it should
always be the case that c
10
> c
00
and c
01
> c
11
. We call these conditions the
“reasonableness” conditions.
Suppose that the ﬁrst reasonableness condition is violated, so c
00
≥ c
10
but still
c
01
> c
11
. In this case the optimal policy is to label all examples positive. Similarly,
if c
10
> c
00
but c
11
≥ c
01
then it is optimal to label all examples negative. (The
reader can analyze the case where both reasonableness conditions are violated.)
For some cost matrices, some class labels are never predicted by the optimal
policy given by Equation (8.1). The following is a criterion for when this happens.
Say that row m dominates row n in a cost matrix C if for all j, c(m, j) ≥ c(n, j).
In this case the cost of predicting n is no greater than the cost of predicting m,
regardless of what the true class j is. So it is optimal never to predict m. As a special
case, the optimal prediction is always n if row n is dominated by all other rows in
a cost matrix. The two reasonableness conditions for a twoclass cost matrix imply
that neither row in the matrix dominates the other.
Given a cost matrix, the decisions that are optimal are unchanged if each entry in
the matrix is multiplied by a positive constant. This scaling corresponds to changing
the unit of account for costs. Similarly, the decisions that are optimal are unchanged
8.3. THE LOGIC OF COSTS 75
if a constant is added to each entry in the matrix. This shifting corresponds to chang
ing the baseline away from which costs are measured. By scaling and shifting entries,
any twoclass cost matrix
c
00
c
01
c
10
c
11
that satisﬁes the reasonableness conditions can be transformed into a simpler matrix
that always leads to the same decisions:
0 c
t
01
1 c
t
11
where c
t
01
= (c
01
− c
00
)/(c
10
− c
00
) and c
t
11
= (c
11
− c
00
)/(c
10
− c
00
). From a
matrix perspective, a 2x2 cost matrix effectively has two degrees of freedom.
8.3 The logic of costs
Costs are not necessarily monetary. A cost can also be a waste of time, or the severity
of an illness, for example. Although most recent research in machine learning has
used the terminology of costs, doing accounting in terms of beneﬁts is generally
preferable, because avoiding mistakes is easier, since there is a natural baseline from
which to measure all beneﬁts, whether positive or negative. This baseline is the state
of the agent before it takes a decision regarding an example. After the agent has
made the decision, if it is better off, its beneﬁt is positive. Otherwise, its beneﬁt is
negative.
When thinking in terms of costs, it is easy to posit a cost matrix that is logi
cally contradictory because not all entries in the matrix are measured from the same
baseline. For example, consider the socalled German credit dataset that was pub
lished as part of the Statlog project [Michie et al., 1994]. The cost matrix given with
this dataset at http://www.sgi.com/tech/mlc/db/german.names is as
follows:
actual bad actual good
predict bad 0 1
predict good 5 0
Here examples are people who apply for a loan from a bank. “Actual good” means
that a customer would repay a loan while “actual bad” means that the customer would
default. The action associated with “predict bad” is to deny the loan. Hence, the
cashﬂowrelative to any baseline associated with this prediction is the same regardless
76 CHAPTER 8. MAKING OPTIMAL DECISIONS
of whether “actual good” or “actual bad” is true. In every economically reasonable
cost matrix for this domain, both entries in the “predict bad” row must be the same.
If these entries are different, it is because different baselines have been chosen for
each entry.
Costs or beneﬁts can be measured against any baseline, but the baseline must be
ﬁxed. An opportunity cost is a foregone beneﬁt, i.e. a missed opportunity rather than
an actual penalty. It is easy to make the mistake of measuring different opportunity
costs against different baselines. For example, the erroneous cost matrix above can
be justiﬁed informally as follows: “The cost of approving a good customer is zero,
and the cost of rejecting a bad customer is zero, because in both cases the correct
decision has been made. If a good customer is rejected, the cost is an opportunity
cost, the foregone proﬁt of 1. If a bad customer is approved for a loan, the cost is the
lost loan principal of 5.”
To see concretely that the reasoning in quotes above is incorrect, suppose that the
bank has one customer of each of the four types. Clearly the cost matrix above is
intended to imply that the net change in the assets of the bank is then −4. Alterna
tively, suppose that we have four customers who receive loans and repay them. The
net change in assets is then +4. Regardless of the baseline, any method of accounting
should give a difference of 8 between these scenarios. But with the erroneous cost
matrix above, the ﬁrst scenario gives a total cost of 6, while the second scenario gives
a total cost of 0.
In general the amount in some cells of a cost or beneﬁt matrix may not be con
stant, and may be different for different examples. For example, consider the credit
card transactions domain. Here the beneﬁt matrix might be
fraudulent legitimate
refuse $20 −$20
approve −x 0.02x
where x is the size of the transaction in dollars. Approving a fraudulent transaction
costs the amount of the transaction because the bank is liable for the expenses of
fraud. Refusing a legitimate transaction has a nontrivial cost because it annoys a
customer. Refusing a fraudulent transaction has a nontrivial beneﬁt because it may
prevent further fraud and lead to the arrest of a criminal.
8.4 Making optimal decisions
Given known costs for correct and incorrect predictions, a test example should be
predicted to have the class that leads to the lowest expected cost. This expectation is
8.4. MAKING OPTIMAL DECISIONS 77
the predicted average cost, and is computed using the conditional probability of each
class given the example.
In the twoclass case, the optimal prediction is class 1 if and only if the expected
cost of this prediction is less than or equal to the expected cost of predicting class 0,
i.e. if and only if
p(y = 0[x)c
10
+p(y = 1[x)c
11
≤ p(y = 0[x)c
00
+p(y = 1[x)c
01
which is equivalent to
(1 −p)c
10
+pc
11
≤ (1 −p)c
00
+pc
01
given p = p(y = 1[x). If this inequality is in fact an equality, then predicting either
class is optimal.
The threshold for making optimal decisions is p
∗
such that
(1 −p
∗
)c
10
+p
∗
c
11
= (1 −p
∗
)c
00
+p
∗
c
01
.
Assuming the reasonableness conditions c
10
> c
00
and c
01
≥ c
11
, the optimal pre
diction is class 1 if and only if p ≥ p
∗
. Rearranging the equation for p
∗
leads to
c
00
−c
10
= −p
∗
c
10
+p
∗
c
11
+p
∗
c
00
−p
∗
c
01
which has the solution
p
∗
=
c
10
−c
00
c
10
−c
00
+c
01
−c
11
assuming the denominator is nonzero, which is implied by the reasonableness con
ditions. This formula for p
∗
shows that any 2x2 cost matrix has essentially only one
degree of freedom from a decisionmaking perspective, although it has two degrees
of freedom from a matrix perspective. The cause of the apparent contradiction is that
the optimal decisionmaking policy is a nonlinear function of the cost matrix.
Note that in some domains costs have more impact than probabilities on decision
making. An extreme case is that with some cost matrices the optimal decision is
always the same, regardless of what the various outcome probabilities are for a given
example x.
1
Decisions and predictions may be not in 1:1 correspondence.
1
The situation where the optimal decision is ﬁxed is wellknown in philosophy as Pascal’s wager.
Essentially, this is the case where the minimax decision and the minimum expected cost decision are
always the same. An argument analogous to Pascal’s wager provides a justiﬁcation for Occam’s razor
in machine learning. Essentially, a PAC theorem says that if we have a ﬁxed, limited number of training
examples, and the true concept is simple, then if we pick a simple concept then we will be right.
However if we pick a complex concept, there is no guarantee we will be right.
78 CHAPTER 8. MAKING OPTIMAL DECISIONS
8.5 Limitations of costbased analysis
The analysis of decisionmaking above assumes that the objective is to minimize
expected cost. This objective is reasonable when a game is played repeatedly, and
the actual class of examples is set stochastically.
The analysis may not be appropriate when the agent is riskaverse and can make
only a few decisions.
Also not appropriate when the actual class of examples is set by a nonrandom
adversary.
Costs may need to be estimated via learning.
We may need to make repeated decisions over time about the same example.
Decisions about one example may inﬂuence other examples.
When there are more than two classes, we do not get full label information on
training examples.
8.6 Rules of thumb for evaluating data mining campaigns
This section is preliminary and incomplete. It is not part of the material for 291 in
2011.
Let q be a fraction of the test set. There is a remarkable rule of thumb that the
lift attainable at q is around
1/q. For example, if q = 0.25 then the attainable lift
is around
√
4 = 2.
The rule of thumb is not valid for very small values of q, for two reasons. The
ﬁrst reason is mathematical: an upper bound on the lift is 1/t, where t is the overall
fraction of positives; t is also called the target rate, response rate, or base rate. The
second reason is empirical: lifts observed in practice tend to be well below10. Hence,
the rule of thumb can reasonably be applied when q > 0.02 and q > t
2
.
The fraction of all positive examples that belong to the topranked fraction q of
the test set is q
1/q =
√
q. The lift in the bottom 1 − q is (1 −
√
q)/(1 − q). We
have
lim
q→1
1 −
√
q
1 −q
= 0.5.
This says that the examples ranked lowest are positive with probability equal to t/2.
In other words, the lift for the lowest ranked examples cannot be driven below 0.5.
nature simple nature complex
predict simple succeed fail
predict complex fail fail
Here “simple” means that the hypothesis is drawn from a space with low cardinality, while complex
means it is drawn from a space with high cardinality.
8.6. RULES OF THUMB FOR EVALUATING DATA MINING CAMPAIGNS 79
Let c be the cost of a contact and let b be the beneﬁt of a positive. This means
that the beneﬁt matrix is
actual negative actual positive
decide negative 0 0
decide positive −c b −c
Let n be the size of the test set, The proﬁt at q is the total beneﬁt minus the total cost,
which is
nqbt
1/q −nqc = nc(bt
√
q/c −q) = nc(k
√
q −q)
where k = tb/c. Proﬁt is maximized when
0 =
d
dq
(kq
0.5
−q) = k(0.5)q
−0.5
−1.
The solution is q = (k/2)
2
.
This solution is in the range zero to one only when k ≤ 2. When tb/c > 2 then
maximum proﬁt is achieved by soliciting every prospect, so data mining is pointless.
The maximum proﬁt that can be achieved is
nc(k(k/2) −(k/2)
2
) = nck
2
/4 = ncq.
Remarkably, this is always the same as the cost of running the campaign, which is
nqc.
As k decreases, the attainable proﬁt decreases fast, i.e. quadratically. If k < 0.4
then q < 0.04 and the attainable proﬁt is less than 0.04nc. Such a campaign may
have high risk, because the lift attainable at small q is typically worse than suggested
by the rule of thumb. Note however that a small adverse change in c, b, or t is not
likely to make the campaign lose money, because the expected revenue is always
twice the campaign cost.
It is interesting to consider whether data mining is beneﬁcial or not from the
point of view of society at large. Remember that c is the cost of one contact from the
perspective of the initiator of the campaign. Each contact also has a cost or beneﬁt
for the recipient. Generally, if the recipient responds, one can assume that the contact
was beneﬁcial for him or her, because s/he only responds if the response is beneﬁcial.
However, if the recipient does not respond, which is the majority case, then one can
assume that the contact caused a net cost to him or her, for example a loss of time.
From the point of view of the initiator of a campaign a cost or beneﬁt for a
respondent is an externality. It is not rational for the initiator to take into account
these beneﬁts or costs, unless they cause respondents to take actions that affect the
initiator, such as closing accounts. However, it is rational for society to take into
account these externalities.
80 CHAPTER 8. MAKING OPTIMAL DECISIONS
The conclusion above is that the revenue from a campaign, for its initiator, is
roughly twice its cost. Suppose that the beneﬁt of responding for a respondent is λb
where b is the beneﬁt to the initiator. Suppose also that the cost of a solicitation to
a person is µc where c is the cost to the initiator. The net beneﬁt to respondents is
positive as long as µ < 2λ.
The reasoning above clariﬁes why spam email campaigns are harmful to society.
For these campaigns, the cost c to the initiator is tiny. However, the cost to a recipient
is not tiny, so µ is large. Whatever λ is, it is likely that µ > 2λ.
In summary, data mining is only beneﬁcial in a narrow sweet spot, where
tb/2 ≤ c ≤ αtb
where α is some constant greater than 1. The product tb is the average beneﬁt of
soliciting a random customer. If the cost c of solicitation is less than half of this, then
it is rational to contact all potential respondents. If c is much greater than the average
beneﬁt, then the campaign is likely to have high risk for the initiator.
As an example of the reasoning above, consider the scenario of the 1998 KDD
contest. Here t = 0.05 about, c = $0.68, and the average beneﬁt is b = $15
approximately. We have k = tb/c = 75/68 = 1.10. The rule of thumb predicts that
the optimal fraction of people to solicit is q = 0.30, while the achievable proﬁt per
person cq = $0.21. In fact, the methods that perform best in this domain achieve
proﬁt of about $0.16, while soliciting about 70% of all people.
8.7 Evaluating success
In order to evaluate whether a realworld campaign is proceeding as expected, we
need to estimate a conﬁdence interval for the proﬁt on the test set. This should be a
range [ab] such that the probability is (say) 95% that the observed proﬁt lies in this
range.
Intuitively, there are two sources of uncertainty for each test person: whether s/he
will donate, and if so, how much. We need to work out the logic of how to quantify
these uncertainties, and then the logic of combining the uncertainties into an overall
conﬁdence interval. The central issue here is not any mathematical details such as
Student distributions versus Gaussians. It is the logic of tracking where uncertainty
arises, and how it is propagated.
8.7. EVALUATING SUCCESS 81
Quiz (2009)
Suppose your lab is trying to crystallize a protein. You can try experimental condi
tions x that differ on temperature, salinity, etc. The label y = 1 means crystallization
is successful, while y = 0 means failure. Assume (not realistically) that the results
of different experiments are independent.
You have a classiﬁer that predicts p(y = 1[x), the probability of success of ex
periment x. The cost of doing one experiment is $60. The value of successful crys
tallization is $9000.
(a) Write down the beneﬁt matrix involved in deciding rationally whether or not
to perform a particular experiment.
The beneﬁt matrix is
success failure
do experiment 900060 60
don’t 0 0
(b) What is the threshold probability of success needed in order to perform an
experiment?
It is rational to do an experiment under conditions x if and only if the expected
beneﬁt is positive, that is if and only if
(9000 −60) p + (−60) (1 −p) > 0.
where p = p(success[x). The threshold value of p is 60/9000 = 1/150.
(c) Is it rational for your lab to take into account a budget constraint such as “we
only have a technician available to do one experiment per day”?
No. A budget constraint is an alternative rule for making decisions that is less
rational than maximizing expected beneﬁt. The optimal behavior is to do all experi
ments that have success probability over 1/150.
Additional explanation: The $60 cost of doing an experiment should include
the expense of technician time. If many experiments are worth doing, then more
technicians should be hired.
If the budget constraint is unavoidable, then experiments should be done starting
with those that have highest probability of success.
If just one successful crystallization is enough, then experiments should also be
done in order of declining success probability, until the ﬁrst actual success.
Quiz for May 4, 2010
Your name:
Suppose that you work for a bank that wants to prevent criminal laundering of money.
The label y = 1 means a money transfer is criminal, while y = 0 means the transfer
is legal. You have a classiﬁer that estimates the probability p(y = 1[x) where x is a
vector of feature values describing a money transfer.
Let z be the dollar amount of the transfer. The matrix of costs (negative) and
beneﬁts (positive) involved in deciding whether or not to deny a transfer is as follows:
criminal legal
deny 0 −0.10z
allow −z 0.01z
Work out the rational policy based on p(y = 1[x) for deciding whether or not to
allow a transfer.
CSE 291 Quiz for May 3, 2011
Your name:
Email address:
Suppose that you are looking for a rare plant. You can search geographical locations
x that differ on rainfall, urbanization, etc. The label y = 1 means the plant is found,
while y = 0 means failure. Assume (not realistically) that the results of different
searches are independent.
You have a classiﬁer that predicts p(y = 1[x), the probability of success when
searching at location x. The cost of doing one search is $200. The value of ﬁnding
the plant is $7000. For every two unsuccessful searches, on average, the information
gained is expected to save you one future search.
(a) Write down the beneﬁt matrix involved in deciding rationally whether or not
to perform a particular search.
(b) What is the threshold probability of success needed in order to perform a
search?
(c) Suppose that you are constrained to do only one search per day. Where should
you search each day?
CSE 291 Assignment 4 (2011)
This assignment is due at the start of class on Tuesday May 10, 2011. As before, you
should work in a team of two, and you are free to change partners or not. You have
two weeks for this assignment, so do a careful job and write a comprehensive report.
This assignment is the last one to use the KDD98 data. You should now train
on the entire training set, and measure ﬁnal success on the test set that you have not
previously used. The test set is in the ﬁle cup98val.zip at http://archive.
ics.uci.edu/ml/databases/kddcup98/kddcup98.html. The test set
labels are in valtargt.txt. These labels are sorted by CONTROLN, unlike the
test instances.
The goal is to solicit an optimal subset of the test examples. The measure of
success to maximize is total donations received minus $0.68 for every solicitation.
You should train a regression function to predict donation amounts, and a classiﬁer
to predict donation probabilities. For a test example x, let the predicted donation
be a(x) and let the predicted donation probability be p(x). It is rational to send a
donation request to person x if and only if
p(x) a(x) ≥ 0.68.
Use only the training set for all development work. Use the actual test set only once,
to measure the ﬁnal success of your method.
CSE 291 Assignment (2010)
This week’s assignment is to participate in the PAKDD 2010 data mining contest.
Details of the contest are at http://sede.neurotech.com.br/PAKDD2010/.
Each team of two students should register and download the ﬁles for the contest.
Your ﬁrst goal should be to understand the data and submission formats, and to
submit a correct set of predictions. Make sure that when you have good predictions
later, you will not run into any technical difﬁculties.
Your second goal should be to understand the contest scenario and the differences
between the training and two test datasets. Do some exploratory analysis of the three
datasets. In your written report, explain your understanding of the scenario, and your
general ﬁndings about the datasets.
Next, based on your general understanding, design a sensible approach for achiev
ing the contest objective. Implement this approach and submit predictions to the con
test. Of course, you may reﬁne your approach iteratively and you may make multiple
submissions. Meet the May 3 deadline for uploading your best predictions and a
copy of your report. The contest rules ask each team to submit a paper of four pages.
The manuscript should be in the scientiﬁc paper format of the confer
ence detailing the stages of the KDD process, focusing on the aspects
which make the solution innovative. The manuscript should be the ba
sis for writing up a full paper for an eventual postconference proceed
ing (currently under negotiation). The authors of top solutions and se
lected manuscripts with innovative approaches will have the opportunity
to submit to that forum. The quality of the manuscripts will not be used
for the competition assessment, unless in the case of ties.
You can ﬁnd template ﬁles for LaTeX and Word at http://www.springer.
com/computer/lncs?SGWID=016467933410. Do not worry about
formatting details.
Chapter 9
Learning classiﬁers despite
missing labels
This chapter discusses how to learn twoclass classiﬁers from nonstandard training
data. Speciﬁcally, we consider three different but related scenarios where labels are
missing for some but not all training examples.
9.1 The standard scenario for learning a classiﬁer
In the standard situation, the training and test sets are two different random samples
from the same population. Being from the same population means that both sets
follow a common probability distribution. Let x be a training or test instance, and let
y be its label. The common distribution is p(x, y).
The notation p(x, y) is an abbreviation for p(X = x and Y = y) where X and
Y are random variables, and x and y are possible values of these variables. If X and
Y are both discretevalued random variables, this distribution is a probability mass
function (pmf). Otherwise, it is a probability density function (pdf).
It is important to understand that we can always write
p(x, y) = p(x)p(y[x)
and also
p(x, y) = p(y)p(x[y)
without loss of generality and without making any assumptions. The equations above
are sometimes called the chain rule of probabilities. They are true both when the
probability values are probability masses, and when they are probability densities.
87
88 CHAPTER 9. LEARNING CLASSIFIERS DESPITE MISSING LABELS
9.2 Sample selection bias in general
Suppose that the label y is observed for some training instances x, but not for others.
This means that the training instances x for which y is observed are not a random
sample from the same population as the test instances. Let s be a new random vari
able with value s = 1 if and only if x is selected. This means that y is observed if
and only if s = 1. An important question is whether the unlabeled training examples
can be exploited in some way. However, here we focus on a different question: how
can we adjust for the fact that the labeled training examples are not representative of
the test examples?
Formally, x, y, and s are random variables. There is some ﬁxed unknown joint
probability distribution p(x, y, s) over triples ¸x, y, s). We can identify three pos
sibilities for s. The easiest case is when p(s = 1[x, y) is a constant. In this case,
the labeled training examples are a fully random subset. We have fewer training
examples, but otherwise we are in the usual situation. This case is called “missing
completely at random” (MCAR).
A more difﬁcult situation arises if s is correlated with x and/or with y, that is
if p(s = 1) ,= p(s = 1[x, y). Here, there are two subcases. First, suppose that
s depends on x but, given x, not on y. In other words, when x is ﬁxed then s is
independent of y. This means that p(s = 1[x, y) = p(s = 1[x) for all x. In this case,
Bayes’ rule gives that
p(y[x, s = 1) =
p(s = 1[x, y)p(y[x)
p(s = 1[x)
=
p(s = 1[x)p(y[x)
p(s = 1[x)
= p(y[x)
assuming that p(s = 1[x) > 0 for all x. Therefore we can learn a correct model of
p(y[x) from just the labeled training data, without using the unlabeled data in any
way. This case is rather misleadingly called “missing at random” (MAR). It is not
the case that labels are missing in a totally random way, because missingness does
depend on x. It is also not the case that s and y are independent. However, it is true
that s and y are conditionally independent, conditional on x. Concretely, for each
value of x the equation p(y[x, s = 0) = p(y[x) = p(y[x, s = 1) holds.
The assumption that p(s = 1[x) > 0 is important. The realworld meaning of
this assumption is that label information must be available with nonzero probability
for every possible instance x. Otherwise, it might be the case for some x that no
labeled training examples are available from which to estimate p(y[x, s = 1).
Suppose that even when x is known, there is still some correlation between s
and y. In this case the label y is said to be “missing not at random” (MNAR) and
inference is much more difﬁcult. We do not discuss this case further here.
9.3. COVARIATE SHIFT 89
9.3 Covariate shift
Imagine that the available training data come from one hospital, but we want to learn
a classiﬁer to use at a different hospital. Let x be a patient and let y be a label such
as “has swine ﬂu.” The distribution p(x) of patients is different at the two hospitals,
but we are willing to assume that the relationship to be learned, which is p(y[x), is
unchanged.
In a scenario like the one above, no unlabeled training examples are available.
Labeled training examples from all classes are available, but the distribution of in
stances is different for the training and test sets. This scenario is called “covariate
shift” where covariate means feature, and shift means change of distribution.
Following the argument in the previous section, the assumption that p(y[x) is
unchanged means that p(y[x, s = 1) = p(y[x, s = 0) where s = 1 refers to the ﬁrst
hospital and s = 0 refers to the second hospital. The simple approach to covariate
shift is thus train the classiﬁer on the data ¸x, y, s = 1) in the usual way, and to apply
it to patients from the second hospital directly.
9.4 Reject inference
Suppose that we want to build a model that predicts who will repay a loan. We need
to be able to apply the model to the entire population of future applicants (sometimes
called the “through the door” population). The usable training examples are people
who were actually granted a loan in the past. Unfortunately, these people are not rep
resentative of the population of all future applicants. They were previously selected
precisely because they were thought to be more likely to repay. The problem of “re
ject inference” is to learn somehow from previous applicants who were not approved,
for whom we hence do not have training labels.
Another example is the task of learning to predict the beneﬁt of a medical treat
ment. If doctors have historically given the treatment only to patients who were
particularly likely to beneﬁt, then the observed success rate of the treatment is likely
to be higher than its success rate on “through the door” patients. Conversely, if doc
tors have historically used the treatment only on patients who were especially ill, then
the “through the door” success rate may be higher than the observed success rate.
The “reject inference” scenario is similar to the covariate shift scenario, but with
two differences. The ﬁrst difference is that unlabeled training instances are available.
The second difference is that the label being missing is expected to be correlated
with the value of the label. The important similarity is the common assumption that
missingness depends only on x, and not additionally on y. Whether y is missing may
90 CHAPTER 9. LEARNING CLASSIFIERS DESPITE MISSING LABELS
be correlated with the value of y, but this correlation disappears after conditioning
on x.
If missingness does depend on y, even after conditioning on x, then we are in the
MNAR situation and in general we are can draw no ﬁrm conclusions. An example of
this situation is “survivor bias.” Suppose our analysis is based on historical records,
and those records are more likely to exist for survivors, everything else being equal.
Then p(s = 1[y = 1, x) > p(s = 1[y = 0, x) where “everything else being equal”
means that x is the same on both sides.
A useful fact is the following. Suppose we want to estimate E[f(z)] where z
follows the distribution p(z), but we can only draw samples of z from the distribution
q(z). The fact is
E[f(z)[z ∼ p(z)] = E[f(z)
p(z)
q(z)
[z ∼ q(z)],
assuming q(z) > 0 for all z. More generally the requirement is that q(z) > 0
whenever p(z) > 0, assuming the deﬁnition 0/0 = 0. The equation above is called
the importancesampling identity.
Let the goal be to compute E[f(x, y)[x, y ∼ p(x, y)] for any function f. To make
notation more concise, write this as E[f] and write E[f(x, y)[x, y ∼ p(x, y, s = 1)]
as E[f[s = 1]. We have
E[f] = E[f
p(x)p(y[x)
p(x[s = 1)p(y[x, s = 1)
[s = 1]
= E[f
p(x)
p(x[s = 1)
[s = 1]
since p(y[x) = p(y[x, s = 1). Applying Bayes’ rule to p(x[s = 1) gives
E[f] = E[f
p(x)
p(s = 1[x)p(x)/p(s = 1)
[s = 1]
= E[f
p(s = 1)
p(s = 1[x)
[s = 1].
The constant p(s = 1) can be estimated as r/n where r is the number of labeled
training examples and n is the total number of training examples. Let ˆ p(s = 1[x) be
a trained model of the conditional probability p(s = 1[x). The estimate of E[f] is
then
r
n
r
i=1
f(x
i
, y
i
)
ˆ p(s = 1[x
i
)
.
9.5. POSITIVE AND UNLABELED EXAMPLES 91
This estimate is called a “plugin” estimate because it is based on plugging the ob
served values of ¸x, y) into a formula that would be correct if based on integrating
over all values of ¸x, y).
The weighting approach just explained is correct in the statistical sense that it
is unbiased, if the propensity estimates ˆ p(s = 1[x) are correct. However, this ap
proach typically has high variance since a few labeled examples with high values
for 1/ˆ p(s = 1[x) dominate the sum. Therefore an important question is whether
alternative approaches exist that have lower variance. One simple heuristic is to
place a ceiling on the values 1/ˆ p(s = 1[x). For example the ceiling 1000 is used
by [Huang et al., 2006]. However, no good method for selecting the ceiling value is
known.
In medical research, the ratio p(s = 1)/p(s = 1[x) is called the “inverse proba
bility of treatment” (IPT) weight.
When does the reject inference scenario give rise to MNAR bias?
9.5 Positive and unlabeled examples
Suppose that only positive examples are labeled. This fact can be stated formally as
the equation
p(s = 1[x, y = 0) = 0. (9.1)
Without some assumption about which positive examples are labeled, it is impossi
ble to make progress. A common assumption is that the labeled positive examples
are chosen completely randomly from all positive examples. Let this be called the
“selected completely at random” assumption. Stated formally, it is that
p(s = 1[x, y = 1) = p(s = 1[y = 1) = c. (9.2)
Another way of stating the assumption is that s and x are conditionally independent
given y.
A training set consists of two subsets, called the labeled (s = 1) and unlabeled
(s = 0) sets. Suppose we provide these two sets as inputs to a standard training
algorithm. This algorithm will yield a function g(x) such that g(x) = p(s = 1[x)
approximately. The following lemma shows how to obtain a model of p(y = 1[x)
from g(x).
Lemma 1. Suppose the “selected completely at random” assumption holds. Then
p(y = 1[x) = p(s = 1[x)/c where c = p(s = 1[y = 1).
92 CHAPTER 9. LEARNING CLASSIFIERS DESPITE MISSING LABELS
Proof. Remember that the assumption is p(s = 1[y = 1, x) = p(s = 1[y = 1).
Now consider p(s = 1[x). We have that
p(s = 1[x) = p(y = 1 ∧ s = 1[x)
= p(y = 1[x)p(s = 1[y = 1, x)
= p(y = 1[x)p(s = 1[y = 1).
The result follows by dividing each side by p(s = 1[y = 1).
Several consequences of the lemma are worth noting. First, f is an increasing
function of g. This means that if the classiﬁer f is only used to rank examples x
according to the chance that they belong to class y = 1, then the classiﬁer g can be
used directly instead of f.
Second, f = g/p(s = 1[y = 1) is a welldeﬁned probability f ≤ 1 only if
g ≤ p(s = 1[y = 1). What this says is that g > p(s = 1[y = 1) is impossible. This
is reasonable because the labeled (positive) and unlabeled (negative) training sets for
g are samples from overlapping regions in x space. Hence it is impossible for any
example x to belong to the positive class for g with a high degree of certainty.
The value of the constant c = p(s = 1[y = 1) can be estimated using a trained
classiﬁer g and a validation set of examples. Let V be such a validation set that is
drawn from the overall distribution p(x, y, s) in the same manner as the nontradi
tional training set. Let P be the subset of examples in V that are labeled (and hence
positive). The estimator of p(s = 1[y = 1) is the average value of g(x) for x in P.
Formally the estimator is e
1
=
1
n
x∈P
g(x) where n is the cardinality of P.
We shall show that e
1
= p(s = 1[y = 1) = c if it is the case that g(x) = p(s =
1[x) for all x. To do this, all we need to show is that g(x) = c for x ∈ P. We can
show this as follows:
g(x) = p(s = 1[x)
= p(s = 1[x, y = 1)p(y = 1[x)
+ p(s = 1[x, y = 0)p(y = 0[x)
= p(s = 1[x, y = 1) 1 + 0 0 since x ∈ P
= p(s = 1[y = 1).
Note that in principle any single example fromP is sufﬁcient to determine c, but that
in practice averaging over all members of P is preferable.
There is an alternative way of using Lemma 1. Let the goal be to estimate
E
p(x,y,s)
[h(x, y)] for any function h, where p(x, y, s) is the overall distribution. To
make notation more concise, write this as E[h]. We want an estimator of E[h] based
on a positiveonly training set of examples of the form ¸x, s).
9.5. POSITIVE AND UNLABELED EXAMPLES 93
Clearly p(y = 1[x, s = 1) = 1. Less obviously,
p(y = 1[x, s = 0) =
p(s = 0[x, y = 1)p(y = 1[x)
p(s = 0[x)
=
[1 −p(s = 1[x, y = 1)]p(y = 1[x)
1 −p(s = 1[x)
=
(1 −c)p(y = 1[x)
1 −p(s = 1[x)
=
(1 −c)p(s = 1[x)/c
1 −p(s = 1[x)
=
1 −c
c
p(s = 1[x)
1 −p(s = 1[x)
.
By deﬁnition
E[h] =
x,y,s
h(x, y)p(x, y, s)
=
x
p(x)
1
s=0
p(s[x)
1
y=0
p(y[x, s)h(x, y)
=
x
p(x)
p(s = 1[x)h(x, 1)
+ p(s = 0[x)[p(y = 1[x, s = 0)h(x, 1)
+ p(y = 0[x, s = 0)h(x, 0)]
.
The plugin estimate of E[h] is then the empirical average
1
m
¸x,s=1)
h(x, 1) +
¸x,s=0)
w(x)h(x, 1) + (1 −w(x))h(x, 0)
where
w(x) = p(y = 1[x, s = 0) =
1 −c
c
p(s = 1[x)
1 −p(s = 1[x)
(9.3)
and m is the cardinality of the training set. What this says is that each labeled exam
ple is treated as a positive example with unit weight, while each unlabeled example
is treated as a combination of a positive example with weight p(y = 1[x, s = 0)
and a negative example with complementary weight 1 − p(y = 1[x, s = 0). The
probability p(s = 1[x) is estimated as g(x) where g is the nontraditional classiﬁer
explained in the previous section.
94 CHAPTER 9. LEARNING CLASSIFIERS DESPITE MISSING LABELS
The result above on estimating E[h] can be used to modify a learning algorithm
in order to make it work with positive and unlabeled training data. One method is to
give training examples individual weights. Positive examples are given unit weight
and unlabeled examples are duplicated; one copy of each unlabeled example is made
positive with weight p(y = 1[x, s = 0) and the other copy is made negative with
weight 1 −p(y = 1[x, s = 0).
9.6 Further issues
Spam ﬁltering scenario.
Observational studies.
Concept drift.
Moral hazard.
Adverse selection.
Quiz for May 11, 2010
Your name:
The importance sampling identity is the equation
E[f(z)[z ∼ p(z)] = E[f(z)w(z)[z ∼ q(z)]
where
w(z) =
p(z)
q(z)
.
We assume that q(z) > 0 for all z such that p(z) > 0, and we deﬁne 0/0 = 0.
Suppose that the training set consists of values z sampled according to the probability
distribution q(z). Explain intuitively which members of the training set will have
greatest inﬂuence on the estimate of E[f(z)[z ∼ p(z)].
Quiz for 2009
(a) Suppose you use the weighting approach to deal with reject inference. What are
the minimum and maximum possible values of the weights?
Let x be a labeled example, and let its weight be p(s = 1)/p(s = 1[x). Intu
itively, this weight is how many copies are needed to allow the one labeled example
to represent all the unlabeled examples that are similar. The conditional probability
p(s = 1[x) can range between 0 and 1, so the weight can range between p(s = 1)
and inﬁnity.
(b) Suppose you use the weighting approach to learn from positive and unlabeled
examples. What are the minimum and maximum possible values of the weights?
In this scenario, weights are assigned to unlabeled examples, not to labeled ex
amples as above. The weights here are probabilities p(y = 1[x, s = 0), so they range
between 0 and 1.
(c) Explain intuitively what can go wrong if the “selected completely at random”
assumption is false, when learning from positive and unlabeled examples.
The “selected completely at random” assumption says that the positive examples
with known labels are perfectly representative of the positive examples with unknown
labels. If this assumption is false, then there will be unlabeled examples that in fact
are positive, but that we treat as negative, because they are different from the labeled
positive examples. The trained model of the positive class will be too narrow.
9.6. FURTHER ISSUES 97
Assignment (revised)
The goal of this assignment is to train useful classiﬁers using training sets with miss
ing information of three different types: (i) covariate shift, (ii) reject inference, and
(iii) no labeled negative training examples. In http://www.cs.ucsd.edu/
users/elkan/291/dmfiles.zip you can ﬁnd four datasets: one test set, and
one training set for each of the three scenarios. Training set (i) has 5,164 examples,
sets (ii) and (iii) have 11,305 examples, and the test set has 11,307 examples. Each
example has values for 13 predictors. (Many thanks to Aditya Menon for creating
these ﬁles.)
You should train a classiﬁer separately based on each training set, and measure
performance separately but on the same test set. Use accuracy as the primary measure
of success, and use the logistic regression option of FastLargeMargin as the
main training method. In each case, you should be able to achieve better than 82%
accuracy.
All four datasets are derived from the socalled Adult dataset that is available
at http://archive.ics.uci.edu/ml/datasets/Adult. Each example
describes one person. The label to be predicted is whether or not the person earns
over $50,000 per year. This is an interesting label to predict because it is analogous
to a label describing whether or not the person is a customer that is desirable in some
way. (We are not using the published weightings, so the fnlwgt feature has been
omitted from our datasets.)
First, do crossvalidation on the test set to establish the accuracy that is achievable
when all training set labels are known. In your report, show graphically a learning
curve, that is accuracy as a function of the number of training examples, for 1000,
2000, etc. training examples.
Training set (i) requires you to overcome covariate shift, since it does not follow
the same distribution as the population of test examples. Evaluate experimentally the
effectiveness of learning p(y[x) from biased training sets of size 1000, 2000, etc.
Training set (ii) requires reject inference, because the training examples are a
randomsample fromthe test population, but the training label is known only for some
training examples. The persons with known labels, on average, are better prospects
than the ones with unknown labels. Compare learning p(y[x) directly with learning
p(y[x) after reweighting.
In training set (iii), a random subset of positive training examples have known
labels. Other training examples may be negative or positive. Use an appropriate
weighting method. Explain how you estimate the constant c = p(s = 1[y = 1) and
discuss the accuracy of this estimate. The true value is c = 0.4; in a real application
we would not know this, of course.
98 CHAPTER 9. LEARNING CLASSIFIERS DESPITE MISSING LABELS
For each scenario and each training method, include in your report a learning
curve ﬁgure ’ that shows accuracy as a function of the number (1000, 2000, etc.)
of labeled training examples used. Discuss the extent to which each of the three
missinglabel scenarios reduces achievable accuracy.
Chapter 10
Recommender systems
The collaborative ﬁltering (CF) task is to recommend items to a user that he or she
is likely to like, based on ratings for different items provided by the same user and
on ratings provided by other users. The general assumption is that users who give
similar ratings to some items are likely also to give similar ratings to other items.
From a formal perspective, the input to a collaborative ﬁltering algorithm is a
matrix of incomplete ratings. Each row corresponds to a user, while each column
corresponds to an item. If user 1 ≤ i ≤ m has rated item 1 ≤ j ≤ n, the matrix
entry x
ij
is the value of this rating. Often rating values are integers between 1 and 5.
If a user has not rated an item, the corresponding matrix entry is missing. Missing
ratings are often represented as 0, but this should not be viewed as an actual rating
value.
The output of a collaborative ﬁltering algorithm is a prediction of the value of
each missing matrix entry. Typically the predictions are not required to be integers.
Given these predictions, many different realworld tasks can be performed. For ex
ample, a recommender system might suggest to a user those items for which the
predicted ratings by this user are highest.
There are two main general approaches to the formal collaborative ﬁltering task:
nearestneighborbased and modelbased. Given a user and item for which a predic
tion is wanted, nearest neighbor (NN) approaches use a similarity function between
rows, and/or a similarity function between columns, to pick a subset of relevant other
users or items. The prediction is then some sort of average of the known ratings in
this subset.
Modelbased approaches to collaborative ﬁltering construct a lowcomplexity
representation of the complete x
ij
matrix. This representation is then used instead of
the original matrix to make predictions. Typically each prediction can be computed
in O(1) time using only a ﬁxed number of coefﬁcients from the representation.
99
100 CHAPTER 10. RECOMMENDER SYSTEMS
The most popular modelbased collaborative ﬁltering algorithms are based on
standard matrix approximation methods such as the singular value decomposition
(SVD), principal component analysis (PCA), or nonnegative matrix factorization
(NNMF). Of course these methods are themselves related to each other.
From the matrix decomposition perspective, the fundamental issue in collabo
rative ﬁltering is that the matrix given for training is incomplete. Many algorithms
have been proposed to extend SVD, PCA, NNMF, and related methods to apply to
incomplete matrices, often based on expectationmaximization (EM). Unfortunately,
methods based on EM are intractably slow for large collaborative ﬁltering tasks, be
cause they require many iterations, each of which computes a matrix decomposition
on a full mby n matrix. Here, we discuss an efﬁcient approach to decomposing large
incomplete matrices.
10.1 Applications of matrix approximation
In addition to collaborative ﬁltering, many other applications of matrix approxima
tion are important also. Examples include the voting records of politicians, the re
sults of sports matches, and distances in computer networks. The largest research
area with a goal similar to collaborative ﬁltering is socalled “item response theory”
(IRT), a subﬁeld of psychometrics. The aim of IRT is to develop models of how a
person answers a multiplechoice question on a test. Users and items in collaborative
ﬁltering correspond to testtakers and test questions in IRT. Missing ratings are called
omitted responses in IRT.
One conceptual difference between collaborative ﬁltering research and IRT re
search is that the aim in IRT is often to ﬁnd a single parameter for each individual
and a single parameter for each question that together predict answers as well as pos
sible. The idea is that each individual’s parameter value is then a good measure of
intrinsic ability and each question’s parameter value is a good measure of difﬁculty.
In contrast, in collaborative ﬁltering research the goal is often to ﬁnd multiple pa
rameters describing each person and each item, because the assumption is that each
item has multiple relevant aspects and each person has separate preferences for each
of these aspects.
10.2 Measures of performance
Given a set of predicted ratings and a matching set of true ratings, the difference be
tween the two sets can be measured in alternative ways. The two standard measures
are mean absolute error (MAE) and mean squared error (MSE). Given a probability
distribution over possible values, MAE is minimized by taking the median of the dis
10.3. ADDITIVE MODELS 101
tribution while MSE is minimized by taking the mean of the distribution. In general
the mean and the median are different, so in general predictions that minimize MAE
are different from predictions that minimize MSE.
Most matrix approximation algorithms aim to minimize MSE between the train
ing data (a complete or incomplete matrix) and the approximation obtained from the
learned lowcomplexity representation. Some methods can be used with equal ease
to minimize MSE or MAE.
10.3 Additive models
Let us consider ﬁrst models where each matrix entry is represented as the sum of two
contributions, one from its row and one from its column. Formally, a
ij
= r
i
+ c
j
where a
ij
is the approximation of x
ij
and r
i
and c
j
are scalars to be learned.
Deﬁne the training mean ¯ x
i
of each row to be the mean of all its known values;
deﬁne the the training mean of each column ¯ x
j
similarly. The usermean model sets
r
i
= ¯ x
i
and c
j
= 0 while the itemmean model sets r
i
= 0 and c
j
= ¯ x
j
. A slightly
more sophisticated baseline is the “bimean” model: r
i
= 0.5¯ x
i
and c
j
= 0.5¯ x
j
.
The “bias from mean” model has r
i
= ¯ x
i
and c
j
= (x
ij
− r
i
)
j
. Note that in
the latter expression, the subscript j means the average over all rows, for column
number j. Intuitively, c
j
is the average amount by which users who provide a rating
for item j like this item more or less than the average item that they have rated.
The optimal additive model can be computed quite straightforwardly. Let I be
the set of matrix indices for which x
ij
is known. The MSE optimal additive model is
the one that minimizes
¸i,j)∈I
(r
i
+c
j
−x
ij
)
2
.
This optimization problem is a special case of a sparse leastsquares linear regres
sion problem: ﬁnd z that minimizes Az − b
2
where the column vector z =
¸r
1
, . . . , r
m
, c
1
, . . . , c
n
)
t
, b is a column vector of x
ij
values and the corresponding
row of A is all zero except for ones in positions i and m + j. z can be computed by
many methods, including the MoorePenrose pseudoinverse of A: z = (A
t
A)
−1
A
t
b.
However this approach requires inverting an m+n by m+n matrix, which is com
putationally expensive. For linear regression in general, approaches based on QR
factorization are more efﬁcient. For this particular special case of linear regression,
Section 10.4 below gives a gradientdescent method to obtain the optimal z that only
requires O([I[) time and space.
The matrix A is always rankdeﬁcient, with rank at most m+n−1, because any
constant can be added to all r
i
and subtracted from all c
j
without changing the matrix
reconstruction. If some users or items have very few ratings, the rank of A may be
102 CHAPTER 10. RECOMMENDER SYSTEMS
less than m+n−1. However the rank deﬁciency of A does not cause computational
problems in practice.
10.4 Multiplicative models
Multiplicative models are similar to additive models, but the row and column values
are multiplied instead of added. Speciﬁcally, x
ij
is approximated as a
ij
= r
i
c
j
with
r
i
and c
j
being scalars to be learned. Like additive models, multiplicative models
have one unnecessary degree of freedom, because any single value r
i
or c
j
can be
ﬁxed without changing the space of achievable approximations.
A multiplicative model is a rankone matrix decomposition, since the rows and
columns of the y matrix are linearly proportional. We now present a general algo
rithm for learning row value/column value models; additive and multiplicative mod
els are special cases. Let x
ij
be a matrix entry and let r
i
and c
j
be corresponding row
and column values. We approximate x
ij
by a
ij
= f(r
i
, c
j
) for some ﬁxed function
f. The approximation error is deﬁned to be e(f(r
i
, c
j
), x
ij
). The training error E
over all known matrix entries I is the pointwise sum of errors.
To learn r
i
and c
j
by minimizing E by gradient descent, we evaluate
∂
∂r
i
E =
¸i,j)∈I
∂
∂r
i
e(f(r
i
, c
j
), x
ij
)
=
¸i,j)∈I
∂
∂f(r
i
, c
j
)
e(f(r
i
, c
j
), x
ij
)
∂
∂r
i
f(r
i
, c
j
).
As before, I is the set of matrix indices for which x
ij
is known. Consider the special
case where e(u, v) = [u − v[
p
for p > 0. This case is a generalization of MAE and
MSE: p = 2 corresponds to MSE and p = 1 corresponds to MAE. We have
∂
∂u
e(u, v) = p[u −v[
p−1
∂
∂u
[u −v[
= p[u −v[
p−1
sgn(u −v).
Here sgn(a) is the sign of a, that is −1, 0, or 1 if a is negative, zero, or posi
tive. For computational purposes we can assume the derivative is zero for the non
differentiable case u = v. Therefore
∂
∂r
i
E =
¸i,j)∈I
p[f(r
i
, c
j
) −x
ij
[
p−1
sgn(f(r
i
, c
j
) −x
ij
)
∂
∂r
i
f(r
i
, c
j
).
10.5. COMBINING MODELS BY FITTING RESIDUALS 103
Now suppose f(r
i
, c
j
) = r
i
+c
j
so
∂
∂r
i
f(r
i
, c
j
) = 1. We obtain
∂
∂r
i
E = p
¸i,j)∈I
[r
i
+c
j
−x
ij
[
p−1
sgn(r
i
+c
j
−x
ij
).
Alternatively, suppose f(r
i
, c
j
) = r
i
c
j
so
∂
∂r
i
f(r
i
, c
j
) = c
j
. We get
∂
∂r
i
E = p
¸i,j)∈I
[r
i
c
j
−x
ij
[
p−1
sgn(r
i
c
j
−x
ij
)c
j
.
Given the gradients above, we apply online (stochastic) gradient descent. This means
that for each triple ¸r
i
, c
j
, x
ij
) in the training set, we compute the gradient with
respect to r
i
and c
j
based just on this one example, and perform the updates
r
i
:= r
i
−λ
∂
∂r
i
e(f(r
i
, c
j
), x
ij
)
and
c
j
:= c
j
−λ
∂
∂c
j
e(f(r
i
, c
j
), x
ij
)
where λ is a learning rate. An epoch is one pass over the whole training set, perform
ing the updates above once for each ¸i, j) ∈ I.
After any ﬁnite number of epochs, stochastic gradient descent does not converge
fully. The learning rate typically follows a decreasing schedule such as λ = a/(b+e)
where e is the number of the current epoch. The values a and b are often chosen to be
small positive integers. The maximum number of epochs is another parameter that
needs to be chosen.
Gradient descent as described above directly optimizes precisely the same ob
jective function (given the MSE error function) that is called “incomplete data likeli
hood” in EM approaches to factorizing matrices with missing entries. It is sometimes
forgotten that EM is just one approach to solving maximumlikelihood problems; in
complete matrix factorization is an example of a maximumlikelihood problemwhere
an alternative solution method is superior.
10.5 Combining models by ﬁtting residuals
Simple models can be combined into more complex models. In general, a second
model is trained to ﬁt the difference between the predictions made by the ﬁrst model
and the truth as speciﬁed by the training data; this difference is called the residual.
The most straightforward way to combine models is additive. Let a
ij
be the
prediction from the ﬁrst model and let x
ij
be the corresponding training value. The
104 CHAPTER 10. RECOMMENDER SYSTEMS
second model is trained to minimize error relative to the residual x
ij
− a
ij
. Let
b
ij
be the prediction made by the second model for matrix entry ij. The combined
prediction is then a
ij
+ b
ij
. If the ﬁrst and second models are both multiplicative,
then the combined model is a ranktwo approximation. The extension to higherrank
approximations is obvious.
The common interpretation of a combined model is that each component model
represents an aspect or property of items and users. The idea is that the column
value for each item is the intensity with which it possesses this property, and the
row value for each user is the weighting the user places on this property. However
this point of view implicitly treats each component model as equally important. It is
more accurate to view each component model as an adjustment to previous models,
where each model is empirically less important than each previous one, because the
numerical magnitude of its contribution is smaller. The aspect of items represented
by each model cannot be understood in isolation, but only in the context of previous
aspects.
10.6 Regularization
A matrix factorization model of rank k has k parameters for each person and for each
movie. For example, consider a training set of 500,000 ratings for 10,000 viewers
and 1000 movies. A rank50 unregularized factor model is almost certain to overﬁt
the training data, because it has 50 (10, 000 + 1000) = 550, 000 parameters, which
is more than the number of data points for training.
We can typically improve generalization by using a large value for k while re
ducing the effective number of parameters via regularization.
The unregularized loss function is
E =
¸i,j)∈I
e(f(r
i
, c
j
), x
ij
)
where I is the set of ij pairs for which x
ij
is known. The corresponding L
2

regularized loss function is
E = µ(
i
r
2
i
+
j
c
2
j
) +
¸i,j)∈I
e(f(r
i
, c
j
), x
ij
)
where µ is the strength of regularization. Remember that this expression has only
two parameter vectors r and c because it is just for a factorization of rank one. The
partial derivative with respect to r
i
is
∂
∂r
i
E = 2µr
i
+
¸i,j)∈I
∂
∂f(r
i
, c
j
)
e(f(r
i
, c
j
), x
ij
)
∂
∂r
i
f(r
i
, c
j
).
10.7. FURTHER ISSUES 105
Assume that e(u, v) = (u −v)
2
and f(r
i
, c
j
) = r
i
c
j
, so
∂
∂r
i
f(r
i
, c
j
) = c
j
. Then
∂
∂r
i
E = 2µr
i
+ 2
¸i,j)∈I
(r
i
c
j
−x
ij
)c
j
.
Using stochastic gradient descent, for each training example ¸r
i
, c
j
, x
ij
) we do the
updates
r
i
:= r
i
−2λµr
i
−2λ(r
i
c
j
−x
ij
)c
j
and similarly for c
j
.
When doing regularization, it is reasonable not to regularize the vectors learned
for the rankone factorization. The reason is that these are always positive, if ratings
x
ij
are positive, because x
ij
= r
i
c
j
on average. (The vectors could always both be
negative, but that would be unintuitive without being more expressive.)
After the rankone factorization, residuals are zero on average, so learned vectors
must have positive and negative entries, and regularization towards zero is reason
able.
10.7 Further issues
Another issue not discussed is any potential probabilistic interpretation of a matrix
decomposition. One reason for not considering models that are explicitly proba
bilistic is that these are usually based on Gaussian assumptions that are incorrect by
deﬁnition when the observed data are integers and/or in a ﬁxed interval.
A third issue not discussed is any explicit model for how matrix entries come
to be missing. There is no assumption or analysis concerning whether entries are
missing “completely at random” (MCAR), “at random” (MAR), or “not at random”
(MNAR). Intuitively, in the collaborative ﬁltering context, missing ratings are likely
to be MNAR. This means that even after accounting for other inﬂuences, whether or
not a rating is missing still depends on its value, because people select which items
to rate based on how much they anticipate liking them. Concretely, everything else
being equal, low ratings are still more likely to be missing than high ratings. This
intuition can be conﬁrmed empirically. On the Movielens dataset, the average rating
in the training set is 3.58. The average prediction from one reasonable method for
ratings in the test set, i.e. ratings that are unknown but known to exist, is 3.79. The
average prediction for ratings that in fact do not exist is 3.34.
Amazon: people who looked at x bought y eventually (within 24hrs usually)
106 CHAPTER 10. RECOMMENDER SYSTEMS
Quiz
For each part below, say whether the statement in italics is true or false, and then
explain your answer brieﬂy.
(a) For any model, the average predicted rating for unrated movies is expected to
be less than the average actual rating for rated movies.
True. People are more likely to like movies that they have actually watched, than
random movies that they have not watched. Any good model should capture this fact.
In more technical language, the value of a rating is correlated with whether or
not it is missing.
(b) Let the predicted value of rating x
ij
be r
i
c
j
+ s
i
d
j
and suppose r
i
and c
j
are trained ﬁrst. For all viewers i, r
i
should be positive, but for some i, s
i
should be
negative.
True. Ratings x
ij
are positive, and x
ij
= r
i
c
j
on average, so r
i
and c
j
should
always be positive. (They could always both be negative, but that would be unintuitive
without being more expressive.)
The term s
i
d
j
models the difference x
ij
− r
i
c
j
. This difference is on average
zero, so it is sometimes positive and sometimes negative. In order to allow s
i
d
j
to be negative, s
i
must be negative sometimes. (Making s
i
be always positive, while
allowing d
j
to be negative, might be possible. However in this case the expressiveness
of the model s
i
d
j
would be reduced.)
(c) We have a training set of 500,000 ratings for 10,000 viewers and 1000 movies,
and we train a rank50 unregularized factor model. This model is likely to overﬁt the
training data.
True. The unregularized model has 50 (10, 000+1000) = 550, 000 parameters,
which is more than the number of data points for training. Hence, overﬁtting is
practically certain.
Quiz for May 25, 2010
Page 87 of the lecture notes says
Given the gradients above, we apply online (stochastic) gradient descent.
This means that we iterate over each triple ¸r
i
, c
j
, x
ij
) in the training
set, compute the gradient with respect to r
i
and c
j
based just on this one
example, and perform the updates
r
i
:= r
i
−λ
∂
∂r
i
e(f(r
i
, c
j
), x
ij
)
and
c
j
:= c
j
−λ
∂
∂c
j
e(f(r
i
, c
j
), x
ij
)
where λ is a learning rate.
State and explain what the ﬁrst update rule is for the special case e(u, v) = (u −v)
2
and f(r
i
, c
j
) = r
i
c
j
.
CSE 291 Quiz for May 10, 2011
Your name:
Email address:
Section 10.6 of the lecture notes says “Using stochastic gradient descent, for each
training example ¸r
i
, c
j
, x
ij
) we do the updates
r
i
:= r
i
−2λµr
i
−2λ(r
i
c
j
−x
ij
)c
j
and similarly for c
j
.”
In the righthand side of the update rule for r
i
, explain the role of λ, of µ, and of r
i
c
j
.
Assignment
The goal of this assignment is to apply a collaborative ﬁltering method to the task of
predicting movie ratings. You should use the small MovieLens dataset available at
http://www.grouplens.org/node/73. This has 100,000 ratings given by
943 users to 1682 movies.
You can select any collaborative ﬁltering method that you like. You may reuse
existing software, or you may write your own, in Matlab or in another programming
language. Whatever your choice, you must understand fully the algorithm that you
apply, and you should explain it with mathematical clarity in your report. The method
that you choose should handle missing values in a sensible and efﬁcient way.
When reporting ﬁnal results, do ﬁvefold crossvalidation, where each rating is
assigned randomly to one fold. Note that this experimental procedure makes the task
easier, because it is likely that every user and every movie is represented in each
training set. Moreover, evaluation is biased towards users who have provided more
ratings; it is easier to make accurate predictions for these users.
In your report, showmean absolute error graphically as a function of a measure of
complexity of your chosen method. If you select a matrix factorization method, this
measure of complexity will likely be rank. Also showtiming information graphically.
Discuss whether you could run your chosen method on the full Netﬂix dataset of
about 10
8
ratings. Also discuss whether your chosen method needs a regularization
technique to reduce overﬁtting.
Good existing software includes the following:
• Jason Rennie’s fast maximum margin matrix factorization for collaborative ﬁl
tering (MMMF) at http://people.csail.mit.edu/jrennie/matlab/.
• Guy Lebanon’s toolkit at http://www2.cs.cmu.edu/˜lebanon/IRlab.
htm.
You are also welcome to write your own code, or to choose other software. If you
choose other existing software, you may want to ask the instructor for comments
ﬁrst.
CSE 291 Assignment 5 (2011)
This assignment is due at the start of class on Tuesday May 17, 2011.
The goal of this project is to implement a collaborative ﬁltering method for the
task of predicting movie ratings. You should use the small MovieLens dataset avail
able at http://www.grouplens.org/node/73. This dataset has 100,000
ratings given by 943 users to 1682 movies.
You should write your own code to implement a matrix factorization method with
regularization that allows for missing data. You should write your own code. Using
R is recommended, but you may choose Matlab or another programming language.
You may use stochastic gradient descent (SGD) as described above, or you may use
another method such as alternating least squares.
When reporting ﬁnal results, do ﬁvefold crossvalidation, where each rating is
assigned randomly to one fold. Note that this experimental procedure makes the task
easier, because it is likely that every user and every movie is represented in each
training set. Moreover, evaluation is biased towards users who have provided more
ratings; it is easier to make accurate predictions for these users.
Do experiments to investigate the beneﬁts of regularization. If you use SGD, do
experiments to investigate how to set the learning rate.
In your report, show mean absolute error and mean squared error graphically,
as a function of the rank of the matrix factorization. Discuss whether the time and
memory efﬁciency of your code is good enough to run on the full Netﬂix dataset of
about 10
8
ratings.
Chapter 11
Text mining
This chapter explains how to do data mining on datasets that are collections of doc
uments. Text mining tasks include
• classiﬁer learning
• clustering,
• topic modeling, and
• latent semantic analysis.
Classiﬁers for documents are useful for many applications. Major uses for binary
classiﬁers include spam detection and personalization of streams of news articles.
Multiclass classiﬁers are useful for routing messages to recipients.
Most classiﬁers for documents are designed to categorize according to subject
matter. However, it is also possible to learn to categorize according to qualitative cri
teria such as helpfulness for product reviews submitted by consumers, or importance
for email messages.
Classiﬁers can be even more useful for ranking documents than for dividing them
into categories. With a training set of very helpful product reviews, and another
training set of very unhelpful reviews, we can learn a scoring function that sorts
other reviews according to their degree of helpfulness. There is often no need to pick
a threshold, which would be arbitrary, to separate marginally helpful from marginally
unhelpful reviews. The Google email feature called Priority Inbox uses a classiﬁer to
rank messages according to their predicted importance as perceived by the recipient.
As discussed in Chapter 6, when one class of examples is rare, it is not reason
able to use 0/1 accuracy to measure the success of a classiﬁer. In text mining, it is
111
112 CHAPTER 11. TEXT MINING
especially common to use what is called Fmeasure. This measure is the harmonic
mean of precision and recall:
f =
2
1/p + 1/r
where p and r are precision and recall for the rare class.
The mistakes made by a binary classiﬁer are either false positives or false nega
tives. A false positive is a failure of precision, while a false negative is a failure of
recall. In domains such as medical diagnosis, a failure of recall is often considered
more serious than a failure of precision. The reason is that the latter has mainly a
monetary cost, but the former can have a signiﬁcant health consequence. In applica
tions of document classiﬁcation, failures of recall are often not serious. The central
reason is that if the content of a document is genuinely important, many documents
are likely to contain different versions of the same content. So a failure of recall on
one of these documents will not lead to an overall failure.
Figuring out the optimal trade off between failures of recall and precision, given
that a failure of recall has a certain probability of being remedied later, is a problem
in decision theory.
In many applications of multiclass classiﬁcation, a single document can belong
to more than one category, so it is correct to predict more than one label. This task is
speciﬁcally called multilabel classiﬁcation. In standard multiclass classiﬁcation, the
classes are mutually exclusive, i.e. a special type of negative correlation is ﬁxed in
advance. In multilabel classiﬁcation, it is important to learn the positive and negative
correlations between classes.
Text mining is often used as a proxy for data mining on underlying entities rep
resented by text documents. For example, employees may be represented by their
resumes, images may be represented by captions, and music or movies mat be repre
sented by reviews. The correlations between labels may arise from correlations in the
underlying entities. For example, the labels “slow” and “romantic” may be positively
correlated for songs.
11.1 The bagofwords representation
The ﬁrst question to is how to represent documents. For genuine understanding of
natural language the order of the words in documents must be preserved. However,
for many largescale data mining tasks, including classifying and clustering docu
ments, it is sufﬁcient to use a simple representation that loses all information about
word order.
Given a collection of documents, the ﬁrst task to perform is to identify the set of
all words used a least once in at least one document. This set is called the vocabulary
11.2. THE MULTINOMIAL DISTRIBUTION 113
V . Often, it is reduced in size by keeping only words that are used in at least two
documents. (Words that are found only once are often misspellings or other mis
takes.) Although the vocabulary is a set, we ﬁx an arbitrary ordering for it, so that
we can refer to word 1 through word m where m = [V [ is the size of the vocabulary.
Once V has been ﬁxed, each document is represented as a vector with integer
entries of length m. If this vector is x then its jth component x
j
is the number of
appearances of word j in the document. The length of the document is n =
m
j=1
x
j
.
For typical documents, n is much smaller than m and x
j
= 0 for most words j.
Many applications of text mining also eliminate from the vocabulary socalled
“stop” words. These are words that are common in most documents and do not
correspond to any particular subject matter. In linguistics these words are some
times called “function” words. They include pronouns (you, he, it) connectives (and,
because, however), prepositions (to, of, before), auxiliaries (have, been, can), and
generic nouns (amount, part, nothing). It is important to appreciate, however, that
generic words carry a lot of information for many tasks, including identifying the
author of a document or detecting its genre.
A collection of documents is represented as a twodimensional matrix where
each row describes a document and each column corresponds to a word. Each entry
in this matrix is an integer count; most entries are zero. It makes sense to view each
column as a feature. It also makes sense to learn a lowrank approximation of the
whole matrix; doing this is called latent semantic analysis (LSA) and is discussed in
Section 12.3 below.
11.2 The multinomial distribution
Once we have a representation for individual documents, the natural next step is
to select a model for a set of documents. This model is a probability distribution.
Given a training set of documents, we will choose values for the parameters of the
distribution that make the training documents have high probability.
Then, given a test document, we can evaluate its probability according to the
model. The higher this probability is, the more similar the test document is to the
training set.
The probability distribution that we use is the multinomial. Mathematically, this
distribution is
p(x; θ) =
n!
m
j=1
x
j
!
m
j=1
θ
x
j
j
.
where the data x are a vector of nonnegative integers and the parameters θ are a
realvalued vector. Both vectors have the same length m. The components of θ are
nonnegative and have unit sum:
m
j=1
θ
j
= 1.
114 CHAPTER 11. TEXT MINING
Intuitively, θ
j
is the probability of word j while x
j
is the count of word j. Each
time word j appears in the document it contributes an amount θ
j
to the total proba
bility, hence the termθ
j
to the power x
j
. The fraction before the product is called the
multinomial coefﬁcient. It is the number of different word sequences that might have
given rise to the observed vector of counts x. Note that the multinomial coefﬁcient
is a function of x, but not of θ. Therefore, when comparing the probability of a doc
ument according to different models, that is according to different parameter vectors
θ, we can ignore the multinomial coefﬁcient.
Like any discrete distribution, a multinomial has to sum to one, where the sum is
over all possible data points. Here, a data point is a document containing n words.
The number of such documents is exponential in their length n: it is m
n
. The proba
bility of any individual document will therefore be very small. What is important is
the relative probability of different documents. A document that mostly uses words
with high probability will have higher relative probability.
At ﬁrst sight, computing the probability of a document requires O(m) time be
cause of the product over j. However, if x
j
= 0 then θ
x
j
j
= 1 so the jth factor can
be omitted from the product. Similarly, 0! = 1 so the jth factor can be omitted from
m
j=1
x
j
!. Hence, computing the probability of a document needs only O(n) time.
Because the probabilities of individual documents decline exponentially with
length n, it is necessary to do numerical computations with log probabilities:
log p(x; θ) = log n! −[
m
j=1
log x
j
!] + [
m
j=1
x
j
log θ
j
]
Given a set of training documents, the maximumlikelihood estimate of the jth pa
rameter is
θ
j
=
1
T
x
x
j
where the sum is over all documents x belonging to the training set. The normalizing
constant is T =
x
j
x
j
which is the sum of the sizes of all training documents.
If a multinomial has θ
j
= 0 for some j, then every document with x
j
> 0 for this
j has zero probability, regardless of any other words in the document. Probabilities
that are perfectly zero are undesirable, so we want θ
j
> 0 for all j. Smoothing with
a constant c is the way to achieve this. We set
θ
j
∝ c +
x
x
j
where the symbol ∝means “proportional to.” The constant c is called a pseudocount.
Intuitively, it is a notional number of appearances of word j that are assumed to exist,
11.3. TRAINING BAYESIAN CLASSIFIERS 115
regardless of the true number of appearances. Typically c is chosen in the range
0 < c ≤ 1. Because the equality
j
θ
j
= 1 must be preserved, the normalizing
constant must be T
t
= mc +T in
θ
j
=
1
T
t
(c +
x
x
j
).
In order to avoid big changes in the estimated probabilities θ
j
, one should have c <
T/m.
Technically, one multinomial is a distribution over all documents of a ﬁxed size
n. Therefore, what is learned by the maximumlikelihood process just described is
in fact a different distribution for each size n. These distributions, although separate,
have the same parameter values.
Generative process. Sampling with replacement.
11.3 Training Bayesian classiﬁers
Bayesian learning is an approach to learning classiﬁers based on Bayes’ rule. Let x
be an example and let y be its class. Suppose the alternative class values are 1 to K.
We can write
p(y = k[x) =
p(x[y = k)p(y = k)
p(x)
.
In order to use the expression on the right for classiﬁcation, we need to learn three
items from training data: p(x[y = k), p(y = k), and p(x). We can estimate p(y = k)
easily as n
k
/
K
k=1
n
k
where n
k
is the number of training examples with class label
k. The denominator p(x) can be computed as the sum of numerators
p(x) =
K
k=1
p(x[y = k)p(y = k).
In general, the classconditional distribution p(x[y = k) can be any distribution
whose parameters are estimated using the training examples that have class label k.
Note that the training process for each class uses only the examples from that
class, and is separate from the training process for all other classes. Usually, each
class is represented by one multinomial ﬁtted by maximumlikelihood as described
in Section 11.2. In the formula
θ
j
=
1
T
t
(c +
x
x
j
)
the sum is taken over the documents in one class. Unfortunately, the exact value of c
can strongly inﬂuence classiﬁcation accuracy.
116 CHAPTER 11. TEXT MINING
11.4 Burstiness
The multinomial model says that each appearance of the same word j always has
the same probability θ
j
. In reality, additional appearances of the same word are less
surprising, i.e. they have higher probability. Consider the following excerpt from a
newspaper article, for example.
Toyota Motor Corp. is expected to announce a major overhaul. Yoshi
Inaba, a former senior Toyota executive, was formally asked by Toyota
this week to oversee the U.S. business. Mr. Inaba is currently head of an
international airport close to Toyota’s headquarters in Japan.
Toyota’s U.S. operations noware suffering fromplunging sales. Mr. In
aba was credited with laying the groundwork for Toyota’s fast growth in
the U.S. before he left the company.
Recently, Toyota has had to idle U.S. assembly lines, pay workers
who aren’t producing vehicles and offer a limited number of voluntary
buyouts. Toyota now employs 36,000 in the U.S.
The multinomial distribution arises from a process of sampling words with re
placement. An alternative distribution named the the Dirichlet compound multino
mial (DCM) arises from an urn process that captures the authorship process better.
Consider a bucket with balls of [V [ different colors. After a ball is selected randomly,
it is not just replaced, but also one more ball of the same color is added. Each time a
ball is drawn, the chance of drawing the same color again is increased. This increased
probability models the phenomenon of burstiness.
Let the initial number of balls with color j be β
j
. These initial values are the
parameters of the DCM distribution. The DCM parameter vector β has length [V [,
like the multinomial parameter vector, but the sum of the components of β is un
constrained. This one extra degree of freedom allows the DCM to discount multiple
observations of the same word, in an adjustable way. The smaller the parameter vales
β
j
are, the more words are bursty.
11.5 Discriminative classiﬁcation
There is a consensus that linear support vector machines (SVMs) are the best known
method for learning classiﬁers for documents. Nonlinear SVMs are not beneﬁcial,
because the number of features is typically much larger than the number of training
examples. For the same reason, choosing the strength of regularization appropriately,
typically by crossvalidation, is crucial.
11.6. CLUSTERING DOCUMENTS 117
When using a linear SVM for text classiﬁcation, accuracy can be improved con
siderably by transforming the raw counts. Since counts do not follow Gaussian dis
tributions, it is not sensible to make them have mean zero and variance one. In
spired by the discussion above of burstiness, it is sensible to replace each count x
j
by log(1 + x
j
). This transformation maps 0 to 0, so it preserves sparsity. A more
extreme transformation that loses information is to make each count binary, that is to
replace all nonzero values by one.
One can also transform counts in a supervised way, that is in a way that uses label
information. This is most straightforward when there are just two classes. Experi
mentally, the following transformation is better than others:
x
j
→ I(x
j
> 0)[ log tp/fn −log fp/tn[.
Above, tp is the number of positive training examples containing word j, and fn
is the number of these examples not containing the word, while fp is the number
of negative training examples containing the word, and tn is the number of these
examples not containing the word. If any of these numbers is zero, we replace it by
0.5, which of course is less than any of these numbers that is genuinely nonzero.
The transformation above is sometimes called logodds weighting. It ignores how
often word appears in an individual document. Instead, it makes each word count
binary, and gives the word a weight that is the same for every document, but depends
on how predictive the word is. The positive and negative classes are treated in a
symmetric way. The value [ log tp/fn − log fp/tn[ is large if tp/fn and fp/tn
have very different values. In this case the word is highly diagnostic for at least one
of the two classes.
11.6 Clustering documents
Suppose that we have a collection of documents, and we want to ﬁnd an organization
for these, i.e. we want to do unsupervised learning. The simplest variety of unsuper
vised learning is clustering.
Clustering can be done probabilistically by ﬁtting a mixture distribution to the
given collection of documents. Formally, a mixture distribution is a probability den
sity function of the form
p(x) =
K
k=1
α
k
p(x; θ
k
).
Here, K is the number of components in the mixture model. For each k, p(x; θ
k
) is
the distribution of component number k. The scalar α
k
is the proportion of compo
nent number k.
Each component is a cluster.
118 CHAPTER 11. TEXT MINING
11.7 Topic models
Mixture models, and clusterings in general, are based on the assumption that each
data point is generated by a single component model. For documents, if compo
nents are topics, it is often more plausible to assume that each document can contain
words from multiple topics. Topic models are probabilistic models that make this
assumption.
Latent Dirichlet allocation (LDA) is the most widely used topic model. It is
based on the intuition that each document contains words from multiple topics; the
proportion of each topic in each document is different, but the topics themselves are
the same for all documents.
The generative process assumed by the LDA model is as follows:
Given: Dirichlet distribution with parameter vector α of length K
Given: Dirichlet distribution with parameter vector β of length V
for topic number 1 to topic number K
draw a multinomial with parameter vector φ
k
according to β
for document number 1 to document number M
draw a topic distribution, i.e. a multinomial θ according to α
for each word in the document
draw a topic z according to θ
draw a word w according to φ
z
Note that z is an integer between 1 and K for each word.
The prior distributions α and β are assumed to be ﬁxed and known, as are the
number K of topics, the number M of documents, the length N
m
of each document,
and the cardinality V of the vocabulary (i.e. the dictionary of all words).
For learning, the training data are the words in all documents. Learning has
two goals: (i) to infer the documentspeciﬁc multinomial θ for each document, and
(ii) to infer the topic distribution φ
k
for each topic. After training, ϕ is a vector of
word probabilities for each topic indicating the content of that topic. The distribution
θ of each document is useful for classifying new documents, measuring similarity
between documents, and more.
When applying LDA, it is not necessary to learn α and β. Steyvers and Grifﬁths
recommend ﬁxed uniform values α = 50/K and β = .01, where K is the number of
topics.
11.8. OPEN QUESTIONS 119
11.8 Open questions
It would be interesting to work out an extension of the logodds mapping for the
multiclass case. One suggestion is
x
j
→ I(x
j
> 0) max
k
[ log tp
jk
/fn
jk
[
where i ranges over the classes, tp
jk
is the number of training examples in class
k containing word j, and fn
jk
is the number of these examples not containing the
word.
It is also not known if combining the log transformation and logodds weighting
is beneﬁcial, as in
x
j
→ log(x
j
+ 1) [ log tp/fn −log fp/tn[.
120 CHAPTER 11. TEXT MINING
Quiz
(a) Explain why, with a multinomial distribution, “the probabilities of individual doc
uments decline exponentially with length n.”
The probability of document x of length n according to a multinomial distribution
is
p(x; θ) =
n!
m
j=1
x
j
!
m
j=1
θ
x
j
j
.
[Rough argument.] Each θ
j
value is less than 1. In total n =
j
x
j
of these
values are multiplied together. Hence as n increases the product
m
j=1
θ
x
j
j
decreases
exponentially. Note the multinomial coefﬁcient n!/
m
j=1
x
j
! does increase with n,
but more slowly.
(b) Consider the multiclass Bayesian classiﬁer
ˆ y = argmax
k
p(x[y = k)p(y = k)
p(x)
.
Simplify the expression inside the argmax operator as much as possible, given that
the model p(x[y = k) for each class is a multinomial distribution.
The denominator p(x) is the same for all k, so it does not inﬂuence which k is the
argmax. Within the multinomial distributions p(x[y = k), the multinomial coefﬁcient
does not depend on k, so it is constant for a single x and it can be eliminated also,
giving
ˆ y = argmax
k
p(y = k)
m
j=1
θ
x
j
kj
.
where θ
kj
is the jth parameter of the kth multinomial.
(c) Consider the classiﬁer from part (b) and suppose that there are just two classes.
Simplify the classiﬁer further into a linear classiﬁer.
Let the two classes be k = 0 and k = 1 so we can write
ˆ y = 1 if and only if p(y = 1)
m
j=1
θ
x
j
1j
> p(y = 0)
m
j=1
θ
x
j
0j
.
Taking logarithms and using indicator function notation gives
ˆ y = I
log p(y = 1) +
m
j=1
x
j
log θ
1j
−log p(y = 0) −
m
j=1
x
j
log θ
0j
.
11.8. OPEN QUESTIONS 121
The classiﬁer simpliﬁes to
ˆ y = I(c
0
+
m
j=1
x
j
c
j
)
where the coefﬁcients are c
0
= log p(y = 1) − log p(y = 0) and c
j
= log θ
1j
−
log θ
0j
. The expression inside the indicator function is a linear function of x.
Quiz for May 18, 2010
Your name:
Consider the task of learning to classify documents into one of two classes, using the
bagofwords representation. Explain why a regularized linear SVM is expected to
be more accurate than a Bayesian classiﬁer using one maximumlikelihood multino
mial for each class.
Write one or two sentences for each of the following reasons.
(a) Handling words that are absent in one class in training.
(b) Learning the relative importance of different words.
CSE 291 Quiz for May 17, 2011
Your name:
Email address:
Consider the binary Bayesian classiﬁer
ˆ y = argmax
k∈¡0,1¦
p(x[y = k)p(y = k)
p(x)
.
Simplify the expression inside the argmax operator as much as possible, given that
the model p(x[y = k) for each class is a multinomial distribution:
p(x[y = 0) =
n!
m
j=1
x
j
!
m
j=1
α
x
j
j
and
p(x[y = 1) =
n!
m
j=1
x
j
!
m
j=1
β
x
j
j
.
Also assume that p(y = 0) = 0.5.
The denominator p(x) is the same for all k, so it does not inﬂuence which k
is the argmax. Within the multinomial distributions, the multinomial coefﬁcient does
not depend on the parameters, so it is constant for a single x and it can be eliminated
also, giving
ˆ y = argmax
k
p(y = k)
m
j=1
θ
x
j
kj
.
where θ
kj
is either α
j
or β
j
.
The assumption p(y = 0) = 0.5 implies p(y = 1) = 0.5, so the p(y = k) factor
also does not inﬂuence classiﬁcation. We can write
ˆ y = 1 if and only if
m
j=1
β
x
j
j
>
m
j=1
α
x
j
j
.
Using indicator function notation gives
ˆ y = I
m
j=1
β
j
α
j
x
j
> 1
.
Note that there is no guaranteed relationship between α
j
and β
j
. We know that
j
α
j
= 1 and similarly for the β parameter vector, but this knowledge does not
help in making a further simpliﬁcation.
124 CHAPTER 11. TEXT MINING
Assignment (2009)
The purpose of this assignment is to compare two different approaches to learning
a classiﬁer for text documents. The dataset to use is called Classic400. It consists
of 400 documents from three categories over a vocabulary of 6205 words. The cate
gories are quite distinct from a human point of view, and high accuracy is achievable.
First, you should try Bayesian classiﬁcation using a multinomial model for each
of the three classes. Note that you may need to smooth the multinomials with pseu
docounts. Second, you should train a support vector machine (SVM) classiﬁer. You
will need to select a method for adapting SVMs for multiclass classiﬁcation. Since
there are more features than training examples, regularization is vital.
For each of the two classiﬁer learning methods, investigate whether you can
achieve better accuracy via feature selection, feature transformation, and/or feature
weighting. Because there are relatively few examples, using tenfold crossvalidation
to measure accuracy is suggested. Try to evaluate the statistical signiﬁcance of dif
ferences in accuracy that you ﬁnd. If you allow any leakage of information from the
test fold to training folds, be sure to explain this in your report.
The dataset is available at http://www.cs.ucsd.edu/users/elkan/
291/classic400.zip, which contains three ﬁles. The main ﬁle cl400.csv is
a commaseparated 400 by 6205 array of word counts. The ﬁle truelabels.csv
gives the actual class of each document, while wordlist gives the string corre
sponding to the word indices 1 to 6205. These strings are not needed for train
ing or applying classiﬁers, but they are useful for interpreting classiﬁers. The ﬁle
classic400.mat contains the same three ﬁles in the form of Matlab matrices.
CSE 291 Assignment 6 (2011)
This assignment is due at the start of class on Tuesday May 24, 2011.
The purpose of this assignment is to compare different approaches to learning a
binary classiﬁer for text documents. The dataset to use is the movie review polarity
dataset, version 2.0, published by Lillian Lee at http://www.cs.cornell.
edu/People/pabo/moviereviewdata/. Be sure to read the README
carefully, and to understand the data and task fully.
First, you should try Bayesian classiﬁcation using a multinomial model for each
of the two classes. You should smooth the multinomials with pseudocounts. Sec
ond, you should train a linear discriminative classiﬁer, either logistic regression or a
support vector machine. Since there are more features than training examples, reg
ularization is vital. To get good performance with an SVM, you need to use strong
regularization, that is a small value of C.
For both classiﬁer learning methods, investigate whether you can achieve better
accuracy via feature selection, feature transformation, and/or feature weighting. Try
to evaluate the statistical signiﬁcance of differences in accuracy that you ﬁnd. Think
carefully about any ways in which you may be allowing leakage of information from
test subsets that makes estimates of accuracy be biased. Discuss this issue in your
report. If you achieve very high accuracy, it is likely that you are a victim of leakage
in some way.
Compare the accuracy that you can achieve with accuracies reported in some of
the many published papers that use this dataset. You can ﬁnd links to these papers at
http://www.cs.cornell.edu/People/pabo/moviereviewdata/
otherexperiments.html. Also, analyze your trained classiﬁer to identify what
features of movie reviews are most indicative of the review being favorable or unfa
vorable.
Chapter 12
Matrix factorization and
applications
Many datasets are essentially twodimensional matrices. For these datasets, methods
based on ideas from linear algebra often provide considerable insight. It is remark
able that similar algorithms are useful for datasets as diverse as the following.
• In many datasets, each row is an example and each column is a feature. For
example, in the KDD 1998 dataset each row describes a person. There are 22
features corresponding to different previous campaigns requesting donations.
For each person and each campaign there is binary feature value, 1 if the person
made a donation and 0 otherwise.
• In a text dataset, each row represents a document while each column represents
a word.
• In a collaborative ﬁltering dataset, each row can be a user and each column can
be a movie.
• In a social network, each person corresponds to a row and also to a column.
The network is a graph, and the matrix is its adjacency matrix.
• In a kernel matrix, each example corresponds to a row and also to a column.
The number in rowi and column j is the degree of similarity between examples
i and j.
The cases above show that the matrix representing a dataset may be square or rectan
gular. Its entries may be binary, that is 0 or 1, or they may be real numbers. In some
cases every entry has a known value, while in other cases some entries are unknown;
missing is a synonym for unknown.
127
128 CHAPTER 12. MATRIX FACTORIZATION AND APPLICATIONS
12.1 Singular value decomposition
A mathematical idea called the singular value decomposition (SVD) is at the root of
most ways of analyzing datasets that are matrices. One important reason why the
SVD is so important is that it is deﬁned for every matrix. Other related concepts
are deﬁned only for matrices that are square, or only for matrices that are invertible.
The SVD is deﬁned for every matrix all of whose entries are known. Its extension to
matrices with missing entries is discussed elsewhere.
Before explaining the SVD, let us review the concepts of linear dependence and
rank. Consider a subset of rows of a matrix. Suppose that the rows in this subset are
called a
1
to a
k
. Let v be any row vector of length k. A linear combination of these
rows is the new row vector that is the matrix product
v
a
1
a
2
. . .
a
k
A set of rows is called linearly independent if no row is a linear combination of the
others.
The rank of a matrix is the largest number of linearly independent rows that it
has. If a matrix has rank k, then it has a subset of k rows such that (i) none of
these is a linear combination of the others, but (ii) every row not in the subset is a
linear combination of the rows in the subset. The rank of a matrix is a measure of its
complexity. Consider for example a user/movie matrix. Suppose that its rank is k.
This means that there are k users such that the ratings of every other user are a linear
function of the ratings of these k users. Formally, these k users are a basis for the
whole matrix.
Let A be any matrix, square or rectangular. Let A have m rows and n columns,
and let r be the rank of A. The SVD is a product of three matrices named U, S, and
V . Speciﬁcally,
A = USV
T
where V
T
means the transpose of V , which can also be written V
t
. The matrix U is
m by r, S is r by r, and V is n by r.
Some conditions on U, S, and V make the SVDunique. First, all the offdiagonal
entries of S are zero, its diagonal entries σ
i
are positive, and they are sorted from
largest to smallest. These entries are called singular values.
1
Next, the columns of
1
They are closely related to eigenvalues, but remember that the SVDis welldeﬁned for rectangular
matrices, while only square matrices have eigenvalues.
12.1. SINGULAR VALUE DECOMPOSITION 129
U, which are vectors of length m, are orthonormal. This means that each column is
a unit vector, and the columns are perpendicular to each other. Concretely, let u
i
and
u
j
be any two columns of U where i ,= j. We have u
i
u
i
= 1 and u
i
u
j
= 0.
Similarly, the columns of V , which are vectors of length n, are orthonormal to each
other.
Because U and V have orthonormal columns, the matrix product in the SVD can
be simpliﬁed to
A =
r
k=1
σ
k
u
k
v
T
k
.
Here, u
k
is the kth column of U, and v
T
k
is the transpose of the kth column of V . The
product u
k
v
T
k
is an m by n matrix of rank 1. It is often called the outer product of
the vectors u
k
and v
k
. This simpliﬁcation decomposes A into the sum of r matrices
of rank 1.
For data mining, the SVD is most interesting because of the following property.
Let U
k
be the k leftmost columns of U. This means that U
k
is an m by k matrix
whose columns are the same as the ﬁrst k columns of U. Similarly, let V
k
be the k
leftmost columns of V . Let S
k
be the k by k diagonal matrix whose diagonal entries
are S
11
to S
kk
. Deﬁne
A
k
= U
k
S
k
V
T
k
.
The rank of A
k
is k. The following theorem was published by the mathematicians
Carl Eckhart and Gale Young in 1936.
Theorem:
A
k
= min
X:rank(X)≤k
[[A−X[[
2
.
The theorem says that among all matrices of rank k or less, A
k
has the smallest
squared error with respect to A. The squared error is simply
[[A−X[[
2
=
m
i=1
n
j=1
(A
ij
−X
ij
)
2
and is sometimes called the Frobenius norm of the matrix A − X. The SVD essen
tially answers the question, what is the best lowcomplexity approximation of a given
matrix A?
The matrices U
k
, S
k
, and V
k
are called truncated versions of U, S, and V . The
omitted columns in the truncated matrices correspond to the smallest diagonal entries
in S. The hope is that these correspond to noise in the original observed A.
Notice that the matrices U, S, and V are themselves truncated in our deﬁnition
to have r columns, where r is the rank of A. This deﬁnition is sometimes called the
compact SVD. The omitted columns correspond to singular values σ
i
= 0 for i > r.
130 CHAPTER 12. MATRIX FACTORIZATION AND APPLICATIONS
The truncated SVD is called a matrix approximation:
A ≈ U
k
S
k
V
T
k
.
The matrix U
k
has k columns, and the same number of rows as A. Each row of U
k
can be viewed as a lowdimensional representation of the corresponding row of A.
Similarly, each column of V
T
k
is a kdimensional representation of the corresponding
column of A. Note that the rows of U
k
, and the columns of V
T
k
, are not orthonormal.
If an approximation with just two factors is desired, the matrix S
k
can be included
in U
k
and/or V
k
.
12.2 Principal component analysis
Standard principal component analysis (PCA) is the most widely used method for
reducing the dimensionality of a dataset. The idea is to identify, one by one, the
directions along which a dataset varies the most. These directions are called the
principal components. The ﬁrst direction is a line that goes through the mean of the
data points, and maximizes the squared projection of each point onto the line. Each
new direction is orthogonal to all previous ones.
Let A be a matrix with m rows corresponding to examples and n columns corre
sponding to features. The ﬁrst step in PCA is to make each feature have mean zero,
that is to subtract the mean of each column from each entry in it. Let this new data
matrix be B, with rows b
1
to b
m
. A direction is represented by a column vector w of
length n with unit L
2
norm. The ﬁrst principal component of B is
w
1
= argmax
[[w[[
2
=1
m
i=1
(b
i
w)
2
Each data point b
i
is a row vector of length n. The principal component is a column
vector of the same length. The scalar (b
i
w) is the projection of b
i
along w, which
is the onedimensional representation of the data point.
Each data point with the ﬁrst component removed is
b
t
i
= b
i
−(b
i
w)w
T
.
The second principal component can be found recursively as
w
2
= argmax
[[w[[
2
=1
m
i=1
(b
t
i
w)
2
Principal component analysis is widely used, but it has some major drawbacks.
First, it is sensitive to the scaling of features. Without changing the correlations
12.3. LATENT SEMANTIC ANALYSIS 131
between features, if some features are multiplied by constants, then the PCAchanges.
The principal components tend to be dominated by the features that are numerically
largest. Second, PCA is unsupervised. It does not take into account any label that is
to be predicted. The directions of maximum variation are usually not the directions
that differentiate best between examples with different labels. Third, PCA is linear.
If the data points lie in a cloud with nonlinear structure, PCA will not capture this
structure.
PCA is closely related to SVD, which is used in practice to compute principal
components and the rerepresentation of data points.
12.3 Latent semantic analysis
Polysemy is the phenomenon in language that a word such as “ﬁre” can have multi
ple different meanings, Another important phenomenon is synonymy, the ability of
different words such as “ﬁrearm” and “gun” to have similar meanings. Because of
synonymy, the bagofwords vectors representing two documents can be dissimilar,
despite similarity in meaning between the documents. Conversely, document vectors
may be similar despite different meanings, because of polysemy.
Latent semantic analysis (LSA), also called latent semantic indexing (LSI) is an
attempt to improve the representation of documents by taking into account polysemy
and synonymy in a way that is induced from the documents themselves. The key idea
is to relate documents and words via concepts. These concepts are latent, i.e. hid
den, because they are not provided explicitly. Concepts can also be called topics,
or dimensions. The hope is that the concepts revealed by LSA do have semantic
meaning.
Intuitively, a document is viewed as a weighted combination of concepts. Im
portantly, each document is created from more than one concept, so concepts are
different from, and more expressive than, cluster centers. Synonymy is captured by
permitting different word choices within one concept, while polysemy is captured by
allowing the same word to appear in multiple concepts. The number of independent
dimensions or concepts is usually chosen to be far smaller than the number of words
in the vocabulary. A typical number is 300.
Concretely, LSI uses the singular value decomposition of the matrix A where
each row is a document and each column is a word. Let A = USV
T
as above.
Each row of U is interpreted as the representation of a document in concept space.
Similarly, each column of V
T
is interpreted as the representation of a word. Each sin
gular value in S is the strength of the corresponding concept in the whole document
collection.
132 CHAPTER 12. MATRIX FACTORIZATION AND APPLICATIONS
Remember that the SVD can be written as
A =
r
k=1
σ
k
u
k
v
T
k
where u
k
is the kth column of U and v
T
k
is the kth row of V
T
. Each vector v
T
k
is essentially a concept, with a certain weight for each word. The vector u
k
gives
the strength of this concept for each document. A concept can be thought of as a
pseudoword; U
ik
= (u
k
)
i
is the count of pseudoword k in document i.
Weighting schemes that transform the raw counts of words in documents are
critical to the performance of LSI. A scheme that is popular is called logentropy.
The weight for word j in document i is
[1 +
j
p
ij
log
n
p
ij
] log(x
ij
+ 1)
where p
ij
is an estimate of the probability of occurrence of word j in document i,
usually a maximum likelihood estimate. Note that words that are equally likely to
occur in every document have the smallest weights.
LSI has drawbacks that research has not yet fully overcome. In no particular
order, some of these are as follows.
• For large datasets, while A can be represented efﬁciently owing to its sparse
nature, A
k
cannot. For example, the TREC2 collection used in research on
information retrieval has 741,696 documents and 138,166 unique words. Since
there are only 80 million nonzero entries in A, it can be stored in single pre
cision ﬂoating point using 320 megabytes of memory. In contrast, because A
k
is dense, storing it for any value of k would require about 410 gigabytes.
• Computing the SVDof large matrices is expensive. However, present day com
putational power and the efﬁcient implementations of SVD, including stochas
tic gradient descent methods, make it possible to apply LSI to large collections.
• The ﬁrst concept found by LSI typically has little semantic interest. Let u be
the ﬁrst column of U and let v be the ﬁrst row of V
T
. The vector u has an
entry for every document, while v has an entry for every word in the vocabu
lary. Typically, the entries in u are roughly proportional to the lengths of the
corresponding documents, while each entry in v is roughly proportional to the
frequency in the whole collection of the corresponding word.
• Mathematically, every pair of rows of V
T
must be orthogonal. This means that
no correlations between topics are allowed.
12.4. MATRIX FACTORIZATION WITH MISSING DATA 133
• Apart from the ﬁrst column, every column of u has about as many negative
as positive entries; the same is true for rows of V
T
. This means that every
document has a negative loading for half the topics, and every topic has a
negative loading for half the words.
• The rows of the U
k
matrix representing rare words, which are often highly
discriminative, typically have small norms. That is, the rare words have small
loadings on every dimension. This has a negative impact on LSI performance
in applications. Unfortunately, both logentropy and traditional IDF schemes
fail to address this problem.
12.4 Matrix factorization with missing data
Alternating least squares.
Convexity.
Regularization.
12.5 Spectral clustering
12.6 Graphbased regularization
Let w
ij
be the similarity between nodes i and j in a graph. Assume that this similarity
is the penalty for the two nodes having different labels. Let f
i
be the label of node i,
either 0 or 1. The cut size of the graph is
i,j:f
i
,=f
j
w
ij
=
i,j
w
ij
(f
i
−f
j
)
2
=
i,j
w
ij
(f
2
i
−2f
i
f
j
+f
2
j
)
=
i,j
w
ij
(f
2
i
+f
2
j
) −2f
t
Wf
= 2
i
(
j
w
ij
)(f
2
i
) −2f
t
Wf
where f is a column vector. Let the sum of row i of the weight matrix W be v
i
=
j
w
ij
, and let V be the diagonal matrix with diagonal entries v
i
. The cut size is
then
−2f
t
Wf + 2f
t
V f = 2f
t
(V −W)f.
134 CHAPTER 12. MATRIX FACTORIZATION AND APPLICATIONS
The term involving W resembles an empirical loss function, while the term involving
V resembles a regularizer.
Remember that f is a 0/1 vector. We can relax the problem by allowing f to
be realvalued between 0 and 1. Finding the optimal f, given W, is then doable
efﬁciently using eigenvectors; see the background section below. This approach is
wellknown in research on spectral clustering and on graphbased semisupervised
learning; see [Zhu and Goldberg, 2009, Section 5.3].
Let z
i
be the suggested label of node i according to the base method mentioned
in the previous section. The objective that we want to minimize is the linear combi
nation
i
(f
i
−z
i
)
2
+ λ
i,j:f
i
,=f
j
w
ij
=
i
(f
2
i
−2f
i
z
i
+z
2
i
) + 2λf
t
(V −W)f
= f
t
If −2f
t
Iz +z
t
Iz + 2λf
t
(V −W)f
= −2f
t
Iz +z
t
Iz + 2λf
t
(V −W +
1
2λ
I)f.
Minimizing this expression is equivalent to maximizing
f
t
Iz +λf
t
(W −V −
1
2λ
I)f
where the term that does not depend on f has been omitted.
The following result is standard.
Theorem: Let Abe symmetric and let B be positive deﬁnite. Then the maximum
value of x
t
Ax given x
t
Bx = 1 is the largest eigenvalue of B
−1
A. The vector x that
achieves the maximum is the corresponding eigenvector of B
−1
A.
CSE 291 Quiz for May 24, 2011
Your name:
Email address:
The compact singular value decomposition of the matrix A can be written as
A =
r
k=1
σ
k
u
k
v
T
k
where the vectors u
k
are orthonormal, and so are the vectors v
T
k
.
(i) What can you say about r in general?
(ii) Suppose that one row of A is all zero. What can you say about r?
Chapter 13
Social network analytics
A social network is a graph where nodes represent individual people and links repre
sent relationships. More generally, in a network nodes represent examples and edges
represent relationships.
Examples:
• Telephone network. Nodes are subscribers, and edges are phone relationhips.
Note that call detail records (CDRs) must be aggregated.
• Facebook.
• Proteinprotein interactions.
• Genegene regulation.
• Papers citing papers.
• Sameauthor relationship between citations.
• Credit card applications.
There are two types of data mining one can do with a network: supervised and unsu
pervised. The aim of supervised learning is to obtain a model that can predict labels
for nodes, or labels for edges. For nodes, this is sometimes called collective classiﬁ
cation. For edges, the most basic label is existence. Predicting whether or not edges
exist is called link prediction. (Predicting whether nodes exist has not been studied
much.)
Examples of tasks that involve collective classiﬁcation: predicting churn of sub
scribers; recognizing fraudulent applicants. recognizing emotional state of Facebook
members, predicting who will plat a new game..
137
138 CHAPTER 13. SOCIAL NETWORK ANALYTICS
Figure 13.1: Visualization of the Enron email network from http://jheer.
org/enron/v1/enron_nice_1.png.
Examples of tasks that involve link prediction: suggesting new friends on Face
book, identifying appropriate citations between scientiﬁc papers, predicting which
pairs of terrorists know each other.
Structural versus dynamic link prediction: inferring unobserved edges versus
forecasting future new edges.
13.1 Issues in network mining
Both collective classiﬁcation and link prediction are transductive problems. Trans
duction is the situation where the set of test examples is known at the time a classiﬁer
is to be trained. In this situation, the real goal is not to train a classiﬁer. Instead, it is
simply to make predictions about speciﬁc examples. In principle, some methods for
transduction might make predictions without having an explicit reusable classiﬁer.
We will look at methods that do involve explicit classiﬁers. However, these classi
ﬁers will be usable only for nodes that are part of the part of the network that is known
at training time. We will not solve the coldstart problem of making predictions for
nodes that are not known during training.
Social networks, and other graphs in data mining, can be complex. In particular,
13.2. UNSUPERVISED NETWORK MINING 139
edges may be directed or undirected, they can be of multiple types, and/or they can
be weighted. The graph can be bipartite, e.g. authors and papers.
Given a social network, nodes have two fundamental characteristics. First, they
have identity. This means that we know which nodes are the same and which are
different in the graph, but we do not necessarily know anything else about nodes.
Second, nodes (and maybe edges) may be be associated with vectors that specify the
values of features. These vectors are sometimes called sideinformation.
Sometimes an edge between two nodes means that they represent the same real
world entity. Recognizing that different records refer to the same person, or that
different citations describe the same paper, is called deduplication or record linkage.
Maybe the most fundamental and most difﬁcult issue in mining network data,
including social networks, is the problem of missing information. More often than
not, it is only known partially which edges exist in a network.
13.2 Unsupervised network mining
The aim of unsupervised data mining is often to come up with a model that explains
the pattern of the connectivity of the network, in particular its statistics. Mathematical
models that explain patterns of connectivity are often timebased, e.g. the “rich get
richer” idea. This type of learning can be fascinating as sociology. For example, it
has revealed that people who stop smoking tend to stop being friends with smokers,
but people who lose weight do not stop being friends with obese people.
Often unsupervised data mining in a network does not allow making useful deci
sions in an obvious way. But some models of entire networks can lead to predictions,
often about the evolution of the network, that can be useful for making decisions.
For example, one can identify the most inﬂuential nodes, and make special efforts to
reach them.
Unsupervised methods are often used for link prediction. Local topological fea
tures. The preferential attachment score is
a(x, y) = [Γ(x)[ [Γ(y)[
while the AdamicAdar score is
a(x, y) =
z∈Γ(x) ∩ Γ(y)
1
log [Γ(z)[
.
The Katz score is
a(x, y) =
∞
l=1
β
l
[paths(x, y, l)[
140 CHAPTER 13. SOCIAL NETWORK ANALYTICS
where paths(x, y, l) is the number of paths of length l from node x to node y. This
measure gives reduced weight to longer paths.
The Laplacian matrix of a graph is D − A where A is the adjacency matrix and
D is the diagonal matrix of vertex degrees. The modularity matrix has elements
B
ij
= A
ij
−
D
ii
D
jj
2m
where d
i
is the degree of vertex i and mis the total numbe of edges.. This is the actual
minus the expected number of edges between the two vertices, where the expected
number depends only on degrees.
13.3 Collective classiﬁcation
In traditional classiﬁer learning, the labels of examples are assumed to be indepen
dent. (This is not the same as assuming that the examples themselves are indepen
dent; see the discussion of sample selection bias above.) If labels are independent,
then a classiﬁer can make predictions separately for separate examples. However,
if labels are not independent, then in principle a classiﬁer can achieve higher accu
racy by predicting labels for related examples in a related way. As mentioned, this
situation is called collective classiﬁcation.
Intuitively, the labels of neighbors are often correlated. Given a node x, we would
like to use the labels of the neighbors of x as features when predicting the label of
x, and vice versa. A principled approach to collective classiﬁcation would ﬁnd mu
tually consistent predicted labels for x and its neighbors. However, in general there
is no guarantee that mutually consistent labels are unique, and there is no general
algorithm for inferring them.
Adifferent approach has proved more successful. This approach is to extend each
node with a ﬁxedlength vector of feature values derived from its neighbors, and/or
the whole network. Then, a standard classiﬁer learning method is applied. Because
a node can have a varying number of neighbors, but standard learning algorithms
need vector representations of ﬁxed length, features based on neighbors are often
aggregates. Aggregates may include sums, means, modes, minimums, maximums,
counts, and indicator functions, notably for existence.
Conventional features describing persons may be implicit versions of aggregates
based on social networks. For many data mining tasks where the examples are
people, there are ﬁve general types of feature: personal demographic, group de
mographic, behavioral, situational, and social. If group demographic features, in
particular, have predictive power, it is often because they are aggregates of features
of linked people. For example, if zip code is predictive of the brand of automobile a
13.3. COLLECTIVE CLASSIFICATION 141
person will purchase, that may be because of contagion effects between people who
know each other, or see each other on the streets.
Often, the most useful vector of feature values for a node is obtained by low rank
approximation of the adjacency matrix. For example, the Enron email adjacency
matrix shown in Figure 13.1 empirically has rank only 2 approximately.
The simplest case is when the adjacency matrix is fully known, edges are undi
rected, and they are unweighted or have realvalued weights. In this case, the stan
dard SVD can be used directly. An undirected adjacency matrix A is symmetric,
meaning A = A
T
. For such a matrix the SVD is
A = USU
T
=
r
k=1
λ
k
u
k
u
T
k
where the λ
k
scalars are the diagonal entries of the matrix S and the u
k
vectors are
the orthonormal columns of the matrix U. Each u
k
vector has one entry for each
node. Then, the vector ¸U
i1
, . . . U
ik
) = ¸(u
1
)
i
, . . . , (u
k
)
i
) can be used as the ﬁxed
length representation of the ith node of the graph. The representation length k can be
chosen between 1 and r. Because the λ
i
values are ordered from largest to smallest,
intuitively the vector u
1
captures more of the overall structure of the network than
u
2
, and so on. The vectors u
k
for large k may represent only random noise.
If edges have discrete labels from more than two classes, then the numerical SVD
cannot be used directly. If some edges are unknown, then the standard SVD is not
appropriate. Instead, one should use a variant that allows some entries of the matrix
A to be unknown. The stochastic gradient approach that is used for collaborative
ﬁltering can be applied.
Often, there are no edges that are known to be absent. In other words, if an edge
is known, then it exists, but if an edge is unknown, it may or may not exist. This
situation is sometimes called “positive only” or “positive unlabeled.” In this case, it
is reasonable to use the standard SVD where the value 1 means “present” and the
value 0 means “unknown.”
Of course, one may also use a low rank approximation of a matrix derived from
the adjacency matrix, such as the Laplacian or modularity matrix.
In many domains, in effect multiple networks are overlaid on the same nodes.
For example, if nodes are persons then one set of edges may represent the “same
address” relationship while another set may represent the “made telephone call to”
relationship.
142 CHAPTER 13. SOCIAL NETWORK ANALYTICS
13.4 The social dimensions approach
The method of [Tang and Liu, 2009a] appears to be the best approach so far for pre
dicting labels of nodes in a social network. The idea is as follows.
1. Compute the modularity matrix from the adjacency matrix.
2. Compute the top k eigenvectors of the modularity matrix.
3. Extend the vector representing each node with a vector of length k containing
its component of each eigenvector.
4. Train a linear classiﬁer using the extended vectors.
An intuitive weakness of this approach is that the eigenvectors are learned in an
unsupervised way that ignores the speciﬁc collective classiﬁcation task to be solved.
The method of [Tang and Liu, 2009a] has two limitations when applied to very
large networks. First, the eigenvectors are dense so storing the extended vector for
each node can use too much memory. Second, computing these eigenvectors may
need too much time and/or too much memory. The method of [Tang and Liu, 2009b]
is an alternative designed not to have these drawbacks. It works as follows.
1. Let n be the number of nodes and let e be the number of edges. Make a sparse
binary dataset D with e rows and n columns, where each edge is represented
by ones for its two nodes.
2. Use the kmeans algorithm with cosine similarity to partition the set of rows
into a large number k of disjoint groups. Note that each row represents one
node; the number of ones in this row is the degree of the node.
3. Extend the vector representing each node with a sparse binary vector of length
k. The jth component of this vector is 1 if and only if the node has at least one
edge belonging to cluster number j.
4. Train a linear classiﬁer using the extended vectors.
13.5 Supervised link prediction
Supervised link prediction is similar to collective classiﬁcation, but involves some
additional difﬁculties. First, the label to be predicted is typically highly unbalanced:
between most pairs of nodes, there is no edge. The common response is to subsample
pairs of nodes that are not joined by an edge, but a better solution is to use a ranking
loss function during training.
13.5. SUPERVISED LINK PREDICTION 143
The most successful general approach to link prediction is essentially the same as
for collective classiﬁcation: extend each node with a vector of feature values derived
from the network, then use a combination of the vector for each node to predict the
existence of an edge between the two nodes.
The vectors representing two nodes should not be combined by concatenation;
see Section 6.7 above. A reasonable alternative is to combine the vectors buy point
wise multiplication.
A combination of vectors representing nodes is a representation of a possible
edge. One can also represent a potential edge with edgebased feature values. For
example, for predicting future coauthorship:
1. Keyword overlap is the most informative feature for predicting coauthorship.
2. Sum of neighbors and sum of papers are next most predictive; these features
measure the propensity of an individual as opposed to any interaction between
individuals.
3. Shortestdistance is the most informative graphbased feature.
Major open research issues are the coldstart problem for new nodes, and how to
generalize to new networks.
CSE 291 Quiz for May 31, 2011
Your name:
Email address:
Consider the task of predicting which pairs of nodes in a social network are linked.
Speciﬁcally, suppose you have a training set of pairs for which edges are known to
exist, or known not to exist. You need to make predictions for the remaining edges.
Let the adjacency matrix A have dimension n by n. You have trained a lowrank
matrix factorization A = UV , where U has dimension n by r for some r ¸ n.
Because A is symmetric, V is the transpose of U.
Let u
i
be the row of U that represents node i. Let ¸j, k) be a pair for which you
need to predict whether an edge exists. Consider these two possible ways to make
this prediction:
1. Simply predict the dotproduct u
j
u
k
.
2. Predict f([u
j
, u
k
]) where [u
j
, u
k
] is the concatenation of the vectors u
j
and
u
k
, and f is a trained logistic regression model.
Explain an important reason why the ﬁrst approach is better.
Assignment due on June 1, 2010
The purpose of this assignment is to apply and evaluate methods to predict the pres
ence of links that are unknown in a social network. Use either the Cora dataset or
the Terrorists dataset. Both datasets are available at http://www.cs.umd.edu/
˜sen/lbcproj/LBC.html.
In the Cora dataset, each node represents an academic paper. Each paper has a
label, where the seven label values are different subareas of machine learning. Each
paper is also represented by a bagofwords vector of length 1433. The network
structure is that each paper is cited by, or cites, at least one other paper in the dataset.
The Terrorists dataset is more complicated. For explanations see the paper Entity
and relationship labeling in afﬁliation networks by Zhao, Sen, and Getoor from the
2006 ICML Workshop on Statistical Network Analysis, available at http://www.
mindswap.org/papers/2006/RelClzPIT.pdf. According to Table 1 and
Section 6 of this paper, there are 917 edges that connect 435 terrorists. There is
information about each terrorist, and also about each edge. The edges are labeled
with types.
For either dataset, your task is to pretend that some edges are unknown, and then
to predict these, based on the known edges and on the nodes. Suppose that you are
doing tenfold crossvalidation on the Terrorists dataset. Then you would pretend
that 92 actual edges are unknown. Based on the 927 − 92 = 825 known edges, and
on all 435 nodes, you would predict a score for each of the 435 434/2 potential
edges. The higher the score of the 92 heldout edges, the better.
You should use logistic regression to predict the score of each potential edge.
When predicting edges, you may use features obtained from (i) the network structure
of the edges known to be present and absent, (ii) properties of the nodes, and/or
(iii) properties of pairs of nodes. For (i), use a method for converting the network
structure into a vector of ﬁxed length of feature values for each node, as discussed in
class. Using one or more of the feature types (i), (ii), and (iii) yields seven alternative
sets of features. Do experiments comparing at least two of these.
Each experiment should use logistic regression and tenfold crossvalidation.
Think carefully about exactly what information is legitimately included in training
folds. Also, avoid the mistakes discussed in Section 6.7 above.
CSE 291 Assignment 7 (2011)
This assignment is due at the start of class on Tuesday May 31, 2011.
The goal is to apply and evaluate methods to predict the presence and type of
links in a social network. The dataset to use has information about 244 supposed
terrorists. The information was compiled originally by the Proﬁles in Terror project.
1
The dataset
2
was created at the University of Maryland
3
and edited at UCSD by Eric
Doi and Ke Tang. Note that the information is far from fully correct or complete.
The dataset has 851 rows and 1227 columns. Each row describes a known link
between two terrorists. The ﬁrst and second columns are id numbers between 1 and
244 for the two terrorists. Of course, an unknown number of additional unknown
links exist. The last column of the dataset indicates the type of a known link. The
value in the last column is 1 if the two terrorists are members of the same organi
zation, and 0 if they have some other relationship (family members, contacted each
other, or used the same facility such as a training camp).
There are 461 positive sameorganization links; see Figure 13.2. Columns 3 to
614 contain 612 binary feature values for the person identiﬁed by the ﬁrst column,
and columns 615 to 1226 contain values for the same features for the second person.
Unfortunately the meaning of these features is not known precisely, and presumably
the information is quite incomplete: zero may mean missing more often than it means
false; see Figure 13.3. For more details see a paper by Eric Doi.
4
There are two different link prediction tasks for this dataset. One task is to predict
the existence of edges, regardless of their type. The other task is to distinguish be
tween sameorganization links and links of other types. Note that each task involves
only binary classiﬁcation. For the ﬁrst task the two classes can be labeled “present”
and “absent” while for the second task the classes can be labeled “colleague” and
“other.”
When measuring performance for the ﬁrst task, you can assume that the 851
known edges are the only edges that really exist. Similarly, you can measure perfor
mance for the second task using only the edges with known labels 0 and 1. The paper
by Eric Doi considers the latter task, for which results are presented in its Figure 6.
Design your experiments so that your results can be compared to those in that table.
For both tasks, do tenfold crossvalidation. For the ﬁrst task, this means pre
tending that 85 true edges are unknown. Then, based on the 851 −85 = 766 known
1
www.amazon.com/ProfilesTerrorMiddleTerroristOrganizations/
dp/0742535258
2
cseweb.ucsd.edu/˜elkan/291/PITdata.csv
3
www.cs.umd.edu/˜sen/lbcproj/LBC.html
4
cseweb.ucsd.edu/˜elkan/291/Eric.Doi_report.pdf
edges, and on all 244 nodes, you predict a score for each of the 244 243/2 potential
edges. The higher the score of the 85 heldout edges, the better.
Use a linear classiﬁer such as logistic regression to compute a score for each
potential edge. When computing scores, you may use features obtained from (i) the
network structure of known and/or unknown edges, (ii) properties of nodes, and (iii)
properties of pairs of nodes. For (i), use matrix factorization to convert the adjacency
matrix (or a transformation of it) into a vector of feature values for each node, as
discussed in class.
While designing your experiments, think carefully about the following issues.
• During crossvalidation, exactly what information is it legitimate to include in
test examples?
• How should you take into account the fact that, almost certainly, some edges
are present but are not among the 851 known edges?
• How should you take into account the fact that many feature values are likely
missing for many nodes?
• How can you avoid the mistakes discussed in Section 6.7 above?
In your report, have separate sections for “Design of experiments” and “Results of ex
periments.” In the results section, in order to allow comparisons with previous work,
do report 0/1 accuracy for the task of distinguishing between sameorganization links
and links of other types. However, also discuss whether or not this is a good measure
of success.
0 50 100 150 200
0
50
100
150
200
nz = 461
Figure 13.2: Visualization of 461 “same organization” links between 244 supposed
terrorists.
0 100 200 300 400 500 600
0
50
100
150
200
nz = 1655
Figure 13.3: Visualization of feature values for the 244 supposed terrorists.
Chapter 14
Interactive experimentation
This chapter discusses datadriven optimization of websites, in particular via A/B
testing.
A/B testing can reveal “willingness to pay.” Compare proﬁt at prices x and y,
based on number sold. Choose the price that yields higher proﬁt.
149
150 CHAPTER 14. INTERACTIVE EXPERIMENTATION
Quiz
The following text is from a New York Times article dated May 30, 2009.
Mr. Herman had run 27 ads on the Web for his client Vespa, the
scooter company. Some were rectangular, some square. And the text
varied: One tagline said, “Smart looks. Smarter purchase,” and dis
played a $0 down, 0 percent interest offer. Another read, “Pure fun. And
function,” and promoted a free Tshirt.
Vespa’s goal was to ﬁnd out whether a ﬁnancial offer would attract
customers, and Mr. Herman’s data concluded that it did. The $0 down
offer attracted 71 percent more responses from one group of Web surfers
than the average of all the Vespa ads, while the Tshirt offer drew 29
percent fewer.
(a) What basic principle of the scientiﬁc method did Mr. Herman not follow,
according to the description above?
(b) Suppose that it is true that the ﬁnancial offer does attract the highest number of
purchasers. What is an important reason why the other offer might still be preferable
for the company?
(c) Explain the writing mistake in the phrase “Mr. Herman’s data concluded that
it did.” Note that the error is relatively highlevel; it is not a spelling or grammar
mistake.
Chapter 15
Reinforcement learning
Reinforcement learning is the type of learning done by an agent who is trying to
ﬁgure out a good policy for interacting with an environment. The framework is that
time is discrete, and at each time step the agent perceives the current state of the
environment and selects a single action. After performing the action, the agent earns
a realvalued reward or penalty, time moves forward, and the environment shifts into
a new state. The goal of the agent is to learn a policy for choosing actions that leads
to the best possible longterm total of rewards.
Reinforcement learning is fundamentally different from supervised learning be
cause explicit correct input/output pairs are never given to the agent. That is, the
agent is never told what the optimal action is in a given state. Nor is the agent ever
told what its future rewards will be. However, at each time step the agent does ﬁnd
out what its current shortterm reward is.
Reinforcement learning has realworld applications in domains where there is
an intrinsic tradeoff between shortterm and longterm rewards and penalties, and
where there is an environment that can be modeled as unknown and random. Suc
cessful application areas include controlling robots, playing games, scheduling ele
vators, treating cancer, and managing relationships with customers [Abe et al., 2002,
Simester et al., 2006]. In the latter three examples, the environment consists of hu
mans, which are modeled collectively as acting probabilistically. More speciﬁcally,
each person is a different instantiation of the environment. The environment is con
sidered to be a set of states. Different people who happen to be in the same state are
assumed to have the same probability distribution of reactions.
Reinforcement learning is appropriate for domains where the agent faces a stochas
tic but nonadversarial environment. It is not suitable for domains where the agent
faces an intelligent opponent, because in adversarial domains the opponent does not
change state probabilistically. For example, reinforcement learning methods are not
151
152 CHAPTER 15. REINFORCEMENT LEARNING
appropriate for learning to play chess. However, reinforcement learning can be suc
cessful for learning to play games that involve both strategy and chance, such as
backgammon, if it is reasonable to model the opponent as following a randomized
strategy. Game theory is the appropriate formal framework for modeling agents inter
acting with intelligent opponents. Of course, ﬁnding good algorithms is even harder
in game theory than in reinforcement learning.
A fundamental distinction in reinforcement learning is whether learning is online
or ofﬂine. In ofﬂine learning, the agent is given a database of recorded experience
interacting with the environment. The job of the agent is to learn the best possible
policy for directing future interaction with the same environment. Note that this is
different from learning to predict the rewards achieved by following the policy used
to generate the recorded experience. The learning that is needed is called offpolicy
since it is about a policy that is different from the one used to generate training data.
In online learning, the agent must balance exploration, which lets it learn more
about the environment, with exploitation, where it uses its current estimate of the
best policy to earn rewards. Most reseearch on reinforcement learning has concerned
online learning, but ofﬂine learning is more useful for many applications.
15.1 Markov decision processes
The basic reinforcement learning scenario is formalized using three concepts:
1. a set S of potential states of the environment,
2. a set A of potential actions of the agent, and
3. rewards that are real numbers.
At each time t, the agent perceives the state s
t
∈ S and its available actions A. The
agent chooses one action a ∈ A and receives from the environment the reward r
t
and
then the new state s
t+1
.
Formally, the environment is represented as a socalled Markov decision process
(MDP). An MDP is a fourtuple ¸S, A, p([, ), r(, )) where
• S is the set of states, which is also called the state space,
• A is the set of actions,
• r(s, a) is the immediate reward received after doing action a from state s.
• p(s
t+1
= s
t
[s
t
= s, a
t
= a) is the probability that action a in state s at time t
will lead to state s
t
at time t + 1, and
15.2. RL VERSUS COSTSENSITIVE LEARNING 153
The states of an MDP possess the Markov property. This means that if the current
state of the MDP at time t is known, then transitions to a new state at time t + 1 are
independent of all previous states. In a nutshell, history is irrelevant.
Some extensions of the MDP framework are possible without changing the nature
of the problem fundamentally. In particular, the reward may be a function of the next
state as well as of the current state and action, that is the reward can be a function
r(s, a, s
t
). Or, for maximal mathematical simplicity, the reward can be a function just
of the current state, r(s). Or, the MDP deﬁnition can be extended to allow r to be
stochastic as opposed to deterministic. In this case, p(r[s, a, s
t
) is the probability that
the immediate reward is r, given state s, action a, and next state s
t
. It is crucial that
the probability distributions of state transitions and rewards are stationary: transitions
and rewards may be stochastic, but their probabilities are the same at all times.
The goal of the agent is to maximize a cumulative function R of the rewards. If
the maximum time step is t = T where T is ﬁnite, then the goal can be to maximize
R = r
0
+r
1
+. . . +r
T
. If the maximum time horizon is inﬁnite, then the goal is to
maximize a discounted sum that involves a discounting factor γ, which is usually a
number just under 1:
R =
∞
t=0
γ
t
r(s
t
, a
t
).
The factor γ makes the sum above ﬁnite even though it is a sum over an inﬁnite
number of time steps.
In order to maximize its total reward, the agent must acquire a policy, i.e. a map
ping π : S → A that speciﬁes which action to take for every state. An optimal policy
π
∗
is one that maximizes the expected value of R, where the expectation is based on
the transition probability distribution p(s
t+1
= s
t
[s
t
= s, a
t
= π
∗
(s
t
)).
Given an MDP, an optimal policy is sometimes called a solution of the MDP. If
the time horizon is inﬁnite, then optimal policies have the Markov property also: the
best action to choose depends on the current state only, regardless of prior history.
With a ﬁnite time horizon, the best action may be a function π
∗
(s, t) of the current
time t as well as the current state t.
15.2 RL versus costsensitive learning
Remember that costsensitive learning (CSL) is rather a misnomer; a more accu
rate name is costsensitive decisionmaking. The differences between CSL and RL
include the following.
• In CSL, the customer is assumed to be in a certain state, and then a single
characteristic becomes true; this is the label to be predicted. In RL, every
154 CHAPTER 15. REINFORCEMENT LEARNING
aspect of the state of the customer can change from one time point to the next.
• In CSL, the customer’s label is assumed to be independent of the agent’s pre
diction. In RL, the next state depends on the agent’s action.
• In CSL, the agent chooses an action based on separate estimates of the cus
tomer’s label and the immediate beneﬁt to the agent of each label. In RL, the
agent chooses an action to maximize longterm beneﬁt.
• In CSL, longterm effects of actions can only be taken into account heuristi
cally. In RL, these are learned; consider priming and annoyance for example.
• In CSL, the beneﬁt must be estimated for every possible action. But the actual
beneﬁt is observed only for one action.
• In CSL, a customer is typically represented by a realvalued vector. In RL, the
state of a customer is typically one of a number of discrete classes.
Both CSL and RL face the issue of possibly inadequate exploration. Suppose that
the agent learns in 2010 never to give a loan to owners of convertibles, based on data
from 2009. Then in 2011, when the agent wants to learn from 2010 data, it will have
no evidence that owners of convertibles are bad risks.
15.3 Algorithms to ﬁnd an optimal policy
If the agent knows the transition probabilities p(s
t
[s, a) of the MDP, then it can cal
culate an optimal policy; no learning is required. To see how to ﬁnd the optimal
policy, consider a realvalued function U of states called the utility function. Intu
itively, U
π
(s) is the expected total reward achieved by starting in state s and acting
according to the policy π. Recursively,
U
π
(s) = r(s, π(s)) +γ
s
p(s
t
[s, π(s))U
π
(s
t
).
The equation above is called the Bellman equation. The Bellman optimality equation
describes the optimal policy:
U
∗
(s) = max
a
[r(s, a) +γ
s
p(s
t
[s, π(s))U
∗
(s
t
)].
Let the function V be an approximation of the function U
∗
. We can get an improved
approximation by picking any state s and doing the update
V (s) := max
a
[r(s, a) +γ
s
p(s
t
[s, a)V (s
t
)]. (15.1)
15.3. ALGORITHMS TO FIND AN OPTIMAL POLICY 155
The value iteration algorithm of Bellman (1957) is simply to do this single step re
peatedly for all states s in any order. If the state space S is ﬁnite, then V can be
stored as a vector and it can be initialized to be V (s) = 0 for each state s.
Lemma: The optimal utility function U
∗
is unique, and the V function converges to
U
∗
regardless of how V is initialized.
Proof: Assume that the state space S is ﬁnite, so the function V (s) can be viewed
as a vector. Let V
1
and V
2
be two such vectors, and let V
t
1
and V
t
2
be the results of
updating V
1
and V
2
. Consider the maximum discrepancy between V
t
1
and V
t
2
:
d = max ¦V
t
1
−V
t
2
¦
= max ¦max
a
1
[r(s, a
1
) +γ
s
p(s
t
[s, a
1
)V
1
(s
t
)]
−max
a
2
[r(s, a
2
) +γ
s
p(s
t
[s, a
2
)V
2
(s
t
)]¦.
This maximum discrepancy goes to zero.
At any point in the algorithm, a current guess for the optimal policy can be com
puted from the approximate utility function V as
π(s) = argmax
a
[r(s, a) +γ
s
p(s
t
[s, a)V (s
t
)]. (15.2)
Note that (15.2) is identical to (15.1) except for the argmax operator in the place of
the max operator. An important fact about the value iteration method is that π(s)
may converge to the optimal policy long before V converges to the true U.
An alternative algorithm called policy iteration is due to Howard (1960). This
method repeats two steps until convergence:
1. Compute V exactly based on π.
2. Update π based on V .
Step 2 is performed using Equation (15.2). Computing V exactly in Step 1 is formu
lated and solved as a set of simultaneous linear equations. There is one equation for
each s in S:
V (s) = r(s, π(s)) +γ
s
p(s
t
[s, a)V (s
t
).
Policy iteration has the advantage that there is a clear stopping condition: when the
function π does not change after performing Step 1, then the algorithm terminates.
156 CHAPTER 15. REINFORCEMENT LEARNING
15.4 Q functions
An agent can execute the algorithms from the previous section, i.e. value iteration
and policy iteration, only if it knows the transition probabilities of the environment.
If these probabilities are unknown, then the agent must learn them at the same time
it is trying to discover an optimal policy. This learning problem is obviously more
difﬁcult. It is called reinforcement learning.
Rather than learn transition probabilities or an optimal policy directly, some of
the most successful RL methods learn a socalled Q function. A Q function is an
extended utility function. Speciﬁcally, the Q
π
function is like the U
π
function except
that it has two arguments and it only follows the policy π starting in the next state:
Q
π
(s, a) = r(s, a) +γ
s
p(s
t
[s, a)Q
π
(s
t
, π(s
t
)).
In words, the Q
π
function is the value of taking the action a and then continuing
according to the policy π. Given a Q function, the recommended action for a state s
is ˆ a = π
Q
(s) = argmax
a
Q(s, a).
When using a Q function, at each time step the action a is determined by a max
imization operation that is a function only of the current state s. From this point
of view, at each time step actions seem to be chosen in a greedy way, ignoring the
future. The key that makes the approach in fact not greedy is that the Q function is
learned. The learned Q function emphasizes aspects of the state that are predictive of
longterm reward.
15.5 The Qlearning algorithm
Experience during learning consists of ¸s, a) pairs together with an observed reward
r and an observed next state s
t
. That is, the agent was in state s, it tried doing a, it
received r, and s
t
happened. The agent can use each such fragment of experience
to update the Q function directly. For any state s
t
and any action a
t
, the agent can
update the Q function as follows:
Q(s
t
, a
t
) := (1 −α)Q(s
t
, a
t
) +α[r
t
+γ max
a
Q(s
t+1
, a)].
In this update rule, r
t
is the observed immediate reward, α is a learning rate such that
0 < α < 1, and γ is the discount rate.
Assume that the state space and the action space are ﬁnite, so we can use a two
dimensional table to store the Q function. The Q learning algorithm is as follows:
1. Initialize Q values, and start in some state s
0
.
15.6. FITTED Q ITERATION 157
2. Given the current state s, choose an action a.
3. Observe immediate reward r and the next state s
t
= s
t+1
.
4. Update Q(s, a) using the update rule above.
5. Continue from Step 2.
This algorithm is guaranteed to converge to optimal Q values, if (i) the environment
is stationary, (ii) the environment satisﬁes the Markov assumption, (iii) each state
is visited repeatedly, (iv) each action is tried in each state repeatedly, and (v) the
learning rate parameter decreases over time. All these assumptions are reasonable.
However, the algorithm may converge extremely slowly.
The learning algorithm does not say which action to select at each step. We need
a separate socalled exploration strategy. This strategy must trade off exploration,
that is trying unusual actions, with exploitation, which means preferring actions with
high estimated Q values.
An greedy policy selects with probability 1 − the action that is optimal ac
cording to the current Q function, and with probability it chooses the action from
a uniform distribution over all other actions. In other words, with probability the
policy does exploration, while with probability 1 − it does exploitation.
Asoftmax policy chooses action a
i
with probability proportional to exp[bQ(s, a
i
)]
where b is a constant. This means that actions with higher estimated Q values are
more likely to be chosen (exploitation), but every action always has a nonzero
chance of being chosen (exploration). Smaller values of b make the softmax strategy
explore more, while larger values make it do more exploitation.
Qlearning has at least two major drawbacks. First, there is no known general
method for choosing a good learning rate α. Second, if Q(s, a) is approximated by
a continuous function, for example a neural network, then when the value of Q(s, a)
is updated for one ¸s, a) pair, the Q values of other stateaction pairs are changed in
unpredictable ways.
The Qlearning algorithm is designed for online learning. If historical training
data is available, Qlearning can be applied, but the next section explains a better
alternative.
15.6 Fitted Q iteration
As mentioned, in the batch setting for RL, an optimal policy or Q function must be
learned from historical data [Neumann, 2008]. A single historical training example
is a quadruple ¸s, a, r, s
t
) where s is a state, a is the action actually taken in this state,
r is the immediate reward that was obtained, and s
t
is the state that was observed to
come next. The method named Qiteration [Murphy, 2005, Ernst et al., 2005] is the
following algorithm:
158 CHAPTER 15. REINFORCEMENT LEARNING
Deﬁne Q
0
(s, a) = 0 for all s and a.
For horizon h = 0, 1, 2, . . .
For each example ¸s, a, r, s
t
)
let v = r +γ max
b
Q
h
(s
t
, b)
Train Q
h+1
with examples ¸s, a, v).
The variable h is called the horizon, because it counts how many steps of lookahead
are implicit in the learned Q function. In some applications, in particular medical
clinical trials, the maximum horizon is a small ﬁxed integer. In these cases the dis
count factor γ can be set to one.
Fitted Qiteration has signiﬁcant advantages over Qlearning. No learning rate is
needed, and all training Q values are ﬁtted simultaneously, so the underlying super
vised learning algorithm can make sure that all of them are ﬁtted reasonably well.
Another major advantage of ﬁtted Qiteration, compared to other methods for
ofﬂine reinforcement learning, is that the historical training data can be collected
in any way that is convenient. There is no need to collect trajectories starting at
any speciﬁc states. Indeed, there is no need to collect trajectories of consecutive
states, that is of the form¸s
1
, a
1
, r
1
, s
2
, a
2
, r
2
, s
3
, . . .). And, there is no need to know
the policy π that was followed to generate historical episodes of the form ¸s, a =
π(s), r, s
t
). Qiteration is an offpolicy method, meaning that data collected while
following one or more arbitrary nonoptimal policies can be used to learn a better
policy.
Offpolicy methods for RL face the problem of bias in sample selection. The
probability distribution of training examples ¸s, a) is different from the probability
distribution of optimal examples ¸s, π
∗
(s)). The fundamental reason why Qiteration
is correct as an offpolicy method is that it is discriminative as opposed to generative.
Qiteration does not model in any way the distribution of stateaction pairs ¸s, a).
Instead, it uses a purely discriminative method to learn only how the values to be
predicted depend on s and a.
15.7 Representing continuous states and actions
Requiring that actions and states be discrete is restrictive. To be useful for many
applications, a reinforcement learning algorithm should allow states and actions to
be represented as realvalued vectors. For example, in many business domains the
state is essentially the current characteristics of a customer, who is represented by a
highdimensional realvalued vector [Simester et al., 2006].
Using the ﬁtted Q iteration algorithm, or many other methods, there are two basic
operations that need to be performed with a Q function. First, ﬁnding optimal actions
should be efﬁcient. Second, the Qfunction must be learnable fromtraining examples.
15.7. REPRESENTING CONTINUOUS STATES AND ACTIONS 159
Suppose that a Q function is written as Q(s, a) = s
T
Wa. This is called a bilinear
representation. It has important advantages. Given s, the optimal action is
a
∗
= argmax
b
Q(s, b) = argmax
b
x b
where x = s
T
W. This maximization, subject to linear restrictions on the compo
nents of b, is a linear programming (LP) task. Hence, it is tractable in practice, even
for highdimensional action vectors. The linear constraints may be lower and upper
bounds for components (subactions) of b. They may also involve multiple compo
nents of b, for example a budget limit in combination with different costs for different
subactions.
1
With the bilinear approach, the Q function is learnable in a straightforward way
by standard linear regression. Let the vectors s and a be of length m and n respec
tively, so that W ∈ R
mn
. The key is to notice that
s
T
Wa =
m
i=1
n
j=1
(W ◦ sa
T
)
ij
= vec(W) vec(sa
T
). (15.3)
In this equation, sa
T
is the matrix that is the outer product of the vectors s and a, and
◦ denotes the elementwise product of two matrices, which is sometimes called the
Hadamard product. The notation vec(A) means the matrix A converted into a vector
by concatenating its columns. Based on Equation (15.3), each training triple ¸s, a, v)
can be converted into the pair ¸vec(sa
T
), v) and vec(W) can be learned by standard
linear regression.
The bilinear representation of Q functions contains an interaction term for every
component of the action vector and of the state vector. Hence, it can represent the
consequences of each action component as depending on each aspect of the state.
1
The LP approach can handle constraints that combine two or more components of the action
vector. Given constraints of this nature, the optimal value for each action component is in general not
simply its lower or upper bound. Even when optimal action values are always lower or upper bounds,
the LP approach is far from trivial in highdimensional action spaces: it ﬁnds the optimal value for
every action component in polynomial time, whereas naive search might need O(2
d
) time where d is
the dimension of the action space.
The solution to an LP problem is always a vertex in the feasible region. Therefore, the maximiza
tion operation always selects a socalled “bangbang” action vector, each component of which is an
extreme value. In many domains, actions that change smoothly over time are desirable. In at least some
domains, the criteria to be minimized or maximized to achieve desirably smooth solutions are known,
for example the “minimum jerk” principle [Viviani and Flash, 1995]. These criteria can be included as
additional penalties in the maximization problem to be solved to ﬁnd the optimal action. If the criteria
are nonlinear, the maximization problem will be harder to solve, but it may still be convex, and it will
still be possible to learn a bilinear Q function representing how longterm reward depends on the current
state and action vectors.
160 CHAPTER 15. REINFORCEMENT LEARNING
Some other representations for Q functions are less satisfactory. The simplest rep
resentation, as mentioned above, is tabular: a separate value is stored for each com
bination of a state value s and an action value a. This representation is usable only
when both the state space and the action space are discrete and of low cardinality.
A general representation is linear, using ﬁxed basis functions. In this representation
Q(s, a) = w[φ
1
(s, a), , φ
m
(s, a)] where φ
1
to φ
m
are ﬁxed realvalued functions
and w is a vector of length m. The bilinear approach is a special case of this repre
sentation where each basis function is the product of a state component and an action
component. An intuitive drawback of other cases of this representation is that they ig
nore the distinction between states and actions: essentially, s and a are concatenated
as inputs to the basis functions φ
i
. A computational drawback is that, in general, no
efﬁcient way of performing the maximization operation argmax
b
Q(s, b) is known,
whether or not there are constraints on feasible action vectors b. Other common rep
resentations are neural networks [Riedmiller, 2005] and ensembles of decision trees
[Ernst et al., 2005]; these have the same difﬁculty.
Note that any ﬁxed format for Q functions, including a bilinear representation, in
general makes the whole RL process suboptimal, since the true optimization problem
that represents longterm reward may not be expressible in this format.
15.8 Inventory management applications
In inventory management applications, at each time step the agent sees a demand
vector and/or a supply vector for a number of products. The agent must decide how to
satisfy each demand in order to maximize longterm beneﬁt, subject to various rules
about the substitutability and perishability of products. Managing the stocks of a
blood bank is a typical application of this type. As a concrete illustration, this section
describes results using the speciﬁc formulation used in [Yu, 2007] and Chapter 12 of
[Powell, 2007].
The blood bank stores blood of eight types: AB+, AB, A+, A, B+, B, O+, and
O. Each period, the manager sees a certain level of demand for each type, and a
certain level of supply. Some types of blood can be substituted for some others; 27
of the 64 conceivable substitutions are allowed. Blood that is not used gets older, and
beyond a certain age must be discarded. With three discrete ages, the inventory of
the blood bank is a vector in R
24
. The state of the system is this vector concatenated
with the 8dimensional demand vector and a unit constant component. The action
vector has 3 27 +1 = 82 dimensions since there are 3 ages, 27 possible allocations,
and a unit constant pseudoaction. Each component of the action vector is a quantity
of blood of one type and age, used to meet demand for blood of the same or another
type. The LP solved to determine the action at each time step has learned coefﬁcients
15.8. INVENTORY MANAGEMENT APPLICATIONS 161
Table 15.1: Shortterm reward functions for blood bank management.
[Yu, 2007] new
supply exact blood type 50 0
substitute O blood 60 0
substitute other type 45 0
fail to meet demand 0 60
discard blood 20 0
in its objective function, but ﬁxed constraints such as that the total quantity supplied
from each of the 24 stocks must be not be more than the current stock amount.
For training and for testing, a trajectory is a series of periods. In each period,
the agent sees a demand vector drawn randomly from a speciﬁc distribution. Then,
blood remaining in stock is aged by one period, blood that is too old is discarded,
and fresh blood arrives according to a different probability distribution. Training is
based on 1000 trajectories of length 10 periods in which the agent follows a greedy
policy.
2
Testing is based on trajectories of the same length where supply and demand
vectors are drawn from the same distributions, but the agent follows a learned policy.
In many realworld situations the objectives to be maximized, both shortterm
and longterm, are subject to debate. Intuitive measures of longterm success may
be not mathematically consistent with intuitive immediate reward functions, or even
with the MDP framework. The blood bank scenario illustrates this issue.
Table 15.1 shows two different immediate reward functions. Each entry is the
beneﬁt accrued by meeting one unit of demand with supply of a certain nature. The
middle column is the shortterm reward function used in previous work, while the
right column is an alternative that is more consistent with the longterm evaluation
measure used previously (described below).
For a blood bank, the percentage of unsatisﬁed demand is an important long
term measure of success [Yu, 2007]. If a small fraction of demand is not met, that
is acceptable because all highpriority demands can still be met. However, if more
than 10% of demand for any type is not met in a given period, then some patients
2
The greedy policy is the policy that supplies blood to maximize onestep reward, at each time
step. Many domains are like the blood bank domain in that the optimal onestep policy, also called
myopic or greedy, can be formulated directly as a linear programming problem with known coefﬁcients.
With a bilinear representation of the Q function, coefﬁcients are learned that formulate the longterm
optimal policy as a problem in the same class. The linear representation is presumably not sufﬁciently
expressive to represent the truly optimal longterm policy, but it is expressive enough to represent a
policy that is better than the greedy one.
162 CHAPTER 15. REINFORCEMENT LEARNING
Table 15.2: Success of alternative learned policies.
average frequency of
unmet severe unmet
A+ demand A+ demand
greedy policy 7.3% 46%
bilinear Qiteration (i) 18.4% 29%
bilinear Qiteration (ii) 7.55% 12%
may suffer seriously. Therefore, we measure both the average percentage of unmet
demand and the frequency of unmet demand over 10%. The A+ blood type is an
especially good indicator of success, because the discrepancy between supply and
demand is greatest for it: on average, 34.00% of demand but only 27.94% of supply
is for A+ blood.
Table 15.2 shows the performance of the greedy policy and of policies learned
by ﬁtted Q iteration using the two immediate reward functions of Table 15.1. Given
the immediate reward function (i), ﬁtted Qiteration satisﬁes on average less of the
demand for A+ blood. However, this reward function does not include an explicit
penalty for failing to meet demand. The bilinear method correctly maximizes (ap
proximately) the discounted longterm average of immediate rewards. The last col
umn of Table 15.1 shows a different immediate reward function that does penalize
failures to meet demand. With this reward function, ﬁtted Q iteration learns a policy
that more than cuts in half the frequency of severe failures.
Bibliography
[Abe et al., 2002] Abe, N., Pednault, E., Wang, H., Zadrozny, B., Fan, W., and Apte,
C. (2002). Empirical comparison of various reinforcement learning strategies for
sequential targeted marketing. In Proceedings of the IEEE International Confer
ence on Data Mining, pages 3–10. IEEE.
[Dietterich, 2009] Dietterich, T. G. (2009). Machine learning and ecosystem infor
matics: Challenges and opportunities. In Proceedings of the 1st Asian Conference
on Machine Learning (ACML), pages 1–5. Springer.
[Dˇ zeroski et al., 2001] Dˇ zeroski, S., De Raedt, L., and Driessens, K. (2001). Rela
tional reinforcement learning. Machine Learning, 43(1):7–52.
[Ernst et al., 2005] Ernst, D., Geurts, P., and Wehenkel, L. (2005). Treebased batch
mode reinforcement learning. Journal of Machine Learning Research, 6(1):503–
556.
[Gordon, 1995a] Gordon, G. J. (1995a). Stable ﬁtted reinforcement learning. In
Advances in Neural Information Processing Systems (NIPS), pages 1052–1058.
[Gordon, 1995b] Gordon, G. J. (1995b). Stable function approximation in dynamic
programming. In Proceedings of the International Conference on Machine Learn
ing (ICML), pages 261–268.
[Huang et al., 2006] Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., and
Sch¨ olkopf, B. (2006). Correcting sample selection bias by unlabeled data. In Pro
ceedings of the Neural Information Processing Systems Conference (NIPS 2006).
[Jonas and Harper, 2006] Jonas, J. and Harper, J. (2006). Effective counterterror
ism and the limited role of predictive data mining. Technical report, Cato In
stitute. Available at http://www.cato.org/pub_display.php?pub_
id=6784.
163
164 BIBLIOGRAPHY
[Judd and Solnick, 1994] Judd, K. L. and Solnick, A. J. (1994). Numerical dy
namic programming with shapepreserving splines. Unpublished paper from the
Hoover Institution available at http://bucky.stanford.edu/papers/
dpshape.pdf.
[Michie et al., 1994] Michie, D., Spiegelhalter, D. J., and Taylor, C. C. (1994). Ma
chine Learning, Neural and Statistical Classiﬁcation. Ellis Horwood.
[Murphy, 2005] Murphy, S. A. (2005). A generalization error for Qlearning. Jour
nal of Machine Learning Research, 6:1073–1097.
[Neumann, 2008] Neumann, G. (2008). Batchmode reinforcement learning for con
tinuous state spaces: A survey.
¨
OGAI Journal, 27(1):15–23.
[Powell, 2007] Powell, W. B. (2007). Approximate Dynamic Programming. John
Wiley & Sons, Inc.
[Riedmiller, 2005] Riedmiller, M. (2005). Neural ﬁtted Q iteration–ﬁrst experiences
with a data efﬁcient neural reinforcement learning method. In Proceedings of the
16th European Conference on Machine Learning (ECML), pages 317–328.
[Simester et al., 2006] Simester, D. I., Sun, P., and Tsitsiklis, J. N. (2006). Dynamic
catalog mailing policies. Management Science, 52(5):683–696.
[Stachurski, 2008] Stachurski, J. (2008). Continuous state dynamic programming
via nonexpansive approximation. Computational Economics, 31(2):141–160.
[Tang and Liu, 2009a] Tang, L. and Liu, H. (2009a). Relational learning via latent
social dimensions. In Proceedings of the 15th ACM SIGKDD International Con
ference on Knowledge Discovery and Data Mining, pages 817–826. ACM.
[Tang and Liu, 2009b] Tang, L. and Liu, H. (2009b). Scalable learning of collective
behavior based on sparse social dimensions. In Proceeding of the 18th ACM Con
ference on Information and Knowledge Management (CIKM), pages 1107–1116.
ACM.
[Vert and Jacob, 2008] Vert, J.P. and Jacob, L. (2008). Machine learning for in
silico virtual screening and chemical genomics: New strategies. Combinatorial
Chemistry & High Throughput Screening, 11(8):677–685(9).
[Viviani and Flash, 1995] Viviani, P. and Flash, T. (1995). Minimumjerk, two
thirds power law, and isochrony: converging approaches to movement planning.
Journal of Experimental Psychology, 21:32–53.
BIBLIOGRAPHY 165
[Yu, 2007] Yu, V. (2007). Approximate dynamic programming for blood inventory
management. Honors thesis, Princeton University.
[Zhu and Goldberg, 2009] Zhu, X. and Goldberg, A. B. (2009). Introduction to
semisupervised learning. Synthesis Lectures on Artiﬁcial Intelligence and Ma
chine Learning, 3(1):1–130.
Contents
Contents 1 Introduction 1.1 Limitations of predictive analytics . . . . . . . . . . . . . . . . . . 1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictive analytics in general 2.1 Supervised learning . . . . . . . . . . . . 2.2 Data cleaning and recoding . . . . . . . . 2.3 Linear regression . . . . . . . . . . . . . 2.4 Interpreting coefﬁcients of a linear model 2.5 Evaluating performance . . . . . . . . . . 2 5 6 7 11 11 12 14 16 18 25 25 26 29 31 31 32 34 34 35 37 45 45 47
2
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
3
Introduction to Rapidminer 3.1 Standardization of features . . . . . . . . . . . . . . . . . . . . . . 3.2 Example of a Rapidminer process . . . . . . . . . . . . . . . . . . 3.3 Other notes on Rapidminer . . . . . . . . . . . . . . . . . . . . . . Support vector machines 4.1 Loss functions . . . . . . . . . . 4.2 Regularization . . . . . . . . . . 4.3 Linear softmargin SVMs . . . . 4.4 Dual formulation . . . . . . . . 4.5 Nonlinear kernels . . . . . . . . 4.6 Selecting the best SVM settings
4
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
5
Doing valid experiments 5.1 Crossvalidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Nested crossvalidation . . . . . . . . . . . . . . . . . . . . . . . . 2
CONTENTS 6 Classiﬁcation with a rare class 6.1 Thresholds and lift . . . . . . . 6.2 Ranking examples . . . . . . . . 6.3 Training to overcome imbalance 6.4 Conditional probabilities . . . . 6.5 Isotonic regression . . . . . . . 6.6 Univariate logistic regression . . 6.7 Pitfalls of link prediction . . . . Logistic regression Making optimal decisions 8.1 Predictions, decisions, and costs . . . . . . . . . . . 8.2 Cost matrix properties . . . . . . . . . . . . . . . . . 8.3 The logic of costs . . . . . . . . . . . . . . . . . . . 8.4 Making optimal decisions . . . . . . . . . . . . . . . 8.5 Limitations of costbased analysis . . . . . . . . . . 8.6 Rules of thumb for evaluating data mining campaigns 8.7 Evaluating success . . . . . . . . . . . . . . . . . . Learning classiﬁers despite missing labels 9.1 The standard scenario for learning a classiﬁer 9.2 Sample selection bias in general . . . . . . . 9.3 Covariate shift . . . . . . . . . . . . . . . . . 9.4 Reject inference . . . . . . . . . . . . . . . . 9.5 Positive and unlabeled examples . . . . . . . 9.6 Further issues . . . . . . . . . . . . . . . . .
3 53 55 56 58 59 59 60 60 69 73 73 74 75 76 78 78 80 87 87 88 89 89 91 94 99 100 100 101 102 103 104 105
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
7 8
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
9
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
10 Recommender systems 10.1 Applications of matrix approximation 10.2 Measures of performance . . . . . . . 10.3 Additive models . . . . . . . . . . . . 10.4 Multiplicative models . . . . . . . . . 10.5 Combining models by ﬁtting residuals 10.6 Regularization . . . . . . . . . . . . . 10.7 Further issues . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
11 Text mining 111 11.1 The bagofwords representation . . . . . . . . . . . . . . . . . . . 112 11.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . 113
. . . . . . . . . . . .5 11. . . . . . . . . . .5 Supervised link prediction . . . . . . . . 15 Reinforcement learning 15. .7 Representing continuous states and actions . . . . . . . . . . . .4 11. . . . . . . . . . . .4 The social dimensions approach 13. . . . . . . . . . . . . . . .4 11. . . . . .7 11. . . . . . . . . . . . . . . . . . . . . . . . . 15. . . . . . . . . . . . . . . . . . . . . . . 14 Interactive experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Social network analytics 13. . . . . . . 15. . . . . . . . . . . . . . . . . . . 13. . . . . . . . . . . . . . . . . .6 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12. . .6 Graphbased regularization . . . . . . . . . . . . .5 Spectral clustering . . . . . . . . . 115 116 116 117 118 119 127 128 130 131 133 133 133 137 138 139 140 142 142 149 151 152 153 154 156 156 157 158 160 163 12 Matrix factorization and applications 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discriminative classiﬁcation Clustering documents . . . . 15. . . . .2 Principal component analysis . . . . . . . . . . . . . . . . .6 Fitted Q iteration .8 Training Bayesian classiﬁers Burstiness . . . . . . . . .3 Algorithms to ﬁnd an optimal policy . . . . . . . . . . . CONTENTS . .1 Markov decision processes . . . . 15. . . . .8 Inventory management applications . . . . . . . . . . . . 13. . . . . . . . . . . . . . . . . . . . . . . . . . . Open questions . . . . . . . .3 Latent semantic analysis . . . . . . . . . . . Topic models . . . . . . . . . . . . . . . 12. . . 12. . . . . . . . . . . 15. . . . . . . . . . . . . . . . . . . . . 12. . Bibliography . . .2 RL versus costsensitive learning . . . . . . . . . . . . 15. . . . . . . . . . . . . . . . . . .3 11. . .1 Issues in network mining . . . . . . . . . . .4 Q functions . . . . .5 The Qlearning algorithm . . . . . .4 Matrix factorization with missing data 12. . . . . . . . . . . . . . .2 Unsupervised network mining . . . . . . . . . . . . .1 Singular value decomposition . . . . . . . . . . . . . . . . 15. . . 13. . . . . . . . .3 Collective classiﬁcation . . . . . . . .
This pattern may be noteworthy and interesting. the goal of descriptive analytics is to discover patterns in data. We shall take it to mean the application of learning algorithms and statistical methods to realworld datasets. customers who are more likely 5 . The goal will then be to predict which other customers might fail to pay in the future. Finding patterns is often fascinating and sometimes highly useful. and it should aim its marketing at a currently less tapped. different. but what should Whole Foods do with the ﬁnding? Often. There are numerous data mining domains in science. perhaps Whole Foods at should direct its marketing towards additional wealthy and liberal people. For example. engineering. along with labels indicating which customers failed to pay their bills. This area is often also called “knowledge discovery in data” or KDD. For example. For example. predictions can be typically be used directly to make decisions that maximize beneﬁt to the decisionmaker.Chapter 1 Introduction There are many deﬁnitions of data mining. suppose that customers of Whole Foods tend to be liberal and wealthy. In general. Or perhaps that demographic is saturated. The focus will be on methods for making predictions. but in general it is harder to obtain direct beneﬁt from descriptive analytics than from predictive analytics. the ﬁnding is really not useful in the absence of additional knowledge. In such a case. Predictive analytics indicates a focus on making predictions. but contradictory. We shall focus on applications that are related to business. the available data may be a customer database. analytics is a newer name for data mining. The main alternative to predictive analytics can be called descriptive analytics. the same ﬁnding suggests two courses of action that are both plausible. business. but the methods that are most useful are mostly the same for applications in science or engineering. and elsewhere where data mining is useful. In a nutshell. For example. group of people? In contrast.
increased efﬁciency often comes from improved accuracy in targeting. 1. Second. Second. one cannot make progress without a dataset for training of adequate size and quality. which beneﬁts the people being targeted. Fico can also help lenders ﬁnd borrowers that will best respond to modiﬁcations and learn how to get in touch with them. 2009: Fico.1 Limitations of predictive analytics It is important to understand the limitations of predictive analytics. so FICO did not have relevant data. because it is hard to see how a useful labeled training set could exist. Society can use the tax system to spread the beneﬁt of increased proﬁt. in general. lenders had no long historical experience with offering modiﬁcations to borrowers. Moreover. Consider for example this extract from an article in the London Financial Times dated May 13. maximizing proﬁt in general is maximizing efﬁciency. that is on reading the minds of borrowers. but who would pay if given a modiﬁed contract. Businesses have no motive to send advertising to people who will merely be annoyed and not respond. recently launched a service that prequaliﬁes borrowers for modiﬁcation programmes using their inhouse scoring data. The target concept is “borrowers that will best respond to modiﬁcations. It is important to understand the difference between a prediction and a decision. INTRODUCTION not to pay in the future can have their credit limit reduced now. and to have historical examples of the concept. but predictions are useful to an agent only if they allow the agent to make decisions that have better outcomes. the company behind the credit score. it is crucial to have a clear deﬁnition of the concept that is to be predicted.6 CHAPTER 1. Data mining lets us make predictions.” From a lender’s perspective (and Fico works for lenders not borrowers) such a borrower is one who would not pay under his current contract. Lenders pay a small fee for Fico to refer potential candidates for modiﬁcations that have already been vetted for inclusion in the programme. There are several responses to this feeling. and may not beneﬁt society at large. Data mining cannot read minds. For a successful data mining application. It is hard to see how this could be a successful application of data mining. Especially in 2009. the actions to be taken based on predictions need to be deﬁned clearly and to have reliable proﬁt consequences. After all. First. the deﬁnition of the target is based on a counterfactual. maximizing proﬁt for a business may be at the expense of consumers. The . First. Some people may feel that the focus in this course on maximizing proﬁt is distasteful or disquieting.
are all likely to change the behavior of borrowers in the future. An even more clear example of an application of predictive analytics that is unlikely to succeed is learning a model to predict which persons will commit a major terrorist act. and understanding methods only at a cookbook level. Here. we shall not look at multiple methods for the same task. Those who get modiﬁcations may be motivated to return and request further concessions.2. Last but not least.1. We will not examine alternative classiﬁer learning methods such as decision trees. then predictions are likely not to be useful. for a successful application it helps if the consequences of actions are essentially independent for different examples. Often. and bagging. changes in the general economy. boosting. It is not clear that giving people modiﬁcations will really change their behavior dramatically. A person requesting a modiﬁcation is already thinking of not paying. If the phenomenon to be predicted is not stable over time. while the test data arise in the future. we will look at support vector machines (SVMs) in detail. intelligent terrorists will take steps not to ﬁt in with patterns exhibited by previous terrorists [Jonas and Harper. OVERVIEW 7 difference between a regular payment and a modiﬁed payment is often small. for classiﬁer learning. and in social attitudes towards foreclosures. Rational borrowers who hear about others getting modiﬁcations will try to make themselves appear to be candidates for a modiﬁcation. There are so few positive training examples that statistically reliable patterns cannot be learned. actions must not have major unintended consequences. in the price level of houses. the training data come from the past. Typically. This may be not the case here. So each modiﬁcation generates a cost that is not restricted to the loss incurred with respect to the individual getting the modiﬁcation. We will tread a middle ground between focusing on theory at the expense of applications. Moreover. All these methods are excellent. For a successful data mining application also. for predictive analytics to be successful. and that have welldocumented successful applications. 1. for example $200 in the case described in the newspaper article. in systematic but unknown ways. . In particular. neural networks. when there is one method that is at least as good as all the others from most points of view. the training data must be representative of the test data. Additionally. but it is hard to identify clearly important scenarios in which they are deﬁnitely superior to SVMs. modiﬁcations may change behavior in undesired ways. that are sufﬁciently simple and fast to be easy to use. 2006].2 Overview In this course we shall only look at methods that have stateoftheart accuracy. Here.
INTRODUCTION We may also look at random forests.8 CHAPTER 1. and which is widely used in commercial applications nowadays. . a nonlinear method that is often superior to linear SVMs.
2011 Your name: Student id. which is aimed at children.CSE 291 Quiz for April 5. if you have one: Email address (important): Suppose that you work for Apple. .” Is your colleague’s idea good or bad? Explain at least one signiﬁcant speciﬁc reason that supports your answer. The company has developed an unprecedented new product called the iKid. Your colleague says “Let’s use data mining on our database of existing customers to learn a model that predicts who is likely to buy an iKid.
.
and an example is one row in such a table. It is essentially the same thing as a table in a relational database.e. Often each label value y is a real number. Sometimes it is important to distinguish between a feature. and a feature value. the training data are a matrix with n rows and p columns. supervised learning is called “regression” and the classiﬁer is called a “regression model. as a row vector of ﬁxed length p. A column in such a table is often called a feature. Each element in the vector representing an example is called a feature value. linear regression. in data mining.e. a predicted y value. 2. Each training and test example is represented in the same way. With n training examples. It may be real number or a value of any other type.” The word “classiﬁer” is usually reserved for the case where label values are discrete. Row. The output of the classiﬁer is a conjecture about y. and vector are essentially synonyms. which is an entire column. A training set is a set of vectors with known label values.Chapter 2 Predictive analytics in general This chapter explains supervised learning. These may be called 1 and +1. In the simplest but most common case. along with 11 . In this case.1 Supervised learning The goal of a supervised learning algorithm is to obtain a classiﬁer by learning from training examples. The label y for a test example is unknown. This type of learning is called “supervised” because of the metaphor that a teacher (i. or no and yes. there are just two label values. a supervisor) has provided the true label of each training example. i. A classiﬁer is something that can be used to make predictions on test examples. or an attribute. or negative and positive. or 0 and 1. tuple. and data cleaning and recoding. and with each example consisting of values for p different features.
units such as dollars and dimensions such as kilograms are omitted. 2. Usually. The cardinality of the training set is n. for a student the feature “year” may have values freshman. sophomore. Even with all the complexity of features.12 CHAPTER 2. No learning algorithm can be expected to discover dayofweek automatically as a function of day/month/year. many aspects are typically ignored in data mining. important features (meaning features with predictive power) may be implicit in other features.g. how does one deﬁne the year of a student who has transferred from a different college. such as 0 (zero). e. True labels are known for training examples. An important job for a human is to think which features may be predictive. Many features are categorical. Sometimes categorical features have names that look numerical. the day of the week may be predictive. Other features are numerical but not realvalued. PREDICTIVE ANALYTICS IN GENERAL a column vector of y values. We use the notation xij for the value of feature number j of example number i. junior.” An equally simple. For example.g. zip codes. but are critical for human understanding. Keeping only examples without any missing values is called “complete case analysis. It is important not to treat a value that means missing inadvertently as a valid regular value. The label of example i is yi . Difﬁculty in determining feature values is also ignored: for example. Some training algorithms can handle missingness internally. while its dimensionality is p. e. the simplest approach is just to discard all examples (rows) with missing values. missing values are indicated by question marks. Some features are realvalued. Dealing with these is difﬁcult. such as “day” in the combination “day month year. If not. but different. often missing values are indicated by strings that look like valid known values. Often. but only the day/month/year date is given in the original data. but not for test examples. or who is parttime? A very common difﬁculty is that the value of a given feature is missing for some training and/or test examples. However. there is a lot of variability and complexity in features. Usually the names used for the different values of a categorical feature make no difference to data mining algorithms. and/or have an unwieldy number of different values.2 Data cleaning and recoding In realworld data. and then to write software that makes these features explicit. and senior.” Moreover. for example integers or monetary amounts. approach is to discard all features (columns) with missing val . It is also difﬁcult to deal with features that are really only meaningful in conjunction with other features. based on understanding the application domain.
for example to create a New England group for MA. An intelligent way to recode discrete predictors is to replace each discrete value by the mean of the target conditioned on that discrete value. It is conventional to code “false” or “no” as 0.0. This idea is especially useful as The ordering of the values.0.2. More sophisticated imputation procedures exist. and “true” or “yes” as 1.e. For example. For the jth categorical value where j < k. features that are numerical can be discretized. the bins are “equal count. This process is called imputation. If one has a large dataset with many examples from each of the 50 states. If a feature with missing values is retained. Some training algorithms can only handle categorical features. set all k − 1 features equal to 0.0 and set all k − 1 others equal to 0. the fact that a particular feature is missing may itself be a useful predictor.e.2. 1 . Also. is arbitrary. DATA CLEANING AND RECODING 13 ues. Each numerical value is then replaced by the name of the bin in which the value lies.0. The values of a binary feature can be recoded as 0. both these approaches eliminate too much useful training data. Other training algorithms can only handle realvalued features. then it is reasonable to replace each missing value by the mean or mode of the nonmissing values. it may be useful to group small states together in sensible ways. in many applications.e. However. CT. One can set boundaries for the bins so that each bin has equal width. set the jth feature value to 1. Otherwise. i. i.. categorical features must be made numerical. nonoverlapping. It often works well to use ten bins.0 and set all k − 1 others equal to 0. Typically. the best way to recode a feature that has k different categorical values is to use k realvalued features. etc.0. For the jth categorical value. but they are not always better. For example.e.1 Categorical features with many values (say. The word “partitioned” means that the bins are exhaustive and mutually exclusive. For these. For these. VT. i. Mathematically it is preferable to use only k − 1 realvalued features. The range of the numerical values is partitioned into a ﬁxed number of intervals that are called bins. if the average label value is 20 for men versus 16 for women. set the jth of these features equal to 1.0 or 1. i. it is often beneﬁcial to create an additional binary feature that is 0 for missing and 1 for present. then it may be reasonable to leave this as a categorical feature with 50 values. Therefore. which value is associated with j = 1. or one can set boundaries so that each bin contains approximately the same number of training examples. Usually. zipcodes may be recoded as just their ﬁrst one or two letters. NH. these values could replace the male and female values of a variable for gender. human intervention is needed to recode them intelligently. For the last categorical value.0. the boundaries are regularly spaced. ME. since these indicate meaningful regions of the United States.” Each bin is given an arbitrary name. over 20) are often difﬁcult to deal with.0.
in test data. .3 Linear regression Let x be an instance and let y be its realvalued label. . Even more important. . It is the value of y predicted by the model if xi = 0 for all i. If a feature is normalized by zscoring. the same discrete value must be replaced by the same mean of training target values. When a discrete value is replaced by the mean of the target conditioned on it. which can lead to overﬁtting. . However. but they will not yield better predictions than the coefﬁcients learned in the standard approach. x must be a vector of real numbers of ﬁxed length. . then the test data is available indirectly for training. the training algorithm can learn a coefﬁcient for each new feature that corresponds to an optimal numerical value for the corresponding discrete value. Normalization. The linear regression model is y = b0 + b1 x1 + b2 x2 + . Of course. Conditional means are likely to be meaningful and useful. Later. The righthand side above is called a linear function of x. as just explained. they can be scaled to have zero mean and unit variance in the same way as other features. Mixed types. The coefﬁcient bi is the amount by which the predicted y value . A major advantage of the conditionalmeans approach is that it avoids an explosion in the dimensionality of training and test examples. of x. PREDICTIVE ANALYTICS IN GENERAL a way to convert a discrete feature with many values. For linear regression. These coefﬁcients are the output of the data mining algorithm. With this standard approach. Remember that this length p is often called the dimension. When preprocessing and recoding data. or dimensionality. its mean and standard deviation must be computed using only the training data. states. xp . If preprocessing is based on test data in any way. + bp xp . the same mean and standard deviation values must be used for normalizing this feature in test data.S. .14 CHAPTER 2. Write x = x1 . it is vital not to peek at test data. Sparse data. later. it may be completely unrealistic that all features xi have value zero. x2 . target values from test data must not be used in any way before or during training. Then. for example the 50 U. After conditionalmean new values have been created. The coefﬁcient b0 is called the intercept. the standard way to recode a discrete feature with m values is to introduce m − 1 binary features. the mean must be just of training target values. The linear function is deﬁned by its coefﬁcients b0 to bp . 2. into a useful single numerical feature.
b) − yi )2 .5. i. where xi = xi1 . For example. it consists of n examples of the form xi . Let b be any set of coefﬁcients. suppose xi is a binary feature where xi = 0 means female and xi = 1 means male. Even if n > p is true. The semicolon in the expression f (xi . then we can write p yi = ˆ j=0 bj xij . . “equivalent” means that the different sets of coefﬁcients achieve the same minimum SSE.2. where the squared error on training example i is (yi − yi )2 .e. while b is a ﬁxed set of parameter values. The standard approach is to say that optimal means minimizing the sum of squared errors on the training set. . the optimal coefﬁcients have multiple equivalent values if some features are themselves related linearly.3. and suppose bi = −2. For an intuitive example. Then the same model can be written in many different ways: • y = b0 + b1 x1 + b2 x2 . Finding the optimal values of the coefﬁcients b0 to bp is the job of the training algorithm. or SSE for short. If we deﬁne xi0 = 1 for every i. The objective function i (yi − j bj xij )2 is called the sum of squared errors. Then the predicted y value for males is lower by 2. suppose features 1 and 2 are height and weight respectively. where the units of measurement are pounds and inches. . LINEAR REGRESSION 15 increases if xi increases by 1. while the parameters b are variable. . Suppose that the training set has cardinality n. Note that during training the n different xi and yi values are ﬁxed. b) = b0 + ˆ j=1 bj xij . The training algorithm then ﬁnds ˆ n ˆ = argmin b b i=1 (f (xi .5. The optimal coefﬁcient values ˆ are not deﬁned uniquely if the number n of trainb ing examples is less than the number p of features. Suppose that x2 = 120 + 5(x1 − 60) = −180 + 5x1 approximately. if the value of all other features is unchanged. everything else being held constant. Here. xip . yi . To make this task welldeﬁned. The predicted value for xi is p yi = f (xi . we need a deﬁnition of what “optimal” means. The constant xi0 = 1 can be called a pseudofeature. b) emphasizes that the vector xi is a variable input.
whether for regression or for classiﬁcation as described in later chapters.16 CHAPTER 2. . the penalty on large coefﬁcients gets stronger. One j=1 j reason for doing this is that the target y values are typically not normalized. The penalty function p b2 is the L2 norm of the vector b. do appear interpretable at ﬁrst sight. there is a unique one that minimizes the function b2 + b2 . In the extreme. The fractions 1/n and 1/p do not make an essential difference. This model has b1 = b2 = c/2. A parameter λ can control the relative j=1 j importance of the two objectives. Linear models. this is an important motivation for data normalization. When two or more features are approximately related linearly. Then all models y = b0 + b1 x1 + b2 x2 are equivalent for which b1 + b2 equals a constant. They can be used to make the numerical value of λ easier to interpret. The parameter λ is often called the strength of regularization. The coefﬁcients obtained by training will be strongly inﬂuenced by randomness in the training data. PREDICTIVE ANALYTICS IN GENERAL • y = b0 + b1 x1 + b2 (−180 + 5x1 ) = [b0 − 180b2 ] + [b1 (1 + 5b2 )]x1 + 0x2 and more. namely SSE and penalty: ˆ = argmin 1 b b n n i=1 1 (yi − yi ) + λ ˆ p 2 p b2 . Regularization is a way to reduce the inﬂuence of this type of randomness. Among these models. Any penalty function that treats all coefﬁcients bj equally. 2. is sensible only if the typical magnitudes of the values of each feature are similar. A simple penalty function of this type is p b2 . then the true values of the coefﬁcients of those features are not well determined. that is one that can be used not only to make predictions. much caution is needed when attempting to derive conclusions from numerical coefﬁcients. However. Note that in the formula p b2 the sum excludes the intercept coefﬁcient b0 . and the typical values of coefﬁcients get smaller. j j=1 If λ = 0 then one gets the standard leastsquares linear regression solution. Consider all models y = b0 +b1 x1 +b2 x2 for which b1 +b2 = c. We 1 2 can obtain it by setting the objective function for training to be the sum of squared errors (SSE) plus a function that penalizes large values of the coefﬁcients. As λ gets larger. Using it for linear j=1 j regression is called ridge regression. suppose x1 = x2 . but also to understand mechanisms in the phenomenon that is being studied.4 Interpreting coefﬁcients of a linear model It is common to desire a data mining model that is interpretable. like the L2 norm.
2 .com/LHSP/importnt.htm. INTERPRETING COEFFICIENTS OF A LINEAR MODEL 17 Consider the following linear regression model for predicting highdensity cholesterol (HDL) levels. as measured by pvalue. Predictors have been reordered here from most to least statistically signiﬁcant. Source: http://www.jerrydallal.4.2.2 HDL cholesterol is considered beneﬁcial and is sometimes called “good” cholesterol.
28 0.00018 0.3436 0. This happens if the predictor and the outcome have a common cause.00044 0.95 0. diastolic blood pressure is statistically signiﬁcant.0239 0. in research or experimentation we want to measure the performance achieved by a learning algorithm. even if the association was statistically signiﬁcant. we have a training set of examples with labels. (It is not clear what GLUM is. and statistically signiﬁcant. when in fact both are signiﬁcant or both are not.00295 0.47 2.28804 0.00041 0. the predictors are body mass index. To do this we use a test set consisting of examples . skinfold thickness.08 2. one would be cautious about believing this relationship. if predictors are collinear. But it may be that vitamin C is simply an indicator of a generally healthy diet high in fruits and vegetables. a predictor may be practically important.00183 0. For example.error 0.0147 0.02215 0. but systolic is not. If this is true. vitamin C level in blood.5 Evaluating performance In a realworld application of supervised learning.) The example illustrates at least two crucial issues.00046 0. PREDICTIVE ANALYTICS IN GENERAL coefficient 1.10936 0.05055 0. But it may also be an artifact of collinearity. The whole point is to make predictions for the test examples.16448 0.0051 0. or if the outcome causes the predictor. systolic blood pressure. Second. A third crucial issue is that a correlation may disagree with other knowledge and assumptions.00103 0. but still useless for interventions.0001 <.31109 0.81 0.50 2.84 2.0001 0. then one may appear signiﬁcant and the other not. vitamin C is generally considered beneﬁcial or neutral. the log of total cholesterol. diastolic blood pressures. and a test set of examples with unknown labels.04 4. If lower vitamin C was associated with higher HDL.00125 Tstat 4.4602 From most to least statistically signiﬁcant.4221 0.00255 0. Above.18 predictor intercept BMI LCHOL GLUM DIAST BLC PRSSY SKINF AGE CHAPTER 2.01205 0. First.0135 0.00147 0.00092 std. This may possibly be true for some physiological reason. then merely taking a vitamin C supplement will cause an increase in HDL level. Above. and age in years. vitamin C is statistically signiﬁcant. However.74 pvalue <. 2.
It is natural to run a supervised learning algorithm many times. they are valid in the training data due to randomness in how the training data was selected from the population. We train the classiﬁer on the training set. in order to pick the best settings. Some of the patterns discovered may be spurious. and test sets should be done randomly. and to measure the accuracy of the function (classiﬁer or regression function) learned with different settings. In practice it is an omnipresent danger. It is absolutely vital to measure the performance of a classiﬁer on an independent test set. or not as strong. Dividing the available data into training. i. . may be not a sample from the same distribution. in order to guarantee that each set is a random sample from the same distribution. the ﬁnal test set must be used only once. apply it to the test set. i. If you use a validation set. but they are not valid.5. is called a validation set.e. in the whole population. Most training algorithms have some settings that the user can choose between. A classiﬁer that relies on these spurious patterns will have higher accuracy on the training examples than it will on the whole population. correlations between the features and the class. However. and then measure performance by comparing the predicted labels with the true labels (which were not available to the training algorithm).2.e. The only purpose of the ﬁnal test set is to estimate the true accuracy achievable with the settings previously chosen using the validation set. For ridge regression the main algorithmic parameter is the degree of regularization λ. validation. Every training algorithm looks for patterns in the training data. for which the true label is genuinely unknown. For fairness. Other algorithmic choices are which sets of features to use. Only accuracy measured on an independent test set is a fair estimate of accuracy on the whole population. it is important to have a ﬁnal test set that is independent of both the training set and the validation set. a very important realworld issue is that future real test examples. A set of labeled examples used to measure accuracy with different algorithmic settings. The phenomenon of relying on patterns that are strong only in the training data is called overﬁtting. EVALUATING PERFORMANCE 19 with known labels.
if features such as “age” are not normalized. in the absence of regularization. as follows: gender male female x11 1 0 x12 0 1 Learning from a large training set yields the model y = .Quiz question Suppose you are building a model to predict how many dollars someone will spend at Sears. Hence. Roebuck’s conclusion is not valid. and then answer the following three parts with one or two sentences each: (a) Explain why Dr. they make the optimal model be undeﬁned. The two features x11 and x12 are linearly related. then on average a woman will spend $60 more than a man. . Dr. The expressiveness of the model would be unchanged. The model only predicts spending of $75 for a woman if all other features have value zero. (b) Explain what conclusion can actually be drawn from the numbers 15 and 75. You decide to use two realvalued features. The conclusion is that if everything else is held constant. . Indeed it will not be true for any woman. It would be good to eliminate one of these two features. Roebuck says “Aha! The average woman spends $75. You know the gender of each customer. Note that if some feature values are systematically different for men and women. . . then even this conclusion is not useful. + 15x11 + 75x12 + . you must recode this discrete feature as continuous. . x11 and x12 . Since you are using linear regression. male or female. because it is not reasonable to hold all other feature values constant. This may not be true for the average woman.” Write your name below. (c) Explain a desirable way to simplify the model. but the average man spends only $15. The coding is a standard “one of n” scheme.
The model only predicts 5. (b) Dr. it is unlikely to be reasonable to hold all other feature values constant when comparing groups.5 transactions. then 5. The coding is a “one of two” scheme.3 more transactions. . Since you are using linear regression. Goldman concludes that the average college graduate makes 5.5 − 3. Explain why Dr. one cannot say that this difference is caused by being a college graduate.Quiz for April 6. Goldman’s conclusion is likely to be false.2 = 2. 2010 Your name: Suppose that you are training a model to predict how many transactions a credit card customer will make. . You decide to use two realvalued features. There may be some other unknown common cause.3 is not the average difference in predicted y value between groups. have value zero.2x38 + . including the intercept. for example. This may not be true for the average college grad. . Explain why Dr. . Sachs concludes that being a college graduate causes a person to make 2. First. You know the education level of each customer.3 is the average difference. (a) Dr. even if 2. Second. . Said another way.5 transactions if all other features. if any other feature have different values on average for men and women. + 5. Sachs’ conclusion is likely to be false also.5x37 + 3. x37 and x38 . on average. as follows: college grad not college grad x37 1 0 x38 0 1 Learning from a large training set yields the model y = . for example income. you recode this discrete feature as continuous. It will certainly be false if features such as “age” are not normalized.
does not have strong predictive power in a trained model. You have enough positive and negative training examples. Assume further that verbal ability is a strong causal inﬂuence on the target.CSE 291 Quiz for April 12. You are training the model correctly. without overﬁtting or any other mistake. despite all the assumptions above being true. . 2011 Your name: Email address (important): Suppose that you are training a model (a classiﬁer) to predict whether or not a graduate student drops out. from a selective graduate program. and that the verbal GRE score is a good measure of verbal ability. Explain two separate reasonable scenarios where the verbal GRE score.
Use root mean squared error (RMSE) as the deﬁnition of error. You can use threefold crossvalidation during development to save time also. Do not speculate about future work. and any limitations on the validity or reliability of your results. and do not explain ideas that did not work. Write in the present tense. and use tenfold crossvalidation to measure RMSE. 2011. edu/ml/databases/kddcup98/kddcup98. Download the ﬁle cup98lrn.uci. and you use strong regularization (ridge parameter 107 perhaps). Include your ﬁnal regression model. The outcome should be the best possible model that you can ﬁnd that uses 30 or fewer of the original features. Do the steps above repeatedly in order to explore alternative ways of using the data. Now. Describe what you did that worked. Do a combination of the following: • Recode nonnumerical features as numerical. not chronologically. If you use Rapidminer. be sure to save the 4843 records in a format that Rapidminer can load quickly. include a printout of your ﬁnal process.ics. To make your work more efﬁcient. If you normalize all input features.Linear regression assignment This assignment is due at the start of class on Tuesday April 12. Organize your report logically. • Discard useless features. build a linear regression model to predict the ﬁeld TARGET_D as accurately as possible.zip from from http://archive. .html. The deliverable is a brief report that is formatted similarly to this assignment description. Read the associated documentation. Load the data into Rapidminer (or other software for data mining such as R). then the regression coefﬁcients will indicate the relative importance of features. You should work in a team of two. • Compare different strengths of ridge regularization. and your results. Choose a partner who has a different background from you. • Transform features to improve their usefulness. Explain any assumptions that you made. Select the 4843 records that have feature TARGET_B=1. Save these as a nativeformat Rapidminer example set.
These operators have two major problems. The two teams that omitted LASTGIFT achieved RMSE worse than $11. Second. three of 11 teams achieved similar ﬁnal RMSEs that were slightly better than $9. it is not a good idea to begin by choosing a subset of the 480 original features based on human intuition. . One could also report mean squared error.00. MSE. the dollar amount of the person’s most recent gift. A model based on just this single feature achieves RMSE of $9. For more discussion of this problem. these operators try a very large number of alternative subsets of variables. $15. In 2009.6243 here. it is also not a good idea to eliminate automatically features with missing values. including sometimes the feature LASTGIFT. Note that if all predictors have mean zero. RMSE. do not confuse readers by switching between multiple performance measures without a good reason. The assignment speciﬁcally asks you to report root mean squared error. Because of the high number of alternatives considered. As is often the case. good performance can be achieved with a very simple model. Rapidminer has operators that search for highly predictive subsets of variables. The most informative single feature is LASTGIFT.Comments on the regression assignment It is typically useful to rescale predictors to have mean zero and variance one. then the intercept of a linear regression model is the mean of the target. However.98. but whichever is chosen should be used consistently. this subset is likely to overﬁt substantially the dataset on which it is best. However. so it is easy for human intuition to pick an initial set that is bad. In general. The teams that did this all omitted features that in fact would have made their ﬁnal models considerably better. they are too slow to be used on a large initial set of variables. it loses interpretability to rescale the target variable. assuming that the intercept is not regularized. As explained above.00. and pick the one that performs best on some dataset. it is possible to do signiﬁcantly better. Despite this directive. First. see the section on nested crossvalidation in a later chapter. The assignment asks you to produce a ﬁnal model based on at most 30 of the original features.
jar where 1g means one gigabyte. many Java implementations allocate so little memory that Rapidminer will quickly terminate with an “out of memory” message. and hence make some efﬁcient training algorithms much slower. Note that destroying sparsity is not an issue when the dataset is stored as a dense matrix. almost all donation amounts are under $50. launch Rapidminer with a command like java Xmx1g jar rapidminer. • Normalize all numerical features to have mean zero and variance 1.1 Standardization of features The recommended procedure is as follows.0. • Optionally. This is not a problem with Windows 7.Chapter 3 Introduction to Rapidminer By default. For example. drop all binary features with fewer than 100 examples for either binary value. It is not recommended to normalize numerical features obtained from binary features. but otherwise. • Convert each binary feature into a numerical feature with values 0. in order. which it is by default in Rapidminer. The reason is that this normalization would destroy the sparsity of the dataset. • Convert each nominal feature with k alternative values into k different binary features. Normalization is also questionable for variables whose distribution is highly uneven. but a few are up 25 . 3.0 and 1.
However. in order to prevent the existence of multiple equivalent models. you are in expert mode. The example source operator has a parameter named attributes. or a person with an academic cap. will allow a linear classiﬁer to discriminate well between different common values of the variable. whether by zscoring or by transformation to the range 0 to 1 or otherwise. which is not legitimate. If regularization is not used. Eliminating binary features for which either value has fewer than 100 training examples is a heuristic way to prevent overﬁtting and at the same time make training faster. No linear normalization.2 Example of a Rapidminer process Figure 3. When the person in red is visible. Click on the icon to toggle between these. The tree illustrates how to perform some common subtasks. If you have two instances of the same operator. The actual data are in a ﬁle with the same name with exten .” The corresponding ﬁle should contain a schema deﬁnition for a dataset. Normally this is ﬁle name with extension “aml. For variables with uneven distributions. or beneﬁcial. Then. The ﬁrst operator is ExampleSource. concatenation allows information from the test set to inﬂuence the details of preprocessing. The simplest is to concatenate both sets before preprocessing. For efﬁciency and interpretability. in the right pane. 3. A common transformation that often achieves this is to take logarithms. A useful trick when zero is a frequent value of a variable x is to use the mapping x → log(x + 1). “Read training example” is a name for this particular instance of this operator. it is often useful to apply a nonlinear transformation that makes their distribution have more of a uniform or Gaussian shape. You select an operator by clicking on it in the left pane. it is important to do all preprocessing in a perfectly consistent way on both datasets. it is best to drop the binary feature corresponding to the most common value of the original multivalued feature. This leaves zeros unchanged and hence preserves sparsity. where you can see all the parameters of each operator. If you have separate training and test sets. the Parameters tab shows the arguments of the operator. that is of using the test set during training. and then split them after.2 shows a tree structure of Rapidminer operators that together perform a standard data mining task. INTRODUCTION TO RAPIDMINER to $500.26 CHAPTER 3. you can identify them with different names. which is a form of information leakage from the test set. It may not be necessary. if regularization is used during training. then at least one binary feature must be dropped from each set created by converting a multivalued feature. At the top there is an icon that is a picture of either a person with a red sweater.
2. . EXAMPLE OF A RAPIDMINER PROCESS 27 Root node of Rapidminer process Process Read training examples ExampleSource Select features by name FeatureNameFilter Convert nominal to binary Nominal2Binominal Remove almostconstant features RemoveUselessAttributes Convert binary to realvalued Nominal2Numerical Zscoring Normalization Scan for best strength of regularization GridParameterOptimization Crossvalidation XValidation Regularized linear regression WLinearRegression ApplierChain OperatorChain Applier ModelApplier Compute RMSE on test fold RegressionPerformance Figure 3.3.1: Rapidminer process for regularized linear regression.
Applying it to discrete features with more than 50 alternative values is not recommended.*)(YRS.*AMNT. The way to indicate nesting of operators is by dragging and dropping. you specify the delimiter that divides ﬁelds within each row of data. The next panel asks you to specify the type of each attribute. Let the argument skip_features_with_name be . Then drag the root of the inner subtree. and drop it on top of the outer operator. In order to convert a discrete feature with k different values into k realvalued 0/1 features. If this is not true for your data. you specify a ﬁle name that is used with “aml” and “dat” extensions to save the data in Rapidminer format. In our sample process. Finally. You can use a CPU monitor to see what Rapidminer is doing. When you click Next on this panel. .*MALE)(STATE)(PEPSTRFL)(. two operators are needed. then Rapidminer hangs with no visible progress.*)(. the character that begins comment lines.28 CHAPTER 3. Error messages may appear in the bottom pane.*GIFT) (MDM. which you have to ﬁx by trial and error. tick the box to make the ﬁrst row be interpreted as ﬁeld names. the data preview at the bottom will look wrong.” The ﬁrst step with this wizard is to specify the ﬁle to read data from. while the second is Nominal2Numerical. Then insert the outer operator.*GIFT. The most common special choice is “label” which means that an attribute is a target to be predicted.*)(RFA_2. Note that the documentation of the latter operator in Rapidminer is misleading: it cannot convert a discrete feature into multiple numerical features directly.*)(.” The easiest way to create these ﬁles is by clicking on “Start Data Loading Wizard.*). INTRODUCTION TO RAPIDMINER sion “dat. all rows of data are loaded. The operator Nominal2Binominal is quite slow. The wizard guesses this based only on the ﬁrst row of data. so it often makes mistakes. it is (. with a text editor or otherwise. The simplest way to ﬁnd a good value for an algorithm setting is to use the XValidation operator nested inside the GridParameterOptimization operator. In the next panel. To keep just features with certain names. Ticking the box for “use double quotes” can avoid some error messages. The same thing happens if you click Previous from the following panel. The ﬁrst is Nominal2Binominal. If you choose the wrong delimiter. First create the inner operator subtree. If there are no errors and the data ﬁle is large.* and let the argument except_ features_with_name identify the features to keep. and the decimal point character. In the next panel. The following panel asks you to say which features are special. the easiest is to make it true outside Rapidminer. use the operator FeatureNameFilter.
OTHER NOTES ON RAPIDMINER 29 3. trimming commas. Operators from Weka versus from elsewhere. Comments on speciﬁc operators: Nominal2Numerical.3 Other notes on Rapidminer Reusing existing processes.3. Nominal2Binominal. . using quote marks. Eliminating nonalphanumerical characters. trimming lines. Saving datasets. RemoveUselessFeatures.3.
.
and 0 if its argument is false. 4. Typically loss functions do not depend directly on x. it will be convenient to assume that a binary label takes on true values 0 or 1. However. The usual deﬁnition of accuracy uses this loss function. y) = I(f (x) = y) which is called the 01 loss function.1 Loss functions Let x be an instance. it has undesirable properties. First. let y be its true label. where we think probabilistically. we will threshold it at zero: y = 2 · I(f (x) ≥ 0) − 1 ˆ where I() is an indicator function that is 1 if its argument is true. including linear and nonlinear kernels. Assume that the prediction is realvalued. However. The most obvious loss function is l(f (x). If we need to convert it into a binary prediction. while otherwise loss is positive. It also discusses detecting overﬁtting via crossvalidation. in this chapter it will be convenient to assume that the label y has true value either +1 or −1. and let f (x) be a prediction. and a loss of zero corresponds to a perfect prediction. it loses information: it does not distinguish between predictions f (x) that are almost right. and those that are 31 . We have seen how to use linear regression to predict a realvalued label.Chapter 4 Support vector machines This chapter explains softmargin support vector machines (SVMs). In later chapters. and preventing overﬁtting via regularization. A socalled loss function l measures how good the prediction is. Now we will see how to use a similar model to predict a binary label.
4. training seeks to classify points correctly. Overall. and does not lose information when the prediction f (x) is realvalued. A better loss function is squared error: l(f (x).5 is as undesirable as f (x) = 0.5 when the true label is y = 1. but it does not seek to make predictions be exactly +1 or −1. However. An SVM classiﬁer f is trained to minimize hinge loss. Intuitively. There is a similar but opposite pattern when the true label is y = +1. the bigger the loss. which is called hinge loss. yi ). then a prediction with the correct sign that is greater than 1 should not be considered incorrect. −1) as a function of the output f (x) of the classiﬁer.32 CHAPTER 4. y) = (f (x) − y)2 which is inﬁnitely differentiable everywhere. if the true label is +1. and to achieve predictions f (x) ≤ −1 for all training instances x with y = −1. look at a ﬁgure showing l(f (x). so it is difﬁcult to use in training algorithms that try to minimize loss via gradient descent. 1 − yf (x)}. Then the loss is zero (the best case) as long as the prediction f (x) ≤ −1. Second. In this sense. +1) = max{0. this loss function says that the prediction f (x) = 1. The training process aims to achieve predictions f (x) ≥ 1 for all training instances x with true label y = +1. The further away f (x) is from −1 in the wrong direction. its derivative is either undeﬁned or zero.2 Regularization Given a set of training examples xi . yi for i = 1 to i = n. 1 + f (x)} l(f (x). without trying to satisfy any unnecessary additional objectives also. and to distinguish clearly between the two classes. satisﬁes the intuition just suggested: l(f (x). the training process intuitively aims to ﬁnd the best possible classiﬁer. i=1 1 It is mathematically convenient to write the hinge loss function as l(f (x). y) = max{0. The following loss function. mathematically. the total training loss (sometimes called empirical loss) is the sum n l(f (xi ). SUPPORT VECTOR MACHINES very wrong. .1 Using hinge loss is the ﬁrst major insight behind SVMs. −1) = max{0. 1 − f (x)} To understand this loss function.
e. and/or the training set is too small. Let c(f ) be some realvalued measure of complexity. Here. Suppose that the space of candidate functions is deﬁned by a vector w ∈ Rd of parameters. REGULARIZATION Suppose that the function f is selected by minimizing average training loss: n 33 f = argminf ∈F i=1 l(f (xi ). However we could also use other norms.2. we do not know in advance what the best space F is for a particular training set. but at the same time to impose a penalty on the complexity of f . w) where g is some ﬁxed function. then we run the risk of overﬁtting the training data. But if F is too restricted. yi ) where F is a space of candidate functions. A possible solution to this dilemma is to choose a ﬂexible space F . The learning process then becomes to solve n f = argminf ∈F λc(f ) + i=1 l(f (xi ). i. In general.4. including the L0 norm d c(f ) = w2 = 0 j=1 I(wj = 0) or the L1 norm d c(f ) = w2 = 1 j=1 wj . . If F is too ﬂexible. λ is a parameter that controls the relative strength of the two objectives. we can write f (x) = g(x. then we run the risk of underﬁtting the data. namely to minimize the complexity of f and to minimize training error. yi ). In this case we can deﬁne the complexity of each candidate function to be the norm of the corresponding w. Most commonly we use the square norm: d c(f ) = w2 = j=1 2 wj . The square norm is the most convenient mathematically.
A linear softmargin SVM classiﬁer is precisely the solution to this optimization problem. useful guidelines are not known for what the best value of C might be for a given dataset. An equivalent way of writing the same optimization problem is n w = argminw∈Rd w2 + C i=1 max{0. 1 − yi (w · xi )} i=1 where the multiplier 1/n is included because it is conventional. 1 − yi (w · xi )} with C = 1/(nλ). The primal and dual formulations are different optimization problems. so there are no local minima.34 CHAPTER 4. and is useful theoretically. Moreover. However. one has to try multiple values of C. a smaller training set should require a smaller C value. Intuitively. when dimensionality d is ﬁxed. This socalled dual formulation is n α∈R max n i=1 αi − 1 2 n n αi αj yi yj (xi · xj ) i=1 j=1 subject to 0 ≤ αi ≤ C. SUPPORT VECTOR MACHINES 4. Note that d is the dimensionality of x. and ﬁnd the best value experimentally. but they have the same unique solution.4 Dual formulation There is an alternative formulation that is equivalent.3 Linear softmargin SVMs A linear classiﬁer is a function f (x) = g(x. It can be proved that the solution to the minimization problem is always unique. and it is the basis of the fastest known algorithms for training linear SVMs. Putting the ideas above together. The solution to the dual problem is a coefﬁcient αi for . It is the easiest description of SVMs to understand. In practice. Many SVM implementations let the user specify C as opposed to λ. the objective function is convex. the optimization problem above is called an unconstrained primal formulation. w) = x · w where g is the dot product function. Mathematically. and w has the same dimensionality. A small value for C corresponds to strong regularization. while a large value corresponds to weak regularization. the objective of learning is to ﬁnd w = argminw∈Rd 1 λw + n 2 n max{0. 4.
where some training points can be misclassiﬁed. where the weight of each instance is between 0 and C. Dot product similarity is closely related to Euclidean distance through the identity d(xi .5 Nonlinear kernels Consider two instances xi and xj . If the number of training examples n is much bigger than the number of features d then most training examples will not be support vectors. solving the dual optimization problem leads to αi being exactly zero for many training instances.5. This fact is the basis for a picture that is often used to illustrate the intuitive idea that an SVM maximizes the margin between positive and negative training examples. This equation says that w is a weighted linear combination of the training instances xi .4. However. Often. whereas it is over Rd in the primal formulation. NONLINEAR KERNELS 35 each training example. recent research has shown that the unconstrained primal formulation leads to faster training algorithms in the linear case. the dual version is easier to extend to obtain nonlinear SVM classiﬁers. 4. Moreover. xj ) = xi − xj  = (xi 2 − 2xi · xj + xj 2 )1/2 . the strength of regularization that gives best performance on test data cannot be determined by looking at distances between training points. and the sign of each instance is its label yi . but it is misleading for several reasons. Notice that the optimization is over Rn . the SVMs that are useful in practice are socalled softmargin SVMs. This dot product is a measure of the similarity of the two instances: it is large if they are similar and small if they are not. These instances are the only ones that contribute to the ﬁnal classiﬁer. The constrained dual formulation is the basis of the training algorithms used by the ﬁrst generation of SVM implementations. In those cases. Second. as just mentioned. This picture is not incorrect. and consider their dot product xi · xj . The trained classiﬁer is f (x) = w · x where the vector n w= i=1 αi yi xi . the primal version is easier to understand and easier to use as a foundation for proving bounds on generalization error. The training instances xi such that αi > 0 are called support vectors. Third. This extension is based on the idea of a kernel function. However. First. often half or more of all training points become support vectors. the regularization that is part of the loss function for SVM training is most useful in cases where n is small relative to d.
This function is all that is needed in order to write down the optimization problem and its solution. it is advantageous to normalize features to have the same mean and variance. where the weight of yi is the product of αi and the degree to which x is similar to xi . xj ) = Φ(xi ) · Φ(xj ). that is regardless of which rerepresentation Φ is used. In principle. Consider a rerepresentation of instances x → Φ(x) where the transformation Φ is a function Rd → Rd . rather than to a highdimensional space Rd . x). The result above says that in order to train a nonlinear SVM classiﬁer. let kij = K(xi . i=1 j=1 The solution is n n f (x) = [ i=1 αi yi Φ(xi )] · x = i=1 αi yi K(xi . However. the function K can be much easier to deal with than Φ.2 Therefore. However. that is xi  = xj  = 1. then Euclidean distance and dot product similarity are perfectly anticorrelated. and train an SVM classiﬁer in this space. in general one cannot have both normalizations be true at the same time. It can be advantageous also to normalize instances so that they have unit length. suppose that we have a function K(xi . 2 .” Practically. all that is needed is the kernel matrix of size n by n whose entries are kij . SUPPORT VECTOR MACHINES where by deﬁnition x2 = x · x. Speciﬁcally. If we know K then we do not need to know the function Φ in any explicit way. In order to apply the trained nonlinear classiﬁer. we could use dotproduct to deﬁne similarity in the new space Rd . Using K exclusively in this way instead of Φ is called the “kernel trick. The function Φ never needs to be known explicitly. that is to make predictions for arbitrary test examples. the equation n f (x) = w · x = i=1 αi yi (xi · x) says that the prediction for a test example x is a weighted average of the training labels yi . Intuitively. the kernel function K is needed. regardless of which kernel K is used. xj ). The learning task is to solve n α∈R max n i=1 αi − 1 2 n n αi αj yi yj kij subject to 0 ≤ αi ≤ C. If the instances have unit length.36 CHAPTER 4. These functions are sometimes called basis functions. because K is just a mapping to R. since it is For many applications of support vector machines. the complexity of the classiﬁer f is limited. one for each support vector xi . This classiﬁer is a weighted combination of at most n functions.
e. its predicted label f (x) is a weighted average of the labels yi of the support vectors xi . i.6 Selecting the best SVM settings Getting good results with support vector machines requires some care. a larger value for σ 2 . the classiﬁer f (x) = i αi yi K(xi . One particular kernel is especially important. The function K is ﬁxed. 4. This notation emphasizes the similarity with a Gaussian distribution. Use crossvalidation to ﬁnd the best values for C and γ for an RBF kernel. x) = exp(−xi − x2 /σ 2 ) where σ 2 = 1/γ. The following procedure is recommended as a starting point: 1. Code data in the numerical format needed for SVM training. Use crossvalidation to ﬁnd the best value for C for a linear kernel.e. 3. 4. An important intuition about SVMs is that with an RBF kernel. . The support vectors that contribute nonnegligibly to the predicted label are those for which the Euclidean distance xi − x is small.4. SELECTING THE BEST SVM SETTINGS 37 deﬁned by at most n coefﬁcients αi . with the maximum occurring only if x = xi . Using an RBF kernel. A smaller value for γ. The consensus opinion is that the best SVM classiﬁer is typically at least as accurate as the best of any other type of classiﬁer. each basis function K(xi . that is signiﬁcantly above zero for a wider range of x values. The radial basis function (RBF) kernel is the function K(xi .6. i. x) is always between 0 and 1. x) is similar to a nearestneighbor classiﬁer. Scale each attribute to have range 0 to 1. Note that the value of K(xi . but obtaining the best SVM classiﬁer is not trivial. Using a larger value for σ 2 is similar to using a larger number k of neighbors in a nearest neighbor classiﬁer. 2. x) = exp(−γxi − x2 ) where γ > 0 is an adjustable parameter. Given a test instance x. The RBF kernel can also be written K(xi . x) is “radial” since it is based on the Euclidean distance xi − x between a support vector xi and a data point x. or to have range −1 to +1. gives a basis functions that is less peaked. so it does not increase the intuitive complexity of the classiﬁer. or to have mean zero and unit variance.
γ ∈ {. 2−1 . Train on the entire available data using the parameter values found to be best via crossvalidation. Once you have found the best values of C and γ to within a factor of 2. 23 .38 CHAPTER 4. . but not always. . 2. it is easy to parallelize of course. 1. 22 . It is important to try a wide enough range of values to be sure that more extreme values do not give better performance. 2−2 . . 2−3 . . It is reasonable to start with C = 1 and γ = 1. and to try values that are smaller and larger by factors of 2: C. . . SUPPORT VECTOR MACHINES 5.}2 . This process is called a grid search. . doing a more ﬁnegrained grid search around these values is sometimes beneﬁcial.
or (iii) not deﬁned. . SELECTING THE BEST SVM SETTINGS 39 Quiz question (a) Draw the hinge loss function for the case where the true label y = 1. Label the axes clearly. (c) For each of the three cases for the derivative. (ii) constant but not zero.6. explain its intuitive implications for training an SVM classiﬁer. (b) Explain where the derivative is (i) zero.4.
1) but l(−2. . 2011 Your name: Email address: As seen in class. 1). −1) = max{0. the hinge loss function is l(f (x). 1) = l(−1. 1 + f (x)} l(f (x).CSE 291 Quiz for April 19. +1) = max{0. 1) = l(1. (b) Explain the reason why l(2. 1 − f (x)} (a) Explain what the two arguments of the hinge loss function are.
(b) Do you expect the optimal settings to involve stronger regularization. It turns out that both classiﬁers perform equally well in terms of accuracy for your task.Quiz for April 20. 2010 Your name: Suppose that you have trained SVM classiﬁers carefully for a given learning task. do you expect the linear classiﬁer or the RBF classiﬁer to have better accuracy? Explain very brieﬂy. . After ﬁnding optimal settings again and retraining. or weaker regularization? Explain very brieﬂy. You have selected the settings that yield the best linear classiﬁer and the best RBF classiﬁer. (a) Now you are given many times more training data.
you should now use more of the dataset. and formatted well. and to evaluate the accuracy of the best classiﬁers as fairly as possible. you can explore models based on a large number of transformed features. The outcome should be the two SVM classiﬁers with highest 0/1 accuracy that you can ﬁnd. the deliverable is a short report that is organized well. We will see next week how to deal with all the negative examples. Do not speculate about future work. Organize your report logically. this one uses the KDD98 training set. and also the best possible SVM using a radial basis function (RBF) kernel. Explain carefully your crossvalidation procedure. Describe what you did that worked. you should work in a team of two. Training nonlinear SVMs is much slower. or change partners. As before. Train the best possible linear SVM. The ﬁrst thing that you should do is extract all positive examples. and your results. the Rapidminer operator named FastLargeMargin is recommended. . However. Explain any assumptions that you made. include a printout of your ﬁnal processes. You may keep the same partner as before. that is examples for which TARGET_B = 1. In the same general way as for linear regression. Identify good values for the softmargin C parameter for both kernels. and do not explain ideas that do not work.CSE 291 Assignment 2 (2011) This assignment is due at the start of class on Tuesday April 19. Only about 5% of all examples are positive. 2011. The goal is to train a support vector machine (SVM) classiﬁer that identiﬁes examples who make a nonzero donation. If you use Rapidminer. written well. and transform features to improve their usefulness. As before. and a random subset of twice as many negative examples. For linear SVMs. describe your ﬁnal linear and RBF models. and for the width parameter for the RBF kernel. without overﬁtting or underﬁtting. not chronologically. Use crossvalidation and grid search carefully to ﬁnd the best settings for training. Train SVM models to predict the target label as accurately as possible. Like the previous assignment. Because it is fast. Be clear and explicit about the 0/1 accuracy that you achieve on test data. choosing a partner with a different background. recode nonnumerical features as numerical. Whatever software you use. Write in the present tense. and any limitations on the validity or reliability of your results.
Decide thoughtfully which measure of accuracy to use. Explain any assumptions that you made. html. As before. and do not explain ideas that do not work. Training nonlinear SVMs is much slower. Explain carefully your nested crossvalidation procedure.CSE 291 Assignment (2010) This assignment is due at the start of class on Tuesday April 20. use only the data for customers who are not sent any email promotion. the deliverable is a wellorganized. Do not speculate about future work. see http:// minethatdata. For a detailed description.edu/users/ elkan/250B/HillstromData. you should work in a team of two. In particular. you can explore models based on a large number of transformed features. recode nonnumerical features as numerical.ucsd. wellwritten. and also the best possible model using a radial basis function (RBF) kernel. Describe what you did that worked. and wellformatted report of about two pages. You may keep the same partner as before. you should identify good values of the softmargin C parameter for both kernels. the Rapidminer operator named FastLargeMargin is recommended. and to evaluate the accuracy of the best classiﬁers as fairly as possible. and of the width parameter for the RBF kernel. and how much they spend. but one can hope that good performance can be achieved with fewer features. For linear SVMs. Use nested crossvalidation carefully to ﬁnd the best settings for training. This project uses data published by Kevin Hillstrom. Write in the present tense. a wellknown data mining consultant. and a description of your two ﬁnal models (not included in the two pages). or change partners. and any limitations on the validity or reliability of your results. Your job is to train a good model to predict which customers visit the retailer’s website. choosing a partner with a different background. For now. .csv. and explain your choice in your report. Organize your report logically. 2010. Include a printout of your ﬁnal Rapidminer process. Train the best possible model using a linear kernel.com/blog/2008/03/minethatdataemailanalyticsanddata. you should ignore information about which customers make a purchase. Build support vector machine (SVM) models to predict the target label as accurately as possible. You can ﬁnd the data at http://cseweb. and transform features to improve their usefulness. without overﬁtting or underﬁtting. In the same general way as for linear regression. not chronologically. As before. The outcome should be the two most accurate SVM classiﬁers that you can ﬁnd. and your results. For this assignment. Because it is fast.
.
Depending on the application. i. fractions. Unfortunately. called positive and negative. Suppose that a set of examples has a certain total size n. In particular. there is no standard convention about whether rows are actual or predicted. They are true positives tp. and false negatives f n. false positives f p. but we would It would be equally valid to swap the rows and columns. 1 45 .e.1 Crossvalidation Usually we have a ﬁxed database of labeled examples available. 5. say n = 1000. true negatives tn. accuracy is the ratio (tp + tn)/n.Chapter 5 Doing valid experiments Consider a classiﬁer that predicts membership in one of two classes. We can represent the performance of any classiﬁer with a table as follows: predicted positive negative tp fn fp tn positive truth negative where tp + f n + f p + tn = n. and we are faced with a dilemma: we would like to use all the examples for training. It is important to remember that these numbers are counts. Remember that there is a universal convention that in notation like xij the ﬁrst subscript refers to rows while the second subscript refers to columns. i. Rows correspond to actual labels. integers. A table like the one above is called a 2×2 confusion matrix.1 The four entries in a 2×2 confusion matrix have standard names. not ratios. while columns correspond to predicted labels.e. different summaries are computed from a confusion matrix.
the time complexity of crossvalidation is k times that of running the training algorithm once. . if each example is duplicated in the training set and we use a nearestneighbor classiﬁer. this matrix is an estimate of the average performance of a classiﬁer learned from a training set of size (k − 1)n/k where n is the size of S.46 CHAPTER 5. This estimate is likely to be conservative in the sense that the ﬁnal classiﬁer may have slightly better performance since it is based on a slightly larger training set. then LOOCV will show a zero error rate. so often LOOCV is computationally infeasible. Sk for i = 1 to i = k let T = S \ Si run learning algorithm with T as training set test the resulting classiﬁer on Si obtaining tpi . integer constant k Procedure: partition S into k disjoint equalsized subsets S1 . . . the largest possible number of folds is k = n. the confusion matrix obtained by crossvalidation is a fair indicator of the performance of the learning algorithm on independent test examples. and the confusion matrix it provides is not the performance of any speciﬁc single classiﬁer. f p = i f pi . The common procedure is to create a ﬁnal classiﬁer by training on all of S. DOING VALID EXPERIMENTS also like to use many examples as an independent test set. This suggests that for preliminary experiments. f n = i f ni The output of crossvalidation is a confusion matrix based on using each labeled example as a test example exactly once. The time required for kfold crossvalidation is then O(k · (k − 1)/k · n) = O((k − 1)n). twofold is a good choice. In recent research the most common choice for k is 10. Hence. Input: Training set S. it has not been used for training that classiﬁer. f pi . and then to use the confusion matrix obtained from crossvalidation as an informal estimate of the performance of this classiﬁer. f ni compute tp = i tpi . If n labeled examples are available. It is the following algorithm. The results of crossvalidation can be misleading. Crossvalidation is a procedure for overcoming this dilemma. Crossvalidation with other values of k will also yield misleadingly low error estimates. Suppose that the time to train a classiﬁer is proportional to the number of training examples. Threefold crossvalidation is therefore twice as timeconsuming as twofold. However. . . Instead. and the time to make predictions on test examples is negligible in comparison. A disadvantage of crossvalidation is that it does not produce any single ﬁnal classiﬁer. This special case is called leaveoneout crossvalidation (LOOCV). For example. Whenever an example is used for testing a classiﬁer. tni . tn = i tni .
so M (ˆ) is not a fair estimate of the performance to be expected from v on v ˆ Nevertheless. It is a not uncommon mistake with crossvalidation to do subsampling on the test set. Subsampling the common class is a standard approach to learning from unbalanced data. . . The input set V of alternative ˆ settings can be a grid of parameter values. But. Often. i. the alternatives are all the same algorithm but with different parameter settings. So. do crossvalidation under the assumption that 2:1 is the true ratio of negatives to positives. Typically. 2 . how should we do model selection? A simple way to apply crossvalidation for model selection is the following: Input: dataset S. It is reasonable to do subsampling in the training folds of crossvalidation.2. set V of alternative algorithm settings Procedure: partition S randomly into k disjoint subsets S1 . Another common case is where the alternatives are different subsets of features. . . Reported performance numbers should always be based on a set of test examples that is directly typical of a genuine test population. initially. It is crucial to understand this point. but not in the test fold. for the assignment due on April 19. or learning algorithm. The setting v is chosen to maximize ˆ M (v). any procedure for selecting parameter values is itself part of the learning algorithm. 2011. the number of alternatives is ﬁnite and the only way to evaluate an alternative is to run it explicitly on a training set.5. Then. data where some classes are very common while others are very rare.e.2 Nested crossvalidation A model selection procedure is a method for choosing the best classiﬁer. NESTED CROSSVALIDATION 47 Subsampling means omitting examples from some classes. Sk of equal size for each setting v in V for i = 1 to i = k let T = S \ Si run the learning algorithm with setting v and T as training set apply the trained model to Si obtaining performance ei end for let M (v) be the average of ei end for select v = argmaxv M (v) ˆ The output of this model selection procedure is v . integer k. you should subsample the negatives in the KDD98 dataset just once. from a set of alternatives. for example SVMs with different C and γ values.2 5.
ˆ The number e obtained by nested crossvalidation is an estimate of the performance of f . v is chosen to optimize performance on all of S. Sk of equal size for i = 1 to i = k let T = S \ Si partition T randomly into k disjoint subsets T1 . ˆ Notice that the division of S into subsets happens just once. This choice reduces the random variability in the evaluation of different v. The obvious thing to do is to run nonnested crossvalidation on the whole dataset. nested crossvalidation does not produce a single ﬁnal model. . so ˆ v is likely to overﬁt S. integers k and k . the average of ei Like standard crossvalidation. This process is called nested crossvalidation. Speciﬁcally. One could argue that nested crossvalidation is an unreasonable amount of effort to obtain a single number that. but this would not overcome the issue that v is chosen to optimize performance on S. . . obtain a recommended setting v . Tk of equal size for each setting v in V for j = 1 to j = k let U = T \ Tj run the learning algorithm with setting v and U as training set apply the trained model to Tj obtaining performance eij end for let Mi (v) be the average of eij end for select vi = argmaxv Mi (v) ˆ run the learning algorithm with setting vi and T as training set ˆ apply the trained model to Si obtaining performance ei end for report e. and the same division is used for all settings v. DOING VALID EXPERIMENTS future data. . A new partition of S could be created for each setting v. . .48 CHAPTER 5. ˆ What should we do about the fact that any procedure for selecting parameter values is itself part of the learning algorithm? One answer is that this procedure should itself be evaluated using crossvalidation. nested crossvalidation is the following process: Input: dataset S. and then train once on the whole dataset ˆ using v yielding a single ﬁnal model f . although fair. is not provably correct. and is likely too low because it is based on smaller training sets. . . because one crossvalidation is run inside another. Stated another way. . set V of alternative algorithm settings Procedure: partition S randomly into k disjoint subsets S1 .
As mentioned above.5. . so it can certainly be intelligent and/or heuristic. trying every setting in V explicitly can be prohibitive. It may be preferable to search in V in a more clever way. the search for a good member of the set V is itself part of the learning algorithm.2. using a genetic algorithm or some other heuristic method. NESTED CROSSVALIDATION 49 Whether or not one uses nested crossvalidation.
then the entire process is nested crossvalidation. suggest an improved variation of the method above. or via crossvalidation. integer k. as an estimate of the RMSE to be expected on future data. . . and to report a fair estimate of the RMSE to be expected on future test data. The value that seems to be best is likely overﬁtting this set. Explain why this method is likely to be overoptimistic. Input: dataset S. The procedure to select a value for R is part of the learning algorithm. Each R value is being evaluated on the entire set S. Incorrect answers include the following: .Quiz for April 13. Sk of equal size for each R value in V for i = 1 to i = k let T = S \ Si train linear regression with parameter R and T as training set apply the trained regression equation to Si obtaining SSE ei end for compute M = i ei report R and RM SE = M/S end for (a) Suppose that you choose the R value for which the reported RMSE is lowest. . 2010 Your name: The following crossvalidation procedure is intended to ﬁnd the best regularization parameter R for linear regression. This whole procedure should be evaluated using a separate test set. set V of alternative R values Procedure: partition S randomly into k disjoint subsets S1 . Additional notes: The procedure to select a value for R uses crossvalidation. (b) Very brieﬂy. so if this procedure is evaluated itself using crossvalidation. .
and that any search for settings for a learning algorithm (e. and second. we are doing regression. so stratiﬁcation is not welldeﬁned. or for algorithmic parameters) is part of the learning method.” No. search for a subset of features. but it would not change the fact that the best R is being selected according to its performance on all of S.5.2. a different partition for each R value might increase the variability in the evaluation of each R. Two basic points to remember are that it is never fair to evaluate a learning method on its training set. ﬁrst of all. not just once.g. • “The partition of S should be done separately for each R value. NESTED CROSSVALIDATION 51 • “The partition of S should be stratiﬁed. .” No. failing to stratify increases variability but not does not cause overﬁtting.
.
On the other hand. we will talk about actual numbers (also called counts) of examples. most credit card transactions are legitimate. Then we can get 99% accuracy by predicting trivially that every transaction is legitimate. The goal is to identify the rare positive examples. In a balanced set. most examples are negative but a few examples are positive. A major difﬁculty with unbalanced data is that accuracy is not a meaningful measure of performance. suppose we somehow identify 5% of transactions for further investigation. the fraction of examples of each class is about the same. but both the training and test sets are unbalanced. the goal is to ﬁnd needles in a haystack. and half of all fraudulent transactions occur among these 5%. That is. we will consider only the two class case. some classes are rare while others are common. Suppose the test set has a certain total size n.Chapter 6 Classiﬁcation with a rare class In many data mining applications. say n = 1000. For example. In an unbalanced set. It turns out that thinking about actual numbers leads to less confusion and more insight than thinking about fractions. We have a standard binary classiﬁer learning problem. Rather than talk about fractions or percentages of a set. But its accuracy is only 95%. as accurately as possible. Clearly the identiﬁcation process is doing something worthwhile and not trivial. Suppose that 99% of credit card transactions are legitimate. We can represent the performance of the trivial classiﬁer as the following confusion matrix: predicted positive negative 0 10 0 53 990 positive truth negative . but a few are fraudulent. For concreteness in further discussion. and we will call the rare class positive.
In other research areas. so that readers can calculate whatever performance measurements they are most interested in. F = 1/p + 1/r r+p For the confusion matrix above F = 2 · 0. They are deﬁned as follows: • precision p = tp/(tp + f p). it is best to give the full confusion matrix explicitly. precision can be misleadingly high for a classiﬁer that predicts that only a few test examples are positive. When writing a report. two summary measures that are commonly used are called precision and recall. can provide a full picture of a confusion matrix. Fmeasure is a widely used metric that overcomes this limitation of precision. Consider the following confusion matrix: predicted positive negative 1 9 0 990 positive truth negative Precision is 100% but 90% of actual positives are missed.54 CHAPTER 6. while precision is sometimes called positive predictive value. for example accuracy. CLASSIFICATION WITH A RARE CLASS The performance of the nontrivial classiﬁer is predicted positive negative 5 5 45 945 positive truth negative For supervised learning with discrete predictions. and • recall r = tp/(tp + f n). only a confusion matrix gives complete information about the performance of a classiﬁer. recall is often called sensitivity. that is when tp + f p = 0. Assuming that n is known. No single number.1/(1 + 0. It is the harmonic average of precision and recall: 2 2pr = . three of the counts in a confusion matrix can vary independently. . The names “precision” and “recall” come from the ﬁeld of information retrieval. Worse. Precision is undeﬁned for a classiﬁer that predicts that every test example is negative.18. When accuracy is not meaningful. or of the usefulness of a classiﬁer.1) = 0.
1. however. Often. Now. even when a natural threshold exists. The precise deﬁnition of lift is a bit complex. THRESHOLDS AND LIFT 55 Besides accuracy. Therefore. so we want to investigate the examples x with the highest prediction scores f (x). the smaller the prediction rate the higher the lift. In some scenarios.1 Thresholds and lift A confusion matrix is always based on discrete predictions. Some of these are called speciﬁcity. false negative rate. let the fraction of examples predicted to be positive be the prediction rate x = (tp+f p)/n. An external limit on the resources available will determine how many examples can be investigated. Setting the threshold determines the number tp + f p of examples that are predicted to be positive. Intuitively. the correct strategy is to choose a threshold t such that we investigate all examples x with f (x) ≥ t and the number of such examples is determined by the budget constraint. Given that a ﬁxed number of examples are predicted to be positive. We shall return to this issue below. Let the density of actual positives within the predicted positives be d = tp/(tp + f p). For example. a lift of 2 means that actual positives are twice as dense among the predicted positives as they are among all examples. precision. Confusion matrices cannot represent information about the overall usefulness of underlying realvalued predictions. Next. Of course. This question is answered by a measure called lift. let the base rate of actual positive examples be b = (tp + f n)/n. false positive rate. Rather than rely on agreement and understanding of the deﬁnitions of these. it is possible to change the threshold to achieve a target number of positive predictions. First. kappa coefﬁcient. positive and negative likelihood ratio. an SVM classiﬁer yields a realvalued prediction f (x) which is then compared to the threshold zero to obtain a discrete yes/no prediction. 6. and Fmeasure. However. Suppose that all examples predicted to be positive are subjected to further investigation. . it is preferable simply to report a full confusion matrix explicitly. but ﬁrst we shall consider the issue of selecting a threshold. This target number is often based on a socalled budget constraint. Note that the lift will be different at different prediction rates. recall. Typically. many other summaries are also commonly computed from confusion matrices. This number is a natural target for the value f p + tp. there is a natural threshold such as zero for an SVM. these predictions are obtained by thresholding a realvalued predicted score. and more. the lift at prediction rate x is deﬁned to be the ratio d/b.6. a natural question is how good a classiﬁer is at capturing the actual positives within this number. we want to investigate those examples that are most likely to be actual positives.
it is still more rational than choosing an arbitrary threshold. the threshold zero for an SVM classiﬁer has some mathematical meaning but is not a rational guide for making decisions. The receiver is an antenna and its operating characteristic is the sensitivity that it is tuned for. 1 .56 CHAPTER 6. where the horizontal axis measures false positive rate (fpr) and the vertical axis measures true positive rate (tpr). perhaps too many transactions are being investigated and the marginal beneﬁt of investigating some transactions is negative. In the credit card scenario. Typically. In particular.2 Ranking examples Applying a threshold to a classiﬁer with realvalued outputs loses information. This is what an ROC curve does. While a budget constraint is not normally a rational way of choosing a threshold for predictions. budget constraints should normally be questioned. Or. there would be a net beneﬁt if additional transactions were investigated. it is not meaningful to use the same numerical threshold directly for different classiﬁers. These are deﬁned as f pr = fp f p + tn tp tpr = tp + f n ROC stands for “receiver operating characteristic. because the distinction between examples on the same side of the threshold is lost.” This terminology originates in the theory of detection based on electromagnetic waves. Moreover. tp + f n tp + f p prediction rate Lift is a useful measure of success if the number of examples that should be predicted to be positive is determined by external considerations. Making optimal decisions about how many examples should be predicted to be positive is discussed in the next chapter. perhaps too few transactions are being investigated. it is useful to compare different classiﬁers across the range of all possible thresholds. However. at the time a classiﬁer is trained and evaluated. it is meaningful to compare the recall achieved by different classiﬁers when the threshold for each one is set to make their precisions equal. Therefore. However.1 Concretely. 6. CLASSIFICATION WITH A RARE CLASS Lift can be expressed as d b = = tp/(tp + f p) tp · n = (tp + f n)/n (tp + f p)(tp + f n) n tp recall = . it is often the case that the threshold to be used for decisionmaking is not known. an ROC curve is a plot of the performance of a classiﬁer.
as shown by the probabilistic meaning just mentioned.1: ROC curves for three alternative binary classiﬁers. often abbreviated AUC. RANKING EXAMPLES 57 Figure 6. it has built into it the implicit assumption that the rare class is more important than the common class.6. It happens often that the ROC curves of two classiﬁers cross. which implies that neither one dominates the other uniformly. while it is 1 for a perfect classiﬁer. The AUC is 0.2. but do not provide a quantitative measure of the performance of classiﬁers.” In an ROC plot. the ideal point is at the top left. Note that tpr is the same as recall and is sometimes also called “hit rate. The AUC has an intuitive meaning: it can be proved that it equals the probability that the classiﬁer will rank correctly two randomly chosen examples where one is positive and one is negative. . One reason why AUC is widely used is that. A natural quantity to consider is the area under the ROC curve. One classiﬁer uniformly dominates another if its curve is always above the other’s curve.5 for a classiﬁer whose predictions are no better than random. Source: Wikipedia. ROC plots are informative.
Now. Many SVM software packages allow the user to specify these two different C parameters. then most pressure will be to achieve high accuracy on that class. which can be an important beneﬁt. If most training examples belong to one class. only include in the training set a randomly chosen subset of the available negative examples. The empirical loss function treats all n training examples equally.3 Training to overcome imbalance The previous sections were about measuring performance in applications where one class is rare but important.j ∈Q where Q = { i. Good values of the two C parameters can be chosen by crossvalidation. A subtle idea is that we can actually optimize AUC of a linear classiﬁer directly on the training set. This section describes ideas for actually performing better on such applications. That is. It does however make training faster. This is the same as 0/1 accuracy on a new set Q: A= 1 Q I(I(w xi > w xj ) = I(yi > yj )). note that I(w xi > w xj ) = I(w (xi − xj ) > 0). CLASSIFICATION WITH A RARE CLASS 6. 1). Consider the regularized loss function that is minimized by SVM training: n c(f ) + C i=1 l(f (xi ). The most obvious idea is to undersample the negative class. A major disadvantage of this approach is that it loses information.58 CHAPTER 6. we can minimize c(f ) + C−1 i:yi =−1 l(f (xi ). Speciﬁcally. yi ). One response to class imbalance is to give different importances to each class in the optimization task. . or their ratio can be ﬁxed so that each class has equal importance. The empirical AUC of a linear classiﬁer with coefﬁcients w is A= 1 n−1 n1 i:yi =1 j:yj =−1 I(w xi > w xj ). j : yi = yj }. −1) + C1 i:yi =1 l(f (xi ). The parameter C controls the tradeoff between regularization and minimizing empirical loss. i.
j ∈Q l(w (xi − xj ). and squared error relative to the y j values is minimized.. where pooling a set means replacing each member of the set by the arithmetic mean of the set. The training set Q for this learning task is balanced because if the pair i. the minimization can often be done efﬁciently by stochastic gradient descent. There is an elegant algorithm called “pool adjacent violators” (PAV) that solves this problem in linear time. For each f j we want to ﬁnd an output value gj such that the gj values are monotonically increasing. However. j belongs to Q then so does j. 1} be the corresponding true labels..4.6. we minimize λc(f ) + i. Brier score. CONDITIONAL PROBABILITIES 59 This formulation suggests learning a linear classiﬁer w such that a training example xi − xj has 0/1 label I(yi > yj ). The set Q can have size up to n2 /4 if the original training set has size n. and let yi ∈ {0. The weight vector w typically reaches convergence after seeing only O(n) elements of Q. 6.gn min (y j − gj )2 subject to gj ≤ gj+1 for j = 1 to j = n − 1 It is a remarkable fact that if squared error is minimized. Speciﬁcally. 6. the optimization problem is g1 . Deﬁnition of wellcalibrated probability. Formally. i . and let y j be the yi values sorted in the same order.5 Isotonic regression Let fi be prediction scores on a dataset. Note also that if yi = 1 if and only if yj = −1 so we want xi − xj to have the label 1 if and only if yi = 1.. Ideal versus achievable probability estimates. picking elements i. The algorithm is as follows. then the resulting predictions are wellcalibrated probabilities. Let f j be the fi values sorted from smallest to largest.4 Conditional probabilities Reason 1 to want probabilities: predicting expectations.. Reason 2 to want probabilities: understanding predictive power. j from Q randomly. yi ). Let gj = y j for all j Start with j = 1 and increase j until the ﬁrst j such that gj ≤ gj+1 . Converting scores into probabilities.
y ). the procedure to predict a wellcalibrated probability is as follows: Apply the classiﬁer to obtain f (x) Find j such that f j ≤ f (x) ≤ f j+1 The predicted probability is gj . 1} be the corresponding true labels. The parameters a and b are chosen to minimize the total loss 1 l( . However. The ﬁrst ability is critical thinking. The model is log p = a + bf 1−p where f is a prediction score and p is the corresponding estimated probability. one could also use squared error. As above. An alternative is to use a parametric model. let fi be prediction scores on a training set. CLASSIFICATION WITH A RARE CLASS Pool gj and gj+1 Move left: If gj−1 ≤ gj then pool gj−1 to gj+1 Continue to the left until monotonicity is satisﬁed Proceed to the right Given a test example x. The equation above shows that the logistic regression model is essentially a linear model with intercept a and coefﬁcient b. The goal of writing it as an assignment was to help develop three important abilities. 6. An equivalent way of writing the model is p= 1 1+ e−(a+bf ) . i. 6.7 Pitfalls of link prediction This section was originally an assignment. and let yi ∈ {0.6 Univariate logistic regression The disadvantage of isotonic regression is that it creates a lookup table for converting scores into estimated probabilities. which would be consistent with isotonic regression. −(a+bfi ) i 1+e i The precise loss function l that is typically used in the optimization is called conditional log likelihood (CLL) and is explained in Chapter ??? below. The most common model is called univariate logistic regression. the skill of identifying what is most important and then identifying what may be .e.60 CHAPTER 6.
6.7. PITFALLS OF LINK PREDICTION
61
incorrect. The second ability is understanding a new application domain quickly, in this case a task in computational biology. The third ability is presenting opinions in a persuasive way. Students were asked to explain arguments and conclusions crisply, concisely, and clearly. Assignment: The paper you should read is Predicting proteinprotein interactions from primary structure by Joel Bock and David Gough, published in the journal Bioinformatics in 2001. Figure out and describe three major ﬂaws in the paper. The ﬂaws concern • how the dataset is constructed, • how each example is represented, and • how performance is measured and reported. Each of the three mistakes is serious. The paper has 248 citations according to Google Scholar as of May 26, 2009, but unfortunately each ﬂaw by itself makes the results of the paper not useful as a basis for future research. Each mistake is described sufﬁciently clearly in the paper: it is a sin of commission, not a sin of omission. The second mistake, how each example is represented, is the most subtle, but at least one of the papers citing this paper does explain it clearly. It is connected with how SVMs are applied here. Remember the slogan: “If you cannot represent it then you cannot learn it.” Separately, provide a brief critique of the four beneﬁts claimed for SVMs in the section of the paper entitled Support vector machine learning. Are these beneﬁts true? Are they unique to SVMs? Does the work described in this paper take advantage of them? Sample answers: Here is a brief summary of what I see as the most important issues with the paper. (1) How the dataset is constructed. The problem here is that the negative examples are not pairs of genuine proteins. Instead, they are pairs of randomly generated amino acid sequences. It is quite possible that these artiﬁcial sequences could not fold into actual proteins at all. The classiﬁers reported in this paper may learn mainly to distinguish between real proteins and nonproteins. The authors acknowledge this concern, but do not overcome it. They could have used pairs of genuine proteins as negative examples. It is true that one cannot be sure that any given pair really is noninteracting. However, the great majority of pairs do not interact. Moreover, if a negative example really is an interaction, that will presumably slightly reduce the apparent accuracy of a trained classiﬁer, but not change overall conclusions.
62
CHAPTER 6. CLASSIFICATION WITH A RARE CLASS
(2) How each example is represented. This is a subtle but clearcut and important issue, assuming that the research uses a linear classiﬁer. Let x1 and x2 be two proteins and let f (x1 ) and f (x2 ) be their representations as vectors in Rd . The pair of proteins is represented as the concatenated vector f (x1 ) f (x2 ) ∈ R2d . Suppose a trained linear SVM has parameter vector w. By deﬁnition w ∈ R2d also. (If there is a bias coefﬁcient, so w ∈ R2d+1 , the conclusion is the same.) Now suppose the ﬁrst protein x1 is ﬁxed and consider varying the second protein x2 . Proteins x2 will be ranked according to the numerical value of the dot product w · f (x1 ) f (x2 ) . This is equal to w1 · f (x1 ) + w2 · f (x2 ) where the vector w is written as w1 w2 . If x1 is ﬁxed, then the ﬁrst term is constant and the second term w2 · f (x2 ) determines the ranking. The ranking of x2 proteins will be the same regardless of what the x1 protein is. This fundamental drawback of linear classiﬁers for predicting interactions is pointed out in [Vert and Jacob, 2008, Section 5]. With a concatenated representation of protein pairs, a linear classiﬁer can at best learn the propensity of individual proteins to interact. Such a classiﬁer cannot represent patterns that depend on features that are true only for speciﬁc protein pairs. This is the relevance of the slogan “If you cannot represent it then you cannot learn it.” Note: Previously I was under the impression that the authors stated that they used a linear kernel. On rereading the paper, it fails to mention at all what kernel they use. If the research uses a linear kernel, then the argument above is applicable. (3) How performance is measured and reported. Most pairs of proteins are noninteracting. It is reasonable to use training sets where negative examples (noninteracting pairs) are undersampled. However, it is not reasonable or correct to report performance (accuracy, precision, recall, etc.) on test sets where negative examples are underrepresented, which is what is done in this paper. As for the four claimed advantages of SVMs: 1. SVMs are nonlinear while requiring “relatively few” parameters. The authors do not explain what kernel they use. In any case, with a nonlinear kernel the number of parameters is the number of support vectors, which is often close to the number of training examples. It is not clear relative to what this can be considered “few.” 2. SVMs have an analytic upper bound on generalization error. This upper bound does motivate the SVM approach to training classiﬁers, but it typically does not provide useful guarantees for speciﬁc training sets. In any case, a bound of this type has nothing to do with assigning conﬁdences to individual predictions. In practice overﬁtting is prevented by straightforward
6.7. PITFALLS OF LINK PREDICTION
63
search for the value of the C parameter that is empirically best, not by applying a theorem. 3. SVMs have fast training, which is essential for screening large datasets. SVM training is slow compared to many other classiﬁer learning methods, except for linear classiﬁers trained by fast algorithms that were only published after 2001, when this paper was published. As mentioned above, a linear classiﬁer is not appropriate with the representation of protein pairs used in this paper. In any case, what is needed for screening many test examples is fast classiﬁer application, not fast training. Applying a linear classiﬁer is fast, whether it is an SVM or not. Applying a nonlinear SVM typically has the same orderofmagnitude time complexity as applying a nearestneighbor classiﬁer, which is the slowest type of classiﬁer in common use. 4. SVMs can be continuously updated in response to new data. At least one algorithm is known for updating an SVM given a new training example, but it is not cited in this paper. I do not know any algorithm for training an optimal new SVM efﬁciently, that is without retraining on old data. In any case, new realworld protein data arrives slowly enough that retraining from scratch is feasible.
95 · 1 which equals 0. What is the maximum possible lift at 10% of this classiﬁer? If the upper bound b is a wellcalibrated probability. (a) Suppose that a probabilistic classiﬁer predicts p(y = 1x) = c for some constant c. so the error rate is 5%.05 is the only constant that is a wellcalibrated probability.05 where 0. The MSE is 0. where the base rate for the positive class y = 1 is 5%.05 equals the average frequency of positive examples in sets of examples with predicted score c. then the fraction of positive examples among the highestscoring examples is at most b. CLASSIFICATION WITH A RARE CLASS Quiz for 2009 All questions below refer to a classiﬁer learning task with two classes.5. for all test examples x.95 · (0 − 0.05 is the base rate.05)2 + 0.64 CHAPTER 6. the lift is at most b/0.05 is the best value for c.0475.05 · 0. Hence. not just at 10%. . Explain why c = 0. all examples are predicted to be negative. Note that this is the highest possible lift at any percentage. (c) Suppose a wellcalibrated probabilistic classiﬁer satisﬁes a ≤ p(y = 1x) ≤ b for all x. “Wellcalibrated” means that c = 0. (b) What is the error rate of the classiﬁer from part (a)? What is its MSE? With a prediction threshold of 0.05)2 = 0. The value 0.05 · (1 − 0.
Quiz for April 27. . visit = 1)p(visit = 1x). 2010 Your name: The current assignment is based on the equation p(buy = 1x) = p(buy = 1x. Explain clearly but brieﬂy why this equation is true.
Jekyll correct? Explain your answer brieﬂy. j : yi = +1 and yj = −1}. Dr. Here.3) suggest training to maximize AUC by minimizing the objective function λc(f ) + i. yi ).j ∈Q l(w (xi − xj ). 2011 Your name: Email address: The lecture notes for last week (Section 6. Is Dr. j : yi = yj }. the set Q identiﬁes all pairs of original training examples xi and xj that have different labels: Q = { i. Jekyll says that in fact. one only needs to use a smaller set Q = { i. .CSE 291 Quiz for April 26.
However. As explained in class. AUC is the probability that a random positive example scores higher than a random negative example.CSE 291 Assignment (2011) This assignment is due at the start of class on Tuesday April 26. for example. you should work in a team of two. For both methods. then the FastLargeMargin operator is recommended. apply crossvalidation to ﬁnd good algorithm settings. Both learning methods should be applied to the same training set. As before. You should try (i) a linear classiﬁer and (ii) any other supervised learning method of your choice. 2011. The second method can also be a linear classiﬁer that is different from your ﬁrst linear method. You may change partners. The measure of success to optimize is AUC (area under the ROC curve). Do feature selection to reduce the size of the training set as much as is reasonably possible. The goal is to train a classiﬁer with realvalued outputs that identiﬁes test examples with TARGET_B = 1. if you use Rapidminer. a neural network. For example. As before. The objective is to see whether the second method can perform better than the linear method. For (i). or keep the same partner. you should now use the entire dataset. or a decision tree. this one uses the KDD98 training set. recode and transform features to improve their usefulness. Like the previous assignments. . using the same data coded in the same way. The second method can be a nonlinear support vector machine. you may choose to compare logistic regression and a linear SVM.
mean. for customers who are not sent any promotion). you should work in a team of two. As before. That is. As before. . Like the previous assignment. then use the logistic regression option of the FastLargeMargin operator. You may apply any learning algorithms that you like. As before. First. recode and transform features to improve their usefulness. Note that mathematically p(buyx) = p(buyx. or keep the same partner. this one uses the ecommerce data published by Kevin Hillstrom However. use logistic regression. compare learning a classiﬁer directly with learning two classiﬁers. What are the minimum. This is a highly unbalanced classiﬁcation task. and maximum predicted probabilities? Discuss whether these are reasonable. If you are using Rapidminer. Can any method achieve better accuracy than logistic regression? For predicting who makes a purchase. as many positive test examples as possible should be among the 25% of test examples with highest prediction score. 2010.CSE 291 Assignment (2010) This assignment is due at the start of class on Tuesday April 27. The measure of success to optimize is lift at 25%. You may change partners. the ﬁrst to predict who visits the website and the second to predict which visitors make purchases. be careful not to fool yourself about the success of your methods. the goal now is to predict who makes a purchase on the website (again. Investigate the probability estimates produced by your two methods. use your creativity to get the best possible performance in predicting who makes a purchase. Second. visit)p(visitx).
w can be used to predict values for ˆ i y. We do not need to assume that the xi are independent When y is binary and x is a realvalued vector. w). We assume that we never want to predict values of x. we just need to assume that the yi are independent conditional on the xi . Given training data consisting of xi . for any speciﬁc value of x. for each x there is a different distribution f (yx. β) = 1 1 + exp −[α + d j=1 βj xj ] is called logistic regression. Technically. α. the principle of maximum conditional likelihood says to choose a parameter estimate w that maximizes the product ˆ f (yi xi . yi pairs. the model p(y = 1x. since we use i below to index over training examples 1 to n. but all these distributions share the same parameters w. We assume that the label y follows a probability distribution that is different for different examples x.Chapter 7 Logistic regression This chapter shows how to obtain a linear classiﬁer whose predictions are probabilities. the possible values of y are y = 1 and y = 0. In order to justify the conditional likelihood being a product. In the binary case. Then. w) of y. If data points x themselves are always known. The logistic regression model is easier to understand in the form log p =α+ 1−p 69 d βj xj j=1 . then there is no need to have a probabilistic model of them. We use j to index over the feature values x1 to xd of a single example of dimensionality d.
Since probabilities range between 0 and 1. A linear expression of the form α + j βj xj can also take unbounded values. then exp(βj ) is the extra odds of having the outcome y = 1 when xj = 1. α. . . The basic model does not allow the probability to depend in a Ushaped way on any xj . w). training a logistic regression classiﬁer involves maximizing the log conditional likelihood (LCL). yi . yi xi ) = i=1 log f (yi xi . compared to when xj = 0. Essentially. For each feature j. odds range between 0 and +∞ and log odds range unboundedly between −∞ and +∞. This is the sum of the LCL for each training example: n n LCL = i=1 log L(w. and log[p/(1 − p)] is called the log odds. The ratio p/(1 − p) is called the odds of the event y = 1 given x. but not as a model for odds or for probabilities. the log conditional likelihood is log pi if the true label yi = 1 and log(1 − pi ) if yi = 0. yi log pi + (1 − yi ) log(1 − pi ) Now write pi = 1 − pi 1 1 + exp −hi 1 = 1− 1 + exp −hi exp −hi = 1 + exp −hi 1 = (exp hi ) + 1 So the LCL for one example is − log(1 + exp −yi hi ) . LOGISTIC REGRESSION where p is an abbreviation for p(y = 1x. xn . Given the training set { x1 . . exp(βj xj ) is a multiplicative scaling factor on the odds p/(1− p). where pi = p(y = 1xi . so it is reasonable to use a linear expression as a model for log odds. If the predictor xj is binary. β). Given a single training example xi . then exp(βj ) is the extra odds of having the outcome y = 1 when the value of xj increases by one unit. y1 . yn }. as a function of each predictor xj . w).70 CHAPTER 7. If the predictor xj is realvalued. . or decrease monotonically. A major limitation of the basic logistic regression model is that the probability p must either increase monotonically. logistic regression is the simplest possible model for a random yes/no outcome that depends linearly on predictors x1 to xd .
∂βj i This derivative will always be positive. It has some similarity in shape to hinge loss.e. It also arises from theorems on minimizing generalization error. Then. This penalty should be minimized. i. Suppose that word number j appears only in documents whose labels are positive. it arises as a consequence of assuming a Gaussian prior on parameters. . Consider learning a logistic regression classiﬁer for categorizing documents. following the gradient will send βj to inﬁnity. A major reason why it is popular is that it can be derived from several different points of view. regardless of all other words in the test document. Therefore. error on independent test sets drawn from the same distribution as the training set. This type of regularization is called quadratic or Tikhonov regularization. every test document containing this word will be predicted to be positive with certainty. in a tradeoff with maximizing likelihood. In particular. but quite successful.71 and maximizing the LCL is equivalent to minimizing log(1 + exp −yi hi ). i The loss function inside the sum is called log loss. the optimization problem to be solved is ˆ β = argmax LCL − µβ2 β where β2 is the L2 norm of the parameter vector. There is a standard method for solving this overﬁtting problem that is quite simple. The idea is to impose a penalty on the magnitude of the parameter values. and the constant µ quantiﬁes the tradeoff between maximizing likelihood and minimizing parameter values. This overconﬁdent generalization is an example of overﬁtting. Mathematically. as long as the predicted probability pi is not perfectly one for all these documents. The partial derivative of the log conditional likelihood with respect to the parameter for this word is ∂ LCL = (yi − pi )xij . it arises from robust classiﬁcation: assuming that each training point lies in an unknown location inside a sphere centered on its measured location. The solution is called regularization. And.
.
Conversely. Predictions can often be probabilistic.1 Predictions. a prediction concerning a patient may be “allergic” or “not allergic” to aspirin. it can be rational to approve a small transaction even if there is a high probability it is fraudulent. then it can be rational not to approve a large transaction. if the cost of approving a fraudulent transaction is proportional to the dollar amount involved. when in fact class j is true. Suppose that examples are credit card transactions and the label y = 1 designates a legitimate transaction. predicting i means acting as if i is true. so one could equally well call this deciding i. j) entry in a cost matrix c is the cost of acting as if class i is true. Then making the decision y = 1 for an attempted transaction means acting as if the transaction is legitimate. If i = j then the prediction is correct. even if the transaction is most likely legitimate. approving the transaction. and maximizing the value of customers. while decisions typically cannot. The essence of costsensitive decisionmaking is that it can be optimal to act as if one class is true even when some other class is more probable. and costs Decisions and predictions are conceptually very different. For example. For example. while if i = j the prediction is incorrect. let i be the predicted class and let j be the true class. 8. The (i. i. Mathematically. decisions. A cost matrix c has the following structure when there are only two classes: 73 . while the corresponding decision is whether or not to administer the drug.Chapter 8 Making optimal decisions This chapter discusses making optimal decisions based on predictions. Here.e.
In this framework. c(m. This scaling corresponds to changing the unit of account for costs. some class labels are never predicted by the optimal policy given by Equation (8. regardless of what the true class j is. e(x. So it is optimal never to predict m. j). In this case the cost of predicting n is no greater than the cost of predicting m.) For some cost matrices. Given a cost matrix. the optimal prediction is always n if row n is dominated by all other rows in a cost matrix. 8.74 CHAPTER 8. Similarly. Say that row m dominates row n in a cost matrix C if for all j.1). Suppose that the ﬁrst reasonableness condition is violated. For a 2x2 cost matrix. As a special case. (The reader can analyze the case where both reasonableness conditions are violated. In this case the optimal policy is to label all examples positive. j). 1) = c01 c(1. 0) = c10 actual positive c(0. 0) = c00 c(1.2 Cost matrix properties Conceptually. i) = p(jx)c(i.) The optimal prediction for an example x is the class i that minimizes the expected cost e(x. In short the convention is row/column = i/j = predicted/actual. We follow the convention that cost matrix rows correspond to alternative predicted classes. (This convention is the opposite of the one in Section 5 so perhaps we should switch one of these to make the conventions similar. MAKING OPTIMAL DECISIONS actual negative c(0. i) is an expectation computed by summing over the alternative possibilities for the true class of x.1) j For each i. The two reasonableness conditions for a twoclass cost matrix imply that neither row in the matrix dominates the other. the cost of labeling an example incorrectly should always be greater than the cost of labeling it correctly. this means that it should always be the case that c10 > c00 and c01 > c11 . while columns correspond to actual classes. The following is a criterion for when this happens. the role of a learning algorithm is to produce a classiﬁer that for any example x can estimate the probability p(jx) of each class j being the true class of x. (8. the decisions that are optimal are unchanged if each entry in the matrix is multiplied by a positive constant. Similarly. j) ≥ c(n. so c00 ≥ c10 but still c01 > c11 . if c10 > c00 but c11 ≥ c01 then it is optimal to label all examples negative. the decisions that are optimal are unchanged . We call these conditions the “reasonableness” conditions. 1) = c11 predict negative predict positive The cost of a false positive is c10 while the cost of a false negative is c01 .
8. “Actual good” means that a customer would repay a loan while “actual bad” means that the customer would default. for example. This shifting corresponds to changing the baseline away from which costs are measured. the cashﬂow relative to any baseline associated with this prediction is the same regardless .com/tech/mlc/db/german. When thinking in terms of costs. or the severity of an illness. After the agent has made the decision. The action associated with “predict bad” is to deny the loan. From a matrix perspective. 1994]. if it is better off. because avoiding mistakes is easier. its beneﬁt is positive. consider the socalled German credit dataset that was published as part of the Statlog project [Michie et al. This baseline is the state of the agent before it takes a decision regarding an example. it is easy to posit a cost matrix that is logically contradictory because not all entries in the matrix are measured from the same baseline.names is as follows: predict bad predict good actual bad 0 5 actual good 1 0 Here examples are people who apply for a loan from a bank.. The cost matrix given with this dataset at http://www. any twoclass cost matrix c00 c10 c01 c11 that satisﬁes the reasonableness conditions can be transformed into a simpler matrix that always leads to the same decisions: 0 1 c01 c11 where c01 = (c01 − c00 )/(c10 − c00 ) and c11 = (c11 − c00 )/(c10 − c00 ). By scaling and shifting entries. Hence. a 2x2 cost matrix effectively has two degrees of freedom. whether positive or negative.3.3 The logic of costs Costs are not necessarily monetary.8. since there is a natural baseline from which to measure all beneﬁts. For example. doing accounting in terms of beneﬁts is generally preferable. THE LOGIC OF COSTS 75 if a constant is added to each entry in the matrix. Otherwise.sgi. A cost can also be a waste of time. Although most recent research in machine learning has used the terminology of costs. its beneﬁt is negative.
but the baseline must be ﬁxed. MAKING OPTIMAL DECISIONS of whether “actual good” or “actual bad” is true. the foregone proﬁt of 1. Alternatively. This expectation is . If a good customer is rejected. But with the erroneous cost matrix above. the ﬁrst scenario gives a total cost of 6. Here the beneﬁt matrix might be fraudulent $20 −x legitimate −$20 0.” To see concretely that the reasoning in quotes above is incorrect. and may be different for different examples. Refusing a legitimate transaction has a nontrivial cost because it annoys a customer. and the cost of rejecting a bad customer is zero. In every economically reasonable cost matrix for this domain.02x refuse approve where x is the size of the transaction in dollars. An opportunity cost is a foregone beneﬁt. It is easy to make the mistake of measuring different opportunity costs against different baselines. Costs or beneﬁts can be measured against any baseline. while the second scenario gives a total cost of 0. suppose that we have four customers who receive loans and repay them. both entries in the “predict bad” row must be the same. In general the amount in some cells of a cost or beneﬁt matrix may not be constant. If a bad customer is approved for a loan.4 Making optimal decisions Given known costs for correct and incorrect predictions. a missed opportunity rather than an actual penalty. If these entries are different. any method of accounting should give a difference of 8 between these scenarios. because in both cases the correct decision has been made. The net change in assets is then +4. Clearly the cost matrix above is intended to imply that the net change in the assets of the bank is then −4. Regardless of the baseline. 8. For example. Refusing a fraudulent transaction has a nontrivial beneﬁt because it may prevent further fraud and lead to the arrest of a criminal. suppose that the bank has one customer of each of the four types. Approving a fraudulent transaction costs the amount of the transaction because the bank is liable for the expenses of fraud. it is because different baselines have been chosen for each entry. a test example should be predicted to have the class that leads to the lowest expected cost. i. For example. the cost is an opportunity cost.76 CHAPTER 8. the cost is the lost loan principal of 5.e. consider the credit card transactions domain. the erroneous cost matrix above can be justiﬁed informally as follows: “The cost of approving a good customer is zero.
MAKING OPTIMAL DECISIONS 77 the predicted average cost. although it has two degrees of freedom from a matrix perspective. Rearranging the equation for p∗ leads to c00 − c10 = −p∗ c10 + p∗ c11 + p∗ c00 − p∗ c01 which has the solution p∗ = c10 − c00 c10 − c00 + c01 − c11 assuming the denominator is nonzero. regardless of what the various outcome probabilities are for a given example x. The threshold for making optimal decisions is p∗ such that (1 − p∗ )c10 + p∗ c11 = (1 − p∗ )c00 + p∗ c01 . An extreme case is that with some cost matrices the optimal decision is always the same. the optimal prediction is class 1 if and only if p ≥ p∗ . then predicting either class is optimal. i. Essentially. However if we pick a complex concept. In the twoclass case. a PAC theorem says that if we have a ﬁxed.8. 1 . then if we pick a simple concept then we will be right. The cause of the apparent contradiction is that the optimal decisionmaking policy is a nonlinear function of the cost matrix. which is implied by the reasonableness conditions. the optimal prediction is class 1 if and only if the expected cost of this prediction is less than or equal to the expected cost of predicting class 0.4.1 Decisions and predictions may be not in 1:1 correspondence. This formula for p∗ shows that any 2x2 cost matrix has essentially only one degree of freedom from a decisionmaking perspective. Note that in some domains costs have more impact than probabilities on decisionmaking. If this inequality is in fact an equality. The situation where the optimal decision is ﬁxed is wellknown in philosophy as Pascal’s wager. An argument analogous to Pascal’s wager provides a justiﬁcation for Occam’s razor in machine learning. and is computed using the conditional probability of each class given the example. if and only if p(y = 0x)c10 + p(y = 1x)c11 ≤ p(y = 0x)c00 + p(y = 1x)c01 which is equivalent to (1 − p)c10 + pc11 ≤ (1 − p)c00 + pc01 given p = p(y = 1x). limited number of training examples. Essentially. there is no guarantee we will be right. and the true concept is simple. this is the case where the minimax decision and the minimum expected cost decision are always the same. Assuming the reasonableness conditions c10 > c00 and c01 ≥ c11 .e.
8. We may need to make repeated decisions over time about the same example. When there are more than two classes. . The second reason is empirical: lifts observed in practice tend to be well below 10. response rate. the lift for the lowest ranked examples cannot be driven below 0. or base rate. and the actual class of examples is set stochastically. while complex means it is drawn from a space with high cardinality. t is also called the target rate.78 CHAPTER 8.5. This objective is reasonable when a game is played repeatedly. where t is the overall fraction of positives. The ﬁrst reason is mathematical: an upper bound on the lift is 1/t. There is a remarkable rule of thumb that the lift attainable at q is around 1/q.5. Let q be a fraction of the test set. we do not get full label information on training examples. The rule of thumb is not valid for very small values of q. q→1 1 − q This says that the examples ranked lowest are positive with probability equal to t/2. The analysis may not be appropriate when the agent is riskaverse and can make only a few decisions. MAKING OPTIMAL DECISIONS 8. The lift in the bottom 1 − q is (1 − q)/(1 − q).25 then the attainable lift √ is around 4 = 2. For example.02 and q > t2 . the rule of thumb can reasonably be applied when q > 0. Costs may need to be estimated via learning. In other words.6 Rules of thumb for evaluating data mining campaigns This section is preliminary and incomplete. The fraction of all positive examples that belong to the topranked fraction q of √ √ the test set is q 1/q = q.5 Limitations of costbased analysis The analysis of decisionmaking above assumes that the objective is to minimize expected cost. predict simple predict complex nature simple succeed fail nature complex fail fail Here “simple” means that the hypothesis is drawn from a space with low cardinality. Also not appropriate when the actual class of examples is set by a nonrandom adversary. for two reasons. Hence. We have √ 1− q lim = 0. It is not part of the material for 291 in 2011. if q = 0. Decisions about one example may inﬂuence other examples.
Generally. It is not rational for the initiator to take into account these beneﬁts or costs. The maximum proﬁt that can be achieved is nc(k(k/2) − (k/2)2 ) = nck 2 /4 = ncq. because the lift attainable at small q is typically worse than suggested by the rule of thumb.8.04nc. because the expected revenue is always twice the campaign cost.5 − 1. because s/he only responds if the response is beneﬁcial. it is rational for society to take into account these externalities. However. then one can assume that the contact caused a net cost to him or her.e. b. such as closing accounts. so data mining is pointless. this is always the same as the cost of running the campaign. if the recipient does not respond. unless they cause respondents to take actions that affect the initiator. . or t is not likely to make the campaign lose money. Each contact also has a cost or beneﬁt for the recipient. which is √ √ nqbt 1/q − nqc = nc(bt q/c − q) = nc(k q − q) where k = tb/c. Proﬁt is maximized when 0= d (kq 0. Such a campaign may have high risk. This means that the beneﬁt matrix is decide negative decide positive actual negative 0 −c actual positive 0 b−c Let n be the size of the test set. the attainable proﬁt decreases fast. one can assume that the contact was beneﬁcial for him or her.5 − q) = k(0.04 and the attainable proﬁt is less than 0.4 then q < 0. for example a loss of time. quadratically. Note however that a small adverse change in c.5)q −0. i. if the recipient responds. If k < 0. It is interesting to consider whether data mining is beneﬁcial or not from the point of view of society at large.6. This solution is in the range zero to one only when k ≤ 2. which is the majority case. When tb/c > 2 then maximum proﬁt is achieved by soliciting every prospect. dq The solution is q = (k/2)2 . which is nqc. However. Remember that c is the cost of one contact from the perspective of the initiator of the campaign. The proﬁt at q is the total beneﬁt minus the total cost. Remarkably. From the point of view of the initiator of a campaign a cost or beneﬁt for a respondent is an externality. RULES OF THUMB FOR EVALUATING DATA MINING CAMPAIGNS 79 Let c be the cost of a contact and let b be the beneﬁt of a positive. As k decreases.
so µ is large. However. In summary. the cost c to the initiator is tiny. the methods that perform best in this domain achieve proﬁt of about $0. and how it is propagated. Suppose also that the cost of a solicitation to a person is µc where c is the cost to the initiator. If c is much greater than the average beneﬁt. We need to work out the logic of how to quantify these uncertainties. We have k = tb/c = 75/68 = 1. we need to estimate a conﬁdence interval for the proﬁt on the test set.10. The product tb is the average beneﬁt of soliciting a random customer. Intuitively. It is the logic of tracking where uncertainty arises. while the achievable proﬁt per person cq = $0. data mining is only beneﬁcial in a narrow sweet spot. The net beneﬁt to respondents is positive as long as µ < 2λ. while soliciting about 70% of all people. and then the logic of combining the uncertainties into an overall conﬁdence interval. c = $0. the cost to a recipient is not tiny. Here t = 0. there are two sources of uncertainty for each test person: whether s/he will donate.21.7 Evaluating success In order to evaluate whether a realworld campaign is proceeding as expected.30. If the cost c of solicitation is less than half of this. This should be a range [ab] such that the probability is (say) 95% that the observed proﬁt lies in this range. The central issue here is not any mathematical details such as Student distributions versus Gaussians. how much. it is likely that µ > 2λ. Suppose that the beneﬁt of responding for a respondent is λb where b is the beneﬁt to the initiator. then it is rational to contact all potential respondents. The reasoning above clariﬁes why spam email campaigns are harmful to society. 8. As an example of the reasoning above. In fact.16. and the average beneﬁt is b = $15 approximately. .80 CHAPTER 8.68. Whatever λ is. is roughly twice its cost. MAKING OPTIMAL DECISIONS The conclusion above is that the revenue from a campaign. consider the scenario of the 1998 KDD contest. where tb/2 ≤ c ≤ αtb where α is some constant greater than 1. then the campaign is likely to have high risk for the initiator. for its initiator. For these campaigns.05 about. and if so. The rule of thumb predicts that the optimal fraction of people to solicit is q = 0.
salinity. The value of successful crystallization is $9000. EVALUATING SUCCESS 81 Quiz (2009) Suppose your lab is trying to crystallize a protein. while y = 0 means failure. If many experiments are worth doing. The optimal behavior is to do all experiments that have success probability over 1/150. the probability of success of experiment x. then experiments should also be done in order of declining success probability. Assume (not realistically) that the results of different experiments are independent. (c) Is it rational for your lab to take into account a budget constraint such as “we only have a technician available to do one experiment per day”? No. You can try experimental conditions x that differ on temperature. If just one successful crystallization is enough. The cost of doing one experiment is $60. The label y = 1 means crystallization is successful. A budget constraint is an alternative rule for making decisions that is less rational than maximizing expected beneﬁt. etc. The threshold value of p is 60/9000 = 1/150. You have a classiﬁer that predicts p(y = 1x). (a) Write down the beneﬁt matrix involved in deciding rationally whether or not to perform a particular experiment. where p = p(successx). Additional explanation: The $60 cost of doing an experiment should include the expense of technician time. then experiments should be done starting with those that have highest probability of success. .8. that is if and only if (9000 − 60) · p + (−60) · (1 − p) > 0. then more technicians should be hired. If the budget constraint is unavoidable. The beneﬁt matrix is do experiment don’t success 900060 0 failure 60 0 (b) What is the threshold probability of success needed in order to perform an experiment? It is rational to do an experiment under conditions x if and only if the expected beneﬁt is positive.7. until the ﬁrst actual success.
Quiz for May 4. The label y = 1 means a money transfer is criminal. . The matrix of costs (negative) and beneﬁts (positive) involved in deciding whether or not to deny a transfer is as follows: criminal 0 −z legal −0.01z deny allow Work out the rational policy based on p(y = 1x) for deciding whether or not to allow a transfer. You have a classiﬁer that estimates the probability p(y = 1x) where x is a vector of feature values describing a money transfer. while y = 0 means the transfer is legal. 2010 Your name: Suppose that you work for a bank that wants to prevent criminal laundering of money. Let z be the dollar amount of the transfer.10z 0.
You can search geographical locations x that differ on rainfall.CSE 291 Quiz for May 3. Assume (not realistically) that the results of different searches are independent. The label y = 1 means the plant is found. the information gained is expected to save you one future search. the probability of success when searching at location x. The value of ﬁnding the plant is $7000. 2011 Your name: Email address: Suppose that you are looking for a rare plant. Where should you search each day? . The cost of doing one search is $200. on average. (a) Write down the beneﬁt matrix involved in deciding rationally whether or not to perform a particular search. while y = 0 means failure. (b) What is the threshold probability of success needed in order to perform a search? (c) Suppose that you are constrained to do only one search per day. etc. You have a classiﬁer that predicts p(y = 1x). urbanization. For every two unsuccessful searches.
The test set labels are in valtargt. The test set is in the ﬁle cup98val. . you should work in a team of two. Use only the training set for all development work. You have two weeks for this assignment.txt. ics.CSE 291 Assignment 4 (2011) This assignment is due at the start of class on Tuesday May 10. These labels are sorted by CONTROLN. 2011. You should train a regression function to predict donation amounts. let the predicted donation be a(x) and let the predicted donation probability be p(x).68. Use the actual test set only once. to measure the ﬁnal success of your method.uci. unlike the test instances. and you are free to change partners or not. The goal is to solicit an optimal subset of the test examples. As before. For a test example x.html. The measure of success to maximize is total donations received minus $0. so do a careful job and write a comprehensive report.zip at http://archive. This assignment is the last one to use the KDD98 data. and a classiﬁer to predict donation probabilities. You should now train on the entire training set.68 for every solicitation. and measure ﬁnal success on the test set that you have not previously used.edu/ml/databases/kddcup98/kddcup98. It is rational to send a donation request to person x if and only if p(x) · a(x) ≥ 0.
Implement this approach and submit predictions to the contest. Of course. com/computer/lncs?SGWID=016467933410. design a sensible approach for achieving the contest objective.springer. Each team of two students should register and download the ﬁles for the contest. Your ﬁrst goal should be to understand the data and submission formats. Meet the May 3 deadline for uploading your best predictions and a copy of your report.com. The manuscript should be the basis for writing up a full paper for an eventual postconference proceeding (currently under negotiation). Do not worry about formatting details. The contest rules ask each team to submit a paper of four pages. focusing on the aspects which make the solution innovative. The manuscript should be in the scientiﬁc paper format of the conference detailing the stages of the KDD process.neurotech. In your written report. and your general ﬁndings about the datasets. Make sure that when you have good predictions later.br/PAKDD2010/. The authors of top solutions and selected manuscripts with innovative approaches will have the opportunity to submit to that forum. Details of the contest are at http://sede. based on your general understanding.CSE 291 Assignment (2010) This week’s assignment is to participate in the PAKDD 2010 data mining contest. and to submit a correct set of predictions. explain your understanding of the scenario. . Do some exploratory analysis of the three datasets. Your second goal should be to understand the contest scenario and the differences between the training and two test datasets. Next. The quality of the manuscripts will not be used for the competition assessment. unless in the case of ties. You can ﬁnd template ﬁles for LaTeX and Word at http://www. you may reﬁne your approach iteratively and you may make multiple submissions. you will not run into any technical difﬁculties.
.
The notation p(x. and when they are probability densities. If X and Y are both discretevalued random variables. the training and test sets are two different random samples from the same population. Let x be a training or test instance. y) = p(y)p(xy) without loss of generality and without making any assumptions. y). y) is an abbreviation for p(X = x and Y = y) where X and Y are random variables. it is a probability density function (pdf). Speciﬁcally. It is important to understand that we can always write p(x. Otherwise. 87 . 9. The common distribution is p(x. The equations above are sometimes called the chain rule of probabilities. y) = p(x)p(yx) and also p(x. Being from the same population means that both sets follow a common probability distribution. this distribution is a probability mass function (pmf). They are true both when the probability values are probability masses. we consider three different but related scenarios where labels are missing for some but not all training examples.Chapter 9 Learning classiﬁers despite missing labels This chapter discusses how to learn twoclass classiﬁers from nonstandard training data. and let y be its label.1 The standard scenario for learning a classiﬁer In the standard situation. and x and y are possible values of these variables.
We can identify three possibilities for s. Otherwise. it is true that s and y are conditionally independent. It is not the case that labels are missing in a totally random way. . y. y)p(yx) p(s = 1x)p(yx) = = p(yx) p(s = 1x) p(s = 1x) assuming that p(s = 1x) > 0 for all x. there are two subcases. y) = p(s = 1x) for all x. s = 1) = p(s = 1x. There is some ﬁxed unknown joint probability distribution p(x. because missingness does depend on x. This means that y is observed if and only if s = 1. However. given x. s . However. x. The assumption that p(s = 1x) > 0 is important. This case is rather misleadingly called “missing at random” (MAR). LEARNING CLASSIFIERS DESPITE MISSING LABELS 9. there is still some correlation between s and y. This means that p(s = 1x. conditional on x. In this case. Here. Suppose that even when x is known. y. y) is a constant. Concretely. This case is called “missing completely at random” (MCAR). without using the unlabeled data in any way. In this case the label y is said to be “missing not at random” (MNAR) and inference is much more difﬁcult. but not for others. This means that the training instances x for which y is observed are not a random sample from the same population as the test instances. The easiest case is when p(s = 1x. when x is ﬁxed then s is independent of y. Bayes’ rule gives that p(yx. s = 1). The realworld meaning of this assumption is that label information must be available with nonzero probability for every possible instance x. s = 0) = p(yx) = p(yx. that is if p(s = 1) = p(s = 1x. s = 1) holds. y). Let s be a new random variable with value s = 1 if and only if x is selected. and s are random variables. In other words. First. We have fewer training examples. We do not discuss this case further here.88 CHAPTER 9. not on y.2 Sample selection bias in general Suppose that the label y is observed for some training instances x. A more difﬁcult situation arises if s is correlated with x and/or with y. s) over triples x. suppose that s depends on x but. here we focus on a different question: how can we adjust for the fact that the labeled training examples are not representative of the test examples? Formally. It is also not the case that s and y are independent. Therefore we can learn a correct model of p(yx) from just the labeled training data. An important question is whether the unlabeled training examples can be exploited in some way. y. In this case. but otherwise we are in the usual situation. for each value of x the equation p(yx. it might be the case for some x that no labeled training examples are available from which to estimate p(yx. the labeled training examples are a fully random subset.
and shift means change of distribution. but we want to learn a classiﬁer to use at a different hospital. then the “through the door” success rate may be higher than the observed success rate.3 Covariate shift Imagine that the available training data come from one hospital. Another example is the task of learning to predict the beneﬁt of a medical treatment. They were previously selected precisely because they were thought to be more likely to repay. the assumption that p(yx) is unchanged means that p(yx. but we are willing to assume that the relationship to be learned. Conversely. The important similarity is the common assumption that missingness depends only on x. The simple approach to covariate shift is thus train the classiﬁer on the data x. Labeled training examples from all classes are available. s = 1 in the usual way.3.4 Reject inference Suppose that we want to build a model that predicts who will repay a loan.” The distribution p(x) of patients is different at the two hospitals. The “reject inference” scenario is similar to the covariate shift scenario. Let x be a patient and let y be a label such as “has swine ﬂu. s = 1) = p(yx. no unlabeled training examples are available. and not additionally on y. The usable training examples are people who were actually granted a loan in the past. is unchanged. but the distribution of instances is different for the training and test sets. 9. The problem of “reject inference” is to learn somehow from previous applicants who were not approved. and to apply it to patients from the second hospital directly. Following the argument in the previous section. Whether y is missing may . for whom we hence do not have training labels. if doctors have historically used the treatment only on patients who were especially ill. In a scenario like the one above. COVARIATE SHIFT 89 9. s = 0) where s = 1 refers to the ﬁrst hospital and s = 0 refers to the second hospital.9. y. This scenario is called “covariate shift” where covariate means feature. which is p(yx). but with two differences. Unfortunately. The ﬁrst difference is that unlabeled training instances are available. We need to be able to apply the model to the entire population of future applicants (sometimes called the “through the door” population). The second difference is that the label being missing is expected to be correlated with the value of the label. then the observed success rate of the treatment is likely to be higher than its success rate on “through the door” patients. these people are not representative of the population of all future applicants. If doctors have historically given the treatment only to patients who were particularly likely to beneﬁt.
q(z) assuming q(z) > 0 for all z. y ∼ p(x. p(s = 1x) The constant p(s = 1) can be estimated as r/n where r is the number of labeled training examples and n is the total number of training examples. s = 1). More generally the requirement is that q(z) > 0 whenever p(z) > 0. The estimate of E[f ] is then r n r i=1 f (xi . then we are in the MNAR situation and in general we are can draw no ﬁrm conclusions. We have E[f ] = E[f p(x)p(yx) s = 1] p(xs = 1)p(yx. and those records are more likely to exist for survivors. y)] for any function f . everything else being equal. Let p(s = 1x) be ˆ a trained model of the conditional probability p(s = 1x). even after conditioning on x. Applying Bayes’ rule to p(xs = 1) gives E[f ] = E[f p(x) s = 1] p(s = 1x)p(x)/p(s = 1) p(s = 1) = E[f s = 1].90 CHAPTER 9. LEARNING CLASSIFIERS DESPITE MISSING LABELS be correlated with the value of y. x) > p(s = 1y = 0. but we can only draw samples of z from the distribution q(z). x) where “everything else being equal” means that x is the same on both sides. y ∼ p(x. If missingness does depend on y. write this as E[f ] and write E[f (x. s = 1) p(x) = E[f s = 1] p(xs = 1) since p(yx) = p(yx. yi ) . The equation above is called the importancesampling identity. The fact is E[f (z)z ∼ p(z)] = E[f (z) p(z) z ∼ q(z)]. s = 1)] as E[f s = 1]. An example of this situation is “survivor bias. assuming the deﬁnition 0/0 = 0. y)x. but this correlation disappears after conditioning on x. p(s = 1xi ) ˆ . A useful fact is the following. y. Then p(s = 1y = 1. Let the goal be to compute E[f (x. y)x. Suppose we want to estimate E[f (z)] where z follows the distribution p(z). To make notation more concise.” Suppose our analysis is based on historical records.
y = 0) = 0. This algorithm will yield a function g(x) such that g(x) = p(s = 1x) approximately. POSITIVE AND UNLABELED EXAMPLES 91 This estimate is called a “plugin” estimate because it is based on plugging the observed values of x. In medical research.. no good method for selecting the ceiling value is known.2) Another way of stating the assumption is that s and x are conditionally independent given y.1) Without some assumption about which positive examples are labeled. A training set consists of two subsets. Lemma 1. y into a formula that would be correct if based on integrating over all values of x.9. However. For example the ceiling 1000 is used p by [Huang et al. y = 1) = p(s = 1y = 1) = c.5 Positive and unlabeled examples Suppose that only positive examples are labeled. However. . The weighting approach just explained is correct in the statistical sense that it is unbiased. When does the reject inference scenario give rise to MNAR bias? 9. the ratio p(s = 1)/p(s = 1x) is called the “inverse probability of treatment” (IPT) weight. The following lemma shows how to obtain a model of p(y = 1x) from g(x). if the propensity estimates p(s = 1x) are correct. Suppose the “selected completely at random” assumption holds. it is impossible to make progress. called the labeled (s = 1) and unlabeled (s = 0) sets. 2006]. This fact can be stated formally as the equation p(s = 1x. Suppose we provide these two sets as inputs to a standard training algorithm. (9. Then p(y = 1x) = p(s = 1x)/c where c = p(s = 1y = 1).5. A common assumption is that the labeled positive examples are chosen completely randomly from all positive examples. this apˆ proach typically has high variance since a few labeled examples with high values for 1/ˆ(s = 1x) dominate the sum. (9. it is that p(s = 1x. One simple heuristic is to place a ceiling on the values 1/ˆ(s = 1x). Therefore an important question is whether p alternative approaches exist that have lower variance. Stated formally. Let this be called the “selected completely at random” assumption. y .
y. Remember that the assumption is p(s = 1y = 1. Second. Let V be such a validation set that is drawn from the overall distribution p(x. 1 Formally the estimator is e1 = n x∈P g(x) where n is the cardinality of P . First. s .92 CHAPTER 9. all we need to show is that g(x) = c for x ∈ P . This means that if the classiﬁer f is only used to rank examples x according to the chance that they belong to class y = 1. Now consider p(s = 1x). then the classiﬁer g can be used directly instead of f . To do this. f is an increasing function of g. . Several consequences of the lemma are worth noting. s) in the same manner as the nontraditional training set. y. The estimator of p(s = 1y = 1) is the average value of g(x) for x in P . y)] for any function h. We want an estimator of E[h] based on a positiveonly training set of examples of the form x. s) is the overall distribution. write this as E[h]. Let P be the subset of examples in V that are labeled (and hence positive). Note that in principle any single example from P is sufﬁcient to determine c. What this says is that g > p(s = 1y = 1) is impossible. f = g/p(s = 1y = 1) is a welldeﬁned probability f ≤ 1 only if g ≤ p(s = 1y = 1). but that in practice averaging over all members of P is preferable. y = 1)p(y = 1x) + p(s = 1x. There is an alternative way of using Lemma 1. x) = p(s = 1y = 1). We have that p(s = 1x) = p(y = 1 ∧ s = 1x) = p(y = 1x)p(s = 1y = 1. x) = p(y = 1x)p(s = 1y = 1). Hence it is impossible for any example x to belong to the positive class for g with a high degree of certainty. We can show this as follows: g(x) = p(s = 1x) = p(s = 1x. LEARNING CLASSIFIERS DESPITE MISSING LABELS Proof.y. y = 0)p(y = 0x) = p(s = 1x. Let the goal be to estimate Ep(x. We shall show that e1 = p(s = 1y = 1) = c if it is the case that g(x) = p(s = 1x) for all x. The result follows by dividing each side by p(s = 1y = 1). The value of the constant c = p(s = 1y = 1) can be estimated using a trained classiﬁer g and a validation set of examples. where p(x. y = 1) · 1 + 0 · 0 since x ∈ P = p(s = 1y = 1). This is reasonable because the labeled (positive) and unlabeled (negative) training sets for g are samples from overlapping regions in x space.s) [h(x. To make notation more concise.
s = 1) = 1. s = 0) = 1 − c p(s = 1x) c 1 − p(s = 1x) (9. What this says is that each labeled example is treated as a positive example with unit weight. y. 0)] . y)p(x.s 1 1 93 p(s = 0x. s = 0)h(x. while each unlabeled example is treated as a combination of a positive example with weight p(y = 1x. . Less obviously. y = 1)]p(y = 1x) 1 − p(s = 1x) (1 − c)p(y = 1x) 1 − p(s = 1x) (1 − c)p(s = 1x)/c 1 − p(s = 1x) 1 − c p(s = 1x) . p(y = 1x. 1) + p(y = 0x. s) p(x) x s=0 = = x p(sx) y=0 p(yx. 1) + p(s = 0x)[p(y = 1x.s=1 x. The plugin estimate of E[h] is then the empirical average 1 m where w(x) = p(y = 1x.5. 0) and m is the cardinality of the training set.9. POSITIVE AND UNLABELED EXAMPLES Clearly p(y = 1x.s=0 w(x)h(x. 1) + (1 − w(x))h(x. s = 0)h(x. y = 1)p(y = 1x) p(s = 0x) [1 − p(s = 1x. s = 0). y) p(x) p(s = 1x)h(x. The probability p(s = 1x) is estimated as g(x) where g is the nontraditional classiﬁer explained in the previous section.y. c 1 − p(s = 1x) h(x.3) h(x. s = 0) and a negative example with complementary weight 1 − p(y = 1x. s)h(x. 1) + x. s = 0) = = = = = By deﬁnition E[h] = x.
Adverse selection. LEARNING CLASSIFIERS DESPITE MISSING LABELS The result above on estimating E[h] can be used to modify a learning algorithm in order to make it work with positive and unlabeled training data. s = 0) and the other copy is made negative with weight 1 − p(y = 1x. . Positive examples are given unit weight and unlabeled examples are duplicated. Concept drift. Moral hazard. Observational studies. One method is to give training examples individual weights. 9. one copy of each unlabeled example is made positive with weight p(y = 1x.6 Further issues Spam ﬁltering scenario. s = 0).94 CHAPTER 9.
and we deﬁne 0/0 = 0. 2010 Your name: The importance sampling identity is the equation E[f (z)z ∼ p(z)] = E[f (z)w(z)z ∼ q(z)] where w(z) = p(z) . . Suppose that the training set consists of values z sampled according to the probability distribution q(z). Explain intuitively which members of the training set will have greatest inﬂuence on the estimate of E[f (z)z ∼ p(z)].Quiz for May 11. q(z) We assume that q(z) > 0 for all z such that p(z) > 0.
What are the minimum and maximum possible values of the weights? In this scenario. so they range between 0 and 1. . weights are assigned to unlabeled examples. and let its weight be p(s = 1)/p(s = 1x). this weight is how many copies are needed to allow the one labeled example to represent all the unlabeled examples that are similar. Intuitively. then there will be unlabeled examples that in fact are positive. so the weight can range between p(s = 1) and inﬁnity. but that we treat as negative. when learning from positive and unlabeled examples. (b) Suppose you use the weighting approach to learn from positive and unlabeled examples. The trained model of the positive class will be too narrow.Quiz for 2009 (a) Suppose you use the weighting approach to deal with reject inference. because they are different from the labeled positive examples. The weights here are probabilities p(y = 1x. not to labeled examples as above. s = 0). (c) Explain intuitively what can go wrong if the “selected completely at random” assumption is false. What are the minimum and maximum possible values of the weights? Let x be a labeled example. If this assumption is false. The conditional probability p(s = 1x) can range between 0 and 1. The “selected completely at random” assumption says that the positive examples with known labels are perfectly representative of the positive examples with unknown labels.
) You should train a classiﬁer separately based on each training set. Training set (ii) requires reject inference. In your report. training examples. The persons with known labels. . so the fnlwgt feature has been omitted from our datasets.zip you can ﬁnd four datasets: one test set. because the training examples are a random sample from the test population.edu/ml/datasets/Adult. Compare learning p(yx) directly with learning p(yx) after reweighting. show graphically a learning curve. In http://www.ucsd. do crossvalidation on the test set to establish the accuracy that is achievable when all training set labels are known. Use accuracy as the primary measure of success. Each example describes one person.9. The label to be predicted is whether or not the person earns over $50. Training set (i) has 5.) First.6. a random subset of positive training examples have known labels. The true value is c = 0. 2000. etc. (Many thanks to Aditya Menon for creating these ﬁles. (We are not using the published weightings. and measure performance separately but on the same test set. on average. but the training label is known only for some training examples. In training set (iii). are better prospects than the ones with unknown labels. in a real application we would not know this. Other training examples may be negative or positive.4. etc. that is accuracy as a function of the number of training examples.305 examples. FURTHER ISSUES 97 Assignment (revised) The goal of this assignment is to train useful classiﬁers using training sets with missing information of three different types: (i) covariate shift. All four datasets are derived from the socalled Adult dataset that is available at http://archive. (ii) reject inference.164 examples. Training set (i) requires you to overcome covariate shift.edu/ users/elkan/291/dmfiles. 2000. of course.cs.307 examples. for 1000. Use an appropriate weighting method. and use the logistic regression option of FastLargeMargin as the main training method. sets (ii) and (iii) have 11. Evaluate experimentally the effectiveness of learning p(yx) from biased training sets of size 1000. and one training set for each of the three scenarios. Each example has values for 13 predictors. and (iii) no labeled negative training examples.uci. and the test set has 11. since it does not follow the same distribution as the population of test examples. Explain how you estimate the constant c = p(s = 1y = 1) and discuss the accuracy of this estimate.000 per year. In each case.ics. you should be able to achieve better than 82% accuracy. This is an interesting label to predict because it is analogous to a label describing whether or not the person is a customer that is desirable in some way.
include in your report a learning curve ﬁgure ’ that shows accuracy as a function of the number (1000. .) of labeled training examples used. etc. Discuss the extent to which each of the three missinglabel scenarios reduces achievable accuracy.98 CHAPTER 9. LEARNING CLASSIFIERS DESPITE MISSING LABELS For each scenario and each training method. 2000.
Chapter 10
Recommender systems
The collaborative ﬁltering (CF) task is to recommend items to a user that he or she is likely to like, based on ratings for different items provided by the same user and on ratings provided by other users. The general assumption is that users who give similar ratings to some items are likely also to give similar ratings to other items. From a formal perspective, the input to a collaborative ﬁltering algorithm is a matrix of incomplete ratings. Each row corresponds to a user, while each column corresponds to an item. If user 1 ≤ i ≤ m has rated item 1 ≤ j ≤ n, the matrix entry xij is the value of this rating. Often rating values are integers between 1 and 5. If a user has not rated an item, the corresponding matrix entry is missing. Missing ratings are often represented as 0, but this should not be viewed as an actual rating value. The output of a collaborative ﬁltering algorithm is a prediction of the value of each missing matrix entry. Typically the predictions are not required to be integers. Given these predictions, many different realworld tasks can be performed. For example, a recommender system might suggest to a user those items for which the predicted ratings by this user are highest. There are two main general approaches to the formal collaborative ﬁltering task: nearestneighborbased and modelbased. Given a user and item for which a prediction is wanted, nearest neighbor (NN) approaches use a similarity function between rows, and/or a similarity function between columns, to pick a subset of relevant other users or items. The prediction is then some sort of average of the known ratings in this subset. Modelbased approaches to collaborative ﬁltering construct a lowcomplexity representation of the complete xij matrix. This representation is then used instead of the original matrix to make predictions. Typically each prediction can be computed in O(1) time using only a ﬁxed number of coefﬁcients from the representation. 99
100
CHAPTER 10. RECOMMENDER SYSTEMS
The most popular modelbased collaborative ﬁltering algorithms are based on standard matrix approximation methods such as the singular value decomposition (SVD), principal component analysis (PCA), or nonnegative matrix factorization (NNMF). Of course these methods are themselves related to each other. From the matrix decomposition perspective, the fundamental issue in collaborative ﬁltering is that the matrix given for training is incomplete. Many algorithms have been proposed to extend SVD, PCA, NNMF, and related methods to apply to incomplete matrices, often based on expectationmaximization (EM). Unfortunately, methods based on EM are intractably slow for large collaborative ﬁltering tasks, because they require many iterations, each of which computes a matrix decomposition on a full m by n matrix. Here, we discuss an efﬁcient approach to decomposing large incomplete matrices.
10.1
Applications of matrix approximation
In addition to collaborative ﬁltering, many other applications of matrix approximation are important also. Examples include the voting records of politicians, the results of sports matches, and distances in computer networks. The largest research area with a goal similar to collaborative ﬁltering is socalled “item response theory” (IRT), a subﬁeld of psychometrics. The aim of IRT is to develop models of how a person answers a multiplechoice question on a test. Users and items in collaborative ﬁltering correspond to testtakers and test questions in IRT. Missing ratings are called omitted responses in IRT. One conceptual difference between collaborative ﬁltering research and IRT research is that the aim in IRT is often to ﬁnd a single parameter for each individual and a single parameter for each question that together predict answers as well as possible. The idea is that each individual’s parameter value is then a good measure of intrinsic ability and each question’s parameter value is a good measure of difﬁculty. In contrast, in collaborative ﬁltering research the goal is often to ﬁnd multiple parameters describing each person and each item, because the assumption is that each item has multiple relevant aspects and each person has separate preferences for each of these aspects.
10.2
Measures of performance
Given a set of predicted ratings and a matching set of true ratings, the difference between the two sets can be measured in alternative ways. The two standard measures are mean absolute error (MAE) and mean squared error (MSE). Given a probability distribution over possible values, MAE is minimized by taking the median of the dis
10.3. ADDITIVE MODELS
101
tribution while MSE is minimized by taking the mean of the distribution. In general the mean and the median are different, so in general predictions that minimize MAE are different from predictions that minimize MSE. Most matrix approximation algorithms aim to minimize MSE between the training data (a complete or incomplete matrix) and the approximation obtained from the learned lowcomplexity representation. Some methods can be used with equal ease to minimize MSE or MAE.
10.3
Additive models
Let us consider ﬁrst models where each matrix entry is represented as the sum of two contributions, one from its row and one from its column. Formally, aij = ri + cj where aij is the approximation of xij and ri and cj are scalars to be learned. Deﬁne the training mean xi· of each row to be the mean of all its known values; ¯ deﬁne the the training mean of each column x·j similarly. The usermean model sets ¯ ri = xi· and cj = 0 while the itemmean model sets ri = 0 and cj = x·j . A slightly ¯ ¯ more sophisticated baseline is the “bimean” model: ri = 0.5¯i· and cj = 0.5¯·j . x x The “bias from mean” model has ri = xi· and cj = (xij − ri )·j . Note that in ¯ the latter expression, the subscript ·j means the average over all rows, for column number j. Intuitively, cj is the average amount by which users who provide a rating for item j like this item more or less than the average item that they have rated. The optimal additive model can be computed quite straightforwardly. Let I be the set of matrix indices for which xij is known. The MSE optimal additive model is the one that minimizes (ri + cj − xij )2 .
i,j ∈I
This optimization problem is a special case of a sparse leastsquares linear regression problem: ﬁnd z that minimizes Az − b 2 where the column vector z = r1 , . . . , rm , c1 , . . . , cn , b is a column vector of xij values and the corresponding row of A is all zero except for ones in positions i and m + j. z can be computed by many methods, including the MoorePenrose pseudoinverse of A: z = (A A)−1 A b. However this approach requires inverting an m + n by m + n matrix, which is computationally expensive. For linear regression in general, approaches based on QR factorization are more efﬁcient. For this particular special case of linear regression, Section 10.4 below gives a gradientdescent method to obtain the optimal z that only requires O(I) time and space. The matrix A is always rankdeﬁcient, with rank at most m + n − 1, because any constant can be added to all ri and subtracted from all cj without changing the matrix reconstruction. If some users or items have very few ratings, the rank of A may be
Therefore ∂ E= ∂ri pf (ri . xij is approximated as aij = ri cj with ri and cj being scalars to be learned. cj ) ∂ri As before. cj ). 0. since the rows and columns of the y matrix are linearly proportional. Speciﬁcally. However the rank deﬁciency of A does not cause computational problems in practice. We have ∂ ∂ e(u. cj ) − xij p−1 sgn(f (ri . Let xij be a matrix entry and let ri and cj be corresponding row and column values. or positive. but the row and column values are multiplied instead of added. We approximate xij by aij = f (ri .4 Multiplicative models Multiplicative models are similar to additive models. or 1 if a is negative. ∂ri . v) = pu − vp−1 u − v ∂u ∂u = pu − vp−1 sgn(u − v).j ∈I i. xij ). A multiplicative model is a rankone matrix decomposition. that is −1. cj ) − xij ) i. Consider the special case where e(u. cj ) for some ﬁxed function f .102 CHAPTER 10. cj ). 10. To learn ri and cj by minimizing E by gradient descent. because any single value ri or cj can be ﬁxed without changing the space of achievable approximations. multiplicative models have one unnecessary degree of freedom. xij ) f (ri . v) = u − vp for p > 0. xij ) ∂ri ∂ ∂ e(f (ri . I is the set of matrix indices for which xij is known. The training error E over all known matrix entries I is the pointwise sum of errors. We now present a general algorithm for learning row value/column value models.j ∈I ∂ f (ri . cj ). cj ). Here sgn(a) is the sign of a. additive and multiplicative models are special cases. RECOMMENDER SYSTEMS less than m + n − 1. cj ). Like additive models. we evaluate ∂ E = ∂ri = i. For computational purposes we can assume the derivative is zero for the nondifferentiable case u = v. This case is a generalization of MAE and MSE: p = 2 corresponds to MSE and p = 1 corresponds to MAE. zero.j ∈I ∂ e(f (ri . The approximation error is deﬁned to be e(f (ri . ∂f (ri .
we compute the gradient with respect to ri and cj based just on this one example. Gradient descent as described above directly optimizes precisely the same objective function (given the MSE error function) that is called “incomplete data likelihood” in EM approaches to factorizing matrices with missing entries.10. suppose f (ri .5. This means that for each triple ri . performing the updates above once for each i. The .5 Combining models by ﬁtting residuals Simple models can be combined into more complex models. cj ) 103 = 1. cj ) Alternatively. Let aij be the prediction from the ﬁrst model and let xij be the corresponding training value. stochastic gradient descent does not converge fully. j ∈ I.j ∈I ∂ ∂ri f (ri . It is sometimes forgotten that EM is just one approach to solving maximumlikelihood problems. COMBINING MODELS BY FITTING RESIDUALS Now suppose f (ri . xij in the training set. The maximum number of epochs is another parameter that needs to be chosen. The values a and b are often chosen to be small positive integers. incomplete matrix factorization is an example of a maximumlikelihood problem where an alternative solution method is superior. i. An epoch is one pass over the whole training set. The most straightforward way to combine models is additive. cj ). In general. After any ﬁnite number of epochs. i. xij ) ∂ri ∂ e(f (ri . and perform the updates ri := ri − λ and cj := cj − λ ∂ e(f (ri . 10. We get ri cj − xij p−1 sgn(ri cj − xij )cj . this difference is called the residual. cj ) = ri cj so ∂ E=p ∂ri = cj . cj . We obtain ri + cj − xij p−1 sgn(ri + cj − xij ). xij ) ∂cj where λ is a learning rate.j ∈I Given the gradients above. cj ) = ri + cj so ∂ E=p ∂ri ∂ ∂ri f (ri . we apply online (stochastic) gradient descent. cj ). The learning rate typically follows a decreasing schedule such as λ = a/(b+e) where e is the number of the current epoch. a second model is trained to ﬁt the difference between the predictions made by the ﬁrst model and the truth as speciﬁed by the training data.
Remember that this expression has only two parameter vectors r and c because it is just for a factorization of rank one.j ∈I . xij ) where I is the set of ij pairs for which xij is known. A rank50 unregularized factor model is almost certain to overﬁt the training data. because it has 50 · (10.000 ratings for 10. The unregularized loss function is E = i. because the numerical magnitude of its contribution is smaller. We can typically improve generalization by using a large value for k while reducing the effective number of parameters via regularization. The combined prediction is then aij + bij . For example. xij ) where µ is the strength of regularization.j ∈I e(f (ri . Let bij be the prediction made by the second model for matrix entry ij.j ∈I e(f (ri . where each model is empirically less important than each previous one. cj ). The corresponding L2 regularized loss function is E = µ( i 2 ri + j c2 ) + j i. cj ). The aspect of items represented by each model cannot be understood in isolation. The partial derivative with respect to ri is ∂ E = 2µri + ∂ri ∂ ∂ e(f (ri . The extension to higherrank approximations is obvious. then the combined model is a ranktwo approximation. cj ). The idea is that the column value for each item is the intensity with which it possesses this property. xij ) f (ri .6 Regularization A matrix factorization model of rank k has k parameters for each person and for each movie. If the ﬁrst and second models are both multiplicative. and the row value for each user is the weighting the user places on this property. RECOMMENDER SYSTEMS second model is trained to minimize error relative to the residual xij − aij . which is more than the number of data points for training. cj ).104 CHAPTER 10.000 viewers and 1000 movies. It is more accurate to view each component model as an adjustment to previous models. but only in the context of previous aspects. ∂f (ri . The common interpretation of a combined model is that each component model represents an aspect or property of items and users. consider a training set of 500. cj ) ∂ri i. 10. However this point of view implicitly treats each component model as equally important. 000 + 1000) = 550. 000 parameters.
The reason is that these are always positive.34. but that would be unintuitive without being more expressive.j ∈I Using stochastic gradient descent. everything else being equal.79. cj . When doing regularization. This means that even after accounting for other inﬂuences.7. so ∂ E = 2µri + 2 ∂ri ∂ ∂ri f (ri . it is reasonable not to regularize the vectors learned for the rankone factorization. because people select which items to rate based on how much they anticipate liking them. i. v) = (u − v)2 and f (ri . There is no assumption or analysis concerning whether entries are missing “completely at random” (MCAR).10. xij we do the updates ri := ri − 2λµri − 2λ(ri cj − xij )cj and similarly for cj .58. for each training example ri . residuals are zero on average. low ratings are still more likely to be missing than high ratings. is 3. FURTHER ISSUES Assume that e(u. because xij = ri cj on average. and regularization towards zero is reasonable. cj ) = ri cj . or “not at random” (MNAR). A third issue not discussed is any explicit model for how matrix entries come to be missing. so learned vectors must have positive and negative entries. One reason for not considering models that are explicitly probabilistic is that these are usually based on Gaussian assumptions that are incorrect by deﬁnition when the observed data are integers and/or in a ﬁxed interval. (The vectors could always both be negative. missing ratings are likely to be MNAR. On the Movielens dataset. The average prediction from one reasonable method for ratings in the test set. Then (ri cj − xij )cj . This intuition can be conﬁrmed empirically. 10. whether or not a rating is missing still depends on its value. i. Concretely. cj ) 105 = cj . Amazon: people who looked at x bought y eventually (within 24hrs usually) . ratings that are unknown but known to exist. in the collaborative ﬁltering context. Intuitively. “at random” (MAR). The average prediction for ratings that in fact do not exist is 3. if ratings xij are positive.e.) After the rankone factorization.7 Further issues Another issue not discussed is any potential probabilistic interpretation of a matrix decomposition. the average rating in the training set is 3.
True.) (c) We have a training set of 500.000 viewers and 1000 movies. This difference is on average zero.000 ratings for 10. People are more likely to like movies that they have actually watched. so ri and cj should always be positive. might be possible. but for some i. overﬁtting is practically certain. the average predicted rating for unrated movies is expected to be less than the average actual rating for rated movies. si must be negative sometimes.106 CHAPTER 10.) The term si dj models the difference xij − ri cj . For all viewers i. (Making si be always positive. True. si should be negative. ri should be positive. However in this case the expressiveness of the model si dj would be reduced. (They could always both be negative. Any good model should capture this fact. RECOMMENDER SYSTEMS Quiz For each part below. and then explain your answer brieﬂy. The unregularized model has 50 · (10. which is more than the number of data points for training. In order to allow si dj to be negative. In more technical language. 000 + 1000) = 550. but that would be unintuitive without being more expressive. (a) For any model. True. than random movies that they have not watched. the value of a rating is correlated with whether or not it is missing. Hence. and we train a rank50 unregularized factor model. and xij = ri cj on average. (b) Let the predicted value of rating xij be ri cj + si dj and suppose ri and cj are trained ﬁrst. . This model is likely to overﬁt the training data. say whether the statement in italics is true or false. 000 parameters. so it is sometimes positive and sometimes negative. Ratings xij are positive. while allowing dj to be negative.
compute the gradient with respect to ri and cj based just on this one example.Quiz for May 25. and perform the updates ri := ri − λ and cj := cj − λ where λ is a learning rate. xij ) ∂cj . ∂ e(f (ri . cj ). 2010 Page 87 of the lecture notes says Given the gradients above. we apply online (stochastic) gradient descent. cj ). xij in the training set. State and explain what the ﬁrst update rule is for the special case e(u. This means that we iterate over each triple ri . cj . cj ) = ri · cj . v) = (u − v)2 and f (ri . xij ) ∂ri ∂ e(f (ri .
CSE 291 Quiz for May 10, 2011
Your name: Email address: Section 10.6 of the lecture notes says “Using stochastic gradient descent, for each training example ri , cj , xij we do the updates ri := ri − 2λµri − 2λ(ri cj − xij )cj and similarly for cj .” In the righthand side of the update rule for ri , explain the role of λ, of µ, and of ri cj .
Assignment
The goal of this assignment is to apply a collaborative ﬁltering method to the task of predicting movie ratings. You should use the small MovieLens dataset available at http://www.grouplens.org/node/73. This has 100,000 ratings given by 943 users to 1682 movies. You can select any collaborative ﬁltering method that you like. You may reuse existing software, or you may write your own, in Matlab or in another programming language. Whatever your choice, you must understand fully the algorithm that you apply, and you should explain it with mathematical clarity in your report. The method that you choose should handle missing values in a sensible and efﬁcient way. When reporting ﬁnal results, do ﬁvefold crossvalidation, where each rating is assigned randomly to one fold. Note that this experimental procedure makes the task easier, because it is likely that every user and every movie is represented in each training set. Moreover, evaluation is biased towards users who have provided more ratings; it is easier to make accurate predictions for these users. In your report, show mean absolute error graphically as a function of a measure of complexity of your chosen method. If you select a matrix factorization method, this measure of complexity will likely be rank. Also show timing information graphically. Discuss whether you could run your chosen method on the full Netﬂix dataset of about 108 ratings. Also discuss whether your chosen method needs a regularization technique to reduce overﬁtting. Good existing software includes the following: • Jason Rennie’s fast maximum margin matrix factorization for collaborative ﬁltering (MMMF) at http://people.csail.mit.edu/jrennie/matlab/. • Guy Lebanon’s toolkit at http://www2.cs.cmu.edu/˜lebanon/IRlab. htm. You are also welcome to write your own code, or to choose other software. If you choose other existing software, you may want to ask the instructor for comments ﬁrst.
CSE 291 Assignment 5 (2011)
This assignment is due at the start of class on Tuesday May 17, 2011. The goal of this project is to implement a collaborative ﬁltering method for the task of predicting movie ratings. You should use the small MovieLens dataset available at http://www.grouplens.org/node/73. This dataset has 100,000 ratings given by 943 users to 1682 movies. You should write your own code to implement a matrix factorization method with regularization that allows for missing data. You should write your own code. Using R is recommended, but you may choose Matlab or another programming language. You may use stochastic gradient descent (SGD) as described above, or you may use another method such as alternating least squares. When reporting ﬁnal results, do ﬁvefold crossvalidation, where each rating is assigned randomly to one fold. Note that this experimental procedure makes the task easier, because it is likely that every user and every movie is represented in each training set. Moreover, evaluation is biased towards users who have provided more ratings; it is easier to make accurate predictions for these users. Do experiments to investigate the beneﬁts of regularization. If you use SGD, do experiments to investigate how to set the learning rate. In your report, show mean absolute error and mean squared error graphically, as a function of the rank of the matrix factorization. Discuss whether the time and memory efﬁciency of your code is good enough to run on the full Netﬂix dataset of about 108 ratings.
it is 111 . The Google email feature called Priority Inbox uses a classiﬁer to rank messages according to their predicted importance as perceived by the recipient. Most classiﬁers for documents are designed to categorize according to subject matter.Chapter 11 Text mining This chapter explains how to do data mining on datasets that are collections of documents. With a training set of very helpful product reviews. to separate marginally helpful from marginally unhelpful reviews. Multiclass classiﬁers are useful for routing messages to recipients. Text mining tasks include • classiﬁer learning • clustering. and • latent semantic analysis. or importance for email messages. when one class of examples is rare. and another training set of very unhelpful reviews. Classiﬁers can be even more useful for ranking documents than for dividing them into categories. it is also possible to learn to categorize according to qualitative criteria such as helpfulness for product reviews submitted by consumers. There is often no need to pick a threshold. which would be arbitrary. we can learn a scoring function that sorts other reviews according to their degree of helpfulness. Classiﬁers for documents are useful for many applications. • topic modeling. However. Major uses for binary classiﬁers include spam detection and personalization of streams of news articles. it is not reasonable to use 0/1 accuracy to measure the success of a classiﬁer. In text mining. As discussed in Chapter 6.
failures of recall are often not serious. A false positive is a failure of precision. For example. 11. it is important to learn the positive and negative correlations between classes. is a problem in decision theory. the ﬁrst task to perform is to identify the set of all words used a least once in at least one document. In many applications of multiclass classiﬁcation. This measure is the harmonic mean of precision and recall: 2 f= 1/p + 1/r where p and r are precision and recall for the rare class. The central reason is that if the content of a document is genuinely important. a failure of recall is often considered more serious than a failure of precision. but the former can have a signiﬁcant health consequence. Figuring out the optimal trade off between failures of recall and precision. In domains such as medical diagnosis. i. employees may be represented by their resumes. including classifying and clustering documents. for many largescale data mining tasks. the classes are mutually exclusive. The correlations between labels may arise from correlations in the underlying entities.e. However. Given a collection of documents.1 The bagofwords representation The ﬁrst question to is how to represent documents. so it is correct to predict more than one label. For genuine understanding of natural language the order of the words in documents must be preserved. In multilabel classiﬁcation. images may be represented by captions. a special type of negative correlation is ﬁxed in advance. TEXT MINING especially common to use what is called Fmeasure. So a failure of recall on one of these documents will not lead to an overall failure. the labels “slow” and “romantic” may be positively correlated for songs. many documents are likely to contain different versions of the same content. while a false negative is a failure of recall. and music or movies mat be represented by reviews. given that a failure of recall has a certain probability of being remedied later. For example. This set is called the vocabulary . a single document can belong to more than one category. Text mining is often used as a proxy for data mining on underlying entities represented by text documents. This task is speciﬁcally called multilabel classiﬁcation. The reason is that the latter has mainly a monetary cost. In applications of document classiﬁcation.112 CHAPTER 11. it is sufﬁcient to use a simple representation that loses all information about word order. The mistakes made by a binary classiﬁer are either false positives or false negatives. In standard multiclass classiﬁcation.
it) connectives (and. Mathematically. Given a training set of documents. xj ! j=1 j=1 where the data x are a vector of nonnegative integers and the parameters θ are a realvalued vector. doing this is called latent semantic analysis (LSA) and is discussed in Section 12. each document is represented as a vector with integer entries of length m.2 The multinomial distribution Once we have a representation for individual documents. before). we will choose values for the parameters of the distribution that make the training documents have high probability. this distribution is m n! x p(x. THE MULTINOMIAL DISTRIBUTION 113 V . Often. we ﬁx an arbitrary ordering for it. n is much smaller than m and xj = 0 for most words j. Many applications of text mining also eliminate from the vocabulary socalled “stop” words. Then. The probability distribution that we use is the multinomial. It makes sense to view each column as a feature. part. (Words that are found only once are often misspellings or other mistakes. can). because. If this vector is x then its jth component xj is the number of appearances of word j in the document. including identifying the author of a document or detecting its genre. prepositions (to. This model is a probability distribution. most entries are zero.2. been.3 below. Once V has been ﬁxed. it is reduced in size by keeping only words that are used in at least two documents. The higher this probability is. however). that generic words carry a lot of information for many tasks. In linguistics these words are sometimes called “function” words. he. The length of the document is n = m xj . A collection of documents is represented as a twodimensional matrix where each row describes a document and each column corresponds to a word. the natural next step is to select a model for a set of documents. Both vectors have the same length m.) Although the vocabulary is a set. 11. These are words that are common in most documents and do not correspond to any particular subject matter. however. Each entry in this matrix is an integer count. so that we can refer to word 1 through word m where m = V  is the size of the vocabulary. θ) = m θj j . the more similar the test document is to the training set. It is important to appreciate. of.11. we can evaluate its probability according to the model. auxiliaries (have. j=1 For typical documents. and generic nouns (amount. given a test document. They include pronouns (you. nothing). j=1 . The components of θ are nonnegative and have unit sum: m θj = 1. It also makes sense to learn a lowrank approximation of the whole matrix.
a multinomial has to sum to one. Note that the multinomial coefﬁcient is a function of x. . If a multinomial has θj = 0 for some j. we can ignore the multinomial coefﬁcient. However. then every document with xj > 0 for this j has zero probability. Hence. θ) = log n! − [ j=1 log xj !] + [ j=1 xj · log θj ] Given a set of training documents. Probabilities that are perfectly zero are undesirable. Intuitively. The fraction before the product is called the multinomial coefﬁcient. it is a notional number of appearances of word j that are assumed to exist. Like any discrete distribution. that is according to different parameter vectors θ. it is necessary to do numerical computations with log probabilities: m m log p(x. computing the probability of a document requires O(m) time bex cause of the product over j. Similarly. What is important is the relative probability of different documents. regardless of any other words in the document. The number of such documents is exponential in their length n: it is mn . The probability of any individual document will therefore be very small. We set θj ∝ c + x xj where the symbol ∝ means “proportional to. Because the probabilities of individual documents decline exponentially with length n. where the sum is over all possible data points.” The constant c is called a pseudocount. so we want θj > 0 for all j. the maximumlikelihood estimate of the jth parameter is 1 θj = xj T x where the sum is over all documents x belonging to the training set. TEXT MINING Intuitively. if xj = 0 then θj j = 1 so the jth factor can be omitted from the product. when comparing the probability of a document according to different models. hence the term θj to the power xj . A document that mostly uses words with high probability will have higher relative probability. The normalizing constant is T = x j xj which is the sum of the sizes of all training documents. computing the probability of a document needs only O(n) time. Therefore. 0! = 1 so the jth factor can be omitted from m j=1 xj !. but not of θ. θj is the probability of word j while xj is the count of word j. Smoothing with a constant c is the way to achieve this. a data point is a document containing n words. It is the number of different word sequences that might have given rise to the observed vector of counts x.114 CHAPTER 11. At ﬁrst sight. Each time word j appears in the document it contributes an amount θj to the total probability. Here.
x In order to avoid big changes in the estimated probabilities θj . Generative process.2. p(y = k). These distributions. Suppose the alternative class values are 1 to K. Therefore. The denominator p(x) can be computed as the sum of numerators K p(x) = k=1 p(xy = k)p(y = k). have the same parameter values. each class is represented by one multinomial ﬁtted by maximumlikelihood as described in Section 11. one multinomial is a distribution over all documents of a ﬁxed size n.3 Training Bayesian classiﬁers Bayesian learning is an approach to learning classiﬁers based on Bayes’ rule. TRAINING BAYESIAN CLASSIFIERS 115 regardless of the true number of appearances. and is separate from the training process for all other classes. what is learned by the maximumlikelihood process just described is in fact a different distribution for each size n. We can write p(xy = k)p(y = k) p(y = kx) = . Unfortunately. the exact value of c can strongly inﬂuence classiﬁcation accuracy. the normalizing constant must be T = mc + T in θj = 1 (c + T xj ). . p(x) In order to use the expression on the right for classiﬁcation. We can estimate p(y = k) easily as nk / K nk where nk is the number of training examples with class label k=1 k. Let x be an example and let y be its class. In general. Typically c is chosen in the range 0 < c ≤ 1.3. In the formula 1 θj = (c + xj ) T x the sum is taken over the documents in one class. Because the equality j θj = 1 must be preserved.11. Technically. one should have c < T /m. Sampling with replacement. we need to learn three items from training data: p(xy = k). 11. Usually. Note that the training process for each class uses only the examples from that class. and p(x). the classconditional distribution p(xy = k) can be any distribution whose parameters are estimated using the training examples that have class label k. although separate.
e. Toyota has had to idle U. These initial values are the parameters of the DCM distribution. choosing the strength of regularization appropriately. it is not just replaced. in an adjustable way.000 in the U. but the sum of the components of β is unconstrained.S. because the number of features is typically much larger than the number of training examples. For the same reason. In reality.4 Burstiness The multinomial model says that each appearance of the same word j always has the same probability θj . operations now are suffering from plunging sales. i.5 Discriminative classiﬁcation There is a consensus that linear support vector machines (SVMs) are the best known method for learning classiﬁers for documents. 11. After a ball is selected randomly. a former senior Toyota executive. The multinomial distribution arises from a process of sampling words with replacement. An alternative distribution named the the Dirichlet compound multinomial (DCM) arises from an urn process that captures the authorship process better. Consider a bucket with balls of V  different colors. like the multinomial parameter vector. business. Inaba was credited with laying the groundwork for Toyota’s fast growth in the U. assembly lines. This increased probability models the phenomenon of burstiness. Toyota now employs 36. .S. This one extra degree of freedom allows the DCM to discount multiple observations of the same word. the chance of drawing the same color again is increased.S. Toyota’s U. is expected to announce a major overhaul. The DCM parameter vector β has length V . pay workers who aren’t producing vehicles and offer a limited number of voluntary buyouts. Recently. additional appearances of the same word are less surprising. Consider the following excerpt from a newspaper article. typically by crossvalidation. Let the initial number of balls with color j be βj . Mr. TEXT MINING 11. is crucial. Nonlinear SVMs are not beneﬁcial.S. they have higher probability.S. Mr.116 CHAPTER 11. Inaba is currently head of an international airport close to Toyota’s headquarters in Japan. but also one more ball of the same color is added. the more words are bursty. before he left the company. was formally asked by Toyota this week to oversee the U. for example. The smaller the parameter vales βj are. Each time a ball is drawn. Yoshi Inaba. Toyota Motor Corp.
so it preserves sparsity. a mixture distribution is a probability density function of the form K p(x) = k=1 αk p(x. and gives the word a weight that is the same for every document. θk ) is the distribution of component number k. Above. . p(x. One can also transform counts in a supervised way. i. it is sensible to replace each count xj by log(1 + xj ). CLUSTERING DOCUMENTS 117 When using a linear SVM for text classiﬁcation. A more extreme transformation that loses information is to make each count binary. Inspired by the discussion above of burstiness.11. The scalar αk is the proportion of component number k.6. Instead. tp is the number of positive training examples containing word j. The simplest variety of unsupervised learning is clustering. accuracy can be improved considerably by transforming the raw counts. and f n is the number of these examples not containing the word. but depends on how predictive the word is. Each component is a cluster. and tn is the number of these examples not containing the word.5. Here. that is in a way that uses label information. It ignores how often word appears in an individual document. The transformation above is sometimes called logodds weighting.6 Clustering documents Suppose that we have a collection of documents. and we want to ﬁnd an organization for these. Clustering can be done probabilistically by ﬁtting a mixture distribution to the given collection of documents. it is not sensible to make them have mean zero and variance one. K is the number of components in the mixture model. This is most straightforward when there are just two classes. that is to replace all nonzero values by one. For each k.e. 11. we want to do unsupervised learning. Formally. we replace it by 0. This transformation maps 0 to 0. θk ). Experimentally. it makes each word count binary. The positive and negative classes are treated in a symmetric way. Since counts do not follow Gaussian distributions. The value  log tp/f n − log f p/tn is large if tp/f n and f p/tn have very different values. while f p is the number of negative training examples containing the word. If any of these numbers is zero. the following transformation is better than others: xj → I(xj > 0) log tp/f n − log f p/tn. which of course is less than any of these numbers that is genuinely nonzero. In this case the word is highly diagnostic for at least one of the two classes.
the dictionary of all words). the number M of documents. The prior distributions α and β are assumed to be ﬁxed and known. the training data are the words in all documents. the length Nm of each document. it is not necessary to learn α and β. where K is the number of topics. TEXT MINING 11. The distribution θ of each document is useful for classifying new documents. if components are topics. i.e. as are the number K of topics. ϕ is a vector of word probabilities for each topic indicating the content of that topic. the proportion of each topic in each document is different. It is based on the intuition that each document contains words from multiple topics.01. and clusterings in general.118 CHAPTER 11.e.7 Topic models Mixture models. For learning. it is often more plausible to assume that each document can contain words from multiple topics. Steyvers and Grifﬁths recommend ﬁxed uniform values α = 50/K and β = . When applying LDA. Topic models are probabilistic models that make this assumption. The generative process assumed by the LDA model is as follows: Given: Dirichlet distribution with parameter vector α of length K Given: Dirichlet distribution with parameter vector β of length V for topic number 1 to topic number K draw a multinomial with parameter vector φk according to β for document number 1 to document number M draw a topic distribution. and the cardinality V of the vocabulary (i. but the topics themselves are the same for all documents. and (ii) to infer the topic distribution φk for each topic. and more. Learning has two goals: (i) to infer the documentspeciﬁc multinomial θ for each document. Latent Dirichlet allocation (LDA) is the most widely used topic model. measuring similarity between documents. are based on the assumption that each data point is generated by a single component model. . After training. a multinomial θ according to α for each word in the document draw a topic z according to θ draw a word w according to φz Note that z is an integer between 1 and K for each word. For documents.
11. OPEN QUESTIONS 119 11. One suggestion is xj → I(xj > 0) · max  log tpjk /f njk  k where i ranges over the classes. tpjk is the number of training examples in class k containing word j.8 Open questions It would be interesting to work out an extension of the logodds mapping for the multiclass case. as in xj → log(xj + 1) ·  log tp/f n − log f p/tn. . It is also not known if combining the log transformation and logodds weighting is beneﬁcial. and f njk is the number of these examples not containing the word.8.
given that the model p(xy = k) for each class is a multinomial distribution. Within the multinomial distributions p(xy = k). Simplify the classiﬁer further into a linear classiﬁer. (c) Consider the classiﬁer from part (b) and suppose that there are just two classes. with a multinomial distribution. “the probabilities of individual documents decline exponentially with length n. TEXT MINING Quiz (a) Explain why. p(x) Simplify the expression inside the argmax operator as much as possible. giving m y = argmax p(y = k) ˆ k j=1 j θkj . j=1 but more slowly. Note the multinomial coefﬁcient n!/ m xj ! does increase with n. (b) Consider the multiclass Bayesian classiﬁer y = argmax ˆ k p(xy = k)p(y = k) . so it is constant for a single x and it can be eliminated also.” The probability of document x of length n according to a multinomial distribution is m n! x p(x.] Each θj value is less than 1.120 CHAPTER 11. Let the two classes be k = 0 and k = 1 so we can write m m j θ1j > p(y = 0) y = 1 if and only if p(y = 1) ˆ j=1 x j θ0j . x where θkj is the jth parameter of the kth multinomial. Hence as n increases the product j=1 θj j decreases exponentially. The denominator p(x) is the same for all k. θ) = m θj j . In total n = j xj of these x m values are multiplied together. the multinomial coefﬁcient does not depend on k. x j=1 Taking logarithms and using indicator function notation gives m m y = I log p(y = 1) + ˆ j=1 xj log θ1j − log p(y = 0) − j=1 xj log θ0j . . xj ! j=1 j=1 [Rough argument. so it does not inﬂuence which k is the argmax.
. OPEN QUESTIONS The classiﬁer simpliﬁes to y = I(c0 + ˆ j=1 121 m xj cj ) where the coefﬁcients are c0 = log p(y = 1) − log p(y = 0) and cj = log θ1j − log θ0j . The expression inside the indicator function is a linear function of x.8.11.
. (b) Learning the relative importance of different words. Write one or two sentences for each of the following reasons. using the bagofwords representation. 2010 Your name: Consider the task of learning to classify documents into one of two classes. (a) Handling words that are absent in one class in training. Explain why a regularized linear SVM is expected to be more accurate than a Bayesian classiﬁer using one maximumlikelihood multinomial for each class.Quiz for May 18.
Within the multinomial distributions. m j=1 xj ! j=1 Also assume that p(y = 0) = 0.5. given that the model p(xy = k) for each class is a multinomial distribution: p(xy = 0) = and p(xy = 1) = n! x αj j m xj ! j=1 j=1 m n! m x βj j .CSE 291 Quiz for May 17. so it is constant for a single x and it can be eliminated also. so it does not inﬂuence which k is the argmax. We know that j αj = 1 and similarly for the β parameter vector.5. x Using indicator function notation gives m y=I ˆ j=1 βj αj xj >1 . so the p(y = k) factor also does not inﬂuence classiﬁcation. . 2011 Your name: Email address: Consider the binary Bayesian classiﬁer y = argmax ˆ k∈{0. but this knowledge does not help in making a further simpliﬁcation. We can write m m y = 1 if and only if ˆ j=1 βj j > j=1 x αj j . p(x) Simplify the expression inside the argmax operator as much as possible. x where θkj is either αj or βj . the multinomial coefﬁcient does not depend on the parameters.1} p(xy = k)p(y = k) . The denominator p(x) is the same for all k.5 implies p(y = 1) = 0. Note that there is no guaranteed relationship between αj and βj . giving m y = argmax p(y = k) ˆ k j=1 j θkj . The assumption p(y = 0) = 0.
and high accuracy is achievable.ucsd. Because there are relatively few examples. investigate whether you can achieve better accuracy via feature selection. regularization is vital. These strings are not needed for training or applying classiﬁers. you should try Bayesian classiﬁcation using a multinomial model for each of the three classes.cs. TEXT MINING Assignment (2009) The purpose of this assignment is to compare two different approaches to learning a classiﬁer for text documents. The ﬁle truelabels.zip. be sure to explain this in your report. using tenfold crossvalidation to measure accuracy is suggested.edu/users/elkan/ 291/classic400. It consists of 400 documents from three categories over a vocabulary of 6205 words. and/or feature weighting. .124 CHAPTER 11. The main ﬁle cl400. The ﬁle classic400.csv is a commaseparated 400 by 6205 array of word counts. If you allow any leakage of information from the test fold to training folds. but they are useful for interpreting classiﬁers. Since there are more features than training examples. The categories are quite distinct from a human point of view. which contains three ﬁles. feature transformation. First. you should train a support vector machine (SVM) classiﬁer. Try to evaluate the statistical signiﬁcance of differences in accuracy that you ﬁnd. while wordlist gives the string corresponding to the word indices 1 to 6205.mat contains the same three ﬁles in the form of Matlab matrices. Note that you may need to smooth the multinomials with pseudocounts. The dataset to use is called Classic400. The dataset is available at http://www. For each of the two classiﬁer learning methods.csv gives the actual class of each document. Second. You will need to select a method for adapting SVMs for multiclass classiﬁcation.
Be sure to read the README carefully.cornell. You can ﬁnd links to these papers at http://www. Try to evaluate the statistical signiﬁcance of differences in accuracy that you ﬁnd. version 2. The dataset to use is the movie review polarity dataset. analyze your trained classiﬁer to identify what features of movie reviews are most indicative of the review being favorable or unfavorable. To get good performance with an SVM.0. Compare the accuracy that you can achieve with accuracies reported in some of the many published papers that use this dataset. First. You should smooth the multinomials with pseudocounts. Since there are more features than training examples. you need to use strong regularization. either logistic regression or a support vector machine.cs. Also. If you achieve very high accuracy. and/or feature weighting. For both classiﬁer learning methods. . Discuss this issue in your report. The purpose of this assignment is to compare different approaches to learning a binary classiﬁer for text documents. Think carefully about any ways in which you may be allowing leakage of information from test subsets that makes estimates of accuracy be biased. 2011. edu/People/pabo/moviereviewdata/.cornell.CSE 291 Assignment 6 (2011) This assignment is due at the start of class on Tuesday May 24. that is a small value of C. you should try Bayesian classiﬁcation using a multinomial model for each of the two classes. regularization is vital.edu/People/pabo/moviereviewdata/ otherexperiments.cs. feature transformation. it is likely that you are a victim of leakage in some way. published by Lillian Lee at http://www. Second. and to understand the data and task fully.html. you should train a linear discriminative classiﬁer. investigate whether you can achieve better accuracy via feature selection.
.
For example. methods based on ideas from linear algebra often provide considerable insight. each row represents a document while each column represents a word. each person corresponds to a row and also to a column. 1 if the person made a donation and 0 otherwise. For these datasets. that is 0 or 1. each row can be a user and each column can be a movie.Chapter 12 Matrix factorization and applications Many datasets are essentially twodimensional matrices. For each person and each campaign there is binary feature value. • In a kernel matrix. or they may be real numbers. • In many datasets. In some cases every entry has a known value. Its entries may be binary. • In a social network. The cases above show that the matrix representing a dataset may be square or rectangular. while in other cases some entries are unknown. • In a collaborative ﬁltering dataset. and the matrix is its adjacency matrix. • In a text dataset. 127 . It is remarkable that similar algorithms are useful for datasets as diverse as the following. missing is a synonym for unknown. each row is an example and each column is a feature. The number in row i and column j is the degree of similarity between examples i and j. in the KDD 1998 dataset each row describes a person. There are 22 features corresponding to different previous campaigns requesting donations. The network is a graph. each example corresponds to a row and also to a column.
its diagonal entries σi are positive. while only square matrices have eigenvalues. Speciﬁcally. then it has a subset of k rows such that (i) none of these is a linear combination of the others. The rank of a matrix is the largest number of linearly independent rows that it has. A linear combination of these rows is the new row vector that is the matrix product a1 a2 v . and V . The matrix U is m by r. and V is n by r.1 Singular value decomposition A mathematical idea called the singular value decomposition (SVD) is at the root of most ways of analyzing datasets that are matrices. Its extension to matrices with missing entries is discussed elsewhere.1 Next. but (ii) every row not in the subset is a linear combination of the rows in the subset. The SVD is a product of three matrices named U . Formally. MATRIX FACTORIZATION AND APPLICATIONS 12. A = U SV T where V T means the transpose of V . This means that there are k users such that the ratings of every other user are a linear function of the ratings of these k users. all the offdiagonal entries of S are zero. Other related concepts are deﬁned only for matrices that are square. S is r by r. but remember that the SVD is welldeﬁned for rectangular matrices. The rank of a matrix is a measure of its complexity. these k users are a basis for the whole matrix. Let A have m rows and n columns. square or rectangular. One important reason why the SVD is so important is that it is deﬁned for every matrix. Consider for example a user/movie matrix. Let v be any row vector of length k. . which can also be written V . . ak A set of rows is called linearly independent if no row is a linear combination of the others. the columns of They are closely related to eigenvalues. or only for matrices that are invertible. These entries are called singular values. Before explaining the SVD. If a matrix has rank k. and let r be the rank of A. S. Let A be any matrix. Suppose that its rank is k. and V make the SVD unique. Consider a subset of rows of a matrix. S.128 CHAPTER 12. First. and they are sorted from largest to smallest. 1 . let us review the concepts of linear dependence and rank. The SVD is deﬁned for every matrix all of whose entries are known. Suppose that the rows in this subset are called a1 to ak . Some conditions on U .
The SVD essentially answers the question. Because U and V have orthonormal columns. The hope is that these correspond to noise in the original observed A. The omitted columns in the truncated matrices correspond to the smallest diagonal entries in S. Let Sk be the k by k diagonal matrix whose diagonal entries are S11 to Skk . It is often called the outer product of the vectors uk and vk . and is the transpose of the kth column of V . the columns of V . which are vectors of length n. For data mining. X:rank(X)≤k The theorem says that among all matrices of rank k or less. This simpliﬁcation decomposes A into the sum of r matrices of rank 1. the SVD is most interesting because of the following property. This means that Uk is an m by k matrix whose columns are the same as the ﬁrst k columns of U . Theorem: Ak = min A − X2 . are orthonormal. Notice that the matrices U . Sk . Deﬁne Ak = Uk Sk VkT . and Vk are called truncated versions of U .1. where r is the rank of A.12. The squared error is simply m n A − X2 = i=1 j=1 (Aij − Xij )2 and is sometimes called the Frobenius norm of the matrix A − X. the matrix product in the SVD can be simpliﬁed to r A= k=1 T vk T σk uk vk . Concretely. what is the best lowcomplexity approximation of a given matrix A? The matrices Uk . and the columns are perpendicular to each other. Let Uk be the k leftmost columns of U . which are vectors of length m. Similarly. This means that each column is a unit vector. let Vk be the k leftmost columns of V . This deﬁnition is sometimes called the compact SVD. SINGULAR VALUE DECOMPOSITION 129 U . Here. The following theorem was published by the mathematicians Carl Eckhart and Gale Young in 1936. and V are themselves truncated in our deﬁnition to have r columns. The T product uk vk is an m by n matrix of rank 1. We have ui · ui = 1 and ui · uj = 0. The omitted columns correspond to singular values σi = 0 for i > r. The rank of Ak is k. Ak has the smallest squared error with respect to A. Similarly. uk is the kth column of U . S. . and V . S. are orthonormal to each other. let ui and uj be any two columns of U where i = j.
The scalar (bi · w) is the projection of bi along w. Let this new data matrix be B. The matrix Uk has k columns. The ﬁrst step in PCA is to make each feature have mean zero. These directions are called the principal components. The ﬁrst principal component of B is m w1 = argmax w2 =1 i=1 (bi · w)2 Each data point bi is a row vector of length n. Let A be a matrix with m rows corresponding to examples and n columns corresponding to features. and maximizes the squared projection of each point onto the line. each column of VkT is a kdimensional representation of the corresponding column of A. Note that the rows of Uk . MATRIX FACTORIZATION AND APPLICATIONS The truncated SVD is called a matrix approximation: A ≈ Uk Sk VkT . but it has some major drawbacks. If an approximation with just two factors is desired. The idea is to identify. the matrix Sk can be included in Uk and/or Vk . First. it is sensitive to the scaling of features. and the same number of rows as A. Each row of Uk can be viewed as a lowdimensional representation of the corresponding row of A. with rows b1 to bm . The second principal component can be found recursively as m w2 = argmax w2 =1 i=1 (bi · w)2 Principal component analysis is widely used. which is the onedimensional representation of the data point. The ﬁrst direction is a line that goes through the mean of the data points. Each new direction is orthogonal to all previous ones. one by one. A direction is represented by a column vector w of length n with unit L2 norm. Each data point with the ﬁrst component removed is bi = bi − (bi · w)wT . the directions along which a dataset varies the most. Without changing the correlations .130 CHAPTER 12. Similarly. and the columns of VkT . are not orthonormal. that is to subtract the mean of each column from each entry in it. The principal component is a column vector of the same length. 12.2 Principal component analysis Standard principal component analysis (PCA) is the most widely used method for reducing the dimensionality of a dataset.
PCA is unsupervised. cluster centers. because of polysemy. and more expressive than. Each row of U is interpreted as the representation of a document in concept space. PCA is linear. each column of V T is interpreted as the representation of a word. a document is viewed as a weighted combination of concepts. then the PCA changes. The hope is that the concepts revealed by LSA do have semantic meaning. document vectors may be similar despite different meanings. Third. Intuitively. LSI uses the singular value decomposition of the matrix A where each row is a document and each column is a word. Concepts can also be called topics. Let A = U SV T as above. It does not take into account any label that is to be predicted. hidden. the ability of different words such as “ﬁrearm” and “gun” to have similar meanings. Second. Synonymy is captured by permitting different word choices within one concept. i. if some features are multiplied by constants. or dimensions. so concepts are different from. despite similarity in meaning between the documents. These concepts are latent. PCA will not capture this structure. because they are not provided explicitly. LATENT SEMANTIC ANALYSIS 131 between features. If the data points lie in a cloud with nonlinear structure.3 Latent semantic analysis Polysemy is the phenomenon in language that a word such as “ﬁre” can have multiple different meanings. The principal components tend to be dominated by the features that are numerically largest. Latent semantic analysis (LSA). while polysemy is captured by allowing the same word to appear in multiple concepts. The number of independent dimensions or concepts is usually chosen to be far smaller than the number of words in the vocabulary. each document is created from more than one concept. 12. The key idea is to relate documents and words via concepts. The directions of maximum variation are usually not the directions that differentiate best between examples with different labels.3. the bagofwords vectors representing two documents can be dissimilar. A typical number is 300. PCA is closely related to SVD. Similarly.12. Importantly.e. Conversely. Another important phenomenon is synonymy. also called latent semantic indexing (LSI) is an attempt to improve the representation of documents by taking into account polysemy and synonymy in a way that is induced from the documents themselves. which is used in practice to compute principal components and the rerepresentation of data points. . Each singular value in S is the strength of the corresponding concept in the whole document collection. Concretely. Because of synonymy.
In contrast. Uik = (uk )i is the count of pseudoword k in document i. some of these are as follows. while v has an entry for every word in the vocabulary. Typically. The vector uk gives the strength of this concept for each document. with a certain weight for each word. • Mathematically. • The ﬁrst concept found by LSI typically has little semantic interest. For example. Each vector vk is essentially a concept. A concept can be thought of as a pseudoword. This means that no correlations between topics are allowed. including stochastic gradient descent methods. because Ak is dense.696 documents and 138. while A can be represented efﬁciently owing to its sparse nature. Since there are only 80 million nonzero entries in A. MATRIX FACTORIZATION AND APPLICATIONS Remember that the SVD can be written as r A= k=1 T σk uk vk T T where uk is the kth column of U and vk is the kth row of V T . Weighting schemes that transform the raw counts of words in documents are critical to the performance of LSI.132 CHAPTER 12. it can be stored in single precision ﬂoating point using 320 megabytes of memory. Let u be the ﬁrst column of U and let v be the ﬁrst row of V T . every pair of rows of V T must be orthogonal. . In no particular order. The vector u has an entry for every document.166 unique words. while each entry in v is roughly proportional to the frequency in the whole collection of the corresponding word. usually a maximum likelihood estimate. the entries in u are roughly proportional to the lengths of the corresponding documents. Note that words that are equally likely to occur in every document have the smallest weights. the TREC2 collection used in research on information retrieval has 741. • For large datasets. Ak cannot. A scheme that is popular is called logentropy. However. LSI has drawbacks that research has not yet fully overcome. present day computational power and the efﬁcient implementations of SVD. The weight for word j in document i is [1 + j pij logn pij ] · log(xij + 1) where pij is an estimate of the probability of occurrence of word j in document i. storing it for any value of k would require about 410 gigabytes. • Computing the SVD of large matrices is expensive. make it possible to apply LSI to large collections.
typically have small norms.4 Matrix factorization with missing data Alternating least squares.j 2 wij (fi2 + fj ) − 2f W f i. which are often highly discriminative. Let fi be the label of node i. and let V be the diagonal matrix with diagonal entries vi . either 0 or 1.j = = = 2 i ( j wij )(fi2 ) − 2f W f where f is a column vector. MATRIX FACTORIZATION WITH MISSING DATA 133 • Apart from the ﬁrst column.6 Spectral clustering Graphbased regularization Let wij be the similarity between nodes i and j in a graph. Regularization. 12. both logentropy and traditional IDF schemes fail to address this problem. every column of u has about as many negative as positive entries. the rare words have small loadings on every dimension. Convexity. and every topic has a negative loading for half the words.5 12. .12. The cut size of the graph is wij i.j:fi =fj = i. the same is true for rows of V T . Assume that this similarity is the penalty for the two nodes having different labels. Let the sum of row i of the weight matrix W be vi = j wij . Unfortunately. The cut size is then −2f W f + 2f V f = 2f (V − W )f. 12. That is. This has a negative impact on LSI performance in applications. • The rows of the Uk matrix representing rare words.4. This means that every document has a negative loading for half the topics.j wij (fi − fj )2 2 wij (fi2 − 2fi fj + fj ) i.
is then doable efﬁciently using eigenvectors. Then the maximum value of x Ax given x Bx = 1 is the largest eigenvalue of B −1 A.j:fi =fj wij 2 (fi2 − 2fi zi + zi ) + 2λf (V − W )f i = = f If − 2f Iz + z Iz + 2λf (V − W )f 1 = −2f Iz + z Iz + 2λf (V − W + I)f. given W . The vector x that achieves the maximum is the corresponding eigenvector of B −1 A. 2λ Minimizing this expression is equivalent to maximizing f Iz + λf (W − V − 1 I)f 2λ where the term that does not depend on f has been omitted. . The objective that we want to minimize is the linear combination (fi − zi )2 + λ i i. Remember that f is a 0/1 vector. This approach is wellknown in research on spectral clustering and on graphbased semisupervised learning. Finding the optimal f . We can relax the problem by allowing f to be realvalued between 0 and 1. see [Zhu and Goldberg. while the term involving V resembles a regularizer. Section 5. see the background section below. MATRIX FACTORIZATION AND APPLICATIONS The term involving W resembles an empirical loss function. Let zi be the suggested label of node i according to the base method mentioned in the previous section. 2009. Theorem: Let A be symmetric and let B be positive deﬁnite.134 CHAPTER 12.3]. The following result is standard.
and so are the vectors vk . 2011 Your name: Email address: The compact singular value decomposition of the matrix A can be written as r A= k=1 T σk uk vk T where the vectors uk are orthonormal. (i) What can you say about r in general? (ii) Suppose that one row of A is all zero. What can you say about r? .CSE 291 Quiz for May 24.
.
in a network nodes represent examples and edges represent relationships. The aim of supervised learning is to obtain a model that can predict labels for nodes. • Facebook. There are two types of data mining one can do with a network: supervised and unsupervised. • Proteinprotein interactions. recognizing fraudulent applicants. 137 . Note that call detail records (CDRs) must be aggregated. (Predicting whether nodes exist has not been studied much.Chapter 13 Social network analytics A social network is a graph where nodes represent individual people and links represent relationships. and edges are phone relationhips.) Examples of tasks that involve collective classiﬁcation: predicting churn of subscribers. recognizing emotional state of Facebook members. Nodes are subscribers. Examples: • Telephone network. predicting who will plat a new game. or labels for edges. More generally. the most basic label is existence. For nodes.. • Credit card applications. For edges. this is sometimes called collective classiﬁcation. • Genegene regulation. Predicting whether or not edges exist is called link prediction. • Papers citing papers. • Sameauthor relationship between citations.
In this situation. predicting which pairs of terrorists know each other. org/enron/v1/enron_nice_1. these classiﬁers will be usable only for nodes that are part of the part of the network that is known at training time.138 CHAPTER 13. identifying appropriate citations between scientiﬁc papers. 13. SOCIAL NETWORK ANALYTICS Figure 13. In particular. can be complex. Structural versus dynamic link prediction: inferring unobserved edges versus forecasting future new edges.png. the real goal is not to train a classiﬁer. However. We will not solve the coldstart problem of making predictions for nodes that are not known during training. some methods for transduction might make predictions without having an explicit reusable classiﬁer. We will look at methods that do involve explicit classiﬁers. Social networks. Instead. In principle. Transduction is the situation where the set of test examples is known at the time a classiﬁer is to be trained. and other graphs in data mining. Examples of tasks that involve link prediction: suggesting new friends on Facebook. . it is simply to make predictions about speciﬁc examples.1: Visualization of the Enron email network from http://jheer.1 Issues in network mining Both collective classiﬁcation and link prediction are transductive problems.
including social networks. This means that we know which nodes are the same and which are different in the graph.g. Maybe the most fundamental and most difﬁcult issue in mining network data. 13. it is only known partially which edges exist in a network. they can be of multiple types.2. Given a social network. log Γ(z) The Katz score is a(x. But some models of entire networks can lead to predictions. e. and make special efforts to reach them. the “rich get richer” idea. or that different citations describe the same paper. but people who lose weight do not stop being friends with obese people. Sometimes an edge between two nodes means that they represent the same realworld entity. Second. y) = ∞ β l paths(x. they have identity. Recognizing that different records refer to the same person. These vectors are sometimes called sideinformation. in particular its statistics. For example. UNSUPERVISED NETWORK MINING 139 edges may be directed or undirected. y. nodes have two fundamental characteristics. This type of learning can be fascinating as sociology. Often unsupervised data mining in a network does not allow making useful decisions in an obvious way. is called deduplication or record linkage. l) l=1 . y) = Γ(x) · Γ(y) while the AdamicAdar score is a(x. Local topological features. More often than not. For example. y) = z∈Γ(x) ∩ Γ(y) 1 .g. The preferential attachment score is a(x.2 Unsupervised network mining The aim of unsupervised data mining is often to come up with a model that explains the pattern of the connectivity of the network. nodes (and maybe edges) may be be associated with vectors that specify the values of features. it has revealed that people who stop smoking tend to stop being friends with smokers. The graph can be bipartite. and/or they can be weighted. is the problem of missing information. one can identify the most inﬂuential nodes. e. often about the evolution of the network. Mathematical models that explain patterns of connectivity are often timebased. but we do not necessarily know anything else about nodes. that can be useful for making decisions. First.13. Unsupervised methods are often used for link prediction. authors and papers.
However. SOCIAL NETWORK ANALYTICS where paths(x. A different approach has proved more successful. group demographic. and social. l) is the number of paths of length l from node x to node y. the labels of examples are assumed to be independent. and indicator functions. The Laplacian matrix of a graph is D − A where A is the adjacency matrix and D is the diagonal matrix of vertex degrees.. it is often because they are aggregates of features of linked people. in particular. see the discussion of sample selection bias above. minimums. the labels of neighbors are often correlated. features based on neighbors are often aggregates. we would like to use the labels of the neighbors of x as features when predicting the label of x. counts. 13. a standard classiﬁer learning method is applied. modes. Because a node can have a varying number of neighbors. situational. Intuitively.) If labels are independent. However. this situation is called collective classiﬁcation. For example.140 CHAPTER 13. and there is no general algorithm for inferring them. For many data mining tasks where the examples are people. This approach is to extend each node with a ﬁxedlength vector of feature values derived from its neighbors. if zip code is predictive of the brand of automobile a . have predictive power. means. notably for existence. As mentioned. Given a node x. if labels are not independent. then in principle a classiﬁer can achieve higher accuracy by predicting labels for related examples in a related way. there are ﬁve general types of feature: personal demographic. Then. where the expected number depends only on degrees. This measure gives reduced weight to longer paths. y.3 Collective classiﬁcation In traditional classiﬁer learning. maximums. and/or the whole network. If group demographic features. and vice versa. A principled approach to collective classiﬁcation would ﬁnd mutually consistent predicted labels for x and its neighbors. then a classiﬁer can make predictions separately for separate examples. but standard learning algorithms need vector representations of ﬁxed length. behavioral. (This is not the same as assuming that the examples themselves are independent. The modularity matrix has elements Bij = Aij − Dii Djj 2m where di is the degree of vertex i and m is the total numbe of edges. Aggregates may include sums. in general there is no guarantee that mutually consistent labels are unique. Conventional features describing persons may be implicit versions of aggregates based on social networks. This is the actual minus the expected number of edges between the two vertices.
For example. . In many domains. Each uk vector has one entry for each node. that may be because of contagion effects between people who know each other. For such a matrix the SVD is r A = U SU = k=1 T λk u k u T k where the λk scalars are the diagonal entries of the matrix S and the uk vectors are the orthonormal columns of the matrix U . the most useful vector of feature values for a node is obtained by low rank approximation of the adjacency matrix. if nodes are persons then one set of edges may represent the “same address” relationship while another set may represent the “made telephone call to” relationship. or see each other on the streets. Then. An undirected adjacency matrix A is symmetric. such as the Laplacian or modularity matrix. . COLLECTIVE CLASSIFICATION 141 person will purchase. there are no edges that are known to be absent. and so on. then it exists. and they are unweighted or have realvalued weights. intuitively the vector u1 captures more of the overall structure of the network than u2 . Because the λi values are ordered from largest to smallest. then the numerical SVD cannot be used directly. the vector Ui1 . one should use a variant that allows some entries of the matrix A to be unknown. Often.13. . the standard SVD can be used directly. meaning A = AT . For example. If some edges are unknown. This situation is sometimes called “positive only” or “positive unlabeled. edges are undirected. the Enron email adjacency matrix shown in Figure 13. it is reasonable to use the standard SVD where the value 1 means “present” and the value 0 means “unknown. . The vectors uk for large k may represent only random noise.3. Uik = (u1 )i . In this case. If edges have discrete labels from more than two classes. In other words. Often. The stochastic gradient approach that is used for collaborative ﬁltering can be applied. in effect multiple networks are overlaid on the same nodes. The representation length k can be chosen between 1 and r. . it may or may not exist. if an edge is known. (uk )i can be used as the ﬁxedlength representation of the ith node of the graph.” Of course. . Instead.1 empirically has rank only 2 approximately. . then the standard SVD is not appropriate.” In this case. one may also use a low rank approximation of a matrix derived from the adjacency matrix. but if an edge is unknown. The simplest case is when the adjacency matrix is fully known. .
Train a linear classiﬁer using the extended vectors. 2009b] is an alternative designed not to have these drawbacks. 13.4 The social dimensions approach The method of [Tang and Liu. First. where each edge is represented by ones for its two nodes. 3. Note that each row represents one node. Make a sparse binary dataset D with e rows and n columns. 1. 4. 2. the number of ones in this row is the degree of the node. Let n be the number of nodes and let e be the number of edges. Compute the modularity matrix from the adjacency matrix. computing these eigenvectors may need too much time and/or too much memory. Train a linear classiﬁer using the extended vectors. 2. It works as follows. The method of [Tang and Liu. SOCIAL NETWORK ANALYTICS 13. 2009a] appears to be the best approach so far for predicting labels of nodes in a social network. Extend the vector representing each node with a vector of length k containing its component of each eigenvector. Extend the vector representing each node with a sparse binary vector of length k. The jth component of this vector is 1 if and only if the node has at least one edge belonging to cluster number j. An intuitive weakness of this approach is that the eigenvectors are learned in an unsupervised way that ignores the speciﬁc collective classiﬁcation task to be solved. First. the label to be predicted is typically highly unbalanced: between most pairs of nodes. The idea is as follows. there is no edge. Second. The common response is to subsample pairs of nodes that are not joined by an edge. the eigenvectors are dense so storing the extended vector for each node can use too much memory.142 CHAPTER 13. 2009a] has two limitations when applied to very large networks. but a better solution is to use a ranking loss function during training. . The method of [Tang and Liu. Compute the top k eigenvectors of the modularity matrix. 3. 4. Use the kmeans algorithm with cosine similarity to partition the set of rows into a large number k of disjoint groups.5 Supervised link prediction Supervised link prediction is similar to collective classiﬁcation. 1. but involves some additional difﬁculties.
5. then use a combination of the vector for each node to predict the existence of an edge between the two nodes. A combination of vectors representing nodes is a representation of a possible edge. Sum of neighbors and sum of papers are next most predictive. for predicting future coauthorship: 1.13. see Section 6. A reasonable alternative is to combine the vectors buy pointwise multiplication. these features measure the propensity of an individual as opposed to any interaction between individuals. Shortestdistance is the most informative graphbased feature. SUPERVISED LINK PREDICTION 143 The most successful general approach to link prediction is essentially the same as for collective classiﬁcation: extend each node with a vector of feature values derived from the network. The vectors representing two nodes should not be combined by concatenation. Keyword overlap is the most informative feature for predicting coauthorship. 2. 3. For example. One can also represent a potential edge with edgebased feature values.7 above. Major open research issues are the coldstart problem for new nodes. . and how to generalize to new networks.
You have trained a lowrank matrix factorization A = U V .CSE 291 Quiz for May 31. Consider these two possible ways to make this prediction: 1. 2011 Your name: Email address: Consider the task of predicting which pairs of nodes in a social network are linked. Explain an important reason why the ﬁrst approach is better. 2. Let ui be the row of U that represents node i. V is the transpose of U . or known not to exist. Simply predict the dotproduct uj · uk . . uk ] is the concatenation of the vectors uj and uk . and f is a trained logistic regression model. You need to make predictions for the remaining edges. Because A is symmetric. k be a pair for which you need to predict whether an edge exists. suppose you have a training set of pairs for which edges are known to exist. where U has dimension n by r for some r n. Predict f ([uj . Let j. uk ]) where [uj . Let the adjacency matrix A have dimension n by n. Speciﬁcally.
You should use logistic regression to predict the score of each potential edge. avoid the mistakes discussed in Section 6. based on the known edges and on the nodes. Each paper has a label. (ii) properties of the nodes. . available at http://www. Do experiments comparing at least two of these. For explanations see the paper Entity and relationship labeling in afﬁliation networks by Zhao. Use either the Cora dataset or the Terrorists dataset. Then you would pretend that 92 actual edges are unknown. Also. you would predict a score for each of the 435 · 434/2 potential edges. Sen. each node represents an academic paper. Each experiment should use logistic regression and tenfold crossvalidation. For either dataset. and/or (iii) properties of pairs of nodes.html. where the seven label values are different subareas of machine learning. 2010 The purpose of this assignment is to apply and evaluate methods to predict the presence of links that are unknown in a social network.org/papers/2006/RelClzPIT.pdf. there are 917 edges that connect 435 terrorists. Think carefully about exactly what information is legitimately included in training folds.umd. The edges are labeled with types. Each paper is also represented by a bagofwords vector of length 1433.Assignment due on June 1. and also about each edge. and then to predict these. you may use features obtained from (i) the network structure of the edges known to be present and absent. There is information about each terrorist. The Terrorists dataset is more complicated. The higher the score of the 92 heldout edges. For (i). When predicting edges. the better. and on all 435 nodes.cs. as discussed in class. use a method for converting the network structure into a vector of ﬁxed length of feature values for each node. or cites.edu/ ˜sen/lbcproj/LBC. (ii). Based on the 927 − 92 = 825 known edges. Suppose that you are doing tenfold crossvalidation on the Terrorists dataset. The network structure is that each paper is cited by.7 above. and Getoor from the 2006 ICML Workshop on Statistical Network Analysis. mindswap. Using one or more of the feature types (i). According to Table 1 and Section 6 of this paper. and (iii) yields seven alternative sets of features. your task is to pretend that some edges are unknown. In the Cora dataset. Both datasets are available at http://www. at least one other paper in the dataset.
Note that the information is far from fully correct or complete. regardless of their type. Columns 3 to 614 contain 612 binary feature values for the person identiﬁed by the ﬁrst column. For the ﬁrst task.edu/˜sen/lbcproj/LBC. For the ﬁrst task the two classes can be labeled “present” and “absent” while for the second task the classes can be labeled “colleague” and “other.2. 2011.html 4 cseweb.CSE 291 Assignment 7 (2011) This assignment is due at the start of class on Tuesday May 31. The goal is to apply and evaluate methods to predict the presence and type of links in a social network. you can assume that the 851 known edges are the only edges that really exist. Note that each task involves only binary classiﬁcation. The information was compiled originally by the Proﬁles in Terror project. Unfortunately the meaning of these features is not known precisely.amazon. The value in the last column is 1 if the two terrorists are members of the same organization. For more details see a paper by Eric Doi.com/ProfilesTerrorMiddleTerroristOrganizations/ dp/0742535258 2 cseweb. The paper by Eric Doi considers the latter task. and 0 if they have some other relationship (family members. There are 461 positive sameorganization links. Of course. for which results are presented in its Figure 6. Then.3.ucsd.csv 3 www.umd. see Figure 13. and presumably the information is quite incomplete: zero may mean missing more often than it means false. based on the 851 − 85 = 766 known 1 www.ucsd.cs. an unknown number of additional unknown links exist. The dataset has 851 rows and 1227 columns. see Figure 13. Design your experiments so that your results can be compared to those in that table. For both tasks. Similarly.Doi_report. One task is to predict the existence of edges. The last column of the dataset indicates the type of a known link.pdf . contacted each other. you can measure performance for the second task using only the edges with known labels 0 and 1. The other task is to distinguish between sameorganization links and links of other types. Each row describes a known link between two terrorists. or used the same facility such as a training camp).1 The dataset2 was created at the University of Maryland3 and edited at UCSD by Eric Doi and Ke Tang. do tenfold crossvalidation.4 There are two different link prediction tasks for this dataset. and columns 615 to 1226 contain values for the same features for the second person. The dataset to use has information about 244 supposed terrorists. this means pretending that 85 true edges are unknown.edu/˜elkan/291/PITdata.edu/˜elkan/291/Eric.” When measuring performance for the ﬁrst task. The ﬁrst and second columns are id numbers between 1 and 244 for the two terrorists.
(ii) properties of nodes. exactly what information is it legitimate to include in test examples? • How should you take into account the fact that. have separate sections for “Design of experiments” and “Results of experiments. do report 0/1 accuracy for the task of distinguishing between sameorganization links and links of other types. Use a linear classiﬁer such as logistic regression to compute a score for each potential edge. think carefully about the following issues. almost certainly. For (i). The higher the score of the 85 heldout edges.edges. While designing your experiments. you may use features obtained from (i) the network structure of known and/or unknown edges. • During crossvalidation. also discuss whether or not this is a good measure of success. .7 above? In your report. However. in order to allow comparisons with previous work. use matrix factorization to convert the adjacency matrix (or a transformation of it) into a vector of feature values for each node. When computing scores. you predict a score for each of the 244 · 243/2 potential edges. some edges are present but are not among the 851 known edges? • How should you take into account the fact that many feature values are likely missing for many nodes? • How can you avoid the mistakes discussed in Section 6. the better. and (iii) properties of pairs of nodes.” In the results section. as discussed in class. and on all 244 nodes.
0 50 100 150 200 0 100 200 300 nz = 1655 400 500 600 Figure 13.2: Visualization of 461 “same organization” links between 244 supposed terrorists.0 50 100 150 200 0 50 100 nz = 461 150 200 Figure 13. .3: Visualization of feature values for the 244 supposed terrorists.
A/B testing can reveal “willingness to pay. 149 .Chapter 14 Interactive experimentation This chapter discusses datadriven optimization of websites. Choose the price that yields higher proﬁt. based on number sold.” Compare proﬁt at prices x and y. in particular via A/B testing.
according to the description above? (b) Suppose that it is true that the ﬁnancial offer does attract the highest number of purchasers. (a) What basic principle of the scientiﬁc method did Mr. Another read.” Note that the error is relatively highlevel. 2009. The $0 down offer attracted 71 percent more responses from one group of Web surfers than the average of all the Vespa ads. Vespa’s goal was to ﬁnd out whether a ﬁnancial offer would attract customers. some square. Herman’s data concluded that it did. Herman not follow. Herman’s data concluded that it did. it is not a spelling or grammar mistake. INTERACTIVE EXPERIMENTATION Quiz The following text is from a New York Times article dated May 30. And the text varied: One tagline said. Smarter purchase. “Pure fun. and Mr. What is an important reason why the other offer might still be preferable for the company? (c) Explain the writing mistake in the phrase “Mr. while the Tshirt offer drew 29 percent fewer. the scooter company. Herman had run 27 ads on the Web for his client Vespa. “Smart looks. Mr.150 CHAPTER 14. And function. 0 percent interest offer. .” and displayed a $0 down. Some were rectangular.” and promoted a free Tshirt.
Different people who happen to be in the same state are assumed to have the same probability distribution of reactions.Chapter 15 Reinforcement learning Reinforcement learning is the type of learning done by an agent who is trying to ﬁgure out a good policy for interacting with an environment. However. Reinforcement learning is appropriate for domains where the agent faces a stochastic but nonadversarial environment. Reinforcement learning is fundamentally different from supervised learning because explicit correct input/output pairs are never given to the agent.. Simester et al. because in adversarial domains the opponent does not change state probabilistically. and where there is an environment that can be modeled as unknown and random. 2006]. and the environment shifts into a new state. playing games. That is. The framework is that time is discrete. The environment is considered to be a set of states. and managing relationships with customers [Abe et al. After performing the action. reinforcement learning methods are not 151 . Successful application areas include controlling robots. 2002. the agent earns a realvalued reward or penalty. and at each time step the agent perceives the current state of the environment and selects a single action. which are modeled collectively as acting probabilistically. The goal of the agent is to learn a policy for choosing actions that leads to the best possible longterm total of rewards. scheduling elevators. More speciﬁcally. each person is a different instantiation of the environment. at each time step the agent does ﬁnd out what its current shortterm reward is. the environment consists of humans. treating cancer. time moves forward.. For example. Reinforcement learning has realworld applications in domains where there is an intrinsic tradeoff between shortterm and longterm rewards and penalties. the agent is never told what the optimal action is in a given state. Nor is the agent ever told what its future rewards will be. In the latter three examples. It is not suitable for domains where the agent faces an intelligent opponent.
and . In ofﬂine learning. ·) where • S is the set of states. ·). Game theory is the appropriate formal framework for modeling agents interacting with intelligent opponents. An MDP is a fourtuple S. REINFORCEMENT LEARNING appropriate for learning to play chess. A. • r(s. 15. However.152 CHAPTER 15. a set S of potential states of the environment. rewards that are real numbers. if it is reasonable to model the opponent as following a randomized strategy. at = a) is the probability that action a in state s at time t will lead to state s at time t + 1. ﬁnding good algorithms is even harder in game theory than in reinforcement learning. The job of the agent is to learn the best possible policy for directing future interaction with the same environment. the agent perceives the state st ∈ S and its available actions A. Most reseearch on reinforcement learning has concerned online learning. r(·. reinforcement learning can be successful for learning to play games that involve both strategy and chance. At each time t. and 3. which lets it learn more about the environment. A fundamental distinction in reinforcement learning is whether learning is online or ofﬂine. where it uses its current estimate of the best policy to earn rewards.1 Markov decision processes The basic reinforcement learning scenario is formalized using three concepts: 1. In online learning. 2. with exploitation. such as backgammon. a) is the immediate reward received after doing action a from state s. Of course. the agent must balance exploration. Note that this is different from learning to predict the rewards achieved by following the policy used to generate the recorded experience. • p(st+1 = s st = s. a set A of potential actions of the agent. • A is the set of actions. the environment is represented as a socalled Markov decision process (MDP). but ofﬂine learning is more useful for many applications. The agent chooses one action a ∈ A and receives from the environment the reward rt and then the new state st+1 . p(··. Formally. which is also called the state space. The learning that is needed is called offpolicy since it is about a policy that is different from the one used to generate training data. the agent is given a database of recorded experience interacting with the environment.
t) of the current time t as well as the current state t. In RL. The factor γ makes the sum above ﬁnite even though it is a sum over an inﬁnite number of time steps. a more accurate name is costsensitive decisionmaking. s ) is the probability that the immediate reward is r. a mapping π : S → A that speciﬁes which action to take for every state. an optimal policy is sometimes called a solution of the MDP. p(rs. the reward may be a function of the next state as well as of the current state and action. This means that if the current state of the MDP at time t is known. + rT . Or. An optimal policy π ∗ is one that maximizes the expected value of R. and then a single characteristic becomes true. the customer is assumed to be in a certain state. every . regardless of prior history. then transitions to a new state at time t + 1 are independent of all previous states. the reward can be a function just of the current state. .2. but their probabilities are the same at all times. the agent must acquire a policy.2 RL versus costsensitive learning Remember that costsensitive learning (CSL) is rather a misnomer. then optimal policies have the Markov property also: the best action to choose depends on the current state only. Or. Some extensions of the MDP framework are possible without changing the nature of the problem fundamentally. The differences between CSL and RL include the following. In order to maximize its total reward. If the maximum time horizon is inﬁnite. RL VERSUS COSTSENSITIVE LEARNING 153 The states of an MDP possess the Markov property. • In CSL.15. at ). 15. where the expectation is based on the transition probability distribution p(st+1 = s st = s. the MDP deﬁnition can be extended to allow r to be stochastic as opposed to deterministic. that is the reward can be a function r(s. i. at = π ∗ (st )). s ). With a ﬁnite time horizon. and next state s . history is irrelevant. In a nutshell. The goal of the agent is to maximize a cumulative function R of the rewards. It is crucial that the probability distributions of state transitions and rewards are stationary: transitions and rewards may be stochastic.e. Given an MDP. then the goal can be to maximize R = r0 + r1 + . a. In this case. In particular. If the time horizon is inﬁnite. a. for maximal mathematical simplicity. . this is the label to be predicted. given state s. If the maximum time step is t = T where T is ﬁnite. then the goal is to maximize a discounted sum that involves a discounting factor γ. the best action may be a function π ∗ (s. which is usually a number just under 1: ∞ R= t=0 γ t r(st . r(s). action a.
consider priming and annoyance for example. a) + γ a s p(s s. • In CSL. the state of a customer is typically one of a number of discrete classes. To see how to ﬁnd the optimal policy.3 Algorithms to ﬁnd an optimal policy If the agent knows the transition probabilities p(s s. π(s))U π (s ). REINFORCEMENT LEARNING aspect of the state of the customer can change from one time point to the next. The equation above is called the Bellman equation. • In CSL. In RL. U π (s) = r(s. In RL. the agent chooses an action based on separate estimates of the customer’s label and the immediate beneﬁt to the agent of each label. a) + γ a s p(s s. consider a realvalued function U of states called the utility function.1) . Then in 2011. • In CSL. a customer is typically represented by a realvalued vector.154 CHAPTER 15. Intuitively. based on data from 2009. In RL. it will have no evidence that owners of convertibles are bad risks. In RL. • In CSL. Recursively. the beneﬁt must be estimated for every possible action. Let the function V be an approximation of the function U ∗ . But the actual beneﬁt is observed only for one action. • In CSL. 15. π(s)) + γ s p(s s. the next state depends on the agent’s action. the agent chooses an action to maximize longterm beneﬁt. when the agent wants to learn from 2010 data. then it can calculate an optimal policy. The Bellman optimality equation describes the optimal policy: U ∗ (s) = max [r(s. We can get an improved approximation by picking any state s and doing the update V (s) := max [r(s. these are learned. (15. Suppose that the agent learns in 2010 never to give a loan to owners of convertibles. Both CSL and RL face the issue of possibly inadequate exploration. U π (s) is the expected total reward achieved by starting in state s and acting according to the policy π. a)V (s )]. a) of the MDP. no learning is required. longterm effects of actions can only be taken into account heuristically. the customer’s label is assumed to be independent of the agent’s prediction. π(s))U ∗ (s )].
then the algorithm terminates.2). Consider the maximum discrepancy between V1 and V2 : d = max {V1 − V2 } = max {max [r(s. Update π based on V . This method repeats two steps until convergence: 1.2) Note that (15. 2. so the function V (s) can be viewed as a vector.1) except for the argmax operator in the place of the max operator. a)V (s )]. Proof: Assume that the state space S is ﬁnite.15. and the V function converges to U ∗ regardless of how V is initialized. Step 2 is performed using Equation (15. a) + γ a s p(s s. Let V1 and V2 be two such vectors. a1 )V1 (s )] p(s s. a)V (s ). At any point in the algorithm.2) is identical to (15. Lemma: The optimal utility function U ∗ is unique. a2 )V2 (s )]}. . ALGORITHMS TO FIND AN OPTIMAL POLICY 155 The value iteration algorithm of Bellman (1957) is simply to do this single step repeatedly for all states s in any order. s − max[r(s. (15. An alternative algorithm called policy iteration is due to Howard (1960). If the state space S is ﬁnite. There is one equation for each s in S: V (s) = r(s. a1 ) + γ a1 s p(s s. An important fact about the value iteration method is that π(s) may converge to the optimal policy long before V converges to the true U . Policy iteration has the advantage that there is a clear stopping condition: when the function π does not change after performing Step 1. a current guess for the optimal policy can be computed from the approximate utility function V as π(s) = argmax [r(s.3. then V can be stored as a vector and it can be initialized to be V (s) = 0 for each state s. π(s)) + γ s p(s s. and let V1 and V2 be the results of updating V1 and V2 . Computing V exactly in Step 1 is formulated and solved as a set of simultaneous linear equations. a2 ) + γ a2 This maximum discrepancy goes to zero. Compute V exactly based on π.
and s happened. a)Qπ (s . only if it knows the transition probabilities of the environment. a) = r(s. a pairs together with an observed reward r and an observed next state s . The learned Q function emphasizes aspects of the state that are predictive of longterm reward. The key that makes the approach in fact not greedy is that the Q function is learned. the recommended action for a state s is a = π Q (s) = argmaxa Q(s. Initialize Q values. at each time step the action a is determined by a maximization operation that is a function only of the current state s. the Qπ function is the value of taking the action a and then continuing according to the policy π. a) + γ s p(s s. It is called reinforcement learning. a In this update rule.e. ignoring the future. From this point of view. π(s )). at each time step actions seem to be chosen in a greedy way. Given a Q function. ˆ When using a Q function. Speciﬁcally.156 CHAPTER 15. Rather than learn transition probabilities or an optimal policy directly. the agent can update the Q function as follows: Q(st . For any state st and any action at . . it tried doing a.5 The Qlearning algorithm Experience during learning consists of s. at ) := (1 − α)Q(st . then the agent must learn them at the same time it is trying to discover an optimal policy. i. In words.4 Q functions An agent can execute the algorithms from the previous section. a)]. and γ is the discount rate. value iteration and policy iteration. so we can use a twodimensional table to store the Q function. REINFORCEMENT LEARNING 15. α is a learning rate such that 0 < α < 1. the Qπ function is like the U π function except that it has two arguments and it only follows the policy π starting in the next state: Qπ (s. a). some of the most successful RL methods learn a socalled Q function. it received r. the agent was in state s. This learning problem is obviously more difﬁcult. The Q learning algorithm is as follows: 1. The agent can use each such fragment of experience to update the Q function directly. That is. If these probabilities are unknown. rt is the observed immediate reward. A Q function is an extended utility function. and start in some state s0 . Assume that the state space and the action space are ﬁnite. at ) + α[rt + γ max Q(st+1 . 15.
and with probability it chooses the action from a uniform distribution over all other actions. In other words. choose an action a. Continue from Step 2. then when the value of Q(s. This strategy must trade off exploration. FITTED Q ITERATION 2. Observe immediate reward r and the next state s = st+1 . A single historical training example is a quadruple s. a.. 2008]. there is no known general method for choosing a good learning rate α. with exploitation. and s is the state that was observed to come next. if (i) the environment is stationary. an optimal policy or Q function must be learned from historical data [Neumann. 2005. the Q values of other stateaction pairs are changed in unpredictable ways. ai )] where b is a constant. 5. r. and (v) the learning rate parameter decreases over time.15. The method named Qiteration [Murphy. However. but every action always has a nonzero chance of being chosen (exploration). If historical training data is available. r is the immediate reward that was obtained. while with probability 1 − it does exploitation. s where s is a state. 4. which means preferring actions with high estimated Q values. Qlearning has at least two major drawbacks. a is the action actually taken in this state. We need a separate socalled exploration strategy. 157 This algorithm is guaranteed to converge to optimal Q values. All these assumptions are reasonable. a) is updated for one s. This means that actions with higher estimated Q values are more likely to be chosen (exploitation). 2005] is the following algorithm: . Second. (iii) each state is visited repeatedly. while larger values make it do more exploitation. for example a neural network. 15. Given the current state s. Qlearning can be applied. but the next section explains a better alternative. (ii) the environment satisﬁes the Markov assumption.6 Fitted Q iteration As mentioned. in the batch setting for RL. if Q(s. The learning algorithm does not say which action to select at each step. The Qlearning algorithm is designed for online learning. the algorithm may converge extremely slowly. a) is approximated by a continuous function. First. Ernst et al.6. a) using the update rule above. Smaller values of b make the softmax strategy explore more. 3. that is trying unusual actions. An greedy policy selects with probability 1 − the action that is optimal according to the current Q function. (iv) each action is tried in each state repeatedly. A softmax policy chooses action ai with probability proportional to exp[bQ(s. with probability the policy does exploration. a pair. Update Q(s.
And. The fundamental reason why Qiteration is correct as an offpolicy method is that it is discriminative as opposed to generative. r. compared to other methods for ofﬂine reinforcement learning. . there are two basic operations that need to be performed with a Q function. that is of the form s1 . For example. . who is represented by a highdimensional realvalued vector [Simester et al. a . meaning that data collected while following one or more arbitrary nonoptimal policies can be used to learn a better policy. and all training Q values are ﬁtted simultaneously. a is different from the probability distribution of optimal examples s. . because it counts how many steps of lookahead are implicit in the learned Q function. Qiteration does not model in any way the distribution of stateaction pairs s. b) Train Qh+1 with examples s. Qiteration is an offpolicy method. s . Indeed. a reinforcement learning algorithm should allow states and actions to be represented as realvalued vectors. a.. a1 . For each example s. s let v = r + γ maxb Qh (s . v . 1. Instead.7 Representing continuous states and actions Requiring that actions and states be discrete is restrictive. r2 . The probability distribution of training examples s. In these cases the discount factor γ can be set to one.158 CHAPTER 15. 15. the maximum horizon is a small ﬁxed integer. REINFORCEMENT LEARNING Deﬁne Q0 (s. Fitted Qiteration has signiﬁcant advantages over Qlearning. There is no need to collect trajectories starting at any speciﬁc states. s3 . . r. a. s2 . it uses a purely discriminative method to learn only how the values to be predicted depend on s and a. so the underlying supervised learning algorithm can make sure that all of them are ﬁtted reasonably well. In some applications. . or many other methods. . Second. For horizon h = 0. To be useful for many applications. in many business domains the state is essentially the current characteristics of a customer. . Offpolicy methods for RL face the problem of bias in sample selection. First. 2006]. is that the historical training data can be collected in any way that is convenient. r1 . Another major advantage of ﬁtted Qiteration. ﬁnding optimal actions should be efﬁcient. . the Q function must be learnable from training examples. a = π(s). there is no need to collect trajectories of consecutive states. No learning rate is needed. Using the ﬁtted Q iteration algorithm. a2 . there is no need to know the policy π that was followed to generate historical episodes of the form s. π ∗ (s) . in particular medical clinical trials. 2. The variable h is called the horizon. a) = 0 for all s and a.
Hence. 1995].3) In this equation. and it will still be possible to learn a bilinear Q function representing how longterm reward depends on the current state and action vectors. the optimal value for each action component is in general not simply its lower or upper bound. the LP approach is far from trivial in highdimensional action spaces: it ﬁnds the optimal value for every action component in polynomial time. and ◦ denotes the elementwise product of two matrices.3). the maximization operation always selects a socalled “bangbang” action vector. v can be converted into the pair vec(saT ). subject to linear restrictions on the components of b. Given s. whereas naive search might need O(2d ) time where d is the dimension of the action space. Hence. In at least some domains. The solution to an LP problem is always a vertex in the feasible region. The notation vec(A) means the matrix A converted into a vector by concatenating its columns. for example a budget limit in combination with different costs for different subactions. (15. the criteria to be minimized or maximized to achieve desirably smooth solutions are known. but it may still be convex. 1 . the Q function is learnable in a straightforward way by standard linear regression. is a linear programming (LP) task. it can represent the consequences of each action component as depending on each aspect of the state. v and vec(W ) can be learned by standard linear regression. saT is the matrix that is the outer product of the vectors s and a. actions that change smoothly over time are desirable. the maximization problem will be harder to solve. The bilinear representation of Q functions contains an interaction term for every component of the action vector and of the state vector. Even when optimal action values are always lower or upper bounds. It has important advantages. it is tractable in practice. If the criteria are nonlinear. Based on Equation (15. This is called a bilinear representation. REPRESENTING CONTINUOUS STATES AND ACTIONS 159 Suppose that a Q function is written as Q(s. so that W ∈ Rm×n . In many domains. They may also involve multiple components of b. The linear constraints may be lower and upper bounds for components (subactions) of b. each component of which is an extreme value.7. These criteria can be included as additional penalties in the maximization problem to be solved to ﬁnd the optimal action. The key is to notice that m n sT W a = i=1 j=1 (W ◦ saT )ij = vec(W ) · vec(saT ). Therefore.1 With the bilinear approach. a. This maximization. the optimal action is a∗ = argmax Q(s. The LP approach can handle constraints that combine two or more components of the action vector.15. for example the “minimum jerk” principle [Viviani and Flash. which is sometimes called the Hadamard product. even for highdimensional action vectors. a) = sT W a. Let the vectors s and a be of length m and n respectively. b) = argmax x · b b b where x = sT W . Given constraints of this nature. each training triple s.
Each period.. The simplest representation. 15. since the true optimization problem that represents longterm reward may not be expressible in this format. A computational drawback is that. the inventory of the blood bank is a vector in R24 . whether or not there are constraints on feasible action vectors b.160 CHAPTER 15. used to meet demand for blood of the same or another type. The state of the system is this vector concatenated with the 8dimensional demand vector and a unit constant component. this section describes results using the speciﬁc formulation used in [Yu. The agent must decide how to satisfy each demand in order to maximize longterm beneﬁt. and a unit constant pseudoaction. b) is known. 2007]. The blood bank stores blood of eight types: AB+. This representation is usable only when both the state space and the action space are discrete and of low cardinality. Managing the stocks of a blood bank is a typical application of this type. The action vector has 3 · 27 + 1 = 82 dimensions since there are 3 ages. and O. As a concrete illustration. Blood that is not used gets older. no efﬁcient way of performing the maximization operation argmaxb Q(s. 2007] and Chapter 12 of [Powell. 2005] and ensembles of decision trees [Ernst et al. a)] where φ1 to φm are ﬁxed realvalued functions and w is a vector of length m. A general representation is linear. the manager sees a certain level of demand for each type. 27 possible allocations. AB. Some types of blood can be substituted for some others. in general. using ﬁxed basis functions. The LP solved to determine the action at each time step has learned coefﬁcients . B+. A. REINFORCEMENT LEARNING Some other representations for Q functions are less satisfactory. and a certain level of supply. in general makes the whole RL process suboptimal. these have the same difﬁculty. Each component of the action vector is a quantity of blood of one type and age. With three discrete ages. A+. and beyond a certain age must be discarded.8 Inventory management applications In inventory management applications. Note that any ﬁxed format for Q functions. a) = w·[φ1 (s. 27 of the 64 conceivable substitutions are allowed. · · · . Other common representations are neural networks [Riedmiller. 2005]. The bilinear approach is a special case of this representation where each basis function is the product of a state component and an action component. An intuitive drawback of other cases of this representation is that they ignore the distinction between states and actions: essentially. as mentioned above. s and a are concatenated as inputs to the basis functions φi . B. φm (s. including a bilinear representation. a). at each time step the agent sees a demand vector and/or a supply vector for a number of products. is tabular: a separate value is stored for each combination of a state value s and an action value a. In this representation Q(s. subject to various rules about the substitutability and perishability of products. O+.
15.8. INVENTORY MANAGEMENT APPLICATIONS Table 15.1: Shortterm reward functions for blood bank management. [Yu, 2007] 50 60 45 0 20 new 0 0 0 60 0
161
supply exact blood type substitute O blood substitute other type fail to meet demand discard blood
in its objective function, but ﬁxed constraints such as that the total quantity supplied from each of the 24 stocks must be not be more than the current stock amount. For training and for testing, a trajectory is a series of periods. In each period, the agent sees a demand vector drawn randomly from a speciﬁc distribution. Then, blood remaining in stock is aged by one period, blood that is too old is discarded, and fresh blood arrives according to a different probability distribution. Training is based on 1000 trajectories of length 10 periods in which the agent follows a greedy policy.2 Testing is based on trajectories of the same length where supply and demand vectors are drawn from the same distributions, but the agent follows a learned policy. In many realworld situations the objectives to be maximized, both shortterm and longterm, are subject to debate. Intuitive measures of longterm success may be not mathematically consistent with intuitive immediate reward functions, or even with the MDP framework. The blood bank scenario illustrates this issue. Table 15.1 shows two different immediate reward functions. Each entry is the beneﬁt accrued by meeting one unit of demand with supply of a certain nature. The middle column is the shortterm reward function used in previous work, while the right column is an alternative that is more consistent with the longterm evaluation measure used previously (described below). For a blood bank, the percentage of unsatisﬁed demand is an important longterm measure of success [Yu, 2007]. If a small fraction of demand is not met, that is acceptable because all highpriority demands can still be met. However, if more than 10% of demand for any type is not met in a given period, then some patients
The greedy policy is the policy that supplies blood to maximize onestep reward, at each time step. Many domains are like the blood bank domain in that the optimal onestep policy, also called myopic or greedy, can be formulated directly as a linear programming problem with known coefﬁcients. With a bilinear representation of the Q function, coefﬁcients are learned that formulate the longterm optimal policy as a problem in the same class. The linear representation is presumably not sufﬁciently expressive to represent the truly optimal longterm policy, but it is expressive enough to represent a policy that is better than the greedy one.
2
162
CHAPTER 15. REINFORCEMENT LEARNING Table 15.2: Success of alternative learned policies. average unmet A+ demand 7.3% 18.4% 7.55% frequency of severe unmet A+ demand 46% 29% 12%
greedy policy bilinear Qiteration (i) bilinear Qiteration (ii)
may suffer seriously. Therefore, we measure both the average percentage of unmet demand and the frequency of unmet demand over 10%. The A+ blood type is an especially good indicator of success, because the discrepancy between supply and demand is greatest for it: on average, 34.00% of demand but only 27.94% of supply is for A+ blood. Table 15.2 shows the performance of the greedy policy and of policies learned by ﬁtted Q iteration using the two immediate reward functions of Table 15.1. Given the immediate reward function (i), ﬁtted Qiteration satisﬁes on average less of the demand for A+ blood. However, this reward function does not include an explicit penalty for failing to meet demand. The bilinear method correctly maximizes (approximately) the discounted longterm average of immediate rewards. The last column of Table 15.1 shows a different immediate reward function that does penalize failures to meet demand. With this reward function, ﬁtted Q iteration learns a policy that more than cuts in half the frequency of severe failures.
Bibliography
[Abe et al., 2002] Abe, N., Pednault, E., Wang, H., Zadrozny, B., Fan, W., and Apte, C. (2002). Empirical comparison of various reinforcement learning strategies for sequential targeted marketing. In Proceedings of the IEEE International Conference on Data Mining, pages 3–10. IEEE. [Dietterich, 2009] Dietterich, T. G. (2009). Machine learning and ecosystem informatics: Challenges and opportunities. In Proceedings of the 1st Asian Conference on Machine Learning (ACML), pages 1–5. Springer. [Dˇ eroski et al., 2001] Dˇ eroski, S., De Raedt, L., and Driessens, K. (2001). Relaz z tional reinforcement learning. Machine Learning, 43(1):7–52. [Ernst et al., 2005] Ernst, D., Geurts, P., and Wehenkel, L. (2005). Treebased batch mode reinforcement learning. Journal of Machine Learning Research, 6(1):503– 556. [Gordon, 1995a] Gordon, G. J. (1995a). Stable ﬁtted reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pages 1052–1058. [Gordon, 1995b] Gordon, G. J. (1995b). Stable function approximation in dynamic programming. In Proceedings of the International Conference on Machine Learning (ICML), pages 261–268. [Huang et al., 2006] Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., and Sch¨ lkopf, B. (2006). Correcting sample selection bias by unlabeled data. In Proo ceedings of the Neural Information Processing Systems Conference (NIPS 2006). [Jonas and Harper, 2006] Jonas, J. and Harper, J. (2006). Effective counterterrorism and the limited role of predictive data mining. Technical report, Cato Institute. Available at http://www.cato.org/pub_display.php?pub_ id=6784. 163
27(1):15–23.. 2009b] Tang. In Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM).164 BIBLIOGRAPHY [Judd and Solnick. [Powell. (2009b). and Solnick. B... T. [Tang and Liu. 2005] Riedmiller. Sun. (2008). (2009a). (2006). J. Inc. N. J. and isochrony: converging approaches to movement planning. pages 1107–1116. D. H. 2008] Neumann.. K. ACM. Numerical dynamic programming with shapepreserving splines. and Liu. 2005] Murphy. Journal of Machine Learning Research. P. [Neumann. Continuous state dynamic programming via nonexpansive approximation. M. Ellis Horwood.. [Simester et al. 31(2):141–160. and Tsitsiklis. 11(8):677–685(9). Machine learning for in silico virtual screening and chemical genomics: New strategies. L. 2007] Powell. L. [Viviani and Flash. 1994] Judd. [Vert and Jacob.edu/papers/ dpshape. Minimumjerk. In Proceedings of the 16th European Conference on Machine Learning (ECML). D. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.pdf. (1994). (1995). [Riedmiller. 52(5):683–696. (1994). P. Scalable learning of collective behavior based on sparse social dimensions. [Michie et al.P. J. twothirds power law. A.stanford. and Flash. L. (2008). 2008] Vert. C. 1995] Viviani. W. Management Science. Spiegelhalter. Relational learning via latent social dimensions. 2009a] Tang. [Murphy. G. D. pages 317–328. Computational Economics. 1994] Michie. S. ACM.. Journal of Experimental Psychology. and Jacob. Dynamic catalog mailing policies. (2005). C. [Tang and Liu. Unpublished paper from the Hoover Institution available at http://bucky. John Wiley & Sons. . I. and Liu. A generalization error for Qlearning. A. H. 6:1073–1097. [Stachurski. J. (2008). (2007). 21:32–53. J. (2005). Batchmode reinforcement learning for con¨ tinuous state spaces: A survey. L. Combinatorial Chemistry & High Throughput Screening. Approximate Dynamic Programming. OGAI Journal. Neural and Statistical Classiﬁcation. 2008] Stachurski. Neural ﬁtted Q iteration–ﬁrst experiences with a data efﬁcient neural reinforcement learning method. 2006] Simester. Machine Learning. pages 817–826. and Taylor.
BIBLIOGRAPHY 165 [Yu. A. Approximate dynamic programming for blood inventory management. B. [Zhu and Goldberg. . V. 2007] Yu. Princeton University. (2009). Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning. X. and Goldberg. Honors thesis. 3(1):1–130. (2007). Introduction to semisupervised learning. 2009] Zhu.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.