You are on page 1of 33

# Machine Learning Tutorial

Section 1

June 10,
10 2011

## This ppt includes some slides/slide-parts/text taken

from online materials created by the following
people:
- Greg Grudic
- Alexander Vezhnevets
- Hal III Daume

## What is Machine Learning?

The goal of machine learning is to build computer
systems that can adapt and learn from their
experience.
Tom Dietterich

## SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

A Generic System

x1
x2

System

xN

y1
y2

h1 , h2 , ..., h K

yM

x = ( x1 , x2 ,..., xN )
I
Input
t Variables:
V i bl

## Output Variables: y = ( y1 , y2 ,..., yK )

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

## When are ML algorithms NOT needed?

When the relationships between all system variables
(input, output, and hidden) is completely
understood!
This is NOT the case for almost any real system!

## SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

The Sub
Sub-Fields
Fields of ML

Supervised Learning
Reinforcement Learning
Unsupervised Learning

## SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Supervised Learning
Given: Training examples

{( x , f ( x ) ) , ( x , f ( x ) ) ,..., ( x
1

, f ( xP ))

## for some unknown function (system) y = f ( x )

Find f ( x )
Predict y = f ( x ) , where x is not in the training set

## Model model quality

Model,
Definition: A computer program is said to learn
from experience E
with respect to some class of tasks T
and performance measure P,
if its performance at tasks in T, as measured by P, improves
with experience E.
Learned hypothesis: model of problem/task T
Model quality: accuracy/performance measured by P

## Data / Examples / Sample / Instances

Data: experience E in the form
f
off examples / instances
characteristic of the whole input space
representative sample
independent and identically distributed (no bias in selection / observations)

Good
G d example
l
1000 abstracts chosen randomly out of 20M PubMed entries (abstracts)
probably i.i.d.
representative?
if annotation is involved it is always a question of compromises

## Definitely bad example

all abstracts that have John Smith as an author

## Instances have to be comparable to each other

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

## Data / Examples / Sample / Instances

Example: set off queries and a set off top retrieved documents
(characterized via tf, idf, tf*idf, PRank, BM25 scores) for each
try predicting relevance for reranking!
top retrieved set is dependent on underlying IR system!
issues with representativeness, but for reranking this is fine
characterization is dependent on query (exc. PRank), i.e. only certain pairs (for
the same Q) are meaningfully comparable (c.f. independent examples for the
same Q)
we have to normalize the features per query to have same mean/variance
we have to form pairs and compare e.g. the diff of feature values

Toy example:
Q = learning, rank 1: tf = 15, rank 100: tf = 2
Q = overfitting, rank 1: tf = 2, rank 10: tf = 0
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

10

Features
The available examples (experience) has to be
described to the algorithm in a consumable format
Here: examples are represented as vectors of pre-defined features
E.g. for credit risk assesment, typical features can be: income range,
history real estate properties,
properties criminal record,
record
city of residence, etc.

yp
binary
nominal
nominal
ordinal
numeric
numeric

## (criminal record, Y/N)

(city of residence, X)
(income range, 0-10K, 10-20K, )

11

## Machine Learning Tutorial

CB,, GS,, REC

Section 2

Experimental practice
by now youve learned what machine learning is; in the supervised approach you
need (carefully selected / prepared) examples that you describe through features;
the algorithm then learns a model of the problem based on the examples (usually
some kind of optimization is performed in the background); and as a result
result,
improvement is observed in terms of some performance measure

## Machine Learning Tutorial for the UKP lab

June 10, 2011

Model parameters
2 kinds of parameters
one the user sets for the training procedure in advance hyperparameter
the degree of polynom to match in regression
y in Neural Network
number/size of hidden layer
number of instances per leaf in decision tree

## one that actually gets optimized through the training parameter

regression coefficients
network weights
size/depth of decision tree (in Weka, other implementations might allow to control that)

we usually do not talk about the latter, but refer to hyperparameters as parameters

Hyperparameters
the less the algorithm has, the better
Naive Bayes the best? No parameters!
usually algs with better discriminative power are not parameter-free

typically are set to optimize performance (on validation set, or through cross-validation)
manual, grid search, simulated annealing, gradient descent, etc.

common pitfall:
select the hyperparameters via CV, e.g. 10-fold + report cross-validation results
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

13

## Cross validation Illustration

Cross-validation,
X k = {x1 ,..., xk }

X1

X2

X3

Test

X4

## The result is an average

over all iterations

Train

## SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

X5

14

Cross Validation
Cross-Validation
n-fold
f ld CV:
CV common practice
ti for
f making
ki (hyper)parameter
(h
)
t estimation
ti ti more
robust
round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model
typical: random splits, without replacement (each instance tests exactly once)
the other way: random subsampling cross-validation

## n-fold CV: common practice to report average performance,

performance deviation,
deviation etc.
etc
No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004)
bad practice? problem: training sets largely overlap, test errors are also dependent
tends to underestimate real variance of CV (thus e.g.
e g confidence intervals are to be treated with extreme
caution)
5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets

Folding via
ia natural
nat ral units
nits of processing for the given
typically, document boundaries best practice is doing it yourself!
ML package / CSV representation is not aware of e.g. document boundaries!
The PPI case

## SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

15

Cross Validation
Cross-Validation
Ideally the valid settings are:
take off-the-shelf algorithms, avoid parameter tuning and compare
results e.g.
results,
e g via cross-validation
cross validation
n.b. you probably do the folding yourself, trying to minimize biases!

## do parameter tuning (n.b. selecting/tuning your features is also tuning!)

but then normally you have to have a blind set (from the beginning)
e.g. have a look at shared tasks, e.g. CoNLL practical way to learn
experimental best practice to align the predefined standards (you might even
benefit from comparative results, etc.)

## You might want to do something different

be
be aware of these & the consequences
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

16

The ML workflow
Common ML experimenting pipeline
1. define the task
instance, target variable/labels, collect and label/annotate data
credit
di risk
i k assessment: 1 credit
d/b d credit,
di ~s ran out in
i the
h
previous year

## 2. define and collect/calculate features, define train / validation

(development) ((test!)) / test (evaluation) data
3. pick a learning algorithm (e.g. decision tree), train model
train on training set
optimize/set model hyperparameters (e.g. number of instances / leaf, use
pruning, ) according to performance on validation data
cross validation: use all training data as validation data

## test model accuracy on (blind) test set

y to use model to p
predict unseen instances with an expected
p
accuracy similar to that seen on test
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

17

## Try this in Weka

=== Run information ===
Relation: segment
Instances: 1500
Attributes: 20
Test mode:

split 80.0%
80 0% train,
train remainder test

Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Correctly Classified Instances
290
96.6667 %
Incorrectly Classified Instances
10
3.3333 %
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 12
Correctly Classified Instances
281
93.6667 %
Incorrectly Classified
C
f
Instances
19
6.3333 %
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

18

Model complexity
Fitting
Fitti a polynomial
l
i l regression:
i
t

a ( x) = nn x

M=0

1.0

0.0

1.0

M=1

0.0

n =0

## By, for instance, least squares:

1.0
0.0

1.0
0.5
x

1.0

= arg min y j nn x
j =1

0.0

0.5
x

1.0

M=3

1.0

M=9

2
0.0

1.0

0.0

n =0

1.0

1.0
0.0

0.5
x

19

1.0

0.0

0.5
x

1.0

## Data size and model complexity

Important concept: discriminative power of the
algorithm
linear vs nonlinear model
some theoretical aspects:
1-hidden-layer NN with unlimited hidden nodes can
perfectly model any smooth function/surface

20

## Data size and model complexity

Overfitting: the model perfectly learns to classify training data,
data but
has no (bad) generalization ability
results in high test error (useless model)
typical for small sample sizes and powerful models

## Underfitting: the model is not capable of learning the (complex)

patterns in the training set
Reasons of Underfitting and Overfitting:
lack of discriminative power
smallll sample
l size
i
noise in the data /labels or features/
generalization ability of algorithm
has to be chosen wrt. sample size

## Size (complexity) of learnt model

grows with data size
if the data is consistent,
consistent this is OK
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

21

## Predictions Confusion matrix

TP: p classified as p
FP: n classified as p
TN: n classified as n
FN: p
p classified as n
Good
G
prediction:
TP+TN
Error:
FP (false alarm) + FN (miss)
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

22

Evaluation measures
Accuracy
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage).
(TP+TN) / (TP+FN+FP+TN)

Error rate
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage).
(FP+FN) / (TP+FN+FP+TN)

[Root]?[Mean|Absolute][Squared]?Error
The difference between the predicted and actual values
e.g.

RMSE=

1
( f ( x) y ) 2

## Algorithms (e.g. those in Weka) typically optimize these

might be a mismatch between optimization objective and actual evaluation measure
optimize different measures research on its own (e.g. in ML for IR, a.k.a. learning to rank)

## SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

23

Evaluation measures
TP: p
p classified as p
FP: n classified as p
TN: n classified as n
FN: p
p classified as n

Precision
Fraction of correctly predicted positives and all
predicted positives
TP/(TP+FP)
(
)

Recall
Fraction
ac o o
of co
correctly
ec y p
predicted
ed c ed pos
positives
es a
and
da
all ac
actual
ua pos
positives
es
TP/(TP+FN)

F measure
weighted harmonic mean of Precision and Recall (usually equal weighted, =1)

precision recall
F = (1 + ) 2
precision + recall
2

Only makes sense for a subset of classes (usually measured for a single
class)
For all classes, it equals the accuracy
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

24

Evaluation measures
Sequence P/R/F,
P/R/F e.g.
e g in Named Entity Recognition,
Recognition Chunking,
Chunking etc.
etc
A sequence of tokens with the same label is treated as a single instance
John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG
before_O joining_O IBM_ORG.
Why? We need complete phrases to be identified correctly
How? With external evaluation script, e.g. conlleval for NER

Example tagging:
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG
before_O joining_O IBM_ORG.

Multiple penalty:
3 Positives: John (PER),
(PER) Johns Hopkins University (ORG),
(ORG) IBM (ORG)
2 FPs: Johns Hopkins (PER) and University (ORG)
1 FN: Johns Hopkins University (ORG)
F(PER) = 0.67,
0 67 F(ORG) = 0
0.5
5
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

25

Loss types
1
1.

## The real loss function given to us by the world

world. Typically involves notions of money saved
saved,
time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this
function.
The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance
assessments etc.
assessments,
etc We can perform these evaluations,
evaluations but they are slow and costly
costly. They
require humans in the loop.
Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate,
mean-average-precision. These require humans at the front of the loop, but after that are
cheap
h
and
d quick.
i k T
Typically
i ll some effort
ff t h
has b
been putt iinto
t showing
h i correlation
l ti b
between
t
th
these
and something higher up.
Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for
parsing, chunking and named-entity recognition), alignment error rate (for word alignment)
and
d perplexity
l it (f
(for llanguage modeling).
d li ) Th
These also
l require
i h
humans att th
the ffrontt off th
the lloop,
but differ from (3) in that they are not actually compared with higher-up tasks.

2.
3.

4.

## Be careful what you are optimizing! Some measures (trypically of Type 4)

become disfunctional when you are optimizing them!

## SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

26

Evaluation measures
Sequence P/R/F,
P/R/F e.g.
e g in Named Entity Recognition,
Recognition Chunking,
Chunking etc.
etc
John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG.

Example
p tagging
gg g 1:

John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG.
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)
2 FPs: Johns Hopkins (PER) and University (ORG)
1 FN: Johns Hopkins University (ORG)
F(PER) = 0.67, F(ORG) = 0.5

Example tagging 2:

JJohn
h _PER studied
di d_O at_O the
h _O Johns
J h _O Hopkins
H ki _O University
U i
i _O before
b f _O joining
j i i _O IBM_ORG.
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)
0 FP
1 FN: Johns Hopkins
p
University
y ((ORG))
F(PER) = 1.0, F(ORG) = 0.67

Optimizing phrase-F can encourage / prefer systems that do not mark entities!
mostt likely,
lik l this
thi is
b d!!
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

27

ROC curve
ROC
O
Characteristic
C
curve
Curve that depicts the relation between recall (sensitivity) and false
positives (1-specificity)
(1 specificity)

Sensitivvity (Reca
all)

Best case

Worst case

False Positives
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

FP / (FP+TN)
28

Evaluation measures
Area
A
under
d ROC curve, AUC
As you vary the decision threshold, you can plot the recall vs. false
positive rate
p
The area under the curve measures how accurately your model
separates
t positive
iti from
f
negatives
ti
perfect ranking: AUC = 1.0
random decision: AUC = 0.5

## Similarly (e.g. in IR): area under P/R curve

when
h there
th
are too
t many (true)
(t ) negatives
ti
correctly identifying negatives is not interesting anyway

29

## Evaluation measures (Ranking)

Precision
P i i @K
number of true positives in top K predictions / ranks

MAP
The average of precisions computed at the point of each of the positives in the
ranked list ((P=0 for positives not ranked at all))

NDCG
For graded relevance / ranking
Highly relevant documents appearing lower in a search result list should be
penalized as the graded relevance value is reduced logarithmically proportional
to the position of the result.

## SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

30

Learning curve
Measures
M
h
how
th
the
accuracy
error

off the
th model
d l changes
h
with
ith
sample size
iteration number

Smaller sample
worse accuracy
y bias in the estimate
more likely
(representative sample)
variance in the estimate

## Typical learning curve

If it looks differently:
you are plotting error vs. size/iteration
you are doing something wrong!
overfitting (iteration, not sample size)!
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

??
31

Data or Algorithm?
Compare the accuracy of various machine learning algorithms with a
varying amount of training data (Banko & Brill, 2001):

Winnow
perceptron
nave Bayes
memory-based learner

Features:
bag of words:
words within a window of the
target word
collocations containing
specific words and/or part of speech

## Training corpus: 1-billion words

from a variety of English texts
((news articles, literature, scientific abstracts, etc.))
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

32

## Take home messages (up until now)

Supervised
p
learning:
g based on a set of labeled examples
p
((x,, f(x))
( )) learn the
input-output mapping, i.e. f(x)
3 factors of successful machine learning models
much data
good features
well-suited learning algorithm

ML workflow
1.
2
2.
3.
4.

problem definition
feature engineering; experimental setup /train,
/train validation,
validation test //
selection of learning algorithm, (hyper)parameter tuning, training a final model
predict unseen examples & fill tables / draw figures for the paper - test

Careful
C f l with
ith

## data representation (i.i.d, comparability, )

experimental setup (cross-validation, blind testing, )
data size and algorithm selection (+ overfitting,
overfitting underfitting,
underfitting ))
evaluation measures

33