TBMI 26
Ola Friman
ola.friman@foi.se
Lecture 2
Supervised learning Linear models
Recap  Supervised learning
Task: Learn to predict/classify new data from
labeled examples.
Input: Training data examples {x
i
, y
i
} i=1...N,
where x
i
is a feature vector and y
i
is a class
label in the set O. Today well mostly assume
two classes: O = {1,1}
Output: A function f(x;w
1
,,w
k
) O
2
Find a function f and adjust the parameters w
1
,,w
k
so that
new feature vectors are classified correctly. Generalization!!
How does the brain take decisions?
(on the low level!)
Basic unit: the neuron
The human brain has approximately 100
billion (10
11
) neurons.
Each neuron connected to about 7000 other
neurons.
Approx. 10
14
 10
15
synapses (connections).
3
Cell body
Axon
Dendrites
Model of a neuron
4
Input signals
Weights
Summation
Activation function
E
x
1
x
2
x
n
w
1
w
n
w
2
The Perceptron
5
(McCulloch & Pitts 1943, Rosenblatt 1962)
( ) ( ) x w
T
n
w x w w w w x x + = 
.

\

+ = .
=
0
n
1 i
i i 0 0 n 1
, , ; ,..., f o o
Extra reading on the history of the perceptron:
http://www.csulb.edu/~cwallis/artificialn/History.htm
( ) ( ) x sign x = o
( ) x x = o
( ) ( ) x x tanh = o
1
1
1
1
Sigmoid function
Step function
Linear function
Not differentiable!
Not biologically plausible!
Notational simplification: Bias weight
6
( ) ( ) x w
T
n
x w w w x x o o = 
.

\

= .
=
n
0 i
i i 0 n 1
, , ; ,..., f
Add a constant 1 to the feature vector so that we
dont have to treat w
0
separately.
Instead of x = [x
1
,...,x
n
]
T,
we have x = [1, x
1
,...,x
n
]
T
Activation functions in 2D
7
( ) ( ) x w x w
T T
sign = o
( ) x w x w
T T
= o
( ) ( ) x w x w
T T
tanh = o
Geometry of linear classifiers
8
x
1
x
2 y = w
0
+w
1
x
1
+w
2
x
2
= 0
(w
1,
w
2
)
T
y < 0
y > 0
Another view the super feature
9
=
=
n
0 i
i i
x w
T
x w
Reduces the original n features
into one single feature
x
If seen as vectors, w
T
x is the
projection of x onto w.
w
w
T
x
(w
1,
w
2
)
T
Advantages of a parametric model
Only stores a few parameters (w
0
, w
1
, ,w
n
)
instead of all the training samples, as in kNN.
Very fast to evaluate on which side of the line
a new sample is on: w
T
x <0 or w
T
x >0
is also known as a
discriminant function.
10
( ) ( ) x w w x
T
o = ; f
Which linear classifier to choose?
11
x
1
x
2 y = w
0
+w
1
x
1
+w
2
x
2
= 0
(w
1,
w
2
)
T
y < 0
y > 0
Find the best separator
Optimization!
Min/max of a cost function c(w
0
, w
1
, ,w
n
)
with the weights w
0
, w
1
, ,w
n
as parameters.
Two ways to optimize:
Algebraic: Set derivative and solve!
Iterative numeric: Follow the gradient direction
until minimum/maximum of g is reached.
This is called gradient descent/ascent!
12
0 =
c
c
i
w
c
Gradient descent/ascent
13
?
How to get to the lowest point?




.

\

c
c
c
c
= V
2
1
x
f
x
f
f
Gradient descent
14
Small q
Large q
Too large q
Gradient escent
Choosing the step length
Choosing the step length
The size of the gradient is not relevant!
15
c(w)
Small derivative, small steps.
Large derivative, large steps.
Local optima
Gradient search is not guaranteed to find the
global minimum/maximum.
With a sufficiently small step length, the
closest local optimum will be found.
16
Global
minimum
Local
minima
Online learning vs. batch learning
Batch learning: Use all training examples to
update the classifier.
Most oftenly used.
Online learning: Update the classifier using
one training example at the time.
Can be used when training samples arrive
sequentially, e.g., adaptive learning.
Also known as stochastic ascent/descent.
17
Three different cost functions
Perceptron algorithm
Linear Discriminant Analysis
(a.k.a. Fisher Linear Discriminant)
Support Vector Machines
18
Perceptron algorithm
19
( ) x w
T
i
i n
y w w w
I e
= ,..., ,
1 0
c
Maximize the following cost function
n = # features
I = set of misclassified training samples
y
i
e {1,1} depending on the class of training sample i
Historically (1960s) one of the first machine learning algorithms
Negative for misclassified samples
Considers only misclassified samples
( )
n
w w w ,..., ,
1 0
c is always negative, or 0 for a perfect classification!
Linear activation function
Perceptron algorithm, cont.
20
ik
i
i
k
x y
w
I e
=
c
cc
( ) x w
T
i
i n
y w w w
I e
= ,..., ,
1 0
c
i
i
i
y x
w
I e
=
c
cc
Algorithm:
1. Start with a random w
2. Iterate Eq. 1 until convergence
Gradient ascent:
( ) 1 Eq.
1 i
i
i t t t
y x w
w
w w
I e
+
+ =
c
c
+ = q
c
q
Perceptron example
21
Perceptron algorithm, cont.
If the classes are linearly separable, the
perceptron algorithm will converge to a
solution that separates the classes.
If the classes are not linearly separable, the
algorithm will fail.
Which solution it arrives in depends on the
start value w.
22
Linear Discriminant Analysis (LDA)
23
x
1
(w
1,
w
2
)
T
x
2
Consider the projected feature vectors
LDA Different projections
24
x
1
x
2
Projection
Optimal projection
Separating plane
LDA Separation
25
Small variance
Large distance
Large variance
Small distance
Goal: minimize variance and maximize distance.
2
( )
2
2
2
1
2
2 1
o o
c
+
=
Maximize:
LDA Cost function
26
( )
2
2
2
1
2
2 1
o o
c
+
=
( ) ( ) w C w w C w w C w w w
tot
T T T
= + = +
2 1
2
2
2
1
o o
( ) ( ) ( )( ) Cw w w x x x x w x w x w w
T
n
i
T
i i
T
n
i
T
i
T
n n
= 
.

\

= = =
= = 1
2
1
2
1 1
o
Variance:
C
( ) x w x w x w w
T
n
i
i
T
n
i
i
T
n n
= 
.

\

= =
= = 1 1
1 1
Distance:
1
x
( ) ( ) ( ) ( ) ( ) ( )( ) Mw w w x x x x w x x w w w
T
T
T T
= = =
2 1 2 1
2
2 1
2
2 1
M
2
x
LDA Cost function, cont.
27
( )
( )
w C w
Mw w
w
tot
T
T
=
+
=
2
2
2
1
2
2 1
o o
c
This form is called a (generalized) Rayleigh quotient,
which is maximized by the largest eigenvector to the
generalized eigenvalue problem C
tot
w = Mw!
Simplification:
Some scalar K
( )( ) ( )
2 1 2 1 2 1
x x w x x x x Mw = = K
T
( )
2 1
1
~ x x C w
tot
Scaling of w not important!
LDA  Examples
4 3 2 1 0 1 2 3 4
4
3
2
1
0
1
2
3
4
5
28
4 3 2 1 0 1 2 3 4
4
3
2
1
0
1
2
3
4
Support Vector Machines (SVM)
Idea!
29
Optimal separation line remains the same, feature points
close to the class limits are more important!
These are called support vectors!
Support vectors
SVM Maximum margin
30
2
2
w
w
Choose w that gives maximum margin !
SVM Cost function
31
w
w
T
x+w
0
= 1
w
T
x+w
0
= 1
w
T
x+w
0
> 1
w
T
x+w
0
< 1
w
T
x+w
0
= 0
Scaling of w is free choose a specific scaling!
x
p
w
T
(x
p
+ ) + w
0
= 1
x
s
w
T
x
s
+ w
0
= 1
x
s
= x
p
+
w
T
+ w
T
x
p
+ w
0
= 1
0
w
T
= w
For the support vector, = 1 / w
SVM Cost function, cont.
32
Maximizing = 1 / w is the same as minimizing w!
1 ) ( subject to
min
0
> + w y
i
T
i
x w
w
No training samples must reside in the margin region!
Optimization procedure outside the scope of this course
Summary Linear classifiers
Perceptron algorithm:
Of historical interest
Linear Discriminant Analysis
Simple to implement
Very useful as a first classifier to try
Support Vector Machines
By many considered as the stateoftheart classifier
principle, especially the nonlinear forms (see next
lecture)
Many software packages exist on the internet
33
What about more than 2 classes?
Common solution: Combine several binary
classifiers
34
Evaluate most
likely class