Professional Documents
Culture Documents
1
There is no single line (hyperplane) that separates
class A from class B. On the contrary, AND and OR
operations are linearly separable problems
2
The Two-Layer Perceptron
3
Then class B is located outside the shaded area and class A
inside. This is a two-phase design.
• Phase 1:
1 Draw two lines (hyperplanes)
g1 ( x) g 2 ( x) 0
Each of them is realized by a perceptron. The outputs
of the perceptrons will be
0
yi f ( g i ( x)) i 1, 2
1
depending on the position of x.
• Phase 2:
2 Find the position of x w.r.t. both lines, based
on the values of y1, y2.
4
1st phase 2nd
x1 x2 y1 y2 phase
5
The decision is now performed on the transformed y data.
g ( y) 0
This can be performed via a second line, which can also
be realized by a perceptron.
6
Computations of the first phase perform a
mapping that transforms the nonlinearly separable
problem to a linearly separable one.
The architecture
7
• This is known as the two layer perceptron with
one hidden and one output layer. The
activation functions are
0
f (.)
1
x Rl
x y [ y1 ,... y p ]T , yi 0, 1 i 1, 2,... p 9
performs a mapping of a vector
onto the vertices of the unit side Hp hypercube
10
Intersections of these hyperplanes form regions in the
l-dimensional space. Each region corresponds to a
vertex of the Hp unit hypercube.
11
For example, the 001 vertex corresponds to the
region which is located
12
The output neuron realizes a hyperplane in the
y
transformed space, that separates some of the vertices
from the others. Thus, the two layer perceptron has the
capability to classify vectors into classes that consist
of unions of polyhedral regions. But NOT ANY
union. It depends on the relative position of the
corresponding vertices.
13
Three layer-perceptrons
The architecture
The idea is similar to the XOR problem. It realizes more than one
planes in the space.
yRp 14
The reasoning
Overall:
18
• The task is a nonlinear optimization one. For
the gradient descent method
r r r
w1 ( new) w1 (old) w1
r J
w r
1
w1
19
The Procedure:
• Initialize unknown weights randomly with small values.
• Compute the gradient terms backwards, starting with
the weights of the last (3rd) layer and then moving
towards the first
• Update the weights
• Repeat the procedure until a termination procedure is
met
22
The Cost function choice
Examples:
• The Least Squares
N
J E (i )
i 1
k k
E (i ) e (i ) ( ym (i ) yˆ m (i )) 2
2
m
m 1 m 1
i 1,2,..., N
Desired response of the mth output neuron
(1 or 0) for
ym (i )
x(i )
Actual response of the mth output neuron, in the
interval [0, 1], for input
yˆ m (i )
x(i ) 23
The cross-entropy
N
J E (i )
i 1
k
E (i ) ym (i ) ln yˆ m (i ) (1 ym (i )) ln(1 yˆ m (i ))
m 1
24
Remark 1: A common feature of all the above is the
danger of local minimum convergence. “Well formed”
cost functions guarantee convergence to a “good”
solution, that is one that classifies correctly
ALL training patterns, provided such a solution exists.
The cross-entropy cost function is a well formed one.
The Least Squares is not.
25
Remark 2: Both, the Least Squares and the cross
entropy lead to output values yˆ m (i ) that approximate
optimally class a-posteriori probabilities!!!
yˆ m (i ) P ( m x(i ))
26
Choice of the network size.
How big a network can be. How many layers and how
many neurons per layer?? There are two major
directions
• Pruning Techniques: These techniques start from a
large network and then weights and/or neurons are
removed iteratively, according to a criterion.
27
— Methods based on parameter sensitivity
1 1
J g iwi hiiwi hijwiw j
2
i 2 i 2 i j
J 2 J
gi , hij
wi wi w j
28
Pruning is now achieved in the following procedure:
Train the network using Backpropagation
for a number of steps
Compute the saliencies
hii wi2
si
2
Remove weights with small si.
Repeat the process
N
J E (i ) aE p ( w)
i 1
29
The second term favours small values for the weights,
e.g.,
E p ( ) h( wk2 )
k
2
w
h( wk2 ) 2 k 2
w0 wk
where w 1
After some0 training steps, weights with small values are
removed.
• Constructive techniques:
They start with a small network and keep increasing it,
according to a predetermined procedure and criterion.
30
Remark: Why not start with a large network and leave
the algorithm to decide which weights are small??
This approach is just naïve. It overlooks that
classifiers must have good generalization properties.
A large network can result in small errors for the
training set, since it can learn the particular details of
the training set. On the other hand, it will not be able
to perform well when presented with data unknown to
it. The size of the network must be:
• Large enough to learn what makes data of the
same class similar and data from different classes
dissimilar
• Small enough not to be able to learn underlying
differences between data of the same class. This
leads to the so called overfitting.
31
Example:
32
Overtraining is another side of the same coin, i.e.,
the network adapts to the peculiarities of the training
set.
33
Generalized Linear Classifiers
34
• Are there any functions and an appropriate k, so
that the mapping
f1 ( x )
x y ...
f k ( x)
space?
so that
If w0 w y 0 , x 1
T
w0 w y 0 , x 2
T
35
In such a case this is equivalent with approximating
the nonlinear discriminant function g(x), in terms of
i.e., f i (x ),
k
g ( x) w0 wi f i ( x) ( ) 0
i 1
36
Capacity of the l-dimensional space in Linear
Dichotomies
l
Assume N points in R assumed to be in general
position, that is:
37
Cover’s theorem states: The number of groupings that can be
formed by (l-1)-dimensional hyperplanes to separate N points in two
classes is
l
N 1 N 1 ( N 1)!
O( N , l ) 2 ,
i
0 O(4,2)=14
Example: N=4, il=2, i ( N 1 i )!i!
38
Probability of grouping N points in two linearly
separable classes is
O( N , l )
N
PN
l
N r (l 1)
39
Thus, the probability of having N points in linearly
separable classes tends to 1, for large l, provided
N<2( l+1)
40
Radial Basis Function Networks (RBF)
Choose
41
x ci 2
f i ( x) exp
2 2
i
42
Example: The XOR problem
• Define:
1 0 1
c1 , c 2 , 1 2
1 0 2
exp( x c1 2 )
y 2
exp( x c 2 )
43
g ( y ) y1 y2 1 0
2 2
g ( x) exp( x c1 ) exp( x c 2 ) 1 0
44
Training of the RBF networks
T
g ( x) w0 w y
is a typical linear classifier design.
45
Universal Approximators
46
Support Vector Machines: The non-linear case
i 1 2 i, j
where y i R k
47
Also, the classifier will be
T
g ( y ) w y w0
Ns
i yi y i y
i 1
where x y R k
• High complexity
48
Something clever: Compute the inner products in
the high dimensional space as functions of inner
products performed in the low dimensional
space!!!
x12
Let x y 2 x1 x2 R 3
x22
Then, it is easy to show that
T T
y i y j ( xi x j )2
49
Mercer’s Theorem
Let x ( x) H
Then, the inner product in H
where r ( x ) r ( y )
r
K ( x, y )
50
The opposite is also true. Any kernel, with the above properties,
corresponds to an inner product in SOME space!!!
Examples of kernels
• Polynomial:
xz 2
exp
z ) Functions:
K ( x,Basis
• Radial
2
• Hyperbolic Tangent:
T
K ( x, z ) ( x z 1) q , q 0
for appropriate values of β, γ.
T
K ( x, z ) tanh( x z )
51
SVM Formulation
• Step 1: Choose appropriate kernel. This
implicitely assumes a mapping to a higher
dimensional (yet, not known) space.
• Step 2: 1
max ( i i j yi y j K ( x i , x j ))
i 2 i, j
subject to : 0 i C , i 1,2,..., N
y
i
i i 0
Ns
w i yi ( x i )
i 1 52
• Step 3: Assign x to
Ns
1 (2 ) if g ( x) i yi Κ( x i , x) w0 ()0
i 1
53
54
Decision Trees
is feature xi a
56
Design Elements that define a decision tree.
• Each node, t, is associated with a subset Χ t X , where X
is the training set. At each node, Xt is split into two (binary
splits) disjoint descendant subsets Xt,Y and Xt,N, where
Xt,Y Xt,N = Ø
Xt,Y Xt,N = Xt
57
• A splitting criterion must be adopted for the best split of Xt
into Xt,Y and Xt,N.
58
Set of Questions: In OBCT trees the set of questions is of the
type
is xi a ?
The choice of the specific xi and the value of the threshold α,
for each node t, are the results of searching, during training,
among the features and a set of possible threshold values.
The final combination is the one that results to the best
value of a criterion.
59
Splitting Criterion: The main idea behind splitting at each
node is the resulting descendant subsets Xt,Y and Xt,N to be
more class homogeneous compared to Xt. Thus the criterion
must be in harmony with such a goal. A commonly used
criterion is the node impurity:
M
I (t ) P i | t log 2 P t | t
i 1
N ti
and P i | t
Nt
i
where N t is the number of data points in Xt that belong to
class i. The decrease in node impurity is defined as:
N t , Nt ,N
I (t ) I (t ) I (t ) I (t N )
Nt Nt
60
• The goal is to choose the parameters in each node
(feature and threshold) that result in a split with the
highest decrease in impurity.
61
Summary of an OBCT algorithmic scheme:
62
Remarks:
• A critical factor in the design is the size of the tree.
Usually one grows a tree to a large size and then applies
various pruning techniques.
• Decision trees belong to the class of unstable classifiers.
This can be overcome by a number of “averaging”
techniques. Bagging is a popular technique. Using
bootstrap techniques in X, various trees are constructed,
Ti, i=1, 2, …, B. The decision is taken according to a
majority voting rule.
63
Combining Classifiers
The basic philosophy behind the combination of different
classifiers lies in the fact that even the “best” classifier fails in
some patterns that other classifiers may classify correctly.
Combining classifiers aims at exploiting this complementary
information residing in the various classifiers.
64
• Product Rule: Assign x to the class i :
L
i arg max Pj k | x
k j 1
where Pj k | x is the respective posterior probability of the
jth classifier.
65
x the class for which there
• Majority Voting Rule: Assign to
is a consensus or when at least of thec classifiers agree
on the class label of x
where:
L
2 1, L even
c
L 1 , L odd
2
otherwise the decision is rejection, that is no decision is
taken.
Thus, correct decision is made if the majority of the
classifiers agree on the correct label, and wrong decision
if the majority agrees in the wrong label.
66
Dependent or not Dependent classifiers?
Remarks:
• The majority voting and the summation schemes rank among
the most popular combination schemes.
69
The AdaBoost Algorithm.
Construct an optimally designed classifier of the form:
f ( x) sign F ( x)
K
where:
F ( x) ak x; k
k 1
is a parameter vector.
70
• The essence of the method.
Design the series of classifiers:
x; 1 , x; 2 , ..., x; k
71
• Updating the weights for each sample x i , i 1, 2 , ..., N
wim exp yi am x i ; m
wi( m 1)
Zm
– Zm is a normalizing factor common for all samples.
1 1 Pm
am ln
– 2 Pm
73
Remarks:
• Training error rate tends to zero after a few iterations.
The test error levels to some value.
74