Thirteen 19240 PDF

Multilayer Perceptron
1
We need neural nets because we do not know the joint distribution
(how about KDE?)
First approx. Restricted to this kind!

(parametric)
Optimal Optimum
2
Second approx.
(time average)
What happens when the size of the hidden layers (# of

neuron) is increased?
• By UAT approximation becomes better, but what about

generalization?
3
Vapnik’s
How good are the inputs? Are the neurons consider

sufficient in numbers and types?
Relevant to the effect Relevant to effect due
due to finite training to assumed network
Formulation in terms of VC (Vapnik-

Chervonenkis) dimensions!
4
Vapnik- Chervonenkis dimensions
Capacity of family of binary classification functions realized by a

learning machine:
Single hidden layer – more neurons, larger h
5
Machine capacity too small Machine capacity too large
(over-determined) (under-determined)
In relation to input dimension N
6
f ∈ 𝔉𝔉 Set of all functions that can be realized
f: 𝑹𝑹𝑁𝑁 → 0,1 2 class problem
𝑃𝑃𝑒𝑒𝑁𝑁 (𝑓𝑓) Classification error probability

for N i.i.d. training samples
Gap smaller the better
Fraction of the training samples for

which the error occurs a training sample
𝑓𝑓 𝑥𝑥𝑖𝑖 ≠ 𝑦𝑦𝑖𝑖
𝑓𝑓 ∗ (𝑁𝑁) 𝑃𝑃𝑒𝑒 (𝑓𝑓) For any sample, generalization
error probability
7
𝑃𝑃𝑒𝑒 For any sample and all realizable functions,
generalization error probability
Gap smaller the better
𝑃𝑃𝑒𝑒 (𝑓𝑓 ∗ )
𝑃𝑃𝑒𝑒𝑁𝑁 (𝑓𝑓 ∗ ) 𝑃𝑃𝑒𝑒 = min 𝑃𝑃𝑒𝑒 (𝑓𝑓)
𝑓𝑓∈𝔉𝔉
Vapnik- Chervonenkis Theorem:
The empirical and true error probabilities

corresponding to a function 𝑓𝑓 ∈ 𝔉𝔉 satisfy:
2
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 max 𝑃𝑃𝑒𝑒𝑁𝑁 𝑓𝑓 − 𝑃𝑃𝑒𝑒 𝑓𝑓 > 𝜀𝜀 ≤ 8𝕊𝕊 𝔉𝔉, 𝑁𝑁 exp( −𝑁𝑁𝜀𝜀 �32)
𝑓𝑓∈𝔉𝔉
8
𝕊𝕊 𝔉𝔉, 𝑁𝑁 Shatter co-efficient of class 𝔉𝔉
This is related to the maximum number of

dichotomies of N samples (2N)
Number of elements in the powerset

of set of N training samples.
Shatter co-efficient of class 𝔉𝔉:

𝔛𝔛𝑁𝑁
An 𝑓𝑓 ∈ 𝔉𝔉 performs the
Set of all inputs
dichotomies, and all instances of
two sets of samples obtained are
collected in 𝔒𝔒.
9
If there exists an 𝔛𝔛𝑁𝑁 ⊂ 𝔛𝔛 , for the largest N
where
{𝑜𝑜 ∩ 𝔛𝔛𝑁𝑁 , 𝑜𝑜 ∈ 𝔒𝔒} = 𝖕𝖕(𝔛𝔛𝑁𝑁 )
Then shatter co-efficient of 𝔉𝔉 : 𝖕𝖕 𝔛𝔛𝑁𝑁 = 2𝑁𝑁 (cardinality of
power set)
Otherwise it is < 2𝑁𝑁
VC dimension definition:
The largest integer 𝑘𝑘 ≥ 1 for which 𝕊𝕊 𝔉𝔉, 𝑘𝑘 = 2𝑘𝑘 is called the VC
dimension of the class 𝔉𝔉 and is denoted by 𝑉𝑉𝑐𝑐 .
If 𝕊𝕊 𝔉𝔉, 𝑁𝑁 = 2𝑁𝑁 for every 𝑁𝑁 (in 𝔛𝔛), then VC dimension is infinite.
Another: (+1 than previous)

smallest integer 𝑘𝑘 ≥ 1 for which 𝕊𝕊 𝔉𝔉, 𝑘𝑘 < 2𝑘𝑘
10
In general, in an L-dimensional space, VC dimension of a
perceptron is L+1.
Sauler-Shelah Lemma:
If 𝑘𝑘 is the VC-dimension of a class 𝔉𝔉, then the collection set 𝔒𝔒 of 𝔉𝔉

can consist of at most ∑𝑘𝑘𝑖𝑖=0 𝑁𝑁𝑖𝑖 = 𝑁𝑁 + 1 𝑘𝑘 set.
Also, the reduction in generalization error

probability with N is given as
−𝑁𝑁𝜀𝜀 2
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑃𝑃𝑒𝑒 𝑓𝑓 ∗ − min 𝑃𝑃𝑒𝑒 (𝑓𝑓) ≤ 8𝕊𝕊 𝔉𝔉, 𝑁𝑁 exp( �128)
𝑓𝑓∈𝔉𝔉
11
Training samples #, VC-dimension and error relation:
𝑘𝑘1 𝑉𝑉𝑐𝑐 𝑘𝑘2 𝑉𝑉𝑐𝑐 𝑘𝑘3 8

𝑁𝑁 𝜀𝜀, 𝜌𝜌 = max 2
ln 2 , 2 ln
𝜀𝜀 𝜀𝜀 𝜀𝜀 𝜌𝜌
Where for 𝑁𝑁 ≥ 𝑁𝑁 𝜀𝜀, 𝜌𝜌 , 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃(|𝑃𝑃𝑒𝑒 𝑓𝑓 ∗ − 𝑃𝑃𝑒𝑒 | > 𝜀𝜀) ≤ 𝜌𝜌
k1-3 are predefined constants.
Sample complexity (required)
Also on estimation error! 2𝑁𝑁 𝜌𝜌

𝑉𝑉𝑐𝑐 ln +1 − ln
𝑉𝑉𝑐𝑐 4
𝑃𝑃𝑒𝑒𝑁𝑁 𝑓𝑓 ≤ 𝑃𝑃𝑒𝑒 𝑓𝑓 +
𝑁𝑁
Holds with probability (1- 𝜌𝜌)
12
Third approx.
(computational
consideration)
Size of the training sample set (N) is the

active budget constraint on learning
Computing time (ρ) is the active

budget constraint on learning
13
NN output & hidden units:
Before the output: Weights
Linear units for Gaussian output:
Distribution not known, but linear gives best chance for gradient flow
Sigmoid units for Bernoulli output:
Output goes to 0 or 1, but thresholding bad at gradient flow
14
Logistic
Can be seen as:
Converting this to probability
15
Cost:
Balancing of log & exp,

allowing good gradient flow
Soft plus function:
[y,z] = [ near 0, very

negative], [ near 1, very
positive]
16
Softmax units for Multinouli output: WINNER-TAKE-ALL
Mapping to multiple
probability values, with
sum of probabilities as 1.
z themselves are considered the un-
normalized log probabilities
approx
But, saturation at (among) larger differences!

17

Thirteen 19240 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thirteen 19240 PDF

Uploaded by

Copyright:

Available Formats

Multilayer Perceptron

First approx. Restricted to this kind!

What happens when the size of the hidden layers (# of

• By UAT approximation becomes better, but what about

How good are the inputs? Are the neurons consider

Formulation in terms of VC (Vapnik-

Capacity of family of binary classification functions realized by a

Single hidden layer – more neurons, larger h

f: 𝑹𝑹𝑁𝑁 → 0,1 2 class problem

𝑃𝑃𝑒𝑒𝑁𝑁 (𝑓𝑓) Classification error probability

Fraction of the training samples for

Gap smaller the better

Vapnik- Chervonenkis Theorem:

The empirical and true error probabilities

This is related to the maximum number of

Number of elements in the powerset

Shatter co-efficient of class 𝔉𝔉:

Another: (+1 than previous)

If 𝑘𝑘 is the VC-dimension of a class 𝔉𝔉, then the collection set 𝔒𝔒 of 𝔉𝔉

Also, the reduction in generalization error

𝑘𝑘1 𝑉𝑉𝑐𝑐 𝑘𝑘2 𝑉𝑉𝑐𝑐 𝑘𝑘3 8

k1-3 are predefined constants.

Sample complexity (required)

Also on estimation error! 2𝑁𝑁 𝜌𝜌

Size of the training sample set (N) is the

Computing time (ρ) is the active

Linear units for Gaussian output:

Sigmoid units for Bernoulli output:

Output goes to 0 or 1, but thresholding bad at gradient flow

Balancing of log & exp,

Soft plus function:

[y,z] = [ near 0, very

But, saturation at (among) larger differences!

You might also like