You are on page 1of 17

Multilayer Perceptron

1
We need neural nets because we do not know the joint distribution
(how about KDE?)

First approx. Restricted to this kind!


(parametric)

Optimal Optimum
2
Second approx.
(time average)

What happens when the size of the hidden layers (# of


neuron) is increased?

• By UAT approximation becomes better, but what about


generalization?

3
Vapnik’s

How good are the inputs? Are the neurons consider


sufficient in numbers and types?
Relevant to the effect Relevant to effect due
due to finite training to assumed network

Formulation in terms of VC (Vapnik-


Chervonenkis) dimensions!

4
Vapnik- Chervonenkis dimensions

Capacity of family of binary classification functions realized by a


learning machine:

Single hidden layer – more neurons, larger h

5
Machine capacity too small Machine capacity too large
(over-determined) (under-determined)
In relation to input dimension N

6
f ∈ 𝔉𝔉 Set of all functions that can be realized

f: 𝑹𝑹𝑁𝑁 → 0,1 2 class problem

𝑃𝑃𝑒𝑒𝑁𝑁 (𝑓𝑓) Classification error probability


for N i.i.d. training samples
Gap smaller the better

Fraction of the training samples for


which the error occurs a training sample
𝑓𝑓 𝑥𝑥𝑖𝑖 ≠ 𝑦𝑦𝑖𝑖
𝑓𝑓 ∗ (𝑁𝑁) 𝑃𝑃𝑒𝑒 (𝑓𝑓) For any sample, generalization
error probability
7
𝑃𝑃𝑒𝑒 For any sample and all realizable functions,
generalization error probability

Gap smaller the better

𝑃𝑃𝑒𝑒 (𝑓𝑓 ∗ )
𝑃𝑃𝑒𝑒𝑁𝑁 (𝑓𝑓 ∗ ) 𝑃𝑃𝑒𝑒 = min 𝑃𝑃𝑒𝑒 (𝑓𝑓)
𝑓𝑓∈𝔉𝔉

Vapnik- Chervonenkis Theorem:

The empirical and true error probabilities


corresponding to a function 𝑓𝑓 ∈ 𝔉𝔉 satisfy:
2
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 max 𝑃𝑃𝑒𝑒𝑁𝑁 𝑓𝑓 − 𝑃𝑃𝑒𝑒 𝑓𝑓 > 𝜀𝜀 ≤ 8𝕊𝕊 𝔉𝔉, 𝑁𝑁 exp( −𝑁𝑁𝜀𝜀 �32)
𝑓𝑓∈𝔉𝔉

8
𝕊𝕊 𝔉𝔉, 𝑁𝑁 Shatter co-efficient of class 𝔉𝔉

This is related to the maximum number of


dichotomies of N samples (2N)

Number of elements in the powerset


of set of N training samples.

Shatter co-efficient of class 𝔉𝔉:


𝔛𝔛𝑁𝑁
An 𝑓𝑓 ∈ 𝔉𝔉 performs the
Set of all inputs
dichotomies, and all instances of
two sets of samples obtained are
collected in 𝔒𝔒.

9
If there exists an 𝔛𝔛𝑁𝑁 ⊂ 𝔛𝔛 , for the largest N

where
{𝑜𝑜 ∩ 𝔛𝔛𝑁𝑁 , 𝑜𝑜 ∈ 𝔒𝔒} = 𝖕𝖕(𝔛𝔛𝑁𝑁 )
Then shatter co-efficient of 𝔉𝔉 : 𝖕𝖕 𝔛𝔛𝑁𝑁 = 2𝑁𝑁 (cardinality of
power set)
Otherwise it is < 2𝑁𝑁

VC dimension definition:
The largest integer 𝑘𝑘 ≥ 1 for which 𝕊𝕊 𝔉𝔉, 𝑘𝑘 = 2𝑘𝑘 is called the VC
dimension of the class 𝔉𝔉 and is denoted by 𝑉𝑉𝑐𝑐 .
If 𝕊𝕊 𝔉𝔉, 𝑁𝑁 = 2𝑁𝑁 for every 𝑁𝑁 (in 𝔛𝔛), then VC dimension is infinite.

Another: (+1 than previous)


smallest integer 𝑘𝑘 ≥ 1 for which 𝕊𝕊 𝔉𝔉, 𝑘𝑘 < 2𝑘𝑘

10
In general, in an L-dimensional space, VC dimension of a
perceptron is L+1.

Sauler-Shelah Lemma:

If 𝑘𝑘 is the VC-dimension of a class 𝔉𝔉, then the collection set 𝔒𝔒 of 𝔉𝔉


can consist of at most ∑𝑘𝑘𝑖𝑖=0 𝑁𝑁𝑖𝑖 = 𝑁𝑁 + 1 𝑘𝑘 set.

Also, the reduction in generalization error


probability with N is given as

−𝑁𝑁𝜀𝜀 2
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑃𝑃𝑒𝑒 𝑓𝑓 ∗ − min 𝑃𝑃𝑒𝑒 (𝑓𝑓) ≤ 8𝕊𝕊 𝔉𝔉, 𝑁𝑁 exp( �128)
𝑓𝑓∈𝔉𝔉

11
Training samples #, VC-dimension and error relation:

𝑘𝑘1 𝑉𝑉𝑐𝑐 𝑘𝑘2 𝑉𝑉𝑐𝑐 𝑘𝑘3 8


𝑁𝑁 𝜀𝜀, 𝜌𝜌 = max 2
ln 2 , 2 ln
𝜀𝜀 𝜀𝜀 𝜀𝜀 𝜌𝜌
Where for 𝑁𝑁 ≥ 𝑁𝑁 𝜀𝜀, 𝜌𝜌 , 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃(|𝑃𝑃𝑒𝑒 𝑓𝑓 ∗ − 𝑃𝑃𝑒𝑒 | > 𝜀𝜀) ≤ 𝜌𝜌

k1-3 are predefined constants.

Sample complexity (required)

Also on estimation error! 2𝑁𝑁 𝜌𝜌


𝑉𝑉𝑐𝑐 ln +1 − ln
𝑉𝑉𝑐𝑐 4
𝑃𝑃𝑒𝑒𝑁𝑁 𝑓𝑓 ≤ 𝑃𝑃𝑒𝑒 𝑓𝑓 +
𝑁𝑁
Holds with probability (1- 𝜌𝜌)
12
Third approx.
(computational
consideration)

Size of the training sample set (N) is the


active budget constraint on learning

Computing time (ρ) is the active


budget constraint on learning

13
NN output & hidden units:
Before the output: Weights

Linear units for Gaussian output:

Distribution not known, but linear gives best chance for gradient flow

Sigmoid units for Bernoulli output:

Output goes to 0 or 1, but thresholding bad at gradient flow

14
Logistic
Can be seen as:
Converting this to probability

15
Cost:

Balancing of log & exp,


allowing good gradient flow

Soft plus function:

[y,z] = [ near 0, very


negative], [ near 1, very
positive]

16
Softmax units for Multinouli output: WINNER-TAKE-ALL

Mapping to multiple
probability values, with
sum of probabilities as 1.
z themselves are considered the un-
normalized log probabilities

approx

But, saturation at (among) larger differences!


17

You might also like